linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
       [not found]               ` <alpine.DEB.2.21.1908281605190.23149@nanos.tec.linutronix.de>
@ 2019-09-03  3:30                 ` Ming Lei
  2019-09-03  5:59                   ` Daniel Lezcano
  0 siblings, 1 reply; 40+ messages in thread
From: Ming Lei @ 2019-09-03  3:30 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jens Axboe, Hannes Reinecke, Daniel Lezcano, Sagi Grimberg,
	linux-scsi, Peter Zijlstra, Long Li, John Garry, LKML,
	linux-nvme, Keith Busch, Ingo Molnar, Christoph Hellwig

On Wed, Aug 28, 2019 at 04:07:19PM +0200, Thomas Gleixner wrote:
> On Wed, 28 Aug 2019, Ming Lei wrote:
> > On Wed, Aug 28, 2019 at 01:23:06PM +0200, Thomas Gleixner wrote:
> > > On Wed, 28 Aug 2019, Ming Lei wrote:
> > > > On Wed, Aug 28, 2019 at 01:09:44AM +0200, Thomas Gleixner wrote:
> > > > > > > Also how is that supposed to work when sched_clock is jiffies based?
> > > > > > 
> > > > > > Good catch, looks ktime_get_ns() is needed.
> > > > > 
> > > > > And what is ktime_get_ns() returning when the only available clocksource is
> > > > > jiffies?
> > > > 
> > > > IMO, it isn't one issue. If the only clocksource is jiffies, we needn't to
> > > > expect high IO performance. Then it is fine to always handle the irq in
> > > > interrupt context or thread context.
> > > > 
> > > > However, if it can be recognized runtime, irq_flood_detected() can
> > > > always return true or false.
> > > 
> > > Right. The clocksource is determined at runtime. And if there is no high
> > > resolution clocksource then that function will return crap.
> > 
> > This patch still works even though the only clocksource is jiffies.
> 
> Works by some definition of works, right?

I am not sure there is such system which doesn't provide any high resolution
clocksource, meantime there is one high performance storage device
attached, and expect top IO performance can be reached.

Suppose there is such system: I mean that irq_flood_detected() returns either
true or false, then the actual IO performance should be accepted on
system without high resolution clocksource from user view.

> 
> > > Well, yes. But it's trivial enough to utilize parts of it for your
> > > purposes.
> > 
> > >From the code of kernel/irq/timing.c:
> > 
> > 1) record_irq_time() only records the start time of one irq, and not
> > consider the time taken in interrupt handler, so we can't figure out
> > the real interval between two do_IRQ() on one CPU
> 
> I said utilize and that means that the infrastructure can be used and
> extended. I did not say that it solves your problem, right?

The infrastructure is for predicating when the next interrupt comes,
which is used in PM cases(usually for mobile phone or power sensitive
cases). However, IRQ flood is used in high performance system(usually
enterprise case). The two use cases are actually orthogonal, also:

1) if the irq timing infrastructure is used, we have to apply the
management code on irq flood detection, for example, we have to
build the irq timing code in kernel and enable it. Then performance
regression might be caused for enterprise application.

2) irq timing's runtime overload is much higher, irq_timings_push()
touches much more memory footprint, since it records recent 32
irq's timestamp. That isn't what IRQ flood detection wants, also
not enough for flood detection.

3) irq flood detection itself is very simple, just one EWMA
calculation, see the following code:

irq_update_interval() (called from irq_enter())
        int cpu = raw_smp_processor_id();
        struct irq_interval *inter = per_cpu_ptr(&avg_irq_interval, cpu);
        u64 delta = sched_clock_cpu(cpu) - inter->last_irq_end;

        inter->avg = (inter->avg * IRQ_INTERVAL_EWMA_PREV_FACTOR +
                delta * IRQ_INTERVAL_EWMA_CURR_FACTOR) /
                IRQ_INTERVAL_EWMA_WEIGHT;

bool irq_flood_detected(void) (called from __handle_irq_event_percpu())
{
        return raw_cpu_ptr(&avg_irq_interval)->avg <= IRQ_FLOOD_THRESHOLD_NS;
}

irq_exit()
	inter->last_irq_end = sched_clock_cpu(smp_processor_id());

So there is basically nothing shared between the two, only one percpu
variable is needed for detecting irq flood.

> 
> > 2) irq/timing doesn't cover softirq
> 
> That's solvable, right?

Yeah, we can extend irq/timing, but ugly for irq/timing, since irq/timing
focuses on hardirq predication, and softirq isn't involved in that
purpose.

>  
> > Daniel, could you take a look and see if irq flood detection can be
> > implemented easily by irq/timing.c?
> 
> I assume you can take a look as well, right?

Yeah, I have looked at the code for a while, but I think that irq/timing
could become complicated unnecessarily for covering irq flood detection,
meantime it is much less efficient for detecting IRQ flood.

thanks,
Ming

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-03  3:30                 ` [PATCH 1/4] softirq: implement IRQ flood detection mechanism Ming Lei
@ 2019-09-03  5:59                   ` Daniel Lezcano
  2019-09-03  6:31                     ` Ming Lei
  0 siblings, 1 reply; 40+ messages in thread
From: Daniel Lezcano @ 2019-09-03  5:59 UTC (permalink / raw)
  To: Ming Lei, Thomas Gleixner
  Cc: Jens Axboe, Hannes Reinecke, Sagi Grimberg, linux-scsi,
	Peter Zijlstra, Long Li, John Garry, LKML, linux-nvme,
	Keith Busch, Ingo Molnar, Christoph Hellwig


Hi Ming Lei,

On 03/09/2019 05:30, Ming Lei wrote:

[ ... ]


>>> 2) irq/timing doesn't cover softirq
>>
>> That's solvable, right?
> 
> Yeah, we can extend irq/timing, but ugly for irq/timing, since irq/timing
> focuses on hardirq predication, and softirq isn't involved in that
> purpose.
> 
>>  
>>> Daniel, could you take a look and see if irq flood detection can be
>>> implemented easily by irq/timing.c?
>>
>> I assume you can take a look as well, right?
> 
> Yeah, I have looked at the code for a while, but I think that irq/timing
> could become complicated unnecessarily for covering irq flood detection,
> meantime it is much less efficient for detecting IRQ flood.

In the series, there is nothing describing rigorously the problem (I can
only guess) and why the proposed solution solves it.

What is your definition of an 'irq flood'? A high irq load? An irq
arriving while we are processing the previous one in the bottom halves?

The patch 2/4 description says "however IO completion is only done on
one of these submission CPU cores". That describes the bottleneck and
then the patch says "Add IRQF_RESCUE_THREAD to create one interrupt
thread handler", what is the rational between the bottleneck (problem)
and the irqf_rescue_thread (solution)?

Is it really the solution to track the irq timings to detect a flood?



-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-03  5:59                   ` Daniel Lezcano
@ 2019-09-03  6:31                     ` Ming Lei
  2019-09-03  6:40                       ` Daniel Lezcano
  0 siblings, 1 reply; 40+ messages in thread
From: Ming Lei @ 2019-09-03  6:31 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Jens Axboe, Hannes Reinecke, Sagi Grimberg, linux-scsi,
	Peter Zijlstra, Long Li, John Garry, LKML, linux-nvme,
	Keith Busch, Ingo Molnar, Thomas Gleixner, Christoph Hellwig

Hi Daniel,

On Tue, Sep 03, 2019 at 07:59:39AM +0200, Daniel Lezcano wrote:
> 
> Hi Ming Lei,
> 
> On 03/09/2019 05:30, Ming Lei wrote:
> 
> [ ... ]
> 
> 
> >>> 2) irq/timing doesn't cover softirq
> >>
> >> That's solvable, right?
> > 
> > Yeah, we can extend irq/timing, but ugly for irq/timing, since irq/timing
> > focuses on hardirq predication, and softirq isn't involved in that
> > purpose.
> > 
> >>  
> >>> Daniel, could you take a look and see if irq flood detection can be
> >>> implemented easily by irq/timing.c?
> >>
> >> I assume you can take a look as well, right?
> > 
> > Yeah, I have looked at the code for a while, but I think that irq/timing
> > could become complicated unnecessarily for covering irq flood detection,
> > meantime it is much less efficient for detecting IRQ flood.
> 
> In the series, there is nothing describing rigorously the problem (I can
> only guess) and why the proposed solution solves it.
> 
> What is your definition of an 'irq flood'? A high irq load? An irq
> arriving while we are processing the previous one in the bottom halves?

So far, it means that handling interrupt & softirq takes all utilization
of one CPU, then processes can't be run on this CPU basically, usually
sort of CPU lockup warning will be triggered.

> 
> The patch 2/4 description says "however IO completion is only done on
> one of these submission CPU cores". That describes the bottleneck and
> then the patch says "Add IRQF_RESCUE_THREAD to create one interrupt
> thread handler", what is the rational between the bottleneck (problem)
> and the irqf_rescue_thread (solution)?

The solution is to switch to handle this interrupt on the created rescue
irq thread context when irq flood is detected, and 'this interrupt' means
the interrupt requested with IRQF_RESCUE_THREAD.

> 
> Is it really the solution to track the irq timings to detect a flood?

The solution tracks the time taken on running do_IRQ() for each CPU.


Thanks,
Ming

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-03  6:31                     ` Ming Lei
@ 2019-09-03  6:40                       ` Daniel Lezcano
  2019-09-03  7:28                         ` Ming Lei
  0 siblings, 1 reply; 40+ messages in thread
From: Daniel Lezcano @ 2019-09-03  6:40 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Hannes Reinecke, Sagi Grimberg, linux-scsi,
	Peter Zijlstra, Long Li, John Garry, LKML, linux-nvme,
	Keith Busch, Ingo Molnar, Thomas Gleixner, Christoph Hellwig

On 03/09/2019 08:31, Ming Lei wrote:
> Hi Daniel,
> 
> On Tue, Sep 03, 2019 at 07:59:39AM +0200, Daniel Lezcano wrote:
>>
>> Hi Ming Lei,
>>
>> On 03/09/2019 05:30, Ming Lei wrote:
>>
>> [ ... ]
>>
>>
>>>>> 2) irq/timing doesn't cover softirq
>>>>
>>>> That's solvable, right?
>>>
>>> Yeah, we can extend irq/timing, but ugly for irq/timing, since irq/timing
>>> focuses on hardirq predication, and softirq isn't involved in that
>>> purpose.
>>>
>>>>  
>>>>> Daniel, could you take a look and see if irq flood detection can be
>>>>> implemented easily by irq/timing.c?
>>>>
>>>> I assume you can take a look as well, right?
>>>
>>> Yeah, I have looked at the code for a while, but I think that irq/timing
>>> could become complicated unnecessarily for covering irq flood detection,
>>> meantime it is much less efficient for detecting IRQ flood.
>>
>> In the series, there is nothing describing rigorously the problem (I can
>> only guess) and why the proposed solution solves it.
>>
>> What is your definition of an 'irq flood'? A high irq load? An irq
>> arriving while we are processing the previous one in the bottom halves?
> 
> So far, it means that handling interrupt & softirq takes all utilization
> of one CPU, then processes can't be run on this CPU basically, usually
> sort of CPU lockup warning will be triggered.

It is a scheduler problem then ?

>> The patch 2/4 description says "however IO completion is only done on
>> one of these submission CPU cores". That describes the bottleneck and
>> then the patch says "Add IRQF_RESCUE_THREAD to create one interrupt
>> thread handler", what is the rational between the bottleneck (problem)
>> and the irqf_rescue_thread (solution)?
> 
> The solution is to switch to handle this interrupt on the created rescue
> irq thread context when irq flood is detected, and 'this interrupt' means
> the interrupt requested with IRQF_RESCUE_THREAD.
> 
>>
>> Is it really the solution to track the irq timings to detect a flood?
> 
> The solution tracks the time taken on running do_IRQ() for each CPU.




-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-03  6:40                       ` Daniel Lezcano
@ 2019-09-03  7:28                         ` Ming Lei
  2019-09-03  7:50                           ` Daniel Lezcano
  2019-09-03  8:09                           ` Thomas Gleixner
  0 siblings, 2 replies; 40+ messages in thread
From: Ming Lei @ 2019-09-03  7:28 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Jens Axboe, Hannes Reinecke, Sagi Grimberg, linux-scsi,
	Peter Zijlstra, Long Li, John Garry, LKML, linux-nvme,
	Keith Busch, Ingo Molnar, Thomas Gleixner, Christoph Hellwig

On Tue, Sep 03, 2019 at 08:40:35AM +0200, Daniel Lezcano wrote:
> On 03/09/2019 08:31, Ming Lei wrote:
> > Hi Daniel,
> > 
> > On Tue, Sep 03, 2019 at 07:59:39AM +0200, Daniel Lezcano wrote:
> >>
> >> Hi Ming Lei,
> >>
> >> On 03/09/2019 05:30, Ming Lei wrote:
> >>
> >> [ ... ]
> >>
> >>
> >>>>> 2) irq/timing doesn't cover softirq
> >>>>
> >>>> That's solvable, right?
> >>>
> >>> Yeah, we can extend irq/timing, but ugly for irq/timing, since irq/timing
> >>> focuses on hardirq predication, and softirq isn't involved in that
> >>> purpose.
> >>>
> >>>>  
> >>>>> Daniel, could you take a look and see if irq flood detection can be
> >>>>> implemented easily by irq/timing.c?
> >>>>
> >>>> I assume you can take a look as well, right?
> >>>
> >>> Yeah, I have looked at the code for a while, but I think that irq/timing
> >>> could become complicated unnecessarily for covering irq flood detection,
> >>> meantime it is much less efficient for detecting IRQ flood.
> >>
> >> In the series, there is nothing describing rigorously the problem (I can
> >> only guess) and why the proposed solution solves it.
> >>
> >> What is your definition of an 'irq flood'? A high irq load? An irq
> >> arriving while we are processing the previous one in the bottom halves?
> > 
> > So far, it means that handling interrupt & softirq takes all utilization
> > of one CPU, then processes can't be run on this CPU basically, usually
> > sort of CPU lockup warning will be triggered.
> 
> It is a scheduler problem then ?

Scheduler can do nothing if the CPU is taken completely by handling
interrupt & softirq, so seems not a scheduler problem, IMO.


Thanks,
Ming

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-03  7:28                         ` Ming Lei
@ 2019-09-03  7:50                           ` Daniel Lezcano
  2019-09-03  9:30                             ` Ming Lei
  2019-09-04 17:07                             ` Bart Van Assche
  2019-09-03  8:09                           ` Thomas Gleixner
  1 sibling, 2 replies; 40+ messages in thread
From: Daniel Lezcano @ 2019-09-03  7:50 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Hannes Reinecke, Sagi Grimberg, linux-scsi,
	Peter Zijlstra, Long Li, John Garry, LKML, linux-nvme,
	Keith Busch, Ingo Molnar, Thomas Gleixner, Christoph Hellwig

On 03/09/2019 09:28, Ming Lei wrote:
> On Tue, Sep 03, 2019 at 08:40:35AM +0200, Daniel Lezcano wrote:
>> On 03/09/2019 08:31, Ming Lei wrote:
>>> Hi Daniel,
>>>
>>> On Tue, Sep 03, 2019 at 07:59:39AM +0200, Daniel Lezcano wrote:
>>>>
>>>> Hi Ming Lei,
>>>>
>>>> On 03/09/2019 05:30, Ming Lei wrote:
>>>>
>>>> [ ... ]
>>>>
>>>>
>>>>>>> 2) irq/timing doesn't cover softirq
>>>>>>
>>>>>> That's solvable, right?
>>>>>
>>>>> Yeah, we can extend irq/timing, but ugly for irq/timing, since irq/timing
>>>>> focuses on hardirq predication, and softirq isn't involved in that
>>>>> purpose.
>>>>>
>>>>>>  
>>>>>>> Daniel, could you take a look and see if irq flood detection can be
>>>>>>> implemented easily by irq/timing.c?
>>>>>>
>>>>>> I assume you can take a look as well, right?
>>>>>
>>>>> Yeah, I have looked at the code for a while, but I think that irq/timing
>>>>> could become complicated unnecessarily for covering irq flood detection,
>>>>> meantime it is much less efficient for detecting IRQ flood.
>>>>
>>>> In the series, there is nothing describing rigorously the problem (I can
>>>> only guess) and why the proposed solution solves it.
>>>>
>>>> What is your definition of an 'irq flood'? A high irq load? An irq
>>>> arriving while we are processing the previous one in the bottom halves?
>>>
>>> So far, it means that handling interrupt & softirq takes all utilization
>>> of one CPU, then processes can't be run on this CPU basically, usually
>>> sort of CPU lockup warning will be triggered.
>>
>> It is a scheduler problem then ?
> 
> Scheduler can do nothing if the CPU is taken completely by handling
> interrupt & softirq, so seems not a scheduler problem, IMO.

Why? If there is a irq pressure on one CPU reducing its capacity, the
scheduler will balance the tasks on another CPU, no?



-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-03  7:28                         ` Ming Lei
  2019-09-03  7:50                           ` Daniel Lezcano
@ 2019-09-03  8:09                           ` Thomas Gleixner
  2019-09-03  9:24                             ` Ming Lei
  1 sibling, 1 reply; 40+ messages in thread
From: Thomas Gleixner @ 2019-09-03  8:09 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Hannes Reinecke, John Garry, Sagi Grimberg,
	linux-scsi, Peter Zijlstra, Long Li, Daniel Lezcano, LKML,
	linux-nvme, Keith Busch, Ingo Molnar, Christoph Hellwig

On Tue, 3 Sep 2019, Ming Lei wrote:
> Scheduler can do nothing if the CPU is taken completely by handling
> interrupt & softirq, so seems not a scheduler problem, IMO.

Well, but thinking more about it, the solution you are proposing is more a
bandaid than anything else.

If you look at the networking NAPI mechanism. It handles that situation
gracefully by:

  - Disabling the interrupt at the device level

  - Polling the device in softirq context until empty and then reenabling
    interrupts

  - In case the softirq handles more packets than a defined budget it
    forces the softirq into the softirqd thread context which also
    allows rescheduling once the budget is completed.

With your adhoc workaround you handle one specific case. But it does not
work at all when an overload situation occurs in a case where the queues
are truly per cpu simply. Because then the interrupt and the thread
affinity are the same and single CPU targets and you replace the interrupt
with a threaded handler which runs by default with RT priority.

So instead of hacking something half baken into the hard/softirq code, why
can't block do a budget limitation and once that is reached switch to
something NAPI like as a general solution?

Thanks,

	tglx

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-03  8:09                           ` Thomas Gleixner
@ 2019-09-03  9:24                             ` Ming Lei
  0 siblings, 0 replies; 40+ messages in thread
From: Ming Lei @ 2019-09-03  9:24 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jens Axboe, Hannes Reinecke, John Garry, Sagi Grimberg,
	linux-scsi, Peter Zijlstra, Long Li, Daniel Lezcano, LKML,
	linux-nvme, Keith Busch, Ingo Molnar, Christoph Hellwig

On Tue, Sep 03, 2019 at 10:09:57AM +0200, Thomas Gleixner wrote:
> On Tue, 3 Sep 2019, Ming Lei wrote:
> > Scheduler can do nothing if the CPU is taken completely by handling
> > interrupt & softirq, so seems not a scheduler problem, IMO.
> 
> Well, but thinking more about it, the solution you are proposing is more a
> bandaid than anything else.
> 
> If you look at the networking NAPI mechanism. It handles that situation
> gracefully by:
> 
>   - Disabling the interrupt at the device level

I guess you mean we disable the interrupt in the softirq context.

IO performance could be affected by the extra action of disabling/enabling
interrupt every time.

IOPS for the discussed device is several millions.

> 
>   - Polling the device in softirq context until empty and then reenabling
>     interrupts

blk-mq switches to complete req in interrupt context for avoiding extra
performance loss, so switching back to softirq context every time may
cause performance regression.

> 
>   - In case the softirq handles more packets than a defined budget it
>     forces the softirq into the softirqd thread context which also
>     allows rescheduling once the budget is completed.


It can be hard to figure out one perfect defined budget.

In the patchset of V2[1], IRQF_ONESHOT is applied on the irq thread,
and interrupt isn't enabled until the interrupt has been handled in
the irq thread context.

[1] https://github.com/ming1/linux/commits/v5.3-genirq-for-5.4

The approach in this patchset is actually very similar with the above
NAPI based way. The difference is that softirq is avoided, and interrupt
is always handled in interrupt context in case that CPU won't be stalled,
so performance won't be affected. And we only switch to handle interrupt
in thread context if CPU stall is going to happen.

> 
> With your adhoc workaround you handle one specific case. But it does not
> work at all when an overload situation occurs in a case where the queues
> are truly per cpu simply.

There isn't such CPU stall issue in case of single submission vs. single
completion, because submission side and completion side share same single
CPU, and the submission side will slow down if completion side takes all
the CPU. 

> Because then the interrupt and the thread
> affinity are the same and single CPU targets and you replace the interrupt
> with a threaded handler which runs by default with RT priority.

Even though the threaded handler is RT priority and the thread is run
on same CPU with the interrupt, CPU/rcu stall still can be avoided.

Also we can switch to use irq affinity for the irq thread instead of effective
affinity.

> 
> So instead of hacking something half baken into the hard/softirq code, why
> can't block do a budget limitation and once that is reached switch to
> something NAPI like as a general solution?

Another big reason is that multiple submission vs. single completion isn't
common case, I knew that there are only small number of such device,
so re-inventing NAPI based approach may takes lots of effort, meantime
only small number of devices can get the benefit, not sure if block
community would like to consider that.

IMO, it might be the simplest generic way to solve the problem from genirq.


Thanks,
Ming

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-03  7:50                           ` Daniel Lezcano
@ 2019-09-03  9:30                             ` Ming Lei
  2019-09-04 17:07                             ` Bart Van Assche
  1 sibling, 0 replies; 40+ messages in thread
From: Ming Lei @ 2019-09-03  9:30 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Keith Busch, Hannes Reinecke, Sagi Grimberg, linux-scsi,
	Peter Zijlstra, Long Li, John Garry, LKML, linux-nvme,
	Jens Axboe, Ingo Molnar, Thomas Gleixner, Christoph Hellwig

On Tue, Sep 03, 2019 at 09:50:06AM +0200, Daniel Lezcano wrote:
> On 03/09/2019 09:28, Ming Lei wrote:
> > On Tue, Sep 03, 2019 at 08:40:35AM +0200, Daniel Lezcano wrote:
> >> On 03/09/2019 08:31, Ming Lei wrote:
> >>> Hi Daniel,
> >>>
> >>> On Tue, Sep 03, 2019 at 07:59:39AM +0200, Daniel Lezcano wrote:
> >>>>
> >>>> Hi Ming Lei,
> >>>>
> >>>> On 03/09/2019 05:30, Ming Lei wrote:
> >>>>
> >>>> [ ... ]
> >>>>
> >>>>
> >>>>>>> 2) irq/timing doesn't cover softirq
> >>>>>>
> >>>>>> That's solvable, right?
> >>>>>
> >>>>> Yeah, we can extend irq/timing, but ugly for irq/timing, since irq/timing
> >>>>> focuses on hardirq predication, and softirq isn't involved in that
> >>>>> purpose.
> >>>>>
> >>>>>>  
> >>>>>>> Daniel, could you take a look and see if irq flood detection can be
> >>>>>>> implemented easily by irq/timing.c?
> >>>>>>
> >>>>>> I assume you can take a look as well, right?
> >>>>>
> >>>>> Yeah, I have looked at the code for a while, but I think that irq/timing
> >>>>> could become complicated unnecessarily for covering irq flood detection,
> >>>>> meantime it is much less efficient for detecting IRQ flood.
> >>>>
> >>>> In the series, there is nothing describing rigorously the problem (I can
> >>>> only guess) and why the proposed solution solves it.
> >>>>
> >>>> What is your definition of an 'irq flood'? A high irq load? An irq
> >>>> arriving while we are processing the previous one in the bottom halves?
> >>>
> >>> So far, it means that handling interrupt & softirq takes all utilization
> >>> of one CPU, then processes can't be run on this CPU basically, usually
> >>> sort of CPU lockup warning will be triggered.
> >>
> >> It is a scheduler problem then ?
> > 
> > Scheduler can do nothing if the CPU is taken completely by handling
> > interrupt & softirq, so seems not a scheduler problem, IMO.
> 
> Why? If there is a irq pressure on one CPU reducing its capacity, the
> scheduler will balance the tasks on another CPU, no?

For example, the following two kinds[1][2] of warning can be triggered
easily when heavy fio test is run on NVMe. The 1st one could be that
network irq is affected by the nvme irq flood. The 2nd one is rcu_sched stall.

[1]
[  186.531069] NETDEV WATCHDOG: enP48661s1 (mlx4_core): transmit queue 12 timed out
[  186.545259] WARNING: CPU: 9 PID: 0 at net/sched/sch_generic.c:443 dev_watchdog+0x252/0x260
[  186.546222] Modules linked in: xt_owner ext4 mbcache jbd2 ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter dm_mirror dm_region_hash dm_log dm_mod crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel glue_helper crypto_simd cryptd mlx4_en mlx4_core nvme nvme_core hv_utils sg pcspkr i2c_piix4 ptp hv_balloon pci_hyperv pps_core joydev ip_tables xfs libcrc32c sd_mod ata_generic pata_acpi hv_storvsc hv_netvsc hyperv_keyboard hid_hyperv scsi_transport_fc ata_piix hyperv_fb crc32c_intel libata hv_vmbus floppy serio_raw
[  186.546222] CPU: 9 PID: 0 Comm: swapper/9 Not tainted 5.3.0-rc6+ #17
[  186.546222] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007  05/18/2018
[  186.546222] RIP: 0010:dev_watchdog+0x252/0x260
[  186.546222] Code: c0 75 e4 eb 98 4c 89 ef c6 05 38 ca c7 00 01 e8 04 93 fb ff 89 d9 48 89 c2 4c 89 ee 48 c7 c7 f0 9e 54 be 31 c0 e8 4e 14 95 ff <0f> 0b e9 75 ff ff ff 0f 1f 80 00 00 00 00 0f 1f 44 00 00 41 57 ba
[  186.546222] RSP: 0018:ffffae5518ee0e88 EFLAGS: 00010282
[  186.546222] RAX: 0000000000000000 RBX: 000000000000000c RCX: 0000000000000000
[  186.546222] RDX: 0000000000000001 RSI: ffff9d6b5f8577b8 RDI: ffff9d6b5f8577b8
[  186.546222] RBP: ffff9d5b4c300440 R08: 0000000000000363 R09: 0000000000000363
[  186.546222] R10: 0000000000000001 R11: 0000000000aaaaaa R12: 0000000000000100
[  186.546222] R13: ffff9d5b4c300000 R14: ffff9d5b4c30041c R15: 0000000000000009
[  186.546222] FS:  0000000000000000(0000) GS:ffff9d6b5f840000(0000) knlGS:0000000000000000
[  186.546222] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  186.546222] CR2: 00007f6324aa19b8 CR3: 000000308981c000 CR4: 00000000003406e0
[  186.546222] Call Trace:
[  186.546222]  <IRQ>
[  186.546222]  ? pfifo_fast_enqueue+0x110/0x110
[  186.546222]  call_timer_fn+0x2f/0x140
[  186.546222]  run_timer_softirq+0x1f6/0x460
[  186.546222]  ? timerqueue_add+0x54/0x80
[  186.546222]  ? enqueue_hrtimer+0x38/0x90
[  186.546222]  __do_softirq+0xda/0x2a8
[  186.546222]  irq_exit+0xc5/0xd0
[  186.546222]  hv_stimer0_vector_handler+0x5a/0x70
[  186.546222]  hv_stimer0_callback_vector+0xf/0x20
[  186.546222]  </IRQ>
[  186.546222] RIP: 0010:native_safe_halt+0xe/0x10
[  186.546222] Code: 01 00 f0 80 48 02 20 48 8b 00 a8 08 0f 84 7a ff ff ff eb bc 90 90 90 90 90 90 90 90 e9 07 00 00 00 0f 00 2d a6 74 59 00 fb f4 <c3> 90 e9 07 00 00 00 0f 00 2d 96 74 59 00 f4 c3 90 90 0f 1f 44 00
[  186.546222] RSP: 0018:ffffae5518993eb0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff12
[  186.546222] RAX: ffffffffbdc72970 RBX: ffff9d5ba5b79700 RCX: 0000000000000000
[  186.546222] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000002af36665fe
[  186.546222] RBP: 0000000000000009 R08: 0000000000000000 R09: fffffffe56427f06
[  186.546222] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
[  186.546222] R13: 0000000000000000 R14: ffff9d5ba5b79700 R15: ffff9d5ba5b79700
[  186.546222]  ? __cpuidle_text_start+0x8/0x8
[  186.546222]  default_idle+0x1a/0x140
[  186.546222]  do_idle+0x1a6/0x290
[  186.546222]  cpu_startup_entry+0x19/0x20
[  186.546222]  start_secondary+0x166/0x1c0
[  186.546222]  secondary_startup_64+0xa4/0xb0
[  186.546222] ---[ end trace 8d7ef273033e5d7a ]---

[2]
[  247.296117] rcu: INFO: rcu_sched self-detected stall on CPU
[  247.441146] rcu: 	64-....: (1 GPs behind) idle=1fe/1/0x4000000000000002 softirq=498/499 fqs=13379 
[  247.588155] 	(t=60291 jiffies g=6261 q=16750)
[  247.726156] NMI backtrace for cpu 64
[  247.862150] CPU: 64 PID: 7772 Comm: fio Tainted: G        W         5.3.0-rc6+ #17
[  248.004148] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007  05/18/2018
[  248.150158] Call Trace:
[  248.279195]  <IRQ>
[  248.401151]  dump_stack+0x5a/0x73
[  248.522126]  nmi_cpu_backtrace+0x90/0xa0
[  248.640147]  ? lapic_can_unplug_cpu+0xa0/0xa0
[  248.757151]  nmi_trigger_cpumask_backtrace+0x6c/0x110
[  248.871145]  rcu_dump_cpu_stacks+0x8a/0xb8
[  248.980152]  rcu_sched_clock_irq+0x55a/0x7a0
[  249.089152]  ? smp_call_function_single_async+0x1f/0x40
[  249.197151]  ? blk_mq_complete_request+0x8e/0xf0
[  249.301157]  ? tick_sched_do_timer+0x80/0x80
[  249.404177]  update_process_times+0x28/0x50
[  249.505152]  tick_sched_handle+0x25/0x60
[  249.605151]  tick_sched_timer+0x37/0x70
[  249.705149]  __hrtimer_run_queues+0xfb/0x270
[  249.804160]  hrtimer_interrupt+0x122/0x270
[  249.904166]  hv_stimer0_vector_handler+0x33/0x70
[  250.004148]  hv_stimer0_callback_vector+0xf/0x20
[  250.102144]  </IRQ>
[  250.195164] RIP: 0010:nvme_queue_rq+0x42/0x710 [nvme]
[  250.294147] Code: 68 48 8b a8 c8 00 00 00 48 8b 97 b8 00 00 00 48 8b 1e 65 48 8b 0c 25 28 00 00 00 48 89 4c 24 60 31 c9 48 8b 7a 70 4c 8b 6d 00 <c7> 83 3c 01 00 00 00 00 00 00 c7 83 40 01 00 00 ff ff ff ff c7 83
[  250.518141] RSP: 0018:ffffae551c7ef8f8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff12
[  250.630150] RAX: ffff9d5b49fb6c00 RBX: ffff9d5b5aaf4500 RCX: 0000000000000000
[  250.743121] RDX: ffff9d5b4a719020 RSI: ffffae551c7ef9a8 RDI: ffff9d5b5e8e2e80
[  250.856148] RBP: ffff9d5b4ba20100 R08: 0000000000000001 R09: ffff9d4bc51ef080
[  250.970154] R10: 000000000000007f R11: ffff9d5b4a719020 R12: ffffae551c7ef9a8
[  251.084177] R13: ffff9d5b5e430000 R14: ffffae551c7efab0 R15: ffffce44ffa0bbc0
[  251.200157]  ? __sbitmap_get_word+0x2a/0x80
[  251.311146]  ? sbitmap_get+0x61/0x130
[  251.421145]  __blk_mq_try_issue_directly+0x143/0x1f0
[  251.533144]  blk_mq_request_issue_directly+0x4d/0xa0
[  251.646148]  blk_mq_try_issue_list_directly+0x51/0xc0
[  251.758155]  blk_mq_sched_insert_requests+0xbb/0x110
[  251.865149]  blk_mq_flush_plug_list+0x191/0x2c0
[  251.971149]  blk_flush_plug_list+0xcc/0xf0
[  252.074146]  blk_finish_plug+0x27/0x40
[  252.177154]  blkdev_direct_IO+0x410/0x4a0
[  252.280155]  generic_file_read_iter+0xb1/0xc70
[  252.384166]  ? common_interrupt+0xa/0xf
[  252.486149]  ? _cond_resched+0x15/0x30
[  252.588168]  aio_read+0xf6/0x150
[  252.688145]  ? common_interrupt+0xa/0xf
[  252.789147]  ? __switch_to_asm+0x40/0x70
[  252.892151]  ? __switch_to_asm+0x34/0x70
[  252.994125]  ? __check_object_size+0x75/0x1b0
[  253.097148]  ? _copy_to_user+0x22/0x30
[  253.198147]  ? aio_read_events+0x279/0x300
[  253.298149]  ? kmem_cache_alloc+0x15c/0x200
[  253.396155]  io_submit_one+0x199/0xb50
[  253.492163]  ? read_events+0x5c/0x160
[  253.588152]  ? __switch_to_asm+0x40/0x70
[  253.683146]  ? __switch_to_asm+0x34/0x70
[  253.773145]  ? __switch_to_asm+0x40/0x70
[  253.863148]  ? __switch_to_asm+0x34/0x70
[  253.950431]  ? __switch_to_asm+0x40/0x70
[  254.037146]  __x64_sys_io_submit+0xb3/0x1a0
[  254.125153]  ? __audit_syscall_exit+0x1d9/0x280
[  254.214149]  do_syscall_64+0x5b/0x1d0
[  254.300154]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  254.390145] RIP: 0033:0x7f63580f1697
[  254.478137] Code: 49 83 38 00 75 eb 49 83 78 08 00 75 e4 8b 47 0c 39 47 08 75 dc 31 c0 c3 66 2e 0f 1f 84 00 00 00 00 00 90 b8 d1 00 00 00 0f 05 <c3> 0f 1f 84 00 00 00 00 00 b8 d2 00 00 00 0f 05 c3 0f 1f 84 00 00
[  254.682145] RSP: 002b:00007ffe5fee3d18 EFLAGS: 00000293 ORIG_RAX: 00000000000000d1
[  254.784142] RAX: ffffffffffffffda RBX: 0000000001defb90 RCX: 00007f63580f1697
[  254.885146] RDX: 0000000001defdf8 RSI: 0000000000000001 RDI: 00007f636c142000
[  254.989153] RBP: 0000000000000000 R08: 0000000000000010 R09: 0000000001dee3e0
[  255.092126] R10: 0000000001dde000 R11: 0000000000000293 R12: 0000000000000001
[  255.194148] R13: 0000000000000018 R14: 00007f632cb45980 R15: 0000000001defe70
[  299.228276] nvme nvme7: I/O 220 QID 2 timeout, aborting
[  299.298199] nvme nvme7: Abort status: 0x0
[  300.425912] mlx4_en: enP48661s1: Steering Mode 2
[  438.694592] rcu: INFO: rcu_sched self-detected stall on CPU
[  438.783121] rcu: 	77-....: (1 GPs behind) idle=c1e/1/0x4000000000000004 softirq=795/796 fqs=12978 
[  438.929120] 	(t=60235 jiffies g=6541 q=3725)
[  439.029124] NMI backtrace for cpu 77
[  439.130120] CPU: 77 PID: 7718 Comm: fio Tainted: G        W         5.3.0-rc6+ #17
[  439.245123] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007  05/18/2018
[  439.357125] Call Trace:
[  439.467157]  <IRQ>
[  439.522121]  dump_stack+0x5a/0x73
[  439.670123]  nmi_cpu_backtrace+0x90/0xa0
[  439.767124]  ? lapic_can_unplug_cpu+0xa0/0xa0
[  439.877119]  nmi_trigger_cpumask_backtrace+0x6c/0x110
[  439.987122]  rcu_dump_cpu_stacks+0x8a/0xb8
[  440.087121]  rcu_sched_clock_irq+0x55a/0x7a0
[  440.186126]  ? smp_call_function_single_async+0x1f/0x40
[  440.293125]  ? blk_mq_complete_request+0x8e/0xf0
[  440.399122]  ? tick_sched_do_timer+0x80/0x80
[  440.505118]  update_process_times+0x28/0x50
[  440.607119]  tick_sched_handle+0x25/0x60
[  440.709122]  tick_sched_timer+0x37/0x70
[  440.814119]  __hrtimer_run_queues+0xfb/0x270
[  440.919122]  hrtimer_interrupt+0x122/0x270
[  441.024116]  hv_stimer0_vector_handler+0x33/0x70
[  441.108126]  hv_stimer0_callback_vector+0xf/0x20
[  441.220121] RIP: 0010:__do_softirq+0x77/0x2a8
[  441.319130] Code: 0f b7 ca 89 4c 24 04 c7 44 24 20 0a 00 00 00 48 89 44 24 08 48 89 44 24 18 65 66 c7 05 d0 a5 02 42 00 00 fb 66 0f 1f 44 00 00 <b8> ff ff ff ff 48 c7 c5 00 51 60 be 0f bc 44 24 04 83 c0 01 89 04
[  441.552119] RSP: 0018:ffffae5519cc0f78 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff12
[  441.698119] RAX: ffff9d7b5ac7dc00 RBX: 0000000000000000 RCX: 0000000000000002
[  441.826119] RDX: ffff9ddba50d0002 RSI: 0000000000000000 RDI: 0000000000000000
[  441.943122] RBP: 0000000000000000 R08: 0000000000000000 R09: 01484a5406289ee5
[  442.052121] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  442.180121] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  442.307119]  ? hv_stimer0_callback_vector+0xa/0x20
[  442.423123]  ? clockevents_program_event+0x79/0xf0
[  442.543117]  irq_exit+0xc5/0xd0
[  442.655124]  hv_stimer0_vector_handler+0x5a/0x70
[  442.783119]  hv_stimer0_callback_vector+0xf/0x20
[  442.896120]  </IRQ>
[  443.004119] RIP: 0010:__blk_mq_try_issue_directly+0x157/0x1f0


thanks,
Ming

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-03  7:50                           ` Daniel Lezcano
  2019-09-03  9:30                             ` Ming Lei
@ 2019-09-04 17:07                             ` Bart Van Assche
  2019-09-04 17:31                               ` Daniel Lezcano
  1 sibling, 1 reply; 40+ messages in thread
From: Bart Van Assche @ 2019-09-04 17:07 UTC (permalink / raw)
  To: Daniel Lezcano, Ming Lei
  Cc: Keith Busch, Hannes Reinecke, Sagi Grimberg, linux-scsi,
	Peter Zijlstra, Long Li, John Garry, LKML, linux-nvme,
	Jens Axboe, Ingo Molnar, Thomas Gleixner, Christoph Hellwig

On 9/3/19 12:50 AM, Daniel Lezcano wrote:
> On 03/09/2019 09:28, Ming Lei wrote:
>> On Tue, Sep 03, 2019 at 08:40:35AM +0200, Daniel Lezcano wrote:
>>> It is a scheduler problem then ?
>>
>> Scheduler can do nothing if the CPU is taken completely by handling
>> interrupt & softirq, so seems not a scheduler problem, IMO.
> 
> Why? If there is a irq pressure on one CPU reducing its capacity, the
> scheduler will balance the tasks on another CPU, no?

Only if CONFIG_IRQ_TIME_ACCOUNTING has been enabled. However, I don't 
know any Linux distro that enables that option. That's probably because 
that option introduces two rdtsc() calls in each interrupt. Given the 
overhead introduced by this option, I don't think this is the solution 
Ming is looking for.

See also irqtime_account_irq() in kernel/sched/cputime.c.

Bart.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-04 17:07                             ` Bart Van Assche
@ 2019-09-04 17:31                               ` Daniel Lezcano
  2019-09-04 17:38                                 ` Bart Van Assche
  2019-09-05  9:06                                 ` Ming Lei
  0 siblings, 2 replies; 40+ messages in thread
From: Daniel Lezcano @ 2019-09-04 17:31 UTC (permalink / raw)
  To: Bart Van Assche, Ming Lei
  Cc: Keith Busch, Hannes Reinecke, Sagi Grimberg, linux-scsi,
	Peter Zijlstra, Long Li, John Garry, LKML, linux-nvme,
	Jens Axboe, Ingo Molnar, Thomas Gleixner, Christoph Hellwig

Hi,

On 04/09/2019 19:07, Bart Van Assche wrote:
> On 9/3/19 12:50 AM, Daniel Lezcano wrote:
>> On 03/09/2019 09:28, Ming Lei wrote:
>>> On Tue, Sep 03, 2019 at 08:40:35AM +0200, Daniel Lezcano wrote:
>>>> It is a scheduler problem then ?
>>>
>>> Scheduler can do nothing if the CPU is taken completely by handling
>>> interrupt & softirq, so seems not a scheduler problem, IMO.
>>
>> Why? If there is a irq pressure on one CPU reducing its capacity, the
>> scheduler will balance the tasks on another CPU, no?
> 
> Only if CONFIG_IRQ_TIME_ACCOUNTING has been enabled. However, I don't
> know any Linux distro that enables that option. That's probably because
> that option introduces two rdtsc() calls in each interrupt. Given the
> overhead introduced by this option, I don't think this is the solution
> Ming is looking for.

Was this overhead reported somewhere ?

> See also irqtime_account_irq() in kernel/sched/cputime.c.

From my POV, this framework could be interesting to detect this situation.


-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-04 17:31                               ` Daniel Lezcano
@ 2019-09-04 17:38                                 ` Bart Van Assche
  2019-09-04 18:02                                   ` Peter Zijlstra
  2019-09-05  9:06                                 ` Ming Lei
  1 sibling, 1 reply; 40+ messages in thread
From: Bart Van Assche @ 2019-09-04 17:38 UTC (permalink / raw)
  To: Daniel Lezcano, Ming Lei
  Cc: Jens Axboe, Hannes Reinecke, Sagi Grimberg, linux-scsi,
	Peter Zijlstra, Long Li, John Garry, LKML, linux-nvme,
	Keith Busch, Ingo Molnar, Thomas Gleixner, Christoph Hellwig

On 9/4/19 10:31 AM, Daniel Lezcano wrote:
> On 04/09/2019 19:07, Bart Van Assche wrote:
>> Only if CONFIG_IRQ_TIME_ACCOUNTING has been enabled. However, I don't
>> know any Linux distro that enables that option. That's probably because
>> that option introduces two rdtsc() calls in each interrupt. Given the
>> overhead introduced by this option, I don't think this is the solution
>> Ming is looking for.
> 
> Was this overhead reported somewhere ?
  I think it is widely known that rdtsc is a relatively slow x86 
instruction. So I expect that using that instruction will cause a 
measurable overhead if it is called frequently enough. I'm not aware of 
any publicly available measurement data however.

Bart.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-04 17:38                                 ` Bart Van Assche
@ 2019-09-04 18:02                                   ` Peter Zijlstra
  2019-09-04 19:47                                     ` Bart Van Assche
  0 siblings, 1 reply; 40+ messages in thread
From: Peter Zijlstra @ 2019-09-04 18:02 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, Hannes Reinecke, John Garry, Sagi Grimberg,
	linux-scsi, Long Li, Daniel Lezcano, LKML, linux-nvme, Ming Lei,
	Keith Busch, Ingo Molnar, Thomas Gleixner, Christoph Hellwig

On Wed, Sep 04, 2019 at 10:38:59AM -0700, Bart Van Assche wrote:
> On 9/4/19 10:31 AM, Daniel Lezcano wrote:
> > On 04/09/2019 19:07, Bart Van Assche wrote:
> > > Only if CONFIG_IRQ_TIME_ACCOUNTING has been enabled. However, I don't
> > > know any Linux distro that enables that option. That's probably because
> > > that option introduces two rdtsc() calls in each interrupt. Given the
> > > overhead introduced by this option, I don't think this is the solution
> > > Ming is looking for.
> > 
> > Was this overhead reported somewhere ?
>  I think it is widely known that rdtsc is a relatively slow x86 instruction.
> So I expect that using that instruction will cause a measurable overhead if
> it is called frequently enough. I'm not aware of any publicly available
> measurement data however.

https://www.agner.org/optimize/instruction_tables.pdf

RDTSC, Ryzen: ~36
RDTSC, Skylake: ~20

Sadly those same tables don't list the cost of actual exceptions or even
IRET :/

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-04 18:02                                   ` Peter Zijlstra
@ 2019-09-04 19:47                                     ` Bart Van Assche
  2019-09-05  9:11                                       ` Ming Lei
  0 siblings, 1 reply; 40+ messages in thread
From: Bart Van Assche @ 2019-09-04 19:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jens Axboe, Hannes Reinecke, John Garry, Sagi Grimberg,
	linux-scsi, Long Li, Daniel Lezcano, LKML, linux-nvme, Ming Lei,
	Keith Busch, Ingo Molnar, Thomas Gleixner, Christoph Hellwig

On 9/4/19 11:02 AM, Peter Zijlstra wrote:
> On Wed, Sep 04, 2019 at 10:38:59AM -0700, Bart Van Assche wrote:
>> I think it is widely known that rdtsc is a relatively slow x86 instruction.
>> So I expect that using that instruction will cause a measurable overhead if
>> it is called frequently enough. I'm not aware of any publicly available
>> measurement data however.
> 
> https://www.agner.org/optimize/instruction_tables.pdf
> 
> RDTSC, Ryzen: ~36
> RDTSC, Skylake: ~20
> 
> Sadly those same tables don't list the cost of actual exceptions or even
> IRET :/

Thanks Peter for having looked up these numbers. These numbers are much 
better than last time I checked. Ming, would CONFIG_IRQ_TIME_ACCOUNTING 
help your workload?

Does anyone know which CPUs the following text from 
Documentation/admin-guide/kernel-parameters.txt refers to?

tsc=		[ ... ]
		[x86] noirqtime: Do not use TSC to do irq accounting.
		Used to run time disable IRQ_TIME_ACCOUNTING on any
		platforms where RDTSC is slow and this accounting
		can add overhead.

Bart.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-04 17:31                               ` Daniel Lezcano
  2019-09-04 17:38                                 ` Bart Van Assche
@ 2019-09-05  9:06                                 ` Ming Lei
  2019-09-05 10:37                                   ` Daniel Lezcano
  1 sibling, 1 reply; 40+ messages in thread
From: Ming Lei @ 2019-09-05  9:06 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Keith Busch, Hannes Reinecke, Bart Van Assche, linux-scsi,
	Peter Zijlstra, Long Li, John Garry, LKML, linux-nvme,
	Jens Axboe, Ingo Molnar, Thomas Gleixner, Christoph Hellwig,
	Sagi Grimberg

On Wed, Sep 04, 2019 at 07:31:48PM +0200, Daniel Lezcano wrote:
> Hi,
> 
> On 04/09/2019 19:07, Bart Van Assche wrote:
> > On 9/3/19 12:50 AM, Daniel Lezcano wrote:
> >> On 03/09/2019 09:28, Ming Lei wrote:
> >>> On Tue, Sep 03, 2019 at 08:40:35AM +0200, Daniel Lezcano wrote:
> >>>> It is a scheduler problem then ?
> >>>
> >>> Scheduler can do nothing if the CPU is taken completely by handling
> >>> interrupt & softirq, so seems not a scheduler problem, IMO.
> >>
> >> Why? If there is a irq pressure on one CPU reducing its capacity, the
> >> scheduler will balance the tasks on another CPU, no?
> > 
> > Only if CONFIG_IRQ_TIME_ACCOUNTING has been enabled. However, I don't
> > know any Linux distro that enables that option. That's probably because
> > that option introduces two rdtsc() calls in each interrupt. Given the
> > overhead introduced by this option, I don't think this is the solution
> > Ming is looking for.
> 
> Was this overhead reported somewhere ?

The syscall of gettimeofday() calls ktime_get_real_ts64() which finally
calls tk_clock_read() which calls rdtsc too.

But gettimeofday() is often used in fast path, and block IO_STAT needs to
read it too.

> 
> > See also irqtime_account_irq() in kernel/sched/cputime.c.
> 
> From my POV, this framework could be interesting to detect this situation.

Now we are talking about IRQ_TIME_ACCOUNTING instead of IRQ_TIMINGS, and the
former one could be used to implement the detection. And the only sharing
should be the read of timestamp.

Thanks,
Ming

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-04 19:47                                     ` Bart Van Assche
@ 2019-09-05  9:11                                       ` Ming Lei
  0 siblings, 0 replies; 40+ messages in thread
From: Ming Lei @ 2019-09-05  9:11 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, Hannes Reinecke, John Garry, Sagi Grimberg,
	linux-scsi, Peter Zijlstra, Long Li, Daniel Lezcano, LKML,
	linux-nvme, Keith Busch, Ingo Molnar, Thomas Gleixner,
	Christoph Hellwig

On Wed, Sep 04, 2019 at 12:47:13PM -0700, Bart Van Assche wrote:
> On 9/4/19 11:02 AM, Peter Zijlstra wrote:
> > On Wed, Sep 04, 2019 at 10:38:59AM -0700, Bart Van Assche wrote:
> > > I think it is widely known that rdtsc is a relatively slow x86 instruction.
> > > So I expect that using that instruction will cause a measurable overhead if
> > > it is called frequently enough. I'm not aware of any publicly available
> > > measurement data however.
> > 
> > https://www.agner.org/optimize/instruction_tables.pdf
> > 
> > RDTSC, Ryzen: ~36
> > RDTSC, Skylake: ~20
> > 
> > Sadly those same tables don't list the cost of actual exceptions or even
> > IRET :/
> 
> Thanks Peter for having looked up these numbers. These numbers are much
> better than last time I checked. Ming, would CONFIG_IRQ_TIME_ACCOUNTING help
> your workload?

In my fio test on azure L80sv2, IRQ_TIME_ACCOUNTING isn't enabled.
However the irq flood detection introduces two RDTSC for each do_IRQ(),
not see obvious IOPS difference.

Thanks,
Ming

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-05  9:06                                 ` Ming Lei
@ 2019-09-05 10:37                                   ` Daniel Lezcano
  2019-09-06  1:22                                     ` Long Li
  2019-09-06  1:48                                     ` Ming Lei
  0 siblings, 2 replies; 40+ messages in thread
From: Daniel Lezcano @ 2019-09-05 10:37 UTC (permalink / raw)
  To: Ming Lei
  Cc: Keith Busch, Hannes Reinecke, Bart Van Assche, linux-scsi,
	Peter Zijlstra, Long Li, John Garry, LKML, linux-nvme,
	Jens Axboe, Ingo Molnar, Thomas Gleixner, Christoph Hellwig,
	Sagi Grimberg


Hi Ming,

On 05/09/2019 11:06, Ming Lei wrote:
> On Wed, Sep 04, 2019 at 07:31:48PM +0200, Daniel Lezcano wrote:
>> Hi,
>>
>> On 04/09/2019 19:07, Bart Van Assche wrote:
>>> On 9/3/19 12:50 AM, Daniel Lezcano wrote:
>>>> On 03/09/2019 09:28, Ming Lei wrote:
>>>>> On Tue, Sep 03, 2019 at 08:40:35AM +0200, Daniel Lezcano wrote:
>>>>>> It is a scheduler problem then ?
>>>>>
>>>>> Scheduler can do nothing if the CPU is taken completely by handling
>>>>> interrupt & softirq, so seems not a scheduler problem, IMO.
>>>>
>>>> Why? If there is a irq pressure on one CPU reducing its capacity, the
>>>> scheduler will balance the tasks on another CPU, no?
>>>
>>> Only if CONFIG_IRQ_TIME_ACCOUNTING has been enabled. However, I don't
>>> know any Linux distro that enables that option. That's probably because
>>> that option introduces two rdtsc() calls in each interrupt. Given the
>>> overhead introduced by this option, I don't think this is the solution
>>> Ming is looking for.
>>
>> Was this overhead reported somewhere ?
> 
> The syscall of gettimeofday() calls ktime_get_real_ts64() which finally
> calls tk_clock_read() which calls rdtsc too.
> 
> But gettimeofday() is often used in fast path, and block IO_STAT needs to
> read it too.
> 
>>
>>> See also irqtime_account_irq() in kernel/sched/cputime.c.
>>
>> From my POV, this framework could be interesting to detect this situation.
> 
> Now we are talking about IRQ_TIME_ACCOUNTING instead of IRQ_TIMINGS, and the
> former one could be used to implement the detection. And the only sharing
> should be the read of timestamp.

You did not share yet the analysis of the problem (the kernel warnings
give the symptoms) and gave the reasoning for the solution. It is hard
to understand what you are looking for exactly and how to connect the dots.

AFAIU, there are fast medium where the responses to requests are faster
than the time to process them, right?

I don't see how detecting IRQ flooding and use a threaded irq is the
solution, can you explain?

If the responses are coming at a very high rate, whatever the solution
(interrupts, threaded interrupts, polling), we are still in the same
situation.

My suggestion was initially to see if the interrupt load will be taken
into accounts in the cpu load and favorize task migration with the
scheduler load balance to a less loaded CPU, thus the CPU processing
interrupts will end up doing only that while other CPUs will handle the
"threaded" side.

Beside that, I'm wondering if the block scheduler should be somehow
involved in that [1]

  -- Daniel

[1]
https://www.linaro.org/blog/io-bandwidth-management-for-production-quality-services/



-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-05 10:37                                   ` Daniel Lezcano
@ 2019-09-06  1:22                                     ` Long Li
  2019-09-06  4:36                                       ` Daniel Lezcano
  2019-09-06  1:48                                     ` Ming Lei
  1 sibling, 1 reply; 40+ messages in thread
From: Long Li @ 2019-09-06  1:22 UTC (permalink / raw)
  To: Daniel Lezcano, Ming Lei
  Cc: Keith Busch, Hannes Reinecke, Bart Van Assche, linux-scsi,
	Peter Zijlstra, John Garry, LKML, linux-nvme, Jens Axboe,
	Ingo Molnar, Thomas Gleixner, Christoph Hellwig, Sagi Grimberg

>Subject: Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
>
>
>Hi Ming,
>
>On 05/09/2019 11:06, Ming Lei wrote:
>> On Wed, Sep 04, 2019 at 07:31:48PM +0200, Daniel Lezcano wrote:
>>> Hi,
>>>
>>> On 04/09/2019 19:07, Bart Van Assche wrote:
>>>> On 9/3/19 12:50 AM, Daniel Lezcano wrote:
>>>>> On 03/09/2019 09:28, Ming Lei wrote:
>>>>>> On Tue, Sep 03, 2019 at 08:40:35AM +0200, Daniel Lezcano wrote:
>>>>>>> It is a scheduler problem then ?
>>>>>>
>>>>>> Scheduler can do nothing if the CPU is taken completely by
>>>>>> handling interrupt & softirq, so seems not a scheduler problem, IMO.
>>>>>
>>>>> Why? If there is a irq pressure on one CPU reducing its capacity,
>>>>> the scheduler will balance the tasks on another CPU, no?
>>>>
>>>> Only if CONFIG_IRQ_TIME_ACCOUNTING has been enabled. However, I
>>>> don't know any Linux distro that enables that option. That's
>>>> probably because that option introduces two rdtsc() calls in each
>>>> interrupt. Given the overhead introduced by this option, I don't
>>>> think this is the solution Ming is looking for.
>>>
>>> Was this overhead reported somewhere ?
>>
>> The syscall of gettimeofday() calls ktime_get_real_ts64() which
>> finally calls tk_clock_read() which calls rdtsc too.
>>
>> But gettimeofday() is often used in fast path, and block IO_STAT needs
>> to read it too.
>>
>>>
>>>> See also irqtime_account_irq() in kernel/sched/cputime.c.
>>>
>>> From my POV, this framework could be interesting to detect this situation.
>>
>> Now we are talking about IRQ_TIME_ACCOUNTING instead of
>IRQ_TIMINGS,
>> and the former one could be used to implement the detection. And the
>> only sharing should be the read of timestamp.
>
>You did not share yet the analysis of the problem (the kernel warnings give
>the symptoms) and gave the reasoning for the solution. It is hard to
>understand what you are looking for exactly and how to connect the dots.
>
>AFAIU, there are fast medium where the responses to requests are faster
>than the time to process them, right?
>
>I don't see how detecting IRQ flooding and use a threaded irq is the solution,
>can you explain?
>
>If the responses are coming at a very high rate, whatever the solution
>(interrupts, threaded interrupts, polling), we are still in the same situation.
>
>My suggestion was initially to see if the interrupt load will be taken into
>accounts in the cpu load and favorize task migration with the scheduler load
>balance to a less loaded CPU, thus the CPU processing interrupts will end up
>doing only that while other CPUs will handle the "threaded" side.
>
>Beside that, I'm wondering if the block scheduler should be somehow
>involved in that [1]
>
>  -- Daniel

Hi Daniel

I want to share some test results with IRQ_TIME_ACCOUNTING. During tests, the kernel had warnings on RCU stall and soft lockup:

An example of RCU stall;
[ 3016.148250] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 3016.148299] rcu:     66-....: (1 GPs behind) idle=cc2/0/0x3 softirq=10037/10037 fqs=4717
[ 3016.148299]  (detected by 27, t=15011 jiffies, g=45173, q=17194)
[ 3016.148299] Sending NMI from CPU 27 to CPUs 66:
[ 3016.148299] NMI backtrace for cpu 66
[ 3016.148299] CPU: 66 PID: 0 Comm: swapper/66 Tainted: G             L    5.3.0-rc6+ #68
[ 3016.148299] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007  05/18/2018
[ 3016.148299] RIP: 0010:0xffff9c4740013003
[ 3016.148299] Code: Bad RIP value.
[ 3016.148299] RSP: 0018:ffff9c4759acc8d0 EFLAGS: 00000046
[ 3016.148299] RAX: 0000000000000000 RBX: 0000000000000080 RCX: 000000000001000b
[ 3016.148299] RDX: 00000000000000fb RSI: ffff9c4740013000 RDI: ffff9c4759acc920
[ 3016.148299] RBP: ffff9c4759acc920 R08: 0000000000000008 R09: 01484a845c6de350
[ 3016.148299] R10: ffff9c4759accd30 R11: 0000000000000001 R12: 00000000000000fb
[ 3016.148299] R13: 0000000000000042 R14: ffff8a7d9b771f80 R15: 00000000000001e1
[ 3016.148299] FS:  0000000000000000(0000) GS:ffff8afd9f880000(0000) knlGS:0000000000000000
[ 3016.148299] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3016.148299] CR2: ffff9c4740012fd9 CR3: 000000208b9bc000 CR4: 00000000003406e0
[ 3016.148299] Call Trace:
[ 3016.148299]  <IRQ>
[ 3016.148299]  ? __send_ipi_mask+0x145/0x2e0
[ 3016.148299]  ? __send_ipi_one+0x3a/0x60
[ 3016.148299]  ? hv_send_ipi+0x10/0x30
[ 3016.148299]  ? generic_exec_single+0x63/0xe0
[ 3016.148299]  ? smp_call_function_single_async+0x1f/0x40
[ 3016.148299]  ? blk_mq_complete_request+0xdf/0xf0
[ 3016.148299]  ? nvme_irq+0x144/0x240 [nvme]
[ 3016.148299]  ? tick_sched_do_timer+0x80/0x80
[ 3016.148299]  ? __handle_irq_event_percpu+0x40/0x190
[ 3016.148299]  ? handle_irq_event_percpu+0x30/0x70
[ 3016.148299]  ? handle_irq_event+0x36/0x60
[ 3016.148299]  ? handle_edge_irq+0x7e/0x190
[ 3016.148299]  ? handle_irq+0x1c/0x30
[ 3016.148299]  ? do_IRQ+0x49/0xd0
[ 3016.148299]  ? common_interrupt+0xf/0xf
[ 3016.148299]  ? common_interrupt+0xa/0xf
[ 3016.148299]  ? __do_softirq+0x76/0x2e3
[ 3016.148299]  ? __do_softirq+0x4b/0x2e3
[ 3016.148299]  ? sched_clock_local+0x12/0x80
[ 3016.148299]  ? irq_exit+0xdd/0xf0
[ 3016.148299]  ? hv_stimer0_vector_handler+0x62/0x70
[ 3016.148299]  ? hv_stimer0_callback_vector+0xf/0x20
[ 3016.148299]  </IRQ>
[ 3016.148299]  ? __sched_text_end+0x2/0x2
[ 3016.148299]  ? default_idle+0x25/0x150
[ 3016.148299]  ? do_idle+0x1ad/0x250
[ 3016.148299]  ? cpu_startup_entry+0x19/0x20
[ 3016.148299]  ? start_secondary+0x156/0x1a0
[ 3016.148299]  ? secondary_startup_64+0xa4/0xb0

An example of soft lockup:
[ 3082.490820] watchdog: BUG: soft lockup - CPU#66 stuck for 22s! [swapper/66:0]
[ 3082.501353] Modules linked in: xt_owner xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_security nls_iso8859_1 nvme nvme_core pci_hyperv hv_balloon serio_raw joydev sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_decompress zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel hyperv_fb hid_generic aes_x86_64 crypto_simd hid_hyperv cfbfillrect cryptd hv_netvsc glue_helper cfbimgblt hid hv_utils pata_acpi hyperv_keyboard cfbcopyarea
[ 3082.501353] CPU: 66 PID: 0 Comm: swapper/66 Tainted: G             L    5.3.0-rc6+ #68
[ 3082.501353] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007  05/18/2018
[ 3082.501353] RIP: 0010:__do_softirq+0x76/0x2e3
[ 3082.501353] Code: 81 05 72 5c e1 55 00 01 00 00 c7 44 24 20 0a 00 00 00 44 89 74 24 04 48 c7 c0 80 98 02 00 65 66 c7 00 00 00 fb b8 ff ff ff ff <0f> bc 44 24 04 83 c0 01 89 44 24 10 0f 84 a0 00 00 00 48 c7 44 24
[ 3082.501353] RSP: 0018:ffff9c4759accf78 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff12
[ 3082.501353] RAX: 00000000ffffffff RBX: 0000000000000000 RCX: 000002aff0e90b1d
[ 3082.501353] RDX: 00000000000000b5 RSI: 000002aff0e90bd2 RDI: 00000000000000b5
[ 3082.501353] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000029900
[ 3082.501353] R10: ffff9c4759accf20 R11: 0000000002b140e3 R12: 0000000000000000
[ 3082.501353] R13: 0000000000000000 R14: 0000000000000282 R15: 0000000000000000
[ 3082.501353] FS:  0000000000000000(0000) GS:ffff8afd9f880000(0000) knlGS:0000000000000000
[ 3082.501353] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3082.501353] CR2: 0000000000638a60 CR3: 000000208b9bc000 CR4: 00000000003406e0
[ 3082.501353] Call Trace:
[ 3082.501353]  <IRQ>
[ 3082.501353]  ? sched_clock_local+0x12/0x80
[ 3082.501353]  irq_exit+0xdd/0xf0
[ 3082.501353]  hv_stimer0_vector_handler+0x62/0x70
[ 3082.501353]  hv_stimer0_callback_vector+0xf/0x20
[ 3082.501353]  </IRQ>
[ 3082.501353] RIP: 0010:default_idle+0x25/0x150
[ 3082.501353] Code: f2 75 ff 90 90 0f 1f 44 00 00 41 55 41 54 55 53 65 8b 2d 7e 49 0f 56 0f 1f 44 00 00 e9 07 00 00 00 0f 00 2d 5f 9a 4e 00 fb f4 <65> 8b 2d 64 49 0f 56 0f 1f 44 00 00 5b 5d 41 5c 41 5d c3 65 8b 05
[ 3082.501353] RSP: 0018:ffff9c4758b5bec0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff12
[ 3082.501353] RAX: ffffffffa9f1b9c0 RBX: 0000000000000042 RCX: 0000000000000000
[ 3082.501353] RDX: 0000000000000042 RSI: 0000000000000000 RDI: 000002af5fd2e9ef
[ 3082.501353] RBP: 0000000000000042 R08: 0000000000000000 R09: 0000000000029900
[ 3082.501353] R10: ffff9c4758b5be98 R11: 0000000000008a0c R12: 0000000000000000
[ 3082.501353] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 3082.501353]  ? __sched_text_end+0x2/0x2
[ 3082.501353]  do_idle+0x1ad/0x250
[ 3082.501353]  cpu_startup_entry+0x19/0x20
[ 3082.501353]  start_secondary+0x156/0x1a0
[ 3082.501353]  secondary_startup_64+0xa4/0xb0

Tracing shows that the CPU was in either hardirq or softirq all the time before warnings. During tests, the system was unresponsive at times.

Ming's patch fixed this problem. The system was responsive throughout tests.

As for performance hit, both resulted in a small drop in peak IOPS. With IRQ_TIME_ACCOUNTING I see a 3% drop. With Ming's patch it is 1% drop.

For the tests, I used the following fio command on 10 NVMe disks:
fio --bs=4k --ioengine=libaio --iodepth=128 --filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1 --direct=1 --runtime=12000 --numjobs=80 --rw=randread --name=test --group_reporting --gtod_reduce=1


Thanks

Long

>
>[1]
>https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww
>.linaro.org%2Fblog%2Fio-bandwidth-management-for-production-quality-
>services%2F&amp;data=02%7C01%7Clongli%40microsoft.com%7C57db27acef
>38457bf8f808d731ed05e6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0
>%7C637032766394916978&amp;sdata=ADJU0iEvTITaPCkLcKsGCbqr2GEbQ8U1
>S81jmarmrMU%3D&amp;reserved=0
>
>
>
>--
>
><https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww
>.linaro.org%2F&amp;data=02%7C01%7Clongli%40microsoft.com%7C57db27ac
>ef38457bf8f808d731ed05e6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C
>0%7C637032766394926977&amp;sdata=1IoPpi8C87%2BGPz4RYEMrEYBma0jSe
>Bbd%2FS%2FaAb%2BUKSA%3D&amp;reserved=0> Linaro.org │ Open source
>software for ARM SoCs
>
>Follow Linaro:
><https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww
>.facebook.com%2Fpages%2FLinaro&amp;data=02%7C01%7Clongli%40microso
>ft.com%7C57db27acef38457bf8f808d731ed05e6%7C72f988bf86f141af91ab2d7
>cd011db47%7C1%7C0%7C637032766394926977&amp;sdata=8ycbXr2QuArabxf
>yARomp2ApmadaOG9lmzfC7IT3%2Bj0%3D&amp;reserved=0> Facebook |
><https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Ftwitt
>er.com%2F%23!%2Flinaroorg&amp;data=02%7C01%7Clongli%40microsoft.co
>m%7C57db27acef38457bf8f808d731ed05e6%7C72f988bf86f141af91ab2d7cd01
>1db47%7C1%7C0%7C637032766394926977&amp;sdata=aQllrUNHWZ3Vkyp7lJ4
>Rd9Qxx3DMcjYv9AQ7g9ytZzU%3D&amp;reserved=0> Twitter |
><https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww
>.linaro.org%2Flinaro-
>blog%2F&amp;data=02%7C01%7Clongli%40microsoft.com%7C57db27acef3845
>7bf8f808d731ed05e6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C6
>37032766394926977&amp;sdata=o2SbnrMPZxRA%2Bu51Zzuew2AQhUvcF0TG
>XsZz%2Bzrivb8%3D&amp;reserved=0> Blog

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-05 10:37                                   ` Daniel Lezcano
  2019-09-06  1:22                                     ` Long Li
@ 2019-09-06  1:48                                     ` Ming Lei
  2019-09-06  5:14                                       ` Daniel Lezcano
  2019-09-06 14:18                                       ` Keith Busch
  1 sibling, 2 replies; 40+ messages in thread
From: Ming Lei @ 2019-09-06  1:48 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Jens Axboe, Hannes Reinecke, Bart Van Assche, linux-scsi,
	Peter Zijlstra, Long Li, John Garry, LKML, linux-nvme,
	Keith Busch, Ingo Molnar, Thomas Gleixner, Christoph Hellwig,
	Sagi Grimberg

Hi Daniel,

On Thu, Sep 05, 2019 at 12:37:13PM +0200, Daniel Lezcano wrote:
> 
> Hi Ming,
> 
> On 05/09/2019 11:06, Ming Lei wrote:
> > On Wed, Sep 04, 2019 at 07:31:48PM +0200, Daniel Lezcano wrote:
> >> Hi,
> >>
> >> On 04/09/2019 19:07, Bart Van Assche wrote:
> >>> On 9/3/19 12:50 AM, Daniel Lezcano wrote:
> >>>> On 03/09/2019 09:28, Ming Lei wrote:
> >>>>> On Tue, Sep 03, 2019 at 08:40:35AM +0200, Daniel Lezcano wrote:
> >>>>>> It is a scheduler problem then ?
> >>>>>
> >>>>> Scheduler can do nothing if the CPU is taken completely by handling
> >>>>> interrupt & softirq, so seems not a scheduler problem, IMO.
> >>>>
> >>>> Why? If there is a irq pressure on one CPU reducing its capacity, the
> >>>> scheduler will balance the tasks on another CPU, no?
> >>>
> >>> Only if CONFIG_IRQ_TIME_ACCOUNTING has been enabled. However, I don't
> >>> know any Linux distro that enables that option. That's probably because
> >>> that option introduces two rdtsc() calls in each interrupt. Given the
> >>> overhead introduced by this option, I don't think this is the solution
> >>> Ming is looking for.
> >>
> >> Was this overhead reported somewhere ?
> > 
> > The syscall of gettimeofday() calls ktime_get_real_ts64() which finally
> > calls tk_clock_read() which calls rdtsc too.
> > 
> > But gettimeofday() is often used in fast path, and block IO_STAT needs to
> > read it too.
> > 
> >>
> >>> See also irqtime_account_irq() in kernel/sched/cputime.c.
> >>
> >> From my POV, this framework could be interesting to detect this situation.
> > 
> > Now we are talking about IRQ_TIME_ACCOUNTING instead of IRQ_TIMINGS, and the
> > former one could be used to implement the detection. And the only sharing
> > should be the read of timestamp.
> 
> You did not share yet the analysis of the problem (the kernel warnings
> give the symptoms) and gave the reasoning for the solution. It is hard
> to understand what you are looking for exactly and how to connect the dots.

Let me explain it one more time:

When one IRQ flood happens on one CPU:

1) softirq handling on this CPU can't make progress

2) kernel thread bound to this CPU can't make progress

For example, network may require softirq to xmit packets, or another irq
thread for handling keyboards/mice or whatever, or rcu_sched may depend
on that CPU for making progress, then the irq flood stalls the whole
system.

> 
> AFAIU, there are fast medium where the responses to requests are faster
> than the time to process them, right?

Usually medium may not be faster than CPU, now we are talking about
interrupts, which can be originated from lots of devices concurrently,
for example, in Long Li'test, there are 8 NVMe drives involved.

> 
> I don't see how detecting IRQ flooding and use a threaded irq is the
> solution, can you explain?

When IRQ flood is detected, we reserve a bit little time for providing
chance to make softirq/threads scheduled by scheduler, then the above
problem can be avoided.

> 
> If the responses are coming at a very high rate, whatever the solution
> (interrupts, threaded interrupts, polling), we are still in the same
> situation.

When we moving the interrupt handling into irq thread, other softirq/
threaded interrupt/thread gets chance to be scheduled, so we can avoid
to stall the whole system.

> 
> My suggestion was initially to see if the interrupt load will be taken
> into accounts in the cpu load and favorize task migration with the
> scheduler load balance to a less loaded CPU, thus the CPU processing
> interrupts will end up doing only that while other CPUs will handle the
> "threaded" side.
> 
> Beside that, I'm wondering if the block scheduler should be somehow
> involved in that [1]

For NVMe or any multi-queue storage, the default scheduler is 'none',
which basically does nothing except for submitting IO asap.


Thanks,
Ming

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-06  1:22                                     ` Long Li
@ 2019-09-06  4:36                                       ` Daniel Lezcano
  2019-09-06  4:44                                         ` Long Li
  0 siblings, 1 reply; 40+ messages in thread
From: Daniel Lezcano @ 2019-09-06  4:36 UTC (permalink / raw)
  To: Long Li, Ming Lei
  Cc: Keith Busch, Hannes Reinecke, Bart Van Assche, linux-scsi,
	Peter Zijlstra, John Garry, LKML, linux-nvme, Jens Axboe,
	Ingo Molnar, Thomas Gleixner, Christoph Hellwig, Sagi Grimberg


On 06/09/2019 03:22, Long Li wrote:
[ ... ]
> 

> Tracing shows that the CPU was in either hardirq or softirq all the
> time before warnings. During tests, the system was unresponsive at
> times.
> 
> Ming's patch fixed this problem. The system was responsive throughout
> tests.
> 
> As for performance hit, both resulted in a small drop in peak IOPS.
> With IRQ_TIME_ACCOUNTING I see a 3% drop. With Ming's patch it is 1%
> drop.

Do you mean IRQ_TIME_ACCOUNTING + irq threaded ?


> For the tests, I used the following fio command on 10 NVMe disks: fio
> --bs=4k --ioengine=libaio --iodepth=128
> --filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1
> --direct=1 --runtime=12000 --numjobs=80 --rw=randread --name=test
> --group_reporting --gtod_reduce=1

-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-06  4:36                                       ` Daniel Lezcano
@ 2019-09-06  4:44                                         ` Long Li
  0 siblings, 0 replies; 40+ messages in thread
From: Long Li @ 2019-09-06  4:44 UTC (permalink / raw)
  To: Daniel Lezcano, Ming Lei
  Cc: Keith Busch, Hannes Reinecke, Bart Van Assche, linux-scsi,
	Peter Zijlstra, John Garry, LKML, linux-nvme, Jens Axboe,
	Ingo Molnar, Thomas Gleixner, Christoph Hellwig, Sagi Grimberg

>Subject: Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
>
>
>On 06/09/2019 03:22, Long Li wrote:
>[ ... ]
>>
>
>> Tracing shows that the CPU was in either hardirq or softirq all the
>> time before warnings. During tests, the system was unresponsive at
>> times.
>>
>> Ming's patch fixed this problem. The system was responsive throughout
>> tests.
>>
>> As for performance hit, both resulted in a small drop in peak IOPS.
>> With IRQ_TIME_ACCOUNTING I see a 3% drop. With Ming's patch it is 1%
>> drop.
>
>Do you mean IRQ_TIME_ACCOUNTING + irq threaded ?

It's just IRQ_TIME_ACCOUNTING.

>
>
>> For the tests, I used the following fio command on 10 NVMe disks: fio
>> --bs=4k --ioengine=libaio --iodepth=128
>> --
>filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev
>/nv
>>
>me4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/n
>vme9n1
>> --direct=1 --runtime=12000 --numjobs=80 --rw=randread --name=test
>> --group_reporting --gtod_reduce=1
>
>--
>
><https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww
>.linaro.org%2F&amp;data=02%7C01%7Clongli%40microsoft.com%7Cf142f9f9e
>15145434dd608d73283c817%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0
>%7C637033413903343340&amp;sdata=FRCGiKyxpdqyIPob1nWITGvymRdI3fSG
>vyBJovpwVw4%3D&amp;reserved=0> Linaro.org │ Open source software for
>ARM SoCs
>
>Follow Linaro:
><https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww
>.facebook.com%2Fpages%2FLinaro&amp;data=02%7C01%7Clongli%40microso
>ft.com%7Cf142f9f9e15145434dd608d73283c817%7C72f988bf86f141af91ab2d7c
>d011db47%7C1%7C0%7C637033413903343340&amp;sdata=P6t7wiGUESJoFuKi
>u3VrjRMGBYUWAW7TEYinUiFrlQs%3D&amp;reserved=0> Facebook |
><https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Ftwitt
>er.com%2F%23!%2Flinaroorg&amp;data=02%7C01%7Clongli%40microsoft.co
>m%7Cf142f9f9e15145434dd608d73283c817%7C72f988bf86f141af91ab2d7cd011
>db47%7C1%7C0%7C637033413903343340&amp;sdata=UB%2FOZZ1Mz38PQiDa
>BiJOHS4qr%2FWCejI0aKX9JRPNZ3s%3D&amp;reserved=0> Twitter |
><https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww
>.linaro.org%2Flinaro-
>blog%2F&amp;data=02%7C01%7Clongli%40microsoft.com%7Cf142f9f9e15145
>434dd608d73283c817%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C6
>37033413903343340&amp;sdata=7%2BrawoAWuFzou90GTgIUJV%2Fasv2N2Og
>ciePvYmblDFM%3D&amp;reserved=0> Blog

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-06  1:48                                     ` Ming Lei
@ 2019-09-06  5:14                                       ` Daniel Lezcano
  2019-09-06 18:30                                         ` Sagi Grimberg
  2019-09-06 14:18                                       ` Keith Busch
  1 sibling, 1 reply; 40+ messages in thread
From: Daniel Lezcano @ 2019-09-06  5:14 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Hannes Reinecke, Bart Van Assche, linux-scsi,
	Peter Zijlstra, Long Li, John Garry, LKML, linux-nvme,
	Keith Busch, Ingo Molnar, Thomas Gleixner, Christoph Hellwig,
	Sagi Grimberg


Hi,

On 06/09/2019 03:48, Ming Lei wrote:

[ ... ]

>> You did not share yet the analysis of the problem (the kernel warnings
>> give the symptoms) and gave the reasoning for the solution. It is hard
>> to understand what you are looking for exactly and how to connect the dots.
> 
> Let me explain it one more time:>
> When one IRQ flood happens on one CPU:
> 
> 1) softirq handling on this CPU can't make progress
> 
> 2) kernel thread bound to this CPU can't make progress
> 
> For example, network may require softirq to xmit packets, or another irq
> thread for handling keyboards/mice or whatever, or rcu_sched may depend
> on that CPU for making progress, then the irq flood stalls the whole
> system.
> 
>>
>> AFAIU, there are fast medium where the responses to requests are faster
>> than the time to process them, right?
> 
> Usually medium may not be faster than CPU, now we are talking about
> interrupts, which can be originated from lots of devices concurrently,
> for example, in Long Li'test, there are 8 NVMe drives involved.
> 
>>
>> I don't see how detecting IRQ flooding and use a threaded irq is the
>> solution, can you explain?
> 
> When IRQ flood is detected, we reserve a bit little time for providing
> chance to make softirq/threads scheduled by scheduler, then the above
> problem can be avoided.
> 
>>
>> If the responses are coming at a very high rate, whatever the solution
>> (interrupts, threaded interrupts, polling), we are still in the same
>> situation.
> 
> When we moving the interrupt handling into irq thread, other softirq/
> threaded interrupt/thread gets chance to be scheduled, so we can avoid
> to stall the whole system.

Ok, so the real problem is per-cpu bounded tasks.

I share Thomas opinion about a NAPI like approach.

I do believe you should also rely on the IRQ_TIME_ACCOUNTING (may be get
it optimized) to contribute to the CPU load and enforce task migration
at load balance.

>> My suggestion was initially to see if the interrupt load will be taken
>> into accounts in the cpu load and favorize task migration with the
>> scheduler load balance to a less loaded CPU, thus the CPU processing
>> interrupts will end up doing only that while other CPUs will handle the
>> "threaded" side.
>>
>> Beside that, I'm wondering if the block scheduler should be somehow
>> involved in that [1]
> 
> For NVMe or any multi-queue storage, the default scheduler is 'none',
> which basically does nothing except for submitting IO asap.
> 
> 
> Thanks,
> Ming
> 


-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] genirq: use irq's affinity for threaded irq with IRQF_RESCUE_THREAD
       [not found] ` <20190827085344.30799-5-ming.lei@redhat.com>
@ 2019-09-06  8:50   ` John Garry
  0 siblings, 0 replies; 40+ messages in thread
From: John Garry @ 2019-09-06  8:50 UTC (permalink / raw)
  To: Ming Lei, Thomas Gleixner
  Cc: Jens Axboe, Hannes Reinecke, Sagi Grimberg, linux-scsi,
	chenxiang, Peter Zijlstra, Long Li, daniel.lezcano, linux-kernel,
	linux-nvme, Keith Busch, Ingo Molnar, Christoph Hellwig

On 27/08/2019 09:53, Ming Lei wrote:
> In case of IRQF_RESCUE_THREAD, the threaded handler is only used to
> handle interrupt when IRQ flood comes, use irq's affinity for this thread
> so that scheduler may select other not too busy CPUs for handling the
> interrupt.
>
> Cc: Long Li <longli@microsoft.com>
> Cc: Ingo Molnar <mingo@redhat.com>,
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Keith Busch <keith.busch@intel.com>
> Cc: Jens Axboe <axboe@fb.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Sagi Grimberg <sagi@grimberg.me>
> Cc: John Garry <john.garry@huawei.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: linux-nvme@lists.infradead.org
> Cc: linux-scsi@vger.kernel.org
> Signed-off-by: Ming Lei <ming.lei@redhat.com>


> ---
>  kernel/irq/manage.c | 13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> index 1566abbf50e8..03bc041348b7 100644
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -968,7 +968,18 @@ irq_thread_check_affinity(struct irq_desc *desc, struct irqaction *action)
>  	if (cpumask_available(desc->irq_common_data.affinity)) {
>  		const struct cpumask *m;
>
> -		m = irq_data_get_effective_affinity_mask(&desc->irq_data);
> +		/*
> +		 * Managed IRQ's affinity is setup gracefull on MUNA locality,

gracefully

> +		 * also if IRQF_RESCUE_THREAD is set, interrupt flood has been
> +		 * triggered, so ask scheduler to run the thread on CPUs
> +		 * specified by this interrupt's affinity.
> +		 */

Hi Ming,

> +		if ((action->flags & IRQF_RESCUE_THREAD) &&
> +				irqd_affinity_is_managed(&desc->irq_data))

This doesn't look to solve the other issue I reported - that being that 
we handle the interrupt in a threaded handler natively, and the hard 
irq+threaded handler fully occupies the cpu, limiting throughput.

So can we expand the scope to cover that scenario also? I don't think 
that it’s right to solve that separately. So if we're continuing this 
approach, can we add separate judgment for spreading the cpumask for the 
threaded part?

Thanks,
John

> +			m = desc->irq_common_data.affinity;
> +		else
> +			m = irq_data_get_effective_affinity_mask(
> +					&desc->irq_data);
>  		cpumask_copy(mask, m);
>  	} else {
>  		valid = false;
>



_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-06  1:48                                     ` Ming Lei
  2019-09-06  5:14                                       ` Daniel Lezcano
@ 2019-09-06 14:18                                       ` Keith Busch
  2019-09-06 17:50                                         ` Long Li
  1 sibling, 1 reply; 40+ messages in thread
From: Keith Busch @ 2019-09-06 14:18 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Hannes Reinecke, John Garry, Bart Van Assche,
	linux-scsi, Peter Zijlstra, Long Li, Daniel Lezcano, LKML,
	linux-nvme, Keith Busch, Ingo Molnar, Thomas Gleixner,
	Christoph Hellwig, Sagi Grimberg

On Fri, Sep 06, 2019 at 09:48:21AM +0800, Ming Lei wrote:
> When one IRQ flood happens on one CPU:
> 
> 1) softirq handling on this CPU can't make progress
> 
> 2) kernel thread bound to this CPU can't make progress
> 
> For example, network may require softirq to xmit packets, or another irq
> thread for handling keyboards/mice or whatever, or rcu_sched may depend
> on that CPU for making progress, then the irq flood stalls the whole
> system.
> 
> > 
> > AFAIU, there are fast medium where the responses to requests are faster
> > than the time to process them, right?
> 
> Usually medium may not be faster than CPU, now we are talking about
> interrupts, which can be originated from lots of devices concurrently,
> for example, in Long Li'test, there are 8 NVMe drives involved.

Why are all 8 nvmes sharing the same CPU for interrupt handling?
Shouldn't matrix_find_best_cpu_managed() handle selecting the least used
CPU from the cpumask for the effective interrupt handling?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-06 14:18                                       ` Keith Busch
@ 2019-09-06 17:50                                         ` Long Li
  2019-09-06 22:19                                           ` Ming Lei
  0 siblings, 1 reply; 40+ messages in thread
From: Long Li @ 2019-09-06 17:50 UTC (permalink / raw)
  To: Keith Busch, Ming Lei
  Cc: Jens Axboe, Hannes Reinecke, John Garry, Bart Van Assche,
	linux-scsi, Peter Zijlstra, Daniel Lezcano, LKML, linux-nvme,
	Keith Busch, Ingo Molnar, Thomas Gleixner, Christoph Hellwig,
	Sagi Grimberg

>Subject: Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
>
>On Fri, Sep 06, 2019 at 09:48:21AM +0800, Ming Lei wrote:
>> When one IRQ flood happens on one CPU:
>>
>> 1) softirq handling on this CPU can't make progress
>>
>> 2) kernel thread bound to this CPU can't make progress
>>
>> For example, network may require softirq to xmit packets, or another
>> irq thread for handling keyboards/mice or whatever, or rcu_sched may
>> depend on that CPU for making progress, then the irq flood stalls the
>> whole system.
>>
>> >
>> > AFAIU, there are fast medium where the responses to requests are
>> > faster than the time to process them, right?
>>
>> Usually medium may not be faster than CPU, now we are talking about
>> interrupts, which can be originated from lots of devices concurrently,
>> for example, in Long Li'test, there are 8 NVMe drives involved.
>
>Why are all 8 nvmes sharing the same CPU for interrupt handling?
>Shouldn't matrix_find_best_cpu_managed() handle selecting the least used
>CPU from the cpumask for the effective interrupt handling?

The tests run on 10 NVMe disks on a system of 80 CPUs. Each NVMe disk has 32 hardware queues.
It seems matrix_find_best_cpu_managed() has done its job, but we may still have CPUs that service several hardware queues mapped from other issuing CPUs.
Another thing to consider is that there may be other managed interrupts on the system, so NVMe interrupts may not end up evenly distributed on such a system.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-06  5:14                                       ` Daniel Lezcano
@ 2019-09-06 18:30                                         ` Sagi Grimberg
  2019-09-06 18:52                                           ` Keith Busch
  2019-09-07  0:01                                           ` Ming Lei
  0 siblings, 2 replies; 40+ messages in thread
From: Sagi Grimberg @ 2019-09-06 18:30 UTC (permalink / raw)
  To: Daniel Lezcano, Ming Lei
  Cc: Keith Busch, Hannes Reinecke, Bart Van Assche, linux-scsi,
	Peter Zijlstra, Long Li, John Garry, LKML, linux-nvme,
	Jens Axboe, Ingo Molnar, Thomas Gleixner, Christoph Hellwig


> 
> Ok, so the real problem is per-cpu bounded tasks.
> 
> I share Thomas opinion about a NAPI like approach.

We already have that, its irq_poll, but it seems that for this
use-case, we get lower performance for some reason. I'm not
entirely sure why that is, maybe its because we need to mask interrupts
because we don't have an "arm" register in nvme like network devices
have?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-06 18:30                                         ` Sagi Grimberg
@ 2019-09-06 18:52                                           ` Keith Busch
  2019-09-07  0:01                                           ` Ming Lei
  1 sibling, 0 replies; 40+ messages in thread
From: Keith Busch @ 2019-09-06 18:52 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Keith Busch, Hannes Reinecke, John Garry, Bart Van Assche,
	linux-scsi, Peter Zijlstra, Long Li, Daniel Lezcano, LKML,
	linux-nvme, Ming Lei, Jens Axboe, Ingo Molnar, Thomas Gleixner,
	Christoph Hellwig

On Fri, Sep 06, 2019 at 11:30:57AM -0700, Sagi Grimberg wrote:
> 
> > 
> > Ok, so the real problem is per-cpu bounded tasks.
> > 
> > I share Thomas opinion about a NAPI like approach.
> 
> We already have that, its irq_poll, but it seems that for this
> use-case, we get lower performance for some reason. I'm not
> entirely sure why that is, maybe its because we need to mask interrupts
> because we don't have an "arm" register in nvme like network devices
> have?

For MSI, that's the INTMS/INTMC NVMe registers. MSI-x, though, has to
disarm it in its table entry, and the Linux implementation will do a
posted read in that path, which is a bit too expensive.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-06 17:50                                         ` Long Li
@ 2019-09-06 22:19                                           ` Ming Lei
  2019-09-06 22:25                                             ` Keith Busch
  2019-09-10  0:24                                             ` Ming Lei
  0 siblings, 2 replies; 40+ messages in thread
From: Ming Lei @ 2019-09-06 22:19 UTC (permalink / raw)
  To: Long Li
  Cc: Jens Axboe, Hannes Reinecke, John Garry, Bart Van Assche,
	linux-scsi, Peter Zijlstra, Daniel Lezcano, LKML, linux-nvme,
	Keith Busch, Ingo Molnar, Keith Busch, Thomas Gleixner,
	Christoph Hellwig, Sagi Grimberg

On Fri, Sep 06, 2019 at 05:50:49PM +0000, Long Li wrote:
> >Subject: Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
> >
> >On Fri, Sep 06, 2019 at 09:48:21AM +0800, Ming Lei wrote:
> >> When one IRQ flood happens on one CPU:
> >>
> >> 1) softirq handling on this CPU can't make progress
> >>
> >> 2) kernel thread bound to this CPU can't make progress
> >>
> >> For example, network may require softirq to xmit packets, or another
> >> irq thread for handling keyboards/mice or whatever, or rcu_sched may
> >> depend on that CPU for making progress, then the irq flood stalls the
> >> whole system.
> >>
> >> >
> >> > AFAIU, there are fast medium where the responses to requests are
> >> > faster than the time to process them, right?
> >>
> >> Usually medium may not be faster than CPU, now we are talking about
> >> interrupts, which can be originated from lots of devices concurrently,
> >> for example, in Long Li'test, there are 8 NVMe drives involved.
> >
> >Why are all 8 nvmes sharing the same CPU for interrupt handling?
> >Shouldn't matrix_find_best_cpu_managed() handle selecting the least used
> >CPU from the cpumask for the effective interrupt handling?
> 
> The tests run on 10 NVMe disks on a system of 80 CPUs. Each NVMe disk has 32 hardware queues.

Then there are total 320 NVMe MSI/X vectors, and 80 CPUs, so irq matrix
can't avoid effective CPUs overlapping at all.

> It seems matrix_find_best_cpu_managed() has done its job, but we may still have CPUs that service several hardware queues mapped from other issuing CPUs.
> Another thing to consider is that there may be other managed interrupts on the system, so NVMe interrupts may not end up evenly distributed on such a system.

Another improvement could be to try to not overlap effective CPUs among
vectors of fast device first, meantime allow the overlap between slow
vectors and fast vectors.

This way could improve in case that total fast vectors are <= nr_cpu_cores.

thanks,
Ming

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-06 22:19                                           ` Ming Lei
@ 2019-09-06 22:25                                             ` Keith Busch
  2019-09-06 23:13                                               ` Ming Lei
  2019-09-10  0:24                                             ` Ming Lei
  1 sibling, 1 reply; 40+ messages in thread
From: Keith Busch @ 2019-09-06 22:25 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Hannes Reinecke, John Garry, Bart Van Assche,
	linux-scsi, Peter Zijlstra, Long Li, Daniel Lezcano, LKML,
	linux-nvme, Keith Busch, Ingo Molnar, Thomas Gleixner,
	Christoph Hellwig, Sagi Grimberg

On Sat, Sep 07, 2019 at 06:19:21AM +0800, Ming Lei wrote:
> On Fri, Sep 06, 2019 at 05:50:49PM +0000, Long Li wrote:
> > >Subject: Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
> > >
> > >Why are all 8 nvmes sharing the same CPU for interrupt handling?
> > >Shouldn't matrix_find_best_cpu_managed() handle selecting the least used
> > >CPU from the cpumask for the effective interrupt handling?
> > 
> > The tests run on 10 NVMe disks on a system of 80 CPUs. Each NVMe disk has 32 hardware queues.
> 
> Then there are total 320 NVMe MSI/X vectors, and 80 CPUs, so irq matrix
> can't avoid effective CPUs overlapping at all.

Sure, but it's at most half, meanwhile the CPU that's dispatching requests
would naturally be throttled by the other half who's completions are
interrupting that CPU, no?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-06 22:25                                             ` Keith Busch
@ 2019-09-06 23:13                                               ` Ming Lei
  0 siblings, 0 replies; 40+ messages in thread
From: Ming Lei @ 2019-09-06 23:13 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, Hannes Reinecke, John Garry, Bart Van Assche,
	linux-scsi, Peter Zijlstra, Long Li, Daniel Lezcano, LKML,
	linux-nvme, Keith Busch, Ingo Molnar, Thomas Gleixner,
	Christoph Hellwig, Sagi Grimberg

On Fri, Sep 06, 2019 at 04:25:55PM -0600, Keith Busch wrote:
> On Sat, Sep 07, 2019 at 06:19:21AM +0800, Ming Lei wrote:
> > On Fri, Sep 06, 2019 at 05:50:49PM +0000, Long Li wrote:
> > > >Subject: Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
> > > >
> > > >Why are all 8 nvmes sharing the same CPU for interrupt handling?
> > > >Shouldn't matrix_find_best_cpu_managed() handle selecting the least used
> > > >CPU from the cpumask for the effective interrupt handling?
> > > 
> > > The tests run on 10 NVMe disks on a system of 80 CPUs. Each NVMe disk has 32 hardware queues.
> > 
> > Then there are total 320 NVMe MSI/X vectors, and 80 CPUs, so irq matrix
> > can't avoid effective CPUs overlapping at all.
> 
> Sure, but it's at most half, meanwhile the CPU that's dispatching requests
> would naturally be throttled by the other half who's completions are
> interrupting that CPU, no?

The root cause is that multiple submission vs. single completion.

Let's see two cases:

1) 10 NVMe, each 8 queues, 80 CPU cores
- suppose genriq matrix can avoid effective cpu overlap, each cpu
  only handles one nvme interrupt
- there can be concurrent submissions from 10 CPUs, and all may be
  completed on one CPU
- IRQ flood couldn't happen for this case, given each CPU is only
  handling completion from one NVMe drive, which shouldn't be fast
  than CPU.

2) 10 NVMe, each 32 queues, 80 CPU cores
- one CPU may handle 4 NVMe interrupts, each from different NVMe drive
- then there may be 4*3 submissions aimed at single completion, then IRQ
  flood should be easy triggered on CPU for handing 4 NVMe interrupts.
  Because IO from 4 NVMe drive may be quicker than one CPU.

I can observe IRQ flood on the case #1, but there are still CPUs for
handling 2 NVMe interrupt, as the reason mentioned by Long. We could
improve for this case.

Thanks,
Ming

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-06 18:30                                         ` Sagi Grimberg
  2019-09-06 18:52                                           ` Keith Busch
@ 2019-09-07  0:01                                           ` Ming Lei
  2019-09-10  3:10                                             ` Sagi Grimberg
  1 sibling, 1 reply; 40+ messages in thread
From: Ming Lei @ 2019-09-07  0:01 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jens Axboe, Hannes Reinecke, John Garry, Bart Van Assche,
	linux-scsi, Peter Zijlstra, Long Li, Daniel Lezcano, LKML,
	linux-nvme, Keith Busch, Ingo Molnar, Thomas Gleixner,
	Christoph Hellwig

On Fri, Sep 06, 2019 at 11:30:57AM -0700, Sagi Grimberg wrote:
> 
> > 
> > Ok, so the real problem is per-cpu bounded tasks.
> > 
> > I share Thomas opinion about a NAPI like approach.
> 
> We already have that, its irq_poll, but it seems that for this
> use-case, we get lower performance for some reason. I'm not
> entirely sure why that is, maybe its because we need to mask interrupts
> because we don't have an "arm" register in nvme like network devices
> have?

Long observed that IOPS drops much too by switching to threaded irq. If
softirqd is waken up for handing softirq, the performance shouldn't
be better than threaded irq. Especially, Long found that context
switch is increased a lot after applying your irq poll patch.

http://lists.infradead.org/pipermail/linux-nvme/2019-August/026788.html

Thanks,
Ming

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-06 22:19                                           ` Ming Lei
  2019-09-06 22:25                                             ` Keith Busch
@ 2019-09-10  0:24                                             ` Ming Lei
  1 sibling, 0 replies; 40+ messages in thread
From: Ming Lei @ 2019-09-10  0:24 UTC (permalink / raw)
  To: Long Li
  Cc: Jens Axboe, Hannes Reinecke, John Garry, Bart Van Assche,
	linux-scsi, Peter Zijlstra, Daniel Lezcano, LKML, linux-nvme,
	Keith Busch, Ingo Molnar, Keith Busch, Thomas Gleixner,
	Christoph Hellwig, Sagi Grimberg

On Sat, Sep 07, 2019 at 06:19:20AM +0800, Ming Lei wrote:
> On Fri, Sep 06, 2019 at 05:50:49PM +0000, Long Li wrote:
> > >Subject: Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
> > >
> > >On Fri, Sep 06, 2019 at 09:48:21AM +0800, Ming Lei wrote:
> > >> When one IRQ flood happens on one CPU:
> > >>
> > >> 1) softirq handling on this CPU can't make progress
> > >>
> > >> 2) kernel thread bound to this CPU can't make progress
> > >>
> > >> For example, network may require softirq to xmit packets, or another
> > >> irq thread for handling keyboards/mice or whatever, or rcu_sched may
> > >> depend on that CPU for making progress, then the irq flood stalls the
> > >> whole system.
> > >>
> > >> >
> > >> > AFAIU, there are fast medium where the responses to requests are
> > >> > faster than the time to process them, right?
> > >>
> > >> Usually medium may not be faster than CPU, now we are talking about
> > >> interrupts, which can be originated from lots of devices concurrently,
> > >> for example, in Long Li'test, there are 8 NVMe drives involved.
> > >
> > >Why are all 8 nvmes sharing the same CPU for interrupt handling?
> > >Shouldn't matrix_find_best_cpu_managed() handle selecting the least used
> > >CPU from the cpumask for the effective interrupt handling?
> > 
> > The tests run on 10 NVMe disks on a system of 80 CPUs. Each NVMe disk has 32 hardware queues.
> 
> Then there are total 320 NVMe MSI/X vectors, and 80 CPUs, so irq matrix
> can't avoid effective CPUs overlapping at all.
> 
> > It seems matrix_find_best_cpu_managed() has done its job, but we may still have CPUs that service several hardware queues mapped from other issuing CPUs.
> > Another thing to consider is that there may be other managed interrupts on the system, so NVMe interrupts may not end up evenly distributed on such a system.
> 
> Another improvement could be to try to not overlap effective CPUs among
> vectors of fast device first, meantime allow the overlap between slow
> vectors and fast vectors.
> 
> This way could improve in case that total fast vectors are <= nr_cpu_cores.

For this particular case, it can't be done, because:

1) this machine has 10 NUMA nodes, and each NVMe has 8 hw queues, so too
many CPUs are assigned to the 1st two hw queues, see the code branch of
'if (numvecs <= nodes)' in __irq_build_affinity_masks().

2) then less CPUs are assigned to the other 6 hw queues

3) finally same effective CPU is shared by two IRQ vector.

Also looks matrix_find_best_cpu_managed() has been doing well enough for
choosing best effective CPU.


Thanks,
Ming

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-07  0:01                                           ` Ming Lei
@ 2019-09-10  3:10                                             ` Sagi Grimberg
  2019-09-18  0:00                                               ` Long Li
  2019-09-18 14:37                                               ` Ming Lei
  0 siblings, 2 replies; 40+ messages in thread
From: Sagi Grimberg @ 2019-09-10  3:10 UTC (permalink / raw)
  To: Ming Lei
  Cc: Keith Busch, Hannes Reinecke, Daniel Lezcano, Bart Van Assche,
	linux-scsi, Peter Zijlstra, Long Li, John Garry, LKML,
	linux-nvme, Jens Axboe, Ingo Molnar, Thomas Gleixner,
	Christoph Hellwig

Hey Ming,

>>> Ok, so the real problem is per-cpu bounded tasks.
>>>
>>> I share Thomas opinion about a NAPI like approach.
>>
>> We already have that, its irq_poll, but it seems that for this
>> use-case, we get lower performance for some reason. I'm not
>> entirely sure why that is, maybe its because we need to mask interrupts
>> because we don't have an "arm" register in nvme like network devices
>> have?
> 
> Long observed that IOPS drops much too by switching to threaded irq. If
> softirqd is waken up for handing softirq, the performance shouldn't
> be better than threaded irq.

Its true that it shouldn't be any faster, but what irqpoll already has
and we don't need to reinvent is a proper budgeting mechanism that
needs to occur when multiple devices map irq vectors to the same cpu
core.

irqpoll already maintains a percpu list and dispatch the ->poll with
a budget that the backend enforces and irqpoll multiplexes between them.
Having this mechanism in irq (hard or threaded) context sounds
unnecessary a bit.

It seems like we're attempting to stay in irq context for as long as we
can instead of scheduling to softirq/thread context if we have more than
a minimal amount of work to do. Without at least understanding why
softirq/thread degrades us so much this code seems like the wrong
approach to me. Interrupt context will always be faster, but it is
not a sufficient reason to spend as much time as possible there, is it?

We should also keep in mind, that the networking stack has been doing
this for years, I would try to understand why this cannot work for nvme
before dismissing.

> Especially, Long found that context
> switch is increased a lot after applying your irq poll patch.
> 
> http://lists.infradead.org/pipermail/linux-nvme/2019-August/026788.html

Oh, I didn't see that one, wonder why... thanks!

5% improvement, I guess we can buy that for other users as is :)

If we suffer from lots of context switches while the CPU is flooded with
interrupts, then I would argue that we're re-raising softirq too much.
In this use-case, my assumption is that the cpu cannot keep up with the
interrupts and not that it doesn't reap enough (we also reap the first
batch in interrupt context...)

Perhaps making irqpoll continue until it must resched would improve
things further? Although this is a latency vs. efficiency tradeoff,
looks like MAX_SOFTIRQ_TIME is set to 2ms:

"
  * The MAX_SOFTIRQ_TIME provides a nice upper bound in most cases, but in
  * certain cases, such as stop_machine(), jiffies may cease to
  * increment and so we need the MAX_SOFTIRQ_RESTART limit as
  * well to make sure we eventually return from this method.
  *
  * These limits have been established via experimentation.
  * The two things to balance is latency against fairness -
  * we want to handle softirqs as soon as possible, but they
  * should not be able to lock up the box.
"

Long, does this patch make any difference?
--
diff --git a/lib/irq_poll.c b/lib/irq_poll.c
index 2f17b488d58e..d8eab563fa77 100644
--- a/lib/irq_poll.c
+++ b/lib/irq_poll.c
@@ -12,8 +12,6 @@
  #include <linux/irq_poll.h>
  #include <linux/delay.h>

-static unsigned int irq_poll_budget __read_mostly = 256;
-
  static DEFINE_PER_CPU(struct list_head, blk_cpu_iopoll);

  /**
@@ -77,42 +75,29 @@ EXPORT_SYMBOL(irq_poll_complete);

  static void __latent_entropy irq_poll_softirq(struct softirq_action *h)
  {
-       struct list_head *list = this_cpu_ptr(&blk_cpu_iopoll);
-       int rearm = 0, budget = irq_poll_budget;
-       unsigned long start_time = jiffies;
+       struct list_head *irqpoll_list = this_cpu_ptr(&blk_cpu_iopoll);
+       LIST_HEAD(list);

         local_irq_disable();
+       list_splice_init(irqpoll_list, &list);
+       local_irq_enable();

-       while (!list_empty(list)) {
+       while (!list_empty(&list)) {
                 struct irq_poll *iop;
                 int work, weight;

-               /*
-                * If softirq window is exhausted then punt.
-                */
-               if (budget <= 0 || time_after(jiffies, start_time)) {
-                       rearm = 1;
-                       break;
-               }
-
-               local_irq_enable();
-
                 /* Even though interrupts have been re-enabled, this
                  * access is safe because interrupts can only add new
                  * entries to the tail of this list, and only ->poll()
                  * calls can remove this head entry from the list.
                  */
-               iop = list_entry(list->next, struct irq_poll, list);
+               iop = list_first_entry(&list, struct irq_poll, list);

                 weight = iop->weight;
                 work = 0;
                 if (test_bit(IRQ_POLL_F_SCHED, &iop->state))
                         work = iop->poll(iop, weight);

-               budget -= work;
-
-               local_irq_disable();
-
                 /*
                  * Drivers must not modify the iopoll state, if they
                  * consume their assigned weight (or more, some drivers 
can't
@@ -125,11 +110,21 @@ static void __latent_entropy 
irq_poll_softirq(struct softirq_action *h)
                         if (test_bit(IRQ_POLL_F_DISABLE, &iop->state))
                                 __irq_poll_complete(iop);
                         else
-                               list_move_tail(&iop->list, list);
+                               list_move_tail(&iop->list, &list);
                 }
+
+               /*
+                * If softirq window is exhausted then punt.
+                */
+               if (need_resched())
+                       break;
         }

-       if (rearm)
+       local_irq_disable();
+
+       list_splice_tail_init(irqpoll_list, &list);
+       list_splice(&list, irqpoll_list);
+       if (!list_empty(irqpoll_list))
                 __raise_softirq_irqoff(IRQ_POLL_SOFTIRQ);

         local_irq_enable();
--

Reminder to the nvme side (slightly modified):
--
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 52205f8d90b4..09dc6da67b05 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -24,6 +24,7 @@
  #include <linux/io-64-nonatomic-lo-hi.h>
  #include <linux/sed-opal.h>
  #include <linux/pci-p2pdma.h>
+#include <linux/irq_poll.h>

  #include "trace.h"
  #include "nvme.h"
@@ -32,6 +33,7 @@
  #define CQ_SIZE(q)     ((q)->q_depth * sizeof(struct nvme_completion))

  #define SGES_PER_PAGE  (PAGE_SIZE / sizeof(struct nvme_sgl_desc))
+#define NVME_POLL_BUDGET_IRQ   256

  /*
   * These can be higher, but we need to ensure that any command doesn't
@@ -189,6 +191,7 @@ struct nvme_queue {
         u32 *dbbuf_cq_db;
         u32 *dbbuf_sq_ei;
         u32 *dbbuf_cq_ei;
+       struct irq_poll iop;
         struct completion delete_done;
  };

@@ -1014,11 +1017,29 @@ static inline int nvme_process_cq(struct 
nvme_queue *nvmeq, u16 *start,
         return found;
  }

+static int nvme_irqpoll_handler(struct irq_poll *iop, int budget)
+{
+       struct nvme_queue *nvmeq = container_of(iop, struct nvme_queue, 
iop);
+       struct pci_dev *pdev = to_pci_dev(nvmeq->dev->dev);
+       u16 start, end;
+       int completed;
+
+       completed = nvme_process_cq(nvmeq, &start, &end, budget);
+       nvme_complete_cqes(nvmeq, start, end);
+       if (completed < budget) {
+               irq_poll_complete(&nvmeq->iop);
+               enable_irq(pci_irq_vector(pdev, nvmeq->cq_vector));
+       }
+
+       return completed;
+}
+
  static irqreturn_t nvme_irq(int irq, void *data)
  {
         struct nvme_queue *nvmeq = data;
         irqreturn_t ret = IRQ_NONE;
         u16 start, end;
+       int budget = nvmeq->q_depth;

         /*
          * The rmb/wmb pair ensures we see all updates from a previous 
run of
@@ -1027,13 +1048,23 @@ static irqreturn_t nvme_irq(int irq, void *data)
         rmb();
         if (nvmeq->cq_head != nvmeq->last_cq_head)
                 ret = IRQ_HANDLED;
-       nvme_process_cq(nvmeq, &start, &end, -1);
+
+       /* reap here up to a budget of the size the queue depth */
+       do {
+               budget -= nvme_process_cq(nvmeq, &start, &end, budget);
+               if (start != end) {
+                       nvme_complete_cqes(nvmeq, start, end);
+                       ret = IRQ_HANDLED;
+               }
+       } while (start != end && budget > 0);
+
         nvmeq->last_cq_head = nvmeq->cq_head;
         wmb();

-       if (start != end) {
-               nvme_complete_cqes(nvmeq, start, end);
-               return IRQ_HANDLED;
+       /* if we still have cqes to reap, schedule irqpoll */
+       if (start != end && nvme_cqe_pending(nvmeq)) {
+               disable_irq_nosync(irq);
+               irq_poll_sched(&nvmeq->iop);
         }

         return ret;
@@ -1346,6 +1377,7 @@ static enum blk_eh_timer_return 
nvme_timeout(struct request *req, bool reserved)

  static void nvme_free_queue(struct nvme_queue *nvmeq)
  {
+       irq_poll_disable(&nvmeq->iop);
         dma_free_coherent(nvmeq->dev->dev, CQ_SIZE(nvmeq),
                                 (void *)nvmeq->cqes, nvmeq->cq_dma_addr);
         if (!nvmeq->sq_cmds)
@@ -1480,6 +1512,7 @@ static int nvme_alloc_queue(struct nvme_dev *dev, 
int qid, int depth)
         nvmeq->dev = dev;
         spin_lock_init(&nvmeq->sq_lock);
         spin_lock_init(&nvmeq->cq_poll_lock);
+       irq_poll_init(&nvmeq->iop, NVME_POLL_BUDGET_IRQ, 
nvme_irqpoll_handler);
         nvmeq->cq_head = 0;
         nvmeq->cq_phase = 1;
         nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride];
--

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* RE: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-10  3:10                                             ` Sagi Grimberg
@ 2019-09-18  0:00                                               ` Long Li
  2019-09-20 17:14                                                 ` Sagi Grimberg
  2019-09-18 14:37                                               ` Ming Lei
  1 sibling, 1 reply; 40+ messages in thread
From: Long Li @ 2019-09-18  0:00 UTC (permalink / raw)
  To: Sagi Grimberg, Ming Lei
  Cc: Keith Busch, Hannes Reinecke, Daniel Lezcano, Bart Van Assche,
	linux-scsi, Peter Zijlstra, John Garry, LKML, linux-nvme,
	Jens Axboe, Ingo Molnar, Thomas Gleixner, Christoph Hellwig

>Subject: Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
>
>Hey Ming,
>
>>>> Ok, so the real problem is per-cpu bounded tasks.
>>>>
>>>> I share Thomas opinion about a NAPI like approach.
>>>
>>> We already have that, its irq_poll, but it seems that for this
>>> use-case, we get lower performance for some reason. I'm not entirely
>>> sure why that is, maybe its because we need to mask interrupts
>>> because we don't have an "arm" register in nvme like network devices
>>> have?
>>
>> Long observed that IOPS drops much too by switching to threaded irq.
>> If softirqd is waken up for handing softirq, the performance shouldn't
>> be better than threaded irq.
>
>Its true that it shouldn't be any faster, but what irqpoll already has and we
>don't need to reinvent is a proper budgeting mechanism that needs to occur
>when multiple devices map irq vectors to the same cpu core.
>
>irqpoll already maintains a percpu list and dispatch the ->poll with a budget
>that the backend enforces and irqpoll multiplexes between them.
>Having this mechanism in irq (hard or threaded) context sounds unnecessary a
>bit.
>
>It seems like we're attempting to stay in irq context for as long as we can
>instead of scheduling to softirq/thread context if we have more than a
>minimal amount of work to do. Without at least understanding why
>softirq/thread degrades us so much this code seems like the wrong approach
>to me. Interrupt context will always be faster, but it is not a sufficient reason
>to spend as much time as possible there, is it?
>
>We should also keep in mind, that the networking stack has been doing this
>for years, I would try to understand why this cannot work for nvme before
>dismissing.
>
>> Especially, Long found that context
>> switch is increased a lot after applying your irq poll patch.
>>
>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists
>> .infradead.org%2Fpipermail%2Flinux-nvme%2F2019-
>August%2F026788.html&am
>>
>p;data=02%7C01%7Clongli%40microsoft.com%7C20391b0810844821325908d73
>59c
>>
>64d2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637036818140279
>742&a
>>
>mp;sdata=GyBWILwPvwHYvrTGSAVZbdl%2Fcoz3twSXe2DrH2t1MeQ%3D&am
>p;reserved
>> =0
>
>Oh, I didn't see that one, wonder why... thanks!
>
>5% improvement, I guess we can buy that for other users as is :)
>
>If we suffer from lots of context switches while the CPU is flooded with
>interrupts, then I would argue that we're re-raising softirq too much.
>In this use-case, my assumption is that the cpu cannot keep up with the
>interrupts and not that it doesn't reap enough (we also reap the first batch in
>interrupt context...)
>
>Perhaps making irqpoll continue until it must resched would improve things
>further? Although this is a latency vs. efficiency tradeoff, looks like
>MAX_SOFTIRQ_TIME is set to 2ms:
>
>"
>  * The MAX_SOFTIRQ_TIME provides a nice upper bound in most cases, but in
>  * certain cases, such as stop_machine(), jiffies may cease to
>  * increment and so we need the MAX_SOFTIRQ_RESTART limit as
>  * well to make sure we eventually return from this method.
>  *
>  * These limits have been established via experimentation.
>  * The two things to balance is latency against fairness -
>  * we want to handle softirqs as soon as possible, but they
>  * should not be able to lock up the box.
>"
>
>Long, does this patch make any difference?

Sagi,

Sorry it took a while to bring my system back online.

With the patch, the IOPS is about the same drop with the 1st patch. I think the excessive context switches are causing the drop in IOPS.

The following are captured by "perf sched record" for 30 seconds during tests.

"perf sched latency"
With patch:
  fio:(82)              | 937632.706 ms |  1782255 | avg:    0.209 ms | max:   63.123 ms | max at:    768.274023 s

without patch:
  fio:(82)              |2348323.432 ms |    18848 | avg:    0.295 ms | max:   28.446 ms | max at:   6447.310255 s

Look closer at each CPU, we can see ksoftirqd is competing CPU with fio (and effectively throttle other fio processes)
(captured in /sys/kernel/debug/tracing, echo sched:* >set_event)

On CPU1 with patch: (note that the prev_state for fio is "R", it's preemptively scheduled)
           <...>-4077  [001] d... 66456.805062: sched_switch: prev_comm=fio prev_pid=4077 prev_prio=120 prev_state=R ==> next_comm=ksoftirqd/1 next_pid=17 next_prio=120
           <...>-17    [001] d... 66456.805859: sched_switch: prev_comm=ksoftirqd/1 prev_pid=17 prev_prio=120 prev_state=S ==> next_comm=fio next_pid=4077 next_prio=120
           <...>-4077  [001] d... 66456.844049: sched_switch: prev_comm=fio prev_pid=4077 prev_prio=120 prev_state=R ==> next_comm=ksoftirqd/1 next_pid=17 next_prio=120
           <...>-17    [001] d... 66456.844607: sched_switch: prev_comm=ksoftirqd/1 prev_pid=17 prev_prio=120 prev_state=S ==> next_comm=fio next_pid=4077 next_prio=120

On CPU1 without patch: (the prev_state for fio is "S", it's voluntarily scheduled)
          <idle>-0     [001] d...  6725.392308: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=fio next_pid=14342 next_prio=120
             fio-14342 [001] d...  6725.392332: sched_switch: prev_comm=fio prev_pid=14342 prev_prio=120 prev_state=S ==> next_comm=swapper/1 next_pid=0 next_prio=120
          <idle>-0     [001] d...  6725.392356: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=fio next_pid=14342 next_prio=120
             fio-14342 [001] d...  6725.392425: sched_switch: prev_comm=fio prev_pid=14342 prev_prio=120 prev_state=S ==> next_comm=swapper/1 next_pid=0 next_prio=120

Thanks

Long

>--
>diff --git a/lib/irq_poll.c b/lib/irq_poll.c index 2f17b488d58e..d8eab563fa77
>100644
>--- a/lib/irq_poll.c
>+++ b/lib/irq_poll.c
>@@ -12,8 +12,6 @@
>  #include <linux/irq_poll.h>
>  #include <linux/delay.h>
>
>-static unsigned int irq_poll_budget __read_mostly = 256;
>-
>  static DEFINE_PER_CPU(struct list_head, blk_cpu_iopoll);
>
>  /**
>@@ -77,42 +75,29 @@ EXPORT_SYMBOL(irq_poll_complete);
>
>  static void __latent_entropy irq_poll_softirq(struct softirq_action *h)
>  {
>-       struct list_head *list = this_cpu_ptr(&blk_cpu_iopoll);
>-       int rearm = 0, budget = irq_poll_budget;
>-       unsigned long start_time = jiffies;
>+       struct list_head *irqpoll_list = this_cpu_ptr(&blk_cpu_iopoll);
>+       LIST_HEAD(list);
>
>         local_irq_disable();
>+       list_splice_init(irqpoll_list, &list);
>+       local_irq_enable();
>
>-       while (!list_empty(list)) {
>+       while (!list_empty(&list)) {
>                 struct irq_poll *iop;
>                 int work, weight;
>
>-               /*
>-                * If softirq window is exhausted then punt.
>-                */
>-               if (budget <= 0 || time_after(jiffies, start_time)) {
>-                       rearm = 1;
>-                       break;
>-               }
>-
>-               local_irq_enable();
>-
>                 /* Even though interrupts have been re-enabled, this
>                  * access is safe because interrupts can only add new
>                  * entries to the tail of this list, and only ->poll()
>                  * calls can remove this head entry from the list.
>                  */
>-               iop = list_entry(list->next, struct irq_poll, list);
>+               iop = list_first_entry(&list, struct irq_poll, list);
>
>                 weight = iop->weight;
>                 work = 0;
>                 if (test_bit(IRQ_POLL_F_SCHED, &iop->state))
>                         work = iop->poll(iop, weight);
>
>-               budget -= work;
>-
>-               local_irq_disable();
>-
>                 /*
>                  * Drivers must not modify the iopoll state, if they
>                  * consume their assigned weight (or more, some drivers can't @@ -
>125,11 +110,21 @@ static void __latent_entropy irq_poll_softirq(struct
>softirq_action *h)
>                         if (test_bit(IRQ_POLL_F_DISABLE, &iop->state))
>                                 __irq_poll_complete(iop);
>                         else
>-                               list_move_tail(&iop->list, list);
>+                               list_move_tail(&iop->list, &list);
>                 }
>+
>+               /*
>+                * If softirq window is exhausted then punt.
>+                */
>+               if (need_resched())
>+                       break;
>         }
>
>-       if (rearm)
>+       local_irq_disable();
>+
>+       list_splice_tail_init(irqpoll_list, &list);
>+       list_splice(&list, irqpoll_list);
>+       if (!list_empty(irqpoll_list))
>                 __raise_softirq_irqoff(IRQ_POLL_SOFTIRQ);
>
>         local_irq_enable();
>--
>
>Reminder to the nvme side (slightly modified):
>--
>diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index
>52205f8d90b4..09dc6da67b05 100644
>--- a/drivers/nvme/host/pci.c
>+++ b/drivers/nvme/host/pci.c
>@@ -24,6 +24,7 @@
>  #include <linux/io-64-nonatomic-lo-hi.h>
>  #include <linux/sed-opal.h>
>  #include <linux/pci-p2pdma.h>
>+#include <linux/irq_poll.h>
>
>  #include "trace.h"
>  #include "nvme.h"
>@@ -32,6 +33,7 @@
>  #define CQ_SIZE(q)     ((q)->q_depth * sizeof(struct nvme_completion))
>
>  #define SGES_PER_PAGE  (PAGE_SIZE / sizeof(struct nvme_sgl_desc))
>+#define NVME_POLL_BUDGET_IRQ   256
>
>  /*
>   * These can be higher, but we need to ensure that any command doesn't
>@@ -189,6 +191,7 @@ struct nvme_queue {
>         u32 *dbbuf_cq_db;
>         u32 *dbbuf_sq_ei;
>         u32 *dbbuf_cq_ei;
>+       struct irq_poll iop;
>         struct completion delete_done;
>  };
>
>@@ -1014,11 +1017,29 @@ static inline int nvme_process_cq(struct
>nvme_queue *nvmeq, u16 *start,
>         return found;
>  }
>
>+static int nvme_irqpoll_handler(struct irq_poll *iop, int budget) {
>+       struct nvme_queue *nvmeq = container_of(iop, struct nvme_queue,
>iop);
>+       struct pci_dev *pdev = to_pci_dev(nvmeq->dev->dev);
>+       u16 start, end;
>+       int completed;
>+
>+       completed = nvme_process_cq(nvmeq, &start, &end, budget);
>+       nvme_complete_cqes(nvmeq, start, end);
>+       if (completed < budget) {
>+               irq_poll_complete(&nvmeq->iop);
>+               enable_irq(pci_irq_vector(pdev, nvmeq->cq_vector));
>+       }
>+
>+       return completed;
>+}
>+
>  static irqreturn_t nvme_irq(int irq, void *data)
>  {
>         struct nvme_queue *nvmeq = data;
>         irqreturn_t ret = IRQ_NONE;
>         u16 start, end;
>+       int budget = nvmeq->q_depth;
>
>         /*
>          * The rmb/wmb pair ensures we see all updates from a previous run of
>@@ -1027,13 +1048,23 @@ static irqreturn_t nvme_irq(int irq, void *data)
>         rmb();
>         if (nvmeq->cq_head != nvmeq->last_cq_head)
>                 ret = IRQ_HANDLED;
>-       nvme_process_cq(nvmeq, &start, &end, -1);
>+
>+       /* reap here up to a budget of the size the queue depth */
>+       do {
>+               budget -= nvme_process_cq(nvmeq, &start, &end, budget);
>+               if (start != end) {
>+                       nvme_complete_cqes(nvmeq, start, end);
>+                       ret = IRQ_HANDLED;
>+               }
>+       } while (start != end && budget > 0);
>+
>         nvmeq->last_cq_head = nvmeq->cq_head;
>         wmb();
>
>-       if (start != end) {
>-               nvme_complete_cqes(nvmeq, start, end);
>-               return IRQ_HANDLED;
>+       /* if we still have cqes to reap, schedule irqpoll */
>+       if (start != end && nvme_cqe_pending(nvmeq)) {
>+               disable_irq_nosync(irq);
>+               irq_poll_sched(&nvmeq->iop);
>         }
>
>         return ret;
>@@ -1346,6 +1377,7 @@ static enum blk_eh_timer_return
>nvme_timeout(struct request *req, bool reserved)
>
>  static void nvme_free_queue(struct nvme_queue *nvmeq)
>  {
>+       irq_poll_disable(&nvmeq->iop);
>         dma_free_coherent(nvmeq->dev->dev, CQ_SIZE(nvmeq),
>                                 (void *)nvmeq->cqes, nvmeq->cq_dma_addr);
>         if (!nvmeq->sq_cmds)
>@@ -1480,6 +1512,7 @@ static int nvme_alloc_queue(struct nvme_dev *dev,
>int qid, int depth)
>         nvmeq->dev = dev;
>         spin_lock_init(&nvmeq->sq_lock);
>         spin_lock_init(&nvmeq->cq_poll_lock);
>+       irq_poll_init(&nvmeq->iop, NVME_POLL_BUDGET_IRQ,
>nvme_irqpoll_handler);
>         nvmeq->cq_head = 0;
>         nvmeq->cq_phase = 1;
>         nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride];
>--
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-10  3:10                                             ` Sagi Grimberg
  2019-09-18  0:00                                               ` Long Li
@ 2019-09-18 14:37                                               ` Ming Lei
  2019-09-20 17:09                                                 ` Sagi Grimberg
  1 sibling, 1 reply; 40+ messages in thread
From: Ming Lei @ 2019-09-18 14:37 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jens Axboe, Hannes Reinecke, John Garry, Bart Van Assche,
	linux-scsi, Peter Zijlstra, Long Li, Daniel Lezcano, LKML,
	linux-nvme, Keith Busch, Ingo Molnar, Thomas Gleixner,
	Christoph Hellwig

On Mon, Sep 09, 2019 at 08:10:07PM -0700, Sagi Grimberg wrote:
> Hey Ming,
> 
> > > > Ok, so the real problem is per-cpu bounded tasks.
> > > > 
> > > > I share Thomas opinion about a NAPI like approach.
> > > 
> > > We already have that, its irq_poll, but it seems that for this
> > > use-case, we get lower performance for some reason. I'm not
> > > entirely sure why that is, maybe its because we need to mask interrupts
> > > because we don't have an "arm" register in nvme like network devices
> > > have?
> > 
> > Long observed that IOPS drops much too by switching to threaded irq. If
> > softirqd is waken up for handing softirq, the performance shouldn't
> > be better than threaded irq.
> 
> Its true that it shouldn't be any faster, but what irqpoll already has
> and we don't need to reinvent is a proper budgeting mechanism that
> needs to occur when multiple devices map irq vectors to the same cpu
> core.
> 
> irqpoll already maintains a percpu list and dispatch the ->poll with
> a budget that the backend enforces and irqpoll multiplexes between them.
> Having this mechanism in irq (hard or threaded) context sounds
> unnecessary a bit.
> 
> It seems like we're attempting to stay in irq context for as long as we
> can instead of scheduling to softirq/thread context if we have more than
> a minimal amount of work to do. Without at least understanding why
> softirq/thread degrades us so much this code seems like the wrong
> approach to me. Interrupt context will always be faster, but it is
> not a sufficient reason to spend as much time as possible there, is it?

If extra latency is added in IO completion path, this latency will be
introduced in the submission path, because the hw queue depth is fixed,
which is often small. Especially in case of multiple submission vs.
single(shared) completion, the whole hw queue tags can be exhausted
easily. 

I guess no such effect for networking IO.

> 
> We should also keep in mind, that the networking stack has been doing
> this for years, I would try to understand why this cannot work for nvme
> before dismissing.

The above may be one reason.

Thanks,
Ming

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-18 14:37                                               ` Ming Lei
@ 2019-09-20 17:09                                                 ` Sagi Grimberg
  0 siblings, 0 replies; 40+ messages in thread
From: Sagi Grimberg @ 2019-09-20 17:09 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Hannes Reinecke, John Garry, Bart Van Assche,
	linux-scsi, Peter Zijlstra, Long Li, Daniel Lezcano, LKML,
	linux-nvme, Keith Busch, Ingo Molnar, Thomas Gleixner,
	Christoph Hellwig


>> It seems like we're attempting to stay in irq context for as long as we
>> can instead of scheduling to softirq/thread context if we have more than
>> a minimal amount of work to do. Without at least understanding why
>> softirq/thread degrades us so much this code seems like the wrong
>> approach to me. Interrupt context will always be faster, but it is
>> not a sufficient reason to spend as much time as possible there, is it?
> 
> If extra latency is added in IO completion path, this latency will be
> introduced in the submission path, because the hw queue depth is fixed,
> which is often small. Especially in case of multiple submission vs.
> single(shared) completion, the whole hw queue tags can be exhausted
> easily.

This is why the patch is reaping the first batch from hard-irq, but only
if it has more will defer to softirq. So I'm not sure the short QD use
case applies here...

> I guess no such effect for networking IO.

Maybe, maybe not...

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-18  0:00                                               ` Long Li
@ 2019-09-20 17:14                                                 ` Sagi Grimberg
  2019-09-20 19:12                                                   ` Long Li
  0 siblings, 1 reply; 40+ messages in thread
From: Sagi Grimberg @ 2019-09-20 17:14 UTC (permalink / raw)
  To: Long Li, Ming Lei
  Cc: Keith Busch, Hannes Reinecke, Daniel Lezcano, Bart Van Assche,
	linux-scsi, Peter Zijlstra, John Garry, LKML, linux-nvme,
	Jens Axboe, Ingo Molnar, Thomas Gleixner, Christoph Hellwig


>> Hey Ming,
>>
>>>>> Ok, so the real problem is per-cpu bounded tasks.
>>>>>
>>>>> I share Thomas opinion about a NAPI like approach.
>>>>
>>>> We already have that, its irq_poll, but it seems that for this
>>>> use-case, we get lower performance for some reason. I'm not entirely
>>>> sure why that is, maybe its because we need to mask interrupts
>>>> because we don't have an "arm" register in nvme like network devices
>>>> have?
>>>
>>> Long observed that IOPS drops much too by switching to threaded irq.
>>> If softirqd is waken up for handing softirq, the performance shouldn't
>>> be better than threaded irq.
>>
>> Its true that it shouldn't be any faster, but what irqpoll already has and we
>> don't need to reinvent is a proper budgeting mechanism that needs to occur
>> when multiple devices map irq vectors to the same cpu core.
>>
>> irqpoll already maintains a percpu list and dispatch the ->poll with a budget
>> that the backend enforces and irqpoll multiplexes between them.
>> Having this mechanism in irq (hard or threaded) context sounds unnecessary a
>> bit.
>>
>> It seems like we're attempting to stay in irq context for as long as we can
>> instead of scheduling to softirq/thread context if we have more than a
>> minimal amount of work to do. Without at least understanding why
>> softirq/thread degrades us so much this code seems like the wrong approach
>> to me. Interrupt context will always be faster, but it is not a sufficient reason
>> to spend as much time as possible there, is it?
>>
>> We should also keep in mind, that the networking stack has been doing this
>> for years, I would try to understand why this cannot work for nvme before
>> dismissing.
>>
>>> Especially, Long found that context
>>> switch is increased a lot after applying your irq poll patch.
>>>
>>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists
>>> .infradead.org%2Fpipermail%2Flinux-nvme%2F2019-
>> August%2F026788.html&am
>>>
>> p;data=02%7C01%7Clongli%40microsoft.com%7C20391b0810844821325908d73
>> 59c
>>>
>> 64d2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637036818140279
>> 742&a
>>>
>> mp;sdata=GyBWILwPvwHYvrTGSAVZbdl%2Fcoz3twSXe2DrH2t1MeQ%3D&am
>> p;reserved
>>> =0
>>
>> Oh, I didn't see that one, wonder why... thanks!
>>
>> 5% improvement, I guess we can buy that for other users as is :)
>>
>> If we suffer from lots of context switches while the CPU is flooded with
>> interrupts, then I would argue that we're re-raising softirq too much.
>> In this use-case, my assumption is that the cpu cannot keep up with the
>> interrupts and not that it doesn't reap enough (we also reap the first batch in
>> interrupt context...)
>>
>> Perhaps making irqpoll continue until it must resched would improve things
>> further? Although this is a latency vs. efficiency tradeoff, looks like
>> MAX_SOFTIRQ_TIME is set to 2ms:
>>
>> "
>>   * The MAX_SOFTIRQ_TIME provides a nice upper bound in most cases, but in
>>   * certain cases, such as stop_machine(), jiffies may cease to
>>   * increment and so we need the MAX_SOFTIRQ_RESTART limit as
>>   * well to make sure we eventually return from this method.
>>   *
>>   * These limits have been established via experimentation.
>>   * The two things to balance is latency against fairness -
>>   * we want to handle softirqs as soon as possible, but they
>>   * should not be able to lock up the box.
>> "
>>
>> Long, does this patch make any difference?
> 
> Sagi,
> 
> Sorry it took a while to bring my system back online.
> 
> With the patch, the IOPS is about the same drop with the 1st patch. I think the excessive context switches are causing the drop in IOPS.
> 
> The following are captured by "perf sched record" for 30 seconds during tests.
> 
> "perf sched latency"
> With patch:
>    fio:(82)              | 937632.706 ms |  1782255 | avg:    0.209 ms | max:   63.123 ms | max at:    768.274023 s
> 
> without patch:
>    fio:(82)              |2348323.432 ms |    18848 | avg:    0.295 ms | max:   28.446 ms | max at:   6447.310255 s

Without patch means the proposed hard-irq patch?

If we are context switching too much, it means the soft-irq operation is
not efficient, not necessarily the fact that the completion path is
running in soft-irq..

Is your kernel compiled with full preemption or voluntary preemption?

> Look closer at each CPU, we can see ksoftirqd is competing CPU with fio (and effectively throttle other fio processes)
> (captured in /sys/kernel/debug/tracing, echo sched:* >set_event)
> 
> On CPU1 with patch: (note that the prev_state for fio is "R", it's preemptively scheduled)
>             <...>-4077  [001] d... 66456.805062: sched_switch: prev_comm=fio prev_pid=4077 prev_prio=120 prev_state=R ==> next_comm=ksoftirqd/1 next_pid=17 next_prio=120
>             <...>-17    [001] d... 66456.805859: sched_switch: prev_comm=ksoftirqd/1 prev_pid=17 prev_prio=120 prev_state=S ==> next_comm=fio next_pid=4077 next_prio=120
>             <...>-4077  [001] d... 66456.844049: sched_switch: prev_comm=fio prev_pid=4077 prev_prio=120 prev_state=R ==> next_comm=ksoftirqd/1 next_pid=17 next_prio=120
>             <...>-17    [001] d... 66456.844607: sched_switch: prev_comm=ksoftirqd/1 prev_pid=17 prev_prio=120 prev_state=S ==> next_comm=fio next_pid=4077 next_prio=120
> 
> On CPU1 without patch: (the prev_state for fio is "S", it's voluntarily scheduled)
>            <idle>-0     [001] d...  6725.392308: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=fio next_pid=14342 next_prio=120
>               fio-14342 [001] d...  6725.392332: sched_switch: prev_comm=fio prev_pid=14342 prev_prio=120 prev_state=S ==> next_comm=swapper/1 next_pid=0 next_prio=120
>            <idle>-0     [001] d...  6725.392356: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=fio next_pid=14342 next_prio=120
>               fio-14342 [001] d...  6725.392425: sched_switch: prev_comm=fio prev_pid=14342 prev_prio=120 prev_state=S ==> next_comm=swapper/1 next_pid=0 next_prio=12

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-20 17:14                                                 ` Sagi Grimberg
@ 2019-09-20 19:12                                                   ` Long Li
  2019-09-20 20:45                                                     ` Sagi Grimberg
  0 siblings, 1 reply; 40+ messages in thread
From: Long Li @ 2019-09-20 19:12 UTC (permalink / raw)
  To: Sagi Grimberg, Ming Lei
  Cc: Keith Busch, Hannes Reinecke, Daniel Lezcano, Bart Van Assche,
	linux-scsi, Peter Zijlstra, John Garry, LKML, linux-nvme,
	Jens Axboe, Ingo Molnar, Thomas Gleixner, Christoph Hellwig

> >> Long, does this patch make any difference?
> >
> > Sagi,
> >
> > Sorry it took a while to bring my system back online.
> >
> > With the patch, the IOPS is about the same drop with the 1st patch. I think
> the excessive context switches are causing the drop in IOPS.
> >
> > The following are captured by "perf sched record" for 30 seconds during
> tests.
> >
> > "perf sched latency"
> > With patch:
> >    fio:(82)              | 937632.706 ms |  1782255 | avg:    0.209 ms | max:   63.123
> ms | max at:    768.274023 s
> >
> > without patch:
> >    fio:(82)              |2348323.432 ms |    18848 | avg:    0.295 ms | max:   28.446
> ms | max at:   6447.310255 s
> 
> Without patch means the proposed hard-irq patch?

It means the current upstream code without any patch. But It's prone to soft lockup.

Ming's proposed hard-irq patch gets similar results to "without patch", however it fixes the soft lockup.

> 
> If we are context switching too much, it means the soft-irq operation is not
> efficient, not necessarily the fact that the completion path is running in soft-
> irq..
> 
> Is your kernel compiled with full preemption or voluntary preemption?

The tests are based on Ubuntu 18.04 kernel configuration. Here are the parameters:

# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set

> 
> > Look closer at each CPU, we can see ksoftirqd is competing CPU with
> > fio (and effectively throttle other fio processes) (captured in
> > /sys/kernel/debug/tracing, echo sched:* >set_event)
> >
> > On CPU1 with patch: (note that the prev_state for fio is "R", it's
> preemptively scheduled)
> >             <...>-4077  [001] d... 66456.805062: sched_switch: prev_comm=fio
> prev_pid=4077 prev_prio=120 prev_state=R ==> next_comm=ksoftirqd/1
> next_pid=17 next_prio=120
> >             <...>-17    [001] d... 66456.805859: sched_switch:
> prev_comm=ksoftirqd/1 prev_pid=17 prev_prio=120 prev_state=S ==>
> next_comm=fio next_pid=4077 next_prio=120
> >             <...>-4077  [001] d... 66456.844049: sched_switch: prev_comm=fio
> prev_pid=4077 prev_prio=120 prev_state=R ==> next_comm=ksoftirqd/1
> next_pid=17 next_prio=120
> >             <...>-17    [001] d... 66456.844607: sched_switch:
> prev_comm=ksoftirqd/1 prev_pid=17 prev_prio=120 prev_state=S ==>
> next_comm=fio next_pid=4077 next_prio=120
> >
> > On CPU1 without patch: (the prev_state for fio is "S", it's voluntarily
> scheduled)
> >            <idle>-0     [001] d...  6725.392308: sched_switch:
> prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==>
> next_comm=fio next_pid=14342 next_prio=120
> >               fio-14342 [001] d...  6725.392332: sched_switch: prev_comm=fio
> prev_pid=14342 prev_prio=120 prev_state=S ==> next_comm=swapper/1
> next_pid=0 next_prio=120
> >            <idle>-0     [001] d...  6725.392356: sched_switch:
> prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==>
> next_comm=fio next_pid=14342 next_prio=120
> >               fio-14342 [001] d...  6725.392425: sched_switch:
> > prev_comm=fio prev_pid=14342 prev_prio=120 prev_state=S ==>
> > next_comm=swapper/1 next_pid=0 next_prio=12
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-20 19:12                                                   ` Long Li
@ 2019-09-20 20:45                                                     ` Sagi Grimberg
  2019-09-24  0:57                                                       ` Long Li
  0 siblings, 1 reply; 40+ messages in thread
From: Sagi Grimberg @ 2019-09-20 20:45 UTC (permalink / raw)
  To: Long Li, Ming Lei
  Cc: Keith Busch, Hannes Reinecke, Daniel Lezcano, Bart Van Assche,
	linux-scsi, Peter Zijlstra, John Garry, LKML, linux-nvme,
	Jens Axboe, Ingo Molnar, Thomas Gleixner, Christoph Hellwig


>>> Sagi,
>>>
>>> Sorry it took a while to bring my system back online.
>>>
>>> With the patch, the IOPS is about the same drop with the 1st patch. I think
>> the excessive context switches are causing the drop in IOPS.
>>>
>>> The following are captured by "perf sched record" for 30 seconds during
>> tests.
>>>
>>> "perf sched latency"
>>> With patch:
>>>     fio:(82)              | 937632.706 ms |  1782255 | avg:    0.209 ms | max:   63.123
>> ms | max at:    768.274023 s
>>>
>>> without patch:
>>>     fio:(82)              |2348323.432 ms |    18848 | avg:    0.295 ms | max:   28.446
>> ms | max at:   6447.310255 s
>>
>> Without patch means the proposed hard-irq patch?
> 
> It means the current upstream code without any patch. But It's prone to soft lockup.
> 
> Ming's proposed hard-irq patch gets similar results to "without patch", however it fixes the soft lockup.

Thanks for the clarification.

The problem with what Ming is proposing in my mind (and its an existing
problem that exists today), is that nvme is taking precedence over
anything else until it absolutely cannot hog the cpu in hardirq.

In the thread Ming referenced a case where today if the cpu core has a
net softirq activity it cannot make forward progress. So with Ming's
suggestion, net softirq will eventually make progress, but it creates an
inherent fairness issue. Who said that nvme completions should come
faster then the net rx/tx or another I/O device (or hrtimers or sched
events...)?

As much as I'd like nvme to complete as soon as possible, I might have
other activities in the system that are as important if not more. So
I don't think we can solve this with something that is not cooperative
or fair with the rest of the system.

>> If we are context switching too much, it means the soft-irq operation is not
>> efficient, not necessarily the fact that the completion path is running in soft-
>> irq..
>>
>> Is your kernel compiled with full preemption or voluntary preemption?
> 
> The tests are based on Ubuntu 18.04 kernel configuration. Here are the parameters:
> 
> # CONFIG_PREEMPT_NONE is not set
> CONFIG_PREEMPT_VOLUNTARY=y
> # CONFIG_PREEMPT is not set

I see, so it still seems that irq_poll_softirq is still not efficient in
reaping completions. reaping the completions on its own is pretty much
the same in hard and soft irq, so its really the scheduling part that is
creating the overhead (which does not exist in hard irq).

Question:
when you test with without the patch (completions are coming in 
hard-irq), do the fio threads that run on the cpu cores that are 
assigned to the cores that are handling interrupts get substantially lower
throughput than the rest of the fio threads? I would expect that
the fio threads that are running on the first 32 cores to get very low
iops (overpowered by the nvme interrupts) and the rest doing much more
given that nvme has almost no limits to how much time it can spend on
processing completions.

If need_resched() is causing us to context switch too aggressively, does 
changing that to local_softirq_pending() make things better?
--
diff --git a/lib/irq_poll.c b/lib/irq_poll.c
index d8eab563fa77..05d524fcaf04 100644
--- a/lib/irq_poll.c
+++ b/lib/irq_poll.c
@@ -116,7 +116,7 @@ static void __latent_entropy irq_poll_softirq(struct 
softirq_action *h)
                 /*
                  * If softirq window is exhausted then punt.
                  */
-               if (need_resched())
+               if (local_softirq_pending())
                         break;
         }
--

Although, this can potentially cause other threads from making forward
progress.. If it is better, perhaps we also need a time limit as well.

Perhaps we should add statistics/tracing on how many completions we are
reaping per invocation...

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* RE: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-20 20:45                                                     ` Sagi Grimberg
@ 2019-09-24  0:57                                                       ` Long Li
  0 siblings, 0 replies; 40+ messages in thread
From: Long Li @ 2019-09-24  0:57 UTC (permalink / raw)
  To: Sagi Grimberg, Ming Lei
  Cc: Keith Busch, Hannes Reinecke, Daniel Lezcano, Bart Van Assche,
	linux-scsi, Peter Zijlstra, John Garry, LKML, linux-nvme,
	Jens Axboe, Ingo Molnar, Thomas Gleixner, Christoph Hellwig

>Thanks for the clarification.
>
>The problem with what Ming is proposing in my mind (and its an existing
>problem that exists today), is that nvme is taking precedence over anything
>else until it absolutely cannot hog the cpu in hardirq.
>
>In the thread Ming referenced a case where today if the cpu core has a net
>softirq activity it cannot make forward progress. So with Ming's suggestion,
>net softirq will eventually make progress, but it creates an inherent fairness
>issue. Who said that nvme completions should come faster then the net rx/tx
>or another I/O device (or hrtimers or sched events...)?
>
>As much as I'd like nvme to complete as soon as possible, I might have other
>activities in the system that are as important if not more. So I don't think we
>can solve this with something that is not cooperative or fair with the rest of
>the system.
>
>>> If we are context switching too much, it means the soft-irq operation
>>> is not efficient, not necessarily the fact that the completion path
>>> is running in soft- irq..
>>>
>>> Is your kernel compiled with full preemption or voluntary preemption?
>>
>> The tests are based on Ubuntu 18.04 kernel configuration. Here are the
>parameters:
>>
>> # CONFIG_PREEMPT_NONE is not set
>> CONFIG_PREEMPT_VOLUNTARY=y
>> # CONFIG_PREEMPT is not set
>
>I see, so it still seems that irq_poll_softirq is still not efficient in reaping
>completions. reaping the completions on its own is pretty much the same in
>hard and soft irq, so its really the scheduling part that is creating the overhead
>(which does not exist in hard irq).
>
>Question:
>when you test with without the patch (completions are coming in hard-irq),
>do the fio threads that run on the cpu cores that are assigned to the cores that
>are handling interrupts get substantially lower throughput than the rest of the
>fio threads? I would expect that the fio threads that are running on the first 32
>cores to get very low iops (overpowered by the nvme interrupts) and the rest
>doing much more given that nvme has almost no limits to how much time it
>can spend on processing completions.
>
>If need_resched() is causing us to context switch too aggressively, does
>changing that to local_softirq_pending() make things better?
>--
>diff --git a/lib/irq_poll.c b/lib/irq_poll.c index d8eab563fa77..05d524fcaf04
>100644
>--- a/lib/irq_poll.c
>+++ b/lib/irq_poll.c
>@@ -116,7 +116,7 @@ static void __latent_entropy irq_poll_softirq(struct
>softirq_action *h)
>                 /*
>                  * If softirq window is exhausted then punt.
>                  */
>-               if (need_resched())
>+               if (local_softirq_pending())
>                         break;
>         }
>--
>
>Although, this can potentially cause other threads from making forward
>progress.. If it is better, perhaps we also need a time limit as well.

Thanks for this patch. The IOPS was about the same. (it tends to fluctuate more but within 3% variation)

I captured the following from one of the CPUs. All CPUs tend to have similar numbers. The following numbers are captured during 5 seconds and averaged:

Context switches/s:
Without any patch: 5
With the previous patch: 640
With this patch: 522

Process migrated/s:
Without any patch: 0.6
With the previous patch: 104
With this patch: 121

>
>Perhaps we should add statistics/tracing on how many completions we are
>reaping per invocation...

I'll look into a bit more on completion. From the numbers I think the increased number of context switches/migrations are hurting most on performance.

Thanks

Long
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2019-09-24  0:57 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20190827085344.30799-1-ming.lei@redhat.com>
     [not found] ` <20190827085344.30799-2-ming.lei@redhat.com>
     [not found]   ` <alpine.DEB.2.21.1908271633450.1939@nanos.tec.linutronix.de>
     [not found]     ` <20190827225827.GA5263@ming.t460p>
     [not found]       ` <alpine.DEB.2.21.1908280104330.1939@nanos.tec.linutronix.de>
     [not found]         ` <20190828110633.GC15524@ming.t460p>
     [not found]           ` <alpine.DEB.2.21.1908281316230.1869@nanos.tec.linutronix.de>
     [not found]             ` <20190828135054.GA23861@ming.t460p>
     [not found]               ` <alpine.DEB.2.21.1908281605190.23149@nanos.tec.linutronix.de>
2019-09-03  3:30                 ` [PATCH 1/4] softirq: implement IRQ flood detection mechanism Ming Lei
2019-09-03  5:59                   ` Daniel Lezcano
2019-09-03  6:31                     ` Ming Lei
2019-09-03  6:40                       ` Daniel Lezcano
2019-09-03  7:28                         ` Ming Lei
2019-09-03  7:50                           ` Daniel Lezcano
2019-09-03  9:30                             ` Ming Lei
2019-09-04 17:07                             ` Bart Van Assche
2019-09-04 17:31                               ` Daniel Lezcano
2019-09-04 17:38                                 ` Bart Van Assche
2019-09-04 18:02                                   ` Peter Zijlstra
2019-09-04 19:47                                     ` Bart Van Assche
2019-09-05  9:11                                       ` Ming Lei
2019-09-05  9:06                                 ` Ming Lei
2019-09-05 10:37                                   ` Daniel Lezcano
2019-09-06  1:22                                     ` Long Li
2019-09-06  4:36                                       ` Daniel Lezcano
2019-09-06  4:44                                         ` Long Li
2019-09-06  1:48                                     ` Ming Lei
2019-09-06  5:14                                       ` Daniel Lezcano
2019-09-06 18:30                                         ` Sagi Grimberg
2019-09-06 18:52                                           ` Keith Busch
2019-09-07  0:01                                           ` Ming Lei
2019-09-10  3:10                                             ` Sagi Grimberg
2019-09-18  0:00                                               ` Long Li
2019-09-20 17:14                                                 ` Sagi Grimberg
2019-09-20 19:12                                                   ` Long Li
2019-09-20 20:45                                                     ` Sagi Grimberg
2019-09-24  0:57                                                       ` Long Li
2019-09-18 14:37                                               ` Ming Lei
2019-09-20 17:09                                                 ` Sagi Grimberg
2019-09-06 14:18                                       ` Keith Busch
2019-09-06 17:50                                         ` Long Li
2019-09-06 22:19                                           ` Ming Lei
2019-09-06 22:25                                             ` Keith Busch
2019-09-06 23:13                                               ` Ming Lei
2019-09-10  0:24                                             ` Ming Lei
2019-09-03  8:09                           ` Thomas Gleixner
2019-09-03  9:24                             ` Ming Lei
     [not found] ` <20190827085344.30799-5-ming.lei@redhat.com>
2019-09-06  8:50   ` [PATCH 4/4] genirq: use irq's affinity for threaded irq with IRQF_RESCUE_THREAD John Garry

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).