[3/3] nvme: complete request in work queue on CPU with flooded interrupts
diff mbox series

Message ID 1566281669-48212-4-git-send-email-longli@linuxonhyperv.com
State New
Headers show
Series
  • fix interrupt swamp in NVMe
Related show

Commit Message

Long Li Aug. 20, 2019, 6:14 a.m. UTC
From: Long Li <longli@microsoft.com>

When a NVMe hardware queue is mapped to several CPU queues, it is possible
that the CPU this hardware queue is bound to is flooded by returning I/O for
other CPUs.

For example, consider the following scenario:
1. CPU 0, 1, 2 and 3 share the same hardware queue
2. the hardware queue interrupts CPU 0 for I/O response
3. processes from CPU 1, 2 and 3 keep sending I/Os

CPU 0 may be flooded with interrupts from NVMe device that are I/O responses
for CPU 1, 2 and 3. Under heavy I/O load, it is possible that CPU 0 spends
all the time serving NVMe and other system interrupts, but doesn't have a
chance to run in process context.

To fix this, CPU 0 can schedule a work to complete the I/O request when it
detects the scheduler is not making progress. This serves multiple purposes:

1. This CPU has to be scheduled to complete the request. The other CPUs can't
issue more I/Os until some previous I/Os are completed. This helps this CPU
get out of NVMe interrupts.

2. This acts a throttling mechanisum for NVMe devices, in that it can not
starve a CPU while servicing I/Os from other CPUs.

3. This CPU can make progress on RCU and other work items on its queue.

Signed-off-by: Long Li <longli@microsoft.com>
---
 drivers/nvme/host/core.c | 57 +++++++++++++++++++++++++++++++++++++++-
 drivers/nvme/host/nvme.h |  1 +
 2 files changed, 57 insertions(+), 1 deletion(-)

Comments

Peter Zijlstra Aug. 20, 2019, 9:52 a.m. UTC | #1
On Mon, Aug 19, 2019 at 11:14:29PM -0700, longli@linuxonhyperv.com wrote:
> From: Long Li <longli@microsoft.com>
> 
> When a NVMe hardware queue is mapped to several CPU queues, it is possible
> that the CPU this hardware queue is bound to is flooded by returning I/O for
> other CPUs.
> 
> For example, consider the following scenario:
> 1. CPU 0, 1, 2 and 3 share the same hardware queue
> 2. the hardware queue interrupts CPU 0 for I/O response
> 3. processes from CPU 1, 2 and 3 keep sending I/Os
> 
> CPU 0 may be flooded with interrupts from NVMe device that are I/O responses
> for CPU 1, 2 and 3. Under heavy I/O load, it is possible that CPU 0 spends
> all the time serving NVMe and other system interrupts, but doesn't have a
> chance to run in process context.

Ideally -- and there is some code to affect this, the load-balancer will
move tasks away from this CPU.

> To fix this, CPU 0 can schedule a work to complete the I/O request when it
> detects the scheduler is not making progress. This serves multiple purposes:

Suppose the task waiting for the IO completion is a RT task, and you've
just queued it to a regular work. This is an instant priority inversion.

> 1. This CPU has to be scheduled to complete the request. The other CPUs can't
> issue more I/Os until some previous I/Os are completed. This helps this CPU
> get out of NVMe interrupts.
> 
> 2. This acts a throttling mechanisum for NVMe devices, in that it can not
> starve a CPU while servicing I/Os from other CPUs.
> 
> 3. This CPU can make progress on RCU and other work items on its queue.
> 
> Signed-off-by: Long Li <longli@microsoft.com>
> ---
>  drivers/nvme/host/core.c | 57 +++++++++++++++++++++++++++++++++++++++-
>  drivers/nvme/host/nvme.h |  1 +
>  2 files changed, 57 insertions(+), 1 deletion(-)

WTH does this live in the NVME driver? Surely something like this should
be in the block layer. I'm thinking there's fiber channel connected
storage that should be able to trigger much the same issues.

> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 6a9dd68c0f4f..576bb6fce293 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c

> @@ -260,9 +270,54 @@ static void nvme_retry_req(struct request *req)
>  	blk_mq_delay_kick_requeue_list(req->q, delay);
>  }
>  
> +static void nvme_complete_rq_work(struct work_struct *work)
> +{
> +	struct nvme_request *nvme_rq =
> +		container_of(work, struct nvme_request, work);
> +	struct request *req = blk_mq_rq_from_pdu(nvme_rq);
> +
> +	nvme_complete_rq(req);
> +}
> +
> +
>  void nvme_complete_rq(struct request *req)
>  {
> -	blk_status_t status = nvme_error_status(req);
> +	blk_status_t status;
> +	int cpu;
> +	u64 switches;
> +	struct nvme_request *nvme_rq;
> +
> +	if (!in_interrupt())
> +		goto skip_check;
> +
> +	nvme_rq = nvme_req(req);
> +	cpu = smp_processor_id();
> +	if (idle_cpu(cpu))
> +		goto skip_check;
> +
> +	/* Check if this CPU is flooded with interrupts */
> +	switches = get_cpu_rq_switches(cpu);
> +	if (this_cpu_read(last_switch) == switches) {
> +		/*
> +		 * If this CPU hasn't made a context switch in
> +		 * MAX_SCHED_TIMEOUT ns (and it's not idle), schedule a work to
> +		 * complete this I/O. This forces this CPU run non-interrupt
> +		 * code and throttle the other CPU issuing the I/O
> +		 */

What if there was only a single task on that CPU? Then we'd never
need/want to context switch in the first place.

AFAICT all this is just a whole bunch of gruesome hacks piled on top one
another.

> +		if (sched_clock() - this_cpu_read(last_clock)
> +				> MAX_SCHED_TIMEOUT) {
> +			INIT_WORK(&nvme_rq->work, nvme_complete_rq_work);
> +			schedule_work_on(cpu, &nvme_rq->work);
> +			return;
> +		}
> +
> +	} else {
> +		this_cpu_write(last_switch, switches);
> +		this_cpu_write(last_clock, sched_clock());
> +	}
> +
> +skip_check:

Aside from everything else; this is just sodding poor coding style. What
is wrong with something like:

	if (nvme_complete_throttle(...))
		return;

> +	status = nvme_error_status(req);
>  
>  	trace_nvme_complete_rq(req);
>
Sagi Grimberg Aug. 20, 2019, 5:33 p.m. UTC | #2
> From: Long Li <longli@microsoft.com>
> 
> When a NVMe hardware queue is mapped to several CPU queues, it is possible
> that the CPU this hardware queue is bound to is flooded by returning I/O for
> other CPUs.
> 
> For example, consider the following scenario:
> 1. CPU 0, 1, 2 and 3 share the same hardware queue
> 2. the hardware queue interrupts CPU 0 for I/O response
> 3. processes from CPU 1, 2 and 3 keep sending I/Os
> 
> CPU 0 may be flooded with interrupts from NVMe device that are I/O responses
> for CPU 1, 2 and 3. Under heavy I/O load, it is possible that CPU 0 spends
> all the time serving NVMe and other system interrupts, but doesn't have a
> chance to run in process context.
> 
> To fix this, CPU 0 can schedule a work to complete the I/O request when it
> detects the scheduler is not making progress. This serves multiple purposes:
> 
> 1. This CPU has to be scheduled to complete the request. The other CPUs can't
> issue more I/Os until some previous I/Os are completed. This helps this CPU
> get out of NVMe interrupts.
> 
> 2. This acts a throttling mechanisum for NVMe devices, in that it can not
> starve a CPU while servicing I/Os from other CPUs.
> 
> 3. This CPU can make progress on RCU and other work items on its queue.

The problem is indeed real, but this is the wrong approach in my mind.

We already have irqpoll which takes care proper budgeting polling
cycles and not hogging the cpu.

I've sent rfc for this particular problem before [1]. At the time IIRC,
Christoph suggested that we will poll the first batch directly from
the irq context and reap the rest in irqpoll handler.

[1]: 
http://lists.infradead.org/pipermail/linux-nvme/2016-October/006497.html

How about something like this instead:
--
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 71127a366d3c..84bf16d75109 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -24,6 +24,7 @@
  #include <linux/io-64-nonatomic-lo-hi.h>
  #include <linux/sed-opal.h>
  #include <linux/pci-p2pdma.h>
+#include <linux/irq_poll.h>

  #include "trace.h"
  #include "nvme.h"
@@ -32,6 +33,7 @@
  #define CQ_SIZE(q)     ((q)->q_depth * sizeof(struct nvme_completion))

  #define SGES_PER_PAGE  (PAGE_SIZE / sizeof(struct nvme_sgl_desc))
+#define NVME_POLL_BUDGET_IRQ   256

  /*
   * These can be higher, but we need to ensure that any command doesn't
@@ -189,6 +191,7 @@ struct nvme_queue {
         u32 *dbbuf_cq_db;
         u32 *dbbuf_sq_ei;
         u32 *dbbuf_cq_ei;
+       struct irq_poll iop;
         struct completion delete_done;
  };

@@ -1015,6 +1018,23 @@ static inline int nvme_process_cq(struct 
nvme_queue *nvmeq, u16 *start,
         return found;
  }

+static int nvme_irqpoll_handler(struct irq_poll *iop, int budget)
+{
+       struct nvme_queue *nvmeq = container_of(iop, struct nvme_queue, 
iop);
+       struct pci_dev *pdev = to_pci_dev(nvmeq->dev->dev);
+       u16 start, end;
+       int completed;
+
+       completed = nvme_process_cq(nvmeq, &start, &end, budget);
+       nvme_complete_cqes(nvmeq, start, end);
+       if (completed < budget) {
+               irq_poll_complete(&nvmeq->iop);
+               enable_irq(pci_irq_vector(pdev, nvmeq->cq_vector));
+       }
+
+       return completed;
+}
+
  static irqreturn_t nvme_irq(int irq, void *data)
  {
         struct nvme_queue *nvmeq = data;
@@ -1028,12 +1048,16 @@ static irqreturn_t nvme_irq(int irq, void *data)
         rmb();
         if (nvmeq->cq_head != nvmeq->last_cq_head)
                 ret = IRQ_HANDLED;
-       nvme_process_cq(nvmeq, &start, &end, -1);
+       nvme_process_cq(nvmeq, &start, &end, NVME_POLL_BUDGET_IRQ);
         nvmeq->last_cq_head = nvmeq->cq_head;
         wmb();

         if (start != end) {
                 nvme_complete_cqes(nvmeq, start, end);
+               if (nvme_cqe_pending(nvmeq)) {
+                       disable_irq_nosync(irq);
+                       irq_poll_sched(&nvmeq->iop);
+               }
                 return IRQ_HANDLED;
         }

@@ -1347,6 +1371,7 @@ static enum blk_eh_timer_return 
nvme_timeout(struct request *req, bool reserved)

  static void nvme_free_queue(struct nvme_queue *nvmeq)
  {
+       irq_poll_disable(&nvmeq->iop);
         dma_free_coherent(nvmeq->dev->dev, CQ_SIZE(nvmeq),
                                 (void *)nvmeq->cqes, nvmeq->cq_dma_addr);
         if (!nvmeq->sq_cmds)
@@ -1481,6 +1506,7 @@ static int nvme_alloc_queue(struct nvme_dev *dev, 
int qid, int depth)
         nvmeq->dev = dev;
         spin_lock_init(&nvmeq->sq_lock);
         spin_lock_init(&nvmeq->cq_poll_lock);
+       irq_poll_init(&nvmeq->iop, NVME_POLL_BUDGET_IRQ, 
nvme_irqpoll_handler);
         nvmeq->cq_head = 0;
         nvmeq->cq_phase = 1;
         nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride];
--
Long Li Aug. 21, 2019, 8:37 a.m. UTC | #3
>>>Subject: Re: [PATCH 3/3] nvme: complete request in work queue on CPU
>>>with flooded interrupts
>>>
>>>On Mon, Aug 19, 2019 at 11:14:29PM -0700, longli@linuxonhyperv.com
>>>wrote:
>>>> From: Long Li <longli@microsoft.com>
>>>>
>>>> When a NVMe hardware queue is mapped to several CPU queues, it is
>>>> possible that the CPU this hardware queue is bound to is flooded by
>>>> returning I/O for other CPUs.
>>>>
>>>> For example, consider the following scenario:
>>>> 1. CPU 0, 1, 2 and 3 share the same hardware queue 2. the hardware
>>>> queue interrupts CPU 0 for I/O response 3. processes from CPU 1, 2 and
>>>> 3 keep sending I/Os
>>>>
>>>> CPU 0 may be flooded with interrupts from NVMe device that are I/O
>>>> responses for CPU 1, 2 and 3. Under heavy I/O load, it is possible
>>>> that CPU 0 spends all the time serving NVMe and other system
>>>> interrupts, but doesn't have a chance to run in process context.
>>>
>>>Ideally -- and there is some code to affect this, the load-balancer will move
>>>tasks away from this CPU.
>>>
>>>> To fix this, CPU 0 can schedule a work to complete the I/O request
>>>> when it detects the scheduler is not making progress. This serves multiple
>>>purposes:
>>>
>>>Suppose the task waiting for the IO completion is a RT task, and you've just
>>>queued it to a regular work. This is an instant priority inversion.

This is a choice. We can either not "lock up" the CPU, or finish the I/O on time from IRQ handler. I think throttling only happens in extreme conditions, which is rare. The purpose is to make the whole system responsive and happy.

>>>
>>>> 1. This CPU has to be scheduled to complete the request. The other
>>>> CPUs can't issue more I/Os until some previous I/Os are completed.
>>>> This helps this CPU get out of NVMe interrupts.
>>>>
>>>> 2. This acts a throttling mechanisum for NVMe devices, in that it can
>>>> not starve a CPU while servicing I/Os from other CPUs.
>>>>
>>>> 3. This CPU can make progress on RCU and other work items on its queue.
>>>>
>>>> Signed-off-by: Long Li <longli@microsoft.com>
>>>> ---
>>>>  drivers/nvme/host/core.c | 57
>>>> +++++++++++++++++++++++++++++++++++++++-
>>>>  drivers/nvme/host/nvme.h |  1 +
>>>>  2 files changed, 57 insertions(+), 1 deletion(-)
>>>
>>>WTH does this live in the NVME driver? Surely something like this should be
>>>in the block layer. I'm thinking there's fiber channel connected storage that
>>>should be able to trigger much the same issues.

Yes this can be done in block layer. I'm not sure the best way to accomplish this so implemented a NVMe patch to help test. The test results are promising in that we are getting 99.5% of performance while avoided CPU lockup. The challenge is to find a way to throttle a fast storage device.

>>>
>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index
>>>> 6a9dd68c0f4f..576bb6fce293 100644
>>>> --- a/drivers/nvme/host/core.c
>>>> +++ b/drivers/nvme/host/core.c
>>>
>>>> @@ -260,9 +270,54 @@ static void nvme_retry_req(struct request *req)
>>>>  	blk_mq_delay_kick_requeue_list(req->q, delay);  }
>>>>
>>>> +static void nvme_complete_rq_work(struct work_struct *work) {
>>>> +	struct nvme_request *nvme_rq =
>>>> +		container_of(work, struct nvme_request, work);
>>>> +	struct request *req = blk_mq_rq_from_pdu(nvme_rq);
>>>> +
>>>> +	nvme_complete_rq(req);
>>>> +}
>>>> +
>>>> +
>>>>  void nvme_complete_rq(struct request *req)  {
>>>> -	blk_status_t status = nvme_error_status(req);
>>>> +	blk_status_t status;
>>>> +	int cpu;
>>>> +	u64 switches;
>>>> +	struct nvme_request *nvme_rq;
>>>> +
>>>> +	if (!in_interrupt())
>>>> +		goto skip_check;
>>>> +
>>>> +	nvme_rq = nvme_req(req);
>>>> +	cpu = smp_processor_id();
>>>> +	if (idle_cpu(cpu))
>>>> +		goto skip_check;
>>>> +
>>>> +	/* Check if this CPU is flooded with interrupts */
>>>> +	switches = get_cpu_rq_switches(cpu);
>>>> +	if (this_cpu_read(last_switch) == switches) {
>>>> +		/*
>>>> +		 * If this CPU hasn't made a context switch in
>>>> +		 * MAX_SCHED_TIMEOUT ns (and it's not idle), schedule a
>>>work to
>>>> +		 * complete this I/O. This forces this CPU run non-interrupt
>>>> +		 * code and throttle the other CPU issuing the I/O
>>>> +		 */
>>>
>>>What if there was only a single task on that CPU? Then we'd never
>>>need/want to context switch in the first place.
>>>
>>>AFAICT all this is just a whole bunch of gruesome hacks piled on top one
>>>another.
>>>
>>>> +		if (sched_clock() - this_cpu_read(last_clock)
>>>> +				> MAX_SCHED_TIMEOUT) {
>>>> +			INIT_WORK(&nvme_rq->work,
>>>nvme_complete_rq_work);
>>>> +			schedule_work_on(cpu, &nvme_rq->work);
>>>> +			return;
>>>> +		}
>>>> +
>>>> +	} else {
>>>> +		this_cpu_write(last_switch, switches);
>>>> +		this_cpu_write(last_clock, sched_clock());
>>>> +	}
>>>> +
>>>> +skip_check:
>>>
>>>Aside from everything else; this is just sodding poor coding style. What is
>>>wrong with something like:
>>>
>>>	if (nvme_complete_throttle(...))
>>>		return;
>>>
>>>> +	status = nvme_error_status(req);
>>>>
>>>>  	trace_nvme_complete_rq(req);
>>>>
Long Li Aug. 21, 2019, 8:39 a.m. UTC | #4
>>>Subject: Re: [PATCH 3/3] nvme: complete request in work queue on CPU
>>>with flooded interrupts
>>>
>>>
>>>> From: Long Li <longli@microsoft.com>
>>>>
>>>> When a NVMe hardware queue is mapped to several CPU queues, it is
>>>> possible that the CPU this hardware queue is bound to is flooded by
>>>> returning I/O for other CPUs.
>>>>
>>>> For example, consider the following scenario:
>>>> 1. CPU 0, 1, 2 and 3 share the same hardware queue 2. the hardware
>>>> queue interrupts CPU 0 for I/O response 3. processes from CPU 1, 2 and
>>>> 3 keep sending I/Os
>>>>
>>>> CPU 0 may be flooded with interrupts from NVMe device that are I/O
>>>> responses for CPU 1, 2 and 3. Under heavy I/O load, it is possible
>>>> that CPU 0 spends all the time serving NVMe and other system
>>>> interrupts, but doesn't have a chance to run in process context.
>>>>
>>>> To fix this, CPU 0 can schedule a work to complete the I/O request
>>>> when it detects the scheduler is not making progress. This serves multiple
>>>purposes:
>>>>
>>>> 1. This CPU has to be scheduled to complete the request. The other
>>>> CPUs can't issue more I/Os until some previous I/Os are completed.
>>>> This helps this CPU get out of NVMe interrupts.
>>>>
>>>> 2. This acts a throttling mechanisum for NVMe devices, in that it can
>>>> not starve a CPU while servicing I/Os from other CPUs.
>>>>
>>>> 3. This CPU can make progress on RCU and other work items on its queue.
>>>
>>>The problem is indeed real, but this is the wrong approach in my mind.
>>>
>>>We already have irqpoll which takes care proper budgeting polling cycles
>>>and not hogging the cpu.
>>>
>>>I've sent rfc for this particular problem before [1]. At the time IIRC,
>>>Christoph suggested that we will poll the first batch directly from the irq
>>>context and reap the rest in irqpoll handler.

Thanks for the pointer. I will test and report back.

>>>
>>>[1]:
>>>https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.
>>>infradead.org%2Fpipermail%2Flinux-nvme%2F2016-
>>>October%2F006497.html&amp;data=02%7C01%7Clongli%40microsoft.com%
>>>7C0ebf36eff15c4182116608d725948b93%7C72f988bf86f141af91ab2d7cd011d
>>>b47%7C1%7C0%7C637019192254250361&amp;sdata=fJ%2Fkc8HLSmfzaY3BY
>>>E66zlZKD6FjcXgMJZzVGCVqI%2FU%3D&amp;reserved=0
>>>
>>>How about something like this instead:
>>>--
>>>diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index
>>>71127a366d3c..84bf16d75109 100644
>>>--- a/drivers/nvme/host/pci.c
>>>+++ b/drivers/nvme/host/pci.c
>>>@@ -24,6 +24,7 @@
>>>  #include <linux/io-64-nonatomic-lo-hi.h>
>>>  #include <linux/sed-opal.h>
>>>  #include <linux/pci-p2pdma.h>
>>>+#include <linux/irq_poll.h>
>>>
>>>  #include "trace.h"
>>>  #include "nvme.h"
>>>@@ -32,6 +33,7 @@
>>>  #define CQ_SIZE(q)     ((q)->q_depth * sizeof(struct nvme_completion))
>>>
>>>  #define SGES_PER_PAGE  (PAGE_SIZE / sizeof(struct nvme_sgl_desc))
>>>+#define NVME_POLL_BUDGET_IRQ   256
>>>
>>>  /*
>>>   * These can be higher, but we need to ensure that any command doesn't
>>>@@ -189,6 +191,7 @@ struct nvme_queue {
>>>         u32 *dbbuf_cq_db;
>>>         u32 *dbbuf_sq_ei;
>>>         u32 *dbbuf_cq_ei;
>>>+       struct irq_poll iop;
>>>         struct completion delete_done;
>>>  };
>>>
>>>@@ -1015,6 +1018,23 @@ static inline int nvme_process_cq(struct
>>>nvme_queue *nvmeq, u16 *start,
>>>         return found;
>>>  }
>>>
>>>+static int nvme_irqpoll_handler(struct irq_poll *iop, int budget) {
>>>+       struct nvme_queue *nvmeq = container_of(iop, struct nvme_queue,
>>>iop);
>>>+       struct pci_dev *pdev = to_pci_dev(nvmeq->dev->dev);
>>>+       u16 start, end;
>>>+       int completed;
>>>+
>>>+       completed = nvme_process_cq(nvmeq, &start, &end, budget);
>>>+       nvme_complete_cqes(nvmeq, start, end);
>>>+       if (completed < budget) {
>>>+               irq_poll_complete(&nvmeq->iop);
>>>+               enable_irq(pci_irq_vector(pdev, nvmeq->cq_vector));
>>>+       }
>>>+
>>>+       return completed;
>>>+}
>>>+
>>>  static irqreturn_t nvme_irq(int irq, void *data)
>>>  {
>>>         struct nvme_queue *nvmeq = data; @@ -1028,12 +1048,16 @@ static
>>>irqreturn_t nvme_irq(int irq, void *data)
>>>         rmb();
>>>         if (nvmeq->cq_head != nvmeq->last_cq_head)
>>>                 ret = IRQ_HANDLED;
>>>-       nvme_process_cq(nvmeq, &start, &end, -1);
>>>+       nvme_process_cq(nvmeq, &start, &end, NVME_POLL_BUDGET_IRQ);
>>>         nvmeq->last_cq_head = nvmeq->cq_head;
>>>         wmb();
>>>
>>>         if (start != end) {
>>>                 nvme_complete_cqes(nvmeq, start, end);
>>>+               if (nvme_cqe_pending(nvmeq)) {
>>>+                       disable_irq_nosync(irq);
>>>+                       irq_poll_sched(&nvmeq->iop);
>>>+               }
>>>                 return IRQ_HANDLED;
>>>         }
>>>
>>>@@ -1347,6 +1371,7 @@ static enum blk_eh_timer_return
>>>nvme_timeout(struct request *req, bool reserved)
>>>
>>>  static void nvme_free_queue(struct nvme_queue *nvmeq)
>>>  {
>>>+       irq_poll_disable(&nvmeq->iop);
>>>         dma_free_coherent(nvmeq->dev->dev, CQ_SIZE(nvmeq),
>>>                                 (void *)nvmeq->cqes, nvmeq->cq_dma_addr);
>>>         if (!nvmeq->sq_cmds)
>>>@@ -1481,6 +1506,7 @@ static int nvme_alloc_queue(struct nvme_dev
>>>*dev, int qid, int depth)
>>>         nvmeq->dev = dev;
>>>         spin_lock_init(&nvmeq->sq_lock);
>>>         spin_lock_init(&nvmeq->cq_poll_lock);
>>>+       irq_poll_init(&nvmeq->iop, NVME_POLL_BUDGET_IRQ,
>>>nvme_irqpoll_handler);
>>>         nvmeq->cq_head = 0;
>>>         nvmeq->cq_phase = 1;
>>>         nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride];
>>>--
Peter Zijlstra Aug. 21, 2019, 10:35 a.m. UTC | #5
On Wed, Aug 21, 2019 at 08:37:55AM +0000, Long Li wrote:
> >>>Subject: Re: [PATCH 3/3] nvme: complete request in work queue on CPU
> >>>with flooded interrupts
> >>>
> >>>On Mon, Aug 19, 2019 at 11:14:29PM -0700, longli@linuxonhyperv.com
> >>>wrote:
> >>>> From: Long Li <longli@microsoft.com>
> >>>>
> >>>> When a NVMe hardware queue is mapped to several CPU queues, it is
> >>>> possible that the CPU this hardware queue is bound to is flooded by
> >>>> returning I/O for other CPUs.
> >>>>
> >>>> For example, consider the following scenario:
> >>>> 1. CPU 0, 1, 2 and 3 share the same hardware queue 2. the hardware
> >>>> queue interrupts CPU 0 for I/O response 3. processes from CPU 1, 2 and
> >>>> 3 keep sending I/Os
> >>>>
> >>>> CPU 0 may be flooded with interrupts from NVMe device that are I/O
> >>>> responses for CPU 1, 2 and 3. Under heavy I/O load, it is possible
> >>>> that CPU 0 spends all the time serving NVMe and other system
> >>>> interrupts, but doesn't have a chance to run in process context.
> >>>
> >>>Ideally -- and there is some code to affect this, the load-balancer will move
> >>>tasks away from this CPU.
> >>>
> >>>> To fix this, CPU 0 can schedule a work to complete the I/O request
> >>>> when it detects the scheduler is not making progress. This serves multiple
> >>>purposes:
> >>>
> >>>Suppose the task waiting for the IO completion is a RT task, and you've just
> >>>queued it to a regular work. This is an instant priority inversion.
> 
> This is a choice. We can either not "lock up" the CPU, or finish the I/O on time from IRQ handler. I think throttling only happens in extreme conditions, which is rare. The purpose is to make the whole system responsive and happy.

Can you please use a sane MUA.. this is unreadable garbage.
Long Li Aug. 21, 2019, 5:36 p.m. UTC | #6
>>>Subject: RE: [PATCH 3/3] nvme: complete request in work queue on CPU
>>>with flooded interrupts
>>>
>>>>>>Subject: Re: [PATCH 3/3] nvme: complete request in work queue on
>>>CPU
>>>>>>with flooded interrupts
>>>>>>
>>>>>>
>>>>>>> From: Long Li <longli@microsoft.com>
>>>>>>>
>>>>>>> When a NVMe hardware queue is mapped to several CPU queues, it is
>>>>>>> possible that the CPU this hardware queue is bound to is flooded by
>>>>>>> returning I/O for other CPUs.
>>>>>>>
>>>>>>> For example, consider the following scenario:
>>>>>>> 1. CPU 0, 1, 2 and 3 share the same hardware queue 2. the hardware
>>>>>>> queue interrupts CPU 0 for I/O response 3. processes from CPU 1, 2
>>>>>>> and
>>>>>>> 3 keep sending I/Os
>>>>>>>
>>>>>>> CPU 0 may be flooded with interrupts from NVMe device that are I/O
>>>>>>> responses for CPU 1, 2 and 3. Under heavy I/O load, it is possible
>>>>>>> that CPU 0 spends all the time serving NVMe and other system
>>>>>>> interrupts, but doesn't have a chance to run in process context.
>>>>>>>
>>>>>>> To fix this, CPU 0 can schedule a work to complete the I/O request
>>>>>>> when it detects the scheduler is not making progress. This serves
>>>>>>> multiple
>>>>>>purposes:
>>>>>>>
>>>>>>> 1. This CPU has to be scheduled to complete the request. The other
>>>>>>> CPUs can't issue more I/Os until some previous I/Os are completed.
>>>>>>> This helps this CPU get out of NVMe interrupts.
>>>>>>>
>>>>>>> 2. This acts a throttling mechanisum for NVMe devices, in that it
>>>>>>> can not starve a CPU while servicing I/Os from other CPUs.
>>>>>>>
>>>>>>> 3. This CPU can make progress on RCU and other work items on its
>>>queue.
>>>>>>
>>>>>>The problem is indeed real, but this is the wrong approach in my mind.
>>>>>>
>>>>>>We already have irqpoll which takes care proper budgeting polling
>>>>>>cycles and not hogging the cpu.
>>>>>>
>>>>>>I've sent rfc for this particular problem before [1]. At the time
>>>>>>IIRC, Christoph suggested that we will poll the first batch directly
>>>>>>from the irq context and reap the rest in irqpoll handler.
>>>
>>>Thanks for the pointer. I will test and report back.

Sagi,

Here are the test results.

Benchmark command:
fio --bs=4k --ioengine=libaio --iodepth=64 --filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1 --direct=1 --runtime=90 --numjobs=80 --rw=randread --name=test --group_reporting --gtod_reduce=1

With your patch: 1720k IOPS
With threaded interrupts: 1320k IOPS
With just interrupts: 3720k IOPS

Interrupts are the fastest but we need to find a way to throttle it.

Thanks

Long


>>>
>>>>>>
>>>>>>[1]:
>>>>>>https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fl
>>>ists.
>>>>>>infradead.org%2Fpipermail%2Flinux-nvme%2F2016-
>>>>>>October%2F006497.html&amp;data=02%7C01%7Clongli%40microsoft.co
>>>m%
>>>>>>7C0ebf36eff15c4182116608d725948b93%7C72f988bf86f141af91ab2d7cd0
>>>11d
>>>>>>b47%7C1%7C0%7C637019192254250361&amp;sdata=fJ%2Fkc8HLSmfzaY
>>>3BY
>>>>>>E66zlZKD6FjcXgMJZzVGCVqI%2FU%3D&amp;reserved=0
>>>>>>
>>>>>>How about something like this instead:
>>>>>>--
>>>>>>diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index
>>>>>>71127a366d3c..84bf16d75109 100644
>>>>>>--- a/drivers/nvme/host/pci.c
>>>>>>+++ b/drivers/nvme/host/pci.c
>>>>>>@@ -24,6 +24,7 @@
>>>>>>  #include <linux/io-64-nonatomic-lo-hi.h>
>>>>>>  #include <linux/sed-opal.h>
>>>>>>  #include <linux/pci-p2pdma.h>
>>>>>>+#include <linux/irq_poll.h>
>>>>>>
>>>>>>  #include "trace.h"
>>>>>>  #include "nvme.h"
>>>>>>@@ -32,6 +33,7 @@
>>>>>>  #define CQ_SIZE(q)     ((q)->q_depth * sizeof(struct nvme_completion))
>>>>>>
>>>>>>  #define SGES_PER_PAGE  (PAGE_SIZE / sizeof(struct nvme_sgl_desc))
>>>>>>+#define NVME_POLL_BUDGET_IRQ   256
>>>>>>
>>>>>>  /*
>>>>>>   * These can be higher, but we need to ensure that any command
>>>>>>doesn't @@ -189,6 +191,7 @@ struct nvme_queue {
>>>>>>         u32 *dbbuf_cq_db;
>>>>>>         u32 *dbbuf_sq_ei;
>>>>>>         u32 *dbbuf_cq_ei;
>>>>>>+       struct irq_poll iop;
>>>>>>         struct completion delete_done;  };
>>>>>>
>>>>>>@@ -1015,6 +1018,23 @@ static inline int nvme_process_cq(struct
>>>>>>nvme_queue *nvmeq, u16 *start,
>>>>>>         return found;
>>>>>>  }
>>>>>>
>>>>>>+static int nvme_irqpoll_handler(struct irq_poll *iop, int budget) {
>>>>>>+       struct nvme_queue *nvmeq = container_of(iop, struct
>>>>>>+nvme_queue,
>>>>>>iop);
>>>>>>+       struct pci_dev *pdev = to_pci_dev(nvmeq->dev->dev);
>>>>>>+       u16 start, end;
>>>>>>+       int completed;
>>>>>>+
>>>>>>+       completed = nvme_process_cq(nvmeq, &start, &end, budget);
>>>>>>+       nvme_complete_cqes(nvmeq, start, end);
>>>>>>+       if (completed < budget) {
>>>>>>+               irq_poll_complete(&nvmeq->iop);
>>>>>>+               enable_irq(pci_irq_vector(pdev, nvmeq->cq_vector));
>>>>>>+       }
>>>>>>+
>>>>>>+       return completed;
>>>>>>+}
>>>>>>+
>>>>>>  static irqreturn_t nvme_irq(int irq, void *data)
>>>>>>  {
>>>>>>         struct nvme_queue *nvmeq = data; @@ -1028,12 +1048,16 @@
>>>>>>static irqreturn_t nvme_irq(int irq, void *data)
>>>>>>         rmb();
>>>>>>         if (nvmeq->cq_head != nvmeq->last_cq_head)
>>>>>>                 ret = IRQ_HANDLED;
>>>>>>-       nvme_process_cq(nvmeq, &start, &end, -1);
>>>>>>+       nvme_process_cq(nvmeq, &start, &end,
>>>NVME_POLL_BUDGET_IRQ);
>>>>>>         nvmeq->last_cq_head = nvmeq->cq_head;
>>>>>>         wmb();
>>>>>>
>>>>>>         if (start != end) {
>>>>>>                 nvme_complete_cqes(nvmeq, start, end);
>>>>>>+               if (nvme_cqe_pending(nvmeq)) {
>>>>>>+                       disable_irq_nosync(irq);
>>>>>>+                       irq_poll_sched(&nvmeq->iop);
>>>>>>+               }
>>>>>>                 return IRQ_HANDLED;
>>>>>>         }
>>>>>>
>>>>>>@@ -1347,6 +1371,7 @@ static enum blk_eh_timer_return
>>>>>>nvme_timeout(struct request *req, bool reserved)
>>>>>>
>>>>>>  static void nvme_free_queue(struct nvme_queue *nvmeq)  {
>>>>>>+       irq_poll_disable(&nvmeq->iop);
>>>>>>         dma_free_coherent(nvmeq->dev->dev, CQ_SIZE(nvmeq),
>>>>>>                                 (void *)nvmeq->cqes, nvmeq->cq_dma_addr);
>>>>>>         if (!nvmeq->sq_cmds)
>>>>>>@@ -1481,6 +1506,7 @@ static int nvme_alloc_queue(struct nvme_dev
>>>>>>*dev, int qid, int depth)
>>>>>>         nvmeq->dev = dev;
>>>>>>         spin_lock_init(&nvmeq->sq_lock);
>>>>>>         spin_lock_init(&nvmeq->cq_poll_lock);
>>>>>>+       irq_poll_init(&nvmeq->iop, NVME_POLL_BUDGET_IRQ,
>>>>>>nvme_irqpoll_handler);
>>>>>>         nvmeq->cq_head = 0;
>>>>>>         nvmeq->cq_phase = 1;
>>>>>>         nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride];
>>>>>>--
Sagi Grimberg Aug. 21, 2019, 9:54 p.m. UTC | #7
> Sagi,
> 
> Here are the test results.
> 
> Benchmark command:
> fio --bs=4k --ioengine=libaio --iodepth=64 --filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1 --direct=1 --runtime=90 --numjobs=80 --rw=randread --name=test --group_reporting --gtod_reduce=1
> 
> With your patch: 1720k IOPS
> With threaded interrupts: 1320k IOPS
> With just interrupts: 3720k IOPS
> 
> Interrupts are the fastest but we need to find a way to throttle it.

This is the workload that generates the flood?
If so I did not expect that this would be the perf difference..

If completions keep pounding on the cpu, I would expect irqpoll
to simply keep running forever and poll the cqs. There is no
fundamental reason why polling would be faster in an interrupt,
what matters could be:
1. we reschedule more than we need to
2. we don't reap enough completions in every poll round, which
will trigger rearming the interrupt and then when it fires reschedule
another softirq...

Maybe we need to take care of some irq_poll optimizations?

Does this (untested) patch make any difference?
--
diff --git a/lib/irq_poll.c b/lib/irq_poll.c
index 2f17b488d58e..0e94183eba15 100644
--- a/lib/irq_poll.c
+++ b/lib/irq_poll.c
@@ -12,7 +12,8 @@
  #include <linux/irq_poll.h>
  #include <linux/delay.h>

-static unsigned int irq_poll_budget __read_mostly = 256;
+static unsigned int irq_poll_budget __read_mostly = 3000;
+unsigned int __read_mostly irqpoll_budget_usecs = 2000;

  static DEFINE_PER_CPU(struct list_head, blk_cpu_iopoll);

@@ -77,32 +78,26 @@ EXPORT_SYMBOL(irq_poll_complete);

  static void __latent_entropy irq_poll_softirq(struct softirq_action *h)
  {
-       struct list_head *list = this_cpu_ptr(&blk_cpu_iopoll);
-       int rearm = 0, budget = irq_poll_budget;
-       unsigned long start_time = jiffies;
+       struct list_head *irqpoll_list = this_cpu_ptr(&blk_cpu_iopoll);
+       unsigned int budget = irq_poll_budget;
+       unsigned long time_limit =
+               jiffies + usecs_to_jiffies(irqpoll_budget_usecs);
+       LIST_HEAD(list);

         local_irq_disable();
+       list_splice_init(irqpoll_list, &list);
+       local_irq_enable();

-       while (!list_empty(list)) {
+       while (!list_empty(&list)) {
                 struct irq_poll *iop;
                 int work, weight;

-               /*
-                * If softirq window is exhausted then punt.
-                */
-               if (budget <= 0 || time_after(jiffies, start_time)) {
-                       rearm = 1;
-                       break;
-               }
-
-               local_irq_enable();
-
                 /* Even though interrupts have been re-enabled, this
                  * access is safe because interrupts can only add new
                  * entries to the tail of this list, and only ->poll()
                  * calls can remove this head entry from the list.
                  */
-               iop = list_entry(list->next, struct irq_poll, list);
+               iop = list_first_entry(&list, struct irq_poll, list);

                 weight = iop->weight;
                 work = 0;
@@ -111,8 +106,6 @@ static void __latent_entropy irq_poll_softirq(struct 
softirq_action *h)

                 budget -= work;

-               local_irq_disable();
-
                 /*
                  * Drivers must not modify the iopoll state, if they
                  * consume their assigned weight (or more, some drivers 
can't
@@ -125,11 +118,21 @@ static void __latent_entropy 
irq_poll_softirq(struct softirq_action *h)
                         if (test_bit(IRQ_POLL_F_DISABLE, &iop->state))
                                 __irq_poll_complete(iop);
                         else
-                               list_move_tail(&iop->list, list);
+                               list_move_tail(&iop->list, &list);
                 }
+
+               /*
+                * If softirq window is exhausted then punt.
+                */
+               if (budget <= 0 || time_after_eq(jiffies, time_limit))
+                       break;
         }

-       if (rearm)
+       local_irq_disable();
+
+       list_splice_tail_init(irqpoll_list, &list);
+       list_splice(&list, irqpoll_list);
+       if (!list_empty(irqpoll_list))
                 __raise_softirq_irqoff(IRQ_POLL_SOFTIRQ);

         local_irq_enable();
--
Ming Lei Aug. 23, 2019, 3:21 a.m. UTC | #8
On Tue, Aug 20, 2019 at 10:33:38AM -0700, Sagi Grimberg wrote:
> 
> > From: Long Li <longli@microsoft.com>
> > 
> > When a NVMe hardware queue is mapped to several CPU queues, it is possible
> > that the CPU this hardware queue is bound to is flooded by returning I/O for
> > other CPUs.
> > 
> > For example, consider the following scenario:
> > 1. CPU 0, 1, 2 and 3 share the same hardware queue
> > 2. the hardware queue interrupts CPU 0 for I/O response
> > 3. processes from CPU 1, 2 and 3 keep sending I/Os
> > 
> > CPU 0 may be flooded with interrupts from NVMe device that are I/O responses
> > for CPU 1, 2 and 3. Under heavy I/O load, it is possible that CPU 0 spends
> > all the time serving NVMe and other system interrupts, but doesn't have a
> > chance to run in process context.
> > 
> > To fix this, CPU 0 can schedule a work to complete the I/O request when it
> > detects the scheduler is not making progress. This serves multiple purposes:
> > 
> > 1. This CPU has to be scheduled to complete the request. The other CPUs can't
> > issue more I/Os until some previous I/Os are completed. This helps this CPU
> > get out of NVMe interrupts.
> > 
> > 2. This acts a throttling mechanisum for NVMe devices, in that it can not
> > starve a CPU while servicing I/Os from other CPUs.
> > 
> > 3. This CPU can make progress on RCU and other work items on its queue.
> 
> The problem is indeed real, but this is the wrong approach in my mind.
> 
> We already have irqpoll which takes care proper budgeting polling
> cycles and not hogging the cpu.

The issue isn't unique to NVMe, and can be any fast devices which
interrupts CPU too frequently, meantime the interrupt/softirq handler may
take a bit much time, then CPU is easy to be lockup by the interrupt/sofirq
handler, especially in case that multiple submission CPUs vs. single
completion CPU.

Some SCSI devices has the same problem too.

Could we consider to add one generic mechanism to cover this kind of
problem?

One approach I thought of is to allocate one backup thread for handling
such interrupt, which can be marked as IRQF_BACKUP_THREAD by drivers. 

Inside do_IRQ(), irqtime is accounted, before calling action->handler(),
check if this CPU has taken too long time for handling IRQ(interrupt or
softirq) and see if this CPU could be lock up. If yes, wakeup the backup
thread to handle the interrupt for avoiding lockup this CPU.

The threaded interrupt framework is there, and this way could be easier
to implement. Meantime most time the handler is run in interrupt context
and we may avoid the performance loss when CPU isn't busy enough.

Any comment on this approach?

Thanks,
Ming
Long Li Aug. 24, 2019, 12:13 a.m. UTC | #9
>>>Subject: Re: [PATCH 3/3] nvme: complete request in work queue on CPU
>>>with flooded interrupts
>>>
>>>
>>>> Sagi,
>>>>
>>>> Here are the test results.
>>>>
>>>> Benchmark command:
>>>> fio --bs=4k --ioengine=libaio --iodepth=64
>>>> --
>>>filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/d
>>>ev/nv
>>>>
>>>me4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev
>>>/nvme9n1
>>>> --direct=1 --runtime=90 --numjobs=80 --rw=randread --name=test
>>>> --group_reporting --gtod_reduce=1
>>>>
>>>> With your patch: 1720k IOPS
>>>> With threaded interrupts: 1320k IOPS
>>>> With just interrupts: 3720k IOPS
>>>>
>>>> Interrupts are the fastest but we need to find a way to throttle it.
>>>
>>>This is the workload that generates the flood?
>>>If so I did not expect that this would be the perf difference..
>>>
>>>If completions keep pounding on the cpu, I would expect irqpoll to simply
>>>keep running forever and poll the cqs. There is no fundamental reason why
>>>polling would be faster in an interrupt, what matters could be:
>>>1. we reschedule more than we need to
>>>2. we don't reap enough completions in every poll round, which will trigger
>>>rearming the interrupt and then when it fires reschedule another softirq...
>>>

Yes I think it's the rescheduling that takes some. With the patch there are lots of ksoftirqd activities. (compared to nearly none without the patch)
A 90 seconds FIO run shows a big difference of context switches on all CPUs:
With patch: 5755849
Without patch: 1462931

>>>Maybe we need to take care of some irq_poll optimizations?
>>>
>>>Does this (untested) patch make any difference?
>>>--
>>>diff --git a/lib/irq_poll.c b/lib/irq_poll.c index 2f17b488d58e..0e94183eba15
>>>100644
>>>--- a/lib/irq_poll.c
>>>+++ b/lib/irq_poll.c
>>>@@ -12,7 +12,8 @@
>>>  #include <linux/irq_poll.h>
>>>  #include <linux/delay.h>
>>>
>>>-static unsigned int irq_poll_budget __read_mostly = 256;
>>>+static unsigned int irq_poll_budget __read_mostly = 3000; unsigned int
>>>+__read_mostly irqpoll_budget_usecs = 2000;
>>>
>>>  static DEFINE_PER_CPU(struct list_head, blk_cpu_iopoll);
>>>
>>>@@ -77,32 +78,26 @@ EXPORT_SYMBOL(irq_poll_complete);
>>>
>>>  static void __latent_entropy irq_poll_softirq(struct softirq_action *h)
>>>  {
>>>-       struct list_head *list = this_cpu_ptr(&blk_cpu_iopoll);
>>>-       int rearm = 0, budget = irq_poll_budget;
>>>-       unsigned long start_time = jiffies;
>>>+       struct list_head *irqpoll_list = this_cpu_ptr(&blk_cpu_iopoll);
>>>+       unsigned int budget = irq_poll_budget;
>>>+       unsigned long time_limit =
>>>+               jiffies + usecs_to_jiffies(irqpoll_budget_usecs);
>>>+       LIST_HEAD(list);
>>>
>>>         local_irq_disable();
>>>+       list_splice_init(irqpoll_list, &list);
>>>+       local_irq_enable();
>>>
>>>-       while (!list_empty(list)) {
>>>+       while (!list_empty(&list)) {
>>>                 struct irq_poll *iop;
>>>                 int work, weight;
>>>
>>>-               /*
>>>-                * If softirq window is exhausted then punt.
>>>-                */
>>>-               if (budget <= 0 || time_after(jiffies, start_time)) {
>>>-                       rearm = 1;
>>>-                       break;
>>>-               }
>>>-
>>>-               local_irq_enable();
>>>-
>>>                 /* Even though interrupts have been re-enabled, this
>>>                  * access is safe because interrupts can only add new
>>>                  * entries to the tail of this list, and only ->poll()
>>>                  * calls can remove this head entry from the list.
>>>                  */
>>>-               iop = list_entry(list->next, struct irq_poll, list);
>>>+               iop = list_first_entry(&list, struct irq_poll, list);
>>>
>>>                 weight = iop->weight;
>>>                 work = 0;
>>>@@ -111,8 +106,6 @@ static void __latent_entropy irq_poll_softirq(struct
>>>softirq_action *h)
>>>
>>>                 budget -= work;
>>>
>>>-               local_irq_disable();
>>>-
>>>                 /*
>>>                  * Drivers must not modify the iopoll state, if they
>>>                  * consume their assigned weight (or more, some drivers can't @@
>>>-125,11 +118,21 @@ static void __latent_entropy irq_poll_softirq(struct
>>>softirq_action *h)
>>>                         if (test_bit(IRQ_POLL_F_DISABLE, &iop->state))
>>>                                 __irq_poll_complete(iop);
>>>                         else
>>>-                               list_move_tail(&iop->list, list);
>>>+                               list_move_tail(&iop->list, &list);
>>>                 }
>>>+
>>>+               /*
>>>+                * If softirq window is exhausted then punt.
>>>+                */
>>>+               if (budget <= 0 || time_after_eq(jiffies, time_limit))
>>>+                       break;
>>>         }
>>>
>>>-       if (rearm)
>>>+       local_irq_disable();
>>>+
>>>+       list_splice_tail_init(irqpoll_list, &list);
>>>+       list_splice(&list, irqpoll_list);
>>>+       if (!list_empty(irqpoll_list))
>>>                 __raise_softirq_irqoff(IRQ_POLL_SOFTIRQ);
>>>
>>>         local_irq_enable();
>>>--

It's got slightly better IOPS. With this 2nd patch, the number of context switches is at 5445863 (~5% improvement over the 1st patch).
Long Li Aug. 24, 2019, 12:27 a.m. UTC | #10
>>>Subject: Re: [PATCH 3/3] nvme: complete request in work queue on CPU
>>>with flooded interrupts
>>>
>>>On Tue, Aug 20, 2019 at 10:33:38AM -0700, Sagi Grimberg wrote:
>>>>
>>>> > From: Long Li <longli@microsoft.com>
>>>> >
>>>> > When a NVMe hardware queue is mapped to several CPU queues, it is
>>>> > possible that the CPU this hardware queue is bound to is flooded by
>>>> > returning I/O for other CPUs.
>>>> >
>>>> > For example, consider the following scenario:
>>>> > 1. CPU 0, 1, 2 and 3 share the same hardware queue 2. the hardware
>>>> > queue interrupts CPU 0 for I/O response 3. processes from CPU 1, 2
>>>> > and 3 keep sending I/Os
>>>> >
>>>> > CPU 0 may be flooded with interrupts from NVMe device that are I/O
>>>> > responses for CPU 1, 2 and 3. Under heavy I/O load, it is possible
>>>> > that CPU 0 spends all the time serving NVMe and other system
>>>> > interrupts, but doesn't have a chance to run in process context.
>>>> >
>>>> > To fix this, CPU 0 can schedule a work to complete the I/O request
>>>> > when it detects the scheduler is not making progress. This serves
>>>multiple purposes:
>>>> >
>>>> > 1. This CPU has to be scheduled to complete the request. The other
>>>> > CPUs can't issue more I/Os until some previous I/Os are completed.
>>>> > This helps this CPU get out of NVMe interrupts.
>>>> >
>>>> > 2. This acts a throttling mechanisum for NVMe devices, in that it
>>>> > can not starve a CPU while servicing I/Os from other CPUs.
>>>> >
>>>> > 3. This CPU can make progress on RCU and other work items on its
>>>queue.
>>>>
>>>> The problem is indeed real, but this is the wrong approach in my mind.
>>>>
>>>> We already have irqpoll which takes care proper budgeting polling
>>>> cycles and not hogging the cpu.
>>>
>>>The issue isn't unique to NVMe, and can be any fast devices which
>>>interrupts CPU too frequently, meantime the interrupt/softirq handler may
>>>take a bit much time, then CPU is easy to be lockup by the interrupt/sofirq
>>>handler, especially in case that multiple submission CPUs vs. single
>>>completion CPU.
>>>
>>>Some SCSI devices has the same problem too.
>>>
>>>Could we consider to add one generic mechanism to cover this kind of
>>>problem?
>>>
>>>One approach I thought of is to allocate one backup thread for handling such
>>>interrupt, which can be marked as IRQF_BACKUP_THREAD by drivers.
>>>
>>>Inside do_IRQ(), irqtime is accounted, before calling action->handler(),
>>>check if this CPU has taken too long time for handling IRQ(interrupt or
>>>softirq) and see if this CPU could be lock up. If yes, wakeup the backup

How do you know if this CPU is spending all the time in do_IRQ()?

Is it something like:
If (IRQ_time /elapsed_time > a threshold value)
	wake up the backup thread

>>>thread to handle the interrupt for avoiding lockup this CPU.
>>>
>>>The threaded interrupt framework is there, and this way could be easier to
>>>implement. Meantime most time the handler is run in interrupt context and
>>>we may avoid the performance loss when CPU isn't busy enough.
>>>
>>>Any comment on this approach?
>>>
>>>Thanks,
>>>Ming
Ming Lei Aug. 24, 2019, 12:55 p.m. UTC | #11
On Sat, Aug 24, 2019 at 12:27:18AM +0000, Long Li wrote:
> >>>Subject: Re: [PATCH 3/3] nvme: complete request in work queue on CPU
> >>>with flooded interrupts
> >>>
> >>>On Tue, Aug 20, 2019 at 10:33:38AM -0700, Sagi Grimberg wrote:
> >>>>
> >>>> > From: Long Li <longli@microsoft.com>
> >>>> >
> >>>> > When a NVMe hardware queue is mapped to several CPU queues, it is
> >>>> > possible that the CPU this hardware queue is bound to is flooded by
> >>>> > returning I/O for other CPUs.
> >>>> >
> >>>> > For example, consider the following scenario:
> >>>> > 1. CPU 0, 1, 2 and 3 share the same hardware queue 2. the hardware
> >>>> > queue interrupts CPU 0 for I/O response 3. processes from CPU 1, 2
> >>>> > and 3 keep sending I/Os
> >>>> >
> >>>> > CPU 0 may be flooded with interrupts from NVMe device that are I/O
> >>>> > responses for CPU 1, 2 and 3. Under heavy I/O load, it is possible
> >>>> > that CPU 0 spends all the time serving NVMe and other system
> >>>> > interrupts, but doesn't have a chance to run in process context.
> >>>> >
> >>>> > To fix this, CPU 0 can schedule a work to complete the I/O request
> >>>> > when it detects the scheduler is not making progress. This serves
> >>>multiple purposes:
> >>>> >
> >>>> > 1. This CPU has to be scheduled to complete the request. The other
> >>>> > CPUs can't issue more I/Os until some previous I/Os are completed.
> >>>> > This helps this CPU get out of NVMe interrupts.
> >>>> >
> >>>> > 2. This acts a throttling mechanisum for NVMe devices, in that it
> >>>> > can not starve a CPU while servicing I/Os from other CPUs.
> >>>> >
> >>>> > 3. This CPU can make progress on RCU and other work items on its
> >>>queue.
> >>>>
> >>>> The problem is indeed real, but this is the wrong approach in my mind.
> >>>>
> >>>> We already have irqpoll which takes care proper budgeting polling
> >>>> cycles and not hogging the cpu.
> >>>
> >>>The issue isn't unique to NVMe, and can be any fast devices which
> >>>interrupts CPU too frequently, meantime the interrupt/softirq handler may
> >>>take a bit much time, then CPU is easy to be lockup by the interrupt/sofirq
> >>>handler, especially in case that multiple submission CPUs vs. single
> >>>completion CPU.
> >>>
> >>>Some SCSI devices has the same problem too.
> >>>
> >>>Could we consider to add one generic mechanism to cover this kind of
> >>>problem?
> >>>
> >>>One approach I thought of is to allocate one backup thread for handling such
> >>>interrupt, which can be marked as IRQF_BACKUP_THREAD by drivers.
> >>>
> >>>Inside do_IRQ(), irqtime is accounted, before calling action->handler(),
> >>>check if this CPU has taken too long time for handling IRQ(interrupt or
> >>>softirq) and see if this CPU could be lock up. If yes, wakeup the backup
> 
> How do you know if this CPU is spending all the time in do_IRQ()?
> 
> Is it something like:
> If (IRQ_time /elapsed_time > a threshold value)
> 	wake up the backup thread

Yeah, the above could work in theory.

Another approach I thought of is to monitor average irq gap time on each
CPU.

We could use EWMA(Exponential Weighted Moving Average) to do it simply,
such as:

	curr_irq_gap(cpu) = current start time of do_IRQ() on 'cpu' -
			end time of last do_IRQ() on 'cpu'
	avg_irq_gap(cpu) = weight_prev * avg_irq_gap(cpu) + weight_curr * curr_irq_gap(cpu) 

	note:
		weight_prev + weight_curr = 1

When avg_irq_gap(cpu) is close to one small enough threshold, we think irq flood is
detected.

'weight_prev' could be chosen as one big enough value for avoiding short-time flood.


Thanks,
Ming

Patch
diff mbox series

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 6a9dd68c0f4f..576bb6fce293 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -28,6 +28,7 @@ 
 #include <linux/t10-pi.h>
 #include <linux/pm_qos.h>
 #include <asm/unaligned.h>
+#include <linux/sched.h>
 
 #define CREATE_TRACE_POINTS
 #include "trace.h"
@@ -97,6 +98,15 @@  static dev_t nvme_chr_devt;
 static struct class *nvme_class;
 static struct class *nvme_subsys_class;
 
+/*
+ * The following are for detecting if this CPU is flooded with interrupts.
+ * The timestamp for the last context switch is recorded. If that is at least
+ * MAX_SCHED_TIMEOUT ago, try to recover from interrupt flood
+ */
+static DEFINE_PER_CPU(u64, last_switch);
+static DEFINE_PER_CPU(u64, last_clock);
+#define MAX_SCHED_TIMEOUT 2000000000	// 2 seconds in ns
+
 static int nvme_revalidate_disk(struct gendisk *disk);
 static void nvme_put_subsystem(struct nvme_subsystem *subsys);
 static void nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl,
@@ -260,9 +270,54 @@  static void nvme_retry_req(struct request *req)
 	blk_mq_delay_kick_requeue_list(req->q, delay);
 }
 
+static void nvme_complete_rq_work(struct work_struct *work)
+{
+	struct nvme_request *nvme_rq =
+		container_of(work, struct nvme_request, work);
+	struct request *req = blk_mq_rq_from_pdu(nvme_rq);
+
+	nvme_complete_rq(req);
+}
+
+
 void nvme_complete_rq(struct request *req)
 {
-	blk_status_t status = nvme_error_status(req);
+	blk_status_t status;
+	int cpu;
+	u64 switches;
+	struct nvme_request *nvme_rq;
+
+	if (!in_interrupt())
+		goto skip_check;
+
+	nvme_rq = nvme_req(req);
+	cpu = smp_processor_id();
+	if (idle_cpu(cpu))
+		goto skip_check;
+
+	/* Check if this CPU is flooded with interrupts */
+	switches = get_cpu_rq_switches(cpu);
+	if (this_cpu_read(last_switch) == switches) {
+		/*
+		 * If this CPU hasn't made a context switch in
+		 * MAX_SCHED_TIMEOUT ns (and it's not idle), schedule a work to
+		 * complete this I/O. This forces this CPU run non-interrupt
+		 * code and throttle the other CPU issuing the I/O
+		 */
+		if (sched_clock() - this_cpu_read(last_clock)
+				> MAX_SCHED_TIMEOUT) {
+			INIT_WORK(&nvme_rq->work, nvme_complete_rq_work);
+			schedule_work_on(cpu, &nvme_rq->work);
+			return;
+		}
+
+	} else {
+		this_cpu_write(last_switch, switches);
+		this_cpu_write(last_clock, sched_clock());
+	}
+
+skip_check:
+	status = nvme_error_status(req);
 
 	trace_nvme_complete_rq(req);
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 0a4a7f5e0de7..a8876e69e476 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -113,6 +113,7 @@  struct nvme_request {
 	u8			flags;
 	u16			status;
 	struct nvme_ctrl	*ctrl;
+	struct work_struct	work;
 };
 
 /*