[PATCH] nvme: Acknowledge completion queue on each iteration

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] nvme: Acknowledge completion queue on each iteration
@ 2017-07-17 22:36 ` Sinan Kaya
  0 siblings, 0 replies; 30+ messages in thread
From: Sinan Kaya @ 2017-07-17 22:36 UTC (permalink / raw)
  To: linux-nvme, timur
  Cc: linux-arm-msm, linux-arm-kernel, Sinan Kaya, Keith Busch,
	Jens Axboe, Christoph Hellwig, Sagi Grimberg, linux-kernel

Code is moving the completion queue doorbell after processing all completed
events and sending callbacks to the block layer on each iteration.

This is causing a performance drop when a lot of jobs are queued towards
the HW. Move the completion queue doorbell on each loop instead and allow new
jobs to be queued by the HW.

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
 drivers/nvme/host/pci.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index d10d2f2..33d9b5b 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -810,13 +810,12 @@ static void nvme_process_cq(struct nvme_queue *nvmeq)
 
 	while (nvme_read_cqe(nvmeq, &cqe)) {
 		nvme_handle_cqe(nvmeq, &cqe);
+		nvme_ring_cq_doorbell(nvmeq);
 		consumed++;
 	}
 
-	if (consumed) {
-		nvme_ring_cq_doorbell(nvmeq);
+	if (consumed)
 		nvmeq->cqe_seen = 1;
-	}
 }
 
 static irqreturn_t nvme_irq(int irq, void *data)
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH] nvme: Acknowledge completion queue on each iteration
@ 2017-07-17 22:36 ` Sinan Kaya
  0 siblings, 0 replies; 30+ messages in thread
From: Sinan Kaya @ 2017-07-17 22:36 UTC (permalink / raw)


Code is moving the completion queue doorbell after processing all completed
events and sending callbacks to the block layer on each iteration.

This is causing a performance drop when a lot of jobs are queued towards
the HW. Move the completion queue doorbell on each loop instead and allow new
jobs to be queued by the HW.

Signed-off-by: Sinan Kaya <okaya at codeaurora.org>
---
 drivers/nvme/host/pci.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index d10d2f2..33d9b5b 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -810,13 +810,12 @@ static void nvme_process_cq(struct nvme_queue *nvmeq)
 
 	while (nvme_read_cqe(nvmeq, &cqe)) {
 		nvme_handle_cqe(nvmeq, &cqe);
+		nvme_ring_cq_doorbell(nvmeq);
 		consumed++;
 	}
 
-	if (consumed) {
-		nvme_ring_cq_doorbell(nvmeq);
+	if (consumed)
 		nvmeq->cqe_seen = 1;
-	}
 }
 
 static irqreturn_t nvme_irq(int irq, void *data)
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH] nvme: Acknowledge completion queue on each iteration
@ 2017-07-17 22:36 ` Sinan Kaya
  0 siblings, 0 replies; 30+ messages in thread
From: Sinan Kaya @ 2017-07-17 22:36 UTC (permalink / raw)
  To: linux-arm-kernel

Code is moving the completion queue doorbell after processing all completed
events and sending callbacks to the block layer on each iteration.

This is causing a performance drop when a lot of jobs are queued towards
the HW. Move the completion queue doorbell on each loop instead and allow new
jobs to be queued by the HW.

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
 drivers/nvme/host/pci.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index d10d2f2..33d9b5b 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -810,13 +810,12 @@ static void nvme_process_cq(struct nvme_queue *nvmeq)
 
 	while (nvme_read_cqe(nvmeq, &cqe)) {
 		nvme_handle_cqe(nvmeq, &cqe);
+		nvme_ring_cq_doorbell(nvmeq);
 		consumed++;
 	}
 
-	if (consumed) {
-		nvme_ring_cq_doorbell(nvmeq);
+	if (consumed)
 		nvmeq->cqe_seen = 1;
-	}
 }
 
 static irqreturn_t nvme_irq(int irq, void *data)
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH] nvme: Acknowledge completion queue on each iteration
  2017-07-17 22:36 ` Sinan Kaya
  (?)
@ 2017-07-17 22:45   ` Keith Busch
  -1 siblings, 0 replies; 30+ messages in thread
From: Keith Busch @ 2017-07-17 22:45 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: linux-nvme, timur, linux-arm-msm, linux-arm-kernel, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, linux-kernel

On Mon, Jul 17, 2017 at 06:36:23PM -0400, Sinan Kaya wrote:
> Code is moving the completion queue doorbell after processing all completed
> events and sending callbacks to the block layer on each iteration.
> 
> This is causing a performance drop when a lot of jobs are queued towards
> the HW. Move the completion queue doorbell on each loop instead and allow new
> jobs to be queued by the HW.

That doesn't make sense. Aggregating doorbell writes should be much more
efficient for high depth workloads.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH] nvme: Acknowledge completion queue on each iteration
@ 2017-07-17 22:45   ` Keith Busch
  0 siblings, 0 replies; 30+ messages in thread
From: Keith Busch @ 2017-07-17 22:45 UTC (permalink / raw)


On Mon, Jul 17, 2017@06:36:23PM -0400, Sinan Kaya wrote:
> Code is moving the completion queue doorbell after processing all completed
> events and sending callbacks to the block layer on each iteration.
> 
> This is causing a performance drop when a lot of jobs are queued towards
> the HW. Move the completion queue doorbell on each loop instead and allow new
> jobs to be queued by the HW.

That doesn't make sense. Aggregating doorbell writes should be much more
efficient for high depth workloads.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH] nvme: Acknowledge completion queue on each iteration
@ 2017-07-17 22:45   ` Keith Busch
  0 siblings, 0 replies; 30+ messages in thread
From: Keith Busch @ 2017-07-17 22:45 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jul 17, 2017 at 06:36:23PM -0400, Sinan Kaya wrote:
> Code is moving the completion queue doorbell after processing all completed
> events and sending callbacks to the block layer on each iteration.
> 
> This is causing a performance drop when a lot of jobs are queued towards
> the HW. Move the completion queue doorbell on each loop instead and allow new
> jobs to be queued by the HW.

That doesn't make sense. Aggregating doorbell writes should be much more
efficient for high depth workloads.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] nvme: Acknowledge completion queue on each iteration
  2017-07-17 22:45   ` Keith Busch
  (?)
@ 2017-07-17 22:46     ` Sinan Kaya
  -1 siblings, 0 replies; 30+ messages in thread
From: Sinan Kaya @ 2017-07-17 22:46 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-nvme, timur, linux-arm-msm, linux-arm-kernel, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, linux-kernel

Hi Keith,

On 7/17/2017 6:45 PM, Keith Busch wrote:
> On Mon, Jul 17, 2017 at 06:36:23PM -0400, Sinan Kaya wrote:
>> Code is moving the completion queue doorbell after processing all completed
>> events and sending callbacks to the block layer on each iteration.
>>
>> This is causing a performance drop when a lot of jobs are queued towards
>> the HW. Move the completion queue doorbell on each loop instead and allow new
>> jobs to be queued by the HW.
> 
> That doesn't make sense. Aggregating doorbell writes should be much more
> efficient for high depth workloads.
> 

Problem is that code is throttling the HW as HW cannot queue more completions until
SW get a chance to clear it. 

As an example:

for each in N
(
	blk_layer()
)
ring door bell

HW cannot queue new job until N x blk_layer operations are processed and queue
element ownership is passed to the HW after the loop. HW is just sitting idle
there if no queue entries are available.

Sinan

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH] nvme: Acknowledge completion queue on each iteration
@ 2017-07-17 22:46     ` Sinan Kaya
  0 siblings, 0 replies; 30+ messages in thread
From: Sinan Kaya @ 2017-07-17 22:46 UTC (permalink / raw)


Hi Keith,

On 7/17/2017 6:45 PM, Keith Busch wrote:
> On Mon, Jul 17, 2017@06:36:23PM -0400, Sinan Kaya wrote:
>> Code is moving the completion queue doorbell after processing all completed
>> events and sending callbacks to the block layer on each iteration.
>>
>> This is causing a performance drop when a lot of jobs are queued towards
>> the HW. Move the completion queue doorbell on each loop instead and allow new
>> jobs to be queued by the HW.
> 
> That doesn't make sense. Aggregating doorbell writes should be much more
> efficient for high depth workloads.
> 

Problem is that code is throttling the HW as HW cannot queue more completions until
SW get a chance to clear it. 

As an example:

for each in N
(
	blk_layer()
)
ring door bell

HW cannot queue new job until N x blk_layer operations are processed and queue
element ownership is passed to the HW after the loop. HW is just sitting idle
there if no queue entries are available.

Sinan

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH] nvme: Acknowledge completion queue on each iteration
@ 2017-07-17 22:46     ` Sinan Kaya
  0 siblings, 0 replies; 30+ messages in thread
From: Sinan Kaya @ 2017-07-17 22:46 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Keith,

On 7/17/2017 6:45 PM, Keith Busch wrote:
> On Mon, Jul 17, 2017 at 06:36:23PM -0400, Sinan Kaya wrote:
>> Code is moving the completion queue doorbell after processing all completed
>> events and sending callbacks to the block layer on each iteration.
>>
>> This is causing a performance drop when a lot of jobs are queued towards
>> the HW. Move the completion queue doorbell on each loop instead and allow new
>> jobs to be queued by the HW.
> 
> That doesn't make sense. Aggregating doorbell writes should be much more
> efficient for high depth workloads.
> 

Problem is that code is throttling the HW as HW cannot queue more completions until
SW get a chance to clear it. 

As an example:

for each in N
(
	blk_layer()
)
ring door bell

HW cannot queue new job until N x blk_layer operations are processed and queue
element ownership is passed to the HW after the loop. HW is just sitting idle
there if no queue entries are available.

Sinan

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] nvme: Acknowledge completion queue on each iteration
  2017-07-17 22:46     ` Sinan Kaya
  (?)
@ 2017-07-17 22:56       ` Keith Busch
  -1 siblings, 0 replies; 30+ messages in thread
From: Keith Busch @ 2017-07-17 22:56 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: linux-nvme, timur, linux-arm-msm, linux-arm-kernel, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, linux-kernel

On Mon, Jul 17, 2017 at 06:46:11PM -0400, Sinan Kaya wrote:
> Hi Keith,
> 
> On 7/17/2017 6:45 PM, Keith Busch wrote:
> > On Mon, Jul 17, 2017 at 06:36:23PM -0400, Sinan Kaya wrote:
> >> Code is moving the completion queue doorbell after processing all completed
> >> events and sending callbacks to the block layer on each iteration.
> >>
> >> This is causing a performance drop when a lot of jobs are queued towards
> >> the HW. Move the completion queue doorbell on each loop instead and allow new
> >> jobs to be queued by the HW.
> > 
> > That doesn't make sense. Aggregating doorbell writes should be much more
> > efficient for high depth workloads.
> > 
> 
> Problem is that code is throttling the HW as HW cannot queue more completions until
> SW get a chance to clear it. 
> 
> As an example:
> 
> for each in N
> (
> 	blk_layer()
> )
> ring door bell
> 
> HW cannot queue new job until N x blk_layer operations are processed and queue
> element ownership is passed to the HW after the loop. HW is just sitting idle
> there if no queue entries are available.

If no completion queue entries are available, then there can't possibly
be any submission queue entries for the HW to work on either.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH] nvme: Acknowledge completion queue on each iteration
@ 2017-07-17 22:56       ` Keith Busch
  0 siblings, 0 replies; 30+ messages in thread
From: Keith Busch @ 2017-07-17 22:56 UTC (permalink / raw)


On Mon, Jul 17, 2017@06:46:11PM -0400, Sinan Kaya wrote:
> Hi Keith,
> 
> On 7/17/2017 6:45 PM, Keith Busch wrote:
> > On Mon, Jul 17, 2017@06:36:23PM -0400, Sinan Kaya wrote:
> >> Code is moving the completion queue doorbell after processing all completed
> >> events and sending callbacks to the block layer on each iteration.
> >>
> >> This is causing a performance drop when a lot of jobs are queued towards
> >> the HW. Move the completion queue doorbell on each loop instead and allow new
> >> jobs to be queued by the HW.
> > 
> > That doesn't make sense. Aggregating doorbell writes should be much more
> > efficient for high depth workloads.
> > 
> 
> Problem is that code is throttling the HW as HW cannot queue more completions until
> SW get a chance to clear it. 
> 
> As an example:
> 
> for each in N
> (
> 	blk_layer()
> )
> ring door bell
> 
> HW cannot queue new job until N x blk_layer operations are processed and queue
> element ownership is passed to the HW after the loop. HW is just sitting idle
> there if no queue entries are available.

If no completion queue entries are available, then there can't possibly
be any submission queue entries for the HW to work on either.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH] nvme: Acknowledge completion queue on each iteration
@ 2017-07-17 22:56       ` Keith Busch
  0 siblings, 0 replies; 30+ messages in thread
From: Keith Busch @ 2017-07-17 22:56 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jul 17, 2017 at 06:46:11PM -0400, Sinan Kaya wrote:
> Hi Keith,
> 
> On 7/17/2017 6:45 PM, Keith Busch wrote:
> > On Mon, Jul 17, 2017 at 06:36:23PM -0400, Sinan Kaya wrote:
> >> Code is moving the completion queue doorbell after processing all completed
> >> events and sending callbacks to the block layer on each iteration.
> >>
> >> This is causing a performance drop when a lot of jobs are queued towards
> >> the HW. Move the completion queue doorbell on each loop instead and allow new
> >> jobs to be queued by the HW.
> > 
> > That doesn't make sense. Aggregating doorbell writes should be much more
> > efficient for high depth workloads.
> > 
> 
> Problem is that code is throttling the HW as HW cannot queue more completions until
> SW get a chance to clear it. 
> 
> As an example:
> 
> for each in N
> (
> 	blk_layer()
> )
> ring door bell
> 
> HW cannot queue new job until N x blk_layer operations are processed and queue
> element ownership is passed to the HW after the loop. HW is just sitting idle
> there if no queue entries are available.

If no completion queue entries are available, then there can't possibly
be any submission queue entries for the HW to work on either.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] nvme: Acknowledge completion queue on each iteration
  2017-07-17 22:56       ` Keith Busch
  (?)
@ 2017-07-17 23:07         ` okaya
  -1 siblings, 0 replies; 30+ messages in thread
From: okaya @ 2017-07-17 23:07 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-nvme, timur, linux-arm-msm, linux-arm-kernel, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, linux-kernel

On 2017-07-17 18:56, Keith Busch wrote:
> On Mon, Jul 17, 2017 at 06:46:11PM -0400, Sinan Kaya wrote:
>> Hi Keith,
>> 
>> On 7/17/2017 6:45 PM, Keith Busch wrote:
>> > On Mon, Jul 17, 2017 at 06:36:23PM -0400, Sinan Kaya wrote:
>> >> Code is moving the completion queue doorbell after processing all completed
>> >> events and sending callbacks to the block layer on each iteration.
>> >>
>> >> This is causing a performance drop when a lot of jobs are queued towards
>> >> the HW. Move the completion queue doorbell on each loop instead and allow new
>> >> jobs to be queued by the HW.
>> >
>> > That doesn't make sense. Aggregating doorbell writes should be much more
>> > efficient for high depth workloads.
>> >
>> 
>> Problem is that code is throttling the HW as HW cannot queue more 
>> completions until
>> SW get a chance to clear it.
>> 
>> As an example:
>> 
>> for each in N
>> (
>> 	blk_layer()
>> )
>> ring door bell
>> 
>> HW cannot queue new job until N x blk_layer operations are processed 
>> and queue
>> element ownership is passed to the HW after the loop. HW is just 
>> sitting idle
>> there if no queue entries are available.
> 
> If no completion queue entries are available, then there can't possibly
> be any submission queue entries for the HW to work on either.

Maybe, I need to understand the design better. I was curious why 
completion and submission queues were protected by a single lock causing 
lock contention.

I was treating each queue independently. I have seen slightly better 
performance by an early doorbell. That was my explanation.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH] nvme: Acknowledge completion queue on each iteration
@ 2017-07-17 23:07         ` okaya
  0 siblings, 0 replies; 30+ messages in thread
From: okaya @ 2017-07-17 23:07 UTC (permalink / raw)


On 2017-07-17 18:56, Keith Busch wrote:
> On Mon, Jul 17, 2017@06:46:11PM -0400, Sinan Kaya wrote:
>> Hi Keith,
>> 
>> On 7/17/2017 6:45 PM, Keith Busch wrote:
>> > On Mon, Jul 17, 2017@06:36:23PM -0400, Sinan Kaya wrote:
>> >> Code is moving the completion queue doorbell after processing all completed
>> >> events and sending callbacks to the block layer on each iteration.
>> >>
>> >> This is causing a performance drop when a lot of jobs are queued towards
>> >> the HW. Move the completion queue doorbell on each loop instead and allow new
>> >> jobs to be queued by the HW.
>> >
>> > That doesn't make sense. Aggregating doorbell writes should be much more
>> > efficient for high depth workloads.
>> >
>> 
>> Problem is that code is throttling the HW as HW cannot queue more 
>> completions until
>> SW get a chance to clear it.
>> 
>> As an example:
>> 
>> for each in N
>> (
>> 	blk_layer()
>> )
>> ring door bell
>> 
>> HW cannot queue new job until N x blk_layer operations are processed 
>> and queue
>> element ownership is passed to the HW after the loop. HW is just 
>> sitting idle
>> there if no queue entries are available.
> 
> If no completion queue entries are available, then there can't possibly
> be any submission queue entries for the HW to work on either.

Maybe, I need to understand the design better. I was curious why 
completion and submission queues were protected by a single lock causing 
lock contention.

I was treating each queue independently. I have seen slightly better 
performance by an early doorbell. That was my explanation.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH] nvme: Acknowledge completion queue on each iteration
@ 2017-07-17 23:07         ` okaya
  0 siblings, 0 replies; 30+ messages in thread
From: okaya at codeaurora.org @ 2017-07-17 23:07 UTC (permalink / raw)
  To: linux-arm-kernel

On 2017-07-17 18:56, Keith Busch wrote:
> On Mon, Jul 17, 2017 at 06:46:11PM -0400, Sinan Kaya wrote:
>> Hi Keith,
>> 
>> On 7/17/2017 6:45 PM, Keith Busch wrote:
>> > On Mon, Jul 17, 2017 at 06:36:23PM -0400, Sinan Kaya wrote:
>> >> Code is moving the completion queue doorbell after processing all completed
>> >> events and sending callbacks to the block layer on each iteration.
>> >>
>> >> This is causing a performance drop when a lot of jobs are queued towards
>> >> the HW. Move the completion queue doorbell on each loop instead and allow new
>> >> jobs to be queued by the HW.
>> >
>> > That doesn't make sense. Aggregating doorbell writes should be much more
>> > efficient for high depth workloads.
>> >
>> 
>> Problem is that code is throttling the HW as HW cannot queue more 
>> completions until
>> SW get a chance to clear it.
>> 
>> As an example:
>> 
>> for each in N
>> (
>> 	blk_layer()
>> )
>> ring door bell
>> 
>> HW cannot queue new job until N x blk_layer operations are processed 
>> and queue
>> element ownership is passed to the HW after the loop. HW is just 
>> sitting idle
>> there if no queue entries are available.
> 
> If no completion queue entries are available, then there can't possibly
> be any submission queue entries for the HW to work on either.

Maybe, I need to understand the design better. I was curious why 
completion and submission queues were protected by a single lock causing 
lock contention.

I was treating each queue independently. I have seen slightly better 
performance by an early doorbell. That was my explanation.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] nvme: Acknowledge completion queue on each iteration
  2017-07-17 23:07         ` okaya
  (?)
@ 2017-07-18 14:36           ` Keith Busch
  -1 siblings, 0 replies; 30+ messages in thread
From: Keith Busch @ 2017-07-18 14:36 UTC (permalink / raw)
  To: okaya
  Cc: linux-nvme, timur, linux-arm-msm, linux-arm-kernel, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, linux-kernel

On Mon, Jul 17, 2017 at 07:07:00PM -0400, okaya@codeaurora.org wrote:
> Maybe, I need to understand the design better. I was curious why completion
> and submission queues were protected by a single lock causing lock
> contention.

Ideally the queues are tied to CPUs, so you couldn't have one thread
submitting to a particular queue-pair while another thread is reaping
completions from it. Such a setup wouldn't get lock contention.

Some machines have so many CPUs, though, that sharing hardware queues
is required. We've experimented with separate submission and completion
locks for such cases, but I've never seen an improved performance as a
result.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH] nvme: Acknowledge completion queue on each iteration
@ 2017-07-18 14:36           ` Keith Busch
  0 siblings, 0 replies; 30+ messages in thread
From: Keith Busch @ 2017-07-18 14:36 UTC (permalink / raw)

On Mon, Jul 17, 2017@07:07:00PM -0400, okaya@codeaurora.org wrote:
> Maybe, I need to understand the design better. I was curious why completion
> and submission queues were protected by a single lock causing lock
> contention.

Ideally the queues are tied to CPUs, so you couldn't have one thread
submitting to a particular queue-pair while another thread is reaping
completions from it. Such a setup wouldn't get lock contention.

Some machines have so many CPUs, though, that sharing hardware queues
is required. We've experimented with separate submission and completion
locks for such cases, but I've never seen an improved performance as a
result.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH] nvme: Acknowledge completion queue on each iteration
@ 2017-07-18 14:36           ` Keith Busch
  0 siblings, 0 replies; 30+ messages in thread
From: Keith Busch @ 2017-07-18 14:36 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jul 17, 2017 at 07:07:00PM -0400, okaya at codeaurora.org wrote:
> Maybe, I need to understand the design better. I was curious why completion
> and submission queues were protected by a single lock causing lock
> contention.

Ideally the queues are tied to CPUs, so you couldn't have one thread
submitting to a particular queue-pair while another thread is reaping
completions from it. Such a setup wouldn't get lock contention.

Some machines have so many CPUs, though, that sharing hardware queues
is required. We've experimented with separate submission and completion
locks for such cases, but I've never seen an improved performance as a
result.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] nvme: Acknowledge completion queue on each iteration
  2017-07-18 14:36           ` Keith Busch
  (?)
@ 2017-07-18 18:52             ` Sinan Kaya
  -1 siblings, 0 replies; 30+ messages in thread
From: Sinan Kaya @ 2017-07-18 18:52 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-nvme, timur, linux-arm-msm, linux-arm-kernel, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, linux-kernel

On 7/18/2017 10:36 AM, Keith Busch wrote:
> On Mon, Jul 17, 2017 at 07:07:00PM -0400, okaya@codeaurora.org wrote:
>> Maybe, I need to understand the design better. I was curious why completion
>> and submission queues were protected by a single lock causing lock
>> contention.
> Ideally the queues are tied to CPUs, so you couldn't have one thread
> submitting to a particular queue-pair while another thread is reaping
> completions from it. Such a setup wouldn't get lock contention.

I do see that the NVMe driver is creating a completion interrupt on
each CPU core for the completions. No problems with that. 

However, I don't think you can guarantee that there will always be a single
CPU core targeting one submission queue especially with asynchronous IO.

Lock contention counters from CONFIG_LOCK_STAT are pointing to nvmeq->lock
in my FIO tests.

Did I miss something?

> 
> Some machines have so many CPUs, though, that sharing hardware queues
> is required. We've experimented with separate submission and completion
> locks for such cases, but I've never seen an improved performance as a
> result.
> 

I have also experimented with multiple locks with no significant gains. 
However, I was curious if somebody else had a better implementation than mine.

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH] nvme: Acknowledge completion queue on each iteration
@ 2017-07-18 18:52             ` Sinan Kaya
  0 siblings, 0 replies; 30+ messages in thread
From: Sinan Kaya @ 2017-07-18 18:52 UTC (permalink / raw)

On 7/18/2017 10:36 AM, Keith Busch wrote:
> On Mon, Jul 17, 2017@07:07:00PM -0400, okaya@codeaurora.org wrote:
>> Maybe, I need to understand the design better. I was curious why completion
>> and submission queues were protected by a single lock causing lock
>> contention.
> Ideally the queues are tied to CPUs, so you couldn't have one thread
> submitting to a particular queue-pair while another thread is reaping
> completions from it. Such a setup wouldn't get lock contention.

I do see that the NVMe driver is creating a completion interrupt on
each CPU core for the completions. No problems with that. 

However, I don't think you can guarantee that there will always be a single
CPU core targeting one submission queue especially with asynchronous IO.

Lock contention counters from CONFIG_LOCK_STAT are pointing to nvmeq->lock
in my FIO tests.

Did I miss something?

> 
> Some machines have so many CPUs, though, that sharing hardware queues
> is required. We've experimented with separate submission and completion
> locks for such cases, but I've never seen an improved performance as a
> result.
> 

I have also experimented with multiple locks with no significant gains. 
However, I was curious if somebody else had a better implementation than mine.

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH] nvme: Acknowledge completion queue on each iteration
@ 2017-07-18 18:52             ` Sinan Kaya
  0 siblings, 0 replies; 30+ messages in thread
From: Sinan Kaya @ 2017-07-18 18:52 UTC (permalink / raw)
  To: linux-arm-kernel

On 7/18/2017 10:36 AM, Keith Busch wrote:
> On Mon, Jul 17, 2017 at 07:07:00PM -0400, okaya at codeaurora.org wrote:
>> Maybe, I need to understand the design better. I was curious why completion
>> and submission queues were protected by a single lock causing lock
>> contention.
> Ideally the queues are tied to CPUs, so you couldn't have one thread
> submitting to a particular queue-pair while another thread is reaping
> completions from it. Such a setup wouldn't get lock contention.

I do see that the NVMe driver is creating a completion interrupt on
each CPU core for the completions. No problems with that. 

However, I don't think you can guarantee that there will always be a single
CPU core targeting one submission queue especially with asynchronous IO.

Lock contention counters from CONFIG_LOCK_STAT are pointing to nvmeq->lock
in my FIO tests.

Did I miss something?

> 
> Some machines have so many CPUs, though, that sharing hardware queues
> is required. We've experimented with separate submission and completion
> locks for such cases, but I've never seen an improved performance as a
> result.
> 

I have also experimented with multiple locks with no significant gains. 
However, I was curious if somebody else had a better implementation than mine.

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] nvme: Acknowledge completion queue on each iteration
  2017-07-18 18:52             ` Sinan Kaya
  (?)
@ 2017-07-18 21:26               ` Keith Busch
  -1 siblings, 0 replies; 30+ messages in thread
From: Keith Busch @ 2017-07-18 21:26 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: linux-nvme, timur, linux-arm-msm, linux-arm-kernel, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, linux-kernel

On Tue, Jul 18, 2017 at 02:52:26PM -0400, Sinan Kaya wrote:
> On 7/18/2017 10:36 AM, Keith Busch wrote:
> 
> I do see that the NVMe driver is creating a completion interrupt on
> each CPU core for the completions. No problems with that. 
> 
> However, I don't think you can guarantee that there will always be a single
> CPU core targeting one submission queue especially with asynchronous IO.
>
> Lock contention counters from CONFIG_LOCK_STAT are pointing to nvmeq->lock
> in my FIO tests.
> 
> Did I miss something?

I think that must mean your machine has many more CPUs than your nvme
controller has IO queues.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH] nvme: Acknowledge completion queue on each iteration
@ 2017-07-18 21:26               ` Keith Busch
  0 siblings, 0 replies; 30+ messages in thread
From: Keith Busch @ 2017-07-18 21:26 UTC (permalink / raw)


On Tue, Jul 18, 2017@02:52:26PM -0400, Sinan Kaya wrote:
> On 7/18/2017 10:36 AM, Keith Busch wrote:
> 
> I do see that the NVMe driver is creating a completion interrupt on
> each CPU core for the completions. No problems with that. 
> 
> However, I don't think you can guarantee that there will always be a single
> CPU core targeting one submission queue especially with asynchronous IO.
>
> Lock contention counters from CONFIG_LOCK_STAT are pointing to nvmeq->lock
> in my FIO tests.
> 
> Did I miss something?

I think that must mean your machine has many more CPUs than your nvme
controller has IO queues.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH] nvme: Acknowledge completion queue on each iteration
@ 2017-07-18 21:26               ` Keith Busch
  0 siblings, 0 replies; 30+ messages in thread
From: Keith Busch @ 2017-07-18 21:26 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jul 18, 2017 at 02:52:26PM -0400, Sinan Kaya wrote:
> On 7/18/2017 10:36 AM, Keith Busch wrote:
> 
> I do see that the NVMe driver is creating a completion interrupt on
> each CPU core for the completions. No problems with that. 
> 
> However, I don't think you can guarantee that there will always be a single
> CPU core targeting one submission queue especially with asynchronous IO.
>
> Lock contention counters from CONFIG_LOCK_STAT are pointing to nvmeq->lock
> in my FIO tests.
> 
> Did I miss something?

I think that must mean your machine has many more CPUs than your nvme
controller has IO queues.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] nvme: Acknowledge completion queue on each iteration
  2017-07-17 22:36 ` Sinan Kaya
  (?)
@ 2017-07-19  9:20   ` Sagi Grimberg
  -1 siblings, 0 replies; 30+ messages in thread
From: Sagi Grimberg @ 2017-07-19  9:20 UTC (permalink / raw)
  To: Sinan Kaya, linux-nvme, timur
  Cc: linux-arm-msm, linux-arm-kernel, Keith Busch, Jens Axboe,
	Christoph Hellwig, linux-kernel

> Code is moving the completion queue doorbell after processing all completed
> events and sending callbacks to the block layer on each iteration.
> 
> This is causing a performance drop when a lot of jobs are queued towards
> the HW. Move the completion queue doorbell on each loop instead and allow new
> jobs to be queued by the HW.
> 
> Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
> ---
>   drivers/nvme/host/pci.c | 5 ++---
>   1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index d10d2f2..33d9b5b 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -810,13 +810,12 @@ static void nvme_process_cq(struct nvme_queue *nvmeq)
>   
>   	while (nvme_read_cqe(nvmeq, &cqe)) {
>   		nvme_handle_cqe(nvmeq, &cqe);
> +		nvme_ring_cq_doorbell(nvmeq);
>   		consumed++;
>   	}
>   
> -	if (consumed) {
> -		nvme_ring_cq_doorbell(nvmeq);
> +	if (consumed)
>   		nvmeq->cqe_seen = 1;
> -	}
>   }

Agree with Keith that this is definitely not the way to go, it
adds mmio operations in the hot path with very little gain (if
at all).

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH] nvme: Acknowledge completion queue on each iteration
@ 2017-07-19  9:20   ` Sagi Grimberg
  0 siblings, 0 replies; 30+ messages in thread
From: Sagi Grimberg @ 2017-07-19  9:20 UTC (permalink / raw)


> Code is moving the completion queue doorbell after processing all completed
> events and sending callbacks to the block layer on each iteration.
> 
> This is causing a performance drop when a lot of jobs are queued towards
> the HW. Move the completion queue doorbell on each loop instead and allow new
> jobs to be queued by the HW.
> 
> Signed-off-by: Sinan Kaya <okaya at codeaurora.org>
> ---
>   drivers/nvme/host/pci.c | 5 ++---
>   1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index d10d2f2..33d9b5b 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -810,13 +810,12 @@ static void nvme_process_cq(struct nvme_queue *nvmeq)
>   
>   	while (nvme_read_cqe(nvmeq, &cqe)) {
>   		nvme_handle_cqe(nvmeq, &cqe);
> +		nvme_ring_cq_doorbell(nvmeq);
>   		consumed++;
>   	}
>   
> -	if (consumed) {
> -		nvme_ring_cq_doorbell(nvmeq);
> +	if (consumed)
>   		nvmeq->cqe_seen = 1;
> -	}
>   }

Agree with Keith that this is definitely not the way to go, it
adds mmio operations in the hot path with very little gain (if
at all).

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH] nvme: Acknowledge completion queue on each iteration
@ 2017-07-19  9:20   ` Sagi Grimberg
  0 siblings, 0 replies; 30+ messages in thread
From: Sagi Grimberg @ 2017-07-19  9:20 UTC (permalink / raw)
  To: linux-arm-kernel

> Code is moving the completion queue doorbell after processing all completed
> events and sending callbacks to the block layer on each iteration.
> 
> This is causing a performance drop when a lot of jobs are queued towards
> the HW. Move the completion queue doorbell on each loop instead and allow new
> jobs to be queued by the HW.
> 
> Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
> ---
>   drivers/nvme/host/pci.c | 5 ++---
>   1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index d10d2f2..33d9b5b 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -810,13 +810,12 @@ static void nvme_process_cq(struct nvme_queue *nvmeq)
>   
>   	while (nvme_read_cqe(nvmeq, &cqe)) {
>   		nvme_handle_cqe(nvmeq, &cqe);
> +		nvme_ring_cq_doorbell(nvmeq);
>   		consumed++;
>   	}
>   
> -	if (consumed) {
> -		nvme_ring_cq_doorbell(nvmeq);
> +	if (consumed)
>   		nvmeq->cqe_seen = 1;
> -	}
>   }

Agree with Keith that this is definitely not the way to go, it
adds mmio operations in the hot path with very little gain (if
at all).

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] nvme: Acknowledge completion queue on each iteration
  2017-07-19  9:20   ` Sagi Grimberg
  (?)
@ 2017-07-19 10:37     ` Sinan Kaya
  -1 siblings, 0 replies; 30+ messages in thread
From: Sinan Kaya @ 2017-07-19 10:37 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme, timur
  Cc: linux-arm-msm, linux-arm-kernel, Keith Busch, Jens Axboe,
	Christoph Hellwig, linux-kernel

On 7/19/2017 5:20 AM, Sagi Grimberg wrote:
>> Code is moving the completion queue doorbell after processing all completed
>> events and sending callbacks to the block layer on each iteration.
>>
>> This is causing a performance drop when a lot of jobs are queued towards
>> the HW. Move the completion queue doorbell on each loop instead and allow new
>> jobs to be queued by the HW.
>>
>> Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
>> ---
>>   drivers/nvme/host/pci.c | 5 ++---
>>   1 file changed, 2 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
>> index d10d2f2..33d9b5b 100644
>> --- a/drivers/nvme/host/pci.c
>> +++ b/drivers/nvme/host/pci.c
>> @@ -810,13 +810,12 @@ static void nvme_process_cq(struct nvme_queue *nvmeq)
>>         while (nvme_read_cqe(nvmeq, &cqe)) {
>>           nvme_handle_cqe(nvmeq, &cqe);
>> +        nvme_ring_cq_doorbell(nvmeq);
>>           consumed++;
>>       }
>>   -    if (consumed) {
>> -        nvme_ring_cq_doorbell(nvmeq);
>> +    if (consumed)
>>           nvmeq->cqe_seen = 1;
>> -    }
>>   }
> 
> Agree with Keith that this is definitely not the way to go, it
> adds mmio operations in the hot path with very little gain (if
> at all).
> 

Understood, different architectures might have different latency accessing the HW
registers. It might be expansive on some platform like you indicated and this change
would make it worse.

I'm doing a self NACK as well.

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH] nvme: Acknowledge completion queue on each iteration
@ 2017-07-19 10:37     ` Sinan Kaya
  0 siblings, 0 replies; 30+ messages in thread
From: Sinan Kaya @ 2017-07-19 10:37 UTC (permalink / raw)


On 7/19/2017 5:20 AM, Sagi Grimberg wrote:
>> Code is moving the completion queue doorbell after processing all completed
>> events and sending callbacks to the block layer on each iteration.
>>
>> This is causing a performance drop when a lot of jobs are queued towards
>> the HW. Move the completion queue doorbell on each loop instead and allow new
>> jobs to be queued by the HW.
>>
>> Signed-off-by: Sinan Kaya <okaya at codeaurora.org>
>> ---
>>   drivers/nvme/host/pci.c | 5 ++---
>>   1 file changed, 2 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
>> index d10d2f2..33d9b5b 100644
>> --- a/drivers/nvme/host/pci.c
>> +++ b/drivers/nvme/host/pci.c
>> @@ -810,13 +810,12 @@ static void nvme_process_cq(struct nvme_queue *nvmeq)
>>         while (nvme_read_cqe(nvmeq, &cqe)) {
>>           nvme_handle_cqe(nvmeq, &cqe);
>> +        nvme_ring_cq_doorbell(nvmeq);
>>           consumed++;
>>       }
>>   -    if (consumed) {
>> -        nvme_ring_cq_doorbell(nvmeq);
>> +    if (consumed)
>>           nvmeq->cqe_seen = 1;
>> -    }
>>   }
> 
> Agree with Keith that this is definitely not the way to go, it
> adds mmio operations in the hot path with very little gain (if
> at all).
> 

Understood, different architectures might have different latency accessing the HW
registers. It might be expansive on some platform like you indicated and this change
would make it worse.

I'm doing a self NACK as well.

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH] nvme: Acknowledge completion queue on each iteration
@ 2017-07-19 10:37     ` Sinan Kaya
  0 siblings, 0 replies; 30+ messages in thread
From: Sinan Kaya @ 2017-07-19 10:37 UTC (permalink / raw)
  To: linux-arm-kernel

On 7/19/2017 5:20 AM, Sagi Grimberg wrote:
>> Code is moving the completion queue doorbell after processing all completed
>> events and sending callbacks to the block layer on each iteration.
>>
>> This is causing a performance drop when a lot of jobs are queued towards
>> the HW. Move the completion queue doorbell on each loop instead and allow new
>> jobs to be queued by the HW.
>>
>> Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
>> ---
>>   drivers/nvme/host/pci.c | 5 ++---
>>   1 file changed, 2 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
>> index d10d2f2..33d9b5b 100644
>> --- a/drivers/nvme/host/pci.c
>> +++ b/drivers/nvme/host/pci.c
>> @@ -810,13 +810,12 @@ static void nvme_process_cq(struct nvme_queue *nvmeq)
>>         while (nvme_read_cqe(nvmeq, &cqe)) {
>>           nvme_handle_cqe(nvmeq, &cqe);
>> +        nvme_ring_cq_doorbell(nvmeq);
>>           consumed++;
>>       }
>>   -    if (consumed) {
>> -        nvme_ring_cq_doorbell(nvmeq);
>> +    if (consumed)
>>           nvmeq->cqe_seen = 1;
>> -    }
>>   }
> 
> Agree with Keith that this is definitely not the way to go, it
> adds mmio operations in the hot path with very little gain (if
> at all).
> 

Understood, different architectures might have different latency accessing the HW
registers. It might be expansive on some platform like you indicated and this change
would make it worse.

I'm doing a self NACK as well.

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2017-07-19 10:37 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-17 22:36 [PATCH] nvme: Acknowledge completion queue on each iteration Sinan Kaya
2017-07-17 22:36 ` Sinan Kaya
2017-07-17 22:36 ` Sinan Kaya
2017-07-17 22:45 ` Keith Busch
2017-07-17 22:45   ` Keith Busch
2017-07-17 22:45   ` Keith Busch
2017-07-17 22:46   ` Sinan Kaya
2017-07-17 22:46     ` Sinan Kaya
2017-07-17 22:46     ` Sinan Kaya
2017-07-17 22:56     ` Keith Busch
2017-07-17 22:56       ` Keith Busch
2017-07-17 22:56       ` Keith Busch
2017-07-17 23:07       ` okaya
2017-07-17 23:07         ` okaya at codeaurora.org
2017-07-17 23:07         ` okaya
2017-07-18 14:36         ` Keith Busch
2017-07-18 14:36           ` Keith Busch
2017-07-18 14:36           ` Keith Busch
2017-07-18 18:52           ` Sinan Kaya
2017-07-18 18:52             ` Sinan Kaya
2017-07-18 18:52             ` Sinan Kaya
2017-07-18 21:26             ` Keith Busch
2017-07-18 21:26               ` Keith Busch
2017-07-18 21:26               ` Keith Busch
2017-07-19  9:20 ` Sagi Grimberg
2017-07-19  9:20   ` Sagi Grimberg
2017-07-19  9:20   ` Sagi Grimberg
2017-07-19 10:37   ` Sinan Kaya
2017-07-19 10:37     ` Sinan Kaya
2017-07-19 10:37     ` Sinan Kaya

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.