All of lore.kernel.org
 help / color / mirror / Atom feed
* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-11 13:43 ` Johannes Thumshirn
  0 siblings, 0 replies; 120+ messages in thread
From: Johannes Thumshirn @ 2017-01-11 13:43 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-block, Linux-scsi, Sagi Grimberg, linux-nvme,
	Christoph Hellwig, Keith Busch

Hi all,

I'd like to attend LSF/MM and would like to discuss polling for block drivers.

Currently there is blk-iopoll but it is neither as widely used as NAPI in the
networking field and accoring to Sagi's findings in [1] performance with
polling is not on par with IRQ usage.

On LSF/MM I'd like to whether it is desirable to have NAPI like polling in
more block drivers and how to overcome the currently seen performance issues.

[1] http://lists.infradead.org/pipermail/linux-nvme/2016-October/006975.html

Byte,
	Johannes

-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: Felix Imend�rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N�rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-11 13:43 ` Johannes Thumshirn
  0 siblings, 0 replies; 120+ messages in thread
From: Johannes Thumshirn @ 2017-01-11 13:43 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-block, Linux-scsi, Sagi Grimberg, linux-nvme,
	Christoph Hellwig, Keith Busch

Hi all,

I'd like to attend LSF/MM and would like to discuss polling for block drivers.

Currently there is blk-iopoll but it is neither as widely used as NAPI in the
networking field and accoring to Sagi's findings in [1] performance with
polling is not on par with IRQ usage.

On LSF/MM I'd like to whether it is desirable to have NAPI like polling in
more block drivers and how to overcome the currently seen performance issues.

[1] http://lists.infradead.org/pipermail/linux-nvme/2016-October/006975.html

Byte,
	Johannes

-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-11 13:43 ` Johannes Thumshirn
  0 siblings, 0 replies; 120+ messages in thread
From: Johannes Thumshirn @ 2017-01-11 13:43 UTC (permalink / raw)


Hi all,

I'd like to attend LSF/MM and would like to discuss polling for block drivers.

Currently there is blk-iopoll but it is neither as widely used as NAPI in the
networking field and accoring to Sagi's findings in [1] performance with
polling is not on par with IRQ usage.

On LSF/MM I'd like to whether it is desirable to have NAPI like polling in
more block drivers and how to overcome the currently seen performance issues.

[1] http://lists.infradead.org/pipermail/linux-nvme/2016-October/006975.html

Byte,
	Johannes

-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-11 13:43 ` Johannes Thumshirn
  (?)
@ 2017-01-11 13:46   ` Hannes Reinecke
  -1 siblings, 0 replies; 120+ messages in thread
From: Hannes Reinecke @ 2017-01-11 13:46 UTC (permalink / raw)
  To: Johannes Thumshirn, lsf-pc
  Cc: linux-block, Linux-scsi, Sagi Grimberg, linux-nvme,
	Christoph Hellwig, Keith Busch

On 01/11/2017 02:43 PM, Johannes Thumshirn wrote:
> Hi all,
> 
> I'd like to attend LSF/MM and would like to discuss polling for block drivers.
> 
> Currently there is blk-iopoll but it is neither as widely used as NAPI in the
> networking field and accoring to Sagi's findings in [1] performance with
> polling is not on par with IRQ usage.
> 
> On LSF/MM I'd like to whether it is desirable to have NAPI like polling in
> more block drivers and how to overcome the currently seen performance issues.
> 
> [1] http://lists.infradead.org/pipermail/linux-nvme/2016-October/006975.html
> 
Yup.

I'm all for it.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: F. Imend�rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N�rnberg)

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-11 13:46   ` Hannes Reinecke
  0 siblings, 0 replies; 120+ messages in thread
From: Hannes Reinecke @ 2017-01-11 13:46 UTC (permalink / raw)
  To: Johannes Thumshirn, lsf-pc
  Cc: linux-block, Linux-scsi, Sagi Grimberg, linux-nvme,
	Christoph Hellwig, Keith Busch

On 01/11/2017 02:43 PM, Johannes Thumshirn wrote:
> Hi all,
> 
> I'd like to attend LSF/MM and would like to discuss polling for block drivers.
> 
> Currently there is blk-iopoll but it is neither as widely used as NAPI in the
> networking field and accoring to Sagi's findings in [1] performance with
> polling is not on par with IRQ usage.
> 
> On LSF/MM I'd like to whether it is desirable to have NAPI like polling in
> more block drivers and how to overcome the currently seen performance issues.
> 
> [1] http://lists.infradead.org/pipermail/linux-nvme/2016-October/006975.html
> 
Yup.

I'm all for it.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-11 13:46   ` Hannes Reinecke
  0 siblings, 0 replies; 120+ messages in thread
From: Hannes Reinecke @ 2017-01-11 13:46 UTC (permalink / raw)


On 01/11/2017 02:43 PM, Johannes Thumshirn wrote:
> Hi all,
> 
> I'd like to attend LSF/MM and would like to discuss polling for block drivers.
> 
> Currently there is blk-iopoll but it is neither as widely used as NAPI in the
> networking field and accoring to Sagi's findings in [1] performance with
> polling is not on par with IRQ usage.
> 
> On LSF/MM I'd like to whether it is desirable to have NAPI like polling in
> more block drivers and how to overcome the currently seen performance issues.
> 
> [1] http://lists.infradead.org/pipermail/linux-nvme/2016-October/006975.html
> 
Yup.

I'm all for it.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-11 13:43 ` Johannes Thumshirn
@ 2017-01-11 15:07   ` Jens Axboe
  -1 siblings, 0 replies; 120+ messages in thread
From: Jens Axboe @ 2017-01-11 15:07 UTC (permalink / raw)
  To: Johannes Thumshirn, lsf-pc
  Cc: linux-block, Linux-scsi, Sagi Grimberg, linux-nvme,
	Christoph Hellwig, Keith Busch

On 01/11/2017 06:43 AM, Johannes Thumshirn wrote:
> Hi all,
> 
> I'd like to attend LSF/MM and would like to discuss polling for block drivers.
> 
> Currently there is blk-iopoll but it is neither as widely used as NAPI in the
> networking field and accoring to Sagi's findings in [1] performance with
> polling is not on par with IRQ usage.
> 
> On LSF/MM I'd like to whether it is desirable to have NAPI like polling in
> more block drivers and how to overcome the currently seen performance issues.

It would be an interesting topic to discuss, as it is a shame that blk-iopoll
isn't used more widely.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-11 15:07   ` Jens Axboe
  0 siblings, 0 replies; 120+ messages in thread
From: Jens Axboe @ 2017-01-11 15:07 UTC (permalink / raw)


On 01/11/2017 06:43 AM, Johannes Thumshirn wrote:
> Hi all,
> 
> I'd like to attend LSF/MM and would like to discuss polling for block drivers.
> 
> Currently there is blk-iopoll but it is neither as widely used as NAPI in the
> networking field and accoring to Sagi's findings in [1] performance with
> polling is not on par with IRQ usage.
> 
> On LSF/MM I'd like to whether it is desirable to have NAPI like polling in
> more block drivers and how to overcome the currently seen performance issues.

It would be an interesting topic to discuss, as it is a shame that blk-iopoll
isn't used more widely.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-11 15:07   ` Jens Axboe
@ 2017-01-11 15:13     ` Jens Axboe
  -1 siblings, 0 replies; 120+ messages in thread
From: Jens Axboe @ 2017-01-11 15:13 UTC (permalink / raw)
  To: Johannes Thumshirn, lsf-pc
  Cc: linux-block, Linux-scsi, Sagi Grimberg, linux-nvme,
	Christoph Hellwig, Keith Busch

On 01/11/2017 08:07 AM, Jens Axboe wrote:
> On 01/11/2017 06:43 AM, Johannes Thumshirn wrote:
>> Hi all,
>>
>> I'd like to attend LSF/MM and would like to discuss polling for block drivers.
>>
>> Currently there is blk-iopoll but it is neither as widely used as NAPI in the
>> networking field and accoring to Sagi's findings in [1] performance with
>> polling is not on par with IRQ usage.
>>
>> On LSF/MM I'd like to whether it is desirable to have NAPI like polling in
>> more block drivers and how to overcome the currently seen performance issues.
> 
> It would be an interesting topic to discuss, as it is a shame that blk-iopoll
> isn't used more widely.

Forgot to mention - it should only be a topic, if experimentation has
been done and results gathered to pin point what the issues are, so we
have something concrete to discus. I'm not at all interested in a hand
wavy discussion on the topic.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-11 15:13     ` Jens Axboe
  0 siblings, 0 replies; 120+ messages in thread
From: Jens Axboe @ 2017-01-11 15:13 UTC (permalink / raw)


On 01/11/2017 08:07 AM, Jens Axboe wrote:
> On 01/11/2017 06:43 AM, Johannes Thumshirn wrote:
>> Hi all,
>>
>> I'd like to attend LSF/MM and would like to discuss polling for block drivers.
>>
>> Currently there is blk-iopoll but it is neither as widely used as NAPI in the
>> networking field and accoring to Sagi's findings in [1] performance with
>> polling is not on par with IRQ usage.
>>
>> On LSF/MM I'd like to whether it is desirable to have NAPI like polling in
>> more block drivers and how to overcome the currently seen performance issues.
> 
> It would be an interesting topic to discuss, as it is a shame that blk-iopoll
> isn't used more widely.

Forgot to mention - it should only be a topic, if experimentation has
been done and results gathered to pin point what the issues are, so we
have something concrete to discus. I'm not at all interested in a hand
wavy discussion on the topic.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-11 15:07   ` Jens Axboe
  (?)
@ 2017-01-11 15:16     ` Hannes Reinecke
  -1 siblings, 0 replies; 120+ messages in thread
From: Hannes Reinecke @ 2017-01-11 15:16 UTC (permalink / raw)
  To: Jens Axboe, Johannes Thumshirn, lsf-pc
  Cc: linux-block, Linux-scsi, Sagi Grimberg, linux-nvme,
	Christoph Hellwig, Keith Busch

On 01/11/2017 04:07 PM, Jens Axboe wrote:
> On 01/11/2017 06:43 AM, Johannes Thumshirn wrote:
>> Hi all,
>>
>> I'd like to attend LSF/MM and would like to discuss polling for block drivers.
>>
>> Currently there is blk-iopoll but it is neither as widely used as NAPI in the
>> networking field and accoring to Sagi's findings in [1] performance with
>> polling is not on par with IRQ usage.
>>
>> On LSF/MM I'd like to whether it is desirable to have NAPI like polling in
>> more block drivers and how to overcome the currently seen performance issues.
> 
> It would be an interesting topic to discuss, as it is a shame that blk-iopoll
> isn't used more widely.
> 
Indeed; some drivers like lpfc already _have_ a polling mode, but not
hooked up to blk-iopoll. Would be really cool to get that going.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: F. Imend�rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N�rnberg)

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-11 15:16     ` Hannes Reinecke
  0 siblings, 0 replies; 120+ messages in thread
From: Hannes Reinecke @ 2017-01-11 15:16 UTC (permalink / raw)
  To: Jens Axboe, Johannes Thumshirn, lsf-pc
  Cc: linux-block, Linux-scsi, Sagi Grimberg, linux-nvme,
	Christoph Hellwig, Keith Busch

On 01/11/2017 04:07 PM, Jens Axboe wrote:
> On 01/11/2017 06:43 AM, Johannes Thumshirn wrote:
>> Hi all,
>>
>> I'd like to attend LSF/MM and would like to discuss polling for block drivers.
>>
>> Currently there is blk-iopoll but it is neither as widely used as NAPI in the
>> networking field and accoring to Sagi's findings in [1] performance with
>> polling is not on par with IRQ usage.
>>
>> On LSF/MM I'd like to whether it is desirable to have NAPI like polling in
>> more block drivers and how to overcome the currently seen performance issues.
> 
> It would be an interesting topic to discuss, as it is a shame that blk-iopoll
> isn't used more widely.
> 
Indeed; some drivers like lpfc already _have_ a polling mode, but not
hooked up to blk-iopoll. Would be really cool to get that going.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-11 15:16     ` Hannes Reinecke
  0 siblings, 0 replies; 120+ messages in thread
From: Hannes Reinecke @ 2017-01-11 15:16 UTC (permalink / raw)


On 01/11/2017 04:07 PM, Jens Axboe wrote:
> On 01/11/2017 06:43 AM, Johannes Thumshirn wrote:
>> Hi all,
>>
>> I'd like to attend LSF/MM and would like to discuss polling for block drivers.
>>
>> Currently there is blk-iopoll but it is neither as widely used as NAPI in the
>> networking field and accoring to Sagi's findings in [1] performance with
>> polling is not on par with IRQ usage.
>>
>> On LSF/MM I'd like to whether it is desirable to have NAPI like polling in
>> more block drivers and how to overcome the currently seen performance issues.
> 
> It would be an interesting topic to discuss, as it is a shame that blk-iopoll
> isn't used more widely.
> 
Indeed; some drivers like lpfc already _have_ a polling mode, but not
hooked up to blk-iopoll. Would be really cool to get that going.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-11 13:43 ` Johannes Thumshirn
  (?)
@ 2017-01-11 16:08   ` Bart Van Assche
  -1 siblings, 0 replies; 120+ messages in thread
From: Bart Van Assche @ 2017-01-11 16:08 UTC (permalink / raw)
  To: jthumshirn, lsf-pc
  Cc: Linux-scsi, hch, keith.busch, linux-nvme, linux-block, sagi

On Wed, 2017-01-11 at 14:43 +0100, Johannes Thumshirn wrote:
> I'd like to attend LSF/MM and would like to discuss polling for block
> drivers.
>=20
> Currently there is blk-iopoll but it is neither as widely used as NAPI in
> the networking field and accoring to Sagi's findings in [1] performance
> with polling is not on par with IRQ usage.
>=20
> On LSF/MM I'd like to whether it is desirable to have NAPI like polling i=
n
> more block drivers and how to overcome the currently seen performance
> issues.
>=20
> [1] http://lists.infradead.org/pipermail/linux-nvme/2016-October/006975.h=
t
> ml

A typical Ethernet network adapter delays the generation of an interrupt
after it has received a packet. A typical block device or HBA does not dela=
y
the generation of an interrupt that reports an I/O completion. I think that
is why polling is more effective for network adapters than for block
devices. I'm not sure whether it is possible to achieve benefits similar to
NAPI for block devices without implementing interrupt coalescing in the
block device firmware. Note: for block device implementations that use the
RDMA API, the RDMA API supports interrupt coalescing (see also
ib_modify_cq()).

An example of the interrupt coalescing parameters for a network adapter:

# ethtool -c em1 | grep -E 'rx-usecs:|tx-usecs:'
rx-usecs: 3
tx-usecs: 0

Bart.=

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-11 16:08   ` Bart Van Assche
  0 siblings, 0 replies; 120+ messages in thread
From: Bart Van Assche @ 2017-01-11 16:08 UTC (permalink / raw)
  To: jthumshirn, lsf-pc
  Cc: Linux-scsi, hch, keith.busch, linux-nvme, linux-block, sagi

On Wed, 2017-01-11 at 14:43 +0100, Johannes Thumshirn wrote:
> I'd like to attend LSF/MM and would like to discuss polling for block
> drivers.
> 
> Currently there is blk-iopoll but it is neither as widely used as NAPI in
> the networking field and accoring to Sagi's findings in [1] performance
> with polling is not on par with IRQ usage.
> 
> On LSF/MM I'd like to whether it is desirable to have NAPI like polling in
> more block drivers and how to overcome the currently seen performance
> issues.
> 
> [1] http://lists.infradead.org/pipermail/linux-nvme/2016-October/006975.ht
> ml

A typical Ethernet network adapter delays the generation of an interrupt
after it has received a packet. A typical block device or HBA does not delay
the generation of an interrupt that reports an I/O completion. I think that
is why polling is more effective for network adapters than for block
devices. I'm not sure whether it is possible to achieve benefits similar to
NAPI for block devices without implementing interrupt coalescing in the
block device firmware. Note: for block device implementations that use the
RDMA API, the RDMA API supports interrupt coalescing (see also
ib_modify_cq()).

An example of the interrupt coalescing parameters for a network adapter:

# ethtool -c em1 | grep -E 'rx-usecs:|tx-usecs:'
rx-usecs: 3
tx-usecs: 0

Bart.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-11 16:08   ` Bart Van Assche
  0 siblings, 0 replies; 120+ messages in thread
From: Bart Van Assche @ 2017-01-11 16:08 UTC (permalink / raw)


On Wed, 2017-01-11@14:43 +0100, Johannes Thumshirn wrote:
> I'd like to attend LSF/MM and would like to discuss polling for block
> drivers.
> 
> Currently there is blk-iopoll but it is neither as widely used as NAPI in
> the networking field and accoring to Sagi's findings in [1] performance
> with polling is not on par with IRQ usage.
> 
> On LSF/MM I'd like to whether it is desirable to have NAPI like polling in
> more block drivers and how to overcome the currently seen performance
> issues.
> 
> [1] http://lists.infradead.org/pipermail/linux-nvme/2016-October/006975.ht
> ml

A typical Ethernet network adapter delays the generation of an interrupt
after it has received a packet. A typical block device or HBA does not delay
the generation of an interrupt that reports an I/O completion. I think that
is why polling is more effective for network adapters than for block
devices. I'm not sure whether it is possible to achieve benefits similar to
NAPI for block devices without implementing interrupt coalescing in the
block device firmware. Note: for block device implementations that use the
RDMA API, the RDMA API supports interrupt coalescing (see also
ib_modify_cq()).

An example of the interrupt coalescing parameters for a network adapter:

# ethtool -c em1 | grep -E 'rx-usecs:|tx-usecs:'
rx-usecs: 3
tx-usecs: 0

Bart.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-11 16:08   ` Bart Van Assche
@ 2017-01-11 16:12     ` hch
  -1 siblings, 0 replies; 120+ messages in thread
From: hch @ 2017-01-11 16:12 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: jthumshirn, lsf-pc, Linux-scsi, hch, keith.busch, linux-nvme,
	linux-block, sagi

On Wed, Jan 11, 2017 at 04:08:31PM +0000, Bart Van Assche wrote:
> A typical Ethernet network adapter delays the generation of an interrupt
> after it has received a packet. A typical block device or HBA does not delay
> the generation of an interrupt that reports an I/O completion.

NVMe allows for configurable interrupt coalescing, as do a few modern
SCSI HBAs.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-11 16:12     ` hch
  0 siblings, 0 replies; 120+ messages in thread
From: hch @ 2017-01-11 16:12 UTC (permalink / raw)


On Wed, Jan 11, 2017@04:08:31PM +0000, Bart Van Assche wrote:
> A typical Ethernet network adapter delays the generation of an interrupt
> after it has received a packet. A typical block device or HBA does not delay
> the generation of an interrupt that reports an I/O completion.

NVMe allows for configurable interrupt coalescing, as do a few modern
SCSI HBAs.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-11 16:08   ` Bart Van Assche
  (?)
@ 2017-01-11 16:14     ` Johannes Thumshirn
  -1 siblings, 0 replies; 120+ messages in thread
From: Johannes Thumshirn @ 2017-01-11 16:14 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: lsf-pc, Linux-scsi, hch, keith.busch, linux-nvme, linux-block, sagi

On Wed, Jan 11, 2017 at 04:08:31PM +0000, Bart Van Assche wrote:

[...]

> A typical Ethernet network adapter delays the generation of an interrupt
> after it has received a packet. A typical block device or HBA does not delay
> the generation of an interrupt that reports an I/O completion. I think that
> is why polling is more effective for network adapters than for block
> devices. I'm not sure whether it is possible to achieve benefits similar to
> NAPI for block devices without implementing interrupt coalescing in the
> block device firmware. Note: for block device implementations that use the
> RDMA API, the RDMA API supports interrupt coalescing (see also
> ib_modify_cq()).

Well you can always turn off IRQ generation in the HBA just before scheuduling
the poll handler and re-enable it after you've exhausted your budget or used
too much time, can't you? 

I'll do some prototyping and tests tomorrow so we have some more ground for
discussion.

Byte,
	Johannes
-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: Felix Imend�rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N�rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-11 16:14     ` Johannes Thumshirn
  0 siblings, 0 replies; 120+ messages in thread
From: Johannes Thumshirn @ 2017-01-11 16:14 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: lsf-pc, Linux-scsi, hch, keith.busch, linux-nvme, linux-block, sagi

On Wed, Jan 11, 2017 at 04:08:31PM +0000, Bart Van Assche wrote:

[...]

> A typical Ethernet network adapter delays the generation of an interrupt
> after it has received a packet. A typical block device or HBA does not delay
> the generation of an interrupt that reports an I/O completion. I think that
> is why polling is more effective for network adapters than for block
> devices. I'm not sure whether it is possible to achieve benefits similar to
> NAPI for block devices without implementing interrupt coalescing in the
> block device firmware. Note: for block device implementations that use the
> RDMA API, the RDMA API supports interrupt coalescing (see also
> ib_modify_cq()).

Well you can always turn off IRQ generation in the HBA just before scheuduling
the poll handler and re-enable it after you've exhausted your budget or used
too much time, can't you? 

I'll do some prototyping and tests tomorrow so we have some more ground for
discussion.

Byte,
	Johannes
-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-11 16:14     ` Johannes Thumshirn
  0 siblings, 0 replies; 120+ messages in thread
From: Johannes Thumshirn @ 2017-01-11 16:14 UTC (permalink / raw)


On Wed, Jan 11, 2017@04:08:31PM +0000, Bart Van Assche wrote:

[...]

> A typical Ethernet network adapter delays the generation of an interrupt
> after it has received a packet. A typical block device or HBA does not delay
> the generation of an interrupt that reports an I/O completion. I think that
> is why polling is more effective for network adapters than for block
> devices. I'm not sure whether it is possible to achieve benefits similar to
> NAPI for block devices without implementing interrupt coalescing in the
> block device firmware. Note: for block device implementations that use the
> RDMA API, the RDMA API supports interrupt coalescing (see also
> ib_modify_cq()).

Well you can always turn off IRQ generation in the HBA just before scheuduling
the poll handler and re-enable it after you've exhausted your budget or used
too much time, can't you? 

I'll do some prototyping and tests tomorrow so we have some more ground for
discussion.

Byte,
	Johannes
-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-11 16:12     ` hch
@ 2017-01-11 16:15       ` Jens Axboe
  -1 siblings, 0 replies; 120+ messages in thread
From: Jens Axboe @ 2017-01-11 16:15 UTC (permalink / raw)
  To: hch, Bart Van Assche
  Cc: jthumshirn, lsf-pc, Linux-scsi, keith.busch, linux-nvme,
	linux-block, sagi

On 01/11/2017 09:12 AM, hch@infradead.org wrote:
> On Wed, Jan 11, 2017 at 04:08:31PM +0000, Bart Van Assche wrote:
>> A typical Ethernet network adapter delays the generation of an interrupt
>> after it has received a packet. A typical block device or HBA does not delay
>> the generation of an interrupt that reports an I/O completion.
> 
> NVMe allows for configurable interrupt coalescing, as do a few modern
> SCSI HBAs.

Unfortunately it's way too coarse on NVMe, with the timer being in 100
usec increments... I've had mixed success with the depth trigger.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-11 16:15       ` Jens Axboe
  0 siblings, 0 replies; 120+ messages in thread
From: Jens Axboe @ 2017-01-11 16:15 UTC (permalink / raw)


On 01/11/2017 09:12 AM, hch@infradead.org wrote:
> On Wed, Jan 11, 2017@04:08:31PM +0000, Bart Van Assche wrote:
>> A typical Ethernet network adapter delays the generation of an interrupt
>> after it has received a packet. A typical block device or HBA does not delay
>> the generation of an interrupt that reports an I/O completion.
> 
> NVMe allows for configurable interrupt coalescing, as do a few modern
> SCSI HBAs.

Unfortunately it's way too coarse on NVMe, with the timer being in 100
usec increments... I've had mixed success with the depth trigger.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-11 16:12     ` hch
  (?)
@ 2017-01-11 16:22       ` Hannes Reinecke
  -1 siblings, 0 replies; 120+ messages in thread
From: Hannes Reinecke @ 2017-01-11 16:22 UTC (permalink / raw)
  To: hch, Bart Van Assche
  Cc: jthumshirn, lsf-pc, Linux-scsi, keith.busch, linux-nvme,
	linux-block, sagi

On 01/11/2017 05:12 PM, hch@infradead.org wrote:
> On Wed, Jan 11, 2017 at 04:08:31PM +0000, Bart Van Assche wrote:
>> A typical Ethernet network adapter delays the generation of an interrupt
>> after it has received a packet. A typical block device or HBA does not delay
>> the generation of an interrupt that reports an I/O completion.
> 
> NVMe allows for configurable interrupt coalescing, as do a few modern
> SCSI HBAs.

Essentially every modern SCSI HBA does interrupt coalescing; otherwise
the queuing interface won't work efficiently.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: F. Imend�rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N�rnberg)

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-11 16:22       ` Hannes Reinecke
  0 siblings, 0 replies; 120+ messages in thread
From: Hannes Reinecke @ 2017-01-11 16:22 UTC (permalink / raw)
  To: hch, Bart Van Assche
  Cc: jthumshirn, lsf-pc, Linux-scsi, keith.busch, linux-nvme,
	linux-block, sagi

On 01/11/2017 05:12 PM, hch@infradead.org wrote:
> On Wed, Jan 11, 2017 at 04:08:31PM +0000, Bart Van Assche wrote:
>> A typical Ethernet network adapter delays the generation of an interrupt
>> after it has received a packet. A typical block device or HBA does not delay
>> the generation of an interrupt that reports an I/O completion.
> 
> NVMe allows for configurable interrupt coalescing, as do a few modern
> SCSI HBAs.

Essentially every modern SCSI HBA does interrupt coalescing; otherwise
the queuing interface won't work efficiently.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-11 16:22       ` Hannes Reinecke
  0 siblings, 0 replies; 120+ messages in thread
From: Hannes Reinecke @ 2017-01-11 16:22 UTC (permalink / raw)


On 01/11/2017 05:12 PM, hch@infradead.org wrote:
> On Wed, Jan 11, 2017@04:08:31PM +0000, Bart Van Assche wrote:
>> A typical Ethernet network adapter delays the generation of an interrupt
>> after it has received a packet. A typical block device or HBA does not delay
>> the generation of an interrupt that reports an I/O completion.
> 
> NVMe allows for configurable interrupt coalescing, as do a few modern
> SCSI HBAs.

Essentially every modern SCSI HBA does interrupt coalescing; otherwise
the queuing interface won't work efficiently.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-11 16:22       ` Hannes Reinecke
  (?)
@ 2017-01-11 16:26         ` Bart Van Assche
  -1 siblings, 0 replies; 120+ messages in thread
From: Bart Van Assche @ 2017-01-11 16:26 UTC (permalink / raw)
  To: hch, hare
  Cc: Linux-scsi, keith.busch, jthumshirn, linux-nvme, lsf-pc,
	linux-block, sagi

On Wed, 2017-01-11 at 17:22 +0100, Hannes Reinecke wrote:
> On 01/11/2017 05:12 PM, hch@infradead.org wrote:
> > On Wed, Jan 11, 2017 at 04:08:31PM +0000, Bart Van Assche wrote:
> > > A typical Ethernet network adapter delays the generation of an
> > > interrupt
> > > after it has received a packet. A typical block device or HBA does no=
t
> > > delay
> > > the generation of an interrupt that reports an I/O completion.
> >=20
> > NVMe allows for configurable interrupt coalescing, as do a few modern
> > SCSI HBAs.
>=20
> Essentially every modern SCSI HBA does interrupt coalescing; otherwise
> the queuing interface won't work efficiently.

Hello Hannes,

The first e-mail in this e-mail thread referred to measurements against a
block device for which interrupt coalescing was not enabled. I think that
the measurements have to be repeated against a block device for which
interrupt coalescing is enabled.

Bart.=

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-11 16:26         ` Bart Van Assche
  0 siblings, 0 replies; 120+ messages in thread
From: Bart Van Assche @ 2017-01-11 16:26 UTC (permalink / raw)
  To: hch, hare
  Cc: Linux-scsi, keith.busch, jthumshirn, linux-nvme, lsf-pc,
	linux-block, sagi

On Wed, 2017-01-11 at 17:22 +0100, Hannes Reinecke wrote:
> On 01/11/2017 05:12 PM, hch@infradead.org wrote:
> > On Wed, Jan 11, 2017 at 04:08:31PM +0000, Bart Van Assche wrote:
> > > A typical Ethernet network adapter delays the generation of an
> > > interrupt
> > > after it has received a packet. A typical block device or HBA does not
> > > delay
> > > the generation of an interrupt that reports an I/O completion.
> > 
> > NVMe allows for configurable interrupt coalescing, as do a few modern
> > SCSI HBAs.
> 
> Essentially every modern SCSI HBA does interrupt coalescing; otherwise
> the queuing interface won't work efficiently.

Hello Hannes,

The first e-mail in this e-mail thread referred to measurements against a
block device for which interrupt coalescing was not enabled. I think that
the measurements have to be repeated against a block device for which
interrupt coalescing is enabled.

Bart.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-11 16:26         ` Bart Van Assche
  0 siblings, 0 replies; 120+ messages in thread
From: Bart Van Assche @ 2017-01-11 16:26 UTC (permalink / raw)


On Wed, 2017-01-11@17:22 +0100, Hannes Reinecke wrote:
> On 01/11/2017 05:12 PM, hch@infradead.org wrote:
> > On Wed, Jan 11, 2017@04:08:31PM +0000, Bart Van Assche wrote:
> > > A typical Ethernet network adapter delays the generation of an
> > > interrupt
> > > after it has received a packet. A typical block device or HBA does not
> > > delay
> > > the generation of an interrupt that reports an I/O completion.
> > 
> > NVMe allows for configurable interrupt coalescing, as do a few modern
> > SCSI HBAs.
> 
> Essentially every modern SCSI HBA does interrupt coalescing; otherwise
> the queuing interface won't work efficiently.

Hello Hannes,

The first e-mail in this e-mail thread referred to measurements against a
block device for which interrupt coalescing was not enabled. I think that
the measurements have to be repeated against a block device for which
interrupt coalescing is enabled.

Bart.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-11 16:26         ` Bart Van Assche
  (?)
@ 2017-01-11 16:45           ` Hannes Reinecke
  -1 siblings, 0 replies; 120+ messages in thread
From: Hannes Reinecke @ 2017-01-11 16:45 UTC (permalink / raw)
  To: Bart Van Assche, hch
  Cc: Linux-scsi, keith.busch, jthumshirn, linux-nvme, lsf-pc,
	linux-block, sagi

On 01/11/2017 05:26 PM, Bart Van Assche wrote:
> On Wed, 2017-01-11 at 17:22 +0100, Hannes Reinecke wrote:
>> On 01/11/2017 05:12 PM, hch@infradead.org wrote:
>>> On Wed, Jan 11, 2017 at 04:08:31PM +0000, Bart Van Assche wrote:
>>>> A typical Ethernet network adapter delays the generation of an
>>>> interrupt
>>>> after it has received a packet. A typical block device or HBA does not
>>>> delay
>>>> the generation of an interrupt that reports an I/O completion.
>>>
>>> NVMe allows for configurable interrupt coalescing, as do a few modern
>>> SCSI HBAs.
>>
>> Essentially every modern SCSI HBA does interrupt coalescing; otherwise
>> the queuing interface won't work efficiently.
> 
> Hello Hannes,
> 
> The first e-mail in this e-mail thread referred to measurements against a
> block device for which interrupt coalescing was not enabled. I think that
> the measurements have to be repeated against a block device for which
> interrupt coalescing is enabled.
> 
Guess what we'll be doing in the next few days ...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: J. Hawn, J. Guild, F. Imend�rffer, HRB 16746 (AG N�rnberg)

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-11 16:45           ` Hannes Reinecke
  0 siblings, 0 replies; 120+ messages in thread
From: Hannes Reinecke @ 2017-01-11 16:45 UTC (permalink / raw)
  To: Bart Van Assche, hch
  Cc: Linux-scsi, keith.busch, jthumshirn, linux-nvme, lsf-pc,
	linux-block, sagi

On 01/11/2017 05:26 PM, Bart Van Assche wrote:
> On Wed, 2017-01-11 at 17:22 +0100, Hannes Reinecke wrote:
>> On 01/11/2017 05:12 PM, hch@infradead.org wrote:
>>> On Wed, Jan 11, 2017 at 04:08:31PM +0000, Bart Van Assche wrote:
>>>> A typical Ethernet network adapter delays the generation of an
>>>> interrupt
>>>> after it has received a packet. A typical block device or HBA does not
>>>> delay
>>>> the generation of an interrupt that reports an I/O completion.
>>>
>>> NVMe allows for configurable interrupt coalescing, as do a few modern
>>> SCSI HBAs.
>>
>> Essentially every modern SCSI HBA does interrupt coalescing; otherwise
>> the queuing interface won't work efficiently.
> 
> Hello Hannes,
> 
> The first e-mail in this e-mail thread referred to measurements against a
> block device for which interrupt coalescing was not enabled. I think that
> the measurements have to be repeated against a block device for which
> interrupt coalescing is enabled.
> 
Guess what we'll be doing in the next few days ...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-11 16:45           ` Hannes Reinecke
  0 siblings, 0 replies; 120+ messages in thread
From: Hannes Reinecke @ 2017-01-11 16:45 UTC (permalink / raw)


On 01/11/2017 05:26 PM, Bart Van Assche wrote:
> On Wed, 2017-01-11@17:22 +0100, Hannes Reinecke wrote:
>> On 01/11/2017 05:12 PM, hch@infradead.org wrote:
>>> On Wed, Jan 11, 2017@04:08:31PM +0000, Bart Van Assche wrote:
>>>> A typical Ethernet network adapter delays the generation of an
>>>> interrupt
>>>> after it has received a packet. A typical block device or HBA does not
>>>> delay
>>>> the generation of an interrupt that reports an I/O completion.
>>>
>>> NVMe allows for configurable interrupt coalescing, as do a few modern
>>> SCSI HBAs.
>>
>> Essentially every modern SCSI HBA does interrupt coalescing; otherwise
>> the queuing interface won't work efficiently.
> 
> Hello Hannes,
> 
> The first e-mail in this e-mail thread referred to measurements against a
> block device for which interrupt coalescing was not enabled. I think that
> the measurements have to be repeated against a block device for which
> interrupt coalescing is enabled.
> 
Guess what we'll be doing in the next few days ...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare at suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: J. Hawn, J. Guild, F. Imend?rffer, HRB 16746 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-11 15:07   ` Jens Axboe
                     ` (2 preceding siblings ...)
  (?)
@ 2017-01-12  4:36   ` Stephen Bates
  2017-01-12  4:44       ` Jens Axboe
  -1 siblings, 1 reply; 120+ messages in thread
From: Stephen Bates @ 2017-01-12  4:36 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Johannes Thumshirn, lsf-pc, Christoph Hellwig, Sagi Grimberg,
	linux-scsi, linux-nvme, linux-block, Keith Busch

>>
>> I'd like to attend LSF/MM and would like to discuss polling for block
>> drivers.
>>
>> Currently there is blk-iopoll but it is neither as widely used as NAPI
>> in the networking field and accoring to Sagi's findings in [1]
>> performance with polling is not on par with IRQ usage.
>>
>> On LSF/MM I'd like to whether it is desirable to have NAPI like polling
>> in more block drivers and how to overcome the currently seen performance
>> issues.
>
> It would be an interesting topic to discuss, as it is a shame that
> blk-iopoll isn't used more widely.
>
> --
> Jens Axboe
>

I'd also be interested in this topic. Given that iopoll only really makes
sense for low-latency, low queue depth environments (i.e. down below
10-20us) I'd like to discuss which drivers we think will need/want to be
upgraded (aside from NVMe ;-)).

I'd also be interested in discussing how best to enable and disable
polling. In the past some of us have pushed for a "big hammer" to turn
polling on for a given device or HW queue [1]. I'd like to discuss this
again as well as looking at other methods above and beyond the preadv2
system call and the HIPRI flag.

Stephen

[1] http://marc.info/?l=linux-block&m=146307410101827&w=2

> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
>
>

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-12  4:36   ` Stephen Bates
@ 2017-01-12  4:44       ` Jens Axboe
  0 siblings, 0 replies; 120+ messages in thread
From: Jens Axboe @ 2017-01-12  4:44 UTC (permalink / raw)
  To: Stephen Bates
  Cc: Johannes Thumshirn, lsf-pc, Christoph Hellwig, Sagi Grimberg,
	linux-scsi, linux-nvme, linux-block, Keith Busch

On 01/11/2017 09:36 PM, Stephen Bates wrote:
>>>
>>> I'd like to attend LSF/MM and would like to discuss polling for block
>>> drivers.
>>>
>>> Currently there is blk-iopoll but it is neither as widely used as NAPI
>>> in the networking field and accoring to Sagi's findings in [1]
>>> performance with polling is not on par with IRQ usage.
>>>
>>> On LSF/MM I'd like to whether it is desirable to have NAPI like polling
>>> in more block drivers and how to overcome the currently seen performance
>>> issues.
>>
>> It would be an interesting topic to discuss, as it is a shame that
>> blk-iopoll isn't used more widely.
>>
>> --
>> Jens Axboe
>>
> 
> I'd also be interested in this topic. Given that iopoll only really makes
> sense for low-latency, low queue depth environments (i.e. down below
> 10-20us) I'd like to discuss which drivers we think will need/want to be
> upgraded (aside from NVMe ;-)).
> 
> I'd also be interested in discussing how best to enable and disable
> polling. In the past some of us have pushed for a "big hammer" to turn
> polling on for a given device or HW queue [1]. I'd like to discuss this
> again as well as looking at other methods above and beyond the preadv2
> system call and the HIPRI flag.

This is a separate topic. The initial proposal is for polling for
interrupt mitigation, you are talking about polling in the context of
polling for completion of an IO.

We can definitely talk about this form of polling as well, but it should
be a separate topic and probably proposed independently.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-12  4:44       ` Jens Axboe
  0 siblings, 0 replies; 120+ messages in thread
From: Jens Axboe @ 2017-01-12  4:44 UTC (permalink / raw)


On 01/11/2017 09:36 PM, Stephen Bates wrote:
>>>
>>> I'd like to attend LSF/MM and would like to discuss polling for block
>>> drivers.
>>>
>>> Currently there is blk-iopoll but it is neither as widely used as NAPI
>>> in the networking field and accoring to Sagi's findings in [1]
>>> performance with polling is not on par with IRQ usage.
>>>
>>> On LSF/MM I'd like to whether it is desirable to have NAPI like polling
>>> in more block drivers and how to overcome the currently seen performance
>>> issues.
>>
>> It would be an interesting topic to discuss, as it is a shame that
>> blk-iopoll isn't used more widely.
>>
>> --
>> Jens Axboe
>>
> 
> I'd also be interested in this topic. Given that iopoll only really makes
> sense for low-latency, low queue depth environments (i.e. down below
> 10-20us) I'd like to discuss which drivers we think will need/want to be
> upgraded (aside from NVMe ;-)).
> 
> I'd also be interested in discussing how best to enable and disable
> polling. In the past some of us have pushed for a "big hammer" to turn
> polling on for a given device or HW queue [1]. I'd like to discuss this
> again as well as looking at other methods above and beyond the preadv2
> system call and the HIPRI flag.

This is a separate topic. The initial proposal is for polling for
interrupt mitigation, you are talking about polling in the context of
polling for completion of an IO.

We can definitely talk about this form of polling as well, but it should
be a separate topic and probably proposed independently.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-12  4:44       ` Jens Axboe
@ 2017-01-12  4:56         ` Stephen Bates
  -1 siblings, 0 replies; 120+ messages in thread
From: Stephen Bates @ 2017-01-12  4:56 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Johannes Thumshirn, lsf-pc, Christoph Hellwig, Sagi Grimberg,
	linux-scsi, linux-nvme, linux-block, Keith Busch

>
> This is a separate topic. The initial proposal is for polling for
> interrupt mitigation, you are talking about polling in the context of
> polling for completion of an IO.
>
> We can definitely talk about this form of polling as well, but it should
> be a separate topic and probably proposed independently.
>
> --
> Jens Axboe
>
>

Jens

Oh thanks for the clarification. I will propose this as a separate topic.

Thanks

Stephen

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-12  4:56         ` Stephen Bates
  0 siblings, 0 replies; 120+ messages in thread
From: Stephen Bates @ 2017-01-12  4:56 UTC (permalink / raw)


>
> This is a separate topic. The initial proposal is for polling for
> interrupt mitigation, you are talking about polling in the context of
> polling for completion of an IO.
>
> We can definitely talk about this form of polling as well, but it should
> be a separate topic and probably proposed independently.
>
> --
> Jens Axboe
>
>

Jens

Oh thanks for the clarification. I will propose this as a separate topic.

Thanks

Stephen

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-11 15:13     ` Jens Axboe
@ 2017-01-12  8:23       ` Sagi Grimberg
  -1 siblings, 0 replies; 120+ messages in thread
From: Sagi Grimberg @ 2017-01-12  8:23 UTC (permalink / raw)
  To: Jens Axboe, Johannes Thumshirn, lsf-pc
  Cc: linux-block, Linux-scsi, linux-nvme, Christoph Hellwig, Keith Busch


>>> Hi all,
>>>
>>> I'd like to attend LSF/MM and would like to discuss polling for block drivers.
>>>
>>> Currently there is blk-iopoll but it is neither as widely used as NAPI in the
>>> networking field and accoring to Sagi's findings in [1] performance with
>>> polling is not on par with IRQ usage.
>>>
>>> On LSF/MM I'd like to whether it is desirable to have NAPI like polling in
>>> more block drivers and how to overcome the currently seen performance issues.
>>
>> It would be an interesting topic to discuss, as it is a shame that blk-iopoll
>> isn't used more widely.
>
> Forgot to mention - it should only be a topic, if experimentation has
> been done and results gathered to pin point what the issues are, so we
> have something concrete to discus. I'm not at all interested in a hand
> wavy discussion on the topic.
>

Hey all,

Indeed I attempted to convert nvme to use irq-poll (let's use its
new name) but experienced some unexplained performance degradations.

Keith reported a 700ns degradation for QD=1 with his Xpoint devices,
this sort of degradation are acceptable I guess because we do schedule
a soft-irq before consuming the completion, but I noticed ~10% IOPs
degradation fr QD=32 which is not acceptable.

I agree with Jens that we'll need some analysis if we want the
discussion to be affective, and I can spend some time this if I
can find volunteers with high-end nvme devices (I only have access
to client nvme devices.

I can add debugfs statistics on average the number of completions I
consume per intererupt, I can also trace the interrupt and the soft-irq
start,end. Any other interesting stats I can add?

I also tried a hybrid mode where the first 4 completions were handled
in the interrupt and the rest in soft-irq but that didn't make much
of a difference.

Any other thoughts?

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-12  8:23       ` Sagi Grimberg
  0 siblings, 0 replies; 120+ messages in thread
From: Sagi Grimberg @ 2017-01-12  8:23 UTC (permalink / raw)



>>> Hi all,
>>>
>>> I'd like to attend LSF/MM and would like to discuss polling for block drivers.
>>>
>>> Currently there is blk-iopoll but it is neither as widely used as NAPI in the
>>> networking field and accoring to Sagi's findings in [1] performance with
>>> polling is not on par with IRQ usage.
>>>
>>> On LSF/MM I'd like to whether it is desirable to have NAPI like polling in
>>> more block drivers and how to overcome the currently seen performance issues.
>>
>> It would be an interesting topic to discuss, as it is a shame that blk-iopoll
>> isn't used more widely.
>
> Forgot to mention - it should only be a topic, if experimentation has
> been done and results gathered to pin point what the issues are, so we
> have something concrete to discus. I'm not at all interested in a hand
> wavy discussion on the topic.
>

Hey all,

Indeed I attempted to convert nvme to use irq-poll (let's use its
new name) but experienced some unexplained performance degradations.

Keith reported a 700ns degradation for QD=1 with his Xpoint devices,
this sort of degradation are acceptable I guess because we do schedule
a soft-irq before consuming the completion, but I noticed ~10% IOPs
degradation fr QD=32 which is not acceptable.

I agree with Jens that we'll need some analysis if we want the
discussion to be affective, and I can spend some time this if I
can find volunteers with high-end nvme devices (I only have access
to client nvme devices.

I can add debugfs statistics on average the number of completions I
consume per intererupt, I can also trace the interrupt and the soft-irq
start,end. Any other interesting stats I can add?

I also tried a hybrid mode where the first 4 completions were handled
in the interrupt and the rest in soft-irq but that didn't make much
of a difference.

Any other thoughts?

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-11 16:08   ` Bart Van Assche
  (?)
@ 2017-01-12  8:41     ` Sagi Grimberg
  -1 siblings, 0 replies; 120+ messages in thread
From: Sagi Grimberg @ 2017-01-12  8:41 UTC (permalink / raw)
  To: Bart Van Assche, jthumshirn, lsf-pc
  Cc: hch, keith.busch, linux-block, linux-nvme, Linux-scsi


>> I'd like to attend LSF/MM and would like to discuss polling for block
>> drivers.
>>
>> Currently there is blk-iopoll but it is neither as widely used as NAPI in
>> the networking field and accoring to Sagi's findings in [1] performance
>> with polling is not on par with IRQ usage.
>>
>> On LSF/MM I'd like to whether it is desirable to have NAPI like polling in
>> more block drivers and how to overcome the currently seen performance
>> issues.
>>
>> [1] http://lists.infradead.org/pipermail/linux-nvme/2016-October/006975.ht
>> ml
>
> A typical Ethernet network adapter delays the generation of an interrupt
> after it has received a packet. A typical block device or HBA does not delay
> the generation of an interrupt that reports an I/O completion. I think that
> is why polling is more effective for network adapters than for block
> devices. I'm not sure whether it is possible to achieve benefits similar to
> NAPI for block devices without implementing interrupt coalescing in the
> block device firmware. Note: for block device implementations that use the
> RDMA API, the RDMA API supports interrupt coalescing (see also
> ib_modify_cq()).

Hey Bart,

I don't agree that interrupt coalescing is the reason why irq-poll is
not suitable for nvme or storage devices.

First, when the nvme device fires an interrupt, the driver consumes
the completion(s) from the interrupt (usually there will be some more
completions waiting in the cq by the time the host start processing it).
With irq-poll, we disable further interrupts and schedule soft-irq for
processing, which if at all, improve the completions per interrupt
utilization (because it takes slightly longer before processing the cq).

Moreover, irq-poll is budgeting the completion queue processing which is
important for a couple of reasons.

1. it prevents hard-irq context abuse like we do today. if other cpu
    cores are pounding with more submissions on the same queue, we might
    get into a hard-lockup (which I've seen happening).

2. irq-poll maintains fairness between devices by correctly budgeting
    the processing of different completions queues that share the same
    affinity. This can become crucial when working with multiple nvme
    devices, each has multiple io queues that share the same IRQ
    assignment.

3. It reduces (or at least should reduce) the overall number of
    interrupts in the system because we only enable interrupts again
    when the completion queue is completely processed.

So overall, I think it's very useful for nvme and other modern HBAs,
but unfortunately, other than solving (1), I wasn't able to see
performance improvement but rather a slight regression, but I can't
explain where its coming from...

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-12  8:41     ` Sagi Grimberg
  0 siblings, 0 replies; 120+ messages in thread
From: Sagi Grimberg @ 2017-01-12  8:41 UTC (permalink / raw)
  To: Bart Van Assche, jthumshirn, lsf-pc
  Cc: Linux-scsi, hch, keith.busch, linux-nvme, linux-block


>> I'd like to attend LSF/MM and would like to discuss polling for block
>> drivers.
>>
>> Currently there is blk-iopoll but it is neither as widely used as NAPI in
>> the networking field and accoring to Sagi's findings in [1] performance
>> with polling is not on par with IRQ usage.
>>
>> On LSF/MM I'd like to whether it is desirable to have NAPI like polling in
>> more block drivers and how to overcome the currently seen performance
>> issues.
>>
>> [1] http://lists.infradead.org/pipermail/linux-nvme/2016-October/006975.ht
>> ml
>
> A typical Ethernet network adapter delays the generation of an interrupt
> after it has received a packet. A typical block device or HBA does not delay
> the generation of an interrupt that reports an I/O completion. I think that
> is why polling is more effective for network adapters than for block
> devices. I'm not sure whether it is possible to achieve benefits similar to
> NAPI for block devices without implementing interrupt coalescing in the
> block device firmware. Note: for block device implementations that use the
> RDMA API, the RDMA API supports interrupt coalescing (see also
> ib_modify_cq()).

Hey Bart,

I don't agree that interrupt coalescing is the reason why irq-poll is
not suitable for nvme or storage devices.

First, when the nvme device fires an interrupt, the driver consumes
the completion(s) from the interrupt (usually there will be some more
completions waiting in the cq by the time the host start processing it).
With irq-poll, we disable further interrupts and schedule soft-irq for
processing, which if at all, improve the completions per interrupt
utilization (because it takes slightly longer before processing the cq).

Moreover, irq-poll is budgeting the completion queue processing which is
important for a couple of reasons.

1. it prevents hard-irq context abuse like we do today. if other cpu
    cores are pounding with more submissions on the same queue, we might
    get into a hard-lockup (which I've seen happening).

2. irq-poll maintains fairness between devices by correctly budgeting
    the processing of different completions queues that share the same
    affinity. This can become crucial when working with multiple nvme
    devices, each has multiple io queues that share the same IRQ
    assignment.

3. It reduces (or at least should reduce) the overall number of
    interrupts in the system because we only enable interrupts again
    when the completion queue is completely processed.

So overall, I think it's very useful for nvme and other modern HBAs,
but unfortunately, other than solving (1), I wasn't able to see
performance improvement but rather a slight regression, but I can't
explain where its coming from...

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-12  8:41     ` Sagi Grimberg
  0 siblings, 0 replies; 120+ messages in thread
From: Sagi Grimberg @ 2017-01-12  8:41 UTC (permalink / raw)



>> I'd like to attend LSF/MM and would like to discuss polling for block
>> drivers.
>>
>> Currently there is blk-iopoll but it is neither as widely used as NAPI in
>> the networking field and accoring to Sagi's findings in [1] performance
>> with polling is not on par with IRQ usage.
>>
>> On LSF/MM I'd like to whether it is desirable to have NAPI like polling in
>> more block drivers and how to overcome the currently seen performance
>> issues.
>>
>> [1] http://lists.infradead.org/pipermail/linux-nvme/2016-October/006975.ht
>> ml
>
> A typical Ethernet network adapter delays the generation of an interrupt
> after it has received a packet. A typical block device or HBA does not delay
> the generation of an interrupt that reports an I/O completion. I think that
> is why polling is more effective for network adapters than for block
> devices. I'm not sure whether it is possible to achieve benefits similar to
> NAPI for block devices without implementing interrupt coalescing in the
> block device firmware. Note: for block device implementations that use the
> RDMA API, the RDMA API supports interrupt coalescing (see also
> ib_modify_cq()).

Hey Bart,

I don't agree that interrupt coalescing is the reason why irq-poll is
not suitable for nvme or storage devices.

First, when the nvme device fires an interrupt, the driver consumes
the completion(s) from the interrupt (usually there will be some more
completions waiting in the cq by the time the host start processing it).
With irq-poll, we disable further interrupts and schedule soft-irq for
processing, which if at all, improve the completions per interrupt
utilization (because it takes slightly longer before processing the cq).

Moreover, irq-poll is budgeting the completion queue processing which is
important for a couple of reasons.

1. it prevents hard-irq context abuse like we do today. if other cpu
    cores are pounding with more submissions on the same queue, we might
    get into a hard-lockup (which I've seen happening).

2. irq-poll maintains fairness between devices by correctly budgeting
    the processing of different completions queues that share the same
    affinity. This can become crucial when working with multiple nvme
    devices, each has multiple io queues that share the same IRQ
    assignment.

3. It reduces (or at least should reduce) the overall number of
    interrupts in the system because we only enable interrupts again
    when the completion queue is completely processed.

So overall, I think it's very useful for nvme and other modern HBAs,
but unfortunately, other than solving (1), I wasn't able to see
performance improvement but rather a slight regression, but I can't
explain where its coming from...

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-11 16:26         ` Bart Van Assche
@ 2017-01-12  8:52           ` sagi grimberg
  -1 siblings, 0 replies; 120+ messages in thread
From: sagi grimberg @ 2017-01-12  8:52 UTC (permalink / raw)
  To: Bart Van Assche, hch, hare
  Cc: Linux-scsi, keith.busch, jthumshirn, linux-nvme, lsf-pc, linux-block

>>>> A typical Ethernet network adapter delays the generation of an
>>>> interrupt
>>>> after it has received a packet. A typical block device or HBA does not
>>>> delay
>>>> the generation of an interrupt that reports an I/O completion.
>>>
>>> NVMe allows for configurable interrupt coalescing, as do a few modern
>>> SCSI HBAs.
>>
>> Essentially every modern SCSI HBA does interrupt coalescing; otherwise
>> the queuing interface won't work efficiently.
>
> Hello Hannes,
>
> The first e-mail in this e-mail thread referred to measurements against a
> block device for which interrupt coalescing was not enabled. I think that
> the measurements have to be repeated against a block device for which
> interrupt coalescing is enabled.

Hey Bart,

I see how interrupt coalescing can help, but even without it, I think it
should be better.

Moreover, I don't think that strict moderation is something that can
work. The only way interrupt moderation can be effective, is if it's
adaptive and adjusts itself to the workload. Note that this feature
is on by default in most of the modern Ethernet devices (adaptive-rx).

IMHO, irq-poll vs. interrupt polling should be compared without relying
on the underlying device capabilities.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-12  8:52           ` sagi grimberg
  0 siblings, 0 replies; 120+ messages in thread
From: sagi grimberg @ 2017-01-12  8:52 UTC (permalink / raw)


>>>> A typical Ethernet network adapter delays the generation of an
>>>> interrupt
>>>> after it has received a packet. A typical block device or HBA does not
>>>> delay
>>>> the generation of an interrupt that reports an I/O completion.
>>>
>>> NVMe allows for configurable interrupt coalescing, as do a few modern
>>> SCSI HBAs.
>>
>> Essentially every modern SCSI HBA does interrupt coalescing; otherwise
>> the queuing interface won't work efficiently.
>
> Hello Hannes,
>
> The first e-mail in this e-mail thread referred to measurements against a
> block device for which interrupt coalescing was not enabled. I think that
> the measurements have to be repeated against a block device for which
> interrupt coalescing is enabled.

Hey Bart,

I see how interrupt coalescing can help, but even without it, I think it
should be better.

Moreover, I don't think that strict moderation is something that can
work. The only way interrupt moderation can be effective, is if it's
adaptive and adjusts itself to the workload. Note that this feature
is on by default in most of the modern Ethernet devices (adaptive-rx).

IMHO, irq-poll vs. interrupt polling should be compared without relying
on the underlying device capabilities.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-12  8:23       ` Sagi Grimberg
  (?)
@ 2017-01-12 10:02         ` Johannes Thumshirn
  -1 siblings, 0 replies; 120+ messages in thread
From: Johannes Thumshirn @ 2017-01-12 10:02 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jens Axboe, lsf-pc, linux-block, Christoph Hellwig, Keith Busch,
	linux-nvme, Linux-scsi

On Thu, Jan 12, 2017 at 10:23:47AM +0200, Sagi Grimberg wrote:
> 
> >>>Hi all,
> >>>
> >>>I'd like to attend LSF/MM and would like to discuss polling for block drivers.
> >>>
> >>>Currently there is blk-iopoll but it is neither as widely used as NAPI in the
> >>>networking field and accoring to Sagi's findings in [1] performance with
> >>>polling is not on par with IRQ usage.
> >>>
> >>>On LSF/MM I'd like to whether it is desirable to have NAPI like polling in
> >>>more block drivers and how to overcome the currently seen performance issues.
> >>
> >>It would be an interesting topic to discuss, as it is a shame that blk-iopoll
> >>isn't used more widely.
> >
> >Forgot to mention - it should only be a topic, if experimentation has
> >been done and results gathered to pin point what the issues are, so we
> >have something concrete to discus. I'm not at all interested in a hand
> >wavy discussion on the topic.
> >
> 
> Hey all,
> 
> Indeed I attempted to convert nvme to use irq-poll (let's use its
> new name) but experienced some unexplained performance degradations.
> 
> Keith reported a 700ns degradation for QD=1 with his Xpoint devices,
> this sort of degradation are acceptable I guess because we do schedule
> a soft-irq before consuming the completion, but I noticed ~10% IOPs
> degradation fr QD=32 which is not acceptable.
> 
> I agree with Jens that we'll need some analysis if we want the
> discussion to be affective, and I can spend some time this if I
> can find volunteers with high-end nvme devices (I only have access
> to client nvme devices.

I have a P3700 but somehow burned the FW. Let me see if I can bring it back to
live. 

I also have converted AHCI to the irq_poll interface and will run some tests.
I do also have some hpsa devices on which I could run tests once the driver is
adopted.

But can we come to a common testing methology not to compare apples with
oranges? Sagi do you still have the fio job file from your last tests laying
somewhere and if yes could you share it?

Byte,
	Johannes

-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: Felix Imend�rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N�rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-12 10:02         ` Johannes Thumshirn
  0 siblings, 0 replies; 120+ messages in thread
From: Johannes Thumshirn @ 2017-01-12 10:02 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jens Axboe, lsf-pc, linux-block, Christoph Hellwig, Keith Busch,
	linux-nvme, Linux-scsi

On Thu, Jan 12, 2017 at 10:23:47AM +0200, Sagi Grimberg wrote:
> 
> >>>Hi all,
> >>>
> >>>I'd like to attend LSF/MM and would like to discuss polling for block drivers.
> >>>
> >>>Currently there is blk-iopoll but it is neither as widely used as NAPI in the
> >>>networking field and accoring to Sagi's findings in [1] performance with
> >>>polling is not on par with IRQ usage.
> >>>
> >>>On LSF/MM I'd like to whether it is desirable to have NAPI like polling in
> >>>more block drivers and how to overcome the currently seen performance issues.
> >>
> >>It would be an interesting topic to discuss, as it is a shame that blk-iopoll
> >>isn't used more widely.
> >
> >Forgot to mention - it should only be a topic, if experimentation has
> >been done and results gathered to pin point what the issues are, so we
> >have something concrete to discus. I'm not at all interested in a hand
> >wavy discussion on the topic.
> >
> 
> Hey all,
> 
> Indeed I attempted to convert nvme to use irq-poll (let's use its
> new name) but experienced some unexplained performance degradations.
> 
> Keith reported a 700ns degradation for QD=1 with his Xpoint devices,
> this sort of degradation are acceptable I guess because we do schedule
> a soft-irq before consuming the completion, but I noticed ~10% IOPs
> degradation fr QD=32 which is not acceptable.
> 
> I agree with Jens that we'll need some analysis if we want the
> discussion to be affective, and I can spend some time this if I
> can find volunteers with high-end nvme devices (I only have access
> to client nvme devices.

I have a P3700 but somehow burned the FW. Let me see if I can bring it back to
live. 

I also have converted AHCI to the irq_poll interface and will run some tests.
I do also have some hpsa devices on which I could run tests once the driver is
adopted.

But can we come to a common testing methology not to compare apples with
oranges? Sagi do you still have the fio job file from your last tests laying
somewhere and if yes could you share it?

Byte,
	Johannes

-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-12 10:02         ` Johannes Thumshirn
  0 siblings, 0 replies; 120+ messages in thread
From: Johannes Thumshirn @ 2017-01-12 10:02 UTC (permalink / raw)


On Thu, Jan 12, 2017@10:23:47AM +0200, Sagi Grimberg wrote:
> 
> >>>Hi all,
> >>>
> >>>I'd like to attend LSF/MM and would like to discuss polling for block drivers.
> >>>
> >>>Currently there is blk-iopoll but it is neither as widely used as NAPI in the
> >>>networking field and accoring to Sagi's findings in [1] performance with
> >>>polling is not on par with IRQ usage.
> >>>
> >>>On LSF/MM I'd like to whether it is desirable to have NAPI like polling in
> >>>more block drivers and how to overcome the currently seen performance issues.
> >>
> >>It would be an interesting topic to discuss, as it is a shame that blk-iopoll
> >>isn't used more widely.
> >
> >Forgot to mention - it should only be a topic, if experimentation has
> >been done and results gathered to pin point what the issues are, so we
> >have something concrete to discus. I'm not at all interested in a hand
> >wavy discussion on the topic.
> >
> 
> Hey all,
> 
> Indeed I attempted to convert nvme to use irq-poll (let's use its
> new name) but experienced some unexplained performance degradations.
> 
> Keith reported a 700ns degradation for QD=1 with his Xpoint devices,
> this sort of degradation are acceptable I guess because we do schedule
> a soft-irq before consuming the completion, but I noticed ~10% IOPs
> degradation fr QD=32 which is not acceptable.
> 
> I agree with Jens that we'll need some analysis if we want the
> discussion to be affective, and I can spend some time this if I
> can find volunteers with high-end nvme devices (I only have access
> to client nvme devices.

I have a P3700 but somehow burned the FW. Let me see if I can bring it back to
live. 

I also have converted AHCI to the irq_poll interface and will run some tests.
I do also have some hpsa devices on which I could run tests once the driver is
adopted.

But can we come to a common testing methology not to compare apples with
oranges? Sagi do you still have the fio job file from your last tests laying
somewhere and if yes could you share it?

Byte,
	Johannes

-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-12 10:02         ` Johannes Thumshirn
@ 2017-01-12 11:44           ` Sagi Grimberg
  -1 siblings, 0 replies; 120+ messages in thread
From: Sagi Grimberg @ 2017-01-12 11:44 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Jens Axboe, lsf-pc, linux-block, Christoph Hellwig, Keith Busch,
	linux-nvme, Linux-scsi


>> I agree with Jens that we'll need some analysis if we want the
>> discussion to be affective, and I can spend some time this if I
>> can find volunteers with high-end nvme devices (I only have access
>> to client nvme devices.
>
> I have a P3700 but somehow burned the FW. Let me see if I can bring it back to
> live.
>
> I also have converted AHCI to the irq_poll interface and will run some tests.
> I do also have some hpsa devices on which I could run tests once the driver is
> adopted.
>
> But can we come to a common testing methology not to compare apples with
> oranges? Sagi do you still have the fio job file from your last tests laying
> somewhere and if yes could you share it?

Its pretty basic:
--
[global]
group_reporting
cpus_allowed=0
cpus_allowed_policy=split
rw=randrw
bs=4k
numjobs=4
iodepth=32
runtime=60
time_based
loops=1
ioengine=libaio
direct=1
invalidate=1
randrepeat=1
norandommap
exitall

[job]
--

**Note: when I ran multiple threads on more cpus the performance
degradation phenomenon disappeared, but I tested on a VM with
qemu emulation backed by null_blk so I figured I had some other
bottleneck somewhere (that's why I asked for some more testing).

Note that I ran randrw because I was backed with null_blk, testing
with a real nvme device, you should either run randread or write, and
if you do a write, you can't run it multi-threaded (well you can, but
you'll get unpredictable performance...).

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-12 11:44           ` Sagi Grimberg
  0 siblings, 0 replies; 120+ messages in thread
From: Sagi Grimberg @ 2017-01-12 11:44 UTC (permalink / raw)



>> I agree with Jens that we'll need some analysis if we want the
>> discussion to be affective, and I can spend some time this if I
>> can find volunteers with high-end nvme devices (I only have access
>> to client nvme devices.
>
> I have a P3700 but somehow burned the FW. Let me see if I can bring it back to
> live.
>
> I also have converted AHCI to the irq_poll interface and will run some tests.
> I do also have some hpsa devices on which I could run tests once the driver is
> adopted.
>
> But can we come to a common testing methology not to compare apples with
> oranges? Sagi do you still have the fio job file from your last tests laying
> somewhere and if yes could you share it?

Its pretty basic:
--
[global]
group_reporting
cpus_allowed=0
cpus_allowed_policy=split
rw=randrw
bs=4k
numjobs=4
iodepth=32
runtime=60
time_based
loops=1
ioengine=libaio
direct=1
invalidate=1
randrepeat=1
norandommap
exitall

[job]
--

**Note: when I ran multiple threads on more cpus the performance
degradation phenomenon disappeared, but I tested on a VM with
qemu emulation backed by null_blk so I figured I had some other
bottleneck somewhere (that's why I asked for some more testing).

Note that I ran randrw because I was backed with null_blk, testing
with a real nvme device, you should either run randread or write, and
if you do a write, you can't run it multi-threaded (well you can, but
you'll get unpredictable performance...).

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-12 11:44           ` Sagi Grimberg
  (?)
@ 2017-01-12 12:53             ` Johannes Thumshirn
  -1 siblings, 0 replies; 120+ messages in thread
From: Johannes Thumshirn @ 2017-01-12 12:53 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jens Axboe, lsf-pc, linux-block, Christoph Hellwig, Keith Busch,
	linux-nvme, Linux-scsi

On Thu, Jan 12, 2017 at 01:44:05PM +0200, Sagi Grimberg wrote:
[...]
> Its pretty basic:
> --
> [global]
> group_reporting
> cpus_allowed=0
> cpus_allowed_policy=split
> rw=randrw
> bs=4k
> numjobs=4
> iodepth=32
> runtime=60
> time_based
> loops=1
> ioengine=libaio
> direct=1
> invalidate=1
> randrepeat=1
> norandommap
> exitall
> 
> [job]
> --
> 
> **Note: when I ran multiple threads on more cpus the performance
> degradation phenomenon disappeared, but I tested on a VM with
> qemu emulation backed by null_blk so I figured I had some other
> bottleneck somewhere (that's why I asked for some more testing).

That could be because of the vmexits as every MMIO access in the guest
triggers a vmexit and if you poll with a low budget you do more MMIOs hence
you have more vmexits.

Did you do testing only in qemu or with real H/W as well?

> 
> Note that I ran randrw because I was backed with null_blk, testing
> with a real nvme device, you should either run randread or write, and
> if you do a write, you can't run it multi-threaded (well you can, but
> you'll get unpredictable performance...).

Noted, thanks.

Byte,
	Johannes
-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: Felix Imend�rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N�rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-12 12:53             ` Johannes Thumshirn
  0 siblings, 0 replies; 120+ messages in thread
From: Johannes Thumshirn @ 2017-01-12 12:53 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jens Axboe, lsf-pc, linux-block, Christoph Hellwig, Keith Busch,
	linux-nvme, Linux-scsi

On Thu, Jan 12, 2017 at 01:44:05PM +0200, Sagi Grimberg wrote:
[...]
> Its pretty basic:
> --
> [global]
> group_reporting
> cpus_allowed=0
> cpus_allowed_policy=split
> rw=randrw
> bs=4k
> numjobs=4
> iodepth=32
> runtime=60
> time_based
> loops=1
> ioengine=libaio
> direct=1
> invalidate=1
> randrepeat=1
> norandommap
> exitall
> 
> [job]
> --
> 
> **Note: when I ran multiple threads on more cpus the performance
> degradation phenomenon disappeared, but I tested on a VM with
> qemu emulation backed by null_blk so I figured I had some other
> bottleneck somewhere (that's why I asked for some more testing).

That could be because of the vmexits as every MMIO access in the guest
triggers a vmexit and if you poll with a low budget you do more MMIOs hence
you have more vmexits.

Did you do testing only in qemu or with real H/W as well?

> 
> Note that I ran randrw because I was backed with null_blk, testing
> with a real nvme device, you should either run randread or write, and
> if you do a write, you can't run it multi-threaded (well you can, but
> you'll get unpredictable performance...).

Noted, thanks.

Byte,
	Johannes
-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-12 12:53             ` Johannes Thumshirn
  0 siblings, 0 replies; 120+ messages in thread
From: Johannes Thumshirn @ 2017-01-12 12:53 UTC (permalink / raw)


On Thu, Jan 12, 2017@01:44:05PM +0200, Sagi Grimberg wrote:
[...]
> Its pretty basic:
> --
> [global]
> group_reporting
> cpus_allowed=0
> cpus_allowed_policy=split
> rw=randrw
> bs=4k
> numjobs=4
> iodepth=32
> runtime=60
> time_based
> loops=1
> ioengine=libaio
> direct=1
> invalidate=1
> randrepeat=1
> norandommap
> exitall
> 
> [job]
> --
> 
> **Note: when I ran multiple threads on more cpus the performance
> degradation phenomenon disappeared, but I tested on a VM with
> qemu emulation backed by null_blk so I figured I had some other
> bottleneck somewhere (that's why I asked for some more testing).

That could be because of the vmexits as every MMIO access in the guest
triggers a vmexit and if you poll with a low budget you do more MMIOs hence
you have more vmexits.

Did you do testing only in qemu or with real H/W as well?

> 
> Note that I ran randrw because I was backed with null_blk, testing
> with a real nvme device, you should either run randread or write, and
> if you do a write, you can't run it multi-threaded (well you can, but
> you'll get unpredictable performance...).

Noted, thanks.

Byte,
	Johannes
-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-12 12:53             ` Johannes Thumshirn
@ 2017-01-12 14:41               ` Sagi Grimberg
  -1 siblings, 0 replies; 120+ messages in thread
From: Sagi Grimberg @ 2017-01-12 14:41 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Jens Axboe, Keith Busch, Linux-scsi, Christoph Hellwig,
	linux-nvme, linux-block, lsf-pc


>> **Note: when I ran multiple threads on more cpus the performance
>> degradation phenomenon disappeared, but I tested on a VM with
>> qemu emulation backed by null_blk so I figured I had some other
>> bottleneck somewhere (that's why I asked for some more testing).
>
> That could be because of the vmexits as every MMIO access in the guest
> triggers a vmexit and if you poll with a low budget you do more MMIOs hence
> you have more vmexits.
>
> Did you do testing only in qemu or with real H/W as well?

I tried once. IIRC, I saw the same phenomenons...

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [Lsf-pc] [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-12 14:41               ` Sagi Grimberg
  0 siblings, 0 replies; 120+ messages in thread
From: Sagi Grimberg @ 2017-01-12 14:41 UTC (permalink / raw)



>> **Note: when I ran multiple threads on more cpus the performance
>> degradation phenomenon disappeared, but I tested on a VM with
>> qemu emulation backed by null_blk so I figured I had some other
>> bottleneck somewhere (that's why I asked for some more testing).
>
> That could be because of the vmexits as every MMIO access in the guest
> triggers a vmexit and if you poll with a low budget you do more MMIOs hence
> you have more vmexits.
>
> Did you do testing only in qemu or with real H/W as well?

I tried once. IIRC, I saw the same phenomenons...

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-12 14:41               ` Sagi Grimberg
  (?)
@ 2017-01-12 18:59                 ` Johannes Thumshirn
  -1 siblings, 0 replies; 120+ messages in thread
From: Johannes Thumshirn @ 2017-01-12 18:59 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jens Axboe, Keith Busch, Linux-scsi, Christoph Hellwig,
	linux-nvme, linux-block, lsf-pc

On Thu, Jan 12, 2017 at 04:41:00PM +0200, Sagi Grimberg wrote:
> 
> >>**Note: when I ran multiple threads on more cpus the performance
> >>degradation phenomenon disappeared, but I tested on a VM with
> >>qemu emulation backed by null_blk so I figured I had some other
> >>bottleneck somewhere (that's why I asked for some more testing).
> >
> >That could be because of the vmexits as every MMIO access in the guest
> >triggers a vmexit and if you poll with a low budget you do more MMIOs hence
> >you have more vmexits.
> >
> >Did you do testing only in qemu or with real H/W as well?
> 
> I tried once. IIRC, I saw the same phenomenons...

JFTR I tried my AHCI irq_poll patch on the Qemu emulation and the read
throughput dropped from ~1GB/s to ~350MB/s. But this can be related to
Qemu's I/O wiredness as well I think. I'll try on real hardware tomorrow.

-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: Felix Imend�rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N�rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-12 18:59                 ` Johannes Thumshirn
  0 siblings, 0 replies; 120+ messages in thread
From: Johannes Thumshirn @ 2017-01-12 18:59 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jens Axboe, Keith Busch, Linux-scsi, Christoph Hellwig,
	linux-nvme, linux-block, lsf-pc

On Thu, Jan 12, 2017 at 04:41:00PM +0200, Sagi Grimberg wrote:
> 
> >>**Note: when I ran multiple threads on more cpus the performance
> >>degradation phenomenon disappeared, but I tested on a VM with
> >>qemu emulation backed by null_blk so I figured I had some other
> >>bottleneck somewhere (that's why I asked for some more testing).
> >
> >That could be because of the vmexits as every MMIO access in the guest
> >triggers a vmexit and if you poll with a low budget you do more MMIOs hence
> >you have more vmexits.
> >
> >Did you do testing only in qemu or with real H/W as well?
> 
> I tried once. IIRC, I saw the same phenomenons...

JFTR I tried my AHCI irq_poll patch on the Qemu emulation and the read
throughput dropped from ~1GB/s to ~350MB/s. But this can be related to
Qemu's I/O wiredness as well I think. I'll try on real hardware tomorrow.

-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [Lsf-pc] [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-12 18:59                 ` Johannes Thumshirn
  0 siblings, 0 replies; 120+ messages in thread
From: Johannes Thumshirn @ 2017-01-12 18:59 UTC (permalink / raw)


On Thu, Jan 12, 2017@04:41:00PM +0200, Sagi Grimberg wrote:
> 
> >>**Note: when I ran multiple threads on more cpus the performance
> >>degradation phenomenon disappeared, but I tested on a VM with
> >>qemu emulation backed by null_blk so I figured I had some other
> >>bottleneck somewhere (that's why I asked for some more testing).
> >
> >That could be because of the vmexits as every MMIO access in the guest
> >triggers a vmexit and if you poll with a low budget you do more MMIOs hence
> >you have more vmexits.
> >
> >Did you do testing only in qemu or with real H/W as well?
> 
> I tried once. IIRC, I saw the same phenomenons...

JFTR I tried my AHCI irq_poll patch on the Qemu emulation and the read
throughput dropped from ~1GB/s to ~350MB/s. But this can be related to
Qemu's I/O wiredness as well I think. I'll try on real hardware tomorrow.

-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-12  8:41     ` Sagi Grimberg
  (?)
@ 2017-01-12 19:13       ` Bart Van Assche
  -1 siblings, 0 replies; 120+ messages in thread
From: Bart Van Assche @ 2017-01-12 19:13 UTC (permalink / raw)
  To: jthumshirn, lsf-pc, sagi
  Cc: hch, keith.busch, linux-block, linux-nvme, Linux-scsi

On Thu, 2017-01-12 at 10:41 +0200, Sagi Grimberg wrote:
> First, when the nvme device fires an interrupt, the driver consumes
> the completion(s) from the interrupt (usually there will be some more
> completions waiting in the cq by the time the host start processing it).
> With irq-poll, we disable further interrupts and schedule soft-irq for
> processing, which if at all, improve the completions per interrupt
> utilization (because it takes slightly longer before processing the cq).
> 
> Moreover, irq-poll is budgeting the completion queue processing which is
> important for a couple of reasons.
> 
> 1. it prevents hard-irq context abuse like we do today. if other cpu
>     cores are pounding with more submissions on the same queue, we might
>     get into a hard-lockup (which I've seen happening).
> 
> 2. irq-poll maintains fairness between devices by correctly budgeting
>     the processing of different completions queues that share the same
>     affinity. This can become crucial when working with multiple nvme
>     devices, each has multiple io queues that share the same IRQ
>     assignment.
> 
> 3. It reduces (or at least should reduce) the overall number of
>     interrupts in the system because we only enable interrupts again
>     when the completion queue is completely processed.
> 
> So overall, I think it's very useful for nvme and other modern HBAs,
> but unfortunately, other than solving (1), I wasn't able to see
> performance improvement but rather a slight regression, but I can't
> explain where its coming from...

Hello Sagi,

Thank you for the additional clarification. Although I am not sure whether
irq-poll is the ideal solution for the problems that has been described
above, I agree that it would help to discuss this topic further during
LSF/MM.

Bart.
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-12 19:13       ` Bart Van Assche
  0 siblings, 0 replies; 120+ messages in thread
From: Bart Van Assche @ 2017-01-12 19:13 UTC (permalink / raw)
  To: jthumshirn, lsf-pc, sagi
  Cc: Linux-scsi, hch, keith.busch, linux-nvme, linux-block

On Thu, 2017-01-12 at 10:41 +0200, Sagi Grimberg wrote:
> First, when the nvme device fires an interrupt, the driver consumes
> the completion(s) from the interrupt (usually there will be some more
> completions waiting in the cq by the time the host start processing it).
> With irq-poll, we disable further interrupts and schedule soft-irq for
> processing, which if at all, improve the completions per interrupt
> utilization (because it takes slightly longer before processing the cq).
> 
> Moreover, irq-poll is budgeting the completion queue processing which is
> important for a couple of reasons.
> 
> 1. it prevents hard-irq context abuse like we do today. if other cpu
>     cores are pounding with more submissions on the same queue, we might
>     get into a hard-lockup (which I've seen happening).
> 
> 2. irq-poll maintains fairness between devices by correctly budgeting
>     the processing of different completions queues that share the same
>     affinity. This can become crucial when working with multiple nvme
>     devices, each has multiple io queues that share the same IRQ
>     assignment.
> 
> 3. It reduces (or at least should reduce) the overall number of
>     interrupts in the system because we only enable interrupts again
>     when the completion queue is completely processed.
> 
> So overall, I think it's very useful for nvme and other modern HBAs,
> but unfortunately, other than solving (1), I wasn't able to see
> performance improvement but rather a slight regression, but I can't
> explain where its coming from...

Hello Sagi,

Thank you for the additional clarification. Although I am not sure whether
irq-poll is the ideal solution for the problems that has been described
above, I agree that it would help to discuss this topic further during
LSF/MM.

Bart.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-12 19:13       ` Bart Van Assche
  0 siblings, 0 replies; 120+ messages in thread
From: Bart Van Assche @ 2017-01-12 19:13 UTC (permalink / raw)


On Thu, 2017-01-12@10:41 +0200, Sagi Grimberg wrote:
> First, when the nvme device fires an interrupt, the driver consumes
> the completion(s) from the interrupt (usually there will be some more
> completions waiting in the cq by the time the host start processing it).
> With irq-poll, we disable further interrupts and schedule soft-irq for
> processing, which if at all, improve the completions per interrupt
> utilization (because it takes slightly longer before processing the cq).
> 
> Moreover, irq-poll is budgeting the completion queue processing which is
> important for a couple of reasons.
> 
> 1. it prevents hard-irq context abuse like we do today. if other cpu
>     cores are pounding with more submissions on the same queue, we might
>     get into a hard-lockup (which I've seen happening).
> 
> 2. irq-poll maintains fairness between devices by correctly budgeting
>     the processing of different completions queues that share the same
>     affinity. This can become crucial when working with multiple nvme
>     devices, each has multiple io queues that share the same IRQ
>     assignment.
> 
> 3. It reduces (or at least should reduce) the overall number of
>     interrupts in the system because we only enable interrupts again
>     when the completion queue is completely processed.
> 
> So overall, I think it's very useful for nvme and other modern HBAs,
> but unfortunately, other than solving (1), I wasn't able to see
> performance improvement but rather a slight regression, but I can't
> explain where its coming from...

Hello Sagi,

Thank you for the additional clarification. Although I am not sure whether
irq-poll is the ideal solution for the problems that has been described
above, I agree that it would help to discuss this topic further during
LSF/MM.

Bart.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-11 15:13     ` Jens Axboe
  (?)
@ 2017-01-13 15:56       ` Johannes Thumshirn
  -1 siblings, 0 replies; 120+ messages in thread
From: Johannes Thumshirn @ 2017-01-13 15:56 UTC (permalink / raw)
  To: Jens Axboe
  Cc: lsf-pc, linux-block, Linux-scsi, Sagi Grimberg, linux-nvme,
	Christoph Hellwig, Keith Busch

On Wed, Jan 11, 2017 at 08:13:02AM -0700, Jens Axboe wrote:
> On 01/11/2017 08:07 AM, Jens Axboe wrote:
> > On 01/11/2017 06:43 AM, Johannes Thumshirn wrote:
> >> Hi all,
> >>
> >> I'd like to attend LSF/MM and would like to discuss polling for block drivers.
> >>
> >> Currently there is blk-iopoll but it is neither as widely used as NAPI in the
> >> networking field and accoring to Sagi's findings in [1] performance with
> >> polling is not on par with IRQ usage.
> >>
> >> On LSF/MM I'd like to whether it is desirable to have NAPI like polling in
> >> more block drivers and how to overcome the currently seen performance issues.
> > 
> > It would be an interesting topic to discuss, as it is a shame that blk-iopoll
> > isn't used more widely.
> 
> Forgot to mention - it should only be a topic, if experimentation has
> been done and results gathered to pin point what the issues are, so we
> have something concrete to discus. I'm not at all interested in a hand
> wavy discussion on the topic.

So here are my 1st real numbers on this topic w/ some spinning rust:

All is done with 4.10-rc3 and we at least have no performance degradation when
a poll budget of 128 or 256 (oddly the max that irq_poll currently does you
allow to have). Clearly it looks like the disk is the limiting factor here and
we already saturated it. I'll do AHCI SSD tests on Monday. Hannes did some tests 
with mptXsas and a SSD maybe he can share his findings as well.

scsi-sq:
--------
baseline:
  read : io=66776KB, bw=1105.5KB/s, iops=276, runt= 60406msec
  write: io=65812KB, bw=1089.6KB/s, iops=272, runt= 60406msec

AHCI irq_poll budget 31:
  read : io=53372KB, bw=904685B/s, iops=220, runt= 60411msec
  write: io=52596KB, bw=891531B/s, iops=217, runt= 60411msec

AHCI irq_poll budget 128:
  read : io=66664KB, bw=1106.4KB/s, iops=276, runt= 60257msec
  write: io=65608KB, bw=1088.9KB/s, iops=272, runt= 60257msec

AHCI irq_poll budget 256:
  read : io=67048KB, bw=1111.2KB/s, iops=277, runt= 60296msec
  write: io=65916KB, bw=1093.3KB/s, iops=273, runt= 60296msec

scsi-mq:
--------
baseline:
  read : io=78220KB, bw=1300.7KB/s, iops=325, runt= 60140msec
  write: io=77104KB, bw=1282.8KB/s, iops=320, runt= 60140msec

AHCI irq_poll budget 256:
  read : io=78316KB, bw=1301.7KB/s, iops=325, runt= 60167msec
  write: io=77172KB, bw=1282.7KB/s, iops=320, runt= 60167msec


-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: Felix Imend�rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N�rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-13 15:56       ` Johannes Thumshirn
  0 siblings, 0 replies; 120+ messages in thread
From: Johannes Thumshirn @ 2017-01-13 15:56 UTC (permalink / raw)
  To: Jens Axboe
  Cc: lsf-pc, linux-block, Linux-scsi, Sagi Grimberg, linux-nvme,
	Christoph Hellwig, Keith Busch

On Wed, Jan 11, 2017 at 08:13:02AM -0700, Jens Axboe wrote:
> On 01/11/2017 08:07 AM, Jens Axboe wrote:
> > On 01/11/2017 06:43 AM, Johannes Thumshirn wrote:
> >> Hi all,
> >>
> >> I'd like to attend LSF/MM and would like to discuss polling for block drivers.
> >>
> >> Currently there is blk-iopoll but it is neither as widely used as NAPI in the
> >> networking field and accoring to Sagi's findings in [1] performance with
> >> polling is not on par with IRQ usage.
> >>
> >> On LSF/MM I'd like to whether it is desirable to have NAPI like polling in
> >> more block drivers and how to overcome the currently seen performance issues.
> > 
> > It would be an interesting topic to discuss, as it is a shame that blk-iopoll
> > isn't used more widely.
> 
> Forgot to mention - it should only be a topic, if experimentation has
> been done and results gathered to pin point what the issues are, so we
> have something concrete to discus. I'm not at all interested in a hand
> wavy discussion on the topic.

So here are my 1st real numbers on this topic w/ some spinning rust:

All is done with 4.10-rc3 and we at least have no performance degradation when
a poll budget of 128 or 256 (oddly the max that irq_poll currently does you
allow to have). Clearly it looks like the disk is the limiting factor here and
we already saturated it. I'll do AHCI SSD tests on Monday. Hannes did some tests 
with mptXsas and a SSD maybe he can share his findings as well.

scsi-sq:
--------
baseline:
  read : io=66776KB, bw=1105.5KB/s, iops=276, runt= 60406msec
  write: io=65812KB, bw=1089.6KB/s, iops=272, runt= 60406msec

AHCI irq_poll budget 31:
  read : io=53372KB, bw=904685B/s, iops=220, runt= 60411msec
  write: io=52596KB, bw=891531B/s, iops=217, runt= 60411msec

AHCI irq_poll budget 128:
  read : io=66664KB, bw=1106.4KB/s, iops=276, runt= 60257msec
  write: io=65608KB, bw=1088.9KB/s, iops=272, runt= 60257msec

AHCI irq_poll budget 256:
  read : io=67048KB, bw=1111.2KB/s, iops=277, runt= 60296msec
  write: io=65916KB, bw=1093.3KB/s, iops=273, runt= 60296msec

scsi-mq:
--------
baseline:
  read : io=78220KB, bw=1300.7KB/s, iops=325, runt= 60140msec
  write: io=77104KB, bw=1282.8KB/s, iops=320, runt= 60140msec

AHCI irq_poll budget 256:
  read : io=78316KB, bw=1301.7KB/s, iops=325, runt= 60167msec
  write: io=77172KB, bw=1282.7KB/s, iops=320, runt= 60167msec


-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-13 15:56       ` Johannes Thumshirn
  0 siblings, 0 replies; 120+ messages in thread
From: Johannes Thumshirn @ 2017-01-13 15:56 UTC (permalink / raw)


On Wed, Jan 11, 2017@08:13:02AM -0700, Jens Axboe wrote:
> On 01/11/2017 08:07 AM, Jens Axboe wrote:
> > On 01/11/2017 06:43 AM, Johannes Thumshirn wrote:
> >> Hi all,
> >>
> >> I'd like to attend LSF/MM and would like to discuss polling for block drivers.
> >>
> >> Currently there is blk-iopoll but it is neither as widely used as NAPI in the
> >> networking field and accoring to Sagi's findings in [1] performance with
> >> polling is not on par with IRQ usage.
> >>
> >> On LSF/MM I'd like to whether it is desirable to have NAPI like polling in
> >> more block drivers and how to overcome the currently seen performance issues.
> > 
> > It would be an interesting topic to discuss, as it is a shame that blk-iopoll
> > isn't used more widely.
> 
> Forgot to mention - it should only be a topic, if experimentation has
> been done and results gathered to pin point what the issues are, so we
> have something concrete to discus. I'm not at all interested in a hand
> wavy discussion on the topic.

So here are my 1st real numbers on this topic w/ some spinning rust:

All is done with 4.10-rc3 and we at least have no performance degradation when
a poll budget of 128 or 256 (oddly the max that irq_poll currently does you
allow to have). Clearly it looks like the disk is the limiting factor here and
we already saturated it. I'll do AHCI SSD tests on Monday. Hannes did some tests 
with mptXsas and a SSD maybe he can share his findings as well.

scsi-sq:
--------
baseline:
  read : io=66776KB, bw=1105.5KB/s, iops=276, runt= 60406msec
  write: io=65812KB, bw=1089.6KB/s, iops=272, runt= 60406msec

AHCI irq_poll budget 31:
  read : io=53372KB, bw=904685B/s, iops=220, runt= 60411msec
  write: io=52596KB, bw=891531B/s, iops=217, runt= 60411msec

AHCI irq_poll budget 128:
  read : io=66664KB, bw=1106.4KB/s, iops=276, runt= 60257msec
  write: io=65608KB, bw=1088.9KB/s, iops=272, runt= 60257msec

AHCI irq_poll budget 256:
  read : io=67048KB, bw=1111.2KB/s, iops=277, runt= 60296msec
  write: io=65916KB, bw=1093.3KB/s, iops=273, runt= 60296msec

scsi-mq:
--------
baseline:
  read : io=78220KB, bw=1300.7KB/s, iops=325, runt= 60140msec
  write: io=77104KB, bw=1282.8KB/s, iops=320, runt= 60140msec

AHCI irq_poll budget 256:
  read : io=78316KB, bw=1301.7KB/s, iops=325, runt= 60167msec
  write: io=77172KB, bw=1282.7KB/s, iops=320, runt= 60167msec


-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-12  8:23       ` Sagi Grimberg
@ 2017-01-17 15:38         ` Sagi Grimberg
  -1 siblings, 0 replies; 120+ messages in thread
From: Sagi Grimberg @ 2017-01-17 15:38 UTC (permalink / raw)
  To: Jens Axboe, Johannes Thumshirn, lsf-pc
  Cc: linux-block, Linux-scsi, linux-nvme, Christoph Hellwig, Keith Busch

[-- Attachment #1: Type: text/plain, Size: 5615 bytes --]

Hey, so I made some initial analysis of whats going on with
irq-poll.

First, I sampled how much time it takes before we
get the interrupt in nvme_irq and the initial visit
to nvme_irqpoll_handler. I ran a single threaded fio
with QD=32 of 4K reads. This is two displays of a
histogram of the latency (ns):
--
[1]
queue = b'nvme0q1'
      usecs               : count     distribution
          0 -> 1          : 7310 
|****************************************|
          2 -> 3          : 11       | 
      |
          4 -> 7          : 10       | 
      |
          8 -> 15         : 20       | 
      |
         16 -> 31         : 0        | 
      |
         32 -> 63         : 0        | 
      |
         64 -> 127        : 1        | 
      |

[2]
queue = b'nvme0q1'
      usecs               : count     distribution
          0 -> 1          : 7309 
|****************************************|
          2 -> 3          : 14       | 
      |
          4 -> 7          : 7        | 
      |
          8 -> 15         : 17       | 
      |

We can see that most of the time our latency is pretty good (<1ns) but with
huge tail latencies (some 8-15 ns and even one in 32-63 ns).
**NOTE, in order to reduce the tracing impact on performance I sampled
for every 100 interrupts.

I also sampled for a multiple threads/queues with QD=32 of 4K reads.
This is a collection of histograms for 5 queues (5 fio threads):
queue = b'nvme0q1'
      usecs               : count     distribution
          0 -> 1          : 701 
|****************************************|
          2 -> 3          : 177      |********** 
      |
          4 -> 7          : 56       |*** 
      |
          8 -> 15         : 24       |* 
      |
         16 -> 31         : 6        | 
      |
         32 -> 63         : 1        | 
      |

queue = b'nvme0q2'
      usecs               : count     distribution
          0 -> 1          : 412 
|****************************************|
          2 -> 3          : 52       |***** 
      |
          4 -> 7          : 19       |* 
      |
          8 -> 15         : 13       |* 
      |
         16 -> 31         : 5        | 
      |

queue = b'nvme0q3'
      usecs               : count     distribution
          0 -> 1          : 381 
|****************************************|
          2 -> 3          : 74       |******* 
      |
          4 -> 7          : 26       |** 
      |
          8 -> 15         : 12       |* 
      |
         16 -> 31         : 3        | 
      |
         32 -> 63         : 0        | 
      |
         64 -> 127        : 0        | 
      |
        128 -> 255        : 1        | 
      |

queue = b'nvme0q4'
      usecs               : count     distribution
          0 -> 1          : 386 
|****************************************|
          2 -> 3          : 63       |****** 
      |
          4 -> 7          : 30       |*** 
      |
          8 -> 15         : 11       |* 
      |
         16 -> 31         : 7        | 
      |
         32 -> 63         : 1        | 
      |

queue = b'nvme0q5'
      usecs               : count     distribution
          0 -> 1          : 384 
|****************************************|
          2 -> 3          : 69       |******* 
      |
          4 -> 7          : 25       |** 
      |
          8 -> 15         : 15       |* 
      |
         16 -> 31         : 3        | 
      |

Overall looks pretty much the same but some more samples with tails...

Next, I sampled how many completions we are able to consume per interrupt.
Two exaples of histograms of how many completions we take per interrupt.
--
queue = b'nvme0q1'
      completed     : count     distribution
         0          : 0        |                                        |
         1          : 11690    |****************************************|
         2          : 46       |                                        |
         3          : 1        |                                        |

queue = b'nvme0q1'
      completed     : count     distribution
         0          : 0        |                                        |
         1          : 944      |****************************************|
         2          : 8        |                                        |
--

So it looks like we are super not efficient because most of the times we 
catch 1
completion per interrupt and the whole point is that we need to find 
more! This fio
is single threaded with QD=32 so I'd expect that we be somewhere in 8-31 
almost all
the time... I also tried QD=1024, histogram is still the same.
**NOTE: Here I also sampled for every 100 interrupts.


I'll try to run the counter on the current nvme driver and see what I get.



I attached the bpf scripts I wrote (nvme-trace-irq, nvme-count-comps)
with hope that someone is interested enough to try and reproduce these
numbers on his/hers setup and maybe suggest some other useful tracing
we can do.

Prerequisites:
1. iovisor is needed for python bpf support.
   $ echo "deb [trusted=yes] https://repo.iovisor.org/apt/xenial 
xenial-nightly main" | sudo tee /etc/apt/sources.list.d/iovisor.list
   $ sudo apt-get update -y
   $ sudo apt-get install bcc-tools -y
   # Nastty hack .. bcc only available in python2 but copliant with 
python3..
   $ sudo cp -r /usr/lib/python2.7/dist-packages/bcc 
/usr/lib/python3/dist-packages/

2. Because we don't have the nvme-pci symbols exported, The nvme.h file 
is needed on the
    test machine (where the bpf code is running). I used nfs mount for 
the linux source (this
    is why I include from /mnt/linux in the scripts).


[-- Attachment #2: nvme-count-comps --]
[-- Type: text/plain, Size: 6398 bytes --]

#!/usr/bin/python3
# @lint-avoid-python-3-compatibility-imports

from __future__ import print_function
from bcc import BPF
from time import sleep, strftime
import argparse

# arguments
examples = """examples:
    ./nvme_comp_cout            # summarize interrupt->irqpoll latency as a histogram
    ./nvme_comp_cout 1 10       # print 1 second summaries, 10 times
    ./nvme_comp_cout -mT 1      # 1s summaries, milliseconds, and timestamps
    ./nvme_comp_cout -Q         # show each nvme queue device separately
"""
parser = argparse.ArgumentParser(
    description="Summarize block device I/O latency as a histogram",
    formatter_class=argparse.RawDescriptionHelpFormatter,
    epilog=examples)
parser.add_argument("-T", "--timestamp", action="store_true",
    help="include timestamp on output")
parser.add_argument("-m", "--milliseconds", action="store_true",
    help="millisecond histogram")
parser.add_argument("-Q", "--queues", action="store_true",
    help="print a histogram per queue")
parser.add_argument("--freq", help="Account every N-th request",
    type=int, required=False)
parser.add_argument("interval", nargs="?", default=2,
    help="output interval, in seconds")
parser.add_argument("count", nargs="?", default=99999999,
    help="number of outputs")
args = parser.parse_args()
countdown = int(args.count)
debug = 0

# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
/*****************************************************************
 * Nasty hack because we don't have the nvme-pci structs exported
 *****************************************************************/
#include <linux/aer.h>
#include <linux/bitops.h>
#include <linux/blkdev.h>
#include <linux/blk-mq.h>
#include <linux/blk-mq-pci.h>
#include <linux/cpu.h>
#include <linux/delay.h>
#include <linux/errno.h>
#include <linux/fs.h>
#include <linux/genhd.h>
#include <linux/hdreg.h>
#include <linux/idr.h>
#include <linux/init.h>
#include <linux/interrupt.h>
#include <linux/io.h>
#include <linux/kdev_t.h>
#include <linux/kernel.h>
#include <linux/mm.h>
#include <linux/module.h>
#include <linux/moduleparam.h>
#include <linux/mutex.h>
#include <linux/pci.h>
#include <linux/poison.h>
#include <linux/ptrace.h>
#include <linux/sched.h>
#include <linux/slab.h>
#include <linux/t10-pi.h>
#include <linux/timer.h>
#include <linux/types.h>
#include <linux/io-64-nonatomic-lo-hi.h>
#include <asm/unaligned.h>
#include <linux/irq_poll.h>
#include "/mnt/linux/drivers/nvme/host/nvme.h"

struct nvme_dev;
struct nvme_queue;
/*
 * Represents an NVM Express device.  Each nvme_dev is a PCI function.
 */
struct nvme_dev {
	struct nvme_queue **queues;
	struct blk_mq_tag_set tagset;
	struct blk_mq_tag_set admin_tagset;
	u32 __iomem *dbs;
	struct device *dev;
	struct dma_pool *prp_page_pool;
	struct dma_pool *prp_small_pool;
	unsigned queue_count;
	unsigned online_queues;
	unsigned max_qid;
	int q_depth;
	u32 db_stride;
	void __iomem *bar;
	struct work_struct reset_work;
	struct work_struct remove_work;
	struct timer_list watchdog_timer;
	struct mutex shutdown_lock;
	bool subsystem;
	void __iomem *cmb;
	dma_addr_t cmb_dma_addr;
	u64 cmb_size;
	u32 cmbsz;
	u32 cmbloc;
	struct nvme_ctrl ctrl;
	struct completion ioq_wait;
};

/*
 * An NVM Express queue.  Each device has at least two (one for admin
 * commands and one for I/O commands).
 */
struct nvme_queue {
	struct device *q_dmadev;
	struct nvme_dev *dev;
	char irqname[24];
	spinlock_t sq_lock;
	spinlock_t cq_lock;
	struct nvme_command *sq_cmds;
	struct nvme_command __iomem *sq_cmds_io;
	volatile struct nvme_completion *cqes;
	struct blk_mq_tags **tags;
	dma_addr_t sq_dma_addr;
	dma_addr_t cq_dma_addr;
	u32 __iomem *q_db;
	u16 q_depth;
	s16 cq_vector;
	u16 sq_tail;
	u16 cq_head;
	u16 qid;
	u8 cq_phase;
	struct irq_poll	iop;
};

typedef struct queue_key {
    char queue[24];
    u64 slot;
} queue_key_t;

/* Completion counter context */
struct nvmeq {
    struct nvme_queue *q;
    u64 completed;
};

BPF_TABLE("percpu_array", int, struct nvmeq, qarr, 1);
BPF_TABLE("percpu_array", int, int, call_count, 1);
STORAGE

/* trace nvme interrupt */
int trace_interrupt_start(struct pt_regs *ctx, int irq, void *data)
{
    __CALL__COUNT__FILTER__

    struct nvmeq q ={};
    int index = 0;

    q.q = data;
    q.completed = 0; /* reset completions */

    qarr.update(&index, &q);
    return 0;
}

/* count completed on each irqpoll end */
int trace_irqpoll_end(struct pt_regs *ctx)
{
    __CALL__COUNT__FILTER__

    struct nvmeq zero = {};
    int index = 0;
    struct nvmeq *q;
    int completed = ctx->ax;

    q = qarr.lookup_or_init(&index, &zero);
    if (q == NULL) {
	goto out;
    }

    q->completed += completed;
    /* No variables in kretprobe :( 64 is our budget */
    if (completed < 64) {
        /* store as histogram */
        STORE
        q->completed = 0;
    }

out:
    return 0;
}
"""

call_count_filter = """
{
    int zero = 0;
    int index =0;
    int *skip;

    skip = call_count.lookup_or_init(&index, &zero);

    if ((*skip) < %d) {
        (*skip)++;
        return 0;
    }
    (*skip) = 0;
}
"""

# code substitutions
if args.queues:
    bpf_text = bpf_text.replace('STORAGE',
        'BPF_HISTOGRAM(dist, queue_key_t);')
    bpf_text = bpf_text.replace('STORE',
        'queue_key_t key = {.slot = q->completed}; ' +
        'bpf_probe_read(&key.queue, sizeof(key.queue), ' +
        'q->q->irqname); dist.increment(key);')
else:
    bpf_text = bpf_text.replace('STORAGE', 'BPF_HISTOGRAM(dist);')
    bpf_text = bpf_text.replace('STORE',
        'dist.increment(q->completed);')

bpf_text = bpf_text.replace("__CALL__COUNT__FILTER__", call_count_filter % (args.freq - 1) if args.freq is not None else "")
if debug:
    print(bpf_text)


# load BPF program
b = BPF(text=bpf_text)
b.attach_kprobe(event="nvme_irq", fn_name="trace_interrupt_start")
b.attach_kretprobe(event="nvme_irqpoll_handler", fn_name="trace_irqpoll_end")

print("Tracing nvme I/O interrupt/irqpoll... Hit Ctrl-C to end.")

# output
exiting = 0 if args.interval else 1
dist = b.get_table("dist")
while (1):
    try:
        sleep(int(args.interval))
    except KeyboardInterrupt:
        exiting = 1

    print()
    if args.timestamp:
        print("%-8s\n" % strftime("%H:%M:%S"), end="")

    dist.print_linear_hist("completed", "queue")
    dist.clear()

    countdown -= 1
    if exiting or countdown == 0:
        exit()

[-- Attachment #3: nvme-trace-irq --]
[-- Type: text/plain, Size: 6397 bytes --]

#!/usr/bin/python3
# @lint-avoid-python-3-compatibility-imports

from __future__ import print_function
from bcc import BPF
from time import sleep, strftime
import argparse

# arguments
examples = """examples:
    ./nvmetrace            # summarize interrupt->irqpoll latency as a histogram
    ./nvmetrace 1 10       # print 1 second summaries, 10 times
    ./nvmetrace -mT 1      # 1s summaries, milliseconds, and timestamps
    ./nvmetrace -Q         # show each nvme queue device separately
"""
parser = argparse.ArgumentParser(
    description="Summarize interrupt to softirq latency as a histogram",
    formatter_class=argparse.RawDescriptionHelpFormatter,
    epilog=examples)
parser.add_argument("-T", "--timestamp", action="store_true",
    help="include timestamp on output")
parser.add_argument("-m", "--milliseconds", action="store_true",
    help="millisecond histogram")
parser.add_argument("-Q", "--queues", action="store_true",
    help="print a histogram per queue")
parser.add_argument("--freq", help="Account every N-th request", type=int, required=False)
parser.add_argument("interval", nargs="?", default=2,
    help="output interval, in seconds")
parser.add_argument("count", nargs="?", default=99999999,
    help="number of outputs")
args = parser.parse_args()
countdown = int(args.count)
debug = 0

# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
/*****************************************************************
 * Nasty hack because we don't have the nvme-pci structs exported
 *****************************************************************/
#include <linux/aer.h>
#include <linux/bitops.h>
#include <linux/blkdev.h>
#include <linux/blk-mq.h>
#include <linux/blk-mq-pci.h>
#include <linux/cpu.h>
#include <linux/delay.h>
#include <linux/errno.h>
#include <linux/fs.h>
#include <linux/genhd.h>
#include <linux/hdreg.h>
#include <linux/idr.h>
#include <linux/init.h>
#include <linux/interrupt.h>
#include <linux/io.h>
#include <linux/kdev_t.h>
#include <linux/kernel.h>
#include <linux/mm.h>
#include <linux/module.h>
#include <linux/moduleparam.h>
#include <linux/mutex.h>
#include <linux/pci.h>
#include <linux/poison.h>
#include <linux/ptrace.h>
#include <linux/sched.h>
#include <linux/slab.h>
#include <linux/t10-pi.h>
#include <linux/timer.h>
#include <linux/types.h>
#include <linux/io-64-nonatomic-lo-hi.h>
#include <asm/unaligned.h>
#include <linux/irq_poll.h>

/* location of nvme.h */
#include "/mnt/linux/drivers/nvme/host/nvme.h"

struct nvme_dev;
struct nvme_queue;
/*
 * Represents an NVM Express device.  Each nvme_dev is a PCI function.
 */
struct nvme_dev {
	struct nvme_queue **queues;
	struct blk_mq_tag_set tagset;
	struct blk_mq_tag_set admin_tagset;
	u32 __iomem *dbs;
	struct device *dev;
	struct dma_pool *prp_page_pool;
	struct dma_pool *prp_small_pool;
	unsigned queue_count;
	unsigned online_queues;
	unsigned max_qid;
	int q_depth;
	u32 db_stride;
	void __iomem *bar;
	struct work_struct reset_work;
	struct work_struct remove_work;
	struct timer_list watchdog_timer;
	struct mutex shutdown_lock;
	bool subsystem;
	void __iomem *cmb;
	dma_addr_t cmb_dma_addr;
	u64 cmb_size;
	u32 cmbsz;
	u32 cmbloc;
	struct nvme_ctrl ctrl;
	struct completion ioq_wait;
};

/*
 * An NVM Express queue.  Each device has at least two (one for admin
 * commands and one for I/O commands).
 */
struct nvme_queue {
	struct device *q_dmadev;
	struct nvme_dev *dev;
	char irqname[24];
	spinlock_t sq_lock;
	spinlock_t cq_lock;
	struct nvme_command *sq_cmds;
	struct nvme_command __iomem *sq_cmds_io;
	volatile struct nvme_completion *cqes;
	struct blk_mq_tags **tags;
	dma_addr_t sq_dma_addr;
	dma_addr_t cq_dma_addr;
	u32 __iomem *q_db;
	u16 q_depth;
	s16 cq_vector;
	u16 sq_tail;
	u16 cq_head;
	u16 qid;
	u8 cq_phase;
	struct irq_poll	iop;
};

typedef struct queue_key {
    char queue[24];
    u64 slot;
} queue_key_t;

BPF_HASH(start, struct nvme_queue *);
BPF_TABLE("percpu_array", int, int, call_count, 1);
STORAGE

/* timestamp nvme interrupt */
int trace_interrupt_start(struct pt_regs *ctx, int irq, void *data)
{
    __CALL__COUNT__FILTER__

    struct nvme_queue *q = data;
    u64 ts = bpf_ktime_get_ns();
    start.update(&q, &ts);
    return 0;
}

/* timestamp nvme irqpoll */
int trace_irqpoll_start(struct pt_regs *ctx, struct irq_poll *iop, int budget)
{
    struct nvme_queue *q = container_of(iop, struct nvme_queue, iop);
    u64 *tsp, delta;

    /* fetch timestamp and calculate delta */
    tsp = start.lookup(&q);
    if (tsp == 0) {
        return 0;   /* missed issue */
    }
    delta = bpf_ktime_get_ns() - *tsp;
    FACTOR

    /* store as histogram */
    STORE
    start.delete(&q);

    return 0;
}
"""

# code substitutions
call_count_filter = """
{
    int zero = 0;
    int index =0;
    int *skip;

    skip = call_count.lookup_or_init(&index, &zero);

    if ((*skip) < %d) {
        (*skip)++;
        return 0;
    }
    (*skip) = 0;
}
"""

if args.milliseconds:
    bpf_text = bpf_text.replace('FACTOR', 'delta /= 1000000;')
    label = "msecs"
else:
    bpf_text = bpf_text.replace('FACTOR', 'delta /= 1000;')
    label = "usecs"
if args.queues:
    bpf_text = bpf_text.replace('STORAGE',
        'BPF_HISTOGRAM(dist, queue_key_t);')
    bpf_text = bpf_text.replace('STORE',
        'queue_key_t key = {.slot = bpf_log2l(delta)}; ' +
        'bpf_probe_read(&key.queue, sizeof(key.queue), ' +
        'q->irqname); dist.increment(key);')
else:
    bpf_text = bpf_text.replace('STORAGE', 'BPF_HISTOGRAM(dist);')
    bpf_text = bpf_text.replace('STORE',
        'dist.increment(bpf_log2l(delta));')

bpf_text = bpf_text.replace("__CALL__COUNT__FILTER__", call_count_filter % (args.freq - 1) if args.freq is not None else "")
if debug:
    print(bpf_text)

# load BPF program
b = BPF(text=bpf_text)
b.attach_kprobe(event="nvme_irq", fn_name="trace_interrupt_start")
b.attach_kprobe(event="nvme_irqpoll_handler", fn_name="trace_irqpoll_start")

print("Tracing nvme I/O interrupt/irqpoll... Hit Ctrl-C to end.")

# output
exiting = 0 if args.interval else 1
dist = b.get_table("dist")
while (1):
    try:
        sleep(int(args.interval))
    except KeyboardInterrupt:
        exiting = 1

    print()
    if args.timestamp:
        print("%-8s\n" % strftime("%H:%M:%S"), end="")

    dist.print_log2_hist(label, "queue")
    dist.clear()

    countdown -= 1
    if exiting or countdown == 0:
        exit()

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-17 15:38         ` Sagi Grimberg
  0 siblings, 0 replies; 120+ messages in thread
From: Sagi Grimberg @ 2017-01-17 15:38 UTC (permalink / raw)


Hey, so I made some initial analysis of whats going on with
irq-poll.

First, I sampled how much time it takes before we
get the interrupt in nvme_irq and the initial visit
to nvme_irqpoll_handler. I ran a single threaded fio
with QD=32 of 4K reads. This is two displays of a
histogram of the latency (ns):
--
[1]
queue = b'nvme0q1'
      usecs               : count     distribution
          0 -> 1          : 7310 
|****************************************|
          2 -> 3          : 11       | 
      |
          4 -> 7          : 10       | 
      |
          8 -> 15         : 20       | 
      |
         16 -> 31         : 0        | 
      |
         32 -> 63         : 0        | 
      |
         64 -> 127        : 1        | 
      |

[2]
queue = b'nvme0q1'
      usecs               : count     distribution
          0 -> 1          : 7309 
|****************************************|
          2 -> 3          : 14       | 
      |
          4 -> 7          : 7        | 
      |
          8 -> 15         : 17       | 
      |

We can see that most of the time our latency is pretty good (<1ns) but with
huge tail latencies (some 8-15 ns and even one in 32-63 ns).
**NOTE, in order to reduce the tracing impact on performance I sampled
for every 100 interrupts.

I also sampled for a multiple threads/queues with QD=32 of 4K reads.
This is a collection of histograms for 5 queues (5 fio threads):
queue = b'nvme0q1'
      usecs               : count     distribution
          0 -> 1          : 701 
|****************************************|
          2 -> 3          : 177      |********** 
      |
          4 -> 7          : 56       |*** 
      |
          8 -> 15         : 24       |* 
      |
         16 -> 31         : 6        | 
      |
         32 -> 63         : 1        | 
      |

queue = b'nvme0q2'
      usecs               : count     distribution
          0 -> 1          : 412 
|****************************************|
          2 -> 3          : 52       |***** 
      |
          4 -> 7          : 19       |* 
      |
          8 -> 15         : 13       |* 
      |
         16 -> 31         : 5        | 
      |

queue = b'nvme0q3'
      usecs               : count     distribution
          0 -> 1          : 381 
|****************************************|
          2 -> 3          : 74       |******* 
      |
          4 -> 7          : 26       |** 
      |
          8 -> 15         : 12       |* 
      |
         16 -> 31         : 3        | 
      |
         32 -> 63         : 0        | 
      |
         64 -> 127        : 0        | 
      |
        128 -> 255        : 1        | 
      |

queue = b'nvme0q4'
      usecs               : count     distribution
          0 -> 1          : 386 
|****************************************|
          2 -> 3          : 63       |****** 
      |
          4 -> 7          : 30       |*** 
      |
          8 -> 15         : 11       |* 
      |
         16 -> 31         : 7        | 
      |
         32 -> 63         : 1        | 
      |

queue = b'nvme0q5'
      usecs               : count     distribution
          0 -> 1          : 384 
|****************************************|
          2 -> 3          : 69       |******* 
      |
          4 -> 7          : 25       |** 
      |
          8 -> 15         : 15       |* 
      |
         16 -> 31         : 3        | 
      |

Overall looks pretty much the same but some more samples with tails...

Next, I sampled how many completions we are able to consume per interrupt.
Two exaples of histograms of how many completions we take per interrupt.
--
queue = b'nvme0q1'
      completed     : count     distribution
         0          : 0        |                                        |
         1          : 11690    |****************************************|
         2          : 46       |                                        |
         3          : 1        |                                        |

queue = b'nvme0q1'
      completed     : count     distribution
         0          : 0        |                                        |
         1          : 944      |****************************************|
         2          : 8        |                                        |
--

So it looks like we are super not efficient because most of the times we 
catch 1
completion per interrupt and the whole point is that we need to find 
more! This fio
is single threaded with QD=32 so I'd expect that we be somewhere in 8-31 
almost all
the time... I also tried QD=1024, histogram is still the same.
**NOTE: Here I also sampled for every 100 interrupts.


I'll try to run the counter on the current nvme driver and see what I get.



I attached the bpf scripts I wrote (nvme-trace-irq, nvme-count-comps)
with hope that someone is interested enough to try and reproduce these
numbers on his/hers setup and maybe suggest some other useful tracing
we can do.

Prerequisites:
1. iovisor is needed for python bpf support.
   $ echo "deb [trusted=yes] https://repo.iovisor.org/apt/xenial 
xenial-nightly main" | sudo tee /etc/apt/sources.list.d/iovisor.list
   $ sudo apt-get update -y
   $ sudo apt-get install bcc-tools -y
   # Nastty hack .. bcc only available in python2 but copliant with 
python3..
   $ sudo cp -r /usr/lib/python2.7/dist-packages/bcc 
/usr/lib/python3/dist-packages/

2. Because we don't have the nvme-pci symbols exported, The nvme.h file 
is needed on the
    test machine (where the bpf code is running). I used nfs mount for 
the linux source (this
    is why I include from /mnt/linux in the scripts).

-------------- next part --------------
#!/usr/bin/python3
# @lint-avoid-python-3-compatibility-imports

from __future__ import print_function
from bcc import BPF
from time import sleep, strftime
import argparse

# arguments
examples = """examples:
    ./nvme_comp_cout            # summarize interrupt->irqpoll latency as a histogram
    ./nvme_comp_cout 1 10       # print 1 second summaries, 10 times
    ./nvme_comp_cout -mT 1      # 1s summaries, milliseconds, and timestamps
    ./nvme_comp_cout -Q         # show each nvme queue device separately
"""
parser = argparse.ArgumentParser(
    description="Summarize block device I/O latency as a histogram",
    formatter_class=argparse.RawDescriptionHelpFormatter,
    epilog=examples)
parser.add_argument("-T", "--timestamp", action="store_true",
    help="include timestamp on output")
parser.add_argument("-m", "--milliseconds", action="store_true",
    help="millisecond histogram")
parser.add_argument("-Q", "--queues", action="store_true",
    help="print a histogram per queue")
parser.add_argument("--freq", help="Account every N-th request",
    type=int, required=False)
parser.add_argument("interval", nargs="?", default=2,
    help="output interval, in seconds")
parser.add_argument("count", nargs="?", default=99999999,
    help="number of outputs")
args = parser.parse_args()
countdown = int(args.count)
debug = 0

# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
/*****************************************************************
 * Nasty hack because we don't have the nvme-pci structs exported
 *****************************************************************/
#include <linux/aer.h>
#include <linux/bitops.h>
#include <linux/blkdev.h>
#include <linux/blk-mq.h>
#include <linux/blk-mq-pci.h>
#include <linux/cpu.h>
#include <linux/delay.h>
#include <linux/errno.h>
#include <linux/fs.h>
#include <linux/genhd.h>
#include <linux/hdreg.h>
#include <linux/idr.h>
#include <linux/init.h>
#include <linux/interrupt.h>
#include <linux/io.h>
#include <linux/kdev_t.h>
#include <linux/kernel.h>
#include <linux/mm.h>
#include <linux/module.h>
#include <linux/moduleparam.h>
#include <linux/mutex.h>
#include <linux/pci.h>
#include <linux/poison.h>
#include <linux/ptrace.h>
#include <linux/sched.h>
#include <linux/slab.h>
#include <linux/t10-pi.h>
#include <linux/timer.h>
#include <linux/types.h>
#include <linux/io-64-nonatomic-lo-hi.h>
#include <asm/unaligned.h>
#include <linux/irq_poll.h>
#include "/mnt/linux/drivers/nvme/host/nvme.h"

struct nvme_dev;
struct nvme_queue;
/*
 * Represents an NVM Express device.  Each nvme_dev is a PCI function.
 */
struct nvme_dev {
	struct nvme_queue **queues;
	struct blk_mq_tag_set tagset;
	struct blk_mq_tag_set admin_tagset;
	u32 __iomem *dbs;
	struct device *dev;
	struct dma_pool *prp_page_pool;
	struct dma_pool *prp_small_pool;
	unsigned queue_count;
	unsigned online_queues;
	unsigned max_qid;
	int q_depth;
	u32 db_stride;
	void __iomem *bar;
	struct work_struct reset_work;
	struct work_struct remove_work;
	struct timer_list watchdog_timer;
	struct mutex shutdown_lock;
	bool subsystem;
	void __iomem *cmb;
	dma_addr_t cmb_dma_addr;
	u64 cmb_size;
	u32 cmbsz;
	u32 cmbloc;
	struct nvme_ctrl ctrl;
	struct completion ioq_wait;
};

/*
 * An NVM Express queue.  Each device has at least two (one for admin
 * commands and one for I/O commands).
 */
struct nvme_queue {
	struct device *q_dmadev;
	struct nvme_dev *dev;
	char irqname[24];
	spinlock_t sq_lock;
	spinlock_t cq_lock;
	struct nvme_command *sq_cmds;
	struct nvme_command __iomem *sq_cmds_io;
	volatile struct nvme_completion *cqes;
	struct blk_mq_tags **tags;
	dma_addr_t sq_dma_addr;
	dma_addr_t cq_dma_addr;
	u32 __iomem *q_db;
	u16 q_depth;
	s16 cq_vector;
	u16 sq_tail;
	u16 cq_head;
	u16 qid;
	u8 cq_phase;
	struct irq_poll	iop;
};

typedef struct queue_key {
    char queue[24];
    u64 slot;
} queue_key_t;

/* Completion counter context */
struct nvmeq {
    struct nvme_queue *q;
    u64 completed;
};

BPF_TABLE("percpu_array", int, struct nvmeq, qarr, 1);
BPF_TABLE("percpu_array", int, int, call_count, 1);
STORAGE

/* trace nvme interrupt */
int trace_interrupt_start(struct pt_regs *ctx, int irq, void *data)
{
    __CALL__COUNT__FILTER__

    struct nvmeq q ={};
    int index = 0;

    q.q = data;
    q.completed = 0; /* reset completions */

    qarr.update(&index, &q);
    return 0;
}

/* count completed on each irqpoll end */
int trace_irqpoll_end(struct pt_regs *ctx)
{
    __CALL__COUNT__FILTER__

    struct nvmeq zero = {};
    int index = 0;
    struct nvmeq *q;
    int completed = ctx->ax;

    q = qarr.lookup_or_init(&index, &zero);
    if (q == NULL) {
	goto out;
    }

    q->completed += completed;
    /* No variables in kretprobe :( 64 is our budget */
    if (completed < 64) {
        /* store as histogram */
        STORE
        q->completed = 0;
    }

out:
    return 0;
}
"""

call_count_filter = """
{
    int zero = 0;
    int index =0;
    int *skip;

    skip = call_count.lookup_or_init(&index, &zero);

    if ((*skip) < %d) {
        (*skip)++;
        return 0;
    }
    (*skip) = 0;
}
"""

# code substitutions
if args.queues:
    bpf_text = bpf_text.replace('STORAGE',
        'BPF_HISTOGRAM(dist, queue_key_t);')
    bpf_text = bpf_text.replace('STORE',
        'queue_key_t key = {.slot = q->completed}; ' +
        'bpf_probe_read(&key.queue, sizeof(key.queue), ' +
        'q->q->irqname); dist.increment(key);')
else:
    bpf_text = bpf_text.replace('STORAGE', 'BPF_HISTOGRAM(dist);')
    bpf_text = bpf_text.replace('STORE',
        'dist.increment(q->completed);')

bpf_text = bpf_text.replace("__CALL__COUNT__FILTER__", call_count_filter % (args.freq - 1) if args.freq is not None else "")
if debug:
    print(bpf_text)


# load BPF program
b = BPF(text=bpf_text)
b.attach_kprobe(event="nvme_irq", fn_name="trace_interrupt_start")
b.attach_kretprobe(event="nvme_irqpoll_handler", fn_name="trace_irqpoll_end")

print("Tracing nvme I/O interrupt/irqpoll... Hit Ctrl-C to end.")

# output
exiting = 0 if args.interval else 1
dist = b.get_table("dist")
while (1):
    try:
        sleep(int(args.interval))
    except KeyboardInterrupt:
        exiting = 1

    print()
    if args.timestamp:
        print("%-8s\n" % strftime("%H:%M:%S"), end="")

    dist.print_linear_hist("completed", "queue")
    dist.clear()

    countdown -= 1
    if exiting or countdown == 0:
        exit()
-------------- next part --------------
#!/usr/bin/python3
# @lint-avoid-python-3-compatibility-imports

from __future__ import print_function
from bcc import BPF
from time import sleep, strftime
import argparse

# arguments
examples = """examples:
    ./nvmetrace            # summarize interrupt->irqpoll latency as a histogram
    ./nvmetrace 1 10       # print 1 second summaries, 10 times
    ./nvmetrace -mT 1      # 1s summaries, milliseconds, and timestamps
    ./nvmetrace -Q         # show each nvme queue device separately
"""
parser = argparse.ArgumentParser(
    description="Summarize interrupt to softirq latency as a histogram",
    formatter_class=argparse.RawDescriptionHelpFormatter,
    epilog=examples)
parser.add_argument("-T", "--timestamp", action="store_true",
    help="include timestamp on output")
parser.add_argument("-m", "--milliseconds", action="store_true",
    help="millisecond histogram")
parser.add_argument("-Q", "--queues", action="store_true",
    help="print a histogram per queue")
parser.add_argument("--freq", help="Account every N-th request", type=int, required=False)
parser.add_argument("interval", nargs="?", default=2,
    help="output interval, in seconds")
parser.add_argument("count", nargs="?", default=99999999,
    help="number of outputs")
args = parser.parse_args()
countdown = int(args.count)
debug = 0

# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
/*****************************************************************
 * Nasty hack because we don't have the nvme-pci structs exported
 *****************************************************************/
#include <linux/aer.h>
#include <linux/bitops.h>
#include <linux/blkdev.h>
#include <linux/blk-mq.h>
#include <linux/blk-mq-pci.h>
#include <linux/cpu.h>
#include <linux/delay.h>
#include <linux/errno.h>
#include <linux/fs.h>
#include <linux/genhd.h>
#include <linux/hdreg.h>
#include <linux/idr.h>
#include <linux/init.h>
#include <linux/interrupt.h>
#include <linux/io.h>
#include <linux/kdev_t.h>
#include <linux/kernel.h>
#include <linux/mm.h>
#include <linux/module.h>
#include <linux/moduleparam.h>
#include <linux/mutex.h>
#include <linux/pci.h>
#include <linux/poison.h>
#include <linux/ptrace.h>
#include <linux/sched.h>
#include <linux/slab.h>
#include <linux/t10-pi.h>
#include <linux/timer.h>
#include <linux/types.h>
#include <linux/io-64-nonatomic-lo-hi.h>
#include <asm/unaligned.h>
#include <linux/irq_poll.h>

/* location of nvme.h */
#include "/mnt/linux/drivers/nvme/host/nvme.h"

struct nvme_dev;
struct nvme_queue;
/*
 * Represents an NVM Express device.  Each nvme_dev is a PCI function.
 */
struct nvme_dev {
	struct nvme_queue **queues;
	struct blk_mq_tag_set tagset;
	struct blk_mq_tag_set admin_tagset;
	u32 __iomem *dbs;
	struct device *dev;
	struct dma_pool *prp_page_pool;
	struct dma_pool *prp_small_pool;
	unsigned queue_count;
	unsigned online_queues;
	unsigned max_qid;
	int q_depth;
	u32 db_stride;
	void __iomem *bar;
	struct work_struct reset_work;
	struct work_struct remove_work;
	struct timer_list watchdog_timer;
	struct mutex shutdown_lock;
	bool subsystem;
	void __iomem *cmb;
	dma_addr_t cmb_dma_addr;
	u64 cmb_size;
	u32 cmbsz;
	u32 cmbloc;
	struct nvme_ctrl ctrl;
	struct completion ioq_wait;
};

/*
 * An NVM Express queue.  Each device has at least two (one for admin
 * commands and one for I/O commands).
 */
struct nvme_queue {
	struct device *q_dmadev;
	struct nvme_dev *dev;
	char irqname[24];
	spinlock_t sq_lock;
	spinlock_t cq_lock;
	struct nvme_command *sq_cmds;
	struct nvme_command __iomem *sq_cmds_io;
	volatile struct nvme_completion *cqes;
	struct blk_mq_tags **tags;
	dma_addr_t sq_dma_addr;
	dma_addr_t cq_dma_addr;
	u32 __iomem *q_db;
	u16 q_depth;
	s16 cq_vector;
	u16 sq_tail;
	u16 cq_head;
	u16 qid;
	u8 cq_phase;
	struct irq_poll	iop;
};

typedef struct queue_key {
    char queue[24];
    u64 slot;
} queue_key_t;

BPF_HASH(start, struct nvme_queue *);
BPF_TABLE("percpu_array", int, int, call_count, 1);
STORAGE

/* timestamp nvme interrupt */
int trace_interrupt_start(struct pt_regs *ctx, int irq, void *data)
{
    __CALL__COUNT__FILTER__

    struct nvme_queue *q = data;
    u64 ts = bpf_ktime_get_ns();
    start.update(&q, &ts);
    return 0;
}

/* timestamp nvme irqpoll */
int trace_irqpoll_start(struct pt_regs *ctx, struct irq_poll *iop, int budget)
{
    struct nvme_queue *q = container_of(iop, struct nvme_queue, iop);
    u64 *tsp, delta;

    /* fetch timestamp and calculate delta */
    tsp = start.lookup(&q);
    if (tsp == 0) {
        return 0;   /* missed issue */
    }
    delta = bpf_ktime_get_ns() - *tsp;
    FACTOR

    /* store as histogram */
    STORE
    start.delete(&q);

    return 0;
}
"""

# code substitutions
call_count_filter = """
{
    int zero = 0;
    int index =0;
    int *skip;

    skip = call_count.lookup_or_init(&index, &zero);

    if ((*skip) < %d) {
        (*skip)++;
        return 0;
    }
    (*skip) = 0;
}
"""

if args.milliseconds:
    bpf_text = bpf_text.replace('FACTOR', 'delta /= 1000000;')
    label = "msecs"
else:
    bpf_text = bpf_text.replace('FACTOR', 'delta /= 1000;')
    label = "usecs"
if args.queues:
    bpf_text = bpf_text.replace('STORAGE',
        'BPF_HISTOGRAM(dist, queue_key_t);')
    bpf_text = bpf_text.replace('STORE',
        'queue_key_t key = {.slot = bpf_log2l(delta)}; ' +
        'bpf_probe_read(&key.queue, sizeof(key.queue), ' +
        'q->irqname); dist.increment(key);')
else:
    bpf_text = bpf_text.replace('STORAGE', 'BPF_HISTOGRAM(dist);')
    bpf_text = bpf_text.replace('STORE',
        'dist.increment(bpf_log2l(delta));')

bpf_text = bpf_text.replace("__CALL__COUNT__FILTER__", call_count_filter % (args.freq - 1) if args.freq is not None else "")
if debug:
    print(bpf_text)

# load BPF program
b = BPF(text=bpf_text)
b.attach_kprobe(event="nvme_irq", fn_name="trace_interrupt_start")
b.attach_kprobe(event="nvme_irqpoll_handler", fn_name="trace_irqpoll_start")

print("Tracing nvme I/O interrupt/irqpoll... Hit Ctrl-C to end.")

# output
exiting = 0 if args.interval else 1
dist = b.get_table("dist")
while (1):
    try:
        sleep(int(args.interval))
    except KeyboardInterrupt:
        exiting = 1

    print()
    if args.timestamp:
        print("%-8s\n" % strftime("%H:%M:%S"), end="")

    dist.print_log2_hist(label, "queue")
    dist.clear()

    countdown -= 1
    if exiting or countdown == 0:
        exit()

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-17 15:38         ` Sagi Grimberg
@ 2017-01-17 15:45           ` Sagi Grimberg
  -1 siblings, 0 replies; 120+ messages in thread
From: Sagi Grimberg @ 2017-01-17 15:45 UTC (permalink / raw)
  To: Jens Axboe, Johannes Thumshirn, lsf-pc
  Cc: linux-block, Linux-scsi, linux-nvme, Christoph Hellwig, Keith Busch


> --
> [1]
> queue = b'nvme0q1'
>      usecs               : count     distribution
>          0 -> 1          : 7310 |****************************************|
>          2 -> 3          : 11       |      |
>          4 -> 7          : 10       |      |
>          8 -> 15         : 20       |      |
>         16 -> 31         : 0        |      |
>         32 -> 63         : 0        |      |
>         64 -> 127        : 1        |      |
>
> [2]
> queue = b'nvme0q1'
>      usecs               : count     distribution
>          0 -> 1          : 7309 |****************************************|
>          2 -> 3          : 14       |      |
>          4 -> 7          : 7        |      |
>          8 -> 15         : 17       |      |
>

Rrr, email made the histograms look funky (tabs vs. spaces...)
The count is what's important anyways...

Just adding that I used an Intel P3500 nvme device.

> We can see that most of the time our latency is pretty good (<1ns) but with
> huge tail latencies (some 8-15 ns and even one in 32-63 ns).

Obviously is micro-seconds and not nano-seconds (I wish...)

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-17 15:45           ` Sagi Grimberg
  0 siblings, 0 replies; 120+ messages in thread
From: Sagi Grimberg @ 2017-01-17 15:45 UTC (permalink / raw)



> --
> [1]
> queue = b'nvme0q1'
>      usecs               : count     distribution
>          0 -> 1          : 7310 |****************************************|
>          2 -> 3          : 11       |      |
>          4 -> 7          : 10       |      |
>          8 -> 15         : 20       |      |
>         16 -> 31         : 0        |      |
>         32 -> 63         : 0        |      |
>         64 -> 127        : 1        |      |
>
> [2]
> queue = b'nvme0q1'
>      usecs               : count     distribution
>          0 -> 1          : 7309 |****************************************|
>          2 -> 3          : 14       |      |
>          4 -> 7          : 7        |      |
>          8 -> 15         : 17       |      |
>

Rrr, email made the histograms look funky (tabs vs. spaces...)
The count is what's important anyways...

Just adding that I used an Intel P3500 nvme device.

> We can see that most of the time our latency is pretty good (<1ns) but with
> huge tail latencies (some 8-15 ns and even one in 32-63 ns).

Obviously is micro-seconds and not nano-seconds (I wish...)

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-17 15:38         ` Sagi Grimberg
@ 2017-01-17 16:15           ` Sagi Grimberg
  -1 siblings, 0 replies; 120+ messages in thread
From: Sagi Grimberg @ 2017-01-17 16:15 UTC (permalink / raw)
  To: Jens Axboe, Johannes Thumshirn, lsf-pc
  Cc: linux-block, Linux-scsi, linux-nvme, Christoph Hellwig, Keith Busch

Oh, and the current code that was tested can be found at:

git://git.infradead.org/nvme.git nvme-irqpoll

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-17 16:15           ` Sagi Grimberg
  0 siblings, 0 replies; 120+ messages in thread
From: Sagi Grimberg @ 2017-01-17 16:15 UTC (permalink / raw)


Oh, and the current code that was tested can be found at:

git://git.infradead.org/nvme.git nvme-irqpoll

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-17 16:15           ` Sagi Grimberg
  (?)
@ 2017-01-17 16:27             ` Johannes Thumshirn
  -1 siblings, 0 replies; 120+ messages in thread
From: Johannes Thumshirn @ 2017-01-17 16:27 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jens Axboe, lsf-pc, linux-block, Linux-scsi, linux-nvme,
	Christoph Hellwig, Keith Busch

On Tue, Jan 17, 2017 at 06:15:43PM +0200, Sagi Grimberg wrote:
> Oh, and the current code that was tested can be found at:
> 
> git://git.infradead.org/nvme.git nvme-irqpoll

Just for the record, all tests you've run are with the upper irq_poll_budget of
256 [1]?

We (Hannes and me) recently stumbed accross this when trying to poll for more
than 256 queue entries in the drivers we've been testing.

Did your system load reduce with irq polling? In theory it should but I have
seen increases with AHCI at least according to fio. IIRC Hannes saw decreases
with his SAS HBA tests, as expected.

[1] lib/irq_poll.c:13

Byte,
	Johannes

-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: Felix Imend�rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N�rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-17 16:27             ` Johannes Thumshirn
  0 siblings, 0 replies; 120+ messages in thread
From: Johannes Thumshirn @ 2017-01-17 16:27 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jens Axboe, lsf-pc, linux-block, Linux-scsi, linux-nvme,
	Christoph Hellwig, Keith Busch

On Tue, Jan 17, 2017 at 06:15:43PM +0200, Sagi Grimberg wrote:
> Oh, and the current code that was tested can be found at:
> 
> git://git.infradead.org/nvme.git nvme-irqpoll

Just for the record, all tests you've run are with the upper irq_poll_budget of
256 [1]?

We (Hannes and me) recently stumbed accross this when trying to poll for more
than 256 queue entries in the drivers we've been testing.

Did your system load reduce with irq polling? In theory it should but I have
seen increases with AHCI at least according to fio. IIRC Hannes saw decreases
with his SAS HBA tests, as expected.

[1] lib/irq_poll.c:13

Byte,
	Johannes

-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-17 16:27             ` Johannes Thumshirn
  0 siblings, 0 replies; 120+ messages in thread
From: Johannes Thumshirn @ 2017-01-17 16:27 UTC (permalink / raw)


On Tue, Jan 17, 2017@06:15:43PM +0200, Sagi Grimberg wrote:
> Oh, and the current code that was tested can be found at:
> 
> git://git.infradead.org/nvme.git nvme-irqpoll

Just for the record, all tests you've run are with the upper irq_poll_budget of
256 [1]?

We (Hannes and me) recently stumbed accross this when trying to poll for more
than 256 queue entries in the drivers we've been testing.

Did your system load reduce with irq polling? In theory it should but I have
seen increases with AHCI at least according to fio. IIRC Hannes saw decreases
with his SAS HBA tests, as expected.

[1] lib/irq_poll.c:13

Byte,
	Johannes

-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-17 16:27             ` Johannes Thumshirn
@ 2017-01-17 16:38               ` Sagi Grimberg
  -1 siblings, 0 replies; 120+ messages in thread
From: Sagi Grimberg @ 2017-01-17 16:38 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Jens Axboe, lsf-pc, linux-block, Linux-scsi, linux-nvme,
	Christoph Hellwig, Keith Busch


> Just for the record, all tests you've run are with the upper irq_poll_budget of
> 256 [1]?

Yes, but that's the point, I never ever reach this budget because
I'm only processing 1-2 completions per interrupt.

> We (Hannes and me) recently stumbed accross this when trying to poll for more
> than 256 queue entries in the drivers we've been testing.

What do you mean by stumbed? irq-poll should be agnostic to the fact
that drivers can poll more than their given budget?

> Did your system load reduce with irq polling? In theory it should but I have
> seen increases with AHCI at least according to fio. IIRC Hannes saw decreases
> with his SAS HBA tests, as expected.

I didn't see any reduction. When I tested on a single cpu core (to
simplify for a single queue) the cpu was at 100% cpu but got less iops
(which makes sense, a single cpu-core is not enough to max out the nvme
device, at least not the core I'm using). Before irqpoll I got
~230 KIOPs on a single cpu-core and after irqpoll I got ~205 KIOPs
which is consistent with the ~10% iops decrease I've reported in the
original submission.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-17 16:38               ` Sagi Grimberg
  0 siblings, 0 replies; 120+ messages in thread
From: Sagi Grimberg @ 2017-01-17 16:38 UTC (permalink / raw)



> Just for the record, all tests you've run are with the upper irq_poll_budget of
> 256 [1]?

Yes, but that's the point, I never ever reach this budget because
I'm only processing 1-2 completions per interrupt.

> We (Hannes and me) recently stumbed accross this when trying to poll for more
> than 256 queue entries in the drivers we've been testing.

What do you mean by stumbed? irq-poll should be agnostic to the fact
that drivers can poll more than their given budget?

> Did your system load reduce with irq polling? In theory it should but I have
> seen increases with AHCI at least according to fio. IIRC Hannes saw decreases
> with his SAS HBA tests, as expected.

I didn't see any reduction. When I tested on a single cpu core (to
simplify for a single queue) the cpu was at 100% cpu but got less iops
(which makes sense, a single cpu-core is not enough to max out the nvme
device, at least not the core I'm using). Before irqpoll I got
~230 KIOPs on a single cpu-core and after irqpoll I got ~205 KIOPs
which is consistent with the ~10% iops decrease I've reported in the
original submission.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-17 15:38         ` Sagi Grimberg
                           ` (2 preceding siblings ...)
  (?)
@ 2017-01-17 16:44         ` Andrey Kuzmin
  2017-01-17 16:50             ` Sagi Grimberg
  -1 siblings, 1 reply; 120+ messages in thread
From: Andrey Kuzmin @ 2017-01-17 16:44 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jens Axboe, Johannes Thumshirn, lsf-pc, linux-block,
	Christoph Hellwig, Keith Busch, linux-nvme, Linux-scsi

[-- Attachment #1: Type: text/plain, Size: 6311 bytes --]

On Tue, Jan 17, 2017 at 6:38 PM, Sagi Grimberg <sagi@grimberg.me> wrote:

> Hey, so I made some initial analysis of whats going on with
> irq-poll.
>
> First, I sampled how much time it takes before we
> get the interrupt in nvme_irq and the initial visit
> to nvme_irqpoll_handler. I ran a single threaded fio
> with QD=32 of 4K reads. This is two displays of a
> histogram of the latency (ns):
> --
> [1]
> queue = b'nvme0q1'
>      usecs               : count     distribution
>          0 -> 1          : 7310 |****************************************|
>          2 -> 3          : 11       |      |
>          4 -> 7          : 10       |      |
>          8 -> 15         : 20       |      |
>         16 -> 31         : 0        |      |
>         32 -> 63         : 0        |      |
>         64 -> 127        : 1        |      |
>
> [2]
> queue = b'nvme0q1'
>      usecs               : count     distribution
>          0 -> 1          : 7309 |****************************************|
>          2 -> 3          : 14       |      |
>          4 -> 7          : 7        |      |
>          8 -> 15         : 17       |      |
>
> We can see that most of the time our latency is pretty good (<1ns) but with
> huge tail latencies (some 8-15 ns and even one in 32-63 ns).
> **NOTE, in order to reduce the tracing impact on performance I sampled
> for every 100 interrupts.
>
> I also sampled for a multiple threads/queues with QD=32 of 4K reads.
> This is a collection of histograms for 5 queues (5 fio threads):
> queue = b'nvme0q1'
>      usecs               : count     distribution
>          0 -> 1          : 701 |****************************************|
>          2 -> 3          : 177      |**********      |
>          4 -> 7          : 56       |***      |
>          8 -> 15         : 24       |*      |
>         16 -> 31         : 6        |      |
>         32 -> 63         : 1        |      |
>
> queue = b'nvme0q2'
>      usecs               : count     distribution
>          0 -> 1          : 412 |****************************************|
>          2 -> 3          : 52       |*****      |
>          4 -> 7          : 19       |*      |
>          8 -> 15         : 13       |*      |
>         16 -> 31         : 5        |      |
>
> queue = b'nvme0q3'
>      usecs               : count     distribution
>          0 -> 1          : 381 |****************************************|
>          2 -> 3          : 74       |*******      |
>          4 -> 7          : 26       |**      |
>          8 -> 15         : 12       |*      |
>         16 -> 31         : 3        |      |
>         32 -> 63         : 0        |      |
>         64 -> 127        : 0        |      |
>        128 -> 255        : 1        |      |
>
> queue = b'nvme0q4'
>      usecs               : count     distribution
>          0 -> 1          : 386 |****************************************|
>          2 -> 3          : 63       |******      |
>          4 -> 7          : 30       |***      |
>          8 -> 15         : 11       |*      |
>         16 -> 31         : 7        |      |
>         32 -> 63         : 1        |      |
>
> queue = b'nvme0q5'
>      usecs               : count     distribution
>          0 -> 1          : 384 |****************************************|
>          2 -> 3          : 69       |*******      |
>          4 -> 7          : 25       |**      |
>          8 -> 15         : 15       |*      |
>         16 -> 31         : 3        |      |
>
> Overall looks pretty much the same but some more samples with tails...
>
> Next, I sampled how many completions we are able to consume per interrupt.
> Two exaples of histograms of how many completions we take per interrupt.
> --
> queue = b'nvme0q1'
>      completed     : count     distribution
>         0          : 0        |                                        |
>         1          : 11690    |****************************************|
>         2          : 46       |                                        |
>         3          : 1        |                                        |
>
> queue = b'nvme0q1'
>      completed     : count     distribution
>         0          : 0        |                                        |
>         1          : 944      |****************************************|
>         2          : 8        |                                        |
> --
>
> So it looks like we are super not efficient because most of the times we
> catch 1
> completion per interrupt and the whole point is that we need to find more!
> This fio
> is single threaded with QD=32 so I'd expect that we be somewhere in 8-31
> almost all
> the time... I also tried QD=1024, histogram is still the same.
>

It looks like it takes you longer to submit an I/O than to service an
interrupt, so increasing queue depth in the singe-threaded case doesn't
make much difference. You might want to try multiple threads per core with
QD, say, 32 (but beware that Intel limits the aggregate queue depth to 256
and even 128 for some models).

Regards,
Andrey




> **NOTE: Here I also sampled for every 100 interrupts.
>
>
> I'll try to run the counter on the current nvme driver and see what I get.
>
>
>
> I attached the bpf scripts I wrote (nvme-trace-irq, nvme-count-comps)
> with hope that someone is interested enough to try and reproduce these
> numbers on his/hers setup and maybe suggest some other useful tracing
> we can do.
>
> Prerequisites:
> 1. iovisor is needed for python bpf support.
>   $ echo "deb [trusted=yes] https://repo.iovisor.org/apt/xenial
> xenial-nightly main" | sudo tee /etc/apt/sources.list.d/iovisor.list
>   $ sudo apt-get update -y
>   $ sudo apt-get install bcc-tools -y
>   # Nastty hack .. bcc only available in python2 but copliant with
> python3..
>   $ sudo cp -r /usr/lib/python2.7/dist-packages/bcc
> /usr/lib/python3/dist-packages/
>
> 2. Because we don't have the nvme-pci symbols exported, The nvme.h file is
> needed on the
>    test machine (where the bpf code is running). I used nfs mount for the
> linux source (this
>    is why I include from /mnt/linux in the scripts).
>
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
>
>

[-- Attachment #2: Type: text/html, Size: 8770 bytes --]

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-17 16:44         ` Andrey Kuzmin
@ 2017-01-17 16:50             ` Sagi Grimberg
  0 siblings, 0 replies; 120+ messages in thread
From: Sagi Grimberg @ 2017-01-17 16:50 UTC (permalink / raw)
  To: Andrey Kuzmin
  Cc: Jens Axboe, Johannes Thumshirn, lsf-pc, linux-block,
	Christoph Hellwig, Keith Busch, linux-nvme, Linux-scsi


>     So it looks like we are super not efficient because most of the
>     times we catch 1
>     completion per interrupt and the whole point is that we need to find
>     more! This fio
>     is single threaded with QD=32 so I'd expect that we be somewhere in
>     8-31 almost all
>     the time... I also tried QD=1024, histogram is still the same.
>
>
> It looks like it takes you longer to submit an I/O than to service an
> interrupt,

Well, with irq-poll we do practically nothing in the interrupt handler,
only schedule irq-poll. Note that the latency measures are only from
the point the interrupt arrives and the point we actually service it
by polling for completions.

> so increasing queue depth in the singe-threaded case doesn't
> make much difference. You might want to try multiple threads per core
> with QD, say, 32

This is how I ran, QD=32.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-17 16:50             ` Sagi Grimberg
  0 siblings, 0 replies; 120+ messages in thread
From: Sagi Grimberg @ 2017-01-17 16:50 UTC (permalink / raw)



>     So it looks like we are super not efficient because most of the
>     times we catch 1
>     completion per interrupt and the whole point is that we need to find
>     more! This fio
>     is single threaded with QD=32 so I'd expect that we be somewhere in
>     8-31 almost all
>     the time... I also tried QD=1024, histogram is still the same.
>
>
> It looks like it takes you longer to submit an I/O than to service an
> interrupt,

Well, with irq-poll we do practically nothing in the interrupt handler,
only schedule irq-poll. Note that the latency measures are only from
the point the interrupt arrives and the point we actually service it
by polling for completions.

> so increasing queue depth in the singe-threaded case doesn't
> make much difference. You might want to try multiple threads per core
> with QD, say, 32

This is how I ran, QD=32.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-17 16:38               ` Sagi Grimberg
  (?)
@ 2017-01-18 13:51                 ` Johannes Thumshirn
  -1 siblings, 0 replies; 120+ messages in thread
From: Johannes Thumshirn @ 2017-01-18 13:51 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jens Axboe, Christoph Hellwig, Linux-scsi, linux-nvme,
	linux-block, Keith Busch, lsf-pc

On Tue, Jan 17, 2017 at 06:38:43PM +0200, Sagi Grimberg wrote:
> 
> >Just for the record, all tests you've run are with the upper irq_poll_budget of
> >256 [1]?
> 
> Yes, but that's the point, I never ever reach this budget because
> I'm only processing 1-2 completions per interrupt.
> 
> >We (Hannes and me) recently stumbed accross this when trying to poll for more
> >than 256 queue entries in the drivers we've been testing.

s/stumbed/stumbled/

> 
> What do you mean by stumbed? irq-poll should be agnostic to the fact
> that drivers can poll more than their given budget?

So what you say is you saw a consomed == 1 [1] most of the time?

[1] from http://git.infradead.org/nvme.git/commitdiff/eed5a9d925c59e43980047059fde29e3aa0b7836

-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: Felix Imend�rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N�rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-18 13:51                 ` Johannes Thumshirn
  0 siblings, 0 replies; 120+ messages in thread
From: Johannes Thumshirn @ 2017-01-18 13:51 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jens Axboe, Christoph Hellwig, Linux-scsi, linux-nvme,
	linux-block, Keith Busch, lsf-pc

On Tue, Jan 17, 2017 at 06:38:43PM +0200, Sagi Grimberg wrote:
> 
> >Just for the record, all tests you've run are with the upper irq_poll_budget of
> >256 [1]?
> 
> Yes, but that's the point, I never ever reach this budget because
> I'm only processing 1-2 completions per interrupt.
> 
> >We (Hannes and me) recently stumbed accross this when trying to poll for more
> >than 256 queue entries in the drivers we've been testing.

s/stumbed/stumbled/

> 
> What do you mean by stumbed? irq-poll should be agnostic to the fact
> that drivers can poll more than their given budget?

So what you say is you saw a consomed == 1 [1] most of the time?

[1] from http://git.infradead.org/nvme.git/commitdiff/eed5a9d925c59e43980047059fde29e3aa0b7836

-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-18 13:51                 ` Johannes Thumshirn
  0 siblings, 0 replies; 120+ messages in thread
From: Johannes Thumshirn @ 2017-01-18 13:51 UTC (permalink / raw)


On Tue, Jan 17, 2017@06:38:43PM +0200, Sagi Grimberg wrote:
> 
> >Just for the record, all tests you've run are with the upper irq_poll_budget of
> >256 [1]?
> 
> Yes, but that's the point, I never ever reach this budget because
> I'm only processing 1-2 completions per interrupt.
> 
> >We (Hannes and me) recently stumbed accross this when trying to poll for more
> >than 256 queue entries in the drivers we've been testing.

s/stumbed/stumbled/

> 
> What do you mean by stumbed? irq-poll should be agnostic to the fact
> that drivers can poll more than their given budget?

So what you say is you saw a consomed == 1 [1] most of the time?

[1] from http://git.infradead.org/nvme.git/commitdiff/eed5a9d925c59e43980047059fde29e3aa0b7836

-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-17 16:50             ` Sagi Grimberg
@ 2017-01-18 14:02               ` Hannes Reinecke
  -1 siblings, 0 replies; 120+ messages in thread
From: Hannes Reinecke @ 2017-01-18 14:02 UTC (permalink / raw)
  To: Sagi Grimberg, Andrey Kuzmin
  Cc: Jens Axboe, Johannes Thumshirn, lsf-pc, linux-block,
	Christoph Hellwig, Keith Busch, linux-nvme, Linux-scsi

On 01/17/2017 05:50 PM, Sagi Grimberg wrote:
> 
>>     So it looks like we are super not efficient because most of the
>>     times we catch 1
>>     completion per interrupt and the whole point is that we need to find
>>     more! This fio
>>     is single threaded with QD=32 so I'd expect that we be somewhere in
>>     8-31 almost all
>>     the time... I also tried QD=1024, histogram is still the same.
>>
>>
>> It looks like it takes you longer to submit an I/O than to service an
>> interrupt,
> 
> Well, with irq-poll we do practically nothing in the interrupt handler,
> only schedule irq-poll. Note that the latency measures are only from
> the point the interrupt arrives and the point we actually service it
> by polling for completions.
> 
>> so increasing queue depth in the singe-threaded case doesn't
>> make much difference. You might want to try multiple threads per core
>> with QD, say, 32
> 
> This is how I ran, QD=32.

The one thing which I found _really_ curious is this:

  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.1%
     issued    : total=r=7673377/w=0/d=0, short=r=0/w=0/d=0,
drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=256

(note the lines starting with 'submit' and 'complete').
They are _always_ 4, irrespective of the hardware and/or tests which I
run. Jens, what are these numbers supposed to mean?
Is this intended?
ATM the information content from those two lines is essentially 0,
seeing that the never change irrespective of the tests I'm doing.
(And which fio version I'm using ...)

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-18 14:02               ` Hannes Reinecke
  0 siblings, 0 replies; 120+ messages in thread
From: Hannes Reinecke @ 2017-01-18 14:02 UTC (permalink / raw)


On 01/17/2017 05:50 PM, Sagi Grimberg wrote:
> 
>>     So it looks like we are super not efficient because most of the
>>     times we catch 1
>>     completion per interrupt and the whole point is that we need to find
>>     more! This fio
>>     is single threaded with QD=32 so I'd expect that we be somewhere in
>>     8-31 almost all
>>     the time... I also tried QD=1024, histogram is still the same.
>>
>>
>> It looks like it takes you longer to submit an I/O than to service an
>> interrupt,
> 
> Well, with irq-poll we do practically nothing in the interrupt handler,
> only schedule irq-poll. Note that the latency measures are only from
> the point the interrupt arrives and the point we actually service it
> by polling for completions.
> 
>> so increasing queue depth in the singe-threaded case doesn't
>> make much difference. You might want to try multiple threads per core
>> with QD, say, 32
> 
> This is how I ran, QD=32.

The one thing which I found _really_ curious is this:

  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.1%
     issued    : total=r=7673377/w=0/d=0, short=r=0/w=0/d=0,
drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=256

(note the lines starting with 'submit' and 'complete').
They are _always_ 4, irrespective of the hardware and/or tests which I
run. Jens, what are these numbers supposed to mean?
Is this intended?
ATM the information content from those two lines is essentially 0,
seeing that the never change irrespective of the tests I'm doing.
(And which fio version I'm using ...)

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-18 13:51                 ` Johannes Thumshirn
@ 2017-01-18 14:27                   ` Sagi Grimberg
  -1 siblings, 0 replies; 120+ messages in thread
From: Sagi Grimberg @ 2017-01-18 14:27 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Jens Axboe, Christoph Hellwig, Linux-scsi, linux-nvme,
	linux-block, Keith Busch, lsf-pc


> So what you say is you saw a consomed == 1 [1] most of the time?
>
> [1] from http://git.infradead.org/nvme.git/commitdiff/eed5a9d925c59e43980047059fde29e3aa0b7836

Exactly. By processing 1 completion per interrupt it makes perfect sense
why this performs poorly, it's not worth paying the soft-irq schedule
for only a single completion.

What I'm curious is how consistent is this with different devices (wish
I had some...)

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-18 14:27                   ` Sagi Grimberg
  0 siblings, 0 replies; 120+ messages in thread
From: Sagi Grimberg @ 2017-01-18 14:27 UTC (permalink / raw)



> So what you say is you saw a consomed == 1 [1] most of the time?
>
> [1] from http://git.infradead.org/nvme.git/commitdiff/eed5a9d925c59e43980047059fde29e3aa0b7836

Exactly. By processing 1 completion per interrupt it makes perfect sense
why this performs poorly, it's not worth paying the soft-irq schedule
for only a single completion.

What I'm curious is how consistent is this with different devices (wish
I had some...)

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-18 14:27                   ` Sagi Grimberg
@ 2017-01-18 14:36                     ` Andrey Kuzmin
  -1 siblings, 0 replies; 120+ messages in thread
From: Andrey Kuzmin @ 2017-01-18 14:36 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Johannes Thumshirn, Jens Axboe, Keith Busch, Linux-scsi,
	linux-nvme, Christoph Hellwig, linux-block, lsf-pc

On Wed, Jan 18, 2017 at 5:27 PM, Sagi Grimberg <sagi@grimberg.me> wrote:
>
>> So what you say is you saw a consomed == 1 [1] most of the time?
>>
>> [1] from
>> http://git.infradead.org/nvme.git/commitdiff/eed5a9d925c59e43980047059fde29e3aa0b7836
>
>
> Exactly. By processing 1 completion per interrupt it makes perfect sense
> why this performs poorly, it's not worth paying the soft-irq schedule
> for only a single completion.

Your report provided this stats with one-completion dominance for the
single-threaded case. Does it also hold if you run multiple fio
threads per core?

Regards,
Andrey

>
> What I'm curious is how consistent is this with different devices (wish
> I had some...)
>
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-18 14:36                     ` Andrey Kuzmin
  0 siblings, 0 replies; 120+ messages in thread
From: Andrey Kuzmin @ 2017-01-18 14:36 UTC (permalink / raw)


On Wed, Jan 18, 2017@5:27 PM, Sagi Grimberg <sagi@grimberg.me> wrote:
>
>> So what you say is you saw a consomed == 1 [1] most of the time?
>>
>> [1] from
>> http://git.infradead.org/nvme.git/commitdiff/eed5a9d925c59e43980047059fde29e3aa0b7836
>
>
> Exactly. By processing 1 completion per interrupt it makes perfect sense
> why this performs poorly, it's not worth paying the soft-irq schedule
> for only a single completion.

Your report provided this stats with one-completion dominance for the
single-threaded case. Does it also hold if you run multiple fio
threads per core?

Regards,
Andrey

>
> What I'm curious is how consistent is this with different devices (wish
> I had some...)
>
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-18 14:36                     ` Andrey Kuzmin
@ 2017-01-18 14:40                       ` Sagi Grimberg
  -1 siblings, 0 replies; 120+ messages in thread
From: Sagi Grimberg @ 2017-01-18 14:40 UTC (permalink / raw)
  To: Andrey Kuzmin
  Cc: Johannes Thumshirn, Jens Axboe, Keith Busch, Linux-scsi,
	linux-nvme, Christoph Hellwig, linux-block, lsf-pc


> Your report provided this stats with one-completion dominance for the
> single-threaded case. Does it also hold if you run multiple fio
> threads per core?

It's useless to run more threads on that core, it's already fully
utilized. That single threads is already posting a fair amount of
submissions, so I don't see how adding more fio jobs can help in any
way.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-18 14:40                       ` Sagi Grimberg
  0 siblings, 0 replies; 120+ messages in thread
From: Sagi Grimberg @ 2017-01-18 14:40 UTC (permalink / raw)



> Your report provided this stats with one-completion dominance for the
> single-threaded case. Does it also hold if you run multiple fio
> threads per core?

It's useless to run more threads on that core, it's already fully
utilized. That single threads is already posting a fair amount of
submissions, so I don't see how adding more fio jobs can help in any
way.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-18 14:27                   ` Sagi Grimberg
  (?)
@ 2017-01-18 14:58                     ` Johannes Thumshirn
  -1 siblings, 0 replies; 120+ messages in thread
From: Johannes Thumshirn @ 2017-01-18 14:58 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jens Axboe, Christoph Hellwig, Linux-scsi, linux-nvme,
	linux-block, Keith Busch, lsf-pc

On Wed, Jan 18, 2017 at 04:27:24PM +0200, Sagi Grimberg wrote:
> 
> >So what you say is you saw a consomed == 1 [1] most of the time?
> >
> >[1] from http://git.infradead.org/nvme.git/commitdiff/eed5a9d925c59e43980047059fde29e3aa0b7836
> 
> Exactly. By processing 1 completion per interrupt it makes perfect sense
> why this performs poorly, it's not worth paying the soft-irq schedule
> for only a single completion.
> 
> What I'm curious is how consistent is this with different devices (wish
> I had some...)

Hannes just spotted this:
static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
                         const struct blk_mq_queue_data *bd)
{
[...]
        __nvme_submit_cmd(nvmeq, &cmnd);
        nvme_process_cq(nvmeq);
        spin_unlock_irq(&nvmeq->q_lock);
        return BLK_MQ_RQ_QUEUE_OK;
out_cleanup_iod:
        nvme_free_iod(dev, req);
out_free_cmd:
        nvme_cleanup_cmd(req);
        return ret;
}

So we're draining the CQ on submit. This of cause makes polling for
completions in the IRQ handler rather pointless as we already did in the
submission path. 

-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: Felix Imend�rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N�rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-18 14:58                     ` Johannes Thumshirn
  0 siblings, 0 replies; 120+ messages in thread
From: Johannes Thumshirn @ 2017-01-18 14:58 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jens Axboe, Christoph Hellwig, Linux-scsi, linux-nvme,
	linux-block, Keith Busch, lsf-pc

On Wed, Jan 18, 2017 at 04:27:24PM +0200, Sagi Grimberg wrote:
> 
> >So what you say is you saw a consomed == 1 [1] most of the time?
> >
> >[1] from http://git.infradead.org/nvme.git/commitdiff/eed5a9d925c59e43980047059fde29e3aa0b7836
> 
> Exactly. By processing 1 completion per interrupt it makes perfect sense
> why this performs poorly, it's not worth paying the soft-irq schedule
> for only a single completion.
> 
> What I'm curious is how consistent is this with different devices (wish
> I had some...)

Hannes just spotted this:
static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
                         const struct blk_mq_queue_data *bd)
{
[...]
        __nvme_submit_cmd(nvmeq, &cmnd);
        nvme_process_cq(nvmeq);
        spin_unlock_irq(&nvmeq->q_lock);
        return BLK_MQ_RQ_QUEUE_OK;
out_cleanup_iod:
        nvme_free_iod(dev, req);
out_free_cmd:
        nvme_cleanup_cmd(req);
        return ret;
}

So we're draining the CQ on submit. This of cause makes polling for
completions in the IRQ handler rather pointless as we already did in the
submission path. 

-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-18 14:58                     ` Johannes Thumshirn
  0 siblings, 0 replies; 120+ messages in thread
From: Johannes Thumshirn @ 2017-01-18 14:58 UTC (permalink / raw)


On Wed, Jan 18, 2017@04:27:24PM +0200, Sagi Grimberg wrote:
> 
> >So what you say is you saw a consomed == 1 [1] most of the time?
> >
> >[1] from http://git.infradead.org/nvme.git/commitdiff/eed5a9d925c59e43980047059fde29e3aa0b7836
> 
> Exactly. By processing 1 completion per interrupt it makes perfect sense
> why this performs poorly, it's not worth paying the soft-irq schedule
> for only a single completion.
> 
> What I'm curious is how consistent is this with different devices (wish
> I had some...)

Hannes just spotted this:
static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
                         const struct blk_mq_queue_data *bd)
{
[...]
        __nvme_submit_cmd(nvmeq, &cmnd);
        nvme_process_cq(nvmeq);
        spin_unlock_irq(&nvmeq->q_lock);
        return BLK_MQ_RQ_QUEUE_OK;
out_cleanup_iod:
        nvme_free_iod(dev, req);
out_free_cmd:
        nvme_cleanup_cmd(req);
        return ret;
}

So we're draining the CQ on submit. This of cause makes polling for
completions in the IRQ handler rather pointless as we already did in the
submission path. 

-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-18 14:58                     ` Johannes Thumshirn
@ 2017-01-18 15:14                       ` Sagi Grimberg
  -1 siblings, 0 replies; 120+ messages in thread
From: Sagi Grimberg @ 2017-01-18 15:14 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Jens Axboe, Christoph Hellwig, Linux-scsi, linux-nvme,
	linux-block, Keith Busch, lsf-pc


> Hannes just spotted this:
> static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
>                          const struct blk_mq_queue_data *bd)
> {
> [...]
>         __nvme_submit_cmd(nvmeq, &cmnd);
>         nvme_process_cq(nvmeq);
>         spin_unlock_irq(&nvmeq->q_lock);
>         return BLK_MQ_RQ_QUEUE_OK;
> out_cleanup_iod:
>         nvme_free_iod(dev, req);
> out_free_cmd:
>         nvme_cleanup_cmd(req);
>         return ret;
> }
>
> So we're draining the CQ on submit. This of cause makes polling for
> completions in the IRQ handler rather pointless as we already did in the
> submission path.

I think you missed:
http://git.infradead.org/nvme.git/commit/49c91e3e09dc3c9dd1718df85112a8cce3ab7007

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-18 15:14                       ` Sagi Grimberg
  0 siblings, 0 replies; 120+ messages in thread
From: Sagi Grimberg @ 2017-01-18 15:14 UTC (permalink / raw)



> Hannes just spotted this:
> static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
>                          const struct blk_mq_queue_data *bd)
> {
> [...]
>         __nvme_submit_cmd(nvmeq, &cmnd);
>         nvme_process_cq(nvmeq);
>         spin_unlock_irq(&nvmeq->q_lock);
>         return BLK_MQ_RQ_QUEUE_OK;
> out_cleanup_iod:
>         nvme_free_iod(dev, req);
> out_free_cmd:
>         nvme_cleanup_cmd(req);
>         return ret;
> }
>
> So we're draining the CQ on submit. This of cause makes polling for
> completions in the IRQ handler rather pointless as we already did in the
> submission path.

I think you missed:
http://git.infradead.org/nvme.git/commit/49c91e3e09dc3c9dd1718df85112a8cce3ab7007

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-18 15:14                       ` Sagi Grimberg
  (?)
@ 2017-01-18 15:16                         ` Johannes Thumshirn
  -1 siblings, 0 replies; 120+ messages in thread
From: Johannes Thumshirn @ 2017-01-18 15:16 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jens Axboe, Christoph Hellwig, Linux-scsi, linux-nvme,
	linux-block, Keith Busch, lsf-pc

On Wed, Jan 18, 2017 at 05:14:36PM +0200, Sagi Grimberg wrote:
> 
> >Hannes just spotted this:
> >static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
> >                         const struct blk_mq_queue_data *bd)
> >{
> >[...]
> >        __nvme_submit_cmd(nvmeq, &cmnd);
> >        nvme_process_cq(nvmeq);
> >        spin_unlock_irq(&nvmeq->q_lock);
> >        return BLK_MQ_RQ_QUEUE_OK;
> >out_cleanup_iod:
> >        nvme_free_iod(dev, req);
> >out_free_cmd:
> >        nvme_cleanup_cmd(req);
> >        return ret;
> >}
> >
> >So we're draining the CQ on submit. This of cause makes polling for
> >completions in the IRQ handler rather pointless as we already did in the
> >submission path.
> 
> I think you missed:
> http://git.infradead.org/nvme.git/commit/49c91e3e09dc3c9dd1718df85112a8cce3ab7007

I indeed did, thanks.

-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: Felix Imend�rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N�rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-18 15:16                         ` Johannes Thumshirn
  0 siblings, 0 replies; 120+ messages in thread
From: Johannes Thumshirn @ 2017-01-18 15:16 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jens Axboe, Christoph Hellwig, Linux-scsi, linux-nvme,
	linux-block, Keith Busch, lsf-pc

On Wed, Jan 18, 2017 at 05:14:36PM +0200, Sagi Grimberg wrote:
> 
> >Hannes just spotted this:
> >static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
> >                         const struct blk_mq_queue_data *bd)
> >{
> >[...]
> >        __nvme_submit_cmd(nvmeq, &cmnd);
> >        nvme_process_cq(nvmeq);
> >        spin_unlock_irq(&nvmeq->q_lock);
> >        return BLK_MQ_RQ_QUEUE_OK;
> >out_cleanup_iod:
> >        nvme_free_iod(dev, req);
> >out_free_cmd:
> >        nvme_cleanup_cmd(req);
> >        return ret;
> >}
> >
> >So we're draining the CQ on submit. This of cause makes polling for
> >completions in the IRQ handler rather pointless as we already did in the
> >submission path.
> 
> I think you missed:
> http://git.infradead.org/nvme.git/commit/49c91e3e09dc3c9dd1718df85112a8cce3ab7007

I indeed did, thanks.

-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-18 15:16                         ` Johannes Thumshirn
  0 siblings, 0 replies; 120+ messages in thread
From: Johannes Thumshirn @ 2017-01-18 15:16 UTC (permalink / raw)


On Wed, Jan 18, 2017@05:14:36PM +0200, Sagi Grimberg wrote:
> 
> >Hannes just spotted this:
> >static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
> >                         const struct blk_mq_queue_data *bd)
> >{
> >[...]
> >        __nvme_submit_cmd(nvmeq, &cmnd);
> >        nvme_process_cq(nvmeq);
> >        spin_unlock_irq(&nvmeq->q_lock);
> >        return BLK_MQ_RQ_QUEUE_OK;
> >out_cleanup_iod:
> >        nvme_free_iod(dev, req);
> >out_free_cmd:
> >        nvme_cleanup_cmd(req);
> >        return ret;
> >}
> >
> >So we're draining the CQ on submit. This of cause makes polling for
> >completions in the IRQ handler rather pointless as we already did in the
> >submission path.
> 
> I think you missed:
> http://git.infradead.org/nvme.git/commit/49c91e3e09dc3c9dd1718df85112a8cce3ab7007

I indeed did, thanks.

-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-18 14:40                       ` Sagi Grimberg
@ 2017-01-18 15:35                         ` Andrey Kuzmin
  -1 siblings, 0 replies; 120+ messages in thread
From: Andrey Kuzmin @ 2017-01-18 15:35 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Johannes Thumshirn, Jens Axboe, Keith Busch, Linux-scsi,
	linux-nvme, Christoph Hellwig, linux-block, lsf-pc

On Wed, Jan 18, 2017 at 5:40 PM, Sagi Grimberg <sagi@grimberg.me> wrote:
>
>> Your report provided this stats with one-completion dominance for the
>> single-threaded case. Does it also hold if you run multiple fio
>> threads per core?
>
>
> It's useless to run more threads on that core, it's already fully
> utilized. That single threads is already posting a fair amount of
> submissions, so I don't see how adding more fio jobs can help in any
> way.

With a single thread, your completion processing/submission is
completely serialized. From my experience, it takes fio couple of
microsec to process completion and submit next request, and that'
(much) larger than your interrupt processing time.

Regards,
Andrey

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-18 15:35                         ` Andrey Kuzmin
  0 siblings, 0 replies; 120+ messages in thread
From: Andrey Kuzmin @ 2017-01-18 15:35 UTC (permalink / raw)


On Wed, Jan 18, 2017@5:40 PM, Sagi Grimberg <sagi@grimberg.me> wrote:
>
>> Your report provided this stats with one-completion dominance for the
>> single-threaded case. Does it also hold if you run multiple fio
>> threads per core?
>
>
> It's useless to run more threads on that core, it's already fully
> utilized. That single threads is already posting a fair amount of
> submissions, so I don't see how adding more fio jobs can help in any
> way.

With a single thread, your completion processing/submission is
completely serialized. From my experience, it takes fio couple of
microsec to process completion and submit next request, and that'
(much) larger than your interrupt processing time.

Regards,
Andrey

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-18 15:16                         ` Johannes Thumshirn
  (?)
@ 2017-01-18 15:39                           ` Hannes Reinecke
  -1 siblings, 0 replies; 120+ messages in thread
From: Hannes Reinecke @ 2017-01-18 15:39 UTC (permalink / raw)
  To: Johannes Thumshirn, Sagi Grimberg
  Cc: Jens Axboe, Christoph Hellwig, Linux-scsi, linux-nvme,
	linux-block, Keith Busch, lsf-pc

On 01/18/2017 04:16 PM, Johannes Thumshirn wrote:
> On Wed, Jan 18, 2017 at 05:14:36PM +0200, Sagi Grimberg wrote:
>>
>>> Hannes just spotted this:
>>> static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
>>>                         const struct blk_mq_queue_data *bd)
>>> {
>>> [...]
>>>        __nvme_submit_cmd(nvmeq, &cmnd);
>>>        nvme_process_cq(nvmeq);
>>>        spin_unlock_irq(&nvmeq->q_lock);
>>>        return BLK_MQ_RQ_QUEUE_OK;
>>> out_cleanup_iod:
>>>        nvme_free_iod(dev, req);
>>> out_free_cmd:
>>>        nvme_cleanup_cmd(req);
>>>        return ret;
>>> }
>>>
>>> So we're draining the CQ on submit. This of cause makes polling for
>>> completions in the IRQ handler rather pointless as we already did in the
>>> submission path.
>>
>> I think you missed:
>> http://git.infradead.org/nvme.git/commit/49c91e3e09dc3c9dd1718df85112a8cce3ab7007
> 
> I indeed did, thanks.
> 
But it doesn't help.

We're still having to wait for the first interrupt, and if we're really
fast that's the only completion we have to process.

Try this:


diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b4b32e6..e2dd9e2 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -623,6 +623,8 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
        }
        __nvme_submit_cmd(nvmeq, &cmnd);
        spin_unlock(&nvmeq->sq_lock);
+       disable_irq_nosync(nvmeq_irq(irq));
+       irq_poll_sched(&nvmeq->iop);
        return BLK_MQ_RQ_QUEUE_OK;
 out_cleanup_iod:
        nvme_free_iod(dev, req);

That should avoid the first interrupt, and with a bit of lock reduce the
number of interrupts _drastically_.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: F. Imend�rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N�rnberg)

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-18 15:39                           ` Hannes Reinecke
  0 siblings, 0 replies; 120+ messages in thread
From: Hannes Reinecke @ 2017-01-18 15:39 UTC (permalink / raw)
  To: Johannes Thumshirn, Sagi Grimberg
  Cc: Jens Axboe, Christoph Hellwig, Linux-scsi, linux-nvme,
	linux-block, Keith Busch, lsf-pc

On 01/18/2017 04:16 PM, Johannes Thumshirn wrote:
> On Wed, Jan 18, 2017 at 05:14:36PM +0200, Sagi Grimberg wrote:
>>
>>> Hannes just spotted this:
>>> static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
>>>                         const struct blk_mq_queue_data *bd)
>>> {
>>> [...]
>>>        __nvme_submit_cmd(nvmeq, &cmnd);
>>>        nvme_process_cq(nvmeq);
>>>        spin_unlock_irq(&nvmeq->q_lock);
>>>        return BLK_MQ_RQ_QUEUE_OK;
>>> out_cleanup_iod:
>>>        nvme_free_iod(dev, req);
>>> out_free_cmd:
>>>        nvme_cleanup_cmd(req);
>>>        return ret;
>>> }
>>>
>>> So we're draining the CQ on submit. This of cause makes polling for
>>> completions in the IRQ handler rather pointless as we already did in the
>>> submission path.
>>
>> I think you missed:
>> http://git.infradead.org/nvme.git/commit/49c91e3e09dc3c9dd1718df85112a8cce3ab7007
> 
> I indeed did, thanks.
> 
But it doesn't help.

We're still having to wait for the first interrupt, and if we're really
fast that's the only completion we have to process.

Try this:


diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b4b32e6..e2dd9e2 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -623,6 +623,8 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
        }
        __nvme_submit_cmd(nvmeq, &cmnd);
        spin_unlock(&nvmeq->sq_lock);
+       disable_irq_nosync(nvmeq_irq(irq));
+       irq_poll_sched(&nvmeq->iop);
        return BLK_MQ_RQ_QUEUE_OK;
 out_cleanup_iod:
        nvme_free_iod(dev, req);

That should avoid the first interrupt, and with a bit of lock reduce the
number of interrupts _drastically_.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-18 15:39                           ` Hannes Reinecke
  0 siblings, 0 replies; 120+ messages in thread
From: Hannes Reinecke @ 2017-01-18 15:39 UTC (permalink / raw)


On 01/18/2017 04:16 PM, Johannes Thumshirn wrote:
> On Wed, Jan 18, 2017@05:14:36PM +0200, Sagi Grimberg wrote:
>>
>>> Hannes just spotted this:
>>> static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
>>>                         const struct blk_mq_queue_data *bd)
>>> {
>>> [...]
>>>        __nvme_submit_cmd(nvmeq, &cmnd);
>>>        nvme_process_cq(nvmeq);
>>>        spin_unlock_irq(&nvmeq->q_lock);
>>>        return BLK_MQ_RQ_QUEUE_OK;
>>> out_cleanup_iod:
>>>        nvme_free_iod(dev, req);
>>> out_free_cmd:
>>>        nvme_cleanup_cmd(req);
>>>        return ret;
>>> }
>>>
>>> So we're draining the CQ on submit. This of cause makes polling for
>>> completions in the IRQ handler rather pointless as we already did in the
>>> submission path.
>>
>> I think you missed:
>> http://git.infradead.org/nvme.git/commit/49c91e3e09dc3c9dd1718df85112a8cce3ab7007
> 
> I indeed did, thanks.
> 
But it doesn't help.

We're still having to wait for the first interrupt, and if we're really
fast that's the only completion we have to process.

Try this:


diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b4b32e6..e2dd9e2 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -623,6 +623,8 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
        }
        __nvme_submit_cmd(nvmeq, &cmnd);
        spin_unlock(&nvmeq->sq_lock);
+       disable_irq_nosync(nvmeq_irq(irq));
+       irq_poll_sched(&nvmeq->iop);
        return BLK_MQ_RQ_QUEUE_OK;
 out_cleanup_iod:
        nvme_free_iod(dev, req);

That should avoid the first interrupt, and with a bit of lock reduce the
number of interrupts _drastically_.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-18 15:39                           ` Hannes Reinecke
@ 2017-01-19  8:12                             ` Sagi Grimberg
  -1 siblings, 0 replies; 120+ messages in thread
From: Sagi Grimberg @ 2017-01-19  8:12 UTC (permalink / raw)
  To: Hannes Reinecke, Johannes Thumshirn
  Cc: Jens Axboe, Christoph Hellwig, Linux-scsi, linux-nvme,
	linux-block, Keith Busch, lsf-pc


>>> I think you missed:
>>> http://git.infradead.org/nvme.git/commit/49c91e3e09dc3c9dd1718df85112a8cce3ab7007
>>
>> I indeed did, thanks.
>>
> But it doesn't help.
>
> We're still having to wait for the first interrupt, and if we're really
> fast that's the only completion we have to process.
>
> Try this:
>
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index b4b32e6..e2dd9e2 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -623,6 +623,8 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
>         }
>         __nvme_submit_cmd(nvmeq, &cmnd);
>         spin_unlock(&nvmeq->sq_lock);
> +       disable_irq_nosync(nvmeq_irq(irq));
> +       irq_poll_sched(&nvmeq->iop);

a. This would trigger a condition that we disable irq twice which
is wrong at least because it will generate a warning.

b. This would cause a way-too-much triggers of ksoftirqd. In order for
it to be effective we need to to run only when it should and optimally
when the completion queue has a batch of completions waiting.

After a deeper analysis, I agree with Bart that interrupt coalescing is
needed for it to work. The problem with nvme coalescing as Jens said, is
a death penalty of 100us granularity. Hannes, Johannes, how does it look
like with the devices you are testing with?

Also, I think that adaptive moderation is needed in order for it to
work well. I know that some networking drivers implemented adaptive
moderation in SW before having HW support for it. It can be done by
maintaining stats and having a periodic work that looks at it and
changes the moderation parameters.

Does anyone think that this is something we should consider?

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-19  8:12                             ` Sagi Grimberg
  0 siblings, 0 replies; 120+ messages in thread
From: Sagi Grimberg @ 2017-01-19  8:12 UTC (permalink / raw)



>>> I think you missed:
>>> http://git.infradead.org/nvme.git/commit/49c91e3e09dc3c9dd1718df85112a8cce3ab7007
>>
>> I indeed did, thanks.
>>
> But it doesn't help.
>
> We're still having to wait for the first interrupt, and if we're really
> fast that's the only completion we have to process.
>
> Try this:
>
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index b4b32e6..e2dd9e2 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -623,6 +623,8 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
>         }
>         __nvme_submit_cmd(nvmeq, &cmnd);
>         spin_unlock(&nvmeq->sq_lock);
> +       disable_irq_nosync(nvmeq_irq(irq));
> +       irq_poll_sched(&nvmeq->iop);

a. This would trigger a condition that we disable irq twice which
is wrong at least because it will generate a warning.

b. This would cause a way-too-much triggers of ksoftirqd. In order for
it to be effective we need to to run only when it should and optimally
when the completion queue has a batch of completions waiting.

After a deeper analysis, I agree with Bart that interrupt coalescing is
needed for it to work. The problem with nvme coalescing as Jens said, is
a death penalty of 100us granularity. Hannes, Johannes, how does it look
like with the devices you are testing with?

Also, I think that adaptive moderation is needed in order for it to
work well. I know that some networking drivers implemented adaptive
moderation in SW before having HW support for it. It can be done by
maintaining stats and having a periodic work that looks at it and
changes the moderation parameters.

Does anyone think that this is something we should consider?

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-19  8:12                             ` Sagi Grimberg
@ 2017-01-19  8:23                               ` Sagi Grimberg
  -1 siblings, 0 replies; 120+ messages in thread
From: Sagi Grimberg @ 2017-01-19  8:23 UTC (permalink / raw)
  To: Hannes Reinecke, Johannes Thumshirn
  Cc: Jens Axboe, Christoph Hellwig, Linux-scsi, linux-nvme,
	linux-block, Keith Busch, lsf-pc

Christoph suggest to me once that we can take a hybrid
approach where we consume a small amount of completions (say 4)
right away from the interrupt handler and if we have more
we schedule irq-poll to reap the rest. But back then it
didn't work better which is not aligned with my observations
that we consume only 1 completion per interrupt...

I can give it another go... What do people think about it?

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-19  8:23                               ` Sagi Grimberg
  0 siblings, 0 replies; 120+ messages in thread
From: Sagi Grimberg @ 2017-01-19  8:23 UTC (permalink / raw)


Christoph suggest to me once that we can take a hybrid
approach where we consume a small amount of completions (say 4)
right away from the interrupt handler and if we have more
we schedule irq-poll to reap the rest. But back then it
didn't work better which is not aligned with my observations
that we consume only 1 completion per interrupt...

I can give it another go... What do people think about it?

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-19  8:12                             ` Sagi Grimberg
  (?)
@ 2017-01-19  9:13                               ` Johannes Thumshirn
  -1 siblings, 0 replies; 120+ messages in thread
From: Johannes Thumshirn @ 2017-01-19  9:13 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Hannes Reinecke, Jens Axboe, Christoph Hellwig, Linux-scsi,
	linux-nvme, linux-block, Keith Busch, lsf-pc

On Thu, Jan 19, 2017 at 10:12:17AM +0200, Sagi Grimberg wrote:
> 
> >>>I think you missed:
> >>>http://git.infradead.org/nvme.git/commit/49c91e3e09dc3c9dd1718df85112a8cce3ab7007
> >>
> >>I indeed did, thanks.
> >>
> >But it doesn't help.
> >
> >We're still having to wait for the first interrupt, and if we're really
> >fast that's the only completion we have to process.
> >
> >Try this:
> >
> >
> >diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> >index b4b32e6..e2dd9e2 100644
> >--- a/drivers/nvme/host/pci.c
> >+++ b/drivers/nvme/host/pci.c
> >@@ -623,6 +623,8 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
> >        }
> >        __nvme_submit_cmd(nvmeq, &cmnd);
> >        spin_unlock(&nvmeq->sq_lock);
> >+       disable_irq_nosync(nvmeq_irq(irq));
> >+       irq_poll_sched(&nvmeq->iop);
> 
> a. This would trigger a condition that we disable irq twice which
> is wrong at least because it will generate a warning.
> 
> b. This would cause a way-too-much triggers of ksoftirqd. In order for
> it to be effective we need to to run only when it should and optimally
> when the completion queue has a batch of completions waiting.
> 
> After a deeper analysis, I agree with Bart that interrupt coalescing is
> needed for it to work. The problem with nvme coalescing as Jens said, is
> a death penalty of 100us granularity. Hannes, Johannes, how does it look
> like with the devices you are testing with?

I haven't had a look at AHCI's Command Completion Coalescing yet but hopefully
I find the time today (+SSD testing!!!).

Don't know if Hannes did (but I _think_ no). The problem is we've already
maxed out our test HW w/o irq_poll and so the only changes we're seeing
currently is an increase of wasted CPU cycles. Not what we wanted to have.

> 
> Also, I think that adaptive moderation is needed in order for it to
> work well. I know that some networking drivers implemented adaptive
> moderation in SW before having HW support for it. It can be done by
> maintaining stats and having a periodic work that looks at it and
> changes the moderation parameters.
> 
> Does anyone think that this is something we should consider?

Yes we've been discussing this internally as well and it sounds good but thats
still all pure theory and nothing actually implemented and tested.

Byte,
	Johannes

-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: Felix Imend�rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N�rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-19  9:13                               ` Johannes Thumshirn
  0 siblings, 0 replies; 120+ messages in thread
From: Johannes Thumshirn @ 2017-01-19  9:13 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Hannes Reinecke, Jens Axboe, Christoph Hellwig, Linux-scsi,
	linux-nvme, linux-block, Keith Busch, lsf-pc

On Thu, Jan 19, 2017 at 10:12:17AM +0200, Sagi Grimberg wrote:
> 
> >>>I think you missed:
> >>>http://git.infradead.org/nvme.git/commit/49c91e3e09dc3c9dd1718df85112a8cce3ab7007
> >>
> >>I indeed did, thanks.
> >>
> >But it doesn't help.
> >
> >We're still having to wait for the first interrupt, and if we're really
> >fast that's the only completion we have to process.
> >
> >Try this:
> >
> >
> >diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> >index b4b32e6..e2dd9e2 100644
> >--- a/drivers/nvme/host/pci.c
> >+++ b/drivers/nvme/host/pci.c
> >@@ -623,6 +623,8 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
> >        }
> >        __nvme_submit_cmd(nvmeq, &cmnd);
> >        spin_unlock(&nvmeq->sq_lock);
> >+       disable_irq_nosync(nvmeq_irq(irq));
> >+       irq_poll_sched(&nvmeq->iop);
> 
> a. This would trigger a condition that we disable irq twice which
> is wrong at least because it will generate a warning.
> 
> b. This would cause a way-too-much triggers of ksoftirqd. In order for
> it to be effective we need to to run only when it should and optimally
> when the completion queue has a batch of completions waiting.
> 
> After a deeper analysis, I agree with Bart that interrupt coalescing is
> needed for it to work. The problem with nvme coalescing as Jens said, is
> a death penalty of 100us granularity. Hannes, Johannes, how does it look
> like with the devices you are testing with?

I haven't had a look at AHCI's Command Completion Coalescing yet but hopefully
I find the time today (+SSD testing!!!).

Don't know if Hannes did (but I _think_ no). The problem is we've already
maxed out our test HW w/o irq_poll and so the only changes we're seeing
currently is an increase of wasted CPU cycles. Not what we wanted to have.

> 
> Also, I think that adaptive moderation is needed in order for it to
> work well. I know that some networking drivers implemented adaptive
> moderation in SW before having HW support for it. It can be done by
> maintaining stats and having a periodic work that looks at it and
> changes the moderation parameters.
> 
> Does anyone think that this is something we should consider?

Yes we've been discussing this internally as well and it sounds good but thats
still all pure theory and nothing actually implemented and tested.

Byte,
	Johannes

-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-19  9:13                               ` Johannes Thumshirn
  0 siblings, 0 replies; 120+ messages in thread
From: Johannes Thumshirn @ 2017-01-19  9:13 UTC (permalink / raw)


On Thu, Jan 19, 2017@10:12:17AM +0200, Sagi Grimberg wrote:
> 
> >>>I think you missed:
> >>>http://git.infradead.org/nvme.git/commit/49c91e3e09dc3c9dd1718df85112a8cce3ab7007
> >>
> >>I indeed did, thanks.
> >>
> >But it doesn't help.
> >
> >We're still having to wait for the first interrupt, and if we're really
> >fast that's the only completion we have to process.
> >
> >Try this:
> >
> >
> >diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> >index b4b32e6..e2dd9e2 100644
> >--- a/drivers/nvme/host/pci.c
> >+++ b/drivers/nvme/host/pci.c
> >@@ -623,6 +623,8 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
> >        }
> >        __nvme_submit_cmd(nvmeq, &cmnd);
> >        spin_unlock(&nvmeq->sq_lock);
> >+       disable_irq_nosync(nvmeq_irq(irq));
> >+       irq_poll_sched(&nvmeq->iop);
> 
> a. This would trigger a condition that we disable irq twice which
> is wrong at least because it will generate a warning.
> 
> b. This would cause a way-too-much triggers of ksoftirqd. In order for
> it to be effective we need to to run only when it should and optimally
> when the completion queue has a batch of completions waiting.
> 
> After a deeper analysis, I agree with Bart that interrupt coalescing is
> needed for it to work. The problem with nvme coalescing as Jens said, is
> a death penalty of 100us granularity. Hannes, Johannes, how does it look
> like with the devices you are testing with?

I haven't had a look at AHCI's Command Completion Coalescing yet but hopefully
I find the time today (+SSD testing!!!).

Don't know if Hannes did (but I _think_ no). The problem is we've already
maxed out our test HW w/o irq_poll and so the only changes we're seeing
currently is an increase of wasted CPU cycles. Not what we wanted to have.

> 
> Also, I think that adaptive moderation is needed in order for it to
> work well. I know that some networking drivers implemented adaptive
> moderation in SW before having HW support for it. It can be done by
> maintaining stats and having a periodic work that looks at it and
> changes the moderation parameters.
> 
> Does anyone think that this is something we should consider?

Yes we've been discussing this internally as well and it sounds good but thats
still all pure theory and nothing actually implemented and tested.

Byte,
	Johannes

-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-19  8:23                               ` Sagi Grimberg
  (?)
@ 2017-01-19  9:18                                 ` Johannes Thumshirn
  -1 siblings, 0 replies; 120+ messages in thread
From: Johannes Thumshirn @ 2017-01-19  9:18 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Hannes Reinecke, Jens Axboe, Christoph Hellwig, Linux-scsi,
	linux-nvme, linux-block, Keith Busch, lsf-pc

On Thu, Jan 19, 2017 at 10:23:28AM +0200, Sagi Grimberg wrote:
> Christoph suggest to me once that we can take a hybrid
> approach where we consume a small amount of completions (say 4)
> right away from the interrupt handler and if we have more
> we schedule irq-poll to reap the rest. But back then it
> didn't work better which is not aligned with my observations
> that we consume only 1 completion per interrupt...
> 
> I can give it another go... What do people think about it?

This could be good.

What's also possible (see answer to my previous mail) is measuring the time it
takes for a completion to arrive and if the average time is lower than the
context switch time just busy loop insted of waiting for the IRQ to arrive. If
it is higher we can always schedule a timer to hit _before_ the IRQ will
likely arrive and start polling. Is this something that sounds reasonable to
you guys as well?

	Johannes

-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: Felix Imend�rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N�rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-19  9:18                                 ` Johannes Thumshirn
  0 siblings, 0 replies; 120+ messages in thread
From: Johannes Thumshirn @ 2017-01-19  9:18 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Hannes Reinecke, Jens Axboe, Christoph Hellwig, Linux-scsi,
	linux-nvme, linux-block, Keith Busch, lsf-pc

On Thu, Jan 19, 2017 at 10:23:28AM +0200, Sagi Grimberg wrote:
> Christoph suggest to me once that we can take a hybrid
> approach where we consume a small amount of completions (say 4)
> right away from the interrupt handler and if we have more
> we schedule irq-poll to reap the rest. But back then it
> didn't work better which is not aligned with my observations
> that we consume only 1 completion per interrupt...
> 
> I can give it another go... What do people think about it?

This could be good.

What's also possible (see answer to my previous mail) is measuring the time it
takes for a completion to arrive and if the average time is lower than the
context switch time just busy loop insted of waiting for the IRQ to arrive. If
it is higher we can always schedule a timer to hit _before_ the IRQ will
likely arrive and start polling. Is this something that sounds reasonable to
you guys as well?

	Johannes

-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-19  9:18                                 ` Johannes Thumshirn
  0 siblings, 0 replies; 120+ messages in thread
From: Johannes Thumshirn @ 2017-01-19  9:18 UTC (permalink / raw)


On Thu, Jan 19, 2017@10:23:28AM +0200, Sagi Grimberg wrote:
> Christoph suggest to me once that we can take a hybrid
> approach where we consume a small amount of completions (say 4)
> right away from the interrupt handler and if we have more
> we schedule irq-poll to reap the rest. But back then it
> didn't work better which is not aligned with my observations
> that we consume only 1 completion per interrupt...
> 
> I can give it another go... What do people think about it?

This could be good.

What's also possible (see answer to my previous mail) is measuring the time it
takes for a completion to arrive and if the average time is lower than the
context switch time just busy loop insted of waiting for the IRQ to arrive. If
it is higher we can always schedule a timer to hit _before_ the IRQ will
likely arrive and start polling. Is this something that sounds reasonable to
you guys as well?

	Johannes

-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-11 15:07   ` Jens Axboe
@ 2017-01-19 10:57     ` Ming Lei
  -1 siblings, 0 replies; 120+ messages in thread
From: Ming Lei @ 2017-01-19 10:57 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Johannes Thumshirn, lsf-pc, linux-block, Linux SCSI List,
	Sagi Grimberg, linux-nvme, Christoph Hellwig, Keith Busch

On Wed, Jan 11, 2017 at 11:07 PM, Jens Axboe <axboe@kernel.dk> wrote:
> On 01/11/2017 06:43 AM, Johannes Thumshirn wrote:
>> Hi all,
>>
>> I'd like to attend LSF/MM and would like to discuss polling for block drivers.
>>
>> Currently there is blk-iopoll but it is neither as widely used as NAPI in the
>> networking field and accoring to Sagi's findings in [1] performance with
>> polling is not on par with IRQ usage.
>>
>> On LSF/MM I'd like to whether it is desirable to have NAPI like polling in
>> more block drivers and how to overcome the currently seen performance issues.
>
> It would be an interesting topic to discuss, as it is a shame that blk-iopoll
> isn't used more widely.

I remembered that Keith and I discussed some issues of blk-iopoll:

    http://marc.info/?l=linux-block&m=147576999016407&w=2

seems which isn't addressed yet.


Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-19 10:57     ` Ming Lei
  0 siblings, 0 replies; 120+ messages in thread
From: Ming Lei @ 2017-01-19 10:57 UTC (permalink / raw)


On Wed, Jan 11, 2017@11:07 PM, Jens Axboe <axboe@kernel.dk> wrote:
> On 01/11/2017 06:43 AM, Johannes Thumshirn wrote:
>> Hi all,
>>
>> I'd like to attend LSF/MM and would like to discuss polling for block drivers.
>>
>> Currently there is blk-iopoll but it is neither as widely used as NAPI in the
>> networking field and accoring to Sagi's findings in [1] performance with
>> polling is not on par with IRQ usage.
>>
>> On LSF/MM I'd like to whether it is desirable to have NAPI like polling in
>> more block drivers and how to overcome the currently seen performance issues.
>
> It would be an interesting topic to discuss, as it is a shame that blk-iopoll
> isn't used more widely.

I remembered that Keith and I discussed some issues of blk-iopoll:

    http://marc.info/?l=linux-block&m=147576999016407&w=2

seems which isn't addressed yet.


Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-19 10:57     ` Ming Lei
@ 2017-01-19 11:03       ` Hannes Reinecke
  -1 siblings, 0 replies; 120+ messages in thread
From: Hannes Reinecke @ 2017-01-19 11:03 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: Johannes Thumshirn, lsf-pc, linux-block, Linux SCSI List,
	Sagi Grimberg, linux-nvme, Christoph Hellwig, Keith Busch

On 01/19/2017 11:57 AM, Ming Lei wrote:
> On Wed, Jan 11, 2017 at 11:07 PM, Jens Axboe <axboe@kernel.dk> wrote:
>> On 01/11/2017 06:43 AM, Johannes Thumshirn wrote:
>>> Hi all,
>>>
>>> I'd like to attend LSF/MM and would like to discuss polling for block drivers.
>>>
>>> Currently there is blk-iopoll but it is neither as widely used as NAPI in the
>>> networking field and accoring to Sagi's findings in [1] performance with
>>> polling is not on par with IRQ usage.
>>>
>>> On LSF/MM I'd like to whether it is desirable to have NAPI like polling in
>>> more block drivers and how to overcome the currently seen performance issues.
>>
>> It would be an interesting topic to discuss, as it is a shame that blk-iopoll
>> isn't used more widely.
> 
> I remembered that Keith and I discussed some issues of blk-iopoll:
> 
>     http://marc.info/?l=linux-block&m=147576999016407&w=2
> 
> seems which isn't addressed yet.
> 
That's a different poll.

For some obscure reasons you have a blk-mq-poll function (via
q->mq_ops->poll) and an irqpoll function.
The former is for polling completion of individual block-layer tags, the
latter for polling completions from the hardware instead of relying on
interrupts.

We're discussing the latter here, so that thread isn't really applicable
here. However, there have been requests to discuss the former at LSF/MM,
too. So there might be a chance of restarting that discussion.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-19 11:03       ` Hannes Reinecke
  0 siblings, 0 replies; 120+ messages in thread
From: Hannes Reinecke @ 2017-01-19 11:03 UTC (permalink / raw)


On 01/19/2017 11:57 AM, Ming Lei wrote:
> On Wed, Jan 11, 2017@11:07 PM, Jens Axboe <axboe@kernel.dk> wrote:
>> On 01/11/2017 06:43 AM, Johannes Thumshirn wrote:
>>> Hi all,
>>>
>>> I'd like to attend LSF/MM and would like to discuss polling for block drivers.
>>>
>>> Currently there is blk-iopoll but it is neither as widely used as NAPI in the
>>> networking field and accoring to Sagi's findings in [1] performance with
>>> polling is not on par with IRQ usage.
>>>
>>> On LSF/MM I'd like to whether it is desirable to have NAPI like polling in
>>> more block drivers and how to overcome the currently seen performance issues.
>>
>> It would be an interesting topic to discuss, as it is a shame that blk-iopoll
>> isn't used more widely.
> 
> I remembered that Keith and I discussed some issues of blk-iopoll:
> 
>     http://marc.info/?l=linux-block&m=147576999016407&w=2
> 
> seems which isn't addressed yet.
> 
That's a different poll.

For some obscure reasons you have a blk-mq-poll function (via
q->mq_ops->poll) and an irqpoll function.
The former is for polling completion of individual block-layer tags, the
latter for polling completions from the hardware instead of relying on
interrupts.

We're discussing the latter here, so that thread isn't really applicable
here. However, there have been requests to discuss the former at LSF/MM,
too. So there might be a chance of restarting that discussion.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-18 14:02               ` Hannes Reinecke
@ 2017-01-20  0:13                 ` Jens Axboe
  -1 siblings, 0 replies; 120+ messages in thread
From: Jens Axboe @ 2017-01-20  0:13 UTC (permalink / raw)
  To: Hannes Reinecke, Sagi Grimberg, Andrey Kuzmin
  Cc: Johannes Thumshirn, lsf-pc, linux-block, Christoph Hellwig,
	Keith Busch, linux-nvme, Linux-scsi

On 01/18/2017 06:02 AM, Hannes Reinecke wrote:
> On 01/17/2017 05:50 PM, Sagi Grimberg wrote:
>>
>>>     So it looks like we are super not efficient because most of the
>>>     times we catch 1
>>>     completion per interrupt and the whole point is that we need to find
>>>     more! This fio
>>>     is single threaded with QD=32 so I'd expect that we be somewhere in
>>>     8-31 almost all
>>>     the time... I also tried QD=1024, histogram is still the same.
>>>
>>>
>>> It looks like it takes you longer to submit an I/O than to service an
>>> interrupt,
>>
>> Well, with irq-poll we do practically nothing in the interrupt handler,
>> only schedule irq-poll. Note that the latency measures are only from
>> the point the interrupt arrives and the point we actually service it
>> by polling for completions.
>>
>>> so increasing queue depth in the singe-threaded case doesn't
>>> make much difference. You might want to try multiple threads per core
>>> with QD, say, 32
>>
>> This is how I ran, QD=32.
> 
> The one thing which I found _really_ curious is this:
> 
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>> =64=100.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>> =64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>> =64=0.1%
>      issued    : total=r=7673377/w=0/d=0, short=r=0/w=0/d=0,
> drop=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=256
> 
> (note the lines starting with 'submit' and 'complete').
> They are _always_ 4, irrespective of the hardware and/or tests which I
> run. Jens, what are these numbers supposed to mean?
> Is this intended?

It's bucketized. 0=0.0% means that 0% of the submissions didn't submit
anything (unsurprisingly), and ditto for the complete side. The next bucket
is 1..4, so 100% of submissions and completions was in that range.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-20  0:13                 ` Jens Axboe
  0 siblings, 0 replies; 120+ messages in thread
From: Jens Axboe @ 2017-01-20  0:13 UTC (permalink / raw)


On 01/18/2017 06:02 AM, Hannes Reinecke wrote:
> On 01/17/2017 05:50 PM, Sagi Grimberg wrote:
>>
>>>     So it looks like we are super not efficient because most of the
>>>     times we catch 1
>>>     completion per interrupt and the whole point is that we need to find
>>>     more! This fio
>>>     is single threaded with QD=32 so I'd expect that we be somewhere in
>>>     8-31 almost all
>>>     the time... I also tried QD=1024, histogram is still the same.
>>>
>>>
>>> It looks like it takes you longer to submit an I/O than to service an
>>> interrupt,
>>
>> Well, with irq-poll we do practically nothing in the interrupt handler,
>> only schedule irq-poll. Note that the latency measures are only from
>> the point the interrupt arrives and the point we actually service it
>> by polling for completions.
>>
>>> so increasing queue depth in the singe-threaded case doesn't
>>> make much difference. You might want to try multiple threads per core
>>> with QD, say, 32
>>
>> This is how I ran, QD=32.
> 
> The one thing which I found _really_ curious is this:
> 
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>> =64=100.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>> =64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>> =64=0.1%
>      issued    : total=r=7673377/w=0/d=0, short=r=0/w=0/d=0,
> drop=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=256
> 
> (note the lines starting with 'submit' and 'complete').
> They are _always_ 4, irrespective of the hardware and/or tests which I
> run. Jens, what are these numbers supposed to mean?
> Is this intended?

It's bucketized. 0=0.0% means that 0% of the submissions didn't submit
anything (unsurprisingly), and ditto for the complete side. The next bucket
is 1..4, so 100% of submissions and completions was in that range.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
  2017-01-17 15:45           ` Sagi Grimberg
  (?)
@ 2017-01-20 12:22             ` Johannes Thumshirn
  -1 siblings, 0 replies; 120+ messages in thread
From: Johannes Thumshirn @ 2017-01-20 12:22 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jens Axboe, lsf-pc, linux-block, Linux-scsi, linux-nvme,
	Christoph Hellwig, Keith Busch

On Tue, Jan 17, 2017 at 05:45:53PM +0200, Sagi Grimberg wrote:
> 
> >--
> >[1]
> >queue = b'nvme0q1'
> >     usecs               : count     distribution
> >         0 -> 1          : 7310 |****************************************|
> >         2 -> 3          : 11       |      |
> >         4 -> 7          : 10       |      |
> >         8 -> 15         : 20       |      |
> >        16 -> 31         : 0        |      |
> >        32 -> 63         : 0        |      |
> >        64 -> 127        : 1        |      |
> >
> >[2]
> >queue = b'nvme0q1'
> >     usecs               : count     distribution
> >         0 -> 1          : 7309 |****************************************|
> >         2 -> 3          : 14       |      |
> >         4 -> 7          : 7        |      |
> >         8 -> 15         : 17       |      |
> >
> 
> Rrr, email made the histograms look funky (tabs vs. spaces...)
> The count is what's important anyways...
> 
> Just adding that I used an Intel P3500 nvme device.
> 
> >We can see that most of the time our latency is pretty good (<1ns) but with
> >huge tail latencies (some 8-15 ns and even one in 32-63 ns).
> 
> Obviously is micro-seconds and not nano-seconds (I wish...)

So to share yesterday's (and today's) findings:

On AHCI I see only one completion polled as well.

This probably is because in contrast to networking (with NAPI) in the block
layer we do have a link between submission and completion whereas in networking
RX and TX are decoupled. So if we're sending out one request we get the
completion for it.

What we'd need is a link to know "we've sent 10 requests out, now poll for the
10 completions after the 1st IRQ". So basically what NVMe already did with
calling __nvme_process_cq() after submission. Maybe we should even disable
IRQs when submitting and re-enable after submitting so the
submission patch doesn't get preempted by a completion.

Does this make sense?

Byte,
	Johannes

-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: Felix Imend�rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N�rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-20 12:22             ` Johannes Thumshirn
  0 siblings, 0 replies; 120+ messages in thread
From: Johannes Thumshirn @ 2017-01-20 12:22 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jens Axboe, lsf-pc, linux-block, Linux-scsi, linux-nvme,
	Christoph Hellwig, Keith Busch

On Tue, Jan 17, 2017 at 05:45:53PM +0200, Sagi Grimberg wrote:
> 
> >--
> >[1]
> >queue = b'nvme0q1'
> >     usecs               : count     distribution
> >         0 -> 1          : 7310 |****************************************|
> >         2 -> 3          : 11       |      |
> >         4 -> 7          : 10       |      |
> >         8 -> 15         : 20       |      |
> >        16 -> 31         : 0        |      |
> >        32 -> 63         : 0        |      |
> >        64 -> 127        : 1        |      |
> >
> >[2]
> >queue = b'nvme0q1'
> >     usecs               : count     distribution
> >         0 -> 1          : 7309 |****************************************|
> >         2 -> 3          : 14       |      |
> >         4 -> 7          : 7        |      |
> >         8 -> 15         : 17       |      |
> >
> 
> Rrr, email made the histograms look funky (tabs vs. spaces...)
> The count is what's important anyways...
> 
> Just adding that I used an Intel P3500 nvme device.
> 
> >We can see that most of the time our latency is pretty good (<1ns) but with
> >huge tail latencies (some 8-15 ns and even one in 32-63 ns).
> 
> Obviously is micro-seconds and not nano-seconds (I wish...)

So to share yesterday's (and today's) findings:

On AHCI I see only one completion polled as well.

This probably is because in contrast to networking (with NAPI) in the block
layer we do have a link between submission and completion whereas in networking
RX and TX are decoupled. So if we're sending out one request we get the
completion for it.

What we'd need is a link to know "we've sent 10 requests out, now poll for the
10 completions after the 1st IRQ". So basically what NVMe already did with
calling __nvme_process_cq() after submission. Maybe we should even disable
IRQs when submitting and re-enable after submitting so the
submission patch doesn't get preempted by a completion.

Does this make sense?

Byte,
	Johannes

-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
@ 2017-01-20 12:22             ` Johannes Thumshirn
  0 siblings, 0 replies; 120+ messages in thread
From: Johannes Thumshirn @ 2017-01-20 12:22 UTC (permalink / raw)


On Tue, Jan 17, 2017@05:45:53PM +0200, Sagi Grimberg wrote:
> 
> >--
> >[1]
> >queue = b'nvme0q1'
> >     usecs               : count     distribution
> >         0 -> 1          : 7310 |****************************************|
> >         2 -> 3          : 11       |      |
> >         4 -> 7          : 10       |      |
> >         8 -> 15         : 20       |      |
> >        16 -> 31         : 0        |      |
> >        32 -> 63         : 0        |      |
> >        64 -> 127        : 1        |      |
> >
> >[2]
> >queue = b'nvme0q1'
> >     usecs               : count     distribution
> >         0 -> 1          : 7309 |****************************************|
> >         2 -> 3          : 14       |      |
> >         4 -> 7          : 7        |      |
> >         8 -> 15         : 17       |      |
> >
> 
> Rrr, email made the histograms look funky (tabs vs. spaces...)
> The count is what's important anyways...
> 
> Just adding that I used an Intel P3500 nvme device.
> 
> >We can see that most of the time our latency is pretty good (<1ns) but with
> >huge tail latencies (some 8-15 ns and even one in 32-63 ns).
> 
> Obviously is micro-seconds and not nano-seconds (I wish...)

So to share yesterday's (and today's) findings:

On AHCI I see only one completion polled as well.

This probably is because in contrast to networking (with NAPI) in the block
layer we do have a link between submission and completion whereas in networking
RX and TX are decoupled. So if we're sending out one request we get the
completion for it.

What we'd need is a link to know "we've sent 10 requests out, now poll for the
10 completions after the 1st IRQ". So basically what NVMe already did with
calling __nvme_process_cq() after submission. Maybe we should even disable
IRQs when submitting and re-enable after submitting so the
submission patch doesn't get preempted by a completion.

Does this make sense?

Byte,
	Johannes

-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 120+ messages in thread

end of thread, other threads:[~2017-01-20 12:43 UTC | newest]

Thread overview: 120+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-11 13:43 [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers Johannes Thumshirn
2017-01-11 13:43 ` Johannes Thumshirn
2017-01-11 13:43 ` Johannes Thumshirn
2017-01-11 13:46 ` Hannes Reinecke
2017-01-11 13:46   ` Hannes Reinecke
2017-01-11 13:46   ` Hannes Reinecke
2017-01-11 15:07 ` Jens Axboe
2017-01-11 15:07   ` Jens Axboe
2017-01-11 15:13   ` Jens Axboe
2017-01-11 15:13     ` Jens Axboe
2017-01-12  8:23     ` Sagi Grimberg
2017-01-12  8:23       ` Sagi Grimberg
2017-01-12 10:02       ` Johannes Thumshirn
2017-01-12 10:02         ` Johannes Thumshirn
2017-01-12 10:02         ` Johannes Thumshirn
2017-01-12 11:44         ` Sagi Grimberg
2017-01-12 11:44           ` Sagi Grimberg
2017-01-12 12:53           ` Johannes Thumshirn
2017-01-12 12:53             ` Johannes Thumshirn
2017-01-12 12:53             ` Johannes Thumshirn
2017-01-12 14:41             ` [Lsf-pc] " Sagi Grimberg
2017-01-12 14:41               ` Sagi Grimberg
2017-01-12 18:59               ` Johannes Thumshirn
2017-01-12 18:59                 ` Johannes Thumshirn
2017-01-12 18:59                 ` Johannes Thumshirn
2017-01-17 15:38       ` Sagi Grimberg
2017-01-17 15:38         ` Sagi Grimberg
2017-01-17 15:45         ` Sagi Grimberg
2017-01-17 15:45           ` Sagi Grimberg
2017-01-20 12:22           ` Johannes Thumshirn
2017-01-20 12:22             ` Johannes Thumshirn
2017-01-20 12:22             ` Johannes Thumshirn
2017-01-17 16:15         ` Sagi Grimberg
2017-01-17 16:15           ` Sagi Grimberg
2017-01-17 16:27           ` Johannes Thumshirn
2017-01-17 16:27             ` Johannes Thumshirn
2017-01-17 16:27             ` Johannes Thumshirn
2017-01-17 16:38             ` Sagi Grimberg
2017-01-17 16:38               ` Sagi Grimberg
2017-01-18 13:51               ` Johannes Thumshirn
2017-01-18 13:51                 ` Johannes Thumshirn
2017-01-18 13:51                 ` Johannes Thumshirn
2017-01-18 14:27                 ` Sagi Grimberg
2017-01-18 14:27                   ` Sagi Grimberg
2017-01-18 14:36                   ` Andrey Kuzmin
2017-01-18 14:36                     ` Andrey Kuzmin
2017-01-18 14:40                     ` Sagi Grimberg
2017-01-18 14:40                       ` Sagi Grimberg
2017-01-18 15:35                       ` Andrey Kuzmin
2017-01-18 15:35                         ` Andrey Kuzmin
2017-01-18 14:58                   ` Johannes Thumshirn
2017-01-18 14:58                     ` Johannes Thumshirn
2017-01-18 14:58                     ` Johannes Thumshirn
2017-01-18 15:14                     ` Sagi Grimberg
2017-01-18 15:14                       ` Sagi Grimberg
2017-01-18 15:16                       ` Johannes Thumshirn
2017-01-18 15:16                         ` Johannes Thumshirn
2017-01-18 15:16                         ` Johannes Thumshirn
2017-01-18 15:39                         ` Hannes Reinecke
2017-01-18 15:39                           ` Hannes Reinecke
2017-01-18 15:39                           ` Hannes Reinecke
2017-01-19  8:12                           ` Sagi Grimberg
2017-01-19  8:12                             ` Sagi Grimberg
2017-01-19  8:23                             ` Sagi Grimberg
2017-01-19  8:23                               ` Sagi Grimberg
2017-01-19  9:18                               ` Johannes Thumshirn
2017-01-19  9:18                                 ` Johannes Thumshirn
2017-01-19  9:18                                 ` Johannes Thumshirn
2017-01-19  9:13                             ` Johannes Thumshirn
2017-01-19  9:13                               ` Johannes Thumshirn
2017-01-19  9:13                               ` Johannes Thumshirn
2017-01-17 16:44         ` Andrey Kuzmin
2017-01-17 16:50           ` Sagi Grimberg
2017-01-17 16:50             ` Sagi Grimberg
2017-01-18 14:02             ` Hannes Reinecke
2017-01-18 14:02               ` Hannes Reinecke
2017-01-20  0:13               ` Jens Axboe
2017-01-20  0:13                 ` Jens Axboe
2017-01-13 15:56     ` Johannes Thumshirn
2017-01-13 15:56       ` Johannes Thumshirn
2017-01-13 15:56       ` Johannes Thumshirn
2017-01-11 15:16   ` Hannes Reinecke
2017-01-11 15:16     ` Hannes Reinecke
2017-01-11 15:16     ` Hannes Reinecke
2017-01-12  4:36   ` Stephen Bates
2017-01-12  4:44     ` Jens Axboe
2017-01-12  4:44       ` Jens Axboe
2017-01-12  4:56       ` Stephen Bates
2017-01-12  4:56         ` Stephen Bates
2017-01-19 10:57   ` Ming Lei
2017-01-19 10:57     ` Ming Lei
2017-01-19 11:03     ` Hannes Reinecke
2017-01-19 11:03       ` Hannes Reinecke
2017-01-11 16:08 ` Bart Van Assche
2017-01-11 16:08   ` Bart Van Assche
2017-01-11 16:08   ` Bart Van Assche
2017-01-11 16:12   ` hch
2017-01-11 16:12     ` hch
2017-01-11 16:15     ` Jens Axboe
2017-01-11 16:15       ` Jens Axboe
2017-01-11 16:22     ` Hannes Reinecke
2017-01-11 16:22       ` Hannes Reinecke
2017-01-11 16:22       ` Hannes Reinecke
2017-01-11 16:26       ` Bart Van Assche
2017-01-11 16:26         ` Bart Van Assche
2017-01-11 16:26         ` Bart Van Assche
2017-01-11 16:45         ` Hannes Reinecke
2017-01-11 16:45           ` Hannes Reinecke
2017-01-11 16:45           ` Hannes Reinecke
2017-01-12  8:52         ` sagi grimberg
2017-01-12  8:52           ` sagi grimberg
2017-01-11 16:14   ` Johannes Thumshirn
2017-01-11 16:14     ` Johannes Thumshirn
2017-01-11 16:14     ` Johannes Thumshirn
2017-01-12  8:41   ` Sagi Grimberg
2017-01-12  8:41     ` Sagi Grimberg
2017-01-12  8:41     ` Sagi Grimberg
2017-01-12 19:13     ` Bart Van Assche
2017-01-12 19:13       ` Bart Van Assche
2017-01-12 19:13       ` Bart Van Assche

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.