Re: [SPDK] Request for comments regarding latency measurements

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: [SPDK] Request for comments regarding latency measurements
@ 2017-11-14 14:56 Luse, Paul E
  0 siblings, 0 replies; 22+ messages in thread
From: Luse, Paul E @ 2017-11-14 14:56 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 2190 bytes --]

FWIW option 2 sounds like the most appropriate to me...

Thx
Paul

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Paul Von-Stamwitz
Sent: Monday, November 13, 2017 7:29 PM
To: spdk(a)lists.01.org
Subject: [SPDK] Request for comments regarding latency measurements

Background:
Bdev.c currently maintains a count of reads/writes/bytes_read/bytes_written in channel->stat. This information is retrieved (and reset) via spdk_bdev_get_io_stat.

Proposal:
Add latency information to channel->stat and enable the option to provide a histogram.
We can measure the latency of each IO and keep a running total on a read/write basis. We can also use the measured latency to keep a running count of reads/writes in their associated histogram "buckets".
The question is, how do we measure latency?

Option 1:
Measure latency at the NVMe tracker .
Currently, we already timestamp every command placed on the tracker if the abort timeout callback is registered. When we remove a completed IO from the tracker, we can timestamp it again and calculate the latency.
There are several issues here that need to be considered.
We need to get the latency information back up to the bdev layer, most likely through a callback argument, but this would require a change the NVMe API. If the bdev breaks down a request into smaller IOs, it can add up the latencies of the child IOs for io_stat and histogram purposes.
Also, this method does not take into account any latency added by the spdk bdev/nvme layers (except the poller.) If a request was queued before being placed on the tracker, then the time it was queued is not factored into the latency calculation.

Option 2:
Measure latency at the bdev layer.
We can timestamp at submit and again at completion. This would keep all io_stat information local to bdev and would take into account the overhead of most queued operations. Any applications written directly to the NVMe layer would have to calculate their own latencies, but that is currently true for all io_stats.

I'm sure that there are other issues I am missing, but I would appreciate any comments on how best to move forward on this.

Thanks,
Paul

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 5617 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [SPDK] Request for comments regarding latency measurements
@ 2017-12-08  1:05 Sreeni Busam
  0 siblings, 0 replies; 22+ messages in thread
From: Sreeni Busam @ 2017-12-08  1:05 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 11161 bytes --]

Thanks Isaac.

From: Isaac Otsiabah [mailto:IOtsiabah(a)us.fujitsu.com]
Sent: Thursday, December 7, 2017 4:46 PM
To: Storage Performance Development Kit <spdk(a)lists.01.org>; Paul Von-Stamwitz <PVonStamwitz(a)us.fujitsu.com>; Sreeni (Sreenivasa) Busam (Stellus) <s.busam(a)stellus.com>
Cc: Edward Yang <eyang(a)us.fujitsu.com>; Isaac Otsiabah <IOtsiabah(a)us.fujitsu.com>
Subject: RE: [SPDK] Request for comments regarding latency measurements

Hi Sreeni, I work for Paul. We used fio_plugin under bdev as the ioengine. We also modified  spdk_fio_cleanup_thread(..) in examples/bdev/fio_plugin/fio_plugin.c to print out the stats in the for loop.

Ie.
spdk_bdev_get_io_stat(target->bdev, target->ch, &stat);
fprintf(stdout,"read_latency:      : %f\n",  (double)stat.read_latency_ticks /(stat.ticks_us_rate * stat.num_read_ops));
fprintf(stdout,"write_latency:     : %f\n",  (double)stat.write_latency_ticks /(stat.ticks_us_rate * stat.num_write_ops));

Isaac
From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Sreeni (Sreenivasa) Busam (Stellus)
Sent: Thursday, December 07, 2017 3:07 PM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>; Paul Von-Stamwitz <PVonStamwitz(a)us.fujitsu.com<mailto:PVonStamwitz(a)us.fujitsu.com>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

Hi Paul,

Did you use the same code in fio_plugin in spdk directory to test your SPDK changes or you used different application? Please share the code for test application if you made any changes to test the bdev latency.
If there was any testing with NVMeoF configuration with your changes, please let me know what performance you got.

Thanks a lot,
Sreeni

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Sreeni (Sreenivasa) Busam (Stellus)
Sent: Wednesday, December 6, 2017 3:48 PM
To: Paul Von-Stamwitz <PVonStamwitz(a)us.fujitsu.com<mailto:PVonStamwitz(a)us.fujitsu.com>>; Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

Thanks Paul. It helps in understanding the changes.

From: Paul Von-Stamwitz [mailto:PVonStamwitz(a)us.fujitsu.com]
Sent: Wednesday, December 6, 2017 3:43 PM
To: Sreeni (Sreenivasa) Busam (Stellus) <s.busam(a)stellus.com<mailto:s.busam(a)stellus.com>>; Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: RE: [SPDK] Request for comments regarding latency measurements

Hi Sreeni,

The first two columns are fio numbers, where clat is average time between submission to bdev and completion. The third column is calculated  from spdk_bdev_get_io_stat which returns:

struct spdk_bdev_io_stat {
                uint64_t bytes_read;
                uint64_t num_read_ops;
                uint64_t bytes_written;
                uint64_t num_write_ops;
                uint64_t read_latency_ticks;
                uint64_t write_latency_ticks;
                uint64_t ticks_us_rate;
};

All numbers in spdk_bdev_io_stat are cumulative. To get the cumulative latency in usec, divide (read/write)_latency_ticks by ticks_us_rate. Then, divide that by the num_(read/write)_ops to get the average latency per command.

Best regards,
Paul

From: Sreeni (Sreenivasa) Busam (Stellus) [mailto:s.busam(a)stellus.com]
Sent: Wednesday, December 06, 2017 3:22 PM
To: Paul Von-Stamwitz; Storage Performance Development Kit
Subject: RE: [SPDK] Request for comments regarding latency measurements

Hi Paul,

Thanks a lot for submitting the patch.
The “clat” latency is the difference of I/O complete ticks and submit ticks for a I/O request, is it correct?
Please let me know how you calculated bdev latency for a request.

Sreeni

From: Paul Von-Stamwitz [mailto:PVonStamwitz(a)us.fujitsu.com]
Sent: Wednesday, December 6, 2017 3:10 PM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Cc: Sreeni (Sreenivasa) Busam (Stellus) <s.busam(a)stellus.com<mailto:s.busam(a)stellus.com>>
Subject: RE: [SPDK] Request for comments regarding latency measurements

Hi Sreeni,

We submitted a PR for the latency measurements (change 390654<https://review.gerrithub.io/390654>).

I tried to add you as a reviewer, but your name did not come up.

We tested this against the fio_plugin for bdev and the numbers matched well.

Units = micro seconds
Test run#

fio clat

fio avg latency

bdev latency

1 Qdepth(2)

write

7.80

8.52

8.56

read

95.36

96.06

97.138

2 Qdepth(4)

write

7.98

8.70

8.32

read

133.88

134.59

128.85

3 Qdepth(8)

write

8.83

9.85

10.87

read

175.61

176.48

180.66

4 Qdepth(16)

write

9.79

10.81

10.282

read

240.71

241.6

236.913

5 Qdepth(32)

write

11.87

12.88

12.384

read

329.8

330.67

327.648

6 Qdepth(64)

write

20.64

21

20.707

read

471.02

471.91

467.118

7 Qdepth(128)

write

187.53

188.57

182.92

read

704.93

705.81

697.49

Best regards,
Paul

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Sreeni (Sreenivasa) Busam (Stellus)
Sent: Tuesday, November 14, 2017 11:21 AM
To: Storage Performance Development Kit
Subject: Re: [SPDK] Request for comments regarding latency measurements

Hi Paul,
That would be great.
Please add me as the reviewer for this task. It would be really helpful.

Thanks,
Sreeni

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Paul Von-Stamwitz
Sent: Tuesday, November 14, 2017 10:46 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

Hi Sreeni,

Since we have consensus on Option #2, we do plan on submitting a patch for it. We can certainly include you as a reviewer if you like.

-Paul

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Sreeni (Sreenivasa) Busam (Stellus)
Sent: Tuesday, November 14, 2017 10:28 AM
To: Storage Performance Development Kit
Subject: Re: [SPDK] Request for comments regarding latency measurements

Hi Jim,

I am not an SPDK driver expert. As I have been working on the same problem in the last 2 weeks, in my investigation, I concluded the same as you have written in the email, option 2 looked like easy and good solution.
The bdev interfaces have been well developed for NVMe devices and it looks similar to the Linux kernel Storage driver except for the polling and interrupt differences to get the status. These are just my comments based on what I understood so far.
I have already been working on collecting the I/O statistics for the applications on NVMe layer. For any applications using the bdev layer, I would like to use your changes. When you are planning to submit the patch to repository ?
Please let me know when you complete your investigation.

Thanks,
Sreeni

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Harris, James R
Sent: Tuesday, November 14, 2017 7:43 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

Agreed - #2 is the way to go.

While we’re on the topic - should we be resetting the stats when spdk_bdev_get_io_stat() is called?  This eliminates the ability for two separate applications to monitor stats in parallel – or even a separate application plus some future yet-to-be-written internal load-balancing monitor.  I’m thinking that bdev should just keep the running total and let the caller own tracking differences from the last time it called spdk_bdev_get_io_stat().

-Jim

From: SPDK <spdk-bounces(a)lists.01.org<mailto:spdk-bounces(a)lists.01.org>> on behalf of Nathan Marushak <nathan.marushak(a)intel.com<mailto:nathan.marushak(a)intel.com>>
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Date: Tuesday, November 14, 2017 at 7:29 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

So, not being the technical expert so to speak ☺, please take this with a grain of salt. Agree with option #2. Seems like this would allow for keeping stats across the different device types e.g. NVMe, NVML, NVMe-oF (although that has a transport within NVMe, so it might be covered for either option).

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Luse, Paul E
Sent: Tuesday, November 14, 2017 7:57 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

FWIW option 2 sounds like the most appropriate to me…

Thx
Paul

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Paul Von-Stamwitz
Sent: Monday, November 13, 2017 7:29 PM
To: spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>
Subject: [SPDK] Request for comments regarding latency measurements

Background:
Bdev.c currently maintains a count of reads/writes/bytes_read/bytes_written in channel->stat. This information is retrieved (and reset) via spdk_bdev_get_io_stat.

Proposal:
Add latency information to channel->stat and enable the option to provide a histogram.
We can measure the latency of each IO and keep a running total on a read/write basis. We can also use the measured latency to keep a running count of reads/writes in their associated histogram “buckets”.
The question is, how do we measure latency?

Option 1:
Measure latency at the NVMe tracker .
Currently, we already timestamp every command placed on the tracker if the abort timeout callback is registered. When we remove a completed IO from the tracker, we can timestamp it again and calculate the latency.
There are several issues here that need to be considered.
We need to get the latency information back up to the bdev layer, most likely through a callback argument, but this would require a change the NVMe API. If the bdev breaks down a request into smaller IOs, it can add up the latencies of the child IOs for io_stat and histogram purposes.
Also, this method does not take into account any latency added by the spdk bdev/nvme layers (except the poller.) If a request was queued before being placed on the tracker, then the time it was queued is not factored into the latency calculation.

Option 2:
Measure latency at the bdev layer.
We can timestamp at submit and again at completion. This would keep all io_stat information local to bdev and would take into account the overhead of most queued operations. Any applications written directly to the NVMe layer would have to calculate their own latencies, but that is currently true for all io_stats.

I’m sure that there are other issues I am missing, but I would appreciate any comments on how best to move forward on this.

Thanks,
Paul

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 44028 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [SPDK] Request for comments regarding latency measurements
@ 2017-12-08  0:45 Isaac Otsiabah
  0 siblings, 0 replies; 22+ messages in thread
From: Isaac Otsiabah @ 2017-12-08  0:45 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 10641 bytes --]

Hi Sreeni, I work for Paul. We used fio_plugin under bdev as the ioengine. We also modified  spdk_fio_cleanup_thread(..) in examples/bdev/fio_plugin/fio_plugin.c to print out the stats in the for loop.

Ie.
spdk_bdev_get_io_stat(target->bdev, target->ch, &stat);
fprintf(stdout,"read_latency:      : %f\n",  (double)stat.read_latency_ticks /(stat.ticks_us_rate * stat.num_read_ops));
fprintf(stdout,"write_latency:     : %f\n",  (double)stat.write_latency_ticks /(stat.ticks_us_rate * stat.num_write_ops));

Isaac
From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Sreeni (Sreenivasa) Busam (Stellus)
Sent: Thursday, December 07, 2017 3:07 PM
To: Storage Performance Development Kit <spdk(a)lists.01.org>; Paul Von-Stamwitz <PVonStamwitz(a)us.fujitsu.com>
Subject: Re: [SPDK] Request for comments regarding latency measurements

Hi Paul,

Did you use the same code in fio_plugin in spdk directory to test your SPDK changes or you used different application? Please share the code for test application if you made any changes to test the bdev latency.
If there was any testing with NVMeoF configuration with your changes, please let me know what performance you got.

Thanks a lot,
Sreeni

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Sreeni (Sreenivasa) Busam (Stellus)
Sent: Wednesday, December 6, 2017 3:48 PM
To: Paul Von-Stamwitz <PVonStamwitz(a)us.fujitsu.com<mailto:PVonStamwitz(a)us.fujitsu.com>>; Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

Thanks Paul. It helps in understanding the changes.

From: Paul Von-Stamwitz [mailto:PVonStamwitz(a)us.fujitsu.com]
Sent: Wednesday, December 6, 2017 3:43 PM
To: Sreeni (Sreenivasa) Busam (Stellus) <s.busam(a)stellus.com<mailto:s.busam(a)stellus.com>>; Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: RE: [SPDK] Request for comments regarding latency measurements

Hi Sreeni,

The first two columns are fio numbers, where clat is average time between submission to bdev and completion. The third column is calculated  from spdk_bdev_get_io_stat which returns:

struct spdk_bdev_io_stat {
                uint64_t bytes_read;
                uint64_t num_read_ops;
                uint64_t bytes_written;
                uint64_t num_write_ops;
                uint64_t read_latency_ticks;
                uint64_t write_latency_ticks;
                uint64_t ticks_us_rate;
};

All numbers in spdk_bdev_io_stat are cumulative. To get the cumulative latency in usec, divide (read/write)_latency_ticks by ticks_us_rate. Then, divide that by the num_(read/write)_ops to get the average latency per command.

Best regards,
Paul

From: Sreeni (Sreenivasa) Busam (Stellus) [mailto:s.busam(a)stellus.com]
Sent: Wednesday, December 06, 2017 3:22 PM
To: Paul Von-Stamwitz; Storage Performance Development Kit
Subject: RE: [SPDK] Request for comments regarding latency measurements

Hi Paul,

Thanks a lot for submitting the patch.
The “clat” latency is the difference of I/O complete ticks and submit ticks for a I/O request, is it correct?
Please let me know how you calculated bdev latency for a request.

Sreeni

From: Paul Von-Stamwitz [mailto:PVonStamwitz(a)us.fujitsu.com]
Sent: Wednesday, December 6, 2017 3:10 PM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Cc: Sreeni (Sreenivasa) Busam (Stellus) <s.busam(a)stellus.com<mailto:s.busam(a)stellus.com>>
Subject: RE: [SPDK] Request for comments regarding latency measurements

Hi Sreeni,

We submitted a PR for the latency measurements (change 390654<https://review.gerrithub.io/390654>).

I tried to add you as a reviewer, but your name did not come up.

We tested this against the fio_plugin for bdev and the numbers matched well.

Units = micro seconds
Test run#

fio clat

fio avg latency

bdev latency

1 Qdepth(2)

write

7.80

8.52

8.56

read

95.36

96.06

97.138

2 Qdepth(4)

write

7.98

8.70

8.32

read

133.88

134.59

128.85

3 Qdepth(8)

write

8.83

9.85

10.87

read

175.61

176.48

180.66

4 Qdepth(16)

write

9.79

10.81

10.282

read

240.71

241.6

236.913

5 Qdepth(32)

write

11.87

12.88

12.384

read

329.8

330.67

327.648

6 Qdepth(64)

write

20.64

21

20.707

read

471.02

471.91

467.118

7 Qdepth(128)

write

187.53

188.57

182.92

read

704.93

705.81

697.49

Best regards,
Paul

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Sreeni (Sreenivasa) Busam (Stellus)
Sent: Tuesday, November 14, 2017 11:21 AM
To: Storage Performance Development Kit
Subject: Re: [SPDK] Request for comments regarding latency measurements

Hi Paul,
That would be great.
Please add me as the reviewer for this task. It would be really helpful.

Thanks,
Sreeni

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Paul Von-Stamwitz
Sent: Tuesday, November 14, 2017 10:46 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

Hi Sreeni,

Since we have consensus on Option #2, we do plan on submitting a patch for it. We can certainly include you as a reviewer if you like.

-Paul

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Sreeni (Sreenivasa) Busam (Stellus)
Sent: Tuesday, November 14, 2017 10:28 AM
To: Storage Performance Development Kit
Subject: Re: [SPDK] Request for comments regarding latency measurements

Hi Jim,

I am not an SPDK driver expert. As I have been working on the same problem in the last 2 weeks, in my investigation, I concluded the same as you have written in the email, option 2 looked like easy and good solution.
The bdev interfaces have been well developed for NVMe devices and it looks similar to the Linux kernel Storage driver except for the polling and interrupt differences to get the status. These are just my comments based on what I understood so far.
I have already been working on collecting the I/O statistics for the applications on NVMe layer. For any applications using the bdev layer, I would like to use your changes. When you are planning to submit the patch to repository ?
Please let me know when you complete your investigation.

Thanks,
Sreeni

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Harris, James R
Sent: Tuesday, November 14, 2017 7:43 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

Agreed - #2 is the way to go.

While we’re on the topic - should we be resetting the stats when spdk_bdev_get_io_stat() is called?  This eliminates the ability for two separate applications to monitor stats in parallel – or even a separate application plus some future yet-to-be-written internal load-balancing monitor.  I’m thinking that bdev should just keep the running total and let the caller own tracking differences from the last time it called spdk_bdev_get_io_stat().

-Jim

From: SPDK <spdk-bounces(a)lists.01.org<mailto:spdk-bounces(a)lists.01.org>> on behalf of Nathan Marushak <nathan.marushak(a)intel.com<mailto:nathan.marushak(a)intel.com>>
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Date: Tuesday, November 14, 2017 at 7:29 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

So, not being the technical expert so to speak ☺, please take this with a grain of salt. Agree with option #2. Seems like this would allow for keeping stats across the different device types e.g. NVMe, NVML, NVMe-oF (although that has a transport within NVMe, so it might be covered for either option).

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Luse, Paul E
Sent: Tuesday, November 14, 2017 7:57 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

FWIW option 2 sounds like the most appropriate to me…

Thx
Paul

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Paul Von-Stamwitz
Sent: Monday, November 13, 2017 7:29 PM
To: spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>
Subject: [SPDK] Request for comments regarding latency measurements

Background:
Bdev.c currently maintains a count of reads/writes/bytes_read/bytes_written in channel->stat. This information is retrieved (and reset) via spdk_bdev_get_io_stat.

Proposal:
Add latency information to channel->stat and enable the option to provide a histogram.
We can measure the latency of each IO and keep a running total on a read/write basis. We can also use the measured latency to keep a running count of reads/writes in their associated histogram “buckets”.
The question is, how do we measure latency?

Option 1:
Measure latency at the NVMe tracker .
Currently, we already timestamp every command placed on the tracker if the abort timeout callback is registered. When we remove a completed IO from the tracker, we can timestamp it again and calculate the latency.
There are several issues here that need to be considered.
We need to get the latency information back up to the bdev layer, most likely through a callback argument, but this would require a change the NVMe API. If the bdev breaks down a request into smaller IOs, it can add up the latencies of the child IOs for io_stat and histogram purposes.
Also, this method does not take into account any latency added by the spdk bdev/nvme layers (except the poller.) If a request was queued before being placed on the tracker, then the time it was queued is not factored into the latency calculation.

Option 2:
Measure latency at the bdev layer.
We can timestamp at submit and again at completion. This would keep all io_stat information local to bdev and would take into account the overhead of most queued operations. Any applications written directly to the NVMe layer would have to calculate their own latencies, but that is currently true for all io_stats.

I’m sure that there are other issues I am missing, but I would appreciate any comments on how best to move forward on this.

Thanks,
Paul

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 40888 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [SPDK] Request for comments regarding latency measurements
@ 2017-12-07 23:06 Sreeni Busam
  0 siblings, 0 replies; 22+ messages in thread
From: Sreeni Busam @ 2017-12-07 23:06 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 9723 bytes --]

Hi Paul,

Did you use the same code in fio_plugin in spdk directory to test your SPDK changes or you used different application? Please share the code for test application if you made any changes to test the bdev latency.
If there was any testing with NVMeoF configuration with your changes, please let me know what performance you got.

Thanks a lot,
Sreeni

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Sreeni (Sreenivasa) Busam (Stellus)
Sent: Wednesday, December 6, 2017 3:48 PM
To: Paul Von-Stamwitz <PVonStamwitz(a)us.fujitsu.com>; Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: Re: [SPDK] Request for comments regarding latency measurements

Thanks Paul. It helps in understanding the changes.

From: Paul Von-Stamwitz [mailto:PVonStamwitz(a)us.fujitsu.com]
Sent: Wednesday, December 6, 2017 3:43 PM
To: Sreeni (Sreenivasa) Busam (Stellus) <s.busam(a)stellus.com<mailto:s.busam(a)stellus.com>>; Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: RE: [SPDK] Request for comments regarding latency measurements

Hi Sreeni,

The first two columns are fio numbers, where clat is average time between submission to bdev and completion. The third column is calculated  from spdk_bdev_get_io_stat which returns:

struct spdk_bdev_io_stat {
                uint64_t bytes_read;
                uint64_t num_read_ops;
                uint64_t bytes_written;
                uint64_t num_write_ops;
                uint64_t read_latency_ticks;
                uint64_t write_latency_ticks;
                uint64_t ticks_us_rate;
};

All numbers in spdk_bdev_io_stat are cumulative. To get the cumulative latency in usec, divide (read/write)_latency_ticks by ticks_us_rate. Then, divide that by the num_(read/write)_ops to get the average latency per command.

Best regards,
Paul

From: Sreeni (Sreenivasa) Busam (Stellus) [mailto:s.busam(a)stellus.com]
Sent: Wednesday, December 06, 2017 3:22 PM
To: Paul Von-Stamwitz; Storage Performance Development Kit
Subject: RE: [SPDK] Request for comments regarding latency measurements

Hi Paul,

Thanks a lot for submitting the patch.
The “clat” latency is the difference of I/O complete ticks and submit ticks for a I/O request, is it correct?
Please let me know how you calculated bdev latency for a request.

Sreeni

From: Paul Von-Stamwitz [mailto:PVonStamwitz(a)us.fujitsu.com]
Sent: Wednesday, December 6, 2017 3:10 PM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Cc: Sreeni (Sreenivasa) Busam (Stellus) <s.busam(a)stellus.com<mailto:s.busam(a)stellus.com>>
Subject: RE: [SPDK] Request for comments regarding latency measurements

Hi Sreeni,

We submitted a PR for the latency measurements (change 390654<https://review.gerrithub.io/390654>).

I tried to add you as a reviewer, but your name did not come up.

We tested this against the fio_plugin for bdev and the numbers matched well.

Units = micro seconds
Test run#

fio clat

fio avg latency

bdev latency

1 Qdepth(2)

write

7.80

8.52

8.56

read

95.36

96.06

97.138

2 Qdepth(4)

write

7.98

8.70

8.32

read

133.88

134.59

128.85

3 Qdepth(8)

write

8.83

9.85

10.87

read

175.61

176.48

180.66

4 Qdepth(16)

write

9.79

10.81

10.282

read

240.71

241.6

236.913

5 Qdepth(32)

write

11.87

12.88

12.384

read

329.8

330.67

327.648

6 Qdepth(64)

write

20.64

21

20.707

read

471.02

471.91

467.118

7 Qdepth(128)

write

187.53

188.57

182.92

read

704.93

705.81

697.49

Best regards,
Paul

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Sreeni (Sreenivasa) Busam (Stellus)
Sent: Tuesday, November 14, 2017 11:21 AM
To: Storage Performance Development Kit
Subject: Re: [SPDK] Request for comments regarding latency measurements

Hi Paul,
That would be great.
Please add me as the reviewer for this task. It would be really helpful.

Thanks,
Sreeni

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Paul Von-Stamwitz
Sent: Tuesday, November 14, 2017 10:46 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

Hi Sreeni,

Since we have consensus on Option #2, we do plan on submitting a patch for it. We can certainly include you as a reviewer if you like.

-Paul

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Sreeni (Sreenivasa) Busam (Stellus)
Sent: Tuesday, November 14, 2017 10:28 AM
To: Storage Performance Development Kit
Subject: Re: [SPDK] Request for comments regarding latency measurements

Hi Jim,

I am not an SPDK driver expert. As I have been working on the same problem in the last 2 weeks, in my investigation, I concluded the same as you have written in the email, option 2 looked like easy and good solution.
The bdev interfaces have been well developed for NVMe devices and it looks similar to the Linux kernel Storage driver except for the polling and interrupt differences to get the status. These are just my comments based on what I understood so far.
I have already been working on collecting the I/O statistics for the applications on NVMe layer. For any applications using the bdev layer, I would like to use your changes. When you are planning to submit the patch to repository ?
Please let me know when you complete your investigation.

Thanks,
Sreeni

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Harris, James R
Sent: Tuesday, November 14, 2017 7:43 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

Agreed - #2 is the way to go.

While we’re on the topic - should we be resetting the stats when spdk_bdev_get_io_stat() is called?  This eliminates the ability for two separate applications to monitor stats in parallel – or even a separate application plus some future yet-to-be-written internal load-balancing monitor.  I’m thinking that bdev should just keep the running total and let the caller own tracking differences from the last time it called spdk_bdev_get_io_stat().

-Jim

From: SPDK <spdk-bounces(a)lists.01.org<mailto:spdk-bounces(a)lists.01.org>> on behalf of Nathan Marushak <nathan.marushak(a)intel.com<mailto:nathan.marushak(a)intel.com>>
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Date: Tuesday, November 14, 2017 at 7:29 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

So, not being the technical expert so to speak ☺, please take this with a grain of salt. Agree with option #2. Seems like this would allow for keeping stats across the different device types e.g. NVMe, NVML, NVMe-oF (although that has a transport within NVMe, so it might be covered for either option).

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Luse, Paul E
Sent: Tuesday, November 14, 2017 7:57 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

FWIW option 2 sounds like the most appropriate to me…

Thx
Paul

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Paul Von-Stamwitz
Sent: Monday, November 13, 2017 7:29 PM
To: spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>
Subject: [SPDK] Request for comments regarding latency measurements

Background:
Bdev.c currently maintains a count of reads/writes/bytes_read/bytes_written in channel->stat. This information is retrieved (and reset) via spdk_bdev_get_io_stat.

Proposal:
Add latency information to channel->stat and enable the option to provide a histogram.
We can measure the latency of each IO and keep a running total on a read/write basis. We can also use the measured latency to keep a running count of reads/writes in their associated histogram “buckets”.
The question is, how do we measure latency?

Option 1:
Measure latency at the NVMe tracker .
Currently, we already timestamp every command placed on the tracker if the abort timeout callback is registered. When we remove a completed IO from the tracker, we can timestamp it again and calculate the latency.
There are several issues here that need to be considered.
We need to get the latency information back up to the bdev layer, most likely through a callback argument, but this would require a change the NVMe API. If the bdev breaks down a request into smaller IOs, it can add up the latencies of the child IOs for io_stat and histogram purposes.
Also, this method does not take into account any latency added by the spdk bdev/nvme layers (except the poller.) If a request was queued before being placed on the tracker, then the time it was queued is not factored into the latency calculation.

Option 2:
Measure latency at the bdev layer.
We can timestamp at submit and again at completion. This would keep all io_stat information local to bdev and would take into account the overhead of most queued operations. Any applications written directly to the NVMe layer would have to calculate their own latencies, but that is currently true for all io_stats.

I’m sure that there are other issues I am missing, but I would appreciate any comments on how best to move forward on this.

Thanks,
Paul

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 40849 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [SPDK] Request for comments regarding latency measurements
@ 2017-12-06 23:48 Sreeni Busam
  0 siblings, 0 replies; 22+ messages in thread
From: Sreeni Busam @ 2017-12-06 23:48 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 8967 bytes --]

Thanks Paul. It helps in understanding the changes.

From: Paul Von-Stamwitz [mailto:PVonStamwitz(a)us.fujitsu.com]
Sent: Wednesday, December 6, 2017 3:43 PM
To: Sreeni (Sreenivasa) Busam (Stellus) <s.busam(a)stellus.com>; Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: RE: [SPDK] Request for comments regarding latency measurements

Hi Sreeni,

The first two columns are fio numbers, where clat is average time between submission to bdev and completion. The third column is calculated  from spdk_bdev_get_io_stat which returns:

struct spdk_bdev_io_stat {
                uint64_t bytes_read;
                uint64_t num_read_ops;
                uint64_t bytes_written;
                uint64_t num_write_ops;
                uint64_t read_latency_ticks;
                uint64_t write_latency_ticks;
                uint64_t ticks_us_rate;
};

All numbers in spdk_bdev_io_stat are cumulative. To get the cumulative latency in usec, divide (read/write)_latency_ticks by ticks_us_rate. Then, divide that by the num_(read/write)_ops to get the average latency per command.

Best regards,
Paul

From: Sreeni (Sreenivasa) Busam (Stellus) [mailto:s.busam(a)stellus.com]
Sent: Wednesday, December 06, 2017 3:22 PM
To: Paul Von-Stamwitz; Storage Performance Development Kit
Subject: RE: [SPDK] Request for comments regarding latency measurements

Hi Paul,

Thanks a lot for submitting the patch.
The “clat” latency is the difference of I/O complete ticks and submit ticks for a I/O request, is it correct?
Please let me know how you calculated bdev latency for a request.

Sreeni

From: Paul Von-Stamwitz [mailto:PVonStamwitz(a)us.fujitsu.com]
Sent: Wednesday, December 6, 2017 3:10 PM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Cc: Sreeni (Sreenivasa) Busam (Stellus) <s.busam(a)stellus.com<mailto:s.busam(a)stellus.com>>
Subject: RE: [SPDK] Request for comments regarding latency measurements

Hi Sreeni,

We submitted a PR for the latency measurements (change 390654<https://review.gerrithub.io/390654>).

I tried to add you as a reviewer, but your name did not come up.

We tested this against the fio_plugin for bdev and the numbers matched well.

Units = micro seconds
Test run#

fio clat

fio avg latency

bdev latency

1 Qdepth(2)

write

7.80

8.52

8.56

read

95.36

96.06

97.138

2 Qdepth(4)

write

7.98

8.70

8.32

read

133.88

134.59

128.85

3 Qdepth(8)

write

8.83

9.85

10.87

read

175.61

176.48

180.66

4 Qdepth(16)

write

9.79

10.81

10.282

read

240.71

241.6

236.913

5 Qdepth(32)

write

11.87

12.88

12.384

read

329.8

330.67

327.648

6 Qdepth(64)

write

20.64

21

20.707

read

471.02

471.91

467.118

7 Qdepth(128)

write

187.53

188.57

182.92

read

704.93

705.81

697.49

Best regards,
Paul

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Sreeni (Sreenivasa) Busam (Stellus)
Sent: Tuesday, November 14, 2017 11:21 AM
To: Storage Performance Development Kit
Subject: Re: [SPDK] Request for comments regarding latency measurements

Hi Paul,
That would be great.
Please add me as the reviewer for this task. It would be really helpful.

Thanks,
Sreeni

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Paul Von-Stamwitz
Sent: Tuesday, November 14, 2017 10:46 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

Hi Sreeni,

Since we have consensus on Option #2, we do plan on submitting a patch for it. We can certainly include you as a reviewer if you like.

-Paul

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Sreeni (Sreenivasa) Busam (Stellus)
Sent: Tuesday, November 14, 2017 10:28 AM
To: Storage Performance Development Kit
Subject: Re: [SPDK] Request for comments regarding latency measurements

Hi Jim,

I am not an SPDK driver expert. As I have been working on the same problem in the last 2 weeks, in my investigation, I concluded the same as you have written in the email, option 2 looked like easy and good solution.
The bdev interfaces have been well developed for NVMe devices and it looks similar to the Linux kernel Storage driver except for the polling and interrupt differences to get the status. These are just my comments based on what I understood so far.
I have already been working on collecting the I/O statistics for the applications on NVMe layer. For any applications using the bdev layer, I would like to use your changes. When you are planning to submit the patch to repository ?
Please let me know when you complete your investigation.

Thanks,
Sreeni

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Harris, James R
Sent: Tuesday, November 14, 2017 7:43 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

Agreed - #2 is the way to go.

While we’re on the topic - should we be resetting the stats when spdk_bdev_get_io_stat() is called?  This eliminates the ability for two separate applications to monitor stats in parallel – or even a separate application plus some future yet-to-be-written internal load-balancing monitor.  I’m thinking that bdev should just keep the running total and let the caller own tracking differences from the last time it called spdk_bdev_get_io_stat().

-Jim

From: SPDK <spdk-bounces(a)lists.01.org<mailto:spdk-bounces(a)lists.01.org>> on behalf of Nathan Marushak <nathan.marushak(a)intel.com<mailto:nathan.marushak(a)intel.com>>
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Date: Tuesday, November 14, 2017 at 7:29 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

So, not being the technical expert so to speak ☺, please take this with a grain of salt. Agree with option #2. Seems like this would allow for keeping stats across the different device types e.g. NVMe, NVML, NVMe-oF (although that has a transport within NVMe, so it might be covered for either option).

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Luse, Paul E
Sent: Tuesday, November 14, 2017 7:57 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

FWIW option 2 sounds like the most appropriate to me…

Thx
Paul

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Paul Von-Stamwitz
Sent: Monday, November 13, 2017 7:29 PM
To: spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>
Subject: [SPDK] Request for comments regarding latency measurements

Background:
Bdev.c currently maintains a count of reads/writes/bytes_read/bytes_written in channel->stat. This information is retrieved (and reset) via spdk_bdev_get_io_stat.

Proposal:
Add latency information to channel->stat and enable the option to provide a histogram.
We can measure the latency of each IO and keep a running total on a read/write basis. We can also use the measured latency to keep a running count of reads/writes in their associated histogram “buckets”.
The question is, how do we measure latency?

Option 1:
Measure latency at the NVMe tracker .
Currently, we already timestamp every command placed on the tracker if the abort timeout callback is registered. When we remove a completed IO from the tracker, we can timestamp it again and calculate the latency.
There are several issues here that need to be considered.
We need to get the latency information back up to the bdev layer, most likely through a callback argument, but this would require a change the NVMe API. If the bdev breaks down a request into smaller IOs, it can add up the latencies of the child IOs for io_stat and histogram purposes.
Also, this method does not take into account any latency added by the spdk bdev/nvme layers (except the poller.) If a request was queued before being placed on the tracker, then the time it was queued is not factored into the latency calculation.

Option 2:
Measure latency at the bdev layer.
We can timestamp at submit and again at completion. This would keep all io_stat information local to bdev and would take into account the overhead of most queued operations. Any applications written directly to the NVMe layer would have to calculate their own latencies, but that is currently true for all io_stats.

I’m sure that there are other issues I am missing, but I would appreciate any comments on how best to move forward on this.

Thanks,
Paul

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 39096 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [SPDK] Request for comments regarding latency measurements
@ 2017-12-06 23:43 Paul Von-Stamwitz
  0 siblings, 0 replies; 22+ messages in thread
From: Paul Von-Stamwitz @ 2017-12-06 23:43 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 8606 bytes --]

Hi Sreeni,

The first two columns are fio numbers, where clat is average time between submission to bdev and completion. The third column is calculated  from spdk_bdev_get_io_stat which returns:

struct spdk_bdev_io_stat {
                uint64_t bytes_read;
                uint64_t num_read_ops;
                uint64_t bytes_written;
                uint64_t num_write_ops;
                uint64_t read_latency_ticks;
                uint64_t write_latency_ticks;
                uint64_t ticks_us_rate;
};

All numbers in spdk_bdev_io_stat are cumulative. To get the cumulative latency in usec, divide (read/write)_latency_ticks by ticks_us_rate. Then, divide that by the num_(read/write)_ops to get the average latency per command.

Best regards,
Paul

From: Sreeni (Sreenivasa) Busam (Stellus) [mailto:s.busam(a)stellus.com]
Sent: Wednesday, December 06, 2017 3:22 PM
To: Paul Von-Stamwitz; Storage Performance Development Kit
Subject: RE: [SPDK] Request for comments regarding latency measurements

Hi Paul,

Thanks a lot for submitting the patch.
The “clat” latency is the difference of I/O complete ticks and submit ticks for a I/O request, is it correct?
Please let me know how you calculated bdev latency for a request.

Sreeni

From: Paul Von-Stamwitz [mailto:PVonStamwitz(a)us.fujitsu.com]
Sent: Wednesday, December 6, 2017 3:10 PM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Cc: Sreeni (Sreenivasa) Busam (Stellus) <s.busam(a)stellus.com<mailto:s.busam(a)stellus.com>>
Subject: RE: [SPDK] Request for comments regarding latency measurements

Hi Sreeni,

We submitted a PR for the latency measurements (change 390654<https://review.gerrithub.io/390654>).

I tried to add you as a reviewer, but your name did not come up.

We tested this against the fio_plugin for bdev and the numbers matched well.

Units = micro seconds
Test run#

fio clat

fio avg latency

bdev latency

1 Qdepth(2)

write

7.80

8.52

8.56

read

95.36

96.06

97.138

2 Qdepth(4)

write

7.98

8.70

8.32

read

133.88

134.59

128.85

3 Qdepth(8)

write

8.83

9.85

10.87

read

175.61

176.48

180.66

4 Qdepth(16)

write

9.79

10.81

10.282

read

240.71

241.6

236.913

5 Qdepth(32)

write

11.87

12.88

12.384

read

329.8

330.67

327.648

6 Qdepth(64)

write

20.64

21

20.707

read

471.02

471.91

467.118

7 Qdepth(128)

write

187.53

188.57

182.92

read

704.93

705.81

697.49

Best regards,
Paul

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Sreeni (Sreenivasa) Busam (Stellus)
Sent: Tuesday, November 14, 2017 11:21 AM
To: Storage Performance Development Kit
Subject: Re: [SPDK] Request for comments regarding latency measurements

Hi Paul,
That would be great.
Please add me as the reviewer for this task. It would be really helpful.

Thanks,
Sreeni

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Paul Von-Stamwitz
Sent: Tuesday, November 14, 2017 10:46 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

Hi Sreeni,

Since we have consensus on Option #2, we do plan on submitting a patch for it. We can certainly include you as a reviewer if you like.

-Paul

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Sreeni (Sreenivasa) Busam (Stellus)
Sent: Tuesday, November 14, 2017 10:28 AM
To: Storage Performance Development Kit
Subject: Re: [SPDK] Request for comments regarding latency measurements

Hi Jim,

I am not an SPDK driver expert. As I have been working on the same problem in the last 2 weeks, in my investigation, I concluded the same as you have written in the email, option 2 looked like easy and good solution.
The bdev interfaces have been well developed for NVMe devices and it looks similar to the Linux kernel Storage driver except for the polling and interrupt differences to get the status. These are just my comments based on what I understood so far.
I have already been working on collecting the I/O statistics for the applications on NVMe layer. For any applications using the bdev layer, I would like to use your changes. When you are planning to submit the patch to repository ?
Please let me know when you complete your investigation.

Thanks,
Sreeni

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Harris, James R
Sent: Tuesday, November 14, 2017 7:43 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

Agreed - #2 is the way to go.

While we’re on the topic - should we be resetting the stats when spdk_bdev_get_io_stat() is called?  This eliminates the ability for two separate applications to monitor stats in parallel – or even a separate application plus some future yet-to-be-written internal load-balancing monitor.  I’m thinking that bdev should just keep the running total and let the caller own tracking differences from the last time it called spdk_bdev_get_io_stat().

-Jim

From: SPDK <spdk-bounces(a)lists.01.org<mailto:spdk-bounces(a)lists.01.org>> on behalf of Nathan Marushak <nathan.marushak(a)intel.com<mailto:nathan.marushak(a)intel.com>>
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Date: Tuesday, November 14, 2017 at 7:29 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

So, not being the technical expert so to speak ☺, please take this with a grain of salt. Agree with option #2. Seems like this would allow for keeping stats across the different device types e.g. NVMe, NVML, NVMe-oF (although that has a transport within NVMe, so it might be covered for either option).

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Luse, Paul E
Sent: Tuesday, November 14, 2017 7:57 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

FWIW option 2 sounds like the most appropriate to me…

Thx
Paul

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Paul Von-Stamwitz
Sent: Monday, November 13, 2017 7:29 PM
To: spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>
Subject: [SPDK] Request for comments regarding latency measurements

Background:
Bdev.c currently maintains a count of reads/writes/bytes_read/bytes_written in channel->stat. This information is retrieved (and reset) via spdk_bdev_get_io_stat.

Proposal:
Add latency information to channel->stat and enable the option to provide a histogram.
We can measure the latency of each IO and keep a running total on a read/write basis. We can also use the measured latency to keep a running count of reads/writes in their associated histogram “buckets”.
The question is, how do we measure latency?

Option 1:
Measure latency at the NVMe tracker .
Currently, we already timestamp every command placed on the tracker if the abort timeout callback is registered. When we remove a completed IO from the tracker, we can timestamp it again and calculate the latency.
There are several issues here that need to be considered.
We need to get the latency information back up to the bdev layer, most likely through a callback argument, but this would require a change the NVMe API. If the bdev breaks down a request into smaller IOs, it can add up the latencies of the child IOs for io_stat and histogram purposes.
Also, this method does not take into account any latency added by the spdk bdev/nvme layers (except the poller.) If a request was queued before being placed on the tracker, then the time it was queued is not factored into the latency calculation.

Option 2:
Measure latency at the bdev layer.
We can timestamp at submit and again at completion. This would keep all io_stat information local to bdev and would take into account the overhead of most queued operations. Any applications written directly to the NVMe layer would have to calculate their own latencies, but that is currently true for all io_stats.

I’m sure that there are other issues I am missing, but I would appreciate any comments on how best to move forward on this.

Thanks,
Paul

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 35968 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [SPDK] Request for comments regarding latency measurements
@ 2017-12-06 23:21 Sreeni Busam
  0 siblings, 0 replies; 22+ messages in thread
From: Sreeni Busam @ 2017-12-06 23:21 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 7515 bytes --]

Hi Paul,

Thanks a lot for submitting the patch.
The “clat” latency is the difference of I/O complete ticks and submit ticks for a I/O request, is it correct?
Please let me know how you calculated bdev latency for a request.

Sreeni

From: Paul Von-Stamwitz [mailto:PVonStamwitz(a)us.fujitsu.com]
Sent: Wednesday, December 6, 2017 3:10 PM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Cc: Sreeni (Sreenivasa) Busam (Stellus) <s.busam(a)stellus.com>
Subject: RE: [SPDK] Request for comments regarding latency measurements

Hi Sreeni,

We submitted a PR for the latency measurements (change 390654<https://review.gerrithub.io/390654>).

I tried to add you as a reviewer, but your name did not come up.

We tested this against the fio_plugin for bdev and the numbers matched well.

Units = micro seconds
Test run#

fio clat

fio avg latency

bdev latency

1 Qdepth(2)

write

7.80

8.52

8.56

read

95.36

96.06

97.138

2 Qdepth(4)

write

7.98

8.70

8.32

read

133.88

134.59

128.85

3 Qdepth(8)

write

8.83

9.85

10.87

read

175.61

176.48

180.66

4 Qdepth(16)

write

9.79

10.81

10.282

read

240.71

241.6

236.913

5 Qdepth(32)

write

11.87

12.88

12.384

read

329.8

330.67

327.648

6 Qdepth(64)

write

20.64

21

20.707

read

471.02

471.91

467.118

7 Qdepth(128)

write

187.53

188.57

182.92

read

704.93

705.81

697.49

Best regards,
Paul

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Sreeni (Sreenivasa) Busam (Stellus)
Sent: Tuesday, November 14, 2017 11:21 AM
To: Storage Performance Development Kit
Subject: Re: [SPDK] Request for comments regarding latency measurements

Hi Paul,
That would be great.
Please add me as the reviewer for this task. It would be really helpful.

Thanks,
Sreeni

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Paul Von-Stamwitz
Sent: Tuesday, November 14, 2017 10:46 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

Hi Sreeni,

Since we have consensus on Option #2, we do plan on submitting a patch for it. We can certainly include you as a reviewer if you like.

-Paul

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Sreeni (Sreenivasa) Busam (Stellus)
Sent: Tuesday, November 14, 2017 10:28 AM
To: Storage Performance Development Kit
Subject: Re: [SPDK] Request for comments regarding latency measurements

Hi Jim,

I am not an SPDK driver expert. As I have been working on the same problem in the last 2 weeks, in my investigation, I concluded the same as you have written in the email, option 2 looked like easy and good solution.
The bdev interfaces have been well developed for NVMe devices and it looks similar to the Linux kernel Storage driver except for the polling and interrupt differences to get the status. These are just my comments based on what I understood so far.
I have already been working on collecting the I/O statistics for the applications on NVMe layer. For any applications using the bdev layer, I would like to use your changes. When you are planning to submit the patch to repository ?
Please let me know when you complete your investigation.

Thanks,
Sreeni

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Harris, James R
Sent: Tuesday, November 14, 2017 7:43 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

Agreed - #2 is the way to go.

While we’re on the topic - should we be resetting the stats when spdk_bdev_get_io_stat() is called?  This eliminates the ability for two separate applications to monitor stats in parallel – or even a separate application plus some future yet-to-be-written internal load-balancing monitor.  I’m thinking that bdev should just keep the running total and let the caller own tracking differences from the last time it called spdk_bdev_get_io_stat().

-Jim

From: SPDK <spdk-bounces(a)lists.01.org<mailto:spdk-bounces(a)lists.01.org>> on behalf of Nathan Marushak <nathan.marushak(a)intel.com<mailto:nathan.marushak(a)intel.com>>
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Date: Tuesday, November 14, 2017 at 7:29 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

So, not being the technical expert so to speak ☺, please take this with a grain of salt. Agree with option #2. Seems like this would allow for keeping stats across the different device types e.g. NVMe, NVML, NVMe-oF (although that has a transport within NVMe, so it might be covered for either option).

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Luse, Paul E
Sent: Tuesday, November 14, 2017 7:57 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

FWIW option 2 sounds like the most appropriate to me…

Thx
Paul

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Paul Von-Stamwitz
Sent: Monday, November 13, 2017 7:29 PM
To: spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>
Subject: [SPDK] Request for comments regarding latency measurements

Background:
Bdev.c currently maintains a count of reads/writes/bytes_read/bytes_written in channel->stat. This information is retrieved (and reset) via spdk_bdev_get_io_stat.

Proposal:
Add latency information to channel->stat and enable the option to provide a histogram.
We can measure the latency of each IO and keep a running total on a read/write basis. We can also use the measured latency to keep a running count of reads/writes in their associated histogram “buckets”.
The question is, how do we measure latency?

Option 1:
Measure latency at the NVMe tracker .
Currently, we already timestamp every command placed on the tracker if the abort timeout callback is registered. When we remove a completed IO from the tracker, we can timestamp it again and calculate the latency.
There are several issues here that need to be considered.
We need to get the latency information back up to the bdev layer, most likely through a callback argument, but this would require a change the NVMe API. If the bdev breaks down a request into smaller IOs, it can add up the latencies of the child IOs for io_stat and histogram purposes.
Also, this method does not take into account any latency added by the spdk bdev/nvme layers (except the poller.) If a request was queued before being placed on the tracker, then the time it was queued is not factored into the latency calculation.

Option 2:
Measure latency at the bdev layer.
We can timestamp at submit and again at completion. This would keep all io_stat information local to bdev and would take into account the overhead of most queued operations. Any applications written directly to the NVMe layer would have to calculate their own latencies, but that is currently true for all io_stats.

I’m sure that there are other issues I am missing, but I would appreciate any comments on how best to move forward on this.

Thanks,
Paul

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 34652 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [SPDK] Request for comments regarding latency measurements
@ 2017-12-06 23:10 Paul Von-Stamwitz
  0 siblings, 0 replies; 22+ messages in thread
From: Paul Von-Stamwitz @ 2017-12-06 23:10 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 6959 bytes --]

Hi Sreeni,

We submitted a PR for the latency measurements (change 390654<https://review.gerrithub.io/390654>).

I tried to add you as a reviewer, but your name did not come up.

We tested this against the fio_plugin for bdev and the numbers matched well.

Units = micro seconds
Test run#

fio clat

fio avg latency

bdev latency

1 Qdepth(2)

write

7.80

8.52

8.56

read

95.36

96.06

97.138

2 Qdepth(4)

write

7.98

8.70

8.32

read

133.88

134.59

128.85

3 Qdepth(8)

write

8.83

9.85

10.87

read

175.61

176.48

180.66

4 Qdepth(16)

write

9.79

10.81

10.282

read

240.71

241.6

236.913

5 Qdepth(32)

write

11.87

12.88

12.384

read

329.8

330.67

327.648

6 Qdepth(64)

write

20.64

21

20.707

read

471.02

471.91

467.118

7 Qdepth(128)

write

187.53

188.57

182.92

read

704.93

705.81

697.49

Best regards,
Paul

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Sreeni (Sreenivasa) Busam (Stellus)
Sent: Tuesday, November 14, 2017 11:21 AM
To: Storage Performance Development Kit
Subject: Re: [SPDK] Request for comments regarding latency measurements

Hi Paul,
That would be great.
Please add me as the reviewer for this task. It would be really helpful.

Thanks,
Sreeni

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Paul Von-Stamwitz
Sent: Tuesday, November 14, 2017 10:46 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

Hi Sreeni,

Since we have consensus on Option #2, we do plan on submitting a patch for it. We can certainly include you as a reviewer if you like.

-Paul

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Sreeni (Sreenivasa) Busam (Stellus)
Sent: Tuesday, November 14, 2017 10:28 AM
To: Storage Performance Development Kit
Subject: Re: [SPDK] Request for comments regarding latency measurements

Hi Jim,

I am not an SPDK driver expert. As I have been working on the same problem in the last 2 weeks, in my investigation, I concluded the same as you have written in the email, option 2 looked like easy and good solution.
The bdev interfaces have been well developed for NVMe devices and it looks similar to the Linux kernel Storage driver except for the polling and interrupt differences to get the status. These are just my comments based on what I understood so far.
I have already been working on collecting the I/O statistics for the applications on NVMe layer. For any applications using the bdev layer, I would like to use your changes. When you are planning to submit the patch to repository ?
Please let me know when you complete your investigation.

Thanks,
Sreeni

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Harris, James R
Sent: Tuesday, November 14, 2017 7:43 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

Agreed - #2 is the way to go.

While we’re on the topic - should we be resetting the stats when spdk_bdev_get_io_stat() is called?  This eliminates the ability for two separate applications to monitor stats in parallel – or even a separate application plus some future yet-to-be-written internal load-balancing monitor.  I’m thinking that bdev should just keep the running total and let the caller own tracking differences from the last time it called spdk_bdev_get_io_stat().

-Jim

From: SPDK <spdk-bounces(a)lists.01.org<mailto:spdk-bounces(a)lists.01.org>> on behalf of Nathan Marushak <nathan.marushak(a)intel.com<mailto:nathan.marushak(a)intel.com>>
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Date: Tuesday, November 14, 2017 at 7:29 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

So, not being the technical expert so to speak ☺, please take this with a grain of salt. Agree with option #2. Seems like this would allow for keeping stats across the different device types e.g. NVMe, NVML, NVMe-oF (although that has a transport within NVMe, so it might be covered for either option).

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Luse, Paul E
Sent: Tuesday, November 14, 2017 7:57 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

FWIW option 2 sounds like the most appropriate to me…

Thx
Paul

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Paul Von-Stamwitz
Sent: Monday, November 13, 2017 7:29 PM
To: spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>
Subject: [SPDK] Request for comments regarding latency measurements

Background:
Bdev.c currently maintains a count of reads/writes/bytes_read/bytes_written in channel->stat. This information is retrieved (and reset) via spdk_bdev_get_io_stat.

Proposal:
Add latency information to channel->stat and enable the option to provide a histogram.
We can measure the latency of each IO and keep a running total on a read/write basis. We can also use the measured latency to keep a running count of reads/writes in their associated histogram “buckets”.
The question is, how do we measure latency?

Option 1:
Measure latency at the NVMe tracker .
Currently, we already timestamp every command placed on the tracker if the abort timeout callback is registered. When we remove a completed IO from the tracker, we can timestamp it again and calculate the latency.
There are several issues here that need to be considered.
We need to get the latency information back up to the bdev layer, most likely through a callback argument, but this would require a change the NVMe API. If the bdev breaks down a request into smaller IOs, it can add up the latencies of the child IOs for io_stat and histogram purposes.
Also, this method does not take into account any latency added by the spdk bdev/nvme layers (except the poller.) If a request was queued before being placed on the tracker, then the time it was queued is not factored into the latency calculation.

Option 2:
Measure latency at the bdev layer.
We can timestamp at submit and again at completion. This would keep all io_stat information local to bdev and would take into account the overhead of most queued operations. Any applications written directly to the NVMe layer would have to calculate their own latencies, but that is currently true for all io_stats.

I’m sure that there are other issues I am missing, but I would appreciate any comments on how best to move forward on this.

Thanks,
Paul

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 31467 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [SPDK] Request for comments regarding latency measurements
@ 2017-11-14 22:33 Paul Von-Stamwitz
  0 siblings, 0 replies; 22+ messages in thread
From: Paul Von-Stamwitz @ 2017-11-14 22:33 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 8316 bytes --]

Okay. We’ll change ns to tsc and add the tsc_rate for the caller do the conversion.

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Harris, James R
Sent: Tuesday, November 14, 2017 2:30 PM
To: Storage Performance Development Kit
Subject: Re: [SPDK] Request for comments regarding latency measurements

The histogram code also buckets by TSC, with conversions done “offline” by the caller.

-Jim

From: SPDK <spdk-bounces(a)lists.01.org<mailto:spdk-bounces(a)lists.01.org>> on behalf of Paul Von-Stamwitz <PVonStamwitz(a)us.fujitsu.com<mailto:PVonStamwitz(a)us.fujitsu.com>>
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Date: Tuesday, November 14, 2017 at 2:28 PM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

I thought of that. I don’t like doing the conversion on every command either, but I was thinking about how we would handle the histogram and I was assuming that the buckets would be segmented in units of time.

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Harris, James R
Sent: Tuesday, November 14, 2017 2:23 PM
To: Storage Performance Development Kit
Subject: Re: [SPDK] Request for comments regarding latency measurements

Hi Paul,

Agree on handling reset and histogram stuff separately.

Instead of ns could we store as tsc?  I’d like to avoid tsc => ns conversions in the I/O path if possible.  Then store the TSC rate as a separate member in this structure.

-Jim

From: SPDK <spdk-bounces(a)lists.01.org<mailto:spdk-bounces(a)lists.01.org>> on behalf of Paul Von-Stamwitz <PVonStamwitz(a)us.fujitsu.com<mailto:PVonStamwitz(a)us.fujitsu.com>>
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Date: Tuesday, November 14, 2017 at 2:16 PM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

I don’t have a problem taking out the reset, but we need to check with the VTune guys since they are resetting their data at each interval as well. At the end of the day I would like the two methods to eventually converge, or at least be consistent.

I’m thinking for now, just appending a cumulative I/O time in nanoseconds to io_stat.

struct spdk_bdev_io_stat {
        uint64_t bytes_read;
        uint64_t num_read_ops;
        uint64_t bytes_written;
        uint64_t num_write_ops;
        uint64_t read_time_ns;
        uint64_t write_time_ns;
};

We can address the reset issue and histogram in subsequent  PRs.

Is this okay?
-Paul

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Walker, Benjamin
Sent: Tuesday, November 14, 2017 8:36 AM
To: Storage Performance Development Kit
Subject: Re: [SPDK] Request for comments regarding latency measurements

The only reason to reset is to deal with overflow. But the time between overflows would be very large so the user could potentially account for it on their end.

-------- Original message --------
From: "Harris, James R" <james.r.harris(a)intel.com<mailto:james.r.harris(a)intel.com>>
Date: 11/14/17 9:01 AM (GMT-07:00)
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

Or we leave out the reset stats API and provide helper functions on the client side (in Python) to diff results from two calls to spdk_bdev_get_io_stat().

From: SPDK <spdk-bounces(a)lists.01.org<mailto:spdk-bounces(a)lists.01.org>> on behalf of Paul E Luse <paul.e.luse(a)intel.com<mailto:paul.e.luse(a)intel.com>>
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Date: Tuesday, November 14, 2017 at 7:50 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

Or maybe as an initial separate patch remove the auto-reset and add a new API to reset stats….

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Harris, James R
Sent: Tuesday, November 14, 2017 8:43 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

Agreed - #2 is the way to go.

While we’re on the topic - should we be resetting the stats when spdk_bdev_get_io_stat() is called?  This eliminates the ability for two separate applications to monitor stats in parallel – or even a separate application plus some future yet-to-be-written internal load-balancing monitor.  I’m thinking that bdev should just keep the running total and let the caller own tracking differences from the last time it called spdk_bdev_get_io_stat().

-Jim

From: SPDK <spdk-bounces(a)lists.01.org<mailto:spdk-bounces(a)lists.01.org>> on behalf of Nathan Marushak <nathan.marushak(a)intel.com<mailto:nathan.marushak(a)intel.com>>
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Date: Tuesday, November 14, 2017 at 7:29 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

So, not being the technical expert so to speak ☺, please take this with a grain of salt. Agree with option #2. Seems like this would allow for keeping stats across the different device types e.g. NVMe, NVML, NVMe-oF (although that has a transport within NVMe, so it might be covered for either option).

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Luse, Paul E
Sent: Tuesday, November 14, 2017 7:57 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

FWIW option 2 sounds like the most appropriate to me…

Thx
Paul

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Paul Von-Stamwitz
Sent: Monday, November 13, 2017 7:29 PM
To: spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>
Subject: [SPDK] Request for comments regarding latency measurements

Background:
Bdev.c currently maintains a count of reads/writes/bytes_read/bytes_written in channel->stat. This information is retrieved (and reset) via spdk_bdev_get_io_stat.

Proposal:
Add latency information to channel->stat and enable the option to provide a histogram.
We can measure the latency of each IO and keep a running total on a read/write basis. We can also use the measured latency to keep a running count of reads/writes in their associated histogram “buckets”.
The question is, how do we measure latency?

Option 1:
Measure latency at the NVMe tracker .
Currently, we already timestamp every command placed on the tracker if the abort timeout callback is registered. When we remove a completed IO from the tracker, we can timestamp it again and calculate the latency.
There are several issues here that need to be considered.
We need to get the latency information back up to the bdev layer, most likely through a callback argument, but this would require a change the NVMe API. If the bdev breaks down a request into smaller IOs, it can add up the latencies of the child IOs for io_stat and histogram purposes.
Also, this method does not take into account any latency added by the spdk bdev/nvme layers (except the poller.) If a request was queued before being placed on the tracker, then the time it was queued is not factored into the latency calculation.

Option 2:
Measure latency at the bdev layer.
We can timestamp at submit and again at completion. This would keep all io_stat information local to bdev and would take into account the overhead of most queued operations. Any applications written directly to the NVMe layer would have to calculate their own latencies, but that is currently true for all io_stats.

I’m sure that there are other issues I am missing, but I would appreciate any comments on how best to move forward on this.

Thanks,
Paul

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 24478 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [SPDK] Request for comments regarding latency measurements
@ 2017-11-14 22:30 Harris, James R
  0 siblings, 0 replies; 22+ messages in thread
From: Harris, James R @ 2017-11-14 22:30 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 7861 bytes --]

The histogram code also buckets by TSC, with conversions done “offline” by the caller.

-Jim

From: SPDK <spdk-bounces(a)lists.01.org> on behalf of Paul Von-Stamwitz <PVonStamwitz(a)us.fujitsu.com>
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org>
Date: Tuesday, November 14, 2017 at 2:28 PM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: Re: [SPDK] Request for comments regarding latency measurements

I thought of that. I don’t like doing the conversion on every command either, but I was thinking about how we would handle the histogram and I was assuming that the buckets would be segmented in units of time.

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Harris, James R
Sent: Tuesday, November 14, 2017 2:23 PM
To: Storage Performance Development Kit
Subject: Re: [SPDK] Request for comments regarding latency measurements

Hi Paul,

Agree on handling reset and histogram stuff separately.

Instead of ns could we store as tsc?  I’d like to avoid tsc => ns conversions in the I/O path if possible.  Then store the TSC rate as a separate member in this structure.

-Jim

From: SPDK <spdk-bounces(a)lists.01.org<mailto:spdk-bounces(a)lists.01.org>> on behalf of Paul Von-Stamwitz <PVonStamwitz(a)us.fujitsu.com<mailto:PVonStamwitz(a)us.fujitsu.com>>
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Date: Tuesday, November 14, 2017 at 2:16 PM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

I don’t have a problem taking out the reset, but we need to check with the VTune guys since they are resetting their data at each interval as well. At the end of the day I would like the two methods to eventually converge, or at least be consistent.

I’m thinking for now, just appending a cumulative I/O time in nanoseconds to io_stat.

struct spdk_bdev_io_stat {
        uint64_t bytes_read;
        uint64_t num_read_ops;
        uint64_t bytes_written;
        uint64_t num_write_ops;
        uint64_t read_time_ns;
        uint64_t write_time_ns;
};

We can address the reset issue and histogram in subsequent  PRs.

Is this okay?
-Paul

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Walker, Benjamin
Sent: Tuesday, November 14, 2017 8:36 AM
To: Storage Performance Development Kit
Subject: Re: [SPDK] Request for comments regarding latency measurements

The only reason to reset is to deal with overflow. But the time between overflows would be very large so the user could potentially account for it on their end.

-------- Original message --------
From: "Harris, James R" <james.r.harris(a)intel.com<mailto:james.r.harris(a)intel.com>>
Date: 11/14/17 9:01 AM (GMT-07:00)
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

Or we leave out the reset stats API and provide helper functions on the client side (in Python) to diff results from two calls to spdk_bdev_get_io_stat().

From: SPDK <spdk-bounces(a)lists.01.org<mailto:spdk-bounces(a)lists.01.org>> on behalf of Paul E Luse <paul.e.luse(a)intel.com<mailto:paul.e.luse(a)intel.com>>
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Date: Tuesday, November 14, 2017 at 7:50 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

Or maybe as an initial separate patch remove the auto-reset and add a new API to reset stats….

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Harris, James R
Sent: Tuesday, November 14, 2017 8:43 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

Agreed - #2 is the way to go.

While we’re on the topic - should we be resetting the stats when spdk_bdev_get_io_stat() is called?  This eliminates the ability for two separate applications to monitor stats in parallel – or even a separate application plus some future yet-to-be-written internal load-balancing monitor.  I’m thinking that bdev should just keep the running total and let the caller own tracking differences from the last time it called spdk_bdev_get_io_stat().

-Jim

From: SPDK <spdk-bounces(a)lists.01.org<mailto:spdk-bounces(a)lists.01.org>> on behalf of Nathan Marushak <nathan.marushak(a)intel.com<mailto:nathan.marushak(a)intel.com>>
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Date: Tuesday, November 14, 2017 at 7:29 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

So, not being the technical expert so to speak ☺, please take this with a grain of salt. Agree with option #2. Seems like this would allow for keeping stats across the different device types e.g. NVMe, NVML, NVMe-oF (although that has a transport within NVMe, so it might be covered for either option).

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Luse, Paul E
Sent: Tuesday, November 14, 2017 7:57 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

FWIW option 2 sounds like the most appropriate to me…

Thx
Paul

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Paul Von-Stamwitz
Sent: Monday, November 13, 2017 7:29 PM
To: spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>
Subject: [SPDK] Request for comments regarding latency measurements

Background:
Bdev.c currently maintains a count of reads/writes/bytes_read/bytes_written in channel->stat. This information is retrieved (and reset) via spdk_bdev_get_io_stat.

Proposal:
Add latency information to channel->stat and enable the option to provide a histogram.
We can measure the latency of each IO and keep a running total on a read/write basis. We can also use the measured latency to keep a running count of reads/writes in their associated histogram “buckets”.
The question is, how do we measure latency?

Option 1:
Measure latency at the NVMe tracker .
Currently, we already timestamp every command placed on the tracker if the abort timeout callback is registered. When we remove a completed IO from the tracker, we can timestamp it again and calculate the latency.
There are several issues here that need to be considered.
We need to get the latency information back up to the bdev layer, most likely through a callback argument, but this would require a change the NVMe API. If the bdev breaks down a request into smaller IOs, it can add up the latencies of the child IOs for io_stat and histogram purposes.
Also, this method does not take into account any latency added by the spdk bdev/nvme layers (except the poller.) If a request was queued before being placed on the tracker, then the time it was queued is not factored into the latency calculation.

Option 2:
Measure latency at the bdev layer.
We can timestamp at submit and again at completion. This would keep all io_stat information local to bdev and would take into account the overhead of most queued operations. Any applications written directly to the NVMe layer would have to calculate their own latencies, but that is currently true for all io_stats.

I’m sure that there are other issues I am missing, but I would appreciate any comments on how best to move forward on this.

Thanks,
Paul

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 23972 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [SPDK] Request for comments regarding latency measurements
@ 2017-11-14 22:28 Paul Von-Stamwitz
  0 siblings, 0 replies; 22+ messages in thread
From: Paul Von-Stamwitz @ 2017-11-14 22:28 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 7400 bytes --]

I thought of that. I don’t like doing the conversion on every command either, but I was thinking about how we would handle the histogram and I was assuming that the buckets would be segmented in units of time.

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Harris, James R
Sent: Tuesday, November 14, 2017 2:23 PM
To: Storage Performance Development Kit
Subject: Re: [SPDK] Request for comments regarding latency measurements

Hi Paul,

Agree on handling reset and histogram stuff separately.

Instead of ns could we store as tsc?  I’d like to avoid tsc => ns conversions in the I/O path if possible.  Then store the TSC rate as a separate member in this structure.

-Jim

From: SPDK <spdk-bounces(a)lists.01.org<mailto:spdk-bounces(a)lists.01.org>> on behalf of Paul Von-Stamwitz <PVonStamwitz(a)us.fujitsu.com<mailto:PVonStamwitz(a)us.fujitsu.com>>
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Date: Tuesday, November 14, 2017 at 2:16 PM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

I don’t have a problem taking out the reset, but we need to check with the VTune guys since they are resetting their data at each interval as well. At the end of the day I would like the two methods to eventually converge, or at least be consistent.

I’m thinking for now, just appending a cumulative I/O time in nanoseconds to io_stat.

struct spdk_bdev_io_stat {
        uint64_t bytes_read;
        uint64_t num_read_ops;
        uint64_t bytes_written;
        uint64_t num_write_ops;
        uint64_t read_time_ns;
        uint64_t write_time_ns;
};

We can address the reset issue and histogram in subsequent  PRs.

Is this okay?
-Paul

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Walker, Benjamin
Sent: Tuesday, November 14, 2017 8:36 AM
To: Storage Performance Development Kit
Subject: Re: [SPDK] Request for comments regarding latency measurements

The only reason to reset is to deal with overflow. But the time between overflows would be very large so the user could potentially account for it on their end.

-------- Original message --------
From: "Harris, James R" <james.r.harris(a)intel.com<mailto:james.r.harris(a)intel.com>>
Date: 11/14/17 9:01 AM (GMT-07:00)
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

Or we leave out the reset stats API and provide helper functions on the client side (in Python) to diff results from two calls to spdk_bdev_get_io_stat().

From: SPDK <spdk-bounces(a)lists.01.org<mailto:spdk-bounces(a)lists.01.org>> on behalf of Paul E Luse <paul.e.luse(a)intel.com<mailto:paul.e.luse(a)intel.com>>
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Date: Tuesday, November 14, 2017 at 7:50 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

Or maybe as an initial separate patch remove the auto-reset and add a new API to reset stats….

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Harris, James R
Sent: Tuesday, November 14, 2017 8:43 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

Agreed - #2 is the way to go.

While we’re on the topic - should we be resetting the stats when spdk_bdev_get_io_stat() is called?  This eliminates the ability for two separate applications to monitor stats in parallel – or even a separate application plus some future yet-to-be-written internal load-balancing monitor.  I’m thinking that bdev should just keep the running total and let the caller own tracking differences from the last time it called spdk_bdev_get_io_stat().

-Jim

From: SPDK <spdk-bounces(a)lists.01.org<mailto:spdk-bounces(a)lists.01.org>> on behalf of Nathan Marushak <nathan.marushak(a)intel.com<mailto:nathan.marushak(a)intel.com>>
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Date: Tuesday, November 14, 2017 at 7:29 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

So, not being the technical expert so to speak ☺, please take this with a grain of salt. Agree with option #2. Seems like this would allow for keeping stats across the different device types e.g. NVMe, NVML, NVMe-oF (although that has a transport within NVMe, so it might be covered for either option).

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Luse, Paul E
Sent: Tuesday, November 14, 2017 7:57 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

FWIW option 2 sounds like the most appropriate to me…

Thx
Paul

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Paul Von-Stamwitz
Sent: Monday, November 13, 2017 7:29 PM
To: spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>
Subject: [SPDK] Request for comments regarding latency measurements

Background:
Bdev.c currently maintains a count of reads/writes/bytes_read/bytes_written in channel->stat. This information is retrieved (and reset) via spdk_bdev_get_io_stat.

Proposal:
Add latency information to channel->stat and enable the option to provide a histogram.
We can measure the latency of each IO and keep a running total on a read/write basis. We can also use the measured latency to keep a running count of reads/writes in their associated histogram “buckets”.
The question is, how do we measure latency?

Option 1:
Measure latency at the NVMe tracker .
Currently, we already timestamp every command placed on the tracker if the abort timeout callback is registered. When we remove a completed IO from the tracker, we can timestamp it again and calculate the latency.
There are several issues here that need to be considered.
We need to get the latency information back up to the bdev layer, most likely through a callback argument, but this would require a change the NVMe API. If the bdev breaks down a request into smaller IOs, it can add up the latencies of the child IOs for io_stat and histogram purposes.
Also, this method does not take into account any latency added by the spdk bdev/nvme layers (except the poller.) If a request was queued before being placed on the tracker, then the time it was queued is not factored into the latency calculation.

Option 2:
Measure latency at the bdev layer.
We can timestamp at submit and again at completion. This would keep all io_stat information local to bdev and would take into account the overhead of most queued operations. Any applications written directly to the NVMe layer would have to calculate their own latencies, but that is currently true for all io_stats.

I’m sure that there are other issues I am missing, but I would appreciate any comments on how best to move forward on this.

Thanks,
Paul

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 21830 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [SPDK] Request for comments regarding latency measurements
@ 2017-11-14 22:22 Harris, James R
  0 siblings, 0 replies; 22+ messages in thread
From: Harris, James R @ 2017-11-14 22:22 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 6819 bytes --]

Hi Paul,

Agree on handling reset and histogram stuff separately.

Instead of ns could we store as tsc?  I’d like to avoid tsc => ns conversions in the I/O path if possible.  Then store the TSC rate as a separate member in this structure.

-Jim

From: SPDK <spdk-bounces(a)lists.01.org> on behalf of Paul Von-Stamwitz <PVonStamwitz(a)us.fujitsu.com>
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org>
Date: Tuesday, November 14, 2017 at 2:16 PM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: Re: [SPDK] Request for comments regarding latency measurements

I don’t have a problem taking out the reset, but we need to check with the VTune guys since they are resetting their data at each interval as well. At the end of the day I would like the two methods to eventually converge, or at least be consistent.

I’m thinking for now, just appending a cumulative I/O time in nanoseconds to io_stat.

struct spdk_bdev_io_stat {
        uint64_t bytes_read;
        uint64_t num_read_ops;
        uint64_t bytes_written;
        uint64_t num_write_ops;
        uint64_t read_time_ns;
        uint64_t write_time_ns;
};

We can address the reset issue and histogram in subsequent  PRs.

Is this okay?
-Paul

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Walker, Benjamin
Sent: Tuesday, November 14, 2017 8:36 AM
To: Storage Performance Development Kit
Subject: Re: [SPDK] Request for comments regarding latency measurements

The only reason to reset is to deal with overflow. But the time between overflows would be very large so the user could potentially account for it on their end.

-------- Original message --------
From: "Harris, James R" <james.r.harris(a)intel.com<mailto:james.r.harris(a)intel.com>>
Date: 11/14/17 9:01 AM (GMT-07:00)
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

Or we leave out the reset stats API and provide helper functions on the client side (in Python) to diff results from two calls to spdk_bdev_get_io_stat().

From: SPDK <spdk-bounces(a)lists.01.org<mailto:spdk-bounces(a)lists.01.org>> on behalf of Paul E Luse <paul.e.luse(a)intel.com<mailto:paul.e.luse(a)intel.com>>
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Date: Tuesday, November 14, 2017 at 7:50 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

Or maybe as an initial separate patch remove the auto-reset and add a new API to reset stats….

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Harris, James R
Sent: Tuesday, November 14, 2017 8:43 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

Agreed - #2 is the way to go.

While we’re on the topic - should we be resetting the stats when spdk_bdev_get_io_stat() is called?  This eliminates the ability for two separate applications to monitor stats in parallel – or even a separate application plus some future yet-to-be-written internal load-balancing monitor.  I’m thinking that bdev should just keep the running total and let the caller own tracking differences from the last time it called spdk_bdev_get_io_stat().

-Jim

From: SPDK <spdk-bounces(a)lists.01.org<mailto:spdk-bounces(a)lists.01.org>> on behalf of Nathan Marushak <nathan.marushak(a)intel.com<mailto:nathan.marushak(a)intel.com>>
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Date: Tuesday, November 14, 2017 at 7:29 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

So, not being the technical expert so to speak ☺, please take this with a grain of salt. Agree with option #2. Seems like this would allow for keeping stats across the different device types e.g. NVMe, NVML, NVMe-oF (although that has a transport within NVMe, so it might be covered for either option).

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Luse, Paul E
Sent: Tuesday, November 14, 2017 7:57 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

FWIW option 2 sounds like the most appropriate to me…

Thx
Paul

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Paul Von-Stamwitz
Sent: Monday, November 13, 2017 7:29 PM
To: spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>
Subject: [SPDK] Request for comments regarding latency measurements

Background:
Bdev.c currently maintains a count of reads/writes/bytes_read/bytes_written in channel->stat. This information is retrieved (and reset) via spdk_bdev_get_io_stat.

Proposal:
Add latency information to channel->stat and enable the option to provide a histogram.
We can measure the latency of each IO and keep a running total on a read/write basis. We can also use the measured latency to keep a running count of reads/writes in their associated histogram “buckets”.
The question is, how do we measure latency?

Option 1:
Measure latency at the NVMe tracker .
Currently, we already timestamp every command placed on the tracker if the abort timeout callback is registered. When we remove a completed IO from the tracker, we can timestamp it again and calculate the latency.
There are several issues here that need to be considered.
We need to get the latency information back up to the bdev layer, most likely through a callback argument, but this would require a change the NVMe API. If the bdev breaks down a request into smaller IOs, it can add up the latencies of the child IOs for io_stat and histogram purposes.
Also, this method does not take into account any latency added by the spdk bdev/nvme layers (except the poller.) If a request was queued before being placed on the tracker, then the time it was queued is not factored into the latency calculation.

Option 2:
Measure latency at the bdev layer.
We can timestamp at submit and again at completion. This would keep all io_stat information local to bdev and would take into account the overhead of most queued operations. Any applications written directly to the NVMe layer would have to calculate their own latencies, but that is currently true for all io_stats.

I’m sure that there are other issues I am missing, but I would appreciate any comments on how best to move forward on this.

Thanks,
Paul

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 21261 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [SPDK] Request for comments regarding latency measurements
@ 2017-11-14 22:16 Paul Von-Stamwitz
  0 siblings, 0 replies; 22+ messages in thread
From: Paul Von-Stamwitz @ 2017-11-14 22:16 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 6187 bytes --]

I don't have a problem taking out the reset, but we need to check with the VTune guys since they are resetting their data at each interval as well. At the end of the day I would like the two methods to eventually converge, or at least be consistent.

I'm thinking for now, just appending a cumulative I/O time in nanoseconds to io_stat.

struct spdk_bdev_io_stat {
        uint64_t bytes_read;
        uint64_t num_read_ops;
        uint64_t bytes_written;
        uint64_t num_write_ops;
        uint64_t read_time_ns;
        uint64_t write_time_ns;
};

We can address the reset issue and histogram in subsequent  PRs.

Is this okay?
-Paul

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Walker, Benjamin
Sent: Tuesday, November 14, 2017 8:36 AM
To: Storage Performance Development Kit
Subject: Re: [SPDK] Request for comments regarding latency measurements

The only reason to reset is to deal with overflow. But the time between overflows would be very large so the user could potentially account for it on their end.

-------- Original message --------
From: "Harris, James R" <james.r.harris(a)intel.com<mailto:james.r.harris(a)intel.com>>
Date: 11/14/17 9:01 AM (GMT-07:00)
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

Or we leave out the reset stats API and provide helper functions on the client side (in Python) to diff results from two calls to spdk_bdev_get_io_stat().

From: SPDK <spdk-bounces(a)lists.01.org<mailto:spdk-bounces(a)lists.01.org>> on behalf of Paul E Luse <paul.e.luse(a)intel.com<mailto:paul.e.luse(a)intel.com>>
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Date: Tuesday, November 14, 2017 at 7:50 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

Or maybe as an initial separate patch remove the auto-reset and add a new API to reset stats....

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Harris, James R
Sent: Tuesday, November 14, 2017 8:43 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

Agreed - #2 is the way to go.

While we're on the topic - should we be resetting the stats when spdk_bdev_get_io_stat() is called?  This eliminates the ability for two separate applications to monitor stats in parallel - or even a separate application plus some future yet-to-be-written internal load-balancing monitor.  I'm thinking that bdev should just keep the running total and let the caller own tracking differences from the last time it called spdk_bdev_get_io_stat().

-Jim

From: SPDK <spdk-bounces(a)lists.01.org<mailto:spdk-bounces(a)lists.01.org>> on behalf of Nathan Marushak <nathan.marushak(a)intel.com<mailto:nathan.marushak(a)intel.com>>
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Date: Tuesday, November 14, 2017 at 7:29 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

So, not being the technical expert so to speak :), please take this with a grain of salt. Agree with option #2. Seems like this would allow for keeping stats across the different device types e.g. NVMe, NVML, NVMe-oF (although that has a transport within NVMe, so it might be covered for either option).

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Luse, Paul E
Sent: Tuesday, November 14, 2017 7:57 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

FWIW option 2 sounds like the most appropriate to me...

Thx
Paul

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Paul Von-Stamwitz
Sent: Monday, November 13, 2017 7:29 PM
To: spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>
Subject: [SPDK] Request for comments regarding latency measurements

Background:
Bdev.c currently maintains a count of reads/writes/bytes_read/bytes_written in channel->stat. This information is retrieved (and reset) via spdk_bdev_get_io_stat.

Proposal:
Add latency information to channel->stat and enable the option to provide a histogram.
We can measure the latency of each IO and keep a running total on a read/write basis. We can also use the measured latency to keep a running count of reads/writes in their associated histogram "buckets".
The question is, how do we measure latency?

Option 1:
Measure latency at the NVMe tracker .
Currently, we already timestamp every command placed on the tracker if the abort timeout callback is registered. When we remove a completed IO from the tracker, we can timestamp it again and calculate the latency.
There are several issues here that need to be considered.
We need to get the latency information back up to the bdev layer, most likely through a callback argument, but this would require a change the NVMe API. If the bdev breaks down a request into smaller IOs, it can add up the latencies of the child IOs for io_stat and histogram purposes.
Also, this method does not take into account any latency added by the spdk bdev/nvme layers (except the poller.) If a request was queued before being placed on the tracker, then the time it was queued is not factored into the latency calculation.

Option 2:
Measure latency at the bdev layer.
We can timestamp at submit and again at completion. This would keep all io_stat information local to bdev and would take into account the overhead of most queued operations. Any applications written directly to the NVMe layer would have to calculate their own latencies, but that is currently true for all io_stats.

I'm sure that there are other issues I am missing, but I would appreciate any comments on how best to move forward on this.

Thanks,
Paul

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 18249 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [SPDK] Request for comments regarding latency measurements
@ 2017-11-14 19:20 Sreeni Busam
  0 siblings, 0 replies; 22+ messages in thread
From: Sreeni Busam @ 2017-11-14 19:20 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 5676 bytes --]

Hi Paul,
That would be great.
Please add me as the reviewer for this task. It would be really helpful.

Thanks,
Sreeni

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Paul Von-Stamwitz
Sent: Tuesday, November 14, 2017 10:46 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: Re: [SPDK] Request for comments regarding latency measurements

Hi Sreeni,

Since we have consensus on Option #2, we do plan on submitting a patch for it. We can certainly include you as a reviewer if you like.

-Paul

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Sreeni (Sreenivasa) Busam (Stellus)
Sent: Tuesday, November 14, 2017 10:28 AM
To: Storage Performance Development Kit
Subject: Re: [SPDK] Request for comments regarding latency measurements

Hi Jim,

I am not an SPDK driver expert. As I have been working on the same problem in the last 2 weeks, in my investigation, I concluded the same as you have written in the email, option 2 looked like easy and good solution.
The bdev interfaces have been well developed for NVMe devices and it looks similar to the Linux kernel Storage driver except for the polling and interrupt differences to get the status. These are just my comments based on what I understood so far.
I have already been working on collecting the I/O statistics for the applications on NVMe layer. For any applications using the bdev layer, I would like to use your changes. When you are planning to submit the patch to repository ?
Please let me know when you complete your investigation.

Thanks,
Sreeni

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Harris, James R
Sent: Tuesday, November 14, 2017 7:43 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

Agreed - #2 is the way to go.

While we’re on the topic - should we be resetting the stats when spdk_bdev_get_io_stat() is called?  This eliminates the ability for two separate applications to monitor stats in parallel – or even a separate application plus some future yet-to-be-written internal load-balancing monitor.  I’m thinking that bdev should just keep the running total and let the caller own tracking differences from the last time it called spdk_bdev_get_io_stat().

-Jim

From: SPDK <spdk-bounces(a)lists.01.org<mailto:spdk-bounces(a)lists.01.org>> on behalf of Nathan Marushak <nathan.marushak(a)intel.com<mailto:nathan.marushak(a)intel.com>>
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Date: Tuesday, November 14, 2017 at 7:29 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

So, not being the technical expert so to speak ☺, please take this with a grain of salt. Agree with option #2. Seems like this would allow for keeping stats across the different device types e.g. NVMe, NVML, NVMe-oF (although that has a transport within NVMe, so it might be covered for either option).

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Luse, Paul E
Sent: Tuesday, November 14, 2017 7:57 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

FWIW option 2 sounds like the most appropriate to me…

Thx
Paul

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Paul Von-Stamwitz
Sent: Monday, November 13, 2017 7:29 PM
To: spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>
Subject: [SPDK] Request for comments regarding latency measurements

Background:
Bdev.c currently maintains a count of reads/writes/bytes_read/bytes_written in channel->stat. This information is retrieved (and reset) via spdk_bdev_get_io_stat.

Proposal:
Add latency information to channel->stat and enable the option to provide a histogram.
We can measure the latency of each IO and keep a running total on a read/write basis. We can also use the measured latency to keep a running count of reads/writes in their associated histogram “buckets”.
The question is, how do we measure latency?

Option 1:
Measure latency at the NVMe tracker .
Currently, we already timestamp every command placed on the tracker if the abort timeout callback is registered. When we remove a completed IO from the tracker, we can timestamp it again and calculate the latency.
There are several issues here that need to be considered.
We need to get the latency information back up to the bdev layer, most likely through a callback argument, but this would require a change the NVMe API. If the bdev breaks down a request into smaller IOs, it can add up the latencies of the child IOs for io_stat and histogram purposes.
Also, this method does not take into account any latency added by the spdk bdev/nvme layers (except the poller.) If a request was queued before being placed on the tracker, then the time it was queued is not factored into the latency calculation.

Option 2:
Measure latency at the bdev layer.
We can timestamp at submit and again at completion. This would keep all io_stat information local to bdev and would take into account the overhead of most queued operations. Any applications written directly to the NVMe layer would have to calculate their own latencies, but that is currently true for all io_stats.

I’m sure that there are other issues I am missing, but I would appreciate any comments on how best to move forward on this.

Thanks,
Paul

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 15614 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [SPDK] Request for comments regarding latency measurements
@ 2017-11-14 18:46 Paul Von-Stamwitz
  0 siblings, 0 replies; 22+ messages in thread
From: Paul Von-Stamwitz @ 2017-11-14 18:46 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 5288 bytes --]

Hi Sreeni,

Since we have consensus on Option #2, we do plan on submitting a patch for it. We can certainly include you as a reviewer if you like.

-Paul

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Sreeni (Sreenivasa) Busam (Stellus)
Sent: Tuesday, November 14, 2017 10:28 AM
To: Storage Performance Development Kit
Subject: Re: [SPDK] Request for comments regarding latency measurements

Hi Jim,

I am not an SPDK driver expert. As I have been working on the same problem in the last 2 weeks, in my investigation, I concluded the same as you have written in the email, option 2 looked like easy and good solution.
The bdev interfaces have been well developed for NVMe devices and it looks similar to the Linux kernel Storage driver except for the polling and interrupt differences to get the status. These are just my comments based on what I understood so far.
I have already been working on collecting the I/O statistics for the applications on NVMe layer. For any applications using the bdev layer, I would like to use your changes. When you are planning to submit the patch to repository ?
Please let me know when you complete your investigation.

Thanks,
Sreeni

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Harris, James R
Sent: Tuesday, November 14, 2017 7:43 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

Agreed - #2 is the way to go.

While we’re on the topic - should we be resetting the stats when spdk_bdev_get_io_stat() is called?  This eliminates the ability for two separate applications to monitor stats in parallel – or even a separate application plus some future yet-to-be-written internal load-balancing monitor.  I’m thinking that bdev should just keep the running total and let the caller own tracking differences from the last time it called spdk_bdev_get_io_stat().

-Jim

From: SPDK <spdk-bounces(a)lists.01.org<mailto:spdk-bounces(a)lists.01.org>> on behalf of Nathan Marushak <nathan.marushak(a)intel.com<mailto:nathan.marushak(a)intel.com>>
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Date: Tuesday, November 14, 2017 at 7:29 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

So, not being the technical expert so to speak ☺, please take this with a grain of salt. Agree with option #2. Seems like this would allow for keeping stats across the different device types e.g. NVMe, NVML, NVMe-oF (although that has a transport within NVMe, so it might be covered for either option).

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Luse, Paul E
Sent: Tuesday, November 14, 2017 7:57 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

FWIW option 2 sounds like the most appropriate to me…

Thx
Paul

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Paul Von-Stamwitz
Sent: Monday, November 13, 2017 7:29 PM
To: spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>
Subject: [SPDK] Request for comments regarding latency measurements

Background:
Bdev.c currently maintains a count of reads/writes/bytes_read/bytes_written in channel->stat. This information is retrieved (and reset) via spdk_bdev_get_io_stat.

Proposal:
Add latency information to channel->stat and enable the option to provide a histogram.
We can measure the latency of each IO and keep a running total on a read/write basis. We can also use the measured latency to keep a running count of reads/writes in their associated histogram “buckets”.
The question is, how do we measure latency?

Option 1:
Measure latency at the NVMe tracker .
Currently, we already timestamp every command placed on the tracker if the abort timeout callback is registered. When we remove a completed IO from the tracker, we can timestamp it again and calculate the latency.
There are several issues here that need to be considered.
We need to get the latency information back up to the bdev layer, most likely through a callback argument, but this would require a change the NVMe API. If the bdev breaks down a request into smaller IOs, it can add up the latencies of the child IOs for io_stat and histogram purposes.
Also, this method does not take into account any latency added by the spdk bdev/nvme layers (except the poller.) If a request was queued before being placed on the tracker, then the time it was queued is not factored into the latency calculation.

Option 2:
Measure latency at the bdev layer.
We can timestamp at submit and again at completion. This would keep all io_stat information local to bdev and would take into account the overhead of most queued operations. Any applications written directly to the NVMe layer would have to calculate their own latencies, but that is currently true for all io_stats.

I’m sure that there are other issues I am missing, but I would appreciate any comments on how best to move forward on this.

Thanks,
Paul

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 13742 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [SPDK] Request for comments regarding latency measurements
@ 2017-11-14 18:27 Sreeni Busam
  0 siblings, 0 replies; 22+ messages in thread
From: Sreeni Busam @ 2017-11-14 18:27 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 4842 bytes --]

Hi Jim,

I am not an SPDK driver expert. As I have been working on the same problem in the last 2 weeks, in my investigation, I concluded the same as you have written in the email, option 2 looked like easy and good solution.
The bdev interfaces have been well developed for NVMe devices and it looks similar to the Linux kernel Storage driver except for the polling and interrupt differences to get the status. These are just my comments based on what I understood so far.
I have already been working on collecting the I/O statistics for the applications on NVMe layer. For any applications using the bdev layer, I would like to use your changes. When you are planning to submit the patch to repository ?
Please let me know when you complete your investigation.

Thanks,
Sreeni

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Harris, James R
Sent: Tuesday, November 14, 2017 7:43 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: Re: [SPDK] Request for comments regarding latency measurements

Agreed - #2 is the way to go.

While we’re on the topic - should we be resetting the stats when spdk_bdev_get_io_stat() is called?  This eliminates the ability for two separate applications to monitor stats in parallel – or even a separate application plus some future yet-to-be-written internal load-balancing monitor.  I’m thinking that bdev should just keep the running total and let the caller own tracking differences from the last time it called spdk_bdev_get_io_stat().

-Jim

From: SPDK <spdk-bounces(a)lists.01.org<mailto:spdk-bounces(a)lists.01.org>> on behalf of Nathan Marushak <nathan.marushak(a)intel.com<mailto:nathan.marushak(a)intel.com>>
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Date: Tuesday, November 14, 2017 at 7:29 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

So, not being the technical expert so to speak ☺, please take this with a grain of salt. Agree with option #2. Seems like this would allow for keeping stats across the different device types e.g. NVMe, NVML, NVMe-oF (although that has a transport within NVMe, so it might be covered for either option).

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Luse, Paul E
Sent: Tuesday, November 14, 2017 7:57 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

FWIW option 2 sounds like the most appropriate to me…

Thx
Paul

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Paul Von-Stamwitz
Sent: Monday, November 13, 2017 7:29 PM
To: spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>
Subject: [SPDK] Request for comments regarding latency measurements

Background:
Bdev.c currently maintains a count of reads/writes/bytes_read/bytes_written in channel->stat. This information is retrieved (and reset) via spdk_bdev_get_io_stat.

Proposal:
Add latency information to channel->stat and enable the option to provide a histogram.
We can measure the latency of each IO and keep a running total on a read/write basis. We can also use the measured latency to keep a running count of reads/writes in their associated histogram “buckets”.
The question is, how do we measure latency?

Option 1:
Measure latency at the NVMe tracker .
Currently, we already timestamp every command placed on the tracker if the abort timeout callback is registered. When we remove a completed IO from the tracker, we can timestamp it again and calculate the latency.
There are several issues here that need to be considered.
We need to get the latency information back up to the bdev layer, most likely through a callback argument, but this would require a change the NVMe API. If the bdev breaks down a request into smaller IOs, it can add up the latencies of the child IOs for io_stat and histogram purposes.
Also, this method does not take into account any latency added by the spdk bdev/nvme layers (except the poller.) If a request was queued before being placed on the tracker, then the time it was queued is not factored into the latency calculation.

Option 2:
Measure latency at the bdev layer.
We can timestamp at submit and again at completion. This would keep all io_stat information local to bdev and would take into account the overhead of most queued operations. Any applications written directly to the NVMe layer would have to calculate their own latencies, but that is currently true for all io_stats.

I’m sure that there are other issues I am missing, but I would appreciate any comments on how best to move forward on this.

Thanks,
Paul

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 12465 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [SPDK] Request for comments regarding latency measurements
@ 2017-11-14 16:36 Walker, Benjamin
  0 siblings, 0 replies; 22+ messages in thread
From: Walker, Benjamin @ 2017-11-14 16:36 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 5084 bytes --]

The only reason to reset is to deal with overflow. But the time between overflows would be very large so the user could potentially account for it on their end.

-------- Original message --------
From: "Harris, James R" <james.r.harris(a)intel.com>
Date: 11/14/17 9:01 AM (GMT-07:00)
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: Re: [SPDK] Request for comments regarding latency measurements

Or we leave out the reset stats API and provide helper functions on the client side (in Python) to diff results from two calls to spdk_bdev_get_io_stat().

From: SPDK <spdk-bounces(a)lists.01.org> on behalf of Paul E Luse <paul.e.luse(a)intel.com>
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org>
Date: Tuesday, November 14, 2017 at 7:50 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: Re: [SPDK] Request for comments regarding latency measurements

Or maybe as an initial separate patch remove the auto-reset and add a new API to reset stats….

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Harris, James R
Sent: Tuesday, November 14, 2017 8:43 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: Re: [SPDK] Request for comments regarding latency measurements

Agreed - #2 is the way to go.

While we’re on the topic - should we be resetting the stats when spdk_bdev_get_io_stat() is called?  This eliminates the ability for two separate applications to monitor stats in parallel – or even a separate application plus some future yet-to-be-written internal load-balancing monitor.  I’m thinking that bdev should just keep the running total and let the caller own tracking differences from the last time it called spdk_bdev_get_io_stat().

-Jim

From: SPDK <spdk-bounces(a)lists.01.org<mailto:spdk-bounces(a)lists.01.org>> on behalf of Nathan Marushak <nathan.marushak(a)intel.com<mailto:nathan.marushak(a)intel.com>>
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Date: Tuesday, November 14, 2017 at 7:29 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

So, not being the technical expert so to speak :), please take this with a grain of salt. Agree with option #2. Seems like this would allow for keeping stats across the different device types e.g. NVMe, NVML, NVMe-oF (although that has a transport within NVMe, so it might be covered for either option).

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Luse, Paul E
Sent: Tuesday, November 14, 2017 7:57 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

FWIW option 2 sounds like the most appropriate to me…

Thx
Paul

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Paul Von-Stamwitz
Sent: Monday, November 13, 2017 7:29 PM
To: spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>
Subject: [SPDK] Request for comments regarding latency measurements

Background:
Bdev.c currently maintains a count of reads/writes/bytes_read/bytes_written in channel->stat. This information is retrieved (and reset) via spdk_bdev_get_io_stat.

Proposal:
Add latency information to channel->stat and enable the option to provide a histogram.
We can measure the latency of each IO and keep a running total on a read/write basis. We can also use the measured latency to keep a running count of reads/writes in their associated histogram “buckets”.
The question is, how do we measure latency?

Option 1:
Measure latency at the NVMe tracker .
Currently, we already timestamp every command placed on the tracker if the abort timeout callback is registered. When we remove a completed IO from the tracker, we can timestamp it again and calculate the latency.
There are several issues here that need to be considered.
We need to get the latency information back up to the bdev layer, most likely through a callback argument, but this would require a change the NVMe API. If the bdev breaks down a request into smaller IOs, it can add up the latencies of the child IOs for io_stat and histogram purposes.
Also, this method does not take into account any latency added by the spdk bdev/nvme layers (except the poller.) If a request was queued before being placed on the tracker, then the time it was queued is not factored into the latency calculation.

Option 2:
Measure latency at the bdev layer.
We can timestamp at submit and again at completion. This would keep all io_stat information local to bdev and would take into account the overhead of most queued operations. Any applications written directly to the NVMe layer would have to calculate their own latencies, but that is currently true for all io_stats.

I’m sure that there are other issues I am missing, but I would appreciate any comments on how best to move forward on this.

Thanks,
Paul

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 11283 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [SPDK] Request for comments regarding latency measurements
@ 2017-11-14 16:00 Harris, James R
  0 siblings, 0 replies; 22+ messages in thread
From: Harris, James R @ 2017-11-14 16:00 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 4655 bytes --]

Or we leave out the reset stats API and provide helper functions on the client side (in Python) to diff results from two calls to spdk_bdev_get_io_stat().

From: SPDK <spdk-bounces(a)lists.01.org> on behalf of Paul E Luse <paul.e.luse(a)intel.com>
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org>
Date: Tuesday, November 14, 2017 at 7:50 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: Re: [SPDK] Request for comments regarding latency measurements

Or maybe as an initial separate patch remove the auto-reset and add a new API to reset stats….

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Harris, James R
Sent: Tuesday, November 14, 2017 8:43 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: Re: [SPDK] Request for comments regarding latency measurements

Agreed - #2 is the way to go.

While we’re on the topic - should we be resetting the stats when spdk_bdev_get_io_stat() is called?  This eliminates the ability for two separate applications to monitor stats in parallel – or even a separate application plus some future yet-to-be-written internal load-balancing monitor.  I’m thinking that bdev should just keep the running total and let the caller own tracking differences from the last time it called spdk_bdev_get_io_stat().

-Jim

From: SPDK <spdk-bounces(a)lists.01.org<mailto:spdk-bounces(a)lists.01.org>> on behalf of Nathan Marushak <nathan.marushak(a)intel.com<mailto:nathan.marushak(a)intel.com>>
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Date: Tuesday, November 14, 2017 at 7:29 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

So, not being the technical expert so to speak ☺, please take this with a grain of salt. Agree with option #2. Seems like this would allow for keeping stats across the different device types e.g. NVMe, NVML, NVMe-oF (although that has a transport within NVMe, so it might be covered for either option).

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Luse, Paul E
Sent: Tuesday, November 14, 2017 7:57 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

FWIW option 2 sounds like the most appropriate to me…

Thx
Paul

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Paul Von-Stamwitz
Sent: Monday, November 13, 2017 7:29 PM
To: spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>
Subject: [SPDK] Request for comments regarding latency measurements

Background:
Bdev.c currently maintains a count of reads/writes/bytes_read/bytes_written in channel->stat. This information is retrieved (and reset) via spdk_bdev_get_io_stat.

Proposal:
Add latency information to channel->stat and enable the option to provide a histogram.
We can measure the latency of each IO and keep a running total on a read/write basis. We can also use the measured latency to keep a running count of reads/writes in their associated histogram “buckets”.
The question is, how do we measure latency?

Option 1:
Measure latency at the NVMe tracker .
Currently, we already timestamp every command placed on the tracker if the abort timeout callback is registered. When we remove a completed IO from the tracker, we can timestamp it again and calculate the latency.
There are several issues here that need to be considered.
We need to get the latency information back up to the bdev layer, most likely through a callback argument, but this would require a change the NVMe API. If the bdev breaks down a request into smaller IOs, it can add up the latencies of the child IOs for io_stat and histogram purposes.
Also, this method does not take into account any latency added by the spdk bdev/nvme layers (except the poller.) If a request was queued before being placed on the tracker, then the time it was queued is not factored into the latency calculation.

Option 2:
Measure latency at the bdev layer.
We can timestamp at submit and again at completion. This would keep all io_stat information local to bdev and would take into account the overhead of most queued operations. Any applications written directly to the NVMe layer would have to calculate their own latencies, but that is currently true for all io_stats.

I’m sure that there are other issues I am missing, but I would appreciate any comments on how best to move forward on this.

Thanks,
Paul

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 12379 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [SPDK] Request for comments regarding latency measurements
@ 2017-11-14 15:50 Luse, Paul E
  0 siblings, 0 replies; 22+ messages in thread
From: Luse, Paul E @ 2017-11-14 15:50 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 4152 bytes --]

Or maybe as an initial separate patch remove the auto-reset and add a new API to reset stats….

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Harris, James R
Sent: Tuesday, November 14, 2017 8:43 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: Re: [SPDK] Request for comments regarding latency measurements

Agreed - #2 is the way to go.

While we’re on the topic - should we be resetting the stats when spdk_bdev_get_io_stat() is called?  This eliminates the ability for two separate applications to monitor stats in parallel – or even a separate application plus some future yet-to-be-written internal load-balancing monitor.  I’m thinking that bdev should just keep the running total and let the caller own tracking differences from the last time it called spdk_bdev_get_io_stat().

-Jim

From: SPDK <spdk-bounces(a)lists.01.org<mailto:spdk-bounces(a)lists.01.org>> on behalf of Nathan Marushak <nathan.marushak(a)intel.com<mailto:nathan.marushak(a)intel.com>>
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Date: Tuesday, November 14, 2017 at 7:29 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

So, not being the technical expert so to speak ☺, please take this with a grain of salt. Agree with option #2. Seems like this would allow for keeping stats across the different device types e.g. NVMe, NVML, NVMe-oF (although that has a transport within NVMe, so it might be covered for either option).

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Luse, Paul E
Sent: Tuesday, November 14, 2017 7:57 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] Request for comments regarding latency measurements

FWIW option 2 sounds like the most appropriate to me…

Thx
Paul

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Paul Von-Stamwitz
Sent: Monday, November 13, 2017 7:29 PM
To: spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>
Subject: [SPDK] Request for comments regarding latency measurements

Background:
Bdev.c currently maintains a count of reads/writes/bytes_read/bytes_written in channel->stat. This information is retrieved (and reset) via spdk_bdev_get_io_stat.

Proposal:
Add latency information to channel->stat and enable the option to provide a histogram.
We can measure the latency of each IO and keep a running total on a read/write basis. We can also use the measured latency to keep a running count of reads/writes in their associated histogram “buckets”.
The question is, how do we measure latency?

Option 1:
Measure latency at the NVMe tracker .
Currently, we already timestamp every command placed on the tracker if the abort timeout callback is registered. When we remove a completed IO from the tracker, we can timestamp it again and calculate the latency.
There are several issues here that need to be considered.
We need to get the latency information back up to the bdev layer, most likely through a callback argument, but this would require a change the NVMe API. If the bdev breaks down a request into smaller IOs, it can add up the latencies of the child IOs for io_stat and histogram purposes.
Also, this method does not take into account any latency added by the spdk bdev/nvme layers (except the poller.) If a request was queued before being placed on the tracker, then the time it was queued is not factored into the latency calculation.

Option 2:
Measure latency at the bdev layer.
We can timestamp at submit and again at completion. This would keep all io_stat information local to bdev and would take into account the overhead of most queued operations. Any applications written directly to the NVMe layer would have to calculate their own latencies, but that is currently true for all io_stats.

I’m sure that there are other issues I am missing, but I would appreciate any comments on how best to move forward on this.

Thanks,
Paul

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 11105 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [SPDK] Request for comments regarding latency measurements
@ 2017-11-14 15:42 Harris, James R
  0 siblings, 0 replies; 22+ messages in thread
From: Harris, James R @ 2017-11-14 15:42 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 3638 bytes --]

Agreed - #2 is the way to go.

While we’re on the topic - should we be resetting the stats when spdk_bdev_get_io_stat() is called?  This eliminates the ability for two separate applications to monitor stats in parallel – or even a separate application plus some future yet-to-be-written internal load-balancing monitor.  I’m thinking that bdev should just keep the running total and let the caller own tracking differences from the last time it called spdk_bdev_get_io_stat().

-Jim

From: SPDK <spdk-bounces(a)lists.01.org> on behalf of Nathan Marushak <nathan.marushak(a)intel.com>
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org>
Date: Tuesday, November 14, 2017 at 7:29 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: Re: [SPDK] Request for comments regarding latency measurements

So, not being the technical expert so to speak ☺, please take this with a grain of salt. Agree with option #2. Seems like this would allow for keeping stats across the different device types e.g. NVMe, NVML, NVMe-oF (although that has a transport within NVMe, so it might be covered for either option).

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Luse, Paul E
Sent: Tuesday, November 14, 2017 7:57 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: Re: [SPDK] Request for comments regarding latency measurements

FWIW option 2 sounds like the most appropriate to me…

Thx
Paul

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Paul Von-Stamwitz
Sent: Monday, November 13, 2017 7:29 PM
To: spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>
Subject: [SPDK] Request for comments regarding latency measurements

Background:
Bdev.c currently maintains a count of reads/writes/bytes_read/bytes_written in channel->stat. This information is retrieved (and reset) via spdk_bdev_get_io_stat.

Proposal:
Add latency information to channel->stat and enable the option to provide a histogram.
We can measure the latency of each IO and keep a running total on a read/write basis. We can also use the measured latency to keep a running count of reads/writes in their associated histogram “buckets”.
The question is, how do we measure latency?

Option 1:
Measure latency at the NVMe tracker .
Currently, we already timestamp every command placed on the tracker if the abort timeout callback is registered. When we remove a completed IO from the tracker, we can timestamp it again and calculate the latency.
There are several issues here that need to be considered.
We need to get the latency information back up to the bdev layer, most likely through a callback argument, but this would require a change the NVMe API. If the bdev breaks down a request into smaller IOs, it can add up the latencies of the child IOs for io_stat and histogram purposes.
Also, this method does not take into account any latency added by the spdk bdev/nvme layers (except the poller.) If a request was queued before being placed on the tracker, then the time it was queued is not factored into the latency calculation.

Option 2:
Measure latency at the bdev layer.
We can timestamp at submit and again at completion. This would keep all io_stat information local to bdev and would take into account the overhead of most queued operations. Any applications written directly to the NVMe layer would have to calculate their own latencies, but that is currently true for all io_stats.

I’m sure that there are other issues I am missing, but I would appreciate any comments on how best to move forward on this.

Thanks,
Paul

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 9878 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [SPDK] Request for comments regarding latency measurements
@ 2017-11-14 15:29 Marushak, Nathan
  0 siblings, 0 replies; 22+ messages in thread
From: Marushak, Nathan @ 2017-11-14 15:29 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 2780 bytes --]

So, not being the technical expert so to speak :), please take this with a grain of salt. Agree with option #2. Seems like this would allow for keeping stats across the different device types e.g. NVMe, NVML, NVMe-oF (although that has a transport within NVMe, so it might be covered for either option).

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Luse, Paul E
Sent: Tuesday, November 14, 2017 7:57 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: Re: [SPDK] Request for comments regarding latency measurements

FWIW option 2 sounds like the most appropriate to me...

Thx
Paul

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Paul Von-Stamwitz
Sent: Monday, November 13, 2017 7:29 PM
To: spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>
Subject: [SPDK] Request for comments regarding latency measurements

Background:
Bdev.c currently maintains a count of reads/writes/bytes_read/bytes_written in channel->stat. This information is retrieved (and reset) via spdk_bdev_get_io_stat.

Proposal:
Add latency information to channel->stat and enable the option to provide a histogram.
We can measure the latency of each IO and keep a running total on a read/write basis. We can also use the measured latency to keep a running count of reads/writes in their associated histogram "buckets".
The question is, how do we measure latency?

Option 1:
Measure latency at the NVMe tracker .
Currently, we already timestamp every command placed on the tracker if the abort timeout callback is registered. When we remove a completed IO from the tracker, we can timestamp it again and calculate the latency.
There are several issues here that need to be considered.
We need to get the latency information back up to the bdev layer, most likely through a callback argument, but this would require a change the NVMe API. If the bdev breaks down a request into smaller IOs, it can add up the latencies of the child IOs for io_stat and histogram purposes.
Also, this method does not take into account any latency added by the spdk bdev/nvme layers (except the poller.) If a request was queued before being placed on the tracker, then the time it was queued is not factored into the latency calculation.

Option 2:
Measure latency at the bdev layer.
We can timestamp at submit and again at completion. This would keep all io_stat information local to bdev and would take into account the overhead of most queued operations. Any applications written directly to the NVMe layer would have to calculate their own latencies, but that is currently true for all io_stats.

I'm sure that there are other issues I am missing, but I would appreciate any comments on how best to move forward on this.

Thanks,
Paul

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 7268 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [SPDK] Request for comments regarding latency measurements
@ 2017-11-14  2:29 Paul Von-Stamwitz
  0 siblings, 0 replies; 22+ messages in thread
From: Paul Von-Stamwitz @ 2017-11-14  2:29 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 1901 bytes --]

Background:
Bdev.c currently maintains a count of reads/writes/bytes_read/bytes_written in channel->stat. This information is retrieved (and reset) via spdk_bdev_get_io_stat.

Proposal:
Add latency information to channel->stat and enable the option to provide a histogram.
We can measure the latency of each IO and keep a running total on a read/write basis. We can also use the measured latency to keep a running count of reads/writes in their associated histogram "buckets".
The question is, how do we measure latency?

Option 1:
Measure latency at the NVMe tracker .
Currently, we already timestamp every command placed on the tracker if the abort timeout callback is registered. When we remove a completed IO from the tracker, we can timestamp it again and calculate the latency.
There are several issues here that need to be considered.
We need to get the latency information back up to the bdev layer, most likely through a callback argument, but this would require a change the NVMe API. If the bdev breaks down a request into smaller IOs, it can add up the latencies of the child IOs for io_stat and histogram purposes.
Also, this method does not take into account any latency added by the spdk bdev/nvme layers (except the poller.) If a request was queued before being placed on the tracker, then the time it was queued is not factored into the latency calculation.

Option 2:
Measure latency at the bdev layer.
We can timestamp at submit and again at completion. This would keep all io_stat information local to bdev and would take into account the overhead of most queued operations. Any applications written directly to the NVMe layer would have to calculate their own latencies, but that is currently true for all io_stats.

I'm sure that there are other issues I am missing, but I would appreciate any comments on how best to move forward on this.

Thanks,
Paul

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 5762 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2017-12-08  1:05 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-14 14:56 [SPDK] Request for comments regarding latency measurements Luse, Paul E
  -- strict thread matches above, loose matches on Subject: below --
2017-12-08  1:05 Sreeni Busam
2017-12-08  0:45 Isaac Otsiabah
2017-12-07 23:06 Sreeni Busam
2017-12-06 23:48 Sreeni Busam
2017-12-06 23:43 Paul Von-Stamwitz
2017-12-06 23:21 Sreeni Busam
2017-12-06 23:10 Paul Von-Stamwitz
2017-11-14 22:33 Paul Von-Stamwitz
2017-11-14 22:30 Harris, James R
2017-11-14 22:28 Paul Von-Stamwitz
2017-11-14 22:22 Harris, James R
2017-11-14 22:16 Paul Von-Stamwitz
2017-11-14 19:20 Sreeni Busam
2017-11-14 18:46 Paul Von-Stamwitz
2017-11-14 18:27 Sreeni Busam
2017-11-14 16:36 Walker, Benjamin
2017-11-14 16:00 Harris, James R
2017-11-14 15:50 Luse, Paul E
2017-11-14 15:42 Harris, James R
2017-11-14 15:29 Marushak, Nathan
2017-11-14  2:29 Paul Von-Stamwitz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.