* Re: [SPDK] bdev: Added latency to channel statistics patch testing
@ 2018-01-18 0:52 Isaac Otsiabah
0 siblings, 0 replies; 5+ messages in thread
From: Isaac Otsiabah @ 2018-01-18 0:52 UTC (permalink / raw)
To: spdk
[-- Attachment #1: Type: text/plain, Size: 21071 bytes --]
Paul, here is a sample test with qdepth = 124, it is not too bad (although I don’t know why the first 5 lines came out that way). I also found the way I printed out yesterday’s result was not a good way to print the data because the print statement had huge effect on the test.
[root(a)spdk3 spdk]# ./test/lib/bdev/bdevperf/bdevperf -s 4096 -q 128 -m 0x01 -w write -t 60 -c /usr/local/etc/spdk/iscsi.conf
Starting DPDK 17.11.0 initialization...
[ DPDK EAL parameters: bdevtest -c 0x01 -m 1024 --file-prefix=spdk_pid26447 ]
EAL: Detected 16 lcore(s)
EAL: No free hugepages reported in hugepages-1048576kB
EAL: Probing VFIO support...
EAL: VFIO support initialized
Total cores available: 1
Occupied cpu socket mask is 0x1
reactor.c: 425:_spdk_reactor_run: *NOTICE*: Reactor started on core 0 on socket 0
copy_engine_ioat.c: 305:copy_engine_ioat_init: *NOTICE*: Ioat Copy Engine Offload Enabled
EAL: PCI device 0000:04:00.0 on NUMA socket 0
EAL: probe driver: 8086:953 spdk_nvme
EAL: using IOMMU type 1 (Type 1)
EAL: Releasing pci mapped resource for 0000:04:00.0
EAL: Calling pci_unmap_resource for 0000:04:00.0 at 0x7f7329600000
EAL: PCI device 0000:05:00.0 on NUMA socket 0
EAL: probe driver: 8086:953 spdk_nvme
EAL: using IOMMU type 1 (Type 1)
Running I/O for 60 seconds...
num_read_ops bytes_read read_latency num_write_ops bytes_written write_latency
------------ ---------- ------------ ------------- ------------- -------------
0 0 0.0000 94570 387358720 1351.1851
0 0 0.0000 99974 409493504 1279.4091
0 0 0.0000 114813 470274048 1116.5582
0 0 0.0000 92996 380911616 1371.9102
0 0 0.0000 83009 340004864 1545.4854
0 0 0.0000 309289 1266847744 414.2505
0 0 0.0000 401524 1644642304 318.8416
0 0 0.0000 408972 1675149312 312.8732
0 0 0.0000 402966 1650548736 317.6119
0 0 0.0000 400038 1638555648 319.8758
0 0 0.0000 404605 1657262080 316.3150
0 0 0.0000 403147 1651290112 317.4082
0 0 0.0000 400947 1642278912 319.1372
0 0 0.0000 404645 1657425920 316.3060
0 0 0.0000 405415 1660579840 315.6657
0 0 0.0000 398918 1633968128 320.7656
0 0 0.0000 402784 1649803264 317.7172
0 0 0.0000 380796 1559740416 335.9359
0 0 0.0000 394401 1615466496 324.5751
0 0 0.0000 404962 1658724352 316.0541
0 0 0.0000 408758 1674272768 313.0753
0 0 0.0000 396287 1623191552 322.8262
0 0 0.0000 398845 1633669120 320.9502
0 0 0.0000 405815 1662218240 315.2997
0 0 0.0000 405171 1659580416 315.6694
0 0 0.0000 409331 1676619776 312.8568
0 0 0.0000 404625 1657344000 316.1800
0 0 0.0000 401002 1642504192 319.1404
0 0 0.0000 408305 1672417280 313.4192
0 0 0.0000 399390 1635901440 320.4554
0 0 0.0000 401766 1645633536 318.5347
0 0 0.0000 402426 1648336896 317.9712
0 0 0.0000 411577 1685819392 310.9624
0 0 0.0000 405196 1659682816 315.8021
0 0 0.0000 394127 1614344192 324.6732
0 0 0.0000 403309 1651953664 317.3189
0 0 0.0000 418917 1715884032 305.1955
0 0 0.0000 397012 1626161152 322.6149
0 0 0.0000 393817 1613074432 325.0013
0 0 0.0000 404538 1656987648 316.3576
0 0 0.0000 410705 1682247680 311.5600
0 0 0.0000 397694 1628954624 321.8006
0 0 0.0000 403170 1651384320 317.3719
0 0 0.0000 410538 1681563648 311.7305
0 0 0.0000 412626 1690116096 310.1745
0 0 0.0000 397344 1627521024 322.0545
0 0 0.0000 402063 1646850048 318.0674
0 0 0.0000 409945 1679134720 312.3769
0 0 0.0000 402902 1650286592 317.6064
0 0 0.0000 402773 1649758208 317.7294
0 0 0.0000 399598 1636753408 320.2229
0 0 0.0000 407974 1671061504 313.7196
0 0 0.0000 404373 1656311808 316.4473
0 0 0.0000 401165 1643171840 318.9727
0 0 0.0000 399465 1636208640 320.3532
0 0 0.0000 402221 1647497216 318.1874
0 0 0.0000 405536 1661075456 315.5483
0 0 0.0000 407820 1670430720 313.8303
0 0 0.0000 395988 1621966848 323.1163
Logical core: 0
Nvme1n1 : 376042.50 IO/s 1468.92 MB/s
=====================================================
Total : 376042.50 IO/s 1468.92 MB/s
From: Isaac Otsiabah
Sent: Tuesday, January 16, 2018 11:29 AM
To: 'Harris, James R' <james.r.harris(a)intel.com>; 'spdk(a)lists.01.org' <spdk(a)lists.01.org>
Cc: Paul Von-Stamwitz <PVonStamwitz(a)us.fujitsu.com>; Edward Yang <eyang(a)us.fujitsu.com>; Isaac Otsiabah <IOtsiabah(a)us.fujitsu.com>
Subject: RE: bdev: Added latency to channel statistics patch testing
Hi Jim, Paul,
1. I using the spdk_io_channel_get_thread(ch) to obtain the thread pointer and used it to obtain the cached tsc value. This was the result using test/lib/bdev/bdevperf/bdevperf -s 4096 -q 1 -m 0x01 -w write -t 60 -c /usr/local/etc/spdk/iscsi.conf for write operations at varied qdepth
Num of core = 1, Devices = 1, runtime = 60 sec, operation = write
Test#
Qdepth
Retrieving cached tsc value through
th = spdk_io_channel_get_thread(ch)
tsc = spdk_get_thread_tsc(th)
implementation
Old implementation (not caching spdk_get_ticks() values)
Num of ops
Latency (us)
Num of ops
Latency (us)
1
1
6907524
8.686
6727646
8.849
2
2
10243559
11.714
10298056
11.583
3
4
21901367
11.113
22210442
10.736
4
8
23422354
20.493
23059073
20.747
5
16
22208910
43.226
22940400
41.778
6
32
22644871
84.787
25266787
75.920
7
64
23435674
163.853
23995579
159.959
8
128
23144750
331.825
24022691
319.627
2. With this implementation, i found at higher qdepth (qdepth = 32, 64, 128 ), the same cached tsc value is used as time stamp for several i/o request. Here is a captured sample.
tsc value in spdk_bdev_io_complete() tsc value in spdk_bdev_io_submit(..)
complete_tsc = 13715333455706062 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455708426 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455708426 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455713392 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455715831 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455718269 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455720788 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455723360 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455725594 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455728240 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455730711 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455733099 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455735591 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455738065 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455740592 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
Isaac
From: Harris, James R [mailto:james.r.harris(a)intel.com]
Sent: Thursday, January 11, 2018 8:00 PM
To: Isaac Otsiabah <IOtsiabah(a)us.fujitsu.com<mailto:IOtsiabah(a)us.fujitsu.com>>; 'spdk(a)lists.01.org' <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Cc: Paul Von-Stamwitz <PVonStamwitz(a)us.fujitsu.com<mailto:PVonStamwitz(a)us.fujitsu.com>>; Edward Yang <eyang(a)us.fujitsu.com<mailto:eyang(a)us.fujitsu.com>>
Subject: Re: bdev: Added latency to channel statistics patch testing
Hi Isaac,
Thanks for your detailed analysis.
For #1 and #2 – spdk_io_channel_get_thread() can be used instead which does not require a lock – each io_channel is associated with the thread it was allocated from and this gets stored in the spdk_io_channel structure.
Could you re-run your data from #4 with a newer implementation based on using spdk_io_channel_get_thread() instead of spdk_get_thread()?
-Jim
From: Isaac Otsiabah <IOtsiabah(a)us.fujitsu.com<mailto:IOtsiabah(a)us.fujitsu.com>>
Date: Wednesday, January 10, 2018 at 3:54 PM
To: James Harris <james.r.harris(a)intel.com<mailto:james.r.harris(a)intel.com>>, "'spdk(a)lists.01.org'" <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Cc: Paul Von-Stamwitz <PVonStamwitz(a)us.fujitsu.com<mailto:PVonStamwitz(a)us.fujitsu.com>>, Edward Yang <eyang(a)us.fujitsu.com<mailto:eyang(a)us.fujitsu.com>>, Isaac Otsiabah <IOtsiabah(a)us.fujitsu.com<mailto:IOtsiabah(a)us.fujitsu.com>>
Subject: RE: bdev: Added latency to channel statistics patch testing
Hi Jim, how are you and happy new year? We investigated the cached tsc value suggestion and found two things.
1. To retrieve the cached tsc value from spdk_thread structure, spdk_thread_get_ticks() would be implemented like
uint64_t spdk_thread_get_ticks(void)
{
struct spdk_thread *thread = spdk_get_thread();
if (thread) {
return thread->tsc;
}
return 0;
}
The issue is spdk_get_thread() acquires a lock each time it is called.
2. Saving the tsc value from the _spdk_reactor_run() function require a) acquiring the lock at least once to get the address of the tsc variable from spdk_thread and saving the address into a pointer variable or b) use a function that would repeatedly acquire the lock when it needs to save the value.
3. It is also possible that a cached value can be used to process more than one i/o command which can result in erroneous lower latencies with higher qdepths.
4. I compared the result of spdk_thread_get_ticks() implementation with just calling spdk_get_ticks() (not caching the tsc) using spdk/test/lib/bdev/bdevperf/bdevperf. The result showed lower latency for higher qdepth.
Num of core = 1, Devices = 1, runtime = 60 sec, operation = write
Test#
Qdepth
Retrieving cached tsc value through spdk_thread_get_ticks() call implementation
Old implementation (not caching spdk_get_ticks() values)
Num of ops
Latency (us)
Num of ops
Latency (us)
1
1
6383242
9.399
6638771
8.972
2
2
10309024
11.640
10260596
11.630
3
4
21594909
11.113
22122622
10.783
4
8
22333247
21.492
22187344
21.569
5
16
22584995
42.506
21974123
43.622
6
32
23407821
82.024
23418909
81.919
7
64
23745706
161.713
23039526
166.604
8
128
25408188
302.265
23890885
321.395
Conclusion: So with these issues, what if we keep the latency measurement implementation we submitted as is, and we add RPC command that enable/disable the latency measurement logic? This way, applications running production code will not suffer from time taken to execute latency measurement logic. On the other hand, if a user needs to quickly measure latency, they can enable latency measurement with the RPC command, take latency measurement and then disable it when done. What do you think?
Isaac
From: Harris, James R [mailto:james.r.harris(a)intel.com]
Sent: Tuesday, December 19, 2017 3:47 PM
To: Isaac Otsiabah <IOtsiabah(a)us.fujitsu.com<mailto:IOtsiabah(a)us.fujitsu.com>>; Verkamp, Daniel <daniel.verkamp(a)intel.com<mailto:daniel.verkamp(a)intel.com>>
Cc: Paul Von-Stamwitz <PVonStamwitz(a)us.fujitsu.com<mailto:PVonStamwitz(a)us.fujitsu.com>>; Edward Yang <eyang(a)us.fujitsu.com<mailto:eyang(a)us.fujitsu.com>>
Subject: Re: bdev: Added latency to channel statistics patch testing
Hi Isaac,
Sorry about the delay. Since this failed the test pool it didn’t get through my review filter. We do have some intermittent failures in the test pool that we are debugging – if one of these ever hits you, please do not hesitate to send an e-mail and we will be happy to re-run it for you.
One concern I have is that this adds two get_ticks calls on each I/O. And for cases like SPDK logical volumes – where the NVMe namespace is a bdev, and then there is a logical volume bdev on top of it – it means four get_ticks calls for each I/O submitted to the logical volume.
The get_ticks() calls are relatively expensive – on my E5 v4 Xeon system, each get_ticks() takes about 11ns. So if we do 1M IO/s on one core with SPDK logical volumes on my system, this would be 4 * 1M * 11ns = 44ms (or 4.4%) of the CPU cycles spent just on these get_ticks() calls.
Ben and I were talking about this in the lab and have an idea. In the main reactor loop, we already need to do a get_ticks() call each time through the loop. We could easily save this value in the spdk_thread structure. Then when the bdev layer needs it for something like I/O statistics, it could use that value, instead of calling get_ticks() again. We would call this something like spdk_thread_get_ticks(). The downside is that value would not be 100% precise – it would resolve to when the poller function was called, which may be 100-200ns off.
We could also provide an option with an RPC/conf file parameter so that a call to spdk_thread_get_ticks() would return a precise tick count instead of the cached value. Then Fujitsu could enable this to get precise values while the default would use the cached values.
What do you think? If this would suit Fujitsu’s needs, I would be happy to put a patch out for review.
-Jim
From: Isaac Otsiabah <IOtsiabah(a)us.fujitsu.com<mailto:IOtsiabah(a)us.fujitsu.com>>
Date: Tuesday, December 19, 2017 at 2:51 PM
To: James Harris <james.r.harris(a)intel.com<mailto:james.r.harris(a)intel.com>>, Daniel Verkamp <daniel.verkamp(a)intel.com<mailto:daniel.verkamp(a)intel.com>>
Cc: Paul Von-Stamwitz <PVonStamwitz(a)us.fujitsu.com<mailto:PVonStamwitz(a)us.fujitsu.com>>, Isaac Otsiabah <IOtsiabah(a)us.fujitsu.com<mailto:IOtsiabah(a)us.fujitsu.com>>, Edward Yang <eyang(a)us.fujitsu.com<mailto:eyang(a)us.fujitsu.com>>
Subject: FW: bdev: Added latency to channel statistics patch testing
Hello Jim, how are you? Please, can we get a movement on this patch to move it forward. The changes were very small in lib/bdev/bdev.c. I tested it with bdev fio_plugin and latency results were comparable fio results. One thing is, the build and test on machine fedora-06 kept failing but was passing on all other machines. I do not see a reason for failing on fedora-06 since the changes were not platform related. Please, can you help us on this to move it forward? Thank you.
Isaac
From: Isaac Otsiabah
Sent: Tuesday, December 12, 2017 12:59 PM
To: 'daniel.verkamp(a)intel.com'' <daniel.verkamp(a)intel.com'<mailto:daniel.verkamp(a)intel.com'>>; 'Harris, James R' <james.r.harris(a)intel.com<mailto:james.r.harris(a)intel.com>>
Cc: Paul Von-Stamwitz <PVonStamwitz(a)us.fujitsu.com<mailto:PVonStamwitz(a)us.fujitsu.com>>; Isaac Otsiabah <IOtsiabah(a)us.fujitsu.com<mailto:IOtsiabah(a)us.fujitsu.com>>; Edward Yang <eyang(a)us.fujitsu.com<mailto:eyang(a)us.fujitsu.com>>
Subject: bdev: Added latency to channel statistics patch testing
Hi Daniel, this is a bdev fio_plugin test that compared latency with latency measurements with changes in the https://review.gerrithub.io/#/c/390654 (bdev: Added latency to channel statistics) patch. As expected, latency measurements from the patch are comparable to fio measurements.
Test run#
Qdepth
fio clat (us)
fio avg latency (us)
bdev latency (us) from spdk_bdev_get_io_stat(..)
1
2
write
7.80
8.52
8.56
read
95.36
96.06
97.138
2
4
write
7.98
8.70
8.32
read
133.88
134.59
128.85
3
8
write
8.83
9.85
10.87
read
175.61
176.48
180.66
4
16
write
9.79
10.81
10.282
read
240.71
241.6
236.913
5
32
write
11.87
12.88
12.384
read
329.8
330.67
327.648
6
64
write
20.64
21
20.707
read
471.02
471.91
467.118
7
128
write
187.53
188.57
182.92
read
704.93
705.81
697.49
Isaac
[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 104732 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [SPDK] bdev: Added latency to channel statistics patch testing
@ 2018-01-18 18:58 Isaac Otsiabah
0 siblings, 0 replies; 5+ messages in thread
From: Isaac Otsiabah @ 2018-01-18 18:58 UTC (permalink / raw)
To: spdk
[-- Attachment #1: Type: text/plain, Size: 32396 bytes --]
Paul, this is the result
1. using:
th = spdk_io_channel_get_thread(ch);
tsc = spdk_get_thread_tsc(th);
[root(a)spdk3 spdk]# ./test/lib/bdev/bdevperf/bdevperf -s 4096 -q 128 -m 0x01 -w write -t 60 -c /usr/local/etc/spdk/iscsi.conf
num_read_ops bytes_read read_latency num_write_ops bytes_written write_latency
------------ ---------- ------------ ------------- ------------- -------------
0 0 0.0000 435578 1784127488 293.8193
0 0 0.0000 428631 1755672576 298.5830
0 0 0.0000 423610 1735106560 302.2442
0 0 0.0000 430952 1765179392 296.9787
0 0 0.0000 426278 1746034688 300.2768
0 0 0.0000 424090 1737072640 301.8508
0 0 0.0000 426037 1745047552 300.4211
0 0 0.0000 420957 1724239872 304.0467
0 0 0.0000 434842 1781112832 294.3727
0 0 0.0000 429074 1757487104 298.3152
0 0 0.0000 425031 1740926976 301.1486
0 0 0.0000 428866 1756635136 298.4947
0 0 0.0000 423979 1736617984 300.7845
0 0 0.0000 433749 1776635904 296.1650
0 0 0.0000 424780 1739898880 301.3642
0 0 0.0000 425455 1742663680 300.8231
0 0 0.0000 427162 1749655552 299.6304
0 0 0.0000 424388 1738293248 301.6453
0 0 0.0000 428316 1754382336 298.8146
0 0 0.0000 425336 1742176256 300.9551
0 0 0.0000 421322 1725734912 303.8081
0 0 0.0000 423862 1736138752 302.0131
0 0 0.0000 430603 1763749888 297.2477
0 0 0.0000 427503 1751052288 299.3274
0 0 0.0000 420815 1723658240 304.2671
0 0 0.0000 422581 1730891776 302.8817
0 0 0.0000 424177 1737428992 301.7465
0 0 0.0000 428037 1753239552 299.0869
0 0 0.0000 427170 1749688320 299.6150
0 0 0.0000 422081 1728843776 303.2310
0 0 0.0000 424933 1740525568 301.2518
0 0 0.0000 428345 1754501120 298.8024
0 0 0.0000 425661 1743507456 300.6915
0 0 0.0000 430579 1763651584 297.3311
0 0 0.0000 421490 1726423040 303.6662
0 0 0.0000 426280 1746042880 300.2608
0 0 0.0000 421358 1725882368 303.7821
0 0 0.0000 422450 1730355200 303.0324
0 0 0.0000 427154 1749622784 299.6184
0 0 0.0000 421889 1728057344 303.3832
0 0 0.0000 428059 1753329664 299.0453
0 0 0.0000 429738 1760206848 297.8697
0 0 0.0000 421209 1725272064 303.8654
0 0 0.0000 425380 1742356480 300.9057
0 0 0.0000 427348 1750417408 299.5208
0 0 0.0000 428519 1755213824 298.6997
0 0 0.0000 427863 1752526848 299.0434
0 0 0.0000 425101 1741213696 301.2247
0 0 0.0000 424648 1739358208 301.4586
0 0 0.0000 432119 1769959424 296.2284
0 0 0.0000 429455 1759047680 298.0216
0 0 0.0000 422632 1731100672 302.8620
0 0 0.0000 422966 1732468736 302.6300
0 0 0.0000 426696 1747746816 300.0038
0 0 0.0000 425899 1744482304 300.5288
0 0 0.0000 426041 1745063936 300.4273
0 0 0.0000 421260 1725480960 303.8689
0 0 0.0000 419565 1718538240 305.0604
0 0 0.0000 409050 1675468800 312.8973
Logical core: 0
Nvme1n1 : 425527.28 IO/s 1662.22 MB/s
=====================================================
Total : 425527.28 IO/s 1662.22 MB/s
2. using spdk_get_ticks():
[root(a)spdk3 spdk]# ./test/lib/bdev/bdevperf/bdevperf -s 4096 -q 128 -m 0x01 -w write -t 60 -c /usr/local/etc/spdk/iscsi.conf
num_read_ops bytes_read read_latency num_write_ops bytes_written write_latency
------------ ---------- ------------ ------------- ------------- -------------
0 0 0.0000 134761 551981056 949.1912
0 0 0.0000 198945 814878720 643.4745
0 0 0.0000 305658 1251975168 418.7685
0 0 0.0000 440324 1803567104 290.4658
0 0 0.0000 428217 1753976832 299.0289
0 0 0.0000 428025 1753190400 298.9819
0 0 0.0000 426583 1747283968 299.9747
0 0 0.0000 421772 1727578112 303.3840
0 0 0.0000 433964 1777516544 294.9032
0 0 0.0000 425320 1742110720 300.8883
0 0 0.0000 423417 1734316032 302.2486
0 0 0.0000 425988 1744846848 300.3928
0 0 0.0000 424872 1740275712 301.2081
0 0 0.0000 431960 1769308160 296.1758
0 0 0.0000 423112 1733066752 301.8376
0 0 0.0000 425583 1743187968 301.3565
0 0 0.0000 430877 1764872192 296.9813
0 0 0.0000 427167 1749676032 299.5763
0 0 0.0000 430766 1764417536 297.0981
0 0 0.0000 427232 1749942272 299.4718
0 0 0.0000 423493 1734627328 302.2339
0 0 0.0000 432767 1772613632 295.6960
0 0 0.0000 428598 1755537408 298.5770
0 0 0.0000 426942 1748754432 299.6748
0 0 0.0000 427373 1750519808 299.4715
0 0 0.0000 420913 1724059648 304.0575
0 0 0.0000 432179 1770205184 296.0884
0 0 0.0000 429501 1759236096 297.9613
0 0 0.0000 423451 1734455296 302.2133
0 0 0.0000 430140 1761853440 297.4968
0 0 0.0000 421935 1728245760 303.2698
0 0 0.0000 435399 1783394304 292.8442
0 0 0.0000 426387 1746481152 301.2575
0 0 0.0000 422693 1731350528 302.6834
0 0 0.0000 430104 1761705984 297.4744
0 0 0.0000 426741 1747931136 299.9834
0 0 0.0000 433475 1775513600 295.1906
0 0 0.0000 423791 1735847936 301.9070
0 0 0.0000 425118 1741283328 301.1013
0 0 0.0000 430139 1761849344 297.4558
0 0 0.0000 428332 1754447872 298.8155
0 0 0.0000 431087 1765732352 296.8460
0 0 0.0000 426818 1748246528 299.8442
0 0 0.0000 419429 1717981184 305.0739
0 0 0.0000 425118 1741283328 301.0078
0 0 0.0000 429540 1759395840 297.9101
0 0 0.0000 426933 1748717568 299.7829
0 0 0.0000 427088 1749352448 299.6166
0 0 0.0000 423446 1734434816 302.2267
0 0 0.0000 433727 1776545792 295.0228
0 0 0.0000 427909 1752715264 299.0406
0 0 0.0000 427282 1750147072 299.5439
0 0 0.0000 430196 1762082816 297.4285
0 0 0.0000 422415 1730211840 302.9736
0 0 0.0000 433799 1776840704 295.0092
0 0 0.0000 427605 1751470080 299.2512
0 0 0.0000 426354 1746345984 300.1782
0 0 0.0000 429645 1759825920 297.8310
0 0 0.0000 412214 1688428544 310.4126
Logical core: 0
Nvme1n1 : 416261.88 IO/s 1626.02 MB/s
=====================================================
Total : 416261.88 IO/s 1626.02 MB/s
From: Isaac Otsiabah
Sent: Wednesday, January 17, 2018 4:53 PM
To: 'Harris, James R' <james.r.harris(a)intel.com>; 'spdk(a)lists.01.org' <spdk(a)lists.01.org>
Cc: Paul Von-Stamwitz <PVonStamwitz(a)us.fujitsu.com>; Edward Yang <eyang(a)us.fujitsu.com>
Subject: RE: bdev: Added latency to channel statistics patch testing
Paul, here is a sample test with qdepth = 124, it is not too bad (although I don’t know why the first 5 lines came out that way). I also found the way I printed out yesterday’s result was not a good way to print the data because the print statement had huge effect on the test.
[root(a)spdk3 spdk]# ./test/lib/bdev/bdevperf/bdevperf -s 4096 -q 128 -m 0x01 -w write -t 60 -c /usr/local/etc/spdk/iscsi.conf
Starting DPDK 17.11.0 initialization...
[ DPDK EAL parameters: bdevtest -c 0x01 -m 1024 --file-prefix=spdk_pid26447 ]
EAL: Detected 16 lcore(s)
EAL: No free hugepages reported in hugepages-1048576kB
EAL: Probing VFIO support...
EAL: VFIO support initialized
Total cores available: 1
Occupied cpu socket mask is 0x1
reactor.c: 425:_spdk_reactor_run: *NOTICE*: Reactor started on core 0 on socket 0
copy_engine_ioat.c: 305:copy_engine_ioat_init: *NOTICE*: Ioat Copy Engine Offload Enabled
EAL: PCI device 0000:04:00.0 on NUMA socket 0
EAL: probe driver: 8086:953 spdk_nvme
EAL: using IOMMU type 1 (Type 1)
EAL: Releasing pci mapped resource for 0000:04:00.0
EAL: Calling pci_unmap_resource for 0000:04:00.0 at 0x7f7329600000
EAL: PCI device 0000:05:00.0 on NUMA socket 0
EAL: probe driver: 8086:953 spdk_nvme
EAL: using IOMMU type 1 (Type 1)
Running I/O for 60 seconds...
num_read_ops bytes_read read_latency num_write_ops bytes_written write_latency
------------ ---------- ------------ ------------- ------------- -------------
0 0 0.0000 94570 387358720 1351.1851
0 0 0.0000 99974 409493504 1279.4091
0 0 0.0000 114813 470274048 1116.5582
0 0 0.0000 92996 380911616 1371.9102
0 0 0.0000 83009 340004864 1545.4854
0 0 0.0000 309289 1266847744 414.2505
0 0 0.0000 401524 1644642304 318.8416
0 0 0.0000 408972 1675149312 312.8732
0 0 0.0000 402966 1650548736 317.6119
0 0 0.0000 400038 1638555648 319.8758
0 0 0.0000 404605 1657262080 316.3150
0 0 0.0000 403147 1651290112 317.4082
0 0 0.0000 400947 1642278912 319.1372
0 0 0.0000 404645 1657425920 316.3060
0 0 0.0000 405415 1660579840 315.6657
0 0 0.0000 398918 1633968128 320.7656
0 0 0.0000 402784 1649803264 317.7172
0 0 0.0000 380796 1559740416 335.9359
0 0 0.0000 394401 1615466496 324.5751
0 0 0.0000 404962 1658724352 316.0541
0 0 0.0000 408758 1674272768 313.0753
0 0 0.0000 396287 1623191552 322.8262
0 0 0.0000 398845 1633669120 320.9502
0 0 0.0000 405815 1662218240 315.2997
0 0 0.0000 405171 1659580416 315.6694
0 0 0.0000 409331 1676619776 312.8568
0 0 0.0000 404625 1657344000 316.1800
0 0 0.0000 401002 1642504192 319.1404
0 0 0.0000 408305 1672417280 313.4192
0 0 0.0000 399390 1635901440 320.4554
0 0 0.0000 401766 1645633536 318.5347
0 0 0.0000 402426 1648336896 317.9712
0 0 0.0000 411577 1685819392 310.9624
0 0 0.0000 405196 1659682816 315.8021
0 0 0.0000 394127 1614344192 324.6732
0 0 0.0000 403309 1651953664 317.3189
0 0 0.0000 418917 1715884032 305.1955
0 0 0.0000 397012 1626161152 322.6149
0 0 0.0000 393817 1613074432 325.0013
0 0 0.0000 404538 1656987648 316.3576
0 0 0.0000 410705 1682247680 311.5600
0 0 0.0000 397694 1628954624 321.8006
0 0 0.0000 403170 1651384320 317.3719
0 0 0.0000 410538 1681563648 311.7305
0 0 0.0000 412626 1690116096 310.1745
0 0 0.0000 397344 1627521024 322.0545
0 0 0.0000 402063 1646850048 318.0674
0 0 0.0000 409945 1679134720 312.3769
0 0 0.0000 402902 1650286592 317.6064
0 0 0.0000 402773 1649758208 317.7294
0 0 0.0000 399598 1636753408 320.2229
0 0 0.0000 407974 1671061504 313.7196
0 0 0.0000 404373 1656311808 316.4473
0 0 0.0000 401165 1643171840 318.9727
0 0 0.0000 399465 1636208640 320.3532
0 0 0.0000 402221 1647497216 318.1874
0 0 0.0000 405536 1661075456 315.5483
0 0 0.0000 407820 1670430720 313.8303
0 0 0.0000 395988 1621966848 323.1163
Logical core: 0
Nvme1n1 : 376042.50 IO/s 1468.92 MB/s
=====================================================
Total : 376042.50 IO/s 1468.92 MB/s
From: Isaac Otsiabah
Sent: Tuesday, January 16, 2018 11:29 AM
To: 'Harris, James R' <james.r.harris(a)intel.com<mailto:james.r.harris(a)intel.com>>; 'spdk(a)lists.01.org' <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Cc: Paul Von-Stamwitz <PVonStamwitz(a)us.fujitsu.com<mailto:PVonStamwitz(a)us.fujitsu.com>>; Edward Yang <eyang(a)us.fujitsu.com<mailto:eyang(a)us.fujitsu.com>>; Isaac Otsiabah <IOtsiabah(a)us.fujitsu.com<mailto:IOtsiabah(a)us.fujitsu.com>>
Subject: RE: bdev: Added latency to channel statistics patch testing
Hi Jim, Paul,
1. I using the spdk_io_channel_get_thread(ch) to obtain the thread pointer and used it to obtain the cached tsc value. This was the result using test/lib/bdev/bdevperf/bdevperf -s 4096 -q 1 -m 0x01 -w write -t 60 -c /usr/local/etc/spdk/iscsi.conf for write operations at varied qdepth
Num of core = 1, Devices = 1, runtime = 60 sec, operation = write
Test#
Qdepth
Retrieving cached tsc value through
th = spdk_io_channel_get_thread(ch)
tsc = spdk_get_thread_tsc(th)
implementation
Old implementation (not caching spdk_get_ticks() values)
Num of ops
Latency (us)
Num of ops
Latency (us)
1
1
6907524
8.686
6727646
8.849
2
2
10243559
11.714
10298056
11.583
3
4
21901367
11.113
22210442
10.736
4
8
23422354
20.493
23059073
20.747
5
16
22208910
43.226
22940400
41.778
6
32
22644871
84.787
25266787
75.920
7
64
23435674
163.853
23995579
159.959
8
128
23144750
331.825
24022691
319.627
2. With this implementation, i found at higher qdepth (qdepth = 32, 64, 128 ), the same cached tsc value is used as time stamp for several i/o request. Here is a captured sample.
tsc value in spdk_bdev_io_complete() tsc value in spdk_bdev_io_submit(..)
complete_tsc = 13715333455706062 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455708426 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455708426 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455713392 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455715831 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455718269 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455720788 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455723360 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455725594 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455728240 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455730711 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455733099 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455735591 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455738065 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455740592 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
Isaac
From: Harris, James R [mailto:james.r.harris(a)intel.com]
Sent: Thursday, January 11, 2018 8:00 PM
To: Isaac Otsiabah <IOtsiabah(a)us.fujitsu.com<mailto:IOtsiabah(a)us.fujitsu.com>>; 'spdk(a)lists.01.org' <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Cc: Paul Von-Stamwitz <PVonStamwitz(a)us.fujitsu.com<mailto:PVonStamwitz(a)us.fujitsu.com>>; Edward Yang <eyang(a)us.fujitsu.com<mailto:eyang(a)us.fujitsu.com>>
Subject: Re: bdev: Added latency to channel statistics patch testing
Hi Isaac,
Thanks for your detailed analysis.
For #1 and #2 – spdk_io_channel_get_thread() can be used instead which does not require a lock – each io_channel is associated with the thread it was allocated from and this gets stored in the spdk_io_channel structure.
Could you re-run your data from #4 with a newer implementation based on using spdk_io_channel_get_thread() instead of spdk_get_thread()?
-Jim
From: Isaac Otsiabah <IOtsiabah(a)us.fujitsu.com<mailto:IOtsiabah(a)us.fujitsu.com>>
Date: Wednesday, January 10, 2018 at 3:54 PM
To: James Harris <james.r.harris(a)intel.com<mailto:james.r.harris(a)intel.com>>, "'spdk(a)lists.01.org'" <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Cc: Paul Von-Stamwitz <PVonStamwitz(a)us.fujitsu.com<mailto:PVonStamwitz(a)us.fujitsu.com>>, Edward Yang <eyang(a)us.fujitsu.com<mailto:eyang(a)us.fujitsu.com>>, Isaac Otsiabah <IOtsiabah(a)us.fujitsu.com<mailto:IOtsiabah(a)us.fujitsu.com>>
Subject: RE: bdev: Added latency to channel statistics patch testing
Hi Jim, how are you and happy new year? We investigated the cached tsc value suggestion and found two things.
1. To retrieve the cached tsc value from spdk_thread structure, spdk_thread_get_ticks() would be implemented like
uint64_t spdk_thread_get_ticks(void)
{
struct spdk_thread *thread = spdk_get_thread();
if (thread) {
return thread->tsc;
}
return 0;
}
The issue is spdk_get_thread() acquires a lock each time it is called.
2. Saving the tsc value from the _spdk_reactor_run() function require a) acquiring the lock at least once to get the address of the tsc variable from spdk_thread and saving the address into a pointer variable or b) use a function that would repeatedly acquire the lock when it needs to save the value.
3. It is also possible that a cached value can be used to process more than one i/o command which can result in erroneous lower latencies with higher qdepths.
4. I compared the result of spdk_thread_get_ticks() implementation with just calling spdk_get_ticks() (not caching the tsc) using spdk/test/lib/bdev/bdevperf/bdevperf. The result showed lower latency for higher qdepth.
Num of core = 1, Devices = 1, runtime = 60 sec, operation = write
Test#
Qdepth
Retrieving cached tsc value through spdk_thread_get_ticks() call implementation
Old implementation (not caching spdk_get_ticks() values)
Num of ops
Latency (us)
Num of ops
Latency (us)
1
1
6383242
9.399
6638771
8.972
2
2
10309024
11.640
10260596
11.630
3
4
21594909
11.113
22122622
10.783
4
8
22333247
21.492
22187344
21.569
5
16
22584995
42.506
21974123
43.622
6
32
23407821
82.024
23418909
81.919
7
64
23745706
161.713
23039526
166.604
8
128
25408188
302.265
23890885
321.395
Conclusion: So with these issues, what if we keep the latency measurement implementation we submitted as is, and we add RPC command that enable/disable the latency measurement logic? This way, applications running production code will not suffer from time taken to execute latency measurement logic. On the other hand, if a user needs to quickly measure latency, they can enable latency measurement with the RPC command, take latency measurement and then disable it when done. What do you think?
Isaac
From: Harris, James R [mailto:james.r.harris(a)intel.com]
Sent: Tuesday, December 19, 2017 3:47 PM
To: Isaac Otsiabah <IOtsiabah(a)us.fujitsu.com<mailto:IOtsiabah(a)us.fujitsu.com>>; Verkamp, Daniel <daniel.verkamp(a)intel.com<mailto:daniel.verkamp(a)intel.com>>
Cc: Paul Von-Stamwitz <PVonStamwitz(a)us.fujitsu.com<mailto:PVonStamwitz(a)us.fujitsu.com>>; Edward Yang <eyang(a)us.fujitsu.com<mailto:eyang(a)us.fujitsu.com>>
Subject: Re: bdev: Added latency to channel statistics patch testing
Hi Isaac,
Sorry about the delay. Since this failed the test pool it didn’t get through my review filter. We do have some intermittent failures in the test pool that we are debugging – if one of these ever hits you, please do not hesitate to send an e-mail and we will be happy to re-run it for you.
One concern I have is that this adds two get_ticks calls on each I/O. And for cases like SPDK logical volumes – where the NVMe namespace is a bdev, and then there is a logical volume bdev on top of it – it means four get_ticks calls for each I/O submitted to the logical volume.
The get_ticks() calls are relatively expensive – on my E5 v4 Xeon system, each get_ticks() takes about 11ns. So if we do 1M IO/s on one core with SPDK logical volumes on my system, this would be 4 * 1M * 11ns = 44ms (or 4.4%) of the CPU cycles spent just on these get_ticks() calls.
Ben and I were talking about this in the lab and have an idea. In the main reactor loop, we already need to do a get_ticks() call each time through the loop. We could easily save this value in the spdk_thread structure. Then when the bdev layer needs it for something like I/O statistics, it could use that value, instead of calling get_ticks() again. We would call this something like spdk_thread_get_ticks(). The downside is that value would not be 100% precise – it would resolve to when the poller function was called, which may be 100-200ns off.
We could also provide an option with an RPC/conf file parameter so that a call to spdk_thread_get_ticks() would return a precise tick count instead of the cached value. Then Fujitsu could enable this to get precise values while the default would use the cached values.
What do you think? If this would suit Fujitsu’s needs, I would be happy to put a patch out for review.
-Jim
From: Isaac Otsiabah <IOtsiabah(a)us.fujitsu.com<mailto:IOtsiabah(a)us.fujitsu.com>>
Date: Tuesday, December 19, 2017 at 2:51 PM
To: James Harris <james.r.harris(a)intel.com<mailto:james.r.harris(a)intel.com>>, Daniel Verkamp <daniel.verkamp(a)intel.com<mailto:daniel.verkamp(a)intel.com>>
Cc: Paul Von-Stamwitz <PVonStamwitz(a)us.fujitsu.com<mailto:PVonStamwitz(a)us.fujitsu.com>>, Isaac Otsiabah <IOtsiabah(a)us.fujitsu.com<mailto:IOtsiabah(a)us.fujitsu.com>>, Edward Yang <eyang(a)us.fujitsu.com<mailto:eyang(a)us.fujitsu.com>>
Subject: FW: bdev: Added latency to channel statistics patch testing
Hello Jim, how are you? Please, can we get a movement on this patch to move it forward. The changes were very small in lib/bdev/bdev.c. I tested it with bdev fio_plugin and latency results were comparable fio results. One thing is, the build and test on machine fedora-06 kept failing but was passing on all other machines. I do not see a reason for failing on fedora-06 since the changes were not platform related. Please, can you help us on this to move it forward? Thank you.
Isaac
From: Isaac Otsiabah
Sent: Tuesday, December 12, 2017 12:59 PM
To: 'daniel.verkamp(a)intel.com'' <daniel.verkamp(a)intel.com'<mailto:daniel.verkamp(a)intel.com'>>; 'Harris, James R' <james.r.harris(a)intel.com<mailto:james.r.harris(a)intel.com>>
Cc: Paul Von-Stamwitz <PVonStamwitz(a)us.fujitsu.com<mailto:PVonStamwitz(a)us.fujitsu.com>>; Isaac Otsiabah <IOtsiabah(a)us.fujitsu.com<mailto:IOtsiabah(a)us.fujitsu.com>>; Edward Yang <eyang(a)us.fujitsu.com<mailto:eyang(a)us.fujitsu.com>>
Subject: bdev: Added latency to channel statistics patch testing
Hi Daniel, this is a bdev fio_plugin test that compared latency with latency measurements with changes in the https://review.gerrithub.io/#/c/390654 (bdev: Added latency to channel statistics) patch. As expected, latency measurements from the patch are comparable to fio measurements.
Test run#
Qdepth
fio clat (us)
fio avg latency (us)
bdev latency (us) from spdk_bdev_get_io_stat(..)
1
2
write
7.80
8.52
8.56
read
95.36
96.06
97.138
2
4
write
7.98
8.70
8.32
read
133.88
134.59
128.85
3
8
write
8.83
9.85
10.87
read
175.61
176.48
180.66
4
16
write
9.79
10.81
10.282
read
240.71
241.6
236.913
5
32
write
11.87
12.88
12.384
read
329.8
330.67
327.648
6
64
write
20.64
21
20.707
read
471.02
471.91
467.118
7
128
write
187.53
188.57
182.92
read
704.93
705.81
697.49
Isaac
[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 142940 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [SPDK] bdev: Added latency to channel statistics patch testing
@ 2018-01-16 19:29 Isaac Otsiabah
0 siblings, 0 replies; 5+ messages in thread
From: Isaac Otsiabah @ 2018-01-16 19:29 UTC (permalink / raw)
To: spdk
[-- Attachment #1: Type: text/plain, Size: 14021 bytes --]
Hi Jim, Paul,
1. I using the spdk_io_channel_get_thread(ch) to obtain the thread pointer and used it to obtain the cached tsc value. This was the result using test/lib/bdev/bdevperf/bdevperf -s 4096 -q 1 -m 0x01 -w write -t 60 -c /usr/local/etc/spdk/iscsi.conf for write operations at varied qdepth
Num of core = 1, Devices = 1, runtime = 60 sec, operation = write
Test#
Qdepth
Retrieving cached tsc value through
th = spdk_io_channel_get_thread(ch)
tsc = spdk_get_thread_tsc(th)
implementation
Old implementation (not caching spdk_get_ticks() values)
Num of ops
Latency (us)
Num of ops
Latency (us)
1
1
6907524
8.686
6727646
8.849
2
2
10243559
11.714
10298056
11.583
3
4
21901367
11.113
22210442
10.736
4
8
23422354
20.493
23059073
20.747
5
16
22208910
43.226
22940400
41.778
6
32
22644871
84.787
25266787
75.920
7
64
23435674
163.853
23995579
159.959
8
128
23144750
331.825
24022691
319.627
2. With this implementation, i found at higher qdepth (qdepth = 32, 64, 128 ), the same cached tsc value is used as time stamp for several i/o request. Here is a captured sample.
tsc value in spdk_bdev_io_complete() tsc value in spdk_bdev_io_submit(..)
complete_tsc = 13715333455706062 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455708426 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455708426 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455713392 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455715831 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455718269 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455720788 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455723360 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455725594 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455728240 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455730711 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455733099 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455735591 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455738065 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455740592 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
complete_tsc = 13715333455743081 bdev_io->submit_tsc = 13715333454866625
Isaac
From: Harris, James R [mailto:james.r.harris(a)intel.com]
Sent: Thursday, January 11, 2018 8:00 PM
To: Isaac Otsiabah <IOtsiabah(a)us.fujitsu.com>; 'spdk(a)lists.01.org' <spdk(a)lists.01.org>
Cc: Paul Von-Stamwitz <PVonStamwitz(a)us.fujitsu.com>; Edward Yang <eyang(a)us.fujitsu.com>
Subject: Re: bdev: Added latency to channel statistics patch testing
Hi Isaac,
Thanks for your detailed analysis.
For #1 and #2 – spdk_io_channel_get_thread() can be used instead which does not require a lock – each io_channel is associated with the thread it was allocated from and this gets stored in the spdk_io_channel structure.
Could you re-run your data from #4 with a newer implementation based on using spdk_io_channel_get_thread() instead of spdk_get_thread()?
-Jim
From: Isaac Otsiabah <IOtsiabah(a)us.fujitsu.com<mailto:IOtsiabah(a)us.fujitsu.com>>
Date: Wednesday, January 10, 2018 at 3:54 PM
To: James Harris <james.r.harris(a)intel.com<mailto:james.r.harris(a)intel.com>>, "'spdk(a)lists.01.org'" <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Cc: Paul Von-Stamwitz <PVonStamwitz(a)us.fujitsu.com<mailto:PVonStamwitz(a)us.fujitsu.com>>, Edward Yang <eyang(a)us.fujitsu.com<mailto:eyang(a)us.fujitsu.com>>, Isaac Otsiabah <IOtsiabah(a)us.fujitsu.com<mailto:IOtsiabah(a)us.fujitsu.com>>
Subject: RE: bdev: Added latency to channel statistics patch testing
Hi Jim, how are you and happy new year? We investigated the cached tsc value suggestion and found two things.
1. To retrieve the cached tsc value from spdk_thread structure, spdk_thread_get_ticks() would be implemented like
uint64_t spdk_thread_get_ticks(void)
{
struct spdk_thread *thread = spdk_get_thread();
if (thread) {
return thread->tsc;
}
return 0;
}
The issue is spdk_get_thread() acquires a lock each time it is called.
2. Saving the tsc value from the _spdk_reactor_run() function require a) acquiring the lock at least once to get the address of the tsc variable from spdk_thread and saving the address into a pointer variable or b) use a function that would repeatedly acquire the lock when it needs to save the value.
3. It is also possible that a cached value can be used to process more than one i/o command which can result in erroneous lower latencies with higher qdepths.
4. I compared the result of spdk_thread_get_ticks() implementation with just calling spdk_get_ticks() (not caching the tsc) using spdk/test/lib/bdev/bdevperf/bdevperf. The result showed lower latency for higher qdepth.
Num of core = 1, Devices = 1, runtime = 60 sec, operation = write
Test#
Qdepth
Retrieving cached tsc value through spdk_thread_get_ticks() call implementation
Old implementation (not caching spdk_get_ticks() values)
Num of ops
Latency (us)
Num of ops
Latency (us)
1
1
6383242
9.399
6638771
8.972
2
2
10309024
11.640
10260596
11.630
3
4
21594909
11.113
22122622
10.783
4
8
22333247
21.492
22187344
21.569
5
16
22584995
42.506
21974123
43.622
6
32
23407821
82.024
23418909
81.919
7
64
23745706
161.713
23039526
166.604
8
128
25408188
302.265
23890885
321.395
Conclusion: So with these issues, what if we keep the latency measurement implementation we submitted as is, and we add RPC command that enable/disable the latency measurement logic? This way, applications running production code will not suffer from time taken to execute latency measurement logic. On the other hand, if a user needs to quickly measure latency, they can enable latency measurement with the RPC command, take latency measurement and then disable it when done. What do you think?
Isaac
From: Harris, James R [mailto:james.r.harris(a)intel.com]
Sent: Tuesday, December 19, 2017 3:47 PM
To: Isaac Otsiabah <IOtsiabah(a)us.fujitsu.com<mailto:IOtsiabah(a)us.fujitsu.com>>; Verkamp, Daniel <daniel.verkamp(a)intel.com<mailto:daniel.verkamp(a)intel.com>>
Cc: Paul Von-Stamwitz <PVonStamwitz(a)us.fujitsu.com<mailto:PVonStamwitz(a)us.fujitsu.com>>; Edward Yang <eyang(a)us.fujitsu.com<mailto:eyang(a)us.fujitsu.com>>
Subject: Re: bdev: Added latency to channel statistics patch testing
Hi Isaac,
Sorry about the delay. Since this failed the test pool it didn’t get through my review filter. We do have some intermittent failures in the test pool that we are debugging – if one of these ever hits you, please do not hesitate to send an e-mail and we will be happy to re-run it for you.
One concern I have is that this adds two get_ticks calls on each I/O. And for cases like SPDK logical volumes – where the NVMe namespace is a bdev, and then there is a logical volume bdev on top of it – it means four get_ticks calls for each I/O submitted to the logical volume.
The get_ticks() calls are relatively expensive – on my E5 v4 Xeon system, each get_ticks() takes about 11ns. So if we do 1M IO/s on one core with SPDK logical volumes on my system, this would be 4 * 1M * 11ns = 44ms (or 4.4%) of the CPU cycles spent just on these get_ticks() calls.
Ben and I were talking about this in the lab and have an idea. In the main reactor loop, we already need to do a get_ticks() call each time through the loop. We could easily save this value in the spdk_thread structure. Then when the bdev layer needs it for something like I/O statistics, it could use that value, instead of calling get_ticks() again. We would call this something like spdk_thread_get_ticks(). The downside is that value would not be 100% precise – it would resolve to when the poller function was called, which may be 100-200ns off.
We could also provide an option with an RPC/conf file parameter so that a call to spdk_thread_get_ticks() would return a precise tick count instead of the cached value. Then Fujitsu could enable this to get precise values while the default would use the cached values.
What do you think? If this would suit Fujitsu’s needs, I would be happy to put a patch out for review.
-Jim
From: Isaac Otsiabah <IOtsiabah(a)us.fujitsu.com<mailto:IOtsiabah(a)us.fujitsu.com>>
Date: Tuesday, December 19, 2017 at 2:51 PM
To: James Harris <james.r.harris(a)intel.com<mailto:james.r.harris(a)intel.com>>, Daniel Verkamp <daniel.verkamp(a)intel.com<mailto:daniel.verkamp(a)intel.com>>
Cc: Paul Von-Stamwitz <PVonStamwitz(a)us.fujitsu.com<mailto:PVonStamwitz(a)us.fujitsu.com>>, Isaac Otsiabah <IOtsiabah(a)us.fujitsu.com<mailto:IOtsiabah(a)us.fujitsu.com>>, Edward Yang <eyang(a)us.fujitsu.com<mailto:eyang(a)us.fujitsu.com>>
Subject: FW: bdev: Added latency to channel statistics patch testing
Hello Jim, how are you? Please, can we get a movement on this patch to move it forward. The changes were very small in lib/bdev/bdev.c. I tested it with bdev fio_plugin and latency results were comparable fio results. One thing is, the build and test on machine fedora-06 kept failing but was passing on all other machines. I do not see a reason for failing on fedora-06 since the changes were not platform related. Please, can you help us on this to move it forward? Thank you.
Isaac
From: Isaac Otsiabah
Sent: Tuesday, December 12, 2017 12:59 PM
To: 'daniel.verkamp(a)intel.com'' <daniel.verkamp(a)intel.com'<mailto:daniel.verkamp(a)intel.com'>>; 'Harris, James R' <james.r.harris(a)intel.com<mailto:james.r.harris(a)intel.com>>
Cc: Paul Von-Stamwitz <PVonStamwitz(a)us.fujitsu.com<mailto:PVonStamwitz(a)us.fujitsu.com>>; Isaac Otsiabah <IOtsiabah(a)us.fujitsu.com<mailto:IOtsiabah(a)us.fujitsu.com>>; Edward Yang <eyang(a)us.fujitsu.com<mailto:eyang(a)us.fujitsu.com>>
Subject: bdev: Added latency to channel statistics patch testing
Hi Daniel, this is a bdev fio_plugin test that compared latency with latency measurements with changes in the https://review.gerrithub.io/#/c/390654 (bdev: Added latency to channel statistics) patch. As expected, latency measurements from the patch are comparable to fio measurements.
Test run#
Qdepth
fio clat (us)
fio avg latency (us)
bdev latency (us) from spdk_bdev_get_io_stat(..)
1
2
write
7.80
8.52
8.56
read
95.36
96.06
97.138
2
4
write
7.98
8.70
8.32
read
133.88
134.59
128.85
3
8
write
8.83
9.85
10.87
read
175.61
176.48
180.66
4
16
write
9.79
10.81
10.282
read
240.71
241.6
236.913
5
32
write
11.87
12.88
12.384
read
329.8
330.67
327.648
6
64
write
20.64
21
20.707
read
471.02
471.91
467.118
7
128
write
187.53
188.57
182.92
read
704.93
705.81
697.49
Isaac
[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 87120 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [SPDK] bdev: Added latency to channel statistics patch testing
@ 2018-01-12 4:00 Harris, James R
0 siblings, 0 replies; 5+ messages in thread
From: Harris, James R @ 2018-01-12 4:00 UTC (permalink / raw)
To: spdk
[-- Attachment #1: Type: text/plain, Size: 8213 bytes --]
Hi Isaac,
Thanks for your detailed analysis.
For #1 and #2 – spdk_io_channel_get_thread() can be used instead which does not require a lock – each io_channel is associated with the thread it was allocated from and this gets stored in the spdk_io_channel structure.
Could you re-run your data from #4 with a newer implementation based on using spdk_io_channel_get_thread() instead of spdk_get_thread()?
-Jim
From: Isaac Otsiabah <IOtsiabah(a)us.fujitsu.com>
Date: Wednesday, January 10, 2018 at 3:54 PM
To: James Harris <james.r.harris(a)intel.com>, "'spdk(a)lists.01.org'" <spdk(a)lists.01.org>
Cc: Paul Von-Stamwitz <PVonStamwitz(a)us.fujitsu.com>, Edward Yang <eyang(a)us.fujitsu.com>, Isaac Otsiabah <IOtsiabah(a)us.fujitsu.com>
Subject: RE: bdev: Added latency to channel statistics patch testing
Hi Jim, how are you and happy new year? We investigated the cached tsc value suggestion and found two things.
1. To retrieve the cached tsc value from spdk_thread structure, spdk_thread_get_ticks() would be implemented like
uint64_t spdk_thread_get_ticks(void)
{
struct spdk_thread *thread = spdk_get_thread();
if (thread) {
return thread->tsc;
}
return 0;
}
The issue is spdk_get_thread() acquires a lock each time it is called.
2. Saving the tsc value from the _spdk_reactor_run() function require a) acquiring the lock at least once to get the address of the tsc variable from spdk_thread and saving the address into a pointer variable or b) use a function that would repeatedly acquire the lock when it needs to save the value.
3. It is also possible that a cached value can be used to process more than one i/o command which can result in erroneous lower latencies with higher qdepths.
4. I compared the result of spdk_thread_get_ticks() implementation with just calling spdk_get_ticks() (not caching the tsc) using spdk/test/lib/bdev/bdevperf/bdevperf. The result showed lower latency for higher qdepth.
Num of core = 1, Devices = 1, runtime = 60 sec, operation = write
Test#
Qdepth
Retrieving cached tsc value through spdk_thread_get_ticks() call implementation
Old implementation (not caching spdk_get_ticks() values)
Num of ops
Latency (us)
Num of ops
Latency (us)
1
1
6383242
9.399
6638771
8.972
2
2
10309024
11.640
10260596
11.630
3
4
21594909
11.113
22122622
10.783
4
8
22333247
21.492
22187344
21.569
5
16
22584995
42.506
21974123
43.622
6
32
23407821
82.024
23418909
81.919
7
64
23745706
161.713
23039526
166.604
8
128
25408188
302.265
23890885
321.395
Conclusion: So with these issues, what if we keep the latency measurement implementation we submitted as is, and we add RPC command that enable/disable the latency measurement logic? This way, applications running production code will not suffer from time taken to execute latency measurement logic. On the other hand, if a user needs to quickly measure latency, they can enable latency measurement with the RPC command, take latency measurement and then disable it when done. What do you think?
Isaac
From: Harris, James R [mailto:james.r.harris(a)intel.com]
Sent: Tuesday, December 19, 2017 3:47 PM
To: Isaac Otsiabah <IOtsiabah(a)us.fujitsu.com>; Verkamp, Daniel <daniel.verkamp(a)intel.com>
Cc: Paul Von-Stamwitz <PVonStamwitz(a)us.fujitsu.com>; Edward Yang <eyang(a)us.fujitsu.com>
Subject: Re: bdev: Added latency to channel statistics patch testing
Hi Isaac,
Sorry about the delay. Since this failed the test pool it didn’t get through my review filter. We do have some intermittent failures in the test pool that we are debugging – if one of these ever hits you, please do not hesitate to send an e-mail and we will be happy to re-run it for you.
One concern I have is that this adds two get_ticks calls on each I/O. And for cases like SPDK logical volumes – where the NVMe namespace is a bdev, and then there is a logical volume bdev on top of it – it means four get_ticks calls for each I/O submitted to the logical volume.
The get_ticks() calls are relatively expensive – on my E5 v4 Xeon system, each get_ticks() takes about 11ns. So if we do 1M IO/s on one core with SPDK logical volumes on my system, this would be 4 * 1M * 11ns = 44ms (or 4.4%) of the CPU cycles spent just on these get_ticks() calls.
Ben and I were talking about this in the lab and have an idea. In the main reactor loop, we already need to do a get_ticks() call each time through the loop. We could easily save this value in the spdk_thread structure. Then when the bdev layer needs it for something like I/O statistics, it could use that value, instead of calling get_ticks() again. We would call this something like spdk_thread_get_ticks(). The downside is that value would not be 100% precise – it would resolve to when the poller function was called, which may be 100-200ns off.
We could also provide an option with an RPC/conf file parameter so that a call to spdk_thread_get_ticks() would return a precise tick count instead of the cached value. Then Fujitsu could enable this to get precise values while the default would use the cached values.
What do you think? If this would suit Fujitsu’s needs, I would be happy to put a patch out for review.
-Jim
From: Isaac Otsiabah <IOtsiabah(a)us.fujitsu.com<mailto:IOtsiabah(a)us.fujitsu.com>>
Date: Tuesday, December 19, 2017 at 2:51 PM
To: James Harris <james.r.harris(a)intel.com<mailto:james.r.harris(a)intel.com>>, Daniel Verkamp <daniel.verkamp(a)intel.com<mailto:daniel.verkamp(a)intel.com>>
Cc: Paul Von-Stamwitz <PVonStamwitz(a)us.fujitsu.com<mailto:PVonStamwitz(a)us.fujitsu.com>>, Isaac Otsiabah <IOtsiabah(a)us.fujitsu.com<mailto:IOtsiabah(a)us.fujitsu.com>>, Edward Yang <eyang(a)us.fujitsu.com<mailto:eyang(a)us.fujitsu.com>>
Subject: FW: bdev: Added latency to channel statistics patch testing
Hello Jim, how are you? Please, can we get a movement on this patch to move it forward. The changes were very small in lib/bdev/bdev.c. I tested it with bdev fio_plugin and latency results were comparable fio results. One thing is, the build and test on machine fedora-06 kept failing but was passing on all other machines. I do not see a reason for failing on fedora-06 since the changes were not platform related. Please, can you help us on this to move it forward? Thank you.
Isaac
From: Isaac Otsiabah
Sent: Tuesday, December 12, 2017 12:59 PM
To: 'daniel.verkamp(a)intel.com'' <daniel.verkamp(a)intel.com'<mailto:daniel.verkamp(a)intel.com'>>; 'Harris, James R' <james.r.harris(a)intel.com<mailto:james.r.harris(a)intel.com>>
Cc: Paul Von-Stamwitz <PVonStamwitz(a)us.fujitsu.com<mailto:PVonStamwitz(a)us.fujitsu.com>>; Isaac Otsiabah <IOtsiabah(a)us.fujitsu.com<mailto:IOtsiabah(a)us.fujitsu.com>>; Edward Yang <eyang(a)us.fujitsu.com<mailto:eyang(a)us.fujitsu.com>>
Subject: bdev: Added latency to channel statistics patch testing
Hi Daniel, this is a bdev fio_plugin test that compared latency with latency measurements with changes in the https://review.gerrithub.io/#/c/390654 (bdev: Added latency to channel statistics) patch. As expected, latency measurements from the patch are comparable to fio measurements.
Test run#
Qdepth
fio clat (us)
fio avg latency (us)
bdev latency (us) from spdk_bdev_get_io_stat(..)
1
2
write
7.80
8.52
8.56
read
95.36
96.06
97.138
2
4
write
7.98
8.70
8.32
read
133.88
134.59
128.85
3
8
write
8.83
9.85
10.87
read
175.61
176.48
180.66
4
16
write
9.79
10.81
10.282
read
240.71
241.6
236.913
5
32
write
11.87
12.88
12.384
read
329.8
330.67
327.648
6
64
write
20.64
21
20.707
read
471.02
471.91
467.118
7
128
write
187.53
188.57
182.92
read
704.93
705.81
697.49
Isaac
[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 51517 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [SPDK] bdev: Added latency to channel statistics patch testing
@ 2018-01-10 22:54 Isaac Otsiabah
0 siblings, 0 replies; 5+ messages in thread
From: Isaac Otsiabah @ 2018-01-10 22:54 UTC (permalink / raw)
To: spdk
[-- Attachment #1: Type: text/plain, Size: 7384 bytes --]
Hi Jim, how are you and happy new year? We investigated the cached tsc value suggestion and found two things.
1. To retrieve the cached tsc value from spdk_thread structure, spdk_thread_get_ticks() would be implemented like
uint64_t spdk_thread_get_ticks(void)
{
struct spdk_thread *thread = spdk_get_thread();
if (thread) {
return thread->tsc;
}
return 0;
}
The issue is spdk_get_thread() acquires a lock each time it is called.
2. Saving the tsc value from the _spdk_reactor_run() function require a) acquiring the lock at least once to get the address of the tsc variable from spdk_thread and saving the address into a pointer variable or b) use a function that would repeatedly acquire the lock when it needs to save the value.
3. It is also possible that a cached value can be used to process more than one i/o command which can result in erroneous lower latencies with higher qdepths.
4. I compared the result of spdk_thread_get_ticks() implementation with just calling spdk_get_ticks() (not caching the tsc) using spdk/test/lib/bdev/bdevperf/bdevperf. The result showed lower latency for higher qdepth.
Num of core = 1, Devices = 1, runtime = 60 sec, operation = write
Test#
Qdepth
Retrieving cached tsc value through spdk_thread_get_ticks() call implementation
Old implementation (not caching spdk_get_ticks() values)
Num of ops
Latency (us)
Num of ops
Latency (us)
1
1
6383242
9.399
6638771
8.972
2
2
10309024
11.640
10260596
11.630
3
4
21594909
11.113
22122622
10.783
4
8
22333247
21.492
22187344
21.569
5
16
22584995
42.506
21974123
43.622
6
32
23407821
82.024
23418909
81.919
7
64
23745706
161.713
23039526
166.604
8
128
25408188
302.265
23890885
321.395
Conclusion: So with these issues, what if we keep the latency measurement implementation we submitted as is, and we add RPC command that enable/disable the latency measurement logic? This way, applications running production code will not suffer from time taken to execute latency measurement logic. On the other hand, if a user needs to quickly measure latency, they can enable latency measurement with the RPC command, take latency measurement and then disable it when done. What do you think?
Isaac
From: Harris, James R [mailto:james.r.harris(a)intel.com]
Sent: Tuesday, December 19, 2017 3:47 PM
To: Isaac Otsiabah <IOtsiabah(a)us.fujitsu.com>; Verkamp, Daniel <daniel.verkamp(a)intel.com>
Cc: Paul Von-Stamwitz <PVonStamwitz(a)us.fujitsu.com>; Edward Yang <eyang(a)us.fujitsu.com>
Subject: Re: bdev: Added latency to channel statistics patch testing
Hi Isaac,
Sorry about the delay. Since this failed the test pool it didn’t get through my review filter. We do have some intermittent failures in the test pool that we are debugging – if one of these ever hits you, please do not hesitate to send an e-mail and we will be happy to re-run it for you.
One concern I have is that this adds two get_ticks calls on each I/O. And for cases like SPDK logical volumes – where the NVMe namespace is a bdev, and then there is a logical volume bdev on top of it – it means four get_ticks calls for each I/O submitted to the logical volume.
The get_ticks() calls are relatively expensive – on my E5 v4 Xeon system, each get_ticks() takes about 11ns. So if we do 1M IO/s on one core with SPDK logical volumes on my system, this would be 4 * 1M * 11ns = 44ms (or 4.4%) of the CPU cycles spent just on these get_ticks() calls.
Ben and I were talking about this in the lab and have an idea. In the main reactor loop, we already need to do a get_ticks() call each time through the loop. We could easily save this value in the spdk_thread structure. Then when the bdev layer needs it for something like I/O statistics, it could use that value, instead of calling get_ticks() again. We would call this something like spdk_thread_get_ticks(). The downside is that value would not be 100% precise – it would resolve to when the poller function was called, which may be 100-200ns off.
We could also provide an option with an RPC/conf file parameter so that a call to spdk_thread_get_ticks() would return a precise tick count instead of the cached value. Then Fujitsu could enable this to get precise values while the default would use the cached values.
What do you think? If this would suit Fujitsu’s needs, I would be happy to put a patch out for review.
-Jim
From: Isaac Otsiabah <IOtsiabah(a)us.fujitsu.com<mailto:IOtsiabah(a)us.fujitsu.com>>
Date: Tuesday, December 19, 2017 at 2:51 PM
To: James Harris <james.r.harris(a)intel.com<mailto:james.r.harris(a)intel.com>>, Daniel Verkamp <daniel.verkamp(a)intel.com<mailto:daniel.verkamp(a)intel.com>>
Cc: Paul Von-Stamwitz <PVonStamwitz(a)us.fujitsu.com<mailto:PVonStamwitz(a)us.fujitsu.com>>, Isaac Otsiabah <IOtsiabah(a)us.fujitsu.com<mailto:IOtsiabah(a)us.fujitsu.com>>, Edward Yang <eyang(a)us.fujitsu.com<mailto:eyang(a)us.fujitsu.com>>
Subject: FW: bdev: Added latency to channel statistics patch testing
Hello Jim, how are you? Please, can we get a movement on this patch to move it forward. The changes were very small in lib/bdev/bdev.c. I tested it with bdev fio_plugin and latency results were comparable fio results. One thing is, the build and test on machine fedora-06 kept failing but was passing on all other machines. I do not see a reason for failing on fedora-06 since the changes were not platform related. Please, can you help us on this to move it forward? Thank you.
Isaac
From: Isaac Otsiabah
Sent: Tuesday, December 12, 2017 12:59 PM
To: 'daniel.verkamp(a)intel.com'' <daniel.verkamp(a)intel.com'<mailto:daniel.verkamp(a)intel.com'>>; 'Harris, James R' <james.r.harris(a)intel.com<mailto:james.r.harris(a)intel.com>>
Cc: Paul Von-Stamwitz <PVonStamwitz(a)us.fujitsu.com<mailto:PVonStamwitz(a)us.fujitsu.com>>; Isaac Otsiabah <IOtsiabah(a)us.fujitsu.com<mailto:IOtsiabah(a)us.fujitsu.com>>; Edward Yang <eyang(a)us.fujitsu.com<mailto:eyang(a)us.fujitsu.com>>
Subject: bdev: Added latency to channel statistics patch testing
Hi Daniel, this is a bdev fio_plugin test that compared latency with latency measurements with changes in the https://review.gerrithub.io/#/c/390654 (bdev: Added latency to channel statistics) patch. As expected, latency measurements from the patch are comparable to fio measurements.
Test run#
Qdepth
fio clat (us)
fio avg latency (us)
bdev latency (us) from spdk_bdev_get_io_stat(..)
1
2
write
7.80
8.52
8.56
read
95.36
96.06
97.138
2
4
write
7.98
8.70
8.32
read
133.88
134.59
128.85
3
8
write
8.83
9.85
10.87
read
175.61
176.48
180.66
4
16
write
9.79
10.81
10.282
read
240.71
241.6
236.913
5
32
write
11.87
12.88
12.384
read
329.8
330.67
327.648
6
64
write
20.64
21
20.707
read
471.02
471.91
467.118
7
128
write
187.53
188.57
182.92
read
704.93
705.81
697.49
Isaac
[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 46516 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2018-01-18 18:58 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-18 0:52 [SPDK] bdev: Added latency to channel statistics patch testing Isaac Otsiabah
-- strict thread matches above, loose matches on Subject: below --
2018-01-18 18:58 Isaac Otsiabah
2018-01-16 19:29 Isaac Otsiabah
2018-01-12 4:00 Harris, James R
2018-01-10 22:54 Isaac Otsiabah
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.