All of lore.kernel.org
 help / color / mirror / Atom feed
* Bcache in writes direct with fsync. Are IOPS limited?
       [not found] <958894243.922478.1652201375900.ref@mail.yahoo.com>
@ 2022-05-10 16:49 ` Adriano Silva
  2022-05-11  6:20   ` Matthias Ferdinand
  2022-05-18  1:22   ` Eric Wheeler
  0 siblings, 2 replies; 37+ messages in thread
From: Adriano Silva @ 2022-05-10 16:49 UTC (permalink / raw)
  To: Bcache Linux

Hello.

I'm trying to set up a flash disk NVMe as a disk cache for two or three isolated (I will use 2TB disks, but in these tests I used a 1TB one) spinning disks that I have on a Linux 5.4.174 (Proxmox node).

I'm using a NVMe (960GB datacenter devices with tantalum capacitors) as a cache.

The goal, depending on the results I get in benchmark tests, would be to set up an identical configuration for all my ten hyperconverged Ceph nodes, putting the OSD's to run on top of bcache and with DB/Wall on the same NVMe, but on a separate partition .

Testing the fio in NVME, it performs well enough at 4K random writes, even using direct and fsync flags.

root@pve-20:~# fio --filename=/dev/nvme0n1p2 --direct=1 --fsync=1 --rw=randwrite --bs=4K --numjobs=1 --iodepth=1 --runtime=10 --time_based --group_reporting --name=journal-test --ioengine=libaio

  write: IOPS=32.9k, BW=129MiB/s (135MB/s)(1286MiB/10001msec); 0 zone resets
  lat (nsec)   : 1000=0.01%
  lat (usec)   : 2=0.01%, 20=0.01%, 50=99.73%, 100=0.12%, 250=0.01%
  lat (usec)   : 500=0.02%, 750=0.11%, 1000=0.01%
  cpu          : usr=11.59%, sys=18.37%, ctx=329115, majf=0, minf=14
  IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,329119,0,329118 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=129MiB/s (135MB/s), 129MiB/s-129MiB/s (135MB/s-135MB/s), io=1286MiB (1348MB), run=10001-10001msec


But when I do the same test on bcache writeback, the performance drops a lot. Of course, it's better than the performance of spinning disks, but much worse than when accessed directly from the NVMe device hardware.

root@pve-20:~# fio --filename=/dev/bcache0 --direct=1 --fsync=1 --rw=randwrite --bs=4K --numjobs=1 --iodepth=1 --runtime=10 --time_based --group_reporting --name=journal-test --ioengine=libaio
  write: IOPS=1548, BW=6193KiB/s (6342kB/s)(60.5MiB/10001msec); 0 zone resets
  lat (usec)   : 50=0.41%, 100=31.42%, 250=66.20%, 500=1.01%, 750=0.31%
  lat (usec)   : 1000=0.15%
  lat (msec)   : 2=0.20%, 4=0.08%, 10=0.08%, 20=0.15%
  cpu          : usr=3.72%, sys=11.67%, ctx=44541, majf=0, minf=12
Run status group 0 (all jobs):
  WRITE: bw=6193KiB/s (6342kB/s), 6193KiB/s-6193KiB/s (6342kB/s-6342kB/s), io=60.5MiB (63.4MB), run=10001-10001msec

Disk stats (read/write):
    bcache0: ios=0/30596, merge=0/0, ticks=0/8492, in_queue=8492, util=98.99%, aggrios=0/16276, aggrmerge=0/0, aggrticks=0/4528, aggrin_queue=578, aggrutil=98.17%
  sdb: ios=0/2, merge=0/0, ticks=0/1158, in_queue=1156, util=5.59%
  nvme0n1: ios=1/32550, merge=0/0, ticks=1/7898, in_queue=0, util=98.17%


As we can see, the same test done on the bcache0 device only got 1548 IOPS and that yielded only 6.3 KB/s.

This is much more than any spinning HDD could give me, but many times less than the result obtained by NVMe.

I've noticed in several tests, varying the amount of jobs or increasing the size of the blocks, that the larger the size of the blocks, the more I approximate the performance of the physical device to the bcache device. But it always seems that the amount of IOPS is limited to somewhere around 1500-1800 IOPS (maximum). By increasing the amount of jobs, I get better results and more IOPS, but if you divide the total IOPS by the amount of jobs, you can see that the IOPS are always limited in the range 1500-1800 per job.

The commands used to configure bcache were:

# echo writeback > /sys/block/bcache0/bcache/cache_mode
# echo 0 > /sys/block/bcache0/bcache/sequential_cutoff
##
## Then I tried everything also with the commands below, but there was no improvement.
##
# echo 0 > /sys/fs/bcache/<cache set>/congested_read_threshold_us
# echo 0 > /sys/fs/bcache/<cache set>/congested_write_threshold_us


Monitoring with dstat, it is possible to notice that when activating the fio command, the writing is all done in the cache device (a second partition of NVMe), until the end of the test. The spinning disk is only written after the time has passed and it is possible to see the read on the NVMe and the write on the spinning disk (which means the transfer of data in the background).

--dsk/sdb---dsk/nvme0n1-dsk/bcache0 ---io/sdb----io/nvme0n1--io/bcache0 -net/total- ---load-avg--- --total-cpu-usage-- ---system-- ----system---- async
 read  writ: read  writ: read  writ| read  writ: read  writ: read  writ| recv  send| 1m   5m  15m |usr sys idl wai stl| int   csw |     time     | #aio
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |8462B 8000B|0.03 0.15 0.31|  1   0  99   0   0| 250   383 |09-05 15:19:47|   0
   0     0 :4096B  454k:   0   336k|   0     0 :1.00   184 :   0   170 |4566B 4852B|0.03 0.15 0.31|  2   2  94   1   0|1277  3470 |09-05 15:19:48|   1B
   0  8192B:   0  8022k:   0  6512k|   0  2.00 :   0  3388 :   0  3254 |3261B 2827B|0.11 0.16 0.32|  0   2  93   5   0|4397    16k|09-05 15:19:49|   1B
   0     0 :   0  7310k:   0  6460k|   0     0 :   0  3240 :   0  3231 |6773B 6428B|0.11 0.16 0.32|  0   1  93   6   0|4190    16k|09-05 15:19:50|   1B
   0     0 :   0  7313k:   0  6504k|   0     0 :   0  3252 :   0  3251 |6719B 6201B|0.11 0.16 0.32|  0   2  92   6   0|4482    16k|09-05 15:19:51|   1B
   0     0 :   0  7313k:   0  6496k|   0     0 :   0  3251 :   0  3250 |4743B 4016B|0.11 0.16 0.32|  0   1  93   6   0|4243    16k|09-05 15:19:52|   1B
   0     0 :   0  7329k:   0  6496k|   0     0 :   0  3289 :   0  3245 |6107B 6062B|0.11 0.16 0.32|  1   1  90   8   0|4706    18k|09-05 15:19:53|   1B
   0     0 :   0  5373k:   0  4184k|   0     0 :   0  2946 :   0  2095 |6387B 6062B|0.26 0.19 0.33|  0   2  95   4   0|3774    12k|09-05 15:19:54|   1B
   0     0 :   0  6966k:   0  5668k|   0     0 :   0  3270 :   0  2834 |7264B 7546B|0.26 0.19 0.33|  0   1  93   5   0|4214    15k|09-05 15:19:55|   1B
   0     0 :   0  7271k:   0  6252k|   0     0 :   0  3258 :   0  3126 |5928B 4584B|0.26 0.19 0.33|  0   2  93   5   0|4156    16k|09-05 15:19:56|   1B
   0     0 :   0  7419k:   0  6504k|   0     0 :   0  3308 :   0  3251 |5226B 5650B|0.26 0.19 0.33|  2   1  91   6   0|4433    16k|09-05 15:19:57|   1B
   0     0 :   0  6444k:   0  5704k|   0     0 :   0  2873 :   0  2851 |6494B 8021B|0.26 0.19 0.33|  1   1  91   7   0|4352    16k|09-05 15:19:58|   0
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |6030B 7204B|0.24 0.19 0.32|  0   0 100   0   0| 209   279 |09-05 15:19:59|   0


This means that the writeback cache mechanism appears to be working as it should, except for the performance limitation.

With ioping it is also possible to notice a limitation, as the latency of the bcache0 device is around 1.5ms, while in the case of the raw device (a partition of NVMe), the same test is only 82.1us.

root@pve-20:~# ioping -c10 /dev/bcache0 -D -Y -WWW -s4k
4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=1 time=1.52 ms (warmup)
4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=2 time=1.60 ms
4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=3 time=1.55 ms
4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=4 time=1.59 ms
4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=5 time=1.52 ms
4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=6 time=1.44 ms
4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=7 time=1.01 ms (fast)
4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=8 time=968.6 us (fast)
4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=9 time=1.12 ms
4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=10 time=1.12 ms

--- /dev/bcache0 (block device 931.5 GiB) ioping statistics ---
9 requests completed in 11.9 ms, 36 KiB written, 754 iops, 2.95 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 968.6 us / 1.33 ms / 1.60 ms / 249.1 us

-------------------------------------------------------------------

root@pve-20:/# dstat -drnlcyt -D sdb,nvme0n1,bcache0 --aio
--dsk/sdb---dsk/nvme0n1-dsk/bcache0 ---io/sdb----io/nvme0n1--io/bcache0 -net/total- ---load-avg--- --total-cpu-usage-- ---system-- ----system---- async
 read  writ: read  writ: read  writ| read  writ: read  writ: read  writ| recv  send| 1m   5m  15m |usr sys idl wai stl| int   csw |     time     | #aio
 332B  181k: 167k  937k:  20B  303k|0.01  11.2 :11.9  42.1 :0.00  5.98 |   0     0 |0.10 0.31 0.36|  0   0  99   0   0| 392   904 |09-05 15:26:35|   0
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |2200B 2506B|0.09 0.31 0.36|  0   0  99   0   0| 437   538 |09-05 15:26:40|   0
   0     0 :   0  5632B:   0  4096B|   0     0 :   0  4.00 :   0  3.00 |8868B 8136B|0.09 0.31 0.36|  0   0 100   0   0| 247   339 |09-05 15:26:41|   0
   0     0 :   0  5632B:   0  4096B|   0     0 :   0  4.00 :   0  3.00 |7318B 7372B|0.09 0.31 0.36|  0   0  99   0   0| 520  2153 |09-05 15:26:42|   0
   0     0 :   0  5632B:   0  4096B|   0     0 :   0  4.00 :   0  3.00 |3315B 2768B|0.09 0.31 0.36|  1   0  97   2   0|1130  2214 |09-05 15:26:43|   0
   0     0 :   0  5632B:   0  4096B|   0     0 :   0  4.00 :   0  3.00 |9526B   12k|0.09 0.31 0.36|  1   0  99   0   0| 339   564 |09-05 15:26:44|   0
   0  4096B:4096B 6656B:   0  4096B|   0  1.00 :1.00  6.00 :   0  3.00 |6142B 6536B|0.08 0.30 0.36|  0   1  98   0   0| 316   375 |09-05 15:26:45|   0
   0  4096B:4096B 5632B:   0  4096B|   0  1.00 :1.00  4.00 :   0  3.00 |3378B 3714B|0.08 0.30 0.36|  0   0 100   0   0| 191   328 |09-05 15:26:46|   0
   0  4096B:4096B 6656B:   0  4096B|   0  1.00 :1.00  6.00 :   0  3.00 |  10k   21k|0.08 0.30 0.36|  1   0  99   0   0| 387   468 |09-05 15:26:47|   0
   0  4096B:4096B 5632B:   0  4096B|   0  1.00 :1.00  4.00 :   0  3.00 |7650B 8602B|0.08 0.30 0.36|  0   0  97   2   0| 737  2627 |09-05 15:26:48|   0
   0  4096B:4096B 6144B:   0  4096B|   0  1.00 :1.00  5.00 :   0  3.00 |9025B 8083B|0.08 0.30 0.36|  0   0 100   0   0| 335   510 |09-05 15:26:49|   0
   0  4096B:4096B 5632B:   0  4096B|   0  1.00 :1.00  4.00 :   0  3.00 |  12k   11k|0.08 0.30 0.35|  0   0 100   0   0| 290   496 |09-05 15:26:50|   0
   0  4096B:4096B    0 :   0     0 |   0  1.00 :1.00     0 :   0     0 |5467B 5365B|0.08 0.30 0.35|  0   0 100   0   0| 404   300 |09-05 15:26:51|   0
   0  4096B:4096B    0 :   0     0 |   0  1.00 :1.00     0 :   0     0 |7973B 7315B|0.08 0.30 0.35|  0   0 100   0   0| 195   304 |09-05 15:26:52|   0
   0  4096B:4096B    0 :   0     0 |   0  1.00 :1.00     0 :   0     0 |6183B 4929B|0.08 0.30 0.35|  0   0  99   1   0| 683  2542 |09-05 15:26:53|   0
   0  4096B:4096B   12k:   0     0 |   0  1.00 :1.00  2.00 :   0     0 |4995B 4998B|0.08 0.30 0.35|  0   0 100   0   0| 199   422 |09-05 15:26:54|   0
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |8353B 8059B|0.07 0.29 0.35|  0   0 100   0   0| 164   217 |09-05 15:26:55|   0
=====================================================================================================

root@pve-20:~# ioping -c10 /dev/nvme0n1p2 -D -Y -WWW -s4k
4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=1 time=81.2 us (warmup)
4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=2 time=82.7 us
4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=3 time=82.4 us
4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=4 time=94.4 us
4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=5 time=95.1 us
4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=6 time=67.5 us
4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=7 time=85.1 us
4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=8 time=63.5 us (fast)
4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=9 time=82.2 us
4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=10 time=86.1 us

--- /dev/nvme0n1p2 (block device 300 GiB) ioping statistics ---
9 requests completed in 739.2 us, 36 KiB written, 12.2 k iops, 47.6 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 63.5 us / 82.1 us / 95.1 us / 10.0 us

-----------------------------------------------------------------------------------------

root@pve-20:/# dstat -drnlcyt -D sdb,nvme0n1,bcache0 --aio
--dsk/sdb---dsk/nvme0n1-dsk/bcache0 ---io/sdb----io/nvme0n1--io/bcache0 -net/total- ---load-avg--- --total-cpu-usage-- ---system-- ----system---- async
 read  writ: read  writ: read  writ| read  writ: read  writ: read  writ| recv  send| 1m   5m  15m |usr sys idl wai stl| int   csw |     time     | #aio
 332B  181k: 167k  935k:  20B  302k|0.01  11.2 :11.9  42.0 :0.00  5.96 |   0     0 |0.18 0.25 0.32|  0   0  99   0   0| 392   904 |09-05 15:30:49|   0
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |4443B 4548B|0.16 0.25 0.32|  0   0 100   0   0| 108   209 |09-05 15:30:55|   0
   0     0 :   0  4096B:   0     0 |   0     0 :   0  1.00 :   0     0 |3526B 3844B|0.16 0.25 0.32|  1   0  99   0   0| 316   434 |09-05 15:30:56|   0
   0     0 :   0  4096B:   0     0 |   0     0 :   0  1.00 :   0     0 |5855B 4707B|0.16 0.25 0.32|  0   0 100   0   0| 146   277 |09-05 15:30:57|   0
   0     0 :   0  4096B:   0     0 |   0     0 :   0  1.00 :   0     0 |8897B 7349B|0.16 0.25 0.32|  0   0  99   1   0| 740  2323 |09-05 15:30:58|   0
   0     0 :   0  4096B:   0     0 |   0     0 :   0  1.00 :   0     0 |7802B 7280B|0.15 0.24 0.32|  0   0 100   0   0| 118   235 |09-05 15:30:59|   0
   0     0 :   0  4096B:   0     0 |   0     0 :   0  1.00 :   0     0 |5610B 4593B|0.15 0.24 0.32|  2   0  98   0   0| 667   682 |09-05 15:31:00|   0
   0     0 :   0  4096B:   0     0 |   0     0 :   0  1.00 :   0     0 |9046B 8254B|0.15 0.24 0.32|  4   0  96   0   0| 515   707 |09-05 15:31:01|   0
   0     0 :   0  4096B:   0     0 |   0     0 :   0  1.00 :   0     0 |5323B 5129B|0.15 0.24 0.32|  0   0 100   0   0| 191   247 |09-05 15:31:02|   0
   0     0 :   0  4096B:   0     0 |   0     0 :   0  1.00 :   0     0 |4249B 3549B|0.15 0.24 0.32|  0   0  98   2   0| 708  2565 |09-05 15:31:03|   0
   0     0 :   0  4096B:   0     0 |   0     0 :   0  1.00 :   0     0 |7577B 7351B|0.14 0.24 0.32|  0   0 100   0   0| 291   350 |09-05 15:31:04|   0
   0     0 :2080k 4096B:   0     0 |   0     0 :62.0  1.00 :   0     0 |5731B 5692B|0.14 0.24 0.32|  0   0 100   0   0| 330   462 |09-05 15:31:05|   0
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |7347B 5852B|0.14 0.24 0.32|  1   0  99   0   0| 306   419 |09-05 15:31:06|   0



The cache was configured directly on one of the NVMe partitions (in this case, the first partition). I did several tests using fio and ioping, testing on a partition on the NVMe device, without partition and directly on the raw block, on a first partition, on the second, with or without configuring bcache. I did all this to remove any doubt as to the method. The results of tests performed directly on the hardware device, without going through bcache are always fast and similar.

But tests in bcache are always slower. If you use writethrough, of course, it gets much worse, because the performance is equal to the raw spinning disk.

Using writeback improves a lot, but still doesn't use the full speed of NVMe (honestly, much less than full speed).

But I've also noticed that there is a limit on writing sequential data, which is a little more than half of the maximum write rate shown in direct tests by the NVMe device.

Processing doesn't seem to be going up like the tests.

Please would anyone know, what could be causing these limits?

Tanks

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Bcache in writes direct with fsync. Are IOPS limited?
  2022-05-10 16:49 ` Bcache in writes direct with fsync. Are IOPS limited? Adriano Silva
@ 2022-05-11  6:20   ` Matthias Ferdinand
  2022-05-11 12:58     ` Adriano Silva
  2022-05-18  1:22   ` Eric Wheeler
  1 sibling, 1 reply; 37+ messages in thread
From: Matthias Ferdinand @ 2022-05-11  6:20 UTC (permalink / raw)
  To: Adriano Silva; +Cc: Bcache Linux

On Tue, May 10, 2022 at 04:49:35PM +0000, Adriano Silva wrote:
> As we can see, the same test done on the bcache0 device only got 1548 IOPS and that yielded only 6.3 KB/s.
> 
> This is much more than any spinning HDD could give me, but many times less than the result obtained by NVMe.


Hi,

bcache needs to do a lot of metadata work, resulting in a noticeable
write amplification. My testing with bcache (some years ago and only with
SATA SSDs) showed that bcache latency increases a lot with high amounts
of dirty data, so I used to tune down writeback_percent, usually to 1,
and used to keep the cache device size low at around 40GB.
I also found performance to increase slightly when a bcache device
was created with 4k block size instead of default 512bytes.

Still quite a decrease in iops. Maybe you could monitor with iostat,
it gives those _await columns, there might be some hints.

Matthias

> I've noticed in several tests, varying the amount of jobs or increasing the size of the blocks, that the larger the size of the blocks, the more I approximate the performance of the physical device to the bcache device. But it always seems that the amount of IOPS is limited to somewhere around 1500-1800 IOPS (maximum). By increasing the amount of jobs, I get better results and more IOPS, but if you divide the total IOPS by the amount of jobs, you can see that the IOPS are always limited in the range 1500-1800 per job.
> 
> The commands used to configure bcache were:
> 
> # echo writeback > /sys/block/bcache0/bcache/cache_mode
> # echo 0 > /sys/block/bcache0/bcache/sequential_cutoff
> ##
> ## Then I tried everything also with the commands below, but there was no improvement.
> ##
> # echo 0 > /sys/fs/bcache/<cache set>/congested_read_threshold_us
> # echo 0 > /sys/fs/bcache/<cache set>/congested_write_threshold_us
> 
> 
> Monitoring with dstat, it is possible to notice that when activating the fio command, the writing is all done in the cache device (a second partition of NVMe), until the end of the test. The spinning disk is only written after the time has passed and it is possible to see the read on the NVMe and the write on the spinning disk (which means the transfer of data in the background).
> 
> --dsk/sdb---dsk/nvme0n1-dsk/bcache0 ---io/sdb----io/nvme0n1--io/bcache0 -net/total- ---load-avg--- --total-cpu-usage-- ---system-- ----system---- async
>  read  writ: read  writ: read  writ| read  writ: read  writ: read  writ| recv  send| 1m   5m  15m |usr sys idl wai stl| int   csw |     time     | #aio
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |8462B 8000B|0.03 0.15 0.31|  1   0  99   0   0| 250   383 |09-05 15:19:47|   0
>    0     0 :4096B  454k:   0   336k|   0     0 :1.00   184 :   0   170 |4566B 4852B|0.03 0.15 0.31|  2   2  94   1   0|1277  3470 |09-05 15:19:48|   1B
>    0  8192B:   0  8022k:   0  6512k|   0  2.00 :   0  3388 :   0  3254 |3261B 2827B|0.11 0.16 0.32|  0   2  93   5   0|4397    16k|09-05 15:19:49|   1B
>    0     0 :   0  7310k:   0  6460k|   0     0 :   0  3240 :   0  3231 |6773B 6428B|0.11 0.16 0.32|  0   1  93   6   0|4190    16k|09-05 15:19:50|   1B
>    0     0 :   0  7313k:   0  6504k|   0     0 :   0  3252 :   0  3251 |6719B 6201B|0.11 0.16 0.32|  0   2  92   6   0|4482    16k|09-05 15:19:51|   1B
>    0     0 :   0  7313k:   0  6496k|   0     0 :   0  3251 :   0  3250 |4743B 4016B|0.11 0.16 0.32|  0   1  93   6   0|4243    16k|09-05 15:19:52|   1B
>    0     0 :   0  7329k:   0  6496k|   0     0 :   0  3289 :   0  3245 |6107B 6062B|0.11 0.16 0.32|  1   1  90   8   0|4706    18k|09-05 15:19:53|   1B
>    0     0 :   0  5373k:   0  4184k|   0     0 :   0  2946 :   0  2095 |6387B 6062B|0.26 0.19 0.33|  0   2  95   4   0|3774    12k|09-05 15:19:54|   1B
>    0     0 :   0  6966k:   0  5668k|   0     0 :   0  3270 :   0  2834 |7264B 7546B|0.26 0.19 0.33|  0   1  93   5   0|4214    15k|09-05 15:19:55|   1B
>    0     0 :   0  7271k:   0  6252k|   0     0 :   0  3258 :   0  3126 |5928B 4584B|0.26 0.19 0.33|  0   2  93   5   0|4156    16k|09-05 15:19:56|   1B
>    0     0 :   0  7419k:   0  6504k|   0     0 :   0  3308 :   0  3251 |5226B 5650B|0.26 0.19 0.33|  2   1  91   6   0|4433    16k|09-05 15:19:57|   1B
>    0     0 :   0  6444k:   0  5704k|   0     0 :   0  2873 :   0  2851 |6494B 8021B|0.26 0.19 0.33|  1   1  91   7   0|4352    16k|09-05 15:19:58|   0
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |6030B 7204B|0.24 0.19 0.32|  0   0 100   0   0| 209   279 |09-05 15:19:59|   0

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Bcache in writes direct with fsync. Are IOPS limited?
  2022-05-11  6:20   ` Matthias Ferdinand
@ 2022-05-11 12:58     ` Adriano Silva
  2022-05-11 21:21       ` Matthias Ferdinand
  0 siblings, 1 reply; 37+ messages in thread
From: Adriano Silva @ 2022-05-11 12:58 UTC (permalink / raw)
  To: Matthias Ferdinand; +Cc: Bcache Linux

Tank you for your answer!

> bcache needs to do a lot of metadata work, resulting in a noticeable
> write amplification. My testing with bcache (some years ago and only with
> SATA SSDs) showed that bcache latency increases a lot with high amounts
> of dirty data

I'm testing with empty devices, no data.

Wouldn't write amplification be noticeable in dstat? Because it doesn't seem significant during the tests, since I monitor reads and writes in all disks in dstat.

> I also found performance to increase slightly when a bcache device
> was created with 4k block size instead of default 512bytes.

Are you talking about changing the block size for the cache device or the backing device?

I tried switching to cache device but bcache gave error when trying to attach device afterwards. It only worked when I kept the values ​​as default (512). I was only able, in the creation of the cache, to change the value of the bucket to 16K (which is what I found for information about my NVMe, but I don't even know if it's correct), but unfortunately that didn't change the result of the IOPS or the latency.

> so I used to tune down writeback_percent, usually to 1,
> and used to keep the cache device size low at around 40GB.

I think it must be some fine tuning.

One curious thing I noticed, is that writing is always taking place on the flash, never on the spinning disk. This is expected and should give the same fast response as the flash device. However, this is not what happens when going through bcache. 

But when I remove the fsync flag in the test with fio, which tells the application to wait for the write response, the 4K write happens much faster, reaching 73.6 MB/s and 17k IOPS. This is half the device's performance, but it's more than enough for my case. The fsync flag makes no significant difference to the performance of my flash disk when testing directly on it. The fact that bcache speeds up when the fsync flag is removed makes me believe that bcache is not slow to write, but for some reason, bcache is taking a while to respond that the write is complete. I think that should be the point!

And without fsync, ioping tests also speed up, albeit less. In this case, I can see that the latency drops to something around 600~700us.

Nothing compared to the 84us obtained when recording directly to the flash device (with or without fsync). But it's still much better than the 1.5ms you get in bcache when you add the fsync flag to wait for the write response.

That is, what it looks like is that there is a wait placed by the bcache layer between the write being sent to it, it waiting for the disk response, and then sending the response to the application. This is increasing latency and consequently reducing performance. I think it must be some fine tuning (or no?).

I think that this tool (bcache) is not used much, at least not in this way, because I'm having difficulties getting feedback on the Internet. I didn't even know where to get help.

In fact, the use of writes in small blocks with fsync and direct flags is not very common. It is commonly used in database servers and other data center storage tools that need to make sure that the data is physically written to the device immediately after each operation. The problem is that these applications need to guarantee that the writes were actually performed and the disk caches are made of volatile memory, which does not guarantee the write, because a power failure can occur and then the data that was only in the cache is lost. That's why the request in each operation that the data be written directly, without going through the cache and that the response comes immediately.

This makes operations very slow in nature.

And everything is even slower when each operation has the small size of only 4K, for example. That is, for each requested write operation of only 4K, an instruction is sent along with it requesting that the data is not stopped in the disk cache (suspecting that the cache is a volatile memory) and that the data is immediately written, with confirmation being of such recording coming from the device afterwards. This significantly increases latency.

And that's why in these environments it is recommended to use RAID cards with cache and batteries that ignore the direct and fsync instructions, but guarantee data saving, even in cases of power failure precisely because of the batteries.

But still, nowadays with enterprise flash devices, containing tantalum capacitors that act as true built-in UPS, RAID arrays, besides being expensive, are no longer considered so fast.

In this sense, flash devices with built-in supercapacitors also work by ignoring fsync flags and guaranteeing recording, even in cases of power failure.

Thus, writings on these devices become so fast that it doesn't even seem like a physical write confirmation request was sent for each operation. The operations are fast for the databases as well as any simple writes that would naturally occur to the cache of a consumer flash disk. 

But enterprise data center flash disks are very expensive! So the idea would be to use spinning disks for write space, but use enterprise datacenter flash disks (NVMe) as cache with bcache. So, theoretically, bcache would divert writes (especially small ones) always directly to the NVMe drive and I would benefit from all the low latency, high throughput, and IOPs of the drive, on most writes and reads.

Unfortunately something is not working out as I imagined. Because something is limiting IOPS and increasing latency a lot.

I think it might be something I'm doing wrong in the configuration. Or some fine tuning I don't know how to do.




Thank you! The search continues. If anyone else can help, I'd appreciate it!

Em quarta-feira, 11 de maio de 2022 03:20:18 BRT, Matthias Ferdinand <bcache@mfedv.net> escreveu: 





On Tue, May 10, 2022 at 04:49:35PM +0000, Adriano Silva wrote:
> As we can see, the same test done on the bcache0 device only got 1548 IOPS and that yielded only 6.3 KB/s.
> 
> This is much more than any spinning HDD could give me, but many times less than the result obtained by NVMe.


Hi,

bcache needs to do a lot of metadata work, resulting in a noticeable
write amplification. My testing with bcache (some years ago and only with
SATA SSDs) showed that bcache latency increases a lot with high amounts
of dirty data, so I used to tune down writeback_percent, usually to 1,
and used to keep the cache device size low at around 40GB.
I also found performance to increase slightly when a bcache device
was created with 4k block size instead of default 512bytes.

Still quite a decrease in iops. Maybe you could monitor with iostat,
it gives those _await columns, there might be some hints.

Matthias


> I've noticed in several tests, varying the amount of jobs or increasing the size of the blocks, that the larger the size of the blocks, the more I approximate the performance of the physical device to the bcache device. But it always seems that the amount of IOPS is limited to somewhere around 1500-1800 IOPS (maximum). By increasing the amount of jobs, I get better results and more IOPS, but if you divide the total IOPS by the amount of jobs, you can see that the IOPS are always limited in the range 1500-1800 per job.
> 
> The commands used to configure bcache were:
> 
> # echo writeback > /sys/block/bcache0/bcache/cache_mode
> # echo 0 > /sys/block/bcache0/bcache/sequential_cutoff
> ##
> ## Then I tried everything also with the commands below, but there was no improvement.
> ##
> # echo 0 > /sys/fs/bcache/<cache set>/congested_read_threshold_us
> # echo 0 > /sys/fs/bcache/<cache set>/congested_write_threshold_us
> 
> 
> Monitoring with dstat, it is possible to notice that when activating the fio command, the writing is all done in the cache device (a second partition of NVMe), until the end of the test. The spinning disk is only written after the time has passed and it is possible to see the read on the NVMe and the write on the spinning disk (which means the transfer of data in the background).
> 
> --dsk/sdb---dsk/nvme0n1-dsk/bcache0 ---io/sdb----io/nvme0n1--io/bcache0 -net/total- ---load-avg--- --total-cpu-usage-- ---system-- ----system---- async
>  read  writ: read  writ: read  writ| read  writ: read  writ: read  writ| recv  send| 1m   5m  15m |usr sys idl wai stl| int   csw |     time     | #aio
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |8462B 8000B|0.03 0.15 0.31|  1   0  99   0   0| 250   383 |09-05 15:19:47|   0
>    0     0 :4096B  454k:   0   336k|   0     0 :1.00   184 :   0   170 |4566B 4852B|0.03 0.15 0.31|  2   2  94   1   0|1277  3470 |09-05 15:19:48|   1B
>    0  8192B:   0  8022k:   0  6512k|   0  2.00 :   0  3388 :   0  3254 |3261B 2827B|0.11 0.16 0.32|  0   2  93   5   0|4397    16k|09-05 15:19:49|   1B
>    0     0 :   0  7310k:   0  6460k|   0     0 :   0  3240 :   0  3231 |6773B 6428B|0.11 0.16 0.32|  0   1  93   6   0|4190    16k|09-05 15:19:50|   1B
>    0     0 :   0  7313k:   0  6504k|   0     0 :   0  3252 :   0  3251 |6719B 6201B|0.11 0.16 0.32|  0   2  92   6   0|4482    16k|09-05 15:19:51|   1B
>    0     0 :   0  7313k:   0  6496k|   0     0 :   0  3251 :   0  3250 |4743B 4016B|0.11 0.16 0.32|  0   1  93   6   0|4243    16k|09-05 15:19:52|   1B
>    0     0 :   0  7329k:   0  6496k|   0     0 :   0  3289 :   0  3245 |6107B 6062B|0.11 0.16 0.32|  1   1  90   8   0|4706    18k|09-05 15:19:53|   1B
>    0     0 :   0  5373k:   0  4184k|   0     0 :   0  2946 :   0  2095 |6387B 6062B|0.26 0.19 0.33|  0   2  95   4   0|3774    12k|09-05 15:19:54|   1B
>    0     0 :   0  6966k:   0  5668k|   0     0 :   0  3270 :   0  2834 |7264B 7546B|0.26 0.19 0.33|  0   1  93   5   0|4214    15k|09-05 15:19:55|   1B
>    0     0 :   0  7271k:   0  6252k|   0     0 :   0  3258 :   0  3126 |5928B 4584B|0.26 0.19 0.33|  0   2  93   5   0|4156    16k|09-05 15:19:56|   1B
>    0     0 :   0  7419k:   0  6504k|   0     0 :   0  3308 :   0  3251 |5226B 5650B|0.26 0.19 0.33|  2   1  91   6   0|4433    16k|09-05 15:19:57|   1B
>    0     0 :   0  6444k:   0  5704k|   0     0 :   0  2873 :   0  2851 |6494B 8021B|0.26 0.19 0.33|  1   1  91   7   0|4352    16k|09-05 15:19:58|   0
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |6030B 7204B|0.24 0.19 0.32|  0   0 100   0   0| 209   279 |09-05 15:19:59|   0

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Bcache in writes direct with fsync. Are IOPS limited?
  2022-05-11 12:58     ` Adriano Silva
@ 2022-05-11 21:21       ` Matthias Ferdinand
  0 siblings, 0 replies; 37+ messages in thread
From: Matthias Ferdinand @ 2022-05-11 21:21 UTC (permalink / raw)
  To: Adriano Silva; +Cc: Bcache Linux

On Wed, May 11, 2022 at 12:58:48PM +0000, Adriano Silva wrote:
> Tank you for your answer!
> 
> > bcache needs to do a lot of metadata work, resulting in a noticeable
> > write amplification. My testing with bcache (some years ago and only with
> > SATA SSDs) showed that bcache latency increases a lot with high amounts
> > of dirty data
> 
> I'm testing with empty devices, no data.
> 
> Wouldn't write amplification be noticeable in dstat? Because it doesn't seem significant during the tests, since I monitor reads and writes in all disks in dstat.

yes, you are right, that would be visible. I was misled from ~3k writes
to nvme (vs. ~1.5k writes from fio), but the same ~3k writes are on
bcache.

> > I also found performance to increase slightly when a bcache device
> > was created with 4k block size instead of default 512bytes.
> 
> Are you talking about changing the block size for the cache device or the backing device?

neither - it was the "-w" argument to make-bcache. I found some old
logfile from my tests. Where both hdd and ssd showed as
512b-sector-devices, the command to create the bcache device was 
    make-bcache --data_offset 2048 --wipe-bcache -w 4k -C /dev/sde1 -B /dev/sdb
In /sys/block/bcacheX/queue/hw_sector_size it then says "4096".


> But when I remove the fsync flag in the test with fio, which tells the application to wait for the write response, the 4K write happens much faster, reaching 73.6 MB/s and 17k IOPS. This is half the device's performance, but it's more than enough for my case. The fsync flag makes no significant difference to the performance of my flash disk when testing directly on it. The fact that bcache speeds up when the fsync flag is removed makes me believe that bcache is not slow to write, but for some reason, bcache is taking a while to respond that the write is complete. I think that should be the point!

I can't claim to fully understand what fsync does (or how a block
device driver is supposed to handle it), but this might account for the
roughly doubled writes shown with dstat as opposed to the fio results.

From the name "journal-test" I guess you are trying something like
    https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
He uses very similar parameters, except with "--sync=1", not
"--fsync=1".

This is a proper benchmark for the old ceph filestore journal, as this
was written linearly, and in the worst case could have been written in
chunks as small as 4k.

As you are using proxmox, I guess you want to use its ceph component.
They use the modern ceph bluestore format, and there is no journal
anymore.  I don't know if the bluestore WAL exhibits similar access
patterns as the old journal and if this benchmark still has real-world
relevance.  But when having enough NVMe disk space, you are advised to
put bluestore WAL and ideally also the bluestore DB directly on NVMe,
and use bcache only for the bluestore data part. If you do so, make sure
to set rotational=1 on the bcache device before creating the OSD, or
ceph will use unsuitable bluestore parameters, possibly overwhelming the
hdd:

    https://www.spinics.net/lists/ceph-users/msg71646.html

Matthias

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Bcache in writes direct with fsync. Are IOPS limited?
  2022-05-10 16:49 ` Bcache in writes direct with fsync. Are IOPS limited? Adriano Silva
  2022-05-11  6:20   ` Matthias Ferdinand
@ 2022-05-18  1:22   ` Eric Wheeler
  2022-05-23 14:07     ` Coly Li
                       ` (2 more replies)
  1 sibling, 3 replies; 37+ messages in thread
From: Eric Wheeler @ 2022-05-18  1:22 UTC (permalink / raw)
  To: Adriano Silva; +Cc: Bcache Linux, Coly Li, Matthias Ferdinand

On Tue, 10 May 2022, Adriano Silva wrote:
> I'm trying to set up a flash disk NVMe as a disk cache for two or three 
> isolated (I will use 2TB disks, but in these tests I used a 1TB one) 
> spinning disks that I have on a Linux 5.4.174 (Proxmox node).

Coly has been adding quite a few optimizations over the years.  You might 
try a new kernel and see if that helps.  More below.

> I'm using a NVMe (960GB datacenter devices with tantalum capacitors) as 
> a cache.
> [...]
>
> But when I do the same test on bcache writeback, the performance drops a 
> lot. Of course, it's better than the performance of spinning disks, but 
> much worse than when accessed directly from the NVMe device hardware.
>
> [...]
> As we can see, the same test done on the bcache0 device only got 1548 
> IOPS and that yielded only 6.3 KB/s.

Well done on the benchmarking!  I always thought our new NVMes performed 
slower than expected but hadn't gotten around to investigating. 

> I've noticed in several tests, varying the amount of jobs or increasing 
> the size of the blocks, that the larger the size of the blocks, the more 
> I approximate the performance of the physical device to the bcache 
> device.

You said "blocks" but did you mean bucket size (make-bcache -b) or block 
size (make-bcache -w) ?

If larger buckets makes it slower than that actually surprises me: bigger 
buckets means less metadata and better sequential writeback to the 
spinning disks (though you hadn't yet hit writeback to spinning disks in 
your stats).  Maybe you already tried, but varying the bucket size might 
help.  Try graphing bucket size (powers of 2) against IOPS, maybe there is 
a "sweet spot"?

Be aware that 4k blocks (so-called "4Kn") is unsafe for the cache device, 
unless Coly has patched that.  Make sure your `blockdev --getss` reports 
512 for your NVMe!

Hi Coly,

Some time ago you ordered an an SSD to test the 4k cache issue, has that 
been fixed?  I've kept an eye out for the patch but not sure if it was released.

You have a really great test rig setup with NVMes for stress
testing bcache. Can you replicate Adriano's `ioping` numbers below?

> With ioping it is also possible to notice a limitation, as the latency 
> of the bcache0 device is around 1.5ms, while in the case of the raw 
> device (a partition of NVMe), the same test is only 82.1us.
> 
> root@pve-20:~# ioping -c10 /dev/bcache0 -D -Y -WWW -s4k
> 4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=1 time=1.52 ms (warmup)
> 4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=2 time=1.60 ms
> 4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=3 time=1.55 ms
>
> root@pve-20:~# ioping -c10 /dev/nvme0n1p2 -D -Y -WWW -s4k
> 4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=1 time=81.2 us (warmup)
> 4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=2 time=82.7 us
> 4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=3 time=82.4 us

Wow, almost 20x higher latency, sounds convincing that something is wrong.

A few things to try:

1. Try ioping without -Y.  How does it compare?

2. Maybe this is an inter-socket latency issue.  Is your server 
   multi-socket?  If so, then as a first pass you could set the kernel 
   cmdline `isolcpus` for testing to limit all processes to a single 
   socket where the NVMe is connected (see `lscpu`).  Check `hwloc-ls`
   or your motherboard manual to see how the NVMe port is wired to your
   CPUs.

   If that helps then fine tune with `numactl -cN ioping` and 
   /proc/irq/<n>/smp_affinity_list (and `grep nvme /proc/interrupts`) to 
   make sure your NVMe's are locked to IRQs on the same socket.

3a. sysfs:

> # echo 0 > /sys/block/bcache0/bcache/sequential_cutoff

good.

> # echo 0 > /sys/fs/bcache/<cache set>/congested_read_threshold_us
> # echo 0 > /sys/fs/bcache/<cache set>/congested_write_threshold_us

Also try these (I think bcache/cache is a symlink to /sys/fs/bcache/<cache set>)

echo 10000000 > /sys/block/bcache0/bcache/cache/congested_read_threshold_us 
echo 10000000 > /sys/block/bcache0/bcache/cache/congested_write_threshold_us


Try tuning journal_delay_ms: 
  /sys/fs/bcache/<cset-uuid>/journal_delay_ms
    Journal writes will delay for up to this many milliseconds, unless a 
    cache flush happens sooner. Defaults to 100.

3b: Hacking bcache code:

I just noticed that journal_delay_ms says "unless a cache flush happens 
sooner" but cache flushes can be re-ordered so flushing the journal when 
REQ_OP_FLUSH comes through may not be useful, especially if there is a 
high volume of flushes coming down the pipe because the flushes could kill 
the NVMe's cache---and maybe the 1.5ms ping is actual flash latency.  It
would flush data and journal.

Maybe there should be a cachedev_noflush sysfs option for those with some 
kind of power-loss protection of there SSD's.  It looks like this is 
handled in request.c when these functions call bch_journal_meta():

	1053: static void cached_dev_nodata(struct closure *cl)
	1263: static void flash_dev_nodata(struct closure *cl)

Coly can you comment about journal flush semantics with respect to 
performance vs correctness and crash safety?

Adriano, as a test, you could change this line in search_alloc() in 
request.c:

	- s->iop.flush_journal    = op_is_flush(bio->bi_opf);
	+ s->iop.flush_journal    = 0;

and see how performance changes.

Someone correct me if I'm wrong, but I don't think flush_journal=0 will 
affect correctness unless there is a crash.  If that /is/ the performance 
problem then it would narrow the scope of this discussion.

4. I wonder if your 1.5ms `ioping` stats scale with CPU clock speed: can 
   you set your CPU governor to run at full clock speed and then slowest 
   clock speed to see if it is a CPU limit somewhere as we expect?

   You can do `grep MHz /proc/cpuinfo` to see the active rate to make sure 
   the governor did its job.  

   If it scales with CPU then something in bcache is working too hard.  
   Maybe garbage collection?  Other devs would need to chime in here to 
   steer the troubleshooting if that is the case.


5. I'm not sure if garbage collection is the issue, but you might try 
   Mingzhe's dynamic incremental gc patch:
	https://www.spinics.net/lists/linux-bcache/msg11185.html

6. Try dm-cache and see if its IO latency is similar to bcache: If it is 
   about the same then that would indicate an issue in the block layer 
   somewhere outside of bcache.  If dm-cache is better, then that confirms 
   a bcache issue.


> The cache was configured directly on one of the NVMe partitions (in this 
> case, the first partition). I did several tests using fio and ioping, 
> testing on a partition on the NVMe device, without partition and 
> directly on the raw block, on a first partition, on the second, with or 
> without configuring bcache. I did all this to remove any doubt as to the 
> method. The results of tests performed directly on the hardware device, 
> without going through bcache are always fast and similar.
> 
> But tests in bcache are always slower. If you use writethrough, of 
> course, it gets much worse, because the performance is equal to the raw 
> spinning disk.
> 
> Using writeback improves a lot, but still doesn't use the full speed of 
> NVMe (honestly, much less than full speed).

Indeed, I hope this can be fixed!  A 20x improvement in bcache would 
be awesome.
 
> But I've also noticed that there is a limit on writing sequential data, 
> which is a little more than half of the maximum write rate shown in 
> direct tests by the NVMe device.

For sync, async, or both?

> Processing doesn't seem to be going up like the tests.

What do you mean "processing" ?

-Eric



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Bcache in writes direct with fsync. Are IOPS limited?
  2022-05-18  1:22   ` Eric Wheeler
@ 2022-05-23 14:07     ` Coly Li
  2022-05-26 19:15       ` Eric Wheeler
  2022-05-23 18:36     ` [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync) Eric Wheeler
       [not found]     ` <681726005.1812841.1653564986700@mail.yahoo.com>
  2 siblings, 1 reply; 37+ messages in thread
From: Coly Li @ 2022-05-23 14:07 UTC (permalink / raw)
  To: Eric Wheeler, Adriano Silva; +Cc: Bcache Linux, Matthias Ferdinand

On 5/18/22 9:22 AM, Eric Wheeler wrote:
> On Tue, 10 May 2022, Adriano Silva wrote:
>> I'm trying to set up a flash disk NVMe as a disk cache for two or three
>> isolated (I will use 2TB disks, but in these tests I used a 1TB one)
>> spinning disks that I have on a Linux 5.4.174 (Proxmox node).
> Coly has been adding quite a few optimizations over the years.  You might
> try a new kernel and see if that helps.  More below.


Yes, the latest stable kernel is preferred. Linux 5.4 based kernel is 
stable enough for bcache, but it is still better to use latest stable 
kernel.


>> I'm using a NVMe (960GB datacenter devices with tantalum capacitors) as
>> a cache.
>> [...]
>>
>> But when I do the same test on bcache writeback, the performance drops a
>> lot. Of course, it's better than the performance of spinning disks, but
>> much worse than when accessed directly from the NVMe device hardware.
>>
>> [...]
>> As we can see, the same test done on the bcache0 device only got 1548
>> IOPS and that yielded only 6.3 KB/s.
> Well done on the benchmarking!  I always thought our new NVMes performed
> slower than expected but hadn't gotten around to investigating.
>
>> I've noticed in several tests, varying the amount of jobs or increasing
>> the size of the blocks, that the larger the size of the blocks, the more
>> I approximate the performance of the physical device to the bcache
>> device.
> You said "blocks" but did you mean bucket size (make-bcache -b) or block
> size (make-bcache -w) ?
>
> If larger buckets makes it slower than that actually surprises me: bigger
> buckets means less metadata and better sequential writeback to the
> spinning disks (though you hadn't yet hit writeback to spinning disks in
> your stats).  Maybe you already tried, but varying the bucket size might
> help.  Try graphing bucket size (powers of 2) against IOPS, maybe there is
> a "sweet spot"?
>
> Be aware that 4k blocks (so-called "4Kn") is unsafe for the cache device,
> unless Coly has patched that.  Make sure your `blockdev --getss` reports
> 512 for your NVMe!
>
> Hi Coly,
>
> Some time ago you ordered an an SSD to test the 4k cache issue, has that
> been fixed?  I've kept an eye out for the patch but not sure if it was released.


Yes, I got the Intel P3700 PCIe SSD to fix the 4Kn unaligned I/O issue 
(borrowed from a hardware vendor). The new situation is, current kernel 
does the sector size alignment checking quite earlier in bio layer, if 
the LBA is not sector size aligned, it is rejected in the bio code, and 
the underlying driver doesn't have chance to see the bio anymore. So for 
now, the unaligned LBA for 4Kn device cannot reach bcache code, that's 
to say, the original reported condition won't happen now.

And after this observation, I stopped my investigation on the unaligned 
sector size I/O on 4Kn device, and returned the P3700 PCIe SSD to the 
hardware vendor.


> You have a really great test rig setup with NVMes for stress
> testing bcache. Can you replicate Adriano's `ioping` numbers below?


I tried the similar operation, yes it should be a bit slower than raw 
device access, but should not be slow like that...

Here is my fio single thread fsync performance number,

job0: (groupid=0, jobs=1): err= 0: pid=3370: Mon May 23 16:17:05 2022
   write: IOPS=20.9k, BW=81.8MiB/s (85.8MB/s)(17.3GiB/216718msec); 0 
zone resets
    bw (  KiB/s): min=75904, max=86872, per=100.00%, avg=83814.21, 
stdev=1321.04, samples=433
    iops        : min=18976, max=21718, avg=20953.56, stdev=330.27, 
samples=433
   lat (usec)   : 2=0.01%, 10=0.01%, 20=97.34%, 50=1.71%, 100=0.47%
   lat (usec)   : 250=0.42%, 500=0.01%, 750=0.01%, 1000=0.02%
   lat (msec)   : 2=0.02%, 4=0.01%

Most of the write I/Os finished in 20us, comparing to 100-250us, that is 
too slow, which is out of my expectation. There should be something not 
properly working.



>> With ioping it is also possible to notice a limitation, as the latency
>> of the bcache0 device is around 1.5ms, while in the case of the raw
>> device (a partition of NVMe), the same test is only 82.1us.
>>
>> root@pve-20:~# ioping -c10 /dev/bcache0 -D -Y -WWW -s4k
>> 4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=1 time=1.52 ms (warmup)
>> 4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=2 time=1.60 ms
>> 4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=3 time=1.55 ms
>>
>> root@pve-20:~# ioping -c10 /dev/nvme0n1p2 -D -Y -WWW -s4k
>> 4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=1 time=81.2 us (warmup)
>> 4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=2 time=82.7 us
>> 4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=3 time=82.4 us
> Wow, almost 20x higher latency, sounds convincing that something is wrong.
>
> A few things to try:
>
> 1. Try ioping without -Y.  How does it compare?
>
> 2. Maybe this is an inter-socket latency issue.  Is your server
>     multi-socket?  If so, then as a first pass you could set the kernel
>     cmdline `isolcpus` for testing to limit all processes to a single
>     socket where the NVMe is connected (see `lscpu`).  Check `hwloc-ls`
>     or your motherboard manual to see how the NVMe port is wired to your
>     CPUs.
>
>     If that helps then fine tune with `numactl -cN ioping` and
>     /proc/irq/<n>/smp_affinity_list (and `grep nvme /proc/interrupts`) to
>     make sure your NVMe's are locked to IRQs on the same socket.

Wow, this is too slow...


Here is my performance number,

  # ./ioping -c10 /dev/bcache0 -D -Y -WWW -s4k
4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=1 time=144.3 us 
(warmup)
4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=2 time=84.1 us
4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=3 time=71.8 us
4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=4 time=68.9 us
4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=5 time=69.8 us
4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=6 time=68.7 us
4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=7 time=68.8 us
4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=8 time=70.3 us
4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=9 time=68.8 us
4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=10 time=68.5 us

  # ./ioping -c10 /dev/bcache0 -D -WWW -s4k
4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=1 time=127.8 us 
(warmup)
4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=2 time=67.8 us
4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=3 time=60.3 us
4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=4 time=46.9 us
4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=5 time=52.6 us
4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=6 time=43.8 us
4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=7 time=52.7 us
4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=8 time=44.3 us
4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=9 time=52.0 us
4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=10 time=44.6 us

1.5ms is really far from my expectation, there must be something wrong....


[snipped]

> Someone correct me if I'm wrong, but I don't think flush_journal=0 will
> affect correctness unless there is a crash.  If that /is/ the performance
> problem then it would narrow the scope of this discussion.
>
> 4. I wonder if your 1.5ms `ioping` stats scale with CPU clock speed: can
>     you set your CPU governor to run at full clock speed and then slowest
>     clock speed to see if it is a CPU limit somewhere as we expect?
>
>     You can do `grep MHz /proc/cpuinfo` to see the active rate to make sure
>     the governor did its job.
>
>     If it scales with CPU then something in bcache is working too hard.
>     Maybe garbage collection?  Other devs would need to chime in here to
>     steer the troubleshooting if that is the case.

Maybe system memory is small?  1.5ms is too slow, I cannot imagine how 
it can be such slow...


>
> 5. I'm not sure if garbage collection is the issue, but you might try
>     Mingzhe's dynamic incremental gc patch:
> 	https://www.spinics.net/lists/linux-bcache/msg11185.html
>
> 6. Try dm-cache and see if its IO latency is similar to bcache: If it is
>     about the same then that would indicate an issue in the block layer
>     somewhere outside of bcache.  If dm-cache is better, then that confirms
>     a bcache issue.

Great idea.


>
>> The cache was configured directly on one of the NVMe partitions (in this
>> case, the first partition). I did several tests using fio and ioping,
>> testing on a partition on the NVMe device, without partition and
>> directly on the raw block, on a first partition, on the second, with or
>> without configuring bcache. I did all this to remove any doubt as to the
>> method. The results of tests performed directly on the hardware device,
>> without going through bcache are always fast and similar.


What is the performance number on the whole NVMe disk without 
partition?  In case the partition start LBA is not perfectly aligned to 
some size...

Can I know the hardware configuration, and the NVMe SSD spec? Maybe I 
can try to find a similar one around my location and have a try if I am 
lucky.


Thanks.


Coly Li




^ permalink raw reply	[flat|nested] 37+ messages in thread

* [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync)
  2022-05-18  1:22   ` Eric Wheeler
  2022-05-23 14:07     ` Coly Li
@ 2022-05-23 18:36     ` Eric Wheeler
  2022-05-24  5:34       ` Christoph Hellwig
       [not found]     ` <681726005.1812841.1653564986700@mail.yahoo.com>
  2 siblings, 1 reply; 37+ messages in thread
From: Eric Wheeler @ 2022-05-23 18:36 UTC (permalink / raw)
  To: Coly Li; +Cc: Adriano Silva, Bcache Linux, Matthias Ferdinand, linux-block

[-- Attachment #1: Type: text/plain, Size: 6042 bytes --]

On Tue, 17 May 2022, Eric Wheeler wrote:
>   /sys/fs/bcache/<cset-uuid>/journal_delay_ms
>     Journal writes will delay for up to this many milliseconds, unless a 
>     cache flush happens sooner. Defaults to 100.
> 
> I just noticed that journal_delay_ms says "unless a cache flush happens 
> sooner" but cache flushes can be re-ordered so flushing the journal when 
> REQ_OP_FLUSH comes through may not be useful, especially if there is a 
> high volume of flushes coming down the pipe because the flushes could kill 
> the NVMe's cache---and maybe the 1.5ms ping is actual flash latency.  It
> would flush data and journal.
> 
> Maybe there should be a cachedev_noflush sysfs option for those with some 
> kind of power-loss protection of there SSD's.  It looks like this is 
> handled in request.c when these functions call bch_journal_meta():
> 
> 	1053: static void cached_dev_nodata(struct closure *cl)
> 	1263: static void flash_dev_nodata(struct closure *cl)
> 
> Coly can you comment about journal flush semantics with respect to 
> performance vs correctness and crash safety?
> 
> Adriano, as a test, you could change this line in search_alloc() in 
> request.c:
> 
> 	- s->iop.flush_journal    = op_is_flush(bio->bi_opf);
> 	+ s->iop.flush_journal    = 0;
> 
> and see how performance changes.

Hi Coly, all:

Can you think of any reason that forcing iop.flush_journal=0 for bcache 
devs with backed by non-volatile cache would be unsafe?

If it is safe, then three new sysctl flags to optionally drop flushes 
would increase overall bcache performance by avoiding controller flushes, 
especially on the spinning disks.  These would of course default to 0:

  - noflush_journal - no flush on journal writes
  - noflush_cache   - no flush on normal cache IO writes
  - noflush_bdev    - no flush on normal bdev IO writes

What do you think?

From Coly's iopings:

>  # ./ioping -c10 /dev/bcache0 -D -Y -WWW -s4k
> 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=1 time=144.3 us (warmup)
> 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=2 time=84.1 us
> 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=3 time=71.8 us
> 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=4 time=68.9 us
> 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=5 time=69.8 us
> 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=6 time=68.7 us
> 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=7 time=68.8 us
> 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=8 time=70.3 us
> 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=9 time=68.8 us
> 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=10 time=68.5 us
 
^ Average is 71.1 us.

>  # ./ioping -c10 /dev/bcache0 -D -WWW -s4k
> 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=1 time=127.8 us (warmup)
> 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=2 time=67.8 us
> 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=3 time=60.3 us
> 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=4 time=46.9 us
> 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=5 time=52.6 us
> 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=6 time=43.8 us
> 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=7 time=52.7 us
> 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=8 time=44.3 us
> 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=9 time=52.0 us
> 4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=10 time=44.6 us

^ Average is 51.7 us.

Dropping sync write flushes provides a 27% reduction in SSD latency!


--
Eric Wheeler



> 
> Someone correct me if I'm wrong, but I don't think flush_journal=0 will 
> affect correctness unless there is a crash.  If that /is/ the performance 
> problem then it would narrow the scope of this discussion.
> 
> 4. I wonder if your 1.5ms `ioping` stats scale with CPU clock speed: can 
>    you set your CPU governor to run at full clock speed and then slowest 
>    clock speed to see if it is a CPU limit somewhere as we expect?
> 
>    You can do `grep MHz /proc/cpuinfo` to see the active rate to make sure 
>    the governor did its job.  
> 
>    If it scales with CPU then something in bcache is working too hard.  
>    Maybe garbage collection?  Other devs would need to chime in here to 
>    steer the troubleshooting if that is the case.
> 
> 
> 5. I'm not sure if garbage collection is the issue, but you might try 
>    Mingzhe's dynamic incremental gc patch:
> 	https://www.spinics.net/lists/linux-bcache/msg11185.html
> 
> 6. Try dm-cache and see if its IO latency is similar to bcache: If it is 
>    about the same then that would indicate an issue in the block layer 
>    somewhere outside of bcache.  If dm-cache is better, then that confirms 
>    a bcache issue.
> 
> 
> > The cache was configured directly on one of the NVMe partitions (in this 
> > case, the first partition). I did several tests using fio and ioping, 
> > testing on a partition on the NVMe device, without partition and 
> > directly on the raw block, on a first partition, on the second, with or 
> > without configuring bcache. I did all this to remove any doubt as to the 
> > method. The results of tests performed directly on the hardware device, 
> > without going through bcache are always fast and similar.
> > 
> > But tests in bcache are always slower. If you use writethrough, of 
> > course, it gets much worse, because the performance is equal to the raw 
> > spinning disk.
> > 
> > Using writeback improves a lot, but still doesn't use the full speed of 
> > NVMe (honestly, much less than full speed).
> 
> Indeed, I hope this can be fixed!  A 20x improvement in bcache would 
> be awesome.
>  
> > But I've also noticed that there is a limit on writing sequential data, 
> > which is a little more than half of the maximum write rate shown in 
> > direct tests by the NVMe device.
> 
> For sync, async, or both?
> 
> > Processing doesn't seem to be going up like the tests.
> 
> What do you mean "processing" ?
> 
> -Eric
> 
> 
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync)
  2022-05-23 18:36     ` [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync) Eric Wheeler
@ 2022-05-24  5:34       ` Christoph Hellwig
  2022-05-24 20:14         ` Eric Wheeler
  0 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2022-05-24  5:34 UTC (permalink / raw)
  To: Eric Wheeler
  Cc: Coly Li, Adriano Silva, Bcache Linux, Matthias Ferdinand, linux-block

... wait.

Can someone explain what this is all about?  Devices with power fail
protection will advertise that (using VWC flag in NVMe for example) and
we will never send flushes.  So anything that explicitly disables
flushed will generally cause data corruption.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync)
  2022-05-24  5:34       ` Christoph Hellwig
@ 2022-05-24 20:14         ` Eric Wheeler
  2022-05-24 20:34           ` Keith Busch
  2022-05-25  5:17           ` Christoph Hellwig
  0 siblings, 2 replies; 37+ messages in thread
From: Eric Wheeler @ 2022-05-24 20:14 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Coly Li, Adriano Silva, Bcache Linux, Matthias Ferdinand, linux-block

Hi Christoph,

On Mon, 23 May 2022, Christoph Hellwig wrote:
> ... wait.
> 
> Can someone explain what this is all about?  Devices with power fail 
> protection will advertise that (using VWC flag in NVMe for example) and 
> we will never send flushes. So anything that explicitly disables flushed 
> will generally cause data corruption.

Adriano was getting 1.5ms sync-write ioping's to an NVMe through bcache 
(instead of the expected ~70us), so perhaps the NVMe flushes were killing 
performance if every write was also forcing an erase cycle.

The suggestion was to disable flushes in bcache as a troubleshooting step 
to see if that solved the problem, but with the warning that it could be 
unsafe.

Questions:

1. If a user knows their disks have a non-volatile cache then is it safe 
   to drop flushes?

2. If not, then under what circumstances is it unsafe with a non-volatile 
   cache?
  
3. Since the block layer wont send flushes when the hardware reports that 
   the cache is non-volatile, then how do you query the device to make 
   sure it is reporting correctly?  For NVMe you can get VWC as:
	nvme id-ctrl -H /dev/nvme0 |grep -A1 vwc
   
   ...but how do you query a block device (like a RAID LUN) to make sure 
   it is reporting a non-volatile cache correctly?

--
Eric Wheeler



> 
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync)
  2022-05-24 20:14         ` Eric Wheeler
@ 2022-05-24 20:34           ` Keith Busch
  2022-05-24 21:34             ` Eric Wheeler
  2022-05-25  5:17           ` Christoph Hellwig
  1 sibling, 1 reply; 37+ messages in thread
From: Keith Busch @ 2022-05-24 20:34 UTC (permalink / raw)
  To: Eric Wheeler
  Cc: Christoph Hellwig, Coly Li, Adriano Silva, Bcache Linux,
	Matthias Ferdinand, linux-block

On Tue, May 24, 2022 at 01:14:18PM -0700, Eric Wheeler wrote:
> Hi Christoph,
> 
> On Mon, 23 May 2022, Christoph Hellwig wrote:
> > ... wait.
> > 
> > Can someone explain what this is all about?  Devices with power fail 
> > protection will advertise that (using VWC flag in NVMe for example) and 
> > we will never send flushes. So anything that explicitly disables flushed 
> > will generally cause data corruption.
> 
> Adriano was getting 1.5ms sync-write ioping's to an NVMe through bcache 
> (instead of the expected ~70us), so perhaps the NVMe flushes were killing 
> performance if every write was also forcing an erase cycle.
> 
> The suggestion was to disable flushes in bcache as a troubleshooting step 
> to see if that solved the problem, but with the warning that it could be 
> unsafe.
> 
> Questions:
> 
> 1. If a user knows their disks have a non-volatile cache then is it safe 
>    to drop flushes?
> 
> 2. If not, then under what circumstances is it unsafe with a non-volatile 
>    cache?
>   
> 3. Since the block layer wont send flushes when the hardware reports that 
>    the cache is non-volatile, then how do you query the device to make 
>    sure it is reporting correctly?  For NVMe you can get VWC as:
> 	nvme id-ctrl -H /dev/nvme0 |grep -A1 vwc
>    
>    ...but how do you query a block device (like a RAID LUN) to make sure 
>    it is reporting a non-volatile cache correctly?

You can check the queue attribute, /sys/block/<disk>/queue/write_cache. If the
value is "write through", then the device is reporting it doesn't have a
volatile cache. If it is "write back", then it has a volatile cache.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync)
  2022-05-24 20:34           ` Keith Busch
@ 2022-05-24 21:34             ` Eric Wheeler
  2022-05-25  5:20               ` Christoph Hellwig
  0 siblings, 1 reply; 37+ messages in thread
From: Eric Wheeler @ 2022-05-24 21:34 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Coly Li, Adriano Silva, Bcache Linux,
	Matthias Ferdinand, linux-block

On Tue, 24 May 2022, Keith Busch wrote:
> On Tue, May 24, 2022 at 01:14:18PM -0700, Eric Wheeler wrote:
> > Hi Christoph,
> > 
> > On Mon, 23 May 2022, Christoph Hellwig wrote:
> > > ... wait.
> > > 
> > > Can someone explain what this is all about?  Devices with power fail 
> > > protection will advertise that (using VWC flag in NVMe for example) and 
> > > we will never send flushes. So anything that explicitly disables flushed 
> > > will generally cause data corruption.
> > 
> > Adriano was getting 1.5ms sync-write ioping's to an NVMe through bcache 
> > (instead of the expected ~70us), so perhaps the NVMe flushes were killing 
> > performance if every write was also forcing an erase cycle.
> > 
> > The suggestion was to disable flushes in bcache as a troubleshooting step 
> > to see if that solved the problem, but with the warning that it could be 
> > unsafe.
> > 
> > Questions:
> > 
> > 1. If a user knows their disks have a non-volatile cache then is it safe 
> >    to drop flushes?
> > 
> > 2. If not, then under what circumstances is it unsafe with a non-volatile 
> >    cache?
> >   
> > 3. Since the block layer wont send flushes when the hardware reports that 
> >    the cache is non-volatile, then how do you query the device to make 
> >    sure it is reporting correctly?  For NVMe you can get VWC as:
> > 	nvme id-ctrl -H /dev/nvme0 |grep -A1 vwc
> >    
> >    ...but how do you query a block device (like a RAID LUN) to make sure 
> >    it is reporting a non-volatile cache correctly?
> 
> You can check the queue attribute, /sys/block/<disk>/queue/write_cache. If the
> value is "write through", then the device is reporting it doesn't have a
> volatile cache. If it is "write back", then it has a volatile cache.
 
Thanks, Keith!  

Is this flag influced at all when /sys/block/sdX/queue/scheduler is set 
to "none", or does the write_cache flag operate independently of the 
selected scheduler?

Does the block layer stop sending flushes at the first device in the stack 
that is set to "write back"?  For example, if a device mapper target is 
writeback will it strip flushes on the way to the backing device?

This confirms what I have suspected all along: We have an LSI MegaRAID 
SAS-3516 where the write policy is "write back" in the LUN, but the cache 
is flagged in Linux as write-through:

	]# cat /sys/block/sdb/queue/write_cache 
	write through

I guess this is the correct place to adjust that behavior!


--
Eric Wheeler


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync)
  2022-05-24 20:14         ` Eric Wheeler
  2022-05-24 20:34           ` Keith Busch
@ 2022-05-25  5:17           ` Christoph Hellwig
  1 sibling, 0 replies; 37+ messages in thread
From: Christoph Hellwig @ 2022-05-25  5:17 UTC (permalink / raw)
  To: Eric Wheeler
  Cc: Christoph Hellwig, Coly Li, Adriano Silva, Bcache Linux,
	Matthias Ferdinand, linux-block

On Tue, May 24, 2022 at 01:14:18PM -0700, Eric Wheeler wrote:
> Adriano was getting 1.5ms sync-write ioping's to an NVMe through bcache 
> (instead of the expected ~70us), so perhaps the NVMe flushes were killing 
> performance if every write was also forcing an erase cycle.

This sounds very typical of a low end consumer grade NVMe SSD, yes.

> The suggestion was to disable flushes in bcache as a troubleshooting step 
> to see if that solved the problem, but with the warning that it could be 
> unsafe.

If you want to disable the cache (despite this being unsafe!) you can
do this for every block device:

	echo "write through" > /sys/block/XXX/queue/write_cache

> Questions:
> 
> 1. If a user knows their disks have a non-volatile cache then is it safe 
>    to drop flushes?

It is, but in that case the disk will not advertise a write cache, and
the flushes will not make it past the submit_bio and never reach the
driver.

> 3. Since the block layer wont send flushes when the hardware reports that 
>    the cache is non-volatile, then how do you query the device to make 
>    sure it is reporting correctly?  For NVMe you can get VWC as:
> 	nvme id-ctrl -H /dev/nvme0 |grep -A1 vwc
>    
>    ...but how do you query a block device (like a RAID LUN) to make sure 
>    it is reporting a non-volatile cache correctly?

cat /sys/block/XXX/queue/write_cache

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync)
  2022-05-24 21:34             ` Eric Wheeler
@ 2022-05-25  5:20               ` Christoph Hellwig
  2022-05-25 18:44                 ` Eric Wheeler
  2022-05-28  1:52                 ` Eric Wheeler
  0 siblings, 2 replies; 37+ messages in thread
From: Christoph Hellwig @ 2022-05-25  5:20 UTC (permalink / raw)
  To: Eric Wheeler
  Cc: Keith Busch, Christoph Hellwig, Coly Li, Adriano Silva,
	Bcache Linux, Matthias Ferdinand, linux-block

On Tue, May 24, 2022 at 02:34:23PM -0700, Eric Wheeler wrote:
> Is this flag influced at all when /sys/block/sdX/queue/scheduler is set 
> to "none", or does the write_cache flag operate independently of the 
> selected scheduler?

This in completely independent from sthe scheduler.

> Does the block layer stop sending flushes at the first device in the stack 
> that is set to "write back"?  For example, if a device mapper target is 
> writeback will it strip flushes on the way to the backing device?

This is up to the stacking driver.  dm and tend to pass through flushes
where needed.

> This confirms what I have suspected all along: We have an LSI MegaRAID 
> SAS-3516 where the write policy is "write back" in the LUN, but the cache 
> is flagged in Linux as write-through:
> 
> 	]# cat /sys/block/sdb/queue/write_cache 
> 	write through
> 
> I guess this is the correct place to adjust that behavior!

MegaRAID has had all kinds of unsafe policies in the past unfortunately.
I'm not even sure all of them could pass through flushes properly if we
asked them to :(

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync)
  2022-05-25  5:20               ` Christoph Hellwig
@ 2022-05-25 18:44                 ` Eric Wheeler
  2022-05-26  9:06                   ` Christoph Hellwig
  2022-05-28  1:52                 ` Eric Wheeler
  1 sibling, 1 reply; 37+ messages in thread
From: Eric Wheeler @ 2022-05-25 18:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Coly Li, Adriano Silva, Bcache Linux,
	Matthias Ferdinand, linux-block

On Tue, 24 May 2022, Christoph Hellwig wrote:
> On Tue, May 24, 2022 at 02:34:23PM -0700, Eric Wheeler wrote:
> > Is this flag influced at all when /sys/block/sdX/queue/scheduler is set 
> > to "none", or does the write_cache flag operate independently of the 
> > selected scheduler?
> 
> This is up to the stacking driver.  dm and tend to pass through flushes
> where needed.
> 
> > This confirms what I have suspected all along: We have an LSI MegaRAID 
> > SAS-3516 where the write policy is "write back" in the LUN, but the cache 
> > is flagged in Linux as write-through:
> > 
> > 	]# cat /sys/block/sdb/queue/write_cache 
> > 	write through
> > 
> > I guess this is the correct place to adjust that behavior!
> 
> MegaRAID has had all kinds of unsafe policies in the past unfortunately.
> I'm not even sure all of them could pass through flushes properly if we
> asked them to :(

Thanks for the feedback, great info!

In your experience, which SAS/SATA RAID controllers are best behaved in 
terms of policies and reporting things like io_opt and 
writeback/writethrough to the kernel?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync)
  2022-05-25 18:44                 ` Eric Wheeler
@ 2022-05-26  9:06                   ` Christoph Hellwig
  0 siblings, 0 replies; 37+ messages in thread
From: Christoph Hellwig @ 2022-05-26  9:06 UTC (permalink / raw)
  To: Eric Wheeler
  Cc: Christoph Hellwig, Keith Busch, Coly Li, Adriano Silva,
	Bcache Linux, Matthias Ferdinand, linux-block

On Wed, May 25, 2022 at 11:44:01AM -0700, Eric Wheeler wrote:
> In your experience, which SAS/SATA RAID controllers are best behaved in 
> terms of policies and reporting things like io_opt and 
> writeback/writethrough to the kernel?

I never had actually good experiences with any of them.  That being
said I also haven't used one for years.  For SAS or SATA attachd to
expanders setups I've mostly used the mpt2/3 family of controllers
which are doing okay.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Bcache in writes direct with fsync. Are IOPS limited?
  2022-05-23 14:07     ` Coly Li
@ 2022-05-26 19:15       ` Eric Wheeler
  2022-05-27 17:28         ` colyli
  0 siblings, 1 reply; 37+ messages in thread
From: Eric Wheeler @ 2022-05-26 19:15 UTC (permalink / raw)
  To: Coly Li; +Cc: Adriano Silva, Bcache Linux, Matthias Ferdinand

On Mon, 23 May 2022, Coly Li wrote:
> On 5/18/22 9:22 AM, Eric Wheeler wrote:
> > Some time ago you ordered an an SSD to test the 4k cache issue, has that
> > been fixed?  I've kept an eye out for the patch but not sure if it was
> > released.
> 
> Yes, I got the Intel P3700 PCIe SSD to fix the 4Kn unaligned I/O issue
> (borrowed from a hardware vendor). The new situation is, current kernel does
> the sector size alignment checking quite earlier in bio layer, if the LBA is
> not sector size aligned, it is rejected in the bio code, and the underlying
> driver doesn't have chance to see the bio anymore. So for now, the unaligned
> LBA for 4Kn device cannot reach bcache code, that's to say, the original
> reported condition won't happen now.

The issue is not with unaligned 4k IOs hitting /dev/bcache0 because you
are right, the bio layer will reject those before even getting to
bcache:

The issue is that the bcache cache metadata sometimes makes metadata or
journal requests from _inside_ bcache that are not 4k aligned.  When
this happens the bio layer rejects the request from bcache (not from
whatever is above bcache).

Correct me if I misunderstood what you meant here, maybe it really was 
fixed.  Here is your response from that old thread that pointed at 
unaligned key access where you said "Wow, the above lines are very 
informative, thanks!"

bcache: check_4k_alignment() KEY_OFFSET(&w->key) is not 4KB aligned:  15725385535
  https://www.spinics.net/lists/linux-bcache/msg06076.html

In that thread Kent sent a quick top-post asking "have you checked extent 
merging?"
	https://www.spinics.net/lists/linux-bcache/msg06077.html

> And after this observation, I stopped my investigation on the unaligned sector
> size I/O on 4Kn device, and returned the P3700 PCIe SSD to the hardware
> vendor.

Hmm, sorry that it wasn't reproduced.  I hope I'm wrong, but if bcache is 
generating the 4k-unaligned requests against the cache meta then this bug 
might still be floating around for "4Kn" cache users.

-Eric

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Bcache in writes direct with fsync. Are IOPS limited?
       [not found]     ` <681726005.1812841.1653564986700@mail.yahoo.com>
@ 2022-05-26 20:20       ` Adriano Silva
  2022-05-26 20:28       ` Eric Wheeler
  1 sibling, 0 replies; 37+ messages in thread
From: Adriano Silva @ 2022-05-26 20:20 UTC (permalink / raw)
  To: Eric Wheeler, Bcache Linux, Matthias Ferdinand, Coly Li

Hi People,

Thanks for answering.

This is a enterprise NVMe device with Power Loss Protection system. It has a non-volatile cache.

Before purchasing these enterprise devices, I did tests with consumer NVMe. Consumer device performance is acceptable only on hardware cached writes. But on the contrary on consumer devices in tests with fio passing parameters for direct and synchronous writing (--direct=1 --fsync=1 --rw=randwrite --bs=4K --numjobs=1 --iodepth= 1) the performance is very low. So today I'm using enterprise NVME with tantalum capacitors which makes the cache non-volatile and performs much better when written directly to the hardware. But the performance issue is only occurring when the write is directed to the bcache device.

Here is information from my Hardware you asked for (Eric), plus some additional information to try to help.

root@pve-20:/# blockdev --getss /dev/nvme0n1
512
root@pve-20:/# blockdev --report /dev/nvme0n1
RO    RA   SSZ   BSZ   StartSec            Size   Device
rw   256   512  4096          0    960197124096   /dev/nvme0n1
root@pve-20:/# blockdev --getioopt /dev/nvme0n1
512
root@pve-20:/# blockdev --getiomin /dev/nvme0n1
512
root@pve-20:/# blockdev --getpbsz /dev/nvme0n1
512
root@pve-20:/# blockdev --getmaxsect /dev/nvme0n1
256
root@pve-20:/# blockdev --getbsz /dev/nvme0n1
4096
root@pve-20:/# blockdev --getsz /dev/nvme0n1
1875385008
root@pve-20:/# blockdev --getra /dev/nvme0n1
256
root@pve-20:/# blockdev --getfra /dev/nvme0n1
256
root@pve-20:/# blockdev --getdiscardzeroes /dev/nvme0n1
0
root@pve-20:/# blockdev --getalignoff /dev/nvme0n1
0

root@pve-20:~# nvme id-ctrl -H /dev/nvme0n1 |grep -A1 vwc
vwc       : 0
  [0:0] : 0    Volatile Write Cache Not Present
root@pve-20:~#


root@pve-20:~# nvme id-ctrl /dev/nvme0n1
NVME Identify Controller:
vid       : 0x1c5c
ssvid     : 0x1c5c
sn        : EI6............................D2Q   
mn        : HFS960GD0MEE-5410A                      
fr        : 40033A00
rab       : 1
ieee      : ace42e
cmic      : 0
mdts      : 5
cntlid    : 0
ver       : 10200
rtd3r     : 90f560
rtd3e     : ea60
oaes      : 0
ctratt    : 0
rrls      : 0
oacs      : 0x6
acl       : 3
aerl      : 3
frmw      : 0xf
lpa       : 0x2
elpe      : 254
npss      : 2
avscc     : 0x1
apsta     : 0
wctemp    : 353
cctemp    : 361
mtfa      : 0
hmpre     : 0
hmmin     : 0
tnvmcap   : 0
unvmcap   : 0
rpmbs     : 0
edstt     : 2
dsto      : 0
fwug      : 0
kas       : 0
hctma     : 0
mntmt     : 0
mxtmt     : 0
sanicap   : 0
hmminds   : 0
hmmaxd    : 0
nsetidmax : 0
anatt     : 0
anacap    : 0
anagrpmax : 0
nanagrpid : 0
sqes      : 0x66
cqes      : 0x44
maxcmd    : 0
nn        : 1
oncs      : 0x14
fuses     : 0
fna       : 0x4
vwc       : 0
awun      : 255
awupf     : 0
nvscc     : 1
nwpc      : 0
acwu      : 0
sgls      : 0
mnan      : 0
subnqn    :
ioccsz    : 0
iorcsz    : 0
icdoff    : 0
ctrattr   : 0
msdbd     : 0
ps    0 : mp:7.39W operational enlat:1 exlat:1 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:2.02W active_power:4.02W
ps    1 : mp:6.82W operational enlat:1 exlat:1 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:2.02W active_power:2.02W
ps    2 : mp:4.95W operational enlat:1 exlat:1 rrt:2 rrl:2
          rwt:2 rwl:2 idle_power:2.02W active_power:2.02W
root@pve-20:~#

root@pve-20:~# nvme id-ns /dev/nvme0n1
NVME Identify Namespace 1:
nsze    : 0x6fc81ab0
ncap    : 0x6fc81ab0
nuse    : 0x6fc81ab0
nsfeat  : 0
nlbaf   : 0
flbas   : 0x10
mc      : 0
dpc     : 0
dps     : 0
nmic    : 0
rescap  : 0
fpi     : 0
dlfeat  : 0
nawun   : 0
nawupf  : 0
nacwu   : 0
nabsn   : 0
nabo    : 0
nabspf  : 0
noiob   : 0
nvmcap  : 0
nsattr    : 0
nvmsetid: 0
anagrpid: 0
endgid  : 0
nguid   : 00000000000000000000000000000000
eui64   : ace42e610000189f
lbaf  0 : ms:0   lbads:9  rp:0 (in use)
root@pve-20:~#

If anyone needs any more information about the hardware, please ask.

An interesting thing to note is that when I test using fio with --bs=512, the direct hardware performance is horrible (~1MB/s).

root@pve-20:/# fio --filename=/dev/nvme0n1p2 --direct=1 --fsync=1 --rw=randwrite --bs=512 --numjobs=1 --iodepth=1 --runtime=5 --time_based --group_reporting --name=journal-test --ioengine=libaio
journal-test: (g=0): rw=randwrite, bs=(R) 512B-512B, (W) 512B-512B, (T) 512B-512B, ioengine=libaio, iodepth=1
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=1047KiB/s][w=2095 IOPS][eta 00m:00s]
journal-test: (groupid=0, jobs=1): err= 0: pid=1715926: Mon May 23 14:05:28 2022
  write: IOPS=2087, BW=1044KiB/s (1069kB/s)(5220KiB/5001msec); 0 zone resets
    slat (nsec): min=3338, max=90998, avg=12760.92, stdev=3377.45
    clat (usec): min=32, max=945, avg=453.85, stdev=27.03
     lat (usec): min=46, max=953, avg=467.16, stdev=27.79
    clat percentiles (usec):
     |  1.00th=[  404],  5.00th=[  420], 10.00th=[  429], 20.00th=[  433],
     | 30.00th=[  437], 40.00th=[  453], 50.00th=[  465], 60.00th=[  465],
     | 70.00th=[  469], 80.00th=[  469], 90.00th=[  474], 95.00th=[  474],
     | 99.00th=[  494], 99.50th=[  502], 99.90th=[  848], 99.95th=[  889],
     | 99.99th=[  914]
   bw (  KiB/s): min= 1033, max= 1056, per=100.00%, avg=1044.22, stdev= 9.56, samples=9
   iops        : min= 2066, max= 2112, avg=2088.67, stdev=19.14, samples=9
  lat (usec)   : 50=0.03%, 100=0.01%, 500=99.38%, 750=0.44%, 1000=0.14%
  fsync/fdatasync/sync_file_range:
    sync (nsec): min=74, max=578, avg=279.19, stdev=45.25
    sync percentiles (nsec):
     |  1.00th=[  151],  5.00th=[  179], 10.00th=[  235], 20.00th=[  249],
     | 30.00th=[  255], 40.00th=[  278], 50.00th=[  294], 60.00th=[  298],
     | 70.00th=[  314], 80.00th=[  314], 90.00th=[  330], 95.00th=[  334],
     | 99.00th=[  346], 99.50th=[  350], 99.90th=[  374], 99.95th=[  386],
     | 99.99th=[  498]
  cpu          : usr=3.40%, sys=5.38%, ctx=10439, majf=0, minf=12
  IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,10439,0,10438 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1044KiB/s (1069kB/s), 1044KiB/s-1044KiB/s (1069kB/s-1069kB/s), io=5220KiB (5345kB), run=5001-5001msec

Disk stats (read/write):
  nvme0n1: ios=58/10171, merge=0/0, ticks=10/4559, in_queue=0, util=97.64%

But the same test directly on the hardware with fio passing the parameter --bs=4K, the performance completely changes, for the better (~130MB/s).

root@pve-20:/# fio --filename=/dev/nvme0n1p2 --direct=1 --fsync=1 --rw=randwrite --bs=4K --numjobs=1 --iodepth=1 --runtime=5 --time_based --group_reporting --name=journal-test --ioengine=libaio
journal-test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=125MiB/s][w=31.9k IOPS][eta 00m:00s]
journal-test: (groupid=0, jobs=1): err= 0: pid=1725642: Mon May 23 14:13:50 2022
  write: IOPS=31.9k, BW=124MiB/s (131MB/s)(623MiB/5001msec); 0 zone resets
    slat (nsec): min=2942, max=87863, avg=3222.02, stdev=1233.34
    clat (nsec): min=865, max=1238.6k, avg=25283.31, stdev=24400.58
     lat (usec): min=24, max=1243, avg=28.63, stdev=24.45
    clat percentiles (usec):
     |  1.00th=[   23],  5.00th=[   23], 10.00th=[   23], 20.00th=[   23],
     | 30.00th=[   24], 40.00th=[   24], 50.00th=[   24], 60.00th=[   25],
     | 70.00th=[   26], 80.00th=[   26], 90.00th=[   26], 95.00th=[   29],
     | 99.00th=[   35], 99.50th=[   41], 99.90th=[  652], 99.95th=[  725],
     | 99.99th=[  766]
   bw (  KiB/s): min=125696, max=129008, per=99.98%, avg=127456.33, stdev=1087.63, samples=9
   iops        : min=31424, max=32252, avg=31864.00, stdev=271.99, samples=9
  lat (nsec)   : 1000=0.01%
  lat (usec)   : 2=0.01%, 20=0.01%, 50=99.59%, 100=0.24%, 250=0.01%
  lat (usec)   : 500=0.02%, 750=0.10%, 1000=0.02%
  lat (msec)   : 2=0.01%
  fsync/fdatasync/sync_file_range:
    sync (nsec): min=43, max=435, avg=68.51, stdev=10.83
    sync percentiles (nsec):
     |  1.00th=[   59],  5.00th=[   60], 10.00th=[   61], 20.00th=[   63],
     | 30.00th=[   64], 40.00th=[   65], 50.00th=[   66], 60.00th=[   67],
     | 70.00th=[   70], 80.00th=[   73], 90.00th=[   77], 95.00th=[   80],
     | 99.00th=[  122], 99.50th=[  147], 99.90th=[  177], 99.95th=[  189],
     | 99.99th=[  251]
  cpu          : usr=10.72%, sys=19.54%, ctx=159367, majf=0, minf=11
  IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,159384,0,159383 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=124MiB/s (131MB/s), 124MiB/s-124MiB/s (131MB/s-131MB/s), io=623MiB (653MB), run=5001-5001msec

Disk stats (read/write):
  nvme0n1: ios=58/155935, merge=0/0, ticks=10/3823, in_queue=0, util=98.26%

Does anything justify this difference?

Maybe that's why when I create bcache with the -w=4K option the performance improves. Not as much as I'd like, but it gets better.

I also noticed that when I use the --bs=4K parameter (or indicating even larger blocks) and use the --ioengine=libaio parameter together in the direct test on the hardware, the performance improves a lot, even doubling the speed in the case of blocks of 4K Without --ioengine=libaio, direct hardware is somewhere around 15K IOPS at 60.2 MB/s, but using this library, it goes to 32K IOPS and 130MB/s;

That's why I have standardized using this parameter (--ioengine=libaio) in tests.

The buckets, I read that it would be better to put the hardware device erase block size. However, I have already tried to find this information by reading the device, also with the manufacturer, but without success. So I have no idea which bucket size would be best, but from my tests, the default of 512KB seems to be adequate. 

Responding to Coly, I did tests using fio to directly write to the block device NVME (/dev/nvme0n1), without going through any partitions. Performance is always slightly better on hardware when writing directly to the block without a partition. But the difference is minimal. This difference also seems to be reflected in bcache, but it is also very small (insignificant).

I've already noticed that, increasing the number of jobs, the performance of the bcache0 device improves a lot, reaching almost equal to the performance of tests done directly on the Hardware. 

Eric, perhaps it is not such a simple task to recompile the Kernel with the suggested change. I'm working with Proxmox 6.4. I'm not sure, but I think the Kernel may have some adaptation. It is based on Kernel 5.4, which it is approved for.

Also listening to Coly's suggestion, I'll try to perform tests with the Kernel version 5.15 to see if it can solve. Would this version be good enough? It's just that, as I said above, as I'm using Proxmox, I'm afraid to change the Kernel version they provide.

Eric, to be clear, the hardware I'm using has only 1 processor socket.

I'm trying to test with another identical computer (the same motherboard, the same processor, the same NVMe, with the difference that it only has 12GB of RAM, the first having 48GB). It is an HP Z400 Workstation with an Intel Xeon X5680 sixcore processor (12 threads), DDR3 1333MHz 10600E (old computer). On the second computer, I put a newer version of the distribution that uses Kernel based on version 5.15. I am now comparing the performance of the two computers in the lab.

On this second computer I had worse performance than the first one (practically half the performance with bcache), despite the performance of the tests done directly in NVME being identical.

I tried going back to the same OS version on the first computer to try and keep the exact same scenario on both computers so I could first compare the two. I try to keep the exact same software configuration. However, there were no changes. Is it the low RAM that makes the performance worse in the second?

I noticed a difference in behavior on the second computer compared to the first in dstat. While the first computer doesn't seem to touch the backup device at all, the second computer signals something a little different, as although it doesn't write data to the backup disk, it does signal IO movement. Strange no?

Let's look at the dstat of the first computer:

--dsk/sdb---dsk/nvme0n1-dsk/bcache0 ---io/sdb----io/nvme0n1--io/bcache0 -net/total- ---load-avg--- --total-cpu-usage-- ---system-- ----system---- async
 read  writ: read  writ: read  writ| read  writ: read  writ: read  writ| recv  send| 1m   5m  15m |usr sys idl wai stl| int   csw |     time     | #aio
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |6953B 7515B|0.13 0.26 0.26|  0   0  99   0   0| 399   634 |25-05 09:41:42|   0
   0  8192B:4096B 2328k:   0  1168k|   0  2.00 :1.00   586 :   0   587 |9150B 2724B|0.13 0.26 0.26|  2   2  96   0   0|1093  3267 |25-05 09:41:43|   1B
   0     0 :   0    58M:   0    29M|   0     0 :   0  14.8k:   0  14.7k|  14k 9282B|0.13 0.26 0.26|  1   3  94   2   0|  16k   67k|25-05 09:41:44|   1B
   0     0 :   0    58M:   0    29M|   0     0 :   0  14.9k:   0  14.8k|  10k 8992B|0.13 0.26 0.26|  1   3  93   2   0|  16k   69k|25-05 09:41:45|   1B
   0     0 :   0    58M:   0    29M|   0     0 :   0  14.9k:   0  14.8k|7281B 4651B|0.13 0.26 0.26|  1   3  92   4   0|  16k   67k|25-05 09:41:46|   1B
   0     0 :   0    59M:   0    30M|   0     0 :   0  15.2k:   0  15.1k|7849B 4729B|0.20 0.28 0.27|  1   4  94   2   0|  16k   69k|25-05 09:41:47|   1B
   0     0 :   0    57M:   0    28M|   0     0 :   0  14.4k:   0  14.4k|  11k 8584B|0.20 0.28 0.27|  1   3  94   2   0|  15k   65k|25-05 09:41:48|   0
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |4086B 7720B|0.20 0.28 0.27|  0   0 100   0   0| 274   332 |25-05 09:41:49|   0

Note that on this first computer, the writings and IOs of the backing device (sdb) remain motionless. While NVMe device IOs track bcache0 device IOs at ~14.8K

Let's see the dstat now on the second computer:

--dsk/sdd---dsk/nvme0n1-dsk/bcache0 ---io/sdd----io/nvme0n1--io/bcache0 -net/total- ---load-avg--- --total-cpu-usage-- ---system-- ----system---- async
 read  writ: read  writ: read  writ| read  writ: read  writ: read  writ| recv  send| 1m   5m  15m |usr sys idl wai stl| int   csw |     time     | #aio
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |9254B 3301B|0.15 0.19 0.11|  1   2  97   0   0| 360   318 |26-05 06:27:15|   0
   0  8192B:4096B   19M:   0  9600k|   0  2402 :1.00  4816 :   0  4801 |8826B 3619B|0.15 0.19 0.11|  0   1  98   0   0|8115    27k|26-05 06:27:16|   1B
   0     0 :   0    21M:   0    11M|   0  2737 :   0  5492 :   0  5474 |4051B 2552B|0.15 0.19 0.11|  0   2  97   1   0|9212    31k|26-05 06:27:17|   1B
   0     0 :   0    23M:   0    11M|   0  2890 :   0  5801 :   0  5781 |4816B 2492B|0.15 0.19 0.11|  1   2  96   2   0|9976    34k|26-05 06:27:18|   1B
   0     0 :   0    23M:   0    11M|   0  2935 :   0  5888 :   0  5870 |4450B 2552B|0.22 0.21 0.12|  0   2  96   2   0|9937    33k|26-05 06:27:19|   1B
   0     0 :   0    22M:   0    11M|   0  2777 :   0  5575 :   0  5553 |8644B 1614B|0.22 0.21 0.12|  0   2  98   0   0|9416    31k|26-05 06:27:20|   1B
   0     0 :   0  2096k:   0  1040k|   0   260 :   0   523 :   0   519 |  10k 8760B|0.22 0.21 0.12|  0   1  99   0   0|1246  3157 |26-05 06:27:21|   0
   0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |4083B 2990B|0.22 0.21 0.12|  0   0 100   0   0| 390   369 |26-05 06:27:22|   0

In this case, with exactly the same command, we got a very different result. While writes to the backing device (sdd) do not happen (this is correct), we noticed that IOs occur on both the NVMe device and the backing device (i think this is wrong), but at a much lower rate now, around 5.6K on NVMe and 2.8K on the backing device. It leaves the impression that although it is not writing anything to sdd device, it is sending some signal to the backing device in each two IO operations that is performed with the cache device. And that would be delaying the answer. Could it be something like this?

It is important to point out that the writeback mode is on, obviously, and that the sequential cutoff is at zero, but I tried to put default values ​​or high values ​​and there were no changes. I also tried changing congested_write_threshold_us and congested_read_threshold_us, also with no result changes.

The only thing I noticed different between the configurations of the two computers was btree_cache_size, which on the first is much larger (7.7M) m while on the second it is only 768K. But I don't know if this parameter is configurable and if it could justify the difference.

Disabling Intel's Turbo Boost technology through the BIOS appears to have no effect.

And we will continue our tests comparing the two computers, including to test the two versions of the Kernel. If anyone else has ideas, thanks!

Em terça-feira, 17 de maio de 2022 22:23:09 BRT, Eric Wheeler <bcache@lists.ewheeler.net> escreveu: 





On Tue, 10 May 2022, Adriano Silva wrote:
> I'm trying to set up a flash disk NVMe as a disk cache for two or three 
> isolated (I will use 2TB disks, but in these tests I used a 1TB one) 
> spinning disks that I have on a Linux 5.4.174 (Proxmox node).

Coly has been adding quite a few optimizations over the years.  You might 
try a new kernel and see if that helps.  More below.

> I'm using a NVMe (960GB datacenter devices with tantalum capacitors) as 
> a cache.
> [...]
>
> But when I do the same test on bcache writeback, the performance drops a 
> lot. Of course, it's better than the performance of spinning disks, but 
> much worse than when accessed directly from the NVMe device hardware.
>
> [...]
> As we can see, the same test done on the bcache0 device only got 1548 
> IOPS and that yielded only 6.3 KB/s.

Well done on the benchmarking!  I always thought our new NVMes performed 
slower than expected but hadn't gotten around to investigating. 

> I've noticed in several tests, varying the amount of jobs or increasing 
> the size of the blocks, that the larger the size of the blocks, the more 
> I approximate the performance of the physical device to the bcache 
> device.

You said "blocks" but did you mean bucket size (make-bcache -b) or block 
size (make-bcache -w) ?

If larger buckets makes it slower than that actually surprises me: bigger 
buckets means less metadata and better sequential writeback to the 
spinning disks (though you hadn't yet hit writeback to spinning disks in 
your stats).  Maybe you already tried, but varying the bucket size might 
help.  Try graphing bucket size (powers of 2) against IOPS, maybe there is 
a "sweet spot"?

Be aware that 4k blocks (so-called "4Kn") is unsafe for the cache device, 
unless Coly has patched that.  Make sure your `blockdev --getss` reports 
512 for your NVMe!

Hi Coly,

Some time ago you ordered an an SSD to test the 4k cache issue, has that 
been fixed?  I've kept an eye out for the patch but not sure if it was released.

You have a really great test rig setup with NVMes for stress
testing bcache. Can you replicate Adriano's `ioping` numbers below?

> With ioping it is also possible to notice a limitation, as the latency 
> of the bcache0 device is around 1.5ms, while in the case of the raw 
> device (a partition of NVMe), the same test is only 82.1us.
> 
> root@pve-20:~# ioping -c10 /dev/bcache0 -D -Y -WWW -s4k
> 4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=1 time=1.52 ms (warmup)
> 4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=2 time=1.60 ms
> 4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=3 time=1.55 ms
>
> root@pve-20:~# ioping -c10 /dev/nvme0n1p2 -D -Y -WWW -s4k
> 4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=1 time=81.2 us (warmup)
> 4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=2 time=82.7 us
> 4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=3 time=82.4 us

Wow, almost 20x higher latency, sounds convincing that something is wrong.

A few things to try:

1. Try ioping without -Y.  How does it compare?

2. Maybe this is an inter-socket latency issue.  Is your server 
  multi-socket?  If so, then as a first pass you could set the kernel 
  cmdline `isolcpus` for testing to limit all processes to a single 
  socket where the NVMe is connected (see `lscpu`).  Check `hwloc-ls`
  or your motherboard manual to see how the NVMe port is wired to your
  CPUs.

  If that helps then fine tune with `numactl -cN ioping` and 
  /proc/irq/<n>/smp_affinity_list (and `grep nvme /proc/interrupts`) to 
  make sure your NVMe's are locked to IRQs on the same socket.

3a. sysfs:

> # echo 0 > /sys/block/bcache0/bcache/sequential_cutoff

good.

> # echo 0 > /sys/fs/bcache/<cache set>/congested_read_threshold_us
> # echo 0 > /sys/fs/bcache/<cache set>/congested_write_threshold_us

Also try these (I think bcache/cache is a symlink to /sys/fs/bcache/<cache set>)

echo 10000000 > /sys/block/bcache0/bcache/cache/congested_read_threshold_us 
echo 10000000 > /sys/block/bcache0/bcache/cache/congested_write_threshold_us


Try tuning journal_delay_ms: 
  /sys/fs/bcache/<cset-uuid>/journal_delay_ms
    Journal writes will delay for up to this many milliseconds, unless a 
    cache flush happens sooner. Defaults to 100.

3b: Hacking bcache code:

I just noticed that journal_delay_ms says "unless a cache flush happens 
sooner" but cache flushes can be re-ordered so flushing the journal when 
REQ_OP_FLUSH comes through may not be useful, especially if there is a 
high volume of flushes coming down the pipe because the flushes could kill 
the NVMe's cache---and maybe the 1.5ms ping is actual flash latency.  It
would flush data and journal.

Maybe there should be a cachedev_noflush sysfs option for those with some 
kind of power-loss protection of there SSD's.  It looks like this is 
handled in request.c when these functions call bch_journal_meta():

    1053: static void cached_dev_nodata(struct closure *cl)
    1263: static void flash_dev_nodata(struct closure *cl)

Coly can you comment about journal flush semantics with respect to 
performance vs correctness and crash safety?

Adriano, as a test, you could change this line in search_alloc() in 
request.c:

    - s->iop.flush_journal    = op_is_flush(bio->bi_opf);
    + s->iop.flush_journal    = 0;

and see how performance changes.

Someone correct me if I'm wrong, but I don't think flush_journal=0 will 
affect correctness unless there is a crash.  If that /is/ the performance 
problem then it would narrow the scope of this discussion.

4. I wonder if your 1.5ms `ioping` stats scale with CPU clock speed: can 
  you set your CPU governor to run at full clock speed and then slowest 
  clock speed to see if it is a CPU limit somewhere as we expect?

  You can do `grep MHz /proc/cpuinfo` to see the active rate to make sure 
  the governor did its job.  

  If it scales with CPU then something in bcache is working too hard.  
  Maybe garbage collection?  Other devs would need to chime in here to 
  steer the troubleshooting if that is the case.


5. I'm not sure if garbage collection is the issue, but you might try 
  Mingzhe's dynamic incremental gc patch:
    https://www.spinics.net/lists/linux-bcache/msg11185.html

6. Try dm-cache and see if its IO latency is similar to bcache: If it is 
  about the same then that would indicate an issue in the block layer 
  somewhere outside of bcache.  If dm-cache is better, then that confirms 
  a bcache issue.


> The cache was configured directly on one of the NVMe partitions (in this 
> case, the first partition). I did several tests using fio and ioping, 
> testing on a partition on the NVMe device, without partition and 
> directly on the raw block, on a first partition, on the second, with or 
> without configuring bcache. I did all this to remove any doubt as to the 
> method. The results of tests performed directly on the hardware device, 
> without going through bcache are always fast and similar.
> 
> But tests in bcache are always slower. If you use writethrough, of 
> course, it gets much worse, because the performance is equal to the raw 
> spinning disk.
> 
> Using writeback improves a lot, but still doesn't use the full speed of 
> NVMe (honestly, much less than full speed).

Indeed, I hope this can be fixed!  A 20x improvement in bcache would 
be awesome.

> But I've also noticed that there is a limit on writing sequential data, 
> which is a little more than half of the maximum write rate shown in 
> direct tests by the NVMe device.

For sync, async, or both?


> Processing doesn't seem to be going up like the tests.


What do you mean "processing" ?

-Eric


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Bcache in writes direct with fsync. Are IOPS limited?
       [not found]     ` <681726005.1812841.1653564986700@mail.yahoo.com>
  2022-05-26 20:20       ` Bcache in writes direct with fsync. Are IOPS limited? Adriano Silva
@ 2022-05-26 20:28       ` Eric Wheeler
  2022-05-27  4:07         ` Adriano Silva
  1 sibling, 1 reply; 37+ messages in thread
From: Eric Wheeler @ 2022-05-26 20:28 UTC (permalink / raw)
  To: Adriano Silva; +Cc: Coly Li, Bcache Linux, Matthias Ferdinand

[-- Attachment #1: Type: text/plain, Size: 22334 bytes --]

On Thu, 26 May 2022, Adriano Silva wrote:
> This is a enterprise NVMe device with Power Loss Protection system. It 
> has a non-volatile cache.
> 
> Before purchasing these enterprise devices, I did tests with consumer 
> NVMe. Consumer device performance is acceptable only on hardware cached 
> writes. But on the contrary on consumer devices in tests with fio 
> passing parameters for direct and synchronous writing (--direct=1 
> --fsync=1 --rw=randwrite --bs=4K --numjobs=1 --iodepth= 1) the 
> performance is very low. So today I'm using enterprise NVME with 
> tantalum capacitors which makes the cache non-volatile and performs much 
> better when written directly to the hardware. But the performance issue 
> is only occurring when the write is directed to the bcache device.
> 
> Here is information from my Hardware you asked for (Eric), plus some 
> additional information to try to help.
> 
> root@pve-20:/# blockdev --getss /dev/nvme0n1
> 512
> root@pve-20:/# blockdev --report /dev/nvme0n1
> RO    RA   SSZ   BSZ   StartSec            Size   Device
> rw   256   512  4096          0    960197124096   /dev/nvme0n1

> root@pve-20:~# nvme id-ctrl -H /dev/nvme0n1 |grep -A1 vwc
> vwc       : 0
>   [0:0] : 0    Volatile Write Cache Not Present

Please confirm that this says "write back":

]# cat /sys/block/nvme0n1/queue/write_cache 

Try this to set _all_ blockdevs to write-back and see if it affects
performance (warning: power loss is unsafe for non-volatile caches after 
this command):

]# for i in /sys/block/*/queue/write_cache; do echo 'write back' > $i; done
 
> An interesting thing to note is that when I test using fio with 
> --bs=512, the direct hardware performance is horrible (~1MB/s).

I think you know this already, but for CYA:

   WARNING: THESE ARE DESTRUCTIVE WRITES, DO NOT USE ON PRODUCTION DATA!

Please post `ioping` stats for each server you are testing (some of these 
you may have already posted, but if you can place them inline of this same 
response it would be helpful so we don't need to dig into old emails).

]# blkdiscard /dev/nvme0n1

]# ioping -c10 /dev/nvme0n1 -D -Y -WWW -s512
]# ioping -c10 /dev/nvme0n1 -D -WWW -s512

]# ioping -c10 /dev/nvme0n1 -D -Y -WWW -s4k
]# ioping -c10 /dev/nvme0n1 -D -WWW -s4k

Next, lets rule out backing-device interference by creating a dummy
mapper device that has 128mb of ramdisk for persistent meta storage
(superblock, LVM, etc) but presents as a 1TB volume in size; writes
beyond 128mb are dropped:

	modprobe brd rd_size=$((128*1024))

	]# cat << EOT | dmsetup create zero
	0 262144 linear /dev/ram0 0
	262144 2147483648 zero
	EOT

Then use that as your backing device:

	]# blkdiscard /dev/nvme0n1
	]# make-bcache -w 512 -B /dev/mapper/zero -C /dev/nvme0n1 --writeback

]# ioping -c10 /dev/bcache0 -D -Y -WWW -s512
]# ioping -c10 /dev/bcache0 -D -WWW -s512

]# ioping -c10 /dev/bcache0 -D -Y -WWW -s4k
]# ioping -c10 /dev/bcache0 -D -WWW -s4k

Test again with -w 4096:
	]# blkdiscard /dev/nvme0n1
	]# make-bcache -w 4096 -B /dev/mapper/zero -C /dev/nvme0n1 --writeback

]# ioping -c10 /dev/bcache0 -D -Y -WWW -s4k
]# ioping -c10 /dev/bcache0 -D -WWW -s4k

# These should error with -w 4096 because 512 is too small:

]# ioping -c10 /dev/bcache0 -D -Y -WWW -s512
]# ioping -c10 /dev/bcache0 -D -WWW -s512

> root@pve-20:/# fio --filename=/dev/nvme0n1p2 --direct=1 --fsync=1 --rw=randwrite --bs=512 --numjobs=1 --iodepth=1 --runtime=5 --time_based --group_reporting --name=journal-test --ioengine=libaio
>   write: IOPS=2087, BW=1044KiB/s (1069kB/s)(5220KiB/5001msec); 0 zone resets
>          ^^^^^^^^^ 
> But the same test directly on the hardware with fio passing the
> parameter --bs=4K, the performance completely changes, for the better
> (~130MB/s).
>
> root@pve-20:/# fio --filename=/dev/nvme0n1p2 --direct=1 --fsync=1 --rw=randwrite --bs=4K --numjobs=1 --iodepth=1 --runtime=5 --time_based --group_reporting --name=journal-test --ioengine=libaio
>   write: IOPS=31.9k, BW=124MiB/s (131MB/s)(623MiB/5001msec); 0 zone resets
>          ^^^^^^^^^^
> Does anything justify this difference?

I think you may have discovered the problem and the `ioping`s above
might confirm that.

IOPS are a better metric here, not MB/sec because smaller IOs will
always be smaller bandwidth because they are smaller and RTT is a
factor.  However, IOPS are ~16x lower than the expected 8x difference
(512/4096=1/8) so something else is going on. 

The hardware is probably addressed 4k internally "4Kn" (with even larger 
erase pages that the FTL manages).  Sending it a bunch of 512-byte IOs may 
trigger a read-modify-write operation on the flash controller and is 
(probably) spinning CPU cycles on the flash controller itself. A firmware 
upgrade on the NVMe might help if they have addressed this.

This is speculaution, but assuming that internally the flash uses 4k 
sectors, it is doing something like this (pseudo code):

	1. new_data = fetch_from_pcie()
	2. rmw = read_sector(LBA)
	3. memcpy(rmw+offset, new_data, 512)
	4. queue_write_to_flash(rmw, LBA)

> Maybe that's why when I create bcache with the -w=4K option the 
> performance improves. Not as much as I'd like, but it gets better.
> [...] 
> The buckets, I read that it would be better to put the hardware device 
> erase block size. However, I have already tried to find this information 
> by reading the device, also with the manufacturer, but without success. 
> So I have no idea which bucket size would be best, but from my tests, 
> the default of 512KB seems to be adequate.

It might be worth testing power-of-2 bucket sizes to see what works best
for your workload.  Note that `fio --rw=randwrite` may not be
representative of your "real" workload so randwrite could be a good
place to start, but bench your real workload against bucket sizes to see
what works best.

> Eric, perhaps it is not such a simple task to recompile the Kernel with 
> the suggested change. I'm working with Proxmox 6.4. I'm not sure, but I 
> think the Kernel may have some adaptation. It is based on Kernel 5.4, 
> which it is approved for.

Keith and Christoph corrected me; as noted above, this does the same 
thing, so no need to hack on the kernel to change flush behavior:

	echo 'write back' > /sys/block/<DEV>/queue/write_cache

> Also listening to Coly's suggestion, I'll try to perform tests with the 
> Kernel version 5.15 to see if it can solve. Would this version be good 
> enough? It's just that, as I said above, as I'm using Proxmox, I'm 
> afraid to change the Kernel version they provide.

I'm guessing proxmox doesn't care too much about the kernel version as
long as the modules you use are built.  Just copy your existing .config
(usually /boot/config-<version>) as
kernel-source-dir/.config and run `make oldconfig` (or `make menuconfig`
and save+exit, which is what I usually do).

> Eric, to be clear, the hardware I'm using has only 1 processor socket.

Ok, so not a cacheline bounce issue.

> I'm trying to test with another identical computer (the same 
> motherboard, the same processor, the same NVMe, with the difference that 
> it only has 12GB of RAM, the first having 48GB). It is an HP Z400 
> Workstation with an Intel Xeon X5680 sixcore processor (12 threads), 
> DDR3 1333MHz 10600E (old computer).

Is this second server still a single-socket?

> On the second computer, I put a newer version of the distribution that 
> uses Kernel based on version 5.15. I am now comparing the performance of 
> the two computers in the lab.
> 
> On this second computer I had worse performance than the first one 
> (practically half the performance with bcache), despite the performance 
> of the tests done directly in NVME being identical.
> 
> I tried going back to the same OS version on the first computer to try 
> and keep the exact same scenario on both computers so I could first 
> compare the two. I try to keep the exact same software configuration. 
> However, there were no changes. Is it the low RAM that makes the 
> performance worse in the second?
 
The amount of memory isn't an issue, but CPU clock speed or memory speed 
might.  If server-2 has 2x sockets then make sure NVMe interrupts hit the 
socket where it is attached.  Could be a PCIe version thing, but I 
don't think you are saturating the PCIe link.

> I noticed a difference in behavior on the second computer compared to 
> the first in dstat. While the first computer doesn't seem to touch the 
> backup device at all, the second computer signals something a little 
> different, as although it doesn't write data to the backup disk, it does 
> signal IO movement. Strange no?
> 
> Let's look at the dstat of the first computer:
> 
> --dsk/sdb---dsk/nvme0n1-dsk/bcache0 ---io/sdb----io/nvme0n1--io/bcache0 -net/total- ---load-avg--- --total-cpu-usage-- ---system-- ----system---- async
>  read  writ: read  writ: read  writ| read  writ: read  writ: read  writ| recv  send| 1m   5m  15m |usr sys idl wai stl| int   csw |     time     | #aio
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |6953B 7515B|0.13 0.26 0.26|  0   0  99   0   0| 399   634 |25-05 09:41:42|   0
>    0  8192B:4096B 2328k:   0  1168k|   0  2.00 :1.00   586 :   0   587 |9150B 2724B|0.13 0.26 0.26|  2   2  96   0   0|1093  3267 |25-05 09:41:43|   1B
>    0     0 :   0    58M:   0    29M|   0     0 :   0  14.8k:   0  14.7k|  14k 9282B|0.13 0.26 0.26|  1   3  94   2   0|  16k   67k|25-05 09:41:44|   1B
>    0     0 :   0    58M:   0    29M|   0     0 :   0  14.9k:   0  14.8k|  10k 8992B|0.13 0.26 0.26|  1   3  93   2   0|  16k   69k|25-05 09:41:45|   1B
>    0     0 :   0    58M:   0    29M|   0     0 :   0  14.9k:   0  14.8k|7281B 4651B|0.13 0.26 0.26|  1   3  92   4   0|  16k   67k|25-05 09:41:46|   1B
>    0     0 :   0    59M:   0    30M|   0     0 :   0  15.2k:   0  15.1k|7849B 4729B|0.20 0.28 0.27|  1   4  94   2   0|  16k   69k|25-05 09:41:47|   1B
>    0     0 :   0    57M:   0    28M|   0     0 :   0  14.4k:   0  14.4k|  11k 8584B|0.20 0.28 0.27|  1   3  94   2   0|  15k   65k|25-05 09:41:48|   0
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |4086B 7720B|0.20 0.28 0.27|  0   0 100   0   0| 274   332 |25-05 09:41:49|   0
> 
> Note that on this first computer, the writings and IOs of the backing 
> device (sdb) remain motionless. While NVMe device IOs track bcache0 
> device IOs at ~14.8K
> 
> Let's see the dstat now on the second computer:
> 
> --dsk/sdd---dsk/nvme0n1-dsk/bcache0 ---io/sdd----io/nvme0n1--io/bcache0 -net/total- ---load-avg--- --total-cpu-usage-- ---system-- ----system---- async
>  read  writ: read  writ: read  writ| read  writ: read  writ: read  writ| recv  send| 1m   5m  15m |usr sys idl wai stl| int   csw |     time     | #aio
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |9254B 3301B|0.15 0.19 0.11|  1   2  97   0   0| 360   318 |26-05 06:27:15|   0
>    0  8192B:4096B   19M:   0  9600k|   0  2402 :1.00  4816 :   0  4801 |8826B 3619B|0.15 0.19 0.11|  0   1  98   0   0|8115    27k|26-05 06:27:16|   1B
>    0     0 :   0    21M:   0    11M|   0  2737 :   0  5492 :   0  5474 |4051B 2552B|0.15 0.19 0.11|  0   2  97   1   0|9212    31k|26-05 06:27:17|   1B
>    0     0 :   0    23M:   0    11M|   0  2890 :   0  5801 :   0  5781 |4816B 2492B|0.15 0.19 0.11|  1   2  96   2   0|9976    34k|26-05 06:27:18|   1B
>    0     0 :   0    23M:   0    11M|   0  2935 :   0  5888 :   0  5870 |4450B 2552B|0.22 0.21 0.12|  0   2  96   2   0|9937    33k|26-05 06:27:19|   1B
>    0     0 :   0    22M:   0    11M|   0  2777 :   0  5575 :   0  5553 |8644B 1614B|0.22 0.21 0.12|  0   2  98   0   0|9416    31k|26-05 06:27:20|   1B
>    0     0 :   0  2096k:   0  1040k|   0   260 :   0   523 :   0   519 |  10k 8760B|0.22 0.21 0.12|  0   1  99   0   0|1246  3157 |26-05 06:27:21|   0
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |4083B 2990B|0.22 0.21 0.12|  0   0 100   0   0| 390   369 |26-05 06:27:22|   0
 
> In this case, with exactly the same command, we got a very different 
> result. While writes to the backing device (sdd) do not happen (this is 
> correct), we noticed that IOs occur on both the NVMe device and the 
> backing device (i think this is wrong), but at a much lower rate now, 
> around 5.6K on NVMe and 2.8K on the backing device. It leaves the 
> impression that although it is not writing anything to sdd device, it is 
> sending some signal to the backing device in each two IO operations that 
> is performed with the cache device. And that would be delaying the 
> answer. Could it be something like this?

I think in newer kernels that bcache is more aggressive at writeback. 
Using /dev/mapper/zero as above will help rule out backing device 
interference.  Also make sure you have the sysfs flags turned to encourge 
it to write to SSD and not bypass:

	echo 0 > /sys/block/bcache0/bcache/sequential_cutoff
	echo 10000000 > /sys/block/bcache0/bcache/cache/congested_read_threshold_us 
	echo 10000000 > /sys/block/bcache0/bcache/cache/congested_write_threshold_us

> It is important to point out that the writeback mode is on, obviously, 
> and that the sequential cutoff is at zero, but I tried to put default 
> values or high values and there were no changes. I also tried changing 
> congested_write_threshold_us and congested_read_threshold_us, also with 
> no result changes.

Try this too: 
	echo 300 > /sys/block/bcache0/bcache/writeback_delay

and make sure bcache is in writeback (echo writeback > 
/sys/block/bcache0/bcache0/cache_mode) in case that was not configured on 
server2.

-Eric

> The only thing I noticed different between the configurations of the two 
> computers was btree_cache_size, which on the first is much larger (7.7M) 
> m while on the second it is only 768K. But I don't know if this 
> parameter is configurable and if it could justify the difference.
> 
> Disabling Intel's Turbo Boost technology through the BIOS appears to 
> have no effect.
> 
> And we will continue our tests comparing the two computers, including to 
> test the two versions of the Kernel. If anyone else has ideas, thanks!


> 
> Em terça-feira, 17 de maio de 2022 22:23:09 BRT, Eric Wheeler <bcache@lists.ewheeler.net> escreveu: 
> 
> 
> 
> 
> 
> On Tue, 10 May 2022, Adriano Silva wrote:
> > I'm trying to set up a flash disk NVMe as a disk cache for two or three 
> > isolated (I will use 2TB disks, but in these tests I used a 1TB one) 
> > spinning disks that I have on a Linux 5.4.174 (Proxmox node).
> 
> Coly has been adding quite a few optimizations over the years.  You might 
> try a new kernel and see if that helps.  More below.
> 
> > I'm using a NVMe (960GB datacenter devices with tantalum capacitors) as 
> > a cache.
> > [...]
> >
> > But when I do the same test on bcache writeback, the performance drops a 
> > lot. Of course, it's better than the performance of spinning disks, but 
> > much worse than when accessed directly from the NVMe device hardware.
> >
> > [...]
> > As we can see, the same test done on the bcache0 device only got 1548 
> > IOPS and that yielded only 6.3 KB/s.
> 
> Well done on the benchmarking!  I always thought our new NVMes performed 
> slower than expected but hadn't gotten around to investigating. 
> 
> > I've noticed in several tests, varying the amount of jobs or increasing 
> > the size of the blocks, that the larger the size of the blocks, the more 
> > I approximate the performance of the physical device to the bcache 
> > device.
> 
> You said "blocks" but did you mean bucket size (make-bcache -b) or block 
> size (make-bcache -w) ?
> 
> If larger buckets makes it slower than that actually surprises me: bigger 
> buckets means less metadata and better sequential writeback to the 
> spinning disks (though you hadn't yet hit writeback to spinning disks in 
> your stats).  Maybe you already tried, but varying the bucket size might 
> help.  Try graphing bucket size (powers of 2) against IOPS, maybe there is 
> a "sweet spot"?
> 
> Be aware that 4k blocks (so-called "4Kn") is unsafe for the cache device, 
> unless Coly has patched that.  Make sure your `blockdev --getss` reports 
> 512 for your NVMe!
> 
> Hi Coly,
> 
> Some time ago you ordered an an SSD to test the 4k cache issue, has that 
> been fixed?  I've kept an eye out for the patch but not sure if it was released.
> 
> You have a really great test rig setup with NVMes for stress
> testing bcache. Can you replicate Adriano's `ioping` numbers below?
> 
> > With ioping it is also possible to notice a limitation, as the latency 
> > of the bcache0 device is around 1.5ms, while in the case of the raw 
> > device (a partition of NVMe), the same test is only 82.1us.
> > 
> > root@pve-20:~# ioping -c10 /dev/bcache0 -D -Y -WWW -s4k
> > 4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=1 time=1.52 ms (warmup)
> > 4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=2 time=1.60 ms
> > 4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=3 time=1.55 ms
> >
> > root@pve-20:~# ioping -c10 /dev/nvme0n1p2 -D -Y -WWW -s4k
> > 4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=1 time=81.2 us (warmup)
> > 4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=2 time=82.7 us
> > 4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=3 time=82.4 us
> 
> Wow, almost 20x higher latency, sounds convincing that something is wrong.
> 
> A few things to try:
> 
> 1. Try ioping without -Y.  How does it compare?
> 
> 2. Maybe this is an inter-socket latency issue.  Is your server 
>   multi-socket?  If so, then as a first pass you could set the kernel 
>   cmdline `isolcpus` for testing to limit all processes to a single 
>   socket where the NVMe is connected (see `lscpu`).  Check `hwloc-ls`
>   or your motherboard manual to see how the NVMe port is wired to your
>   CPUs.
> 
>   If that helps then fine tune with `numactl -cN ioping` and 
>   /proc/irq/<n>/smp_affinity_list (and `grep nvme /proc/interrupts`) to 
>   make sure your NVMe's are locked to IRQs on the same socket.
> 
> 3a. sysfs:
> 
> > # echo 0 > /sys/block/bcache0/bcache/sequential_cutoff
> 
> good.
> 
> > # echo 0 > /sys/fs/bcache/<cache set>/congested_read_threshold_us
> > # echo 0 > /sys/fs/bcache/<cache set>/congested_write_threshold_us
> 
> Also try these (I think bcache/cache is a symlink to /sys/fs/bcache/<cache set>)
> 
> echo 10000000 > /sys/block/bcache0/bcache/cache/congested_read_threshold_us 
> echo 10000000 > /sys/block/bcache0/bcache/cache/congested_write_threshold_us
> 
> 
> Try tuning journal_delay_ms: 
>   /sys/fs/bcache/<cset-uuid>/journal_delay_ms
>     Journal writes will delay for up to this many milliseconds, unless a 
>     cache flush happens sooner. Defaults to 100.
> 
> 3b: Hacking bcache code:
> 
> I just noticed that journal_delay_ms says "unless a cache flush happens 
> sooner" but cache flushes can be re-ordered so flushing the journal when 
> REQ_OP_FLUSH comes through may not be useful, especially if there is a 
> high volume of flushes coming down the pipe because the flushes could kill 
> the NVMe's cache---and maybe the 1.5ms ping is actual flash latency.  It
> would flush data and journal.
> 
> Maybe there should be a cachedev_noflush sysfs option for those with some 
> kind of power-loss protection of there SSD's.  It looks like this is 
> handled in request.c when these functions call bch_journal_meta():
> 
>     1053: static void cached_dev_nodata(struct closure *cl)
>     1263: static void flash_dev_nodata(struct closure *cl)
> 
> Coly can you comment about journal flush semantics with respect to 
> performance vs correctness and crash safety?
> 
> Adriano, as a test, you could change this line in search_alloc() in 
> request.c:
> 
>     - s->iop.flush_journal    = op_is_flush(bio->bi_opf);
>     + s->iop.flush_journal    = 0;
> 
> and see how performance changes.
> 
> Someone correct me if I'm wrong, but I don't think flush_journal=0 will 
> affect correctness unless there is a crash.  If that /is/ the performance 
> problem then it would narrow the scope of this discussion.
> 
> 4. I wonder if your 1.5ms `ioping` stats scale with CPU clock speed: can 
>   you set your CPU governor to run at full clock speed and then slowest 
>   clock speed to see if it is a CPU limit somewhere as we expect?
> 
>   You can do `grep MHz /proc/cpuinfo` to see the active rate to make sure 
>   the governor did its job.  
> 
>   If it scales with CPU then something in bcache is working too hard.  
>   Maybe garbage collection?  Other devs would need to chime in here to 
>   steer the troubleshooting if that is the case.
> 
> 
> 5. I'm not sure if garbage collection is the issue, but you might try 
>   Mingzhe's dynamic incremental gc patch:
>     https://www.spinics.net/lists/linux-bcache/msg11185.html
> 
> 6. Try dm-cache and see if its IO latency is similar to bcache: If it is 
>   about the same then that would indicate an issue in the block layer 
>   somewhere outside of bcache.  If dm-cache is better, then that confirms 
>   a bcache issue.
> 
> 
> > The cache was configured directly on one of the NVMe partitions (in this 
> > case, the first partition). I did several tests using fio and ioping, 
> > testing on a partition on the NVMe device, without partition and 
> > directly on the raw block, on a first partition, on the second, with or 
> > without configuring bcache. I did all this to remove any doubt as to the 
> > method. The results of tests performed directly on the hardware device, 
> > without going through bcache are always fast and similar.
> > 
> > But tests in bcache are always slower. If you use writethrough, of 
> > course, it gets much worse, because the performance is equal to the raw 
> > spinning disk.
> > 
> > Using writeback improves a lot, but still doesn't use the full speed of 
> > NVMe (honestly, much less than full speed).
> 
> Indeed, I hope this can be fixed!  A 20x improvement in bcache would 
> be awesome.
> 
> > But I've also noticed that there is a limit on writing sequential data, 
> > which is a little more than half of the maximum write rate shown in 
> > direct tests by the NVMe device.
> 
> For sync, async, or both?
> 
> 
> > Processing doesn't seem to be going up like the tests.
> 
> 
> What do you mean "processing" ?
> 
> -Eric
> 
> 
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Bcache in writes direct with fsync. Are IOPS limited?
  2022-05-26 20:28       ` Eric Wheeler
@ 2022-05-27  4:07         ` Adriano Silva
  2022-05-28  1:27           ` Eric Wheeler
  0 siblings, 1 reply; 37+ messages in thread
From: Adriano Silva @ 2022-05-27  4:07 UTC (permalink / raw)
  To: Eric Wheeler, Coly Li, Bcache Linux, Matthias Ferdinand

> Please confirm that this says "write back":

> ]# cat /sys/block/nvme0n1/queue/write_cache

No, this says "write through"


> ]# for i in /sys/block/*/queue/write_cache; do echo 'write back' > $i; done
Done!

I can say that the performance of tests after the write back command for all devices greatly worsens the performance of direct tests on NVME hardware. Below you can see this.

What I realized doing the test was, right after doing the blkdiscard on the first server the command took a long time, I think more than 1 minute to return. After the return, when doing the ioping it increased the latency a lot that I'm used to. So I turned the server off and on again to discard again and test. I noticed that he improved, as I demonstrate.

From my understanding of the tests, it was clear that the performance of direct writes to NVME hardware on the two servers is very similar. Perhaps exactly the same. Also in NVME, when writing 512 Bytes at a time, the latency starts well but gets worse after a few write operations, which doesn't happen when writing 4K which always has better performance.

In all scenarios, when using write cache on /sys/block/nvme0n1/queue/write_cache, performance is severely degraded.

Also in all scenarios, when synchronization is required (parameter -Y), the performance is slightly worse.

But between servers, there is no difference in bcache when the backup device is in RAM.

>I think in newer kernels that bcache is more aggressive at writeback.
>Using /dev/mapper/zero as above will help rule out backing device
>interference.  Also make sure you have the sysfs flags turned to encourge
>it to write to SSD and not bypass

I actually went back to using the previous Kernel version (5.4) after I noticed that it wouldn't have improved performance. Today, both servers have version 5.4.


Just below the result right after the blkdiscard that took a long time.

=========
In first server

root@pve-20:~# cat /sys/block/nvme0n1/queue/write_cache
write through
root@pve-20:~# blkdiscard /dev/nvme0n1
root@pve-20:~# ioping -c10 /dev/nvme0n1 -D -Y -WWW -s512
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=544.6 us (warmup)
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=388.1 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=1.44 ms
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=656.8 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=1.71 ms
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=1.83 ms
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=702.2 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=582.1 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=1.15 ms
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=1.07 ms

--- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
9 requests completed in 9.54 ms, 4.50 KiB written, 943 iops, 471.9 KiB/s
generated 10 requests in 9.00 s, 5 KiB, 1 iops, 568 B/s
min/avg/max/mdev = 388.1 us / 1.06 ms / 1.83 ms / 487.4 us
root@pve-20:~#
root@pve-20:~# ioping -c10 /dev/nvme0n1 -D -WWW -s512
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=1.28 ms (warmup)
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=678.8 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=725.3 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=1.25 ms
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=794.1 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=493.1 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=1.10 ms
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=1.06 ms
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=971.8 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=1.11 ms

--- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
9 requests completed in 8.19 ms, 4.50 KiB written, 1.10 k iops, 549.2 KiB/s
generated 10 requests in 9.00 s, 5 KiB, 1 iops, 568 B/s
min/avg/max/mdev = 493.1 us / 910.3 us / 1.25 ms / 235.1 us
root@pve-20:~#
root@pve-20:~# ioping -c10 /dev/nvme0n1 -D -Y -WWW -s4K
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=471.0 us (warmup)
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=1.06 ms
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=1.17 ms
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=1.29 ms
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=830.5 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=1.31 ms
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=1.40 ms (slow)
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=195.0 us (fast)
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=841.2 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=1.22 ms

--- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
9 requests completed in 9.32 ms, 36 KiB written, 965 iops, 3.77 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 195.0 us / 1.04 ms / 1.40 ms / 352.0 us
root@pve-20:~#
root@pve-20:~# ioping -c10 /dev/nvme0n1 -D -WWW -s4K
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=645.2 us (warmup)
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=1.20 ms
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=1.41 ms
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=1.39 ms
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=978.4 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=75.8 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=68.6 us (fast)
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=74.0 us (fast)
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=73.7 us (fast)
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=67.0 us (fast)

--- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
9 requests completed in 5.34 ms, 36 KiB written, 1.68 k iops, 6.58 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 67.0 us / 593.7 us / 1.41 ms / 595.1 us
root@pve-20:~#

==========
Here, below the results after I shut down the first server and test again:

root@pve-20:~# blkdiscard /dev/nvme0n1
root@pve-20:~# ioping -c10 /dev/nvme0n1 -D -Y -WWW -s512
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=68.4 us (warmup)
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=76.5 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=67.0 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=60.1 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=463.9 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=471.4 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=505.1 us (slow)
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=501.0 us (slow)
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=486.3 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=520.4 us (slow)

--- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
9 requests completed in 3.15 ms, 4.50 KiB written, 2.85 k iops, 1.39 MiB/s
generated 10 requests in 9.00 s, 5 KiB, 1 iops, 568 B/s
min/avg/max/mdev = 60.1 us / 350.2 us / 520.4 us / 200.3 us
root@pve-20:~# ioping -c10 /dev/nvme0n1 -D -WWW -s512
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=460.8 us (warmup)
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=507.5 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=514.9 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=505.8 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=500.3 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=503.3 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=506.9 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=499.4 us (fast)
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=500.1 us (fast)
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=502.4 us

--- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
9 requests completed in 4.54 ms, 4.50 KiB written, 1.98 k iops, 991.0 KiB/s
generated 10 requests in 9.00 s, 5 KiB, 1 iops, 568 B/s
min/avg/max/mdev = 499.4 us / 504.5 us / 514.9 us / 4.64 us
root@pve-20:~#
root@pve-20:~# ioping -c10 /dev/nvme0n1 -D -Y -WWW -s4K
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=56.7 us (warmup)
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=81.7 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=60.0 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=78.0 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=75.1 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=79.7 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=91.2 us (slow)
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=76.6 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=79.0 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=87.1 us

--- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
9 requests completed in 708.4 us, 36 KiB written, 12.7 k iops, 49.6 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 60.0 us / 78.7 us / 91.2 us / 8.20 us
root@pve-20:~#
root@pve-20:~# ioping -c10 /dev/nvme0n1 -D -WWW -s4K
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=86.6 us (warmup)
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=72.7 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=60.5 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=70.5 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=72.7 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=60.2 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=83.5 us (slow)
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=60.4 us (fast)
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=86.0 us (slow)
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=61.2 us (fast)

--- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
9 requests completed in 627.7 us, 36 KiB written, 14.3 k iops, 56.0 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 60.2 us / 69.7 us / 86.0 us / 9.49 us
root@pve-20:~#

======= 
On the second server...
On the second server, blkdiscard didn't take long and the first result was the one below:

root@pve-21:~# cat /sys/block/nvme0n1/queue/write_cache
write through
root@pve-21:~# blkdiscard /dev/nvme0n1
root@pve-21:~# ioping -c10 /dev/nvme0n1 -D -Y -WWW -s512
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=60.7 us (warmup)
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=71.9 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=77.4 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=61.2 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=468.2 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=497.0 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=491.8 us (slow)
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=490.6 us (slow)
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=494.4 us (slow)
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=493.9 us (slow)

--- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
9 requests completed in 3.15 ms, 4.50 KiB written, 2.86 k iops, 1.40 MiB/s
generated 10 requests in 9.00 s, 5 KiB, 1 iops, 568 B/s
min/avg/max/mdev = 61.2 us / 349.6 us / 497.0 us / 197.8 us
root@pve-21:~# ioping -c10 /dev/nvme0n1 -D -WWW -s512
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=494.5 us (warmup)
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=490.6 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=490.3 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=489.8 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=492.3 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=488.1 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=496.0 us (slow)
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=492.1 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=493.0 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=508.0 us (slow)

--- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
9 requests completed in 4.44 ms, 4.50 KiB written, 2.03 k iops, 1013.5 KiB/s
generated 10 requests in 9.00 s, 5 KiB, 1 iops, 568 B/s
min/avg/max/mdev = 488.1 us / 493.3 us / 508.0 us / 5.60 us
root@pve-21:~# ioping -c10 /dev/nvme0n1 -D -Y -WWW -s4K
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=84.9 us (warmup)
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=75.7 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=76.5 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=76.0 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=77.6 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=78.8 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=84.2 us (slow)
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=85.0 us (slow)
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=79.3 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=97.1 us (slow)

--- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
9 requests completed in 730.1 us, 36 KiB written, 12.3 k iops, 48.1 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 75.7 us / 81.1 us / 97.1 us / 6.48 us
root@pve-21:~# ioping -c10 /dev/nvme0n1 -D -WWW -s4K
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=80.8 us (warmup)
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=77.7 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=70.9 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=69.1 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=72.0 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=68.3 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=71.7 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=86.7 us (slow)
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=93.2 us (slow)
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=64.8 us (fast)

--- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
9 requests completed in 674.3 us, 36 KiB written, 13.3 k iops, 52.1 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 64.8 us / 74.9 us / 93.2 us / 8.79 us
root@pve-21:~#

========== 
After switching to wirte back and going back to write through.
In first server...

oot@pve-20:~# for i in /sys/block/*/queue/write_cache; do echo 'write back' > $i; done
root@pve-20:~# ioping -c10 /dev/nvme0n1 -D -Y -WWW -s512
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=2.31 ms (warmup)
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=2.37 ms
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=2.40 ms
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=2.45 ms
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=2.57 ms
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=2.46 ms
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=2.57 ms (slow)
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=2.56 ms
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=2.38 ms (fast)
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=2.48 ms

--- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
9 requests completed in 22.2 ms, 4.50 KiB written, 404 iops, 202.4 KiB/s
generated 10 requests in 9.00 s, 5 KiB, 1 iops, 568 B/s
min/avg/max/mdev = 2.37 ms / 2.47 ms / 2.57 ms / 75.2 us
root@pve-20:~#
root@pve-20:~# ioping -c10 /dev/nvme0n1 -D -WWW -s512
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=1.16 ms (warmup)
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=1.15 ms
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=1.14 ms
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=1.15 ms
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=1.17 ms
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=1.15 ms
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=1.13 ms (fast)
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=1.14 ms
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=1.22 ms (slow)
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=1.20 ms

--- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
9 requests completed in 10.5 ms, 4.50 KiB written, 860 iops, 430.1 KiB/s
generated 10 requests in 9.00 s, 5 KiB, 1 iops, 568 B/s
min/avg/max/mdev = 1.13 ms / 1.16 ms / 1.22 ms / 27.6 us
root@pve-20:~#
root@pve-20:~# ioping -c10 /dev/nvme0n1 -D -Y -WWW -s4K
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=2.03 ms (warmup)
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=2.04 ms
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=2.07 ms
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=2.07 ms
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=2.05 ms
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=2.02 ms
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=2.05 ms
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=2.09 ms (slow)
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=2.04 ms
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=1.99 ms (fast)

--- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
9 requests completed in 18.4 ms, 36 KiB written, 489 iops, 1.91 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 1.99 ms / 2.04 ms / 2.09 ms / 29.0 us
root@pve-20:~#
root@pve-20:~# ioping -c10 /dev/nvme0n1 -D -WWW -s4K
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=703.4 us (warmup)
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=725.1 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=724.8 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=705.7 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=733.1 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=697.6 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=690.2 us (fast)
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=688.4 us (fast)
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=689.5 us (fast)
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=671.7 us (fast)

--- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
9 requests completed in 6.33 ms, 36 KiB written, 1.42 k iops, 5.56 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 671.7 us / 702.9 us / 733.1 us / 19.6 us
root@pve-20:~#
root@pve-20:~# for i in /sys/block/*/queue/write_cache; do echo 'write through' > $i; done
root@pve-20:~# ioping -c10 /dev/nvme0n1 -D -WWW -s4K
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=82.6 us (warmup)
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=89.3 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=61.7 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=74.0 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=89.4 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=62.5 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=74.1 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=81.3 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=78.1 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=84.3 us

--- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
9 requests completed in 694.9 us, 36 KiB written, 13.0 k iops, 50.6 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 61.7 us / 77.2 us / 89.4 us / 9.67 us
root@pve-20:~#

=============
On  the second server...

root@pve-21:~# for i in /sys/block/*/queue/write_cache; do echo 'write back' > $i; done
root@pve-21:~# blkdiscard /dev/nvme0n1
root@pve-21:~# ioping -c10 /dev/nvme0n1 -D -Y -WWW -s512
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=1.83 ms (warmup)
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=2.39 ms
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=2.40 ms
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=2.21 ms
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=2.44 ms
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=2.34 ms
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=2.34 ms
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=2.42 ms
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=2.22 ms (fast)
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=2.20 ms (fast)

--- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
9 requests completed in 21.0 ms, 4.50 KiB written, 429 iops, 214.7 KiB/s
generated 10 requests in 9.00 s, 5 KiB, 1 iops, 568 B/s
min/avg/max/mdev = 2.20 ms / 2.33 ms / 2.44 ms / 88.9 us
root@pve-21:~# ioping -c10 /dev/nvme0n1 -D -WWW -s512
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=1.12 ms (warmup)
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=663.6 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=1.12 ms
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=1.11 ms
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=1.11 ms
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=1.16 ms
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=1.18 ms (slow)
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=1.11 ms
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=1.16 ms
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=1.17 ms (slow)

--- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
9 requests completed in 9.78 ms, 4.50 KiB written, 920 iops, 460.2 KiB/s
generated 10 requests in 9.00 s, 5 KiB, 1 iops, 568 B/s
min/avg/max/mdev = 663.6 us / 1.09 ms / 1.18 ms / 151.9 us
root@pve-21:~# ioping -c10 /dev/nvme0n1 -D -Y -WWW -s4K
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=1.85 ms (warmup)
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=1.81 ms
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=1.82 ms
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=1.82 ms
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=2.01 ms
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=1.99 ms
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=1.98 ms
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=1.95 ms
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=1.83 ms
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=1.82 ms (fast)

--- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
9 requests completed in 17.0 ms, 36 KiB written, 528 iops, 2.06 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 1.81 ms / 1.89 ms / 2.01 ms / 82.3 us
root@pve-21:~# ioping -c10 /dev/nvme0n1 -D -WWW -s4K
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=673.1 us (warmup)
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=667.1 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=688.2 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=653.1 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=661.5 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=663.3 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=698.0 us (slow)
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=663.7 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=708.6 us (slow)
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=677.2 us

--- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
9 requests completed in 6.08 ms, 36 KiB written, 1.48 k iops, 5.78 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 653.1 us / 675.6 us / 708.6 us / 17.7 us
root@pve-21:~#
root@pve-21:~# for i in /sys/block/*/queue/write_cache; do echo 'write through' > $i; done
root@pve-21:~# ioping -c10 /dev/nvme0n1 -D -WWW -s4K
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=85.3 us (warmup)
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=79.8 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=74.7 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=85.6 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=66.8 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=92.2 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=73.5 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=65.0 us (fast)
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=73.0 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=73.2 us

--- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
9 requests completed in 683.7 us, 36 KiB written, 13.2 k iops, 51.4 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 65.0 us / 76.0 us / 92.2 us / 8.17 us
root@pve-21:~#

As you can see from the tests, the write performance on NVME hardware is horrible when putting /sys/block/*/queue/write_cache as 'write back'.


=====
Lets go..

root@pve-21:~# modprobe brd rd_size=$((128*1024))
root@pve-21:~# cat << EOT | dmsetup create zero
    0 262144 linear /dev/ram0 0
    262144 2147483648 zero
> EOT
root@pve-21:~# blkdiscard /dev/nvme0n1
root@pve-21:~# make-bcache -w 512 -B /dev/mapper/zero -C /dev/nvme0n1 --writeback
UUID:            563eaa85-43e9-491b-8c1f-f1b94a8f97c8
Set UUID:        0dcec849-9ee9-41a9-b220-b1923e93cdb1
version:        0
nbuckets:        1831430
block_size:        1
bucket_size:        1024
nr_in_set:        1
nr_this_dev:        0
first_bucket:        1
UUID:            acdd0f18-4198-43dd-847a-087058d80c25
Set UUID:        0dcec849-9ee9-41a9-b220-b1923e93cdb1
version:        1
block_size:        1
data_offset:        16
root@pve-21:~#
root@pve-21:~# ioping -c10 /dev/bcache0 -D -Y -WWW -s512
512 B >>> /dev/bcache0 (block device 1.00 TiB): request=1 time=3.04 ms (warmup)
512 B >>> /dev/bcache0 (block device 1.00 TiB): request=2 time=1.98 ms
512 B >>> /dev/bcache0 (block device 1.00 TiB): request=3 time=1.88 ms
512 B >>> /dev/bcache0 (block device 1.00 TiB): request=4 time=1.95 ms
512 B >>> /dev/bcache0 (block device 1.00 TiB): request=5 time=1.78 ms
512 B >>> /dev/bcache0 (block device 1.00 TiB): request=6 time=1.92 ms
512 B >>> /dev/bcache0 (block device 1.00 TiB): request=7 time=1.87 ms
512 B >>> /dev/bcache0 (block device 1.00 TiB): request=8 time=1.87 ms
512 B >>> /dev/bcache0 (block device 1.00 TiB): request=9 time=1.87 ms
512 B >>> /dev/bcache0 (block device 1.00 TiB): request=10 time=1.83 ms

--- /dev/bcache0 (block device 1.00 TiB) ioping statistics ---
9 requests completed in 17.0 ms, 4.50 KiB written, 530 iops, 265.2 KiB/s
generated 10 requests in 9.00 s, 5 KiB, 1 iops, 568 B/s
min/avg/max/mdev = 1.78 ms / 1.89 ms / 1.98 ms / 57.4 us
root@pve-21:~#
root@pve-21:~# ioping -c10 /dev/bcache0 -D -WWW -s512
512 B >>> /dev/bcache0 (block device 1.00 TiB): request=1 time=1.12 ms (warmup)
512 B >>> /dev/bcache0 (block device 1.00 TiB): request=2 time=1.01 ms
512 B >>> /dev/bcache0 (block device 1.00 TiB): request=3 time=1.00 ms
512 B >>> /dev/bcache0 (block device 1.00 TiB): request=4 time=1.05 ms
512 B >>> /dev/bcache0 (block device 1.00 TiB): request=5 time=1.04 ms
512 B >>> /dev/bcache0 (block device 1.00 TiB): request=6 time=1.04 ms
512 B >>> /dev/bcache0 (block device 1.00 TiB): request=7 time=996.5 us (fast)
512 B >>> /dev/bcache0 (block device 1.00 TiB): request=8 time=1.01 ms
512 B >>> /dev/bcache0 (block device 1.00 TiB): request=9 time=994.3 us (fast)
512 B >>> /dev/bcache0 (block device 1.00 TiB): request=10 time=976.5 us (fast)

--- /dev/bcache0 (block device 1.00 TiB) ioping statistics ---
9 requests completed in 9.11 ms, 4.50 KiB written, 987 iops, 493.9 KiB/s
generated 10 requests in 9.00 s, 5 KiB, 1 iops, 568 B/s
min/avg/max/mdev = 976.5 us / 1.01 ms / 1.05 ms / 22.5 us
root@pve-21:~#
root@pve-21:~# ioping -c10 /dev/bcache0 -D -Y -WWW -s4K
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=1 time=1.43 ms (warmup)
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=2 time=1.39 ms
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=3 time=1.38 ms
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=4 time=1.40 ms
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=5 time=1.40 ms
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=6 time=1.43 ms
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=7 time=1.39 ms
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=8 time=1.42 ms
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=9 time=1.41 ms
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=10 time=1.40 ms

--- /dev/bcache0 (block device 1.00 TiB) ioping statistics ---
9 requests completed in 12.6 ms, 36 KiB written, 713 iops, 2.79 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 1.38 ms / 1.40 ms / 1.43 ms / 13.1 us
root@pve-21:~# ioping -c10 /dev/bcache0 -D -WWW -s4K
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=1 time=676.0 us (warmup)
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=2 time=638.0 us
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=3 time=659.5 us
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=4 time=650.2 us
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=5 time=644.0 us
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=6 time=644.4 us
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=7 time=652.1 us
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=8 time=641.8 us
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=9 time=658.0 us
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=10 time=642.7 us

--- /dev/bcache0 (block device 1.00 TiB) ioping statistics ---
9 requests completed in 5.83 ms, 36 KiB written, 1.54 k iops, 6.03 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 638.0 us / 647.9 us / 659.5 us / 7.06 us
root@pve-21:~#

=========
Now, bcache -w 4k

root@pve-21:~# blkdiscard /dev/nvme0n1
root@pve-21:~# make-bcache -w 4096 -B /dev/mapper/zero -C /dev/nvme0n1 --writeback
UUID:            c955591f-21af-467d-b26a-5ff567af2001
Set UUID:        8c477796-88ab-4b20-990a-cef8b2df040a
version:        0
nbuckets:        1831430
block_size:        8
bucket_size:        1024
nr_in_set:        1
nr_this_dev:        0
first_bucket:        1
UUID:            ea89b843-a019-4464-8da5-377ba44f0e6b
Set UUID:        8c477796-88ab-4b20-990a-cef8b2df040a
version:        1
block_size:        8
data_offset:        16
root@pve-21:
root@pve-21:~# ioping -c10 /dev/bcache0 -D -Y -WWW -s512
ioping: request failed: Invalid argument
root@pve-21:~# ioping -c10 /dev/bcache0 -D -Y -WWW -s4K
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=1 time=2.93 ms (warmup)
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=2 time=313.2 us
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=3 time=274.1 us
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=4 time=296.4 us
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=5 time=247.2 us
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=6 time=227.4 us
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=7 time=224.6 us (fast)
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=8 time=253.8 us
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=9 time=235.3 us
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=10 time=197.6 us (fast)

--- /dev/bcache0 (block device 1.00 TiB) ioping statistics ---
9 requests completed in 2.27 ms, 36 KiB written, 3.96 k iops, 15.5 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 197.6 us / 252.2 us / 313.2 us / 34.7 us
root@pve-21:~# ioping -c10 /dev/bcache0 -D -WWW -s4K
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=1 time=262.8 us (warmup)
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=2 time=255.9 us
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=3 time=239.9 us
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=4 time=228.8 us
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=5 time=252.3 us
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=6 time=237.1 us
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=7 time=237.5 us
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=8 time=232.3 us
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=9 time=243.3 us
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=10 time=232.7 us

--- /dev/bcache0 (block device 1.00 TiB) ioping statistics ---
9 requests completed in 2.16 ms, 36 KiB written, 4.17 k iops, 16.3 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 228.8 us / 240.0 us / 255.9 us / 8.62 us


=========
On the first server

root@pve-20:~# modprobe brd rd_size=$((128*1024))
root@pve-20:~# cat << EOT | dmsetup create zero
>     0 262144 linear /dev/ram0 0
>     262144 2147483648 zero
> EOT
root@pve-20:~# blkdiscard /dev/nvme0n1
root@pve-20:~# make-bcache -w 512 -B /dev/mapper/zero -C /dev/nvme0n1 --writeback
UUID:            f82f76a1-8f41-4a0a-9213-f4632fa372a4
Set UUID:        d6ba5557-3055-4151-bd91-05db6a668ba7
version:        0
nbuckets:        1831430
block_size:        1
bucket_size:        1024
nr_in_set:        1
nr_this_dev:        0
first_bucket:        1
UUID:            5c3d1795-c484-4611-881f-bc991642aa76
Set UUID:        d6ba5557-3055-4151-bd91-05db6a668ba7
version:        1
block_size:        1
data_offset:        16
root@pve-20:~# ls /sys/fs/bcache/
d6ba5557-3055-4151-bd91-05db6a668ba7/ register
pendings_cleanup                      register_quiet
root@pve-20:~# ioping -c10 /dev/bcache0 -D -Y -WWW -s512
512 B >>> /dev/bcache0 (block device 1.00 TiB): request=1 time=3.05 ms (warmup)
512 B >>> /dev/bcache0 (block device 1.00 TiB): request=2 time=1.98 ms
512 B >>> /dev/bcache0 (block device 1.00 TiB): request=3 time=1.99 ms
512 B >>> /dev/bcache0 (block device 1.00 TiB): request=4 time=1.94 ms
512 B >>> /dev/bcache0 (block device 1.00 TiB): request=5 time=1.88 ms
512 B >>> /dev/bcache0 (block device 1.00 TiB): request=6 time=1.77 ms
512 B >>> /dev/bcache0 (block device 1.00 TiB): request=7 time=1.82 ms
512 B >>> /dev/bcache0 (block device 1.00 TiB): request=8 time=1.86 ms
512 B >>> /dev/bcache0 (block device 1.00 TiB): request=9 time=1.99 ms (slow)
512 B >>> /dev/bcache0 (block device 1.00 TiB): request=10 time=1.82 ms

--- /dev/bcache0 (block device 1.00 TiB) ioping statistics ---
9 requests completed in 17.1 ms, 4.50 KiB written, 527 iops, 263.9 KiB/s
generated 10 requests in 9.00 s, 5 KiB, 1 iops, 568 B/s
min/avg/max/mdev = 1.77 ms / 1.89 ms / 1.99 ms / 76.6 us
root@pve-20:~# ioping -c10 /dev/bcache0 -D -WWW -s512
512 B >>> /dev/bcache0 (block device 1.00 TiB): request=1 time=1.05 ms (warmup)
512 B >>> /dev/bcache0 (block device 1.00 TiB): request=2 time=1.07 ms
512 B >>> /dev/bcache0 (block device 1.00 TiB): request=3 time=1.04 ms
512 B >>> /dev/bcache0 (block device 1.00 TiB): request=4 time=1.01 ms
512 B >>> /dev/bcache0 (block device 1.00 TiB): request=5 time=1.11 ms
512 B >>> /dev/bcache0 (block device 1.00 TiB): request=6 time=1.06 ms
512 B >>> /dev/bcache0 (block device 1.00 TiB): request=7 time=1.03 ms
512 B >>> /dev/bcache0 (block device 1.00 TiB): request=8 time=1.06 ms
512 B >>> /dev/bcache0 (block device 1.00 TiB): request=9 time=1.04 ms
512 B >>> /dev/bcache0 (block device 1.00 TiB): request=10 time=1.06 ms

--- /dev/bcache0 (block device 1.00 TiB) ioping statistics ---
9 requests completed in 9.49 ms, 4.50 KiB written, 948 iops, 474.1 KiB/s
generated 10 requests in 9.00 s, 5 KiB, 1 iops, 568 B/s
min/avg/max/mdev = 1.01 ms / 1.05 ms / 1.11 ms / 26.3 us
root@pve-20:~# ioping -c10 /dev/bcache0 -D -Y -WWW -s4K
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=1 time=1.47 ms (warmup)
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=2 time=1.57 ms
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=3 time=1.57 ms
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=4 time=1.52 ms
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=5 time=1.11 ms
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=6 time=1.02 ms
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=7 time=1.03 ms (fast)
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=8 time=1.04 ms (fast)
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=9 time=1.45 ms
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=10 time=1.45 ms

--- /dev/bcache0 (block device 1.00 TiB) ioping statistics ---
9 requests completed in 11.8 ms, 36 KiB written, 765 iops, 2.99 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 1.02 ms / 1.31 ms / 1.57 ms / 232.7 us
root@pve-20:~# ioping -c10 /dev/bcache0 -D -WWW -s4K
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=1 time=249.7 us (warmup)
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=2 time=671.3 us
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=3 time=663.0 us
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=4 time=655.3 us
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=5 time=664.0 us
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=6 time=693.7 us
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=7 time=610.5 us (fast)
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=8 time=217.8 us (fast)
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=9 time=223.0 us (fast)
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=10 time=219.7 us (fast)

--- /dev/bcache0 (block device 1.00 TiB) ioping statistics ---
9 requests completed in 4.62 ms, 36 KiB written, 1.95 k iops, 7.61 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 217.8 us / 513.1 us / 693.7 us / 208.2 us
root@pve-20:~#

root@pve-20:~# blkdiscard /dev/nvme0n1
root@pve-20:~# make-bcache -w 4096 -B /dev/mapper/zero -C /dev/nvme0n1 --writeback
UUID:            c0252cdb-6a3b-43c1-8c86-3f679dd61d06
Set UUID:        e56ca07c-4b1a-4bea-8bd4-2cabb60cb4f0
version:        0
nbuckets:        1831430
block_size:        8
bucket_size:        1024
nr_in_set:        1
nr_this_dev:        0
first_bucket:        1
UUID:            2c501dde-dd04-4294-9e35-8f3b57fdd75d
Set UUID:        e56ca07c-4b1a-4bea-8bd4-2cabb60cb4f0
version:        1
block_size:        8
data_offset:        16
root@pve-20:~# ioping -c10 /dev/bcache0 -D -Y -WWW -s512
ioping: request failed: Invalid argument
root@pve-20:~# ioping -c10 /dev/bcache0 -D -WWW -s512
ioping: request failed: Invalid argument
root@pve-20:~# ioping -c10 /dev/bcache0 -D -Y -WWW -s4K
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=1 time=2.91 ms (warmup)
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=2 time=227.9 us
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=3 time=353.8 us
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=4 time=193.2 us
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=5 time=189.0 us
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=6 time=340.3 us
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=7 time=259.8 us
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=8 time=254.9 us
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=9 time=285.3 us
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=10 time=282.7 us

--- /dev/bcache0 (block device 1.00 TiB) ioping statistics ---
9 requests completed in 2.39 ms, 36 KiB written, 3.77 k iops, 14.7 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 189.0 us / 265.2 us / 353.8 us / 54.5 us
root@pve-20:~# ioping -c10 /dev/bcache0 -D -WWW -s4K
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=1 time=276.3 us (warmup)
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=2 time=224.6 us
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=3 time=226.8 us
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=4 time=240.1 us
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=5 time=237.4 us
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=6 time=231.6 us
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=7 time=238.1 us
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=8 time=199.1 us (fast)
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=9 time=240.4 us (slow)
4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=10 time=280.5 us (slow)

--- /dev/bcache0 (block device 1.00 TiB) ioping statistics ---
9 requests completed in 2.12 ms, 36 KiB written, 4.25 k iops, 16.6 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 199.1 us / 235.4 us / 280.5 us / 20.0 us
root@pve-20:~#


In addition to the request, I decided to add these results to help:

root@pve-20:~# ioping -c5 /dev/mapper/zero -D -Y -WWW -s512
512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=1 time=14.4 us (warmup)
512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=2 time=19.9 us
512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=3 time=23.4 us
512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=4 time=17.5 us
512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=5 time=19.4 us

--- /dev/mapper/zero (block device 1.00 TiB) ioping statistics ---
4 requests completed in 80.2 us, 2 KiB written, 49.9 k iops, 24.4 MiB/s
generated 5 requests in 4.00 s, 2.50 KiB, 1 iops, 639 B/s

min/avg/max/mdev = 17.5 us / 20.0 us / 23.4 us / 2.12 us

root@pve-20:~# ioping -c5 /dev/mapper/zero -D -WWW -s512
512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=1 time=14.4 us (warmup)
512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=2 time=13.0 us
512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=3 time=18.8 us
512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=4 time=17.4 us
512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=5 time=18.8 us

--- /dev/mapper/zero (block device 1.00 TiB) ioping statistics ---
4 requests completed in 67.9 us, 2 KiB written, 58.9 k iops, 28.8 MiB/s
generated 5 requests in 4.00 s, 2.50 KiB, 1 iops, 639 B/s
min/avg/max/mdev = 13.0 us / 17.0 us / 18.8 us / 2.38 us
root@pve-20:~# ioping -c5 /dev/mapper/zero -D -Y -WWW -s4K
4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=1 time=24.6 us (warmup)
4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=2 time=27.2 us
4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=3 time=21.1 us
4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=4 time=17.0 us
4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=5 time=22.8 us

--- /dev/mapper/zero (block device 1.00 TiB) ioping statistics ---
4 requests completed in 88.0 us, 16 KiB written, 45.4 k iops, 177.5 MiB/s
generated 5 requests in 4.00 s, 20 KiB, 1 iops, 5.00 KiB/s
min/avg/max/mdev = 17.0 us / 22.0 us / 27.2 us / 3.65 us
root@pve-20:~# ioping -c5 /dev/mapper/zero -D -WWW -s4K
4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=1 time=22.9 us (warmup)
4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=2 time=15.7 us
4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=3 time=21.5 us
4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=4 time=21.1 us
4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=5 time=24.3 us

--- /dev/mapper/zero (block device 1.00 TiB) ioping statistics ---
4 requests completed in 82.6 us, 16 KiB written, 48.4 k iops, 189.2 MiB/s
generated 5 requests in 4.00 s, 20 KiB, 1 iops, 5.00 KiB/s
min/avg/max/mdev = 15.7 us / 20.6 us / 24.3 us / 3.09 us
root@pve-20:~#

root@pve-20:~# blkdiscard /dev/nvme0n1
root@pve-20:~# ioping -c10 /dev/nvme0n1 -D -Y -WWW -s4K
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=82.7 us (warmup)
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=78.6 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=63.2 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=72.4 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=75.4 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=82.4 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=71.9 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=84.8 us (slow)
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=95.6 us (slow)
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=84.9 us

--- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
9 requests completed in 709.1 us, 36 KiB written, 12.7 k iops, 49.6 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 63.2 us / 78.8 us / 95.6 us / 8.89 us
root@pve-20:~# ioping -c10 /dev/nvme0n1 -D -WWW -s4K
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=68.3 us (warmup)
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=70.3 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=81.4 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=81.9 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=83.0 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=91.7 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=71.1 us (fast)
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=87.9 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=81.2 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=60.4 us (fast)

--- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
9 requests completed in 708.9 us, 36 KiB written, 12.7 k iops, 49.6 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 60.4 us / 78.8 us / 91.7 us / 9.18 us
root@pve-20:~# ioping -c10 /dev/nvme0n1 -D -Y -WWW -s512
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=59.2 us (warmup)
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=63.6 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=64.8 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=63.4 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=516.1 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=502.1 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=510.5 us (slow)
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=502.9 us (slow)
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=496.3 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=505.5 us (slow)

--- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
9 requests completed in 3.23 ms, 4.50 KiB written, 2.79 k iops, 1.36 MiB/s
generated 10 requests in 9.00 s, 5 KiB, 1 iops, 568 B/s
min/avg/max/mdev = 63.4 us / 358.4 us / 516.1 us / 208.2 us
root@pve-20:~# ioping -c10 /dev/nvme0n1 -D -WWW -s512
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=491.5 us (warmup)
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=496.1 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=506.9 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=510.7 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=503.2 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=501.4 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=498.8 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=510.4 us (slow)
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=502.4 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=501.3 us

--- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
9 requests completed in 4.53 ms, 4.50 KiB written, 1.99 k iops, 993.1 KiB/s
generated 10 requests in 9.00 s, 5 KiB, 1 iops, 568 B/s
min/avg/max/mdev = 496.1 us / 503.5 us / 510.7 us / 4.70 us
root@pve-20:~#


root@pve-21:~# ioping -c10 /dev/mapper/zero -D -Y -WWW -s512
512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=1 time=13.4 us (warmup)
512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=2 time=22.6 us
512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=3 time=15.3 us
512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=4 time=26.1 us
512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=5 time=15.2 us
512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=6 time=20.8 us
512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=7 time=24.9 us
512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=8 time=15.2 us (fast)
512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=9 time=15.2 us (fast)
512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=10 time=15.7 us (fast)

--- /dev/mapper/zero (block device 1.00 TiB) ioping statistics ---
9 requests completed in 171.0 us, 4.50 KiB written, 52.6 k iops, 25.7 MiB/s
generated 10 requests in 9.00 s, 5 KiB, 1 iops, 568 B/s
min/avg/max/mdev = 15.2 us / 19.0 us / 26.1 us / 4.34 us
root@pve-21:~# ioping -c10 /dev/mapper/zero -D -WWW -s512
512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=1 time=14.3 us (warmup)
512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=2 time=22.4 us
512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=3 time=25.9 us
512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=4 time=14.8 us
512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=5 time=24.8 us
512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=6 time=24.6 us
512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=7 time=13.7 us (fast)
512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=8 time=18.2 us
512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=9 time=15.4 us
512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=10 time=15.2 us

--- /dev/mapper/zero (block device 1.00 TiB) ioping statistics ---
9 requests completed in 174.9 us, 4.50 KiB written, 51.5 k iops, 25.1 MiB/s
generated 10 requests in 9.00 s, 5 KiB, 1 iops, 568 B/s
min/avg/max/mdev = 13.7 us / 19.4 us / 25.9 us / 4.67 us
root@pve-21:~# ioping -c10 /dev/mapper/zero -D -Y -WWW -s4K
4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=1 time=22.3 us (warmup)
4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=2 time=17.3 us
4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=3 time=26.0 us
4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=4 time=27.0 us
4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=5 time=15.7 us
4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=6 time=18.1 us
4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=7 time=17.8 us
4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=8 time=16.9 us
4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=9 time=15.4 us (fast)
4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=10 time=15.5 us (fast)

--- /dev/mapper/zero (block device 1.00 TiB) ioping statistics ---
9 requests completed in 169.7 us, 36 KiB written, 53.0 k iops, 207.2 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 15.4 us / 18.9 us / 27.0 us / 4.21 us
root@pve-21:~# ioping -c10 /dev/mapper/zero -D -WWW -s4K
4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=1 time=22.4 us (warmup)
4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=2 time=15.3 us
4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=3 time=26.1 us
4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=4 time=15.0 us
4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=5 time=15.0 us
4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=6 time=17.8 us
4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=7 time=15.3 us (fast)
4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=8 time=15.3 us (fast)
4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=9 time=15.0 us (fast)
4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=10 time=14.9 us (fast)

--- /dev/mapper/zero (block device 1.00 TiB) ioping statistics ---
9 requests completed in 149.6 us, 36 KiB written, 60.2 k iops, 235.0 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 14.9 us / 16.6 us / 26.1 us / 3.47 us
root@pve-21:~#
root@pve-21:~# blkdiscard /dev/nvme0n1
root@pve-21:~# ioping -c5 /dev/nvme0n1 -D -Y -WWW -s512
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=461.1 us (warmup)
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=476.4 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=479.3 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=480.2 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=480.9 us

--- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
4 requests completed in 1.92 ms, 2 KiB written, 2.09 k iops, 1.02 MiB/s
generated 5 requests in 4.00 s, 2.50 KiB, 1 iops, 639 B/s
min/avg/max/mdev = 476.4 us / 479.2 us / 480.9 us / 1.73 us
root@pve-21:~# ioping -c5 /dev/nvme0n1 -D -WWW -s512
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=456.1 us (warmup)
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=423.0 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=424.8 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=433.3 us
512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=446.3 us

--- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
4 requests completed in 1.73 ms, 2 KiB written, 2.31 k iops, 1.13 MiB/s
generated 5 requests in 4.00 s, 2.50 KiB, 1 iops, 639 B/s
min/avg/max/mdev = 423.0 us / 431.9 us / 446.3 us / 9.23 us
root@pve-21:~# ioping -c5 /dev/nvme0n1 -D -Y -WWW -s4K
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=88.9 us (warmup)
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=79.8 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=70.9 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=94.3 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=72.8 us

--- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
4 requests completed in 317.7 us, 16 KiB written, 12.6 k iops, 49.2 MiB/s
generated 5 requests in 4.00 s, 20 KiB, 1 iops, 5.00 KiB/s
min/avg/max/mdev = 70.9 us / 79.4 us / 94.3 us / 9.20 us
root@pve-21:~# ioping -c5 /dev/nvme0n1 -D -WWW -s4K
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=86.4 us (warmup)
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=119.0 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=66.1 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=72.4 us
4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=73.1 us

--- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
4 requests completed in 330.6 us, 16 KiB written, 12.1 k iops, 47.3 MiB/s
generated 5 requests in 4.00 s, 20 KiB, 1 iops, 5.00 KiB/s
min/avg/max/mdev = 66.1 us / 82.7 us / 119.0 us / 21.2 us
root@pve-21:~#

Em quinta-feira, 26 de maio de 2022 17:28:36 BRT, Eric Wheeler <bcache@lists.ewheeler.net> escreveu: 





On Thu, 26 May 2022, Adriano Silva wrote:
> This is a enterprise NVMe device with Power Loss Protection system. It 
> has a non-volatile cache.
> 
> Before purchasing these enterprise devices, I did tests with consumer 
> NVMe. Consumer device performance is acceptable only on hardware cached 
> writes. But on the contrary on consumer devices in tests with fio 
> passing parameters for direct and synchronous writing (--direct=1 
> --fsync=1 --rw=randwrite --bs=4K --numjobs=1 --iodepth= 1) the 
> performance is very low. So today I'm using enterprise NVME with 
> tantalum capacitors which makes the cache non-volatile and performs much 
> better when written directly to the hardware. But the performance issue 
> is only occurring when the write is directed to the bcache device.
> 
> Here is information from my Hardware you asked for (Eric), plus some 
> additional information to try to help.
> 
> root@pve-20:/# blockdev --getss /dev/nvme0n1
> 512
> root@pve-20:/# blockdev --report /dev/nvme0n1
> RO    RA   SSZ   BSZ   StartSec            Size   Device
> rw   256   512  4096          0    960197124096   /dev/nvme0n1

> root@pve-20:~# nvme id-ctrl -H /dev/nvme0n1 |grep -A1 vwc
> vwc       : 0
>   [0:0] : 0    Volatile Write Cache Not Present

Please confirm that this says "write back":

]# cat /sys/block/nvme0n1/queue/write_cache 

Try this to set _all_ blockdevs to write-back and see if it affects
performance (warning: power loss is unsafe for non-volatile caches after 
this command):

]# for i in /sys/block/*/queue/write_cache; do echo 'write back' > $i; done

> An interesting thing to note is that when I test using fio with 
> --bs=512, the direct hardware performance is horrible (~1MB/s).

I think you know this already, but for CYA:

  WARNING: THESE ARE DESTRUCTIVE WRITES, DO NOT USE ON PRODUCTION DATA!

Please post `ioping` stats for each server you are testing (some of these 
you may have already posted, but if you can place them inline of this same 
response it would be helpful so we don't need to dig into old emails).

]# blkdiscard /dev/nvme0n1

]# ioping -c10 /dev/nvme0n1 -D -Y -WWW -s512
]# ioping -c10 /dev/nvme0n1 -D -WWW -s512

]# ioping -c10 /dev/nvme0n1 -D -Y -WWW -s4k
]# ioping -c10 /dev/nvme0n1 -D -WWW -s4k

Next, lets rule out backing-device interference by creating a dummy
mapper device that has 128mb of ramdisk for persistent meta storage
(superblock, LVM, etc) but presents as a 1TB volume in size; writes
beyond 128mb are dropped:

    modprobe brd rd_size=$((128*1024))

    ]# cat << EOT | dmsetup create zero
    0 262144 linear /dev/ram0 0
    262144 2147483648 zero
    EOT

Then use that as your backing device:

    ]# blkdiscard /dev/nvme0n1
    ]# make-bcache -w 512 -B /dev/mapper/zero -C /dev/nvme0n1 --writeback

]# ioping -c10 /dev/bcache0 -D -Y -WWW -s512
]# ioping -c10 /dev/bcache0 -D -WWW -s512

]# ioping -c10 /dev/bcache0 -D -Y -WWW -s4k
]# ioping -c10 /dev/bcache0 -D -WWW -s4k

Test again with -w 4096:
    ]# blkdiscard /dev/nvme0n1
    ]# make-bcache -w 4096 -B /dev/mapper/zero -C /dev/nvme0n1 --writeback

]# ioping -c10 /dev/bcache0 -D -Y -WWW -s4k
]# ioping -c10 /dev/bcache0 -D -WWW -s4k

# These should error with -w 4096 because 512 is too small:

]# ioping -c10 /dev/bcache0 -D -Y -WWW -s512
]# ioping -c10 /dev/bcache0 -D -WWW -s512

> root@pve-20:/# fio --filename=/dev/nvme0n1p2 --direct=1 --fsync=1 --rw=randwrite --bs=512 --numjobs=1 --iodepth=1 --runtime=5 --time_based --group_reporting --name=journal-test --ioengine=libaio
>   write: IOPS=2087, BW=1044KiB/s (1069kB/s)(5220KiB/5001msec); 0 zone resets
>          ^^^^^^^^^ 
> But the same test directly on the hardware with fio passing the
> parameter --bs=4K, the performance completely changes, for the better
> (~130MB/s).
>
> root@pve-20:/# fio --filename=/dev/nvme0n1p2 --direct=1 --fsync=1 --rw=randwrite --bs=4K --numjobs=1 --iodepth=1 --runtime=5 --time_based --group_reporting --name=journal-test --ioengine=libaio
>   write: IOPS=31.9k, BW=124MiB/s (131MB/s)(623MiB/5001msec); 0 zone resets
>          ^^^^^^^^^^
> Does anything justify this difference?

I think you may have discovered the problem and the `ioping`s above
might confirm that.

IOPS are a better metric here, not MB/sec because smaller IOs will
always be smaller bandwidth because they are smaller and RTT is a
factor.  However, IOPS are ~16x lower than the expected 8x difference
(512/4096=1/8) so something else is going on. 

The hardware is probably addressed 4k internally "4Kn" (with even larger 
erase pages that the FTL manages).  Sending it a bunch of 512-byte IOs may 
trigger a read-modify-write operation on the flash controller and is 
(probably) spinning CPU cycles on the flash controller itself. A firmware 
upgrade on the NVMe might help if they have addressed this.

This is speculaution, but assuming that internally the flash uses 4k 
sectors, it is doing something like this (pseudo code):

    1. new_data = fetch_from_pcie()
    2. rmw = read_sector(LBA)
    3. memcpy(rmw+offset, new_data, 512)
    4. queue_write_to_flash(rmw, LBA)

> Maybe that's why when I create bcache with the -w=4K option the 
> performance improves. Not as much as I'd like, but it gets better.
> [...] 
> The buckets, I read that it would be better to put the hardware device 
> erase block size. However, I have already tried to find this information 
> by reading the device, also with the manufacturer, but without success. 
> So I have no idea which bucket size would be best, but from my tests, 
> the default of 512KB seems to be adequate.

It might be worth testing power-of-2 bucket sizes to see what works best
for your workload.  Note that `fio --rw=randwrite` may not be
representative of your "real" workload so randwrite could be a good
place to start, but bench your real workload against bucket sizes to see
what works best.

> Eric, perhaps it is not such a simple task to recompile the Kernel with 
> the suggested change. I'm working with Proxmox 6.4. I'm not sure, but I 
> think the Kernel may have some adaptation. It is based on Kernel 5.4, 
> which it is approved for.

Keith and Christoph corrected me; as noted above, this does the same 
thing, so no need to hack on the kernel to change flush behavior:

    echo 'write back' > /sys/block/<DEV>/queue/write_cache

> Also listening to Coly's suggestion, I'll try to perform tests with the 
> Kernel version 5.15 to see if it can solve. Would this version be good 
> enough? It's just that, as I said above, as I'm using Proxmox, I'm 
> afraid to change the Kernel version they provide.

I'm guessing proxmox doesn't care too much about the kernel version as
long as the modules you use are built.  Just copy your existing .config
(usually /boot/config-<version>) as
kernel-source-dir/.config and run `make oldconfig` (or `make menuconfig`
and save+exit, which is what I usually do).

> Eric, to be clear, the hardware I'm using has only 1 processor socket.

Ok, so not a cacheline bounce issue.

> I'm trying to test with another identical computer (the same 
> motherboard, the same processor, the same NVMe, with the difference that 
> it only has 12GB of RAM, the first having 48GB). It is an HP Z400 
> Workstation with an Intel Xeon X5680 sixcore processor (12 threads), 
> DDR3 1333MHz 10600E (old computer).

Is this second server still a single-socket?

> On the second computer, I put a newer version of the distribution that 
> uses Kernel based on version 5.15. I am now comparing the performance of 
> the two computers in the lab.
> 
> On this second computer I had worse performance than the first one 
> (practically half the performance with bcache), despite the performance 
> of the tests done directly in NVME being identical.
> 
> I tried going back to the same OS version on the first computer to try 
> and keep the exact same scenario on both computers so I could first 
> compare the two. I try to keep the exact same software configuration. 
> However, there were no changes. Is it the low RAM that makes the 
> performance worse in the second?

The amount of memory isn't an issue, but CPU clock speed or memory speed 
might.  If server-2 has 2x sockets then make sure NVMe interrupts hit the 
socket where it is attached.  Could be a PCIe version thing, but I 
don't think you are saturating the PCIe link.

> I noticed a difference in behavior on the second computer compared to 
> the first in dstat. While the first computer doesn't seem to touch the 
> backup device at all, the second computer signals something a little 
> different, as although it doesn't write data to the backup disk, it does 
> signal IO movement. Strange no?
> 
> Let's look at the dstat of the first computer:
> 
> --dsk/sdb---dsk/nvme0n1-dsk/bcache0 ---io/sdb----io/nvme0n1--io/bcache0 -net/total- ---load-avg--- --total-cpu-usage-- ---system-- ----system---- async
>  read  writ: read  writ: read  writ| read  writ: read  writ: read  writ| recv  send| 1m   5m  15m |usr sys idl wai stl| int   csw |     time     | #aio
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |6953B 7515B|0.13 0.26 0.26|  0   0  99   0   0| 399   634 |25-05 09:41:42|   0
>    0  8192B:4096B 2328k:   0  1168k|   0  2.00 :1.00   586 :   0   587 |9150B 2724B|0.13 0.26 0.26|  2   2  96   0   0|1093  3267 |25-05 09:41:43|   1B
>    0     0 :   0    58M:   0    29M|   0     0 :   0  14.8k:   0  14.7k|  14k 9282B|0.13 0.26 0.26|  1   3  94   2   0|  16k   67k|25-05 09:41:44|   1B
>    0     0 :   0    58M:   0    29M|   0     0 :   0  14.9k:   0  14.8k|  10k 8992B|0.13 0.26 0.26|  1   3  93   2   0|  16k   69k|25-05 09:41:45|   1B
>    0     0 :   0    58M:   0    29M|   0     0 :   0  14.9k:   0  14.8k|7281B 4651B|0.13 0.26 0.26|  1   3  92   4   0|  16k   67k|25-05 09:41:46|   1B
>    0     0 :   0    59M:   0    30M|   0     0 :   0  15.2k:   0  15.1k|7849B 4729B|0.20 0.28 0.27|  1   4  94   2   0|  16k   69k|25-05 09:41:47|   1B
>    0     0 :   0    57M:   0    28M|   0     0 :   0  14.4k:   0  14.4k|  11k 8584B|0.20 0.28 0.27|  1   3  94   2   0|  15k   65k|25-05 09:41:48|   0
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |4086B 7720B|0.20 0.28 0.27|  0   0 100   0   0| 274   332 |25-05 09:41:49|   0
> 
> Note that on this first computer, the writings and IOs of the backing 
> device (sdb) remain motionless. While NVMe device IOs track bcache0 
> device IOs at ~14.8K
> 
> Let's see the dstat now on the second computer:
> 
> --dsk/sdd---dsk/nvme0n1-dsk/bcache0 ---io/sdd----io/nvme0n1--io/bcache0 -net/total- ---load-avg--- --total-cpu-usage-- ---system-- ----system---- async
>  read  writ: read  writ: read  writ| read  writ: read  writ: read  writ| recv  send| 1m   5m  15m |usr sys idl wai stl| int   csw |     time     | #aio
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |9254B 3301B|0.15 0.19 0.11|  1   2  97   0   0| 360   318 |26-05 06:27:15|   0
>    0  8192B:4096B   19M:   0  9600k|   0  2402 :1.00  4816 :   0  4801 |8826B 3619B|0.15 0.19 0.11|  0   1  98   0   0|8115    27k|26-05 06:27:16|   1B
>    0     0 :   0    21M:   0    11M|   0  2737 :   0  5492 :   0  5474 |4051B 2552B|0.15 0.19 0.11|  0   2  97   1   0|9212    31k|26-05 06:27:17|   1B
>    0     0 :   0    23M:   0    11M|   0  2890 :   0  5801 :   0  5781 |4816B 2492B|0.15 0.19 0.11|  1   2  96   2   0|9976    34k|26-05 06:27:18|   1B
>    0     0 :   0    23M:   0    11M|   0  2935 :   0  5888 :   0  5870 |4450B 2552B|0.22 0.21 0.12|  0   2  96   2   0|9937    33k|26-05 06:27:19|   1B
>    0     0 :   0    22M:   0    11M|   0  2777 :   0  5575 :   0  5553 |8644B 1614B|0.22 0.21 0.12|  0   2  98   0   0|9416    31k|26-05 06:27:20|   1B
>    0     0 :   0  2096k:   0  1040k|   0   260 :   0   523 :   0   519 |  10k 8760B|0.22 0.21 0.12|  0   1  99   0   0|1246  3157 |26-05 06:27:21|   0
>    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |4083B 2990B|0.22 0.21 0.12|  0   0 100   0   0| 390   369 |26-05 06:27:22|   0

> In this case, with exactly the same command, we got a very different 
> result. While writes to the backing device (sdd) do not happen (this is 
> correct), we noticed that IOs occur on both the NVMe device and the 
> backing device (i think this is wrong), but at a much lower rate now, 
> around 5.6K on NVMe and 2.8K on the backing device. It leaves the 
> impression that although it is not writing anything to sdd device, it is 
> sending some signal to the backing device in each two IO operations that 
> is performed with the cache device. And that would be delaying the 
> answer. Could it be something like this?

I think in newer kernels that bcache is more aggressive at writeback. 
Using /dev/mapper/zero as above will help rule out backing device 
interference.  Also make sure you have the sysfs flags turned to encourge 
it to write to SSD and not bypass:

    echo 0 > /sys/block/bcache0/bcache/sequential_cutoff
    echo 10000000 > /sys/block/bcache0/bcache/cache/congested_read_threshold_us 
    echo 10000000 > /sys/block/bcache0/bcache/cache/congested_write_threshold_us

> It is important to point out that the writeback mode is on, obviously, 
> and that the sequential cutoff is at zero, but I tried to put default 
> values or high values and there were no changes. I also tried changing 
> congested_write_threshold_us and congested_read_threshold_us, also with 
> no result changes.

Try this too: 
    echo 300 > /sys/block/bcache0/bcache/writeback_delay

and make sure bcache is in writeback (echo writeback > 
/sys/block/bcache0/bcache0/cache_mode) in case that was not configured on 
server2.


-Eric

> The only thing I noticed different between the configurations of the two 
> computers was btree_cache_size, which on the first is much larger (7.7M) 
> m while on the second it is only 768K. But I don't know if this 
> parameter is configurable and if it could justify the difference.
> 
> Disabling Intel's Turbo Boost technology through the BIOS appears to 
> have no effect.
> 
> And we will continue our tests comparing the two computers, including to 
> test the two versions of the Kernel. If anyone else has ideas, thanks!


> 
> Em terça-feira, 17 de maio de 2022 22:23:09 BRT, Eric Wheeler <bcache@lists.ewheeler.net> escreveu: 
> 
> 
> 
> 
> 
> On Tue, 10 May 2022, Adriano Silva wrote:
> > I'm trying to set up a flash disk NVMe as a disk cache for two or three 
> > isolated (I will use 2TB disks, but in these tests I used a 1TB one) 
> > spinning disks that I have on a Linux 5.4.174 (Proxmox node).
> 
> Coly has been adding quite a few optimizations over the years.  You might 
> try a new kernel and see if that helps.  More below.
> 
> > I'm using a NVMe (960GB datacenter devices with tantalum capacitors) as 
> > a cache.
> > [...]
> >
> > But when I do the same test on bcache writeback, the performance drops a 
> > lot. Of course, it's better than the performance of spinning disks, but 
> > much worse than when accessed directly from the NVMe device hardware.
> >
> > [...]
> > As we can see, the same test done on the bcache0 device only got 1548 
> > IOPS and that yielded only 6.3 KB/s.
> 
> Well done on the benchmarking!  I always thought our new NVMes performed 
> slower than expected but hadn't gotten around to investigating. 
> 
> > I've noticed in several tests, varying the amount of jobs or increasing 
> > the size of the blocks, that the larger the size of the blocks, the more 
> > I approximate the performance of the physical device to the bcache 
> > device.
> 
> You said "blocks" but did you mean bucket size (make-bcache -b) or block 
> size (make-bcache -w) ?
> 
> If larger buckets makes it slower than that actually surprises me: bigger 
> buckets means less metadata and better sequential writeback to the 
> spinning disks (though you hadn't yet hit writeback to spinning disks in 
> your stats).  Maybe you already tried, but varying the bucket size might 
> help.  Try graphing bucket size (powers of 2) against IOPS, maybe there is 
> a "sweet spot"?
> 
> Be aware that 4k blocks (so-called "4Kn") is unsafe for the cache device, 
> unless Coly has patched that.  Make sure your `blockdev --getss` reports 
> 512 for your NVMe!
> 
> Hi Coly,
> 
> Some time ago you ordered an an SSD to test the 4k cache issue, has that 
> been fixed?  I've kept an eye out for the patch but not sure if it was released.
> 
> You have a really great test rig setup with NVMes for stress
> testing bcache. Can you replicate Adriano's `ioping` numbers below?
> 
> > With ioping it is also possible to notice a limitation, as the latency 
> > of the bcache0 device is around 1.5ms, while in the case of the raw 
> > device (a partition of NVMe), the same test is only 82.1us.
> > 
> > root@pve-20:~# ioping -c10 /dev/bcache0 -D -Y -WWW -s4k
> > 4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=1 time=1.52 ms (warmup)
> > 4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=2 time=1.60 ms
> > 4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=3 time=1.55 ms
> >
> > root@pve-20:~# ioping -c10 /dev/nvme0n1p2 -D -Y -WWW -s4k
> > 4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=1 time=81.2 us (warmup)
> > 4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=2 time=82.7 us
> > 4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=3 time=82.4 us
> 
> Wow, almost 20x higher latency, sounds convincing that something is wrong.
> 
> A few things to try:
> 
> 1. Try ioping without -Y.  How does it compare?
> 
> 2. Maybe this is an inter-socket latency issue.  Is your server 
>   multi-socket?  If so, then as a first pass you could set the kernel 
>   cmdline `isolcpus` for testing to limit all processes to a single 
>   socket where the NVMe is connected (see `lscpu`).  Check `hwloc-ls`
>   or your motherboard manual to see how the NVMe port is wired to your
>   CPUs.
> 
>   If that helps then fine tune with `numactl -cN ioping` and 
>   /proc/irq/<n>/smp_affinity_list (and `grep nvme /proc/interrupts`) to 
>   make sure your NVMe's are locked to IRQs on the same socket.
> 
> 3a. sysfs:
> 
> > # echo 0 > /sys/block/bcache0/bcache/sequential_cutoff
> 
> good.
> 
> > # echo 0 > /sys/fs/bcache/<cache set>/congested_read_threshold_us
> > # echo 0 > /sys/fs/bcache/<cache set>/congested_write_threshold_us
> 
> Also try these (I think bcache/cache is a symlink to /sys/fs/bcache/<cache set>)
> 
> echo 10000000 > /sys/block/bcache0/bcache/cache/congested_read_threshold_us 
> echo 10000000 > /sys/block/bcache0/bcache/cache/congested_write_threshold_us
> 
> 
> Try tuning journal_delay_ms: 
>   /sys/fs/bcache/<cset-uuid>/journal_delay_ms
>     Journal writes will delay for up to this many milliseconds, unless a 
>     cache flush happens sooner. Defaults to 100.
> 
> 3b: Hacking bcache code:
> 
> I just noticed that journal_delay_ms says "unless a cache flush happens 
> sooner" but cache flushes can be re-ordered so flushing the journal when 
> REQ_OP_FLUSH comes through may not be useful, especially if there is a 
> high volume of flushes coming down the pipe because the flushes could kill 
> the NVMe's cache---and maybe the 1.5ms ping is actual flash latency.  It
> would flush data and journal.
> 
> Maybe there should be a cachedev_noflush sysfs option for those with some 
> kind of power-loss protection of there SSD's.  It looks like this is 
> handled in request.c when these functions call bch_journal_meta():
> 
>     1053: static void cached_dev_nodata(struct closure *cl)
>     1263: static void flash_dev_nodata(struct closure *cl)
> 
> Coly can you comment about journal flush semantics with respect to 
> performance vs correctness and crash safety?
> 
> Adriano, as a test, you could change this line in search_alloc() in 
> request.c:
> 
>     - s->iop.flush_journal    = op_is_flush(bio->bi_opf);
>     + s->iop.flush_journal    = 0;
> 
> and see how performance changes.
> 
> Someone correct me if I'm wrong, but I don't think flush_journal=0 will 
> affect correctness unless there is a crash.  If that /is/ the performance 
> problem then it would narrow the scope of this discussion.
> 
> 4. I wonder if your 1.5ms `ioping` stats scale with CPU clock speed: can 
>   you set your CPU governor to run at full clock speed and then slowest 
>   clock speed to see if it is a CPU limit somewhere as we expect?
> 
>   You can do `grep MHz /proc/cpuinfo` to see the active rate to make sure 
>   the governor did its job.  
> 
>   If it scales with CPU then something in bcache is working too hard.  
>   Maybe garbage collection?  Other devs would need to chime in here to 
>   steer the troubleshooting if that is the case.
> 
> 
> 5. I'm not sure if garbage collection is the issue, but you might try 
>   Mingzhe's dynamic incremental gc patch:
>     https://www.spinics.net/lists/linux-bcache/msg11185.html
> 
> 6. Try dm-cache and see if its IO latency is similar to bcache: If it is 
>   about the same then that would indicate an issue in the block layer 
>   somewhere outside of bcache.  If dm-cache is better, then that confirms 
>   a bcache issue.
> 
> 
> > The cache was configured directly on one of the NVMe partitions (in this 
> > case, the first partition). I did several tests using fio and ioping, 
> > testing on a partition on the NVMe device, without partition and 
> > directly on the raw block, on a first partition, on the second, with or 
> > without configuring bcache. I did all this to remove any doubt as to the 
> > method. The results of tests performed directly on the hardware device, 
> > without going through bcache are always fast and similar.
> > 
> > But tests in bcache are always slower. If you use writethrough, of 
> > course, it gets much worse, because the performance is equal to the raw 
> > spinning disk.
> > 
> > Using writeback improves a lot, but still doesn't use the full speed of 
> > NVMe (honestly, much less than full speed).
> 
> Indeed, I hope this can be fixed!  A 20x improvement in bcache would 
> be awesome.
> 
> > But I've also noticed that there is a limit on writing sequential data, 
> > which is a little more than half of the maximum write rate shown in 
> > direct tests by the NVMe device.
> 
> For sync, async, or both?
> 
> 
> > Processing doesn't seem to be going up like the tests.
> 
> 
> What do you mean "processing" ?
> 
> -Eric
> 
> 
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Bcache in writes direct with fsync. Are IOPS limited?
  2022-05-26 19:15       ` Eric Wheeler
@ 2022-05-27 17:28         ` colyli
  2022-05-28  0:58           ` Eric Wheeler
  0 siblings, 1 reply; 37+ messages in thread
From: colyli @ 2022-05-27 17:28 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: Adriano Silva, Bcache Linux, Matthias Ferdinand

在 2022-05-27 03:15,Eric Wheeler 写道:
> On Mon, 23 May 2022, Coly Li wrote:
>> On 5/18/22 9:22 AM, Eric Wheeler wrote:
>> > Some time ago you ordered an an SSD to test the 4k cache issue, has that
>> > been fixed?  I've kept an eye out for the patch but not sure if it was
>> > released.
>> 
>> Yes, I got the Intel P3700 PCIe SSD to fix the 4Kn unaligned I/O issue
>> (borrowed from a hardware vendor). The new situation is, current 
>> kernel does
>> the sector size alignment checking quite earlier in bio layer, if the 
>> LBA is
>> not sector size aligned, it is rejected in the bio code, and the 
>> underlying
>> driver doesn't have chance to see the bio anymore. So for now, the 
>> unaligned
>> LBA for 4Kn device cannot reach bcache code, that's to say, the 
>> original
>> reported condition won't happen now.
> 
> The issue is not with unaligned 4k IOs hitting /dev/bcache0 because you
> are right, the bio layer will reject those before even getting to
> bcache:
> 
> The issue is that the bcache cache metadata sometimes makes metadata or
> journal requests from _inside_ bcache that are not 4k aligned.  When
> this happens the bio layer rejects the request from bcache (not from
> whatever is above bcache).
> 
> Correct me if I misunderstood what you meant here, maybe it really was
> fixed.  Here is your response from that old thread that pointed at
> unaligned key access where you said "Wow, the above lines are very
> informative, thanks!"
> 

It was not fixed, at least I didn't do it on purpose. Maybe it was 
avoided
by other fixes, e.g. the oversize bkey fix. But I don't have evidence 
the
issue was fixed.

> bcache: check_4k_alignment() KEY_OFFSET(&w->key) is not 4KB aligned:
> 15725385535
>   https://www.spinics.net/lists/linux-bcache/msg06076.html
> 
> In that thread Kent sent a quick top-post asking "have you checked 
> extent
> merging?"
> 	https://www.spinics.net/lists/linux-bcache/msg06077.html
> 

It embarrassed me that I received your informative debug information, 
and I
glared very hard at the code for quite long time, but didn't have any 
clue
that how such problem may happen in the extent related code.

Since you reported the issue and I believe you, I will keep my eyes on 
the
non-aligned 4Kn issue for bcache internal I/O. Hope someday I may have 
idea
suddenly to point out where the problem is, and fix it.


>> And after this observation, I stopped my investigation on the 
>> unaligned sector
>> size I/O on 4Kn device, and returned the P3700 PCIe SSD to the 
>> hardware
>> vendor.
> 
> Hmm, sorry that it wasn't reproduced.  I hope I'm wrong, but if bcache 
> is
> generating the 4k-unaligned requests against the cache meta then this 
> bug
> might still be floating around for "4Kn" cache users.
> 

I don't think you were wrong, you are people whom I believe :-) It just 
needs
time and luck...

Coly Li

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Bcache in writes direct with fsync. Are IOPS limited?
  2022-05-27 17:28         ` colyli
@ 2022-05-28  0:58           ` Eric Wheeler
  0 siblings, 0 replies; 37+ messages in thread
From: Eric Wheeler @ 2022-05-28  0:58 UTC (permalink / raw)
  To: colyli; +Cc: Adriano Silva, Bcache Linux, Matthias Ferdinand

[-- Attachment #1: Type: text/plain, Size: 3375 bytes --]

On Sat, 28 May 2022, colyli wrote:
> 在 2022-05-27 03:15,Eric Wheeler 写道:
> > On Mon, 23 May 2022, Coly Li wrote:
> >> On 5/18/22 9:22 AM, Eric Wheeler wrote:
> >> > Some time ago you ordered an an SSD to test the 4k cache issue, has that
> >> > been fixed?  I've kept an eye out for the patch but not sure if it was
> >> > released.
> >> 
> >> Yes, I got the Intel P3700 PCIe SSD to fix the 4Kn unaligned I/O issue
> >> (borrowed from a hardware vendor). The new situation is, current kernel
> >> does
> >> the sector size alignment checking quite earlier in bio layer, if the LBA
> >> is
> >> not sector size aligned, it is rejected in the bio code, and the underlying
> >> driver doesn't have chance to see the bio anymore. So for now, the
> >> unaligned
> >> LBA for 4Kn device cannot reach bcache code, that's to say, the original
> >> reported condition won't happen now.
> > 
> > The issue is not with unaligned 4k IOs hitting /dev/bcache0 because you
> > are right, the bio layer will reject those before even getting to
> > bcache:
> > 
> > The issue is that the bcache cache metadata sometimes makes metadata or
> > journal requests from _inside_ bcache that are not 4k aligned.  When
> > this happens the bio layer rejects the request from bcache (not from
> > whatever is above bcache).
> > 
> > Correct me if I misunderstood what you meant here, maybe it really was
> > fixed.  Here is your response from that old thread that pointed at
> > unaligned key access where you said "Wow, the above lines are very
> > informative, thanks!"
> > 
> 
> It was not fixed, at least I didn't do it on purpose. Maybe it was avoided
> by other fixes, e.g. the oversize bkey fix. But I don't have evidence the
> issue was fixed.
> 
> > bcache: check_4k_alignment() KEY_OFFSET(&w->key) is not 4KB aligned:
> > 15725385535
> >   https://www.spinics.net/lists/linux-bcache/msg06076.html
> > 
> > In that thread Kent sent a quick top-post asking "have you checked extent
> > merging?"
> >  https://www.spinics.net/lists/linux-bcache/msg06077.html
> > 
> 
> It embarrassed me that I received your informative debug information, and I
> glared very hard at the code for quite long time, but didn't have any clue
> that how such problem may happen in the extent related code.

You do great work on bcache, I appreciate everything you do.  No need to 
be embarrassed, this is just a hard bug to pin down!

> Since you reported the issue and I believe you, I will keep my eyes on the
> non-aligned 4Kn issue for bcache internal I/O. Hope someday I may have idea
> suddenly to point out where the problem is, and fix it.

You might try this for testing:

1. Format with -w 4096

2. Add some WARN_ONCE's around metadata and journal IO operations and run 
   it through your stress test to see what turns up.  The -w 4096 will 
   guarantee that all userspace IOs are 4k aligned, and then if any WARN's 
   trigger then they are suspect.  Even on 512-byte cache deployments we 
   should target 4k-aligned meta IOs hitting the SSD cache.  This would
   fix 2 things:

      a. It will guarantee that all journal/meta IOs are aligned to 4k for 
         4k cache users.

      b. Fix Adriano's performance issues since for at least his Hynix 
         SSD because 512b IOs are ~6x high latency than 4k IOs on his 
         system.

--
Eric Wheeler


> 
> Coly Li
> 
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Bcache in writes direct with fsync. Are IOPS limited?
  2022-05-27  4:07         ` Adriano Silva
@ 2022-05-28  1:27           ` Eric Wheeler
  2022-05-28  7:22             ` Matthias Ferdinand
  0 siblings, 1 reply; 37+ messages in thread
From: Eric Wheeler @ 2022-05-28  1:27 UTC (permalink / raw)
  To: Adriano Silva; +Cc: Coly Li, Bcache Linux, Matthias Ferdinand

[-- Attachment #1: Type: text/plain, Size: 86032 bytes --]

On Fri, 27 May 2022, Adriano Silva wrote:
> > Please confirm that this says "write back":
> 
> > ]# cat /sys/block/nvme0n1/queue/write_cache
> 
> No, this says "write through"
> 
> 
> > ]# for i in /sys/block/*/queue/write_cache; do echo 'write back' > $i; done
> Done!
> 
> I can say that the performance of tests after the write back command for 
> all devices greatly worsens the performance of direct tests on NVME 
> hardware. Below you can see this.

I wonder what is going on there!  I tried the same thing on my system and 
'write through' is faster for me, too, so it would be worth investigating.

Try this too in case the scheduler is getting in the way, but I don't 
think thats it at this point:

	echo none > /sys/block/nvme0n1/queue/scheduler
 
> What I realized doing the test was, right after doing the blkdiscard on 
> the first server the command took a long time, I think more than 1 
> minute to return. After the return, when doing the ioping it increased 
> the latency a lot that I'm used to. So I turned the server off and on 
> again to discard again and test. I noticed that he improved, as I 
> demonstrate.

Hmm, a server power cycle decreases IO latency?  Maybe it reset something 
in the NVMe embedded CPU that drives the FTL...

You could try changing the PCIe payload size to 512 or 256 if your BIOS 
has such a setting, but I would expect that to be slower...but maybe 
faster for 512b IO's?  Not sure but you could try it:
	https://www.techarp.com/bios-guide/pci-e-maximum-payload-size/

> From my understanding of the tests, it was clear that the performance of 
> direct writes to NVME hardware on the two servers is very similar. 
> Perhaps exactly the same. Also in NVME, when writing 512 Bytes at a 
> time, the latency starts well but gets worse after a few write 
> operations, which doesn't happen when writing 4K which always has better 
> performance.
> 
> In all scenarios, when using write cache on 
> /sys/block/nvme0n1/queue/write_cache, performance is severely degraded.
> 
> Also in all scenarios, when synchronization is required (parameter -Y), 
> the performance is slightly worse.

Ultimately it is clear that your NVMe doesn't like 512b requests, they 
have a ~6x higher RTT:

    512b:
        root@pve-21:~# ioping -c5 /dev/nvme0n1 -D -WWW -s512
        min/avg/max/mdev = 476.4 us / 479.2 us / 480.9 us / 1.73 us
                                      ^^^^^

        root@pve-21:~# ioping -c5 /dev/nvme0n1 -D -Y -WWW -s512
        min/avg/max/mdev = 476.4 us / 479.2 us / 480.9 us / 1.73 us
                                      ^^^^^
    4k:
        root@pve-21:~# ioping -c5 /dev/nvme0n1 -D -Y -WWW -s4K
        min/avg/max/mdev = 70.9 us / 79.4 us / 94.3 us / 9.20 us
                                     ^^^^

        root@pve-21:~# ioping -c5 /dev/nvme0n1 -D -WWW -s4K
        min/avg/max/mdev = 66.1 us / 82.7 us / 119.0 us / 21.2 us
                                     ^^^^

... so unfortunately this is a hardware issue that bcache can't easily fix.

You could format everything with -w 4096 if it will work for your 
application.  It is safe to run bcache with -w 4096, we have for 8 years 
now.

However, keep your hardware sector size of the cache device (blockdev 
--getss /dev/nvme0n1) at 512b like it is now because there might be a 4k 
cache device alignment bug floating around.

If the bug is still out there, then it is caused by a possiblity that 
bcache may send 512b-aligned IOs to 4k-aligned cachedevs.  In your case 
the only downside to that is lower performance, but if you were using a 
"4Kn" cache device then it may not be safe.

The only way to get a 4Kn cache device is to purchase an NVMe with 4Kn or 
use vendor tools to re-format it that way so---its not going to happen by 
accident!  As far as I have seen, 4Kn backing devices are fine with 
-w 4096; the possible problem is just in the cache device.


I'm not sure what else to suggest at this point.  If you really need 512b 
IOs then you might try a few different NVMe vendors to benchmark and 
ioping them with 512b sectors to see how they respond and I would be 
curious to know what you find!

--
Eric Wheeler

> But between servers, there is no difference in bcache when the backup 
> device is in RAM.
> 
> >I think in newer kernels that bcache is more aggressive at writeback. 
> >Using /dev/mapper/zero as above will help rule out backing device 
> >interference.  Also make sure you have the sysfs flags turned to 
> >encourge it to write to SSD and not bypass
> 
> I actually went back to using the previous Kernel version (5.4) after I 
> noticed that it wouldn't have improved performance. Today, both servers 
> have version 5.4.
> 
> 
> Just below the result right after the blkdiscard that took a long time.
> 
> =========
> In first server
> 
> root@pve-20:~# cat /sys/block/nvme0n1/queue/write_cache
> write through
> root@pve-20:~# blkdiscard /dev/nvme0n1
> root@pve-20:~# ioping -c10 /dev/nvme0n1 -D -Y -WWW -s512
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=544.6 us (warmup)
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=388.1 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=1.44 ms
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=656.8 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=1.71 ms
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=1.83 ms
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=702.2 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=582.1 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=1.15 ms
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=1.07 ms
> 
> --- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
> 9 requests completed in 9.54 ms, 4.50 KiB written, 943 iops, 471.9 KiB/s
> generated 10 requests in 9.00 s, 5 KiB, 1 iops, 568 B/s
> min/avg/max/mdev = 388.1 us / 1.06 ms / 1.83 ms / 487.4 us
> root@pve-20:~#
> root@pve-20:~# ioping -c10 /dev/nvme0n1 -D -WWW -s512
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=1.28 ms (warmup)
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=678.8 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=725.3 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=1.25 ms
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=794.1 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=493.1 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=1.10 ms
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=1.06 ms
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=971.8 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=1.11 ms
> 
> --- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
> 9 requests completed in 8.19 ms, 4.50 KiB written, 1.10 k iops, 549.2 KiB/s
> generated 10 requests in 9.00 s, 5 KiB, 1 iops, 568 B/s
> min/avg/max/mdev = 493.1 us / 910.3 us / 1.25 ms / 235.1 us
> root@pve-20:~#
> root@pve-20:~# ioping -c10 /dev/nvme0n1 -D -Y -WWW -s4K
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=471.0 us (warmup)
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=1.06 ms
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=1.17 ms
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=1.29 ms
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=830.5 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=1.31 ms
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=1.40 ms (slow)
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=195.0 us (fast)
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=841.2 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=1.22 ms
> 
> --- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
> 9 requests completed in 9.32 ms, 36 KiB written, 965 iops, 3.77 MiB/s
> generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
> min/avg/max/mdev = 195.0 us / 1.04 ms / 1.40 ms / 352.0 us
> root@pve-20:~#
> root@pve-20:~# ioping -c10 /dev/nvme0n1 -D -WWW -s4K
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=645.2 us (warmup)
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=1.20 ms
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=1.41 ms
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=1.39 ms
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=978.4 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=75.8 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=68.6 us (fast)
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=74.0 us (fast)
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=73.7 us (fast)
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=67.0 us (fast)
> 
> --- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
> 9 requests completed in 5.34 ms, 36 KiB written, 1.68 k iops, 6.58 MiB/s
> generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
> min/avg/max/mdev = 67.0 us / 593.7 us / 1.41 ms / 595.1 us
> root@pve-20:~#
> 
> ==========
> Here, below the results after I shut down the first server and test again:
> 
> root@pve-20:~# blkdiscard /dev/nvme0n1
> root@pve-20:~# ioping -c10 /dev/nvme0n1 -D -Y -WWW -s512
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=68.4 us (warmup)
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=76.5 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=67.0 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=60.1 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=463.9 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=471.4 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=505.1 us (slow)
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=501.0 us (slow)
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=486.3 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=520.4 us (slow)
> 
> --- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
> 9 requests completed in 3.15 ms, 4.50 KiB written, 2.85 k iops, 1.39 MiB/s
> generated 10 requests in 9.00 s, 5 KiB, 1 iops, 568 B/s
> min/avg/max/mdev = 60.1 us / 350.2 us / 520.4 us / 200.3 us
> root@pve-20:~# ioping -c10 /dev/nvme0n1 -D -WWW -s512
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=460.8 us (warmup)
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=507.5 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=514.9 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=505.8 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=500.3 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=503.3 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=506.9 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=499.4 us (fast)
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=500.1 us (fast)
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=502.4 us
> 
> --- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
> 9 requests completed in 4.54 ms, 4.50 KiB written, 1.98 k iops, 991.0 KiB/s
> generated 10 requests in 9.00 s, 5 KiB, 1 iops, 568 B/s
> min/avg/max/mdev = 499.4 us / 504.5 us / 514.9 us / 4.64 us
> root@pve-20:~#
> root@pve-20:~# ioping -c10 /dev/nvme0n1 -D -Y -WWW -s4K
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=56.7 us (warmup)
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=81.7 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=60.0 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=78.0 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=75.1 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=79.7 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=91.2 us (slow)
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=76.6 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=79.0 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=87.1 us
> 
> --- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
> 9 requests completed in 708.4 us, 36 KiB written, 12.7 k iops, 49.6 MiB/s
> generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
> min/avg/max/mdev = 60.0 us / 78.7 us / 91.2 us / 8.20 us
> root@pve-20:~#
> root@pve-20:~# ioping -c10 /dev/nvme0n1 -D -WWW -s4K
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=86.6 us (warmup)
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=72.7 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=60.5 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=70.5 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=72.7 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=60.2 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=83.5 us (slow)
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=60.4 us (fast)
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=86.0 us (slow)
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=61.2 us (fast)
> 
> --- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
> 9 requests completed in 627.7 us, 36 KiB written, 14.3 k iops, 56.0 MiB/s
> generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
> min/avg/max/mdev = 60.2 us / 69.7 us / 86.0 us / 9.49 us
> root@pve-20:~#
> 
> ======= 
> On the second server...
> On the second server, blkdiscard didn't take long and the first result was the one below:
> 
> root@pve-21:~# cat /sys/block/nvme0n1/queue/write_cache
> write through
> root@pve-21:~# blkdiscard /dev/nvme0n1
> root@pve-21:~# ioping -c10 /dev/nvme0n1 -D -Y -WWW -s512
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=60.7 us (warmup)
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=71.9 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=77.4 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=61.2 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=468.2 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=497.0 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=491.8 us (slow)
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=490.6 us (slow)
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=494.4 us (slow)
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=493.9 us (slow)
> 
> --- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
> 9 requests completed in 3.15 ms, 4.50 KiB written, 2.86 k iops, 1.40 MiB/s
> generated 10 requests in 9.00 s, 5 KiB, 1 iops, 568 B/s
> min/avg/max/mdev = 61.2 us / 349.6 us / 497.0 us / 197.8 us
> root@pve-21:~# ioping -c10 /dev/nvme0n1 -D -WWW -s512
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=494.5 us (warmup)
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=490.6 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=490.3 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=489.8 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=492.3 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=488.1 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=496.0 us (slow)
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=492.1 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=493.0 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=508.0 us (slow)
> 
> --- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
> 9 requests completed in 4.44 ms, 4.50 KiB written, 2.03 k iops, 1013.5 KiB/s
> generated 10 requests in 9.00 s, 5 KiB, 1 iops, 568 B/s
> min/avg/max/mdev = 488.1 us / 493.3 us / 508.0 us / 5.60 us
> root@pve-21:~# ioping -c10 /dev/nvme0n1 -D -Y -WWW -s4K
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=84.9 us (warmup)
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=75.7 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=76.5 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=76.0 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=77.6 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=78.8 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=84.2 us (slow)
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=85.0 us (slow)
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=79.3 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=97.1 us (slow)
> 
> --- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
> 9 requests completed in 730.1 us, 36 KiB written, 12.3 k iops, 48.1 MiB/s
> generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
> min/avg/max/mdev = 75.7 us / 81.1 us / 97.1 us / 6.48 us
> root@pve-21:~# ioping -c10 /dev/nvme0n1 -D -WWW -s4K
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=80.8 us (warmup)
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=77.7 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=70.9 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=69.1 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=72.0 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=68.3 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=71.7 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=86.7 us (slow)
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=93.2 us (slow)
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=64.8 us (fast)
> 
> --- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
> 9 requests completed in 674.3 us, 36 KiB written, 13.3 k iops, 52.1 MiB/s
> generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
> min/avg/max/mdev = 64.8 us / 74.9 us / 93.2 us / 8.79 us
> root@pve-21:~#
> 
> ========== 
> After switching to wirte back and going back to write through.
> In first server...
> 
> oot@pve-20:~# for i in /sys/block/*/queue/write_cache; do echo 'write back' > $i; done
> root@pve-20:~# ioping -c10 /dev/nvme0n1 -D -Y -WWW -s512
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=2.31 ms (warmup)
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=2.37 ms
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=2.40 ms
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=2.45 ms
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=2.57 ms
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=2.46 ms
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=2.57 ms (slow)
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=2.56 ms
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=2.38 ms (fast)
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=2.48 ms
> 
> --- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
> 9 requests completed in 22.2 ms, 4.50 KiB written, 404 iops, 202.4 KiB/s
> generated 10 requests in 9.00 s, 5 KiB, 1 iops, 568 B/s
> min/avg/max/mdev = 2.37 ms / 2.47 ms / 2.57 ms / 75.2 us
> root@pve-20:~#
> root@pve-20:~# ioping -c10 /dev/nvme0n1 -D -WWW -s512
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=1.16 ms (warmup)
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=1.15 ms
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=1.14 ms
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=1.15 ms
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=1.17 ms
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=1.15 ms
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=1.13 ms (fast)
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=1.14 ms
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=1.22 ms (slow)
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=1.20 ms
> 
> --- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
> 9 requests completed in 10.5 ms, 4.50 KiB written, 860 iops, 430.1 KiB/s
> generated 10 requests in 9.00 s, 5 KiB, 1 iops, 568 B/s
> min/avg/max/mdev = 1.13 ms / 1.16 ms / 1.22 ms / 27.6 us
> root@pve-20:~#
> root@pve-20:~# ioping -c10 /dev/nvme0n1 -D -Y -WWW -s4K
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=2.03 ms (warmup)
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=2.04 ms
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=2.07 ms
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=2.07 ms
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=2.05 ms
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=2.02 ms
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=2.05 ms
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=2.09 ms (slow)
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=2.04 ms
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=1.99 ms (fast)
> 
> --- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
> 9 requests completed in 18.4 ms, 36 KiB written, 489 iops, 1.91 MiB/s
> generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
> min/avg/max/mdev = 1.99 ms / 2.04 ms / 2.09 ms / 29.0 us
> root@pve-20:~#
> root@pve-20:~# ioping -c10 /dev/nvme0n1 -D -WWW -s4K
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=703.4 us (warmup)
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=725.1 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=724.8 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=705.7 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=733.1 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=697.6 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=690.2 us (fast)
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=688.4 us (fast)
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=689.5 us (fast)
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=671.7 us (fast)
> 
> --- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
> 9 requests completed in 6.33 ms, 36 KiB written, 1.42 k iops, 5.56 MiB/s
> generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
> min/avg/max/mdev = 671.7 us / 702.9 us / 733.1 us / 19.6 us
> root@pve-20:~#
> root@pve-20:~# for i in /sys/block/*/queue/write_cache; do echo 'write through' > $i; done
> root@pve-20:~# ioping -c10 /dev/nvme0n1 -D -WWW -s4K
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=82.6 us (warmup)
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=89.3 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=61.7 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=74.0 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=89.4 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=62.5 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=74.1 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=81.3 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=78.1 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=84.3 us
> 
> --- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
> 9 requests completed in 694.9 us, 36 KiB written, 13.0 k iops, 50.6 MiB/s
> generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
> min/avg/max/mdev = 61.7 us / 77.2 us / 89.4 us / 9.67 us
> root@pve-20:~#
> 
> =============
> On  the second server...
> 
> root@pve-21:~# for i in /sys/block/*/queue/write_cache; do echo 'write back' > $i; done
> root@pve-21:~# blkdiscard /dev/nvme0n1
> root@pve-21:~# ioping -c10 /dev/nvme0n1 -D -Y -WWW -s512
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=1.83 ms (warmup)
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=2.39 ms
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=2.40 ms
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=2.21 ms
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=2.44 ms
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=2.34 ms
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=2.34 ms
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=2.42 ms
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=2.22 ms (fast)
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=2.20 ms (fast)
> 
> --- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
> 9 requests completed in 21.0 ms, 4.50 KiB written, 429 iops, 214.7 KiB/s
> generated 10 requests in 9.00 s, 5 KiB, 1 iops, 568 B/s
> min/avg/max/mdev = 2.20 ms / 2.33 ms / 2.44 ms / 88.9 us
> root@pve-21:~# ioping -c10 /dev/nvme0n1 -D -WWW -s512
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=1.12 ms (warmup)
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=663.6 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=1.12 ms
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=1.11 ms
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=1.11 ms
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=1.16 ms
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=1.18 ms (slow)
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=1.11 ms
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=1.16 ms
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=1.17 ms (slow)
> 
> --- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
> 9 requests completed in 9.78 ms, 4.50 KiB written, 920 iops, 460.2 KiB/s
> generated 10 requests in 9.00 s, 5 KiB, 1 iops, 568 B/s
> min/avg/max/mdev = 663.6 us / 1.09 ms / 1.18 ms / 151.9 us
> root@pve-21:~# ioping -c10 /dev/nvme0n1 -D -Y -WWW -s4K
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=1.85 ms (warmup)
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=1.81 ms
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=1.82 ms
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=1.82 ms
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=2.01 ms
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=1.99 ms
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=1.98 ms
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=1.95 ms
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=1.83 ms
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=1.82 ms (fast)
> 
> --- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
> 9 requests completed in 17.0 ms, 36 KiB written, 528 iops, 2.06 MiB/s
> generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
> min/avg/max/mdev = 1.81 ms / 1.89 ms / 2.01 ms / 82.3 us
> root@pve-21:~# ioping -c10 /dev/nvme0n1 -D -WWW -s4K
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=673.1 us (warmup)
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=667.1 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=688.2 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=653.1 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=661.5 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=663.3 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=698.0 us (slow)
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=663.7 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=708.6 us (slow)
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=677.2 us
> 
> --- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
> 9 requests completed in 6.08 ms, 36 KiB written, 1.48 k iops, 5.78 MiB/s
> generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
> min/avg/max/mdev = 653.1 us / 675.6 us / 708.6 us / 17.7 us
> root@pve-21:~#
> root@pve-21:~# for i in /sys/block/*/queue/write_cache; do echo 'write through' > $i; done
> root@pve-21:~# ioping -c10 /dev/nvme0n1 -D -WWW -s4K
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=85.3 us (warmup)
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=79.8 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=74.7 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=85.6 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=66.8 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=92.2 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=73.5 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=65.0 us (fast)
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=73.0 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=73.2 us
> 
> --- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
> 9 requests completed in 683.7 us, 36 KiB written, 13.2 k iops, 51.4 MiB/s
> generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
> min/avg/max/mdev = 65.0 us / 76.0 us / 92.2 us / 8.17 us
> root@pve-21:~#
> 
> As you can see from the tests, the write performance on NVME hardware is horrible when putting /sys/block/*/queue/write_cache as 'write back'.
> 
> 
> =====
> Lets go..
> 
> root@pve-21:~# modprobe brd rd_size=$((128*1024))
> root@pve-21:~# cat << EOT | dmsetup create zero
>     0 262144 linear /dev/ram0 0
>     262144 2147483648 zero
> > EOT
> root@pve-21:~# blkdiscard /dev/nvme0n1
> root@pve-21:~# make-bcache -w 512 -B /dev/mapper/zero -C /dev/nvme0n1 --writeback
> UUID:            563eaa85-43e9-491b-8c1f-f1b94a8f97c8
> Set UUID:        0dcec849-9ee9-41a9-b220-b1923e93cdb1
> version:        0
> nbuckets:        1831430
> block_size:        1
> bucket_size:        1024
> nr_in_set:        1
> nr_this_dev:        0
> first_bucket:        1
> UUID:            acdd0f18-4198-43dd-847a-087058d80c25
> Set UUID:        0dcec849-9ee9-41a9-b220-b1923e93cdb1
> version:        1
> block_size:        1
> data_offset:        16
> root@pve-21:~#
> root@pve-21:~# ioping -c10 /dev/bcache0 -D -Y -WWW -s512
> 512 B >>> /dev/bcache0 (block device 1.00 TiB): request=1 time=3.04 ms (warmup)
> 512 B >>> /dev/bcache0 (block device 1.00 TiB): request=2 time=1.98 ms
> 512 B >>> /dev/bcache0 (block device 1.00 TiB): request=3 time=1.88 ms
> 512 B >>> /dev/bcache0 (block device 1.00 TiB): request=4 time=1.95 ms
> 512 B >>> /dev/bcache0 (block device 1.00 TiB): request=5 time=1.78 ms
> 512 B >>> /dev/bcache0 (block device 1.00 TiB): request=6 time=1.92 ms
> 512 B >>> /dev/bcache0 (block device 1.00 TiB): request=7 time=1.87 ms
> 512 B >>> /dev/bcache0 (block device 1.00 TiB): request=8 time=1.87 ms
> 512 B >>> /dev/bcache0 (block device 1.00 TiB): request=9 time=1.87 ms
> 512 B >>> /dev/bcache0 (block device 1.00 TiB): request=10 time=1.83 ms
> 
> --- /dev/bcache0 (block device 1.00 TiB) ioping statistics ---
> 9 requests completed in 17.0 ms, 4.50 KiB written, 530 iops, 265.2 KiB/s
> generated 10 requests in 9.00 s, 5 KiB, 1 iops, 568 B/s
> min/avg/max/mdev = 1.78 ms / 1.89 ms / 1.98 ms / 57.4 us
> root@pve-21:~#
> root@pve-21:~# ioping -c10 /dev/bcache0 -D -WWW -s512
> 512 B >>> /dev/bcache0 (block device 1.00 TiB): request=1 time=1.12 ms (warmup)
> 512 B >>> /dev/bcache0 (block device 1.00 TiB): request=2 time=1.01 ms
> 512 B >>> /dev/bcache0 (block device 1.00 TiB): request=3 time=1.00 ms
> 512 B >>> /dev/bcache0 (block device 1.00 TiB): request=4 time=1.05 ms
> 512 B >>> /dev/bcache0 (block device 1.00 TiB): request=5 time=1.04 ms
> 512 B >>> /dev/bcache0 (block device 1.00 TiB): request=6 time=1.04 ms
> 512 B >>> /dev/bcache0 (block device 1.00 TiB): request=7 time=996.5 us (fast)
> 512 B >>> /dev/bcache0 (block device 1.00 TiB): request=8 time=1.01 ms
> 512 B >>> /dev/bcache0 (block device 1.00 TiB): request=9 time=994.3 us (fast)
> 512 B >>> /dev/bcache0 (block device 1.00 TiB): request=10 time=976.5 us (fast)
> 
> --- /dev/bcache0 (block device 1.00 TiB) ioping statistics ---
> 9 requests completed in 9.11 ms, 4.50 KiB written, 987 iops, 493.9 KiB/s
> generated 10 requests in 9.00 s, 5 KiB, 1 iops, 568 B/s
> min/avg/max/mdev = 976.5 us / 1.01 ms / 1.05 ms / 22.5 us
> root@pve-21:~#
> root@pve-21:~# ioping -c10 /dev/bcache0 -D -Y -WWW -s4K
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=1 time=1.43 ms (warmup)
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=2 time=1.39 ms
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=3 time=1.38 ms
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=4 time=1.40 ms
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=5 time=1.40 ms
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=6 time=1.43 ms
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=7 time=1.39 ms
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=8 time=1.42 ms
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=9 time=1.41 ms
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=10 time=1.40 ms
> 
> --- /dev/bcache0 (block device 1.00 TiB) ioping statistics ---
> 9 requests completed in 12.6 ms, 36 KiB written, 713 iops, 2.79 MiB/s
> generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
> min/avg/max/mdev = 1.38 ms / 1.40 ms / 1.43 ms / 13.1 us
> root@pve-21:~# ioping -c10 /dev/bcache0 -D -WWW -s4K
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=1 time=676.0 us (warmup)
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=2 time=638.0 us
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=3 time=659.5 us
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=4 time=650.2 us
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=5 time=644.0 us
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=6 time=644.4 us
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=7 time=652.1 us
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=8 time=641.8 us
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=9 time=658.0 us
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=10 time=642.7 us
> 
> --- /dev/bcache0 (block device 1.00 TiB) ioping statistics ---
> 9 requests completed in 5.83 ms, 36 KiB written, 1.54 k iops, 6.03 MiB/s
> generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
> min/avg/max/mdev = 638.0 us / 647.9 us / 659.5 us / 7.06 us
> root@pve-21:~#
> 
> =========
> Now, bcache -w 4k
> 
> root@pve-21:~# blkdiscard /dev/nvme0n1
> root@pve-21:~# make-bcache -w 4096 -B /dev/mapper/zero -C /dev/nvme0n1 --writeback
> UUID:            c955591f-21af-467d-b26a-5ff567af2001
> Set UUID:        8c477796-88ab-4b20-990a-cef8b2df040a
> version:        0
> nbuckets:        1831430
> block_size:        8
> bucket_size:        1024
> nr_in_set:        1
> nr_this_dev:        0
> first_bucket:        1
> UUID:            ea89b843-a019-4464-8da5-377ba44f0e6b
> Set UUID:        8c477796-88ab-4b20-990a-cef8b2df040a
> version:        1
> block_size:        8
> data_offset:        16
> root@pve-21:
> root@pve-21:~# ioping -c10 /dev/bcache0 -D -Y -WWW -s512
> ioping: request failed: Invalid argument
> root@pve-21:~# ioping -c10 /dev/bcache0 -D -Y -WWW -s4K
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=1 time=2.93 ms (warmup)
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=2 time=313.2 us
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=3 time=274.1 us
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=4 time=296.4 us
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=5 time=247.2 us
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=6 time=227.4 us
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=7 time=224.6 us (fast)
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=8 time=253.8 us
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=9 time=235.3 us
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=10 time=197.6 us (fast)
> 
> --- /dev/bcache0 (block device 1.00 TiB) ioping statistics ---
> 9 requests completed in 2.27 ms, 36 KiB written, 3.96 k iops, 15.5 MiB/s
> generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
> min/avg/max/mdev = 197.6 us / 252.2 us / 313.2 us / 34.7 us
> root@pve-21:~# ioping -c10 /dev/bcache0 -D -WWW -s4K
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=1 time=262.8 us (warmup)
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=2 time=255.9 us
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=3 time=239.9 us
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=4 time=228.8 us
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=5 time=252.3 us
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=6 time=237.1 us
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=7 time=237.5 us
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=8 time=232.3 us
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=9 time=243.3 us
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=10 time=232.7 us
> 
> --- /dev/bcache0 (block device 1.00 TiB) ioping statistics ---
> 9 requests completed in 2.16 ms, 36 KiB written, 4.17 k iops, 16.3 MiB/s
> generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
> min/avg/max/mdev = 228.8 us / 240.0 us / 255.9 us / 8.62 us
> 
> 
> =========
> On the first server
> 
> root@pve-20:~# modprobe brd rd_size=$((128*1024))
> root@pve-20:~# cat << EOT | dmsetup create zero
> >     0 262144 linear /dev/ram0 0
> >     262144 2147483648 zero
> > EOT
> root@pve-20:~# blkdiscard /dev/nvme0n1
> root@pve-20:~# make-bcache -w 512 -B /dev/mapper/zero -C /dev/nvme0n1 --writeback
> UUID:            f82f76a1-8f41-4a0a-9213-f4632fa372a4
> Set UUID:        d6ba5557-3055-4151-bd91-05db6a668ba7
> version:        0
> nbuckets:        1831430
> block_size:        1
> bucket_size:        1024
> nr_in_set:        1
> nr_this_dev:        0
> first_bucket:        1
> UUID:            5c3d1795-c484-4611-881f-bc991642aa76
> Set UUID:        d6ba5557-3055-4151-bd91-05db6a668ba7
> version:        1
> block_size:        1
> data_offset:        16
> root@pve-20:~# ls /sys/fs/bcache/
> d6ba5557-3055-4151-bd91-05db6a668ba7/ register
> pendings_cleanup                      register_quiet
> root@pve-20:~# ioping -c10 /dev/bcache0 -D -Y -WWW -s512
> 512 B >>> /dev/bcache0 (block device 1.00 TiB): request=1 time=3.05 ms (warmup)
> 512 B >>> /dev/bcache0 (block device 1.00 TiB): request=2 time=1.98 ms
> 512 B >>> /dev/bcache0 (block device 1.00 TiB): request=3 time=1.99 ms
> 512 B >>> /dev/bcache0 (block device 1.00 TiB): request=4 time=1.94 ms
> 512 B >>> /dev/bcache0 (block device 1.00 TiB): request=5 time=1.88 ms
> 512 B >>> /dev/bcache0 (block device 1.00 TiB): request=6 time=1.77 ms
> 512 B >>> /dev/bcache0 (block device 1.00 TiB): request=7 time=1.82 ms
> 512 B >>> /dev/bcache0 (block device 1.00 TiB): request=8 time=1.86 ms
> 512 B >>> /dev/bcache0 (block device 1.00 TiB): request=9 time=1.99 ms (slow)
> 512 B >>> /dev/bcache0 (block device 1.00 TiB): request=10 time=1.82 ms
> 
> --- /dev/bcache0 (block device 1.00 TiB) ioping statistics ---
> 9 requests completed in 17.1 ms, 4.50 KiB written, 527 iops, 263.9 KiB/s
> generated 10 requests in 9.00 s, 5 KiB, 1 iops, 568 B/s
> min/avg/max/mdev = 1.77 ms / 1.89 ms / 1.99 ms / 76.6 us
> root@pve-20:~# ioping -c10 /dev/bcache0 -D -WWW -s512
> 512 B >>> /dev/bcache0 (block device 1.00 TiB): request=1 time=1.05 ms (warmup)
> 512 B >>> /dev/bcache0 (block device 1.00 TiB): request=2 time=1.07 ms
> 512 B >>> /dev/bcache0 (block device 1.00 TiB): request=3 time=1.04 ms
> 512 B >>> /dev/bcache0 (block device 1.00 TiB): request=4 time=1.01 ms
> 512 B >>> /dev/bcache0 (block device 1.00 TiB): request=5 time=1.11 ms
> 512 B >>> /dev/bcache0 (block device 1.00 TiB): request=6 time=1.06 ms
> 512 B >>> /dev/bcache0 (block device 1.00 TiB): request=7 time=1.03 ms
> 512 B >>> /dev/bcache0 (block device 1.00 TiB): request=8 time=1.06 ms
> 512 B >>> /dev/bcache0 (block device 1.00 TiB): request=9 time=1.04 ms
> 512 B >>> /dev/bcache0 (block device 1.00 TiB): request=10 time=1.06 ms
> 
> --- /dev/bcache0 (block device 1.00 TiB) ioping statistics ---
> 9 requests completed in 9.49 ms, 4.50 KiB written, 948 iops, 474.1 KiB/s
> generated 10 requests in 9.00 s, 5 KiB, 1 iops, 568 B/s
> min/avg/max/mdev = 1.01 ms / 1.05 ms / 1.11 ms / 26.3 us
> root@pve-20:~# ioping -c10 /dev/bcache0 -D -Y -WWW -s4K
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=1 time=1.47 ms (warmup)
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=2 time=1.57 ms
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=3 time=1.57 ms
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=4 time=1.52 ms
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=5 time=1.11 ms
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=6 time=1.02 ms
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=7 time=1.03 ms (fast)
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=8 time=1.04 ms (fast)
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=9 time=1.45 ms
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=10 time=1.45 ms
> 
> --- /dev/bcache0 (block device 1.00 TiB) ioping statistics ---
> 9 requests completed in 11.8 ms, 36 KiB written, 765 iops, 2.99 MiB/s
> generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
> min/avg/max/mdev = 1.02 ms / 1.31 ms / 1.57 ms / 232.7 us
> root@pve-20:~# ioping -c10 /dev/bcache0 -D -WWW -s4K
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=1 time=249.7 us (warmup)
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=2 time=671.3 us
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=3 time=663.0 us
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=4 time=655.3 us
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=5 time=664.0 us
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=6 time=693.7 us
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=7 time=610.5 us (fast)
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=8 time=217.8 us (fast)
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=9 time=223.0 us (fast)
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=10 time=219.7 us (fast)
> 
> --- /dev/bcache0 (block device 1.00 TiB) ioping statistics ---
> 9 requests completed in 4.62 ms, 36 KiB written, 1.95 k iops, 7.61 MiB/s
> generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
> min/avg/max/mdev = 217.8 us / 513.1 us / 693.7 us / 208.2 us
> root@pve-20:~#
> 
> root@pve-20:~# blkdiscard /dev/nvme0n1
> root@pve-20:~# make-bcache -w 4096 -B /dev/mapper/zero -C /dev/nvme0n1 --writeback
> UUID:            c0252cdb-6a3b-43c1-8c86-3f679dd61d06
> Set UUID:        e56ca07c-4b1a-4bea-8bd4-2cabb60cb4f0
> version:        0
> nbuckets:        1831430
> block_size:        8
> bucket_size:        1024
> nr_in_set:        1
> nr_this_dev:        0
> first_bucket:        1
> UUID:            2c501dde-dd04-4294-9e35-8f3b57fdd75d
> Set UUID:        e56ca07c-4b1a-4bea-8bd4-2cabb60cb4f0
> version:        1
> block_size:        8
> data_offset:        16
> root@pve-20:~# ioping -c10 /dev/bcache0 -D -Y -WWW -s512
> ioping: request failed: Invalid argument
> root@pve-20:~# ioping -c10 /dev/bcache0 -D -WWW -s512
> ioping: request failed: Invalid argument
> root@pve-20:~# ioping -c10 /dev/bcache0 -D -Y -WWW -s4K
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=1 time=2.91 ms (warmup)
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=2 time=227.9 us
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=3 time=353.8 us
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=4 time=193.2 us
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=5 time=189.0 us
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=6 time=340.3 us
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=7 time=259.8 us
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=8 time=254.9 us
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=9 time=285.3 us
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=10 time=282.7 us
> 
> --- /dev/bcache0 (block device 1.00 TiB) ioping statistics ---
> 9 requests completed in 2.39 ms, 36 KiB written, 3.77 k iops, 14.7 MiB/s
> generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
> min/avg/max/mdev = 189.0 us / 265.2 us / 353.8 us / 54.5 us
> root@pve-20:~# ioping -c10 /dev/bcache0 -D -WWW -s4K
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=1 time=276.3 us (warmup)
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=2 time=224.6 us
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=3 time=226.8 us
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=4 time=240.1 us
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=5 time=237.4 us
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=6 time=231.6 us
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=7 time=238.1 us
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=8 time=199.1 us (fast)
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=9 time=240.4 us (slow)
> 4 KiB >>> /dev/bcache0 (block device 1.00 TiB): request=10 time=280.5 us (slow)
> 
> --- /dev/bcache0 (block device 1.00 TiB) ioping statistics ---
> 9 requests completed in 2.12 ms, 36 KiB written, 4.25 k iops, 16.6 MiB/s
> generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
> min/avg/max/mdev = 199.1 us / 235.4 us / 280.5 us / 20.0 us
> root@pve-20:~#
> 
> 
> In addition to the request, I decided to add these results to help:
> 
> root@pve-20:~# ioping -c5 /dev/mapper/zero -D -Y -WWW -s512
> 512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=1 time=14.4 us (warmup)
> 512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=2 time=19.9 us
> 512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=3 time=23.4 us
> 512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=4 time=17.5 us
> 512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=5 time=19.4 us
> 
> --- /dev/mapper/zero (block device 1.00 TiB) ioping statistics ---
> 4 requests completed in 80.2 us, 2 KiB written, 49.9 k iops, 24.4 MiB/s
> generated 5 requests in 4.00 s, 2.50 KiB, 1 iops, 639 B/s
> 
> min/avg/max/mdev = 17.5 us / 20.0 us / 23.4 us / 2.12 us
> 
> root@pve-20:~# ioping -c5 /dev/mapper/zero -D -WWW -s512
> 512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=1 time=14.4 us (warmup)
> 512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=2 time=13.0 us
> 512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=3 time=18.8 us
> 512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=4 time=17.4 us
> 512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=5 time=18.8 us
> 
> --- /dev/mapper/zero (block device 1.00 TiB) ioping statistics ---
> 4 requests completed in 67.9 us, 2 KiB written, 58.9 k iops, 28.8 MiB/s
> generated 5 requests in 4.00 s, 2.50 KiB, 1 iops, 639 B/s
> min/avg/max/mdev = 13.0 us / 17.0 us / 18.8 us / 2.38 us
> root@pve-20:~# ioping -c5 /dev/mapper/zero -D -Y -WWW -s4K
> 4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=1 time=24.6 us (warmup)
> 4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=2 time=27.2 us
> 4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=3 time=21.1 us
> 4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=4 time=17.0 us
> 4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=5 time=22.8 us
> 
> --- /dev/mapper/zero (block device 1.00 TiB) ioping statistics ---
> 4 requests completed in 88.0 us, 16 KiB written, 45.4 k iops, 177.5 MiB/s
> generated 5 requests in 4.00 s, 20 KiB, 1 iops, 5.00 KiB/s
> min/avg/max/mdev = 17.0 us / 22.0 us / 27.2 us / 3.65 us
> root@pve-20:~# ioping -c5 /dev/mapper/zero -D -WWW -s4K
> 4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=1 time=22.9 us (warmup)
> 4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=2 time=15.7 us
> 4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=3 time=21.5 us
> 4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=4 time=21.1 us
> 4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=5 time=24.3 us
> 
> --- /dev/mapper/zero (block device 1.00 TiB) ioping statistics ---
> 4 requests completed in 82.6 us, 16 KiB written, 48.4 k iops, 189.2 MiB/s
> generated 5 requests in 4.00 s, 20 KiB, 1 iops, 5.00 KiB/s
> min/avg/max/mdev = 15.7 us / 20.6 us / 24.3 us / 3.09 us
> root@pve-20:~#
> 
> root@pve-20:~# blkdiscard /dev/nvme0n1
> root@pve-20:~# ioping -c10 /dev/nvme0n1 -D -Y -WWW -s4K
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=82.7 us (warmup)
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=78.6 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=63.2 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=72.4 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=75.4 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=82.4 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=71.9 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=84.8 us (slow)
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=95.6 us (slow)
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=84.9 us
> 
> --- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
> 9 requests completed in 709.1 us, 36 KiB written, 12.7 k iops, 49.6 MiB/s
> generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
> min/avg/max/mdev = 63.2 us / 78.8 us / 95.6 us / 8.89 us
> root@pve-20:~# ioping -c10 /dev/nvme0n1 -D -WWW -s4K
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=68.3 us (warmup)
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=70.3 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=81.4 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=81.9 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=83.0 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=91.7 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=71.1 us (fast)
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=87.9 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=81.2 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=60.4 us (fast)
> 
> --- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
> 9 requests completed in 708.9 us, 36 KiB written, 12.7 k iops, 49.6 MiB/s
> generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
> min/avg/max/mdev = 60.4 us / 78.8 us / 91.7 us / 9.18 us
> root@pve-20:~# ioping -c10 /dev/nvme0n1 -D -Y -WWW -s512
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=59.2 us (warmup)
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=63.6 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=64.8 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=63.4 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=516.1 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=502.1 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=510.5 us (slow)
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=502.9 us (slow)
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=496.3 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=505.5 us (slow)
> 
> --- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
> 9 requests completed in 3.23 ms, 4.50 KiB written, 2.79 k iops, 1.36 MiB/s
> generated 10 requests in 9.00 s, 5 KiB, 1 iops, 568 B/s
> min/avg/max/mdev = 63.4 us / 358.4 us / 516.1 us / 208.2 us
> root@pve-20:~# ioping -c10 /dev/nvme0n1 -D -WWW -s512
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=491.5 us (warmup)
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=496.1 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=506.9 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=510.7 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=503.2 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=6 time=501.4 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=7 time=498.8 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=8 time=510.4 us (slow)
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=9 time=502.4 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=10 time=501.3 us
> 
> --- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
> 9 requests completed in 4.53 ms, 4.50 KiB written, 1.99 k iops, 993.1 KiB/s
> generated 10 requests in 9.00 s, 5 KiB, 1 iops, 568 B/s
> min/avg/max/mdev = 496.1 us / 503.5 us / 510.7 us / 4.70 us
> root@pve-20:~#
> 
> 
> root@pve-21:~# ioping -c10 /dev/mapper/zero -D -Y -WWW -s512
> 512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=1 time=13.4 us (warmup)
> 512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=2 time=22.6 us
> 512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=3 time=15.3 us
> 512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=4 time=26.1 us
> 512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=5 time=15.2 us
> 512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=6 time=20.8 us
> 512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=7 time=24.9 us
> 512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=8 time=15.2 us (fast)
> 512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=9 time=15.2 us (fast)
> 512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=10 time=15.7 us (fast)
> 
> --- /dev/mapper/zero (block device 1.00 TiB) ioping statistics ---
> 9 requests completed in 171.0 us, 4.50 KiB written, 52.6 k iops, 25.7 MiB/s
> generated 10 requests in 9.00 s, 5 KiB, 1 iops, 568 B/s
> min/avg/max/mdev = 15.2 us / 19.0 us / 26.1 us / 4.34 us
> root@pve-21:~# ioping -c10 /dev/mapper/zero -D -WWW -s512
> 512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=1 time=14.3 us (warmup)
> 512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=2 time=22.4 us
> 512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=3 time=25.9 us
> 512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=4 time=14.8 us
> 512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=5 time=24.8 us
> 512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=6 time=24.6 us
> 512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=7 time=13.7 us (fast)
> 512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=8 time=18.2 us
> 512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=9 time=15.4 us
> 512 B >>> /dev/mapper/zero (block device 1.00 TiB): request=10 time=15.2 us
> 
> --- /dev/mapper/zero (block device 1.00 TiB) ioping statistics ---
> 9 requests completed in 174.9 us, 4.50 KiB written, 51.5 k iops, 25.1 MiB/s
> generated 10 requests in 9.00 s, 5 KiB, 1 iops, 568 B/s
> min/avg/max/mdev = 13.7 us / 19.4 us / 25.9 us / 4.67 us
> root@pve-21:~# ioping -c10 /dev/mapper/zero -D -Y -WWW -s4K
> 4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=1 time=22.3 us (warmup)
> 4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=2 time=17.3 us
> 4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=3 time=26.0 us
> 4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=4 time=27.0 us
> 4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=5 time=15.7 us
> 4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=6 time=18.1 us
> 4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=7 time=17.8 us
> 4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=8 time=16.9 us
> 4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=9 time=15.4 us (fast)
> 4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=10 time=15.5 us (fast)
> 
> --- /dev/mapper/zero (block device 1.00 TiB) ioping statistics ---
> 9 requests completed in 169.7 us, 36 KiB written, 53.0 k iops, 207.2 MiB/s
> generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
> min/avg/max/mdev = 15.4 us / 18.9 us / 27.0 us / 4.21 us
> root@pve-21:~# ioping -c10 /dev/mapper/zero -D -WWW -s4K
> 4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=1 time=22.4 us (warmup)
> 4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=2 time=15.3 us
> 4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=3 time=26.1 us
> 4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=4 time=15.0 us
> 4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=5 time=15.0 us
> 4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=6 time=17.8 us
> 4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=7 time=15.3 us (fast)
> 4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=8 time=15.3 us (fast)
> 4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=9 time=15.0 us (fast)
> 4 KiB >>> /dev/mapper/zero (block device 1.00 TiB): request=10 time=14.9 us (fast)
> 
> --- /dev/mapper/zero (block device 1.00 TiB) ioping statistics ---
> 9 requests completed in 149.6 us, 36 KiB written, 60.2 k iops, 235.0 MiB/s
> generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
> min/avg/max/mdev = 14.9 us / 16.6 us / 26.1 us / 3.47 us
> root@pve-21:~#
> root@pve-21:~# blkdiscard /dev/nvme0n1
> root@pve-21:~# ioping -c5 /dev/nvme0n1 -D -Y -WWW -s512
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=461.1 us (warmup)
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=476.4 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=479.3 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=480.2 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=480.9 us
> 
> --- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
> 4 requests completed in 1.92 ms, 2 KiB written, 2.09 k iops, 1.02 MiB/s
> generated 5 requests in 4.00 s, 2.50 KiB, 1 iops, 639 B/s
> min/avg/max/mdev = 476.4 us / 479.2 us / 480.9 us / 1.73 us
> root@pve-21:~# ioping -c5 /dev/nvme0n1 -D -WWW -s512
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=456.1 us (warmup)
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=423.0 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=424.8 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=433.3 us
> 512 B >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=446.3 us
> 
> --- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
> 4 requests completed in 1.73 ms, 2 KiB written, 2.31 k iops, 1.13 MiB/s
> generated 5 requests in 4.00 s, 2.50 KiB, 1 iops, 639 B/s
> min/avg/max/mdev = 423.0 us / 431.9 us / 446.3 us / 9.23 us
> root@pve-21:~# ioping -c5 /dev/nvme0n1 -D -Y -WWW -s4K
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=88.9 us (warmup)
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=79.8 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=70.9 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=94.3 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=72.8 us
> 
> --- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
> 4 requests completed in 317.7 us, 16 KiB written, 12.6 k iops, 49.2 MiB/s
> generated 5 requests in 4.00 s, 20 KiB, 1 iops, 5.00 KiB/s
> min/avg/max/mdev = 70.9 us / 79.4 us / 94.3 us / 9.20 us
> root@pve-21:~# ioping -c5 /dev/nvme0n1 -D -WWW -s4K
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=1 time=86.4 us (warmup)
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=2 time=119.0 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=3 time=66.1 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=4 time=72.4 us
> 4 KiB >>> /dev/nvme0n1 (block device 894.3 GiB): request=5 time=73.1 us
> 
> --- /dev/nvme0n1 (block device 894.3 GiB) ioping statistics ---
> 4 requests completed in 330.6 us, 16 KiB written, 12.1 k iops, 47.3 MiB/s
> generated 5 requests in 4.00 s, 20 KiB, 1 iops, 5.00 KiB/s
> min/avg/max/mdev = 66.1 us / 82.7 us / 119.0 us / 21.2 us
> root@pve-21:~#
> 
> Em quinta-feira, 26 de maio de 2022 17:28:36 BRT, Eric Wheeler <bcache@lists.ewheeler.net> escreveu: 
> 
> 
> 
> 
> 
> On Thu, 26 May 2022, Adriano Silva wrote:
> > This is a enterprise NVMe device with Power Loss Protection system. It 
> > has a non-volatile cache.
> > 
> > Before purchasing these enterprise devices, I did tests with consumer 
> > NVMe. Consumer device performance is acceptable only on hardware cached 
> > writes. But on the contrary on consumer devices in tests with fio 
> > passing parameters for direct and synchronous writing (--direct=1 
> > --fsync=1 --rw=randwrite --bs=4K --numjobs=1 --iodepth= 1) the 
> > performance is very low. So today I'm using enterprise NVME with 
> > tantalum capacitors which makes the cache non-volatile and performs much 
> > better when written directly to the hardware. But the performance issue 
> > is only occurring when the write is directed to the bcache device.
> > 
> > Here is information from my Hardware you asked for (Eric), plus some 
> > additional information to try to help.
> > 
> > root@pve-20:/# blockdev --getss /dev/nvme0n1
> > 512
> > root@pve-20:/# blockdev --report /dev/nvme0n1
> > RO    RA   SSZ   BSZ   StartSec            Size   Device
> > rw   256   512  4096          0    960197124096   /dev/nvme0n1
> 
> > root@pve-20:~# nvme id-ctrl -H /dev/nvme0n1 |grep -A1 vwc
> > vwc       : 0
> >   [0:0] : 0    Volatile Write Cache Not Present
> 
> Please confirm that this says "write back":
> 
> ]# cat /sys/block/nvme0n1/queue/write_cache 
> 
> Try this to set _all_ blockdevs to write-back and see if it affects
> performance (warning: power loss is unsafe for non-volatile caches after 
> this command):
> 
> ]# for i in /sys/block/*/queue/write_cache; do echo 'write back' > $i; done
> 
> > An interesting thing to note is that when I test using fio with 
> > --bs=512, the direct hardware performance is horrible (~1MB/s).
> 
> I think you know this already, but for CYA:
> 
>   WARNING: THESE ARE DESTRUCTIVE WRITES, DO NOT USE ON PRODUCTION DATA!
> 
> Please post `ioping` stats for each server you are testing (some of these 
> you may have already posted, but if you can place them inline of this same 
> response it would be helpful so we don't need to dig into old emails).
> 
> ]# blkdiscard /dev/nvme0n1
> 
> ]# ioping -c10 /dev/nvme0n1 -D -Y -WWW -s512
> ]# ioping -c10 /dev/nvme0n1 -D -WWW -s512
> 
> ]# ioping -c10 /dev/nvme0n1 -D -Y -WWW -s4k
> ]# ioping -c10 /dev/nvme0n1 -D -WWW -s4k
> 
> Next, lets rule out backing-device interference by creating a dummy
> mapper device that has 128mb of ramdisk for persistent meta storage
> (superblock, LVM, etc) but presents as a 1TB volume in size; writes
> beyond 128mb are dropped:
> 
>     modprobe brd rd_size=$((128*1024))
> 
>     ]# cat << EOT | dmsetup create zero
>     0 262144 linear /dev/ram0 0
>     262144 2147483648 zero
>     EOT
> 
> Then use that as your backing device:
> 
>     ]# blkdiscard /dev/nvme0n1
>     ]# make-bcache -w 512 -B /dev/mapper/zero -C /dev/nvme0n1 --writeback
> 
> ]# ioping -c10 /dev/bcache0 -D -Y -WWW -s512
> ]# ioping -c10 /dev/bcache0 -D -WWW -s512
> 
> ]# ioping -c10 /dev/bcache0 -D -Y -WWW -s4k
> ]# ioping -c10 /dev/bcache0 -D -WWW -s4k
> 
> Test again with -w 4096:
>     ]# blkdiscard /dev/nvme0n1
>     ]# make-bcache -w 4096 -B /dev/mapper/zero -C /dev/nvme0n1 --writeback
> 
> ]# ioping -c10 /dev/bcache0 -D -Y -WWW -s4k
> ]# ioping -c10 /dev/bcache0 -D -WWW -s4k
> 
> # These should error with -w 4096 because 512 is too small:
> 
> ]# ioping -c10 /dev/bcache0 -D -Y -WWW -s512
> ]# ioping -c10 /dev/bcache0 -D -WWW -s512
> 
> > root@pve-20:/# fio --filename=/dev/nvme0n1p2 --direct=1 --fsync=1 --rw=randwrite --bs=512 --numjobs=1 --iodepth=1 --runtime=5 --time_based --group_reporting --name=journal-test --ioengine=libaio
> >   write: IOPS=2087, BW=1044KiB/s (1069kB/s)(5220KiB/5001msec); 0 zone resets
> >          ^^^^^^^^^ 
> > But the same test directly on the hardware with fio passing the
> > parameter --bs=4K, the performance completely changes, for the better
> > (~130MB/s).
> >
> > root@pve-20:/# fio --filename=/dev/nvme0n1p2 --direct=1 --fsync=1 --rw=randwrite --bs=4K --numjobs=1 --iodepth=1 --runtime=5 --time_based --group_reporting --name=journal-test --ioengine=libaio
> >   write: IOPS=31.9k, BW=124MiB/s (131MB/s)(623MiB/5001msec); 0 zone resets
> >          ^^^^^^^^^^
> > Does anything justify this difference?
> 
> I think you may have discovered the problem and the `ioping`s above
> might confirm that.
> 
> IOPS are a better metric here, not MB/sec because smaller IOs will
> always be smaller bandwidth because they are smaller and RTT is a
> factor.  However, IOPS are ~16x lower than the expected 8x difference
> (512/4096=1/8) so something else is going on. 
> 
> The hardware is probably addressed 4k internally "4Kn" (with even larger 
> erase pages that the FTL manages).  Sending it a bunch of 512-byte IOs may 
> trigger a read-modify-write operation on the flash controller and is 
> (probably) spinning CPU cycles on the flash controller itself. A firmware 
> upgrade on the NVMe might help if they have addressed this.
> 
> This is speculaution, but assuming that internally the flash uses 4k 
> sectors, it is doing something like this (pseudo code):
> 
>     1. new_data = fetch_from_pcie()
>     2. rmw = read_sector(LBA)
>     3. memcpy(rmw+offset, new_data, 512)
>     4. queue_write_to_flash(rmw, LBA)
> 
> > Maybe that's why when I create bcache with the -w=4K option the 
> > performance improves. Not as much as I'd like, but it gets better.
> > [...] 
> > The buckets, I read that it would be better to put the hardware device 
> > erase block size. However, I have already tried to find this information 
> > by reading the device, also with the manufacturer, but without success. 
> > So I have no idea which bucket size would be best, but from my tests, 
> > the default of 512KB seems to be adequate.
> 
> It might be worth testing power-of-2 bucket sizes to see what works best
> for your workload.  Note that `fio --rw=randwrite` may not be
> representative of your "real" workload so randwrite could be a good
> place to start, but bench your real workload against bucket sizes to see
> what works best.
> 
> > Eric, perhaps it is not such a simple task to recompile the Kernel with 
> > the suggested change. I'm working with Proxmox 6.4. I'm not sure, but I 
> > think the Kernel may have some adaptation. It is based on Kernel 5.4, 
> > which it is approved for.
> 
> Keith and Christoph corrected me; as noted above, this does the same 
> thing, so no need to hack on the kernel to change flush behavior:
> 
>     echo 'write back' > /sys/block/<DEV>/queue/write_cache
> 
> > Also listening to Coly's suggestion, I'll try to perform tests with the 
> > Kernel version 5.15 to see if it can solve. Would this version be good 
> > enough? It's just that, as I said above, as I'm using Proxmox, I'm 
> > afraid to change the Kernel version they provide.
> 
> I'm guessing proxmox doesn't care too much about the kernel version as
> long as the modules you use are built.  Just copy your existing .config
> (usually /boot/config-<version>) as
> kernel-source-dir/.config and run `make oldconfig` (or `make menuconfig`
> and save+exit, which is what I usually do).
> 
> > Eric, to be clear, the hardware I'm using has only 1 processor socket.
> 
> Ok, so not a cacheline bounce issue.
> 
> > I'm trying to test with another identical computer (the same 
> > motherboard, the same processor, the same NVMe, with the difference that 
> > it only has 12GB of RAM, the first having 48GB). It is an HP Z400 
> > Workstation with an Intel Xeon X5680 sixcore processor (12 threads), 
> > DDR3 1333MHz 10600E (old computer).
> 
> Is this second server still a single-socket?
> 
> > On the second computer, I put a newer version of the distribution that 
> > uses Kernel based on version 5.15. I am now comparing the performance of 
> > the two computers in the lab.
> > 
> > On this second computer I had worse performance than the first one 
> > (practically half the performance with bcache), despite the performance 
> > of the tests done directly in NVME being identical.
> > 
> > I tried going back to the same OS version on the first computer to try 
> > and keep the exact same scenario on both computers so I could first 
> > compare the two. I try to keep the exact same software configuration. 
> > However, there were no changes. Is it the low RAM that makes the 
> > performance worse in the second?
> 
> The amount of memory isn't an issue, but CPU clock speed or memory speed 
> might.  If server-2 has 2x sockets then make sure NVMe interrupts hit the 
> socket where it is attached.  Could be a PCIe version thing, but I 
> don't think you are saturating the PCIe link.
> 
> > I noticed a difference in behavior on the second computer compared to 
> > the first in dstat. While the first computer doesn't seem to touch the 
> > backup device at all, the second computer signals something a little 
> > different, as although it doesn't write data to the backup disk, it does 
> > signal IO movement. Strange no?
> > 
> > Let's look at the dstat of the first computer:
> > 
> > --dsk/sdb---dsk/nvme0n1-dsk/bcache0 ---io/sdb----io/nvme0n1--io/bcache0 -net/total- ---load-avg--- --total-cpu-usage-- ---system-- ----system---- async
> >  read  writ: read  writ: read  writ| read  writ: read  writ: read  writ| recv  send| 1m   5m  15m |usr sys idl wai stl| int   csw |     time     | #aio
> >    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |6953B 7515B|0.13 0.26 0.26|  0   0  99   0   0| 399   634 |25-05 09:41:42|   0
> >    0  8192B:4096B 2328k:   0  1168k|   0  2.00 :1.00   586 :   0   587 |9150B 2724B|0.13 0.26 0.26|  2   2  96   0   0|1093  3267 |25-05 09:41:43|   1B
> >    0     0 :   0    58M:   0    29M|   0     0 :   0  14.8k:   0  14.7k|  14k 9282B|0.13 0.26 0.26|  1   3  94   2   0|  16k   67k|25-05 09:41:44|   1B
> >    0     0 :   0    58M:   0    29M|   0     0 :   0  14.9k:   0  14.8k|  10k 8992B|0.13 0.26 0.26|  1   3  93   2   0|  16k   69k|25-05 09:41:45|   1B
> >    0     0 :   0    58M:   0    29M|   0     0 :   0  14.9k:   0  14.8k|7281B 4651B|0.13 0.26 0.26|  1   3  92   4   0|  16k   67k|25-05 09:41:46|   1B
> >    0     0 :   0    59M:   0    30M|   0     0 :   0  15.2k:   0  15.1k|7849B 4729B|0.20 0.28 0.27|  1   4  94   2   0|  16k   69k|25-05 09:41:47|   1B
> >    0     0 :   0    57M:   0    28M|   0     0 :   0  14.4k:   0  14.4k|  11k 8584B|0.20 0.28 0.27|  1   3  94   2   0|  15k   65k|25-05 09:41:48|   0
> >    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |4086B 7720B|0.20 0.28 0.27|  0   0 100   0   0| 274   332 |25-05 09:41:49|   0
> > 
> > Note that on this first computer, the writings and IOs of the backing 
> > device (sdb) remain motionless. While NVMe device IOs track bcache0 
> > device IOs at ~14.8K
> > 
> > Let's see the dstat now on the second computer:
> > 
> > --dsk/sdd---dsk/nvme0n1-dsk/bcache0 ---io/sdd----io/nvme0n1--io/bcache0 -net/total- ---load-avg--- --total-cpu-usage-- ---system-- ----system---- async
> >  read  writ: read  writ: read  writ| read  writ: read  writ: read  writ| recv  send| 1m   5m  15m |usr sys idl wai stl| int   csw |     time     | #aio
> >    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |9254B 3301B|0.15 0.19 0.11|  1   2  97   0   0| 360   318 |26-05 06:27:15|   0
> >    0  8192B:4096B   19M:   0  9600k|   0  2402 :1.00  4816 :   0  4801 |8826B 3619B|0.15 0.19 0.11|  0   1  98   0   0|8115    27k|26-05 06:27:16|   1B
> >    0     0 :   0    21M:   0    11M|   0  2737 :   0  5492 :   0  5474 |4051B 2552B|0.15 0.19 0.11|  0   2  97   1   0|9212    31k|26-05 06:27:17|   1B
> >    0     0 :   0    23M:   0    11M|   0  2890 :   0  5801 :   0  5781 |4816B 2492B|0.15 0.19 0.11|  1   2  96   2   0|9976    34k|26-05 06:27:18|   1B
> >    0     0 :   0    23M:   0    11M|   0  2935 :   0  5888 :   0  5870 |4450B 2552B|0.22 0.21 0.12|  0   2  96   2   0|9937    33k|26-05 06:27:19|   1B
> >    0     0 :   0    22M:   0    11M|   0  2777 :   0  5575 :   0  5553 |8644B 1614B|0.22 0.21 0.12|  0   2  98   0   0|9416    31k|26-05 06:27:20|   1B
> >    0     0 :   0  2096k:   0  1040k|   0   260 :   0   523 :   0   519 |  10k 8760B|0.22 0.21 0.12|  0   1  99   0   0|1246  3157 |26-05 06:27:21|   0
> >    0     0 :   0     0 :   0     0 |   0     0 :   0     0 :   0     0 |4083B 2990B|0.22 0.21 0.12|  0   0 100   0   0| 390   369 |26-05 06:27:22|   0
> 
> > In this case, with exactly the same command, we got a very different 
> > result. While writes to the backing device (sdd) do not happen (this is 
> > correct), we noticed that IOs occur on both the NVMe device and the 
> > backing device (i think this is wrong), but at a much lower rate now, 
> > around 5.6K on NVMe and 2.8K on the backing device. It leaves the 
> > impression that although it is not writing anything to sdd device, it is 
> > sending some signal to the backing device in each two IO operations that 
> > is performed with the cache device. And that would be delaying the 
> > answer. Could it be something like this?
> 
> I think in newer kernels that bcache is more aggressive at writeback. 
> Using /dev/mapper/zero as above will help rule out backing device 
> interference.  Also make sure you have the sysfs flags turned to encourge 
> it to write to SSD and not bypass:
> 
>     echo 0 > /sys/block/bcache0/bcache/sequential_cutoff
>     echo 10000000 > /sys/block/bcache0/bcache/cache/congested_read_threshold_us 
>     echo 10000000 > /sys/block/bcache0/bcache/cache/congested_write_threshold_us
> 
> > It is important to point out that the writeback mode is on, obviously, 
> > and that the sequential cutoff is at zero, but I tried to put default 
> > values or high values and there were no changes. I also tried changing 
> > congested_write_threshold_us and congested_read_threshold_us, also with 
> > no result changes.
> 
> Try this too: 
>     echo 300 > /sys/block/bcache0/bcache/writeback_delay
> 
> and make sure bcache is in writeback (echo writeback > 
> /sys/block/bcache0/bcache0/cache_mode) in case that was not configured on 
> server2.
> 
> 
> -Eric
> 
> > The only thing I noticed different between the configurations of the two 
> > computers was btree_cache_size, which on the first is much larger (7.7M) 
> > m while on the second it is only 768K. But I don't know if this 
> > parameter is configurable and if it could justify the difference.
> > 
> > Disabling Intel's Turbo Boost technology through the BIOS appears to 
> > have no effect.
> > 
> > And we will continue our tests comparing the two computers, including to 
> > test the two versions of the Kernel. If anyone else has ideas, thanks!
> 
> 
> > 
> > Em terça-feira, 17 de maio de 2022 22:23:09 BRT, Eric Wheeler <bcache@lists.ewheeler.net> escreveu: 
> > 
> > 
> > 
> > 
> > 
> > On Tue, 10 May 2022, Adriano Silva wrote:
> > > I'm trying to set up a flash disk NVMe as a disk cache for two or three 
> > > isolated (I will use 2TB disks, but in these tests I used a 1TB one) 
> > > spinning disks that I have on a Linux 5.4.174 (Proxmox node).
> > 
> > Coly has been adding quite a few optimizations over the years.  You might 
> > try a new kernel and see if that helps.  More below.
> > 
> > > I'm using a NVMe (960GB datacenter devices with tantalum capacitors) as 
> > > a cache.
> > > [...]
> > >
> > > But when I do the same test on bcache writeback, the performance drops a 
> > > lot. Of course, it's better than the performance of spinning disks, but 
> > > much worse than when accessed directly from the NVMe device hardware.
> > >
> > > [...]
> > > As we can see, the same test done on the bcache0 device only got 1548 
> > > IOPS and that yielded only 6.3 KB/s.
> > 
> > Well done on the benchmarking!  I always thought our new NVMes performed 
> > slower than expected but hadn't gotten around to investigating. 
> > 
> > > I've noticed in several tests, varying the amount of jobs or increasing 
> > > the size of the blocks, that the larger the size of the blocks, the more 
> > > I approximate the performance of the physical device to the bcache 
> > > device.
> > 
> > You said "blocks" but did you mean bucket size (make-bcache -b) or block 
> > size (make-bcache -w) ?
> > 
> > If larger buckets makes it slower than that actually surprises me: bigger 
> > buckets means less metadata and better sequential writeback to the 
> > spinning disks (though you hadn't yet hit writeback to spinning disks in 
> > your stats).  Maybe you already tried, but varying the bucket size might 
> > help.  Try graphing bucket size (powers of 2) against IOPS, maybe there is 
> > a "sweet spot"?
> > 
> > Be aware that 4k blocks (so-called "4Kn") is unsafe for the cache device, 
> > unless Coly has patched that.  Make sure your `blockdev --getss` reports 
> > 512 for your NVMe!
> > 
> > Hi Coly,
> > 
> > Some time ago you ordered an an SSD to test the 4k cache issue, has that 
> > been fixed?  I've kept an eye out for the patch but not sure if it was released.
> > 
> > You have a really great test rig setup with NVMes for stress
> > testing bcache. Can you replicate Adriano's `ioping` numbers below?
> > 
> > > With ioping it is also possible to notice a limitation, as the latency 
> > > of the bcache0 device is around 1.5ms, while in the case of the raw 
> > > device (a partition of NVMe), the same test is only 82.1us.
> > > 
> > > root@pve-20:~# ioping -c10 /dev/bcache0 -D -Y -WWW -s4k
> > > 4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=1 time=1.52 ms (warmup)
> > > 4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=2 time=1.60 ms
> > > 4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=3 time=1.55 ms
> > >
> > > root@pve-20:~# ioping -c10 /dev/nvme0n1p2 -D -Y -WWW -s4k
> > > 4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=1 time=81.2 us (warmup)
> > > 4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=2 time=82.7 us
> > > 4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=3 time=82.4 us
> > 
> > Wow, almost 20x higher latency, sounds convincing that something is wrong.
> > 
> > A few things to try:
> > 
> > 1. Try ioping without -Y.  How does it compare?
> > 
> > 2. Maybe this is an inter-socket latency issue.  Is your server 
> >   multi-socket?  If so, then as a first pass you could set the kernel 
> >   cmdline `isolcpus` for testing to limit all processes to a single 
> >   socket where the NVMe is connected (see `lscpu`).  Check `hwloc-ls`
> >   or your motherboard manual to see how the NVMe port is wired to your
> >   CPUs.
> > 
> >   If that helps then fine tune with `numactl -cN ioping` and 
> >   /proc/irq/<n>/smp_affinity_list (and `grep nvme /proc/interrupts`) to 
> >   make sure your NVMe's are locked to IRQs on the same socket.
> > 
> > 3a. sysfs:
> > 
> > > # echo 0 > /sys/block/bcache0/bcache/sequential_cutoff
> > 
> > good.
> > 
> > > # echo 0 > /sys/fs/bcache/<cache set>/congested_read_threshold_us
> > > # echo 0 > /sys/fs/bcache/<cache set>/congested_write_threshold_us
> > 
> > Also try these (I think bcache/cache is a symlink to /sys/fs/bcache/<cache set>)
> > 
> > echo 10000000 > /sys/block/bcache0/bcache/cache/congested_read_threshold_us 
> > echo 10000000 > /sys/block/bcache0/bcache/cache/congested_write_threshold_us
> > 
> > 
> > Try tuning journal_delay_ms: 
> >   /sys/fs/bcache/<cset-uuid>/journal_delay_ms
> >     Journal writes will delay for up to this many milliseconds, unless a 
> >     cache flush happens sooner. Defaults to 100.
> > 
> > 3b: Hacking bcache code:
> > 
> > I just noticed that journal_delay_ms says "unless a cache flush happens 
> > sooner" but cache flushes can be re-ordered so flushing the journal when 
> > REQ_OP_FLUSH comes through may not be useful, especially if there is a 
> > high volume of flushes coming down the pipe because the flushes could kill 
> > the NVMe's cache---and maybe the 1.5ms ping is actual flash latency.  It
> > would flush data and journal.
> > 
> > Maybe there should be a cachedev_noflush sysfs option for those with some 
> > kind of power-loss protection of there SSD's.  It looks like this is 
> > handled in request.c when these functions call bch_journal_meta():
> > 
> >     1053: static void cached_dev_nodata(struct closure *cl)
> >     1263: static void flash_dev_nodata(struct closure *cl)
> > 
> > Coly can you comment about journal flush semantics with respect to 
> > performance vs correctness and crash safety?
> > 
> > Adriano, as a test, you could change this line in search_alloc() in 
> > request.c:
> > 
> >     - s->iop.flush_journal    = op_is_flush(bio->bi_opf);
> >     + s->iop.flush_journal    = 0;
> > 
> > and see how performance changes.
> > 
> > Someone correct me if I'm wrong, but I don't think flush_journal=0 will 
> > affect correctness unless there is a crash.  If that /is/ the performance 
> > problem then it would narrow the scope of this discussion.
> > 
> > 4. I wonder if your 1.5ms `ioping` stats scale with CPU clock speed: can 
> >   you set your CPU governor to run at full clock speed and then slowest 
> >   clock speed to see if it is a CPU limit somewhere as we expect?
> > 
> >   You can do `grep MHz /proc/cpuinfo` to see the active rate to make sure 
> >   the governor did its job.  
> > 
> >   If it scales with CPU then something in bcache is working too hard.  
> >   Maybe garbage collection?  Other devs would need to chime in here to 
> >   steer the troubleshooting if that is the case.
> > 
> > 
> > 5. I'm not sure if garbage collection is the issue, but you might try 
> >   Mingzhe's dynamic incremental gc patch:
> >     https://www.spinics.net/lists/linux-bcache/msg11185.html
> > 
> > 6. Try dm-cache and see if its IO latency is similar to bcache: If it is 
> >   about the same then that would indicate an issue in the block layer 
> >   somewhere outside of bcache.  If dm-cache is better, then that confirms 
> >   a bcache issue.
> > 
> > 
> > > The cache was configured directly on one of the NVMe partitions (in this 
> > > case, the first partition). I did several tests using fio and ioping, 
> > > testing on a partition on the NVMe device, without partition and 
> > > directly on the raw block, on a first partition, on the second, with or 
> > > without configuring bcache. I did all this to remove any doubt as to the 
> > > method. The results of tests performed directly on the hardware device, 
> > > without going through bcache are always fast and similar.
> > > 
> > > But tests in bcache are always slower. If you use writethrough, of 
> > > course, it gets much worse, because the performance is equal to the raw 
> > > spinning disk.
> > > 
> > > Using writeback improves a lot, but still doesn't use the full speed of 
> > > NVMe (honestly, much less than full speed).
> > 
> > Indeed, I hope this can be fixed!  A 20x improvement in bcache would 
> > be awesome.
> > 
> > > But I've also noticed that there is a limit on writing sequential data, 
> > > which is a little more than half of the maximum write rate shown in 
> > > direct tests by the NVMe device.
> > 
> > For sync, async, or both?
> > 
> > 
> > > Processing doesn't seem to be going up like the tests.
> > 
> > 
> > What do you mean "processing" ?
> > 
> > -Eric
> > 
> > 
> > 
> 
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync)
  2022-05-25  5:20               ` Christoph Hellwig
  2022-05-25 18:44                 ` Eric Wheeler
@ 2022-05-28  1:52                 ` Eric Wheeler
  2022-05-28  3:57                   ` Keith Busch
  2022-05-28  4:59                   ` Christoph Hellwig
  1 sibling, 2 replies; 37+ messages in thread
From: Eric Wheeler @ 2022-05-28  1:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Coly Li, Adriano Silva, Bcache Linux,
	Matthias Ferdinand, linux-block

[-- Attachment #1: Type: text/plain, Size: 2609 bytes --]

On Tue, 24 May 2022, Christoph Hellwig wrote:
> On Tue, May 24, 2022 at 02:34:23PM -0700, Eric Wheeler wrote:
> > Is this flag influced at all when /sys/block/sdX/queue/scheduler is set 
> > to "none", or does the write_cache flag operate independently of the 
> > selected scheduler?
> 
> This in completely independent from sthe scheduler.
> 
> > Does the block layer stop sending flushes at the first device in the stack 
> > that is set to "write back"?  For example, if a device mapper target is 
> > writeback will it strip flushes on the way to the backing device?
> 
> This is up to the stacking driver.  dm and tend to pass through flushes
> where needed.
> 
> > This confirms what I have suspected all along: We have an LSI MegaRAID 
> > SAS-3516 where the write policy is "write back" in the LUN, but the cache 
> > is flagged in Linux as write-through:
> > 
> > 	]# cat /sys/block/sdb/queue/write_cache 
> > 	write through

Hi Keith, Christoph:

Adriano who started this thread (cc'ed) reported that setting 
queue/write_cache to "write back" provides much higher latency on his NVMe 
than "write through"; I tested a system here and found the same thing.

Here is Adriano's summary:

        # cat /sys/block/nvme0n1/queue/write_cache
        write through
        # ioping -c10 /dev/nvme0n1 -D -Y -WWW -s4K
        ...
        min/avg/max/mdev = 60.0 us / 78.7 us / 91.2 us / 8.20 us
                                     ^^^^ ^^

        # for i in /sys/block/*/queue/write_cache; do echo 'write back' > $i; done
        # ioping -c10 /dev/nvme0n1 -D -Y -WWW -s4K
        ...
        min/avg/max/mdev = 1.81 ms / 1.89 ms / 2.01 ms / 82.3 us
                                     ^^^^ ^^

Interestingly, Adriano's is 24.01x and ours is 23.97x higher latency
higher (see below).  These 24x numbers seem too similar to be a
coincidence on such different configurations.  He's running Linux 5.4
and we are on 4.19.

Is this expected?


More info:

The stack where I verified the behavior Adriano reported is slightly
different, NVMe's are under md RAID1 with LVM on top, so latency is
higher, but still basically the same high latency difference with
writeback enabled:

	]# cat /sys/block/nvme[01]n1/queue/write_cache
	write through
	write through
	]# ionice -c1 -n1 ioping -c10 /dev/ssd/ssd-test -D -s4k -WWW -Y
	...
	min/avg/max/mdev = 119.1 us / 754.9 us / 2.67 ms / 1.02 ms


	]# cat /sys/block/nvme[01]n1/queue/write_cache
	write back
	write back
	]# ionice -c1 -n1 ioping -c10 /dev/ssd/ssd-test -D -s4k -WWW -Y
	...
	min/avg/max/mdev = 113.4 us / 18.1 ms / 29.2 ms / 9.53 ms


--
Eric Wheeler

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync)
  2022-05-28  1:52                 ` Eric Wheeler
@ 2022-05-28  3:57                   ` Keith Busch
  2022-05-28  4:59                   ` Christoph Hellwig
  1 sibling, 0 replies; 37+ messages in thread
From: Keith Busch @ 2022-05-28  3:57 UTC (permalink / raw)
  To: Eric Wheeler
  Cc: Christoph Hellwig, Coly Li, Adriano Silva, Bcache Linux,
	Matthias Ferdinand, linux-block

On Fri, May 27, 2022 at 06:52:22PM -0700, Eric Wheeler wrote:
> Hi Keith, Christoph:
> 
> Adriano who started this thread (cc'ed) reported that setting 
> queue/write_cache to "write back" provides much higher latency on his NVMe 
> than "write through"; I tested a system here and found the same thing.
> 
> Here is Adriano's summary:
> 
>         # cat /sys/block/nvme0n1/queue/write_cache
>         write through
>         # ioping -c10 /dev/nvme0n1 -D -Y -WWW -s4K
>         ...
>         min/avg/max/mdev = 60.0 us / 78.7 us / 91.2 us / 8.20 us
>                                      ^^^^ ^^
> 
>         # for i in /sys/block/*/queue/write_cache; do echo 'write back' > $i; done
>         # ioping -c10 /dev/nvme0n1 -D -Y -WWW -s4K
>         ...
>         min/avg/max/mdev = 1.81 ms / 1.89 ms / 2.01 ms / 82.3 us
>                                      ^^^^ ^^

With the "write back" setting, I find that the writes dispatched from ioping
will have the force-unit-access bit set in the commands, so it is expected to
take longer.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync)
  2022-05-28  1:52                 ` Eric Wheeler
  2022-05-28  3:57                   ` Keith Busch
@ 2022-05-28  4:59                   ` Christoph Hellwig
  2022-05-28 12:57                     ` Adriano Silva
  1 sibling, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2022-05-28  4:59 UTC (permalink / raw)
  To: Eric Wheeler
  Cc: Christoph Hellwig, Keith Busch, Coly Li, Adriano Silva,
	Bcache Linux, Matthias Ferdinand, linux-block

On Fri, May 27, 2022 at 06:52:22PM -0700, Eric Wheeler wrote:
> Adriano who started this thread (cc'ed) reported that setting 
> queue/write_cache to "write back" provides much higher latency on his NVMe 
> than "write through"; I tested a system here and found the same thing.
>
> [...]
>
> Is this expected?

Once you do that, the block layer ignores all flushes and FUA bits, so
yes it is going to be a lot faster.  But also completely unsafe because
it does not provide any data durability guarantees.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Bcache in writes direct with fsync. Are IOPS limited?
  2022-05-28  1:27           ` Eric Wheeler
@ 2022-05-28  7:22             ` Matthias Ferdinand
  2022-05-28 12:09               ` Adriano Silva
  0 siblings, 1 reply; 37+ messages in thread
From: Matthias Ferdinand @ 2022-05-28  7:22 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: Adriano Silva, Coly Li, Bcache Linux

On Fri, May 27, 2022 at 06:27:53PM -0700, Eric Wheeler wrote:
> > I can say that the performance of tests after the write back command for 
> > all devices greatly worsens the performance of direct tests on NVME 
> > hardware. Below you can see this.
> 
> I wonder what is going on there!  I tried the same thing on my system and 
> 'write through' is faster for me, too, so it would be worth investigating.

In Ceph context, it seems not unusual to disable SSD write back cache
and see much improved performance (or the other way round: see
surprisingly low performance with write back cache enabled):

    https://yourcmc.ru/wiki/Ceph_performance#Drive_cache_is_slowing_you_down

Disk controllers seem to interpret FLUSH CACHE / FUA differently.
If bcache would set FUA for cache device writes while running fio
directly on the nvme device would not, that might explain the timing
difference.

Regards
Matthias

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Bcache in writes direct with fsync. Are IOPS limited?
  2022-05-28  7:22             ` Matthias Ferdinand
@ 2022-05-28 12:09               ` Adriano Silva
  0 siblings, 0 replies; 37+ messages in thread
From: Adriano Silva @ 2022-05-28 12:09 UTC (permalink / raw)
  To: Eric Wheeler, Matthias Ferdinand; +Cc: Coly Li, Bcache Linux

Tankyou Eric, Matthias, Coly..


> Disk controllers seem to interpret FLUSH CACHE / FUA differently.
> If bcache would set FUA for cache device writes while running fio
> directly on the nvme device would not, that might explain the timing
> difference.

Matthias, thanks a lot for helping!

I believe this test was not aimed at the Ceph context. Although my ultimate goal is to run Ceph (this you are correct), Ceph is still off. Turning on Ceph will be my next step, after getting a solid cached device setup. And these direct and synchronized disk-based tests is useful for Ceph, but also can be useful to get an idea of ​​how it will work for other applications too, such as an Oracle database engine, PostgreSQL, or other database engines. 

On the other hand, I believe that this result is obtained by the fact that an enterprise NVME with PLP (Power Loss Protection) is very fast for direct writes. More than expected from OS caching mechanisms. If I'm not mistaken, the test was about the OS caching mechanism.

Eric,

I don't see big problems in creating the bcache using -w 4096. But there might be some situation that it degrades the performance trying to write in 512 Bytes, as you said.. This can worry in production environment?

Anyway, the performance even using -w 4096 was still way below the native NVME performance. Is this because of the metadata headers?

I noticed one thing via the dstat tool (seems useful for checking the data flow and the flow of I/O operations to the devices in real time):

For each write of a 4K block to bcache, it results in a 16K write to the cache device (NVME). This seems to represent that bcache writes an excess 12KB (three times the size of the 4K block) as a form of header, metadata, or whatever, some useful mapping information from it, for each 4K block written. That's right? Is correct?

If this is correct, it might explain why I still only have 1/4 of the performance of NVME writing 4KB blocks, even if I format bcache with -w 4096. Because if for every 4KB block I write to bcache, it needs writing 4X that same amount of data to the cache device, it's obvious that I'm only going to get 25% of the hardware performance.

That's it ?

Another thing that's intrigued me now, is the difference in performance of bcache from one server to the other... Although I believe that this must be some configuration, because the hardware is identical, I can't imagine which one. I even hit the memory to be the same on both machines, even the SATA position of the disks, so there is no difference. But even so, the second machine insists on having half the performance of the first, just in the cache.

And again by dstat, I verify that there are zero Bytes written or read to the backing device, while 4K blocks are written to the bcache device and NVME hardware. And that's correct, I think. But at the same time, dstat indicates that I/O operations are taking place to the backing device. And this does not occur on the first server, only on the second. It seems clear to me that this behavior is halving the performance on the second server. But why? Why are there IO operations destined for the backup device with "zero" bytes written or read? What kind of IO operation could write or read zero Bytes? And why would they occur?

This is one more step of research..

If anyone has an idea, I'd appreciate it.

Thank you all!



Em sábado, 28 de maio de 2022 04:22:51 BRT, Matthias Ferdinand <bcache@mfedv.net> escreveu: 





On Fri, May 27, 2022 at 06:27:53PM -0700, Eric Wheeler wrote:
> > I can say that the performance of tests after the write back command for 
> > all devices greatly worsens the performance of direct tests on NVME 
> > hardware. Below you can see this.
> 
> I wonder what is going on there!  I tried the same thing on my system and 
> 'write through' is faster for me, too, so it would be worth investigating.

In Ceph context, it seems not unusual to disable SSD write back cache
and see much improved performance (or the other way round: see
surprisingly low performance with write back cache enabled):

    https://yourcmc.ru/wiki/Ceph_performance#Drive_cache_is_slowing_you_down

Disk controllers seem to interpret FLUSH CACHE / FUA differently.
If bcache would set FUA for cache device writes while running fio
directly on the nvme device would not, that might explain the timing
difference.

Regards
Matthias

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync)
  2022-05-28  4:59                   ` Christoph Hellwig
@ 2022-05-28 12:57                     ` Adriano Silva
  2022-05-29  3:18                       ` Keith Busch
  0 siblings, 1 reply; 37+ messages in thread
From: Adriano Silva @ 2022-05-28 12:57 UTC (permalink / raw)
  To: Christoph Hellwig, Bcache Linux, Matthias Ferdinand, Coly Li,
	Eric Wheeler, Keith Busch

Dear Christoph,

> Once you do that, the block layer ignores all flushes and FUA bits, so
> yes it is going to be a lot faster.  But also completely unsafe because
> it does not provide any data durability guarantees.

Sorry, but wouldn't it be the other way around? Or did I really not understand your answer?

Sorry, I don't know anything about kernel code, but wouldn't it be the other way around?

It's just that, I may not be understanding. And it's likely that I'm not, because you understand more about this, I'm new to this subject. I know very little about it, or almost nothing.

But it's just that I've read the opposite about it.

 Isn't "write through" to provide more secure writes?

I also see that "write back" would be meant to be faster. No?

But I understand that when I do a write with direct ioping (-D) and with forced sync (-Y), then an enterprise NVME device with PLP (Power Loss Protection) like mine here should perform very well because in theory, the messages are sent to the hardware by the OS with an instruction for the Hardware to ignore the cache (correct?), but the NVME device will still put it in its local cache and give an immediate response to the OS saying that the data has been written, because he knows his local cache is a safe place for this (in theory).

On the other hand, answering why writing is slow when "write back" is activated is intriguing. Could it be the software logic stack involved to do the Write Back? I don't know.


Em sábado, 28 de maio de 2022 01:59:30 BRT, Christoph Hellwig <hch@infradead.org> escreveu: 





On Fri, May 27, 2022 at 06:52:22PM -0700, Eric Wheeler wrote:

> Adriano who started this thread (cc'ed) reported that setting 
> queue/write_cache to "write back" provides much higher latency on his NVMe 
> than "write through"; I tested a system here and found the same thing.
>
> [...]
>
> Is this expected?


Once you do that, the block layer ignores all flushes and FUA bits, so
yes it is going to be a lot faster.  But also completely unsafe because
it does not provide any data durability guarantees.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync)
  2022-05-28 12:57                     ` Adriano Silva
@ 2022-05-29  3:18                       ` Keith Busch
  2022-05-31 19:42                         ` Eric Wheeler
       [not found]                         ` <2064546094.2440522.1653825057164@mail.yahoo.com>
  0 siblings, 2 replies; 37+ messages in thread
From: Keith Busch @ 2022-05-29  3:18 UTC (permalink / raw)
  To: Adriano Silva
  Cc: Christoph Hellwig, Bcache Linux, Matthias Ferdinand, Coly Li,
	Eric Wheeler

On Sat, May 28, 2022 at 12:57:26PM +0000, Adriano Silva wrote:
> Dear Christoph,
> 
> > Once you do that, the block layer ignores all flushes and FUA bits, so
> > yes it is going to be a lot faster.  But also completely unsafe because
> > it does not provide any data durability guarantees.
> 
> Sorry, but wouldn't it be the other way around? Or did I really not understand your answer?
> 
> Sorry, I don't know anything about kernel code, but wouldn't it be the other way around?
> 
> It's just that, I may not be understanding. And it's likely that I'm not, because you understand more about this, I'm new to this subject. I know very little about it, or almost nothing.
> 
> But it's just that I've read the opposite about it.
> 
>  Isn't "write through" to provide more secure writes?
> 
> I also see that "write back" would be meant to be faster. No?

The sysfs "write_cache" attribute just controls what the kernel does. It
doesn't change any hardware settings.

In "write back" mode, a sync write will have FUA set, which will generally be
slower than a write without FUA. In "write through" mode, the kernel doesn't
set FUA so the data may not be durable after the completion if the controller
is using a volatile write cache.
 
> But I understand that when I do a write with direct ioping (-D) and with forced sync (-Y), then an enterprise NVME device with PLP (Power Loss Protection) like mine here should perform very well because in theory, the messages are sent to the hardware by the OS with an instruction for the Hardware to ignore the cache (correct?), but the NVME device will still put it in its local cache and give an immediate response to the OS saying that the data has been written, because he knows his local cache is a safe place for this (in theory).

If the device's power-loss protected memory is considered non-volatile, then it
shouldn't be reporting a volatile write cache, and it may complete commands
once the write data reaches its non-volatile cache. It can treat flush and FUA
as no-ops.
 
> On the other hand, answering why writing is slow when "write back" is activated is intriguing. Could it be the software logic stack involved to do the Write Back? I don't know.

Yeah, the software stack will issue flushes and FUA in "write back" mode.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync)
  2022-05-29  3:18                       ` Keith Busch
@ 2022-05-31 19:42                         ` Eric Wheeler
  2022-05-31 20:22                           ` Keith Busch
       [not found]                         ` <2064546094.2440522.1653825057164@mail.yahoo.com>
  1 sibling, 1 reply; 37+ messages in thread
From: Eric Wheeler @ 2022-05-31 19:42 UTC (permalink / raw)
  To: Keith Busch
  Cc: Adriano Silva, Christoph Hellwig, Bcache Linux,
	Matthias Ferdinand, Coly Li

[-- Attachment #1: Type: text/plain, Size: 3276 bytes --]

On Sat, 28 May 2022, Keith Busch wrote:
> On Sat, May 28, 2022 at 12:57:26PM +0000, Adriano Silva wrote:
> > Dear Christoph,
> > 
> > > Once you do that, the block layer ignores all flushes and FUA bits, so
> > > yes it is going to be a lot faster.  But also completely unsafe because
> > > it does not provide any data durability guarantees.
> > 
> > Sorry, but wouldn't it be the other way around? Or did I really not 
> > understand your answer?
> > 
> > Sorry, I don't know anything about kernel code, but wouldn't it be the 
> > other way around?
> > 
> > It's just that, I may not be understanding. And it's likely that I'm 
> > not, because you understand more about this, I'm new to this subject. 
> > I know very little about it, or almost nothing.
> > 
> > But it's just that I've read the opposite about it.
> > 
> >  Isn't "write through" to provide more secure writes?
> > 
> > I also see that "write back" would be meant to be faster. No?
> 
> The sysfs "write_cache" attribute just controls what the kernel does. It
> doesn't change any hardware settings.
> 
> In "write back" mode, a sync write will have FUA set, which will generally be
> slower than a write without FUA. In "write through" mode, the kernel doesn't
> set FUA so the data may not be durable after the completion if the controller
> is using a volatile write cache.

Something seems wrong here: Typically on a RAID controller LUN 
configuration "writeback" means that the non-volatile cache is active so 
"write back caching" is enabled.

According to https://www.kernel.org/doc/Documentation/block/queue-sysfs.txt:

	"When read, this file will display whether the device has write
	back caching enabled or not. It will return "write back" for the former
	case, and "write through" for the latter."

If my text mailer would underline then I would underline this from the 
documentation: "whether the device has write back caching enabled or not"

Is there a good explanation for why the kernel setting is exactly 
_opposite_ of the controller setting?

> > But I understand that when I do a write with direct ioping (-D) and 
> > with forced sync (-Y), then an enterprise NVME device with PLP (Power 
> > Loss Protection) like mine here should perform very well because in 
> > theory, the messages are sent to the hardware by the OS with an 
> > instruction for the Hardware to ignore the cache (correct?), but the 
> > NVME device will still put it in its local cache and give an immediate 
> > response to the OS saying that the data has been written, because he 
> > knows his local cache is a safe place for this (in theory).
> 
> If the device's power-loss protected memory is considered non-volatile, then it
> shouldn't be reporting a volatile write cache, and it may complete commands
> once the write data reaches its non-volatile cache. It can treat flush and FUA
> as no-ops.
>  
> > On the other hand, answering why writing is slow when "write back" is 
> > activated is intriguing. Could it be the software logic stack involved 
> > to do the Write Back? I don't know.
> 
> Yeah, the software stack will issue flushes and FUA in "write back" 
> mode.

If it this setting really is intended to be backwards from industry 
vernacular then perhaps it is a documentation bug...

-Eric

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync)
  2022-05-31 19:42                         ` Eric Wheeler
@ 2022-05-31 20:22                           ` Keith Busch
  2022-05-31 23:04                             ` Eric Wheeler
  0 siblings, 1 reply; 37+ messages in thread
From: Keith Busch @ 2022-05-31 20:22 UTC (permalink / raw)
  To: Eric Wheeler
  Cc: Adriano Silva, Christoph Hellwig, Bcache Linux,
	Matthias Ferdinand, Coly Li

On Tue, May 31, 2022 at 12:42:49PM -0700, Eric Wheeler wrote:
> On Sat, 28 May 2022, Keith Busch wrote:
> > On Sat, May 28, 2022 at 12:57:26PM +0000, Adriano Silva wrote:
> > > Dear Christoph,
> > > 
> > > > Once you do that, the block layer ignores all flushes and FUA bits, so
> > > > yes it is going to be a lot faster.  But also completely unsafe because
> > > > it does not provide any data durability guarantees.
> > > 
> > > Sorry, but wouldn't it be the other way around? Or did I really not 
> > > understand your answer?
> > > 
> > > Sorry, I don't know anything about kernel code, but wouldn't it be the 
> > > other way around?
> > > 
> > > It's just that, I may not be understanding. And it's likely that I'm 
> > > not, because you understand more about this, I'm new to this subject. 
> > > I know very little about it, or almost nothing.
> > > 
> > > But it's just that I've read the opposite about it.
> > > 
> > >  Isn't "write through" to provide more secure writes?
> > > 
> > > I also see that "write back" would be meant to be faster. No?
> > 
> > The sysfs "write_cache" attribute just controls what the kernel does. It
> > doesn't change any hardware settings.
> > 
> > In "write back" mode, a sync write will have FUA set, which will generally be
> > slower than a write without FUA. In "write through" mode, the kernel doesn't
> > set FUA so the data may not be durable after the completion if the controller
> > is using a volatile write cache.
> 
> Something seems wrong here: Typically on a RAID controller LUN 
> configuration "writeback" means that the non-volatile cache is active so 
> "write back caching" is enabled.
> 
> According to https://www.kernel.org/doc/Documentation/block/queue-sysfs.txt:
> 
> 	"When read, this file will display whether the device has write
> 	back caching enabled or not. It will return "write back" for the former
> 	case, and "write through" for the latter."
> 
> If my text mailer would underline then I would underline this from the 
> documentation: "whether the device has write back caching enabled or not"

Maybe this is confusing because we let the user change the kernel's behavior
regardless of how the storage device is configured?
 
> Is there a good explanation for why the kernel setting is exactly 
> _opposite_ of the controller setting?

By default, the drivers should have the correct setting reported for their
devices, not the opposite. The user can override the sysfs attribute to the
opposite setting though, so it's not necessarily an accurate report of what the
device has actually enabled.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync)
  2022-05-31 20:22                           ` Keith Busch
@ 2022-05-31 23:04                             ` Eric Wheeler
  2022-06-01  0:36                               ` Keith Busch
  0 siblings, 1 reply; 37+ messages in thread
From: Eric Wheeler @ 2022-05-31 23:04 UTC (permalink / raw)
  To: Keith Busch
  Cc: Adriano Silva, Christoph Hellwig, Bcache Linux,
	Matthias Ferdinand, Coly Li

[-- Attachment #1: Type: text/plain, Size: 4407 bytes --]

On Tue, 31 May 2022, Keith Busch wrote:
> On Tue, May 31, 2022 at 12:42:49PM -0700, Eric Wheeler wrote:
> > On Sat, 28 May 2022, Keith Busch wrote:
> > > On Sat, May 28, 2022 at 12:57:26PM +0000, Adriano Silva wrote:
> > > > Dear Christoph,
> > > > 
> > > > > Once you do that, the block layer ignores all flushes and FUA bits, so
> > > > > yes it is going to be a lot faster.  But also completely unsafe because
> > > > > it does not provide any data durability guarantees.
> > > > 
> > > > Sorry, but wouldn't it be the other way around? Or did I really not 
> > > > understand your answer?
> > > > 
> > > > Sorry, I don't know anything about kernel code, but wouldn't it be the 
> > > > other way around?
> > > > 
> > > > It's just that, I may not be understanding. And it's likely that I'm 
> > > > not, because you understand more about this, I'm new to this subject. 
> > > > I know very little about it, or almost nothing.
> > > > 
> > > > But it's just that I've read the opposite about it.
> > > > 
> > > >  Isn't "write through" to provide more secure writes?
> > > > 
> > > > I also see that "write back" would be meant to be faster. No?
> > > 
> > > The sysfs "write_cache" attribute just controls what the kernel does. It
> > > doesn't change any hardware settings.
> > > 
> > > In "write back" mode, a sync write will have FUA set, which will generally be
> > > slower than a write without FUA. In "write through" mode, the kernel doesn't
> > > set FUA so the data may not be durable after the completion if the controller
> > > is using a volatile write cache.
> > 
> > Something seems wrong here: Typically on a RAID controller LUN 
> > configuration "writeback" means that the non-volatile cache is active so 
> > "write back caching" is enabled.
> > 
> > According to https://www.kernel.org/doc/Documentation/block/queue-sysfs.txt:
> > 
> > 	"When read, this file will display whether the device has write
> > 	back caching enabled or not. It will return "write back" for the former
> > 	case, and "write through" for the latter."
> > 
> > If my text mailer would underline then I would underline this from the 
> > documentation: "whether the device has write back caching enabled or not"
> 
> Maybe this is confusing because we let the user change the kernel's behavior
> regardless of how the storage device is configured?

This is important to keep, not all controllers properly report the LUN's 
cache state, so overrides are necessary in real life...but that's not what 
we're hoping to address:
  
> > Is there a good explanation for why the kernel setting is exactly 
> > _opposite_ of the controller setting?
> 
> By default, the drivers should have the correct setting reported for their
> devices, not the opposite. The user can override the sysfs attribute to the
> opposite setting though, so it's not necessarily an accurate report of what the
> device has actually enabled.

Lets assume for the moment that drivers always set this flag correctly 
because that isn't really the issue here: This is a discussion of 
terminology.

What I mean is that the very term "write-through" means to write _through_ 
the cache and block until completion to persistent storage, whereas, 
"write-back" means to return completion to the OS before IOs reach 
persistent storage.

...or at least this is the terminology that the RAID card manufacturers 
have used for decades.  I actually checked Wikipedia (as a zeitgeist 
reference, not as an authority) just in case I've been mistaken all these 
years as to the spirit of the meaning and it aligns with what I'm trying 
to express here:

  * Write-through: write is done synchronously both to the cache and to 
    the backing store.

  * Write-back (also called write-behind): initially, writing is done only 
    to the cache. The write to the backing store is postponed until the 
    modified content is about to be replaced by another cache block.
  [ https://en.wikipedia.org/wiki/Cache_(computing)#Writing_policies ]


So the kernel's notion of "write through" meaning "Drop FLUSH/FUA" sounds 
like the industry meaning of "write-back" as defined above; conversely, 
the kernel's notion of "write back" sounds like the industry definition of 
"write-through"

Is there a well-meaning rationale for the kernel's concept of "write 
through" to be different than what end users have been conditioned to 
understand?

--
Eric Wheeler

 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync)
  2022-05-31 23:04                             ` Eric Wheeler
@ 2022-06-01  0:36                               ` Keith Busch
  2022-06-01 18:48                                 ` Eric Wheeler
  0 siblings, 1 reply; 37+ messages in thread
From: Keith Busch @ 2022-06-01  0:36 UTC (permalink / raw)
  To: Eric Wheeler
  Cc: Adriano Silva, Christoph Hellwig, Bcache Linux,
	Matthias Ferdinand, Coly Li

On Tue, May 31, 2022 at 04:04:12PM -0700, Eric Wheeler wrote:
> 
>   * Write-through: write is done synchronously both to the cache and to 
>     the backing store.
> 
>   * Write-back (also called write-behind): initially, writing is done only 
>     to the cache. The write to the backing store is postponed until the 
>     modified content is about to be replaced by another cache block.
>   [ https://en.wikipedia.org/wiki/Cache_(computing)#Writing_policies ]
> 
> 
> So the kernel's notion of "write through" meaning "Drop FLUSH/FUA" sounds 
> like the industry meaning of "write-back" as defined above; conversely, 
> the kernel's notion of "write back" sounds like the industry definition of 
> "write-through"
> 
> Is there a well-meaning rationale for the kernel's concept of "write 
> through" to be different than what end users have been conditioned to 
> understand?

I think we all agree what "write through" vs "write back" mean. I'm just not
sure what's the source of the disconnect with the kernel's behavior.

  A "write through" device persists data before completing a write operation.

  Flush/FUA says to write data to persistence before completing the operation.

You don't need both. Flush/FUA should be a no-op to a "write through" device
because the data is synchronously committed to the backing store automatically.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync)
  2022-06-01  0:36                               ` Keith Busch
@ 2022-06-01 18:48                                 ` Eric Wheeler
  0 siblings, 0 replies; 37+ messages in thread
From: Eric Wheeler @ 2022-06-01 18:48 UTC (permalink / raw)
  To: Keith Busch
  Cc: Adriano Silva, Christoph Hellwig, Bcache Linux,
	Matthias Ferdinand, Coly Li

On Tue, 31 May 2022, Keith Busch wrote:
> On Tue, May 31, 2022 at 04:04:12PM -0700, Eric Wheeler wrote:
> > 
> >   * Write-through: write is done synchronously both to the cache and to 
> >     the backing store.
> > 
> >   * Write-back (also called write-behind): initially, writing is done only 
> >     to the cache. The write to the backing store is postponed until the 
> >     modified content is about to be replaced by another cache block.
> >   [ https://en.wikipedia.org/wiki/Cache_(computing)#Writing_policies ]
> > 
> > 
> > So the kernel's notion of "write through" meaning "Drop FLUSH/FUA" sounds 
> > like the industry meaning of "write-back" as defined above; conversely, 
> > the kernel's notion of "write back" sounds like the industry definition of 
> > "write-through"
> > 
> > Is there a well-meaning rationale for the kernel's concept of "write 
> > through" to be different than what end users have been conditioned to 
> > understand?
> 
> I think we all agree what "write through" vs "write back" mean. I'm just not
> sure what's the source of the disconnect with the kernel's behavior.
> 
>   A "write through" device persists data before completing a write operation.
> 
>   Flush/FUA says to write data to persistence before completing the operation.
> 
> You don't need both. Flush/FUA should be a no-op to a "write through" device
> because the data is synchronously committed to the backing store automatically.

Ok, I think I'm starting to understand the rationale, thank you for your 
patience while I've come to wrap my head around it. So, using a RAID 
controller cache as an example:

1. A RAID controller with a _non-volatile_ "writeback" cache (from the 
   controller's perspective, ie, _with_ battery) is a "write through"  
   device as far as the kernel is concerned because the controller will 
   return the write as complete as soon as it is in the persistent cache.

2. A RAID controller with a _volatile_ "writeback" cache (from the 
   controller's perspective, ie _without_ battery) is a "write back"  
   device as far as the kernel is concerned because the controller will 
   return the write as complete as soon as it is in the cache, but the 
   cache is not persistent!  So in that case flush/FUA is necessary.

I think it is rare someone would configure a RAID controller is as 
writeback (in the controller) when the cache is volatile (ie, without 
battery), but it is an interesting way to disect this to understand the 
rationale around value choices for the `queue/write_cache` flag in sysfs.

So please correct me here if I'm wrong: theoretically, a RAID controller 
with a volatile writeback cache is "safe" in terms of any flush/FIO 
behavior, assuming the controller respects those ops in writeback mode.  
For example, ext4's journal is probably consistent after a crash, even if 
2GB of cached data might be lost (assuming FUA and not FLUSH is being 
used for meta, I don't actually know ext4's implementation there).


I would guess that most end users are going to expect queue/write_cache to 
match their RAID controller's naming convention.  If they see "write 
through" when they know their controller is in writeback w/battery then 
they might reasonably expect the flag to show "write back", too.  If they 
then force it to "write back" then they loose the performance benefit.

Given that, and considering end users that configure raid controllers do 
not commonly understand the flush/FUA intracies and what really 
constitutes "write back" vs "write through" from the kernel's perspective, 
then perhaps it would be a good idea to add more documentation around 
write_cache here:

  https://www.kernel.org/doc/Documentation/block/queue-sysfs.txt

What do you think?


--
Eric Wheeler




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync)
       [not found]                           ` <YpTKfHHWz27Qugi+@kbusch-mbp.dhcp.thefacebook.com>
@ 2022-06-01 19:27                             ` Adriano Silva
  2022-06-01 21:11                               ` Eric Wheeler
  0 siblings, 1 reply; 37+ messages in thread
From: Adriano Silva @ 2022-06-01 19:27 UTC (permalink / raw)
  To: Keith Busch, Eric Wheeler, Matthias Ferdinand, Bcache Linux,
	Coly Li, Christoph Hellwig, linux-block

Tankyou,

I don't know if my NVME's devices are 4K LBA. I do not think so. They are all the same model and manufacturer. I know that they work with blocks of 512 Bytes, but that their latency is very high when processing blocks of this size.

However, in all the tests I do with them with 4K blocks, the result is much better. So I always use 4K blocks. Because in real life I don't think I'll use blocks smaller than 4K.

> You can remove the kernel interpretation using passthrough commands. Here's an
> example comparing with and without FUA assuming a 512b logical block format:
> 
>   # echo "" | nvme write /dev/nvme0n1 --block-count=7 --data-size=4k --force-unit-access --latency
>   # echo "" | nvme write /dev/nvme0n1 --block-count=7 --data-size=4k --latency
> 
> if you have a 4k LBA format, use "--block-count=0".
> 
> And you may want to run each of the above several times to get an average since
> other factors can affect the reported latency.

I created a bash script capable of executing the two commands you suggested to me in a period of 10 seconds in a row, to get some more acceptable average. The result is the following:

root@pve-21:~# for i in /sys/block/*/queue/write_cache; do echo 'write back' > $i; done
root@pve-21:~# cat /sys/block/nvme0n1/queue/write_cache
write back
root@pve-21:~# ./nvme_write.sh
Total: 10 seconds, 3027 tests. Latency (us) : min: 29  /  avr: 37   /  max: 98
root@pve-21:~# ./nvme_write.sh --force-unit-access
Total: 10 seconds, 2985 tests. Latency (us) : min: 29  /  avr: 37   /  max: 111
root@pve-21:~#
root@pve-21:~# ./nvme_write.sh --force-unit-access --block-count=0
Total: 10 seconds, 2556 tests. Latency (us) : min: 404  /  avr: 428   /  max: 492
root@pve-21:~# ./nvme_write.sh --block-count=0
Total: 10 seconds, 2521 tests. Latency (us) : min: 403  /  avr: 428   /  max: 496
root@pve-21:~#
root@pve-21:~#
root@pve-21:~# for i in /sys/block/*/queue/write_cache; do echo 'write through' > $i; done
root@pve-21:~# cat /sys/block/nvme0n1/queue/write_cache
write through
root@pve-21:~# ./nvme_write.sh
Total: 10 seconds, 2988 tests. Latency (us) : min: 29  /  avr: 37   /  max: 114
root@pve-21:~# ./nvme_write.sh --force-unit-access
Total: 10 seconds, 2926 tests. Latency (us) : min: 29  /  avr: 36   /  max: 71
root@pve-21:~#
root@pve-21:~# ./nvme_write.sh --force-unit-access --block-count=0
Total: 10 seconds, 2456 tests. Latency (us) : min: 31  /  avr: 428   /  max: 496
root@pve-21:~# ./nvme_write.sh --block-count=0
Total: 10 seconds, 2627 tests. Latency (us) : min: 402  /  avr: 428   /  max: 509

Well, as we can see above, in almost 3k tests run in a period of ten seconds, with each of the commands, I got even better results than I already got with ioping. I did tests with isolated commands as well, but I decided to write a bash script to be able to execute many commands in a short period of time and make an average. And we can see an average of about 37us in any situation. Very low!

However, when using that suggested command --block-count=0 the latency is very high in any situation, around 428us.

But as we see, using the nvme command, the latency is always the same in any scenario, whether with or without --force-unit-access, having a difference only regarding the use of the command directed to devices that don't have LBA or that aren't.

What do you think?

Tanks,


Em segunda-feira, 30 de maio de 2022 10:45:37 BRT, Keith Busch <kbusch@kernel.org> escreveu: 





On Sun, May 29, 2022 at 11:50:57AM +0000, Adriano Silva wrote:

> So why the slowness? Is it just the time spent in kernel code to set FUA and Flush Cache bits on writes that would cause all this latency increment (84us to 1.89ms) ?


I don't think the kernel's handling accounts for that great of a difference. I
think the difference is probably on the controller side.

The NVMe spec says that a Write command with FUA set:

"the controller shall write that data and metadata, if any, to non-volatile
media before indicating command completion."

So if the memory is non-volatile, it can complete the command without writing
to the backing media. It can also commit the data to the backing media if it
wants to before completing the command, but that's implementation specific
details.

You can remove the kernel interpretation using passthrough commands. Here's an
example comparing with and without FUA assuming a 512b logical block format:

  # echo "" | nvme write /dev/nvme0n1 --block-count=7 --data-size=4k --force-unit-access --latency
  # echo "" | nvme write /dev/nvme0n1 --block-count=7 --data-size=4k --latency

If you have a 4k LBA format, use "--block-count=0".

And you may want to run each of the above several times to get an average since
other factors can affect the reported latency.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync)
  2022-06-01 19:27                             ` Adriano Silva
@ 2022-06-01 21:11                               ` Eric Wheeler
  2022-06-02  5:26                                 ` Christoph Hellwig
  0 siblings, 1 reply; 37+ messages in thread
From: Eric Wheeler @ 2022-06-01 21:11 UTC (permalink / raw)
  To: Adriano Silva
  Cc: Keith Busch, Matthias Ferdinand, Bcache Linux, Coly Li,
	Christoph Hellwig, linux-block

[-- Attachment #1: Type: text/plain, Size: 5493 bytes --]

On Wed, 1 Jun 2022, Adriano Silva wrote:
> I don't know if my NVME's devices are 4K LBA. I do not think so. They 
> are all the same model and manufacturer. I know that they work with 
> blocks of 512 Bytes, but that their latency is very high when processing 
> blocks of this size.

Ok, it should be safe in terms of the possible bcache bug I was referring 
to if it supports 512b IOs.

> However, in all the tests I do with them with 4K blocks, the result is 
> much better. So I always use 4K blocks. Because in real life I don't 
> think I'll use blocks smaller than 4K.

Makes sense, format with -w 4k.  There is probably some CPU benefit to 
having page-aligned IOs, too.

> > You can remove the kernel interpretation using passthrough commands. Here's an
> > example comparing with and without FUA assuming a 512b logical block format:
> > 
> >   # echo "" | nvme write /dev/nvme0n1 --block-count=7 --data-size=4k --force-unit-access --latency
> >   # echo "" | nvme write /dev/nvme0n1 --block-count=7 --data-size=4k --latency
> > 
> > if you have a 4k LBA format, use "--block-count=0".
> > 
> > And you may want to run each of the above several times to get an average since
> > other factors can affect the reported latency.
> 
> I created a bash script capable of executing the two commands you 
> suggested to me in a period of 10 seconds in a row, to get some more 
> acceptable average. The result is the following:
> 
> root@pve-21:~# for i in /sys/block/*/queue/write_cache; do echo 'write back' > $i; done
> root@pve-21:~# cat /sys/block/nvme0n1/queue/write_cache
> write back
> root@pve-21:~# ./nvme_write.sh
> Total: 10 seconds, 3027 tests. Latency (us) : min: 29  /  avr: 37   /  max: 98
> root@pve-21:~# ./nvme_write.sh --force-unit-access
> Total: 10 seconds, 2985 tests. Latency (us) : min: 29  /  avr: 37   /  max: 111
> root@pve-21:~#
> root@pve-21:~# ./nvme_write.sh --force-unit-access --block-count=0
> Total: 10 seconds, 2556 tests. Latency (us) : min: 404  /  avr: 428   /  max: 492
> root@pve-21:~# ./nvme_write.sh --block-count=0
> Total: 10 seconds, 2521 tests. Latency (us) : min: 403  /  avr: 428   /  max: 496
> root@pve-21:~#
> root@pve-21:~#
> root@pve-21:~# for i in /sys/block/*/queue/write_cache; do echo 'write through' > $i; done
> root@pve-21:~# cat /sys/block/nvme0n1/queue/write_cache
> write through
> root@pve-21:~# ./nvme_write.sh
> Total: 10 seconds, 2988 tests. Latency (us) : min: 29  /  avr: 37   /  max: 114
> root@pve-21:~# ./nvme_write.sh --force-unit-access
> Total: 10 seconds, 2926 tests. Latency (us) : min: 29  /  avr: 36   /  max: 71
> root@pve-21:~#
> root@pve-21:~# ./nvme_write.sh --force-unit-access --block-count=0
> Total: 10 seconds, 2456 tests. Latency (us) : min: 31  /  avr: 428   /  max: 496
> root@pve-21:~# ./nvme_write.sh --block-count=0
> Total: 10 seconds, 2627 tests. Latency (us) : min: 402  /  avr: 428   /  max: 509
> 
> Well, as we can see above, in almost 3k tests run in a period of ten 
> seconds, with each of the commands, I got even better results than I 
> already got with ioping. I did tests with isolated commands as well, but 
> I decided to write a bash script to be able to execute many commands in 
> a short period of time and make an average. And we can see an average of 
> about 37us in any situation. Very low!
> 
> However, when using that suggested command --block-count=0 the latency 
> is very high in any situation, around 428us.
> 
> But as we see, using the nvme command, the latency is always the same in 
> any scenario, whether with or without --force-unit-access, having a 
> difference only regarding the use of the command directed to devices 
> that don't have LBA or that aren't.
> 
> What do you think?

It looks like the NVMe works well except in 512b situations.  Its 
interesting that --force-unit-access doesn't increase the latency: Perhaps 
the NVMe ignores sync flags since it knows it has a non-volatile cache.

-Eric

> 
> Tanks,
> 
> 
> Em segunda-feira, 30 de maio de 2022 10:45:37 BRT, Keith Busch <kbusch@kernel.org> escreveu: 
> 
> 
> 
> 
> 
> On Sun, May 29, 2022 at 11:50:57AM +0000, Adriano Silva wrote:
> 
> > So why the slowness? Is it just the time spent in kernel code to set 
> > FUA and Flush Cache bits on writes that would cause all this latency 
> > increment (84us to 1.89ms) ?
> 
> 
> I don't think the kernel's handling accounts for that great of a difference. I
> think the difference is probably on the controller side.
> 
> The NVMe spec says that a Write command with FUA set:
> 
> "the controller shall write that data and metadata, if any, to non-volatile
> media before indicating command completion."
> 
> So if the memory is non-volatile, it can complete the command without writing
> to the backing media. It can also commit the data to the backing media if it
> wants to before completing the command, but that's implementation specific
> details.
> 
> You can remove the kernel interpretation using passthrough commands. Here's an
> example comparing with and without FUA assuming a 512b logical block format:
> 
>   # echo "" | nvme write /dev/nvme0n1 --block-count=7 --data-size=4k --force-unit-access --latency
>   # echo "" | nvme write /dev/nvme0n1 --block-count=7 --data-size=4k --latency
> 
> If you have a 4k LBA format, use "--block-count=0".
> 
> And you may want to run each of the above several times to get an average since
> other factors can affect the reported latency.
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync)
  2022-06-01 21:11                               ` Eric Wheeler
@ 2022-06-02  5:26                                 ` Christoph Hellwig
  0 siblings, 0 replies; 37+ messages in thread
From: Christoph Hellwig @ 2022-06-02  5:26 UTC (permalink / raw)
  To: Eric Wheeler
  Cc: Adriano Silva, Keith Busch, Matthias Ferdinand, Bcache Linux,
	Coly Li, Christoph Hellwig, linux-block

On Wed, Jun 01, 2022 at 02:11:35PM -0700, Eric Wheeler wrote:
> It looks like the NVMe works well except in 512b situations.  Its 
> interesting that --force-unit-access doesn't increase the latency: Perhaps 
> the NVMe ignores sync flags since it knows it has a non-volatile cache.

NVMe (and other interface) SSDs generally come in two flavors:

 - consumer ones have a volatile write cache and FUA/Flush has a lot of
   overhead
 - enterprise ones with the grossly nisnamed "power loss protection"
   feature have a non-volatile write cache and FUA/Flush has no overhead
   at all

If this is an enterprise drive the behavior is expected.  If on the
other hand it is a cheap consumer driver chances are it just lies, which
there have been a few instances of.

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2022-06-02  5:26 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <958894243.922478.1652201375900.ref@mail.yahoo.com>
2022-05-10 16:49 ` Bcache in writes direct with fsync. Are IOPS limited? Adriano Silva
2022-05-11  6:20   ` Matthias Ferdinand
2022-05-11 12:58     ` Adriano Silva
2022-05-11 21:21       ` Matthias Ferdinand
2022-05-18  1:22   ` Eric Wheeler
2022-05-23 14:07     ` Coly Li
2022-05-26 19:15       ` Eric Wheeler
2022-05-27 17:28         ` colyli
2022-05-28  0:58           ` Eric Wheeler
2022-05-23 18:36     ` [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync) Eric Wheeler
2022-05-24  5:34       ` Christoph Hellwig
2022-05-24 20:14         ` Eric Wheeler
2022-05-24 20:34           ` Keith Busch
2022-05-24 21:34             ` Eric Wheeler
2022-05-25  5:20               ` Christoph Hellwig
2022-05-25 18:44                 ` Eric Wheeler
2022-05-26  9:06                   ` Christoph Hellwig
2022-05-28  1:52                 ` Eric Wheeler
2022-05-28  3:57                   ` Keith Busch
2022-05-28  4:59                   ` Christoph Hellwig
2022-05-28 12:57                     ` Adriano Silva
2022-05-29  3:18                       ` Keith Busch
2022-05-31 19:42                         ` Eric Wheeler
2022-05-31 20:22                           ` Keith Busch
2022-05-31 23:04                             ` Eric Wheeler
2022-06-01  0:36                               ` Keith Busch
2022-06-01 18:48                                 ` Eric Wheeler
     [not found]                         ` <2064546094.2440522.1653825057164@mail.yahoo.com>
     [not found]                           ` <YpTKfHHWz27Qugi+@kbusch-mbp.dhcp.thefacebook.com>
2022-06-01 19:27                             ` Adriano Silva
2022-06-01 21:11                               ` Eric Wheeler
2022-06-02  5:26                                 ` Christoph Hellwig
2022-05-25  5:17           ` Christoph Hellwig
     [not found]     ` <681726005.1812841.1653564986700@mail.yahoo.com>
2022-05-26 20:20       ` Bcache in writes direct with fsync. Are IOPS limited? Adriano Silva
2022-05-26 20:28       ` Eric Wheeler
2022-05-27  4:07         ` Adriano Silva
2022-05-28  1:27           ` Eric Wheeler
2022-05-28  7:22             ` Matthias Ferdinand
2022-05-28 12:09               ` Adriano Silva

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.