Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync)

From: Eric Wheeler <bcache@lists.ewheeler.net>
To: Adriano Silva <adriano_da_silva@yahoo.com.br>
Cc: Keith Busch <kbusch@kernel.org>,
	Matthias Ferdinand <bcache@mfedv.net>,
	Bcache Linux <linux-bcache@vger.kernel.org>,
	Coly Li <colyli@suse.de>, Christoph Hellwig <hch@infradead.org>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>
Subject: Re: [RFC] Add sysctl option to drop disk flushes in bcache? (was: Bcache in writes direct with fsync)
Date: Wed, 1 Jun 2022 14:11:35 -0700 (PDT)	[thread overview]
Message-ID: <8a95d4f-b263-5231-537d-b1f88fdd5090@ewheeler.net> (raw)
In-Reply-To: <1295433800.3263424.1654111657911@mail.yahoo.com>

[-- Attachment #1: Type: text/plain, Size: 5493 bytes --]

On Wed, 1 Jun 2022, Adriano Silva wrote:
> I don't know if my NVME's devices are 4K LBA. I do not think so. They 
> are all the same model and manufacturer. I know that they work with 
> blocks of 512 Bytes, but that their latency is very high when processing 
> blocks of this size.

Ok, it should be safe in terms of the possible bcache bug I was referring 
to if it supports 512b IOs.

> However, in all the tests I do with them with 4K blocks, the result is 
> much better. So I always use 4K blocks. Because in real life I don't 
> think I'll use blocks smaller than 4K.

Makes sense, format with -w 4k.  There is probably some CPU benefit to 
having page-aligned IOs, too.

> > You can remove the kernel interpretation using passthrough commands. Here's an
> > example comparing with and without FUA assuming a 512b logical block format:
> > 
> >   # echo "" | nvme write /dev/nvme0n1 --block-count=7 --data-size=4k --force-unit-access --latency
> >   # echo "" | nvme write /dev/nvme0n1 --block-count=7 --data-size=4k --latency
> > 
> > if you have a 4k LBA format, use "--block-count=0".
> > 
> > And you may want to run each of the above several times to get an average since
> > other factors can affect the reported latency.
> 
> I created a bash script capable of executing the two commands you 
> suggested to me in a period of 10 seconds in a row, to get some more 
> acceptable average. The result is the following:
> 
> root@pve-21:~# for i in /sys/block/*/queue/write_cache; do echo 'write back' > $i; done
> root@pve-21:~# cat /sys/block/nvme0n1/queue/write_cache
> write back
> root@pve-21:~# ./nvme_write.sh
> Total: 10 seconds, 3027 tests. Latency (us) : min: 29  /  avr: 37   /  max: 98
> root@pve-21:~# ./nvme_write.sh --force-unit-access
> Total: 10 seconds, 2985 tests. Latency (us) : min: 29  /  avr: 37   /  max: 111
> root@pve-21:~#
> root@pve-21:~# ./nvme_write.sh --force-unit-access --block-count=0
> Total: 10 seconds, 2556 tests. Latency (us) : min: 404  /  avr: 428   /  max: 492
> root@pve-21:~# ./nvme_write.sh --block-count=0
> Total: 10 seconds, 2521 tests. Latency (us) : min: 403  /  avr: 428   /  max: 496
> root@pve-21:~#
> root@pve-21:~#
> root@pve-21:~# for i in /sys/block/*/queue/write_cache; do echo 'write through' > $i; done
> root@pve-21:~# cat /sys/block/nvme0n1/queue/write_cache
> write through
> root@pve-21:~# ./nvme_write.sh
> Total: 10 seconds, 2988 tests. Latency (us) : min: 29  /  avr: 37   /  max: 114
> root@pve-21:~# ./nvme_write.sh --force-unit-access
> Total: 10 seconds, 2926 tests. Latency (us) : min: 29  /  avr: 36   /  max: 71
> root@pve-21:~#
> root@pve-21:~# ./nvme_write.sh --force-unit-access --block-count=0
> Total: 10 seconds, 2456 tests. Latency (us) : min: 31  /  avr: 428   /  max: 496
> root@pve-21:~# ./nvme_write.sh --block-count=0
> Total: 10 seconds, 2627 tests. Latency (us) : min: 402  /  avr: 428   /  max: 509
> 
> Well, as we can see above, in almost 3k tests run in a period of ten 
> seconds, with each of the commands, I got even better results than I 
> already got with ioping. I did tests with isolated commands as well, but 
> I decided to write a bash script to be able to execute many commands in 
> a short period of time and make an average. And we can see an average of 
> about 37us in any situation. Very low!
> 
> However, when using that suggested command --block-count=0 the latency 
> is very high in any situation, around 428us.
> 
> But as we see, using the nvme command, the latency is always the same in 
> any scenario, whether with or without --force-unit-access, having a 
> difference only regarding the use of the command directed to devices 
> that don't have LBA or that aren't.
> 
> What do you think?

It looks like the NVMe works well except in 512b situations.  Its 
interesting that --force-unit-access doesn't increase the latency: Perhaps 
the NVMe ignores sync flags since it knows it has a non-volatile cache.

-Eric

> 
> Tanks,
> 
> 
> Em segunda-feira, 30 de maio de 2022 10:45:37 BRT, Keith Busch <kbusch@kernel.org> escreveu: 
> 
> 
> 
> 
> 
> On Sun, May 29, 2022 at 11:50:57AM +0000, Adriano Silva wrote:
> 
> > So why the slowness? Is it just the time spent in kernel code to set 
> > FUA and Flush Cache bits on writes that would cause all this latency 
> > increment (84us to 1.89ms) ?
> 
> 
> I don't think the kernel's handling accounts for that great of a difference. I
> think the difference is probably on the controller side.
> 
> The NVMe spec says that a Write command with FUA set:
> 
> "the controller shall write that data and metadata, if any, to non-volatile
> media before indicating command completion."
> 
> So if the memory is non-volatile, it can complete the command without writing
> to the backing media. It can also commit the data to the backing media if it
> wants to before completing the command, but that's implementation specific
> details.
> 
> You can remove the kernel interpretation using passthrough commands. Here's an
> example comparing with and without FUA assuming a 512b logical block format:
> 
>   # echo "" | nvme write /dev/nvme0n1 --block-count=7 --data-size=4k --force-unit-access --latency
>   # echo "" | nvme write /dev/nvme0n1 --block-count=7 --data-size=4k --latency
> 
> If you have a 4k LBA format, use "--block-count=0".
> 
> And you may want to run each of the above several times to get an average since
> other factors can affect the reported latency.
>