Re: Re: [PATCH 0/2] btrfs: disable inline checksum for multi-dev striped FS

From: Naohiro Aota <Naohiro.Aota@wdc.com>
To: Roman Mamedov <rm@romanrm.net>
Cc: "linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>,
	"wangyugui@e16-tech.com" <wangyugui@e16-tech.com>,
	"hch@lst.de" <hch@lst.de>, "clm@meta.com" <clm@meta.com>
Subject: Re: Re: [PATCH 0/2] btrfs: disable inline checksum for multi-dev striped FS
Date: Mon, 22 Jan 2024 07:17:43 +0000	[thread overview]
Message-ID: <irc2v7zqrpbkeehhysq7fccwmguujnkrktknl3d23t2ecwope6@o62qzd4yyxt2> (raw)
In-Reply-To: <20240118141231.5166cdd7@nvm>

On Thu, Jan 18, 2024 at 02:12:31PM +0500, Roman Mamedov wrote:
> On Thu, 18 Jan 2024 17:54:49 +0900
> Naohiro Aota <naohiro.aota@wdc.com> wrote:
>
> > There was a report of write performance regression on 6.5-rc4 on RAID0
> > (4 devices) btrfs [1]. Then, I reported that BTRFS_FS_CSUM_IMPL_FAST
> > and doing the checksum inline can be bad for performance on RAID0
> > setup [2].
> >
> > [1] https://lore.kernel.org/linux-btrfs/20230731152223.4EFB.409509F4@e16-tech.com/
> > [2] https://lore.kernel.org/linux-btrfs/p3vo3g7pqn664mhmdhlotu5dzcna6vjtcoc2hb2lsgo2fwct7k@xzaxclba5tae/
> >
> > While inlining the fast checksum is good for single (or two) device,
> > but it is not fast enough for multi-device striped writing.
>
> Personal opinion, it is a very awkward criteria to enable or disable the
> inline mode. There can be a RAID0 of SATA HDDs/SSDs that will be slower than a
> single PCI-E 4.0 NVMe SSD. In [1] the inline mode slashes the performance from
> 4 GB/sec to 1.5 GB/sec. A single modern SSD is capable of up to 6 GB/sec.
>
> Secondly, less often, there can be a hardware RAID which presents itself to the
> OS as a single device, but is also very fast.
>
> Sure, basing such decision on anything else, such as benchmark of the
> actual block device may not be as feasible.
>
> > So, this series first introduces fs_devices->inline_csum_mode and its
> > sysfs interface to tweak the inline csum behavior (auto/on/off). Then,
> > it disables inline checksum when it find a block group striped writing
> > into multiple devices.
>
> Has it been determined what improvement enabling the inline mode brings at all,
> and in which setups? Maybe just disable it by default and provide this tweak
> option?

Note: as mentioned by David, I'm going to say "sync checksum" instead of
"inline checksum".

I'm going to list the benchmark results here.

The sync checksum is introduced with this patch.

https://lore.kernel.org/linux-btrfs/20230503070615.1029820-2-hch@lst.de/

The benchmark described in the patch is originated from this email by Chris.

https://lore.kernel.org/linux-btrfs/eb544c31-7a74-d503-83f0-4dc226917d1a@meta.com/

* Device: Intel Optane

* workqueue checksum (Unpatched):
  write: IOPS=3316, BW=3316MiB/s (3477MB/s)(200GiB/61757msec); 0 zone resets

* Sync checksum (synchronous CRCs):
  write: IOPS=4882, BW=4882MiB/s (5119MB/s)(200GiB/41948msec); 0 zone resets

Christoph also did the same on kvm on consumer drives and got a similar
result. Furthermore, even with "non-accelerated crc32 code", "the workqueue
offload only looked better for really large writes, and then only
marginally."

https://lore.kernel.org/linux-btrfs/20230325081341.GB7353@lst.de/

Then, Wang Yugui reported a regression both on SINGLE setup and RAID0 setup.

https://lore.kernel.org/linux-btrfs/20230811222321.2AD2.409509F4@e16-tech.com/

* CPU: E5 2680 v2, two NUMA nodes
* RAM: 192G
* Device: NVMe SSD PCIe3 x4
* Btrfs profile: data=raid0, metadata=raid1
  - all PCIe3 NVMe SSD are connected to one NVMe HBA/one numa node.

* workqueue checksum: RAID0:
  WRITE: bw=3858MiB/s (4045MB/s)
  WRITE: bw=3781MiB/s (3965MB/s)
* sync checksum: RAID0:
  WRITE: bw=1311MiB/s (1375MB/s)
  WRITE: bw=1435MiB/s (1504MB/s)

* workqueue checksum: SINGLE:
  WRITE: bw=3004MiB/s (3149MB/s)
  WRITE: bw=2851MiB/s (2990MB/s)
* sync checksum: SINGLE:
  WRITE: bw=1337MiB/s (1402MB/s)
  WRITE: bw=1413MiB/s (1481MB/s)

So, workqueue (old) method is way better on his machine.

After a while, I reported workqueue checksum is better than sync checksum
on 6 SSDs RAID0 case.

https://lore.kernel.org/linux-btrfs/p3vo3g7pqn664mhmdhlotu5dzcna6vjtcoc2hb2lsgo2fwct7k@xzaxclba5tae/

* CPU: Intel(R) Xeon(R) Platinum 8260 CPU, two NUMA nodes, 96 CPUs
* RAM: 1024G

On 6 SSDs RAID0
* workqueue checksum:
  WRITE: bw=2106MiB/s (2208MB/s), 2106MiB/s-2106MiB/s (2208MB/s-2208MB/s), io=760GiB (816GB), run=369705-369705msec
* sync checksum:
  WRITE: bw=1391MiB/s (1458MB/s), 1391MiB/s-1391MiB/s (1458MB/s-1458MB/s), io=499GiB (536GB), run=367312-367312msec

Or, even with 1 SSD setup (still RAID0):
* workqueue checksum:
  WRITE: bw=437MiB/s (459MB/s), 437MiB/s-437MiB/s (459MB/s-459MB/s), io=299GiB (321GB), run=699787-699787msec
* sync checksum:
  WRITE: bw=442MiB/s (464MB/s), 442MiB/s-442MiB/s (464MB/s-464MB/s), io=302GiB (324GB), run=698553-698553msec

The same as Wang Yugui, I got better performance with workqueue checksum.

I also tested it on an emulated fast device (null_blk with irqmode=0)
today. The device is formatted with the default profile.

* CPU: Intel(R) Xeon(R) Platinum 8260 CPU, two NUMA nodes, 96 CPUs
* RAM: 1024G
* Device: null_blk with irqmode=0, use_per_node_hctx=1, memory_backed=1, size=512000 (512GB)
* Btrfs profile: data=single, metadata=dup

I ran this fio command with this series applied to tweak the checksum mode.

fio --group_reporting --eta=always --eta-interval=30s --eta-newline=30s \
    --rw=write \
    --direct=0 --ioengine=psync \
    --filesize=${filesize} \
    --blocksize=1m \
    --end_fsync=1 \
    --directory=/mnt \
    --name=writer --numjobs=${numjobs}

I tried several setups, but I could not get a better performance with sync
checksum. Examples are shown below.

With numjobs=96, filesize=2GB
* workqueue checksum: (writing off to the newly added sysfs file)
  WRITE: bw=1776MiB/s (1862MB/s), 1776MiB/s-1776MiB/s (1862MB/s-1862MB/s), io=192GiB (206GB), run=110733-110733msec
* sync checksum       (writing on to the sysfs file)
  WRITE: bw=1037MiB/s (1088MB/s), 1037MiB/s-1037MiB/s (1088MB/s-1088MB/s), io=192GiB (206GB), run=189550-189550msec

With numjobs=368, filesize=512MB
* workqueue checksum:
  WRITE: bw=1726MiB/s (1810MB/s), 1726MiB/s-1726MiB/s (1810MB/s-1810MB/s), io=192GiB (206GB), run=113902-113902msec
* sync checksum
  WRITE: bw=1221MiB/s (1280MB/s), 1221MiB/s-1221MiB/s (1280MB/s-1280MB/s), io=192GiB (206GB), run=161060-161060msec

Also, I run a similar experiment on a different machine, which has 32 CPUs
and 128 GB RAM. Since it has a smaller RAM, filesize is also smaller than
above. And, again, workqueue checksum is slightly better.

* workqueue checksum:
  WRITE: bw=298MiB/s (313MB/s), 298MiB/s-298MiB/s (313MB/s-313MB/s), io=32.0GiB (34.4GB), run=109883-109883msec
* sync checksum
  WRITE: bw=275MiB/s (288MB/s), 275MiB/s-275MiB/s (288MB/s-288MB/s), io=32.0GiB (34.4GB), run=119169-119169msec

When I started writing this reply, I thought the proper criteria may be the
number of CPUs, or some balance of the number of CPUs vs disks. But, now,
as I could not get "sync checksum" to be better on any setup, I'm getting
puzzled. Is "sync checksum" really effective still for now? Maybe, it's
good on smaller CPUs (~4?) machine with a single device?

In addition, We are also going to have a change on the workqueue's side
too, which changes max number of working jobs, especially effective for a
NUMA machine.

https://lore.kernel.org/all/20240113002911.406791-1-tj@kernel.org/

Anyway, we need more benchmark results to see the effect of "sync checksum"
and "workqueue checksum".

>
> > Naohiro Aota (2):
> >   btrfs: introduce inline_csum_mode to tweak inline checksum behavior
> >   btrfs: detect multi-dev stripe and disable automatic inline checksum
> >
> >  fs/btrfs/bio.c     | 14 ++++++++++++--
> >  fs/btrfs/sysfs.c   | 39 +++++++++++++++++++++++++++++++++++++++
> >  fs/btrfs/volumes.c | 20 ++++++++++++++++++++
> >  fs/btrfs/volumes.h | 21 +++++++++++++++++++++
> >  4 files changed, 92 insertions(+), 2 deletions(-)
> >
>
>
> --
> With respect,
> Roman