From: Naohiro Aota <Naohiro.Aota@wdc.com>
To: Roman Mamedov <rm@romanrm.net>
Cc: "linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>,
"wangyugui@e16-tech.com" <wangyugui@e16-tech.com>,
"hch@lst.de" <hch@lst.de>, "clm@meta.com" <clm@meta.com>
Subject: Re: Re: [PATCH 0/2] btrfs: disable inline checksum for multi-dev striped FS
Date: Mon, 22 Jan 2024 07:17:43 +0000 [thread overview]
Message-ID: <irc2v7zqrpbkeehhysq7fccwmguujnkrktknl3d23t2ecwope6@o62qzd4yyxt2> (raw)
In-Reply-To: <20240118141231.5166cdd7@nvm>
On Thu, Jan 18, 2024 at 02:12:31PM +0500, Roman Mamedov wrote:
> On Thu, 18 Jan 2024 17:54:49 +0900
> Naohiro Aota <naohiro.aota@wdc.com> wrote:
>
> > There was a report of write performance regression on 6.5-rc4 on RAID0
> > (4 devices) btrfs [1]. Then, I reported that BTRFS_FS_CSUM_IMPL_FAST
> > and doing the checksum inline can be bad for performance on RAID0
> > setup [2].
> >
> > [1] https://lore.kernel.org/linux-btrfs/20230731152223.4EFB.409509F4@e16-tech.com/
> > [2] https://lore.kernel.org/linux-btrfs/p3vo3g7pqn664mhmdhlotu5dzcna6vjtcoc2hb2lsgo2fwct7k@xzaxclba5tae/
> >
> > While inlining the fast checksum is good for single (or two) device,
> > but it is not fast enough for multi-device striped writing.
>
> Personal opinion, it is a very awkward criteria to enable or disable the
> inline mode. There can be a RAID0 of SATA HDDs/SSDs that will be slower than a
> single PCI-E 4.0 NVMe SSD. In [1] the inline mode slashes the performance from
> 4 GB/sec to 1.5 GB/sec. A single modern SSD is capable of up to 6 GB/sec.
>
> Secondly, less often, there can be a hardware RAID which presents itself to the
> OS as a single device, but is also very fast.
>
> Sure, basing such decision on anything else, such as benchmark of the
> actual block device may not be as feasible.
>
> > So, this series first introduces fs_devices->inline_csum_mode and its
> > sysfs interface to tweak the inline csum behavior (auto/on/off). Then,
> > it disables inline checksum when it find a block group striped writing
> > into multiple devices.
>
> Has it been determined what improvement enabling the inline mode brings at all,
> and in which setups? Maybe just disable it by default and provide this tweak
> option?
Note: as mentioned by David, I'm going to say "sync checksum" instead of
"inline checksum".
I'm going to list the benchmark results here.
The sync checksum is introduced with this patch.
https://lore.kernel.org/linux-btrfs/20230503070615.1029820-2-hch@lst.de/
The benchmark described in the patch is originated from this email by Chris.
https://lore.kernel.org/linux-btrfs/eb544c31-7a74-d503-83f0-4dc226917d1a@meta.com/
* Device: Intel Optane
* workqueue checksum (Unpatched):
write: IOPS=3316, BW=3316MiB/s (3477MB/s)(200GiB/61757msec); 0 zone resets
* Sync checksum (synchronous CRCs):
write: IOPS=4882, BW=4882MiB/s (5119MB/s)(200GiB/41948msec); 0 zone resets
Christoph also did the same on kvm on consumer drives and got a similar
result. Furthermore, even with "non-accelerated crc32 code", "the workqueue
offload only looked better for really large writes, and then only
marginally."
https://lore.kernel.org/linux-btrfs/20230325081341.GB7353@lst.de/
Then, Wang Yugui reported a regression both on SINGLE setup and RAID0 setup.
https://lore.kernel.org/linux-btrfs/20230811222321.2AD2.409509F4@e16-tech.com/
* CPU: E5 2680 v2, two NUMA nodes
* RAM: 192G
* Device: NVMe SSD PCIe3 x4
* Btrfs profile: data=raid0, metadata=raid1
- all PCIe3 NVMe SSD are connected to one NVMe HBA/one numa node.
* workqueue checksum: RAID0:
WRITE: bw=3858MiB/s (4045MB/s)
WRITE: bw=3781MiB/s (3965MB/s)
* sync checksum: RAID0:
WRITE: bw=1311MiB/s (1375MB/s)
WRITE: bw=1435MiB/s (1504MB/s)
* workqueue checksum: SINGLE:
WRITE: bw=3004MiB/s (3149MB/s)
WRITE: bw=2851MiB/s (2990MB/s)
* sync checksum: SINGLE:
WRITE: bw=1337MiB/s (1402MB/s)
WRITE: bw=1413MiB/s (1481MB/s)
So, workqueue (old) method is way better on his machine.
After a while, I reported workqueue checksum is better than sync checksum
on 6 SSDs RAID0 case.
https://lore.kernel.org/linux-btrfs/p3vo3g7pqn664mhmdhlotu5dzcna6vjtcoc2hb2lsgo2fwct7k@xzaxclba5tae/
* CPU: Intel(R) Xeon(R) Platinum 8260 CPU, two NUMA nodes, 96 CPUs
* RAM: 1024G
On 6 SSDs RAID0
* workqueue checksum:
WRITE: bw=2106MiB/s (2208MB/s), 2106MiB/s-2106MiB/s (2208MB/s-2208MB/s), io=760GiB (816GB), run=369705-369705msec
* sync checksum:
WRITE: bw=1391MiB/s (1458MB/s), 1391MiB/s-1391MiB/s (1458MB/s-1458MB/s), io=499GiB (536GB), run=367312-367312msec
Or, even with 1 SSD setup (still RAID0):
* workqueue checksum:
WRITE: bw=437MiB/s (459MB/s), 437MiB/s-437MiB/s (459MB/s-459MB/s), io=299GiB (321GB), run=699787-699787msec
* sync checksum:
WRITE: bw=442MiB/s (464MB/s), 442MiB/s-442MiB/s (464MB/s-464MB/s), io=302GiB (324GB), run=698553-698553msec
The same as Wang Yugui, I got better performance with workqueue checksum.
I also tested it on an emulated fast device (null_blk with irqmode=0)
today. The device is formatted with the default profile.
* CPU: Intel(R) Xeon(R) Platinum 8260 CPU, two NUMA nodes, 96 CPUs
* RAM: 1024G
* Device: null_blk with irqmode=0, use_per_node_hctx=1, memory_backed=1, size=512000 (512GB)
* Btrfs profile: data=single, metadata=dup
I ran this fio command with this series applied to tweak the checksum mode.
fio --group_reporting --eta=always --eta-interval=30s --eta-newline=30s \
--rw=write \
--direct=0 --ioengine=psync \
--filesize=${filesize} \
--blocksize=1m \
--end_fsync=1 \
--directory=/mnt \
--name=writer --numjobs=${numjobs}
I tried several setups, but I could not get a better performance with sync
checksum. Examples are shown below.
With numjobs=96, filesize=2GB
* workqueue checksum: (writing off to the newly added sysfs file)
WRITE: bw=1776MiB/s (1862MB/s), 1776MiB/s-1776MiB/s (1862MB/s-1862MB/s), io=192GiB (206GB), run=110733-110733msec
* sync checksum (writing on to the sysfs file)
WRITE: bw=1037MiB/s (1088MB/s), 1037MiB/s-1037MiB/s (1088MB/s-1088MB/s), io=192GiB (206GB), run=189550-189550msec
With numjobs=368, filesize=512MB
* workqueue checksum:
WRITE: bw=1726MiB/s (1810MB/s), 1726MiB/s-1726MiB/s (1810MB/s-1810MB/s), io=192GiB (206GB), run=113902-113902msec
* sync checksum
WRITE: bw=1221MiB/s (1280MB/s), 1221MiB/s-1221MiB/s (1280MB/s-1280MB/s), io=192GiB (206GB), run=161060-161060msec
Also, I run a similar experiment on a different machine, which has 32 CPUs
and 128 GB RAM. Since it has a smaller RAM, filesize is also smaller than
above. And, again, workqueue checksum is slightly better.
* workqueue checksum:
WRITE: bw=298MiB/s (313MB/s), 298MiB/s-298MiB/s (313MB/s-313MB/s), io=32.0GiB (34.4GB), run=109883-109883msec
* sync checksum
WRITE: bw=275MiB/s (288MB/s), 275MiB/s-275MiB/s (288MB/s-288MB/s), io=32.0GiB (34.4GB), run=119169-119169msec
When I started writing this reply, I thought the proper criteria may be the
number of CPUs, or some balance of the number of CPUs vs disks. But, now,
as I could not get "sync checksum" to be better on any setup, I'm getting
puzzled. Is "sync checksum" really effective still for now? Maybe, it's
good on smaller CPUs (~4?) machine with a single device?
In addition, We are also going to have a change on the workqueue's side
too, which changes max number of working jobs, especially effective for a
NUMA machine.
https://lore.kernel.org/all/20240113002911.406791-1-tj@kernel.org/
Anyway, we need more benchmark results to see the effect of "sync checksum"
and "workqueue checksum".
>
> > Naohiro Aota (2):
> > btrfs: introduce inline_csum_mode to tweak inline checksum behavior
> > btrfs: detect multi-dev stripe and disable automatic inline checksum
> >
> > fs/btrfs/bio.c | 14 ++++++++++++--
> > fs/btrfs/sysfs.c | 39 +++++++++++++++++++++++++++++++++++++++
> > fs/btrfs/volumes.c | 20 ++++++++++++++++++++
> > fs/btrfs/volumes.h | 21 +++++++++++++++++++++
> > 4 files changed, 92 insertions(+), 2 deletions(-)
> >
>
>
> --
> With respect,
> Roman
next prev parent reply other threads:[~2024-01-22 7:17 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-01-18 8:54 [PATCH 0/2] btrfs: disable inline checksum for multi-dev striped FS Naohiro Aota
2024-01-18 8:54 ` [PATCH 1/2] btrfs: introduce inline_csum_mode to tweak inline checksum behavior Naohiro Aota
2024-01-18 8:54 ` [PATCH 2/2] btrfs: detect multi-dev stripe and disable automatic inline checksum Naohiro Aota
2024-01-19 15:29 ` Johannes Thumshirn
2024-01-22 8:02 ` Naohiro Aota
2024-01-22 21:11 ` David Sterba
2024-01-18 9:12 ` [PATCH 0/2] btrfs: disable inline checksum for multi-dev striped FS Roman Mamedov
2024-01-19 15:49 ` David Sterba
2024-01-22 15:31 ` Naohiro Aota
2024-01-22 7:17 ` Naohiro Aota [this message]
2024-01-19 15:30 ` Johannes Thumshirn
2024-01-19 16:01 ` David Sterba
2024-01-22 15:12 ` Naohiro Aota
2024-01-22 21:19 ` David Sterba
2024-01-24 0:19 ` Wang Yugui
2024-01-29 12:56 ` Wang Yugui
2024-01-30 1:38 ` Naohiro Aota
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=irc2v7zqrpbkeehhysq7fccwmguujnkrktknl3d23t2ecwope6@o62qzd4yyxt2 \
--to=naohiro.aota@wdc.com \
--cc=clm@meta.com \
--cc=hch@lst.de \
--cc=linux-btrfs@vger.kernel.org \
--cc=rm@romanrm.net \
--cc=wangyugui@e16-tech.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).