From: Philipp Falk <philipp.falk@thinkparq.com>
To: linux-fsdevel@vger.kernel.org
Subject: Throughput drop and high CPU load on fast NVMe drives
Date: Tue, 22 Jun 2021 19:15:58 +0200 [thread overview]
Message-ID: <YNIaztBNK+I5w44w@xps13> (raw)
We are facing a performance issue on XFS and other filesystems running on
fast NVMe drives when reading large amounts of data through the page cache
with fio.
Streaming read performance starts off near the NVMe hardware limit until
around the total size of system memory worth of data has been read.
Performance then drops to around half the hardware limit and CPU load
increases significantly. Using perf, we were able to establish that most of
the CPU load is caused by a spin lock in native_queued_spin_lock_slowpath:
- 58,93% 58,92% fio [kernel.kallsyms] [k]
native_queued_spin_lock_slowpath
45,72% __libc_read
entry_SYSCALL_64_after_hwframe
do_syscall_64
ksys_read
vfs_read
new_sync_read
xfs_file_read_iter
xfs_file_buffered_aio_read
- generic_file_read_iter
- 45,72% ondemand_readahead
- __do_page_cache_readahead
- 34,64% __alloc_pages_nodemask
- 34,34% __alloc_pages_slowpath
- 34,33% try_to_free_pages
do_try_to_free_pages
- shrink_node
- 34,33% shrink_lruvec
- shrink_inactive_list
- 28,22% shrink_page_list
- 28,10% __remove_mapping
- 28,10% _raw_spin_lock_irqsave
native_queued_spin_lock_slowpath
+ 6,10% _raw_spin_lock_irq
+ 11,09% read_pages
When direct I/O is used, hardware level read throughput is sustained during
the entire experiment and CPU load stays low. Threads stay in D state most
of the time.
Very similar results are described around half-way through this article
[1].
Is this a known issue with the page cache and high throughput I/O? Is there
any tuning that can be applied to get around the CPU bottleneck? We have
tried disabling readahead on the drives, which lead to very bad throughput
(~-90%). Various other scheduler related tuning was tried as well but the
results were always similar.
Experiment setup can be found below. I am happy to provide more detail if
required. If this is the wrong place to post this, please kindly let me
know.
Best regards
- Philipp
Experiment setup:
[1] https://tanelpoder.com/posts/11m-iops-with-10-ssds-on-amd-threadripper-pro-workstation/
CPU: 2x Intel(R) Xeon(R) Platinum 8352Y 2.2 GHz, 32c/64t each, 512GB memory
NVMe: 16x 1.6TB, 8 per NUMA node
FS: one XFS per disk, but reproducible on ext4 and ZFS
Kernel: Linux 5.3 (SLES), but reproducible on 5.12 (SUSE Tumbleweed)
NVMe scheduler: both "none" and "mq-deadline", very similar results
fio: 4 threads per NVMe drive, 20GiB of data per thread, ioengine=sync
Sustained read throughput direct=1: ~52GiB/s (~3.2 GiB/s*disk)
Sustained read throughput direct=0: ~25GiB/s (~1.5 GiB/s*disk)
next reply other threads:[~2021-06-22 17:16 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-06-22 17:15 Philipp Falk [this message]
2021-06-22 17:36 ` Throughput drop and high CPU load on fast NVMe drives Matthew Wilcox
2021-06-23 11:33 ` Philipp Falk
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YNIaztBNK+I5w44w@xps13 \
--to=philipp.falk@thinkparq.com \
--cc=linux-fsdevel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.