All of lore.kernel.org
 help / color / mirror / Atom feed
* Throughput drop and high CPU load on fast NVMe drives
@ 2021-06-22 17:15 Philipp Falk
  2021-06-22 17:36 ` Matthew Wilcox
  0 siblings, 1 reply; 3+ messages in thread
From: Philipp Falk @ 2021-06-22 17:15 UTC (permalink / raw)
  To: linux-fsdevel

We are facing a performance issue on XFS and other filesystems running on
fast NVMe drives when reading large amounts of data through the page cache
with fio.

Streaming read performance starts off near the NVMe hardware limit until
around the total size of system memory worth of data has been read.
Performance then drops to around half the hardware limit and CPU load
increases significantly. Using perf, we were able to establish that most of
the CPU load is caused by a spin lock in native_queued_spin_lock_slowpath:

-   58,93%    58,92%  fio              [kernel.kallsyms]         [k]
native_queued_spin_lock_slowpath
     45,72% __libc_read
        entry_SYSCALL_64_after_hwframe
        do_syscall_64
        ksys_read
        vfs_read
        new_sync_read
        xfs_file_read_iter
        xfs_file_buffered_aio_read
      - generic_file_read_iter
         - 45,72% ondemand_readahead
            - __do_page_cache_readahead
               - 34,64% __alloc_pages_nodemask
                  - 34,34% __alloc_pages_slowpath
                     - 34,33% try_to_free_pages
                          do_try_to_free_pages
                        - shrink_node
                           - 34,33% shrink_lruvec
                              - shrink_inactive_list
                                 - 28,22% shrink_page_list
                                    - 28,10% __remove_mapping
                                       - 28,10% _raw_spin_lock_irqsave
                                            native_queued_spin_lock_slowpath
                                 + 6,10% _raw_spin_lock_irq
               + 11,09% read_pages

When direct I/O is used, hardware level read throughput is sustained during
the entire experiment and CPU load stays low. Threads stay in D state most
of the time.

Very similar results are described around half-way through this article
[1].

Is this a known issue with the page cache and high throughput I/O? Is there
any tuning that can be applied to get around the CPU bottleneck? We have
tried disabling readahead on the drives, which lead to very bad throughput
(~-90%). Various other scheduler related tuning was tried as well but the
results were always similar.

Experiment setup can be found below. I am happy to provide more detail if
required. If this is the wrong place to post this, please kindly let me
know.

Best regards
- Philipp

Experiment setup:

[1] https://tanelpoder.com/posts/11m-iops-with-10-ssds-on-amd-threadripper-pro-workstation/

CPU: 2x Intel(R) Xeon(R) Platinum 8352Y 2.2 GHz, 32c/64t each, 512GB memory
NVMe: 16x 1.6TB, 8 per NUMA node
FS: one XFS per disk, but reproducible on ext4 and ZFS
Kernel: Linux 5.3 (SLES), but reproducible on 5.12 (SUSE Tumbleweed)
NVMe scheduler: both "none" and "mq-deadline", very similar results
fio: 4 threads per NVMe drive, 20GiB of data per thread, ioengine=sync
  Sustained read throughput direct=1: ~52GiB/s (~3.2 GiB/s*disk)
  Sustained read throughput direct=0: ~25GiB/s (~1.5 GiB/s*disk)

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Throughput drop and high CPU load on fast NVMe drives
  2021-06-22 17:15 Throughput drop and high CPU load on fast NVMe drives Philipp Falk
@ 2021-06-22 17:36 ` Matthew Wilcox
  2021-06-23 11:33   ` Philipp Falk
  0 siblings, 1 reply; 3+ messages in thread
From: Matthew Wilcox @ 2021-06-22 17:36 UTC (permalink / raw)
  To: Philipp Falk; +Cc: linux-fsdevel

On Tue, Jun 22, 2021 at 07:15:58PM +0200, Philipp Falk wrote:
> We are facing a performance issue on XFS and other filesystems running on
> fast NVMe drives when reading large amounts of data through the page cache
> with fio.
> 
> Streaming read performance starts off near the NVMe hardware limit until
> around the total size of system memory worth of data has been read.
> Performance then drops to around half the hardware limit and CPU load
> increases significantly. Using perf, we were able to establish that most of
> the CPU load is caused by a spin lock in native_queued_spin_lock_slowpath:
[...]
> When direct I/O is used, hardware level read throughput is sustained during
> the entire experiment and CPU load stays low. Threads stay in D state most
> of the time.
> 
> Very similar results are described around half-way through this article
> [1].
> 
> Is this a known issue with the page cache and high throughput I/O? Is there
> any tuning that can be applied to get around the CPU bottleneck? We have
> tried disabling readahead on the drives, which lead to very bad throughput
> (~-90%). Various other scheduler related tuning was tried as well but the
> results were always similar.

Yes, this is a known issue.  Here's what's happening:

 - The machine hits its low memory watermarks and starts trying to
   reclaim.  There's one kswapd per node, so both nodes go to work
   trying to reclaim memory (each kswapd tries to handle the memory
   attached to its node)
 - But all the memory is allocated to the same file, so both kswapd
   instances try to remove the pages from the same file, and necessarily
   contend on the same spinlock.
 - The process trying to stream the file is also trying to acquire this
   spinlock in order to add its newly-allocated pages to the file.

What you can do is force the page cache to only allocate memory from the
local node.  That means this workload will only use half the memory in
the machine, but it's a streaming workload, so that shouldn't matter?

The only problem is, I'm not sure what the user interface is to make
that happen.  Here's what it looks like inside the kernel:

        if (cpuset_do_page_mem_spread()) {
                unsigned int cpuset_mems_cookie;
                do {
                        cpuset_mems_cookie = read_mems_allowed_begin();
                        n = cpuset_mem_spread_node();
                        page = __alloc_pages_node(n, gfp, 0);
                } while (!page && read_mems_allowed_retry(cpuset_mems_cookie));

so it's something to do with cpusets?


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Throughput drop and high CPU load on fast NVMe drives
  2021-06-22 17:36 ` Matthew Wilcox
@ 2021-06-23 11:33   ` Philipp Falk
  0 siblings, 0 replies; 3+ messages in thread
From: Philipp Falk @ 2021-06-23 11:33 UTC (permalink / raw)
  To: linux-fsdevel

* Matthew Wilcox <willy@infradead.org> [210622 19:37]:
> Yes, this is a known issue.  Here's what's happening:
>
>  - The machine hits its low memory watermarks and starts trying to
>    reclaim.  There's one kswapd per node, so both nodes go to work
>    trying to reclaim memory (each kswapd tries to handle the memory
>    attached to its node)
>  - But all the memory is allocated to the same file, so both kswapd
>    instances try to remove the pages from the same file, and necessarily
>    contend on the same spinlock.
>  - The process trying to stream the file is also trying to acquire this
>    spinlock in order to add its newly-allocated pages to the file.
>

Thank you for the detailed explanation. In this benchmark scenario, every
thread (4 per NVMe drive) uses its own file, so there are reads from 64
files in flight at the same time. The individual files are only 20GiB in
size so the kswapd instances must handle memory allocated to multiple files
at once, right?

But both kswapd instances are probably contending for the same spinlocks on
multiple of those files then.

> What you can do is force the page cache to only allocate memory from the
> local node.  That means this workload will only use half the memory in
> the machine, but it's a streaming workload, so that shouldn't matter?
>
> The only problem is, I'm not sure what the user interface is to make
> that happen.  Here's what it looks like inside the kernel:

I repeated the benchmark and bound the fio threads to the numa nodes their
specific disks are connected to. I also forced the memory allocation to be
local to those numa zones and confirmed that cache allocation really only
happens on half of the memory when only the threads on one numa zone run.
Not sure if that is enough to achieve that only one kswapd will be actively
trying to remove pages.

In both cases (only threads on one numa zone running and numa bound threads
on both numa zones running) the throughput drop occured when half/all of
the memory was exhausted.

Does that mean that it isn't the two kswapd threads contending for the
locks but the process itself and the local kswapd? Is there anything else
we could do to improve that situation?

- Philipp

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-06-23 11:33 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-22 17:15 Throughput drop and high CPU load on fast NVMe drives Philipp Falk
2021-06-22 17:36 ` Matthew Wilcox
2021-06-23 11:33   ` Philipp Falk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.