From: Yu Zhao <firstname.lastname@example.org> To: Dave Chinner <email@example.com> Cc: Jens Axboe <firstname.lastname@example.org>, SeongJae Park <email@example.com>, Linux-MM <firstname.lastname@example.org>, Andi Kleen <email@example.com>, Andrew Morton <firstname.lastname@example.org>, Benjamin Manes <email@example.com>, Dave Hansen <firstname.lastname@example.org>, Hillf Danton <email@example.com>, Johannes Weiner <firstname.lastname@example.org>, Jonathan Corbet <email@example.com>, Joonsoo Kim <firstname.lastname@example.org>, Matthew Wilcox <email@example.com>, Mel Gorman <firstname.lastname@example.org>, Miaohe Lin <email@example.com>, Michael Larabel <firstname.lastname@example.org>, Michal Hocko <email@example.com>, Michel Lespinasse <firstname.lastname@example.org>, Rik van Riel <email@example.com>, Roman Gushchin <firstname.lastname@example.org>, Rong Chen <email@example.com>, SeongJae Park <firstname.lastname@example.org>, Tim Chen <email@example.com>, Vlastimil Babka <firstname.lastname@example.org>, Yang Shi <email@example.com>, Ying Huang <firstname.lastname@example.org>, Zi Yan <email@example.com>, linux-kernel <firstname.lastname@example.org>, email@example.com, Kernel Page Reclaim v2 <firstname.lastname@example.org> Subject: Re: [PATCH v2 00/16] Multigenerational LRU Framework Date: Wed, 14 Apr 2021 04:00:05 -0600 [thread overview] Message-ID: <YHa9Ja6e17f2LeKA@google.com> (raw) In-Reply-To: <CAOUHufa5id9mmjud-UQd4agLCtmDypdNDStkxgoQxsUoh8Qcsg@mail.gmail.com> On Wed, Apr 14, 2021 at 01:16:52AM -0600, Yu Zhao wrote: > On Tue, Apr 13, 2021 at 10:50 PM Dave Chinner <email@example.com> wrote: > > > > On Tue, Apr 13, 2021 at 09:40:12PM -0600, Yu Zhao wrote: > > > On Tue, Apr 13, 2021 at 5:14 PM Dave Chinner <firstname.lastname@example.org> wrote: > > > > On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote: > > > > > On 4/13/21 1:51 AM, SeongJae Park wrote: > > > > > > From: SeongJae Park <email@example.com> > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > > > > > Very interesting work, thank you for sharing this :) > > > > > > > > > > > > On Tue, 13 Apr 2021 00:56:17 -0600 Yu Zhao <firstname.lastname@example.org> wrote: > > > > > > > > > > > >> What's new in v2 > > > > > >> ================ > > > > > >> Special thanks to Jens Axboe for reporting a regression in buffered > > > > > >> I/O and helping test the fix. > > > > > > > > > > > > Is the discussion open? If so, could you please give me a link? > > > > > > > > > > I wasn't on the initial post (or any of the lists it was posted to), but > > > > > it's on the google page reclaim list. Not sure if that is public or not. > > > > > > > > > > tldr is that I was pretty excited about this work, as buffered IO tends > > > > > to suck (a lot) for high throughput applications. My test case was > > > > > pretty simple: > > > > > > > > > > Randomly read a fast device, using 4k buffered IO, and watch what > > > > > happens when the page cache gets filled up. For this particular test, > > > > > we'll initially be doing 2.1GB/sec of IO, and then drop to 1.5-1.6GB/sec > > > > > with kswapd using a lot of CPU trying to keep up. That's mainline > > > > > behavior. > > > > > > > > I see this exact same behaviour here, too, but I RCA'd it to > > > > contention between the inode and memory reclaim for the mapping > > > > structure that indexes the page cache. Basically the mapping tree > > > > lock is the contention point here - you can either be adding pages > > > > to the mapping during IO, or memory reclaim can be removing pages > > > > from the mapping, but we can't do both at once. > > > > > > > > So we end up with kswapd spinning on the mapping tree lock like so > > > > when doing 1.6GB/s in 4kB buffered IO: > > > > > > > > - 20.06% 0.00% [kernel] [k] kswapd ▒ > > > > - 20.06% kswapd ▒ > > > > - 20.05% balance_pgdat ▒ > > > > - 20.03% shrink_node ▒ > > > > - 19.92% shrink_lruvec ▒ > > > > - 19.91% shrink_inactive_list ▒ > > > > - 19.22% shrink_page_list ▒ > > > > - 17.51% __remove_mapping ▒ > > > > - 14.16% _raw_spin_lock_irqsave ▒ > > > > - 14.14% do_raw_spin_lock ▒ > > > > __pv_queued_spin_lock_slowpath ▒ > > > > - 1.56% __delete_from_page_cache ▒ > > > > 0.63% xas_store ▒ > > > > - 0.78% _raw_spin_unlock_irqrestore ▒ > > > > - 0.69% do_raw_spin_unlock ▒ > > > > __raw_callee_save___pv_queued_spin_unlock ▒ > > > > - 0.82% free_unref_page_list ▒ > > > > - 0.72% free_unref_page_commit ▒ > > > > 0.57% free_pcppages_bulk ▒ > > > > > > > > And these are the processes consuming CPU: > > > > > > > > 5171 root 20 0 1442496 5696 1284 R 99.7 0.0 1:07.78 fio > > > > 1150 root 20 0 0 0 0 S 47.4 0.0 0:22.70 kswapd1 > > > > 1146 root 20 0 0 0 0 S 44.0 0.0 0:21.85 kswapd0 > > > > 1152 root 20 0 0 0 0 S 39.7 0.0 0:18.28 kswapd3 > > > > 1151 root 20 0 0 0 0 S 15.2 0.0 0:12.14 kswapd2 > > > > > > > > i.e. when memory reclaim kicks in, the read process has 20% less > > > > time with exclusive access to the mapping tree to insert new pages. > > > > Hence buffered read performance goes down quite substantially when > > > > memory reclaim kicks in, and this really has nothing to do with the > > > > memory reclaim LRU scanning algorithm. > > > > > > > > I can actually get this machine to pin those 5 processes to 100% CPU > > > > under certain conditions. Each process is spinning all that extra > > > > time on the mapping tree lock, and performance degrades further. > > > > Changing the LRU reclaim algorithm won't fix this - the workload is > > > > solidly bound by the exclusive nature of the mapping tree lock and > > > > the number of tasks trying to obtain it exclusively... > > > > > > > > > The initial posting of this patchset did no better, in fact it did a bit > > > > > worse. Performance dropped to the same levels and kswapd was using as > > > > > much CPU as before, but on top of that we also got excessive swapping. > > > > > Not at a high rate, but 5-10MB/sec continually. > > > > > > > > > > I had some back and forths with Yu Zhao and tested a few new revisions, > > > > > and the current series does much better in this regard. Performance > > > > > still dips a bit when page cache fills, but not nearly as much, and > > > > > kswapd is using less CPU than before. > > > > > > > > Profiles would be interesting, because it sounds to me like reclaim > > > > *might* be batching page cache removal better (e.g. fewer, larger > > > > batches) and so spending less time contending on the mapping tree > > > > lock... > > > > > > > > IOWs, I suspect this result might actually be a result of less lock > > > > contention due to a change in batch processing characteristics of > > > > the new algorithm rather than it being a "better" algorithm... > > > > > > I appreciate the profile. But there is no batching in > > > __remove_mapping() -- it locks the mapping for each page, and > > > therefore the lock contention penalizes the mainline and this patchset > > > equally. It looks worse on your system because the four kswapd threads > > > from different nodes were working on the same file. > > > > I think you misunderstand exactly what I mean by "batching" here. > > I'm not talking about doing multiple pieces of work under a single > > lock. What I mean is that the overall amount of work done in a > > single reclaim scan (i.e a "reclaim batch") is packaged differently. > > > > We already batch up page reclaim via building a page list and then > > passing it to shrink_page_list() to process the batch of pages in a > > single pass. Each page in this page list batch then calls > > remove_mapping() to pull the page form the LRU, we have a run of > > contention between the foreground read() thread and the background > > kswapd. > > > > If the size or nature of the pages in the batch passed to > > shrink_page_list() changes, then the amount of time a reclaim batch > > is going to put pressure on the mapping tree lock will also change. > > That's the "change in batching behaviour" I'm referring to here. I > > haven't read through the patchset to determine if you change the > > shrink_page_list() algorithm, but it likely changes what is passed > > to be reclaimed and that in turn changes the locking patterns that > > fall out of shrink_page_list... > > Ok, if we are talking about the size of the batch passed to > shrink_page_list(), both the mainline and this patchset cap it at > SWAP_CLUSTER_MAX, which is 32. There are corner cases, but when > running fio/io_uring, it's safe to say both use 32. > > > > And kswapd is only one of two paths that could affect the performance. > > > The kernel context of the test process is where the improvement mainly > > > comes from. > > > > > > I also suspect you were testing a file much larger than your memory > > > size. If so, sorry to tell you that a file only a few times larger, > > > e.g. twice, would be worse. > > > > > > Here is my take: > > > > > > Claim > > > ----- > > > This patchset is a "better" algorithm. (Technically it's not an > > > algorithm, it's a feedback loop.) > > > > > > Theoretical basis > > > ----------------- > > > An open-loop control (the mainline) can only be better if the margin > > > of error in its prediction of the future events is less than that from > > > the trial-and-error of a closed-loop control (this patchset). For > > > simple machines, it surely can. For page reclaim, AFAIK, it can't. > > > > > > A typical example: when randomly accessing a (not infinitely) large > > > file via buffered io long enough, we're bound to hit the same blocks > > > multiple times. Should we activate the pages containing those blocks, > > > i.e., to move them to the active lru list? No. > > > > > > RCA > > > --- > > > For the fio/io_uring benchmark, the "No" is the key. > > > > > > The mainline activates pages accessed multiple times. This is done in > > > the buffered io access path by mark_page_accessed(), and it takes the > > > lru lock, which is contended under memory pressure. This contention > > > slows down both the access path and kswapd. But kswapd is not the > > > problem here because we are measuring the io_uring process, not kswap. > > > > > > For this patchset, there are no activations since the refault rates of > > > pages accessed multiple times are similar to those accessed only once > > > -- activations will only be done to pages from tiers with higher > > > refault rates. > > > > > > If you wish to debunk > > > --------------------- > > > > Nope, it's your job to convince us that it works, not the other way > > around. It's up to you to prove that your assertions are correct, > > not for us to prove they are false. > > Just trying to keep people motivated, my homework is my own. > > > > git fetch https://linux-mm.googlesource.com/page-reclaim refs/changes/73/1173/1 > > > > > > CONFIG_LRU_GEN=y > > > CONFIG_LRU_GEN_ENABLED=y > > > > > > Run your benchmarks > > > > > > Profiles (200G mem + 400G file) > > > ------------------------------- > > > A quick test from Jens' fio/io_uring: > > > > > > -rc7 > > > 13.30% io_uring xas_load > > > 13.22% io_uring _copy_to_iter > > > 12.30% io_uring __add_to_page_cache_locked > > > 7.43% io_uring clear_page_erms > > > 4.18% io_uring filemap_get_read_batch > > > 3.54% io_uring get_page_from_freelist > > > 2.98% io_uring ***native_queued_spin_lock_slowpath*** > > > 1.61% io_uring page_cache_ra_unbounded > > > 1.16% io_uring xas_start > > > 1.08% io_uring filemap_read > > > 1.07% io_uring ***__activate_page*** > > > > > > lru lock: 2.98% (lru addition + activation) > > > activation: 1.07% > > > > > > -rc7 + this patchset > > > 14.44% io_uring xas_load > > > 14.14% io_uring _copy_to_iter > > > 11.15% io_uring __add_to_page_cache_locked > > > 6.56% io_uring clear_page_erms > > > 4.44% io_uring filemap_get_read_batch > > > 2.14% io_uring get_page_from_freelist > > > 1.32% io_uring page_cache_ra_unbounded > > > 1.20% io_uring psi_group_change > > > 1.18% io_uring filemap_read > > > 1.09% io_uring ****native_queued_spin_lock_slowpath**** > > > 1.08% io_uring do_mpage_readpage > > > > > > lru lock: 1.09% (lru addition only) > > > > All this tells us is that there was *less contention on the mapping > > tree lock*. It does not tell us why there was less contention. > > > > You've handily omitted the kswapd profile, which is really the one > > of interest to the discussion here - how did the memory reclaim CPU > > usage profile also change at the same time? > > Well, let me attach them. Suffix -1 is the mainline, -2 is the patchset. > > mainline > 57.65% kswapd0 __remove_mapping > this patchset > 61.61% kswapd0 __remove_mapping > > As I said, the mapping lock contention penalizes both heavily. Its > percentage is even higher with the patchset, because it has less > overhead. I'm trying to explain "the less overhead" part: it's the > activations that make the mainline worse. > > mainline > 6.53% kswapd0 shrink_active_list > this patchset > 0 > > From the io_uring context: > mainline > 2.53% io_uring mark_page_accessed > this patchset > 0.52% io_uring mark_page_accessed > > mark_page_accessed() moves pages accessed multiple times to the active > lru list. Then shrink_active_list() moves them back to the inactive > list. All for nothing. > > I don't want to paste everything here -- they'd clutter. Please see > all the detailed profiles in the attachment. Let me know if their > formats are no to your liking. I still have the raw perf.data. > > > > And I plan to reach out to other communities, e.g., PostgreSQL, to > > > benchmark the patchset. I heard they have been complaining about the > > > buffered io performance under memory pressure. Any other benchmarks > > > you'd suggest? > > > > > > BTW, you might find another surprise in how less frequently slab > > > shrinkers are called under memory pressure, because this patchset is a > > > lot better at finding pages to reclaim and therefore doesn't overkill > > > slabs. > > > > That's actually very likely to be a Bad Thing and cause unexpected > > perofrmance and OOM based regressions. When the machine finally runs > > out of page cache it can easily reclaim, it's going to get stuck > > with long tail latencies reclaiming huge slab caches as they've had > > no substantial ongoing pressure put on them to keep them in balance > > with the overall memory pressure the system is under... > > Well. It does use the existing equation. That is if it scans X% of > pages, then it scans X% of slab objects. But 1) it often finds pages > to reclaim at a lower X% 2) the pages it reclaims are less likely to > refault. So the side effect is the overall slab objects it scans also > reduce. I do see your point but don't see any options, at the moment. I apologize for the spam. Apparent the attachment in my previous email didn't reach everybody. I hope this would work: git clone https://linux-mm.googlesource.com/benchmarks Repo contains profiles collected when running fio/io_uring, mainline: kswapd-1.txt kswapd-1.svg io_uring-1.txt io_uring-1.svg patched: kswapd-2.txt kswapd-2.svg io_uring-2.txt io_uring-2.svg Thanks.
next prev parent reply other threads:[~2021-04-14 10:00 UTC|newest] Thread overview: 57+ messages / expand[flat|nested] mbox.gz Atom feed top 2021-04-13 6:56 Yu Zhao 2021-04-13 6:56 ` [PATCH v2 01/16] include/linux/memcontrol.h: do not warn in page_memcg_rcu() if !CONFIG_MEMCG Yu Zhao 2021-04-13 6:56 ` [PATCH v2 02/16] include/linux/nodemask.h: define next_memory_node() if !CONFIG_NUMA Yu Zhao 2021-04-13 6:56 ` [PATCH v2 03/16] include/linux/huge_mm.h: define is_huge_zero_pmd() if !CONFIG_TRANSPARENT_HUGEPAGE Yu Zhao 2021-04-13 6:56 ` [PATCH v2 04/16] include/linux/cgroup.h: export cgroup_mutex Yu Zhao 2021-04-13 6:56 ` [PATCH v2 05/16] mm/swap.c: export activate_page() Yu Zhao 2021-04-13 6:56 ` [PATCH v2 06/16] mm, x86: support the access bit on non-leaf PMD entries Yu Zhao 2021-04-13 6:56 ` [PATCH v2 07/16] mm/vmscan.c: refactor shrink_node() Yu Zhao 2021-04-13 6:56 ` [PATCH v2 08/16] mm: multigenerational lru: groundwork Yu Zhao 2021-04-13 6:56 ` [PATCH v2 09/16] mm: multigenerational lru: activation Yu Zhao 2021-04-13 6:56 ` [PATCH v2 10/16] mm: multigenerational lru: mm_struct list Yu Zhao 2021-04-14 14:36 ` Matthew Wilcox 2021-04-13 6:56 ` [PATCH v2 11/16] mm: multigenerational lru: aging Yu Zhao 2021-04-13 6:56 ` [PATCH v2 12/16] mm: multigenerational lru: eviction Yu Zhao 2021-04-13 6:56 ` [PATCH v2 13/16] mm: multigenerational lru: page reclaim Yu Zhao 2021-04-13 6:56 ` [PATCH v2 14/16] mm: multigenerational lru: user interface Yu Zhao 2021-04-13 6:56 ` [PATCH v2 15/16] mm: multigenerational lru: Kconfig Yu Zhao 2021-04-13 6:56 ` [PATCH v2 16/16] mm: multigenerational lru: documentation Yu Zhao 2021-04-13 7:51 ` [PATCH v2 00/16] Multigenerational LRU Framework SeongJae Park 2021-04-13 16:13 ` Jens Axboe 2021-04-13 16:42 ` SeongJae Park 2021-04-13 23:14 ` Dave Chinner 2021-04-14 2:29 ` Rik van Riel [not found] ` <CAOUHufafMcaG8sOS=1YMy2P_6p0R1FzP16bCwpUau7g1-PybBQ@mail.gmail.com> 2021-04-14 6:15 ` Huang, Ying 2021-04-14 7:58 ` Yu Zhao 2021-04-14 8:27 ` Huang, Ying 2021-04-14 13:51 ` Rik van Riel 2021-04-14 15:56 ` Andi Kleen 2021-04-14 15:58 ` [page-reclaim] " Shakeel Butt 2021-04-14 18:45 ` Yu Zhao 2021-04-14 15:51 ` Andi Kleen 2021-04-14 15:58 ` Rik van Riel 2021-04-14 19:14 ` Yu Zhao 2021-04-14 19:41 ` Rik van Riel 2021-04-14 20:08 ` Yu Zhao 2021-04-14 19:04 ` Yu Zhao 2021-04-15 3:00 ` Andi Kleen 2021-04-15 7:13 ` Yu Zhao 2021-04-15 8:19 ` Huang, Ying 2021-04-15 9:57 ` Michel Lespinasse 2021-04-24 2:33 ` Yu Zhao 2021-04-24 3:30 ` Andi Kleen 2021-04-24 4:16 ` Yu Zhao 2021-04-14 3:40 ` Yu Zhao 2021-04-14 4:50 ` Dave Chinner 2021-04-14 7:16 ` Yu Zhao 2021-04-14 10:00 ` Yu Zhao [this message] 2021-04-15 1:36 ` Dave Chinner 2021-04-24 21:21 ` Yu Zhao 2021-04-14 14:43 ` Jens Axboe 2021-04-14 19:42 ` Yu Zhao 2021-04-15 1:21 ` Dave Chinner 2021-04-14 17:43 ` Johannes Weiner 2021-04-27 10:35 ` Yu Zhao 2021-04-29 23:46 ` Konstantin Kharlamov 2021-04-30 6:37 ` Konstantin Kharlamov 2021-04-30 19:31 ` Yu Zhao
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=YHa9Ja6e17f2LeKA@google.com \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --subject='Re: [PATCH v2 00/16] Multigenerational LRU Framework' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).