From: Ryan Roberts <ryan.roberts@arm.com> To: Yu Zhao <yuzhao@google.com>, "Yin, Fengwei" <fengwei.yin@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org>, Matthew Wilcox <willy@infradead.org>, "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>, David Hildenbrand <david@redhat.com>, Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will@kernel.org>, Anshuman Khandual <anshuman.khandual@arm.com>, Yang Shi <shy828301@gmail.com>, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory Date: Tue, 4 Jul 2023 16:36:16 +0100 [thread overview] Message-ID: <467afd30-c85a-8b9d-97b9-a9ef9d0983af@arm.com> (raw) In-Reply-To: <CAOUHufYsOdywAJMxdh6W-=uLykD=7JrUwgBvUJWvfWJeQ5XxnA@mail.gmail.com> On 04/07/2023 08:11, Yu Zhao wrote: > On Tue, Jul 4, 2023 at 12:22 AM Yin, Fengwei <fengwei.yin@intel.com> wrote: >> >> On 7/4/2023 10:18 AM, Yu Zhao wrote: >>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>> >>>> Hi All, >>>> >>>> This is v2 of a series to implement variable order, large folios for anonymous >>>> memory. The objective of this is to improve performance by allocating larger >>>> chunks of memory during anonymous page faults. See [1] for background. >>> >>> Thanks for the quick response! >>> >>>> I've significantly reworked and simplified the patch set based on comments from >>>> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to >>>> VARIABLE_THP, on Yu's advice. >>>> >>>> The last patch is for arm64 to explicitly override the default >>>> arch_wants_pte_order() and is intended as an example. If this series is accepted >>>> I suggest taking the first 4 patches through the mm tree and the arm64 change >>>> could be handled through the arm64 tree separately. Neither has any build >>>> dependency on the other. >>>> >>>> The one area where I haven't followed Yu's advice is in the determination of the >>>> size of folio to use. It was suggested that I have a single preferred large >>>> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there >>>> being existing overlapping populated PTEs, etc) then fallback immediately to >>>> order-0. It turned out that this approach caused a performance regression in the >>>> Speedometer benchmark. >>> >>> I suppose it's regression against the v1, not the unpatched kernel. >> From the performance data Ryan shared, it's against unpatched kernel: >> >> Speedometer 2.0: >> >> | kernel | runs_per_min | >> |:-------------------------------|---------------:| >> | baseline-4k | 0.0% | >> | anonfolio-lkml-v1 | 0.7% | >> | anonfolio-lkml-v2-simple-order | -0.9% | >> | anonfolio-lkml-v2 | 0.5% | > > I see. Thanks. > > A couple of questions: > 1. Do we have a stddev? | kernel | mean_abs | std_abs | mean_rel | std_rel | |:------------------------- |-----------:|----------:|-----------:|----------:| | baseline-4k | 117.4 | 0.8 | 0.0% | 0.7% | | anonfolio-v1 | 118.2 | 1 | 0.7% | 0.9% | | anonfolio-v2-simple-order | 116.4 | 1.1 | -0.9% | 0.9% | | anonfolio-v2 | 118 | 1.2 | 0.5% | 1.0% | This is with 3 runs per reboot across 5 reboots, with first run after reboot trimmed (it's always a bit slower, I assume due to cold page cache). So 10 data points per kernel in total. I've rerun the test multiple times and see similar results each time. I've also run anonfolio-v2 with Kconfig FLEXIBLE_THP=disabled and in this case I see the same performance as baseline-4k. > 2. Do we have a theory why it regressed? I have a woolly hypothesis; I think Chromium is doing mmap/munmap in ways that mean when we fault, order-4 is often too big to fit in the VMA. So we fallback to order-0. I guess this is happening so often for this workload that the cost of doing the checks and fallback is outweighing the benefit of the memory that does end up with order-4 folios. I've sampled the memory in each bucket (once per second) while running and its roughly: 64K: 25% 32K: 15% 16K: 15% 4K: 45% 32K and 16K obviously fold into the 4K bucket with anonfolio-v2-simple-order. But potentially, I suspect there is lots of mmap/unmap for the smaller sizes and the 64K contents is more static - that's just a guess though. > Assuming no bugs, I don't see how a real regression could happen -- > falling back to order-0 isn't different from the original behavior. > Ryan, could you `perf record` and `cat /proc/vmstat` and share them? I can, but it will have to be a bit later in the week. I'll do some more test runs overnight so we have a larger number of runs - hopefully that might tell us that this is noise to a certain extent. I'd still like to hear a clear technical argument for why the bin-packing approach is not the correct one! Thanks, Ryan
WARNING: multiple messages have this Message-ID (diff)
From: Ryan Roberts <ryan.roberts@arm.com> To: Yu Zhao <yuzhao@google.com>, "Yin, Fengwei" <fengwei.yin@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org>, Matthew Wilcox <willy@infradead.org>, "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>, David Hildenbrand <david@redhat.com>, Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will@kernel.org>, Anshuman Khandual <anshuman.khandual@arm.com>, Yang Shi <shy828301@gmail.com>, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory Date: Tue, 4 Jul 2023 16:36:16 +0100 [thread overview] Message-ID: <467afd30-c85a-8b9d-97b9-a9ef9d0983af@arm.com> (raw) In-Reply-To: <CAOUHufYsOdywAJMxdh6W-=uLykD=7JrUwgBvUJWvfWJeQ5XxnA@mail.gmail.com> On 04/07/2023 08:11, Yu Zhao wrote: > On Tue, Jul 4, 2023 at 12:22 AM Yin, Fengwei <fengwei.yin@intel.com> wrote: >> >> On 7/4/2023 10:18 AM, Yu Zhao wrote: >>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>> >>>> Hi All, >>>> >>>> This is v2 of a series to implement variable order, large folios for anonymous >>>> memory. The objective of this is to improve performance by allocating larger >>>> chunks of memory during anonymous page faults. See [1] for background. >>> >>> Thanks for the quick response! >>> >>>> I've significantly reworked and simplified the patch set based on comments from >>>> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to >>>> VARIABLE_THP, on Yu's advice. >>>> >>>> The last patch is for arm64 to explicitly override the default >>>> arch_wants_pte_order() and is intended as an example. If this series is accepted >>>> I suggest taking the first 4 patches through the mm tree and the arm64 change >>>> could be handled through the arm64 tree separately. Neither has any build >>>> dependency on the other. >>>> >>>> The one area where I haven't followed Yu's advice is in the determination of the >>>> size of folio to use. It was suggested that I have a single preferred large >>>> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there >>>> being existing overlapping populated PTEs, etc) then fallback immediately to >>>> order-0. It turned out that this approach caused a performance regression in the >>>> Speedometer benchmark. >>> >>> I suppose it's regression against the v1, not the unpatched kernel. >> From the performance data Ryan shared, it's against unpatched kernel: >> >> Speedometer 2.0: >> >> | kernel | runs_per_min | >> |:-------------------------------|---------------:| >> | baseline-4k | 0.0% | >> | anonfolio-lkml-v1 | 0.7% | >> | anonfolio-lkml-v2-simple-order | -0.9% | >> | anonfolio-lkml-v2 | 0.5% | > > I see. Thanks. > > A couple of questions: > 1. Do we have a stddev? | kernel | mean_abs | std_abs | mean_rel | std_rel | |:------------------------- |-----------:|----------:|-----------:|----------:| | baseline-4k | 117.4 | 0.8 | 0.0% | 0.7% | | anonfolio-v1 | 118.2 | 1 | 0.7% | 0.9% | | anonfolio-v2-simple-order | 116.4 | 1.1 | -0.9% | 0.9% | | anonfolio-v2 | 118 | 1.2 | 0.5% | 1.0% | This is with 3 runs per reboot across 5 reboots, with first run after reboot trimmed (it's always a bit slower, I assume due to cold page cache). So 10 data points per kernel in total. I've rerun the test multiple times and see similar results each time. I've also run anonfolio-v2 with Kconfig FLEXIBLE_THP=disabled and in this case I see the same performance as baseline-4k. > 2. Do we have a theory why it regressed? I have a woolly hypothesis; I think Chromium is doing mmap/munmap in ways that mean when we fault, order-4 is often too big to fit in the VMA. So we fallback to order-0. I guess this is happening so often for this workload that the cost of doing the checks and fallback is outweighing the benefit of the memory that does end up with order-4 folios. I've sampled the memory in each bucket (once per second) while running and its roughly: 64K: 25% 32K: 15% 16K: 15% 4K: 45% 32K and 16K obviously fold into the 4K bucket with anonfolio-v2-simple-order. But potentially, I suspect there is lots of mmap/unmap for the smaller sizes and the 64K contents is more static - that's just a guess though. > Assuming no bugs, I don't see how a real regression could happen -- > falling back to order-0 isn't different from the original behavior. > Ryan, could you `perf record` and `cat /proc/vmstat` and share them? I can, but it will have to be a bit later in the week. I'll do some more test runs overnight so we have a larger number of runs - hopefully that might tell us that this is noise to a certain extent. I'd still like to hear a clear technical argument for why the bin-packing approach is not the correct one! Thanks, Ryan _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
next prev parent reply other threads:[~2023-07-04 15:36 UTC|newest] Thread overview: 167+ messages / expand[flat|nested] mbox.gz Atom feed top 2023-07-03 13:53 [PATCH v2 0/5] variable-order, large folios for anonymous memory Ryan Roberts 2023-07-03 13:53 ` Ryan Roberts 2023-07-03 13:53 ` [PATCH v2 1/5] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap() Ryan Roberts 2023-07-03 13:53 ` Ryan Roberts 2023-07-03 19:05 ` Yu Zhao 2023-07-03 19:05 ` Yu Zhao 2023-07-04 2:13 ` Yin, Fengwei 2023-07-04 2:13 ` Yin, Fengwei 2023-07-04 11:19 ` Ryan Roberts 2023-07-04 11:19 ` Ryan Roberts 2023-07-04 2:14 ` Yin, Fengwei 2023-07-04 2:14 ` Yin, Fengwei 2023-07-03 13:53 ` [PATCH v2 2/5] mm: Allow deferred splitting of arbitrary large anon folios Ryan Roberts 2023-07-03 13:53 ` Ryan Roberts 2023-07-07 8:21 ` Huang, Ying 2023-07-07 8:21 ` Huang, Ying 2023-07-07 9:39 ` Ryan Roberts 2023-07-07 9:42 ` Ryan Roberts 2023-07-07 9:42 ` Ryan Roberts 2023-07-10 5:37 ` Huang, Ying 2023-07-10 5:37 ` Huang, Ying 2023-07-10 8:29 ` Ryan Roberts 2023-07-10 8:29 ` Ryan Roberts 2023-07-10 9:01 ` Huang, Ying 2023-07-10 9:01 ` Huang, Ying 2023-07-10 9:39 ` Ryan Roberts 2023-07-10 9:39 ` Ryan Roberts 2023-07-11 1:56 ` Huang, Ying 2023-07-11 1:56 ` Huang, Ying 2023-07-03 13:53 ` [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order() Ryan Roberts 2023-07-03 13:53 ` Ryan Roberts 2023-07-03 19:50 ` Yu Zhao 2023-07-03 19:50 ` Yu Zhao 2023-07-04 13:20 ` Ryan Roberts 2023-07-04 13:20 ` Ryan Roberts 2023-07-05 2:07 ` Yu Zhao 2023-07-05 2:07 ` Yu Zhao 2023-07-05 9:11 ` Ryan Roberts 2023-07-05 9:11 ` Ryan Roberts 2023-07-05 17:24 ` Yu Zhao 2023-07-05 17:24 ` Yu Zhao 2023-07-05 18:01 ` Ryan Roberts 2023-07-05 18:01 ` Ryan Roberts 2023-07-06 19:33 ` Matthew Wilcox 2023-07-06 19:33 ` Matthew Wilcox 2023-07-07 10:00 ` Ryan Roberts 2023-07-07 10:00 ` Ryan Roberts 2023-07-04 2:22 ` Yin, Fengwei 2023-07-04 2:22 ` Yin, Fengwei 2023-07-04 3:02 ` Yu Zhao 2023-07-04 3:02 ` Yu Zhao 2023-07-04 3:59 ` Yu Zhao 2023-07-04 3:59 ` Yu Zhao 2023-07-04 5:22 ` Yin, Fengwei 2023-07-04 5:22 ` Yin, Fengwei 2023-07-04 5:42 ` Yu Zhao 2023-07-04 5:42 ` Yu Zhao 2023-07-04 12:36 ` Ryan Roberts 2023-07-04 12:36 ` Ryan Roberts 2023-07-04 13:23 ` Ryan Roberts 2023-07-04 13:23 ` Ryan Roberts 2023-07-05 1:40 ` Yu Zhao 2023-07-05 1:40 ` Yu Zhao 2023-07-05 1:23 ` Yu Zhao 2023-07-05 1:23 ` Yu Zhao 2023-07-05 2:18 ` Yin Fengwei 2023-07-05 2:18 ` Yin Fengwei 2023-07-03 13:53 ` [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance Ryan Roberts 2023-07-03 13:53 ` Ryan Roberts 2023-07-03 15:51 ` kernel test robot 2023-07-03 15:51 ` kernel test robot 2023-07-03 16:01 ` kernel test robot 2023-07-03 16:01 ` kernel test robot 2023-07-04 1:35 ` Yu Zhao 2023-07-04 1:35 ` Yu Zhao 2023-07-04 14:08 ` Ryan Roberts 2023-07-04 14:08 ` Ryan Roberts 2023-07-04 23:47 ` Yu Zhao 2023-07-04 23:47 ` Yu Zhao 2023-07-04 3:45 ` Yin, Fengwei 2023-07-04 3:45 ` Yin, Fengwei 2023-07-04 14:20 ` Ryan Roberts 2023-07-04 14:20 ` Ryan Roberts 2023-07-04 23:35 ` Yin Fengwei 2023-07-04 23:57 ` Matthew Wilcox 2023-07-04 23:57 ` Matthew Wilcox 2023-07-05 9:54 ` Ryan Roberts 2023-07-05 9:54 ` Ryan Roberts 2023-07-05 12:08 ` Matthew Wilcox 2023-07-05 12:08 ` Matthew Wilcox 2023-07-07 8:01 ` Huang, Ying 2023-07-07 8:01 ` Huang, Ying 2023-07-07 9:52 ` Ryan Roberts 2023-07-07 9:52 ` Ryan Roberts 2023-07-07 11:29 ` David Hildenbrand 2023-07-07 11:29 ` David Hildenbrand 2023-07-07 13:57 ` Matthew Wilcox 2023-07-07 13:57 ` Matthew Wilcox 2023-07-07 14:07 ` David Hildenbrand 2023-07-07 14:07 ` David Hildenbrand 2023-07-07 15:13 ` Ryan Roberts 2023-07-07 15:13 ` Ryan Roberts 2023-07-07 16:06 ` David Hildenbrand 2023-07-07 16:06 ` David Hildenbrand 2023-07-07 16:22 ` Ryan Roberts 2023-07-07 16:22 ` Ryan Roberts 2023-07-07 19:06 ` David Hildenbrand 2023-07-07 19:06 ` David Hildenbrand 2023-07-10 8:41 ` Ryan Roberts 2023-07-10 8:41 ` Ryan Roberts 2023-07-10 3:03 ` Huang, Ying 2023-07-10 3:03 ` Huang, Ying 2023-07-10 8:55 ` Ryan Roberts 2023-07-10 8:55 ` Ryan Roberts 2023-07-10 9:18 ` Huang, Ying 2023-07-10 9:18 ` Huang, Ying 2023-07-10 9:25 ` Ryan Roberts 2023-07-10 9:25 ` Ryan Roberts 2023-07-11 0:48 ` Huang, Ying 2023-07-11 0:48 ` Huang, Ying 2023-07-10 2:49 ` Huang, Ying 2023-07-10 2:49 ` Huang, Ying 2023-07-03 13:53 ` [PATCH v2 5/5] arm64: mm: Override arch_wants_pte_order() Ryan Roberts 2023-07-03 13:53 ` Ryan Roberts 2023-07-03 20:02 ` Yu Zhao 2023-07-03 20:02 ` Yu Zhao 2023-07-04 2:18 ` [PATCH v2 0/5] variable-order, large folios for anonymous memory Yu Zhao 2023-07-04 2:18 ` Yu Zhao 2023-07-04 6:22 ` Yin, Fengwei 2023-07-04 6:22 ` Yin, Fengwei 2023-07-04 7:11 ` Yu Zhao 2023-07-04 7:11 ` Yu Zhao 2023-07-04 15:36 ` Ryan Roberts [this message] 2023-07-04 15:36 ` Ryan Roberts 2023-07-04 23:52 ` Yin Fengwei 2023-07-05 0:21 ` Yu Zhao 2023-07-05 0:21 ` Yu Zhao 2023-07-05 10:16 ` Ryan Roberts 2023-07-05 10:16 ` Ryan Roberts 2023-07-05 19:00 ` Yu Zhao 2023-07-05 19:00 ` Yu Zhao 2023-07-05 19:38 ` David Hildenbrand 2023-07-05 19:38 ` David Hildenbrand 2023-07-06 8:02 ` Ryan Roberts 2023-07-06 8:02 ` Ryan Roberts 2023-07-07 11:40 ` David Hildenbrand 2023-07-07 11:40 ` David Hildenbrand 2023-07-07 13:12 ` Matthew Wilcox 2023-07-07 13:12 ` Matthew Wilcox 2023-07-07 13:24 ` David Hildenbrand 2023-07-07 13:24 ` David Hildenbrand 2023-07-10 10:07 ` Ryan Roberts 2023-07-10 10:07 ` Ryan Roberts 2023-07-10 16:57 ` Matthew Wilcox 2023-07-10 16:57 ` Matthew Wilcox 2023-07-10 16:53 ` Zi Yan 2023-07-10 16:53 ` Zi Yan 2023-07-19 15:49 ` Ryan Roberts 2023-07-19 15:49 ` Ryan Roberts 2023-07-19 16:05 ` Zi Yan 2023-07-19 16:05 ` Zi Yan 2023-07-19 18:37 ` Ryan Roberts 2023-07-19 18:37 ` Ryan Roberts 2023-07-11 21:11 ` Luis Chamberlain 2023-07-11 21:11 ` Luis Chamberlain 2023-07-11 21:59 ` Matthew Wilcox 2023-07-11 21:59 ` Matthew Wilcox
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=467afd30-c85a-8b9d-97b9-a9ef9d0983af@arm.com \ --to=ryan.roberts@arm.com \ --cc=akpm@linux-foundation.org \ --cc=anshuman.khandual@arm.com \ --cc=catalin.marinas@arm.com \ --cc=david@redhat.com \ --cc=fengwei.yin@intel.com \ --cc=kirill.shutemov@linux.intel.com \ --cc=linux-arm-kernel@lists.infradead.org \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-mm@kvack.org \ --cc=shy828301@gmail.com \ --cc=will@kernel.org \ --cc=willy@infradead.org \ --cc=yuzhao@google.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.