From: Ryan Roberts <ryan.roberts@arm.com> To: David Hildenbrand <david@redhat.com>, Matthew Wilcox <willy@infradead.org> Cc: "Huang, Ying" <ying.huang@intel.com>, Andrew Morton <akpm@linux-foundation.org>, "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>, Yin Fengwei <fengwei.yin@intel.com>, Yu Zhao <yuzhao@google.com>, Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will@kernel.org>, Anshuman Khandual <anshuman.khandual@arm.com>, Yang Shi <shy828301@gmail.com>, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance Date: Fri, 7 Jul 2023 16:13:52 +0100 [thread overview] Message-ID: <9dd036a8-9ba3-0cc4-b791-cb3178237728@arm.com> (raw) In-Reply-To: <1e406f04-78ef-6573-e1f1-f0d0e0d5246a@redhat.com> On 07/07/2023 15:07, David Hildenbrand wrote: > On 07.07.23 15:57, Matthew Wilcox wrote: >> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote: >>> On 07.07.23 11:52, Ryan Roberts wrote: >>>> On 07/07/2023 09:01, Huang, Ying wrote: >>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to >>>>> avoid internal fragmentation completely. So, I think that finally we >>>>> will need to provide a mechanism for the users to opt out, e.g., >>>>> something like "always madvise never" via >>>>> /sys/kernel/mm/transparent_hugepage/enabled. I'm not sure whether it's >>>>> a good idea to reuse the existing interface of THP. >>>> >>>> I wouldn't want to tie this to the existing interface, simply because that >>>> implies that we would want to follow the "always" and "madvise" advice too; >>>> That >>>> means that on a thp=madvise system (which is certainly the case for android and >>>> other client systems) we would have to disable large anon folios for VMAs that >>>> haven't explicitly opted in. That breaks the intention that this should be an >>>> invisible performance boost. I think it's important to set the policy for >>>> use of >>> >>> It will never ever be a completely invisible performance boost, just like >>> ordinary THP. >>> >>> Using the exact same existing toggle is the right thing to do. If someone >>> specify "never" or "madvise", then do exactly that. >>> >>> It might make sense to have more modes or additional toggles, but >>> "madvise=never" means no memory waste. >> >> I hate the existing mechanisms. They are an abdication of our >> responsibility, and an attempt to blame the user (be it the sysadmin >> or the programmer) of our code for using it wrongly. We should not >> replicate this mistake. > > I don't agree regarding the programmer responsibility. In some cases the > programmer really doesn't want to get more memory populated than requested -- > and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do. > > Regarding the madvise=never/madvise/always (sys admin decision), memory waste > (and nailing down bugs or working around them in customer setups) have been very > good reasons to let the admin have a word. > >> >> Our code should be auto-tuning. I posted a long, detailed outline here: >> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/ >> > > Well, "auto-tuning" also should be perfect for everybody, but once reality > strikes you know it isn't. > > If people don't feel like using THP, let them have a word. The "madvise" config > option is probably more controversial. But the "always vs. never" absolutely > makes sense to me. > >>> I remember I raised it already in the past, but you *absolutely* have to >>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for >>> example, userfaultfd) that doesn't want the kernel to populate any >>> additional page tables. So if you have to respect that already, then also >>> respect MADV_HUGEPAGE, simple. >> >> Possibly having uffd enabled on a VMA should disable using large folios, > > There are cases where we enable uffd *after* already touching memory (postcopy > live migration in QEMU being the famous example). That doesn't fly. > >> I can get behind that. But the notion that userspace knows what it's >> doing ... hahaha. Just ignore the madvise flags. Userspace doesn't >> know what it's doing. > > If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in > some cases. And these include cases I care about messing with sparse VM memory :) > > I have strong opinions against populating more than required when user space set > MADV_NOHUGEPAGE. I can see your point about honouring MADV_NOHUGEPAGE, so think that it is reasonable to fallback to allocating an order-0 page in a VMA that has it set. The app has gone out of its way to explicitly set it, after all. I think the correct behaviour for the global thp controls (cmdline and sysfs) are less obvious though. I could get on board with disabling large anon folios globally when thp="never". But for other situations, I would prefer to keep large anon folios enabled (treat "madvise" as "always"), with the argument that their order is much smaller than traditional THP and therefore the internal fragmentation is significantly reduced. I really don't want to end up with user space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large anon folios. I still feel that it would be better for the thp and large anon folio controls to be independent though - what's the argument for tying them together? >
WARNING: multiple messages have this Message-ID (diff)
From: Ryan Roberts <ryan.roberts@arm.com> To: David Hildenbrand <david@redhat.com>, Matthew Wilcox <willy@infradead.org> Cc: "Huang, Ying" <ying.huang@intel.com>, Andrew Morton <akpm@linux-foundation.org>, "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>, Yin Fengwei <fengwei.yin@intel.com>, Yu Zhao <yuzhao@google.com>, Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will@kernel.org>, Anshuman Khandual <anshuman.khandual@arm.com>, Yang Shi <shy828301@gmail.com>, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance Date: Fri, 7 Jul 2023 16:13:52 +0100 [thread overview] Message-ID: <9dd036a8-9ba3-0cc4-b791-cb3178237728@arm.com> (raw) In-Reply-To: <1e406f04-78ef-6573-e1f1-f0d0e0d5246a@redhat.com> On 07/07/2023 15:07, David Hildenbrand wrote: > On 07.07.23 15:57, Matthew Wilcox wrote: >> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote: >>> On 07.07.23 11:52, Ryan Roberts wrote: >>>> On 07/07/2023 09:01, Huang, Ying wrote: >>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to >>>>> avoid internal fragmentation completely. So, I think that finally we >>>>> will need to provide a mechanism for the users to opt out, e.g., >>>>> something like "always madvise never" via >>>>> /sys/kernel/mm/transparent_hugepage/enabled. I'm not sure whether it's >>>>> a good idea to reuse the existing interface of THP. >>>> >>>> I wouldn't want to tie this to the existing interface, simply because that >>>> implies that we would want to follow the "always" and "madvise" advice too; >>>> That >>>> means that on a thp=madvise system (which is certainly the case for android and >>>> other client systems) we would have to disable large anon folios for VMAs that >>>> haven't explicitly opted in. That breaks the intention that this should be an >>>> invisible performance boost. I think it's important to set the policy for >>>> use of >>> >>> It will never ever be a completely invisible performance boost, just like >>> ordinary THP. >>> >>> Using the exact same existing toggle is the right thing to do. If someone >>> specify "never" or "madvise", then do exactly that. >>> >>> It might make sense to have more modes or additional toggles, but >>> "madvise=never" means no memory waste. >> >> I hate the existing mechanisms. They are an abdication of our >> responsibility, and an attempt to blame the user (be it the sysadmin >> or the programmer) of our code for using it wrongly. We should not >> replicate this mistake. > > I don't agree regarding the programmer responsibility. In some cases the > programmer really doesn't want to get more memory populated than requested -- > and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do. > > Regarding the madvise=never/madvise/always (sys admin decision), memory waste > (and nailing down bugs or working around them in customer setups) have been very > good reasons to let the admin have a word. > >> >> Our code should be auto-tuning. I posted a long, detailed outline here: >> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/ >> > > Well, "auto-tuning" also should be perfect for everybody, but once reality > strikes you know it isn't. > > If people don't feel like using THP, let them have a word. The "madvise" config > option is probably more controversial. But the "always vs. never" absolutely > makes sense to me. > >>> I remember I raised it already in the past, but you *absolutely* have to >>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for >>> example, userfaultfd) that doesn't want the kernel to populate any >>> additional page tables. So if you have to respect that already, then also >>> respect MADV_HUGEPAGE, simple. >> >> Possibly having uffd enabled on a VMA should disable using large folios, > > There are cases where we enable uffd *after* already touching memory (postcopy > live migration in QEMU being the famous example). That doesn't fly. > >> I can get behind that. But the notion that userspace knows what it's >> doing ... hahaha. Just ignore the madvise flags. Userspace doesn't >> know what it's doing. > > If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in > some cases. And these include cases I care about messing with sparse VM memory :) > > I have strong opinions against populating more than required when user space set > MADV_NOHUGEPAGE. I can see your point about honouring MADV_NOHUGEPAGE, so think that it is reasonable to fallback to allocating an order-0 page in a VMA that has it set. The app has gone out of its way to explicitly set it, after all. I think the correct behaviour for the global thp controls (cmdline and sysfs) are less obvious though. I could get on board with disabling large anon folios globally when thp="never". But for other situations, I would prefer to keep large anon folios enabled (treat "madvise" as "always"), with the argument that their order is much smaller than traditional THP and therefore the internal fragmentation is significantly reduced. I really don't want to end up with user space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large anon folios. I still feel that it would be better for the thp and large anon folio controls to be independent though - what's the argument for tying them together? > _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
next prev parent reply other threads:[~2023-07-07 15:14 UTC|newest] Thread overview: 167+ messages / expand[flat|nested] mbox.gz Atom feed top 2023-07-03 13:53 [PATCH v2 0/5] variable-order, large folios for anonymous memory Ryan Roberts 2023-07-03 13:53 ` Ryan Roberts 2023-07-03 13:53 ` [PATCH v2 1/5] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap() Ryan Roberts 2023-07-03 13:53 ` Ryan Roberts 2023-07-03 19:05 ` Yu Zhao 2023-07-03 19:05 ` Yu Zhao 2023-07-04 2:13 ` Yin, Fengwei 2023-07-04 2:13 ` Yin, Fengwei 2023-07-04 11:19 ` Ryan Roberts 2023-07-04 11:19 ` Ryan Roberts 2023-07-04 2:14 ` Yin, Fengwei 2023-07-04 2:14 ` Yin, Fengwei 2023-07-03 13:53 ` [PATCH v2 2/5] mm: Allow deferred splitting of arbitrary large anon folios Ryan Roberts 2023-07-03 13:53 ` Ryan Roberts 2023-07-07 8:21 ` Huang, Ying 2023-07-07 8:21 ` Huang, Ying 2023-07-07 9:39 ` Ryan Roberts 2023-07-07 9:42 ` Ryan Roberts 2023-07-07 9:42 ` Ryan Roberts 2023-07-10 5:37 ` Huang, Ying 2023-07-10 5:37 ` Huang, Ying 2023-07-10 8:29 ` Ryan Roberts 2023-07-10 8:29 ` Ryan Roberts 2023-07-10 9:01 ` Huang, Ying 2023-07-10 9:01 ` Huang, Ying 2023-07-10 9:39 ` Ryan Roberts 2023-07-10 9:39 ` Ryan Roberts 2023-07-11 1:56 ` Huang, Ying 2023-07-11 1:56 ` Huang, Ying 2023-07-03 13:53 ` [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order() Ryan Roberts 2023-07-03 13:53 ` Ryan Roberts 2023-07-03 19:50 ` Yu Zhao 2023-07-03 19:50 ` Yu Zhao 2023-07-04 13:20 ` Ryan Roberts 2023-07-04 13:20 ` Ryan Roberts 2023-07-05 2:07 ` Yu Zhao 2023-07-05 2:07 ` Yu Zhao 2023-07-05 9:11 ` Ryan Roberts 2023-07-05 9:11 ` Ryan Roberts 2023-07-05 17:24 ` Yu Zhao 2023-07-05 17:24 ` Yu Zhao 2023-07-05 18:01 ` Ryan Roberts 2023-07-05 18:01 ` Ryan Roberts 2023-07-06 19:33 ` Matthew Wilcox 2023-07-06 19:33 ` Matthew Wilcox 2023-07-07 10:00 ` Ryan Roberts 2023-07-07 10:00 ` Ryan Roberts 2023-07-04 2:22 ` Yin, Fengwei 2023-07-04 2:22 ` Yin, Fengwei 2023-07-04 3:02 ` Yu Zhao 2023-07-04 3:02 ` Yu Zhao 2023-07-04 3:59 ` Yu Zhao 2023-07-04 3:59 ` Yu Zhao 2023-07-04 5:22 ` Yin, Fengwei 2023-07-04 5:22 ` Yin, Fengwei 2023-07-04 5:42 ` Yu Zhao 2023-07-04 5:42 ` Yu Zhao 2023-07-04 12:36 ` Ryan Roberts 2023-07-04 12:36 ` Ryan Roberts 2023-07-04 13:23 ` Ryan Roberts 2023-07-04 13:23 ` Ryan Roberts 2023-07-05 1:40 ` Yu Zhao 2023-07-05 1:40 ` Yu Zhao 2023-07-05 1:23 ` Yu Zhao 2023-07-05 1:23 ` Yu Zhao 2023-07-05 2:18 ` Yin Fengwei 2023-07-05 2:18 ` Yin Fengwei 2023-07-03 13:53 ` [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance Ryan Roberts 2023-07-03 13:53 ` Ryan Roberts 2023-07-03 15:51 ` kernel test robot 2023-07-03 15:51 ` kernel test robot 2023-07-03 16:01 ` kernel test robot 2023-07-03 16:01 ` kernel test robot 2023-07-04 1:35 ` Yu Zhao 2023-07-04 1:35 ` Yu Zhao 2023-07-04 14:08 ` Ryan Roberts 2023-07-04 14:08 ` Ryan Roberts 2023-07-04 23:47 ` Yu Zhao 2023-07-04 23:47 ` Yu Zhao 2023-07-04 3:45 ` Yin, Fengwei 2023-07-04 3:45 ` Yin, Fengwei 2023-07-04 14:20 ` Ryan Roberts 2023-07-04 14:20 ` Ryan Roberts 2023-07-04 23:35 ` Yin Fengwei 2023-07-04 23:57 ` Matthew Wilcox 2023-07-04 23:57 ` Matthew Wilcox 2023-07-05 9:54 ` Ryan Roberts 2023-07-05 9:54 ` Ryan Roberts 2023-07-05 12:08 ` Matthew Wilcox 2023-07-05 12:08 ` Matthew Wilcox 2023-07-07 8:01 ` Huang, Ying 2023-07-07 8:01 ` Huang, Ying 2023-07-07 9:52 ` Ryan Roberts 2023-07-07 9:52 ` Ryan Roberts 2023-07-07 11:29 ` David Hildenbrand 2023-07-07 11:29 ` David Hildenbrand 2023-07-07 13:57 ` Matthew Wilcox 2023-07-07 13:57 ` Matthew Wilcox 2023-07-07 14:07 ` David Hildenbrand 2023-07-07 14:07 ` David Hildenbrand 2023-07-07 15:13 ` Ryan Roberts [this message] 2023-07-07 15:13 ` Ryan Roberts 2023-07-07 16:06 ` David Hildenbrand 2023-07-07 16:06 ` David Hildenbrand 2023-07-07 16:22 ` Ryan Roberts 2023-07-07 16:22 ` Ryan Roberts 2023-07-07 19:06 ` David Hildenbrand 2023-07-07 19:06 ` David Hildenbrand 2023-07-10 8:41 ` Ryan Roberts 2023-07-10 8:41 ` Ryan Roberts 2023-07-10 3:03 ` Huang, Ying 2023-07-10 3:03 ` Huang, Ying 2023-07-10 8:55 ` Ryan Roberts 2023-07-10 8:55 ` Ryan Roberts 2023-07-10 9:18 ` Huang, Ying 2023-07-10 9:18 ` Huang, Ying 2023-07-10 9:25 ` Ryan Roberts 2023-07-10 9:25 ` Ryan Roberts 2023-07-11 0:48 ` Huang, Ying 2023-07-11 0:48 ` Huang, Ying 2023-07-10 2:49 ` Huang, Ying 2023-07-10 2:49 ` Huang, Ying 2023-07-03 13:53 ` [PATCH v2 5/5] arm64: mm: Override arch_wants_pte_order() Ryan Roberts 2023-07-03 13:53 ` Ryan Roberts 2023-07-03 20:02 ` Yu Zhao 2023-07-03 20:02 ` Yu Zhao 2023-07-04 2:18 ` [PATCH v2 0/5] variable-order, large folios for anonymous memory Yu Zhao 2023-07-04 2:18 ` Yu Zhao 2023-07-04 6:22 ` Yin, Fengwei 2023-07-04 6:22 ` Yin, Fengwei 2023-07-04 7:11 ` Yu Zhao 2023-07-04 7:11 ` Yu Zhao 2023-07-04 15:36 ` Ryan Roberts 2023-07-04 15:36 ` Ryan Roberts 2023-07-04 23:52 ` Yin Fengwei 2023-07-05 0:21 ` Yu Zhao 2023-07-05 0:21 ` Yu Zhao 2023-07-05 10:16 ` Ryan Roberts 2023-07-05 10:16 ` Ryan Roberts 2023-07-05 19:00 ` Yu Zhao 2023-07-05 19:00 ` Yu Zhao 2023-07-05 19:38 ` David Hildenbrand 2023-07-05 19:38 ` David Hildenbrand 2023-07-06 8:02 ` Ryan Roberts 2023-07-06 8:02 ` Ryan Roberts 2023-07-07 11:40 ` David Hildenbrand 2023-07-07 11:40 ` David Hildenbrand 2023-07-07 13:12 ` Matthew Wilcox 2023-07-07 13:12 ` Matthew Wilcox 2023-07-07 13:24 ` David Hildenbrand 2023-07-07 13:24 ` David Hildenbrand 2023-07-10 10:07 ` Ryan Roberts 2023-07-10 10:07 ` Ryan Roberts 2023-07-10 16:57 ` Matthew Wilcox 2023-07-10 16:57 ` Matthew Wilcox 2023-07-10 16:53 ` Zi Yan 2023-07-10 16:53 ` Zi Yan 2023-07-19 15:49 ` Ryan Roberts 2023-07-19 15:49 ` Ryan Roberts 2023-07-19 16:05 ` Zi Yan 2023-07-19 16:05 ` Zi Yan 2023-07-19 18:37 ` Ryan Roberts 2023-07-19 18:37 ` Ryan Roberts 2023-07-11 21:11 ` Luis Chamberlain 2023-07-11 21:11 ` Luis Chamberlain 2023-07-11 21:59 ` Matthew Wilcox 2023-07-11 21:59 ` Matthew Wilcox
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=9dd036a8-9ba3-0cc4-b791-cb3178237728@arm.com \ --to=ryan.roberts@arm.com \ --cc=akpm@linux-foundation.org \ --cc=anshuman.khandual@arm.com \ --cc=catalin.marinas@arm.com \ --cc=david@redhat.com \ --cc=fengwei.yin@intel.com \ --cc=kirill.shutemov@linux.intel.com \ --cc=linux-arm-kernel@lists.infradead.org \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-mm@kvack.org \ --cc=shy828301@gmail.com \ --cc=will@kernel.org \ --cc=willy@infradead.org \ --cc=ying.huang@intel.com \ --cc=yuzhao@google.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.