All of lore.kernel.org
 help / color / mirror / Atom feed
* [LSF/MM/BPF TOPIC] HGM for hugetlbfs
@ 2023-03-06 19:19 Mike Kravetz
  2023-03-14 15:37 ` James Houghton
  2023-05-24 20:26 ` James Houghton
  0 siblings, 2 replies; 29+ messages in thread
From: Mike Kravetz @ 2023-03-06 19:19 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-mm, James Houghton, Peter Xu

This is past the deadline, so feel free to ignore.  However, ...

James Houghton has been working on the concept of HugeTLB High Granularity
Mapping (HGM) as discussed here:
https://lore.kernel.org/linux-mm/20230218002819.1486479-1-jthoughton@google.com/

The primary motivation for this work is post-copy live migration of VMs backed
by hugetlb pages via userfaultfd.  A followup use case is more gracefully
handling memory errors/poison on hugetlb pages.

As can be seen by the size of James's patch set, the required changes for
HGM are a bit complex and involved.  This is also complicated the need
choosing a 'mapcount strategy' as the previous scheme used by hugetlb
will no longer work.

A HGM for hugetlbfs session would present the current approach and challenges.  
While much of the work is confined to hugetlb, there is a bit spill over to
other mm areas: specifically page table walking.  A discussion on ways to
move forward with this effort would be appreciated.
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] HGM for hugetlbfs
  2023-03-06 19:19 [LSF/MM/BPF TOPIC] HGM for hugetlbfs Mike Kravetz
@ 2023-03-14 15:37 ` James Houghton
  2023-04-12  1:44   ` David Rientjes
  2023-05-24 20:26 ` James Houghton
  1 sibling, 1 reply; 29+ messages in thread
From: James Houghton @ 2023-03-14 15:37 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: lsf-pc, linux-mm, Peter Xu

On Mon, Mar 6, 2023 at 11:19 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> This is past the deadline, so feel free to ignore.  However, ...
>
> James Houghton has been working on the concept of HugeTLB High Granularity
> Mapping (HGM) as discussed here:
> https://lore.kernel.org/linux-mm/20230218002819.1486479-1-jthoughton@google.com/
>
> The primary motivation for this work is post-copy live migration of VMs backed
> by hugetlb pages via userfaultfd.  A followup use case is more gracefully
> handling memory errors/poison on hugetlb pages.
>
> As can be seen by the size of James's patch set, the required changes for
> HGM are a bit complex and involved.  This is also complicated the need
> choosing a 'mapcount strategy' as the previous scheme used by hugetlb
> will no longer work.
>
> A HGM for hugetlbfs session would present the current approach and challenges.
> While much of the work is confined to hugetlb, there is a bit spill over to
> other mm areas: specifically page table walking.  A discussion on ways to
> move forward with this effort would be appreciated.

Thanks for proposing this, Mike.

To hopefully get more interest in this topic, I want to lay out the
reasons that Google uses HugeTLB for VMs today. They are:
- Guaranteed availability of hugepages
- Guaranteed NUMA alignment
- Availability of 1G pages
- HugeTLB vmemmap optimization to save page struct overhead

Until generic mm supports all this, HugeTLB will remain a very
important piece of Linux for us. :)

The main limitation of HugeTLB that I care about is that it can only
map an entire hugepage at once; it can never partially map a hugepage
(like, there is no such thing as a PTE-mapped HugeTLB page). As Mike
said, this makes the following applications impossible:
1. With userfaultfd-based live migration, being able to fetch and
install memory at PAGE_SIZE.
2. Memory poison at PAGE_SIZE.

HugeTLB high-granularity mapping (HGM) is an effort to make #1 and #2
possible with HugeTLB.

#1 and #2 are already possible with generic mm, so this also begs the
question: Can we merge HugeTLB with generic mm? This would certainly
be much more work than HGM, but it removes all those pesky HugeTLB
special cases (though, we still want all those features that HugeTLB
has).

Coming up with a plan to merge HugeTLB with generic mm would be
challenging, and LSFMM might be a good place to have such a
discussion. Not all of HugeTLB would need to be merged. I think some
of the main special cases that should be removed are:
1. hugetlb_fault (fault/GUP special case)
2. page_vma_mapped_walk's special case
3. hugetlb_entry in pagewalk
4. HugeTLB's rmap/mapcount special cases (already working on this!)

As part of this merge/unification, architectures would need to merge
their hugetlb implementations with their generic mm implementations
(for example, moving any special logic from set_huge_pte_at to
set_pte_at).

These are just some initial thoughts; I'm sure many of you have your
own ideas for this.

A discussion about HGM might serve as a jumping-off point for ideas
for how to enhance the generic mm implementation to make the
unification possible.


- James Houghton


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] HGM for hugetlbfs
  2023-03-14 15:37 ` James Houghton
@ 2023-04-12  1:44   ` David Rientjes
  0 siblings, 0 replies; 29+ messages in thread
From: David Rientjes @ 2023-04-12  1:44 UTC (permalink / raw)
  To: James Houghton, Tom Lendacky, Roth, Michael, Kalra, Ashish
  Cc: Mike Kravetz, lsf-pc, linux-mm, Peter Xu

[-- Attachment #1: Type: text/plain, Size: 3558 bytes --]

On Tue, 14 Mar 2023, James Houghton wrote:

> On Mon, Mar 6, 2023 at 11:19 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >
> > This is past the deadline, so feel free to ignore.  However, ...
> >
> > James Houghton has been working on the concept of HugeTLB High Granularity
> > Mapping (HGM) as discussed here:
> > https://lore.kernel.org/linux-mm/20230218002819.1486479-1-jthoughton@google.com/
> >
> > The primary motivation for this work is post-copy live migration of VMs backed
> > by hugetlb pages via userfaultfd.  A followup use case is more gracefully
> > handling memory errors/poison on hugetlb pages.
> >
> > As can be seen by the size of James's patch set, the required changes for
> > HGM are a bit complex and involved.  This is also complicated the need
> > choosing a 'mapcount strategy' as the previous scheme used by hugetlb
> > will no longer work.
> >
> > A HGM for hugetlbfs session would present the current approach and challenges.
> > While much of the work is confined to hugetlb, there is a bit spill over to
> > other mm areas: specifically page table walking.  A discussion on ways to
> > move forward with this effort would be appreciated.
> 
> Thanks for proposing this, Mike.
> 
> To hopefully get more interest in this topic, I want to lay out the
> reasons that Google uses HugeTLB for VMs today. They are:
> - Guaranteed availability of hugepages
> - Guaranteed NUMA alignment
> - Availability of 1G pages
> - HugeTLB vmemmap optimization to save page struct overhead
> 
> Until generic mm supports all this, HugeTLB will remain a very
> important piece of Linux for us. :)
> 
> The main limitation of HugeTLB that I care about is that it can only
> map an entire hugepage at once; it can never partially map a hugepage
> (like, there is no such thing as a PTE-mapped HugeTLB page). As Mike
> said, this makes the following applications impossible:
> 1. With userfaultfd-based live migration, being able to fetch and
> install memory at PAGE_SIZE.
> 2. Memory poison at PAGE_SIZE.
> 
> HugeTLB high-granularity mapping (HGM) is an effort to make #1 and #2
> possible with HugeTLB.
> 
> #1 and #2 are already possible with generic mm, so this also begs the
> question: Can we merge HugeTLB with generic mm? This would certainly
> be much more work than HGM, but it removes all those pesky HugeTLB
> special cases (though, we still want all those features that HugeTLB
> has).
> 
> Coming up with a plan to merge HugeTLB with generic mm would be
> challenging, and LSFMM might be a good place to have such a
> discussion. Not all of HugeTLB would need to be merged. I think some
> of the main special cases that should be removed are:
> 1. hugetlb_fault (fault/GUP special case)
> 2. page_vma_mapped_walk's special case
> 3. hugetlb_entry in pagewalk
> 4. HugeTLB's rmap/mapcount special cases (already working on this!)
> 
> As part of this merge/unification, architectures would need to merge
> their hugetlb implementations with their generic mm implementations
> (for example, moving any special logic from set_huge_pte_at to
> set_pte_at).
> 
> These are just some initial thoughts; I'm sure many of you have your
> own ideas for this.
> 
> A discussion about HGM might serve as a jumping-off point for ideas
> for how to enhance the generic mm implementation to make the
> unification possible.
> 

I'd definitely be interested in joining into this discussion, specifically 
for live migration and memory poisoning use cases.

Adding in some folks at AMD as well as this may be useful for SEV-SNP host 
support.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] HGM for hugetlbfs
  2023-03-06 19:19 [LSF/MM/BPF TOPIC] HGM for hugetlbfs Mike Kravetz
  2023-03-14 15:37 ` James Houghton
@ 2023-05-24 20:26 ` James Houghton
  2023-05-26  3:00   ` David Rientjes
  1 sibling, 1 reply; 29+ messages in thread
From: James Houghton @ 2023-05-24 20:26 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: lsf-pc, linux-mm, Peter Xu, Michal Hocko, Matthew Wilcox,
	David Hildenbrand, David Rientjes, Axel Rasmussen, Jiaqi Yan

On Mon, Mar 6, 2023 at 11:19 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> This is past the deadline, so feel free to ignore.  However, ...
>
> James Houghton has been working on the concept of HugeTLB High Granularity
> Mapping (HGM) as discussed here:
> https://lore.kernel.org/linux-mm/20230218002819.1486479-1-jthoughton@google.com/
>
> The primary motivation for this work is post-copy live migration of VMs backed
> by hugetlb pages via userfaultfd.  A followup use case is more gracefully
> handling memory errors/poison on hugetlb pages.
>
> As can be seen by the size of James's patch set, the required changes for
> HGM are a bit complex and involved.  This is also complicated the need
> choosing a 'mapcount strategy' as the previous scheme used by hugetlb
> will no longer work.
>
> A HGM for hugetlbfs session would present the current approach and challenges.
> While much of the work is confined to hugetlb, there is a bit spill over to
> other mm areas: specifically page table walking.  A discussion on ways to
> move forward with this effort would be appreciated.
> --
> Mike Kravetz

Hi everyone,

If you came to the HGM session at LSF/MM/BPF, thank you! I want to
address some of the feedback I got and restate the importance of HGM,
especially as it relates to handling memory poison.

## Memory poison is a problem

HGM allows us to unmap poison at 4K instead of unmapping the entire
hugetlb page. For applications that use HugeTLB, losing the entire
hugepage can be catastrophic. For example, if a hypervisor is using 1G
pages for guest memory, the VM will lose 1G of its physical address
space, which is catastrophic (even 2M will most likely kill the VM).
If we can limit the poisoning to only 4K, the VM will most likely be
able to recover. This improved recoverability applies to other HugeTLB
users as well, like databases.

## Adding a new filesystem has risks, and unification will take years

Most of the feedback I got from the HGM session was to simply avoid
adding new code to HugeTLB, and instead to make a new device or
filesystem. Creating a new device or filesystem could work, but it
leaves existing HugeTLB users with no answer for memory poison. Users
would need to switch to the new device/filesystem if they want better
hwpoison handling, and it will probably take years for the new
device/filesystem to support all the features that HugeTLB supports
today (so beyond PUD+ mappings, we would need page table sharing, page
struct freeing, and even private mappings/CoW).

If we make a new filesystem and are unable to completely implement the
HugeTLB uapi exactly with that filesystem, we will be stuck unable to
remove HugeTLB.  We would strongly like to avoid coexisting HugeTLB
implementations (similar to cgroup v1 and cgroup v2) if at all
possible.

Instead of making a new filesystem, we could add HugeTLB-like features
tmpfs, such as support for gigantic page allocations (from bootmem or
CMA, like HugeTLB), for example. This path would work to mostly unify
HugeTLB with tmpfs, but existing HugeTLB users will still have to wait
for many years before poison can be handled more efficiently. (And
some users care about things like hugetlb_cgroup!)

## HGM doesn’t hinder future unification

HGM doesn’t add any new special cases into mm code; it takes advantage
of the existing special cases that already exist to support HugeTLB.
HGM also isn’t adding a completely novel feature that can’t be
replicated by THPs: PTE-mapping of THPs is already supported.

HGM solves a problem that HugeTLB users have right now: unnecessarily
large portions of memory are poisoned. Unless we fix HugeTLB itself,
we will have to spend years effectively rewriting HugeTLB and telling
users to switch to the new system that gets built.

Given all this, I think we should continue to move forward with HGM
unless there is another feasible way to solve poisoning for existing
HugeTLB users. Also, I encourage everyone to read the series itself
(it's not all that complicated!).

- James


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] HGM for hugetlbfs
  2023-05-24 20:26 ` James Houghton
@ 2023-05-26  3:00   ` David Rientjes
       [not found]     ` <20230602172723.GA3941@monkey>
  0 siblings, 1 reply; 29+ messages in thread
From: David Rientjes @ 2023-05-26  3:00 UTC (permalink / raw)
  To: James Houghton, Naoya Horiguchi, Miaohe Lin
  Cc: Mike Kravetz, lsf-pc, linux-mm, Peter Xu, Michal Hocko,
	Matthew Wilcox, David Hildenbrand, Axel Rasmussen, Jiaqi Yan

[-- Attachment #1: Type: text/plain, Size: 4182 bytes --]

On Wed, 24 May 2023, James Houghton wrote:

> Hi everyone,
> 
> If you came to the HGM session at LSF/MM/BPF, thank you!

Thank you, James, for putting together such a detailed discussion and 
soliciting some great feedback.

> I want to
> address some of the feedback I got and restate the importance of HGM,
> especially as it relates to handling memory poison.
> 

Thanks for bringing this up, I think it's a very important use case.  
Adding in Naoya Horiguchi and Miaohe Lin as well.

> ## Memory poison is a problem
> 
> HGM allows us to unmap poison at 4K instead of unmapping the entire
> hugetlb page. For applications that use HugeTLB, losing the entire
> hugepage can be catastrophic. For example, if a hypervisor is using 1G
> pages for guest memory, the VM will lose 1G of its physical address
> space, which is catastrophic (even 2M will most likely kill the VM).
> If we can limit the poisoning to only 4K, the VM will most likely be
> able to recover. This improved recoverability applies to other HugeTLB
> users as well, like databases.
> 

Mike, do you have feedback on how useful this would be, especially for use 
cases beyond what cloud providers would find helpful?

> ## Adding a new filesystem has risks, and unification will take years
> 
> Most of the feedback I got from the HGM session was to simply avoid
> adding new code to HugeTLB, and instead to make a new device or
> filesystem. Creating a new device or filesystem could work, but it
> leaves existing HugeTLB users with no answer for memory poison. Users
> would need to switch to the new device/filesystem if they want better
> hwpoison handling, and it will probably take years for the new
> device/filesystem to support all the features that HugeTLB supports
> today (so beyond PUD+ mappings, we would need page table sharing, page
> struct freeing, and even private mappings/CoW).
> 
> If we make a new filesystem and are unable to completely implement the
> HugeTLB uapi exactly with that filesystem, we will be stuck unable to
> remove HugeTLB.  We would strongly like to avoid coexisting HugeTLB
> implementations (similar to cgroup v1 and cgroup v2) if at all
> possible.
> 
> Instead of making a new filesystem, we could add HugeTLB-like features
> tmpfs, such as support for gigantic page allocations (from bootmem or
> CMA, like HugeTLB), for example. This path would work to mostly unify
> HugeTLB with tmpfs, but existing HugeTLB users will still have to wait
> for many years before poison can be handled more efficiently. (And
> some users care about things like hugetlb_cgroup!)
> 
> ## HGM doesn’t hinder future unification
> 
> HGM doesn’t add any new special cases into mm code; it takes advantage
> of the existing special cases that already exist to support HugeTLB.
> HGM also isn’t adding a completely novel feature that can’t be
> replicated by THPs: PTE-mapping of THPs is already supported.
> 

I think this is important, there are deficiencies that HGM can fully 
address (like the aforementioned smaller granularity page poisoning, as 
well as optimized live migration) while not posing an obstacle for future 
unification if possible.

If not for HGM, it would be great to get alignment on what needs to be 
done so that we can support memory poisoning in smaller sizes for users of 
1GB pages *and* optimized live migration for VMs backed by 1GB pages 
without requiring a full unification of the HugeTLB subsystem with the 
rest of core MM.

While that unification has been discussed for several years, it would be a 
shame if that became a full blocker to address these real deficiencies 
that are actively causing pain.

> HGM solves a problem that HugeTLB users have right now: unnecessarily
> large portions of memory are poisoned. Unless we fix HugeTLB itself,
> we will have to spend years effectively rewriting HugeTLB and telling
> users to switch to the new system that gets built.
> 
> Given all this, I think we should continue to move forward with HGM
> unless there is another feasible way to solve poisoning for existing
> HugeTLB users. Also, I encourage everyone to read the series itself
> (it's not all that complicated!).
> 
> - James
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] HGM for hugetlbfs
       [not found]     ` <20230602172723.GA3941@monkey>
@ 2023-06-06 22:40       ` David Rientjes
  2023-06-07  7:38         ` David Hildenbrand
  0 siblings, 1 reply; 29+ messages in thread
From: David Rientjes @ 2023-06-06 22:40 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: James Houghton, Naoya Horiguchi, Miaohe Lin, lsf-pc, linux-mm,
	Peter Xu, Michal Hocko, Matthew Wilcox, David Hildenbrand,
	Axel Rasmussen, Jiaqi Yan

On Fri, 2 Jun 2023, Mike Kravetz wrote:

> The benefit of HGM in the case of memory errors is fairly obvious.  As
> mentioned above, when a memory error is encountered on a hugetlb page,
> that entire hugetlb page becomes inaccessible to the application.  Losing,
> 1G or even 2M of data is often catastrophic for an application.  There
> is often no way to recover.  It just makes sense that recovering from
> the loss of 4K of data would generally be easier and more likely to be
> possible.  Today, when Oracle DB encounters a hard memory error on a
> hugetlb page it will shutdown.  Plans are currently in place repair and
> recover from such errors if possible.  Isolating the area of data loss
> to a single 4K page significantly increases the likelihood of repair and
> recovery.
> 
> Today, when a memory error is encountered on a hugetlb page an
> application is 'notified' of the error by a SIGBUS, as well as the
> virtual address of the hugetlb page and it's size.  This makes sense as
> hugetlb pages are accessed by a single page table entry, so you get all
> or nothing.  As mentioned by James above, this is catastrophic for VMs
> as the hypervisor has just been told that 2M or 1G is now inaccessible.
> With HGM, we can isolate such errors to 4K.
> 
> Backing VMs with hugetlb pages is a real use case today.  We are seeing
> memory errors on such hugetlb pages with the result being VM failures.
> One of the advantages of backing VMs with THPs is that they are split in
> the case of memory errors.  HGM would allow similar functionality.

Thanks for this context, Mike, it's very useful.

I think everybody is aligned on the desire to map memory at smaller 
granularities for multiple use cases and it's fairly clear that these use 
cases are critically important to multiple stakeholders.

I think the open question is whether this functionality is supported in 
hugetlbfs (like with HGM) or that there is a hard requirement that we must 
use THP for this support.

I don't think that hugetlbfs is feature frozen, but if there's a strong 
bias toward not merging additional complexity into the subsystem that 
would useful to know.  I personally think the critical use cases described 
above justify the added complexity of HGM to hugetlb and we wouldn't be 
blocked by the long standing (15+ years) desire to mesh hugetlb into the 
core MM subsystem before we can stop the pain associated with memory 
poisoning and live migration.

Are there strong objections to extending hugetlb for this support?


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] HGM for hugetlbfs
  2023-06-06 22:40       ` David Rientjes
@ 2023-06-07  7:38         ` David Hildenbrand
  2023-06-07  7:51           ` Yosry Ahmed
  2023-06-07 14:40           ` Matthew Wilcox
  0 siblings, 2 replies; 29+ messages in thread
From: David Hildenbrand @ 2023-06-07  7:38 UTC (permalink / raw)
  To: David Rientjes, Mike Kravetz
  Cc: James Houghton, Naoya Horiguchi, Miaohe Lin, lsf-pc, linux-mm,
	Peter Xu, Michal Hocko, Matthew Wilcox, Axel Rasmussen,
	Jiaqi Yan

On 07.06.23 00:40, David Rientjes wrote:
> On Fri, 2 Jun 2023, Mike Kravetz wrote:
> 
>> The benefit of HGM in the case of memory errors is fairly obvious.  As
>> mentioned above, when a memory error is encountered on a hugetlb page,
>> that entire hugetlb page becomes inaccessible to the application.  Losing,
>> 1G or even 2M of data is often catastrophic for an application.  There
>> is often no way to recover.  It just makes sense that recovering from
>> the loss of 4K of data would generally be easier and more likely to be
>> possible.  Today, when Oracle DB encounters a hard memory error on a
>> hugetlb page it will shutdown.  Plans are currently in place repair and
>> recover from such errors if possible.  Isolating the area of data loss
>> to a single 4K page significantly increases the likelihood of repair and
>> recovery.
>>
>> Today, when a memory error is encountered on a hugetlb page an
>> application is 'notified' of the error by a SIGBUS, as well as the
>> virtual address of the hugetlb page and it's size.  This makes sense as
>> hugetlb pages are accessed by a single page table entry, so you get all
>> or nothing.  As mentioned by James above, this is catastrophic for VMs
>> as the hypervisor has just been told that 2M or 1G is now inaccessible.
>> With HGM, we can isolate such errors to 4K.
>>
>> Backing VMs with hugetlb pages is a real use case today.  We are seeing
>> memory errors on such hugetlb pages with the result being VM failures.
>> One of the advantages of backing VMs with THPs is that they are split in
>> the case of memory errors.  HGM would allow similar functionality.
> 
> Thanks for this context, Mike, it's very useful.
> 
> I think everybody is aligned on the desire to map memory at smaller
> granularities for multiple use cases and it's fairly clear that these use
> cases are critically important to multiple stakeholders.
> 
> I think the open question is whether this functionality is supported in
> hugetlbfs (like with HGM) or that there is a hard requirement that we must
> use THP for this support.
> 
> I don't think that hugetlbfs is feature frozen, but if there's a strong
> bias toward not merging additional complexity into the subsystem that
> would useful to know.  I personally think the critical use cases described

At least I, attending that session, thought that it was clear that the 
majority of the people speaking up clearly expressed "no more added 
complexity". So I think there is a clear strong bias, at least from the 
people attending that session.


> above justify the added complexity of HGM to hugetlb and we wouldn't be
> blocked by the long standing (15+ years) desire to mesh hugetlb into the
> core MM subsystem before we can stop the pain associated with memory
> poisoning and live migration.
> 
> Are there strong objections to extending hugetlb for this support?

I don't want to get too involved in this discussion (busy), but I 
absolutely agree on the points that were raised at LSF/MM that

(A) hugetlb is complicated and very special (many things not integrated 
with core-mm, so we need special-casing all over the place). [example: 
what is a pte?]

(B) We added a bunch of complexity in the past that some people 
considered very important (and it was not feature frozen, right? ;) ). 
Looking back, we might just not have done some of that, or done it 
differently/cleaner -- better integrated in the core. (PMD sharing, 
MAP_PRIVATE, a reservation mechanism that still requires preallocation 
because it fails with NUMA/fork, ...)

(C) Unifying hugetlb and the core looks like it's getting more and more 
out of reach, maybe even impossible with all the complexity we added 
over the years (well, and keep adding).

Sure, HGM for the purpose of better hwpoison handling makes sense. But 
hugetlb is probably 20 years old and hwpoison handling probably 13 years 
old. So we managed to get quite far without that optimization.

Absolutely, HGM for better postcopy live migration also makes sense, I 
guess nobody disagrees on that.


But as discussed in that session, maybe we should just start anew and 
implement something that integrates nicely with the core , instead of 
making hugetlb more complicated and even more special.


Now, we all know, nobody wants to do the heavy lifting for that, that's 
why we're discussing how to get in yet another complicated feature.

Maybe we can manage to reduce complexity and integrate some parts nicer 
with core-mm, I don't know.


Don't get me wrong, Mike is the maintainer, I'm just reading along and 
voicing what I observed in the LSF/MM session (well, I mixed in some of 
my own opinion ;) ).

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] HGM for hugetlbfs
  2023-06-07  7:38         ` David Hildenbrand
@ 2023-06-07  7:51           ` Yosry Ahmed
  2023-06-07  8:13             ` David Hildenbrand
  2023-06-07 14:40           ` Matthew Wilcox
  1 sibling, 1 reply; 29+ messages in thread
From: Yosry Ahmed @ 2023-06-07  7:51 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: David Rientjes, Mike Kravetz, James Houghton, Naoya Horiguchi,
	Miaohe Lin, lsf-pc, linux-mm, Peter Xu, Michal Hocko,
	Matthew Wilcox, Axel Rasmussen, Jiaqi Yan

On Wed, Jun 7, 2023 at 12:38 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 07.06.23 00:40, David Rientjes wrote:
> > On Fri, 2 Jun 2023, Mike Kravetz wrote:
> >
> >> The benefit of HGM in the case of memory errors is fairly obvious.  As
> >> mentioned above, when a memory error is encountered on a hugetlb page,
> >> that entire hugetlb page becomes inaccessible to the application.  Losing,
> >> 1G or even 2M of data is often catastrophic for an application.  There
> >> is often no way to recover.  It just makes sense that recovering from
> >> the loss of 4K of data would generally be easier and more likely to be
> >> possible.  Today, when Oracle DB encounters a hard memory error on a
> >> hugetlb page it will shutdown.  Plans are currently in place repair and
> >> recover from such errors if possible.  Isolating the area of data loss
> >> to a single 4K page significantly increases the likelihood of repair and
> >> recovery.
> >>
> >> Today, when a memory error is encountered on a hugetlb page an
> >> application is 'notified' of the error by a SIGBUS, as well as the
> >> virtual address of the hugetlb page and it's size.  This makes sense as
> >> hugetlb pages are accessed by a single page table entry, so you get all
> >> or nothing.  As mentioned by James above, this is catastrophic for VMs
> >> as the hypervisor has just been told that 2M or 1G is now inaccessible.
> >> With HGM, we can isolate such errors to 4K.
> >>
> >> Backing VMs with hugetlb pages is a real use case today.  We are seeing
> >> memory errors on such hugetlb pages with the result being VM failures.
> >> One of the advantages of backing VMs with THPs is that they are split in
> >> the case of memory errors.  HGM would allow similar functionality.
> >
> > Thanks for this context, Mike, it's very useful.
> >
> > I think everybody is aligned on the desire to map memory at smaller
> > granularities for multiple use cases and it's fairly clear that these use
> > cases are critically important to multiple stakeholders.
> >
> > I think the open question is whether this functionality is supported in
> > hugetlbfs (like with HGM) or that there is a hard requirement that we must
> > use THP for this support.
> >
> > I don't think that hugetlbfs is feature frozen, but if there's a strong
> > bias toward not merging additional complexity into the subsystem that
> > would useful to know.  I personally think the critical use cases described
>
> At least I, attending that session, thought that it was clear that the
> majority of the people speaking up clearly expressed "no more added
> complexity". So I think there is a clear strong bias, at least from the
> people attending that session.
>
>
> > above justify the added complexity of HGM to hugetlb and we wouldn't be
> > blocked by the long standing (15+ years) desire to mesh hugetlb into the
> > core MM subsystem before we can stop the pain associated with memory
> > poisoning and live migration.
> >
> > Are there strong objections to extending hugetlb for this support?
>
> I don't want to get too involved in this discussion (busy), but I
> absolutely agree on the points that were raised at LSF/MM that
>
> (A) hugetlb is complicated and very special (many things not integrated
> with core-mm, so we need special-casing all over the place). [example:
> what is a pte?]
>
> (B) We added a bunch of complexity in the past that some people
> considered very important (and it was not feature frozen, right? ;) ).
> Looking back, we might just not have done some of that, or done it
> differently/cleaner -- better integrated in the core. (PMD sharing,
> MAP_PRIVATE, a reservation mechanism that still requires preallocation
> because it fails with NUMA/fork, ...)
>
> (C) Unifying hugetlb and the core looks like it's getting more and more
> out of reach, maybe even impossible with all the complexity we added
> over the years (well, and keep adding).
>
> Sure, HGM for the purpose of better hwpoison handling makes sense. But
> hugetlb is probably 20 years old and hwpoison handling probably 13 years
> old. So we managed to get quite far without that optimization.
>
> Absolutely, HGM for better postcopy live migration also makes sense, I
> guess nobody disagrees on that.
>
>
> But as discussed in that session, maybe we should just start anew and
> implement something that integrates nicely with the core , instead of
> making hugetlb more complicated and even more special.
>
>
> Now, we all know, nobody wants to do the heavy lifting for that, that's
> why we're discussing how to get in yet another complicated feature.

If nobody wants to do the heavy lifting and unifying hugetlb with core
MM is becoming impossible as you state, then does adding another
feature to hugetlb (that we are all agreeing is useful for multiple
use cases) really making things worse? In other words, if someone
decides tomorrow to do the heavy lifting, how much harder does this
become because of HGM, if any?

I am the farthest away from being an expert here, I am just an
observer here, but if the answer to the above question is "HGM doesn't
actually make it worse" or "HGM only slightly makes things harder",
then I naively think that it's something that we should do, from a
pure cost-benefit analysis.

Again, I don't have a lot of context here, and I understand everyone's
frustration with the current state of hugetlb. Just my 2 cents.

>
> Maybe we can manage to reduce complexity and integrate some parts nicer
> with core-mm, I don't know.
>
>
> Don't get me wrong, Mike is the maintainer, I'm just reading along and
> voicing what I observed in the LSF/MM session (well, I mixed in some of
> my own opinion ;) ).
>
> --
> Cheers,
>
> David / dhildenb
>
>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] HGM for hugetlbfs
  2023-06-07  7:51           ` Yosry Ahmed
@ 2023-06-07  8:13             ` David Hildenbrand
  2023-06-07 22:06               ` Mike Kravetz
  0 siblings, 1 reply; 29+ messages in thread
From: David Hildenbrand @ 2023-06-07  8:13 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: David Rientjes, Mike Kravetz, James Houghton, Naoya Horiguchi,
	Miaohe Lin, lsf-pc, linux-mm, Peter Xu, Michal Hocko,
	Matthew Wilcox, Axel Rasmussen, Jiaqi Yan

On 07.06.23 09:51, Yosry Ahmed wrote:
> On Wed, Jun 7, 2023 at 12:38 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 07.06.23 00:40, David Rientjes wrote:
>>> On Fri, 2 Jun 2023, Mike Kravetz wrote:
>>>
>>>> The benefit of HGM in the case of memory errors is fairly obvious.  As
>>>> mentioned above, when a memory error is encountered on a hugetlb page,
>>>> that entire hugetlb page becomes inaccessible to the application.  Losing,
>>>> 1G or even 2M of data is often catastrophic for an application.  There
>>>> is often no way to recover.  It just makes sense that recovering from
>>>> the loss of 4K of data would generally be easier and more likely to be
>>>> possible.  Today, when Oracle DB encounters a hard memory error on a
>>>> hugetlb page it will shutdown.  Plans are currently in place repair and
>>>> recover from such errors if possible.  Isolating the area of data loss
>>>> to a single 4K page significantly increases the likelihood of repair and
>>>> recovery.
>>>>
>>>> Today, when a memory error is encountered on a hugetlb page an
>>>> application is 'notified' of the error by a SIGBUS, as well as the
>>>> virtual address of the hugetlb page and it's size.  This makes sense as
>>>> hugetlb pages are accessed by a single page table entry, so you get all
>>>> or nothing.  As mentioned by James above, this is catastrophic for VMs
>>>> as the hypervisor has just been told that 2M or 1G is now inaccessible.
>>>> With HGM, we can isolate such errors to 4K.
>>>>
>>>> Backing VMs with hugetlb pages is a real use case today.  We are seeing
>>>> memory errors on such hugetlb pages with the result being VM failures.
>>>> One of the advantages of backing VMs with THPs is that they are split in
>>>> the case of memory errors.  HGM would allow similar functionality.
>>>
>>> Thanks for this context, Mike, it's very useful.
>>>
>>> I think everybody is aligned on the desire to map memory at smaller
>>> granularities for multiple use cases and it's fairly clear that these use
>>> cases are critically important to multiple stakeholders.
>>>
>>> I think the open question is whether this functionality is supported in
>>> hugetlbfs (like with HGM) or that there is a hard requirement that we must
>>> use THP for this support.
>>>
>>> I don't think that hugetlbfs is feature frozen, but if there's a strong
>>> bias toward not merging additional complexity into the subsystem that
>>> would useful to know.  I personally think the critical use cases described
>>
>> At least I, attending that session, thought that it was clear that the
>> majority of the people speaking up clearly expressed "no more added
>> complexity". So I think there is a clear strong bias, at least from the
>> people attending that session.
>>
>>
>>> above justify the added complexity of HGM to hugetlb and we wouldn't be
>>> blocked by the long standing (15+ years) desire to mesh hugetlb into the
>>> core MM subsystem before we can stop the pain associated with memory
>>> poisoning and live migration.
>>>
>>> Are there strong objections to extending hugetlb for this support?
>>
>> I don't want to get too involved in this discussion (busy), but I
>> absolutely agree on the points that were raised at LSF/MM that
>>
>> (A) hugetlb is complicated and very special (many things not integrated
>> with core-mm, so we need special-casing all over the place). [example:
>> what is a pte?]
>>
>> (B) We added a bunch of complexity in the past that some people
>> considered very important (and it was not feature frozen, right? ;) ).
>> Looking back, we might just not have done some of that, or done it
>> differently/cleaner -- better integrated in the core. (PMD sharing,
>> MAP_PRIVATE, a reservation mechanism that still requires preallocation
>> because it fails with NUMA/fork, ...)
>>
>> (C) Unifying hugetlb and the core looks like it's getting more and more
>> out of reach, maybe even impossible with all the complexity we added
>> over the years (well, and keep adding).
>>
>> Sure, HGM for the purpose of better hwpoison handling makes sense. But
>> hugetlb is probably 20 years old and hwpoison handling probably 13 years
>> old. So we managed to get quite far without that optimization.
>>
>> Absolutely, HGM for better postcopy live migration also makes sense, I
>> guess nobody disagrees on that.
>>
>>
>> But as discussed in that session, maybe we should just start anew and
>> implement something that integrates nicely with the core , instead of
>> making hugetlb more complicated and even more special.
>>
>>
>> Now, we all know, nobody wants to do the heavy lifting for that, that's
>> why we're discussing how to get in yet another complicated feature.
> 
> If nobody wants to do the heavy lifting and unifying hugetlb with core
> MM is becoming impossible as you state, then does adding another
> feature to hugetlb (that we are all agreeing is useful for multiple
> use cases) really making things worse? In other words, if someone

Well, if we (as a community) reject more complexity and outline an 
alternative of what would be acceptable (rewrite), people that really 
want these new features will *have to* do the heavy lifting.

[and I see many people from employers that might have the capacity to do 
the heavy lifting if really required being involved in the discussion 
around HGM :P ]

> decides tomorrow to do the heavy lifting, how much harder does this
> become because of HGM, if any?
> 
> I am the farthest away from being an expert here, I am just an
> observer here, but if the answer to the above question is "HGM doesn't
> actually make it worse" or "HGM only slightly makes things harder",
> then I naively think that it's something that we should do, from a
> pure cost-benefit analysis.

Well, there is always the "maintainability" aspect, because upstream has 
to maintain whatever complexity gets merged. No matter what, we'll have 
to keep maintaining the current set of hugetlb features until we can 
eventually deprecate it/some in the far, far future.

I, for my part, am happy as long as I can stay away as far as possible 
from hugetlb code. Again, Mike is the maintainer.

What I saw so far regarding HGM does not count as "slightly makes things 
harder".

> 
> Again, I don't have a lot of context here, and I understand everyone's
> frustration with the current state of hugetlb. Just my 2 cents.

The thing is, we all agree that something that hugetlb provides is 
valuable (i.e., pool of huge/large pages that we can map large), just 
that after 20 years there might be better ways of doing it and 
integrating it better with core-mm.

Yes, many people are frustrated with the current state. Adding more 
complexity won't improve things.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] HGM for hugetlbfs
  2023-06-07  7:38         ` David Hildenbrand
  2023-06-07  7:51           ` Yosry Ahmed
@ 2023-06-07 14:40           ` Matthew Wilcox
  1 sibling, 0 replies; 29+ messages in thread
From: Matthew Wilcox @ 2023-06-07 14:40 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: David Rientjes, Mike Kravetz, James Houghton, Naoya Horiguchi,
	Miaohe Lin, lsf-pc, linux-mm, Peter Xu, Michal Hocko,
	Axel Rasmussen, Jiaqi Yan

On Wed, Jun 07, 2023 at 09:38:35AM +0200, David Hildenbrand wrote:
> I don't want to get too involved in this discussion (busy), but I absolutely
> agree on the points that were raised at LSF/MM that
> 
> (A) hugetlb is complicated and very special (many things not integrated with
> core-mm, so we need special-casing all over the place). [example: what is a
> pte?]

This is somethign that absolutely does need to get fixed.  It's one of
the big sources of complexity and confusion around code that is supposed
to work with both hugetlb & THP.  I understand why hugetlb originally
said "everything is a pte", but THP went a different route, and I think
hugetlb now needs to follow.

Fixing pagewalk.h to not be complete garbage would be a good start.
I can elaborate more along these lines if someone's actually going to
put in the work to do it.

> (B) We added a bunch of complexity in the past that some people considered
> very important (and it was not feature frozen, right? ;) ). Looking back, we
> might just not have done some of that, or done it differently/cleaner --
> better integrated in the core. (PMD sharing, MAP_PRIVATE, a reservation
> mechanism that still requires preallocation because it fails with NUMA/fork,
> ...)

It'd be nice if people engaged seriously with the efforts to move that
functionality into the core.  eg mshare.  Saying "Oh just share the
hugetlb implementation" is not serious engagement, it's an indication
you haven't been paying attention to what the needs are.

I haven't looked at the hugetlb reservation mechanism in enough detail to
be able to understand why people use it, what they actually want, and how
it could be done better in the core.  Maybe somebody else could do that.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] HGM for hugetlbfs
  2023-06-07  8:13             ` David Hildenbrand
@ 2023-06-07 22:06               ` Mike Kravetz
  2023-06-08  0:02                 ` David Rientjes
  2023-06-08 21:54                 ` [Lsf-pc] " Dan Williams
  0 siblings, 2 replies; 29+ messages in thread
From: Mike Kravetz @ 2023-06-07 22:06 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Yosry Ahmed, David Rientjes, James Houghton, Naoya Horiguchi,
	Miaohe Lin, lsf-pc, linux-mm, Peter Xu, Michal Hocko,
	Matthew Wilcox, Axel Rasmussen, Jiaqi Yan

On 06/07/23 10:13, David Hildenbrand wrote:
> On 07.06.23 09:51, Yosry Ahmed wrote:
> > On Wed, Jun 7, 2023 at 12:38 AM David Hildenbrand <david@redhat.com> wrote:
> > > 
> > > On 07.06.23 00:40, David Rientjes wrote:
> > > > On Fri, 2 Jun 2023, Mike Kravetz wrote:
> > > > 
> > > > > The benefit of HGM in the case of memory errors is fairly obvious.  As
> > > > > mentioned above, when a memory error is encountered on a hugetlb page,
> > > > > that entire hugetlb page becomes inaccessible to the application.  Losing,
> > > > > 1G or even 2M of data is often catastrophic for an application.  There
> > > > > is often no way to recover.  It just makes sense that recovering from
> > > > > the loss of 4K of data would generally be easier and more likely to be
> > > > > possible.  Today, when Oracle DB encounters a hard memory error on a
> > > > > hugetlb page it will shutdown.  Plans are currently in place repair and
> > > > > recover from such errors if possible.  Isolating the area of data loss
> > > > > to a single 4K page significantly increases the likelihood of repair and
> > > > > recovery.
> > > > > 
> > > > > Today, when a memory error is encountered on a hugetlb page an
> > > > > application is 'notified' of the error by a SIGBUS, as well as the
> > > > > virtual address of the hugetlb page and it's size.  This makes sense as
> > > > > hugetlb pages are accessed by a single page table entry, so you get all
> > > > > or nothing.  As mentioned by James above, this is catastrophic for VMs
> > > > > as the hypervisor has just been told that 2M or 1G is now inaccessible.
> > > > > With HGM, we can isolate such errors to 4K.
> > > > > 
> > > > > Backing VMs with hugetlb pages is a real use case today.  We are seeing
> > > > > memory errors on such hugetlb pages with the result being VM failures.
> > > > > One of the advantages of backing VMs with THPs is that they are split in
> > > > > the case of memory errors.  HGM would allow similar functionality.
> > > > 
> > > > Thanks for this context, Mike, it's very useful.
> > > > 
> > > > I think everybody is aligned on the desire to map memory at smaller
> > > > granularities for multiple use cases and it's fairly clear that these use
> > > > cases are critically important to multiple stakeholders.
> > > > 
> > > > I think the open question is whether this functionality is supported in
> > > > hugetlbfs (like with HGM) or that there is a hard requirement that we must
> > > > use THP for this support.
> > > > 
> > > > I don't think that hugetlbfs is feature frozen, but if there's a strong
> > > > bias toward not merging additional complexity into the subsystem that
> > > > would useful to know.  I personally think the critical use cases described
> > > 
> > > At least I, attending that session, thought that it was clear that the
> > > majority of the people speaking up clearly expressed "no more added
> > > complexity". So I think there is a clear strong bias, at least from the
> > > people attending that session.
> > > 
> > > 
> > > > above justify the added complexity of HGM to hugetlb and we wouldn't be
> > > > blocked by the long standing (15+ years) desire to mesh hugetlb into the
> > > > core MM subsystem before we can stop the pain associated with memory
> > > > poisoning and live migration.
> > > > 
> > > > Are there strong objections to extending hugetlb for this support?
> > > 
> > > I don't want to get too involved in this discussion (busy), but I
> > > absolutely agree on the points that were raised at LSF/MM that
> > > 
> > > (A) hugetlb is complicated and very special (many things not integrated
> > > with core-mm, so we need special-casing all over the place). [example:
> > > what is a pte?]
> > > 
> > > (B) We added a bunch of complexity in the past that some people
> > > considered very important (and it was not feature frozen, right? ;) ).
> > > Looking back, we might just not have done some of that, or done it
> > > differently/cleaner -- better integrated in the core. (PMD sharing,
> > > MAP_PRIVATE, a reservation mechanism that still requires preallocation
> > > because it fails with NUMA/fork, ...)
> > > 
> > > (C) Unifying hugetlb and the core looks like it's getting more and more
> > > out of reach, maybe even impossible with all the complexity we added
> > > over the years (well, and keep adding).
> > > 
> > > Sure, HGM for the purpose of better hwpoison handling makes sense. But
> > > hugetlb is probably 20 years old and hwpoison handling probably 13 years
> > > old. So we managed to get quite far without that optimization.
> > > 
> > > Absolutely, HGM for better postcopy live migration also makes sense, I
> > > guess nobody disagrees on that.
> > > 
> > > 
> > > But as discussed in that session, maybe we should just start anew and
> > > implement something that integrates nicely with the core , instead of
> > > making hugetlb more complicated and even more special.
> > > 
> > > 
> > > Now, we all know, nobody wants to do the heavy lifting for that, that's
> > > why we're discussing how to get in yet another complicated feature.
> > 
> > If nobody wants to do the heavy lifting and unifying hugetlb with core
> > MM is becoming impossible as you state, then does adding another
> > feature to hugetlb (that we are all agreeing is useful for multiple
> > use cases) really making things worse? In other words, if someone
> 
> Well, if we (as a community) reject more complexity and outline an
> alternative of what would be acceptable (rewrite), people that really want
> these new features will *have to* do the heavy lifting.
> 
> [and I see many people from employers that might have the capacity to do the
> heavy lifting if really required being involved in the discussion around HGM
> :P ]
> 
> > decides tomorrow to do the heavy lifting, how much harder does this
> > become because of HGM, if any?
> > 
> > I am the farthest away from being an expert here, I am just an
> > observer here, but if the answer to the above question is "HGM doesn't
> > actually make it worse" or "HGM only slightly makes things harder",
> > then I naively think that it's something that we should do, from a
> > pure cost-benefit analysis.
> 
> Well, there is always the "maintainability" aspect, because upstream has to
> maintain whatever complexity gets merged. No matter what, we'll have to keep
> maintaining the current set of hugetlb features until we can eventually
> deprecate it/some in the far, far future.
> 
> I, for my part, am happy as long as I can stay away as far as possible from
> hugetlb code. Again, Mike is the maintainer.

Thanks for the reminder :)

Maintainability is my primary concern with HGM.  That is one of the reasons
I proposed James pitch the topic at LSFMM.  Even though I am the 'maintainer'
changes introduced by HGM will impact others working in mm.

> What I saw so far regarding HGM does not count as "slightly makes things
> harder".
> 
> > Again, I don't have a lot of context here, and I understand everyone's
> > frustration with the current state of hugetlb. Just my 2 cents.
> 
> The thing is, we all agree that something that hugetlb provides is valuable
> (i.e., pool of huge/large pages that we can map large), just that after 20
> years there might be better ways of doing it and integrating it better with
> core-mm.

I am struggling with how to support existing hugetlb users that are running
into issues like memory errors on hugetlb pages today.  And, yes that is a
source of real customer issues.  They are not really happy with the current
design that a single error will take out a 1G page, and their VM or
application.  Moving to THP is not likely as they really want a pre-allocated
pool of 1G pages.  I just don't have a good answer for them.
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] HGM for hugetlbfs
  2023-06-07 22:06               ` Mike Kravetz
@ 2023-06-08  0:02                 ` David Rientjes
  2023-06-08  6:34                   ` David Hildenbrand
  2023-06-08 21:54                 ` [Lsf-pc] " Dan Williams
  1 sibling, 1 reply; 29+ messages in thread
From: David Rientjes @ 2023-06-08  0:02 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: David Hildenbrand, Yosry Ahmed, James Houghton, Naoya Horiguchi,
	Miaohe Lin, lsf-pc, linux-mm, Peter Xu, Michal Hocko,
	Matthew Wilcox, Axel Rasmussen, Jiaqi Yan

On Wed, 7 Jun 2023, Mike Kravetz wrote:

> > > > > Are there strong objections to extending hugetlb for this support?
> > > > 
> > > > I don't want to get too involved in this discussion (busy), but I
> > > > absolutely agree on the points that were raised at LSF/MM that
> > > > 
> > > > (A) hugetlb is complicated and very special (many things not integrated
> > > > with core-mm, so we need special-casing all over the place). [example:
> > > > what is a pte?]
> > > > 
> > > > (B) We added a bunch of complexity in the past that some people
> > > > considered very important (and it was not feature frozen, right? ;) ).
> > > > Looking back, we might just not have done some of that, or done it
> > > > differently/cleaner -- better integrated in the core. (PMD sharing,
> > > > MAP_PRIVATE, a reservation mechanism that still requires preallocation
> > > > because it fails with NUMA/fork, ...)
> > > > 
> > > > (C) Unifying hugetlb and the core looks like it's getting more and more
> > > > out of reach, maybe even impossible with all the complexity we added
> > > > over the years (well, and keep adding).
> > > > 
> > > > Sure, HGM for the purpose of better hwpoison handling makes sense. But
> > > > hugetlb is probably 20 years old and hwpoison handling probably 13 years
> > > > old. So we managed to get quite far without that optimization.
> > > > 

Sane handling for memory poisoning and optimizations for live migration 
are both much more important for the real-world 1GB hugetlb user, so it 
doesn't quite have that lengthy of a history.

Unfortuantely, cloud providers receive complaints about both of these from 
customers.  They are one of the most significant causes for poor customer 
experience.

While people have proposed 1GB THP support in the past, it was nacked, in 
part, because of the suggestion to just use existing 1GB support in 
hugetlb instead :)

> > > > Absolutely, HGM for better postcopy live migration also makes sense, I
> > > > guess nobody disagrees on that.
> > > > 
> > > > 
> > > > But as discussed in that session, maybe we should just start anew and
> > > > implement something that integrates nicely with the core , instead of
> > > > making hugetlb more complicated and even more special.
> > > > 

Certainly an ideal would be where we could support everybody's use cases 
in a much more cohesive way with the rest of the core MM.  I'm 
particularly concerned about how long it will take to get to that state 
even if we had kernel developers committed to doing the work.  Even if we 
had a design for this new subsystem that was more tightly coupled with the 
core MM, it would take O(years) to implement, test, extend for other 
architectures, and that's before any existing of users of hugetlb could 
make the changes in the rest of their software stack to support it.

We have no other solution today for 1GB support in Linux, so waiting 
O(years) for this yet-to-be-designed future *is* going to cause 
compounding customer pain in the real world.

> > > > Now, we all know, nobody wants to do the heavy lifting for that, that's
> > > > why we're discussing how to get in yet another complicated feature.
> > > 
> > > If nobody wants to do the heavy lifting and unifying hugetlb with core
> > > MM is becoming impossible as you state, then does adding another
> > > feature to hugetlb (that we are all agreeing is useful for multiple
> > > use cases) really making things worse? In other words, if someone
> > 
> > Well, if we (as a community) reject more complexity and outline an
> > alternative of what would be acceptable (rewrite), people that really want
> > these new features will *have to* do the heavy lifting.
> > 
> > [and I see many people from employers that might have the capacity to do the
> > heavy lifting if really required being involved in the discussion around HGM
> > :P ]
> > 
> > > decides tomorrow to do the heavy lifting, how much harder does this
> > > become because of HGM, if any?
> > > 
> > > I am the farthest away from being an expert here, I am just an
> > > observer here, but if the answer to the above question is "HGM doesn't
> > > actually make it worse" or "HGM only slightly makes things harder",
> > > then I naively think that it's something that we should do, from a
> > > pure cost-benefit analysis.
> > 
> > Well, there is always the "maintainability" aspect, because upstream has to
> > maintain whatever complexity gets merged. No matter what, we'll have to keep
> > maintaining the current set of hugetlb features until we can eventually
> > deprecate it/some in the far, far future.
> > 
> > I, for my part, am happy as long as I can stay away as far as possible from
> > hugetlb code. Again, Mike is the maintainer.
> 
> Thanks for the reminder :)
> 
> Maintainability is my primary concern with HGM.  That is one of the reasons
> I proposed James pitch the topic at LSFMM.  Even though I am the 'maintainer'
> changes introduced by HGM will impact others working in mm.
> 
> > What I saw so far regarding HGM does not count as "slightly makes things
> > harder".
> > 
> > > Again, I don't have a lot of context here, and I understand everyone's
> > > frustration with the current state of hugetlb. Just my 2 cents.
> > 
> > The thing is, we all agree that something that hugetlb provides is valuable
> > (i.e., pool of huge/large pages that we can map large), just that after 20
> > years there might be better ways of doing it and integrating it better with
> > core-mm.
> 
> I am struggling with how to support existing hugetlb users that are running
> into issues like memory errors on hugetlb pages today.  And, yes that is a
> source of real customer issues.  They are not really happy with the current
> design that a single error will take out a 1G page, and their VM or
> application.  Moving to THP is not likely as they really want a pre-allocated
> pool of 1G pages.  I just don't have a good answer for them.

Fully agreed, these customer complaints are a very real and significant 
problem that is actively causing pain today for 1GB users.  That can't be 
understated.  Same for the user who is live migrated because of a 
disruptive software update on the host.

We would very much like a future where the hugetlb subsystem is more 
closely integrated with the core mm just because of subtle bugs that have 
popped up over time in hugetlb, including very complex reservation code.  
We've funded an initiative around hugetlb reliability because of a 
critical dependency on the subsystem as the *only* way to support 1GB 
mappings.

Don't get me wrong: integration with core mm is very beneficial from a 
reliability and maintenance perspective.  I just don't think the right 
solution is to mandate O(years) of work *before* we can possibly stop the 
very real customer pain.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] HGM for hugetlbfs
  2023-06-08  0:02                 ` David Rientjes
@ 2023-06-08  6:34                   ` David Hildenbrand
  2023-06-08 18:50                     ` Yang Shi
  2023-06-08 20:10                     ` Matthew Wilcox
  0 siblings, 2 replies; 29+ messages in thread
From: David Hildenbrand @ 2023-06-08  6:34 UTC (permalink / raw)
  To: David Rientjes, Mike Kravetz
  Cc: Yosry Ahmed, James Houghton, Naoya Horiguchi, Miaohe Lin, lsf-pc,
	linux-mm, Peter Xu, Michal Hocko, Matthew Wilcox, Axel Rasmussen,
	Jiaqi Yan

On 08.06.23 02:02, David Rientjes wrote:
> On Wed, 7 Jun 2023, Mike Kravetz wrote:
> 
>>>>>> Are there strong objections to extending hugetlb for this support?
>>>>>
>>>>> I don't want to get too involved in this discussion (busy), but I
>>>>> absolutely agree on the points that were raised at LSF/MM that
>>>>>
>>>>> (A) hugetlb is complicated and very special (many things not integrated
>>>>> with core-mm, so we need special-casing all over the place). [example:
>>>>> what is a pte?]
>>>>>
>>>>> (B) We added a bunch of complexity in the past that some people
>>>>> considered very important (and it was not feature frozen, right? ;) ).
>>>>> Looking back, we might just not have done some of that, or done it
>>>>> differently/cleaner -- better integrated in the core. (PMD sharing,
>>>>> MAP_PRIVATE, a reservation mechanism that still requires preallocation
>>>>> because it fails with NUMA/fork, ...)
>>>>>
>>>>> (C) Unifying hugetlb and the core looks like it's getting more and more
>>>>> out of reach, maybe even impossible with all the complexity we added
>>>>> over the years (well, and keep adding).
>>>>>
>>>>> Sure, HGM for the purpose of better hwpoison handling makes sense. But
>>>>> hugetlb is probably 20 years old and hwpoison handling probably 13 years
>>>>> old. So we managed to get quite far without that optimization.
>>>>>
> 
> Sane handling for memory poisoning and optimizations for live migration
> are both much more important for the real-world 1GB hugetlb user, so it
> doesn't quite have that lengthy of a history.
> 
> Unfortuantely, cloud providers receive complaints about both of these from
> customers.  They are one of the most significant causes for poor customer
> experience.
> 
> While people have proposed 1GB THP support in the past, it was nacked, in
> part, because of the suggestion to just use existing 1GB support in
> hugetlb instead :)

Yes, because I still think that the use for "transparent" (for the user) 
nowadays is very limited and not worth the complexity.

IMHO, what you really want is a pool of large pages that (guarantees 
about availability and nodes) and fine control about who gets these 
pages. That's what hugetlb provides.

In contrast to THP, you don't want to allow for
* Partially mmap, mremap, munmap, mprotect them
* Partially sharing then / COW'ing them
* Partially mixing them with other anon pages (MADV_DONTNEED + refault)
* Exclude them from some features KSM/swap
* (swap them out and eventually split them for that)

Because you don't want to get these pages PTE-mapped by the system 
*unless* there is a real reason (HGM, hwpoison) -- you want guarantees. 
Once such a page is PTE-mapped, you only want to collapse in place.

But you don't want special-HGM, you simply want the core to PTE-map them 
like a (file) THP.

IMHO, getting that realized much easier would be if we wouldn't have to 
care about some of the hugetlb complexity I raised (MAP_PRIVATE, PMD 
sharing), but maybe there is a way ...

> 
>>>>> Absolutely, HGM for better postcopy live migration also makes sense, I
>>>>> guess nobody disagrees on that.
>>>>>
>>>>>
>>>>> But as discussed in that session, maybe we should just start anew and
>>>>> implement something that integrates nicely with the core , instead of
>>>>> making hugetlb more complicated and even more special.
>>>>>
> 
> Certainly an ideal would be where we could support everybody's use cases
> in a much more cohesive way with the rest of the core MM.  I'm
> particularly concerned about how long it will take to get to that state
> even if we had kernel developers committed to doing the work.  Even if we
> had a design for this new subsystem that was more tightly coupled with the
> core MM, it would take O(years) to implement, test, extend for other
> architectures, and that's before any existing of users of hugetlb could
> make the changes in the rest of their software stack to support it.

One interesting experiment would be, to just take hugetlb and remove all 
complexity (strip it to it's core: a pooling of large pages without 
special MAP_PRIVATE support, PMD sharing, reservations, ...). Then, see 
how to get core-mm to just treat them like PUD/PMD-mapped folios that 
can get PTE-mapped -- just like we have with FS-level THP.

Maybe we could then factor out what's shared with the old hugetlb 
implementations (e.g., pooling) and have both co-exist (e.g., configured 
at runtime).

The user-space interface for hugetlb would not change (well, except fail 
MAP_PRIVATE for now)

(especially, no messing with anon hugetlb pages)


Again, the spirit would be "teach the core to just treat them like 
folios that can get PTE-mapped" instead of "add HGM to hugetlb". If we 
can achieve that without a hugetlb v2, great. But i think that will be 
harder .... but I might be just wrong.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] HGM for hugetlbfs
  2023-06-08  6:34                   ` David Hildenbrand
@ 2023-06-08 18:50                     ` Yang Shi
  2023-06-08 21:23                       ` Mike Kravetz
  2023-06-08 20:10                     ` Matthew Wilcox
  1 sibling, 1 reply; 29+ messages in thread
From: Yang Shi @ 2023-06-08 18:50 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: David Rientjes, Mike Kravetz, Yosry Ahmed, James Houghton,
	Naoya Horiguchi, Miaohe Lin, lsf-pc, linux-mm, Peter Xu,
	Michal Hocko, Matthew Wilcox, Axel Rasmussen, Jiaqi Yan

On Wed, Jun 7, 2023 at 11:34 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 08.06.23 02:02, David Rientjes wrote:
> > On Wed, 7 Jun 2023, Mike Kravetz wrote:
> >
> >>>>>> Are there strong objections to extending hugetlb for this support?
> >>>>>
> >>>>> I don't want to get too involved in this discussion (busy), but I
> >>>>> absolutely agree on the points that were raised at LSF/MM that
> >>>>>
> >>>>> (A) hugetlb is complicated and very special (many things not integrated
> >>>>> with core-mm, so we need special-casing all over the place). [example:
> >>>>> what is a pte?]
> >>>>>
> >>>>> (B) We added a bunch of complexity in the past that some people
> >>>>> considered very important (and it was not feature frozen, right? ;) ).
> >>>>> Looking back, we might just not have done some of that, or done it
> >>>>> differently/cleaner -- better integrated in the core. (PMD sharing,
> >>>>> MAP_PRIVATE, a reservation mechanism that still requires preallocation
> >>>>> because it fails with NUMA/fork, ...)
> >>>>>
> >>>>> (C) Unifying hugetlb and the core looks like it's getting more and more
> >>>>> out of reach, maybe even impossible with all the complexity we added
> >>>>> over the years (well, and keep adding).
> >>>>>
> >>>>> Sure, HGM for the purpose of better hwpoison handling makes sense. But
> >>>>> hugetlb is probably 20 years old and hwpoison handling probably 13 years
> >>>>> old. So we managed to get quite far without that optimization.
> >>>>>
> >
> > Sane handling for memory poisoning and optimizations for live migration
> > are both much more important for the real-world 1GB hugetlb user, so it
> > doesn't quite have that lengthy of a history.
> >
> > Unfortuantely, cloud providers receive complaints about both of these from
> > customers.  They are one of the most significant causes for poor customer
> > experience.
> >
> > While people have proposed 1GB THP support in the past, it was nacked, in
> > part, because of the suggestion to just use existing 1GB support in
> > hugetlb instead :)

Yes, but it was before HGM was proposed, we may revisit it.

>
> Yes, because I still think that the use for "transparent" (for the user)
> nowadays is very limited and not worth the complexity.
>
> IMHO, what you really want is a pool of large pages that (guarantees
> about availability and nodes) and fine control about who gets these
> pages. That's what hugetlb provides.

The most concern for 1G THP is the allocation time. But I don't think
it is a no-go for allocating THP from a preallocated pool, for
example, CMA.

>
> In contrast to THP, you don't want to allow for
> * Partially mmap, mremap, munmap, mprotect them
> * Partially sharing then / COW'ing them
> * Partially mixing them with other anon pages (MADV_DONTNEED + refault)

IIRC, QEMU treats hugetlbfs as 2M block size, we should be able to
teach QEMU to treat tmpfs + THP as 2M block size too. I used to have a
patch to make stat.st_blksize return THP size for tmpfs (89fdcd262fd4
mm: shmem: make stat.st_blksize return huge page size if THP is on).
So when the applications are aware of the 2M or 1G page/block size,
hopefully it may help reduce the partial mapping things. But I'm not
an expert on QEMU, I may miss something.

> * Exclude them from some features KSM/swap
> * (swap them out and eventually split them for that)

We have "noswap" mount option for tmpfs now, so swap is not a problem.

But we may lose some features, for example, PMD sharing, hugetlb
cgroup, etc. Not sure whether they are a showstopper or not.

So it sounds easier to have 1G THP than HGM IMHO if I don't miss
something vital.

>
> Because you don't want to get these pages PTE-mapped by the system
> *unless* there is a real reason (HGM, hwpoison) -- you want guarantees.
> Once such a page is PTE-mapped, you only want to collapse in place.
>
> But you don't want special-HGM, you simply want the core to PTE-map them
> like a (file) THP.
>
> IMHO, getting that realized much easier would be if we wouldn't have to
> care about some of the hugetlb complexity I raised (MAP_PRIVATE, PMD
> sharing), but maybe there is a way ...
>
> >
> >>>>> Absolutely, HGM for better postcopy live migration also makes sense, I
> >>>>> guess nobody disagrees on that.
> >>>>>
> >>>>>
> >>>>> But as discussed in that session, maybe we should just start anew and
> >>>>> implement something that integrates nicely with the core , instead of
> >>>>> making hugetlb more complicated and even more special.
> >>>>>
> >
> > Certainly an ideal would be where we could support everybody's use cases
> > in a much more cohesive way with the rest of the core MM.  I'm
> > particularly concerned about how long it will take to get to that state
> > even if we had kernel developers committed to doing the work.  Even if we
> > had a design for this new subsystem that was more tightly coupled with the
> > core MM, it would take O(years) to implement, test, extend for other
> > architectures, and that's before any existing of users of hugetlb could
> > make the changes in the rest of their software stack to support it.
>
> One interesting experiment would be, to just take hugetlb and remove all
> complexity (strip it to it's core: a pooling of large pages without
> special MAP_PRIVATE support, PMD sharing, reservations, ...). Then, see
> how to get core-mm to just treat them like PUD/PMD-mapped folios that
> can get PTE-mapped -- just like we have with FS-level THP.
>
> Maybe we could then factor out what's shared with the old hugetlb
> implementations (e.g., pooling) and have both co-exist (e.g., configured
> at runtime).
>
> The user-space interface for hugetlb would not change (well, except fail
> MAP_PRIVATE for now)
>
> (especially, no messing with anon hugetlb pages)
>
>
> Again, the spirit would be "teach the core to just treat them like
> folios that can get PTE-mapped" instead of "add HGM to hugetlb". If we
> can achieve that without a hugetlb v2, great. But i think that will be
> harder .... but I might be just wrong.
>
> --
> Cheers,
>
> David / dhildenb
>
>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] HGM for hugetlbfs
  2023-06-08  6:34                   ` David Hildenbrand
  2023-06-08 18:50                     ` Yang Shi
@ 2023-06-08 20:10                     ` Matthew Wilcox
  2023-06-09  2:59                       ` David Rientjes
  2023-06-13 14:59                       ` Jason Gunthorpe
  1 sibling, 2 replies; 29+ messages in thread
From: Matthew Wilcox @ 2023-06-08 20:10 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: David Rientjes, Mike Kravetz, Yosry Ahmed, James Houghton,
	Naoya Horiguchi, Miaohe Lin, lsf-pc, linux-mm, Peter Xu,
	Michal Hocko, Axel Rasmussen, Jiaqi Yan

On Thu, Jun 08, 2023 at 08:34:10AM +0200, David Hildenbrand wrote:
> On 08.06.23 02:02, David Rientjes wrote:
> > While people have proposed 1GB THP support in the past, it was nacked, in
> > part, because of the suggestion to just use existing 1GB support in
> > hugetlb instead :)
> 
> Yes, because I still think that the use for "transparent" (for the user)
> nowadays is very limited and not worth the complexity.
> 
> IMHO, what you really want is a pool of large pages that (guarantees about
> availability and nodes) and fine control about who gets these pages. That's
> what hugetlb provides.
> 
> In contrast to THP, you don't want to allow for
> * Partially mmap, mremap, munmap, mprotect them
> * Partially sharing then / COW'ing them
> * Partially mixing them with other anon pages (MADV_DONTNEED + refault)
> * Exclude them from some features KSM/swap
> * (swap them out and eventually split them for that)
> 
> Because you don't want to get these pages PTE-mapped by the system *unless*
> there is a real reason (HGM, hwpoison) -- you want guarantees. Once such a
> page is PTE-mapped, you only want to collapse in place.
> 
> But you don't want special-HGM, you simply want the core to PTE-map them
> like a (file) THP.
> 
> IMHO, getting that realized much easier would be if we wouldn't have to care
> about some of the hugetlb complexity I raised (MAP_PRIVATE, PMD sharing),
> but maybe there is a way ...

I favour a more evolutionary than revolutionary approach.  That is,
I think it's acceptable to add new features to hugetlbfs _if_ they're
combined with cleanup work that gets hugetlbfs closer to the main mm.
This is why I harp on things like pagewalk that currently need special
handling for hugetlb -- that's pointless; they should just be treated as
large folios.  GUP handles hugetlb separately too, and I'm not sure why.

That's not to be confused with "hugetlb must change to be more like
the regular mm".  Sometimes both are bad, stupid and wrong, and need to
be changed.  The MM has never had to handle 1GB pages before and, eg,
handling mapcount by iterating over each struct page is not sensible
because that's 16MB of data just to answer folio_mapcount().


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] HGM for hugetlbfs
  2023-06-08 18:50                     ` Yang Shi
@ 2023-06-08 21:23                       ` Mike Kravetz
  2023-06-09  1:57                         ` Zi Yan
  0 siblings, 1 reply; 29+ messages in thread
From: Mike Kravetz @ 2023-06-08 21:23 UTC (permalink / raw)
  To: Yang Shi
  Cc: David Hildenbrand, David Rientjes, Yosry Ahmed, James Houghton,
	Naoya Horiguchi, Miaohe Lin, lsf-pc, linux-mm, Peter Xu,
	Michal Hocko, Matthew Wilcox, Axel Rasmussen, Jiaqi Yan, Zi Yan

On 06/08/23 11:50, Yang Shi wrote:
> On Wed, Jun 7, 2023 at 11:34 PM David Hildenbrand <david@redhat.com> wrote:
> >
> > On 08.06.23 02:02, David Rientjes wrote:
> > > On Wed, 7 Jun 2023, Mike Kravetz wrote:
> > >
> > >>>>>> Are there strong objections to extending hugetlb for this support?
> > >>>>>
> > >>>>> I don't want to get too involved in this discussion (busy), but I
> > >>>>> absolutely agree on the points that were raised at LSF/MM that
> > >>>>>
> > >>>>> (A) hugetlb is complicated and very special (many things not integrated
> > >>>>> with core-mm, so we need special-casing all over the place). [example:
> > >>>>> what is a pte?]
> > >>>>>
> > >>>>> (B) We added a bunch of complexity in the past that some people
> > >>>>> considered very important (and it was not feature frozen, right? ;) ).
> > >>>>> Looking back, we might just not have done some of that, or done it
> > >>>>> differently/cleaner -- better integrated in the core. (PMD sharing,
> > >>>>> MAP_PRIVATE, a reservation mechanism that still requires preallocation
> > >>>>> because it fails with NUMA/fork, ...)
> > >>>>>
> > >>>>> (C) Unifying hugetlb and the core looks like it's getting more and more
> > >>>>> out of reach, maybe even impossible with all the complexity we added
> > >>>>> over the years (well, and keep adding).
> > >>>>>
> > >>>>> Sure, HGM for the purpose of better hwpoison handling makes sense. But
> > >>>>> hugetlb is probably 20 years old and hwpoison handling probably 13 years
> > >>>>> old. So we managed to get quite far without that optimization.
> > >>>>>
> > >
> > > Sane handling for memory poisoning and optimizations for live migration
> > > are both much more important for the real-world 1GB hugetlb user, so it
> > > doesn't quite have that lengthy of a history.
> > >
> > > Unfortuantely, cloud providers receive complaints about both of these from
> > > customers.  They are one of the most significant causes for poor customer
> > > experience.
> > >
> > > While people have proposed 1GB THP support in the past, it was nacked, in
> > > part, because of the suggestion to just use existing 1GB support in
> > > hugetlb instead :)
> 
> Yes, but it was before HGM was proposed, we may revisit it.
> 

Adding Zi Yan on CC as the person driving 1G THP.

> >
> > Yes, because I still think that the use for "transparent" (for the user)
> > nowadays is very limited and not worth the complexity.
> >
> > IMHO, what you really want is a pool of large pages that (guarantees
> > about availability and nodes) and fine control about who gets these
> > pages. That's what hugetlb provides.
> 
> The most concern for 1G THP is the allocation time. But I don't think
> it is a no-go for allocating THP from a preallocated pool, for
> example, CMA.

I seem to remember Zi trying to use CMA for 1G THP allocations.  However, I
am not sure if using CMA would be sufficient.  IIUC, allocating from CMA could
still require page migrations to put together a 1G contiguous area.  In a pool
as used by hugetlb, 1G pages are pre-allocated and sitting in the pool.  The
downside of such a pool is that the memory can not be used for other purposes
and sits 'idle' if not allocated.

Hate to even bring this up, but there are complaints today about 'allocation
time' of 1GB pages from the hugetlb pool.  This 'allocation time' is actually
the time it takes to clear/zero 1G of memory.  Only reason I mention is
using something like CMA to allocate 1G pages (at fault time) may add
unacceptable latency.

> >
> > In contrast to THP, you don't want to allow for
> > * Partially mmap, mremap, munmap, mprotect them
> > * Partially sharing then / COW'ing them
> > * Partially mixing them with other anon pages (MADV_DONTNEED + refault)
> 
> IIRC, QEMU treats hugetlbfs as 2M block size, we should be able to
> teach QEMU to treat tmpfs + THP as 2M block size too. I used to have a
> patch to make stat.st_blksize return THP size for tmpfs (89fdcd262fd4
> mm: shmem: make stat.st_blksize return huge page size if THP is on).
> So when the applications are aware of the 2M or 1G page/block size,
> hopefully it may help reduce the partial mapping things. But I'm not
> an expert on QEMU, I may miss something.
> 
> > * Exclude them from some features KSM/swap
> > * (swap them out and eventually split them for that)
> 
> We have "noswap" mount option for tmpfs now, so swap is not a problem.
> 
> But we may lose some features, for example, PMD sharing, hugetlb
> cgroup, etc. Not sure whether they are a showstopper or not.
> 
> So it sounds easier to have 1G THP than HGM IMHO if I don't miss
> something vital.

I have always wanted to experiment with having THP use a pre-allocated
pool for huge page allocations.  Of course, this adds the complication
of what to do when the pool is exhausted.

Perhaps Zi has performed such experiments?
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] HGM for hugetlbfs
  2023-06-07 22:06               ` Mike Kravetz
  2023-06-08  0:02                 ` David Rientjes
@ 2023-06-08 21:54                 ` Dan Williams
  2023-06-08 22:35                   ` Mike Kravetz
  1 sibling, 1 reply; 29+ messages in thread
From: Dan Williams @ 2023-06-08 21:54 UTC (permalink / raw)
  To: Mike Kravetz, David Hildenbrand
  Cc: Miaohe Lin, James Houghton, Naoya Horiguchi, Peter Xu,
	Yosry Ahmed, linux-mm, Michal Hocko, Matthew Wilcox,
	David Rientjes, Axel Rasmussen, lsf-pc, Jiaqi Yan

Mike Kravetz wrote:
> On 06/07/23 10:13, David Hildenbrand wrote:
[..]
> I am struggling with how to support existing hugetlb users that are running
> into issues like memory errors on hugetlb pages today.  And, yes that is a
> source of real customer issues.  They are not really happy with the current
> design that a single error will take out a 1G page, and their VM or
> application.  Moving to THP is not likely as they really want a pre-allocated
> pool of 1G pages.  I just don't have a good answer for them.

Is it the reporting interface, or the fact that the page gets offlined
too quickly? I.e. if the 1GB page was unmapped from userspace per usual
memory-failure, but the application had an opportunity to record what
got clobbered on a smaller granularity and then ask the kernel to repair
the page, would that relieve some pain? Where repair is atomically
writing a full cacheline of zeroes, or copying around the poison to a
new page and returning the old one to broken down and only have the
single 4K page with error quarantined.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] HGM for hugetlbfs
  2023-06-08 21:54                 ` [Lsf-pc] " Dan Williams
@ 2023-06-08 22:35                   ` Mike Kravetz
  2023-06-09  3:36                     ` Dan Williams
  0 siblings, 1 reply; 29+ messages in thread
From: Mike Kravetz @ 2023-06-08 22:35 UTC (permalink / raw)
  To: Dan Williams
  Cc: David Hildenbrand, Miaohe Lin, James Houghton, Naoya Horiguchi,
	Peter Xu, Yosry Ahmed, linux-mm, Michal Hocko, Matthew Wilcox,
	David Rientjes, Axel Rasmussen, lsf-pc, Jiaqi Yan

On 06/08/23 14:54, Dan Williams wrote:
> Mike Kravetz wrote:
> > On 06/07/23 10:13, David Hildenbrand wrote:
> [..]
> > I am struggling with how to support existing hugetlb users that are running
> > into issues like memory errors on hugetlb pages today.  And, yes that is a
> > source of real customer issues.  They are not really happy with the current
> > design that a single error will take out a 1G page, and their VM or
> > application.  Moving to THP is not likely as they really want a pre-allocated
> > pool of 1G pages.  I just don't have a good answer for them.
> 
> Is it the reporting interface, or the fact that the page gets offlined
> too quickly?

Somewhat both.

Reporting says the error starts at the beginning of the huge page with
length of huge page size.  So, actual error is not really isolated.  In
a way, this is 'desired' since hugetlb pages are treated as a single page.

Once a page is marked with poison, we prevent subsequent faults of the page.
Since a hugetlb page is treated as a single page, the 'good data' can
not be accessed as there is no way to fault in smaller pieces (4K pages)
of the page.  Jiaqi Yan actually put together patches to 'read' the good
4K pages within the hugetlb page [1], but we will not always have a file
handle.

[1] https://lore.kernel.org/linux-mm/20230517160948.811355-1-jiaqiyan@google.com/

>              I.e. if the 1GB page was unmapped from userspace per usual
> memory-failure, but the application had an opportunity to record what
> got clobbered on a smaller granularity and then ask the kernel to repair
> the page, would that relieve some pain?

Sounds interesting.

>                                         Where repair is atomically
> writing a full cacheline of zeroes,

Excuse my hardware ignorance ... In this case, I assume writing zeroes
will repair the error on the original memory?  This would then result
in data loss/zeroed, BUT the memory could be accessed without error.
So, the original 1G page could be used by the application (with data
missing of course).

>                                     or copying around the poison to a
> new page and returning the old one to broken down and only have the
> single 4K page with error quarantined.

I suppose we could do that within the kernel, however user space would
have the ability to do this IF it could access the good 4K pages.  That
is essentially what we do with THP pages by splitting and just marking a
single 4K page with poison.  That is the functionality proposed by HGM.

It seems like asking the kernel to 'repair the page' would be a new
hugetlb specific interface.  Or, could there be other users?
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] HGM for hugetlbfs
  2023-06-08 21:23                       ` Mike Kravetz
@ 2023-06-09  1:57                         ` Zi Yan
  2023-06-09 15:17                           ` Pasha Tatashin
  2023-06-09 19:57                           ` Matthew Wilcox
  0 siblings, 2 replies; 29+ messages in thread
From: Zi Yan @ 2023-06-09  1:57 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Yang Shi, David Hildenbrand, David Rientjes, Yosry Ahmed,
	James Houghton, Naoya Horiguchi, Miaohe Lin, lsf-pc, linux-mm,
	Peter Xu, Michal Hocko, Matthew Wilcox, Axel Rasmussen,
	Jiaqi Yan

[-- Attachment #1: Type: text/plain, Size: 7733 bytes --]

On 8 Jun 2023, at 17:23, Mike Kravetz wrote:

> On 06/08/23 11:50, Yang Shi wrote:
>> On Wed, Jun 7, 2023 at 11:34 PM David Hildenbrand <david@redhat.com> wrote:
>>>
>>> On 08.06.23 02:02, David Rientjes wrote:
>>>> On Wed, 7 Jun 2023, Mike Kravetz wrote:
>>>>
>>>>>>>>> Are there strong objections to extending hugetlb for this support?
>>>>>>>>
>>>>>>>> I don't want to get too involved in this discussion (busy), but I
>>>>>>>> absolutely agree on the points that were raised at LSF/MM that
>>>>>>>>
>>>>>>>> (A) hugetlb is complicated and very special (many things not integrated
>>>>>>>> with core-mm, so we need special-casing all over the place). [example:
>>>>>>>> what is a pte?]
>>>>>>>>
>>>>>>>> (B) We added a bunch of complexity in the past that some people
>>>>>>>> considered very important (and it was not feature frozen, right? ;) ).
>>>>>>>> Looking back, we might just not have done some of that, or done it
>>>>>>>> differently/cleaner -- better integrated in the core. (PMD sharing,
>>>>>>>> MAP_PRIVATE, a reservation mechanism that still requires preallocation
>>>>>>>> because it fails with NUMA/fork, ...)
>>>>>>>>
>>>>>>>> (C) Unifying hugetlb and the core looks like it's getting more and more
>>>>>>>> out of reach, maybe even impossible with all the complexity we added
>>>>>>>> over the years (well, and keep adding).
>>>>>>>>
>>>>>>>> Sure, HGM for the purpose of better hwpoison handling makes sense. But
>>>>>>>> hugetlb is probably 20 years old and hwpoison handling probably 13 years
>>>>>>>> old. So we managed to get quite far without that optimization.
>>>>>>>>
>>>>
>>>> Sane handling for memory poisoning and optimizations for live migration
>>>> are both much more important for the real-world 1GB hugetlb user, so it
>>>> doesn't quite have that lengthy of a history.
>>>>
>>>> Unfortuantely, cloud providers receive complaints about both of these from
>>>> customers.  They are one of the most significant causes for poor customer
>>>> experience.
>>>>
>>>> While people have proposed 1GB THP support in the past, it was nacked, in
>>>> part, because of the suggestion to just use existing 1GB support in
>>>> hugetlb instead :)
>>
>> Yes, but it was before HGM was proposed, we may revisit it.
>>
>
> Adding Zi Yan on CC as the person driving 1G THP.

Thanks.

I have not attended the LSF/MM, but the points above mostly look valid.
IMHO, if we keep adding new features to hugetlbfs, we might have two
parallel memory systems, replicating each other a lot. Maybe it is the time
to think about how to merge hugetlbfs features back to core mm.

From my understanding, the most desirable user visible feature of hugetlbfs
is that it provides deterministic huge page allocation, since huge pages
are preserved. If we can preserve that, replacing hugetlbfs backend with
THP or even just plain folio should be good enough. Let me know if I miss
any important user visible feature.

On the hugetlbfs backend, PMD sharing, MAP_PRIVATE, reducing struct page
storage all look features core mm might want. Merging these features back
to core mm might be a good first step.

I thought about replacing hugetlbfs backend with THP (with my 1GB THP support),
but find that not all THP features are necessary for hugetlbfs users or
compatible with existing hugetlbfs. For example, hugetlbfs does not need
transparent page split, since user just wants that big page size. And page
split might not get along with reducing struct page storage feature.

In sum, I think we might not need all THP features (page table entry split
and huge page split) to replace hugetlbfs and we might just need to enable
core mm to handle any size folio and hugetlb pages are just folios that
can go as large as 1GB. As a result, hugetlb pages can take advantage of
all core mm features, like hwpoison.

>>>
>>> Yes, because I still think that the use for "transparent" (for the user)
>>> nowadays is very limited and not worth the complexity.
>>>
>>> IMHO, what you really want is a pool of large pages that (guarantees
>>> about availability and nodes) and fine control about who gets these
>>> pages. That's what hugetlb provides.
>>
>> The most concern for 1G THP is the allocation time. But I don't think
>> it is a no-go for allocating THP from a preallocated pool, for
>> example, CMA.
>
> I seem to remember Zi trying to use CMA for 1G THP allocations.  However, I
> am not sure if using CMA would be sufficient.  IIUC, allocating from CMA could
> still require page migrations to put together a 1G contiguous area.  In a pool
> as used by hugetlb, 1G pages are pre-allocated and sitting in the pool.  The
> downside of such a pool is that the memory can not be used for other purposes
> and sits 'idle' if not allocated.

Yes, I tried that. One big issue is that at free time a 1GB THP needs to be freed
back to a CMA pool instead of buddy allocator, but THP can be split and after
split, it is really hard to tell whether a page is from a CMA pool or not.

hugetlb pages does not support page split yet, so the issue might not be
relevant. But if a THP cannot be split freely, is it a still THP? So it comes
back to my question: do we really want 1GB THP or just core mm can handle
any size folios?

>
> Hate to even bring this up, but there are complaints today about 'allocation
> time' of 1GB pages from the hugetlb pool.  This 'allocation time' is actually
> the time it takes to clear/zero 1G of memory.  Only reason I mention is
> using something like CMA to allocate 1G pages (at fault time) may add
> unacceptable latency.

One solution I had in mind is that you could zero these 1GB pages at free
time in a worker thread, so that you do not pay the penalty at page allocation
time. But it would not work if the allocation comes right after a page is
freed.

>
>>>
>>> In contrast to THP, you don't want to allow for
>>> * Partially mmap, mremap, munmap, mprotect them
>>> * Partially sharing then / COW'ing them
>>> * Partially mixing them with other anon pages (MADV_DONTNEED + refault)
>>
>> IIRC, QEMU treats hugetlbfs as 2M block size, we should be able to
>> teach QEMU to treat tmpfs + THP as 2M block size too. I used to have a
>> patch to make stat.st_blksize return THP size for tmpfs (89fdcd262fd4
>> mm: shmem: make stat.st_blksize return huge page size if THP is on).
>> So when the applications are aware of the 2M or 1G page/block size,
>> hopefully it may help reduce the partial mapping things. But I'm not
>> an expert on QEMU, I may miss something.
>>
>>> * Exclude them from some features KSM/swap
>>> * (swap them out and eventually split them for that)
>>
>> We have "noswap" mount option for tmpfs now, so swap is not a problem.
>>
>> But we may lose some features, for example, PMD sharing, hugetlb
>> cgroup, etc. Not sure whether they are a showstopper or not.
>>
>> So it sounds easier to have 1G THP than HGM IMHO if I don't miss
>> something vital.
>
> I have always wanted to experiment with having THP use a pre-allocated
> pool for huge page allocations.  Of course, this adds the complication
> of what to do when the pool is exhausted.
>
> Perhaps Zi has performed such experiments?

Using CMA allocation is a similar experiment, but when CMA pools are
exhausted, 1GB THP allocation will fail. We can try to use compaction to
get more 1GB free pages, but that might take prohibitively long time
and could fail at the end.

At the end, let me ask this again: do we want 1GB THP to replace hugetlb
or enable core mm to handle any size folios and change 1GB hugetlb page
to a 1GB folio?

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] HGM for hugetlbfs
  2023-06-08 20:10                     ` Matthew Wilcox
@ 2023-06-09  2:59                       ` David Rientjes
  2023-06-13 14:59                       ` Jason Gunthorpe
  1 sibling, 0 replies; 29+ messages in thread
From: David Rientjes @ 2023-06-09  2:59 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Hildenbrand, Mike Kravetz, Yosry Ahmed, James Houghton,
	Naoya Horiguchi, Miaohe Lin, lsf-pc, linux-mm, Peter Xu,
	Michal Hocko, Axel Rasmussen, Jiaqi Yan

On Thu, 8 Jun 2023, Matthew Wilcox wrote:

> On Thu, Jun 08, 2023 at 08:34:10AM +0200, David Hildenbrand wrote:
> > On 08.06.23 02:02, David Rientjes wrote:
> > > While people have proposed 1GB THP support in the past, it was nacked, in
> > > part, because of the suggestion to just use existing 1GB support in
> > > hugetlb instead :)
> > 
> > Yes, because I still think that the use for "transparent" (for the user)
> > nowadays is very limited and not worth the complexity.
> > 
> > IMHO, what you really want is a pool of large pages that (guarantees about
> > availability and nodes) and fine control about who gets these pages. That's
> > what hugetlb provides.
> > 
> > In contrast to THP, you don't want to allow for
> > * Partially mmap, mremap, munmap, mprotect them
> > * Partially sharing then / COW'ing them
> > * Partially mixing them with other anon pages (MADV_DONTNEED + refault)
> > * Exclude them from some features KSM/swap
> > * (swap them out and eventually split them for that)
> > 
> > Because you don't want to get these pages PTE-mapped by the system *unless*
> > there is a real reason (HGM, hwpoison) -- you want guarantees. Once such a
> > page is PTE-mapped, you only want to collapse in place.
> > 
> > But you don't want special-HGM, you simply want the core to PTE-map them
> > like a (file) THP.
> > 
> > IMHO, getting that realized much easier would be if we wouldn't have to care
> > about some of the hugetlb complexity I raised (MAP_PRIVATE, PMD sharing),
> > but maybe there is a way ...
> 
> I favour a more evolutionary than revolutionary approach.  That is,
> I think it's acceptable to add new features to hugetlbfs _if_ they're
> combined with cleanup work that gets hugetlbfs closer to the main mm.
> This is why I harp on things like pagewalk that currently need special
> handling for hugetlb -- that's pointless; they should just be treated as
> large folios.  GUP handles hugetlb separately too, and I'm not sure why.
> 
> That's not to be confused with "hugetlb must change to be more like
> the regular mm".  Sometimes both are bad, stupid and wrong, and need to
> be changed.  The MM has never had to handle 1GB pages before and, eg,
> handling mapcount by iterating over each struct page is not sensible
> because that's 16MB of data just to answer folio_mapcount().
> 

Ok, so I'll latch onto this feedback because I think it's (1) a concrete 
path forward to solve existing real-world pain by adding support to 
hugetlb (to address hwpoison and postcopy live migration latency) and (2) 
an overall and long-awaited improvement in maintainability for the MM 
subsystem.

Nobody on this thread is interested in substantially increasing the 
complexity of hugetlb.  That's true from the standpoint of sheer 
maintainability, but also reliability.  We've been bitten time and time 
again by hugetlb-only reliability issues, which are their own class of 
customer complaints.  These have not only been in hugetlb's reservation 
code.

In fact, from my POV, hugetlb *reliability* is the most important topic 
discussed so far in this thread and that can be substantially improved by 
this evolutionary approach that reduces the "special casing" that is the 
hugetlb subsystem today.  We would very much want a unified way of 
handling page walks, for example.

I don't think anybody here is advocating for making hugetlb more of a 
snowflake :)  Improving hugetlb maintainability *and* reliability is of 
critical importance to us, as is solving memory poisoning and live 
migration latency.  I don't think that one needs to block the other.

So the work to improve hugetlb reliability and maintainability is 
something that can be tractable and we'd definitely like feedback on so 
that we can contribute to it.

I'd very much prefer that this does not get in the way of solving 
real-world problems that HGM addresses just because it's an active source 
of real customer issues today.  Rest assured, making forward progress on 
HGM will not reduce our interest in improving hugetlb maintainability :)

I know that James is very eager to receive code review for the HGM series 
itself from anybody who would be willing to review it.  Is there a way to 
make forward progress on deciding whether HGM (with any code review 
comments addressed) has a path forward?


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] HGM for hugetlbfs
  2023-06-08 22:35                   ` Mike Kravetz
@ 2023-06-09  3:36                     ` Dan Williams
  2023-06-09 20:20                       ` James Houghton
  0 siblings, 1 reply; 29+ messages in thread
From: Dan Williams @ 2023-06-09  3:36 UTC (permalink / raw)
  To: Mike Kravetz, Dan Williams
  Cc: David Hildenbrand, Miaohe Lin, James Houghton, Naoya Horiguchi,
	Peter Xu, Yosry Ahmed, linux-mm, Michal Hocko, Matthew Wilcox,
	David Rientjes, Axel Rasmussen, lsf-pc, Jiaqi Yan, jane.chu

[ add Jane ]

Mike Kravetz wrote:
> On 06/08/23 14:54, Dan Williams wrote:
> > Mike Kravetz wrote:
> > > On 06/07/23 10:13, David Hildenbrand wrote:
> > [..]
> > > I am struggling with how to support existing hugetlb users that are running
> > > into issues like memory errors on hugetlb pages today.  And, yes that is a
> > > source of real customer issues.  They are not really happy with the current
> > > design that a single error will take out a 1G page, and their VM or
> > > application.  Moving to THP is not likely as they really want a pre-allocated
> > > pool of 1G pages.  I just don't have a good answer for them.
> > 
> > Is it the reporting interface, or the fact that the page gets offlined
> > too quickly?
> 
> Somewhat both.
> 
> Reporting says the error starts at the beginning of the huge page with
> length of huge page size.  So, actual error is not really isolated.  In
> a way, this is 'desired' since hugetlb pages are treated as a single page.

On x86 the error reporting is always by cacheline, but it's the
memory-failure code that turns that into a SIGBUS with the sigaction
info indicating failure relative to the page-size. That interface has
been awkward for PMEM as well as Jane can attest.

> Once a page is marked with poison, we prevent subsequent faults of the page.

That makes sense.

> Since a hugetlb page is treated as a single page, the 'good data' can
> not be accessed as there is no way to fault in smaller pieces (4K pages)
> of the page.  Jiaqi Yan actually put together patches to 'read' the good
> 4K pages within the hugetlb page [1], but we will not always have a file
> handle.

That mitigation is also a problem for device-dax that makes hard
guarantees that mappings will always be aligned, mainly to keep the
driver simple.

> 
> [1] https://lore.kernel.org/linux-mm/20230517160948.811355-1-jiaqiyan@google.com/
> 
> >              I.e. if the 1GB page was unmapped from userspace per usual
> > memory-failure, but the application had an opportunity to record what
> > got clobbered on a smaller granularity and then ask the kernel to repair
> > the page, would that relieve some pain?
> 
> Sounds interesting.
> 
> >                                         Where repair is atomically
> > writing a full cacheline of zeroes,
> 
> Excuse my hardware ignorance ... In this case, I assume writing zeroes
> will repair the error on the original memory?  This would then result
> in data loss/zeroed, BUT the memory could be accessed without error.
> So, the original 1G page could be used by the application (with data
> missing of course).

Yes, but it depends. Sometimes poison is a permanent error and no amount
of writing to it can correct the error, sometimes it is transient like a
high energy particle flipped a bit in the cell, and sometime it is
deposited from outside the memory controller like the case when a
poisoned dirty cacheline gets written back.

The majority of the time, outside catastrophic loss of a whole rank,
it's only 64-bytes at a time that has gone bad.

> >                                     or copying around the poison to a
> > new page and returning the old one to broken down and only have the
> > single 4K page with error quarantined.
> 
> I suppose we could do that within the kernel, however user space would
> have the ability to do this IF it could access the good 4K pages.  That
> is essentially what we do with THP pages by splitting and just marking a
> single 4K page with poison.  That is the functionality proposed by HGM.
> 
> It seems like asking the kernel to 'repair the page' would be a new
> hugetlb specific interface.  Or, could there be other users?

I think there are other users for this.

Jane worked on DAX_RECOVERY_WRITE support which is a way for a DIRECT_IO
write on a DAX file (guaranteed to be page aligned) to plumb an
operation to the pmem driver top repair a location that is not mmap'able
due to hardware poison.

However that's fsdax specific. It would be nice to be able to have
SIGBUS handlers that can ask the kernel to overwrite the cacheline and
restore access to the rest of the page. It seems unfortunate to live
with throwing away 1GB - 64-bytes of capacity on the first sign of
trouble.

The nice thing about hugetlb compared to pmem is that you do not need to
repair in place, in case the error is permanent. Conceivably the kernel
could allocate a new page, perform the copy of the good bits on behalf
of the application, and let the page be mapped again. If that copy
encounters poison rinse and repeat until it succeeds or the application
says, "you know what, I think it's dead, thanks anyway".

It's something that has been on the "when there is time pile", but maybe
instead of making hugetlb more complicated this effort goes to make
memory-failure more capable.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] HGM for hugetlbfs
  2023-06-09  1:57                         ` Zi Yan
@ 2023-06-09 15:17                           ` Pasha Tatashin
  2023-06-09 19:04                             ` Ankur Arora
  2023-06-09 19:57                           ` Matthew Wilcox
  1 sibling, 1 reply; 29+ messages in thread
From: Pasha Tatashin @ 2023-06-09 15:17 UTC (permalink / raw)
  To: Zi Yan
  Cc: Mike Kravetz, Yang Shi, David Hildenbrand, David Rientjes,
	Yosry Ahmed, James Houghton, Naoya Horiguchi, Miaohe Lin, lsf-pc,
	linux-mm, Peter Xu, Michal Hocko, Matthew Wilcox, Axel Rasmussen,
	Jiaqi Yan

> > Hate to even bring this up, but there are complaints today about 'allocation
> > time' of 1GB pages from the hugetlb pool.  This 'allocation time' is actually
> > the time it takes to clear/zero 1G of memory.  Only reason I mention is
> > using something like CMA to allocate 1G pages (at fault time) may add
> > unacceptable latency.
>
> One solution I had in mind is that you could zero these 1GB pages at free
> time in a worker thread, so that you do not pay the penalty at page allocation
> time. But it would not work if the allocation comes right after a page is
> freed.

In addition, there were several proposals to speed zeroing of huge pages:

1. X86 specific: Cannon Matthews proposed "clear 1G pages with
streaming stores on x86" change.
https://lore.kernel.org/linux-mm/20200307010353.172991-1-cannonmatthews@google.com

This speeds up setting up 1G pages by roughly 4 times.

2. X86 specific: Kirill and Andi proposed also proposed a similar
change even earlier:
https://lore.kernel.org/all/1345470757-12005-1-git-send-email-kirill.shutemov@linux.intel.com

3. Arch Generic: Ktasks https://lwn.net/Articles/770826
That allows zeroing HugeTLB pages in Parallel.

4. VM Specific: https://lwn.net/Articles/931933/
Allows to lazyly zero 1G pages in the guest.

I looked through the (1) proposal and did not see any major pushbacks,
I do not see why movnti can't be used specifically for gigantic pages.

Pasha


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] HGM for hugetlbfs
  2023-06-09 15:17                           ` Pasha Tatashin
@ 2023-06-09 19:04                             ` Ankur Arora
  0 siblings, 0 replies; 29+ messages in thread
From: Ankur Arora @ 2023-06-09 19:04 UTC (permalink / raw)
  To: pasha.tatashin
  Cc: axelrasmussen, david, jiaqiyan, jthoughton, linmiaohe, linux-mm,
	lsf-pc, mhocko, mike.kravetz, naoya.horiguchi, peterx, rientjes,
	shy828301, willy, yosryahmed, ziy, ankur.a.arora

> > > Hate to even bring this up, but there are complaints today about 'allocation
> > > time' of 1GB pages from the hugetlb pool.  This 'allocation time' is actually
> > > the time it takes to clear/zero 1G of memory.  Only reason I mention is
> > > using something like CMA to allocate 1G pages (at fault time) may add
> > > unacceptable latency.
> >
> > One solution I had in mind is that you could zero these 1GB pages at free
> > time in a worker thread, so that you do not pay the penalty at page allocation
> > time. But it would not work if the allocation comes right after a page is
> > freed.
> 
> In addition, there were several proposals to speed zeroing of huge pages:
> 
> 1. X86 specific: Cannon Matthews proposed "clear 1G pages with
> streaming stores on x86" change.
> https://lore.kernel.org/linux-mm/20200307010353.172991-1-cannonmatthews@google.com
> 
> This speeds up setting up 1G pages by roughly 4 times.
> 
> 2. X86 specific: Kirill and Andi proposed also proposed a similar
> change even earlier:
> https://lore.kernel.org/all/1345470757-12005-1-git-send-email-kirill.shutemov@linux.intel.com

Also, this one more recently from me:
  https://lore.kernel.org/all/20220606202109.1306034-1-ankur.a.arora@oracle.com/

Linus had some comments on the overall approach and I had sent out this
as follow-up:
  https://lore.kernel.org/all/20230403052233.1880567-1-ankur.a.arora@oracle.com/

> 3. Arch Generic: Ktasks https://lwn.net/Articles/770826
> That allows zeroing HugeTLB pages in Parallel.
> 
> 4. VM Specific: https://lwn.net/Articles/931933/
> Allows to lazyly zero 1G pages in the guest.
> 
> I looked through the (1) proposal and did not see any major pushbacks,
> I do not see why movnti can't be used specifically for gigantic pages.

AFAICT, the recent concerns are mostly around proper API, and in
encapsulating MOVNTI like primitives such that they can be safely
used.


Ankur


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] HGM for hugetlbfs
  2023-06-09  1:57                         ` Zi Yan
  2023-06-09 15:17                           ` Pasha Tatashin
@ 2023-06-09 19:57                           ` Matthew Wilcox
  1 sibling, 0 replies; 29+ messages in thread
From: Matthew Wilcox @ 2023-06-09 19:57 UTC (permalink / raw)
  To: Zi Yan
  Cc: Mike Kravetz, Yang Shi, David Hildenbrand, David Rientjes,
	Yosry Ahmed, James Houghton, Naoya Horiguchi, Miaohe Lin, lsf-pc,
	linux-mm, Peter Xu, Michal Hocko, Axel Rasmussen, Jiaqi Yan

On Thu, Jun 08, 2023 at 09:57:34PM -0400, Zi Yan wrote:
> On the hugetlbfs backend, PMD sharing, MAP_PRIVATE, reducing struct page
> storage all look features core mm might want. Merging these features back
> to core mm might be a good first step.
> 
> I thought about replacing hugetlbfs backend with THP (with my 1GB THP support),
> but find that not all THP features are necessary for hugetlbfs users or
> compatible with existing hugetlbfs. For example, hugetlbfs does not need
> transparent page split, since user just wants that big page size. And page
> split might not get along with reducing struct page storage feature.

But with HGM, we actually do want to split the page because part of it
has hit a hwpoison event.  What these customers don't need is support
for misaligned mappings or partial mappings.  If they map a 1GB page,
they do it 1GB aligned and in multiples of 1GB.  And they tell us in
advance that's what they're doing.

> In sum, I think we might not need all THP features (page table entry split
> and huge page split) to replace hugetlbfs and we might just need to enable
> core mm to handle any size folio and hugetlb pages are just folios that
> can go as large as 1GB. As a result, hugetlb pages can take advantage of
> all core mm features, like hwpoison.

Yes, this is more or less in line with my work.  And yet there are still
problems to solve:

 - mapcount (discussed elsewhere in the thread)
 - page cache index scaling (Sid is working on this)
 - page table sharing (mshare)
 - reserved memory

> > I seem to remember Zi trying to use CMA for 1G THP allocations.  However, I
> > am not sure if using CMA would be sufficient.  IIUC, allocating from CMA could
> > still require page migrations to put together a 1G contiguous area.  In a pool
> > as used by hugetlb, 1G pages are pre-allocated and sitting in the pool.  The
> > downside of such a pool is that the memory can not be used for other purposes
> > and sits 'idle' if not allocated.
> 
> Yes, I tried that. One big issue is that at free time a 1GB THP needs to be freed
> back to a CMA pool instead of buddy allocator, but THP can be split and after
> split, it is really hard to tell whether a page is from a CMA pool or not.
> 
> hugetlb pages does not support page split yet, so the issue might not be
> relevant. But if a THP cannot be split freely, is it a still THP? So it comes
> back to my question: do we really want 1GB THP or just core mm can handle
> any size folios?

We definitely want the core MM to be able to handle folios of arbitrary
size.  There are a pile of places still to fix, eg if you map a
misaligned 1GB page, you can see N PTEs followed by 511 PMDs followed by
512-N PTEs.  There are a lot of places that assume pmd_page() returns
both a head page and the precise page, and those will need to be fixed.
There's a reason I limit page cache to PMD_ORDER today.

> > Hate to even bring this up, but there are complaints today about 'allocation
> > time' of 1GB pages from the hugetlb pool.  This 'allocation time' is actually
> > the time it takes to clear/zero 1G of memory.  Only reason I mention is
> > using something like CMA to allocate 1G pages (at fault time) may add
> > unacceptable latency.
> 
> One solution I had in mind is that you could zero these 1GB pages at free
> time in a worker thread, so that you do not pay the penalty at page allocation
> time. But it would not work if the allocation comes right after a page is
> freed.

It rather goes against the principle of the user should pay the cost.
If we got the zeroing for free, that'd be one thing, but it feels like
we're robbing Peter (of CPU time) to pay Paul.

> At the end, let me ask this again: do we want 1GB THP to replace hugetlb
> or enable core mm to handle any size folios and change 1GB hugetlb page
> to a 1GB folio?

I don't see this as an either-or.  The core MM needs to be enhanced to
handle arbitrary sized folios, but the hugetlbfs interface needs to be
kept around for ever.  What we need from a maintainability point of view
is removing how special hugetlbfs is.




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] HGM for hugetlbfs
  2023-06-09  3:36                     ` Dan Williams
@ 2023-06-09 20:20                       ` James Houghton
  2023-06-13 15:17                         ` Jason Gunthorpe
  0 siblings, 1 reply; 29+ messages in thread
From: James Houghton @ 2023-06-09 20:20 UTC (permalink / raw)
  To: Dan Williams
  Cc: Mike Kravetz, David Hildenbrand, Miaohe Lin, Naoya Horiguchi,
	Peter Xu, Yosry Ahmed, linux-mm, Michal Hocko, Matthew Wilcox,
	David Rientjes, Axel Rasmussen, lsf-pc, Jiaqi Yan, jane.chu

On Thu, Jun 8, 2023 at 8:36 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> [ add Jane ]
>
> Mike Kravetz wrote:
> > On 06/08/23 14:54, Dan Williams wrote:
> > > Mike Kravetz wrote:
> > > > On 06/07/23 10:13, David Hildenbrand wrote:
> > > [..]
> > > > I am struggling with how to support existing hugetlb users that are running
> > > > into issues like memory errors on hugetlb pages today.  And, yes that is a
> > > > source of real customer issues.  They are not really happy with the current
> > > > design that a single error will take out a 1G page, and their VM or
> > > > application.  Moving to THP is not likely as they really want a pre-allocated
> > > > pool of 1G pages.  I just don't have a good answer for them.
> > >
> > > Is it the reporting interface, or the fact that the page gets offlined
> > > too quickly?
> >
> > Somewhat both.
> >
> > Reporting says the error starts at the beginning of the huge page with
> > length of huge page size.  So, actual error is not really isolated.  In
> > a way, this is 'desired' since hugetlb pages are treated as a single page.
>
> On x86 the error reporting is always by cacheline, but it's the
> memory-failure code that turns that into a SIGBUS with the sigaction
> info indicating failure relative to the page-size. That interface has
> been awkward for PMEM as well as Jane can attest.
>
> > Once a page is marked with poison, we prevent subsequent faults of the page.
>
> That makes sense.
>
> > Since a hugetlb page is treated as a single page, the 'good data' can
> > not be accessed as there is no way to fault in smaller pieces (4K pages)
> > of the page.  Jiaqi Yan actually put together patches to 'read' the good
> > 4K pages within the hugetlb page [1], but we will not always have a file
> > handle.
>
> That mitigation is also a problem for device-dax that makes hard
> guarantees that mappings will always be aligned, mainly to keep the
> driver simple.
>
> >
> > [1] https://lore.kernel.org/linux-mm/20230517160948.811355-1-jiaqiyan@google.com/
> >
> > >              I.e. if the 1GB page was unmapped from userspace per usual
> > > memory-failure, but the application had an opportunity to record what
> > > got clobbered on a smaller granularity and then ask the kernel to repair
> > > the page, would that relieve some pain?
> >
> > Sounds interesting.
> >
> > >                                         Where repair is atomically
> > > writing a full cacheline of zeroes,
> >
> > Excuse my hardware ignorance ... In this case, I assume writing zeroes
> > will repair the error on the original memory?  This would then result
> > in data loss/zeroed, BUT the memory could be accessed without error.
> > So, the original 1G page could be used by the application (with data
> > missing of course).
>
> Yes, but it depends. Sometimes poison is a permanent error and no amount
> of writing to it can correct the error, sometimes it is transient like a
> high energy particle flipped a bit in the cell, and sometime it is
> deposited from outside the memory controller like the case when a
> poisoned dirty cacheline gets written back.
>
> The majority of the time, outside catastrophic loss of a whole rank,
> it's only 64-bytes at a time that has gone bad.
>
> > >                                     or copying around the poison to a
> > > new page and returning the old one to broken down and only have the
> > > single 4K page with error quarantined.
> >
> > I suppose we could do that within the kernel, however user space would
> > have the ability to do this IF it could access the good 4K pages.  That
> > is essentially what we do with THP pages by splitting and just marking a
> > single 4K page with poison.  That is the functionality proposed by HGM.
> >
> > It seems like asking the kernel to 'repair the page' would be a new
> > hugetlb specific interface.  Or, could there be other users?
>
> I think there are other users for this.
>
> Jane worked on DAX_RECOVERY_WRITE support which is a way for a DIRECT_IO
> write on a DAX file (guaranteed to be page aligned) to plumb an
> operation to the pmem driver top repair a location that is not mmap'able
> due to hardware poison.
>
> However that's fsdax specific. It would be nice to be able to have
> SIGBUS handlers that can ask the kernel to overwrite the cacheline and
> restore access to the rest of the page. It seems unfortunate to live
> with throwing away 1GB - 64-bytes of capacity on the first sign of
> trouble.
>
> The nice thing about hugetlb compared to pmem is that you do not need to
> repair in place, in case the error is permanent. Conceivably the kernel
> could allocate a new page, perform the copy of the good bits on behalf
> of the application, and let the page be mapped again. If that copy
> encounters poison rinse and repeat until it succeeds or the application
> says, "you know what, I think it's dead, thanks anyway".

I'm not sure if this is compatible with what we need for VMs. We can't
overwrite/zero guest memory unless the guest were somehow enlightened,
which we can't guarantee. We can't allow the guest to keep triggering
memory errors -- i.e., we have to unmap the memory at least from the
EPT (ideally by unmapping it from the userspace page tables).

So, we could:
1. Do what HGM does and have the kernel unmap the 4K page in the
userspace page tables.
2. On-the-fly change the VMA for our hugepage to not be HugeTLB
anymore, and re-map all the good 4K pages.
3. Tell userspace that it must change its mapping from HugeTLB to
something else, and move the good 4K pages into the new mapping.

(2) feels like more complexity than (1). If a user created a
MAP_HUGETLB mapping and now it isn't HugeTLB, that feels wrong.

(3) today isn't possible, but with Jiaqi's improvement to hugetlbfs
read() it becomes possible. We'll need to have an extra 1G of memory
while we are doing this copying/recovery, and it isn't transparent at
all.

(3) is additionally painful when considering live migration. We have
to keep the 4K page unmapped after the migration (to keep it poisoned
from the guest's perspective), but the page is no longer *actually*
poisoned on the host. To get the memory we need to back our
fake-poisoned pages with tmpfs, we would need to free our 1G page.
Getting that page back later isn't trivial.

So (1) still seems like the most natural solution, so the question
becomes: how exactly do we implement 4K unmapping? And that brings us
back to the main question about how HGM should be implemented in
general.

>
> It's something that has been on the "when there is time pile", but maybe
> instead of making hugetlb more complicated this effort goes to make
> memory-failure more capable.

I like this line of thinking, but as I see it right now, we still need
something like HGM -- maybe I'm wrong. :)

- James


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] HGM for hugetlbfs
  2023-06-08 20:10                     ` Matthew Wilcox
  2023-06-09  2:59                       ` David Rientjes
@ 2023-06-13 14:59                       ` Jason Gunthorpe
  2023-06-13 15:15                         ` David Hildenbrand
  1 sibling, 1 reply; 29+ messages in thread
From: Jason Gunthorpe @ 2023-06-13 14:59 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Hildenbrand, David Rientjes, Mike Kravetz, Yosry Ahmed,
	James Houghton, Naoya Horiguchi, Miaohe Lin, lsf-pc, linux-mm,
	Peter Xu, Michal Hocko, Axel Rasmussen, Jiaqi Yan

On Thu, Jun 08, 2023 at 09:10:15PM +0100, Matthew Wilcox wrote:
> On Thu, Jun 08, 2023 at 08:34:10AM +0200, David Hildenbrand wrote:
> > On 08.06.23 02:02, David Rientjes wrote:
> > > While people have proposed 1GB THP support in the past, it was nacked, in
> > > part, because of the suggestion to just use existing 1GB support in
> > > hugetlb instead :)
> > 
> > Yes, because I still think that the use for "transparent" (for the user)
> > nowadays is very limited and not worth the complexity.
> > 
> > IMHO, what you really want is a pool of large pages that (guarantees about
> > availability and nodes) and fine control about who gets these pages. That's
> > what hugetlb provides.
> > 
> > In contrast to THP, you don't want to allow for
> > * Partially mmap, mremap, munmap, mprotect them
> > * Partially sharing then / COW'ing them
> > * Partially mixing them with other anon pages (MADV_DONTNEED + refault)
> > * Exclude them from some features KSM/swap
> > * (swap them out and eventually split them for that)
> > 
> > Because you don't want to get these pages PTE-mapped by the system *unless*
> > there is a real reason (HGM, hwpoison) -- you want guarantees. Once such a
> > page is PTE-mapped, you only want to collapse in place.
> > 
> > But you don't want special-HGM, you simply want the core to PTE-map them
> > like a (file) THP.
> > 
> > IMHO, getting that realized much easier would be if we wouldn't have to care
> > about some of the hugetlb complexity I raised (MAP_PRIVATE, PMD sharing),
> > but maybe there is a way ...
> 
> I favour a more evolutionary than revolutionary approach.  That is,
> I think it's acceptable to add new features to hugetlbfs _if_ they're
> combined with cleanup work that gets hugetlbfs closer to the main mm.
> This is why I harp on things like pagewalk that currently need special
> handling for hugetlb -- that's pointless; they should just be treated as
> large folios.  GUP handles hugetlb separately too, and I'm not sure why.

Yes, this echo's my feelings too.

Making all the special core-mm cases around hugetlb even more
complicated with HGM seems like a non-starter.

We need to get to a point where the core-mm handles all the PTE
programming and supports arbitary order folios in the page tables
uniformly for everyone.

hugetlb is just a special high order folio provider.

Get rid of all the special PTE formats, unique arch code, and special
code in gup.c/pagewalkers/etc that supports hugetlbfs.

I think the general path to do that is to make the core-mm and all the
hugetlb supporting arches support a core-code path for working with
high order folios in page tables.

Maybe this is demo'd & tested with a temporary/simplified hugetlbfs
uAPI. When the core MM and all the arches are ready you switch
hugetlbfs to use the new core API and deleted all the page walk
special cases.

From there you can then teach the core code to do all the splitting
and whatever that you want.

Jason


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] HGM for hugetlbfs
  2023-06-13 14:59                       ` Jason Gunthorpe
@ 2023-06-13 15:15                         ` David Hildenbrand
  2023-06-13 15:45                           ` Peter Xu
  0 siblings, 1 reply; 29+ messages in thread
From: David Hildenbrand @ 2023-06-13 15:15 UTC (permalink / raw)
  To: Jason Gunthorpe, Matthew Wilcox
  Cc: David Rientjes, Mike Kravetz, Yosry Ahmed, James Houghton,
	Naoya Horiguchi, Miaohe Lin, lsf-pc, linux-mm, Peter Xu,
	Michal Hocko, Axel Rasmussen, Jiaqi Yan

On 13.06.23 16:59, Jason Gunthorpe wrote:
> On Thu, Jun 08, 2023 at 09:10:15PM +0100, Matthew Wilcox wrote:
>> On Thu, Jun 08, 2023 at 08:34:10AM +0200, David Hildenbrand wrote:
>>> On 08.06.23 02:02, David Rientjes wrote:
>>>> While people have proposed 1GB THP support in the past, it was nacked, in
>>>> part, because of the suggestion to just use existing 1GB support in
>>>> hugetlb instead :)
>>>
>>> Yes, because I still think that the use for "transparent" (for the user)
>>> nowadays is very limited and not worth the complexity.
>>>
>>> IMHO, what you really want is a pool of large pages that (guarantees about
>>> availability and nodes) and fine control about who gets these pages. That's
>>> what hugetlb provides.
>>>
>>> In contrast to THP, you don't want to allow for
>>> * Partially mmap, mremap, munmap, mprotect them
>>> * Partially sharing then / COW'ing them
>>> * Partially mixing them with other anon pages (MADV_DONTNEED + refault)
>>> * Exclude them from some features KSM/swap
>>> * (swap them out and eventually split them for that)
>>>
>>> Because you don't want to get these pages PTE-mapped by the system *unless*
>>> there is a real reason (HGM, hwpoison) -- you want guarantees. Once such a
>>> page is PTE-mapped, you only want to collapse in place.
>>>
>>> But you don't want special-HGM, you simply want the core to PTE-map them
>>> like a (file) THP.
>>>
>>> IMHO, getting that realized much easier would be if we wouldn't have to care
>>> about some of the hugetlb complexity I raised (MAP_PRIVATE, PMD sharing),
>>> but maybe there is a way ...
>>
>> I favour a more evolutionary than revolutionary approach.  That is,
>> I think it's acceptable to add new features to hugetlbfs _if_ they're
>> combined with cleanup work that gets hugetlbfs closer to the main mm.
>> This is why I harp on things like pagewalk that currently need special
>> handling for hugetlb -- that's pointless; they should just be treated as
>> large folios.  GUP handles hugetlb separately too, and I'm not sure why.
> 
> Yes, this echo's my feelings too.
> 
> Making all the special core-mm cases around hugetlb even more
> complicated with HGM seems like a non-starter.
> 
> We need to get to a point where the core-mm handles all the PTE
> programming and supports arbitary order folios in the page tables
> uniformly for everyone.
> 
> hugetlb is just a special high order folio provider.
> 
> Get rid of all the special PTE formats, unique arch code, and special
> code in gup.c/pagewalkers/etc that supports hugetlbfs.
> 
> I think the general path to do that is to make the core-mm and all the
> hugetlb supporting arches support a core-code path for working with
> high order folios in page tables.
> 
> Maybe this is demo'd & tested with a temporary/simplified hugetlbfs
> uAPI. When the core MM and all the arches are ready you switch
> hugetlbfs to use the new core API and deleted all the page walk
> special cases.
> 
>  From there you can then teach the core code to do all the splitting
> and whatever that you want.

Yes, that's my hope.

As I said, some existing oddities like PMD sharing (VM use-cases don't 
really require that) and MAP_PRIVATE handling (again, VMs don't really 
require that) could make the conversion more problematic ... IMHO

So maybe we should really factor out the core hugetlb pooling logic and 
write a simplified v2 implementation that integrates nicely with the VM 
without all of these oddities.

We can then either port some of these oddities step by step from v1 to 
v2 or replace them by something better (for example: if we really want 
MAP_PRIVATE, then just do it like with any other file and use ordinary 
anon (THP) ).

One day, we can then just switch to v2 and remove v1. If we manage 
without any uABI changes, great.

Doing all the conversion in-place could turn out extremely painful and 
take much longer ... but I might be just taught otherwise.

As you say, hugetlb should just be a special folio provider ...

We can discuss tomorrow.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] HGM for hugetlbfs
  2023-06-09 20:20                       ` James Houghton
@ 2023-06-13 15:17                         ` Jason Gunthorpe
  0 siblings, 0 replies; 29+ messages in thread
From: Jason Gunthorpe @ 2023-06-13 15:17 UTC (permalink / raw)
  To: James Houghton
  Cc: Dan Williams, Mike Kravetz, David Hildenbrand, Miaohe Lin,
	Naoya Horiguchi, Peter Xu, Yosry Ahmed, linux-mm, Michal Hocko,
	Matthew Wilcox, David Rientjes, Axel Rasmussen, lsf-pc,
	Jiaqi Yan, jane.chu

On Fri, Jun 09, 2023 at 01:20:19PM -0700, James Houghton wrote:

> So, we could:
> 1. Do what HGM does and have the kernel unmap the 4K page in the
> userspace page tables.
> 2. On-the-fly change the VMA for our hugepage to not be HugeTLB
> anymore, and re-map all the good 4K pages.
> 3. Tell userspace that it must change its mapping from HugeTLB to
> something else, and move the good 4K pages into the new mapping.
 
> (2) feels like more complexity than (1). If a user created a
> MAP_HUGETLB mapping and now it isn't HugeTLB, that feels wrong.
> 
> (3) today isn't possible, but with Jiaqi's improvement to hugetlbfs
> read() it becomes possible. We'll need to have an extra 1G of memory
> while we are doing this copying/recovery, and it isn't transparent at
> all.

It is transparent to the VM, it just has a longer EPT fault response
time if the VM touches that range.

> (3) is additionally painful when considering live migration. We have
> to keep the 4K page unmapped after the migration (to keep it poisoned
> from the guest's perspective), but the page is no longer *actually*
> poisoned on the host. To get the memory we need to back our
> fake-poisoned pages with tmpfs, we would need to free our 1G page.
> Getting that page back later isn't trivial.

Why does this change with #1?

As David says you can't transparently "fix" the page, so when you
migrate a VM with unavailable pages it must migrate those unavailable
pages too, regardless if the kernel made them unavailable or
userspace did.

So, regardless, you end up with a VM that has holes in its address
map.

I guess if the hole is created from a PTE map of a 1G hugetlbfs it is
easier to "heal" back to a full 1G map, but this healing could also be
done by copying.

It seems to me the main value of the kernel-side approach is that it
eliminates the copies and makes the time the 1G page would be
unavailable to the guest shorter.

> So (1) still seems like the most natural solution, so the question
> becomes: how exactly do we implement 4K unmapping? And that brings us
> back to the main question about how HGM should be implemented in
> general.

IMHO if you can do it in userspace with a copy you can solve your
urgent customer need and then have more time to do the big kernel
rework required to optimize it with kernel support.

Jason


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] HGM for hugetlbfs
  2023-06-13 15:15                         ` David Hildenbrand
@ 2023-06-13 15:45                           ` Peter Xu
  0 siblings, 0 replies; 29+ messages in thread
From: Peter Xu @ 2023-06-13 15:45 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Jason Gunthorpe, Matthew Wilcox, David Rientjes, Mike Kravetz,
	Yosry Ahmed, James Houghton, Naoya Horiguchi, Miaohe Lin, lsf-pc,
	linux-mm, Michal Hocko, Axel Rasmussen, Jiaqi Yan

On Tue, Jun 13, 2023 at 05:15:36PM +0200, David Hildenbrand wrote:
> Doing all the conversion in-place could turn out extremely painful and take
> much longer ... but I might be just taught otherwise.

IMHO we should start with the attempt of in-place conversion always, and we
should provide good reasonings to justify every single point of a change to
v1: either a design flaw that we must change, or when impossible to convert
without breaking v1.

The "pain" is already there.  IMHO any new v2 proposal should be able to
list out all the "pain"s (it'll be a vain if we "silently" carry over a
pain point, so a detailed summary of the problem along with reasonings to
change, what is the right things to do, can be half way already of the
whole effort..), and (2) justify the "pain"s are gone in the new design.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2023-06-13 15:45 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-06 19:19 [LSF/MM/BPF TOPIC] HGM for hugetlbfs Mike Kravetz
2023-03-14 15:37 ` James Houghton
2023-04-12  1:44   ` David Rientjes
2023-05-24 20:26 ` James Houghton
2023-05-26  3:00   ` David Rientjes
     [not found]     ` <20230602172723.GA3941@monkey>
2023-06-06 22:40       ` David Rientjes
2023-06-07  7:38         ` David Hildenbrand
2023-06-07  7:51           ` Yosry Ahmed
2023-06-07  8:13             ` David Hildenbrand
2023-06-07 22:06               ` Mike Kravetz
2023-06-08  0:02                 ` David Rientjes
2023-06-08  6:34                   ` David Hildenbrand
2023-06-08 18:50                     ` Yang Shi
2023-06-08 21:23                       ` Mike Kravetz
2023-06-09  1:57                         ` Zi Yan
2023-06-09 15:17                           ` Pasha Tatashin
2023-06-09 19:04                             ` Ankur Arora
2023-06-09 19:57                           ` Matthew Wilcox
2023-06-08 20:10                     ` Matthew Wilcox
2023-06-09  2:59                       ` David Rientjes
2023-06-13 14:59                       ` Jason Gunthorpe
2023-06-13 15:15                         ` David Hildenbrand
2023-06-13 15:45                           ` Peter Xu
2023-06-08 21:54                 ` [Lsf-pc] " Dan Williams
2023-06-08 22:35                   ` Mike Kravetz
2023-06-09  3:36                     ` Dan Williams
2023-06-09 20:20                       ` James Houghton
2023-06-13 15:17                         ` Jason Gunthorpe
2023-06-07 14:40           ` Matthew Wilcox

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.