linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm: hugetlb: support get/set_policy for hugetlb_vm_ops
@ 2022-10-12  8:15 Albert Huang
  2022-10-12 19:45 ` Hugh Dickins
  2022-10-17  8:44 ` David Hildenbrand
  0 siblings, 2 replies; 17+ messages in thread
From: Albert Huang @ 2022-10-12  8:15 UTC (permalink / raw)
  To: songmuchun
  Cc: huangjie.albert, Mike Kravetz, Andrew Morton, linux-mm, linux-kernel

From: "huangjie.albert" <huangjie.albert@bytedance.com>

implement these two functions so that we can set the mempolicy to
the inode of the hugetlb file. This ensures that the mempolicy of
all processes sharing this huge page file is consistent.

In some scenarios where huge pages are shared:
if we need to limit the memory usage of vm within node0, so I set qemu's
mempilciy bind to node0, but if there is a process (such as virtiofsd)
shared memory with the vm, in this case. If the page fault is triggered
by virtiofsd, the allocated memory may go to node1 which  depends on
virtiofsd.

Signed-off-by: huangjie.albert <huangjie.albert@bytedance.com>
---
 mm/hugetlb.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0ad53ad98e74..ed7599821655 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4678,6 +4678,24 @@ static vm_fault_t hugetlb_vm_op_fault(struct vm_fault *vmf)
 	return 0;
 }
 
+#ifdef CONFIG_NUMA
+int hugetlb_vm_op_set_policy(struct vm_area_struct *vma, struct mempolicy *mpol)
+{
+	struct inode *inode = file_inode(vma->vm_file);
+
+	return mpol_set_shared_policy(&HUGETLBFS_I(inode)->policy, vma, mpol);
+}
+
+struct mempolicy *hugetlb_vm_op_get_policy(struct vm_area_struct *vma, unsigned long addr)
+{
+	struct inode *inode = file_inode(vma->vm_file);
+	pgoff_t index;
+
+	index = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+	return mpol_shared_policy_lookup(&HUGETLBFS_I(inode)->policy, index);
+}
+#endif
+
 /*
  * When a new function is introduced to vm_operations_struct and added
  * to hugetlb_vm_ops, please consider adding the function to shm_vm_ops.
@@ -4691,6 +4709,10 @@ const struct vm_operations_struct hugetlb_vm_ops = {
 	.close = hugetlb_vm_op_close,
 	.may_split = hugetlb_vm_op_split,
 	.pagesize = hugetlb_vm_op_pagesize,
+#ifdef CONFIG_NUMA
+	.set_policy = hugetlb_vm_op_set_policy,
+	.get_policy = hugetlb_vm_op_get_policy,
+#endif
 };
 
 static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm: hugetlb: support get/set_policy for hugetlb_vm_ops
  2022-10-12  8:15 [PATCH] mm: hugetlb: support get/set_policy for hugetlb_vm_ops Albert Huang
@ 2022-10-12 19:45 ` Hugh Dickins
  2022-10-14 16:56   ` Mike Kravetz
  2022-10-19  9:33   ` [PATCH] mm: hugetlb: support get/set_policy for hugetlb_vm_ops 黄杰
  2022-10-17  8:44 ` David Hildenbrand
  1 sibling, 2 replies; 17+ messages in thread
From: Hugh Dickins @ 2022-10-12 19:45 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Albert Huang, Muchun Song, Andi Kleen, Andrew Morton, linux-mm,
	linux-kernel

On Wed, 12 Oct 2022, Albert Huang wrote:

> From: "huangjie.albert" <huangjie.albert@bytedance.com>
> 
> implement these two functions so that we can set the mempolicy to
> the inode of the hugetlb file. This ensures that the mempolicy of
> all processes sharing this huge page file is consistent.
> 
> In some scenarios where huge pages are shared:
> if we need to limit the memory usage of vm within node0, so I set qemu's
> mempilciy bind to node0, but if there is a process (such as virtiofsd)
> shared memory with the vm, in this case. If the page fault is triggered
> by virtiofsd, the allocated memory may go to node1 which  depends on
> virtiofsd.
> 
> Signed-off-by: huangjie.albert <huangjie.albert@bytedance.com>

Aha!  Congratulations for noticing, after all this time.  hugetlbfs
contains various little pieces of code that pretend to be supporting
shared NUMA mempolicy, but in fact there was nothing connecting it up.

It will be for Mike to decide, but personally I oppose adding
shared NUMA mempolicy support to hugetlbfs, after eighteen years.

The thing is, it will change the behaviour of NUMA on hugetlbfs:
in ways that would have been sensible way back then, yes; but surely
those who have invested in NUMA and hugetlbfs have developed other
ways of administering it successfully, without shared NUMA mempolicy.

At the least, I would expect some tests to break (I could easily be
wrong), and there's a chance that some app or tool would break too.

I have carried the reverse of Albert's patch for a long time, stripping
out the pretence of shared NUMA mempolicy support from hugetlbfs: I
wanted that, so that I could work on modifying the tmpfs implementation,
without having to worry about other users.

Mike, if you would prefer to see my patch stripping out the pretence,
let us know: it has never been a priority to send in, but I can update
it to 6.1-rc1 if you'd like to see it.  (Once upon a time, it removed
all need for struct hugetlbfs_inode_info, but nowadays that's still
required for the memfd seals.)

Whether Albert's patch is complete and correct, I haven't begun to think
about: I am not saying it isn't, but shared NUMA mempolicy adds another
dimension of complexity, and need for support, that I think hugetlbfs
would be better off continuing to survive without.

Hugh

> ---
>  mm/hugetlb.c | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 0ad53ad98e74..ed7599821655 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4678,6 +4678,24 @@ static vm_fault_t hugetlb_vm_op_fault(struct vm_fault *vmf)
>  	return 0;
>  }
>  
> +#ifdef CONFIG_NUMA
> +int hugetlb_vm_op_set_policy(struct vm_area_struct *vma, struct mempolicy *mpol)
> +{
> +	struct inode *inode = file_inode(vma->vm_file);
> +
> +	return mpol_set_shared_policy(&HUGETLBFS_I(inode)->policy, vma, mpol);
> +}
> +
> +struct mempolicy *hugetlb_vm_op_get_policy(struct vm_area_struct *vma, unsigned long addr)
> +{
> +	struct inode *inode = file_inode(vma->vm_file);
> +	pgoff_t index;
> +
> +	index = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> +	return mpol_shared_policy_lookup(&HUGETLBFS_I(inode)->policy, index);
> +}
> +#endif
> +
>  /*
>   * When a new function is introduced to vm_operations_struct and added
>   * to hugetlb_vm_ops, please consider adding the function to shm_vm_ops.
> @@ -4691,6 +4709,10 @@ const struct vm_operations_struct hugetlb_vm_ops = {
>  	.close = hugetlb_vm_op_close,
>  	.may_split = hugetlb_vm_op_split,
>  	.pagesize = hugetlb_vm_op_pagesize,
> +#ifdef CONFIG_NUMA
> +	.set_policy = hugetlb_vm_op_set_policy,
> +	.get_policy = hugetlb_vm_op_get_policy,
> +#endif
>  };
>  
>  static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
> -- 
> 2.31.1

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm: hugetlb: support get/set_policy for hugetlb_vm_ops
  2022-10-12 19:45 ` Hugh Dickins
@ 2022-10-14 16:56   ` Mike Kravetz
  2022-10-17  3:35     ` [External] " 黄杰
  2022-10-19  9:29     ` [PATCH v2] mm: hugetlb: support for shared memory policy Albert Huang
  2022-10-19  9:33   ` [PATCH] mm: hugetlb: support get/set_policy for hugetlb_vm_ops 黄杰
  1 sibling, 2 replies; 17+ messages in thread
From: Mike Kravetz @ 2022-10-14 16:56 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Albert Huang, Muchun Song, Andi Kleen, Andrew Morton, linux-mm,
	linux-kernel

On 10/12/22 12:45, Hugh Dickins wrote:
> On Wed, 12 Oct 2022, Albert Huang wrote:
> 
> > From: "huangjie.albert" <huangjie.albert@bytedance.com>
> > 
> > implement these two functions so that we can set the mempolicy to
> > the inode of the hugetlb file. This ensures that the mempolicy of
> > all processes sharing this huge page file is consistent.
> > 
> > In some scenarios where huge pages are shared:
> > if we need to limit the memory usage of vm within node0, so I set qemu's
> > mempilciy bind to node0, but if there is a process (such as virtiofsd)
> > shared memory with the vm, in this case. If the page fault is triggered
> > by virtiofsd, the allocated memory may go to node1 which  depends on
> > virtiofsd.
> > 
> > Signed-off-by: huangjie.albert <huangjie.albert@bytedance.com>

Thanks for the patch Albert, and thank you Hugh for the comments!

> Aha!  Congratulations for noticing, after all this time.  hugetlbfs
> contains various little pieces of code that pretend to be supporting
> shared NUMA mempolicy, but in fact there was nothing connecting it up.

I actually had to look this up to verify it was not supported.  However, the
documentation is fairly clear.
From admin-guide/mm/numa_memory_policy.rst.

"As of 2.6.22, only shared memory segments, created by shmget() or
 mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy.  When shared
 policy support was added to Linux, the associated data structures were
 added to hugetlbfs shmem segments.  At the time, hugetlbfs did not
 support allocation at fault time--a.k.a lazy allocation--so hugetlbfs
 shmem segments were never "hooked up" to the shared policy support.
 Although hugetlbfs segments now support lazy allocation, their support
 for shared policy has not been completed."

It is somewhat embarrassing that this has been known for so long and
nothing has changed.

> It will be for Mike to decide, but personally I oppose adding
> shared NUMA mempolicy support to hugetlbfs, after eighteen years.
> 
> The thing is, it will change the behaviour of NUMA on hugetlbfs:
> in ways that would have been sensible way back then, yes; but surely
> those who have invested in NUMA and hugetlbfs have developed other
> ways of administering it successfully, without shared NUMA mempolicy.
> 
> At the least, I would expect some tests to break (I could easily be
> wrong), and there's a chance that some app or tool would break too.
> 
> I have carried the reverse of Albert's patch for a long time, stripping
> out the pretence of shared NUMA mempolicy support from hugetlbfs: I
> wanted that, so that I could work on modifying the tmpfs implementation,
> without having to worry about other users.
> 
> Mike, if you would prefer to see my patch stripping out the pretence,
> let us know: it has never been a priority to send in, but I can update
> it to 6.1-rc1 if you'd like to see it.  (Once upon a time, it removed
> all need for struct hugetlbfs_inode_info, but nowadays that's still
> required for the memfd seals.)
> 
> Whether Albert's patch is complete and correct, I haven't begun to think
> about: I am not saying it isn't, but shared NUMA mempolicy adds another
> dimension of complexity, and need for support, that I think hugetlbfs
> would be better off continuing to survive without.

To be honest, I have not looked into the complexities of shared NUMA
mempolicy and exactly what is required for it's support.  With my limited
knowledge, it appears that this patch adds some type of support for shared
policy, but it may not provide all support mentioned in the documentation.

At the very least, this patch should also update documentation to state
what type of support is provided.

Albert, can you look into what would be required for full support?  I can take
a look as well but have some other higher priority tasks to work first.

TBH, I like Hugh's idea of removing the 'pretence of shared policy support'.
We are currently wasting memory carrying around extra unused fields in
hugetlbfs_inode_info. :(
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [External] Re: [PATCH] mm: hugetlb: support get/set_policy for hugetlb_vm_ops
  2022-10-14 16:56   ` Mike Kravetz
@ 2022-10-17  3:35     ` 黄杰
  2022-10-19  9:29     ` [PATCH v2] mm: hugetlb: support for shared memory policy Albert Huang
  1 sibling, 0 replies; 17+ messages in thread
From: 黄杰 @ 2022-10-17  3:35 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Hugh Dickins, Muchun Song, Andi Kleen, Andrew Morton, linux-mm,
	linux-kernel

Mike Kravetz <mike.kravetz@oracle.com> 于2022年10月15日周六 00:56写道:
>
> On 10/12/22 12:45, Hugh Dickins wrote:
> > On Wed, 12 Oct 2022, Albert Huang wrote:
> >
> > > From: "huangjie.albert" <huangjie.albert@bytedance.com>
> > >
> > > implement these two functions so that we can set the mempolicy to
> > > the inode of the hugetlb file. This ensures that the mempolicy of
> > > all processes sharing this huge page file is consistent.
> > >
> > > In some scenarios where huge pages are shared:
> > > if we need to limit the memory usage of vm within node0, so I set qemu's
> > > mempilciy bind to node0, but if there is a process (such as virtiofsd)
> > > shared memory with the vm, in this case. If the page fault is triggered
> > > by virtiofsd, the allocated memory may go to node1 which  depends on
> > > virtiofsd.
> > >
> > > Signed-off-by: huangjie.albert <huangjie.albert@bytedance.com>
>
> Thanks for the patch Albert, and thank you Hugh for the comments!
>
> > Aha!  Congratulations for noticing, after all this time.  hugetlbfs
> > contains various little pieces of code that pretend to be supporting
> > shared NUMA mempolicy, but in fact there was nothing connecting it up.
>
> I actually had to look this up to verify it was not supported.  However, the
> documentation is fairly clear.
> From admin-guide/mm/numa_memory_policy.rst.
>
> "As of 2.6.22, only shared memory segments, created by shmget() or
>  mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy.  When shared
>  policy support was added to Linux, the associated data structures were
>  added to hugetlbfs shmem segments.  At the time, hugetlbfs did not
>  support allocation at fault time--a.k.a lazy allocation--so hugetlbfs
>  shmem segments were never "hooked up" to the shared policy support.
>  Although hugetlbfs segments now support lazy allocation, their support
>  for shared policy has not been completed."
>
> It is somewhat embarrassing that this has been known for so long and
> nothing has changed.
>
> > It will be for Mike to decide, but personally I oppose adding
> > shared NUMA mempolicy support to hugetlbfs, after eighteen years.
> >
> > The thing is, it will change the behaviour of NUMA on hugetlbfs:
> > in ways that would have been sensible way back then, yes; but surely
> > those who have invested in NUMA and hugetlbfs have developed other
> > ways of administering it successfully, without shared NUMA mempolicy.
> >
> > At the least, I would expect some tests to break (I could easily be
> > wrong), and there's a chance that some app or tool would break too.
> >
> > I have carried the reverse of Albert's patch for a long time, stripping
> > out the pretence of shared NUMA mempolicy support from hugetlbfs: I
> > wanted that, so that I could work on modifying the tmpfs implementation,
> > without having to worry about other users.
> >
> > Mike, if you would prefer to see my patch stripping out the pretence,
> > let us know: it has never been a priority to send in, but I can update
> > it to 6.1-rc1 if you'd like to see it.  (Once upon a time, it removed
> > all need for struct hugetlbfs_inode_info, but nowadays that's still
> > required for the memfd seals.)
> >
> > Whether Albert's patch is complete and correct, I haven't begun to think
> > about: I am not saying it isn't, but shared NUMA mempolicy adds another
> > dimension of complexity, and need for support, that I think hugetlbfs
> > would be better off continuing to survive without.
>
> To be honest, I have not looked into the complexities of shared NUMA
> mempolicy and exactly what is required for it's support.  With my limited
> knowledge, it appears that this patch adds some type of support for shared
> policy, but it may not provide all support mentioned in the documentation.
>
> At the very least, this patch should also update documentation to state
> what type of support is provided.
>
> Albert, can you look into what would be required for full support?  I can take
> a look as well but have some other higher priority tasks to work first.
>

Lucky to do this job, let me think about it.

> TBH, I like Hugh's idea of removing the 'pretence of shared policy support'.
> We are currently wasting memory carrying around extra unused fields in
> hugetlbfs_inode_info. :(
> --
> Mike Kravetz

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm: hugetlb: support get/set_policy for hugetlb_vm_ops
  2022-10-12  8:15 [PATCH] mm: hugetlb: support get/set_policy for hugetlb_vm_ops Albert Huang
  2022-10-12 19:45 ` Hugh Dickins
@ 2022-10-17  8:44 ` David Hildenbrand
  2022-10-17  9:48   ` [External] " 黄杰
  1 sibling, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2022-10-17  8:44 UTC (permalink / raw)
  To: Albert Huang, songmuchun
  Cc: Mike Kravetz, Andrew Morton, linux-mm, linux-kernel

On 12.10.22 10:15, Albert Huang wrote:
> From: "huangjie.albert" <huangjie.albert@bytedance.com>
> 
> implement these two functions so that we can set the mempolicy to
> the inode of the hugetlb file. This ensures that the mempolicy of
> all processes sharing this huge page file is consistent.
> 
> In some scenarios where huge pages are shared:
> if we need to limit the memory usage of vm within node0, so I set qemu's
> mempilciy bind to node0, but if there is a process (such as virtiofsd)
> shared memory with the vm, in this case. If the page fault is triggered
> by virtiofsd, the allocated memory may go to node1 which  depends on
> virtiofsd.
> 

Any VM that uses hugetlb should be preallocating memory. For example, 
this is the expected default under QEMU when using huge pages.

Once preallocation does the right thing regarding NUMA policy, there is 
no need to worry about it in other sub-processes.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [External] Re: [PATCH] mm: hugetlb: support get/set_policy for hugetlb_vm_ops
  2022-10-17  8:44 ` David Hildenbrand
@ 2022-10-17  9:48   ` 黄杰
  2022-10-17 11:33     ` David Hildenbrand
  0 siblings, 1 reply; 17+ messages in thread
From: 黄杰 @ 2022-10-17  9:48 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: songmuchun, Mike Kravetz, Andrew Morton, linux-mm, linux-kernel

David Hildenbrand <david@redhat.com> 于2022年10月17日周一 16:44写道:
>
> On 12.10.22 10:15, Albert Huang wrote:
> > From: "huangjie.albert" <huangjie.albert@bytedance.com>
> >
> > implement these two functions so that we can set the mempolicy to
> > the inode of the hugetlb file. This ensures that the mempolicy of
> > all processes sharing this huge page file is consistent.
> >
> > In some scenarios where huge pages are shared:
> > if we need to limit the memory usage of vm within node0, so I set qemu's
> > mempilciy bind to node0, but if there is a process (such as virtiofsd)
> > shared memory with the vm, in this case. If the page fault is triggered
> > by virtiofsd, the allocated memory may go to node1 which  depends on
> > virtiofsd.
> >
>
> Any VM that uses hugetlb should be preallocating memory. For example,
> this is the expected default under QEMU when using huge pages.
>
> Once preallocation does the right thing regarding NUMA policy, there is
> no need to worry about it in other sub-processes.
>

Hi, David
thanks for your reminder

Yes, you are absolutely right, However, the pre-allocation mechanism
does solve this problem.
However, some scenarios do not like to use the pre-allocation mechanism, such as
scenarios that are sensitive to virtual machine startup time, or
scenarios that require
high memory utilization. The on-demand allocation mechanism may be better,
so the key point is to find a way support for shared policy。

Thanks,

Albert

> --
> Thanks,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [External] Re: [PATCH] mm: hugetlb: support get/set_policy for hugetlb_vm_ops
  2022-10-17  9:48   ` [External] " 黄杰
@ 2022-10-17 11:33     ` David Hildenbrand
  2022-10-17 11:46       ` 黄杰
  2022-10-17 17:59       ` [External] " Mike Kravetz
  0 siblings, 2 replies; 17+ messages in thread
From: David Hildenbrand @ 2022-10-17 11:33 UTC (permalink / raw)
  To: 黄杰
  Cc: songmuchun, Mike Kravetz, Andrew Morton, linux-mm, linux-kernel

On 17.10.22 11:48, 黄杰 wrote:
> David Hildenbrand <david@redhat.com> 于2022年10月17日周一 16:44写道:
>>
>> On 12.10.22 10:15, Albert Huang wrote:
>>> From: "huangjie.albert" <huangjie.albert@bytedance.com>
>>>
>>> implement these two functions so that we can set the mempolicy to
>>> the inode of the hugetlb file. This ensures that the mempolicy of
>>> all processes sharing this huge page file is consistent.
>>>
>>> In some scenarios where huge pages are shared:
>>> if we need to limit the memory usage of vm within node0, so I set qemu's
>>> mempilciy bind to node0, but if there is a process (such as virtiofsd)
>>> shared memory with the vm, in this case. If the page fault is triggered
>>> by virtiofsd, the allocated memory may go to node1 which  depends on
>>> virtiofsd.
>>>
>>
>> Any VM that uses hugetlb should be preallocating memory. For example,
>> this is the expected default under QEMU when using huge pages.
>>
>> Once preallocation does the right thing regarding NUMA policy, there is
>> no need to worry about it in other sub-processes.
>>
> 
> Hi, David
> thanks for your reminder
> 
> Yes, you are absolutely right, However, the pre-allocation mechanism
> does solve this problem.
> However, some scenarios do not like to use the pre-allocation mechanism, such as
> scenarios that are sensitive to virtual machine startup time, or
> scenarios that require
> high memory utilization. The on-demand allocation mechanism may be better,
> so the key point is to find a way support for shared policy。

Using hugetlb -- with a fixed pool size -- without preallocation is like 
playing with fire. Hugetlb reservation makes one believe that on-demand 
allocation is going to work, but there are various scenarios where that 
can go seriously wrong, and you can run out of huge pages.

If you're using hugetlb as memory backend for a VM without 
preallocation, you really have to be very careful. I can only advise 
against doing that.


Also: why does another process read/write *first* to a guest physical 
memory location before the OS running inside the VM even initialized 
that memory? That sounds very wrong. What am I missing?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [External] Re: [PATCH] mm: hugetlb: support get/set_policy for hugetlb_vm_ops
  2022-10-17 11:33     ` David Hildenbrand
@ 2022-10-17 11:46       ` 黄杰
  2022-10-17 12:00         ` David Hildenbrand
  2022-10-17 17:59       ` [External] " Mike Kravetz
  1 sibling, 1 reply; 17+ messages in thread
From: 黄杰 @ 2022-10-17 11:46 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: songmuchun, Mike Kravetz, Andrew Morton, linux-mm, linux-kernel

David Hildenbrand <david@redhat.com> 于2022年10月17日周一 19:33写道:
>
> On 17.10.22 11:48, 黄杰 wrote:
> > David Hildenbrand <david@redhat.com> 于2022年10月17日周一 16:44写道:
> >>
> >> On 12.10.22 10:15, Albert Huang wrote:
> >>> From: "huangjie.albert" <huangjie.albert@bytedance.com>
> >>>
> >>> implement these two functions so that we can set the mempolicy to
> >>> the inode of the hugetlb file. This ensures that the mempolicy of
> >>> all processes sharing this huge page file is consistent.
> >>>
> >>> In some scenarios where huge pages are shared:
> >>> if we need to limit the memory usage of vm within node0, so I set qemu's
> >>> mempilciy bind to node0, but if there is a process (such as virtiofsd)
> >>> shared memory with the vm, in this case. If the page fault is triggered
> >>> by virtiofsd, the allocated memory may go to node1 which  depends on
> >>> virtiofsd.
> >>>
> >>
> >> Any VM that uses hugetlb should be preallocating memory. For example,
> >> this is the expected default under QEMU when using huge pages.
> >>
> >> Once preallocation does the right thing regarding NUMA policy, there is
> >> no need to worry about it in other sub-processes.
> >>
> >
> > Hi, David
> > thanks for your reminder
> >
> > Yes, you are absolutely right, However, the pre-allocation mechanism
> > does solve this problem.
> > However, some scenarios do not like to use the pre-allocation mechanism, such as
> > scenarios that are sensitive to virtual machine startup time, or
> > scenarios that require
> > high memory utilization. The on-demand allocation mechanism may be better,
> > so the key point is to find a way support for shared policy。
>
> Using hugetlb -- with a fixed pool size -- without preallocation is like
> playing with fire. Hugetlb reservation makes one believe that on-demand
> allocation is going to work, but there are various scenarios where that
> can go seriously wrong, and you can run out of huge pages.
>
> If you're using hugetlb as memory backend for a VM without
> preallocation, you really have to be very careful. I can only advise
> against doing that.
>
>
> Also: why does another process read/write *first* to a guest physical
> memory location before the OS running inside the VM even initialized
> that memory? That sounds very wrong. What am I missing?
>

for example : virtio ring buffer.
For the avial descriptor, the guest kernel only gives an address to
the backend,
and does not actually access the memory.

Thanks.

> --
> Thanks,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [External] Re: [PATCH] mm: hugetlb: support get/set_policy for hugetlb_vm_ops
  2022-10-17 11:46       ` 黄杰
@ 2022-10-17 12:00         ` David Hildenbrand
  2022-10-18  9:27           ` 黄杰
  0 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2022-10-17 12:00 UTC (permalink / raw)
  To: 黄杰
  Cc: songmuchun, Mike Kravetz, Andrew Morton, linux-mm, linux-kernel

On 17.10.22 13:46, 黄杰 wrote:
> David Hildenbrand <david@redhat.com> 于2022年10月17日周一 19:33写道:
>>
>> On 17.10.22 11:48, 黄杰 wrote:
>>> David Hildenbrand <david@redhat.com> 于2022年10月17日周一 16:44写道:
>>>>
>>>> On 12.10.22 10:15, Albert Huang wrote:
>>>>> From: "huangjie.albert" <huangjie.albert@bytedance.com>
>>>>>
>>>>> implement these two functions so that we can set the mempolicy to
>>>>> the inode of the hugetlb file. This ensures that the mempolicy of
>>>>> all processes sharing this huge page file is consistent.
>>>>>
>>>>> In some scenarios where huge pages are shared:
>>>>> if we need to limit the memory usage of vm within node0, so I set qemu's
>>>>> mempilciy bind to node0, but if there is a process (such as virtiofsd)
>>>>> shared memory with the vm, in this case. If the page fault is triggered
>>>>> by virtiofsd, the allocated memory may go to node1 which  depends on
>>>>> virtiofsd.
>>>>>
>>>>
>>>> Any VM that uses hugetlb should be preallocating memory. For example,
>>>> this is the expected default under QEMU when using huge pages.
>>>>
>>>> Once preallocation does the right thing regarding NUMA policy, there is
>>>> no need to worry about it in other sub-processes.
>>>>
>>>
>>> Hi, David
>>> thanks for your reminder
>>>
>>> Yes, you are absolutely right, However, the pre-allocation mechanism
>>> does solve this problem.
>>> However, some scenarios do not like to use the pre-allocation mechanism, such as
>>> scenarios that are sensitive to virtual machine startup time, or
>>> scenarios that require
>>> high memory utilization. The on-demand allocation mechanism may be better,
>>> so the key point is to find a way support for shared policy。
>>
>> Using hugetlb -- with a fixed pool size -- without preallocation is like
>> playing with fire. Hugetlb reservation makes one believe that on-demand
>> allocation is going to work, but there are various scenarios where that
>> can go seriously wrong, and you can run out of huge pages.
>>
>> If you're using hugetlb as memory backend for a VM without
>> preallocation, you really have to be very careful. I can only advise
>> against doing that.
>>
>>
>> Also: why does another process read/write *first* to a guest physical
>> memory location before the OS running inside the VM even initialized
>> that memory? That sounds very wrong. What am I missing?
>>
> 
> for example : virtio ring buffer.
> For the avial descriptor, the guest kernel only gives an address to
> the backend,
> and does not actually access the memory.

Okay, thanks. So we're essentially providing uninitialized memory to a 
device? Hm, that implies that the device might have access to memory 
that was previously used by someone else ... not sure how to feel about 
that, but maybe this is just the way of doing things.

The "easy" user-space fix would be to simply similarly mbind() in  the 
other processes where we mmap(). Has that option been explored?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [External] Re: [PATCH] mm: hugetlb: support get/set_policy for hugetlb_vm_ops
  2022-10-17 11:33     ` David Hildenbrand
  2022-10-17 11:46       ` 黄杰
@ 2022-10-17 17:59       ` Mike Kravetz
  2022-10-18  9:24         ` 黄杰
  1 sibling, 1 reply; 17+ messages in thread
From: Mike Kravetz @ 2022-10-17 17:59 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: 黄杰, songmuchun, Andrew Morton, linux-mm, linux-kernel

On 10/17/22 13:33, David Hildenbrand wrote:
> On 17.10.22 11:48, 黄杰 wrote:
> > David Hildenbrand <david@redhat.com> 于2022年10月17日周一 16:44写道:
> > > 
> > > On 12.10.22 10:15, Albert Huang wrote:
> > > > From: "huangjie.albert" <huangjie.albert@bytedance.com>
> > > > 
> > > > implement these two functions so that we can set the mempolicy to
> > > > the inode of the hugetlb file. This ensures that the mempolicy of
> > > > all processes sharing this huge page file is consistent.
> > > > 
> > > > In some scenarios where huge pages are shared:
> > > > if we need to limit the memory usage of vm within node0, so I set qemu's
> > > > mempilciy bind to node0, but if there is a process (such as virtiofsd)
> > > > shared memory with the vm, in this case. If the page fault is triggered
> > > > by virtiofsd, the allocated memory may go to node1 which  depends on
> > > > virtiofsd.
> > > > 
> > > 
> > > Any VM that uses hugetlb should be preallocating memory. For example,
> > > this is the expected default under QEMU when using huge pages.
> > > 
> > > Once preallocation does the right thing regarding NUMA policy, there is
> > > no need to worry about it in other sub-processes.
> > > 
> > 
> > Hi, David
> > thanks for your reminder
> > 
> > Yes, you are absolutely right, However, the pre-allocation mechanism
> > does solve this problem.
> > However, some scenarios do not like to use the pre-allocation mechanism, such as
> > scenarios that are sensitive to virtual machine startup time, or
> > scenarios that require
> > high memory utilization. The on-demand allocation mechanism may be better,
> > so the key point is to find a way support for shared policy。
> 
> Using hugetlb -- with a fixed pool size -- without preallocation is like
> playing with fire. Hugetlb reservation makes one believe that on-demand
> allocation is going to work, but there are various scenarios where that can
> go seriously wrong, and you can run out of huge pages.

I absolutely agree with this cautionary note.

hugetlb reservations guarantee that a sufficient number of huge pages exist.
However, there is no guarantee that those pages are on any specific node
associated with a numa policy.  Therefore, an 'on demand' allocation could
fail resulting in SIGBUS being set to the faulting process.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm: hugetlb: support get/set_policy for hugetlb_vm_ops
  2022-10-17 17:59       ` [External] " Mike Kravetz
@ 2022-10-18  9:24         ` 黄杰
  0 siblings, 0 replies; 17+ messages in thread
From: 黄杰 @ 2022-10-18  9:24 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: David Hildenbrand, Muchun Song, Andrew Morton, linux-mm, linux-kernel

Mike Kravetz <mike.kravetz@oracle.com> 于2022年10月18日周二 01:59写道:
>
> On 10/17/22 13:33, David Hildenbrand wrote:
> > On 17.10.22 11:48, 黄杰 wrote:
> > > David Hildenbrand <david@redhat.com> 于2022年10月17日周一 16:44写道:
> > > >
> > > > On 12.10.22 10:15, Albert Huang wrote:
> > > > > From: "huangjie.albert" <huangjie.albert@bytedance.com>
> > > > >
> > > > > implement these two functions so that we can set the mempolicy to
> > > > > the inode of the hugetlb file. This ensures that the mempolicy of
> > > > > all processes sharing this huge page file is consistent.
> > > > >
> > > > > In some scenarios where huge pages are shared:
> > > > > if we need to limit the memory usage of vm within node0, so I set qemu's
> > > > > mempilciy bind to node0, but if there is a process (such as virtiofsd)
> > > > > shared memory with the vm, in this case. If the page fault is triggered
> > > > > by virtiofsd, the allocated memory may go to node1 which  depends on
> > > > > virtiofsd.
> > > > >
> > > >
> > > > Any VM that uses hugetlb should be preallocating memory. For example,
> > > > this is the expected default under QEMU when using huge pages.
> > > >
> > > > Once preallocation does the right thing regarding NUMA policy, there is
> > > > no need to worry about it in other sub-processes.
> > > >
> > >
> > > Hi, David
> > > thanks for your reminder
> > >
> > > Yes, you are absolutely right, However, the pre-allocation mechanism
> > > does solve this problem.
> > > However, some scenarios do not like to use the pre-allocation mechanism, such as
> > > scenarios that are sensitive to virtual machine startup time, or
> > > scenarios that require
> > > high memory utilization. The on-demand allocation mechanism may be better,
> > > so the key point is to find a way support for shared policy。
> >
> > Using hugetlb -- with a fixed pool size -- without preallocation is like
> > playing with fire. Hugetlb reservation makes one believe that on-demand
> > allocation is going to work, but there are various scenarios where that can
> > go seriously wrong, and you can run out of huge pages.
>
> I absolutely agree with this cautionary note.
>
> hugetlb reservations guarantee that a sufficient number of huge pages exist.
> However, there is no guarantee that those pages are on any specific node
> associated with a numa policy.  Therefore, an 'on demand' allocation could
> fail resulting in SIGBUS being set to the faulting process.
> -

Yes, supporting on-demand requires adding a lot of other code to
support, I have thought about this, but there is currently no code
that is suitable for submitting to the community.


> Mike Kravetz

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm: hugetlb: support get/set_policy for hugetlb_vm_ops
  2022-10-17 12:00         ` David Hildenbrand
@ 2022-10-18  9:27           ` 黄杰
  2022-10-18  9:35             ` David Hildenbrand
  0 siblings, 1 reply; 17+ messages in thread
From: 黄杰 @ 2022-10-18  9:27 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Muchun Song, Mike Kravetz, Andrew Morton, linux-mm, linux-kernel

David Hildenbrand <david@redhat.com> 于2022年10月17日周一 20:00写道:
>
> On 17.10.22 13:46, 黄杰 wrote:
> > David Hildenbrand <david@redhat.com> 于2022年10月17日周一 19:33写道:
> >>
> >> On 17.10.22 11:48, 黄杰 wrote:
> >>> David Hildenbrand <david@redhat.com> 于2022年10月17日周一 16:44写道:
> >>>>
> >>>> On 12.10.22 10:15, Albert Huang wrote:
> >>>>> From: "huangjie.albert" <huangjie.albert@bytedance.com>
> >>>>>
> >>>>> implement these two functions so that we can set the mempolicy to
> >>>>> the inode of the hugetlb file. This ensures that the mempolicy of
> >>>>> all processes sharing this huge page file is consistent.
> >>>>>
> >>>>> In some scenarios where huge pages are shared:
> >>>>> if we need to limit the memory usage of vm within node0, so I set qemu's
> >>>>> mempilciy bind to node0, but if there is a process (such as virtiofsd)
> >>>>> shared memory with the vm, in this case. If the page fault is triggered
> >>>>> by virtiofsd, the allocated memory may go to node1 which  depends on
> >>>>> virtiofsd.
> >>>>>
> >>>>
> >>>> Any VM that uses hugetlb should be preallocating memory. For example,
> >>>> this is the expected default under QEMU when using huge pages.
> >>>>
> >>>> Once preallocation does the right thing regarding NUMA policy, there is
> >>>> no need to worry about it in other sub-processes.
> >>>>
> >>>
> >>> Hi, David
> >>> thanks for your reminder
> >>>
> >>> Yes, you are absolutely right, However, the pre-allocation mechanism
> >>> does solve this problem.
> >>> However, some scenarios do not like to use the pre-allocation mechanism, such as
> >>> scenarios that are sensitive to virtual machine startup time, or
> >>> scenarios that require
> >>> high memory utilization. The on-demand allocation mechanism may be better,
> >>> so the key point is to find a way support for shared policy。
> >>
> >> Using hugetlb -- with a fixed pool size -- without preallocation is like
> >> playing with fire. Hugetlb reservation makes one believe that on-demand
> >> allocation is going to work, but there are various scenarios where that
> >> can go seriously wrong, and you can run out of huge pages.
> >>
> >> If you're using hugetlb as memory backend for a VM without
> >> preallocation, you really have to be very careful. I can only advise
> >> against doing that.
> >>
> >>
> >> Also: why does another process read/write *first* to a guest physical
> >> memory location before the OS running inside the VM even initialized
> >> that memory? That sounds very wrong. What am I missing?
> >>
> >
> > for example : virtio ring buffer.
> > For the avial descriptor, the guest kernel only gives an address to
> > the backend,
> > and does not actually access the memory.
>
> Okay, thanks. So we're essentially providing uninitialized memory to a
> device? Hm, that implies that the device might have access to memory
> that was previously used by someone else ... not sure how to feel about
> that, but maybe this is just the way of doing things.
>
> The "easy" user-space fix would be to simply similarly mbind() in  the
> other processes where we mmap(). Has that option been explored?

This can also solve the problem temporarily, but we need to change all
processes that share memory with it, so it can't be done once and for
all

>
> --
> Thanks,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm: hugetlb: support get/set_policy for hugetlb_vm_ops
  2022-10-18  9:27           ` 黄杰
@ 2022-10-18  9:35             ` David Hildenbrand
  0 siblings, 0 replies; 17+ messages in thread
From: David Hildenbrand @ 2022-10-18  9:35 UTC (permalink / raw)
  To: 黄杰
  Cc: Muchun Song, Mike Kravetz, Andrew Morton, linux-mm, linux-kernel

On 18.10.22 11:27, 黄杰 wrote:
> David Hildenbrand <david@redhat.com> 于2022年10月17日周一 20:00写道:
>>
>> On 17.10.22 13:46, 黄杰 wrote:
>>> David Hildenbrand <david@redhat.com> 于2022年10月17日周一 19:33写道:
>>>>
>>>> On 17.10.22 11:48, 黄杰 wrote:
>>>>> David Hildenbrand <david@redhat.com> 于2022年10月17日周一 16:44写道:
>>>>>>
>>>>>> On 12.10.22 10:15, Albert Huang wrote:
>>>>>>> From: "huangjie.albert" <huangjie.albert@bytedance.com>
>>>>>>>
>>>>>>> implement these two functions so that we can set the mempolicy to
>>>>>>> the inode of the hugetlb file. This ensures that the mempolicy of
>>>>>>> all processes sharing this huge page file is consistent.
>>>>>>>
>>>>>>> In some scenarios where huge pages are shared:
>>>>>>> if we need to limit the memory usage of vm within node0, so I set qemu's
>>>>>>> mempilciy bind to node0, but if there is a process (such as virtiofsd)
>>>>>>> shared memory with the vm, in this case. If the page fault is triggered
>>>>>>> by virtiofsd, the allocated memory may go to node1 which  depends on
>>>>>>> virtiofsd.
>>>>>>>
>>>>>>
>>>>>> Any VM that uses hugetlb should be preallocating memory. For example,
>>>>>> this is the expected default under QEMU when using huge pages.
>>>>>>
>>>>>> Once preallocation does the right thing regarding NUMA policy, there is
>>>>>> no need to worry about it in other sub-processes.
>>>>>>
>>>>>
>>>>> Hi, David
>>>>> thanks for your reminder
>>>>>
>>>>> Yes, you are absolutely right, However, the pre-allocation mechanism
>>>>> does solve this problem.
>>>>> However, some scenarios do not like to use the pre-allocation mechanism, such as
>>>>> scenarios that are sensitive to virtual machine startup time, or
>>>>> scenarios that require
>>>>> high memory utilization. The on-demand allocation mechanism may be better,
>>>>> so the key point is to find a way support for shared policy。
>>>>
>>>> Using hugetlb -- with a fixed pool size -- without preallocation is like
>>>> playing with fire. Hugetlb reservation makes one believe that on-demand
>>>> allocation is going to work, but there are various scenarios where that
>>>> can go seriously wrong, and you can run out of huge pages.
>>>>
>>>> If you're using hugetlb as memory backend for a VM without
>>>> preallocation, you really have to be very careful. I can only advise
>>>> against doing that.
>>>>
>>>>
>>>> Also: why does another process read/write *first* to a guest physical
>>>> memory location before the OS running inside the VM even initialized
>>>> that memory? That sounds very wrong. What am I missing?
>>>>
>>>
>>> for example : virtio ring buffer.
>>> For the avial descriptor, the guest kernel only gives an address to
>>> the backend,
>>> and does not actually access the memory.
>>
>> Okay, thanks. So we're essentially providing uninitialized memory to a
>> device? Hm, that implies that the device might have access to memory
>> that was previously used by someone else ... not sure how to feel about
>> that, but maybe this is just the way of doing things.
>>
>> The "easy" user-space fix would be to simply similarly mbind() in  the
>> other processes where we mmap(). Has that option been explored?
> 
> This can also solve the problem temporarily, but we need to change all
> processes that share memory with it, so it can't be done once and for
> all

How many process that really care are we walking about? Usually, 
extending the vhost-user protocol and relevant libraries should be good 
enough.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2] mm: hugetlb: support for shared memory policy
  2022-10-14 16:56   ` Mike Kravetz
  2022-10-17  3:35     ` [External] " 黄杰
@ 2022-10-19  9:29     ` Albert Huang
  2022-10-19 11:49       ` Aneesh Kumar K V
  1 sibling, 1 reply; 17+ messages in thread
From: Albert Huang @ 2022-10-19  9:29 UTC (permalink / raw)
  To: mike.kravetz
  Cc: huangjie.albert, Jonathan Corbet, Muchun Song, Andrew Morton,
	Aneesh Kumar K.V, linux-doc, linux-kernel, linux-mm

From: "huangjie.albert" <huangjie.albert@bytedance.com>

implement get/set_policy for hugetlb_vm_ops to support the shared policy
This ensures that the mempolicy of all processes sharing this huge page
file is consistent.

In some scenarios where huge pages are shared:
if we need to limit the memory usage of vm within node0, so I set qemu's
mempilciy bind to node0, but if there is a process (such as virtiofsd)
shared memory with the vm, in this case. If the page fault is triggered
by virtiofsd, the allocated memory may go to node1 which depends on
virtiofsd. Although we can use the memory prealloc provided by qemu to
avoid this issue, but this method will significantly increase the
creation time of the vm(a few seconds, depending on memory size).

after we hooked up hugetlb_vm_ops(set/get_policy):
both the shared memory segments created by shmget() with SHM_HUGETLB flag
and the mmap(MAP_SHARED|MAP_HUGETLB), also support shared policy.

v1->v2:
1、hugetlb share the memory policy when the vma with the VM_SHARED flag.
2、update the documentation.

Signed-off-by: huangjie.albert <huangjie.albert@bytedance.com>
---
 .../admin-guide/mm/numa_memory_policy.rst     | 20 +++++++++------
 mm/hugetlb.c                                  | 25 +++++++++++++++++++
 2 files changed, 37 insertions(+), 8 deletions(-)

diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index 5a6afecbb0d0..5672a6c2d2ef 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -133,14 +133,18 @@ Shared Policy
 	the object share the policy, and all pages allocated for the
 	shared object, by any task, will obey the shared policy.
 
-	As of 2.6.22, only shared memory segments, created by shmget() or
-	mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy.  When shared
-	policy support was added to Linux, the associated data structures were
-	added to hugetlbfs shmem segments.  At the time, hugetlbfs did not
-	support allocation at fault time--a.k.a lazy allocation--so hugetlbfs
-	shmem segments were never "hooked up" to the shared policy support.
-	Although hugetlbfs segments now support lazy allocation, their support
-	for shared policy has not been completed.
+	As of 2.6.22, only shared memory segments, created by shmget() without
+	SHM_HUGETLB flag or mmap(MAP_ANONYMOUS|MAP_SHARED) without MAP_HUGETLB
+	flag, support shared policy. When shared policy support was added to Linux,
+	the associated data structures were added to hugetlbfs shmem segments.
+	At the time, hugetlbfs did not support allocation at fault time--a.k.a
+	lazy allocation--so hugetlbfs shmem segments were never "hooked up" to
+	the shared policy support. Although hugetlbfs segments now support lazy
+	allocation, their support for shared policy has not been completed.
+
+	after we hooked up hugetlb_vm_ops(set/get_policy):
+	both the shared memory segments created by shmget() with SHM_HUGETLB flag
+	and mmap(MAP_SHARED|MAP_HUGETLB), also support shared policy.
 
 	As mentioned above in :ref:`VMA policies <vma_policy>` section,
 	allocations of page cache pages for regular files mmap()ed
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 87d875e5e0a9..fc7038931832 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4632,6 +4632,27 @@ static vm_fault_t hugetlb_vm_op_fault(struct vm_fault *vmf)
 	return 0;
 }
 
+#ifdef CONFIG_NUMA
+int hugetlb_vm_op_set_policy(struct vm_area_struct *vma, struct mempolicy *mpol)
+{
+	struct inode *inode = file_inode(vma->vm_file);
+
+	if (!(vma->vm_flags & VM_SHARED))
+		return 0;
+
+	return mpol_set_shared_policy(&HUGETLBFS_I(inode)->policy, vma, mpol);
+}
+
+struct mempolicy *hugetlb_vm_op_get_policy(struct vm_area_struct *vma, unsigned long addr)
+{
+	struct inode *inode = file_inode(vma->vm_file);
+	pgoff_t index;
+
+	index = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+	return mpol_shared_policy_lookup(&HUGETLBFS_I(inode)->policy, index);
+}
+#endif
+
 /*
  * When a new function is introduced to vm_operations_struct and added
  * to hugetlb_vm_ops, please consider adding the function to shm_vm_ops.
@@ -4645,6 +4666,10 @@ const struct vm_operations_struct hugetlb_vm_ops = {
 	.close = hugetlb_vm_op_close,
 	.may_split = hugetlb_vm_op_split,
 	.pagesize = hugetlb_vm_op_pagesize,
+#ifdef CONFIG_NUMA
+	.set_policy = hugetlb_vm_op_set_policy,
+	.get_policy = hugetlb_vm_op_get_policy,
+#endif
 };
 
 static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
-- 
2.37.0 (Apple Git-136)


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm: hugetlb: support get/set_policy for hugetlb_vm_ops
  2022-10-12 19:45 ` Hugh Dickins
  2022-10-14 16:56   ` Mike Kravetz
@ 2022-10-19  9:33   ` 黄杰
  2022-10-23 20:16     ` Hugh Dickins
  1 sibling, 1 reply; 17+ messages in thread
From: 黄杰 @ 2022-10-19  9:33 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Mike Kravetz, Muchun Song, Andi Kleen, Andrew Morton, linux-mm,
	linux-kernel

Hugh Dickins <hughd@google.com> 于2022年10月13日周四 03:45写道:
>
> On Wed, 12 Oct 2022, Albert Huang wrote:
>
> > From: "huangjie.albert" <huangjie.albert@bytedance.com>
> >
> > implement these two functions so that we can set the mempolicy to
> > the inode of the hugetlb file. This ensures that the mempolicy of
> > all processes sharing this huge page file is consistent.
> >
> > In some scenarios where huge pages are shared:
> > if we need to limit the memory usage of vm within node0, so I set qemu's
> > mempilciy bind to node0, but if there is a process (such as virtiofsd)
> > shared memory with the vm, in this case. If the page fault is triggered
> > by virtiofsd, the allocated memory may go to node1 which  depends on
> > virtiofsd.
> >
> > Signed-off-by: huangjie.albert <huangjie.albert@bytedance.com>
>
> Aha!  Congratulations for noticing, after all this time.  hugetlbfs
> contains various little pieces of code that pretend to be supporting
> shared NUMA mempolicy, but in fact there was nothing connecting it up.
>
> It will be for Mike to decide, but personally I oppose adding
> shared NUMA mempolicy support to hugetlbfs, after eighteen years.
>
> The thing is, it will change the behaviour of NUMA on hugetlbfs:
> in ways that would have been sensible way back then, yes; but surely
> those who have invested in NUMA and hugetlbfs have developed other
> ways of administering it successfully, without shared NUMA mempolicy.
>
> At the least, I would expect some tests to break (I could easily be
> wrong), and there's a chance that some app or tool would break too.

Hi : Hugh

Can you share some issues here?

Thanks.

>
> I have carried the reverse of Albert's patch for a long time, stripping
> out the pretence of shared NUMA mempolicy support from hugetlbfs: I
> wanted that, so that I could work on modifying the tmpfs implementation,
> without having to worry about other users.
>
> Mike, if you would prefer to see my patch stripping out the pretence,
> let us know: it has never been a priority to send in, but I can update
> it to 6.1-rc1 if you'd like to see it.  (Once upon a time, it removed
> all need for struct hugetlbfs_inode_info, but nowadays that's still
> required for the memfd seals.)
>
> Whether Albert's patch is complete and correct, I haven't begun to think
> about: I am not saying it isn't, but shared NUMA mempolicy adds another
> dimension of complexity, and need for support, that I think hugetlbfs
> would be better off continuing to survive without.
>
> Hugh
>
> > ---
> >  mm/hugetlb.c | 22 ++++++++++++++++++++++
> >  1 file changed, 22 insertions(+)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 0ad53ad98e74..ed7599821655 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -4678,6 +4678,24 @@ static vm_fault_t hugetlb_vm_op_fault(struct vm_fault *vmf)
> >       return 0;
> >  }
> >
> > +#ifdef CONFIG_NUMA
> > +int hugetlb_vm_op_set_policy(struct vm_area_struct *vma, struct mempolicy *mpol)
> > +{
> > +     struct inode *inode = file_inode(vma->vm_file);
> > +
> > +     return mpol_set_shared_policy(&HUGETLBFS_I(inode)->policy, vma, mpol);
> > +}
> > +
> > +struct mempolicy *hugetlb_vm_op_get_policy(struct vm_area_struct *vma, unsigned long addr)
> > +{
> > +     struct inode *inode = file_inode(vma->vm_file);
> > +     pgoff_t index;
> > +
> > +     index = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> > +     return mpol_shared_policy_lookup(&HUGETLBFS_I(inode)->policy, index);
> > +}
> > +#endif
> > +
> >  /*
> >   * When a new function is introduced to vm_operations_struct and added
> >   * to hugetlb_vm_ops, please consider adding the function to shm_vm_ops.
> > @@ -4691,6 +4709,10 @@ const struct vm_operations_struct hugetlb_vm_ops = {
> >       .close = hugetlb_vm_op_close,
> >       .may_split = hugetlb_vm_op_split,
> >       .pagesize = hugetlb_vm_op_pagesize,
> > +#ifdef CONFIG_NUMA
> > +     .set_policy = hugetlb_vm_op_set_policy,
> > +     .get_policy = hugetlb_vm_op_get_policy,
> > +#endif
> >  };
> >
> >  static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
> > --
> > 2.31.1

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] mm: hugetlb: support for shared memory policy
  2022-10-19  9:29     ` [PATCH v2] mm: hugetlb: support for shared memory policy Albert Huang
@ 2022-10-19 11:49       ` Aneesh Kumar K V
  0 siblings, 0 replies; 17+ messages in thread
From: Aneesh Kumar K V @ 2022-10-19 11:49 UTC (permalink / raw)
  To: Albert Huang, mike.kravetz
  Cc: Jonathan Corbet, Muchun Song, Andrew Morton, linux-doc,
	linux-kernel, linux-mm

On 10/19/22 2:59 PM, Albert Huang wrote:
> From: "huangjie.albert" <huangjie.albert@bytedance.com>
> 
> implement get/set_policy for hugetlb_vm_ops to support the shared policy
> This ensures that the mempolicy of all processes sharing this huge page
> file is consistent.
> 
> In some scenarios where huge pages are shared:
> if we need to limit the memory usage of vm within node0, so I set qemu's
> mempilciy bind to node0, but if there is a process (such as virtiofsd)
> shared memory with the vm, in this case. If the page fault is triggered
> by virtiofsd, the allocated memory may go to node1 which depends on
> virtiofsd. Although we can use the memory prealloc provided by qemu to
> avoid this issue, but this method will significantly increase the
> creation time of the vm(a few seconds, depending on memory size).
> 
> after we hooked up hugetlb_vm_ops(set/get_policy):
> both the shared memory segments created by shmget() with SHM_HUGETLB flag
> and the mmap(MAP_SHARED|MAP_HUGETLB), also support shared policy.
> 
> v1->v2:
> 1、hugetlb share the memory policy when the vma with the VM_SHARED flag.
> 2、update the documentation.
> 
> Signed-off-by: huangjie.albert <huangjie.albert@bytedance.com>
> ---
>  .../admin-guide/mm/numa_memory_policy.rst     | 20 +++++++++------
>  mm/hugetlb.c                                  | 25 +++++++++++++++++++
>  2 files changed, 37 insertions(+), 8 deletions(-)
> 
> diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
> index 5a6afecbb0d0..5672a6c2d2ef 100644
> --- a/Documentation/admin-guide/mm/numa_memory_policy.rst
> +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
> @@ -133,14 +133,18 @@ Shared Policy
>  	the object share the policy, and all pages allocated for the
>  	shared object, by any task, will obey the shared policy.
>  
> -	As of 2.6.22, only shared memory segments, created by shmget() or
> -	mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy.  When shared
> -	policy support was added to Linux, the associated data structures were
> -	added to hugetlbfs shmem segments.  At the time, hugetlbfs did not
> -	support allocation at fault time--a.k.a lazy allocation--so hugetlbfs
> -	shmem segments were never "hooked up" to the shared policy support.
> -	Although hugetlbfs segments now support lazy allocation, their support
> -	for shared policy has not been completed.
> +	As of 2.6.22, only shared memory segments, created by shmget() without
> +	SHM_HUGETLB flag or mmap(MAP_ANONYMOUS|MAP_SHARED) without MAP_HUGETLB
> +	flag, support shared policy. When shared policy support was added to Linux,
> +	the associated data structures were added to hugetlbfs shmem segments.
> +	At the time, hugetlbfs did not support allocation at fault time--a.k.a
> +	lazy allocation--so hugetlbfs shmem segments were never "hooked up" to
> +	the shared policy support. Although hugetlbfs segments now support lazy
> +	allocation, their support for shared policy has not been completed.
> +
> +	after we hooked up hugetlb_vm_ops(set/get_policy):
> +	both the shared memory segments created by shmget() with SHM_HUGETLB flag
> +	and mmap(MAP_SHARED|MAP_HUGETLB), also support shared policy.
>  
>  	As mentioned above in :ref:`VMA policies <vma_policy>` section,
>  	allocations of page cache pages for regular files mmap()ed
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 87d875e5e0a9..fc7038931832 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4632,6 +4632,27 @@ static vm_fault_t hugetlb_vm_op_fault(struct vm_fault *vmf)
>  	return 0;
>  }
>  
> +#ifdef CONFIG_NUMA
> +int hugetlb_vm_op_set_policy(struct vm_area_struct *vma, struct mempolicy *mpol)
> +{
> +	struct inode *inode = file_inode(vma->vm_file);
> +
> +	if (!(vma->vm_flags & VM_SHARED))
> +		return 0;
> +
> +	return mpol_set_shared_policy(&HUGETLBFS_I(inode)->policy, vma, mpol);
> +}
> +
> +struct mempolicy *hugetlb_vm_op_get_policy(struct vm_area_struct *vma, unsigned long addr)
> +{
> +	struct inode *inode = file_inode(vma->vm_file);
> +	pgoff_t index;
> +
> +	index = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> +	return mpol_shared_policy_lookup(&HUGETLBFS_I(inode)->policy, index);
> +}
> +#endif
> +
>  /*
>   * When a new function is introduced to vm_operations_struct and added
>   * to hugetlb_vm_ops, please consider adding the function to shm_vm_ops.
> @@ -4645,6 +4666,10 @@ const struct vm_operations_struct hugetlb_vm_ops = {
>  	.close = hugetlb_vm_op_close,
>  	.may_split = hugetlb_vm_op_split,
>  	.pagesize = hugetlb_vm_op_pagesize,
> +#ifdef CONFIG_NUMA
> +	.set_policy = hugetlb_vm_op_set_policy,
> +	.get_policy = hugetlb_vm_op_get_policy,
> +#endif
>  };
>  
>  static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,


How is the current usage of 

/* Set numa allocation policy based on index */
hugetlb_set_vma_policy(&pseudo_vma, inode, index);

enforcing the policy with the current code? Also if we have get_policy()

Can we remove the usage of the same in hugetlbfs_fallocate()
after this patch? With shared policy we should be able to fetch
the policy via get_vma_policy()?

A related question does shm_pseudo_vma_init() requires that mpolicy_lookup? 

-aneesh



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm: hugetlb: support get/set_policy for hugetlb_vm_ops
  2022-10-19  9:33   ` [PATCH] mm: hugetlb: support get/set_policy for hugetlb_vm_ops 黄杰
@ 2022-10-23 20:16     ` Hugh Dickins
  0 siblings, 0 replies; 17+ messages in thread
From: Hugh Dickins @ 2022-10-23 20:16 UTC (permalink / raw)
  To: 黄杰
  Cc: Hugh Dickins, Mike Kravetz, Muchun Song, Andi Kleen,
	Andrew Morton, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3870 bytes --]

On Wed, 19 Oct 2022, 黄杰 wrote:
> Hugh Dickins <hughd@google.com> 于2022年10月13日周四 03:45写道:
> > On Wed, 12 Oct 2022, Albert Huang wrote:
> > > From: "huangjie.albert" <huangjie.albert@bytedance.com>
> > >
> > > implement these two functions so that we can set the mempolicy to
> > > the inode of the hugetlb file. This ensures that the mempolicy of
> > > all processes sharing this huge page file is consistent.
> > >
> > > In some scenarios where huge pages are shared:
> > > if we need to limit the memory usage of vm within node0, so I set qemu's
> > > mempilciy bind to node0, but if there is a process (such as virtiofsd)
> > > shared memory with the vm, in this case. If the page fault is triggered
> > > by virtiofsd, the allocated memory may go to node1 which  depends on
> > > virtiofsd.
> > >
> > > Signed-off-by: huangjie.albert <huangjie.albert@bytedance.com>
> >
> > Aha!  Congratulations for noticing, after all this time.  hugetlbfs
> > contains various little pieces of code that pretend to be supporting
> > shared NUMA mempolicy, but in fact there was nothing connecting it up.
> >
> > It will be for Mike to decide, but personally I oppose adding
> > shared NUMA mempolicy support to hugetlbfs, after eighteen years.
> >
> > The thing is, it will change the behaviour of NUMA on hugetlbfs:
> > in ways that would have been sensible way back then, yes; but surely
> > those who have invested in NUMA and hugetlbfs have developed other
> > ways of administering it successfully, without shared NUMA mempolicy.
> >
> > At the least, I would expect some tests to break (I could easily be
> > wrong), and there's a chance that some app or tool would break too.
> 
> Hi : Hugh
> 
> Can you share some issues here?

Sorry, I don't think I can: precisely because it's been such a relief
to know that hugetlbfs is not in the shared NUMA mempolicy game, I've
given no thought to what issues it might have if it joined the game.

Not much memory is wasted on the unused fields in hugetlbfs_inode_info,
just a few bytes per inode, that aspect doesn't concern me much.

Reference counting of shared mempolicy has certainly been a recurrent
problem in the past (see mpol_needs_cond_ref() etc): stable nowadays
I believe; but whether supporting hugetlbfs would cause new problems
to surface there, I don't know; but whatever, those would just be
bugs to be fixed.

/proc/pid/numa_maps does not represent shared NUMA mempolicies
correctly: not for tmpfs, and would not for hugetlbfs.  I did have
old patches to fix that, but not patches that I'm ever likely to
have time to resurrect and present and push.

My main difficulties in tmpfs were with how to deal correctly and
consistently with non-hugepage-aligned mempolicies when hugepages
are in use.  In the case of hugetlbfs, it would be simpler, since
you're always dealing in hugepages of a known size: I recommend
being as strict as possible, demanding correctly aligned mempolicy
or else EINVAL.  (That may already be enforced, I've not looked.)

But my main concern in extending shared NUMA mempolicy to hugetlbfs
is exactly what I already said earlier:

The thing is, it will change the behaviour of NUMA on hugetlbfs:
in ways that would have been sensible way back then, yes; but surely
those who have invested in NUMA and hugetlbfs have developed other
ways of administering it successfully, without shared NUMA mempolicy.

At the least, I would expect some tests to break (I could easily be
wrong), and there's a chance that some app or tool would break too.

It's a risk, and a body of complication, that I would keep away from
myself.  The shared mempolicy rbtree: makes sense, but no madvise()
since has implemented such a tree, to attach its advice to ranges
of the shared object rather than to vma.

Hugh

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2022-10-23 20:16 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-12  8:15 [PATCH] mm: hugetlb: support get/set_policy for hugetlb_vm_ops Albert Huang
2022-10-12 19:45 ` Hugh Dickins
2022-10-14 16:56   ` Mike Kravetz
2022-10-17  3:35     ` [External] " 黄杰
2022-10-19  9:29     ` [PATCH v2] mm: hugetlb: support for shared memory policy Albert Huang
2022-10-19 11:49       ` Aneesh Kumar K V
2022-10-19  9:33   ` [PATCH] mm: hugetlb: support get/set_policy for hugetlb_vm_ops 黄杰
2022-10-23 20:16     ` Hugh Dickins
2022-10-17  8:44 ` David Hildenbrand
2022-10-17  9:48   ` [External] " 黄杰
2022-10-17 11:33     ` David Hildenbrand
2022-10-17 11:46       ` 黄杰
2022-10-17 12:00         ` David Hildenbrand
2022-10-18  9:27           ` 黄杰
2022-10-18  9:35             ` David Hildenbrand
2022-10-17 17:59       ` [External] " Mike Kravetz
2022-10-18  9:24         ` 黄杰

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).