All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v3 00/49] 1GB PUD THP support on x86_64
@ 2021-02-24 22:35 Zi Yan
  2021-02-25 11:02 ` David Hildenbrand
  2021-03-02  1:59 ` Roman Gushchin
  0 siblings, 2 replies; 15+ messages in thread
From: Zi Yan @ 2021-02-24 22:35 UTC (permalink / raw)
  To: linux-mm
  Cc: Matthew Wilcox, Kirill A . Shutemov, Roman Gushchin,
	Andrew Morton, Yang Shi, Michal Hocko, John Hubbard,
	Ralph Campbell, David Nellans, Jason Gunthorpe, David Rientjes,
	Vlastimil Babka, David Hildenbrand, Mike Kravetz, Song Liu,
	Zi Yan

From: Zi Yan <ziy@nvidia.com>

Hi all,

I have rebased my 1GB PUD THP support patches on v5.11-mmotm-2021-02-18-18-29
and the code is available at
https://github.com/x-y-z/linux-1gb-thp/tree/1gb_thp_v5.11-mmotm-2021-02-18-18-29
if you want to give it a try. The actual 49 patches are not sent out with this
cover letter. :)

Instead of asking for code review, I would like to discuss on the concerns I got
from previous RFCs. I think there are two major ones:

1. 1GB page allocation. Current implementation allocates 1GB pages from CMA
   regions that are reserved at boot time like hugetlbfs. The concerns on
   using CMA is that an educated guess is needed to avoid depleting kernel
   memory in case CMA regions are set too large. Recently David Rientjes
   proposes to use process_madvise() for hugepage collapse, which is an
   alternative [1] but might not work for 1GB pages, since there is no way of
   _allocating_ a 1GB page to which collapse pages. I proposed a similar
   approach at LSF/MM 2019, generating physically contiguous memory after pages
   are allocated [2], which is usable for 1GB THPs. This approach does in-place
   huge page promotion thus does not require page allocation.

2. Large amount of new code to review. I find most of added code is just a
   simply copy paste from existing PMD THP code. I have tried to reduce
   the new code size by reusing some existing code [3], but did not find a good
   way of reusing PMD handling code for PUD, which is the major part of this
   patchset. I am all ears if you have any idea on how to reduce new code size
   or make code review easier.


Any comment or suggestion is welcome. Thanks.

[1] https://lore.kernel.org/linux-mm/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/
[2] https://lwn.net/Articles/779979/
[3] https://lwn.net/Articles/837928/


Roman Gushchin (2):
  mm: cma: introduce cma_release_nowait()
  mm: hugetlb: don't drop hugetlb_lock around cma_release() call

Zi Yan (47):
  mm: memcg: make memcg huge page split support any order split.
  mm: page_owner: add support for splitting to any order in split
    page_owner.
  mm: thp: add support for split huge page to any lower order pages.
  mm: thp: use single linked list for THP page table page deposit.
  mm: add new helper functions to allocate one PMD page with
    HPAGE_PMD_NR PTE pages.
  mm: thp: add page table deposit/withdraw functions for PUD THP.
  mm: change thp_order and thp_nr as we will have not just PMD THPs.
  mm: thp: add anonymous PUD THP page fault support without enabling it.
  mm: thp: add PUD THP support for copy_huge_pud.
  mm: thp: add PUD THP support to zap_huge_pud.
  fs: proc: add PUD THP kpageflag.
  mm: thp: handling PUD THP reference bit.
  mm: rmap: add mappped/unmapped page order to anonymous page rmap
    functions.
  mm: rmap: add map_order to page_remove_anon_compound_rmap.
  mm: add pud manipulation functions.
  mm: thp: add PUDDoubleMap page flag for PUD- and PMD-mapped pages.
  mm: thp: add pmd_compound_mapcount for PMD mappings in PUD THPs.
  mm: thp: add split_huge_pud() function to split PUD entries.
  mm: thp: handle PMD-mapped PUD THP in split_huge_pmd functions.
  mm: thp: adjust page map counting functions for PMD- and PTE-mapped
    PUD THPs.
  mm: thp: new ttu_flags to split huge pud during try_to_unmap.
  mm: thp: add new checks for zap_huge_pmd.
  mm: thp: add pud split events.
  mm: thp: split pud when adjusting vma ranges.
  mm: thp: handle PUD THP properly at page allocation and deallocation.
  mm: rmap: handle PUD-, PMD- and PTE-mapped PUD THP properly in rmap.
  mm: page_walk: handle PUD after pud entry split.
  mm: thp: use split_huge_page_to_order_to_list for split huge pud page.
  mm: thp: add PUD THP to deferred split list when PUD mapping is gone.
  mm: debug: adapt dump_page to PUD THP.
  mm: thp: PUD THP COW splits PUD page and falls back to PMD page.
  mm: thp: PUD THP follow_p*d_page() support.
  mm: stats: make smap stats understand PUD THPs.
  mm: page_vma_walk: teach it about PMD-mapped PUD THP.
  mm: thp: PUD THP support in try_to_unmap().
  mm: thp: split PUD THPs at page reclaim.
  mm: support PUD THP pagemap support.
  mm: madvise: add page size options to MADV_HUGEPAGE and
    MADV_NOHUGEPAGE.
  mm: vma: add VM_HUGEPAGE_PUD to vm_flags at bit 37.
  mm: thp: add a global knob to enable/disable PUD THPs.
  mm: thp: make PUD THP size public.
  hugetlb: cma: move cma reserve function to cma.c.
  mm: thp: use cma reservation for pud thp allocation.
  mm: thp: enable anonymous PUD THP at page fault path.
  mm: cma: only clear bitmap no freeing pages.
  mm: thp: clear cma bitmap during PUD THP split.
  mm: migrate: split PUD THP if it is going to be migrated.

 .../admin-guide/kernel-parameters.txt         |   2 +-
 Documentation/admin-guide/mm/transhuge.rst    |   1 +
 arch/arm64/mm/hugetlbpage.c                   |   2 +-
 arch/powerpc/mm/hugetlbpage.c                 |   2 +-
 arch/x86/include/asm/pgalloc.h                |  69 ++
 arch/x86/include/asm/pgtable.h                |  26 +
 arch/x86/kernel/setup.c                       |   8 +-
 arch/x86/mm/pgtable.c                         |  38 +
 drivers/base/node.c                           |   2 +
 fs/proc/meminfo.c                             |   2 +
 fs/proc/page.c                                |   2 +
 fs/proc/task_mmu.c                            | 126 ++-
 include/linux/cma.h                           |  20 +
 include/linux/huge_mm.h                       |  92 ++-
 include/linux/hugetlb.h                       |  12 -
 include/linux/llist.h                         |  11 +
 include/linux/memcontrol.h                    |   5 +-
 include/linux/mm.h                            |  53 +-
 include/linux/mm_types.h                      |  13 +-
 include/linux/mmu_notifier.h                  |  13 +
 include/linux/mmzone.h                        |   1 +
 include/linux/page-flags.h                    |  25 +
 include/linux/page_owner.h                    |  10 +-
 include/linux/pgtable.h                       |  34 +
 include/linux/rmap.h                          |  10 +-
 include/linux/vm_event_item.h                 |   7 +
 include/uapi/asm-generic/mman-common.h        |  23 +
 include/uapi/linux/kernel-page-flags.h        |   1 +
 kernel/events/uprobes.c                       |   4 +-
 kernel/fork.c                                 |  10 +-
 mm/cma.c                                      | 226 ++++++
 mm/cma.h                                      |   5 +
 mm/debug.c                                    |   6 +-
 mm/gup.c                                      |  60 +-
 mm/huge_memory.c                              | 748 ++++++++++++++++--
 mm/hugetlb.c                                  | 126 +--
 mm/khugepaged.c                               |  16 +-
 mm/ksm.c                                      |   4 +-
 mm/madvise.c                                  |  17 +-
 mm/memcontrol.c                               |   6 +-
 mm/memory.c                                   |  28 +-
 mm/mempolicy.c                                |  14 +-
 mm/migrate.c                                  |  16 +-
 mm/page_alloc.c                               |  55 +-
 mm/page_owner.c                               |  13 +-
 mm/page_vma_mapped.c                          | 171 +++-
 mm/pagewalk.c                                 |   6 +-
 mm/pgtable-generic.c                          |  49 +-
 mm/rmap.c                                     | 297 +++++--
 mm/swap_slots.c                               |   2 +
 mm/swapfile.c                                 |  11 +-
 mm/userfaultfd.c                              |   2 +-
 mm/util.c                                     |  18 +-
 mm/vmscan.c                                   |  33 +-
 mm/vmstat.c                                   |   8 +
 55 files changed, 2160 insertions(+), 401 deletions(-)

-- 
2.30.0



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v3 00/49] 1GB PUD THP support on x86_64
  2021-02-24 22:35 [RFC PATCH v3 00/49] 1GB PUD THP support on x86_64 Zi Yan
@ 2021-02-25 11:02 ` David Hildenbrand
  2021-02-25 22:13   ` Zi Yan
  2021-03-02  1:59 ` Roman Gushchin
  1 sibling, 1 reply; 15+ messages in thread
From: David Hildenbrand @ 2021-02-25 11:02 UTC (permalink / raw)
  To: Zi Yan, linux-mm
  Cc: Matthew Wilcox, Kirill A . Shutemov, Roman Gushchin,
	Andrew Morton, Yang Shi, Michal Hocko, John Hubbard,
	Ralph Campbell, David Nellans, Jason Gunthorpe, David Rientjes,
	Vlastimil Babka, Mike Kravetz, Song Liu

On 24.02.21 23:35, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> Hi all,
> 
> I have rebased my 1GB PUD THP support patches on v5.11-mmotm-2021-02-18-18-29
> and the code is available at
> https://github.com/x-y-z/linux-1gb-thp/tree/1gb_thp_v5.11-mmotm-2021-02-18-18-29
> if you want to give it a try. The actual 49 patches are not sent out with this
> cover letter. :)
> 
> Instead of asking for code review, I would like to discuss on the concerns I got
> from previous RFCs. I think there are two major ones:
> 
> 1. 1GB page allocation. Current implementation allocates 1GB pages from CMA
>     regions that are reserved at boot time like hugetlbfs. The concerns on
>     using CMA is that an educated guess is needed to avoid depleting kernel
>     memory in case CMA regions are set too large. Recently David Rientjes
>     proposes to use process_madvise() for hugepage collapse, which is an
>     alternative [1] but might not work for 1GB pages, since there is no way of

I see two core ideas of THP:

1) Transparent to the user: you get speedup without really caring 
*except* having to enable/disable the optimization sometimes manually 
(i.e., MADV_HUGEPAGE) -  because in corner cases (e.g., userfaultfd), 
it's not completely transparent and might have performance impacts. 
mprotect(), mmap(MAP_FIXED), mremap() work as expected.

2) Transparent to other subsystems of the kernel: the page size of the 
mapping is in base pages - we can split anytime on demand in case we 
cannot handle THP. In addition, no special requirements: no CMA, no 
movability restrictions, no swappability restrictions, ... most stuff 
works transparently by splitting.

Your current approach messes with 2). Your proposal here messes with 1).

Any kind of explicit placement by the user can silently get reverted any 
time. So process_madvise() would really only be useful in cases where a 
temporary split might get reverted later on by the os automatically - 
like we have for 2MB THP right now.

So process_madvise() is less likely to help if the system won't try 
collapsing automatically (more below).

>     _allocating_ a 1GB page to which collapse pages. I proposed a similar
>     approach at LSF/MM 2019, generating physically contiguous memory after pages
>     are allocated [2], which is usable for 1GB THPs. This approach does in-place
>     huge page promotion thus does not require page allocation.

I like the idea of forming a 1GB THP at a location where already 
consecutive pages allow for it. It can be applied generically - and both 
1) and 2) keep working as expected. Anytime there was a split, we can 
retry forming a THP later.

However, I don't follow how this is actually really feasible in big 
scale. You could only ever collapse into a 1GB THP if you happen to have 
1GB consecutive 2MB THP / 4k already. Sounds to me like this happens 
when the stars align.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v3 00/49] 1GB PUD THP support on x86_64
  2021-02-25 11:02 ` David Hildenbrand
@ 2021-02-25 22:13   ` Zi Yan
  2021-03-02  8:55     ` David Hildenbrand
  0 siblings, 1 reply; 15+ messages in thread
From: Zi Yan @ 2021-02-25 22:13 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, Matthew Wilcox, Kirill A . Shutemov, Roman Gushchin,
	Andrew Morton, Yang Shi, Michal Hocko, John Hubbard,
	Ralph Campbell, David Nellans, Jason Gunthorpe, David Rientjes,
	Vlastimil Babka, Mike Kravetz, Song Liu

[-- Attachment #1: Type: text/plain, Size: 4405 bytes --]

On 25 Feb 2021, at 6:02, David Hildenbrand wrote:

> On 24.02.21 23:35, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> Hi all,
>>
>> I have rebased my 1GB PUD THP support patches on v5.11-mmotm-2021-02-18-18-29
>> and the code is available at
>> https://github.com/x-y-z/linux-1gb-thp/tree/1gb_thp_v5.11-mmotm-2021-02-18-18-29
>> if you want to give it a try. The actual 49 patches are not sent out with this
>> cover letter. :)
>>
>> Instead of asking for code review, I would like to discuss on the concerns I got
>> from previous RFCs. I think there are two major ones:
>>
>> 1. 1GB page allocation. Current implementation allocates 1GB pages from CMA
>>     regions that are reserved at boot time like hugetlbfs. The concerns on
>>     using CMA is that an educated guess is needed to avoid depleting kernel
>>     memory in case CMA regions are set too large. Recently David Rientjes
>>     proposes to use process_madvise() for hugepage collapse, which is an
>>     alternative [1] but might not work for 1GB pages, since there is no way of
>
> I see two core ideas of THP:
>
> 1) Transparent to the user: you get speedup without really caring *except* having to enable/disable the optimization sometimes manually (i.e., MADV_HUGEPAGE) -  because in corner cases (e.g., userfaultfd), it's not completely transparent and might have performance impacts. mprotect(), mmap(MAP_FIXED), mremap() work as expected.
>
> 2) Transparent to other subsystems of the kernel: the page size of the mapping is in base pages - we can split anytime on demand in case we cannot handle THP. In addition, no special requirements: no CMA, no movability restrictions, no swappability restrictions, ... most stuff works transparently by splitting.
>
> Your current approach messes with 2). Your proposal here messes with 1).
>
> Any kind of explicit placement by the user can silently get reverted any time. So process_madvise() would really only be useful in cases where a temporary split might get reverted later on by the os automatically - like we have for 2MB THP right now.
>
> So process_madvise() is less likely to help if the system won't try collapsing automatically (more below).
>>     _allocating_ a 1GB page to which collapse pages. I proposed a similar
>>     approach at LSF/MM 2019, generating physically contiguous memory after pages
>>     are allocated [2], which is usable for 1GB THPs. This approach does in-place
>>     huge page promotion thus does not require page allocation.
>
> I like the idea of forming a 1GB THP at a location where already consecutive pages allow for it. It can be applied generically - and both 1) and 2) keep working as expected. Anytime there was a split, we can retry forming a THP later.
>
> However, I don't follow how this is actually really feasible in big scale. You could only ever collapse into a 1GB THP if you happen to have 1GB consecutive 2MB THP / 4k already. Sounds to me like this happens when the stars align.

Both the process_madvise() approach and my proposal require page migration to bring back THPs, since like you said having consecutive pages ready is extremely rare. IIUC, the process_madvise() approach reuses khugepaged code to collapse huge pages,
namely first allocating a 2MB THP, then copying data over, finally free old base pages. My proposal would migrate pages within
a virtual address range (>1GB and 1GB-aligned) to get all physical pages contiguous, then promote the resulting 1GB consecutive
pages to 1GB THP. No new page allocation is needed.

Both approaches would need user-space invocation, assuming either the application itself wants to get THPs for a specific region or a user-space daemon would do this for a group of application, instead of waiting for khugepaged to slowly (4096 pages every 10s) scan and do huge page collapse. User will pay the cost of getting THP. This also means THPs are not completely transparent to user, but I think it should be fine when users explicitly invoke these two methods to get THPs for better performance.

The difference of my proposal is that it does not need a 1GB THP allocation, so there is no special requirements like using CMA
or increasing MAX_ORDER in buddy allocator to allow 1GB page allocation. It makes creating THPs with orders > MAX_ORDER possible
without other intrusive changes.


—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v3 00/49] 1GB PUD THP support on x86_64
  2021-02-24 22:35 [RFC PATCH v3 00/49] 1GB PUD THP support on x86_64 Zi Yan
  2021-02-25 11:02 ` David Hildenbrand
@ 2021-03-02  1:59 ` Roman Gushchin
  2021-03-04 16:26   ` Zi Yan
  1 sibling, 1 reply; 15+ messages in thread
From: Roman Gushchin @ 2021-03-02  1:59 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, Matthew Wilcox, Kirill A . Shutemov, Andrew Morton,
	Yang Shi, Michal Hocko, John Hubbard, Ralph Campbell,
	David Nellans, Jason Gunthorpe, David Rientjes, Vlastimil Babka,
	David Hildenbrand, Mike Kravetz, Song Liu

On Wed, Feb 24, 2021 at 05:35:36PM -0500, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> Hi all,
> 
> I have rebased my 1GB PUD THP support patches on v5.11-mmotm-2021-02-18-18-29
> and the code is available at
> https://github.com/x-y-z/linux-1gb-thp/tree/1gb_thp_v5.11-mmotm-2021-02-18-18-29
> if you want to give it a try. The actual 49 patches are not sent out with this
> cover letter. :)
> 
> Instead of asking for code review, I would like to discuss on the concerns I got
> from previous RFCs. I think there are two major ones:
> 
> 1. 1GB page allocation. Current implementation allocates 1GB pages from CMA
>    regions that are reserved at boot time like hugetlbfs. The concerns on
>    using CMA is that an educated guess is needed to avoid depleting kernel
>    memory in case CMA regions are set too large. Recently David Rientjes
>    proposes to use process_madvise() for hugepage collapse, which is an
>    alternative [1] but might not work for 1GB pages, since there is no way of
>    _allocating_ a 1GB page to which collapse pages. I proposed a similar
>    approach at LSF/MM 2019, generating physically contiguous memory after pages
>    are allocated [2], which is usable for 1GB THPs. This approach does in-place
>    huge page promotion thus does not require page allocation.

Well, I don't think there an alternative to cma as now. When the memory is almost
filled at least once, any subsequent activity leading to substantial slab allocations
(e.g. run git gc) will fragment the memory, so that there are virtually no chances
to find a continuous GB.

It's possible in theory to reduce the fragmentation on 1GB scale by grouping
non-movable pageblocks, but it seems a separate project.

Thanks!

> 
> 2. Large amount of new code to review. I find most of added code is just a
>    simply copy paste from existing PMD THP code. I have tried to reduce
>    the new code size by reusing some existing code [3], but did not find a good
>    way of reusing PMD handling code for PUD, which is the major part of this
>    patchset. I am all ears if you have any idea on how to reduce new code size
>    or make code review easier.
> 
> 
> Any comment or suggestion is welcome. Thanks.
> 
> [1] https://lore.kernel.org/linux-mm/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/
> [2] https://lwn.net/Articles/779979/ 
> [3] https://lwn.net/Articles/837928/ 
> 
> 
> Roman Gushchin (2):
>   mm: cma: introduce cma_release_nowait()
>   mm: hugetlb: don't drop hugetlb_lock around cma_release() call
> 
> Zi Yan (47):
>   mm: memcg: make memcg huge page split support any order split.
>   mm: page_owner: add support for splitting to any order in split
>     page_owner.
>   mm: thp: add support for split huge page to any lower order pages.
>   mm: thp: use single linked list for THP page table page deposit.
>   mm: add new helper functions to allocate one PMD page with
>     HPAGE_PMD_NR PTE pages.
>   mm: thp: add page table deposit/withdraw functions for PUD THP.
>   mm: change thp_order and thp_nr as we will have not just PMD THPs.
>   mm: thp: add anonymous PUD THP page fault support without enabling it.
>   mm: thp: add PUD THP support for copy_huge_pud.
>   mm: thp: add PUD THP support to zap_huge_pud.
>   fs: proc: add PUD THP kpageflag.
>   mm: thp: handling PUD THP reference bit.
>   mm: rmap: add mappped/unmapped page order to anonymous page rmap
>     functions.
>   mm: rmap: add map_order to page_remove_anon_compound_rmap.
>   mm: add pud manipulation functions.
>   mm: thp: add PUDDoubleMap page flag for PUD- and PMD-mapped pages.
>   mm: thp: add pmd_compound_mapcount for PMD mappings in PUD THPs.
>   mm: thp: add split_huge_pud() function to split PUD entries.
>   mm: thp: handle PMD-mapped PUD THP in split_huge_pmd functions.
>   mm: thp: adjust page map counting functions for PMD- and PTE-mapped
>     PUD THPs.
>   mm: thp: new ttu_flags to split huge pud during try_to_unmap.
>   mm: thp: add new checks for zap_huge_pmd.
>   mm: thp: add pud split events.
>   mm: thp: split pud when adjusting vma ranges.
>   mm: thp: handle PUD THP properly at page allocation and deallocation.
>   mm: rmap: handle PUD-, PMD- and PTE-mapped PUD THP properly in rmap.
>   mm: page_walk: handle PUD after pud entry split.
>   mm: thp: use split_huge_page_to_order_to_list for split huge pud page.
>   mm: thp: add PUD THP to deferred split list when PUD mapping is gone.
>   mm: debug: adapt dump_page to PUD THP.
>   mm: thp: PUD THP COW splits PUD page and falls back to PMD page.
>   mm: thp: PUD THP follow_p*d_page() support.
>   mm: stats: make smap stats understand PUD THPs.
>   mm: page_vma_walk: teach it about PMD-mapped PUD THP.
>   mm: thp: PUD THP support in try_to_unmap().
>   mm: thp: split PUD THPs at page reclaim.
>   mm: support PUD THP pagemap support.
>   mm: madvise: add page size options to MADV_HUGEPAGE and
>     MADV_NOHUGEPAGE.
>   mm: vma: add VM_HUGEPAGE_PUD to vm_flags at bit 37.
>   mm: thp: add a global knob to enable/disable PUD THPs.
>   mm: thp: make PUD THP size public.
>   hugetlb: cma: move cma reserve function to cma.c.
>   mm: thp: use cma reservation for pud thp allocation.
>   mm: thp: enable anonymous PUD THP at page fault path.
>   mm: cma: only clear bitmap no freeing pages.
>   mm: thp: clear cma bitmap during PUD THP split.
>   mm: migrate: split PUD THP if it is going to be migrated.
> 
>  .../admin-guide/kernel-parameters.txt         |   2 +-
>  Documentation/admin-guide/mm/transhuge.rst    |   1 +
>  arch/arm64/mm/hugetlbpage.c                   |   2 +-
>  arch/powerpc/mm/hugetlbpage.c                 |   2 +-
>  arch/x86/include/asm/pgalloc.h                |  69 ++
>  arch/x86/include/asm/pgtable.h                |  26 +
>  arch/x86/kernel/setup.c                       |   8 +-
>  arch/x86/mm/pgtable.c                         |  38 +
>  drivers/base/node.c                           |   2 +
>  fs/proc/meminfo.c                             |   2 +
>  fs/proc/page.c                                |   2 +
>  fs/proc/task_mmu.c                            | 126 ++-
>  include/linux/cma.h                           |  20 +
>  include/linux/huge_mm.h                       |  92 ++-
>  include/linux/hugetlb.h                       |  12 -
>  include/linux/llist.h                         |  11 +
>  include/linux/memcontrol.h                    |   5 +-
>  include/linux/mm.h                            |  53 +-
>  include/linux/mm_types.h                      |  13 +-
>  include/linux/mmu_notifier.h                  |  13 +
>  include/linux/mmzone.h                        |   1 +
>  include/linux/page-flags.h                    |  25 +
>  include/linux/page_owner.h                    |  10 +-
>  include/linux/pgtable.h                       |  34 +
>  include/linux/rmap.h                          |  10 +-
>  include/linux/vm_event_item.h                 |   7 +
>  include/uapi/asm-generic/mman-common.h        |  23 +
>  include/uapi/linux/kernel-page-flags.h        |   1 +
>  kernel/events/uprobes.c                       |   4 +-
>  kernel/fork.c                                 |  10 +-
>  mm/cma.c                                      | 226 ++++++
>  mm/cma.h                                      |   5 +
>  mm/debug.c                                    |   6 +-
>  mm/gup.c                                      |  60 +-
>  mm/huge_memory.c                              | 748 ++++++++++++++++--
>  mm/hugetlb.c                                  | 126 +--
>  mm/khugepaged.c                               |  16 +-
>  mm/ksm.c                                      |   4 +-
>  mm/madvise.c                                  |  17 +-
>  mm/memcontrol.c                               |   6 +-
>  mm/memory.c                                   |  28 +-
>  mm/mempolicy.c                                |  14 +-
>  mm/migrate.c                                  |  16 +-
>  mm/page_alloc.c                               |  55 +-
>  mm/page_owner.c                               |  13 +-
>  mm/page_vma_mapped.c                          | 171 +++-
>  mm/pagewalk.c                                 |   6 +-
>  mm/pgtable-generic.c                          |  49 +-
>  mm/rmap.c                                     | 297 +++++--
>  mm/swap_slots.c                               |   2 +
>  mm/swapfile.c                                 |  11 +-
>  mm/userfaultfd.c                              |   2 +-
>  mm/util.c                                     |  18 +-
>  mm/vmscan.c                                   |  33 +-
>  mm/vmstat.c                                   |   8 +
>  55 files changed, 2160 insertions(+), 401 deletions(-)
> 
> -- 
> 2.30.0
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v3 00/49] 1GB PUD THP support on x86_64
  2021-02-25 22:13   ` Zi Yan
@ 2021-03-02  8:55     ` David Hildenbrand
  2021-03-03 23:42       ` Zi Yan
  0 siblings, 1 reply; 15+ messages in thread
From: David Hildenbrand @ 2021-03-02  8:55 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, Matthew Wilcox, Kirill A . Shutemov, Roman Gushchin,
	Andrew Morton, Yang Shi, Michal Hocko, John Hubbard,
	Ralph Campbell, David Nellans, Jason Gunthorpe, David Rientjes,
	Vlastimil Babka, Mike Kravetz, Song Liu

>>
>> However, I don't follow how this is actually really feasible in big scale. You could only ever collapse into a 1GB THP if you happen to have 1GB consecutive 2MB THP / 4k already. Sounds to me like this happens when the stars align.
> 
> Both the process_madvise() approach and my proposal require page migration to bring back THPs, since like you said having consecutive pages ready is extremely rare. IIUC, the process_madvise() approach reuses khugepaged code to collapse huge pages,
> namely first allocating a 2MB THP, then copying data over, finally free old base pages. My proposal would migrate pages within
> a virtual address range (>1GB and 1GB-aligned) to get all physical pages contiguous, then promote the resulting 1GB consecutive
> pages to 1GB THP. No new page allocation is needed.

I am missing how we can ever reliably form 1GB pages (esp. after the 
system ran for a while) without any kind of fragmentation avoidance / 
defragmentation mechanism that is aware of gigantic pages. For THP, 
pageblocks+compaction serve that purpose.

> 
> Both approaches would need user-space invocation, assuming either the application itself wants to get THPs for a specific region or a user-space daemon would do this for a group of application, instead of waiting for khugepaged to slowly (4096 pages every 10s) scan and do huge page collapse. User will pay the cost of getting THP. This also means THPs are not completely transparent to user, but I think it should be fine when users explicitly invoke these two methods to get THPs for better performance.

Here is the problem: these *advises* are not persistent. Assume your 
system has to swap and has to split the THP + write it to the swap 
backend. The gigantic page is lost for that part of the application. 
When loading the individual 4k pages out of swap there is no guarantee 
that we can form a 1 GB page again - and how should we know that the 
application wanted a 1 GB page at that position?

How would the application know that the advise was no dropped and that
a) There is no 1GB page anymore
b) It would have to re-issue the advise

Similarly, I am not convinced that the future of khugepaged is in user 
space.

> 
> The difference of my proposal is that it does not need a 1GB THP allocation, so there is no special requirements like using CMA
> or increasing MAX_ORDER in buddy allocator to allow 1GB page allocation. It makes creating THPs with orders > MAX_ORDER possible
> without other intrusive changes.

Anything that relies on large allocations succeeding purely because 
"ZONE_NORMAL memory is usually not fragmented after boot" is broken by 
design. That's why we have CMA, it can give guarantees (well, once we 
fix all remaining issues :) ).

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v3 00/49] 1GB PUD THP support on x86_64
  2021-03-02  8:55     ` David Hildenbrand
@ 2021-03-03 23:42       ` Zi Yan
  2021-03-04  9:26         ` David Hildenbrand
  0 siblings, 1 reply; 15+ messages in thread
From: Zi Yan @ 2021-03-03 23:42 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, Matthew Wilcox, Kirill A . Shutemov, Roman Gushchin,
	Andrew Morton, Yang Shi, Michal Hocko, John Hubbard,
	Ralph Campbell, David Nellans, Jason Gunthorpe, David Rientjes,
	Vlastimil Babka, Mike Kravetz, Song Liu

[-- Attachment #1: Type: text/plain, Size: 4876 bytes --]

On 2 Mar 2021, at 3:55, David Hildenbrand wrote:

>>>
>>> However, I don't follow how this is actually really feasible in big scale. You could only ever collapse into a 1GB THP if you happen to have 1GB consecutive 2MB THP / 4k already. Sounds to me like this happens when the stars align.
>>
>> Both the process_madvise() approach and my proposal require page migration to bring back THPs, since like you said having consecutive pages ready is extremely rare. IIUC, the process_madvise() approach reuses khugepaged code to collapse huge pages,
>> namely first allocating a 2MB THP, then copying data over, finally free old base pages. My proposal would migrate pages within
>> a virtual address range (>1GB and 1GB-aligned) to get all physical pages contiguous, then promote the resulting 1GB consecutive
>> pages to 1GB THP. No new page allocation is needed.
>
> I am missing how we can ever reliably form 1GB pages (esp. after the system ran for a while) without any kind of fragmentation avoidance / defragmentation mechanism that is aware of gigantic pages. For THP, pageblocks+compaction serve that purpose.

We may not have that as reliable as pageblocks+compaction for THP, but we are able to improve over existing code after 1GB THP
is supported and used. Otherwise, why bother adding a new mechanism when there is no user?

I did an experiment on my 32GB desktop like Roman suggested in another email, using as much memory as possible and running
“git gc” on Linux repo at the same time to fragment memory. I repeated the process three times with three different Linux repos.
I checked all pageblock types with my custom kernel module (https://github.com/x-y-z/kernel-modules) and discovered that
the system still have 11 1GB Movable pageblocks (consecutive pageblocks with the same migratetype are grouped as large as
possible). This means after heavy memory fragmentation the system is still able to form 11 1GB THPs, which is >30% of total
possible 1GB THPs. I think it is a reasonably good number since we are not going to form 1GB THPs for everything running
in the system.

>>
>> Both approaches would need user-space invocation, assuming either the application itself wants to get THPs for a specific region or a user-space daemon would do this for a group of application, instead of waiting for khugepaged to slowly (4096 pages every 10s) scan and do huge page collapse. User will pay the cost of getting THP. This also means THPs are not completely transparent to user, but I think it should be fine when users explicitly invoke these two methods to get THPs for better performance.
>
> Here is the problem: these *advises* are not persistent. Assume your system has to swap and has to split the THP + write it to the swap backend. The gigantic page is lost for that part of the application. When loading the individual 4k pages out of swap there is no guarantee that we can form a 1 GB page again - and how should we know that the application wanted a 1 GB page at that position?

VM_HUGEPAGE will be set for that VMA and I am planning to add a new field to VMA to indicate what huge page size we want in
that VMA. About split 1GB THP due to swapping, that happens to THP too. Either khugepaged or a user daemon calling
process_madvise() could recover 1GB THP.

>
> How would the application know that the advise was no dropped and that
> a) There is no 1GB page anymore
> b) It would have to re-issue the advise

I expected a daemon, either khugepaged or a user one calling process_mavise, would rescan the application and reform 1GB pages.

>
> Similarly, I am not convinced that the future of khugepaged is in user space.

The issue of khugepaged is that it runs at very slow rate, 4096 pages every 10s, because kernel does not want to consume
too much CPU resources without knowing the benefit of forming THPs. A user daemon can run at a fast pace to form THPs or
1GB THPs from application memory regions that users really want huge pages.

>
>>
>> The difference of my proposal is that it does not need a 1GB THP allocation, so there is no special requirements like using CMA
>> or increasing MAX_ORDER in buddy allocator to allow 1GB page allocation. It makes creating THPs with orders > MAX_ORDER possible
>> without other intrusive changes.
>
> Anything that relies on large allocations succeeding purely because "ZONE_NORMAL memory is usually not fragmented after boot" is broken by design. That's why we have CMA, it can give guarantees (well, once we fix all remaining issues :) ).

It seems that you are suggesting I should use CMA for 1GB THP allocation, since CMA can give guarantee for large allocations.
Using CMA for 1GB THP would be a great first step to get 1GB THP working, then we can replace it with other large allocation
mechanisms later.


—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v3 00/49] 1GB PUD THP support on x86_64
  2021-03-03 23:42       ` Zi Yan
@ 2021-03-04  9:26         ` David Hildenbrand
  0 siblings, 0 replies; 15+ messages in thread
From: David Hildenbrand @ 2021-03-04  9:26 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, Matthew Wilcox, Kirill A . Shutemov, Roman Gushchin,
	Andrew Morton, Yang Shi, Michal Hocko, John Hubbard,
	Ralph Campbell, David Nellans, Jason Gunthorpe, David Rientjes,
	Vlastimil Babka, Mike Kravetz, Song Liu

On 04.03.21 00:42, Zi Yan wrote:
> On 2 Mar 2021, at 3:55, David Hildenbrand wrote:
> 
>>>>
>>>> However, I don't follow how this is actually really feasible in big scale. You could only ever collapse into a 1GB THP if you happen to have 1GB consecutive 2MB THP / 4k already. Sounds to me like this happens when the stars align.
>>>
>>> Both the process_madvise() approach and my proposal require page migration to bring back THPs, since like you said having consecutive pages ready is extremely rare. IIUC, the process_madvise() approach reuses khugepaged code to collapse huge pages,
>>> namely first allocating a 2MB THP, then copying data over, finally free old base pages. My proposal would migrate pages within
>>> a virtual address range (>1GB and 1GB-aligned) to get all physical pages contiguous, then promote the resulting 1GB consecutive
>>> pages to 1GB THP. No new page allocation is needed.
>>
>> I am missing how we can ever reliably form 1GB pages (esp. after the system ran for a while) without any kind of fragmentation avoidance / defragmentation mechanism that is aware of gigantic pages. For THP, pageblocks+compaction serve that purpose.
> 
> We may not have that as reliable as pageblocks+compaction for THP, but we are able to improve over existing code after 1GB THP
> is supported and used. Otherwise, why bother adding a new mechanism when there is no user?
> 
> I did an experiment on my 32GB desktop like Roman suggested in another email, using as much memory as possible and running
> “git gc” on Linux repo at the same time to fragment memory. I repeated the process three times with three different Linux repos.
> I checked all pageblock types with my custom kernel module (https://github.com/x-y-z/kernel-modules) and discovered that
> the system still have 11 1GB Movable pageblocks (consecutive pageblocks with the same migratetype are grouped as large as
> possible). This means after heavy memory fragmentation the system is still able to form 11 1GB THPs, which is >30% of total
> possible 1GB THPs. I think it is a reasonably good number since we are not going to form 1GB THPs for everything running
> in the system.
> 

I'm sorry, but I don't think this is a relevant reproducer for 
fragmentation with unmovable allocations.

I feel like repeating myself: Anything that relies on large allocations 
succeeding purely because "ZONE_NORMAL memory is usually not fragmented 
after boot" is broken by design.

If your approach does not have any such approach, it's broken by design 
and only works in some very limited setups / under very limited 
conditions. We don't want anything like that when it severely affects 
the code ("49 patches").

>>>
>>> Both approaches would need user-space invocation, assuming either the application itself wants to get THPs for a specific region or a user-space daemon would do this for a group of application, instead of waiting for khugepaged to slowly (4096 pages every 10s) scan and do huge page collapse. User will pay the cost of getting THP. This also means THPs are not completely transparent to user, but I think it should be fine when users explicitly invoke these two methods to get THPs for better performance.
>>
>> Here is the problem: these *advises* are not persistent. Assume your system has to swap and has to split the THP + write it to the swap backend. The gigantic page is lost for that part of the application. When loading the individual 4k pages out of swap there is no guarantee that we can form a 1 GB page again - and how should we know that the application wanted a 1 GB page at that position?
> 
> VM_HUGEPAGE will be set for that VMA and I am planning to add a new field to VMA to indicate what huge page size we want in
> that VMA. About split 1GB THP due to swapping, that happens to THP too. Either khugepaged or a user daemon calling
> process_madvise() could recover 1GB THP.
> 

Sorry, but for any kind of advise like "please collapse this into a 1GB 
page", splitting VMAs does not make any sense. Then, you can just let 
the application mmap(MAP_HUGE ...) that part instead -  you also get a 
separate VMA and need the mmap lock in write.

Ordinary THP can be recovered quite well because *we have actual 
mechanisms in place that try to form contiguous 2MB (->pageblock) chunks*.

>>
>> How would the application know that the advise was no dropped and that
>> a) There is no 1GB page anymore
>> b) It would have to re-issue the advise
> 
> I expected a daemon, either khugepaged or a user one calling process_mavise, would rescan the application and reform 1GB pages.
> 

 From user space? How should it know about whether that application has 
hugepages enabled/disabled for some regions? How should it know if we 
have to special case uffd?

I repeat: I am not convinced that the future of khugepaged is in user 
space. It might be valuable for some minor hints from the application 
itself -  "please collapse this into a THP if possible", but not more - 
IMHO, but not across applications.

>>
>> Similarly, I am not convinced that the future of khugepaged is in user space.
> 
> The issue of khugepaged is that it runs at very slow rate, 4096 pages every 10s, because kernel does not want to consume
> too much CPU resources without knowing the benefit of forming THPs. A user daemon can run at a fast pace to form THPs or
> 1GB THPs from application memory regions that users really want huge pages.
> 

Not sure we really want a daemon. You could just kick khugepaged instead 
- for example to run on a specific process. I think if - at all - it 
should be the application that gives additional advises. But that is a 
different discussion than 1 GB THP.

>>
>>>
>>> The difference of my proposal is that it does not need a 1GB THP allocation, so there is no special requirements like using CMA
>>> or increasing MAX_ORDER in buddy allocator to allow 1GB page allocation. It makes creating THPs with orders > MAX_ORDER possible
>>> without other intrusive changes.
>>
>> Anything that relies on large allocations succeeding purely because "ZONE_NORMAL memory is usually not fragmented after boot" is broken by design. That's why we have CMA, it can give guarantees (well, once we fix all remaining issues :) ).
> 
> It seems that you are suggesting I should use CMA for 1GB THP allocation, since CMA can give guarantee for large allocations.
> Using CMA for 1GB THP would be a great first step to get 1GB THP working, then we can replace it with other large allocation
> mechanisms later.

No, as already expressed multiple times, I don't think this is the right 
thing to do.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v3 00/49] 1GB PUD THP support on x86_64
  2021-03-02  1:59 ` Roman Gushchin
@ 2021-03-04 16:26   ` Zi Yan
  2021-03-04 16:45     ` Roman Gushchin
  0 siblings, 1 reply; 15+ messages in thread
From: Zi Yan @ 2021-03-04 16:26 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Matthew Wilcox, Kirill A . Shutemov, Andrew Morton,
	Yang Shi, Michal Hocko, John Hubbard, Ralph Campbell,
	David Nellans, Jason Gunthorpe, David Rientjes, Vlastimil Babka,
	David Hildenbrand, Mike Kravetz, Song Liu

[-- Attachment #1: Type: text/plain, Size: 2032 bytes --]

On 1 Mar 2021, at 20:59, Roman Gushchin wrote:

> On Wed, Feb 24, 2021 at 05:35:36PM -0500, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> Hi all,
>>
>> I have rebased my 1GB PUD THP support patches on v5.11-mmotm-2021-02-18-18-29
>> and the code is available at
>> https://github.com/x-y-z/linux-1gb-thp/tree/1gb_thp_v5.11-mmotm-2021-02-18-18-29
>> if you want to give it a try. The actual 49 patches are not sent out with this
>> cover letter. :)
>>
>> Instead of asking for code review, I would like to discuss on the concerns I got
>> from previous RFCs. I think there are two major ones:
>>
>> 1. 1GB page allocation. Current implementation allocates 1GB pages from CMA
>>    regions that are reserved at boot time like hugetlbfs. The concerns on
>>    using CMA is that an educated guess is needed to avoid depleting kernel
>>    memory in case CMA regions are set too large. Recently David Rientjes
>>    proposes to use process_madvise() for hugepage collapse, which is an
>>    alternative [1] but might not work for 1GB pages, since there is no way of
>>    _allocating_ a 1GB page to which collapse pages. I proposed a similar
>>    approach at LSF/MM 2019, generating physically contiguous memory after pages
>>    are allocated [2], which is usable for 1GB THPs. This approach does in-place
>>    huge page promotion thus does not require page allocation.
>
> Well, I don't think there an alternative to cma as now. When the memory is almost
> filled at least once, any subsequent activity leading to substantial slab allocations
> (e.g. run git gc) will fragment the memory, so that there are virtually no chances
> to find a continuous GB.
>
> It's possible in theory to reduce the fragmentation on 1GB scale by grouping
> non-movable pageblocks, but it seems a separate project.

My experiments showed that finding continuous GBs is possible, but I agree that
CMA is more reliable and 1GB scale defragmentation should be a separate project.


—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v3 00/49] 1GB PUD THP support on x86_64
  2021-03-04 16:26   ` Zi Yan
@ 2021-03-04 16:45     ` Roman Gushchin
  2021-03-30 17:24       ` Zi Yan
  0 siblings, 1 reply; 15+ messages in thread
From: Roman Gushchin @ 2021-03-04 16:45 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, Matthew Wilcox, Kirill A . Shutemov, Andrew Morton,
	Yang Shi, Michal Hocko, John Hubbard, Ralph Campbell,
	David Nellans, Jason Gunthorpe, David Rientjes, Vlastimil Babka,
	David Hildenbrand, Mike Kravetz, Song Liu

On Thu, Mar 04, 2021 at 11:26:03AM -0500, Zi Yan wrote:
> On 1 Mar 2021, at 20:59, Roman Gushchin wrote:
> 
> > On Wed, Feb 24, 2021 at 05:35:36PM -0500, Zi Yan wrote:
> >> From: Zi Yan <ziy@nvidia.com>
> >>
> >> Hi all,
> >>
> >> I have rebased my 1GB PUD THP support patches on v5.11-mmotm-2021-02-18-18-29
> >> and the code is available at
> >> https://github.com/x-y-z/linux-1gb-thp/tree/1gb_thp_v5.11-mmotm-2021-02-18-18-29
> >> if you want to give it a try. The actual 49 patches are not sent out with this
> >> cover letter. :)
> >>
> >> Instead of asking for code review, I would like to discuss on the concerns I got
> >> from previous RFCs. I think there are two major ones:
> >>
> >> 1. 1GB page allocation. Current implementation allocates 1GB pages from CMA
> >>    regions that are reserved at boot time like hugetlbfs. The concerns on
> >>    using CMA is that an educated guess is needed to avoid depleting kernel
> >>    memory in case CMA regions are set too large. Recently David Rientjes
> >>    proposes to use process_madvise() for hugepage collapse, which is an
> >>    alternative [1] but might not work for 1GB pages, since there is no way of
> >>    _allocating_ a 1GB page to which collapse pages. I proposed a similar
> >>    approach at LSF/MM 2019, generating physically contiguous memory after pages
> >>    are allocated [2], which is usable for 1GB THPs. This approach does in-place
> >>    huge page promotion thus does not require page allocation.
> >
> > Well, I don't think there an alternative to cma as now. When the memory is almost
> > filled at least once, any subsequent activity leading to substantial slab allocations
> > (e.g. run git gc) will fragment the memory, so that there are virtually no chances
> > to find a continuous GB.
> >
> > It's possible in theory to reduce the fragmentation on 1GB scale by grouping
> > non-movable pageblocks, but it seems a separate project.
> 
> My experiments showed that finding continuous GBs is possible, but I agree that
> CMA is more reliable and 1GB scale defragmentation should be a separate project.

I actually ran a large scale experiment (on tens of thousands of machines) in the last
several months. It was about hugetlbfs 1GB pages, but the allocation mechanism is the same.

My goal as to allocate a relatively small number of 1GB pages (<20% of the total memory).
Without cma chances are reaching 0% very fast after reboot, and even manual manipulations
like shutting down all workloads, dropping caches, calling sync, compaction, etc. do not
help much. Sometimes you can allocate maybe 1-2 pages, but that's about it.

Even with cma we had to fix a number of additional problems (like sub-optimal placement
of cma areas, 2MB THP migration, some ext4 and btrfs page migration issues) to have
a reasonable success rate about ~95-99%. And it's not 100% anyway.

The problem with artificial tests is that you're likely experimenting on a freshly
rebooted machine which isn't/wasn't doing much. It's a bad model of the real memory
state of a production server.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v3 00/49] 1GB PUD THP support on x86_64
  2021-03-04 16:45     ` Roman Gushchin
@ 2021-03-30 17:24       ` Zi Yan
  2021-03-30 18:02         ` Roman Gushchin
  0 siblings, 1 reply; 15+ messages in thread
From: Zi Yan @ 2021-03-30 17:24 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Matthew Wilcox, Kirill A . Shutemov, Andrew Morton,
	Yang Shi, Michal Hocko, John Hubbard, Ralph Campbell,
	David Nellans, Jason Gunthorpe, David Rientjes, Vlastimil Babka,
	David Hildenbrand, Mike Kravetz, Song Liu

[-- Attachment #1: Type: text/plain, Size: 4074 bytes --]

Hi Roman,


On 4 Mar 2021, at 11:45, Roman Gushchin wrote:

> On Thu, Mar 04, 2021 at 11:26:03AM -0500, Zi Yan wrote:
>> On 1 Mar 2021, at 20:59, Roman Gushchin wrote:
>>
>>> On Wed, Feb 24, 2021 at 05:35:36PM -0500, Zi Yan wrote:
>>>> From: Zi Yan <ziy@nvidia.com>
>>>>
>>>> Hi all,
>>>>
>>>> I have rebased my 1GB PUD THP support patches on v5.11-mmotm-2021-02-18-18-29
>>>> and the code is available at
>>>> https://github.com/x-y-z/linux-1gb-thp/tree/1gb_thp_v5.11-mmotm-2021-02-18-18-29
>>>> if you want to give it a try. The actual 49 patches are not sent out with this
>>>> cover letter. :)
>>>>
>>>> Instead of asking for code review, I would like to discuss on the concerns I got
>>>> from previous RFCs. I think there are two major ones:
>>>>
>>>> 1. 1GB page allocation. Current implementation allocates 1GB pages from CMA
>>>>    regions that are reserved at boot time like hugetlbfs. The concerns on
>>>>    using CMA is that an educated guess is needed to avoid depleting kernel
>>>>    memory in case CMA regions are set too large. Recently David Rientjes
>>>>    proposes to use process_madvise() for hugepage collapse, which is an
>>>>    alternative [1] but might not work for 1GB pages, since there is no way of
>>>>    _allocating_ a 1GB page to which collapse pages. I proposed a similar
>>>>    approach at LSF/MM 2019, generating physically contiguous memory after pages
>>>>    are allocated [2], which is usable for 1GB THPs. This approach does in-place
>>>>    huge page promotion thus does not require page allocation.
>>>
>>> Well, I don't think there an alternative to cma as now. When the memory is almost
>>> filled at least once, any subsequent activity leading to substantial slab allocations
>>> (e.g. run git gc) will fragment the memory, so that there are virtually no chances
>>> to find a continuous GB.
>>>
>>> It's possible in theory to reduce the fragmentation on 1GB scale by grouping
>>> non-movable pageblocks, but it seems a separate project.
>>
>> My experiments showed that finding continuous GBs is possible, but I agree that
>> CMA is more reliable and 1GB scale defragmentation should be a separate project.
>
> I actually ran a large scale experiment (on tens of thousands of machines) in the last
> several months. It was about hugetlbfs 1GB pages, but the allocation mechanism is the same.

Thanks for the information. I finally have time to come back to this. Do you mind sharing
the total memory of these machines? I want to have some idea on the scale of this issue to
make sure I reproduce in a proper machine. Are you trying to get <20% of 10s GBs, 100s GBs,
or TBs memory?

>
> My goal as to allocate a relatively small number of 1GB pages (<20% of the total memory).
> Without cma chances are reaching 0% very fast after reboot, and even manual manipulations
> like shutting down all workloads, dropping caches, calling sync, compaction, etc. do not
> help much. Sometimes you can allocate maybe 1-2 pages, but that's about it.

Is there a way of replicating such an environment with publicly available software?
I really want to understand the root cause and am willing to find a possible solution.
It would be much easier if I can reproduce this locally.

>
> Even with cma we had to fix a number of additional problems (like sub-optimal placement
> of cma areas, 2MB THP migration, some ext4 and btrfs page migration issues) to have
> a reasonable success rate about ~95-99%. And it's not 100% anyway.
>
> The problem with artificial tests is that you're likely experimenting on a freshly
> rebooted machine which isn't/wasn't doing much. It's a bad model of the real memory
> state of a production server.

Yes, I agree that my experiment is not representative. Can you provide more information
on what application behavior(s) leading to this memory fragmentation? I guess it is
because non-moveable pages spread across the entire physical memory space. Is there
a quick reproducer for that?

Thanks.


—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v3 00/49] 1GB PUD THP support on x86_64
  2021-03-30 17:24       ` Zi Yan
@ 2021-03-30 18:02         ` Roman Gushchin
  2021-03-31  2:04           ` Zi Yan
  2021-03-31  3:09           ` Matthew Wilcox
  0 siblings, 2 replies; 15+ messages in thread
From: Roman Gushchin @ 2021-03-30 18:02 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, Matthew Wilcox, Kirill A . Shutemov, Andrew Morton,
	Yang Shi, Michal Hocko, John Hubbard, Ralph Campbell,
	David Nellans, Jason Gunthorpe, David Rientjes, Vlastimil Babka,
	David Hildenbrand, Mike Kravetz, Song Liu

On Tue, Mar 30, 2021 at 01:24:14PM -0400, Zi Yan wrote:
> Hi Roman,
> 
> 
> On 4 Mar 2021, at 11:45, Roman Gushchin wrote:
> 
> > On Thu, Mar 04, 2021 at 11:26:03AM -0500, Zi Yan wrote:
> >> On 1 Mar 2021, at 20:59, Roman Gushchin wrote:
> >>
> >>> On Wed, Feb 24, 2021 at 05:35:36PM -0500, Zi Yan wrote:
> >>>> From: Zi Yan <ziy@nvidia.com>
> >>>>
> >>>> Hi all,
> >>>>
> >>>> I have rebased my 1GB PUD THP support patches on v5.11-mmotm-2021-02-18-18-29
> >>>> and the code is available at
> >>>> https://github.com/x-y-z/linux-1gb-thp/tree/1gb_thp_v5.11-mmotm-2021-02-18-18-29
> >>>> if you want to give it a try. The actual 49 patches are not sent out with this
> >>>> cover letter. :)
> >>>>
> >>>> Instead of asking for code review, I would like to discuss on the concerns I got
> >>>> from previous RFCs. I think there are two major ones:
> >>>>
> >>>> 1. 1GB page allocation. Current implementation allocates 1GB pages from CMA
> >>>>    regions that are reserved at boot time like hugetlbfs. The concerns on
> >>>>    using CMA is that an educated guess is needed to avoid depleting kernel
> >>>>    memory in case CMA regions are set too large. Recently David Rientjes
> >>>>    proposes to use process_madvise() for hugepage collapse, which is an
> >>>>    alternative [1] but might not work for 1GB pages, since there is no way of
> >>>>    _allocating_ a 1GB page to which collapse pages. I proposed a similar
> >>>>    approach at LSF/MM 2019, generating physically contiguous memory after pages
> >>>>    are allocated [2], which is usable for 1GB THPs. This approach does in-place
> >>>>    huge page promotion thus does not require page allocation.
> >>>
> >>> Well, I don't think there an alternative to cma as now. When the memory is almost
> >>> filled at least once, any subsequent activity leading to substantial slab allocations
> >>> (e.g. run git gc) will fragment the memory, so that there are virtually no chances
> >>> to find a continuous GB.
> >>>
> >>> It's possible in theory to reduce the fragmentation on 1GB scale by grouping
> >>> non-movable pageblocks, but it seems a separate project.
> >>
> >> My experiments showed that finding continuous GBs is possible, but I agree that
> >> CMA is more reliable and 1GB scale defragmentation should be a separate project.
> >
> > I actually ran a large scale experiment (on tens of thousands of machines) in the last
> > several months. It was about hugetlbfs 1GB pages, but the allocation mechanism is the same.
> 
> Thanks for the information. I finally have time to come back to this. Do you mind sharing
> the total memory of these machines? I want to have some idea on the scale of this issue to
> make sure I reproduce in a proper machine. Are you trying to get <20% of 10s GBs, 100s GBs,
> or TBs memory?

There are different configurations, but in general they are in 100's GB or smaller.

> 
> >
> > My goal as to allocate a relatively small number of 1GB pages (<20% of the total memory).
> > Without cma chances are reaching 0% very fast after reboot, and even manual manipulations
> > like shutting down all workloads, dropping caches, calling sync, compaction, etc. do not
> > help much. Sometimes you can allocate maybe 1-2 pages, but that's about it.
> 
> Is there a way of replicating such an environment with publicly available software?
> I really want to understand the root cause and am willing to find a possible solution.
> It would be much easier if I can reproduce this locally.

There is nothing fb-specific: once the memory is filled with anon/pagecache, any subsequent
allocations of non-movable memory (slabs, percpu, etc) will fragment the memory. There
is a pageblock mechanism which prevents the fragmentation on 2MB scale, but nothing prevents
the fragmentation on 1GB scale. It just a matter of runtime (and the number of mm operations).

> 
> >
> > Even with cma we had to fix a number of additional problems (like sub-optimal placement
> > of cma areas, 2MB THP migration, some ext4 and btrfs page migration issues) to have
> > a reasonable success rate about ~95-99%. And it's not 100% anyway.
> >
> > The problem with artificial tests is that you're likely experimenting on a freshly
> > rebooted machine which isn't/wasn't doing much. It's a bad model of the real memory
> > state of a production server.
> 
> Yes, I agree that my experiment is not representative. Can you provide more information
> on what application behavior(s) leading to this memory fragmentation? I guess it is
> because non-moveable pages spread across the entire physical memory space. Is there
> a quick reproducer for that?

I have a simple c program which is able to fragment the memory, you can play with it:
https://github.com/rgushchin/fragm .

But as I said, basically any load which is actively using the whole memory
will fragment it.

Thanks!


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v3 00/49] 1GB PUD THP support on x86_64
  2021-03-30 18:02         ` Roman Gushchin
@ 2021-03-31  2:04           ` Zi Yan
  2021-03-31  3:09           ` Matthew Wilcox
  1 sibling, 0 replies; 15+ messages in thread
From: Zi Yan @ 2021-03-31  2:04 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Matthew Wilcox, Kirill A . Shutemov, Andrew Morton,
	Yang Shi, Michal Hocko, John Hubbard, Ralph Campbell,
	David Nellans, Jason Gunthorpe, David Rientjes, Vlastimil Babka,
	David Hildenbrand, Mike Kravetz, Song Liu

[-- Attachment #1: Type: text/plain, Size: 5952 bytes --]

On 30 Mar 2021, at 14:02, Roman Gushchin wrote:

> On Tue, Mar 30, 2021 at 01:24:14PM -0400, Zi Yan wrote:
>> Hi Roman,
>>
>>
>> On 4 Mar 2021, at 11:45, Roman Gushchin wrote:
>>
>>> On Thu, Mar 04, 2021 at 11:26:03AM -0500, Zi Yan wrote:
>>>> On 1 Mar 2021, at 20:59, Roman Gushchin wrote:
>>>>
>>>>> On Wed, Feb 24, 2021 at 05:35:36PM -0500, Zi Yan wrote:
>>>>>> From: Zi Yan <ziy@nvidia.com>
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I have rebased my 1GB PUD THP support patches on v5.11-mmotm-2021-02-18-18-29
>>>>>> and the code is available at
>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fx-y-z%2Flinux-1gb-thp%2Ftree%2F1gb_thp_v5.11-mmotm-2021-02-18-18-29&amp;data=04%7C01%7Cziy%40nvidia.com%7C49dd8b5e66994e6b7f5e08d8f3a5fa13%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637527241503834147%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=3jjPz8HTJDn3bYWhrwKToMCXDScZuCoqsEsink3eGZE%3D&amp;reserved=0
>>>>>> if you want to give it a try. The actual 49 patches are not sent out with this
>>>>>> cover letter. :)
>>>>>>
>>>>>> Instead of asking for code review, I would like to discuss on the concerns I got
>>>>>> from previous RFCs. I think there are two major ones:
>>>>>>
>>>>>> 1. 1GB page allocation. Current implementation allocates 1GB pages from CMA
>>>>>>    regions that are reserved at boot time like hugetlbfs. The concerns on
>>>>>>    using CMA is that an educated guess is needed to avoid depleting kernel
>>>>>>    memory in case CMA regions are set too large. Recently David Rientjes
>>>>>>    proposes to use process_madvise() for hugepage collapse, which is an
>>>>>>    alternative [1] but might not work for 1GB pages, since there is no way of
>>>>>>    _allocating_ a 1GB page to which collapse pages. I proposed a similar
>>>>>>    approach at LSF/MM 2019, generating physically contiguous memory after pages
>>>>>>    are allocated [2], which is usable for 1GB THPs. This approach does in-place
>>>>>>    huge page promotion thus does not require page allocation.
>>>>>
>>>>> Well, I don't think there an alternative to cma as now. When the memory is almost
>>>>> filled at least once, any subsequent activity leading to substantial slab allocations
>>>>> (e.g. run git gc) will fragment the memory, so that there are virtually no chances
>>>>> to find a continuous GB.
>>>>>
>>>>> It's possible in theory to reduce the fragmentation on 1GB scale by grouping
>>>>> non-movable pageblocks, but it seems a separate project.
>>>>
>>>> My experiments showed that finding continuous GBs is possible, but I agree that
>>>> CMA is more reliable and 1GB scale defragmentation should be a separate project.
>>>
>>> I actually ran a large scale experiment (on tens of thousands of machines) in the last
>>> several months. It was about hugetlbfs 1GB pages, but the allocation mechanism is the same.
>>
>> Thanks for the information. I finally have time to come back to this. Do you mind sharing
>> the total memory of these machines? I want to have some idea on the scale of this issue to
>> make sure I reproduce in a proper machine. Are you trying to get <20% of 10s GBs, 100s GBs,
>> or TBs memory?
>
> There are different configurations, but in general they are in 100's GB or smaller.
>
>>
>>>
>>> My goal as to allocate a relatively small number of 1GB pages (<20% of the total memory).
>>> Without cma chances are reaching 0% very fast after reboot, and even manual manipulations
>>> like shutting down all workloads, dropping caches, calling sync, compaction, etc. do not
>>> help much. Sometimes you can allocate maybe 1-2 pages, but that's about it.
>>
>> Is there a way of replicating such an environment with publicly available software?
>> I really want to understand the root cause and am willing to find a possible solution.
>> It would be much easier if I can reproduce this locally.
>
> There is nothing fb-specific: once the memory is filled with anon/pagecache, any subsequent
> allocations of non-movable memory (slabs, percpu, etc) will fragment the memory. There
> is a pageblock mechanism which prevents the fragmentation on 2MB scale, but nothing prevents
> the fragmentation on 1GB scale. It just a matter of runtime (and the number of mm operations).
>
>>
>>>
>>> Even with cma we had to fix a number of additional problems (like sub-optimal placement
>>> of cma areas, 2MB THP migration, some ext4 and btrfs page migration issues) to have
>>> a reasonable success rate about ~95-99%. And it's not 100% anyway.
>>>
>>> The problem with artificial tests is that you're likely experimenting on a freshly
>>> rebooted machine which isn't/wasn't doing much. It's a bad model of the real memory
>>> state of a production server.
>>
>> Yes, I agree that my experiment is not representative. Can you provide more information
>> on what application behavior(s) leading to this memory fragmentation? I guess it is
>> because non-moveable pages spread across the entire physical memory space. Is there
>> a quick reproducer for that?
>
> I have a simple c program which is able to fragment the memory, you can play with it:
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Frgushchin%2Ffragm&amp;data=04%7C01%7Cziy%40nvidia.com%7C49dd8b5e66994e6b7f5e08d8f3a5fa13%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637527241503834147%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=RE9CfPG2fG7lZfHuiW78jlJewajJzJ2DCbbmGJpWPRU%3D&amp;reserved=0 .
>
> But as I said, basically any load which is actively using the whole memory
> will fragment it.

With your simple program, I am able to fragment the memory to the condition that
it is impossible to allocate/generate 1GB pages. I will look into this.

Thanks.

—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v3 00/49] 1GB PUD THP support on x86_64
  2021-03-30 18:02         ` Roman Gushchin
  2021-03-31  2:04           ` Zi Yan
@ 2021-03-31  3:09           ` Matthew Wilcox
  2021-03-31  3:32             ` Roman Gushchin
  1 sibling, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2021-03-31  3:09 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Zi Yan, linux-mm, Kirill A . Shutemov, Andrew Morton, Yang Shi,
	Michal Hocko, John Hubbard, Ralph Campbell, David Nellans,
	Jason Gunthorpe, David Rientjes, Vlastimil Babka,
	David Hildenbrand, Mike Kravetz, Song Liu

On Tue, Mar 30, 2021 at 11:02:07AM -0700, Roman Gushchin wrote:
> On Tue, Mar 30, 2021 at 01:24:14PM -0400, Zi Yan wrote:
> > On 4 Mar 2021, at 11:45, Roman Gushchin wrote:
> > > I actually ran a large scale experiment (on tens of thousands of machines) in the last
> > > several months. It was about hugetlbfs 1GB pages, but the allocation mechanism is the same.
> > 
> > Thanks for the information. I finally have time to come back to this. Do you mind sharing
> > the total memory of these machines? I want to have some idea on the scale of this issue to
> > make sure I reproduce in a proper machine. Are you trying to get <20% of 10s GBs, 100s GBs,
> > or TBs memory?
> 
> There are different configurations, but in general they are in 100's GB or smaller.

Are you using ZONE_MOVEABLE?  Seeing /proc/buddyinfo from one of these
machines might be illuminating.

> > 
> > >
> > > My goal as to allocate a relatively small number of 1GB pages (<20% of the total memory).
> > > Without cma chances are reaching 0% very fast after reboot, and even manual manipulations
> > > like shutting down all workloads, dropping caches, calling sync, compaction, etc. do not
> > > help much. Sometimes you can allocate maybe 1-2 pages, but that's about it.
> > 
> > Is there a way of replicating such an environment with publicly available software?
> > I really want to understand the root cause and am willing to find a possible solution.
> > It would be much easier if I can reproduce this locally.
> 
> There is nothing fb-specific: once the memory is filled with anon/pagecache, any subsequent
> allocations of non-movable memory (slabs, percpu, etc) will fragment the memory. There
> is a pageblock mechanism which prevents the fragmentation on 2MB scale, but nothing prevents
> the fragmentation on 1GB scale. It just a matter of runtime (and the number of mm operations).

I think this is somewhere the buddy allocator could be improved.
Of course, it knows nothing of larger page orders (which needs to be
fixed), but in general, I would like it to do a better job of segregating
movable and unmovable allocations.

Let's take a machine with 100GB of memory as an example.  Ideally,
unmovable allocations would start at 4GB (assuming below 4GB is
ZONE_DMA32).  Movable allocations can allocate anywhere in memory, but
should avoid being "near" unmovable allocations.  Perhaps they start
at 5GB.  When unmovable allocations get up to 5GB, we should first exert
a bit of pressure to shrink the unmovable allocations (looking at you,
dcache), but eventually we'll need to grow the unmovable allocations
above 5GB and we should move, say, all the pages between 5GB and 5GB+1MB.
If this unmovable allocation was just temporary, we get a reassembled
1MB page.  If it was permanent, we now have 1MB of memory to soak up
the next few allocations.

The model I'm thinking of here is that we have a "line" in memory that
divides movable and unmovable allocations.  It can move up, but there
has to be significant memory pressure to do so.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v3 00/49] 1GB PUD THP support on x86_64
  2021-03-31  3:09           ` Matthew Wilcox
@ 2021-03-31  3:32             ` Roman Gushchin
  2021-03-31 14:48               ` Zi Yan
  0 siblings, 1 reply; 15+ messages in thread
From: Roman Gushchin @ 2021-03-31  3:32 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Zi Yan, linux-mm, Kirill A . Shutemov, Andrew Morton, Yang Shi,
	Michal Hocko, John Hubbard, Ralph Campbell, David Nellans,
	Jason Gunthorpe, David Rientjes, Vlastimil Babka,
	David Hildenbrand, Mike Kravetz, Song Liu

On Wed, Mar 31, 2021 at 04:09:35AM +0100, Matthew Wilcox wrote:
> On Tue, Mar 30, 2021 at 11:02:07AM -0700, Roman Gushchin wrote:
> > On Tue, Mar 30, 2021 at 01:24:14PM -0400, Zi Yan wrote:
> > > On 4 Mar 2021, at 11:45, Roman Gushchin wrote:
> > > > I actually ran a large scale experiment (on tens of thousands of machines) in the last
> > > > several months. It was about hugetlbfs 1GB pages, but the allocation mechanism is the same.
> > > 
> > > Thanks for the information. I finally have time to come back to this. Do you mind sharing
> > > the total memory of these machines? I want to have some idea on the scale of this issue to
> > > make sure I reproduce in a proper machine. Are you trying to get <20% of 10s GBs, 100s GBs,
> > > or TBs memory?
> > 
> > There are different configurations, but in general they are in 100's GB or smaller.
> 
> Are you using ZONE_MOVEABLE?  Seeing /proc/buddyinfo from one of these
> machines might be illuminating.

No, I'm using pre-allocated cma areas, and it works fine.
Buddyinfo stops at order 10, right?
How it's helpful with fragmentation on 1GB scale?

> 
> > > 
> > > >
> > > > My goal as to allocate a relatively small number of 1GB pages (<20% of the total memory).
> > > > Without cma chances are reaching 0% very fast after reboot, and even manual manipulations
> > > > like shutting down all workloads, dropping caches, calling sync, compaction, etc. do not
> > > > help much. Sometimes you can allocate maybe 1-2 pages, but that's about it.
> > > 
> > > Is there a way of replicating such an environment with publicly available software?
> > > I really want to understand the root cause and am willing to find a possible solution.
> > > It would be much easier if I can reproduce this locally.
> > 
> > There is nothing fb-specific: once the memory is filled with anon/pagecache, any subsequent
> > allocations of non-movable memory (slabs, percpu, etc) will fragment the memory. There
> > is a pageblock mechanism which prevents the fragmentation on 2MB scale, but nothing prevents
> > the fragmentation on 1GB scale. It just a matter of runtime (and the number of mm operations).
> 
> I think this is somewhere the buddy allocator could be improved.
> Of course, it knows nothing of larger page orders (which needs to be
> fixed), but in general, I would like it to do a better job of segregating
> movable and unmovable allocations.
> 
> Let's take a machine with 100GB of memory as an example.  Ideally,
> unmovable allocations would start at 4GB (assuming below 4GB is
> ZONE_DMA32).  Movable allocations can allocate anywhere in memory, but
> should avoid being "near" unmovable allocations.  Perhaps they start
> at 5GB.  When unmovable allocations get up to 5GB, we should first exert
> a bit of pressure to shrink the unmovable allocations (looking at you,
> dcache), but eventually we'll need to grow the unmovable allocations
> above 5GB and we should move, say, all the pages between 5GB and 5GB+1MB.
> If this unmovable allocation was just temporary, we get a reassembled
> 1MB page.  If it was permanent, we now have 1MB of memory to soak up
> the next few allocations.
> 
> The model I'm thinking of here is that we have a "line" in memory that
> divides movable and unmovable allocations.  It can move up, but there
> has to be significant memory pressure to do so.
> 

I agree. My idea (which I need to find some time to try) was to hack the pageblock
code so that if we convert a block to non-movable, we can convert an entire GB
around. In this case, all unmovable memory will likely fit into few 1GB chunks,
leaving all other chunks movable. But from the security's point of view it will
be less desirable, I guess. What do you think about it?

Thanks!


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v3 00/49] 1GB PUD THP support on x86_64
  2021-03-31  3:32             ` Roman Gushchin
@ 2021-03-31 14:48               ` Zi Yan
  0 siblings, 0 replies; 15+ messages in thread
From: Zi Yan @ 2021-03-31 14:48 UTC (permalink / raw)
  To: Roman Gushchin, Matthew Wilcox
  Cc: linux-mm, Kirill A . Shutemov, Andrew Morton, Yang Shi,
	Michal Hocko, John Hubbard, Ralph Campbell, David Nellans,
	Jason Gunthorpe, David Rientjes, Vlastimil Babka,
	David Hildenbrand, Mike Kravetz, Song Liu

[-- Attachment #1: Type: text/plain, Size: 5370 bytes --]

On 30 Mar 2021, at 23:32, Roman Gushchin wrote:

> On Wed, Mar 31, 2021 at 04:09:35AM +0100, Matthew Wilcox wrote:
>> On Tue, Mar 30, 2021 at 11:02:07AM -0700, Roman Gushchin wrote:
>>> On Tue, Mar 30, 2021 at 01:24:14PM -0400, Zi Yan wrote:
>>>> On 4 Mar 2021, at 11:45, Roman Gushchin wrote:
>>>>> I actually ran a large scale experiment (on tens of thousands of machines) in the last
>>>>> several months. It was about hugetlbfs 1GB pages, but the allocation mechanism is the same.
>>>>
>>>> Thanks for the information. I finally have time to come back to this. Do you mind sharing
>>>> the total memory of these machines? I want to have some idea on the scale of this issue to
>>>> make sure I reproduce in a proper machine. Are you trying to get <20% of 10s GBs, 100s GBs,
>>>> or TBs memory?
>>>
>>> There are different configurations, but in general they are in 100's GB or smaller.
>>
>> Are you using ZONE_MOVEABLE?  Seeing /proc/buddyinfo from one of these
>> machines might be illuminating.
>
> No, I'm using pre-allocated cma areas, and it works fine.
> Buddyinfo stops at order 10, right?
> How it's helpful with fragmentation on 1GB scale?
>
>>
>>>>
>>>>>
>>>>> My goal as to allocate a relatively small number of 1GB pages (<20% of the total memory).
>>>>> Without cma chances are reaching 0% very fast after reboot, and even manual manipulations
>>>>> like shutting down all workloads, dropping caches, calling sync, compaction, etc. do not
>>>>> help much. Sometimes you can allocate maybe 1-2 pages, but that's about it.
>>>>
>>>> Is there a way of replicating such an environment with publicly available software?
>>>> I really want to understand the root cause and am willing to find a possible solution.
>>>> It would be much easier if I can reproduce this locally.
>>>
>>> There is nothing fb-specific: once the memory is filled with anon/pagecache, any subsequent
>>> allocations of non-movable memory (slabs, percpu, etc) will fragment the memory. There
>>> is a pageblock mechanism which prevents the fragmentation on 2MB scale, but nothing prevents
>>> the fragmentation on 1GB scale. It just a matter of runtime (and the number of mm operations).
>>
>> I think this is somewhere the buddy allocator could be improved.
>> Of course, it knows nothing of larger page orders (which needs to be
>> fixed), but in general, I would like it to do a better job of segregating
>> movable and unmovable allocations.
>>
>> Let's take a machine with 100GB of memory as an example.  Ideally,
>> unmovable allocations would start at 4GB (assuming below 4GB is
>> ZONE_DMA32).  Movable allocations can allocate anywhere in memory, but
>> should avoid being "near" unmovable allocations.  Perhaps they start
>> at 5GB.  When unmovable allocations get up to 5GB, we should first exert
>> a bit of pressure to shrink the unmovable allocations (looking at you,
>> dcache), but eventually we'll need to grow the unmovable allocations
>> above 5GB and we should move, say, all the pages between 5GB and 5GB+1MB.
>> If this unmovable allocation was just temporary, we get a reassembled
>> 1MB page.  If it was permanent, we now have 1MB of memory to soak up
>> the next few allocations.
>>
>> The model I'm thinking of here is that we have a "line" in memory that
>> divides movable and unmovable allocations.  It can move up, but there
>> has to be significant memory pressure to do so.

Hi Roman and Matthew,

David Hildenbrand proposed an idea similar to Matthew’s, ZONE_PREFER_MOVABLE,
which prefers movable allocation and is the fallback for unmovable allocations
when ZONE_NORMAL is full [1]. Also ZONE_PREFER_MOVABLE size can be changed
dynamically on demand.

My concerns for the ideas like this are:

1. Some long-live unmovable pages might hold the boundary between movable and
unmovable, so the part for unmovable allocation might end up only increasing.
Would something like, creating a lot of dentries by going through the file
system and keeping the last visited file open, make this happen?

2. The cost of pushing the boundary. Unless both movable and unmovable allocations
are going towards the boundary, kernel will need to migrate movable pages to move
the boundary. It would create noticeable latency for unmovable allocations that
need to move the boundary, right?

>
> I agree. My idea (which I need to find some time to try) was to hack the pageblock
> code so that if we convert a block to non-movable, we can convert an entire GB
> around. In this case, all unmovable memory will likely fit into few 1GB chunks,
> leaving all other chunks movable. But from the security's point of view it will
> be less desirable, I guess. What do you think about it?

I like this idea and have thought about it too. It could reuse existing fragmentation
avoidance mechanism and does not have my concerns above. But there will still
be some work to prevent more than one 1GB pageblock from being converted after
we add support for 1GB pageblock size, otherwise, memory can still be fragmented
by unmovable pages when multiple 1GB pageblocks are converted to unmovable ones,
assuming movable allocation can fall back to unmovable pageblocks.


[1] https://lore.kernel.org/linux-mm/6135d2c5-2a74-6ca8-4b3b-8ceb25c0d4b1@redhat.com/

—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-03-31 14:48 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-24 22:35 [RFC PATCH v3 00/49] 1GB PUD THP support on x86_64 Zi Yan
2021-02-25 11:02 ` David Hildenbrand
2021-02-25 22:13   ` Zi Yan
2021-03-02  8:55     ` David Hildenbrand
2021-03-03 23:42       ` Zi Yan
2021-03-04  9:26         ` David Hildenbrand
2021-03-02  1:59 ` Roman Gushchin
2021-03-04 16:26   ` Zi Yan
2021-03-04 16:45     ` Roman Gushchin
2021-03-30 17:24       ` Zi Yan
2021-03-30 18:02         ` Roman Gushchin
2021-03-31  2:04           ` Zi Yan
2021-03-31  3:09           ` Matthew Wilcox
2021-03-31  3:32             ` Roman Gushchin
2021-03-31 14:48               ` Zi Yan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.