* [PATCH v2 0/5] variable-order, large folios for anonymous memory @ 2023-07-03 13:53 ` Ryan Roberts 0 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-03 13:53 UTC (permalink / raw) To: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm Hi All, This is v2 of a series to implement variable order, large folios for anonymous memory. The objective of this is to improve performance by allocating larger chunks of memory during anonymous page faults. See [1] for background. I've significantly reworked and simplified the patch set based on comments from Yu Zhao (thanks for all your feedback!). I've also renamed the feature to VARIABLE_THP, on Yu's advice. The last patch is for arm64 to explicitly override the default arch_wants_pte_order() and is intended as an example. If this series is accepted I suggest taking the first 4 patches through the mm tree and the arm64 change could be handled through the arm64 tree separately. Neither has any build dependency on the other. The one area where I haven't followed Yu's advice is in the determination of the size of folio to use. It was suggested that I have a single preferred large order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there being existing overlapping populated PTEs, etc) then fallback immediately to order-0. It turned out that this approach caused a performance regression in the Speedometer benchmark. With my v1 patch, there were significant quantities of memory which could not be placed in the 64K bucket and were instead being allocated for the 32K and 16K buckets. With the proposed simplification, that memory ended up using the 4K bucket, so page faults increased by 2.75x compared to the v1 patch (although due to the 64K bucket, this number is still a bit lower than the baseline). So instead, I continue to calculate a folio order that is somewhere between the preferred order and 0. (See below for more details). The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series [2], which is a hard dependency. I have a branch at [3]. Changes since v1 [1] -------------------- - removed changes to arch-dependent vma_alloc_zeroed_movable_folio() - replaced with arch-independent alloc_anon_folio() - follows THP allocation approach - no longer retry with intermediate orders if allocation fails - fallback directly to order-0 - remove folio_add_new_anon_rmap_range() patch - instead add its new functionality to folio_add_new_anon_rmap() - remove batch-zap pte mappings optimization patch - remove enabler folio_remove_rmap_range() patch too - These offer real perf improvement so will submit separately - simplify Kconfig - single FLEXIBLE_THP option, which is independent of arch - depends on TRANSPARENT_HUGEPAGE - when enabled default to max anon folio size of 64K unless arch explicitly overrides - simplify changes to do_anonymous_page(): - no more retry loop Performance ----------- Below results show 3 benchmarks; kernel compilation with 8 jobs, kernel compilation with 80 jobs, and speedometer 2.0 (a javascript benchmark running in Chromium). All cases are running on Ampere Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark is repeated 15 times over 5 reboots and averaged. 'anonfolio-lkml-v1' is the v1 patchset at [1]. 'anonfolio-lkml-v2' is this v2 patchset. 'anonfolio-lkml-v2-simple-order' is anonfolio-lkml-v2 but with the order selection simplification that Yu Zhao suggested - I'm trying to justify here why I did not follow the advice. Kernel compilation with 8 jobs: | kernel | real-time | kern-time | user-time | |:-------------------------------|------------:|------------:|------------:| | baseline-4k | 0.0% | 0.0% | 0.0% | | anonfolio-lkml-v1 | -5.3% | -42.9% | -0.6% | | anonfolio-lkml-v2-simple-order | -4.4% | -36.5% | -0.4% | | anonfolio-lkml-v2 | -4.8% | -38.6% | -0.6% | We can see that the simple-order approach is responsible for a regression of 0.4%. Kernel compilation with 80 jobs: | kernel | real-time | kern-time | user-time | |:-------------------------------|------------:|------------:|------------:| | baseline-4k | 0.0% | 0.0% | 0.0% | | anonfolio-lkml-v1 | -4.6% | -45.7% | 1.4% | | anonfolio-lkml-v2-simple-order | -4.7% | -40.2% | -0.1% | | anonfolio-lkml-v2 | -5.0% | -42.6% | -0.3% | simple-order costs 0.3 % here. v2 is actually performing higher than v1 due to fixing the v1 regression on user-time. Speedometer 2.0: | kernel | runs_per_min | |:-------------------------------|---------------:| | baseline-4k | 0.0% | | anonfolio-lkml-v1 | 0.7% | | anonfolio-lkml-v2-simple-order | -0.9% | | anonfolio-lkml-v2 | 0.5% | simple-order regresses performance by 0.9% vs the baseline, for a total negative swing of 1.6% vs v1. This is fixed by keeping the more complex order selection mechanism from v1. The remaining (kernel time) performance gap between v1 and v2 for the above benchmarks is due to the removal of the "batch zap" patch in v2. Adding that back in gives us the performance back. I intend to submit that as a separate series once this series is accepted. [1] https://lore.kernel.org/linux-mm/20230626171430.3167004-1-ryan.roberts@arm.com/ [2] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/ [3] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anonfolio-lkml_v2 Thanks, Ryan Ryan Roberts (5): mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap() mm: Allow deferred splitting of arbitrary large anon folios mm: Default implementation of arch_wants_pte_order() mm: FLEXIBLE_THP for improved performance arm64: mm: Override arch_wants_pte_order() arch/arm64/Kconfig | 12 +++ arch/arm64/include/asm/pgtable.h | 4 + arch/arm64/mm/mmu.c | 8 ++ include/linux/pgtable.h | 13 +++ mm/Kconfig | 10 ++ mm/memory.c | 168 ++++++++++++++++++++++++++++--- mm/rmap.c | 28 ++++-- 7 files changed, 222 insertions(+), 21 deletions(-) -- 2.25.1 ^ permalink raw reply [flat|nested] 167+ messages in thread
* [PATCH v2 0/5] variable-order, large folios for anonymous memory @ 2023-07-03 13:53 ` Ryan Roberts 0 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-03 13:53 UTC (permalink / raw) To: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm Hi All, This is v2 of a series to implement variable order, large folios for anonymous memory. The objective of this is to improve performance by allocating larger chunks of memory during anonymous page faults. See [1] for background. I've significantly reworked and simplified the patch set based on comments from Yu Zhao (thanks for all your feedback!). I've also renamed the feature to VARIABLE_THP, on Yu's advice. The last patch is for arm64 to explicitly override the default arch_wants_pte_order() and is intended as an example. If this series is accepted I suggest taking the first 4 patches through the mm tree and the arm64 change could be handled through the arm64 tree separately. Neither has any build dependency on the other. The one area where I haven't followed Yu's advice is in the determination of the size of folio to use. It was suggested that I have a single preferred large order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there being existing overlapping populated PTEs, etc) then fallback immediately to order-0. It turned out that this approach caused a performance regression in the Speedometer benchmark. With my v1 patch, there were significant quantities of memory which could not be placed in the 64K bucket and were instead being allocated for the 32K and 16K buckets. With the proposed simplification, that memory ended up using the 4K bucket, so page faults increased by 2.75x compared to the v1 patch (although due to the 64K bucket, this number is still a bit lower than the baseline). So instead, I continue to calculate a folio order that is somewhere between the preferred order and 0. (See below for more details). The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series [2], which is a hard dependency. I have a branch at [3]. Changes since v1 [1] -------------------- - removed changes to arch-dependent vma_alloc_zeroed_movable_folio() - replaced with arch-independent alloc_anon_folio() - follows THP allocation approach - no longer retry with intermediate orders if allocation fails - fallback directly to order-0 - remove folio_add_new_anon_rmap_range() patch - instead add its new functionality to folio_add_new_anon_rmap() - remove batch-zap pte mappings optimization patch - remove enabler folio_remove_rmap_range() patch too - These offer real perf improvement so will submit separately - simplify Kconfig - single FLEXIBLE_THP option, which is independent of arch - depends on TRANSPARENT_HUGEPAGE - when enabled default to max anon folio size of 64K unless arch explicitly overrides - simplify changes to do_anonymous_page(): - no more retry loop Performance ----------- Below results show 3 benchmarks; kernel compilation with 8 jobs, kernel compilation with 80 jobs, and speedometer 2.0 (a javascript benchmark running in Chromium). All cases are running on Ampere Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark is repeated 15 times over 5 reboots and averaged. 'anonfolio-lkml-v1' is the v1 patchset at [1]. 'anonfolio-lkml-v2' is this v2 patchset. 'anonfolio-lkml-v2-simple-order' is anonfolio-lkml-v2 but with the order selection simplification that Yu Zhao suggested - I'm trying to justify here why I did not follow the advice. Kernel compilation with 8 jobs: | kernel | real-time | kern-time | user-time | |:-------------------------------|------------:|------------:|------------:| | baseline-4k | 0.0% | 0.0% | 0.0% | | anonfolio-lkml-v1 | -5.3% | -42.9% | -0.6% | | anonfolio-lkml-v2-simple-order | -4.4% | -36.5% | -0.4% | | anonfolio-lkml-v2 | -4.8% | -38.6% | -0.6% | We can see that the simple-order approach is responsible for a regression of 0.4%. Kernel compilation with 80 jobs: | kernel | real-time | kern-time | user-time | |:-------------------------------|------------:|------------:|------------:| | baseline-4k | 0.0% | 0.0% | 0.0% | | anonfolio-lkml-v1 | -4.6% | -45.7% | 1.4% | | anonfolio-lkml-v2-simple-order | -4.7% | -40.2% | -0.1% | | anonfolio-lkml-v2 | -5.0% | -42.6% | -0.3% | simple-order costs 0.3 % here. v2 is actually performing higher than v1 due to fixing the v1 regression on user-time. Speedometer 2.0: | kernel | runs_per_min | |:-------------------------------|---------------:| | baseline-4k | 0.0% | | anonfolio-lkml-v1 | 0.7% | | anonfolio-lkml-v2-simple-order | -0.9% | | anonfolio-lkml-v2 | 0.5% | simple-order regresses performance by 0.9% vs the baseline, for a total negative swing of 1.6% vs v1. This is fixed by keeping the more complex order selection mechanism from v1. The remaining (kernel time) performance gap between v1 and v2 for the above benchmarks is due to the removal of the "batch zap" patch in v2. Adding that back in gives us the performance back. I intend to submit that as a separate series once this series is accepted. [1] https://lore.kernel.org/linux-mm/20230626171430.3167004-1-ryan.roberts@arm.com/ [2] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/ [3] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anonfolio-lkml_v2 Thanks, Ryan Ryan Roberts (5): mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap() mm: Allow deferred splitting of arbitrary large anon folios mm: Default implementation of arch_wants_pte_order() mm: FLEXIBLE_THP for improved performance arm64: mm: Override arch_wants_pte_order() arch/arm64/Kconfig | 12 +++ arch/arm64/include/asm/pgtable.h | 4 + arch/arm64/mm/mmu.c | 8 ++ include/linux/pgtable.h | 13 +++ mm/Kconfig | 10 ++ mm/memory.c | 168 ++++++++++++++++++++++++++++--- mm/rmap.c | 28 ++++-- 7 files changed, 222 insertions(+), 21 deletions(-) -- 2.25.1 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* [PATCH v2 1/5] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap() 2023-07-03 13:53 ` Ryan Roberts @ 2023-07-03 13:53 ` Ryan Roberts -1 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-03 13:53 UTC (permalink / raw) To: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm In preparation for FLEXIBLE_THP support, improve folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be passed to it. In this case, all contained pages are accounted using the "small" pages scheme. Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> --- mm/rmap.c | 26 +++++++++++++++++++------- 1 file changed, 19 insertions(+), 7 deletions(-) diff --git a/mm/rmap.c b/mm/rmap.c index 1d8369549424..82ef5ba363d1 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1278,31 +1278,43 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, * This means the inc-and-test can be bypassed. * The folio does not have to be locked. * - * If the folio is large, it is accounted as a THP. As the folio + * If the folio is pmd-mappable, it is accounted as a THP. As the folio * is new, it's assumed to be mapped exclusively by a single process. */ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma, unsigned long address) { - int nr; + int nr = folio_nr_pages(folio); + int i; + struct page *page; - VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma); + VM_BUG_ON_VMA(address < vma->vm_start || + address + (nr << PAGE_SHIFT) > vma->vm_end, vma); __folio_set_swapbacked(folio); - if (likely(!folio_test_pmd_mappable(folio))) { + if (!folio_test_large(folio)) { /* increment count (starts at -1) */ atomic_set(&folio->_mapcount, 0); - nr = 1; + __page_set_anon_rmap(folio, &folio->page, vma, address, 1); + } else if (!folio_test_pmd_mappable(folio)) { + /* increment count (starts at 0) */ + atomic_set(&folio->_nr_pages_mapped, nr); + + page = &folio->page; + for (i = 0; i < nr; i++, page++, address += PAGE_SIZE) { + /* increment count (starts at -1) */ + atomic_set(&page->_mapcount, 0); + __page_set_anon_rmap(folio, page, vma, address, 1); + } } else { /* increment count (starts at -1) */ atomic_set(&folio->_entire_mapcount, 0); atomic_set(&folio->_nr_pages_mapped, COMPOUND_MAPPED); - nr = folio_nr_pages(folio); __lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr); + __page_set_anon_rmap(folio, &folio->page, vma, address, 1); } __lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr); - __page_set_anon_rmap(folio, &folio->page, vma, address, 1); } /** -- 2.25.1 ^ permalink raw reply related [flat|nested] 167+ messages in thread
* [PATCH v2 1/5] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap() @ 2023-07-03 13:53 ` Ryan Roberts 0 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-03 13:53 UTC (permalink / raw) To: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm In preparation for FLEXIBLE_THP support, improve folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be passed to it. In this case, all contained pages are accounted using the "small" pages scheme. Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> --- mm/rmap.c | 26 +++++++++++++++++++------- 1 file changed, 19 insertions(+), 7 deletions(-) diff --git a/mm/rmap.c b/mm/rmap.c index 1d8369549424..82ef5ba363d1 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1278,31 +1278,43 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, * This means the inc-and-test can be bypassed. * The folio does not have to be locked. * - * If the folio is large, it is accounted as a THP. As the folio + * If the folio is pmd-mappable, it is accounted as a THP. As the folio * is new, it's assumed to be mapped exclusively by a single process. */ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma, unsigned long address) { - int nr; + int nr = folio_nr_pages(folio); + int i; + struct page *page; - VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma); + VM_BUG_ON_VMA(address < vma->vm_start || + address + (nr << PAGE_SHIFT) > vma->vm_end, vma); __folio_set_swapbacked(folio); - if (likely(!folio_test_pmd_mappable(folio))) { + if (!folio_test_large(folio)) { /* increment count (starts at -1) */ atomic_set(&folio->_mapcount, 0); - nr = 1; + __page_set_anon_rmap(folio, &folio->page, vma, address, 1); + } else if (!folio_test_pmd_mappable(folio)) { + /* increment count (starts at 0) */ + atomic_set(&folio->_nr_pages_mapped, nr); + + page = &folio->page; + for (i = 0; i < nr; i++, page++, address += PAGE_SIZE) { + /* increment count (starts at -1) */ + atomic_set(&page->_mapcount, 0); + __page_set_anon_rmap(folio, page, vma, address, 1); + } } else { /* increment count (starts at -1) */ atomic_set(&folio->_entire_mapcount, 0); atomic_set(&folio->_nr_pages_mapped, COMPOUND_MAPPED); - nr = folio_nr_pages(folio); __lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr); + __page_set_anon_rmap(folio, &folio->page, vma, address, 1); } __lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr); - __page_set_anon_rmap(folio, &folio->page, vma, address, 1); } /** -- 2.25.1 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 167+ messages in thread
* Re: [PATCH v2 1/5] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap() 2023-07-03 13:53 ` Ryan Roberts @ 2023-07-03 19:05 ` Yu Zhao -1 siblings, 0 replies; 167+ messages in thread From: Yu Zhao @ 2023-07-03 19:05 UTC (permalink / raw) To: Ryan Roberts Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > In preparation for FLEXIBLE_THP support, improve > folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be > passed to it. In this case, all contained pages are accounted using the > "small" pages scheme. Nit: In this case, all *subpages* are accounted using the *order-0 folio* (or base page) scheme. > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Yu Zhao <yuzhao@google.com> > mm/rmap.c | 26 +++++++++++++++++++------- > 1 file changed, 19 insertions(+), 7 deletions(-) > > diff --git a/mm/rmap.c b/mm/rmap.c > index 1d8369549424..82ef5ba363d1 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1278,31 +1278,43 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, > * This means the inc-and-test can be bypassed. > * The folio does not have to be locked. > * > - * If the folio is large, it is accounted as a THP. As the folio > + * If the folio is pmd-mappable, it is accounted as a THP. As the folio > * is new, it's assumed to be mapped exclusively by a single process. > */ > void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma, > unsigned long address) > { > - int nr; > + int nr = folio_nr_pages(folio); > + int i; > + struct page *page; > > - VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma); > + VM_BUG_ON_VMA(address < vma->vm_start || > + address + (nr << PAGE_SHIFT) > vma->vm_end, vma); > __folio_set_swapbacked(folio); > > - if (likely(!folio_test_pmd_mappable(folio))) { > + if (!folio_test_large(folio)) { > /* increment count (starts at -1) */ > atomic_set(&folio->_mapcount, 0); > - nr = 1; > + __page_set_anon_rmap(folio, &folio->page, vma, address, 1); > + } else if (!folio_test_pmd_mappable(folio)) { > + /* increment count (starts at 0) */ > + atomic_set(&folio->_nr_pages_mapped, nr); > + > + page = &folio->page; > + for (i = 0; i < nr; i++, page++, address += PAGE_SIZE) { > + /* increment count (starts at -1) */ > + atomic_set(&page->_mapcount, 0); > + __page_set_anon_rmap(folio, page, vma, address, 1); > + } Nit: use folio_page(), e.g., } else if (!folio_test_pmd_mappable(folio)) { int i; for (i = 0; i < nr; i++) { struct page *page = folio_page(folio, i); /* increment count (starts at -1) */ atomic_set(&page->_mapcount, 0); __page_set_anon_rmap(folio, page, vma, address + PAGE_SIZE * i, 1); } /* increment count (starts at 0) */ atomic_set(&folio->_nr_pages_mapped, nr); } else { > } else { > /* increment count (starts at -1) */ > atomic_set(&folio->_entire_mapcount, 0); > atomic_set(&folio->_nr_pages_mapped, COMPOUND_MAPPED); > - nr = folio_nr_pages(folio); > __lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr); > + __page_set_anon_rmap(folio, &folio->page, vma, address, 1); > } > > __lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr); > - __page_set_anon_rmap(folio, &folio->page, vma, address, 1); > } ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 1/5] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap() @ 2023-07-03 19:05 ` Yu Zhao 0 siblings, 0 replies; 167+ messages in thread From: Yu Zhao @ 2023-07-03 19:05 UTC (permalink / raw) To: Ryan Roberts Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > In preparation for FLEXIBLE_THP support, improve > folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be > passed to it. In this case, all contained pages are accounted using the > "small" pages scheme. Nit: In this case, all *subpages* are accounted using the *order-0 folio* (or base page) scheme. > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Yu Zhao <yuzhao@google.com> > mm/rmap.c | 26 +++++++++++++++++++------- > 1 file changed, 19 insertions(+), 7 deletions(-) > > diff --git a/mm/rmap.c b/mm/rmap.c > index 1d8369549424..82ef5ba363d1 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1278,31 +1278,43 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, > * This means the inc-and-test can be bypassed. > * The folio does not have to be locked. > * > - * If the folio is large, it is accounted as a THP. As the folio > + * If the folio is pmd-mappable, it is accounted as a THP. As the folio > * is new, it's assumed to be mapped exclusively by a single process. > */ > void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma, > unsigned long address) > { > - int nr; > + int nr = folio_nr_pages(folio); > + int i; > + struct page *page; > > - VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma); > + VM_BUG_ON_VMA(address < vma->vm_start || > + address + (nr << PAGE_SHIFT) > vma->vm_end, vma); > __folio_set_swapbacked(folio); > > - if (likely(!folio_test_pmd_mappable(folio))) { > + if (!folio_test_large(folio)) { > /* increment count (starts at -1) */ > atomic_set(&folio->_mapcount, 0); > - nr = 1; > + __page_set_anon_rmap(folio, &folio->page, vma, address, 1); > + } else if (!folio_test_pmd_mappable(folio)) { > + /* increment count (starts at 0) */ > + atomic_set(&folio->_nr_pages_mapped, nr); > + > + page = &folio->page; > + for (i = 0; i < nr; i++, page++, address += PAGE_SIZE) { > + /* increment count (starts at -1) */ > + atomic_set(&page->_mapcount, 0); > + __page_set_anon_rmap(folio, page, vma, address, 1); > + } Nit: use folio_page(), e.g., } else if (!folio_test_pmd_mappable(folio)) { int i; for (i = 0; i < nr; i++) { struct page *page = folio_page(folio, i); /* increment count (starts at -1) */ atomic_set(&page->_mapcount, 0); __page_set_anon_rmap(folio, page, vma, address + PAGE_SIZE * i, 1); } /* increment count (starts at 0) */ atomic_set(&folio->_nr_pages_mapped, nr); } else { > } else { > /* increment count (starts at -1) */ > atomic_set(&folio->_entire_mapcount, 0); > atomic_set(&folio->_nr_pages_mapped, COMPOUND_MAPPED); > - nr = folio_nr_pages(folio); > __lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr); > + __page_set_anon_rmap(folio, &folio->page, vma, address, 1); > } > > __lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr); > - __page_set_anon_rmap(folio, &folio->page, vma, address, 1); > } _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 1/5] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap() 2023-07-03 19:05 ` Yu Zhao @ 2023-07-04 2:13 ` Yin, Fengwei -1 siblings, 0 replies; 167+ messages in thread From: Yin, Fengwei @ 2023-07-04 2:13 UTC (permalink / raw) To: Yu Zhao, Ryan Roberts Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 7/4/2023 3:05 AM, Yu Zhao wrote: > On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> In preparation for FLEXIBLE_THP support, improve >> folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be >> passed to it. In this case, all contained pages are accounted using the >> "small" pages scheme. > > Nit: In this case, all *subpages* are accounted using the *order-0 > folio* (or base page) scheme. Matthew suggested not to use subpage with folio. Using page with folio: https://lore.kernel.org/linux-mm/Y9qiS%2FIxZOMx62t6@casper.infradead.org/ > >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > > Reviewed-by: Yu Zhao <yuzhao@google.com> > >> mm/rmap.c | 26 +++++++++++++++++++------- >> 1 file changed, 19 insertions(+), 7 deletions(-) >> >> diff --git a/mm/rmap.c b/mm/rmap.c >> index 1d8369549424..82ef5ba363d1 100644 >> --- a/mm/rmap.c >> +++ b/mm/rmap.c >> @@ -1278,31 +1278,43 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, >> * This means the inc-and-test can be bypassed. >> * The folio does not have to be locked. >> * >> - * If the folio is large, it is accounted as a THP. As the folio >> + * If the folio is pmd-mappable, it is accounted as a THP. As the folio >> * is new, it's assumed to be mapped exclusively by a single process. >> */ >> void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma, >> unsigned long address) >> { >> - int nr; >> + int nr = folio_nr_pages(folio); >> + int i; >> + struct page *page; >> >> - VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma); >> + VM_BUG_ON_VMA(address < vma->vm_start || >> + address + (nr << PAGE_SHIFT) > vma->vm_end, vma); >> __folio_set_swapbacked(folio); >> >> - if (likely(!folio_test_pmd_mappable(folio))) { >> + if (!folio_test_large(folio)) { >> /* increment count (starts at -1) */ >> atomic_set(&folio->_mapcount, 0); >> - nr = 1; >> + __page_set_anon_rmap(folio, &folio->page, vma, address, 1); >> + } else if (!folio_test_pmd_mappable(folio)) { >> + /* increment count (starts at 0) */ >> + atomic_set(&folio->_nr_pages_mapped, nr); >> + >> + page = &folio->page; >> + for (i = 0; i < nr; i++, page++, address += PAGE_SIZE) { >> + /* increment count (starts at -1) */ >> + atomic_set(&page->_mapcount, 0); >> + __page_set_anon_rmap(folio, page, vma, address, 1); >> + } > > Nit: use folio_page(), e.g., > > } else if (!folio_test_pmd_mappable(folio)) { > int i; > > for (i = 0; i < nr; i++) { > struct page *page = folio_page(folio, i); > > /* increment count (starts at -1) */ > atomic_set(&page->_mapcount, 0); > __page_set_anon_rmap(folio, page, vma, address + PAGE_SIZE * i, 1); > } > /* increment count (starts at 0) */ > atomic_set(&folio->_nr_pages_mapped, nr); > } else { > >> } else { >> /* increment count (starts at -1) */ >> atomic_set(&folio->_entire_mapcount, 0); >> atomic_set(&folio->_nr_pages_mapped, COMPOUND_MAPPED); >> - nr = folio_nr_pages(folio); >> __lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr); >> + __page_set_anon_rmap(folio, &folio->page, vma, address, 1); >> } >> >> __lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr); >> - __page_set_anon_rmap(folio, &folio->page, vma, address, 1); >> } ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 1/5] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap() @ 2023-07-04 2:13 ` Yin, Fengwei 0 siblings, 0 replies; 167+ messages in thread From: Yin, Fengwei @ 2023-07-04 2:13 UTC (permalink / raw) To: Yu Zhao, Ryan Roberts Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 7/4/2023 3:05 AM, Yu Zhao wrote: > On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> In preparation for FLEXIBLE_THP support, improve >> folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be >> passed to it. In this case, all contained pages are accounted using the >> "small" pages scheme. > > Nit: In this case, all *subpages* are accounted using the *order-0 > folio* (or base page) scheme. Matthew suggested not to use subpage with folio. Using page with folio: https://lore.kernel.org/linux-mm/Y9qiS%2FIxZOMx62t6@casper.infradead.org/ > >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > > Reviewed-by: Yu Zhao <yuzhao@google.com> > >> mm/rmap.c | 26 +++++++++++++++++++------- >> 1 file changed, 19 insertions(+), 7 deletions(-) >> >> diff --git a/mm/rmap.c b/mm/rmap.c >> index 1d8369549424..82ef5ba363d1 100644 >> --- a/mm/rmap.c >> +++ b/mm/rmap.c >> @@ -1278,31 +1278,43 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, >> * This means the inc-and-test can be bypassed. >> * The folio does not have to be locked. >> * >> - * If the folio is large, it is accounted as a THP. As the folio >> + * If the folio is pmd-mappable, it is accounted as a THP. As the folio >> * is new, it's assumed to be mapped exclusively by a single process. >> */ >> void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma, >> unsigned long address) >> { >> - int nr; >> + int nr = folio_nr_pages(folio); >> + int i; >> + struct page *page; >> >> - VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma); >> + VM_BUG_ON_VMA(address < vma->vm_start || >> + address + (nr << PAGE_SHIFT) > vma->vm_end, vma); >> __folio_set_swapbacked(folio); >> >> - if (likely(!folio_test_pmd_mappable(folio))) { >> + if (!folio_test_large(folio)) { >> /* increment count (starts at -1) */ >> atomic_set(&folio->_mapcount, 0); >> - nr = 1; >> + __page_set_anon_rmap(folio, &folio->page, vma, address, 1); >> + } else if (!folio_test_pmd_mappable(folio)) { >> + /* increment count (starts at 0) */ >> + atomic_set(&folio->_nr_pages_mapped, nr); >> + >> + page = &folio->page; >> + for (i = 0; i < nr; i++, page++, address += PAGE_SIZE) { >> + /* increment count (starts at -1) */ >> + atomic_set(&page->_mapcount, 0); >> + __page_set_anon_rmap(folio, page, vma, address, 1); >> + } > > Nit: use folio_page(), e.g., > > } else if (!folio_test_pmd_mappable(folio)) { > int i; > > for (i = 0; i < nr; i++) { > struct page *page = folio_page(folio, i); > > /* increment count (starts at -1) */ > atomic_set(&page->_mapcount, 0); > __page_set_anon_rmap(folio, page, vma, address + PAGE_SIZE * i, 1); > } > /* increment count (starts at 0) */ > atomic_set(&folio->_nr_pages_mapped, nr); > } else { > >> } else { >> /* increment count (starts at -1) */ >> atomic_set(&folio->_entire_mapcount, 0); >> atomic_set(&folio->_nr_pages_mapped, COMPOUND_MAPPED); >> - nr = folio_nr_pages(folio); >> __lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr); >> + __page_set_anon_rmap(folio, &folio->page, vma, address, 1); >> } >> >> __lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr); >> - __page_set_anon_rmap(folio, &folio->page, vma, address, 1); >> } _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 1/5] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap() 2023-07-04 2:13 ` Yin, Fengwei @ 2023-07-04 11:19 ` Ryan Roberts -1 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-04 11:19 UTC (permalink / raw) To: Yin, Fengwei, Yu Zhao Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 04/07/2023 03:13, Yin, Fengwei wrote: > > > On 7/4/2023 3:05 AM, Yu Zhao wrote: >> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >>> >>> In preparation for FLEXIBLE_THP support, improve >>> folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be >>> passed to it. In this case, all contained pages are accounted using the >>> "small" pages scheme. >> >> Nit: In this case, all *subpages* are accounted using the *order-0 >> folio* (or base page) scheme. > Matthew suggested not to use subpage with folio. Using page with folio: > https://lore.kernel.org/linux-mm/Y9qiS%2FIxZOMx62t6@casper.infradead.org/ OK, I'll change this to "In this case, all contained pages are accounted using the *order-0 folio* (or base page) scheme." > >> >>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >> >> Reviewed-by: Yu Zhao <yuzhao@google.com> Thanks! >> >>> mm/rmap.c | 26 +++++++++++++++++++------- >>> 1 file changed, 19 insertions(+), 7 deletions(-) >>> >>> diff --git a/mm/rmap.c b/mm/rmap.c >>> index 1d8369549424..82ef5ba363d1 100644 >>> --- a/mm/rmap.c >>> +++ b/mm/rmap.c >>> @@ -1278,31 +1278,43 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, >>> * This means the inc-and-test can be bypassed. >>> * The folio does not have to be locked. >>> * >>> - * If the folio is large, it is accounted as a THP. As the folio >>> + * If the folio is pmd-mappable, it is accounted as a THP. As the folio >>> * is new, it's assumed to be mapped exclusively by a single process. >>> */ >>> void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma, >>> unsigned long address) >>> { >>> - int nr; >>> + int nr = folio_nr_pages(folio); >>> + int i; >>> + struct page *page; >>> >>> - VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma); >>> + VM_BUG_ON_VMA(address < vma->vm_start || >>> + address + (nr << PAGE_SHIFT) > vma->vm_end, vma); >>> __folio_set_swapbacked(folio); >>> >>> - if (likely(!folio_test_pmd_mappable(folio))) { >>> + if (!folio_test_large(folio)) { >>> /* increment count (starts at -1) */ >>> atomic_set(&folio->_mapcount, 0); >>> - nr = 1; >>> + __page_set_anon_rmap(folio, &folio->page, vma, address, 1); >>> + } else if (!folio_test_pmd_mappable(folio)) { >>> + /* increment count (starts at 0) */ >>> + atomic_set(&folio->_nr_pages_mapped, nr); >>> + >>> + page = &folio->page; >>> + for (i = 0; i < nr; i++, page++, address += PAGE_SIZE) { >>> + /* increment count (starts at -1) */ >>> + atomic_set(&page->_mapcount, 0); >>> + __page_set_anon_rmap(folio, page, vma, address, 1); >>> + } >> >> Nit: use folio_page(), e.g., Yep, will change for v3. >> >> } else if (!folio_test_pmd_mappable(folio)) { >> int i; >> >> for (i = 0; i < nr; i++) { >> struct page *page = folio_page(folio, i); >> >> /* increment count (starts at -1) */ >> atomic_set(&page->_mapcount, 0); >> __page_set_anon_rmap(folio, page, vma, address + PAGE_SIZE * i, 1); >> } >> /* increment count (starts at 0) */ >> atomic_set(&folio->_nr_pages_mapped, nr); >> } else { >> >>> } else { >>> /* increment count (starts at -1) */ >>> atomic_set(&folio->_entire_mapcount, 0); >>> atomic_set(&folio->_nr_pages_mapped, COMPOUND_MAPPED); >>> - nr = folio_nr_pages(folio); >>> __lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr); >>> + __page_set_anon_rmap(folio, &folio->page, vma, address, 1); >>> } >>> >>> __lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr); >>> - __page_set_anon_rmap(folio, &folio->page, vma, address, 1); >>> } ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 1/5] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap() @ 2023-07-04 11:19 ` Ryan Roberts 0 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-04 11:19 UTC (permalink / raw) To: Yin, Fengwei, Yu Zhao Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 04/07/2023 03:13, Yin, Fengwei wrote: > > > On 7/4/2023 3:05 AM, Yu Zhao wrote: >> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >>> >>> In preparation for FLEXIBLE_THP support, improve >>> folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be >>> passed to it. In this case, all contained pages are accounted using the >>> "small" pages scheme. >> >> Nit: In this case, all *subpages* are accounted using the *order-0 >> folio* (or base page) scheme. > Matthew suggested not to use subpage with folio. Using page with folio: > https://lore.kernel.org/linux-mm/Y9qiS%2FIxZOMx62t6@casper.infradead.org/ OK, I'll change this to "In this case, all contained pages are accounted using the *order-0 folio* (or base page) scheme." > >> >>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >> >> Reviewed-by: Yu Zhao <yuzhao@google.com> Thanks! >> >>> mm/rmap.c | 26 +++++++++++++++++++------- >>> 1 file changed, 19 insertions(+), 7 deletions(-) >>> >>> diff --git a/mm/rmap.c b/mm/rmap.c >>> index 1d8369549424..82ef5ba363d1 100644 >>> --- a/mm/rmap.c >>> +++ b/mm/rmap.c >>> @@ -1278,31 +1278,43 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, >>> * This means the inc-and-test can be bypassed. >>> * The folio does not have to be locked. >>> * >>> - * If the folio is large, it is accounted as a THP. As the folio >>> + * If the folio is pmd-mappable, it is accounted as a THP. As the folio >>> * is new, it's assumed to be mapped exclusively by a single process. >>> */ >>> void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma, >>> unsigned long address) >>> { >>> - int nr; >>> + int nr = folio_nr_pages(folio); >>> + int i; >>> + struct page *page; >>> >>> - VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma); >>> + VM_BUG_ON_VMA(address < vma->vm_start || >>> + address + (nr << PAGE_SHIFT) > vma->vm_end, vma); >>> __folio_set_swapbacked(folio); >>> >>> - if (likely(!folio_test_pmd_mappable(folio))) { >>> + if (!folio_test_large(folio)) { >>> /* increment count (starts at -1) */ >>> atomic_set(&folio->_mapcount, 0); >>> - nr = 1; >>> + __page_set_anon_rmap(folio, &folio->page, vma, address, 1); >>> + } else if (!folio_test_pmd_mappable(folio)) { >>> + /* increment count (starts at 0) */ >>> + atomic_set(&folio->_nr_pages_mapped, nr); >>> + >>> + page = &folio->page; >>> + for (i = 0; i < nr; i++, page++, address += PAGE_SIZE) { >>> + /* increment count (starts at -1) */ >>> + atomic_set(&page->_mapcount, 0); >>> + __page_set_anon_rmap(folio, page, vma, address, 1); >>> + } >> >> Nit: use folio_page(), e.g., Yep, will change for v3. >> >> } else if (!folio_test_pmd_mappable(folio)) { >> int i; >> >> for (i = 0; i < nr; i++) { >> struct page *page = folio_page(folio, i); >> >> /* increment count (starts at -1) */ >> atomic_set(&page->_mapcount, 0); >> __page_set_anon_rmap(folio, page, vma, address + PAGE_SIZE * i, 1); >> } >> /* increment count (starts at 0) */ >> atomic_set(&folio->_nr_pages_mapped, nr); >> } else { >> >>> } else { >>> /* increment count (starts at -1) */ >>> atomic_set(&folio->_entire_mapcount, 0); >>> atomic_set(&folio->_nr_pages_mapped, COMPOUND_MAPPED); >>> - nr = folio_nr_pages(folio); >>> __lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr); >>> + __page_set_anon_rmap(folio, &folio->page, vma, address, 1); >>> } >>> >>> __lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr); >>> - __page_set_anon_rmap(folio, &folio->page, vma, address, 1); >>> } _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 1/5] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap() 2023-07-03 13:53 ` Ryan Roberts @ 2023-07-04 2:14 ` Yin, Fengwei -1 siblings, 0 replies; 167+ messages in thread From: Yin, Fengwei @ 2023-07-04 2:14 UTC (permalink / raw) To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi Cc: linux-arm-kernel, linux-kernel, linux-mm On 7/3/2023 9:53 PM, Ryan Roberts wrote: > In preparation for FLEXIBLE_THP support, improve > folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be > passed to it. In this case, all contained pages are accounted using the > "small" pages scheme. > > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Yin, Fengwei <fengwei.yin@intel.com> > --- > mm/rmap.c | 26 +++++++++++++++++++------- > 1 file changed, 19 insertions(+), 7 deletions(-) > > diff --git a/mm/rmap.c b/mm/rmap.c > index 1d8369549424..82ef5ba363d1 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1278,31 +1278,43 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, > * This means the inc-and-test can be bypassed. > * The folio does not have to be locked. > * > - * If the folio is large, it is accounted as a THP. As the folio > + * If the folio is pmd-mappable, it is accounted as a THP. As the folio > * is new, it's assumed to be mapped exclusively by a single process. > */ > void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma, > unsigned long address) > { > - int nr; > + int nr = folio_nr_pages(folio); > + int i; > + struct page *page; > > - VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma); > + VM_BUG_ON_VMA(address < vma->vm_start || > + address + (nr << PAGE_SHIFT) > vma->vm_end, vma); > __folio_set_swapbacked(folio); > > - if (likely(!folio_test_pmd_mappable(folio))) { > + if (!folio_test_large(folio)) { > /* increment count (starts at -1) */ > atomic_set(&folio->_mapcount, 0); > - nr = 1; > + __page_set_anon_rmap(folio, &folio->page, vma, address, 1); > + } else if (!folio_test_pmd_mappable(folio)) { > + /* increment count (starts at 0) */ > + atomic_set(&folio->_nr_pages_mapped, nr); > + > + page = &folio->page; > + for (i = 0; i < nr; i++, page++, address += PAGE_SIZE) { > + /* increment count (starts at -1) */ > + atomic_set(&page->_mapcount, 0); > + __page_set_anon_rmap(folio, page, vma, address, 1); > + } > } else { > /* increment count (starts at -1) */ > atomic_set(&folio->_entire_mapcount, 0); > atomic_set(&folio->_nr_pages_mapped, COMPOUND_MAPPED); > - nr = folio_nr_pages(folio); > __lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr); > + __page_set_anon_rmap(folio, &folio->page, vma, address, 1); > } > > __lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr); > - __page_set_anon_rmap(folio, &folio->page, vma, address, 1); > } > > /** ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 1/5] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap() @ 2023-07-04 2:14 ` Yin, Fengwei 0 siblings, 0 replies; 167+ messages in thread From: Yin, Fengwei @ 2023-07-04 2:14 UTC (permalink / raw) To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi Cc: linux-arm-kernel, linux-kernel, linux-mm On 7/3/2023 9:53 PM, Ryan Roberts wrote: > In preparation for FLEXIBLE_THP support, improve > folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be > passed to it. In this case, all contained pages are accounted using the > "small" pages scheme. > > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Yin, Fengwei <fengwei.yin@intel.com> > --- > mm/rmap.c | 26 +++++++++++++++++++------- > 1 file changed, 19 insertions(+), 7 deletions(-) > > diff --git a/mm/rmap.c b/mm/rmap.c > index 1d8369549424..82ef5ba363d1 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1278,31 +1278,43 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, > * This means the inc-and-test can be bypassed. > * The folio does not have to be locked. > * > - * If the folio is large, it is accounted as a THP. As the folio > + * If the folio is pmd-mappable, it is accounted as a THP. As the folio > * is new, it's assumed to be mapped exclusively by a single process. > */ > void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma, > unsigned long address) > { > - int nr; > + int nr = folio_nr_pages(folio); > + int i; > + struct page *page; > > - VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma); > + VM_BUG_ON_VMA(address < vma->vm_start || > + address + (nr << PAGE_SHIFT) > vma->vm_end, vma); > __folio_set_swapbacked(folio); > > - if (likely(!folio_test_pmd_mappable(folio))) { > + if (!folio_test_large(folio)) { > /* increment count (starts at -1) */ > atomic_set(&folio->_mapcount, 0); > - nr = 1; > + __page_set_anon_rmap(folio, &folio->page, vma, address, 1); > + } else if (!folio_test_pmd_mappable(folio)) { > + /* increment count (starts at 0) */ > + atomic_set(&folio->_nr_pages_mapped, nr); > + > + page = &folio->page; > + for (i = 0; i < nr; i++, page++, address += PAGE_SIZE) { > + /* increment count (starts at -1) */ > + atomic_set(&page->_mapcount, 0); > + __page_set_anon_rmap(folio, page, vma, address, 1); > + } > } else { > /* increment count (starts at -1) */ > atomic_set(&folio->_entire_mapcount, 0); > atomic_set(&folio->_nr_pages_mapped, COMPOUND_MAPPED); > - nr = folio_nr_pages(folio); > __lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr); > + __page_set_anon_rmap(folio, &folio->page, vma, address, 1); > } > > __lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr); > - __page_set_anon_rmap(folio, &folio->page, vma, address, 1); > } > > /** _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* [PATCH v2 2/5] mm: Allow deferred splitting of arbitrary large anon folios 2023-07-03 13:53 ` Ryan Roberts @ 2023-07-03 13:53 ` Ryan Roberts -1 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-03 13:53 UTC (permalink / raw) To: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm With the introduction of large folios for anonymous memory, we would like to be able to split them when they have unmapped subpages, in order to free those unused pages under memory pressure. So remove the artificial requirement that the large folio needed to be at least PMD-sized. Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> --- mm/rmap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/rmap.c b/mm/rmap.c index 82ef5ba363d1..bbcb2308a1c5 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1474,7 +1474,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma, * page of the folio is unmapped and at least one page * is still mapped. */ - if (folio_test_pmd_mappable(folio) && folio_test_anon(folio)) + if (folio_test_large(folio) && folio_test_anon(folio)) if (!compound || nr < nr_pmdmapped) deferred_split_folio(folio); } -- 2.25.1 ^ permalink raw reply related [flat|nested] 167+ messages in thread
* [PATCH v2 2/5] mm: Allow deferred splitting of arbitrary large anon folios @ 2023-07-03 13:53 ` Ryan Roberts 0 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-03 13:53 UTC (permalink / raw) To: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm With the introduction of large folios for anonymous memory, we would like to be able to split them when they have unmapped subpages, in order to free those unused pages under memory pressure. So remove the artificial requirement that the large folio needed to be at least PMD-sized. Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> --- mm/rmap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/rmap.c b/mm/rmap.c index 82ef5ba363d1..bbcb2308a1c5 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1474,7 +1474,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma, * page of the folio is unmapped and at least one page * is still mapped. */ - if (folio_test_pmd_mappable(folio) && folio_test_anon(folio)) + if (folio_test_large(folio) && folio_test_anon(folio)) if (!compound || nr < nr_pmdmapped) deferred_split_folio(folio); } -- 2.25.1 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 167+ messages in thread
* Re: [PATCH v2 2/5] mm: Allow deferred splitting of arbitrary large anon folios 2023-07-03 13:53 ` Ryan Roberts @ 2023-07-07 8:21 ` Huang, Ying -1 siblings, 0 replies; 167+ messages in thread From: Huang, Ying @ 2023-07-07 8:21 UTC (permalink / raw) To: Ryan Roberts Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm Ryan Roberts <ryan.roberts@arm.com> writes: > With the introduction of large folios for anonymous memory, we would > like to be able to split them when they have unmapped subpages, in order > to free those unused pages under memory pressure. So remove the > artificial requirement that the large folio needed to be at least > PMD-sized. > > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > Reviewed-by: Yu Zhao <yuzhao@google.com> > Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> > --- > mm/rmap.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/rmap.c b/mm/rmap.c > index 82ef5ba363d1..bbcb2308a1c5 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1474,7 +1474,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma, > * page of the folio is unmapped and at least one page > * is still mapped. > */ > - if (folio_test_pmd_mappable(folio) && folio_test_anon(folio)) > + if (folio_test_large(folio) && folio_test_anon(folio)) > if (!compound || nr < nr_pmdmapped) > deferred_split_folio(folio); > } One possible issue is that even for large folios mapped only in one process, in zap_pte_range(), we will always call deferred_split_folio() unnecessarily before freeing a large folio. Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 2/5] mm: Allow deferred splitting of arbitrary large anon folios @ 2023-07-07 8:21 ` Huang, Ying 0 siblings, 0 replies; 167+ messages in thread From: Huang, Ying @ 2023-07-07 8:21 UTC (permalink / raw) To: Ryan Roberts Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm Ryan Roberts <ryan.roberts@arm.com> writes: > With the introduction of large folios for anonymous memory, we would > like to be able to split them when they have unmapped subpages, in order > to free those unused pages under memory pressure. So remove the > artificial requirement that the large folio needed to be at least > PMD-sized. > > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > Reviewed-by: Yu Zhao <yuzhao@google.com> > Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> > --- > mm/rmap.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/rmap.c b/mm/rmap.c > index 82ef5ba363d1..bbcb2308a1c5 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1474,7 +1474,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma, > * page of the folio is unmapped and at least one page > * is still mapped. > */ > - if (folio_test_pmd_mappable(folio) && folio_test_anon(folio)) > + if (folio_test_large(folio) && folio_test_anon(folio)) > if (!compound || nr < nr_pmdmapped) > deferred_split_folio(folio); > } One possible issue is that even for large folios mapped only in one process, in zap_pte_range(), we will always call deferred_split_folio() unnecessarily before freeing a large folio. Best Regards, Huang, Ying _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 2/5] mm: Allow deferred splitting of arbitrary large anon folios 2023-07-07 8:21 ` Huang, Ying (?) @ 2023-07-07 9:39 ` Ryan Roberts -1 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-07 9:39 UTC (permalink / raw) To: linux-arm-kernel On 07/07/2023 09:21, Huang, Ying wrote: > Ryan Roberts <ryan.roberts@arm.com> writes: > >> With the introduction of large folios for anonymous memory, we would >> like to be able to split them when they have unmapped subpages, in order >> to free those unused pages under memory pressure. So remove the >> artificial requirement that the large folio needed to be at least >> PMD-sized. >> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >> Reviewed-by: Yu Zhao <yuzhao@google.com> >> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> >> --- >> mm/rmap.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/mm/rmap.c b/mm/rmap.c >> index 82ef5ba363d1..bbcb2308a1c5 100644 >> --- a/mm/rmap.c >> +++ b/mm/rmap.c >> @@ -1474,7 +1474,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma, >> * page of the folio is unmapped and at least one page >> * is still mapped. >> */ >> - if (folio_test_pmd_mappable(folio) && folio_test_anon(folio)) >> + if (folio_test_large(folio) && folio_test_anon(folio)) >> if (!compound || nr < nr_pmdmapped) >> deferred_split_folio(folio); >> } > > One possible issue is that even for large folios mapped only in one > process, in zap_pte_range(), we will always call deferred_split_folio() > unnecessarily before freeing a large folio. Hi Huang, thanks for reviewing! I have a patch that solves this problem by determining a range of ptes covered by a single folio and doing a "batch zap". This prevents the need to add the folio to the deferred split queue, only to remove it again shortly afterwards. This reduces lock contention and I can measure a performance improvement for the kernel compilation benchmark. See [1]. However, I decided to remove it from this patch set on Yu Zhao's advice. We are aiming for the minimal patch set to start with and wanted to focus people on that. I intend to submit it separately later on. [1] https://lore.kernel.org/linux-mm/20230626171430.3167004-8-ryan.roberts@arm.com/ Thanks, Ryan > > Best Regards, > Huang, Ying > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 2/5] mm: Allow deferred splitting of arbitrary large anon folios 2023-07-07 8:21 ` Huang, Ying @ 2023-07-07 9:42 ` Ryan Roberts -1 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-07 9:42 UTC (permalink / raw) To: Huang, Ying Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm Somehow I managed to reply only to the linux-arm-kernel list on first attempt so resending: On 07/07/2023 09:21, Huang, Ying wrote: > Ryan Roberts <ryan.roberts@arm.com> writes: > >> With the introduction of large folios for anonymous memory, we would >> like to be able to split them when they have unmapped subpages, in order >> to free those unused pages under memory pressure. So remove the >> artificial requirement that the large folio needed to be at least >> PMD-sized. >> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >> Reviewed-by: Yu Zhao <yuzhao@google.com> >> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> >> --- >> mm/rmap.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/mm/rmap.c b/mm/rmap.c >> index 82ef5ba363d1..bbcb2308a1c5 100644 >> --- a/mm/rmap.c >> +++ b/mm/rmap.c >> @@ -1474,7 +1474,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma, >> * page of the folio is unmapped and at least one page >> * is still mapped. >> */ >> - if (folio_test_pmd_mappable(folio) && folio_test_anon(folio)) >> + if (folio_test_large(folio) && folio_test_anon(folio)) >> if (!compound || nr < nr_pmdmapped) >> deferred_split_folio(folio); >> } > > One possible issue is that even for large folios mapped only in one > process, in zap_pte_range(), we will always call deferred_split_folio() > unnecessarily before freeing a large folio. Hi Huang, thanks for reviewing! I have a patch that solves this problem by determining a range of ptes covered by a single folio and doing a "batch zap". This prevents the need to add the folio to the deferred split queue, only to remove it again shortly afterwards. This reduces lock contention and I can measure a performance improvement for the kernel compilation benchmark. See [1]. However, I decided to remove it from this patch set on Yu Zhao's advice. We are aiming for the minimal patch set to start with and wanted to focus people on that. I intend to submit it separately later on. [1] https://lore.kernel.org/linux-mm/20230626171430.3167004-8-ryan.roberts@arm.com/ Thanks, Ryan > > Best Regards, > Huang, Ying > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 2/5] mm: Allow deferred splitting of arbitrary large anon folios @ 2023-07-07 9:42 ` Ryan Roberts 0 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-07 9:42 UTC (permalink / raw) To: Huang, Ying Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm Somehow I managed to reply only to the linux-arm-kernel list on first attempt so resending: On 07/07/2023 09:21, Huang, Ying wrote: > Ryan Roberts <ryan.roberts@arm.com> writes: > >> With the introduction of large folios for anonymous memory, we would >> like to be able to split them when they have unmapped subpages, in order >> to free those unused pages under memory pressure. So remove the >> artificial requirement that the large folio needed to be at least >> PMD-sized. >> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >> Reviewed-by: Yu Zhao <yuzhao@google.com> >> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> >> --- >> mm/rmap.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/mm/rmap.c b/mm/rmap.c >> index 82ef5ba363d1..bbcb2308a1c5 100644 >> --- a/mm/rmap.c >> +++ b/mm/rmap.c >> @@ -1474,7 +1474,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma, >> * page of the folio is unmapped and at least one page >> * is still mapped. >> */ >> - if (folio_test_pmd_mappable(folio) && folio_test_anon(folio)) >> + if (folio_test_large(folio) && folio_test_anon(folio)) >> if (!compound || nr < nr_pmdmapped) >> deferred_split_folio(folio); >> } > > One possible issue is that even for large folios mapped only in one > process, in zap_pte_range(), we will always call deferred_split_folio() > unnecessarily before freeing a large folio. Hi Huang, thanks for reviewing! I have a patch that solves this problem by determining a range of ptes covered by a single folio and doing a "batch zap". This prevents the need to add the folio to the deferred split queue, only to remove it again shortly afterwards. This reduces lock contention and I can measure a performance improvement for the kernel compilation benchmark. See [1]. However, I decided to remove it from this patch set on Yu Zhao's advice. We are aiming for the minimal patch set to start with and wanted to focus people on that. I intend to submit it separately later on. [1] https://lore.kernel.org/linux-mm/20230626171430.3167004-8-ryan.roberts@arm.com/ Thanks, Ryan > > Best Regards, > Huang, Ying > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 2/5] mm: Allow deferred splitting of arbitrary large anon folios 2023-07-07 9:42 ` Ryan Roberts @ 2023-07-10 5:37 ` Huang, Ying -1 siblings, 0 replies; 167+ messages in thread From: Huang, Ying @ 2023-07-10 5:37 UTC (permalink / raw) To: Ryan Roberts Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm Ryan Roberts <ryan.roberts@arm.com> writes: > Somehow I managed to reply only to the linux-arm-kernel list on first attempt so > resending: > > On 07/07/2023 09:21, Huang, Ying wrote: >> Ryan Roberts <ryan.roberts@arm.com> writes: >> >>> With the introduction of large folios for anonymous memory, we would >>> like to be able to split them when they have unmapped subpages, in order >>> to free those unused pages under memory pressure. So remove the >>> artificial requirement that the large folio needed to be at least >>> PMD-sized. >>> >>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >>> Reviewed-by: Yu Zhao <yuzhao@google.com> >>> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> >>> --- >>> mm/rmap.c | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >>> >>> diff --git a/mm/rmap.c b/mm/rmap.c >>> index 82ef5ba363d1..bbcb2308a1c5 100644 >>> --- a/mm/rmap.c >>> +++ b/mm/rmap.c >>> @@ -1474,7 +1474,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma, >>> * page of the folio is unmapped and at least one page >>> * is still mapped. >>> */ >>> - if (folio_test_pmd_mappable(folio) && folio_test_anon(folio)) >>> + if (folio_test_large(folio) && folio_test_anon(folio)) >>> if (!compound || nr < nr_pmdmapped) >>> deferred_split_folio(folio); >>> } >> >> One possible issue is that even for large folios mapped only in one >> process, in zap_pte_range(), we will always call deferred_split_folio() >> unnecessarily before freeing a large folio. > > Hi Huang, thanks for reviewing! > > I have a patch that solves this problem by determining a range of ptes covered > by a single folio and doing a "batch zap". This prevents the need to add the > folio to the deferred split queue, only to remove it again shortly afterwards. > This reduces lock contention and I can measure a performance improvement for the > kernel compilation benchmark. See [1]. > > However, I decided to remove it from this patch set on Yu Zhao's advice. We are > aiming for the minimal patch set to start with and wanted to focus people on > that. I intend to submit it separately later on. > > [1] https://lore.kernel.org/linux-mm/20230626171430.3167004-8-ryan.roberts@arm.com/ Thanks for your information! "batch zap" can solve the problem. And, I agree with Matthew's comments to fix the large folios interaction issues before merging the patches to allocate large folios as in the following email. https://lore.kernel.org/linux-mm/ZKVdUDuwNWDUCWc5@casper.infradead.org/ If so, we don't need to introduce the above problem or a large patchset. Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 2/5] mm: Allow deferred splitting of arbitrary large anon folios @ 2023-07-10 5:37 ` Huang, Ying 0 siblings, 0 replies; 167+ messages in thread From: Huang, Ying @ 2023-07-10 5:37 UTC (permalink / raw) To: Ryan Roberts Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm Ryan Roberts <ryan.roberts@arm.com> writes: > Somehow I managed to reply only to the linux-arm-kernel list on first attempt so > resending: > > On 07/07/2023 09:21, Huang, Ying wrote: >> Ryan Roberts <ryan.roberts@arm.com> writes: >> >>> With the introduction of large folios for anonymous memory, we would >>> like to be able to split them when they have unmapped subpages, in order >>> to free those unused pages under memory pressure. So remove the >>> artificial requirement that the large folio needed to be at least >>> PMD-sized. >>> >>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >>> Reviewed-by: Yu Zhao <yuzhao@google.com> >>> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> >>> --- >>> mm/rmap.c | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >>> >>> diff --git a/mm/rmap.c b/mm/rmap.c >>> index 82ef5ba363d1..bbcb2308a1c5 100644 >>> --- a/mm/rmap.c >>> +++ b/mm/rmap.c >>> @@ -1474,7 +1474,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma, >>> * page of the folio is unmapped and at least one page >>> * is still mapped. >>> */ >>> - if (folio_test_pmd_mappable(folio) && folio_test_anon(folio)) >>> + if (folio_test_large(folio) && folio_test_anon(folio)) >>> if (!compound || nr < nr_pmdmapped) >>> deferred_split_folio(folio); >>> } >> >> One possible issue is that even for large folios mapped only in one >> process, in zap_pte_range(), we will always call deferred_split_folio() >> unnecessarily before freeing a large folio. > > Hi Huang, thanks for reviewing! > > I have a patch that solves this problem by determining a range of ptes covered > by a single folio and doing a "batch zap". This prevents the need to add the > folio to the deferred split queue, only to remove it again shortly afterwards. > This reduces lock contention and I can measure a performance improvement for the > kernel compilation benchmark. See [1]. > > However, I decided to remove it from this patch set on Yu Zhao's advice. We are > aiming for the minimal patch set to start with and wanted to focus people on > that. I intend to submit it separately later on. > > [1] https://lore.kernel.org/linux-mm/20230626171430.3167004-8-ryan.roberts@arm.com/ Thanks for your information! "batch zap" can solve the problem. And, I agree with Matthew's comments to fix the large folios interaction issues before merging the patches to allocate large folios as in the following email. https://lore.kernel.org/linux-mm/ZKVdUDuwNWDUCWc5@casper.infradead.org/ If so, we don't need to introduce the above problem or a large patchset. Best Regards, Huang, Ying _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 2/5] mm: Allow deferred splitting of arbitrary large anon folios 2023-07-10 5:37 ` Huang, Ying @ 2023-07-10 8:29 ` Ryan Roberts -1 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-10 8:29 UTC (permalink / raw) To: Huang, Ying Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 10/07/2023 06:37, Huang, Ying wrote: > Ryan Roberts <ryan.roberts@arm.com> writes: > >> Somehow I managed to reply only to the linux-arm-kernel list on first attempt so >> resending: >> >> On 07/07/2023 09:21, Huang, Ying wrote: >>> Ryan Roberts <ryan.roberts@arm.com> writes: >>> >>>> With the introduction of large folios for anonymous memory, we would >>>> like to be able to split them when they have unmapped subpages, in order >>>> to free those unused pages under memory pressure. So remove the >>>> artificial requirement that the large folio needed to be at least >>>> PMD-sized. >>>> >>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >>>> Reviewed-by: Yu Zhao <yuzhao@google.com> >>>> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> >>>> --- >>>> mm/rmap.c | 2 +- >>>> 1 file changed, 1 insertion(+), 1 deletion(-) >>>> >>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>> index 82ef5ba363d1..bbcb2308a1c5 100644 >>>> --- a/mm/rmap.c >>>> +++ b/mm/rmap.c >>>> @@ -1474,7 +1474,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma, >>>> * page of the folio is unmapped and at least one page >>>> * is still mapped. >>>> */ >>>> - if (folio_test_pmd_mappable(folio) && folio_test_anon(folio)) >>>> + if (folio_test_large(folio) && folio_test_anon(folio)) >>>> if (!compound || nr < nr_pmdmapped) >>>> deferred_split_folio(folio); >>>> } >>> >>> One possible issue is that even for large folios mapped only in one >>> process, in zap_pte_range(), we will always call deferred_split_folio() >>> unnecessarily before freeing a large folio. >> >> Hi Huang, thanks for reviewing! >> >> I have a patch that solves this problem by determining a range of ptes covered >> by a single folio and doing a "batch zap". This prevents the need to add the >> folio to the deferred split queue, only to remove it again shortly afterwards. >> This reduces lock contention and I can measure a performance improvement for the >> kernel compilation benchmark. See [1]. >> >> However, I decided to remove it from this patch set on Yu Zhao's advice. We are >> aiming for the minimal patch set to start with and wanted to focus people on >> that. I intend to submit it separately later on. >> >> [1] https://lore.kernel.org/linux-mm/20230626171430.3167004-8-ryan.roberts@arm.com/ > > Thanks for your information! "batch zap" can solve the problem. > > And, I agree with Matthew's comments to fix the large folios interaction > issues before merging the patches to allocate large folios as in the > following email. > > https://lore.kernel.org/linux-mm/ZKVdUDuwNWDUCWc5@casper.infradead.org/ > > If so, we don't need to introduce the above problem or a large patchset. I appreciate Matthew's and others position about not wanting to merge a minimal implementation while there are some fundamental features (e.g. compaction) it doesn't play well with - I'm working to create a definitive list so these items can be tracked and tackled. That said, I don't see this "batch zap" patch as an example of this. It's just a performance enhancement that improves things even further than large anon folios on their own. I'd rather concentrate on the core changes first then deal with this type of thing later. Does that work for you? > > Best Regards, > Huang, Ying ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 2/5] mm: Allow deferred splitting of arbitrary large anon folios @ 2023-07-10 8:29 ` Ryan Roberts 0 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-10 8:29 UTC (permalink / raw) To: Huang, Ying Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 10/07/2023 06:37, Huang, Ying wrote: > Ryan Roberts <ryan.roberts@arm.com> writes: > >> Somehow I managed to reply only to the linux-arm-kernel list on first attempt so >> resending: >> >> On 07/07/2023 09:21, Huang, Ying wrote: >>> Ryan Roberts <ryan.roberts@arm.com> writes: >>> >>>> With the introduction of large folios for anonymous memory, we would >>>> like to be able to split them when they have unmapped subpages, in order >>>> to free those unused pages under memory pressure. So remove the >>>> artificial requirement that the large folio needed to be at least >>>> PMD-sized. >>>> >>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >>>> Reviewed-by: Yu Zhao <yuzhao@google.com> >>>> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> >>>> --- >>>> mm/rmap.c | 2 +- >>>> 1 file changed, 1 insertion(+), 1 deletion(-) >>>> >>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>> index 82ef5ba363d1..bbcb2308a1c5 100644 >>>> --- a/mm/rmap.c >>>> +++ b/mm/rmap.c >>>> @@ -1474,7 +1474,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma, >>>> * page of the folio is unmapped and at least one page >>>> * is still mapped. >>>> */ >>>> - if (folio_test_pmd_mappable(folio) && folio_test_anon(folio)) >>>> + if (folio_test_large(folio) && folio_test_anon(folio)) >>>> if (!compound || nr < nr_pmdmapped) >>>> deferred_split_folio(folio); >>>> } >>> >>> One possible issue is that even for large folios mapped only in one >>> process, in zap_pte_range(), we will always call deferred_split_folio() >>> unnecessarily before freeing a large folio. >> >> Hi Huang, thanks for reviewing! >> >> I have a patch that solves this problem by determining a range of ptes covered >> by a single folio and doing a "batch zap". This prevents the need to add the >> folio to the deferred split queue, only to remove it again shortly afterwards. >> This reduces lock contention and I can measure a performance improvement for the >> kernel compilation benchmark. See [1]. >> >> However, I decided to remove it from this patch set on Yu Zhao's advice. We are >> aiming for the minimal patch set to start with and wanted to focus people on >> that. I intend to submit it separately later on. >> >> [1] https://lore.kernel.org/linux-mm/20230626171430.3167004-8-ryan.roberts@arm.com/ > > Thanks for your information! "batch zap" can solve the problem. > > And, I agree with Matthew's comments to fix the large folios interaction > issues before merging the patches to allocate large folios as in the > following email. > > https://lore.kernel.org/linux-mm/ZKVdUDuwNWDUCWc5@casper.infradead.org/ > > If so, we don't need to introduce the above problem or a large patchset. I appreciate Matthew's and others position about not wanting to merge a minimal implementation while there are some fundamental features (e.g. compaction) it doesn't play well with - I'm working to create a definitive list so these items can be tracked and tackled. That said, I don't see this "batch zap" patch as an example of this. It's just a performance enhancement that improves things even further than large anon folios on their own. I'd rather concentrate on the core changes first then deal with this type of thing later. Does that work for you? > > Best Regards, > Huang, Ying _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 2/5] mm: Allow deferred splitting of arbitrary large anon folios 2023-07-10 8:29 ` Ryan Roberts @ 2023-07-10 9:01 ` Huang, Ying -1 siblings, 0 replies; 167+ messages in thread From: Huang, Ying @ 2023-07-10 9:01 UTC (permalink / raw) To: Ryan Roberts Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm Ryan Roberts <ryan.roberts@arm.com> writes: > On 10/07/2023 06:37, Huang, Ying wrote: >> Ryan Roberts <ryan.roberts@arm.com> writes: >> >>> Somehow I managed to reply only to the linux-arm-kernel list on first attempt so >>> resending: >>> >>> On 07/07/2023 09:21, Huang, Ying wrote: >>>> Ryan Roberts <ryan.roberts@arm.com> writes: >>>> >>>>> With the introduction of large folios for anonymous memory, we would >>>>> like to be able to split them when they have unmapped subpages, in order >>>>> to free those unused pages under memory pressure. So remove the >>>>> artificial requirement that the large folio needed to be at least >>>>> PMD-sized. >>>>> >>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >>>>> Reviewed-by: Yu Zhao <yuzhao@google.com> >>>>> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> >>>>> --- >>>>> mm/rmap.c | 2 +- >>>>> 1 file changed, 1 insertion(+), 1 deletion(-) >>>>> >>>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>>> index 82ef5ba363d1..bbcb2308a1c5 100644 >>>>> --- a/mm/rmap.c >>>>> +++ b/mm/rmap.c >>>>> @@ -1474,7 +1474,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma, >>>>> * page of the folio is unmapped and at least one page >>>>> * is still mapped. >>>>> */ >>>>> - if (folio_test_pmd_mappable(folio) && folio_test_anon(folio)) >>>>> + if (folio_test_large(folio) && folio_test_anon(folio)) >>>>> if (!compound || nr < nr_pmdmapped) >>>>> deferred_split_folio(folio); >>>>> } >>>> >>>> One possible issue is that even for large folios mapped only in one >>>> process, in zap_pte_range(), we will always call deferred_split_folio() >>>> unnecessarily before freeing a large folio. >>> >>> Hi Huang, thanks for reviewing! >>> >>> I have a patch that solves this problem by determining a range of ptes covered >>> by a single folio and doing a "batch zap". This prevents the need to add the >>> folio to the deferred split queue, only to remove it again shortly afterwards. >>> This reduces lock contention and I can measure a performance improvement for the >>> kernel compilation benchmark. See [1]. >>> >>> However, I decided to remove it from this patch set on Yu Zhao's advice. We are >>> aiming for the minimal patch set to start with and wanted to focus people on >>> that. I intend to submit it separately later on. >>> >>> [1] https://lore.kernel.org/linux-mm/20230626171430.3167004-8-ryan.roberts@arm.com/ >> >> Thanks for your information! "batch zap" can solve the problem. >> >> And, I agree with Matthew's comments to fix the large folios interaction >> issues before merging the patches to allocate large folios as in the >> following email. >> >> https://lore.kernel.org/linux-mm/ZKVdUDuwNWDUCWc5@casper.infradead.org/ >> >> If so, we don't need to introduce the above problem or a large patchset. > > I appreciate Matthew's and others position about not wanting to merge a minimal > implementation while there are some fundamental features (e.g. compaction) it > doesn't play well with - I'm working to create a definitive list so these items > can be tracked and tackled. Good to know this, Thanks! > That said, I don't see this "batch zap" patch as an example of this. It's just a > performance enhancement that improves things even further than large anon folios > on their own. I'd rather concentrate on the core changes first then deal with > this type of thing later. Does that work for you? IIUC, allocating large folios upon page fault depends on splitting large folios in page_remove_rmap() to avoid memory wastage. Splitting large folios in page_remove_rmap() depends on "batch zap" to avoid performance regression in zap_pte_range(). So we need them to be done earlier. Or I miss something? Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 2/5] mm: Allow deferred splitting of arbitrary large anon folios @ 2023-07-10 9:01 ` Huang, Ying 0 siblings, 0 replies; 167+ messages in thread From: Huang, Ying @ 2023-07-10 9:01 UTC (permalink / raw) To: Ryan Roberts Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm Ryan Roberts <ryan.roberts@arm.com> writes: > On 10/07/2023 06:37, Huang, Ying wrote: >> Ryan Roberts <ryan.roberts@arm.com> writes: >> >>> Somehow I managed to reply only to the linux-arm-kernel list on first attempt so >>> resending: >>> >>> On 07/07/2023 09:21, Huang, Ying wrote: >>>> Ryan Roberts <ryan.roberts@arm.com> writes: >>>> >>>>> With the introduction of large folios for anonymous memory, we would >>>>> like to be able to split them when they have unmapped subpages, in order >>>>> to free those unused pages under memory pressure. So remove the >>>>> artificial requirement that the large folio needed to be at least >>>>> PMD-sized. >>>>> >>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >>>>> Reviewed-by: Yu Zhao <yuzhao@google.com> >>>>> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> >>>>> --- >>>>> mm/rmap.c | 2 +- >>>>> 1 file changed, 1 insertion(+), 1 deletion(-) >>>>> >>>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>>> index 82ef5ba363d1..bbcb2308a1c5 100644 >>>>> --- a/mm/rmap.c >>>>> +++ b/mm/rmap.c >>>>> @@ -1474,7 +1474,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma, >>>>> * page of the folio is unmapped and at least one page >>>>> * is still mapped. >>>>> */ >>>>> - if (folio_test_pmd_mappable(folio) && folio_test_anon(folio)) >>>>> + if (folio_test_large(folio) && folio_test_anon(folio)) >>>>> if (!compound || nr < nr_pmdmapped) >>>>> deferred_split_folio(folio); >>>>> } >>>> >>>> One possible issue is that even for large folios mapped only in one >>>> process, in zap_pte_range(), we will always call deferred_split_folio() >>>> unnecessarily before freeing a large folio. >>> >>> Hi Huang, thanks for reviewing! >>> >>> I have a patch that solves this problem by determining a range of ptes covered >>> by a single folio and doing a "batch zap". This prevents the need to add the >>> folio to the deferred split queue, only to remove it again shortly afterwards. >>> This reduces lock contention and I can measure a performance improvement for the >>> kernel compilation benchmark. See [1]. >>> >>> However, I decided to remove it from this patch set on Yu Zhao's advice. We are >>> aiming for the minimal patch set to start with and wanted to focus people on >>> that. I intend to submit it separately later on. >>> >>> [1] https://lore.kernel.org/linux-mm/20230626171430.3167004-8-ryan.roberts@arm.com/ >> >> Thanks for your information! "batch zap" can solve the problem. >> >> And, I agree with Matthew's comments to fix the large folios interaction >> issues before merging the patches to allocate large folios as in the >> following email. >> >> https://lore.kernel.org/linux-mm/ZKVdUDuwNWDUCWc5@casper.infradead.org/ >> >> If so, we don't need to introduce the above problem or a large patchset. > > I appreciate Matthew's and others position about not wanting to merge a minimal > implementation while there are some fundamental features (e.g. compaction) it > doesn't play well with - I'm working to create a definitive list so these items > can be tracked and tackled. Good to know this, Thanks! > That said, I don't see this "batch zap" patch as an example of this. It's just a > performance enhancement that improves things even further than large anon folios > on their own. I'd rather concentrate on the core changes first then deal with > this type of thing later. Does that work for you? IIUC, allocating large folios upon page fault depends on splitting large folios in page_remove_rmap() to avoid memory wastage. Splitting large folios in page_remove_rmap() depends on "batch zap" to avoid performance regression in zap_pte_range(). So we need them to be done earlier. Or I miss something? Best Regards, Huang, Ying _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 2/5] mm: Allow deferred splitting of arbitrary large anon folios 2023-07-10 9:01 ` Huang, Ying @ 2023-07-10 9:39 ` Ryan Roberts -1 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-10 9:39 UTC (permalink / raw) To: Huang, Ying Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 10/07/2023 10:01, Huang, Ying wrote: > Ryan Roberts <ryan.roberts@arm.com> writes: > >> On 10/07/2023 06:37, Huang, Ying wrote: >>> Ryan Roberts <ryan.roberts@arm.com> writes: >>> >>>> Somehow I managed to reply only to the linux-arm-kernel list on first attempt so >>>> resending: >>>> >>>> On 07/07/2023 09:21, Huang, Ying wrote: >>>>> Ryan Roberts <ryan.roberts@arm.com> writes: >>>>> >>>>>> With the introduction of large folios for anonymous memory, we would >>>>>> like to be able to split them when they have unmapped subpages, in order >>>>>> to free those unused pages under memory pressure. So remove the >>>>>> artificial requirement that the large folio needed to be at least >>>>>> PMD-sized. >>>>>> >>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >>>>>> Reviewed-by: Yu Zhao <yuzhao@google.com> >>>>>> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> >>>>>> --- >>>>>> mm/rmap.c | 2 +- >>>>>> 1 file changed, 1 insertion(+), 1 deletion(-) >>>>>> >>>>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>>>> index 82ef5ba363d1..bbcb2308a1c5 100644 >>>>>> --- a/mm/rmap.c >>>>>> +++ b/mm/rmap.c >>>>>> @@ -1474,7 +1474,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma, >>>>>> * page of the folio is unmapped and at least one page >>>>>> * is still mapped. >>>>>> */ >>>>>> - if (folio_test_pmd_mappable(folio) && folio_test_anon(folio)) >>>>>> + if (folio_test_large(folio) && folio_test_anon(folio)) >>>>>> if (!compound || nr < nr_pmdmapped) >>>>>> deferred_split_folio(folio); >>>>>> } >>>>> >>>>> One possible issue is that even for large folios mapped only in one >>>>> process, in zap_pte_range(), we will always call deferred_split_folio() >>>>> unnecessarily before freeing a large folio. >>>> >>>> Hi Huang, thanks for reviewing! >>>> >>>> I have a patch that solves this problem by determining a range of ptes covered >>>> by a single folio and doing a "batch zap". This prevents the need to add the >>>> folio to the deferred split queue, only to remove it again shortly afterwards. >>>> This reduces lock contention and I can measure a performance improvement for the >>>> kernel compilation benchmark. See [1]. >>>> >>>> However, I decided to remove it from this patch set on Yu Zhao's advice. We are >>>> aiming for the minimal patch set to start with and wanted to focus people on >>>> that. I intend to submit it separately later on. >>>> >>>> [1] https://lore.kernel.org/linux-mm/20230626171430.3167004-8-ryan.roberts@arm.com/ >>> >>> Thanks for your information! "batch zap" can solve the problem. >>> >>> And, I agree with Matthew's comments to fix the large folios interaction >>> issues before merging the patches to allocate large folios as in the >>> following email. >>> >>> https://lore.kernel.org/linux-mm/ZKVdUDuwNWDUCWc5@casper.infradead.org/ >>> >>> If so, we don't need to introduce the above problem or a large patchset. >> >> I appreciate Matthew's and others position about not wanting to merge a minimal >> implementation while there are some fundamental features (e.g. compaction) it >> doesn't play well with - I'm working to create a definitive list so these items >> can be tracked and tackled. > > Good to know this, Thanks! > >> That said, I don't see this "batch zap" patch as an example of this. It's just a >> performance enhancement that improves things even further than large anon folios >> on their own. I'd rather concentrate on the core changes first then deal with >> this type of thing later. Does that work for you? > > IIUC, allocating large folios upon page fault depends on splitting large > folios in page_remove_rmap() to avoid memory wastage. Splitting large > folios in page_remove_rmap() depends on "batch zap" to avoid performance > regression in zap_pte_range(). So we need them to be done earlier. Or > I miss something? My point was just that large anon folios improves performance significantly overall, despite a small perf regression in zap_pte_range(). That regression is reduced further by a patch from Yin Fengwei to reduce the lock contention [1]. So it doesn't seem urgent to me to get the "batch zap" change in. I'll add it to my list, then prioritize it against the other stuff. [1] https://lore.kernel.org/linux-mm/20230429082759.1600796-1-fengwei.yin@intel.com/ > > Best Regards, > Huang, Ying ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 2/5] mm: Allow deferred splitting of arbitrary large anon folios @ 2023-07-10 9:39 ` Ryan Roberts 0 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-10 9:39 UTC (permalink / raw) To: Huang, Ying Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 10/07/2023 10:01, Huang, Ying wrote: > Ryan Roberts <ryan.roberts@arm.com> writes: > >> On 10/07/2023 06:37, Huang, Ying wrote: >>> Ryan Roberts <ryan.roberts@arm.com> writes: >>> >>>> Somehow I managed to reply only to the linux-arm-kernel list on first attempt so >>>> resending: >>>> >>>> On 07/07/2023 09:21, Huang, Ying wrote: >>>>> Ryan Roberts <ryan.roberts@arm.com> writes: >>>>> >>>>>> With the introduction of large folios for anonymous memory, we would >>>>>> like to be able to split them when they have unmapped subpages, in order >>>>>> to free those unused pages under memory pressure. So remove the >>>>>> artificial requirement that the large folio needed to be at least >>>>>> PMD-sized. >>>>>> >>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >>>>>> Reviewed-by: Yu Zhao <yuzhao@google.com> >>>>>> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> >>>>>> --- >>>>>> mm/rmap.c | 2 +- >>>>>> 1 file changed, 1 insertion(+), 1 deletion(-) >>>>>> >>>>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>>>> index 82ef5ba363d1..bbcb2308a1c5 100644 >>>>>> --- a/mm/rmap.c >>>>>> +++ b/mm/rmap.c >>>>>> @@ -1474,7 +1474,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma, >>>>>> * page of the folio is unmapped and at least one page >>>>>> * is still mapped. >>>>>> */ >>>>>> - if (folio_test_pmd_mappable(folio) && folio_test_anon(folio)) >>>>>> + if (folio_test_large(folio) && folio_test_anon(folio)) >>>>>> if (!compound || nr < nr_pmdmapped) >>>>>> deferred_split_folio(folio); >>>>>> } >>>>> >>>>> One possible issue is that even for large folios mapped only in one >>>>> process, in zap_pte_range(), we will always call deferred_split_folio() >>>>> unnecessarily before freeing a large folio. >>>> >>>> Hi Huang, thanks for reviewing! >>>> >>>> I have a patch that solves this problem by determining a range of ptes covered >>>> by a single folio and doing a "batch zap". This prevents the need to add the >>>> folio to the deferred split queue, only to remove it again shortly afterwards. >>>> This reduces lock contention and I can measure a performance improvement for the >>>> kernel compilation benchmark. See [1]. >>>> >>>> However, I decided to remove it from this patch set on Yu Zhao's advice. We are >>>> aiming for the minimal patch set to start with and wanted to focus people on >>>> that. I intend to submit it separately later on. >>>> >>>> [1] https://lore.kernel.org/linux-mm/20230626171430.3167004-8-ryan.roberts@arm.com/ >>> >>> Thanks for your information! "batch zap" can solve the problem. >>> >>> And, I agree with Matthew's comments to fix the large folios interaction >>> issues before merging the patches to allocate large folios as in the >>> following email. >>> >>> https://lore.kernel.org/linux-mm/ZKVdUDuwNWDUCWc5@casper.infradead.org/ >>> >>> If so, we don't need to introduce the above problem or a large patchset. >> >> I appreciate Matthew's and others position about not wanting to merge a minimal >> implementation while there are some fundamental features (e.g. compaction) it >> doesn't play well with - I'm working to create a definitive list so these items >> can be tracked and tackled. > > Good to know this, Thanks! > >> That said, I don't see this "batch zap" patch as an example of this. It's just a >> performance enhancement that improves things even further than large anon folios >> on their own. I'd rather concentrate on the core changes first then deal with >> this type of thing later. Does that work for you? > > IIUC, allocating large folios upon page fault depends on splitting large > folios in page_remove_rmap() to avoid memory wastage. Splitting large > folios in page_remove_rmap() depends on "batch zap" to avoid performance > regression in zap_pte_range(). So we need them to be done earlier. Or > I miss something? My point was just that large anon folios improves performance significantly overall, despite a small perf regression in zap_pte_range(). That regression is reduced further by a patch from Yin Fengwei to reduce the lock contention [1]. So it doesn't seem urgent to me to get the "batch zap" change in. I'll add it to my list, then prioritize it against the other stuff. [1] https://lore.kernel.org/linux-mm/20230429082759.1600796-1-fengwei.yin@intel.com/ > > Best Regards, > Huang, Ying _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 2/5] mm: Allow deferred splitting of arbitrary large anon folios 2023-07-10 9:39 ` Ryan Roberts @ 2023-07-11 1:56 ` Huang, Ying -1 siblings, 0 replies; 167+ messages in thread From: Huang, Ying @ 2023-07-11 1:56 UTC (permalink / raw) To: Ryan Roberts Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm Ryan Roberts <ryan.roberts@arm.com> writes: > On 10/07/2023 10:01, Huang, Ying wrote: >> Ryan Roberts <ryan.roberts@arm.com> writes: >> >>> On 10/07/2023 06:37, Huang, Ying wrote: >>>> Ryan Roberts <ryan.roberts@arm.com> writes: >>>> >>>>> Somehow I managed to reply only to the linux-arm-kernel list on first attempt so >>>>> resending: >>>>> >>>>> On 07/07/2023 09:21, Huang, Ying wrote: >>>>>> Ryan Roberts <ryan.roberts@arm.com> writes: >>>>>> >>>>>>> With the introduction of large folios for anonymous memory, we would >>>>>>> like to be able to split them when they have unmapped subpages, in order >>>>>>> to free those unused pages under memory pressure. So remove the >>>>>>> artificial requirement that the large folio needed to be at least >>>>>>> PMD-sized. >>>>>>> >>>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >>>>>>> Reviewed-by: Yu Zhao <yuzhao@google.com> >>>>>>> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> >>>>>>> --- >>>>>>> mm/rmap.c | 2 +- >>>>>>> 1 file changed, 1 insertion(+), 1 deletion(-) >>>>>>> >>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>>>>> index 82ef5ba363d1..bbcb2308a1c5 100644 >>>>>>> --- a/mm/rmap.c >>>>>>> +++ b/mm/rmap.c >>>>>>> @@ -1474,7 +1474,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma, >>>>>>> * page of the folio is unmapped and at least one page >>>>>>> * is still mapped. >>>>>>> */ >>>>>>> - if (folio_test_pmd_mappable(folio) && folio_test_anon(folio)) >>>>>>> + if (folio_test_large(folio) && folio_test_anon(folio)) >>>>>>> if (!compound || nr < nr_pmdmapped) >>>>>>> deferred_split_folio(folio); >>>>>>> } >>>>>> >>>>>> One possible issue is that even for large folios mapped only in one >>>>>> process, in zap_pte_range(), we will always call deferred_split_folio() >>>>>> unnecessarily before freeing a large folio. >>>>> >>>>> Hi Huang, thanks for reviewing! >>>>> >>>>> I have a patch that solves this problem by determining a range of ptes covered >>>>> by a single folio and doing a "batch zap". This prevents the need to add the >>>>> folio to the deferred split queue, only to remove it again shortly afterwards. >>>>> This reduces lock contention and I can measure a performance improvement for the >>>>> kernel compilation benchmark. See [1]. >>>>> >>>>> However, I decided to remove it from this patch set on Yu Zhao's advice. We are >>>>> aiming for the minimal patch set to start with and wanted to focus people on >>>>> that. I intend to submit it separately later on. >>>>> >>>>> [1] https://lore.kernel.org/linux-mm/20230626171430.3167004-8-ryan.roberts@arm.com/ >>>> >>>> Thanks for your information! "batch zap" can solve the problem. >>>> >>>> And, I agree with Matthew's comments to fix the large folios interaction >>>> issues before merging the patches to allocate large folios as in the >>>> following email. >>>> >>>> https://lore.kernel.org/linux-mm/ZKVdUDuwNWDUCWc5@casper.infradead.org/ >>>> >>>> If so, we don't need to introduce the above problem or a large patchset. >>> >>> I appreciate Matthew's and others position about not wanting to merge a minimal >>> implementation while there are some fundamental features (e.g. compaction) it >>> doesn't play well with - I'm working to create a definitive list so these items >>> can be tracked and tackled. >> >> Good to know this, Thanks! >> >>> That said, I don't see this "batch zap" patch as an example of this. It's just a >>> performance enhancement that improves things even further than large anon folios >>> on their own. I'd rather concentrate on the core changes first then deal with >>> this type of thing later. Does that work for you? >> >> IIUC, allocating large folios upon page fault depends on splitting large >> folios in page_remove_rmap() to avoid memory wastage. Splitting large >> folios in page_remove_rmap() depends on "batch zap" to avoid performance >> regression in zap_pte_range(). So we need them to be done earlier. Or >> I miss something? > > My point was just that large anon folios improves performance significantly > overall, despite a small perf regression in zap_pte_range(). That regression is > reduced further by a patch from Yin Fengwei to reduce the lock contention [1]. > So it doesn't seem urgent to me to get the "batch zap" change in. I don't think Fengwei's patch will help much here. Because that patch is to optimize if the folio isn't in deferred split queue, but now the folio will be put in deferred split queue. And I don't think allocating large folios upon page fault is more urgent. We should avoid regression if possible. > I'll add it to my list, then prioritize it against the other stuff. > > [1] https://lore.kernel.org/linux-mm/20230429082759.1600796-1-fengwei.yin@intel.com/ > Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 2/5] mm: Allow deferred splitting of arbitrary large anon folios @ 2023-07-11 1:56 ` Huang, Ying 0 siblings, 0 replies; 167+ messages in thread From: Huang, Ying @ 2023-07-11 1:56 UTC (permalink / raw) To: Ryan Roberts Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm Ryan Roberts <ryan.roberts@arm.com> writes: > On 10/07/2023 10:01, Huang, Ying wrote: >> Ryan Roberts <ryan.roberts@arm.com> writes: >> >>> On 10/07/2023 06:37, Huang, Ying wrote: >>>> Ryan Roberts <ryan.roberts@arm.com> writes: >>>> >>>>> Somehow I managed to reply only to the linux-arm-kernel list on first attempt so >>>>> resending: >>>>> >>>>> On 07/07/2023 09:21, Huang, Ying wrote: >>>>>> Ryan Roberts <ryan.roberts@arm.com> writes: >>>>>> >>>>>>> With the introduction of large folios for anonymous memory, we would >>>>>>> like to be able to split them when they have unmapped subpages, in order >>>>>>> to free those unused pages under memory pressure. So remove the >>>>>>> artificial requirement that the large folio needed to be at least >>>>>>> PMD-sized. >>>>>>> >>>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >>>>>>> Reviewed-by: Yu Zhao <yuzhao@google.com> >>>>>>> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> >>>>>>> --- >>>>>>> mm/rmap.c | 2 +- >>>>>>> 1 file changed, 1 insertion(+), 1 deletion(-) >>>>>>> >>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>>>>> index 82ef5ba363d1..bbcb2308a1c5 100644 >>>>>>> --- a/mm/rmap.c >>>>>>> +++ b/mm/rmap.c >>>>>>> @@ -1474,7 +1474,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma, >>>>>>> * page of the folio is unmapped and at least one page >>>>>>> * is still mapped. >>>>>>> */ >>>>>>> - if (folio_test_pmd_mappable(folio) && folio_test_anon(folio)) >>>>>>> + if (folio_test_large(folio) && folio_test_anon(folio)) >>>>>>> if (!compound || nr < nr_pmdmapped) >>>>>>> deferred_split_folio(folio); >>>>>>> } >>>>>> >>>>>> One possible issue is that even for large folios mapped only in one >>>>>> process, in zap_pte_range(), we will always call deferred_split_folio() >>>>>> unnecessarily before freeing a large folio. >>>>> >>>>> Hi Huang, thanks for reviewing! >>>>> >>>>> I have a patch that solves this problem by determining a range of ptes covered >>>>> by a single folio and doing a "batch zap". This prevents the need to add the >>>>> folio to the deferred split queue, only to remove it again shortly afterwards. >>>>> This reduces lock contention and I can measure a performance improvement for the >>>>> kernel compilation benchmark. See [1]. >>>>> >>>>> However, I decided to remove it from this patch set on Yu Zhao's advice. We are >>>>> aiming for the minimal patch set to start with and wanted to focus people on >>>>> that. I intend to submit it separately later on. >>>>> >>>>> [1] https://lore.kernel.org/linux-mm/20230626171430.3167004-8-ryan.roberts@arm.com/ >>>> >>>> Thanks for your information! "batch zap" can solve the problem. >>>> >>>> And, I agree with Matthew's comments to fix the large folios interaction >>>> issues before merging the patches to allocate large folios as in the >>>> following email. >>>> >>>> https://lore.kernel.org/linux-mm/ZKVdUDuwNWDUCWc5@casper.infradead.org/ >>>> >>>> If so, we don't need to introduce the above problem or a large patchset. >>> >>> I appreciate Matthew's and others position about not wanting to merge a minimal >>> implementation while there are some fundamental features (e.g. compaction) it >>> doesn't play well with - I'm working to create a definitive list so these items >>> can be tracked and tackled. >> >> Good to know this, Thanks! >> >>> That said, I don't see this "batch zap" patch as an example of this. It's just a >>> performance enhancement that improves things even further than large anon folios >>> on their own. I'd rather concentrate on the core changes first then deal with >>> this type of thing later. Does that work for you? >> >> IIUC, allocating large folios upon page fault depends on splitting large >> folios in page_remove_rmap() to avoid memory wastage. Splitting large >> folios in page_remove_rmap() depends on "batch zap" to avoid performance >> regression in zap_pte_range(). So we need them to be done earlier. Or >> I miss something? > > My point was just that large anon folios improves performance significantly > overall, despite a small perf regression in zap_pte_range(). That regression is > reduced further by a patch from Yin Fengwei to reduce the lock contention [1]. > So it doesn't seem urgent to me to get the "batch zap" change in. I don't think Fengwei's patch will help much here. Because that patch is to optimize if the folio isn't in deferred split queue, but now the folio will be put in deferred split queue. And I don't think allocating large folios upon page fault is more urgent. We should avoid regression if possible. > I'll add it to my list, then prioritize it against the other stuff. > > [1] https://lore.kernel.org/linux-mm/20230429082759.1600796-1-fengwei.yin@intel.com/ > Best Regards, Huang, Ying _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order() 2023-07-03 13:53 ` Ryan Roberts @ 2023-07-03 13:53 ` Ryan Roberts -1 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-03 13:53 UTC (permalink / raw) To: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm arch_wants_pte_order() can be overridden by the arch to return the preferred folio order for pte-mapped memory. This is useful as some architectures (e.g. arm64) can coalesce TLB entries when the physical memory is suitably contiguous. The first user for this hint will be FLEXIBLE_THP, which aims to allocate large folios for anonymous memory to reduce page faults and other per-page operation costs. Here we add the default implementation of the function, used when the architecture does not define it, which returns the order corresponding to 64K. Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> --- include/linux/pgtable.h | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index a661a17173fa..f7e38598f20b 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -13,6 +13,7 @@ #include <linux/errno.h> #include <asm-generic/pgtable_uffd.h> #include <linux/page_table_check.h> +#include <linux/sizes.h> #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \ defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void) } #endif +#ifndef arch_wants_pte_order +/* + * Returns preferred folio order for pte-mapped memory. Must be in range [0, + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios + * to be at least order-2. + */ +static inline int arch_wants_pte_order(struct vm_area_struct *vma) +{ + return ilog2(SZ_64K >> PAGE_SHIFT); +} +#endif + #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long address, -- 2.25.1 ^ permalink raw reply related [flat|nested] 167+ messages in thread
* [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order() @ 2023-07-03 13:53 ` Ryan Roberts 0 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-03 13:53 UTC (permalink / raw) To: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm arch_wants_pte_order() can be overridden by the arch to return the preferred folio order for pte-mapped memory. This is useful as some architectures (e.g. arm64) can coalesce TLB entries when the physical memory is suitably contiguous. The first user for this hint will be FLEXIBLE_THP, which aims to allocate large folios for anonymous memory to reduce page faults and other per-page operation costs. Here we add the default implementation of the function, used when the architecture does not define it, which returns the order corresponding to 64K. Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> --- include/linux/pgtable.h | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index a661a17173fa..f7e38598f20b 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -13,6 +13,7 @@ #include <linux/errno.h> #include <asm-generic/pgtable_uffd.h> #include <linux/page_table_check.h> +#include <linux/sizes.h> #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \ defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void) } #endif +#ifndef arch_wants_pte_order +/* + * Returns preferred folio order for pte-mapped memory. Must be in range [0, + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios + * to be at least order-2. + */ +static inline int arch_wants_pte_order(struct vm_area_struct *vma) +{ + return ilog2(SZ_64K >> PAGE_SHIFT); +} +#endif + #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long address, -- 2.25.1 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 167+ messages in thread
* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order() 2023-07-03 13:53 ` Ryan Roberts @ 2023-07-03 19:50 ` Yu Zhao -1 siblings, 0 replies; 167+ messages in thread From: Yu Zhao @ 2023-07-03 19:50 UTC (permalink / raw) To: Ryan Roberts Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > arch_wants_pte_order() can be overridden by the arch to return the > preferred folio order for pte-mapped memory. This is useful as some > architectures (e.g. arm64) can coalesce TLB entries when the physical > memory is suitably contiguous. > > The first user for this hint will be FLEXIBLE_THP, which aims to > allocate large folios for anonymous memory to reduce page faults and > other per-page operation costs. > > Here we add the default implementation of the function, used when the > architecture does not define it, which returns the order corresponding > to 64K. I don't really mind a non-zero default value. But people would ask why non-zero and why 64KB. Probably you could argue this is the large size all known archs support if they have TLB coalescing. For x86, AMD CPUs would want to override this. I'll leave it to Fengwei to decide whether Intel wants a different default value. Also I don't like the vma parameter because it makes arch_wants_pte_order() a mix of hw preference and vma policy. From my POV, the function should be only about the former; the latter should be decided by arch-independent MM code. However, I can live with it if ARM MM people think this is really what you want. ATM, I'm skeptical they do. > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> After another CPU vendor, e.g., Fengwei, and an ARM MM person, e.g., Will give the green light: Reviewed-by: Yu Zhao <yuzhao@google.com> > --- > include/linux/pgtable.h | 13 +++++++++++++ > 1 file changed, 13 insertions(+) > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > index a661a17173fa..f7e38598f20b 100644 > --- a/include/linux/pgtable.h > +++ b/include/linux/pgtable.h > @@ -13,6 +13,7 @@ > #include <linux/errno.h> > #include <asm-generic/pgtable_uffd.h> > #include <linux/page_table_check.h> > +#include <linux/sizes.h> > > #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \ > defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS > @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void) > } > #endif > > +#ifndef arch_wants_pte_order > +/* > + * Returns preferred folio order for pte-mapped memory. Must be in range [0, > + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios The warning is helpful. > + * to be at least order-2. > + */ > +static inline int arch_wants_pte_order(struct vm_area_struct *vma) > +{ > + return ilog2(SZ_64K >> PAGE_SHIFT); > +} > +#endif > + > #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR > static inline pte_t ptep_get_and_clear(struct mm_struct *mm, > unsigned long address, ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order() @ 2023-07-03 19:50 ` Yu Zhao 0 siblings, 0 replies; 167+ messages in thread From: Yu Zhao @ 2023-07-03 19:50 UTC (permalink / raw) To: Ryan Roberts Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > arch_wants_pte_order() can be overridden by the arch to return the > preferred folio order for pte-mapped memory. This is useful as some > architectures (e.g. arm64) can coalesce TLB entries when the physical > memory is suitably contiguous. > > The first user for this hint will be FLEXIBLE_THP, which aims to > allocate large folios for anonymous memory to reduce page faults and > other per-page operation costs. > > Here we add the default implementation of the function, used when the > architecture does not define it, which returns the order corresponding > to 64K. I don't really mind a non-zero default value. But people would ask why non-zero and why 64KB. Probably you could argue this is the large size all known archs support if they have TLB coalescing. For x86, AMD CPUs would want to override this. I'll leave it to Fengwei to decide whether Intel wants a different default value. Also I don't like the vma parameter because it makes arch_wants_pte_order() a mix of hw preference and vma policy. From my POV, the function should be only about the former; the latter should be decided by arch-independent MM code. However, I can live with it if ARM MM people think this is really what you want. ATM, I'm skeptical they do. > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> After another CPU vendor, e.g., Fengwei, and an ARM MM person, e.g., Will give the green light: Reviewed-by: Yu Zhao <yuzhao@google.com> > --- > include/linux/pgtable.h | 13 +++++++++++++ > 1 file changed, 13 insertions(+) > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > index a661a17173fa..f7e38598f20b 100644 > --- a/include/linux/pgtable.h > +++ b/include/linux/pgtable.h > @@ -13,6 +13,7 @@ > #include <linux/errno.h> > #include <asm-generic/pgtable_uffd.h> > #include <linux/page_table_check.h> > +#include <linux/sizes.h> > > #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \ > defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS > @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void) > } > #endif > > +#ifndef arch_wants_pte_order > +/* > + * Returns preferred folio order for pte-mapped memory. Must be in range [0, > + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios The warning is helpful. > + * to be at least order-2. > + */ > +static inline int arch_wants_pte_order(struct vm_area_struct *vma) > +{ > + return ilog2(SZ_64K >> PAGE_SHIFT); > +} > +#endif > + > #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR > static inline pte_t ptep_get_and_clear(struct mm_struct *mm, > unsigned long address, _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order() 2023-07-03 19:50 ` Yu Zhao @ 2023-07-04 13:20 ` Ryan Roberts -1 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-04 13:20 UTC (permalink / raw) To: Yu Zhao Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 03/07/2023 20:50, Yu Zhao wrote: > On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> arch_wants_pte_order() can be overridden by the arch to return the >> preferred folio order for pte-mapped memory. This is useful as some >> architectures (e.g. arm64) can coalesce TLB entries when the physical >> memory is suitably contiguous. >> >> The first user for this hint will be FLEXIBLE_THP, which aims to >> allocate large folios for anonymous memory to reduce page faults and >> other per-page operation costs. >> >> Here we add the default implementation of the function, used when the >> architecture does not define it, which returns the order corresponding >> to 64K. > > I don't really mind a non-zero default value. But people would ask why > non-zero and why 64KB. Probably you could argue this is the large size > all known archs support if they have TLB coalescing. For x86, AMD CPUs > would want to override this. I'll leave it to Fengwei to decide > whether Intel wants a different default value.> > Also I don't like the vma parameter because it makes > arch_wants_pte_order() a mix of hw preference and vma policy. From my > POV, the function should be only about the former; the latter should > be decided by arch-independent MM code. However, I can live with it if > ARM MM people think this is really what you want. ATM, I'm skeptical > they do. Here's the big picture for what I'm tryng to achieve: - In the common case, I'd like all programs to get a performance bump by automatically and transparently using large anon folios - so no explicit requirement on the process to opt-in. - On arm64, in the above case, I'd like the preferred folio size to be 64K; from the (admittedly limitted) testing I've done that's about where the performance knee is and it doesn't appear to increase the memory wastage very much. It also has the benefits that for 4K base pages this is the contpte size (order-4) so I can take full benefit of contpte mappings transparently to the process. And for 16K this is the HPA size (order-2). - On arm64 when the process has marked the VMA for THP (or when transparent_hugepage=always) but the VMA does not meet the requirements for a PMD-sized mapping (or we failed to allocate, ...) then I'd like to map using contpte. For 4K base pages this is 64K (order-4), for 16K this is 2M (order-7) and for 64K this is 2M (order-5). The 64K base page case is very important since the PMD size for that base page is 512MB which is almost impossible to allocate in practice. So one approach would be to define arch_wants_pte_order() as always returning the contpte size (remove the vma parameter). Then max_anon_folio_order() in memory.c could so this: #define MAX_ANON_FOLIO_ORDER_NOTHP ilog2(SZ_64K >> PAGE_SHIFT); static inline int max_anon_folio_order(struct vm_area_struct *vma) { int order = arch_wants_pte_order(); // Fix up default case which returns 0 because PAGE_ALLOC_COSTLY_ORDER // can't be used directly in pgtable.h order = order ? order : PAGE_ALLOC_COSTLY_ORDER; if (hugepage_vma_check(vma, vma->vm_flags, false, true, true)) return order; else return min(order, MAX_ANON_FOLIO_ORDER_NOTHP); } This moves the SW policy into memory.c and gives you PAGE_ALLOC_COSTLY_ORDER (or whatever default we decide on) as the default for arches with no override, and also meets all my goals above. > >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > > After another CPU vendor, e.g., Fengwei, and an ARM MM person, e.g., > Will give the green light: > Reviewed-by: Yu Zhao <yuzhao@google.com> > >> --- >> include/linux/pgtable.h | 13 +++++++++++++ >> 1 file changed, 13 insertions(+) >> >> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h >> index a661a17173fa..f7e38598f20b 100644 >> --- a/include/linux/pgtable.h >> +++ b/include/linux/pgtable.h >> @@ -13,6 +13,7 @@ >> #include <linux/errno.h> >> #include <asm-generic/pgtable_uffd.h> >> #include <linux/page_table_check.h> >> +#include <linux/sizes.h> >> >> #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \ >> defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS >> @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void) >> } >> #endif >> >> +#ifndef arch_wants_pte_order >> +/* >> + * Returns preferred folio order for pte-mapped memory. Must be in range [0, >> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios > > The warning is helpful. > >> + * to be at least order-2. >> + */ >> +static inline int arch_wants_pte_order(struct vm_area_struct *vma) >> +{ >> + return ilog2(SZ_64K >> PAGE_SHIFT); >> +} >> +#endif >> + >> #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR >> static inline pte_t ptep_get_and_clear(struct mm_struct *mm, >> unsigned long address, ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order() @ 2023-07-04 13:20 ` Ryan Roberts 0 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-04 13:20 UTC (permalink / raw) To: Yu Zhao Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 03/07/2023 20:50, Yu Zhao wrote: > On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> arch_wants_pte_order() can be overridden by the arch to return the >> preferred folio order for pte-mapped memory. This is useful as some >> architectures (e.g. arm64) can coalesce TLB entries when the physical >> memory is suitably contiguous. >> >> The first user for this hint will be FLEXIBLE_THP, which aims to >> allocate large folios for anonymous memory to reduce page faults and >> other per-page operation costs. >> >> Here we add the default implementation of the function, used when the >> architecture does not define it, which returns the order corresponding >> to 64K. > > I don't really mind a non-zero default value. But people would ask why > non-zero and why 64KB. Probably you could argue this is the large size > all known archs support if they have TLB coalescing. For x86, AMD CPUs > would want to override this. I'll leave it to Fengwei to decide > whether Intel wants a different default value.> > Also I don't like the vma parameter because it makes > arch_wants_pte_order() a mix of hw preference and vma policy. From my > POV, the function should be only about the former; the latter should > be decided by arch-independent MM code. However, I can live with it if > ARM MM people think this is really what you want. ATM, I'm skeptical > they do. Here's the big picture for what I'm tryng to achieve: - In the common case, I'd like all programs to get a performance bump by automatically and transparently using large anon folios - so no explicit requirement on the process to opt-in. - On arm64, in the above case, I'd like the preferred folio size to be 64K; from the (admittedly limitted) testing I've done that's about where the performance knee is and it doesn't appear to increase the memory wastage very much. It also has the benefits that for 4K base pages this is the contpte size (order-4) so I can take full benefit of contpte mappings transparently to the process. And for 16K this is the HPA size (order-2). - On arm64 when the process has marked the VMA for THP (or when transparent_hugepage=always) but the VMA does not meet the requirements for a PMD-sized mapping (or we failed to allocate, ...) then I'd like to map using contpte. For 4K base pages this is 64K (order-4), for 16K this is 2M (order-7) and for 64K this is 2M (order-5). The 64K base page case is very important since the PMD size for that base page is 512MB which is almost impossible to allocate in practice. So one approach would be to define arch_wants_pte_order() as always returning the contpte size (remove the vma parameter). Then max_anon_folio_order() in memory.c could so this: #define MAX_ANON_FOLIO_ORDER_NOTHP ilog2(SZ_64K >> PAGE_SHIFT); static inline int max_anon_folio_order(struct vm_area_struct *vma) { int order = arch_wants_pte_order(); // Fix up default case which returns 0 because PAGE_ALLOC_COSTLY_ORDER // can't be used directly in pgtable.h order = order ? order : PAGE_ALLOC_COSTLY_ORDER; if (hugepage_vma_check(vma, vma->vm_flags, false, true, true)) return order; else return min(order, MAX_ANON_FOLIO_ORDER_NOTHP); } This moves the SW policy into memory.c and gives you PAGE_ALLOC_COSTLY_ORDER (or whatever default we decide on) as the default for arches with no override, and also meets all my goals above. > >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > > After another CPU vendor, e.g., Fengwei, and an ARM MM person, e.g., > Will give the green light: > Reviewed-by: Yu Zhao <yuzhao@google.com> > >> --- >> include/linux/pgtable.h | 13 +++++++++++++ >> 1 file changed, 13 insertions(+) >> >> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h >> index a661a17173fa..f7e38598f20b 100644 >> --- a/include/linux/pgtable.h >> +++ b/include/linux/pgtable.h >> @@ -13,6 +13,7 @@ >> #include <linux/errno.h> >> #include <asm-generic/pgtable_uffd.h> >> #include <linux/page_table_check.h> >> +#include <linux/sizes.h> >> >> #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \ >> defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS >> @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void) >> } >> #endif >> >> +#ifndef arch_wants_pte_order >> +/* >> + * Returns preferred folio order for pte-mapped memory. Must be in range [0, >> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios > > The warning is helpful. > >> + * to be at least order-2. >> + */ >> +static inline int arch_wants_pte_order(struct vm_area_struct *vma) >> +{ >> + return ilog2(SZ_64K >> PAGE_SHIFT); >> +} >> +#endif >> + >> #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR >> static inline pte_t ptep_get_and_clear(struct mm_struct *mm, >> unsigned long address, _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order() 2023-07-04 13:20 ` Ryan Roberts @ 2023-07-05 2:07 ` Yu Zhao -1 siblings, 0 replies; 167+ messages in thread From: Yu Zhao @ 2023-07-05 2:07 UTC (permalink / raw) To: Ryan Roberts Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On Tue, Jul 4, 2023 at 7:20 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 03/07/2023 20:50, Yu Zhao wrote: > > On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > >> > >> arch_wants_pte_order() can be overridden by the arch to return the > >> preferred folio order for pte-mapped memory. This is useful as some > >> architectures (e.g. arm64) can coalesce TLB entries when the physical > >> memory is suitably contiguous. > >> > >> The first user for this hint will be FLEXIBLE_THP, which aims to > >> allocate large folios for anonymous memory to reduce page faults and > >> other per-page operation costs. > >> > >> Here we add the default implementation of the function, used when the > >> architecture does not define it, which returns the order corresponding > >> to 64K. > > > > I don't really mind a non-zero default value. But people would ask why > > non-zero and why 64KB. Probably you could argue this is the large size > > all known archs support if they have TLB coalescing. For x86, AMD CPUs > > would want to override this. I'll leave it to Fengwei to decide > > whether Intel wants a different default value.> > > Also I don't like the vma parameter because it makes > > arch_wants_pte_order() a mix of hw preference and vma policy. From my > > POV, the function should be only about the former; the latter should > > be decided by arch-independent MM code. However, I can live with it if > > ARM MM people think this is really what you want. ATM, I'm skeptical > > they do. > > Here's the big picture for what I'm tryng to achieve: > > - In the common case, I'd like all programs to get a performance bump by > automatically and transparently using large anon folios - so no explicit > requirement on the process to opt-in. We all agree on this :) > - On arm64, in the above case, I'd like the preferred folio size to be 64K; > from the (admittedly limitted) testing I've done that's about where the > performance knee is and it doesn't appear to increase the memory wastage very > much. It also has the benefits that for 4K base pages this is the contpte size > (order-4) so I can take full benefit of contpte mappings transparently to the > process. And for 16K this is the HPA size (order-2). My highest priority is to get 16KB proven first because it would benefit both client and server devices. So it may be different from yours but I don't see any conflict. > - On arm64 when the process has marked the VMA for THP (or when > transparent_hugepage=always) but the VMA does not meet the requirements for a > PMD-sized mapping (or we failed to allocate, ...) then I'd like to map using > contpte. For 4K base pages this is 64K (order-4), for 16K this is 2M (order-7) > and for 64K this is 2M (order-5). The 64K base page case is very important since > the PMD size for that base page is 512MB which is almost impossible to allocate > in practice. Which case (server or client) are you focusing on here? For our client devices, I can confidently say that 64KB has to be after 16KB, if it happens at all. For servers in general, I don't know of any major memory-intensive workloads that are not THP-aware, i.e., I don't think "VMA does not meet the requirements" is a concern. ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order() @ 2023-07-05 2:07 ` Yu Zhao 0 siblings, 0 replies; 167+ messages in thread From: Yu Zhao @ 2023-07-05 2:07 UTC (permalink / raw) To: Ryan Roberts Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On Tue, Jul 4, 2023 at 7:20 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 03/07/2023 20:50, Yu Zhao wrote: > > On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > >> > >> arch_wants_pte_order() can be overridden by the arch to return the > >> preferred folio order for pte-mapped memory. This is useful as some > >> architectures (e.g. arm64) can coalesce TLB entries when the physical > >> memory is suitably contiguous. > >> > >> The first user for this hint will be FLEXIBLE_THP, which aims to > >> allocate large folios for anonymous memory to reduce page faults and > >> other per-page operation costs. > >> > >> Here we add the default implementation of the function, used when the > >> architecture does not define it, which returns the order corresponding > >> to 64K. > > > > I don't really mind a non-zero default value. But people would ask why > > non-zero and why 64KB. Probably you could argue this is the large size > > all known archs support if they have TLB coalescing. For x86, AMD CPUs > > would want to override this. I'll leave it to Fengwei to decide > > whether Intel wants a different default value.> > > Also I don't like the vma parameter because it makes > > arch_wants_pte_order() a mix of hw preference and vma policy. From my > > POV, the function should be only about the former; the latter should > > be decided by arch-independent MM code. However, I can live with it if > > ARM MM people think this is really what you want. ATM, I'm skeptical > > they do. > > Here's the big picture for what I'm tryng to achieve: > > - In the common case, I'd like all programs to get a performance bump by > automatically and transparently using large anon folios - so no explicit > requirement on the process to opt-in. We all agree on this :) > - On arm64, in the above case, I'd like the preferred folio size to be 64K; > from the (admittedly limitted) testing I've done that's about where the > performance knee is and it doesn't appear to increase the memory wastage very > much. It also has the benefits that for 4K base pages this is the contpte size > (order-4) so I can take full benefit of contpte mappings transparently to the > process. And for 16K this is the HPA size (order-2). My highest priority is to get 16KB proven first because it would benefit both client and server devices. So it may be different from yours but I don't see any conflict. > - On arm64 when the process has marked the VMA for THP (or when > transparent_hugepage=always) but the VMA does not meet the requirements for a > PMD-sized mapping (or we failed to allocate, ...) then I'd like to map using > contpte. For 4K base pages this is 64K (order-4), for 16K this is 2M (order-7) > and for 64K this is 2M (order-5). The 64K base page case is very important since > the PMD size for that base page is 512MB which is almost impossible to allocate > in practice. Which case (server or client) are you focusing on here? For our client devices, I can confidently say that 64KB has to be after 16KB, if it happens at all. For servers in general, I don't know of any major memory-intensive workloads that are not THP-aware, i.e., I don't think "VMA does not meet the requirements" is a concern. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order() 2023-07-05 2:07 ` Yu Zhao @ 2023-07-05 9:11 ` Ryan Roberts -1 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-05 9:11 UTC (permalink / raw) To: Yu Zhao Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 05/07/2023 03:07, Yu Zhao wrote: > On Tue, Jul 4, 2023 at 7:20 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> On 03/07/2023 20:50, Yu Zhao wrote: >>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>> >>>> arch_wants_pte_order() can be overridden by the arch to return the >>>> preferred folio order for pte-mapped memory. This is useful as some >>>> architectures (e.g. arm64) can coalesce TLB entries when the physical >>>> memory is suitably contiguous. >>>> >>>> The first user for this hint will be FLEXIBLE_THP, which aims to >>>> allocate large folios for anonymous memory to reduce page faults and >>>> other per-page operation costs. >>>> >>>> Here we add the default implementation of the function, used when the >>>> architecture does not define it, which returns the order corresponding >>>> to 64K. >>> >>> I don't really mind a non-zero default value. But people would ask why >>> non-zero and why 64KB. Probably you could argue this is the large size >>> all known archs support if they have TLB coalescing. For x86, AMD CPUs >>> would want to override this. I'll leave it to Fengwei to decide >>> whether Intel wants a different default value.> >>> Also I don't like the vma parameter because it makes >>> arch_wants_pte_order() a mix of hw preference and vma policy. From my >>> POV, the function should be only about the former; the latter should >>> be decided by arch-independent MM code. However, I can live with it if >>> ARM MM people think this is really what you want. ATM, I'm skeptical >>> they do. >> >> Here's the big picture for what I'm tryng to achieve: >> >> - In the common case, I'd like all programs to get a performance bump by >> automatically and transparently using large anon folios - so no explicit >> requirement on the process to opt-in. > > We all agree on this :) > >> - On arm64, in the above case, I'd like the preferred folio size to be 64K; >> from the (admittedly limitted) testing I've done that's about where the >> performance knee is and it doesn't appear to increase the memory wastage very >> much. It also has the benefits that for 4K base pages this is the contpte size >> (order-4) so I can take full benefit of contpte mappings transparently to the >> process. And for 16K this is the HPA size (order-2). > > My highest priority is to get 16KB proven first because it would > benefit both client and server devices. So it may be different from > yours but I don't see any conflict. Do you mean 16K folios on a 4K base page system, or large folios on a 16K base page system? I thought your focus was on speeding up 4K base page client systems but this statement has got me wondering? > >> - On arm64 when the process has marked the VMA for THP (or when >> transparent_hugepage=always) but the VMA does not meet the requirements for a >> PMD-sized mapping (or we failed to allocate, ...) then I'd like to map using >> contpte. For 4K base pages this is 64K (order-4), for 16K this is 2M (order-7) >> and for 64K this is 2M (order-5). The 64K base page case is very important since >> the PMD size for that base page is 512MB which is almost impossible to allocate >> in practice. > > Which case (server or client) are you focusing on here? For our client > devices, I can confidently say that 64KB has to be after 16KB, if it > happens at all. For servers in general, I don't know of any major > memory-intensive workloads that are not THP-aware, i.e., I don't think > "VMA does not meet the requirements" is a concern. For the 64K base page case, the focus is server. The problem reported by our partner is that the 512M huge page size is too big to reliably allocate and so the fauls always fall back to 64K base pages in practice. I would also speculate (happy to be proved wrong) that there are many THP-aware workloads that assume the THP size is 2M. In this case, their VMAs may well be too small to fit a 512M huge page when running on 64K base page system. But the TL;DR is that Arm has a partner for which enabling 2M THP on a 64K base page system is a very real requirement. Our intent is that this will be the mechanism we use to enable it. ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order() @ 2023-07-05 9:11 ` Ryan Roberts 0 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-05 9:11 UTC (permalink / raw) To: Yu Zhao Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 05/07/2023 03:07, Yu Zhao wrote: > On Tue, Jul 4, 2023 at 7:20 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> On 03/07/2023 20:50, Yu Zhao wrote: >>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>> >>>> arch_wants_pte_order() can be overridden by the arch to return the >>>> preferred folio order for pte-mapped memory. This is useful as some >>>> architectures (e.g. arm64) can coalesce TLB entries when the physical >>>> memory is suitably contiguous. >>>> >>>> The first user for this hint will be FLEXIBLE_THP, which aims to >>>> allocate large folios for anonymous memory to reduce page faults and >>>> other per-page operation costs. >>>> >>>> Here we add the default implementation of the function, used when the >>>> architecture does not define it, which returns the order corresponding >>>> to 64K. >>> >>> I don't really mind a non-zero default value. But people would ask why >>> non-zero and why 64KB. Probably you could argue this is the large size >>> all known archs support if they have TLB coalescing. For x86, AMD CPUs >>> would want to override this. I'll leave it to Fengwei to decide >>> whether Intel wants a different default value.> >>> Also I don't like the vma parameter because it makes >>> arch_wants_pte_order() a mix of hw preference and vma policy. From my >>> POV, the function should be only about the former; the latter should >>> be decided by arch-independent MM code. However, I can live with it if >>> ARM MM people think this is really what you want. ATM, I'm skeptical >>> they do. >> >> Here's the big picture for what I'm tryng to achieve: >> >> - In the common case, I'd like all programs to get a performance bump by >> automatically and transparently using large anon folios - so no explicit >> requirement on the process to opt-in. > > We all agree on this :) > >> - On arm64, in the above case, I'd like the preferred folio size to be 64K; >> from the (admittedly limitted) testing I've done that's about where the >> performance knee is and it doesn't appear to increase the memory wastage very >> much. It also has the benefits that for 4K base pages this is the contpte size >> (order-4) so I can take full benefit of contpte mappings transparently to the >> process. And for 16K this is the HPA size (order-2). > > My highest priority is to get 16KB proven first because it would > benefit both client and server devices. So it may be different from > yours but I don't see any conflict. Do you mean 16K folios on a 4K base page system, or large folios on a 16K base page system? I thought your focus was on speeding up 4K base page client systems but this statement has got me wondering? > >> - On arm64 when the process has marked the VMA for THP (or when >> transparent_hugepage=always) but the VMA does not meet the requirements for a >> PMD-sized mapping (or we failed to allocate, ...) then I'd like to map using >> contpte. For 4K base pages this is 64K (order-4), for 16K this is 2M (order-7) >> and for 64K this is 2M (order-5). The 64K base page case is very important since >> the PMD size for that base page is 512MB which is almost impossible to allocate >> in practice. > > Which case (server or client) are you focusing on here? For our client > devices, I can confidently say that 64KB has to be after 16KB, if it > happens at all. For servers in general, I don't know of any major > memory-intensive workloads that are not THP-aware, i.e., I don't think > "VMA does not meet the requirements" is a concern. For the 64K base page case, the focus is server. The problem reported by our partner is that the 512M huge page size is too big to reliably allocate and so the fauls always fall back to 64K base pages in practice. I would also speculate (happy to be proved wrong) that there are many THP-aware workloads that assume the THP size is 2M. In this case, their VMAs may well be too small to fit a 512M huge page when running on 64K base page system. But the TL;DR is that Arm has a partner for which enabling 2M THP on a 64K base page system is a very real requirement. Our intent is that this will be the mechanism we use to enable it. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order() 2023-07-05 9:11 ` Ryan Roberts @ 2023-07-05 17:24 ` Yu Zhao -1 siblings, 0 replies; 167+ messages in thread From: Yu Zhao @ 2023-07-05 17:24 UTC (permalink / raw) To: Ryan Roberts Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On Wed, Jul 5, 2023 at 3:11 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 05/07/2023 03:07, Yu Zhao wrote: > > On Tue, Jul 4, 2023 at 7:20 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > >> > >> On 03/07/2023 20:50, Yu Zhao wrote: > >>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > >>>> > >>>> arch_wants_pte_order() can be overridden by the arch to return the > >>>> preferred folio order for pte-mapped memory. This is useful as some > >>>> architectures (e.g. arm64) can coalesce TLB entries when the physical > >>>> memory is suitably contiguous. > >>>> > >>>> The first user for this hint will be FLEXIBLE_THP, which aims to > >>>> allocate large folios for anonymous memory to reduce page faults and > >>>> other per-page operation costs. > >>>> > >>>> Here we add the default implementation of the function, used when the > >>>> architecture does not define it, which returns the order corresponding > >>>> to 64K. > >>> > >>> I don't really mind a non-zero default value. But people would ask why > >>> non-zero and why 64KB. Probably you could argue this is the large size > >>> all known archs support if they have TLB coalescing. For x86, AMD CPUs > >>> would want to override this. I'll leave it to Fengwei to decide > >>> whether Intel wants a different default value.> > >>> Also I don't like the vma parameter because it makes > >>> arch_wants_pte_order() a mix of hw preference and vma policy. From my > >>> POV, the function should be only about the former; the latter should > >>> be decided by arch-independent MM code. However, I can live with it if > >>> ARM MM people think this is really what you want. ATM, I'm skeptical > >>> they do. > >> > >> Here's the big picture for what I'm tryng to achieve: > >> > >> - In the common case, I'd like all programs to get a performance bump by > >> automatically and transparently using large anon folios - so no explicit > >> requirement on the process to opt-in. > > > > We all agree on this :) > > > >> - On arm64, in the above case, I'd like the preferred folio size to be 64K; > >> from the (admittedly limitted) testing I've done that's about where the > >> performance knee is and it doesn't appear to increase the memory wastage very > >> much. It also has the benefits that for 4K base pages this is the contpte size > >> (order-4) so I can take full benefit of contpte mappings transparently to the > >> process. And for 16K this is the HPA size (order-2). > > > > My highest priority is to get 16KB proven first because it would > > benefit both client and server devices. So it may be different from > > yours but I don't see any conflict. > > Do you mean 16K folios on a 4K base page system Yes. > or large folios on a 16K base > page system? I thought your focus was on speeding up 4K base page client systems > but this statement has got me wondering? Sorry, I should have said 4x4KB. > >> - On arm64 when the process has marked the VMA for THP (or when > >> transparent_hugepage=always) but the VMA does not meet the requirements for a > >> PMD-sized mapping (or we failed to allocate, ...) then I'd like to map using > >> contpte. For 4K base pages this is 64K (order-4), for 16K this is 2M (order-7) > >> and for 64K this is 2M (order-5). The 64K base page case is very important since > >> the PMD size for that base page is 512MB which is almost impossible to allocate > >> in practice. > > > > Which case (server or client) are you focusing on here? For our client > > devices, I can confidently say that 64KB has to be after 16KB, if it > > happens at all. For servers in general, I don't know of any major > > memory-intensive workloads that are not THP-aware, i.e., I don't think > > "VMA does not meet the requirements" is a concern. > > For the 64K base page case, the focus is server. The problem reported by our > partner is that the 512M huge page size is too big to reliably allocate and so > the fauls always fall back to 64K base pages in practice. I would also speculate > (happy to be proved wrong) that there are many THP-aware workloads that assume > the THP size is 2M. In this case, their VMAs may well be too small to fit a 512M > huge page when running on 64K base page system. Interesting. When you have something ready to share, I might be able to try it on our ARM servers as well. > But the TL;DR is that Arm has a partner for which enabling 2M THP on a 64K base > page system is a very real requirement. Our intent is that this will be the > mechanism we use to enable it. Yes, contpte makes more sense for what you described. It'd fit in a lot better in the hugetlb case, but I guess your partner uses anon. ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order() @ 2023-07-05 17:24 ` Yu Zhao 0 siblings, 0 replies; 167+ messages in thread From: Yu Zhao @ 2023-07-05 17:24 UTC (permalink / raw) To: Ryan Roberts Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On Wed, Jul 5, 2023 at 3:11 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 05/07/2023 03:07, Yu Zhao wrote: > > On Tue, Jul 4, 2023 at 7:20 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > >> > >> On 03/07/2023 20:50, Yu Zhao wrote: > >>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > >>>> > >>>> arch_wants_pte_order() can be overridden by the arch to return the > >>>> preferred folio order for pte-mapped memory. This is useful as some > >>>> architectures (e.g. arm64) can coalesce TLB entries when the physical > >>>> memory is suitably contiguous. > >>>> > >>>> The first user for this hint will be FLEXIBLE_THP, which aims to > >>>> allocate large folios for anonymous memory to reduce page faults and > >>>> other per-page operation costs. > >>>> > >>>> Here we add the default implementation of the function, used when the > >>>> architecture does not define it, which returns the order corresponding > >>>> to 64K. > >>> > >>> I don't really mind a non-zero default value. But people would ask why > >>> non-zero and why 64KB. Probably you could argue this is the large size > >>> all known archs support if they have TLB coalescing. For x86, AMD CPUs > >>> would want to override this. I'll leave it to Fengwei to decide > >>> whether Intel wants a different default value.> > >>> Also I don't like the vma parameter because it makes > >>> arch_wants_pte_order() a mix of hw preference and vma policy. From my > >>> POV, the function should be only about the former; the latter should > >>> be decided by arch-independent MM code. However, I can live with it if > >>> ARM MM people think this is really what you want. ATM, I'm skeptical > >>> they do. > >> > >> Here's the big picture for what I'm tryng to achieve: > >> > >> - In the common case, I'd like all programs to get a performance bump by > >> automatically and transparently using large anon folios - so no explicit > >> requirement on the process to opt-in. > > > > We all agree on this :) > > > >> - On arm64, in the above case, I'd like the preferred folio size to be 64K; > >> from the (admittedly limitted) testing I've done that's about where the > >> performance knee is and it doesn't appear to increase the memory wastage very > >> much. It also has the benefits that for 4K base pages this is the contpte size > >> (order-4) so I can take full benefit of contpte mappings transparently to the > >> process. And for 16K this is the HPA size (order-2). > > > > My highest priority is to get 16KB proven first because it would > > benefit both client and server devices. So it may be different from > > yours but I don't see any conflict. > > Do you mean 16K folios on a 4K base page system Yes. > or large folios on a 16K base > page system? I thought your focus was on speeding up 4K base page client systems > but this statement has got me wondering? Sorry, I should have said 4x4KB. > >> - On arm64 when the process has marked the VMA for THP (or when > >> transparent_hugepage=always) but the VMA does not meet the requirements for a > >> PMD-sized mapping (or we failed to allocate, ...) then I'd like to map using > >> contpte. For 4K base pages this is 64K (order-4), for 16K this is 2M (order-7) > >> and for 64K this is 2M (order-5). The 64K base page case is very important since > >> the PMD size for that base page is 512MB which is almost impossible to allocate > >> in practice. > > > > Which case (server or client) are you focusing on here? For our client > > devices, I can confidently say that 64KB has to be after 16KB, if it > > happens at all. For servers in general, I don't know of any major > > memory-intensive workloads that are not THP-aware, i.e., I don't think > > "VMA does not meet the requirements" is a concern. > > For the 64K base page case, the focus is server. The problem reported by our > partner is that the 512M huge page size is too big to reliably allocate and so > the fauls always fall back to 64K base pages in practice. I would also speculate > (happy to be proved wrong) that there are many THP-aware workloads that assume > the THP size is 2M. In this case, their VMAs may well be too small to fit a 512M > huge page when running on 64K base page system. Interesting. When you have something ready to share, I might be able to try it on our ARM servers as well. > But the TL;DR is that Arm has a partner for which enabling 2M THP on a 64K base > page system is a very real requirement. Our intent is that this will be the > mechanism we use to enable it. Yes, contpte makes more sense for what you described. It'd fit in a lot better in the hugetlb case, but I guess your partner uses anon. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order() 2023-07-05 17:24 ` Yu Zhao @ 2023-07-05 18:01 ` Ryan Roberts -1 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-05 18:01 UTC (permalink / raw) To: Yu Zhao Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 05/07/2023 18:24, Yu Zhao wrote: > On Wed, Jul 5, 2023 at 3:11 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> On 05/07/2023 03:07, Yu Zhao wrote: >>> On Tue, Jul 4, 2023 at 7:20 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>> >>>> On 03/07/2023 20:50, Yu Zhao wrote: >>>>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>>> >>>>>> arch_wants_pte_order() can be overridden by the arch to return the >>>>>> preferred folio order for pte-mapped memory. This is useful as some >>>>>> architectures (e.g. arm64) can coalesce TLB entries when the physical >>>>>> memory is suitably contiguous. >>>>>> >>>>>> The first user for this hint will be FLEXIBLE_THP, which aims to >>>>>> allocate large folios for anonymous memory to reduce page faults and >>>>>> other per-page operation costs. >>>>>> >>>>>> Here we add the default implementation of the function, used when the >>>>>> architecture does not define it, which returns the order corresponding >>>>>> to 64K. >>>>> >>>>> I don't really mind a non-zero default value. But people would ask why >>>>> non-zero and why 64KB. Probably you could argue this is the large size >>>>> all known archs support if they have TLB coalescing. For x86, AMD CPUs >>>>> would want to override this. I'll leave it to Fengwei to decide >>>>> whether Intel wants a different default value.> >>>>> Also I don't like the vma parameter because it makes >>>>> arch_wants_pte_order() a mix of hw preference and vma policy. From my >>>>> POV, the function should be only about the former; the latter should >>>>> be decided by arch-independent MM code. However, I can live with it if >>>>> ARM MM people think this is really what you want. ATM, I'm skeptical >>>>> they do. >>>> >>>> Here's the big picture for what I'm tryng to achieve: >>>> >>>> - In the common case, I'd like all programs to get a performance bump by >>>> automatically and transparently using large anon folios - so no explicit >>>> requirement on the process to opt-in. >>> >>> We all agree on this :) >>> >>>> - On arm64, in the above case, I'd like the preferred folio size to be 64K; >>>> from the (admittedly limitted) testing I've done that's about where the >>>> performance knee is and it doesn't appear to increase the memory wastage very >>>> much. It also has the benefits that for 4K base pages this is the contpte size >>>> (order-4) so I can take full benefit of contpte mappings transparently to the >>>> process. And for 16K this is the HPA size (order-2). >>> >>> My highest priority is to get 16KB proven first because it would >>> benefit both client and server devices. So it may be different from >>> yours but I don't see any conflict. >> >> Do you mean 16K folios on a 4K base page system > > Yes. > >> or large folios on a 16K base >> page system? I thought your focus was on speeding up 4K base page client systems >> but this statement has got me wondering? > > Sorry, I should have said 4x4KB. OK. Be aware that a number of Arm CPUs that support HPA don't have it enabled by default (or at least don't have it enabled in the mode that you would want it to see best performance with large anon folios). You would need EL3 access to reconfigure it. > >>>> - On arm64 when the process has marked the VMA for THP (or when >>>> transparent_hugepage=always) but the VMA does not meet the requirements for a >>>> PMD-sized mapping (or we failed to allocate, ...) then I'd like to map using >>>> contpte. For 4K base pages this is 64K (order-4), for 16K this is 2M (order-7) >>>> and for 64K this is 2M (order-5). The 64K base page case is very important since >>>> the PMD size for that base page is 512MB which is almost impossible to allocate >>>> in practice. >>> >>> Which case (server or client) are you focusing on here? For our client >>> devices, I can confidently say that 64KB has to be after 16KB, if it >>> happens at all. For servers in general, I don't know of any major >>> memory-intensive workloads that are not THP-aware, i.e., I don't think >>> "VMA does not meet the requirements" is a concern. >> >> For the 64K base page case, the focus is server. The problem reported by our >> partner is that the 512M huge page size is too big to reliably allocate and so >> the fauls always fall back to 64K base pages in practice. I would also speculate >> (happy to be proved wrong) that there are many THP-aware workloads that assume >> the THP size is 2M. In this case, their VMAs may well be too small to fit a 512M >> huge page when running on 64K base page system. > > Interesting. When you have something ready to share, I might be able > to try it on our ARM servers as well. That would be really helpful. I'm currently updating my branch that collates everything to reflect the review comments in this patch set and the contpte patch set. I'll share it in a couple of weeks. > >> But the TL;DR is that Arm has a partner for which enabling 2M THP on a 64K base >> page system is a very real requirement. Our intent is that this will be the >> mechanism we use to enable it. > > Yes, contpte makes more sense for what you described. It'd fit in a > lot better in the hugetlb case, but I guess your partner uses anon. arm64 already supports contpte for hugetlb, but they need it to work with anon memory using THP. ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order() @ 2023-07-05 18:01 ` Ryan Roberts 0 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-05 18:01 UTC (permalink / raw) To: Yu Zhao Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 05/07/2023 18:24, Yu Zhao wrote: > On Wed, Jul 5, 2023 at 3:11 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> On 05/07/2023 03:07, Yu Zhao wrote: >>> On Tue, Jul 4, 2023 at 7:20 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>> >>>> On 03/07/2023 20:50, Yu Zhao wrote: >>>>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>>> >>>>>> arch_wants_pte_order() can be overridden by the arch to return the >>>>>> preferred folio order for pte-mapped memory. This is useful as some >>>>>> architectures (e.g. arm64) can coalesce TLB entries when the physical >>>>>> memory is suitably contiguous. >>>>>> >>>>>> The first user for this hint will be FLEXIBLE_THP, which aims to >>>>>> allocate large folios for anonymous memory to reduce page faults and >>>>>> other per-page operation costs. >>>>>> >>>>>> Here we add the default implementation of the function, used when the >>>>>> architecture does not define it, which returns the order corresponding >>>>>> to 64K. >>>>> >>>>> I don't really mind a non-zero default value. But people would ask why >>>>> non-zero and why 64KB. Probably you could argue this is the large size >>>>> all known archs support if they have TLB coalescing. For x86, AMD CPUs >>>>> would want to override this. I'll leave it to Fengwei to decide >>>>> whether Intel wants a different default value.> >>>>> Also I don't like the vma parameter because it makes >>>>> arch_wants_pte_order() a mix of hw preference and vma policy. From my >>>>> POV, the function should be only about the former; the latter should >>>>> be decided by arch-independent MM code. However, I can live with it if >>>>> ARM MM people think this is really what you want. ATM, I'm skeptical >>>>> they do. >>>> >>>> Here's the big picture for what I'm tryng to achieve: >>>> >>>> - In the common case, I'd like all programs to get a performance bump by >>>> automatically and transparently using large anon folios - so no explicit >>>> requirement on the process to opt-in. >>> >>> We all agree on this :) >>> >>>> - On arm64, in the above case, I'd like the preferred folio size to be 64K; >>>> from the (admittedly limitted) testing I've done that's about where the >>>> performance knee is and it doesn't appear to increase the memory wastage very >>>> much. It also has the benefits that for 4K base pages this is the contpte size >>>> (order-4) so I can take full benefit of contpte mappings transparently to the >>>> process. And for 16K this is the HPA size (order-2). >>> >>> My highest priority is to get 16KB proven first because it would >>> benefit both client and server devices. So it may be different from >>> yours but I don't see any conflict. >> >> Do you mean 16K folios on a 4K base page system > > Yes. > >> or large folios on a 16K base >> page system? I thought your focus was on speeding up 4K base page client systems >> but this statement has got me wondering? > > Sorry, I should have said 4x4KB. OK. Be aware that a number of Arm CPUs that support HPA don't have it enabled by default (or at least don't have it enabled in the mode that you would want it to see best performance with large anon folios). You would need EL3 access to reconfigure it. > >>>> - On arm64 when the process has marked the VMA for THP (or when >>>> transparent_hugepage=always) but the VMA does not meet the requirements for a >>>> PMD-sized mapping (or we failed to allocate, ...) then I'd like to map using >>>> contpte. For 4K base pages this is 64K (order-4), for 16K this is 2M (order-7) >>>> and for 64K this is 2M (order-5). The 64K base page case is very important since >>>> the PMD size for that base page is 512MB which is almost impossible to allocate >>>> in practice. >>> >>> Which case (server or client) are you focusing on here? For our client >>> devices, I can confidently say that 64KB has to be after 16KB, if it >>> happens at all. For servers in general, I don't know of any major >>> memory-intensive workloads that are not THP-aware, i.e., I don't think >>> "VMA does not meet the requirements" is a concern. >> >> For the 64K base page case, the focus is server. The problem reported by our >> partner is that the 512M huge page size is too big to reliably allocate and so >> the fauls always fall back to 64K base pages in practice. I would also speculate >> (happy to be proved wrong) that there are many THP-aware workloads that assume >> the THP size is 2M. In this case, their VMAs may well be too small to fit a 512M >> huge page when running on 64K base page system. > > Interesting. When you have something ready to share, I might be able > to try it on our ARM servers as well. That would be really helpful. I'm currently updating my branch that collates everything to reflect the review comments in this patch set and the contpte patch set. I'll share it in a couple of weeks. > >> But the TL;DR is that Arm has a partner for which enabling 2M THP on a 64K base >> page system is a very real requirement. Our intent is that this will be the >> mechanism we use to enable it. > > Yes, contpte makes more sense for what you described. It'd fit in a > lot better in the hugetlb case, but I guess your partner uses anon. arm64 already supports contpte for hugetlb, but they need it to work with anon memory using THP. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order() 2023-07-05 2:07 ` Yu Zhao @ 2023-07-06 19:33 ` Matthew Wilcox -1 siblings, 0 replies; 167+ messages in thread From: Matthew Wilcox @ 2023-07-06 19:33 UTC (permalink / raw) To: Yu Zhao Cc: Ryan Roberts, Andrew Morton, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On Tue, Jul 04, 2023 at 08:07:19PM -0600, Yu Zhao wrote: > > - On arm64 when the process has marked the VMA for THP (or when > > transparent_hugepage=always) but the VMA does not meet the requirements for a > > PMD-sized mapping (or we failed to allocate, ...) then I'd like to map using > > contpte. For 4K base pages this is 64K (order-4), for 16K this is 2M (order-7) > > and for 64K this is 2M (order-5). The 64K base page case is very important since > > the PMD size for that base page is 512MB which is almost impossible to allocate > > in practice. > > Which case (server or client) are you focusing on here? For our client > devices, I can confidently say that 64KB has to be after 16KB, if it > happens at all. For servers in general, I don't know of any major > memory-intensive workloads that are not THP-aware, i.e., I don't think > "VMA does not meet the requirements" is a concern. It sounds like you've done some measurements, and I'd like to understand those a bit better. There are a number of factors involved: - A larger page size shrinks the length of the LRU list, so systems which see heavy LRU lock contention benefit more - A larger page size has more internal fragmentation, so we run out of memory and have to do reclaim more often (and maybe workload which used to fit in DRAM now do not) (probably others; i'm not at 100% right now) I think concerns about "allocating lots of order-2 folios makes it harder to allocate order-4 folios" are _probably_ not warranted (without data to prove otherwise). All anonymous memory is movable, so our compaction code should be able to create larger order folios. ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order() @ 2023-07-06 19:33 ` Matthew Wilcox 0 siblings, 0 replies; 167+ messages in thread From: Matthew Wilcox @ 2023-07-06 19:33 UTC (permalink / raw) To: Yu Zhao Cc: Ryan Roberts, Andrew Morton, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On Tue, Jul 04, 2023 at 08:07:19PM -0600, Yu Zhao wrote: > > - On arm64 when the process has marked the VMA for THP (or when > > transparent_hugepage=always) but the VMA does not meet the requirements for a > > PMD-sized mapping (or we failed to allocate, ...) then I'd like to map using > > contpte. For 4K base pages this is 64K (order-4), for 16K this is 2M (order-7) > > and for 64K this is 2M (order-5). The 64K base page case is very important since > > the PMD size for that base page is 512MB which is almost impossible to allocate > > in practice. > > Which case (server or client) are you focusing on here? For our client > devices, I can confidently say that 64KB has to be after 16KB, if it > happens at all. For servers in general, I don't know of any major > memory-intensive workloads that are not THP-aware, i.e., I don't think > "VMA does not meet the requirements" is a concern. It sounds like you've done some measurements, and I'd like to understand those a bit better. There are a number of factors involved: - A larger page size shrinks the length of the LRU list, so systems which see heavy LRU lock contention benefit more - A larger page size has more internal fragmentation, so we run out of memory and have to do reclaim more often (and maybe workload which used to fit in DRAM now do not) (probably others; i'm not at 100% right now) I think concerns about "allocating lots of order-2 folios makes it harder to allocate order-4 folios" are _probably_ not warranted (without data to prove otherwise). All anonymous memory is movable, so our compaction code should be able to create larger order folios. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order() 2023-07-06 19:33 ` Matthew Wilcox @ 2023-07-07 10:00 ` Ryan Roberts -1 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-07 10:00 UTC (permalink / raw) To: Matthew Wilcox, Yu Zhao Cc: Andrew Morton, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 06/07/2023 20:33, Matthew Wilcox wrote: > On Tue, Jul 04, 2023 at 08:07:19PM -0600, Yu Zhao wrote: >>> - On arm64 when the process has marked the VMA for THP (or when >>> transparent_hugepage=always) but the VMA does not meet the requirements for a >>> PMD-sized mapping (or we failed to allocate, ...) then I'd like to map using >>> contpte. For 4K base pages this is 64K (order-4), for 16K this is 2M (order-7) >>> and for 64K this is 2M (order-5). The 64K base page case is very important since >>> the PMD size for that base page is 512MB which is almost impossible to allocate >>> in practice. >> >> Which case (server or client) are you focusing on here? For our client >> devices, I can confidently say that 64KB has to be after 16KB, if it >> happens at all. For servers in general, I don't know of any major >> memory-intensive workloads that are not THP-aware, i.e., I don't think >> "VMA does not meet the requirements" is a concern. > > It sounds like you've done some measurements, and I'd like to understand > those a bit better. There are a number of factors involved: I'm not sure if that's a question to me or Yu? I haven't personally done any measurements for the 64K base page case. But Arm has a partner that is pushing for this. I'm hoping to see some test results from them posted publicly in the coming weeks. See [1] for more explanation on the rationale. [1] https://lore.kernel.org/linux-mm/4d4c45a2-0037-71de-b182-f516fee07e67@arm.com/T/#m8a7c4b71f94224ec3fe6d0a407f48d74c789ba4f > > - A larger page size shrinks the length of the LRU list, so systems > which see heavy LRU lock contention benefit more > - A larger page size has more internal fragmentation, so we run out of > memory and have to do reclaim more often (and maybe workload which > used to fit in DRAM now do not) > (probably others; i'm not at 100% right now) > > I think concerns about "allocating lots of order-2 folios makes it harder > to allocate order-4 folios" are _probably_ not warranted (without data > to prove otherwise). All anonymous memory is movable, so our compaction > code should be able to create larger order folios. > ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order() @ 2023-07-07 10:00 ` Ryan Roberts 0 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-07 10:00 UTC (permalink / raw) To: Matthew Wilcox, Yu Zhao Cc: Andrew Morton, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 06/07/2023 20:33, Matthew Wilcox wrote: > On Tue, Jul 04, 2023 at 08:07:19PM -0600, Yu Zhao wrote: >>> - On arm64 when the process has marked the VMA for THP (or when >>> transparent_hugepage=always) but the VMA does not meet the requirements for a >>> PMD-sized mapping (or we failed to allocate, ...) then I'd like to map using >>> contpte. For 4K base pages this is 64K (order-4), for 16K this is 2M (order-7) >>> and for 64K this is 2M (order-5). The 64K base page case is very important since >>> the PMD size for that base page is 512MB which is almost impossible to allocate >>> in practice. >> >> Which case (server or client) are you focusing on here? For our client >> devices, I can confidently say that 64KB has to be after 16KB, if it >> happens at all. For servers in general, I don't know of any major >> memory-intensive workloads that are not THP-aware, i.e., I don't think >> "VMA does not meet the requirements" is a concern. > > It sounds like you've done some measurements, and I'd like to understand > those a bit better. There are a number of factors involved: I'm not sure if that's a question to me or Yu? I haven't personally done any measurements for the 64K base page case. But Arm has a partner that is pushing for this. I'm hoping to see some test results from them posted publicly in the coming weeks. See [1] for more explanation on the rationale. [1] https://lore.kernel.org/linux-mm/4d4c45a2-0037-71de-b182-f516fee07e67@arm.com/T/#m8a7c4b71f94224ec3fe6d0a407f48d74c789ba4f > > - A larger page size shrinks the length of the LRU list, so systems > which see heavy LRU lock contention benefit more > - A larger page size has more internal fragmentation, so we run out of > memory and have to do reclaim more often (and maybe workload which > used to fit in DRAM now do not) > (probably others; i'm not at 100% right now) > > I think concerns about "allocating lots of order-2 folios makes it harder > to allocate order-4 folios" are _probably_ not warranted (without data > to prove otherwise). All anonymous memory is movable, so our compaction > code should be able to create larger order folios. > _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order() 2023-07-03 13:53 ` Ryan Roberts @ 2023-07-04 2:22 ` Yin, Fengwei -1 siblings, 0 replies; 167+ messages in thread From: Yin, Fengwei @ 2023-07-04 2:22 UTC (permalink / raw) To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi Cc: linux-arm-kernel, linux-kernel, linux-mm On 7/3/2023 9:53 PM, Ryan Roberts wrote: > arch_wants_pte_order() can be overridden by the arch to return the > preferred folio order for pte-mapped memory. This is useful as some > architectures (e.g. arm64) can coalesce TLB entries when the physical > memory is suitably contiguous. > > The first user for this hint will be FLEXIBLE_THP, which aims to > allocate large folios for anonymous memory to reduce page faults and > other per-page operation costs. > > Here we add the default implementation of the function, used when the > architecture does not define it, which returns the order corresponding > to 64K. > > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > --- > include/linux/pgtable.h | 13 +++++++++++++ > 1 file changed, 13 insertions(+) > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > index a661a17173fa..f7e38598f20b 100644 > --- a/include/linux/pgtable.h > +++ b/include/linux/pgtable.h > @@ -13,6 +13,7 @@ > #include <linux/errno.h> > #include <asm-generic/pgtable_uffd.h> > #include <linux/page_table_check.h> > +#include <linux/sizes.h> > > #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \ > defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS > @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void) > } > #endif > > +#ifndef arch_wants_pte_order > +/* > + * Returns preferred folio order for pte-mapped memory. Must be in range [0, > + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios > + * to be at least order-2. > + */ > +static inline int arch_wants_pte_order(struct vm_area_struct *vma) > +{ > + return ilog2(SZ_64K >> PAGE_SHIFT); Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER? Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9. If the pcp could cover the page, the pressure to zone lock will be reduced by pcp. Regards Yin, Fengwei > +} > +#endif > + > #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR > static inline pte_t ptep_get_and_clear(struct mm_struct *mm, > unsigned long address, ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order() @ 2023-07-04 2:22 ` Yin, Fengwei 0 siblings, 0 replies; 167+ messages in thread From: Yin, Fengwei @ 2023-07-04 2:22 UTC (permalink / raw) To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi Cc: linux-arm-kernel, linux-kernel, linux-mm On 7/3/2023 9:53 PM, Ryan Roberts wrote: > arch_wants_pte_order() can be overridden by the arch to return the > preferred folio order for pte-mapped memory. This is useful as some > architectures (e.g. arm64) can coalesce TLB entries when the physical > memory is suitably contiguous. > > The first user for this hint will be FLEXIBLE_THP, which aims to > allocate large folios for anonymous memory to reduce page faults and > other per-page operation costs. > > Here we add the default implementation of the function, used when the > architecture does not define it, which returns the order corresponding > to 64K. > > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > --- > include/linux/pgtable.h | 13 +++++++++++++ > 1 file changed, 13 insertions(+) > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > index a661a17173fa..f7e38598f20b 100644 > --- a/include/linux/pgtable.h > +++ b/include/linux/pgtable.h > @@ -13,6 +13,7 @@ > #include <linux/errno.h> > #include <asm-generic/pgtable_uffd.h> > #include <linux/page_table_check.h> > +#include <linux/sizes.h> > > #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \ > defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS > @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void) > } > #endif > > +#ifndef arch_wants_pte_order > +/* > + * Returns preferred folio order for pte-mapped memory. Must be in range [0, > + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios > + * to be at least order-2. > + */ > +static inline int arch_wants_pte_order(struct vm_area_struct *vma) > +{ > + return ilog2(SZ_64K >> PAGE_SHIFT); Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER? Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9. If the pcp could cover the page, the pressure to zone lock will be reduced by pcp. Regards Yin, Fengwei > +} > +#endif > + > #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR > static inline pte_t ptep_get_and_clear(struct mm_struct *mm, > unsigned long address, _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order() 2023-07-04 2:22 ` Yin, Fengwei @ 2023-07-04 3:02 ` Yu Zhao -1 siblings, 0 replies; 167+ messages in thread From: Yu Zhao @ 2023-07-04 3:02 UTC (permalink / raw) To: Yin, Fengwei Cc: Ryan Roberts, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On Mon, Jul 3, 2023 at 8:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote: > > > > On 7/3/2023 9:53 PM, Ryan Roberts wrote: > > arch_wants_pte_order() can be overridden by the arch to return the > > preferred folio order for pte-mapped memory. This is useful as some > > architectures (e.g. arm64) can coalesce TLB entries when the physical > > memory is suitably contiguous. > > > > The first user for this hint will be FLEXIBLE_THP, which aims to > > allocate large folios for anonymous memory to reduce page faults and > > other per-page operation costs. > > > > Here we add the default implementation of the function, used when the > > architecture does not define it, which returns the order corresponding > > to 64K. > > > > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > > --- > > include/linux/pgtable.h | 13 +++++++++++++ > > 1 file changed, 13 insertions(+) > > > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > > index a661a17173fa..f7e38598f20b 100644 > > --- a/include/linux/pgtable.h > > +++ b/include/linux/pgtable.h > > @@ -13,6 +13,7 @@ > > #include <linux/errno.h> > > #include <asm-generic/pgtable_uffd.h> > > #include <linux/page_table_check.h> > > +#include <linux/sizes.h> > > > > #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \ > > defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS > > @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void) > > } > > #endif > > > > +#ifndef arch_wants_pte_order > > +/* > > + * Returns preferred folio order for pte-mapped memory. Must be in range [0, > > + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios > > + * to be at least order-2. > > + */ > > +static inline int arch_wants_pte_order(struct vm_area_struct *vma) > > +{ > > + return ilog2(SZ_64K >> PAGE_SHIFT); > Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER? > > Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9. > If the pcp could cover the page, the pressure to zone lock will be reduced by pcp. The value of PAGE_ALLOC_COSTLY_ORDER is reasonable but again it's a s/w policy not a h/w preference. Besides, I don't think we can include mmzone.h in pgtable.h. ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order() @ 2023-07-04 3:02 ` Yu Zhao 0 siblings, 0 replies; 167+ messages in thread From: Yu Zhao @ 2023-07-04 3:02 UTC (permalink / raw) To: Yin, Fengwei Cc: Ryan Roberts, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On Mon, Jul 3, 2023 at 8:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote: > > > > On 7/3/2023 9:53 PM, Ryan Roberts wrote: > > arch_wants_pte_order() can be overridden by the arch to return the > > preferred folio order for pte-mapped memory. This is useful as some > > architectures (e.g. arm64) can coalesce TLB entries when the physical > > memory is suitably contiguous. > > > > The first user for this hint will be FLEXIBLE_THP, which aims to > > allocate large folios for anonymous memory to reduce page faults and > > other per-page operation costs. > > > > Here we add the default implementation of the function, used when the > > architecture does not define it, which returns the order corresponding > > to 64K. > > > > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > > --- > > include/linux/pgtable.h | 13 +++++++++++++ > > 1 file changed, 13 insertions(+) > > > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > > index a661a17173fa..f7e38598f20b 100644 > > --- a/include/linux/pgtable.h > > +++ b/include/linux/pgtable.h > > @@ -13,6 +13,7 @@ > > #include <linux/errno.h> > > #include <asm-generic/pgtable_uffd.h> > > #include <linux/page_table_check.h> > > +#include <linux/sizes.h> > > > > #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \ > > defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS > > @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void) > > } > > #endif > > > > +#ifndef arch_wants_pte_order > > +/* > > + * Returns preferred folio order for pte-mapped memory. Must be in range [0, > > + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios > > + * to be at least order-2. > > + */ > > +static inline int arch_wants_pte_order(struct vm_area_struct *vma) > > +{ > > + return ilog2(SZ_64K >> PAGE_SHIFT); > Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER? > > Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9. > If the pcp could cover the page, the pressure to zone lock will be reduced by pcp. The value of PAGE_ALLOC_COSTLY_ORDER is reasonable but again it's a s/w policy not a h/w preference. Besides, I don't think we can include mmzone.h in pgtable.h. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order() 2023-07-04 3:02 ` Yu Zhao @ 2023-07-04 3:59 ` Yu Zhao -1 siblings, 0 replies; 167+ messages in thread From: Yu Zhao @ 2023-07-04 3:59 UTC (permalink / raw) To: Yin, Fengwei, Ryan Roberts Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On Mon, Jul 3, 2023 at 9:02 PM Yu Zhao <yuzhao@google.com> wrote: > > On Mon, Jul 3, 2023 at 8:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote: > > > > > > > > On 7/3/2023 9:53 PM, Ryan Roberts wrote: > > > arch_wants_pte_order() can be overridden by the arch to return the > > > preferred folio order for pte-mapped memory. This is useful as some > > > architectures (e.g. arm64) can coalesce TLB entries when the physical > > > memory is suitably contiguous. > > > > > > The first user for this hint will be FLEXIBLE_THP, which aims to > > > allocate large folios for anonymous memory to reduce page faults and > > > other per-page operation costs. > > > > > > Here we add the default implementation of the function, used when the > > > architecture does not define it, which returns the order corresponding > > > to 64K. > > > > > > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > > > --- > > > include/linux/pgtable.h | 13 +++++++++++++ > > > 1 file changed, 13 insertions(+) > > > > > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > > > index a661a17173fa..f7e38598f20b 100644 > > > --- a/include/linux/pgtable.h > > > +++ b/include/linux/pgtable.h > > > @@ -13,6 +13,7 @@ > > > #include <linux/errno.h> > > > #include <asm-generic/pgtable_uffd.h> > > > #include <linux/page_table_check.h> > > > +#include <linux/sizes.h> > > > > > > #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \ > > > defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS > > > @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void) > > > } > > > #endif > > > > > > +#ifndef arch_wants_pte_order > > > +/* > > > + * Returns preferred folio order for pte-mapped memory. Must be in range [0, > > > + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios > > > + * to be at least order-2. > > > + */ > > > +static inline int arch_wants_pte_order(struct vm_area_struct *vma) > > > +{ > > > + return ilog2(SZ_64K >> PAGE_SHIFT); > > Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER? > > > > Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9. > > If the pcp could cover the page, the pressure to zone lock will be reduced by pcp. > > The value of PAGE_ALLOC_COSTLY_ORDER is reasonable but again it's a > s/w policy not a h/w preference. Besides, I don't think we can include > mmzone.h in pgtable.h. I think we can make a compromise: 1. change the default implementation of arch_has_hw_pte_young() to return 0, and 2. in memory.c, we can try PAGE_ALLOC_COSTLY_ORDER for archs that don't override arch_has_hw_pte_young(), or if its return value is too large to fit. This should also take care of the regression, right? ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order() @ 2023-07-04 3:59 ` Yu Zhao 0 siblings, 0 replies; 167+ messages in thread From: Yu Zhao @ 2023-07-04 3:59 UTC (permalink / raw) To: Yin, Fengwei, Ryan Roberts Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On Mon, Jul 3, 2023 at 9:02 PM Yu Zhao <yuzhao@google.com> wrote: > > On Mon, Jul 3, 2023 at 8:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote: > > > > > > > > On 7/3/2023 9:53 PM, Ryan Roberts wrote: > > > arch_wants_pte_order() can be overridden by the arch to return the > > > preferred folio order for pte-mapped memory. This is useful as some > > > architectures (e.g. arm64) can coalesce TLB entries when the physical > > > memory is suitably contiguous. > > > > > > The first user for this hint will be FLEXIBLE_THP, which aims to > > > allocate large folios for anonymous memory to reduce page faults and > > > other per-page operation costs. > > > > > > Here we add the default implementation of the function, used when the > > > architecture does not define it, which returns the order corresponding > > > to 64K. > > > > > > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > > > --- > > > include/linux/pgtable.h | 13 +++++++++++++ > > > 1 file changed, 13 insertions(+) > > > > > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > > > index a661a17173fa..f7e38598f20b 100644 > > > --- a/include/linux/pgtable.h > > > +++ b/include/linux/pgtable.h > > > @@ -13,6 +13,7 @@ > > > #include <linux/errno.h> > > > #include <asm-generic/pgtable_uffd.h> > > > #include <linux/page_table_check.h> > > > +#include <linux/sizes.h> > > > > > > #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \ > > > defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS > > > @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void) > > > } > > > #endif > > > > > > +#ifndef arch_wants_pte_order > > > +/* > > > + * Returns preferred folio order for pte-mapped memory. Must be in range [0, > > > + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios > > > + * to be at least order-2. > > > + */ > > > +static inline int arch_wants_pte_order(struct vm_area_struct *vma) > > > +{ > > > + return ilog2(SZ_64K >> PAGE_SHIFT); > > Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER? > > > > Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9. > > If the pcp could cover the page, the pressure to zone lock will be reduced by pcp. > > The value of PAGE_ALLOC_COSTLY_ORDER is reasonable but again it's a > s/w policy not a h/w preference. Besides, I don't think we can include > mmzone.h in pgtable.h. I think we can make a compromise: 1. change the default implementation of arch_has_hw_pte_young() to return 0, and 2. in memory.c, we can try PAGE_ALLOC_COSTLY_ORDER for archs that don't override arch_has_hw_pte_young(), or if its return value is too large to fit. This should also take care of the regression, right? _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order() 2023-07-04 3:59 ` Yu Zhao @ 2023-07-04 5:22 ` Yin, Fengwei -1 siblings, 0 replies; 167+ messages in thread From: Yin, Fengwei @ 2023-07-04 5:22 UTC (permalink / raw) To: Yu Zhao, Ryan Roberts Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 7/4/2023 11:59 AM, Yu Zhao wrote: > On Mon, Jul 3, 2023 at 9:02 PM Yu Zhao <yuzhao@google.com> wrote: >> >> On Mon, Jul 3, 2023 at 8:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote: >>> >>> >>> >>> On 7/3/2023 9:53 PM, Ryan Roberts wrote: >>>> arch_wants_pte_order() can be overridden by the arch to return the >>>> preferred folio order for pte-mapped memory. This is useful as some >>>> architectures (e.g. arm64) can coalesce TLB entries when the physical >>>> memory is suitably contiguous. >>>> >>>> The first user for this hint will be FLEXIBLE_THP, which aims to >>>> allocate large folios for anonymous memory to reduce page faults and >>>> other per-page operation costs. >>>> >>>> Here we add the default implementation of the function, used when the >>>> architecture does not define it, which returns the order corresponding >>>> to 64K. >>>> >>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >>>> --- >>>> include/linux/pgtable.h | 13 +++++++++++++ >>>> 1 file changed, 13 insertions(+) >>>> >>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h >>>> index a661a17173fa..f7e38598f20b 100644 >>>> --- a/include/linux/pgtable.h >>>> +++ b/include/linux/pgtable.h >>>> @@ -13,6 +13,7 @@ >>>> #include <linux/errno.h> >>>> #include <asm-generic/pgtable_uffd.h> >>>> #include <linux/page_table_check.h> >>>> +#include <linux/sizes.h> >>>> >>>> #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \ >>>> defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS >>>> @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void) >>>> } >>>> #endif >>>> >>>> +#ifndef arch_wants_pte_order >>>> +/* >>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0, >>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios >>>> + * to be at least order-2. >>>> + */ >>>> +static inline int arch_wants_pte_order(struct vm_area_struct *vma) >>>> +{ >>>> + return ilog2(SZ_64K >> PAGE_SHIFT); >>> Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER? >>> >>> Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9. >>> If the pcp could cover the page, the pressure to zone lock will be reduced by pcp. >> >> The value of PAGE_ALLOC_COSTLY_ORDER is reasonable but again it's a >> s/w policy not a h/w preference. Besides, I don't think we can include >> mmzone.h in pgtable.h. > > I think we can make a compromise: > 1. change the default implementation of arch_has_hw_pte_young() to return 0, and > 2. in memory.c, we can try PAGE_ALLOC_COSTLY_ORDER for archs that > don't override arch_has_hw_pte_young(), or if its return value is too > large to fit. Do you mean arch_wants_pte_order()? Yes. This looks good to me. Thanks. Regards Yin, Fengwei > This should also take care of the regression, right? ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order() @ 2023-07-04 5:22 ` Yin, Fengwei 0 siblings, 0 replies; 167+ messages in thread From: Yin, Fengwei @ 2023-07-04 5:22 UTC (permalink / raw) To: Yu Zhao, Ryan Roberts Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 7/4/2023 11:59 AM, Yu Zhao wrote: > On Mon, Jul 3, 2023 at 9:02 PM Yu Zhao <yuzhao@google.com> wrote: >> >> On Mon, Jul 3, 2023 at 8:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote: >>> >>> >>> >>> On 7/3/2023 9:53 PM, Ryan Roberts wrote: >>>> arch_wants_pte_order() can be overridden by the arch to return the >>>> preferred folio order for pte-mapped memory. This is useful as some >>>> architectures (e.g. arm64) can coalesce TLB entries when the physical >>>> memory is suitably contiguous. >>>> >>>> The first user for this hint will be FLEXIBLE_THP, which aims to >>>> allocate large folios for anonymous memory to reduce page faults and >>>> other per-page operation costs. >>>> >>>> Here we add the default implementation of the function, used when the >>>> architecture does not define it, which returns the order corresponding >>>> to 64K. >>>> >>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >>>> --- >>>> include/linux/pgtable.h | 13 +++++++++++++ >>>> 1 file changed, 13 insertions(+) >>>> >>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h >>>> index a661a17173fa..f7e38598f20b 100644 >>>> --- a/include/linux/pgtable.h >>>> +++ b/include/linux/pgtable.h >>>> @@ -13,6 +13,7 @@ >>>> #include <linux/errno.h> >>>> #include <asm-generic/pgtable_uffd.h> >>>> #include <linux/page_table_check.h> >>>> +#include <linux/sizes.h> >>>> >>>> #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \ >>>> defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS >>>> @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void) >>>> } >>>> #endif >>>> >>>> +#ifndef arch_wants_pte_order >>>> +/* >>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0, >>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios >>>> + * to be at least order-2. >>>> + */ >>>> +static inline int arch_wants_pte_order(struct vm_area_struct *vma) >>>> +{ >>>> + return ilog2(SZ_64K >> PAGE_SHIFT); >>> Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER? >>> >>> Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9. >>> If the pcp could cover the page, the pressure to zone lock will be reduced by pcp. >> >> The value of PAGE_ALLOC_COSTLY_ORDER is reasonable but again it's a >> s/w policy not a h/w preference. Besides, I don't think we can include >> mmzone.h in pgtable.h. > > I think we can make a compromise: > 1. change the default implementation of arch_has_hw_pte_young() to return 0, and > 2. in memory.c, we can try PAGE_ALLOC_COSTLY_ORDER for archs that > don't override arch_has_hw_pte_young(), or if its return value is too > large to fit. Do you mean arch_wants_pte_order()? Yes. This looks good to me. Thanks. Regards Yin, Fengwei > This should also take care of the regression, right? _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order() 2023-07-04 5:22 ` Yin, Fengwei @ 2023-07-04 5:42 ` Yu Zhao -1 siblings, 0 replies; 167+ messages in thread From: Yu Zhao @ 2023-07-04 5:42 UTC (permalink / raw) To: Yin, Fengwei Cc: Ryan Roberts, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On Mon, Jul 3, 2023 at 11:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote: > > > > On 7/4/2023 11:59 AM, Yu Zhao wrote: > > On Mon, Jul 3, 2023 at 9:02 PM Yu Zhao <yuzhao@google.com> wrote: > >> > >> On Mon, Jul 3, 2023 at 8:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote: > >>> > >>> > >>> > >>> On 7/3/2023 9:53 PM, Ryan Roberts wrote: > >>>> arch_wants_pte_order() can be overridden by the arch to return the > >>>> preferred folio order for pte-mapped memory. This is useful as some > >>>> architectures (e.g. arm64) can coalesce TLB entries when the physical > >>>> memory is suitably contiguous. > >>>> > >>>> The first user for this hint will be FLEXIBLE_THP, which aims to > >>>> allocate large folios for anonymous memory to reduce page faults and > >>>> other per-page operation costs. > >>>> > >>>> Here we add the default implementation of the function, used when the > >>>> architecture does not define it, which returns the order corresponding > >>>> to 64K. > >>>> > >>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > >>>> --- > >>>> include/linux/pgtable.h | 13 +++++++++++++ > >>>> 1 file changed, 13 insertions(+) > >>>> > >>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > >>>> index a661a17173fa..f7e38598f20b 100644 > >>>> --- a/include/linux/pgtable.h > >>>> +++ b/include/linux/pgtable.h > >>>> @@ -13,6 +13,7 @@ > >>>> #include <linux/errno.h> > >>>> #include <asm-generic/pgtable_uffd.h> > >>>> #include <linux/page_table_check.h> > >>>> +#include <linux/sizes.h> > >>>> > >>>> #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \ > >>>> defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS > >>>> @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void) > >>>> } > >>>> #endif > >>>> > >>>> +#ifndef arch_wants_pte_order > >>>> +/* > >>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0, > >>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios > >>>> + * to be at least order-2. > >>>> + */ > >>>> +static inline int arch_wants_pte_order(struct vm_area_struct *vma) > >>>> +{ > >>>> + return ilog2(SZ_64K >> PAGE_SHIFT); > >>> Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER? > >>> > >>> Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9. > >>> If the pcp could cover the page, the pressure to zone lock will be reduced by pcp. > >> > >> The value of PAGE_ALLOC_COSTLY_ORDER is reasonable but again it's a > >> s/w policy not a h/w preference. Besides, I don't think we can include > >> mmzone.h in pgtable.h. > > > > I think we can make a compromise: > > 1. change the default implementation of arch_has_hw_pte_young() to return 0, and > > 2. in memory.c, we can try PAGE_ALLOC_COSTLY_ORDER for archs that > > don't override arch_has_hw_pte_young(), or if its return value is too > > large to fit. > Do you mean arch_wants_pte_order()? Yes. This looks good to me. Thanks. Sorry, copied the wrong function from above and pasted without looking... ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order() @ 2023-07-04 5:42 ` Yu Zhao 0 siblings, 0 replies; 167+ messages in thread From: Yu Zhao @ 2023-07-04 5:42 UTC (permalink / raw) To: Yin, Fengwei Cc: Ryan Roberts, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On Mon, Jul 3, 2023 at 11:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote: > > > > On 7/4/2023 11:59 AM, Yu Zhao wrote: > > On Mon, Jul 3, 2023 at 9:02 PM Yu Zhao <yuzhao@google.com> wrote: > >> > >> On Mon, Jul 3, 2023 at 8:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote: > >>> > >>> > >>> > >>> On 7/3/2023 9:53 PM, Ryan Roberts wrote: > >>>> arch_wants_pte_order() can be overridden by the arch to return the > >>>> preferred folio order for pte-mapped memory. This is useful as some > >>>> architectures (e.g. arm64) can coalesce TLB entries when the physical > >>>> memory is suitably contiguous. > >>>> > >>>> The first user for this hint will be FLEXIBLE_THP, which aims to > >>>> allocate large folios for anonymous memory to reduce page faults and > >>>> other per-page operation costs. > >>>> > >>>> Here we add the default implementation of the function, used when the > >>>> architecture does not define it, which returns the order corresponding > >>>> to 64K. > >>>> > >>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > >>>> --- > >>>> include/linux/pgtable.h | 13 +++++++++++++ > >>>> 1 file changed, 13 insertions(+) > >>>> > >>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > >>>> index a661a17173fa..f7e38598f20b 100644 > >>>> --- a/include/linux/pgtable.h > >>>> +++ b/include/linux/pgtable.h > >>>> @@ -13,6 +13,7 @@ > >>>> #include <linux/errno.h> > >>>> #include <asm-generic/pgtable_uffd.h> > >>>> #include <linux/page_table_check.h> > >>>> +#include <linux/sizes.h> > >>>> > >>>> #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \ > >>>> defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS > >>>> @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void) > >>>> } > >>>> #endif > >>>> > >>>> +#ifndef arch_wants_pte_order > >>>> +/* > >>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0, > >>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios > >>>> + * to be at least order-2. > >>>> + */ > >>>> +static inline int arch_wants_pte_order(struct vm_area_struct *vma) > >>>> +{ > >>>> + return ilog2(SZ_64K >> PAGE_SHIFT); > >>> Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER? > >>> > >>> Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9. > >>> If the pcp could cover the page, the pressure to zone lock will be reduced by pcp. > >> > >> The value of PAGE_ALLOC_COSTLY_ORDER is reasonable but again it's a > >> s/w policy not a h/w preference. Besides, I don't think we can include > >> mmzone.h in pgtable.h. > > > > I think we can make a compromise: > > 1. change the default implementation of arch_has_hw_pte_young() to return 0, and > > 2. in memory.c, we can try PAGE_ALLOC_COSTLY_ORDER for archs that > > don't override arch_has_hw_pte_young(), or if its return value is too > > large to fit. > Do you mean arch_wants_pte_order()? Yes. This looks good to me. Thanks. Sorry, copied the wrong function from above and pasted without looking... _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order() 2023-07-04 3:59 ` Yu Zhao @ 2023-07-04 12:36 ` Ryan Roberts -1 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-04 12:36 UTC (permalink / raw) To: Yu Zhao, Yin, Fengwei Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 04/07/2023 04:59, Yu Zhao wrote: > On Mon, Jul 3, 2023 at 9:02 PM Yu Zhao <yuzhao@google.com> wrote: >> >> On Mon, Jul 3, 2023 at 8:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote: >>> >>> >>> >>> On 7/3/2023 9:53 PM, Ryan Roberts wrote: >>>> arch_wants_pte_order() can be overridden by the arch to return the >>>> preferred folio order for pte-mapped memory. This is useful as some >>>> architectures (e.g. arm64) can coalesce TLB entries when the physical >>>> memory is suitably contiguous. >>>> >>>> The first user for this hint will be FLEXIBLE_THP, which aims to >>>> allocate large folios for anonymous memory to reduce page faults and >>>> other per-page operation costs. >>>> >>>> Here we add the default implementation of the function, used when the >>>> architecture does not define it, which returns the order corresponding >>>> to 64K. >>>> >>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >>>> --- >>>> include/linux/pgtable.h | 13 +++++++++++++ >>>> 1 file changed, 13 insertions(+) >>>> >>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h >>>> index a661a17173fa..f7e38598f20b 100644 >>>> --- a/include/linux/pgtable.h >>>> +++ b/include/linux/pgtable.h >>>> @@ -13,6 +13,7 @@ >>>> #include <linux/errno.h> >>>> #include <asm-generic/pgtable_uffd.h> >>>> #include <linux/page_table_check.h> >>>> +#include <linux/sizes.h> >>>> >>>> #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \ >>>> defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS >>>> @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void) >>>> } >>>> #endif >>>> >>>> +#ifndef arch_wants_pte_order >>>> +/* >>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0, >>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios >>>> + * to be at least order-2. >>>> + */ >>>> +static inline int arch_wants_pte_order(struct vm_area_struct *vma) >>>> +{ >>>> + return ilog2(SZ_64K >> PAGE_SHIFT); >>> Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER? >>> >>> Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9. >>> If the pcp could cover the page, the pressure to zone lock will be reduced by pcp. >> >> The value of PAGE_ALLOC_COSTLY_ORDER is reasonable but again it's a >> s/w policy not a h/w preference. Besides, I don't think we can include >> mmzone.h in pgtable.h. > > I think we can make a compromise: > 1. change the default implementation of arch_has_hw_pte_young() to return 0, and > 2. in memory.c, we can try PAGE_ALLOC_COSTLY_ORDER for archs that > don't override arch_has_hw_pte_young(), or if its return value is too > large to fit. > This should also take care of the regression, right? I think you are suggesting that we use 0 as a sentinel which we then translate to PAGE_ALLOC_COSTLY_ORDER? I already have a max_anon_folio_order() function in memory.c (actually it is currently a macro defined as arch_wants_pte_order()). So it would become (I'll talk about the vma concern separately in the thread where you raised it): static inline int max_anon_folio_order(struct vm_area_struct *vma) { int order = arch_wants_pte_order(vma); return order ? order : PAGE_ALLOC_COSTLY_ORDER; } Correct? I don't see how it fixes the regression (assume you're talking about Speedometer) though? On arm64 arch_wants_pte_order() will still be returning order-4. ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order() @ 2023-07-04 12:36 ` Ryan Roberts 0 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-04 12:36 UTC (permalink / raw) To: Yu Zhao, Yin, Fengwei Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 04/07/2023 04:59, Yu Zhao wrote: > On Mon, Jul 3, 2023 at 9:02 PM Yu Zhao <yuzhao@google.com> wrote: >> >> On Mon, Jul 3, 2023 at 8:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote: >>> >>> >>> >>> On 7/3/2023 9:53 PM, Ryan Roberts wrote: >>>> arch_wants_pte_order() can be overridden by the arch to return the >>>> preferred folio order for pte-mapped memory. This is useful as some >>>> architectures (e.g. arm64) can coalesce TLB entries when the physical >>>> memory is suitably contiguous. >>>> >>>> The first user for this hint will be FLEXIBLE_THP, which aims to >>>> allocate large folios for anonymous memory to reduce page faults and >>>> other per-page operation costs. >>>> >>>> Here we add the default implementation of the function, used when the >>>> architecture does not define it, which returns the order corresponding >>>> to 64K. >>>> >>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >>>> --- >>>> include/linux/pgtable.h | 13 +++++++++++++ >>>> 1 file changed, 13 insertions(+) >>>> >>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h >>>> index a661a17173fa..f7e38598f20b 100644 >>>> --- a/include/linux/pgtable.h >>>> +++ b/include/linux/pgtable.h >>>> @@ -13,6 +13,7 @@ >>>> #include <linux/errno.h> >>>> #include <asm-generic/pgtable_uffd.h> >>>> #include <linux/page_table_check.h> >>>> +#include <linux/sizes.h> >>>> >>>> #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \ >>>> defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS >>>> @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void) >>>> } >>>> #endif >>>> >>>> +#ifndef arch_wants_pte_order >>>> +/* >>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0, >>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios >>>> + * to be at least order-2. >>>> + */ >>>> +static inline int arch_wants_pte_order(struct vm_area_struct *vma) >>>> +{ >>>> + return ilog2(SZ_64K >> PAGE_SHIFT); >>> Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER? >>> >>> Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9. >>> If the pcp could cover the page, the pressure to zone lock will be reduced by pcp. >> >> The value of PAGE_ALLOC_COSTLY_ORDER is reasonable but again it's a >> s/w policy not a h/w preference. Besides, I don't think we can include >> mmzone.h in pgtable.h. > > I think we can make a compromise: > 1. change the default implementation of arch_has_hw_pte_young() to return 0, and > 2. in memory.c, we can try PAGE_ALLOC_COSTLY_ORDER for archs that > don't override arch_has_hw_pte_young(), or if its return value is too > large to fit. > This should also take care of the regression, right? I think you are suggesting that we use 0 as a sentinel which we then translate to PAGE_ALLOC_COSTLY_ORDER? I already have a max_anon_folio_order() function in memory.c (actually it is currently a macro defined as arch_wants_pte_order()). So it would become (I'll talk about the vma concern separately in the thread where you raised it): static inline int max_anon_folio_order(struct vm_area_struct *vma) { int order = arch_wants_pte_order(vma); return order ? order : PAGE_ALLOC_COSTLY_ORDER; } Correct? I don't see how it fixes the regression (assume you're talking about Speedometer) though? On arm64 arch_wants_pte_order() will still be returning order-4. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order() 2023-07-04 12:36 ` Ryan Roberts @ 2023-07-04 13:23 ` Ryan Roberts -1 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-04 13:23 UTC (permalink / raw) To: Yu Zhao, Yin, Fengwei Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 04/07/2023 13:36, Ryan Roberts wrote: > On 04/07/2023 04:59, Yu Zhao wrote: >> On Mon, Jul 3, 2023 at 9:02 PM Yu Zhao <yuzhao@google.com> wrote: >>> >>> On Mon, Jul 3, 2023 at 8:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote: >>>> >>>> >>>> >>>> On 7/3/2023 9:53 PM, Ryan Roberts wrote: >>>>> arch_wants_pte_order() can be overridden by the arch to return the >>>>> preferred folio order for pte-mapped memory. This is useful as some >>>>> architectures (e.g. arm64) can coalesce TLB entries when the physical >>>>> memory is suitably contiguous. >>>>> >>>>> The first user for this hint will be FLEXIBLE_THP, which aims to >>>>> allocate large folios for anonymous memory to reduce page faults and >>>>> other per-page operation costs. >>>>> >>>>> Here we add the default implementation of the function, used when the >>>>> architecture does not define it, which returns the order corresponding >>>>> to 64K. >>>>> >>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >>>>> --- >>>>> include/linux/pgtable.h | 13 +++++++++++++ >>>>> 1 file changed, 13 insertions(+) >>>>> >>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h >>>>> index a661a17173fa..f7e38598f20b 100644 >>>>> --- a/include/linux/pgtable.h >>>>> +++ b/include/linux/pgtable.h >>>>> @@ -13,6 +13,7 @@ >>>>> #include <linux/errno.h> >>>>> #include <asm-generic/pgtable_uffd.h> >>>>> #include <linux/page_table_check.h> >>>>> +#include <linux/sizes.h> >>>>> >>>>> #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \ >>>>> defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS >>>>> @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void) >>>>> } >>>>> #endif >>>>> >>>>> +#ifndef arch_wants_pte_order >>>>> +/* >>>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0, >>>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios >>>>> + * to be at least order-2. >>>>> + */ >>>>> +static inline int arch_wants_pte_order(struct vm_area_struct *vma) >>>>> +{ >>>>> + return ilog2(SZ_64K >> PAGE_SHIFT); >>>> Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER? >>>> >>>> Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9. >>>> If the pcp could cover the page, the pressure to zone lock will be reduced by pcp. >>> >>> The value of PAGE_ALLOC_COSTLY_ORDER is reasonable but again it's a >>> s/w policy not a h/w preference. Besides, I don't think we can include >>> mmzone.h in pgtable.h. >> >> I think we can make a compromise: >> 1. change the default implementation of arch_has_hw_pte_young() to return 0, and >> 2. in memory.c, we can try PAGE_ALLOC_COSTLY_ORDER for archs that >> don't override arch_has_hw_pte_young(), or if its return value is too >> large to fit. >> This should also take care of the regression, right? > > I think you are suggesting that we use 0 as a sentinel which we then translate > to PAGE_ALLOC_COSTLY_ORDER? I already have a max_anon_folio_order() function in > memory.c (actually it is currently a macro defined as arch_wants_pte_order()). > > So it would become (I'll talk about the vma concern separately in the thread > where you raised it): > > static inline int max_anon_folio_order(struct vm_area_struct *vma) > { > int order = arch_wants_pte_order(vma); > > return order ? order : PAGE_ALLOC_COSTLY_ORDER; > } > > Correct? Actually, I'm not sure its a good idea to default to a fixed order. If running on an arch with big base pages (e.g. powerpc with 64K pages?), that will soon add up to a big chunk of memory, which could be wasteful? PAGE_ALLOC_COSTLY_ORDER = 3 so with 64K base page, that 512K. Is that a concern? Wouldn't it be better to define this as an absolute size? Or even the min of PAGE_ALLOC_COSTLY_ORDER and an absolute size? > > I don't see how it fixes the regression (assume you're talking about > Speedometer) though? On arm64 arch_wants_pte_order() will still be returning > order-4. > ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order() @ 2023-07-04 13:23 ` Ryan Roberts 0 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-04 13:23 UTC (permalink / raw) To: Yu Zhao, Yin, Fengwei Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 04/07/2023 13:36, Ryan Roberts wrote: > On 04/07/2023 04:59, Yu Zhao wrote: >> On Mon, Jul 3, 2023 at 9:02 PM Yu Zhao <yuzhao@google.com> wrote: >>> >>> On Mon, Jul 3, 2023 at 8:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote: >>>> >>>> >>>> >>>> On 7/3/2023 9:53 PM, Ryan Roberts wrote: >>>>> arch_wants_pte_order() can be overridden by the arch to return the >>>>> preferred folio order for pte-mapped memory. This is useful as some >>>>> architectures (e.g. arm64) can coalesce TLB entries when the physical >>>>> memory is suitably contiguous. >>>>> >>>>> The first user for this hint will be FLEXIBLE_THP, which aims to >>>>> allocate large folios for anonymous memory to reduce page faults and >>>>> other per-page operation costs. >>>>> >>>>> Here we add the default implementation of the function, used when the >>>>> architecture does not define it, which returns the order corresponding >>>>> to 64K. >>>>> >>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >>>>> --- >>>>> include/linux/pgtable.h | 13 +++++++++++++ >>>>> 1 file changed, 13 insertions(+) >>>>> >>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h >>>>> index a661a17173fa..f7e38598f20b 100644 >>>>> --- a/include/linux/pgtable.h >>>>> +++ b/include/linux/pgtable.h >>>>> @@ -13,6 +13,7 @@ >>>>> #include <linux/errno.h> >>>>> #include <asm-generic/pgtable_uffd.h> >>>>> #include <linux/page_table_check.h> >>>>> +#include <linux/sizes.h> >>>>> >>>>> #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \ >>>>> defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS >>>>> @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void) >>>>> } >>>>> #endif >>>>> >>>>> +#ifndef arch_wants_pte_order >>>>> +/* >>>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0, >>>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios >>>>> + * to be at least order-2. >>>>> + */ >>>>> +static inline int arch_wants_pte_order(struct vm_area_struct *vma) >>>>> +{ >>>>> + return ilog2(SZ_64K >> PAGE_SHIFT); >>>> Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER? >>>> >>>> Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9. >>>> If the pcp could cover the page, the pressure to zone lock will be reduced by pcp. >>> >>> The value of PAGE_ALLOC_COSTLY_ORDER is reasonable but again it's a >>> s/w policy not a h/w preference. Besides, I don't think we can include >>> mmzone.h in pgtable.h. >> >> I think we can make a compromise: >> 1. change the default implementation of arch_has_hw_pte_young() to return 0, and >> 2. in memory.c, we can try PAGE_ALLOC_COSTLY_ORDER for archs that >> don't override arch_has_hw_pte_young(), or if its return value is too >> large to fit. >> This should also take care of the regression, right? > > I think you are suggesting that we use 0 as a sentinel which we then translate > to PAGE_ALLOC_COSTLY_ORDER? I already have a max_anon_folio_order() function in > memory.c (actually it is currently a macro defined as arch_wants_pte_order()). > > So it would become (I'll talk about the vma concern separately in the thread > where you raised it): > > static inline int max_anon_folio_order(struct vm_area_struct *vma) > { > int order = arch_wants_pte_order(vma); > > return order ? order : PAGE_ALLOC_COSTLY_ORDER; > } > > Correct? Actually, I'm not sure its a good idea to default to a fixed order. If running on an arch with big base pages (e.g. powerpc with 64K pages?), that will soon add up to a big chunk of memory, which could be wasteful? PAGE_ALLOC_COSTLY_ORDER = 3 so with 64K base page, that 512K. Is that a concern? Wouldn't it be better to define this as an absolute size? Or even the min of PAGE_ALLOC_COSTLY_ORDER and an absolute size? > > I don't see how it fixes the regression (assume you're talking about > Speedometer) though? On arm64 arch_wants_pte_order() will still be returning > order-4. > _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order() 2023-07-04 13:23 ` Ryan Roberts @ 2023-07-05 1:40 ` Yu Zhao -1 siblings, 0 replies; 167+ messages in thread From: Yu Zhao @ 2023-07-05 1:40 UTC (permalink / raw) To: Ryan Roberts Cc: Yin, Fengwei, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On Tue, Jul 4, 2023 at 7:23 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 04/07/2023 13:36, Ryan Roberts wrote: > > On 04/07/2023 04:59, Yu Zhao wrote: > >> On Mon, Jul 3, 2023 at 9:02 PM Yu Zhao <yuzhao@google.com> wrote: > >>> > >>> On Mon, Jul 3, 2023 at 8:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote: > >>>> > >>>> > >>>> > >>>> On 7/3/2023 9:53 PM, Ryan Roberts wrote: > >>>>> arch_wants_pte_order() can be overridden by the arch to return the > >>>>> preferred folio order for pte-mapped memory. This is useful as some > >>>>> architectures (e.g. arm64) can coalesce TLB entries when the physical > >>>>> memory is suitably contiguous. > >>>>> > >>>>> The first user for this hint will be FLEXIBLE_THP, which aims to > >>>>> allocate large folios for anonymous memory to reduce page faults and > >>>>> other per-page operation costs. > >>>>> > >>>>> Here we add the default implementation of the function, used when the > >>>>> architecture does not define it, which returns the order corresponding > >>>>> to 64K. > >>>>> > >>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > >>>>> --- > >>>>> include/linux/pgtable.h | 13 +++++++++++++ > >>>>> 1 file changed, 13 insertions(+) > >>>>> > >>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > >>>>> index a661a17173fa..f7e38598f20b 100644 > >>>>> --- a/include/linux/pgtable.h > >>>>> +++ b/include/linux/pgtable.h > >>>>> @@ -13,6 +13,7 @@ > >>>>> #include <linux/errno.h> > >>>>> #include <asm-generic/pgtable_uffd.h> > >>>>> #include <linux/page_table_check.h> > >>>>> +#include <linux/sizes.h> > >>>>> > >>>>> #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \ > >>>>> defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS > >>>>> @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void) > >>>>> } > >>>>> #endif > >>>>> > >>>>> +#ifndef arch_wants_pte_order > >>>>> +/* > >>>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0, > >>>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios > >>>>> + * to be at least order-2. > >>>>> + */ > >>>>> +static inline int arch_wants_pte_order(struct vm_area_struct *vma) > >>>>> +{ > >>>>> + return ilog2(SZ_64K >> PAGE_SHIFT); > >>>> Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER? > >>>> > >>>> Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9. > >>>> If the pcp could cover the page, the pressure to zone lock will be reduced by pcp. > >>> > >>> The value of PAGE_ALLOC_COSTLY_ORDER is reasonable but again it's a > >>> s/w policy not a h/w preference. Besides, I don't think we can include > >>> mmzone.h in pgtable.h. > >> > >> I think we can make a compromise: > >> 1. change the default implementation of arch_has_hw_pte_young() to return 0, and > >> 2. in memory.c, we can try PAGE_ALLOC_COSTLY_ORDER for archs that > >> don't override arch_has_hw_pte_young(), or if its return value is too > >> large to fit. > >> This should also take care of the regression, right? > > > > I think you are suggesting that we use 0 as a sentinel which we then translate > > to PAGE_ALLOC_COSTLY_ORDER? I already have a max_anon_folio_order() function in > > memory.c (actually it is currently a macro defined as arch_wants_pte_order()). > > > > So it would become (I'll talk about the vma concern separately in the thread > > where you raised it): > > > > static inline int max_anon_folio_order(struct vm_area_struct *vma) > > { > > int order = arch_wants_pte_order(vma); > > > > return order ? order : PAGE_ALLOC_COSTLY_ORDER; > > } > > > > Correct? > > Actually, I'm not sure its a good idea to default to a fixed order. If running > on an arch with big base pages (e.g. powerpc with 64K pages?), that will soon > add up to a big chunk of memory, which could be wasteful? > > PAGE_ALLOC_COSTLY_ORDER = 3 so with 64K base page, that 512K. Is that a concern? > Wouldn't it be better to define this as an absolute size? Or even the min of > PAGE_ALLOC_COSTLY_ORDER and an absolute size? For my POV, not at all. POWER can use smaller page sizes if they wanted to -- I don't think they do: at least the distros I use on my POWER9 all have THP=always by default (2MB). ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order() @ 2023-07-05 1:40 ` Yu Zhao 0 siblings, 0 replies; 167+ messages in thread From: Yu Zhao @ 2023-07-05 1:40 UTC (permalink / raw) To: Ryan Roberts Cc: Yin, Fengwei, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On Tue, Jul 4, 2023 at 7:23 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 04/07/2023 13:36, Ryan Roberts wrote: > > On 04/07/2023 04:59, Yu Zhao wrote: > >> On Mon, Jul 3, 2023 at 9:02 PM Yu Zhao <yuzhao@google.com> wrote: > >>> > >>> On Mon, Jul 3, 2023 at 8:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote: > >>>> > >>>> > >>>> > >>>> On 7/3/2023 9:53 PM, Ryan Roberts wrote: > >>>>> arch_wants_pte_order() can be overridden by the arch to return the > >>>>> preferred folio order for pte-mapped memory. This is useful as some > >>>>> architectures (e.g. arm64) can coalesce TLB entries when the physical > >>>>> memory is suitably contiguous. > >>>>> > >>>>> The first user for this hint will be FLEXIBLE_THP, which aims to > >>>>> allocate large folios for anonymous memory to reduce page faults and > >>>>> other per-page operation costs. > >>>>> > >>>>> Here we add the default implementation of the function, used when the > >>>>> architecture does not define it, which returns the order corresponding > >>>>> to 64K. > >>>>> > >>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > >>>>> --- > >>>>> include/linux/pgtable.h | 13 +++++++++++++ > >>>>> 1 file changed, 13 insertions(+) > >>>>> > >>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > >>>>> index a661a17173fa..f7e38598f20b 100644 > >>>>> --- a/include/linux/pgtable.h > >>>>> +++ b/include/linux/pgtable.h > >>>>> @@ -13,6 +13,7 @@ > >>>>> #include <linux/errno.h> > >>>>> #include <asm-generic/pgtable_uffd.h> > >>>>> #include <linux/page_table_check.h> > >>>>> +#include <linux/sizes.h> > >>>>> > >>>>> #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \ > >>>>> defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS > >>>>> @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void) > >>>>> } > >>>>> #endif > >>>>> > >>>>> +#ifndef arch_wants_pte_order > >>>>> +/* > >>>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0, > >>>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios > >>>>> + * to be at least order-2. > >>>>> + */ > >>>>> +static inline int arch_wants_pte_order(struct vm_area_struct *vma) > >>>>> +{ > >>>>> + return ilog2(SZ_64K >> PAGE_SHIFT); > >>>> Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER? > >>>> > >>>> Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9. > >>>> If the pcp could cover the page, the pressure to zone lock will be reduced by pcp. > >>> > >>> The value of PAGE_ALLOC_COSTLY_ORDER is reasonable but again it's a > >>> s/w policy not a h/w preference. Besides, I don't think we can include > >>> mmzone.h in pgtable.h. > >> > >> I think we can make a compromise: > >> 1. change the default implementation of arch_has_hw_pte_young() to return 0, and > >> 2. in memory.c, we can try PAGE_ALLOC_COSTLY_ORDER for archs that > >> don't override arch_has_hw_pte_young(), or if its return value is too > >> large to fit. > >> This should also take care of the regression, right? > > > > I think you are suggesting that we use 0 as a sentinel which we then translate > > to PAGE_ALLOC_COSTLY_ORDER? I already have a max_anon_folio_order() function in > > memory.c (actually it is currently a macro defined as arch_wants_pte_order()). > > > > So it would become (I'll talk about the vma concern separately in the thread > > where you raised it): > > > > static inline int max_anon_folio_order(struct vm_area_struct *vma) > > { > > int order = arch_wants_pte_order(vma); > > > > return order ? order : PAGE_ALLOC_COSTLY_ORDER; > > } > > > > Correct? > > Actually, I'm not sure its a good idea to default to a fixed order. If running > on an arch with big base pages (e.g. powerpc with 64K pages?), that will soon > add up to a big chunk of memory, which could be wasteful? > > PAGE_ALLOC_COSTLY_ORDER = 3 so with 64K base page, that 512K. Is that a concern? > Wouldn't it be better to define this as an absolute size? Or even the min of > PAGE_ALLOC_COSTLY_ORDER and an absolute size? For my POV, not at all. POWER can use smaller page sizes if they wanted to -- I don't think they do: at least the distros I use on my POWER9 all have THP=always by default (2MB). _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order() 2023-07-04 12:36 ` Ryan Roberts @ 2023-07-05 1:23 ` Yu Zhao -1 siblings, 0 replies; 167+ messages in thread From: Yu Zhao @ 2023-07-05 1:23 UTC (permalink / raw) To: Ryan Roberts, Yin, Fengwei Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm [-- Attachment #1: Type: text/plain, Size: 4742 bytes --] On Tue, Jul 4, 2023 at 6:36 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 04/07/2023 04:59, Yu Zhao wrote: > > On Mon, Jul 3, 2023 at 9:02 PM Yu Zhao <yuzhao@google.com> wrote: > >> > >> On Mon, Jul 3, 2023 at 8:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote: > >>> > >>> > >>> > >>> On 7/3/2023 9:53 PM, Ryan Roberts wrote: > >>>> arch_wants_pte_order() can be overridden by the arch to return the > >>>> preferred folio order for pte-mapped memory. This is useful as some > >>>> architectures (e.g. arm64) can coalesce TLB entries when the physical > >>>> memory is suitably contiguous. > >>>> > >>>> The first user for this hint will be FLEXIBLE_THP, which aims to > >>>> allocate large folios for anonymous memory to reduce page faults and > >>>> other per-page operation costs. > >>>> > >>>> Here we add the default implementation of the function, used when the > >>>> architecture does not define it, which returns the order corresponding > >>>> to 64K. > >>>> > >>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > >>>> --- > >>>> include/linux/pgtable.h | 13 +++++++++++++ > >>>> 1 file changed, 13 insertions(+) > >>>> > >>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > >>>> index a661a17173fa..f7e38598f20b 100644 > >>>> --- a/include/linux/pgtable.h > >>>> +++ b/include/linux/pgtable.h > >>>> @@ -13,6 +13,7 @@ > >>>> #include <linux/errno.h> > >>>> #include <asm-generic/pgtable_uffd.h> > >>>> #include <linux/page_table_check.h> > >>>> +#include <linux/sizes.h> > >>>> > >>>> #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \ > >>>> defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS > >>>> @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void) > >>>> } > >>>> #endif > >>>> > >>>> +#ifndef arch_wants_pte_order > >>>> +/* > >>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0, > >>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios > >>>> + * to be at least order-2. > >>>> + */ > >>>> +static inline int arch_wants_pte_order(struct vm_area_struct *vma) > >>>> +{ > >>>> + return ilog2(SZ_64K >> PAGE_SHIFT); > >>> Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER? > >>> > >>> Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9. > >>> If the pcp could cover the page, the pressure to zone lock will be reduced by pcp. > >> > >> The value of PAGE_ALLOC_COSTLY_ORDER is reasonable but again it's a > >> s/w policy not a h/w preference. Besides, I don't think we can include > >> mmzone.h in pgtable.h. > > > > I think we can make a compromise: > > 1. change the default implementation of arch_has_hw_pte_young() to return 0, and > > 2. in memory.c, we can try PAGE_ALLOC_COSTLY_ORDER for archs that > > don't override arch_has_hw_pte_young(), or if its return value is too > > large to fit. > > This should also take care of the regression, right? > > I think you are suggesting that we use 0 as a sentinel which we then translate > to PAGE_ALLOC_COSTLY_ORDER? I already have a max_anon_folio_order() function in > memory.c (actually it is currently a macro defined as arch_wants_pte_order()). > > So it would become (I'll talk about the vma concern separately in the thread > where you raised it): > > static inline int max_anon_folio_order(struct vm_area_struct *vma) > { > int order = arch_wants_pte_order(vma); > > return order ? order : PAGE_ALLOC_COSTLY_ORDER; > } > > Correct? > > I don't see how it fixes the regression (assume you're talking about > Speedometer) though? On arm64 arch_wants_pte_order() will still be returning > order-4. Here is what I was actually suggesting -- I think the problem was because contpte is a bit too large for that benchmark and for the page allocator too, unfortunately. The following allows one retry (32KB) before fallback to order 0 when using contpte (64KB). There is no retry for HPA (16KB) and other archs. + int preferred = arch_wants_pte_order(vma) ? : PAGE_ALLOC_COSTLY_ORDER; + int orders[] = { + preferred, + preferred > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0, + 0, + }; I'm attaching a patch which fills in the two helpers I left empty here [1]. Would the above work for Intel, Fengwei? (AMD wouldn't need to override arch_wants_pte_order() since PTE coalescing on Zen is also PAGE_ALLOC_COSTLY_ORDER.) [1] https://lore.kernel.org/linux-mm/CAOUHufaK82K8Sa35T7z3=gkm4GB0cWD3aqeZF6mYx82v7cOTeA@mail.gmail.com/2-anon_folios.patch [-- Attachment #2: fallback.patch --] [-- Type: application/octet-stream, Size: 1950 bytes --] diff --git a/mm/memory.c b/mm/memory.c index f69fbc251198..c19cbba60d04 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4023,6 +4023,75 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) return ret; } +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages) +{ + int i; + + if (nr_pages == 1) + return vmf_pte_changed(vmf); + + for (i = 0; i < nr_pages; i++) { + if (!pte_none(ptep_get_lockless(vmf->pte + i))) + return true; + } + + return false; +} + +#ifdef CONFIG_FLEXIBLE_THP +static struct folio *alloc_anon_folio(struct vm_fault *vmf) +{ + int i; + unsigned long addr; + struct vm_area_struct *vma = vmf->vma; + int preferred = arch_wants_pte_order(vma) ? : PAGE_ALLOC_COSTLY_ORDER; + int orders[] = { + preferred, + preferred > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0, + 0, + }; + + if (vmf_orig_pte_uffd_wp(vmf)) + goto fallback; + + for (i = 0; orders[i]; i++) { + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]); + if (addr >= vma->vm_start && addr + (PAGE_SIZE << orders[i]) <= vma->vm_end) + break; + } + + if (!orders[i]) + goto fallback; + + vmf->pte = pte_offset_map(vmf->pmd, addr); + + for (; orders[i]; i++) { + if (!vmf_pte_range_changed(vmf, 1 << orders[i])) + break; + } + + pte_unmap(vmf->pte); + vmf->pte = NULL; + + for (; orders[i]; i++) { + struct folio *folio + gfp_t gfp = vma_thp_gfp_mask(vma); + + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]); + folio = vma_alloc_folio(gfp, orders[i], vma, addr, true); + if (folio) { + clear_huge_page(&folio->page, addr, 1 << orders[i]); + vmf->address = addr; + return folio; + } + } +fallback: + return vma_alloc_zeroed_movable_folio(vma, vmf->address); +} +#else +#define alloc_anon_folio(vmf) vma_alloc_zeroed_movable_folio((vmf)->vma, (vmf)->address) +#endif + /* * We enter with non-exclusive mmap_lock (to exclude vma changes, * but allow concurrent faults), and pte mapped but not yet locked. ^ permalink raw reply related [flat|nested] 167+ messages in thread
* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order() @ 2023-07-05 1:23 ` Yu Zhao 0 siblings, 0 replies; 167+ messages in thread From: Yu Zhao @ 2023-07-05 1:23 UTC (permalink / raw) To: Ryan Roberts, Yin, Fengwei Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm [-- Attachment #1: Type: text/plain, Size: 4742 bytes --] On Tue, Jul 4, 2023 at 6:36 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 04/07/2023 04:59, Yu Zhao wrote: > > On Mon, Jul 3, 2023 at 9:02 PM Yu Zhao <yuzhao@google.com> wrote: > >> > >> On Mon, Jul 3, 2023 at 8:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote: > >>> > >>> > >>> > >>> On 7/3/2023 9:53 PM, Ryan Roberts wrote: > >>>> arch_wants_pte_order() can be overridden by the arch to return the > >>>> preferred folio order for pte-mapped memory. This is useful as some > >>>> architectures (e.g. arm64) can coalesce TLB entries when the physical > >>>> memory is suitably contiguous. > >>>> > >>>> The first user for this hint will be FLEXIBLE_THP, which aims to > >>>> allocate large folios for anonymous memory to reduce page faults and > >>>> other per-page operation costs. > >>>> > >>>> Here we add the default implementation of the function, used when the > >>>> architecture does not define it, which returns the order corresponding > >>>> to 64K. > >>>> > >>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > >>>> --- > >>>> include/linux/pgtable.h | 13 +++++++++++++ > >>>> 1 file changed, 13 insertions(+) > >>>> > >>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > >>>> index a661a17173fa..f7e38598f20b 100644 > >>>> --- a/include/linux/pgtable.h > >>>> +++ b/include/linux/pgtable.h > >>>> @@ -13,6 +13,7 @@ > >>>> #include <linux/errno.h> > >>>> #include <asm-generic/pgtable_uffd.h> > >>>> #include <linux/page_table_check.h> > >>>> +#include <linux/sizes.h> > >>>> > >>>> #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \ > >>>> defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS > >>>> @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void) > >>>> } > >>>> #endif > >>>> > >>>> +#ifndef arch_wants_pte_order > >>>> +/* > >>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0, > >>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios > >>>> + * to be at least order-2. > >>>> + */ > >>>> +static inline int arch_wants_pte_order(struct vm_area_struct *vma) > >>>> +{ > >>>> + return ilog2(SZ_64K >> PAGE_SHIFT); > >>> Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER? > >>> > >>> Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9. > >>> If the pcp could cover the page, the pressure to zone lock will be reduced by pcp. > >> > >> The value of PAGE_ALLOC_COSTLY_ORDER is reasonable but again it's a > >> s/w policy not a h/w preference. Besides, I don't think we can include > >> mmzone.h in pgtable.h. > > > > I think we can make a compromise: > > 1. change the default implementation of arch_has_hw_pte_young() to return 0, and > > 2. in memory.c, we can try PAGE_ALLOC_COSTLY_ORDER for archs that > > don't override arch_has_hw_pte_young(), or if its return value is too > > large to fit. > > This should also take care of the regression, right? > > I think you are suggesting that we use 0 as a sentinel which we then translate > to PAGE_ALLOC_COSTLY_ORDER? I already have a max_anon_folio_order() function in > memory.c (actually it is currently a macro defined as arch_wants_pte_order()). > > So it would become (I'll talk about the vma concern separately in the thread > where you raised it): > > static inline int max_anon_folio_order(struct vm_area_struct *vma) > { > int order = arch_wants_pte_order(vma); > > return order ? order : PAGE_ALLOC_COSTLY_ORDER; > } > > Correct? > > I don't see how it fixes the regression (assume you're talking about > Speedometer) though? On arm64 arch_wants_pte_order() will still be returning > order-4. Here is what I was actually suggesting -- I think the problem was because contpte is a bit too large for that benchmark and for the page allocator too, unfortunately. The following allows one retry (32KB) before fallback to order 0 when using contpte (64KB). There is no retry for HPA (16KB) and other archs. + int preferred = arch_wants_pte_order(vma) ? : PAGE_ALLOC_COSTLY_ORDER; + int orders[] = { + preferred, + preferred > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0, + 0, + }; I'm attaching a patch which fills in the two helpers I left empty here [1]. Would the above work for Intel, Fengwei? (AMD wouldn't need to override arch_wants_pte_order() since PTE coalescing on Zen is also PAGE_ALLOC_COSTLY_ORDER.) [1] https://lore.kernel.org/linux-mm/CAOUHufaK82K8Sa35T7z3=gkm4GB0cWD3aqeZF6mYx82v7cOTeA@mail.gmail.com/2-anon_folios.patch [-- Attachment #2: fallback.patch --] [-- Type: application/octet-stream, Size: 1950 bytes --] diff --git a/mm/memory.c b/mm/memory.c index f69fbc251198..c19cbba60d04 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4023,6 +4023,75 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) return ret; } +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages) +{ + int i; + + if (nr_pages == 1) + return vmf_pte_changed(vmf); + + for (i = 0; i < nr_pages; i++) { + if (!pte_none(ptep_get_lockless(vmf->pte + i))) + return true; + } + + return false; +} + +#ifdef CONFIG_FLEXIBLE_THP +static struct folio *alloc_anon_folio(struct vm_fault *vmf) +{ + int i; + unsigned long addr; + struct vm_area_struct *vma = vmf->vma; + int preferred = arch_wants_pte_order(vma) ? : PAGE_ALLOC_COSTLY_ORDER; + int orders[] = { + preferred, + preferred > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0, + 0, + }; + + if (vmf_orig_pte_uffd_wp(vmf)) + goto fallback; + + for (i = 0; orders[i]; i++) { + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]); + if (addr >= vma->vm_start && addr + (PAGE_SIZE << orders[i]) <= vma->vm_end) + break; + } + + if (!orders[i]) + goto fallback; + + vmf->pte = pte_offset_map(vmf->pmd, addr); + + for (; orders[i]; i++) { + if (!vmf_pte_range_changed(vmf, 1 << orders[i])) + break; + } + + pte_unmap(vmf->pte); + vmf->pte = NULL; + + for (; orders[i]; i++) { + struct folio *folio + gfp_t gfp = vma_thp_gfp_mask(vma); + + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]); + folio = vma_alloc_folio(gfp, orders[i], vma, addr, true); + if (folio) { + clear_huge_page(&folio->page, addr, 1 << orders[i]); + vmf->address = addr; + return folio; + } + } +fallback: + return vma_alloc_zeroed_movable_folio(vma, vmf->address); +} +#else +#define alloc_anon_folio(vmf) vma_alloc_zeroed_movable_folio((vmf)->vma, (vmf)->address) +#endif + /* * We enter with non-exclusive mmap_lock (to exclude vma changes, * but allow concurrent faults), and pte mapped but not yet locked. [-- Attachment #3: Type: text/plain, Size: 176 bytes --] _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 167+ messages in thread
* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order() 2023-07-05 1:23 ` Yu Zhao @ 2023-07-05 2:18 ` Yin Fengwei -1 siblings, 0 replies; 167+ messages in thread From: Yin Fengwei @ 2023-07-05 2:18 UTC (permalink / raw) To: Yu Zhao, Ryan Roberts Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 7/5/23 09:23, Yu Zhao wrote: > On Tue, Jul 4, 2023 at 6:36 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> On 04/07/2023 04:59, Yu Zhao wrote: >>> On Mon, Jul 3, 2023 at 9:02 PM Yu Zhao <yuzhao@google.com> wrote: >>>> >>>> On Mon, Jul 3, 2023 at 8:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote: >>>>> >>>>> >>>>> >>>>> On 7/3/2023 9:53 PM, Ryan Roberts wrote: >>>>>> arch_wants_pte_order() can be overridden by the arch to return the >>>>>> preferred folio order for pte-mapped memory. This is useful as some >>>>>> architectures (e.g. arm64) can coalesce TLB entries when the physical >>>>>> memory is suitably contiguous. >>>>>> >>>>>> The first user for this hint will be FLEXIBLE_THP, which aims to >>>>>> allocate large folios for anonymous memory to reduce page faults and >>>>>> other per-page operation costs. >>>>>> >>>>>> Here we add the default implementation of the function, used when the >>>>>> architecture does not define it, which returns the order corresponding >>>>>> to 64K. >>>>>> >>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >>>>>> --- >>>>>> include/linux/pgtable.h | 13 +++++++++++++ >>>>>> 1 file changed, 13 insertions(+) >>>>>> >>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h >>>>>> index a661a17173fa..f7e38598f20b 100644 >>>>>> --- a/include/linux/pgtable.h >>>>>> +++ b/include/linux/pgtable.h >>>>>> @@ -13,6 +13,7 @@ >>>>>> #include <linux/errno.h> >>>>>> #include <asm-generic/pgtable_uffd.h> >>>>>> #include <linux/page_table_check.h> >>>>>> +#include <linux/sizes.h> >>>>>> >>>>>> #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \ >>>>>> defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS >>>>>> @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void) >>>>>> } >>>>>> #endif >>>>>> >>>>>> +#ifndef arch_wants_pte_order >>>>>> +/* >>>>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0, >>>>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios >>>>>> + * to be at least order-2. >>>>>> + */ >>>>>> +static inline int arch_wants_pte_order(struct vm_area_struct *vma) >>>>>> +{ >>>>>> + return ilog2(SZ_64K >> PAGE_SHIFT); >>>>> Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER? >>>>> >>>>> Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9. >>>>> If the pcp could cover the page, the pressure to zone lock will be reduced by pcp. >>>> >>>> The value of PAGE_ALLOC_COSTLY_ORDER is reasonable but again it's a >>>> s/w policy not a h/w preference. Besides, I don't think we can include >>>> mmzone.h in pgtable.h. >>> >>> I think we can make a compromise: >>> 1. change the default implementation of arch_has_hw_pte_young() to return 0, and >>> 2. in memory.c, we can try PAGE_ALLOC_COSTLY_ORDER for archs that >>> don't override arch_has_hw_pte_young(), or if its return value is too >>> large to fit. >>> This should also take care of the regression, right? >> >> I think you are suggesting that we use 0 as a sentinel which we then translate >> to PAGE_ALLOC_COSTLY_ORDER? I already have a max_anon_folio_order() function in >> memory.c (actually it is currently a macro defined as arch_wants_pte_order()). >> >> So it would become (I'll talk about the vma concern separately in the thread >> where you raised it): >> >> static inline int max_anon_folio_order(struct vm_area_struct *vma) >> { >> int order = arch_wants_pte_order(vma); >> >> return order ? order : PAGE_ALLOC_COSTLY_ORDER; >> } >> >> Correct? >> >> I don't see how it fixes the regression (assume you're talking about >> Speedometer) though? On arm64 arch_wants_pte_order() will still be returning >> order-4. > > Here is what I was actually suggesting -- I think the problem was > because contpte is a bit too large for that benchmark and for the page > allocator too, unfortunately. The following allows one retry (32KB) > before fallback to order 0 when using contpte (64KB). There is no > retry for HPA (16KB) and other archs. > > + int preferred = arch_wants_pte_order(vma) ? : PAGE_ALLOC_COSTLY_ORDER; > + int orders[] = { > + preferred, > + preferred > PAGE_ALLOC_COSTLY_ORDER ? > PAGE_ALLOC_COSTLY_ORDER : 0, > + 0, > + }; > > I'm attaching a patch which fills in the two helpers I left empty here [1]. > > Would the above work for Intel, Fengwei? PAGE_ALLOC_COSTLY_ORDER is Intel preferred because it fits the most common Intel system. So yes. This works for Intel. Regards Yin, Fengwei > > (AMD wouldn't need to override arch_wants_pte_order() since PTE > coalescing on Zen is also PAGE_ALLOC_COSTLY_ORDER.) > > [1] https://lore.kernel.org/linux-mm/CAOUHufaK82K8Sa35T7z3=gkm4GB0cWD3aqeZF6mYx82v7cOTeA@mail.gmail.com/2-anon_folios.patch ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order() @ 2023-07-05 2:18 ` Yin Fengwei 0 siblings, 0 replies; 167+ messages in thread From: Yin Fengwei @ 2023-07-05 2:18 UTC (permalink / raw) To: Yu Zhao, Ryan Roberts Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 7/5/23 09:23, Yu Zhao wrote: > On Tue, Jul 4, 2023 at 6:36 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> On 04/07/2023 04:59, Yu Zhao wrote: >>> On Mon, Jul 3, 2023 at 9:02 PM Yu Zhao <yuzhao@google.com> wrote: >>>> >>>> On Mon, Jul 3, 2023 at 8:23 PM Yin, Fengwei <fengwei.yin@intel.com> wrote: >>>>> >>>>> >>>>> >>>>> On 7/3/2023 9:53 PM, Ryan Roberts wrote: >>>>>> arch_wants_pte_order() can be overridden by the arch to return the >>>>>> preferred folio order for pte-mapped memory. This is useful as some >>>>>> architectures (e.g. arm64) can coalesce TLB entries when the physical >>>>>> memory is suitably contiguous. >>>>>> >>>>>> The first user for this hint will be FLEXIBLE_THP, which aims to >>>>>> allocate large folios for anonymous memory to reduce page faults and >>>>>> other per-page operation costs. >>>>>> >>>>>> Here we add the default implementation of the function, used when the >>>>>> architecture does not define it, which returns the order corresponding >>>>>> to 64K. >>>>>> >>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >>>>>> --- >>>>>> include/linux/pgtable.h | 13 +++++++++++++ >>>>>> 1 file changed, 13 insertions(+) >>>>>> >>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h >>>>>> index a661a17173fa..f7e38598f20b 100644 >>>>>> --- a/include/linux/pgtable.h >>>>>> +++ b/include/linux/pgtable.h >>>>>> @@ -13,6 +13,7 @@ >>>>>> #include <linux/errno.h> >>>>>> #include <asm-generic/pgtable_uffd.h> >>>>>> #include <linux/page_table_check.h> >>>>>> +#include <linux/sizes.h> >>>>>> >>>>>> #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \ >>>>>> defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS >>>>>> @@ -336,6 +337,18 @@ static inline bool arch_has_hw_pte_young(void) >>>>>> } >>>>>> #endif >>>>>> >>>>>> +#ifndef arch_wants_pte_order >>>>>> +/* >>>>>> + * Returns preferred folio order for pte-mapped memory. Must be in range [0, >>>>>> + * PMD_SHIFT-PAGE_SHIFT) and must not be order-1 since THP requires large folios >>>>>> + * to be at least order-2. >>>>>> + */ >>>>>> +static inline int arch_wants_pte_order(struct vm_area_struct *vma) >>>>>> +{ >>>>>> + return ilog2(SZ_64K >> PAGE_SHIFT); >>>>> Default value which is not related with any silicon may be: PAGE_ALLOC_COSTLY_ORDER? >>>>> >>>>> Also, current pcp list support cache page with order 0...PAGE_ALLOC_COSTLY_ORDER, 9. >>>>> If the pcp could cover the page, the pressure to zone lock will be reduced by pcp. >>>> >>>> The value of PAGE_ALLOC_COSTLY_ORDER is reasonable but again it's a >>>> s/w policy not a h/w preference. Besides, I don't think we can include >>>> mmzone.h in pgtable.h. >>> >>> I think we can make a compromise: >>> 1. change the default implementation of arch_has_hw_pte_young() to return 0, and >>> 2. in memory.c, we can try PAGE_ALLOC_COSTLY_ORDER for archs that >>> don't override arch_has_hw_pte_young(), or if its return value is too >>> large to fit. >>> This should also take care of the regression, right? >> >> I think you are suggesting that we use 0 as a sentinel which we then translate >> to PAGE_ALLOC_COSTLY_ORDER? I already have a max_anon_folio_order() function in >> memory.c (actually it is currently a macro defined as arch_wants_pte_order()). >> >> So it would become (I'll talk about the vma concern separately in the thread >> where you raised it): >> >> static inline int max_anon_folio_order(struct vm_area_struct *vma) >> { >> int order = arch_wants_pte_order(vma); >> >> return order ? order : PAGE_ALLOC_COSTLY_ORDER; >> } >> >> Correct? >> >> I don't see how it fixes the regression (assume you're talking about >> Speedometer) though? On arm64 arch_wants_pte_order() will still be returning >> order-4. > > Here is what I was actually suggesting -- I think the problem was > because contpte is a bit too large for that benchmark and for the page > allocator too, unfortunately. The following allows one retry (32KB) > before fallback to order 0 when using contpte (64KB). There is no > retry for HPA (16KB) and other archs. > > + int preferred = arch_wants_pte_order(vma) ? : PAGE_ALLOC_COSTLY_ORDER; > + int orders[] = { > + preferred, > + preferred > PAGE_ALLOC_COSTLY_ORDER ? > PAGE_ALLOC_COSTLY_ORDER : 0, > + 0, > + }; > > I'm attaching a patch which fills in the two helpers I left empty here [1]. > > Would the above work for Intel, Fengwei? PAGE_ALLOC_COSTLY_ORDER is Intel preferred because it fits the most common Intel system. So yes. This works for Intel. Regards Yin, Fengwei > > (AMD wouldn't need to override arch_wants_pte_order() since PTE > coalescing on Zen is also PAGE_ALLOC_COSTLY_ORDER.) > > [1] https://lore.kernel.org/linux-mm/CAOUHufaK82K8Sa35T7z3=gkm4GB0cWD3aqeZF6mYx82v7cOTeA@mail.gmail.com/2-anon_folios.patch _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance 2023-07-03 13:53 ` Ryan Roberts @ 2023-07-03 13:53 ` Ryan Roberts -1 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-03 13:53 UTC (permalink / raw) To: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm Introduce FLEXIBLE_THP feature, which allows anonymous memory to be allocated in large folios of a specified order. All pages of the large folio are pte-mapped during the same page fault, significantly reducing the number of page faults. The number of per-page operations (e.g. ref counting, rmap management lru list management) are also significantly reduced since those ops now become per-folio. The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which defaults to disabled for now; there is a long list of todos to make FLEXIBLE_THP robust with existing features (e.g. compaction, mlock, some madvise ops, etc). These items will be tackled in subsequent patches. When enabled, the preferred folio order is as returned by arch_wants_pte_order(), which may be overridden by the arch as it sees fit. Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous set of ptes map physically contigious, naturally aligned memory, so this mechanism allows the architecture to optimize as required. If the preferred order can't be used (e.g. because the folio would breach the bounds of the vma, or because ptes in the region are already mapped) then we fall back to a suitable lower order. Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> --- mm/Kconfig | 10 ++++ mm/memory.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++++---- 2 files changed, 165 insertions(+), 13 deletions(-) diff --git a/mm/Kconfig b/mm/Kconfig index 7672a22647b4..1c06b2c0a24e 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -822,6 +822,16 @@ config READ_ONLY_THP_FOR_FS support of file THPs will be developed in the next few release cycles. +config FLEXIBLE_THP + bool "Flexible order THP" + depends on TRANSPARENT_HUGEPAGE + default n + help + Use large (bigger than order-0) folios to back anonymous memory where + possible, even if the order of the folio is smaller than the PMD + order. This reduces the number of page faults, as well as other + per-page overheads to improve performance for many workloads. + endif # TRANSPARENT_HUGEPAGE # diff --git a/mm/memory.c b/mm/memory.c index fb30f7523550..abe2ea94f3f5 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3001,6 +3001,116 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf) return 0; } +#ifdef CONFIG_FLEXIBLE_THP +/* + * Allocates, zeros and returns a folio of the requested order for use as + * anonymous memory. + */ +static struct folio *alloc_anon_folio(struct vm_area_struct *vma, + unsigned long addr, int order) +{ + gfp_t gfp; + struct folio *folio; + + if (order == 0) + return vma_alloc_zeroed_movable_folio(vma, addr); + + gfp = vma_thp_gfp_mask(vma); + folio = vma_alloc_folio(gfp, order, vma, addr, true); + if (folio) + clear_huge_page(&folio->page, addr, folio_nr_pages(folio)); + + return folio; +} + +/* + * Preferred folio order to allocate for anonymous memory. + */ +#define max_anon_folio_order(vma) arch_wants_pte_order(vma) +#else +#define alloc_anon_folio(vma, addr, order) \ + vma_alloc_zeroed_movable_folio(vma, addr) +#define max_anon_folio_order(vma) 0 +#endif + +/* + * Returns index of first pte that is not none, or nr if all are none. + */ +static inline int check_ptes_none(pte_t *pte, int nr) +{ + int i; + + for (i = 0; i < nr; i++) { + if (!pte_none(ptep_get(pte++))) + return i; + } + + return nr; +} + +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order) +{ + /* + * The aim here is to determine what size of folio we should allocate + * for this fault. Factors include: + * - Order must not be higher than `order` upon entry + * - Folio must be naturally aligned within VA space + * - Folio must be fully contained inside one pmd entry + * - Folio must not breach boundaries of vma + * - Folio must not overlap any non-none ptes + * + * Additionally, we do not allow order-1 since this breaks assumptions + * elsewhere in the mm; THP pages must be at least order-2 (since they + * store state up to the 3rd struct page subpage), and these pages must + * be THP in order to correctly use pre-existing THP infrastructure such + * as folio_split(). + * + * Note that the caller may or may not choose to lock the pte. If + * unlocked, the result is racy and the user must re-check any overlap + * with non-none ptes under the lock. + */ + + struct vm_area_struct *vma = vmf->vma; + int nr; + unsigned long addr; + pte_t *pte; + pte_t *first_set = NULL; + int ret; + + order = min(order, PMD_SHIFT - PAGE_SHIFT); + + for (; order > 1; order--) { + nr = 1 << order; + addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT); + pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT); + + /* Check vma bounds. */ + if (addr < vma->vm_start || + addr + (nr << PAGE_SHIFT) > vma->vm_end) + continue; + + /* Ptes covered by order already known to be none. */ + if (pte + nr <= first_set) + break; + + /* Already found set pte in range covered by order. */ + if (pte <= first_set) + continue; + + /* Need to check if all the ptes are none. */ + ret = check_ptes_none(pte, nr); + if (ret == nr) + break; + + first_set = pte + ret; + } + + if (order == 1) + order = 0; + + return order; +} + /* * Handle write page faults for pages that can be reused in the current vma * @@ -3073,7 +3183,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) goto oom; if (is_zero_pfn(pte_pfn(vmf->orig_pte))) { - new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); + new_folio = alloc_anon_folio(vma, vmf->address, 0); if (!new_folio) goto oom; } else { @@ -4040,6 +4150,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) struct folio *folio; vm_fault_t ret = 0; pte_t entry; + int order; + int pgcount; + unsigned long addr; /* File mapping without ->vm_ops ? */ if (vma->vm_flags & VM_SHARED) @@ -4081,24 +4194,51 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) pte_unmap_unlock(vmf->pte, vmf->ptl); return handle_userfault(vmf, VM_UFFD_MISSING); } - goto setpte; + if (uffd_wp) + entry = pte_mkuffd_wp(entry); + set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); + + /* No need to invalidate - it was non-present before */ + update_mmu_cache(vma, vmf->address, vmf->pte); + goto unlock; + } + + /* + * If allocating a large folio, determine the biggest suitable order for + * the VMA (e.g. it must not exceed the VMA's bounds, it must not + * overlap with any populated PTEs, etc). We are not under the ptl here + * so we will need to re-check that we are not overlapping any populated + * PTEs once we have the lock. + */ + order = uffd_wp ? 0 : max_anon_folio_order(vma); + if (order > 0) { + vmf->pte = pte_offset_map(vmf->pmd, vmf->address); + order = calc_anon_folio_order_alloc(vmf, order); + pte_unmap(vmf->pte); } - /* Allocate our own private page. */ + /* Allocate our own private folio. */ if (unlikely(anon_vma_prepare(vma))) goto oom; - folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); + folio = alloc_anon_folio(vma, vmf->address, order); + if (!folio && order > 0) { + order = 0; + folio = alloc_anon_folio(vma, vmf->address, order); + } if (!folio) goto oom; + pgcount = 1 << order; + addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT); + if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) goto oom_free_page; folio_throttle_swaprate(folio, GFP_KERNEL); /* * The memory barrier inside __folio_mark_uptodate makes sure that - * preceding stores to the page contents become visible before - * the set_pte_at() write. + * preceding stores to the folio contents become visible before + * the set_ptes() write. */ __folio_mark_uptodate(folio); @@ -4107,11 +4247,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) if (vma->vm_flags & VM_WRITE) entry = pte_mkwrite(pte_mkdirty(entry)); - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, - &vmf->ptl); + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl); if (vmf_pte_changed(vmf)) { update_mmu_tlb(vma, vmf->address, vmf->pte); goto release; + } else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) { + goto release; } ret = check_stable_address_space(vma->vm_mm); @@ -4125,16 +4266,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) return handle_userfault(vmf, VM_UFFD_MISSING); } - inc_mm_counter(vma->vm_mm, MM_ANONPAGES); - folio_add_new_anon_rmap(folio, vma, vmf->address); + folio_ref_add(folio, pgcount - 1); + add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount); + folio_add_new_anon_rmap(folio, vma, addr); folio_add_lru_vma(folio, vma); -setpte: + if (uffd_wp) entry = pte_mkuffd_wp(entry); - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); + set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount); /* No need to invalidate - it was non-present before */ - update_mmu_cache(vma, vmf->address, vmf->pte); + update_mmu_cache_range(vma, addr, vmf->pte, pgcount); unlock: pte_unmap_unlock(vmf->pte, vmf->ptl); return ret; -- 2.25.1 ^ permalink raw reply related [flat|nested] 167+ messages in thread
* [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance @ 2023-07-03 13:53 ` Ryan Roberts 0 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-03 13:53 UTC (permalink / raw) To: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm Introduce FLEXIBLE_THP feature, which allows anonymous memory to be allocated in large folios of a specified order. All pages of the large folio are pte-mapped during the same page fault, significantly reducing the number of page faults. The number of per-page operations (e.g. ref counting, rmap management lru list management) are also significantly reduced since those ops now become per-folio. The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which defaults to disabled for now; there is a long list of todos to make FLEXIBLE_THP robust with existing features (e.g. compaction, mlock, some madvise ops, etc). These items will be tackled in subsequent patches. When enabled, the preferred folio order is as returned by arch_wants_pte_order(), which may be overridden by the arch as it sees fit. Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous set of ptes map physically contigious, naturally aligned memory, so this mechanism allows the architecture to optimize as required. If the preferred order can't be used (e.g. because the folio would breach the bounds of the vma, or because ptes in the region are already mapped) then we fall back to a suitable lower order. Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> --- mm/Kconfig | 10 ++++ mm/memory.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++++---- 2 files changed, 165 insertions(+), 13 deletions(-) diff --git a/mm/Kconfig b/mm/Kconfig index 7672a22647b4..1c06b2c0a24e 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -822,6 +822,16 @@ config READ_ONLY_THP_FOR_FS support of file THPs will be developed in the next few release cycles. +config FLEXIBLE_THP + bool "Flexible order THP" + depends on TRANSPARENT_HUGEPAGE + default n + help + Use large (bigger than order-0) folios to back anonymous memory where + possible, even if the order of the folio is smaller than the PMD + order. This reduces the number of page faults, as well as other + per-page overheads to improve performance for many workloads. + endif # TRANSPARENT_HUGEPAGE # diff --git a/mm/memory.c b/mm/memory.c index fb30f7523550..abe2ea94f3f5 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3001,6 +3001,116 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf) return 0; } +#ifdef CONFIG_FLEXIBLE_THP +/* + * Allocates, zeros and returns a folio of the requested order for use as + * anonymous memory. + */ +static struct folio *alloc_anon_folio(struct vm_area_struct *vma, + unsigned long addr, int order) +{ + gfp_t gfp; + struct folio *folio; + + if (order == 0) + return vma_alloc_zeroed_movable_folio(vma, addr); + + gfp = vma_thp_gfp_mask(vma); + folio = vma_alloc_folio(gfp, order, vma, addr, true); + if (folio) + clear_huge_page(&folio->page, addr, folio_nr_pages(folio)); + + return folio; +} + +/* + * Preferred folio order to allocate for anonymous memory. + */ +#define max_anon_folio_order(vma) arch_wants_pte_order(vma) +#else +#define alloc_anon_folio(vma, addr, order) \ + vma_alloc_zeroed_movable_folio(vma, addr) +#define max_anon_folio_order(vma) 0 +#endif + +/* + * Returns index of first pte that is not none, or nr if all are none. + */ +static inline int check_ptes_none(pte_t *pte, int nr) +{ + int i; + + for (i = 0; i < nr; i++) { + if (!pte_none(ptep_get(pte++))) + return i; + } + + return nr; +} + +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order) +{ + /* + * The aim here is to determine what size of folio we should allocate + * for this fault. Factors include: + * - Order must not be higher than `order` upon entry + * - Folio must be naturally aligned within VA space + * - Folio must be fully contained inside one pmd entry + * - Folio must not breach boundaries of vma + * - Folio must not overlap any non-none ptes + * + * Additionally, we do not allow order-1 since this breaks assumptions + * elsewhere in the mm; THP pages must be at least order-2 (since they + * store state up to the 3rd struct page subpage), and these pages must + * be THP in order to correctly use pre-existing THP infrastructure such + * as folio_split(). + * + * Note that the caller may or may not choose to lock the pte. If + * unlocked, the result is racy and the user must re-check any overlap + * with non-none ptes under the lock. + */ + + struct vm_area_struct *vma = vmf->vma; + int nr; + unsigned long addr; + pte_t *pte; + pte_t *first_set = NULL; + int ret; + + order = min(order, PMD_SHIFT - PAGE_SHIFT); + + for (; order > 1; order--) { + nr = 1 << order; + addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT); + pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT); + + /* Check vma bounds. */ + if (addr < vma->vm_start || + addr + (nr << PAGE_SHIFT) > vma->vm_end) + continue; + + /* Ptes covered by order already known to be none. */ + if (pte + nr <= first_set) + break; + + /* Already found set pte in range covered by order. */ + if (pte <= first_set) + continue; + + /* Need to check if all the ptes are none. */ + ret = check_ptes_none(pte, nr); + if (ret == nr) + break; + + first_set = pte + ret; + } + + if (order == 1) + order = 0; + + return order; +} + /* * Handle write page faults for pages that can be reused in the current vma * @@ -3073,7 +3183,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) goto oom; if (is_zero_pfn(pte_pfn(vmf->orig_pte))) { - new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); + new_folio = alloc_anon_folio(vma, vmf->address, 0); if (!new_folio) goto oom; } else { @@ -4040,6 +4150,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) struct folio *folio; vm_fault_t ret = 0; pte_t entry; + int order; + int pgcount; + unsigned long addr; /* File mapping without ->vm_ops ? */ if (vma->vm_flags & VM_SHARED) @@ -4081,24 +4194,51 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) pte_unmap_unlock(vmf->pte, vmf->ptl); return handle_userfault(vmf, VM_UFFD_MISSING); } - goto setpte; + if (uffd_wp) + entry = pte_mkuffd_wp(entry); + set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); + + /* No need to invalidate - it was non-present before */ + update_mmu_cache(vma, vmf->address, vmf->pte); + goto unlock; + } + + /* + * If allocating a large folio, determine the biggest suitable order for + * the VMA (e.g. it must not exceed the VMA's bounds, it must not + * overlap with any populated PTEs, etc). We are not under the ptl here + * so we will need to re-check that we are not overlapping any populated + * PTEs once we have the lock. + */ + order = uffd_wp ? 0 : max_anon_folio_order(vma); + if (order > 0) { + vmf->pte = pte_offset_map(vmf->pmd, vmf->address); + order = calc_anon_folio_order_alloc(vmf, order); + pte_unmap(vmf->pte); } - /* Allocate our own private page. */ + /* Allocate our own private folio. */ if (unlikely(anon_vma_prepare(vma))) goto oom; - folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); + folio = alloc_anon_folio(vma, vmf->address, order); + if (!folio && order > 0) { + order = 0; + folio = alloc_anon_folio(vma, vmf->address, order); + } if (!folio) goto oom; + pgcount = 1 << order; + addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT); + if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) goto oom_free_page; folio_throttle_swaprate(folio, GFP_KERNEL); /* * The memory barrier inside __folio_mark_uptodate makes sure that - * preceding stores to the page contents become visible before - * the set_pte_at() write. + * preceding stores to the folio contents become visible before + * the set_ptes() write. */ __folio_mark_uptodate(folio); @@ -4107,11 +4247,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) if (vma->vm_flags & VM_WRITE) entry = pte_mkwrite(pte_mkdirty(entry)); - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, - &vmf->ptl); + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl); if (vmf_pte_changed(vmf)) { update_mmu_tlb(vma, vmf->address, vmf->pte); goto release; + } else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) { + goto release; } ret = check_stable_address_space(vma->vm_mm); @@ -4125,16 +4266,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) return handle_userfault(vmf, VM_UFFD_MISSING); } - inc_mm_counter(vma->vm_mm, MM_ANONPAGES); - folio_add_new_anon_rmap(folio, vma, vmf->address); + folio_ref_add(folio, pgcount - 1); + add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount); + folio_add_new_anon_rmap(folio, vma, addr); folio_add_lru_vma(folio, vma); -setpte: + if (uffd_wp) entry = pte_mkuffd_wp(entry); - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); + set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount); /* No need to invalidate - it was non-present before */ - update_mmu_cache(vma, vmf->address, vmf->pte); + update_mmu_cache_range(vma, addr, vmf->pte, pgcount); unlock: pte_unmap_unlock(vmf->pte, vmf->ptl); return ret; -- 2.25.1 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance 2023-07-03 13:53 ` Ryan Roberts @ 2023-07-03 15:51 ` kernel test robot -1 siblings, 0 replies; 167+ messages in thread From: kernel test robot @ 2023-07-03 15:51 UTC (permalink / raw) To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi Cc: llvm, oe-kbuild-all, Linux Memory Management List, Ryan Roberts, linux-arm-kernel, linux-kernel Hi Ryan, kernel test robot noticed the following build errors: [auto build test ERROR on arm64/for-next/core] [also build test ERROR on v6.4] [cannot apply to akpm-mm/mm-everything linus/master next-20230703] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Ryan-Roberts/mm-Non-pmd-mappable-large-folios-for-folio_add_new_anon_rmap/20230703-215627 base: https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/core patch link: https://lore.kernel.org/r/20230703135330.1865927-5-ryan.roberts%40arm.com patch subject: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance config: um-allyesconfig (https://download.01.org/0day-ci/archive/20230703/202307032325.u93xmWbG-lkp@intel.com/config) compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project.git f28c006a5895fc0e329fe15fead81e37457cb1d1) reproduce: (https://download.01.org/0day-ci/archive/20230703/202307032325.u93xmWbG-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202307032325.u93xmWbG-lkp@intel.com/ All errors (new ones prefixed by >>): In file included from mm/memory.c:42: In file included from include/linux/kernel_stat.h:9: In file included from include/linux/interrupt.h:11: In file included from include/linux/hardirq.h:11: In file included from arch/um/include/asm/hardirq.h:5: In file included from include/asm-generic/hardirq.h:17: In file included from include/linux/irq.h:20: In file included from include/linux/io.h:13: In file included from arch/um/include/asm/io.h:24: include/asm-generic/io.h:547:31: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] val = __raw_readb(PCI_IOBASE + addr); ~~~~~~~~~~ ^ include/asm-generic/io.h:560:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + addr)); ~~~~~~~~~~ ^ include/uapi/linux/byteorder/little_endian.h:37:51: note: expanded from macro '__le16_to_cpu' #define __le16_to_cpu(x) ((__force __u16)(__le16)(x)) ^ In file included from mm/memory.c:42: In file included from include/linux/kernel_stat.h:9: In file included from include/linux/interrupt.h:11: In file included from include/linux/hardirq.h:11: In file included from arch/um/include/asm/hardirq.h:5: In file included from include/asm-generic/hardirq.h:17: In file included from include/linux/irq.h:20: In file included from include/linux/io.h:13: In file included from arch/um/include/asm/io.h:24: include/asm-generic/io.h:573:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr)); ~~~~~~~~~~ ^ include/uapi/linux/byteorder/little_endian.h:35:51: note: expanded from macro '__le32_to_cpu' #define __le32_to_cpu(x) ((__force __u32)(__le32)(x)) ^ In file included from mm/memory.c:42: In file included from include/linux/kernel_stat.h:9: In file included from include/linux/interrupt.h:11: In file included from include/linux/hardirq.h:11: In file included from arch/um/include/asm/hardirq.h:5: In file included from include/asm-generic/hardirq.h:17: In file included from include/linux/irq.h:20: In file included from include/linux/io.h:13: In file included from arch/um/include/asm/io.h:24: include/asm-generic/io.h:584:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] __raw_writeb(value, PCI_IOBASE + addr); ~~~~~~~~~~ ^ include/asm-generic/io.h:594:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] __raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr); ~~~~~~~~~~ ^ include/asm-generic/io.h:604:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] __raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr); ~~~~~~~~~~ ^ include/asm-generic/io.h:692:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] readsb(PCI_IOBASE + addr, buffer, count); ~~~~~~~~~~ ^ include/asm-generic/io.h:700:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] readsw(PCI_IOBASE + addr, buffer, count); ~~~~~~~~~~ ^ include/asm-generic/io.h:708:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] readsl(PCI_IOBASE + addr, buffer, count); ~~~~~~~~~~ ^ include/asm-generic/io.h:717:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] writesb(PCI_IOBASE + addr, buffer, count); ~~~~~~~~~~ ^ include/asm-generic/io.h:726:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] writesw(PCI_IOBASE + addr, buffer, count); ~~~~~~~~~~ ^ include/asm-generic/io.h:735:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] writesl(PCI_IOBASE + addr, buffer, count); ~~~~~~~~~~ ^ >> mm/memory.c:4271:2: error: implicit declaration of function 'set_ptes' is invalid in C99 [-Werror,-Wimplicit-function-declaration] set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount); ^ mm/memory.c:4271:2: note: did you mean 'set_pte'? arch/um/include/asm/pgtable.h:232:20: note: 'set_pte' declared here static inline void set_pte(pte_t *pteptr, pte_t pteval) ^ >> mm/memory.c:4274:2: error: implicit declaration of function 'update_mmu_cache_range' is invalid in C99 [-Werror,-Wimplicit-function-declaration] update_mmu_cache_range(vma, addr, vmf->pte, pgcount); ^ 12 warnings and 2 errors generated. vim +/set_ptes +4271 mm/memory.c 4135 4136 /* 4137 * We enter with non-exclusive mmap_lock (to exclude vma changes, 4138 * but allow concurrent faults), and pte mapped but not yet locked. 4139 * We return with mmap_lock still held, but pte unmapped and unlocked. 4140 */ 4141 static vm_fault_t do_anonymous_page(struct vm_fault *vmf) 4142 { 4143 bool uffd_wp = vmf_orig_pte_uffd_wp(vmf); 4144 struct vm_area_struct *vma = vmf->vma; 4145 struct folio *folio; 4146 vm_fault_t ret = 0; 4147 pte_t entry; 4148 int order; 4149 int pgcount; 4150 unsigned long addr; 4151 4152 /* File mapping without ->vm_ops ? */ 4153 if (vma->vm_flags & VM_SHARED) 4154 return VM_FAULT_SIGBUS; 4155 4156 /* 4157 * Use pte_alloc() instead of pte_alloc_map(). We can't run 4158 * pte_offset_map() on pmds where a huge pmd might be created 4159 * from a different thread. 4160 * 4161 * pte_alloc_map() is safe to use under mmap_write_lock(mm) or when 4162 * parallel threads are excluded by other means. 4163 * 4164 * Here we only have mmap_read_lock(mm). 4165 */ 4166 if (pte_alloc(vma->vm_mm, vmf->pmd)) 4167 return VM_FAULT_OOM; 4168 4169 /* See comment in handle_pte_fault() */ 4170 if (unlikely(pmd_trans_unstable(vmf->pmd))) 4171 return 0; 4172 4173 /* Use the zero-page for reads */ 4174 if (!(vmf->flags & FAULT_FLAG_WRITE) && 4175 !mm_forbids_zeropage(vma->vm_mm)) { 4176 entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address), 4177 vma->vm_page_prot)); 4178 vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, 4179 vmf->address, &vmf->ptl); 4180 if (vmf_pte_changed(vmf)) { 4181 update_mmu_tlb(vma, vmf->address, vmf->pte); 4182 goto unlock; 4183 } 4184 ret = check_stable_address_space(vma->vm_mm); 4185 if (ret) 4186 goto unlock; 4187 /* Deliver the page fault to userland, check inside PT lock */ 4188 if (userfaultfd_missing(vma)) { 4189 pte_unmap_unlock(vmf->pte, vmf->ptl); 4190 return handle_userfault(vmf, VM_UFFD_MISSING); 4191 } 4192 if (uffd_wp) 4193 entry = pte_mkuffd_wp(entry); 4194 set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); 4195 4196 /* No need to invalidate - it was non-present before */ 4197 update_mmu_cache(vma, vmf->address, vmf->pte); 4198 goto unlock; 4199 } 4200 4201 /* 4202 * If allocating a large folio, determine the biggest suitable order for 4203 * the VMA (e.g. it must not exceed the VMA's bounds, it must not 4204 * overlap with any populated PTEs, etc). We are not under the ptl here 4205 * so we will need to re-check that we are not overlapping any populated 4206 * PTEs once we have the lock. 4207 */ 4208 order = uffd_wp ? 0 : max_anon_folio_order(vma); 4209 if (order > 0) { 4210 vmf->pte = pte_offset_map(vmf->pmd, vmf->address); 4211 order = calc_anon_folio_order_alloc(vmf, order); 4212 pte_unmap(vmf->pte); 4213 } 4214 4215 /* Allocate our own private folio. */ 4216 if (unlikely(anon_vma_prepare(vma))) 4217 goto oom; 4218 folio = alloc_anon_folio(vma, vmf->address, order); 4219 if (!folio && order > 0) { 4220 order = 0; 4221 folio = alloc_anon_folio(vma, vmf->address, order); 4222 } 4223 if (!folio) 4224 goto oom; 4225 4226 pgcount = 1 << order; 4227 addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT); 4228 4229 if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) 4230 goto oom_free_page; 4231 folio_throttle_swaprate(folio, GFP_KERNEL); 4232 4233 /* 4234 * The memory barrier inside __folio_mark_uptodate makes sure that 4235 * preceding stores to the folio contents become visible before 4236 * the set_ptes() write. 4237 */ 4238 __folio_mark_uptodate(folio); 4239 4240 entry = mk_pte(&folio->page, vma->vm_page_prot); 4241 entry = pte_sw_mkyoung(entry); 4242 if (vma->vm_flags & VM_WRITE) 4243 entry = pte_mkwrite(pte_mkdirty(entry)); 4244 4245 vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl); 4246 if (vmf_pte_changed(vmf)) { 4247 update_mmu_tlb(vma, vmf->address, vmf->pte); 4248 goto release; 4249 } else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) { 4250 goto release; 4251 } 4252 4253 ret = check_stable_address_space(vma->vm_mm); 4254 if (ret) 4255 goto release; 4256 4257 /* Deliver the page fault to userland, check inside PT lock */ 4258 if (userfaultfd_missing(vma)) { 4259 pte_unmap_unlock(vmf->pte, vmf->ptl); 4260 folio_put(folio); 4261 return handle_userfault(vmf, VM_UFFD_MISSING); 4262 } 4263 4264 folio_ref_add(folio, pgcount - 1); 4265 add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount); 4266 folio_add_new_anon_rmap(folio, vma, addr); 4267 folio_add_lru_vma(folio, vma); 4268 4269 if (uffd_wp) 4270 entry = pte_mkuffd_wp(entry); > 4271 set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount); 4272 4273 /* No need to invalidate - it was non-present before */ > 4274 update_mmu_cache_range(vma, addr, vmf->pte, pgcount); 4275 unlock: 4276 pte_unmap_unlock(vmf->pte, vmf->ptl); 4277 return ret; 4278 release: 4279 folio_put(folio); 4280 goto unlock; 4281 oom_free_page: 4282 folio_put(folio); 4283 oom: 4284 return VM_FAULT_OOM; 4285 } 4286 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance @ 2023-07-03 15:51 ` kernel test robot 0 siblings, 0 replies; 167+ messages in thread From: kernel test robot @ 2023-07-03 15:51 UTC (permalink / raw) To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi Cc: llvm, oe-kbuild-all, Linux Memory Management List, Ryan Roberts, linux-arm-kernel, linux-kernel Hi Ryan, kernel test robot noticed the following build errors: [auto build test ERROR on arm64/for-next/core] [also build test ERROR on v6.4] [cannot apply to akpm-mm/mm-everything linus/master next-20230703] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Ryan-Roberts/mm-Non-pmd-mappable-large-folios-for-folio_add_new_anon_rmap/20230703-215627 base: https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/core patch link: https://lore.kernel.org/r/20230703135330.1865927-5-ryan.roberts%40arm.com patch subject: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance config: um-allyesconfig (https://download.01.org/0day-ci/archive/20230703/202307032325.u93xmWbG-lkp@intel.com/config) compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project.git f28c006a5895fc0e329fe15fead81e37457cb1d1) reproduce: (https://download.01.org/0day-ci/archive/20230703/202307032325.u93xmWbG-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202307032325.u93xmWbG-lkp@intel.com/ All errors (new ones prefixed by >>): In file included from mm/memory.c:42: In file included from include/linux/kernel_stat.h:9: In file included from include/linux/interrupt.h:11: In file included from include/linux/hardirq.h:11: In file included from arch/um/include/asm/hardirq.h:5: In file included from include/asm-generic/hardirq.h:17: In file included from include/linux/irq.h:20: In file included from include/linux/io.h:13: In file included from arch/um/include/asm/io.h:24: include/asm-generic/io.h:547:31: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] val = __raw_readb(PCI_IOBASE + addr); ~~~~~~~~~~ ^ include/asm-generic/io.h:560:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + addr)); ~~~~~~~~~~ ^ include/uapi/linux/byteorder/little_endian.h:37:51: note: expanded from macro '__le16_to_cpu' #define __le16_to_cpu(x) ((__force __u16)(__le16)(x)) ^ In file included from mm/memory.c:42: In file included from include/linux/kernel_stat.h:9: In file included from include/linux/interrupt.h:11: In file included from include/linux/hardirq.h:11: In file included from arch/um/include/asm/hardirq.h:5: In file included from include/asm-generic/hardirq.h:17: In file included from include/linux/irq.h:20: In file included from include/linux/io.h:13: In file included from arch/um/include/asm/io.h:24: include/asm-generic/io.h:573:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr)); ~~~~~~~~~~ ^ include/uapi/linux/byteorder/little_endian.h:35:51: note: expanded from macro '__le32_to_cpu' #define __le32_to_cpu(x) ((__force __u32)(__le32)(x)) ^ In file included from mm/memory.c:42: In file included from include/linux/kernel_stat.h:9: In file included from include/linux/interrupt.h:11: In file included from include/linux/hardirq.h:11: In file included from arch/um/include/asm/hardirq.h:5: In file included from include/asm-generic/hardirq.h:17: In file included from include/linux/irq.h:20: In file included from include/linux/io.h:13: In file included from arch/um/include/asm/io.h:24: include/asm-generic/io.h:584:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] __raw_writeb(value, PCI_IOBASE + addr); ~~~~~~~~~~ ^ include/asm-generic/io.h:594:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] __raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr); ~~~~~~~~~~ ^ include/asm-generic/io.h:604:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] __raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr); ~~~~~~~~~~ ^ include/asm-generic/io.h:692:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] readsb(PCI_IOBASE + addr, buffer, count); ~~~~~~~~~~ ^ include/asm-generic/io.h:700:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] readsw(PCI_IOBASE + addr, buffer, count); ~~~~~~~~~~ ^ include/asm-generic/io.h:708:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] readsl(PCI_IOBASE + addr, buffer, count); ~~~~~~~~~~ ^ include/asm-generic/io.h:717:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] writesb(PCI_IOBASE + addr, buffer, count); ~~~~~~~~~~ ^ include/asm-generic/io.h:726:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] writesw(PCI_IOBASE + addr, buffer, count); ~~~~~~~~~~ ^ include/asm-generic/io.h:735:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] writesl(PCI_IOBASE + addr, buffer, count); ~~~~~~~~~~ ^ >> mm/memory.c:4271:2: error: implicit declaration of function 'set_ptes' is invalid in C99 [-Werror,-Wimplicit-function-declaration] set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount); ^ mm/memory.c:4271:2: note: did you mean 'set_pte'? arch/um/include/asm/pgtable.h:232:20: note: 'set_pte' declared here static inline void set_pte(pte_t *pteptr, pte_t pteval) ^ >> mm/memory.c:4274:2: error: implicit declaration of function 'update_mmu_cache_range' is invalid in C99 [-Werror,-Wimplicit-function-declaration] update_mmu_cache_range(vma, addr, vmf->pte, pgcount); ^ 12 warnings and 2 errors generated. vim +/set_ptes +4271 mm/memory.c 4135 4136 /* 4137 * We enter with non-exclusive mmap_lock (to exclude vma changes, 4138 * but allow concurrent faults), and pte mapped but not yet locked. 4139 * We return with mmap_lock still held, but pte unmapped and unlocked. 4140 */ 4141 static vm_fault_t do_anonymous_page(struct vm_fault *vmf) 4142 { 4143 bool uffd_wp = vmf_orig_pte_uffd_wp(vmf); 4144 struct vm_area_struct *vma = vmf->vma; 4145 struct folio *folio; 4146 vm_fault_t ret = 0; 4147 pte_t entry; 4148 int order; 4149 int pgcount; 4150 unsigned long addr; 4151 4152 /* File mapping without ->vm_ops ? */ 4153 if (vma->vm_flags & VM_SHARED) 4154 return VM_FAULT_SIGBUS; 4155 4156 /* 4157 * Use pte_alloc() instead of pte_alloc_map(). We can't run 4158 * pte_offset_map() on pmds where a huge pmd might be created 4159 * from a different thread. 4160 * 4161 * pte_alloc_map() is safe to use under mmap_write_lock(mm) or when 4162 * parallel threads are excluded by other means. 4163 * 4164 * Here we only have mmap_read_lock(mm). 4165 */ 4166 if (pte_alloc(vma->vm_mm, vmf->pmd)) 4167 return VM_FAULT_OOM; 4168 4169 /* See comment in handle_pte_fault() */ 4170 if (unlikely(pmd_trans_unstable(vmf->pmd))) 4171 return 0; 4172 4173 /* Use the zero-page for reads */ 4174 if (!(vmf->flags & FAULT_FLAG_WRITE) && 4175 !mm_forbids_zeropage(vma->vm_mm)) { 4176 entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address), 4177 vma->vm_page_prot)); 4178 vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, 4179 vmf->address, &vmf->ptl); 4180 if (vmf_pte_changed(vmf)) { 4181 update_mmu_tlb(vma, vmf->address, vmf->pte); 4182 goto unlock; 4183 } 4184 ret = check_stable_address_space(vma->vm_mm); 4185 if (ret) 4186 goto unlock; 4187 /* Deliver the page fault to userland, check inside PT lock */ 4188 if (userfaultfd_missing(vma)) { 4189 pte_unmap_unlock(vmf->pte, vmf->ptl); 4190 return handle_userfault(vmf, VM_UFFD_MISSING); 4191 } 4192 if (uffd_wp) 4193 entry = pte_mkuffd_wp(entry); 4194 set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); 4195 4196 /* No need to invalidate - it was non-present before */ 4197 update_mmu_cache(vma, vmf->address, vmf->pte); 4198 goto unlock; 4199 } 4200 4201 /* 4202 * If allocating a large folio, determine the biggest suitable order for 4203 * the VMA (e.g. it must not exceed the VMA's bounds, it must not 4204 * overlap with any populated PTEs, etc). We are not under the ptl here 4205 * so we will need to re-check that we are not overlapping any populated 4206 * PTEs once we have the lock. 4207 */ 4208 order = uffd_wp ? 0 : max_anon_folio_order(vma); 4209 if (order > 0) { 4210 vmf->pte = pte_offset_map(vmf->pmd, vmf->address); 4211 order = calc_anon_folio_order_alloc(vmf, order); 4212 pte_unmap(vmf->pte); 4213 } 4214 4215 /* Allocate our own private folio. */ 4216 if (unlikely(anon_vma_prepare(vma))) 4217 goto oom; 4218 folio = alloc_anon_folio(vma, vmf->address, order); 4219 if (!folio && order > 0) { 4220 order = 0; 4221 folio = alloc_anon_folio(vma, vmf->address, order); 4222 } 4223 if (!folio) 4224 goto oom; 4225 4226 pgcount = 1 << order; 4227 addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT); 4228 4229 if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) 4230 goto oom_free_page; 4231 folio_throttle_swaprate(folio, GFP_KERNEL); 4232 4233 /* 4234 * The memory barrier inside __folio_mark_uptodate makes sure that 4235 * preceding stores to the folio contents become visible before 4236 * the set_ptes() write. 4237 */ 4238 __folio_mark_uptodate(folio); 4239 4240 entry = mk_pte(&folio->page, vma->vm_page_prot); 4241 entry = pte_sw_mkyoung(entry); 4242 if (vma->vm_flags & VM_WRITE) 4243 entry = pte_mkwrite(pte_mkdirty(entry)); 4244 4245 vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl); 4246 if (vmf_pte_changed(vmf)) { 4247 update_mmu_tlb(vma, vmf->address, vmf->pte); 4248 goto release; 4249 } else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) { 4250 goto release; 4251 } 4252 4253 ret = check_stable_address_space(vma->vm_mm); 4254 if (ret) 4255 goto release; 4256 4257 /* Deliver the page fault to userland, check inside PT lock */ 4258 if (userfaultfd_missing(vma)) { 4259 pte_unmap_unlock(vmf->pte, vmf->ptl); 4260 folio_put(folio); 4261 return handle_userfault(vmf, VM_UFFD_MISSING); 4262 } 4263 4264 folio_ref_add(folio, pgcount - 1); 4265 add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount); 4266 folio_add_new_anon_rmap(folio, vma, addr); 4267 folio_add_lru_vma(folio, vma); 4268 4269 if (uffd_wp) 4270 entry = pte_mkuffd_wp(entry); > 4271 set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount); 4272 4273 /* No need to invalidate - it was non-present before */ > 4274 update_mmu_cache_range(vma, addr, vmf->pte, pgcount); 4275 unlock: 4276 pte_unmap_unlock(vmf->pte, vmf->ptl); 4277 return ret; 4278 release: 4279 folio_put(folio); 4280 goto unlock; 4281 oom_free_page: 4282 folio_put(folio); 4283 oom: 4284 return VM_FAULT_OOM; 4285 } 4286 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance 2023-07-03 13:53 ` Ryan Roberts @ 2023-07-03 16:01 ` kernel test robot -1 siblings, 0 replies; 167+ messages in thread From: kernel test robot @ 2023-07-03 16:01 UTC (permalink / raw) To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi Cc: llvm, oe-kbuild-all, Linux Memory Management List, Ryan Roberts, linux-arm-kernel, linux-kernel Hi Ryan, kernel test robot noticed the following build errors: [auto build test ERROR on arm64/for-next/core] [also build test ERROR on v6.4] [cannot apply to akpm-mm/mm-everything linus/master next-20230703] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Ryan-Roberts/mm-Non-pmd-mappable-large-folios-for-folio_add_new_anon_rmap/20230703-215627 base: https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/core patch link: https://lore.kernel.org/r/20230703135330.1865927-5-ryan.roberts%40arm.com patch subject: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance config: um-allnoconfig (https://download.01.org/0day-ci/archive/20230703/202307032330.TguyNttt-lkp@intel.com/config) compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project.git 4a5ac14ee968ff0ad5d2cc1ffa0299048db4c88a) reproduce: (https://download.01.org/0day-ci/archive/20230703/202307032330.TguyNttt-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202307032330.TguyNttt-lkp@intel.com/ All errors (new ones prefixed by >>): In file included from mm/memory.c:42: In file included from include/linux/kernel_stat.h:9: In file included from include/linux/interrupt.h:11: In file included from include/linux/hardirq.h:11: In file included from arch/um/include/asm/hardirq.h:5: In file included from include/asm-generic/hardirq.h:17: In file included from include/linux/irq.h:20: In file included from include/linux/io.h:13: In file included from arch/um/include/asm/io.h:24: include/asm-generic/io.h:547:31: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 547 | val = __raw_readb(PCI_IOBASE + addr); | ~~~~~~~~~~ ^ include/asm-generic/io.h:560:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 560 | val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + addr)); | ~~~~~~~~~~ ^ include/uapi/linux/byteorder/little_endian.h:37:51: note: expanded from macro '__le16_to_cpu' 37 | #define __le16_to_cpu(x) ((__force __u16)(__le16)(x)) | ^ In file included from mm/memory.c:42: In file included from include/linux/kernel_stat.h:9: In file included from include/linux/interrupt.h:11: In file included from include/linux/hardirq.h:11: In file included from arch/um/include/asm/hardirq.h:5: In file included from include/asm-generic/hardirq.h:17: In file included from include/linux/irq.h:20: In file included from include/linux/io.h:13: In file included from arch/um/include/asm/io.h:24: include/asm-generic/io.h:573:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 573 | val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr)); | ~~~~~~~~~~ ^ include/uapi/linux/byteorder/little_endian.h:35:51: note: expanded from macro '__le32_to_cpu' 35 | #define __le32_to_cpu(x) ((__force __u32)(__le32)(x)) | ^ In file included from mm/memory.c:42: In file included from include/linux/kernel_stat.h:9: In file included from include/linux/interrupt.h:11: In file included from include/linux/hardirq.h:11: In file included from arch/um/include/asm/hardirq.h:5: In file included from include/asm-generic/hardirq.h:17: In file included from include/linux/irq.h:20: In file included from include/linux/io.h:13: In file included from arch/um/include/asm/io.h:24: include/asm-generic/io.h:584:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 584 | __raw_writeb(value, PCI_IOBASE + addr); | ~~~~~~~~~~ ^ include/asm-generic/io.h:594:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 594 | __raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr); | ~~~~~~~~~~ ^ include/asm-generic/io.h:604:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 604 | __raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr); | ~~~~~~~~~~ ^ include/asm-generic/io.h:692:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 692 | readsb(PCI_IOBASE + addr, buffer, count); | ~~~~~~~~~~ ^ include/asm-generic/io.h:700:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 700 | readsw(PCI_IOBASE + addr, buffer, count); | ~~~~~~~~~~ ^ include/asm-generic/io.h:708:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 708 | readsl(PCI_IOBASE + addr, buffer, count); | ~~~~~~~~~~ ^ include/asm-generic/io.h:717:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 717 | writesb(PCI_IOBASE + addr, buffer, count); | ~~~~~~~~~~ ^ include/asm-generic/io.h:726:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 726 | writesw(PCI_IOBASE + addr, buffer, count); | ~~~~~~~~~~ ^ include/asm-generic/io.h:735:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 735 | writesl(PCI_IOBASE + addr, buffer, count); | ~~~~~~~~~~ ^ >> mm/memory.c:4271:2: error: call to undeclared function 'set_ptes'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration] 4271 | set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount); | ^ mm/memory.c:4271:2: note: did you mean 'set_pte'? arch/um/include/asm/pgtable.h:232:20: note: 'set_pte' declared here 232 | static inline void set_pte(pte_t *pteptr, pte_t pteval) | ^ >> mm/memory.c:4274:2: error: call to undeclared function 'update_mmu_cache_range'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration] 4274 | update_mmu_cache_range(vma, addr, vmf->pte, pgcount); | ^ 12 warnings and 2 errors generated. vim +/set_ptes +4271 mm/memory.c 4135 4136 /* 4137 * We enter with non-exclusive mmap_lock (to exclude vma changes, 4138 * but allow concurrent faults), and pte mapped but not yet locked. 4139 * We return with mmap_lock still held, but pte unmapped and unlocked. 4140 */ 4141 static vm_fault_t do_anonymous_page(struct vm_fault *vmf) 4142 { 4143 bool uffd_wp = vmf_orig_pte_uffd_wp(vmf); 4144 struct vm_area_struct *vma = vmf->vma; 4145 struct folio *folio; 4146 vm_fault_t ret = 0; 4147 pte_t entry; 4148 int order; 4149 int pgcount; 4150 unsigned long addr; 4151 4152 /* File mapping without ->vm_ops ? */ 4153 if (vma->vm_flags & VM_SHARED) 4154 return VM_FAULT_SIGBUS; 4155 4156 /* 4157 * Use pte_alloc() instead of pte_alloc_map(). We can't run 4158 * pte_offset_map() on pmds where a huge pmd might be created 4159 * from a different thread. 4160 * 4161 * pte_alloc_map() is safe to use under mmap_write_lock(mm) or when 4162 * parallel threads are excluded by other means. 4163 * 4164 * Here we only have mmap_read_lock(mm). 4165 */ 4166 if (pte_alloc(vma->vm_mm, vmf->pmd)) 4167 return VM_FAULT_OOM; 4168 4169 /* See comment in handle_pte_fault() */ 4170 if (unlikely(pmd_trans_unstable(vmf->pmd))) 4171 return 0; 4172 4173 /* Use the zero-page for reads */ 4174 if (!(vmf->flags & FAULT_FLAG_WRITE) && 4175 !mm_forbids_zeropage(vma->vm_mm)) { 4176 entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address), 4177 vma->vm_page_prot)); 4178 vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, 4179 vmf->address, &vmf->ptl); 4180 if (vmf_pte_changed(vmf)) { 4181 update_mmu_tlb(vma, vmf->address, vmf->pte); 4182 goto unlock; 4183 } 4184 ret = check_stable_address_space(vma->vm_mm); 4185 if (ret) 4186 goto unlock; 4187 /* Deliver the page fault to userland, check inside PT lock */ 4188 if (userfaultfd_missing(vma)) { 4189 pte_unmap_unlock(vmf->pte, vmf->ptl); 4190 return handle_userfault(vmf, VM_UFFD_MISSING); 4191 } 4192 if (uffd_wp) 4193 entry = pte_mkuffd_wp(entry); 4194 set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); 4195 4196 /* No need to invalidate - it was non-present before */ 4197 update_mmu_cache(vma, vmf->address, vmf->pte); 4198 goto unlock; 4199 } 4200 4201 /* 4202 * If allocating a large folio, determine the biggest suitable order for 4203 * the VMA (e.g. it must not exceed the VMA's bounds, it must not 4204 * overlap with any populated PTEs, etc). We are not under the ptl here 4205 * so we will need to re-check that we are not overlapping any populated 4206 * PTEs once we have the lock. 4207 */ 4208 order = uffd_wp ? 0 : max_anon_folio_order(vma); 4209 if (order > 0) { 4210 vmf->pte = pte_offset_map(vmf->pmd, vmf->address); 4211 order = calc_anon_folio_order_alloc(vmf, order); 4212 pte_unmap(vmf->pte); 4213 } 4214 4215 /* Allocate our own private folio. */ 4216 if (unlikely(anon_vma_prepare(vma))) 4217 goto oom; 4218 folio = alloc_anon_folio(vma, vmf->address, order); 4219 if (!folio && order > 0) { 4220 order = 0; 4221 folio = alloc_anon_folio(vma, vmf->address, order); 4222 } 4223 if (!folio) 4224 goto oom; 4225 4226 pgcount = 1 << order; 4227 addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT); 4228 4229 if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) 4230 goto oom_free_page; 4231 folio_throttle_swaprate(folio, GFP_KERNEL); 4232 4233 /* 4234 * The memory barrier inside __folio_mark_uptodate makes sure that 4235 * preceding stores to the folio contents become visible before 4236 * the set_ptes() write. 4237 */ 4238 __folio_mark_uptodate(folio); 4239 4240 entry = mk_pte(&folio->page, vma->vm_page_prot); 4241 entry = pte_sw_mkyoung(entry); 4242 if (vma->vm_flags & VM_WRITE) 4243 entry = pte_mkwrite(pte_mkdirty(entry)); 4244 4245 vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl); 4246 if (vmf_pte_changed(vmf)) { 4247 update_mmu_tlb(vma, vmf->address, vmf->pte); 4248 goto release; 4249 } else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) { 4250 goto release; 4251 } 4252 4253 ret = check_stable_address_space(vma->vm_mm); 4254 if (ret) 4255 goto release; 4256 4257 /* Deliver the page fault to userland, check inside PT lock */ 4258 if (userfaultfd_missing(vma)) { 4259 pte_unmap_unlock(vmf->pte, vmf->ptl); 4260 folio_put(folio); 4261 return handle_userfault(vmf, VM_UFFD_MISSING); 4262 } 4263 4264 folio_ref_add(folio, pgcount - 1); 4265 add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount); 4266 folio_add_new_anon_rmap(folio, vma, addr); 4267 folio_add_lru_vma(folio, vma); 4268 4269 if (uffd_wp) 4270 entry = pte_mkuffd_wp(entry); > 4271 set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount); 4272 4273 /* No need to invalidate - it was non-present before */ > 4274 update_mmu_cache_range(vma, addr, vmf->pte, pgcount); 4275 unlock: 4276 pte_unmap_unlock(vmf->pte, vmf->ptl); 4277 return ret; 4278 release: 4279 folio_put(folio); 4280 goto unlock; 4281 oom_free_page: 4282 folio_put(folio); 4283 oom: 4284 return VM_FAULT_OOM; 4285 } 4286 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance @ 2023-07-03 16:01 ` kernel test robot 0 siblings, 0 replies; 167+ messages in thread From: kernel test robot @ 2023-07-03 16:01 UTC (permalink / raw) To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi Cc: llvm, oe-kbuild-all, Linux Memory Management List, Ryan Roberts, linux-arm-kernel, linux-kernel Hi Ryan, kernel test robot noticed the following build errors: [auto build test ERROR on arm64/for-next/core] [also build test ERROR on v6.4] [cannot apply to akpm-mm/mm-everything linus/master next-20230703] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Ryan-Roberts/mm-Non-pmd-mappable-large-folios-for-folio_add_new_anon_rmap/20230703-215627 base: https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/core patch link: https://lore.kernel.org/r/20230703135330.1865927-5-ryan.roberts%40arm.com patch subject: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance config: um-allnoconfig (https://download.01.org/0day-ci/archive/20230703/202307032330.TguyNttt-lkp@intel.com/config) compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project.git 4a5ac14ee968ff0ad5d2cc1ffa0299048db4c88a) reproduce: (https://download.01.org/0day-ci/archive/20230703/202307032330.TguyNttt-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202307032330.TguyNttt-lkp@intel.com/ All errors (new ones prefixed by >>): In file included from mm/memory.c:42: In file included from include/linux/kernel_stat.h:9: In file included from include/linux/interrupt.h:11: In file included from include/linux/hardirq.h:11: In file included from arch/um/include/asm/hardirq.h:5: In file included from include/asm-generic/hardirq.h:17: In file included from include/linux/irq.h:20: In file included from include/linux/io.h:13: In file included from arch/um/include/asm/io.h:24: include/asm-generic/io.h:547:31: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 547 | val = __raw_readb(PCI_IOBASE + addr); | ~~~~~~~~~~ ^ include/asm-generic/io.h:560:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 560 | val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + addr)); | ~~~~~~~~~~ ^ include/uapi/linux/byteorder/little_endian.h:37:51: note: expanded from macro '__le16_to_cpu' 37 | #define __le16_to_cpu(x) ((__force __u16)(__le16)(x)) | ^ In file included from mm/memory.c:42: In file included from include/linux/kernel_stat.h:9: In file included from include/linux/interrupt.h:11: In file included from include/linux/hardirq.h:11: In file included from arch/um/include/asm/hardirq.h:5: In file included from include/asm-generic/hardirq.h:17: In file included from include/linux/irq.h:20: In file included from include/linux/io.h:13: In file included from arch/um/include/asm/io.h:24: include/asm-generic/io.h:573:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 573 | val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr)); | ~~~~~~~~~~ ^ include/uapi/linux/byteorder/little_endian.h:35:51: note: expanded from macro '__le32_to_cpu' 35 | #define __le32_to_cpu(x) ((__force __u32)(__le32)(x)) | ^ In file included from mm/memory.c:42: In file included from include/linux/kernel_stat.h:9: In file included from include/linux/interrupt.h:11: In file included from include/linux/hardirq.h:11: In file included from arch/um/include/asm/hardirq.h:5: In file included from include/asm-generic/hardirq.h:17: In file included from include/linux/irq.h:20: In file included from include/linux/io.h:13: In file included from arch/um/include/asm/io.h:24: include/asm-generic/io.h:584:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 584 | __raw_writeb(value, PCI_IOBASE + addr); | ~~~~~~~~~~ ^ include/asm-generic/io.h:594:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 594 | __raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr); | ~~~~~~~~~~ ^ include/asm-generic/io.h:604:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 604 | __raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr); | ~~~~~~~~~~ ^ include/asm-generic/io.h:692:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 692 | readsb(PCI_IOBASE + addr, buffer, count); | ~~~~~~~~~~ ^ include/asm-generic/io.h:700:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 700 | readsw(PCI_IOBASE + addr, buffer, count); | ~~~~~~~~~~ ^ include/asm-generic/io.h:708:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 708 | readsl(PCI_IOBASE + addr, buffer, count); | ~~~~~~~~~~ ^ include/asm-generic/io.h:717:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 717 | writesb(PCI_IOBASE + addr, buffer, count); | ~~~~~~~~~~ ^ include/asm-generic/io.h:726:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 726 | writesw(PCI_IOBASE + addr, buffer, count); | ~~~~~~~~~~ ^ include/asm-generic/io.h:735:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 735 | writesl(PCI_IOBASE + addr, buffer, count); | ~~~~~~~~~~ ^ >> mm/memory.c:4271:2: error: call to undeclared function 'set_ptes'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration] 4271 | set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount); | ^ mm/memory.c:4271:2: note: did you mean 'set_pte'? arch/um/include/asm/pgtable.h:232:20: note: 'set_pte' declared here 232 | static inline void set_pte(pte_t *pteptr, pte_t pteval) | ^ >> mm/memory.c:4274:2: error: call to undeclared function 'update_mmu_cache_range'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration] 4274 | update_mmu_cache_range(vma, addr, vmf->pte, pgcount); | ^ 12 warnings and 2 errors generated. vim +/set_ptes +4271 mm/memory.c 4135 4136 /* 4137 * We enter with non-exclusive mmap_lock (to exclude vma changes, 4138 * but allow concurrent faults), and pte mapped but not yet locked. 4139 * We return with mmap_lock still held, but pte unmapped and unlocked. 4140 */ 4141 static vm_fault_t do_anonymous_page(struct vm_fault *vmf) 4142 { 4143 bool uffd_wp = vmf_orig_pte_uffd_wp(vmf); 4144 struct vm_area_struct *vma = vmf->vma; 4145 struct folio *folio; 4146 vm_fault_t ret = 0; 4147 pte_t entry; 4148 int order; 4149 int pgcount; 4150 unsigned long addr; 4151 4152 /* File mapping without ->vm_ops ? */ 4153 if (vma->vm_flags & VM_SHARED) 4154 return VM_FAULT_SIGBUS; 4155 4156 /* 4157 * Use pte_alloc() instead of pte_alloc_map(). We can't run 4158 * pte_offset_map() on pmds where a huge pmd might be created 4159 * from a different thread. 4160 * 4161 * pte_alloc_map() is safe to use under mmap_write_lock(mm) or when 4162 * parallel threads are excluded by other means. 4163 * 4164 * Here we only have mmap_read_lock(mm). 4165 */ 4166 if (pte_alloc(vma->vm_mm, vmf->pmd)) 4167 return VM_FAULT_OOM; 4168 4169 /* See comment in handle_pte_fault() */ 4170 if (unlikely(pmd_trans_unstable(vmf->pmd))) 4171 return 0; 4172 4173 /* Use the zero-page for reads */ 4174 if (!(vmf->flags & FAULT_FLAG_WRITE) && 4175 !mm_forbids_zeropage(vma->vm_mm)) { 4176 entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address), 4177 vma->vm_page_prot)); 4178 vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, 4179 vmf->address, &vmf->ptl); 4180 if (vmf_pte_changed(vmf)) { 4181 update_mmu_tlb(vma, vmf->address, vmf->pte); 4182 goto unlock; 4183 } 4184 ret = check_stable_address_space(vma->vm_mm); 4185 if (ret) 4186 goto unlock; 4187 /* Deliver the page fault to userland, check inside PT lock */ 4188 if (userfaultfd_missing(vma)) { 4189 pte_unmap_unlock(vmf->pte, vmf->ptl); 4190 return handle_userfault(vmf, VM_UFFD_MISSING); 4191 } 4192 if (uffd_wp) 4193 entry = pte_mkuffd_wp(entry); 4194 set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); 4195 4196 /* No need to invalidate - it was non-present before */ 4197 update_mmu_cache(vma, vmf->address, vmf->pte); 4198 goto unlock; 4199 } 4200 4201 /* 4202 * If allocating a large folio, determine the biggest suitable order for 4203 * the VMA (e.g. it must not exceed the VMA's bounds, it must not 4204 * overlap with any populated PTEs, etc). We are not under the ptl here 4205 * so we will need to re-check that we are not overlapping any populated 4206 * PTEs once we have the lock. 4207 */ 4208 order = uffd_wp ? 0 : max_anon_folio_order(vma); 4209 if (order > 0) { 4210 vmf->pte = pte_offset_map(vmf->pmd, vmf->address); 4211 order = calc_anon_folio_order_alloc(vmf, order); 4212 pte_unmap(vmf->pte); 4213 } 4214 4215 /* Allocate our own private folio. */ 4216 if (unlikely(anon_vma_prepare(vma))) 4217 goto oom; 4218 folio = alloc_anon_folio(vma, vmf->address, order); 4219 if (!folio && order > 0) { 4220 order = 0; 4221 folio = alloc_anon_folio(vma, vmf->address, order); 4222 } 4223 if (!folio) 4224 goto oom; 4225 4226 pgcount = 1 << order; 4227 addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT); 4228 4229 if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) 4230 goto oom_free_page; 4231 folio_throttle_swaprate(folio, GFP_KERNEL); 4232 4233 /* 4234 * The memory barrier inside __folio_mark_uptodate makes sure that 4235 * preceding stores to the folio contents become visible before 4236 * the set_ptes() write. 4237 */ 4238 __folio_mark_uptodate(folio); 4239 4240 entry = mk_pte(&folio->page, vma->vm_page_prot); 4241 entry = pte_sw_mkyoung(entry); 4242 if (vma->vm_flags & VM_WRITE) 4243 entry = pte_mkwrite(pte_mkdirty(entry)); 4244 4245 vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl); 4246 if (vmf_pte_changed(vmf)) { 4247 update_mmu_tlb(vma, vmf->address, vmf->pte); 4248 goto release; 4249 } else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) { 4250 goto release; 4251 } 4252 4253 ret = check_stable_address_space(vma->vm_mm); 4254 if (ret) 4255 goto release; 4256 4257 /* Deliver the page fault to userland, check inside PT lock */ 4258 if (userfaultfd_missing(vma)) { 4259 pte_unmap_unlock(vmf->pte, vmf->ptl); 4260 folio_put(folio); 4261 return handle_userfault(vmf, VM_UFFD_MISSING); 4262 } 4263 4264 folio_ref_add(folio, pgcount - 1); 4265 add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount); 4266 folio_add_new_anon_rmap(folio, vma, addr); 4267 folio_add_lru_vma(folio, vma); 4268 4269 if (uffd_wp) 4270 entry = pte_mkuffd_wp(entry); > 4271 set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount); 4272 4273 /* No need to invalidate - it was non-present before */ > 4274 update_mmu_cache_range(vma, addr, vmf->pte, pgcount); 4275 unlock: 4276 pte_unmap_unlock(vmf->pte, vmf->ptl); 4277 return ret; 4278 release: 4279 folio_put(folio); 4280 goto unlock; 4281 oom_free_page: 4282 folio_put(folio); 4283 oom: 4284 return VM_FAULT_OOM; 4285 } 4286 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance 2023-07-03 13:53 ` Ryan Roberts @ 2023-07-04 1:35 ` Yu Zhao -1 siblings, 0 replies; 167+ messages in thread From: Yu Zhao @ 2023-07-04 1:35 UTC (permalink / raw) To: Ryan Roberts Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm [-- Attachment #1: Type: text/plain, Size: 13988 bytes --] On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > Introduce FLEXIBLE_THP feature, which allows anonymous memory to be > allocated in large folios of a specified order. All pages of the large > folio are pte-mapped during the same page fault, significantly reducing > the number of page faults. The number of per-page operations (e.g. ref > counting, rmap management lru list management) are also significantly > reduced since those ops now become per-folio. > > The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which > defaults to disabled for now; there is a long list of todos to make > FLEXIBLE_THP robust with existing features (e.g. compaction, mlock, some > madvise ops, etc). These items will be tackled in subsequent patches. > > When enabled, the preferred folio order is as returned by > arch_wants_pte_order(), which may be overridden by the arch as it sees > fit. Some architectures (e.g. arm64) can coalsece TLB entries if a coalesce > contiguous set of ptes map physically contigious, naturally aligned contiguous > memory, so this mechanism allows the architecture to optimize as > required. > > If the preferred order can't be used (e.g. because the folio would > breach the bounds of the vma, or because ptes in the region are already > mapped) then we fall back to a suitable lower order. > > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > --- > mm/Kconfig | 10 ++++ > mm/memory.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++++---- > 2 files changed, 165 insertions(+), 13 deletions(-) > > diff --git a/mm/Kconfig b/mm/Kconfig > index 7672a22647b4..1c06b2c0a24e 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -822,6 +822,16 @@ config READ_ONLY_THP_FOR_FS > support of file THPs will be developed in the next few release > cycles. > > +config FLEXIBLE_THP > + bool "Flexible order THP" > + depends on TRANSPARENT_HUGEPAGE > + default n The default value is already N. > + help > + Use large (bigger than order-0) folios to back anonymous memory where > + possible, even if the order of the folio is smaller than the PMD > + order. This reduces the number of page faults, as well as other > + per-page overheads to improve performance for many workloads. > + > endif # TRANSPARENT_HUGEPAGE > > # > diff --git a/mm/memory.c b/mm/memory.c > index fb30f7523550..abe2ea94f3f5 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3001,6 +3001,116 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf) > return 0; > } > > +#ifdef CONFIG_FLEXIBLE_THP > +/* > + * Allocates, zeros and returns a folio of the requested order for use as > + * anonymous memory. > + */ > +static struct folio *alloc_anon_folio(struct vm_area_struct *vma, > + unsigned long addr, int order) > +{ > + gfp_t gfp; > + struct folio *folio; > + > + if (order == 0) > + return vma_alloc_zeroed_movable_folio(vma, addr); > + > + gfp = vma_thp_gfp_mask(vma); > + folio = vma_alloc_folio(gfp, order, vma, addr, true); > + if (folio) > + clear_huge_page(&folio->page, addr, folio_nr_pages(folio)); > + > + return folio; > +} > + > +/* > + * Preferred folio order to allocate for anonymous memory. > + */ > +#define max_anon_folio_order(vma) arch_wants_pte_order(vma) > +#else > +#define alloc_anon_folio(vma, addr, order) \ > + vma_alloc_zeroed_movable_folio(vma, addr) > +#define max_anon_folio_order(vma) 0 > +#endif > + > +/* > + * Returns index of first pte that is not none, or nr if all are none. > + */ > +static inline int check_ptes_none(pte_t *pte, int nr) > +{ > + int i; > + > + for (i = 0; i < nr; i++) { > + if (!pte_none(ptep_get(pte++))) > + return i; > + } > + > + return nr; > +} > + > +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order) > +{ > + /* > + * The aim here is to determine what size of folio we should allocate > + * for this fault. Factors include: > + * - Order must not be higher than `order` upon entry > + * - Folio must be naturally aligned within VA space > + * - Folio must be fully contained inside one pmd entry > + * - Folio must not breach boundaries of vma > + * - Folio must not overlap any non-none ptes > + * > + * Additionally, we do not allow order-1 since this breaks assumptions > + * elsewhere in the mm; THP pages must be at least order-2 (since they > + * store state up to the 3rd struct page subpage), and these pages must > + * be THP in order to correctly use pre-existing THP infrastructure such > + * as folio_split(). > + * > + * Note that the caller may or may not choose to lock the pte. If > + * unlocked, the result is racy and the user must re-check any overlap > + * with non-none ptes under the lock. > + */ > + > + struct vm_area_struct *vma = vmf->vma; > + int nr; > + unsigned long addr; > + pte_t *pte; > + pte_t *first_set = NULL; > + int ret; > + > + order = min(order, PMD_SHIFT - PAGE_SHIFT); > + > + for (; order > 1; order--) { I'm not sure how we can justify this policy. As an initial step, it'd be a lot easier to sell if we only considered the order of arch_wants_pte_order() and the order 0. > + nr = 1 << order; > + addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT); > + pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT); > + > + /* Check vma bounds. */ > + if (addr < vma->vm_start || > + addr + (nr << PAGE_SHIFT) > vma->vm_end) > + continue; > + > + /* Ptes covered by order already known to be none. */ > + if (pte + nr <= first_set) > + break; > + > + /* Already found set pte in range covered by order. */ > + if (pte <= first_set) > + continue; > + > + /* Need to check if all the ptes are none. */ > + ret = check_ptes_none(pte, nr); > + if (ret == nr) > + break; > + > + first_set = pte + ret; > + } > + > + if (order == 1) > + order = 0; > + > + return order; > +} Everything above can be simplified into two helpers: vmf_pte_range_changed() and alloc_anon_folio() (or whatever names you prefer). Details below. > /* > * Handle write page faults for pages that can be reused in the current vma > * > @@ -3073,7 +3183,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) > goto oom; > > if (is_zero_pfn(pte_pfn(vmf->orig_pte))) { > - new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); > + new_folio = alloc_anon_folio(vma, vmf->address, 0); This seems unnecessary for now. Later on, we could fill in an aligned area with multiple write-protected zero pages during a read fault and then replace them with a large folio here. > if (!new_folio) > goto oom; > } else { > @@ -4040,6 +4150,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) > struct folio *folio; > vm_fault_t ret = 0; > pte_t entry; > + int order; > + int pgcount; > + unsigned long addr; > > /* File mapping without ->vm_ops ? */ > if (vma->vm_flags & VM_SHARED) > @@ -4081,24 +4194,51 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) > pte_unmap_unlock(vmf->pte, vmf->ptl); > return handle_userfault(vmf, VM_UFFD_MISSING); > } > - goto setpte; > + if (uffd_wp) > + entry = pte_mkuffd_wp(entry); > + set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); > + > + /* No need to invalidate - it was non-present before */ > + update_mmu_cache(vma, vmf->address, vmf->pte); > + goto unlock; > + } Nor really needed IMO. Details below. === > + /* > + * If allocating a large folio, determine the biggest suitable order for > + * the VMA (e.g. it must not exceed the VMA's bounds, it must not > + * overlap with any populated PTEs, etc). We are not under the ptl here > + * so we will need to re-check that we are not overlapping any populated > + * PTEs once we have the lock. > + */ > + order = uffd_wp ? 0 : max_anon_folio_order(vma); > + if (order > 0) { > + vmf->pte = pte_offset_map(vmf->pmd, vmf->address); > + order = calc_anon_folio_order_alloc(vmf, order); > + pte_unmap(vmf->pte); > } === The section above together with the section below should be wrapped in a helper. > - /* Allocate our own private page. */ > + /* Allocate our own private folio. */ > if (unlikely(anon_vma_prepare(vma))) > goto oom; === > - folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); > + folio = alloc_anon_folio(vma, vmf->address, order); > + if (!folio && order > 0) { > + order = 0; > + folio = alloc_anon_folio(vma, vmf->address, order); > + } === One helper returns a folio of order arch_wants_pte_order(), or order 0 if it fails to allocate that order, e.g., folio = alloc_anon_folio(vmf); And if vmf_orig_pte_uffd_wp(vmf) is true, the helper allocates order 0 regardless of arch_wants_pte_order(). Upon success, it can update vmf->address, since if we run into a race with another PF, we exit the fault handler and retry anyway. > if (!folio) > goto oom; > > + pgcount = 1 << order; > + addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT); As shown above, the helper already updates vmf->address. And mm/ never used pgcount before -- the convention is nr_pages = folio_nr_pages(). > if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) > goto oom_free_page; > folio_throttle_swaprate(folio, GFP_KERNEL); > > /* > * The memory barrier inside __folio_mark_uptodate makes sure that > - * preceding stores to the page contents become visible before > - * the set_pte_at() write. > + * preceding stores to the folio contents become visible before > + * the set_ptes() write. We don't have set_ptes() yet. > */ > __folio_mark_uptodate(folio); > > @@ -4107,11 +4247,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) > if (vma->vm_flags & VM_WRITE) > entry = pte_mkwrite(pte_mkdirty(entry)); > > - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, > - &vmf->ptl); > + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl); > if (vmf_pte_changed(vmf)) { > update_mmu_tlb(vma, vmf->address, vmf->pte); > goto release; > + } else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) { > + goto release; > } Need new helper: if (vmf_pte_range_changed(vmf, nr_pages)) { for (i = 0; i < nr_pages; i++) update_mmu_tlb(vma, vmf->address + PAGE_SIZE * i, vmf->pte + i); goto release; } (It should be fine to call update_mmu_tlb() even if it's not really necessary.) > ret = check_stable_address_space(vma->vm_mm); > @@ -4125,16 +4266,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) > return handle_userfault(vmf, VM_UFFD_MISSING); > } > > - inc_mm_counter(vma->vm_mm, MM_ANONPAGES); > - folio_add_new_anon_rmap(folio, vma, vmf->address); > + folio_ref_add(folio, pgcount - 1); > + add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount); > + folio_add_new_anon_rmap(folio, vma, addr); > folio_add_lru_vma(folio, vma); > -setpte: > + > if (uffd_wp) > entry = pte_mkuffd_wp(entry); > - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); > + set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount); We would have to do it one by one for now. > /* No need to invalidate - it was non-present before */ > - update_mmu_cache(vma, vmf->address, vmf->pte); > + update_mmu_cache_range(vma, addr, vmf->pte, pgcount); Ditto. How about this (by moving mk_pte() and its friends here): ... folio_add_lru_vma(folio, vma); for (i = 0; i < nr_pages; i++) { entry = mk_pte(folio_page(folio, i), vma->vm_page_prot); entry = pte_sw_mkyoung(entry); if (vma->vm_flags & VM_WRITE) entry = pte_mkwrite(pte_mkdirty(entry)); setpte: if (uffd_wp) entry = pte_mkuffd_wp(entry); set_pte_at(vma->vm_mm, vmf->address + PAGE_SIZE * i, vmf->pte + i, entry); /* No need to invalidate - it was non-present before */ update_mmu_cache(vma, vmf->address + PAGE_SIZE * i, vmf->pte + i); } > unlock: > pte_unmap_unlock(vmf->pte, vmf->ptl); > return ret; Attaching a small patch in case anything above is not clear. Please take a look. Thanks. [-- Attachment #2: anon_folios.patch --] [-- Type: text/x-patch, Size: 2658 bytes --] diff --git a/mm/memory.c b/mm/memory.c index 40a269457c8b..04fdb8529f68 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4063,6 +4063,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) */ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) { + int i = 0; + int nr_pages = 1; bool uffd_wp = vmf_orig_pte_uffd_wp(vmf); struct vm_area_struct *vma = vmf->vma; struct folio *folio; @@ -4107,10 +4109,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) /* Allocate our own private page. */ if (unlikely(anon_vma_prepare(vma))) goto oom; - folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); + folio = alloc_anon_folio(vmf); // updates vmf->address accordingly if (!folio) goto oom; + nr_pages = folio_nr_pages(folio); + if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) goto oom_free_page; folio_throttle_swaprate(folio, GFP_KERNEL); @@ -4122,17 +4126,13 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) */ __folio_mark_uptodate(folio); - entry = mk_pte(&folio->page, vma->vm_page_prot); - entry = pte_sw_mkyoung(entry); - if (vma->vm_flags & VM_WRITE) - entry = pte_mkwrite(pte_mkdirty(entry)); - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); if (!vmf->pte) goto release; - if (vmf_pte_changed(vmf)) { - update_mmu_tlb(vma, vmf->address, vmf->pte); + if (vmf_pte_range_changed(vmf, nr_pages)) { + for (i = 0; i < nr_pages; i++) + update_mmu_tlb(vma, vmf->address + PAGE_SIZE * i, vmf->pte + i); goto release; } @@ -4147,16 +4147,24 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) return handle_userfault(vmf, VM_UFFD_MISSING); } - inc_mm_counter(vma->vm_mm, MM_ANONPAGES); + folio_ref_add(folio, nr_pages - 1); + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages); folio_add_new_anon_rmap(folio, vma, vmf->address); folio_add_lru_vma(folio, vma); + + for (i = 0; i < nr_pages; i++) { + entry = mk_pte(folio_page(folio, i), vma->vm_page_prot); + entry = pte_sw_mkyoung(entry); + if (vma->vm_flags & VM_WRITE) + entry = pte_mkwrite(pte_mkdirty(entry)); setpte: - if (uffd_wp) - entry = pte_mkuffd_wp(entry); - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); + if (uffd_wp) + entry = pte_mkuffd_wp(entry); + set_pte_at(vma->vm_mm, vmf->address + PAGE_SIZE * i, vmf->pte + i, entry); - /* No need to invalidate - it was non-present before */ - update_mmu_cache(vma, vmf->address, vmf->pte); + /* No need to invalidate - it was non-present before */ + update_mmu_cache(vma, vmf->address + PAGE_SIZE * i, vmf->pte + i); + } unlock: if (vmf->pte) pte_unmap_unlock(vmf->pte, vmf->ptl); ^ permalink raw reply related [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance @ 2023-07-04 1:35 ` Yu Zhao 0 siblings, 0 replies; 167+ messages in thread From: Yu Zhao @ 2023-07-04 1:35 UTC (permalink / raw) To: Ryan Roberts Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm [-- Attachment #1: Type: text/plain, Size: 13988 bytes --] On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > Introduce FLEXIBLE_THP feature, which allows anonymous memory to be > allocated in large folios of a specified order. All pages of the large > folio are pte-mapped during the same page fault, significantly reducing > the number of page faults. The number of per-page operations (e.g. ref > counting, rmap management lru list management) are also significantly > reduced since those ops now become per-folio. > > The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which > defaults to disabled for now; there is a long list of todos to make > FLEXIBLE_THP robust with existing features (e.g. compaction, mlock, some > madvise ops, etc). These items will be tackled in subsequent patches. > > When enabled, the preferred folio order is as returned by > arch_wants_pte_order(), which may be overridden by the arch as it sees > fit. Some architectures (e.g. arm64) can coalsece TLB entries if a coalesce > contiguous set of ptes map physically contigious, naturally aligned contiguous > memory, so this mechanism allows the architecture to optimize as > required. > > If the preferred order can't be used (e.g. because the folio would > breach the bounds of the vma, or because ptes in the region are already > mapped) then we fall back to a suitable lower order. > > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > --- > mm/Kconfig | 10 ++++ > mm/memory.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++++---- > 2 files changed, 165 insertions(+), 13 deletions(-) > > diff --git a/mm/Kconfig b/mm/Kconfig > index 7672a22647b4..1c06b2c0a24e 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -822,6 +822,16 @@ config READ_ONLY_THP_FOR_FS > support of file THPs will be developed in the next few release > cycles. > > +config FLEXIBLE_THP > + bool "Flexible order THP" > + depends on TRANSPARENT_HUGEPAGE > + default n The default value is already N. > + help > + Use large (bigger than order-0) folios to back anonymous memory where > + possible, even if the order of the folio is smaller than the PMD > + order. This reduces the number of page faults, as well as other > + per-page overheads to improve performance for many workloads. > + > endif # TRANSPARENT_HUGEPAGE > > # > diff --git a/mm/memory.c b/mm/memory.c > index fb30f7523550..abe2ea94f3f5 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3001,6 +3001,116 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf) > return 0; > } > > +#ifdef CONFIG_FLEXIBLE_THP > +/* > + * Allocates, zeros and returns a folio of the requested order for use as > + * anonymous memory. > + */ > +static struct folio *alloc_anon_folio(struct vm_area_struct *vma, > + unsigned long addr, int order) > +{ > + gfp_t gfp; > + struct folio *folio; > + > + if (order == 0) > + return vma_alloc_zeroed_movable_folio(vma, addr); > + > + gfp = vma_thp_gfp_mask(vma); > + folio = vma_alloc_folio(gfp, order, vma, addr, true); > + if (folio) > + clear_huge_page(&folio->page, addr, folio_nr_pages(folio)); > + > + return folio; > +} > + > +/* > + * Preferred folio order to allocate for anonymous memory. > + */ > +#define max_anon_folio_order(vma) arch_wants_pte_order(vma) > +#else > +#define alloc_anon_folio(vma, addr, order) \ > + vma_alloc_zeroed_movable_folio(vma, addr) > +#define max_anon_folio_order(vma) 0 > +#endif > + > +/* > + * Returns index of first pte that is not none, or nr if all are none. > + */ > +static inline int check_ptes_none(pte_t *pte, int nr) > +{ > + int i; > + > + for (i = 0; i < nr; i++) { > + if (!pte_none(ptep_get(pte++))) > + return i; > + } > + > + return nr; > +} > + > +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order) > +{ > + /* > + * The aim here is to determine what size of folio we should allocate > + * for this fault. Factors include: > + * - Order must not be higher than `order` upon entry > + * - Folio must be naturally aligned within VA space > + * - Folio must be fully contained inside one pmd entry > + * - Folio must not breach boundaries of vma > + * - Folio must not overlap any non-none ptes > + * > + * Additionally, we do not allow order-1 since this breaks assumptions > + * elsewhere in the mm; THP pages must be at least order-2 (since they > + * store state up to the 3rd struct page subpage), and these pages must > + * be THP in order to correctly use pre-existing THP infrastructure such > + * as folio_split(). > + * > + * Note that the caller may or may not choose to lock the pte. If > + * unlocked, the result is racy and the user must re-check any overlap > + * with non-none ptes under the lock. > + */ > + > + struct vm_area_struct *vma = vmf->vma; > + int nr; > + unsigned long addr; > + pte_t *pte; > + pte_t *first_set = NULL; > + int ret; > + > + order = min(order, PMD_SHIFT - PAGE_SHIFT); > + > + for (; order > 1; order--) { I'm not sure how we can justify this policy. As an initial step, it'd be a lot easier to sell if we only considered the order of arch_wants_pte_order() and the order 0. > + nr = 1 << order; > + addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT); > + pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT); > + > + /* Check vma bounds. */ > + if (addr < vma->vm_start || > + addr + (nr << PAGE_SHIFT) > vma->vm_end) > + continue; > + > + /* Ptes covered by order already known to be none. */ > + if (pte + nr <= first_set) > + break; > + > + /* Already found set pte in range covered by order. */ > + if (pte <= first_set) > + continue; > + > + /* Need to check if all the ptes are none. */ > + ret = check_ptes_none(pte, nr); > + if (ret == nr) > + break; > + > + first_set = pte + ret; > + } > + > + if (order == 1) > + order = 0; > + > + return order; > +} Everything above can be simplified into two helpers: vmf_pte_range_changed() and alloc_anon_folio() (or whatever names you prefer). Details below. > /* > * Handle write page faults for pages that can be reused in the current vma > * > @@ -3073,7 +3183,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) > goto oom; > > if (is_zero_pfn(pte_pfn(vmf->orig_pte))) { > - new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); > + new_folio = alloc_anon_folio(vma, vmf->address, 0); This seems unnecessary for now. Later on, we could fill in an aligned area with multiple write-protected zero pages during a read fault and then replace them with a large folio here. > if (!new_folio) > goto oom; > } else { > @@ -4040,6 +4150,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) > struct folio *folio; > vm_fault_t ret = 0; > pte_t entry; > + int order; > + int pgcount; > + unsigned long addr; > > /* File mapping without ->vm_ops ? */ > if (vma->vm_flags & VM_SHARED) > @@ -4081,24 +4194,51 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) > pte_unmap_unlock(vmf->pte, vmf->ptl); > return handle_userfault(vmf, VM_UFFD_MISSING); > } > - goto setpte; > + if (uffd_wp) > + entry = pte_mkuffd_wp(entry); > + set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); > + > + /* No need to invalidate - it was non-present before */ > + update_mmu_cache(vma, vmf->address, vmf->pte); > + goto unlock; > + } Nor really needed IMO. Details below. === > + /* > + * If allocating a large folio, determine the biggest suitable order for > + * the VMA (e.g. it must not exceed the VMA's bounds, it must not > + * overlap with any populated PTEs, etc). We are not under the ptl here > + * so we will need to re-check that we are not overlapping any populated > + * PTEs once we have the lock. > + */ > + order = uffd_wp ? 0 : max_anon_folio_order(vma); > + if (order > 0) { > + vmf->pte = pte_offset_map(vmf->pmd, vmf->address); > + order = calc_anon_folio_order_alloc(vmf, order); > + pte_unmap(vmf->pte); > } === The section above together with the section below should be wrapped in a helper. > - /* Allocate our own private page. */ > + /* Allocate our own private folio. */ > if (unlikely(anon_vma_prepare(vma))) > goto oom; === > - folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); > + folio = alloc_anon_folio(vma, vmf->address, order); > + if (!folio && order > 0) { > + order = 0; > + folio = alloc_anon_folio(vma, vmf->address, order); > + } === One helper returns a folio of order arch_wants_pte_order(), or order 0 if it fails to allocate that order, e.g., folio = alloc_anon_folio(vmf); And if vmf_orig_pte_uffd_wp(vmf) is true, the helper allocates order 0 regardless of arch_wants_pte_order(). Upon success, it can update vmf->address, since if we run into a race with another PF, we exit the fault handler and retry anyway. > if (!folio) > goto oom; > > + pgcount = 1 << order; > + addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT); As shown above, the helper already updates vmf->address. And mm/ never used pgcount before -- the convention is nr_pages = folio_nr_pages(). > if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) > goto oom_free_page; > folio_throttle_swaprate(folio, GFP_KERNEL); > > /* > * The memory barrier inside __folio_mark_uptodate makes sure that > - * preceding stores to the page contents become visible before > - * the set_pte_at() write. > + * preceding stores to the folio contents become visible before > + * the set_ptes() write. We don't have set_ptes() yet. > */ > __folio_mark_uptodate(folio); > > @@ -4107,11 +4247,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) > if (vma->vm_flags & VM_WRITE) > entry = pte_mkwrite(pte_mkdirty(entry)); > > - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, > - &vmf->ptl); > + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl); > if (vmf_pte_changed(vmf)) { > update_mmu_tlb(vma, vmf->address, vmf->pte); > goto release; > + } else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) { > + goto release; > } Need new helper: if (vmf_pte_range_changed(vmf, nr_pages)) { for (i = 0; i < nr_pages; i++) update_mmu_tlb(vma, vmf->address + PAGE_SIZE * i, vmf->pte + i); goto release; } (It should be fine to call update_mmu_tlb() even if it's not really necessary.) > ret = check_stable_address_space(vma->vm_mm); > @@ -4125,16 +4266,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) > return handle_userfault(vmf, VM_UFFD_MISSING); > } > > - inc_mm_counter(vma->vm_mm, MM_ANONPAGES); > - folio_add_new_anon_rmap(folio, vma, vmf->address); > + folio_ref_add(folio, pgcount - 1); > + add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount); > + folio_add_new_anon_rmap(folio, vma, addr); > folio_add_lru_vma(folio, vma); > -setpte: > + > if (uffd_wp) > entry = pte_mkuffd_wp(entry); > - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); > + set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount); We would have to do it one by one for now. > /* No need to invalidate - it was non-present before */ > - update_mmu_cache(vma, vmf->address, vmf->pte); > + update_mmu_cache_range(vma, addr, vmf->pte, pgcount); Ditto. How about this (by moving mk_pte() and its friends here): ... folio_add_lru_vma(folio, vma); for (i = 0; i < nr_pages; i++) { entry = mk_pte(folio_page(folio, i), vma->vm_page_prot); entry = pte_sw_mkyoung(entry); if (vma->vm_flags & VM_WRITE) entry = pte_mkwrite(pte_mkdirty(entry)); setpte: if (uffd_wp) entry = pte_mkuffd_wp(entry); set_pte_at(vma->vm_mm, vmf->address + PAGE_SIZE * i, vmf->pte + i, entry); /* No need to invalidate - it was non-present before */ update_mmu_cache(vma, vmf->address + PAGE_SIZE * i, vmf->pte + i); } > unlock: > pte_unmap_unlock(vmf->pte, vmf->ptl); > return ret; Attaching a small patch in case anything above is not clear. Please take a look. Thanks. [-- Attachment #2: anon_folios.patch --] [-- Type: text/x-patch, Size: 2658 bytes --] diff --git a/mm/memory.c b/mm/memory.c index 40a269457c8b..04fdb8529f68 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4063,6 +4063,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) */ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) { + int i = 0; + int nr_pages = 1; bool uffd_wp = vmf_orig_pte_uffd_wp(vmf); struct vm_area_struct *vma = vmf->vma; struct folio *folio; @@ -4107,10 +4109,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) /* Allocate our own private page. */ if (unlikely(anon_vma_prepare(vma))) goto oom; - folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); + folio = alloc_anon_folio(vmf); // updates vmf->address accordingly if (!folio) goto oom; + nr_pages = folio_nr_pages(folio); + if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) goto oom_free_page; folio_throttle_swaprate(folio, GFP_KERNEL); @@ -4122,17 +4126,13 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) */ __folio_mark_uptodate(folio); - entry = mk_pte(&folio->page, vma->vm_page_prot); - entry = pte_sw_mkyoung(entry); - if (vma->vm_flags & VM_WRITE) - entry = pte_mkwrite(pte_mkdirty(entry)); - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); if (!vmf->pte) goto release; - if (vmf_pte_changed(vmf)) { - update_mmu_tlb(vma, vmf->address, vmf->pte); + if (vmf_pte_range_changed(vmf, nr_pages)) { + for (i = 0; i < nr_pages; i++) + update_mmu_tlb(vma, vmf->address + PAGE_SIZE * i, vmf->pte + i); goto release; } @@ -4147,16 +4147,24 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) return handle_userfault(vmf, VM_UFFD_MISSING); } - inc_mm_counter(vma->vm_mm, MM_ANONPAGES); + folio_ref_add(folio, nr_pages - 1); + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages); folio_add_new_anon_rmap(folio, vma, vmf->address); folio_add_lru_vma(folio, vma); + + for (i = 0; i < nr_pages; i++) { + entry = mk_pte(folio_page(folio, i), vma->vm_page_prot); + entry = pte_sw_mkyoung(entry); + if (vma->vm_flags & VM_WRITE) + entry = pte_mkwrite(pte_mkdirty(entry)); setpte: - if (uffd_wp) - entry = pte_mkuffd_wp(entry); - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); + if (uffd_wp) + entry = pte_mkuffd_wp(entry); + set_pte_at(vma->vm_mm, vmf->address + PAGE_SIZE * i, vmf->pte + i, entry); - /* No need to invalidate - it was non-present before */ - update_mmu_cache(vma, vmf->address, vmf->pte); + /* No need to invalidate - it was non-present before */ + update_mmu_cache(vma, vmf->address + PAGE_SIZE * i, vmf->pte + i); + } unlock: if (vmf->pte) pte_unmap_unlock(vmf->pte, vmf->ptl); [-- Attachment #3: Type: text/plain, Size: 176 bytes --] _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance 2023-07-04 1:35 ` Yu Zhao @ 2023-07-04 14:08 ` Ryan Roberts -1 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-04 14:08 UTC (permalink / raw) To: Yu Zhao Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 04/07/2023 02:35, Yu Zhao wrote: > On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be >> allocated in large folios of a specified order. All pages of the large >> folio are pte-mapped during the same page fault, significantly reducing >> the number of page faults. The number of per-page operations (e.g. ref >> counting, rmap management lru list management) are also significantly >> reduced since those ops now become per-folio. >> >> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which >> defaults to disabled for now; there is a long list of todos to make >> FLEXIBLE_THP robust with existing features (e.g. compaction, mlock, some >> madvise ops, etc). These items will be tackled in subsequent patches. >> >> When enabled, the preferred folio order is as returned by >> arch_wants_pte_order(), which may be overridden by the arch as it sees >> fit. Some architectures (e.g. arm64) can coalsece TLB entries if a > > coalesce ACK > >> contiguous set of ptes map physically contigious, naturally aligned > > contiguous ACK > >> memory, so this mechanism allows the architecture to optimize as >> required. >> >> If the preferred order can't be used (e.g. because the folio would >> breach the bounds of the vma, or because ptes in the region are already >> mapped) then we fall back to a suitable lower order. >> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >> --- >> mm/Kconfig | 10 ++++ >> mm/memory.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++++---- >> 2 files changed, 165 insertions(+), 13 deletions(-) >> >> diff --git a/mm/Kconfig b/mm/Kconfig >> index 7672a22647b4..1c06b2c0a24e 100644 >> --- a/mm/Kconfig >> +++ b/mm/Kconfig >> @@ -822,6 +822,16 @@ config READ_ONLY_THP_FOR_FS >> support of file THPs will be developed in the next few release >> cycles. >> >> +config FLEXIBLE_THP >> + bool "Flexible order THP" >> + depends on TRANSPARENT_HUGEPAGE >> + default n > > The default value is already N. Is there a coding standard for this? Personally I prefer to make it explicit. > >> + help >> + Use large (bigger than order-0) folios to back anonymous memory where >> + possible, even if the order of the folio is smaller than the PMD >> + order. This reduces the number of page faults, as well as other >> + per-page overheads to improve performance for many workloads. >> + >> endif # TRANSPARENT_HUGEPAGE >> >> # >> diff --git a/mm/memory.c b/mm/memory.c >> index fb30f7523550..abe2ea94f3f5 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -3001,6 +3001,116 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf) >> return 0; >> } >> >> +#ifdef CONFIG_FLEXIBLE_THP >> +/* >> + * Allocates, zeros and returns a folio of the requested order for use as >> + * anonymous memory. >> + */ >> +static struct folio *alloc_anon_folio(struct vm_area_struct *vma, >> + unsigned long addr, int order) >> +{ >> + gfp_t gfp; >> + struct folio *folio; >> + >> + if (order == 0) >> + return vma_alloc_zeroed_movable_folio(vma, addr); >> + >> + gfp = vma_thp_gfp_mask(vma); >> + folio = vma_alloc_folio(gfp, order, vma, addr, true); >> + if (folio) >> + clear_huge_page(&folio->page, addr, folio_nr_pages(folio)); >> + >> + return folio; >> +} >> + >> +/* >> + * Preferred folio order to allocate for anonymous memory. >> + */ >> +#define max_anon_folio_order(vma) arch_wants_pte_order(vma) >> +#else >> +#define alloc_anon_folio(vma, addr, order) \ >> + vma_alloc_zeroed_movable_folio(vma, addr) >> +#define max_anon_folio_order(vma) 0 >> +#endif >> + >> +/* >> + * Returns index of first pte that is not none, or nr if all are none. >> + */ >> +static inline int check_ptes_none(pte_t *pte, int nr) >> +{ >> + int i; >> + >> + for (i = 0; i < nr; i++) { >> + if (!pte_none(ptep_get(pte++))) >> + return i; >> + } >> + >> + return nr; >> +} >> + >> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order) >> +{ >> + /* >> + * The aim here is to determine what size of folio we should allocate >> + * for this fault. Factors include: >> + * - Order must not be higher than `order` upon entry >> + * - Folio must be naturally aligned within VA space >> + * - Folio must be fully contained inside one pmd entry >> + * - Folio must not breach boundaries of vma >> + * - Folio must not overlap any non-none ptes >> + * >> + * Additionally, we do not allow order-1 since this breaks assumptions >> + * elsewhere in the mm; THP pages must be at least order-2 (since they >> + * store state up to the 3rd struct page subpage), and these pages must >> + * be THP in order to correctly use pre-existing THP infrastructure such >> + * as folio_split(). >> + * >> + * Note that the caller may or may not choose to lock the pte. If >> + * unlocked, the result is racy and the user must re-check any overlap >> + * with non-none ptes under the lock. >> + */ >> + >> + struct vm_area_struct *vma = vmf->vma; >> + int nr; >> + unsigned long addr; >> + pte_t *pte; >> + pte_t *first_set = NULL; >> + int ret; >> + >> + order = min(order, PMD_SHIFT - PAGE_SHIFT); >> + >> + for (; order > 1; order--) { > > I'm not sure how we can justify this policy. As an initial step, it'd > be a lot easier to sell if we only considered the order of > arch_wants_pte_order() and the order 0. My justification is in the cover letter; I see performance regression (vs the unpatched kernel) when using the policy you suggest. This policy performs much better in my tests. (I'll reply directly to your follow up questions in the cover letter shortly). What are your technical concerns about this approach? It is pretty light weight (I only touch each PTE once, regardless of the number of loops). If we have strong technical reasons for reverting to the less performant approach then fair enough, but I'd like to hear the rational first. > >> + nr = 1 << order; >> + addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT); >> + pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT); >> + >> + /* Check vma bounds. */ >> + if (addr < vma->vm_start || >> + addr + (nr << PAGE_SHIFT) > vma->vm_end) >> + continue; >> + >> + /* Ptes covered by order already known to be none. */ >> + if (pte + nr <= first_set) >> + break; >> + >> + /* Already found set pte in range covered by order. */ >> + if (pte <= first_set) >> + continue; >> + >> + /* Need to check if all the ptes are none. */ >> + ret = check_ptes_none(pte, nr); >> + if (ret == nr) >> + break; >> + >> + first_set = pte + ret; >> + } >> + >> + if (order == 1) >> + order = 0; >> + >> + return order; >> +} > > Everything above can be simplified into two helpers: > vmf_pte_range_changed() and alloc_anon_folio() (or whatever names you > prefer). Details below. > >> /* >> * Handle write page faults for pages that can be reused in the current vma >> * >> @@ -3073,7 +3183,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) >> goto oom; >> >> if (is_zero_pfn(pte_pfn(vmf->orig_pte))) { >> - new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); >> + new_folio = alloc_anon_folio(vma, vmf->address, 0); > > This seems unnecessary for now. Later on, we could fill in an aligned > area with multiple write-protected zero pages during a read fault and > then replace them with a large folio here. I don't have a strong opinion. I thought that it would be neater to use the same API everywhere, but happy to revert. > >> if (!new_folio) >> goto oom; >> } else { >> @@ -4040,6 +4150,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) >> struct folio *folio; >> vm_fault_t ret = 0; >> pte_t entry; >> + int order; >> + int pgcount; >> + unsigned long addr; >> >> /* File mapping without ->vm_ops ? */ >> if (vma->vm_flags & VM_SHARED) >> @@ -4081,24 +4194,51 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) >> pte_unmap_unlock(vmf->pte, vmf->ptl); >> return handle_userfault(vmf, VM_UFFD_MISSING); >> } >> - goto setpte; >> + if (uffd_wp) >> + entry = pte_mkuffd_wp(entry); >> + set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); >> + >> + /* No need to invalidate - it was non-present before */ >> + update_mmu_cache(vma, vmf->address, vmf->pte); >> + goto unlock; >> + } > > Nor really needed IMO. Details below. > > === > >> + /* >> + * If allocating a large folio, determine the biggest suitable order for >> + * the VMA (e.g. it must not exceed the VMA's bounds, it must not >> + * overlap with any populated PTEs, etc). We are not under the ptl here >> + * so we will need to re-check that we are not overlapping any populated >> + * PTEs once we have the lock. >> + */ >> + order = uffd_wp ? 0 : max_anon_folio_order(vma); >> + if (order > 0) { >> + vmf->pte = pte_offset_map(vmf->pmd, vmf->address); >> + order = calc_anon_folio_order_alloc(vmf, order); >> + pte_unmap(vmf->pte); >> } > > === > > The section above together with the section below should be wrapped in a helper. > >> - /* Allocate our own private page. */ >> + /* Allocate our own private folio. */ >> if (unlikely(anon_vma_prepare(vma))) >> goto oom; > > === > >> - folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); >> + folio = alloc_anon_folio(vma, vmf->address, order); >> + if (!folio && order > 0) { >> + order = 0; >> + folio = alloc_anon_folio(vma, vmf->address, order); >> + } > > === > > One helper returns a folio of order arch_wants_pte_order(), or order 0 > if it fails to allocate that order, e.g., > > folio = alloc_anon_folio(vmf); > > And if vmf_orig_pte_uffd_wp(vmf) is true, the helper allocates order 0 > regardless of arch_wants_pte_order(). Upon success, it can update > vmf->address, since if we run into a race with another PF, we exit the > fault handler and retry anyway. > >> if (!folio) >> goto oom; >> >> + pgcount = 1 << order; >> + addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT); > > As shown above, the helper already updates vmf->address. And mm/ never > used pgcount before -- the convention is nr_pages = folio_nr_pages(). ACK > >> if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) >> goto oom_free_page; >> folio_throttle_swaprate(folio, GFP_KERNEL); >> >> /* >> * The memory barrier inside __folio_mark_uptodate makes sure that >> - * preceding stores to the page contents become visible before >> - * the set_pte_at() write. >> + * preceding stores to the folio contents become visible before >> + * the set_ptes() write. > > We don't have set_ptes() yet. Indeed, that's why I listed the set_ptes() patch set as a hard dependency ;-) > >> */ >> __folio_mark_uptodate(folio); >> >> @@ -4107,11 +4247,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) >> if (vma->vm_flags & VM_WRITE) >> entry = pte_mkwrite(pte_mkdirty(entry)); >> >> - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, >> - &vmf->ptl); >> + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl); >> if (vmf_pte_changed(vmf)) { >> update_mmu_tlb(vma, vmf->address, vmf->pte); >> goto release; >> + } else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) { >> + goto release; >> } > > Need new helper: > > if (vmf_pte_range_changed(vmf, nr_pages)) { > for (i = 0; i < nr_pages; i++) > update_mmu_tlb(vma, vmf->address + PAGE_SIZE * i, vmf->pte + i); > goto release; > } > > (It should be fine to call update_mmu_tlb() even if it's not really necessary.) > >> ret = check_stable_address_space(vma->vm_mm); >> @@ -4125,16 +4266,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) >> return handle_userfault(vmf, VM_UFFD_MISSING); >> } >> >> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES); >> - folio_add_new_anon_rmap(folio, vma, vmf->address); >> + folio_ref_add(folio, pgcount - 1); >> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount); >> + folio_add_new_anon_rmap(folio, vma, addr); >> folio_add_lru_vma(folio, vma); >> -setpte: >> + >> if (uffd_wp) >> entry = pte_mkuffd_wp(entry); >> - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); >> + set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount); > > We would have to do it one by one for now. > >> /* No need to invalidate - it was non-present before */ >> - update_mmu_cache(vma, vmf->address, vmf->pte); >> + update_mmu_cache_range(vma, addr, vmf->pte, pgcount); > > Ditto. > > How about this (by moving mk_pte() and its friends here): > ... > folio_add_lru_vma(folio, vma); > > for (i = 0; i < nr_pages; i++) { > entry = mk_pte(folio_page(folio, i), vma->vm_page_prot); > entry = pte_sw_mkyoung(entry); > if (vma->vm_flags & VM_WRITE) > entry = pte_mkwrite(pte_mkdirty(entry)); > setpte: > if (uffd_wp) > entry = pte_mkuffd_wp(entry); > set_pte_at(vma->vm_mm, vmf->address + PAGE_SIZE * i, > vmf->pte + i, entry); > > /* No need to invalidate - it was non-present before */ > update_mmu_cache(vma, vmf->address + PAGE_SIZE * i, > vmf->pte + i); > } > >> unlock: >> pte_unmap_unlock(vmf->pte, vmf->ptl); >> return ret; > > Attaching a small patch in case anything above is not clear. Please > take a look. Thanks. OK, I'll take a look and rework for v3. ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance @ 2023-07-04 14:08 ` Ryan Roberts 0 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-04 14:08 UTC (permalink / raw) To: Yu Zhao Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 04/07/2023 02:35, Yu Zhao wrote: > On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be >> allocated in large folios of a specified order. All pages of the large >> folio are pte-mapped during the same page fault, significantly reducing >> the number of page faults. The number of per-page operations (e.g. ref >> counting, rmap management lru list management) are also significantly >> reduced since those ops now become per-folio. >> >> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which >> defaults to disabled for now; there is a long list of todos to make >> FLEXIBLE_THP robust with existing features (e.g. compaction, mlock, some >> madvise ops, etc). These items will be tackled in subsequent patches. >> >> When enabled, the preferred folio order is as returned by >> arch_wants_pte_order(), which may be overridden by the arch as it sees >> fit. Some architectures (e.g. arm64) can coalsece TLB entries if a > > coalesce ACK > >> contiguous set of ptes map physically contigious, naturally aligned > > contiguous ACK > >> memory, so this mechanism allows the architecture to optimize as >> required. >> >> If the preferred order can't be used (e.g. because the folio would >> breach the bounds of the vma, or because ptes in the region are already >> mapped) then we fall back to a suitable lower order. >> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >> --- >> mm/Kconfig | 10 ++++ >> mm/memory.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++++---- >> 2 files changed, 165 insertions(+), 13 deletions(-) >> >> diff --git a/mm/Kconfig b/mm/Kconfig >> index 7672a22647b4..1c06b2c0a24e 100644 >> --- a/mm/Kconfig >> +++ b/mm/Kconfig >> @@ -822,6 +822,16 @@ config READ_ONLY_THP_FOR_FS >> support of file THPs will be developed in the next few release >> cycles. >> >> +config FLEXIBLE_THP >> + bool "Flexible order THP" >> + depends on TRANSPARENT_HUGEPAGE >> + default n > > The default value is already N. Is there a coding standard for this? Personally I prefer to make it explicit. > >> + help >> + Use large (bigger than order-0) folios to back anonymous memory where >> + possible, even if the order of the folio is smaller than the PMD >> + order. This reduces the number of page faults, as well as other >> + per-page overheads to improve performance for many workloads. >> + >> endif # TRANSPARENT_HUGEPAGE >> >> # >> diff --git a/mm/memory.c b/mm/memory.c >> index fb30f7523550..abe2ea94f3f5 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -3001,6 +3001,116 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf) >> return 0; >> } >> >> +#ifdef CONFIG_FLEXIBLE_THP >> +/* >> + * Allocates, zeros and returns a folio of the requested order for use as >> + * anonymous memory. >> + */ >> +static struct folio *alloc_anon_folio(struct vm_area_struct *vma, >> + unsigned long addr, int order) >> +{ >> + gfp_t gfp; >> + struct folio *folio; >> + >> + if (order == 0) >> + return vma_alloc_zeroed_movable_folio(vma, addr); >> + >> + gfp = vma_thp_gfp_mask(vma); >> + folio = vma_alloc_folio(gfp, order, vma, addr, true); >> + if (folio) >> + clear_huge_page(&folio->page, addr, folio_nr_pages(folio)); >> + >> + return folio; >> +} >> + >> +/* >> + * Preferred folio order to allocate for anonymous memory. >> + */ >> +#define max_anon_folio_order(vma) arch_wants_pte_order(vma) >> +#else >> +#define alloc_anon_folio(vma, addr, order) \ >> + vma_alloc_zeroed_movable_folio(vma, addr) >> +#define max_anon_folio_order(vma) 0 >> +#endif >> + >> +/* >> + * Returns index of first pte that is not none, or nr if all are none. >> + */ >> +static inline int check_ptes_none(pte_t *pte, int nr) >> +{ >> + int i; >> + >> + for (i = 0; i < nr; i++) { >> + if (!pte_none(ptep_get(pte++))) >> + return i; >> + } >> + >> + return nr; >> +} >> + >> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order) >> +{ >> + /* >> + * The aim here is to determine what size of folio we should allocate >> + * for this fault. Factors include: >> + * - Order must not be higher than `order` upon entry >> + * - Folio must be naturally aligned within VA space >> + * - Folio must be fully contained inside one pmd entry >> + * - Folio must not breach boundaries of vma >> + * - Folio must not overlap any non-none ptes >> + * >> + * Additionally, we do not allow order-1 since this breaks assumptions >> + * elsewhere in the mm; THP pages must be at least order-2 (since they >> + * store state up to the 3rd struct page subpage), and these pages must >> + * be THP in order to correctly use pre-existing THP infrastructure such >> + * as folio_split(). >> + * >> + * Note that the caller may or may not choose to lock the pte. If >> + * unlocked, the result is racy and the user must re-check any overlap >> + * with non-none ptes under the lock. >> + */ >> + >> + struct vm_area_struct *vma = vmf->vma; >> + int nr; >> + unsigned long addr; >> + pte_t *pte; >> + pte_t *first_set = NULL; >> + int ret; >> + >> + order = min(order, PMD_SHIFT - PAGE_SHIFT); >> + >> + for (; order > 1; order--) { > > I'm not sure how we can justify this policy. As an initial step, it'd > be a lot easier to sell if we only considered the order of > arch_wants_pte_order() and the order 0. My justification is in the cover letter; I see performance regression (vs the unpatched kernel) when using the policy you suggest. This policy performs much better in my tests. (I'll reply directly to your follow up questions in the cover letter shortly). What are your technical concerns about this approach? It is pretty light weight (I only touch each PTE once, regardless of the number of loops). If we have strong technical reasons for reverting to the less performant approach then fair enough, but I'd like to hear the rational first. > >> + nr = 1 << order; >> + addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT); >> + pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT); >> + >> + /* Check vma bounds. */ >> + if (addr < vma->vm_start || >> + addr + (nr << PAGE_SHIFT) > vma->vm_end) >> + continue; >> + >> + /* Ptes covered by order already known to be none. */ >> + if (pte + nr <= first_set) >> + break; >> + >> + /* Already found set pte in range covered by order. */ >> + if (pte <= first_set) >> + continue; >> + >> + /* Need to check if all the ptes are none. */ >> + ret = check_ptes_none(pte, nr); >> + if (ret == nr) >> + break; >> + >> + first_set = pte + ret; >> + } >> + >> + if (order == 1) >> + order = 0; >> + >> + return order; >> +} > > Everything above can be simplified into two helpers: > vmf_pte_range_changed() and alloc_anon_folio() (or whatever names you > prefer). Details below. > >> /* >> * Handle write page faults for pages that can be reused in the current vma >> * >> @@ -3073,7 +3183,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) >> goto oom; >> >> if (is_zero_pfn(pte_pfn(vmf->orig_pte))) { >> - new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); >> + new_folio = alloc_anon_folio(vma, vmf->address, 0); > > This seems unnecessary for now. Later on, we could fill in an aligned > area with multiple write-protected zero pages during a read fault and > then replace them with a large folio here. I don't have a strong opinion. I thought that it would be neater to use the same API everywhere, but happy to revert. > >> if (!new_folio) >> goto oom; >> } else { >> @@ -4040,6 +4150,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) >> struct folio *folio; >> vm_fault_t ret = 0; >> pte_t entry; >> + int order; >> + int pgcount; >> + unsigned long addr; >> >> /* File mapping without ->vm_ops ? */ >> if (vma->vm_flags & VM_SHARED) >> @@ -4081,24 +4194,51 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) >> pte_unmap_unlock(vmf->pte, vmf->ptl); >> return handle_userfault(vmf, VM_UFFD_MISSING); >> } >> - goto setpte; >> + if (uffd_wp) >> + entry = pte_mkuffd_wp(entry); >> + set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); >> + >> + /* No need to invalidate - it was non-present before */ >> + update_mmu_cache(vma, vmf->address, vmf->pte); >> + goto unlock; >> + } > > Nor really needed IMO. Details below. > > === > >> + /* >> + * If allocating a large folio, determine the biggest suitable order for >> + * the VMA (e.g. it must not exceed the VMA's bounds, it must not >> + * overlap with any populated PTEs, etc). We are not under the ptl here >> + * so we will need to re-check that we are not overlapping any populated >> + * PTEs once we have the lock. >> + */ >> + order = uffd_wp ? 0 : max_anon_folio_order(vma); >> + if (order > 0) { >> + vmf->pte = pte_offset_map(vmf->pmd, vmf->address); >> + order = calc_anon_folio_order_alloc(vmf, order); >> + pte_unmap(vmf->pte); >> } > > === > > The section above together with the section below should be wrapped in a helper. > >> - /* Allocate our own private page. */ >> + /* Allocate our own private folio. */ >> if (unlikely(anon_vma_prepare(vma))) >> goto oom; > > === > >> - folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); >> + folio = alloc_anon_folio(vma, vmf->address, order); >> + if (!folio && order > 0) { >> + order = 0; >> + folio = alloc_anon_folio(vma, vmf->address, order); >> + } > > === > > One helper returns a folio of order arch_wants_pte_order(), or order 0 > if it fails to allocate that order, e.g., > > folio = alloc_anon_folio(vmf); > > And if vmf_orig_pte_uffd_wp(vmf) is true, the helper allocates order 0 > regardless of arch_wants_pte_order(). Upon success, it can update > vmf->address, since if we run into a race with another PF, we exit the > fault handler and retry anyway. > >> if (!folio) >> goto oom; >> >> + pgcount = 1 << order; >> + addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT); > > As shown above, the helper already updates vmf->address. And mm/ never > used pgcount before -- the convention is nr_pages = folio_nr_pages(). ACK > >> if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) >> goto oom_free_page; >> folio_throttle_swaprate(folio, GFP_KERNEL); >> >> /* >> * The memory barrier inside __folio_mark_uptodate makes sure that >> - * preceding stores to the page contents become visible before >> - * the set_pte_at() write. >> + * preceding stores to the folio contents become visible before >> + * the set_ptes() write. > > We don't have set_ptes() yet. Indeed, that's why I listed the set_ptes() patch set as a hard dependency ;-) > >> */ >> __folio_mark_uptodate(folio); >> >> @@ -4107,11 +4247,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) >> if (vma->vm_flags & VM_WRITE) >> entry = pte_mkwrite(pte_mkdirty(entry)); >> >> - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, >> - &vmf->ptl); >> + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl); >> if (vmf_pte_changed(vmf)) { >> update_mmu_tlb(vma, vmf->address, vmf->pte); >> goto release; >> + } else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) { >> + goto release; >> } > > Need new helper: > > if (vmf_pte_range_changed(vmf, nr_pages)) { > for (i = 0; i < nr_pages; i++) > update_mmu_tlb(vma, vmf->address + PAGE_SIZE * i, vmf->pte + i); > goto release; > } > > (It should be fine to call update_mmu_tlb() even if it's not really necessary.) > >> ret = check_stable_address_space(vma->vm_mm); >> @@ -4125,16 +4266,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) >> return handle_userfault(vmf, VM_UFFD_MISSING); >> } >> >> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES); >> - folio_add_new_anon_rmap(folio, vma, vmf->address); >> + folio_ref_add(folio, pgcount - 1); >> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount); >> + folio_add_new_anon_rmap(folio, vma, addr); >> folio_add_lru_vma(folio, vma); >> -setpte: >> + >> if (uffd_wp) >> entry = pte_mkuffd_wp(entry); >> - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); >> + set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount); > > We would have to do it one by one for now. > >> /* No need to invalidate - it was non-present before */ >> - update_mmu_cache(vma, vmf->address, vmf->pte); >> + update_mmu_cache_range(vma, addr, vmf->pte, pgcount); > > Ditto. > > How about this (by moving mk_pte() and its friends here): > ... > folio_add_lru_vma(folio, vma); > > for (i = 0; i < nr_pages; i++) { > entry = mk_pte(folio_page(folio, i), vma->vm_page_prot); > entry = pte_sw_mkyoung(entry); > if (vma->vm_flags & VM_WRITE) > entry = pte_mkwrite(pte_mkdirty(entry)); > setpte: > if (uffd_wp) > entry = pte_mkuffd_wp(entry); > set_pte_at(vma->vm_mm, vmf->address + PAGE_SIZE * i, > vmf->pte + i, entry); > > /* No need to invalidate - it was non-present before */ > update_mmu_cache(vma, vmf->address + PAGE_SIZE * i, > vmf->pte + i); > } > >> unlock: >> pte_unmap_unlock(vmf->pte, vmf->ptl); >> return ret; > > Attaching a small patch in case anything above is not clear. Please > take a look. Thanks. OK, I'll take a look and rework for v3. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance 2023-07-04 14:08 ` Ryan Roberts @ 2023-07-04 23:47 ` Yu Zhao -1 siblings, 0 replies; 167+ messages in thread From: Yu Zhao @ 2023-07-04 23:47 UTC (permalink / raw) To: Ryan Roberts Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On Tue, Jul 4, 2023 at 8:08 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 04/07/2023 02:35, Yu Zhao wrote: > > On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > >> > >> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be > >> allocated in large folios of a specified order. All pages of the large > >> folio are pte-mapped during the same page fault, significantly reducing > >> the number of page faults. The number of per-page operations (e.g. ref > >> counting, rmap management lru list management) are also significantly > >> reduced since those ops now become per-folio. > >> > >> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which > >> defaults to disabled for now; there is a long list of todos to make > >> FLEXIBLE_THP robust with existing features (e.g. compaction, mlock, some > >> madvise ops, etc). These items will be tackled in subsequent patches. > >> > >> When enabled, the preferred folio order is as returned by > >> arch_wants_pte_order(), which may be overridden by the arch as it sees > >> fit. Some architectures (e.g. arm64) can coalsece TLB entries if a > > > > coalesce > > ACK > > > > >> contiguous set of ptes map physically contigious, naturally aligned > > > > contiguous > > ACK > > > > >> memory, so this mechanism allows the architecture to optimize as > >> required. > >> > >> If the preferred order can't be used (e.g. because the folio would > >> breach the bounds of the vma, or because ptes in the region are already > >> mapped) then we fall back to a suitable lower order. > >> > >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > >> --- > >> mm/Kconfig | 10 ++++ > >> mm/memory.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++++---- > >> 2 files changed, 165 insertions(+), 13 deletions(-) > >> > >> diff --git a/mm/Kconfig b/mm/Kconfig > >> index 7672a22647b4..1c06b2c0a24e 100644 > >> --- a/mm/Kconfig > >> +++ b/mm/Kconfig > >> @@ -822,6 +822,16 @@ config READ_ONLY_THP_FOR_FS > >> support of file THPs will be developed in the next few release > >> cycles. > >> > >> +config FLEXIBLE_THP > >> + bool "Flexible order THP" > >> + depends on TRANSPARENT_HUGEPAGE > >> + default n > > > > The default value is already N. > > Is there a coding standard for this? Personally I prefer to make it explicit. > > > > >> + help > >> + Use large (bigger than order-0) folios to back anonymous memory where > >> + possible, even if the order of the folio is smaller than the PMD > >> + order. This reduces the number of page faults, as well as other > >> + per-page overheads to improve performance for many workloads. > >> + > >> endif # TRANSPARENT_HUGEPAGE > >> > >> # > >> diff --git a/mm/memory.c b/mm/memory.c > >> index fb30f7523550..abe2ea94f3f5 100644 > >> --- a/mm/memory.c > >> +++ b/mm/memory.c > >> @@ -3001,6 +3001,116 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf) > >> return 0; > >> } > >> > >> +#ifdef CONFIG_FLEXIBLE_THP > >> +/* > >> + * Allocates, zeros and returns a folio of the requested order for use as > >> + * anonymous memory. > >> + */ > >> +static struct folio *alloc_anon_folio(struct vm_area_struct *vma, > >> + unsigned long addr, int order) > >> +{ > >> + gfp_t gfp; > >> + struct folio *folio; > >> + > >> + if (order == 0) > >> + return vma_alloc_zeroed_movable_folio(vma, addr); > >> + > >> + gfp = vma_thp_gfp_mask(vma); > >> + folio = vma_alloc_folio(gfp, order, vma, addr, true); > >> + if (folio) > >> + clear_huge_page(&folio->page, addr, folio_nr_pages(folio)); > >> + > >> + return folio; > >> +} > >> + > >> +/* > >> + * Preferred folio order to allocate for anonymous memory. > >> + */ > >> +#define max_anon_folio_order(vma) arch_wants_pte_order(vma) > >> +#else > >> +#define alloc_anon_folio(vma, addr, order) \ > >> + vma_alloc_zeroed_movable_folio(vma, addr) > >> +#define max_anon_folio_order(vma) 0 > >> +#endif > >> + > >> +/* > >> + * Returns index of first pte that is not none, or nr if all are none. > >> + */ > >> +static inline int check_ptes_none(pte_t *pte, int nr) > >> +{ > >> + int i; > >> + > >> + for (i = 0; i < nr; i++) { > >> + if (!pte_none(ptep_get(pte++))) > >> + return i; > >> + } > >> + > >> + return nr; > >> +} > >> + > >> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order) > >> +{ > >> + /* > >> + * The aim here is to determine what size of folio we should allocate > >> + * for this fault. Factors include: > >> + * - Order must not be higher than `order` upon entry > >> + * - Folio must be naturally aligned within VA space > >> + * - Folio must be fully contained inside one pmd entry > >> + * - Folio must not breach boundaries of vma > >> + * - Folio must not overlap any non-none ptes > >> + * > >> + * Additionally, we do not allow order-1 since this breaks assumptions > >> + * elsewhere in the mm; THP pages must be at least order-2 (since they > >> + * store state up to the 3rd struct page subpage), and these pages must > >> + * be THP in order to correctly use pre-existing THP infrastructure such > >> + * as folio_split(). > >> + * > >> + * Note that the caller may or may not choose to lock the pte. If > >> + * unlocked, the result is racy and the user must re-check any overlap > >> + * with non-none ptes under the lock. > >> + */ > >> + > >> + struct vm_area_struct *vma = vmf->vma; > >> + int nr; > >> + unsigned long addr; > >> + pte_t *pte; > >> + pte_t *first_set = NULL; > >> + int ret; > >> + > >> + order = min(order, PMD_SHIFT - PAGE_SHIFT); > >> + > >> + for (; order > 1; order--) { > > > > I'm not sure how we can justify this policy. As an initial step, it'd > > be a lot easier to sell if we only considered the order of > > arch_wants_pte_order() and the order 0. > > My justification is in the cover letter; I see performance regression (vs the > unpatched kernel) when using the policy you suggest. This policy performs much > better in my tests. (I'll reply directly to your follow up questions in the > cover letter shortly). > > What are your technical concerns about this approach? It is pretty light weight > (I only touch each PTE once, regardless of the number of loops). If we have > strong technical reasons for reverting to the less performant approach then fair > enough, but I'd like to hear the rational first. Yes, mainly from three different angles: 1. The engineering principle: we'd want to separate the mechanical part and the policy part when attacking something large. This way it'd be easier to root cause any regressions if they happen. In our case, assuming the regression is real, it might actually prove my point here: I really don't think the two checks (if a vma range fits and if it does, which is unlikely according to your description, if all 64 PTEs are none) caused the regression. My theory is that 64KB itself caused the regression, but smaller sizes made an improvement. If this is really the case, I'd say the fallback policy masked the real problem, which is that 64KB is too large to begin with. 2. The benchmark methodology: I appreciate your effort in doing it, but we also need to consider that the setup is an uncommon scenario. The common scenarios are devices that have been running for weeks without reboots, generally having higher external fragmentation. In addition, for client devices, they are often under memory pressure, which makes fragmentation worse. So we should take the result with a grain of salt, and for that matter, results from after refresh reboots. 3. The technical concern: an ideal policy would consider all three major factors: the h/w features, userspace behaviors and the page allocator behavior. So far we only have the first one handy. The second one is too challenging, so let's forget about it for now. The third one is why I really don't like this best-fit policy. By falling back to smaller orders, we can waste a limited number of physically contiguous pages on wrong vmas (small vmas only), leading to failures to serve large vmas which otherwise would have a higher overall ROI. This can only be addressed within the page allocator: we need to enlighten it to return the highest order available, i.e., not breaking up any higher orders. I'm not really saying we should never try this fallback policy. I'm just thinking we can leave it for later, probably after we've addressed all the concerns with basic functionality. ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance @ 2023-07-04 23:47 ` Yu Zhao 0 siblings, 0 replies; 167+ messages in thread From: Yu Zhao @ 2023-07-04 23:47 UTC (permalink / raw) To: Ryan Roberts Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On Tue, Jul 4, 2023 at 8:08 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 04/07/2023 02:35, Yu Zhao wrote: > > On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > >> > >> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be > >> allocated in large folios of a specified order. All pages of the large > >> folio are pte-mapped during the same page fault, significantly reducing > >> the number of page faults. The number of per-page operations (e.g. ref > >> counting, rmap management lru list management) are also significantly > >> reduced since those ops now become per-folio. > >> > >> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which > >> defaults to disabled for now; there is a long list of todos to make > >> FLEXIBLE_THP robust with existing features (e.g. compaction, mlock, some > >> madvise ops, etc). These items will be tackled in subsequent patches. > >> > >> When enabled, the preferred folio order is as returned by > >> arch_wants_pte_order(), which may be overridden by the arch as it sees > >> fit. Some architectures (e.g. arm64) can coalsece TLB entries if a > > > > coalesce > > ACK > > > > >> contiguous set of ptes map physically contigious, naturally aligned > > > > contiguous > > ACK > > > > >> memory, so this mechanism allows the architecture to optimize as > >> required. > >> > >> If the preferred order can't be used (e.g. because the folio would > >> breach the bounds of the vma, or because ptes in the region are already > >> mapped) then we fall back to a suitable lower order. > >> > >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > >> --- > >> mm/Kconfig | 10 ++++ > >> mm/memory.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++++---- > >> 2 files changed, 165 insertions(+), 13 deletions(-) > >> > >> diff --git a/mm/Kconfig b/mm/Kconfig > >> index 7672a22647b4..1c06b2c0a24e 100644 > >> --- a/mm/Kconfig > >> +++ b/mm/Kconfig > >> @@ -822,6 +822,16 @@ config READ_ONLY_THP_FOR_FS > >> support of file THPs will be developed in the next few release > >> cycles. > >> > >> +config FLEXIBLE_THP > >> + bool "Flexible order THP" > >> + depends on TRANSPARENT_HUGEPAGE > >> + default n > > > > The default value is already N. > > Is there a coding standard for this? Personally I prefer to make it explicit. > > > > >> + help > >> + Use large (bigger than order-0) folios to back anonymous memory where > >> + possible, even if the order of the folio is smaller than the PMD > >> + order. This reduces the number of page faults, as well as other > >> + per-page overheads to improve performance for many workloads. > >> + > >> endif # TRANSPARENT_HUGEPAGE > >> > >> # > >> diff --git a/mm/memory.c b/mm/memory.c > >> index fb30f7523550..abe2ea94f3f5 100644 > >> --- a/mm/memory.c > >> +++ b/mm/memory.c > >> @@ -3001,6 +3001,116 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf) > >> return 0; > >> } > >> > >> +#ifdef CONFIG_FLEXIBLE_THP > >> +/* > >> + * Allocates, zeros and returns a folio of the requested order for use as > >> + * anonymous memory. > >> + */ > >> +static struct folio *alloc_anon_folio(struct vm_area_struct *vma, > >> + unsigned long addr, int order) > >> +{ > >> + gfp_t gfp; > >> + struct folio *folio; > >> + > >> + if (order == 0) > >> + return vma_alloc_zeroed_movable_folio(vma, addr); > >> + > >> + gfp = vma_thp_gfp_mask(vma); > >> + folio = vma_alloc_folio(gfp, order, vma, addr, true); > >> + if (folio) > >> + clear_huge_page(&folio->page, addr, folio_nr_pages(folio)); > >> + > >> + return folio; > >> +} > >> + > >> +/* > >> + * Preferred folio order to allocate for anonymous memory. > >> + */ > >> +#define max_anon_folio_order(vma) arch_wants_pte_order(vma) > >> +#else > >> +#define alloc_anon_folio(vma, addr, order) \ > >> + vma_alloc_zeroed_movable_folio(vma, addr) > >> +#define max_anon_folio_order(vma) 0 > >> +#endif > >> + > >> +/* > >> + * Returns index of first pte that is not none, or nr if all are none. > >> + */ > >> +static inline int check_ptes_none(pte_t *pte, int nr) > >> +{ > >> + int i; > >> + > >> + for (i = 0; i < nr; i++) { > >> + if (!pte_none(ptep_get(pte++))) > >> + return i; > >> + } > >> + > >> + return nr; > >> +} > >> + > >> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order) > >> +{ > >> + /* > >> + * The aim here is to determine what size of folio we should allocate > >> + * for this fault. Factors include: > >> + * - Order must not be higher than `order` upon entry > >> + * - Folio must be naturally aligned within VA space > >> + * - Folio must be fully contained inside one pmd entry > >> + * - Folio must not breach boundaries of vma > >> + * - Folio must not overlap any non-none ptes > >> + * > >> + * Additionally, we do not allow order-1 since this breaks assumptions > >> + * elsewhere in the mm; THP pages must be at least order-2 (since they > >> + * store state up to the 3rd struct page subpage), and these pages must > >> + * be THP in order to correctly use pre-existing THP infrastructure such > >> + * as folio_split(). > >> + * > >> + * Note that the caller may or may not choose to lock the pte. If > >> + * unlocked, the result is racy and the user must re-check any overlap > >> + * with non-none ptes under the lock. > >> + */ > >> + > >> + struct vm_area_struct *vma = vmf->vma; > >> + int nr; > >> + unsigned long addr; > >> + pte_t *pte; > >> + pte_t *first_set = NULL; > >> + int ret; > >> + > >> + order = min(order, PMD_SHIFT - PAGE_SHIFT); > >> + > >> + for (; order > 1; order--) { > > > > I'm not sure how we can justify this policy. As an initial step, it'd > > be a lot easier to sell if we only considered the order of > > arch_wants_pte_order() and the order 0. > > My justification is in the cover letter; I see performance regression (vs the > unpatched kernel) when using the policy you suggest. This policy performs much > better in my tests. (I'll reply directly to your follow up questions in the > cover letter shortly). > > What are your technical concerns about this approach? It is pretty light weight > (I only touch each PTE once, regardless of the number of loops). If we have > strong technical reasons for reverting to the less performant approach then fair > enough, but I'd like to hear the rational first. Yes, mainly from three different angles: 1. The engineering principle: we'd want to separate the mechanical part and the policy part when attacking something large. This way it'd be easier to root cause any regressions if they happen. In our case, assuming the regression is real, it might actually prove my point here: I really don't think the two checks (if a vma range fits and if it does, which is unlikely according to your description, if all 64 PTEs are none) caused the regression. My theory is that 64KB itself caused the regression, but smaller sizes made an improvement. If this is really the case, I'd say the fallback policy masked the real problem, which is that 64KB is too large to begin with. 2. The benchmark methodology: I appreciate your effort in doing it, but we also need to consider that the setup is an uncommon scenario. The common scenarios are devices that have been running for weeks without reboots, generally having higher external fragmentation. In addition, for client devices, they are often under memory pressure, which makes fragmentation worse. So we should take the result with a grain of salt, and for that matter, results from after refresh reboots. 3. The technical concern: an ideal policy would consider all three major factors: the h/w features, userspace behaviors and the page allocator behavior. So far we only have the first one handy. The second one is too challenging, so let's forget about it for now. The third one is why I really don't like this best-fit policy. By falling back to smaller orders, we can waste a limited number of physically contiguous pages on wrong vmas (small vmas only), leading to failures to serve large vmas which otherwise would have a higher overall ROI. This can only be addressed within the page allocator: we need to enlighten it to return the highest order available, i.e., not breaking up any higher orders. I'm not really saying we should never try this fallback policy. I'm just thinking we can leave it for later, probably after we've addressed all the concerns with basic functionality. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance 2023-07-03 13:53 ` Ryan Roberts @ 2023-07-04 3:45 ` Yin, Fengwei -1 siblings, 0 replies; 167+ messages in thread From: Yin, Fengwei @ 2023-07-04 3:45 UTC (permalink / raw) To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi Cc: linux-arm-kernel, linux-kernel, linux-mm On 7/3/2023 9:53 PM, Ryan Roberts wrote: > Introduce FLEXIBLE_THP feature, which allows anonymous memory to be THP is for huge page which is 2M size. We are not huge page here. But I don't have good name either. > allocated in large folios of a specified order. All pages of the large > folio are pte-mapped during the same page fault, significantly reducing > the number of page faults. The number of per-page operations (e.g. ref > counting, rmap management lru list management) are also significantly > reduced since those ops now become per-folio. > > The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which > defaults to disabled for now; there is a long list of todos to make > FLEXIBLE_THP robust with existing features (e.g. compaction, mlock, some > madvise ops, etc). These items will be tackled in subsequent patches. > > When enabled, the preferred folio order is as returned by > arch_wants_pte_order(), which may be overridden by the arch as it sees > fit. Some architectures (e.g. arm64) can coalsece TLB entries if a > contiguous set of ptes map physically contigious, naturally aligned > memory, so this mechanism allows the architecture to optimize as > required. > > If the preferred order can't be used (e.g. because the folio would > breach the bounds of the vma, or because ptes in the region are already > mapped) then we fall back to a suitable lower order. > > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > --- > mm/Kconfig | 10 ++++ > mm/memory.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++++---- > 2 files changed, 165 insertions(+), 13 deletions(-) > > diff --git a/mm/Kconfig b/mm/Kconfig > index 7672a22647b4..1c06b2c0a24e 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -822,6 +822,16 @@ config READ_ONLY_THP_FOR_FS > support of file THPs will be developed in the next few release > cycles. > > +config FLEXIBLE_THP > + bool "Flexible order THP" > + depends on TRANSPARENT_HUGEPAGE > + default n > + help > + Use large (bigger than order-0) folios to back anonymous memory where > + possible, even if the order of the folio is smaller than the PMD > + order. This reduces the number of page faults, as well as other > + per-page overheads to improve performance for many workloads. > + > endif # TRANSPARENT_HUGEPAGE > > # > diff --git a/mm/memory.c b/mm/memory.c > index fb30f7523550..abe2ea94f3f5 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3001,6 +3001,116 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf) > return 0; > } > > +#ifdef CONFIG_FLEXIBLE_THP > +/* > + * Allocates, zeros and returns a folio of the requested order for use as > + * anonymous memory. > + */ > +static struct folio *alloc_anon_folio(struct vm_area_struct *vma, > + unsigned long addr, int order) > +{ > + gfp_t gfp; > + struct folio *folio; > + > + if (order == 0) > + return vma_alloc_zeroed_movable_folio(vma, addr); > + > + gfp = vma_thp_gfp_mask(vma); > + folio = vma_alloc_folio(gfp, order, vma, addr, true); > + if (folio) > + clear_huge_page(&folio->page, addr, folio_nr_pages(folio)); > + > + return folio; > +} > + > +/* > + * Preferred folio order to allocate for anonymous memory. > + */ > +#define max_anon_folio_order(vma) arch_wants_pte_order(vma) > +#else > +#define alloc_anon_folio(vma, addr, order) \ > + vma_alloc_zeroed_movable_folio(vma, addr) > +#define max_anon_folio_order(vma) 0 > +#endif > + > +/* > + * Returns index of first pte that is not none, or nr if all are none. > + */ > +static inline int check_ptes_none(pte_t *pte, int nr) > +{ > + int i; > + > + for (i = 0; i < nr; i++) { > + if (!pte_none(ptep_get(pte++))) > + return i; > + } > + > + return nr; > +} > + > +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order) > +{ > + /* > + * The aim here is to determine what size of folio we should allocate > + * for this fault. Factors include: > + * - Order must not be higher than `order` upon entry > + * - Folio must be naturally aligned within VA space > + * - Folio must be fully contained inside one pmd entry > + * - Folio must not breach boundaries of vma > + * - Folio must not overlap any non-none ptes > + * > + * Additionally, we do not allow order-1 since this breaks assumptions > + * elsewhere in the mm; THP pages must be at least order-2 (since they > + * store state up to the 3rd struct page subpage), and these pages must > + * be THP in order to correctly use pre-existing THP infrastructure such > + * as folio_split(). > + * > + * Note that the caller may or may not choose to lock the pte. If > + * unlocked, the result is racy and the user must re-check any overlap > + * with non-none ptes under the lock. > + */ > + > + struct vm_area_struct *vma = vmf->vma; > + int nr; > + unsigned long addr; > + pte_t *pte; > + pte_t *first_set = NULL; > + int ret; > + > + order = min(order, PMD_SHIFT - PAGE_SHIFT); > + > + for (; order > 1; order--) { > + nr = 1 << order; > + addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT); > + pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT); > + > + /* Check vma bounds. */ > + if (addr < vma->vm_start || > + addr + (nr << PAGE_SHIFT) > vma->vm_end) > + continue; > + > + /* Ptes covered by order already known to be none. */ > + if (pte + nr <= first_set) > + break; > + > + /* Already found set pte in range covered by order. */ > + if (pte <= first_set) > + continue; > + > + /* Need to check if all the ptes are none. */ > + ret = check_ptes_none(pte, nr); > + if (ret == nr) > + break; > + > + first_set = pte + ret; > + } > + > + if (order == 1) > + order = 0; > + > + return order; > +} The logic in above function should be kept is whether the order fit in vma range. check_ptes_none() is not accurate here because no page table lock hold and concurrent fault could happen. So may just drop the check here? Check_ptes_none() is done after take the page table lock. We pick the arch prefered order or order 0 now. > + > /* > * Handle write page faults for pages that can be reused in the current vma > * > @@ -3073,7 +3183,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) > goto oom; > > if (is_zero_pfn(pte_pfn(vmf->orig_pte))) { > - new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); > + new_folio = alloc_anon_folio(vma, vmf->address, 0); > if (!new_folio) > goto oom; > } else { > @@ -4040,6 +4150,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) > struct folio *folio; > vm_fault_t ret = 0; > pte_t entry; > + int order; > + int pgcount; > + unsigned long addr; > > /* File mapping without ->vm_ops ? */ > if (vma->vm_flags & VM_SHARED) > @@ -4081,24 +4194,51 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) > pte_unmap_unlock(vmf->pte, vmf->ptl); > return handle_userfault(vmf, VM_UFFD_MISSING); > } > - goto setpte; > + if (uffd_wp) > + entry = pte_mkuffd_wp(entry); > + set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); > + > + /* No need to invalidate - it was non-present before */ > + update_mmu_cache(vma, vmf->address, vmf->pte); > + goto unlock; > + } > + > + /* > + * If allocating a large folio, determine the biggest suitable order for > + * the VMA (e.g. it must not exceed the VMA's bounds, it must not > + * overlap with any populated PTEs, etc). We are not under the ptl here > + * so we will need to re-check that we are not overlapping any populated > + * PTEs once we have the lock. > + */ > + order = uffd_wp ? 0 : max_anon_folio_order(vma); > + if (order > 0) { > + vmf->pte = pte_offset_map(vmf->pmd, vmf->address); > + order = calc_anon_folio_order_alloc(vmf, order); > + pte_unmap(vmf->pte); > } > > - /* Allocate our own private page. */ > + /* Allocate our own private folio. */ > if (unlikely(anon_vma_prepare(vma))) > goto oom; > - folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); > + folio = alloc_anon_folio(vma, vmf->address, order); > + if (!folio && order > 0) { > + order = 0; > + folio = alloc_anon_folio(vma, vmf->address, order); > + } > if (!folio) > goto oom; > > + pgcount = 1 << order; > + addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT); > + > if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) > goto oom_free_page; > folio_throttle_swaprate(folio, GFP_KERNEL); > > /* > * The memory barrier inside __folio_mark_uptodate makes sure that > - * preceding stores to the page contents become visible before > - * the set_pte_at() write. > + * preceding stores to the folio contents become visible before > + * the set_ptes() write. > */ > __folio_mark_uptodate(folio); > > @@ -4107,11 +4247,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) > if (vma->vm_flags & VM_WRITE) > entry = pte_mkwrite(pte_mkdirty(entry)); > > - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, > - &vmf->ptl); > + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl); > if (vmf_pte_changed(vmf)) { > update_mmu_tlb(vma, vmf->address, vmf->pte); > goto release; > + } else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) { This could be the case that we allocated order 4 page and find a neighbor PTE is filled by concurrent fault. Should we put current folio and fallback to order 0 and try again immedately (goto order 0 allocation instead of return from this function which will go through some page fault path again)? Regards Yin, Fengwei > + goto release; > } > > ret = check_stable_address_space(vma->vm_mm); > @@ -4125,16 +4266,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) > return handle_userfault(vmf, VM_UFFD_MISSING); > } > > - inc_mm_counter(vma->vm_mm, MM_ANONPAGES); > - folio_add_new_anon_rmap(folio, vma, vmf->address); > + folio_ref_add(folio, pgcount - 1); > + add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount); > + folio_add_new_anon_rmap(folio, vma, addr); > folio_add_lru_vma(folio, vma); > -setpte: > + > if (uffd_wp) > entry = pte_mkuffd_wp(entry); > - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); > + set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount); > > /* No need to invalidate - it was non-present before */ > - update_mmu_cache(vma, vmf->address, vmf->pte); > + update_mmu_cache_range(vma, addr, vmf->pte, pgcount); > unlock: > pte_unmap_unlock(vmf->pte, vmf->ptl); > return ret; ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance @ 2023-07-04 3:45 ` Yin, Fengwei 0 siblings, 0 replies; 167+ messages in thread From: Yin, Fengwei @ 2023-07-04 3:45 UTC (permalink / raw) To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi Cc: linux-arm-kernel, linux-kernel, linux-mm On 7/3/2023 9:53 PM, Ryan Roberts wrote: > Introduce FLEXIBLE_THP feature, which allows anonymous memory to be THP is for huge page which is 2M size. We are not huge page here. But I don't have good name either. > allocated in large folios of a specified order. All pages of the large > folio are pte-mapped during the same page fault, significantly reducing > the number of page faults. The number of per-page operations (e.g. ref > counting, rmap management lru list management) are also significantly > reduced since those ops now become per-folio. > > The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which > defaults to disabled for now; there is a long list of todos to make > FLEXIBLE_THP robust with existing features (e.g. compaction, mlock, some > madvise ops, etc). These items will be tackled in subsequent patches. > > When enabled, the preferred folio order is as returned by > arch_wants_pte_order(), which may be overridden by the arch as it sees > fit. Some architectures (e.g. arm64) can coalsece TLB entries if a > contiguous set of ptes map physically contigious, naturally aligned > memory, so this mechanism allows the architecture to optimize as > required. > > If the preferred order can't be used (e.g. because the folio would > breach the bounds of the vma, or because ptes in the region are already > mapped) then we fall back to a suitable lower order. > > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > --- > mm/Kconfig | 10 ++++ > mm/memory.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++++---- > 2 files changed, 165 insertions(+), 13 deletions(-) > > diff --git a/mm/Kconfig b/mm/Kconfig > index 7672a22647b4..1c06b2c0a24e 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -822,6 +822,16 @@ config READ_ONLY_THP_FOR_FS > support of file THPs will be developed in the next few release > cycles. > > +config FLEXIBLE_THP > + bool "Flexible order THP" > + depends on TRANSPARENT_HUGEPAGE > + default n > + help > + Use large (bigger than order-0) folios to back anonymous memory where > + possible, even if the order of the folio is smaller than the PMD > + order. This reduces the number of page faults, as well as other > + per-page overheads to improve performance for many workloads. > + > endif # TRANSPARENT_HUGEPAGE > > # > diff --git a/mm/memory.c b/mm/memory.c > index fb30f7523550..abe2ea94f3f5 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3001,6 +3001,116 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf) > return 0; > } > > +#ifdef CONFIG_FLEXIBLE_THP > +/* > + * Allocates, zeros and returns a folio of the requested order for use as > + * anonymous memory. > + */ > +static struct folio *alloc_anon_folio(struct vm_area_struct *vma, > + unsigned long addr, int order) > +{ > + gfp_t gfp; > + struct folio *folio; > + > + if (order == 0) > + return vma_alloc_zeroed_movable_folio(vma, addr); > + > + gfp = vma_thp_gfp_mask(vma); > + folio = vma_alloc_folio(gfp, order, vma, addr, true); > + if (folio) > + clear_huge_page(&folio->page, addr, folio_nr_pages(folio)); > + > + return folio; > +} > + > +/* > + * Preferred folio order to allocate for anonymous memory. > + */ > +#define max_anon_folio_order(vma) arch_wants_pte_order(vma) > +#else > +#define alloc_anon_folio(vma, addr, order) \ > + vma_alloc_zeroed_movable_folio(vma, addr) > +#define max_anon_folio_order(vma) 0 > +#endif > + > +/* > + * Returns index of first pte that is not none, or nr if all are none. > + */ > +static inline int check_ptes_none(pte_t *pte, int nr) > +{ > + int i; > + > + for (i = 0; i < nr; i++) { > + if (!pte_none(ptep_get(pte++))) > + return i; > + } > + > + return nr; > +} > + > +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order) > +{ > + /* > + * The aim here is to determine what size of folio we should allocate > + * for this fault. Factors include: > + * - Order must not be higher than `order` upon entry > + * - Folio must be naturally aligned within VA space > + * - Folio must be fully contained inside one pmd entry > + * - Folio must not breach boundaries of vma > + * - Folio must not overlap any non-none ptes > + * > + * Additionally, we do not allow order-1 since this breaks assumptions > + * elsewhere in the mm; THP pages must be at least order-2 (since they > + * store state up to the 3rd struct page subpage), and these pages must > + * be THP in order to correctly use pre-existing THP infrastructure such > + * as folio_split(). > + * > + * Note that the caller may or may not choose to lock the pte. If > + * unlocked, the result is racy and the user must re-check any overlap > + * with non-none ptes under the lock. > + */ > + > + struct vm_area_struct *vma = vmf->vma; > + int nr; > + unsigned long addr; > + pte_t *pte; > + pte_t *first_set = NULL; > + int ret; > + > + order = min(order, PMD_SHIFT - PAGE_SHIFT); > + > + for (; order > 1; order--) { > + nr = 1 << order; > + addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT); > + pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT); > + > + /* Check vma bounds. */ > + if (addr < vma->vm_start || > + addr + (nr << PAGE_SHIFT) > vma->vm_end) > + continue; > + > + /* Ptes covered by order already known to be none. */ > + if (pte + nr <= first_set) > + break; > + > + /* Already found set pte in range covered by order. */ > + if (pte <= first_set) > + continue; > + > + /* Need to check if all the ptes are none. */ > + ret = check_ptes_none(pte, nr); > + if (ret == nr) > + break; > + > + first_set = pte + ret; > + } > + > + if (order == 1) > + order = 0; > + > + return order; > +} The logic in above function should be kept is whether the order fit in vma range. check_ptes_none() is not accurate here because no page table lock hold and concurrent fault could happen. So may just drop the check here? Check_ptes_none() is done after take the page table lock. We pick the arch prefered order or order 0 now. > + > /* > * Handle write page faults for pages that can be reused in the current vma > * > @@ -3073,7 +3183,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) > goto oom; > > if (is_zero_pfn(pte_pfn(vmf->orig_pte))) { > - new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); > + new_folio = alloc_anon_folio(vma, vmf->address, 0); > if (!new_folio) > goto oom; > } else { > @@ -4040,6 +4150,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) > struct folio *folio; > vm_fault_t ret = 0; > pte_t entry; > + int order; > + int pgcount; > + unsigned long addr; > > /* File mapping without ->vm_ops ? */ > if (vma->vm_flags & VM_SHARED) > @@ -4081,24 +4194,51 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) > pte_unmap_unlock(vmf->pte, vmf->ptl); > return handle_userfault(vmf, VM_UFFD_MISSING); > } > - goto setpte; > + if (uffd_wp) > + entry = pte_mkuffd_wp(entry); > + set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); > + > + /* No need to invalidate - it was non-present before */ > + update_mmu_cache(vma, vmf->address, vmf->pte); > + goto unlock; > + } > + > + /* > + * If allocating a large folio, determine the biggest suitable order for > + * the VMA (e.g. it must not exceed the VMA's bounds, it must not > + * overlap with any populated PTEs, etc). We are not under the ptl here > + * so we will need to re-check that we are not overlapping any populated > + * PTEs once we have the lock. > + */ > + order = uffd_wp ? 0 : max_anon_folio_order(vma); > + if (order > 0) { > + vmf->pte = pte_offset_map(vmf->pmd, vmf->address); > + order = calc_anon_folio_order_alloc(vmf, order); > + pte_unmap(vmf->pte); > } > > - /* Allocate our own private page. */ > + /* Allocate our own private folio. */ > if (unlikely(anon_vma_prepare(vma))) > goto oom; > - folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); > + folio = alloc_anon_folio(vma, vmf->address, order); > + if (!folio && order > 0) { > + order = 0; > + folio = alloc_anon_folio(vma, vmf->address, order); > + } > if (!folio) > goto oom; > > + pgcount = 1 << order; > + addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT); > + > if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) > goto oom_free_page; > folio_throttle_swaprate(folio, GFP_KERNEL); > > /* > * The memory barrier inside __folio_mark_uptodate makes sure that > - * preceding stores to the page contents become visible before > - * the set_pte_at() write. > + * preceding stores to the folio contents become visible before > + * the set_ptes() write. > */ > __folio_mark_uptodate(folio); > > @@ -4107,11 +4247,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) > if (vma->vm_flags & VM_WRITE) > entry = pte_mkwrite(pte_mkdirty(entry)); > > - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, > - &vmf->ptl); > + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl); > if (vmf_pte_changed(vmf)) { > update_mmu_tlb(vma, vmf->address, vmf->pte); > goto release; > + } else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) { This could be the case that we allocated order 4 page and find a neighbor PTE is filled by concurrent fault. Should we put current folio and fallback to order 0 and try again immedately (goto order 0 allocation instead of return from this function which will go through some page fault path again)? Regards Yin, Fengwei > + goto release; > } > > ret = check_stable_address_space(vma->vm_mm); > @@ -4125,16 +4266,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) > return handle_userfault(vmf, VM_UFFD_MISSING); > } > > - inc_mm_counter(vma->vm_mm, MM_ANONPAGES); > - folio_add_new_anon_rmap(folio, vma, vmf->address); > + folio_ref_add(folio, pgcount - 1); > + add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount); > + folio_add_new_anon_rmap(folio, vma, addr); > folio_add_lru_vma(folio, vma); > -setpte: > + > if (uffd_wp) > entry = pte_mkuffd_wp(entry); > - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); > + set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount); > > /* No need to invalidate - it was non-present before */ > - update_mmu_cache(vma, vmf->address, vmf->pte); > + update_mmu_cache_range(vma, addr, vmf->pte, pgcount); > unlock: > pte_unmap_unlock(vmf->pte, vmf->ptl); > return ret; _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance 2023-07-04 3:45 ` Yin, Fengwei @ 2023-07-04 14:20 ` Ryan Roberts -1 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-04 14:20 UTC (permalink / raw) To: Yin, Fengwei, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi Cc: linux-arm-kernel, linux-kernel, linux-mm On 04/07/2023 04:45, Yin, Fengwei wrote: > > On 7/3/2023 9:53 PM, Ryan Roberts wrote: >> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be > THP is for huge page which is 2M size. We are not huge page here. But > I don't have good name either. Is that really true? On arm64 with 16K pages, huge pages are 32M and with 64K base page, they are 512M. So huge pages already have a variable size. And they sometimes get PTE-mapped. So can't we just think of this as an extension of the THP feature? > >> allocated in large folios of a specified order. All pages of the large >> folio are pte-mapped during the same page fault, significantly reducing >> the number of page faults. The number of per-page operations (e.g. ref >> counting, rmap management lru list management) are also significantly >> reduced since those ops now become per-folio. >> >> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which >> defaults to disabled for now; there is a long list of todos to make >> FLEXIBLE_THP robust with existing features (e.g. compaction, mlock, some >> madvise ops, etc). These items will be tackled in subsequent patches. >> >> When enabled, the preferred folio order is as returned by >> arch_wants_pte_order(), which may be overridden by the arch as it sees >> fit. Some architectures (e.g. arm64) can coalsece TLB entries if a >> contiguous set of ptes map physically contigious, naturally aligned >> memory, so this mechanism allows the architecture to optimize as >> required. >> >> If the preferred order can't be used (e.g. because the folio would >> breach the bounds of the vma, or because ptes in the region are already >> mapped) then we fall back to a suitable lower order. >> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >> --- >> mm/Kconfig | 10 ++++ >> mm/memory.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++++---- >> 2 files changed, 165 insertions(+), 13 deletions(-) >> >> diff --git a/mm/Kconfig b/mm/Kconfig >> index 7672a22647b4..1c06b2c0a24e 100644 >> --- a/mm/Kconfig >> +++ b/mm/Kconfig >> @@ -822,6 +822,16 @@ config READ_ONLY_THP_FOR_FS >> support of file THPs will be developed in the next few release >> cycles. >> >> +config FLEXIBLE_THP >> + bool "Flexible order THP" >> + depends on TRANSPARENT_HUGEPAGE >> + default n >> + help >> + Use large (bigger than order-0) folios to back anonymous memory where >> + possible, even if the order of the folio is smaller than the PMD >> + order. This reduces the number of page faults, as well as other >> + per-page overheads to improve performance for many workloads. >> + >> endif # TRANSPARENT_HUGEPAGE >> >> # >> diff --git a/mm/memory.c b/mm/memory.c >> index fb30f7523550..abe2ea94f3f5 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -3001,6 +3001,116 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf) >> return 0; >> } >> >> +#ifdef CONFIG_FLEXIBLE_THP >> +/* >> + * Allocates, zeros and returns a folio of the requested order for use as >> + * anonymous memory. >> + */ >> +static struct folio *alloc_anon_folio(struct vm_area_struct *vma, >> + unsigned long addr, int order) >> +{ >> + gfp_t gfp; >> + struct folio *folio; >> + >> + if (order == 0) >> + return vma_alloc_zeroed_movable_folio(vma, addr); >> + >> + gfp = vma_thp_gfp_mask(vma); >> + folio = vma_alloc_folio(gfp, order, vma, addr, true); >> + if (folio) >> + clear_huge_page(&folio->page, addr, folio_nr_pages(folio)); >> + >> + return folio; >> +} >> + >> +/* >> + * Preferred folio order to allocate for anonymous memory. >> + */ >> +#define max_anon_folio_order(vma) arch_wants_pte_order(vma) >> +#else >> +#define alloc_anon_folio(vma, addr, order) \ >> + vma_alloc_zeroed_movable_folio(vma, addr) >> +#define max_anon_folio_order(vma) 0 >> +#endif >> + >> +/* >> + * Returns index of first pte that is not none, or nr if all are none. >> + */ >> +static inline int check_ptes_none(pte_t *pte, int nr) >> +{ >> + int i; >> + >> + for (i = 0; i < nr; i++) { >> + if (!pte_none(ptep_get(pte++))) >> + return i; >> + } >> + >> + return nr; >> +} >> + >> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order) >> +{ >> + /* >> + * The aim here is to determine what size of folio we should allocate >> + * for this fault. Factors include: >> + * - Order must not be higher than `order` upon entry >> + * - Folio must be naturally aligned within VA space >> + * - Folio must be fully contained inside one pmd entry >> + * - Folio must not breach boundaries of vma >> + * - Folio must not overlap any non-none ptes >> + * >> + * Additionally, we do not allow order-1 since this breaks assumptions >> + * elsewhere in the mm; THP pages must be at least order-2 (since they >> + * store state up to the 3rd struct page subpage), and these pages must >> + * be THP in order to correctly use pre-existing THP infrastructure such >> + * as folio_split(). >> + * >> + * Note that the caller may or may not choose to lock the pte. If >> + * unlocked, the result is racy and the user must re-check any overlap >> + * with non-none ptes under the lock. >> + */ >> + >> + struct vm_area_struct *vma = vmf->vma; >> + int nr; >> + unsigned long addr; >> + pte_t *pte; >> + pte_t *first_set = NULL; >> + int ret; >> + >> + order = min(order, PMD_SHIFT - PAGE_SHIFT); >> + >> + for (; order > 1; order--) { >> + nr = 1 << order; >> + addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT); >> + pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT); >> + >> + /* Check vma bounds. */ >> + if (addr < vma->vm_start || >> + addr + (nr << PAGE_SHIFT) > vma->vm_end) >> + continue; >> + >> + /* Ptes covered by order already known to be none. */ >> + if (pte + nr <= first_set) >> + break; >> + >> + /* Already found set pte in range covered by order. */ >> + if (pte <= first_set) >> + continue; >> + >> + /* Need to check if all the ptes are none. */ >> + ret = check_ptes_none(pte, nr); >> + if (ret == nr) >> + break; >> + >> + first_set = pte + ret; >> + } >> + >> + if (order == 1) >> + order = 0; >> + >> + return order; >> +} > The logic in above function should be kept is whether the order fit in vma range. > > check_ptes_none() is not accurate here because no page table lock hold and concurrent > fault could happen. So may just drop the check here? Check_ptes_none() is done after > take the page table lock. I agree it is just an estimate given the lock is not held; the comment at the top says the same. But I don't think we can wait until after the lock is taken to measure this. We can't hold the lock while allocating the folio and we need a guess at what to allocate. If we don't guess here, we will allocate the biggest, then take the lock, see that it doesn't fit, and exit. Then the system will re-fault and we will follow the exact same path - ending up in live lock. > > We pick the arch prefered order or order 0 now. > >> + >> /* >> * Handle write page faults for pages that can be reused in the current vma >> * >> @@ -3073,7 +3183,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) >> goto oom; >> >> if (is_zero_pfn(pte_pfn(vmf->orig_pte))) { >> - new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); >> + new_folio = alloc_anon_folio(vma, vmf->address, 0); >> if (!new_folio) >> goto oom; >> } else { >> @@ -4040,6 +4150,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) >> struct folio *folio; >> vm_fault_t ret = 0; >> pte_t entry; >> + int order; >> + int pgcount; >> + unsigned long addr; >> >> /* File mapping without ->vm_ops ? */ >> if (vma->vm_flags & VM_SHARED) >> @@ -4081,24 +4194,51 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) >> pte_unmap_unlock(vmf->pte, vmf->ptl); >> return handle_userfault(vmf, VM_UFFD_MISSING); >> } >> - goto setpte; >> + if (uffd_wp) >> + entry = pte_mkuffd_wp(entry); >> + set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); >> + >> + /* No need to invalidate - it was non-present before */ >> + update_mmu_cache(vma, vmf->address, vmf->pte); >> + goto unlock; >> + } >> + >> + /* >> + * If allocating a large folio, determine the biggest suitable order for >> + * the VMA (e.g. it must not exceed the VMA's bounds, it must not >> + * overlap with any populated PTEs, etc). We are not under the ptl here >> + * so we will need to re-check that we are not overlapping any populated >> + * PTEs once we have the lock. >> + */ >> + order = uffd_wp ? 0 : max_anon_folio_order(vma); >> + if (order > 0) { >> + vmf->pte = pte_offset_map(vmf->pmd, vmf->address); >> + order = calc_anon_folio_order_alloc(vmf, order); >> + pte_unmap(vmf->pte); >> } >> >> - /* Allocate our own private page. */ >> + /* Allocate our own private folio. */ >> if (unlikely(anon_vma_prepare(vma))) >> goto oom; >> - folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); >> + folio = alloc_anon_folio(vma, vmf->address, order); >> + if (!folio && order > 0) { >> + order = 0; >> + folio = alloc_anon_folio(vma, vmf->address, order); >> + } >> if (!folio) >> goto oom; >> >> + pgcount = 1 << order; >> + addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT); >> + >> if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) >> goto oom_free_page; >> folio_throttle_swaprate(folio, GFP_KERNEL); >> >> /* >> * The memory barrier inside __folio_mark_uptodate makes sure that >> - * preceding stores to the page contents become visible before >> - * the set_pte_at() write. >> + * preceding stores to the folio contents become visible before >> + * the set_ptes() write. >> */ >> __folio_mark_uptodate(folio); >> >> @@ -4107,11 +4247,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) >> if (vma->vm_flags & VM_WRITE) >> entry = pte_mkwrite(pte_mkdirty(entry)); >> >> - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, >> - &vmf->ptl); >> + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl); >> if (vmf_pte_changed(vmf)) { >> update_mmu_tlb(vma, vmf->address, vmf->pte); >> goto release; >> + } else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) { > This could be the case that we allocated order 4 page and find a neighbor PTE is > filled by concurrent fault. Should we put current folio and fallback to order 0 > and try again immedately (goto order 0 allocation instead of return from this > function which will go through some page fault path again)? That's how it worked in v1, but I had review comments from Yang Shi asking me to re-fault instead. This approach is certainly cleaner from a code point of view. And I expect races of that nature will be rare. > > > Regards > Yin, Fengwei > >> + goto release; >> } >> >> ret = check_stable_address_space(vma->vm_mm); >> @@ -4125,16 +4266,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) >> return handle_userfault(vmf, VM_UFFD_MISSING); >> } >> >> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES); >> - folio_add_new_anon_rmap(folio, vma, vmf->address); >> + folio_ref_add(folio, pgcount - 1); >> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount); >> + folio_add_new_anon_rmap(folio, vma, addr); >> folio_add_lru_vma(folio, vma); >> -setpte: >> + >> if (uffd_wp) >> entry = pte_mkuffd_wp(entry); >> - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); >> + set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount); >> >> /* No need to invalidate - it was non-present before */ >> - update_mmu_cache(vma, vmf->address, vmf->pte); >> + update_mmu_cache_range(vma, addr, vmf->pte, pgcount); >> unlock: >> pte_unmap_unlock(vmf->pte, vmf->ptl); >> return ret; ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance @ 2023-07-04 14:20 ` Ryan Roberts 0 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-04 14:20 UTC (permalink / raw) To: Yin, Fengwei, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi Cc: linux-arm-kernel, linux-kernel, linux-mm On 04/07/2023 04:45, Yin, Fengwei wrote: > > On 7/3/2023 9:53 PM, Ryan Roberts wrote: >> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be > THP is for huge page which is 2M size. We are not huge page here. But > I don't have good name either. Is that really true? On arm64 with 16K pages, huge pages are 32M and with 64K base page, they are 512M. So huge pages already have a variable size. And they sometimes get PTE-mapped. So can't we just think of this as an extension of the THP feature? > >> allocated in large folios of a specified order. All pages of the large >> folio are pte-mapped during the same page fault, significantly reducing >> the number of page faults. The number of per-page operations (e.g. ref >> counting, rmap management lru list management) are also significantly >> reduced since those ops now become per-folio. >> >> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which >> defaults to disabled for now; there is a long list of todos to make >> FLEXIBLE_THP robust with existing features (e.g. compaction, mlock, some >> madvise ops, etc). These items will be tackled in subsequent patches. >> >> When enabled, the preferred folio order is as returned by >> arch_wants_pte_order(), which may be overridden by the arch as it sees >> fit. Some architectures (e.g. arm64) can coalsece TLB entries if a >> contiguous set of ptes map physically contigious, naturally aligned >> memory, so this mechanism allows the architecture to optimize as >> required. >> >> If the preferred order can't be used (e.g. because the folio would >> breach the bounds of the vma, or because ptes in the region are already >> mapped) then we fall back to a suitable lower order. >> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >> --- >> mm/Kconfig | 10 ++++ >> mm/memory.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++++---- >> 2 files changed, 165 insertions(+), 13 deletions(-) >> >> diff --git a/mm/Kconfig b/mm/Kconfig >> index 7672a22647b4..1c06b2c0a24e 100644 >> --- a/mm/Kconfig >> +++ b/mm/Kconfig >> @@ -822,6 +822,16 @@ config READ_ONLY_THP_FOR_FS >> support of file THPs will be developed in the next few release >> cycles. >> >> +config FLEXIBLE_THP >> + bool "Flexible order THP" >> + depends on TRANSPARENT_HUGEPAGE >> + default n >> + help >> + Use large (bigger than order-0) folios to back anonymous memory where >> + possible, even if the order of the folio is smaller than the PMD >> + order. This reduces the number of page faults, as well as other >> + per-page overheads to improve performance for many workloads. >> + >> endif # TRANSPARENT_HUGEPAGE >> >> # >> diff --git a/mm/memory.c b/mm/memory.c >> index fb30f7523550..abe2ea94f3f5 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -3001,6 +3001,116 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf) >> return 0; >> } >> >> +#ifdef CONFIG_FLEXIBLE_THP >> +/* >> + * Allocates, zeros and returns a folio of the requested order for use as >> + * anonymous memory. >> + */ >> +static struct folio *alloc_anon_folio(struct vm_area_struct *vma, >> + unsigned long addr, int order) >> +{ >> + gfp_t gfp; >> + struct folio *folio; >> + >> + if (order == 0) >> + return vma_alloc_zeroed_movable_folio(vma, addr); >> + >> + gfp = vma_thp_gfp_mask(vma); >> + folio = vma_alloc_folio(gfp, order, vma, addr, true); >> + if (folio) >> + clear_huge_page(&folio->page, addr, folio_nr_pages(folio)); >> + >> + return folio; >> +} >> + >> +/* >> + * Preferred folio order to allocate for anonymous memory. >> + */ >> +#define max_anon_folio_order(vma) arch_wants_pte_order(vma) >> +#else >> +#define alloc_anon_folio(vma, addr, order) \ >> + vma_alloc_zeroed_movable_folio(vma, addr) >> +#define max_anon_folio_order(vma) 0 >> +#endif >> + >> +/* >> + * Returns index of first pte that is not none, or nr if all are none. >> + */ >> +static inline int check_ptes_none(pte_t *pte, int nr) >> +{ >> + int i; >> + >> + for (i = 0; i < nr; i++) { >> + if (!pte_none(ptep_get(pte++))) >> + return i; >> + } >> + >> + return nr; >> +} >> + >> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order) >> +{ >> + /* >> + * The aim here is to determine what size of folio we should allocate >> + * for this fault. Factors include: >> + * - Order must not be higher than `order` upon entry >> + * - Folio must be naturally aligned within VA space >> + * - Folio must be fully contained inside one pmd entry >> + * - Folio must not breach boundaries of vma >> + * - Folio must not overlap any non-none ptes >> + * >> + * Additionally, we do not allow order-1 since this breaks assumptions >> + * elsewhere in the mm; THP pages must be at least order-2 (since they >> + * store state up to the 3rd struct page subpage), and these pages must >> + * be THP in order to correctly use pre-existing THP infrastructure such >> + * as folio_split(). >> + * >> + * Note that the caller may or may not choose to lock the pte. If >> + * unlocked, the result is racy and the user must re-check any overlap >> + * with non-none ptes under the lock. >> + */ >> + >> + struct vm_area_struct *vma = vmf->vma; >> + int nr; >> + unsigned long addr; >> + pte_t *pte; >> + pte_t *first_set = NULL; >> + int ret; >> + >> + order = min(order, PMD_SHIFT - PAGE_SHIFT); >> + >> + for (; order > 1; order--) { >> + nr = 1 << order; >> + addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT); >> + pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT); >> + >> + /* Check vma bounds. */ >> + if (addr < vma->vm_start || >> + addr + (nr << PAGE_SHIFT) > vma->vm_end) >> + continue; >> + >> + /* Ptes covered by order already known to be none. */ >> + if (pte + nr <= first_set) >> + break; >> + >> + /* Already found set pte in range covered by order. */ >> + if (pte <= first_set) >> + continue; >> + >> + /* Need to check if all the ptes are none. */ >> + ret = check_ptes_none(pte, nr); >> + if (ret == nr) >> + break; >> + >> + first_set = pte + ret; >> + } >> + >> + if (order == 1) >> + order = 0; >> + >> + return order; >> +} > The logic in above function should be kept is whether the order fit in vma range. > > check_ptes_none() is not accurate here because no page table lock hold and concurrent > fault could happen. So may just drop the check here? Check_ptes_none() is done after > take the page table lock. I agree it is just an estimate given the lock is not held; the comment at the top says the same. But I don't think we can wait until after the lock is taken to measure this. We can't hold the lock while allocating the folio and we need a guess at what to allocate. If we don't guess here, we will allocate the biggest, then take the lock, see that it doesn't fit, and exit. Then the system will re-fault and we will follow the exact same path - ending up in live lock. > > We pick the arch prefered order or order 0 now. > >> + >> /* >> * Handle write page faults for pages that can be reused in the current vma >> * >> @@ -3073,7 +3183,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) >> goto oom; >> >> if (is_zero_pfn(pte_pfn(vmf->orig_pte))) { >> - new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); >> + new_folio = alloc_anon_folio(vma, vmf->address, 0); >> if (!new_folio) >> goto oom; >> } else { >> @@ -4040,6 +4150,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) >> struct folio *folio; >> vm_fault_t ret = 0; >> pte_t entry; >> + int order; >> + int pgcount; >> + unsigned long addr; >> >> /* File mapping without ->vm_ops ? */ >> if (vma->vm_flags & VM_SHARED) >> @@ -4081,24 +4194,51 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) >> pte_unmap_unlock(vmf->pte, vmf->ptl); >> return handle_userfault(vmf, VM_UFFD_MISSING); >> } >> - goto setpte; >> + if (uffd_wp) >> + entry = pte_mkuffd_wp(entry); >> + set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); >> + >> + /* No need to invalidate - it was non-present before */ >> + update_mmu_cache(vma, vmf->address, vmf->pte); >> + goto unlock; >> + } >> + >> + /* >> + * If allocating a large folio, determine the biggest suitable order for >> + * the VMA (e.g. it must not exceed the VMA's bounds, it must not >> + * overlap with any populated PTEs, etc). We are not under the ptl here >> + * so we will need to re-check that we are not overlapping any populated >> + * PTEs once we have the lock. >> + */ >> + order = uffd_wp ? 0 : max_anon_folio_order(vma); >> + if (order > 0) { >> + vmf->pte = pte_offset_map(vmf->pmd, vmf->address); >> + order = calc_anon_folio_order_alloc(vmf, order); >> + pte_unmap(vmf->pte); >> } >> >> - /* Allocate our own private page. */ >> + /* Allocate our own private folio. */ >> if (unlikely(anon_vma_prepare(vma))) >> goto oom; >> - folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); >> + folio = alloc_anon_folio(vma, vmf->address, order); >> + if (!folio && order > 0) { >> + order = 0; >> + folio = alloc_anon_folio(vma, vmf->address, order); >> + } >> if (!folio) >> goto oom; >> >> + pgcount = 1 << order; >> + addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT); >> + >> if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) >> goto oom_free_page; >> folio_throttle_swaprate(folio, GFP_KERNEL); >> >> /* >> * The memory barrier inside __folio_mark_uptodate makes sure that >> - * preceding stores to the page contents become visible before >> - * the set_pte_at() write. >> + * preceding stores to the folio contents become visible before >> + * the set_ptes() write. >> */ >> __folio_mark_uptodate(folio); >> >> @@ -4107,11 +4247,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) >> if (vma->vm_flags & VM_WRITE) >> entry = pte_mkwrite(pte_mkdirty(entry)); >> >> - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, >> - &vmf->ptl); >> + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl); >> if (vmf_pte_changed(vmf)) { >> update_mmu_tlb(vma, vmf->address, vmf->pte); >> goto release; >> + } else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) { > This could be the case that we allocated order 4 page and find a neighbor PTE is > filled by concurrent fault. Should we put current folio and fallback to order 0 > and try again immedately (goto order 0 allocation instead of return from this > function which will go through some page fault path again)? That's how it worked in v1, but I had review comments from Yang Shi asking me to re-fault instead. This approach is certainly cleaner from a code point of view. And I expect races of that nature will be rare. > > > Regards > Yin, Fengwei > >> + goto release; >> } >> >> ret = check_stable_address_space(vma->vm_mm); >> @@ -4125,16 +4266,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) >> return handle_userfault(vmf, VM_UFFD_MISSING); >> } >> >> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES); >> - folio_add_new_anon_rmap(folio, vma, vmf->address); >> + folio_ref_add(folio, pgcount - 1); >> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount); >> + folio_add_new_anon_rmap(folio, vma, addr); >> folio_add_lru_vma(folio, vma); >> -setpte: >> + >> if (uffd_wp) >> entry = pte_mkuffd_wp(entry); >> - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); >> + set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount); >> >> /* No need to invalidate - it was non-present before */ >> - update_mmu_cache(vma, vmf->address, vmf->pte); >> + update_mmu_cache_range(vma, addr, vmf->pte, pgcount); >> unlock: >> pte_unmap_unlock(vmf->pte, vmf->ptl); >> return ret; _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance 2023-07-04 14:20 ` Ryan Roberts (?) @ 2023-07-04 23:35 ` Yin Fengwei -1 siblings, 0 replies; 167+ messages in thread From: Yin Fengwei @ 2023-07-04 23:35 UTC (permalink / raw) To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi Cc: linux-arm-kernel, linux-kernel, linux-mm On 7/4/23 22:20, Ryan Roberts wrote: > On 04/07/2023 04:45, Yin, Fengwei wrote: >> >> On 7/3/2023 9:53 PM, Ryan Roberts wrote: >>> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be >> THP is for huge page which is 2M size. We are not huge page here. But >> I don't have good name either. > > Is that really true? On arm64 with 16K pages, huge pages are 32M and with 64K > base page, they are 512M. So huge pages already have a variable size. And they > sometimes get PTE-mapped. So can't we just think of this as an extension of the > THP feature? My understanding is the THP has several fixed size on different arch. The 32K or 16K which could be picked here are not THP size. > >> >>> allocated in large folios of a specified order. All pages of the large >>> folio are pte-mapped during the same page fault, significantly reducing >>> the number of page faults. The number of per-page operations (e.g. ref >>> counting, rmap management lru list management) are also significantly >>> reduced since those ops now become per-folio. >>> >>> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which >>> defaults to disabled for now; there is a long list of todos to make >>> FLEXIBLE_THP robust with existing features (e.g. compaction, mlock, some >>> madvise ops, etc). These items will be tackled in subsequent patches. >>> >>> When enabled, the preferred folio order is as returned by >>> arch_wants_pte_order(), which may be overridden by the arch as it sees >>> fit. Some architectures (e.g. arm64) can coalsece TLB entries if a >>> contiguous set of ptes map physically contigious, naturally aligned >>> memory, so this mechanism allows the architecture to optimize as >>> required. >>> >>> If the preferred order can't be used (e.g. because the folio would >>> breach the bounds of the vma, or because ptes in the region are already >>> mapped) then we fall back to a suitable lower order. >>> >>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >>> --- >>> mm/Kconfig | 10 ++++ >>> mm/memory.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++++---- >>> 2 files changed, 165 insertions(+), 13 deletions(-) >>> >>> diff --git a/mm/Kconfig b/mm/Kconfig >>> index 7672a22647b4..1c06b2c0a24e 100644 >>> --- a/mm/Kconfig >>> +++ b/mm/Kconfig >>> @@ -822,6 +822,16 @@ config READ_ONLY_THP_FOR_FS >>> support of file THPs will be developed in the next few release >>> cycles. >>> >>> +config FLEXIBLE_THP >>> + bool "Flexible order THP" >>> + depends on TRANSPARENT_HUGEPAGE >>> + default n >>> + help >>> + Use large (bigger than order-0) folios to back anonymous memory where >>> + possible, even if the order of the folio is smaller than the PMD >>> + order. This reduces the number of page faults, as well as other >>> + per-page overheads to improve performance for many workloads. >>> + >>> endif # TRANSPARENT_HUGEPAGE >>> >>> # >>> diff --git a/mm/memory.c b/mm/memory.c >>> index fb30f7523550..abe2ea94f3f5 100644 >>> --- a/mm/memory.c >>> +++ b/mm/memory.c >>> @@ -3001,6 +3001,116 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf) >>> return 0; >>> } >>> >>> +#ifdef CONFIG_FLEXIBLE_THP >>> +/* >>> + * Allocates, zeros and returns a folio of the requested order for use as >>> + * anonymous memory. >>> + */ >>> +static struct folio *alloc_anon_folio(struct vm_area_struct *vma, >>> + unsigned long addr, int order) >>> +{ >>> + gfp_t gfp; >>> + struct folio *folio; >>> + >>> + if (order == 0) >>> + return vma_alloc_zeroed_movable_folio(vma, addr); >>> + >>> + gfp = vma_thp_gfp_mask(vma); >>> + folio = vma_alloc_folio(gfp, order, vma, addr, true); >>> + if (folio) >>> + clear_huge_page(&folio->page, addr, folio_nr_pages(folio)); >>> + >>> + return folio; >>> +} >>> + >>> +/* >>> + * Preferred folio order to allocate for anonymous memory. >>> + */ >>> +#define max_anon_folio_order(vma) arch_wants_pte_order(vma) >>> +#else >>> +#define alloc_anon_folio(vma, addr, order) \ >>> + vma_alloc_zeroed_movable_folio(vma, addr) >>> +#define max_anon_folio_order(vma) 0 >>> +#endif >>> + >>> +/* >>> + * Returns index of first pte that is not none, or nr if all are none. >>> + */ >>> +static inline int check_ptes_none(pte_t *pte, int nr) >>> +{ >>> + int i; >>> + >>> + for (i = 0; i < nr; i++) { >>> + if (!pte_none(ptep_get(pte++))) >>> + return i; >>> + } >>> + >>> + return nr; >>> +} >>> + >>> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order) >>> +{ >>> + /* >>> + * The aim here is to determine what size of folio we should allocate >>> + * for this fault. Factors include: >>> + * - Order must not be higher than `order` upon entry >>> + * - Folio must be naturally aligned within VA space >>> + * - Folio must be fully contained inside one pmd entry >>> + * - Folio must not breach boundaries of vma >>> + * - Folio must not overlap any non-none ptes >>> + * >>> + * Additionally, we do not allow order-1 since this breaks assumptions >>> + * elsewhere in the mm; THP pages must be at least order-2 (since they >>> + * store state up to the 3rd struct page subpage), and these pages must >>> + * be THP in order to correctly use pre-existing THP infrastructure such >>> + * as folio_split(). >>> + * >>> + * Note that the caller may or may not choose to lock the pte. If >>> + * unlocked, the result is racy and the user must re-check any overlap >>> + * with non-none ptes under the lock. >>> + */ >>> + >>> + struct vm_area_struct *vma = vmf->vma; >>> + int nr; >>> + unsigned long addr; >>> + pte_t *pte; >>> + pte_t *first_set = NULL; >>> + int ret; >>> + >>> + order = min(order, PMD_SHIFT - PAGE_SHIFT); >>> + >>> + for (; order > 1; order--) { >>> + nr = 1 << order; >>> + addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT); >>> + pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT); >>> + >>> + /* Check vma bounds. */ >>> + if (addr < vma->vm_start || >>> + addr + (nr << PAGE_SHIFT) > vma->vm_end) >>> + continue; >>> + >>> + /* Ptes covered by order already known to be none. */ >>> + if (pte + nr <= first_set) >>> + break; >>> + >>> + /* Already found set pte in range covered by order. */ >>> + if (pte <= first_set) >>> + continue; >>> + >>> + /* Need to check if all the ptes are none. */ >>> + ret = check_ptes_none(pte, nr); >>> + if (ret == nr) >>> + break; >>> + >>> + first_set = pte + ret; >>> + } >>> + >>> + if (order == 1) >>> + order = 0; >>> + >>> + return order; >>> +} >> The logic in above function should be kept is whether the order fit in vma range. >> >> check_ptes_none() is not accurate here because no page table lock hold and concurrent >> fault could happen. So may just drop the check here? Check_ptes_none() is done after >> take the page table lock. > > I agree it is just an estimate given the lock is not held; the comment at the > top says the same. But I don't think we can wait until after the lock is taken > to measure this. We can't hold the lock while allocating the folio and we need a > guess at what to allocate. If we don't guess here, we will allocate the biggest, > then take the lock, see that it doesn't fit, and exit. Then the system will > re-fault and we will follow the exact same path - ending up in live lock. It will not if we try order0 immediately. But see my comments to the refault. > >> >> We pick the arch prefered order or order 0 now. >> >>> + >>> /* >>> * Handle write page faults for pages that can be reused in the current vma >>> * >>> @@ -3073,7 +3183,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) >>> goto oom; >>> >>> if (is_zero_pfn(pte_pfn(vmf->orig_pte))) { >>> - new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); >>> + new_folio = alloc_anon_folio(vma, vmf->address, 0); >>> if (!new_folio) >>> goto oom; >>> } else { >>> @@ -4040,6 +4150,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) >>> struct folio *folio; >>> vm_fault_t ret = 0; >>> pte_t entry; >>> + int order; >>> + int pgcount; >>> + unsigned long addr; >>> >>> /* File mapping without ->vm_ops ? */ >>> if (vma->vm_flags & VM_SHARED) >>> @@ -4081,24 +4194,51 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) >>> pte_unmap_unlock(vmf->pte, vmf->ptl); >>> return handle_userfault(vmf, VM_UFFD_MISSING); >>> } >>> - goto setpte; >>> + if (uffd_wp) >>> + entry = pte_mkuffd_wp(entry); >>> + set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); >>> + >>> + /* No need to invalidate - it was non-present before */ >>> + update_mmu_cache(vma, vmf->address, vmf->pte); >>> + goto unlock; >>> + } >>> + >>> + /* >>> + * If allocating a large folio, determine the biggest suitable order for >>> + * the VMA (e.g. it must not exceed the VMA's bounds, it must not >>> + * overlap with any populated PTEs, etc). We are not under the ptl here >>> + * so we will need to re-check that we are not overlapping any populated >>> + * PTEs once we have the lock. >>> + */ >>> + order = uffd_wp ? 0 : max_anon_folio_order(vma); >>> + if (order > 0) { >>> + vmf->pte = pte_offset_map(vmf->pmd, vmf->address); >>> + order = calc_anon_folio_order_alloc(vmf, order); >>> + pte_unmap(vmf->pte); >>> } >>> >>> - /* Allocate our own private page. */ >>> + /* Allocate our own private folio. */ >>> if (unlikely(anon_vma_prepare(vma))) >>> goto oom; >>> - folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); >>> + folio = alloc_anon_folio(vma, vmf->address, order); >>> + if (!folio && order > 0) { >>> + order = 0; >>> + folio = alloc_anon_folio(vma, vmf->address, order); >>> + } >>> if (!folio) >>> goto oom; >>> >>> + pgcount = 1 << order; >>> + addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT); >>> + >>> if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) >>> goto oom_free_page; >>> folio_throttle_swaprate(folio, GFP_KERNEL); >>> >>> /* >>> * The memory barrier inside __folio_mark_uptodate makes sure that >>> - * preceding stores to the page contents become visible before >>> - * the set_pte_at() write. >>> + * preceding stores to the folio contents become visible before >>> + * the set_ptes() write. >>> */ >>> __folio_mark_uptodate(folio); >>> >>> @@ -4107,11 +4247,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) >>> if (vma->vm_flags & VM_WRITE) >>> entry = pte_mkwrite(pte_mkdirty(entry)); >>> >>> - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, >>> - &vmf->ptl); >>> + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl); >>> if (vmf_pte_changed(vmf)) { >>> update_mmu_tlb(vma, vmf->address, vmf->pte); >>> goto release; >>> + } else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) { >> This could be the case that we allocated order 4 page and find a neighbor PTE is >> filled by concurrent fault. Should we put current folio and fallback to order 0 >> and try again immedately (goto order 0 allocation instead of return from this >> function which will go through some page fault path again)? > > That's how it worked in v1, but I had review comments from Yang Shi asking me to > re-fault instead. This approach is certainly cleaner from a code point of view. > And I expect races of that nature will be rare. I must miss that discussion in v1. My bad. I should jump in that discussion. So I will drop my comment here even I still think we should avoid refault. I don't want the comment back and forth. Regards Yin, Fengwei > >> >> >> Regards >> Yin, Fengwei >> >>> + goto release; >>> } >>> >>> ret = check_stable_address_space(vma->vm_mm); >>> @@ -4125,16 +4266,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) >>> return handle_userfault(vmf, VM_UFFD_MISSING); >>> } >>> >>> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES); >>> - folio_add_new_anon_rmap(folio, vma, vmf->address); >>> + folio_ref_add(folio, pgcount - 1); >>> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount); >>> + folio_add_new_anon_rmap(folio, vma, addr); >>> folio_add_lru_vma(folio, vma); >>> -setpte: >>> + >>> if (uffd_wp) >>> entry = pte_mkuffd_wp(entry); >>> - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); >>> + set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount); >>> >>> /* No need to invalidate - it was non-present before */ >>> - update_mmu_cache(vma, vmf->address, vmf->pte); >>> + update_mmu_cache_range(vma, addr, vmf->pte, pgcount); >>> unlock: >>> pte_unmap_unlock(vmf->pte, vmf->ptl); >>> return ret; > ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance 2023-07-04 14:20 ` Ryan Roberts @ 2023-07-04 23:57 ` Matthew Wilcox -1 siblings, 0 replies; 167+ messages in thread From: Matthew Wilcox @ 2023-07-04 23:57 UTC (permalink / raw) To: Ryan Roberts Cc: Yin, Fengwei, Andrew Morton, Kirill A. Shutemov, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On Tue, Jul 04, 2023 at 03:20:35PM +0100, Ryan Roberts wrote: > On 04/07/2023 04:45, Yin, Fengwei wrote: > > > > On 7/3/2023 9:53 PM, Ryan Roberts wrote: > >> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be > > THP is for huge page which is 2M size. We are not huge page here. But > > I don't have good name either. > > Is that really true? On arm64 with 16K pages, huge pages are 32M and with 64K > base page, they are 512M. So huge pages already have a variable size. And they > sometimes get PTE-mapped. So can't we just think of this as an extension of the > THP feature? The confusing thing is that we have counters for the number of THP allocated (and number of THP mapped), and for those we always use PMD-size folios. If we must have a config option, then this is ANON_LARGE_FOLIOS. But why do we need a config option? We don't have one for the page cache, and we're better off for it. Yes, it depends on CONFIG_TRANSPARENT_HUGEPAGE today, but that's more of an accidental heritage, and it'd be great to do away with that dependency eventually. Hardware support isn't needed. Large folios benefit us from a software point of view. if we need a chicken bit, we can edit the source code to not create anon folios larger than order 0. ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance @ 2023-07-04 23:57 ` Matthew Wilcox 0 siblings, 0 replies; 167+ messages in thread From: Matthew Wilcox @ 2023-07-04 23:57 UTC (permalink / raw) To: Ryan Roberts Cc: Yin, Fengwei, Andrew Morton, Kirill A. Shutemov, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On Tue, Jul 04, 2023 at 03:20:35PM +0100, Ryan Roberts wrote: > On 04/07/2023 04:45, Yin, Fengwei wrote: > > > > On 7/3/2023 9:53 PM, Ryan Roberts wrote: > >> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be > > THP is for huge page which is 2M size. We are not huge page here. But > > I don't have good name either. > > Is that really true? On arm64 with 16K pages, huge pages are 32M and with 64K > base page, they are 512M. So huge pages already have a variable size. And they > sometimes get PTE-mapped. So can't we just think of this as an extension of the > THP feature? The confusing thing is that we have counters for the number of THP allocated (and number of THP mapped), and for those we always use PMD-size folios. If we must have a config option, then this is ANON_LARGE_FOLIOS. But why do we need a config option? We don't have one for the page cache, and we're better off for it. Yes, it depends on CONFIG_TRANSPARENT_HUGEPAGE today, but that's more of an accidental heritage, and it'd be great to do away with that dependency eventually. Hardware support isn't needed. Large folios benefit us from a software point of view. if we need a chicken bit, we can edit the source code to not create anon folios larger than order 0. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance 2023-07-04 23:57 ` Matthew Wilcox @ 2023-07-05 9:54 ` Ryan Roberts -1 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-05 9:54 UTC (permalink / raw) To: Matthew Wilcox Cc: Yin, Fengwei, Andrew Morton, Kirill A. Shutemov, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 05/07/2023 00:57, Matthew Wilcox wrote: > On Tue, Jul 04, 2023 at 03:20:35PM +0100, Ryan Roberts wrote: >> On 04/07/2023 04:45, Yin, Fengwei wrote: >>> >>> On 7/3/2023 9:53 PM, Ryan Roberts wrote: >>>> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be >>> THP is for huge page which is 2M size. We are not huge page here. But >>> I don't have good name either. >> >> Is that really true? On arm64 with 16K pages, huge pages are 32M and with 64K >> base page, they are 512M. So huge pages already have a variable size. And they >> sometimes get PTE-mapped. So can't we just think of this as an extension of the >> THP feature? > > The confusing thing is that we have counters for the number of THP > allocated (and number of THP mapped), and for those we always use > PMD-size folios. OK fair point. I really don't have a strong opinion on the name - I changed it from LARGE_ANON_FOLIO because Yu was suggesting it should be tied to THP. So I'm happy to change it back to LARGE_ANON_FOLIO (or something else) if that's the concensus. But I expect I'll end up in a game of ping-pong. So I'm going to keep this name for now and focus on converging the actual implementation to something that is agreeable. Once we are there, we can argue about the name. > > If we must have a config option, then this is ANON_LARGE_FOLIOS. > > But why do we need a config option? We don't have one for the > page cache, and we're better off for it. Yes, it depends on > CONFIG_TRANSPARENT_HUGEPAGE today, but that's more of an accidental > heritage, and it'd be great to do away with that dependency eventually. > > Hardware support isn't needed. Large folios benefit us from a software > point of view. if we need a chicken bit, we can edit the source code > to not create anon folios larger than order 0. From my PoV it's about managing risk; there are currently parts of the mm that will interact poorly with large pte-mapped folios (madvise, compaction, ...). We want to incrementally fix that stuff, but until it's all fixed, we can't deploy this as always-on. Further down the line when things are more complete and there is more test coverage, we could remove the Kconfig or default it to enabled. ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance @ 2023-07-05 9:54 ` Ryan Roberts 0 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-05 9:54 UTC (permalink / raw) To: Matthew Wilcox Cc: Yin, Fengwei, Andrew Morton, Kirill A. Shutemov, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 05/07/2023 00:57, Matthew Wilcox wrote: > On Tue, Jul 04, 2023 at 03:20:35PM +0100, Ryan Roberts wrote: >> On 04/07/2023 04:45, Yin, Fengwei wrote: >>> >>> On 7/3/2023 9:53 PM, Ryan Roberts wrote: >>>> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be >>> THP is for huge page which is 2M size. We are not huge page here. But >>> I don't have good name either. >> >> Is that really true? On arm64 with 16K pages, huge pages are 32M and with 64K >> base page, they are 512M. So huge pages already have a variable size. And they >> sometimes get PTE-mapped. So can't we just think of this as an extension of the >> THP feature? > > The confusing thing is that we have counters for the number of THP > allocated (and number of THP mapped), and for those we always use > PMD-size folios. OK fair point. I really don't have a strong opinion on the name - I changed it from LARGE_ANON_FOLIO because Yu was suggesting it should be tied to THP. So I'm happy to change it back to LARGE_ANON_FOLIO (or something else) if that's the concensus. But I expect I'll end up in a game of ping-pong. So I'm going to keep this name for now and focus on converging the actual implementation to something that is agreeable. Once we are there, we can argue about the name. > > If we must have a config option, then this is ANON_LARGE_FOLIOS. > > But why do we need a config option? We don't have one for the > page cache, and we're better off for it. Yes, it depends on > CONFIG_TRANSPARENT_HUGEPAGE today, but that's more of an accidental > heritage, and it'd be great to do away with that dependency eventually. > > Hardware support isn't needed. Large folios benefit us from a software > point of view. if we need a chicken bit, we can edit the source code > to not create anon folios larger than order 0. From my PoV it's about managing risk; there are currently parts of the mm that will interact poorly with large pte-mapped folios (madvise, compaction, ...). We want to incrementally fix that stuff, but until it's all fixed, we can't deploy this as always-on. Further down the line when things are more complete and there is more test coverage, we could remove the Kconfig or default it to enabled. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance 2023-07-05 9:54 ` Ryan Roberts @ 2023-07-05 12:08 ` Matthew Wilcox -1 siblings, 0 replies; 167+ messages in thread From: Matthew Wilcox @ 2023-07-05 12:08 UTC (permalink / raw) To: Ryan Roberts Cc: Yin, Fengwei, Andrew Morton, Kirill A. Shutemov, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On Wed, Jul 05, 2023 at 10:54:30AM +0100, Ryan Roberts wrote: > On 05/07/2023 00:57, Matthew Wilcox wrote: > > The confusing thing is that we have counters for the number of THP > > allocated (and number of THP mapped), and for those we always use > > PMD-size folios. > > OK fair point. I really don't have a strong opinion on the name - I changed it > from LARGE_ANON_FOLIO because Yu was suggesting it should be tied to THP. So I'm > happy to change it back to LARGE_ANON_FOLIO (or something else) if that's the > concensus. But I expect I'll end up in a game of ping-pong. So I'm going to keep > this name for now and focus on converging the actual implementation to something > that is agreeable. Once we are there, we can argue about the name. I didn't see Yu arguing for changing the name of the config options, just having far fewer of them. > > If we must have a config option, then this is ANON_LARGE_FOLIOS. > > > > But why do we need a config option? We don't have one for the > > page cache, and we're better off for it. Yes, it depends on > > CONFIG_TRANSPARENT_HUGEPAGE today, but that's more of an accidental > > heritage, and it'd be great to do away with that dependency eventually. > > > > Hardware support isn't needed. Large folios benefit us from a software > > point of view. if we need a chicken bit, we can edit the source code > > to not create anon folios larger than order 0. > > >From my PoV it's about managing risk; there are currently parts of the mm that > will interact poorly with large pte-mapped folios (madvise, compaction, ...). We > want to incrementally fix that stuff, but until it's all fixed, we can't deploy > this as always-on. Further down the line when things are more complete and there > is more test coverage, we could remove the Kconfig or default it to enabled. We have to fix those places with the bad interactions, not merge a Kconfig option that lets you turn it on to experiment. That's how you get a bad reputation and advice to disable a config option. We had that for years with CONFIG_TRANSPARENT_HUGEPAGE; people tried it out early on, found the performance problems, and all these years later we still have articles being published that say to turn it off. By all means, we can have a golden patchset that we all agree is the one to use for finding problems, and we can merge the pre-enabling work "We don't have large anonymous folios yet, but when we do, this will need to iterate over each page in the folio". ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance @ 2023-07-05 12:08 ` Matthew Wilcox 0 siblings, 0 replies; 167+ messages in thread From: Matthew Wilcox @ 2023-07-05 12:08 UTC (permalink / raw) To: Ryan Roberts Cc: Yin, Fengwei, Andrew Morton, Kirill A. Shutemov, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On Wed, Jul 05, 2023 at 10:54:30AM +0100, Ryan Roberts wrote: > On 05/07/2023 00:57, Matthew Wilcox wrote: > > The confusing thing is that we have counters for the number of THP > > allocated (and number of THP mapped), and for those we always use > > PMD-size folios. > > OK fair point. I really don't have a strong opinion on the name - I changed it > from LARGE_ANON_FOLIO because Yu was suggesting it should be tied to THP. So I'm > happy to change it back to LARGE_ANON_FOLIO (or something else) if that's the > concensus. But I expect I'll end up in a game of ping-pong. So I'm going to keep > this name for now and focus on converging the actual implementation to something > that is agreeable. Once we are there, we can argue about the name. I didn't see Yu arguing for changing the name of the config options, just having far fewer of them. > > If we must have a config option, then this is ANON_LARGE_FOLIOS. > > > > But why do we need a config option? We don't have one for the > > page cache, and we're better off for it. Yes, it depends on > > CONFIG_TRANSPARENT_HUGEPAGE today, but that's more of an accidental > > heritage, and it'd be great to do away with that dependency eventually. > > > > Hardware support isn't needed. Large folios benefit us from a software > > point of view. if we need a chicken bit, we can edit the source code > > to not create anon folios larger than order 0. > > >From my PoV it's about managing risk; there are currently parts of the mm that > will interact poorly with large pte-mapped folios (madvise, compaction, ...). We > want to incrementally fix that stuff, but until it's all fixed, we can't deploy > this as always-on. Further down the line when things are more complete and there > is more test coverage, we could remove the Kconfig or default it to enabled. We have to fix those places with the bad interactions, not merge a Kconfig option that lets you turn it on to experiment. That's how you get a bad reputation and advice to disable a config option. We had that for years with CONFIG_TRANSPARENT_HUGEPAGE; people tried it out early on, found the performance problems, and all these years later we still have articles being published that say to turn it off. By all means, we can have a golden patchset that we all agree is the one to use for finding problems, and we can merge the pre-enabling work "We don't have large anonymous folios yet, but when we do, this will need to iterate over each page in the folio". _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance 2023-07-03 13:53 ` Ryan Roberts @ 2023-07-07 8:01 ` Huang, Ying -1 siblings, 0 replies; 167+ messages in thread From: Huang, Ying @ 2023-07-07 8:01 UTC (permalink / raw) To: Ryan Roberts Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm Ryan Roberts <ryan.roberts@arm.com> writes: > Introduce FLEXIBLE_THP feature, which allows anonymous memory to be > allocated in large folios of a specified order. All pages of the large > folio are pte-mapped during the same page fault, significantly reducing > the number of page faults. The number of per-page operations (e.g. ref > counting, rmap management lru list management) are also significantly > reduced since those ops now become per-folio. I likes the idea to share as much code as possible between large (anonymous) folio and THP. Finally, THP becomes just a special kind of large folio. Although we can use smaller page order for FLEXIBLE_THP, it's hard to avoid internal fragmentation completely. So, I think that finally we will need to provide a mechanism for the users to opt out, e.g., something like "always madvise never" via /sys/kernel/mm/transparent_hugepage/enabled. I'm not sure whether it's a good idea to reuse the existing interface of THP. Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance @ 2023-07-07 8:01 ` Huang, Ying 0 siblings, 0 replies; 167+ messages in thread From: Huang, Ying @ 2023-07-07 8:01 UTC (permalink / raw) To: Ryan Roberts Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm Ryan Roberts <ryan.roberts@arm.com> writes: > Introduce FLEXIBLE_THP feature, which allows anonymous memory to be > allocated in large folios of a specified order. All pages of the large > folio are pte-mapped during the same page fault, significantly reducing > the number of page faults. The number of per-page operations (e.g. ref > counting, rmap management lru list management) are also significantly > reduced since those ops now become per-folio. I likes the idea to share as much code as possible between large (anonymous) folio and THP. Finally, THP becomes just a special kind of large folio. Although we can use smaller page order for FLEXIBLE_THP, it's hard to avoid internal fragmentation completely. So, I think that finally we will need to provide a mechanism for the users to opt out, e.g., something like "always madvise never" via /sys/kernel/mm/transparent_hugepage/enabled. I'm not sure whether it's a good idea to reuse the existing interface of THP. Best Regards, Huang, Ying _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance 2023-07-07 8:01 ` Huang, Ying @ 2023-07-07 9:52 ` Ryan Roberts -1 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-07 9:52 UTC (permalink / raw) To: Huang, Ying Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 07/07/2023 09:01, Huang, Ying wrote: > Ryan Roberts <ryan.roberts@arm.com> writes: > >> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be >> allocated in large folios of a specified order. All pages of the large >> folio are pte-mapped during the same page fault, significantly reducing >> the number of page faults. The number of per-page operations (e.g. ref >> counting, rmap management lru list management) are also significantly >> reduced since those ops now become per-folio. > > I likes the idea to share as much code as possible between large > (anonymous) folio and THP. Finally, THP becomes just a special kind of > large folio. > > Although we can use smaller page order for FLEXIBLE_THP, it's hard to > avoid internal fragmentation completely. So, I think that finally we > will need to provide a mechanism for the users to opt out, e.g., > something like "always madvise never" via > /sys/kernel/mm/transparent_hugepage/enabled. I'm not sure whether it's > a good idea to reuse the existing interface of THP. I wouldn't want to tie this to the existing interface, simply because that implies that we would want to follow the "always" and "madvise" advice too; That means that on a thp=madvise system (which is certainly the case for android and other client systems) we would have to disable large anon folios for VMAs that haven't explicitly opted in. That breaks the intention that this should be an invisible performance boost. I think it's important to set the policy for use of THP separately to use of large anon folios. I could be persuaded on the merrits of a new runtime enable/disable interface if there is concensus. > > Best Regards, > Huang, Ying ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance @ 2023-07-07 9:52 ` Ryan Roberts 0 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-07 9:52 UTC (permalink / raw) To: Huang, Ying Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 07/07/2023 09:01, Huang, Ying wrote: > Ryan Roberts <ryan.roberts@arm.com> writes: > >> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be >> allocated in large folios of a specified order. All pages of the large >> folio are pte-mapped during the same page fault, significantly reducing >> the number of page faults. The number of per-page operations (e.g. ref >> counting, rmap management lru list management) are also significantly >> reduced since those ops now become per-folio. > > I likes the idea to share as much code as possible between large > (anonymous) folio and THP. Finally, THP becomes just a special kind of > large folio. > > Although we can use smaller page order for FLEXIBLE_THP, it's hard to > avoid internal fragmentation completely. So, I think that finally we > will need to provide a mechanism for the users to opt out, e.g., > something like "always madvise never" via > /sys/kernel/mm/transparent_hugepage/enabled. I'm not sure whether it's > a good idea to reuse the existing interface of THP. I wouldn't want to tie this to the existing interface, simply because that implies that we would want to follow the "always" and "madvise" advice too; That means that on a thp=madvise system (which is certainly the case for android and other client systems) we would have to disable large anon folios for VMAs that haven't explicitly opted in. That breaks the intention that this should be an invisible performance boost. I think it's important to set the policy for use of THP separately to use of large anon folios. I could be persuaded on the merrits of a new runtime enable/disable interface if there is concensus. > > Best Regards, > Huang, Ying _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance 2023-07-07 9:52 ` Ryan Roberts @ 2023-07-07 11:29 ` David Hildenbrand -1 siblings, 0 replies; 167+ messages in thread From: David Hildenbrand @ 2023-07-07 11:29 UTC (permalink / raw) To: Ryan Roberts, Huang, Ying Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 07.07.23 11:52, Ryan Roberts wrote: > On 07/07/2023 09:01, Huang, Ying wrote: >> Ryan Roberts <ryan.roberts@arm.com> writes: >> >>> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be >>> allocated in large folios of a specified order. All pages of the large >>> folio are pte-mapped during the same page fault, significantly reducing >>> the number of page faults. The number of per-page operations (e.g. ref >>> counting, rmap management lru list management) are also significantly >>> reduced since those ops now become per-folio. >> >> I likes the idea to share as much code as possible between large >> (anonymous) folio and THP. Finally, THP becomes just a special kind of >> large folio. >> >> Although we can use smaller page order for FLEXIBLE_THP, it's hard to >> avoid internal fragmentation completely. So, I think that finally we >> will need to provide a mechanism for the users to opt out, e.g., >> something like "always madvise never" via >> /sys/kernel/mm/transparent_hugepage/enabled. I'm not sure whether it's >> a good idea to reuse the existing interface of THP. > > I wouldn't want to tie this to the existing interface, simply because that > implies that we would want to follow the "always" and "madvise" advice too; That > means that on a thp=madvise system (which is certainly the case for android and > other client systems) we would have to disable large anon folios for VMAs that > haven't explicitly opted in. That breaks the intention that this should be an > invisible performance boost. I think it's important to set the policy for use of It will never ever be a completely invisible performance boost, just like ordinary THP. Using the exact same existing toggle is the right thing to do. If someone specify "never" or "madvise", then do exactly that. It might make sense to have more modes or additional toggles, but "madvise=never" means no memory waste. I remember I raised it already in the past, but you *absolutely* have to respect the MADV_NOHUGEPAGE flag. There is user space out there (for example, userfaultfd) that doesn't want the kernel to populate any additional page tables. So if you have to respect that already, then also respect MADV_HUGEPAGE, simple. > THP separately to use of large anon folios. > > I could be persuaded on the merrits of a new runtime enable/disable interface if > there is concensus. There would have to be very good reason for a completely separate control. Bypassing MADV_NOHUGEPAGE or "madvise=never" simply because we add a "flexible" before the THP sounds broken. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance @ 2023-07-07 11:29 ` David Hildenbrand 0 siblings, 0 replies; 167+ messages in thread From: David Hildenbrand @ 2023-07-07 11:29 UTC (permalink / raw) To: Ryan Roberts, Huang, Ying Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 07.07.23 11:52, Ryan Roberts wrote: > On 07/07/2023 09:01, Huang, Ying wrote: >> Ryan Roberts <ryan.roberts@arm.com> writes: >> >>> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be >>> allocated in large folios of a specified order. All pages of the large >>> folio are pte-mapped during the same page fault, significantly reducing >>> the number of page faults. The number of per-page operations (e.g. ref >>> counting, rmap management lru list management) are also significantly >>> reduced since those ops now become per-folio. >> >> I likes the idea to share as much code as possible between large >> (anonymous) folio and THP. Finally, THP becomes just a special kind of >> large folio. >> >> Although we can use smaller page order for FLEXIBLE_THP, it's hard to >> avoid internal fragmentation completely. So, I think that finally we >> will need to provide a mechanism for the users to opt out, e.g., >> something like "always madvise never" via >> /sys/kernel/mm/transparent_hugepage/enabled. I'm not sure whether it's >> a good idea to reuse the existing interface of THP. > > I wouldn't want to tie this to the existing interface, simply because that > implies that we would want to follow the "always" and "madvise" advice too; That > means that on a thp=madvise system (which is certainly the case for android and > other client systems) we would have to disable large anon folios for VMAs that > haven't explicitly opted in. That breaks the intention that this should be an > invisible performance boost. I think it's important to set the policy for use of It will never ever be a completely invisible performance boost, just like ordinary THP. Using the exact same existing toggle is the right thing to do. If someone specify "never" or "madvise", then do exactly that. It might make sense to have more modes or additional toggles, but "madvise=never" means no memory waste. I remember I raised it already in the past, but you *absolutely* have to respect the MADV_NOHUGEPAGE flag. There is user space out there (for example, userfaultfd) that doesn't want the kernel to populate any additional page tables. So if you have to respect that already, then also respect MADV_HUGEPAGE, simple. > THP separately to use of large anon folios. > > I could be persuaded on the merrits of a new runtime enable/disable interface if > there is concensus. There would have to be very good reason for a completely separate control. Bypassing MADV_NOHUGEPAGE or "madvise=never" simply because we add a "flexible" before the THP sounds broken. -- Cheers, David / dhildenb _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance 2023-07-07 11:29 ` David Hildenbrand @ 2023-07-07 13:57 ` Matthew Wilcox -1 siblings, 0 replies; 167+ messages in thread From: Matthew Wilcox @ 2023-07-07 13:57 UTC (permalink / raw) To: David Hildenbrand Cc: Ryan Roberts, Huang, Ying, Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote: > On 07.07.23 11:52, Ryan Roberts wrote: > > On 07/07/2023 09:01, Huang, Ying wrote: > > > Although we can use smaller page order for FLEXIBLE_THP, it's hard to > > > avoid internal fragmentation completely. So, I think that finally we > > > will need to provide a mechanism for the users to opt out, e.g., > > > something like "always madvise never" via > > > /sys/kernel/mm/transparent_hugepage/enabled. I'm not sure whether it's > > > a good idea to reuse the existing interface of THP. > > > > I wouldn't want to tie this to the existing interface, simply because that > > implies that we would want to follow the "always" and "madvise" advice too; That > > means that on a thp=madvise system (which is certainly the case for android and > > other client systems) we would have to disable large anon folios for VMAs that > > haven't explicitly opted in. That breaks the intention that this should be an > > invisible performance boost. I think it's important to set the policy for use of > > It will never ever be a completely invisible performance boost, just like > ordinary THP. > > Using the exact same existing toggle is the right thing to do. If someone > specify "never" or "madvise", then do exactly that. > > It might make sense to have more modes or additional toggles, but > "madvise=never" means no memory waste. I hate the existing mechanisms. They are an abdication of our responsibility, and an attempt to blame the user (be it the sysadmin or the programmer) of our code for using it wrongly. We should not replicate this mistake. Our code should be auto-tuning. I posted a long, detailed outline here: https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/ > I remember I raised it already in the past, but you *absolutely* have to > respect the MADV_NOHUGEPAGE flag. There is user space out there (for > example, userfaultfd) that doesn't want the kernel to populate any > additional page tables. So if you have to respect that already, then also > respect MADV_HUGEPAGE, simple. Possibly having uffd enabled on a VMA should disable using large folios, I can get behind that. But the notion that userspace knows what it's doing ... hahaha. Just ignore the madvise flags. Userspace doesn't know what it's doing. ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance @ 2023-07-07 13:57 ` Matthew Wilcox 0 siblings, 0 replies; 167+ messages in thread From: Matthew Wilcox @ 2023-07-07 13:57 UTC (permalink / raw) To: David Hildenbrand Cc: Ryan Roberts, Huang, Ying, Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote: > On 07.07.23 11:52, Ryan Roberts wrote: > > On 07/07/2023 09:01, Huang, Ying wrote: > > > Although we can use smaller page order for FLEXIBLE_THP, it's hard to > > > avoid internal fragmentation completely. So, I think that finally we > > > will need to provide a mechanism for the users to opt out, e.g., > > > something like "always madvise never" via > > > /sys/kernel/mm/transparent_hugepage/enabled. I'm not sure whether it's > > > a good idea to reuse the existing interface of THP. > > > > I wouldn't want to tie this to the existing interface, simply because that > > implies that we would want to follow the "always" and "madvise" advice too; That > > means that on a thp=madvise system (which is certainly the case for android and > > other client systems) we would have to disable large anon folios for VMAs that > > haven't explicitly opted in. That breaks the intention that this should be an > > invisible performance boost. I think it's important to set the policy for use of > > It will never ever be a completely invisible performance boost, just like > ordinary THP. > > Using the exact same existing toggle is the right thing to do. If someone > specify "never" or "madvise", then do exactly that. > > It might make sense to have more modes or additional toggles, but > "madvise=never" means no memory waste. I hate the existing mechanisms. They are an abdication of our responsibility, and an attempt to blame the user (be it the sysadmin or the programmer) of our code for using it wrongly. We should not replicate this mistake. Our code should be auto-tuning. I posted a long, detailed outline here: https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/ > I remember I raised it already in the past, but you *absolutely* have to > respect the MADV_NOHUGEPAGE flag. There is user space out there (for > example, userfaultfd) that doesn't want the kernel to populate any > additional page tables. So if you have to respect that already, then also > respect MADV_HUGEPAGE, simple. Possibly having uffd enabled on a VMA should disable using large folios, I can get behind that. But the notion that userspace knows what it's doing ... hahaha. Just ignore the madvise flags. Userspace doesn't know what it's doing. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance 2023-07-07 13:57 ` Matthew Wilcox @ 2023-07-07 14:07 ` David Hildenbrand -1 siblings, 0 replies; 167+ messages in thread From: David Hildenbrand @ 2023-07-07 14:07 UTC (permalink / raw) To: Matthew Wilcox Cc: Ryan Roberts, Huang, Ying, Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 07.07.23 15:57, Matthew Wilcox wrote: > On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote: >> On 07.07.23 11:52, Ryan Roberts wrote: >>> On 07/07/2023 09:01, Huang, Ying wrote: >>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to >>>> avoid internal fragmentation completely. So, I think that finally we >>>> will need to provide a mechanism for the users to opt out, e.g., >>>> something like "always madvise never" via >>>> /sys/kernel/mm/transparent_hugepage/enabled. I'm not sure whether it's >>>> a good idea to reuse the existing interface of THP. >>> >>> I wouldn't want to tie this to the existing interface, simply because that >>> implies that we would want to follow the "always" and "madvise" advice too; That >>> means that on a thp=madvise system (which is certainly the case for android and >>> other client systems) we would have to disable large anon folios for VMAs that >>> haven't explicitly opted in. That breaks the intention that this should be an >>> invisible performance boost. I think it's important to set the policy for use of >> >> It will never ever be a completely invisible performance boost, just like >> ordinary THP. >> >> Using the exact same existing toggle is the right thing to do. If someone >> specify "never" or "madvise", then do exactly that. >> >> It might make sense to have more modes or additional toggles, but >> "madvise=never" means no memory waste. > > I hate the existing mechanisms. They are an abdication of our > responsibility, and an attempt to blame the user (be it the sysadmin > or the programmer) of our code for using it wrongly. We should not > replicate this mistake. I don't agree regarding the programmer responsibility. In some cases the programmer really doesn't want to get more memory populated than requested -- and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do. Regarding the madvise=never/madvise/always (sys admin decision), memory waste (and nailing down bugs or working around them in customer setups) have been very good reasons to let the admin have a word. > > Our code should be auto-tuning. I posted a long, detailed outline here: > https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/ > Well, "auto-tuning" also should be perfect for everybody, but once reality strikes you know it isn't. If people don't feel like using THP, let them have a word. The "madvise" config option is probably more controversial. But the "always vs. never" absolutely makes sense to me. >> I remember I raised it already in the past, but you *absolutely* have to >> respect the MADV_NOHUGEPAGE flag. There is user space out there (for >> example, userfaultfd) that doesn't want the kernel to populate any >> additional page tables. So if you have to respect that already, then also >> respect MADV_HUGEPAGE, simple. > > Possibly having uffd enabled on a VMA should disable using large folios, There are cases where we enable uffd *after* already touching memory (postcopy live migration in QEMU being the famous example). That doesn't fly. > I can get behind that. But the notion that userspace knows what it's > doing ... hahaha. Just ignore the madvise flags. Userspace doesn't > know what it's doing. If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in some cases. And these include cases I care about messing with sparse VM memory :) I have strong opinions against populating more than required when user space set MADV_NOHUGEPAGE. -- Cheers, David / dhildenb _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance @ 2023-07-07 14:07 ` David Hildenbrand 0 siblings, 0 replies; 167+ messages in thread From: David Hildenbrand @ 2023-07-07 14:07 UTC (permalink / raw) To: Matthew Wilcox Cc: Ryan Roberts, Huang, Ying, Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 07.07.23 15:57, Matthew Wilcox wrote: > On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote: >> On 07.07.23 11:52, Ryan Roberts wrote: >>> On 07/07/2023 09:01, Huang, Ying wrote: >>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to >>>> avoid internal fragmentation completely. So, I think that finally we >>>> will need to provide a mechanism for the users to opt out, e.g., >>>> something like "always madvise never" via >>>> /sys/kernel/mm/transparent_hugepage/enabled. I'm not sure whether it's >>>> a good idea to reuse the existing interface of THP. >>> >>> I wouldn't want to tie this to the existing interface, simply because that >>> implies that we would want to follow the "always" and "madvise" advice too; That >>> means that on a thp=madvise system (which is certainly the case for android and >>> other client systems) we would have to disable large anon folios for VMAs that >>> haven't explicitly opted in. That breaks the intention that this should be an >>> invisible performance boost. I think it's important to set the policy for use of >> >> It will never ever be a completely invisible performance boost, just like >> ordinary THP. >> >> Using the exact same existing toggle is the right thing to do. If someone >> specify "never" or "madvise", then do exactly that. >> >> It might make sense to have more modes or additional toggles, but >> "madvise=never" means no memory waste. > > I hate the existing mechanisms. They are an abdication of our > responsibility, and an attempt to blame the user (be it the sysadmin > or the programmer) of our code for using it wrongly. We should not > replicate this mistake. I don't agree regarding the programmer responsibility. In some cases the programmer really doesn't want to get more memory populated than requested -- and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do. Regarding the madvise=never/madvise/always (sys admin decision), memory waste (and nailing down bugs or working around them in customer setups) have been very good reasons to let the admin have a word. > > Our code should be auto-tuning. I posted a long, detailed outline here: > https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/ > Well, "auto-tuning" also should be perfect for everybody, but once reality strikes you know it isn't. If people don't feel like using THP, let them have a word. The "madvise" config option is probably more controversial. But the "always vs. never" absolutely makes sense to me. >> I remember I raised it already in the past, but you *absolutely* have to >> respect the MADV_NOHUGEPAGE flag. There is user space out there (for >> example, userfaultfd) that doesn't want the kernel to populate any >> additional page tables. So if you have to respect that already, then also >> respect MADV_HUGEPAGE, simple. > > Possibly having uffd enabled on a VMA should disable using large folios, There are cases where we enable uffd *after* already touching memory (postcopy live migration in QEMU being the famous example). That doesn't fly. > I can get behind that. But the notion that userspace knows what it's > doing ... hahaha. Just ignore the madvise flags. Userspace doesn't > know what it's doing. If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in some cases. And these include cases I care about messing with sparse VM memory :) I have strong opinions against populating more than required when user space set MADV_NOHUGEPAGE. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance 2023-07-07 14:07 ` David Hildenbrand @ 2023-07-07 15:13 ` Ryan Roberts -1 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-07 15:13 UTC (permalink / raw) To: David Hildenbrand, Matthew Wilcox Cc: Huang, Ying, Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 07/07/2023 15:07, David Hildenbrand wrote: > On 07.07.23 15:57, Matthew Wilcox wrote: >> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote: >>> On 07.07.23 11:52, Ryan Roberts wrote: >>>> On 07/07/2023 09:01, Huang, Ying wrote: >>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to >>>>> avoid internal fragmentation completely. So, I think that finally we >>>>> will need to provide a mechanism for the users to opt out, e.g., >>>>> something like "always madvise never" via >>>>> /sys/kernel/mm/transparent_hugepage/enabled. I'm not sure whether it's >>>>> a good idea to reuse the existing interface of THP. >>>> >>>> I wouldn't want to tie this to the existing interface, simply because that >>>> implies that we would want to follow the "always" and "madvise" advice too; >>>> That >>>> means that on a thp=madvise system (which is certainly the case for android and >>>> other client systems) we would have to disable large anon folios for VMAs that >>>> haven't explicitly opted in. That breaks the intention that this should be an >>>> invisible performance boost. I think it's important to set the policy for >>>> use of >>> >>> It will never ever be a completely invisible performance boost, just like >>> ordinary THP. >>> >>> Using the exact same existing toggle is the right thing to do. If someone >>> specify "never" or "madvise", then do exactly that. >>> >>> It might make sense to have more modes or additional toggles, but >>> "madvise=never" means no memory waste. >> >> I hate the existing mechanisms. They are an abdication of our >> responsibility, and an attempt to blame the user (be it the sysadmin >> or the programmer) of our code for using it wrongly. We should not >> replicate this mistake. > > I don't agree regarding the programmer responsibility. In some cases the > programmer really doesn't want to get more memory populated than requested -- > and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do. > > Regarding the madvise=never/madvise/always (sys admin decision), memory waste > (and nailing down bugs or working around them in customer setups) have been very > good reasons to let the admin have a word. > >> >> Our code should be auto-tuning. I posted a long, detailed outline here: >> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/ >> > > Well, "auto-tuning" also should be perfect for everybody, but once reality > strikes you know it isn't. > > If people don't feel like using THP, let them have a word. The "madvise" config > option is probably more controversial. But the "always vs. never" absolutely > makes sense to me. > >>> I remember I raised it already in the past, but you *absolutely* have to >>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for >>> example, userfaultfd) that doesn't want the kernel to populate any >>> additional page tables. So if you have to respect that already, then also >>> respect MADV_HUGEPAGE, simple. >> >> Possibly having uffd enabled on a VMA should disable using large folios, > > There are cases where we enable uffd *after* already touching memory (postcopy > live migration in QEMU being the famous example). That doesn't fly. > >> I can get behind that. But the notion that userspace knows what it's >> doing ... hahaha. Just ignore the madvise flags. Userspace doesn't >> know what it's doing. > > If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in > some cases. And these include cases I care about messing with sparse VM memory :) > > I have strong opinions against populating more than required when user space set > MADV_NOHUGEPAGE. I can see your point about honouring MADV_NOHUGEPAGE, so think that it is reasonable to fallback to allocating an order-0 page in a VMA that has it set. The app has gone out of its way to explicitly set it, after all. I think the correct behaviour for the global thp controls (cmdline and sysfs) are less obvious though. I could get on board with disabling large anon folios globally when thp="never". But for other situations, I would prefer to keep large anon folios enabled (treat "madvise" as "always"), with the argument that their order is much smaller than traditional THP and therefore the internal fragmentation is significantly reduced. I really don't want to end up with user space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large anon folios. I still feel that it would be better for the thp and large anon folio controls to be independent though - what's the argument for tying them together? > ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance @ 2023-07-07 15:13 ` Ryan Roberts 0 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-07 15:13 UTC (permalink / raw) To: David Hildenbrand, Matthew Wilcox Cc: Huang, Ying, Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 07/07/2023 15:07, David Hildenbrand wrote: > On 07.07.23 15:57, Matthew Wilcox wrote: >> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote: >>> On 07.07.23 11:52, Ryan Roberts wrote: >>>> On 07/07/2023 09:01, Huang, Ying wrote: >>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to >>>>> avoid internal fragmentation completely. So, I think that finally we >>>>> will need to provide a mechanism for the users to opt out, e.g., >>>>> something like "always madvise never" via >>>>> /sys/kernel/mm/transparent_hugepage/enabled. I'm not sure whether it's >>>>> a good idea to reuse the existing interface of THP. >>>> >>>> I wouldn't want to tie this to the existing interface, simply because that >>>> implies that we would want to follow the "always" and "madvise" advice too; >>>> That >>>> means that on a thp=madvise system (which is certainly the case for android and >>>> other client systems) we would have to disable large anon folios for VMAs that >>>> haven't explicitly opted in. That breaks the intention that this should be an >>>> invisible performance boost. I think it's important to set the policy for >>>> use of >>> >>> It will never ever be a completely invisible performance boost, just like >>> ordinary THP. >>> >>> Using the exact same existing toggle is the right thing to do. If someone >>> specify "never" or "madvise", then do exactly that. >>> >>> It might make sense to have more modes or additional toggles, but >>> "madvise=never" means no memory waste. >> >> I hate the existing mechanisms. They are an abdication of our >> responsibility, and an attempt to blame the user (be it the sysadmin >> or the programmer) of our code for using it wrongly. We should not >> replicate this mistake. > > I don't agree regarding the programmer responsibility. In some cases the > programmer really doesn't want to get more memory populated than requested -- > and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do. > > Regarding the madvise=never/madvise/always (sys admin decision), memory waste > (and nailing down bugs or working around them in customer setups) have been very > good reasons to let the admin have a word. > >> >> Our code should be auto-tuning. I posted a long, detailed outline here: >> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/ >> > > Well, "auto-tuning" also should be perfect for everybody, but once reality > strikes you know it isn't. > > If people don't feel like using THP, let them have a word. The "madvise" config > option is probably more controversial. But the "always vs. never" absolutely > makes sense to me. > >>> I remember I raised it already in the past, but you *absolutely* have to >>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for >>> example, userfaultfd) that doesn't want the kernel to populate any >>> additional page tables. So if you have to respect that already, then also >>> respect MADV_HUGEPAGE, simple. >> >> Possibly having uffd enabled on a VMA should disable using large folios, > > There are cases where we enable uffd *after* already touching memory (postcopy > live migration in QEMU being the famous example). That doesn't fly. > >> I can get behind that. But the notion that userspace knows what it's >> doing ... hahaha. Just ignore the madvise flags. Userspace doesn't >> know what it's doing. > > If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in > some cases. And these include cases I care about messing with sparse VM memory :) > > I have strong opinions against populating more than required when user space set > MADV_NOHUGEPAGE. I can see your point about honouring MADV_NOHUGEPAGE, so think that it is reasonable to fallback to allocating an order-0 page in a VMA that has it set. The app has gone out of its way to explicitly set it, after all. I think the correct behaviour for the global thp controls (cmdline and sysfs) are less obvious though. I could get on board with disabling large anon folios globally when thp="never". But for other situations, I would prefer to keep large anon folios enabled (treat "madvise" as "always"), with the argument that their order is much smaller than traditional THP and therefore the internal fragmentation is significantly reduced. I really don't want to end up with user space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large anon folios. I still feel that it would be better for the thp and large anon folio controls to be independent though - what's the argument for tying them together? > _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance 2023-07-07 15:13 ` Ryan Roberts @ 2023-07-07 16:06 ` David Hildenbrand -1 siblings, 0 replies; 167+ messages in thread From: David Hildenbrand @ 2023-07-07 16:06 UTC (permalink / raw) To: Ryan Roberts, Matthew Wilcox Cc: Huang, Ying, Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 07.07.23 17:13, Ryan Roberts wrote: > On 07/07/2023 15:07, David Hildenbrand wrote: >> On 07.07.23 15:57, Matthew Wilcox wrote: >>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote: >>>> On 07.07.23 11:52, Ryan Roberts wrote: >>>>> On 07/07/2023 09:01, Huang, Ying wrote: >>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to >>>>>> avoid internal fragmentation completely. So, I think that finally we >>>>>> will need to provide a mechanism for the users to opt out, e.g., >>>>>> something like "always madvise never" via >>>>>> /sys/kernel/mm/transparent_hugepage/enabled. I'm not sure whether it's >>>>>> a good idea to reuse the existing interface of THP. >>>>> >>>>> I wouldn't want to tie this to the existing interface, simply because that >>>>> implies that we would want to follow the "always" and "madvise" advice too; >>>>> That >>>>> means that on a thp=madvise system (which is certainly the case for android and >>>>> other client systems) we would have to disable large anon folios for VMAs that >>>>> haven't explicitly opted in. That breaks the intention that this should be an >>>>> invisible performance boost. I think it's important to set the policy for >>>>> use of >>>> >>>> It will never ever be a completely invisible performance boost, just like >>>> ordinary THP. >>>> >>>> Using the exact same existing toggle is the right thing to do. If someone >>>> specify "never" or "madvise", then do exactly that. >>>> >>>> It might make sense to have more modes or additional toggles, but >>>> "madvise=never" means no memory waste. >>> >>> I hate the existing mechanisms. They are an abdication of our >>> responsibility, and an attempt to blame the user (be it the sysadmin >>> or the programmer) of our code for using it wrongly. We should not >>> replicate this mistake. >> >> I don't agree regarding the programmer responsibility. In some cases the >> programmer really doesn't want to get more memory populated than requested -- >> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do. >> >> Regarding the madvise=never/madvise/always (sys admin decision), memory waste >> (and nailing down bugs or working around them in customer setups) have been very >> good reasons to let the admin have a word. >> >>> >>> Our code should be auto-tuning. I posted a long, detailed outline here: >>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/ >>> >> >> Well, "auto-tuning" also should be perfect for everybody, but once reality >> strikes you know it isn't. >> >> If people don't feel like using THP, let them have a word. The "madvise" config >> option is probably more controversial. But the "always vs. never" absolutely >> makes sense to me. >> >>>> I remember I raised it already in the past, but you *absolutely* have to >>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for >>>> example, userfaultfd) that doesn't want the kernel to populate any >>>> additional page tables. So if you have to respect that already, then also >>>> respect MADV_HUGEPAGE, simple. >>> >>> Possibly having uffd enabled on a VMA should disable using large folios, >> >> There are cases where we enable uffd *after* already touching memory (postcopy >> live migration in QEMU being the famous example). That doesn't fly. >> >>> I can get behind that. But the notion that userspace knows what it's >>> doing ... hahaha. Just ignore the madvise flags. Userspace doesn't >>> know what it's doing. >> >> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in >> some cases. And these include cases I care about messing with sparse VM memory :) >> >> I have strong opinions against populating more than required when user space set >> MADV_NOHUGEPAGE. > > I can see your point about honouring MADV_NOHUGEPAGE, so think that it is > reasonable to fallback to allocating an order-0 page in a VMA that has it set. > The app has gone out of its way to explicitly set it, after all. > > I think the correct behaviour for the global thp controls (cmdline and sysfs) > are less obvious though. I could get on board with disabling large anon folios > globally when thp="never". But for other situations, I would prefer to keep > large anon folios enabled (treat "madvise" as "always"), with the argument that > their order is much smaller than traditional THP and therefore the internal > fragmentation is significantly reduced. I really don't want to end up with user > space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large > anon folios. I was briefly playing with a nasty idea of an additional "madvise-pmd" option (that could be the new default), that would use PMD THP only in madvise'd regions, and ordinary everywhere else. But let's disregard that for now. I think there is a bigger issue (below). > > I still feel that it would be better for the thp and large anon folio controls > to be independent though - what's the argument for tying them together? Thinking about desired 2 MiB flexible THP on aarch64 (64k kernel) vs, 2 MiB PMD THP on aarch64 (4k kernel), how are they any different? Just the way they are mapped ... It's easy to say "64k vs. 2 MiB" is a difference and we want separate controls, but how is "2MiB vs. 2 MiB" different? Having that said, I think we have to make up our mind how much control we want to give user space. Again, the "2MiB vs. 2 MiB" case nicely shows that it's not trivial: memory waste is a real issue on some systems where we limit THP to madvise(). Just throwing it out for discussing: What about keeping the "all / madvise / never" semantics (and MADV_NOHUGEPAGE ...) but having an additional config knob that specifies in which cases we *still* allow flexible THP even though the system was configured for "madvise". I can't come up with a good name for that, but something like "max_auto_size=64k" could be something reasonable to set. We could have an arch+hw specific default. (we all hate config options, I know, but I think there are good reasons to have such bare-minimum ones) -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance @ 2023-07-07 16:06 ` David Hildenbrand 0 siblings, 0 replies; 167+ messages in thread From: David Hildenbrand @ 2023-07-07 16:06 UTC (permalink / raw) To: Ryan Roberts, Matthew Wilcox Cc: Huang, Ying, Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 07.07.23 17:13, Ryan Roberts wrote: > On 07/07/2023 15:07, David Hildenbrand wrote: >> On 07.07.23 15:57, Matthew Wilcox wrote: >>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote: >>>> On 07.07.23 11:52, Ryan Roberts wrote: >>>>> On 07/07/2023 09:01, Huang, Ying wrote: >>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to >>>>>> avoid internal fragmentation completely. So, I think that finally we >>>>>> will need to provide a mechanism for the users to opt out, e.g., >>>>>> something like "always madvise never" via >>>>>> /sys/kernel/mm/transparent_hugepage/enabled. I'm not sure whether it's >>>>>> a good idea to reuse the existing interface of THP. >>>>> >>>>> I wouldn't want to tie this to the existing interface, simply because that >>>>> implies that we would want to follow the "always" and "madvise" advice too; >>>>> That >>>>> means that on a thp=madvise system (which is certainly the case for android and >>>>> other client systems) we would have to disable large anon folios for VMAs that >>>>> haven't explicitly opted in. That breaks the intention that this should be an >>>>> invisible performance boost. I think it's important to set the policy for >>>>> use of >>>> >>>> It will never ever be a completely invisible performance boost, just like >>>> ordinary THP. >>>> >>>> Using the exact same existing toggle is the right thing to do. If someone >>>> specify "never" or "madvise", then do exactly that. >>>> >>>> It might make sense to have more modes or additional toggles, but >>>> "madvise=never" means no memory waste. >>> >>> I hate the existing mechanisms. They are an abdication of our >>> responsibility, and an attempt to blame the user (be it the sysadmin >>> or the programmer) of our code for using it wrongly. We should not >>> replicate this mistake. >> >> I don't agree regarding the programmer responsibility. In some cases the >> programmer really doesn't want to get more memory populated than requested -- >> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do. >> >> Regarding the madvise=never/madvise/always (sys admin decision), memory waste >> (and nailing down bugs or working around them in customer setups) have been very >> good reasons to let the admin have a word. >> >>> >>> Our code should be auto-tuning. I posted a long, detailed outline here: >>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/ >>> >> >> Well, "auto-tuning" also should be perfect for everybody, but once reality >> strikes you know it isn't. >> >> If people don't feel like using THP, let them have a word. The "madvise" config >> option is probably more controversial. But the "always vs. never" absolutely >> makes sense to me. >> >>>> I remember I raised it already in the past, but you *absolutely* have to >>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for >>>> example, userfaultfd) that doesn't want the kernel to populate any >>>> additional page tables. So if you have to respect that already, then also >>>> respect MADV_HUGEPAGE, simple. >>> >>> Possibly having uffd enabled on a VMA should disable using large folios, >> >> There are cases where we enable uffd *after* already touching memory (postcopy >> live migration in QEMU being the famous example). That doesn't fly. >> >>> I can get behind that. But the notion that userspace knows what it's >>> doing ... hahaha. Just ignore the madvise flags. Userspace doesn't >>> know what it's doing. >> >> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in >> some cases. And these include cases I care about messing with sparse VM memory :) >> >> I have strong opinions against populating more than required when user space set >> MADV_NOHUGEPAGE. > > I can see your point about honouring MADV_NOHUGEPAGE, so think that it is > reasonable to fallback to allocating an order-0 page in a VMA that has it set. > The app has gone out of its way to explicitly set it, after all. > > I think the correct behaviour for the global thp controls (cmdline and sysfs) > are less obvious though. I could get on board with disabling large anon folios > globally when thp="never". But for other situations, I would prefer to keep > large anon folios enabled (treat "madvise" as "always"), with the argument that > their order is much smaller than traditional THP and therefore the internal > fragmentation is significantly reduced. I really don't want to end up with user > space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large > anon folios. I was briefly playing with a nasty idea of an additional "madvise-pmd" option (that could be the new default), that would use PMD THP only in madvise'd regions, and ordinary everywhere else. But let's disregard that for now. I think there is a bigger issue (below). > > I still feel that it would be better for the thp and large anon folio controls > to be independent though - what's the argument for tying them together? Thinking about desired 2 MiB flexible THP on aarch64 (64k kernel) vs, 2 MiB PMD THP on aarch64 (4k kernel), how are they any different? Just the way they are mapped ... It's easy to say "64k vs. 2 MiB" is a difference and we want separate controls, but how is "2MiB vs. 2 MiB" different? Having that said, I think we have to make up our mind how much control we want to give user space. Again, the "2MiB vs. 2 MiB" case nicely shows that it's not trivial: memory waste is a real issue on some systems where we limit THP to madvise(). Just throwing it out for discussing: What about keeping the "all / madvise / never" semantics (and MADV_NOHUGEPAGE ...) but having an additional config knob that specifies in which cases we *still* allow flexible THP even though the system was configured for "madvise". I can't come up with a good name for that, but something like "max_auto_size=64k" could be something reasonable to set. We could have an arch+hw specific default. (we all hate config options, I know, but I think there are good reasons to have such bare-minimum ones) -- Cheers, David / dhildenb _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance 2023-07-07 16:06 ` David Hildenbrand @ 2023-07-07 16:22 ` Ryan Roberts -1 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-07 16:22 UTC (permalink / raw) To: David Hildenbrand, Matthew Wilcox Cc: Huang, Ying, Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 07/07/2023 17:06, David Hildenbrand wrote: > On 07.07.23 17:13, Ryan Roberts wrote: >> On 07/07/2023 15:07, David Hildenbrand wrote: >>> On 07.07.23 15:57, Matthew Wilcox wrote: >>>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote: >>>>> On 07.07.23 11:52, Ryan Roberts wrote: >>>>>> On 07/07/2023 09:01, Huang, Ying wrote: >>>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to >>>>>>> avoid internal fragmentation completely. So, I think that finally we >>>>>>> will need to provide a mechanism for the users to opt out, e.g., >>>>>>> something like "always madvise never" via >>>>>>> /sys/kernel/mm/transparent_hugepage/enabled. I'm not sure whether it's >>>>>>> a good idea to reuse the existing interface of THP. >>>>>> >>>>>> I wouldn't want to tie this to the existing interface, simply because that >>>>>> implies that we would want to follow the "always" and "madvise" advice too; >>>>>> That >>>>>> means that on a thp=madvise system (which is certainly the case for >>>>>> android and >>>>>> other client systems) we would have to disable large anon folios for VMAs >>>>>> that >>>>>> haven't explicitly opted in. That breaks the intention that this should be an >>>>>> invisible performance boost. I think it's important to set the policy for >>>>>> use of >>>>> >>>>> It will never ever be a completely invisible performance boost, just like >>>>> ordinary THP. >>>>> >>>>> Using the exact same existing toggle is the right thing to do. If someone >>>>> specify "never" or "madvise", then do exactly that. >>>>> >>>>> It might make sense to have more modes or additional toggles, but >>>>> "madvise=never" means no memory waste. >>>> >>>> I hate the existing mechanisms. They are an abdication of our >>>> responsibility, and an attempt to blame the user (be it the sysadmin >>>> or the programmer) of our code for using it wrongly. We should not >>>> replicate this mistake. >>> >>> I don't agree regarding the programmer responsibility. In some cases the >>> programmer really doesn't want to get more memory populated than requested -- >>> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do. >>> >>> Regarding the madvise=never/madvise/always (sys admin decision), memory waste >>> (and nailing down bugs or working around them in customer setups) have been very >>> good reasons to let the admin have a word. >>> >>>> >>>> Our code should be auto-tuning. I posted a long, detailed outline here: >>>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/ >>>> >>> >>> Well, "auto-tuning" also should be perfect for everybody, but once reality >>> strikes you know it isn't. >>> >>> If people don't feel like using THP, let them have a word. The "madvise" config >>> option is probably more controversial. But the "always vs. never" absolutely >>> makes sense to me. >>> >>>>> I remember I raised it already in the past, but you *absolutely* have to >>>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for >>>>> example, userfaultfd) that doesn't want the kernel to populate any >>>>> additional page tables. So if you have to respect that already, then also >>>>> respect MADV_HUGEPAGE, simple. >>>> >>>> Possibly having uffd enabled on a VMA should disable using large folios, >>> >>> There are cases where we enable uffd *after* already touching memory (postcopy >>> live migration in QEMU being the famous example). That doesn't fly. >>> >>>> I can get behind that. But the notion that userspace knows what it's >>>> doing ... hahaha. Just ignore the madvise flags. Userspace doesn't >>>> know what it's doing. >>> >>> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in >>> some cases. And these include cases I care about messing with sparse VM >>> memory :) >>> >>> I have strong opinions against populating more than required when user space set >>> MADV_NOHUGEPAGE. >> >> I can see your point about honouring MADV_NOHUGEPAGE, so think that it is >> reasonable to fallback to allocating an order-0 page in a VMA that has it set. >> The app has gone out of its way to explicitly set it, after all. >> >> I think the correct behaviour for the global thp controls (cmdline and sysfs) >> are less obvious though. I could get on board with disabling large anon folios >> globally when thp="never". But for other situations, I would prefer to keep >> large anon folios enabled (treat "madvise" as "always"), with the argument that >> their order is much smaller than traditional THP and therefore the internal >> fragmentation is significantly reduced. I really don't want to end up with user >> space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large >> anon folios. > > I was briefly playing with a nasty idea of an additional "madvise-pmd" option > (that could be the new default), that would use PMD THP only in madvise'd > regions, and ordinary everywhere else. But let's disregard that for now. I think > there is a bigger issue (below). > >> >> I still feel that it would be better for the thp and large anon folio controls >> to be independent though - what's the argument for tying them together? > > Thinking about desired 2 MiB flexible THP on aarch64 (64k kernel) vs, 2 MiB PMD > THP on aarch64 (4k kernel), how are they any different? Just the way they are > mapped ... The last patch in the series shows my current approach to that: int arch_wants_pte_order(struct vm_area_struct *vma) { if (hugepage_vma_check(vma, vma->vm_flags, false, true, true)) return CONFIG_ARM64_PTE_ORDER_THP; <<< always the contpte size else return CONFIG_ARM64_PTE_ORDER_NOTHP; <<< limited to 64K } But Yu has raised concerns that this type of policy needs to be in the core mm. So we could have the arch blindly return the preferred order from HW perspective (which would be contpte size for arm64). Then for !hugepage_vma_check(), mm could take the min of that value and some determined "acceptable" limit (which in my mind is 64K ;-). > > It's easy to say "64k vs. 2 MiB" is a difference and we want separate controls, > but how is "2MiB vs. 2 MiB" different? > > Having that said, I think we have to make up our mind how much control we want > to give user space. Again, the "2MiB vs. 2 MiB" case nicely shows that it's not > trivial: memory waste is a real issue on some systems where we limit THP to > madvise(). > > > Just throwing it out for discussing: > > What about keeping the "all / madvise / never" semantics (and MADV_NOHUGEPAGE > ...) but having an additional config knob that specifies in which cases we > *still* allow flexible THP even though the system was configured for "madvise". > > I can't come up with a good name for that, but something like > "max_auto_size=64k" could be something reasonable to set. We could have an > arch+hw specific default. Ahha, yes, that's essentially what I have above. I personally also like the idea of the limit being an absolute value rather than an order. Although I know Yu feels differently (see [1]). [1] https://lore.kernel.org/linux-mm/4d4c45a2-0037-71de-b182-f516fee07e67@arm.com/T/#m2aff6eebd7f14d0d0620b48497d26eacecf970e6 > > (we all hate config options, I know, but I think there are good reasons to have > such bare-minimum ones) > ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance @ 2023-07-07 16:22 ` Ryan Roberts 0 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-07 16:22 UTC (permalink / raw) To: David Hildenbrand, Matthew Wilcox Cc: Huang, Ying, Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 07/07/2023 17:06, David Hildenbrand wrote: > On 07.07.23 17:13, Ryan Roberts wrote: >> On 07/07/2023 15:07, David Hildenbrand wrote: >>> On 07.07.23 15:57, Matthew Wilcox wrote: >>>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote: >>>>> On 07.07.23 11:52, Ryan Roberts wrote: >>>>>> On 07/07/2023 09:01, Huang, Ying wrote: >>>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to >>>>>>> avoid internal fragmentation completely. So, I think that finally we >>>>>>> will need to provide a mechanism for the users to opt out, e.g., >>>>>>> something like "always madvise never" via >>>>>>> /sys/kernel/mm/transparent_hugepage/enabled. I'm not sure whether it's >>>>>>> a good idea to reuse the existing interface of THP. >>>>>> >>>>>> I wouldn't want to tie this to the existing interface, simply because that >>>>>> implies that we would want to follow the "always" and "madvise" advice too; >>>>>> That >>>>>> means that on a thp=madvise system (which is certainly the case for >>>>>> android and >>>>>> other client systems) we would have to disable large anon folios for VMAs >>>>>> that >>>>>> haven't explicitly opted in. That breaks the intention that this should be an >>>>>> invisible performance boost. I think it's important to set the policy for >>>>>> use of >>>>> >>>>> It will never ever be a completely invisible performance boost, just like >>>>> ordinary THP. >>>>> >>>>> Using the exact same existing toggle is the right thing to do. If someone >>>>> specify "never" or "madvise", then do exactly that. >>>>> >>>>> It might make sense to have more modes or additional toggles, but >>>>> "madvise=never" means no memory waste. >>>> >>>> I hate the existing mechanisms. They are an abdication of our >>>> responsibility, and an attempt to blame the user (be it the sysadmin >>>> or the programmer) of our code for using it wrongly. We should not >>>> replicate this mistake. >>> >>> I don't agree regarding the programmer responsibility. In some cases the >>> programmer really doesn't want to get more memory populated than requested -- >>> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do. >>> >>> Regarding the madvise=never/madvise/always (sys admin decision), memory waste >>> (and nailing down bugs or working around them in customer setups) have been very >>> good reasons to let the admin have a word. >>> >>>> >>>> Our code should be auto-tuning. I posted a long, detailed outline here: >>>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/ >>>> >>> >>> Well, "auto-tuning" also should be perfect for everybody, but once reality >>> strikes you know it isn't. >>> >>> If people don't feel like using THP, let them have a word. The "madvise" config >>> option is probably more controversial. But the "always vs. never" absolutely >>> makes sense to me. >>> >>>>> I remember I raised it already in the past, but you *absolutely* have to >>>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for >>>>> example, userfaultfd) that doesn't want the kernel to populate any >>>>> additional page tables. So if you have to respect that already, then also >>>>> respect MADV_HUGEPAGE, simple. >>>> >>>> Possibly having uffd enabled on a VMA should disable using large folios, >>> >>> There are cases where we enable uffd *after* already touching memory (postcopy >>> live migration in QEMU being the famous example). That doesn't fly. >>> >>>> I can get behind that. But the notion that userspace knows what it's >>>> doing ... hahaha. Just ignore the madvise flags. Userspace doesn't >>>> know what it's doing. >>> >>> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in >>> some cases. And these include cases I care about messing with sparse VM >>> memory :) >>> >>> I have strong opinions against populating more than required when user space set >>> MADV_NOHUGEPAGE. >> >> I can see your point about honouring MADV_NOHUGEPAGE, so think that it is >> reasonable to fallback to allocating an order-0 page in a VMA that has it set. >> The app has gone out of its way to explicitly set it, after all. >> >> I think the correct behaviour for the global thp controls (cmdline and sysfs) >> are less obvious though. I could get on board with disabling large anon folios >> globally when thp="never". But for other situations, I would prefer to keep >> large anon folios enabled (treat "madvise" as "always"), with the argument that >> their order is much smaller than traditional THP and therefore the internal >> fragmentation is significantly reduced. I really don't want to end up with user >> space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large >> anon folios. > > I was briefly playing with a nasty idea of an additional "madvise-pmd" option > (that could be the new default), that would use PMD THP only in madvise'd > regions, and ordinary everywhere else. But let's disregard that for now. I think > there is a bigger issue (below). > >> >> I still feel that it would be better for the thp and large anon folio controls >> to be independent though - what's the argument for tying them together? > > Thinking about desired 2 MiB flexible THP on aarch64 (64k kernel) vs, 2 MiB PMD > THP on aarch64 (4k kernel), how are they any different? Just the way they are > mapped ... The last patch in the series shows my current approach to that: int arch_wants_pte_order(struct vm_area_struct *vma) { if (hugepage_vma_check(vma, vma->vm_flags, false, true, true)) return CONFIG_ARM64_PTE_ORDER_THP; <<< always the contpte size else return CONFIG_ARM64_PTE_ORDER_NOTHP; <<< limited to 64K } But Yu has raised concerns that this type of policy needs to be in the core mm. So we could have the arch blindly return the preferred order from HW perspective (which would be contpte size for arm64). Then for !hugepage_vma_check(), mm could take the min of that value and some determined "acceptable" limit (which in my mind is 64K ;-). > > It's easy to say "64k vs. 2 MiB" is a difference and we want separate controls, > but how is "2MiB vs. 2 MiB" different? > > Having that said, I think we have to make up our mind how much control we want > to give user space. Again, the "2MiB vs. 2 MiB" case nicely shows that it's not > trivial: memory waste is a real issue on some systems where we limit THP to > madvise(). > > > Just throwing it out for discussing: > > What about keeping the "all / madvise / never" semantics (and MADV_NOHUGEPAGE > ...) but having an additional config knob that specifies in which cases we > *still* allow flexible THP even though the system was configured for "madvise". > > I can't come up with a good name for that, but something like > "max_auto_size=64k" could be something reasonable to set. We could have an > arch+hw specific default. Ahha, yes, that's essentially what I have above. I personally also like the idea of the limit being an absolute value rather than an order. Although I know Yu feels differently (see [1]). [1] https://lore.kernel.org/linux-mm/4d4c45a2-0037-71de-b182-f516fee07e67@arm.com/T/#m2aff6eebd7f14d0d0620b48497d26eacecf970e6 > > (we all hate config options, I know, but I think there are good reasons to have > such bare-minimum ones) > _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance 2023-07-07 16:22 ` Ryan Roberts @ 2023-07-07 19:06 ` David Hildenbrand -1 siblings, 0 replies; 167+ messages in thread From: David Hildenbrand @ 2023-07-07 19:06 UTC (permalink / raw) To: Ryan Roberts, Matthew Wilcox Cc: Huang, Ying, Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm >>> I still feel that it would be better for the thp and large anon folio controls >>> to be independent though - what's the argument for tying them together? >> >> Thinking about desired 2 MiB flexible THP on aarch64 (64k kernel) vs, 2 MiB PMD >> THP on aarch64 (4k kernel), how are they any different? Just the way they are >> mapped ... > > The last patch in the series shows my current approach to that: > > int arch_wants_pte_order(struct vm_area_struct *vma) > { > if (hugepage_vma_check(vma, vma->vm_flags, false, true, true)) > return CONFIG_ARM64_PTE_ORDER_THP; <<< always the contpte size > else > return CONFIG_ARM64_PTE_ORDER_NOTHP; <<< limited to 64K > } > > But Yu has raised concerns that this type of policy needs to be in the core mm. > So we could have the arch blindly return the preferred order from HW perspective > (which would be contpte size for arm64). Then for !hugepage_vma_check(), mm > could take the min of that value and some determined "acceptable" limit (which > in my mind is 64K ;-). Yeah, it's really tricky. Because why should arm64 with 64k base pages *not* return 2MiB (which is one possible cont-pte size IIRC) ? I share the idea that 64k might *currently* on *some platforms* be a reasonable choice. But that's where the "fun" begins. > >> >> It's easy to say "64k vs. 2 MiB" is a difference and we want separate controls, >> but how is "2MiB vs. 2 MiB" different? >> >> Having that said, I think we have to make up our mind how much control we want >> to give user space. Again, the "2MiB vs. 2 MiB" case nicely shows that it's not >> trivial: memory waste is a real issue on some systems where we limit THP to >> madvise(). >> >> >> Just throwing it out for discussing: >> >> What about keeping the "all / madvise / never" semantics (and MADV_NOHUGEPAGE >> ...) but having an additional config knob that specifies in which cases we >> *still* allow flexible THP even though the system was configured for "madvise". >> >> I can't come up with a good name for that, but something like >> "max_auto_size=64k" could be something reasonable to set. We could have an >> arch+hw specific default. > > Ahha, yes, that's essentially what I have above. I personally also like the idea > of the limit being an absolute value rather than an order. Although I know Yu > feels differently (see [1]). Exposed to user space I think it should be a human-readable value. Inside the kernel, I don't particularly care. (Having databases/VMs on arch64 with 64k in mind) I think it might be interesting to have something like the following: thp=madvise max_auto_size=64k/128k/256k So in MADV_HUGEPAGE VMAs (such as under QEMU), we'd happily take any flexible THP, especially ones < PMD THP (512 MiB) as well. 2 MiB or 4 MiB THP? sure, give them to my VM. You're barely going to find 512 MiB THP either way in practice .... But for the remainder of my system, just do something reasonable and don't go crazy on the memory waste. I'll try reading all the previous discussions next week. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance @ 2023-07-07 19:06 ` David Hildenbrand 0 siblings, 0 replies; 167+ messages in thread From: David Hildenbrand @ 2023-07-07 19:06 UTC (permalink / raw) To: Ryan Roberts, Matthew Wilcox Cc: Huang, Ying, Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm >>> I still feel that it would be better for the thp and large anon folio controls >>> to be independent though - what's the argument for tying them together? >> >> Thinking about desired 2 MiB flexible THP on aarch64 (64k kernel) vs, 2 MiB PMD >> THP on aarch64 (4k kernel), how are they any different? Just the way they are >> mapped ... > > The last patch in the series shows my current approach to that: > > int arch_wants_pte_order(struct vm_area_struct *vma) > { > if (hugepage_vma_check(vma, vma->vm_flags, false, true, true)) > return CONFIG_ARM64_PTE_ORDER_THP; <<< always the contpte size > else > return CONFIG_ARM64_PTE_ORDER_NOTHP; <<< limited to 64K > } > > But Yu has raised concerns that this type of policy needs to be in the core mm. > So we could have the arch blindly return the preferred order from HW perspective > (which would be contpte size for arm64). Then for !hugepage_vma_check(), mm > could take the min of that value and some determined "acceptable" limit (which > in my mind is 64K ;-). Yeah, it's really tricky. Because why should arm64 with 64k base pages *not* return 2MiB (which is one possible cont-pte size IIRC) ? I share the idea that 64k might *currently* on *some platforms* be a reasonable choice. But that's where the "fun" begins. > >> >> It's easy to say "64k vs. 2 MiB" is a difference and we want separate controls, >> but how is "2MiB vs. 2 MiB" different? >> >> Having that said, I think we have to make up our mind how much control we want >> to give user space. Again, the "2MiB vs. 2 MiB" case nicely shows that it's not >> trivial: memory waste is a real issue on some systems where we limit THP to >> madvise(). >> >> >> Just throwing it out for discussing: >> >> What about keeping the "all / madvise / never" semantics (and MADV_NOHUGEPAGE >> ...) but having an additional config knob that specifies in which cases we >> *still* allow flexible THP even though the system was configured for "madvise". >> >> I can't come up with a good name for that, but something like >> "max_auto_size=64k" could be something reasonable to set. We could have an >> arch+hw specific default. > > Ahha, yes, that's essentially what I have above. I personally also like the idea > of the limit being an absolute value rather than an order. Although I know Yu > feels differently (see [1]). Exposed to user space I think it should be a human-readable value. Inside the kernel, I don't particularly care. (Having databases/VMs on arch64 with 64k in mind) I think it might be interesting to have something like the following: thp=madvise max_auto_size=64k/128k/256k So in MADV_HUGEPAGE VMAs (such as under QEMU), we'd happily take any flexible THP, especially ones < PMD THP (512 MiB) as well. 2 MiB or 4 MiB THP? sure, give them to my VM. You're barely going to find 512 MiB THP either way in practice .... But for the remainder of my system, just do something reasonable and don't go crazy on the memory waste. I'll try reading all the previous discussions next week. -- Cheers, David / dhildenb _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance 2023-07-07 19:06 ` David Hildenbrand @ 2023-07-10 8:41 ` Ryan Roberts -1 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-10 8:41 UTC (permalink / raw) To: David Hildenbrand, Matthew Wilcox Cc: Huang, Ying, Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 07/07/2023 20:06, David Hildenbrand wrote: >>>> I still feel that it would be better for the thp and large anon folio controls >>>> to be independent though - what's the argument for tying them together? >>> >>> Thinking about desired 2 MiB flexible THP on aarch64 (64k kernel) vs, 2 MiB PMD >>> THP on aarch64 (4k kernel), how are they any different? Just the way they are >>> mapped ... >> >> The last patch in the series shows my current approach to that: >> >> int arch_wants_pte_order(struct vm_area_struct *vma) >> { >> if (hugepage_vma_check(vma, vma->vm_flags, false, true, true)) >> return CONFIG_ARM64_PTE_ORDER_THP; <<< always the contpte size >> else >> return CONFIG_ARM64_PTE_ORDER_NOTHP; <<< limited to 64K >> } >> >> But Yu has raised concerns that this type of policy needs to be in the core mm. >> So we could have the arch blindly return the preferred order from HW perspective >> (which would be contpte size for arm64). Then for !hugepage_vma_check(), mm >> could take the min of that value and some determined "acceptable" limit (which >> in my mind is 64K ;-). > > Yeah, it's really tricky. Because why should arm64 with 64k base pages *not* > return 2MiB (which is one possible cont-pte size IIRC) ? > > I share the idea that 64k might *currently* on *some platforms* be a reasonable > choice. But that's where the "fun" begins. > >> >>> >>> It's easy to say "64k vs. 2 MiB" is a difference and we want separate controls, >>> but how is "2MiB vs. 2 MiB" different? >>> >>> Having that said, I think we have to make up our mind how much control we want >>> to give user space. Again, the "2MiB vs. 2 MiB" case nicely shows that it's not >>> trivial: memory waste is a real issue on some systems where we limit THP to >>> madvise(). >>> >>> >>> Just throwing it out for discussing: >>> >>> What about keeping the "all / madvise / never" semantics (and MADV_NOHUGEPAGE >>> ...) but having an additional config knob that specifies in which cases we >>> *still* allow flexible THP even though the system was configured for "madvise". >>> >>> I can't come up with a good name for that, but something like >>> "max_auto_size=64k" could be something reasonable to set. We could have an >>> arch+hw specific default. >> >> Ahha, yes, that's essentially what I have above. I personally also like the idea >> of the limit being an absolute value rather than an order. Although I know Yu >> feels differently (see [1]). > > Exposed to user space I think it should be a human-readable value. Inside the > kernel, I don't particularly care. My point was less about human-readable vs not. It was about expressing a value that is relative to the base page size vs expressing a value that is independent of base page size. If the concern is about limiting internal fragmentation, I think its the absolute size that matters. > > (Having databases/VMs on arch64 with 64k in mind) I think it might be > interesting to have something like the following: > > thp=madvise > max_auto_size=64k/128k/256k > > > So in MADV_HUGEPAGE VMAs (such as under QEMU), we'd happily take any flexible > THP, especially ones < PMD THP (512 MiB) as well. 2 MiB or 4 MiB THP? sure, give > them to my VM. You're barely going to find 512 MiB THP either way in practice .... > > But for the remainder of my system, just do something reasonable and don't go > crazy on the memory waste. Yep, we're on the same page. I've got a v3 that's almost ready to go, based on Yu's prevuous round of review. I'm going to encorporate this mechanism into it then post hopefully later in the week. Now I just need to figure out a decent name for the max_auto_size control... > > > I'll try reading all the previous discussions next week. > ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance @ 2023-07-10 8:41 ` Ryan Roberts 0 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-10 8:41 UTC (permalink / raw) To: David Hildenbrand, Matthew Wilcox Cc: Huang, Ying, Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 07/07/2023 20:06, David Hildenbrand wrote: >>>> I still feel that it would be better for the thp and large anon folio controls >>>> to be independent though - what's the argument for tying them together? >>> >>> Thinking about desired 2 MiB flexible THP on aarch64 (64k kernel) vs, 2 MiB PMD >>> THP on aarch64 (4k kernel), how are they any different? Just the way they are >>> mapped ... >> >> The last patch in the series shows my current approach to that: >> >> int arch_wants_pte_order(struct vm_area_struct *vma) >> { >> if (hugepage_vma_check(vma, vma->vm_flags, false, true, true)) >> return CONFIG_ARM64_PTE_ORDER_THP; <<< always the contpte size >> else >> return CONFIG_ARM64_PTE_ORDER_NOTHP; <<< limited to 64K >> } >> >> But Yu has raised concerns that this type of policy needs to be in the core mm. >> So we could have the arch blindly return the preferred order from HW perspective >> (which would be contpte size for arm64). Then for !hugepage_vma_check(), mm >> could take the min of that value and some determined "acceptable" limit (which >> in my mind is 64K ;-). > > Yeah, it's really tricky. Because why should arm64 with 64k base pages *not* > return 2MiB (which is one possible cont-pte size IIRC) ? > > I share the idea that 64k might *currently* on *some platforms* be a reasonable > choice. But that's where the "fun" begins. > >> >>> >>> It's easy to say "64k vs. 2 MiB" is a difference and we want separate controls, >>> but how is "2MiB vs. 2 MiB" different? >>> >>> Having that said, I think we have to make up our mind how much control we want >>> to give user space. Again, the "2MiB vs. 2 MiB" case nicely shows that it's not >>> trivial: memory waste is a real issue on some systems where we limit THP to >>> madvise(). >>> >>> >>> Just throwing it out for discussing: >>> >>> What about keeping the "all / madvise / never" semantics (and MADV_NOHUGEPAGE >>> ...) but having an additional config knob that specifies in which cases we >>> *still* allow flexible THP even though the system was configured for "madvise". >>> >>> I can't come up with a good name for that, but something like >>> "max_auto_size=64k" could be something reasonable to set. We could have an >>> arch+hw specific default. >> >> Ahha, yes, that's essentially what I have above. I personally also like the idea >> of the limit being an absolute value rather than an order. Although I know Yu >> feels differently (see [1]). > > Exposed to user space I think it should be a human-readable value. Inside the > kernel, I don't particularly care. My point was less about human-readable vs not. It was about expressing a value that is relative to the base page size vs expressing a value that is independent of base page size. If the concern is about limiting internal fragmentation, I think its the absolute size that matters. > > (Having databases/VMs on arch64 with 64k in mind) I think it might be > interesting to have something like the following: > > thp=madvise > max_auto_size=64k/128k/256k > > > So in MADV_HUGEPAGE VMAs (such as under QEMU), we'd happily take any flexible > THP, especially ones < PMD THP (512 MiB) as well. 2 MiB or 4 MiB THP? sure, give > them to my VM. You're barely going to find 512 MiB THP either way in practice .... > > But for the remainder of my system, just do something reasonable and don't go > crazy on the memory waste. Yep, we're on the same page. I've got a v3 that's almost ready to go, based on Yu's prevuous round of review. I'm going to encorporate this mechanism into it then post hopefully later in the week. Now I just need to figure out a decent name for the max_auto_size control... > > > I'll try reading all the previous discussions next week. > _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance 2023-07-07 15:13 ` Ryan Roberts @ 2023-07-10 3:03 ` Huang, Ying -1 siblings, 0 replies; 167+ messages in thread From: Huang, Ying @ 2023-07-10 3:03 UTC (permalink / raw) To: Ryan Roberts Cc: David Hildenbrand, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm, Johannes Weiner, Alexander Zhu Ryan Roberts <ryan.roberts@arm.com> writes: > On 07/07/2023 15:07, David Hildenbrand wrote: >> On 07.07.23 15:57, Matthew Wilcox wrote: >>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote: >>>> On 07.07.23 11:52, Ryan Roberts wrote: >>>>> On 07/07/2023 09:01, Huang, Ying wrote: >>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to >>>>>> avoid internal fragmentation completely. So, I think that finally we >>>>>> will need to provide a mechanism for the users to opt out, e.g., >>>>>> something like "always madvise never" via >>>>>> /sys/kernel/mm/transparent_hugepage/enabled. I'm not sure whether it's >>>>>> a good idea to reuse the existing interface of THP. >>>>> >>>>> I wouldn't want to tie this to the existing interface, simply because that >>>>> implies that we would want to follow the "always" and "madvise" advice too; >>>>> That >>>>> means that on a thp=madvise system (which is certainly the case for android and >>>>> other client systems) we would have to disable large anon folios for VMAs that >>>>> haven't explicitly opted in. That breaks the intention that this should be an >>>>> invisible performance boost. I think it's important to set the policy for >>>>> use of >>>> >>>> It will never ever be a completely invisible performance boost, just like >>>> ordinary THP. >>>> >>>> Using the exact same existing toggle is the right thing to do. If someone >>>> specify "never" or "madvise", then do exactly that. >>>> >>>> It might make sense to have more modes or additional toggles, but >>>> "madvise=never" means no memory waste. >>> >>> I hate the existing mechanisms. They are an abdication of our >>> responsibility, and an attempt to blame the user (be it the sysadmin >>> or the programmer) of our code for using it wrongly. We should not >>> replicate this mistake. >> >> I don't agree regarding the programmer responsibility. In some cases the >> programmer really doesn't want to get more memory populated than requested -- >> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do. >> >> Regarding the madvise=never/madvise/always (sys admin decision), memory waste >> (and nailing down bugs or working around them in customer setups) have been very >> good reasons to let the admin have a word. >> >>> >>> Our code should be auto-tuning. I posted a long, detailed outline here: >>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/ >>> >> >> Well, "auto-tuning" also should be perfect for everybody, but once reality >> strikes you know it isn't. >> >> If people don't feel like using THP, let them have a word. The "madvise" config >> option is probably more controversial. But the "always vs. never" absolutely >> makes sense to me. >> >>>> I remember I raised it already in the past, but you *absolutely* have to >>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for >>>> example, userfaultfd) that doesn't want the kernel to populate any >>>> additional page tables. So if you have to respect that already, then also >>>> respect MADV_HUGEPAGE, simple. >>> >>> Possibly having uffd enabled on a VMA should disable using large folios, >> >> There are cases where we enable uffd *after* already touching memory (postcopy >> live migration in QEMU being the famous example). That doesn't fly. >> >>> I can get behind that. But the notion that userspace knows what it's >>> doing ... hahaha. Just ignore the madvise flags. Userspace doesn't >>> know what it's doing. >> >> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in >> some cases. And these include cases I care about messing with sparse VM memory :) >> >> I have strong opinions against populating more than required when user space set >> MADV_NOHUGEPAGE. > > I can see your point about honouring MADV_NOHUGEPAGE, so think that it is > reasonable to fallback to allocating an order-0 page in a VMA that has it set. > The app has gone out of its way to explicitly set it, after all. > > I think the correct behaviour for the global thp controls (cmdline and sysfs) > are less obvious though. I could get on board with disabling large anon folios > globally when thp="never". But for other situations, I would prefer to keep > large anon folios enabled (treat "madvise" as "always"), If we have some mechanism to auto-tune the large folios usage, for example, detect the internal fragmentation and split the large folio, then we can use thp="always" as default configuration. If my memory were correct, this is what Johannes and Alexander is working on. > with the argument that > their order is much smaller than traditional THP and therefore the internal > fragmentation is significantly reduced. Do you have any data for this? > I really don't want to end up with user > space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large > anon folios. > > I still feel that it would be better for the thp and large anon folio controls > to be independent though - what's the argument for tying them together? > Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance @ 2023-07-10 3:03 ` Huang, Ying 0 siblings, 0 replies; 167+ messages in thread From: Huang, Ying @ 2023-07-10 3:03 UTC (permalink / raw) To: Ryan Roberts Cc: David Hildenbrand, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm, Johannes Weiner, Alexander Zhu Ryan Roberts <ryan.roberts@arm.com> writes: > On 07/07/2023 15:07, David Hildenbrand wrote: >> On 07.07.23 15:57, Matthew Wilcox wrote: >>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote: >>>> On 07.07.23 11:52, Ryan Roberts wrote: >>>>> On 07/07/2023 09:01, Huang, Ying wrote: >>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to >>>>>> avoid internal fragmentation completely. So, I think that finally we >>>>>> will need to provide a mechanism for the users to opt out, e.g., >>>>>> something like "always madvise never" via >>>>>> /sys/kernel/mm/transparent_hugepage/enabled. I'm not sure whether it's >>>>>> a good idea to reuse the existing interface of THP. >>>>> >>>>> I wouldn't want to tie this to the existing interface, simply because that >>>>> implies that we would want to follow the "always" and "madvise" advice too; >>>>> That >>>>> means that on a thp=madvise system (which is certainly the case for android and >>>>> other client systems) we would have to disable large anon folios for VMAs that >>>>> haven't explicitly opted in. That breaks the intention that this should be an >>>>> invisible performance boost. I think it's important to set the policy for >>>>> use of >>>> >>>> It will never ever be a completely invisible performance boost, just like >>>> ordinary THP. >>>> >>>> Using the exact same existing toggle is the right thing to do. If someone >>>> specify "never" or "madvise", then do exactly that. >>>> >>>> It might make sense to have more modes or additional toggles, but >>>> "madvise=never" means no memory waste. >>> >>> I hate the existing mechanisms. They are an abdication of our >>> responsibility, and an attempt to blame the user (be it the sysadmin >>> or the programmer) of our code for using it wrongly. We should not >>> replicate this mistake. >> >> I don't agree regarding the programmer responsibility. In some cases the >> programmer really doesn't want to get more memory populated than requested -- >> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do. >> >> Regarding the madvise=never/madvise/always (sys admin decision), memory waste >> (and nailing down bugs or working around them in customer setups) have been very >> good reasons to let the admin have a word. >> >>> >>> Our code should be auto-tuning. I posted a long, detailed outline here: >>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/ >>> >> >> Well, "auto-tuning" also should be perfect for everybody, but once reality >> strikes you know it isn't. >> >> If people don't feel like using THP, let them have a word. The "madvise" config >> option is probably more controversial. But the "always vs. never" absolutely >> makes sense to me. >> >>>> I remember I raised it already in the past, but you *absolutely* have to >>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for >>>> example, userfaultfd) that doesn't want the kernel to populate any >>>> additional page tables. So if you have to respect that already, then also >>>> respect MADV_HUGEPAGE, simple. >>> >>> Possibly having uffd enabled on a VMA should disable using large folios, >> >> There are cases where we enable uffd *after* already touching memory (postcopy >> live migration in QEMU being the famous example). That doesn't fly. >> >>> I can get behind that. But the notion that userspace knows what it's >>> doing ... hahaha. Just ignore the madvise flags. Userspace doesn't >>> know what it's doing. >> >> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in >> some cases. And these include cases I care about messing with sparse VM memory :) >> >> I have strong opinions against populating more than required when user space set >> MADV_NOHUGEPAGE. > > I can see your point about honouring MADV_NOHUGEPAGE, so think that it is > reasonable to fallback to allocating an order-0 page in a VMA that has it set. > The app has gone out of its way to explicitly set it, after all. > > I think the correct behaviour for the global thp controls (cmdline and sysfs) > are less obvious though. I could get on board with disabling large anon folios > globally when thp="never". But for other situations, I would prefer to keep > large anon folios enabled (treat "madvise" as "always"), If we have some mechanism to auto-tune the large folios usage, for example, detect the internal fragmentation and split the large folio, then we can use thp="always" as default configuration. If my memory were correct, this is what Johannes and Alexander is working on. > with the argument that > their order is much smaller than traditional THP and therefore the internal > fragmentation is significantly reduced. Do you have any data for this? > I really don't want to end up with user > space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large > anon folios. > > I still feel that it would be better for the thp and large anon folio controls > to be independent though - what's the argument for tying them together? > Best Regards, Huang, Ying _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance 2023-07-10 3:03 ` Huang, Ying @ 2023-07-10 8:55 ` Ryan Roberts -1 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-10 8:55 UTC (permalink / raw) To: Huang, Ying Cc: David Hildenbrand, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm, Johannes Weiner, Alexander Zhu On 10/07/2023 04:03, Huang, Ying wrote: > Ryan Roberts <ryan.roberts@arm.com> writes: > >> On 07/07/2023 15:07, David Hildenbrand wrote: >>> On 07.07.23 15:57, Matthew Wilcox wrote: >>>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote: >>>>> On 07.07.23 11:52, Ryan Roberts wrote: >>>>>> On 07/07/2023 09:01, Huang, Ying wrote: >>>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to >>>>>>> avoid internal fragmentation completely. So, I think that finally we >>>>>>> will need to provide a mechanism for the users to opt out, e.g., >>>>>>> something like "always madvise never" via >>>>>>> /sys/kernel/mm/transparent_hugepage/enabled. I'm not sure whether it's >>>>>>> a good idea to reuse the existing interface of THP. >>>>>> >>>>>> I wouldn't want to tie this to the existing interface, simply because that >>>>>> implies that we would want to follow the "always" and "madvise" advice too; >>>>>> That >>>>>> means that on a thp=madvise system (which is certainly the case for android and >>>>>> other client systems) we would have to disable large anon folios for VMAs that >>>>>> haven't explicitly opted in. That breaks the intention that this should be an >>>>>> invisible performance boost. I think it's important to set the policy for >>>>>> use of >>>>> >>>>> It will never ever be a completely invisible performance boost, just like >>>>> ordinary THP. >>>>> >>>>> Using the exact same existing toggle is the right thing to do. If someone >>>>> specify "never" or "madvise", then do exactly that. >>>>> >>>>> It might make sense to have more modes or additional toggles, but >>>>> "madvise=never" means no memory waste. >>>> >>>> I hate the existing mechanisms. They are an abdication of our >>>> responsibility, and an attempt to blame the user (be it the sysadmin >>>> or the programmer) of our code for using it wrongly. We should not >>>> replicate this mistake. >>> >>> I don't agree regarding the programmer responsibility. In some cases the >>> programmer really doesn't want to get more memory populated than requested -- >>> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do. >>> >>> Regarding the madvise=never/madvise/always (sys admin decision), memory waste >>> (and nailing down bugs or working around them in customer setups) have been very >>> good reasons to let the admin have a word. >>> >>>> >>>> Our code should be auto-tuning. I posted a long, detailed outline here: >>>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/ >>>> >>> >>> Well, "auto-tuning" also should be perfect for everybody, but once reality >>> strikes you know it isn't. >>> >>> If people don't feel like using THP, let them have a word. The "madvise" config >>> option is probably more controversial. But the "always vs. never" absolutely >>> makes sense to me. >>> >>>>> I remember I raised it already in the past, but you *absolutely* have to >>>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for >>>>> example, userfaultfd) that doesn't want the kernel to populate any >>>>> additional page tables. So if you have to respect that already, then also >>>>> respect MADV_HUGEPAGE, simple. >>>> >>>> Possibly having uffd enabled on a VMA should disable using large folios, >>> >>> There are cases where we enable uffd *after* already touching memory (postcopy >>> live migration in QEMU being the famous example). That doesn't fly. >>> >>>> I can get behind that. But the notion that userspace knows what it's >>>> doing ... hahaha. Just ignore the madvise flags. Userspace doesn't >>>> know what it's doing. >>> >>> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in >>> some cases. And these include cases I care about messing with sparse VM memory :) >>> >>> I have strong opinions against populating more than required when user space set >>> MADV_NOHUGEPAGE. >> >> I can see your point about honouring MADV_NOHUGEPAGE, so think that it is >> reasonable to fallback to allocating an order-0 page in a VMA that has it set. >> The app has gone out of its way to explicitly set it, after all. >> >> I think the correct behaviour for the global thp controls (cmdline and sysfs) >> are less obvious though. I could get on board with disabling large anon folios >> globally when thp="never". But for other situations, I would prefer to keep >> large anon folios enabled (treat "madvise" as "always"), > > If we have some mechanism to auto-tune the large folios usage, for > example, detect the internal fragmentation and split the large folio, > then we can use thp="always" as default configuration. If my memory > were correct, this is what Johannes and Alexander is working on. Could you point me to that work? I'd like to understand what the mechanism is. The other half of my work aims to use arm64's pte "contiguous bit" to tell the HW that a span of PTEs share the same mapping and is therefore coalesced into a single TLB entry. The side effect of this, however, is that we only have a single access and dirty bit for the whole contpte extent. So I'd like to avoid any mechanism that relies on getting access/dirty at the base page granularity for a large folio. > >> with the argument that >> their order is much smaller than traditional THP and therefore the internal >> fragmentation is significantly reduced. > > Do you have any data for this? Some; its partly based on intuition that the smaller the allocation unit, the smaller the internal fragmentation. And partly on peak memory usage data I've collected for the benchmarks I'm running, comparing baseline-4k kernel with baseline-16k and baseline-64 kernels along with a 4k kernel that supports large anon folios (I appreciate that's not exactly what we are talking about here, and it's not exactly an extensive set of results!): Kernel Compliation with 8 Jobs: | kernel | peak | |:--------------|-------:| | baseline-4k | 0.0% | | anonfolio | 0.1% | | baseline-16k | 6.3% | | baseline-64k | 28.1% | Kernel Compliation with 80 Jobs: | kernel | peak | |:--------------|-------:| | baseline-4k | 0.0% | | anonfolio | 1.7% | | baseline-16k | 2.6% | | baseline-64k | 12.3% | > >> I really don't want to end up with user >> space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large >> anon folios. >> >> I still feel that it would be better for the thp and large anon folio controls >> to be independent though - what's the argument for tying them together? >> > > Best Regards, > Huang, Ying > ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance @ 2023-07-10 8:55 ` Ryan Roberts 0 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-10 8:55 UTC (permalink / raw) To: Huang, Ying Cc: David Hildenbrand, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm, Johannes Weiner, Alexander Zhu On 10/07/2023 04:03, Huang, Ying wrote: > Ryan Roberts <ryan.roberts@arm.com> writes: > >> On 07/07/2023 15:07, David Hildenbrand wrote: >>> On 07.07.23 15:57, Matthew Wilcox wrote: >>>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote: >>>>> On 07.07.23 11:52, Ryan Roberts wrote: >>>>>> On 07/07/2023 09:01, Huang, Ying wrote: >>>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to >>>>>>> avoid internal fragmentation completely. So, I think that finally we >>>>>>> will need to provide a mechanism for the users to opt out, e.g., >>>>>>> something like "always madvise never" via >>>>>>> /sys/kernel/mm/transparent_hugepage/enabled. I'm not sure whether it's >>>>>>> a good idea to reuse the existing interface of THP. >>>>>> >>>>>> I wouldn't want to tie this to the existing interface, simply because that >>>>>> implies that we would want to follow the "always" and "madvise" advice too; >>>>>> That >>>>>> means that on a thp=madvise system (which is certainly the case for android and >>>>>> other client systems) we would have to disable large anon folios for VMAs that >>>>>> haven't explicitly opted in. That breaks the intention that this should be an >>>>>> invisible performance boost. I think it's important to set the policy for >>>>>> use of >>>>> >>>>> It will never ever be a completely invisible performance boost, just like >>>>> ordinary THP. >>>>> >>>>> Using the exact same existing toggle is the right thing to do. If someone >>>>> specify "never" or "madvise", then do exactly that. >>>>> >>>>> It might make sense to have more modes or additional toggles, but >>>>> "madvise=never" means no memory waste. >>>> >>>> I hate the existing mechanisms. They are an abdication of our >>>> responsibility, and an attempt to blame the user (be it the sysadmin >>>> or the programmer) of our code for using it wrongly. We should not >>>> replicate this mistake. >>> >>> I don't agree regarding the programmer responsibility. In some cases the >>> programmer really doesn't want to get more memory populated than requested -- >>> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do. >>> >>> Regarding the madvise=never/madvise/always (sys admin decision), memory waste >>> (and nailing down bugs or working around them in customer setups) have been very >>> good reasons to let the admin have a word. >>> >>>> >>>> Our code should be auto-tuning. I posted a long, detailed outline here: >>>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/ >>>> >>> >>> Well, "auto-tuning" also should be perfect for everybody, but once reality >>> strikes you know it isn't. >>> >>> If people don't feel like using THP, let them have a word. The "madvise" config >>> option is probably more controversial. But the "always vs. never" absolutely >>> makes sense to me. >>> >>>>> I remember I raised it already in the past, but you *absolutely* have to >>>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for >>>>> example, userfaultfd) that doesn't want the kernel to populate any >>>>> additional page tables. So if you have to respect that already, then also >>>>> respect MADV_HUGEPAGE, simple. >>>> >>>> Possibly having uffd enabled on a VMA should disable using large folios, >>> >>> There are cases where we enable uffd *after* already touching memory (postcopy >>> live migration in QEMU being the famous example). That doesn't fly. >>> >>>> I can get behind that. But the notion that userspace knows what it's >>>> doing ... hahaha. Just ignore the madvise flags. Userspace doesn't >>>> know what it's doing. >>> >>> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in >>> some cases. And these include cases I care about messing with sparse VM memory :) >>> >>> I have strong opinions against populating more than required when user space set >>> MADV_NOHUGEPAGE. >> >> I can see your point about honouring MADV_NOHUGEPAGE, so think that it is >> reasonable to fallback to allocating an order-0 page in a VMA that has it set. >> The app has gone out of its way to explicitly set it, after all. >> >> I think the correct behaviour for the global thp controls (cmdline and sysfs) >> are less obvious though. I could get on board with disabling large anon folios >> globally when thp="never". But for other situations, I would prefer to keep >> large anon folios enabled (treat "madvise" as "always"), > > If we have some mechanism to auto-tune the large folios usage, for > example, detect the internal fragmentation and split the large folio, > then we can use thp="always" as default configuration. If my memory > were correct, this is what Johannes and Alexander is working on. Could you point me to that work? I'd like to understand what the mechanism is. The other half of my work aims to use arm64's pte "contiguous bit" to tell the HW that a span of PTEs share the same mapping and is therefore coalesced into a single TLB entry. The side effect of this, however, is that we only have a single access and dirty bit for the whole contpte extent. So I'd like to avoid any mechanism that relies on getting access/dirty at the base page granularity for a large folio. > >> with the argument that >> their order is much smaller than traditional THP and therefore the internal >> fragmentation is significantly reduced. > > Do you have any data for this? Some; its partly based on intuition that the smaller the allocation unit, the smaller the internal fragmentation. And partly on peak memory usage data I've collected for the benchmarks I'm running, comparing baseline-4k kernel with baseline-16k and baseline-64 kernels along with a 4k kernel that supports large anon folios (I appreciate that's not exactly what we are talking about here, and it's not exactly an extensive set of results!): Kernel Compliation with 8 Jobs: | kernel | peak | |:--------------|-------:| | baseline-4k | 0.0% | | anonfolio | 0.1% | | baseline-16k | 6.3% | | baseline-64k | 28.1% | Kernel Compliation with 80 Jobs: | kernel | peak | |:--------------|-------:| | baseline-4k | 0.0% | | anonfolio | 1.7% | | baseline-16k | 2.6% | | baseline-64k | 12.3% | > >> I really don't want to end up with user >> space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large >> anon folios. >> >> I still feel that it would be better for the thp and large anon folio controls >> to be independent though - what's the argument for tying them together? >> > > Best Regards, > Huang, Ying > _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance 2023-07-10 8:55 ` Ryan Roberts @ 2023-07-10 9:18 ` Huang, Ying -1 siblings, 0 replies; 167+ messages in thread From: Huang, Ying @ 2023-07-10 9:18 UTC (permalink / raw) To: Ryan Roberts Cc: David Hildenbrand, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm, Johannes Weiner, Alexander Zhu Ryan Roberts <ryan.roberts@arm.com> writes: > On 10/07/2023 04:03, Huang, Ying wrote: >> Ryan Roberts <ryan.roberts@arm.com> writes: >> >>> On 07/07/2023 15:07, David Hildenbrand wrote: >>>> On 07.07.23 15:57, Matthew Wilcox wrote: >>>>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote: >>>>>> On 07.07.23 11:52, Ryan Roberts wrote: >>>>>>> On 07/07/2023 09:01, Huang, Ying wrote: >>>>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to >>>>>>>> avoid internal fragmentation completely. So, I think that finally we >>>>>>>> will need to provide a mechanism for the users to opt out, e.g., >>>>>>>> something like "always madvise never" via >>>>>>>> /sys/kernel/mm/transparent_hugepage/enabled. I'm not sure whether it's >>>>>>>> a good idea to reuse the existing interface of THP. >>>>>>> >>>>>>> I wouldn't want to tie this to the existing interface, simply because that >>>>>>> implies that we would want to follow the "always" and "madvise" advice too; >>>>>>> That >>>>>>> means that on a thp=madvise system (which is certainly the case for android and >>>>>>> other client systems) we would have to disable large anon folios for VMAs that >>>>>>> haven't explicitly opted in. That breaks the intention that this should be an >>>>>>> invisible performance boost. I think it's important to set the policy for >>>>>>> use of >>>>>> >>>>>> It will never ever be a completely invisible performance boost, just like >>>>>> ordinary THP. >>>>>> >>>>>> Using the exact same existing toggle is the right thing to do. If someone >>>>>> specify "never" or "madvise", then do exactly that. >>>>>> >>>>>> It might make sense to have more modes or additional toggles, but >>>>>> "madvise=never" means no memory waste. >>>>> >>>>> I hate the existing mechanisms. They are an abdication of our >>>>> responsibility, and an attempt to blame the user (be it the sysadmin >>>>> or the programmer) of our code for using it wrongly. We should not >>>>> replicate this mistake. >>>> >>>> I don't agree regarding the programmer responsibility. In some cases the >>>> programmer really doesn't want to get more memory populated than requested -- >>>> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do. >>>> >>>> Regarding the madvise=never/madvise/always (sys admin decision), memory waste >>>> (and nailing down bugs or working around them in customer setups) have been very >>>> good reasons to let the admin have a word. >>>> >>>>> >>>>> Our code should be auto-tuning. I posted a long, detailed outline here: >>>>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/ >>>>> >>>> >>>> Well, "auto-tuning" also should be perfect for everybody, but once reality >>>> strikes you know it isn't. >>>> >>>> If people don't feel like using THP, let them have a word. The "madvise" config >>>> option is probably more controversial. But the "always vs. never" absolutely >>>> makes sense to me. >>>> >>>>>> I remember I raised it already in the past, but you *absolutely* have to >>>>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for >>>>>> example, userfaultfd) that doesn't want the kernel to populate any >>>>>> additional page tables. So if you have to respect that already, then also >>>>>> respect MADV_HUGEPAGE, simple. >>>>> >>>>> Possibly having uffd enabled on a VMA should disable using large folios, >>>> >>>> There are cases where we enable uffd *after* already touching memory (postcopy >>>> live migration in QEMU being the famous example). That doesn't fly. >>>> >>>>> I can get behind that. But the notion that userspace knows what it's >>>>> doing ... hahaha. Just ignore the madvise flags. Userspace doesn't >>>>> know what it's doing. >>>> >>>> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in >>>> some cases. And these include cases I care about messing with sparse VM memory :) >>>> >>>> I have strong opinions against populating more than required when user space set >>>> MADV_NOHUGEPAGE. >>> >>> I can see your point about honouring MADV_NOHUGEPAGE, so think that it is >>> reasonable to fallback to allocating an order-0 page in a VMA that has it set. >>> The app has gone out of its way to explicitly set it, after all. >>> >>> I think the correct behaviour for the global thp controls (cmdline and sysfs) >>> are less obvious though. I could get on board with disabling large anon folios >>> globally when thp="never". But for other situations, I would prefer to keep >>> large anon folios enabled (treat "madvise" as "always"), >> >> If we have some mechanism to auto-tune the large folios usage, for >> example, detect the internal fragmentation and split the large folio, >> then we can use thp="always" as default configuration. If my memory >> were correct, this is what Johannes and Alexander is working on. > > Could you point me to that work? I'd like to understand what the mechanism is. > The other half of my work aims to use arm64's pte "contiguous bit" to tell the > HW that a span of PTEs share the same mapping and is therefore coalesced into a > single TLB entry. The side effect of this, however, is that we only have a > single access and dirty bit for the whole contpte extent. So I'd like to avoid > any mechanism that relies on getting access/dirty at the base page granularity > for a large folio. Please take a look at the THP shrinker patchset, https://lore.kernel.org/linux-mm/cover.1667454613.git.alexlzhu@fb.com/ >> >>> with the argument that >>> their order is much smaller than traditional THP and therefore the internal >>> fragmentation is significantly reduced. >> >> Do you have any data for this? > > Some; its partly based on intuition that the smaller the allocation unit, the > smaller the internal fragmentation. And partly on peak memory usage data I've > collected for the benchmarks I'm running, comparing baseline-4k kernel with > baseline-16k and baseline-64 kernels along with a 4k kernel that supports large > anon folios (I appreciate that's not exactly what we are talking about here, and > it's not exactly an extensive set of results!): > > > Kernel Compliation with 8 Jobs: > | kernel | peak | > |:--------------|-------:| > | baseline-4k | 0.0% | > | anonfolio | 0.1% | > | baseline-16k | 6.3% | > | baseline-64k | 28.1% | > > > Kernel Compliation with 80 Jobs: > | kernel | peak | > |:--------------|-------:| > | baseline-4k | 0.0% | > | anonfolio | 1.7% | > | baseline-16k | 2.6% | > | baseline-64k | 12.3% | > Why is anonfolio better than baseline-64k if you always allocate 64k anonymous folio? Because page cache uses 64k in baseline-64k? We may need to test some workloads with sparse access patterns too. Best Regards, Huang, Ying >> >>> I really don't want to end up with user >>> space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large >>> anon folios. >>> >>> I still feel that it would be better for the thp and large anon folio controls >>> to be independent though - what's the argument for tying them together? >>> ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance @ 2023-07-10 9:18 ` Huang, Ying 0 siblings, 0 replies; 167+ messages in thread From: Huang, Ying @ 2023-07-10 9:18 UTC (permalink / raw) To: Ryan Roberts Cc: David Hildenbrand, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm, Johannes Weiner, Alexander Zhu Ryan Roberts <ryan.roberts@arm.com> writes: > On 10/07/2023 04:03, Huang, Ying wrote: >> Ryan Roberts <ryan.roberts@arm.com> writes: >> >>> On 07/07/2023 15:07, David Hildenbrand wrote: >>>> On 07.07.23 15:57, Matthew Wilcox wrote: >>>>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote: >>>>>> On 07.07.23 11:52, Ryan Roberts wrote: >>>>>>> On 07/07/2023 09:01, Huang, Ying wrote: >>>>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to >>>>>>>> avoid internal fragmentation completely. So, I think that finally we >>>>>>>> will need to provide a mechanism for the users to opt out, e.g., >>>>>>>> something like "always madvise never" via >>>>>>>> /sys/kernel/mm/transparent_hugepage/enabled. I'm not sure whether it's >>>>>>>> a good idea to reuse the existing interface of THP. >>>>>>> >>>>>>> I wouldn't want to tie this to the existing interface, simply because that >>>>>>> implies that we would want to follow the "always" and "madvise" advice too; >>>>>>> That >>>>>>> means that on a thp=madvise system (which is certainly the case for android and >>>>>>> other client systems) we would have to disable large anon folios for VMAs that >>>>>>> haven't explicitly opted in. That breaks the intention that this should be an >>>>>>> invisible performance boost. I think it's important to set the policy for >>>>>>> use of >>>>>> >>>>>> It will never ever be a completely invisible performance boost, just like >>>>>> ordinary THP. >>>>>> >>>>>> Using the exact same existing toggle is the right thing to do. If someone >>>>>> specify "never" or "madvise", then do exactly that. >>>>>> >>>>>> It might make sense to have more modes or additional toggles, but >>>>>> "madvise=never" means no memory waste. >>>>> >>>>> I hate the existing mechanisms. They are an abdication of our >>>>> responsibility, and an attempt to blame the user (be it the sysadmin >>>>> or the programmer) of our code for using it wrongly. We should not >>>>> replicate this mistake. >>>> >>>> I don't agree regarding the programmer responsibility. In some cases the >>>> programmer really doesn't want to get more memory populated than requested -- >>>> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do. >>>> >>>> Regarding the madvise=never/madvise/always (sys admin decision), memory waste >>>> (and nailing down bugs or working around them in customer setups) have been very >>>> good reasons to let the admin have a word. >>>> >>>>> >>>>> Our code should be auto-tuning. I posted a long, detailed outline here: >>>>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/ >>>>> >>>> >>>> Well, "auto-tuning" also should be perfect for everybody, but once reality >>>> strikes you know it isn't. >>>> >>>> If people don't feel like using THP, let them have a word. The "madvise" config >>>> option is probably more controversial. But the "always vs. never" absolutely >>>> makes sense to me. >>>> >>>>>> I remember I raised it already in the past, but you *absolutely* have to >>>>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for >>>>>> example, userfaultfd) that doesn't want the kernel to populate any >>>>>> additional page tables. So if you have to respect that already, then also >>>>>> respect MADV_HUGEPAGE, simple. >>>>> >>>>> Possibly having uffd enabled on a VMA should disable using large folios, >>>> >>>> There are cases where we enable uffd *after* already touching memory (postcopy >>>> live migration in QEMU being the famous example). That doesn't fly. >>>> >>>>> I can get behind that. But the notion that userspace knows what it's >>>>> doing ... hahaha. Just ignore the madvise flags. Userspace doesn't >>>>> know what it's doing. >>>> >>>> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in >>>> some cases. And these include cases I care about messing with sparse VM memory :) >>>> >>>> I have strong opinions against populating more than required when user space set >>>> MADV_NOHUGEPAGE. >>> >>> I can see your point about honouring MADV_NOHUGEPAGE, so think that it is >>> reasonable to fallback to allocating an order-0 page in a VMA that has it set. >>> The app has gone out of its way to explicitly set it, after all. >>> >>> I think the correct behaviour for the global thp controls (cmdline and sysfs) >>> are less obvious though. I could get on board with disabling large anon folios >>> globally when thp="never". But for other situations, I would prefer to keep >>> large anon folios enabled (treat "madvise" as "always"), >> >> If we have some mechanism to auto-tune the large folios usage, for >> example, detect the internal fragmentation and split the large folio, >> then we can use thp="always" as default configuration. If my memory >> were correct, this is what Johannes and Alexander is working on. > > Could you point me to that work? I'd like to understand what the mechanism is. > The other half of my work aims to use arm64's pte "contiguous bit" to tell the > HW that a span of PTEs share the same mapping and is therefore coalesced into a > single TLB entry. The side effect of this, however, is that we only have a > single access and dirty bit for the whole contpte extent. So I'd like to avoid > any mechanism that relies on getting access/dirty at the base page granularity > for a large folio. Please take a look at the THP shrinker patchset, https://lore.kernel.org/linux-mm/cover.1667454613.git.alexlzhu@fb.com/ >> >>> with the argument that >>> their order is much smaller than traditional THP and therefore the internal >>> fragmentation is significantly reduced. >> >> Do you have any data for this? > > Some; its partly based on intuition that the smaller the allocation unit, the > smaller the internal fragmentation. And partly on peak memory usage data I've > collected for the benchmarks I'm running, comparing baseline-4k kernel with > baseline-16k and baseline-64 kernels along with a 4k kernel that supports large > anon folios (I appreciate that's not exactly what we are talking about here, and > it's not exactly an extensive set of results!): > > > Kernel Compliation with 8 Jobs: > | kernel | peak | > |:--------------|-------:| > | baseline-4k | 0.0% | > | anonfolio | 0.1% | > | baseline-16k | 6.3% | > | baseline-64k | 28.1% | > > > Kernel Compliation with 80 Jobs: > | kernel | peak | > |:--------------|-------:| > | baseline-4k | 0.0% | > | anonfolio | 1.7% | > | baseline-16k | 2.6% | > | baseline-64k | 12.3% | > Why is anonfolio better than baseline-64k if you always allocate 64k anonymous folio? Because page cache uses 64k in baseline-64k? We may need to test some workloads with sparse access patterns too. Best Regards, Huang, Ying >> >>> I really don't want to end up with user >>> space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large >>> anon folios. >>> >>> I still feel that it would be better for the thp and large anon folio controls >>> to be independent though - what's the argument for tying them together? >>> _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance 2023-07-10 9:18 ` Huang, Ying @ 2023-07-10 9:25 ` Ryan Roberts -1 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-10 9:25 UTC (permalink / raw) To: Huang, Ying Cc: David Hildenbrand, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm, Johannes Weiner, Alexander Zhu On 10/07/2023 10:18, Huang, Ying wrote: > Ryan Roberts <ryan.roberts@arm.com> writes: > >> On 10/07/2023 04:03, Huang, Ying wrote: >>> Ryan Roberts <ryan.roberts@arm.com> writes: >>> >>>> On 07/07/2023 15:07, David Hildenbrand wrote: >>>>> On 07.07.23 15:57, Matthew Wilcox wrote: >>>>>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote: >>>>>>> On 07.07.23 11:52, Ryan Roberts wrote: >>>>>>>> On 07/07/2023 09:01, Huang, Ying wrote: >>>>>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to >>>>>>>>> avoid internal fragmentation completely. So, I think that finally we >>>>>>>>> will need to provide a mechanism for the users to opt out, e.g., >>>>>>>>> something like "always madvise never" via >>>>>>>>> /sys/kernel/mm/transparent_hugepage/enabled. I'm not sure whether it's >>>>>>>>> a good idea to reuse the existing interface of THP. >>>>>>>> >>>>>>>> I wouldn't want to tie this to the existing interface, simply because that >>>>>>>> implies that we would want to follow the "always" and "madvise" advice too; >>>>>>>> That >>>>>>>> means that on a thp=madvise system (which is certainly the case for android and >>>>>>>> other client systems) we would have to disable large anon folios for VMAs that >>>>>>>> haven't explicitly opted in. That breaks the intention that this should be an >>>>>>>> invisible performance boost. I think it's important to set the policy for >>>>>>>> use of >>>>>>> >>>>>>> It will never ever be a completely invisible performance boost, just like >>>>>>> ordinary THP. >>>>>>> >>>>>>> Using the exact same existing toggle is the right thing to do. If someone >>>>>>> specify "never" or "madvise", then do exactly that. >>>>>>> >>>>>>> It might make sense to have more modes or additional toggles, but >>>>>>> "madvise=never" means no memory waste. >>>>>> >>>>>> I hate the existing mechanisms. They are an abdication of our >>>>>> responsibility, and an attempt to blame the user (be it the sysadmin >>>>>> or the programmer) of our code for using it wrongly. We should not >>>>>> replicate this mistake. >>>>> >>>>> I don't agree regarding the programmer responsibility. In some cases the >>>>> programmer really doesn't want to get more memory populated than requested -- >>>>> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do. >>>>> >>>>> Regarding the madvise=never/madvise/always (sys admin decision), memory waste >>>>> (and nailing down bugs or working around them in customer setups) have been very >>>>> good reasons to let the admin have a word. >>>>> >>>>>> >>>>>> Our code should be auto-tuning. I posted a long, detailed outline here: >>>>>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/ >>>>>> >>>>> >>>>> Well, "auto-tuning" also should be perfect for everybody, but once reality >>>>> strikes you know it isn't. >>>>> >>>>> If people don't feel like using THP, let them have a word. The "madvise" config >>>>> option is probably more controversial. But the "always vs. never" absolutely >>>>> makes sense to me. >>>>> >>>>>>> I remember I raised it already in the past, but you *absolutely* have to >>>>>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for >>>>>>> example, userfaultfd) that doesn't want the kernel to populate any >>>>>>> additional page tables. So if you have to respect that already, then also >>>>>>> respect MADV_HUGEPAGE, simple. >>>>>> >>>>>> Possibly having uffd enabled on a VMA should disable using large folios, >>>>> >>>>> There are cases where we enable uffd *after* already touching memory (postcopy >>>>> live migration in QEMU being the famous example). That doesn't fly. >>>>> >>>>>> I can get behind that. But the notion that userspace knows what it's >>>>>> doing ... hahaha. Just ignore the madvise flags. Userspace doesn't >>>>>> know what it's doing. >>>>> >>>>> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in >>>>> some cases. And these include cases I care about messing with sparse VM memory :) >>>>> >>>>> I have strong opinions against populating more than required when user space set >>>>> MADV_NOHUGEPAGE. >>>> >>>> I can see your point about honouring MADV_NOHUGEPAGE, so think that it is >>>> reasonable to fallback to allocating an order-0 page in a VMA that has it set. >>>> The app has gone out of its way to explicitly set it, after all. >>>> >>>> I think the correct behaviour for the global thp controls (cmdline and sysfs) >>>> are less obvious though. I could get on board with disabling large anon folios >>>> globally when thp="never". But for other situations, I would prefer to keep >>>> large anon folios enabled (treat "madvise" as "always"), >>> >>> If we have some mechanism to auto-tune the large folios usage, for >>> example, detect the internal fragmentation and split the large folio, >>> then we can use thp="always" as default configuration. If my memory >>> were correct, this is what Johannes and Alexander is working on. >> >> Could you point me to that work? I'd like to understand what the mechanism is. >> The other half of my work aims to use arm64's pte "contiguous bit" to tell the >> HW that a span of PTEs share the same mapping and is therefore coalesced into a >> single TLB entry. The side effect of this, however, is that we only have a >> single access and dirty bit for the whole contpte extent. So I'd like to avoid >> any mechanism that relies on getting access/dirty at the base page granularity >> for a large folio. > > Please take a look at the THP shrinker patchset, > > https://lore.kernel.org/linux-mm/cover.1667454613.git.alexlzhu@fb.com/ Thanks! > >>> >>>> with the argument that >>>> their order is much smaller than traditional THP and therefore the internal >>>> fragmentation is significantly reduced. >>> >>> Do you have any data for this? >> >> Some; its partly based on intuition that the smaller the allocation unit, the >> smaller the internal fragmentation. And partly on peak memory usage data I've >> collected for the benchmarks I'm running, comparing baseline-4k kernel with >> baseline-16k and baseline-64 kernels along with a 4k kernel that supports large >> anon folios (I appreciate that's not exactly what we are talking about here, and >> it's not exactly an extensive set of results!): >> >> >> Kernel Compliation with 8 Jobs: >> | kernel | peak | >> |:--------------|-------:| >> | baseline-4k | 0.0% | >> | anonfolio | 0.1% | >> | baseline-16k | 6.3% | >> | baseline-64k | 28.1% | >> >> >> Kernel Compliation with 80 Jobs: >> | kernel | peak | >> |:--------------|-------:| >> | baseline-4k | 0.0% | >> | anonfolio | 1.7% | >> | baseline-16k | 2.6% | >> | baseline-64k | 12.3% | >> > > Why is anonfolio better than baseline-64k if you always allocate 64k > anonymous folio? Because page cache uses 64k in baseline-64k? No, because the VMA boundaries are aligned to 4K and not 64K. Large Anon Folios only allocates a 64K folio if it does not breach the bounds of the VMA (and if it doesn't overlap other allocated PTEs). > > We may need to test some workloads with sparse access patterns too. Yes, I agree if you have a workload with a pathalogical memory access pattern where it writes to addresses with a stride of 64K, all contained in a single VMA, then you will end up allocating 16x the memory. This is obviously an unrealistic extreme though. > > Best Regards, > Huang, Ying > >>> >>>> I really don't want to end up with user >>>> space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large >>>> anon folios. >>>> >>>> I still feel that it would be better for the thp and large anon folio controls >>>> to be independent though - what's the argument for tying them together? >>>> > ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance @ 2023-07-10 9:25 ` Ryan Roberts 0 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-10 9:25 UTC (permalink / raw) To: Huang, Ying Cc: David Hildenbrand, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm, Johannes Weiner, Alexander Zhu On 10/07/2023 10:18, Huang, Ying wrote: > Ryan Roberts <ryan.roberts@arm.com> writes: > >> On 10/07/2023 04:03, Huang, Ying wrote: >>> Ryan Roberts <ryan.roberts@arm.com> writes: >>> >>>> On 07/07/2023 15:07, David Hildenbrand wrote: >>>>> On 07.07.23 15:57, Matthew Wilcox wrote: >>>>>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote: >>>>>>> On 07.07.23 11:52, Ryan Roberts wrote: >>>>>>>> On 07/07/2023 09:01, Huang, Ying wrote: >>>>>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to >>>>>>>>> avoid internal fragmentation completely. So, I think that finally we >>>>>>>>> will need to provide a mechanism for the users to opt out, e.g., >>>>>>>>> something like "always madvise never" via >>>>>>>>> /sys/kernel/mm/transparent_hugepage/enabled. I'm not sure whether it's >>>>>>>>> a good idea to reuse the existing interface of THP. >>>>>>>> >>>>>>>> I wouldn't want to tie this to the existing interface, simply because that >>>>>>>> implies that we would want to follow the "always" and "madvise" advice too; >>>>>>>> That >>>>>>>> means that on a thp=madvise system (which is certainly the case for android and >>>>>>>> other client systems) we would have to disable large anon folios for VMAs that >>>>>>>> haven't explicitly opted in. That breaks the intention that this should be an >>>>>>>> invisible performance boost. I think it's important to set the policy for >>>>>>>> use of >>>>>>> >>>>>>> It will never ever be a completely invisible performance boost, just like >>>>>>> ordinary THP. >>>>>>> >>>>>>> Using the exact same existing toggle is the right thing to do. If someone >>>>>>> specify "never" or "madvise", then do exactly that. >>>>>>> >>>>>>> It might make sense to have more modes or additional toggles, but >>>>>>> "madvise=never" means no memory waste. >>>>>> >>>>>> I hate the existing mechanisms. They are an abdication of our >>>>>> responsibility, and an attempt to blame the user (be it the sysadmin >>>>>> or the programmer) of our code for using it wrongly. We should not >>>>>> replicate this mistake. >>>>> >>>>> I don't agree regarding the programmer responsibility. In some cases the >>>>> programmer really doesn't want to get more memory populated than requested -- >>>>> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do. >>>>> >>>>> Regarding the madvise=never/madvise/always (sys admin decision), memory waste >>>>> (and nailing down bugs or working around them in customer setups) have been very >>>>> good reasons to let the admin have a word. >>>>> >>>>>> >>>>>> Our code should be auto-tuning. I posted a long, detailed outline here: >>>>>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/ >>>>>> >>>>> >>>>> Well, "auto-tuning" also should be perfect for everybody, but once reality >>>>> strikes you know it isn't. >>>>> >>>>> If people don't feel like using THP, let them have a word. The "madvise" config >>>>> option is probably more controversial. But the "always vs. never" absolutely >>>>> makes sense to me. >>>>> >>>>>>> I remember I raised it already in the past, but you *absolutely* have to >>>>>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for >>>>>>> example, userfaultfd) that doesn't want the kernel to populate any >>>>>>> additional page tables. So if you have to respect that already, then also >>>>>>> respect MADV_HUGEPAGE, simple. >>>>>> >>>>>> Possibly having uffd enabled on a VMA should disable using large folios, >>>>> >>>>> There are cases where we enable uffd *after* already touching memory (postcopy >>>>> live migration in QEMU being the famous example). That doesn't fly. >>>>> >>>>>> I can get behind that. But the notion that userspace knows what it's >>>>>> doing ... hahaha. Just ignore the madvise flags. Userspace doesn't >>>>>> know what it's doing. >>>>> >>>>> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in >>>>> some cases. And these include cases I care about messing with sparse VM memory :) >>>>> >>>>> I have strong opinions against populating more than required when user space set >>>>> MADV_NOHUGEPAGE. >>>> >>>> I can see your point about honouring MADV_NOHUGEPAGE, so think that it is >>>> reasonable to fallback to allocating an order-0 page in a VMA that has it set. >>>> The app has gone out of its way to explicitly set it, after all. >>>> >>>> I think the correct behaviour for the global thp controls (cmdline and sysfs) >>>> are less obvious though. I could get on board with disabling large anon folios >>>> globally when thp="never". But for other situations, I would prefer to keep >>>> large anon folios enabled (treat "madvise" as "always"), >>> >>> If we have some mechanism to auto-tune the large folios usage, for >>> example, detect the internal fragmentation and split the large folio, >>> then we can use thp="always" as default configuration. If my memory >>> were correct, this is what Johannes and Alexander is working on. >> >> Could you point me to that work? I'd like to understand what the mechanism is. >> The other half of my work aims to use arm64's pte "contiguous bit" to tell the >> HW that a span of PTEs share the same mapping and is therefore coalesced into a >> single TLB entry. The side effect of this, however, is that we only have a >> single access and dirty bit for the whole contpte extent. So I'd like to avoid >> any mechanism that relies on getting access/dirty at the base page granularity >> for a large folio. > > Please take a look at the THP shrinker patchset, > > https://lore.kernel.org/linux-mm/cover.1667454613.git.alexlzhu@fb.com/ Thanks! > >>> >>>> with the argument that >>>> their order is much smaller than traditional THP and therefore the internal >>>> fragmentation is significantly reduced. >>> >>> Do you have any data for this? >> >> Some; its partly based on intuition that the smaller the allocation unit, the >> smaller the internal fragmentation. And partly on peak memory usage data I've >> collected for the benchmarks I'm running, comparing baseline-4k kernel with >> baseline-16k and baseline-64 kernels along with a 4k kernel that supports large >> anon folios (I appreciate that's not exactly what we are talking about here, and >> it's not exactly an extensive set of results!): >> >> >> Kernel Compliation with 8 Jobs: >> | kernel | peak | >> |:--------------|-------:| >> | baseline-4k | 0.0% | >> | anonfolio | 0.1% | >> | baseline-16k | 6.3% | >> | baseline-64k | 28.1% | >> >> >> Kernel Compliation with 80 Jobs: >> | kernel | peak | >> |:--------------|-------:| >> | baseline-4k | 0.0% | >> | anonfolio | 1.7% | >> | baseline-16k | 2.6% | >> | baseline-64k | 12.3% | >> > > Why is anonfolio better than baseline-64k if you always allocate 64k > anonymous folio? Because page cache uses 64k in baseline-64k? No, because the VMA boundaries are aligned to 4K and not 64K. Large Anon Folios only allocates a 64K folio if it does not breach the bounds of the VMA (and if it doesn't overlap other allocated PTEs). > > We may need to test some workloads with sparse access patterns too. Yes, I agree if you have a workload with a pathalogical memory access pattern where it writes to addresses with a stride of 64K, all contained in a single VMA, then you will end up allocating 16x the memory. This is obviously an unrealistic extreme though. > > Best Regards, > Huang, Ying > >>> >>>> I really don't want to end up with user >>>> space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large >>>> anon folios. >>>> >>>> I still feel that it would be better for the thp and large anon folio controls >>>> to be independent though - what's the argument for tying them together? >>>> > _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance 2023-07-10 9:25 ` Ryan Roberts @ 2023-07-11 0:48 ` Huang, Ying -1 siblings, 0 replies; 167+ messages in thread From: Huang, Ying @ 2023-07-11 0:48 UTC (permalink / raw) To: Ryan Roberts Cc: David Hildenbrand, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm, Johannes Weiner, Alexander Zhu Ryan Roberts <ryan.roberts@arm.com> writes: > On 10/07/2023 10:18, Huang, Ying wrote: >> Ryan Roberts <ryan.roberts@arm.com> writes: >> >>> On 10/07/2023 04:03, Huang, Ying wrote: >>>> Ryan Roberts <ryan.roberts@arm.com> writes: >>>> >>>>> On 07/07/2023 15:07, David Hildenbrand wrote: >>>>>> On 07.07.23 15:57, Matthew Wilcox wrote: >>>>>>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote: >>>>>>>> On 07.07.23 11:52, Ryan Roberts wrote: >>>>>>>>> On 07/07/2023 09:01, Huang, Ying wrote: >>>>>>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to >>>>>>>>>> avoid internal fragmentation completely. So, I think that finally we >>>>>>>>>> will need to provide a mechanism for the users to opt out, e.g., >>>>>>>>>> something like "always madvise never" via >>>>>>>>>> /sys/kernel/mm/transparent_hugepage/enabled. I'm not sure whether it's >>>>>>>>>> a good idea to reuse the existing interface of THP. >>>>>>>>> >>>>>>>>> I wouldn't want to tie this to the existing interface, simply because that >>>>>>>>> implies that we would want to follow the "always" and "madvise" advice too; >>>>>>>>> That >>>>>>>>> means that on a thp=madvise system (which is certainly the case for android and >>>>>>>>> other client systems) we would have to disable large anon folios for VMAs that >>>>>>>>> haven't explicitly opted in. That breaks the intention that this should be an >>>>>>>>> invisible performance boost. I think it's important to set the policy for >>>>>>>>> use of >>>>>>>> >>>>>>>> It will never ever be a completely invisible performance boost, just like >>>>>>>> ordinary THP. >>>>>>>> >>>>>>>> Using the exact same existing toggle is the right thing to do. If someone >>>>>>>> specify "never" or "madvise", then do exactly that. >>>>>>>> >>>>>>>> It might make sense to have more modes or additional toggles, but >>>>>>>> "madvise=never" means no memory waste. >>>>>>> >>>>>>> I hate the existing mechanisms. They are an abdication of our >>>>>>> responsibility, and an attempt to blame the user (be it the sysadmin >>>>>>> or the programmer) of our code for using it wrongly. We should not >>>>>>> replicate this mistake. >>>>>> >>>>>> I don't agree regarding the programmer responsibility. In some cases the >>>>>> programmer really doesn't want to get more memory populated than requested -- >>>>>> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do. >>>>>> >>>>>> Regarding the madvise=never/madvise/always (sys admin decision), memory waste >>>>>> (and nailing down bugs or working around them in customer setups) have been very >>>>>> good reasons to let the admin have a word. >>>>>> >>>>>>> >>>>>>> Our code should be auto-tuning. I posted a long, detailed outline here: >>>>>>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/ >>>>>>> >>>>>> >>>>>> Well, "auto-tuning" also should be perfect for everybody, but once reality >>>>>> strikes you know it isn't. >>>>>> >>>>>> If people don't feel like using THP, let them have a word. The "madvise" config >>>>>> option is probably more controversial. But the "always vs. never" absolutely >>>>>> makes sense to me. >>>>>> >>>>>>>> I remember I raised it already in the past, but you *absolutely* have to >>>>>>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for >>>>>>>> example, userfaultfd) that doesn't want the kernel to populate any >>>>>>>> additional page tables. So if you have to respect that already, then also >>>>>>>> respect MADV_HUGEPAGE, simple. >>>>>>> >>>>>>> Possibly having uffd enabled on a VMA should disable using large folios, >>>>>> >>>>>> There are cases where we enable uffd *after* already touching memory (postcopy >>>>>> live migration in QEMU being the famous example). That doesn't fly. >>>>>> >>>>>>> I can get behind that. But the notion that userspace knows what it's >>>>>>> doing ... hahaha. Just ignore the madvise flags. Userspace doesn't >>>>>>> know what it's doing. >>>>>> >>>>>> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in >>>>>> some cases. And these include cases I care about messing with sparse VM memory :) >>>>>> >>>>>> I have strong opinions against populating more than required when user space set >>>>>> MADV_NOHUGEPAGE. >>>>> >>>>> I can see your point about honouring MADV_NOHUGEPAGE, so think that it is >>>>> reasonable to fallback to allocating an order-0 page in a VMA that has it set. >>>>> The app has gone out of its way to explicitly set it, after all. >>>>> >>>>> I think the correct behaviour for the global thp controls (cmdline and sysfs) >>>>> are less obvious though. I could get on board with disabling large anon folios >>>>> globally when thp="never". But for other situations, I would prefer to keep >>>>> large anon folios enabled (treat "madvise" as "always"), >>>> >>>> If we have some mechanism to auto-tune the large folios usage, for >>>> example, detect the internal fragmentation and split the large folio, >>>> then we can use thp="always" as default configuration. If my memory >>>> were correct, this is what Johannes and Alexander is working on. >>> >>> Could you point me to that work? I'd like to understand what the mechanism is. >>> The other half of my work aims to use arm64's pte "contiguous bit" to tell the >>> HW that a span of PTEs share the same mapping and is therefore coalesced into a >>> single TLB entry. The side effect of this, however, is that we only have a >>> single access and dirty bit for the whole contpte extent. So I'd like to avoid >>> any mechanism that relies on getting access/dirty at the base page granularity >>> for a large folio. >> >> Please take a look at the THP shrinker patchset, >> >> https://lore.kernel.org/linux-mm/cover.1667454613.git.alexlzhu@fb.com/ > > Thanks! > >> >>>> >>>>> with the argument that >>>>> their order is much smaller than traditional THP and therefore the internal >>>>> fragmentation is significantly reduced. >>>> >>>> Do you have any data for this? >>> >>> Some; its partly based on intuition that the smaller the allocation unit, the >>> smaller the internal fragmentation. And partly on peak memory usage data I've >>> collected for the benchmarks I'm running, comparing baseline-4k kernel with >>> baseline-16k and baseline-64 kernels along with a 4k kernel that supports large >>> anon folios (I appreciate that's not exactly what we are talking about here, and >>> it's not exactly an extensive set of results!): >>> >>> >>> Kernel Compliation with 8 Jobs: >>> | kernel | peak | >>> |:--------------|-------:| >>> | baseline-4k | 0.0% | >>> | anonfolio | 0.1% | >>> | baseline-16k | 6.3% | >>> | baseline-64k | 28.1% | >>> >>> >>> Kernel Compliation with 80 Jobs: >>> | kernel | peak | >>> |:--------------|-------:| >>> | baseline-4k | 0.0% | >>> | anonfolio | 1.7% | >>> | baseline-16k | 2.6% | >>> | baseline-64k | 12.3% | >>> >> >> Why is anonfolio better than baseline-64k if you always allocate 64k >> anonymous folio? Because page cache uses 64k in baseline-64k? > > No, because the VMA boundaries are aligned to 4K and not 64K. Large Anon Folios > only allocates a 64K folio if it does not breach the bounds of the VMA (and if > it doesn't overlap other allocated PTEs). Thanks for explanation! We will use more memory for file cache too for baseline-64k, right? So, you observed much more anonymous pages, but not so for file cache pages? >> >> We may need to test some workloads with sparse access patterns too. > > Yes, I agree if you have a workload with a pathalogical memory access pattern > where it writes to addresses with a stride of 64K, all contained in a single > VMA, then you will end up allocating 16x the memory. This is obviously an > unrealistic extreme though. I think that there should be some realistic workload which has sparse access patterns. Best Regards, Huang, Ying >> >>>> >>>>> I really don't want to end up with user >>>>> space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large >>>>> anon folios. >>>>> >>>>> I still feel that it would be better for the thp and large anon folio controls >>>>> to be independent though - what's the argument for tying them together? >>>>> >> ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance @ 2023-07-11 0:48 ` Huang, Ying 0 siblings, 0 replies; 167+ messages in thread From: Huang, Ying @ 2023-07-11 0:48 UTC (permalink / raw) To: Ryan Roberts Cc: David Hildenbrand, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm, Johannes Weiner, Alexander Zhu Ryan Roberts <ryan.roberts@arm.com> writes: > On 10/07/2023 10:18, Huang, Ying wrote: >> Ryan Roberts <ryan.roberts@arm.com> writes: >> >>> On 10/07/2023 04:03, Huang, Ying wrote: >>>> Ryan Roberts <ryan.roberts@arm.com> writes: >>>> >>>>> On 07/07/2023 15:07, David Hildenbrand wrote: >>>>>> On 07.07.23 15:57, Matthew Wilcox wrote: >>>>>>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote: >>>>>>>> On 07.07.23 11:52, Ryan Roberts wrote: >>>>>>>>> On 07/07/2023 09:01, Huang, Ying wrote: >>>>>>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to >>>>>>>>>> avoid internal fragmentation completely. So, I think that finally we >>>>>>>>>> will need to provide a mechanism for the users to opt out, e.g., >>>>>>>>>> something like "always madvise never" via >>>>>>>>>> /sys/kernel/mm/transparent_hugepage/enabled. I'm not sure whether it's >>>>>>>>>> a good idea to reuse the existing interface of THP. >>>>>>>>> >>>>>>>>> I wouldn't want to tie this to the existing interface, simply because that >>>>>>>>> implies that we would want to follow the "always" and "madvise" advice too; >>>>>>>>> That >>>>>>>>> means that on a thp=madvise system (which is certainly the case for android and >>>>>>>>> other client systems) we would have to disable large anon folios for VMAs that >>>>>>>>> haven't explicitly opted in. That breaks the intention that this should be an >>>>>>>>> invisible performance boost. I think it's important to set the policy for >>>>>>>>> use of >>>>>>>> >>>>>>>> It will never ever be a completely invisible performance boost, just like >>>>>>>> ordinary THP. >>>>>>>> >>>>>>>> Using the exact same existing toggle is the right thing to do. If someone >>>>>>>> specify "never" or "madvise", then do exactly that. >>>>>>>> >>>>>>>> It might make sense to have more modes or additional toggles, but >>>>>>>> "madvise=never" means no memory waste. >>>>>>> >>>>>>> I hate the existing mechanisms. They are an abdication of our >>>>>>> responsibility, and an attempt to blame the user (be it the sysadmin >>>>>>> or the programmer) of our code for using it wrongly. We should not >>>>>>> replicate this mistake. >>>>>> >>>>>> I don't agree regarding the programmer responsibility. In some cases the >>>>>> programmer really doesn't want to get more memory populated than requested -- >>>>>> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do. >>>>>> >>>>>> Regarding the madvise=never/madvise/always (sys admin decision), memory waste >>>>>> (and nailing down bugs or working around them in customer setups) have been very >>>>>> good reasons to let the admin have a word. >>>>>> >>>>>>> >>>>>>> Our code should be auto-tuning. I posted a long, detailed outline here: >>>>>>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/ >>>>>>> >>>>>> >>>>>> Well, "auto-tuning" also should be perfect for everybody, but once reality >>>>>> strikes you know it isn't. >>>>>> >>>>>> If people don't feel like using THP, let them have a word. The "madvise" config >>>>>> option is probably more controversial. But the "always vs. never" absolutely >>>>>> makes sense to me. >>>>>> >>>>>>>> I remember I raised it already in the past, but you *absolutely* have to >>>>>>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for >>>>>>>> example, userfaultfd) that doesn't want the kernel to populate any >>>>>>>> additional page tables. So if you have to respect that already, then also >>>>>>>> respect MADV_HUGEPAGE, simple. >>>>>>> >>>>>>> Possibly having uffd enabled on a VMA should disable using large folios, >>>>>> >>>>>> There are cases where we enable uffd *after* already touching memory (postcopy >>>>>> live migration in QEMU being the famous example). That doesn't fly. >>>>>> >>>>>>> I can get behind that. But the notion that userspace knows what it's >>>>>>> doing ... hahaha. Just ignore the madvise flags. Userspace doesn't >>>>>>> know what it's doing. >>>>>> >>>>>> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in >>>>>> some cases. And these include cases I care about messing with sparse VM memory :) >>>>>> >>>>>> I have strong opinions against populating more than required when user space set >>>>>> MADV_NOHUGEPAGE. >>>>> >>>>> I can see your point about honouring MADV_NOHUGEPAGE, so think that it is >>>>> reasonable to fallback to allocating an order-0 page in a VMA that has it set. >>>>> The app has gone out of its way to explicitly set it, after all. >>>>> >>>>> I think the correct behaviour for the global thp controls (cmdline and sysfs) >>>>> are less obvious though. I could get on board with disabling large anon folios >>>>> globally when thp="never". But for other situations, I would prefer to keep >>>>> large anon folios enabled (treat "madvise" as "always"), >>>> >>>> If we have some mechanism to auto-tune the large folios usage, for >>>> example, detect the internal fragmentation and split the large folio, >>>> then we can use thp="always" as default configuration. If my memory >>>> were correct, this is what Johannes and Alexander is working on. >>> >>> Could you point me to that work? I'd like to understand what the mechanism is. >>> The other half of my work aims to use arm64's pte "contiguous bit" to tell the >>> HW that a span of PTEs share the same mapping and is therefore coalesced into a >>> single TLB entry. The side effect of this, however, is that we only have a >>> single access and dirty bit for the whole contpte extent. So I'd like to avoid >>> any mechanism that relies on getting access/dirty at the base page granularity >>> for a large folio. >> >> Please take a look at the THP shrinker patchset, >> >> https://lore.kernel.org/linux-mm/cover.1667454613.git.alexlzhu@fb.com/ > > Thanks! > >> >>>> >>>>> with the argument that >>>>> their order is much smaller than traditional THP and therefore the internal >>>>> fragmentation is significantly reduced. >>>> >>>> Do you have any data for this? >>> >>> Some; its partly based on intuition that the smaller the allocation unit, the >>> smaller the internal fragmentation. And partly on peak memory usage data I've >>> collected for the benchmarks I'm running, comparing baseline-4k kernel with >>> baseline-16k and baseline-64 kernels along with a 4k kernel that supports large >>> anon folios (I appreciate that's not exactly what we are talking about here, and >>> it's not exactly an extensive set of results!): >>> >>> >>> Kernel Compliation with 8 Jobs: >>> | kernel | peak | >>> |:--------------|-------:| >>> | baseline-4k | 0.0% | >>> | anonfolio | 0.1% | >>> | baseline-16k | 6.3% | >>> | baseline-64k | 28.1% | >>> >>> >>> Kernel Compliation with 80 Jobs: >>> | kernel | peak | >>> |:--------------|-------:| >>> | baseline-4k | 0.0% | >>> | anonfolio | 1.7% | >>> | baseline-16k | 2.6% | >>> | baseline-64k | 12.3% | >>> >> >> Why is anonfolio better than baseline-64k if you always allocate 64k >> anonymous folio? Because page cache uses 64k in baseline-64k? > > No, because the VMA boundaries are aligned to 4K and not 64K. Large Anon Folios > only allocates a 64K folio if it does not breach the bounds of the VMA (and if > it doesn't overlap other allocated PTEs). Thanks for explanation! We will use more memory for file cache too for baseline-64k, right? So, you observed much more anonymous pages, but not so for file cache pages? >> >> We may need to test some workloads with sparse access patterns too. > > Yes, I agree if you have a workload with a pathalogical memory access pattern > where it writes to addresses with a stride of 64K, all contained in a single > VMA, then you will end up allocating 16x the memory. This is obviously an > unrealistic extreme though. I think that there should be some realistic workload which has sparse access patterns. Best Regards, Huang, Ying >> >>>> >>>>> I really don't want to end up with user >>>>> space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large >>>>> anon folios. >>>>> >>>>> I still feel that it would be better for the thp and large anon folio controls >>>>> to be independent though - what's the argument for tying them together? >>>>> >> _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance 2023-07-07 13:57 ` Matthew Wilcox @ 2023-07-10 2:49 ` Huang, Ying -1 siblings, 0 replies; 167+ messages in thread From: Huang, Ying @ 2023-07-10 2:49 UTC (permalink / raw) To: Matthew Wilcox Cc: David Hildenbrand, Ryan Roberts, Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm Matthew Wilcox <willy@infradead.org> writes: > On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote: >> On 07.07.23 11:52, Ryan Roberts wrote: >> > On 07/07/2023 09:01, Huang, Ying wrote: >> > > Although we can use smaller page order for FLEXIBLE_THP, it's hard to >> > > avoid internal fragmentation completely. So, I think that finally we >> > > will need to provide a mechanism for the users to opt out, e.g., >> > > something like "always madvise never" via >> > > /sys/kernel/mm/transparent_hugepage/enabled. I'm not sure whether it's >> > > a good idea to reuse the existing interface of THP. >> > >> > I wouldn't want to tie this to the existing interface, simply because that >> > implies that we would want to follow the "always" and "madvise" advice too; That >> > means that on a thp=madvise system (which is certainly the case for android and >> > other client systems) we would have to disable large anon folios for VMAs that >> > haven't explicitly opted in. That breaks the intention that this should be an >> > invisible performance boost. I think it's important to set the policy for use of >> >> It will never ever be a completely invisible performance boost, just like >> ordinary THP. >> >> Using the exact same existing toggle is the right thing to do. If someone >> specify "never" or "madvise", then do exactly that. >> >> It might make sense to have more modes or additional toggles, but >> "madvise=never" means no memory waste. > > I hate the existing mechanisms. They are an abdication of our > responsibility, and an attempt to blame the user (be it the sysadmin > or the programmer) of our code for using it wrongly. We should not > replicate this mistake. > > Our code should be auto-tuning. I posted a long, detailed outline here: > https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/ Yes. Auto-tuning should be more preferable than any configuration mechanisms. Something like THP shrinker could be another way of auto-tuning. https://lore.kernel.org/linux-mm/cover.1667454613.git.alexlzhu@fb.com/ That is, allocating the large folios on page fault, then try to detect internal fragmentation. >> I remember I raised it already in the past, but you *absolutely* have to >> respect the MADV_NOHUGEPAGE flag. There is user space out there (for >> example, userfaultfd) that doesn't want the kernel to populate any >> additional page tables. So if you have to respect that already, then also >> respect MADV_HUGEPAGE, simple. > > Possibly having uffd enabled on a VMA should disable using large folios, > I can get behind that. But the notion that userspace knows what it's > doing ... hahaha. Just ignore the madvise flags. Userspace doesn't > know what it's doing. Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance @ 2023-07-10 2:49 ` Huang, Ying 0 siblings, 0 replies; 167+ messages in thread From: Huang, Ying @ 2023-07-10 2:49 UTC (permalink / raw) To: Matthew Wilcox Cc: David Hildenbrand, Ryan Roberts, Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm Matthew Wilcox <willy@infradead.org> writes: > On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote: >> On 07.07.23 11:52, Ryan Roberts wrote: >> > On 07/07/2023 09:01, Huang, Ying wrote: >> > > Although we can use smaller page order for FLEXIBLE_THP, it's hard to >> > > avoid internal fragmentation completely. So, I think that finally we >> > > will need to provide a mechanism for the users to opt out, e.g., >> > > something like "always madvise never" via >> > > /sys/kernel/mm/transparent_hugepage/enabled. I'm not sure whether it's >> > > a good idea to reuse the existing interface of THP. >> > >> > I wouldn't want to tie this to the existing interface, simply because that >> > implies that we would want to follow the "always" and "madvise" advice too; That >> > means that on a thp=madvise system (which is certainly the case for android and >> > other client systems) we would have to disable large anon folios for VMAs that >> > haven't explicitly opted in. That breaks the intention that this should be an >> > invisible performance boost. I think it's important to set the policy for use of >> >> It will never ever be a completely invisible performance boost, just like >> ordinary THP. >> >> Using the exact same existing toggle is the right thing to do. If someone >> specify "never" or "madvise", then do exactly that. >> >> It might make sense to have more modes or additional toggles, but >> "madvise=never" means no memory waste. > > I hate the existing mechanisms. They are an abdication of our > responsibility, and an attempt to blame the user (be it the sysadmin > or the programmer) of our code for using it wrongly. We should not > replicate this mistake. > > Our code should be auto-tuning. I posted a long, detailed outline here: > https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/ Yes. Auto-tuning should be more preferable than any configuration mechanisms. Something like THP shrinker could be another way of auto-tuning. https://lore.kernel.org/linux-mm/cover.1667454613.git.alexlzhu@fb.com/ That is, allocating the large folios on page fault, then try to detect internal fragmentation. >> I remember I raised it already in the past, but you *absolutely* have to >> respect the MADV_NOHUGEPAGE flag. There is user space out there (for >> example, userfaultfd) that doesn't want the kernel to populate any >> additional page tables. So if you have to respect that already, then also >> respect MADV_HUGEPAGE, simple. > > Possibly having uffd enabled on a VMA should disable using large folios, > I can get behind that. But the notion that userspace knows what it's > doing ... hahaha. Just ignore the madvise flags. Userspace doesn't > know what it's doing. Best Regards, Huang, Ying _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* [PATCH v2 5/5] arm64: mm: Override arch_wants_pte_order() 2023-07-03 13:53 ` Ryan Roberts @ 2023-07-03 13:53 ` Ryan Roberts -1 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-03 13:53 UTC (permalink / raw) To: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm Define an arch-specific override of arch_wants_pte_order() so that when FLEXIBLE_THP is enabled, large folios will be allocated for anonymous memory with an order that is compatible with arm64's contpte mappings. arch_wants_pte_order() returns an order according to the following policy: For the unhinted case, when THP is not requested for the vma, don't allow anything bigger than 64K. This means we don't waste too much memory. Additionally, for 4K pages this is the contpte size, and for 16K, this is (usually) the HPA size when the uarch feature is implemented. For the hinted case, when THP is requested for the vma, allow the contpte size for all page size configurations; 64K for 4K, 2M for 16K and 2M for 64K. Additionally, the THP and NOTHP order constants are defined using Kconfig so it is possible to override them at build time. Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> --- arch/arm64/Kconfig | 12 ++++++++++++ arch/arm64/include/asm/pgtable.h | 4 ++++ arch/arm64/mm/mmu.c | 8 ++++++++ 3 files changed, 24 insertions(+) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 343e1e1cae10..689c5bf13dc1 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -281,6 +281,18 @@ config ARM64_CONT_PMD_SHIFT default 5 if ARM64_16K_PAGES default 4 +config ARM64_PTE_ORDER_NOTHP + int + default 0 if ARM64_64K_PAGES # 64K (1 page) + default 2 if ARM64_16K_PAGES # 64K (4 pages; benefits from HPA where HW supports it) + default 4 if ARM64_4K_PAGES # 64K (16 pages; eligible for contpte-mapping) + +config ARM64_PTE_ORDER_THP + int + default 5 if ARM64_64K_PAGES # 2M (32 pages; eligible for contpte-mapping) + default 7 if ARM64_16K_PAGES # 2M (128 pages; eligible for contpte-mapping) + default 4 if ARM64_4K_PAGES # 64K (16 pages; eligible for contpte-mapping) + config ARCH_MMAP_RND_BITS_MIN default 14 if ARM64_64K_PAGES default 16 if ARM64_16K_PAGES diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 6fd012663a01..8463d5f9f307 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -1117,6 +1117,10 @@ extern pte_t ptep_modify_prot_start(struct vm_area_struct *vma, extern void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep, pte_t old_pte, pte_t new_pte); + +#define arch_wants_pte_order arch_wants_pte_order +extern int arch_wants_pte_order(struct vm_area_struct *vma); + #endif /* !__ASSEMBLY__ */ #endif /* __ASM_PGTABLE_H */ diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index af6bc8403ee4..8556c4a9b507 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -1481,3 +1481,11 @@ void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr, pte { set_pte_at(vma->vm_mm, addr, ptep, pte); } + +int arch_wants_pte_order(struct vm_area_struct *vma) +{ + if (hugepage_vma_check(vma, vma->vm_flags, false, true, true)) + return CONFIG_ARM64_PTE_ORDER_THP; + else + return CONFIG_ARM64_PTE_ORDER_NOTHP; +} -- 2.25.1 ^ permalink raw reply related [flat|nested] 167+ messages in thread
* [PATCH v2 5/5] arm64: mm: Override arch_wants_pte_order() @ 2023-07-03 13:53 ` Ryan Roberts 0 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-03 13:53 UTC (permalink / raw) To: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, linux-mm Define an arch-specific override of arch_wants_pte_order() so that when FLEXIBLE_THP is enabled, large folios will be allocated for anonymous memory with an order that is compatible with arm64's contpte mappings. arch_wants_pte_order() returns an order according to the following policy: For the unhinted case, when THP is not requested for the vma, don't allow anything bigger than 64K. This means we don't waste too much memory. Additionally, for 4K pages this is the contpte size, and for 16K, this is (usually) the HPA size when the uarch feature is implemented. For the hinted case, when THP is requested for the vma, allow the contpte size for all page size configurations; 64K for 4K, 2M for 16K and 2M for 64K. Additionally, the THP and NOTHP order constants are defined using Kconfig so it is possible to override them at build time. Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> --- arch/arm64/Kconfig | 12 ++++++++++++ arch/arm64/include/asm/pgtable.h | 4 ++++ arch/arm64/mm/mmu.c | 8 ++++++++ 3 files changed, 24 insertions(+) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 343e1e1cae10..689c5bf13dc1 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -281,6 +281,18 @@ config ARM64_CONT_PMD_SHIFT default 5 if ARM64_16K_PAGES default 4 +config ARM64_PTE_ORDER_NOTHP + int + default 0 if ARM64_64K_PAGES # 64K (1 page) + default 2 if ARM64_16K_PAGES # 64K (4 pages; benefits from HPA where HW supports it) + default 4 if ARM64_4K_PAGES # 64K (16 pages; eligible for contpte-mapping) + +config ARM64_PTE_ORDER_THP + int + default 5 if ARM64_64K_PAGES # 2M (32 pages; eligible for contpte-mapping) + default 7 if ARM64_16K_PAGES # 2M (128 pages; eligible for contpte-mapping) + default 4 if ARM64_4K_PAGES # 64K (16 pages; eligible for contpte-mapping) + config ARCH_MMAP_RND_BITS_MIN default 14 if ARM64_64K_PAGES default 16 if ARM64_16K_PAGES diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 6fd012663a01..8463d5f9f307 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -1117,6 +1117,10 @@ extern pte_t ptep_modify_prot_start(struct vm_area_struct *vma, extern void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep, pte_t old_pte, pte_t new_pte); + +#define arch_wants_pte_order arch_wants_pte_order +extern int arch_wants_pte_order(struct vm_area_struct *vma); + #endif /* !__ASSEMBLY__ */ #endif /* __ASM_PGTABLE_H */ diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index af6bc8403ee4..8556c4a9b507 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -1481,3 +1481,11 @@ void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr, pte { set_pte_at(vma->vm_mm, addr, ptep, pte); } + +int arch_wants_pte_order(struct vm_area_struct *vma) +{ + if (hugepage_vma_check(vma, vma->vm_flags, false, true, true)) + return CONFIG_ARM64_PTE_ORDER_THP; + else + return CONFIG_ARM64_PTE_ORDER_NOTHP; +} -- 2.25.1 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 167+ messages in thread
* Re: [PATCH v2 5/5] arm64: mm: Override arch_wants_pte_order() 2023-07-03 13:53 ` Ryan Roberts @ 2023-07-03 20:02 ` Yu Zhao -1 siblings, 0 replies; 167+ messages in thread From: Yu Zhao @ 2023-07-03 20:02 UTC (permalink / raw) To: Ryan Roberts Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > Define an arch-specific override of arch_wants_pte_order() so that when > FLEXIBLE_THP is enabled, large folios will be allocated for anonymous > memory with an order that is compatible with arm64's contpte mappings. > > arch_wants_pte_order() returns an order according to the following > policy: For the unhinted case, when THP is not requested for the vma, > don't allow anything bigger than 64K. This means we don't waste too much > memory. Additionally, for 4K pages this is the contpte size, and for > 16K, this is (usually) the HPA size when the uarch feature is > implemented. For the hinted case, when THP is requested for the vma, > allow the contpte size for all page size configurations; 64K for 4K, 2M > for 16K and 2M for 64K. > > Additionally, the THP and NOTHP order constants are defined using > Kconfig so it is possible to override them at build time. > > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > --- > arch/arm64/Kconfig | 12 ++++++++++++ > arch/arm64/include/asm/pgtable.h | 4 ++++ > arch/arm64/mm/mmu.c | 8 ++++++++ > 3 files changed, 24 insertions(+) > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > index 343e1e1cae10..689c5bf13dc1 100644 > --- a/arch/arm64/Kconfig > +++ b/arch/arm64/Kconfig > @@ -281,6 +281,18 @@ config ARM64_CONT_PMD_SHIFT > default 5 if ARM64_16K_PAGES > default 4 > > +config ARM64_PTE_ORDER_NOTHP > + int > + default 0 if ARM64_64K_PAGES # 64K (1 page) > + default 2 if ARM64_16K_PAGES # 64K (4 pages; benefits from HPA where HW supports it) > + default 4 if ARM64_4K_PAGES # 64K (16 pages; eligible for contpte-mapping) > + > +config ARM64_PTE_ORDER_THP > + int > + default 5 if ARM64_64K_PAGES # 2M (32 pages; eligible for contpte-mapping) > + default 7 if ARM64_16K_PAGES # 2M (128 pages; eligible for contpte-mapping) > + default 4 if ARM64_4K_PAGES # 64K (16 pages; eligible for contpte-mapping) > + > config ARCH_MMAP_RND_BITS_MIN > default 14 if ARM64_64K_PAGES > default 16 if ARM64_16K_PAGES > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h > index 6fd012663a01..8463d5f9f307 100644 > --- a/arch/arm64/include/asm/pgtable.h > +++ b/arch/arm64/include/asm/pgtable.h > @@ -1117,6 +1117,10 @@ extern pte_t ptep_modify_prot_start(struct vm_area_struct *vma, > extern void ptep_modify_prot_commit(struct vm_area_struct *vma, > unsigned long addr, pte_t *ptep, > pte_t old_pte, pte_t new_pte); > + > +#define arch_wants_pte_order arch_wants_pte_order > +extern int arch_wants_pte_order(struct vm_area_struct *vma); > + > #endif /* !__ASSEMBLY__ */ > > #endif /* __ASM_PGTABLE_H */ > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c > index af6bc8403ee4..8556c4a9b507 100644 > --- a/arch/arm64/mm/mmu.c > +++ b/arch/arm64/mm/mmu.c > @@ -1481,3 +1481,11 @@ void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr, pte > { > set_pte_at(vma->vm_mm, addr, ptep, pte); > } > + > +int arch_wants_pte_order(struct vm_area_struct *vma) > +{ > + if (hugepage_vma_check(vma, vma->vm_flags, false, true, true)) > + return CONFIG_ARM64_PTE_ORDER_THP; > + else > + return CONFIG_ARM64_PTE_ORDER_NOTHP; > +} I don't really like this because it's a mix of h/w preference and s/w policy -- from my POV, it's supposed to be the former only. The policy part should be left to core MM (arch-independent). That being said, no objection if ARM MM people think this is really what they want. ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 5/5] arm64: mm: Override arch_wants_pte_order() @ 2023-07-03 20:02 ` Yu Zhao 0 siblings, 0 replies; 167+ messages in thread From: Yu Zhao @ 2023-07-03 20:02 UTC (permalink / raw) To: Ryan Roberts Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > Define an arch-specific override of arch_wants_pte_order() so that when > FLEXIBLE_THP is enabled, large folios will be allocated for anonymous > memory with an order that is compatible with arm64's contpte mappings. > > arch_wants_pte_order() returns an order according to the following > policy: For the unhinted case, when THP is not requested for the vma, > don't allow anything bigger than 64K. This means we don't waste too much > memory. Additionally, for 4K pages this is the contpte size, and for > 16K, this is (usually) the HPA size when the uarch feature is > implemented. For the hinted case, when THP is requested for the vma, > allow the contpte size for all page size configurations; 64K for 4K, 2M > for 16K and 2M for 64K. > > Additionally, the THP and NOTHP order constants are defined using > Kconfig so it is possible to override them at build time. > > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > --- > arch/arm64/Kconfig | 12 ++++++++++++ > arch/arm64/include/asm/pgtable.h | 4 ++++ > arch/arm64/mm/mmu.c | 8 ++++++++ > 3 files changed, 24 insertions(+) > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > index 343e1e1cae10..689c5bf13dc1 100644 > --- a/arch/arm64/Kconfig > +++ b/arch/arm64/Kconfig > @@ -281,6 +281,18 @@ config ARM64_CONT_PMD_SHIFT > default 5 if ARM64_16K_PAGES > default 4 > > +config ARM64_PTE_ORDER_NOTHP > + int > + default 0 if ARM64_64K_PAGES # 64K (1 page) > + default 2 if ARM64_16K_PAGES # 64K (4 pages; benefits from HPA where HW supports it) > + default 4 if ARM64_4K_PAGES # 64K (16 pages; eligible for contpte-mapping) > + > +config ARM64_PTE_ORDER_THP > + int > + default 5 if ARM64_64K_PAGES # 2M (32 pages; eligible for contpte-mapping) > + default 7 if ARM64_16K_PAGES # 2M (128 pages; eligible for contpte-mapping) > + default 4 if ARM64_4K_PAGES # 64K (16 pages; eligible for contpte-mapping) > + > config ARCH_MMAP_RND_BITS_MIN > default 14 if ARM64_64K_PAGES > default 16 if ARM64_16K_PAGES > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h > index 6fd012663a01..8463d5f9f307 100644 > --- a/arch/arm64/include/asm/pgtable.h > +++ b/arch/arm64/include/asm/pgtable.h > @@ -1117,6 +1117,10 @@ extern pte_t ptep_modify_prot_start(struct vm_area_struct *vma, > extern void ptep_modify_prot_commit(struct vm_area_struct *vma, > unsigned long addr, pte_t *ptep, > pte_t old_pte, pte_t new_pte); > + > +#define arch_wants_pte_order arch_wants_pte_order > +extern int arch_wants_pte_order(struct vm_area_struct *vma); > + > #endif /* !__ASSEMBLY__ */ > > #endif /* __ASM_PGTABLE_H */ > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c > index af6bc8403ee4..8556c4a9b507 100644 > --- a/arch/arm64/mm/mmu.c > +++ b/arch/arm64/mm/mmu.c > @@ -1481,3 +1481,11 @@ void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr, pte > { > set_pte_at(vma->vm_mm, addr, ptep, pte); > } > + > +int arch_wants_pte_order(struct vm_area_struct *vma) > +{ > + if (hugepage_vma_check(vma, vma->vm_flags, false, true, true)) > + return CONFIG_ARM64_PTE_ORDER_THP; > + else > + return CONFIG_ARM64_PTE_ORDER_NOTHP; > +} I don't really like this because it's a mix of h/w preference and s/w policy -- from my POV, it's supposed to be the former only. The policy part should be left to core MM (arch-independent). That being said, no objection if ARM MM people think this is really what they want. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory 2023-07-03 13:53 ` Ryan Roberts @ 2023-07-04 2:18 ` Yu Zhao -1 siblings, 0 replies; 167+ messages in thread From: Yu Zhao @ 2023-07-04 2:18 UTC (permalink / raw) To: Ryan Roberts Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > Hi All, > > This is v2 of a series to implement variable order, large folios for anonymous > memory. The objective of this is to improve performance by allocating larger > chunks of memory during anonymous page faults. See [1] for background. Thanks for the quick response! > I've significantly reworked and simplified the patch set based on comments from > Yu Zhao (thanks for all your feedback!). I've also renamed the feature to > VARIABLE_THP, on Yu's advice. > > The last patch is for arm64 to explicitly override the default > arch_wants_pte_order() and is intended as an example. If this series is accepted > I suggest taking the first 4 patches through the mm tree and the arm64 change > could be handled through the arm64 tree separately. Neither has any build > dependency on the other. > > The one area where I haven't followed Yu's advice is in the determination of the > size of folio to use. It was suggested that I have a single preferred large > order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there > being existing overlapping populated PTEs, etc) then fallback immediately to > order-0. It turned out that this approach caused a performance regression in the > Speedometer benchmark. I suppose it's regression against the v1, not the unpatched kernel. > With my v1 patch, there were significant quantities of > memory which could not be placed in the 64K bucket and were instead being > allocated for the 32K and 16K buckets. With the proposed simplification, that > memory ended up using the 4K bucket, so page faults increased by 2.75x compared > to the v1 patch (although due to the 64K bucket, this number is still a bit > lower than the baseline). So instead, I continue to calculate a folio order that > is somewhere between the preferred order and 0. (See below for more details). I suppose the benchmark wasn't running under memory pressure, which is uncommon for client devices. It could be easier the other way around: using 32/16KB shows regression whereas order-0 shows better performance under memory pressure. I'm not sure we should use v1 as the baseline. Unpatched kernel sounds more reasonable at this point. If 32/16KB is proven to be better in most scenarios including under memory pressure, we can reintroduce that policy. I highly doubt this is the case: we tried 16KB base page size on client devices, and overall, the regressions outweighs the benefits. > The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series > [2], which is a hard dependency. I have a branch at [3]. It's not clear to me why [2] is a hard dependency. It seems to me we are getting close and I was hoping we could get into mm-unstable soon without depending on other series... ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory @ 2023-07-04 2:18 ` Yu Zhao 0 siblings, 0 replies; 167+ messages in thread From: Yu Zhao @ 2023-07-04 2:18 UTC (permalink / raw) To: Ryan Roberts Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > Hi All, > > This is v2 of a series to implement variable order, large folios for anonymous > memory. The objective of this is to improve performance by allocating larger > chunks of memory during anonymous page faults. See [1] for background. Thanks for the quick response! > I've significantly reworked and simplified the patch set based on comments from > Yu Zhao (thanks for all your feedback!). I've also renamed the feature to > VARIABLE_THP, on Yu's advice. > > The last patch is for arm64 to explicitly override the default > arch_wants_pte_order() and is intended as an example. If this series is accepted > I suggest taking the first 4 patches through the mm tree and the arm64 change > could be handled through the arm64 tree separately. Neither has any build > dependency on the other. > > The one area where I haven't followed Yu's advice is in the determination of the > size of folio to use. It was suggested that I have a single preferred large > order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there > being existing overlapping populated PTEs, etc) then fallback immediately to > order-0. It turned out that this approach caused a performance regression in the > Speedometer benchmark. I suppose it's regression against the v1, not the unpatched kernel. > With my v1 patch, there were significant quantities of > memory which could not be placed in the 64K bucket and were instead being > allocated for the 32K and 16K buckets. With the proposed simplification, that > memory ended up using the 4K bucket, so page faults increased by 2.75x compared > to the v1 patch (although due to the 64K bucket, this number is still a bit > lower than the baseline). So instead, I continue to calculate a folio order that > is somewhere between the preferred order and 0. (See below for more details). I suppose the benchmark wasn't running under memory pressure, which is uncommon for client devices. It could be easier the other way around: using 32/16KB shows regression whereas order-0 shows better performance under memory pressure. I'm not sure we should use v1 as the baseline. Unpatched kernel sounds more reasonable at this point. If 32/16KB is proven to be better in most scenarios including under memory pressure, we can reintroduce that policy. I highly doubt this is the case: we tried 16KB base page size on client devices, and overall, the regressions outweighs the benefits. > The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series > [2], which is a hard dependency. I have a branch at [3]. It's not clear to me why [2] is a hard dependency. It seems to me we are getting close and I was hoping we could get into mm-unstable soon without depending on other series... _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory 2023-07-04 2:18 ` Yu Zhao @ 2023-07-04 6:22 ` Yin, Fengwei -1 siblings, 0 replies; 167+ messages in thread From: Yin, Fengwei @ 2023-07-04 6:22 UTC (permalink / raw) To: Yu Zhao, Ryan Roberts Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 7/4/2023 10:18 AM, Yu Zhao wrote: > On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> Hi All, >> >> This is v2 of a series to implement variable order, large folios for anonymous >> memory. The objective of this is to improve performance by allocating larger >> chunks of memory during anonymous page faults. See [1] for background. > > Thanks for the quick response! > >> I've significantly reworked and simplified the patch set based on comments from >> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to >> VARIABLE_THP, on Yu's advice. >> >> The last patch is for arm64 to explicitly override the default >> arch_wants_pte_order() and is intended as an example. If this series is accepted >> I suggest taking the first 4 patches through the mm tree and the arm64 change >> could be handled through the arm64 tree separately. Neither has any build >> dependency on the other. >> >> The one area where I haven't followed Yu's advice is in the determination of the >> size of folio to use. It was suggested that I have a single preferred large >> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there >> being existing overlapping populated PTEs, etc) then fallback immediately to >> order-0. It turned out that this approach caused a performance regression in the >> Speedometer benchmark. > > I suppose it's regression against the v1, not the unpatched kernel. From the performance data Ryan shared, it's against unpatched kernel: Speedometer 2.0: | kernel | runs_per_min | |:-------------------------------|---------------:| | baseline-4k | 0.0% | | anonfolio-lkml-v1 | 0.7% | | anonfolio-lkml-v2-simple-order | -0.9% | | anonfolio-lkml-v2 | 0.5% | What if we use 32K or 16K instead of 64K as default anonymous folio size? I suspect this app may have 32K or 16K anon folio as sweet spot. Regards Yin, Fengwei > >> With my v1 patch, there were significant quantities of >> memory which could not be placed in the 64K bucket and were instead being >> allocated for the 32K and 16K buckets. With the proposed simplification, that >> memory ended up using the 4K bucket, so page faults increased by 2.75x compared >> to the v1 patch (although due to the 64K bucket, this number is still a bit >> lower than the baseline). So instead, I continue to calculate a folio order that >> is somewhere between the preferred order and 0. (See below for more details). > > I suppose the benchmark wasn't running under memory pressure, which is > uncommon for client devices. It could be easier the other way around: > using 32/16KB shows regression whereas order-0 shows better > performance under memory pressure. > > I'm not sure we should use v1 as the baseline. Unpatched kernel sounds > more reasonable at this point. If 32/16KB is proven to be better in > most scenarios including under memory pressure, we can reintroduce > that policy. I highly doubt this is the case: we tried 16KB base page > size on client devices, and overall, the regressions outweighs the > benefits. > >> The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series >> [2], which is a hard dependency. I have a branch at [3]. > > It's not clear to me why [2] is a hard dependency. > > It seems to me we are getting close and I was hoping we could get into > mm-unstable soon without depending on other series... > ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory @ 2023-07-04 6:22 ` Yin, Fengwei 0 siblings, 0 replies; 167+ messages in thread From: Yin, Fengwei @ 2023-07-04 6:22 UTC (permalink / raw) To: Yu Zhao, Ryan Roberts Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 7/4/2023 10:18 AM, Yu Zhao wrote: > On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> Hi All, >> >> This is v2 of a series to implement variable order, large folios for anonymous >> memory. The objective of this is to improve performance by allocating larger >> chunks of memory during anonymous page faults. See [1] for background. > > Thanks for the quick response! > >> I've significantly reworked and simplified the patch set based on comments from >> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to >> VARIABLE_THP, on Yu's advice. >> >> The last patch is for arm64 to explicitly override the default >> arch_wants_pte_order() and is intended as an example. If this series is accepted >> I suggest taking the first 4 patches through the mm tree and the arm64 change >> could be handled through the arm64 tree separately. Neither has any build >> dependency on the other. >> >> The one area where I haven't followed Yu's advice is in the determination of the >> size of folio to use. It was suggested that I have a single preferred large >> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there >> being existing overlapping populated PTEs, etc) then fallback immediately to >> order-0. It turned out that this approach caused a performance regression in the >> Speedometer benchmark. > > I suppose it's regression against the v1, not the unpatched kernel. From the performance data Ryan shared, it's against unpatched kernel: Speedometer 2.0: | kernel | runs_per_min | |:-------------------------------|---------------:| | baseline-4k | 0.0% | | anonfolio-lkml-v1 | 0.7% | | anonfolio-lkml-v2-simple-order | -0.9% | | anonfolio-lkml-v2 | 0.5% | What if we use 32K or 16K instead of 64K as default anonymous folio size? I suspect this app may have 32K or 16K anon folio as sweet spot. Regards Yin, Fengwei > >> With my v1 patch, there were significant quantities of >> memory which could not be placed in the 64K bucket and were instead being >> allocated for the 32K and 16K buckets. With the proposed simplification, that >> memory ended up using the 4K bucket, so page faults increased by 2.75x compared >> to the v1 patch (although due to the 64K bucket, this number is still a bit >> lower than the baseline). So instead, I continue to calculate a folio order that >> is somewhere between the preferred order and 0. (See below for more details). > > I suppose the benchmark wasn't running under memory pressure, which is > uncommon for client devices. It could be easier the other way around: > using 32/16KB shows regression whereas order-0 shows better > performance under memory pressure. > > I'm not sure we should use v1 as the baseline. Unpatched kernel sounds > more reasonable at this point. If 32/16KB is proven to be better in > most scenarios including under memory pressure, we can reintroduce > that policy. I highly doubt this is the case: we tried 16KB base page > size on client devices, and overall, the regressions outweighs the > benefits. > >> The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series >> [2], which is a hard dependency. I have a branch at [3]. > > It's not clear to me why [2] is a hard dependency. > > It seems to me we are getting close and I was hoping we could get into > mm-unstable soon without depending on other series... > _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory 2023-07-04 6:22 ` Yin, Fengwei @ 2023-07-04 7:11 ` Yu Zhao -1 siblings, 0 replies; 167+ messages in thread From: Yu Zhao @ 2023-07-04 7:11 UTC (permalink / raw) To: Yin, Fengwei, Ryan Roberts Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On Tue, Jul 4, 2023 at 12:22 AM Yin, Fengwei <fengwei.yin@intel.com> wrote: > > On 7/4/2023 10:18 AM, Yu Zhao wrote: > > On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > >> > >> Hi All, > >> > >> This is v2 of a series to implement variable order, large folios for anonymous > >> memory. The objective of this is to improve performance by allocating larger > >> chunks of memory during anonymous page faults. See [1] for background. > > > > Thanks for the quick response! > > > >> I've significantly reworked and simplified the patch set based on comments from > >> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to > >> VARIABLE_THP, on Yu's advice. > >> > >> The last patch is for arm64 to explicitly override the default > >> arch_wants_pte_order() and is intended as an example. If this series is accepted > >> I suggest taking the first 4 patches through the mm tree and the arm64 change > >> could be handled through the arm64 tree separately. Neither has any build > >> dependency on the other. > >> > >> The one area where I haven't followed Yu's advice is in the determination of the > >> size of folio to use. It was suggested that I have a single preferred large > >> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there > >> being existing overlapping populated PTEs, etc) then fallback immediately to > >> order-0. It turned out that this approach caused a performance regression in the > >> Speedometer benchmark. > > > > I suppose it's regression against the v1, not the unpatched kernel. > From the performance data Ryan shared, it's against unpatched kernel: > > Speedometer 2.0: > > | kernel | runs_per_min | > |:-------------------------------|---------------:| > | baseline-4k | 0.0% | > | anonfolio-lkml-v1 | 0.7% | > | anonfolio-lkml-v2-simple-order | -0.9% | > | anonfolio-lkml-v2 | 0.5% | I see. Thanks. A couple of questions: 1. Do we have a stddev? 2. Do we have a theory why it regressed? Assuming no bugs, I don't see how a real regression could happen -- falling back to order-0 isn't different from the original behavior. Ryan, could you `perf record` and `cat /proc/vmstat` and share them? ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory @ 2023-07-04 7:11 ` Yu Zhao 0 siblings, 0 replies; 167+ messages in thread From: Yu Zhao @ 2023-07-04 7:11 UTC (permalink / raw) To: Yin, Fengwei, Ryan Roberts Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On Tue, Jul 4, 2023 at 12:22 AM Yin, Fengwei <fengwei.yin@intel.com> wrote: > > On 7/4/2023 10:18 AM, Yu Zhao wrote: > > On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > >> > >> Hi All, > >> > >> This is v2 of a series to implement variable order, large folios for anonymous > >> memory. The objective of this is to improve performance by allocating larger > >> chunks of memory during anonymous page faults. See [1] for background. > > > > Thanks for the quick response! > > > >> I've significantly reworked and simplified the patch set based on comments from > >> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to > >> VARIABLE_THP, on Yu's advice. > >> > >> The last patch is for arm64 to explicitly override the default > >> arch_wants_pte_order() and is intended as an example. If this series is accepted > >> I suggest taking the first 4 patches through the mm tree and the arm64 change > >> could be handled through the arm64 tree separately. Neither has any build > >> dependency on the other. > >> > >> The one area where I haven't followed Yu's advice is in the determination of the > >> size of folio to use. It was suggested that I have a single preferred large > >> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there > >> being existing overlapping populated PTEs, etc) then fallback immediately to > >> order-0. It turned out that this approach caused a performance regression in the > >> Speedometer benchmark. > > > > I suppose it's regression against the v1, not the unpatched kernel. > From the performance data Ryan shared, it's against unpatched kernel: > > Speedometer 2.0: > > | kernel | runs_per_min | > |:-------------------------------|---------------:| > | baseline-4k | 0.0% | > | anonfolio-lkml-v1 | 0.7% | > | anonfolio-lkml-v2-simple-order | -0.9% | > | anonfolio-lkml-v2 | 0.5% | I see. Thanks. A couple of questions: 1. Do we have a stddev? 2. Do we have a theory why it regressed? Assuming no bugs, I don't see how a real regression could happen -- falling back to order-0 isn't different from the original behavior. Ryan, could you `perf record` and `cat /proc/vmstat` and share them? _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory 2023-07-04 7:11 ` Yu Zhao @ 2023-07-04 15:36 ` Ryan Roberts -1 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-04 15:36 UTC (permalink / raw) To: Yu Zhao, Yin, Fengwei Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 04/07/2023 08:11, Yu Zhao wrote: > On Tue, Jul 4, 2023 at 12:22 AM Yin, Fengwei <fengwei.yin@intel.com> wrote: >> >> On 7/4/2023 10:18 AM, Yu Zhao wrote: >>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>> >>>> Hi All, >>>> >>>> This is v2 of a series to implement variable order, large folios for anonymous >>>> memory. The objective of this is to improve performance by allocating larger >>>> chunks of memory during anonymous page faults. See [1] for background. >>> >>> Thanks for the quick response! >>> >>>> I've significantly reworked and simplified the patch set based on comments from >>>> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to >>>> VARIABLE_THP, on Yu's advice. >>>> >>>> The last patch is for arm64 to explicitly override the default >>>> arch_wants_pte_order() and is intended as an example. If this series is accepted >>>> I suggest taking the first 4 patches through the mm tree and the arm64 change >>>> could be handled through the arm64 tree separately. Neither has any build >>>> dependency on the other. >>>> >>>> The one area where I haven't followed Yu's advice is in the determination of the >>>> size of folio to use. It was suggested that I have a single preferred large >>>> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there >>>> being existing overlapping populated PTEs, etc) then fallback immediately to >>>> order-0. It turned out that this approach caused a performance regression in the >>>> Speedometer benchmark. >>> >>> I suppose it's regression against the v1, not the unpatched kernel. >> From the performance data Ryan shared, it's against unpatched kernel: >> >> Speedometer 2.0: >> >> | kernel | runs_per_min | >> |:-------------------------------|---------------:| >> | baseline-4k | 0.0% | >> | anonfolio-lkml-v1 | 0.7% | >> | anonfolio-lkml-v2-simple-order | -0.9% | >> | anonfolio-lkml-v2 | 0.5% | > > I see. Thanks. > > A couple of questions: > 1. Do we have a stddev? | kernel | mean_abs | std_abs | mean_rel | std_rel | |:------------------------- |-----------:|----------:|-----------:|----------:| | baseline-4k | 117.4 | 0.8 | 0.0% | 0.7% | | anonfolio-v1 | 118.2 | 1 | 0.7% | 0.9% | | anonfolio-v2-simple-order | 116.4 | 1.1 | -0.9% | 0.9% | | anonfolio-v2 | 118 | 1.2 | 0.5% | 1.0% | This is with 3 runs per reboot across 5 reboots, with first run after reboot trimmed (it's always a bit slower, I assume due to cold page cache). So 10 data points per kernel in total. I've rerun the test multiple times and see similar results each time. I've also run anonfolio-v2 with Kconfig FLEXIBLE_THP=disabled and in this case I see the same performance as baseline-4k. > 2. Do we have a theory why it regressed? I have a woolly hypothesis; I think Chromium is doing mmap/munmap in ways that mean when we fault, order-4 is often too big to fit in the VMA. So we fallback to order-0. I guess this is happening so often for this workload that the cost of doing the checks and fallback is outweighing the benefit of the memory that does end up with order-4 folios. I've sampled the memory in each bucket (once per second) while running and its roughly: 64K: 25% 32K: 15% 16K: 15% 4K: 45% 32K and 16K obviously fold into the 4K bucket with anonfolio-v2-simple-order. But potentially, I suspect there is lots of mmap/unmap for the smaller sizes and the 64K contents is more static - that's just a guess though. > Assuming no bugs, I don't see how a real regression could happen -- > falling back to order-0 isn't different from the original behavior. > Ryan, could you `perf record` and `cat /proc/vmstat` and share them? I can, but it will have to be a bit later in the week. I'll do some more test runs overnight so we have a larger number of runs - hopefully that might tell us that this is noise to a certain extent. I'd still like to hear a clear technical argument for why the bin-packing approach is not the correct one! Thanks, Ryan ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory @ 2023-07-04 15:36 ` Ryan Roberts 0 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-04 15:36 UTC (permalink / raw) To: Yu Zhao, Yin, Fengwei Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 04/07/2023 08:11, Yu Zhao wrote: > On Tue, Jul 4, 2023 at 12:22 AM Yin, Fengwei <fengwei.yin@intel.com> wrote: >> >> On 7/4/2023 10:18 AM, Yu Zhao wrote: >>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>> >>>> Hi All, >>>> >>>> This is v2 of a series to implement variable order, large folios for anonymous >>>> memory. The objective of this is to improve performance by allocating larger >>>> chunks of memory during anonymous page faults. See [1] for background. >>> >>> Thanks for the quick response! >>> >>>> I've significantly reworked and simplified the patch set based on comments from >>>> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to >>>> VARIABLE_THP, on Yu's advice. >>>> >>>> The last patch is for arm64 to explicitly override the default >>>> arch_wants_pte_order() and is intended as an example. If this series is accepted >>>> I suggest taking the first 4 patches through the mm tree and the arm64 change >>>> could be handled through the arm64 tree separately. Neither has any build >>>> dependency on the other. >>>> >>>> The one area where I haven't followed Yu's advice is in the determination of the >>>> size of folio to use. It was suggested that I have a single preferred large >>>> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there >>>> being existing overlapping populated PTEs, etc) then fallback immediately to >>>> order-0. It turned out that this approach caused a performance regression in the >>>> Speedometer benchmark. >>> >>> I suppose it's regression against the v1, not the unpatched kernel. >> From the performance data Ryan shared, it's against unpatched kernel: >> >> Speedometer 2.0: >> >> | kernel | runs_per_min | >> |:-------------------------------|---------------:| >> | baseline-4k | 0.0% | >> | anonfolio-lkml-v1 | 0.7% | >> | anonfolio-lkml-v2-simple-order | -0.9% | >> | anonfolio-lkml-v2 | 0.5% | > > I see. Thanks. > > A couple of questions: > 1. Do we have a stddev? | kernel | mean_abs | std_abs | mean_rel | std_rel | |:------------------------- |-----------:|----------:|-----------:|----------:| | baseline-4k | 117.4 | 0.8 | 0.0% | 0.7% | | anonfolio-v1 | 118.2 | 1 | 0.7% | 0.9% | | anonfolio-v2-simple-order | 116.4 | 1.1 | -0.9% | 0.9% | | anonfolio-v2 | 118 | 1.2 | 0.5% | 1.0% | This is with 3 runs per reboot across 5 reboots, with first run after reboot trimmed (it's always a bit slower, I assume due to cold page cache). So 10 data points per kernel in total. I've rerun the test multiple times and see similar results each time. I've also run anonfolio-v2 with Kconfig FLEXIBLE_THP=disabled and in this case I see the same performance as baseline-4k. > 2. Do we have a theory why it regressed? I have a woolly hypothesis; I think Chromium is doing mmap/munmap in ways that mean when we fault, order-4 is often too big to fit in the VMA. So we fallback to order-0. I guess this is happening so often for this workload that the cost of doing the checks and fallback is outweighing the benefit of the memory that does end up with order-4 folios. I've sampled the memory in each bucket (once per second) while running and its roughly: 64K: 25% 32K: 15% 16K: 15% 4K: 45% 32K and 16K obviously fold into the 4K bucket with anonfolio-v2-simple-order. But potentially, I suspect there is lots of mmap/unmap for the smaller sizes and the 64K contents is more static - that's just a guess though. > Assuming no bugs, I don't see how a real regression could happen -- > falling back to order-0 isn't different from the original behavior. > Ryan, could you `perf record` and `cat /proc/vmstat` and share them? I can, but it will have to be a bit later in the week. I'll do some more test runs overnight so we have a larger number of runs - hopefully that might tell us that this is noise to a certain extent. I'd still like to hear a clear technical argument for why the bin-packing approach is not the correct one! Thanks, Ryan _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory 2023-07-04 15:36 ` Ryan Roberts (?) @ 2023-07-04 23:52 ` Yin Fengwei 2023-07-05 0:21 ` Yu Zhao -1 siblings, 1 reply; 167+ messages in thread From: Yin Fengwei @ 2023-07-04 23:52 UTC (permalink / raw) To: Ryan Roberts, Yu Zhao Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 7/4/23 23:36, Ryan Roberts wrote: > On 04/07/2023 08:11, Yu Zhao wrote: >> On Tue, Jul 4, 2023 at 12:22 AM Yin, Fengwei <fengwei.yin@intel.com> wrote: >>> >>> On 7/4/2023 10:18 AM, Yu Zhao wrote: >>>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>> >>>>> Hi All, >>>>> >>>>> This is v2 of a series to implement variable order, large folios for anonymous >>>>> memory. The objective of this is to improve performance by allocating larger >>>>> chunks of memory during anonymous page faults. See [1] for background. >>>> >>>> Thanks for the quick response! >>>> >>>>> I've significantly reworked and simplified the patch set based on comments from >>>>> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to >>>>> VARIABLE_THP, on Yu's advice. >>>>> >>>>> The last patch is for arm64 to explicitly override the default >>>>> arch_wants_pte_order() and is intended as an example. If this series is accepted >>>>> I suggest taking the first 4 patches through the mm tree and the arm64 change >>>>> could be handled through the arm64 tree separately. Neither has any build >>>>> dependency on the other. >>>>> >>>>> The one area where I haven't followed Yu's advice is in the determination of the >>>>> size of folio to use. It was suggested that I have a single preferred large >>>>> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there >>>>> being existing overlapping populated PTEs, etc) then fallback immediately to >>>>> order-0. It turned out that this approach caused a performance regression in the >>>>> Speedometer benchmark. >>>> >>>> I suppose it's regression against the v1, not the unpatched kernel. >>> From the performance data Ryan shared, it's against unpatched kernel: >>> >>> Speedometer 2.0: >>> >>> | kernel | runs_per_min | >>> |:-------------------------------|---------------:| >>> | baseline-4k | 0.0% | >>> | anonfolio-lkml-v1 | 0.7% | >>> | anonfolio-lkml-v2-simple-order | -0.9% | >>> | anonfolio-lkml-v2 | 0.5% | >> >> I see. Thanks. >> >> A couple of questions: >> 1. Do we have a stddev? > > | kernel | mean_abs | std_abs | mean_rel | std_rel | > |:------------------------- |-----------:|----------:|-----------:|----------:| > | baseline-4k | 117.4 | 0.8 | 0.0% | 0.7% | > | anonfolio-v1 | 118.2 | 1 | 0.7% | 0.9% | > | anonfolio-v2-simple-order | 116.4 | 1.1 | -0.9% | 0.9% | > | anonfolio-v2 | 118 | 1.2 | 0.5% | 1.0% | > > This is with 3 runs per reboot across 5 reboots, with first run after reboot > trimmed (it's always a bit slower, I assume due to cold page cache). So 10 data > points per kernel in total. > > I've rerun the test multiple times and see similar results each time. > > I've also run anonfolio-v2 with Kconfig FLEXIBLE_THP=disabled and in this case I > see the same performance as baseline-4k. > > >> 2. Do we have a theory why it regressed? > > I have a woolly hypothesis; I think Chromium is doing mmap/munmap in ways that > mean when we fault, order-4 is often too big to fit in the VMA. So we fallback > to order-0. I guess this is happening so often for this workload that the cost > of doing the checks and fallback is outweighing the benefit of the memory that > does end up with order-4 folios. > > I've sampled the memory in each bucket (once per second) while running and its > roughly: > > 64K: 25% > 32K: 15% > 16K: 15% > 4K: 45% > > 32K and 16K obviously fold into the 4K bucket with anonfolio-v2-simple-order. > But potentially, I suspect there is lots of mmap/unmap for the smaller sizes and > the 64K contents is more static - that's just a guess though. So this is like out of vma range thing. > >> Assuming no bugs, I don't see how a real regression could happen -- >> falling back to order-0 isn't different from the original behavior. >> Ryan, could you `perf record` and `cat /proc/vmstat` and share them? > > I can, but it will have to be a bit later in the week. I'll do some more test > runs overnight so we have a larger number of runs - hopefully that might tell us > that this is noise to a certain extent. > > I'd still like to hear a clear technical argument for why the bin-packing > approach is not the correct one! My understanding to Yu's (Yu, correct me if I am wrong) comments is that we postpone this part of change and make basic anon large folio support in. Then discuss which approach we should take. Maybe people will agree retry is the choice, maybe other approach will be taken... For example, for this out of VMA range case, per VMA order should be considered. We don't need make decision that the retry should be taken now. Regards Yin, Fengwei > > Thanks, > Ryan > > > ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory 2023-07-04 23:52 ` Yin Fengwei @ 2023-07-05 0:21 ` Yu Zhao 0 siblings, 0 replies; 167+ messages in thread From: Yu Zhao @ 2023-07-05 0:21 UTC (permalink / raw) To: Yin Fengwei, Ryan Roberts Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On Tue, Jul 4, 2023 at 5:53 PM Yin Fengwei <fengwei.yin@intel.com> wrote: > > > > On 7/4/23 23:36, Ryan Roberts wrote: > > On 04/07/2023 08:11, Yu Zhao wrote: > >> On Tue, Jul 4, 2023 at 12:22 AM Yin, Fengwei <fengwei.yin@intel.com> wrote: > >>> > >>> On 7/4/2023 10:18 AM, Yu Zhao wrote: > >>>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > >>>>> > >>>>> Hi All, > >>>>> > >>>>> This is v2 of a series to implement variable order, large folios for anonymous > >>>>> memory. The objective of this is to improve performance by allocating larger > >>>>> chunks of memory during anonymous page faults. See [1] for background. > >>>> > >>>> Thanks for the quick response! > >>>> > >>>>> I've significantly reworked and simplified the patch set based on comments from > >>>>> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to > >>>>> VARIABLE_THP, on Yu's advice. > >>>>> > >>>>> The last patch is for arm64 to explicitly override the default > >>>>> arch_wants_pte_order() and is intended as an example. If this series is accepted > >>>>> I suggest taking the first 4 patches through the mm tree and the arm64 change > >>>>> could be handled through the arm64 tree separately. Neither has any build > >>>>> dependency on the other. > >>>>> > >>>>> The one area where I haven't followed Yu's advice is in the determination of the > >>>>> size of folio to use. It was suggested that I have a single preferred large > >>>>> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there > >>>>> being existing overlapping populated PTEs, etc) then fallback immediately to > >>>>> order-0. It turned out that this approach caused a performance regression in the > >>>>> Speedometer benchmark. > >>>> > >>>> I suppose it's regression against the v1, not the unpatched kernel. > >>> From the performance data Ryan shared, it's against unpatched kernel: > >>> > >>> Speedometer 2.0: > >>> > >>> | kernel | runs_per_min | > >>> |:-------------------------------|---------------:| > >>> | baseline-4k | 0.0% | > >>> | anonfolio-lkml-v1 | 0.7% | > >>> | anonfolio-lkml-v2-simple-order | -0.9% | > >>> | anonfolio-lkml-v2 | 0.5% | > >> > >> I see. Thanks. > >> > >> A couple of questions: > >> 1. Do we have a stddev? > > > > | kernel | mean_abs | std_abs | mean_rel | std_rel | > > |:------------------------- |-----------:|----------:|-----------:|----------:| > > | baseline-4k | 117.4 | 0.8 | 0.0% | 0.7% | > > | anonfolio-v1 | 118.2 | 1 | 0.7% | 0.9% | > > | anonfolio-v2-simple-order | 116.4 | 1.1 | -0.9% | 0.9% | > > | anonfolio-v2 | 118 | 1.2 | 0.5% | 1.0% | > > > > This is with 3 runs per reboot across 5 reboots, with first run after reboot > > trimmed (it's always a bit slower, I assume due to cold page cache). So 10 data > > points per kernel in total. > > > > I've rerun the test multiple times and see similar results each time. > > > > I've also run anonfolio-v2 with Kconfig FLEXIBLE_THP=disabled and in this case I > > see the same performance as baseline-4k. > > > > > >> 2. Do we have a theory why it regressed? > > > > I have a woolly hypothesis; I think Chromium is doing mmap/munmap in ways that > > mean when we fault, order-4 is often too big to fit in the VMA. So we fallback > > to order-0. I guess this is happening so often for this workload that the cost > > of doing the checks and fallback is outweighing the benefit of the memory that > > does end up with order-4 folios. > > > > I've sampled the memory in each bucket (once per second) while running and its > > roughly: > > > > 64K: 25% > > 32K: 15% > > 16K: 15% > > 4K: 45% > > > > 32K and 16K obviously fold into the 4K bucket with anonfolio-v2-simple-order. > > But potentially, I suspect there is lots of mmap/unmap for the smaller sizes and > > the 64K contents is more static - that's just a guess though. > So this is like out of vma range thing. > > > > >> Assuming no bugs, I don't see how a real regression could happen -- > >> falling back to order-0 isn't different from the original behavior. > >> Ryan, could you `perf record` and `cat /proc/vmstat` and share them? > > > > I can, but it will have to be a bit later in the week. I'll do some more test > > runs overnight so we have a larger number of runs - hopefully that might tell us > > that this is noise to a certain extent. > > > > I'd still like to hear a clear technical argument for why the bin-packing > > approach is not the correct one! > My understanding to Yu's (Yu, correct me if I am wrong) comments is that we > postpone this part of change and make basic anon large folio support in. Then > discuss which approach we should take. Maybe people will agree retry is the > choice, maybe other approach will be taken... > > For example, for this out of VMA range case, per VMA order should be considered. > We don't need make decision that the retry should be taken now. I've articulated the reasons in another email. Just summarize the most important point here: using more fallback orders makes a system reach equilibrium faster, at which point it can't allocate the order of arch_wants_pte_order() anymore. IOW, this best-fit policy can reduce the number of folios of the h/w prefered order for a system running long enough. ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory @ 2023-07-05 0:21 ` Yu Zhao 0 siblings, 0 replies; 167+ messages in thread From: Yu Zhao @ 2023-07-05 0:21 UTC (permalink / raw) To: Yin Fengwei, Ryan Roberts Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On Tue, Jul 4, 2023 at 5:53 PM Yin Fengwei <fengwei.yin@intel.com> wrote: > > > > On 7/4/23 23:36, Ryan Roberts wrote: > > On 04/07/2023 08:11, Yu Zhao wrote: > >> On Tue, Jul 4, 2023 at 12:22 AM Yin, Fengwei <fengwei.yin@intel.com> wrote: > >>> > >>> On 7/4/2023 10:18 AM, Yu Zhao wrote: > >>>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > >>>>> > >>>>> Hi All, > >>>>> > >>>>> This is v2 of a series to implement variable order, large folios for anonymous > >>>>> memory. The objective of this is to improve performance by allocating larger > >>>>> chunks of memory during anonymous page faults. See [1] for background. > >>>> > >>>> Thanks for the quick response! > >>>> > >>>>> I've significantly reworked and simplified the patch set based on comments from > >>>>> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to > >>>>> VARIABLE_THP, on Yu's advice. > >>>>> > >>>>> The last patch is for arm64 to explicitly override the default > >>>>> arch_wants_pte_order() and is intended as an example. If this series is accepted > >>>>> I suggest taking the first 4 patches through the mm tree and the arm64 change > >>>>> could be handled through the arm64 tree separately. Neither has any build > >>>>> dependency on the other. > >>>>> > >>>>> The one area where I haven't followed Yu's advice is in the determination of the > >>>>> size of folio to use. It was suggested that I have a single preferred large > >>>>> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there > >>>>> being existing overlapping populated PTEs, etc) then fallback immediately to > >>>>> order-0. It turned out that this approach caused a performance regression in the > >>>>> Speedometer benchmark. > >>>> > >>>> I suppose it's regression against the v1, not the unpatched kernel. > >>> From the performance data Ryan shared, it's against unpatched kernel: > >>> > >>> Speedometer 2.0: > >>> > >>> | kernel | runs_per_min | > >>> |:-------------------------------|---------------:| > >>> | baseline-4k | 0.0% | > >>> | anonfolio-lkml-v1 | 0.7% | > >>> | anonfolio-lkml-v2-simple-order | -0.9% | > >>> | anonfolio-lkml-v2 | 0.5% | > >> > >> I see. Thanks. > >> > >> A couple of questions: > >> 1. Do we have a stddev? > > > > | kernel | mean_abs | std_abs | mean_rel | std_rel | > > |:------------------------- |-----------:|----------:|-----------:|----------:| > > | baseline-4k | 117.4 | 0.8 | 0.0% | 0.7% | > > | anonfolio-v1 | 118.2 | 1 | 0.7% | 0.9% | > > | anonfolio-v2-simple-order | 116.4 | 1.1 | -0.9% | 0.9% | > > | anonfolio-v2 | 118 | 1.2 | 0.5% | 1.0% | > > > > This is with 3 runs per reboot across 5 reboots, with first run after reboot > > trimmed (it's always a bit slower, I assume due to cold page cache). So 10 data > > points per kernel in total. > > > > I've rerun the test multiple times and see similar results each time. > > > > I've also run anonfolio-v2 with Kconfig FLEXIBLE_THP=disabled and in this case I > > see the same performance as baseline-4k. > > > > > >> 2. Do we have a theory why it regressed? > > > > I have a woolly hypothesis; I think Chromium is doing mmap/munmap in ways that > > mean when we fault, order-4 is often too big to fit in the VMA. So we fallback > > to order-0. I guess this is happening so often for this workload that the cost > > of doing the checks and fallback is outweighing the benefit of the memory that > > does end up with order-4 folios. > > > > I've sampled the memory in each bucket (once per second) while running and its > > roughly: > > > > 64K: 25% > > 32K: 15% > > 16K: 15% > > 4K: 45% > > > > 32K and 16K obviously fold into the 4K bucket with anonfolio-v2-simple-order. > > But potentially, I suspect there is lots of mmap/unmap for the smaller sizes and > > the 64K contents is more static - that's just a guess though. > So this is like out of vma range thing. > > > > >> Assuming no bugs, I don't see how a real regression could happen -- > >> falling back to order-0 isn't different from the original behavior. > >> Ryan, could you `perf record` and `cat /proc/vmstat` and share them? > > > > I can, but it will have to be a bit later in the week. I'll do some more test > > runs overnight so we have a larger number of runs - hopefully that might tell us > > that this is noise to a certain extent. > > > > I'd still like to hear a clear technical argument for why the bin-packing > > approach is not the correct one! > My understanding to Yu's (Yu, correct me if I am wrong) comments is that we > postpone this part of change and make basic anon large folio support in. Then > discuss which approach we should take. Maybe people will agree retry is the > choice, maybe other approach will be taken... > > For example, for this out of VMA range case, per VMA order should be considered. > We don't need make decision that the retry should be taken now. I've articulated the reasons in another email. Just summarize the most important point here: using more fallback orders makes a system reach equilibrium faster, at which point it can't allocate the order of arch_wants_pte_order() anymore. IOW, this best-fit policy can reduce the number of folios of the h/w prefered order for a system running long enough. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory 2023-07-05 0:21 ` Yu Zhao @ 2023-07-05 10:16 ` Ryan Roberts -1 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-05 10:16 UTC (permalink / raw) To: Yu Zhao, Yin Fengwei Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 05/07/2023 01:21, Yu Zhao wrote: > On Tue, Jul 4, 2023 at 5:53 PM Yin Fengwei <fengwei.yin@intel.com> wrote: >> >> >> >> On 7/4/23 23:36, Ryan Roberts wrote: >>> On 04/07/2023 08:11, Yu Zhao wrote: >>>> On Tue, Jul 4, 2023 at 12:22 AM Yin, Fengwei <fengwei.yin@intel.com> wrote: >>>>> >>>>> On 7/4/2023 10:18 AM, Yu Zhao wrote: >>>>>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>>>> >>>>>>> Hi All, >>>>>>> >>>>>>> This is v2 of a series to implement variable order, large folios for anonymous >>>>>>> memory. The objective of this is to improve performance by allocating larger >>>>>>> chunks of memory during anonymous page faults. See [1] for background. >>>>>> >>>>>> Thanks for the quick response! >>>>>> >>>>>>> I've significantly reworked and simplified the patch set based on comments from >>>>>>> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to >>>>>>> VARIABLE_THP, on Yu's advice. >>>>>>> >>>>>>> The last patch is for arm64 to explicitly override the default >>>>>>> arch_wants_pte_order() and is intended as an example. If this series is accepted >>>>>>> I suggest taking the first 4 patches through the mm tree and the arm64 change >>>>>>> could be handled through the arm64 tree separately. Neither has any build >>>>>>> dependency on the other. >>>>>>> >>>>>>> The one area where I haven't followed Yu's advice is in the determination of the >>>>>>> size of folio to use. It was suggested that I have a single preferred large >>>>>>> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there >>>>>>> being existing overlapping populated PTEs, etc) then fallback immediately to >>>>>>> order-0. It turned out that this approach caused a performance regression in the >>>>>>> Speedometer benchmark. >>>>>> >>>>>> I suppose it's regression against the v1, not the unpatched kernel. >>>>> From the performance data Ryan shared, it's against unpatched kernel: >>>>> >>>>> Speedometer 2.0: >>>>> >>>>> | kernel | runs_per_min | >>>>> |:-------------------------------|---------------:| >>>>> | baseline-4k | 0.0% | >>>>> | anonfolio-lkml-v1 | 0.7% | >>>>> | anonfolio-lkml-v2-simple-order | -0.9% | >>>>> | anonfolio-lkml-v2 | 0.5% | >>>> >>>> I see. Thanks. >>>> >>>> A couple of questions: >>>> 1. Do we have a stddev? >>> >>> | kernel | mean_abs | std_abs | mean_rel | std_rel | >>> |:------------------------- |-----------:|----------:|-----------:|----------:| >>> | baseline-4k | 117.4 | 0.8 | 0.0% | 0.7% | >>> | anonfolio-v1 | 118.2 | 1 | 0.7% | 0.9% | >>> | anonfolio-v2-simple-order | 116.4 | 1.1 | -0.9% | 0.9% | >>> | anonfolio-v2 | 118 | 1.2 | 0.5% | 1.0% | >>> >>> This is with 3 runs per reboot across 5 reboots, with first run after reboot >>> trimmed (it's always a bit slower, I assume due to cold page cache). So 10 data >>> points per kernel in total. >>> >>> I've rerun the test multiple times and see similar results each time. >>> >>> I've also run anonfolio-v2 with Kconfig FLEXIBLE_THP=disabled and in this case I >>> see the same performance as baseline-4k. >>> >>> >>>> 2. Do we have a theory why it regressed? >>> >>> I have a woolly hypothesis; I think Chromium is doing mmap/munmap in ways that >>> mean when we fault, order-4 is often too big to fit in the VMA. So we fallback >>> to order-0. I guess this is happening so often for this workload that the cost >>> of doing the checks and fallback is outweighing the benefit of the memory that >>> does end up with order-4 folios. >>> >>> I've sampled the memory in each bucket (once per second) while running and its >>> roughly: >>> >>> 64K: 25% >>> 32K: 15% >>> 16K: 15% >>> 4K: 45% >>> >>> 32K and 16K obviously fold into the 4K bucket with anonfolio-v2-simple-order. >>> But potentially, I suspect there is lots of mmap/unmap for the smaller sizes and >>> the 64K contents is more static - that's just a guess though. >> So this is like out of vma range thing. >> >>> >>>> Assuming no bugs, I don't see how a real regression could happen -- >>>> falling back to order-0 isn't different from the original behavior. >>>> Ryan, could you `perf record` and `cat /proc/vmstat` and share them? >>> >>> I can, but it will have to be a bit later in the week. I'll do some more test >>> runs overnight so we have a larger number of runs - hopefully that might tell us >>> that this is noise to a certain extent. >>> >>> I'd still like to hear a clear technical argument for why the bin-packing >>> approach is not the correct one! >> My understanding to Yu's (Yu, correct me if I am wrong) comments is that we >> postpone this part of change and make basic anon large folio support in. Then >> discuss which approach we should take. Maybe people will agree retry is the >> choice, maybe other approach will be taken... >> >> For example, for this out of VMA range case, per VMA order should be considered. >> We don't need make decision that the retry should be taken now. > > I've articulated the reasons in another email. Just summarize the most > important point here: > using more fallback orders makes a system reach equilibrium faster, at > which point it can't allocate the order of arch_wants_pte_order() > anymore. IOW, this best-fit policy can reduce the number of folios of > the h/w prefered order for a system running long enough. Thanks for taking the time to write all the arguments down. I understand what you are saying. If we are considering the whole system, then we also need to think about the page cache though, and that will allocate multiple orders, so you are still going to suffer fragmentation from that user. That said, I like the proposal patch posted where we have up to 3 orders that we try in order of preference; hw-preferred, PAGE_ALLOC_COSTLY_ORDER and 0. That feels like a good compromise that allows me to fulfil my objectives. I'm going to pull this together into a v3 patch set and aim to post towards the end of the week. Are you ok for me to add a Suggested-by: for you? (submitting-patches.rst says I need your explicit permission). On the regression front, I've done a much bigger test run and see the regression is still present (although the mean has shifted a little bit). I've also built a kernel based on anonfolio-lkml-v2 but where arch_wants_pte_order() returns order-3. The aim was to test your hypothesis that 64K allocation is slow. This kernel is performing even better, so I think that confirms your hypothesis: | kernel | runs_per_min | runs | sessions | |:-------------------------------|---------------:|-------:|-----------:| | baseline-4k | 0.0% | 75 | 15 | | anonfolio-lkml-v1 | 1.0% | 75 | 15 | | anonfolio-lkml-v2-simple-order | -0.4% | 75 | 15 | | anonfolio-lkml-v2 | 0.9% | 75 | 15 | | anonfolio-lkml-v2-32k | 1.4% | 10 | 5 | Thanks, Ryan ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory @ 2023-07-05 10:16 ` Ryan Roberts 0 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-05 10:16 UTC (permalink / raw) To: Yu Zhao, Yin Fengwei Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 05/07/2023 01:21, Yu Zhao wrote: > On Tue, Jul 4, 2023 at 5:53 PM Yin Fengwei <fengwei.yin@intel.com> wrote: >> >> >> >> On 7/4/23 23:36, Ryan Roberts wrote: >>> On 04/07/2023 08:11, Yu Zhao wrote: >>>> On Tue, Jul 4, 2023 at 12:22 AM Yin, Fengwei <fengwei.yin@intel.com> wrote: >>>>> >>>>> On 7/4/2023 10:18 AM, Yu Zhao wrote: >>>>>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>>>> >>>>>>> Hi All, >>>>>>> >>>>>>> This is v2 of a series to implement variable order, large folios for anonymous >>>>>>> memory. The objective of this is to improve performance by allocating larger >>>>>>> chunks of memory during anonymous page faults. See [1] for background. >>>>>> >>>>>> Thanks for the quick response! >>>>>> >>>>>>> I've significantly reworked and simplified the patch set based on comments from >>>>>>> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to >>>>>>> VARIABLE_THP, on Yu's advice. >>>>>>> >>>>>>> The last patch is for arm64 to explicitly override the default >>>>>>> arch_wants_pte_order() and is intended as an example. If this series is accepted >>>>>>> I suggest taking the first 4 patches through the mm tree and the arm64 change >>>>>>> could be handled through the arm64 tree separately. Neither has any build >>>>>>> dependency on the other. >>>>>>> >>>>>>> The one area where I haven't followed Yu's advice is in the determination of the >>>>>>> size of folio to use. It was suggested that I have a single preferred large >>>>>>> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there >>>>>>> being existing overlapping populated PTEs, etc) then fallback immediately to >>>>>>> order-0. It turned out that this approach caused a performance regression in the >>>>>>> Speedometer benchmark. >>>>>> >>>>>> I suppose it's regression against the v1, not the unpatched kernel. >>>>> From the performance data Ryan shared, it's against unpatched kernel: >>>>> >>>>> Speedometer 2.0: >>>>> >>>>> | kernel | runs_per_min | >>>>> |:-------------------------------|---------------:| >>>>> | baseline-4k | 0.0% | >>>>> | anonfolio-lkml-v1 | 0.7% | >>>>> | anonfolio-lkml-v2-simple-order | -0.9% | >>>>> | anonfolio-lkml-v2 | 0.5% | >>>> >>>> I see. Thanks. >>>> >>>> A couple of questions: >>>> 1. Do we have a stddev? >>> >>> | kernel | mean_abs | std_abs | mean_rel | std_rel | >>> |:------------------------- |-----------:|----------:|-----------:|----------:| >>> | baseline-4k | 117.4 | 0.8 | 0.0% | 0.7% | >>> | anonfolio-v1 | 118.2 | 1 | 0.7% | 0.9% | >>> | anonfolio-v2-simple-order | 116.4 | 1.1 | -0.9% | 0.9% | >>> | anonfolio-v2 | 118 | 1.2 | 0.5% | 1.0% | >>> >>> This is with 3 runs per reboot across 5 reboots, with first run after reboot >>> trimmed (it's always a bit slower, I assume due to cold page cache). So 10 data >>> points per kernel in total. >>> >>> I've rerun the test multiple times and see similar results each time. >>> >>> I've also run anonfolio-v2 with Kconfig FLEXIBLE_THP=disabled and in this case I >>> see the same performance as baseline-4k. >>> >>> >>>> 2. Do we have a theory why it regressed? >>> >>> I have a woolly hypothesis; I think Chromium is doing mmap/munmap in ways that >>> mean when we fault, order-4 is often too big to fit in the VMA. So we fallback >>> to order-0. I guess this is happening so often for this workload that the cost >>> of doing the checks and fallback is outweighing the benefit of the memory that >>> does end up with order-4 folios. >>> >>> I've sampled the memory in each bucket (once per second) while running and its >>> roughly: >>> >>> 64K: 25% >>> 32K: 15% >>> 16K: 15% >>> 4K: 45% >>> >>> 32K and 16K obviously fold into the 4K bucket with anonfolio-v2-simple-order. >>> But potentially, I suspect there is lots of mmap/unmap for the smaller sizes and >>> the 64K contents is more static - that's just a guess though. >> So this is like out of vma range thing. >> >>> >>>> Assuming no bugs, I don't see how a real regression could happen -- >>>> falling back to order-0 isn't different from the original behavior. >>>> Ryan, could you `perf record` and `cat /proc/vmstat` and share them? >>> >>> I can, but it will have to be a bit later in the week. I'll do some more test >>> runs overnight so we have a larger number of runs - hopefully that might tell us >>> that this is noise to a certain extent. >>> >>> I'd still like to hear a clear technical argument for why the bin-packing >>> approach is not the correct one! >> My understanding to Yu's (Yu, correct me if I am wrong) comments is that we >> postpone this part of change and make basic anon large folio support in. Then >> discuss which approach we should take. Maybe people will agree retry is the >> choice, maybe other approach will be taken... >> >> For example, for this out of VMA range case, per VMA order should be considered. >> We don't need make decision that the retry should be taken now. > > I've articulated the reasons in another email. Just summarize the most > important point here: > using more fallback orders makes a system reach equilibrium faster, at > which point it can't allocate the order of arch_wants_pte_order() > anymore. IOW, this best-fit policy can reduce the number of folios of > the h/w prefered order for a system running long enough. Thanks for taking the time to write all the arguments down. I understand what you are saying. If we are considering the whole system, then we also need to think about the page cache though, and that will allocate multiple orders, so you are still going to suffer fragmentation from that user. That said, I like the proposal patch posted where we have up to 3 orders that we try in order of preference; hw-preferred, PAGE_ALLOC_COSTLY_ORDER and 0. That feels like a good compromise that allows me to fulfil my objectives. I'm going to pull this together into a v3 patch set and aim to post towards the end of the week. Are you ok for me to add a Suggested-by: for you? (submitting-patches.rst says I need your explicit permission). On the regression front, I've done a much bigger test run and see the regression is still present (although the mean has shifted a little bit). I've also built a kernel based on anonfolio-lkml-v2 but where arch_wants_pte_order() returns order-3. The aim was to test your hypothesis that 64K allocation is slow. This kernel is performing even better, so I think that confirms your hypothesis: | kernel | runs_per_min | runs | sessions | |:-------------------------------|---------------:|-------:|-----------:| | baseline-4k | 0.0% | 75 | 15 | | anonfolio-lkml-v1 | 1.0% | 75 | 15 | | anonfolio-lkml-v2-simple-order | -0.4% | 75 | 15 | | anonfolio-lkml-v2 | 0.9% | 75 | 15 | | anonfolio-lkml-v2-32k | 1.4% | 10 | 5 | Thanks, Ryan _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory 2023-07-05 10:16 ` Ryan Roberts @ 2023-07-05 19:00 ` Yu Zhao -1 siblings, 0 replies; 167+ messages in thread From: Yu Zhao @ 2023-07-05 19:00 UTC (permalink / raw) To: Ryan Roberts Cc: Yin Fengwei, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm [-- Attachment #1: Type: text/plain, Size: 8903 bytes --] On Wed, Jul 5, 2023 at 4:16 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 05/07/2023 01:21, Yu Zhao wrote: > > On Tue, Jul 4, 2023 at 5:53 PM Yin Fengwei <fengwei.yin@intel.com> wrote: > >> > >> > >> > >> On 7/4/23 23:36, Ryan Roberts wrote: > >>> On 04/07/2023 08:11, Yu Zhao wrote: > >>>> On Tue, Jul 4, 2023 at 12:22 AM Yin, Fengwei <fengwei.yin@intel.com> wrote: > >>>>> > >>>>> On 7/4/2023 10:18 AM, Yu Zhao wrote: > >>>>>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > >>>>>>> > >>>>>>> Hi All, > >>>>>>> > >>>>>>> This is v2 of a series to implement variable order, large folios for anonymous > >>>>>>> memory. The objective of this is to improve performance by allocating larger > >>>>>>> chunks of memory during anonymous page faults. See [1] for background. > >>>>>> > >>>>>> Thanks for the quick response! > >>>>>> > >>>>>>> I've significantly reworked and simplified the patch set based on comments from > >>>>>>> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to > >>>>>>> VARIABLE_THP, on Yu's advice. > >>>>>>> > >>>>>>> The last patch is for arm64 to explicitly override the default > >>>>>>> arch_wants_pte_order() and is intended as an example. If this series is accepted > >>>>>>> I suggest taking the first 4 patches through the mm tree and the arm64 change > >>>>>>> could be handled through the arm64 tree separately. Neither has any build > >>>>>>> dependency on the other. > >>>>>>> > >>>>>>> The one area where I haven't followed Yu's advice is in the determination of the > >>>>>>> size of folio to use. It was suggested that I have a single preferred large > >>>>>>> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there > >>>>>>> being existing overlapping populated PTEs, etc) then fallback immediately to > >>>>>>> order-0. It turned out that this approach caused a performance regression in the > >>>>>>> Speedometer benchmark. > >>>>>> > >>>>>> I suppose it's regression against the v1, not the unpatched kernel. > >>>>> From the performance data Ryan shared, it's against unpatched kernel: > >>>>> > >>>>> Speedometer 2.0: > >>>>> > >>>>> | kernel | runs_per_min | > >>>>> |:-------------------------------|---------------:| > >>>>> | baseline-4k | 0.0% | > >>>>> | anonfolio-lkml-v1 | 0.7% | > >>>>> | anonfolio-lkml-v2-simple-order | -0.9% | > >>>>> | anonfolio-lkml-v2 | 0.5% | > >>>> > >>>> I see. Thanks. > >>>> > >>>> A couple of questions: > >>>> 1. Do we have a stddev? > >>> > >>> | kernel | mean_abs | std_abs | mean_rel | std_rel | > >>> |:------------------------- |-----------:|----------:|-----------:|----------:| > >>> | baseline-4k | 117.4 | 0.8 | 0.0% | 0.7% | > >>> | anonfolio-v1 | 118.2 | 1 | 0.7% | 0.9% | > >>> | anonfolio-v2-simple-order | 116.4 | 1.1 | -0.9% | 0.9% | > >>> | anonfolio-v2 | 118 | 1.2 | 0.5% | 1.0% | > >>> > >>> This is with 3 runs per reboot across 5 reboots, with first run after reboot > >>> trimmed (it's always a bit slower, I assume due to cold page cache). So 10 data > >>> points per kernel in total. > >>> > >>> I've rerun the test multiple times and see similar results each time. > >>> > >>> I've also run anonfolio-v2 with Kconfig FLEXIBLE_THP=disabled and in this case I > >>> see the same performance as baseline-4k. > >>> > >>> > >>>> 2. Do we have a theory why it regressed? > >>> > >>> I have a woolly hypothesis; I think Chromium is doing mmap/munmap in ways that > >>> mean when we fault, order-4 is often too big to fit in the VMA. So we fallback > >>> to order-0. I guess this is happening so often for this workload that the cost > >>> of doing the checks and fallback is outweighing the benefit of the memory that > >>> does end up with order-4 folios. > >>> > >>> I've sampled the memory in each bucket (once per second) while running and its > >>> roughly: > >>> > >>> 64K: 25% > >>> 32K: 15% > >>> 16K: 15% > >>> 4K: 45% > >>> > >>> 32K and 16K obviously fold into the 4K bucket with anonfolio-v2-simple-order. > >>> But potentially, I suspect there is lots of mmap/unmap for the smaller sizes and > >>> the 64K contents is more static - that's just a guess though. > >> So this is like out of vma range thing. > >> > >>> > >>>> Assuming no bugs, I don't see how a real regression could happen -- > >>>> falling back to order-0 isn't different from the original behavior. > >>>> Ryan, could you `perf record` and `cat /proc/vmstat` and share them? > >>> > >>> I can, but it will have to be a bit later in the week. I'll do some more test > >>> runs overnight so we have a larger number of runs - hopefully that might tell us > >>> that this is noise to a certain extent. > >>> > >>> I'd still like to hear a clear technical argument for why the bin-packing > >>> approach is not the correct one! > >> My understanding to Yu's (Yu, correct me if I am wrong) comments is that we > >> postpone this part of change and make basic anon large folio support in. Then > >> discuss which approach we should take. Maybe people will agree retry is the > >> choice, maybe other approach will be taken... > >> > >> For example, for this out of VMA range case, per VMA order should be considered. > >> We don't need make decision that the retry should be taken now. > > > > I've articulated the reasons in another email. Just summarize the most > > important point here: > > using more fallback orders makes a system reach equilibrium faster, at > > which point it can't allocate the order of arch_wants_pte_order() > > anymore. IOW, this best-fit policy can reduce the number of folios of > > the h/w prefered order for a system running long enough. > > Thanks for taking the time to write all the arguments down. I understand what > you are saying. If we are considering the whole system, then we also need to > think about the page cache though, and that will allocate multiple orders, so > you are still going to suffer fragmentation from that user. 1. page cache doesn't use the best-fit policy -- it has the advantage of having RA hit/miss numbers -- IOW, it doesn't try all orders without an estimated ROI. 2. page cache causes far less fragmentation in my experience: clean page cache gets reclaimed first under memory; unmapped page cache is less costly to migrate. Neither is true for anon, and what makes it worse is that heavy anon users usually enable zram/zswap: allocating memory (to store compressed data) under memory pressure makes reclaim/compaction even harder. > That said, I like the proposal patch posted where we have up to 3 orders that we > try in order of preference; hw-preferred, PAGE_ALLOC_COSTLY_ORDER and 0. That > feels like a good compromise that allows me to fulfil my objectives. I'm going > to pull this together into a v3 patch set and aim to post towards the end of the > week. > > Are you ok for me to add a Suggested-by: for you? (submitting-patches.rst says I > need your explicit permission). Thanks for asking. No need to worry about it -- it's been a great team work with you, Fengwei, Yang et al. I'm attaching a single patch containing all pieces I spelled out/implied/forgot to mention. It doesn't depend on other series -- I just stress-tested it on top of the latest mm-unstable. Please feel free to reuse any bits you see fit. Again no need to worry about Suggested-by. > On the regression front, I've done a much bigger test run and see the regression > is still present (although the mean has shifted a little bit). I've also built a > kernel based on anonfolio-lkml-v2 but where arch_wants_pte_order() returns > order-3. The aim was to test your hypothesis that 64K allocation is slow. This > kernel is performing even better, so I think that confirms your hypothesis: Great, thanks for confirming. > | kernel | runs_per_min | runs | sessions | > |:-------------------------------|---------------:|-------:|-----------:| > | baseline-4k | 0.0% | 75 | 15 | > | anonfolio-lkml-v1 | 1.0% | 75 | 15 | > | anonfolio-lkml-v2-simple-order | -0.4% | 75 | 15 | > | anonfolio-lkml-v2 | 0.9% | 75 | 15 | > | anonfolio-lkml-v2-32k | 1.4% | 10 | 5 | Since we are all committed to the effort long term, the last number is good enough for the initial step to conclude. Hopefully v3 can address all pending comments and get into mm-unstable. [-- Attachment #2: large_anon.patch --] [-- Type: application/octet-stream, Size: 7081 bytes --] diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 5063b482e34f..113d35d993ce 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -313,6 +313,13 @@ static inline bool arch_has_hw_pte_young(void) } #endif +#ifndef arch_wants_pte_order +static inline int arch_wants_pte_order(void) +{ + return 0; +} +#endif + #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long address, diff --git a/mm/memory.c b/mm/memory.c index cf4ae87b1563..238f1a2ffbff 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4059,6 +4059,81 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) return ret; } +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages) +{ + int i; + + if (nr_pages == 1) + return vmf_pte_changed(vmf); + + for (i = 0; i < nr_pages; i++) { + if (!pte_none(ptep_get_lockless(vmf->pte + i))) + return true; + } + + return false; +} + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +static struct folio *alloc_anon_folio(struct vm_fault *vmf) +{ + int i; + gfp_t gfp; + pte_t *pte; + unsigned long addr; + struct vm_area_struct *vma = vmf->vma; + int preferred = arch_wants_pte_order() ? : PAGE_ALLOC_COSTLY_ORDER; + int orders[] = { + preferred, + preferred > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0, + 0, + }; + + if (vmf_orig_pte_uffd_wp(vmf)) + goto fallback; + + for (i = 0; orders[i]; i++) { + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]); + if (addr >= vma->vm_start && addr + (PAGE_SIZE << orders[i]) <= vma->vm_end) + break; + } + + if (!orders[i]) + goto fallback; + + pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK); + VM_WARN_ON_ONCE(vmf->pte); + + for (; orders[i]; i++) { + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]); + vmf->pte = pte + pte_index(addr); + if (!vmf_pte_range_changed(vmf, 1 << orders[i])) + break; + } + + vmf->pte = NULL; + pte_unmap(pte); + + gfp = vma_thp_gfp_mask(vma); + + for (; orders[i]; i++) { + struct folio *folio; + + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]); + folio = vma_alloc_folio(gfp, orders[i], vma, addr, true); + if (folio) { + clear_huge_page(&folio->page, addr, 1 << orders[i]); + vmf->address = addr; + return folio; + } + } +fallback: + return vma_alloc_zeroed_movable_folio(vma, vmf->address); +} +#else +#define alloc_anon_folio(vmf) vma_alloc_zeroed_movable_folio((vmf)->vma, (vmf)->address) +#endif + /* * We enter with non-exclusive mmap_lock (to exclude vma changes, * but allow concurrent faults), and pte mapped but not yet locked. @@ -4066,6 +4141,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) */ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) { + int i = 0; + int nr_pages = 1; bool uffd_wp = vmf_orig_pte_uffd_wp(vmf); struct vm_area_struct *vma = vmf->vma; struct folio *folio; @@ -4110,10 +4187,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) /* Allocate our own private page. */ if (unlikely(anon_vma_prepare(vma))) goto oom; - folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); + folio = alloc_anon_folio(vmf); if (!folio) goto oom; + nr_pages = folio_nr_pages(folio); + if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) goto oom_free_page; folio_throttle_swaprate(folio, GFP_KERNEL); @@ -4125,17 +4204,13 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) */ __folio_mark_uptodate(folio); - entry = mk_pte(&folio->page, vma->vm_page_prot); - entry = pte_sw_mkyoung(entry); - if (vma->vm_flags & VM_WRITE) - entry = pte_mkwrite(pte_mkdirty(entry)); - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); if (!vmf->pte) goto release; - if (vmf_pte_changed(vmf)) { - update_mmu_tlb(vma, vmf->address, vmf->pte); + if (vmf_pte_range_changed(vmf, nr_pages)) { + for (i = 0; i < nr_pages; i++) + update_mmu_tlb(vma, vmf->address + PAGE_SIZE * i, vmf->pte + i); goto release; } @@ -4150,16 +4225,24 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) return handle_userfault(vmf, VM_UFFD_MISSING); } - inc_mm_counter(vma->vm_mm, MM_ANONPAGES); + folio_ref_add(folio, nr_pages - 1); + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages); folio_add_new_anon_rmap(folio, vma, vmf->address); folio_add_lru_vma(folio, vma); + + for (i = 0; i < nr_pages; i++) { + entry = mk_pte(folio_page(folio, i), vma->vm_page_prot); + entry = pte_sw_mkyoung(entry); + if (vma->vm_flags & VM_WRITE) + entry = pte_mkwrite(pte_mkdirty(entry)); setpte: - if (uffd_wp) - entry = pte_mkuffd_wp(entry); - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); + if (uffd_wp) + entry = pte_mkuffd_wp(entry); + set_pte_at(vma->vm_mm, vmf->address + PAGE_SIZE * i, vmf->pte + i, entry); - /* No need to invalidate - it was non-present before */ - update_mmu_cache(vma, vmf->address, vmf->pte); + /* No need to invalidate - it was non-present before */ + update_mmu_cache(vma, vmf->address + PAGE_SIZE * i, vmf->pte + i); + } unlock: if (vmf->pte) pte_unmap_unlock(vmf->pte, vmf->ptl); diff --git a/mm/rmap.c b/mm/rmap.c index 0c0d8857dfce..fb120c8717ec 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1284,25 +1284,36 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma, unsigned long address) { - int nr; + int nr = folio_nr_pages(folio); - VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma); + VM_BUG_ON_VMA(address < vma->vm_start || address + PAGE_SIZE * nr > vma->vm_end, vma); __folio_set_swapbacked(folio); - if (likely(!folio_test_pmd_mappable(folio))) { + if (!folio_test_large(folio)) { /* increment count (starts at -1) */ atomic_set(&folio->_mapcount, 0); - nr = 1; + __page_set_anon_rmap(folio, &folio->page, vma, address, 1); + } else if (!folio_test_pmd_mappable(folio)) { + int i; + + for (i = 0; i < nr; i++) { + struct page *page = folio_page(folio, i); + + /* increment count (starts at -1) */ + atomic_set(&page->_mapcount, 0); + __page_set_anon_rmap(folio, page, vma, address + PAGE_SIZE * i, 1); + } + /* increment count (starts at 0) */ + atomic_set(&folio->_nr_pages_mapped, nr); } else { /* increment count (starts at -1) */ atomic_set(&folio->_entire_mapcount, 0); atomic_set(&folio->_nr_pages_mapped, COMPOUND_MAPPED); - nr = folio_nr_pages(folio); + __page_set_anon_rmap(folio, &folio->page, vma, address, 1); __lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr); } __lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr); - __page_set_anon_rmap(folio, &folio->page, vma, address, 1); } /** @@ -1430,7 +1441,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma, * page of the folio is unmapped and at least one page * is still mapped. */ - if (folio_test_pmd_mappable(folio) && folio_test_anon(folio)) + if (folio_test_large(folio) && folio_test_anon(folio)) if (!compound || nr < nr_pmdmapped) deferred_split_folio(folio); } ^ permalink raw reply related [flat|nested] 167+ messages in thread
* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory @ 2023-07-05 19:00 ` Yu Zhao 0 siblings, 0 replies; 167+ messages in thread From: Yu Zhao @ 2023-07-05 19:00 UTC (permalink / raw) To: Ryan Roberts Cc: Yin Fengwei, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, David Hildenbrand, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm [-- Attachment #1: Type: text/plain, Size: 8903 bytes --] On Wed, Jul 5, 2023 at 4:16 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 05/07/2023 01:21, Yu Zhao wrote: > > On Tue, Jul 4, 2023 at 5:53 PM Yin Fengwei <fengwei.yin@intel.com> wrote: > >> > >> > >> > >> On 7/4/23 23:36, Ryan Roberts wrote: > >>> On 04/07/2023 08:11, Yu Zhao wrote: > >>>> On Tue, Jul 4, 2023 at 12:22 AM Yin, Fengwei <fengwei.yin@intel.com> wrote: > >>>>> > >>>>> On 7/4/2023 10:18 AM, Yu Zhao wrote: > >>>>>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > >>>>>>> > >>>>>>> Hi All, > >>>>>>> > >>>>>>> This is v2 of a series to implement variable order, large folios for anonymous > >>>>>>> memory. The objective of this is to improve performance by allocating larger > >>>>>>> chunks of memory during anonymous page faults. See [1] for background. > >>>>>> > >>>>>> Thanks for the quick response! > >>>>>> > >>>>>>> I've significantly reworked and simplified the patch set based on comments from > >>>>>>> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to > >>>>>>> VARIABLE_THP, on Yu's advice. > >>>>>>> > >>>>>>> The last patch is for arm64 to explicitly override the default > >>>>>>> arch_wants_pte_order() and is intended as an example. If this series is accepted > >>>>>>> I suggest taking the first 4 patches through the mm tree and the arm64 change > >>>>>>> could be handled through the arm64 tree separately. Neither has any build > >>>>>>> dependency on the other. > >>>>>>> > >>>>>>> The one area where I haven't followed Yu's advice is in the determination of the > >>>>>>> size of folio to use. It was suggested that I have a single preferred large > >>>>>>> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there > >>>>>>> being existing overlapping populated PTEs, etc) then fallback immediately to > >>>>>>> order-0. It turned out that this approach caused a performance regression in the > >>>>>>> Speedometer benchmark. > >>>>>> > >>>>>> I suppose it's regression against the v1, not the unpatched kernel. > >>>>> From the performance data Ryan shared, it's against unpatched kernel: > >>>>> > >>>>> Speedometer 2.0: > >>>>> > >>>>> | kernel | runs_per_min | > >>>>> |:-------------------------------|---------------:| > >>>>> | baseline-4k | 0.0% | > >>>>> | anonfolio-lkml-v1 | 0.7% | > >>>>> | anonfolio-lkml-v2-simple-order | -0.9% | > >>>>> | anonfolio-lkml-v2 | 0.5% | > >>>> > >>>> I see. Thanks. > >>>> > >>>> A couple of questions: > >>>> 1. Do we have a stddev? > >>> > >>> | kernel | mean_abs | std_abs | mean_rel | std_rel | > >>> |:------------------------- |-----------:|----------:|-----------:|----------:| > >>> | baseline-4k | 117.4 | 0.8 | 0.0% | 0.7% | > >>> | anonfolio-v1 | 118.2 | 1 | 0.7% | 0.9% | > >>> | anonfolio-v2-simple-order | 116.4 | 1.1 | -0.9% | 0.9% | > >>> | anonfolio-v2 | 118 | 1.2 | 0.5% | 1.0% | > >>> > >>> This is with 3 runs per reboot across 5 reboots, with first run after reboot > >>> trimmed (it's always a bit slower, I assume due to cold page cache). So 10 data > >>> points per kernel in total. > >>> > >>> I've rerun the test multiple times and see similar results each time. > >>> > >>> I've also run anonfolio-v2 with Kconfig FLEXIBLE_THP=disabled and in this case I > >>> see the same performance as baseline-4k. > >>> > >>> > >>>> 2. Do we have a theory why it regressed? > >>> > >>> I have a woolly hypothesis; I think Chromium is doing mmap/munmap in ways that > >>> mean when we fault, order-4 is often too big to fit in the VMA. So we fallback > >>> to order-0. I guess this is happening so often for this workload that the cost > >>> of doing the checks and fallback is outweighing the benefit of the memory that > >>> does end up with order-4 folios. > >>> > >>> I've sampled the memory in each bucket (once per second) while running and its > >>> roughly: > >>> > >>> 64K: 25% > >>> 32K: 15% > >>> 16K: 15% > >>> 4K: 45% > >>> > >>> 32K and 16K obviously fold into the 4K bucket with anonfolio-v2-simple-order. > >>> But potentially, I suspect there is lots of mmap/unmap for the smaller sizes and > >>> the 64K contents is more static - that's just a guess though. > >> So this is like out of vma range thing. > >> > >>> > >>>> Assuming no bugs, I don't see how a real regression could happen -- > >>>> falling back to order-0 isn't different from the original behavior. > >>>> Ryan, could you `perf record` and `cat /proc/vmstat` and share them? > >>> > >>> I can, but it will have to be a bit later in the week. I'll do some more test > >>> runs overnight so we have a larger number of runs - hopefully that might tell us > >>> that this is noise to a certain extent. > >>> > >>> I'd still like to hear a clear technical argument for why the bin-packing > >>> approach is not the correct one! > >> My understanding to Yu's (Yu, correct me if I am wrong) comments is that we > >> postpone this part of change and make basic anon large folio support in. Then > >> discuss which approach we should take. Maybe people will agree retry is the > >> choice, maybe other approach will be taken... > >> > >> For example, for this out of VMA range case, per VMA order should be considered. > >> We don't need make decision that the retry should be taken now. > > > > I've articulated the reasons in another email. Just summarize the most > > important point here: > > using more fallback orders makes a system reach equilibrium faster, at > > which point it can't allocate the order of arch_wants_pte_order() > > anymore. IOW, this best-fit policy can reduce the number of folios of > > the h/w prefered order for a system running long enough. > > Thanks for taking the time to write all the arguments down. I understand what > you are saying. If we are considering the whole system, then we also need to > think about the page cache though, and that will allocate multiple orders, so > you are still going to suffer fragmentation from that user. 1. page cache doesn't use the best-fit policy -- it has the advantage of having RA hit/miss numbers -- IOW, it doesn't try all orders without an estimated ROI. 2. page cache causes far less fragmentation in my experience: clean page cache gets reclaimed first under memory; unmapped page cache is less costly to migrate. Neither is true for anon, and what makes it worse is that heavy anon users usually enable zram/zswap: allocating memory (to store compressed data) under memory pressure makes reclaim/compaction even harder. > That said, I like the proposal patch posted where we have up to 3 orders that we > try in order of preference; hw-preferred, PAGE_ALLOC_COSTLY_ORDER and 0. That > feels like a good compromise that allows me to fulfil my objectives. I'm going > to pull this together into a v3 patch set and aim to post towards the end of the > week. > > Are you ok for me to add a Suggested-by: for you? (submitting-patches.rst says I > need your explicit permission). Thanks for asking. No need to worry about it -- it's been a great team work with you, Fengwei, Yang et al. I'm attaching a single patch containing all pieces I spelled out/implied/forgot to mention. It doesn't depend on other series -- I just stress-tested it on top of the latest mm-unstable. Please feel free to reuse any bits you see fit. Again no need to worry about Suggested-by. > On the regression front, I've done a much bigger test run and see the regression > is still present (although the mean has shifted a little bit). I've also built a > kernel based on anonfolio-lkml-v2 but where arch_wants_pte_order() returns > order-3. The aim was to test your hypothesis that 64K allocation is slow. This > kernel is performing even better, so I think that confirms your hypothesis: Great, thanks for confirming. > | kernel | runs_per_min | runs | sessions | > |:-------------------------------|---------------:|-------:|-----------:| > | baseline-4k | 0.0% | 75 | 15 | > | anonfolio-lkml-v1 | 1.0% | 75 | 15 | > | anonfolio-lkml-v2-simple-order | -0.4% | 75 | 15 | > | anonfolio-lkml-v2 | 0.9% | 75 | 15 | > | anonfolio-lkml-v2-32k | 1.4% | 10 | 5 | Since we are all committed to the effort long term, the last number is good enough for the initial step to conclude. Hopefully v3 can address all pending comments and get into mm-unstable. [-- Attachment #2: large_anon.patch --] [-- Type: application/octet-stream, Size: 7081 bytes --] diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 5063b482e34f..113d35d993ce 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -313,6 +313,13 @@ static inline bool arch_has_hw_pte_young(void) } #endif +#ifndef arch_wants_pte_order +static inline int arch_wants_pte_order(void) +{ + return 0; +} +#endif + #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long address, diff --git a/mm/memory.c b/mm/memory.c index cf4ae87b1563..238f1a2ffbff 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4059,6 +4059,81 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) return ret; } +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages) +{ + int i; + + if (nr_pages == 1) + return vmf_pte_changed(vmf); + + for (i = 0; i < nr_pages; i++) { + if (!pte_none(ptep_get_lockless(vmf->pte + i))) + return true; + } + + return false; +} + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +static struct folio *alloc_anon_folio(struct vm_fault *vmf) +{ + int i; + gfp_t gfp; + pte_t *pte; + unsigned long addr; + struct vm_area_struct *vma = vmf->vma; + int preferred = arch_wants_pte_order() ? : PAGE_ALLOC_COSTLY_ORDER; + int orders[] = { + preferred, + preferred > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0, + 0, + }; + + if (vmf_orig_pte_uffd_wp(vmf)) + goto fallback; + + for (i = 0; orders[i]; i++) { + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]); + if (addr >= vma->vm_start && addr + (PAGE_SIZE << orders[i]) <= vma->vm_end) + break; + } + + if (!orders[i]) + goto fallback; + + pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK); + VM_WARN_ON_ONCE(vmf->pte); + + for (; orders[i]; i++) { + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]); + vmf->pte = pte + pte_index(addr); + if (!vmf_pte_range_changed(vmf, 1 << orders[i])) + break; + } + + vmf->pte = NULL; + pte_unmap(pte); + + gfp = vma_thp_gfp_mask(vma); + + for (; orders[i]; i++) { + struct folio *folio; + + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]); + folio = vma_alloc_folio(gfp, orders[i], vma, addr, true); + if (folio) { + clear_huge_page(&folio->page, addr, 1 << orders[i]); + vmf->address = addr; + return folio; + } + } +fallback: + return vma_alloc_zeroed_movable_folio(vma, vmf->address); +} +#else +#define alloc_anon_folio(vmf) vma_alloc_zeroed_movable_folio((vmf)->vma, (vmf)->address) +#endif + /* * We enter with non-exclusive mmap_lock (to exclude vma changes, * but allow concurrent faults), and pte mapped but not yet locked. @@ -4066,6 +4141,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) */ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) { + int i = 0; + int nr_pages = 1; bool uffd_wp = vmf_orig_pte_uffd_wp(vmf); struct vm_area_struct *vma = vmf->vma; struct folio *folio; @@ -4110,10 +4187,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) /* Allocate our own private page. */ if (unlikely(anon_vma_prepare(vma))) goto oom; - folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); + folio = alloc_anon_folio(vmf); if (!folio) goto oom; + nr_pages = folio_nr_pages(folio); + if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) goto oom_free_page; folio_throttle_swaprate(folio, GFP_KERNEL); @@ -4125,17 +4204,13 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) */ __folio_mark_uptodate(folio); - entry = mk_pte(&folio->page, vma->vm_page_prot); - entry = pte_sw_mkyoung(entry); - if (vma->vm_flags & VM_WRITE) - entry = pte_mkwrite(pte_mkdirty(entry)); - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); if (!vmf->pte) goto release; - if (vmf_pte_changed(vmf)) { - update_mmu_tlb(vma, vmf->address, vmf->pte); + if (vmf_pte_range_changed(vmf, nr_pages)) { + for (i = 0; i < nr_pages; i++) + update_mmu_tlb(vma, vmf->address + PAGE_SIZE * i, vmf->pte + i); goto release; } @@ -4150,16 +4225,24 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) return handle_userfault(vmf, VM_UFFD_MISSING); } - inc_mm_counter(vma->vm_mm, MM_ANONPAGES); + folio_ref_add(folio, nr_pages - 1); + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages); folio_add_new_anon_rmap(folio, vma, vmf->address); folio_add_lru_vma(folio, vma); + + for (i = 0; i < nr_pages; i++) { + entry = mk_pte(folio_page(folio, i), vma->vm_page_prot); + entry = pte_sw_mkyoung(entry); + if (vma->vm_flags & VM_WRITE) + entry = pte_mkwrite(pte_mkdirty(entry)); setpte: - if (uffd_wp) - entry = pte_mkuffd_wp(entry); - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); + if (uffd_wp) + entry = pte_mkuffd_wp(entry); + set_pte_at(vma->vm_mm, vmf->address + PAGE_SIZE * i, vmf->pte + i, entry); - /* No need to invalidate - it was non-present before */ - update_mmu_cache(vma, vmf->address, vmf->pte); + /* No need to invalidate - it was non-present before */ + update_mmu_cache(vma, vmf->address + PAGE_SIZE * i, vmf->pte + i); + } unlock: if (vmf->pte) pte_unmap_unlock(vmf->pte, vmf->ptl); diff --git a/mm/rmap.c b/mm/rmap.c index 0c0d8857dfce..fb120c8717ec 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1284,25 +1284,36 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma, unsigned long address) { - int nr; + int nr = folio_nr_pages(folio); - VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma); + VM_BUG_ON_VMA(address < vma->vm_start || address + PAGE_SIZE * nr > vma->vm_end, vma); __folio_set_swapbacked(folio); - if (likely(!folio_test_pmd_mappable(folio))) { + if (!folio_test_large(folio)) { /* increment count (starts at -1) */ atomic_set(&folio->_mapcount, 0); - nr = 1; + __page_set_anon_rmap(folio, &folio->page, vma, address, 1); + } else if (!folio_test_pmd_mappable(folio)) { + int i; + + for (i = 0; i < nr; i++) { + struct page *page = folio_page(folio, i); + + /* increment count (starts at -1) */ + atomic_set(&page->_mapcount, 0); + __page_set_anon_rmap(folio, page, vma, address + PAGE_SIZE * i, 1); + } + /* increment count (starts at 0) */ + atomic_set(&folio->_nr_pages_mapped, nr); } else { /* increment count (starts at -1) */ atomic_set(&folio->_entire_mapcount, 0); atomic_set(&folio->_nr_pages_mapped, COMPOUND_MAPPED); - nr = folio_nr_pages(folio); + __page_set_anon_rmap(folio, &folio->page, vma, address, 1); __lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr); } __lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr); - __page_set_anon_rmap(folio, &folio->page, vma, address, 1); } /** @@ -1430,7 +1441,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma, * page of the folio is unmapped and at least one page * is still mapped. */ - if (folio_test_pmd_mappable(folio) && folio_test_anon(folio)) + if (folio_test_large(folio) && folio_test_anon(folio)) if (!compound || nr < nr_pmdmapped) deferred_split_folio(folio); } [-- Attachment #3: Type: text/plain, Size: 176 bytes --] _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 167+ messages in thread
* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory 2023-07-03 13:53 ` Ryan Roberts @ 2023-07-05 19:38 ` David Hildenbrand -1 siblings, 0 replies; 167+ messages in thread From: David Hildenbrand @ 2023-07-05 19:38 UTC (permalink / raw) To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi Cc: linux-arm-kernel, linux-kernel, linux-mm On 03.07.23 15:53, Ryan Roberts wrote: > Hi All, > > This is v2 of a series to implement variable order, large folios for anonymous > memory. The objective of this is to improve performance by allocating larger > chunks of memory during anonymous page faults. See [1] for background. > > I've significantly reworked and simplified the patch set based on comments from > Yu Zhao (thanks for all your feedback!). I've also renamed the feature to > VARIABLE_THP, on Yu's advice. > > The last patch is for arm64 to explicitly override the default > arch_wants_pte_order() and is intended as an example. If this series is accepted > I suggest taking the first 4 patches through the mm tree and the arm64 change > could be handled through the arm64 tree separately. Neither has any build > dependency on the other. > > The one area where I haven't followed Yu's advice is in the determination of the > size of folio to use. It was suggested that I have a single preferred large > order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there > being existing overlapping populated PTEs, etc) then fallback immediately to > order-0. It turned out that this approach caused a performance regression in the > Speedometer benchmark. With my v1 patch, there were significant quantities of > memory which could not be placed in the 64K bucket and were instead being > allocated for the 32K and 16K buckets. With the proposed simplification, that > memory ended up using the 4K bucket, so page faults increased by 2.75x compared > to the v1 patch (although due to the 64K bucket, this number is still a bit > lower than the baseline). So instead, I continue to calculate a folio order that > is somewhere between the preferred order and 0. (See below for more details). > > The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series > [2], which is a hard dependency. I have a branch at [3]. > > > Changes since v1 [1] > -------------------- > > - removed changes to arch-dependent vma_alloc_zeroed_movable_folio() > - replaced with arch-independent alloc_anon_folio() > - follows THP allocation approach > - no longer retry with intermediate orders if allocation fails > - fallback directly to order-0 > - remove folio_add_new_anon_rmap_range() patch > - instead add its new functionality to folio_add_new_anon_rmap() > - remove batch-zap pte mappings optimization patch > - remove enabler folio_remove_rmap_range() patch too > - These offer real perf improvement so will submit separately > - simplify Kconfig > - single FLEXIBLE_THP option, which is independent of arch > - depends on TRANSPARENT_HUGEPAGE > - when enabled default to max anon folio size of 64K unless arch > explicitly overrides > - simplify changes to do_anonymous_page(): > - no more retry loop > > > Performance > ----------- > > Below results show 3 benchmarks; kernel compilation with 8 jobs, kernel > compilation with 80 jobs, and speedometer 2.0 (a javascript benchmark running in > Chromium). All cases are running on Ampere Altra with 1 NUMA node enabled, > Ubuntu 22.04 and XFS filesystem. Each benchmark is repeated 15 times over 5 > reboots and averaged. > > 'anonfolio-lkml-v1' is the v1 patchset at [1]. 'anonfolio-lkml-v2' is this v2 > patchset. 'anonfolio-lkml-v2-simple-order' is anonfolio-lkml-v2 but with the > order selection simplification that Yu Zhao suggested - I'm trying to justify > here why I did not follow the advice. > > > Kernel compilation with 8 jobs: > > | kernel | real-time | kern-time | user-time | > |:-------------------------------|------------:|------------:|------------:| > | baseline-4k | 0.0% | 0.0% | 0.0% | > | anonfolio-lkml-v1 | -5.3% | -42.9% | -0.6% | > | anonfolio-lkml-v2-simple-order | -4.4% | -36.5% | -0.4% | > | anonfolio-lkml-v2 | -4.8% | -38.6% | -0.6% | > > We can see that the simple-order approach is responsible for a regression of > 0.4%. > > > Kernel compilation with 80 jobs: > > | kernel | real-time | kern-time | user-time | > |:-------------------------------|------------:|------------:|------------:| > | baseline-4k | 0.0% | 0.0% | 0.0% | > | anonfolio-lkml-v1 | -4.6% | -45.7% | 1.4% | > | anonfolio-lkml-v2-simple-order | -4.7% | -40.2% | -0.1% | > | anonfolio-lkml-v2 | -5.0% | -42.6% | -0.3% | > > simple-order costs 0.3 % here. v2 is actually performing higher than v1 due to > fixing the v1 regression on user-time. > > > Speedometer 2.0: > > | kernel | runs_per_min | > |:-------------------------------|---------------:| > | baseline-4k | 0.0% | > | anonfolio-lkml-v1 | 0.7% | > | anonfolio-lkml-v2-simple-order | -0.9% | > | anonfolio-lkml-v2 | 0.5% | > > simple-order regresses performance by 0.9% vs the baseline, for a total negative > swing of 1.6% vs v1. This is fixed by keeping the more complex order selection > mechanism from v1. > > > The remaining (kernel time) performance gap between v1 and v2 for the above > benchmarks is due to the removal of the "batch zap" patch in v2. Adding that > back in gives us the performance back. I intend to submit that as a separate > series once this series is accepted. > > > [1] https://lore.kernel.org/linux-mm/20230626171430.3167004-1-ryan.roberts@arm.com/ > [2] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/ > [3] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anonfolio-lkml_v2 > > Thanks, > Ryan Hi Ryan, is page migration already working as expected (what about page compaction?), and do we handle migration -ENOMEM when allocating a target page: do we split an fallback to 4k page migration? -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory @ 2023-07-05 19:38 ` David Hildenbrand 0 siblings, 0 replies; 167+ messages in thread From: David Hildenbrand @ 2023-07-05 19:38 UTC (permalink / raw) To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi Cc: linux-arm-kernel, linux-kernel, linux-mm On 03.07.23 15:53, Ryan Roberts wrote: > Hi All, > > This is v2 of a series to implement variable order, large folios for anonymous > memory. The objective of this is to improve performance by allocating larger > chunks of memory during anonymous page faults. See [1] for background. > > I've significantly reworked and simplified the patch set based on comments from > Yu Zhao (thanks for all your feedback!). I've also renamed the feature to > VARIABLE_THP, on Yu's advice. > > The last patch is for arm64 to explicitly override the default > arch_wants_pte_order() and is intended as an example. If this series is accepted > I suggest taking the first 4 patches through the mm tree and the arm64 change > could be handled through the arm64 tree separately. Neither has any build > dependency on the other. > > The one area where I haven't followed Yu's advice is in the determination of the > size of folio to use. It was suggested that I have a single preferred large > order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there > being existing overlapping populated PTEs, etc) then fallback immediately to > order-0. It turned out that this approach caused a performance regression in the > Speedometer benchmark. With my v1 patch, there were significant quantities of > memory which could not be placed in the 64K bucket and were instead being > allocated for the 32K and 16K buckets. With the proposed simplification, that > memory ended up using the 4K bucket, so page faults increased by 2.75x compared > to the v1 patch (although due to the 64K bucket, this number is still a bit > lower than the baseline). So instead, I continue to calculate a folio order that > is somewhere between the preferred order and 0. (See below for more details). > > The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series > [2], which is a hard dependency. I have a branch at [3]. > > > Changes since v1 [1] > -------------------- > > - removed changes to arch-dependent vma_alloc_zeroed_movable_folio() > - replaced with arch-independent alloc_anon_folio() > - follows THP allocation approach > - no longer retry with intermediate orders if allocation fails > - fallback directly to order-0 > - remove folio_add_new_anon_rmap_range() patch > - instead add its new functionality to folio_add_new_anon_rmap() > - remove batch-zap pte mappings optimization patch > - remove enabler folio_remove_rmap_range() patch too > - These offer real perf improvement so will submit separately > - simplify Kconfig > - single FLEXIBLE_THP option, which is independent of arch > - depends on TRANSPARENT_HUGEPAGE > - when enabled default to max anon folio size of 64K unless arch > explicitly overrides > - simplify changes to do_anonymous_page(): > - no more retry loop > > > Performance > ----------- > > Below results show 3 benchmarks; kernel compilation with 8 jobs, kernel > compilation with 80 jobs, and speedometer 2.0 (a javascript benchmark running in > Chromium). All cases are running on Ampere Altra with 1 NUMA node enabled, > Ubuntu 22.04 and XFS filesystem. Each benchmark is repeated 15 times over 5 > reboots and averaged. > > 'anonfolio-lkml-v1' is the v1 patchset at [1]. 'anonfolio-lkml-v2' is this v2 > patchset. 'anonfolio-lkml-v2-simple-order' is anonfolio-lkml-v2 but with the > order selection simplification that Yu Zhao suggested - I'm trying to justify > here why I did not follow the advice. > > > Kernel compilation with 8 jobs: > > | kernel | real-time | kern-time | user-time | > |:-------------------------------|------------:|------------:|------------:| > | baseline-4k | 0.0% | 0.0% | 0.0% | > | anonfolio-lkml-v1 | -5.3% | -42.9% | -0.6% | > | anonfolio-lkml-v2-simple-order | -4.4% | -36.5% | -0.4% | > | anonfolio-lkml-v2 | -4.8% | -38.6% | -0.6% | > > We can see that the simple-order approach is responsible for a regression of > 0.4%. > > > Kernel compilation with 80 jobs: > > | kernel | real-time | kern-time | user-time | > |:-------------------------------|------------:|------------:|------------:| > | baseline-4k | 0.0% | 0.0% | 0.0% | > | anonfolio-lkml-v1 | -4.6% | -45.7% | 1.4% | > | anonfolio-lkml-v2-simple-order | -4.7% | -40.2% | -0.1% | > | anonfolio-lkml-v2 | -5.0% | -42.6% | -0.3% | > > simple-order costs 0.3 % here. v2 is actually performing higher than v1 due to > fixing the v1 regression on user-time. > > > Speedometer 2.0: > > | kernel | runs_per_min | > |:-------------------------------|---------------:| > | baseline-4k | 0.0% | > | anonfolio-lkml-v1 | 0.7% | > | anonfolio-lkml-v2-simple-order | -0.9% | > | anonfolio-lkml-v2 | 0.5% | > > simple-order regresses performance by 0.9% vs the baseline, for a total negative > swing of 1.6% vs v1. This is fixed by keeping the more complex order selection > mechanism from v1. > > > The remaining (kernel time) performance gap between v1 and v2 for the above > benchmarks is due to the removal of the "batch zap" patch in v2. Adding that > back in gives us the performance back. I intend to submit that as a separate > series once this series is accepted. > > > [1] https://lore.kernel.org/linux-mm/20230626171430.3167004-1-ryan.roberts@arm.com/ > [2] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/ > [3] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anonfolio-lkml_v2 > > Thanks, > Ryan Hi Ryan, is page migration already working as expected (what about page compaction?), and do we handle migration -ENOMEM when allocating a target page: do we split an fallback to 4k page migration? -- Cheers, David / dhildenb _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory 2023-07-05 19:38 ` David Hildenbrand @ 2023-07-06 8:02 ` Ryan Roberts -1 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-06 8:02 UTC (permalink / raw) To: David Hildenbrand, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi Cc: linux-arm-kernel, linux-kernel, linux-mm On 05/07/2023 20:38, David Hildenbrand wrote: > On 03.07.23 15:53, Ryan Roberts wrote: >> Hi All, >> >> This is v2 of a series to implement variable order, large folios for anonymous >> memory. The objective of this is to improve performance by allocating larger >> chunks of memory during anonymous page faults. See [1] for background. >> [...] >> Thanks, >> Ryan > > Hi Ryan, > > is page migration already working as expected (what about page compaction?), and > do we handle migration -ENOMEM when allocating a target page: do we split an > fallback to 4k page migration? > Hi David, All, This series aims to be the bare minimum to demonstrate allocation of large anon folios. As such, there is a laundry list of things that need to be done for this feature to play nicely with other features. My preferred route is to merge this with it's Kconfig defaulted to disabled, and its Kconfig description clearly shouting that it's EXPERIMENTAL with an explanation of why (similar to READ_ONLY_THP_FOR_FS). That said, I've put together a table of the items that I'm aware of that need attention. It would be great if people can review and add any missing items. Then we can hopefully parallelize the implementation work. David, I don't think the items you raised are covered - would you mind providing a bit more detail so I can add them to the list? (or just add them to the list yourself, if you prefer). --- - item: mlock description: >- Large, pte-mapped folios are ignored when mlock is requested. Code comment for mlock_vma_folio() says "...filter out pte mappings of THPs, which cannot be consistently counted: a pte mapping of the THP head cannot be distinguished by the page alone." location: - mlock_pte_range() - mlock_vma_folio() assignee: Yin, Fengwei - item: numa balancing description: >- Large, pte-mapped folios are ignored by numa-balancing code. Commit comment (e81c480): "We're going to have THP mapped with PTEs. It will confuse numabalancing. Let's skip them for now." location: - do_numa_page() assignee: <none> - item: madvise description: >- MADV_COLD, MADV_PAGEOUT, MADV_FREE: For large folios, code assumes exclusive only if mapcount==1, else skips remainder of operation. For large, pte-mapped folios, exclusive folios can have mapcount upto nr_pages and still be exclusive. Even better; don't split the folio if it fits entirely within the range? Discussion at https://lore.kernel.org/linux-mm/6cec6f68-248e-63b4-5615-9e0f3f819a0a@redhat.com/ talks about changing folio mapcounting - may help determine if exclusive without pgtable scan? location: - madvise_cold_or_pageout_pte_range() - madvise_free_pte_range() assignee: <none> - item: shrink_folio_list description: >- Raised by Yu Zhao; I can't see the problem in the code - need clarification location: - shrink_folio_list() assignee: <none> - item: compaction description: >- Raised at LSFMM: Compaction skips non-order-0 pages. Already problem for page-cache pages today. Is my understand correct? location: - <where?> assignee: <none> --- Thanks, Ryan ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory @ 2023-07-06 8:02 ` Ryan Roberts 0 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-06 8:02 UTC (permalink / raw) To: David Hildenbrand, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi Cc: linux-arm-kernel, linux-kernel, linux-mm On 05/07/2023 20:38, David Hildenbrand wrote: > On 03.07.23 15:53, Ryan Roberts wrote: >> Hi All, >> >> This is v2 of a series to implement variable order, large folios for anonymous >> memory. The objective of this is to improve performance by allocating larger >> chunks of memory during anonymous page faults. See [1] for background. >> [...] >> Thanks, >> Ryan > > Hi Ryan, > > is page migration already working as expected (what about page compaction?), and > do we handle migration -ENOMEM when allocating a target page: do we split an > fallback to 4k page migration? > Hi David, All, This series aims to be the bare minimum to demonstrate allocation of large anon folios. As such, there is a laundry list of things that need to be done for this feature to play nicely with other features. My preferred route is to merge this with it's Kconfig defaulted to disabled, and its Kconfig description clearly shouting that it's EXPERIMENTAL with an explanation of why (similar to READ_ONLY_THP_FOR_FS). That said, I've put together a table of the items that I'm aware of that need attention. It would be great if people can review and add any missing items. Then we can hopefully parallelize the implementation work. David, I don't think the items you raised are covered - would you mind providing a bit more detail so I can add them to the list? (or just add them to the list yourself, if you prefer). --- - item: mlock description: >- Large, pte-mapped folios are ignored when mlock is requested. Code comment for mlock_vma_folio() says "...filter out pte mappings of THPs, which cannot be consistently counted: a pte mapping of the THP head cannot be distinguished by the page alone." location: - mlock_pte_range() - mlock_vma_folio() assignee: Yin, Fengwei - item: numa balancing description: >- Large, pte-mapped folios are ignored by numa-balancing code. Commit comment (e81c480): "We're going to have THP mapped with PTEs. It will confuse numabalancing. Let's skip them for now." location: - do_numa_page() assignee: <none> - item: madvise description: >- MADV_COLD, MADV_PAGEOUT, MADV_FREE: For large folios, code assumes exclusive only if mapcount==1, else skips remainder of operation. For large, pte-mapped folios, exclusive folios can have mapcount upto nr_pages and still be exclusive. Even better; don't split the folio if it fits entirely within the range? Discussion at https://lore.kernel.org/linux-mm/6cec6f68-248e-63b4-5615-9e0f3f819a0a@redhat.com/ talks about changing folio mapcounting - may help determine if exclusive without pgtable scan? location: - madvise_cold_or_pageout_pte_range() - madvise_free_pte_range() assignee: <none> - item: shrink_folio_list description: >- Raised by Yu Zhao; I can't see the problem in the code - need clarification location: - shrink_folio_list() assignee: <none> - item: compaction description: >- Raised at LSFMM: Compaction skips non-order-0 pages. Already problem for page-cache pages today. Is my understand correct? location: - <where?> assignee: <none> --- Thanks, Ryan _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory 2023-07-06 8:02 ` Ryan Roberts @ 2023-07-07 11:40 ` David Hildenbrand -1 siblings, 0 replies; 167+ messages in thread From: David Hildenbrand @ 2023-07-07 11:40 UTC (permalink / raw) To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi Cc: linux-arm-kernel, linux-kernel, linux-mm On 06.07.23 10:02, Ryan Roberts wrote: > On 05/07/2023 20:38, David Hildenbrand wrote: >> On 03.07.23 15:53, Ryan Roberts wrote: >>> Hi All, >>> >>> This is v2 of a series to implement variable order, large folios for anonymous >>> memory. The objective of this is to improve performance by allocating larger >>> chunks of memory during anonymous page faults. See [1] for background. >>> > > [...] > >>> Thanks, >>> Ryan >> >> Hi Ryan, >> >> is page migration already working as expected (what about page compaction?), and >> do we handle migration -ENOMEM when allocating a target page: do we split an >> fallback to 4k page migration? >> > > Hi David, All, Hi Ryan, thanks a lot for the list. But can you comment on the page migration part (IOW did you try it already)? For example, memory hotunplug, CMA, MCE handling, compaction all rely on page migration of something that was allocated using GFP_MOVABLE to actually work. Compaction seems to skip any higher-order folios, but the question is if the udnerlying migration itself works. If it already works: great! If not, this really has to be tackled early, because otherwise we'll be breaking the GFP_MOVABLE semantics. > > This series aims to be the bare minimum to demonstrate allocation of large anon > folios. As such, there is a laundry list of things that need to be done for this > feature to play nicely with other features. My preferred route is to merge this > with it's Kconfig defaulted to disabled, and its Kconfig description clearly > shouting that it's EXPERIMENTAL with an explanation of why (similar to > READ_ONLY_THP_FOR_FS). As long as we are not sure about the user space control and as long as basic functionality is not working (example, page migration), I would tend to not merge this early just for the sake of it. But yes, something like mlock can eventually be tackled later: as long as there is a runtime interface to disable it ;) > > That said, I've put together a table of the items that I'm aware of that need > attention. It would be great if people can review and add any missing items. > Then we can hopefully parallelize the implementation work. David, I don't think > the items you raised are covered - would you mind providing a bit more detail so > I can add them to the list? (or just add them to the list yourself, if you prefer). > > --- > > - item: > mlock > > description: >- > Large, pte-mapped folios are ignored when mlock is requested. Code comment > for mlock_vma_folio() says "...filter out pte mappings of THPs, which > cannot be consistently counted: a pte mapping of the THP head cannot be > distinguished by the page alone." > > location: > - mlock_pte_range() > - mlock_vma_folio() > > assignee: > Yin, Fengwei > > > - item: > numa balancing > > description: >- > Large, pte-mapped folios are ignored by numa-balancing code. Commit > comment (e81c480): "We're going to have THP mapped with PTEs. It will > confuse numabalancing. Let's skip them for now." > > location: > - do_numa_page() > > assignee: > <none> > > > - item: > madvise > > description: >- > MADV_COLD, MADV_PAGEOUT, MADV_FREE: For large folios, code assumes > exclusive only if mapcount==1, else skips remainder of operation. For > large, pte-mapped folios, exclusive folios can have mapcount upto nr_pages > and still be exclusive. Even better; don't split the folio if it fits > entirely within the range? Discussion at > > https://lore.kernel.org/linux-mm/6cec6f68-248e-63b4-5615-9e0f3f819a0a@redhat.com/ > talks about changing folio mapcounting - may help determine if exclusive > without pgtable scan? > > location: > - madvise_cold_or_pageout_pte_range() > - madvise_free_pte_range() > > assignee: > <none> > > > - item: > shrink_folio_list > > description: >- > Raised by Yu Zhao; I can't see the problem in the code - need > clarification > > location: > - shrink_folio_list() > > assignee: > <none> > > > - item: > compaction > > description: >- > Raised at LSFMM: Compaction skips non-order-0 pages. Already problem for > page-cache pages today. Is my understand correct? > > location: > - <where?> > > assignee: > <none> I'm still thinking about the whole mapcount thingy (and I burned way too much time on that yesterday), which is a big item for such a list and affects some of these items. A pagetable scan is pretty much irrelevant for order-2 pages. But once we're talking about higher orders we really don't want to do that. I'm preparing a writeup with users and challenges. Is swapping working as expected? zswap? -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory @ 2023-07-07 11:40 ` David Hildenbrand 0 siblings, 0 replies; 167+ messages in thread From: David Hildenbrand @ 2023-07-07 11:40 UTC (permalink / raw) To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi Cc: linux-arm-kernel, linux-kernel, linux-mm On 06.07.23 10:02, Ryan Roberts wrote: > On 05/07/2023 20:38, David Hildenbrand wrote: >> On 03.07.23 15:53, Ryan Roberts wrote: >>> Hi All, >>> >>> This is v2 of a series to implement variable order, large folios for anonymous >>> memory. The objective of this is to improve performance by allocating larger >>> chunks of memory during anonymous page faults. See [1] for background. >>> > > [...] > >>> Thanks, >>> Ryan >> >> Hi Ryan, >> >> is page migration already working as expected (what about page compaction?), and >> do we handle migration -ENOMEM when allocating a target page: do we split an >> fallback to 4k page migration? >> > > Hi David, All, Hi Ryan, thanks a lot for the list. But can you comment on the page migration part (IOW did you try it already)? For example, memory hotunplug, CMA, MCE handling, compaction all rely on page migration of something that was allocated using GFP_MOVABLE to actually work. Compaction seems to skip any higher-order folios, but the question is if the udnerlying migration itself works. If it already works: great! If not, this really has to be tackled early, because otherwise we'll be breaking the GFP_MOVABLE semantics. > > This series aims to be the bare minimum to demonstrate allocation of large anon > folios. As such, there is a laundry list of things that need to be done for this > feature to play nicely with other features. My preferred route is to merge this > with it's Kconfig defaulted to disabled, and its Kconfig description clearly > shouting that it's EXPERIMENTAL with an explanation of why (similar to > READ_ONLY_THP_FOR_FS). As long as we are not sure about the user space control and as long as basic functionality is not working (example, page migration), I would tend to not merge this early just for the sake of it. But yes, something like mlock can eventually be tackled later: as long as there is a runtime interface to disable it ;) > > That said, I've put together a table of the items that I'm aware of that need > attention. It would be great if people can review and add any missing items. > Then we can hopefully parallelize the implementation work. David, I don't think > the items you raised are covered - would you mind providing a bit more detail so > I can add them to the list? (or just add them to the list yourself, if you prefer). > > --- > > - item: > mlock > > description: >- > Large, pte-mapped folios are ignored when mlock is requested. Code comment > for mlock_vma_folio() says "...filter out pte mappings of THPs, which > cannot be consistently counted: a pte mapping of the THP head cannot be > distinguished by the page alone." > > location: > - mlock_pte_range() > - mlock_vma_folio() > > assignee: > Yin, Fengwei > > > - item: > numa balancing > > description: >- > Large, pte-mapped folios are ignored by numa-balancing code. Commit > comment (e81c480): "We're going to have THP mapped with PTEs. It will > confuse numabalancing. Let's skip them for now." > > location: > - do_numa_page() > > assignee: > <none> > > > - item: > madvise > > description: >- > MADV_COLD, MADV_PAGEOUT, MADV_FREE: For large folios, code assumes > exclusive only if mapcount==1, else skips remainder of operation. For > large, pte-mapped folios, exclusive folios can have mapcount upto nr_pages > and still be exclusive. Even better; don't split the folio if it fits > entirely within the range? Discussion at > > https://lore.kernel.org/linux-mm/6cec6f68-248e-63b4-5615-9e0f3f819a0a@redhat.com/ > talks about changing folio mapcounting - may help determine if exclusive > without pgtable scan? > > location: > - madvise_cold_or_pageout_pte_range() > - madvise_free_pte_range() > > assignee: > <none> > > > - item: > shrink_folio_list > > description: >- > Raised by Yu Zhao; I can't see the problem in the code - need > clarification > > location: > - shrink_folio_list() > > assignee: > <none> > > > - item: > compaction > > description: >- > Raised at LSFMM: Compaction skips non-order-0 pages. Already problem for > page-cache pages today. Is my understand correct? > > location: > - <where?> > > assignee: > <none> I'm still thinking about the whole mapcount thingy (and I burned way too much time on that yesterday), which is a big item for such a list and affects some of these items. A pagetable scan is pretty much irrelevant for order-2 pages. But once we're talking about higher orders we really don't want to do that. I'm preparing a writeup with users and challenges. Is swapping working as expected? zswap? -- Cheers, David / dhildenb _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory 2023-07-07 11:40 ` David Hildenbrand @ 2023-07-07 13:12 ` Matthew Wilcox -1 siblings, 0 replies; 167+ messages in thread From: Matthew Wilcox @ 2023-07-07 13:12 UTC (permalink / raw) To: David Hildenbrand Cc: Ryan Roberts, Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote: > On 06.07.23 10:02, Ryan Roberts wrote: > But can you comment on the page migration part (IOW did you try it already)? > > For example, memory hotunplug, CMA, MCE handling, compaction all rely on > page migration of something that was allocated using GFP_MOVABLE to actually > work. > > Compaction seems to skip any higher-order folios, but the question is if the > udnerlying migration itself works. > > If it already works: great! If not, this really has to be tackled early, > because otherwise we'll be breaking the GFP_MOVABLE semantics. I have looked at this a bit. _Migration_ should be fine. _Compaction_ is not. If you look at a function like folio_migrate_mapping(), it all seems appropriately folio-ised. There might be something in there that is slightly wrong, but that would just be a bug to fix, not a huge architectural problem. The problem comes in the callers of migrate_pages(). They pass a new_folio_t callback. alloc_migration_target() is the usual one passed and as far as I can tell is fine. I've seen no problems reported with it. compaction_alloc() is a disaster, and I don't know how to fix it. The compaction code has its own allocator which is populated with order-0 folios. How it populates that freelist is awful ... see split_map_pages() > Is swapping working as expected? zswap? Suboptimally. Swap will split folios in order to swap them. Somebody needs to fix that, but it should work. ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory @ 2023-07-07 13:12 ` Matthew Wilcox 0 siblings, 0 replies; 167+ messages in thread From: Matthew Wilcox @ 2023-07-07 13:12 UTC (permalink / raw) To: David Hildenbrand Cc: Ryan Roberts, Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote: > On 06.07.23 10:02, Ryan Roberts wrote: > But can you comment on the page migration part (IOW did you try it already)? > > For example, memory hotunplug, CMA, MCE handling, compaction all rely on > page migration of something that was allocated using GFP_MOVABLE to actually > work. > > Compaction seems to skip any higher-order folios, but the question is if the > udnerlying migration itself works. > > If it already works: great! If not, this really has to be tackled early, > because otherwise we'll be breaking the GFP_MOVABLE semantics. I have looked at this a bit. _Migration_ should be fine. _Compaction_ is not. If you look at a function like folio_migrate_mapping(), it all seems appropriately folio-ised. There might be something in there that is slightly wrong, but that would just be a bug to fix, not a huge architectural problem. The problem comes in the callers of migrate_pages(). They pass a new_folio_t callback. alloc_migration_target() is the usual one passed and as far as I can tell is fine. I've seen no problems reported with it. compaction_alloc() is a disaster, and I don't know how to fix it. The compaction code has its own allocator which is populated with order-0 folios. How it populates that freelist is awful ... see split_map_pages() > Is swapping working as expected? zswap? Suboptimally. Swap will split folios in order to swap them. Somebody needs to fix that, but it should work. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory 2023-07-07 13:12 ` Matthew Wilcox @ 2023-07-07 13:24 ` David Hildenbrand -1 siblings, 0 replies; 167+ messages in thread From: David Hildenbrand @ 2023-07-07 13:24 UTC (permalink / raw) To: Matthew Wilcox Cc: Ryan Roberts, Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 07.07.23 15:12, Matthew Wilcox wrote: > On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote: >> On 06.07.23 10:02, Ryan Roberts wrote: >> But can you comment on the page migration part (IOW did you try it already)? >> >> For example, memory hotunplug, CMA, MCE handling, compaction all rely on >> page migration of something that was allocated using GFP_MOVABLE to actually >> work. >> >> Compaction seems to skip any higher-order folios, but the question is if the >> udnerlying migration itself works. >> >> If it already works: great! If not, this really has to be tackled early, >> because otherwise we'll be breaking the GFP_MOVABLE semantics. > > I have looked at this a bit. _Migration_ should be fine. _Compaction_ > is not. Thanks! Very nice if at least ordinary migration works. > > If you look at a function like folio_migrate_mapping(), it all seems > appropriately folio-ised. There might be something in there that is > slightly wrong, but that would just be a bug to fix, not a huge > architectural problem. > > The problem comes in the callers of migrate_pages(). They pass a > new_folio_t callback. alloc_migration_target() is the usual one passed > and as far as I can tell is fine. I've seen no problems reported with it. > > compaction_alloc() is a disaster, and I don't know how to fix it. > The compaction code has its own allocator which is populated with order-0 > folios. How it populates that freelist is awful ... see split_map_pages() Yeah, all that code was written under the assumption that we're moving order-0 pages (which is what the anon+pagecache pages part). From what I recall, we're allocating order-0 pages from the high memory addresses, so we can migrate from low memory addresses, effectively freeing up low memory addresses and filling high memory addresses. Adjusting that will be ... interesting. Instead of allocating order-0 pages from high addresses, we might want to allocate "as large as possible" ("grab what we can") from high addresses and then have our own kind of buddy for allocating from that pool a compaction destination page, depending on our source page. Nasty. What should always work is the split->migrate. But that's definitely not what we want in many cases. > >> Is swapping working as expected? zswap? > > Suboptimally. Swap will split folios in order to swap them. Somebody > needs to fix that, but it should work. Good! It would be great to have some kind of a feature matrix that tells us what works perfectly, sub-optimally, barely, not at all (and what has not been tested). Maybe (likely!) we'll also find things that are sub-optimal for ordinary THP (like swapping, not even sure about). I suspect that KSM should work mostly fine with flexible-thp. When deduplciating, we'll simply split the compound page and proceed as expected. But might be worth testing as well. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory @ 2023-07-07 13:24 ` David Hildenbrand 0 siblings, 0 replies; 167+ messages in thread From: David Hildenbrand @ 2023-07-07 13:24 UTC (permalink / raw) To: Matthew Wilcox Cc: Ryan Roberts, Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 07.07.23 15:12, Matthew Wilcox wrote: > On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote: >> On 06.07.23 10:02, Ryan Roberts wrote: >> But can you comment on the page migration part (IOW did you try it already)? >> >> For example, memory hotunplug, CMA, MCE handling, compaction all rely on >> page migration of something that was allocated using GFP_MOVABLE to actually >> work. >> >> Compaction seems to skip any higher-order folios, but the question is if the >> udnerlying migration itself works. >> >> If it already works: great! If not, this really has to be tackled early, >> because otherwise we'll be breaking the GFP_MOVABLE semantics. > > I have looked at this a bit. _Migration_ should be fine. _Compaction_ > is not. Thanks! Very nice if at least ordinary migration works. > > If you look at a function like folio_migrate_mapping(), it all seems > appropriately folio-ised. There might be something in there that is > slightly wrong, but that would just be a bug to fix, not a huge > architectural problem. > > The problem comes in the callers of migrate_pages(). They pass a > new_folio_t callback. alloc_migration_target() is the usual one passed > and as far as I can tell is fine. I've seen no problems reported with it. > > compaction_alloc() is a disaster, and I don't know how to fix it. > The compaction code has its own allocator which is populated with order-0 > folios. How it populates that freelist is awful ... see split_map_pages() Yeah, all that code was written under the assumption that we're moving order-0 pages (which is what the anon+pagecache pages part). From what I recall, we're allocating order-0 pages from the high memory addresses, so we can migrate from low memory addresses, effectively freeing up low memory addresses and filling high memory addresses. Adjusting that will be ... interesting. Instead of allocating order-0 pages from high addresses, we might want to allocate "as large as possible" ("grab what we can") from high addresses and then have our own kind of buddy for allocating from that pool a compaction destination page, depending on our source page. Nasty. What should always work is the split->migrate. But that's definitely not what we want in many cases. > >> Is swapping working as expected? zswap? > > Suboptimally. Swap will split folios in order to swap them. Somebody > needs to fix that, but it should work. Good! It would be great to have some kind of a feature matrix that tells us what works perfectly, sub-optimally, barely, not at all (and what has not been tested). Maybe (likely!) we'll also find things that are sub-optimal for ordinary THP (like swapping, not even sure about). I suspect that KSM should work mostly fine with flexible-thp. When deduplciating, we'll simply split the compound page and proceed as expected. But might be worth testing as well. -- Cheers, David / dhildenb _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory 2023-07-07 13:24 ` David Hildenbrand @ 2023-07-10 10:07 ` Ryan Roberts -1 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-10 10:07 UTC (permalink / raw) To: David Hildenbrand, Matthew Wilcox Cc: Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 07/07/2023 14:24, David Hildenbrand wrote: > On 07.07.23 15:12, Matthew Wilcox wrote: >> On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote: >>> On 06.07.23 10:02, Ryan Roberts wrote: >>> But can you comment on the page migration part (IOW did you try it already)? >>> >>> For example, memory hotunplug, CMA, MCE handling, compaction all rely on >>> page migration of something that was allocated using GFP_MOVABLE to actually >>> work. >>> >>> Compaction seems to skip any higher-order folios, but the question is if the >>> udnerlying migration itself works. >>> >>> If it already works: great! If not, this really has to be tackled early, >>> because otherwise we'll be breaking the GFP_MOVABLE semantics. >> >> I have looked at this a bit. _Migration_ should be fine. _Compaction_ >> is not. > > Thanks! Very nice if at least ordinary migration works. That's good to hear - I hadn't personally investigated. > >> >> If you look at a function like folio_migrate_mapping(), it all seems >> appropriately folio-ised. There might be something in there that is >> slightly wrong, but that would just be a bug to fix, not a huge >> architectural problem. >> >> The problem comes in the callers of migrate_pages(). They pass a >> new_folio_t callback. alloc_migration_target() is the usual one passed >> and as far as I can tell is fine. I've seen no problems reported with it. >> >> compaction_alloc() is a disaster, and I don't know how to fix it. >> The compaction code has its own allocator which is populated with order-0 >> folios. How it populates that freelist is awful ... see split_map_pages() I think this compaction issue also affects large folios in the page cache? So really it is a pre-existing bug in the code base that needs to be fixed independently of large anon folios? Should I assume you are tackling this, Matthew? > > Yeah, all that code was written under the assumption that we're moving order-0 > pages (which is what the anon+pagecache pages part). > > From what I recall, we're allocating order-0 pages from the high memory > addresses, so we can migrate from low memory addresses, effectively freeing up > low memory addresses and filling high memory addresses. > > Adjusting that will be ... interesting. Instead of allocating order-0 pages from > high addresses, we might want to allocate "as large as possible" ("grab what we > can") from high addresses and then have our own kind of buddy for allocating > from that pool a compaction destination page, depending on our source page. Nasty. > > What should always work is the split->migrate. But that's definitely not what we > want in many cases. > >> >>> Is swapping working as expected? zswap? >> >> Suboptimally. Swap will split folios in order to swap them. Somebody >> needs to fix that, but it should work. > > Good! > > It would be great to have some kind of a feature matrix that tells us what works > perfectly, sub-optimally, barely, not at all (and what has not been tested). > Maybe (likely!) we'll also find things that are sub-optimal for ordinary THP > (like swapping, not even sure about). I'm building a list of known issues, but so far it has been based on code I've found during review and things raised by people in these threads. Are there test suites that explicitly test these features? If so I'll happily run them against large anon folios, but at the moment I'm ignorant I'm afraid. I have been trying to get mm selftests up and running, but I currently have a bunch of failures on arm64, even without any of my patches - somthing I'm working through. > > I suspect that KSM should work mostly fine with flexible-thp. When > deduplciating, we'll simply split the compound page and proceed as expected. But > might be worth testing as well. > ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory @ 2023-07-10 10:07 ` Ryan Roberts 0 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-10 10:07 UTC (permalink / raw) To: David Hildenbrand, Matthew Wilcox Cc: Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 07/07/2023 14:24, David Hildenbrand wrote: > On 07.07.23 15:12, Matthew Wilcox wrote: >> On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote: >>> On 06.07.23 10:02, Ryan Roberts wrote: >>> But can you comment on the page migration part (IOW did you try it already)? >>> >>> For example, memory hotunplug, CMA, MCE handling, compaction all rely on >>> page migration of something that was allocated using GFP_MOVABLE to actually >>> work. >>> >>> Compaction seems to skip any higher-order folios, but the question is if the >>> udnerlying migration itself works. >>> >>> If it already works: great! If not, this really has to be tackled early, >>> because otherwise we'll be breaking the GFP_MOVABLE semantics. >> >> I have looked at this a bit. _Migration_ should be fine. _Compaction_ >> is not. > > Thanks! Very nice if at least ordinary migration works. That's good to hear - I hadn't personally investigated. > >> >> If you look at a function like folio_migrate_mapping(), it all seems >> appropriately folio-ised. There might be something in there that is >> slightly wrong, but that would just be a bug to fix, not a huge >> architectural problem. >> >> The problem comes in the callers of migrate_pages(). They pass a >> new_folio_t callback. alloc_migration_target() is the usual one passed >> and as far as I can tell is fine. I've seen no problems reported with it. >> >> compaction_alloc() is a disaster, and I don't know how to fix it. >> The compaction code has its own allocator which is populated with order-0 >> folios. How it populates that freelist is awful ... see split_map_pages() I think this compaction issue also affects large folios in the page cache? So really it is a pre-existing bug in the code base that needs to be fixed independently of large anon folios? Should I assume you are tackling this, Matthew? > > Yeah, all that code was written under the assumption that we're moving order-0 > pages (which is what the anon+pagecache pages part). > > From what I recall, we're allocating order-0 pages from the high memory > addresses, so we can migrate from low memory addresses, effectively freeing up > low memory addresses and filling high memory addresses. > > Adjusting that will be ... interesting. Instead of allocating order-0 pages from > high addresses, we might want to allocate "as large as possible" ("grab what we > can") from high addresses and then have our own kind of buddy for allocating > from that pool a compaction destination page, depending on our source page. Nasty. > > What should always work is the split->migrate. But that's definitely not what we > want in many cases. > >> >>> Is swapping working as expected? zswap? >> >> Suboptimally. Swap will split folios in order to swap them. Somebody >> needs to fix that, but it should work. > > Good! > > It would be great to have some kind of a feature matrix that tells us what works > perfectly, sub-optimally, barely, not at all (and what has not been tested). > Maybe (likely!) we'll also find things that are sub-optimal for ordinary THP > (like swapping, not even sure about). I'm building a list of known issues, but so far it has been based on code I've found during review and things raised by people in these threads. Are there test suites that explicitly test these features? If so I'll happily run them against large anon folios, but at the moment I'm ignorant I'm afraid. I have been trying to get mm selftests up and running, but I currently have a bunch of failures on arm64, even without any of my patches - somthing I'm working through. > > I suspect that KSM should work mostly fine with flexible-thp. When > deduplciating, we'll simply split the compound page and proceed as expected. But > might be worth testing as well. > _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory 2023-07-10 10:07 ` Ryan Roberts @ 2023-07-10 16:57 ` Matthew Wilcox -1 siblings, 0 replies; 167+ messages in thread From: Matthew Wilcox @ 2023-07-10 16:57 UTC (permalink / raw) To: Ryan Roberts Cc: David Hildenbrand, Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On Mon, Jul 10, 2023 at 11:07:47AM +0100, Ryan Roberts wrote: > I think this compaction issue also affects large folios in the page cache? So > really it is a pre-existing bug in the code base that needs to be fixed > independently of large anon folios? Should I assume you are tackling this, Matthew? It does need to be fixed independently of large anon folios. Said fix should probably be backported to 6.1 once it's suitably stable. However, I'm not working on it. I have a lot of projects and this one's a missed-opportunity, not a show-stopper. Sounds like Zi Yan might be interested in tackling it though! ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory @ 2023-07-10 16:57 ` Matthew Wilcox 0 siblings, 0 replies; 167+ messages in thread From: Matthew Wilcox @ 2023-07-10 16:57 UTC (permalink / raw) To: Ryan Roberts Cc: David Hildenbrand, Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On Mon, Jul 10, 2023 at 11:07:47AM +0100, Ryan Roberts wrote: > I think this compaction issue also affects large folios in the page cache? So > really it is a pre-existing bug in the code base that needs to be fixed > independently of large anon folios? Should I assume you are tackling this, Matthew? It does need to be fixed independently of large anon folios. Said fix should probably be backported to 6.1 once it's suitably stable. However, I'm not working on it. I have a lot of projects and this one's a missed-opportunity, not a show-stopper. Sounds like Zi Yan might be interested in tackling it though! _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory 2023-07-07 13:24 ` David Hildenbrand @ 2023-07-10 16:53 ` Zi Yan -1 siblings, 0 replies; 167+ messages in thread From: Zi Yan @ 2023-07-10 16:53 UTC (permalink / raw) To: David Hildenbrand Cc: Matthew Wilcox, Ryan Roberts, Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm [-- Attachment #1: Type: text/plain, Size: 2749 bytes --] On 7 Jul 2023, at 9:24, David Hildenbrand wrote: > On 07.07.23 15:12, Matthew Wilcox wrote: >> On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote: >>> On 06.07.23 10:02, Ryan Roberts wrote: >>> But can you comment on the page migration part (IOW did you try it already)? >>> >>> For example, memory hotunplug, CMA, MCE handling, compaction all rely on >>> page migration of something that was allocated using GFP_MOVABLE to actually >>> work. >>> >>> Compaction seems to skip any higher-order folios, but the question is if the >>> udnerlying migration itself works. >>> >>> If it already works: great! If not, this really has to be tackled early, >>> because otherwise we'll be breaking the GFP_MOVABLE semantics. >> >> I have looked at this a bit. _Migration_ should be fine. _Compaction_ >> is not. > > Thanks! Very nice if at least ordinary migration works. > >> >> If you look at a function like folio_migrate_mapping(), it all seems >> appropriately folio-ised. There might be something in there that is >> slightly wrong, but that would just be a bug to fix, not a huge >> architectural problem. >> >> The problem comes in the callers of migrate_pages(). They pass a >> new_folio_t callback. alloc_migration_target() is the usual one passed >> and as far as I can tell is fine. I've seen no problems reported with it. >> >> compaction_alloc() is a disaster, and I don't know how to fix it. >> The compaction code has its own allocator which is populated with order-0 >> folios. How it populates that freelist is awful ... see split_map_pages() > > Yeah, all that code was written under the assumption that we're moving order-0 pages (which is what the anon+pagecache pages part). > > From what I recall, we're allocating order-0 pages from the high memory addresses, so we can migrate from low memory addresses, effectively freeing up low memory addresses and filling high memory addresses. > > Adjusting that will be ... interesting. Instead of allocating order-0 pages from high addresses, we might want to allocate "as large as possible" ("grab what we can") from high addresses and then have our own kind of buddy for allocating from that pool a compaction destination page, depending on our source page. Nasty. We probably do not need a pool, since before migration, we have isolated folios to be migrated and can come up with a stats on how many folios there are at each order. Then, we can isolate free pages based on the stats and do not split free pages all the way down to order-0. We can sort the source folios based on their orders and isolate free pages from largest order to smallest order. That could avoid a free page pool. -- Best Regards, Yan, Zi [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 854 bytes --] ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory @ 2023-07-10 16:53 ` Zi Yan 0 siblings, 0 replies; 167+ messages in thread From: Zi Yan @ 2023-07-10 16:53 UTC (permalink / raw) To: David Hildenbrand Cc: Matthew Wilcox, Ryan Roberts, Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm [-- Attachment #1.1: Type: text/plain, Size: 2749 bytes --] On 7 Jul 2023, at 9:24, David Hildenbrand wrote: > On 07.07.23 15:12, Matthew Wilcox wrote: >> On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote: >>> On 06.07.23 10:02, Ryan Roberts wrote: >>> But can you comment on the page migration part (IOW did you try it already)? >>> >>> For example, memory hotunplug, CMA, MCE handling, compaction all rely on >>> page migration of something that was allocated using GFP_MOVABLE to actually >>> work. >>> >>> Compaction seems to skip any higher-order folios, but the question is if the >>> udnerlying migration itself works. >>> >>> If it already works: great! If not, this really has to be tackled early, >>> because otherwise we'll be breaking the GFP_MOVABLE semantics. >> >> I have looked at this a bit. _Migration_ should be fine. _Compaction_ >> is not. > > Thanks! Very nice if at least ordinary migration works. > >> >> If you look at a function like folio_migrate_mapping(), it all seems >> appropriately folio-ised. There might be something in there that is >> slightly wrong, but that would just be a bug to fix, not a huge >> architectural problem. >> >> The problem comes in the callers of migrate_pages(). They pass a >> new_folio_t callback. alloc_migration_target() is the usual one passed >> and as far as I can tell is fine. I've seen no problems reported with it. >> >> compaction_alloc() is a disaster, and I don't know how to fix it. >> The compaction code has its own allocator which is populated with order-0 >> folios. How it populates that freelist is awful ... see split_map_pages() > > Yeah, all that code was written under the assumption that we're moving order-0 pages (which is what the anon+pagecache pages part). > > From what I recall, we're allocating order-0 pages from the high memory addresses, so we can migrate from low memory addresses, effectively freeing up low memory addresses and filling high memory addresses. > > Adjusting that will be ... interesting. Instead of allocating order-0 pages from high addresses, we might want to allocate "as large as possible" ("grab what we can") from high addresses and then have our own kind of buddy for allocating from that pool a compaction destination page, depending on our source page. Nasty. We probably do not need a pool, since before migration, we have isolated folios to be migrated and can come up with a stats on how many folios there are at each order. Then, we can isolate free pages based on the stats and do not split free pages all the way down to order-0. We can sort the source folios based on their orders and isolate free pages from largest order to smallest order. That could avoid a free page pool. -- Best Regards, Yan, Zi [-- Attachment #1.2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 854 bytes --] [-- Attachment #2: Type: text/plain, Size: 176 bytes --] _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory 2023-07-10 16:53 ` Zi Yan @ 2023-07-19 15:49 ` Ryan Roberts -1 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-19 15:49 UTC (permalink / raw) To: Zi Yan, David Hildenbrand Cc: Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 10/07/2023 17:53, Zi Yan wrote: > On 7 Jul 2023, at 9:24, David Hildenbrand wrote: > >> On 07.07.23 15:12, Matthew Wilcox wrote: >>> On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote: >>>> On 06.07.23 10:02, Ryan Roberts wrote: >>>> But can you comment on the page migration part (IOW did you try it already)? >>>> >>>> For example, memory hotunplug, CMA, MCE handling, compaction all rely on >>>> page migration of something that was allocated using GFP_MOVABLE to actually >>>> work. >>>> >>>> Compaction seems to skip any higher-order folios, but the question is if the >>>> udnerlying migration itself works. >>>> >>>> If it already works: great! If not, this really has to be tackled early, >>>> because otherwise we'll be breaking the GFP_MOVABLE semantics. >>> >>> I have looked at this a bit. _Migration_ should be fine. _Compaction_ >>> is not. >> >> Thanks! Very nice if at least ordinary migration works. >> >>> >>> If you look at a function like folio_migrate_mapping(), it all seems >>> appropriately folio-ised. There might be something in there that is >>> slightly wrong, but that would just be a bug to fix, not a huge >>> architectural problem. >>> >>> The problem comes in the callers of migrate_pages(). They pass a >>> new_folio_t callback. alloc_migration_target() is the usual one passed >>> and as far as I can tell is fine. I've seen no problems reported with it. >>> >>> compaction_alloc() is a disaster, and I don't know how to fix it. >>> The compaction code has its own allocator which is populated with order-0 >>> folios. How it populates that freelist is awful ... see split_map_pages() >> >> Yeah, all that code was written under the assumption that we're moving order-0 pages (which is what the anon+pagecache pages part). >> >> From what I recall, we're allocating order-0 pages from the high memory addresses, so we can migrate from low memory addresses, effectively freeing up low memory addresses and filling high memory addresses. >> >> Adjusting that will be ... interesting. Instead of allocating order-0 pages from high addresses, we might want to allocate "as large as possible" ("grab what we can") from high addresses and then have our own kind of buddy for allocating from that pool a compaction destination page, depending on our source page. Nasty. > > We probably do not need a pool, since before migration, we have isolated folios to > be migrated and can come up with a stats on how many folios there are at each order. > Then, we can isolate free pages based on the stats and do not split free pages > all the way down to order-0. We can sort the source folios based on their orders > and isolate free pages from largest order to smallest order. That could avoid > a free page pool. Hi Zi, I just wanted to check; is this something you are working on or planning to work on? I'm trying to maintain a list of all the items that need to get sorted for large anon folios. It would be great to put your name against it! ;-) > > -- > Best Regards, > Yan, Zi ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory @ 2023-07-19 15:49 ` Ryan Roberts 0 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-19 15:49 UTC (permalink / raw) To: Zi Yan, David Hildenbrand Cc: Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 10/07/2023 17:53, Zi Yan wrote: > On 7 Jul 2023, at 9:24, David Hildenbrand wrote: > >> On 07.07.23 15:12, Matthew Wilcox wrote: >>> On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote: >>>> On 06.07.23 10:02, Ryan Roberts wrote: >>>> But can you comment on the page migration part (IOW did you try it already)? >>>> >>>> For example, memory hotunplug, CMA, MCE handling, compaction all rely on >>>> page migration of something that was allocated using GFP_MOVABLE to actually >>>> work. >>>> >>>> Compaction seems to skip any higher-order folios, but the question is if the >>>> udnerlying migration itself works. >>>> >>>> If it already works: great! If not, this really has to be tackled early, >>>> because otherwise we'll be breaking the GFP_MOVABLE semantics. >>> >>> I have looked at this a bit. _Migration_ should be fine. _Compaction_ >>> is not. >> >> Thanks! Very nice if at least ordinary migration works. >> >>> >>> If you look at a function like folio_migrate_mapping(), it all seems >>> appropriately folio-ised. There might be something in there that is >>> slightly wrong, but that would just be a bug to fix, not a huge >>> architectural problem. >>> >>> The problem comes in the callers of migrate_pages(). They pass a >>> new_folio_t callback. alloc_migration_target() is the usual one passed >>> and as far as I can tell is fine. I've seen no problems reported with it. >>> >>> compaction_alloc() is a disaster, and I don't know how to fix it. >>> The compaction code has its own allocator which is populated with order-0 >>> folios. How it populates that freelist is awful ... see split_map_pages() >> >> Yeah, all that code was written under the assumption that we're moving order-0 pages (which is what the anon+pagecache pages part). >> >> From what I recall, we're allocating order-0 pages from the high memory addresses, so we can migrate from low memory addresses, effectively freeing up low memory addresses and filling high memory addresses. >> >> Adjusting that will be ... interesting. Instead of allocating order-0 pages from high addresses, we might want to allocate "as large as possible" ("grab what we can") from high addresses and then have our own kind of buddy for allocating from that pool a compaction destination page, depending on our source page. Nasty. > > We probably do not need a pool, since before migration, we have isolated folios to > be migrated and can come up with a stats on how many folios there are at each order. > Then, we can isolate free pages based on the stats and do not split free pages > all the way down to order-0. We can sort the source folios based on their orders > and isolate free pages from largest order to smallest order. That could avoid > a free page pool. Hi Zi, I just wanted to check; is this something you are working on or planning to work on? I'm trying to maintain a list of all the items that need to get sorted for large anon folios. It would be great to put your name against it! ;-) > > -- > Best Regards, > Yan, Zi _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory 2023-07-19 15:49 ` Ryan Roberts @ 2023-07-19 16:05 ` Zi Yan -1 siblings, 0 replies; 167+ messages in thread From: Zi Yan @ 2023-07-19 16:05 UTC (permalink / raw) To: Ryan Roberts Cc: David Hildenbrand, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm [-- Attachment #1: Type: text/plain, Size: 3221 bytes --] On 19 Jul 2023, at 11:49, Ryan Roberts wrote: > On 10/07/2023 17:53, Zi Yan wrote: >> On 7 Jul 2023, at 9:24, David Hildenbrand wrote: >> >>> On 07.07.23 15:12, Matthew Wilcox wrote: >>>> On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote: >>>>> On 06.07.23 10:02, Ryan Roberts wrote: >>>>> But can you comment on the page migration part (IOW did you try it already)? >>>>> >>>>> For example, memory hotunplug, CMA, MCE handling, compaction all rely on >>>>> page migration of something that was allocated using GFP_MOVABLE to actually >>>>> work. >>>>> >>>>> Compaction seems to skip any higher-order folios, but the question is if the >>>>> udnerlying migration itself works. >>>>> >>>>> If it already works: great! If not, this really has to be tackled early, >>>>> because otherwise we'll be breaking the GFP_MOVABLE semantics. >>>> >>>> I have looked at this a bit. _Migration_ should be fine. _Compaction_ >>>> is not. >>> >>> Thanks! Very nice if at least ordinary migration works. >>> >>>> >>>> If you look at a function like folio_migrate_mapping(), it all seems >>>> appropriately folio-ised. There might be something in there that is >>>> slightly wrong, but that would just be a bug to fix, not a huge >>>> architectural problem. >>>> >>>> The problem comes in the callers of migrate_pages(). They pass a >>>> new_folio_t callback. alloc_migration_target() is the usual one passed >>>> and as far as I can tell is fine. I've seen no problems reported with it. >>>> >>>> compaction_alloc() is a disaster, and I don't know how to fix it. >>>> The compaction code has its own allocator which is populated with order-0 >>>> folios. How it populates that freelist is awful ... see split_map_pages() >>> >>> Yeah, all that code was written under the assumption that we're moving order-0 pages (which is what the anon+pagecache pages part). >>> >>> From what I recall, we're allocating order-0 pages from the high memory addresses, so we can migrate from low memory addresses, effectively freeing up low memory addresses and filling high memory addresses. >>> >>> Adjusting that will be ... interesting. Instead of allocating order-0 pages from high addresses, we might want to allocate "as large as possible" ("grab what we can") from high addresses and then have our own kind of buddy for allocating from that pool a compaction destination page, depending on our source page. Nasty. >> >> We probably do not need a pool, since before migration, we have isolated folios to >> be migrated and can come up with a stats on how many folios there are at each order. >> Then, we can isolate free pages based on the stats and do not split free pages >> all the way down to order-0. We can sort the source folios based on their orders >> and isolate free pages from largest order to smallest order. That could avoid >> a free page pool. > > Hi Zi, I just wanted to check; is this something you are working on or planning > to work on? I'm trying to maintain a list of all the items that need to get > sorted for large anon folios. It would be great to put your name against it! ;-) Sure. I can work on this one. -- Best Regards, Yan, Zi [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 854 bytes --] ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory @ 2023-07-19 16:05 ` Zi Yan 0 siblings, 0 replies; 167+ messages in thread From: Zi Yan @ 2023-07-19 16:05 UTC (permalink / raw) To: Ryan Roberts Cc: David Hildenbrand, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm [-- Attachment #1.1: Type: text/plain, Size: 3221 bytes --] On 19 Jul 2023, at 11:49, Ryan Roberts wrote: > On 10/07/2023 17:53, Zi Yan wrote: >> On 7 Jul 2023, at 9:24, David Hildenbrand wrote: >> >>> On 07.07.23 15:12, Matthew Wilcox wrote: >>>> On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote: >>>>> On 06.07.23 10:02, Ryan Roberts wrote: >>>>> But can you comment on the page migration part (IOW did you try it already)? >>>>> >>>>> For example, memory hotunplug, CMA, MCE handling, compaction all rely on >>>>> page migration of something that was allocated using GFP_MOVABLE to actually >>>>> work. >>>>> >>>>> Compaction seems to skip any higher-order folios, but the question is if the >>>>> udnerlying migration itself works. >>>>> >>>>> If it already works: great! If not, this really has to be tackled early, >>>>> because otherwise we'll be breaking the GFP_MOVABLE semantics. >>>> >>>> I have looked at this a bit. _Migration_ should be fine. _Compaction_ >>>> is not. >>> >>> Thanks! Very nice if at least ordinary migration works. >>> >>>> >>>> If you look at a function like folio_migrate_mapping(), it all seems >>>> appropriately folio-ised. There might be something in there that is >>>> slightly wrong, but that would just be a bug to fix, not a huge >>>> architectural problem. >>>> >>>> The problem comes in the callers of migrate_pages(). They pass a >>>> new_folio_t callback. alloc_migration_target() is the usual one passed >>>> and as far as I can tell is fine. I've seen no problems reported with it. >>>> >>>> compaction_alloc() is a disaster, and I don't know how to fix it. >>>> The compaction code has its own allocator which is populated with order-0 >>>> folios. How it populates that freelist is awful ... see split_map_pages() >>> >>> Yeah, all that code was written under the assumption that we're moving order-0 pages (which is what the anon+pagecache pages part). >>> >>> From what I recall, we're allocating order-0 pages from the high memory addresses, so we can migrate from low memory addresses, effectively freeing up low memory addresses and filling high memory addresses. >>> >>> Adjusting that will be ... interesting. Instead of allocating order-0 pages from high addresses, we might want to allocate "as large as possible" ("grab what we can") from high addresses and then have our own kind of buddy for allocating from that pool a compaction destination page, depending on our source page. Nasty. >> >> We probably do not need a pool, since before migration, we have isolated folios to >> be migrated and can come up with a stats on how many folios there are at each order. >> Then, we can isolate free pages based on the stats and do not split free pages >> all the way down to order-0. We can sort the source folios based on their orders >> and isolate free pages from largest order to smallest order. That could avoid >> a free page pool. > > Hi Zi, I just wanted to check; is this something you are working on or planning > to work on? I'm trying to maintain a list of all the items that need to get > sorted for large anon folios. It would be great to put your name against it! ;-) Sure. I can work on this one. -- Best Regards, Yan, Zi [-- Attachment #1.2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 854 bytes --] [-- Attachment #2: Type: text/plain, Size: 176 bytes --] _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory 2023-07-19 16:05 ` Zi Yan @ 2023-07-19 18:37 ` Ryan Roberts -1 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-19 18:37 UTC (permalink / raw) To: Zi Yan Cc: David Hildenbrand, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 19/07/2023 17:05, Zi Yan wrote: > On 19 Jul 2023, at 11:49, Ryan Roberts wrote: > >> On 10/07/2023 17:53, Zi Yan wrote: >>> On 7 Jul 2023, at 9:24, David Hildenbrand wrote: >>> >>>> On 07.07.23 15:12, Matthew Wilcox wrote: >>>>> On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote: >>>>>> On 06.07.23 10:02, Ryan Roberts wrote: >>>>>> But can you comment on the page migration part (IOW did you try it already)? >>>>>> >>>>>> For example, memory hotunplug, CMA, MCE handling, compaction all rely on >>>>>> page migration of something that was allocated using GFP_MOVABLE to actually >>>>>> work. >>>>>> >>>>>> Compaction seems to skip any higher-order folios, but the question is if the >>>>>> udnerlying migration itself works. >>>>>> >>>>>> If it already works: great! If not, this really has to be tackled early, >>>>>> because otherwise we'll be breaking the GFP_MOVABLE semantics. >>>>> >>>>> I have looked at this a bit. _Migration_ should be fine. _Compaction_ >>>>> is not. >>>> >>>> Thanks! Very nice if at least ordinary migration works. >>>> >>>>> >>>>> If you look at a function like folio_migrate_mapping(), it all seems >>>>> appropriately folio-ised. There might be something in there that is >>>>> slightly wrong, but that would just be a bug to fix, not a huge >>>>> architectural problem. >>>>> >>>>> The problem comes in the callers of migrate_pages(). They pass a >>>>> new_folio_t callback. alloc_migration_target() is the usual one passed >>>>> and as far as I can tell is fine. I've seen no problems reported with it. >>>>> >>>>> compaction_alloc() is a disaster, and I don't know how to fix it. >>>>> The compaction code has its own allocator which is populated with order-0 >>>>> folios. How it populates that freelist is awful ... see split_map_pages() >>>> >>>> Yeah, all that code was written under the assumption that we're moving order-0 pages (which is what the anon+pagecache pages part). >>>> >>>> From what I recall, we're allocating order-0 pages from the high memory addresses, so we can migrate from low memory addresses, effectively freeing up low memory addresses and filling high memory addresses. >>>> >>>> Adjusting that will be ... interesting. Instead of allocating order-0 pages from high addresses, we might want to allocate "as large as possible" ("grab what we can") from high addresses and then have our own kind of buddy for allocating from that pool a compaction destination page, depending on our source page. Nasty. >>> >>> We probably do not need a pool, since before migration, we have isolated folios to >>> be migrated and can come up with a stats on how many folios there are at each order. >>> Then, we can isolate free pages based on the stats and do not split free pages >>> all the way down to order-0. We can sort the source folios based on their orders >>> and isolate free pages from largest order to smallest order. That could avoid >>> a free page pool. >> >> Hi Zi, I just wanted to check; is this something you are working on or planning >> to work on? I'm trying to maintain a list of all the items that need to get >> sorted for large anon folios. It would be great to put your name against it! ;-) > > Sure. I can work on this one. Awesome - thanks! > > -- > Best Regards, > Yan, Zi ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory @ 2023-07-19 18:37 ` Ryan Roberts 0 siblings, 0 replies; 167+ messages in thread From: Ryan Roberts @ 2023-07-19 18:37 UTC (permalink / raw) To: Zi Yan Cc: David Hildenbrand, Matthew Wilcox, Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On 19/07/2023 17:05, Zi Yan wrote: > On 19 Jul 2023, at 11:49, Ryan Roberts wrote: > >> On 10/07/2023 17:53, Zi Yan wrote: >>> On 7 Jul 2023, at 9:24, David Hildenbrand wrote: >>> >>>> On 07.07.23 15:12, Matthew Wilcox wrote: >>>>> On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote: >>>>>> On 06.07.23 10:02, Ryan Roberts wrote: >>>>>> But can you comment on the page migration part (IOW did you try it already)? >>>>>> >>>>>> For example, memory hotunplug, CMA, MCE handling, compaction all rely on >>>>>> page migration of something that was allocated using GFP_MOVABLE to actually >>>>>> work. >>>>>> >>>>>> Compaction seems to skip any higher-order folios, but the question is if the >>>>>> udnerlying migration itself works. >>>>>> >>>>>> If it already works: great! If not, this really has to be tackled early, >>>>>> because otherwise we'll be breaking the GFP_MOVABLE semantics. >>>>> >>>>> I have looked at this a bit. _Migration_ should be fine. _Compaction_ >>>>> is not. >>>> >>>> Thanks! Very nice if at least ordinary migration works. >>>> >>>>> >>>>> If you look at a function like folio_migrate_mapping(), it all seems >>>>> appropriately folio-ised. There might be something in there that is >>>>> slightly wrong, but that would just be a bug to fix, not a huge >>>>> architectural problem. >>>>> >>>>> The problem comes in the callers of migrate_pages(). They pass a >>>>> new_folio_t callback. alloc_migration_target() is the usual one passed >>>>> and as far as I can tell is fine. I've seen no problems reported with it. >>>>> >>>>> compaction_alloc() is a disaster, and I don't know how to fix it. >>>>> The compaction code has its own allocator which is populated with order-0 >>>>> folios. How it populates that freelist is awful ... see split_map_pages() >>>> >>>> Yeah, all that code was written under the assumption that we're moving order-0 pages (which is what the anon+pagecache pages part). >>>> >>>> From what I recall, we're allocating order-0 pages from the high memory addresses, so we can migrate from low memory addresses, effectively freeing up low memory addresses and filling high memory addresses. >>>> >>>> Adjusting that will be ... interesting. Instead of allocating order-0 pages from high addresses, we might want to allocate "as large as possible" ("grab what we can") from high addresses and then have our own kind of buddy for allocating from that pool a compaction destination page, depending on our source page. Nasty. >>> >>> We probably do not need a pool, since before migration, we have isolated folios to >>> be migrated and can come up with a stats on how many folios there are at each order. >>> Then, we can isolate free pages based on the stats and do not split free pages >>> all the way down to order-0. We can sort the source folios based on their orders >>> and isolate free pages from largest order to smallest order. That could avoid >>> a free page pool. >> >> Hi Zi, I just wanted to check; is this something you are working on or planning >> to work on? I'm trying to maintain a list of all the items that need to get >> sorted for large anon folios. It would be great to put your name against it! ;-) > > Sure. I can work on this one. Awesome - thanks! > > -- > Best Regards, > Yan, Zi _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory 2023-07-07 13:12 ` Matthew Wilcox @ 2023-07-11 21:11 ` Luis Chamberlain -1 siblings, 0 replies; 167+ messages in thread From: Luis Chamberlain @ 2023-07-11 21:11 UTC (permalink / raw) To: Matthew Wilcox Cc: David Hildenbrand, Ryan Roberts, Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On Fri, Jul 07, 2023 at 02:12:01PM +0100, Matthew Wilcox wrote: > On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote: > > > Is swapping working as expected? zswap? > > Suboptimally. Swap will split folios in order to swap them. Wouldn't that mean if high order folios are used a lot but swap is also used, until this is fixed you wouldn't get the expected reclaim gains for high order folios and we'd need compaction more then? > Somebody needs to fix that, but it should work. As we look at shmem stuff it was on the path so something we have considered doing. Ie, it's on our team's list of items to help with but currently on a backburner. Luis ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory @ 2023-07-11 21:11 ` Luis Chamberlain 0 siblings, 0 replies; 167+ messages in thread From: Luis Chamberlain @ 2023-07-11 21:11 UTC (permalink / raw) To: Matthew Wilcox Cc: David Hildenbrand, Ryan Roberts, Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On Fri, Jul 07, 2023 at 02:12:01PM +0100, Matthew Wilcox wrote: > On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote: > > > Is swapping working as expected? zswap? > > Suboptimally. Swap will split folios in order to swap them. Wouldn't that mean if high order folios are used a lot but swap is also used, until this is fixed you wouldn't get the expected reclaim gains for high order folios and we'd need compaction more then? > Somebody needs to fix that, but it should work. As we look at shmem stuff it was on the path so something we have considered doing. Ie, it's on our team's list of items to help with but currently on a backburner. Luis _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory 2023-07-11 21:11 ` Luis Chamberlain @ 2023-07-11 21:59 ` Matthew Wilcox -1 siblings, 0 replies; 167+ messages in thread From: Matthew Wilcox @ 2023-07-11 21:59 UTC (permalink / raw) To: Luis Chamberlain Cc: David Hildenbrand, Ryan Roberts, Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On Tue, Jul 11, 2023 at 02:11:19PM -0700, Luis Chamberlain wrote: > On Fri, Jul 07, 2023 at 02:12:01PM +0100, Matthew Wilcox wrote: > > On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote: > > > > > Is swapping working as expected? zswap? > > > > Suboptimally. Swap will split folios in order to swap them. > > Wouldn't that mean if high order folios are used a lot but swap is also > used, until this is fixed you wouldn't get the expected reclaim gains > for high order folios and we'd need compaction more then? They're split in shrink_folio_list(), so they stay intact until that point? > > Somebody needs to fix that, but it should work. > > As we look at shmem stuff it was on the path so something we have > considered doing. Ie, it's on our team's list of items to help with > but currently on a backburner. Something I was thinking about is that you'll need to prohibit swap devices or swap files being created on large block devices. Until we rewrite the entire swap subsystem ... ^ permalink raw reply [flat|nested] 167+ messages in thread
* Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory @ 2023-07-11 21:59 ` Matthew Wilcox 0 siblings, 0 replies; 167+ messages in thread From: Matthew Wilcox @ 2023-07-11 21:59 UTC (permalink / raw) To: Luis Chamberlain Cc: David Hildenbrand, Ryan Roberts, Andrew Morton, Kirill A. Shutemov, Yin Fengwei, Yu Zhao, Catalin Marinas, Will Deacon, Anshuman Khandual, Yang Shi, linux-arm-kernel, linux-kernel, linux-mm On Tue, Jul 11, 2023 at 02:11:19PM -0700, Luis Chamberlain wrote: > On Fri, Jul 07, 2023 at 02:12:01PM +0100, Matthew Wilcox wrote: > > On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote: > > > > > Is swapping working as expected? zswap? > > > > Suboptimally. Swap will split folios in order to swap them. > > Wouldn't that mean if high order folios are used a lot but swap is also > used, until this is fixed you wouldn't get the expected reclaim gains > for high order folios and we'd need compaction more then? They're split in shrink_folio_list(), so they stay intact until that point? > > Somebody needs to fix that, but it should work. > > As we look at shmem stuff it was on the path so something we have > considered doing. Ie, it's on our team's list of items to help with > but currently on a backburner. Something I was thinking about is that you'll need to prohibit swap devices or swap files being created on large block devices. Until we rewrite the entire swap subsystem ... _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 167+ messages in thread
end of thread, other threads:[~2023-07-19 18:38 UTC | newest] Thread overview: 167+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-07-03 13:53 [PATCH v2 0/5] variable-order, large folios for anonymous memory Ryan Roberts 2023-07-03 13:53 ` Ryan Roberts 2023-07-03 13:53 ` [PATCH v2 1/5] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap() Ryan Roberts 2023-07-03 13:53 ` Ryan Roberts 2023-07-03 19:05 ` Yu Zhao 2023-07-03 19:05 ` Yu Zhao 2023-07-04 2:13 ` Yin, Fengwei 2023-07-04 2:13 ` Yin, Fengwei 2023-07-04 11:19 ` Ryan Roberts 2023-07-04 11:19 ` Ryan Roberts 2023-07-04 2:14 ` Yin, Fengwei 2023-07-04 2:14 ` Yin, Fengwei 2023-07-03 13:53 ` [PATCH v2 2/5] mm: Allow deferred splitting of arbitrary large anon folios Ryan Roberts 2023-07-03 13:53 ` Ryan Roberts 2023-07-07 8:21 ` Huang, Ying 2023-07-07 8:21 ` Huang, Ying 2023-07-07 9:39 ` Ryan Roberts 2023-07-07 9:42 ` Ryan Roberts 2023-07-07 9:42 ` Ryan Roberts 2023-07-10 5:37 ` Huang, Ying 2023-07-10 5:37 ` Huang, Ying 2023-07-10 8:29 ` Ryan Roberts 2023-07-10 8:29 ` Ryan Roberts 2023-07-10 9:01 ` Huang, Ying 2023-07-10 9:01 ` Huang, Ying 2023-07-10 9:39 ` Ryan Roberts 2023-07-10 9:39 ` Ryan Roberts 2023-07-11 1:56 ` Huang, Ying 2023-07-11 1:56 ` Huang, Ying 2023-07-03 13:53 ` [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order() Ryan Roberts 2023-07-03 13:53 ` Ryan Roberts 2023-07-03 19:50 ` Yu Zhao 2023-07-03 19:50 ` Yu Zhao 2023-07-04 13:20 ` Ryan Roberts 2023-07-04 13:20 ` Ryan Roberts 2023-07-05 2:07 ` Yu Zhao 2023-07-05 2:07 ` Yu Zhao 2023-07-05 9:11 ` Ryan Roberts 2023-07-05 9:11 ` Ryan Roberts 2023-07-05 17:24 ` Yu Zhao 2023-07-05 17:24 ` Yu Zhao 2023-07-05 18:01 ` Ryan Roberts 2023-07-05 18:01 ` Ryan Roberts 2023-07-06 19:33 ` Matthew Wilcox 2023-07-06 19:33 ` Matthew Wilcox 2023-07-07 10:00 ` Ryan Roberts 2023-07-07 10:00 ` Ryan Roberts 2023-07-04 2:22 ` Yin, Fengwei 2023-07-04 2:22 ` Yin, Fengwei 2023-07-04 3:02 ` Yu Zhao 2023-07-04 3:02 ` Yu Zhao 2023-07-04 3:59 ` Yu Zhao 2023-07-04 3:59 ` Yu Zhao 2023-07-04 5:22 ` Yin, Fengwei 2023-07-04 5:22 ` Yin, Fengwei 2023-07-04 5:42 ` Yu Zhao 2023-07-04 5:42 ` Yu Zhao 2023-07-04 12:36 ` Ryan Roberts 2023-07-04 12:36 ` Ryan Roberts 2023-07-04 13:23 ` Ryan Roberts 2023-07-04 13:23 ` Ryan Roberts 2023-07-05 1:40 ` Yu Zhao 2023-07-05 1:40 ` Yu Zhao 2023-07-05 1:23 ` Yu Zhao 2023-07-05 1:23 ` Yu Zhao 2023-07-05 2:18 ` Yin Fengwei 2023-07-05 2:18 ` Yin Fengwei 2023-07-03 13:53 ` [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance Ryan Roberts 2023-07-03 13:53 ` Ryan Roberts 2023-07-03 15:51 ` kernel test robot 2023-07-03 15:51 ` kernel test robot 2023-07-03 16:01 ` kernel test robot 2023-07-03 16:01 ` kernel test robot 2023-07-04 1:35 ` Yu Zhao 2023-07-04 1:35 ` Yu Zhao 2023-07-04 14:08 ` Ryan Roberts 2023-07-04 14:08 ` Ryan Roberts 2023-07-04 23:47 ` Yu Zhao 2023-07-04 23:47 ` Yu Zhao 2023-07-04 3:45 ` Yin, Fengwei 2023-07-04 3:45 ` Yin, Fengwei 2023-07-04 14:20 ` Ryan Roberts 2023-07-04 14:20 ` Ryan Roberts 2023-07-04 23:35 ` Yin Fengwei 2023-07-04 23:57 ` Matthew Wilcox 2023-07-04 23:57 ` Matthew Wilcox 2023-07-05 9:54 ` Ryan Roberts 2023-07-05 9:54 ` Ryan Roberts 2023-07-05 12:08 ` Matthew Wilcox 2023-07-05 12:08 ` Matthew Wilcox 2023-07-07 8:01 ` Huang, Ying 2023-07-07 8:01 ` Huang, Ying 2023-07-07 9:52 ` Ryan Roberts 2023-07-07 9:52 ` Ryan Roberts 2023-07-07 11:29 ` David Hildenbrand 2023-07-07 11:29 ` David Hildenbrand 2023-07-07 13:57 ` Matthew Wilcox 2023-07-07 13:57 ` Matthew Wilcox 2023-07-07 14:07 ` David Hildenbrand 2023-07-07 14:07 ` David Hildenbrand 2023-07-07 15:13 ` Ryan Roberts 2023-07-07 15:13 ` Ryan Roberts 2023-07-07 16:06 ` David Hildenbrand 2023-07-07 16:06 ` David Hildenbrand 2023-07-07 16:22 ` Ryan Roberts 2023-07-07 16:22 ` Ryan Roberts 2023-07-07 19:06 ` David Hildenbrand 2023-07-07 19:06 ` David Hildenbrand 2023-07-10 8:41 ` Ryan Roberts 2023-07-10 8:41 ` Ryan Roberts 2023-07-10 3:03 ` Huang, Ying 2023-07-10 3:03 ` Huang, Ying 2023-07-10 8:55 ` Ryan Roberts 2023-07-10 8:55 ` Ryan Roberts 2023-07-10 9:18 ` Huang, Ying 2023-07-10 9:18 ` Huang, Ying 2023-07-10 9:25 ` Ryan Roberts 2023-07-10 9:25 ` Ryan Roberts 2023-07-11 0:48 ` Huang, Ying 2023-07-11 0:48 ` Huang, Ying 2023-07-10 2:49 ` Huang, Ying 2023-07-10 2:49 ` Huang, Ying 2023-07-03 13:53 ` [PATCH v2 5/5] arm64: mm: Override arch_wants_pte_order() Ryan Roberts 2023-07-03 13:53 ` Ryan Roberts 2023-07-03 20:02 ` Yu Zhao 2023-07-03 20:02 ` Yu Zhao 2023-07-04 2:18 ` [PATCH v2 0/5] variable-order, large folios for anonymous memory Yu Zhao 2023-07-04 2:18 ` Yu Zhao 2023-07-04 6:22 ` Yin, Fengwei 2023-07-04 6:22 ` Yin, Fengwei 2023-07-04 7:11 ` Yu Zhao 2023-07-04 7:11 ` Yu Zhao 2023-07-04 15:36 ` Ryan Roberts 2023-07-04 15:36 ` Ryan Roberts 2023-07-04 23:52 ` Yin Fengwei 2023-07-05 0:21 ` Yu Zhao 2023-07-05 0:21 ` Yu Zhao 2023-07-05 10:16 ` Ryan Roberts 2023-07-05 10:16 ` Ryan Roberts 2023-07-05 19:00 ` Yu Zhao 2023-07-05 19:00 ` Yu Zhao 2023-07-05 19:38 ` David Hildenbrand 2023-07-05 19:38 ` David Hildenbrand 2023-07-06 8:02 ` Ryan Roberts 2023-07-06 8:02 ` Ryan Roberts 2023-07-07 11:40 ` David Hildenbrand 2023-07-07 11:40 ` David Hildenbrand 2023-07-07 13:12 ` Matthew Wilcox 2023-07-07 13:12 ` Matthew Wilcox 2023-07-07 13:24 ` David Hildenbrand 2023-07-07 13:24 ` David Hildenbrand 2023-07-10 10:07 ` Ryan Roberts 2023-07-10 10:07 ` Ryan Roberts 2023-07-10 16:57 ` Matthew Wilcox 2023-07-10 16:57 ` Matthew Wilcox 2023-07-10 16:53 ` Zi Yan 2023-07-10 16:53 ` Zi Yan 2023-07-19 15:49 ` Ryan Roberts 2023-07-19 15:49 ` Ryan Roberts 2023-07-19 16:05 ` Zi Yan 2023-07-19 16:05 ` Zi Yan 2023-07-19 18:37 ` Ryan Roberts 2023-07-19 18:37 ` Ryan Roberts 2023-07-11 21:11 ` Luis Chamberlain 2023-07-11 21:11 ` Luis Chamberlain 2023-07-11 21:59 ` Matthew Wilcox 2023-07-11 21:59 ` Matthew Wilcox
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.