All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 00/13] mm/gup: Unify hugetlb, part 2
@ 2024-03-27 15:23 ` peterx
  0 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, peterx, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

From: Peter Xu <peterx@redhat.com>

v4:
- Fix build issues, tested on more archs/configs ([x86_64, i386, arm, arm64,
  powerpc, riscv, s390] x [allno, alldef, allmod]).
  - Squashed the fixup series into v3, touched up commit messages [1]
  - Added the patch to fix pud_pfn() into the series [2]
  - Fixed one more build issue on arm+alldefconfig, where pgd_t is a
    two-item array.
- Manage R-bs: add some, remove some (due to the squashes above)
- Rebase to latest mm-unstable (2f6182cd23a7, March 26th)

rfc: https://lore.kernel.org/r/20231116012908.392077-1-peterx@redhat.com
v1:  https://lore.kernel.org/r/20231219075538.414708-1-peterx@redhat.com
v2:  https://lore.kernel.org/r/20240103091423.400294-1-peterx@redhat.com
v3:  https://lore.kernel.org/r/20240321220802.679544-1-peterx@redhat.com

The series removes the hugetlb slow gup path after a previous refactor work
[1], so that slow gup now uses the exact same path to process all kinds of
memory including hugetlb.

For the long term, we may want to remove most, if not all, call sites of
huge_pte_offset().  It'll be ideal if that API can be completely dropped
from arch hugetlb API.  This series is one small step towards merging
hugetlb specific codes into generic mm paths.  From that POV, this series
removes one reference to huge_pte_offset() out of many others.

One goal of such a route is that we can reconsider merging hugetlb features
like High Granularity Mapping (HGM).  It was not accepted in the past
because it may add lots of hugetlb specific codes and make the mm code even
harder to maintain.  With a merged codeset, features like HGM can hopefully
share some code with THP, legacy (PMD+) or modern (continuous PTEs).

To make it work, the generic slow gup code will need to at least understand
hugepd, which is already done like so in fast-gup.  Due to the specialty of
hugepd to be software-only solution (no hardware recognizes the hugepd
format, so it's purely artificial structures), there's chance we can merge
some or all hugepd formats with cont_pte in the future.  That question is
yet unsettled from Power side to have an acknowledgement.  As of now for
this series, I kept the hugepd handling because we may still need to do so
before getting a clearer picture of the future of hugepd.  The other reason
is simply that we did it already for fast-gup and most codes are still
around to be reused.  It'll make more sense to keep slow/fast gup behave
the same before a decision is made to remove hugepd.

There's one major difference for slow-gup on cont_pte / cont_pmd handling,
currently supported on three architectures (aarch64, riscv, ppc).  Before
the series, slow gup will be able to recognize e.g. cont_pte entries with
the help of huge_pte_offset() when hstate is around.  Now it's gone but
still working, by looking up pgtable entries one by one.

It's not ideal, but hopefully this change should not affect yet on major
workloads.  There's some more information in the commit message of the last
patch.  If this would be a concern, we can consider teaching slow gup to
recognize cont pte/pmd entries, and that should recover the lost
performance.  But I doubt its necessity for now, so I kept it as simple as
it can be.

Test Done
=========

For x86_64, tested full gup_test matrix over 2MB huge pages. For aarch64,
tested the same over 64KB cont_pte huge pages.

One note is that this v3 didn't go through any ppc test anymore, as finding
such system can always take time.  It's based on the fact that it was
tested in previous versions, and this version should have zero change
regarding to hugepd sections.

If anyone (Christophe?) wants to give it a shot on PowerPC, please do and I
would appreciate it: "./run_vmtests.sh -a -t gup_test" should do well
enough (please consider [2] applied if hugepd is <1MB), as long as we're
sure the hugepd pages are touched as expected.

Patch layout
=============

Patch 1-8:    Preparation works, or cleanups in relevant code paths
Patch 9-11:   Teach slow gup with all kinds of huge entries (pXd, hugepd)
Patch 12:     Drop hugetlb_follow_page_mask()

More information can be found in the commit messages of each patch.  Any
comment will be welcomed.  Thanks.

[1] https://lore.kernel.org/all/20230628215310.73782-1-peterx@redhat.com
[2] https://lore.kernel.org/r/20240321215047.678172-1-peterx@redhat.com

Peter Xu (13):
  mm/Kconfig: CONFIG_PGTABLE_HAS_HUGE_LEAVES
  mm/hugetlb: Declare hugetlbfs_pagecache_present() non-static
  mm: Make HPAGE_PXD_* macros even if !THP
  mm: Introduce vma_pgtable_walk_{begin|end}()
  mm/arch: Provide pud_pfn() fallback
  mm/gup: Drop folio_fast_pin_allowed() in hugepd processing
  mm/gup: Refactor record_subpages() to find 1st small page
  mm/gup: Handle hugetlb for no_page_table()
  mm/gup: Cache *pudp in follow_pud_mask()
  mm/gup: Handle huge pud for follow_pud_mask()
  mm/gup: Handle huge pmd for follow_pmd_mask()
  mm/gup: Handle hugepd for follow_page()
  mm/gup: Handle hugetlb in the generic follow_page_mask code

 arch/riscv/include/asm/pgtable.h    |   1 +
 arch/s390/include/asm/pgtable.h     |   1 +
 arch/sparc/include/asm/pgtable_64.h |   1 +
 arch/x86/include/asm/pgtable.h      |   1 +
 include/linux/huge_mm.h             |  37 +-
 include/linux/hugetlb.h             |  16 +-
 include/linux/mm.h                  |   3 +
 include/linux/pgtable.h             |  10 +
 mm/Kconfig                          |   6 +
 mm/gup.c                            | 518 ++++++++++++++++++++--------
 mm/huge_memory.c                    | 133 +------
 mm/hugetlb.c                        |  75 +---
 mm/internal.h                       |   7 +-
 mm/memory.c                         |  12 +
 14 files changed, 441 insertions(+), 380 deletions(-)

-- 
2.44.0


^ permalink raw reply	[flat|nested] 160+ messages in thread

* [PATCH v4 00/13] mm/gup: Unify hugetlb, part 2
@ 2024-03-27 15:23 ` peterx
  0 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, peterx, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

From: Peter Xu <peterx@redhat.com>

v4:
- Fix build issues, tested on more archs/configs ([x86_64, i386, arm, arm64,
  powerpc, riscv, s390] x [allno, alldef, allmod]).
  - Squashed the fixup series into v3, touched up commit messages [1]
  - Added the patch to fix pud_pfn() into the series [2]
  - Fixed one more build issue on arm+alldefconfig, where pgd_t is a
    two-item array.
- Manage R-bs: add some, remove some (due to the squashes above)
- Rebase to latest mm-unstable (2f6182cd23a7, March 26th)

rfc: https://lore.kernel.org/r/20231116012908.392077-1-peterx@redhat.com
v1:  https://lore.kernel.org/r/20231219075538.414708-1-peterx@redhat.com
v2:  https://lore.kernel.org/r/20240103091423.400294-1-peterx@redhat.com
v3:  https://lore.kernel.org/r/20240321220802.679544-1-peterx@redhat.com

The series removes the hugetlb slow gup path after a previous refactor work
[1], so that slow gup now uses the exact same path to process all kinds of
memory including hugetlb.

For the long term, we may want to remove most, if not all, call sites of
huge_pte_offset().  It'll be ideal if that API can be completely dropped
from arch hugetlb API.  This series is one small step towards merging
hugetlb specific codes into generic mm paths.  From that POV, this series
removes one reference to huge_pte_offset() out of many others.

One goal of such a route is that we can reconsider merging hugetlb features
like High Granularity Mapping (HGM).  It was not accepted in the past
because it may add lots of hugetlb specific codes and make the mm code even
harder to maintain.  With a merged codeset, features like HGM can hopefully
share some code with THP, legacy (PMD+) or modern (continuous PTEs).

To make it work, the generic slow gup code will need to at least understand
hugepd, which is already done like so in fast-gup.  Due to the specialty of
hugepd to be software-only solution (no hardware recognizes the hugepd
format, so it's purely artificial structures), there's chance we can merge
some or all hugepd formats with cont_pte in the future.  That question is
yet unsettled from Power side to have an acknowledgement.  As of now for
this series, I kept the hugepd handling because we may still need to do so
before getting a clearer picture of the future of hugepd.  The other reason
is simply that we did it already for fast-gup and most codes are still
around to be reused.  It'll make more sense to keep slow/fast gup behave
the same before a decision is made to remove hugepd.

There's one major difference for slow-gup on cont_pte / cont_pmd handling,
currently supported on three architectures (aarch64, riscv, ppc).  Before
the series, slow gup will be able to recognize e.g. cont_pte entries with
the help of huge_pte_offset() when hstate is around.  Now it's gone but
still working, by looking up pgtable entries one by one.

It's not ideal, but hopefully this change should not affect yet on major
workloads.  There's some more information in the commit message of the last
patch.  If this would be a concern, we can consider teaching slow gup to
recognize cont pte/pmd entries, and that should recover the lost
performance.  But I doubt its necessity for now, so I kept it as simple as
it can be.

Test Done
=========

For x86_64, tested full gup_test matrix over 2MB huge pages. For aarch64,
tested the same over 64KB cont_pte huge pages.

One note is that this v3 didn't go through any ppc test anymore, as finding
such system can always take time.  It's based on the fact that it was
tested in previous versions, and this version should have zero change
regarding to hugepd sections.

If anyone (Christophe?) wants to give it a shot on PowerPC, please do and I
would appreciate it: "./run_vmtests.sh -a -t gup_test" should do well
enough (please consider [2] applied if hugepd is <1MB), as long as we're
sure the hugepd pages are touched as expected.

Patch layout
=============

Patch 1-8:    Preparation works, or cleanups in relevant code paths
Patch 9-11:   Teach slow gup with all kinds of huge entries (pXd, hugepd)
Patch 12:     Drop hugetlb_follow_page_mask()

More information can be found in the commit messages of each patch.  Any
comment will be welcomed.  Thanks.

[1] https://lore.kernel.org/all/20230628215310.73782-1-peterx@redhat.com
[2] https://lore.kernel.org/r/20240321215047.678172-1-peterx@redhat.com

Peter Xu (13):
  mm/Kconfig: CONFIG_PGTABLE_HAS_HUGE_LEAVES
  mm/hugetlb: Declare hugetlbfs_pagecache_present() non-static
  mm: Make HPAGE_PXD_* macros even if !THP
  mm: Introduce vma_pgtable_walk_{begin|end}()
  mm/arch: Provide pud_pfn() fallback
  mm/gup: Drop folio_fast_pin_allowed() in hugepd processing
  mm/gup: Refactor record_subpages() to find 1st small page
  mm/gup: Handle hugetlb for no_page_table()
  mm/gup: Cache *pudp in follow_pud_mask()
  mm/gup: Handle huge pud for follow_pud_mask()
  mm/gup: Handle huge pmd for follow_pmd_mask()
  mm/gup: Handle hugepd for follow_page()
  mm/gup: Handle hugetlb in the generic follow_page_mask code

 arch/riscv/include/asm/pgtable.h    |   1 +
 arch/s390/include/asm/pgtable.h     |   1 +
 arch/sparc/include/asm/pgtable_64.h |   1 +
 arch/x86/include/asm/pgtable.h      |   1 +
 include/linux/huge_mm.h             |  37 +-
 include/linux/hugetlb.h             |  16 +-
 include/linux/mm.h                  |   3 +
 include/linux/pgtable.h             |  10 +
 mm/Kconfig                          |   6 +
 mm/gup.c                            | 518 ++++++++++++++++++++--------
 mm/huge_memory.c                    | 133 +------
 mm/hugetlb.c                        |  75 +---
 mm/internal.h                       |   7 +-
 mm/memory.c                         |  12 +
 14 files changed, 441 insertions(+), 380 deletions(-)

-- 
2.44.0


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 160+ messages in thread

* [PATCH v4 00/13] mm/gup: Unify hugetlb, part 2
@ 2024-03-27 15:23 ` peterx
  0 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: James Houghton, David Hildenbrand, Yang Shi, peterx,
	Andrew Jones, linux-riscv, Andrea Arcangeli, Christoph Hellwig,
	Matthew Wilcox, Aneesh Kumar K . V, linux-arm-kernel,
	Jason Gunthorpe, Axel Rasmussen, Rik van Riel, John Hubbard,
	Kirill A . Shutemov, Vlastimil Babka, Lorenzo Stoakes,
	Muchun Song, Andrew Morton, linuxppc-dev, Mike Rapoport,
	Mike Kravetz

From: Peter Xu <peterx@redhat.com>

v4:
- Fix build issues, tested on more archs/configs ([x86_64, i386, arm, arm64,
  powerpc, riscv, s390] x [allno, alldef, allmod]).
  - Squashed the fixup series into v3, touched up commit messages [1]
  - Added the patch to fix pud_pfn() into the series [2]
  - Fixed one more build issue on arm+alldefconfig, where pgd_t is a
    two-item array.
- Manage R-bs: add some, remove some (due to the squashes above)
- Rebase to latest mm-unstable (2f6182cd23a7, March 26th)

rfc: https://lore.kernel.org/r/20231116012908.392077-1-peterx@redhat.com
v1:  https://lore.kernel.org/r/20231219075538.414708-1-peterx@redhat.com
v2:  https://lore.kernel.org/r/20240103091423.400294-1-peterx@redhat.com
v3:  https://lore.kernel.org/r/20240321220802.679544-1-peterx@redhat.com

The series removes the hugetlb slow gup path after a previous refactor work
[1], so that slow gup now uses the exact same path to process all kinds of
memory including hugetlb.

For the long term, we may want to remove most, if not all, call sites of
huge_pte_offset().  It'll be ideal if that API can be completely dropped
from arch hugetlb API.  This series is one small step towards merging
hugetlb specific codes into generic mm paths.  From that POV, this series
removes one reference to huge_pte_offset() out of many others.

One goal of such a route is that we can reconsider merging hugetlb features
like High Granularity Mapping (HGM).  It was not accepted in the past
because it may add lots of hugetlb specific codes and make the mm code even
harder to maintain.  With a merged codeset, features like HGM can hopefully
share some code with THP, legacy (PMD+) or modern (continuous PTEs).

To make it work, the generic slow gup code will need to at least understand
hugepd, which is already done like so in fast-gup.  Due to the specialty of
hugepd to be software-only solution (no hardware recognizes the hugepd
format, so it's purely artificial structures), there's chance we can merge
some or all hugepd formats with cont_pte in the future.  That question is
yet unsettled from Power side to have an acknowledgement.  As of now for
this series, I kept the hugepd handling because we may still need to do so
before getting a clearer picture of the future of hugepd.  The other reason
is simply that we did it already for fast-gup and most codes are still
around to be reused.  It'll make more sense to keep slow/fast gup behave
the same before a decision is made to remove hugepd.

There's one major difference for slow-gup on cont_pte / cont_pmd handling,
currently supported on three architectures (aarch64, riscv, ppc).  Before
the series, slow gup will be able to recognize e.g. cont_pte entries with
the help of huge_pte_offset() when hstate is around.  Now it's gone but
still working, by looking up pgtable entries one by one.

It's not ideal, but hopefully this change should not affect yet on major
workloads.  There's some more information in the commit message of the last
patch.  If this would be a concern, we can consider teaching slow gup to
recognize cont pte/pmd entries, and that should recover the lost
performance.  But I doubt its necessity for now, so I kept it as simple as
it can be.

Test Done
=========

For x86_64, tested full gup_test matrix over 2MB huge pages. For aarch64,
tested the same over 64KB cont_pte huge pages.

One note is that this v3 didn't go through any ppc test anymore, as finding
such system can always take time.  It's based on the fact that it was
tested in previous versions, and this version should have zero change
regarding to hugepd sections.

If anyone (Christophe?) wants to give it a shot on PowerPC, please do and I
would appreciate it: "./run_vmtests.sh -a -t gup_test" should do well
enough (please consider [2] applied if hugepd is <1MB), as long as we're
sure the hugepd pages are touched as expected.

Patch layout
=============

Patch 1-8:    Preparation works, or cleanups in relevant code paths
Patch 9-11:   Teach slow gup with all kinds of huge entries (pXd, hugepd)
Patch 12:     Drop hugetlb_follow_page_mask()

More information can be found in the commit messages of each patch.  Any
comment will be welcomed.  Thanks.

[1] https://lore.kernel.org/all/20230628215310.73782-1-peterx@redhat.com
[2] https://lore.kernel.org/r/20240321215047.678172-1-peterx@redhat.com

Peter Xu (13):
  mm/Kconfig: CONFIG_PGTABLE_HAS_HUGE_LEAVES
  mm/hugetlb: Declare hugetlbfs_pagecache_present() non-static
  mm: Make HPAGE_PXD_* macros even if !THP
  mm: Introduce vma_pgtable_walk_{begin|end}()
  mm/arch: Provide pud_pfn() fallback
  mm/gup: Drop folio_fast_pin_allowed() in hugepd processing
  mm/gup: Refactor record_subpages() to find 1st small page
  mm/gup: Handle hugetlb for no_page_table()
  mm/gup: Cache *pudp in follow_pud_mask()
  mm/gup: Handle huge pud for follow_pud_mask()
  mm/gup: Handle huge pmd for follow_pmd_mask()
  mm/gup: Handle hugepd for follow_page()
  mm/gup: Handle hugetlb in the generic follow_page_mask code

 arch/riscv/include/asm/pgtable.h    |   1 +
 arch/s390/include/asm/pgtable.h     |   1 +
 arch/sparc/include/asm/pgtable_64.h |   1 +
 arch/x86/include/asm/pgtable.h      |   1 +
 include/linux/huge_mm.h             |  37 +-
 include/linux/hugetlb.h             |  16 +-
 include/linux/mm.h                  |   3 +
 include/linux/pgtable.h             |  10 +
 mm/Kconfig                          |   6 +
 mm/gup.c                            | 518 ++++++++++++++++++++--------
 mm/huge_memory.c                    | 133 +------
 mm/hugetlb.c                        |  75 +---
 mm/internal.h                       |   7 +-
 mm/memory.c                         |  12 +
 14 files changed, 441 insertions(+), 380 deletions(-)

-- 
2.44.0


^ permalink raw reply	[flat|nested] 160+ messages in thread

* [PATCH v4 00/13] mm/gup: Unify hugetlb, part 2
@ 2024-03-27 15:23 ` peterx
  0 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, peterx, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

From: Peter Xu <peterx@redhat.com>

v4:
- Fix build issues, tested on more archs/configs ([x86_64, i386, arm, arm64,
  powerpc, riscv, s390] x [allno, alldef, allmod]).
  - Squashed the fixup series into v3, touched up commit messages [1]
  - Added the patch to fix pud_pfn() into the series [2]
  - Fixed one more build issue on arm+alldefconfig, where pgd_t is a
    two-item array.
- Manage R-bs: add some, remove some (due to the squashes above)
- Rebase to latest mm-unstable (2f6182cd23a7, March 26th)

rfc: https://lore.kernel.org/r/20231116012908.392077-1-peterx@redhat.com
v1:  https://lore.kernel.org/r/20231219075538.414708-1-peterx@redhat.com
v2:  https://lore.kernel.org/r/20240103091423.400294-1-peterx@redhat.com
v3:  https://lore.kernel.org/r/20240321220802.679544-1-peterx@redhat.com

The series removes the hugetlb slow gup path after a previous refactor work
[1], so that slow gup now uses the exact same path to process all kinds of
memory including hugetlb.

For the long term, we may want to remove most, if not all, call sites of
huge_pte_offset().  It'll be ideal if that API can be completely dropped
from arch hugetlb API.  This series is one small step towards merging
hugetlb specific codes into generic mm paths.  From that POV, this series
removes one reference to huge_pte_offset() out of many others.

One goal of such a route is that we can reconsider merging hugetlb features
like High Granularity Mapping (HGM).  It was not accepted in the past
because it may add lots of hugetlb specific codes and make the mm code even
harder to maintain.  With a merged codeset, features like HGM can hopefully
share some code with THP, legacy (PMD+) or modern (continuous PTEs).

To make it work, the generic slow gup code will need to at least understand
hugepd, which is already done like so in fast-gup.  Due to the specialty of
hugepd to be software-only solution (no hardware recognizes the hugepd
format, so it's purely artificial structures), there's chance we can merge
some or all hugepd formats with cont_pte in the future.  That question is
yet unsettled from Power side to have an acknowledgement.  As of now for
this series, I kept the hugepd handling because we may still need to do so
before getting a clearer picture of the future of hugepd.  The other reason
is simply that we did it already for fast-gup and most codes are still
around to be reused.  It'll make more sense to keep slow/fast gup behave
the same before a decision is made to remove hugepd.

There's one major difference for slow-gup on cont_pte / cont_pmd handling,
currently supported on three architectures (aarch64, riscv, ppc).  Before
the series, slow gup will be able to recognize e.g. cont_pte entries with
the help of huge_pte_offset() when hstate is around.  Now it's gone but
still working, by looking up pgtable entries one by one.

It's not ideal, but hopefully this change should not affect yet on major
workloads.  There's some more information in the commit message of the last
patch.  If this would be a concern, we can consider teaching slow gup to
recognize cont pte/pmd entries, and that should recover the lost
performance.  But I doubt its necessity for now, so I kept it as simple as
it can be.

Test Done
=========

For x86_64, tested full gup_test matrix over 2MB huge pages. For aarch64,
tested the same over 64KB cont_pte huge pages.

One note is that this v3 didn't go through any ppc test anymore, as finding
such system can always take time.  It's based on the fact that it was
tested in previous versions, and this version should have zero change
regarding to hugepd sections.

If anyone (Christophe?) wants to give it a shot on PowerPC, please do and I
would appreciate it: "./run_vmtests.sh -a -t gup_test" should do well
enough (please consider [2] applied if hugepd is <1MB), as long as we're
sure the hugepd pages are touched as expected.

Patch layout
=============

Patch 1-8:    Preparation works, or cleanups in relevant code paths
Patch 9-11:   Teach slow gup with all kinds of huge entries (pXd, hugepd)
Patch 12:     Drop hugetlb_follow_page_mask()

More information can be found in the commit messages of each patch.  Any
comment will be welcomed.  Thanks.

[1] https://lore.kernel.org/all/20230628215310.73782-1-peterx@redhat.com
[2] https://lore.kernel.org/r/20240321215047.678172-1-peterx@redhat.com

Peter Xu (13):
  mm/Kconfig: CONFIG_PGTABLE_HAS_HUGE_LEAVES
  mm/hugetlb: Declare hugetlbfs_pagecache_present() non-static
  mm: Make HPAGE_PXD_* macros even if !THP
  mm: Introduce vma_pgtable_walk_{begin|end}()
  mm/arch: Provide pud_pfn() fallback
  mm/gup: Drop folio_fast_pin_allowed() in hugepd processing
  mm/gup: Refactor record_subpages() to find 1st small page
  mm/gup: Handle hugetlb for no_page_table()
  mm/gup: Cache *pudp in follow_pud_mask()
  mm/gup: Handle huge pud for follow_pud_mask()
  mm/gup: Handle huge pmd for follow_pmd_mask()
  mm/gup: Handle hugepd for follow_page()
  mm/gup: Handle hugetlb in the generic follow_page_mask code

 arch/riscv/include/asm/pgtable.h    |   1 +
 arch/s390/include/asm/pgtable.h     |   1 +
 arch/sparc/include/asm/pgtable_64.h |   1 +
 arch/x86/include/asm/pgtable.h      |   1 +
 include/linux/huge_mm.h             |  37 +-
 include/linux/hugetlb.h             |  16 +-
 include/linux/mm.h                  |   3 +
 include/linux/pgtable.h             |  10 +
 mm/Kconfig                          |   6 +
 mm/gup.c                            | 518 ++++++++++++++++++++--------
 mm/huge_memory.c                    | 133 +------
 mm/hugetlb.c                        |  75 +---
 mm/internal.h                       |   7 +-
 mm/memory.c                         |  12 +
 14 files changed, 441 insertions(+), 380 deletions(-)

-- 
2.44.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 160+ messages in thread

* [PATCH v4 01/13] mm/Kconfig: CONFIG_PGTABLE_HAS_HUGE_LEAVES
  2024-03-27 15:23 ` peterx
  (?)
  (?)
@ 2024-03-27 15:23   ` peterx
  -1 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, peterx, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

From: Peter Xu <peterx@redhat.com>

Introduce a config option that will be selected as long as huge leaves are
involved in pgtable (thp or hugetlbfs).  It would be useful to mark any
code with this new config that can process either hugetlb or thp pages in
any level that is higher than pte level.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/Kconfig | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index b924f4a5a3ef..497cdf4d8ebf 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -850,6 +850,12 @@ config READ_ONLY_THP_FOR_FS
 
 endif # TRANSPARENT_HUGEPAGE
 
+#
+# The architecture supports pgtable leaves that is larger than PAGE_SIZE
+#
+config PGTABLE_HAS_HUGE_LEAVES
+	def_bool TRANSPARENT_HUGEPAGE || HUGETLB_PAGE
+
 #
 # UP and nommu archs use km based percpu allocator
 #
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 01/13] mm/Kconfig: CONFIG_PGTABLE_HAS_HUGE_LEAVES
@ 2024-03-27 15:23   ` peterx
  0 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: James Houghton, David Hildenbrand, Yang Shi, peterx,
	Andrew Jones, linux-riscv, Andrea Arcangeli, Christoph Hellwig,
	Matthew Wilcox, Aneesh Kumar K . V, linux-arm-kernel,
	Jason Gunthorpe, Axel Rasmussen, Rik van Riel, John Hubbard,
	Kirill A . Shutemov, Vlastimil Babka, Lorenzo Stoakes,
	Muchun Song, Andrew Morton, linuxppc-dev, Mike Rapoport,
	Mike Kravetz

From: Peter Xu <peterx@redhat.com>

Introduce a config option that will be selected as long as huge leaves are
involved in pgtable (thp or hugetlbfs).  It would be useful to mark any
code with this new config that can process either hugetlb or thp pages in
any level that is higher than pte level.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/Kconfig | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index b924f4a5a3ef..497cdf4d8ebf 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -850,6 +850,12 @@ config READ_ONLY_THP_FOR_FS
 
 endif # TRANSPARENT_HUGEPAGE
 
+#
+# The architecture supports pgtable leaves that is larger than PAGE_SIZE
+#
+config PGTABLE_HAS_HUGE_LEAVES
+	def_bool TRANSPARENT_HUGEPAGE || HUGETLB_PAGE
+
 #
 # UP and nommu archs use km based percpu allocator
 #
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 01/13] mm/Kconfig: CONFIG_PGTABLE_HAS_HUGE_LEAVES
@ 2024-03-27 15:23   ` peterx
  0 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, peterx, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

From: Peter Xu <peterx@redhat.com>

Introduce a config option that will be selected as long as huge leaves are
involved in pgtable (thp or hugetlbfs).  It would be useful to mark any
code with this new config that can process either hugetlb or thp pages in
any level that is higher than pte level.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/Kconfig | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index b924f4a5a3ef..497cdf4d8ebf 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -850,6 +850,12 @@ config READ_ONLY_THP_FOR_FS
 
 endif # TRANSPARENT_HUGEPAGE
 
+#
+# The architecture supports pgtable leaves that is larger than PAGE_SIZE
+#
+config PGTABLE_HAS_HUGE_LEAVES
+	def_bool TRANSPARENT_HUGEPAGE || HUGETLB_PAGE
+
 #
 # UP and nommu archs use km based percpu allocator
 #
-- 
2.44.0


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 01/13] mm/Kconfig: CONFIG_PGTABLE_HAS_HUGE_LEAVES
@ 2024-03-27 15:23   ` peterx
  0 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, peterx, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

From: Peter Xu <peterx@redhat.com>

Introduce a config option that will be selected as long as huge leaves are
involved in pgtable (thp or hugetlbfs).  It would be useful to mark any
code with this new config that can process either hugetlb or thp pages in
any level that is higher than pte level.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/Kconfig | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index b924f4a5a3ef..497cdf4d8ebf 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -850,6 +850,12 @@ config READ_ONLY_THP_FOR_FS
 
 endif # TRANSPARENT_HUGEPAGE
 
+#
+# The architecture supports pgtable leaves that is larger than PAGE_SIZE
+#
+config PGTABLE_HAS_HUGE_LEAVES
+	def_bool TRANSPARENT_HUGEPAGE || HUGETLB_PAGE
+
 #
 # UP and nommu archs use km based percpu allocator
 #
-- 
2.44.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 02/13] mm/hugetlb: Declare hugetlbfs_pagecache_present() non-static
  2024-03-27 15:23 ` peterx
  (?)
  (?)
@ 2024-03-27 15:23   ` peterx
  -1 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, peterx, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

From: Peter Xu <peterx@redhat.com>

It will be used outside hugetlb.c soon.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/hugetlb.h | 9 +++++++++
 mm/hugetlb.c            | 4 ++--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d748628efc5e..294c78b3549f 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -174,6 +174,9 @@ u32 hugetlb_fault_mutex_hash(struct address_space *mapping, pgoff_t idx);
 
 pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 		      unsigned long addr, pud_t *pud);
+bool hugetlbfs_pagecache_present(struct hstate *h,
+				 struct vm_area_struct *vma,
+				 unsigned long address);
 
 struct address_space *hugetlb_page_mapping_lock_write(struct page *hpage);
 
@@ -1228,6 +1231,12 @@ static inline void hugetlb_register_node(struct node *node)
 static inline void hugetlb_unregister_node(struct node *node)
 {
 }
+
+static inline bool hugetlbfs_pagecache_present(
+    struct hstate *h, struct vm_area_struct *vma, unsigned long address)
+{
+	return false;
+}
 #endif	/* CONFIG_HUGETLB_PAGE */
 
 static inline spinlock_t *huge_pte_lock(struct hstate *h,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f9640a81226e..65b9c9a48fd2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6110,8 +6110,8 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
 /*
  * Return whether there is a pagecache page to back given address within VMA.
  */
-static bool hugetlbfs_pagecache_present(struct hstate *h,
-			struct vm_area_struct *vma, unsigned long address)
+bool hugetlbfs_pagecache_present(struct hstate *h,
+				 struct vm_area_struct *vma, unsigned long address)
 {
 	struct address_space *mapping = vma->vm_file->f_mapping;
 	pgoff_t idx = linear_page_index(vma, address);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 02/13] mm/hugetlb: Declare hugetlbfs_pagecache_present() non-static
@ 2024-03-27 15:23   ` peterx
  0 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, peterx, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

From: Peter Xu <peterx@redhat.com>

It will be used outside hugetlb.c soon.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/hugetlb.h | 9 +++++++++
 mm/hugetlb.c            | 4 ++--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d748628efc5e..294c78b3549f 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -174,6 +174,9 @@ u32 hugetlb_fault_mutex_hash(struct address_space *mapping, pgoff_t idx);
 
 pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 		      unsigned long addr, pud_t *pud);
+bool hugetlbfs_pagecache_present(struct hstate *h,
+				 struct vm_area_struct *vma,
+				 unsigned long address);
 
 struct address_space *hugetlb_page_mapping_lock_write(struct page *hpage);
 
@@ -1228,6 +1231,12 @@ static inline void hugetlb_register_node(struct node *node)
 static inline void hugetlb_unregister_node(struct node *node)
 {
 }
+
+static inline bool hugetlbfs_pagecache_present(
+    struct hstate *h, struct vm_area_struct *vma, unsigned long address)
+{
+	return false;
+}
 #endif	/* CONFIG_HUGETLB_PAGE */
 
 static inline spinlock_t *huge_pte_lock(struct hstate *h,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f9640a81226e..65b9c9a48fd2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6110,8 +6110,8 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
 /*
  * Return whether there is a pagecache page to back given address within VMA.
  */
-static bool hugetlbfs_pagecache_present(struct hstate *h,
-			struct vm_area_struct *vma, unsigned long address)
+bool hugetlbfs_pagecache_present(struct hstate *h,
+				 struct vm_area_struct *vma, unsigned long address)
 {
 	struct address_space *mapping = vma->vm_file->f_mapping;
 	pgoff_t idx = linear_page_index(vma, address);
-- 
2.44.0


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 02/13] mm/hugetlb: Declare hugetlbfs_pagecache_present() non-static
@ 2024-03-27 15:23   ` peterx
  0 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: James Houghton, David Hildenbrand, Yang Shi, peterx,
	Andrew Jones, linux-riscv, Andrea Arcangeli, Christoph Hellwig,
	Matthew Wilcox, Aneesh Kumar K . V, linux-arm-kernel,
	Jason Gunthorpe, Axel Rasmussen, Rik van Riel, John Hubbard,
	Kirill A . Shutemov, Vlastimil Babka, Lorenzo Stoakes,
	Muchun Song, Andrew Morton, linuxppc-dev, Mike Rapoport,
	Mike Kravetz

From: Peter Xu <peterx@redhat.com>

It will be used outside hugetlb.c soon.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/hugetlb.h | 9 +++++++++
 mm/hugetlb.c            | 4 ++--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d748628efc5e..294c78b3549f 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -174,6 +174,9 @@ u32 hugetlb_fault_mutex_hash(struct address_space *mapping, pgoff_t idx);
 
 pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 		      unsigned long addr, pud_t *pud);
+bool hugetlbfs_pagecache_present(struct hstate *h,
+				 struct vm_area_struct *vma,
+				 unsigned long address);
 
 struct address_space *hugetlb_page_mapping_lock_write(struct page *hpage);
 
@@ -1228,6 +1231,12 @@ static inline void hugetlb_register_node(struct node *node)
 static inline void hugetlb_unregister_node(struct node *node)
 {
 }
+
+static inline bool hugetlbfs_pagecache_present(
+    struct hstate *h, struct vm_area_struct *vma, unsigned long address)
+{
+	return false;
+}
 #endif	/* CONFIG_HUGETLB_PAGE */
 
 static inline spinlock_t *huge_pte_lock(struct hstate *h,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f9640a81226e..65b9c9a48fd2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6110,8 +6110,8 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
 /*
  * Return whether there is a pagecache page to back given address within VMA.
  */
-static bool hugetlbfs_pagecache_present(struct hstate *h,
-			struct vm_area_struct *vma, unsigned long address)
+bool hugetlbfs_pagecache_present(struct hstate *h,
+				 struct vm_area_struct *vma, unsigned long address)
 {
 	struct address_space *mapping = vma->vm_file->f_mapping;
 	pgoff_t idx = linear_page_index(vma, address);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 02/13] mm/hugetlb: Declare hugetlbfs_pagecache_present() non-static
@ 2024-03-27 15:23   ` peterx
  0 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, peterx, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

From: Peter Xu <peterx@redhat.com>

It will be used outside hugetlb.c soon.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/hugetlb.h | 9 +++++++++
 mm/hugetlb.c            | 4 ++--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d748628efc5e..294c78b3549f 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -174,6 +174,9 @@ u32 hugetlb_fault_mutex_hash(struct address_space *mapping, pgoff_t idx);
 
 pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 		      unsigned long addr, pud_t *pud);
+bool hugetlbfs_pagecache_present(struct hstate *h,
+				 struct vm_area_struct *vma,
+				 unsigned long address);
 
 struct address_space *hugetlb_page_mapping_lock_write(struct page *hpage);
 
@@ -1228,6 +1231,12 @@ static inline void hugetlb_register_node(struct node *node)
 static inline void hugetlb_unregister_node(struct node *node)
 {
 }
+
+static inline bool hugetlbfs_pagecache_present(
+    struct hstate *h, struct vm_area_struct *vma, unsigned long address)
+{
+	return false;
+}
 #endif	/* CONFIG_HUGETLB_PAGE */
 
 static inline spinlock_t *huge_pte_lock(struct hstate *h,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f9640a81226e..65b9c9a48fd2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6110,8 +6110,8 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
 /*
  * Return whether there is a pagecache page to back given address within VMA.
  */
-static bool hugetlbfs_pagecache_present(struct hstate *h,
-			struct vm_area_struct *vma, unsigned long address)
+bool hugetlbfs_pagecache_present(struct hstate *h,
+				 struct vm_area_struct *vma, unsigned long address)
 {
 	struct address_space *mapping = vma->vm_file->f_mapping;
 	pgoff_t idx = linear_page_index(vma, address);
-- 
2.44.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 03/13] mm: Make HPAGE_PXD_* macros even if !THP
  2024-03-27 15:23 ` peterx
  (?)
  (?)
@ 2024-03-27 15:23   ` peterx
  -1 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, peterx, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

From: Peter Xu <peterx@redhat.com>

These macros can be helpful when we plan to merge hugetlb code into generic
code.  Move them out and define them as long as PGTABLE_HAS_HUGE_LEAVES is
selected, because there are systems that only define HUGETLB_PAGE not THP.

One note here is HPAGE_PMD_SHIFT must be defined even if PMD_SHIFT is not
defined (e.g. !CONFIG_MMU case); it (or in other forms, like HPAGE_PMD_NR)
is already used in lots of common codes without ifdef guards.  Use the old
trick to let complations work.

Here we only need to differenciate HPAGE_PXD_SHIFT definitions. All the
rest macros will be defined based on it.  When at it, move HPAGE_PMD_NR /
HPAGE_PMD_ORDER over together.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/huge_mm.h | 29 +++++++++++++++--------------
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 7576025db55d..d3bb25c39482 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -64,9 +64,6 @@ ssize_t single_hugepage_flag_show(struct kobject *kobj,
 				  enum transparent_hugepage_flag flag);
 extern struct kobj_attribute shmem_enabled_attr;
 
-#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
-#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
-
 /*
  * Mask of all large folio orders supported for anonymous THP; all orders up to
  * and including PMD_ORDER, except order-0 (which is not "huge") and order-1
@@ -87,14 +84,25 @@ extern struct kobj_attribute shmem_enabled_attr;
 #define thp_vma_allowable_order(vma, vm_flags, smaps, in_pf, enforce_sysfs, order) \
 	(!!thp_vma_allowable_orders(vma, vm_flags, smaps, in_pf, enforce_sysfs, BIT(order)))
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
 #define HPAGE_PMD_SHIFT PMD_SHIFT
-#define HPAGE_PMD_SIZE	((1UL) << HPAGE_PMD_SHIFT)
+#define HPAGE_PUD_SHIFT PUD_SHIFT
+#else
+#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
+#define HPAGE_PUD_SHIFT ({ BUILD_BUG(); 0; })
+#endif
+
+#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
+#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
 #define HPAGE_PMD_MASK	(~(HPAGE_PMD_SIZE - 1))
+#define HPAGE_PMD_SIZE	((1UL) << HPAGE_PMD_SHIFT)
 
-#define HPAGE_PUD_SHIFT PUD_SHIFT
-#define HPAGE_PUD_SIZE	((1UL) << HPAGE_PUD_SHIFT)
+#define HPAGE_PUD_ORDER (HPAGE_PUD_SHIFT-PAGE_SHIFT)
+#define HPAGE_PUD_NR (1<<HPAGE_PUD_ORDER)
 #define HPAGE_PUD_MASK	(~(HPAGE_PUD_SIZE - 1))
+#define HPAGE_PUD_SIZE	((1UL) << HPAGE_PUD_SHIFT)
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 
 extern unsigned long transparent_hugepage_flags;
 extern unsigned long huge_anon_orders_always;
@@ -377,13 +385,6 @@ static inline bool thp_migration_supported(void)
 }
 
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
-#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
-#define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
-#define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
-
-#define HPAGE_PUD_SHIFT ({ BUILD_BUG(); 0; })
-#define HPAGE_PUD_MASK ({ BUILD_BUG(); 0; })
-#define HPAGE_PUD_SIZE ({ BUILD_BUG(); 0; })
 
 static inline bool folio_test_pmd_mappable(struct folio *folio)
 {
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 03/13] mm: Make HPAGE_PXD_* macros even if !THP
@ 2024-03-27 15:23   ` peterx
  0 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, peterx, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

From: Peter Xu <peterx@redhat.com>

These macros can be helpful when we plan to merge hugetlb code into generic
code.  Move them out and define them as long as PGTABLE_HAS_HUGE_LEAVES is
selected, because there are systems that only define HUGETLB_PAGE not THP.

One note here is HPAGE_PMD_SHIFT must be defined even if PMD_SHIFT is not
defined (e.g. !CONFIG_MMU case); it (or in other forms, like HPAGE_PMD_NR)
is already used in lots of common codes without ifdef guards.  Use the old
trick to let complations work.

Here we only need to differenciate HPAGE_PXD_SHIFT definitions. All the
rest macros will be defined based on it.  When at it, move HPAGE_PMD_NR /
HPAGE_PMD_ORDER over together.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/huge_mm.h | 29 +++++++++++++++--------------
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 7576025db55d..d3bb25c39482 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -64,9 +64,6 @@ ssize_t single_hugepage_flag_show(struct kobject *kobj,
 				  enum transparent_hugepage_flag flag);
 extern struct kobj_attribute shmem_enabled_attr;
 
-#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
-#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
-
 /*
  * Mask of all large folio orders supported for anonymous THP; all orders up to
  * and including PMD_ORDER, except order-0 (which is not "huge") and order-1
@@ -87,14 +84,25 @@ extern struct kobj_attribute shmem_enabled_attr;
 #define thp_vma_allowable_order(vma, vm_flags, smaps, in_pf, enforce_sysfs, order) \
 	(!!thp_vma_allowable_orders(vma, vm_flags, smaps, in_pf, enforce_sysfs, BIT(order)))
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
 #define HPAGE_PMD_SHIFT PMD_SHIFT
-#define HPAGE_PMD_SIZE	((1UL) << HPAGE_PMD_SHIFT)
+#define HPAGE_PUD_SHIFT PUD_SHIFT
+#else
+#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
+#define HPAGE_PUD_SHIFT ({ BUILD_BUG(); 0; })
+#endif
+
+#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
+#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
 #define HPAGE_PMD_MASK	(~(HPAGE_PMD_SIZE - 1))
+#define HPAGE_PMD_SIZE	((1UL) << HPAGE_PMD_SHIFT)
 
-#define HPAGE_PUD_SHIFT PUD_SHIFT
-#define HPAGE_PUD_SIZE	((1UL) << HPAGE_PUD_SHIFT)
+#define HPAGE_PUD_ORDER (HPAGE_PUD_SHIFT-PAGE_SHIFT)
+#define HPAGE_PUD_NR (1<<HPAGE_PUD_ORDER)
 #define HPAGE_PUD_MASK	(~(HPAGE_PUD_SIZE - 1))
+#define HPAGE_PUD_SIZE	((1UL) << HPAGE_PUD_SHIFT)
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 
 extern unsigned long transparent_hugepage_flags;
 extern unsigned long huge_anon_orders_always;
@@ -377,13 +385,6 @@ static inline bool thp_migration_supported(void)
 }
 
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
-#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
-#define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
-#define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
-
-#define HPAGE_PUD_SHIFT ({ BUILD_BUG(); 0; })
-#define HPAGE_PUD_MASK ({ BUILD_BUG(); 0; })
-#define HPAGE_PUD_SIZE ({ BUILD_BUG(); 0; })
 
 static inline bool folio_test_pmd_mappable(struct folio *folio)
 {
-- 
2.44.0


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 03/13] mm: Make HPAGE_PXD_* macros even if !THP
@ 2024-03-27 15:23   ` peterx
  0 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, peterx, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

From: Peter Xu <peterx@redhat.com>

These macros can be helpful when we plan to merge hugetlb code into generic
code.  Move them out and define them as long as PGTABLE_HAS_HUGE_LEAVES is
selected, because there are systems that only define HUGETLB_PAGE not THP.

One note here is HPAGE_PMD_SHIFT must be defined even if PMD_SHIFT is not
defined (e.g. !CONFIG_MMU case); it (or in other forms, like HPAGE_PMD_NR)
is already used in lots of common codes without ifdef guards.  Use the old
trick to let complations work.

Here we only need to differenciate HPAGE_PXD_SHIFT definitions. All the
rest macros will be defined based on it.  When at it, move HPAGE_PMD_NR /
HPAGE_PMD_ORDER over together.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/huge_mm.h | 29 +++++++++++++++--------------
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 7576025db55d..d3bb25c39482 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -64,9 +64,6 @@ ssize_t single_hugepage_flag_show(struct kobject *kobj,
 				  enum transparent_hugepage_flag flag);
 extern struct kobj_attribute shmem_enabled_attr;
 
-#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
-#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
-
 /*
  * Mask of all large folio orders supported for anonymous THP; all orders up to
  * and including PMD_ORDER, except order-0 (which is not "huge") and order-1
@@ -87,14 +84,25 @@ extern struct kobj_attribute shmem_enabled_attr;
 #define thp_vma_allowable_order(vma, vm_flags, smaps, in_pf, enforce_sysfs, order) \
 	(!!thp_vma_allowable_orders(vma, vm_flags, smaps, in_pf, enforce_sysfs, BIT(order)))
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
 #define HPAGE_PMD_SHIFT PMD_SHIFT
-#define HPAGE_PMD_SIZE	((1UL) << HPAGE_PMD_SHIFT)
+#define HPAGE_PUD_SHIFT PUD_SHIFT
+#else
+#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
+#define HPAGE_PUD_SHIFT ({ BUILD_BUG(); 0; })
+#endif
+
+#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
+#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
 #define HPAGE_PMD_MASK	(~(HPAGE_PMD_SIZE - 1))
+#define HPAGE_PMD_SIZE	((1UL) << HPAGE_PMD_SHIFT)
 
-#define HPAGE_PUD_SHIFT PUD_SHIFT
-#define HPAGE_PUD_SIZE	((1UL) << HPAGE_PUD_SHIFT)
+#define HPAGE_PUD_ORDER (HPAGE_PUD_SHIFT-PAGE_SHIFT)
+#define HPAGE_PUD_NR (1<<HPAGE_PUD_ORDER)
 #define HPAGE_PUD_MASK	(~(HPAGE_PUD_SIZE - 1))
+#define HPAGE_PUD_SIZE	((1UL) << HPAGE_PUD_SHIFT)
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 
 extern unsigned long transparent_hugepage_flags;
 extern unsigned long huge_anon_orders_always;
@@ -377,13 +385,6 @@ static inline bool thp_migration_supported(void)
 }
 
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
-#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
-#define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
-#define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
-
-#define HPAGE_PUD_SHIFT ({ BUILD_BUG(); 0; })
-#define HPAGE_PUD_MASK ({ BUILD_BUG(); 0; })
-#define HPAGE_PUD_SIZE ({ BUILD_BUG(); 0; })
 
 static inline bool folio_test_pmd_mappable(struct folio *folio)
 {
-- 
2.44.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 03/13] mm: Make HPAGE_PXD_* macros even if !THP
@ 2024-03-27 15:23   ` peterx
  0 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: James Houghton, David Hildenbrand, Yang Shi, peterx,
	Andrew Jones, linux-riscv, Andrea Arcangeli, Christoph Hellwig,
	Matthew Wilcox, Aneesh Kumar K . V, linux-arm-kernel,
	Jason Gunthorpe, Axel Rasmussen, Rik van Riel, John Hubbard,
	Kirill A . Shutemov, Vlastimil Babka, Lorenzo Stoakes,
	Muchun Song, Andrew Morton, linuxppc-dev, Mike Rapoport,
	Mike Kravetz

From: Peter Xu <peterx@redhat.com>

These macros can be helpful when we plan to merge hugetlb code into generic
code.  Move them out and define them as long as PGTABLE_HAS_HUGE_LEAVES is
selected, because there are systems that only define HUGETLB_PAGE not THP.

One note here is HPAGE_PMD_SHIFT must be defined even if PMD_SHIFT is not
defined (e.g. !CONFIG_MMU case); it (or in other forms, like HPAGE_PMD_NR)
is already used in lots of common codes without ifdef guards.  Use the old
trick to let complations work.

Here we only need to differenciate HPAGE_PXD_SHIFT definitions. All the
rest macros will be defined based on it.  When at it, move HPAGE_PMD_NR /
HPAGE_PMD_ORDER over together.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/huge_mm.h | 29 +++++++++++++++--------------
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 7576025db55d..d3bb25c39482 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -64,9 +64,6 @@ ssize_t single_hugepage_flag_show(struct kobject *kobj,
 				  enum transparent_hugepage_flag flag);
 extern struct kobj_attribute shmem_enabled_attr;
 
-#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
-#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
-
 /*
  * Mask of all large folio orders supported for anonymous THP; all orders up to
  * and including PMD_ORDER, except order-0 (which is not "huge") and order-1
@@ -87,14 +84,25 @@ extern struct kobj_attribute shmem_enabled_attr;
 #define thp_vma_allowable_order(vma, vm_flags, smaps, in_pf, enforce_sysfs, order) \
 	(!!thp_vma_allowable_orders(vma, vm_flags, smaps, in_pf, enforce_sysfs, BIT(order)))
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
 #define HPAGE_PMD_SHIFT PMD_SHIFT
-#define HPAGE_PMD_SIZE	((1UL) << HPAGE_PMD_SHIFT)
+#define HPAGE_PUD_SHIFT PUD_SHIFT
+#else
+#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
+#define HPAGE_PUD_SHIFT ({ BUILD_BUG(); 0; })
+#endif
+
+#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
+#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
 #define HPAGE_PMD_MASK	(~(HPAGE_PMD_SIZE - 1))
+#define HPAGE_PMD_SIZE	((1UL) << HPAGE_PMD_SHIFT)
 
-#define HPAGE_PUD_SHIFT PUD_SHIFT
-#define HPAGE_PUD_SIZE	((1UL) << HPAGE_PUD_SHIFT)
+#define HPAGE_PUD_ORDER (HPAGE_PUD_SHIFT-PAGE_SHIFT)
+#define HPAGE_PUD_NR (1<<HPAGE_PUD_ORDER)
 #define HPAGE_PUD_MASK	(~(HPAGE_PUD_SIZE - 1))
+#define HPAGE_PUD_SIZE	((1UL) << HPAGE_PUD_SHIFT)
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 
 extern unsigned long transparent_hugepage_flags;
 extern unsigned long huge_anon_orders_always;
@@ -377,13 +385,6 @@ static inline bool thp_migration_supported(void)
 }
 
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
-#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
-#define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
-#define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
-
-#define HPAGE_PUD_SHIFT ({ BUILD_BUG(); 0; })
-#define HPAGE_PUD_MASK ({ BUILD_BUG(); 0; })
-#define HPAGE_PUD_SIZE ({ BUILD_BUG(); 0; })
 
 static inline bool folio_test_pmd_mappable(struct folio *folio)
 {
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 04/13] mm: Introduce vma_pgtable_walk_{begin|end}()
  2024-03-27 15:23 ` peterx
  (?)
  (?)
@ 2024-03-27 15:23   ` peterx
  -1 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, peterx, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

From: Peter Xu <peterx@redhat.com>

Introduce per-vma begin()/end() helpers for pgtable walks.  This is a
preparation work to merge hugetlb pgtable walkers with generic mm.

The helpers need to be called before and after a pgtable walk, will start
to be needed if the pgtable walker code supports hugetlb pages.  It's a
hook point for any type of VMA, but for now only hugetlb uses it to
stablize the pgtable pages from getting away (due to possible pmd
unsharing).

Reviewed-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/mm.h |  3 +++
 mm/memory.c        | 12 ++++++++++++
 2 files changed, 15 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index afe27ff3fa94..d8f78017d271 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4233,4 +4233,7 @@ static inline bool pfn_is_unaccepted_memory(unsigned long pfn)
 	return range_contains_unaccepted_memory(paddr, paddr + PAGE_SIZE);
 }
 
+void vma_pgtable_walk_begin(struct vm_area_struct *vma);
+void vma_pgtable_walk_end(struct vm_area_struct *vma);
+
 #endif /* _LINUX_MM_H */
diff --git a/mm/memory.c b/mm/memory.c
index 3d0c0cc33c57..27d173f9a521 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6438,3 +6438,15 @@ void ptlock_free(struct ptdesc *ptdesc)
 	kmem_cache_free(page_ptl_cachep, ptdesc->ptl);
 }
 #endif
+
+void vma_pgtable_walk_begin(struct vm_area_struct *vma)
+{
+	if (is_vm_hugetlb_page(vma))
+		hugetlb_vma_lock_read(vma);
+}
+
+void vma_pgtable_walk_end(struct vm_area_struct *vma)
+{
+	if (is_vm_hugetlb_page(vma))
+		hugetlb_vma_unlock_read(vma);
+}
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 04/13] mm: Introduce vma_pgtable_walk_{begin|end}()
@ 2024-03-27 15:23   ` peterx
  0 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, peterx, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

From: Peter Xu <peterx@redhat.com>

Introduce per-vma begin()/end() helpers for pgtable walks.  This is a
preparation work to merge hugetlb pgtable walkers with generic mm.

The helpers need to be called before and after a pgtable walk, will start
to be needed if the pgtable walker code supports hugetlb pages.  It's a
hook point for any type of VMA, but for now only hugetlb uses it to
stablize the pgtable pages from getting away (due to possible pmd
unsharing).

Reviewed-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/mm.h |  3 +++
 mm/memory.c        | 12 ++++++++++++
 2 files changed, 15 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index afe27ff3fa94..d8f78017d271 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4233,4 +4233,7 @@ static inline bool pfn_is_unaccepted_memory(unsigned long pfn)
 	return range_contains_unaccepted_memory(paddr, paddr + PAGE_SIZE);
 }
 
+void vma_pgtable_walk_begin(struct vm_area_struct *vma);
+void vma_pgtable_walk_end(struct vm_area_struct *vma);
+
 #endif /* _LINUX_MM_H */
diff --git a/mm/memory.c b/mm/memory.c
index 3d0c0cc33c57..27d173f9a521 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6438,3 +6438,15 @@ void ptlock_free(struct ptdesc *ptdesc)
 	kmem_cache_free(page_ptl_cachep, ptdesc->ptl);
 }
 #endif
+
+void vma_pgtable_walk_begin(struct vm_area_struct *vma)
+{
+	if (is_vm_hugetlb_page(vma))
+		hugetlb_vma_lock_read(vma);
+}
+
+void vma_pgtable_walk_end(struct vm_area_struct *vma)
+{
+	if (is_vm_hugetlb_page(vma))
+		hugetlb_vma_unlock_read(vma);
+}
-- 
2.44.0


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 04/13] mm: Introduce vma_pgtable_walk_{begin|end}()
@ 2024-03-27 15:23   ` peterx
  0 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, peterx, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

From: Peter Xu <peterx@redhat.com>

Introduce per-vma begin()/end() helpers for pgtable walks.  This is a
preparation work to merge hugetlb pgtable walkers with generic mm.

The helpers need to be called before and after a pgtable walk, will start
to be needed if the pgtable walker code supports hugetlb pages.  It's a
hook point for any type of VMA, but for now only hugetlb uses it to
stablize the pgtable pages from getting away (due to possible pmd
unsharing).

Reviewed-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/mm.h |  3 +++
 mm/memory.c        | 12 ++++++++++++
 2 files changed, 15 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index afe27ff3fa94..d8f78017d271 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4233,4 +4233,7 @@ static inline bool pfn_is_unaccepted_memory(unsigned long pfn)
 	return range_contains_unaccepted_memory(paddr, paddr + PAGE_SIZE);
 }
 
+void vma_pgtable_walk_begin(struct vm_area_struct *vma);
+void vma_pgtable_walk_end(struct vm_area_struct *vma);
+
 #endif /* _LINUX_MM_H */
diff --git a/mm/memory.c b/mm/memory.c
index 3d0c0cc33c57..27d173f9a521 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6438,3 +6438,15 @@ void ptlock_free(struct ptdesc *ptdesc)
 	kmem_cache_free(page_ptl_cachep, ptdesc->ptl);
 }
 #endif
+
+void vma_pgtable_walk_begin(struct vm_area_struct *vma)
+{
+	if (is_vm_hugetlb_page(vma))
+		hugetlb_vma_lock_read(vma);
+}
+
+void vma_pgtable_walk_end(struct vm_area_struct *vma)
+{
+	if (is_vm_hugetlb_page(vma))
+		hugetlb_vma_unlock_read(vma);
+}
-- 
2.44.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 04/13] mm: Introduce vma_pgtable_walk_{begin|end}()
@ 2024-03-27 15:23   ` peterx
  0 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: James Houghton, David Hildenbrand, Yang Shi, peterx,
	Andrew Jones, linux-riscv, Andrea Arcangeli, Christoph Hellwig,
	Matthew Wilcox, Aneesh Kumar K . V, linux-arm-kernel,
	Jason Gunthorpe, Axel Rasmussen, Rik van Riel, John Hubbard,
	Kirill A . Shutemov, Vlastimil Babka, Lorenzo Stoakes,
	Muchun Song, Andrew Morton, linuxppc-dev, Mike Rapoport,
	Mike Kravetz

From: Peter Xu <peterx@redhat.com>

Introduce per-vma begin()/end() helpers for pgtable walks.  This is a
preparation work to merge hugetlb pgtable walkers with generic mm.

The helpers need to be called before and after a pgtable walk, will start
to be needed if the pgtable walker code supports hugetlb pages.  It's a
hook point for any type of VMA, but for now only hugetlb uses it to
stablize the pgtable pages from getting away (due to possible pmd
unsharing).

Reviewed-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/mm.h |  3 +++
 mm/memory.c        | 12 ++++++++++++
 2 files changed, 15 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index afe27ff3fa94..d8f78017d271 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4233,4 +4233,7 @@ static inline bool pfn_is_unaccepted_memory(unsigned long pfn)
 	return range_contains_unaccepted_memory(paddr, paddr + PAGE_SIZE);
 }
 
+void vma_pgtable_walk_begin(struct vm_area_struct *vma);
+void vma_pgtable_walk_end(struct vm_area_struct *vma);
+
 #endif /* _LINUX_MM_H */
diff --git a/mm/memory.c b/mm/memory.c
index 3d0c0cc33c57..27d173f9a521 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6438,3 +6438,15 @@ void ptlock_free(struct ptdesc *ptdesc)
 	kmem_cache_free(page_ptl_cachep, ptdesc->ptl);
 }
 #endif
+
+void vma_pgtable_walk_begin(struct vm_area_struct *vma)
+{
+	if (is_vm_hugetlb_page(vma))
+		hugetlb_vma_lock_read(vma);
+}
+
+void vma_pgtable_walk_end(struct vm_area_struct *vma)
+{
+	if (is_vm_hugetlb_page(vma))
+		hugetlb_vma_unlock_read(vma);
+}
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
  2024-03-27 15:23 ` peterx
  (?)
  (?)
@ 2024-03-27 15:23   ` peterx
  -1 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, peterx, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

From: Peter Xu <peterx@redhat.com>

The comment in the code explains the reasons.  We took a different approach
comparing to pmd_pfn() by providing a fallback function.

Another option is to provide some lower level config options (compare to
HUGETLB_PAGE or THP) to identify which layer an arch can support for such
huge mappings.  However that can be an overkill.

Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/riscv/include/asm/pgtable.h    |  1 +
 arch/s390/include/asm/pgtable.h     |  1 +
 arch/sparc/include/asm/pgtable_64.h |  1 +
 arch/x86/include/asm/pgtable.h      |  1 +
 include/linux/pgtable.h             | 10 ++++++++++
 5 files changed, 14 insertions(+)

diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 20242402fc11..0ca28cc8e3fa 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -646,6 +646,7 @@ static inline unsigned long pmd_pfn(pmd_t pmd)
 
 #define __pud_to_phys(pud)  (__page_val_to_pfn(pud_val(pud)) << PAGE_SHIFT)
 
+#define pud_pfn pud_pfn
 static inline unsigned long pud_pfn(pud_t pud)
 {
 	return ((__pud_to_phys(pud) & PUD_MASK) >> PAGE_SHIFT);
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 1a71cb19c089..6cbbe473f680 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1414,6 +1414,7 @@ static inline unsigned long pud_deref(pud_t pud)
 	return (unsigned long)__va(pud_val(pud) & origin_mask);
 }
 
+#define pud_pfn pud_pfn
 static inline unsigned long pud_pfn(pud_t pud)
 {
 	return __pa(pud_deref(pud)) >> PAGE_SHIFT;
diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 4d1bafaba942..26efc9bb644a 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -875,6 +875,7 @@ static inline bool pud_leaf(pud_t pud)
 	return pte_val(pte) & _PAGE_PMD_HUGE;
 }
 
+#define pud_pfn pud_pfn
 static inline unsigned long pud_pfn(pud_t pud)
 {
 	pte_t pte = __pte(pud_val(pud));
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index cefc7a84f7a4..273f7557218c 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -234,6 +234,7 @@ static inline unsigned long pmd_pfn(pmd_t pmd)
 	return (pfn & pmd_pfn_mask(pmd)) >> PAGE_SHIFT;
 }
 
+#define pud_pfn pud_pfn
 static inline unsigned long pud_pfn(pud_t pud)
 {
 	phys_addr_t pfn = pud_val(pud);
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 600e17d03659..75fe309a4e10 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1817,6 +1817,16 @@ typedef unsigned int pgtbl_mod_mask;
 #define pte_leaf_size(x) PAGE_SIZE
 #endif
 
+/*
+ * We always define pmd_pfn for all archs as it's used in lots of generic
+ * code.  Now it happens too for pud_pfn (and can happen for larger
+ * mappings too in the future; we're not there yet).  Instead of defining
+ * it for all archs (like pmd_pfn), provide a fallback.
+ */
+#ifndef pud_pfn
+#define pud_pfn(x) ({ BUILD_BUG(); 0; })
+#endif
+
 /*
  * Some architectures have MMUs that are configurable or selectable at boot
  * time. These lead to variable PTRS_PER_x. For statically allocated arrays it
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
@ 2024-03-27 15:23   ` peterx
  0 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, peterx, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

From: Peter Xu <peterx@redhat.com>

The comment in the code explains the reasons.  We took a different approach
comparing to pmd_pfn() by providing a fallback function.

Another option is to provide some lower level config options (compare to
HUGETLB_PAGE or THP) to identify which layer an arch can support for such
huge mappings.  However that can be an overkill.

Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/riscv/include/asm/pgtable.h    |  1 +
 arch/s390/include/asm/pgtable.h     |  1 +
 arch/sparc/include/asm/pgtable_64.h |  1 +
 arch/x86/include/asm/pgtable.h      |  1 +
 include/linux/pgtable.h             | 10 ++++++++++
 5 files changed, 14 insertions(+)

diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 20242402fc11..0ca28cc8e3fa 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -646,6 +646,7 @@ static inline unsigned long pmd_pfn(pmd_t pmd)
 
 #define __pud_to_phys(pud)  (__page_val_to_pfn(pud_val(pud)) << PAGE_SHIFT)
 
+#define pud_pfn pud_pfn
 static inline unsigned long pud_pfn(pud_t pud)
 {
 	return ((__pud_to_phys(pud) & PUD_MASK) >> PAGE_SHIFT);
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 1a71cb19c089..6cbbe473f680 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1414,6 +1414,7 @@ static inline unsigned long pud_deref(pud_t pud)
 	return (unsigned long)__va(pud_val(pud) & origin_mask);
 }
 
+#define pud_pfn pud_pfn
 static inline unsigned long pud_pfn(pud_t pud)
 {
 	return __pa(pud_deref(pud)) >> PAGE_SHIFT;
diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 4d1bafaba942..26efc9bb644a 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -875,6 +875,7 @@ static inline bool pud_leaf(pud_t pud)
 	return pte_val(pte) & _PAGE_PMD_HUGE;
 }
 
+#define pud_pfn pud_pfn
 static inline unsigned long pud_pfn(pud_t pud)
 {
 	pte_t pte = __pte(pud_val(pud));
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index cefc7a84f7a4..273f7557218c 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -234,6 +234,7 @@ static inline unsigned long pmd_pfn(pmd_t pmd)
 	return (pfn & pmd_pfn_mask(pmd)) >> PAGE_SHIFT;
 }
 
+#define pud_pfn pud_pfn
 static inline unsigned long pud_pfn(pud_t pud)
 {
 	phys_addr_t pfn = pud_val(pud);
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 600e17d03659..75fe309a4e10 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1817,6 +1817,16 @@ typedef unsigned int pgtbl_mod_mask;
 #define pte_leaf_size(x) PAGE_SIZE
 #endif
 
+/*
+ * We always define pmd_pfn for all archs as it's used in lots of generic
+ * code.  Now it happens too for pud_pfn (and can happen for larger
+ * mappings too in the future; we're not there yet).  Instead of defining
+ * it for all archs (like pmd_pfn), provide a fallback.
+ */
+#ifndef pud_pfn
+#define pud_pfn(x) ({ BUILD_BUG(); 0; })
+#endif
+
 /*
  * Some architectures have MMUs that are configurable or selectable at boot
  * time. These lead to variable PTRS_PER_x. For statically allocated arrays it
-- 
2.44.0


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
@ 2024-03-27 15:23   ` peterx
  0 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, peterx, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

From: Peter Xu <peterx@redhat.com>

The comment in the code explains the reasons.  We took a different approach
comparing to pmd_pfn() by providing a fallback function.

Another option is to provide some lower level config options (compare to
HUGETLB_PAGE or THP) to identify which layer an arch can support for such
huge mappings.  However that can be an overkill.

Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/riscv/include/asm/pgtable.h    |  1 +
 arch/s390/include/asm/pgtable.h     |  1 +
 arch/sparc/include/asm/pgtable_64.h |  1 +
 arch/x86/include/asm/pgtable.h      |  1 +
 include/linux/pgtable.h             | 10 ++++++++++
 5 files changed, 14 insertions(+)

diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 20242402fc11..0ca28cc8e3fa 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -646,6 +646,7 @@ static inline unsigned long pmd_pfn(pmd_t pmd)
 
 #define __pud_to_phys(pud)  (__page_val_to_pfn(pud_val(pud)) << PAGE_SHIFT)
 
+#define pud_pfn pud_pfn
 static inline unsigned long pud_pfn(pud_t pud)
 {
 	return ((__pud_to_phys(pud) & PUD_MASK) >> PAGE_SHIFT);
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 1a71cb19c089..6cbbe473f680 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1414,6 +1414,7 @@ static inline unsigned long pud_deref(pud_t pud)
 	return (unsigned long)__va(pud_val(pud) & origin_mask);
 }
 
+#define pud_pfn pud_pfn
 static inline unsigned long pud_pfn(pud_t pud)
 {
 	return __pa(pud_deref(pud)) >> PAGE_SHIFT;
diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 4d1bafaba942..26efc9bb644a 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -875,6 +875,7 @@ static inline bool pud_leaf(pud_t pud)
 	return pte_val(pte) & _PAGE_PMD_HUGE;
 }
 
+#define pud_pfn pud_pfn
 static inline unsigned long pud_pfn(pud_t pud)
 {
 	pte_t pte = __pte(pud_val(pud));
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index cefc7a84f7a4..273f7557218c 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -234,6 +234,7 @@ static inline unsigned long pmd_pfn(pmd_t pmd)
 	return (pfn & pmd_pfn_mask(pmd)) >> PAGE_SHIFT;
 }
 
+#define pud_pfn pud_pfn
 static inline unsigned long pud_pfn(pud_t pud)
 {
 	phys_addr_t pfn = pud_val(pud);
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 600e17d03659..75fe309a4e10 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1817,6 +1817,16 @@ typedef unsigned int pgtbl_mod_mask;
 #define pte_leaf_size(x) PAGE_SIZE
 #endif
 
+/*
+ * We always define pmd_pfn for all archs as it's used in lots of generic
+ * code.  Now it happens too for pud_pfn (and can happen for larger
+ * mappings too in the future; we're not there yet).  Instead of defining
+ * it for all archs (like pmd_pfn), provide a fallback.
+ */
+#ifndef pud_pfn
+#define pud_pfn(x) ({ BUILD_BUG(); 0; })
+#endif
+
 /*
  * Some architectures have MMUs that are configurable or selectable at boot
  * time. These lead to variable PTRS_PER_x. For statically allocated arrays it
-- 
2.44.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
@ 2024-03-27 15:23   ` peterx
  0 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: James Houghton, David Hildenbrand, Yang Shi, peterx,
	Andrew Jones, linux-riscv, Andrea Arcangeli, Christoph Hellwig,
	Matthew Wilcox, Aneesh Kumar K . V, linux-arm-kernel,
	Jason Gunthorpe, Axel Rasmussen, Rik van Riel, John Hubbard,
	Kirill A . Shutemov, Vlastimil Babka, Lorenzo Stoakes,
	Muchun Song, Andrew Morton, linuxppc-dev, Mike Rapoport,
	Mike Kravetz

From: Peter Xu <peterx@redhat.com>

The comment in the code explains the reasons.  We took a different approach
comparing to pmd_pfn() by providing a fallback function.

Another option is to provide some lower level config options (compare to
HUGETLB_PAGE or THP) to identify which layer an arch can support for such
huge mappings.  However that can be an overkill.

Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/riscv/include/asm/pgtable.h    |  1 +
 arch/s390/include/asm/pgtable.h     |  1 +
 arch/sparc/include/asm/pgtable_64.h |  1 +
 arch/x86/include/asm/pgtable.h      |  1 +
 include/linux/pgtable.h             | 10 ++++++++++
 5 files changed, 14 insertions(+)

diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 20242402fc11..0ca28cc8e3fa 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -646,6 +646,7 @@ static inline unsigned long pmd_pfn(pmd_t pmd)
 
 #define __pud_to_phys(pud)  (__page_val_to_pfn(pud_val(pud)) << PAGE_SHIFT)
 
+#define pud_pfn pud_pfn
 static inline unsigned long pud_pfn(pud_t pud)
 {
 	return ((__pud_to_phys(pud) & PUD_MASK) >> PAGE_SHIFT);
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 1a71cb19c089..6cbbe473f680 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1414,6 +1414,7 @@ static inline unsigned long pud_deref(pud_t pud)
 	return (unsigned long)__va(pud_val(pud) & origin_mask);
 }
 
+#define pud_pfn pud_pfn
 static inline unsigned long pud_pfn(pud_t pud)
 {
 	return __pa(pud_deref(pud)) >> PAGE_SHIFT;
diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 4d1bafaba942..26efc9bb644a 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -875,6 +875,7 @@ static inline bool pud_leaf(pud_t pud)
 	return pte_val(pte) & _PAGE_PMD_HUGE;
 }
 
+#define pud_pfn pud_pfn
 static inline unsigned long pud_pfn(pud_t pud)
 {
 	pte_t pte = __pte(pud_val(pud));
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index cefc7a84f7a4..273f7557218c 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -234,6 +234,7 @@ static inline unsigned long pmd_pfn(pmd_t pmd)
 	return (pfn & pmd_pfn_mask(pmd)) >> PAGE_SHIFT;
 }
 
+#define pud_pfn pud_pfn
 static inline unsigned long pud_pfn(pud_t pud)
 {
 	phys_addr_t pfn = pud_val(pud);
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 600e17d03659..75fe309a4e10 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1817,6 +1817,16 @@ typedef unsigned int pgtbl_mod_mask;
 #define pte_leaf_size(x) PAGE_SIZE
 #endif
 
+/*
+ * We always define pmd_pfn for all archs as it's used in lots of generic
+ * code.  Now it happens too for pud_pfn (and can happen for larger
+ * mappings too in the future; we're not there yet).  Instead of defining
+ * it for all archs (like pmd_pfn), provide a fallback.
+ */
+#ifndef pud_pfn
+#define pud_pfn(x) ({ BUILD_BUG(); 0; })
+#endif
+
 /*
  * Some architectures have MMUs that are configurable or selectable at boot
  * time. These lead to variable PTRS_PER_x. For statically allocated arrays it
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 06/13] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing
  2024-03-27 15:23 ` peterx
  (?)
  (?)
@ 2024-03-27 15:23   ` peterx
  -1 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, peterx, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

From: Peter Xu <peterx@redhat.com>

Hugepd format for GUP is only used in PowerPC with hugetlbfs.  There are
some kernel usage of hugepd (can refer to hugepd_populate_kernel() for
PPC_8XX), however those pages are not candidates for GUP.

Commit a6e79df92e4a ("mm/gup: disallow FOLL_LONGTERM GUP-fast writing to
file-backed mappings") added a check to fail gup-fast if there's potential
risk of violating GUP over writeback file systems.  That should never apply
to hugepd.  Considering that hugepd is an old format (and even
software-only), there's no plan to extend hugepd into other file typed
memories that is prone to the same issue.

Drop that check, not only because it'll never be true for hugepd per any
known plan, but also it paves way for reusing the function outside
fast-gup.

To make sure we'll still remember this issue just in case hugepd will be
extended to support non-hugetlbfs memories, add a rich comment above
gup_huge_pd(), explaining the issue with proper references.

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index e7510b6ce765..db35b056fc9a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2832,11 +2832,6 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 		return 0;
 	}
 
-	if (!folio_fast_pin_allowed(folio, flags)) {
-		gup_put_folio(folio, refs, flags);
-		return 0;
-	}
-
 	if (!pte_write(pte) && gup_must_unshare(NULL, flags, &folio->page)) {
 		gup_put_folio(folio, refs, flags);
 		return 0;
@@ -2847,6 +2842,14 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 	return 1;
 }
 
+/*
+ * NOTE: currently GUP for a hugepd is only possible on hugetlbfs file
+ * systems on Power, which does not have issue with folio writeback against
+ * GUP updates.  When hugepd will be extended to support non-hugetlbfs or
+ * even anonymous memory, we need to do extra check as what we do with most
+ * of the other folios. See writable_file_mapping_allowed() and
+ * folio_fast_pin_allowed() for more information.
+ */
 static int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
 		unsigned int pdshift, unsigned long end, unsigned int flags,
 		struct page **pages, int *nr)
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 06/13] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing
@ 2024-03-27 15:23   ` peterx
  0 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, peterx, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

From: Peter Xu <peterx@redhat.com>

Hugepd format for GUP is only used in PowerPC with hugetlbfs.  There are
some kernel usage of hugepd (can refer to hugepd_populate_kernel() for
PPC_8XX), however those pages are not candidates for GUP.

Commit a6e79df92e4a ("mm/gup: disallow FOLL_LONGTERM GUP-fast writing to
file-backed mappings") added a check to fail gup-fast if there's potential
risk of violating GUP over writeback file systems.  That should never apply
to hugepd.  Considering that hugepd is an old format (and even
software-only), there's no plan to extend hugepd into other file typed
memories that is prone to the same issue.

Drop that check, not only because it'll never be true for hugepd per any
known plan, but also it paves way for reusing the function outside
fast-gup.

To make sure we'll still remember this issue just in case hugepd will be
extended to support non-hugetlbfs memories, add a rich comment above
gup_huge_pd(), explaining the issue with proper references.

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index e7510b6ce765..db35b056fc9a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2832,11 +2832,6 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 		return 0;
 	}
 
-	if (!folio_fast_pin_allowed(folio, flags)) {
-		gup_put_folio(folio, refs, flags);
-		return 0;
-	}
-
 	if (!pte_write(pte) && gup_must_unshare(NULL, flags, &folio->page)) {
 		gup_put_folio(folio, refs, flags);
 		return 0;
@@ -2847,6 +2842,14 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 	return 1;
 }
 
+/*
+ * NOTE: currently GUP for a hugepd is only possible on hugetlbfs file
+ * systems on Power, which does not have issue with folio writeback against
+ * GUP updates.  When hugepd will be extended to support non-hugetlbfs or
+ * even anonymous memory, we need to do extra check as what we do with most
+ * of the other folios. See writable_file_mapping_allowed() and
+ * folio_fast_pin_allowed() for more information.
+ */
 static int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
 		unsigned int pdshift, unsigned long end, unsigned int flags,
 		struct page **pages, int *nr)
-- 
2.44.0


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 06/13] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing
@ 2024-03-27 15:23   ` peterx
  0 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, peterx, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

From: Peter Xu <peterx@redhat.com>

Hugepd format for GUP is only used in PowerPC with hugetlbfs.  There are
some kernel usage of hugepd (can refer to hugepd_populate_kernel() for
PPC_8XX), however those pages are not candidates for GUP.

Commit a6e79df92e4a ("mm/gup: disallow FOLL_LONGTERM GUP-fast writing to
file-backed mappings") added a check to fail gup-fast if there's potential
risk of violating GUP over writeback file systems.  That should never apply
to hugepd.  Considering that hugepd is an old format (and even
software-only), there's no plan to extend hugepd into other file typed
memories that is prone to the same issue.

Drop that check, not only because it'll never be true for hugepd per any
known plan, but also it paves way for reusing the function outside
fast-gup.

To make sure we'll still remember this issue just in case hugepd will be
extended to support non-hugetlbfs memories, add a rich comment above
gup_huge_pd(), explaining the issue with proper references.

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index e7510b6ce765..db35b056fc9a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2832,11 +2832,6 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 		return 0;
 	}
 
-	if (!folio_fast_pin_allowed(folio, flags)) {
-		gup_put_folio(folio, refs, flags);
-		return 0;
-	}
-
 	if (!pte_write(pte) && gup_must_unshare(NULL, flags, &folio->page)) {
 		gup_put_folio(folio, refs, flags);
 		return 0;
@@ -2847,6 +2842,14 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 	return 1;
 }
 
+/*
+ * NOTE: currently GUP for a hugepd is only possible on hugetlbfs file
+ * systems on Power, which does not have issue with folio writeback against
+ * GUP updates.  When hugepd will be extended to support non-hugetlbfs or
+ * even anonymous memory, we need to do extra check as what we do with most
+ * of the other folios. See writable_file_mapping_allowed() and
+ * folio_fast_pin_allowed() for more information.
+ */
 static int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
 		unsigned int pdshift, unsigned long end, unsigned int flags,
 		struct page **pages, int *nr)
-- 
2.44.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 06/13] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing
@ 2024-03-27 15:23   ` peterx
  0 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: James Houghton, David Hildenbrand, Yang Shi, peterx,
	Andrew Jones, linux-riscv, Andrea Arcangeli, Christoph Hellwig,
	Matthew Wilcox, Aneesh Kumar K . V, linux-arm-kernel,
	Jason Gunthorpe, Axel Rasmussen, Rik van Riel, John Hubbard,
	Kirill A . Shutemov, Vlastimil Babka, Lorenzo Stoakes,
	Muchun Song, Andrew Morton, linuxppc-dev, Mike Rapoport,
	Mike Kravetz

From: Peter Xu <peterx@redhat.com>

Hugepd format for GUP is only used in PowerPC with hugetlbfs.  There are
some kernel usage of hugepd (can refer to hugepd_populate_kernel() for
PPC_8XX), however those pages are not candidates for GUP.

Commit a6e79df92e4a ("mm/gup: disallow FOLL_LONGTERM GUP-fast writing to
file-backed mappings") added a check to fail gup-fast if there's potential
risk of violating GUP over writeback file systems.  That should never apply
to hugepd.  Considering that hugepd is an old format (and even
software-only), there's no plan to extend hugepd into other file typed
memories that is prone to the same issue.

Drop that check, not only because it'll never be true for hugepd per any
known plan, but also it paves way for reusing the function outside
fast-gup.

To make sure we'll still remember this issue just in case hugepd will be
extended to support non-hugetlbfs memories, add a rich comment above
gup_huge_pd(), explaining the issue with proper references.

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index e7510b6ce765..db35b056fc9a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2832,11 +2832,6 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 		return 0;
 	}
 
-	if (!folio_fast_pin_allowed(folio, flags)) {
-		gup_put_folio(folio, refs, flags);
-		return 0;
-	}
-
 	if (!pte_write(pte) && gup_must_unshare(NULL, flags, &folio->page)) {
 		gup_put_folio(folio, refs, flags);
 		return 0;
@@ -2847,6 +2842,14 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 	return 1;
 }
 
+/*
+ * NOTE: currently GUP for a hugepd is only possible on hugetlbfs file
+ * systems on Power, which does not have issue with folio writeback against
+ * GUP updates.  When hugepd will be extended to support non-hugetlbfs or
+ * even anonymous memory, we need to do extra check as what we do with most
+ * of the other folios. See writable_file_mapping_allowed() and
+ * folio_fast_pin_allowed() for more information.
+ */
 static int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
 		unsigned int pdshift, unsigned long end, unsigned int flags,
 		struct page **pages, int *nr)
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 07/13] mm/gup: Refactor record_subpages() to find 1st small page
  2024-03-27 15:23 ` peterx
  (?)
  (?)
@ 2024-03-27 15:23   ` peterx
  -1 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, peterx, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

From: Peter Xu <peterx@redhat.com>

All the fast-gup functions take a tail page to operate, always need to do
page mask calculations before feeding that into record_subpages().

Merge that logic into record_subpages(), so that it will do the nth_page()
calculation.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c | 25 ++++++++++++++-----------
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index db35b056fc9a..c2881772216b 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2779,13 +2779,16 @@ static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
 }
 #endif
 
-static int record_subpages(struct page *page, unsigned long addr,
-			   unsigned long end, struct page **pages)
+static int record_subpages(struct page *page, unsigned long sz,
+			   unsigned long addr, unsigned long end,
+			   struct page **pages)
 {
+	struct page *start_page;
 	int nr;
 
+	start_page = nth_page(page, (addr & (sz - 1)) >> PAGE_SHIFT);
 	for (nr = 0; addr != end; nr++, addr += PAGE_SIZE)
-		pages[nr] = nth_page(page, nr);
+		pages[nr] = nth_page(start_page, nr);
 
 	return nr;
 }
@@ -2820,8 +2823,8 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 	/* hugepages are never "special" */
 	VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 
-	page = nth_page(pte_page(pte), (addr & (sz - 1)) >> PAGE_SHIFT);
-	refs = record_subpages(page, addr, end, pages + *nr);
+	page = pte_page(pte);
+	refs = record_subpages(page, sz, addr, end, pages + *nr);
 
 	folio = try_grab_folio(page, refs, flags);
 	if (!folio)
@@ -2894,8 +2897,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 					     pages, nr);
 	}
 
-	page = nth_page(pmd_page(orig), (addr & ~PMD_MASK) >> PAGE_SHIFT);
-	refs = record_subpages(page, addr, end, pages + *nr);
+	page = pmd_page(orig);
+	refs = record_subpages(page, PMD_SIZE, addr, end, pages + *nr);
 
 	folio = try_grab_folio(page, refs, flags);
 	if (!folio)
@@ -2938,8 +2941,8 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 					     pages, nr);
 	}
 
-	page = nth_page(pud_page(orig), (addr & ~PUD_MASK) >> PAGE_SHIFT);
-	refs = record_subpages(page, addr, end, pages + *nr);
+	page = pud_page(orig);
+	refs = record_subpages(page, PUD_SIZE, addr, end, pages + *nr);
 
 	folio = try_grab_folio(page, refs, flags);
 	if (!folio)
@@ -2978,8 +2981,8 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 
 	BUILD_BUG_ON(pgd_devmap(orig));
 
-	page = nth_page(pgd_page(orig), (addr & ~PGDIR_MASK) >> PAGE_SHIFT);
-	refs = record_subpages(page, addr, end, pages + *nr);
+	page = pgd_page(orig);
+	refs = record_subpages(page, PGDIR_SIZE, addr, end, pages + *nr);
 
 	folio = try_grab_folio(page, refs, flags);
 	if (!folio)
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 07/13] mm/gup: Refactor record_subpages() to find 1st small page
@ 2024-03-27 15:23   ` peterx
  0 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, peterx, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

From: Peter Xu <peterx@redhat.com>

All the fast-gup functions take a tail page to operate, always need to do
page mask calculations before feeding that into record_subpages().

Merge that logic into record_subpages(), so that it will do the nth_page()
calculation.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c | 25 ++++++++++++++-----------
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index db35b056fc9a..c2881772216b 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2779,13 +2779,16 @@ static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
 }
 #endif
 
-static int record_subpages(struct page *page, unsigned long addr,
-			   unsigned long end, struct page **pages)
+static int record_subpages(struct page *page, unsigned long sz,
+			   unsigned long addr, unsigned long end,
+			   struct page **pages)
 {
+	struct page *start_page;
 	int nr;
 
+	start_page = nth_page(page, (addr & (sz - 1)) >> PAGE_SHIFT);
 	for (nr = 0; addr != end; nr++, addr += PAGE_SIZE)
-		pages[nr] = nth_page(page, nr);
+		pages[nr] = nth_page(start_page, nr);
 
 	return nr;
 }
@@ -2820,8 +2823,8 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 	/* hugepages are never "special" */
 	VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 
-	page = nth_page(pte_page(pte), (addr & (sz - 1)) >> PAGE_SHIFT);
-	refs = record_subpages(page, addr, end, pages + *nr);
+	page = pte_page(pte);
+	refs = record_subpages(page, sz, addr, end, pages + *nr);
 
 	folio = try_grab_folio(page, refs, flags);
 	if (!folio)
@@ -2894,8 +2897,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 					     pages, nr);
 	}
 
-	page = nth_page(pmd_page(orig), (addr & ~PMD_MASK) >> PAGE_SHIFT);
-	refs = record_subpages(page, addr, end, pages + *nr);
+	page = pmd_page(orig);
+	refs = record_subpages(page, PMD_SIZE, addr, end, pages + *nr);
 
 	folio = try_grab_folio(page, refs, flags);
 	if (!folio)
@@ -2938,8 +2941,8 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 					     pages, nr);
 	}
 
-	page = nth_page(pud_page(orig), (addr & ~PUD_MASK) >> PAGE_SHIFT);
-	refs = record_subpages(page, addr, end, pages + *nr);
+	page = pud_page(orig);
+	refs = record_subpages(page, PUD_SIZE, addr, end, pages + *nr);
 
 	folio = try_grab_folio(page, refs, flags);
 	if (!folio)
@@ -2978,8 +2981,8 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 
 	BUILD_BUG_ON(pgd_devmap(orig));
 
-	page = nth_page(pgd_page(orig), (addr & ~PGDIR_MASK) >> PAGE_SHIFT);
-	refs = record_subpages(page, addr, end, pages + *nr);
+	page = pgd_page(orig);
+	refs = record_subpages(page, PGDIR_SIZE, addr, end, pages + *nr);
 
 	folio = try_grab_folio(page, refs, flags);
 	if (!folio)
-- 
2.44.0


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 07/13] mm/gup: Refactor record_subpages() to find 1st small page
@ 2024-03-27 15:23   ` peterx
  0 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, peterx, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

From: Peter Xu <peterx@redhat.com>

All the fast-gup functions take a tail page to operate, always need to do
page mask calculations before feeding that into record_subpages().

Merge that logic into record_subpages(), so that it will do the nth_page()
calculation.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c | 25 ++++++++++++++-----------
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index db35b056fc9a..c2881772216b 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2779,13 +2779,16 @@ static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
 }
 #endif
 
-static int record_subpages(struct page *page, unsigned long addr,
-			   unsigned long end, struct page **pages)
+static int record_subpages(struct page *page, unsigned long sz,
+			   unsigned long addr, unsigned long end,
+			   struct page **pages)
 {
+	struct page *start_page;
 	int nr;
 
+	start_page = nth_page(page, (addr & (sz - 1)) >> PAGE_SHIFT);
 	for (nr = 0; addr != end; nr++, addr += PAGE_SIZE)
-		pages[nr] = nth_page(page, nr);
+		pages[nr] = nth_page(start_page, nr);
 
 	return nr;
 }
@@ -2820,8 +2823,8 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 	/* hugepages are never "special" */
 	VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 
-	page = nth_page(pte_page(pte), (addr & (sz - 1)) >> PAGE_SHIFT);
-	refs = record_subpages(page, addr, end, pages + *nr);
+	page = pte_page(pte);
+	refs = record_subpages(page, sz, addr, end, pages + *nr);
 
 	folio = try_grab_folio(page, refs, flags);
 	if (!folio)
@@ -2894,8 +2897,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 					     pages, nr);
 	}
 
-	page = nth_page(pmd_page(orig), (addr & ~PMD_MASK) >> PAGE_SHIFT);
-	refs = record_subpages(page, addr, end, pages + *nr);
+	page = pmd_page(orig);
+	refs = record_subpages(page, PMD_SIZE, addr, end, pages + *nr);
 
 	folio = try_grab_folio(page, refs, flags);
 	if (!folio)
@@ -2938,8 +2941,8 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 					     pages, nr);
 	}
 
-	page = nth_page(pud_page(orig), (addr & ~PUD_MASK) >> PAGE_SHIFT);
-	refs = record_subpages(page, addr, end, pages + *nr);
+	page = pud_page(orig);
+	refs = record_subpages(page, PUD_SIZE, addr, end, pages + *nr);
 
 	folio = try_grab_folio(page, refs, flags);
 	if (!folio)
@@ -2978,8 +2981,8 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 
 	BUILD_BUG_ON(pgd_devmap(orig));
 
-	page = nth_page(pgd_page(orig), (addr & ~PGDIR_MASK) >> PAGE_SHIFT);
-	refs = record_subpages(page, addr, end, pages + *nr);
+	page = pgd_page(orig);
+	refs = record_subpages(page, PGDIR_SIZE, addr, end, pages + *nr);
 
 	folio = try_grab_folio(page, refs, flags);
 	if (!folio)
-- 
2.44.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 07/13] mm/gup: Refactor record_subpages() to find 1st small page
@ 2024-03-27 15:23   ` peterx
  0 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: James Houghton, David Hildenbrand, Yang Shi, peterx,
	Andrew Jones, linux-riscv, Andrea Arcangeli, Christoph Hellwig,
	Matthew Wilcox, Aneesh Kumar K . V, linux-arm-kernel,
	Jason Gunthorpe, Axel Rasmussen, Rik van Riel, John Hubbard,
	Kirill A . Shutemov, Vlastimil Babka, Lorenzo Stoakes,
	Muchun Song, Andrew Morton, linuxppc-dev, Mike Rapoport,
	Mike Kravetz

From: Peter Xu <peterx@redhat.com>

All the fast-gup functions take a tail page to operate, always need to do
page mask calculations before feeding that into record_subpages().

Merge that logic into record_subpages(), so that it will do the nth_page()
calculation.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c | 25 ++++++++++++++-----------
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index db35b056fc9a..c2881772216b 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2779,13 +2779,16 @@ static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
 }
 #endif
 
-static int record_subpages(struct page *page, unsigned long addr,
-			   unsigned long end, struct page **pages)
+static int record_subpages(struct page *page, unsigned long sz,
+			   unsigned long addr, unsigned long end,
+			   struct page **pages)
 {
+	struct page *start_page;
 	int nr;
 
+	start_page = nth_page(page, (addr & (sz - 1)) >> PAGE_SHIFT);
 	for (nr = 0; addr != end; nr++, addr += PAGE_SIZE)
-		pages[nr] = nth_page(page, nr);
+		pages[nr] = nth_page(start_page, nr);
 
 	return nr;
 }
@@ -2820,8 +2823,8 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 	/* hugepages are never "special" */
 	VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 
-	page = nth_page(pte_page(pte), (addr & (sz - 1)) >> PAGE_SHIFT);
-	refs = record_subpages(page, addr, end, pages + *nr);
+	page = pte_page(pte);
+	refs = record_subpages(page, sz, addr, end, pages + *nr);
 
 	folio = try_grab_folio(page, refs, flags);
 	if (!folio)
@@ -2894,8 +2897,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 					     pages, nr);
 	}
 
-	page = nth_page(pmd_page(orig), (addr & ~PMD_MASK) >> PAGE_SHIFT);
-	refs = record_subpages(page, addr, end, pages + *nr);
+	page = pmd_page(orig);
+	refs = record_subpages(page, PMD_SIZE, addr, end, pages + *nr);
 
 	folio = try_grab_folio(page, refs, flags);
 	if (!folio)
@@ -2938,8 +2941,8 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 					     pages, nr);
 	}
 
-	page = nth_page(pud_page(orig), (addr & ~PUD_MASK) >> PAGE_SHIFT);
-	refs = record_subpages(page, addr, end, pages + *nr);
+	page = pud_page(orig);
+	refs = record_subpages(page, PUD_SIZE, addr, end, pages + *nr);
 
 	folio = try_grab_folio(page, refs, flags);
 	if (!folio)
@@ -2978,8 +2981,8 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 
 	BUILD_BUG_ON(pgd_devmap(orig));
 
-	page = nth_page(pgd_page(orig), (addr & ~PGDIR_MASK) >> PAGE_SHIFT);
-	refs = record_subpages(page, addr, end, pages + *nr);
+	page = pgd_page(orig);
+	refs = record_subpages(page, PGDIR_SIZE, addr, end, pages + *nr);
 
 	folio = try_grab_folio(page, refs, flags);
 	if (!folio)
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 08/13] mm/gup: Handle hugetlb for no_page_table()
  2024-03-27 15:23 ` peterx
  (?)
  (?)
@ 2024-03-27 15:23   ` peterx
  -1 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, peterx, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

From: Peter Xu <peterx@redhat.com>

no_page_table() is not yet used for hugetlb code paths. Make it prepared.

The major difference here is hugetlb will return -EFAULT as long as page
cache does not exist, even if VM_SHARED.  See hugetlb_follow_page_mask().

Pass "address" into no_page_table() too, as hugetlb will need it.

Reviewed-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c | 44 ++++++++++++++++++++++++++------------------
 1 file changed, 26 insertions(+), 18 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index c2881772216b..ef46a7053e16 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -501,19 +501,27 @@ static inline void mm_set_has_pinned_flag(unsigned long *mm_flags)
 
 #ifdef CONFIG_MMU
 static struct page *no_page_table(struct vm_area_struct *vma,
-		unsigned int flags)
+				  unsigned int flags, unsigned long address)
 {
+	if (!(flags & FOLL_DUMP))
+		return NULL;
+
 	/*
-	 * When core dumping an enormous anonymous area that nobody
-	 * has touched so far, we don't want to allocate unnecessary pages or
+	 * When core dumping, we don't want to allocate unnecessary pages or
 	 * page tables.  Return error instead of NULL to skip handle_mm_fault,
 	 * then get_dump_page() will return NULL to leave a hole in the dump.
 	 * But we can only make this optimization where a hole would surely
 	 * be zero-filled if handle_mm_fault() actually did handle it.
 	 */
-	if ((flags & FOLL_DUMP) &&
-			(vma_is_anonymous(vma) || !vma->vm_ops->fault))
+	if (is_vm_hugetlb_page(vma)) {
+		struct hstate *h = hstate_vma(vma);
+
+		if (!hugetlbfs_pagecache_present(h, vma, address))
+			return ERR_PTR(-EFAULT);
+	} else if ((vma_is_anonymous(vma) || !vma->vm_ops->fault)) {
 		return ERR_PTR(-EFAULT);
+	}
+
 	return NULL;
 }
 
@@ -593,7 +601,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 
 	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
 	if (!ptep)
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 	pte = ptep_get(ptep);
 	if (!pte_present(pte))
 		goto no_page;
@@ -685,7 +693,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 	pte_unmap_unlock(ptep, ptl);
 	if (!pte_none(pte))
 		return NULL;
-	return no_page_table(vma, flags);
+	return no_page_table(vma, flags, address);
 }
 
 static struct page *follow_pmd_mask(struct vm_area_struct *vma,
@@ -701,27 +709,27 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 	pmd = pmd_offset(pudp, address);
 	pmdval = pmdp_get_lockless(pmd);
 	if (pmd_none(pmdval))
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 	if (!pmd_present(pmdval))
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 	if (pmd_devmap(pmdval)) {
 		ptl = pmd_lock(mm, pmd);
 		page = follow_devmap_pmd(vma, address, pmd, flags, &ctx->pgmap);
 		spin_unlock(ptl);
 		if (page)
 			return page;
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 	}
 	if (likely(!pmd_trans_huge(pmdval)))
 		return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
 
 	if (pmd_protnone(pmdval) && !gup_can_follow_protnone(vma, flags))
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_present(*pmd))) {
 		spin_unlock(ptl);
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 	}
 	if (unlikely(!pmd_trans_huge(*pmd))) {
 		spin_unlock(ptl);
@@ -752,17 +760,17 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma,
 
 	pud = pud_offset(p4dp, address);
 	if (pud_none(*pud))
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 	if (pud_devmap(*pud)) {
 		ptl = pud_lock(mm, pud);
 		page = follow_devmap_pud(vma, address, pud, flags, &ctx->pgmap);
 		spin_unlock(ptl);
 		if (page)
 			return page;
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 	}
 	if (unlikely(pud_bad(*pud)))
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 
 	return follow_pmd_mask(vma, address, pud, flags, ctx);
 }
@@ -777,10 +785,10 @@ static struct page *follow_p4d_mask(struct vm_area_struct *vma,
 	p4dp = p4d_offset(pgdp, address);
 	p4d = READ_ONCE(*p4dp);
 	if (!p4d_present(p4d))
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 	BUILD_BUG_ON(p4d_leaf(p4d));
 	if (unlikely(p4d_bad(p4d)))
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 
 	return follow_pud_mask(vma, address, p4dp, flags, ctx);
 }
@@ -830,7 +838,7 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
 	pgd = pgd_offset(mm, address);
 
 	if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 
 	return follow_p4d_mask(vma, address, pgd, flags, ctx);
 }
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 08/13] mm/gup: Handle hugetlb for no_page_table()
@ 2024-03-27 15:23   ` peterx
  0 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, peterx, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

From: Peter Xu <peterx@redhat.com>

no_page_table() is not yet used for hugetlb code paths. Make it prepared.

The major difference here is hugetlb will return -EFAULT as long as page
cache does not exist, even if VM_SHARED.  See hugetlb_follow_page_mask().

Pass "address" into no_page_table() too, as hugetlb will need it.

Reviewed-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c | 44 ++++++++++++++++++++++++++------------------
 1 file changed, 26 insertions(+), 18 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index c2881772216b..ef46a7053e16 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -501,19 +501,27 @@ static inline void mm_set_has_pinned_flag(unsigned long *mm_flags)
 
 #ifdef CONFIG_MMU
 static struct page *no_page_table(struct vm_area_struct *vma,
-		unsigned int flags)
+				  unsigned int flags, unsigned long address)
 {
+	if (!(flags & FOLL_DUMP))
+		return NULL;
+
 	/*
-	 * When core dumping an enormous anonymous area that nobody
-	 * has touched so far, we don't want to allocate unnecessary pages or
+	 * When core dumping, we don't want to allocate unnecessary pages or
 	 * page tables.  Return error instead of NULL to skip handle_mm_fault,
 	 * then get_dump_page() will return NULL to leave a hole in the dump.
 	 * But we can only make this optimization where a hole would surely
 	 * be zero-filled if handle_mm_fault() actually did handle it.
 	 */
-	if ((flags & FOLL_DUMP) &&
-			(vma_is_anonymous(vma) || !vma->vm_ops->fault))
+	if (is_vm_hugetlb_page(vma)) {
+		struct hstate *h = hstate_vma(vma);
+
+		if (!hugetlbfs_pagecache_present(h, vma, address))
+			return ERR_PTR(-EFAULT);
+	} else if ((vma_is_anonymous(vma) || !vma->vm_ops->fault)) {
 		return ERR_PTR(-EFAULT);
+	}
+
 	return NULL;
 }
 
@@ -593,7 +601,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 
 	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
 	if (!ptep)
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 	pte = ptep_get(ptep);
 	if (!pte_present(pte))
 		goto no_page;
@@ -685,7 +693,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 	pte_unmap_unlock(ptep, ptl);
 	if (!pte_none(pte))
 		return NULL;
-	return no_page_table(vma, flags);
+	return no_page_table(vma, flags, address);
 }
 
 static struct page *follow_pmd_mask(struct vm_area_struct *vma,
@@ -701,27 +709,27 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 	pmd = pmd_offset(pudp, address);
 	pmdval = pmdp_get_lockless(pmd);
 	if (pmd_none(pmdval))
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 	if (!pmd_present(pmdval))
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 	if (pmd_devmap(pmdval)) {
 		ptl = pmd_lock(mm, pmd);
 		page = follow_devmap_pmd(vma, address, pmd, flags, &ctx->pgmap);
 		spin_unlock(ptl);
 		if (page)
 			return page;
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 	}
 	if (likely(!pmd_trans_huge(pmdval)))
 		return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
 
 	if (pmd_protnone(pmdval) && !gup_can_follow_protnone(vma, flags))
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_present(*pmd))) {
 		spin_unlock(ptl);
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 	}
 	if (unlikely(!pmd_trans_huge(*pmd))) {
 		spin_unlock(ptl);
@@ -752,17 +760,17 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma,
 
 	pud = pud_offset(p4dp, address);
 	if (pud_none(*pud))
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 	if (pud_devmap(*pud)) {
 		ptl = pud_lock(mm, pud);
 		page = follow_devmap_pud(vma, address, pud, flags, &ctx->pgmap);
 		spin_unlock(ptl);
 		if (page)
 			return page;
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 	}
 	if (unlikely(pud_bad(*pud)))
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 
 	return follow_pmd_mask(vma, address, pud, flags, ctx);
 }
@@ -777,10 +785,10 @@ static struct page *follow_p4d_mask(struct vm_area_struct *vma,
 	p4dp = p4d_offset(pgdp, address);
 	p4d = READ_ONCE(*p4dp);
 	if (!p4d_present(p4d))
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 	BUILD_BUG_ON(p4d_leaf(p4d));
 	if (unlikely(p4d_bad(p4d)))
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 
 	return follow_pud_mask(vma, address, p4dp, flags, ctx);
 }
@@ -830,7 +838,7 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
 	pgd = pgd_offset(mm, address);
 
 	if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 
 	return follow_p4d_mask(vma, address, pgd, flags, ctx);
 }
-- 
2.44.0


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 08/13] mm/gup: Handle hugetlb for no_page_table()
@ 2024-03-27 15:23   ` peterx
  0 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, peterx, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

From: Peter Xu <peterx@redhat.com>

no_page_table() is not yet used for hugetlb code paths. Make it prepared.

The major difference here is hugetlb will return -EFAULT as long as page
cache does not exist, even if VM_SHARED.  See hugetlb_follow_page_mask().

Pass "address" into no_page_table() too, as hugetlb will need it.

Reviewed-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c | 44 ++++++++++++++++++++++++++------------------
 1 file changed, 26 insertions(+), 18 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index c2881772216b..ef46a7053e16 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -501,19 +501,27 @@ static inline void mm_set_has_pinned_flag(unsigned long *mm_flags)
 
 #ifdef CONFIG_MMU
 static struct page *no_page_table(struct vm_area_struct *vma,
-		unsigned int flags)
+				  unsigned int flags, unsigned long address)
 {
+	if (!(flags & FOLL_DUMP))
+		return NULL;
+
 	/*
-	 * When core dumping an enormous anonymous area that nobody
-	 * has touched so far, we don't want to allocate unnecessary pages or
+	 * When core dumping, we don't want to allocate unnecessary pages or
 	 * page tables.  Return error instead of NULL to skip handle_mm_fault,
 	 * then get_dump_page() will return NULL to leave a hole in the dump.
 	 * But we can only make this optimization where a hole would surely
 	 * be zero-filled if handle_mm_fault() actually did handle it.
 	 */
-	if ((flags & FOLL_DUMP) &&
-			(vma_is_anonymous(vma) || !vma->vm_ops->fault))
+	if (is_vm_hugetlb_page(vma)) {
+		struct hstate *h = hstate_vma(vma);
+
+		if (!hugetlbfs_pagecache_present(h, vma, address))
+			return ERR_PTR(-EFAULT);
+	} else if ((vma_is_anonymous(vma) || !vma->vm_ops->fault)) {
 		return ERR_PTR(-EFAULT);
+	}
+
 	return NULL;
 }
 
@@ -593,7 +601,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 
 	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
 	if (!ptep)
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 	pte = ptep_get(ptep);
 	if (!pte_present(pte))
 		goto no_page;
@@ -685,7 +693,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 	pte_unmap_unlock(ptep, ptl);
 	if (!pte_none(pte))
 		return NULL;
-	return no_page_table(vma, flags);
+	return no_page_table(vma, flags, address);
 }
 
 static struct page *follow_pmd_mask(struct vm_area_struct *vma,
@@ -701,27 +709,27 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 	pmd = pmd_offset(pudp, address);
 	pmdval = pmdp_get_lockless(pmd);
 	if (pmd_none(pmdval))
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 	if (!pmd_present(pmdval))
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 	if (pmd_devmap(pmdval)) {
 		ptl = pmd_lock(mm, pmd);
 		page = follow_devmap_pmd(vma, address, pmd, flags, &ctx->pgmap);
 		spin_unlock(ptl);
 		if (page)
 			return page;
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 	}
 	if (likely(!pmd_trans_huge(pmdval)))
 		return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
 
 	if (pmd_protnone(pmdval) && !gup_can_follow_protnone(vma, flags))
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_present(*pmd))) {
 		spin_unlock(ptl);
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 	}
 	if (unlikely(!pmd_trans_huge(*pmd))) {
 		spin_unlock(ptl);
@@ -752,17 +760,17 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma,
 
 	pud = pud_offset(p4dp, address);
 	if (pud_none(*pud))
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 	if (pud_devmap(*pud)) {
 		ptl = pud_lock(mm, pud);
 		page = follow_devmap_pud(vma, address, pud, flags, &ctx->pgmap);
 		spin_unlock(ptl);
 		if (page)
 			return page;
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 	}
 	if (unlikely(pud_bad(*pud)))
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 
 	return follow_pmd_mask(vma, address, pud, flags, ctx);
 }
@@ -777,10 +785,10 @@ static struct page *follow_p4d_mask(struct vm_area_struct *vma,
 	p4dp = p4d_offset(pgdp, address);
 	p4d = READ_ONCE(*p4dp);
 	if (!p4d_present(p4d))
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 	BUILD_BUG_ON(p4d_leaf(p4d));
 	if (unlikely(p4d_bad(p4d)))
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 
 	return follow_pud_mask(vma, address, p4dp, flags, ctx);
 }
@@ -830,7 +838,7 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
 	pgd = pgd_offset(mm, address);
 
 	if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 
 	return follow_p4d_mask(vma, address, pgd, flags, ctx);
 }
-- 
2.44.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 08/13] mm/gup: Handle hugetlb for no_page_table()
@ 2024-03-27 15:23   ` peterx
  0 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: James Houghton, David Hildenbrand, Yang Shi, peterx,
	Andrew Jones, linux-riscv, Andrea Arcangeli, Christoph Hellwig,
	Matthew Wilcox, Aneesh Kumar K . V, linux-arm-kernel,
	Jason Gunthorpe, Axel Rasmussen, Rik van Riel, John Hubbard,
	Kirill A . Shutemov, Vlastimil Babka, Lorenzo Stoakes,
	Muchun Song, Andrew Morton, linuxppc-dev, Mike Rapoport,
	Mike Kravetz

From: Peter Xu <peterx@redhat.com>

no_page_table() is not yet used for hugetlb code paths. Make it prepared.

The major difference here is hugetlb will return -EFAULT as long as page
cache does not exist, even if VM_SHARED.  See hugetlb_follow_page_mask().

Pass "address" into no_page_table() too, as hugetlb will need it.

Reviewed-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c | 44 ++++++++++++++++++++++++++------------------
 1 file changed, 26 insertions(+), 18 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index c2881772216b..ef46a7053e16 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -501,19 +501,27 @@ static inline void mm_set_has_pinned_flag(unsigned long *mm_flags)
 
 #ifdef CONFIG_MMU
 static struct page *no_page_table(struct vm_area_struct *vma,
-		unsigned int flags)
+				  unsigned int flags, unsigned long address)
 {
+	if (!(flags & FOLL_DUMP))
+		return NULL;
+
 	/*
-	 * When core dumping an enormous anonymous area that nobody
-	 * has touched so far, we don't want to allocate unnecessary pages or
+	 * When core dumping, we don't want to allocate unnecessary pages or
 	 * page tables.  Return error instead of NULL to skip handle_mm_fault,
 	 * then get_dump_page() will return NULL to leave a hole in the dump.
 	 * But we can only make this optimization where a hole would surely
 	 * be zero-filled if handle_mm_fault() actually did handle it.
 	 */
-	if ((flags & FOLL_DUMP) &&
-			(vma_is_anonymous(vma) || !vma->vm_ops->fault))
+	if (is_vm_hugetlb_page(vma)) {
+		struct hstate *h = hstate_vma(vma);
+
+		if (!hugetlbfs_pagecache_present(h, vma, address))
+			return ERR_PTR(-EFAULT);
+	} else if ((vma_is_anonymous(vma) || !vma->vm_ops->fault)) {
 		return ERR_PTR(-EFAULT);
+	}
+
 	return NULL;
 }
 
@@ -593,7 +601,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 
 	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
 	if (!ptep)
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 	pte = ptep_get(ptep);
 	if (!pte_present(pte))
 		goto no_page;
@@ -685,7 +693,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 	pte_unmap_unlock(ptep, ptl);
 	if (!pte_none(pte))
 		return NULL;
-	return no_page_table(vma, flags);
+	return no_page_table(vma, flags, address);
 }
 
 static struct page *follow_pmd_mask(struct vm_area_struct *vma,
@@ -701,27 +709,27 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 	pmd = pmd_offset(pudp, address);
 	pmdval = pmdp_get_lockless(pmd);
 	if (pmd_none(pmdval))
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 	if (!pmd_present(pmdval))
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 	if (pmd_devmap(pmdval)) {
 		ptl = pmd_lock(mm, pmd);
 		page = follow_devmap_pmd(vma, address, pmd, flags, &ctx->pgmap);
 		spin_unlock(ptl);
 		if (page)
 			return page;
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 	}
 	if (likely(!pmd_trans_huge(pmdval)))
 		return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
 
 	if (pmd_protnone(pmdval) && !gup_can_follow_protnone(vma, flags))
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_present(*pmd))) {
 		spin_unlock(ptl);
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 	}
 	if (unlikely(!pmd_trans_huge(*pmd))) {
 		spin_unlock(ptl);
@@ -752,17 +760,17 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma,
 
 	pud = pud_offset(p4dp, address);
 	if (pud_none(*pud))
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 	if (pud_devmap(*pud)) {
 		ptl = pud_lock(mm, pud);
 		page = follow_devmap_pud(vma, address, pud, flags, &ctx->pgmap);
 		spin_unlock(ptl);
 		if (page)
 			return page;
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 	}
 	if (unlikely(pud_bad(*pud)))
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 
 	return follow_pmd_mask(vma, address, pud, flags, ctx);
 }
@@ -777,10 +785,10 @@ static struct page *follow_p4d_mask(struct vm_area_struct *vma,
 	p4dp = p4d_offset(pgdp, address);
 	p4d = READ_ONCE(*p4dp);
 	if (!p4d_present(p4d))
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 	BUILD_BUG_ON(p4d_leaf(p4d));
 	if (unlikely(p4d_bad(p4d)))
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 
 	return follow_pud_mask(vma, address, p4dp, flags, ctx);
 }
@@ -830,7 +838,7 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
 	pgd = pgd_offset(mm, address);
 
 	if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
-		return no_page_table(vma, flags);
+		return no_page_table(vma, flags, address);
 
 	return follow_p4d_mask(vma, address, pgd, flags, ctx);
 }
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 09/13] mm/gup: Cache *pudp in follow_pud_mask()
  2024-03-27 15:23 ` peterx
  (?)
  (?)
@ 2024-03-27 15:23   ` peterx
  -1 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, peterx, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

From: Peter Xu <peterx@redhat.com>

Introduce "pud_t pud" in the function, so the code won't dereference *pudp
multiple time.  Not only because that looks less straightforward, but also
because if the dereference really happened, it's not clear whether there
can be race to see different *pudp values if it's being modified at the
same time.

Acked-by: James Houghton <jthoughton@google.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index ef46a7053e16..26b8cca24077 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -753,26 +753,27 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma,
 				    unsigned int flags,
 				    struct follow_page_context *ctx)
 {
-	pud_t *pud;
+	pud_t *pudp, pud;
 	spinlock_t *ptl;
 	struct page *page;
 	struct mm_struct *mm = vma->vm_mm;
 
-	pud = pud_offset(p4dp, address);
-	if (pud_none(*pud))
+	pudp = pud_offset(p4dp, address);
+	pud = READ_ONCE(*pudp);
+	if (pud_none(pud))
 		return no_page_table(vma, flags, address);
-	if (pud_devmap(*pud)) {
-		ptl = pud_lock(mm, pud);
-		page = follow_devmap_pud(vma, address, pud, flags, &ctx->pgmap);
+	if (pud_devmap(pud)) {
+		ptl = pud_lock(mm, pudp);
+		page = follow_devmap_pud(vma, address, pudp, flags, &ctx->pgmap);
 		spin_unlock(ptl);
 		if (page)
 			return page;
 		return no_page_table(vma, flags, address);
 	}
-	if (unlikely(pud_bad(*pud)))
+	if (unlikely(pud_bad(pud)))
 		return no_page_table(vma, flags, address);
 
-	return follow_pmd_mask(vma, address, pud, flags, ctx);
+	return follow_pmd_mask(vma, address, pudp, flags, ctx);
 }
 
 static struct page *follow_p4d_mask(struct vm_area_struct *vma,
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 09/13] mm/gup: Cache *pudp in follow_pud_mask()
@ 2024-03-27 15:23   ` peterx
  0 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, peterx, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

From: Peter Xu <peterx@redhat.com>

Introduce "pud_t pud" in the function, so the code won't dereference *pudp
multiple time.  Not only because that looks less straightforward, but also
because if the dereference really happened, it's not clear whether there
can be race to see different *pudp values if it's being modified at the
same time.

Acked-by: James Houghton <jthoughton@google.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index ef46a7053e16..26b8cca24077 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -753,26 +753,27 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma,
 				    unsigned int flags,
 				    struct follow_page_context *ctx)
 {
-	pud_t *pud;
+	pud_t *pudp, pud;
 	spinlock_t *ptl;
 	struct page *page;
 	struct mm_struct *mm = vma->vm_mm;
 
-	pud = pud_offset(p4dp, address);
-	if (pud_none(*pud))
+	pudp = pud_offset(p4dp, address);
+	pud = READ_ONCE(*pudp);
+	if (pud_none(pud))
 		return no_page_table(vma, flags, address);
-	if (pud_devmap(*pud)) {
-		ptl = pud_lock(mm, pud);
-		page = follow_devmap_pud(vma, address, pud, flags, &ctx->pgmap);
+	if (pud_devmap(pud)) {
+		ptl = pud_lock(mm, pudp);
+		page = follow_devmap_pud(vma, address, pudp, flags, &ctx->pgmap);
 		spin_unlock(ptl);
 		if (page)
 			return page;
 		return no_page_table(vma, flags, address);
 	}
-	if (unlikely(pud_bad(*pud)))
+	if (unlikely(pud_bad(pud)))
 		return no_page_table(vma, flags, address);
 
-	return follow_pmd_mask(vma, address, pud, flags, ctx);
+	return follow_pmd_mask(vma, address, pudp, flags, ctx);
 }
 
 static struct page *follow_p4d_mask(struct vm_area_struct *vma,
-- 
2.44.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 09/13] mm/gup: Cache *pudp in follow_pud_mask()
@ 2024-03-27 15:23   ` peterx
  0 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, peterx, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

From: Peter Xu <peterx@redhat.com>

Introduce "pud_t pud" in the function, so the code won't dereference *pudp
multiple time.  Not only because that looks less straightforward, but also
because if the dereference really happened, it's not clear whether there
can be race to see different *pudp values if it's being modified at the
same time.

Acked-by: James Houghton <jthoughton@google.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index ef46a7053e16..26b8cca24077 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -753,26 +753,27 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma,
 				    unsigned int flags,
 				    struct follow_page_context *ctx)
 {
-	pud_t *pud;
+	pud_t *pudp, pud;
 	spinlock_t *ptl;
 	struct page *page;
 	struct mm_struct *mm = vma->vm_mm;
 
-	pud = pud_offset(p4dp, address);
-	if (pud_none(*pud))
+	pudp = pud_offset(p4dp, address);
+	pud = READ_ONCE(*pudp);
+	if (pud_none(pud))
 		return no_page_table(vma, flags, address);
-	if (pud_devmap(*pud)) {
-		ptl = pud_lock(mm, pud);
-		page = follow_devmap_pud(vma, address, pud, flags, &ctx->pgmap);
+	if (pud_devmap(pud)) {
+		ptl = pud_lock(mm, pudp);
+		page = follow_devmap_pud(vma, address, pudp, flags, &ctx->pgmap);
 		spin_unlock(ptl);
 		if (page)
 			return page;
 		return no_page_table(vma, flags, address);
 	}
-	if (unlikely(pud_bad(*pud)))
+	if (unlikely(pud_bad(pud)))
 		return no_page_table(vma, flags, address);
 
-	return follow_pmd_mask(vma, address, pud, flags, ctx);
+	return follow_pmd_mask(vma, address, pudp, flags, ctx);
 }
 
 static struct page *follow_p4d_mask(struct vm_area_struct *vma,
-- 
2.44.0


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 09/13] mm/gup: Cache *pudp in follow_pud_mask()
@ 2024-03-27 15:23   ` peterx
  0 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: James Houghton, David Hildenbrand, Yang Shi, peterx,
	Andrew Jones, linux-riscv, Andrea Arcangeli, Christoph Hellwig,
	Matthew Wilcox, Aneesh Kumar K . V, linux-arm-kernel,
	Jason Gunthorpe, Axel Rasmussen, Rik van Riel, John Hubbard,
	Kirill A . Shutemov, Vlastimil Babka, Lorenzo Stoakes,
	Muchun Song, Andrew Morton, linuxppc-dev, Mike Rapoport,
	Mike Kravetz

From: Peter Xu <peterx@redhat.com>

Introduce "pud_t pud" in the function, so the code won't dereference *pudp
multiple time.  Not only because that looks less straightforward, but also
because if the dereference really happened, it's not clear whether there
can be race to see different *pudp values if it's being modified at the
same time.

Acked-by: James Houghton <jthoughton@google.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index ef46a7053e16..26b8cca24077 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -753,26 +753,27 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma,
 				    unsigned int flags,
 				    struct follow_page_context *ctx)
 {
-	pud_t *pud;
+	pud_t *pudp, pud;
 	spinlock_t *ptl;
 	struct page *page;
 	struct mm_struct *mm = vma->vm_mm;
 
-	pud = pud_offset(p4dp, address);
-	if (pud_none(*pud))
+	pudp = pud_offset(p4dp, address);
+	pud = READ_ONCE(*pudp);
+	if (pud_none(pud))
 		return no_page_table(vma, flags, address);
-	if (pud_devmap(*pud)) {
-		ptl = pud_lock(mm, pud);
-		page = follow_devmap_pud(vma, address, pud, flags, &ctx->pgmap);
+	if (pud_devmap(pud)) {
+		ptl = pud_lock(mm, pudp);
+		page = follow_devmap_pud(vma, address, pudp, flags, &ctx->pgmap);
 		spin_unlock(ptl);
 		if (page)
 			return page;
 		return no_page_table(vma, flags, address);
 	}
-	if (unlikely(pud_bad(*pud)))
+	if (unlikely(pud_bad(pud)))
 		return no_page_table(vma, flags, address);
 
-	return follow_pmd_mask(vma, address, pud, flags, ctx);
+	return follow_pmd_mask(vma, address, pudp, flags, ctx);
 }
 
 static struct page *follow_p4d_mask(struct vm_area_struct *vma,
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 10/13] mm/gup: Handle huge pud for follow_pud_mask()
  2024-03-27 15:23 ` peterx
  (?)
  (?)
@ 2024-03-27 15:23   ` peterx
  -1 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, peterx, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

From: Peter Xu <peterx@redhat.com>

Teach follow_pud_mask() to be able to handle normal PUD pages like hugetlb.

Rename follow_devmap_pud() to follow_huge_pud() so that it can process
either huge devmap or hugetlb. Move it out of TRANSPARENT_HUGEPAGE_PUD and
and huge_memory.c (which relies on CONFIG_THP).  Switch to pud_leaf() to
detect both cases in the slow gup.

In the new follow_huge_pud(), taking care of possible CoR for hugetlb if
necessary.  touch_pud() needs to be moved out of huge_memory.c to be
accessable from gup.c even if !THP.

Since at it, optimize the non-present check by adding a pud_present() early
check before taking the pgtable lock, failing the follow_page() early if
PUD is not present: that is required by both devmap or hugetlb.  Use
pud_huge() to also cover the pud_devmap() case.

One more trivial thing to mention is, introduce "pud_t pud" in the code
paths along the way, so the code doesn't dereference *pudp multiple time.
Not only because that looks less straightforward, but also because if the
dereference really happened, it's not clear whether there can be race to
see different *pudp values when it's being modified at the same time.

Setting ctx->page_mask properly for a PUD entry.  As a side effect, this
patch should also be able to optimize devmap GUP on PUD to be able to jump
over the whole PUD range, but not yet verified.  Hugetlb already can do so
prior to this patch.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/huge_mm.h |  8 -----
 mm/gup.c                | 70 +++++++++++++++++++++++++++++++++++++++--
 mm/huge_memory.c        | 47 ++-------------------------
 mm/internal.h           |  2 ++
 4 files changed, 71 insertions(+), 56 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index d3bb25c39482..3f36511bdc02 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -351,8 +351,6 @@ static inline bool folio_test_pmd_mappable(struct folio *folio)
 
 struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
 		pmd_t *pmd, int flags, struct dev_pagemap **pgmap);
-struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
-		pud_t *pud, int flags, struct dev_pagemap **pgmap);
 
 vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);
 
@@ -507,12 +505,6 @@ static inline struct page *follow_devmap_pmd(struct vm_area_struct *vma,
 	return NULL;
 }
 
-static inline struct page *follow_devmap_pud(struct vm_area_struct *vma,
-	unsigned long addr, pud_t *pud, int flags, struct dev_pagemap **pgmap)
-{
-	return NULL;
-}
-
 static inline bool thp_migration_supported(void)
 {
 	return false;
diff --git a/mm/gup.c b/mm/gup.c
index 26b8cca24077..1e5d42211bb4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -525,6 +525,70 @@ static struct page *no_page_table(struct vm_area_struct *vma,
 	return NULL;
 }
 
+#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
+static struct page *follow_huge_pud(struct vm_area_struct *vma,
+				    unsigned long addr, pud_t *pudp,
+				    int flags, struct follow_page_context *ctx)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct page *page;
+	pud_t pud = *pudp;
+	unsigned long pfn = pud_pfn(pud);
+	int ret;
+
+	assert_spin_locked(pud_lockptr(mm, pudp));
+
+	if ((flags & FOLL_WRITE) && !pud_write(pud))
+		return NULL;
+
+	if (!pud_present(pud))
+		return NULL;
+
+	pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT;
+
+	if (IS_ENABLED(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) &&
+	    pud_devmap(pud)) {
+		/*
+		 * device mapped pages can only be returned if the caller
+		 * will manage the page reference count.
+		 *
+		 * At least one of FOLL_GET | FOLL_PIN must be set, so
+		 * assert that here:
+		 */
+		if (!(flags & (FOLL_GET | FOLL_PIN)))
+			return ERR_PTR(-EEXIST);
+
+		if (flags & FOLL_TOUCH)
+			touch_pud(vma, addr, pudp, flags & FOLL_WRITE);
+
+		ctx->pgmap = get_dev_pagemap(pfn, ctx->pgmap);
+		if (!ctx->pgmap)
+			return ERR_PTR(-EFAULT);
+	}
+
+	page = pfn_to_page(pfn);
+
+	if (!pud_devmap(pud) && !pud_write(pud) &&
+	    gup_must_unshare(vma, flags, page))
+		return ERR_PTR(-EMLINK);
+
+	ret = try_grab_page(page, flags);
+	if (ret)
+		page = ERR_PTR(ret);
+	else
+		ctx->page_mask = HPAGE_PUD_NR - 1;
+
+	return page;
+}
+#else  /* CONFIG_PGTABLE_HAS_HUGE_LEAVES */
+static struct page *follow_huge_pud(struct vm_area_struct *vma,
+				    unsigned long addr, pud_t *pudp,
+				    int flags, struct follow_page_context *ctx)
+{
+	return NULL;
+}
+#endif	/* CONFIG_PGTABLE_HAS_HUGE_LEAVES */
+
 static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
 		pte_t *pte, unsigned int flags)
 {
@@ -760,11 +824,11 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma,
 
 	pudp = pud_offset(p4dp, address);
 	pud = READ_ONCE(*pudp);
-	if (pud_none(pud))
+	if (!pud_present(pud))
 		return no_page_table(vma, flags, address);
-	if (pud_devmap(pud)) {
+	if (pud_leaf(pud)) {
 		ptl = pud_lock(mm, pudp);
-		page = follow_devmap_pud(vma, address, pudp, flags, &ctx->pgmap);
+		page = follow_huge_pud(vma, address, pudp, flags, ctx);
 		spin_unlock(ptl);
 		if (page)
 			return page;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bc6fa82d9815..2979198d7b71 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1377,8 +1377,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 }
 
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
-static void touch_pud(struct vm_area_struct *vma, unsigned long addr,
-		      pud_t *pud, bool write)
+void touch_pud(struct vm_area_struct *vma, unsigned long addr,
+	       pud_t *pud, bool write)
 {
 	pud_t _pud;
 
@@ -1390,49 +1390,6 @@ static void touch_pud(struct vm_area_struct *vma, unsigned long addr,
 		update_mmu_cache_pud(vma, addr, pud);
 }
 
-struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
-		pud_t *pud, int flags, struct dev_pagemap **pgmap)
-{
-	unsigned long pfn = pud_pfn(*pud);
-	struct mm_struct *mm = vma->vm_mm;
-	struct page *page;
-	int ret;
-
-	assert_spin_locked(pud_lockptr(mm, pud));
-
-	if (flags & FOLL_WRITE && !pud_write(*pud))
-		return NULL;
-
-	if (pud_present(*pud) && pud_devmap(*pud))
-		/* pass */;
-	else
-		return NULL;
-
-	if (flags & FOLL_TOUCH)
-		touch_pud(vma, addr, pud, flags & FOLL_WRITE);
-
-	/*
-	 * device mapped pages can only be returned if the
-	 * caller will manage the page reference count.
-	 *
-	 * At least one of FOLL_GET | FOLL_PIN must be set, so assert that here:
-	 */
-	if (!(flags & (FOLL_GET | FOLL_PIN)))
-		return ERR_PTR(-EEXIST);
-
-	pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT;
-	*pgmap = get_dev_pagemap(pfn, *pgmap);
-	if (!*pgmap)
-		return ERR_PTR(-EFAULT);
-	page = pfn_to_page(pfn);
-
-	ret = try_grab_page(page, flags);
-	if (ret)
-		page = ERR_PTR(ret);
-
-	return page;
-}
-
 int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		  pud_t *dst_pud, pud_t *src_pud, unsigned long addr,
 		  struct vm_area_struct *vma)
diff --git a/mm/internal.h b/mm/internal.h
index 6c8d3844b6a3..eee8c82740b5 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1111,6 +1111,8 @@ int __must_check try_grab_page(struct page *page, unsigned int flags);
 /*
  * mm/huge_memory.c
  */
+void touch_pud(struct vm_area_struct *vma, unsigned long addr,
+	       pud_t *pud, bool write);
 struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 				   unsigned long addr, pmd_t *pmd,
 				   unsigned int flags);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 10/13] mm/gup: Handle huge pud for follow_pud_mask()
@ 2024-03-27 15:23   ` peterx
  0 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, peterx, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

From: Peter Xu <peterx@redhat.com>

Teach follow_pud_mask() to be able to handle normal PUD pages like hugetlb.

Rename follow_devmap_pud() to follow_huge_pud() so that it can process
either huge devmap or hugetlb. Move it out of TRANSPARENT_HUGEPAGE_PUD and
and huge_memory.c (which relies on CONFIG_THP).  Switch to pud_leaf() to
detect both cases in the slow gup.

In the new follow_huge_pud(), taking care of possible CoR for hugetlb if
necessary.  touch_pud() needs to be moved out of huge_memory.c to be
accessable from gup.c even if !THP.

Since at it, optimize the non-present check by adding a pud_present() early
check before taking the pgtable lock, failing the follow_page() early if
PUD is not present: that is required by both devmap or hugetlb.  Use
pud_huge() to also cover the pud_devmap() case.

One more trivial thing to mention is, introduce "pud_t pud" in the code
paths along the way, so the code doesn't dereference *pudp multiple time.
Not only because that looks less straightforward, but also because if the
dereference really happened, it's not clear whether there can be race to
see different *pudp values when it's being modified at the same time.

Setting ctx->page_mask properly for a PUD entry.  As a side effect, this
patch should also be able to optimize devmap GUP on PUD to be able to jump
over the whole PUD range, but not yet verified.  Hugetlb already can do so
prior to this patch.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/huge_mm.h |  8 -----
 mm/gup.c                | 70 +++++++++++++++++++++++++++++++++++++++--
 mm/huge_memory.c        | 47 ++-------------------------
 mm/internal.h           |  2 ++
 4 files changed, 71 insertions(+), 56 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index d3bb25c39482..3f36511bdc02 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -351,8 +351,6 @@ static inline bool folio_test_pmd_mappable(struct folio *folio)
 
 struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
 		pmd_t *pmd, int flags, struct dev_pagemap **pgmap);
-struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
-		pud_t *pud, int flags, struct dev_pagemap **pgmap);
 
 vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);
 
@@ -507,12 +505,6 @@ static inline struct page *follow_devmap_pmd(struct vm_area_struct *vma,
 	return NULL;
 }
 
-static inline struct page *follow_devmap_pud(struct vm_area_struct *vma,
-	unsigned long addr, pud_t *pud, int flags, struct dev_pagemap **pgmap)
-{
-	return NULL;
-}
-
 static inline bool thp_migration_supported(void)
 {
 	return false;
diff --git a/mm/gup.c b/mm/gup.c
index 26b8cca24077..1e5d42211bb4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -525,6 +525,70 @@ static struct page *no_page_table(struct vm_area_struct *vma,
 	return NULL;
 }
 
+#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
+static struct page *follow_huge_pud(struct vm_area_struct *vma,
+				    unsigned long addr, pud_t *pudp,
+				    int flags, struct follow_page_context *ctx)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct page *page;
+	pud_t pud = *pudp;
+	unsigned long pfn = pud_pfn(pud);
+	int ret;
+
+	assert_spin_locked(pud_lockptr(mm, pudp));
+
+	if ((flags & FOLL_WRITE) && !pud_write(pud))
+		return NULL;
+
+	if (!pud_present(pud))
+		return NULL;
+
+	pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT;
+
+	if (IS_ENABLED(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) &&
+	    pud_devmap(pud)) {
+		/*
+		 * device mapped pages can only be returned if the caller
+		 * will manage the page reference count.
+		 *
+		 * At least one of FOLL_GET | FOLL_PIN must be set, so
+		 * assert that here:
+		 */
+		if (!(flags & (FOLL_GET | FOLL_PIN)))
+			return ERR_PTR(-EEXIST);
+
+		if (flags & FOLL_TOUCH)
+			touch_pud(vma, addr, pudp, flags & FOLL_WRITE);
+
+		ctx->pgmap = get_dev_pagemap(pfn, ctx->pgmap);
+		if (!ctx->pgmap)
+			return ERR_PTR(-EFAULT);
+	}
+
+	page = pfn_to_page(pfn);
+
+	if (!pud_devmap(pud) && !pud_write(pud) &&
+	    gup_must_unshare(vma, flags, page))
+		return ERR_PTR(-EMLINK);
+
+	ret = try_grab_page(page, flags);
+	if (ret)
+		page = ERR_PTR(ret);
+	else
+		ctx->page_mask = HPAGE_PUD_NR - 1;
+
+	return page;
+}
+#else  /* CONFIG_PGTABLE_HAS_HUGE_LEAVES */
+static struct page *follow_huge_pud(struct vm_area_struct *vma,
+				    unsigned long addr, pud_t *pudp,
+				    int flags, struct follow_page_context *ctx)
+{
+	return NULL;
+}
+#endif	/* CONFIG_PGTABLE_HAS_HUGE_LEAVES */
+
 static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
 		pte_t *pte, unsigned int flags)
 {
@@ -760,11 +824,11 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma,
 
 	pudp = pud_offset(p4dp, address);
 	pud = READ_ONCE(*pudp);
-	if (pud_none(pud))
+	if (!pud_present(pud))
 		return no_page_table(vma, flags, address);
-	if (pud_devmap(pud)) {
+	if (pud_leaf(pud)) {
 		ptl = pud_lock(mm, pudp);
-		page = follow_devmap_pud(vma, address, pudp, flags, &ctx->pgmap);
+		page = follow_huge_pud(vma, address, pudp, flags, ctx);
 		spin_unlock(ptl);
 		if (page)
 			return page;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bc6fa82d9815..2979198d7b71 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1377,8 +1377,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 }
 
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
-static void touch_pud(struct vm_area_struct *vma, unsigned long addr,
-		      pud_t *pud, bool write)
+void touch_pud(struct vm_area_struct *vma, unsigned long addr,
+	       pud_t *pud, bool write)
 {
 	pud_t _pud;
 
@@ -1390,49 +1390,6 @@ static void touch_pud(struct vm_area_struct *vma, unsigned long addr,
 		update_mmu_cache_pud(vma, addr, pud);
 }
 
-struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
-		pud_t *pud, int flags, struct dev_pagemap **pgmap)
-{
-	unsigned long pfn = pud_pfn(*pud);
-	struct mm_struct *mm = vma->vm_mm;
-	struct page *page;
-	int ret;
-
-	assert_spin_locked(pud_lockptr(mm, pud));
-
-	if (flags & FOLL_WRITE && !pud_write(*pud))
-		return NULL;
-
-	if (pud_present(*pud) && pud_devmap(*pud))
-		/* pass */;
-	else
-		return NULL;
-
-	if (flags & FOLL_TOUCH)
-		touch_pud(vma, addr, pud, flags & FOLL_WRITE);
-
-	/*
-	 * device mapped pages can only be returned if the
-	 * caller will manage the page reference count.
-	 *
-	 * At least one of FOLL_GET | FOLL_PIN must be set, so assert that here:
-	 */
-	if (!(flags & (FOLL_GET | FOLL_PIN)))
-		return ERR_PTR(-EEXIST);
-
-	pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT;
-	*pgmap = get_dev_pagemap(pfn, *pgmap);
-	if (!*pgmap)
-		return ERR_PTR(-EFAULT);
-	page = pfn_to_page(pfn);
-
-	ret = try_grab_page(page, flags);
-	if (ret)
-		page = ERR_PTR(ret);
-
-	return page;
-}
-
 int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		  pud_t *dst_pud, pud_t *src_pud, unsigned long addr,
 		  struct vm_area_struct *vma)
diff --git a/mm/internal.h b/mm/internal.h
index 6c8d3844b6a3..eee8c82740b5 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1111,6 +1111,8 @@ int __must_check try_grab_page(struct page *page, unsigned int flags);
 /*
  * mm/huge_memory.c
  */
+void touch_pud(struct vm_area_struct *vma, unsigned long addr,
+	       pud_t *pud, bool write);
 struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 				   unsigned long addr, pmd_t *pmd,
 				   unsigned int flags);
-- 
2.44.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 10/13] mm/gup: Handle huge pud for follow_pud_mask()
@ 2024-03-27 15:23   ` peterx
  0 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, peterx, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

From: Peter Xu <peterx@redhat.com>

Teach follow_pud_mask() to be able to handle normal PUD pages like hugetlb.

Rename follow_devmap_pud() to follow_huge_pud() so that it can process
either huge devmap or hugetlb. Move it out of TRANSPARENT_HUGEPAGE_PUD and
and huge_memory.c (which relies on CONFIG_THP).  Switch to pud_leaf() to
detect both cases in the slow gup.

In the new follow_huge_pud(), taking care of possible CoR for hugetlb if
necessary.  touch_pud() needs to be moved out of huge_memory.c to be
accessable from gup.c even if !THP.

Since at it, optimize the non-present check by adding a pud_present() early
check before taking the pgtable lock, failing the follow_page() early if
PUD is not present: that is required by both devmap or hugetlb.  Use
pud_huge() to also cover the pud_devmap() case.

One more trivial thing to mention is, introduce "pud_t pud" in the code
paths along the way, so the code doesn't dereference *pudp multiple time.
Not only because that looks less straightforward, but also because if the
dereference really happened, it's not clear whether there can be race to
see different *pudp values when it's being modified at the same time.

Setting ctx->page_mask properly for a PUD entry.  As a side effect, this
patch should also be able to optimize devmap GUP on PUD to be able to jump
over the whole PUD range, but not yet verified.  Hugetlb already can do so
prior to this patch.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/huge_mm.h |  8 -----
 mm/gup.c                | 70 +++++++++++++++++++++++++++++++++++++++--
 mm/huge_memory.c        | 47 ++-------------------------
 mm/internal.h           |  2 ++
 4 files changed, 71 insertions(+), 56 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index d3bb25c39482..3f36511bdc02 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -351,8 +351,6 @@ static inline bool folio_test_pmd_mappable(struct folio *folio)
 
 struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
 		pmd_t *pmd, int flags, struct dev_pagemap **pgmap);
-struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
-		pud_t *pud, int flags, struct dev_pagemap **pgmap);
 
 vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);
 
@@ -507,12 +505,6 @@ static inline struct page *follow_devmap_pmd(struct vm_area_struct *vma,
 	return NULL;
 }
 
-static inline struct page *follow_devmap_pud(struct vm_area_struct *vma,
-	unsigned long addr, pud_t *pud, int flags, struct dev_pagemap **pgmap)
-{
-	return NULL;
-}
-
 static inline bool thp_migration_supported(void)
 {
 	return false;
diff --git a/mm/gup.c b/mm/gup.c
index 26b8cca24077..1e5d42211bb4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -525,6 +525,70 @@ static struct page *no_page_table(struct vm_area_struct *vma,
 	return NULL;
 }
 
+#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
+static struct page *follow_huge_pud(struct vm_area_struct *vma,
+				    unsigned long addr, pud_t *pudp,
+				    int flags, struct follow_page_context *ctx)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct page *page;
+	pud_t pud = *pudp;
+	unsigned long pfn = pud_pfn(pud);
+	int ret;
+
+	assert_spin_locked(pud_lockptr(mm, pudp));
+
+	if ((flags & FOLL_WRITE) && !pud_write(pud))
+		return NULL;
+
+	if (!pud_present(pud))
+		return NULL;
+
+	pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT;
+
+	if (IS_ENABLED(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) &&
+	    pud_devmap(pud)) {
+		/*
+		 * device mapped pages can only be returned if the caller
+		 * will manage the page reference count.
+		 *
+		 * At least one of FOLL_GET | FOLL_PIN must be set, so
+		 * assert that here:
+		 */
+		if (!(flags & (FOLL_GET | FOLL_PIN)))
+			return ERR_PTR(-EEXIST);
+
+		if (flags & FOLL_TOUCH)
+			touch_pud(vma, addr, pudp, flags & FOLL_WRITE);
+
+		ctx->pgmap = get_dev_pagemap(pfn, ctx->pgmap);
+		if (!ctx->pgmap)
+			return ERR_PTR(-EFAULT);
+	}
+
+	page = pfn_to_page(pfn);
+
+	if (!pud_devmap(pud) && !pud_write(pud) &&
+	    gup_must_unshare(vma, flags, page))
+		return ERR_PTR(-EMLINK);
+
+	ret = try_grab_page(page, flags);
+	if (ret)
+		page = ERR_PTR(ret);
+	else
+		ctx->page_mask = HPAGE_PUD_NR - 1;
+
+	return page;
+}
+#else  /* CONFIG_PGTABLE_HAS_HUGE_LEAVES */
+static struct page *follow_huge_pud(struct vm_area_struct *vma,
+				    unsigned long addr, pud_t *pudp,
+				    int flags, struct follow_page_context *ctx)
+{
+	return NULL;
+}
+#endif	/* CONFIG_PGTABLE_HAS_HUGE_LEAVES */
+
 static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
 		pte_t *pte, unsigned int flags)
 {
@@ -760,11 +824,11 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma,
 
 	pudp = pud_offset(p4dp, address);
 	pud = READ_ONCE(*pudp);
-	if (pud_none(pud))
+	if (!pud_present(pud))
 		return no_page_table(vma, flags, address);
-	if (pud_devmap(pud)) {
+	if (pud_leaf(pud)) {
 		ptl = pud_lock(mm, pudp);
-		page = follow_devmap_pud(vma, address, pudp, flags, &ctx->pgmap);
+		page = follow_huge_pud(vma, address, pudp, flags, ctx);
 		spin_unlock(ptl);
 		if (page)
 			return page;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bc6fa82d9815..2979198d7b71 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1377,8 +1377,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 }
 
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
-static void touch_pud(struct vm_area_struct *vma, unsigned long addr,
-		      pud_t *pud, bool write)
+void touch_pud(struct vm_area_struct *vma, unsigned long addr,
+	       pud_t *pud, bool write)
 {
 	pud_t _pud;
 
@@ -1390,49 +1390,6 @@ static void touch_pud(struct vm_area_struct *vma, unsigned long addr,
 		update_mmu_cache_pud(vma, addr, pud);
 }
 
-struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
-		pud_t *pud, int flags, struct dev_pagemap **pgmap)
-{
-	unsigned long pfn = pud_pfn(*pud);
-	struct mm_struct *mm = vma->vm_mm;
-	struct page *page;
-	int ret;
-
-	assert_spin_locked(pud_lockptr(mm, pud));
-
-	if (flags & FOLL_WRITE && !pud_write(*pud))
-		return NULL;
-
-	if (pud_present(*pud) && pud_devmap(*pud))
-		/* pass */;
-	else
-		return NULL;
-
-	if (flags & FOLL_TOUCH)
-		touch_pud(vma, addr, pud, flags & FOLL_WRITE);
-
-	/*
-	 * device mapped pages can only be returned if the
-	 * caller will manage the page reference count.
-	 *
-	 * At least one of FOLL_GET | FOLL_PIN must be set, so assert that here:
-	 */
-	if (!(flags & (FOLL_GET | FOLL_PIN)))
-		return ERR_PTR(-EEXIST);
-
-	pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT;
-	*pgmap = get_dev_pagemap(pfn, *pgmap);
-	if (!*pgmap)
-		return ERR_PTR(-EFAULT);
-	page = pfn_to_page(pfn);
-
-	ret = try_grab_page(page, flags);
-	if (ret)
-		page = ERR_PTR(ret);
-
-	return page;
-}
-
 int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		  pud_t *dst_pud, pud_t *src_pud, unsigned long addr,
 		  struct vm_area_struct *vma)
diff --git a/mm/internal.h b/mm/internal.h
index 6c8d3844b6a3..eee8c82740b5 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1111,6 +1111,8 @@ int __must_check try_grab_page(struct page *page, unsigned int flags);
 /*
  * mm/huge_memory.c
  */
+void touch_pud(struct vm_area_struct *vma, unsigned long addr,
+	       pud_t *pud, bool write);
 struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 				   unsigned long addr, pmd_t *pmd,
 				   unsigned int flags);
-- 
2.44.0


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 10/13] mm/gup: Handle huge pud for follow_pud_mask()
@ 2024-03-27 15:23   ` peterx
  0 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: James Houghton, David Hildenbrand, Yang Shi, peterx,
	Andrew Jones, linux-riscv, Andrea Arcangeli, Christoph Hellwig,
	Matthew Wilcox, Aneesh Kumar K . V, linux-arm-kernel,
	Jason Gunthorpe, Axel Rasmussen, Rik van Riel, John Hubbard,
	Kirill A . Shutemov, Vlastimil Babka, Lorenzo Stoakes,
	Muchun Song, Andrew Morton, linuxppc-dev, Mike Rapoport,
	Mike Kravetz

From: Peter Xu <peterx@redhat.com>

Teach follow_pud_mask() to be able to handle normal PUD pages like hugetlb.

Rename follow_devmap_pud() to follow_huge_pud() so that it can process
either huge devmap or hugetlb. Move it out of TRANSPARENT_HUGEPAGE_PUD and
and huge_memory.c (which relies on CONFIG_THP).  Switch to pud_leaf() to
detect both cases in the slow gup.

In the new follow_huge_pud(), taking care of possible CoR for hugetlb if
necessary.  touch_pud() needs to be moved out of huge_memory.c to be
accessable from gup.c even if !THP.

Since at it, optimize the non-present check by adding a pud_present() early
check before taking the pgtable lock, failing the follow_page() early if
PUD is not present: that is required by both devmap or hugetlb.  Use
pud_huge() to also cover the pud_devmap() case.

One more trivial thing to mention is, introduce "pud_t pud" in the code
paths along the way, so the code doesn't dereference *pudp multiple time.
Not only because that looks less straightforward, but also because if the
dereference really happened, it's not clear whether there can be race to
see different *pudp values when it's being modified at the same time.

Setting ctx->page_mask properly for a PUD entry.  As a side effect, this
patch should also be able to optimize devmap GUP on PUD to be able to jump
over the whole PUD range, but not yet verified.  Hugetlb already can do so
prior to this patch.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/huge_mm.h |  8 -----
 mm/gup.c                | 70 +++++++++++++++++++++++++++++++++++++++--
 mm/huge_memory.c        | 47 ++-------------------------
 mm/internal.h           |  2 ++
 4 files changed, 71 insertions(+), 56 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index d3bb25c39482..3f36511bdc02 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -351,8 +351,6 @@ static inline bool folio_test_pmd_mappable(struct folio *folio)
 
 struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
 		pmd_t *pmd, int flags, struct dev_pagemap **pgmap);
-struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
-		pud_t *pud, int flags, struct dev_pagemap **pgmap);
 
 vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);
 
@@ -507,12 +505,6 @@ static inline struct page *follow_devmap_pmd(struct vm_area_struct *vma,
 	return NULL;
 }
 
-static inline struct page *follow_devmap_pud(struct vm_area_struct *vma,
-	unsigned long addr, pud_t *pud, int flags, struct dev_pagemap **pgmap)
-{
-	return NULL;
-}
-
 static inline bool thp_migration_supported(void)
 {
 	return false;
diff --git a/mm/gup.c b/mm/gup.c
index 26b8cca24077..1e5d42211bb4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -525,6 +525,70 @@ static struct page *no_page_table(struct vm_area_struct *vma,
 	return NULL;
 }
 
+#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
+static struct page *follow_huge_pud(struct vm_area_struct *vma,
+				    unsigned long addr, pud_t *pudp,
+				    int flags, struct follow_page_context *ctx)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct page *page;
+	pud_t pud = *pudp;
+	unsigned long pfn = pud_pfn(pud);
+	int ret;
+
+	assert_spin_locked(pud_lockptr(mm, pudp));
+
+	if ((flags & FOLL_WRITE) && !pud_write(pud))
+		return NULL;
+
+	if (!pud_present(pud))
+		return NULL;
+
+	pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT;
+
+	if (IS_ENABLED(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) &&
+	    pud_devmap(pud)) {
+		/*
+		 * device mapped pages can only be returned if the caller
+		 * will manage the page reference count.
+		 *
+		 * At least one of FOLL_GET | FOLL_PIN must be set, so
+		 * assert that here:
+		 */
+		if (!(flags & (FOLL_GET | FOLL_PIN)))
+			return ERR_PTR(-EEXIST);
+
+		if (flags & FOLL_TOUCH)
+			touch_pud(vma, addr, pudp, flags & FOLL_WRITE);
+
+		ctx->pgmap = get_dev_pagemap(pfn, ctx->pgmap);
+		if (!ctx->pgmap)
+			return ERR_PTR(-EFAULT);
+	}
+
+	page = pfn_to_page(pfn);
+
+	if (!pud_devmap(pud) && !pud_write(pud) &&
+	    gup_must_unshare(vma, flags, page))
+		return ERR_PTR(-EMLINK);
+
+	ret = try_grab_page(page, flags);
+	if (ret)
+		page = ERR_PTR(ret);
+	else
+		ctx->page_mask = HPAGE_PUD_NR - 1;
+
+	return page;
+}
+#else  /* CONFIG_PGTABLE_HAS_HUGE_LEAVES */
+static struct page *follow_huge_pud(struct vm_area_struct *vma,
+				    unsigned long addr, pud_t *pudp,
+				    int flags, struct follow_page_context *ctx)
+{
+	return NULL;
+}
+#endif	/* CONFIG_PGTABLE_HAS_HUGE_LEAVES */
+
 static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
 		pte_t *pte, unsigned int flags)
 {
@@ -760,11 +824,11 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma,
 
 	pudp = pud_offset(p4dp, address);
 	pud = READ_ONCE(*pudp);
-	if (pud_none(pud))
+	if (!pud_present(pud))
 		return no_page_table(vma, flags, address);
-	if (pud_devmap(pud)) {
+	if (pud_leaf(pud)) {
 		ptl = pud_lock(mm, pudp);
-		page = follow_devmap_pud(vma, address, pudp, flags, &ctx->pgmap);
+		page = follow_huge_pud(vma, address, pudp, flags, ctx);
 		spin_unlock(ptl);
 		if (page)
 			return page;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bc6fa82d9815..2979198d7b71 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1377,8 +1377,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 }
 
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
-static void touch_pud(struct vm_area_struct *vma, unsigned long addr,
-		      pud_t *pud, bool write)
+void touch_pud(struct vm_area_struct *vma, unsigned long addr,
+	       pud_t *pud, bool write)
 {
 	pud_t _pud;
 
@@ -1390,49 +1390,6 @@ static void touch_pud(struct vm_area_struct *vma, unsigned long addr,
 		update_mmu_cache_pud(vma, addr, pud);
 }
 
-struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
-		pud_t *pud, int flags, struct dev_pagemap **pgmap)
-{
-	unsigned long pfn = pud_pfn(*pud);
-	struct mm_struct *mm = vma->vm_mm;
-	struct page *page;
-	int ret;
-
-	assert_spin_locked(pud_lockptr(mm, pud));
-
-	if (flags & FOLL_WRITE && !pud_write(*pud))
-		return NULL;
-
-	if (pud_present(*pud) && pud_devmap(*pud))
-		/* pass */;
-	else
-		return NULL;
-
-	if (flags & FOLL_TOUCH)
-		touch_pud(vma, addr, pud, flags & FOLL_WRITE);
-
-	/*
-	 * device mapped pages can only be returned if the
-	 * caller will manage the page reference count.
-	 *
-	 * At least one of FOLL_GET | FOLL_PIN must be set, so assert that here:
-	 */
-	if (!(flags & (FOLL_GET | FOLL_PIN)))
-		return ERR_PTR(-EEXIST);
-
-	pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT;
-	*pgmap = get_dev_pagemap(pfn, *pgmap);
-	if (!*pgmap)
-		return ERR_PTR(-EFAULT);
-	page = pfn_to_page(pfn);
-
-	ret = try_grab_page(page, flags);
-	if (ret)
-		page = ERR_PTR(ret);
-
-	return page;
-}
-
 int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		  pud_t *dst_pud, pud_t *src_pud, unsigned long addr,
 		  struct vm_area_struct *vma)
diff --git a/mm/internal.h b/mm/internal.h
index 6c8d3844b6a3..eee8c82740b5 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1111,6 +1111,8 @@ int __must_check try_grab_page(struct page *page, unsigned int flags);
 /*
  * mm/huge_memory.c
  */
+void touch_pud(struct vm_area_struct *vma, unsigned long addr,
+	       pud_t *pud, bool write);
 struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 				   unsigned long addr, pmd_t *pmd,
 				   unsigned int flags);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 11/13] mm/gup: Handle huge pmd for follow_pmd_mask()
  2024-03-27 15:23 ` peterx
  (?)
  (?)
@ 2024-03-27 15:23   ` peterx
  -1 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, peterx, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

From: Peter Xu <peterx@redhat.com>

Replace pmd_trans_huge() with pmd_leaf() to also cover pmd_huge() as long
as enabled.

FOLL_TOUCH and FOLL_SPLIT_PMD only apply to THP, not yet huge.

Since now follow_trans_huge_pmd() can process hugetlb pages, renaming it
into follow_huge_pmd() to match what it does.  Move it into gup.c so not
depend on CONFIG_THP.

When at it, move the ctx->page_mask setup into follow_huge_pmd(), only set
it when the page is valid.  It was not a bug to set it before even if GUP
failed (page==NULL), because follow_page_mask() callers always ignores
page_mask if so.  But doing so makes the code cleaner.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c         | 107 ++++++++++++++++++++++++++++++++++++++++++++---
 mm/huge_memory.c |  86 +------------------------------------
 mm/internal.h    |   5 +--
 3 files changed, 105 insertions(+), 93 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 1e5d42211bb4..a81184b01276 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -580,6 +580,93 @@ static struct page *follow_huge_pud(struct vm_area_struct *vma,
 
 	return page;
 }
+
+/* FOLL_FORCE can write to even unwritable PMDs in COW mappings. */
+static inline bool can_follow_write_pmd(pmd_t pmd, struct page *page,
+					struct vm_area_struct *vma,
+					unsigned int flags)
+{
+	/* If the pmd is writable, we can write to the page. */
+	if (pmd_write(pmd))
+		return true;
+
+	/* Maybe FOLL_FORCE is set to override it? */
+	if (!(flags & FOLL_FORCE))
+		return false;
+
+	/* But FOLL_FORCE has no effect on shared mappings */
+	if (vma->vm_flags & (VM_MAYSHARE | VM_SHARED))
+		return false;
+
+	/* ... or read-only private ones */
+	if (!(vma->vm_flags & VM_MAYWRITE))
+		return false;
+
+	/* ... or already writable ones that just need to take a write fault */
+	if (vma->vm_flags & VM_WRITE)
+		return false;
+
+	/*
+	 * See can_change_pte_writable(): we broke COW and could map the page
+	 * writable if we have an exclusive anonymous page ...
+	 */
+	if (!page || !PageAnon(page) || !PageAnonExclusive(page))
+		return false;
+
+	/* ... and a write-fault isn't required for other reasons. */
+	if (vma_soft_dirty_enabled(vma) && !pmd_soft_dirty(pmd))
+		return false;
+	return !userfaultfd_huge_pmd_wp(vma, pmd);
+}
+
+static struct page *follow_huge_pmd(struct vm_area_struct *vma,
+				    unsigned long addr, pmd_t *pmd,
+				    unsigned int flags,
+				    struct follow_page_context *ctx)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pmd_t pmdval = *pmd;
+	struct page *page;
+	int ret;
+
+	assert_spin_locked(pmd_lockptr(mm, pmd));
+
+	page = pmd_page(pmdval);
+	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
+
+	if ((flags & FOLL_WRITE) &&
+	    !can_follow_write_pmd(pmdval, page, vma, flags))
+		return NULL;
+
+	/* Avoid dumping huge zero page */
+	if ((flags & FOLL_DUMP) && is_huge_zero_pmd(pmdval))
+		return ERR_PTR(-EFAULT);
+
+	if (pmd_protnone(*pmd) && !gup_can_follow_protnone(vma, flags))
+		return NULL;
+
+	if (!pmd_write(pmdval) && gup_must_unshare(vma, flags, page))
+		return ERR_PTR(-EMLINK);
+
+	VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
+			!PageAnonExclusive(page), page);
+
+	ret = try_grab_page(page, flags);
+	if (ret)
+		return ERR_PTR(ret);
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	if (pmd_trans_huge(pmdval) && (flags & FOLL_TOUCH))
+		touch_pmd(vma, addr, pmd, flags & FOLL_WRITE);
+#endif	/* CONFIG_TRANSPARENT_HUGEPAGE */
+
+	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
+	ctx->page_mask = HPAGE_PMD_NR - 1;
+	VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
+
+	return page;
+}
+
 #else  /* CONFIG_PGTABLE_HAS_HUGE_LEAVES */
 static struct page *follow_huge_pud(struct vm_area_struct *vma,
 				    unsigned long addr, pud_t *pudp,
@@ -587,6 +674,14 @@ static struct page *follow_huge_pud(struct vm_area_struct *vma,
 {
 	return NULL;
 }
+
+static struct page *follow_huge_pmd(struct vm_area_struct *vma,
+				    unsigned long addr, pmd_t *pmd,
+				    unsigned int flags,
+				    struct follow_page_context *ctx)
+{
+	return NULL;
+}
 #endif	/* CONFIG_PGTABLE_HAS_HUGE_LEAVES */
 
 static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
@@ -784,31 +879,31 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 			return page;
 		return no_page_table(vma, flags, address);
 	}
-	if (likely(!pmd_trans_huge(pmdval)))
+	if (likely(!pmd_leaf(pmdval)))
 		return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
 
 	if (pmd_protnone(pmdval) && !gup_can_follow_protnone(vma, flags))
 		return no_page_table(vma, flags, address);
 
 	ptl = pmd_lock(mm, pmd);
-	if (unlikely(!pmd_present(*pmd))) {
+	pmdval = *pmd;
+	if (unlikely(!pmd_present(pmdval))) {
 		spin_unlock(ptl);
 		return no_page_table(vma, flags, address);
 	}
-	if (unlikely(!pmd_trans_huge(*pmd))) {
+	if (unlikely(!pmd_leaf(pmdval))) {
 		spin_unlock(ptl);
 		return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
 	}
-	if (flags & FOLL_SPLIT_PMD) {
+	if (pmd_trans_huge(pmdval) && (flags & FOLL_SPLIT_PMD)) {
 		spin_unlock(ptl);
 		split_huge_pmd(vma, pmd, address);
 		/* If pmd was left empty, stuff a page table in there quickly */
 		return pte_alloc(mm, pmd) ? ERR_PTR(-ENOMEM) :
 			follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
 	}
-	page = follow_trans_huge_pmd(vma, address, pmd, flags);
+	page = follow_huge_pmd(vma, address, pmd, flags, ctx);
 	spin_unlock(ptl);
-	ctx->page_mask = HPAGE_PMD_NR - 1;
 	return page;
 }
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2979198d7b71..ed0d82c4b829 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1220,8 +1220,8 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write)
 EXPORT_SYMBOL_GPL(vmf_insert_pfn_pud);
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
-static void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
-		      pmd_t *pmd, bool write)
+void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
+	       pmd_t *pmd, bool write)
 {
 	pmd_t _pmd;
 
@@ -1576,88 +1576,6 @@ static inline bool can_change_pmd_writable(struct vm_area_struct *vma,
 	return pmd_dirty(pmd);
 }
 
-/* FOLL_FORCE can write to even unwritable PMDs in COW mappings. */
-static inline bool can_follow_write_pmd(pmd_t pmd, struct page *page,
-					struct vm_area_struct *vma,
-					unsigned int flags)
-{
-	/* If the pmd is writable, we can write to the page. */
-	if (pmd_write(pmd))
-		return true;
-
-	/* Maybe FOLL_FORCE is set to override it? */
-	if (!(flags & FOLL_FORCE))
-		return false;
-
-	/* But FOLL_FORCE has no effect on shared mappings */
-	if (vma->vm_flags & (VM_MAYSHARE | VM_SHARED))
-		return false;
-
-	/* ... or read-only private ones */
-	if (!(vma->vm_flags & VM_MAYWRITE))
-		return false;
-
-	/* ... or already writable ones that just need to take a write fault */
-	if (vma->vm_flags & VM_WRITE)
-		return false;
-
-	/*
-	 * See can_change_pte_writable(): we broke COW and could map the page
-	 * writable if we have an exclusive anonymous page ...
-	 */
-	if (!page || !PageAnon(page) || !PageAnonExclusive(page))
-		return false;
-
-	/* ... and a write-fault isn't required for other reasons. */
-	if (vma_soft_dirty_enabled(vma) && !pmd_soft_dirty(pmd))
-		return false;
-	return !userfaultfd_huge_pmd_wp(vma, pmd);
-}
-
-struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
-				   unsigned long addr,
-				   pmd_t *pmd,
-				   unsigned int flags)
-{
-	struct mm_struct *mm = vma->vm_mm;
-	struct page *page;
-	int ret;
-
-	assert_spin_locked(pmd_lockptr(mm, pmd));
-
-	page = pmd_page(*pmd);
-	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
-
-	if ((flags & FOLL_WRITE) &&
-	    !can_follow_write_pmd(*pmd, page, vma, flags))
-		return NULL;
-
-	/* Avoid dumping huge zero page */
-	if ((flags & FOLL_DUMP) && is_huge_zero_pmd(*pmd))
-		return ERR_PTR(-EFAULT);
-
-	if (pmd_protnone(*pmd) && !gup_can_follow_protnone(vma, flags))
-		return NULL;
-
-	if (!pmd_write(*pmd) && gup_must_unshare(vma, flags, page))
-		return ERR_PTR(-EMLINK);
-
-	VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
-			!PageAnonExclusive(page), page);
-
-	ret = try_grab_page(page, flags);
-	if (ret)
-		return ERR_PTR(ret);
-
-	if (flags & FOLL_TOUCH)
-		touch_pmd(vma, addr, pmd, flags & FOLL_WRITE);
-
-	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
-	VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
-
-	return page;
-}
-
 /* NUMA hinting page fault entry point for trans huge pmds */
 vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 {
diff --git a/mm/internal.h b/mm/internal.h
index eee8c82740b5..e10ecc6594f1 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1113,9 +1113,8 @@ int __must_check try_grab_page(struct page *page, unsigned int flags);
  */
 void touch_pud(struct vm_area_struct *vma, unsigned long addr,
 	       pud_t *pud, bool write);
-struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
-				   unsigned long addr, pmd_t *pmd,
-				   unsigned int flags);
+void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
+	       pmd_t *pmd, bool write);
 
 /*
  * mm/mmap.c
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 11/13] mm/gup: Handle huge pmd for follow_pmd_mask()
@ 2024-03-27 15:23   ` peterx
  0 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, peterx, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

From: Peter Xu <peterx@redhat.com>

Replace pmd_trans_huge() with pmd_leaf() to also cover pmd_huge() as long
as enabled.

FOLL_TOUCH and FOLL_SPLIT_PMD only apply to THP, not yet huge.

Since now follow_trans_huge_pmd() can process hugetlb pages, renaming it
into follow_huge_pmd() to match what it does.  Move it into gup.c so not
depend on CONFIG_THP.

When at it, move the ctx->page_mask setup into follow_huge_pmd(), only set
it when the page is valid.  It was not a bug to set it before even if GUP
failed (page==NULL), because follow_page_mask() callers always ignores
page_mask if so.  But doing so makes the code cleaner.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c         | 107 ++++++++++++++++++++++++++++++++++++++++++++---
 mm/huge_memory.c |  86 +------------------------------------
 mm/internal.h    |   5 +--
 3 files changed, 105 insertions(+), 93 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 1e5d42211bb4..a81184b01276 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -580,6 +580,93 @@ static struct page *follow_huge_pud(struct vm_area_struct *vma,
 
 	return page;
 }
+
+/* FOLL_FORCE can write to even unwritable PMDs in COW mappings. */
+static inline bool can_follow_write_pmd(pmd_t pmd, struct page *page,
+					struct vm_area_struct *vma,
+					unsigned int flags)
+{
+	/* If the pmd is writable, we can write to the page. */
+	if (pmd_write(pmd))
+		return true;
+
+	/* Maybe FOLL_FORCE is set to override it? */
+	if (!(flags & FOLL_FORCE))
+		return false;
+
+	/* But FOLL_FORCE has no effect on shared mappings */
+	if (vma->vm_flags & (VM_MAYSHARE | VM_SHARED))
+		return false;
+
+	/* ... or read-only private ones */
+	if (!(vma->vm_flags & VM_MAYWRITE))
+		return false;
+
+	/* ... or already writable ones that just need to take a write fault */
+	if (vma->vm_flags & VM_WRITE)
+		return false;
+
+	/*
+	 * See can_change_pte_writable(): we broke COW and could map the page
+	 * writable if we have an exclusive anonymous page ...
+	 */
+	if (!page || !PageAnon(page) || !PageAnonExclusive(page))
+		return false;
+
+	/* ... and a write-fault isn't required for other reasons. */
+	if (vma_soft_dirty_enabled(vma) && !pmd_soft_dirty(pmd))
+		return false;
+	return !userfaultfd_huge_pmd_wp(vma, pmd);
+}
+
+static struct page *follow_huge_pmd(struct vm_area_struct *vma,
+				    unsigned long addr, pmd_t *pmd,
+				    unsigned int flags,
+				    struct follow_page_context *ctx)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pmd_t pmdval = *pmd;
+	struct page *page;
+	int ret;
+
+	assert_spin_locked(pmd_lockptr(mm, pmd));
+
+	page = pmd_page(pmdval);
+	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
+
+	if ((flags & FOLL_WRITE) &&
+	    !can_follow_write_pmd(pmdval, page, vma, flags))
+		return NULL;
+
+	/* Avoid dumping huge zero page */
+	if ((flags & FOLL_DUMP) && is_huge_zero_pmd(pmdval))
+		return ERR_PTR(-EFAULT);
+
+	if (pmd_protnone(*pmd) && !gup_can_follow_protnone(vma, flags))
+		return NULL;
+
+	if (!pmd_write(pmdval) && gup_must_unshare(vma, flags, page))
+		return ERR_PTR(-EMLINK);
+
+	VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
+			!PageAnonExclusive(page), page);
+
+	ret = try_grab_page(page, flags);
+	if (ret)
+		return ERR_PTR(ret);
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	if (pmd_trans_huge(pmdval) && (flags & FOLL_TOUCH))
+		touch_pmd(vma, addr, pmd, flags & FOLL_WRITE);
+#endif	/* CONFIG_TRANSPARENT_HUGEPAGE */
+
+	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
+	ctx->page_mask = HPAGE_PMD_NR - 1;
+	VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
+
+	return page;
+}
+
 #else  /* CONFIG_PGTABLE_HAS_HUGE_LEAVES */
 static struct page *follow_huge_pud(struct vm_area_struct *vma,
 				    unsigned long addr, pud_t *pudp,
@@ -587,6 +674,14 @@ static struct page *follow_huge_pud(struct vm_area_struct *vma,
 {
 	return NULL;
 }
+
+static struct page *follow_huge_pmd(struct vm_area_struct *vma,
+				    unsigned long addr, pmd_t *pmd,
+				    unsigned int flags,
+				    struct follow_page_context *ctx)
+{
+	return NULL;
+}
 #endif	/* CONFIG_PGTABLE_HAS_HUGE_LEAVES */
 
 static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
@@ -784,31 +879,31 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 			return page;
 		return no_page_table(vma, flags, address);
 	}
-	if (likely(!pmd_trans_huge(pmdval)))
+	if (likely(!pmd_leaf(pmdval)))
 		return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
 
 	if (pmd_protnone(pmdval) && !gup_can_follow_protnone(vma, flags))
 		return no_page_table(vma, flags, address);
 
 	ptl = pmd_lock(mm, pmd);
-	if (unlikely(!pmd_present(*pmd))) {
+	pmdval = *pmd;
+	if (unlikely(!pmd_present(pmdval))) {
 		spin_unlock(ptl);
 		return no_page_table(vma, flags, address);
 	}
-	if (unlikely(!pmd_trans_huge(*pmd))) {
+	if (unlikely(!pmd_leaf(pmdval))) {
 		spin_unlock(ptl);
 		return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
 	}
-	if (flags & FOLL_SPLIT_PMD) {
+	if (pmd_trans_huge(pmdval) && (flags & FOLL_SPLIT_PMD)) {
 		spin_unlock(ptl);
 		split_huge_pmd(vma, pmd, address);
 		/* If pmd was left empty, stuff a page table in there quickly */
 		return pte_alloc(mm, pmd) ? ERR_PTR(-ENOMEM) :
 			follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
 	}
-	page = follow_trans_huge_pmd(vma, address, pmd, flags);
+	page = follow_huge_pmd(vma, address, pmd, flags, ctx);
 	spin_unlock(ptl);
-	ctx->page_mask = HPAGE_PMD_NR - 1;
 	return page;
 }
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2979198d7b71..ed0d82c4b829 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1220,8 +1220,8 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write)
 EXPORT_SYMBOL_GPL(vmf_insert_pfn_pud);
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
-static void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
-		      pmd_t *pmd, bool write)
+void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
+	       pmd_t *pmd, bool write)
 {
 	pmd_t _pmd;
 
@@ -1576,88 +1576,6 @@ static inline bool can_change_pmd_writable(struct vm_area_struct *vma,
 	return pmd_dirty(pmd);
 }
 
-/* FOLL_FORCE can write to even unwritable PMDs in COW mappings. */
-static inline bool can_follow_write_pmd(pmd_t pmd, struct page *page,
-					struct vm_area_struct *vma,
-					unsigned int flags)
-{
-	/* If the pmd is writable, we can write to the page. */
-	if (pmd_write(pmd))
-		return true;
-
-	/* Maybe FOLL_FORCE is set to override it? */
-	if (!(flags & FOLL_FORCE))
-		return false;
-
-	/* But FOLL_FORCE has no effect on shared mappings */
-	if (vma->vm_flags & (VM_MAYSHARE | VM_SHARED))
-		return false;
-
-	/* ... or read-only private ones */
-	if (!(vma->vm_flags & VM_MAYWRITE))
-		return false;
-
-	/* ... or already writable ones that just need to take a write fault */
-	if (vma->vm_flags & VM_WRITE)
-		return false;
-
-	/*
-	 * See can_change_pte_writable(): we broke COW and could map the page
-	 * writable if we have an exclusive anonymous page ...
-	 */
-	if (!page || !PageAnon(page) || !PageAnonExclusive(page))
-		return false;
-
-	/* ... and a write-fault isn't required for other reasons. */
-	if (vma_soft_dirty_enabled(vma) && !pmd_soft_dirty(pmd))
-		return false;
-	return !userfaultfd_huge_pmd_wp(vma, pmd);
-}
-
-struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
-				   unsigned long addr,
-				   pmd_t *pmd,
-				   unsigned int flags)
-{
-	struct mm_struct *mm = vma->vm_mm;
-	struct page *page;
-	int ret;
-
-	assert_spin_locked(pmd_lockptr(mm, pmd));
-
-	page = pmd_page(*pmd);
-	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
-
-	if ((flags & FOLL_WRITE) &&
-	    !can_follow_write_pmd(*pmd, page, vma, flags))
-		return NULL;
-
-	/* Avoid dumping huge zero page */
-	if ((flags & FOLL_DUMP) && is_huge_zero_pmd(*pmd))
-		return ERR_PTR(-EFAULT);
-
-	if (pmd_protnone(*pmd) && !gup_can_follow_protnone(vma, flags))
-		return NULL;
-
-	if (!pmd_write(*pmd) && gup_must_unshare(vma, flags, page))
-		return ERR_PTR(-EMLINK);
-
-	VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
-			!PageAnonExclusive(page), page);
-
-	ret = try_grab_page(page, flags);
-	if (ret)
-		return ERR_PTR(ret);
-
-	if (flags & FOLL_TOUCH)
-		touch_pmd(vma, addr, pmd, flags & FOLL_WRITE);
-
-	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
-	VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
-
-	return page;
-}
-
 /* NUMA hinting page fault entry point for trans huge pmds */
 vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 {
diff --git a/mm/internal.h b/mm/internal.h
index eee8c82740b5..e10ecc6594f1 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1113,9 +1113,8 @@ int __must_check try_grab_page(struct page *page, unsigned int flags);
  */
 void touch_pud(struct vm_area_struct *vma, unsigned long addr,
 	       pud_t *pud, bool write);
-struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
-				   unsigned long addr, pmd_t *pmd,
-				   unsigned int flags);
+void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
+	       pmd_t *pmd, bool write);
 
 /*
  * mm/mmap.c
-- 
2.44.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 11/13] mm/gup: Handle huge pmd for follow_pmd_mask()
@ 2024-03-27 15:23   ` peterx
  0 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, peterx, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

From: Peter Xu <peterx@redhat.com>

Replace pmd_trans_huge() with pmd_leaf() to also cover pmd_huge() as long
as enabled.

FOLL_TOUCH and FOLL_SPLIT_PMD only apply to THP, not yet huge.

Since now follow_trans_huge_pmd() can process hugetlb pages, renaming it
into follow_huge_pmd() to match what it does.  Move it into gup.c so not
depend on CONFIG_THP.

When at it, move the ctx->page_mask setup into follow_huge_pmd(), only set
it when the page is valid.  It was not a bug to set it before even if GUP
failed (page==NULL), because follow_page_mask() callers always ignores
page_mask if so.  But doing so makes the code cleaner.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c         | 107 ++++++++++++++++++++++++++++++++++++++++++++---
 mm/huge_memory.c |  86 +------------------------------------
 mm/internal.h    |   5 +--
 3 files changed, 105 insertions(+), 93 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 1e5d42211bb4..a81184b01276 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -580,6 +580,93 @@ static struct page *follow_huge_pud(struct vm_area_struct *vma,
 
 	return page;
 }
+
+/* FOLL_FORCE can write to even unwritable PMDs in COW mappings. */
+static inline bool can_follow_write_pmd(pmd_t pmd, struct page *page,
+					struct vm_area_struct *vma,
+					unsigned int flags)
+{
+	/* If the pmd is writable, we can write to the page. */
+	if (pmd_write(pmd))
+		return true;
+
+	/* Maybe FOLL_FORCE is set to override it? */
+	if (!(flags & FOLL_FORCE))
+		return false;
+
+	/* But FOLL_FORCE has no effect on shared mappings */
+	if (vma->vm_flags & (VM_MAYSHARE | VM_SHARED))
+		return false;
+
+	/* ... or read-only private ones */
+	if (!(vma->vm_flags & VM_MAYWRITE))
+		return false;
+
+	/* ... or already writable ones that just need to take a write fault */
+	if (vma->vm_flags & VM_WRITE)
+		return false;
+
+	/*
+	 * See can_change_pte_writable(): we broke COW and could map the page
+	 * writable if we have an exclusive anonymous page ...
+	 */
+	if (!page || !PageAnon(page) || !PageAnonExclusive(page))
+		return false;
+
+	/* ... and a write-fault isn't required for other reasons. */
+	if (vma_soft_dirty_enabled(vma) && !pmd_soft_dirty(pmd))
+		return false;
+	return !userfaultfd_huge_pmd_wp(vma, pmd);
+}
+
+static struct page *follow_huge_pmd(struct vm_area_struct *vma,
+				    unsigned long addr, pmd_t *pmd,
+				    unsigned int flags,
+				    struct follow_page_context *ctx)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pmd_t pmdval = *pmd;
+	struct page *page;
+	int ret;
+
+	assert_spin_locked(pmd_lockptr(mm, pmd));
+
+	page = pmd_page(pmdval);
+	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
+
+	if ((flags & FOLL_WRITE) &&
+	    !can_follow_write_pmd(pmdval, page, vma, flags))
+		return NULL;
+
+	/* Avoid dumping huge zero page */
+	if ((flags & FOLL_DUMP) && is_huge_zero_pmd(pmdval))
+		return ERR_PTR(-EFAULT);
+
+	if (pmd_protnone(*pmd) && !gup_can_follow_protnone(vma, flags))
+		return NULL;
+
+	if (!pmd_write(pmdval) && gup_must_unshare(vma, flags, page))
+		return ERR_PTR(-EMLINK);
+
+	VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
+			!PageAnonExclusive(page), page);
+
+	ret = try_grab_page(page, flags);
+	if (ret)
+		return ERR_PTR(ret);
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	if (pmd_trans_huge(pmdval) && (flags & FOLL_TOUCH))
+		touch_pmd(vma, addr, pmd, flags & FOLL_WRITE);
+#endif	/* CONFIG_TRANSPARENT_HUGEPAGE */
+
+	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
+	ctx->page_mask = HPAGE_PMD_NR - 1;
+	VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
+
+	return page;
+}
+
 #else  /* CONFIG_PGTABLE_HAS_HUGE_LEAVES */
 static struct page *follow_huge_pud(struct vm_area_struct *vma,
 				    unsigned long addr, pud_t *pudp,
@@ -587,6 +674,14 @@ static struct page *follow_huge_pud(struct vm_area_struct *vma,
 {
 	return NULL;
 }
+
+static struct page *follow_huge_pmd(struct vm_area_struct *vma,
+				    unsigned long addr, pmd_t *pmd,
+				    unsigned int flags,
+				    struct follow_page_context *ctx)
+{
+	return NULL;
+}
 #endif	/* CONFIG_PGTABLE_HAS_HUGE_LEAVES */
 
 static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
@@ -784,31 +879,31 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 			return page;
 		return no_page_table(vma, flags, address);
 	}
-	if (likely(!pmd_trans_huge(pmdval)))
+	if (likely(!pmd_leaf(pmdval)))
 		return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
 
 	if (pmd_protnone(pmdval) && !gup_can_follow_protnone(vma, flags))
 		return no_page_table(vma, flags, address);
 
 	ptl = pmd_lock(mm, pmd);
-	if (unlikely(!pmd_present(*pmd))) {
+	pmdval = *pmd;
+	if (unlikely(!pmd_present(pmdval))) {
 		spin_unlock(ptl);
 		return no_page_table(vma, flags, address);
 	}
-	if (unlikely(!pmd_trans_huge(*pmd))) {
+	if (unlikely(!pmd_leaf(pmdval))) {
 		spin_unlock(ptl);
 		return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
 	}
-	if (flags & FOLL_SPLIT_PMD) {
+	if (pmd_trans_huge(pmdval) && (flags & FOLL_SPLIT_PMD)) {
 		spin_unlock(ptl);
 		split_huge_pmd(vma, pmd, address);
 		/* If pmd was left empty, stuff a page table in there quickly */
 		return pte_alloc(mm, pmd) ? ERR_PTR(-ENOMEM) :
 			follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
 	}
-	page = follow_trans_huge_pmd(vma, address, pmd, flags);
+	page = follow_huge_pmd(vma, address, pmd, flags, ctx);
 	spin_unlock(ptl);
-	ctx->page_mask = HPAGE_PMD_NR - 1;
 	return page;
 }
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2979198d7b71..ed0d82c4b829 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1220,8 +1220,8 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write)
 EXPORT_SYMBOL_GPL(vmf_insert_pfn_pud);
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
-static void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
-		      pmd_t *pmd, bool write)
+void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
+	       pmd_t *pmd, bool write)
 {
 	pmd_t _pmd;
 
@@ -1576,88 +1576,6 @@ static inline bool can_change_pmd_writable(struct vm_area_struct *vma,
 	return pmd_dirty(pmd);
 }
 
-/* FOLL_FORCE can write to even unwritable PMDs in COW mappings. */
-static inline bool can_follow_write_pmd(pmd_t pmd, struct page *page,
-					struct vm_area_struct *vma,
-					unsigned int flags)
-{
-	/* If the pmd is writable, we can write to the page. */
-	if (pmd_write(pmd))
-		return true;
-
-	/* Maybe FOLL_FORCE is set to override it? */
-	if (!(flags & FOLL_FORCE))
-		return false;
-
-	/* But FOLL_FORCE has no effect on shared mappings */
-	if (vma->vm_flags & (VM_MAYSHARE | VM_SHARED))
-		return false;
-
-	/* ... or read-only private ones */
-	if (!(vma->vm_flags & VM_MAYWRITE))
-		return false;
-
-	/* ... or already writable ones that just need to take a write fault */
-	if (vma->vm_flags & VM_WRITE)
-		return false;
-
-	/*
-	 * See can_change_pte_writable(): we broke COW and could map the page
-	 * writable if we have an exclusive anonymous page ...
-	 */
-	if (!page || !PageAnon(page) || !PageAnonExclusive(page))
-		return false;
-
-	/* ... and a write-fault isn't required for other reasons. */
-	if (vma_soft_dirty_enabled(vma) && !pmd_soft_dirty(pmd))
-		return false;
-	return !userfaultfd_huge_pmd_wp(vma, pmd);
-}
-
-struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
-				   unsigned long addr,
-				   pmd_t *pmd,
-				   unsigned int flags)
-{
-	struct mm_struct *mm = vma->vm_mm;
-	struct page *page;
-	int ret;
-
-	assert_spin_locked(pmd_lockptr(mm, pmd));
-
-	page = pmd_page(*pmd);
-	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
-
-	if ((flags & FOLL_WRITE) &&
-	    !can_follow_write_pmd(*pmd, page, vma, flags))
-		return NULL;
-
-	/* Avoid dumping huge zero page */
-	if ((flags & FOLL_DUMP) && is_huge_zero_pmd(*pmd))
-		return ERR_PTR(-EFAULT);
-
-	if (pmd_protnone(*pmd) && !gup_can_follow_protnone(vma, flags))
-		return NULL;
-
-	if (!pmd_write(*pmd) && gup_must_unshare(vma, flags, page))
-		return ERR_PTR(-EMLINK);
-
-	VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
-			!PageAnonExclusive(page), page);
-
-	ret = try_grab_page(page, flags);
-	if (ret)
-		return ERR_PTR(ret);
-
-	if (flags & FOLL_TOUCH)
-		touch_pmd(vma, addr, pmd, flags & FOLL_WRITE);
-
-	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
-	VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
-
-	return page;
-}
-
 /* NUMA hinting page fault entry point for trans huge pmds */
 vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 {
diff --git a/mm/internal.h b/mm/internal.h
index eee8c82740b5..e10ecc6594f1 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1113,9 +1113,8 @@ int __must_check try_grab_page(struct page *page, unsigned int flags);
  */
 void touch_pud(struct vm_area_struct *vma, unsigned long addr,
 	       pud_t *pud, bool write);
-struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
-				   unsigned long addr, pmd_t *pmd,
-				   unsigned int flags);
+void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
+	       pmd_t *pmd, bool write);
 
 /*
  * mm/mmap.c
-- 
2.44.0


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 11/13] mm/gup: Handle huge pmd for follow_pmd_mask()
@ 2024-03-27 15:23   ` peterx
  0 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: James Houghton, David Hildenbrand, Yang Shi, peterx,
	Andrew Jones, linux-riscv, Andrea Arcangeli, Christoph Hellwig,
	Matthew Wilcox, Aneesh Kumar K . V, linux-arm-kernel,
	Jason Gunthorpe, Axel Rasmussen, Rik van Riel, John Hubbard,
	Kirill A . Shutemov, Vlastimil Babka, Lorenzo Stoakes,
	Muchun Song, Andrew Morton, linuxppc-dev, Mike Rapoport,
	Mike Kravetz

From: Peter Xu <peterx@redhat.com>

Replace pmd_trans_huge() with pmd_leaf() to also cover pmd_huge() as long
as enabled.

FOLL_TOUCH and FOLL_SPLIT_PMD only apply to THP, not yet huge.

Since now follow_trans_huge_pmd() can process hugetlb pages, renaming it
into follow_huge_pmd() to match what it does.  Move it into gup.c so not
depend on CONFIG_THP.

When at it, move the ctx->page_mask setup into follow_huge_pmd(), only set
it when the page is valid.  It was not a bug to set it before even if GUP
failed (page==NULL), because follow_page_mask() callers always ignores
page_mask if so.  But doing so makes the code cleaner.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c         | 107 ++++++++++++++++++++++++++++++++++++++++++++---
 mm/huge_memory.c |  86 +------------------------------------
 mm/internal.h    |   5 +--
 3 files changed, 105 insertions(+), 93 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 1e5d42211bb4..a81184b01276 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -580,6 +580,93 @@ static struct page *follow_huge_pud(struct vm_area_struct *vma,
 
 	return page;
 }
+
+/* FOLL_FORCE can write to even unwritable PMDs in COW mappings. */
+static inline bool can_follow_write_pmd(pmd_t pmd, struct page *page,
+					struct vm_area_struct *vma,
+					unsigned int flags)
+{
+	/* If the pmd is writable, we can write to the page. */
+	if (pmd_write(pmd))
+		return true;
+
+	/* Maybe FOLL_FORCE is set to override it? */
+	if (!(flags & FOLL_FORCE))
+		return false;
+
+	/* But FOLL_FORCE has no effect on shared mappings */
+	if (vma->vm_flags & (VM_MAYSHARE | VM_SHARED))
+		return false;
+
+	/* ... or read-only private ones */
+	if (!(vma->vm_flags & VM_MAYWRITE))
+		return false;
+
+	/* ... or already writable ones that just need to take a write fault */
+	if (vma->vm_flags & VM_WRITE)
+		return false;
+
+	/*
+	 * See can_change_pte_writable(): we broke COW and could map the page
+	 * writable if we have an exclusive anonymous page ...
+	 */
+	if (!page || !PageAnon(page) || !PageAnonExclusive(page))
+		return false;
+
+	/* ... and a write-fault isn't required for other reasons. */
+	if (vma_soft_dirty_enabled(vma) && !pmd_soft_dirty(pmd))
+		return false;
+	return !userfaultfd_huge_pmd_wp(vma, pmd);
+}
+
+static struct page *follow_huge_pmd(struct vm_area_struct *vma,
+				    unsigned long addr, pmd_t *pmd,
+				    unsigned int flags,
+				    struct follow_page_context *ctx)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pmd_t pmdval = *pmd;
+	struct page *page;
+	int ret;
+
+	assert_spin_locked(pmd_lockptr(mm, pmd));
+
+	page = pmd_page(pmdval);
+	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
+
+	if ((flags & FOLL_WRITE) &&
+	    !can_follow_write_pmd(pmdval, page, vma, flags))
+		return NULL;
+
+	/* Avoid dumping huge zero page */
+	if ((flags & FOLL_DUMP) && is_huge_zero_pmd(pmdval))
+		return ERR_PTR(-EFAULT);
+
+	if (pmd_protnone(*pmd) && !gup_can_follow_protnone(vma, flags))
+		return NULL;
+
+	if (!pmd_write(pmdval) && gup_must_unshare(vma, flags, page))
+		return ERR_PTR(-EMLINK);
+
+	VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
+			!PageAnonExclusive(page), page);
+
+	ret = try_grab_page(page, flags);
+	if (ret)
+		return ERR_PTR(ret);
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	if (pmd_trans_huge(pmdval) && (flags & FOLL_TOUCH))
+		touch_pmd(vma, addr, pmd, flags & FOLL_WRITE);
+#endif	/* CONFIG_TRANSPARENT_HUGEPAGE */
+
+	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
+	ctx->page_mask = HPAGE_PMD_NR - 1;
+	VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
+
+	return page;
+}
+
 #else  /* CONFIG_PGTABLE_HAS_HUGE_LEAVES */
 static struct page *follow_huge_pud(struct vm_area_struct *vma,
 				    unsigned long addr, pud_t *pudp,
@@ -587,6 +674,14 @@ static struct page *follow_huge_pud(struct vm_area_struct *vma,
 {
 	return NULL;
 }
+
+static struct page *follow_huge_pmd(struct vm_area_struct *vma,
+				    unsigned long addr, pmd_t *pmd,
+				    unsigned int flags,
+				    struct follow_page_context *ctx)
+{
+	return NULL;
+}
 #endif	/* CONFIG_PGTABLE_HAS_HUGE_LEAVES */
 
 static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
@@ -784,31 +879,31 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 			return page;
 		return no_page_table(vma, flags, address);
 	}
-	if (likely(!pmd_trans_huge(pmdval)))
+	if (likely(!pmd_leaf(pmdval)))
 		return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
 
 	if (pmd_protnone(pmdval) && !gup_can_follow_protnone(vma, flags))
 		return no_page_table(vma, flags, address);
 
 	ptl = pmd_lock(mm, pmd);
-	if (unlikely(!pmd_present(*pmd))) {
+	pmdval = *pmd;
+	if (unlikely(!pmd_present(pmdval))) {
 		spin_unlock(ptl);
 		return no_page_table(vma, flags, address);
 	}
-	if (unlikely(!pmd_trans_huge(*pmd))) {
+	if (unlikely(!pmd_leaf(pmdval))) {
 		spin_unlock(ptl);
 		return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
 	}
-	if (flags & FOLL_SPLIT_PMD) {
+	if (pmd_trans_huge(pmdval) && (flags & FOLL_SPLIT_PMD)) {
 		spin_unlock(ptl);
 		split_huge_pmd(vma, pmd, address);
 		/* If pmd was left empty, stuff a page table in there quickly */
 		return pte_alloc(mm, pmd) ? ERR_PTR(-ENOMEM) :
 			follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
 	}
-	page = follow_trans_huge_pmd(vma, address, pmd, flags);
+	page = follow_huge_pmd(vma, address, pmd, flags, ctx);
 	spin_unlock(ptl);
-	ctx->page_mask = HPAGE_PMD_NR - 1;
 	return page;
 }
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2979198d7b71..ed0d82c4b829 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1220,8 +1220,8 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write)
 EXPORT_SYMBOL_GPL(vmf_insert_pfn_pud);
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
-static void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
-		      pmd_t *pmd, bool write)
+void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
+	       pmd_t *pmd, bool write)
 {
 	pmd_t _pmd;
 
@@ -1576,88 +1576,6 @@ static inline bool can_change_pmd_writable(struct vm_area_struct *vma,
 	return pmd_dirty(pmd);
 }
 
-/* FOLL_FORCE can write to even unwritable PMDs in COW mappings. */
-static inline bool can_follow_write_pmd(pmd_t pmd, struct page *page,
-					struct vm_area_struct *vma,
-					unsigned int flags)
-{
-	/* If the pmd is writable, we can write to the page. */
-	if (pmd_write(pmd))
-		return true;
-
-	/* Maybe FOLL_FORCE is set to override it? */
-	if (!(flags & FOLL_FORCE))
-		return false;
-
-	/* But FOLL_FORCE has no effect on shared mappings */
-	if (vma->vm_flags & (VM_MAYSHARE | VM_SHARED))
-		return false;
-
-	/* ... or read-only private ones */
-	if (!(vma->vm_flags & VM_MAYWRITE))
-		return false;
-
-	/* ... or already writable ones that just need to take a write fault */
-	if (vma->vm_flags & VM_WRITE)
-		return false;
-
-	/*
-	 * See can_change_pte_writable(): we broke COW and could map the page
-	 * writable if we have an exclusive anonymous page ...
-	 */
-	if (!page || !PageAnon(page) || !PageAnonExclusive(page))
-		return false;
-
-	/* ... and a write-fault isn't required for other reasons. */
-	if (vma_soft_dirty_enabled(vma) && !pmd_soft_dirty(pmd))
-		return false;
-	return !userfaultfd_huge_pmd_wp(vma, pmd);
-}
-
-struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
-				   unsigned long addr,
-				   pmd_t *pmd,
-				   unsigned int flags)
-{
-	struct mm_struct *mm = vma->vm_mm;
-	struct page *page;
-	int ret;
-
-	assert_spin_locked(pmd_lockptr(mm, pmd));
-
-	page = pmd_page(*pmd);
-	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
-
-	if ((flags & FOLL_WRITE) &&
-	    !can_follow_write_pmd(*pmd, page, vma, flags))
-		return NULL;
-
-	/* Avoid dumping huge zero page */
-	if ((flags & FOLL_DUMP) && is_huge_zero_pmd(*pmd))
-		return ERR_PTR(-EFAULT);
-
-	if (pmd_protnone(*pmd) && !gup_can_follow_protnone(vma, flags))
-		return NULL;
-
-	if (!pmd_write(*pmd) && gup_must_unshare(vma, flags, page))
-		return ERR_PTR(-EMLINK);
-
-	VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
-			!PageAnonExclusive(page), page);
-
-	ret = try_grab_page(page, flags);
-	if (ret)
-		return ERR_PTR(ret);
-
-	if (flags & FOLL_TOUCH)
-		touch_pmd(vma, addr, pmd, flags & FOLL_WRITE);
-
-	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
-	VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
-
-	return page;
-}
-
 /* NUMA hinting page fault entry point for trans huge pmds */
 vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 {
diff --git a/mm/internal.h b/mm/internal.h
index eee8c82740b5..e10ecc6594f1 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1113,9 +1113,8 @@ int __must_check try_grab_page(struct page *page, unsigned int flags);
  */
 void touch_pud(struct vm_area_struct *vma, unsigned long addr,
 	       pud_t *pud, bool write);
-struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
-				   unsigned long addr, pmd_t *pmd,
-				   unsigned int flags);
+void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
+	       pmd_t *pmd, bool write);
 
 /*
  * mm/mmap.c
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 12/13] mm/gup: Handle hugepd for follow_page()
  2024-03-27 15:23 ` peterx
  (?)
  (?)
@ 2024-03-27 15:23   ` peterx
  -1 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, peterx, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

From: Peter Xu <peterx@redhat.com>

Hugepd is only used in PowerPC so far on 4K page size kernels where hash
mmu is used.  follow_page_mask() used to leverage hugetlb APIs to access
hugepd entries.  Teach follow_page_mask() itself on hugepd.

With previous refactors on fast-gup gup_huge_pd(), most of the code can be
leveraged.  There's something not needed for follow page, for example,
gup_hugepte() tries to detect pgtable entry change which will never happen
with slow gup (which has the pgtable lock held), but that's not a problem
to check.

Since follow_page() always only fetch one page, set the end to "address +
PAGE_SIZE" should suffice.  We will still do the pgtable walk once for each
hugetlb page by setting ctx->page_mask properly.

One thing worth mentioning is that some level of pgtable's _bad() helper
will report is_hugepd() entries as TRUE on Power8 hash MMUs.  I think it at
least applies to PUD on Power8 with 4K pgsize.  It means feeding a hugepd
entry to pud_bad() will report a false positive. Let's leave that for now
because it can be arch-specific where I am a bit declined to touch.  In
this patch it's not a problem as long as hugepd is detected before any bad
pgtable entries.

To allow slow gup like follow_*_page() to access hugepd helpers, hugepd
codes are moved to the top.  Besides that, the helper record_subpages()
will be used by either hugepd or fast-gup now. To avoid "unused function"
warnings we must provide a "#ifdef" for it, unfortunately.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c | 269 +++++++++++++++++++++++++++++++++----------------------
 1 file changed, 163 insertions(+), 106 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index a81184b01276..a02463c9420e 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -500,6 +500,149 @@ static inline void mm_set_has_pinned_flag(unsigned long *mm_flags)
 }
 
 #ifdef CONFIG_MMU
+
+#if defined(CONFIG_ARCH_HAS_HUGEPD) || defined(CONFIG_HAVE_FAST_GUP)
+static int record_subpages(struct page *page, unsigned long sz,
+			   unsigned long addr, unsigned long end,
+			   struct page **pages)
+{
+	struct page *start_page;
+	int nr;
+
+	start_page = nth_page(page, (addr & (sz - 1)) >> PAGE_SHIFT);
+	for (nr = 0; addr != end; nr++, addr += PAGE_SIZE)
+		pages[nr] = nth_page(start_page, nr);
+
+	return nr;
+}
+#endif	/* CONFIG_ARCH_HAS_HUGEPD || CONFIG_HAVE_FAST_GUP */
+
+#ifdef CONFIG_ARCH_HAS_HUGEPD
+static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
+				      unsigned long sz)
+{
+	unsigned long __boundary = (addr + sz) & ~(sz-1);
+	return (__boundary - 1 < end - 1) ? __boundary : end;
+}
+
+static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
+		       unsigned long end, unsigned int flags,
+		       struct page **pages, int *nr)
+{
+	unsigned long pte_end;
+	struct page *page;
+	struct folio *folio;
+	pte_t pte;
+	int refs;
+
+	pte_end = (addr + sz) & ~(sz-1);
+	if (pte_end < end)
+		end = pte_end;
+
+	pte = huge_ptep_get(ptep);
+
+	if (!pte_access_permitted(pte, flags & FOLL_WRITE))
+		return 0;
+
+	/* hugepages are never "special" */
+	VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
+
+	page = pte_page(pte);
+	refs = record_subpages(page, sz, addr, end, pages + *nr);
+
+	folio = try_grab_folio(page, refs, flags);
+	if (!folio)
+		return 0;
+
+	if (unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) {
+		gup_put_folio(folio, refs, flags);
+		return 0;
+	}
+
+	if (!pte_write(pte) && gup_must_unshare(NULL, flags, &folio->page)) {
+		gup_put_folio(folio, refs, flags);
+		return 0;
+	}
+
+	*nr += refs;
+	folio_set_referenced(folio);
+	return 1;
+}
+
+/*
+ * NOTE: currently GUP for a hugepd is only possible on hugetlbfs file
+ * systems on Power, which does not have issue with folio writeback against
+ * GUP updates.  When hugepd will be extended to support non-hugetlbfs or
+ * even anonymous memory, we need to do extra check as what we do with most
+ * of the other folios. See writable_file_mapping_allowed() and
+ * folio_fast_pin_allowed() for more information.
+ */
+static int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
+		unsigned int pdshift, unsigned long end, unsigned int flags,
+		struct page **pages, int *nr)
+{
+	pte_t *ptep;
+	unsigned long sz = 1UL << hugepd_shift(hugepd);
+	unsigned long next;
+
+	ptep = hugepte_offset(hugepd, addr, pdshift);
+	do {
+		next = hugepte_addr_end(addr, end, sz);
+		if (!gup_hugepte(ptep, sz, addr, end, flags, pages, nr))
+			return 0;
+	} while (ptep++, addr = next, addr != end);
+
+	return 1;
+}
+
+static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd,
+				  unsigned long addr, unsigned int pdshift,
+				  unsigned int flags,
+				  struct follow_page_context *ctx)
+{
+	struct page *page;
+	struct hstate *h;
+	spinlock_t *ptl;
+	int nr = 0, ret;
+	pte_t *ptep;
+
+	/* Only hugetlb supports hugepd */
+	if (WARN_ON_ONCE(!is_vm_hugetlb_page(vma)))
+		return ERR_PTR(-EFAULT);
+
+	h = hstate_vma(vma);
+	ptep = hugepte_offset(hugepd, addr, pdshift);
+	ptl = huge_pte_lock(h, vma->vm_mm, ptep);
+	ret = gup_huge_pd(hugepd, addr, pdshift, addr + PAGE_SIZE,
+			  flags, &page, &nr);
+	spin_unlock(ptl);
+
+	if (ret) {
+		WARN_ON_ONCE(nr != 1);
+		ctx->page_mask = (1U << huge_page_order(h)) - 1;
+		return page;
+	}
+
+	return NULL;
+}
+#else /* CONFIG_ARCH_HAS_HUGEPD */
+static inline int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
+		unsigned int pdshift, unsigned long end, unsigned int flags,
+		struct page **pages, int *nr)
+{
+	return 0;
+}
+
+static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd,
+				  unsigned long addr, unsigned int pdshift,
+				  unsigned int flags,
+				  struct follow_page_context *ctx)
+{
+	return NULL;
+}
+#endif /* CONFIG_ARCH_HAS_HUGEPD */
+
+
 static struct page *no_page_table(struct vm_area_struct *vma,
 				  unsigned int flags, unsigned long address)
 {
@@ -871,6 +1014,9 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 		return no_page_table(vma, flags, address);
 	if (!pmd_present(pmdval))
 		return no_page_table(vma, flags, address);
+	if (unlikely(is_hugepd(__hugepd(pmd_val(pmdval)))))
+		return follow_hugepd(vma, __hugepd(pmd_val(pmdval)),
+				     address, PMD_SHIFT, flags, ctx);
 	if (pmd_devmap(pmdval)) {
 		ptl = pmd_lock(mm, pmd);
 		page = follow_devmap_pmd(vma, address, pmd, flags, &ctx->pgmap);
@@ -921,6 +1067,9 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma,
 	pud = READ_ONCE(*pudp);
 	if (!pud_present(pud))
 		return no_page_table(vma, flags, address);
+	if (unlikely(is_hugepd(__hugepd(pud_val(pud)))))
+		return follow_hugepd(vma, __hugepd(pud_val(pud)),
+				     address, PUD_SHIFT, flags, ctx);
 	if (pud_leaf(pud)) {
 		ptl = pud_lock(mm, pudp);
 		page = follow_huge_pud(vma, address, pudp, flags, ctx);
@@ -944,10 +1093,13 @@ static struct page *follow_p4d_mask(struct vm_area_struct *vma,
 
 	p4dp = p4d_offset(pgdp, address);
 	p4d = READ_ONCE(*p4dp);
-	if (!p4d_present(p4d))
-		return no_page_table(vma, flags, address);
 	BUILD_BUG_ON(p4d_leaf(p4d));
-	if (unlikely(p4d_bad(p4d)))
+
+	if (unlikely(is_hugepd(__hugepd(p4d_val(p4d)))))
+		return follow_hugepd(vma, __hugepd(p4d_val(p4d)),
+				     address, P4D_SHIFT, flags, ctx);
+
+	if (!p4d_present(p4d) || p4d_bad(p4d))
 		return no_page_table(vma, flags, address);
 
 	return follow_pud_mask(vma, address, p4dp, flags, ctx);
@@ -997,10 +1149,15 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
 
 	pgd = pgd_offset(mm, address);
 
-	if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
-		return no_page_table(vma, flags, address);
+	if (unlikely(is_hugepd(__hugepd(pgd_val(*pgd)))))
+		page = follow_hugepd(vma, __hugepd(pgd_val(*pgd)),
+				     address, PGDIR_SHIFT, flags, ctx);
+	else if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
+		page = no_page_table(vma, flags, address);
+	else
+		page = follow_p4d_mask(vma, address, pgd, flags, ctx);
 
-	return follow_p4d_mask(vma, address, pgd, flags, ctx);
+	return page;
 }
 
 struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
@@ -2947,106 +3104,6 @@ static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
 }
 #endif
 
-static int record_subpages(struct page *page, unsigned long sz,
-			   unsigned long addr, unsigned long end,
-			   struct page **pages)
-{
-	struct page *start_page;
-	int nr;
-
-	start_page = nth_page(page, (addr & (sz - 1)) >> PAGE_SHIFT);
-	for (nr = 0; addr != end; nr++, addr += PAGE_SIZE)
-		pages[nr] = nth_page(start_page, nr);
-
-	return nr;
-}
-
-#ifdef CONFIG_ARCH_HAS_HUGEPD
-static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
-				      unsigned long sz)
-{
-	unsigned long __boundary = (addr + sz) & ~(sz-1);
-	return (__boundary - 1 < end - 1) ? __boundary : end;
-}
-
-static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
-		       unsigned long end, unsigned int flags,
-		       struct page **pages, int *nr)
-{
-	unsigned long pte_end;
-	struct page *page;
-	struct folio *folio;
-	pte_t pte;
-	int refs;
-
-	pte_end = (addr + sz) & ~(sz-1);
-	if (pte_end < end)
-		end = pte_end;
-
-	pte = huge_ptep_get(ptep);
-
-	if (!pte_access_permitted(pte, flags & FOLL_WRITE))
-		return 0;
-
-	/* hugepages are never "special" */
-	VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
-
-	page = pte_page(pte);
-	refs = record_subpages(page, sz, addr, end, pages + *nr);
-
-	folio = try_grab_folio(page, refs, flags);
-	if (!folio)
-		return 0;
-
-	if (unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) {
-		gup_put_folio(folio, refs, flags);
-		return 0;
-	}
-
-	if (!pte_write(pte) && gup_must_unshare(NULL, flags, &folio->page)) {
-		gup_put_folio(folio, refs, flags);
-		return 0;
-	}
-
-	*nr += refs;
-	folio_set_referenced(folio);
-	return 1;
-}
-
-/*
- * NOTE: currently GUP for a hugepd is only possible on hugetlbfs file
- * systems on Power, which does not have issue with folio writeback against
- * GUP updates.  When hugepd will be extended to support non-hugetlbfs or
- * even anonymous memory, we need to do extra check as what we do with most
- * of the other folios. See writable_file_mapping_allowed() and
- * folio_fast_pin_allowed() for more information.
- */
-static int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
-		unsigned int pdshift, unsigned long end, unsigned int flags,
-		struct page **pages, int *nr)
-{
-	pte_t *ptep;
-	unsigned long sz = 1UL << hugepd_shift(hugepd);
-	unsigned long next;
-
-	ptep = hugepte_offset(hugepd, addr, pdshift);
-	do {
-		next = hugepte_addr_end(addr, end, sz);
-		if (!gup_hugepte(ptep, sz, addr, end, flags, pages, nr))
-			return 0;
-	} while (ptep++, addr = next, addr != end);
-
-	return 1;
-}
-#else
-static inline int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
-		unsigned int pdshift, unsigned long end, unsigned int flags,
-		struct page **pages, int *nr)
-{
-	return 0;
-}
-#endif /* CONFIG_ARCH_HAS_HUGEPD */
-
 static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 			unsigned long end, unsigned int flags,
 			struct page **pages, int *nr)
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 12/13] mm/gup: Handle hugepd for follow_page()
@ 2024-03-27 15:23   ` peterx
  0 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, peterx, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

From: Peter Xu <peterx@redhat.com>

Hugepd is only used in PowerPC so far on 4K page size kernels where hash
mmu is used.  follow_page_mask() used to leverage hugetlb APIs to access
hugepd entries.  Teach follow_page_mask() itself on hugepd.

With previous refactors on fast-gup gup_huge_pd(), most of the code can be
leveraged.  There's something not needed for follow page, for example,
gup_hugepte() tries to detect pgtable entry change which will never happen
with slow gup (which has the pgtable lock held), but that's not a problem
to check.

Since follow_page() always only fetch one page, set the end to "address +
PAGE_SIZE" should suffice.  We will still do the pgtable walk once for each
hugetlb page by setting ctx->page_mask properly.

One thing worth mentioning is that some level of pgtable's _bad() helper
will report is_hugepd() entries as TRUE on Power8 hash MMUs.  I think it at
least applies to PUD on Power8 with 4K pgsize.  It means feeding a hugepd
entry to pud_bad() will report a false positive. Let's leave that for now
because it can be arch-specific where I am a bit declined to touch.  In
this patch it's not a problem as long as hugepd is detected before any bad
pgtable entries.

To allow slow gup like follow_*_page() to access hugepd helpers, hugepd
codes are moved to the top.  Besides that, the helper record_subpages()
will be used by either hugepd or fast-gup now. To avoid "unused function"
warnings we must provide a "#ifdef" for it, unfortunately.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c | 269 +++++++++++++++++++++++++++++++++----------------------
 1 file changed, 163 insertions(+), 106 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index a81184b01276..a02463c9420e 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -500,6 +500,149 @@ static inline void mm_set_has_pinned_flag(unsigned long *mm_flags)
 }
 
 #ifdef CONFIG_MMU
+
+#if defined(CONFIG_ARCH_HAS_HUGEPD) || defined(CONFIG_HAVE_FAST_GUP)
+static int record_subpages(struct page *page, unsigned long sz,
+			   unsigned long addr, unsigned long end,
+			   struct page **pages)
+{
+	struct page *start_page;
+	int nr;
+
+	start_page = nth_page(page, (addr & (sz - 1)) >> PAGE_SHIFT);
+	for (nr = 0; addr != end; nr++, addr += PAGE_SIZE)
+		pages[nr] = nth_page(start_page, nr);
+
+	return nr;
+}
+#endif	/* CONFIG_ARCH_HAS_HUGEPD || CONFIG_HAVE_FAST_GUP */
+
+#ifdef CONFIG_ARCH_HAS_HUGEPD
+static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
+				      unsigned long sz)
+{
+	unsigned long __boundary = (addr + sz) & ~(sz-1);
+	return (__boundary - 1 < end - 1) ? __boundary : end;
+}
+
+static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
+		       unsigned long end, unsigned int flags,
+		       struct page **pages, int *nr)
+{
+	unsigned long pte_end;
+	struct page *page;
+	struct folio *folio;
+	pte_t pte;
+	int refs;
+
+	pte_end = (addr + sz) & ~(sz-1);
+	if (pte_end < end)
+		end = pte_end;
+
+	pte = huge_ptep_get(ptep);
+
+	if (!pte_access_permitted(pte, flags & FOLL_WRITE))
+		return 0;
+
+	/* hugepages are never "special" */
+	VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
+
+	page = pte_page(pte);
+	refs = record_subpages(page, sz, addr, end, pages + *nr);
+
+	folio = try_grab_folio(page, refs, flags);
+	if (!folio)
+		return 0;
+
+	if (unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) {
+		gup_put_folio(folio, refs, flags);
+		return 0;
+	}
+
+	if (!pte_write(pte) && gup_must_unshare(NULL, flags, &folio->page)) {
+		gup_put_folio(folio, refs, flags);
+		return 0;
+	}
+
+	*nr += refs;
+	folio_set_referenced(folio);
+	return 1;
+}
+
+/*
+ * NOTE: currently GUP for a hugepd is only possible on hugetlbfs file
+ * systems on Power, which does not have issue with folio writeback against
+ * GUP updates.  When hugepd will be extended to support non-hugetlbfs or
+ * even anonymous memory, we need to do extra check as what we do with most
+ * of the other folios. See writable_file_mapping_allowed() and
+ * folio_fast_pin_allowed() for more information.
+ */
+static int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
+		unsigned int pdshift, unsigned long end, unsigned int flags,
+		struct page **pages, int *nr)
+{
+	pte_t *ptep;
+	unsigned long sz = 1UL << hugepd_shift(hugepd);
+	unsigned long next;
+
+	ptep = hugepte_offset(hugepd, addr, pdshift);
+	do {
+		next = hugepte_addr_end(addr, end, sz);
+		if (!gup_hugepte(ptep, sz, addr, end, flags, pages, nr))
+			return 0;
+	} while (ptep++, addr = next, addr != end);
+
+	return 1;
+}
+
+static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd,
+				  unsigned long addr, unsigned int pdshift,
+				  unsigned int flags,
+				  struct follow_page_context *ctx)
+{
+	struct page *page;
+	struct hstate *h;
+	spinlock_t *ptl;
+	int nr = 0, ret;
+	pte_t *ptep;
+
+	/* Only hugetlb supports hugepd */
+	if (WARN_ON_ONCE(!is_vm_hugetlb_page(vma)))
+		return ERR_PTR(-EFAULT);
+
+	h = hstate_vma(vma);
+	ptep = hugepte_offset(hugepd, addr, pdshift);
+	ptl = huge_pte_lock(h, vma->vm_mm, ptep);
+	ret = gup_huge_pd(hugepd, addr, pdshift, addr + PAGE_SIZE,
+			  flags, &page, &nr);
+	spin_unlock(ptl);
+
+	if (ret) {
+		WARN_ON_ONCE(nr != 1);
+		ctx->page_mask = (1U << huge_page_order(h)) - 1;
+		return page;
+	}
+
+	return NULL;
+}
+#else /* CONFIG_ARCH_HAS_HUGEPD */
+static inline int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
+		unsigned int pdshift, unsigned long end, unsigned int flags,
+		struct page **pages, int *nr)
+{
+	return 0;
+}
+
+static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd,
+				  unsigned long addr, unsigned int pdshift,
+				  unsigned int flags,
+				  struct follow_page_context *ctx)
+{
+	return NULL;
+}
+#endif /* CONFIG_ARCH_HAS_HUGEPD */
+
+
 static struct page *no_page_table(struct vm_area_struct *vma,
 				  unsigned int flags, unsigned long address)
 {
@@ -871,6 +1014,9 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 		return no_page_table(vma, flags, address);
 	if (!pmd_present(pmdval))
 		return no_page_table(vma, flags, address);
+	if (unlikely(is_hugepd(__hugepd(pmd_val(pmdval)))))
+		return follow_hugepd(vma, __hugepd(pmd_val(pmdval)),
+				     address, PMD_SHIFT, flags, ctx);
 	if (pmd_devmap(pmdval)) {
 		ptl = pmd_lock(mm, pmd);
 		page = follow_devmap_pmd(vma, address, pmd, flags, &ctx->pgmap);
@@ -921,6 +1067,9 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma,
 	pud = READ_ONCE(*pudp);
 	if (!pud_present(pud))
 		return no_page_table(vma, flags, address);
+	if (unlikely(is_hugepd(__hugepd(pud_val(pud)))))
+		return follow_hugepd(vma, __hugepd(pud_val(pud)),
+				     address, PUD_SHIFT, flags, ctx);
 	if (pud_leaf(pud)) {
 		ptl = pud_lock(mm, pudp);
 		page = follow_huge_pud(vma, address, pudp, flags, ctx);
@@ -944,10 +1093,13 @@ static struct page *follow_p4d_mask(struct vm_area_struct *vma,
 
 	p4dp = p4d_offset(pgdp, address);
 	p4d = READ_ONCE(*p4dp);
-	if (!p4d_present(p4d))
-		return no_page_table(vma, flags, address);
 	BUILD_BUG_ON(p4d_leaf(p4d));
-	if (unlikely(p4d_bad(p4d)))
+
+	if (unlikely(is_hugepd(__hugepd(p4d_val(p4d)))))
+		return follow_hugepd(vma, __hugepd(p4d_val(p4d)),
+				     address, P4D_SHIFT, flags, ctx);
+
+	if (!p4d_present(p4d) || p4d_bad(p4d))
 		return no_page_table(vma, flags, address);
 
 	return follow_pud_mask(vma, address, p4dp, flags, ctx);
@@ -997,10 +1149,15 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
 
 	pgd = pgd_offset(mm, address);
 
-	if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
-		return no_page_table(vma, flags, address);
+	if (unlikely(is_hugepd(__hugepd(pgd_val(*pgd)))))
+		page = follow_hugepd(vma, __hugepd(pgd_val(*pgd)),
+				     address, PGDIR_SHIFT, flags, ctx);
+	else if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
+		page = no_page_table(vma, flags, address);
+	else
+		page = follow_p4d_mask(vma, address, pgd, flags, ctx);
 
-	return follow_p4d_mask(vma, address, pgd, flags, ctx);
+	return page;
 }
 
 struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
@@ -2947,106 +3104,6 @@ static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
 }
 #endif
 
-static int record_subpages(struct page *page, unsigned long sz,
-			   unsigned long addr, unsigned long end,
-			   struct page **pages)
-{
-	struct page *start_page;
-	int nr;
-
-	start_page = nth_page(page, (addr & (sz - 1)) >> PAGE_SHIFT);
-	for (nr = 0; addr != end; nr++, addr += PAGE_SIZE)
-		pages[nr] = nth_page(start_page, nr);
-
-	return nr;
-}
-
-#ifdef CONFIG_ARCH_HAS_HUGEPD
-static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
-				      unsigned long sz)
-{
-	unsigned long __boundary = (addr + sz) & ~(sz-1);
-	return (__boundary - 1 < end - 1) ? __boundary : end;
-}
-
-static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
-		       unsigned long end, unsigned int flags,
-		       struct page **pages, int *nr)
-{
-	unsigned long pte_end;
-	struct page *page;
-	struct folio *folio;
-	pte_t pte;
-	int refs;
-
-	pte_end = (addr + sz) & ~(sz-1);
-	if (pte_end < end)
-		end = pte_end;
-
-	pte = huge_ptep_get(ptep);
-
-	if (!pte_access_permitted(pte, flags & FOLL_WRITE))
-		return 0;
-
-	/* hugepages are never "special" */
-	VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
-
-	page = pte_page(pte);
-	refs = record_subpages(page, sz, addr, end, pages + *nr);
-
-	folio = try_grab_folio(page, refs, flags);
-	if (!folio)
-		return 0;
-
-	if (unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) {
-		gup_put_folio(folio, refs, flags);
-		return 0;
-	}
-
-	if (!pte_write(pte) && gup_must_unshare(NULL, flags, &folio->page)) {
-		gup_put_folio(folio, refs, flags);
-		return 0;
-	}
-
-	*nr += refs;
-	folio_set_referenced(folio);
-	return 1;
-}
-
-/*
- * NOTE: currently GUP for a hugepd is only possible on hugetlbfs file
- * systems on Power, which does not have issue with folio writeback against
- * GUP updates.  When hugepd will be extended to support non-hugetlbfs or
- * even anonymous memory, we need to do extra check as what we do with most
- * of the other folios. See writable_file_mapping_allowed() and
- * folio_fast_pin_allowed() for more information.
- */
-static int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
-		unsigned int pdshift, unsigned long end, unsigned int flags,
-		struct page **pages, int *nr)
-{
-	pte_t *ptep;
-	unsigned long sz = 1UL << hugepd_shift(hugepd);
-	unsigned long next;
-
-	ptep = hugepte_offset(hugepd, addr, pdshift);
-	do {
-		next = hugepte_addr_end(addr, end, sz);
-		if (!gup_hugepte(ptep, sz, addr, end, flags, pages, nr))
-			return 0;
-	} while (ptep++, addr = next, addr != end);
-
-	return 1;
-}
-#else
-static inline int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
-		unsigned int pdshift, unsigned long end, unsigned int flags,
-		struct page **pages, int *nr)
-{
-	return 0;
-}
-#endif /* CONFIG_ARCH_HAS_HUGEPD */
-
 static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 			unsigned long end, unsigned int flags,
 			struct page **pages, int *nr)
-- 
2.44.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 12/13] mm/gup: Handle hugepd for follow_page()
@ 2024-03-27 15:23   ` peterx
  0 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, peterx, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

From: Peter Xu <peterx@redhat.com>

Hugepd is only used in PowerPC so far on 4K page size kernels where hash
mmu is used.  follow_page_mask() used to leverage hugetlb APIs to access
hugepd entries.  Teach follow_page_mask() itself on hugepd.

With previous refactors on fast-gup gup_huge_pd(), most of the code can be
leveraged.  There's something not needed for follow page, for example,
gup_hugepte() tries to detect pgtable entry change which will never happen
with slow gup (which has the pgtable lock held), but that's not a problem
to check.

Since follow_page() always only fetch one page, set the end to "address +
PAGE_SIZE" should suffice.  We will still do the pgtable walk once for each
hugetlb page by setting ctx->page_mask properly.

One thing worth mentioning is that some level of pgtable's _bad() helper
will report is_hugepd() entries as TRUE on Power8 hash MMUs.  I think it at
least applies to PUD on Power8 with 4K pgsize.  It means feeding a hugepd
entry to pud_bad() will report a false positive. Let's leave that for now
because it can be arch-specific where I am a bit declined to touch.  In
this patch it's not a problem as long as hugepd is detected before any bad
pgtable entries.

To allow slow gup like follow_*_page() to access hugepd helpers, hugepd
codes are moved to the top.  Besides that, the helper record_subpages()
will be used by either hugepd or fast-gup now. To avoid "unused function"
warnings we must provide a "#ifdef" for it, unfortunately.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c | 269 +++++++++++++++++++++++++++++++++----------------------
 1 file changed, 163 insertions(+), 106 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index a81184b01276..a02463c9420e 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -500,6 +500,149 @@ static inline void mm_set_has_pinned_flag(unsigned long *mm_flags)
 }
 
 #ifdef CONFIG_MMU
+
+#if defined(CONFIG_ARCH_HAS_HUGEPD) || defined(CONFIG_HAVE_FAST_GUP)
+static int record_subpages(struct page *page, unsigned long sz,
+			   unsigned long addr, unsigned long end,
+			   struct page **pages)
+{
+	struct page *start_page;
+	int nr;
+
+	start_page = nth_page(page, (addr & (sz - 1)) >> PAGE_SHIFT);
+	for (nr = 0; addr != end; nr++, addr += PAGE_SIZE)
+		pages[nr] = nth_page(start_page, nr);
+
+	return nr;
+}
+#endif	/* CONFIG_ARCH_HAS_HUGEPD || CONFIG_HAVE_FAST_GUP */
+
+#ifdef CONFIG_ARCH_HAS_HUGEPD
+static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
+				      unsigned long sz)
+{
+	unsigned long __boundary = (addr + sz) & ~(sz-1);
+	return (__boundary - 1 < end - 1) ? __boundary : end;
+}
+
+static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
+		       unsigned long end, unsigned int flags,
+		       struct page **pages, int *nr)
+{
+	unsigned long pte_end;
+	struct page *page;
+	struct folio *folio;
+	pte_t pte;
+	int refs;
+
+	pte_end = (addr + sz) & ~(sz-1);
+	if (pte_end < end)
+		end = pte_end;
+
+	pte = huge_ptep_get(ptep);
+
+	if (!pte_access_permitted(pte, flags & FOLL_WRITE))
+		return 0;
+
+	/* hugepages are never "special" */
+	VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
+
+	page = pte_page(pte);
+	refs = record_subpages(page, sz, addr, end, pages + *nr);
+
+	folio = try_grab_folio(page, refs, flags);
+	if (!folio)
+		return 0;
+
+	if (unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) {
+		gup_put_folio(folio, refs, flags);
+		return 0;
+	}
+
+	if (!pte_write(pte) && gup_must_unshare(NULL, flags, &folio->page)) {
+		gup_put_folio(folio, refs, flags);
+		return 0;
+	}
+
+	*nr += refs;
+	folio_set_referenced(folio);
+	return 1;
+}
+
+/*
+ * NOTE: currently GUP for a hugepd is only possible on hugetlbfs file
+ * systems on Power, which does not have issue with folio writeback against
+ * GUP updates.  When hugepd will be extended to support non-hugetlbfs or
+ * even anonymous memory, we need to do extra check as what we do with most
+ * of the other folios. See writable_file_mapping_allowed() and
+ * folio_fast_pin_allowed() for more information.
+ */
+static int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
+		unsigned int pdshift, unsigned long end, unsigned int flags,
+		struct page **pages, int *nr)
+{
+	pte_t *ptep;
+	unsigned long sz = 1UL << hugepd_shift(hugepd);
+	unsigned long next;
+
+	ptep = hugepte_offset(hugepd, addr, pdshift);
+	do {
+		next = hugepte_addr_end(addr, end, sz);
+		if (!gup_hugepte(ptep, sz, addr, end, flags, pages, nr))
+			return 0;
+	} while (ptep++, addr = next, addr != end);
+
+	return 1;
+}
+
+static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd,
+				  unsigned long addr, unsigned int pdshift,
+				  unsigned int flags,
+				  struct follow_page_context *ctx)
+{
+	struct page *page;
+	struct hstate *h;
+	spinlock_t *ptl;
+	int nr = 0, ret;
+	pte_t *ptep;
+
+	/* Only hugetlb supports hugepd */
+	if (WARN_ON_ONCE(!is_vm_hugetlb_page(vma)))
+		return ERR_PTR(-EFAULT);
+
+	h = hstate_vma(vma);
+	ptep = hugepte_offset(hugepd, addr, pdshift);
+	ptl = huge_pte_lock(h, vma->vm_mm, ptep);
+	ret = gup_huge_pd(hugepd, addr, pdshift, addr + PAGE_SIZE,
+			  flags, &page, &nr);
+	spin_unlock(ptl);
+
+	if (ret) {
+		WARN_ON_ONCE(nr != 1);
+		ctx->page_mask = (1U << huge_page_order(h)) - 1;
+		return page;
+	}
+
+	return NULL;
+}
+#else /* CONFIG_ARCH_HAS_HUGEPD */
+static inline int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
+		unsigned int pdshift, unsigned long end, unsigned int flags,
+		struct page **pages, int *nr)
+{
+	return 0;
+}
+
+static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd,
+				  unsigned long addr, unsigned int pdshift,
+				  unsigned int flags,
+				  struct follow_page_context *ctx)
+{
+	return NULL;
+}
+#endif /* CONFIG_ARCH_HAS_HUGEPD */
+
+
 static struct page *no_page_table(struct vm_area_struct *vma,
 				  unsigned int flags, unsigned long address)
 {
@@ -871,6 +1014,9 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 		return no_page_table(vma, flags, address);
 	if (!pmd_present(pmdval))
 		return no_page_table(vma, flags, address);
+	if (unlikely(is_hugepd(__hugepd(pmd_val(pmdval)))))
+		return follow_hugepd(vma, __hugepd(pmd_val(pmdval)),
+				     address, PMD_SHIFT, flags, ctx);
 	if (pmd_devmap(pmdval)) {
 		ptl = pmd_lock(mm, pmd);
 		page = follow_devmap_pmd(vma, address, pmd, flags, &ctx->pgmap);
@@ -921,6 +1067,9 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma,
 	pud = READ_ONCE(*pudp);
 	if (!pud_present(pud))
 		return no_page_table(vma, flags, address);
+	if (unlikely(is_hugepd(__hugepd(pud_val(pud)))))
+		return follow_hugepd(vma, __hugepd(pud_val(pud)),
+				     address, PUD_SHIFT, flags, ctx);
 	if (pud_leaf(pud)) {
 		ptl = pud_lock(mm, pudp);
 		page = follow_huge_pud(vma, address, pudp, flags, ctx);
@@ -944,10 +1093,13 @@ static struct page *follow_p4d_mask(struct vm_area_struct *vma,
 
 	p4dp = p4d_offset(pgdp, address);
 	p4d = READ_ONCE(*p4dp);
-	if (!p4d_present(p4d))
-		return no_page_table(vma, flags, address);
 	BUILD_BUG_ON(p4d_leaf(p4d));
-	if (unlikely(p4d_bad(p4d)))
+
+	if (unlikely(is_hugepd(__hugepd(p4d_val(p4d)))))
+		return follow_hugepd(vma, __hugepd(p4d_val(p4d)),
+				     address, P4D_SHIFT, flags, ctx);
+
+	if (!p4d_present(p4d) || p4d_bad(p4d))
 		return no_page_table(vma, flags, address);
 
 	return follow_pud_mask(vma, address, p4dp, flags, ctx);
@@ -997,10 +1149,15 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
 
 	pgd = pgd_offset(mm, address);
 
-	if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
-		return no_page_table(vma, flags, address);
+	if (unlikely(is_hugepd(__hugepd(pgd_val(*pgd)))))
+		page = follow_hugepd(vma, __hugepd(pgd_val(*pgd)),
+				     address, PGDIR_SHIFT, flags, ctx);
+	else if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
+		page = no_page_table(vma, flags, address);
+	else
+		page = follow_p4d_mask(vma, address, pgd, flags, ctx);
 
-	return follow_p4d_mask(vma, address, pgd, flags, ctx);
+	return page;
 }
 
 struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
@@ -2947,106 +3104,6 @@ static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
 }
 #endif
 
-static int record_subpages(struct page *page, unsigned long sz,
-			   unsigned long addr, unsigned long end,
-			   struct page **pages)
-{
-	struct page *start_page;
-	int nr;
-
-	start_page = nth_page(page, (addr & (sz - 1)) >> PAGE_SHIFT);
-	for (nr = 0; addr != end; nr++, addr += PAGE_SIZE)
-		pages[nr] = nth_page(start_page, nr);
-
-	return nr;
-}
-
-#ifdef CONFIG_ARCH_HAS_HUGEPD
-static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
-				      unsigned long sz)
-{
-	unsigned long __boundary = (addr + sz) & ~(sz-1);
-	return (__boundary - 1 < end - 1) ? __boundary : end;
-}
-
-static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
-		       unsigned long end, unsigned int flags,
-		       struct page **pages, int *nr)
-{
-	unsigned long pte_end;
-	struct page *page;
-	struct folio *folio;
-	pte_t pte;
-	int refs;
-
-	pte_end = (addr + sz) & ~(sz-1);
-	if (pte_end < end)
-		end = pte_end;
-
-	pte = huge_ptep_get(ptep);
-
-	if (!pte_access_permitted(pte, flags & FOLL_WRITE))
-		return 0;
-
-	/* hugepages are never "special" */
-	VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
-
-	page = pte_page(pte);
-	refs = record_subpages(page, sz, addr, end, pages + *nr);
-
-	folio = try_grab_folio(page, refs, flags);
-	if (!folio)
-		return 0;
-
-	if (unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) {
-		gup_put_folio(folio, refs, flags);
-		return 0;
-	}
-
-	if (!pte_write(pte) && gup_must_unshare(NULL, flags, &folio->page)) {
-		gup_put_folio(folio, refs, flags);
-		return 0;
-	}
-
-	*nr += refs;
-	folio_set_referenced(folio);
-	return 1;
-}
-
-/*
- * NOTE: currently GUP for a hugepd is only possible on hugetlbfs file
- * systems on Power, which does not have issue with folio writeback against
- * GUP updates.  When hugepd will be extended to support non-hugetlbfs or
- * even anonymous memory, we need to do extra check as what we do with most
- * of the other folios. See writable_file_mapping_allowed() and
- * folio_fast_pin_allowed() for more information.
- */
-static int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
-		unsigned int pdshift, unsigned long end, unsigned int flags,
-		struct page **pages, int *nr)
-{
-	pte_t *ptep;
-	unsigned long sz = 1UL << hugepd_shift(hugepd);
-	unsigned long next;
-
-	ptep = hugepte_offset(hugepd, addr, pdshift);
-	do {
-		next = hugepte_addr_end(addr, end, sz);
-		if (!gup_hugepte(ptep, sz, addr, end, flags, pages, nr))
-			return 0;
-	} while (ptep++, addr = next, addr != end);
-
-	return 1;
-}
-#else
-static inline int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
-		unsigned int pdshift, unsigned long end, unsigned int flags,
-		struct page **pages, int *nr)
-{
-	return 0;
-}
-#endif /* CONFIG_ARCH_HAS_HUGEPD */
-
 static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 			unsigned long end, unsigned int flags,
 			struct page **pages, int *nr)
-- 
2.44.0


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 12/13] mm/gup: Handle hugepd for follow_page()
@ 2024-03-27 15:23   ` peterx
  0 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: James Houghton, David Hildenbrand, Yang Shi, peterx,
	Andrew Jones, linux-riscv, Andrea Arcangeli, Christoph Hellwig,
	Matthew Wilcox, Aneesh Kumar K . V, linux-arm-kernel,
	Jason Gunthorpe, Axel Rasmussen, Rik van Riel, John Hubbard,
	Kirill A . Shutemov, Vlastimil Babka, Lorenzo Stoakes,
	Muchun Song, Andrew Morton, linuxppc-dev, Mike Rapoport,
	Mike Kravetz

From: Peter Xu <peterx@redhat.com>

Hugepd is only used in PowerPC so far on 4K page size kernels where hash
mmu is used.  follow_page_mask() used to leverage hugetlb APIs to access
hugepd entries.  Teach follow_page_mask() itself on hugepd.

With previous refactors on fast-gup gup_huge_pd(), most of the code can be
leveraged.  There's something not needed for follow page, for example,
gup_hugepte() tries to detect pgtable entry change which will never happen
with slow gup (which has the pgtable lock held), but that's not a problem
to check.

Since follow_page() always only fetch one page, set the end to "address +
PAGE_SIZE" should suffice.  We will still do the pgtable walk once for each
hugetlb page by setting ctx->page_mask properly.

One thing worth mentioning is that some level of pgtable's _bad() helper
will report is_hugepd() entries as TRUE on Power8 hash MMUs.  I think it at
least applies to PUD on Power8 with 4K pgsize.  It means feeding a hugepd
entry to pud_bad() will report a false positive. Let's leave that for now
because it can be arch-specific where I am a bit declined to touch.  In
this patch it's not a problem as long as hugepd is detected before any bad
pgtable entries.

To allow slow gup like follow_*_page() to access hugepd helpers, hugepd
codes are moved to the top.  Besides that, the helper record_subpages()
will be used by either hugepd or fast-gup now. To avoid "unused function"
warnings we must provide a "#ifdef" for it, unfortunately.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c | 269 +++++++++++++++++++++++++++++++++----------------------
 1 file changed, 163 insertions(+), 106 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index a81184b01276..a02463c9420e 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -500,6 +500,149 @@ static inline void mm_set_has_pinned_flag(unsigned long *mm_flags)
 }
 
 #ifdef CONFIG_MMU
+
+#if defined(CONFIG_ARCH_HAS_HUGEPD) || defined(CONFIG_HAVE_FAST_GUP)
+static int record_subpages(struct page *page, unsigned long sz,
+			   unsigned long addr, unsigned long end,
+			   struct page **pages)
+{
+	struct page *start_page;
+	int nr;
+
+	start_page = nth_page(page, (addr & (sz - 1)) >> PAGE_SHIFT);
+	for (nr = 0; addr != end; nr++, addr += PAGE_SIZE)
+		pages[nr] = nth_page(start_page, nr);
+
+	return nr;
+}
+#endif	/* CONFIG_ARCH_HAS_HUGEPD || CONFIG_HAVE_FAST_GUP */
+
+#ifdef CONFIG_ARCH_HAS_HUGEPD
+static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
+				      unsigned long sz)
+{
+	unsigned long __boundary = (addr + sz) & ~(sz-1);
+	return (__boundary - 1 < end - 1) ? __boundary : end;
+}
+
+static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
+		       unsigned long end, unsigned int flags,
+		       struct page **pages, int *nr)
+{
+	unsigned long pte_end;
+	struct page *page;
+	struct folio *folio;
+	pte_t pte;
+	int refs;
+
+	pte_end = (addr + sz) & ~(sz-1);
+	if (pte_end < end)
+		end = pte_end;
+
+	pte = huge_ptep_get(ptep);
+
+	if (!pte_access_permitted(pte, flags & FOLL_WRITE))
+		return 0;
+
+	/* hugepages are never "special" */
+	VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
+
+	page = pte_page(pte);
+	refs = record_subpages(page, sz, addr, end, pages + *nr);
+
+	folio = try_grab_folio(page, refs, flags);
+	if (!folio)
+		return 0;
+
+	if (unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) {
+		gup_put_folio(folio, refs, flags);
+		return 0;
+	}
+
+	if (!pte_write(pte) && gup_must_unshare(NULL, flags, &folio->page)) {
+		gup_put_folio(folio, refs, flags);
+		return 0;
+	}
+
+	*nr += refs;
+	folio_set_referenced(folio);
+	return 1;
+}
+
+/*
+ * NOTE: currently GUP for a hugepd is only possible on hugetlbfs file
+ * systems on Power, which does not have issue with folio writeback against
+ * GUP updates.  When hugepd will be extended to support non-hugetlbfs or
+ * even anonymous memory, we need to do extra check as what we do with most
+ * of the other folios. See writable_file_mapping_allowed() and
+ * folio_fast_pin_allowed() for more information.
+ */
+static int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
+		unsigned int pdshift, unsigned long end, unsigned int flags,
+		struct page **pages, int *nr)
+{
+	pte_t *ptep;
+	unsigned long sz = 1UL << hugepd_shift(hugepd);
+	unsigned long next;
+
+	ptep = hugepte_offset(hugepd, addr, pdshift);
+	do {
+		next = hugepte_addr_end(addr, end, sz);
+		if (!gup_hugepte(ptep, sz, addr, end, flags, pages, nr))
+			return 0;
+	} while (ptep++, addr = next, addr != end);
+
+	return 1;
+}
+
+static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd,
+				  unsigned long addr, unsigned int pdshift,
+				  unsigned int flags,
+				  struct follow_page_context *ctx)
+{
+	struct page *page;
+	struct hstate *h;
+	spinlock_t *ptl;
+	int nr = 0, ret;
+	pte_t *ptep;
+
+	/* Only hugetlb supports hugepd */
+	if (WARN_ON_ONCE(!is_vm_hugetlb_page(vma)))
+		return ERR_PTR(-EFAULT);
+
+	h = hstate_vma(vma);
+	ptep = hugepte_offset(hugepd, addr, pdshift);
+	ptl = huge_pte_lock(h, vma->vm_mm, ptep);
+	ret = gup_huge_pd(hugepd, addr, pdshift, addr + PAGE_SIZE,
+			  flags, &page, &nr);
+	spin_unlock(ptl);
+
+	if (ret) {
+		WARN_ON_ONCE(nr != 1);
+		ctx->page_mask = (1U << huge_page_order(h)) - 1;
+		return page;
+	}
+
+	return NULL;
+}
+#else /* CONFIG_ARCH_HAS_HUGEPD */
+static inline int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
+		unsigned int pdshift, unsigned long end, unsigned int flags,
+		struct page **pages, int *nr)
+{
+	return 0;
+}
+
+static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd,
+				  unsigned long addr, unsigned int pdshift,
+				  unsigned int flags,
+				  struct follow_page_context *ctx)
+{
+	return NULL;
+}
+#endif /* CONFIG_ARCH_HAS_HUGEPD */
+
+
 static struct page *no_page_table(struct vm_area_struct *vma,
 				  unsigned int flags, unsigned long address)
 {
@@ -871,6 +1014,9 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 		return no_page_table(vma, flags, address);
 	if (!pmd_present(pmdval))
 		return no_page_table(vma, flags, address);
+	if (unlikely(is_hugepd(__hugepd(pmd_val(pmdval)))))
+		return follow_hugepd(vma, __hugepd(pmd_val(pmdval)),
+				     address, PMD_SHIFT, flags, ctx);
 	if (pmd_devmap(pmdval)) {
 		ptl = pmd_lock(mm, pmd);
 		page = follow_devmap_pmd(vma, address, pmd, flags, &ctx->pgmap);
@@ -921,6 +1067,9 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma,
 	pud = READ_ONCE(*pudp);
 	if (!pud_present(pud))
 		return no_page_table(vma, flags, address);
+	if (unlikely(is_hugepd(__hugepd(pud_val(pud)))))
+		return follow_hugepd(vma, __hugepd(pud_val(pud)),
+				     address, PUD_SHIFT, flags, ctx);
 	if (pud_leaf(pud)) {
 		ptl = pud_lock(mm, pudp);
 		page = follow_huge_pud(vma, address, pudp, flags, ctx);
@@ -944,10 +1093,13 @@ static struct page *follow_p4d_mask(struct vm_area_struct *vma,
 
 	p4dp = p4d_offset(pgdp, address);
 	p4d = READ_ONCE(*p4dp);
-	if (!p4d_present(p4d))
-		return no_page_table(vma, flags, address);
 	BUILD_BUG_ON(p4d_leaf(p4d));
-	if (unlikely(p4d_bad(p4d)))
+
+	if (unlikely(is_hugepd(__hugepd(p4d_val(p4d)))))
+		return follow_hugepd(vma, __hugepd(p4d_val(p4d)),
+				     address, P4D_SHIFT, flags, ctx);
+
+	if (!p4d_present(p4d) || p4d_bad(p4d))
 		return no_page_table(vma, flags, address);
 
 	return follow_pud_mask(vma, address, p4dp, flags, ctx);
@@ -997,10 +1149,15 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
 
 	pgd = pgd_offset(mm, address);
 
-	if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
-		return no_page_table(vma, flags, address);
+	if (unlikely(is_hugepd(__hugepd(pgd_val(*pgd)))))
+		page = follow_hugepd(vma, __hugepd(pgd_val(*pgd)),
+				     address, PGDIR_SHIFT, flags, ctx);
+	else if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
+		page = no_page_table(vma, flags, address);
+	else
+		page = follow_p4d_mask(vma, address, pgd, flags, ctx);
 
-	return follow_p4d_mask(vma, address, pgd, flags, ctx);
+	return page;
 }
 
 struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
@@ -2947,106 +3104,6 @@ static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
 }
 #endif
 
-static int record_subpages(struct page *page, unsigned long sz,
-			   unsigned long addr, unsigned long end,
-			   struct page **pages)
-{
-	struct page *start_page;
-	int nr;
-
-	start_page = nth_page(page, (addr & (sz - 1)) >> PAGE_SHIFT);
-	for (nr = 0; addr != end; nr++, addr += PAGE_SIZE)
-		pages[nr] = nth_page(start_page, nr);
-
-	return nr;
-}
-
-#ifdef CONFIG_ARCH_HAS_HUGEPD
-static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
-				      unsigned long sz)
-{
-	unsigned long __boundary = (addr + sz) & ~(sz-1);
-	return (__boundary - 1 < end - 1) ? __boundary : end;
-}
-
-static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
-		       unsigned long end, unsigned int flags,
-		       struct page **pages, int *nr)
-{
-	unsigned long pte_end;
-	struct page *page;
-	struct folio *folio;
-	pte_t pte;
-	int refs;
-
-	pte_end = (addr + sz) & ~(sz-1);
-	if (pte_end < end)
-		end = pte_end;
-
-	pte = huge_ptep_get(ptep);
-
-	if (!pte_access_permitted(pte, flags & FOLL_WRITE))
-		return 0;
-
-	/* hugepages are never "special" */
-	VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
-
-	page = pte_page(pte);
-	refs = record_subpages(page, sz, addr, end, pages + *nr);
-
-	folio = try_grab_folio(page, refs, flags);
-	if (!folio)
-		return 0;
-
-	if (unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) {
-		gup_put_folio(folio, refs, flags);
-		return 0;
-	}
-
-	if (!pte_write(pte) && gup_must_unshare(NULL, flags, &folio->page)) {
-		gup_put_folio(folio, refs, flags);
-		return 0;
-	}
-
-	*nr += refs;
-	folio_set_referenced(folio);
-	return 1;
-}
-
-/*
- * NOTE: currently GUP for a hugepd is only possible on hugetlbfs file
- * systems on Power, which does not have issue with folio writeback against
- * GUP updates.  When hugepd will be extended to support non-hugetlbfs or
- * even anonymous memory, we need to do extra check as what we do with most
- * of the other folios. See writable_file_mapping_allowed() and
- * folio_fast_pin_allowed() for more information.
- */
-static int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
-		unsigned int pdshift, unsigned long end, unsigned int flags,
-		struct page **pages, int *nr)
-{
-	pte_t *ptep;
-	unsigned long sz = 1UL << hugepd_shift(hugepd);
-	unsigned long next;
-
-	ptep = hugepte_offset(hugepd, addr, pdshift);
-	do {
-		next = hugepte_addr_end(addr, end, sz);
-		if (!gup_hugepte(ptep, sz, addr, end, flags, pages, nr))
-			return 0;
-	} while (ptep++, addr = next, addr != end);
-
-	return 1;
-}
-#else
-static inline int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
-		unsigned int pdshift, unsigned long end, unsigned int flags,
-		struct page **pages, int *nr)
-{
-	return 0;
-}
-#endif /* CONFIG_ARCH_HAS_HUGEPD */
-
 static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 			unsigned long end, unsigned int flags,
 			struct page **pages, int *nr)
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
  2024-03-27 15:23 ` peterx
  (?)
  (?)
@ 2024-03-27 15:23   ` peterx
  -1 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, peterx, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

From: Peter Xu <peterx@redhat.com>

Now follow_page() is ready to handle hugetlb pages in whatever form, and
over all architectures.  Switch to the generic code path.

Time to retire hugetlb_follow_page_mask(), following the previous
retirement of follow_hugetlb_page() in 4849807114b8.

There may be a slight difference of how the loops run when processing slow
GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
loop of __get_user_pages() will resolve one pgtable entry with the patch
applied, rather than relying on the size of hugetlb hstate, the latter may
cover multiple entries in one loop.

A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
a tight loop of slow gup after the path switched.  That shouldn't be a
problem because slow-gup should not be a hot path for GUP in general: when
page is commonly present, fast-gup will already succeed, while when the
page is indeed missing and require a follow up page fault, the slow gup
degrade will probably buried in the fault paths anyway.  It also explains
why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
accelerate thp gup even for "pages != NULL"") lands, the latter not part of
a performance analysis but a side benefit.  If the performance will be a
concern, we can consider handle CONT_PTE in follow_page().

Before that is justified to be necessary, keep everything clean and simple.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/hugetlb.h |  7 ----
 mm/gup.c                | 15 +++------
 mm/hugetlb.c            | 71 -----------------------------------------
 3 files changed, 5 insertions(+), 88 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 294c78b3549f..a546140f89cd 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -328,13 +328,6 @@ static inline void hugetlb_zap_end(
 {
 }
 
-static inline struct page *hugetlb_follow_page_mask(
-    struct vm_area_struct *vma, unsigned long address, unsigned int flags,
-    unsigned int *page_mask)
-{
-	BUILD_BUG(); /* should never be compiled in if !CONFIG_HUGETLB_PAGE*/
-}
-
 static inline int copy_hugetlb_page_range(struct mm_struct *dst,
 					  struct mm_struct *src,
 					  struct vm_area_struct *dst_vma,
diff --git a/mm/gup.c b/mm/gup.c
index a02463c9420e..c803d0b0f358 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1135,18 +1135,11 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
 {
 	pgd_t *pgd;
 	struct mm_struct *mm = vma->vm_mm;
+	struct page *page;
 
-	ctx->page_mask = 0;
-
-	/*
-	 * Call hugetlb_follow_page_mask for hugetlb vmas as it will use
-	 * special hugetlb page table walking code.  This eliminates the
-	 * need to check for hugetlb entries in the general walking code.
-	 */
-	if (is_vm_hugetlb_page(vma))
-		return hugetlb_follow_page_mask(vma, address, flags,
-						&ctx->page_mask);
+	vma_pgtable_walk_begin(vma);
 
+	ctx->page_mask = 0;
 	pgd = pgd_offset(mm, address);
 
 	if (unlikely(is_hugepd(__hugepd(pgd_val(*pgd)))))
@@ -1157,6 +1150,8 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
 	else
 		page = follow_p4d_mask(vma, address, pgd, flags, ctx);
 
+	vma_pgtable_walk_end(vma);
+
 	return page;
 }
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 65b9c9a48fd2..cc79891a3597 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6870,77 +6870,6 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
 }
 #endif /* CONFIG_USERFAULTFD */
 
-struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
-				      unsigned long address, unsigned int flags,
-				      unsigned int *page_mask)
-{
-	struct hstate *h = hstate_vma(vma);
-	struct mm_struct *mm = vma->vm_mm;
-	unsigned long haddr = address & huge_page_mask(h);
-	struct page *page = NULL;
-	spinlock_t *ptl;
-	pte_t *pte, entry;
-	int ret;
-
-	hugetlb_vma_lock_read(vma);
-	pte = hugetlb_walk(vma, haddr, huge_page_size(h));
-	if (!pte)
-		goto out_unlock;
-
-	ptl = huge_pte_lock(h, mm, pte);
-	entry = huge_ptep_get(pte);
-	if (pte_present(entry)) {
-		page = pte_page(entry);
-
-		if (!huge_pte_write(entry)) {
-			if (flags & FOLL_WRITE) {
-				page = NULL;
-				goto out;
-			}
-
-			if (gup_must_unshare(vma, flags, page)) {
-				/* Tell the caller to do unsharing */
-				page = ERR_PTR(-EMLINK);
-				goto out;
-			}
-		}
-
-		page = nth_page(page, ((address & ~huge_page_mask(h)) >> PAGE_SHIFT));
-
-		/*
-		 * Note that page may be a sub-page, and with vmemmap
-		 * optimizations the page struct may be read only.
-		 * try_grab_page() will increase the ref count on the
-		 * head page, so this will be OK.
-		 *
-		 * try_grab_page() should always be able to get the page here,
-		 * because we hold the ptl lock and have verified pte_present().
-		 */
-		ret = try_grab_page(page, flags);
-
-		if (WARN_ON_ONCE(ret)) {
-			page = ERR_PTR(ret);
-			goto out;
-		}
-
-		*page_mask = (1U << huge_page_order(h)) - 1;
-	}
-out:
-	spin_unlock(ptl);
-out_unlock:
-	hugetlb_vma_unlock_read(vma);
-
-	/*
-	 * Fixup retval for dump requests: if pagecache doesn't exist,
-	 * don't try to allocate a new page but just skip it.
-	 */
-	if (!page && (flags & FOLL_DUMP) &&
-	    !hugetlbfs_pagecache_present(h, vma, address))
-		page = ERR_PTR(-EFAULT);
-
-	return page;
-}
-
 long hugetlb_change_protection(struct vm_area_struct *vma,
 		unsigned long address, unsigned long end,
 		pgprot_t newprot, unsigned long cp_flags)
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
@ 2024-03-27 15:23   ` peterx
  0 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, peterx, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

From: Peter Xu <peterx@redhat.com>

Now follow_page() is ready to handle hugetlb pages in whatever form, and
over all architectures.  Switch to the generic code path.

Time to retire hugetlb_follow_page_mask(), following the previous
retirement of follow_hugetlb_page() in 4849807114b8.

There may be a slight difference of how the loops run when processing slow
GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
loop of __get_user_pages() will resolve one pgtable entry with the patch
applied, rather than relying on the size of hugetlb hstate, the latter may
cover multiple entries in one loop.

A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
a tight loop of slow gup after the path switched.  That shouldn't be a
problem because slow-gup should not be a hot path for GUP in general: when
page is commonly present, fast-gup will already succeed, while when the
page is indeed missing and require a follow up page fault, the slow gup
degrade will probably buried in the fault paths anyway.  It also explains
why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
accelerate thp gup even for "pages != NULL"") lands, the latter not part of
a performance analysis but a side benefit.  If the performance will be a
concern, we can consider handle CONT_PTE in follow_page().

Before that is justified to be necessary, keep everything clean and simple.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/hugetlb.h |  7 ----
 mm/gup.c                | 15 +++------
 mm/hugetlb.c            | 71 -----------------------------------------
 3 files changed, 5 insertions(+), 88 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 294c78b3549f..a546140f89cd 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -328,13 +328,6 @@ static inline void hugetlb_zap_end(
 {
 }
 
-static inline struct page *hugetlb_follow_page_mask(
-    struct vm_area_struct *vma, unsigned long address, unsigned int flags,
-    unsigned int *page_mask)
-{
-	BUILD_BUG(); /* should never be compiled in if !CONFIG_HUGETLB_PAGE*/
-}
-
 static inline int copy_hugetlb_page_range(struct mm_struct *dst,
 					  struct mm_struct *src,
 					  struct vm_area_struct *dst_vma,
diff --git a/mm/gup.c b/mm/gup.c
index a02463c9420e..c803d0b0f358 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1135,18 +1135,11 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
 {
 	pgd_t *pgd;
 	struct mm_struct *mm = vma->vm_mm;
+	struct page *page;
 
-	ctx->page_mask = 0;
-
-	/*
-	 * Call hugetlb_follow_page_mask for hugetlb vmas as it will use
-	 * special hugetlb page table walking code.  This eliminates the
-	 * need to check for hugetlb entries in the general walking code.
-	 */
-	if (is_vm_hugetlb_page(vma))
-		return hugetlb_follow_page_mask(vma, address, flags,
-						&ctx->page_mask);
+	vma_pgtable_walk_begin(vma);
 
+	ctx->page_mask = 0;
 	pgd = pgd_offset(mm, address);
 
 	if (unlikely(is_hugepd(__hugepd(pgd_val(*pgd)))))
@@ -1157,6 +1150,8 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
 	else
 		page = follow_p4d_mask(vma, address, pgd, flags, ctx);
 
+	vma_pgtable_walk_end(vma);
+
 	return page;
 }
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 65b9c9a48fd2..cc79891a3597 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6870,77 +6870,6 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
 }
 #endif /* CONFIG_USERFAULTFD */
 
-struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
-				      unsigned long address, unsigned int flags,
-				      unsigned int *page_mask)
-{
-	struct hstate *h = hstate_vma(vma);
-	struct mm_struct *mm = vma->vm_mm;
-	unsigned long haddr = address & huge_page_mask(h);
-	struct page *page = NULL;
-	spinlock_t *ptl;
-	pte_t *pte, entry;
-	int ret;
-
-	hugetlb_vma_lock_read(vma);
-	pte = hugetlb_walk(vma, haddr, huge_page_size(h));
-	if (!pte)
-		goto out_unlock;
-
-	ptl = huge_pte_lock(h, mm, pte);
-	entry = huge_ptep_get(pte);
-	if (pte_present(entry)) {
-		page = pte_page(entry);
-
-		if (!huge_pte_write(entry)) {
-			if (flags & FOLL_WRITE) {
-				page = NULL;
-				goto out;
-			}
-
-			if (gup_must_unshare(vma, flags, page)) {
-				/* Tell the caller to do unsharing */
-				page = ERR_PTR(-EMLINK);
-				goto out;
-			}
-		}
-
-		page = nth_page(page, ((address & ~huge_page_mask(h)) >> PAGE_SHIFT));
-
-		/*
-		 * Note that page may be a sub-page, and with vmemmap
-		 * optimizations the page struct may be read only.
-		 * try_grab_page() will increase the ref count on the
-		 * head page, so this will be OK.
-		 *
-		 * try_grab_page() should always be able to get the page here,
-		 * because we hold the ptl lock and have verified pte_present().
-		 */
-		ret = try_grab_page(page, flags);
-
-		if (WARN_ON_ONCE(ret)) {
-			page = ERR_PTR(ret);
-			goto out;
-		}
-
-		*page_mask = (1U << huge_page_order(h)) - 1;
-	}
-out:
-	spin_unlock(ptl);
-out_unlock:
-	hugetlb_vma_unlock_read(vma);
-
-	/*
-	 * Fixup retval for dump requests: if pagecache doesn't exist,
-	 * don't try to allocate a new page but just skip it.
-	 */
-	if (!page && (flags & FOLL_DUMP) &&
-	    !hugetlbfs_pagecache_present(h, vma, address))
-		page = ERR_PTR(-EFAULT);
-
-	return page;
-}
-
 long hugetlb_change_protection(struct vm_area_struct *vma,
 		unsigned long address, unsigned long end,
 		pgprot_t newprot, unsigned long cp_flags)
-- 
2.44.0


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
@ 2024-03-27 15:23   ` peterx
  0 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, peterx, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

From: Peter Xu <peterx@redhat.com>

Now follow_page() is ready to handle hugetlb pages in whatever form, and
over all architectures.  Switch to the generic code path.

Time to retire hugetlb_follow_page_mask(), following the previous
retirement of follow_hugetlb_page() in 4849807114b8.

There may be a slight difference of how the loops run when processing slow
GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
loop of __get_user_pages() will resolve one pgtable entry with the patch
applied, rather than relying on the size of hugetlb hstate, the latter may
cover multiple entries in one loop.

A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
a tight loop of slow gup after the path switched.  That shouldn't be a
problem because slow-gup should not be a hot path for GUP in general: when
page is commonly present, fast-gup will already succeed, while when the
page is indeed missing and require a follow up page fault, the slow gup
degrade will probably buried in the fault paths anyway.  It also explains
why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
accelerate thp gup even for "pages != NULL"") lands, the latter not part of
a performance analysis but a side benefit.  If the performance will be a
concern, we can consider handle CONT_PTE in follow_page().

Before that is justified to be necessary, keep everything clean and simple.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/hugetlb.h |  7 ----
 mm/gup.c                | 15 +++------
 mm/hugetlb.c            | 71 -----------------------------------------
 3 files changed, 5 insertions(+), 88 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 294c78b3549f..a546140f89cd 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -328,13 +328,6 @@ static inline void hugetlb_zap_end(
 {
 }
 
-static inline struct page *hugetlb_follow_page_mask(
-    struct vm_area_struct *vma, unsigned long address, unsigned int flags,
-    unsigned int *page_mask)
-{
-	BUILD_BUG(); /* should never be compiled in if !CONFIG_HUGETLB_PAGE*/
-}
-
 static inline int copy_hugetlb_page_range(struct mm_struct *dst,
 					  struct mm_struct *src,
 					  struct vm_area_struct *dst_vma,
diff --git a/mm/gup.c b/mm/gup.c
index a02463c9420e..c803d0b0f358 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1135,18 +1135,11 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
 {
 	pgd_t *pgd;
 	struct mm_struct *mm = vma->vm_mm;
+	struct page *page;
 
-	ctx->page_mask = 0;
-
-	/*
-	 * Call hugetlb_follow_page_mask for hugetlb vmas as it will use
-	 * special hugetlb page table walking code.  This eliminates the
-	 * need to check for hugetlb entries in the general walking code.
-	 */
-	if (is_vm_hugetlb_page(vma))
-		return hugetlb_follow_page_mask(vma, address, flags,
-						&ctx->page_mask);
+	vma_pgtable_walk_begin(vma);
 
+	ctx->page_mask = 0;
 	pgd = pgd_offset(mm, address);
 
 	if (unlikely(is_hugepd(__hugepd(pgd_val(*pgd)))))
@@ -1157,6 +1150,8 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
 	else
 		page = follow_p4d_mask(vma, address, pgd, flags, ctx);
 
+	vma_pgtable_walk_end(vma);
+
 	return page;
 }
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 65b9c9a48fd2..cc79891a3597 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6870,77 +6870,6 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
 }
 #endif /* CONFIG_USERFAULTFD */
 
-struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
-				      unsigned long address, unsigned int flags,
-				      unsigned int *page_mask)
-{
-	struct hstate *h = hstate_vma(vma);
-	struct mm_struct *mm = vma->vm_mm;
-	unsigned long haddr = address & huge_page_mask(h);
-	struct page *page = NULL;
-	spinlock_t *ptl;
-	pte_t *pte, entry;
-	int ret;
-
-	hugetlb_vma_lock_read(vma);
-	pte = hugetlb_walk(vma, haddr, huge_page_size(h));
-	if (!pte)
-		goto out_unlock;
-
-	ptl = huge_pte_lock(h, mm, pte);
-	entry = huge_ptep_get(pte);
-	if (pte_present(entry)) {
-		page = pte_page(entry);
-
-		if (!huge_pte_write(entry)) {
-			if (flags & FOLL_WRITE) {
-				page = NULL;
-				goto out;
-			}
-
-			if (gup_must_unshare(vma, flags, page)) {
-				/* Tell the caller to do unsharing */
-				page = ERR_PTR(-EMLINK);
-				goto out;
-			}
-		}
-
-		page = nth_page(page, ((address & ~huge_page_mask(h)) >> PAGE_SHIFT));
-
-		/*
-		 * Note that page may be a sub-page, and with vmemmap
-		 * optimizations the page struct may be read only.
-		 * try_grab_page() will increase the ref count on the
-		 * head page, so this will be OK.
-		 *
-		 * try_grab_page() should always be able to get the page here,
-		 * because we hold the ptl lock and have verified pte_present().
-		 */
-		ret = try_grab_page(page, flags);
-
-		if (WARN_ON_ONCE(ret)) {
-			page = ERR_PTR(ret);
-			goto out;
-		}
-
-		*page_mask = (1U << huge_page_order(h)) - 1;
-	}
-out:
-	spin_unlock(ptl);
-out_unlock:
-	hugetlb_vma_unlock_read(vma);
-
-	/*
-	 * Fixup retval for dump requests: if pagecache doesn't exist,
-	 * don't try to allocate a new page but just skip it.
-	 */
-	if (!page && (flags & FOLL_DUMP) &&
-	    !hugetlbfs_pagecache_present(h, vma, address))
-		page = ERR_PTR(-EFAULT);
-
-	return page;
-}
-
 long hugetlb_change_protection(struct vm_area_struct *vma,
 		unsigned long address, unsigned long end,
 		pgprot_t newprot, unsigned long cp_flags)
-- 
2.44.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
@ 2024-03-27 15:23   ` peterx
  0 siblings, 0 replies; 160+ messages in thread
From: peterx @ 2024-03-27 15:23 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: James Houghton, David Hildenbrand, Yang Shi, peterx,
	Andrew Jones, linux-riscv, Andrea Arcangeli, Christoph Hellwig,
	Matthew Wilcox, Aneesh Kumar K . V, linux-arm-kernel,
	Jason Gunthorpe, Axel Rasmussen, Rik van Riel, John Hubbard,
	Kirill A . Shutemov, Vlastimil Babka, Lorenzo Stoakes,
	Muchun Song, Andrew Morton, linuxppc-dev, Mike Rapoport,
	Mike Kravetz

From: Peter Xu <peterx@redhat.com>

Now follow_page() is ready to handle hugetlb pages in whatever form, and
over all architectures.  Switch to the generic code path.

Time to retire hugetlb_follow_page_mask(), following the previous
retirement of follow_hugetlb_page() in 4849807114b8.

There may be a slight difference of how the loops run when processing slow
GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
loop of __get_user_pages() will resolve one pgtable entry with the patch
applied, rather than relying on the size of hugetlb hstate, the latter may
cover multiple entries in one loop.

A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
a tight loop of slow gup after the path switched.  That shouldn't be a
problem because slow-gup should not be a hot path for GUP in general: when
page is commonly present, fast-gup will already succeed, while when the
page is indeed missing and require a follow up page fault, the slow gup
degrade will probably buried in the fault paths anyway.  It also explains
why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
accelerate thp gup even for "pages != NULL"") lands, the latter not part of
a performance analysis but a side benefit.  If the performance will be a
concern, we can consider handle CONT_PTE in follow_page().

Before that is justified to be necessary, keep everything clean and simple.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/hugetlb.h |  7 ----
 mm/gup.c                | 15 +++------
 mm/hugetlb.c            | 71 -----------------------------------------
 3 files changed, 5 insertions(+), 88 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 294c78b3549f..a546140f89cd 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -328,13 +328,6 @@ static inline void hugetlb_zap_end(
 {
 }
 
-static inline struct page *hugetlb_follow_page_mask(
-    struct vm_area_struct *vma, unsigned long address, unsigned int flags,
-    unsigned int *page_mask)
-{
-	BUILD_BUG(); /* should never be compiled in if !CONFIG_HUGETLB_PAGE*/
-}
-
 static inline int copy_hugetlb_page_range(struct mm_struct *dst,
 					  struct mm_struct *src,
 					  struct vm_area_struct *dst_vma,
diff --git a/mm/gup.c b/mm/gup.c
index a02463c9420e..c803d0b0f358 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1135,18 +1135,11 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
 {
 	pgd_t *pgd;
 	struct mm_struct *mm = vma->vm_mm;
+	struct page *page;
 
-	ctx->page_mask = 0;
-
-	/*
-	 * Call hugetlb_follow_page_mask for hugetlb vmas as it will use
-	 * special hugetlb page table walking code.  This eliminates the
-	 * need to check for hugetlb entries in the general walking code.
-	 */
-	if (is_vm_hugetlb_page(vma))
-		return hugetlb_follow_page_mask(vma, address, flags,
-						&ctx->page_mask);
+	vma_pgtable_walk_begin(vma);
 
+	ctx->page_mask = 0;
 	pgd = pgd_offset(mm, address);
 
 	if (unlikely(is_hugepd(__hugepd(pgd_val(*pgd)))))
@@ -1157,6 +1150,8 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
 	else
 		page = follow_p4d_mask(vma, address, pgd, flags, ctx);
 
+	vma_pgtable_walk_end(vma);
+
 	return page;
 }
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 65b9c9a48fd2..cc79891a3597 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6870,77 +6870,6 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
 }
 #endif /* CONFIG_USERFAULTFD */
 
-struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
-				      unsigned long address, unsigned int flags,
-				      unsigned int *page_mask)
-{
-	struct hstate *h = hstate_vma(vma);
-	struct mm_struct *mm = vma->vm_mm;
-	unsigned long haddr = address & huge_page_mask(h);
-	struct page *page = NULL;
-	spinlock_t *ptl;
-	pte_t *pte, entry;
-	int ret;
-
-	hugetlb_vma_lock_read(vma);
-	pte = hugetlb_walk(vma, haddr, huge_page_size(h));
-	if (!pte)
-		goto out_unlock;
-
-	ptl = huge_pte_lock(h, mm, pte);
-	entry = huge_ptep_get(pte);
-	if (pte_present(entry)) {
-		page = pte_page(entry);
-
-		if (!huge_pte_write(entry)) {
-			if (flags & FOLL_WRITE) {
-				page = NULL;
-				goto out;
-			}
-
-			if (gup_must_unshare(vma, flags, page)) {
-				/* Tell the caller to do unsharing */
-				page = ERR_PTR(-EMLINK);
-				goto out;
-			}
-		}
-
-		page = nth_page(page, ((address & ~huge_page_mask(h)) >> PAGE_SHIFT));
-
-		/*
-		 * Note that page may be a sub-page, and with vmemmap
-		 * optimizations the page struct may be read only.
-		 * try_grab_page() will increase the ref count on the
-		 * head page, so this will be OK.
-		 *
-		 * try_grab_page() should always be able to get the page here,
-		 * because we hold the ptl lock and have verified pte_present().
-		 */
-		ret = try_grab_page(page, flags);
-
-		if (WARN_ON_ONCE(ret)) {
-			page = ERR_PTR(ret);
-			goto out;
-		}
-
-		*page_mask = (1U << huge_page_order(h)) - 1;
-	}
-out:
-	spin_unlock(ptl);
-out_unlock:
-	hugetlb_vma_unlock_read(vma);
-
-	/*
-	 * Fixup retval for dump requests: if pagecache doesn't exist,
-	 * don't try to allocate a new page but just skip it.
-	 */
-	if (!page && (flags & FOLL_DUMP) &&
-	    !hugetlbfs_pagecache_present(h, vma, address))
-		page = ERR_PTR(-EFAULT);
-
-	return page;
-}
-
 long hugetlb_change_protection(struct vm_area_struct *vma,
 		unsigned long address, unsigned long end,
 		pgprot_t newprot, unsigned long cp_flags)
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 06/13] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing
  2024-03-27 15:23   ` peterx
  (?)
  (?)
@ 2024-03-28 10:10     ` David Hildenbrand
  -1 siblings, 0 replies; 160+ messages in thread
From: David Hildenbrand @ 2024-03-28 10:10 UTC (permalink / raw)
  To: peterx, linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Jason Gunthorpe, Mike Rapoport, Axel Rasmussen

On 27.03.24 16:23, peterx@redhat.com wrote:
> From: Peter Xu <peterx@redhat.com>
> 
> Hugepd format for GUP is only used in PowerPC with hugetlbfs.  There are
> some kernel usage of hugepd (can refer to hugepd_populate_kernel() for
> PPC_8XX), however those pages are not candidates for GUP.
> 
> Commit a6e79df92e4a ("mm/gup: disallow FOLL_LONGTERM GUP-fast writing to
> file-backed mappings") added a check to fail gup-fast if there's potential
> risk of violating GUP over writeback file systems.  That should never apply
> to hugepd.  Considering that hugepd is an old format (and even
> software-only), there's no plan to extend hugepd into other file typed
> memories that is prone to the same issue.
> 
> Drop that check, not only because it'll never be true for hugepd per any
> known plan, but also it paves way for reusing the function outside
> fast-gup.
> 
> To make sure we'll still remember this issue just in case hugepd will be
> extended to support non-hugetlbfs memories, add a rich comment above
> gup_huge_pd(), explaining the issue with proper references.
> 
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Lorenzo Stoakes <lstoakes@gmail.com>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: linuxppc-dev@lists.ozlabs.org
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---

@Andrew, you properly adjusted the code to remove the 
gup_fast_folio_allowed() call instead of the folio_fast_pin_allowed() 
call, but

(1) the commit subject
(2) comment for gup_huge_pd()

Still mention folio_fast_pin_allowed().

The patch "mm/gup: handle hugepd for follow_page()" then moves that 
(outdated) comment.

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 06/13] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing
@ 2024-03-28 10:10     ` David Hildenbrand
  0 siblings, 0 replies; 160+ messages in thread
From: David Hildenbrand @ 2024-03-28 10:10 UTC (permalink / raw)
  To: peterx, linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Jason Gunthorpe, Mike Rapoport, Axel Rasmussen

On 27.03.24 16:23, peterx@redhat.com wrote:
> From: Peter Xu <peterx@redhat.com>
> 
> Hugepd format for GUP is only used in PowerPC with hugetlbfs.  There are
> some kernel usage of hugepd (can refer to hugepd_populate_kernel() for
> PPC_8XX), however those pages are not candidates for GUP.
> 
> Commit a6e79df92e4a ("mm/gup: disallow FOLL_LONGTERM GUP-fast writing to
> file-backed mappings") added a check to fail gup-fast if there's potential
> risk of violating GUP over writeback file systems.  That should never apply
> to hugepd.  Considering that hugepd is an old format (and even
> software-only), there's no plan to extend hugepd into other file typed
> memories that is prone to the same issue.
> 
> Drop that check, not only because it'll never be true for hugepd per any
> known plan, but also it paves way for reusing the function outside
> fast-gup.
> 
> To make sure we'll still remember this issue just in case hugepd will be
> extended to support non-hugetlbfs memories, add a rich comment above
> gup_huge_pd(), explaining the issue with proper references.
> 
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Lorenzo Stoakes <lstoakes@gmail.com>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: linuxppc-dev@lists.ozlabs.org
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---

@Andrew, you properly adjusted the code to remove the 
gup_fast_folio_allowed() call instead of the folio_fast_pin_allowed() 
call, but

(1) the commit subject
(2) comment for gup_huge_pd()

Still mention folio_fast_pin_allowed().

The patch "mm/gup: handle hugepd for follow_page()" then moves that 
(outdated) comment.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 06/13] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing
@ 2024-03-28 10:10     ` David Hildenbrand
  0 siblings, 0 replies; 160+ messages in thread
From: David Hildenbrand @ 2024-03-28 10:10 UTC (permalink / raw)
  To: peterx, linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Jason Gunthorpe, Mike Rapoport, Axel Rasmussen

On 27.03.24 16:23, peterx@redhat.com wrote:
> From: Peter Xu <peterx@redhat.com>
> 
> Hugepd format for GUP is only used in PowerPC with hugetlbfs.  There are
> some kernel usage of hugepd (can refer to hugepd_populate_kernel() for
> PPC_8XX), however those pages are not candidates for GUP.
> 
> Commit a6e79df92e4a ("mm/gup: disallow FOLL_LONGTERM GUP-fast writing to
> file-backed mappings") added a check to fail gup-fast if there's potential
> risk of violating GUP over writeback file systems.  That should never apply
> to hugepd.  Considering that hugepd is an old format (and even
> software-only), there's no plan to extend hugepd into other file typed
> memories that is prone to the same issue.
> 
> Drop that check, not only because it'll never be true for hugepd per any
> known plan, but also it paves way for reusing the function outside
> fast-gup.
> 
> To make sure we'll still remember this issue just in case hugepd will be
> extended to support non-hugetlbfs memories, add a rich comment above
> gup_huge_pd(), explaining the issue with proper references.
> 
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Lorenzo Stoakes <lstoakes@gmail.com>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: linuxppc-dev@lists.ozlabs.org
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---

@Andrew, you properly adjusted the code to remove the 
gup_fast_folio_allowed() call instead of the folio_fast_pin_allowed() 
call, but

(1) the commit subject
(2) comment for gup_huge_pd()

Still mention folio_fast_pin_allowed().

The patch "mm/gup: handle hugepd for follow_page()" then moves that 
(outdated) comment.

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 06/13] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing
@ 2024-03-28 10:10     ` David Hildenbrand
  0 siblings, 0 replies; 160+ messages in thread
From: David Hildenbrand @ 2024-03-28 10:10 UTC (permalink / raw)
  To: peterx, linux-mm, linux-kernel
  Cc: James Houghton, Yang Shi, Andrew Jones, Matthew Wilcox,
	linux-riscv, Andrea Arcangeli, Christoph Hellwig,
	Aneesh Kumar K . V, linux-arm-kernel, Jason Gunthorpe,
	Axel Rasmussen, Rik van Riel, John Hubbard, Kirill A . Shutemov,
	Vlastimil Babka, Lorenzo Stoakes, Muchun Song, Andrew Morton,
	linuxppc-dev, Mike Rapoport, Mike Kravetz

On 27.03.24 16:23, peterx@redhat.com wrote:
> From: Peter Xu <peterx@redhat.com>
> 
> Hugepd format for GUP is only used in PowerPC with hugetlbfs.  There are
> some kernel usage of hugepd (can refer to hugepd_populate_kernel() for
> PPC_8XX), however those pages are not candidates for GUP.
> 
> Commit a6e79df92e4a ("mm/gup: disallow FOLL_LONGTERM GUP-fast writing to
> file-backed mappings") added a check to fail gup-fast if there's potential
> risk of violating GUP over writeback file systems.  That should never apply
> to hugepd.  Considering that hugepd is an old format (and even
> software-only), there's no plan to extend hugepd into other file typed
> memories that is prone to the same issue.
> 
> Drop that check, not only because it'll never be true for hugepd per any
> known plan, but also it paves way for reusing the function outside
> fast-gup.
> 
> To make sure we'll still remember this issue just in case hugepd will be
> extended to support non-hugetlbfs memories, add a rich comment above
> gup_huge_pd(), explaining the issue with proper references.
> 
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Lorenzo Stoakes <lstoakes@gmail.com>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: linuxppc-dev@lists.ozlabs.org
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---

@Andrew, you properly adjusted the code to remove the 
gup_fast_folio_allowed() call instead of the folio_fast_pin_allowed() 
call, but

(1) the commit subject
(2) comment for gup_huge_pd()

Still mention folio_fast_pin_allowed().

The patch "mm/gup: handle hugepd for follow_page()" then moves that 
(outdated) comment.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 06/13] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing
  2024-03-28 10:10     ` David Hildenbrand
  (?)
  (?)
@ 2024-03-28 19:01       ` Andrew Morton
  -1 siblings, 0 replies; 160+ messages in thread
From: Andrew Morton @ 2024-03-28 19:01 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: peterx, linux-mm, linux-kernel, Yang Shi, Kirill A . Shutemov,
	Mike Kravetz, John Hubbard, Michael Ellerman, Andrew Jones,
	Muchun Song, linux-riscv, linuxppc-dev, Christophe Leroy,
	Christoph Hellwig, Lorenzo Stoakes, Matthew Wilcox, Rik van Riel,
	linux-arm-kernel, Andrea Arcangeli, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

On Thu, 28 Mar 2024 11:10:59 +0100 David Hildenbrand <david@redhat.com> wrote:

> @Andrew, you properly adjusted the code to remove the 
> gup_fast_folio_allowed() call instead of the folio_fast_pin_allowed() 
> call, but
> 
> (1) the commit subject
> (2) comment for gup_huge_pd()
> 
> Still mention folio_fast_pin_allowed().
> 
> The patch "mm/gup: handle hugepd for follow_page()" then moves that 
> (outdated) comment.

OK, thanks, I fixed all that up.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 06/13] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing
@ 2024-03-28 19:01       ` Andrew Morton
  0 siblings, 0 replies; 160+ messages in thread
From: Andrew Morton @ 2024-03-28 19:01 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: peterx, linux-mm, linux-kernel, Yang Shi, Kirill A . Shutemov,
	Mike Kravetz, John Hubbard, Michael Ellerman, Andrew Jones,
	Muchun Song, linux-riscv, linuxppc-dev, Christophe Leroy,
	Christoph Hellwig, Lorenzo Stoakes, Matthew Wilcox, Rik van Riel,
	linux-arm-kernel, Andrea Arcangeli, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

On Thu, 28 Mar 2024 11:10:59 +0100 David Hildenbrand <david@redhat.com> wrote:

> @Andrew, you properly adjusted the code to remove the 
> gup_fast_folio_allowed() call instead of the folio_fast_pin_allowed() 
> call, but
> 
> (1) the commit subject
> (2) comment for gup_huge_pd()
> 
> Still mention folio_fast_pin_allowed().
> 
> The patch "mm/gup: handle hugepd for follow_page()" then moves that 
> (outdated) comment.

OK, thanks, I fixed all that up.

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 06/13] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing
@ 2024-03-28 19:01       ` Andrew Morton
  0 siblings, 0 replies; 160+ messages in thread
From: Andrew Morton @ 2024-03-28 19:01 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: peterx, linux-mm, linux-kernel, Yang Shi, Kirill A . Shutemov,
	Mike Kravetz, John Hubbard, Michael Ellerman, Andrew Jones,
	Muchun Song, linux-riscv, linuxppc-dev, Christophe Leroy,
	Christoph Hellwig, Lorenzo Stoakes, Matthew Wilcox, Rik van Riel,
	linux-arm-kernel, Andrea Arcangeli, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

On Thu, 28 Mar 2024 11:10:59 +0100 David Hildenbrand <david@redhat.com> wrote:

> @Andrew, you properly adjusted the code to remove the 
> gup_fast_folio_allowed() call instead of the folio_fast_pin_allowed() 
> call, but
> 
> (1) the commit subject
> (2) comment for gup_huge_pd()
> 
> Still mention folio_fast_pin_allowed().
> 
> The patch "mm/gup: handle hugepd for follow_page()" then moves that 
> (outdated) comment.

OK, thanks, I fixed all that up.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 06/13] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing
@ 2024-03-28 19:01       ` Andrew Morton
  0 siblings, 0 replies; 160+ messages in thread
From: Andrew Morton @ 2024-03-28 19:01 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: James Houghton, Yang Shi, peterx, Andrew Jones, linux-mm,
	linux-riscv, Andrea Arcangeli, Christoph Hellwig, Matthew Wilcox,
	Aneesh Kumar K . V, linux-arm-kernel, Jason Gunthorpe,
	Axel Rasmussen, Rik van Riel, John Hubbard, Kirill A . Shutemov,
	Vlastimil Babka, Lorenzo Stoakes, Muchun Song, linux-kernel,
	linuxppc-dev, Mike Rapoport, Mike Kravetz

On Thu, 28 Mar 2024 11:10:59 +0100 David Hildenbrand <david@redhat.com> wrote:

> @Andrew, you properly adjusted the code to remove the 
> gup_fast_folio_allowed() call instead of the folio_fast_pin_allowed() 
> call, but
> 
> (1) the commit subject
> (2) comment for gup_huge_pd()
> 
> Still mention folio_fast_pin_allowed().
> 
> The patch "mm/gup: handle hugepd for follow_page()" then moves that 
> (outdated) comment.

OK, thanks, I fixed all that up.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
  2024-03-27 15:23   ` peterx
  (?)
  (?)
@ 2024-04-02 14:48     ` Ryan Roberts
  -1 siblings, 0 replies; 160+ messages in thread
From: Ryan Roberts @ 2024-04-02 14:48 UTC (permalink / raw)
  To: peterx, linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

Hi Peter,

On 27/03/2024 15:23, peterx@redhat.com wrote:
> From: Peter Xu <peterx@redhat.com>
> 
> Now follow_page() is ready to handle hugetlb pages in whatever form, and
> over all architectures.  Switch to the generic code path.
> 
> Time to retire hugetlb_follow_page_mask(), following the previous
> retirement of follow_hugetlb_page() in 4849807114b8.
> 
> There may be a slight difference of how the loops run when processing slow
> GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
> loop of __get_user_pages() will resolve one pgtable entry with the patch
> applied, rather than relying on the size of hugetlb hstate, the latter may
> cover multiple entries in one loop.
> 
> A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
> a tight loop of slow gup after the path switched.  That shouldn't be a
> problem because slow-gup should not be a hot path for GUP in general: when
> page is commonly present, fast-gup will already succeed, while when the
> page is indeed missing and require a follow up page fault, the slow gup
> degrade will probably buried in the fault paths anyway.  It also explains
> why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
> accelerate thp gup even for "pages != NULL"") lands, the latter not part of
> a performance analysis but a side benefit.  If the performance will be a
> concern, we can consider handle CONT_PTE in follow_page().
> 
> Before that is justified to be necessary, keep everything clean and simple.
> 
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Afraid I'm seeing an oops when running gup_longterm test on arm64 with current mm-unstable. Git bisect blames this patch. The oops reproduces for me every time on 2 different machines:


[    9.340416] kernel BUG at mm/gup.c:778!
[    9.340746] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
[    9.341199] Modules linked in:
[    9.341481] CPU: 1 PID: 1159 Comm: gup_longterm Not tainted 6.9.0-rc2-00210-g910ff1a347e4 #11
[    9.342232] Hardware name: linux,dummy-virt (DT)
[    9.342647] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    9.343195] pc : follow_page_mask+0x4d4/0x880
[    9.343580] lr : follow_page_mask+0x4d4/0x880
[    9.344018] sp : ffff8000898b3aa0
[    9.344345] x29: ffff8000898b3aa0 x28: fffffdffc53973e8 x27: 00003c0005d08000
[    9.345028] x26: ffff00014e5cfd08 x25: ffffd3513a40c000 x24: fffffdffc5d08000
[    9.345682] x23: ffffc1ffc0000000 x22: 0000000000080101 x21: ffff8000898b3ba8
[    9.346337] x20: 0000fffff4200000 x19: ffff00014e52d508 x18: 0000000000000010
[    9.347005] x17: 5f656e6f7a5f7369 x16: 2120262620296567 x15: 6170286461654865
[    9.347713] x14: 6761502128454741 x13: 2929656761702865 x12: 6761705f65636976
[    9.348371] x11: 65645f656e6f7a5f x10: ffffd3513b31d6e0 x9 : ffffd3513852f090
[    9.349062] x8 : 00000000ffffefff x7 : ffffd3513b31d6e0 x6 : 0000000000000000
[    9.349753] x5 : ffff00017ff98cc8 x4 : 0000000000000fff x3 : 0000000000000000
[    9.350397] x2 : 0000000000000000 x1 : ffff000190e8b480 x0 : 0000000000000052
[    9.351097] Call trace:
[    9.351312]  follow_page_mask+0x4d4/0x880
[    9.351700]  __get_user_pages+0xf4/0x3e8
[    9.352089]  __gup_longterm_locked+0x204/0xa70
[    9.352516]  pin_user_pages+0x88/0xc0
[    9.352873]  gup_test_ioctl+0x860/0xc40
[    9.353249]  __arm64_sys_ioctl+0xb0/0x100
[    9.353648]  invoke_syscall+0x50/0x128
[    9.354022]  el0_svc_common.constprop.0+0x48/0xf8
[    9.354488]  do_el0_svc+0x28/0x40
[    9.354822]  el0_svc+0x34/0xe0
[    9.355128]  el0t_64_sync_handler+0x13c/0x158
[    9.355489]  el0t_64_sync+0x190/0x198
[    9.355793] Code: aa1803e0 d000d8e1 91220021 97fff560 (d4210000) 
[    9.356280] ---[ end trace 0000000000000000 ]---
[    9.356651] note: gup_longterm[1159] exited with irqs disabled
[    9.357141] note: gup_longterm[1159] exited with preempt_count 2
[    9.358033] ------------[ cut here ]------------
[    9.358800] WARNING: CPU: 1 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0x108/0x120
[    9.360157] Modules linked in:
[    9.360541] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D            6.9.0-rc2-00210-g910ff1a347e4 #11
[    9.361626] Hardware name: linux,dummy-virt (DT)
[    9.362087] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    9.362758] pc : ct_kernel_exit.constprop.0+0x108/0x120
[    9.363306] lr : ct_idle_enter+0x10/0x20
[    9.363845] sp : ffff8000801abdc0
[    9.364222] x29: ffff8000801abdc0 x28: 0000000000000000 x27: 0000000000000000
[    9.364961] x26: 0000000000000000 x25: ffff00014149d780 x24: 0000000000000000
[    9.365557] x23: 0000000000000000 x22: ffffd3513b299d48 x21: ffffd3513a785730
[    9.366239] x20: ffffd3513b299c28 x19: ffff00017ffa7da0 x18: 0000fffff5ffffff
[    9.366869] x17: 0000000000000000 x16: 1fffe0002a21a8c1 x15: 0000000000000000
[    9.367524] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000002
[    9.368207] x11: 0000000000000001 x10: 0000000000000ad0 x9 : ffffd35138589230
[    9.369123] x8 : ffff00014149e2b0 x7 : 0000000000000000 x6 : 000000000f8c0fb2
[    9.370403] x5 : 4000000000000002 x4 : ffff2cb045825000 x3 : ffff8000801abdc0
[    9.371170] x2 : ffffd3513a782da0 x1 : 4000000000000000 x0 : ffffd3513a782da0
[    9.372279] Call trace:
[    9.372519]  ct_kernel_exit.constprop.0+0x108/0x120
[    9.373216]  ct_idle_enter+0x10/0x20
[    9.373562]  default_idle_call+0x3c/0x160
[    9.374055]  do_idle+0x21c/0x280
[    9.374394]  cpu_startup_entry+0x3c/0x50
[    9.374797]  secondary_start_kernel+0x140/0x168
[    9.375220]  __secondary_switched+0xb8/0xc0
[    9.375875] ---[ end trace 0000000000000000 ]---


The oops trigger is at mm/gup.c:778:
VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);


This is the output of gup_longterm (last output is just before oops):

# [INFO] detected hugetlb page size: 2048 KiB
# [INFO] detected hugetlb page size: 32768 KiB
# [INFO] detected hugetlb page size: 64 KiB
# [INFO] detected hugetlb page size: 1048576 KiB
TAP version 13
1..70
# [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd
ok 1 Should have worked
# [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with tmpfile
ok 2 Should have failed
# [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile
ok 3 Should have failed
# [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
ok 4 Should have worked
# [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (32768 kB)


So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?


I'm running with defconfig plus these:

./scripts/config --enable CONFIG_SQUASHFS_LZ4
./scripts/config --enable CONFIG_SQUASHFS_LZO
./scripts/config --enable CONFIG_SQUASHFS_XZ
./scripts/config --enable CONFIG_SQUASHFS_ZSTD
./scripts/config --enable CONFIG_XFS_FS
./scripts/config --enable CONFIG_FTRACE
./scripts/config --enable CONFIG_FUNCTION_TRACER
./scripts/config --enable CONFIG_KPROBES
./scripts/config --enable CONFIG_HIST_TRIGGERS
./scripts/config --enable CONFIG_FTRACE_SYSCALLS
./scripts/config --enable CONFIG_DEBUG_VM
./scripts/config --enable CONFIG_DEBUG_VM_MAPLE_TREE
./scripts/config --enable CONFIG_DEBUG_VM_RB
./scripts/config --enable CONFIG_DEBUG_VM_PGFLAGS
./scripts/config --enable CONFIG_DEBUG_VM_PGTABLE
./scripts/config --enable CONFIG_PAGE_TABLE_CHECK
./scripts/config --enable CONFIG_USERFAULTFD
./scripts/config --enable CONFIG_TEST_VMALLOC
./scripts/config --enable CONFIG_GUP_TEST


Thanks,
Ryan




> ---
>  include/linux/hugetlb.h |  7 ----
>  mm/gup.c                | 15 +++------
>  mm/hugetlb.c            | 71 -----------------------------------------
>  3 files changed, 5 insertions(+), 88 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 294c78b3549f..a546140f89cd 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -328,13 +328,6 @@ static inline void hugetlb_zap_end(
>  {
>  }
>  
> -static inline struct page *hugetlb_follow_page_mask(
> -    struct vm_area_struct *vma, unsigned long address, unsigned int flags,
> -    unsigned int *page_mask)
> -{
> -	BUILD_BUG(); /* should never be compiled in if !CONFIG_HUGETLB_PAGE*/
> -}
> -
>  static inline int copy_hugetlb_page_range(struct mm_struct *dst,
>  					  struct mm_struct *src,
>  					  struct vm_area_struct *dst_vma,
> diff --git a/mm/gup.c b/mm/gup.c
> index a02463c9420e..c803d0b0f358 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1135,18 +1135,11 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
>  {
>  	pgd_t *pgd;
>  	struct mm_struct *mm = vma->vm_mm;
> +	struct page *page;
>  
> -	ctx->page_mask = 0;
> -
> -	/*
> -	 * Call hugetlb_follow_page_mask for hugetlb vmas as it will use
> -	 * special hugetlb page table walking code.  This eliminates the
> -	 * need to check for hugetlb entries in the general walking code.
> -	 */
> -	if (is_vm_hugetlb_page(vma))
> -		return hugetlb_follow_page_mask(vma, address, flags,
> -						&ctx->page_mask);
> +	vma_pgtable_walk_begin(vma);
>  
> +	ctx->page_mask = 0;
>  	pgd = pgd_offset(mm, address);
>  
>  	if (unlikely(is_hugepd(__hugepd(pgd_val(*pgd)))))
> @@ -1157,6 +1150,8 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
>  	else
>  		page = follow_p4d_mask(vma, address, pgd, flags, ctx);
>  
> +	vma_pgtable_walk_end(vma);
> +
>  	return page;
>  }
>  
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 65b9c9a48fd2..cc79891a3597 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6870,77 +6870,6 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
>  }
>  #endif /* CONFIG_USERFAULTFD */
>  
> -struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> -				      unsigned long address, unsigned int flags,
> -				      unsigned int *page_mask)
> -{
> -	struct hstate *h = hstate_vma(vma);
> -	struct mm_struct *mm = vma->vm_mm;
> -	unsigned long haddr = address & huge_page_mask(h);
> -	struct page *page = NULL;
> -	spinlock_t *ptl;
> -	pte_t *pte, entry;
> -	int ret;
> -
> -	hugetlb_vma_lock_read(vma);
> -	pte = hugetlb_walk(vma, haddr, huge_page_size(h));
> -	if (!pte)
> -		goto out_unlock;
> -
> -	ptl = huge_pte_lock(h, mm, pte);
> -	entry = huge_ptep_get(pte);
> -	if (pte_present(entry)) {
> -		page = pte_page(entry);
> -
> -		if (!huge_pte_write(entry)) {
> -			if (flags & FOLL_WRITE) {
> -				page = NULL;
> -				goto out;
> -			}
> -
> -			if (gup_must_unshare(vma, flags, page)) {
> -				/* Tell the caller to do unsharing */
> -				page = ERR_PTR(-EMLINK);
> -				goto out;
> -			}
> -		}
> -
> -		page = nth_page(page, ((address & ~huge_page_mask(h)) >> PAGE_SHIFT));
> -
> -		/*
> -		 * Note that page may be a sub-page, and with vmemmap
> -		 * optimizations the page struct may be read only.
> -		 * try_grab_page() will increase the ref count on the
> -		 * head page, so this will be OK.
> -		 *
> -		 * try_grab_page() should always be able to get the page here,
> -		 * because we hold the ptl lock and have verified pte_present().
> -		 */
> -		ret = try_grab_page(page, flags);
> -
> -		if (WARN_ON_ONCE(ret)) {
> -			page = ERR_PTR(ret);
> -			goto out;
> -		}
> -
> -		*page_mask = (1U << huge_page_order(h)) - 1;
> -	}
> -out:
> -	spin_unlock(ptl);
> -out_unlock:
> -	hugetlb_vma_unlock_read(vma);
> -
> -	/*
> -	 * Fixup retval for dump requests: if pagecache doesn't exist,
> -	 * don't try to allocate a new page but just skip it.
> -	 */
> -	if (!page && (flags & FOLL_DUMP) &&
> -	    !hugetlbfs_pagecache_present(h, vma, address))
> -		page = ERR_PTR(-EFAULT);
> -
> -	return page;
> -}
> -
>  long hugetlb_change_protection(struct vm_area_struct *vma,
>  		unsigned long address, unsigned long end,
>  		pgprot_t newprot, unsigned long cp_flags)


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
@ 2024-04-02 14:48     ` Ryan Roberts
  0 siblings, 0 replies; 160+ messages in thread
From: Ryan Roberts @ 2024-04-02 14:48 UTC (permalink / raw)
  To: peterx, linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

Hi Peter,

On 27/03/2024 15:23, peterx@redhat.com wrote:
> From: Peter Xu <peterx@redhat.com>
> 
> Now follow_page() is ready to handle hugetlb pages in whatever form, and
> over all architectures.  Switch to the generic code path.
> 
> Time to retire hugetlb_follow_page_mask(), following the previous
> retirement of follow_hugetlb_page() in 4849807114b8.
> 
> There may be a slight difference of how the loops run when processing slow
> GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
> loop of __get_user_pages() will resolve one pgtable entry with the patch
> applied, rather than relying on the size of hugetlb hstate, the latter may
> cover multiple entries in one loop.
> 
> A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
> a tight loop of slow gup after the path switched.  That shouldn't be a
> problem because slow-gup should not be a hot path for GUP in general: when
> page is commonly present, fast-gup will already succeed, while when the
> page is indeed missing and require a follow up page fault, the slow gup
> degrade will probably buried in the fault paths anyway.  It also explains
> why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
> accelerate thp gup even for "pages != NULL"") lands, the latter not part of
> a performance analysis but a side benefit.  If the performance will be a
> concern, we can consider handle CONT_PTE in follow_page().
> 
> Before that is justified to be necessary, keep everything clean and simple.
> 
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Afraid I'm seeing an oops when running gup_longterm test on arm64 with current mm-unstable. Git bisect blames this patch. The oops reproduces for me every time on 2 different machines:


[    9.340416] kernel BUG at mm/gup.c:778!
[    9.340746] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
[    9.341199] Modules linked in:
[    9.341481] CPU: 1 PID: 1159 Comm: gup_longterm Not tainted 6.9.0-rc2-00210-g910ff1a347e4 #11
[    9.342232] Hardware name: linux,dummy-virt (DT)
[    9.342647] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    9.343195] pc : follow_page_mask+0x4d4/0x880
[    9.343580] lr : follow_page_mask+0x4d4/0x880
[    9.344018] sp : ffff8000898b3aa0
[    9.344345] x29: ffff8000898b3aa0 x28: fffffdffc53973e8 x27: 00003c0005d08000
[    9.345028] x26: ffff00014e5cfd08 x25: ffffd3513a40c000 x24: fffffdffc5d08000
[    9.345682] x23: ffffc1ffc0000000 x22: 0000000000080101 x21: ffff8000898b3ba8
[    9.346337] x20: 0000fffff4200000 x19: ffff00014e52d508 x18: 0000000000000010
[    9.347005] x17: 5f656e6f7a5f7369 x16: 2120262620296567 x15: 6170286461654865
[    9.347713] x14: 6761502128454741 x13: 2929656761702865 x12: 6761705f65636976
[    9.348371] x11: 65645f656e6f7a5f x10: ffffd3513b31d6e0 x9 : ffffd3513852f090
[    9.349062] x8 : 00000000ffffefff x7 : ffffd3513b31d6e0 x6 : 0000000000000000
[    9.349753] x5 : ffff00017ff98cc8 x4 : 0000000000000fff x3 : 0000000000000000
[    9.350397] x2 : 0000000000000000 x1 : ffff000190e8b480 x0 : 0000000000000052
[    9.351097] Call trace:
[    9.351312]  follow_page_mask+0x4d4/0x880
[    9.351700]  __get_user_pages+0xf4/0x3e8
[    9.352089]  __gup_longterm_locked+0x204/0xa70
[    9.352516]  pin_user_pages+0x88/0xc0
[    9.352873]  gup_test_ioctl+0x860/0xc40
[    9.353249]  __arm64_sys_ioctl+0xb0/0x100
[    9.353648]  invoke_syscall+0x50/0x128
[    9.354022]  el0_svc_common.constprop.0+0x48/0xf8
[    9.354488]  do_el0_svc+0x28/0x40
[    9.354822]  el0_svc+0x34/0xe0
[    9.355128]  el0t_64_sync_handler+0x13c/0x158
[    9.355489]  el0t_64_sync+0x190/0x198
[    9.355793] Code: aa1803e0 d000d8e1 91220021 97fff560 (d4210000) 
[    9.356280] ---[ end trace 0000000000000000 ]---
[    9.356651] note: gup_longterm[1159] exited with irqs disabled
[    9.357141] note: gup_longterm[1159] exited with preempt_count 2
[    9.358033] ------------[ cut here ]------------
[    9.358800] WARNING: CPU: 1 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0x108/0x120
[    9.360157] Modules linked in:
[    9.360541] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D            6.9.0-rc2-00210-g910ff1a347e4 #11
[    9.361626] Hardware name: linux,dummy-virt (DT)
[    9.362087] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    9.362758] pc : ct_kernel_exit.constprop.0+0x108/0x120
[    9.363306] lr : ct_idle_enter+0x10/0x20
[    9.363845] sp : ffff8000801abdc0
[    9.364222] x29: ffff8000801abdc0 x28: 0000000000000000 x27: 0000000000000000
[    9.364961] x26: 0000000000000000 x25: ffff00014149d780 x24: 0000000000000000
[    9.365557] x23: 0000000000000000 x22: ffffd3513b299d48 x21: ffffd3513a785730
[    9.366239] x20: ffffd3513b299c28 x19: ffff00017ffa7da0 x18: 0000fffff5ffffff
[    9.366869] x17: 0000000000000000 x16: 1fffe0002a21a8c1 x15: 0000000000000000
[    9.367524] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000002
[    9.368207] x11: 0000000000000001 x10: 0000000000000ad0 x9 : ffffd35138589230
[    9.369123] x8 : ffff00014149e2b0 x7 : 0000000000000000 x6 : 000000000f8c0fb2
[    9.370403] x5 : 4000000000000002 x4 : ffff2cb045825000 x3 : ffff8000801abdc0
[    9.371170] x2 : ffffd3513a782da0 x1 : 4000000000000000 x0 : ffffd3513a782da0
[    9.372279] Call trace:
[    9.372519]  ct_kernel_exit.constprop.0+0x108/0x120
[    9.373216]  ct_idle_enter+0x10/0x20
[    9.373562]  default_idle_call+0x3c/0x160
[    9.374055]  do_idle+0x21c/0x280
[    9.374394]  cpu_startup_entry+0x3c/0x50
[    9.374797]  secondary_start_kernel+0x140/0x168
[    9.375220]  __secondary_switched+0xb8/0xc0
[    9.375875] ---[ end trace 0000000000000000 ]---


The oops trigger is at mm/gup.c:778:
VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);


This is the output of gup_longterm (last output is just before oops):

# [INFO] detected hugetlb page size: 2048 KiB
# [INFO] detected hugetlb page size: 32768 KiB
# [INFO] detected hugetlb page size: 64 KiB
# [INFO] detected hugetlb page size: 1048576 KiB
TAP version 13
1..70
# [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd
ok 1 Should have worked
# [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with tmpfile
ok 2 Should have failed
# [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile
ok 3 Should have failed
# [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
ok 4 Should have worked
# [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (32768 kB)


So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?


I'm running with defconfig plus these:

./scripts/config --enable CONFIG_SQUASHFS_LZ4
./scripts/config --enable CONFIG_SQUASHFS_LZO
./scripts/config --enable CONFIG_SQUASHFS_XZ
./scripts/config --enable CONFIG_SQUASHFS_ZSTD
./scripts/config --enable CONFIG_XFS_FS
./scripts/config --enable CONFIG_FTRACE
./scripts/config --enable CONFIG_FUNCTION_TRACER
./scripts/config --enable CONFIG_KPROBES
./scripts/config --enable CONFIG_HIST_TRIGGERS
./scripts/config --enable CONFIG_FTRACE_SYSCALLS
./scripts/config --enable CONFIG_DEBUG_VM
./scripts/config --enable CONFIG_DEBUG_VM_MAPLE_TREE
./scripts/config --enable CONFIG_DEBUG_VM_RB
./scripts/config --enable CONFIG_DEBUG_VM_PGFLAGS
./scripts/config --enable CONFIG_DEBUG_VM_PGTABLE
./scripts/config --enable CONFIG_PAGE_TABLE_CHECK
./scripts/config --enable CONFIG_USERFAULTFD
./scripts/config --enable CONFIG_TEST_VMALLOC
./scripts/config --enable CONFIG_GUP_TEST


Thanks,
Ryan




> ---
>  include/linux/hugetlb.h |  7 ----
>  mm/gup.c                | 15 +++------
>  mm/hugetlb.c            | 71 -----------------------------------------
>  3 files changed, 5 insertions(+), 88 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 294c78b3549f..a546140f89cd 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -328,13 +328,6 @@ static inline void hugetlb_zap_end(
>  {
>  }
>  
> -static inline struct page *hugetlb_follow_page_mask(
> -    struct vm_area_struct *vma, unsigned long address, unsigned int flags,
> -    unsigned int *page_mask)
> -{
> -	BUILD_BUG(); /* should never be compiled in if !CONFIG_HUGETLB_PAGE*/
> -}
> -
>  static inline int copy_hugetlb_page_range(struct mm_struct *dst,
>  					  struct mm_struct *src,
>  					  struct vm_area_struct *dst_vma,
> diff --git a/mm/gup.c b/mm/gup.c
> index a02463c9420e..c803d0b0f358 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1135,18 +1135,11 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
>  {
>  	pgd_t *pgd;
>  	struct mm_struct *mm = vma->vm_mm;
> +	struct page *page;
>  
> -	ctx->page_mask = 0;
> -
> -	/*
> -	 * Call hugetlb_follow_page_mask for hugetlb vmas as it will use
> -	 * special hugetlb page table walking code.  This eliminates the
> -	 * need to check for hugetlb entries in the general walking code.
> -	 */
> -	if (is_vm_hugetlb_page(vma))
> -		return hugetlb_follow_page_mask(vma, address, flags,
> -						&ctx->page_mask);
> +	vma_pgtable_walk_begin(vma);
>  
> +	ctx->page_mask = 0;
>  	pgd = pgd_offset(mm, address);
>  
>  	if (unlikely(is_hugepd(__hugepd(pgd_val(*pgd)))))
> @@ -1157,6 +1150,8 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
>  	else
>  		page = follow_p4d_mask(vma, address, pgd, flags, ctx);
>  
> +	vma_pgtable_walk_end(vma);
> +
>  	return page;
>  }
>  
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 65b9c9a48fd2..cc79891a3597 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6870,77 +6870,6 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
>  }
>  #endif /* CONFIG_USERFAULTFD */
>  
> -struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> -				      unsigned long address, unsigned int flags,
> -				      unsigned int *page_mask)
> -{
> -	struct hstate *h = hstate_vma(vma);
> -	struct mm_struct *mm = vma->vm_mm;
> -	unsigned long haddr = address & huge_page_mask(h);
> -	struct page *page = NULL;
> -	spinlock_t *ptl;
> -	pte_t *pte, entry;
> -	int ret;
> -
> -	hugetlb_vma_lock_read(vma);
> -	pte = hugetlb_walk(vma, haddr, huge_page_size(h));
> -	if (!pte)
> -		goto out_unlock;
> -
> -	ptl = huge_pte_lock(h, mm, pte);
> -	entry = huge_ptep_get(pte);
> -	if (pte_present(entry)) {
> -		page = pte_page(entry);
> -
> -		if (!huge_pte_write(entry)) {
> -			if (flags & FOLL_WRITE) {
> -				page = NULL;
> -				goto out;
> -			}
> -
> -			if (gup_must_unshare(vma, flags, page)) {
> -				/* Tell the caller to do unsharing */
> -				page = ERR_PTR(-EMLINK);
> -				goto out;
> -			}
> -		}
> -
> -		page = nth_page(page, ((address & ~huge_page_mask(h)) >> PAGE_SHIFT));
> -
> -		/*
> -		 * Note that page may be a sub-page, and with vmemmap
> -		 * optimizations the page struct may be read only.
> -		 * try_grab_page() will increase the ref count on the
> -		 * head page, so this will be OK.
> -		 *
> -		 * try_grab_page() should always be able to get the page here,
> -		 * because we hold the ptl lock and have verified pte_present().
> -		 */
> -		ret = try_grab_page(page, flags);
> -
> -		if (WARN_ON_ONCE(ret)) {
> -			page = ERR_PTR(ret);
> -			goto out;
> -		}
> -
> -		*page_mask = (1U << huge_page_order(h)) - 1;
> -	}
> -out:
> -	spin_unlock(ptl);
> -out_unlock:
> -	hugetlb_vma_unlock_read(vma);
> -
> -	/*
> -	 * Fixup retval for dump requests: if pagecache doesn't exist,
> -	 * don't try to allocate a new page but just skip it.
> -	 */
> -	if (!page && (flags & FOLL_DUMP) &&
> -	    !hugetlbfs_pagecache_present(h, vma, address))
> -		page = ERR_PTR(-EFAULT);
> -
> -	return page;
> -}
> -
>  long hugetlb_change_protection(struct vm_area_struct *vma,
>  		unsigned long address, unsigned long end,
>  		pgprot_t newprot, unsigned long cp_flags)


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
@ 2024-04-02 14:48     ` Ryan Roberts
  0 siblings, 0 replies; 160+ messages in thread
From: Ryan Roberts @ 2024-04-02 14:48 UTC (permalink / raw)
  To: peterx, linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

Hi Peter,

On 27/03/2024 15:23, peterx@redhat.com wrote:
> From: Peter Xu <peterx@redhat.com>
> 
> Now follow_page() is ready to handle hugetlb pages in whatever form, and
> over all architectures.  Switch to the generic code path.
> 
> Time to retire hugetlb_follow_page_mask(), following the previous
> retirement of follow_hugetlb_page() in 4849807114b8.
> 
> There may be a slight difference of how the loops run when processing slow
> GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
> loop of __get_user_pages() will resolve one pgtable entry with the patch
> applied, rather than relying on the size of hugetlb hstate, the latter may
> cover multiple entries in one loop.
> 
> A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
> a tight loop of slow gup after the path switched.  That shouldn't be a
> problem because slow-gup should not be a hot path for GUP in general: when
> page is commonly present, fast-gup will already succeed, while when the
> page is indeed missing and require a follow up page fault, the slow gup
> degrade will probably buried in the fault paths anyway.  It also explains
> why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
> accelerate thp gup even for "pages != NULL"") lands, the latter not part of
> a performance analysis but a side benefit.  If the performance will be a
> concern, we can consider handle CONT_PTE in follow_page().
> 
> Before that is justified to be necessary, keep everything clean and simple.
> 
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Afraid I'm seeing an oops when running gup_longterm test on arm64 with current mm-unstable. Git bisect blames this patch. The oops reproduces for me every time on 2 different machines:


[    9.340416] kernel BUG at mm/gup.c:778!
[    9.340746] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
[    9.341199] Modules linked in:
[    9.341481] CPU: 1 PID: 1159 Comm: gup_longterm Not tainted 6.9.0-rc2-00210-g910ff1a347e4 #11
[    9.342232] Hardware name: linux,dummy-virt (DT)
[    9.342647] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    9.343195] pc : follow_page_mask+0x4d4/0x880
[    9.343580] lr : follow_page_mask+0x4d4/0x880
[    9.344018] sp : ffff8000898b3aa0
[    9.344345] x29: ffff8000898b3aa0 x28: fffffdffc53973e8 x27: 00003c0005d08000
[    9.345028] x26: ffff00014e5cfd08 x25: ffffd3513a40c000 x24: fffffdffc5d08000
[    9.345682] x23: ffffc1ffc0000000 x22: 0000000000080101 x21: ffff8000898b3ba8
[    9.346337] x20: 0000fffff4200000 x19: ffff00014e52d508 x18: 0000000000000010
[    9.347005] x17: 5f656e6f7a5f7369 x16: 2120262620296567 x15: 6170286461654865
[    9.347713] x14: 6761502128454741 x13: 2929656761702865 x12: 6761705f65636976
[    9.348371] x11: 65645f656e6f7a5f x10: ffffd3513b31d6e0 x9 : ffffd3513852f090
[    9.349062] x8 : 00000000ffffefff x7 : ffffd3513b31d6e0 x6 : 0000000000000000
[    9.349753] x5 : ffff00017ff98cc8 x4 : 0000000000000fff x3 : 0000000000000000
[    9.350397] x2 : 0000000000000000 x1 : ffff000190e8b480 x0 : 0000000000000052
[    9.351097] Call trace:
[    9.351312]  follow_page_mask+0x4d4/0x880
[    9.351700]  __get_user_pages+0xf4/0x3e8
[    9.352089]  __gup_longterm_locked+0x204/0xa70
[    9.352516]  pin_user_pages+0x88/0xc0
[    9.352873]  gup_test_ioctl+0x860/0xc40
[    9.353249]  __arm64_sys_ioctl+0xb0/0x100
[    9.353648]  invoke_syscall+0x50/0x128
[    9.354022]  el0_svc_common.constprop.0+0x48/0xf8
[    9.354488]  do_el0_svc+0x28/0x40
[    9.354822]  el0_svc+0x34/0xe0
[    9.355128]  el0t_64_sync_handler+0x13c/0x158
[    9.355489]  el0t_64_sync+0x190/0x198
[    9.355793] Code: aa1803e0 d000d8e1 91220021 97fff560 (d4210000) 
[    9.356280] ---[ end trace 0000000000000000 ]---
[    9.356651] note: gup_longterm[1159] exited with irqs disabled
[    9.357141] note: gup_longterm[1159] exited with preempt_count 2
[    9.358033] ------------[ cut here ]------------
[    9.358800] WARNING: CPU: 1 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0x108/0x120
[    9.360157] Modules linked in:
[    9.360541] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D            6.9.0-rc2-00210-g910ff1a347e4 #11
[    9.361626] Hardware name: linux,dummy-virt (DT)
[    9.362087] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    9.362758] pc : ct_kernel_exit.constprop.0+0x108/0x120
[    9.363306] lr : ct_idle_enter+0x10/0x20
[    9.363845] sp : ffff8000801abdc0
[    9.364222] x29: ffff8000801abdc0 x28: 0000000000000000 x27: 0000000000000000
[    9.364961] x26: 0000000000000000 x25: ffff00014149d780 x24: 0000000000000000
[    9.365557] x23: 0000000000000000 x22: ffffd3513b299d48 x21: ffffd3513a785730
[    9.366239] x20: ffffd3513b299c28 x19: ffff00017ffa7da0 x18: 0000fffff5ffffff
[    9.366869] x17: 0000000000000000 x16: 1fffe0002a21a8c1 x15: 0000000000000000
[    9.367524] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000002
[    9.368207] x11: 0000000000000001 x10: 0000000000000ad0 x9 : ffffd35138589230
[    9.369123] x8 : ffff00014149e2b0 x7 : 0000000000000000 x6 : 000000000f8c0fb2
[    9.370403] x5 : 4000000000000002 x4 : ffff2cb045825000 x3 : ffff8000801abdc0
[    9.371170] x2 : ffffd3513a782da0 x1 : 4000000000000000 x0 : ffffd3513a782da0
[    9.372279] Call trace:
[    9.372519]  ct_kernel_exit.constprop.0+0x108/0x120
[    9.373216]  ct_idle_enter+0x10/0x20
[    9.373562]  default_idle_call+0x3c/0x160
[    9.374055]  do_idle+0x21c/0x280
[    9.374394]  cpu_startup_entry+0x3c/0x50
[    9.374797]  secondary_start_kernel+0x140/0x168
[    9.375220]  __secondary_switched+0xb8/0xc0
[    9.375875] ---[ end trace 0000000000000000 ]---


The oops trigger is at mm/gup.c:778:
VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);


This is the output of gup_longterm (last output is just before oops):

# [INFO] detected hugetlb page size: 2048 KiB
# [INFO] detected hugetlb page size: 32768 KiB
# [INFO] detected hugetlb page size: 64 KiB
# [INFO] detected hugetlb page size: 1048576 KiB
TAP version 13
1..70
# [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd
ok 1 Should have worked
# [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with tmpfile
ok 2 Should have failed
# [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile
ok 3 Should have failed
# [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
ok 4 Should have worked
# [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (32768 kB)


So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?


I'm running with defconfig plus these:

./scripts/config --enable CONFIG_SQUASHFS_LZ4
./scripts/config --enable CONFIG_SQUASHFS_LZO
./scripts/config --enable CONFIG_SQUASHFS_XZ
./scripts/config --enable CONFIG_SQUASHFS_ZSTD
./scripts/config --enable CONFIG_XFS_FS
./scripts/config --enable CONFIG_FTRACE
./scripts/config --enable CONFIG_FUNCTION_TRACER
./scripts/config --enable CONFIG_KPROBES
./scripts/config --enable CONFIG_HIST_TRIGGERS
./scripts/config --enable CONFIG_FTRACE_SYSCALLS
./scripts/config --enable CONFIG_DEBUG_VM
./scripts/config --enable CONFIG_DEBUG_VM_MAPLE_TREE
./scripts/config --enable CONFIG_DEBUG_VM_RB
./scripts/config --enable CONFIG_DEBUG_VM_PGFLAGS
./scripts/config --enable CONFIG_DEBUG_VM_PGTABLE
./scripts/config --enable CONFIG_PAGE_TABLE_CHECK
./scripts/config --enable CONFIG_USERFAULTFD
./scripts/config --enable CONFIG_TEST_VMALLOC
./scripts/config --enable CONFIG_GUP_TEST


Thanks,
Ryan




> ---
>  include/linux/hugetlb.h |  7 ----
>  mm/gup.c                | 15 +++------
>  mm/hugetlb.c            | 71 -----------------------------------------
>  3 files changed, 5 insertions(+), 88 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 294c78b3549f..a546140f89cd 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -328,13 +328,6 @@ static inline void hugetlb_zap_end(
>  {
>  }
>  
> -static inline struct page *hugetlb_follow_page_mask(
> -    struct vm_area_struct *vma, unsigned long address, unsigned int flags,
> -    unsigned int *page_mask)
> -{
> -	BUILD_BUG(); /* should never be compiled in if !CONFIG_HUGETLB_PAGE*/
> -}
> -
>  static inline int copy_hugetlb_page_range(struct mm_struct *dst,
>  					  struct mm_struct *src,
>  					  struct vm_area_struct *dst_vma,
> diff --git a/mm/gup.c b/mm/gup.c
> index a02463c9420e..c803d0b0f358 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1135,18 +1135,11 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
>  {
>  	pgd_t *pgd;
>  	struct mm_struct *mm = vma->vm_mm;
> +	struct page *page;
>  
> -	ctx->page_mask = 0;
> -
> -	/*
> -	 * Call hugetlb_follow_page_mask for hugetlb vmas as it will use
> -	 * special hugetlb page table walking code.  This eliminates the
> -	 * need to check for hugetlb entries in the general walking code.
> -	 */
> -	if (is_vm_hugetlb_page(vma))
> -		return hugetlb_follow_page_mask(vma, address, flags,
> -						&ctx->page_mask);
> +	vma_pgtable_walk_begin(vma);
>  
> +	ctx->page_mask = 0;
>  	pgd = pgd_offset(mm, address);
>  
>  	if (unlikely(is_hugepd(__hugepd(pgd_val(*pgd)))))
> @@ -1157,6 +1150,8 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
>  	else
>  		page = follow_p4d_mask(vma, address, pgd, flags, ctx);
>  
> +	vma_pgtable_walk_end(vma);
> +
>  	return page;
>  }
>  
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 65b9c9a48fd2..cc79891a3597 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6870,77 +6870,6 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
>  }
>  #endif /* CONFIG_USERFAULTFD */
>  
> -struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> -				      unsigned long address, unsigned int flags,
> -				      unsigned int *page_mask)
> -{
> -	struct hstate *h = hstate_vma(vma);
> -	struct mm_struct *mm = vma->vm_mm;
> -	unsigned long haddr = address & huge_page_mask(h);
> -	struct page *page = NULL;
> -	spinlock_t *ptl;
> -	pte_t *pte, entry;
> -	int ret;
> -
> -	hugetlb_vma_lock_read(vma);
> -	pte = hugetlb_walk(vma, haddr, huge_page_size(h));
> -	if (!pte)
> -		goto out_unlock;
> -
> -	ptl = huge_pte_lock(h, mm, pte);
> -	entry = huge_ptep_get(pte);
> -	if (pte_present(entry)) {
> -		page = pte_page(entry);
> -
> -		if (!huge_pte_write(entry)) {
> -			if (flags & FOLL_WRITE) {
> -				page = NULL;
> -				goto out;
> -			}
> -
> -			if (gup_must_unshare(vma, flags, page)) {
> -				/* Tell the caller to do unsharing */
> -				page = ERR_PTR(-EMLINK);
> -				goto out;
> -			}
> -		}
> -
> -		page = nth_page(page, ((address & ~huge_page_mask(h)) >> PAGE_SHIFT));
> -
> -		/*
> -		 * Note that page may be a sub-page, and with vmemmap
> -		 * optimizations the page struct may be read only.
> -		 * try_grab_page() will increase the ref count on the
> -		 * head page, so this will be OK.
> -		 *
> -		 * try_grab_page() should always be able to get the page here,
> -		 * because we hold the ptl lock and have verified pte_present().
> -		 */
> -		ret = try_grab_page(page, flags);
> -
> -		if (WARN_ON_ONCE(ret)) {
> -			page = ERR_PTR(ret);
> -			goto out;
> -		}
> -
> -		*page_mask = (1U << huge_page_order(h)) - 1;
> -	}
> -out:
> -	spin_unlock(ptl);
> -out_unlock:
> -	hugetlb_vma_unlock_read(vma);
> -
> -	/*
> -	 * Fixup retval for dump requests: if pagecache doesn't exist,
> -	 * don't try to allocate a new page but just skip it.
> -	 */
> -	if (!page && (flags & FOLL_DUMP) &&
> -	    !hugetlbfs_pagecache_present(h, vma, address))
> -		page = ERR_PTR(-EFAULT);
> -
> -	return page;
> -}
> -
>  long hugetlb_change_protection(struct vm_area_struct *vma,
>  		unsigned long address, unsigned long end,
>  		pgprot_t newprot, unsigned long cp_flags)


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
@ 2024-04-02 14:48     ` Ryan Roberts
  0 siblings, 0 replies; 160+ messages in thread
From: Ryan Roberts @ 2024-04-02 14:48 UTC (permalink / raw)
  To: peterx, linux-mm, linux-kernel
  Cc: James Houghton, David Hildenbrand, Yang Shi, Andrew Jones,
	Matthew Wilcox, linux-riscv, Andrea Arcangeli, Christoph Hellwig,
	Aneesh Kumar K . V, linux-arm-kernel, Jason Gunthorpe,
	Axel Rasmussen, Rik van Riel, John Hubbard, Kirill A . Shutemov,
	Vlastimil Babka, Lorenzo Stoakes, Muchun Song, Andrew Morton,
	linuxppc-dev, Mike Rapoport, Mike Kravetz

Hi Peter,

On 27/03/2024 15:23, peterx@redhat.com wrote:
> From: Peter Xu <peterx@redhat.com>
> 
> Now follow_page() is ready to handle hugetlb pages in whatever form, and
> over all architectures.  Switch to the generic code path.
> 
> Time to retire hugetlb_follow_page_mask(), following the previous
> retirement of follow_hugetlb_page() in 4849807114b8.
> 
> There may be a slight difference of how the loops run when processing slow
> GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
> loop of __get_user_pages() will resolve one pgtable entry with the patch
> applied, rather than relying on the size of hugetlb hstate, the latter may
> cover multiple entries in one loop.
> 
> A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
> a tight loop of slow gup after the path switched.  That shouldn't be a
> problem because slow-gup should not be a hot path for GUP in general: when
> page is commonly present, fast-gup will already succeed, while when the
> page is indeed missing and require a follow up page fault, the slow gup
> degrade will probably buried in the fault paths anyway.  It also explains
> why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
> accelerate thp gup even for "pages != NULL"") lands, the latter not part of
> a performance analysis but a side benefit.  If the performance will be a
> concern, we can consider handle CONT_PTE in follow_page().
> 
> Before that is justified to be necessary, keep everything clean and simple.
> 
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Afraid I'm seeing an oops when running gup_longterm test on arm64 with current mm-unstable. Git bisect blames this patch. The oops reproduces for me every time on 2 different machines:


[    9.340416] kernel BUG at mm/gup.c:778!
[    9.340746] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
[    9.341199] Modules linked in:
[    9.341481] CPU: 1 PID: 1159 Comm: gup_longterm Not tainted 6.9.0-rc2-00210-g910ff1a347e4 #11
[    9.342232] Hardware name: linux,dummy-virt (DT)
[    9.342647] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    9.343195] pc : follow_page_mask+0x4d4/0x880
[    9.343580] lr : follow_page_mask+0x4d4/0x880
[    9.344018] sp : ffff8000898b3aa0
[    9.344345] x29: ffff8000898b3aa0 x28: fffffdffc53973e8 x27: 00003c0005d08000
[    9.345028] x26: ffff00014e5cfd08 x25: ffffd3513a40c000 x24: fffffdffc5d08000
[    9.345682] x23: ffffc1ffc0000000 x22: 0000000000080101 x21: ffff8000898b3ba8
[    9.346337] x20: 0000fffff4200000 x19: ffff00014e52d508 x18: 0000000000000010
[    9.347005] x17: 5f656e6f7a5f7369 x16: 2120262620296567 x15: 6170286461654865
[    9.347713] x14: 6761502128454741 x13: 2929656761702865 x12: 6761705f65636976
[    9.348371] x11: 65645f656e6f7a5f x10: ffffd3513b31d6e0 x9 : ffffd3513852f090
[    9.349062] x8 : 00000000ffffefff x7 : ffffd3513b31d6e0 x6 : 0000000000000000
[    9.349753] x5 : ffff00017ff98cc8 x4 : 0000000000000fff x3 : 0000000000000000
[    9.350397] x2 : 0000000000000000 x1 : ffff000190e8b480 x0 : 0000000000000052
[    9.351097] Call trace:
[    9.351312]  follow_page_mask+0x4d4/0x880
[    9.351700]  __get_user_pages+0xf4/0x3e8
[    9.352089]  __gup_longterm_locked+0x204/0xa70
[    9.352516]  pin_user_pages+0x88/0xc0
[    9.352873]  gup_test_ioctl+0x860/0xc40
[    9.353249]  __arm64_sys_ioctl+0xb0/0x100
[    9.353648]  invoke_syscall+0x50/0x128
[    9.354022]  el0_svc_common.constprop.0+0x48/0xf8
[    9.354488]  do_el0_svc+0x28/0x40
[    9.354822]  el0_svc+0x34/0xe0
[    9.355128]  el0t_64_sync_handler+0x13c/0x158
[    9.355489]  el0t_64_sync+0x190/0x198
[    9.355793] Code: aa1803e0 d000d8e1 91220021 97fff560 (d4210000) 
[    9.356280] ---[ end trace 0000000000000000 ]---
[    9.356651] note: gup_longterm[1159] exited with irqs disabled
[    9.357141] note: gup_longterm[1159] exited with preempt_count 2
[    9.358033] ------------[ cut here ]------------
[    9.358800] WARNING: CPU: 1 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0x108/0x120
[    9.360157] Modules linked in:
[    9.360541] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D            6.9.0-rc2-00210-g910ff1a347e4 #11
[    9.361626] Hardware name: linux,dummy-virt (DT)
[    9.362087] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    9.362758] pc : ct_kernel_exit.constprop.0+0x108/0x120
[    9.363306] lr : ct_idle_enter+0x10/0x20
[    9.363845] sp : ffff8000801abdc0
[    9.364222] x29: ffff8000801abdc0 x28: 0000000000000000 x27: 0000000000000000
[    9.364961] x26: 0000000000000000 x25: ffff00014149d780 x24: 0000000000000000
[    9.365557] x23: 0000000000000000 x22: ffffd3513b299d48 x21: ffffd3513a785730
[    9.366239] x20: ffffd3513b299c28 x19: ffff00017ffa7da0 x18: 0000fffff5ffffff
[    9.366869] x17: 0000000000000000 x16: 1fffe0002a21a8c1 x15: 0000000000000000
[    9.367524] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000002
[    9.368207] x11: 0000000000000001 x10: 0000000000000ad0 x9 : ffffd35138589230
[    9.369123] x8 : ffff00014149e2b0 x7 : 0000000000000000 x6 : 000000000f8c0fb2
[    9.370403] x5 : 4000000000000002 x4 : ffff2cb045825000 x3 : ffff8000801abdc0
[    9.371170] x2 : ffffd3513a782da0 x1 : 4000000000000000 x0 : ffffd3513a782da0
[    9.372279] Call trace:
[    9.372519]  ct_kernel_exit.constprop.0+0x108/0x120
[    9.373216]  ct_idle_enter+0x10/0x20
[    9.373562]  default_idle_call+0x3c/0x160
[    9.374055]  do_idle+0x21c/0x280
[    9.374394]  cpu_startup_entry+0x3c/0x50
[    9.374797]  secondary_start_kernel+0x140/0x168
[    9.375220]  __secondary_switched+0xb8/0xc0
[    9.375875] ---[ end trace 0000000000000000 ]---


The oops trigger is at mm/gup.c:778:
VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);


This is the output of gup_longterm (last output is just before oops):

# [INFO] detected hugetlb page size: 2048 KiB
# [INFO] detected hugetlb page size: 32768 KiB
# [INFO] detected hugetlb page size: 64 KiB
# [INFO] detected hugetlb page size: 1048576 KiB
TAP version 13
1..70
# [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd
ok 1 Should have worked
# [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with tmpfile
ok 2 Should have failed
# [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile
ok 3 Should have failed
# [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
ok 4 Should have worked
# [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (32768 kB)


So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?


I'm running with defconfig plus these:

./scripts/config --enable CONFIG_SQUASHFS_LZ4
./scripts/config --enable CONFIG_SQUASHFS_LZO
./scripts/config --enable CONFIG_SQUASHFS_XZ
./scripts/config --enable CONFIG_SQUASHFS_ZSTD
./scripts/config --enable CONFIG_XFS_FS
./scripts/config --enable CONFIG_FTRACE
./scripts/config --enable CONFIG_FUNCTION_TRACER
./scripts/config --enable CONFIG_KPROBES
./scripts/config --enable CONFIG_HIST_TRIGGERS
./scripts/config --enable CONFIG_FTRACE_SYSCALLS
./scripts/config --enable CONFIG_DEBUG_VM
./scripts/config --enable CONFIG_DEBUG_VM_MAPLE_TREE
./scripts/config --enable CONFIG_DEBUG_VM_RB
./scripts/config --enable CONFIG_DEBUG_VM_PGFLAGS
./scripts/config --enable CONFIG_DEBUG_VM_PGTABLE
./scripts/config --enable CONFIG_PAGE_TABLE_CHECK
./scripts/config --enable CONFIG_USERFAULTFD
./scripts/config --enable CONFIG_TEST_VMALLOC
./scripts/config --enable CONFIG_GUP_TEST


Thanks,
Ryan




> ---
>  include/linux/hugetlb.h |  7 ----
>  mm/gup.c                | 15 +++------
>  mm/hugetlb.c            | 71 -----------------------------------------
>  3 files changed, 5 insertions(+), 88 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 294c78b3549f..a546140f89cd 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -328,13 +328,6 @@ static inline void hugetlb_zap_end(
>  {
>  }
>  
> -static inline struct page *hugetlb_follow_page_mask(
> -    struct vm_area_struct *vma, unsigned long address, unsigned int flags,
> -    unsigned int *page_mask)
> -{
> -	BUILD_BUG(); /* should never be compiled in if !CONFIG_HUGETLB_PAGE*/
> -}
> -
>  static inline int copy_hugetlb_page_range(struct mm_struct *dst,
>  					  struct mm_struct *src,
>  					  struct vm_area_struct *dst_vma,
> diff --git a/mm/gup.c b/mm/gup.c
> index a02463c9420e..c803d0b0f358 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1135,18 +1135,11 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
>  {
>  	pgd_t *pgd;
>  	struct mm_struct *mm = vma->vm_mm;
> +	struct page *page;
>  
> -	ctx->page_mask = 0;
> -
> -	/*
> -	 * Call hugetlb_follow_page_mask for hugetlb vmas as it will use
> -	 * special hugetlb page table walking code.  This eliminates the
> -	 * need to check for hugetlb entries in the general walking code.
> -	 */
> -	if (is_vm_hugetlb_page(vma))
> -		return hugetlb_follow_page_mask(vma, address, flags,
> -						&ctx->page_mask);
> +	vma_pgtable_walk_begin(vma);
>  
> +	ctx->page_mask = 0;
>  	pgd = pgd_offset(mm, address);
>  
>  	if (unlikely(is_hugepd(__hugepd(pgd_val(*pgd)))))
> @@ -1157,6 +1150,8 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
>  	else
>  		page = follow_p4d_mask(vma, address, pgd, flags, ctx);
>  
> +	vma_pgtable_walk_end(vma);
> +
>  	return page;
>  }
>  
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 65b9c9a48fd2..cc79891a3597 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6870,77 +6870,6 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
>  }
>  #endif /* CONFIG_USERFAULTFD */
>  
> -struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> -				      unsigned long address, unsigned int flags,
> -				      unsigned int *page_mask)
> -{
> -	struct hstate *h = hstate_vma(vma);
> -	struct mm_struct *mm = vma->vm_mm;
> -	unsigned long haddr = address & huge_page_mask(h);
> -	struct page *page = NULL;
> -	spinlock_t *ptl;
> -	pte_t *pte, entry;
> -	int ret;
> -
> -	hugetlb_vma_lock_read(vma);
> -	pte = hugetlb_walk(vma, haddr, huge_page_size(h));
> -	if (!pte)
> -		goto out_unlock;
> -
> -	ptl = huge_pte_lock(h, mm, pte);
> -	entry = huge_ptep_get(pte);
> -	if (pte_present(entry)) {
> -		page = pte_page(entry);
> -
> -		if (!huge_pte_write(entry)) {
> -			if (flags & FOLL_WRITE) {
> -				page = NULL;
> -				goto out;
> -			}
> -
> -			if (gup_must_unshare(vma, flags, page)) {
> -				/* Tell the caller to do unsharing */
> -				page = ERR_PTR(-EMLINK);
> -				goto out;
> -			}
> -		}
> -
> -		page = nth_page(page, ((address & ~huge_page_mask(h)) >> PAGE_SHIFT));
> -
> -		/*
> -		 * Note that page may be a sub-page, and with vmemmap
> -		 * optimizations the page struct may be read only.
> -		 * try_grab_page() will increase the ref count on the
> -		 * head page, so this will be OK.
> -		 *
> -		 * try_grab_page() should always be able to get the page here,
> -		 * because we hold the ptl lock and have verified pte_present().
> -		 */
> -		ret = try_grab_page(page, flags);
> -
> -		if (WARN_ON_ONCE(ret)) {
> -			page = ERR_PTR(ret);
> -			goto out;
> -		}
> -
> -		*page_mask = (1U << huge_page_order(h)) - 1;
> -	}
> -out:
> -	spin_unlock(ptl);
> -out_unlock:
> -	hugetlb_vma_unlock_read(vma);
> -
> -	/*
> -	 * Fixup retval for dump requests: if pagecache doesn't exist,
> -	 * don't try to allocate a new page but just skip it.
> -	 */
> -	if (!page && (flags & FOLL_DUMP) &&
> -	    !hugetlbfs_pagecache_present(h, vma, address))
> -		page = ERR_PTR(-EFAULT);
> -
> -	return page;
> -}
> -
>  long hugetlb_change_protection(struct vm_area_struct *vma,
>  		unsigned long address, unsigned long end,
>  		pgprot_t newprot, unsigned long cp_flags)


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
  2024-04-02 14:48     ` Ryan Roberts
  (?)
  (?)
@ 2024-04-02 15:26       ` David Hildenbrand
  -1 siblings, 0 replies; 160+ messages in thread
From: David Hildenbrand @ 2024-04-02 15:26 UTC (permalink / raw)
  To: Ryan Roberts, peterx, linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Jason Gunthorpe, Mike Rapoport, Axel Rasmussen

On 02.04.24 16:48, Ryan Roberts wrote:
> Hi Peter,
> 
> On 27/03/2024 15:23, peterx@redhat.com wrote:
>> From: Peter Xu <peterx@redhat.com>
>>
>> Now follow_page() is ready to handle hugetlb pages in whatever form, and
>> over all architectures.  Switch to the generic code path.
>>
>> Time to retire hugetlb_follow_page_mask(), following the previous
>> retirement of follow_hugetlb_page() in 4849807114b8.
>>
>> There may be a slight difference of how the loops run when processing slow
>> GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
>> loop of __get_user_pages() will resolve one pgtable entry with the patch
>> applied, rather than relying on the size of hugetlb hstate, the latter may
>> cover multiple entries in one loop.
>>
>> A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
>> a tight loop of slow gup after the path switched.  That shouldn't be a
>> problem because slow-gup should not be a hot path for GUP in general: when
>> page is commonly present, fast-gup will already succeed, while when the
>> page is indeed missing and require a follow up page fault, the slow gup
>> degrade will probably buried in the fault paths anyway.  It also explains
>> why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
>> accelerate thp gup even for "pages != NULL"") lands, the latter not part of
>> a performance analysis but a side benefit.  If the performance will be a
>> concern, we can consider handle CONT_PTE in follow_page().
>>
>> Before that is justified to be necessary, keep everything clean and simple.
>>
>> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
>> Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> Afraid I'm seeing an oops when running gup_longterm test on arm64 with current mm-unstable. Git bisect blames this patch. The oops reproduces for me every time on 2 different machines:
> 
> 
> [    9.340416] kernel BUG at mm/gup.c:778!
> [    9.340746] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> [    9.341199] Modules linked in:
> [    9.341481] CPU: 1 PID: 1159 Comm: gup_longterm Not tainted 6.9.0-rc2-00210-g910ff1a347e4 #11
> [    9.342232] Hardware name: linux,dummy-virt (DT)
> [    9.342647] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [    9.343195] pc : follow_page_mask+0x4d4/0x880
> [    9.343580] lr : follow_page_mask+0x4d4/0x880
> [    9.344018] sp : ffff8000898b3aa0
> [    9.344345] x29: ffff8000898b3aa0 x28: fffffdffc53973e8 x27: 00003c0005d08000
> [    9.345028] x26: ffff00014e5cfd08 x25: ffffd3513a40c000 x24: fffffdffc5d08000
> [    9.345682] x23: ffffc1ffc0000000 x22: 0000000000080101 x21: ffff8000898b3ba8
> [    9.346337] x20: 0000fffff4200000 x19: ffff00014e52d508 x18: 0000000000000010
> [    9.347005] x17: 5f656e6f7a5f7369 x16: 2120262620296567 x15: 6170286461654865
> [    9.347713] x14: 6761502128454741 x13: 2929656761702865 x12: 6761705f65636976
> [    9.348371] x11: 65645f656e6f7a5f x10: ffffd3513b31d6e0 x9 : ffffd3513852f090
> [    9.349062] x8 : 00000000ffffefff x7 : ffffd3513b31d6e0 x6 : 0000000000000000
> [    9.349753] x5 : ffff00017ff98cc8 x4 : 0000000000000fff x3 : 0000000000000000
> [    9.350397] x2 : 0000000000000000 x1 : ffff000190e8b480 x0 : 0000000000000052
> [    9.351097] Call trace:
> [    9.351312]  follow_page_mask+0x4d4/0x880
> [    9.351700]  __get_user_pages+0xf4/0x3e8
> [    9.352089]  __gup_longterm_locked+0x204/0xa70
> [    9.352516]  pin_user_pages+0x88/0xc0
> [    9.352873]  gup_test_ioctl+0x860/0xc40
> [    9.353249]  __arm64_sys_ioctl+0xb0/0x100
> [    9.353648]  invoke_syscall+0x50/0x128
> [    9.354022]  el0_svc_common.constprop.0+0x48/0xf8
> [    9.354488]  do_el0_svc+0x28/0x40
> [    9.354822]  el0_svc+0x34/0xe0
> [    9.355128]  el0t_64_sync_handler+0x13c/0x158
> [    9.355489]  el0t_64_sync+0x190/0x198
> [    9.355793] Code: aa1803e0 d000d8e1 91220021 97fff560 (d4210000)
> [    9.356280] ---[ end trace 0000000000000000 ]---
> [    9.356651] note: gup_longterm[1159] exited with irqs disabled
> [    9.357141] note: gup_longterm[1159] exited with preempt_count 2
> [    9.358033] ------------[ cut here ]------------
> [    9.358800] WARNING: CPU: 1 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0x108/0x120
> [    9.360157] Modules linked in:
> [    9.360541] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D            6.9.0-rc2-00210-g910ff1a347e4 #11
> [    9.361626] Hardware name: linux,dummy-virt (DT)
> [    9.362087] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [    9.362758] pc : ct_kernel_exit.constprop.0+0x108/0x120
> [    9.363306] lr : ct_idle_enter+0x10/0x20
> [    9.363845] sp : ffff8000801abdc0
> [    9.364222] x29: ffff8000801abdc0 x28: 0000000000000000 x27: 0000000000000000
> [    9.364961] x26: 0000000000000000 x25: ffff00014149d780 x24: 0000000000000000
> [    9.365557] x23: 0000000000000000 x22: ffffd3513b299d48 x21: ffffd3513a785730
> [    9.366239] x20: ffffd3513b299c28 x19: ffff00017ffa7da0 x18: 0000fffff5ffffff
> [    9.366869] x17: 0000000000000000 x16: 1fffe0002a21a8c1 x15: 0000000000000000
> [    9.367524] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000002
> [    9.368207] x11: 0000000000000001 x10: 0000000000000ad0 x9 : ffffd35138589230
> [    9.369123] x8 : ffff00014149e2b0 x7 : 0000000000000000 x6 : 000000000f8c0fb2
> [    9.370403] x5 : 4000000000000002 x4 : ffff2cb045825000 x3 : ffff8000801abdc0
> [    9.371170] x2 : ffffd3513a782da0 x1 : 4000000000000000 x0 : ffffd3513a782da0
> [    9.372279] Call trace:
> [    9.372519]  ct_kernel_exit.constprop.0+0x108/0x120
> [    9.373216]  ct_idle_enter+0x10/0x20
> [    9.373562]  default_idle_call+0x3c/0x160
> [    9.374055]  do_idle+0x21c/0x280
> [    9.374394]  cpu_startup_entry+0x3c/0x50
> [    9.374797]  secondary_start_kernel+0x140/0x168
> [    9.375220]  __secondary_switched+0xb8/0xc0
> [    9.375875] ---[ end trace 0000000000000000 ]---
> 
> 
> The oops trigger is at mm/gup.c:778:
> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> 
> 
> This is the output of gup_longterm (last output is just before oops):
> 
> # [INFO] detected hugetlb page size: 2048 KiB
> # [INFO] detected hugetlb page size: 32768 KiB
> # [INFO] detected hugetlb page size: 64 KiB
> # [INFO] detected hugetlb page size: 1048576 KiB
> TAP version 13
> 1..70
> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd
> ok 1 Should have worked
> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with tmpfile
> ok 2 Should have failed
> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile
> ok 3 Should have failed
> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
> ok 4 Should have worked
> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (32768 kB)
> 
> 
> So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?

I assume we find the expected tail page, it's just that the check

VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);

Doesn't make sense with hugetlb folios. We might have a tail page mapped 
in a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the 
first cont-pmd entry", we trigger this check.

Likely this sanity check must also allow for hugetlb folios. Or we 
should just remove it completely.

In the past, we wanted to make sure that we never get tail pages of THP 
from PMD entries, because something would currently be broken (we don't 
support THP > PMD).

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
@ 2024-04-02 15:26       ` David Hildenbrand
  0 siblings, 0 replies; 160+ messages in thread
From: David Hildenbrand @ 2024-04-02 15:26 UTC (permalink / raw)
  To: Ryan Roberts, peterx, linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Jason Gunthorpe, Mike Rapoport, Axel Rasmussen

On 02.04.24 16:48, Ryan Roberts wrote:
> Hi Peter,
> 
> On 27/03/2024 15:23, peterx@redhat.com wrote:
>> From: Peter Xu <peterx@redhat.com>
>>
>> Now follow_page() is ready to handle hugetlb pages in whatever form, and
>> over all architectures.  Switch to the generic code path.
>>
>> Time to retire hugetlb_follow_page_mask(), following the previous
>> retirement of follow_hugetlb_page() in 4849807114b8.
>>
>> There may be a slight difference of how the loops run when processing slow
>> GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
>> loop of __get_user_pages() will resolve one pgtable entry with the patch
>> applied, rather than relying on the size of hugetlb hstate, the latter may
>> cover multiple entries in one loop.
>>
>> A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
>> a tight loop of slow gup after the path switched.  That shouldn't be a
>> problem because slow-gup should not be a hot path for GUP in general: when
>> page is commonly present, fast-gup will already succeed, while when the
>> page is indeed missing and require a follow up page fault, the slow gup
>> degrade will probably buried in the fault paths anyway.  It also explains
>> why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
>> accelerate thp gup even for "pages != NULL"") lands, the latter not part of
>> a performance analysis but a side benefit.  If the performance will be a
>> concern, we can consider handle CONT_PTE in follow_page().
>>
>> Before that is justified to be necessary, keep everything clean and simple.
>>
>> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
>> Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> Afraid I'm seeing an oops when running gup_longterm test on arm64 with current mm-unstable. Git bisect blames this patch. The oops reproduces for me every time on 2 different machines:
> 
> 
> [    9.340416] kernel BUG at mm/gup.c:778!
> [    9.340746] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> [    9.341199] Modules linked in:
> [    9.341481] CPU: 1 PID: 1159 Comm: gup_longterm Not tainted 6.9.0-rc2-00210-g910ff1a347e4 #11
> [    9.342232] Hardware name: linux,dummy-virt (DT)
> [    9.342647] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [    9.343195] pc : follow_page_mask+0x4d4/0x880
> [    9.343580] lr : follow_page_mask+0x4d4/0x880
> [    9.344018] sp : ffff8000898b3aa0
> [    9.344345] x29: ffff8000898b3aa0 x28: fffffdffc53973e8 x27: 00003c0005d08000
> [    9.345028] x26: ffff00014e5cfd08 x25: ffffd3513a40c000 x24: fffffdffc5d08000
> [    9.345682] x23: ffffc1ffc0000000 x22: 0000000000080101 x21: ffff8000898b3ba8
> [    9.346337] x20: 0000fffff4200000 x19: ffff00014e52d508 x18: 0000000000000010
> [    9.347005] x17: 5f656e6f7a5f7369 x16: 2120262620296567 x15: 6170286461654865
> [    9.347713] x14: 6761502128454741 x13: 2929656761702865 x12: 6761705f65636976
> [    9.348371] x11: 65645f656e6f7a5f x10: ffffd3513b31d6e0 x9 : ffffd3513852f090
> [    9.349062] x8 : 00000000ffffefff x7 : ffffd3513b31d6e0 x6 : 0000000000000000
> [    9.349753] x5 : ffff00017ff98cc8 x4 : 0000000000000fff x3 : 0000000000000000
> [    9.350397] x2 : 0000000000000000 x1 : ffff000190e8b480 x0 : 0000000000000052
> [    9.351097] Call trace:
> [    9.351312]  follow_page_mask+0x4d4/0x880
> [    9.351700]  __get_user_pages+0xf4/0x3e8
> [    9.352089]  __gup_longterm_locked+0x204/0xa70
> [    9.352516]  pin_user_pages+0x88/0xc0
> [    9.352873]  gup_test_ioctl+0x860/0xc40
> [    9.353249]  __arm64_sys_ioctl+0xb0/0x100
> [    9.353648]  invoke_syscall+0x50/0x128
> [    9.354022]  el0_svc_common.constprop.0+0x48/0xf8
> [    9.354488]  do_el0_svc+0x28/0x40
> [    9.354822]  el0_svc+0x34/0xe0
> [    9.355128]  el0t_64_sync_handler+0x13c/0x158
> [    9.355489]  el0t_64_sync+0x190/0x198
> [    9.355793] Code: aa1803e0 d000d8e1 91220021 97fff560 (d4210000)
> [    9.356280] ---[ end trace 0000000000000000 ]---
> [    9.356651] note: gup_longterm[1159] exited with irqs disabled
> [    9.357141] note: gup_longterm[1159] exited with preempt_count 2
> [    9.358033] ------------[ cut here ]------------
> [    9.358800] WARNING: CPU: 1 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0x108/0x120
> [    9.360157] Modules linked in:
> [    9.360541] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D            6.9.0-rc2-00210-g910ff1a347e4 #11
> [    9.361626] Hardware name: linux,dummy-virt (DT)
> [    9.362087] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [    9.362758] pc : ct_kernel_exit.constprop.0+0x108/0x120
> [    9.363306] lr : ct_idle_enter+0x10/0x20
> [    9.363845] sp : ffff8000801abdc0
> [    9.364222] x29: ffff8000801abdc0 x28: 0000000000000000 x27: 0000000000000000
> [    9.364961] x26: 0000000000000000 x25: ffff00014149d780 x24: 0000000000000000
> [    9.365557] x23: 0000000000000000 x22: ffffd3513b299d48 x21: ffffd3513a785730
> [    9.366239] x20: ffffd3513b299c28 x19: ffff00017ffa7da0 x18: 0000fffff5ffffff
> [    9.366869] x17: 0000000000000000 x16: 1fffe0002a21a8c1 x15: 0000000000000000
> [    9.367524] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000002
> [    9.368207] x11: 0000000000000001 x10: 0000000000000ad0 x9 : ffffd35138589230
> [    9.369123] x8 : ffff00014149e2b0 x7 : 0000000000000000 x6 : 000000000f8c0fb2
> [    9.370403] x5 : 4000000000000002 x4 : ffff2cb045825000 x3 : ffff8000801abdc0
> [    9.371170] x2 : ffffd3513a782da0 x1 : 4000000000000000 x0 : ffffd3513a782da0
> [    9.372279] Call trace:
> [    9.372519]  ct_kernel_exit.constprop.0+0x108/0x120
> [    9.373216]  ct_idle_enter+0x10/0x20
> [    9.373562]  default_idle_call+0x3c/0x160
> [    9.374055]  do_idle+0x21c/0x280
> [    9.374394]  cpu_startup_entry+0x3c/0x50
> [    9.374797]  secondary_start_kernel+0x140/0x168
> [    9.375220]  __secondary_switched+0xb8/0xc0
> [    9.375875] ---[ end trace 0000000000000000 ]---
> 
> 
> The oops trigger is at mm/gup.c:778:
> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> 
> 
> This is the output of gup_longterm (last output is just before oops):
> 
> # [INFO] detected hugetlb page size: 2048 KiB
> # [INFO] detected hugetlb page size: 32768 KiB
> # [INFO] detected hugetlb page size: 64 KiB
> # [INFO] detected hugetlb page size: 1048576 KiB
> TAP version 13
> 1..70
> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd
> ok 1 Should have worked
> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with tmpfile
> ok 2 Should have failed
> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile
> ok 3 Should have failed
> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
> ok 4 Should have worked
> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (32768 kB)
> 
> 
> So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?

I assume we find the expected tail page, it's just that the check

VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);

Doesn't make sense with hugetlb folios. We might have a tail page mapped 
in a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the 
first cont-pmd entry", we trigger this check.

Likely this sanity check must also allow for hugetlb folios. Or we 
should just remove it completely.

In the past, we wanted to make sure that we never get tail pages of THP 
from PMD entries, because something would currently be broken (we don't 
support THP > PMD).

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
@ 2024-04-02 15:26       ` David Hildenbrand
  0 siblings, 0 replies; 160+ messages in thread
From: David Hildenbrand @ 2024-04-02 15:26 UTC (permalink / raw)
  To: Ryan Roberts, peterx, linux-mm, linux-kernel
  Cc: Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Jason Gunthorpe, Mike Rapoport, Axel Rasmussen

On 02.04.24 16:48, Ryan Roberts wrote:
> Hi Peter,
> 
> On 27/03/2024 15:23, peterx@redhat.com wrote:
>> From: Peter Xu <peterx@redhat.com>
>>
>> Now follow_page() is ready to handle hugetlb pages in whatever form, and
>> over all architectures.  Switch to the generic code path.
>>
>> Time to retire hugetlb_follow_page_mask(), following the previous
>> retirement of follow_hugetlb_page() in 4849807114b8.
>>
>> There may be a slight difference of how the loops run when processing slow
>> GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
>> loop of __get_user_pages() will resolve one pgtable entry with the patch
>> applied, rather than relying on the size of hugetlb hstate, the latter may
>> cover multiple entries in one loop.
>>
>> A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
>> a tight loop of slow gup after the path switched.  That shouldn't be a
>> problem because slow-gup should not be a hot path for GUP in general: when
>> page is commonly present, fast-gup will already succeed, while when the
>> page is indeed missing and require a follow up page fault, the slow gup
>> degrade will probably buried in the fault paths anyway.  It also explains
>> why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
>> accelerate thp gup even for "pages != NULL"") lands, the latter not part of
>> a performance analysis but a side benefit.  If the performance will be a
>> concern, we can consider handle CONT_PTE in follow_page().
>>
>> Before that is justified to be necessary, keep everything clean and simple.
>>
>> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
>> Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> Afraid I'm seeing an oops when running gup_longterm test on arm64 with current mm-unstable. Git bisect blames this patch. The oops reproduces for me every time on 2 different machines:
> 
> 
> [    9.340416] kernel BUG at mm/gup.c:778!
> [    9.340746] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> [    9.341199] Modules linked in:
> [    9.341481] CPU: 1 PID: 1159 Comm: gup_longterm Not tainted 6.9.0-rc2-00210-g910ff1a347e4 #11
> [    9.342232] Hardware name: linux,dummy-virt (DT)
> [    9.342647] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [    9.343195] pc : follow_page_mask+0x4d4/0x880
> [    9.343580] lr : follow_page_mask+0x4d4/0x880
> [    9.344018] sp : ffff8000898b3aa0
> [    9.344345] x29: ffff8000898b3aa0 x28: fffffdffc53973e8 x27: 00003c0005d08000
> [    9.345028] x26: ffff00014e5cfd08 x25: ffffd3513a40c000 x24: fffffdffc5d08000
> [    9.345682] x23: ffffc1ffc0000000 x22: 0000000000080101 x21: ffff8000898b3ba8
> [    9.346337] x20: 0000fffff4200000 x19: ffff00014e52d508 x18: 0000000000000010
> [    9.347005] x17: 5f656e6f7a5f7369 x16: 2120262620296567 x15: 6170286461654865
> [    9.347713] x14: 6761502128454741 x13: 2929656761702865 x12: 6761705f65636976
> [    9.348371] x11: 65645f656e6f7a5f x10: ffffd3513b31d6e0 x9 : ffffd3513852f090
> [    9.349062] x8 : 00000000ffffefff x7 : ffffd3513b31d6e0 x6 : 0000000000000000
> [    9.349753] x5 : ffff00017ff98cc8 x4 : 0000000000000fff x3 : 0000000000000000
> [    9.350397] x2 : 0000000000000000 x1 : ffff000190e8b480 x0 : 0000000000000052
> [    9.351097] Call trace:
> [    9.351312]  follow_page_mask+0x4d4/0x880
> [    9.351700]  __get_user_pages+0xf4/0x3e8
> [    9.352089]  __gup_longterm_locked+0x204/0xa70
> [    9.352516]  pin_user_pages+0x88/0xc0
> [    9.352873]  gup_test_ioctl+0x860/0xc40
> [    9.353249]  __arm64_sys_ioctl+0xb0/0x100
> [    9.353648]  invoke_syscall+0x50/0x128
> [    9.354022]  el0_svc_common.constprop.0+0x48/0xf8
> [    9.354488]  do_el0_svc+0x28/0x40
> [    9.354822]  el0_svc+0x34/0xe0
> [    9.355128]  el0t_64_sync_handler+0x13c/0x158
> [    9.355489]  el0t_64_sync+0x190/0x198
> [    9.355793] Code: aa1803e0 d000d8e1 91220021 97fff560 (d4210000)
> [    9.356280] ---[ end trace 0000000000000000 ]---
> [    9.356651] note: gup_longterm[1159] exited with irqs disabled
> [    9.357141] note: gup_longterm[1159] exited with preempt_count 2
> [    9.358033] ------------[ cut here ]------------
> [    9.358800] WARNING: CPU: 1 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0x108/0x120
> [    9.360157] Modules linked in:
> [    9.360541] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D            6.9.0-rc2-00210-g910ff1a347e4 #11
> [    9.361626] Hardware name: linux,dummy-virt (DT)
> [    9.362087] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [    9.362758] pc : ct_kernel_exit.constprop.0+0x108/0x120
> [    9.363306] lr : ct_idle_enter+0x10/0x20
> [    9.363845] sp : ffff8000801abdc0
> [    9.364222] x29: ffff8000801abdc0 x28: 0000000000000000 x27: 0000000000000000
> [    9.364961] x26: 0000000000000000 x25: ffff00014149d780 x24: 0000000000000000
> [    9.365557] x23: 0000000000000000 x22: ffffd3513b299d48 x21: ffffd3513a785730
> [    9.366239] x20: ffffd3513b299c28 x19: ffff00017ffa7da0 x18: 0000fffff5ffffff
> [    9.366869] x17: 0000000000000000 x16: 1fffe0002a21a8c1 x15: 0000000000000000
> [    9.367524] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000002
> [    9.368207] x11: 0000000000000001 x10: 0000000000000ad0 x9 : ffffd35138589230
> [    9.369123] x8 : ffff00014149e2b0 x7 : 0000000000000000 x6 : 000000000f8c0fb2
> [    9.370403] x5 : 4000000000000002 x4 : ffff2cb045825000 x3 : ffff8000801abdc0
> [    9.371170] x2 : ffffd3513a782da0 x1 : 4000000000000000 x0 : ffffd3513a782da0
> [    9.372279] Call trace:
> [    9.372519]  ct_kernel_exit.constprop.0+0x108/0x120
> [    9.373216]  ct_idle_enter+0x10/0x20
> [    9.373562]  default_idle_call+0x3c/0x160
> [    9.374055]  do_idle+0x21c/0x280
> [    9.374394]  cpu_startup_entry+0x3c/0x50
> [    9.374797]  secondary_start_kernel+0x140/0x168
> [    9.375220]  __secondary_switched+0xb8/0xc0
> [    9.375875] ---[ end trace 0000000000000000 ]---
> 
> 
> The oops trigger is at mm/gup.c:778:
> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> 
> 
> This is the output of gup_longterm (last output is just before oops):
> 
> # [INFO] detected hugetlb page size: 2048 KiB
> # [INFO] detected hugetlb page size: 32768 KiB
> # [INFO] detected hugetlb page size: 64 KiB
> # [INFO] detected hugetlb page size: 1048576 KiB
> TAP version 13
> 1..70
> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd
> ok 1 Should have worked
> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with tmpfile
> ok 2 Should have failed
> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile
> ok 3 Should have failed
> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
> ok 4 Should have worked
> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (32768 kB)
> 
> 
> So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?

I assume we find the expected tail page, it's just that the check

VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);

Doesn't make sense with hugetlb folios. We might have a tail page mapped 
in a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the 
first cont-pmd entry", we trigger this check.

Likely this sanity check must also allow for hugetlb folios. Or we 
should just remove it completely.

In the past, we wanted to make sure that we never get tail pages of THP 
from PMD entries, because something would currently be broken (we don't 
support THP > PMD).

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
@ 2024-04-02 15:26       ` David Hildenbrand
  0 siblings, 0 replies; 160+ messages in thread
From: David Hildenbrand @ 2024-04-02 15:26 UTC (permalink / raw)
  To: Ryan Roberts, peterx, linux-mm, linux-kernel
  Cc: James Houghton, Yang Shi, Andrew Jones, Matthew Wilcox,
	linux-riscv, Andrea Arcangeli, Christoph Hellwig,
	Aneesh Kumar K . V, linux-arm-kernel, Jason Gunthorpe,
	Axel Rasmussen, Rik van Riel, John Hubbard, Kirill A . Shutemov,
	Vlastimil Babka, Lorenzo Stoakes, Muchun Song, Andrew Morton,
	linuxppc-dev, Mike Rapoport, Mike Kravetz

On 02.04.24 16:48, Ryan Roberts wrote:
> Hi Peter,
> 
> On 27/03/2024 15:23, peterx@redhat.com wrote:
>> From: Peter Xu <peterx@redhat.com>
>>
>> Now follow_page() is ready to handle hugetlb pages in whatever form, and
>> over all architectures.  Switch to the generic code path.
>>
>> Time to retire hugetlb_follow_page_mask(), following the previous
>> retirement of follow_hugetlb_page() in 4849807114b8.
>>
>> There may be a slight difference of how the loops run when processing slow
>> GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
>> loop of __get_user_pages() will resolve one pgtable entry with the patch
>> applied, rather than relying on the size of hugetlb hstate, the latter may
>> cover multiple entries in one loop.
>>
>> A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
>> a tight loop of slow gup after the path switched.  That shouldn't be a
>> problem because slow-gup should not be a hot path for GUP in general: when
>> page is commonly present, fast-gup will already succeed, while when the
>> page is indeed missing and require a follow up page fault, the slow gup
>> degrade will probably buried in the fault paths anyway.  It also explains
>> why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
>> accelerate thp gup even for "pages != NULL"") lands, the latter not part of
>> a performance analysis but a side benefit.  If the performance will be a
>> concern, we can consider handle CONT_PTE in follow_page().
>>
>> Before that is justified to be necessary, keep everything clean and simple.
>>
>> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
>> Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> Afraid I'm seeing an oops when running gup_longterm test on arm64 with current mm-unstable. Git bisect blames this patch. The oops reproduces for me every time on 2 different machines:
> 
> 
> [    9.340416] kernel BUG at mm/gup.c:778!
> [    9.340746] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> [    9.341199] Modules linked in:
> [    9.341481] CPU: 1 PID: 1159 Comm: gup_longterm Not tainted 6.9.0-rc2-00210-g910ff1a347e4 #11
> [    9.342232] Hardware name: linux,dummy-virt (DT)
> [    9.342647] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [    9.343195] pc : follow_page_mask+0x4d4/0x880
> [    9.343580] lr : follow_page_mask+0x4d4/0x880
> [    9.344018] sp : ffff8000898b3aa0
> [    9.344345] x29: ffff8000898b3aa0 x28: fffffdffc53973e8 x27: 00003c0005d08000
> [    9.345028] x26: ffff00014e5cfd08 x25: ffffd3513a40c000 x24: fffffdffc5d08000
> [    9.345682] x23: ffffc1ffc0000000 x22: 0000000000080101 x21: ffff8000898b3ba8
> [    9.346337] x20: 0000fffff4200000 x19: ffff00014e52d508 x18: 0000000000000010
> [    9.347005] x17: 5f656e6f7a5f7369 x16: 2120262620296567 x15: 6170286461654865
> [    9.347713] x14: 6761502128454741 x13: 2929656761702865 x12: 6761705f65636976
> [    9.348371] x11: 65645f656e6f7a5f x10: ffffd3513b31d6e0 x9 : ffffd3513852f090
> [    9.349062] x8 : 00000000ffffefff x7 : ffffd3513b31d6e0 x6 : 0000000000000000
> [    9.349753] x5 : ffff00017ff98cc8 x4 : 0000000000000fff x3 : 0000000000000000
> [    9.350397] x2 : 0000000000000000 x1 : ffff000190e8b480 x0 : 0000000000000052
> [    9.351097] Call trace:
> [    9.351312]  follow_page_mask+0x4d4/0x880
> [    9.351700]  __get_user_pages+0xf4/0x3e8
> [    9.352089]  __gup_longterm_locked+0x204/0xa70
> [    9.352516]  pin_user_pages+0x88/0xc0
> [    9.352873]  gup_test_ioctl+0x860/0xc40
> [    9.353249]  __arm64_sys_ioctl+0xb0/0x100
> [    9.353648]  invoke_syscall+0x50/0x128
> [    9.354022]  el0_svc_common.constprop.0+0x48/0xf8
> [    9.354488]  do_el0_svc+0x28/0x40
> [    9.354822]  el0_svc+0x34/0xe0
> [    9.355128]  el0t_64_sync_handler+0x13c/0x158
> [    9.355489]  el0t_64_sync+0x190/0x198
> [    9.355793] Code: aa1803e0 d000d8e1 91220021 97fff560 (d4210000)
> [    9.356280] ---[ end trace 0000000000000000 ]---
> [    9.356651] note: gup_longterm[1159] exited with irqs disabled
> [    9.357141] note: gup_longterm[1159] exited with preempt_count 2
> [    9.358033] ------------[ cut here ]------------
> [    9.358800] WARNING: CPU: 1 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0x108/0x120
> [    9.360157] Modules linked in:
> [    9.360541] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D            6.9.0-rc2-00210-g910ff1a347e4 #11
> [    9.361626] Hardware name: linux,dummy-virt (DT)
> [    9.362087] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [    9.362758] pc : ct_kernel_exit.constprop.0+0x108/0x120
> [    9.363306] lr : ct_idle_enter+0x10/0x20
> [    9.363845] sp : ffff8000801abdc0
> [    9.364222] x29: ffff8000801abdc0 x28: 0000000000000000 x27: 0000000000000000
> [    9.364961] x26: 0000000000000000 x25: ffff00014149d780 x24: 0000000000000000
> [    9.365557] x23: 0000000000000000 x22: ffffd3513b299d48 x21: ffffd3513a785730
> [    9.366239] x20: ffffd3513b299c28 x19: ffff00017ffa7da0 x18: 0000fffff5ffffff
> [    9.366869] x17: 0000000000000000 x16: 1fffe0002a21a8c1 x15: 0000000000000000
> [    9.367524] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000002
> [    9.368207] x11: 0000000000000001 x10: 0000000000000ad0 x9 : ffffd35138589230
> [    9.369123] x8 : ffff00014149e2b0 x7 : 0000000000000000 x6 : 000000000f8c0fb2
> [    9.370403] x5 : 4000000000000002 x4 : ffff2cb045825000 x3 : ffff8000801abdc0
> [    9.371170] x2 : ffffd3513a782da0 x1 : 4000000000000000 x0 : ffffd3513a782da0
> [    9.372279] Call trace:
> [    9.372519]  ct_kernel_exit.constprop.0+0x108/0x120
> [    9.373216]  ct_idle_enter+0x10/0x20
> [    9.373562]  default_idle_call+0x3c/0x160
> [    9.374055]  do_idle+0x21c/0x280
> [    9.374394]  cpu_startup_entry+0x3c/0x50
> [    9.374797]  secondary_start_kernel+0x140/0x168
> [    9.375220]  __secondary_switched+0xb8/0xc0
> [    9.375875] ---[ end trace 0000000000000000 ]---
> 
> 
> The oops trigger is at mm/gup.c:778:
> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> 
> 
> This is the output of gup_longterm (last output is just before oops):
> 
> # [INFO] detected hugetlb page size: 2048 KiB
> # [INFO] detected hugetlb page size: 32768 KiB
> # [INFO] detected hugetlb page size: 64 KiB
> # [INFO] detected hugetlb page size: 1048576 KiB
> TAP version 13
> 1..70
> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd
> ok 1 Should have worked
> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with tmpfile
> ok 2 Should have failed
> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile
> ok 3 Should have failed
> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
> ok 4 Should have worked
> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (32768 kB)
> 
> 
> So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?

I assume we find the expected tail page, it's just that the check

VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);

Doesn't make sense with hugetlb folios. We might have a tail page mapped 
in a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the 
first cont-pmd entry", we trigger this check.

Likely this sanity check must also allow for hugetlb folios. Or we 
should just remove it completely.

In the past, we wanted to make sure that we never get tail pages of THP 
from PMD entries, because something would currently be broken (we don't 
support THP > PMD).

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
  2024-04-02 15:26       ` David Hildenbrand
  (?)
  (?)
@ 2024-04-02 16:00         ` Matthew Wilcox
  -1 siblings, 0 replies; 160+ messages in thread
From: Matthew Wilcox @ 2024-04-02 16:00 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Ryan Roberts, peterx, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Jason Gunthorpe, Mike Rapoport, Axel Rasmussen

On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:
> > The oops trigger is at mm/gup.c:778:
> > VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> > 
> > So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?
> 
> I assume we find the expected tail page, it's just that the check
> 
> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> 
> Doesn't make sense with hugetlb folios. We might have a tail page mapped in
> a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the first
> cont-pmd entry", we trigger this check.
> 
> Likely this sanity check must also allow for hugetlb folios. Or we should
> just remove it completely.
> 
> In the past, we wanted to make sure that we never get tail pages of THP from
> PMD entries, because something would currently be broken (we don't support
> THP > PMD).

That was a practical limitation on my part.  We have various parts of
the MM which assume that pmd_page() returns a head page and until we
get all of those fixed, adding support for folios larger than PMD_SIZE
was only going to cause trouble for no significant wins.

I agree with you we should get rid of this assertion entirely.  We should
fix all the places which assume that pmd_page() returns a head page,
but that may take some time.

As an example, filemap_map_pmd() has:

       if (pmd_none(*vmf->pmd) && folio_test_pmd_mappable(folio)) {
                struct page *page = folio_file_page(folio, start);
                vm_fault_t ret = do_set_pmd(vmf, page);

and then do_set_pmd() has:

        if (page != &folio->page || folio_order(folio) != HPAGE_PMD_ORDER)
                return ret;

so we'd simply refuse to use a PMD to map a folio larger than PMD_SIZE.
There's a lot of work to be done to make this work generally (not to
mention figuring out how to handle mapcount for such folios ;-).

This particular case seems straightforward though.  Just remove the
assertion.

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
@ 2024-04-02 16:00         ` Matthew Wilcox
  0 siblings, 0 replies; 160+ messages in thread
From: Matthew Wilcox @ 2024-04-02 16:00 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Ryan Roberts, peterx, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Jason Gunthorpe, Mike Rapoport, Axel Rasmussen

On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:
> > The oops trigger is at mm/gup.c:778:
> > VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> > 
> > So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?
> 
> I assume we find the expected tail page, it's just that the check
> 
> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> 
> Doesn't make sense with hugetlb folios. We might have a tail page mapped in
> a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the first
> cont-pmd entry", we trigger this check.
> 
> Likely this sanity check must also allow for hugetlb folios. Or we should
> just remove it completely.
> 
> In the past, we wanted to make sure that we never get tail pages of THP from
> PMD entries, because something would currently be broken (we don't support
> THP > PMD).

That was a practical limitation on my part.  We have various parts of
the MM which assume that pmd_page() returns a head page and until we
get all of those fixed, adding support for folios larger than PMD_SIZE
was only going to cause trouble for no significant wins.

I agree with you we should get rid of this assertion entirely.  We should
fix all the places which assume that pmd_page() returns a head page,
but that may take some time.

As an example, filemap_map_pmd() has:

       if (pmd_none(*vmf->pmd) && folio_test_pmd_mappable(folio)) {
                struct page *page = folio_file_page(folio, start);
                vm_fault_t ret = do_set_pmd(vmf, page);

and then do_set_pmd() has:

        if (page != &folio->page || folio_order(folio) != HPAGE_PMD_ORDER)
                return ret;

so we'd simply refuse to use a PMD to map a folio larger than PMD_SIZE.
There's a lot of work to be done to make this work generally (not to
mention figuring out how to handle mapcount for such folios ;-).

This particular case seems straightforward though.  Just remove the
assertion.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
@ 2024-04-02 16:00         ` Matthew Wilcox
  0 siblings, 0 replies; 160+ messages in thread
From: Matthew Wilcox @ 2024-04-02 16:00 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Ryan Roberts, peterx, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Jason Gunthorpe, Mike Rapoport, Axel Rasmussen

On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:
> > The oops trigger is at mm/gup.c:778:
> > VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> > 
> > So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?
> 
> I assume we find the expected tail page, it's just that the check
> 
> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> 
> Doesn't make sense with hugetlb folios. We might have a tail page mapped in
> a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the first
> cont-pmd entry", we trigger this check.
> 
> Likely this sanity check must also allow for hugetlb folios. Or we should
> just remove it completely.
> 
> In the past, we wanted to make sure that we never get tail pages of THP from
> PMD entries, because something would currently be broken (we don't support
> THP > PMD).

That was a practical limitation on my part.  We have various parts of
the MM which assume that pmd_page() returns a head page and until we
get all of those fixed, adding support for folios larger than PMD_SIZE
was only going to cause trouble for no significant wins.

I agree with you we should get rid of this assertion entirely.  We should
fix all the places which assume that pmd_page() returns a head page,
but that may take some time.

As an example, filemap_map_pmd() has:

       if (pmd_none(*vmf->pmd) && folio_test_pmd_mappable(folio)) {
                struct page *page = folio_file_page(folio, start);
                vm_fault_t ret = do_set_pmd(vmf, page);

and then do_set_pmd() has:

        if (page != &folio->page || folio_order(folio) != HPAGE_PMD_ORDER)
                return ret;

so we'd simply refuse to use a PMD to map a folio larger than PMD_SIZE.
There's a lot of work to be done to make this work generally (not to
mention figuring out how to handle mapcount for such folios ;-).

This particular case seems straightforward though.  Just remove the
assertion.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
@ 2024-04-02 16:00         ` Matthew Wilcox
  0 siblings, 0 replies; 160+ messages in thread
From: Matthew Wilcox @ 2024-04-02 16:00 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: James Houghton, Yang Shi, peterx, Andrew Jones, linux-mm,
	linux-riscv, Andrea Arcangeli, Christoph Hellwig,
	Aneesh Kumar K . V, linux-arm-kernel, Jason Gunthorpe,
	Axel Rasmussen, Ryan Roberts, Rik van Riel, John Hubbard,
	Kirill A . Shutemov, Vlastimil Babka, Lorenzo Stoakes,
	Muchun Song, linux-kernel, Andrew Morton, linuxppc-dev,
	Mike Rapoport, Mike Kravetz

On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:
> > The oops trigger is at mm/gup.c:778:
> > VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> > 
> > So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?
> 
> I assume we find the expected tail page, it's just that the check
> 
> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> 
> Doesn't make sense with hugetlb folios. We might have a tail page mapped in
> a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the first
> cont-pmd entry", we trigger this check.
> 
> Likely this sanity check must also allow for hugetlb folios. Or we should
> just remove it completely.
> 
> In the past, we wanted to make sure that we never get tail pages of THP from
> PMD entries, because something would currently be broken (we don't support
> THP > PMD).

That was a practical limitation on my part.  We have various parts of
the MM which assume that pmd_page() returns a head page and until we
get all of those fixed, adding support for folios larger than PMD_SIZE
was only going to cause trouble for no significant wins.

I agree with you we should get rid of this assertion entirely.  We should
fix all the places which assume that pmd_page() returns a head page,
but that may take some time.

As an example, filemap_map_pmd() has:

       if (pmd_none(*vmf->pmd) && folio_test_pmd_mappable(folio)) {
                struct page *page = folio_file_page(folio, start);
                vm_fault_t ret = do_set_pmd(vmf, page);

and then do_set_pmd() has:

        if (page != &folio->page || folio_order(folio) != HPAGE_PMD_ORDER)
                return ret;

so we'd simply refuse to use a PMD to map a folio larger than PMD_SIZE.
There's a lot of work to be done to make this work generally (not to
mention figuring out how to handle mapcount for such folios ;-).

This particular case seems straightforward though.  Just remove the
assertion.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
  2024-04-02 16:00         ` Matthew Wilcox
  (?)
  (?)
@ 2024-04-02 16:18           ` Ryan Roberts
  -1 siblings, 0 replies; 160+ messages in thread
From: Ryan Roberts @ 2024-04-02 16:18 UTC (permalink / raw)
  To: Matthew Wilcox, David Hildenbrand
  Cc: peterx, linux-mm, linux-kernel, Yang Shi, Kirill A . Shutemov,
	Mike Kravetz, John Hubbard, Michael Ellerman, Andrew Jones,
	Muchun Song, linux-riscv, linuxppc-dev, Christophe Leroy,
	Andrew Morton, Christoph Hellwig, Lorenzo Stoakes, Rik van Riel,
	linux-arm-kernel, Andrea Arcangeli, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

On 02/04/2024 17:00, Matthew Wilcox wrote:
> On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:
>>> The oops trigger is at mm/gup.c:778:
>>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>>
>>> So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?
>>
>> I assume we find the expected tail page, it's just that the check
>>
>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>
>> Doesn't make sense with hugetlb folios. We might have a tail page mapped in
>> a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the first
>> cont-pmd entry", we trigger this check.
>>
>> Likely this sanity check must also allow for hugetlb folios. Or we should
>> just remove it completely.
>>
>> In the past, we wanted to make sure that we never get tail pages of THP from
>> PMD entries, because something would currently be broken (we don't support
>> THP > PMD).
> 
> That was a practical limitation on my part.  We have various parts of
> the MM which assume that pmd_page() returns a head page and until we
> get all of those fixed, adding support for folios larger than PMD_SIZE
> was only going to cause trouble for no significant wins.
> 
> I agree with you we should get rid of this assertion entirely.  We should
> fix all the places which assume that pmd_page() returns a head page,
> but that may take some time.
> 
> As an example, filemap_map_pmd() has:
> 
>        if (pmd_none(*vmf->pmd) && folio_test_pmd_mappable(folio)) {
>                 struct page *page = folio_file_page(folio, start);
>                 vm_fault_t ret = do_set_pmd(vmf, page);
> 
> and then do_set_pmd() has:
> 
>         if (page != &folio->page || folio_order(folio) != HPAGE_PMD_ORDER)
>                 return ret;
> 
> so we'd simply refuse to use a PMD to map a folio larger than PMD_SIZE.
> There's a lot of work to be done to make this work generally (not to
> mention figuring out how to handle mapcount for such folios ;-).
> 
> This particular case seems straightforward though.  Just remove the
> assertion.

Removing the assertion gets me further, but then I end up with this:

[    9.748422] kernel BUG at include/linux/page-flags.h:1098!
[    9.748897] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
[    9.749590] Modules linked in:
[    9.749867] CPU: 2 PID: 1155 Comm: gup_longterm Not tainted 6.9.0-rc2-00210-g910ff1a347e4-dirty #12
[    9.750682] Hardware name: linux,dummy-virt (DT)
[    9.751095] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    9.751729] pc : follow_page_mask+0x730/0x850
[    9.752152] lr : follow_page_mask+0x730/0x850
[    9.752573] sp : ffff8000898f3aa0
[    9.752882] x29: ffff8000898f3aa0 x28: fffffdffc52b91a8 x27: 0000000000000001
[    9.753543] x26: ffff00014ae46d08 x25: 00003c0005d88000 x24: fffffdffc5d88000
[    9.754221] x23: ffffc1ffc0000000 x22: 0000000000080101 x21: ffff8000898f3ba8
[    9.754875] x20: 0000fffff4200000 x19: ffff0001a3d64450 x18: 0000000000000010
[    9.755567] x17: 2864616548656761 x16: 5021202626202965 x15: 6761702865677548
[    9.756254] x14: 6567615028454741 x13: 2929656761702864 x12: 6165486567615021
[    9.756953] x11: 2026262029656761 x10: ffffaaac08f1d6e0 x9 : ffffaaac0612f090
[    9.757671] x8 : 00000000ffffefff x7 : ffffaaac08f1d6e0 x6 : 0000000000000000
[    9.758356] x5 : ffff00017ffb9cc8 x4 : 0000000000000fff x3 : 0000000000000000
[    9.758983] x2 : 0000000000000000 x1 : ffff000189ecb480 x0 : 0000000000000046
[    9.759663] Call trace:
[    9.759901]  follow_page_mask+0x730/0x850
[    9.760293]  __get_user_pages+0xf4/0x3e8
[    9.760683]  __gup_longterm_locked+0x204/0xa70
[    9.761110]  pin_user_pages+0x88/0xc0
[    9.761486]  gup_test_ioctl+0x860/0xc40
[    9.761866]  __arm64_sys_ioctl+0xb0/0x100
[    9.762254]  invoke_syscall+0x50/0x128
[    9.762630]  el0_svc_common.constprop.0+0x48/0xf8
[    9.763104]  do_el0_svc+0x28/0x40
[    9.763413]  el0_svc+0x34/0xe0
[    9.763699]  el0t_64_sync_handler+0x13c/0x158
[    9.764139]  el0t_64_sync+0x190/0x198
[    9.764465] Code: aa1803e0 d000d8e1 911d6021 97fff4c9 (d4210000) 
[    9.765053] ---[ end trace 0000000000000000 ]---
[    9.765520] note: gup_longterm[1155] exited with irqs disabled
[    9.766146] note: gup_longterm[1155] exited with preempt_count 2
[    9.767366] ------------[ cut here ]------------
[    9.768062] WARNING: CPU: 2 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0x108/0x120
[    9.769146] Modules linked in:
[    9.769429] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G      D            6.9.0-rc2-00210-g910ff1a347e4-dirty #12
[    9.770338] Hardware name: linux,dummy-virt (DT)
[    9.770837] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    9.771615] pc : ct_kernel_exit.constprop.0+0x108/0x120
[    9.772150] lr : ct_idle_enter+0x10/0x20
[    9.772539] sp : ffff8000801b3dc0
[    9.772913] x29: ffff8000801b3dc0 x28: 0000000000000000 x27: 0000000000000000
[    9.773769] x26: 0000000000000000 x25: ffff00014149e900 x24: 0000000000000000
[    9.774526] x23: 0000000000000000 x22: ffffaaac08e99d48 x21: ffffaaac08385730
[    9.775255] x20: ffffaaac08e99c28 x19: ffff00017ffc8da0 x18: 0000fffff5ffffff
[    9.775924] x17: 0000000000000000 x16: 1fffe0002a57c9e1 x15: 0000000000000001
[    9.776619] x14: ffffffffffffffff x13: 0000000000000000 x12: ffffaaac07a06968
[    9.777246] x11: 000000ae44c42eec x10: 0000000000000ad0 x9 : ffffaaac06189230
[    9.777942] x8 : ffff00014149f430 x7 : 02c9acb509db422c x6 : 000000001015a9f0
[    9.778635] x5 : 4000000000000002 x4 : ffff555577c46000 x3 : ffff8000801b3dc0
[    9.779671] x2 : ffffaaac08382da0 x1 : 4000000000000000 x0 : ffffaaac08382da0
[    9.780703] Call trace:
[    9.781150]  ct_kernel_exit.constprop.0+0x108/0x120
[    9.781949]  ct_idle_enter+0x10/0x20
[    9.782246]  default_idle_call+0x3c/0x160
[    9.782624]  do_idle+0x21c/0x280
[    9.782945]  cpu_startup_entry+0x3c/0x50
[    9.783268]  secondary_start_kernel+0x140/0x168
[    9.783818]  __secondary_switched+0xb8/0xc0
[    9.784163] ---[ end trace 0000000000000000 ]---


Which is caused by this:

static __always_inline int PageAnonExclusive(const struct page *page)
{
	VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
	VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page); <<<<
	return test_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);
}

Which is called from can_follow_write_pmd(), called just after the assert I just commented out.


It's triggered by this test:

# [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (32768 kB)

Which is the first MAP_PRIVATE test for cont-pmd mapped hugetlb. (All MAP_SHARED tests are passing).


Looks like can_follow_write_pmd() returns early for VM_SHARED mappings.

I don't think we only keep the PAE flag in the head page for hugetlb pages? So we can't just remove this assert?

I tried just commenting it out and get assert further down follow_huge_pmd():

VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
			!PageAnonExclusive(page), page);

Thanks,
Ryan


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
@ 2024-04-02 16:18           ` Ryan Roberts
  0 siblings, 0 replies; 160+ messages in thread
From: Ryan Roberts @ 2024-04-02 16:18 UTC (permalink / raw)
  To: Matthew Wilcox, David Hildenbrand
  Cc: peterx, linux-mm, linux-kernel, Yang Shi, Kirill A . Shutemov,
	Mike Kravetz, John Hubbard, Michael Ellerman, Andrew Jones,
	Muchun Song, linux-riscv, linuxppc-dev, Christophe Leroy,
	Andrew Morton, Christoph Hellwig, Lorenzo Stoakes, Rik van Riel,
	linux-arm-kernel, Andrea Arcangeli, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

On 02/04/2024 17:00, Matthew Wilcox wrote:
> On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:
>>> The oops trigger is at mm/gup.c:778:
>>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>>
>>> So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?
>>
>> I assume we find the expected tail page, it's just that the check
>>
>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>
>> Doesn't make sense with hugetlb folios. We might have a tail page mapped in
>> a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the first
>> cont-pmd entry", we trigger this check.
>>
>> Likely this sanity check must also allow for hugetlb folios. Or we should
>> just remove it completely.
>>
>> In the past, we wanted to make sure that we never get tail pages of THP from
>> PMD entries, because something would currently be broken (we don't support
>> THP > PMD).
> 
> That was a practical limitation on my part.  We have various parts of
> the MM which assume that pmd_page() returns a head page and until we
> get all of those fixed, adding support for folios larger than PMD_SIZE
> was only going to cause trouble for no significant wins.
> 
> I agree with you we should get rid of this assertion entirely.  We should
> fix all the places which assume that pmd_page() returns a head page,
> but that may take some time.
> 
> As an example, filemap_map_pmd() has:
> 
>        if (pmd_none(*vmf->pmd) && folio_test_pmd_mappable(folio)) {
>                 struct page *page = folio_file_page(folio, start);
>                 vm_fault_t ret = do_set_pmd(vmf, page);
> 
> and then do_set_pmd() has:
> 
>         if (page != &folio->page || folio_order(folio) != HPAGE_PMD_ORDER)
>                 return ret;
> 
> so we'd simply refuse to use a PMD to map a folio larger than PMD_SIZE.
> There's a lot of work to be done to make this work generally (not to
> mention figuring out how to handle mapcount for such folios ;-).
> 
> This particular case seems straightforward though.  Just remove the
> assertion.

Removing the assertion gets me further, but then I end up with this:

[    9.748422] kernel BUG at include/linux/page-flags.h:1098!
[    9.748897] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
[    9.749590] Modules linked in:
[    9.749867] CPU: 2 PID: 1155 Comm: gup_longterm Not tainted 6.9.0-rc2-00210-g910ff1a347e4-dirty #12
[    9.750682] Hardware name: linux,dummy-virt (DT)
[    9.751095] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    9.751729] pc : follow_page_mask+0x730/0x850
[    9.752152] lr : follow_page_mask+0x730/0x850
[    9.752573] sp : ffff8000898f3aa0
[    9.752882] x29: ffff8000898f3aa0 x28: fffffdffc52b91a8 x27: 0000000000000001
[    9.753543] x26: ffff00014ae46d08 x25: 00003c0005d88000 x24: fffffdffc5d88000
[    9.754221] x23: ffffc1ffc0000000 x22: 0000000000080101 x21: ffff8000898f3ba8
[    9.754875] x20: 0000fffff4200000 x19: ffff0001a3d64450 x18: 0000000000000010
[    9.755567] x17: 2864616548656761 x16: 5021202626202965 x15: 6761702865677548
[    9.756254] x14: 6567615028454741 x13: 2929656761702864 x12: 6165486567615021
[    9.756953] x11: 2026262029656761 x10: ffffaaac08f1d6e0 x9 : ffffaaac0612f090
[    9.757671] x8 : 00000000ffffefff x7 : ffffaaac08f1d6e0 x6 : 0000000000000000
[    9.758356] x5 : ffff00017ffb9cc8 x4 : 0000000000000fff x3 : 0000000000000000
[    9.758983] x2 : 0000000000000000 x1 : ffff000189ecb480 x0 : 0000000000000046
[    9.759663] Call trace:
[    9.759901]  follow_page_mask+0x730/0x850
[    9.760293]  __get_user_pages+0xf4/0x3e8
[    9.760683]  __gup_longterm_locked+0x204/0xa70
[    9.761110]  pin_user_pages+0x88/0xc0
[    9.761486]  gup_test_ioctl+0x860/0xc40
[    9.761866]  __arm64_sys_ioctl+0xb0/0x100
[    9.762254]  invoke_syscall+0x50/0x128
[    9.762630]  el0_svc_common.constprop.0+0x48/0xf8
[    9.763104]  do_el0_svc+0x28/0x40
[    9.763413]  el0_svc+0x34/0xe0
[    9.763699]  el0t_64_sync_handler+0x13c/0x158
[    9.764139]  el0t_64_sync+0x190/0x198
[    9.764465] Code: aa1803e0 d000d8e1 911d6021 97fff4c9 (d4210000) 
[    9.765053] ---[ end trace 0000000000000000 ]---
[    9.765520] note: gup_longterm[1155] exited with irqs disabled
[    9.766146] note: gup_longterm[1155] exited with preempt_count 2
[    9.767366] ------------[ cut here ]------------
[    9.768062] WARNING: CPU: 2 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0x108/0x120
[    9.769146] Modules linked in:
[    9.769429] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G      D            6.9.0-rc2-00210-g910ff1a347e4-dirty #12
[    9.770338] Hardware name: linux,dummy-virt (DT)
[    9.770837] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    9.771615] pc : ct_kernel_exit.constprop.0+0x108/0x120
[    9.772150] lr : ct_idle_enter+0x10/0x20
[    9.772539] sp : ffff8000801b3dc0
[    9.772913] x29: ffff8000801b3dc0 x28: 0000000000000000 x27: 0000000000000000
[    9.773769] x26: 0000000000000000 x25: ffff00014149e900 x24: 0000000000000000
[    9.774526] x23: 0000000000000000 x22: ffffaaac08e99d48 x21: ffffaaac08385730
[    9.775255] x20: ffffaaac08e99c28 x19: ffff00017ffc8da0 x18: 0000fffff5ffffff
[    9.775924] x17: 0000000000000000 x16: 1fffe0002a57c9e1 x15: 0000000000000001
[    9.776619] x14: ffffffffffffffff x13: 0000000000000000 x12: ffffaaac07a06968
[    9.777246] x11: 000000ae44c42eec x10: 0000000000000ad0 x9 : ffffaaac06189230
[    9.777942] x8 : ffff00014149f430 x7 : 02c9acb509db422c x6 : 000000001015a9f0
[    9.778635] x5 : 4000000000000002 x4 : ffff555577c46000 x3 : ffff8000801b3dc0
[    9.779671] x2 : ffffaaac08382da0 x1 : 4000000000000000 x0 : ffffaaac08382da0
[    9.780703] Call trace:
[    9.781150]  ct_kernel_exit.constprop.0+0x108/0x120
[    9.781949]  ct_idle_enter+0x10/0x20
[    9.782246]  default_idle_call+0x3c/0x160
[    9.782624]  do_idle+0x21c/0x280
[    9.782945]  cpu_startup_entry+0x3c/0x50
[    9.783268]  secondary_start_kernel+0x140/0x168
[    9.783818]  __secondary_switched+0xb8/0xc0
[    9.784163] ---[ end trace 0000000000000000 ]---


Which is caused by this:

static __always_inline int PageAnonExclusive(const struct page *page)
{
	VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
	VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page); <<<<
	return test_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);
}

Which is called from can_follow_write_pmd(), called just after the assert I just commented out.


It's triggered by this test:

# [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (32768 kB)

Which is the first MAP_PRIVATE test for cont-pmd mapped hugetlb. (All MAP_SHARED tests are passing).


Looks like can_follow_write_pmd() returns early for VM_SHARED mappings.

I don't think we only keep the PAE flag in the head page for hugetlb pages? So we can't just remove this assert?

I tried just commenting it out and get assert further down follow_huge_pmd():

VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
			!PageAnonExclusive(page), page);

Thanks,
Ryan


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
@ 2024-04-02 16:18           ` Ryan Roberts
  0 siblings, 0 replies; 160+ messages in thread
From: Ryan Roberts @ 2024-04-02 16:18 UTC (permalink / raw)
  To: Matthew Wilcox, David Hildenbrand
  Cc: peterx, linux-mm, linux-kernel, Yang Shi, Kirill A . Shutemov,
	Mike Kravetz, John Hubbard, Michael Ellerman, Andrew Jones,
	Muchun Song, linux-riscv, linuxppc-dev, Christophe Leroy,
	Andrew Morton, Christoph Hellwig, Lorenzo Stoakes, Rik van Riel,
	linux-arm-kernel, Andrea Arcangeli, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Jason Gunthorpe, Mike Rapoport,
	Axel Rasmussen

On 02/04/2024 17:00, Matthew Wilcox wrote:
> On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:
>>> The oops trigger is at mm/gup.c:778:
>>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>>
>>> So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?
>>
>> I assume we find the expected tail page, it's just that the check
>>
>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>
>> Doesn't make sense with hugetlb folios. We might have a tail page mapped in
>> a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the first
>> cont-pmd entry", we trigger this check.
>>
>> Likely this sanity check must also allow for hugetlb folios. Or we should
>> just remove it completely.
>>
>> In the past, we wanted to make sure that we never get tail pages of THP from
>> PMD entries, because something would currently be broken (we don't support
>> THP > PMD).
> 
> That was a practical limitation on my part.  We have various parts of
> the MM which assume that pmd_page() returns a head page and until we
> get all of those fixed, adding support for folios larger than PMD_SIZE
> was only going to cause trouble for no significant wins.
> 
> I agree with you we should get rid of this assertion entirely.  We should
> fix all the places which assume that pmd_page() returns a head page,
> but that may take some time.
> 
> As an example, filemap_map_pmd() has:
> 
>        if (pmd_none(*vmf->pmd) && folio_test_pmd_mappable(folio)) {
>                 struct page *page = folio_file_page(folio, start);
>                 vm_fault_t ret = do_set_pmd(vmf, page);
> 
> and then do_set_pmd() has:
> 
>         if (page != &folio->page || folio_order(folio) != HPAGE_PMD_ORDER)
>                 return ret;
> 
> so we'd simply refuse to use a PMD to map a folio larger than PMD_SIZE.
> There's a lot of work to be done to make this work generally (not to
> mention figuring out how to handle mapcount for such folios ;-).
> 
> This particular case seems straightforward though.  Just remove the
> assertion.

Removing the assertion gets me further, but then I end up with this:

[    9.748422] kernel BUG at include/linux/page-flags.h:1098!
[    9.748897] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
[    9.749590] Modules linked in:
[    9.749867] CPU: 2 PID: 1155 Comm: gup_longterm Not tainted 6.9.0-rc2-00210-g910ff1a347e4-dirty #12
[    9.750682] Hardware name: linux,dummy-virt (DT)
[    9.751095] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    9.751729] pc : follow_page_mask+0x730/0x850
[    9.752152] lr : follow_page_mask+0x730/0x850
[    9.752573] sp : ffff8000898f3aa0
[    9.752882] x29: ffff8000898f3aa0 x28: fffffdffc52b91a8 x27: 0000000000000001
[    9.753543] x26: ffff00014ae46d08 x25: 00003c0005d88000 x24: fffffdffc5d88000
[    9.754221] x23: ffffc1ffc0000000 x22: 0000000000080101 x21: ffff8000898f3ba8
[    9.754875] x20: 0000fffff4200000 x19: ffff0001a3d64450 x18: 0000000000000010
[    9.755567] x17: 2864616548656761 x16: 5021202626202965 x15: 6761702865677548
[    9.756254] x14: 6567615028454741 x13: 2929656761702864 x12: 6165486567615021
[    9.756953] x11: 2026262029656761 x10: ffffaaac08f1d6e0 x9 : ffffaaac0612f090
[    9.757671] x8 : 00000000ffffefff x7 : ffffaaac08f1d6e0 x6 : 0000000000000000
[    9.758356] x5 : ffff00017ffb9cc8 x4 : 0000000000000fff x3 : 0000000000000000
[    9.758983] x2 : 0000000000000000 x1 : ffff000189ecb480 x0 : 0000000000000046
[    9.759663] Call trace:
[    9.759901]  follow_page_mask+0x730/0x850
[    9.760293]  __get_user_pages+0xf4/0x3e8
[    9.760683]  __gup_longterm_locked+0x204/0xa70
[    9.761110]  pin_user_pages+0x88/0xc0
[    9.761486]  gup_test_ioctl+0x860/0xc40
[    9.761866]  __arm64_sys_ioctl+0xb0/0x100
[    9.762254]  invoke_syscall+0x50/0x128
[    9.762630]  el0_svc_common.constprop.0+0x48/0xf8
[    9.763104]  do_el0_svc+0x28/0x40
[    9.763413]  el0_svc+0x34/0xe0
[    9.763699]  el0t_64_sync_handler+0x13c/0x158
[    9.764139]  el0t_64_sync+0x190/0x198
[    9.764465] Code: aa1803e0 d000d8e1 911d6021 97fff4c9 (d4210000) 
[    9.765053] ---[ end trace 0000000000000000 ]---
[    9.765520] note: gup_longterm[1155] exited with irqs disabled
[    9.766146] note: gup_longterm[1155] exited with preempt_count 2
[    9.767366] ------------[ cut here ]------------
[    9.768062] WARNING: CPU: 2 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0x108/0x120
[    9.769146] Modules linked in:
[    9.769429] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G      D            6.9.0-rc2-00210-g910ff1a347e4-dirty #12
[    9.770338] Hardware name: linux,dummy-virt (DT)
[    9.770837] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    9.771615] pc : ct_kernel_exit.constprop.0+0x108/0x120
[    9.772150] lr : ct_idle_enter+0x10/0x20
[    9.772539] sp : ffff8000801b3dc0
[    9.772913] x29: ffff8000801b3dc0 x28: 0000000000000000 x27: 0000000000000000
[    9.773769] x26: 0000000000000000 x25: ffff00014149e900 x24: 0000000000000000
[    9.774526] x23: 0000000000000000 x22: ffffaaac08e99d48 x21: ffffaaac08385730
[    9.775255] x20: ffffaaac08e99c28 x19: ffff00017ffc8da0 x18: 0000fffff5ffffff
[    9.775924] x17: 0000000000000000 x16: 1fffe0002a57c9e1 x15: 0000000000000001
[    9.776619] x14: ffffffffffffffff x13: 0000000000000000 x12: ffffaaac07a06968
[    9.777246] x11: 000000ae44c42eec x10: 0000000000000ad0 x9 : ffffaaac06189230
[    9.777942] x8 : ffff00014149f430 x7 : 02c9acb509db422c x6 : 000000001015a9f0
[    9.778635] x5 : 4000000000000002 x4 : ffff555577c46000 x3 : ffff8000801b3dc0
[    9.779671] x2 : ffffaaac08382da0 x1 : 4000000000000000 x0 : ffffaaac08382da0
[    9.780703] Call trace:
[    9.781150]  ct_kernel_exit.constprop.0+0x108/0x120
[    9.781949]  ct_idle_enter+0x10/0x20
[    9.782246]  default_idle_call+0x3c/0x160
[    9.782624]  do_idle+0x21c/0x280
[    9.782945]  cpu_startup_entry+0x3c/0x50
[    9.783268]  secondary_start_kernel+0x140/0x168
[    9.783818]  __secondary_switched+0xb8/0xc0
[    9.784163] ---[ end trace 0000000000000000 ]---


Which is caused by this:

static __always_inline int PageAnonExclusive(const struct page *page)
{
	VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
	VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page); <<<<
	return test_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);
}

Which is called from can_follow_write_pmd(), called just after the assert I just commented out.


It's triggered by this test:

# [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (32768 kB)

Which is the first MAP_PRIVATE test for cont-pmd mapped hugetlb. (All MAP_SHARED tests are passing).


Looks like can_follow_write_pmd() returns early for VM_SHARED mappings.

I don't think we only keep the PAE flag in the head page for hugetlb pages? So we can't just remove this assert?

I tried just commenting it out and get assert further down follow_huge_pmd():

VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
			!PageAnonExclusive(page), page);

Thanks,
Ryan


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
@ 2024-04-02 16:18           ` Ryan Roberts
  0 siblings, 0 replies; 160+ messages in thread
From: Ryan Roberts @ 2024-04-02 16:18 UTC (permalink / raw)
  To: Matthew Wilcox, David Hildenbrand
  Cc: James Houghton, Yang Shi, peterx, Andrew Jones, linux-mm,
	linux-riscv, Andrea Arcangeli, Christoph Hellwig,
	Aneesh Kumar K . V, linux-arm-kernel, Jason Gunthorpe,
	Axel Rasmussen, Rik van Riel, John Hubbard, Kirill A . Shutemov,
	Vlastimil Babka, Lorenzo Stoakes, Muchun Song, linux-kernel,
	Andrew Morton, linuxppc-dev, Mike Rapoport, Mike Kravetz

On 02/04/2024 17:00, Matthew Wilcox wrote:
> On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:
>>> The oops trigger is at mm/gup.c:778:
>>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>>
>>> So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?
>>
>> I assume we find the expected tail page, it's just that the check
>>
>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>
>> Doesn't make sense with hugetlb folios. We might have a tail page mapped in
>> a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the first
>> cont-pmd entry", we trigger this check.
>>
>> Likely this sanity check must also allow for hugetlb folios. Or we should
>> just remove it completely.
>>
>> In the past, we wanted to make sure that we never get tail pages of THP from
>> PMD entries, because something would currently be broken (we don't support
>> THP > PMD).
> 
> That was a practical limitation on my part.  We have various parts of
> the MM which assume that pmd_page() returns a head page and until we
> get all of those fixed, adding support for folios larger than PMD_SIZE
> was only going to cause trouble for no significant wins.
> 
> I agree with you we should get rid of this assertion entirely.  We should
> fix all the places which assume that pmd_page() returns a head page,
> but that may take some time.
> 
> As an example, filemap_map_pmd() has:
> 
>        if (pmd_none(*vmf->pmd) && folio_test_pmd_mappable(folio)) {
>                 struct page *page = folio_file_page(folio, start);
>                 vm_fault_t ret = do_set_pmd(vmf, page);
> 
> and then do_set_pmd() has:
> 
>         if (page != &folio->page || folio_order(folio) != HPAGE_PMD_ORDER)
>                 return ret;
> 
> so we'd simply refuse to use a PMD to map a folio larger than PMD_SIZE.
> There's a lot of work to be done to make this work generally (not to
> mention figuring out how to handle mapcount for such folios ;-).
> 
> This particular case seems straightforward though.  Just remove the
> assertion.

Removing the assertion gets me further, but then I end up with this:

[    9.748422] kernel BUG at include/linux/page-flags.h:1098!
[    9.748897] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
[    9.749590] Modules linked in:
[    9.749867] CPU: 2 PID: 1155 Comm: gup_longterm Not tainted 6.9.0-rc2-00210-g910ff1a347e4-dirty #12
[    9.750682] Hardware name: linux,dummy-virt (DT)
[    9.751095] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    9.751729] pc : follow_page_mask+0x730/0x850
[    9.752152] lr : follow_page_mask+0x730/0x850
[    9.752573] sp : ffff8000898f3aa0
[    9.752882] x29: ffff8000898f3aa0 x28: fffffdffc52b91a8 x27: 0000000000000001
[    9.753543] x26: ffff00014ae46d08 x25: 00003c0005d88000 x24: fffffdffc5d88000
[    9.754221] x23: ffffc1ffc0000000 x22: 0000000000080101 x21: ffff8000898f3ba8
[    9.754875] x20: 0000fffff4200000 x19: ffff0001a3d64450 x18: 0000000000000010
[    9.755567] x17: 2864616548656761 x16: 5021202626202965 x15: 6761702865677548
[    9.756254] x14: 6567615028454741 x13: 2929656761702864 x12: 6165486567615021
[    9.756953] x11: 2026262029656761 x10: ffffaaac08f1d6e0 x9 : ffffaaac0612f090
[    9.757671] x8 : 00000000ffffefff x7 : ffffaaac08f1d6e0 x6 : 0000000000000000
[    9.758356] x5 : ffff00017ffb9cc8 x4 : 0000000000000fff x3 : 0000000000000000
[    9.758983] x2 : 0000000000000000 x1 : ffff000189ecb480 x0 : 0000000000000046
[    9.759663] Call trace:
[    9.759901]  follow_page_mask+0x730/0x850
[    9.760293]  __get_user_pages+0xf4/0x3e8
[    9.760683]  __gup_longterm_locked+0x204/0xa70
[    9.761110]  pin_user_pages+0x88/0xc0
[    9.761486]  gup_test_ioctl+0x860/0xc40
[    9.761866]  __arm64_sys_ioctl+0xb0/0x100
[    9.762254]  invoke_syscall+0x50/0x128
[    9.762630]  el0_svc_common.constprop.0+0x48/0xf8
[    9.763104]  do_el0_svc+0x28/0x40
[    9.763413]  el0_svc+0x34/0xe0
[    9.763699]  el0t_64_sync_handler+0x13c/0x158
[    9.764139]  el0t_64_sync+0x190/0x198
[    9.764465] Code: aa1803e0 d000d8e1 911d6021 97fff4c9 (d4210000) 
[    9.765053] ---[ end trace 0000000000000000 ]---
[    9.765520] note: gup_longterm[1155] exited with irqs disabled
[    9.766146] note: gup_longterm[1155] exited with preempt_count 2
[    9.767366] ------------[ cut here ]------------
[    9.768062] WARNING: CPU: 2 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0x108/0x120
[    9.769146] Modules linked in:
[    9.769429] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G      D            6.9.0-rc2-00210-g910ff1a347e4-dirty #12
[    9.770338] Hardware name: linux,dummy-virt (DT)
[    9.770837] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    9.771615] pc : ct_kernel_exit.constprop.0+0x108/0x120
[    9.772150] lr : ct_idle_enter+0x10/0x20
[    9.772539] sp : ffff8000801b3dc0
[    9.772913] x29: ffff8000801b3dc0 x28: 0000000000000000 x27: 0000000000000000
[    9.773769] x26: 0000000000000000 x25: ffff00014149e900 x24: 0000000000000000
[    9.774526] x23: 0000000000000000 x22: ffffaaac08e99d48 x21: ffffaaac08385730
[    9.775255] x20: ffffaaac08e99c28 x19: ffff00017ffc8da0 x18: 0000fffff5ffffff
[    9.775924] x17: 0000000000000000 x16: 1fffe0002a57c9e1 x15: 0000000000000001
[    9.776619] x14: ffffffffffffffff x13: 0000000000000000 x12: ffffaaac07a06968
[    9.777246] x11: 000000ae44c42eec x10: 0000000000000ad0 x9 : ffffaaac06189230
[    9.777942] x8 : ffff00014149f430 x7 : 02c9acb509db422c x6 : 000000001015a9f0
[    9.778635] x5 : 4000000000000002 x4 : ffff555577c46000 x3 : ffff8000801b3dc0
[    9.779671] x2 : ffffaaac08382da0 x1 : 4000000000000000 x0 : ffffaaac08382da0
[    9.780703] Call trace:
[    9.781150]  ct_kernel_exit.constprop.0+0x108/0x120
[    9.781949]  ct_idle_enter+0x10/0x20
[    9.782246]  default_idle_call+0x3c/0x160
[    9.782624]  do_idle+0x21c/0x280
[    9.782945]  cpu_startup_entry+0x3c/0x50
[    9.783268]  secondary_start_kernel+0x140/0x168
[    9.783818]  __secondary_switched+0xb8/0xc0
[    9.784163] ---[ end trace 0000000000000000 ]---


Which is caused by this:

static __always_inline int PageAnonExclusive(const struct page *page)
{
	VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
	VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page); <<<<
	return test_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);
}

Which is called from can_follow_write_pmd(), called just after the assert I just commented out.


It's triggered by this test:

# [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (32768 kB)

Which is the first MAP_PRIVATE test for cont-pmd mapped hugetlb. (All MAP_SHARED tests are passing).


Looks like can_follow_write_pmd() returns early for VM_SHARED mappings.

I don't think we only keep the PAE flag in the head page for hugetlb pages? So we can't just remove this assert?

I tried just commenting it out and get assert further down follow_huge_pmd():

VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
			!PageAnonExclusive(page), page);

Thanks,
Ryan


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
  2024-04-02 15:26       ` David Hildenbrand
  (?)
  (?)
@ 2024-04-02 16:20         ` Peter Xu
  -1 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2024-04-02 16:20 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Ryan Roberts, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Jason Gunthorpe, Mike Rapoport, Axel Rasmussen

On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:
> On 02.04.24 16:48, Ryan Roberts wrote:
> > Hi Peter,

Hey, Ryan,

Thanks for the report!

> > 
> > On 27/03/2024 15:23, peterx@redhat.com wrote:
> > > From: Peter Xu <peterx@redhat.com>
> > > 
> > > Now follow_page() is ready to handle hugetlb pages in whatever form, and
> > > over all architectures.  Switch to the generic code path.
> > > 
> > > Time to retire hugetlb_follow_page_mask(), following the previous
> > > retirement of follow_hugetlb_page() in 4849807114b8.
> > > 
> > > There may be a slight difference of how the loops run when processing slow
> > > GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
> > > loop of __get_user_pages() will resolve one pgtable entry with the patch
> > > applied, rather than relying on the size of hugetlb hstate, the latter may
> > > cover multiple entries in one loop.
> > > 
> > > A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
> > > a tight loop of slow gup after the path switched.  That shouldn't be a
> > > problem because slow-gup should not be a hot path for GUP in general: when
> > > page is commonly present, fast-gup will already succeed, while when the
> > > page is indeed missing and require a follow up page fault, the slow gup
> > > degrade will probably buried in the fault paths anyway.  It also explains
> > > why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
> > > accelerate thp gup even for "pages != NULL"") lands, the latter not part of
> > > a performance analysis but a side benefit.  If the performance will be a
> > > concern, we can consider handle CONT_PTE in follow_page().
> > > 
> > > Before that is justified to be necessary, keep everything clean and simple.
> > > 
> > > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > 
> > Afraid I'm seeing an oops when running gup_longterm test on arm64 with current mm-unstable. Git bisect blames this patch. The oops reproduces for me every time on 2 different machines:
> > 
> > 
> > [    9.340416] kernel BUG at mm/gup.c:778!
> > [    9.340746] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> > [    9.341199] Modules linked in:
> > [    9.341481] CPU: 1 PID: 1159 Comm: gup_longterm Not tainted 6.9.0-rc2-00210-g910ff1a347e4 #11
> > [    9.342232] Hardware name: linux,dummy-virt (DT)
> > [    9.342647] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > [    9.343195] pc : follow_page_mask+0x4d4/0x880
> > [    9.343580] lr : follow_page_mask+0x4d4/0x880
> > [    9.344018] sp : ffff8000898b3aa0
> > [    9.344345] x29: ffff8000898b3aa0 x28: fffffdffc53973e8 x27: 00003c0005d08000
> > [    9.345028] x26: ffff00014e5cfd08 x25: ffffd3513a40c000 x24: fffffdffc5d08000
> > [    9.345682] x23: ffffc1ffc0000000 x22: 0000000000080101 x21: ffff8000898b3ba8
> > [    9.346337] x20: 0000fffff4200000 x19: ffff00014e52d508 x18: 0000000000000010
> > [    9.347005] x17: 5f656e6f7a5f7369 x16: 2120262620296567 x15: 6170286461654865
> > [    9.347713] x14: 6761502128454741 x13: 2929656761702865 x12: 6761705f65636976
> > [    9.348371] x11: 65645f656e6f7a5f x10: ffffd3513b31d6e0 x9 : ffffd3513852f090
> > [    9.349062] x8 : 00000000ffffefff x7 : ffffd3513b31d6e0 x6 : 0000000000000000
> > [    9.349753] x5 : ffff00017ff98cc8 x4 : 0000000000000fff x3 : 0000000000000000
> > [    9.350397] x2 : 0000000000000000 x1 : ffff000190e8b480 x0 : 0000000000000052
> > [    9.351097] Call trace:
> > [    9.351312]  follow_page_mask+0x4d4/0x880
> > [    9.351700]  __get_user_pages+0xf4/0x3e8
> > [    9.352089]  __gup_longterm_locked+0x204/0xa70
> > [    9.352516]  pin_user_pages+0x88/0xc0
> > [    9.352873]  gup_test_ioctl+0x860/0xc40
> > [    9.353249]  __arm64_sys_ioctl+0xb0/0x100
> > [    9.353648]  invoke_syscall+0x50/0x128
> > [    9.354022]  el0_svc_common.constprop.0+0x48/0xf8
> > [    9.354488]  do_el0_svc+0x28/0x40
> > [    9.354822]  el0_svc+0x34/0xe0
> > [    9.355128]  el0t_64_sync_handler+0x13c/0x158
> > [    9.355489]  el0t_64_sync+0x190/0x198
> > [    9.355793] Code: aa1803e0 d000d8e1 91220021 97fff560 (d4210000)
> > [    9.356280] ---[ end trace 0000000000000000 ]---
> > [    9.356651] note: gup_longterm[1159] exited with irqs disabled
> > [    9.357141] note: gup_longterm[1159] exited with preempt_count 2
> > [    9.358033] ------------[ cut here ]------------
> > [    9.358800] WARNING: CPU: 1 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0x108/0x120
> > [    9.360157] Modules linked in:
> > [    9.360541] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D            6.9.0-rc2-00210-g910ff1a347e4 #11
> > [    9.361626] Hardware name: linux,dummy-virt (DT)
> > [    9.362087] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > [    9.362758] pc : ct_kernel_exit.constprop.0+0x108/0x120
> > [    9.363306] lr : ct_idle_enter+0x10/0x20
> > [    9.363845] sp : ffff8000801abdc0
> > [    9.364222] x29: ffff8000801abdc0 x28: 0000000000000000 x27: 0000000000000000
> > [    9.364961] x26: 0000000000000000 x25: ffff00014149d780 x24: 0000000000000000
> > [    9.365557] x23: 0000000000000000 x22: ffffd3513b299d48 x21: ffffd3513a785730
> > [    9.366239] x20: ffffd3513b299c28 x19: ffff00017ffa7da0 x18: 0000fffff5ffffff
> > [    9.366869] x17: 0000000000000000 x16: 1fffe0002a21a8c1 x15: 0000000000000000
> > [    9.367524] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000002
> > [    9.368207] x11: 0000000000000001 x10: 0000000000000ad0 x9 : ffffd35138589230
> > [    9.369123] x8 : ffff00014149e2b0 x7 : 0000000000000000 x6 : 000000000f8c0fb2
> > [    9.370403] x5 : 4000000000000002 x4 : ffff2cb045825000 x3 : ffff8000801abdc0
> > [    9.371170] x2 : ffffd3513a782da0 x1 : 4000000000000000 x0 : ffffd3513a782da0
> > [    9.372279] Call trace:
> > [    9.372519]  ct_kernel_exit.constprop.0+0x108/0x120
> > [    9.373216]  ct_idle_enter+0x10/0x20
> > [    9.373562]  default_idle_call+0x3c/0x160
> > [    9.374055]  do_idle+0x21c/0x280
> > [    9.374394]  cpu_startup_entry+0x3c/0x50
> > [    9.374797]  secondary_start_kernel+0x140/0x168
> > [    9.375220]  __secondary_switched+0xb8/0xc0
> > [    9.375875] ---[ end trace 0000000000000000 ]---
> > 
> > 
> > The oops trigger is at mm/gup.c:778:
> > VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> > 
> > 
> > This is the output of gup_longterm (last output is just before oops):
> > 
> > # [INFO] detected hugetlb page size: 2048 KiB
> > # [INFO] detected hugetlb page size: 32768 KiB
> > # [INFO] detected hugetlb page size: 64 KiB
> > # [INFO] detected hugetlb page size: 1048576 KiB
> > TAP version 13
> > 1..70
> > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd
> > ok 1 Should have worked
> > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with tmpfile
> > ok 2 Should have failed
> > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile
> > ok 3 Should have failed
> > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
> > ok 4 Should have worked
> > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (32768 kB)
> > 
> > 
> > So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?
> 
> I assume we find the expected tail page, it's just that the check
> 
> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> 
> Doesn't make sense with hugetlb folios. We might have a tail page mapped in
> a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the first
> cont-pmd entry", we trigger this check.
> 
> Likely this sanity check must also allow for hugetlb folios. Or we should
> just remove it completely.

Right, IMHO it'll be easier we remove it, actually I see there's one more
at the end, so I think we need to remove both.

> 
> In the past, we wanted to make sure that we never get tail pages of THP from
> PMD entries, because something would currently be broken (we don't support
> THP > PMD).

There's probably one more thing we need to do, on allowing
PageAnonExclusive() to work with hugetlb tails. Even if we remove the
warnings and if I read the code right, we can BUG_ON again on checking tail
pages over anon-exclusive for PageHuge.

So I assume to fix it completely, we may need two changes: Patch 1 to
prepare PageAnonExclusive() to work on hugetlb tails, then patch 2 to be
squashed into the patch "mm/gup: handle huge pmd for follow_pmd_mask()".
Note: not this patch to fixup, as this patch only does the "switchover" to
the new path, the culprit should be the other patch..

I have them attached below first, before I'll also go and see whether I can
run some arm tests later today or tomorrow.  David, any comments from
anon-exclusive side?

Thanks,

===8<===

From 26f0670acea948945222c97a9cab58428782ca69 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Tue, 2 Apr 2024 11:52:28 -0400
Subject: [PATCH 1/2] mm: Allow anon exclusive check over hugetlb tail pages

PageAnonExclusive() used to forbid tail pages for hugetlbfs, as that used
to be called mostly in hugetlb specific paths and the head page was
guaranteed.

As we move forward towards merging hugetlb paths into generic mm, we may
start to pass in tail hugetlb pages (when with cont-pte/cont-pmd huge
pages) for such check.  Allow it to properly fetch the head, in which case
the anon-exclusiveness of the head will always represents the tail page.

There's already a sign of it when we look at the fast-gup which already
contain the hugetlb processing altogether: we used to have a specific
commit 5805192c7b72 ("mm/gup: handle cont-PTE hugetlb pages correctly in
gup_must_unshare() via GUP-fast") covering that area.  Now with this more
generic change, that can also go away.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/page-flags.h |  8 +++++++-
 mm/internal.h              | 10 ----------
 2 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 888353c209c0..225357f48a79 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -1095,7 +1095,13 @@ PAGEFLAG(Isolated, isolated, PF_ANY);
 static __always_inline int PageAnonExclusive(const struct page *page)
 {
 	VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
-	VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
+	/*
+	 * Allow the anon-exclusive check to work on hugetlb tail pages.
+	 * Here hugetlb pages will always guarantee the anon-exclusiveness
+	 * of the head page represents the tail pages.
+	 */
+	if (PageHuge(page) && !PageHead(page))
+		page = compound_head(page);
 	return test_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);
 }
 
diff --git a/mm/internal.h b/mm/internal.h
index 9512de7398d5..87f6e4fd56a5 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1259,16 +1259,6 @@ static inline bool gup_must_unshare(struct vm_area_struct *vma,
 	if (IS_ENABLED(CONFIG_HAVE_FAST_GUP))
 		smp_rmb();
 
-	/*
-	 * During GUP-fast we might not get called on the head page for a
-	 * hugetlb page that is mapped using cont-PTE, because GUP-fast does
-	 * not work with the abstracted hugetlb PTEs that always point at the
-	 * head page. For hugetlb, PageAnonExclusive only applies on the head
-	 * page (as it cannot be partially COW-shared), so lookup the head page.
-	 */
-	if (unlikely(!PageHead(page) && PageHuge(page)))
-		page = compound_head(page);
-
 	/*
 	 * Note that PageKsm() pages cannot be exclusive, and consequently,
 	 * cannot get pinned.
-- 
2.44.0

From 2411d1c6cc95a937c8b7c74e13a206c22c034fab Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Tue, 2 Apr 2024 12:04:35 -0400
Subject: [PATCH 2/2] fixup! mm/gup: handle huge pmd for follow_pmd_mask()

Allow follow_pmd_mask() to take hugetlb tail pages.  The old warnings do
not help now as hugetlb now allows it to happen, so drop them.

Reported-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 91d70057aea0..d60b63fcfc82 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -775,8 +775,6 @@ static struct page *follow_huge_pmd(struct vm_area_struct *vma,
 	assert_spin_locked(pmd_lockptr(mm, pmd));
 
 	page = pmd_page(pmdval);
-	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
-
 	if ((flags & FOLL_WRITE) &&
 	    !can_follow_write_pmd(pmdval, page, vma, flags))
 		return NULL;
@@ -805,7 +803,6 @@ static struct page *follow_huge_pmd(struct vm_area_struct *vma,
 
 	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
 	ctx->page_mask = HPAGE_PMD_NR - 1;
-	VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
 
 	return page;
 }
-- 
2.44.0

-- 
Peter Xu


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
@ 2024-04-02 16:20         ` Peter Xu
  0 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2024-04-02 16:20 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Ryan Roberts, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Jason Gunthorpe, Mike Rapoport, Axel Rasmussen

On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:
> On 02.04.24 16:48, Ryan Roberts wrote:
> > Hi Peter,

Hey, Ryan,

Thanks for the report!

> > 
> > On 27/03/2024 15:23, peterx@redhat.com wrote:
> > > From: Peter Xu <peterx@redhat.com>
> > > 
> > > Now follow_page() is ready to handle hugetlb pages in whatever form, and
> > > over all architectures.  Switch to the generic code path.
> > > 
> > > Time to retire hugetlb_follow_page_mask(), following the previous
> > > retirement of follow_hugetlb_page() in 4849807114b8.
> > > 
> > > There may be a slight difference of how the loops run when processing slow
> > > GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
> > > loop of __get_user_pages() will resolve one pgtable entry with the patch
> > > applied, rather than relying on the size of hugetlb hstate, the latter may
> > > cover multiple entries in one loop.
> > > 
> > > A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
> > > a tight loop of slow gup after the path switched.  That shouldn't be a
> > > problem because slow-gup should not be a hot path for GUP in general: when
> > > page is commonly present, fast-gup will already succeed, while when the
> > > page is indeed missing and require a follow up page fault, the slow gup
> > > degrade will probably buried in the fault paths anyway.  It also explains
> > > why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
> > > accelerate thp gup even for "pages != NULL"") lands, the latter not part of
> > > a performance analysis but a side benefit.  If the performance will be a
> > > concern, we can consider handle CONT_PTE in follow_page().
> > > 
> > > Before that is justified to be necessary, keep everything clean and simple.
> > > 
> > > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > 
> > Afraid I'm seeing an oops when running gup_longterm test on arm64 with current mm-unstable. Git bisect blames this patch. The oops reproduces for me every time on 2 different machines:
> > 
> > 
> > [    9.340416] kernel BUG at mm/gup.c:778!
> > [    9.340746] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> > [    9.341199] Modules linked in:
> > [    9.341481] CPU: 1 PID: 1159 Comm: gup_longterm Not tainted 6.9.0-rc2-00210-g910ff1a347e4 #11
> > [    9.342232] Hardware name: linux,dummy-virt (DT)
> > [    9.342647] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > [    9.343195] pc : follow_page_mask+0x4d4/0x880
> > [    9.343580] lr : follow_page_mask+0x4d4/0x880
> > [    9.344018] sp : ffff8000898b3aa0
> > [    9.344345] x29: ffff8000898b3aa0 x28: fffffdffc53973e8 x27: 00003c0005d08000
> > [    9.345028] x26: ffff00014e5cfd08 x25: ffffd3513a40c000 x24: fffffdffc5d08000
> > [    9.345682] x23: ffffc1ffc0000000 x22: 0000000000080101 x21: ffff8000898b3ba8
> > [    9.346337] x20: 0000fffff4200000 x19: ffff00014e52d508 x18: 0000000000000010
> > [    9.347005] x17: 5f656e6f7a5f7369 x16: 2120262620296567 x15: 6170286461654865
> > [    9.347713] x14: 6761502128454741 x13: 2929656761702865 x12: 6761705f65636976
> > [    9.348371] x11: 65645f656e6f7a5f x10: ffffd3513b31d6e0 x9 : ffffd3513852f090
> > [    9.349062] x8 : 00000000ffffefff x7 : ffffd3513b31d6e0 x6 : 0000000000000000
> > [    9.349753] x5 : ffff00017ff98cc8 x4 : 0000000000000fff x3 : 0000000000000000
> > [    9.350397] x2 : 0000000000000000 x1 : ffff000190e8b480 x0 : 0000000000000052
> > [    9.351097] Call trace:
> > [    9.351312]  follow_page_mask+0x4d4/0x880
> > [    9.351700]  __get_user_pages+0xf4/0x3e8
> > [    9.352089]  __gup_longterm_locked+0x204/0xa70
> > [    9.352516]  pin_user_pages+0x88/0xc0
> > [    9.352873]  gup_test_ioctl+0x860/0xc40
> > [    9.353249]  __arm64_sys_ioctl+0xb0/0x100
> > [    9.353648]  invoke_syscall+0x50/0x128
> > [    9.354022]  el0_svc_common.constprop.0+0x48/0xf8
> > [    9.354488]  do_el0_svc+0x28/0x40
> > [    9.354822]  el0_svc+0x34/0xe0
> > [    9.355128]  el0t_64_sync_handler+0x13c/0x158
> > [    9.355489]  el0t_64_sync+0x190/0x198
> > [    9.355793] Code: aa1803e0 d000d8e1 91220021 97fff560 (d4210000)
> > [    9.356280] ---[ end trace 0000000000000000 ]---
> > [    9.356651] note: gup_longterm[1159] exited with irqs disabled
> > [    9.357141] note: gup_longterm[1159] exited with preempt_count 2
> > [    9.358033] ------------[ cut here ]------------
> > [    9.358800] WARNING: CPU: 1 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0x108/0x120
> > [    9.360157] Modules linked in:
> > [    9.360541] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D            6.9.0-rc2-00210-g910ff1a347e4 #11
> > [    9.361626] Hardware name: linux,dummy-virt (DT)
> > [    9.362087] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > [    9.362758] pc : ct_kernel_exit.constprop.0+0x108/0x120
> > [    9.363306] lr : ct_idle_enter+0x10/0x20
> > [    9.363845] sp : ffff8000801abdc0
> > [    9.364222] x29: ffff8000801abdc0 x28: 0000000000000000 x27: 0000000000000000
> > [    9.364961] x26: 0000000000000000 x25: ffff00014149d780 x24: 0000000000000000
> > [    9.365557] x23: 0000000000000000 x22: ffffd3513b299d48 x21: ffffd3513a785730
> > [    9.366239] x20: ffffd3513b299c28 x19: ffff00017ffa7da0 x18: 0000fffff5ffffff
> > [    9.366869] x17: 0000000000000000 x16: 1fffe0002a21a8c1 x15: 0000000000000000
> > [    9.367524] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000002
> > [    9.368207] x11: 0000000000000001 x10: 0000000000000ad0 x9 : ffffd35138589230
> > [    9.369123] x8 : ffff00014149e2b0 x7 : 0000000000000000 x6 : 000000000f8c0fb2
> > [    9.370403] x5 : 4000000000000002 x4 : ffff2cb045825000 x3 : ffff8000801abdc0
> > [    9.371170] x2 : ffffd3513a782da0 x1 : 4000000000000000 x0 : ffffd3513a782da0
> > [    9.372279] Call trace:
> > [    9.372519]  ct_kernel_exit.constprop.0+0x108/0x120
> > [    9.373216]  ct_idle_enter+0x10/0x20
> > [    9.373562]  default_idle_call+0x3c/0x160
> > [    9.374055]  do_idle+0x21c/0x280
> > [    9.374394]  cpu_startup_entry+0x3c/0x50
> > [    9.374797]  secondary_start_kernel+0x140/0x168
> > [    9.375220]  __secondary_switched+0xb8/0xc0
> > [    9.375875] ---[ end trace 0000000000000000 ]---
> > 
> > 
> > The oops trigger is at mm/gup.c:778:
> > VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> > 
> > 
> > This is the output of gup_longterm (last output is just before oops):
> > 
> > # [INFO] detected hugetlb page size: 2048 KiB
> > # [INFO] detected hugetlb page size: 32768 KiB
> > # [INFO] detected hugetlb page size: 64 KiB
> > # [INFO] detected hugetlb page size: 1048576 KiB
> > TAP version 13
> > 1..70
> > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd
> > ok 1 Should have worked
> > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with tmpfile
> > ok 2 Should have failed
> > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile
> > ok 3 Should have failed
> > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
> > ok 4 Should have worked
> > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (32768 kB)
> > 
> > 
> > So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?
> 
> I assume we find the expected tail page, it's just that the check
> 
> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> 
> Doesn't make sense with hugetlb folios. We might have a tail page mapped in
> a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the first
> cont-pmd entry", we trigger this check.
> 
> Likely this sanity check must also allow for hugetlb folios. Or we should
> just remove it completely.

Right, IMHO it'll be easier we remove it, actually I see there's one more
at the end, so I think we need to remove both.

> 
> In the past, we wanted to make sure that we never get tail pages of THP from
> PMD entries, because something would currently be broken (we don't support
> THP > PMD).

There's probably one more thing we need to do, on allowing
PageAnonExclusive() to work with hugetlb tails. Even if we remove the
warnings and if I read the code right, we can BUG_ON again on checking tail
pages over anon-exclusive for PageHuge.

So I assume to fix it completely, we may need two changes: Patch 1 to
prepare PageAnonExclusive() to work on hugetlb tails, then patch 2 to be
squashed into the patch "mm/gup: handle huge pmd for follow_pmd_mask()".
Note: not this patch to fixup, as this patch only does the "switchover" to
the new path, the culprit should be the other patch..

I have them attached below first, before I'll also go and see whether I can
run some arm tests later today or tomorrow.  David, any comments from
anon-exclusive side?

Thanks,

===8<===

From 26f0670acea948945222c97a9cab58428782ca69 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Tue, 2 Apr 2024 11:52:28 -0400
Subject: [PATCH 1/2] mm: Allow anon exclusive check over hugetlb tail pages

PageAnonExclusive() used to forbid tail pages for hugetlbfs, as that used
to be called mostly in hugetlb specific paths and the head page was
guaranteed.

As we move forward towards merging hugetlb paths into generic mm, we may
start to pass in tail hugetlb pages (when with cont-pte/cont-pmd huge
pages) for such check.  Allow it to properly fetch the head, in which case
the anon-exclusiveness of the head will always represents the tail page.

There's already a sign of it when we look at the fast-gup which already
contain the hugetlb processing altogether: we used to have a specific
commit 5805192c7b72 ("mm/gup: handle cont-PTE hugetlb pages correctly in
gup_must_unshare() via GUP-fast") covering that area.  Now with this more
generic change, that can also go away.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/page-flags.h |  8 +++++++-
 mm/internal.h              | 10 ----------
 2 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 888353c209c0..225357f48a79 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -1095,7 +1095,13 @@ PAGEFLAG(Isolated, isolated, PF_ANY);
 static __always_inline int PageAnonExclusive(const struct page *page)
 {
 	VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
-	VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
+	/*
+	 * Allow the anon-exclusive check to work on hugetlb tail pages.
+	 * Here hugetlb pages will always guarantee the anon-exclusiveness
+	 * of the head page represents the tail pages.
+	 */
+	if (PageHuge(page) && !PageHead(page))
+		page = compound_head(page);
 	return test_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);
 }
 
diff --git a/mm/internal.h b/mm/internal.h
index 9512de7398d5..87f6e4fd56a5 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1259,16 +1259,6 @@ static inline bool gup_must_unshare(struct vm_area_struct *vma,
 	if (IS_ENABLED(CONFIG_HAVE_FAST_GUP))
 		smp_rmb();
 
-	/*
-	 * During GUP-fast we might not get called on the head page for a
-	 * hugetlb page that is mapped using cont-PTE, because GUP-fast does
-	 * not work with the abstracted hugetlb PTEs that always point at the
-	 * head page. For hugetlb, PageAnonExclusive only applies on the head
-	 * page (as it cannot be partially COW-shared), so lookup the head page.
-	 */
-	if (unlikely(!PageHead(page) && PageHuge(page)))
-		page = compound_head(page);
-
 	/*
 	 * Note that PageKsm() pages cannot be exclusive, and consequently,
 	 * cannot get pinned.
-- 
2.44.0

From 2411d1c6cc95a937c8b7c74e13a206c22c034fab Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Tue, 2 Apr 2024 12:04:35 -0400
Subject: [PATCH 2/2] fixup! mm/gup: handle huge pmd for follow_pmd_mask()

Allow follow_pmd_mask() to take hugetlb tail pages.  The old warnings do
not help now as hugetlb now allows it to happen, so drop them.

Reported-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 91d70057aea0..d60b63fcfc82 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -775,8 +775,6 @@ static struct page *follow_huge_pmd(struct vm_area_struct *vma,
 	assert_spin_locked(pmd_lockptr(mm, pmd));
 
 	page = pmd_page(pmdval);
-	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
-
 	if ((flags & FOLL_WRITE) &&
 	    !can_follow_write_pmd(pmdval, page, vma, flags))
 		return NULL;
@@ -805,7 +803,6 @@ static struct page *follow_huge_pmd(struct vm_area_struct *vma,
 
 	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
 	ctx->page_mask = HPAGE_PMD_NR - 1;
-	VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
 
 	return page;
 }
-- 
2.44.0

-- 
Peter Xu


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
@ 2024-04-02 16:20         ` Peter Xu
  0 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2024-04-02 16:20 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Ryan Roberts, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Jason Gunthorpe, Mike Rapoport, Axel Rasmussen

On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:
> On 02.04.24 16:48, Ryan Roberts wrote:
> > Hi Peter,

Hey, Ryan,

Thanks for the report!

> > 
> > On 27/03/2024 15:23, peterx@redhat.com wrote:
> > > From: Peter Xu <peterx@redhat.com>
> > > 
> > > Now follow_page() is ready to handle hugetlb pages in whatever form, and
> > > over all architectures.  Switch to the generic code path.
> > > 
> > > Time to retire hugetlb_follow_page_mask(), following the previous
> > > retirement of follow_hugetlb_page() in 4849807114b8.
> > > 
> > > There may be a slight difference of how the loops run when processing slow
> > > GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
> > > loop of __get_user_pages() will resolve one pgtable entry with the patch
> > > applied, rather than relying on the size of hugetlb hstate, the latter may
> > > cover multiple entries in one loop.
> > > 
> > > A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
> > > a tight loop of slow gup after the path switched.  That shouldn't be a
> > > problem because slow-gup should not be a hot path for GUP in general: when
> > > page is commonly present, fast-gup will already succeed, while when the
> > > page is indeed missing and require a follow up page fault, the slow gup
> > > degrade will probably buried in the fault paths anyway.  It also explains
> > > why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
> > > accelerate thp gup even for "pages != NULL"") lands, the latter not part of
> > > a performance analysis but a side benefit.  If the performance will be a
> > > concern, we can consider handle CONT_PTE in follow_page().
> > > 
> > > Before that is justified to be necessary, keep everything clean and simple.
> > > 
> > > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > 
> > Afraid I'm seeing an oops when running gup_longterm test on arm64 with current mm-unstable. Git bisect blames this patch. The oops reproduces for me every time on 2 different machines:
> > 
> > 
> > [    9.340416] kernel BUG at mm/gup.c:778!
> > [    9.340746] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> > [    9.341199] Modules linked in:
> > [    9.341481] CPU: 1 PID: 1159 Comm: gup_longterm Not tainted 6.9.0-rc2-00210-g910ff1a347e4 #11
> > [    9.342232] Hardware name: linux,dummy-virt (DT)
> > [    9.342647] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > [    9.343195] pc : follow_page_mask+0x4d4/0x880
> > [    9.343580] lr : follow_page_mask+0x4d4/0x880
> > [    9.344018] sp : ffff8000898b3aa0
> > [    9.344345] x29: ffff8000898b3aa0 x28: fffffdffc53973e8 x27: 00003c0005d08000
> > [    9.345028] x26: ffff00014e5cfd08 x25: ffffd3513a40c000 x24: fffffdffc5d08000
> > [    9.345682] x23: ffffc1ffc0000000 x22: 0000000000080101 x21: ffff8000898b3ba8
> > [    9.346337] x20: 0000fffff4200000 x19: ffff00014e52d508 x18: 0000000000000010
> > [    9.347005] x17: 5f656e6f7a5f7369 x16: 2120262620296567 x15: 6170286461654865
> > [    9.347713] x14: 6761502128454741 x13: 2929656761702865 x12: 6761705f65636976
> > [    9.348371] x11: 65645f656e6f7a5f x10: ffffd3513b31d6e0 x9 : ffffd3513852f090
> > [    9.349062] x8 : 00000000ffffefff x7 : ffffd3513b31d6e0 x6 : 0000000000000000
> > [    9.349753] x5 : ffff00017ff98cc8 x4 : 0000000000000fff x3 : 0000000000000000
> > [    9.350397] x2 : 0000000000000000 x1 : ffff000190e8b480 x0 : 0000000000000052
> > [    9.351097] Call trace:
> > [    9.351312]  follow_page_mask+0x4d4/0x880
> > [    9.351700]  __get_user_pages+0xf4/0x3e8
> > [    9.352089]  __gup_longterm_locked+0x204/0xa70
> > [    9.352516]  pin_user_pages+0x88/0xc0
> > [    9.352873]  gup_test_ioctl+0x860/0xc40
> > [    9.353249]  __arm64_sys_ioctl+0xb0/0x100
> > [    9.353648]  invoke_syscall+0x50/0x128
> > [    9.354022]  el0_svc_common.constprop.0+0x48/0xf8
> > [    9.354488]  do_el0_svc+0x28/0x40
> > [    9.354822]  el0_svc+0x34/0xe0
> > [    9.355128]  el0t_64_sync_handler+0x13c/0x158
> > [    9.355489]  el0t_64_sync+0x190/0x198
> > [    9.355793] Code: aa1803e0 d000d8e1 91220021 97fff560 (d4210000)
> > [    9.356280] ---[ end trace 0000000000000000 ]---
> > [    9.356651] note: gup_longterm[1159] exited with irqs disabled
> > [    9.357141] note: gup_longterm[1159] exited with preempt_count 2
> > [    9.358033] ------------[ cut here ]------------
> > [    9.358800] WARNING: CPU: 1 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0x108/0x120
> > [    9.360157] Modules linked in:
> > [    9.360541] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D            6.9.0-rc2-00210-g910ff1a347e4 #11
> > [    9.361626] Hardware name: linux,dummy-virt (DT)
> > [    9.362087] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > [    9.362758] pc : ct_kernel_exit.constprop.0+0x108/0x120
> > [    9.363306] lr : ct_idle_enter+0x10/0x20
> > [    9.363845] sp : ffff8000801abdc0
> > [    9.364222] x29: ffff8000801abdc0 x28: 0000000000000000 x27: 0000000000000000
> > [    9.364961] x26: 0000000000000000 x25: ffff00014149d780 x24: 0000000000000000
> > [    9.365557] x23: 0000000000000000 x22: ffffd3513b299d48 x21: ffffd3513a785730
> > [    9.366239] x20: ffffd3513b299c28 x19: ffff00017ffa7da0 x18: 0000fffff5ffffff
> > [    9.366869] x17: 0000000000000000 x16: 1fffe0002a21a8c1 x15: 0000000000000000
> > [    9.367524] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000002
> > [    9.368207] x11: 0000000000000001 x10: 0000000000000ad0 x9 : ffffd35138589230
> > [    9.369123] x8 : ffff00014149e2b0 x7 : 0000000000000000 x6 : 000000000f8c0fb2
> > [    9.370403] x5 : 4000000000000002 x4 : ffff2cb045825000 x3 : ffff8000801abdc0
> > [    9.371170] x2 : ffffd3513a782da0 x1 : 4000000000000000 x0 : ffffd3513a782da0
> > [    9.372279] Call trace:
> > [    9.372519]  ct_kernel_exit.constprop.0+0x108/0x120
> > [    9.373216]  ct_idle_enter+0x10/0x20
> > [    9.373562]  default_idle_call+0x3c/0x160
> > [    9.374055]  do_idle+0x21c/0x280
> > [    9.374394]  cpu_startup_entry+0x3c/0x50
> > [    9.374797]  secondary_start_kernel+0x140/0x168
> > [    9.375220]  __secondary_switched+0xb8/0xc0
> > [    9.375875] ---[ end trace 0000000000000000 ]---
> > 
> > 
> > The oops trigger is at mm/gup.c:778:
> > VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> > 
> > 
> > This is the output of gup_longterm (last output is just before oops):
> > 
> > # [INFO] detected hugetlb page size: 2048 KiB
> > # [INFO] detected hugetlb page size: 32768 KiB
> > # [INFO] detected hugetlb page size: 64 KiB
> > # [INFO] detected hugetlb page size: 1048576 KiB
> > TAP version 13
> > 1..70
> > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd
> > ok 1 Should have worked
> > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with tmpfile
> > ok 2 Should have failed
> > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile
> > ok 3 Should have failed
> > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
> > ok 4 Should have worked
> > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (32768 kB)
> > 
> > 
> > So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?
> 
> I assume we find the expected tail page, it's just that the check
> 
> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> 
> Doesn't make sense with hugetlb folios. We might have a tail page mapped in
> a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the first
> cont-pmd entry", we trigger this check.
> 
> Likely this sanity check must also allow for hugetlb folios. Or we should
> just remove it completely.

Right, IMHO it'll be easier we remove it, actually I see there's one more
at the end, so I think we need to remove both.

> 
> In the past, we wanted to make sure that we never get tail pages of THP from
> PMD entries, because something would currently be broken (we don't support
> THP > PMD).

There's probably one more thing we need to do, on allowing
PageAnonExclusive() to work with hugetlb tails. Even if we remove the
warnings and if I read the code right, we can BUG_ON again on checking tail
pages over anon-exclusive for PageHuge.

So I assume to fix it completely, we may need two changes: Patch 1 to
prepare PageAnonExclusive() to work on hugetlb tails, then patch 2 to be
squashed into the patch "mm/gup: handle huge pmd for follow_pmd_mask()".
Note: not this patch to fixup, as this patch only does the "switchover" to
the new path, the culprit should be the other patch..

I have them attached below first, before I'll also go and see whether I can
run some arm tests later today or tomorrow.  David, any comments from
anon-exclusive side?

Thanks,

===8<===

From 26f0670acea948945222c97a9cab58428782ca69 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Tue, 2 Apr 2024 11:52:28 -0400
Subject: [PATCH 1/2] mm: Allow anon exclusive check over hugetlb tail pages

PageAnonExclusive() used to forbid tail pages for hugetlbfs, as that used
to be called mostly in hugetlb specific paths and the head page was
guaranteed.

As we move forward towards merging hugetlb paths into generic mm, we may
start to pass in tail hugetlb pages (when with cont-pte/cont-pmd huge
pages) for such check.  Allow it to properly fetch the head, in which case
the anon-exclusiveness of the head will always represents the tail page.

There's already a sign of it when we look at the fast-gup which already
contain the hugetlb processing altogether: we used to have a specific
commit 5805192c7b72 ("mm/gup: handle cont-PTE hugetlb pages correctly in
gup_must_unshare() via GUP-fast") covering that area.  Now with this more
generic change, that can also go away.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/page-flags.h |  8 +++++++-
 mm/internal.h              | 10 ----------
 2 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 888353c209c0..225357f48a79 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -1095,7 +1095,13 @@ PAGEFLAG(Isolated, isolated, PF_ANY);
 static __always_inline int PageAnonExclusive(const struct page *page)
 {
 	VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
-	VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
+	/*
+	 * Allow the anon-exclusive check to work on hugetlb tail pages.
+	 * Here hugetlb pages will always guarantee the anon-exclusiveness
+	 * of the head page represents the tail pages.
+	 */
+	if (PageHuge(page) && !PageHead(page))
+		page = compound_head(page);
 	return test_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);
 }
 
diff --git a/mm/internal.h b/mm/internal.h
index 9512de7398d5..87f6e4fd56a5 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1259,16 +1259,6 @@ static inline bool gup_must_unshare(struct vm_area_struct *vma,
 	if (IS_ENABLED(CONFIG_HAVE_FAST_GUP))
 		smp_rmb();
 
-	/*
-	 * During GUP-fast we might not get called on the head page for a
-	 * hugetlb page that is mapped using cont-PTE, because GUP-fast does
-	 * not work with the abstracted hugetlb PTEs that always point at the
-	 * head page. For hugetlb, PageAnonExclusive only applies on the head
-	 * page (as it cannot be partially COW-shared), so lookup the head page.
-	 */
-	if (unlikely(!PageHead(page) && PageHuge(page)))
-		page = compound_head(page);
-
 	/*
 	 * Note that PageKsm() pages cannot be exclusive, and consequently,
 	 * cannot get pinned.
-- 
2.44.0

From 2411d1c6cc95a937c8b7c74e13a206c22c034fab Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Tue, 2 Apr 2024 12:04:35 -0400
Subject: [PATCH 2/2] fixup! mm/gup: handle huge pmd for follow_pmd_mask()

Allow follow_pmd_mask() to take hugetlb tail pages.  The old warnings do
not help now as hugetlb now allows it to happen, so drop them.

Reported-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 91d70057aea0..d60b63fcfc82 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -775,8 +775,6 @@ static struct page *follow_huge_pmd(struct vm_area_struct *vma,
 	assert_spin_locked(pmd_lockptr(mm, pmd));
 
 	page = pmd_page(pmdval);
-	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
-
 	if ((flags & FOLL_WRITE) &&
 	    !can_follow_write_pmd(pmdval, page, vma, flags))
 		return NULL;
@@ -805,7 +803,6 @@ static struct page *follow_huge_pmd(struct vm_area_struct *vma,
 
 	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
 	ctx->page_mask = HPAGE_PMD_NR - 1;
-	VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
 
 	return page;
 }
-- 
2.44.0

-- 
Peter Xu


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
@ 2024-04-02 16:20         ` Peter Xu
  0 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2024-04-02 16:20 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: James Houghton, Yang Shi, Andrew Jones, linux-mm, Matthew Wilcox,
	linux-riscv, Andrea Arcangeli, Christoph Hellwig,
	Aneesh Kumar K . V, linux-arm-kernel, Jason Gunthorpe,
	Axel Rasmussen, Ryan Roberts, Rik van Riel, John Hubbard,
	Kirill A . Shutemov, Vlastimil Babka, Lorenzo Stoakes,
	Muchun Song, linux-kernel, Andrew Morton, linuxppc-dev,
	Mike Rapoport, Mike Kravetz

On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:
> On 02.04.24 16:48, Ryan Roberts wrote:
> > Hi Peter,

Hey, Ryan,

Thanks for the report!

> > 
> > On 27/03/2024 15:23, peterx@redhat.com wrote:
> > > From: Peter Xu <peterx@redhat.com>
> > > 
> > > Now follow_page() is ready to handle hugetlb pages in whatever form, and
> > > over all architectures.  Switch to the generic code path.
> > > 
> > > Time to retire hugetlb_follow_page_mask(), following the previous
> > > retirement of follow_hugetlb_page() in 4849807114b8.
> > > 
> > > There may be a slight difference of how the loops run when processing slow
> > > GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
> > > loop of __get_user_pages() will resolve one pgtable entry with the patch
> > > applied, rather than relying on the size of hugetlb hstate, the latter may
> > > cover multiple entries in one loop.
> > > 
> > > A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
> > > a tight loop of slow gup after the path switched.  That shouldn't be a
> > > problem because slow-gup should not be a hot path for GUP in general: when
> > > page is commonly present, fast-gup will already succeed, while when the
> > > page is indeed missing and require a follow up page fault, the slow gup
> > > degrade will probably buried in the fault paths anyway.  It also explains
> > > why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
> > > accelerate thp gup even for "pages != NULL"") lands, the latter not part of
> > > a performance analysis but a side benefit.  If the performance will be a
> > > concern, we can consider handle CONT_PTE in follow_page().
> > > 
> > > Before that is justified to be necessary, keep everything clean and simple.
> > > 
> > > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > 
> > Afraid I'm seeing an oops when running gup_longterm test on arm64 with current mm-unstable. Git bisect blames this patch. The oops reproduces for me every time on 2 different machines:
> > 
> > 
> > [    9.340416] kernel BUG at mm/gup.c:778!
> > [    9.340746] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> > [    9.341199] Modules linked in:
> > [    9.341481] CPU: 1 PID: 1159 Comm: gup_longterm Not tainted 6.9.0-rc2-00210-g910ff1a347e4 #11
> > [    9.342232] Hardware name: linux,dummy-virt (DT)
> > [    9.342647] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > [    9.343195] pc : follow_page_mask+0x4d4/0x880
> > [    9.343580] lr : follow_page_mask+0x4d4/0x880
> > [    9.344018] sp : ffff8000898b3aa0
> > [    9.344345] x29: ffff8000898b3aa0 x28: fffffdffc53973e8 x27: 00003c0005d08000
> > [    9.345028] x26: ffff00014e5cfd08 x25: ffffd3513a40c000 x24: fffffdffc5d08000
> > [    9.345682] x23: ffffc1ffc0000000 x22: 0000000000080101 x21: ffff8000898b3ba8
> > [    9.346337] x20: 0000fffff4200000 x19: ffff00014e52d508 x18: 0000000000000010
> > [    9.347005] x17: 5f656e6f7a5f7369 x16: 2120262620296567 x15: 6170286461654865
> > [    9.347713] x14: 6761502128454741 x13: 2929656761702865 x12: 6761705f65636976
> > [    9.348371] x11: 65645f656e6f7a5f x10: ffffd3513b31d6e0 x9 : ffffd3513852f090
> > [    9.349062] x8 : 00000000ffffefff x7 : ffffd3513b31d6e0 x6 : 0000000000000000
> > [    9.349753] x5 : ffff00017ff98cc8 x4 : 0000000000000fff x3 : 0000000000000000
> > [    9.350397] x2 : 0000000000000000 x1 : ffff000190e8b480 x0 : 0000000000000052
> > [    9.351097] Call trace:
> > [    9.351312]  follow_page_mask+0x4d4/0x880
> > [    9.351700]  __get_user_pages+0xf4/0x3e8
> > [    9.352089]  __gup_longterm_locked+0x204/0xa70
> > [    9.352516]  pin_user_pages+0x88/0xc0
> > [    9.352873]  gup_test_ioctl+0x860/0xc40
> > [    9.353249]  __arm64_sys_ioctl+0xb0/0x100
> > [    9.353648]  invoke_syscall+0x50/0x128
> > [    9.354022]  el0_svc_common.constprop.0+0x48/0xf8
> > [    9.354488]  do_el0_svc+0x28/0x40
> > [    9.354822]  el0_svc+0x34/0xe0
> > [    9.355128]  el0t_64_sync_handler+0x13c/0x158
> > [    9.355489]  el0t_64_sync+0x190/0x198
> > [    9.355793] Code: aa1803e0 d000d8e1 91220021 97fff560 (d4210000)
> > [    9.356280] ---[ end trace 0000000000000000 ]---
> > [    9.356651] note: gup_longterm[1159] exited with irqs disabled
> > [    9.357141] note: gup_longterm[1159] exited with preempt_count 2
> > [    9.358033] ------------[ cut here ]------------
> > [    9.358800] WARNING: CPU: 1 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0x108/0x120
> > [    9.360157] Modules linked in:
> > [    9.360541] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D            6.9.0-rc2-00210-g910ff1a347e4 #11
> > [    9.361626] Hardware name: linux,dummy-virt (DT)
> > [    9.362087] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > [    9.362758] pc : ct_kernel_exit.constprop.0+0x108/0x120
> > [    9.363306] lr : ct_idle_enter+0x10/0x20
> > [    9.363845] sp : ffff8000801abdc0
> > [    9.364222] x29: ffff8000801abdc0 x28: 0000000000000000 x27: 0000000000000000
> > [    9.364961] x26: 0000000000000000 x25: ffff00014149d780 x24: 0000000000000000
> > [    9.365557] x23: 0000000000000000 x22: ffffd3513b299d48 x21: ffffd3513a785730
> > [    9.366239] x20: ffffd3513b299c28 x19: ffff00017ffa7da0 x18: 0000fffff5ffffff
> > [    9.366869] x17: 0000000000000000 x16: 1fffe0002a21a8c1 x15: 0000000000000000
> > [    9.367524] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000002
> > [    9.368207] x11: 0000000000000001 x10: 0000000000000ad0 x9 : ffffd35138589230
> > [    9.369123] x8 : ffff00014149e2b0 x7 : 0000000000000000 x6 : 000000000f8c0fb2
> > [    9.370403] x5 : 4000000000000002 x4 : ffff2cb045825000 x3 : ffff8000801abdc0
> > [    9.371170] x2 : ffffd3513a782da0 x1 : 4000000000000000 x0 : ffffd3513a782da0
> > [    9.372279] Call trace:
> > [    9.372519]  ct_kernel_exit.constprop.0+0x108/0x120
> > [    9.373216]  ct_idle_enter+0x10/0x20
> > [    9.373562]  default_idle_call+0x3c/0x160
> > [    9.374055]  do_idle+0x21c/0x280
> > [    9.374394]  cpu_startup_entry+0x3c/0x50
> > [    9.374797]  secondary_start_kernel+0x140/0x168
> > [    9.375220]  __secondary_switched+0xb8/0xc0
> > [    9.375875] ---[ end trace 0000000000000000 ]---
> > 
> > 
> > The oops trigger is at mm/gup.c:778:
> > VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> > 
> > 
> > This is the output of gup_longterm (last output is just before oops):
> > 
> > # [INFO] detected hugetlb page size: 2048 KiB
> > # [INFO] detected hugetlb page size: 32768 KiB
> > # [INFO] detected hugetlb page size: 64 KiB
> > # [INFO] detected hugetlb page size: 1048576 KiB
> > TAP version 13
> > 1..70
> > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd
> > ok 1 Should have worked
> > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with tmpfile
> > ok 2 Should have failed
> > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile
> > ok 3 Should have failed
> > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
> > ok 4 Should have worked
> > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (32768 kB)
> > 
> > 
> > So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?
> 
> I assume we find the expected tail page, it's just that the check
> 
> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> 
> Doesn't make sense with hugetlb folios. We might have a tail page mapped in
> a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the first
> cont-pmd entry", we trigger this check.
> 
> Likely this sanity check must also allow for hugetlb folios. Or we should
> just remove it completely.

Right, IMHO it'll be easier we remove it, actually I see there's one more
at the end, so I think we need to remove both.

> 
> In the past, we wanted to make sure that we never get tail pages of THP from
> PMD entries, because something would currently be broken (we don't support
> THP > PMD).

There's probably one more thing we need to do, on allowing
PageAnonExclusive() to work with hugetlb tails. Even if we remove the
warnings and if I read the code right, we can BUG_ON again on checking tail
pages over anon-exclusive for PageHuge.

So I assume to fix it completely, we may need two changes: Patch 1 to
prepare PageAnonExclusive() to work on hugetlb tails, then patch 2 to be
squashed into the patch "mm/gup: handle huge pmd for follow_pmd_mask()".
Note: not this patch to fixup, as this patch only does the "switchover" to
the new path, the culprit should be the other patch..

I have them attached below first, before I'll also go and see whether I can
run some arm tests later today or tomorrow.  David, any comments from
anon-exclusive side?

Thanks,

===8<===

From 26f0670acea948945222c97a9cab58428782ca69 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Tue, 2 Apr 2024 11:52:28 -0400
Subject: [PATCH 1/2] mm: Allow anon exclusive check over hugetlb tail pages

PageAnonExclusive() used to forbid tail pages for hugetlbfs, as that used
to be called mostly in hugetlb specific paths and the head page was
guaranteed.

As we move forward towards merging hugetlb paths into generic mm, we may
start to pass in tail hugetlb pages (when with cont-pte/cont-pmd huge
pages) for such check.  Allow it to properly fetch the head, in which case
the anon-exclusiveness of the head will always represents the tail page.

There's already a sign of it when we look at the fast-gup which already
contain the hugetlb processing altogether: we used to have a specific
commit 5805192c7b72 ("mm/gup: handle cont-PTE hugetlb pages correctly in
gup_must_unshare() via GUP-fast") covering that area.  Now with this more
generic change, that can also go away.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/page-flags.h |  8 +++++++-
 mm/internal.h              | 10 ----------
 2 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 888353c209c0..225357f48a79 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -1095,7 +1095,13 @@ PAGEFLAG(Isolated, isolated, PF_ANY);
 static __always_inline int PageAnonExclusive(const struct page *page)
 {
 	VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
-	VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
+	/*
+	 * Allow the anon-exclusive check to work on hugetlb tail pages.
+	 * Here hugetlb pages will always guarantee the anon-exclusiveness
+	 * of the head page represents the tail pages.
+	 */
+	if (PageHuge(page) && !PageHead(page))
+		page = compound_head(page);
 	return test_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);
 }
 
diff --git a/mm/internal.h b/mm/internal.h
index 9512de7398d5..87f6e4fd56a5 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1259,16 +1259,6 @@ static inline bool gup_must_unshare(struct vm_area_struct *vma,
 	if (IS_ENABLED(CONFIG_HAVE_FAST_GUP))
 		smp_rmb();
 
-	/*
-	 * During GUP-fast we might not get called on the head page for a
-	 * hugetlb page that is mapped using cont-PTE, because GUP-fast does
-	 * not work with the abstracted hugetlb PTEs that always point at the
-	 * head page. For hugetlb, PageAnonExclusive only applies on the head
-	 * page (as it cannot be partially COW-shared), so lookup the head page.
-	 */
-	if (unlikely(!PageHead(page) && PageHuge(page)))
-		page = compound_head(page);
-
 	/*
 	 * Note that PageKsm() pages cannot be exclusive, and consequently,
 	 * cannot get pinned.
-- 
2.44.0

From 2411d1c6cc95a937c8b7c74e13a206c22c034fab Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Tue, 2 Apr 2024 12:04:35 -0400
Subject: [PATCH 2/2] fixup! mm/gup: handle huge pmd for follow_pmd_mask()

Allow follow_pmd_mask() to take hugetlb tail pages.  The old warnings do
not help now as hugetlb now allows it to happen, so drop them.

Reported-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 91d70057aea0..d60b63fcfc82 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -775,8 +775,6 @@ static struct page *follow_huge_pmd(struct vm_area_struct *vma,
 	assert_spin_locked(pmd_lockptr(mm, pmd));
 
 	page = pmd_page(pmdval);
-	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
-
 	if ((flags & FOLL_WRITE) &&
 	    !can_follow_write_pmd(pmdval, page, vma, flags))
 		return NULL;
@@ -805,7 +803,6 @@ static struct page *follow_huge_pmd(struct vm_area_struct *vma,
 
 	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
 	ctx->page_mask = HPAGE_PMD_NR - 1;
-	VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
 
 	return page;
 }
-- 
2.44.0

-- 
Peter Xu


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
  2024-04-02 16:18           ` Ryan Roberts
  (?)
  (?)
@ 2024-04-02 16:26             ` Peter Xu
  -1 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2024-04-02 16:26 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Matthew Wilcox, David Hildenbrand, linux-mm, linux-kernel,
	Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Jason Gunthorpe, Mike Rapoport, Axel Rasmussen

On Tue, Apr 02, 2024 at 05:18:36PM +0100, Ryan Roberts wrote:
> On 02/04/2024 17:00, Matthew Wilcox wrote:
> > On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:
> >>> The oops trigger is at mm/gup.c:778:
> >>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> >>>
> >>> So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?
> >>
> >> I assume we find the expected tail page, it's just that the check
> >>
> >> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> >>
> >> Doesn't make sense with hugetlb folios. We might have a tail page mapped in
> >> a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the first
> >> cont-pmd entry", we trigger this check.
> >>
> >> Likely this sanity check must also allow for hugetlb folios. Or we should
> >> just remove it completely.
> >>
> >> In the past, we wanted to make sure that we never get tail pages of THP from
> >> PMD entries, because something would currently be broken (we don't support
> >> THP > PMD).
> > 
> > That was a practical limitation on my part.  We have various parts of
> > the MM which assume that pmd_page() returns a head page and until we
> > get all of those fixed, adding support for folios larger than PMD_SIZE
> > was only going to cause trouble for no significant wins.
> > 
> > I agree with you we should get rid of this assertion entirely.  We should
> > fix all the places which assume that pmd_page() returns a head page,
> > but that may take some time.
> > 
> > As an example, filemap_map_pmd() has:
> > 
> >        if (pmd_none(*vmf->pmd) && folio_test_pmd_mappable(folio)) {
> >                 struct page *page = folio_file_page(folio, start);
> >                 vm_fault_t ret = do_set_pmd(vmf, page);
> > 
> > and then do_set_pmd() has:
> > 
> >         if (page != &folio->page || folio_order(folio) != HPAGE_PMD_ORDER)
> >                 return ret;
> > 
> > so we'd simply refuse to use a PMD to map a folio larger than PMD_SIZE.
> > There's a lot of work to be done to make this work generally (not to
> > mention figuring out how to handle mapcount for such folios ;-).

Hmm, I think it means there're more work than I was thinking... but that's
okay, let's move one step at a time..

> > 
> > This particular case seems straightforward though.  Just remove the
> > assertion.
> 
> Removing the assertion gets me further, but then I end up with this:
> 
> [    9.748422] kernel BUG at include/linux/page-flags.h:1098!
> [    9.748897] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> [    9.749590] Modules linked in:
> [    9.749867] CPU: 2 PID: 1155 Comm: gup_longterm Not tainted 6.9.0-rc2-00210-g910ff1a347e4-dirty #12
> [    9.750682] Hardware name: linux,dummy-virt (DT)
> [    9.751095] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [    9.751729] pc : follow_page_mask+0x730/0x850
> [    9.752152] lr : follow_page_mask+0x730/0x850
> [    9.752573] sp : ffff8000898f3aa0
> [    9.752882] x29: ffff8000898f3aa0 x28: fffffdffc52b91a8 x27: 0000000000000001
> [    9.753543] x26: ffff00014ae46d08 x25: 00003c0005d88000 x24: fffffdffc5d88000
> [    9.754221] x23: ffffc1ffc0000000 x22: 0000000000080101 x21: ffff8000898f3ba8
> [    9.754875] x20: 0000fffff4200000 x19: ffff0001a3d64450 x18: 0000000000000010
> [    9.755567] x17: 2864616548656761 x16: 5021202626202965 x15: 6761702865677548
> [    9.756254] x14: 6567615028454741 x13: 2929656761702864 x12: 6165486567615021
> [    9.756953] x11: 2026262029656761 x10: ffffaaac08f1d6e0 x9 : ffffaaac0612f090
> [    9.757671] x8 : 00000000ffffefff x7 : ffffaaac08f1d6e0 x6 : 0000000000000000
> [    9.758356] x5 : ffff00017ffb9cc8 x4 : 0000000000000fff x3 : 0000000000000000
> [    9.758983] x2 : 0000000000000000 x1 : ffff000189ecb480 x0 : 0000000000000046
> [    9.759663] Call trace:
> [    9.759901]  follow_page_mask+0x730/0x850
> [    9.760293]  __get_user_pages+0xf4/0x3e8
> [    9.760683]  __gup_longterm_locked+0x204/0xa70
> [    9.761110]  pin_user_pages+0x88/0xc0
> [    9.761486]  gup_test_ioctl+0x860/0xc40
> [    9.761866]  __arm64_sys_ioctl+0xb0/0x100
> [    9.762254]  invoke_syscall+0x50/0x128
> [    9.762630]  el0_svc_common.constprop.0+0x48/0xf8
> [    9.763104]  do_el0_svc+0x28/0x40
> [    9.763413]  el0_svc+0x34/0xe0
> [    9.763699]  el0t_64_sync_handler+0x13c/0x158
> [    9.764139]  el0t_64_sync+0x190/0x198
> [    9.764465] Code: aa1803e0 d000d8e1 911d6021 97fff4c9 (d4210000) 
> [    9.765053] ---[ end trace 0000000000000000 ]---
> [    9.765520] note: gup_longterm[1155] exited with irqs disabled
> [    9.766146] note: gup_longterm[1155] exited with preempt_count 2
> [    9.767366] ------------[ cut here ]------------
> [    9.768062] WARNING: CPU: 2 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0x108/0x120
> [    9.769146] Modules linked in:
> [    9.769429] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G      D            6.9.0-rc2-00210-g910ff1a347e4-dirty #12
> [    9.770338] Hardware name: linux,dummy-virt (DT)
> [    9.770837] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [    9.771615] pc : ct_kernel_exit.constprop.0+0x108/0x120
> [    9.772150] lr : ct_idle_enter+0x10/0x20
> [    9.772539] sp : ffff8000801b3dc0
> [    9.772913] x29: ffff8000801b3dc0 x28: 0000000000000000 x27: 0000000000000000
> [    9.773769] x26: 0000000000000000 x25: ffff00014149e900 x24: 0000000000000000
> [    9.774526] x23: 0000000000000000 x22: ffffaaac08e99d48 x21: ffffaaac08385730
> [    9.775255] x20: ffffaaac08e99c28 x19: ffff00017ffc8da0 x18: 0000fffff5ffffff
> [    9.775924] x17: 0000000000000000 x16: 1fffe0002a57c9e1 x15: 0000000000000001
> [    9.776619] x14: ffffffffffffffff x13: 0000000000000000 x12: ffffaaac07a06968
> [    9.777246] x11: 000000ae44c42eec x10: 0000000000000ad0 x9 : ffffaaac06189230
> [    9.777942] x8 : ffff00014149f430 x7 : 02c9acb509db422c x6 : 000000001015a9f0
> [    9.778635] x5 : 4000000000000002 x4 : ffff555577c46000 x3 : ffff8000801b3dc0
> [    9.779671] x2 : ffffaaac08382da0 x1 : 4000000000000000 x0 : ffffaaac08382da0
> [    9.780703] Call trace:
> [    9.781150]  ct_kernel_exit.constprop.0+0x108/0x120
> [    9.781949]  ct_idle_enter+0x10/0x20
> [    9.782246]  default_idle_call+0x3c/0x160
> [    9.782624]  do_idle+0x21c/0x280
> [    9.782945]  cpu_startup_entry+0x3c/0x50
> [    9.783268]  secondary_start_kernel+0x140/0x168
> [    9.783818]  __secondary_switched+0xb8/0xc0
> [    9.784163] ---[ end trace 0000000000000000 ]---
> 
> 
> Which is caused by this:
> 
> static __always_inline int PageAnonExclusive(const struct page *page)
> {
> 	VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
> 	VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page); <<<<
> 	return test_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);
> }
> 
> Which is called from can_follow_write_pmd(), called just after the assert I just commented out.
> 
> 
> It's triggered by this test:
> 
> # [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (32768 kB)
> 
> Which is the first MAP_PRIVATE test for cont-pmd mapped hugetlb. (All MAP_SHARED tests are passing).
> 
> 
> Looks like can_follow_write_pmd() returns early for VM_SHARED mappings.
> 
> I don't think we only keep the PAE flag in the head page for hugetlb pages? So we can't just remove this assert?
> 
> I tried just commenting it out and get assert further down follow_huge_pmd():
> 
> VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
> 			!PageAnonExclusive(page), page);

I just replied in another email; we can try the two patches I attached, or
we can wait until I do some tests (but will be mostly unavailable this
afternoon).

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
@ 2024-04-02 16:26             ` Peter Xu
  0 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2024-04-02 16:26 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Matthew Wilcox, David Hildenbrand, linux-mm, linux-kernel,
	Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Jason Gunthorpe, Mike Rapoport, Axel Rasmussen

On Tue, Apr 02, 2024 at 05:18:36PM +0100, Ryan Roberts wrote:
> On 02/04/2024 17:00, Matthew Wilcox wrote:
> > On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:
> >>> The oops trigger is at mm/gup.c:778:
> >>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> >>>
> >>> So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?
> >>
> >> I assume we find the expected tail page, it's just that the check
> >>
> >> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> >>
> >> Doesn't make sense with hugetlb folios. We might have a tail page mapped in
> >> a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the first
> >> cont-pmd entry", we trigger this check.
> >>
> >> Likely this sanity check must also allow for hugetlb folios. Or we should
> >> just remove it completely.
> >>
> >> In the past, we wanted to make sure that we never get tail pages of THP from
> >> PMD entries, because something would currently be broken (we don't support
> >> THP > PMD).
> > 
> > That was a practical limitation on my part.  We have various parts of
> > the MM which assume that pmd_page() returns a head page and until we
> > get all of those fixed, adding support for folios larger than PMD_SIZE
> > was only going to cause trouble for no significant wins.
> > 
> > I agree with you we should get rid of this assertion entirely.  We should
> > fix all the places which assume that pmd_page() returns a head page,
> > but that may take some time.
> > 
> > As an example, filemap_map_pmd() has:
> > 
> >        if (pmd_none(*vmf->pmd) && folio_test_pmd_mappable(folio)) {
> >                 struct page *page = folio_file_page(folio, start);
> >                 vm_fault_t ret = do_set_pmd(vmf, page);
> > 
> > and then do_set_pmd() has:
> > 
> >         if (page != &folio->page || folio_order(folio) != HPAGE_PMD_ORDER)
> >                 return ret;
> > 
> > so we'd simply refuse to use a PMD to map a folio larger than PMD_SIZE.
> > There's a lot of work to be done to make this work generally (not to
> > mention figuring out how to handle mapcount for such folios ;-).

Hmm, I think it means there're more work than I was thinking... but that's
okay, let's move one step at a time..

> > 
> > This particular case seems straightforward though.  Just remove the
> > assertion.
> 
> Removing the assertion gets me further, but then I end up with this:
> 
> [    9.748422] kernel BUG at include/linux/page-flags.h:1098!
> [    9.748897] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> [    9.749590] Modules linked in:
> [    9.749867] CPU: 2 PID: 1155 Comm: gup_longterm Not tainted 6.9.0-rc2-00210-g910ff1a347e4-dirty #12
> [    9.750682] Hardware name: linux,dummy-virt (DT)
> [    9.751095] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [    9.751729] pc : follow_page_mask+0x730/0x850
> [    9.752152] lr : follow_page_mask+0x730/0x850
> [    9.752573] sp : ffff8000898f3aa0
> [    9.752882] x29: ffff8000898f3aa0 x28: fffffdffc52b91a8 x27: 0000000000000001
> [    9.753543] x26: ffff00014ae46d08 x25: 00003c0005d88000 x24: fffffdffc5d88000
> [    9.754221] x23: ffffc1ffc0000000 x22: 0000000000080101 x21: ffff8000898f3ba8
> [    9.754875] x20: 0000fffff4200000 x19: ffff0001a3d64450 x18: 0000000000000010
> [    9.755567] x17: 2864616548656761 x16: 5021202626202965 x15: 6761702865677548
> [    9.756254] x14: 6567615028454741 x13: 2929656761702864 x12: 6165486567615021
> [    9.756953] x11: 2026262029656761 x10: ffffaaac08f1d6e0 x9 : ffffaaac0612f090
> [    9.757671] x8 : 00000000ffffefff x7 : ffffaaac08f1d6e0 x6 : 0000000000000000
> [    9.758356] x5 : ffff00017ffb9cc8 x4 : 0000000000000fff x3 : 0000000000000000
> [    9.758983] x2 : 0000000000000000 x1 : ffff000189ecb480 x0 : 0000000000000046
> [    9.759663] Call trace:
> [    9.759901]  follow_page_mask+0x730/0x850
> [    9.760293]  __get_user_pages+0xf4/0x3e8
> [    9.760683]  __gup_longterm_locked+0x204/0xa70
> [    9.761110]  pin_user_pages+0x88/0xc0
> [    9.761486]  gup_test_ioctl+0x860/0xc40
> [    9.761866]  __arm64_sys_ioctl+0xb0/0x100
> [    9.762254]  invoke_syscall+0x50/0x128
> [    9.762630]  el0_svc_common.constprop.0+0x48/0xf8
> [    9.763104]  do_el0_svc+0x28/0x40
> [    9.763413]  el0_svc+0x34/0xe0
> [    9.763699]  el0t_64_sync_handler+0x13c/0x158
> [    9.764139]  el0t_64_sync+0x190/0x198
> [    9.764465] Code: aa1803e0 d000d8e1 911d6021 97fff4c9 (d4210000) 
> [    9.765053] ---[ end trace 0000000000000000 ]---
> [    9.765520] note: gup_longterm[1155] exited with irqs disabled
> [    9.766146] note: gup_longterm[1155] exited with preempt_count 2
> [    9.767366] ------------[ cut here ]------------
> [    9.768062] WARNING: CPU: 2 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0x108/0x120
> [    9.769146] Modules linked in:
> [    9.769429] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G      D            6.9.0-rc2-00210-g910ff1a347e4-dirty #12
> [    9.770338] Hardware name: linux,dummy-virt (DT)
> [    9.770837] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [    9.771615] pc : ct_kernel_exit.constprop.0+0x108/0x120
> [    9.772150] lr : ct_idle_enter+0x10/0x20
> [    9.772539] sp : ffff8000801b3dc0
> [    9.772913] x29: ffff8000801b3dc0 x28: 0000000000000000 x27: 0000000000000000
> [    9.773769] x26: 0000000000000000 x25: ffff00014149e900 x24: 0000000000000000
> [    9.774526] x23: 0000000000000000 x22: ffffaaac08e99d48 x21: ffffaaac08385730
> [    9.775255] x20: ffffaaac08e99c28 x19: ffff00017ffc8da0 x18: 0000fffff5ffffff
> [    9.775924] x17: 0000000000000000 x16: 1fffe0002a57c9e1 x15: 0000000000000001
> [    9.776619] x14: ffffffffffffffff x13: 0000000000000000 x12: ffffaaac07a06968
> [    9.777246] x11: 000000ae44c42eec x10: 0000000000000ad0 x9 : ffffaaac06189230
> [    9.777942] x8 : ffff00014149f430 x7 : 02c9acb509db422c x6 : 000000001015a9f0
> [    9.778635] x5 : 4000000000000002 x4 : ffff555577c46000 x3 : ffff8000801b3dc0
> [    9.779671] x2 : ffffaaac08382da0 x1 : 4000000000000000 x0 : ffffaaac08382da0
> [    9.780703] Call trace:
> [    9.781150]  ct_kernel_exit.constprop.0+0x108/0x120
> [    9.781949]  ct_idle_enter+0x10/0x20
> [    9.782246]  default_idle_call+0x3c/0x160
> [    9.782624]  do_idle+0x21c/0x280
> [    9.782945]  cpu_startup_entry+0x3c/0x50
> [    9.783268]  secondary_start_kernel+0x140/0x168
> [    9.783818]  __secondary_switched+0xb8/0xc0
> [    9.784163] ---[ end trace 0000000000000000 ]---
> 
> 
> Which is caused by this:
> 
> static __always_inline int PageAnonExclusive(const struct page *page)
> {
> 	VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
> 	VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page); <<<<
> 	return test_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);
> }
> 
> Which is called from can_follow_write_pmd(), called just after the assert I just commented out.
> 
> 
> It's triggered by this test:
> 
> # [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (32768 kB)
> 
> Which is the first MAP_PRIVATE test for cont-pmd mapped hugetlb. (All MAP_SHARED tests are passing).
> 
> 
> Looks like can_follow_write_pmd() returns early for VM_SHARED mappings.
> 
> I don't think we only keep the PAE flag in the head page for hugetlb pages? So we can't just remove this assert?
> 
> I tried just commenting it out and get assert further down follow_huge_pmd():
> 
> VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
> 			!PageAnonExclusive(page), page);

I just replied in another email; we can try the two patches I attached, or
we can wait until I do some tests (but will be mostly unavailable this
afternoon).

Thanks,

-- 
Peter Xu


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
@ 2024-04-02 16:26             ` Peter Xu
  0 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2024-04-02 16:26 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Matthew Wilcox, David Hildenbrand, linux-mm, linux-kernel,
	Yang Shi, Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Jason Gunthorpe, Mike Rapoport, Axel Rasmussen

On Tue, Apr 02, 2024 at 05:18:36PM +0100, Ryan Roberts wrote:
> On 02/04/2024 17:00, Matthew Wilcox wrote:
> > On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:
> >>> The oops trigger is at mm/gup.c:778:
> >>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> >>>
> >>> So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?
> >>
> >> I assume we find the expected tail page, it's just that the check
> >>
> >> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> >>
> >> Doesn't make sense with hugetlb folios. We might have a tail page mapped in
> >> a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the first
> >> cont-pmd entry", we trigger this check.
> >>
> >> Likely this sanity check must also allow for hugetlb folios. Or we should
> >> just remove it completely.
> >>
> >> In the past, we wanted to make sure that we never get tail pages of THP from
> >> PMD entries, because something would currently be broken (we don't support
> >> THP > PMD).
> > 
> > That was a practical limitation on my part.  We have various parts of
> > the MM which assume that pmd_page() returns a head page and until we
> > get all of those fixed, adding support for folios larger than PMD_SIZE
> > was only going to cause trouble for no significant wins.
> > 
> > I agree with you we should get rid of this assertion entirely.  We should
> > fix all the places which assume that pmd_page() returns a head page,
> > but that may take some time.
> > 
> > As an example, filemap_map_pmd() has:
> > 
> >        if (pmd_none(*vmf->pmd) && folio_test_pmd_mappable(folio)) {
> >                 struct page *page = folio_file_page(folio, start);
> >                 vm_fault_t ret = do_set_pmd(vmf, page);
> > 
> > and then do_set_pmd() has:
> > 
> >         if (page != &folio->page || folio_order(folio) != HPAGE_PMD_ORDER)
> >                 return ret;
> > 
> > so we'd simply refuse to use a PMD to map a folio larger than PMD_SIZE.
> > There's a lot of work to be done to make this work generally (not to
> > mention figuring out how to handle mapcount for such folios ;-).

Hmm, I think it means there're more work than I was thinking... but that's
okay, let's move one step at a time..

> > 
> > This particular case seems straightforward though.  Just remove the
> > assertion.
> 
> Removing the assertion gets me further, but then I end up with this:
> 
> [    9.748422] kernel BUG at include/linux/page-flags.h:1098!
> [    9.748897] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> [    9.749590] Modules linked in:
> [    9.749867] CPU: 2 PID: 1155 Comm: gup_longterm Not tainted 6.9.0-rc2-00210-g910ff1a347e4-dirty #12
> [    9.750682] Hardware name: linux,dummy-virt (DT)
> [    9.751095] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [    9.751729] pc : follow_page_mask+0x730/0x850
> [    9.752152] lr : follow_page_mask+0x730/0x850
> [    9.752573] sp : ffff8000898f3aa0
> [    9.752882] x29: ffff8000898f3aa0 x28: fffffdffc52b91a8 x27: 0000000000000001
> [    9.753543] x26: ffff00014ae46d08 x25: 00003c0005d88000 x24: fffffdffc5d88000
> [    9.754221] x23: ffffc1ffc0000000 x22: 0000000000080101 x21: ffff8000898f3ba8
> [    9.754875] x20: 0000fffff4200000 x19: ffff0001a3d64450 x18: 0000000000000010
> [    9.755567] x17: 2864616548656761 x16: 5021202626202965 x15: 6761702865677548
> [    9.756254] x14: 6567615028454741 x13: 2929656761702864 x12: 6165486567615021
> [    9.756953] x11: 2026262029656761 x10: ffffaaac08f1d6e0 x9 : ffffaaac0612f090
> [    9.757671] x8 : 00000000ffffefff x7 : ffffaaac08f1d6e0 x6 : 0000000000000000
> [    9.758356] x5 : ffff00017ffb9cc8 x4 : 0000000000000fff x3 : 0000000000000000
> [    9.758983] x2 : 0000000000000000 x1 : ffff000189ecb480 x0 : 0000000000000046
> [    9.759663] Call trace:
> [    9.759901]  follow_page_mask+0x730/0x850
> [    9.760293]  __get_user_pages+0xf4/0x3e8
> [    9.760683]  __gup_longterm_locked+0x204/0xa70
> [    9.761110]  pin_user_pages+0x88/0xc0
> [    9.761486]  gup_test_ioctl+0x860/0xc40
> [    9.761866]  __arm64_sys_ioctl+0xb0/0x100
> [    9.762254]  invoke_syscall+0x50/0x128
> [    9.762630]  el0_svc_common.constprop.0+0x48/0xf8
> [    9.763104]  do_el0_svc+0x28/0x40
> [    9.763413]  el0_svc+0x34/0xe0
> [    9.763699]  el0t_64_sync_handler+0x13c/0x158
> [    9.764139]  el0t_64_sync+0x190/0x198
> [    9.764465] Code: aa1803e0 d000d8e1 911d6021 97fff4c9 (d4210000) 
> [    9.765053] ---[ end trace 0000000000000000 ]---
> [    9.765520] note: gup_longterm[1155] exited with irqs disabled
> [    9.766146] note: gup_longterm[1155] exited with preempt_count 2
> [    9.767366] ------------[ cut here ]------------
> [    9.768062] WARNING: CPU: 2 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0x108/0x120
> [    9.769146] Modules linked in:
> [    9.769429] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G      D            6.9.0-rc2-00210-g910ff1a347e4-dirty #12
> [    9.770338] Hardware name: linux,dummy-virt (DT)
> [    9.770837] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [    9.771615] pc : ct_kernel_exit.constprop.0+0x108/0x120
> [    9.772150] lr : ct_idle_enter+0x10/0x20
> [    9.772539] sp : ffff8000801b3dc0
> [    9.772913] x29: ffff8000801b3dc0 x28: 0000000000000000 x27: 0000000000000000
> [    9.773769] x26: 0000000000000000 x25: ffff00014149e900 x24: 0000000000000000
> [    9.774526] x23: 0000000000000000 x22: ffffaaac08e99d48 x21: ffffaaac08385730
> [    9.775255] x20: ffffaaac08e99c28 x19: ffff00017ffc8da0 x18: 0000fffff5ffffff
> [    9.775924] x17: 0000000000000000 x16: 1fffe0002a57c9e1 x15: 0000000000000001
> [    9.776619] x14: ffffffffffffffff x13: 0000000000000000 x12: ffffaaac07a06968
> [    9.777246] x11: 000000ae44c42eec x10: 0000000000000ad0 x9 : ffffaaac06189230
> [    9.777942] x8 : ffff00014149f430 x7 : 02c9acb509db422c x6 : 000000001015a9f0
> [    9.778635] x5 : 4000000000000002 x4 : ffff555577c46000 x3 : ffff8000801b3dc0
> [    9.779671] x2 : ffffaaac08382da0 x1 : 4000000000000000 x0 : ffffaaac08382da0
> [    9.780703] Call trace:
> [    9.781150]  ct_kernel_exit.constprop.0+0x108/0x120
> [    9.781949]  ct_idle_enter+0x10/0x20
> [    9.782246]  default_idle_call+0x3c/0x160
> [    9.782624]  do_idle+0x21c/0x280
> [    9.782945]  cpu_startup_entry+0x3c/0x50
> [    9.783268]  secondary_start_kernel+0x140/0x168
> [    9.783818]  __secondary_switched+0xb8/0xc0
> [    9.784163] ---[ end trace 0000000000000000 ]---
> 
> 
> Which is caused by this:
> 
> static __always_inline int PageAnonExclusive(const struct page *page)
> {
> 	VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
> 	VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page); <<<<
> 	return test_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);
> }
> 
> Which is called from can_follow_write_pmd(), called just after the assert I just commented out.
> 
> 
> It's triggered by this test:
> 
> # [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (32768 kB)
> 
> Which is the first MAP_PRIVATE test for cont-pmd mapped hugetlb. (All MAP_SHARED tests are passing).
> 
> 
> Looks like can_follow_write_pmd() returns early for VM_SHARED mappings.
> 
> I don't think we only keep the PAE flag in the head page for hugetlb pages? So we can't just remove this assert?
> 
> I tried just commenting it out and get assert further down follow_huge_pmd():
> 
> VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
> 			!PageAnonExclusive(page), page);

I just replied in another email; we can try the two patches I attached, or
we can wait until I do some tests (but will be mostly unavailable this
afternoon).

Thanks,

-- 
Peter Xu


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
@ 2024-04-02 16:26             ` Peter Xu
  0 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2024-04-02 16:26 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: James Houghton, David Hildenbrand, Yang Shi, Andrew Jones,
	linux-mm, linux-riscv, Andrea Arcangeli, Aneesh Kumar K . V,
	Matthew Wilcox, Christoph Hellwig, Vlastimil Babka,
	Jason Gunthorpe, Axel Rasmussen, Rik van Riel, John Hubbard,
	Kirill A . Shutemov, linux-arm-kernel, Lorenzo Stoakes,
	Muchun Song, linux-kernel, Andrew Morton, linuxppc-dev,
	Mike Rapoport, Mike Kravetz

On Tue, Apr 02, 2024 at 05:18:36PM +0100, Ryan Roberts wrote:
> On 02/04/2024 17:00, Matthew Wilcox wrote:
> > On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:
> >>> The oops trigger is at mm/gup.c:778:
> >>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> >>>
> >>> So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?
> >>
> >> I assume we find the expected tail page, it's just that the check
> >>
> >> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> >>
> >> Doesn't make sense with hugetlb folios. We might have a tail page mapped in
> >> a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the first
> >> cont-pmd entry", we trigger this check.
> >>
> >> Likely this sanity check must also allow for hugetlb folios. Or we should
> >> just remove it completely.
> >>
> >> In the past, we wanted to make sure that we never get tail pages of THP from
> >> PMD entries, because something would currently be broken (we don't support
> >> THP > PMD).
> > 
> > That was a practical limitation on my part.  We have various parts of
> > the MM which assume that pmd_page() returns a head page and until we
> > get all of those fixed, adding support for folios larger than PMD_SIZE
> > was only going to cause trouble for no significant wins.
> > 
> > I agree with you we should get rid of this assertion entirely.  We should
> > fix all the places which assume that pmd_page() returns a head page,
> > but that may take some time.
> > 
> > As an example, filemap_map_pmd() has:
> > 
> >        if (pmd_none(*vmf->pmd) && folio_test_pmd_mappable(folio)) {
> >                 struct page *page = folio_file_page(folio, start);
> >                 vm_fault_t ret = do_set_pmd(vmf, page);
> > 
> > and then do_set_pmd() has:
> > 
> >         if (page != &folio->page || folio_order(folio) != HPAGE_PMD_ORDER)
> >                 return ret;
> > 
> > so we'd simply refuse to use a PMD to map a folio larger than PMD_SIZE.
> > There's a lot of work to be done to make this work generally (not to
> > mention figuring out how to handle mapcount for such folios ;-).

Hmm, I think it means there're more work than I was thinking... but that's
okay, let's move one step at a time..

> > 
> > This particular case seems straightforward though.  Just remove the
> > assertion.
> 
> Removing the assertion gets me further, but then I end up with this:
> 
> [    9.748422] kernel BUG at include/linux/page-flags.h:1098!
> [    9.748897] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> [    9.749590] Modules linked in:
> [    9.749867] CPU: 2 PID: 1155 Comm: gup_longterm Not tainted 6.9.0-rc2-00210-g910ff1a347e4-dirty #12
> [    9.750682] Hardware name: linux,dummy-virt (DT)
> [    9.751095] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [    9.751729] pc : follow_page_mask+0x730/0x850
> [    9.752152] lr : follow_page_mask+0x730/0x850
> [    9.752573] sp : ffff8000898f3aa0
> [    9.752882] x29: ffff8000898f3aa0 x28: fffffdffc52b91a8 x27: 0000000000000001
> [    9.753543] x26: ffff00014ae46d08 x25: 00003c0005d88000 x24: fffffdffc5d88000
> [    9.754221] x23: ffffc1ffc0000000 x22: 0000000000080101 x21: ffff8000898f3ba8
> [    9.754875] x20: 0000fffff4200000 x19: ffff0001a3d64450 x18: 0000000000000010
> [    9.755567] x17: 2864616548656761 x16: 5021202626202965 x15: 6761702865677548
> [    9.756254] x14: 6567615028454741 x13: 2929656761702864 x12: 6165486567615021
> [    9.756953] x11: 2026262029656761 x10: ffffaaac08f1d6e0 x9 : ffffaaac0612f090
> [    9.757671] x8 : 00000000ffffefff x7 : ffffaaac08f1d6e0 x6 : 0000000000000000
> [    9.758356] x5 : ffff00017ffb9cc8 x4 : 0000000000000fff x3 : 0000000000000000
> [    9.758983] x2 : 0000000000000000 x1 : ffff000189ecb480 x0 : 0000000000000046
> [    9.759663] Call trace:
> [    9.759901]  follow_page_mask+0x730/0x850
> [    9.760293]  __get_user_pages+0xf4/0x3e8
> [    9.760683]  __gup_longterm_locked+0x204/0xa70
> [    9.761110]  pin_user_pages+0x88/0xc0
> [    9.761486]  gup_test_ioctl+0x860/0xc40
> [    9.761866]  __arm64_sys_ioctl+0xb0/0x100
> [    9.762254]  invoke_syscall+0x50/0x128
> [    9.762630]  el0_svc_common.constprop.0+0x48/0xf8
> [    9.763104]  do_el0_svc+0x28/0x40
> [    9.763413]  el0_svc+0x34/0xe0
> [    9.763699]  el0t_64_sync_handler+0x13c/0x158
> [    9.764139]  el0t_64_sync+0x190/0x198
> [    9.764465] Code: aa1803e0 d000d8e1 911d6021 97fff4c9 (d4210000) 
> [    9.765053] ---[ end trace 0000000000000000 ]---
> [    9.765520] note: gup_longterm[1155] exited with irqs disabled
> [    9.766146] note: gup_longterm[1155] exited with preempt_count 2
> [    9.767366] ------------[ cut here ]------------
> [    9.768062] WARNING: CPU: 2 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0x108/0x120
> [    9.769146] Modules linked in:
> [    9.769429] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G      D            6.9.0-rc2-00210-g910ff1a347e4-dirty #12
> [    9.770338] Hardware name: linux,dummy-virt (DT)
> [    9.770837] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [    9.771615] pc : ct_kernel_exit.constprop.0+0x108/0x120
> [    9.772150] lr : ct_idle_enter+0x10/0x20
> [    9.772539] sp : ffff8000801b3dc0
> [    9.772913] x29: ffff8000801b3dc0 x28: 0000000000000000 x27: 0000000000000000
> [    9.773769] x26: 0000000000000000 x25: ffff00014149e900 x24: 0000000000000000
> [    9.774526] x23: 0000000000000000 x22: ffffaaac08e99d48 x21: ffffaaac08385730
> [    9.775255] x20: ffffaaac08e99c28 x19: ffff00017ffc8da0 x18: 0000fffff5ffffff
> [    9.775924] x17: 0000000000000000 x16: 1fffe0002a57c9e1 x15: 0000000000000001
> [    9.776619] x14: ffffffffffffffff x13: 0000000000000000 x12: ffffaaac07a06968
> [    9.777246] x11: 000000ae44c42eec x10: 0000000000000ad0 x9 : ffffaaac06189230
> [    9.777942] x8 : ffff00014149f430 x7 : 02c9acb509db422c x6 : 000000001015a9f0
> [    9.778635] x5 : 4000000000000002 x4 : ffff555577c46000 x3 : ffff8000801b3dc0
> [    9.779671] x2 : ffffaaac08382da0 x1 : 4000000000000000 x0 : ffffaaac08382da0
> [    9.780703] Call trace:
> [    9.781150]  ct_kernel_exit.constprop.0+0x108/0x120
> [    9.781949]  ct_idle_enter+0x10/0x20
> [    9.782246]  default_idle_call+0x3c/0x160
> [    9.782624]  do_idle+0x21c/0x280
> [    9.782945]  cpu_startup_entry+0x3c/0x50
> [    9.783268]  secondary_start_kernel+0x140/0x168
> [    9.783818]  __secondary_switched+0xb8/0xc0
> [    9.784163] ---[ end trace 0000000000000000 ]---
> 
> 
> Which is caused by this:
> 
> static __always_inline int PageAnonExclusive(const struct page *page)
> {
> 	VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
> 	VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page); <<<<
> 	return test_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);
> }
> 
> Which is called from can_follow_write_pmd(), called just after the assert I just commented out.
> 
> 
> It's triggered by this test:
> 
> # [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (32768 kB)
> 
> Which is the first MAP_PRIVATE test for cont-pmd mapped hugetlb. (All MAP_SHARED tests are passing).
> 
> 
> Looks like can_follow_write_pmd() returns early for VM_SHARED mappings.
> 
> I don't think we only keep the PAE flag in the head page for hugetlb pages? So we can't just remove this assert?
> 
> I tried just commenting it out and get assert further down follow_huge_pmd():
> 
> VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
> 			!PageAnonExclusive(page), page);

I just replied in another email; we can try the two patches I attached, or
we can wait until I do some tests (but will be mostly unavailable this
afternoon).

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
  2024-04-02 16:20         ` Peter Xu
  (?)
  (?)
@ 2024-04-02 16:39           ` David Hildenbrand
  -1 siblings, 0 replies; 160+ messages in thread
From: David Hildenbrand @ 2024-04-02 16:39 UTC (permalink / raw)
  To: Peter Xu
  Cc: Ryan Roberts, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Jason Gunthorpe, Mike Rapoport, Axel Rasmussen

On 02.04.24 18:20, Peter Xu wrote:
> On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:
>> On 02.04.24 16:48, Ryan Roberts wrote:
>>> Hi Peter,
> 
> Hey, Ryan,
> 
> Thanks for the report!
> 
>>>
>>> On 27/03/2024 15:23, peterx@redhat.com wrote:
>>>> From: Peter Xu <peterx@redhat.com>
>>>>
>>>> Now follow_page() is ready to handle hugetlb pages in whatever form, and
>>>> over all architectures.  Switch to the generic code path.
>>>>
>>>> Time to retire hugetlb_follow_page_mask(), following the previous
>>>> retirement of follow_hugetlb_page() in 4849807114b8.
>>>>
>>>> There may be a slight difference of how the loops run when processing slow
>>>> GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
>>>> loop of __get_user_pages() will resolve one pgtable entry with the patch
>>>> applied, rather than relying on the size of hugetlb hstate, the latter may
>>>> cover multiple entries in one loop.
>>>>
>>>> A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
>>>> a tight loop of slow gup after the path switched.  That shouldn't be a
>>>> problem because slow-gup should not be a hot path for GUP in general: when
>>>> page is commonly present, fast-gup will already succeed, while when the
>>>> page is indeed missing and require a follow up page fault, the slow gup
>>>> degrade will probably buried in the fault paths anyway.  It also explains
>>>> why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
>>>> accelerate thp gup even for "pages != NULL"") lands, the latter not part of
>>>> a performance analysis but a side benefit.  If the performance will be a
>>>> concern, we can consider handle CONT_PTE in follow_page().
>>>>
>>>> Before that is justified to be necessary, keep everything clean and simple.
>>>>
>>>> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
>>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>>
>>> Afraid I'm seeing an oops when running gup_longterm test on arm64 with current mm-unstable. Git bisect blames this patch. The oops reproduces for me every time on 2 different machines:
>>>
>>>
>>> [    9.340416] kernel BUG at mm/gup.c:778!
>>> [    9.340746] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
>>> [    9.341199] Modules linked in:
>>> [    9.341481] CPU: 1 PID: 1159 Comm: gup_longterm Not tainted 6.9.0-rc2-00210-g910ff1a347e4 #11
>>> [    9.342232] Hardware name: linux,dummy-virt (DT)
>>> [    9.342647] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>> [    9.343195] pc : follow_page_mask+0x4d4/0x880
>>> [    9.343580] lr : follow_page_mask+0x4d4/0x880
>>> [    9.344018] sp : ffff8000898b3aa0
>>> [    9.344345] x29: ffff8000898b3aa0 x28: fffffdffc53973e8 x27: 00003c0005d08000
>>> [    9.345028] x26: ffff00014e5cfd08 x25: ffffd3513a40c000 x24: fffffdffc5d08000
>>> [    9.345682] x23: ffffc1ffc0000000 x22: 0000000000080101 x21: ffff8000898b3ba8
>>> [    9.346337] x20: 0000fffff4200000 x19: ffff00014e52d508 x18: 0000000000000010
>>> [    9.347005] x17: 5f656e6f7a5f7369 x16: 2120262620296567 x15: 6170286461654865
>>> [    9.347713] x14: 6761502128454741 x13: 2929656761702865 x12: 6761705f65636976
>>> [    9.348371] x11: 65645f656e6f7a5f x10: ffffd3513b31d6e0 x9 : ffffd3513852f090
>>> [    9.349062] x8 : 00000000ffffefff x7 : ffffd3513b31d6e0 x6 : 0000000000000000
>>> [    9.349753] x5 : ffff00017ff98cc8 x4 : 0000000000000fff x3 : 0000000000000000
>>> [    9.350397] x2 : 0000000000000000 x1 : ffff000190e8b480 x0 : 0000000000000052
>>> [    9.351097] Call trace:
>>> [    9.351312]  follow_page_mask+0x4d4/0x880
>>> [    9.351700]  __get_user_pages+0xf4/0x3e8
>>> [    9.352089]  __gup_longterm_locked+0x204/0xa70
>>> [    9.352516]  pin_user_pages+0x88/0xc0
>>> [    9.352873]  gup_test_ioctl+0x860/0xc40
>>> [    9.353249]  __arm64_sys_ioctl+0xb0/0x100
>>> [    9.353648]  invoke_syscall+0x50/0x128
>>> [    9.354022]  el0_svc_common.constprop.0+0x48/0xf8
>>> [    9.354488]  do_el0_svc+0x28/0x40
>>> [    9.354822]  el0_svc+0x34/0xe0
>>> [    9.355128]  el0t_64_sync_handler+0x13c/0x158
>>> [    9.355489]  el0t_64_sync+0x190/0x198
>>> [    9.355793] Code: aa1803e0 d000d8e1 91220021 97fff560 (d4210000)
>>> [    9.356280] ---[ end trace 0000000000000000 ]---
>>> [    9.356651] note: gup_longterm[1159] exited with irqs disabled
>>> [    9.357141] note: gup_longterm[1159] exited with preempt_count 2
>>> [    9.358033] ------------[ cut here ]------------
>>> [    9.358800] WARNING: CPU: 1 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0x108/0x120
>>> [    9.360157] Modules linked in:
>>> [    9.360541] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D            6.9.0-rc2-00210-g910ff1a347e4 #11
>>> [    9.361626] Hardware name: linux,dummy-virt (DT)
>>> [    9.362087] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>> [    9.362758] pc : ct_kernel_exit.constprop.0+0x108/0x120
>>> [    9.363306] lr : ct_idle_enter+0x10/0x20
>>> [    9.363845] sp : ffff8000801abdc0
>>> [    9.364222] x29: ffff8000801abdc0 x28: 0000000000000000 x27: 0000000000000000
>>> [    9.364961] x26: 0000000000000000 x25: ffff00014149d780 x24: 0000000000000000
>>> [    9.365557] x23: 0000000000000000 x22: ffffd3513b299d48 x21: ffffd3513a785730
>>> [    9.366239] x20: ffffd3513b299c28 x19: ffff00017ffa7da0 x18: 0000fffff5ffffff
>>> [    9.366869] x17: 0000000000000000 x16: 1fffe0002a21a8c1 x15: 0000000000000000
>>> [    9.367524] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000002
>>> [    9.368207] x11: 0000000000000001 x10: 0000000000000ad0 x9 : ffffd35138589230
>>> [    9.369123] x8 : ffff00014149e2b0 x7 : 0000000000000000 x6 : 000000000f8c0fb2
>>> [    9.370403] x5 : 4000000000000002 x4 : ffff2cb045825000 x3 : ffff8000801abdc0
>>> [    9.371170] x2 : ffffd3513a782da0 x1 : 4000000000000000 x0 : ffffd3513a782da0
>>> [    9.372279] Call trace:
>>> [    9.372519]  ct_kernel_exit.constprop.0+0x108/0x120
>>> [    9.373216]  ct_idle_enter+0x10/0x20
>>> [    9.373562]  default_idle_call+0x3c/0x160
>>> [    9.374055]  do_idle+0x21c/0x280
>>> [    9.374394]  cpu_startup_entry+0x3c/0x50
>>> [    9.374797]  secondary_start_kernel+0x140/0x168
>>> [    9.375220]  __secondary_switched+0xb8/0xc0
>>> [    9.375875] ---[ end trace 0000000000000000 ]---
>>>
>>>
>>> The oops trigger is at mm/gup.c:778:
>>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>>
>>>
>>> This is the output of gup_longterm (last output is just before oops):
>>>
>>> # [INFO] detected hugetlb page size: 2048 KiB
>>> # [INFO] detected hugetlb page size: 32768 KiB
>>> # [INFO] detected hugetlb page size: 64 KiB
>>> # [INFO] detected hugetlb page size: 1048576 KiB
>>> TAP version 13
>>> 1..70
>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd
>>> ok 1 Should have worked
>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with tmpfile
>>> ok 2 Should have failed
>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile
>>> ok 3 Should have failed
>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
>>> ok 4 Should have worked
>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (32768 kB)
>>>
>>>
>>> So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?
>>
>> I assume we find the expected tail page, it's just that the check
>>
>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>
>> Doesn't make sense with hugetlb folios. We might have a tail page mapped in
>> a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the first
>> cont-pmd entry", we trigger this check.
>>
>> Likely this sanity check must also allow for hugetlb folios. Or we should
>> just remove it completely.
> 
> Right, IMHO it'll be easier we remove it, actually I see there's one more
> at the end, so I think we need to remove both.
> 
>>
>> In the past, we wanted to make sure that we never get tail pages of THP from
>> PMD entries, because something would currently be broken (we don't support
>> THP > PMD).
> 
> There's probably one more thing we need to do, on allowing
> PageAnonExclusive() to work with hugetlb tails. Even if we remove the
> warnings and if I read the code right, we can BUG_ON again on checking tail
> pages over anon-exclusive for PageHuge.
> 
> So I assume to fix it completely, we may need two changes: Patch 1 to
> prepare PageAnonExclusive() to work on hugetlb tails, then patch 2 to be
> squashed into the patch "mm/gup: handle huge pmd for follow_pmd_mask()".
> Note: not this patch to fixup, as this patch only does the "switchover" to
> the new path, the culprit should be the other patch..
> 
> I have them attached below first, before I'll also go and see whether I can
> run some arm tests later today or tomorrow.  David, any comments from
> anon-exclusive side?

I added the PageAnonExclusive checks for hugetlb back then, because 
calling it on a tail page indicated real trouble for hugetlb.

Well, and I didn't want to have runtime-hugetlb checks in 
PageAnonExclusive code called on certainly-not-hugetlb code paths.

Personally, I'd fixup the problematic callsite where we know nothing 
nasty is happening (like we did for gup_must_unshare(), because we don't 
expect hugetlb tail pages from arbitrary other code).

But as I'm getting closer to a folio_test_anon_exclusive() 
implementation as we speak (closer, but not done :) ... ), where I'd 
remove any such hugetlb special handling, I don't particularly care how 
we handle GUP here in the meantime.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
@ 2024-04-02 16:39           ` David Hildenbrand
  0 siblings, 0 replies; 160+ messages in thread
From: David Hildenbrand @ 2024-04-02 16:39 UTC (permalink / raw)
  To: Peter Xu
  Cc: Ryan Roberts, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Jason Gunthorpe, Mike Rapoport, Axel Rasmussen

On 02.04.24 18:20, Peter Xu wrote:
> On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:
>> On 02.04.24 16:48, Ryan Roberts wrote:
>>> Hi Peter,
> 
> Hey, Ryan,
> 
> Thanks for the report!
> 
>>>
>>> On 27/03/2024 15:23, peterx@redhat.com wrote:
>>>> From: Peter Xu <peterx@redhat.com>
>>>>
>>>> Now follow_page() is ready to handle hugetlb pages in whatever form, and
>>>> over all architectures.  Switch to the generic code path.
>>>>
>>>> Time to retire hugetlb_follow_page_mask(), following the previous
>>>> retirement of follow_hugetlb_page() in 4849807114b8.
>>>>
>>>> There may be a slight difference of how the loops run when processing slow
>>>> GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
>>>> loop of __get_user_pages() will resolve one pgtable entry with the patch
>>>> applied, rather than relying on the size of hugetlb hstate, the latter may
>>>> cover multiple entries in one loop.
>>>>
>>>> A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
>>>> a tight loop of slow gup after the path switched.  That shouldn't be a
>>>> problem because slow-gup should not be a hot path for GUP in general: when
>>>> page is commonly present, fast-gup will already succeed, while when the
>>>> page is indeed missing and require a follow up page fault, the slow gup
>>>> degrade will probably buried in the fault paths anyway.  It also explains
>>>> why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
>>>> accelerate thp gup even for "pages != NULL"") lands, the latter not part of
>>>> a performance analysis but a side benefit.  If the performance will be a
>>>> concern, we can consider handle CONT_PTE in follow_page().
>>>>
>>>> Before that is justified to be necessary, keep everything clean and simple.
>>>>
>>>> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
>>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>>
>>> Afraid I'm seeing an oops when running gup_longterm test on arm64 with current mm-unstable. Git bisect blames this patch. The oops reproduces for me every time on 2 different machines:
>>>
>>>
>>> [    9.340416] kernel BUG at mm/gup.c:778!
>>> [    9.340746] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
>>> [    9.341199] Modules linked in:
>>> [    9.341481] CPU: 1 PID: 1159 Comm: gup_longterm Not tainted 6.9.0-rc2-00210-g910ff1a347e4 #11
>>> [    9.342232] Hardware name: linux,dummy-virt (DT)
>>> [    9.342647] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>> [    9.343195] pc : follow_page_mask+0x4d4/0x880
>>> [    9.343580] lr : follow_page_mask+0x4d4/0x880
>>> [    9.344018] sp : ffff8000898b3aa0
>>> [    9.344345] x29: ffff8000898b3aa0 x28: fffffdffc53973e8 x27: 00003c0005d08000
>>> [    9.345028] x26: ffff00014e5cfd08 x25: ffffd3513a40c000 x24: fffffdffc5d08000
>>> [    9.345682] x23: ffffc1ffc0000000 x22: 0000000000080101 x21: ffff8000898b3ba8
>>> [    9.346337] x20: 0000fffff4200000 x19: ffff00014e52d508 x18: 0000000000000010
>>> [    9.347005] x17: 5f656e6f7a5f7369 x16: 2120262620296567 x15: 6170286461654865
>>> [    9.347713] x14: 6761502128454741 x13: 2929656761702865 x12: 6761705f65636976
>>> [    9.348371] x11: 65645f656e6f7a5f x10: ffffd3513b31d6e0 x9 : ffffd3513852f090
>>> [    9.349062] x8 : 00000000ffffefff x7 : ffffd3513b31d6e0 x6 : 0000000000000000
>>> [    9.349753] x5 : ffff00017ff98cc8 x4 : 0000000000000fff x3 : 0000000000000000
>>> [    9.350397] x2 : 0000000000000000 x1 : ffff000190e8b480 x0 : 0000000000000052
>>> [    9.351097] Call trace:
>>> [    9.351312]  follow_page_mask+0x4d4/0x880
>>> [    9.351700]  __get_user_pages+0xf4/0x3e8
>>> [    9.352089]  __gup_longterm_locked+0x204/0xa70
>>> [    9.352516]  pin_user_pages+0x88/0xc0
>>> [    9.352873]  gup_test_ioctl+0x860/0xc40
>>> [    9.353249]  __arm64_sys_ioctl+0xb0/0x100
>>> [    9.353648]  invoke_syscall+0x50/0x128
>>> [    9.354022]  el0_svc_common.constprop.0+0x48/0xf8
>>> [    9.354488]  do_el0_svc+0x28/0x40
>>> [    9.354822]  el0_svc+0x34/0xe0
>>> [    9.355128]  el0t_64_sync_handler+0x13c/0x158
>>> [    9.355489]  el0t_64_sync+0x190/0x198
>>> [    9.355793] Code: aa1803e0 d000d8e1 91220021 97fff560 (d4210000)
>>> [    9.356280] ---[ end trace 0000000000000000 ]---
>>> [    9.356651] note: gup_longterm[1159] exited with irqs disabled
>>> [    9.357141] note: gup_longterm[1159] exited with preempt_count 2
>>> [    9.358033] ------------[ cut here ]------------
>>> [    9.358800] WARNING: CPU: 1 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0x108/0x120
>>> [    9.360157] Modules linked in:
>>> [    9.360541] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D            6.9.0-rc2-00210-g910ff1a347e4 #11
>>> [    9.361626] Hardware name: linux,dummy-virt (DT)
>>> [    9.362087] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>> [    9.362758] pc : ct_kernel_exit.constprop.0+0x108/0x120
>>> [    9.363306] lr : ct_idle_enter+0x10/0x20
>>> [    9.363845] sp : ffff8000801abdc0
>>> [    9.364222] x29: ffff8000801abdc0 x28: 0000000000000000 x27: 0000000000000000
>>> [    9.364961] x26: 0000000000000000 x25: ffff00014149d780 x24: 0000000000000000
>>> [    9.365557] x23: 0000000000000000 x22: ffffd3513b299d48 x21: ffffd3513a785730
>>> [    9.366239] x20: ffffd3513b299c28 x19: ffff00017ffa7da0 x18: 0000fffff5ffffff
>>> [    9.366869] x17: 0000000000000000 x16: 1fffe0002a21a8c1 x15: 0000000000000000
>>> [    9.367524] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000002
>>> [    9.368207] x11: 0000000000000001 x10: 0000000000000ad0 x9 : ffffd35138589230
>>> [    9.369123] x8 : ffff00014149e2b0 x7 : 0000000000000000 x6 : 000000000f8c0fb2
>>> [    9.370403] x5 : 4000000000000002 x4 : ffff2cb045825000 x3 : ffff8000801abdc0
>>> [    9.371170] x2 : ffffd3513a782da0 x1 : 4000000000000000 x0 : ffffd3513a782da0
>>> [    9.372279] Call trace:
>>> [    9.372519]  ct_kernel_exit.constprop.0+0x108/0x120
>>> [    9.373216]  ct_idle_enter+0x10/0x20
>>> [    9.373562]  default_idle_call+0x3c/0x160
>>> [    9.374055]  do_idle+0x21c/0x280
>>> [    9.374394]  cpu_startup_entry+0x3c/0x50
>>> [    9.374797]  secondary_start_kernel+0x140/0x168
>>> [    9.375220]  __secondary_switched+0xb8/0xc0
>>> [    9.375875] ---[ end trace 0000000000000000 ]---
>>>
>>>
>>> The oops trigger is at mm/gup.c:778:
>>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>>
>>>
>>> This is the output of gup_longterm (last output is just before oops):
>>>
>>> # [INFO] detected hugetlb page size: 2048 KiB
>>> # [INFO] detected hugetlb page size: 32768 KiB
>>> # [INFO] detected hugetlb page size: 64 KiB
>>> # [INFO] detected hugetlb page size: 1048576 KiB
>>> TAP version 13
>>> 1..70
>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd
>>> ok 1 Should have worked
>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with tmpfile
>>> ok 2 Should have failed
>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile
>>> ok 3 Should have failed
>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
>>> ok 4 Should have worked
>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (32768 kB)
>>>
>>>
>>> So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?
>>
>> I assume we find the expected tail page, it's just that the check
>>
>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>
>> Doesn't make sense with hugetlb folios. We might have a tail page mapped in
>> a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the first
>> cont-pmd entry", we trigger this check.
>>
>> Likely this sanity check must also allow for hugetlb folios. Or we should
>> just remove it completely.
> 
> Right, IMHO it'll be easier we remove it, actually I see there's one more
> at the end, so I think we need to remove both.
> 
>>
>> In the past, we wanted to make sure that we never get tail pages of THP from
>> PMD entries, because something would currently be broken (we don't support
>> THP > PMD).
> 
> There's probably one more thing we need to do, on allowing
> PageAnonExclusive() to work with hugetlb tails. Even if we remove the
> warnings and if I read the code right, we can BUG_ON again on checking tail
> pages over anon-exclusive for PageHuge.
> 
> So I assume to fix it completely, we may need two changes: Patch 1 to
> prepare PageAnonExclusive() to work on hugetlb tails, then patch 2 to be
> squashed into the patch "mm/gup: handle huge pmd for follow_pmd_mask()".
> Note: not this patch to fixup, as this patch only does the "switchover" to
> the new path, the culprit should be the other patch..
> 
> I have them attached below first, before I'll also go and see whether I can
> run some arm tests later today or tomorrow.  David, any comments from
> anon-exclusive side?

I added the PageAnonExclusive checks for hugetlb back then, because 
calling it on a tail page indicated real trouble for hugetlb.

Well, and I didn't want to have runtime-hugetlb checks in 
PageAnonExclusive code called on certainly-not-hugetlb code paths.

Personally, I'd fixup the problematic callsite where we know nothing 
nasty is happening (like we did for gup_must_unshare(), because we don't 
expect hugetlb tail pages from arbitrary other code).

But as I'm getting closer to a folio_test_anon_exclusive() 
implementation as we speak (closer, but not done :) ... ), where I'd 
remove any such hugetlb special handling, I don't particularly care how 
we handle GUP here in the meantime.

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
@ 2024-04-02 16:39           ` David Hildenbrand
  0 siblings, 0 replies; 160+ messages in thread
From: David Hildenbrand @ 2024-04-02 16:39 UTC (permalink / raw)
  To: Peter Xu
  Cc: Ryan Roberts, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Jason Gunthorpe, Mike Rapoport, Axel Rasmussen

On 02.04.24 18:20, Peter Xu wrote:
> On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:
>> On 02.04.24 16:48, Ryan Roberts wrote:
>>> Hi Peter,
> 
> Hey, Ryan,
> 
> Thanks for the report!
> 
>>>
>>> On 27/03/2024 15:23, peterx@redhat.com wrote:
>>>> From: Peter Xu <peterx@redhat.com>
>>>>
>>>> Now follow_page() is ready to handle hugetlb pages in whatever form, and
>>>> over all architectures.  Switch to the generic code path.
>>>>
>>>> Time to retire hugetlb_follow_page_mask(), following the previous
>>>> retirement of follow_hugetlb_page() in 4849807114b8.
>>>>
>>>> There may be a slight difference of how the loops run when processing slow
>>>> GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
>>>> loop of __get_user_pages() will resolve one pgtable entry with the patch
>>>> applied, rather than relying on the size of hugetlb hstate, the latter may
>>>> cover multiple entries in one loop.
>>>>
>>>> A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
>>>> a tight loop of slow gup after the path switched.  That shouldn't be a
>>>> problem because slow-gup should not be a hot path for GUP in general: when
>>>> page is commonly present, fast-gup will already succeed, while when the
>>>> page is indeed missing and require a follow up page fault, the slow gup
>>>> degrade will probably buried in the fault paths anyway.  It also explains
>>>> why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
>>>> accelerate thp gup even for "pages != NULL"") lands, the latter not part of
>>>> a performance analysis but a side benefit.  If the performance will be a
>>>> concern, we can consider handle CONT_PTE in follow_page().
>>>>
>>>> Before that is justified to be necessary, keep everything clean and simple.
>>>>
>>>> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
>>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>>
>>> Afraid I'm seeing an oops when running gup_longterm test on arm64 with current mm-unstable. Git bisect blames this patch. The oops reproduces for me every time on 2 different machines:
>>>
>>>
>>> [    9.340416] kernel BUG at mm/gup.c:778!
>>> [    9.340746] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
>>> [    9.341199] Modules linked in:
>>> [    9.341481] CPU: 1 PID: 1159 Comm: gup_longterm Not tainted 6.9.0-rc2-00210-g910ff1a347e4 #11
>>> [    9.342232] Hardware name: linux,dummy-virt (DT)
>>> [    9.342647] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>> [    9.343195] pc : follow_page_mask+0x4d4/0x880
>>> [    9.343580] lr : follow_page_mask+0x4d4/0x880
>>> [    9.344018] sp : ffff8000898b3aa0
>>> [    9.344345] x29: ffff8000898b3aa0 x28: fffffdffc53973e8 x27: 00003c0005d08000
>>> [    9.345028] x26: ffff00014e5cfd08 x25: ffffd3513a40c000 x24: fffffdffc5d08000
>>> [    9.345682] x23: ffffc1ffc0000000 x22: 0000000000080101 x21: ffff8000898b3ba8
>>> [    9.346337] x20: 0000fffff4200000 x19: ffff00014e52d508 x18: 0000000000000010
>>> [    9.347005] x17: 5f656e6f7a5f7369 x16: 2120262620296567 x15: 6170286461654865
>>> [    9.347713] x14: 6761502128454741 x13: 2929656761702865 x12: 6761705f65636976
>>> [    9.348371] x11: 65645f656e6f7a5f x10: ffffd3513b31d6e0 x9 : ffffd3513852f090
>>> [    9.349062] x8 : 00000000ffffefff x7 : ffffd3513b31d6e0 x6 : 0000000000000000
>>> [    9.349753] x5 : ffff00017ff98cc8 x4 : 0000000000000fff x3 : 0000000000000000
>>> [    9.350397] x2 : 0000000000000000 x1 : ffff000190e8b480 x0 : 0000000000000052
>>> [    9.351097] Call trace:
>>> [    9.351312]  follow_page_mask+0x4d4/0x880
>>> [    9.351700]  __get_user_pages+0xf4/0x3e8
>>> [    9.352089]  __gup_longterm_locked+0x204/0xa70
>>> [    9.352516]  pin_user_pages+0x88/0xc0
>>> [    9.352873]  gup_test_ioctl+0x860/0xc40
>>> [    9.353249]  __arm64_sys_ioctl+0xb0/0x100
>>> [    9.353648]  invoke_syscall+0x50/0x128
>>> [    9.354022]  el0_svc_common.constprop.0+0x48/0xf8
>>> [    9.354488]  do_el0_svc+0x28/0x40
>>> [    9.354822]  el0_svc+0x34/0xe0
>>> [    9.355128]  el0t_64_sync_handler+0x13c/0x158
>>> [    9.355489]  el0t_64_sync+0x190/0x198
>>> [    9.355793] Code: aa1803e0 d000d8e1 91220021 97fff560 (d4210000)
>>> [    9.356280] ---[ end trace 0000000000000000 ]---
>>> [    9.356651] note: gup_longterm[1159] exited with irqs disabled
>>> [    9.357141] note: gup_longterm[1159] exited with preempt_count 2
>>> [    9.358033] ------------[ cut here ]------------
>>> [    9.358800] WARNING: CPU: 1 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0x108/0x120
>>> [    9.360157] Modules linked in:
>>> [    9.360541] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D            6.9.0-rc2-00210-g910ff1a347e4 #11
>>> [    9.361626] Hardware name: linux,dummy-virt (DT)
>>> [    9.362087] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>> [    9.362758] pc : ct_kernel_exit.constprop.0+0x108/0x120
>>> [    9.363306] lr : ct_idle_enter+0x10/0x20
>>> [    9.363845] sp : ffff8000801abdc0
>>> [    9.364222] x29: ffff8000801abdc0 x28: 0000000000000000 x27: 0000000000000000
>>> [    9.364961] x26: 0000000000000000 x25: ffff00014149d780 x24: 0000000000000000
>>> [    9.365557] x23: 0000000000000000 x22: ffffd3513b299d48 x21: ffffd3513a785730
>>> [    9.366239] x20: ffffd3513b299c28 x19: ffff00017ffa7da0 x18: 0000fffff5ffffff
>>> [    9.366869] x17: 0000000000000000 x16: 1fffe0002a21a8c1 x15: 0000000000000000
>>> [    9.367524] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000002
>>> [    9.368207] x11: 0000000000000001 x10: 0000000000000ad0 x9 : ffffd35138589230
>>> [    9.369123] x8 : ffff00014149e2b0 x7 : 0000000000000000 x6 : 000000000f8c0fb2
>>> [    9.370403] x5 : 4000000000000002 x4 : ffff2cb045825000 x3 : ffff8000801abdc0
>>> [    9.371170] x2 : ffffd3513a782da0 x1 : 4000000000000000 x0 : ffffd3513a782da0
>>> [    9.372279] Call trace:
>>> [    9.372519]  ct_kernel_exit.constprop.0+0x108/0x120
>>> [    9.373216]  ct_idle_enter+0x10/0x20
>>> [    9.373562]  default_idle_call+0x3c/0x160
>>> [    9.374055]  do_idle+0x21c/0x280
>>> [    9.374394]  cpu_startup_entry+0x3c/0x50
>>> [    9.374797]  secondary_start_kernel+0x140/0x168
>>> [    9.375220]  __secondary_switched+0xb8/0xc0
>>> [    9.375875] ---[ end trace 0000000000000000 ]---
>>>
>>>
>>> The oops trigger is at mm/gup.c:778:
>>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>>
>>>
>>> This is the output of gup_longterm (last output is just before oops):
>>>
>>> # [INFO] detected hugetlb page size: 2048 KiB
>>> # [INFO] detected hugetlb page size: 32768 KiB
>>> # [INFO] detected hugetlb page size: 64 KiB
>>> # [INFO] detected hugetlb page size: 1048576 KiB
>>> TAP version 13
>>> 1..70
>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd
>>> ok 1 Should have worked
>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with tmpfile
>>> ok 2 Should have failed
>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile
>>> ok 3 Should have failed
>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
>>> ok 4 Should have worked
>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (32768 kB)
>>>
>>>
>>> So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?
>>
>> I assume we find the expected tail page, it's just that the check
>>
>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>
>> Doesn't make sense with hugetlb folios. We might have a tail page mapped in
>> a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the first
>> cont-pmd entry", we trigger this check.
>>
>> Likely this sanity check must also allow for hugetlb folios. Or we should
>> just remove it completely.
> 
> Right, IMHO it'll be easier we remove it, actually I see there's one more
> at the end, so I think we need to remove both.
> 
>>
>> In the past, we wanted to make sure that we never get tail pages of THP from
>> PMD entries, because something would currently be broken (we don't support
>> THP > PMD).
> 
> There's probably one more thing we need to do, on allowing
> PageAnonExclusive() to work with hugetlb tails. Even if we remove the
> warnings and if I read the code right, we can BUG_ON again on checking tail
> pages over anon-exclusive for PageHuge.
> 
> So I assume to fix it completely, we may need two changes: Patch 1 to
> prepare PageAnonExclusive() to work on hugetlb tails, then patch 2 to be
> squashed into the patch "mm/gup: handle huge pmd for follow_pmd_mask()".
> Note: not this patch to fixup, as this patch only does the "switchover" to
> the new path, the culprit should be the other patch..
> 
> I have them attached below first, before I'll also go and see whether I can
> run some arm tests later today or tomorrow.  David, any comments from
> anon-exclusive side?

I added the PageAnonExclusive checks for hugetlb back then, because 
calling it on a tail page indicated real trouble for hugetlb.

Well, and I didn't want to have runtime-hugetlb checks in 
PageAnonExclusive code called on certainly-not-hugetlb code paths.

Personally, I'd fixup the problematic callsite where we know nothing 
nasty is happening (like we did for gup_must_unshare(), because we don't 
expect hugetlb tail pages from arbitrary other code).

But as I'm getting closer to a folio_test_anon_exclusive() 
implementation as we speak (closer, but not done :) ... ), where I'd 
remove any such hugetlb special handling, I don't particularly care how 
we handle GUP here in the meantime.

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
@ 2024-04-02 16:39           ` David Hildenbrand
  0 siblings, 0 replies; 160+ messages in thread
From: David Hildenbrand @ 2024-04-02 16:39 UTC (permalink / raw)
  To: Peter Xu
  Cc: James Houghton, Yang Shi, Andrew Jones, linux-mm, Matthew Wilcox,
	linux-riscv, Andrea Arcangeli, Christoph Hellwig,
	Aneesh Kumar K . V, linux-arm-kernel, Jason Gunthorpe,
	Axel Rasmussen, Ryan Roberts, Rik van Riel, John Hubbard,
	Kirill A . Shutemov, Vlastimil Babka, Lorenzo Stoakes,
	Muchun Song, linux-kernel, Andrew Morton, linuxppc-dev,
	Mike Rapoport, Mike Kravetz

On 02.04.24 18:20, Peter Xu wrote:
> On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:
>> On 02.04.24 16:48, Ryan Roberts wrote:
>>> Hi Peter,
> 
> Hey, Ryan,
> 
> Thanks for the report!
> 
>>>
>>> On 27/03/2024 15:23, peterx@redhat.com wrote:
>>>> From: Peter Xu <peterx@redhat.com>
>>>>
>>>> Now follow_page() is ready to handle hugetlb pages in whatever form, and
>>>> over all architectures.  Switch to the generic code path.
>>>>
>>>> Time to retire hugetlb_follow_page_mask(), following the previous
>>>> retirement of follow_hugetlb_page() in 4849807114b8.
>>>>
>>>> There may be a slight difference of how the loops run when processing slow
>>>> GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
>>>> loop of __get_user_pages() will resolve one pgtable entry with the patch
>>>> applied, rather than relying on the size of hugetlb hstate, the latter may
>>>> cover multiple entries in one loop.
>>>>
>>>> A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
>>>> a tight loop of slow gup after the path switched.  That shouldn't be a
>>>> problem because slow-gup should not be a hot path for GUP in general: when
>>>> page is commonly present, fast-gup will already succeed, while when the
>>>> page is indeed missing and require a follow up page fault, the slow gup
>>>> degrade will probably buried in the fault paths anyway.  It also explains
>>>> why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
>>>> accelerate thp gup even for "pages != NULL"") lands, the latter not part of
>>>> a performance analysis but a side benefit.  If the performance will be a
>>>> concern, we can consider handle CONT_PTE in follow_page().
>>>>
>>>> Before that is justified to be necessary, keep everything clean and simple.
>>>>
>>>> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
>>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>>
>>> Afraid I'm seeing an oops when running gup_longterm test on arm64 with current mm-unstable. Git bisect blames this patch. The oops reproduces for me every time on 2 different machines:
>>>
>>>
>>> [    9.340416] kernel BUG at mm/gup.c:778!
>>> [    9.340746] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
>>> [    9.341199] Modules linked in:
>>> [    9.341481] CPU: 1 PID: 1159 Comm: gup_longterm Not tainted 6.9.0-rc2-00210-g910ff1a347e4 #11
>>> [    9.342232] Hardware name: linux,dummy-virt (DT)
>>> [    9.342647] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>> [    9.343195] pc : follow_page_mask+0x4d4/0x880
>>> [    9.343580] lr : follow_page_mask+0x4d4/0x880
>>> [    9.344018] sp : ffff8000898b3aa0
>>> [    9.344345] x29: ffff8000898b3aa0 x28: fffffdffc53973e8 x27: 00003c0005d08000
>>> [    9.345028] x26: ffff00014e5cfd08 x25: ffffd3513a40c000 x24: fffffdffc5d08000
>>> [    9.345682] x23: ffffc1ffc0000000 x22: 0000000000080101 x21: ffff8000898b3ba8
>>> [    9.346337] x20: 0000fffff4200000 x19: ffff00014e52d508 x18: 0000000000000010
>>> [    9.347005] x17: 5f656e6f7a5f7369 x16: 2120262620296567 x15: 6170286461654865
>>> [    9.347713] x14: 6761502128454741 x13: 2929656761702865 x12: 6761705f65636976
>>> [    9.348371] x11: 65645f656e6f7a5f x10: ffffd3513b31d6e0 x9 : ffffd3513852f090
>>> [    9.349062] x8 : 00000000ffffefff x7 : ffffd3513b31d6e0 x6 : 0000000000000000
>>> [    9.349753] x5 : ffff00017ff98cc8 x4 : 0000000000000fff x3 : 0000000000000000
>>> [    9.350397] x2 : 0000000000000000 x1 : ffff000190e8b480 x0 : 0000000000000052
>>> [    9.351097] Call trace:
>>> [    9.351312]  follow_page_mask+0x4d4/0x880
>>> [    9.351700]  __get_user_pages+0xf4/0x3e8
>>> [    9.352089]  __gup_longterm_locked+0x204/0xa70
>>> [    9.352516]  pin_user_pages+0x88/0xc0
>>> [    9.352873]  gup_test_ioctl+0x860/0xc40
>>> [    9.353249]  __arm64_sys_ioctl+0xb0/0x100
>>> [    9.353648]  invoke_syscall+0x50/0x128
>>> [    9.354022]  el0_svc_common.constprop.0+0x48/0xf8
>>> [    9.354488]  do_el0_svc+0x28/0x40
>>> [    9.354822]  el0_svc+0x34/0xe0
>>> [    9.355128]  el0t_64_sync_handler+0x13c/0x158
>>> [    9.355489]  el0t_64_sync+0x190/0x198
>>> [    9.355793] Code: aa1803e0 d000d8e1 91220021 97fff560 (d4210000)
>>> [    9.356280] ---[ end trace 0000000000000000 ]---
>>> [    9.356651] note: gup_longterm[1159] exited with irqs disabled
>>> [    9.357141] note: gup_longterm[1159] exited with preempt_count 2
>>> [    9.358033] ------------[ cut here ]------------
>>> [    9.358800] WARNING: CPU: 1 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0x108/0x120
>>> [    9.360157] Modules linked in:
>>> [    9.360541] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D            6.9.0-rc2-00210-g910ff1a347e4 #11
>>> [    9.361626] Hardware name: linux,dummy-virt (DT)
>>> [    9.362087] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>> [    9.362758] pc : ct_kernel_exit.constprop.0+0x108/0x120
>>> [    9.363306] lr : ct_idle_enter+0x10/0x20
>>> [    9.363845] sp : ffff8000801abdc0
>>> [    9.364222] x29: ffff8000801abdc0 x28: 0000000000000000 x27: 0000000000000000
>>> [    9.364961] x26: 0000000000000000 x25: ffff00014149d780 x24: 0000000000000000
>>> [    9.365557] x23: 0000000000000000 x22: ffffd3513b299d48 x21: ffffd3513a785730
>>> [    9.366239] x20: ffffd3513b299c28 x19: ffff00017ffa7da0 x18: 0000fffff5ffffff
>>> [    9.366869] x17: 0000000000000000 x16: 1fffe0002a21a8c1 x15: 0000000000000000
>>> [    9.367524] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000002
>>> [    9.368207] x11: 0000000000000001 x10: 0000000000000ad0 x9 : ffffd35138589230
>>> [    9.369123] x8 : ffff00014149e2b0 x7 : 0000000000000000 x6 : 000000000f8c0fb2
>>> [    9.370403] x5 : 4000000000000002 x4 : ffff2cb045825000 x3 : ffff8000801abdc0
>>> [    9.371170] x2 : ffffd3513a782da0 x1 : 4000000000000000 x0 : ffffd3513a782da0
>>> [    9.372279] Call trace:
>>> [    9.372519]  ct_kernel_exit.constprop.0+0x108/0x120
>>> [    9.373216]  ct_idle_enter+0x10/0x20
>>> [    9.373562]  default_idle_call+0x3c/0x160
>>> [    9.374055]  do_idle+0x21c/0x280
>>> [    9.374394]  cpu_startup_entry+0x3c/0x50
>>> [    9.374797]  secondary_start_kernel+0x140/0x168
>>> [    9.375220]  __secondary_switched+0xb8/0xc0
>>> [    9.375875] ---[ end trace 0000000000000000 ]---
>>>
>>>
>>> The oops trigger is at mm/gup.c:778:
>>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>>
>>>
>>> This is the output of gup_longterm (last output is just before oops):
>>>
>>> # [INFO] detected hugetlb page size: 2048 KiB
>>> # [INFO] detected hugetlb page size: 32768 KiB
>>> # [INFO] detected hugetlb page size: 64 KiB
>>> # [INFO] detected hugetlb page size: 1048576 KiB
>>> TAP version 13
>>> 1..70
>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd
>>> ok 1 Should have worked
>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with tmpfile
>>> ok 2 Should have failed
>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile
>>> ok 3 Should have failed
>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
>>> ok 4 Should have worked
>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (32768 kB)
>>>
>>>
>>> So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?
>>
>> I assume we find the expected tail page, it's just that the check
>>
>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>
>> Doesn't make sense with hugetlb folios. We might have a tail page mapped in
>> a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the first
>> cont-pmd entry", we trigger this check.
>>
>> Likely this sanity check must also allow for hugetlb folios. Or we should
>> just remove it completely.
> 
> Right, IMHO it'll be easier we remove it, actually I see there's one more
> at the end, so I think we need to remove both.
> 
>>
>> In the past, we wanted to make sure that we never get tail pages of THP from
>> PMD entries, because something would currently be broken (we don't support
>> THP > PMD).
> 
> There's probably one more thing we need to do, on allowing
> PageAnonExclusive() to work with hugetlb tails. Even if we remove the
> warnings and if I read the code right, we can BUG_ON again on checking tail
> pages over anon-exclusive for PageHuge.
> 
> So I assume to fix it completely, we may need two changes: Patch 1 to
> prepare PageAnonExclusive() to work on hugetlb tails, then patch 2 to be
> squashed into the patch "mm/gup: handle huge pmd for follow_pmd_mask()".
> Note: not this patch to fixup, as this patch only does the "switchover" to
> the new path, the culprit should be the other patch..
> 
> I have them attached below first, before I'll also go and see whether I can
> run some arm tests later today or tomorrow.  David, any comments from
> anon-exclusive side?

I added the PageAnonExclusive checks for hugetlb back then, because 
calling it on a tail page indicated real trouble for hugetlb.

Well, and I didn't want to have runtime-hugetlb checks in 
PageAnonExclusive code called on certainly-not-hugetlb code paths.

Personally, I'd fixup the problematic callsite where we know nothing 
nasty is happening (like we did for gup_must_unshare(), because we don't 
expect hugetlb tail pages from arbitrary other code).

But as I'm getting closer to a folio_test_anon_exclusive() 
implementation as we speak (closer, but not done :) ... ), where I'd 
remove any such hugetlb special handling, I don't particularly care how 
we handle GUP here in the meantime.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
  2024-04-02 16:00         ` Matthew Wilcox
  (?)
  (?)
@ 2024-04-02 16:40           ` David Hildenbrand
  -1 siblings, 0 replies; 160+ messages in thread
From: David Hildenbrand @ 2024-04-02 16:40 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ryan Roberts, peterx, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Jason Gunthorpe, Mike Rapoport, Axel Rasmussen

On 02.04.24 18:00, Matthew Wilcox wrote:
> On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:
>>> The oops trigger is at mm/gup.c:778:
>>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>>
>>> So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?
>>
>> I assume we find the expected tail page, it's just that the check
>>
>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>
>> Doesn't make sense with hugetlb folios. We might have a tail page mapped in
>> a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the first
>> cont-pmd entry", we trigger this check.
>>
>> Likely this sanity check must also allow for hugetlb folios. Or we should
>> just remove it completely.
>>
>> In the past, we wanted to make sure that we never get tail pages of THP from
>> PMD entries, because something would currently be broken (we don't support
>> THP > PMD).
> 
> That was a practical limitation on my part.  We have various parts of
> the MM which assume that pmd_page() returns a head page and until we
> get all of those fixed, adding support for folios larger than PMD_SIZE
> was only going to cause trouble for no significant wins.
> 
> I agree with you we should get rid of this assertion entirely.  We should
> fix all the places which assume that pmd_page() returns a head page,
> but that may take some time.
> 
> As an example, filemap_map_pmd() has:
> 
>         if (pmd_none(*vmf->pmd) && folio_test_pmd_mappable(folio)) {
>                  struct page *page = folio_file_page(folio, start);
>                  vm_fault_t ret = do_set_pmd(vmf, page);
> 
> and then do_set_pmd() has:
> 
>          if (page != &folio->page || folio_order(folio) != HPAGE_PMD_ORDER)
>                  return ret;
> 
> so we'd simply refuse to use a PMD to map a folio larger than PMD_SIZE.
> There's a lot of work to be done to make this work generally (not to
> mention figuring out how to handle mapcount for such folios ;-).

Yes :)

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
@ 2024-04-02 16:40           ` David Hildenbrand
  0 siblings, 0 replies; 160+ messages in thread
From: David Hildenbrand @ 2024-04-02 16:40 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ryan Roberts, peterx, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Jason Gunthorpe, Mike Rapoport, Axel Rasmussen

On 02.04.24 18:00, Matthew Wilcox wrote:
> On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:
>>> The oops trigger is at mm/gup.c:778:
>>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>>
>>> So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?
>>
>> I assume we find the expected tail page, it's just that the check
>>
>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>
>> Doesn't make sense with hugetlb folios. We might have a tail page mapped in
>> a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the first
>> cont-pmd entry", we trigger this check.
>>
>> Likely this sanity check must also allow for hugetlb folios. Or we should
>> just remove it completely.
>>
>> In the past, we wanted to make sure that we never get tail pages of THP from
>> PMD entries, because something would currently be broken (we don't support
>> THP > PMD).
> 
> That was a practical limitation on my part.  We have various parts of
> the MM which assume that pmd_page() returns a head page and until we
> get all of those fixed, adding support for folios larger than PMD_SIZE
> was only going to cause trouble for no significant wins.
> 
> I agree with you we should get rid of this assertion entirely.  We should
> fix all the places which assume that pmd_page() returns a head page,
> but that may take some time.
> 
> As an example, filemap_map_pmd() has:
> 
>         if (pmd_none(*vmf->pmd) && folio_test_pmd_mappable(folio)) {
>                  struct page *page = folio_file_page(folio, start);
>                  vm_fault_t ret = do_set_pmd(vmf, page);
> 
> and then do_set_pmd() has:
> 
>          if (page != &folio->page || folio_order(folio) != HPAGE_PMD_ORDER)
>                  return ret;
> 
> so we'd simply refuse to use a PMD to map a folio larger than PMD_SIZE.
> There's a lot of work to be done to make this work generally (not to
> mention figuring out how to handle mapcount for such folios ;-).

Yes :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
@ 2024-04-02 16:40           ` David Hildenbrand
  0 siblings, 0 replies; 160+ messages in thread
From: David Hildenbrand @ 2024-04-02 16:40 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ryan Roberts, peterx, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Jason Gunthorpe, Mike Rapoport, Axel Rasmussen

On 02.04.24 18:00, Matthew Wilcox wrote:
> On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:
>>> The oops trigger is at mm/gup.c:778:
>>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>>
>>> So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?
>>
>> I assume we find the expected tail page, it's just that the check
>>
>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>
>> Doesn't make sense with hugetlb folios. We might have a tail page mapped in
>> a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the first
>> cont-pmd entry", we trigger this check.
>>
>> Likely this sanity check must also allow for hugetlb folios. Or we should
>> just remove it completely.
>>
>> In the past, we wanted to make sure that we never get tail pages of THP from
>> PMD entries, because something would currently be broken (we don't support
>> THP > PMD).
> 
> That was a practical limitation on my part.  We have various parts of
> the MM which assume that pmd_page() returns a head page and until we
> get all of those fixed, adding support for folios larger than PMD_SIZE
> was only going to cause trouble for no significant wins.
> 
> I agree with you we should get rid of this assertion entirely.  We should
> fix all the places which assume that pmd_page() returns a head page,
> but that may take some time.
> 
> As an example, filemap_map_pmd() has:
> 
>         if (pmd_none(*vmf->pmd) && folio_test_pmd_mappable(folio)) {
>                  struct page *page = folio_file_page(folio, start);
>                  vm_fault_t ret = do_set_pmd(vmf, page);
> 
> and then do_set_pmd() has:
> 
>          if (page != &folio->page || folio_order(folio) != HPAGE_PMD_ORDER)
>                  return ret;
> 
> so we'd simply refuse to use a PMD to map a folio larger than PMD_SIZE.
> There's a lot of work to be done to make this work generally (not to
> mention figuring out how to handle mapcount for such folios ;-).

Yes :)

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
@ 2024-04-02 16:40           ` David Hildenbrand
  0 siblings, 0 replies; 160+ messages in thread
From: David Hildenbrand @ 2024-04-02 16:40 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: James Houghton, Yang Shi, peterx, Andrew Jones, linux-mm,
	linux-riscv, Andrea Arcangeli, Christoph Hellwig,
	Aneesh Kumar K . V, linux-arm-kernel, Jason Gunthorpe,
	Axel Rasmussen, Ryan Roberts, Rik van Riel, John Hubbard,
	Kirill A . Shutemov, Vlastimil Babka, Lorenzo Stoakes,
	Muchun Song, linux-kernel, Andrew Morton, linuxppc-dev,
	Mike Rapoport, Mike Kravetz

On 02.04.24 18:00, Matthew Wilcox wrote:
> On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:
>>> The oops trigger is at mm/gup.c:778:
>>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>>
>>> So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?
>>
>> I assume we find the expected tail page, it's just that the check
>>
>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>
>> Doesn't make sense with hugetlb folios. We might have a tail page mapped in
>> a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the first
>> cont-pmd entry", we trigger this check.
>>
>> Likely this sanity check must also allow for hugetlb folios. Or we should
>> just remove it completely.
>>
>> In the past, we wanted to make sure that we never get tail pages of THP from
>> PMD entries, because something would currently be broken (we don't support
>> THP > PMD).
> 
> That was a practical limitation on my part.  We have various parts of
> the MM which assume that pmd_page() returns a head page and until we
> get all of those fixed, adding support for folios larger than PMD_SIZE
> was only going to cause trouble for no significant wins.
> 
> I agree with you we should get rid of this assertion entirely.  We should
> fix all the places which assume that pmd_page() returns a head page,
> but that may take some time.
> 
> As an example, filemap_map_pmd() has:
> 
>         if (pmd_none(*vmf->pmd) && folio_test_pmd_mappable(folio)) {
>                  struct page *page = folio_file_page(folio, start);
>                  vm_fault_t ret = do_set_pmd(vmf, page);
> 
> and then do_set_pmd() has:
> 
>          if (page != &folio->page || folio_order(folio) != HPAGE_PMD_ORDER)
>                  return ret;
> 
> so we'd simply refuse to use a PMD to map a folio larger than PMD_SIZE.
> There's a lot of work to be done to make this work generally (not to
> mention figuring out how to handle mapcount for such folios ;-).

Yes :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
  2024-04-02 16:20         ` Peter Xu
  (?)
  (?)
@ 2024-04-02 16:46           ` Ryan Roberts
  -1 siblings, 0 replies; 160+ messages in thread
From: Ryan Roberts @ 2024-04-02 16:46 UTC (permalink / raw)
  To: Peter Xu, David Hildenbrand
  Cc: linux-mm, linux-kernel, Yang Shi, Kirill A . Shutemov,
	Mike Kravetz, John Hubbard, Michael Ellerman, Andrew Jones,
	Muchun Song, linux-riscv, linuxppc-dev, Christophe Leroy,
	Andrew Morton, Christoph Hellwig, Lorenzo Stoakes,
	Matthew Wilcox, Rik van Riel, linux-arm-kernel, Andrea Arcangeli,
	Aneesh Kumar K . V, Vlastimil Babka, James Houghton,
	Jason Gunthorpe, Mike Rapoport, Axel Rasmussen

On 02/04/2024 17:20, Peter Xu wrote:
> On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:
>> On 02.04.24 16:48, Ryan Roberts wrote:
>>> Hi Peter,
> 
> Hey, Ryan,
> 
> Thanks for the report!
> 
>>>
>>> On 27/03/2024 15:23, peterx@redhat.com wrote:
>>>> From: Peter Xu <peterx@redhat.com>
>>>>
>>>> Now follow_page() is ready to handle hugetlb pages in whatever form, and
>>>> over all architectures.  Switch to the generic code path.
>>>>
>>>> Time to retire hugetlb_follow_page_mask(), following the previous
>>>> retirement of follow_hugetlb_page() in 4849807114b8.
>>>>
>>>> There may be a slight difference of how the loops run when processing slow
>>>> GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
>>>> loop of __get_user_pages() will resolve one pgtable entry with the patch
>>>> applied, rather than relying on the size of hugetlb hstate, the latter may
>>>> cover multiple entries in one loop.
>>>>
>>>> A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
>>>> a tight loop of slow gup after the path switched.  That shouldn't be a
>>>> problem because slow-gup should not be a hot path for GUP in general: when
>>>> page is commonly present, fast-gup will already succeed, while when the
>>>> page is indeed missing and require a follow up page fault, the slow gup
>>>> degrade will probably buried in the fault paths anyway.  It also explains
>>>> why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
>>>> accelerate thp gup even for "pages != NULL"") lands, the latter not part of
>>>> a performance analysis but a side benefit.  If the performance will be a
>>>> concern, we can consider handle CONT_PTE in follow_page().
>>>>
>>>> Before that is justified to be necessary, keep everything clean and simple.
>>>>
>>>> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
>>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>>
>>> Afraid I'm seeing an oops when running gup_longterm test on arm64 with current mm-unstable. Git bisect blames this patch. The oops reproduces for me every time on 2 different machines:
>>>
>>>
>>> [    9.340416] kernel BUG at mm/gup.c:778!
>>> [    9.340746] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
>>> [    9.341199] Modules linked in:
>>> [    9.341481] CPU: 1 PID: 1159 Comm: gup_longterm Not tainted 6.9.0-rc2-00210-g910ff1a347e4 #11
>>> [    9.342232] Hardware name: linux,dummy-virt (DT)
>>> [    9.342647] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>> [    9.343195] pc : follow_page_mask+0x4d4/0x880
>>> [    9.343580] lr : follow_page_mask+0x4d4/0x880
>>> [    9.344018] sp : ffff8000898b3aa0
>>> [    9.344345] x29: ffff8000898b3aa0 x28: fffffdffc53973e8 x27: 00003c0005d08000
>>> [    9.345028] x26: ffff00014e5cfd08 x25: ffffd3513a40c000 x24: fffffdffc5d08000
>>> [    9.345682] x23: ffffc1ffc0000000 x22: 0000000000080101 x21: ffff8000898b3ba8
>>> [    9.346337] x20: 0000fffff4200000 x19: ffff00014e52d508 x18: 0000000000000010
>>> [    9.347005] x17: 5f656e6f7a5f7369 x16: 2120262620296567 x15: 6170286461654865
>>> [    9.347713] x14: 6761502128454741 x13: 2929656761702865 x12: 6761705f65636976
>>> [    9.348371] x11: 65645f656e6f7a5f x10: ffffd3513b31d6e0 x9 : ffffd3513852f090
>>> [    9.349062] x8 : 00000000ffffefff x7 : ffffd3513b31d6e0 x6 : 0000000000000000
>>> [    9.349753] x5 : ffff00017ff98cc8 x4 : 0000000000000fff x3 : 0000000000000000
>>> [    9.350397] x2 : 0000000000000000 x1 : ffff000190e8b480 x0 : 0000000000000052
>>> [    9.351097] Call trace:
>>> [    9.351312]  follow_page_mask+0x4d4/0x880
>>> [    9.351700]  __get_user_pages+0xf4/0x3e8
>>> [    9.352089]  __gup_longterm_locked+0x204/0xa70
>>> [    9.352516]  pin_user_pages+0x88/0xc0
>>> [    9.352873]  gup_test_ioctl+0x860/0xc40
>>> [    9.353249]  __arm64_sys_ioctl+0xb0/0x100
>>> [    9.353648]  invoke_syscall+0x50/0x128
>>> [    9.354022]  el0_svc_common.constprop.0+0x48/0xf8
>>> [    9.354488]  do_el0_svc+0x28/0x40
>>> [    9.354822]  el0_svc+0x34/0xe0
>>> [    9.355128]  el0t_64_sync_handler+0x13c/0x158
>>> [    9.355489]  el0t_64_sync+0x190/0x198
>>> [    9.355793] Code: aa1803e0 d000d8e1 91220021 97fff560 (d4210000)
>>> [    9.356280] ---[ end trace 0000000000000000 ]---
>>> [    9.356651] note: gup_longterm[1159] exited with irqs disabled
>>> [    9.357141] note: gup_longterm[1159] exited with preempt_count 2
>>> [    9.358033] ------------[ cut here ]------------
>>> [    9.358800] WARNING: CPU: 1 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0x108/0x120
>>> [    9.360157] Modules linked in:
>>> [    9.360541] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D            6.9.0-rc2-00210-g910ff1a347e4 #11
>>> [    9.361626] Hardware name: linux,dummy-virt (DT)
>>> [    9.362087] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>> [    9.362758] pc : ct_kernel_exit.constprop.0+0x108/0x120
>>> [    9.363306] lr : ct_idle_enter+0x10/0x20
>>> [    9.363845] sp : ffff8000801abdc0
>>> [    9.364222] x29: ffff8000801abdc0 x28: 0000000000000000 x27: 0000000000000000
>>> [    9.364961] x26: 0000000000000000 x25: ffff00014149d780 x24: 0000000000000000
>>> [    9.365557] x23: 0000000000000000 x22: ffffd3513b299d48 x21: ffffd3513a785730
>>> [    9.366239] x20: ffffd3513b299c28 x19: ffff00017ffa7da0 x18: 0000fffff5ffffff
>>> [    9.366869] x17: 0000000000000000 x16: 1fffe0002a21a8c1 x15: 0000000000000000
>>> [    9.367524] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000002
>>> [    9.368207] x11: 0000000000000001 x10: 0000000000000ad0 x9 : ffffd35138589230
>>> [    9.369123] x8 : ffff00014149e2b0 x7 : 0000000000000000 x6 : 000000000f8c0fb2
>>> [    9.370403] x5 : 4000000000000002 x4 : ffff2cb045825000 x3 : ffff8000801abdc0
>>> [    9.371170] x2 : ffffd3513a782da0 x1 : 4000000000000000 x0 : ffffd3513a782da0
>>> [    9.372279] Call trace:
>>> [    9.372519]  ct_kernel_exit.constprop.0+0x108/0x120
>>> [    9.373216]  ct_idle_enter+0x10/0x20
>>> [    9.373562]  default_idle_call+0x3c/0x160
>>> [    9.374055]  do_idle+0x21c/0x280
>>> [    9.374394]  cpu_startup_entry+0x3c/0x50
>>> [    9.374797]  secondary_start_kernel+0x140/0x168
>>> [    9.375220]  __secondary_switched+0xb8/0xc0
>>> [    9.375875] ---[ end trace 0000000000000000 ]---
>>>
>>>
>>> The oops trigger is at mm/gup.c:778:
>>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>>
>>>
>>> This is the output of gup_longterm (last output is just before oops):
>>>
>>> # [INFO] detected hugetlb page size: 2048 KiB
>>> # [INFO] detected hugetlb page size: 32768 KiB
>>> # [INFO] detected hugetlb page size: 64 KiB
>>> # [INFO] detected hugetlb page size: 1048576 KiB
>>> TAP version 13
>>> 1..70
>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd
>>> ok 1 Should have worked
>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with tmpfile
>>> ok 2 Should have failed
>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile
>>> ok 3 Should have failed
>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
>>> ok 4 Should have worked
>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (32768 kB)
>>>
>>>
>>> So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?
>>
>> I assume we find the expected tail page, it's just that the check
>>
>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>
>> Doesn't make sense with hugetlb folios. We might have a tail page mapped in
>> a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the first
>> cont-pmd entry", we trigger this check.
>>
>> Likely this sanity check must also allow for hugetlb folios. Or we should
>> just remove it completely.
> 
> Right, IMHO it'll be easier we remove it, actually I see there's one more
> at the end, so I think we need to remove both.
> 
>>
>> In the past, we wanted to make sure that we never get tail pages of THP from
>> PMD entries, because something would currently be broken (we don't support
>> THP > PMD).
> 
> There's probably one more thing we need to do, on allowing
> PageAnonExclusive() to work with hugetlb tails. Even if we remove the
> warnings and if I read the code right, we can BUG_ON again on checking tail
> pages over anon-exclusive for PageHuge.
> 
> So I assume to fix it completely, we may need two changes: Patch 1 to
> prepare PageAnonExclusive() to work on hugetlb tails, then patch 2 to be
> squashed into the patch "mm/gup: handle huge pmd for follow_pmd_mask()".
> Note: not this patch to fixup, as this patch only does the "switchover" to
> the new path, the culprit should be the other patch..

I'll leave you to do the testing on this, if that's ok.

Just to make my config explicit, I have this kernel command line, which reserves
hugetlbs of all sizes for the tests:

"transparent_hugepage=madvise earlycon root=/dev/vda2 secretmem.enable
hugepagesz=1G hugepages=0:2,1:2 hugepagesz=32M hugepages=0:2,1:2
default_hugepagesz=2M hugepages=0:64,1:64 hugepagesz=64K hugepages=0:2,1:2"

Thanks,
Ryan

> 
> I have them attached below first, before I'll also go and see whether I can
> run some arm tests later today or tomorrow.  David, any comments from
> anon-exclusive side?
> 
> Thanks,
> 
> ===8<===
> 
> From 26f0670acea948945222c97a9cab58428782ca69 Mon Sep 17 00:00:00 2001
> From: Peter Xu <peterx@redhat.com>
> Date: Tue, 2 Apr 2024 11:52:28 -0400
> Subject: [PATCH 1/2] mm: Allow anon exclusive check over hugetlb tail pages
> 
> PageAnonExclusive() used to forbid tail pages for hugetlbfs, as that used
> to be called mostly in hugetlb specific paths and the head page was
> guaranteed.
> 
> As we move forward towards merging hugetlb paths into generic mm, we may
> start to pass in tail hugetlb pages (when with cont-pte/cont-pmd huge
> pages) for such check.  Allow it to properly fetch the head, in which case
> the anon-exclusiveness of the head will always represents the tail page.
> 
> There's already a sign of it when we look at the fast-gup which already
> contain the hugetlb processing altogether: we used to have a specific
> commit 5805192c7b72 ("mm/gup: handle cont-PTE hugetlb pages correctly in
> gup_must_unshare() via GUP-fast") covering that area.  Now with this more
> generic change, that can also go away.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/linux/page-flags.h |  8 +++++++-
>  mm/internal.h              | 10 ----------
>  2 files changed, 7 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 888353c209c0..225357f48a79 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -1095,7 +1095,13 @@ PAGEFLAG(Isolated, isolated, PF_ANY);
>  static __always_inline int PageAnonExclusive(const struct page *page)
>  {
>  	VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
> -	VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
> +	/*
> +	 * Allow the anon-exclusive check to work on hugetlb tail pages.
> +	 * Here hugetlb pages will always guarantee the anon-exclusiveness
> +	 * of the head page represents the tail pages.
> +	 */
> +	if (PageHuge(page) && !PageHead(page))
> +		page = compound_head(page);
>  	return test_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);
>  }
>  
> diff --git a/mm/internal.h b/mm/internal.h
> index 9512de7398d5..87f6e4fd56a5 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -1259,16 +1259,6 @@ static inline bool gup_must_unshare(struct vm_area_struct *vma,
>  	if (IS_ENABLED(CONFIG_HAVE_FAST_GUP))
>  		smp_rmb();
>  
> -	/*
> -	 * During GUP-fast we might not get called on the head page for a
> -	 * hugetlb page that is mapped using cont-PTE, because GUP-fast does
> -	 * not work with the abstracted hugetlb PTEs that always point at the
> -	 * head page. For hugetlb, PageAnonExclusive only applies on the head
> -	 * page (as it cannot be partially COW-shared), so lookup the head page.
> -	 */
> -	if (unlikely(!PageHead(page) && PageHuge(page)))
> -		page = compound_head(page);
> -
>  	/*
>  	 * Note that PageKsm() pages cannot be exclusive, and consequently,
>  	 * cannot get pinned.


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
@ 2024-04-02 16:46           ` Ryan Roberts
  0 siblings, 0 replies; 160+ messages in thread
From: Ryan Roberts @ 2024-04-02 16:46 UTC (permalink / raw)
  To: Peter Xu, David Hildenbrand
  Cc: linux-mm, linux-kernel, Yang Shi, Kirill A . Shutemov,
	Mike Kravetz, John Hubbard, Michael Ellerman, Andrew Jones,
	Muchun Song, linux-riscv, linuxppc-dev, Christophe Leroy,
	Andrew Morton, Christoph Hellwig, Lorenzo Stoakes,
	Matthew Wilcox, Rik van Riel, linux-arm-kernel, Andrea Arcangeli,
	Aneesh Kumar K . V, Vlastimil Babka, James Houghton,
	Jason Gunthorpe, Mike Rapoport, Axel Rasmussen

On 02/04/2024 17:20, Peter Xu wrote:
> On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:
>> On 02.04.24 16:48, Ryan Roberts wrote:
>>> Hi Peter,
> 
> Hey, Ryan,
> 
> Thanks for the report!
> 
>>>
>>> On 27/03/2024 15:23, peterx@redhat.com wrote:
>>>> From: Peter Xu <peterx@redhat.com>
>>>>
>>>> Now follow_page() is ready to handle hugetlb pages in whatever form, and
>>>> over all architectures.  Switch to the generic code path.
>>>>
>>>> Time to retire hugetlb_follow_page_mask(), following the previous
>>>> retirement of follow_hugetlb_page() in 4849807114b8.
>>>>
>>>> There may be a slight difference of how the loops run when processing slow
>>>> GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
>>>> loop of __get_user_pages() will resolve one pgtable entry with the patch
>>>> applied, rather than relying on the size of hugetlb hstate, the latter may
>>>> cover multiple entries in one loop.
>>>>
>>>> A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
>>>> a tight loop of slow gup after the path switched.  That shouldn't be a
>>>> problem because slow-gup should not be a hot path for GUP in general: when
>>>> page is commonly present, fast-gup will already succeed, while when the
>>>> page is indeed missing and require a follow up page fault, the slow gup
>>>> degrade will probably buried in the fault paths anyway.  It also explains
>>>> why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
>>>> accelerate thp gup even for "pages != NULL"") lands, the latter not part of
>>>> a performance analysis but a side benefit.  If the performance will be a
>>>> concern, we can consider handle CONT_PTE in follow_page().
>>>>
>>>> Before that is justified to be necessary, keep everything clean and simple.
>>>>
>>>> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
>>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>>
>>> Afraid I'm seeing an oops when running gup_longterm test on arm64 with current mm-unstable. Git bisect blames this patch. The oops reproduces for me every time on 2 different machines:
>>>
>>>
>>> [    9.340416] kernel BUG at mm/gup.c:778!
>>> [    9.340746] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
>>> [    9.341199] Modules linked in:
>>> [    9.341481] CPU: 1 PID: 1159 Comm: gup_longterm Not tainted 6.9.0-rc2-00210-g910ff1a347e4 #11
>>> [    9.342232] Hardware name: linux,dummy-virt (DT)
>>> [    9.342647] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>> [    9.343195] pc : follow_page_mask+0x4d4/0x880
>>> [    9.343580] lr : follow_page_mask+0x4d4/0x880
>>> [    9.344018] sp : ffff8000898b3aa0
>>> [    9.344345] x29: ffff8000898b3aa0 x28: fffffdffc53973e8 x27: 00003c0005d08000
>>> [    9.345028] x26: ffff00014e5cfd08 x25: ffffd3513a40c000 x24: fffffdffc5d08000
>>> [    9.345682] x23: ffffc1ffc0000000 x22: 0000000000080101 x21: ffff8000898b3ba8
>>> [    9.346337] x20: 0000fffff4200000 x19: ffff00014e52d508 x18: 0000000000000010
>>> [    9.347005] x17: 5f656e6f7a5f7369 x16: 2120262620296567 x15: 6170286461654865
>>> [    9.347713] x14: 6761502128454741 x13: 2929656761702865 x12: 6761705f65636976
>>> [    9.348371] x11: 65645f656e6f7a5f x10: ffffd3513b31d6e0 x9 : ffffd3513852f090
>>> [    9.349062] x8 : 00000000ffffefff x7 : ffffd3513b31d6e0 x6 : 0000000000000000
>>> [    9.349753] x5 : ffff00017ff98cc8 x4 : 0000000000000fff x3 : 0000000000000000
>>> [    9.350397] x2 : 0000000000000000 x1 : ffff000190e8b480 x0 : 0000000000000052
>>> [    9.351097] Call trace:
>>> [    9.351312]  follow_page_mask+0x4d4/0x880
>>> [    9.351700]  __get_user_pages+0xf4/0x3e8
>>> [    9.352089]  __gup_longterm_locked+0x204/0xa70
>>> [    9.352516]  pin_user_pages+0x88/0xc0
>>> [    9.352873]  gup_test_ioctl+0x860/0xc40
>>> [    9.353249]  __arm64_sys_ioctl+0xb0/0x100
>>> [    9.353648]  invoke_syscall+0x50/0x128
>>> [    9.354022]  el0_svc_common.constprop.0+0x48/0xf8
>>> [    9.354488]  do_el0_svc+0x28/0x40
>>> [    9.354822]  el0_svc+0x34/0xe0
>>> [    9.355128]  el0t_64_sync_handler+0x13c/0x158
>>> [    9.355489]  el0t_64_sync+0x190/0x198
>>> [    9.355793] Code: aa1803e0 d000d8e1 91220021 97fff560 (d4210000)
>>> [    9.356280] ---[ end trace 0000000000000000 ]---
>>> [    9.356651] note: gup_longterm[1159] exited with irqs disabled
>>> [    9.357141] note: gup_longterm[1159] exited with preempt_count 2
>>> [    9.358033] ------------[ cut here ]------------
>>> [    9.358800] WARNING: CPU: 1 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0x108/0x120
>>> [    9.360157] Modules linked in:
>>> [    9.360541] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D            6.9.0-rc2-00210-g910ff1a347e4 #11
>>> [    9.361626] Hardware name: linux,dummy-virt (DT)
>>> [    9.362087] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>> [    9.362758] pc : ct_kernel_exit.constprop.0+0x108/0x120
>>> [    9.363306] lr : ct_idle_enter+0x10/0x20
>>> [    9.363845] sp : ffff8000801abdc0
>>> [    9.364222] x29: ffff8000801abdc0 x28: 0000000000000000 x27: 0000000000000000
>>> [    9.364961] x26: 0000000000000000 x25: ffff00014149d780 x24: 0000000000000000
>>> [    9.365557] x23: 0000000000000000 x22: ffffd3513b299d48 x21: ffffd3513a785730
>>> [    9.366239] x20: ffffd3513b299c28 x19: ffff00017ffa7da0 x18: 0000fffff5ffffff
>>> [    9.366869] x17: 0000000000000000 x16: 1fffe0002a21a8c1 x15: 0000000000000000
>>> [    9.367524] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000002
>>> [    9.368207] x11: 0000000000000001 x10: 0000000000000ad0 x9 : ffffd35138589230
>>> [    9.369123] x8 : ffff00014149e2b0 x7 : 0000000000000000 x6 : 000000000f8c0fb2
>>> [    9.370403] x5 : 4000000000000002 x4 : ffff2cb045825000 x3 : ffff8000801abdc0
>>> [    9.371170] x2 : ffffd3513a782da0 x1 : 4000000000000000 x0 : ffffd3513a782da0
>>> [    9.372279] Call trace:
>>> [    9.372519]  ct_kernel_exit.constprop.0+0x108/0x120
>>> [    9.373216]  ct_idle_enter+0x10/0x20
>>> [    9.373562]  default_idle_call+0x3c/0x160
>>> [    9.374055]  do_idle+0x21c/0x280
>>> [    9.374394]  cpu_startup_entry+0x3c/0x50
>>> [    9.374797]  secondary_start_kernel+0x140/0x168
>>> [    9.375220]  __secondary_switched+0xb8/0xc0
>>> [    9.375875] ---[ end trace 0000000000000000 ]---
>>>
>>>
>>> The oops trigger is at mm/gup.c:778:
>>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>>
>>>
>>> This is the output of gup_longterm (last output is just before oops):
>>>
>>> # [INFO] detected hugetlb page size: 2048 KiB
>>> # [INFO] detected hugetlb page size: 32768 KiB
>>> # [INFO] detected hugetlb page size: 64 KiB
>>> # [INFO] detected hugetlb page size: 1048576 KiB
>>> TAP version 13
>>> 1..70
>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd
>>> ok 1 Should have worked
>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with tmpfile
>>> ok 2 Should have failed
>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile
>>> ok 3 Should have failed
>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
>>> ok 4 Should have worked
>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (32768 kB)
>>>
>>>
>>> So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?
>>
>> I assume we find the expected tail page, it's just that the check
>>
>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>
>> Doesn't make sense with hugetlb folios. We might have a tail page mapped in
>> a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the first
>> cont-pmd entry", we trigger this check.
>>
>> Likely this sanity check must also allow for hugetlb folios. Or we should
>> just remove it completely.
> 
> Right, IMHO it'll be easier we remove it, actually I see there's one more
> at the end, so I think we need to remove both.
> 
>>
>> In the past, we wanted to make sure that we never get tail pages of THP from
>> PMD entries, because something would currently be broken (we don't support
>> THP > PMD).
> 
> There's probably one more thing we need to do, on allowing
> PageAnonExclusive() to work with hugetlb tails. Even if we remove the
> warnings and if I read the code right, we can BUG_ON again on checking tail
> pages over anon-exclusive for PageHuge.
> 
> So I assume to fix it completely, we may need two changes: Patch 1 to
> prepare PageAnonExclusive() to work on hugetlb tails, then patch 2 to be
> squashed into the patch "mm/gup: handle huge pmd for follow_pmd_mask()".
> Note: not this patch to fixup, as this patch only does the "switchover" to
> the new path, the culprit should be the other patch..

I'll leave you to do the testing on this, if that's ok.

Just to make my config explicit, I have this kernel command line, which reserves
hugetlbs of all sizes for the tests:

"transparent_hugepage=madvise earlycon root=/dev/vda2 secretmem.enable
hugepagesz=1G hugepages=0:2,1:2 hugepagesz=32M hugepages=0:2,1:2
default_hugepagesz=2M hugepages=0:64,1:64 hugepagesz=64K hugepages=0:2,1:2"

Thanks,
Ryan

> 
> I have them attached below first, before I'll also go and see whether I can
> run some arm tests later today or tomorrow.  David, any comments from
> anon-exclusive side?
> 
> Thanks,
> 
> ===8<===
> 
> From 26f0670acea948945222c97a9cab58428782ca69 Mon Sep 17 00:00:00 2001
> From: Peter Xu <peterx@redhat.com>
> Date: Tue, 2 Apr 2024 11:52:28 -0400
> Subject: [PATCH 1/2] mm: Allow anon exclusive check over hugetlb tail pages
> 
> PageAnonExclusive() used to forbid tail pages for hugetlbfs, as that used
> to be called mostly in hugetlb specific paths and the head page was
> guaranteed.
> 
> As we move forward towards merging hugetlb paths into generic mm, we may
> start to pass in tail hugetlb pages (when with cont-pte/cont-pmd huge
> pages) for such check.  Allow it to properly fetch the head, in which case
> the anon-exclusiveness of the head will always represents the tail page.
> 
> There's already a sign of it when we look at the fast-gup which already
> contain the hugetlb processing altogether: we used to have a specific
> commit 5805192c7b72 ("mm/gup: handle cont-PTE hugetlb pages correctly in
> gup_must_unshare() via GUP-fast") covering that area.  Now with this more
> generic change, that can also go away.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/linux/page-flags.h |  8 +++++++-
>  mm/internal.h              | 10 ----------
>  2 files changed, 7 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 888353c209c0..225357f48a79 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -1095,7 +1095,13 @@ PAGEFLAG(Isolated, isolated, PF_ANY);
>  static __always_inline int PageAnonExclusive(const struct page *page)
>  {
>  	VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
> -	VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
> +	/*
> +	 * Allow the anon-exclusive check to work on hugetlb tail pages.
> +	 * Here hugetlb pages will always guarantee the anon-exclusiveness
> +	 * of the head page represents the tail pages.
> +	 */
> +	if (PageHuge(page) && !PageHead(page))
> +		page = compound_head(page);
>  	return test_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);
>  }
>  
> diff --git a/mm/internal.h b/mm/internal.h
> index 9512de7398d5..87f6e4fd56a5 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -1259,16 +1259,6 @@ static inline bool gup_must_unshare(struct vm_area_struct *vma,
>  	if (IS_ENABLED(CONFIG_HAVE_FAST_GUP))
>  		smp_rmb();
>  
> -	/*
> -	 * During GUP-fast we might not get called on the head page for a
> -	 * hugetlb page that is mapped using cont-PTE, because GUP-fast does
> -	 * not work with the abstracted hugetlb PTEs that always point at the
> -	 * head page. For hugetlb, PageAnonExclusive only applies on the head
> -	 * page (as it cannot be partially COW-shared), so lookup the head page.
> -	 */
> -	if (unlikely(!PageHead(page) && PageHuge(page)))
> -		page = compound_head(page);
> -
>  	/*
>  	 * Note that PageKsm() pages cannot be exclusive, and consequently,
>  	 * cannot get pinned.


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
@ 2024-04-02 16:46           ` Ryan Roberts
  0 siblings, 0 replies; 160+ messages in thread
From: Ryan Roberts @ 2024-04-02 16:46 UTC (permalink / raw)
  To: Peter Xu, David Hildenbrand
  Cc: linux-mm, linux-kernel, Yang Shi, Kirill A . Shutemov,
	Mike Kravetz, John Hubbard, Michael Ellerman, Andrew Jones,
	Muchun Song, linux-riscv, linuxppc-dev, Christophe Leroy,
	Andrew Morton, Christoph Hellwig, Lorenzo Stoakes,
	Matthew Wilcox, Rik van Riel, linux-arm-kernel, Andrea Arcangeli,
	Aneesh Kumar K . V, Vlastimil Babka, James Houghton,
	Jason Gunthorpe, Mike Rapoport, Axel Rasmussen

On 02/04/2024 17:20, Peter Xu wrote:
> On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:
>> On 02.04.24 16:48, Ryan Roberts wrote:
>>> Hi Peter,
> 
> Hey, Ryan,
> 
> Thanks for the report!
> 
>>>
>>> On 27/03/2024 15:23, peterx@redhat.com wrote:
>>>> From: Peter Xu <peterx@redhat.com>
>>>>
>>>> Now follow_page() is ready to handle hugetlb pages in whatever form, and
>>>> over all architectures.  Switch to the generic code path.
>>>>
>>>> Time to retire hugetlb_follow_page_mask(), following the previous
>>>> retirement of follow_hugetlb_page() in 4849807114b8.
>>>>
>>>> There may be a slight difference of how the loops run when processing slow
>>>> GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
>>>> loop of __get_user_pages() will resolve one pgtable entry with the patch
>>>> applied, rather than relying on the size of hugetlb hstate, the latter may
>>>> cover multiple entries in one loop.
>>>>
>>>> A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
>>>> a tight loop of slow gup after the path switched.  That shouldn't be a
>>>> problem because slow-gup should not be a hot path for GUP in general: when
>>>> page is commonly present, fast-gup will already succeed, while when the
>>>> page is indeed missing and require a follow up page fault, the slow gup
>>>> degrade will probably buried in the fault paths anyway.  It also explains
>>>> why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
>>>> accelerate thp gup even for "pages != NULL"") lands, the latter not part of
>>>> a performance analysis but a side benefit.  If the performance will be a
>>>> concern, we can consider handle CONT_PTE in follow_page().
>>>>
>>>> Before that is justified to be necessary, keep everything clean and simple.
>>>>
>>>> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
>>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>>
>>> Afraid I'm seeing an oops when running gup_longterm test on arm64 with current mm-unstable. Git bisect blames this patch. The oops reproduces for me every time on 2 different machines:
>>>
>>>
>>> [    9.340416] kernel BUG at mm/gup.c:778!
>>> [    9.340746] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
>>> [    9.341199] Modules linked in:
>>> [    9.341481] CPU: 1 PID: 1159 Comm: gup_longterm Not tainted 6.9.0-rc2-00210-g910ff1a347e4 #11
>>> [    9.342232] Hardware name: linux,dummy-virt (DT)
>>> [    9.342647] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>> [    9.343195] pc : follow_page_mask+0x4d4/0x880
>>> [    9.343580] lr : follow_page_mask+0x4d4/0x880
>>> [    9.344018] sp : ffff8000898b3aa0
>>> [    9.344345] x29: ffff8000898b3aa0 x28: fffffdffc53973e8 x27: 00003c0005d08000
>>> [    9.345028] x26: ffff00014e5cfd08 x25: ffffd3513a40c000 x24: fffffdffc5d08000
>>> [    9.345682] x23: ffffc1ffc0000000 x22: 0000000000080101 x21: ffff8000898b3ba8
>>> [    9.346337] x20: 0000fffff4200000 x19: ffff00014e52d508 x18: 0000000000000010
>>> [    9.347005] x17: 5f656e6f7a5f7369 x16: 2120262620296567 x15: 6170286461654865
>>> [    9.347713] x14: 6761502128454741 x13: 2929656761702865 x12: 6761705f65636976
>>> [    9.348371] x11: 65645f656e6f7a5f x10: ffffd3513b31d6e0 x9 : ffffd3513852f090
>>> [    9.349062] x8 : 00000000ffffefff x7 : ffffd3513b31d6e0 x6 : 0000000000000000
>>> [    9.349753] x5 : ffff00017ff98cc8 x4 : 0000000000000fff x3 : 0000000000000000
>>> [    9.350397] x2 : 0000000000000000 x1 : ffff000190e8b480 x0 : 0000000000000052
>>> [    9.351097] Call trace:
>>> [    9.351312]  follow_page_mask+0x4d4/0x880
>>> [    9.351700]  __get_user_pages+0xf4/0x3e8
>>> [    9.352089]  __gup_longterm_locked+0x204/0xa70
>>> [    9.352516]  pin_user_pages+0x88/0xc0
>>> [    9.352873]  gup_test_ioctl+0x860/0xc40
>>> [    9.353249]  __arm64_sys_ioctl+0xb0/0x100
>>> [    9.353648]  invoke_syscall+0x50/0x128
>>> [    9.354022]  el0_svc_common.constprop.0+0x48/0xf8
>>> [    9.354488]  do_el0_svc+0x28/0x40
>>> [    9.354822]  el0_svc+0x34/0xe0
>>> [    9.355128]  el0t_64_sync_handler+0x13c/0x158
>>> [    9.355489]  el0t_64_sync+0x190/0x198
>>> [    9.355793] Code: aa1803e0 d000d8e1 91220021 97fff560 (d4210000)
>>> [    9.356280] ---[ end trace 0000000000000000 ]---
>>> [    9.356651] note: gup_longterm[1159] exited with irqs disabled
>>> [    9.357141] note: gup_longterm[1159] exited with preempt_count 2
>>> [    9.358033] ------------[ cut here ]------------
>>> [    9.358800] WARNING: CPU: 1 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0x108/0x120
>>> [    9.360157] Modules linked in:
>>> [    9.360541] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D            6.9.0-rc2-00210-g910ff1a347e4 #11
>>> [    9.361626] Hardware name: linux,dummy-virt (DT)
>>> [    9.362087] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>> [    9.362758] pc : ct_kernel_exit.constprop.0+0x108/0x120
>>> [    9.363306] lr : ct_idle_enter+0x10/0x20
>>> [    9.363845] sp : ffff8000801abdc0
>>> [    9.364222] x29: ffff8000801abdc0 x28: 0000000000000000 x27: 0000000000000000
>>> [    9.364961] x26: 0000000000000000 x25: ffff00014149d780 x24: 0000000000000000
>>> [    9.365557] x23: 0000000000000000 x22: ffffd3513b299d48 x21: ffffd3513a785730
>>> [    9.366239] x20: ffffd3513b299c28 x19: ffff00017ffa7da0 x18: 0000fffff5ffffff
>>> [    9.366869] x17: 0000000000000000 x16: 1fffe0002a21a8c1 x15: 0000000000000000
>>> [    9.367524] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000002
>>> [    9.368207] x11: 0000000000000001 x10: 0000000000000ad0 x9 : ffffd35138589230
>>> [    9.369123] x8 : ffff00014149e2b0 x7 : 0000000000000000 x6 : 000000000f8c0fb2
>>> [    9.370403] x5 : 4000000000000002 x4 : ffff2cb045825000 x3 : ffff8000801abdc0
>>> [    9.371170] x2 : ffffd3513a782da0 x1 : 4000000000000000 x0 : ffffd3513a782da0
>>> [    9.372279] Call trace:
>>> [    9.372519]  ct_kernel_exit.constprop.0+0x108/0x120
>>> [    9.373216]  ct_idle_enter+0x10/0x20
>>> [    9.373562]  default_idle_call+0x3c/0x160
>>> [    9.374055]  do_idle+0x21c/0x280
>>> [    9.374394]  cpu_startup_entry+0x3c/0x50
>>> [    9.374797]  secondary_start_kernel+0x140/0x168
>>> [    9.375220]  __secondary_switched+0xb8/0xc0
>>> [    9.375875] ---[ end trace 0000000000000000 ]---
>>>
>>>
>>> The oops trigger is at mm/gup.c:778:
>>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>>
>>>
>>> This is the output of gup_longterm (last output is just before oops):
>>>
>>> # [INFO] detected hugetlb page size: 2048 KiB
>>> # [INFO] detected hugetlb page size: 32768 KiB
>>> # [INFO] detected hugetlb page size: 64 KiB
>>> # [INFO] detected hugetlb page size: 1048576 KiB
>>> TAP version 13
>>> 1..70
>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd
>>> ok 1 Should have worked
>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with tmpfile
>>> ok 2 Should have failed
>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile
>>> ok 3 Should have failed
>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
>>> ok 4 Should have worked
>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (32768 kB)
>>>
>>>
>>> So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?
>>
>> I assume we find the expected tail page, it's just that the check
>>
>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>
>> Doesn't make sense with hugetlb folios. We might have a tail page mapped in
>> a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the first
>> cont-pmd entry", we trigger this check.
>>
>> Likely this sanity check must also allow for hugetlb folios. Or we should
>> just remove it completely.
> 
> Right, IMHO it'll be easier we remove it, actually I see there's one more
> at the end, so I think we need to remove both.
> 
>>
>> In the past, we wanted to make sure that we never get tail pages of THP from
>> PMD entries, because something would currently be broken (we don't support
>> THP > PMD).
> 
> There's probably one more thing we need to do, on allowing
> PageAnonExclusive() to work with hugetlb tails. Even if we remove the
> warnings and if I read the code right, we can BUG_ON again on checking tail
> pages over anon-exclusive for PageHuge.
> 
> So I assume to fix it completely, we may need two changes: Patch 1 to
> prepare PageAnonExclusive() to work on hugetlb tails, then patch 2 to be
> squashed into the patch "mm/gup: handle huge pmd for follow_pmd_mask()".
> Note: not this patch to fixup, as this patch only does the "switchover" to
> the new path, the culprit should be the other patch..

I'll leave you to do the testing on this, if that's ok.

Just to make my config explicit, I have this kernel command line, which reserves
hugetlbs of all sizes for the tests:

"transparent_hugepage=madvise earlycon root=/dev/vda2 secretmem.enable
hugepagesz=1G hugepages=0:2,1:2 hugepagesz=32M hugepages=0:2,1:2
default_hugepagesz=2M hugepages=0:64,1:64 hugepagesz=64K hugepages=0:2,1:2"

Thanks,
Ryan

> 
> I have them attached below first, before I'll also go and see whether I can
> run some arm tests later today or tomorrow.  David, any comments from
> anon-exclusive side?
> 
> Thanks,
> 
> ===8<===
> 
> From 26f0670acea948945222c97a9cab58428782ca69 Mon Sep 17 00:00:00 2001
> From: Peter Xu <peterx@redhat.com>
> Date: Tue, 2 Apr 2024 11:52:28 -0400
> Subject: [PATCH 1/2] mm: Allow anon exclusive check over hugetlb tail pages
> 
> PageAnonExclusive() used to forbid tail pages for hugetlbfs, as that used
> to be called mostly in hugetlb specific paths and the head page was
> guaranteed.
> 
> As we move forward towards merging hugetlb paths into generic mm, we may
> start to pass in tail hugetlb pages (when with cont-pte/cont-pmd huge
> pages) for such check.  Allow it to properly fetch the head, in which case
> the anon-exclusiveness of the head will always represents the tail page.
> 
> There's already a sign of it when we look at the fast-gup which already
> contain the hugetlb processing altogether: we used to have a specific
> commit 5805192c7b72 ("mm/gup: handle cont-PTE hugetlb pages correctly in
> gup_must_unshare() via GUP-fast") covering that area.  Now with this more
> generic change, that can also go away.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/linux/page-flags.h |  8 +++++++-
>  mm/internal.h              | 10 ----------
>  2 files changed, 7 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 888353c209c0..225357f48a79 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -1095,7 +1095,13 @@ PAGEFLAG(Isolated, isolated, PF_ANY);
>  static __always_inline int PageAnonExclusive(const struct page *page)
>  {
>  	VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
> -	VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
> +	/*
> +	 * Allow the anon-exclusive check to work on hugetlb tail pages.
> +	 * Here hugetlb pages will always guarantee the anon-exclusiveness
> +	 * of the head page represents the tail pages.
> +	 */
> +	if (PageHuge(page) && !PageHead(page))
> +		page = compound_head(page);
>  	return test_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);
>  }
>  
> diff --git a/mm/internal.h b/mm/internal.h
> index 9512de7398d5..87f6e4fd56a5 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -1259,16 +1259,6 @@ static inline bool gup_must_unshare(struct vm_area_struct *vma,
>  	if (IS_ENABLED(CONFIG_HAVE_FAST_GUP))
>  		smp_rmb();
>  
> -	/*
> -	 * During GUP-fast we might not get called on the head page for a
> -	 * hugetlb page that is mapped using cont-PTE, because GUP-fast does
> -	 * not work with the abstracted hugetlb PTEs that always point at the
> -	 * head page. For hugetlb, PageAnonExclusive only applies on the head
> -	 * page (as it cannot be partially COW-shared), so lookup the head page.
> -	 */
> -	if (unlikely(!PageHead(page) && PageHuge(page)))
> -		page = compound_head(page);
> -
>  	/*
>  	 * Note that PageKsm() pages cannot be exclusive, and consequently,
>  	 * cannot get pinned.


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
@ 2024-04-02 16:46           ` Ryan Roberts
  0 siblings, 0 replies; 160+ messages in thread
From: Ryan Roberts @ 2024-04-02 16:46 UTC (permalink / raw)
  To: Peter Xu, David Hildenbrand
  Cc: James Houghton, Yang Shi, Andrew Jones, linux-mm, Matthew Wilcox,
	linux-riscv, Andrea Arcangeli, Christoph Hellwig,
	Aneesh Kumar K . V, linux-arm-kernel, Jason Gunthorpe,
	Axel Rasmussen, Rik van Riel, John Hubbard, Kirill A . Shutemov,
	Vlastimil Babka, Lorenzo Stoakes, Muchun Song, linux-kernel,
	Andrew Morton, linuxppc-dev, Mike Rapoport, Mike Kravetz

On 02/04/2024 17:20, Peter Xu wrote:
> On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:
>> On 02.04.24 16:48, Ryan Roberts wrote:
>>> Hi Peter,
> 
> Hey, Ryan,
> 
> Thanks for the report!
> 
>>>
>>> On 27/03/2024 15:23, peterx@redhat.com wrote:
>>>> From: Peter Xu <peterx@redhat.com>
>>>>
>>>> Now follow_page() is ready to handle hugetlb pages in whatever form, and
>>>> over all architectures.  Switch to the generic code path.
>>>>
>>>> Time to retire hugetlb_follow_page_mask(), following the previous
>>>> retirement of follow_hugetlb_page() in 4849807114b8.
>>>>
>>>> There may be a slight difference of how the loops run when processing slow
>>>> GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
>>>> loop of __get_user_pages() will resolve one pgtable entry with the patch
>>>> applied, rather than relying on the size of hugetlb hstate, the latter may
>>>> cover multiple entries in one loop.
>>>>
>>>> A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
>>>> a tight loop of slow gup after the path switched.  That shouldn't be a
>>>> problem because slow-gup should not be a hot path for GUP in general: when
>>>> page is commonly present, fast-gup will already succeed, while when the
>>>> page is indeed missing and require a follow up page fault, the slow gup
>>>> degrade will probably buried in the fault paths anyway.  It also explains
>>>> why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
>>>> accelerate thp gup even for "pages != NULL"") lands, the latter not part of
>>>> a performance analysis but a side benefit.  If the performance will be a
>>>> concern, we can consider handle CONT_PTE in follow_page().
>>>>
>>>> Before that is justified to be necessary, keep everything clean and simple.
>>>>
>>>> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
>>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>>
>>> Afraid I'm seeing an oops when running gup_longterm test on arm64 with current mm-unstable. Git bisect blames this patch. The oops reproduces for me every time on 2 different machines:
>>>
>>>
>>> [    9.340416] kernel BUG at mm/gup.c:778!
>>> [    9.340746] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
>>> [    9.341199] Modules linked in:
>>> [    9.341481] CPU: 1 PID: 1159 Comm: gup_longterm Not tainted 6.9.0-rc2-00210-g910ff1a347e4 #11
>>> [    9.342232] Hardware name: linux,dummy-virt (DT)
>>> [    9.342647] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>> [    9.343195] pc : follow_page_mask+0x4d4/0x880
>>> [    9.343580] lr : follow_page_mask+0x4d4/0x880
>>> [    9.344018] sp : ffff8000898b3aa0
>>> [    9.344345] x29: ffff8000898b3aa0 x28: fffffdffc53973e8 x27: 00003c0005d08000
>>> [    9.345028] x26: ffff00014e5cfd08 x25: ffffd3513a40c000 x24: fffffdffc5d08000
>>> [    9.345682] x23: ffffc1ffc0000000 x22: 0000000000080101 x21: ffff8000898b3ba8
>>> [    9.346337] x20: 0000fffff4200000 x19: ffff00014e52d508 x18: 0000000000000010
>>> [    9.347005] x17: 5f656e6f7a5f7369 x16: 2120262620296567 x15: 6170286461654865
>>> [    9.347713] x14: 6761502128454741 x13: 2929656761702865 x12: 6761705f65636976
>>> [    9.348371] x11: 65645f656e6f7a5f x10: ffffd3513b31d6e0 x9 : ffffd3513852f090
>>> [    9.349062] x8 : 00000000ffffefff x7 : ffffd3513b31d6e0 x6 : 0000000000000000
>>> [    9.349753] x5 : ffff00017ff98cc8 x4 : 0000000000000fff x3 : 0000000000000000
>>> [    9.350397] x2 : 0000000000000000 x1 : ffff000190e8b480 x0 : 0000000000000052
>>> [    9.351097] Call trace:
>>> [    9.351312]  follow_page_mask+0x4d4/0x880
>>> [    9.351700]  __get_user_pages+0xf4/0x3e8
>>> [    9.352089]  __gup_longterm_locked+0x204/0xa70
>>> [    9.352516]  pin_user_pages+0x88/0xc0
>>> [    9.352873]  gup_test_ioctl+0x860/0xc40
>>> [    9.353249]  __arm64_sys_ioctl+0xb0/0x100
>>> [    9.353648]  invoke_syscall+0x50/0x128
>>> [    9.354022]  el0_svc_common.constprop.0+0x48/0xf8
>>> [    9.354488]  do_el0_svc+0x28/0x40
>>> [    9.354822]  el0_svc+0x34/0xe0
>>> [    9.355128]  el0t_64_sync_handler+0x13c/0x158
>>> [    9.355489]  el0t_64_sync+0x190/0x198
>>> [    9.355793] Code: aa1803e0 d000d8e1 91220021 97fff560 (d4210000)
>>> [    9.356280] ---[ end trace 0000000000000000 ]---
>>> [    9.356651] note: gup_longterm[1159] exited with irqs disabled
>>> [    9.357141] note: gup_longterm[1159] exited with preempt_count 2
>>> [    9.358033] ------------[ cut here ]------------
>>> [    9.358800] WARNING: CPU: 1 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0x108/0x120
>>> [    9.360157] Modules linked in:
>>> [    9.360541] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D            6.9.0-rc2-00210-g910ff1a347e4 #11
>>> [    9.361626] Hardware name: linux,dummy-virt (DT)
>>> [    9.362087] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>> [    9.362758] pc : ct_kernel_exit.constprop.0+0x108/0x120
>>> [    9.363306] lr : ct_idle_enter+0x10/0x20
>>> [    9.363845] sp : ffff8000801abdc0
>>> [    9.364222] x29: ffff8000801abdc0 x28: 0000000000000000 x27: 0000000000000000
>>> [    9.364961] x26: 0000000000000000 x25: ffff00014149d780 x24: 0000000000000000
>>> [    9.365557] x23: 0000000000000000 x22: ffffd3513b299d48 x21: ffffd3513a785730
>>> [    9.366239] x20: ffffd3513b299c28 x19: ffff00017ffa7da0 x18: 0000fffff5ffffff
>>> [    9.366869] x17: 0000000000000000 x16: 1fffe0002a21a8c1 x15: 0000000000000000
>>> [    9.367524] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000002
>>> [    9.368207] x11: 0000000000000001 x10: 0000000000000ad0 x9 : ffffd35138589230
>>> [    9.369123] x8 : ffff00014149e2b0 x7 : 0000000000000000 x6 : 000000000f8c0fb2
>>> [    9.370403] x5 : 4000000000000002 x4 : ffff2cb045825000 x3 : ffff8000801abdc0
>>> [    9.371170] x2 : ffffd3513a782da0 x1 : 4000000000000000 x0 : ffffd3513a782da0
>>> [    9.372279] Call trace:
>>> [    9.372519]  ct_kernel_exit.constprop.0+0x108/0x120
>>> [    9.373216]  ct_idle_enter+0x10/0x20
>>> [    9.373562]  default_idle_call+0x3c/0x160
>>> [    9.374055]  do_idle+0x21c/0x280
>>> [    9.374394]  cpu_startup_entry+0x3c/0x50
>>> [    9.374797]  secondary_start_kernel+0x140/0x168
>>> [    9.375220]  __secondary_switched+0xb8/0xc0
>>> [    9.375875] ---[ end trace 0000000000000000 ]---
>>>
>>>
>>> The oops trigger is at mm/gup.c:778:
>>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>>
>>>
>>> This is the output of gup_longterm (last output is just before oops):
>>>
>>> # [INFO] detected hugetlb page size: 2048 KiB
>>> # [INFO] detected hugetlb page size: 32768 KiB
>>> # [INFO] detected hugetlb page size: 64 KiB
>>> # [INFO] detected hugetlb page size: 1048576 KiB
>>> TAP version 13
>>> 1..70
>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd
>>> ok 1 Should have worked
>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with tmpfile
>>> ok 2 Should have failed
>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile
>>> ok 3 Should have failed
>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
>>> ok 4 Should have worked
>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (32768 kB)
>>>
>>>
>>> So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?
>>
>> I assume we find the expected tail page, it's just that the check
>>
>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>
>> Doesn't make sense with hugetlb folios. We might have a tail page mapped in
>> a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the first
>> cont-pmd entry", we trigger this check.
>>
>> Likely this sanity check must also allow for hugetlb folios. Or we should
>> just remove it completely.
> 
> Right, IMHO it'll be easier we remove it, actually I see there's one more
> at the end, so I think we need to remove both.
> 
>>
>> In the past, we wanted to make sure that we never get tail pages of THP from
>> PMD entries, because something would currently be broken (we don't support
>> THP > PMD).
> 
> There's probably one more thing we need to do, on allowing
> PageAnonExclusive() to work with hugetlb tails. Even if we remove the
> warnings and if I read the code right, we can BUG_ON again on checking tail
> pages over anon-exclusive for PageHuge.
> 
> So I assume to fix it completely, we may need two changes: Patch 1 to
> prepare PageAnonExclusive() to work on hugetlb tails, then patch 2 to be
> squashed into the patch "mm/gup: handle huge pmd for follow_pmd_mask()".
> Note: not this patch to fixup, as this patch only does the "switchover" to
> the new path, the culprit should be the other patch..

I'll leave you to do the testing on this, if that's ok.

Just to make my config explicit, I have this kernel command line, which reserves
hugetlbs of all sizes for the tests:

"transparent_hugepage=madvise earlycon root=/dev/vda2 secretmem.enable
hugepagesz=1G hugepages=0:2,1:2 hugepagesz=32M hugepages=0:2,1:2
default_hugepagesz=2M hugepages=0:64,1:64 hugepagesz=64K hugepages=0:2,1:2"

Thanks,
Ryan

> 
> I have them attached below first, before I'll also go and see whether I can
> run some arm tests later today or tomorrow.  David, any comments from
> anon-exclusive side?
> 
> Thanks,
> 
> ===8<===
> 
> From 26f0670acea948945222c97a9cab58428782ca69 Mon Sep 17 00:00:00 2001
> From: Peter Xu <peterx@redhat.com>
> Date: Tue, 2 Apr 2024 11:52:28 -0400
> Subject: [PATCH 1/2] mm: Allow anon exclusive check over hugetlb tail pages
> 
> PageAnonExclusive() used to forbid tail pages for hugetlbfs, as that used
> to be called mostly in hugetlb specific paths and the head page was
> guaranteed.
> 
> As we move forward towards merging hugetlb paths into generic mm, we may
> start to pass in tail hugetlb pages (when with cont-pte/cont-pmd huge
> pages) for such check.  Allow it to properly fetch the head, in which case
> the anon-exclusiveness of the head will always represents the tail page.
> 
> There's already a sign of it when we look at the fast-gup which already
> contain the hugetlb processing altogether: we used to have a specific
> commit 5805192c7b72 ("mm/gup: handle cont-PTE hugetlb pages correctly in
> gup_must_unshare() via GUP-fast") covering that area.  Now with this more
> generic change, that can also go away.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/linux/page-flags.h |  8 +++++++-
>  mm/internal.h              | 10 ----------
>  2 files changed, 7 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 888353c209c0..225357f48a79 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -1095,7 +1095,13 @@ PAGEFLAG(Isolated, isolated, PF_ANY);
>  static __always_inline int PageAnonExclusive(const struct page *page)
>  {
>  	VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
> -	VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
> +	/*
> +	 * Allow the anon-exclusive check to work on hugetlb tail pages.
> +	 * Here hugetlb pages will always guarantee the anon-exclusiveness
> +	 * of the head page represents the tail pages.
> +	 */
> +	if (PageHuge(page) && !PageHead(page))
> +		page = compound_head(page);
>  	return test_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags);
>  }
>  
> diff --git a/mm/internal.h b/mm/internal.h
> index 9512de7398d5..87f6e4fd56a5 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -1259,16 +1259,6 @@ static inline bool gup_must_unshare(struct vm_area_struct *vma,
>  	if (IS_ENABLED(CONFIG_HAVE_FAST_GUP))
>  		smp_rmb();
>  
> -	/*
> -	 * During GUP-fast we might not get called on the head page for a
> -	 * hugetlb page that is mapped using cont-PTE, because GUP-fast does
> -	 * not work with the abstracted hugetlb PTEs that always point at the
> -	 * head page. For hugetlb, PageAnonExclusive only applies on the head
> -	 * page (as it cannot be partially COW-shared), so lookup the head page.
> -	 */
> -	if (unlikely(!PageHead(page) && PageHuge(page)))
> -		page = compound_head(page);
> -
>  	/*
>  	 * Note that PageKsm() pages cannot be exclusive, and consequently,
>  	 * cannot get pinned.


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
  2024-04-02 16:39           ` David Hildenbrand
  (?)
  (?)
@ 2024-04-02 17:57             ` Peter Xu
  -1 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2024-04-02 17:57 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Ryan Roberts, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Jason Gunthorpe, Mike Rapoport, Axel Rasmussen

On Tue, Apr 02, 2024 at 06:39:31PM +0200, David Hildenbrand wrote:
> On 02.04.24 18:20, Peter Xu wrote:
> > On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:
> > > On 02.04.24 16:48, Ryan Roberts wrote:
> > > > Hi Peter,
> > 
> > Hey, Ryan,
> > 
> > Thanks for the report!
> > 
> > > > 
> > > > On 27/03/2024 15:23, peterx@redhat.com wrote:
> > > > > From: Peter Xu <peterx@redhat.com>
> > > > > 
> > > > > Now follow_page() is ready to handle hugetlb pages in whatever form, and
> > > > > over all architectures.  Switch to the generic code path.
> > > > > 
> > > > > Time to retire hugetlb_follow_page_mask(), following the previous
> > > > > retirement of follow_hugetlb_page() in 4849807114b8.
> > > > > 
> > > > > There may be a slight difference of how the loops run when processing slow
> > > > > GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
> > > > > loop of __get_user_pages() will resolve one pgtable entry with the patch
> > > > > applied, rather than relying on the size of hugetlb hstate, the latter may
> > > > > cover multiple entries in one loop.
> > > > > 
> > > > > A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
> > > > > a tight loop of slow gup after the path switched.  That shouldn't be a
> > > > > problem because slow-gup should not be a hot path for GUP in general: when
> > > > > page is commonly present, fast-gup will already succeed, while when the
> > > > > page is indeed missing and require a follow up page fault, the slow gup
> > > > > degrade will probably buried in the fault paths anyway.  It also explains
> > > > > why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
> > > > > accelerate thp gup even for "pages != NULL"") lands, the latter not part of
> > > > > a performance analysis but a side benefit.  If the performance will be a
> > > > > concern, we can consider handle CONT_PTE in follow_page().
> > > > > 
> > > > > Before that is justified to be necessary, keep everything clean and simple.
> > > > > 
> > > > > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> > > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > 
> > > > Afraid I'm seeing an oops when running gup_longterm test on arm64 with current mm-unstable. Git bisect blames this patch. The oops reproduces for me every time on 2 different machines:
> > > > 
> > > > 
> > > > [    9.340416] kernel BUG at mm/gup.c:778!
> > > > [    9.340746] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> > > > [    9.341199] Modules linked in:
> > > > [    9.341481] CPU: 1 PID: 1159 Comm: gup_longterm Not tainted 6.9.0-rc2-00210-g910ff1a347e4 #11
> > > > [    9.342232] Hardware name: linux,dummy-virt (DT)
> > > > [    9.342647] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > > [    9.343195] pc : follow_page_mask+0x4d4/0x880
> > > > [    9.343580] lr : follow_page_mask+0x4d4/0x880
> > > > [    9.344018] sp : ffff8000898b3aa0
> > > > [    9.344345] x29: ffff8000898b3aa0 x28: fffffdffc53973e8 x27: 00003c0005d08000
> > > > [    9.345028] x26: ffff00014e5cfd08 x25: ffffd3513a40c000 x24: fffffdffc5d08000
> > > > [    9.345682] x23: ffffc1ffc0000000 x22: 0000000000080101 x21: ffff8000898b3ba8
> > > > [    9.346337] x20: 0000fffff4200000 x19: ffff00014e52d508 x18: 0000000000000010
> > > > [    9.347005] x17: 5f656e6f7a5f7369 x16: 2120262620296567 x15: 6170286461654865
> > > > [    9.347713] x14: 6761502128454741 x13: 2929656761702865 x12: 6761705f65636976
> > > > [    9.348371] x11: 65645f656e6f7a5f x10: ffffd3513b31d6e0 x9 : ffffd3513852f090
> > > > [    9.349062] x8 : 00000000ffffefff x7 : ffffd3513b31d6e0 x6 : 0000000000000000
> > > > [    9.349753] x5 : ffff00017ff98cc8 x4 : 0000000000000fff x3 : 0000000000000000
> > > > [    9.350397] x2 : 0000000000000000 x1 : ffff000190e8b480 x0 : 0000000000000052
> > > > [    9.351097] Call trace:
> > > > [    9.351312]  follow_page_mask+0x4d4/0x880
> > > > [    9.351700]  __get_user_pages+0xf4/0x3e8
> > > > [    9.352089]  __gup_longterm_locked+0x204/0xa70
> > > > [    9.352516]  pin_user_pages+0x88/0xc0
> > > > [    9.352873]  gup_test_ioctl+0x860/0xc40
> > > > [    9.353249]  __arm64_sys_ioctl+0xb0/0x100
> > > > [    9.353648]  invoke_syscall+0x50/0x128
> > > > [    9.354022]  el0_svc_common.constprop.0+0x48/0xf8
> > > > [    9.354488]  do_el0_svc+0x28/0x40
> > > > [    9.354822]  el0_svc+0x34/0xe0
> > > > [    9.355128]  el0t_64_sync_handler+0x13c/0x158
> > > > [    9.355489]  el0t_64_sync+0x190/0x198
> > > > [    9.355793] Code: aa1803e0 d000d8e1 91220021 97fff560 (d4210000)
> > > > [    9.356280] ---[ end trace 0000000000000000 ]---
> > > > [    9.356651] note: gup_longterm[1159] exited with irqs disabled
> > > > [    9.357141] note: gup_longterm[1159] exited with preempt_count 2
> > > > [    9.358033] ------------[ cut here ]------------
> > > > [    9.358800] WARNING: CPU: 1 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0x108/0x120
> > > > [    9.360157] Modules linked in:
> > > > [    9.360541] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D            6.9.0-rc2-00210-g910ff1a347e4 #11
> > > > [    9.361626] Hardware name: linux,dummy-virt (DT)
> > > > [    9.362087] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > > [    9.362758] pc : ct_kernel_exit.constprop.0+0x108/0x120
> > > > [    9.363306] lr : ct_idle_enter+0x10/0x20
> > > > [    9.363845] sp : ffff8000801abdc0
> > > > [    9.364222] x29: ffff8000801abdc0 x28: 0000000000000000 x27: 0000000000000000
> > > > [    9.364961] x26: 0000000000000000 x25: ffff00014149d780 x24: 0000000000000000
> > > > [    9.365557] x23: 0000000000000000 x22: ffffd3513b299d48 x21: ffffd3513a785730
> > > > [    9.366239] x20: ffffd3513b299c28 x19: ffff00017ffa7da0 x18: 0000fffff5ffffff
> > > > [    9.366869] x17: 0000000000000000 x16: 1fffe0002a21a8c1 x15: 0000000000000000
> > > > [    9.367524] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000002
> > > > [    9.368207] x11: 0000000000000001 x10: 0000000000000ad0 x9 : ffffd35138589230
> > > > [    9.369123] x8 : ffff00014149e2b0 x7 : 0000000000000000 x6 : 000000000f8c0fb2
> > > > [    9.370403] x5 : 4000000000000002 x4 : ffff2cb045825000 x3 : ffff8000801abdc0
> > > > [    9.371170] x2 : ffffd3513a782da0 x1 : 4000000000000000 x0 : ffffd3513a782da0
> > > > [    9.372279] Call trace:
> > > > [    9.372519]  ct_kernel_exit.constprop.0+0x108/0x120
> > > > [    9.373216]  ct_idle_enter+0x10/0x20
> > > > [    9.373562]  default_idle_call+0x3c/0x160
> > > > [    9.374055]  do_idle+0x21c/0x280
> > > > [    9.374394]  cpu_startup_entry+0x3c/0x50
> > > > [    9.374797]  secondary_start_kernel+0x140/0x168
> > > > [    9.375220]  __secondary_switched+0xb8/0xc0
> > > > [    9.375875] ---[ end trace 0000000000000000 ]---
> > > > 
> > > > 
> > > > The oops trigger is at mm/gup.c:778:
> > > > VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> > > > 
> > > > 
> > > > This is the output of gup_longterm (last output is just before oops):
> > > > 
> > > > # [INFO] detected hugetlb page size: 2048 KiB
> > > > # [INFO] detected hugetlb page size: 32768 KiB
> > > > # [INFO] detected hugetlb page size: 64 KiB
> > > > # [INFO] detected hugetlb page size: 1048576 KiB
> > > > TAP version 13
> > > > 1..70
> > > > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd
> > > > ok 1 Should have worked
> > > > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with tmpfile
> > > > ok 2 Should have failed
> > > > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile
> > > > ok 3 Should have failed
> > > > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
> > > > ok 4 Should have worked
> > > > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (32768 kB)
> > > > 
> > > > 
> > > > So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?
> > > 
> > > I assume we find the expected tail page, it's just that the check
> > > 
> > > VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> > > 
> > > Doesn't make sense with hugetlb folios. We might have a tail page mapped in
> > > a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the first
> > > cont-pmd entry", we trigger this check.
> > > 
> > > Likely this sanity check must also allow for hugetlb folios. Or we should
> > > just remove it completely.
> > 
> > Right, IMHO it'll be easier we remove it, actually I see there's one more
> > at the end, so I think we need to remove both.
> > 
> > > 
> > > In the past, we wanted to make sure that we never get tail pages of THP from
> > > PMD entries, because something would currently be broken (we don't support
> > > THP > PMD).
> > 
> > There's probably one more thing we need to do, on allowing
> > PageAnonExclusive() to work with hugetlb tails. Even if we remove the
> > warnings and if I read the code right, we can BUG_ON again on checking tail
> > pages over anon-exclusive for PageHuge.
> > 
> > So I assume to fix it completely, we may need two changes: Patch 1 to
> > prepare PageAnonExclusive() to work on hugetlb tails, then patch 2 to be
> > squashed into the patch "mm/gup: handle huge pmd for follow_pmd_mask()".
> > Note: not this patch to fixup, as this patch only does the "switchover" to
> > the new path, the culprit should be the other patch..
> > 
> > I have them attached below first, before I'll also go and see whether I can
> > run some arm tests later today or tomorrow.  David, any comments from
> > anon-exclusive side?
> 
> I added the PageAnonExclusive checks for hugetlb back then, because calling
> it on a tail page indicated real trouble for hugetlb.
> 
> Well, and I didn't want to have runtime-hugetlb checks in PageAnonExclusive
> code called on certainly-not-hugetlb code paths.
> 
> Personally, I'd fixup the problematic callsite where we know nothing nasty
> is happening (like we did for gup_must_unshare(), because we don't expect
> hugetlb tail pages from arbitrary other code).
> 
> But as I'm getting closer to a folio_test_anon_exclusive() implementation as
> we speak (closer, but not done :) ... ), where I'd remove any such hugetlb
> special handling, I don't particularly care how we handle GUP here in the
> meantime.

That's what I was looking for and found missing just now, when I wanted to
allow follow_huge_pmd() pass page / folio (which will be the head then)
properly into different checks.  I think that patch 1 is the simplest I can
come up with that works mostly like what you said before a follow up
cleanup on top if possible.  It mostly pushed the existing runtime check in
gup_must_unshare() to be more generic.

IIUC it's also a matter of whether you'd want PageAnonExclusive() to take
care of both thp + hugetlb in one shot, rather than let callers handle it
by things like "if (PageHuge()) ... else ...", which I would try to avoid.
It seems so far cleaner to allow PageAnonExclusive() take whatever tail
pages, thp or hugetlb.  But maybe your ultimate patchset can be even better
than that.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
@ 2024-04-02 17:57             ` Peter Xu
  0 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2024-04-02 17:57 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Ryan Roberts, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Jason Gunthorpe, Mike Rapoport, Axel Rasmussen

On Tue, Apr 02, 2024 at 06:39:31PM +0200, David Hildenbrand wrote:
> On 02.04.24 18:20, Peter Xu wrote:
> > On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:
> > > On 02.04.24 16:48, Ryan Roberts wrote:
> > > > Hi Peter,
> > 
> > Hey, Ryan,
> > 
> > Thanks for the report!
> > 
> > > > 
> > > > On 27/03/2024 15:23, peterx@redhat.com wrote:
> > > > > From: Peter Xu <peterx@redhat.com>
> > > > > 
> > > > > Now follow_page() is ready to handle hugetlb pages in whatever form, and
> > > > > over all architectures.  Switch to the generic code path.
> > > > > 
> > > > > Time to retire hugetlb_follow_page_mask(), following the previous
> > > > > retirement of follow_hugetlb_page() in 4849807114b8.
> > > > > 
> > > > > There may be a slight difference of how the loops run when processing slow
> > > > > GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
> > > > > loop of __get_user_pages() will resolve one pgtable entry with the patch
> > > > > applied, rather than relying on the size of hugetlb hstate, the latter may
> > > > > cover multiple entries in one loop.
> > > > > 
> > > > > A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
> > > > > a tight loop of slow gup after the path switched.  That shouldn't be a
> > > > > problem because slow-gup should not be a hot path for GUP in general: when
> > > > > page is commonly present, fast-gup will already succeed, while when the
> > > > > page is indeed missing and require a follow up page fault, the slow gup
> > > > > degrade will probably buried in the fault paths anyway.  It also explains
> > > > > why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
> > > > > accelerate thp gup even for "pages != NULL"") lands, the latter not part of
> > > > > a performance analysis but a side benefit.  If the performance will be a
> > > > > concern, we can consider handle CONT_PTE in follow_page().
> > > > > 
> > > > > Before that is justified to be necessary, keep everything clean and simple.
> > > > > 
> > > > > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> > > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > 
> > > > Afraid I'm seeing an oops when running gup_longterm test on arm64 with current mm-unstable. Git bisect blames this patch. The oops reproduces for me every time on 2 different machines:
> > > > 
> > > > 
> > > > [    9.340416] kernel BUG at mm/gup.c:778!
> > > > [    9.340746] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> > > > [    9.341199] Modules linked in:
> > > > [    9.341481] CPU: 1 PID: 1159 Comm: gup_longterm Not tainted 6.9.0-rc2-00210-g910ff1a347e4 #11
> > > > [    9.342232] Hardware name: linux,dummy-virt (DT)
> > > > [    9.342647] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > > [    9.343195] pc : follow_page_mask+0x4d4/0x880
> > > > [    9.343580] lr : follow_page_mask+0x4d4/0x880
> > > > [    9.344018] sp : ffff8000898b3aa0
> > > > [    9.344345] x29: ffff8000898b3aa0 x28: fffffdffc53973e8 x27: 00003c0005d08000
> > > > [    9.345028] x26: ffff00014e5cfd08 x25: ffffd3513a40c000 x24: fffffdffc5d08000
> > > > [    9.345682] x23: ffffc1ffc0000000 x22: 0000000000080101 x21: ffff8000898b3ba8
> > > > [    9.346337] x20: 0000fffff4200000 x19: ffff00014e52d508 x18: 0000000000000010
> > > > [    9.347005] x17: 5f656e6f7a5f7369 x16: 2120262620296567 x15: 6170286461654865
> > > > [    9.347713] x14: 6761502128454741 x13: 2929656761702865 x12: 6761705f65636976
> > > > [    9.348371] x11: 65645f656e6f7a5f x10: ffffd3513b31d6e0 x9 : ffffd3513852f090
> > > > [    9.349062] x8 : 00000000ffffefff x7 : ffffd3513b31d6e0 x6 : 0000000000000000
> > > > [    9.349753] x5 : ffff00017ff98cc8 x4 : 0000000000000fff x3 : 0000000000000000
> > > > [    9.350397] x2 : 0000000000000000 x1 : ffff000190e8b480 x0 : 0000000000000052
> > > > [    9.351097] Call trace:
> > > > [    9.351312]  follow_page_mask+0x4d4/0x880
> > > > [    9.351700]  __get_user_pages+0xf4/0x3e8
> > > > [    9.352089]  __gup_longterm_locked+0x204/0xa70
> > > > [    9.352516]  pin_user_pages+0x88/0xc0
> > > > [    9.352873]  gup_test_ioctl+0x860/0xc40
> > > > [    9.353249]  __arm64_sys_ioctl+0xb0/0x100
> > > > [    9.353648]  invoke_syscall+0x50/0x128
> > > > [    9.354022]  el0_svc_common.constprop.0+0x48/0xf8
> > > > [    9.354488]  do_el0_svc+0x28/0x40
> > > > [    9.354822]  el0_svc+0x34/0xe0
> > > > [    9.355128]  el0t_64_sync_handler+0x13c/0x158
> > > > [    9.355489]  el0t_64_sync+0x190/0x198
> > > > [    9.355793] Code: aa1803e0 d000d8e1 91220021 97fff560 (d4210000)
> > > > [    9.356280] ---[ end trace 0000000000000000 ]---
> > > > [    9.356651] note: gup_longterm[1159] exited with irqs disabled
> > > > [    9.357141] note: gup_longterm[1159] exited with preempt_count 2
> > > > [    9.358033] ------------[ cut here ]------------
> > > > [    9.358800] WARNING: CPU: 1 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0x108/0x120
> > > > [    9.360157] Modules linked in:
> > > > [    9.360541] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D            6.9.0-rc2-00210-g910ff1a347e4 #11
> > > > [    9.361626] Hardware name: linux,dummy-virt (DT)
> > > > [    9.362087] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > > [    9.362758] pc : ct_kernel_exit.constprop.0+0x108/0x120
> > > > [    9.363306] lr : ct_idle_enter+0x10/0x20
> > > > [    9.363845] sp : ffff8000801abdc0
> > > > [    9.364222] x29: ffff8000801abdc0 x28: 0000000000000000 x27: 0000000000000000
> > > > [    9.364961] x26: 0000000000000000 x25: ffff00014149d780 x24: 0000000000000000
> > > > [    9.365557] x23: 0000000000000000 x22: ffffd3513b299d48 x21: ffffd3513a785730
> > > > [    9.366239] x20: ffffd3513b299c28 x19: ffff00017ffa7da0 x18: 0000fffff5ffffff
> > > > [    9.366869] x17: 0000000000000000 x16: 1fffe0002a21a8c1 x15: 0000000000000000
> > > > [    9.367524] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000002
> > > > [    9.368207] x11: 0000000000000001 x10: 0000000000000ad0 x9 : ffffd35138589230
> > > > [    9.369123] x8 : ffff00014149e2b0 x7 : 0000000000000000 x6 : 000000000f8c0fb2
> > > > [    9.370403] x5 : 4000000000000002 x4 : ffff2cb045825000 x3 : ffff8000801abdc0
> > > > [    9.371170] x2 : ffffd3513a782da0 x1 : 4000000000000000 x0 : ffffd3513a782da0
> > > > [    9.372279] Call trace:
> > > > [    9.372519]  ct_kernel_exit.constprop.0+0x108/0x120
> > > > [    9.373216]  ct_idle_enter+0x10/0x20
> > > > [    9.373562]  default_idle_call+0x3c/0x160
> > > > [    9.374055]  do_idle+0x21c/0x280
> > > > [    9.374394]  cpu_startup_entry+0x3c/0x50
> > > > [    9.374797]  secondary_start_kernel+0x140/0x168
> > > > [    9.375220]  __secondary_switched+0xb8/0xc0
> > > > [    9.375875] ---[ end trace 0000000000000000 ]---
> > > > 
> > > > 
> > > > The oops trigger is at mm/gup.c:778:
> > > > VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> > > > 
> > > > 
> > > > This is the output of gup_longterm (last output is just before oops):
> > > > 
> > > > # [INFO] detected hugetlb page size: 2048 KiB
> > > > # [INFO] detected hugetlb page size: 32768 KiB
> > > > # [INFO] detected hugetlb page size: 64 KiB
> > > > # [INFO] detected hugetlb page size: 1048576 KiB
> > > > TAP version 13
> > > > 1..70
> > > > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd
> > > > ok 1 Should have worked
> > > > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with tmpfile
> > > > ok 2 Should have failed
> > > > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile
> > > > ok 3 Should have failed
> > > > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
> > > > ok 4 Should have worked
> > > > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (32768 kB)
> > > > 
> > > > 
> > > > So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?
> > > 
> > > I assume we find the expected tail page, it's just that the check
> > > 
> > > VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> > > 
> > > Doesn't make sense with hugetlb folios. We might have a tail page mapped in
> > > a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the first
> > > cont-pmd entry", we trigger this check.
> > > 
> > > Likely this sanity check must also allow for hugetlb folios. Or we should
> > > just remove it completely.
> > 
> > Right, IMHO it'll be easier we remove it, actually I see there's one more
> > at the end, so I think we need to remove both.
> > 
> > > 
> > > In the past, we wanted to make sure that we never get tail pages of THP from
> > > PMD entries, because something would currently be broken (we don't support
> > > THP > PMD).
> > 
> > There's probably one more thing we need to do, on allowing
> > PageAnonExclusive() to work with hugetlb tails. Even if we remove the
> > warnings and if I read the code right, we can BUG_ON again on checking tail
> > pages over anon-exclusive for PageHuge.
> > 
> > So I assume to fix it completely, we may need two changes: Patch 1 to
> > prepare PageAnonExclusive() to work on hugetlb tails, then patch 2 to be
> > squashed into the patch "mm/gup: handle huge pmd for follow_pmd_mask()".
> > Note: not this patch to fixup, as this patch only does the "switchover" to
> > the new path, the culprit should be the other patch..
> > 
> > I have them attached below first, before I'll also go and see whether I can
> > run some arm tests later today or tomorrow.  David, any comments from
> > anon-exclusive side?
> 
> I added the PageAnonExclusive checks for hugetlb back then, because calling
> it on a tail page indicated real trouble for hugetlb.
> 
> Well, and I didn't want to have runtime-hugetlb checks in PageAnonExclusive
> code called on certainly-not-hugetlb code paths.
> 
> Personally, I'd fixup the problematic callsite where we know nothing nasty
> is happening (like we did for gup_must_unshare(), because we don't expect
> hugetlb tail pages from arbitrary other code).
> 
> But as I'm getting closer to a folio_test_anon_exclusive() implementation as
> we speak (closer, but not done :) ... ), where I'd remove any such hugetlb
> special handling, I don't particularly care how we handle GUP here in the
> meantime.

That's what I was looking for and found missing just now, when I wanted to
allow follow_huge_pmd() pass page / folio (which will be the head then)
properly into different checks.  I think that patch 1 is the simplest I can
come up with that works mostly like what you said before a follow up
cleanup on top if possible.  It mostly pushed the existing runtime check in
gup_must_unshare() to be more generic.

IIUC it's also a matter of whether you'd want PageAnonExclusive() to take
care of both thp + hugetlb in one shot, rather than let callers handle it
by things like "if (PageHuge()) ... else ...", which I would try to avoid.
It seems so far cleaner to allow PageAnonExclusive() take whatever tail
pages, thp or hugetlb.  But maybe your ultimate patchset can be even better
than that.

Thanks,

-- 
Peter Xu


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
@ 2024-04-02 17:57             ` Peter Xu
  0 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2024-04-02 17:57 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Ryan Roberts, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Jason Gunthorpe, Mike Rapoport, Axel Rasmussen

On Tue, Apr 02, 2024 at 06:39:31PM +0200, David Hildenbrand wrote:
> On 02.04.24 18:20, Peter Xu wrote:
> > On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:
> > > On 02.04.24 16:48, Ryan Roberts wrote:
> > > > Hi Peter,
> > 
> > Hey, Ryan,
> > 
> > Thanks for the report!
> > 
> > > > 
> > > > On 27/03/2024 15:23, peterx@redhat.com wrote:
> > > > > From: Peter Xu <peterx@redhat.com>
> > > > > 
> > > > > Now follow_page() is ready to handle hugetlb pages in whatever form, and
> > > > > over all architectures.  Switch to the generic code path.
> > > > > 
> > > > > Time to retire hugetlb_follow_page_mask(), following the previous
> > > > > retirement of follow_hugetlb_page() in 4849807114b8.
> > > > > 
> > > > > There may be a slight difference of how the loops run when processing slow
> > > > > GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
> > > > > loop of __get_user_pages() will resolve one pgtable entry with the patch
> > > > > applied, rather than relying on the size of hugetlb hstate, the latter may
> > > > > cover multiple entries in one loop.
> > > > > 
> > > > > A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
> > > > > a tight loop of slow gup after the path switched.  That shouldn't be a
> > > > > problem because slow-gup should not be a hot path for GUP in general: when
> > > > > page is commonly present, fast-gup will already succeed, while when the
> > > > > page is indeed missing and require a follow up page fault, the slow gup
> > > > > degrade will probably buried in the fault paths anyway.  It also explains
> > > > > why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
> > > > > accelerate thp gup even for "pages != NULL"") lands, the latter not part of
> > > > > a performance analysis but a side benefit.  If the performance will be a
> > > > > concern, we can consider handle CONT_PTE in follow_page().
> > > > > 
> > > > > Before that is justified to be necessary, keep everything clean and simple.
> > > > > 
> > > > > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> > > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > 
> > > > Afraid I'm seeing an oops when running gup_longterm test on arm64 with current mm-unstable. Git bisect blames this patch. The oops reproduces for me every time on 2 different machines:
> > > > 
> > > > 
> > > > [    9.340416] kernel BUG at mm/gup.c:778!
> > > > [    9.340746] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> > > > [    9.341199] Modules linked in:
> > > > [    9.341481] CPU: 1 PID: 1159 Comm: gup_longterm Not tainted 6.9.0-rc2-00210-g910ff1a347e4 #11
> > > > [    9.342232] Hardware name: linux,dummy-virt (DT)
> > > > [    9.342647] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > > [    9.343195] pc : follow_page_mask+0x4d4/0x880
> > > > [    9.343580] lr : follow_page_mask+0x4d4/0x880
> > > > [    9.344018] sp : ffff8000898b3aa0
> > > > [    9.344345] x29: ffff8000898b3aa0 x28: fffffdffc53973e8 x27: 00003c0005d08000
> > > > [    9.345028] x26: ffff00014e5cfd08 x25: ffffd3513a40c000 x24: fffffdffc5d08000
> > > > [    9.345682] x23: ffffc1ffc0000000 x22: 0000000000080101 x21: ffff8000898b3ba8
> > > > [    9.346337] x20: 0000fffff4200000 x19: ffff00014e52d508 x18: 0000000000000010
> > > > [    9.347005] x17: 5f656e6f7a5f7369 x16: 2120262620296567 x15: 6170286461654865
> > > > [    9.347713] x14: 6761502128454741 x13: 2929656761702865 x12: 6761705f65636976
> > > > [    9.348371] x11: 65645f656e6f7a5f x10: ffffd3513b31d6e0 x9 : ffffd3513852f090
> > > > [    9.349062] x8 : 00000000ffffefff x7 : ffffd3513b31d6e0 x6 : 0000000000000000
> > > > [    9.349753] x5 : ffff00017ff98cc8 x4 : 0000000000000fff x3 : 0000000000000000
> > > > [    9.350397] x2 : 0000000000000000 x1 : ffff000190e8b480 x0 : 0000000000000052
> > > > [    9.351097] Call trace:
> > > > [    9.351312]  follow_page_mask+0x4d4/0x880
> > > > [    9.351700]  __get_user_pages+0xf4/0x3e8
> > > > [    9.352089]  __gup_longterm_locked+0x204/0xa70
> > > > [    9.352516]  pin_user_pages+0x88/0xc0
> > > > [    9.352873]  gup_test_ioctl+0x860/0xc40
> > > > [    9.353249]  __arm64_sys_ioctl+0xb0/0x100
> > > > [    9.353648]  invoke_syscall+0x50/0x128
> > > > [    9.354022]  el0_svc_common.constprop.0+0x48/0xf8
> > > > [    9.354488]  do_el0_svc+0x28/0x40
> > > > [    9.354822]  el0_svc+0x34/0xe0
> > > > [    9.355128]  el0t_64_sync_handler+0x13c/0x158
> > > > [    9.355489]  el0t_64_sync+0x190/0x198
> > > > [    9.355793] Code: aa1803e0 d000d8e1 91220021 97fff560 (d4210000)
> > > > [    9.356280] ---[ end trace 0000000000000000 ]---
> > > > [    9.356651] note: gup_longterm[1159] exited with irqs disabled
> > > > [    9.357141] note: gup_longterm[1159] exited with preempt_count 2
> > > > [    9.358033] ------------[ cut here ]------------
> > > > [    9.358800] WARNING: CPU: 1 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0x108/0x120
> > > > [    9.360157] Modules linked in:
> > > > [    9.360541] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D            6.9.0-rc2-00210-g910ff1a347e4 #11
> > > > [    9.361626] Hardware name: linux,dummy-virt (DT)
> > > > [    9.362087] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > > [    9.362758] pc : ct_kernel_exit.constprop.0+0x108/0x120
> > > > [    9.363306] lr : ct_idle_enter+0x10/0x20
> > > > [    9.363845] sp : ffff8000801abdc0
> > > > [    9.364222] x29: ffff8000801abdc0 x28: 0000000000000000 x27: 0000000000000000
> > > > [    9.364961] x26: 0000000000000000 x25: ffff00014149d780 x24: 0000000000000000
> > > > [    9.365557] x23: 0000000000000000 x22: ffffd3513b299d48 x21: ffffd3513a785730
> > > > [    9.366239] x20: ffffd3513b299c28 x19: ffff00017ffa7da0 x18: 0000fffff5ffffff
> > > > [    9.366869] x17: 0000000000000000 x16: 1fffe0002a21a8c1 x15: 0000000000000000
> > > > [    9.367524] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000002
> > > > [    9.368207] x11: 0000000000000001 x10: 0000000000000ad0 x9 : ffffd35138589230
> > > > [    9.369123] x8 : ffff00014149e2b0 x7 : 0000000000000000 x6 : 000000000f8c0fb2
> > > > [    9.370403] x5 : 4000000000000002 x4 : ffff2cb045825000 x3 : ffff8000801abdc0
> > > > [    9.371170] x2 : ffffd3513a782da0 x1 : 4000000000000000 x0 : ffffd3513a782da0
> > > > [    9.372279] Call trace:
> > > > [    9.372519]  ct_kernel_exit.constprop.0+0x108/0x120
> > > > [    9.373216]  ct_idle_enter+0x10/0x20
> > > > [    9.373562]  default_idle_call+0x3c/0x160
> > > > [    9.374055]  do_idle+0x21c/0x280
> > > > [    9.374394]  cpu_startup_entry+0x3c/0x50
> > > > [    9.374797]  secondary_start_kernel+0x140/0x168
> > > > [    9.375220]  __secondary_switched+0xb8/0xc0
> > > > [    9.375875] ---[ end trace 0000000000000000 ]---
> > > > 
> > > > 
> > > > The oops trigger is at mm/gup.c:778:
> > > > VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> > > > 
> > > > 
> > > > This is the output of gup_longterm (last output is just before oops):
> > > > 
> > > > # [INFO] detected hugetlb page size: 2048 KiB
> > > > # [INFO] detected hugetlb page size: 32768 KiB
> > > > # [INFO] detected hugetlb page size: 64 KiB
> > > > # [INFO] detected hugetlb page size: 1048576 KiB
> > > > TAP version 13
> > > > 1..70
> > > > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd
> > > > ok 1 Should have worked
> > > > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with tmpfile
> > > > ok 2 Should have failed
> > > > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile
> > > > ok 3 Should have failed
> > > > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
> > > > ok 4 Should have worked
> > > > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (32768 kB)
> > > > 
> > > > 
> > > > So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?
> > > 
> > > I assume we find the expected tail page, it's just that the check
> > > 
> > > VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> > > 
> > > Doesn't make sense with hugetlb folios. We might have a tail page mapped in
> > > a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the first
> > > cont-pmd entry", we trigger this check.
> > > 
> > > Likely this sanity check must also allow for hugetlb folios. Or we should
> > > just remove it completely.
> > 
> > Right, IMHO it'll be easier we remove it, actually I see there's one more
> > at the end, so I think we need to remove both.
> > 
> > > 
> > > In the past, we wanted to make sure that we never get tail pages of THP from
> > > PMD entries, because something would currently be broken (we don't support
> > > THP > PMD).
> > 
> > There's probably one more thing we need to do, on allowing
> > PageAnonExclusive() to work with hugetlb tails. Even if we remove the
> > warnings and if I read the code right, we can BUG_ON again on checking tail
> > pages over anon-exclusive for PageHuge.
> > 
> > So I assume to fix it completely, we may need two changes: Patch 1 to
> > prepare PageAnonExclusive() to work on hugetlb tails, then patch 2 to be
> > squashed into the patch "mm/gup: handle huge pmd for follow_pmd_mask()".
> > Note: not this patch to fixup, as this patch only does the "switchover" to
> > the new path, the culprit should be the other patch..
> > 
> > I have them attached below first, before I'll also go and see whether I can
> > run some arm tests later today or tomorrow.  David, any comments from
> > anon-exclusive side?
> 
> I added the PageAnonExclusive checks for hugetlb back then, because calling
> it on a tail page indicated real trouble for hugetlb.
> 
> Well, and I didn't want to have runtime-hugetlb checks in PageAnonExclusive
> code called on certainly-not-hugetlb code paths.
> 
> Personally, I'd fixup the problematic callsite where we know nothing nasty
> is happening (like we did for gup_must_unshare(), because we don't expect
> hugetlb tail pages from arbitrary other code).
> 
> But as I'm getting closer to a folio_test_anon_exclusive() implementation as
> we speak (closer, but not done :) ... ), where I'd remove any such hugetlb
> special handling, I don't particularly care how we handle GUP here in the
> meantime.

That's what I was looking for and found missing just now, when I wanted to
allow follow_huge_pmd() pass page / folio (which will be the head then)
properly into different checks.  I think that patch 1 is the simplest I can
come up with that works mostly like what you said before a follow up
cleanup on top if possible.  It mostly pushed the existing runtime check in
gup_must_unshare() to be more generic.

IIUC it's also a matter of whether you'd want PageAnonExclusive() to take
care of both thp + hugetlb in one shot, rather than let callers handle it
by things like "if (PageHuge()) ... else ...", which I would try to avoid.
It seems so far cleaner to allow PageAnonExclusive() take whatever tail
pages, thp or hugetlb.  But maybe your ultimate patchset can be even better
than that.

Thanks,

-- 
Peter Xu


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
@ 2024-04-02 17:57             ` Peter Xu
  0 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2024-04-02 17:57 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: James Houghton, Yang Shi, Andrew Jones, linux-mm, Matthew Wilcox,
	linux-riscv, Andrea Arcangeli, Christoph Hellwig,
	Aneesh Kumar K . V, linux-arm-kernel, Jason Gunthorpe,
	Axel Rasmussen, Ryan Roberts, Rik van Riel, John Hubbard,
	Kirill A . Shutemov, Vlastimil Babka, Lorenzo Stoakes,
	Muchun Song, linux-kernel, Andrew Morton, linuxppc-dev,
	Mike Rapoport, Mike Kravetz

On Tue, Apr 02, 2024 at 06:39:31PM +0200, David Hildenbrand wrote:
> On 02.04.24 18:20, Peter Xu wrote:
> > On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:
> > > On 02.04.24 16:48, Ryan Roberts wrote:
> > > > Hi Peter,
> > 
> > Hey, Ryan,
> > 
> > Thanks for the report!
> > 
> > > > 
> > > > On 27/03/2024 15:23, peterx@redhat.com wrote:
> > > > > From: Peter Xu <peterx@redhat.com>
> > > > > 
> > > > > Now follow_page() is ready to handle hugetlb pages in whatever form, and
> > > > > over all architectures.  Switch to the generic code path.
> > > > > 
> > > > > Time to retire hugetlb_follow_page_mask(), following the previous
> > > > > retirement of follow_hugetlb_page() in 4849807114b8.
> > > > > 
> > > > > There may be a slight difference of how the loops run when processing slow
> > > > > GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
> > > > > loop of __get_user_pages() will resolve one pgtable entry with the patch
> > > > > applied, rather than relying on the size of hugetlb hstate, the latter may
> > > > > cover multiple entries in one loop.
> > > > > 
> > > > > A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
> > > > > a tight loop of slow gup after the path switched.  That shouldn't be a
> > > > > problem because slow-gup should not be a hot path for GUP in general: when
> > > > > page is commonly present, fast-gup will already succeed, while when the
> > > > > page is indeed missing and require a follow up page fault, the slow gup
> > > > > degrade will probably buried in the fault paths anyway.  It also explains
> > > > > why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
> > > > > accelerate thp gup even for "pages != NULL"") lands, the latter not part of
> > > > > a performance analysis but a side benefit.  If the performance will be a
> > > > > concern, we can consider handle CONT_PTE in follow_page().
> > > > > 
> > > > > Before that is justified to be necessary, keep everything clean and simple.
> > > > > 
> > > > > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> > > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > 
> > > > Afraid I'm seeing an oops when running gup_longterm test on arm64 with current mm-unstable. Git bisect blames this patch. The oops reproduces for me every time on 2 different machines:
> > > > 
> > > > 
> > > > [    9.340416] kernel BUG at mm/gup.c:778!
> > > > [    9.340746] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> > > > [    9.341199] Modules linked in:
> > > > [    9.341481] CPU: 1 PID: 1159 Comm: gup_longterm Not tainted 6.9.0-rc2-00210-g910ff1a347e4 #11
> > > > [    9.342232] Hardware name: linux,dummy-virt (DT)
> > > > [    9.342647] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > > [    9.343195] pc : follow_page_mask+0x4d4/0x880
> > > > [    9.343580] lr : follow_page_mask+0x4d4/0x880
> > > > [    9.344018] sp : ffff8000898b3aa0
> > > > [    9.344345] x29: ffff8000898b3aa0 x28: fffffdffc53973e8 x27: 00003c0005d08000
> > > > [    9.345028] x26: ffff00014e5cfd08 x25: ffffd3513a40c000 x24: fffffdffc5d08000
> > > > [    9.345682] x23: ffffc1ffc0000000 x22: 0000000000080101 x21: ffff8000898b3ba8
> > > > [    9.346337] x20: 0000fffff4200000 x19: ffff00014e52d508 x18: 0000000000000010
> > > > [    9.347005] x17: 5f656e6f7a5f7369 x16: 2120262620296567 x15: 6170286461654865
> > > > [    9.347713] x14: 6761502128454741 x13: 2929656761702865 x12: 6761705f65636976
> > > > [    9.348371] x11: 65645f656e6f7a5f x10: ffffd3513b31d6e0 x9 : ffffd3513852f090
> > > > [    9.349062] x8 : 00000000ffffefff x7 : ffffd3513b31d6e0 x6 : 0000000000000000
> > > > [    9.349753] x5 : ffff00017ff98cc8 x4 : 0000000000000fff x3 : 0000000000000000
> > > > [    9.350397] x2 : 0000000000000000 x1 : ffff000190e8b480 x0 : 0000000000000052
> > > > [    9.351097] Call trace:
> > > > [    9.351312]  follow_page_mask+0x4d4/0x880
> > > > [    9.351700]  __get_user_pages+0xf4/0x3e8
> > > > [    9.352089]  __gup_longterm_locked+0x204/0xa70
> > > > [    9.352516]  pin_user_pages+0x88/0xc0
> > > > [    9.352873]  gup_test_ioctl+0x860/0xc40
> > > > [    9.353249]  __arm64_sys_ioctl+0xb0/0x100
> > > > [    9.353648]  invoke_syscall+0x50/0x128
> > > > [    9.354022]  el0_svc_common.constprop.0+0x48/0xf8
> > > > [    9.354488]  do_el0_svc+0x28/0x40
> > > > [    9.354822]  el0_svc+0x34/0xe0
> > > > [    9.355128]  el0t_64_sync_handler+0x13c/0x158
> > > > [    9.355489]  el0t_64_sync+0x190/0x198
> > > > [    9.355793] Code: aa1803e0 d000d8e1 91220021 97fff560 (d4210000)
> > > > [    9.356280] ---[ end trace 0000000000000000 ]---
> > > > [    9.356651] note: gup_longterm[1159] exited with irqs disabled
> > > > [    9.357141] note: gup_longterm[1159] exited with preempt_count 2
> > > > [    9.358033] ------------[ cut here ]------------
> > > > [    9.358800] WARNING: CPU: 1 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0x108/0x120
> > > > [    9.360157] Modules linked in:
> > > > [    9.360541] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D            6.9.0-rc2-00210-g910ff1a347e4 #11
> > > > [    9.361626] Hardware name: linux,dummy-virt (DT)
> > > > [    9.362087] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > > [    9.362758] pc : ct_kernel_exit.constprop.0+0x108/0x120
> > > > [    9.363306] lr : ct_idle_enter+0x10/0x20
> > > > [    9.363845] sp : ffff8000801abdc0
> > > > [    9.364222] x29: ffff8000801abdc0 x28: 0000000000000000 x27: 0000000000000000
> > > > [    9.364961] x26: 0000000000000000 x25: ffff00014149d780 x24: 0000000000000000
> > > > [    9.365557] x23: 0000000000000000 x22: ffffd3513b299d48 x21: ffffd3513a785730
> > > > [    9.366239] x20: ffffd3513b299c28 x19: ffff00017ffa7da0 x18: 0000fffff5ffffff
> > > > [    9.366869] x17: 0000000000000000 x16: 1fffe0002a21a8c1 x15: 0000000000000000
> > > > [    9.367524] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000002
> > > > [    9.368207] x11: 0000000000000001 x10: 0000000000000ad0 x9 : ffffd35138589230
> > > > [    9.369123] x8 : ffff00014149e2b0 x7 : 0000000000000000 x6 : 000000000f8c0fb2
> > > > [    9.370403] x5 : 4000000000000002 x4 : ffff2cb045825000 x3 : ffff8000801abdc0
> > > > [    9.371170] x2 : ffffd3513a782da0 x1 : 4000000000000000 x0 : ffffd3513a782da0
> > > > [    9.372279] Call trace:
> > > > [    9.372519]  ct_kernel_exit.constprop.0+0x108/0x120
> > > > [    9.373216]  ct_idle_enter+0x10/0x20
> > > > [    9.373562]  default_idle_call+0x3c/0x160
> > > > [    9.374055]  do_idle+0x21c/0x280
> > > > [    9.374394]  cpu_startup_entry+0x3c/0x50
> > > > [    9.374797]  secondary_start_kernel+0x140/0x168
> > > > [    9.375220]  __secondary_switched+0xb8/0xc0
> > > > [    9.375875] ---[ end trace 0000000000000000 ]---
> > > > 
> > > > 
> > > > The oops trigger is at mm/gup.c:778:
> > > > VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> > > > 
> > > > 
> > > > This is the output of gup_longterm (last output is just before oops):
> > > > 
> > > > # [INFO] detected hugetlb page size: 2048 KiB
> > > > # [INFO] detected hugetlb page size: 32768 KiB
> > > > # [INFO] detected hugetlb page size: 64 KiB
> > > > # [INFO] detected hugetlb page size: 1048576 KiB
> > > > TAP version 13
> > > > 1..70
> > > > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd
> > > > ok 1 Should have worked
> > > > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with tmpfile
> > > > ok 2 Should have failed
> > > > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile
> > > > ok 3 Should have failed
> > > > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
> > > > ok 4 Should have worked
> > > > # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (32768 kB)
> > > > 
> > > > 
> > > > So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?
> > > 
> > > I assume we find the expected tail page, it's just that the check
> > > 
> > > VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
> > > 
> > > Doesn't make sense with hugetlb folios. We might have a tail page mapped in
> > > a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the first
> > > cont-pmd entry", we trigger this check.
> > > 
> > > Likely this sanity check must also allow for hugetlb folios. Or we should
> > > just remove it completely.
> > 
> > Right, IMHO it'll be easier we remove it, actually I see there's one more
> > at the end, so I think we need to remove both.
> > 
> > > 
> > > In the past, we wanted to make sure that we never get tail pages of THP from
> > > PMD entries, because something would currently be broken (we don't support
> > > THP > PMD).
> > 
> > There's probably one more thing we need to do, on allowing
> > PageAnonExclusive() to work with hugetlb tails. Even if we remove the
> > warnings and if I read the code right, we can BUG_ON again on checking tail
> > pages over anon-exclusive for PageHuge.
> > 
> > So I assume to fix it completely, we may need two changes: Patch 1 to
> > prepare PageAnonExclusive() to work on hugetlb tails, then patch 2 to be
> > squashed into the patch "mm/gup: handle huge pmd for follow_pmd_mask()".
> > Note: not this patch to fixup, as this patch only does the "switchover" to
> > the new path, the culprit should be the other patch..
> > 
> > I have them attached below first, before I'll also go and see whether I can
> > run some arm tests later today or tomorrow.  David, any comments from
> > anon-exclusive side?
> 
> I added the PageAnonExclusive checks for hugetlb back then, because calling
> it on a tail page indicated real trouble for hugetlb.
> 
> Well, and I didn't want to have runtime-hugetlb checks in PageAnonExclusive
> code called on certainly-not-hugetlb code paths.
> 
> Personally, I'd fixup the problematic callsite where we know nothing nasty
> is happening (like we did for gup_must_unshare(), because we don't expect
> hugetlb tail pages from arbitrary other code).
> 
> But as I'm getting closer to a folio_test_anon_exclusive() implementation as
> we speak (closer, but not done :) ... ), where I'd remove any such hugetlb
> special handling, I don't particularly care how we handle GUP here in the
> meantime.

That's what I was looking for and found missing just now, when I wanted to
allow follow_huge_pmd() pass page / folio (which will be the head then)
properly into different checks.  I think that patch 1 is the simplest I can
come up with that works mostly like what you said before a follow up
cleanup on top if possible.  It mostly pushed the existing runtime check in
gup_must_unshare() to be more generic.

IIUC it's also a matter of whether you'd want PageAnonExclusive() to take
care of both thp + hugetlb in one shot, rather than let callers handle it
by things like "if (PageHuge()) ... else ...", which I would try to avoid.
It seems so far cleaner to allow PageAnonExclusive() take whatever tail
pages, thp or hugetlb.  But maybe your ultimate patchset can be even better
than that.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
  2024-04-02 16:46           ` Ryan Roberts
  (?)
  (?)
@ 2024-04-02 17:58             ` Peter Xu
  -1 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2024-04-02 17:58 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: David Hildenbrand, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Jason Gunthorpe, Mike Rapoport, Axel Rasmussen

On Tue, Apr 02, 2024 at 05:46:57PM +0100, Ryan Roberts wrote:
> I'll leave you to do the testing on this, if that's ok.

Definitely.  I'll test and send formal patches.

> 
> Just to make my config explicit, I have this kernel command line, which reserves
> hugetlbs of all sizes for the tests:
> 
> "transparent_hugepage=madvise earlycon root=/dev/vda2 secretmem.enable
> hugepagesz=1G hugepages=0:2,1:2 hugepagesz=32M hugepages=0:2,1:2
> default_hugepagesz=2M hugepages=0:64,1:64 hugepagesz=64K hugepages=0:2,1:2"

This helps, thanks.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
@ 2024-04-02 17:58             ` Peter Xu
  0 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2024-04-02 17:58 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: David Hildenbrand, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Jason Gunthorpe, Mike Rapoport, Axel Rasmussen

On Tue, Apr 02, 2024 at 05:46:57PM +0100, Ryan Roberts wrote:
> I'll leave you to do the testing on this, if that's ok.

Definitely.  I'll test and send formal patches.

> 
> Just to make my config explicit, I have this kernel command line, which reserves
> hugetlbs of all sizes for the tests:
> 
> "transparent_hugepage=madvise earlycon root=/dev/vda2 secretmem.enable
> hugepagesz=1G hugepages=0:2,1:2 hugepagesz=32M hugepages=0:2,1:2
> default_hugepagesz=2M hugepages=0:64,1:64 hugepagesz=64K hugepages=0:2,1:2"

This helps, thanks.

-- 
Peter Xu


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
@ 2024-04-02 17:58             ` Peter Xu
  0 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2024-04-02 17:58 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: David Hildenbrand, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Jason Gunthorpe, Mike Rapoport, Axel Rasmussen

On Tue, Apr 02, 2024 at 05:46:57PM +0100, Ryan Roberts wrote:
> I'll leave you to do the testing on this, if that's ok.

Definitely.  I'll test and send formal patches.

> 
> Just to make my config explicit, I have this kernel command line, which reserves
> hugetlbs of all sizes for the tests:
> 
> "transparent_hugepage=madvise earlycon root=/dev/vda2 secretmem.enable
> hugepagesz=1G hugepages=0:2,1:2 hugepagesz=32M hugepages=0:2,1:2
> default_hugepagesz=2M hugepages=0:64,1:64 hugepagesz=64K hugepages=0:2,1:2"

This helps, thanks.

-- 
Peter Xu


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
@ 2024-04-02 17:58             ` Peter Xu
  0 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2024-04-02 17:58 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: James Houghton, David Hildenbrand, Yang Shi, Andrew Jones,
	linux-mm, Matthew Wilcox, linux-riscv, Andrea Arcangeli,
	Aneesh Kumar K . V, Christoph Hellwig, Vlastimil Babka,
	Jason Gunthorpe, Axel Rasmussen, Rik van Riel, John Hubbard,
	Kirill A . Shutemov, linux-arm-kernel, Lorenzo Stoakes,
	Muchun Song, linux-kernel, Andrew Morton, linuxppc-dev,
	Mike Rapoport, Mike Kravetz

On Tue, Apr 02, 2024 at 05:46:57PM +0100, Ryan Roberts wrote:
> I'll leave you to do the testing on this, if that's ok.

Definitely.  I'll test and send formal patches.

> 
> Just to make my config explicit, I have this kernel command line, which reserves
> hugetlbs of all sizes for the tests:
> 
> "transparent_hugepage=madvise earlycon root=/dev/vda2 secretmem.enable
> hugepagesz=1G hugepages=0:2,1:2 hugepagesz=32M hugepages=0:2,1:2
> default_hugepagesz=2M hugepages=0:64,1:64 hugepagesz=64K hugepages=0:2,1:2"

This helps, thanks.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
  2024-04-02 17:57             ` Peter Xu
  (?)
  (?)
@ 2024-04-02 18:43               ` David Hildenbrand
  -1 siblings, 0 replies; 160+ messages in thread
From: David Hildenbrand @ 2024-04-02 18:43 UTC (permalink / raw)
  To: Peter Xu
  Cc: Ryan Roberts, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Jason Gunthorpe, Mike Rapoport, Axel Rasmussen

On 02.04.24 19:57, Peter Xu wrote:
> On Tue, Apr 02, 2024 at 06:39:31PM +0200, David Hildenbrand wrote:
>> On 02.04.24 18:20, Peter Xu wrote:
>>> On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:
>>>> On 02.04.24 16:48, Ryan Roberts wrote:
>>>>> Hi Peter,
>>>
>>> Hey, Ryan,
>>>
>>> Thanks for the report!
>>>
>>>>>
>>>>> On 27/03/2024 15:23, peterx@redhat.com wrote:
>>>>>> From: Peter Xu <peterx@redhat.com>
>>>>>>
>>>>>> Now follow_page() is ready to handle hugetlb pages in whatever form, and
>>>>>> over all architectures.  Switch to the generic code path.
>>>>>>
>>>>>> Time to retire hugetlb_follow_page_mask(), following the previous
>>>>>> retirement of follow_hugetlb_page() in 4849807114b8.
>>>>>>
>>>>>> There may be a slight difference of how the loops run when processing slow
>>>>>> GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
>>>>>> loop of __get_user_pages() will resolve one pgtable entry with the patch
>>>>>> applied, rather than relying on the size of hugetlb hstate, the latter may
>>>>>> cover multiple entries in one loop.
>>>>>>
>>>>>> A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
>>>>>> a tight loop of slow gup after the path switched.  That shouldn't be a
>>>>>> problem because slow-gup should not be a hot path for GUP in general: when
>>>>>> page is commonly present, fast-gup will already succeed, while when the
>>>>>> page is indeed missing and require a follow up page fault, the slow gup
>>>>>> degrade will probably buried in the fault paths anyway.  It also explains
>>>>>> why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
>>>>>> accelerate thp gup even for "pages != NULL"") lands, the latter not part of
>>>>>> a performance analysis but a side benefit.  If the performance will be a
>>>>>> concern, we can consider handle CONT_PTE in follow_page().
>>>>>>
>>>>>> Before that is justified to be necessary, keep everything clean and simple.
>>>>>>
>>>>>> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
>>>>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>>>>
>>>>> Afraid I'm seeing an oops when running gup_longterm test on arm64 with current mm-unstable. Git bisect blames this patch. The oops reproduces for me every time on 2 different machines:
>>>>>
>>>>>
>>>>> [    9.340416] kernel BUG at mm/gup.c:778!
>>>>> [    9.340746] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
>>>>> [    9.341199] Modules linked in:
>>>>> [    9.341481] CPU: 1 PID: 1159 Comm: gup_longterm Not tainted 6.9.0-rc2-00210-g910ff1a347e4 #11
>>>>> [    9.342232] Hardware name: linux,dummy-virt (DT)
>>>>> [    9.342647] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>>>> [    9.343195] pc : follow_page_mask+0x4d4/0x880
>>>>> [    9.343580] lr : follow_page_mask+0x4d4/0x880
>>>>> [    9.344018] sp : ffff8000898b3aa0
>>>>> [    9.344345] x29: ffff8000898b3aa0 x28: fffffdffc53973e8 x27: 00003c0005d08000
>>>>> [    9.345028] x26: ffff00014e5cfd08 x25: ffffd3513a40c000 x24: fffffdffc5d08000
>>>>> [    9.345682] x23: ffffc1ffc0000000 x22: 0000000000080101 x21: ffff8000898b3ba8
>>>>> [    9.346337] x20: 0000fffff4200000 x19: ffff00014e52d508 x18: 0000000000000010
>>>>> [    9.347005] x17: 5f656e6f7a5f7369 x16: 2120262620296567 x15: 6170286461654865
>>>>> [    9.347713] x14: 6761502128454741 x13: 2929656761702865 x12: 6761705f65636976
>>>>> [    9.348371] x11: 65645f656e6f7a5f x10: ffffd3513b31d6e0 x9 : ffffd3513852f090
>>>>> [    9.349062] x8 : 00000000ffffefff x7 : ffffd3513b31d6e0 x6 : 0000000000000000
>>>>> [    9.349753] x5 : ffff00017ff98cc8 x4 : 0000000000000fff x3 : 0000000000000000
>>>>> [    9.350397] x2 : 0000000000000000 x1 : ffff000190e8b480 x0 : 0000000000000052
>>>>> [    9.351097] Call trace:
>>>>> [    9.351312]  follow_page_mask+0x4d4/0x880
>>>>> [    9.351700]  __get_user_pages+0xf4/0x3e8
>>>>> [    9.352089]  __gup_longterm_locked+0x204/0xa70
>>>>> [    9.352516]  pin_user_pages+0x88/0xc0
>>>>> [    9.352873]  gup_test_ioctl+0x860/0xc40
>>>>> [    9.353249]  __arm64_sys_ioctl+0xb0/0x100
>>>>> [    9.353648]  invoke_syscall+0x50/0x128
>>>>> [    9.354022]  el0_svc_common.constprop.0+0x48/0xf8
>>>>> [    9.354488]  do_el0_svc+0x28/0x40
>>>>> [    9.354822]  el0_svc+0x34/0xe0
>>>>> [    9.355128]  el0t_64_sync_handler+0x13c/0x158
>>>>> [    9.355489]  el0t_64_sync+0x190/0x198
>>>>> [    9.355793] Code: aa1803e0 d000d8e1 91220021 97fff560 (d4210000)
>>>>> [    9.356280] ---[ end trace 0000000000000000 ]---
>>>>> [    9.356651] note: gup_longterm[1159] exited with irqs disabled
>>>>> [    9.357141] note: gup_longterm[1159] exited with preempt_count 2
>>>>> [    9.358033] ------------[ cut here ]------------
>>>>> [    9.358800] WARNING: CPU: 1 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0x108/0x120
>>>>> [    9.360157] Modules linked in:
>>>>> [    9.360541] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D            6.9.0-rc2-00210-g910ff1a347e4 #11
>>>>> [    9.361626] Hardware name: linux,dummy-virt (DT)
>>>>> [    9.362087] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>>>> [    9.362758] pc : ct_kernel_exit.constprop.0+0x108/0x120
>>>>> [    9.363306] lr : ct_idle_enter+0x10/0x20
>>>>> [    9.363845] sp : ffff8000801abdc0
>>>>> [    9.364222] x29: ffff8000801abdc0 x28: 0000000000000000 x27: 0000000000000000
>>>>> [    9.364961] x26: 0000000000000000 x25: ffff00014149d780 x24: 0000000000000000
>>>>> [    9.365557] x23: 0000000000000000 x22: ffffd3513b299d48 x21: ffffd3513a785730
>>>>> [    9.366239] x20: ffffd3513b299c28 x19: ffff00017ffa7da0 x18: 0000fffff5ffffff
>>>>> [    9.366869] x17: 0000000000000000 x16: 1fffe0002a21a8c1 x15: 0000000000000000
>>>>> [    9.367524] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000002
>>>>> [    9.368207] x11: 0000000000000001 x10: 0000000000000ad0 x9 : ffffd35138589230
>>>>> [    9.369123] x8 : ffff00014149e2b0 x7 : 0000000000000000 x6 : 000000000f8c0fb2
>>>>> [    9.370403] x5 : 4000000000000002 x4 : ffff2cb045825000 x3 : ffff8000801abdc0
>>>>> [    9.371170] x2 : ffffd3513a782da0 x1 : 4000000000000000 x0 : ffffd3513a782da0
>>>>> [    9.372279] Call trace:
>>>>> [    9.372519]  ct_kernel_exit.constprop.0+0x108/0x120
>>>>> [    9.373216]  ct_idle_enter+0x10/0x20
>>>>> [    9.373562]  default_idle_call+0x3c/0x160
>>>>> [    9.374055]  do_idle+0x21c/0x280
>>>>> [    9.374394]  cpu_startup_entry+0x3c/0x50
>>>>> [    9.374797]  secondary_start_kernel+0x140/0x168
>>>>> [    9.375220]  __secondary_switched+0xb8/0xc0
>>>>> [    9.375875] ---[ end trace 0000000000000000 ]---
>>>>>
>>>>>
>>>>> The oops trigger is at mm/gup.c:778:
>>>>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>>>>
>>>>>
>>>>> This is the output of gup_longterm (last output is just before oops):
>>>>>
>>>>> # [INFO] detected hugetlb page size: 2048 KiB
>>>>> # [INFO] detected hugetlb page size: 32768 KiB
>>>>> # [INFO] detected hugetlb page size: 64 KiB
>>>>> # [INFO] detected hugetlb page size: 1048576 KiB
>>>>> TAP version 13
>>>>> 1..70
>>>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd
>>>>> ok 1 Should have worked
>>>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with tmpfile
>>>>> ok 2 Should have failed
>>>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile
>>>>> ok 3 Should have failed
>>>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
>>>>> ok 4 Should have worked
>>>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (32768 kB)
>>>>>
>>>>>
>>>>> So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?
>>>>
>>>> I assume we find the expected tail page, it's just that the check
>>>>
>>>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>>>
>>>> Doesn't make sense with hugetlb folios. We might have a tail page mapped in
>>>> a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the first
>>>> cont-pmd entry", we trigger this check.
>>>>
>>>> Likely this sanity check must also allow for hugetlb folios. Or we should
>>>> just remove it completely.
>>>
>>> Right, IMHO it'll be easier we remove it, actually I see there's one more
>>> at the end, so I think we need to remove both.
>>>
>>>>
>>>> In the past, we wanted to make sure that we never get tail pages of THP from
>>>> PMD entries, because something would currently be broken (we don't support
>>>> THP > PMD).
>>>
>>> There's probably one more thing we need to do, on allowing
>>> PageAnonExclusive() to work with hugetlb tails. Even if we remove the
>>> warnings and if I read the code right, we can BUG_ON again on checking tail
>>> pages over anon-exclusive for PageHuge.
>>>
>>> So I assume to fix it completely, we may need two changes: Patch 1 to
>>> prepare PageAnonExclusive() to work on hugetlb tails, then patch 2 to be
>>> squashed into the patch "mm/gup: handle huge pmd for follow_pmd_mask()".
>>> Note: not this patch to fixup, as this patch only does the "switchover" to
>>> the new path, the culprit should be the other patch..
>>>
>>> I have them attached below first, before I'll also go and see whether I can
>>> run some arm tests later today or tomorrow.  David, any comments from
>>> anon-exclusive side?
>>
>> I added the PageAnonExclusive checks for hugetlb back then, because calling
>> it on a tail page indicated real trouble for hugetlb.
>>
>> Well, and I didn't want to have runtime-hugetlb checks in PageAnonExclusive
>> code called on certainly-not-hugetlb code paths.
>>
>> Personally, I'd fixup the problematic callsite where we know nothing nasty
>> is happening (like we did for gup_must_unshare(), because we don't expect
>> hugetlb tail pages from arbitrary other code).
>>
>> But as I'm getting closer to a folio_test_anon_exclusive() implementation as
>> we speak (closer, but not done :) ... ), where I'd remove any such hugetlb
>> special handling, I don't particularly care how we handle GUP here in the
>> meantime.
> 
> That's what I was looking for and found missing just now, when I wanted to
> allow follow_huge_pmd() pass page / folio (which will be the head then)
> properly into different checks.  I think that patch 1 is the simplest I can
> come up with that works mostly like what you said before a follow up
> cleanup on top if possible.  It mostly pushed the existing runtime check in
> gup_must_unshare() to be more generic.
> 
> IIUC it's also a matter of whether you'd want PageAnonExclusive() to take
> care of both thp + hugetlb in one shot, rather than let callers handle it
> by things like "if (PageHuge()) ... else ...", which I would try to avoid.

I tried to not let the caller pass in things that didn't make any sense.

Getting a tail page on a hugetlb folio in a page table walker except 
GUP-fast was completely bogus before your patch.

PageAnonExclusive was designed to be set on the page that was pointed to 
by a PTE, like having an additional PTE bit. Cont-pte/cont-pmd with the 
hugetlb fuzz around it we all love (huge_pte_offset()) did the right 
thing, because it abstracted the "multiple cont-pte/cont-pmd" PTEs to 
just a single logical PTE, with a single dedicated PageAnonExclusive.

So "conceptually", the caller that knows how the "single logical PTE" 
was the one to handle it. That meant, GUP-fast needed to be special, 
because it was unaware of the huge_pte_offset() logic.

But that seems to change now as we are changing our page table walkers, 
so I don't particularly care how we handle it.

> It seems so far cleaner to allow PageAnonExclusive() take whatever tail
> pages, thp or hugetlb.  But maybe your ultimate patchset can be even better
> than that.

At least that part will be much cleaner.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
@ 2024-04-02 18:43               ` David Hildenbrand
  0 siblings, 0 replies; 160+ messages in thread
From: David Hildenbrand @ 2024-04-02 18:43 UTC (permalink / raw)
  To: Peter Xu
  Cc: Ryan Roberts, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Jason Gunthorpe, Mike Rapoport, Axel Rasmussen

On 02.04.24 19:57, Peter Xu wrote:
> On Tue, Apr 02, 2024 at 06:39:31PM +0200, David Hildenbrand wrote:
>> On 02.04.24 18:20, Peter Xu wrote:
>>> On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:
>>>> On 02.04.24 16:48, Ryan Roberts wrote:
>>>>> Hi Peter,
>>>
>>> Hey, Ryan,
>>>
>>> Thanks for the report!
>>>
>>>>>
>>>>> On 27/03/2024 15:23, peterx@redhat.com wrote:
>>>>>> From: Peter Xu <peterx@redhat.com>
>>>>>>
>>>>>> Now follow_page() is ready to handle hugetlb pages in whatever form, and
>>>>>> over all architectures.  Switch to the generic code path.
>>>>>>
>>>>>> Time to retire hugetlb_follow_page_mask(), following the previous
>>>>>> retirement of follow_hugetlb_page() in 4849807114b8.
>>>>>>
>>>>>> There may be a slight difference of how the loops run when processing slow
>>>>>> GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
>>>>>> loop of __get_user_pages() will resolve one pgtable entry with the patch
>>>>>> applied, rather than relying on the size of hugetlb hstate, the latter may
>>>>>> cover multiple entries in one loop.
>>>>>>
>>>>>> A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
>>>>>> a tight loop of slow gup after the path switched.  That shouldn't be a
>>>>>> problem because slow-gup should not be a hot path for GUP in general: when
>>>>>> page is commonly present, fast-gup will already succeed, while when the
>>>>>> page is indeed missing and require a follow up page fault, the slow gup
>>>>>> degrade will probably buried in the fault paths anyway.  It also explains
>>>>>> why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
>>>>>> accelerate thp gup even for "pages != NULL"") lands, the latter not part of
>>>>>> a performance analysis but a side benefit.  If the performance will be a
>>>>>> concern, we can consider handle CONT_PTE in follow_page().
>>>>>>
>>>>>> Before that is justified to be necessary, keep everything clean and simple.
>>>>>>
>>>>>> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
>>>>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>>>>
>>>>> Afraid I'm seeing an oops when running gup_longterm test on arm64 with current mm-unstable. Git bisect blames this patch. The oops reproduces for me every time on 2 different machines:
>>>>>
>>>>>
>>>>> [    9.340416] kernel BUG at mm/gup.c:778!
>>>>> [    9.340746] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
>>>>> [    9.341199] Modules linked in:
>>>>> [    9.341481] CPU: 1 PID: 1159 Comm: gup_longterm Not tainted 6.9.0-rc2-00210-g910ff1a347e4 #11
>>>>> [    9.342232] Hardware name: linux,dummy-virt (DT)
>>>>> [    9.342647] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>>>> [    9.343195] pc : follow_page_mask+0x4d4/0x880
>>>>> [    9.343580] lr : follow_page_mask+0x4d4/0x880
>>>>> [    9.344018] sp : ffff8000898b3aa0
>>>>> [    9.344345] x29: ffff8000898b3aa0 x28: fffffdffc53973e8 x27: 00003c0005d08000
>>>>> [    9.345028] x26: ffff00014e5cfd08 x25: ffffd3513a40c000 x24: fffffdffc5d08000
>>>>> [    9.345682] x23: ffffc1ffc0000000 x22: 0000000000080101 x21: ffff8000898b3ba8
>>>>> [    9.346337] x20: 0000fffff4200000 x19: ffff00014e52d508 x18: 0000000000000010
>>>>> [    9.347005] x17: 5f656e6f7a5f7369 x16: 2120262620296567 x15: 6170286461654865
>>>>> [    9.347713] x14: 6761502128454741 x13: 2929656761702865 x12: 6761705f65636976
>>>>> [    9.348371] x11: 65645f656e6f7a5f x10: ffffd3513b31d6e0 x9 : ffffd3513852f090
>>>>> [    9.349062] x8 : 00000000ffffefff x7 : ffffd3513b31d6e0 x6 : 0000000000000000
>>>>> [    9.349753] x5 : ffff00017ff98cc8 x4 : 0000000000000fff x3 : 0000000000000000
>>>>> [    9.350397] x2 : 0000000000000000 x1 : ffff000190e8b480 x0 : 0000000000000052
>>>>> [    9.351097] Call trace:
>>>>> [    9.351312]  follow_page_mask+0x4d4/0x880
>>>>> [    9.351700]  __get_user_pages+0xf4/0x3e8
>>>>> [    9.352089]  __gup_longterm_locked+0x204/0xa70
>>>>> [    9.352516]  pin_user_pages+0x88/0xc0
>>>>> [    9.352873]  gup_test_ioctl+0x860/0xc40
>>>>> [    9.353249]  __arm64_sys_ioctl+0xb0/0x100
>>>>> [    9.353648]  invoke_syscall+0x50/0x128
>>>>> [    9.354022]  el0_svc_common.constprop.0+0x48/0xf8
>>>>> [    9.354488]  do_el0_svc+0x28/0x40
>>>>> [    9.354822]  el0_svc+0x34/0xe0
>>>>> [    9.355128]  el0t_64_sync_handler+0x13c/0x158
>>>>> [    9.355489]  el0t_64_sync+0x190/0x198
>>>>> [    9.355793] Code: aa1803e0 d000d8e1 91220021 97fff560 (d4210000)
>>>>> [    9.356280] ---[ end trace 0000000000000000 ]---
>>>>> [    9.356651] note: gup_longterm[1159] exited with irqs disabled
>>>>> [    9.357141] note: gup_longterm[1159] exited with preempt_count 2
>>>>> [    9.358033] ------------[ cut here ]------------
>>>>> [    9.358800] WARNING: CPU: 1 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0x108/0x120
>>>>> [    9.360157] Modules linked in:
>>>>> [    9.360541] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D            6.9.0-rc2-00210-g910ff1a347e4 #11
>>>>> [    9.361626] Hardware name: linux,dummy-virt (DT)
>>>>> [    9.362087] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>>>> [    9.362758] pc : ct_kernel_exit.constprop.0+0x108/0x120
>>>>> [    9.363306] lr : ct_idle_enter+0x10/0x20
>>>>> [    9.363845] sp : ffff8000801abdc0
>>>>> [    9.364222] x29: ffff8000801abdc0 x28: 0000000000000000 x27: 0000000000000000
>>>>> [    9.364961] x26: 0000000000000000 x25: ffff00014149d780 x24: 0000000000000000
>>>>> [    9.365557] x23: 0000000000000000 x22: ffffd3513b299d48 x21: ffffd3513a785730
>>>>> [    9.366239] x20: ffffd3513b299c28 x19: ffff00017ffa7da0 x18: 0000fffff5ffffff
>>>>> [    9.366869] x17: 0000000000000000 x16: 1fffe0002a21a8c1 x15: 0000000000000000
>>>>> [    9.367524] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000002
>>>>> [    9.368207] x11: 0000000000000001 x10: 0000000000000ad0 x9 : ffffd35138589230
>>>>> [    9.369123] x8 : ffff00014149e2b0 x7 : 0000000000000000 x6 : 000000000f8c0fb2
>>>>> [    9.370403] x5 : 4000000000000002 x4 : ffff2cb045825000 x3 : ffff8000801abdc0
>>>>> [    9.371170] x2 : ffffd3513a782da0 x1 : 4000000000000000 x0 : ffffd3513a782da0
>>>>> [    9.372279] Call trace:
>>>>> [    9.372519]  ct_kernel_exit.constprop.0+0x108/0x120
>>>>> [    9.373216]  ct_idle_enter+0x10/0x20
>>>>> [    9.373562]  default_idle_call+0x3c/0x160
>>>>> [    9.374055]  do_idle+0x21c/0x280
>>>>> [    9.374394]  cpu_startup_entry+0x3c/0x50
>>>>> [    9.374797]  secondary_start_kernel+0x140/0x168
>>>>> [    9.375220]  __secondary_switched+0xb8/0xc0
>>>>> [    9.375875] ---[ end trace 0000000000000000 ]---
>>>>>
>>>>>
>>>>> The oops trigger is at mm/gup.c:778:
>>>>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>>>>
>>>>>
>>>>> This is the output of gup_longterm (last output is just before oops):
>>>>>
>>>>> # [INFO] detected hugetlb page size: 2048 KiB
>>>>> # [INFO] detected hugetlb page size: 32768 KiB
>>>>> # [INFO] detected hugetlb page size: 64 KiB
>>>>> # [INFO] detected hugetlb page size: 1048576 KiB
>>>>> TAP version 13
>>>>> 1..70
>>>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd
>>>>> ok 1 Should have worked
>>>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with tmpfile
>>>>> ok 2 Should have failed
>>>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile
>>>>> ok 3 Should have failed
>>>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
>>>>> ok 4 Should have worked
>>>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (32768 kB)
>>>>>
>>>>>
>>>>> So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?
>>>>
>>>> I assume we find the expected tail page, it's just that the check
>>>>
>>>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>>>
>>>> Doesn't make sense with hugetlb folios. We might have a tail page mapped in
>>>> a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the first
>>>> cont-pmd entry", we trigger this check.
>>>>
>>>> Likely this sanity check must also allow for hugetlb folios. Or we should
>>>> just remove it completely.
>>>
>>> Right, IMHO it'll be easier we remove it, actually I see there's one more
>>> at the end, so I think we need to remove both.
>>>
>>>>
>>>> In the past, we wanted to make sure that we never get tail pages of THP from
>>>> PMD entries, because something would currently be broken (we don't support
>>>> THP > PMD).
>>>
>>> There's probably one more thing we need to do, on allowing
>>> PageAnonExclusive() to work with hugetlb tails. Even if we remove the
>>> warnings and if I read the code right, we can BUG_ON again on checking tail
>>> pages over anon-exclusive for PageHuge.
>>>
>>> So I assume to fix it completely, we may need two changes: Patch 1 to
>>> prepare PageAnonExclusive() to work on hugetlb tails, then patch 2 to be
>>> squashed into the patch "mm/gup: handle huge pmd for follow_pmd_mask()".
>>> Note: not this patch to fixup, as this patch only does the "switchover" to
>>> the new path, the culprit should be the other patch..
>>>
>>> I have them attached below first, before I'll also go and see whether I can
>>> run some arm tests later today or tomorrow.  David, any comments from
>>> anon-exclusive side?
>>
>> I added the PageAnonExclusive checks for hugetlb back then, because calling
>> it on a tail page indicated real trouble for hugetlb.
>>
>> Well, and I didn't want to have runtime-hugetlb checks in PageAnonExclusive
>> code called on certainly-not-hugetlb code paths.
>>
>> Personally, I'd fixup the problematic callsite where we know nothing nasty
>> is happening (like we did for gup_must_unshare(), because we don't expect
>> hugetlb tail pages from arbitrary other code).
>>
>> But as I'm getting closer to a folio_test_anon_exclusive() implementation as
>> we speak (closer, but not done :) ... ), where I'd remove any such hugetlb
>> special handling, I don't particularly care how we handle GUP here in the
>> meantime.
> 
> That's what I was looking for and found missing just now, when I wanted to
> allow follow_huge_pmd() pass page / folio (which will be the head then)
> properly into different checks.  I think that patch 1 is the simplest I can
> come up with that works mostly like what you said before a follow up
> cleanup on top if possible.  It mostly pushed the existing runtime check in
> gup_must_unshare() to be more generic.
> 
> IIUC it's also a matter of whether you'd want PageAnonExclusive() to take
> care of both thp + hugetlb in one shot, rather than let callers handle it
> by things like "if (PageHuge()) ... else ...", which I would try to avoid.

I tried to not let the caller pass in things that didn't make any sense.

Getting a tail page on a hugetlb folio in a page table walker except 
GUP-fast was completely bogus before your patch.

PageAnonExclusive was designed to be set on the page that was pointed to 
by a PTE, like having an additional PTE bit. Cont-pte/cont-pmd with the 
hugetlb fuzz around it we all love (huge_pte_offset()) did the right 
thing, because it abstracted the "multiple cont-pte/cont-pmd" PTEs to 
just a single logical PTE, with a single dedicated PageAnonExclusive.

So "conceptually", the caller that knows how the "single logical PTE" 
was the one to handle it. That meant, GUP-fast needed to be special, 
because it was unaware of the huge_pte_offset() logic.

But that seems to change now as we are changing our page table walkers, 
so I don't particularly care how we handle it.

> It seems so far cleaner to allow PageAnonExclusive() take whatever tail
> pages, thp or hugetlb.  But maybe your ultimate patchset can be even better
> than that.

At least that part will be much cleaner.

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
@ 2024-04-02 18:43               ` David Hildenbrand
  0 siblings, 0 replies; 160+ messages in thread
From: David Hildenbrand @ 2024-04-02 18:43 UTC (permalink / raw)
  To: Peter Xu
  Cc: Ryan Roberts, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Jason Gunthorpe, Mike Rapoport, Axel Rasmussen

On 02.04.24 19:57, Peter Xu wrote:
> On Tue, Apr 02, 2024 at 06:39:31PM +0200, David Hildenbrand wrote:
>> On 02.04.24 18:20, Peter Xu wrote:
>>> On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:
>>>> On 02.04.24 16:48, Ryan Roberts wrote:
>>>>> Hi Peter,
>>>
>>> Hey, Ryan,
>>>
>>> Thanks for the report!
>>>
>>>>>
>>>>> On 27/03/2024 15:23, peterx@redhat.com wrote:
>>>>>> From: Peter Xu <peterx@redhat.com>
>>>>>>
>>>>>> Now follow_page() is ready to handle hugetlb pages in whatever form, and
>>>>>> over all architectures.  Switch to the generic code path.
>>>>>>
>>>>>> Time to retire hugetlb_follow_page_mask(), following the previous
>>>>>> retirement of follow_hugetlb_page() in 4849807114b8.
>>>>>>
>>>>>> There may be a slight difference of how the loops run when processing slow
>>>>>> GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
>>>>>> loop of __get_user_pages() will resolve one pgtable entry with the patch
>>>>>> applied, rather than relying on the size of hugetlb hstate, the latter may
>>>>>> cover multiple entries in one loop.
>>>>>>
>>>>>> A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
>>>>>> a tight loop of slow gup after the path switched.  That shouldn't be a
>>>>>> problem because slow-gup should not be a hot path for GUP in general: when
>>>>>> page is commonly present, fast-gup will already succeed, while when the
>>>>>> page is indeed missing and require a follow up page fault, the slow gup
>>>>>> degrade will probably buried in the fault paths anyway.  It also explains
>>>>>> why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
>>>>>> accelerate thp gup even for "pages != NULL"") lands, the latter not part of
>>>>>> a performance analysis but a side benefit.  If the performance will be a
>>>>>> concern, we can consider handle CONT_PTE in follow_page().
>>>>>>
>>>>>> Before that is justified to be necessary, keep everything clean and simple.
>>>>>>
>>>>>> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
>>>>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>>>>
>>>>> Afraid I'm seeing an oops when running gup_longterm test on arm64 with current mm-unstable. Git bisect blames this patch. The oops reproduces for me every time on 2 different machines:
>>>>>
>>>>>
>>>>> [    9.340416] kernel BUG at mm/gup.c:778!
>>>>> [    9.340746] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
>>>>> [    9.341199] Modules linked in:
>>>>> [    9.341481] CPU: 1 PID: 1159 Comm: gup_longterm Not tainted 6.9.0-rc2-00210-g910ff1a347e4 #11
>>>>> [    9.342232] Hardware name: linux,dummy-virt (DT)
>>>>> [    9.342647] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>>>> [    9.343195] pc : follow_page_mask+0x4d4/0x880
>>>>> [    9.343580] lr : follow_page_mask+0x4d4/0x880
>>>>> [    9.344018] sp : ffff8000898b3aa0
>>>>> [    9.344345] x29: ffff8000898b3aa0 x28: fffffdffc53973e8 x27: 00003c0005d08000
>>>>> [    9.345028] x26: ffff00014e5cfd08 x25: ffffd3513a40c000 x24: fffffdffc5d08000
>>>>> [    9.345682] x23: ffffc1ffc0000000 x22: 0000000000080101 x21: ffff8000898b3ba8
>>>>> [    9.346337] x20: 0000fffff4200000 x19: ffff00014e52d508 x18: 0000000000000010
>>>>> [    9.347005] x17: 5f656e6f7a5f7369 x16: 2120262620296567 x15: 6170286461654865
>>>>> [    9.347713] x14: 6761502128454741 x13: 2929656761702865 x12: 6761705f65636976
>>>>> [    9.348371] x11: 65645f656e6f7a5f x10: ffffd3513b31d6e0 x9 : ffffd3513852f090
>>>>> [    9.349062] x8 : 00000000ffffefff x7 : ffffd3513b31d6e0 x6 : 0000000000000000
>>>>> [    9.349753] x5 : ffff00017ff98cc8 x4 : 0000000000000fff x3 : 0000000000000000
>>>>> [    9.350397] x2 : 0000000000000000 x1 : ffff000190e8b480 x0 : 0000000000000052
>>>>> [    9.351097] Call trace:
>>>>> [    9.351312]  follow_page_mask+0x4d4/0x880
>>>>> [    9.351700]  __get_user_pages+0xf4/0x3e8
>>>>> [    9.352089]  __gup_longterm_locked+0x204/0xa70
>>>>> [    9.352516]  pin_user_pages+0x88/0xc0
>>>>> [    9.352873]  gup_test_ioctl+0x860/0xc40
>>>>> [    9.353249]  __arm64_sys_ioctl+0xb0/0x100
>>>>> [    9.353648]  invoke_syscall+0x50/0x128
>>>>> [    9.354022]  el0_svc_common.constprop.0+0x48/0xf8
>>>>> [    9.354488]  do_el0_svc+0x28/0x40
>>>>> [    9.354822]  el0_svc+0x34/0xe0
>>>>> [    9.355128]  el0t_64_sync_handler+0x13c/0x158
>>>>> [    9.355489]  el0t_64_sync+0x190/0x198
>>>>> [    9.355793] Code: aa1803e0 d000d8e1 91220021 97fff560 (d4210000)
>>>>> [    9.356280] ---[ end trace 0000000000000000 ]---
>>>>> [    9.356651] note: gup_longterm[1159] exited with irqs disabled
>>>>> [    9.357141] note: gup_longterm[1159] exited with preempt_count 2
>>>>> [    9.358033] ------------[ cut here ]------------
>>>>> [    9.358800] WARNING: CPU: 1 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0x108/0x120
>>>>> [    9.360157] Modules linked in:
>>>>> [    9.360541] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D            6.9.0-rc2-00210-g910ff1a347e4 #11
>>>>> [    9.361626] Hardware name: linux,dummy-virt (DT)
>>>>> [    9.362087] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>>>> [    9.362758] pc : ct_kernel_exit.constprop.0+0x108/0x120
>>>>> [    9.363306] lr : ct_idle_enter+0x10/0x20
>>>>> [    9.363845] sp : ffff8000801abdc0
>>>>> [    9.364222] x29: ffff8000801abdc0 x28: 0000000000000000 x27: 0000000000000000
>>>>> [    9.364961] x26: 0000000000000000 x25: ffff00014149d780 x24: 0000000000000000
>>>>> [    9.365557] x23: 0000000000000000 x22: ffffd3513b299d48 x21: ffffd3513a785730
>>>>> [    9.366239] x20: ffffd3513b299c28 x19: ffff00017ffa7da0 x18: 0000fffff5ffffff
>>>>> [    9.366869] x17: 0000000000000000 x16: 1fffe0002a21a8c1 x15: 0000000000000000
>>>>> [    9.367524] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000002
>>>>> [    9.368207] x11: 0000000000000001 x10: 0000000000000ad0 x9 : ffffd35138589230
>>>>> [    9.369123] x8 : ffff00014149e2b0 x7 : 0000000000000000 x6 : 000000000f8c0fb2
>>>>> [    9.370403] x5 : 4000000000000002 x4 : ffff2cb045825000 x3 : ffff8000801abdc0
>>>>> [    9.371170] x2 : ffffd3513a782da0 x1 : 4000000000000000 x0 : ffffd3513a782da0
>>>>> [    9.372279] Call trace:
>>>>> [    9.372519]  ct_kernel_exit.constprop.0+0x108/0x120
>>>>> [    9.373216]  ct_idle_enter+0x10/0x20
>>>>> [    9.373562]  default_idle_call+0x3c/0x160
>>>>> [    9.374055]  do_idle+0x21c/0x280
>>>>> [    9.374394]  cpu_startup_entry+0x3c/0x50
>>>>> [    9.374797]  secondary_start_kernel+0x140/0x168
>>>>> [    9.375220]  __secondary_switched+0xb8/0xc0
>>>>> [    9.375875] ---[ end trace 0000000000000000 ]---
>>>>>
>>>>>
>>>>> The oops trigger is at mm/gup.c:778:
>>>>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>>>>
>>>>>
>>>>> This is the output of gup_longterm (last output is just before oops):
>>>>>
>>>>> # [INFO] detected hugetlb page size: 2048 KiB
>>>>> # [INFO] detected hugetlb page size: 32768 KiB
>>>>> # [INFO] detected hugetlb page size: 64 KiB
>>>>> # [INFO] detected hugetlb page size: 1048576 KiB
>>>>> TAP version 13
>>>>> 1..70
>>>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd
>>>>> ok 1 Should have worked
>>>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with tmpfile
>>>>> ok 2 Should have failed
>>>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile
>>>>> ok 3 Should have failed
>>>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
>>>>> ok 4 Should have worked
>>>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (32768 kB)
>>>>>
>>>>>
>>>>> So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?
>>>>
>>>> I assume we find the expected tail page, it's just that the check
>>>>
>>>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>>>
>>>> Doesn't make sense with hugetlb folios. We might have a tail page mapped in
>>>> a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the first
>>>> cont-pmd entry", we trigger this check.
>>>>
>>>> Likely this sanity check must also allow for hugetlb folios. Or we should
>>>> just remove it completely.
>>>
>>> Right, IMHO it'll be easier we remove it, actually I see there's one more
>>> at the end, so I think we need to remove both.
>>>
>>>>
>>>> In the past, we wanted to make sure that we never get tail pages of THP from
>>>> PMD entries, because something would currently be broken (we don't support
>>>> THP > PMD).
>>>
>>> There's probably one more thing we need to do, on allowing
>>> PageAnonExclusive() to work with hugetlb tails. Even if we remove the
>>> warnings and if I read the code right, we can BUG_ON again on checking tail
>>> pages over anon-exclusive for PageHuge.
>>>
>>> So I assume to fix it completely, we may need two changes: Patch 1 to
>>> prepare PageAnonExclusive() to work on hugetlb tails, then patch 2 to be
>>> squashed into the patch "mm/gup: handle huge pmd for follow_pmd_mask()".
>>> Note: not this patch to fixup, as this patch only does the "switchover" to
>>> the new path, the culprit should be the other patch..
>>>
>>> I have them attached below first, before I'll also go and see whether I can
>>> run some arm tests later today or tomorrow.  David, any comments from
>>> anon-exclusive side?
>>
>> I added the PageAnonExclusive checks for hugetlb back then, because calling
>> it on a tail page indicated real trouble for hugetlb.
>>
>> Well, and I didn't want to have runtime-hugetlb checks in PageAnonExclusive
>> code called on certainly-not-hugetlb code paths.
>>
>> Personally, I'd fixup the problematic callsite where we know nothing nasty
>> is happening (like we did for gup_must_unshare(), because we don't expect
>> hugetlb tail pages from arbitrary other code).
>>
>> But as I'm getting closer to a folio_test_anon_exclusive() implementation as
>> we speak (closer, but not done :) ... ), where I'd remove any such hugetlb
>> special handling, I don't particularly care how we handle GUP here in the
>> meantime.
> 
> That's what I was looking for and found missing just now, when I wanted to
> allow follow_huge_pmd() pass page / folio (which will be the head then)
> properly into different checks.  I think that patch 1 is the simplest I can
> come up with that works mostly like what you said before a follow up
> cleanup on top if possible.  It mostly pushed the existing runtime check in
> gup_must_unshare() to be more generic.
> 
> IIUC it's also a matter of whether you'd want PageAnonExclusive() to take
> care of both thp + hugetlb in one shot, rather than let callers handle it
> by things like "if (PageHuge()) ... else ...", which I would try to avoid.

I tried to not let the caller pass in things that didn't make any sense.

Getting a tail page on a hugetlb folio in a page table walker except 
GUP-fast was completely bogus before your patch.

PageAnonExclusive was designed to be set on the page that was pointed to 
by a PTE, like having an additional PTE bit. Cont-pte/cont-pmd with the 
hugetlb fuzz around it we all love (huge_pte_offset()) did the right 
thing, because it abstracted the "multiple cont-pte/cont-pmd" PTEs to 
just a single logical PTE, with a single dedicated PageAnonExclusive.

So "conceptually", the caller that knows how the "single logical PTE" 
was the one to handle it. That meant, GUP-fast needed to be special, 
because it was unaware of the huge_pte_offset() logic.

But that seems to change now as we are changing our page table walkers, 
so I don't particularly care how we handle it.

> It seems so far cleaner to allow PageAnonExclusive() take whatever tail
> pages, thp or hugetlb.  But maybe your ultimate patchset can be even better
> than that.

At least that part will be much cleaner.

-- 
Cheers,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code
@ 2024-04-02 18:43               ` David Hildenbrand
  0 siblings, 0 replies; 160+ messages in thread
From: David Hildenbrand @ 2024-04-02 18:43 UTC (permalink / raw)
  To: Peter Xu
  Cc: James Houghton, Yang Shi, Andrew Jones, linux-mm, Matthew Wilcox,
	linux-riscv, Andrea Arcangeli, Christoph Hellwig,
	Aneesh Kumar K . V, linux-arm-kernel, Jason Gunthorpe,
	Axel Rasmussen, Ryan Roberts, Rik van Riel, John Hubbard,
	Kirill A . Shutemov, Vlastimil Babka, Lorenzo Stoakes,
	Muchun Song, linux-kernel, Andrew Morton, linuxppc-dev,
	Mike Rapoport, Mike Kravetz

On 02.04.24 19:57, Peter Xu wrote:
> On Tue, Apr 02, 2024 at 06:39:31PM +0200, David Hildenbrand wrote:
>> On 02.04.24 18:20, Peter Xu wrote:
>>> On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:
>>>> On 02.04.24 16:48, Ryan Roberts wrote:
>>>>> Hi Peter,
>>>
>>> Hey, Ryan,
>>>
>>> Thanks for the report!
>>>
>>>>>
>>>>> On 27/03/2024 15:23, peterx@redhat.com wrote:
>>>>>> From: Peter Xu <peterx@redhat.com>
>>>>>>
>>>>>> Now follow_page() is ready to handle hugetlb pages in whatever form, and
>>>>>> over all architectures.  Switch to the generic code path.
>>>>>>
>>>>>> Time to retire hugetlb_follow_page_mask(), following the previous
>>>>>> retirement of follow_hugetlb_page() in 4849807114b8.
>>>>>>
>>>>>> There may be a slight difference of how the loops run when processing slow
>>>>>> GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
>>>>>> loop of __get_user_pages() will resolve one pgtable entry with the patch
>>>>>> applied, rather than relying on the size of hugetlb hstate, the latter may
>>>>>> cover multiple entries in one loop.
>>>>>>
>>>>>> A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
>>>>>> a tight loop of slow gup after the path switched.  That shouldn't be a
>>>>>> problem because slow-gup should not be a hot path for GUP in general: when
>>>>>> page is commonly present, fast-gup will already succeed, while when the
>>>>>> page is indeed missing and require a follow up page fault, the slow gup
>>>>>> degrade will probably buried in the fault paths anyway.  It also explains
>>>>>> why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
>>>>>> accelerate thp gup even for "pages != NULL"") lands, the latter not part of
>>>>>> a performance analysis but a side benefit.  If the performance will be a
>>>>>> concern, we can consider handle CONT_PTE in follow_page().
>>>>>>
>>>>>> Before that is justified to be necessary, keep everything clean and simple.
>>>>>>
>>>>>> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
>>>>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>>>>
>>>>> Afraid I'm seeing an oops when running gup_longterm test on arm64 with current mm-unstable. Git bisect blames this patch. The oops reproduces for me every time on 2 different machines:
>>>>>
>>>>>
>>>>> [    9.340416] kernel BUG at mm/gup.c:778!
>>>>> [    9.340746] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
>>>>> [    9.341199] Modules linked in:
>>>>> [    9.341481] CPU: 1 PID: 1159 Comm: gup_longterm Not tainted 6.9.0-rc2-00210-g910ff1a347e4 #11
>>>>> [    9.342232] Hardware name: linux,dummy-virt (DT)
>>>>> [    9.342647] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>>>> [    9.343195] pc : follow_page_mask+0x4d4/0x880
>>>>> [    9.343580] lr : follow_page_mask+0x4d4/0x880
>>>>> [    9.344018] sp : ffff8000898b3aa0
>>>>> [    9.344345] x29: ffff8000898b3aa0 x28: fffffdffc53973e8 x27: 00003c0005d08000
>>>>> [    9.345028] x26: ffff00014e5cfd08 x25: ffffd3513a40c000 x24: fffffdffc5d08000
>>>>> [    9.345682] x23: ffffc1ffc0000000 x22: 0000000000080101 x21: ffff8000898b3ba8
>>>>> [    9.346337] x20: 0000fffff4200000 x19: ffff00014e52d508 x18: 0000000000000010
>>>>> [    9.347005] x17: 5f656e6f7a5f7369 x16: 2120262620296567 x15: 6170286461654865
>>>>> [    9.347713] x14: 6761502128454741 x13: 2929656761702865 x12: 6761705f65636976
>>>>> [    9.348371] x11: 65645f656e6f7a5f x10: ffffd3513b31d6e0 x9 : ffffd3513852f090
>>>>> [    9.349062] x8 : 00000000ffffefff x7 : ffffd3513b31d6e0 x6 : 0000000000000000
>>>>> [    9.349753] x5 : ffff00017ff98cc8 x4 : 0000000000000fff x3 : 0000000000000000
>>>>> [    9.350397] x2 : 0000000000000000 x1 : ffff000190e8b480 x0 : 0000000000000052
>>>>> [    9.351097] Call trace:
>>>>> [    9.351312]  follow_page_mask+0x4d4/0x880
>>>>> [    9.351700]  __get_user_pages+0xf4/0x3e8
>>>>> [    9.352089]  __gup_longterm_locked+0x204/0xa70
>>>>> [    9.352516]  pin_user_pages+0x88/0xc0
>>>>> [    9.352873]  gup_test_ioctl+0x860/0xc40
>>>>> [    9.353249]  __arm64_sys_ioctl+0xb0/0x100
>>>>> [    9.353648]  invoke_syscall+0x50/0x128
>>>>> [    9.354022]  el0_svc_common.constprop.0+0x48/0xf8
>>>>> [    9.354488]  do_el0_svc+0x28/0x40
>>>>> [    9.354822]  el0_svc+0x34/0xe0
>>>>> [    9.355128]  el0t_64_sync_handler+0x13c/0x158
>>>>> [    9.355489]  el0t_64_sync+0x190/0x198
>>>>> [    9.355793] Code: aa1803e0 d000d8e1 91220021 97fff560 (d4210000)
>>>>> [    9.356280] ---[ end trace 0000000000000000 ]---
>>>>> [    9.356651] note: gup_longterm[1159] exited with irqs disabled
>>>>> [    9.357141] note: gup_longterm[1159] exited with preempt_count 2
>>>>> [    9.358033] ------------[ cut here ]------------
>>>>> [    9.358800] WARNING: CPU: 1 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0x108/0x120
>>>>> [    9.360157] Modules linked in:
>>>>> [    9.360541] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D            6.9.0-rc2-00210-g910ff1a347e4 #11
>>>>> [    9.361626] Hardware name: linux,dummy-virt (DT)
>>>>> [    9.362087] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>>>> [    9.362758] pc : ct_kernel_exit.constprop.0+0x108/0x120
>>>>> [    9.363306] lr : ct_idle_enter+0x10/0x20
>>>>> [    9.363845] sp : ffff8000801abdc0
>>>>> [    9.364222] x29: ffff8000801abdc0 x28: 0000000000000000 x27: 0000000000000000
>>>>> [    9.364961] x26: 0000000000000000 x25: ffff00014149d780 x24: 0000000000000000
>>>>> [    9.365557] x23: 0000000000000000 x22: ffffd3513b299d48 x21: ffffd3513a785730
>>>>> [    9.366239] x20: ffffd3513b299c28 x19: ffff00017ffa7da0 x18: 0000fffff5ffffff
>>>>> [    9.366869] x17: 0000000000000000 x16: 1fffe0002a21a8c1 x15: 0000000000000000
>>>>> [    9.367524] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000002
>>>>> [    9.368207] x11: 0000000000000001 x10: 0000000000000ad0 x9 : ffffd35138589230
>>>>> [    9.369123] x8 : ffff00014149e2b0 x7 : 0000000000000000 x6 : 000000000f8c0fb2
>>>>> [    9.370403] x5 : 4000000000000002 x4 : ffff2cb045825000 x3 : ffff8000801abdc0
>>>>> [    9.371170] x2 : ffffd3513a782da0 x1 : 4000000000000000 x0 : ffffd3513a782da0
>>>>> [    9.372279] Call trace:
>>>>> [    9.372519]  ct_kernel_exit.constprop.0+0x108/0x120
>>>>> [    9.373216]  ct_idle_enter+0x10/0x20
>>>>> [    9.373562]  default_idle_call+0x3c/0x160
>>>>> [    9.374055]  do_idle+0x21c/0x280
>>>>> [    9.374394]  cpu_startup_entry+0x3c/0x50
>>>>> [    9.374797]  secondary_start_kernel+0x140/0x168
>>>>> [    9.375220]  __secondary_switched+0xb8/0xc0
>>>>> [    9.375875] ---[ end trace 0000000000000000 ]---
>>>>>
>>>>>
>>>>> The oops trigger is at mm/gup.c:778:
>>>>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>>>>
>>>>>
>>>>> This is the output of gup_longterm (last output is just before oops):
>>>>>
>>>>> # [INFO] detected hugetlb page size: 2048 KiB
>>>>> # [INFO] detected hugetlb page size: 32768 KiB
>>>>> # [INFO] detected hugetlb page size: 64 KiB
>>>>> # [INFO] detected hugetlb page size: 1048576 KiB
>>>>> TAP version 13
>>>>> 1..70
>>>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd
>>>>> ok 1 Should have worked
>>>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with tmpfile
>>>>> ok 2 Should have failed
>>>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile
>>>>> ok 3 Should have failed
>>>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
>>>>> ok 4 Should have worked
>>>>> # [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (32768 kB)
>>>>>
>>>>>
>>>>> So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing you're trying to iterate 2M into a cont-pmd folio and ending up with an unexpected tail page?
>>>>
>>>> I assume we find the expected tail page, it's just that the check
>>>>
>>>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>>>>
>>>> Doesn't make sense with hugetlb folios. We might have a tail page mapped in
>>>> a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the first
>>>> cont-pmd entry", we trigger this check.
>>>>
>>>> Likely this sanity check must also allow for hugetlb folios. Or we should
>>>> just remove it completely.
>>>
>>> Right, IMHO it'll be easier we remove it, actually I see there's one more
>>> at the end, so I think we need to remove both.
>>>
>>>>
>>>> In the past, we wanted to make sure that we never get tail pages of THP from
>>>> PMD entries, because something would currently be broken (we don't support
>>>> THP > PMD).
>>>
>>> There's probably one more thing we need to do, on allowing
>>> PageAnonExclusive() to work with hugetlb tails. Even if we remove the
>>> warnings and if I read the code right, we can BUG_ON again on checking tail
>>> pages over anon-exclusive for PageHuge.
>>>
>>> So I assume to fix it completely, we may need two changes: Patch 1 to
>>> prepare PageAnonExclusive() to work on hugetlb tails, then patch 2 to be
>>> squashed into the patch "mm/gup: handle huge pmd for follow_pmd_mask()".
>>> Note: not this patch to fixup, as this patch only does the "switchover" to
>>> the new path, the culprit should be the other patch..
>>>
>>> I have them attached below first, before I'll also go and see whether I can
>>> run some arm tests later today or tomorrow.  David, any comments from
>>> anon-exclusive side?
>>
>> I added the PageAnonExclusive checks for hugetlb back then, because calling
>> it on a tail page indicated real trouble for hugetlb.
>>
>> Well, and I didn't want to have runtime-hugetlb checks in PageAnonExclusive
>> code called on certainly-not-hugetlb code paths.
>>
>> Personally, I'd fixup the problematic callsite where we know nothing nasty
>> is happening (like we did for gup_must_unshare(), because we don't expect
>> hugetlb tail pages from arbitrary other code).
>>
>> But as I'm getting closer to a folio_test_anon_exclusive() implementation as
>> we speak (closer, but not done :) ... ), where I'd remove any such hugetlb
>> special handling, I don't particularly care how we handle GUP here in the
>> meantime.
> 
> That's what I was looking for and found missing just now, when I wanted to
> allow follow_huge_pmd() pass page / folio (which will be the head then)
> properly into different checks.  I think that patch 1 is the simplest I can
> come up with that works mostly like what you said before a follow up
> cleanup on top if possible.  It mostly pushed the existing runtime check in
> gup_must_unshare() to be more generic.
> 
> IIUC it's also a matter of whether you'd want PageAnonExclusive() to take
> care of both thp + hugetlb in one shot, rather than let callers handle it
> by things like "if (PageHuge()) ... else ...", which I would try to avoid.

I tried to not let the caller pass in things that didn't make any sense.

Getting a tail page on a hugetlb folio in a page table walker except 
GUP-fast was completely bogus before your patch.

PageAnonExclusive was designed to be set on the page that was pointed to 
by a PTE, like having an additional PTE bit. Cont-pte/cont-pmd with the 
hugetlb fuzz around it we all love (huge_pte_offset()) did the right 
thing, because it abstracted the "multiple cont-pte/cont-pmd" PTEs to 
just a single logical PTE, with a single dedicated PageAnonExclusive.

So "conceptually", the caller that knows how the "single logical PTE" 
was the one to handle it. That meant, GUP-fast needed to be special, 
because it was unaware of the huge_pte_offset() logic.

But that seems to change now as we are changing our page table walkers, 
so I don't particularly care how we handle it.

> It seems so far cleaner to allow PageAnonExclusive() take whatever tail
> pages, thp or hugetlb.  But maybe your ultimate patchset can be even better
> than that.

At least that part will be much cleaner.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
  2024-03-27 15:23   ` peterx
  (?)
  (?)
@ 2024-04-02 19:05     ` Nathan Chancellor
  -1 siblings, 0 replies; 160+ messages in thread
From: Nathan Chancellor @ 2024-04-02 19:05 UTC (permalink / raw)
  To: peterx
  Cc: linux-mm, linux-kernel, Yang Shi, Kirill A . Shutemov,
	Mike Kravetz, John Hubbard, Michael Ellerman, Andrew Jones,
	Muchun Song, linux-riscv, linuxppc-dev, Christophe Leroy,
	Andrew Morton, Christoph Hellwig, Lorenzo Stoakes,
	Matthew Wilcox, Rik van Riel, linux-arm-kernel, Andrea Arcangeli,
	David Hildenbrand, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Jason Gunthorpe, Mike Rapoport, Axel Rasmussen,
	Huacai Chen, WANG Xuerui, loongarch

Hi Peter (and LoongArch folks),

On Wed, Mar 27, 2024 at 11:23:24AM -0400, peterx@redhat.com wrote:
> From: Peter Xu <peterx@redhat.com>
> 
> The comment in the code explains the reasons.  We took a different approach
> comparing to pmd_pfn() by providing a fallback function.
> 
> Another option is to provide some lower level config options (compare to
> HUGETLB_PAGE or THP) to identify which layer an arch can support for such
> huge mappings.  However that can be an overkill.
> 
> Cc: Mike Rapoport (IBM) <rppt@kernel.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  arch/riscv/include/asm/pgtable.h    |  1 +
>  arch/s390/include/asm/pgtable.h     |  1 +
>  arch/sparc/include/asm/pgtable_64.h |  1 +
>  arch/x86/include/asm/pgtable.h      |  1 +
>  include/linux/pgtable.h             | 10 ++++++++++
>  5 files changed, 14 insertions(+)
> 
> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
> index 20242402fc11..0ca28cc8e3fa 100644
> --- a/arch/riscv/include/asm/pgtable.h
> +++ b/arch/riscv/include/asm/pgtable.h
> @@ -646,6 +646,7 @@ static inline unsigned long pmd_pfn(pmd_t pmd)
>  
>  #define __pud_to_phys(pud)  (__page_val_to_pfn(pud_val(pud)) << PAGE_SHIFT)
>  
> +#define pud_pfn pud_pfn
>  static inline unsigned long pud_pfn(pud_t pud)
>  {
>  	return ((__pud_to_phys(pud) & PUD_MASK) >> PAGE_SHIFT);
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 1a71cb19c089..6cbbe473f680 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -1414,6 +1414,7 @@ static inline unsigned long pud_deref(pud_t pud)
>  	return (unsigned long)__va(pud_val(pud) & origin_mask);
>  }
>  
> +#define pud_pfn pud_pfn
>  static inline unsigned long pud_pfn(pud_t pud)
>  {
>  	return __pa(pud_deref(pud)) >> PAGE_SHIFT;
> diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
> index 4d1bafaba942..26efc9bb644a 100644
> --- a/arch/sparc/include/asm/pgtable_64.h
> +++ b/arch/sparc/include/asm/pgtable_64.h
> @@ -875,6 +875,7 @@ static inline bool pud_leaf(pud_t pud)
>  	return pte_val(pte) & _PAGE_PMD_HUGE;
>  }
>  
> +#define pud_pfn pud_pfn
>  static inline unsigned long pud_pfn(pud_t pud)
>  {
>  	pte_t pte = __pte(pud_val(pud));
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index cefc7a84f7a4..273f7557218c 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -234,6 +234,7 @@ static inline unsigned long pmd_pfn(pmd_t pmd)
>  	return (pfn & pmd_pfn_mask(pmd)) >> PAGE_SHIFT;
>  }
>  
> +#define pud_pfn pud_pfn
>  static inline unsigned long pud_pfn(pud_t pud)
>  {
>  	phys_addr_t pfn = pud_val(pud);
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 600e17d03659..75fe309a4e10 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1817,6 +1817,16 @@ typedef unsigned int pgtbl_mod_mask;
>  #define pte_leaf_size(x) PAGE_SIZE
>  #endif
>  
> +/*
> + * We always define pmd_pfn for all archs as it's used in lots of generic
> + * code.  Now it happens too for pud_pfn (and can happen for larger
> + * mappings too in the future; we're not there yet).  Instead of defining
> + * it for all archs (like pmd_pfn), provide a fallback.
> + */
> +#ifndef pud_pfn
> +#define pud_pfn(x) ({ BUILD_BUG(); 0; })
> +#endif
> +
>  /*
>   * Some architectures have MMUs that are configurable or selectable at boot
>   * time. These lead to variable PTRS_PER_x. For statically allocated arrays it
> -- 
> 2.44.0
> 

This BUILD_BUG() triggers for LoongArch with their defconfig, so it
seems like they need to provide an implementation of pud_pfn()?

  In function 'follow_huge_pud',
      inlined from 'follow_pud_mask' at mm/gup.c:1075:10,
      inlined from 'follow_p4d_mask' at mm/gup.c:1105:9,
      inlined from 'follow_page_mask' at mm/gup.c:1151:10:
  include/linux/compiler_types.h:460:45: error: call to '__compiletime_assert_382' declared with attribute error: BUILD_BUG failed
    460 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
        |                                             ^
  include/linux/compiler_types.h:441:25: note: in definition of macro '__compiletime_assert'
    441 |                         prefix ## suffix();                             \
        |                         ^~~~~~
  include/linux/compiler_types.h:460:9: note: in expansion of macro '_compiletime_assert'
    460 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
        |         ^~~~~~~~~~~~~~~~~~~
  include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
     39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
        |                                     ^~~~~~~~~~~~~~~~~~
  include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_ON_MSG'
     59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
        |                     ^~~~~~~~~~~~~~~~
  include/linux/pgtable.h:1887:23: note: in expansion of macro 'BUILD_BUG'
   1887 | #define pud_pfn(x) ({ BUILD_BUG(); 0; })
        |                       ^~~~~~~~~
  mm/gup.c:679:29: note: in expansion of macro 'pud_pfn'
    679 |         unsigned long pfn = pud_pfn(pud);
        |                             ^~~~~~~

Cheers,
Nathan

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
@ 2024-04-02 19:05     ` Nathan Chancellor
  0 siblings, 0 replies; 160+ messages in thread
From: Nathan Chancellor @ 2024-04-02 19:05 UTC (permalink / raw)
  To: peterx
  Cc: linux-mm, linux-kernel, Yang Shi, Kirill A . Shutemov,
	Mike Kravetz, John Hubbard, Michael Ellerman, Andrew Jones,
	Muchun Song, linux-riscv, linuxppc-dev, Christophe Leroy,
	Andrew Morton, Christoph Hellwig, Lorenzo Stoakes,
	Matthew Wilcox, Rik van Riel, linux-arm-kernel, Andrea Arcangeli,
	David Hildenbrand, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Jason Gunthorpe, Mike Rapoport, Axel Rasmussen,
	Huacai Chen, WANG Xuerui, loongarch

Hi Peter (and LoongArch folks),

On Wed, Mar 27, 2024 at 11:23:24AM -0400, peterx@redhat.com wrote:
> From: Peter Xu <peterx@redhat.com>
> 
> The comment in the code explains the reasons.  We took a different approach
> comparing to pmd_pfn() by providing a fallback function.
> 
> Another option is to provide some lower level config options (compare to
> HUGETLB_PAGE or THP) to identify which layer an arch can support for such
> huge mappings.  However that can be an overkill.
> 
> Cc: Mike Rapoport (IBM) <rppt@kernel.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  arch/riscv/include/asm/pgtable.h    |  1 +
>  arch/s390/include/asm/pgtable.h     |  1 +
>  arch/sparc/include/asm/pgtable_64.h |  1 +
>  arch/x86/include/asm/pgtable.h      |  1 +
>  include/linux/pgtable.h             | 10 ++++++++++
>  5 files changed, 14 insertions(+)
> 
> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
> index 20242402fc11..0ca28cc8e3fa 100644
> --- a/arch/riscv/include/asm/pgtable.h
> +++ b/arch/riscv/include/asm/pgtable.h
> @@ -646,6 +646,7 @@ static inline unsigned long pmd_pfn(pmd_t pmd)
>  
>  #define __pud_to_phys(pud)  (__page_val_to_pfn(pud_val(pud)) << PAGE_SHIFT)
>  
> +#define pud_pfn pud_pfn
>  static inline unsigned long pud_pfn(pud_t pud)
>  {
>  	return ((__pud_to_phys(pud) & PUD_MASK) >> PAGE_SHIFT);
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 1a71cb19c089..6cbbe473f680 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -1414,6 +1414,7 @@ static inline unsigned long pud_deref(pud_t pud)
>  	return (unsigned long)__va(pud_val(pud) & origin_mask);
>  }
>  
> +#define pud_pfn pud_pfn
>  static inline unsigned long pud_pfn(pud_t pud)
>  {
>  	return __pa(pud_deref(pud)) >> PAGE_SHIFT;
> diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
> index 4d1bafaba942..26efc9bb644a 100644
> --- a/arch/sparc/include/asm/pgtable_64.h
> +++ b/arch/sparc/include/asm/pgtable_64.h
> @@ -875,6 +875,7 @@ static inline bool pud_leaf(pud_t pud)
>  	return pte_val(pte) & _PAGE_PMD_HUGE;
>  }
>  
> +#define pud_pfn pud_pfn
>  static inline unsigned long pud_pfn(pud_t pud)
>  {
>  	pte_t pte = __pte(pud_val(pud));
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index cefc7a84f7a4..273f7557218c 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -234,6 +234,7 @@ static inline unsigned long pmd_pfn(pmd_t pmd)
>  	return (pfn & pmd_pfn_mask(pmd)) >> PAGE_SHIFT;
>  }
>  
> +#define pud_pfn pud_pfn
>  static inline unsigned long pud_pfn(pud_t pud)
>  {
>  	phys_addr_t pfn = pud_val(pud);
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 600e17d03659..75fe309a4e10 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1817,6 +1817,16 @@ typedef unsigned int pgtbl_mod_mask;
>  #define pte_leaf_size(x) PAGE_SIZE
>  #endif
>  
> +/*
> + * We always define pmd_pfn for all archs as it's used in lots of generic
> + * code.  Now it happens too for pud_pfn (and can happen for larger
> + * mappings too in the future; we're not there yet).  Instead of defining
> + * it for all archs (like pmd_pfn), provide a fallback.
> + */
> +#ifndef pud_pfn
> +#define pud_pfn(x) ({ BUILD_BUG(); 0; })
> +#endif
> +
>  /*
>   * Some architectures have MMUs that are configurable or selectable at boot
>   * time. These lead to variable PTRS_PER_x. For statically allocated arrays it
> -- 
> 2.44.0
> 

This BUILD_BUG() triggers for LoongArch with their defconfig, so it
seems like they need to provide an implementation of pud_pfn()?

  In function 'follow_huge_pud',
      inlined from 'follow_pud_mask' at mm/gup.c:1075:10,
      inlined from 'follow_p4d_mask' at mm/gup.c:1105:9,
      inlined from 'follow_page_mask' at mm/gup.c:1151:10:
  include/linux/compiler_types.h:460:45: error: call to '__compiletime_assert_382' declared with attribute error: BUILD_BUG failed
    460 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
        |                                             ^
  include/linux/compiler_types.h:441:25: note: in definition of macro '__compiletime_assert'
    441 |                         prefix ## suffix();                             \
        |                         ^~~~~~
  include/linux/compiler_types.h:460:9: note: in expansion of macro '_compiletime_assert'
    460 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
        |         ^~~~~~~~~~~~~~~~~~~
  include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
     39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
        |                                     ^~~~~~~~~~~~~~~~~~
  include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_ON_MSG'
     59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
        |                     ^~~~~~~~~~~~~~~~
  include/linux/pgtable.h:1887:23: note: in expansion of macro 'BUILD_BUG'
   1887 | #define pud_pfn(x) ({ BUILD_BUG(); 0; })
        |                       ^~~~~~~~~
  mm/gup.c:679:29: note: in expansion of macro 'pud_pfn'
    679 |         unsigned long pfn = pud_pfn(pud);
        |                             ^~~~~~~

Cheers,
Nathan

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
@ 2024-04-02 19:05     ` Nathan Chancellor
  0 siblings, 0 replies; 160+ messages in thread
From: Nathan Chancellor @ 2024-04-02 19:05 UTC (permalink / raw)
  To: peterx
  Cc: linux-mm, linux-kernel, Yang Shi, Kirill A . Shutemov,
	Mike Kravetz, John Hubbard, Michael Ellerman, Andrew Jones,
	Muchun Song, linux-riscv, linuxppc-dev, Christophe Leroy,
	Andrew Morton, Christoph Hellwig, Lorenzo Stoakes,
	Matthew Wilcox, Rik van Riel, linux-arm-kernel, Andrea Arcangeli,
	David Hildenbrand, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Jason Gunthorpe, Mike Rapoport, Axel Rasmussen,
	Huacai Chen, WANG Xuerui, loongarch

Hi Peter (and LoongArch folks),

On Wed, Mar 27, 2024 at 11:23:24AM -0400, peterx@redhat.com wrote:
> From: Peter Xu <peterx@redhat.com>
> 
> The comment in the code explains the reasons.  We took a different approach
> comparing to pmd_pfn() by providing a fallback function.
> 
> Another option is to provide some lower level config options (compare to
> HUGETLB_PAGE or THP) to identify which layer an arch can support for such
> huge mappings.  However that can be an overkill.
> 
> Cc: Mike Rapoport (IBM) <rppt@kernel.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  arch/riscv/include/asm/pgtable.h    |  1 +
>  arch/s390/include/asm/pgtable.h     |  1 +
>  arch/sparc/include/asm/pgtable_64.h |  1 +
>  arch/x86/include/asm/pgtable.h      |  1 +
>  include/linux/pgtable.h             | 10 ++++++++++
>  5 files changed, 14 insertions(+)
> 
> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
> index 20242402fc11..0ca28cc8e3fa 100644
> --- a/arch/riscv/include/asm/pgtable.h
> +++ b/arch/riscv/include/asm/pgtable.h
> @@ -646,6 +646,7 @@ static inline unsigned long pmd_pfn(pmd_t pmd)
>  
>  #define __pud_to_phys(pud)  (__page_val_to_pfn(pud_val(pud)) << PAGE_SHIFT)
>  
> +#define pud_pfn pud_pfn
>  static inline unsigned long pud_pfn(pud_t pud)
>  {
>  	return ((__pud_to_phys(pud) & PUD_MASK) >> PAGE_SHIFT);
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 1a71cb19c089..6cbbe473f680 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -1414,6 +1414,7 @@ static inline unsigned long pud_deref(pud_t pud)
>  	return (unsigned long)__va(pud_val(pud) & origin_mask);
>  }
>  
> +#define pud_pfn pud_pfn
>  static inline unsigned long pud_pfn(pud_t pud)
>  {
>  	return __pa(pud_deref(pud)) >> PAGE_SHIFT;
> diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
> index 4d1bafaba942..26efc9bb644a 100644
> --- a/arch/sparc/include/asm/pgtable_64.h
> +++ b/arch/sparc/include/asm/pgtable_64.h
> @@ -875,6 +875,7 @@ static inline bool pud_leaf(pud_t pud)
>  	return pte_val(pte) & _PAGE_PMD_HUGE;
>  }
>  
> +#define pud_pfn pud_pfn
>  static inline unsigned long pud_pfn(pud_t pud)
>  {
>  	pte_t pte = __pte(pud_val(pud));
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index cefc7a84f7a4..273f7557218c 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -234,6 +234,7 @@ static inline unsigned long pmd_pfn(pmd_t pmd)
>  	return (pfn & pmd_pfn_mask(pmd)) >> PAGE_SHIFT;
>  }
>  
> +#define pud_pfn pud_pfn
>  static inline unsigned long pud_pfn(pud_t pud)
>  {
>  	phys_addr_t pfn = pud_val(pud);
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 600e17d03659..75fe309a4e10 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1817,6 +1817,16 @@ typedef unsigned int pgtbl_mod_mask;
>  #define pte_leaf_size(x) PAGE_SIZE
>  #endif
>  
> +/*
> + * We always define pmd_pfn for all archs as it's used in lots of generic
> + * code.  Now it happens too for pud_pfn (and can happen for larger
> + * mappings too in the future; we're not there yet).  Instead of defining
> + * it for all archs (like pmd_pfn), provide a fallback.
> + */
> +#ifndef pud_pfn
> +#define pud_pfn(x) ({ BUILD_BUG(); 0; })
> +#endif
> +
>  /*
>   * Some architectures have MMUs that are configurable or selectable at boot
>   * time. These lead to variable PTRS_PER_x. For statically allocated arrays it
> -- 
> 2.44.0
> 

This BUILD_BUG() triggers for LoongArch with their defconfig, so it
seems like they need to provide an implementation of pud_pfn()?

  In function 'follow_huge_pud',
      inlined from 'follow_pud_mask' at mm/gup.c:1075:10,
      inlined from 'follow_p4d_mask' at mm/gup.c:1105:9,
      inlined from 'follow_page_mask' at mm/gup.c:1151:10:
  include/linux/compiler_types.h:460:45: error: call to '__compiletime_assert_382' declared with attribute error: BUILD_BUG failed
    460 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
        |                                             ^
  include/linux/compiler_types.h:441:25: note: in definition of macro '__compiletime_assert'
    441 |                         prefix ## suffix();                             \
        |                         ^~~~~~
  include/linux/compiler_types.h:460:9: note: in expansion of macro '_compiletime_assert'
    460 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
        |         ^~~~~~~~~~~~~~~~~~~
  include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
     39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
        |                                     ^~~~~~~~~~~~~~~~~~
  include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_ON_MSG'
     59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
        |                     ^~~~~~~~~~~~~~~~
  include/linux/pgtable.h:1887:23: note: in expansion of macro 'BUILD_BUG'
   1887 | #define pud_pfn(x) ({ BUILD_BUG(); 0; })
        |                       ^~~~~~~~~
  mm/gup.c:679:29: note: in expansion of macro 'pud_pfn'
    679 |         unsigned long pfn = pud_pfn(pud);
        |                             ^~~~~~~

Cheers,
Nathan

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
@ 2024-04-02 19:05     ` Nathan Chancellor
  0 siblings, 0 replies; 160+ messages in thread
From: Nathan Chancellor @ 2024-04-02 19:05 UTC (permalink / raw)
  To: peterx
  Cc: James Houghton, David Hildenbrand, Yang Shi, Andrew Jones,
	linux-mm, Matthew Wilcox, linux-riscv, WANG Xuerui,
	Andrea Arcangeli, Christoph Hellwig, Huacai Chen,
	Aneesh Kumar K . V, linux-arm-kernel, Jason Gunthorpe,
	Axel Rasmussen, Rik van Riel, John Hubbard, loongarch,
	Kirill A . Shutemov, Vlastimil Babka, Lorenzo Stoakes,
	Muchun Song, linux-kernel, Andrew Morton, linuxppc-dev,
	Mike Rapoport, Mike Kravetz

Hi Peter (and LoongArch folks),

On Wed, Mar 27, 2024 at 11:23:24AM -0400, peterx@redhat.com wrote:
> From: Peter Xu <peterx@redhat.com>
> 
> The comment in the code explains the reasons.  We took a different approach
> comparing to pmd_pfn() by providing a fallback function.
> 
> Another option is to provide some lower level config options (compare to
> HUGETLB_PAGE or THP) to identify which layer an arch can support for such
> huge mappings.  However that can be an overkill.
> 
> Cc: Mike Rapoport (IBM) <rppt@kernel.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  arch/riscv/include/asm/pgtable.h    |  1 +
>  arch/s390/include/asm/pgtable.h     |  1 +
>  arch/sparc/include/asm/pgtable_64.h |  1 +
>  arch/x86/include/asm/pgtable.h      |  1 +
>  include/linux/pgtable.h             | 10 ++++++++++
>  5 files changed, 14 insertions(+)
> 
> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
> index 20242402fc11..0ca28cc8e3fa 100644
> --- a/arch/riscv/include/asm/pgtable.h
> +++ b/arch/riscv/include/asm/pgtable.h
> @@ -646,6 +646,7 @@ static inline unsigned long pmd_pfn(pmd_t pmd)
>  
>  #define __pud_to_phys(pud)  (__page_val_to_pfn(pud_val(pud)) << PAGE_SHIFT)
>  
> +#define pud_pfn pud_pfn
>  static inline unsigned long pud_pfn(pud_t pud)
>  {
>  	return ((__pud_to_phys(pud) & PUD_MASK) >> PAGE_SHIFT);
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 1a71cb19c089..6cbbe473f680 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -1414,6 +1414,7 @@ static inline unsigned long pud_deref(pud_t pud)
>  	return (unsigned long)__va(pud_val(pud) & origin_mask);
>  }
>  
> +#define pud_pfn pud_pfn
>  static inline unsigned long pud_pfn(pud_t pud)
>  {
>  	return __pa(pud_deref(pud)) >> PAGE_SHIFT;
> diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
> index 4d1bafaba942..26efc9bb644a 100644
> --- a/arch/sparc/include/asm/pgtable_64.h
> +++ b/arch/sparc/include/asm/pgtable_64.h
> @@ -875,6 +875,7 @@ static inline bool pud_leaf(pud_t pud)
>  	return pte_val(pte) & _PAGE_PMD_HUGE;
>  }
>  
> +#define pud_pfn pud_pfn
>  static inline unsigned long pud_pfn(pud_t pud)
>  {
>  	pte_t pte = __pte(pud_val(pud));
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index cefc7a84f7a4..273f7557218c 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -234,6 +234,7 @@ static inline unsigned long pmd_pfn(pmd_t pmd)
>  	return (pfn & pmd_pfn_mask(pmd)) >> PAGE_SHIFT;
>  }
>  
> +#define pud_pfn pud_pfn
>  static inline unsigned long pud_pfn(pud_t pud)
>  {
>  	phys_addr_t pfn = pud_val(pud);
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 600e17d03659..75fe309a4e10 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1817,6 +1817,16 @@ typedef unsigned int pgtbl_mod_mask;
>  #define pte_leaf_size(x) PAGE_SIZE
>  #endif
>  
> +/*
> + * We always define pmd_pfn for all archs as it's used in lots of generic
> + * code.  Now it happens too for pud_pfn (and can happen for larger
> + * mappings too in the future; we're not there yet).  Instead of defining
> + * it for all archs (like pmd_pfn), provide a fallback.
> + */
> +#ifndef pud_pfn
> +#define pud_pfn(x) ({ BUILD_BUG(); 0; })
> +#endif
> +
>  /*
>   * Some architectures have MMUs that are configurable or selectable at boot
>   * time. These lead to variable PTRS_PER_x. For statically allocated arrays it
> -- 
> 2.44.0
> 

This BUILD_BUG() triggers for LoongArch with their defconfig, so it
seems like they need to provide an implementation of pud_pfn()?

  In function 'follow_huge_pud',
      inlined from 'follow_pud_mask' at mm/gup.c:1075:10,
      inlined from 'follow_p4d_mask' at mm/gup.c:1105:9,
      inlined from 'follow_page_mask' at mm/gup.c:1151:10:
  include/linux/compiler_types.h:460:45: error: call to '__compiletime_assert_382' declared with attribute error: BUILD_BUG failed
    460 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
        |                                             ^
  include/linux/compiler_types.h:441:25: note: in definition of macro '__compiletime_assert'
    441 |                         prefix ## suffix();                             \
        |                         ^~~~~~
  include/linux/compiler_types.h:460:9: note: in expansion of macro '_compiletime_assert'
    460 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
        |         ^~~~~~~~~~~~~~~~~~~
  include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
     39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
        |                                     ^~~~~~~~~~~~~~~~~~
  include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_ON_MSG'
     59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
        |                     ^~~~~~~~~~~~~~~~
  include/linux/pgtable.h:1887:23: note: in expansion of macro 'BUILD_BUG'
   1887 | #define pud_pfn(x) ({ BUILD_BUG(); 0; })
        |                       ^~~~~~~~~
  mm/gup.c:679:29: note: in expansion of macro 'pud_pfn'
    679 |         unsigned long pfn = pud_pfn(pud);
        |                             ^~~~~~~

Cheers,
Nathan

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
  2024-04-02 19:05     ` Nathan Chancellor
  (?)
  (?)
@ 2024-04-02 22:43       ` Peter Xu
  -1 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2024-04-02 22:43 UTC (permalink / raw)
  To: Nathan Chancellor
  Cc: linux-mm, linux-kernel, Yang Shi, Kirill A . Shutemov,
	Mike Kravetz, John Hubbard, Michael Ellerman, Andrew Jones,
	Muchun Song, linux-riscv, linuxppc-dev, Christophe Leroy,
	Andrew Morton, Christoph Hellwig, Lorenzo Stoakes,
	Matthew Wilcox, Rik van Riel, linux-arm-kernel, Andrea Arcangeli,
	David Hildenbrand, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Jason Gunthorpe, Mike Rapoport, Axel Rasmussen,
	Huacai Chen, WANG Xuerui, loongarch

On Tue, Apr 02, 2024 at 12:05:49PM -0700, Nathan Chancellor wrote:
> Hi Peter (and LoongArch folks),
> 
> On Wed, Mar 27, 2024 at 11:23:24AM -0400, peterx@redhat.com wrote:
> > From: Peter Xu <peterx@redhat.com>
> > 
> > The comment in the code explains the reasons.  We took a different approach
> > comparing to pmd_pfn() by providing a fallback function.
> > 
> > Another option is to provide some lower level config options (compare to
> > HUGETLB_PAGE or THP) to identify which layer an arch can support for such
> > huge mappings.  However that can be an overkill.
> > 
> > Cc: Mike Rapoport (IBM) <rppt@kernel.org>
> > Cc: Matthew Wilcox <willy@infradead.org>
> > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  arch/riscv/include/asm/pgtable.h    |  1 +
> >  arch/s390/include/asm/pgtable.h     |  1 +
> >  arch/sparc/include/asm/pgtable_64.h |  1 +
> >  arch/x86/include/asm/pgtable.h      |  1 +
> >  include/linux/pgtable.h             | 10 ++++++++++
> >  5 files changed, 14 insertions(+)
> > 
> > diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
> > index 20242402fc11..0ca28cc8e3fa 100644
> > --- a/arch/riscv/include/asm/pgtable.h
> > +++ b/arch/riscv/include/asm/pgtable.h
> > @@ -646,6 +646,7 @@ static inline unsigned long pmd_pfn(pmd_t pmd)
> >  
> >  #define __pud_to_phys(pud)  (__page_val_to_pfn(pud_val(pud)) << PAGE_SHIFT)
> >  
> > +#define pud_pfn pud_pfn
> >  static inline unsigned long pud_pfn(pud_t pud)
> >  {
> >  	return ((__pud_to_phys(pud) & PUD_MASK) >> PAGE_SHIFT);
> > diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> > index 1a71cb19c089..6cbbe473f680 100644
> > --- a/arch/s390/include/asm/pgtable.h
> > +++ b/arch/s390/include/asm/pgtable.h
> > @@ -1414,6 +1414,7 @@ static inline unsigned long pud_deref(pud_t pud)
> >  	return (unsigned long)__va(pud_val(pud) & origin_mask);
> >  }
> >  
> > +#define pud_pfn pud_pfn
> >  static inline unsigned long pud_pfn(pud_t pud)
> >  {
> >  	return __pa(pud_deref(pud)) >> PAGE_SHIFT;
> > diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
> > index 4d1bafaba942..26efc9bb644a 100644
> > --- a/arch/sparc/include/asm/pgtable_64.h
> > +++ b/arch/sparc/include/asm/pgtable_64.h
> > @@ -875,6 +875,7 @@ static inline bool pud_leaf(pud_t pud)
> >  	return pte_val(pte) & _PAGE_PMD_HUGE;
> >  }
> >  
> > +#define pud_pfn pud_pfn
> >  static inline unsigned long pud_pfn(pud_t pud)
> >  {
> >  	pte_t pte = __pte(pud_val(pud));
> > diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> > index cefc7a84f7a4..273f7557218c 100644
> > --- a/arch/x86/include/asm/pgtable.h
> > +++ b/arch/x86/include/asm/pgtable.h
> > @@ -234,6 +234,7 @@ static inline unsigned long pmd_pfn(pmd_t pmd)
> >  	return (pfn & pmd_pfn_mask(pmd)) >> PAGE_SHIFT;
> >  }
> >  
> > +#define pud_pfn pud_pfn
> >  static inline unsigned long pud_pfn(pud_t pud)
> >  {
> >  	phys_addr_t pfn = pud_val(pud);
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index 600e17d03659..75fe309a4e10 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -1817,6 +1817,16 @@ typedef unsigned int pgtbl_mod_mask;
> >  #define pte_leaf_size(x) PAGE_SIZE
> >  #endif
> >  
> > +/*
> > + * We always define pmd_pfn for all archs as it's used in lots of generic
> > + * code.  Now it happens too for pud_pfn (and can happen for larger
> > + * mappings too in the future; we're not there yet).  Instead of defining
> > + * it for all archs (like pmd_pfn), provide a fallback.
> > + */
> > +#ifndef pud_pfn
> > +#define pud_pfn(x) ({ BUILD_BUG(); 0; })
> > +#endif
> > +
> >  /*
> >   * Some architectures have MMUs that are configurable or selectable at boot
> >   * time. These lead to variable PTRS_PER_x. For statically allocated arrays it
> > -- 
> > 2.44.0
> > 
> 
> This BUILD_BUG() triggers for LoongArch with their defconfig, so it
> seems like they need to provide an implementation of pud_pfn()?
> 
>   In function 'follow_huge_pud',
>       inlined from 'follow_pud_mask' at mm/gup.c:1075:10,
>       inlined from 'follow_p4d_mask' at mm/gup.c:1105:9,
>       inlined from 'follow_page_mask' at mm/gup.c:1151:10:
>   include/linux/compiler_types.h:460:45: error: call to '__compiletime_assert_382' declared with attribute error: BUILD_BUG failed
>     460 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>         |                                             ^
>   include/linux/compiler_types.h:441:25: note: in definition of macro '__compiletime_assert'
>     441 |                         prefix ## suffix();                             \
>         |                         ^~~~~~
>   include/linux/compiler_types.h:460:9: note: in expansion of macro '_compiletime_assert'
>     460 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>         |         ^~~~~~~~~~~~~~~~~~~
>   include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
>      39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
>         |                                     ^~~~~~~~~~~~~~~~~~
>   include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_ON_MSG'
>      59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
>         |                     ^~~~~~~~~~~~~~~~
>   include/linux/pgtable.h:1887:23: note: in expansion of macro 'BUILD_BUG'
>    1887 | #define pud_pfn(x) ({ BUILD_BUG(); 0; })
>         |                       ^~~~~~~~~
>   mm/gup.c:679:29: note: in expansion of macro 'pud_pfn'
>     679 |         unsigned long pfn = pud_pfn(pud);
>         |                             ^~~~~~~

I actually tested this without hitting the issue (even though I didn't
mention it in the cover letter..).  I re-kicked the build test, it turns
out my "make alldefconfig" on loongarch will generate a config with both
HUGETLB=n && THP=n, while arch/loongarch/configs/loongson3_defconfig has
THP=y (which I assume was the one above build used).  I didn't further
check how "make alldefconfig" generated the config; a bit surprising that
it didn't fetch from there.

(and it also surprises me that this BUILD_BUG can trigger.. I used to try
 triggering it elsewhere but failed..)

For loongarch the best thing is not compile in follow_huge_pud(), as it
doesn't support pud dax, neither does it support pud hugetlb.  However
again that may require some more CONFIG_* options to declare the level one
arch supports on HUGETLB_PAGE.  Here maybe the simplest (and it should also
cover all the rest archs on similar issues if ever possible to happen) is
we remove the BUILD_BUG() and explain why.  It should be safe for loongarch
too here in this case to not defined it until properly supported.

Thanks,

===8<===
From 585f34aa3d5b12cd2186367b0882d4293f792062 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Tue, 2 Apr 2024 18:31:07 -0400
Subject: [PATCH] fixup! mm/arch: provide pud_pfn() fallback

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/pgtable.h | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index fa8f92f6e2d7..0f4b2faa1d71 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1882,9 +1882,13 @@ typedef unsigned int pgtbl_mod_mask;
  * code.  Now it happens too for pud_pfn (and can happen for larger
  * mappings too in the future; we're not there yet).  Instead of defining
  * it for all archs (like pmd_pfn), provide a fallback.
+ *
+ * Note that returning 0 here means any arch that didn't define this can
+ * get severely wrong when it hits a real pud leaf.  It's arch's
+ * responsibility to properly define it when a huge pud is possible.
  */
 #ifndef pud_pfn
-#define pud_pfn(x) ({ BUILD_BUG(); 0; })
+#define pud_pfn(x) 0
 #endif
 
 /*
-- 
2.44.0

-- 
Peter Xu


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
@ 2024-04-02 22:43       ` Peter Xu
  0 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2024-04-02 22:43 UTC (permalink / raw)
  To: Nathan Chancellor
  Cc: linux-mm, linux-kernel, Yang Shi, Kirill A . Shutemov,
	Mike Kravetz, John Hubbard, Michael Ellerman, Andrew Jones,
	Muchun Song, linux-riscv, linuxppc-dev, Christophe Leroy,
	Andrew Morton, Christoph Hellwig, Lorenzo Stoakes,
	Matthew Wilcox, Rik van Riel, linux-arm-kernel, Andrea Arcangeli,
	David Hildenbrand, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Jason Gunthorpe, Mike Rapoport, Axel Rasmussen,
	Huacai Chen, WANG Xuerui, loongarch

On Tue, Apr 02, 2024 at 12:05:49PM -0700, Nathan Chancellor wrote:
> Hi Peter (and LoongArch folks),
> 
> On Wed, Mar 27, 2024 at 11:23:24AM -0400, peterx@redhat.com wrote:
> > From: Peter Xu <peterx@redhat.com>
> > 
> > The comment in the code explains the reasons.  We took a different approach
> > comparing to pmd_pfn() by providing a fallback function.
> > 
> > Another option is to provide some lower level config options (compare to
> > HUGETLB_PAGE or THP) to identify which layer an arch can support for such
> > huge mappings.  However that can be an overkill.
> > 
> > Cc: Mike Rapoport (IBM) <rppt@kernel.org>
> > Cc: Matthew Wilcox <willy@infradead.org>
> > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  arch/riscv/include/asm/pgtable.h    |  1 +
> >  arch/s390/include/asm/pgtable.h     |  1 +
> >  arch/sparc/include/asm/pgtable_64.h |  1 +
> >  arch/x86/include/asm/pgtable.h      |  1 +
> >  include/linux/pgtable.h             | 10 ++++++++++
> >  5 files changed, 14 insertions(+)
> > 
> > diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
> > index 20242402fc11..0ca28cc8e3fa 100644
> > --- a/arch/riscv/include/asm/pgtable.h
> > +++ b/arch/riscv/include/asm/pgtable.h
> > @@ -646,6 +646,7 @@ static inline unsigned long pmd_pfn(pmd_t pmd)
> >  
> >  #define __pud_to_phys(pud)  (__page_val_to_pfn(pud_val(pud)) << PAGE_SHIFT)
> >  
> > +#define pud_pfn pud_pfn
> >  static inline unsigned long pud_pfn(pud_t pud)
> >  {
> >  	return ((__pud_to_phys(pud) & PUD_MASK) >> PAGE_SHIFT);
> > diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> > index 1a71cb19c089..6cbbe473f680 100644
> > --- a/arch/s390/include/asm/pgtable.h
> > +++ b/arch/s390/include/asm/pgtable.h
> > @@ -1414,6 +1414,7 @@ static inline unsigned long pud_deref(pud_t pud)
> >  	return (unsigned long)__va(pud_val(pud) & origin_mask);
> >  }
> >  
> > +#define pud_pfn pud_pfn
> >  static inline unsigned long pud_pfn(pud_t pud)
> >  {
> >  	return __pa(pud_deref(pud)) >> PAGE_SHIFT;
> > diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
> > index 4d1bafaba942..26efc9bb644a 100644
> > --- a/arch/sparc/include/asm/pgtable_64.h
> > +++ b/arch/sparc/include/asm/pgtable_64.h
> > @@ -875,6 +875,7 @@ static inline bool pud_leaf(pud_t pud)
> >  	return pte_val(pte) & _PAGE_PMD_HUGE;
> >  }
> >  
> > +#define pud_pfn pud_pfn
> >  static inline unsigned long pud_pfn(pud_t pud)
> >  {
> >  	pte_t pte = __pte(pud_val(pud));
> > diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> > index cefc7a84f7a4..273f7557218c 100644
> > --- a/arch/x86/include/asm/pgtable.h
> > +++ b/arch/x86/include/asm/pgtable.h
> > @@ -234,6 +234,7 @@ static inline unsigned long pmd_pfn(pmd_t pmd)
> >  	return (pfn & pmd_pfn_mask(pmd)) >> PAGE_SHIFT;
> >  }
> >  
> > +#define pud_pfn pud_pfn
> >  static inline unsigned long pud_pfn(pud_t pud)
> >  {
> >  	phys_addr_t pfn = pud_val(pud);
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index 600e17d03659..75fe309a4e10 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -1817,6 +1817,16 @@ typedef unsigned int pgtbl_mod_mask;
> >  #define pte_leaf_size(x) PAGE_SIZE
> >  #endif
> >  
> > +/*
> > + * We always define pmd_pfn for all archs as it's used in lots of generic
> > + * code.  Now it happens too for pud_pfn (and can happen for larger
> > + * mappings too in the future; we're not there yet).  Instead of defining
> > + * it for all archs (like pmd_pfn), provide a fallback.
> > + */
> > +#ifndef pud_pfn
> > +#define pud_pfn(x) ({ BUILD_BUG(); 0; })
> > +#endif
> > +
> >  /*
> >   * Some architectures have MMUs that are configurable or selectable at boot
> >   * time. These lead to variable PTRS_PER_x. For statically allocated arrays it
> > -- 
> > 2.44.0
> > 
> 
> This BUILD_BUG() triggers for LoongArch with their defconfig, so it
> seems like they need to provide an implementation of pud_pfn()?
> 
>   In function 'follow_huge_pud',
>       inlined from 'follow_pud_mask' at mm/gup.c:1075:10,
>       inlined from 'follow_p4d_mask' at mm/gup.c:1105:9,
>       inlined from 'follow_page_mask' at mm/gup.c:1151:10:
>   include/linux/compiler_types.h:460:45: error: call to '__compiletime_assert_382' declared with attribute error: BUILD_BUG failed
>     460 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>         |                                             ^
>   include/linux/compiler_types.h:441:25: note: in definition of macro '__compiletime_assert'
>     441 |                         prefix ## suffix();                             \
>         |                         ^~~~~~
>   include/linux/compiler_types.h:460:9: note: in expansion of macro '_compiletime_assert'
>     460 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>         |         ^~~~~~~~~~~~~~~~~~~
>   include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
>      39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
>         |                                     ^~~~~~~~~~~~~~~~~~
>   include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_ON_MSG'
>      59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
>         |                     ^~~~~~~~~~~~~~~~
>   include/linux/pgtable.h:1887:23: note: in expansion of macro 'BUILD_BUG'
>    1887 | #define pud_pfn(x) ({ BUILD_BUG(); 0; })
>         |                       ^~~~~~~~~
>   mm/gup.c:679:29: note: in expansion of macro 'pud_pfn'
>     679 |         unsigned long pfn = pud_pfn(pud);
>         |                             ^~~~~~~

I actually tested this without hitting the issue (even though I didn't
mention it in the cover letter..).  I re-kicked the build test, it turns
out my "make alldefconfig" on loongarch will generate a config with both
HUGETLB=n && THP=n, while arch/loongarch/configs/loongson3_defconfig has
THP=y (which I assume was the one above build used).  I didn't further
check how "make alldefconfig" generated the config; a bit surprising that
it didn't fetch from there.

(and it also surprises me that this BUILD_BUG can trigger.. I used to try
 triggering it elsewhere but failed..)

For loongarch the best thing is not compile in follow_huge_pud(), as it
doesn't support pud dax, neither does it support pud hugetlb.  However
again that may require some more CONFIG_* options to declare the level one
arch supports on HUGETLB_PAGE.  Here maybe the simplest (and it should also
cover all the rest archs on similar issues if ever possible to happen) is
we remove the BUILD_BUG() and explain why.  It should be safe for loongarch
too here in this case to not defined it until properly supported.

Thanks,

===8<===
From 585f34aa3d5b12cd2186367b0882d4293f792062 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Tue, 2 Apr 2024 18:31:07 -0400
Subject: [PATCH] fixup! mm/arch: provide pud_pfn() fallback

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/pgtable.h | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index fa8f92f6e2d7..0f4b2faa1d71 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1882,9 +1882,13 @@ typedef unsigned int pgtbl_mod_mask;
  * code.  Now it happens too for pud_pfn (and can happen for larger
  * mappings too in the future; we're not there yet).  Instead of defining
  * it for all archs (like pmd_pfn), provide a fallback.
+ *
+ * Note that returning 0 here means any arch that didn't define this can
+ * get severely wrong when it hits a real pud leaf.  It's arch's
+ * responsibility to properly define it when a huge pud is possible.
  */
 #ifndef pud_pfn
-#define pud_pfn(x) ({ BUILD_BUG(); 0; })
+#define pud_pfn(x) 0
 #endif
 
 /*
-- 
2.44.0

-- 
Peter Xu


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
@ 2024-04-02 22:43       ` Peter Xu
  0 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2024-04-02 22:43 UTC (permalink / raw)
  To: Nathan Chancellor
  Cc: linux-mm, linux-kernel, Yang Shi, Kirill A . Shutemov,
	Mike Kravetz, John Hubbard, Michael Ellerman, Andrew Jones,
	Muchun Song, linux-riscv, linuxppc-dev, Christophe Leroy,
	Andrew Morton, Christoph Hellwig, Lorenzo Stoakes,
	Matthew Wilcox, Rik van Riel, linux-arm-kernel, Andrea Arcangeli,
	David Hildenbrand, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Jason Gunthorpe, Mike Rapoport, Axel Rasmussen,
	Huacai Chen, WANG Xuerui, loongarch

On Tue, Apr 02, 2024 at 12:05:49PM -0700, Nathan Chancellor wrote:
> Hi Peter (and LoongArch folks),
> 
> On Wed, Mar 27, 2024 at 11:23:24AM -0400, peterx@redhat.com wrote:
> > From: Peter Xu <peterx@redhat.com>
> > 
> > The comment in the code explains the reasons.  We took a different approach
> > comparing to pmd_pfn() by providing a fallback function.
> > 
> > Another option is to provide some lower level config options (compare to
> > HUGETLB_PAGE or THP) to identify which layer an arch can support for such
> > huge mappings.  However that can be an overkill.
> > 
> > Cc: Mike Rapoport (IBM) <rppt@kernel.org>
> > Cc: Matthew Wilcox <willy@infradead.org>
> > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  arch/riscv/include/asm/pgtable.h    |  1 +
> >  arch/s390/include/asm/pgtable.h     |  1 +
> >  arch/sparc/include/asm/pgtable_64.h |  1 +
> >  arch/x86/include/asm/pgtable.h      |  1 +
> >  include/linux/pgtable.h             | 10 ++++++++++
> >  5 files changed, 14 insertions(+)
> > 
> > diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
> > index 20242402fc11..0ca28cc8e3fa 100644
> > --- a/arch/riscv/include/asm/pgtable.h
> > +++ b/arch/riscv/include/asm/pgtable.h
> > @@ -646,6 +646,7 @@ static inline unsigned long pmd_pfn(pmd_t pmd)
> >  
> >  #define __pud_to_phys(pud)  (__page_val_to_pfn(pud_val(pud)) << PAGE_SHIFT)
> >  
> > +#define pud_pfn pud_pfn
> >  static inline unsigned long pud_pfn(pud_t pud)
> >  {
> >  	return ((__pud_to_phys(pud) & PUD_MASK) >> PAGE_SHIFT);
> > diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> > index 1a71cb19c089..6cbbe473f680 100644
> > --- a/arch/s390/include/asm/pgtable.h
> > +++ b/arch/s390/include/asm/pgtable.h
> > @@ -1414,6 +1414,7 @@ static inline unsigned long pud_deref(pud_t pud)
> >  	return (unsigned long)__va(pud_val(pud) & origin_mask);
> >  }
> >  
> > +#define pud_pfn pud_pfn
> >  static inline unsigned long pud_pfn(pud_t pud)
> >  {
> >  	return __pa(pud_deref(pud)) >> PAGE_SHIFT;
> > diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
> > index 4d1bafaba942..26efc9bb644a 100644
> > --- a/arch/sparc/include/asm/pgtable_64.h
> > +++ b/arch/sparc/include/asm/pgtable_64.h
> > @@ -875,6 +875,7 @@ static inline bool pud_leaf(pud_t pud)
> >  	return pte_val(pte) & _PAGE_PMD_HUGE;
> >  }
> >  
> > +#define pud_pfn pud_pfn
> >  static inline unsigned long pud_pfn(pud_t pud)
> >  {
> >  	pte_t pte = __pte(pud_val(pud));
> > diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> > index cefc7a84f7a4..273f7557218c 100644
> > --- a/arch/x86/include/asm/pgtable.h
> > +++ b/arch/x86/include/asm/pgtable.h
> > @@ -234,6 +234,7 @@ static inline unsigned long pmd_pfn(pmd_t pmd)
> >  	return (pfn & pmd_pfn_mask(pmd)) >> PAGE_SHIFT;
> >  }
> >  
> > +#define pud_pfn pud_pfn
> >  static inline unsigned long pud_pfn(pud_t pud)
> >  {
> >  	phys_addr_t pfn = pud_val(pud);
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index 600e17d03659..75fe309a4e10 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -1817,6 +1817,16 @@ typedef unsigned int pgtbl_mod_mask;
> >  #define pte_leaf_size(x) PAGE_SIZE
> >  #endif
> >  
> > +/*
> > + * We always define pmd_pfn for all archs as it's used in lots of generic
> > + * code.  Now it happens too for pud_pfn (and can happen for larger
> > + * mappings too in the future; we're not there yet).  Instead of defining
> > + * it for all archs (like pmd_pfn), provide a fallback.
> > + */
> > +#ifndef pud_pfn
> > +#define pud_pfn(x) ({ BUILD_BUG(); 0; })
> > +#endif
> > +
> >  /*
> >   * Some architectures have MMUs that are configurable or selectable at boot
> >   * time. These lead to variable PTRS_PER_x. For statically allocated arrays it
> > -- 
> > 2.44.0
> > 
> 
> This BUILD_BUG() triggers for LoongArch with their defconfig, so it
> seems like they need to provide an implementation of pud_pfn()?
> 
>   In function 'follow_huge_pud',
>       inlined from 'follow_pud_mask' at mm/gup.c:1075:10,
>       inlined from 'follow_p4d_mask' at mm/gup.c:1105:9,
>       inlined from 'follow_page_mask' at mm/gup.c:1151:10:
>   include/linux/compiler_types.h:460:45: error: call to '__compiletime_assert_382' declared with attribute error: BUILD_BUG failed
>     460 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>         |                                             ^
>   include/linux/compiler_types.h:441:25: note: in definition of macro '__compiletime_assert'
>     441 |                         prefix ## suffix();                             \
>         |                         ^~~~~~
>   include/linux/compiler_types.h:460:9: note: in expansion of macro '_compiletime_assert'
>     460 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>         |         ^~~~~~~~~~~~~~~~~~~
>   include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
>      39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
>         |                                     ^~~~~~~~~~~~~~~~~~
>   include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_ON_MSG'
>      59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
>         |                     ^~~~~~~~~~~~~~~~
>   include/linux/pgtable.h:1887:23: note: in expansion of macro 'BUILD_BUG'
>    1887 | #define pud_pfn(x) ({ BUILD_BUG(); 0; })
>         |                       ^~~~~~~~~
>   mm/gup.c:679:29: note: in expansion of macro 'pud_pfn'
>     679 |         unsigned long pfn = pud_pfn(pud);
>         |                             ^~~~~~~

I actually tested this without hitting the issue (even though I didn't
mention it in the cover letter..).  I re-kicked the build test, it turns
out my "make alldefconfig" on loongarch will generate a config with both
HUGETLB=n && THP=n, while arch/loongarch/configs/loongson3_defconfig has
THP=y (which I assume was the one above build used).  I didn't further
check how "make alldefconfig" generated the config; a bit surprising that
it didn't fetch from there.

(and it also surprises me that this BUILD_BUG can trigger.. I used to try
 triggering it elsewhere but failed..)

For loongarch the best thing is not compile in follow_huge_pud(), as it
doesn't support pud dax, neither does it support pud hugetlb.  However
again that may require some more CONFIG_* options to declare the level one
arch supports on HUGETLB_PAGE.  Here maybe the simplest (and it should also
cover all the rest archs on similar issues if ever possible to happen) is
we remove the BUILD_BUG() and explain why.  It should be safe for loongarch
too here in this case to not defined it until properly supported.

Thanks,

===8<===
From 585f34aa3d5b12cd2186367b0882d4293f792062 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Tue, 2 Apr 2024 18:31:07 -0400
Subject: [PATCH] fixup! mm/arch: provide pud_pfn() fallback

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/pgtable.h | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index fa8f92f6e2d7..0f4b2faa1d71 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1882,9 +1882,13 @@ typedef unsigned int pgtbl_mod_mask;
  * code.  Now it happens too for pud_pfn (and can happen for larger
  * mappings too in the future; we're not there yet).  Instead of defining
  * it for all archs (like pmd_pfn), provide a fallback.
+ *
+ * Note that returning 0 here means any arch that didn't define this can
+ * get severely wrong when it hits a real pud leaf.  It's arch's
+ * responsibility to properly define it when a huge pud is possible.
  */
 #ifndef pud_pfn
-#define pud_pfn(x) ({ BUILD_BUG(); 0; })
+#define pud_pfn(x) 0
 #endif
 
 /*
-- 
2.44.0

-- 
Peter Xu


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
@ 2024-04-02 22:43       ` Peter Xu
  0 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2024-04-02 22:43 UTC (permalink / raw)
  To: Nathan Chancellor
  Cc: James Houghton, David Hildenbrand, Yang Shi, Andrew Jones,
	linux-mm, Matthew Wilcox, linux-riscv, WANG Xuerui,
	Andrea Arcangeli, Christoph Hellwig, Huacai Chen,
	Aneesh Kumar K . V, linux-arm-kernel, Jason Gunthorpe,
	Axel Rasmussen, Rik van Riel, John Hubbard, loongarch,
	Kirill A . Shutemov, Vlastimil Babka, Lorenzo Stoakes,
	Muchun Song, linux-kernel, Andrew Morton, linuxppc-dev,
	Mike Rapoport, Mike Kravetz

On Tue, Apr 02, 2024 at 12:05:49PM -0700, Nathan Chancellor wrote:
> Hi Peter (and LoongArch folks),
> 
> On Wed, Mar 27, 2024 at 11:23:24AM -0400, peterx@redhat.com wrote:
> > From: Peter Xu <peterx@redhat.com>
> > 
> > The comment in the code explains the reasons.  We took a different approach
> > comparing to pmd_pfn() by providing a fallback function.
> > 
> > Another option is to provide some lower level config options (compare to
> > HUGETLB_PAGE or THP) to identify which layer an arch can support for such
> > huge mappings.  However that can be an overkill.
> > 
> > Cc: Mike Rapoport (IBM) <rppt@kernel.org>
> > Cc: Matthew Wilcox <willy@infradead.org>
> > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  arch/riscv/include/asm/pgtable.h    |  1 +
> >  arch/s390/include/asm/pgtable.h     |  1 +
> >  arch/sparc/include/asm/pgtable_64.h |  1 +
> >  arch/x86/include/asm/pgtable.h      |  1 +
> >  include/linux/pgtable.h             | 10 ++++++++++
> >  5 files changed, 14 insertions(+)
> > 
> > diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
> > index 20242402fc11..0ca28cc8e3fa 100644
> > --- a/arch/riscv/include/asm/pgtable.h
> > +++ b/arch/riscv/include/asm/pgtable.h
> > @@ -646,6 +646,7 @@ static inline unsigned long pmd_pfn(pmd_t pmd)
> >  
> >  #define __pud_to_phys(pud)  (__page_val_to_pfn(pud_val(pud)) << PAGE_SHIFT)
> >  
> > +#define pud_pfn pud_pfn
> >  static inline unsigned long pud_pfn(pud_t pud)
> >  {
> >  	return ((__pud_to_phys(pud) & PUD_MASK) >> PAGE_SHIFT);
> > diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> > index 1a71cb19c089..6cbbe473f680 100644
> > --- a/arch/s390/include/asm/pgtable.h
> > +++ b/arch/s390/include/asm/pgtable.h
> > @@ -1414,6 +1414,7 @@ static inline unsigned long pud_deref(pud_t pud)
> >  	return (unsigned long)__va(pud_val(pud) & origin_mask);
> >  }
> >  
> > +#define pud_pfn pud_pfn
> >  static inline unsigned long pud_pfn(pud_t pud)
> >  {
> >  	return __pa(pud_deref(pud)) >> PAGE_SHIFT;
> > diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
> > index 4d1bafaba942..26efc9bb644a 100644
> > --- a/arch/sparc/include/asm/pgtable_64.h
> > +++ b/arch/sparc/include/asm/pgtable_64.h
> > @@ -875,6 +875,7 @@ static inline bool pud_leaf(pud_t pud)
> >  	return pte_val(pte) & _PAGE_PMD_HUGE;
> >  }
> >  
> > +#define pud_pfn pud_pfn
> >  static inline unsigned long pud_pfn(pud_t pud)
> >  {
> >  	pte_t pte = __pte(pud_val(pud));
> > diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> > index cefc7a84f7a4..273f7557218c 100644
> > --- a/arch/x86/include/asm/pgtable.h
> > +++ b/arch/x86/include/asm/pgtable.h
> > @@ -234,6 +234,7 @@ static inline unsigned long pmd_pfn(pmd_t pmd)
> >  	return (pfn & pmd_pfn_mask(pmd)) >> PAGE_SHIFT;
> >  }
> >  
> > +#define pud_pfn pud_pfn
> >  static inline unsigned long pud_pfn(pud_t pud)
> >  {
> >  	phys_addr_t pfn = pud_val(pud);
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index 600e17d03659..75fe309a4e10 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -1817,6 +1817,16 @@ typedef unsigned int pgtbl_mod_mask;
> >  #define pte_leaf_size(x) PAGE_SIZE
> >  #endif
> >  
> > +/*
> > + * We always define pmd_pfn for all archs as it's used in lots of generic
> > + * code.  Now it happens too for pud_pfn (and can happen for larger
> > + * mappings too in the future; we're not there yet).  Instead of defining
> > + * it for all archs (like pmd_pfn), provide a fallback.
> > + */
> > +#ifndef pud_pfn
> > +#define pud_pfn(x) ({ BUILD_BUG(); 0; })
> > +#endif
> > +
> >  /*
> >   * Some architectures have MMUs that are configurable or selectable at boot
> >   * time. These lead to variable PTRS_PER_x. For statically allocated arrays it
> > -- 
> > 2.44.0
> > 
> 
> This BUILD_BUG() triggers for LoongArch with their defconfig, so it
> seems like they need to provide an implementation of pud_pfn()?
> 
>   In function 'follow_huge_pud',
>       inlined from 'follow_pud_mask' at mm/gup.c:1075:10,
>       inlined from 'follow_p4d_mask' at mm/gup.c:1105:9,
>       inlined from 'follow_page_mask' at mm/gup.c:1151:10:
>   include/linux/compiler_types.h:460:45: error: call to '__compiletime_assert_382' declared with attribute error: BUILD_BUG failed
>     460 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>         |                                             ^
>   include/linux/compiler_types.h:441:25: note: in definition of macro '__compiletime_assert'
>     441 |                         prefix ## suffix();                             \
>         |                         ^~~~~~
>   include/linux/compiler_types.h:460:9: note: in expansion of macro '_compiletime_assert'
>     460 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>         |         ^~~~~~~~~~~~~~~~~~~
>   include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
>      39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
>         |                                     ^~~~~~~~~~~~~~~~~~
>   include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_ON_MSG'
>      59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
>         |                     ^~~~~~~~~~~~~~~~
>   include/linux/pgtable.h:1887:23: note: in expansion of macro 'BUILD_BUG'
>    1887 | #define pud_pfn(x) ({ BUILD_BUG(); 0; })
>         |                       ^~~~~~~~~
>   mm/gup.c:679:29: note: in expansion of macro 'pud_pfn'
>     679 |         unsigned long pfn = pud_pfn(pud);
>         |                             ^~~~~~~

I actually tested this without hitting the issue (even though I didn't
mention it in the cover letter..).  I re-kicked the build test, it turns
out my "make alldefconfig" on loongarch will generate a config with both
HUGETLB=n && THP=n, while arch/loongarch/configs/loongson3_defconfig has
THP=y (which I assume was the one above build used).  I didn't further
check how "make alldefconfig" generated the config; a bit surprising that
it didn't fetch from there.

(and it also surprises me that this BUILD_BUG can trigger.. I used to try
 triggering it elsewhere but failed..)

For loongarch the best thing is not compile in follow_huge_pud(), as it
doesn't support pud dax, neither does it support pud hugetlb.  However
again that may require some more CONFIG_* options to declare the level one
arch supports on HUGETLB_PAGE.  Here maybe the simplest (and it should also
cover all the rest archs on similar issues if ever possible to happen) is
we remove the BUILD_BUG() and explain why.  It should be safe for loongarch
too here in this case to not defined it until properly supported.

Thanks,

===8<===
From 585f34aa3d5b12cd2186367b0882d4293f792062 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Tue, 2 Apr 2024 18:31:07 -0400
Subject: [PATCH] fixup! mm/arch: provide pud_pfn() fallback

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/pgtable.h | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index fa8f92f6e2d7..0f4b2faa1d71 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1882,9 +1882,13 @@ typedef unsigned int pgtbl_mod_mask;
  * code.  Now it happens too for pud_pfn (and can happen for larger
  * mappings too in the future; we're not there yet).  Instead of defining
  * it for all archs (like pmd_pfn), provide a fallback.
+ *
+ * Note that returning 0 here means any arch that didn't define this can
+ * get severely wrong when it hits a real pud leaf.  It's arch's
+ * responsibility to properly define it when a huge pud is possible.
  */
 #ifndef pud_pfn
-#define pud_pfn(x) ({ BUILD_BUG(); 0; })
+#define pud_pfn(x) 0
 #endif
 
 /*
-- 
2.44.0

-- 
Peter Xu


^ permalink raw reply related	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
  2024-04-02 22:43       ` Peter Xu
  (?)
  (?)
@ 2024-04-02 22:53         ` Jason Gunthorpe
  -1 siblings, 0 replies; 160+ messages in thread
From: Jason Gunthorpe @ 2024-04-02 22:53 UTC (permalink / raw)
  To: Peter Xu
  Cc: Nathan Chancellor, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Mike Rapoport, Axel Rasmussen,
	Huacai Chen, WANG Xuerui, loongarch

On Tue, Apr 02, 2024 at 06:43:56PM -0400, Peter Xu wrote:

> I actually tested this without hitting the issue (even though I didn't
> mention it in the cover letter..).  I re-kicked the build test, it turns
> out my "make alldefconfig" on loongarch will generate a config with both
> HUGETLB=n && THP=n, while arch/loongarch/configs/loongson3_defconfig has
> THP=y (which I assume was the one above build used).  I didn't further
> check how "make alldefconfig" generated the config; a bit surprising that
> it didn't fetch from there.

I suspect it is weird compiler variations.. Maybe something is not
being inlined.

> (and it also surprises me that this BUILD_BUG can trigger.. I used to try
>  triggering it elsewhere but failed..)

As the pud_leaf() == FALSE should result in the BUILD_BUG never being
called and the optimizer removing it.

Perhaps the issue is that the pud_leaf() is too far from the pud_pfn?

Jason

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
@ 2024-04-02 22:53         ` Jason Gunthorpe
  0 siblings, 0 replies; 160+ messages in thread
From: Jason Gunthorpe @ 2024-04-02 22:53 UTC (permalink / raw)
  To: Peter Xu
  Cc: Nathan Chancellor, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Mike Rapoport, Axel Rasmussen,
	Huacai Chen, WANG Xuerui, loongarch

On Tue, Apr 02, 2024 at 06:43:56PM -0400, Peter Xu wrote:

> I actually tested this without hitting the issue (even though I didn't
> mention it in the cover letter..).  I re-kicked the build test, it turns
> out my "make alldefconfig" on loongarch will generate a config with both
> HUGETLB=n && THP=n, while arch/loongarch/configs/loongson3_defconfig has
> THP=y (which I assume was the one above build used).  I didn't further
> check how "make alldefconfig" generated the config; a bit surprising that
> it didn't fetch from there.

I suspect it is weird compiler variations.. Maybe something is not
being inlined.

> (and it also surprises me that this BUILD_BUG can trigger.. I used to try
>  triggering it elsewhere but failed..)

As the pud_leaf() == FALSE should result in the BUILD_BUG never being
called and the optimizer removing it.

Perhaps the issue is that the pud_leaf() is too far from the pud_pfn?

Jason

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
@ 2024-04-02 22:53         ` Jason Gunthorpe
  0 siblings, 0 replies; 160+ messages in thread
From: Jason Gunthorpe @ 2024-04-02 22:53 UTC (permalink / raw)
  To: Peter Xu
  Cc: Nathan Chancellor, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Mike Rapoport, Axel Rasmussen,
	Huacai Chen, WANG Xuerui, loongarch

On Tue, Apr 02, 2024 at 06:43:56PM -0400, Peter Xu wrote:

> I actually tested this without hitting the issue (even though I didn't
> mention it in the cover letter..).  I re-kicked the build test, it turns
> out my "make alldefconfig" on loongarch will generate a config with both
> HUGETLB=n && THP=n, while arch/loongarch/configs/loongson3_defconfig has
> THP=y (which I assume was the one above build used).  I didn't further
> check how "make alldefconfig" generated the config; a bit surprising that
> it didn't fetch from there.

I suspect it is weird compiler variations.. Maybe something is not
being inlined.

> (and it also surprises me that this BUILD_BUG can trigger.. I used to try
>  triggering it elsewhere but failed..)

As the pud_leaf() == FALSE should result in the BUILD_BUG never being
called and the optimizer removing it.

Perhaps the issue is that the pud_leaf() is too far from the pud_pfn?

Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
@ 2024-04-02 22:53         ` Jason Gunthorpe
  0 siblings, 0 replies; 160+ messages in thread
From: Jason Gunthorpe @ 2024-04-02 22:53 UTC (permalink / raw)
  To: Peter Xu
  Cc: James Houghton, David Hildenbrand, Yang Shi, Andrew Jones,
	linux-mm, Matthew Wilcox, linux-riscv, WANG Xuerui,
	Andrea Arcangeli, Aneesh Kumar K . V, Huacai Chen,
	Christoph Hellwig, Vlastimil Babka, Axel Rasmussen, Rik van Riel,
	John Hubbard, Nathan Chancellor, loongarch, Kirill A . Shutemov,
	linux-arm-kernel, Lorenzo Stoakes, Muchun Song, linux-kernel,
	Andrew Morton, linuxppc-dev, Mike Rapoport, Mike Kravetz

On Tue, Apr 02, 2024 at 06:43:56PM -0400, Peter Xu wrote:

> I actually tested this without hitting the issue (even though I didn't
> mention it in the cover letter..).  I re-kicked the build test, it turns
> out my "make alldefconfig" on loongarch will generate a config with both
> HUGETLB=n && THP=n, while arch/loongarch/configs/loongson3_defconfig has
> THP=y (which I assume was the one above build used).  I didn't further
> check how "make alldefconfig" generated the config; a bit surprising that
> it didn't fetch from there.

I suspect it is weird compiler variations.. Maybe something is not
being inlined.

> (and it also surprises me that this BUILD_BUG can trigger.. I used to try
>  triggering it elsewhere but failed..)

As the pud_leaf() == FALSE should result in the BUILD_BUG never being
called and the optimizer removing it.

Perhaps the issue is that the pud_leaf() is too far from the pud_pfn?

Jason

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
  2024-04-02 22:53         ` Jason Gunthorpe
  (?)
  (?)
@ 2024-04-02 23:35           ` Peter Xu
  -1 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2024-04-02 23:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Nathan Chancellor, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Mike Rapoport, Axel Rasmussen,
	Huacai Chen, WANG Xuerui, loongarch

On Tue, Apr 02, 2024 at 07:53:20PM -0300, Jason Gunthorpe wrote:
> On Tue, Apr 02, 2024 at 06:43:56PM -0400, Peter Xu wrote:
> 
> > I actually tested this without hitting the issue (even though I didn't
> > mention it in the cover letter..).  I re-kicked the build test, it turns
> > out my "make alldefconfig" on loongarch will generate a config with both
> > HUGETLB=n && THP=n, while arch/loongarch/configs/loongson3_defconfig has
> > THP=y (which I assume was the one above build used).  I didn't further
> > check how "make alldefconfig" generated the config; a bit surprising that
> > it didn't fetch from there.
> 
> I suspect it is weird compiler variations.. Maybe something is not
> being inlined.
> 
> > (and it also surprises me that this BUILD_BUG can trigger.. I used to try
> >  triggering it elsewhere but failed..)
> 
> As the pud_leaf() == FALSE should result in the BUILD_BUG never being
> called and the optimizer removing it.

Good point, for some reason loongarch defined pud_leaf() without defining
pud_pfn(), which does look strange.

#define pud_leaf(pud)		((pud_val(pud) & _PAGE_HUGE) != 0)

But I noticed at least MIPS also does it..  Logically I think one arch
should define either none of both.

> 
> Perhaps the issue is that the pud_leaf() is too far from the pud_pfn?

My understanding is follow_pud_mask() should completely get optimized and
follow_huge_pud() will be dropped in the compiler output if pud_leaf()==false.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
@ 2024-04-02 23:35           ` Peter Xu
  0 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2024-04-02 23:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Nathan Chancellor, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Mike Rapoport, Axel Rasmussen,
	Huacai Chen, WANG Xuerui, loongarch

On Tue, Apr 02, 2024 at 07:53:20PM -0300, Jason Gunthorpe wrote:
> On Tue, Apr 02, 2024 at 06:43:56PM -0400, Peter Xu wrote:
> 
> > I actually tested this without hitting the issue (even though I didn't
> > mention it in the cover letter..).  I re-kicked the build test, it turns
> > out my "make alldefconfig" on loongarch will generate a config with both
> > HUGETLB=n && THP=n, while arch/loongarch/configs/loongson3_defconfig has
> > THP=y (which I assume was the one above build used).  I didn't further
> > check how "make alldefconfig" generated the config; a bit surprising that
> > it didn't fetch from there.
> 
> I suspect it is weird compiler variations.. Maybe something is not
> being inlined.
> 
> > (and it also surprises me that this BUILD_BUG can trigger.. I used to try
> >  triggering it elsewhere but failed..)
> 
> As the pud_leaf() == FALSE should result in the BUILD_BUG never being
> called and the optimizer removing it.

Good point, for some reason loongarch defined pud_leaf() without defining
pud_pfn(), which does look strange.

#define pud_leaf(pud)		((pud_val(pud) & _PAGE_HUGE) != 0)

But I noticed at least MIPS also does it..  Logically I think one arch
should define either none of both.

> 
> Perhaps the issue is that the pud_leaf() is too far from the pud_pfn?

My understanding is follow_pud_mask() should completely get optimized and
follow_huge_pud() will be dropped in the compiler output if pud_leaf()==false.

Thanks,

-- 
Peter Xu


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
@ 2024-04-02 23:35           ` Peter Xu
  0 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2024-04-02 23:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Nathan Chancellor, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Mike Rapoport, Axel Rasmussen,
	Huacai Chen, WANG Xuerui, loongarch

On Tue, Apr 02, 2024 at 07:53:20PM -0300, Jason Gunthorpe wrote:
> On Tue, Apr 02, 2024 at 06:43:56PM -0400, Peter Xu wrote:
> 
> > I actually tested this without hitting the issue (even though I didn't
> > mention it in the cover letter..).  I re-kicked the build test, it turns
> > out my "make alldefconfig" on loongarch will generate a config with both
> > HUGETLB=n && THP=n, while arch/loongarch/configs/loongson3_defconfig has
> > THP=y (which I assume was the one above build used).  I didn't further
> > check how "make alldefconfig" generated the config; a bit surprising that
> > it didn't fetch from there.
> 
> I suspect it is weird compiler variations.. Maybe something is not
> being inlined.
> 
> > (and it also surprises me that this BUILD_BUG can trigger.. I used to try
> >  triggering it elsewhere but failed..)
> 
> As the pud_leaf() == FALSE should result in the BUILD_BUG never being
> called and the optimizer removing it.

Good point, for some reason loongarch defined pud_leaf() without defining
pud_pfn(), which does look strange.

#define pud_leaf(pud)		((pud_val(pud) & _PAGE_HUGE) != 0)

But I noticed at least MIPS also does it..  Logically I think one arch
should define either none of both.

> 
> Perhaps the issue is that the pud_leaf() is too far from the pud_pfn?

My understanding is follow_pud_mask() should completely get optimized and
follow_huge_pud() will be dropped in the compiler output if pud_leaf()==false.

Thanks,

-- 
Peter Xu


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
@ 2024-04-02 23:35           ` Peter Xu
  0 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2024-04-02 23:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: James Houghton, David Hildenbrand, Yang Shi, Andrew Jones,
	linux-mm, Matthew Wilcox, linux-riscv, WANG Xuerui,
	Andrea Arcangeli, Aneesh Kumar K . V, Huacai Chen,
	Christoph Hellwig, Vlastimil Babka, Axel Rasmussen, Rik van Riel,
	John Hubbard, Nathan Chancellor, loongarch, Kirill A . Shutemov,
	linux-arm-kernel, Lorenzo Stoakes, Muchun Song, linux-kernel,
	Andrew Morton, linuxppc-dev, Mike Rapoport, Mike Kravetz

On Tue, Apr 02, 2024 at 07:53:20PM -0300, Jason Gunthorpe wrote:
> On Tue, Apr 02, 2024 at 06:43:56PM -0400, Peter Xu wrote:
> 
> > I actually tested this without hitting the issue (even though I didn't
> > mention it in the cover letter..).  I re-kicked the build test, it turns
> > out my "make alldefconfig" on loongarch will generate a config with both
> > HUGETLB=n && THP=n, while arch/loongarch/configs/loongson3_defconfig has
> > THP=y (which I assume was the one above build used).  I didn't further
> > check how "make alldefconfig" generated the config; a bit surprising that
> > it didn't fetch from there.
> 
> I suspect it is weird compiler variations.. Maybe something is not
> being inlined.
> 
> > (and it also surprises me that this BUILD_BUG can trigger.. I used to try
> >  triggering it elsewhere but failed..)
> 
> As the pud_leaf() == FALSE should result in the BUILD_BUG never being
> called and the optimizer removing it.

Good point, for some reason loongarch defined pud_leaf() without defining
pud_pfn(), which does look strange.

#define pud_leaf(pud)		((pud_val(pud) & _PAGE_HUGE) != 0)

But I noticed at least MIPS also does it..  Logically I think one arch
should define either none of both.

> 
> Perhaps the issue is that the pud_leaf() is too far from the pud_pfn?

My understanding is follow_pud_mask() should completely get optimized and
follow_huge_pud() will be dropped in the compiler output if pud_leaf()==false.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
  2024-04-02 23:35           ` Peter Xu
  (?)
  (?)
@ 2024-04-03 12:08             ` Jason Gunthorpe
  -1 siblings, 0 replies; 160+ messages in thread
From: Jason Gunthorpe @ 2024-04-03 12:08 UTC (permalink / raw)
  To: Peter Xu
  Cc: Nathan Chancellor, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Mike Rapoport, Axel Rasmussen,
	Huacai Chen, WANG Xuerui, loongarch

On Tue, Apr 02, 2024 at 07:35:45PM -0400, Peter Xu wrote:
> On Tue, Apr 02, 2024 at 07:53:20PM -0300, Jason Gunthorpe wrote:
> > On Tue, Apr 02, 2024 at 06:43:56PM -0400, Peter Xu wrote:
> > 
> > > I actually tested this without hitting the issue (even though I didn't
> > > mention it in the cover letter..).  I re-kicked the build test, it turns
> > > out my "make alldefconfig" on loongarch will generate a config with both
> > > HUGETLB=n && THP=n, while arch/loongarch/configs/loongson3_defconfig has
> > > THP=y (which I assume was the one above build used).  I didn't further
> > > check how "make alldefconfig" generated the config; a bit surprising that
> > > it didn't fetch from there.
> > 
> > I suspect it is weird compiler variations.. Maybe something is not
> > being inlined.
> > 
> > > (and it also surprises me that this BUILD_BUG can trigger.. I used to try
> > >  triggering it elsewhere but failed..)
> > 
> > As the pud_leaf() == FALSE should result in the BUILD_BUG never being
> > called and the optimizer removing it.
> 
> Good point, for some reason loongarch defined pud_leaf() without defining
> pud_pfn(), which does look strange.
> 
> #define pud_leaf(pud)		((pud_val(pud) & _PAGE_HUGE) != 0)
> 
> But I noticed at least MIPS also does it..  Logically I think one arch
> should define either none of both.

Wow, this is definately an arch issue. You can't define pud_leaf() and
not have a pud_pfn(). It makes no sense at all..

I'd say the BUILD_BUG has done it's job and found an issue, fix it by
not defining pud_leaf? I don't see any calls to pud_leaf in loongarch
at least

Jason

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
@ 2024-04-03 12:08             ` Jason Gunthorpe
  0 siblings, 0 replies; 160+ messages in thread
From: Jason Gunthorpe @ 2024-04-03 12:08 UTC (permalink / raw)
  To: Peter Xu
  Cc: Nathan Chancellor, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Mike Rapoport, Axel Rasmussen,
	Huacai Chen, WANG Xuerui, loongarch

On Tue, Apr 02, 2024 at 07:35:45PM -0400, Peter Xu wrote:
> On Tue, Apr 02, 2024 at 07:53:20PM -0300, Jason Gunthorpe wrote:
> > On Tue, Apr 02, 2024 at 06:43:56PM -0400, Peter Xu wrote:
> > 
> > > I actually tested this without hitting the issue (even though I didn't
> > > mention it in the cover letter..).  I re-kicked the build test, it turns
> > > out my "make alldefconfig" on loongarch will generate a config with both
> > > HUGETLB=n && THP=n, while arch/loongarch/configs/loongson3_defconfig has
> > > THP=y (which I assume was the one above build used).  I didn't further
> > > check how "make alldefconfig" generated the config; a bit surprising that
> > > it didn't fetch from there.
> > 
> > I suspect it is weird compiler variations.. Maybe something is not
> > being inlined.
> > 
> > > (and it also surprises me that this BUILD_BUG can trigger.. I used to try
> > >  triggering it elsewhere but failed..)
> > 
> > As the pud_leaf() == FALSE should result in the BUILD_BUG never being
> > called and the optimizer removing it.
> 
> Good point, for some reason loongarch defined pud_leaf() without defining
> pud_pfn(), which does look strange.
> 
> #define pud_leaf(pud)		((pud_val(pud) & _PAGE_HUGE) != 0)
> 
> But I noticed at least MIPS also does it..  Logically I think one arch
> should define either none of both.

Wow, this is definately an arch issue. You can't define pud_leaf() and
not have a pud_pfn(). It makes no sense at all..

I'd say the BUILD_BUG has done it's job and found an issue, fix it by
not defining pud_leaf? I don't see any calls to pud_leaf in loongarch
at least

Jason

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
@ 2024-04-03 12:08             ` Jason Gunthorpe
  0 siblings, 0 replies; 160+ messages in thread
From: Jason Gunthorpe @ 2024-04-03 12:08 UTC (permalink / raw)
  To: Peter Xu
  Cc: Nathan Chancellor, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Mike Rapoport, Axel Rasmussen,
	Huacai Chen, WANG Xuerui, loongarch

On Tue, Apr 02, 2024 at 07:35:45PM -0400, Peter Xu wrote:
> On Tue, Apr 02, 2024 at 07:53:20PM -0300, Jason Gunthorpe wrote:
> > On Tue, Apr 02, 2024 at 06:43:56PM -0400, Peter Xu wrote:
> > 
> > > I actually tested this without hitting the issue (even though I didn't
> > > mention it in the cover letter..).  I re-kicked the build test, it turns
> > > out my "make alldefconfig" on loongarch will generate a config with both
> > > HUGETLB=n && THP=n, while arch/loongarch/configs/loongson3_defconfig has
> > > THP=y (which I assume was the one above build used).  I didn't further
> > > check how "make alldefconfig" generated the config; a bit surprising that
> > > it didn't fetch from there.
> > 
> > I suspect it is weird compiler variations.. Maybe something is not
> > being inlined.
> > 
> > > (and it also surprises me that this BUILD_BUG can trigger.. I used to try
> > >  triggering it elsewhere but failed..)
> > 
> > As the pud_leaf() == FALSE should result in the BUILD_BUG never being
> > called and the optimizer removing it.
> 
> Good point, for some reason loongarch defined pud_leaf() without defining
> pud_pfn(), which does look strange.
> 
> #define pud_leaf(pud)		((pud_val(pud) & _PAGE_HUGE) != 0)
> 
> But I noticed at least MIPS also does it..  Logically I think one arch
> should define either none of both.

Wow, this is definately an arch issue. You can't define pud_leaf() and
not have a pud_pfn(). It makes no sense at all..

I'd say the BUILD_BUG has done it's job and found an issue, fix it by
not defining pud_leaf? I don't see any calls to pud_leaf in loongarch
at least

Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
@ 2024-04-03 12:08             ` Jason Gunthorpe
  0 siblings, 0 replies; 160+ messages in thread
From: Jason Gunthorpe @ 2024-04-03 12:08 UTC (permalink / raw)
  To: Peter Xu
  Cc: James Houghton, David Hildenbrand, Yang Shi, Andrew Jones,
	linux-mm, Matthew Wilcox, linux-riscv, WANG Xuerui,
	Andrea Arcangeli, Aneesh Kumar K . V, Huacai Chen,
	Christoph Hellwig, Vlastimil Babka, Axel Rasmussen, Rik van Riel,
	John Hubbard, Nathan Chancellor, loongarch, Kirill A . Shutemov,
	linux-arm-kernel, Lorenzo Stoakes, Muchun Song, linux-kernel,
	Andrew Morton, linuxppc-dev, Mike Rapoport, Mike Kravetz

On Tue, Apr 02, 2024 at 07:35:45PM -0400, Peter Xu wrote:
> On Tue, Apr 02, 2024 at 07:53:20PM -0300, Jason Gunthorpe wrote:
> > On Tue, Apr 02, 2024 at 06:43:56PM -0400, Peter Xu wrote:
> > 
> > > I actually tested this without hitting the issue (even though I didn't
> > > mention it in the cover letter..).  I re-kicked the build test, it turns
> > > out my "make alldefconfig" on loongarch will generate a config with both
> > > HUGETLB=n && THP=n, while arch/loongarch/configs/loongson3_defconfig has
> > > THP=y (which I assume was the one above build used).  I didn't further
> > > check how "make alldefconfig" generated the config; a bit surprising that
> > > it didn't fetch from there.
> > 
> > I suspect it is weird compiler variations.. Maybe something is not
> > being inlined.
> > 
> > > (and it also surprises me that this BUILD_BUG can trigger.. I used to try
> > >  triggering it elsewhere but failed..)
> > 
> > As the pud_leaf() == FALSE should result in the BUILD_BUG never being
> > called and the optimizer removing it.
> 
> Good point, for some reason loongarch defined pud_leaf() without defining
> pud_pfn(), which does look strange.
> 
> #define pud_leaf(pud)		((pud_val(pud) & _PAGE_HUGE) != 0)
> 
> But I noticed at least MIPS also does it..  Logically I think one arch
> should define either none of both.

Wow, this is definately an arch issue. You can't define pud_leaf() and
not have a pud_pfn(). It makes no sense at all..

I'd say the BUILD_BUG has done it's job and found an issue, fix it by
not defining pud_leaf? I don't see any calls to pud_leaf in loongarch
at least

Jason

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
  2024-04-03 12:08             ` Jason Gunthorpe
  (?)
  (?)
@ 2024-04-03 12:26               ` Christophe Leroy
  -1 siblings, 0 replies; 160+ messages in thread
From: Christophe Leroy @ 2024-04-03 12:26 UTC (permalink / raw)
  To: Jason Gunthorpe, Peter Xu
  Cc: Nathan Chancellor, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Andrew Morton, Christoph Hellwig, Lorenzo Stoakes,
	Matthew Wilcox, Rik van Riel, linux-arm-kernel, Andrea Arcangeli,
	David Hildenbrand, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Mike Rapoport, Axel Rasmussen, Huacai Chen,
	WANG Xuerui, loongarch



Le 03/04/2024 à 14:08, Jason Gunthorpe a écrit :
> On Tue, Apr 02, 2024 at 07:35:45PM -0400, Peter Xu wrote:
>> On Tue, Apr 02, 2024 at 07:53:20PM -0300, Jason Gunthorpe wrote:
>>> On Tue, Apr 02, 2024 at 06:43:56PM -0400, Peter Xu wrote:
>>>
>>>> I actually tested this without hitting the issue (even though I didn't
>>>> mention it in the cover letter..).  I re-kicked the build test, it turns
>>>> out my "make alldefconfig" on loongarch will generate a config with both
>>>> HUGETLB=n && THP=n, while arch/loongarch/configs/loongson3_defconfig has
>>>> THP=y (which I assume was the one above build used).  I didn't further
>>>> check how "make alldefconfig" generated the config; a bit surprising that
>>>> it didn't fetch from there.
>>>
>>> I suspect it is weird compiler variations.. Maybe something is not
>>> being inlined.
>>>
>>>> (and it also surprises me that this BUILD_BUG can trigger.. I used to try
>>>>   triggering it elsewhere but failed..)
>>>
>>> As the pud_leaf() == FALSE should result in the BUILD_BUG never being
>>> called and the optimizer removing it.
>>
>> Good point, for some reason loongarch defined pud_leaf() without defining
>> pud_pfn(), which does look strange.
>>
>> #define pud_leaf(pud)		((pud_val(pud) & _PAGE_HUGE) != 0)
>>
>> But I noticed at least MIPS also does it..  Logically I think one arch
>> should define either none of both.
> 
> Wow, this is definately an arch issue. You can't define pud_leaf() and
> not have a pud_pfn(). It makes no sense at all..
> 
> I'd say the BUILD_BUG has done it's job and found an issue, fix it by
> not defining pud_leaf? I don't see any calls to pud_leaf in loongarch
> at least

As far as I can see it was added by commit 303be4b33562 ("LoongArch: mm: 
Add p?d_leaf() definitions").

Not sure it was added for a good reason, and I'm not sure what was added 
is correct because arch/loongarch/include/asm/pgtable-bits.h has:

#define	_PAGE_HUGE_SHIFT	6  /* HUGE is a PMD bit */

So I'm not sure it is correct to use that bit for PUD, is it ?

Probably pud_leaf() should always return false.

Christophe

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
@ 2024-04-03 12:26               ` Christophe Leroy
  0 siblings, 0 replies; 160+ messages in thread
From: Christophe Leroy @ 2024-04-03 12:26 UTC (permalink / raw)
  To: Jason Gunthorpe, Peter Xu
  Cc: Nathan Chancellor, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Andrew Morton, Christoph Hellwig, Lorenzo Stoakes,
	Matthew Wilcox, Rik van Riel, linux-arm-kernel, Andrea Arcangeli,
	David Hildenbrand, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Mike Rapoport, Axel Rasmussen, Huacai Chen,
	WANG Xuerui, loongarch



Le 03/04/2024 à 14:08, Jason Gunthorpe a écrit :
> On Tue, Apr 02, 2024 at 07:35:45PM -0400, Peter Xu wrote:
>> On Tue, Apr 02, 2024 at 07:53:20PM -0300, Jason Gunthorpe wrote:
>>> On Tue, Apr 02, 2024 at 06:43:56PM -0400, Peter Xu wrote:
>>>
>>>> I actually tested this without hitting the issue (even though I didn't
>>>> mention it in the cover letter..).  I re-kicked the build test, it turns
>>>> out my "make alldefconfig" on loongarch will generate a config with both
>>>> HUGETLB=n && THP=n, while arch/loongarch/configs/loongson3_defconfig has
>>>> THP=y (which I assume was the one above build used).  I didn't further
>>>> check how "make alldefconfig" generated the config; a bit surprising that
>>>> it didn't fetch from there.
>>>
>>> I suspect it is weird compiler variations.. Maybe something is not
>>> being inlined.
>>>
>>>> (and it also surprises me that this BUILD_BUG can trigger.. I used to try
>>>>   triggering it elsewhere but failed..)
>>>
>>> As the pud_leaf() == FALSE should result in the BUILD_BUG never being
>>> called and the optimizer removing it.
>>
>> Good point, for some reason loongarch defined pud_leaf() without defining
>> pud_pfn(), which does look strange.
>>
>> #define pud_leaf(pud)		((pud_val(pud) & _PAGE_HUGE) != 0)
>>
>> But I noticed at least MIPS also does it..  Logically I think one arch
>> should define either none of both.
> 
> Wow, this is definately an arch issue. You can't define pud_leaf() and
> not have a pud_pfn(). It makes no sense at all..
> 
> I'd say the BUILD_BUG has done it's job and found an issue, fix it by
> not defining pud_leaf? I don't see any calls to pud_leaf in loongarch
> at least

As far as I can see it was added by commit 303be4b33562 ("LoongArch: mm: 
Add p?d_leaf() definitions").

Not sure it was added for a good reason, and I'm not sure what was added 
is correct because arch/loongarch/include/asm/pgtable-bits.h has:

#define	_PAGE_HUGE_SHIFT	6  /* HUGE is a PMD bit */

So I'm not sure it is correct to use that bit for PUD, is it ?

Probably pud_leaf() should always return false.

Christophe
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
@ 2024-04-03 12:26               ` Christophe Leroy
  0 siblings, 0 replies; 160+ messages in thread
From: Christophe Leroy @ 2024-04-03 12:26 UTC (permalink / raw)
  To: Jason Gunthorpe, Peter Xu
  Cc: Nathan Chancellor, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Andrew Morton, Christoph Hellwig, Lorenzo Stoakes,
	Matthew Wilcox, Rik van Riel, linux-arm-kernel, Andrea Arcangeli,
	David Hildenbrand, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Mike Rapoport, Axel Rasmussen, Huacai Chen,
	WANG Xuerui, loongarch



Le 03/04/2024 à 14:08, Jason Gunthorpe a écrit :
> On Tue, Apr 02, 2024 at 07:35:45PM -0400, Peter Xu wrote:
>> On Tue, Apr 02, 2024 at 07:53:20PM -0300, Jason Gunthorpe wrote:
>>> On Tue, Apr 02, 2024 at 06:43:56PM -0400, Peter Xu wrote:
>>>
>>>> I actually tested this without hitting the issue (even though I didn't
>>>> mention it in the cover letter..).  I re-kicked the build test, it turns
>>>> out my "make alldefconfig" on loongarch will generate a config with both
>>>> HUGETLB=n && THP=n, while arch/loongarch/configs/loongson3_defconfig has
>>>> THP=y (which I assume was the one above build used).  I didn't further
>>>> check how "make alldefconfig" generated the config; a bit surprising that
>>>> it didn't fetch from there.
>>>
>>> I suspect it is weird compiler variations.. Maybe something is not
>>> being inlined.
>>>
>>>> (and it also surprises me that this BUILD_BUG can trigger.. I used to try
>>>>   triggering it elsewhere but failed..)
>>>
>>> As the pud_leaf() == FALSE should result in the BUILD_BUG never being
>>> called and the optimizer removing it.
>>
>> Good point, for some reason loongarch defined pud_leaf() without defining
>> pud_pfn(), which does look strange.
>>
>> #define pud_leaf(pud)		((pud_val(pud) & _PAGE_HUGE) != 0)
>>
>> But I noticed at least MIPS also does it..  Logically I think one arch
>> should define either none of both.
> 
> Wow, this is definately an arch issue. You can't define pud_leaf() and
> not have a pud_pfn(). It makes no sense at all..
> 
> I'd say the BUILD_BUG has done it's job and found an issue, fix it by
> not defining pud_leaf? I don't see any calls to pud_leaf in loongarch
> at least

As far as I can see it was added by commit 303be4b33562 ("LoongArch: mm: 
Add p?d_leaf() definitions").

Not sure it was added for a good reason, and I'm not sure what was added 
is correct because arch/loongarch/include/asm/pgtable-bits.h has:

#define	_PAGE_HUGE_SHIFT	6  /* HUGE is a PMD bit */

So I'm not sure it is correct to use that bit for PUD, is it ?

Probably pud_leaf() should always return false.

Christophe
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
@ 2024-04-03 12:26               ` Christophe Leroy
  0 siblings, 0 replies; 160+ messages in thread
From: Christophe Leroy @ 2024-04-03 12:26 UTC (permalink / raw)
  To: Jason Gunthorpe, Peter Xu
  Cc: James Houghton, David Hildenbrand, Yang Shi, Andrew Jones,
	linux-mm, linux-riscv, WANG Xuerui, Andrea Arcangeli,
	Aneesh Kumar K . V, Huacai Chen, Matthew Wilcox,
	Christoph Hellwig, linux-arm-kernel, Axel Rasmussen,
	Rik van Riel, John Hubbard, Nathan Chancellor, loongarch,
	Kirill A . Shutemov, Vlastimil Babka, Lorenzo Stoakes,
	Muchun Song, linux-kernel, Andrew Morton, linuxppc-dev,
	Mike Rapoport, Mike Kravetz



Le 03/04/2024 à 14:08, Jason Gunthorpe a écrit :
> On Tue, Apr 02, 2024 at 07:35:45PM -0400, Peter Xu wrote:
>> On Tue, Apr 02, 2024 at 07:53:20PM -0300, Jason Gunthorpe wrote:
>>> On Tue, Apr 02, 2024 at 06:43:56PM -0400, Peter Xu wrote:
>>>
>>>> I actually tested this without hitting the issue (even though I didn't
>>>> mention it in the cover letter..).  I re-kicked the build test, it turns
>>>> out my "make alldefconfig" on loongarch will generate a config with both
>>>> HUGETLB=n && THP=n, while arch/loongarch/configs/loongson3_defconfig has
>>>> THP=y (which I assume was the one above build used).  I didn't further
>>>> check how "make alldefconfig" generated the config; a bit surprising that
>>>> it didn't fetch from there.
>>>
>>> I suspect it is weird compiler variations.. Maybe something is not
>>> being inlined.
>>>
>>>> (and it also surprises me that this BUILD_BUG can trigger.. I used to try
>>>>   triggering it elsewhere but failed..)
>>>
>>> As the pud_leaf() == FALSE should result in the BUILD_BUG never being
>>> called and the optimizer removing it.
>>
>> Good point, for some reason loongarch defined pud_leaf() without defining
>> pud_pfn(), which does look strange.
>>
>> #define pud_leaf(pud)		((pud_val(pud) & _PAGE_HUGE) != 0)
>>
>> But I noticed at least MIPS also does it..  Logically I think one arch
>> should define either none of both.
> 
> Wow, this is definately an arch issue. You can't define pud_leaf() and
> not have a pud_pfn(). It makes no sense at all..
> 
> I'd say the BUILD_BUG has done it's job and found an issue, fix it by
> not defining pud_leaf? I don't see any calls to pud_leaf in loongarch
> at least

As far as I can see it was added by commit 303be4b33562 ("LoongArch: mm: 
Add p?d_leaf() definitions").

Not sure it was added for a good reason, and I'm not sure what was added 
is correct because arch/loongarch/include/asm/pgtable-bits.h has:

#define	_PAGE_HUGE_SHIFT	6  /* HUGE is a PMD bit */

So I'm not sure it is correct to use that bit for PUD, is it ?

Probably pud_leaf() should always return false.

Christophe

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
  2024-04-03 12:26               ` Christophe Leroy
  (?)
  (?)
@ 2024-04-03 13:07                 ` Jason Gunthorpe
  -1 siblings, 0 replies; 160+ messages in thread
From: Jason Gunthorpe @ 2024-04-03 13:07 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Peter Xu, Nathan Chancellor, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Andrew Morton, Christoph Hellwig, Lorenzo Stoakes,
	Matthew Wilcox, Rik van Riel, linux-arm-kernel, Andrea Arcangeli,
	David Hildenbrand, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Mike Rapoport, Axel Rasmussen, Huacai Chen,
	WANG Xuerui, loongarch

On Wed, Apr 03, 2024 at 12:26:43PM +0000, Christophe Leroy wrote:
> 
> 
> Le 03/04/2024 à 14:08, Jason Gunthorpe a écrit :
> > On Tue, Apr 02, 2024 at 07:35:45PM -0400, Peter Xu wrote:
> >> On Tue, Apr 02, 2024 at 07:53:20PM -0300, Jason Gunthorpe wrote:
> >>> On Tue, Apr 02, 2024 at 06:43:56PM -0400, Peter Xu wrote:
> >>>
> >>>> I actually tested this without hitting the issue (even though I didn't
> >>>> mention it in the cover letter..).  I re-kicked the build test, it turns
> >>>> out my "make alldefconfig" on loongarch will generate a config with both
> >>>> HUGETLB=n && THP=n, while arch/loongarch/configs/loongson3_defconfig has
> >>>> THP=y (which I assume was the one above build used).  I didn't further
> >>>> check how "make alldefconfig" generated the config; a bit surprising that
> >>>> it didn't fetch from there.
> >>>
> >>> I suspect it is weird compiler variations.. Maybe something is not
> >>> being inlined.
> >>>
> >>>> (and it also surprises me that this BUILD_BUG can trigger.. I used to try
> >>>>   triggering it elsewhere but failed..)
> >>>
> >>> As the pud_leaf() == FALSE should result in the BUILD_BUG never being
> >>> called and the optimizer removing it.
> >>
> >> Good point, for some reason loongarch defined pud_leaf() without defining
> >> pud_pfn(), which does look strange.
> >>
> >> #define pud_leaf(pud)		((pud_val(pud) & _PAGE_HUGE) != 0)
> >>
> >> But I noticed at least MIPS also does it..  Logically I think one arch
> >> should define either none of both.
> > 
> > Wow, this is definately an arch issue. You can't define pud_leaf() and
> > not have a pud_pfn(). It makes no sense at all..
> > 
> > I'd say the BUILD_BUG has done it's job and found an issue, fix it by
> > not defining pud_leaf? I don't see any calls to pud_leaf in loongarch
> > at least
> 
> As far as I can see it was added by commit 303be4b33562 ("LoongArch: mm: 
> Add p?d_leaf() definitions").

That commit makes it sounds like the arch supports huge PUD's through
the hugepte mechanism - it says a LTP test failed so something
populated a huge PUD at least??

So maybe this?

#define pud_pfn pte_pfn

> Not sure it was added for a good reason, and I'm not sure what was added 
> is correct because arch/loongarch/include/asm/pgtable-bits.h has:
> 
> #define	_PAGE_HUGE_SHIFT	6  /* HUGE is a PMD bit */
> 
> So I'm not sure it is correct to use that bit for PUD, is it ?

Could be, lots of arches repeat the bit layouts in each radix
level.. It is essentially why the hugepte trick of pretending every
level is a pte works.
 
Jason

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
@ 2024-04-03 13:07                 ` Jason Gunthorpe
  0 siblings, 0 replies; 160+ messages in thread
From: Jason Gunthorpe @ 2024-04-03 13:07 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Peter Xu, Nathan Chancellor, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Andrew Morton, Christoph Hellwig, Lorenzo Stoakes,
	Matthew Wilcox, Rik van Riel, linux-arm-kernel, Andrea Arcangeli,
	David Hildenbrand, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Mike Rapoport, Axel Rasmussen, Huacai Chen,
	WANG Xuerui, loongarch

On Wed, Apr 03, 2024 at 12:26:43PM +0000, Christophe Leroy wrote:
> 
> 
> Le 03/04/2024 à 14:08, Jason Gunthorpe a écrit :
> > On Tue, Apr 02, 2024 at 07:35:45PM -0400, Peter Xu wrote:
> >> On Tue, Apr 02, 2024 at 07:53:20PM -0300, Jason Gunthorpe wrote:
> >>> On Tue, Apr 02, 2024 at 06:43:56PM -0400, Peter Xu wrote:
> >>>
> >>>> I actually tested this without hitting the issue (even though I didn't
> >>>> mention it in the cover letter..).  I re-kicked the build test, it turns
> >>>> out my "make alldefconfig" on loongarch will generate a config with both
> >>>> HUGETLB=n && THP=n, while arch/loongarch/configs/loongson3_defconfig has
> >>>> THP=y (which I assume was the one above build used).  I didn't further
> >>>> check how "make alldefconfig" generated the config; a bit surprising that
> >>>> it didn't fetch from there.
> >>>
> >>> I suspect it is weird compiler variations.. Maybe something is not
> >>> being inlined.
> >>>
> >>>> (and it also surprises me that this BUILD_BUG can trigger.. I used to try
> >>>>   triggering it elsewhere but failed..)
> >>>
> >>> As the pud_leaf() == FALSE should result in the BUILD_BUG never being
> >>> called and the optimizer removing it.
> >>
> >> Good point, for some reason loongarch defined pud_leaf() without defining
> >> pud_pfn(), which does look strange.
> >>
> >> #define pud_leaf(pud)		((pud_val(pud) & _PAGE_HUGE) != 0)
> >>
> >> But I noticed at least MIPS also does it..  Logically I think one arch
> >> should define either none of both.
> > 
> > Wow, this is definately an arch issue. You can't define pud_leaf() and
> > not have a pud_pfn(). It makes no sense at all..
> > 
> > I'd say the BUILD_BUG has done it's job and found an issue, fix it by
> > not defining pud_leaf? I don't see any calls to pud_leaf in loongarch
> > at least
> 
> As far as I can see it was added by commit 303be4b33562 ("LoongArch: mm: 
> Add p?d_leaf() definitions").

That commit makes it sounds like the arch supports huge PUD's through
the hugepte mechanism - it says a LTP test failed so something
populated a huge PUD at least??

So maybe this?

#define pud_pfn pte_pfn

> Not sure it was added for a good reason, and I'm not sure what was added 
> is correct because arch/loongarch/include/asm/pgtable-bits.h has:
> 
> #define	_PAGE_HUGE_SHIFT	6  /* HUGE is a PMD bit */
> 
> So I'm not sure it is correct to use that bit for PUD, is it ?

Could be, lots of arches repeat the bit layouts in each radix
level.. It is essentially why the hugepte trick of pretending every
level is a pte works.
 
Jason

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
@ 2024-04-03 13:07                 ` Jason Gunthorpe
  0 siblings, 0 replies; 160+ messages in thread
From: Jason Gunthorpe @ 2024-04-03 13:07 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Peter Xu, Nathan Chancellor, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Andrew Morton, Christoph Hellwig, Lorenzo Stoakes,
	Matthew Wilcox, Rik van Riel, linux-arm-kernel, Andrea Arcangeli,
	David Hildenbrand, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Mike Rapoport, Axel Rasmussen, Huacai Chen,
	WANG Xuerui, loongarch

On Wed, Apr 03, 2024 at 12:26:43PM +0000, Christophe Leroy wrote:
> 
> 
> Le 03/04/2024 à 14:08, Jason Gunthorpe a écrit :
> > On Tue, Apr 02, 2024 at 07:35:45PM -0400, Peter Xu wrote:
> >> On Tue, Apr 02, 2024 at 07:53:20PM -0300, Jason Gunthorpe wrote:
> >>> On Tue, Apr 02, 2024 at 06:43:56PM -0400, Peter Xu wrote:
> >>>
> >>>> I actually tested this without hitting the issue (even though I didn't
> >>>> mention it in the cover letter..).  I re-kicked the build test, it turns
> >>>> out my "make alldefconfig" on loongarch will generate a config with both
> >>>> HUGETLB=n && THP=n, while arch/loongarch/configs/loongson3_defconfig has
> >>>> THP=y (which I assume was the one above build used).  I didn't further
> >>>> check how "make alldefconfig" generated the config; a bit surprising that
> >>>> it didn't fetch from there.
> >>>
> >>> I suspect it is weird compiler variations.. Maybe something is not
> >>> being inlined.
> >>>
> >>>> (and it also surprises me that this BUILD_BUG can trigger.. I used to try
> >>>>   triggering it elsewhere but failed..)
> >>>
> >>> As the pud_leaf() == FALSE should result in the BUILD_BUG never being
> >>> called and the optimizer removing it.
> >>
> >> Good point, for some reason loongarch defined pud_leaf() without defining
> >> pud_pfn(), which does look strange.
> >>
> >> #define pud_leaf(pud)		((pud_val(pud) & _PAGE_HUGE) != 0)
> >>
> >> But I noticed at least MIPS also does it..  Logically I think one arch
> >> should define either none of both.
> > 
> > Wow, this is definately an arch issue. You can't define pud_leaf() and
> > not have a pud_pfn(). It makes no sense at all..
> > 
> > I'd say the BUILD_BUG has done it's job and found an issue, fix it by
> > not defining pud_leaf? I don't see any calls to pud_leaf in loongarch
> > at least
> 
> As far as I can see it was added by commit 303be4b33562 ("LoongArch: mm: 
> Add p?d_leaf() definitions").

That commit makes it sounds like the arch supports huge PUD's through
the hugepte mechanism - it says a LTP test failed so something
populated a huge PUD at least??

So maybe this?

#define pud_pfn pte_pfn

> Not sure it was added for a good reason, and I'm not sure what was added 
> is correct because arch/loongarch/include/asm/pgtable-bits.h has:
> 
> #define	_PAGE_HUGE_SHIFT	6  /* HUGE is a PMD bit */
> 
> So I'm not sure it is correct to use that bit for PUD, is it ?

Could be, lots of arches repeat the bit layouts in each radix
level.. It is essentially why the hugepte trick of pretending every
level is a pte works.
 
Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
@ 2024-04-03 13:07                 ` Jason Gunthorpe
  0 siblings, 0 replies; 160+ messages in thread
From: Jason Gunthorpe @ 2024-04-03 13:07 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: James Houghton, David Hildenbrand, Yang Shi, Peter Xu,
	Andrew Jones, linux-mm, linux-riscv, WANG Xuerui,
	Andrea Arcangeli, Aneesh Kumar K . V, Huacai Chen,
	Matthew Wilcox, Christoph Hellwig, Vlastimil Babka,
	Axel Rasmussen, Rik van Riel, John Hubbard, Nathan Chancellor,
	loongarch, Kirill A . Shutemov, linux-arm-kernel,
	Lorenzo Stoakes, Muchun Song, linux-kernel@vger.kernel. org,
	Andrew Morton, linuxppc-dev, Mike Rapoport, Mike Kravetz

On Wed, Apr 03, 2024 at 12:26:43PM +0000, Christophe Leroy wrote:
> 
> 
> Le 03/04/2024 à 14:08, Jason Gunthorpe a écrit :
> > On Tue, Apr 02, 2024 at 07:35:45PM -0400, Peter Xu wrote:
> >> On Tue, Apr 02, 2024 at 07:53:20PM -0300, Jason Gunthorpe wrote:
> >>> On Tue, Apr 02, 2024 at 06:43:56PM -0400, Peter Xu wrote:
> >>>
> >>>> I actually tested this without hitting the issue (even though I didn't
> >>>> mention it in the cover letter..).  I re-kicked the build test, it turns
> >>>> out my "make alldefconfig" on loongarch will generate a config with both
> >>>> HUGETLB=n && THP=n, while arch/loongarch/configs/loongson3_defconfig has
> >>>> THP=y (which I assume was the one above build used).  I didn't further
> >>>> check how "make alldefconfig" generated the config; a bit surprising that
> >>>> it didn't fetch from there.
> >>>
> >>> I suspect it is weird compiler variations.. Maybe something is not
> >>> being inlined.
> >>>
> >>>> (and it also surprises me that this BUILD_BUG can trigger.. I used to try
> >>>>   triggering it elsewhere but failed..)
> >>>
> >>> As the pud_leaf() == FALSE should result in the BUILD_BUG never being
> >>> called and the optimizer removing it.
> >>
> >> Good point, for some reason loongarch defined pud_leaf() without defining
> >> pud_pfn(), which does look strange.
> >>
> >> #define pud_leaf(pud)		((pud_val(pud) & _PAGE_HUGE) != 0)
> >>
> >> But I noticed at least MIPS also does it..  Logically I think one arch
> >> should define either none of both.
> > 
> > Wow, this is definately an arch issue. You can't define pud_leaf() and
> > not have a pud_pfn(). It makes no sense at all..
> > 
> > I'd say the BUILD_BUG has done it's job and found an issue, fix it by
> > not defining pud_leaf? I don't see any calls to pud_leaf in loongarch
> > at least
> 
> As far as I can see it was added by commit 303be4b33562 ("LoongArch: mm: 
> Add p?d_leaf() definitions").

That commit makes it sounds like the arch supports huge PUD's through
the hugepte mechanism - it says a LTP test failed so something
populated a huge PUD at least??

So maybe this?

#define pud_pfn pte_pfn

> Not sure it was added for a good reason, and I'm not sure what was added 
> is correct because arch/loongarch/include/asm/pgtable-bits.h has:
> 
> #define	_PAGE_HUGE_SHIFT	6  /* HUGE is a PMD bit */
> 
> So I'm not sure it is correct to use that bit for PUD, is it ?

Could be, lots of arches repeat the bit layouts in each radix
level.. It is essentially why the hugepte trick of pretending every
level is a pte works.
 
Jason

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
  2024-04-03 13:07                 ` Jason Gunthorpe
  (?)
  (?)
@ 2024-04-03 13:17                   ` Christophe Leroy
  -1 siblings, 0 replies; 160+ messages in thread
From: Christophe Leroy @ 2024-04-03 13:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Xu, Nathan Chancellor, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Andrew Morton, Christoph Hellwig, Lorenzo Stoakes,
	Matthew Wilcox, Rik van Riel, linux-arm-kernel, Andrea Arcangeli,
	David Hildenbrand, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Mike Rapoport, Axel Rasmussen, Huacai Chen,
	WANG Xuerui, loongarch



Le 03/04/2024 à 15:07, Jason Gunthorpe a écrit :
> On Wed, Apr 03, 2024 at 12:26:43PM +0000, Christophe Leroy wrote:
>>
>>
>> Le 03/04/2024 à 14:08, Jason Gunthorpe a écrit :
>>> On Tue, Apr 02, 2024 at 07:35:45PM -0400, Peter Xu wrote:
>>>> On Tue, Apr 02, 2024 at 07:53:20PM -0300, Jason Gunthorpe wrote:
>>>>> On Tue, Apr 02, 2024 at 06:43:56PM -0400, Peter Xu wrote:
>>>>>
>>>>>> I actually tested this without hitting the issue (even though I didn't
>>>>>> mention it in the cover letter..).  I re-kicked the build test, it turns
>>>>>> out my "make alldefconfig" on loongarch will generate a config with both
>>>>>> HUGETLB=n && THP=n, while arch/loongarch/configs/loongson3_defconfig has
>>>>>> THP=y (which I assume was the one above build used).  I didn't further
>>>>>> check how "make alldefconfig" generated the config; a bit surprising that
>>>>>> it didn't fetch from there.
>>>>>
>>>>> I suspect it is weird compiler variations.. Maybe something is not
>>>>> being inlined.
>>>>>
>>>>>> (and it also surprises me that this BUILD_BUG can trigger.. I used to try
>>>>>>    triggering it elsewhere but failed..)
>>>>>
>>>>> As the pud_leaf() == FALSE should result in the BUILD_BUG never being
>>>>> called and the optimizer removing it.
>>>>
>>>> Good point, for some reason loongarch defined pud_leaf() without defining
>>>> pud_pfn(), which does look strange.
>>>>
>>>> #define pud_leaf(pud)		((pud_val(pud) & _PAGE_HUGE) != 0)
>>>>
>>>> But I noticed at least MIPS also does it..  Logically I think one arch
>>>> should define either none of both.
>>>
>>> Wow, this is definately an arch issue. You can't define pud_leaf() and
>>> not have a pud_pfn(). It makes no sense at all..
>>>
>>> I'd say the BUILD_BUG has done it's job and found an issue, fix it by
>>> not defining pud_leaf? I don't see any calls to pud_leaf in loongarch
>>> at least
>>
>> As far as I can see it was added by commit 303be4b33562 ("LoongArch: mm:
>> Add p?d_leaf() definitions").
> 
> That commit makes it sounds like the arch supports huge PUD's through
> the hugepte mechanism - it says a LTP test failed so something
> populated a huge PUD at least??

Not sure, I more see it just like a copy/paste of commit 501b81046701 
("mips: mm: add p?d_leaf() definitions").

The commit message says that the test failed because pmd_leaf() is 
missing, it says nothing about PUD.

When looking where _PAGE_HUGE is used in loongarch, I have the 
impression that it is exclusively used at PMD level.

> 
> So maybe this?
> 
> #define pud_pfn pte_pfn
> 
>> Not sure it was added for a good reason, and I'm not sure what was added
>> is correct because arch/loongarch/include/asm/pgtable-bits.h has:
>>
>> #define	_PAGE_HUGE_SHIFT	6  /* HUGE is a PMD bit */
>>
>> So I'm not sure it is correct to use that bit for PUD, is it ?
> 
> Could be, lots of arches repeat the bit layouts in each radix
> level.. It is essentially why the hugepte trick of pretending every
> level is a pte works.
>   
> Jason

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
@ 2024-04-03 13:17                   ` Christophe Leroy
  0 siblings, 0 replies; 160+ messages in thread
From: Christophe Leroy @ 2024-04-03 13:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Xu, Nathan Chancellor, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Andrew Morton, Christoph Hellwig, Lorenzo Stoakes,
	Matthew Wilcox, Rik van Riel, linux-arm-kernel, Andrea Arcangeli,
	David Hildenbrand, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Mike Rapoport, Axel Rasmussen, Huacai Chen,
	WANG Xuerui, loongarch



Le 03/04/2024 à 15:07, Jason Gunthorpe a écrit :
> On Wed, Apr 03, 2024 at 12:26:43PM +0000, Christophe Leroy wrote:
>>
>>
>> Le 03/04/2024 à 14:08, Jason Gunthorpe a écrit :
>>> On Tue, Apr 02, 2024 at 07:35:45PM -0400, Peter Xu wrote:
>>>> On Tue, Apr 02, 2024 at 07:53:20PM -0300, Jason Gunthorpe wrote:
>>>>> On Tue, Apr 02, 2024 at 06:43:56PM -0400, Peter Xu wrote:
>>>>>
>>>>>> I actually tested this without hitting the issue (even though I didn't
>>>>>> mention it in the cover letter..).  I re-kicked the build test, it turns
>>>>>> out my "make alldefconfig" on loongarch will generate a config with both
>>>>>> HUGETLB=n && THP=n, while arch/loongarch/configs/loongson3_defconfig has
>>>>>> THP=y (which I assume was the one above build used).  I didn't further
>>>>>> check how "make alldefconfig" generated the config; a bit surprising that
>>>>>> it didn't fetch from there.
>>>>>
>>>>> I suspect it is weird compiler variations.. Maybe something is not
>>>>> being inlined.
>>>>>
>>>>>> (and it also surprises me that this BUILD_BUG can trigger.. I used to try
>>>>>>    triggering it elsewhere but failed..)
>>>>>
>>>>> As the pud_leaf() == FALSE should result in the BUILD_BUG never being
>>>>> called and the optimizer removing it.
>>>>
>>>> Good point, for some reason loongarch defined pud_leaf() without defining
>>>> pud_pfn(), which does look strange.
>>>>
>>>> #define pud_leaf(pud)		((pud_val(pud) & _PAGE_HUGE) != 0)
>>>>
>>>> But I noticed at least MIPS also does it..  Logically I think one arch
>>>> should define either none of both.
>>>
>>> Wow, this is definately an arch issue. You can't define pud_leaf() and
>>> not have a pud_pfn(). It makes no sense at all..
>>>
>>> I'd say the BUILD_BUG has done it's job and found an issue, fix it by
>>> not defining pud_leaf? I don't see any calls to pud_leaf in loongarch
>>> at least
>>
>> As far as I can see it was added by commit 303be4b33562 ("LoongArch: mm:
>> Add p?d_leaf() definitions").
> 
> That commit makes it sounds like the arch supports huge PUD's through
> the hugepte mechanism - it says a LTP test failed so something
> populated a huge PUD at least??

Not sure, I more see it just like a copy/paste of commit 501b81046701 
("mips: mm: add p?d_leaf() definitions").

The commit message says that the test failed because pmd_leaf() is 
missing, it says nothing about PUD.

When looking where _PAGE_HUGE is used in loongarch, I have the 
impression that it is exclusively used at PMD level.

> 
> So maybe this?
> 
> #define pud_pfn pte_pfn
> 
>> Not sure it was added for a good reason, and I'm not sure what was added
>> is correct because arch/loongarch/include/asm/pgtable-bits.h has:
>>
>> #define	_PAGE_HUGE_SHIFT	6  /* HUGE is a PMD bit */
>>
>> So I'm not sure it is correct to use that bit for PUD, is it ?
> 
> Could be, lots of arches repeat the bit layouts in each radix
> level.. It is essentially why the hugepte trick of pretending every
> level is a pte works.
>   
> Jason
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
@ 2024-04-03 13:17                   ` Christophe Leroy
  0 siblings, 0 replies; 160+ messages in thread
From: Christophe Leroy @ 2024-04-03 13:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Xu, Nathan Chancellor, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Andrew Morton, Christoph Hellwig, Lorenzo Stoakes,
	Matthew Wilcox, Rik van Riel, linux-arm-kernel, Andrea Arcangeli,
	David Hildenbrand, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Mike Rapoport, Axel Rasmussen, Huacai Chen,
	WANG Xuerui, loongarch



Le 03/04/2024 à 15:07, Jason Gunthorpe a écrit :
> On Wed, Apr 03, 2024 at 12:26:43PM +0000, Christophe Leroy wrote:
>>
>>
>> Le 03/04/2024 à 14:08, Jason Gunthorpe a écrit :
>>> On Tue, Apr 02, 2024 at 07:35:45PM -0400, Peter Xu wrote:
>>>> On Tue, Apr 02, 2024 at 07:53:20PM -0300, Jason Gunthorpe wrote:
>>>>> On Tue, Apr 02, 2024 at 06:43:56PM -0400, Peter Xu wrote:
>>>>>
>>>>>> I actually tested this without hitting the issue (even though I didn't
>>>>>> mention it in the cover letter..).  I re-kicked the build test, it turns
>>>>>> out my "make alldefconfig" on loongarch will generate a config with both
>>>>>> HUGETLB=n && THP=n, while arch/loongarch/configs/loongson3_defconfig has
>>>>>> THP=y (which I assume was the one above build used).  I didn't further
>>>>>> check how "make alldefconfig" generated the config; a bit surprising that
>>>>>> it didn't fetch from there.
>>>>>
>>>>> I suspect it is weird compiler variations.. Maybe something is not
>>>>> being inlined.
>>>>>
>>>>>> (and it also surprises me that this BUILD_BUG can trigger.. I used to try
>>>>>>    triggering it elsewhere but failed..)
>>>>>
>>>>> As the pud_leaf() == FALSE should result in the BUILD_BUG never being
>>>>> called and the optimizer removing it.
>>>>
>>>> Good point, for some reason loongarch defined pud_leaf() without defining
>>>> pud_pfn(), which does look strange.
>>>>
>>>> #define pud_leaf(pud)		((pud_val(pud) & _PAGE_HUGE) != 0)
>>>>
>>>> But I noticed at least MIPS also does it..  Logically I think one arch
>>>> should define either none of both.
>>>
>>> Wow, this is definately an arch issue. You can't define pud_leaf() and
>>> not have a pud_pfn(). It makes no sense at all..
>>>
>>> I'd say the BUILD_BUG has done it's job and found an issue, fix it by
>>> not defining pud_leaf? I don't see any calls to pud_leaf in loongarch
>>> at least
>>
>> As far as I can see it was added by commit 303be4b33562 ("LoongArch: mm:
>> Add p?d_leaf() definitions").
> 
> That commit makes it sounds like the arch supports huge PUD's through
> the hugepte mechanism - it says a LTP test failed so something
> populated a huge PUD at least??

Not sure, I more see it just like a copy/paste of commit 501b81046701 
("mips: mm: add p?d_leaf() definitions").

The commit message says that the test failed because pmd_leaf() is 
missing, it says nothing about PUD.

When looking where _PAGE_HUGE is used in loongarch, I have the 
impression that it is exclusively used at PMD level.

> 
> So maybe this?
> 
> #define pud_pfn pte_pfn
> 
>> Not sure it was added for a good reason, and I'm not sure what was added
>> is correct because arch/loongarch/include/asm/pgtable-bits.h has:
>>
>> #define	_PAGE_HUGE_SHIFT	6  /* HUGE is a PMD bit */
>>
>> So I'm not sure it is correct to use that bit for PUD, is it ?
> 
> Could be, lots of arches repeat the bit layouts in each radix
> level.. It is essentially why the hugepte trick of pretending every
> level is a pte works.
>   
> Jason
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
@ 2024-04-03 13:17                   ` Christophe Leroy
  0 siblings, 0 replies; 160+ messages in thread
From: Christophe Leroy @ 2024-04-03 13:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: James Houghton, David Hildenbrand, Yang Shi, Peter Xu,
	Andrew Jones, linux-mm, linux-riscv, WANG Xuerui,
	Andrea Arcangeli, Aneesh Kumar K . V, Huacai Chen,
	Matthew Wilcox, Christoph Hellwig, Vlastimil Babka,
	Axel Rasmussen, Rik van Riel, John Hubbard, Nathan Chancellor,
	loongarch, Kirill A . Shutemov, linux-arm-kernel,
	Lorenzo Stoakes, Muchun Song, linux-kernel, Andrew Morton,
	linuxppc-dev, Mike Rapoport, Mike Kravetz



Le 03/04/2024 à 15:07, Jason Gunthorpe a écrit :
> On Wed, Apr 03, 2024 at 12:26:43PM +0000, Christophe Leroy wrote:
>>
>>
>> Le 03/04/2024 à 14:08, Jason Gunthorpe a écrit :
>>> On Tue, Apr 02, 2024 at 07:35:45PM -0400, Peter Xu wrote:
>>>> On Tue, Apr 02, 2024 at 07:53:20PM -0300, Jason Gunthorpe wrote:
>>>>> On Tue, Apr 02, 2024 at 06:43:56PM -0400, Peter Xu wrote:
>>>>>
>>>>>> I actually tested this without hitting the issue (even though I didn't
>>>>>> mention it in the cover letter..).  I re-kicked the build test, it turns
>>>>>> out my "make alldefconfig" on loongarch will generate a config with both
>>>>>> HUGETLB=n && THP=n, while arch/loongarch/configs/loongson3_defconfig has
>>>>>> THP=y (which I assume was the one above build used).  I didn't further
>>>>>> check how "make alldefconfig" generated the config; a bit surprising that
>>>>>> it didn't fetch from there.
>>>>>
>>>>> I suspect it is weird compiler variations.. Maybe something is not
>>>>> being inlined.
>>>>>
>>>>>> (and it also surprises me that this BUILD_BUG can trigger.. I used to try
>>>>>>    triggering it elsewhere but failed..)
>>>>>
>>>>> As the pud_leaf() == FALSE should result in the BUILD_BUG never being
>>>>> called and the optimizer removing it.
>>>>
>>>> Good point, for some reason loongarch defined pud_leaf() without defining
>>>> pud_pfn(), which does look strange.
>>>>
>>>> #define pud_leaf(pud)		((pud_val(pud) & _PAGE_HUGE) != 0)
>>>>
>>>> But I noticed at least MIPS also does it..  Logically I think one arch
>>>> should define either none of both.
>>>
>>> Wow, this is definately an arch issue. You can't define pud_leaf() and
>>> not have a pud_pfn(). It makes no sense at all..
>>>
>>> I'd say the BUILD_BUG has done it's job and found an issue, fix it by
>>> not defining pud_leaf? I don't see any calls to pud_leaf in loongarch
>>> at least
>>
>> As far as I can see it was added by commit 303be4b33562 ("LoongArch: mm:
>> Add p?d_leaf() definitions").
> 
> That commit makes it sounds like the arch supports huge PUD's through
> the hugepte mechanism - it says a LTP test failed so something
> populated a huge PUD at least??

Not sure, I more see it just like a copy/paste of commit 501b81046701 
("mips: mm: add p?d_leaf() definitions").

The commit message says that the test failed because pmd_leaf() is 
missing, it says nothing about PUD.

When looking where _PAGE_HUGE is used in loongarch, I have the 
impression that it is exclusively used at PMD level.

> 
> So maybe this?
> 
> #define pud_pfn pte_pfn
> 
>> Not sure it was added for a good reason, and I'm not sure what was added
>> is correct because arch/loongarch/include/asm/pgtable-bits.h has:
>>
>> #define	_PAGE_HUGE_SHIFT	6  /* HUGE is a PMD bit */
>>
>> So I'm not sure it is correct to use that bit for PUD, is it ?
> 
> Could be, lots of arches repeat the bit layouts in each radix
> level.. It is essentially why the hugepte trick of pretending every
> level is a pte works.
>   
> Jason

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
  2024-04-03 13:17                   ` Christophe Leroy
  (?)
  (?)
@ 2024-04-03 13:33                     ` Jason Gunthorpe
  -1 siblings, 0 replies; 160+ messages in thread
From: Jason Gunthorpe @ 2024-04-03 13:33 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Peter Xu, Nathan Chancellor, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Andrew Morton, Christoph Hellwig, Lorenzo Stoakes,
	Matthew Wilcox, Rik van Riel, linux-arm-kernel, Andrea Arcangeli,
	David Hildenbrand, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Mike Rapoport, Axel Rasmussen, Huacai Chen,
	WANG Xuerui, loongarch

On Wed, Apr 03, 2024 at 01:17:06PM +0000, Christophe Leroy wrote:

> > That commit makes it sounds like the arch supports huge PUD's through
> > the hugepte mechanism - it says a LTP test failed so something
> > populated a huge PUD at least??
> 
> Not sure, I more see it just like a copy/paste of commit 501b81046701 
> ("mips: mm: add p?d_leaf() definitions").
> 
> The commit message says that the test failed because pmd_leaf() is 
> missing, it says nothing about PUD.

AH fair enough, it is probably a C&P then

Jason

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
@ 2024-04-03 13:33                     ` Jason Gunthorpe
  0 siblings, 0 replies; 160+ messages in thread
From: Jason Gunthorpe @ 2024-04-03 13:33 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Peter Xu, Nathan Chancellor, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Andrew Morton, Christoph Hellwig, Lorenzo Stoakes,
	Matthew Wilcox, Rik van Riel, linux-arm-kernel, Andrea Arcangeli,
	David Hildenbrand, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Mike Rapoport, Axel Rasmussen, Huacai Chen,
	WANG Xuerui, loongarch

On Wed, Apr 03, 2024 at 01:17:06PM +0000, Christophe Leroy wrote:

> > That commit makes it sounds like the arch supports huge PUD's through
> > the hugepte mechanism - it says a LTP test failed so something
> > populated a huge PUD at least??
> 
> Not sure, I more see it just like a copy/paste of commit 501b81046701 
> ("mips: mm: add p?d_leaf() definitions").
> 
> The commit message says that the test failed because pmd_leaf() is 
> missing, it says nothing about PUD.

AH fair enough, it is probably a C&P then

Jason

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
@ 2024-04-03 13:33                     ` Jason Gunthorpe
  0 siblings, 0 replies; 160+ messages in thread
From: Jason Gunthorpe @ 2024-04-03 13:33 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Peter Xu, Nathan Chancellor, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Andrew Morton, Christoph Hellwig, Lorenzo Stoakes,
	Matthew Wilcox, Rik van Riel, linux-arm-kernel, Andrea Arcangeli,
	David Hildenbrand, Aneesh Kumar K . V, Vlastimil Babka,
	James Houghton, Mike Rapoport, Axel Rasmussen, Huacai Chen,
	WANG Xuerui, loongarch

On Wed, Apr 03, 2024 at 01:17:06PM +0000, Christophe Leroy wrote:

> > That commit makes it sounds like the arch supports huge PUD's through
> > the hugepte mechanism - it says a LTP test failed so something
> > populated a huge PUD at least??
> 
> Not sure, I more see it just like a copy/paste of commit 501b81046701 
> ("mips: mm: add p?d_leaf() definitions").
> 
> The commit message says that the test failed because pmd_leaf() is 
> missing, it says nothing about PUD.

AH fair enough, it is probably a C&P then

Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
@ 2024-04-03 13:33                     ` Jason Gunthorpe
  0 siblings, 0 replies; 160+ messages in thread
From: Jason Gunthorpe @ 2024-04-03 13:33 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: James Houghton, David Hildenbrand, Yang Shi, Peter Xu,
	Andrew Jones, linux-mm, linux-riscv, WANG Xuerui,
	Andrea Arcangeli, Aneesh Kumar K . V, Huacai Chen,
	Matthew Wilcox, Christoph Hellwig, Vlastimil Babka,
	Axel Rasmussen, Rik van Riel, John Hubbard, Nathan Chancellor,
	loongarch, Kirill A . Shutemov, linux-arm-kernel,
	Lorenzo Stoakes, Muchun Song, linux-kernel@vger.kernel. org,
	Andrew Morton, linuxppc-dev, Mike Rapoport, Mike Kravetz

On Wed, Apr 03, 2024 at 01:17:06PM +0000, Christophe Leroy wrote:

> > That commit makes it sounds like the arch supports huge PUD's through
> > the hugepte mechanism - it says a LTP test failed so something
> > populated a huge PUD at least??
> 
> Not sure, I more see it just like a copy/paste of commit 501b81046701 
> ("mips: mm: add p?d_leaf() definitions").
> 
> The commit message says that the test failed because pmd_leaf() is 
> missing, it says nothing about PUD.

AH fair enough, it is probably a C&P then

Jason

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
  2024-04-03 12:08             ` Jason Gunthorpe
  (?)
  (?)
@ 2024-04-03 18:25               ` Peter Xu
  -1 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2024-04-03 18:25 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Nathan Chancellor, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Mike Rapoport, Axel Rasmussen,
	Huacai Chen, WANG Xuerui, loongarch

On Wed, Apr 03, 2024 at 09:08:41AM -0300, Jason Gunthorpe wrote:
> On Tue, Apr 02, 2024 at 07:35:45PM -0400, Peter Xu wrote:
> > On Tue, Apr 02, 2024 at 07:53:20PM -0300, Jason Gunthorpe wrote:
> > > On Tue, Apr 02, 2024 at 06:43:56PM -0400, Peter Xu wrote:
> > > 
> > > > I actually tested this without hitting the issue (even though I didn't
> > > > mention it in the cover letter..).  I re-kicked the build test, it turns
> > > > out my "make alldefconfig" on loongarch will generate a config with both
> > > > HUGETLB=n && THP=n, while arch/loongarch/configs/loongson3_defconfig has
> > > > THP=y (which I assume was the one above build used).  I didn't further
> > > > check how "make alldefconfig" generated the config; a bit surprising that
> > > > it didn't fetch from there.
> > > 
> > > I suspect it is weird compiler variations.. Maybe something is not
> > > being inlined.
> > > 
> > > > (and it also surprises me that this BUILD_BUG can trigger.. I used to try
> > > >  triggering it elsewhere but failed..)
> > > 
> > > As the pud_leaf() == FALSE should result in the BUILD_BUG never being
> > > called and the optimizer removing it.
> > 
> > Good point, for some reason loongarch defined pud_leaf() without defining
> > pud_pfn(), which does look strange.
> > 
> > #define pud_leaf(pud)		((pud_val(pud) & _PAGE_HUGE) != 0)
> > 
> > But I noticed at least MIPS also does it..  Logically I think one arch
> > should define either none of both.
> 
> Wow, this is definately an arch issue. You can't define pud_leaf() and
> not have a pud_pfn(). It makes no sense at all..
> 
> I'd say the BUILD_BUG has done it's job and found an issue, fix it by
> not defining pud_leaf? I don't see any calls to pud_leaf in loongarch
> at least

Yes, that sounds better too to me, however it means we may also risk other
archs that can fail another defconfig build.. and I worry I bring trouble
to multiple such cases.  Fundamentally it's indeed my patch that broke
those builds, so I still sent the change and leave that for arch developers
to decide the best for the archs.

I think if wanted, we can add that BUILD_BUG() back when we're sure no arch
will break with it.  So such changes from arch can still be proposed
alongside of removal of BUILD_BUG() (and I'd guess some other arch will
start to notice such build issue soon if existed.. so it still more or less
has similar effect of a reminder..).

Thanks,

-- 
Peter Xu


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
@ 2024-04-03 18:25               ` Peter Xu
  0 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2024-04-03 18:25 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Nathan Chancellor, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Mike Rapoport, Axel Rasmussen,
	Huacai Chen, WANG Xuerui, loongarch

On Wed, Apr 03, 2024 at 09:08:41AM -0300, Jason Gunthorpe wrote:
> On Tue, Apr 02, 2024 at 07:35:45PM -0400, Peter Xu wrote:
> > On Tue, Apr 02, 2024 at 07:53:20PM -0300, Jason Gunthorpe wrote:
> > > On Tue, Apr 02, 2024 at 06:43:56PM -0400, Peter Xu wrote:
> > > 
> > > > I actually tested this without hitting the issue (even though I didn't
> > > > mention it in the cover letter..).  I re-kicked the build test, it turns
> > > > out my "make alldefconfig" on loongarch will generate a config with both
> > > > HUGETLB=n && THP=n, while arch/loongarch/configs/loongson3_defconfig has
> > > > THP=y (which I assume was the one above build used).  I didn't further
> > > > check how "make alldefconfig" generated the config; a bit surprising that
> > > > it didn't fetch from there.
> > > 
> > > I suspect it is weird compiler variations.. Maybe something is not
> > > being inlined.
> > > 
> > > > (and it also surprises me that this BUILD_BUG can trigger.. I used to try
> > > >  triggering it elsewhere but failed..)
> > > 
> > > As the pud_leaf() == FALSE should result in the BUILD_BUG never being
> > > called and the optimizer removing it.
> > 
> > Good point, for some reason loongarch defined pud_leaf() without defining
> > pud_pfn(), which does look strange.
> > 
> > #define pud_leaf(pud)		((pud_val(pud) & _PAGE_HUGE) != 0)
> > 
> > But I noticed at least MIPS also does it..  Logically I think one arch
> > should define either none of both.
> 
> Wow, this is definately an arch issue. You can't define pud_leaf() and
> not have a pud_pfn(). It makes no sense at all..
> 
> I'd say the BUILD_BUG has done it's job and found an issue, fix it by
> not defining pud_leaf? I don't see any calls to pud_leaf in loongarch
> at least

Yes, that sounds better too to me, however it means we may also risk other
archs that can fail another defconfig build.. and I worry I bring trouble
to multiple such cases.  Fundamentally it's indeed my patch that broke
those builds, so I still sent the change and leave that for arch developers
to decide the best for the archs.

I think if wanted, we can add that BUILD_BUG() back when we're sure no arch
will break with it.  So such changes from arch can still be proposed
alongside of removal of BUILD_BUG() (and I'd guess some other arch will
start to notice such build issue soon if existed.. so it still more or less
has similar effect of a reminder..).

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
@ 2024-04-03 18:25               ` Peter Xu
  0 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2024-04-03 18:25 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Nathan Chancellor, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Mike Rapoport, Axel Rasmussen,
	Huacai Chen, WANG Xuerui, loongarch

On Wed, Apr 03, 2024 at 09:08:41AM -0300, Jason Gunthorpe wrote:
> On Tue, Apr 02, 2024 at 07:35:45PM -0400, Peter Xu wrote:
> > On Tue, Apr 02, 2024 at 07:53:20PM -0300, Jason Gunthorpe wrote:
> > > On Tue, Apr 02, 2024 at 06:43:56PM -0400, Peter Xu wrote:
> > > 
> > > > I actually tested this without hitting the issue (even though I didn't
> > > > mention it in the cover letter..).  I re-kicked the build test, it turns
> > > > out my "make alldefconfig" on loongarch will generate a config with both
> > > > HUGETLB=n && THP=n, while arch/loongarch/configs/loongson3_defconfig has
> > > > THP=y (which I assume was the one above build used).  I didn't further
> > > > check how "make alldefconfig" generated the config; a bit surprising that
> > > > it didn't fetch from there.
> > > 
> > > I suspect it is weird compiler variations.. Maybe something is not
> > > being inlined.
> > > 
> > > > (and it also surprises me that this BUILD_BUG can trigger.. I used to try
> > > >  triggering it elsewhere but failed..)
> > > 
> > > As the pud_leaf() == FALSE should result in the BUILD_BUG never being
> > > called and the optimizer removing it.
> > 
> > Good point, for some reason loongarch defined pud_leaf() without defining
> > pud_pfn(), which does look strange.
> > 
> > #define pud_leaf(pud)		((pud_val(pud) & _PAGE_HUGE) != 0)
> > 
> > But I noticed at least MIPS also does it..  Logically I think one arch
> > should define either none of both.
> 
> Wow, this is definately an arch issue. You can't define pud_leaf() and
> not have a pud_pfn(). It makes no sense at all..
> 
> I'd say the BUILD_BUG has done it's job and found an issue, fix it by
> not defining pud_leaf? I don't see any calls to pud_leaf in loongarch
> at least

Yes, that sounds better too to me, however it means we may also risk other
archs that can fail another defconfig build.. and I worry I bring trouble
to multiple such cases.  Fundamentally it's indeed my patch that broke
those builds, so I still sent the change and leave that for arch developers
to decide the best for the archs.

I think if wanted, we can add that BUILD_BUG() back when we're sure no arch
will break with it.  So such changes from arch can still be proposed
alongside of removal of BUILD_BUG() (and I'd guess some other arch will
start to notice such build issue soon if existed.. so it still more or less
has similar effect of a reminder..).

Thanks,

-- 
Peter Xu


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
@ 2024-04-03 18:25               ` Peter Xu
  0 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2024-04-03 18:25 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: James Houghton, David Hildenbrand, Yang Shi, Andrew Jones,
	linux-mm, Matthew Wilcox, linux-riscv, WANG Xuerui,
	Andrea Arcangeli, Aneesh Kumar K . V, Huacai Chen,
	Christoph Hellwig, Vlastimil Babka, Axel Rasmussen, Rik van Riel,
	John Hubbard, Nathan Chancellor, loongarch, Kirill A . Shutemov,
	linux-arm-kernel, Lorenzo Stoakes, Muchun Song, linux-kernel,
	Andrew Morton, linuxppc-dev, Mike Rapoport, Mike Kravetz

On Wed, Apr 03, 2024 at 09:08:41AM -0300, Jason Gunthorpe wrote:
> On Tue, Apr 02, 2024 at 07:35:45PM -0400, Peter Xu wrote:
> > On Tue, Apr 02, 2024 at 07:53:20PM -0300, Jason Gunthorpe wrote:
> > > On Tue, Apr 02, 2024 at 06:43:56PM -0400, Peter Xu wrote:
> > > 
> > > > I actually tested this without hitting the issue (even though I didn't
> > > > mention it in the cover letter..).  I re-kicked the build test, it turns
> > > > out my "make alldefconfig" on loongarch will generate a config with both
> > > > HUGETLB=n && THP=n, while arch/loongarch/configs/loongson3_defconfig has
> > > > THP=y (which I assume was the one above build used).  I didn't further
> > > > check how "make alldefconfig" generated the config; a bit surprising that
> > > > it didn't fetch from there.
> > > 
> > > I suspect it is weird compiler variations.. Maybe something is not
> > > being inlined.
> > > 
> > > > (and it also surprises me that this BUILD_BUG can trigger.. I used to try
> > > >  triggering it elsewhere but failed..)
> > > 
> > > As the pud_leaf() == FALSE should result in the BUILD_BUG never being
> > > called and the optimizer removing it.
> > 
> > Good point, for some reason loongarch defined pud_leaf() without defining
> > pud_pfn(), which does look strange.
> > 
> > #define pud_leaf(pud)		((pud_val(pud) & _PAGE_HUGE) != 0)
> > 
> > But I noticed at least MIPS also does it..  Logically I think one arch
> > should define either none of both.
> 
> Wow, this is definately an arch issue. You can't define pud_leaf() and
> not have a pud_pfn(). It makes no sense at all..
> 
> I'd say the BUILD_BUG has done it's job and found an issue, fix it by
> not defining pud_leaf? I don't see any calls to pud_leaf in loongarch
> at least

Yes, that sounds better too to me, however it means we may also risk other
archs that can fail another defconfig build.. and I worry I bring trouble
to multiple such cases.  Fundamentally it's indeed my patch that broke
those builds, so I still sent the change and leave that for arch developers
to decide the best for the archs.

I think if wanted, we can add that BUILD_BUG() back when we're sure no arch
will break with it.  So such changes from arch can still be proposed
alongside of removal of BUILD_BUG() (and I'd guess some other arch will
start to notice such build issue soon if existed.. so it still more or less
has similar effect of a reminder..).

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
  2024-04-03 18:25               ` Peter Xu
  (?)
  (?)
@ 2024-04-04 11:24                 ` Jason Gunthorpe
  -1 siblings, 0 replies; 160+ messages in thread
From: Jason Gunthorpe @ 2024-04-04 11:24 UTC (permalink / raw)
  To: Peter Xu
  Cc: Nathan Chancellor, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Mike Rapoport, Axel Rasmussen,
	Huacai Chen, WANG Xuerui, loongarch

On Wed, Apr 03, 2024 at 02:25:20PM -0400, Peter Xu wrote:

> > I'd say the BUILD_BUG has done it's job and found an issue, fix it by
> > not defining pud_leaf? I don't see any calls to pud_leaf in loongarch
> > at least
> 
> Yes, that sounds better too to me, however it means we may also risk other
> archs that can fail another defconfig build.. and I worry I bring trouble
> to multiple such cases.  Fundamentally it's indeed my patch that broke
> those builds, so I still sent the change and leave that for arch developers
> to decide the best for the archs.

But your change causes silent data corruption if the code path is
run.. I think we are overall better to wade through the compile time
bugs from linux-next. Honestly if there were alot then I'd think there
would be more complaints already.

Maybe it should just be a seperate step from this series.

Jason

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
@ 2024-04-04 11:24                 ` Jason Gunthorpe
  0 siblings, 0 replies; 160+ messages in thread
From: Jason Gunthorpe @ 2024-04-04 11:24 UTC (permalink / raw)
  To: Peter Xu
  Cc: Nathan Chancellor, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Mike Rapoport, Axel Rasmussen,
	Huacai Chen, WANG Xuerui, loongarch

On Wed, Apr 03, 2024 at 02:25:20PM -0400, Peter Xu wrote:

> > I'd say the BUILD_BUG has done it's job and found an issue, fix it by
> > not defining pud_leaf? I don't see any calls to pud_leaf in loongarch
> > at least
> 
> Yes, that sounds better too to me, however it means we may also risk other
> archs that can fail another defconfig build.. and I worry I bring trouble
> to multiple such cases.  Fundamentally it's indeed my patch that broke
> those builds, so I still sent the change and leave that for arch developers
> to decide the best for the archs.

But your change causes silent data corruption if the code path is
run.. I think we are overall better to wade through the compile time
bugs from linux-next. Honestly if there were alot then I'd think there
would be more complaints already.

Maybe it should just be a seperate step from this series.

Jason

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
@ 2024-04-04 11:24                 ` Jason Gunthorpe
  0 siblings, 0 replies; 160+ messages in thread
From: Jason Gunthorpe @ 2024-04-04 11:24 UTC (permalink / raw)
  To: Peter Xu
  Cc: Nathan Chancellor, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Mike Rapoport, Axel Rasmussen,
	Huacai Chen, WANG Xuerui, loongarch

On Wed, Apr 03, 2024 at 02:25:20PM -0400, Peter Xu wrote:

> > I'd say the BUILD_BUG has done it's job and found an issue, fix it by
> > not defining pud_leaf? I don't see any calls to pud_leaf in loongarch
> > at least
> 
> Yes, that sounds better too to me, however it means we may also risk other
> archs that can fail another defconfig build.. and I worry I bring trouble
> to multiple such cases.  Fundamentally it's indeed my patch that broke
> those builds, so I still sent the change and leave that for arch developers
> to decide the best for the archs.

But your change causes silent data corruption if the code path is
run.. I think we are overall better to wade through the compile time
bugs from linux-next. Honestly if there were alot then I'd think there
would be more complaints already.

Maybe it should just be a seperate step from this series.

Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
@ 2024-04-04 11:24                 ` Jason Gunthorpe
  0 siblings, 0 replies; 160+ messages in thread
From: Jason Gunthorpe @ 2024-04-04 11:24 UTC (permalink / raw)
  To: Peter Xu
  Cc: James Houghton, David Hildenbrand, Yang Shi, Andrew Jones,
	linux-mm, Matthew Wilcox, linux-riscv, WANG Xuerui,
	Andrea Arcangeli, Aneesh Kumar K . V, Huacai Chen,
	Christoph Hellwig, Vlastimil Babka, Axel Rasmussen, Rik van Riel,
	John Hubbard, Nathan Chancellor, loongarch, Kirill A . Shutemov,
	linux-arm-kernel, Lorenzo Stoakes, Muchun Song, linux-kernel,
	Andrew Morton, linuxppc-dev, Mike Rapoport, Mike Kravetz

On Wed, Apr 03, 2024 at 02:25:20PM -0400, Peter Xu wrote:

> > I'd say the BUILD_BUG has done it's job and found an issue, fix it by
> > not defining pud_leaf? I don't see any calls to pud_leaf in loongarch
> > at least
> 
> Yes, that sounds better too to me, however it means we may also risk other
> archs that can fail another defconfig build.. and I worry I bring trouble
> to multiple such cases.  Fundamentally it's indeed my patch that broke
> those builds, so I still sent the change and leave that for arch developers
> to decide the best for the archs.

But your change causes silent data corruption if the code path is
run.. I think we are overall better to wade through the compile time
bugs from linux-next. Honestly if there were alot then I'd think there
would be more complaints already.

Maybe it should just be a seperate step from this series.

Jason

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
  2024-04-04 11:24                 ` Jason Gunthorpe
  (?)
  (?)
@ 2024-04-04 12:00                   ` Peter Xu
  -1 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2024-04-04 12:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Nathan Chancellor, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Mike Rapoport, Axel Rasmussen,
	Huacai Chen, WANG Xuerui, loongarch

On Thu, Apr 04, 2024 at 08:24:04AM -0300, Jason Gunthorpe wrote:
> On Wed, Apr 03, 2024 at 02:25:20PM -0400, Peter Xu wrote:
> 
> > > I'd say the BUILD_BUG has done it's job and found an issue, fix it by
> > > not defining pud_leaf? I don't see any calls to pud_leaf in loongarch
> > > at least
> > 
> > Yes, that sounds better too to me, however it means we may also risk other
> > archs that can fail another defconfig build.. and I worry I bring trouble
> > to multiple such cases.  Fundamentally it's indeed my patch that broke
> > those builds, so I still sent the change and leave that for arch developers
> > to decide the best for the archs.
> 
> But your change causes silent data corruption if the code path is
> run.. I think we are overall better to wade through the compile time
> bugs from linux-next. Honestly if there were alot then I'd think there
> would be more complaints already.
> 
> Maybe it should just be a seperate step from this series.

Right, that'll be imho better to be done separate, as I think we'd better
consolidate the code.

One thing I don't worry is the warning would cause anything real to fail; I
don't yet expect any arch that will not define pud_pfn when it needs
it.. so it can mean all of the build errors may not cause real benefits as
of now.  But I agree with you we'd better have it.  I'll take a todo and
I'll try to add it back after all these fallouts.  With my cross build
chains now it shouldn't be hard, just take some time to revisit each arch.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
@ 2024-04-04 12:00                   ` Peter Xu
  0 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2024-04-04 12:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Nathan Chancellor, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Mike Rapoport, Axel Rasmussen,
	Huacai Chen, WANG Xuerui, loongarch

On Thu, Apr 04, 2024 at 08:24:04AM -0300, Jason Gunthorpe wrote:
> On Wed, Apr 03, 2024 at 02:25:20PM -0400, Peter Xu wrote:
> 
> > > I'd say the BUILD_BUG has done it's job and found an issue, fix it by
> > > not defining pud_leaf? I don't see any calls to pud_leaf in loongarch
> > > at least
> > 
> > Yes, that sounds better too to me, however it means we may also risk other
> > archs that can fail another defconfig build.. and I worry I bring trouble
> > to multiple such cases.  Fundamentally it's indeed my patch that broke
> > those builds, so I still sent the change and leave that for arch developers
> > to decide the best for the archs.
> 
> But your change causes silent data corruption if the code path is
> run.. I think we are overall better to wade through the compile time
> bugs from linux-next. Honestly if there were alot then I'd think there
> would be more complaints already.
> 
> Maybe it should just be a seperate step from this series.

Right, that'll be imho better to be done separate, as I think we'd better
consolidate the code.

One thing I don't worry is the warning would cause anything real to fail; I
don't yet expect any arch that will not define pud_pfn when it needs
it.. so it can mean all of the build errors may not cause real benefits as
of now.  But I agree with you we'd better have it.  I'll take a todo and
I'll try to add it back after all these fallouts.  With my cross build
chains now it shouldn't be hard, just take some time to revisit each arch.

Thanks,

-- 
Peter Xu


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
@ 2024-04-04 12:00                   ` Peter Xu
  0 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2024-04-04 12:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Nathan Chancellor, linux-mm, linux-kernel, Yang Shi,
	Kirill A . Shutemov, Mike Kravetz, John Hubbard,
	Michael Ellerman, Andrew Jones, Muchun Song, linux-riscv,
	linuxppc-dev, Christophe Leroy, Andrew Morton, Christoph Hellwig,
	Lorenzo Stoakes, Matthew Wilcox, Rik van Riel, linux-arm-kernel,
	Andrea Arcangeli, David Hildenbrand, Aneesh Kumar K . V,
	Vlastimil Babka, James Houghton, Mike Rapoport, Axel Rasmussen,
	Huacai Chen, WANG Xuerui, loongarch

On Thu, Apr 04, 2024 at 08:24:04AM -0300, Jason Gunthorpe wrote:
> On Wed, Apr 03, 2024 at 02:25:20PM -0400, Peter Xu wrote:
> 
> > > I'd say the BUILD_BUG has done it's job and found an issue, fix it by
> > > not defining pud_leaf? I don't see any calls to pud_leaf in loongarch
> > > at least
> > 
> > Yes, that sounds better too to me, however it means we may also risk other
> > archs that can fail another defconfig build.. and I worry I bring trouble
> > to multiple such cases.  Fundamentally it's indeed my patch that broke
> > those builds, so I still sent the change and leave that for arch developers
> > to decide the best for the archs.
> 
> But your change causes silent data corruption if the code path is
> run.. I think we are overall better to wade through the compile time
> bugs from linux-next. Honestly if there were alot then I'd think there
> would be more complaints already.
> 
> Maybe it should just be a seperate step from this series.

Right, that'll be imho better to be done separate, as I think we'd better
consolidate the code.

One thing I don't worry is the warning would cause anything real to fail; I
don't yet expect any arch that will not define pud_pfn when it needs
it.. so it can mean all of the build errors may not cause real benefits as
of now.  But I agree with you we'd better have it.  I'll take a todo and
I'll try to add it back after all these fallouts.  With my cross build
chains now it shouldn't be hard, just take some time to revisit each arch.

Thanks,

-- 
Peter Xu


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback
@ 2024-04-04 12:00                   ` Peter Xu
  0 siblings, 0 replies; 160+ messages in thread
From: Peter Xu @ 2024-04-04 12:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: James Houghton, David Hildenbrand, Yang Shi, Andrew Jones,
	linux-mm, Matthew Wilcox, linux-riscv, WANG Xuerui,
	Andrea Arcangeli, Aneesh Kumar K . V, Huacai Chen,
	Christoph Hellwig, Vlastimil Babka, Axel Rasmussen, Rik van Riel,
	John Hubbard, Nathan Chancellor, loongarch, Kirill A . Shutemov,
	linux-arm-kernel, Lorenzo Stoakes, Muchun Song, linux-kernel,
	Andrew Morton, linuxppc-dev, Mike Rapoport, Mike Kravetz

On Thu, Apr 04, 2024 at 08:24:04AM -0300, Jason Gunthorpe wrote:
> On Wed, Apr 03, 2024 at 02:25:20PM -0400, Peter Xu wrote:
> 
> > > I'd say the BUILD_BUG has done it's job and found an issue, fix it by
> > > not defining pud_leaf? I don't see any calls to pud_leaf in loongarch
> > > at least
> > 
> > Yes, that sounds better too to me, however it means we may also risk other
> > archs that can fail another defconfig build.. and I worry I bring trouble
> > to multiple such cases.  Fundamentally it's indeed my patch that broke
> > those builds, so I still sent the change and leave that for arch developers
> > to decide the best for the archs.
> 
> But your change causes silent data corruption if the code path is
> run.. I think we are overall better to wade through the compile time
> bugs from linux-next. Honestly if there were alot then I'd think there
> would be more complaints already.
> 
> Maybe it should just be a seperate step from this series.

Right, that'll be imho better to be done separate, as I think we'd better
consolidate the code.

One thing I don't worry is the warning would cause anything real to fail; I
don't yet expect any arch that will not define pud_pfn when it needs
it.. so it can mean all of the build errors may not cause real benefits as
of now.  But I agree with you we'd better have it.  I'll take a todo and
I'll try to add it back after all these fallouts.  With my cross build
chains now it shouldn't be hard, just take some time to revisit each arch.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 160+ messages in thread

end of thread, other threads:[~2024-04-04 12:01 UTC | newest]

Thread overview: 160+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-27 15:23 [PATCH v4 00/13] mm/gup: Unify hugetlb, part 2 peterx
2024-03-27 15:23 ` peterx
2024-03-27 15:23 ` peterx
2024-03-27 15:23 ` peterx
2024-03-27 15:23 ` [PATCH v4 01/13] mm/Kconfig: CONFIG_PGTABLE_HAS_HUGE_LEAVES peterx
2024-03-27 15:23   ` peterx
2024-03-27 15:23   ` peterx
2024-03-27 15:23   ` peterx
2024-03-27 15:23 ` [PATCH v4 02/13] mm/hugetlb: Declare hugetlbfs_pagecache_present() non-static peterx
2024-03-27 15:23   ` peterx
2024-03-27 15:23   ` peterx
2024-03-27 15:23   ` peterx
2024-03-27 15:23 ` [PATCH v4 03/13] mm: Make HPAGE_PXD_* macros even if !THP peterx
2024-03-27 15:23   ` peterx
2024-03-27 15:23   ` peterx
2024-03-27 15:23   ` peterx
2024-03-27 15:23 ` [PATCH v4 04/13] mm: Introduce vma_pgtable_walk_{begin|end}() peterx
2024-03-27 15:23   ` peterx
2024-03-27 15:23   ` peterx
2024-03-27 15:23   ` peterx
2024-03-27 15:23 ` [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback peterx
2024-03-27 15:23   ` peterx
2024-03-27 15:23   ` peterx
2024-03-27 15:23   ` peterx
2024-04-02 19:05   ` Nathan Chancellor
2024-04-02 19:05     ` Nathan Chancellor
2024-04-02 19:05     ` Nathan Chancellor
2024-04-02 19:05     ` Nathan Chancellor
2024-04-02 22:43     ` Peter Xu
2024-04-02 22:43       ` Peter Xu
2024-04-02 22:43       ` Peter Xu
2024-04-02 22:43       ` Peter Xu
2024-04-02 22:53       ` Jason Gunthorpe
2024-04-02 22:53         ` Jason Gunthorpe
2024-04-02 22:53         ` Jason Gunthorpe
2024-04-02 22:53         ` Jason Gunthorpe
2024-04-02 23:35         ` Peter Xu
2024-04-02 23:35           ` Peter Xu
2024-04-02 23:35           ` Peter Xu
2024-04-02 23:35           ` Peter Xu
2024-04-03 12:08           ` Jason Gunthorpe
2024-04-03 12:08             ` Jason Gunthorpe
2024-04-03 12:08             ` Jason Gunthorpe
2024-04-03 12:08             ` Jason Gunthorpe
2024-04-03 12:26             ` Christophe Leroy
2024-04-03 12:26               ` Christophe Leroy
2024-04-03 12:26               ` Christophe Leroy
2024-04-03 12:26               ` Christophe Leroy
2024-04-03 13:07               ` Jason Gunthorpe
2024-04-03 13:07                 ` Jason Gunthorpe
2024-04-03 13:07                 ` Jason Gunthorpe
2024-04-03 13:07                 ` Jason Gunthorpe
2024-04-03 13:17                 ` Christophe Leroy
2024-04-03 13:17                   ` Christophe Leroy
2024-04-03 13:17                   ` Christophe Leroy
2024-04-03 13:17                   ` Christophe Leroy
2024-04-03 13:33                   ` Jason Gunthorpe
2024-04-03 13:33                     ` Jason Gunthorpe
2024-04-03 13:33                     ` Jason Gunthorpe
2024-04-03 13:33                     ` Jason Gunthorpe
2024-04-03 18:25             ` Peter Xu
2024-04-03 18:25               ` Peter Xu
2024-04-03 18:25               ` Peter Xu
2024-04-03 18:25               ` Peter Xu
2024-04-04 11:24               ` Jason Gunthorpe
2024-04-04 11:24                 ` Jason Gunthorpe
2024-04-04 11:24                 ` Jason Gunthorpe
2024-04-04 11:24                 ` Jason Gunthorpe
2024-04-04 12:00                 ` Peter Xu
2024-04-04 12:00                   ` Peter Xu
2024-04-04 12:00                   ` Peter Xu
2024-04-04 12:00                   ` Peter Xu
2024-03-27 15:23 ` [PATCH v4 06/13] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing peterx
2024-03-27 15:23   ` peterx
2024-03-27 15:23   ` peterx
2024-03-27 15:23   ` peterx
2024-03-28 10:10   ` David Hildenbrand
2024-03-28 10:10     ` David Hildenbrand
2024-03-28 10:10     ` David Hildenbrand
2024-03-28 10:10     ` David Hildenbrand
2024-03-28 19:01     ` Andrew Morton
2024-03-28 19:01       ` Andrew Morton
2024-03-28 19:01       ` Andrew Morton
2024-03-28 19:01       ` Andrew Morton
2024-03-27 15:23 ` [PATCH v4 07/13] mm/gup: Refactor record_subpages() to find 1st small page peterx
2024-03-27 15:23   ` peterx
2024-03-27 15:23   ` peterx
2024-03-27 15:23   ` peterx
2024-03-27 15:23 ` [PATCH v4 08/13] mm/gup: Handle hugetlb for no_page_table() peterx
2024-03-27 15:23   ` peterx
2024-03-27 15:23   ` peterx
2024-03-27 15:23   ` peterx
2024-03-27 15:23 ` [PATCH v4 09/13] mm/gup: Cache *pudp in follow_pud_mask() peterx
2024-03-27 15:23   ` peterx
2024-03-27 15:23   ` peterx
2024-03-27 15:23   ` peterx
2024-03-27 15:23 ` [PATCH v4 10/13] mm/gup: Handle huge pud for follow_pud_mask() peterx
2024-03-27 15:23   ` peterx
2024-03-27 15:23   ` peterx
2024-03-27 15:23   ` peterx
2024-03-27 15:23 ` [PATCH v4 11/13] mm/gup: Handle huge pmd for follow_pmd_mask() peterx
2024-03-27 15:23   ` peterx
2024-03-27 15:23   ` peterx
2024-03-27 15:23   ` peterx
2024-03-27 15:23 ` [PATCH v4 12/13] mm/gup: Handle hugepd for follow_page() peterx
2024-03-27 15:23   ` peterx
2024-03-27 15:23   ` peterx
2024-03-27 15:23   ` peterx
2024-03-27 15:23 ` [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code peterx
2024-03-27 15:23   ` peterx
2024-03-27 15:23   ` peterx
2024-03-27 15:23   ` peterx
2024-04-02 14:48   ` Ryan Roberts
2024-04-02 14:48     ` Ryan Roberts
2024-04-02 14:48     ` Ryan Roberts
2024-04-02 14:48     ` Ryan Roberts
2024-04-02 15:26     ` David Hildenbrand
2024-04-02 15:26       ` David Hildenbrand
2024-04-02 15:26       ` David Hildenbrand
2024-04-02 15:26       ` David Hildenbrand
2024-04-02 16:00       ` Matthew Wilcox
2024-04-02 16:00         ` Matthew Wilcox
2024-04-02 16:00         ` Matthew Wilcox
2024-04-02 16:00         ` Matthew Wilcox
2024-04-02 16:18         ` Ryan Roberts
2024-04-02 16:18           ` Ryan Roberts
2024-04-02 16:18           ` Ryan Roberts
2024-04-02 16:18           ` Ryan Roberts
2024-04-02 16:26           ` Peter Xu
2024-04-02 16:26             ` Peter Xu
2024-04-02 16:26             ` Peter Xu
2024-04-02 16:26             ` Peter Xu
2024-04-02 16:40         ` David Hildenbrand
2024-04-02 16:40           ` David Hildenbrand
2024-04-02 16:40           ` David Hildenbrand
2024-04-02 16:40           ` David Hildenbrand
2024-04-02 16:20       ` Peter Xu
2024-04-02 16:20         ` Peter Xu
2024-04-02 16:20         ` Peter Xu
2024-04-02 16:20         ` Peter Xu
2024-04-02 16:39         ` David Hildenbrand
2024-04-02 16:39           ` David Hildenbrand
2024-04-02 16:39           ` David Hildenbrand
2024-04-02 16:39           ` David Hildenbrand
2024-04-02 17:57           ` Peter Xu
2024-04-02 17:57             ` Peter Xu
2024-04-02 17:57             ` Peter Xu
2024-04-02 17:57             ` Peter Xu
2024-04-02 18:43             ` David Hildenbrand
2024-04-02 18:43               ` David Hildenbrand
2024-04-02 18:43               ` David Hildenbrand
2024-04-02 18:43               ` David Hildenbrand
2024-04-02 16:46         ` Ryan Roberts
2024-04-02 16:46           ` Ryan Roberts
2024-04-02 16:46           ` Ryan Roberts
2024-04-02 16:46           ` Ryan Roberts
2024-04-02 17:58           ` Peter Xu
2024-04-02 17:58             ` Peter Xu
2024-04-02 17:58             ` Peter Xu
2024-04-02 17:58             ` Peter Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.