All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ryan Roberts <ryan.roberts@arm.com>
To: Jason Gunthorpe <jgg@nvidia.com>,
	Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: "peterx@redhat.com" <peterx@redhat.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	James Houghton <jthoughton@google.com>,
	David Hildenbrand <david@redhat.com>,
	"Kirill A . Shutemov" <kirill@shutemov.name>,
	Yang Shi <shy828301@gmail.com>,
	"linux-riscv@lists.infradead.org"
	<linux-riscv@lists.infradead.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	"Aneesh Kumar K . V" <aneesh.kumar@kernel.org>,
	Rik van Riel <riel@surriel.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Mike Rapoport <rppt@kernel.org>,
	John Hubbard <jhubbard@nvidia.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	Michael Ellerman <mpe@ellerman.id.au>,
	Andrew Jones <andrew.jones@linux.dev>,
	"linuxppc-dev@lists.ozlabs.org" <linuxppc-dev@lists.ozlabs.org>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Muchun Song <muchun.song@linux.dev>,
	"linux-arm-kernel@lists.infradead.org"
	<linux-arm-kernel@lists.infradead.org>,
	Christoph Hellwig <hch@infradead.org>,
	Lorenzo Stoakes <lstoakes@gmail.com>,
	Matthew Wilcox <willy@infradead.org>
Subject: Re: [PATCH v2 06/13] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing
Date: Thu, 18 Jan 2024 15:15:36 +0000	[thread overview]
Message-ID: <9e60b948-0044-4826-8551-0a3888650657@arm.com> (raw)
In-Reply-To: <20240117132243.GG734935@nvidia.com>

On 17/01/2024 13:22, Jason Gunthorpe wrote:
> On Tue, Jan 16, 2024 at 06:32:32PM +0000, Christophe Leroy wrote:
>>>> hugepd is a page directory dedicated to huge pages, where you have huge
>>>> pages listed instead of regular pages. For instance, on powerpc 32 with
>>>> each PGD entries covering 4Mbytes, a regular page table has 1024 PTEs. A
>>>> hugepd for 512k is a page table with 8 entries.
>>>>
>>>> And for 8Mbytes entries, the hugepd is a page table with only one entry.
>>>> And 2 consecutive PGS entries will point to the same hugepd to cover the
>>>> entire 8Mbytes.
>>>
>>> That still sounds alot like the ARM thing - except ARM replicates the
>>> entry, you also said PPC relicates the entry like ARM to get to the
>>> 8M?
>>
>> Is it like ARM ? Not sure. The PTE is not in the PGD it must be in a L2 
>> directory, even for 8M.
> 
> Your diagram looks almost exactly like ARM to me.
> 
> The key thing is that the address for the L2 Table is *always* formed as:
> 
>    L2 Table Base << 12 + L2 Index << 2 + 00
> 
> Then the L2 Descriptor must contains bits indicating the page
> size. The L2 Descriptor is replicated to every 4k entry that the page
> size covers.
> 
> The only difference I see is the 8M case which has a page size greater
> than a single L1 entry.
> 
>> Yes that's how it works on powerpc. For 8xx we used to do that for both 
>> 8M and 512k pages. Now for 512k pages we do kind of like ARM (which 
>> means replicating the entry 128 times) as that's needed to allow mixing 
>> different page sizes for a given PGD entry.
> 
> Right, you want to have granular page sizes or it becomes unusable in
> the general case
>  
>> But for 8M pages that would mean replicating the entry 2048 times. 
>> That's a bit too much isn't it ?
> 
> Indeed, de-duplicating the L2 Table is a neat optimization.
> 
>>> So if you imagine a pmd_leaf(), pmd_leaf_size() and a pte_leaf_size()
>>> that would return enough information for both.
>>
>> pmd_leaf() ? Unless I'm missing something I can't do leaf at PMD (PGD) 
>> level. It must be a two-level process even for pages bigger than a PMD 
>> entry.
> 
> Right, this is the normal THP/hugetlb situation on x86/etc. It
> wouldn't apply here since it seems the HW doesn't have a bit in the L1
> descriptor to indicate leaf.
> 
> Instead for PPC this hugepd stuff should start to follow Ryan's
> generic work for ARM contig:
> 
> https://lore.kernel.org/all/20231218105100.172635-1-ryan.roberts@arm.com/
> 
> Specifically the arch implementation:
> 
> https://lore.kernel.org/linux-mm/20231218105100.172635-15-ryan.roberts@arm.com/
> 
> Ie the arch should ultimately wire up the replication and variable
> page size bits within its implementation of set_ptes(). set_ptes()s
> gets a contiguous run of address and should install it with maximum
> use of the variable page sizes. The core code will start to call
> set_ptes() in more cases as Ryan gets along his project.

Note that it's not just set_ptes() that you want to batch; there are other calls
that can benefit too. See patches 2 and 3 in the series you linked. (although
I'm working with DavidH on this and the details are going to change a little).

> 
> For the purposes of GUP, where are are today and where we are going,
> it would be much better to not have a special PPC specific "hugepd"
> parser. Just process each of the 4k replicates one by one like ARM is
> starting with.
> 
> The arch would still have to return the correct page address from
> pte_phys() which I think Ryan is doing by having the replicates encode
> the full 4k based address in each entry.

Yes; although its actually also a requirement of the arm architecture. Since the
contig bit is just a hint that the HW may or may not take any notice of, the
page tables have to be correct for the case where the HW just reads them in base
pages. Fixing up the bottom bits should be trivial using the PTE pointer, if
needed for ppc.

> The HW will ignore those low
> bits and pte_phys() then works properly. This would work for PPC as
> well, excluding the 8M optimization.
> 
> Going forward I'd expect to see some pte_page_size() that returns the
> size bits and GUP can have logic to skip reading replicates.

Yes; pte_batch_remaining() in patch 2 is an attempt at this. But as I said the
details will likely change a little.

> 
> The advantage of all this is that it stops making the feature special
> and the work Ryan is doing to generically push larger folios into
> set_ptes will become usable on these PPC platforms as well. And we can
> kill the PPC specific hugepd.
> 
> Jason


WARNING: multiple messages have this Message-ID (diff)
From: Ryan Roberts <ryan.roberts@arm.com>
To: Jason Gunthorpe <jgg@nvidia.com>,
	Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: "peterx@redhat.com" <peterx@redhat.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	James Houghton <jthoughton@google.com>,
	David Hildenbrand <david@redhat.com>,
	"Kirill A . Shutemov" <kirill@shutemov.name>,
	Yang Shi <shy828301@gmail.com>,
	"linux-riscv@lists.infradead.org"
	<linux-riscv@lists.infradead.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	"Aneesh Kumar K . V" <aneesh.kumar@kernel.org>,
	Rik van Riel <riel@surriel.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Mike Rapoport <rppt@kernel.org>,
	John Hubbard <jhubbard@nvidia.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	Michael Ellerman <mpe@ellerman.id.au>,
	Andrew Jones <andrew.jones@linux.dev>,
	"linuxppc-dev@lists.ozlabs.org" <linuxppc-dev@lists.ozlabs.org>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Muchun Song <muchun.song@linux.dev>,
	"linux-arm-kernel@lists.infradead.org"
	<linux-arm-kernel@lists.infradead.org>,
	Christoph Hellwig <hch@infradead.org>,
	Lorenzo Stoakes <lstoakes@gmail.com>,
	Matthew Wilcox <willy@infradead.org>
Subject: Re: [PATCH v2 06/13] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing
Date: Thu, 18 Jan 2024 15:15:36 +0000	[thread overview]
Message-ID: <9e60b948-0044-4826-8551-0a3888650657@arm.com> (raw)
In-Reply-To: <20240117132243.GG734935@nvidia.com>

On 17/01/2024 13:22, Jason Gunthorpe wrote:
> On Tue, Jan 16, 2024 at 06:32:32PM +0000, Christophe Leroy wrote:
>>>> hugepd is a page directory dedicated to huge pages, where you have huge
>>>> pages listed instead of regular pages. For instance, on powerpc 32 with
>>>> each PGD entries covering 4Mbytes, a regular page table has 1024 PTEs. A
>>>> hugepd for 512k is a page table with 8 entries.
>>>>
>>>> And for 8Mbytes entries, the hugepd is a page table with only one entry.
>>>> And 2 consecutive PGS entries will point to the same hugepd to cover the
>>>> entire 8Mbytes.
>>>
>>> That still sounds alot like the ARM thing - except ARM replicates the
>>> entry, you also said PPC relicates the entry like ARM to get to the
>>> 8M?
>>
>> Is it like ARM ? Not sure. The PTE is not in the PGD it must be in a L2 
>> directory, even for 8M.
> 
> Your diagram looks almost exactly like ARM to me.
> 
> The key thing is that the address for the L2 Table is *always* formed as:
> 
>    L2 Table Base << 12 + L2 Index << 2 + 00
> 
> Then the L2 Descriptor must contains bits indicating the page
> size. The L2 Descriptor is replicated to every 4k entry that the page
> size covers.
> 
> The only difference I see is the 8M case which has a page size greater
> than a single L1 entry.
> 
>> Yes that's how it works on powerpc. For 8xx we used to do that for both 
>> 8M and 512k pages. Now for 512k pages we do kind of like ARM (which 
>> means replicating the entry 128 times) as that's needed to allow mixing 
>> different page sizes for a given PGD entry.
> 
> Right, you want to have granular page sizes or it becomes unusable in
> the general case
>  
>> But for 8M pages that would mean replicating the entry 2048 times. 
>> That's a bit too much isn't it ?
> 
> Indeed, de-duplicating the L2 Table is a neat optimization.
> 
>>> So if you imagine a pmd_leaf(), pmd_leaf_size() and a pte_leaf_size()
>>> that would return enough information for both.
>>
>> pmd_leaf() ? Unless I'm missing something I can't do leaf at PMD (PGD) 
>> level. It must be a two-level process even for pages bigger than a PMD 
>> entry.
> 
> Right, this is the normal THP/hugetlb situation on x86/etc. It
> wouldn't apply here since it seems the HW doesn't have a bit in the L1
> descriptor to indicate leaf.
> 
> Instead for PPC this hugepd stuff should start to follow Ryan's
> generic work for ARM contig:
> 
> https://lore.kernel.org/all/20231218105100.172635-1-ryan.roberts@arm.com/
> 
> Specifically the arch implementation:
> 
> https://lore.kernel.org/linux-mm/20231218105100.172635-15-ryan.roberts@arm.com/
> 
> Ie the arch should ultimately wire up the replication and variable
> page size bits within its implementation of set_ptes(). set_ptes()s
> gets a contiguous run of address and should install it with maximum
> use of the variable page sizes. The core code will start to call
> set_ptes() in more cases as Ryan gets along his project.

Note that it's not just set_ptes() that you want to batch; there are other calls
that can benefit too. See patches 2 and 3 in the series you linked. (although
I'm working with DavidH on this and the details are going to change a little).

> 
> For the purposes of GUP, where are are today and where we are going,
> it would be much better to not have a special PPC specific "hugepd"
> parser. Just process each of the 4k replicates one by one like ARM is
> starting with.
> 
> The arch would still have to return the correct page address from
> pte_phys() which I think Ryan is doing by having the replicates encode
> the full 4k based address in each entry.

Yes; although its actually also a requirement of the arm architecture. Since the
contig bit is just a hint that the HW may or may not take any notice of, the
page tables have to be correct for the case where the HW just reads them in base
pages. Fixing up the bottom bits should be trivial using the PTE pointer, if
needed for ppc.

> The HW will ignore those low
> bits and pte_phys() then works properly. This would work for PPC as
> well, excluding the 8M optimization.
> 
> Going forward I'd expect to see some pte_page_size() that returns the
> size bits and GUP can have logic to skip reading replicates.

Yes; pte_batch_remaining() in patch 2 is an attempt at this. But as I said the
details will likely change a little.

> 
> The advantage of all this is that it stops making the feature special
> and the work Ryan is doing to generically push larger folios into
> set_ptes will become usable on these PPC platforms as well. And we can
> kill the PPC specific hugepd.
> 
> Jason


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

WARNING: multiple messages have this Message-ID (diff)
From: Ryan Roberts <ryan.roberts@arm.com>
To: Jason Gunthorpe <jgg@nvidia.com>,
	Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: "peterx@redhat.com" <peterx@redhat.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	James Houghton <jthoughton@google.com>,
	David Hildenbrand <david@redhat.com>,
	"Kirill A . Shutemov" <kirill@shutemov.name>,
	Yang Shi <shy828301@gmail.com>,
	"linux-riscv@lists.infradead.org"
	<linux-riscv@lists.infradead.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	"Aneesh Kumar K . V" <aneesh.kumar@kernel.org>,
	Rik van Riel <riel@surriel.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Mike Rapoport <rppt@kernel.org>,
	John Hubbard <jhubbard@nvidia.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	Michael Ellerman <mpe@ellerman.id.au>,
	Andrew Jones <andrew.jones@linux.dev>,
	"linuxppc-dev@lists.ozlabs.org" <linuxppc-dev@lists.ozlabs.org>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Muchun Song <muchun.song@linux.dev>,
	"linux-arm-kernel@lists.infradead.org"
	<linux-arm-kernel@lists.infradead.org>,
	Christoph Hellwig <hch@infradead.org>,
	Lorenzo Stoakes <lstoakes@gmail.com>,
	Matthew Wilcox <willy@infradead.org>
Subject: Re: [PATCH v2 06/13] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing
Date: Thu, 18 Jan 2024 15:15:36 +0000	[thread overview]
Message-ID: <9e60b948-0044-4826-8551-0a3888650657@arm.com> (raw)
In-Reply-To: <20240117132243.GG734935@nvidia.com>

On 17/01/2024 13:22, Jason Gunthorpe wrote:
> On Tue, Jan 16, 2024 at 06:32:32PM +0000, Christophe Leroy wrote:
>>>> hugepd is a page directory dedicated to huge pages, where you have huge
>>>> pages listed instead of regular pages. For instance, on powerpc 32 with
>>>> each PGD entries covering 4Mbytes, a regular page table has 1024 PTEs. A
>>>> hugepd for 512k is a page table with 8 entries.
>>>>
>>>> And for 8Mbytes entries, the hugepd is a page table with only one entry.
>>>> And 2 consecutive PGS entries will point to the same hugepd to cover the
>>>> entire 8Mbytes.
>>>
>>> That still sounds alot like the ARM thing - except ARM replicates the
>>> entry, you also said PPC relicates the entry like ARM to get to the
>>> 8M?
>>
>> Is it like ARM ? Not sure. The PTE is not in the PGD it must be in a L2 
>> directory, even for 8M.
> 
> Your diagram looks almost exactly like ARM to me.
> 
> The key thing is that the address for the L2 Table is *always* formed as:
> 
>    L2 Table Base << 12 + L2 Index << 2 + 00
> 
> Then the L2 Descriptor must contains bits indicating the page
> size. The L2 Descriptor is replicated to every 4k entry that the page
> size covers.
> 
> The only difference I see is the 8M case which has a page size greater
> than a single L1 entry.
> 
>> Yes that's how it works on powerpc. For 8xx we used to do that for both 
>> 8M and 512k pages. Now for 512k pages we do kind of like ARM (which 
>> means replicating the entry 128 times) as that's needed to allow mixing 
>> different page sizes for a given PGD entry.
> 
> Right, you want to have granular page sizes or it becomes unusable in
> the general case
>  
>> But for 8M pages that would mean replicating the entry 2048 times. 
>> That's a bit too much isn't it ?
> 
> Indeed, de-duplicating the L2 Table is a neat optimization.
> 
>>> So if you imagine a pmd_leaf(), pmd_leaf_size() and a pte_leaf_size()
>>> that would return enough information for both.
>>
>> pmd_leaf() ? Unless I'm missing something I can't do leaf at PMD (PGD) 
>> level. It must be a two-level process even for pages bigger than a PMD 
>> entry.
> 
> Right, this is the normal THP/hugetlb situation on x86/etc. It
> wouldn't apply here since it seems the HW doesn't have a bit in the L1
> descriptor to indicate leaf.
> 
> Instead for PPC this hugepd stuff should start to follow Ryan's
> generic work for ARM contig:
> 
> https://lore.kernel.org/all/20231218105100.172635-1-ryan.roberts@arm.com/
> 
> Specifically the arch implementation:
> 
> https://lore.kernel.org/linux-mm/20231218105100.172635-15-ryan.roberts@arm.com/
> 
> Ie the arch should ultimately wire up the replication and variable
> page size bits within its implementation of set_ptes(). set_ptes()s
> gets a contiguous run of address and should install it with maximum
> use of the variable page sizes. The core code will start to call
> set_ptes() in more cases as Ryan gets along his project.

Note that it's not just set_ptes() that you want to batch; there are other calls
that can benefit too. See patches 2 and 3 in the series you linked. (although
I'm working with DavidH on this and the details are going to change a little).

> 
> For the purposes of GUP, where are are today and where we are going,
> it would be much better to not have a special PPC specific "hugepd"
> parser. Just process each of the 4k replicates one by one like ARM is
> starting with.
> 
> The arch would still have to return the correct page address from
> pte_phys() which I think Ryan is doing by having the replicates encode
> the full 4k based address in each entry.

Yes; although its actually also a requirement of the arm architecture. Since the
contig bit is just a hint that the HW may or may not take any notice of, the
page tables have to be correct for the case where the HW just reads them in base
pages. Fixing up the bottom bits should be trivial using the PTE pointer, if
needed for ppc.

> The HW will ignore those low
> bits and pte_phys() then works properly. This would work for PPC as
> well, excluding the 8M optimization.
> 
> Going forward I'd expect to see some pte_page_size() that returns the
> size bits and GUP can have logic to skip reading replicates.

Yes; pte_batch_remaining() in patch 2 is an attempt at this. But as I said the
details will likely change a little.

> 
> The advantage of all this is that it stops making the feature special
> and the work Ryan is doing to generically push larger folios into
> set_ptes will become usable on these PPC platforms as well. And we can
> kill the PPC specific hugepd.
> 
> Jason


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

WARNING: multiple messages have this Message-ID (diff)
From: Ryan Roberts <ryan.roberts@arm.com>
To: Jason Gunthorpe <jgg@nvidia.com>,
	Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: James Houghton <jthoughton@google.com>,
	David Hildenbrand <david@redhat.com>,
	Yang Shi <shy828301@gmail.com>,
	"peterx@redhat.com" <peterx@redhat.com>,
	Andrew Jones <andrew.jones@linux.dev>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-riscv@lists.infradead.org"
	<linux-riscv@lists.infradead.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Christoph Hellwig <hch@infradead.org>,
	Matthew Wilcox <willy@infradead.org>,
	"Aneesh Kumar K . V" <aneesh.kumar@kernel.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Rik van Riel <riel@surriel.com>,
	John Hubbard <jhubbard@nvidia.com>,
	"Kirill A . Shutemov" <kirill@shutemov.name>,
	"linux-arm-kernel@lists.infradead.org"
	<linux-arm-kernel@lists.infradead.org>,
	Lorenzo Stoakes <lstoakes@gmail.com>,
	Muchun Song <muchun.song@linux.dev>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	"linuxppc-dev@lists.ozlabs.org" <linuxppc-dev@lists.ozlabs.org>,
	Mike Rapop ort <rppt@kernel.org>,
	Mike Kravetz <mike.kravetz@oracle.com>
Subject: Re: [PATCH v2 06/13] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing
Date: Thu, 18 Jan 2024 15:15:36 +0000	[thread overview]
Message-ID: <9e60b948-0044-4826-8551-0a3888650657@arm.com> (raw)
In-Reply-To: <20240117132243.GG734935@nvidia.com>

On 17/01/2024 13:22, Jason Gunthorpe wrote:
> On Tue, Jan 16, 2024 at 06:32:32PM +0000, Christophe Leroy wrote:
>>>> hugepd is a page directory dedicated to huge pages, where you have huge
>>>> pages listed instead of regular pages. For instance, on powerpc 32 with
>>>> each PGD entries covering 4Mbytes, a regular page table has 1024 PTEs. A
>>>> hugepd for 512k is a page table with 8 entries.
>>>>
>>>> And for 8Mbytes entries, the hugepd is a page table with only one entry.
>>>> And 2 consecutive PGS entries will point to the same hugepd to cover the
>>>> entire 8Mbytes.
>>>
>>> That still sounds alot like the ARM thing - except ARM replicates the
>>> entry, you also said PPC relicates the entry like ARM to get to the
>>> 8M?
>>
>> Is it like ARM ? Not sure. The PTE is not in the PGD it must be in a L2 
>> directory, even for 8M.
> 
> Your diagram looks almost exactly like ARM to me.
> 
> The key thing is that the address for the L2 Table is *always* formed as:
> 
>    L2 Table Base << 12 + L2 Index << 2 + 00
> 
> Then the L2 Descriptor must contains bits indicating the page
> size. The L2 Descriptor is replicated to every 4k entry that the page
> size covers.
> 
> The only difference I see is the 8M case which has a page size greater
> than a single L1 entry.
> 
>> Yes that's how it works on powerpc. For 8xx we used to do that for both 
>> 8M and 512k pages. Now for 512k pages we do kind of like ARM (which 
>> means replicating the entry 128 times) as that's needed to allow mixing 
>> different page sizes for a given PGD entry.
> 
> Right, you want to have granular page sizes or it becomes unusable in
> the general case
>  
>> But for 8M pages that would mean replicating the entry 2048 times. 
>> That's a bit too much isn't it ?
> 
> Indeed, de-duplicating the L2 Table is a neat optimization.
> 
>>> So if you imagine a pmd_leaf(), pmd_leaf_size() and a pte_leaf_size()
>>> that would return enough information for both.
>>
>> pmd_leaf() ? Unless I'm missing something I can't do leaf at PMD (PGD) 
>> level. It must be a two-level process even for pages bigger than a PMD 
>> entry.
> 
> Right, this is the normal THP/hugetlb situation on x86/etc. It
> wouldn't apply here since it seems the HW doesn't have a bit in the L1
> descriptor to indicate leaf.
> 
> Instead for PPC this hugepd stuff should start to follow Ryan's
> generic work for ARM contig:
> 
> https://lore.kernel.org/all/20231218105100.172635-1-ryan.roberts@arm.com/
> 
> Specifically the arch implementation:
> 
> https://lore.kernel.org/linux-mm/20231218105100.172635-15-ryan.roberts@arm.com/
> 
> Ie the arch should ultimately wire up the replication and variable
> page size bits within its implementation of set_ptes(). set_ptes()s
> gets a contiguous run of address and should install it with maximum
> use of the variable page sizes. The core code will start to call
> set_ptes() in more cases as Ryan gets along his project.

Note that it's not just set_ptes() that you want to batch; there are other calls
that can benefit too. See patches 2 and 3 in the series you linked. (although
I'm working with DavidH on this and the details are going to change a little).

> 
> For the purposes of GUP, where are are today and where we are going,
> it would be much better to not have a special PPC specific "hugepd"
> parser. Just process each of the 4k replicates one by one like ARM is
> starting with.
> 
> The arch would still have to return the correct page address from
> pte_phys() which I think Ryan is doing by having the replicates encode
> the full 4k based address in each entry.

Yes; although its actually also a requirement of the arm architecture. Since the
contig bit is just a hint that the HW may or may not take any notice of, the
page tables have to be correct for the case where the HW just reads them in base
pages. Fixing up the bottom bits should be trivial using the PTE pointer, if
needed for ppc.

> The HW will ignore those low
> bits and pte_phys() then works properly. This would work for PPC as
> well, excluding the 8M optimization.
> 
> Going forward I'd expect to see some pte_page_size() that returns the
> size bits and GUP can have logic to skip reading replicates.

Yes; pte_batch_remaining() in patch 2 is an attempt at this. But as I said the
details will likely change a little.

> 
> The advantage of all this is that it stops making the feature special
> and the work Ryan is doing to generically push larger folios into
> set_ptes will become usable on these PPC platforms as well. And we can
> kill the PPC specific hugepd.
> 
> Jason


  reply	other threads:[~2024-01-18 15:15 UTC|newest]

Thread overview: 143+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-03  9:14 [PATCH v2 00/13] mm/gup: Unify hugetlb, part 2 peterx
2024-01-03  9:14 ` peterx
2024-01-03  9:14 ` peterx
2024-01-03  9:14 ` peterx
2024-01-03  9:14 ` [PATCH v2 01/13] mm/Kconfig: CONFIG_PGTABLE_HAS_HUGE_LEAVES peterx
2024-01-03  9:14   ` peterx
2024-01-03  9:14   ` peterx
2024-01-03  9:14   ` peterx
2024-01-15 17:37   ` Jason Gunthorpe
2024-01-15 17:37     ` Jason Gunthorpe
2024-01-15 17:37     ` Jason Gunthorpe
2024-01-15 17:37     ` Jason Gunthorpe
2024-01-22  8:25     ` Peter Xu
2024-01-22  8:25       ` Peter Xu
2024-01-22  8:25       ` Peter Xu
2024-01-22  8:25       ` Peter Xu
2024-01-03  9:14 ` [PATCH v2 02/13] mm/hugetlb: Declare hugetlbfs_pagecache_present() non-static peterx
2024-01-03  9:14   ` peterx
2024-01-03  9:14   ` peterx
2024-01-03  9:14   ` peterx
2024-01-03  9:14 ` [PATCH v2 03/13] mm: Provide generic pmd_thp_or_huge() peterx
2024-01-03  9:14   ` peterx
2024-01-03  9:14   ` peterx
2024-01-03  9:14   ` peterx
2024-01-15 17:55   ` Jason Gunthorpe
2024-01-15 17:55     ` Jason Gunthorpe
2024-01-15 17:55     ` Jason Gunthorpe
2024-01-15 17:55     ` Jason Gunthorpe
2024-02-21  9:37     ` Peter Xu
2024-02-21  9:37       ` Peter Xu
2024-02-21  9:37       ` Peter Xu
2024-02-21  9:37       ` Peter Xu
2024-02-21 12:57       ` Jason Gunthorpe
2024-02-21 12:57         ` Jason Gunthorpe
2024-02-21 12:57         ` Jason Gunthorpe
2024-02-21 12:57         ` Jason Gunthorpe
2024-02-22  8:04         ` Peter Xu
2024-02-22  8:04           ` Peter Xu
2024-02-22  8:04           ` Peter Xu
2024-02-22  8:04           ` Peter Xu
2024-01-03  9:14 ` [PATCH v2 04/13] mm: Make HPAGE_PXD_* macros even if !THP peterx
2024-01-03  9:14   ` peterx
2024-01-03  9:14   ` peterx
2024-01-03  9:14   ` peterx
2024-01-15 17:59   ` Jason Gunthorpe
2024-01-15 17:59     ` Jason Gunthorpe
2024-01-15 17:59     ` Jason Gunthorpe
2024-01-15 17:59     ` Jason Gunthorpe
2024-01-03  9:14 ` [PATCH v2 05/13] mm: Introduce vma_pgtable_walk_{begin|end}() peterx
2024-01-03  9:14   ` peterx
2024-01-03  9:14   ` peterx
2024-01-03  9:14   ` peterx
2024-01-03  9:14 ` [PATCH v2 06/13] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing peterx
2024-01-03  9:14   ` peterx
2024-01-03  9:14   ` peterx
2024-01-03  9:14   ` peterx
2024-01-15 18:37   ` Jason Gunthorpe
2024-01-15 18:37     ` Jason Gunthorpe
2024-01-15 18:37     ` Jason Gunthorpe
2024-01-15 18:37     ` Jason Gunthorpe
2024-01-16  6:30     ` Christophe Leroy
2024-01-16  6:30       ` Christophe Leroy
2024-01-16  6:30       ` Christophe Leroy
2024-01-16  6:30       ` Christophe Leroy
2024-01-16 12:31       ` Jason Gunthorpe
2024-01-16 12:31         ` Jason Gunthorpe
2024-01-16 12:31         ` Jason Gunthorpe
2024-01-16 12:31         ` Jason Gunthorpe
2024-01-16 18:32         ` Christophe Leroy
2024-01-16 18:32           ` Christophe Leroy
2024-01-16 18:32           ` Christophe Leroy
2024-01-17 13:22           ` Jason Gunthorpe
2024-01-17 13:22             ` Jason Gunthorpe
2024-01-17 13:22             ` Jason Gunthorpe
2024-01-17 13:22             ` Jason Gunthorpe
2024-01-18 15:15             ` Ryan Roberts [this message]
2024-01-18 15:15               ` Ryan Roberts
2024-01-18 15:15               ` Ryan Roberts
2024-01-18 15:15               ` Ryan Roberts
2024-02-21 11:55     ` Peter Xu
2024-02-21 11:55       ` Peter Xu
2024-02-21 11:55       ` Peter Xu
2024-02-21 11:55       ` Peter Xu
2024-01-03  9:14 ` [PATCH v2 07/13] mm/gup: Refactor record_subpages() to find 1st small page peterx
2024-01-03  9:14   ` peterx
2024-01-03  9:14   ` peterx
2024-01-03  9:14   ` peterx
2024-01-15 18:38   ` Jason Gunthorpe
2024-01-15 18:38     ` Jason Gunthorpe
2024-01-15 18:38     ` Jason Gunthorpe
2024-01-15 18:38     ` Jason Gunthorpe
2024-01-03  9:14 ` [PATCH v2 08/13] mm/gup: Handle hugetlb for no_page_table() peterx
2024-01-03  9:14   ` peterx
2024-01-03  9:14   ` peterx
2024-01-03  9:14   ` peterx
2024-01-15 18:39   ` Jason Gunthorpe
2024-01-15 18:39     ` Jason Gunthorpe
2024-01-15 18:39     ` Jason Gunthorpe
2024-01-15 18:39     ` Jason Gunthorpe
2024-01-03  9:14 ` [PATCH v2 09/13] mm/gup: Cache *pudp in follow_pud_mask() peterx
2024-01-03  9:14   ` peterx
2024-01-03  9:14   ` peterx
2024-01-03  9:14   ` peterx
2024-01-15 18:41   ` Jason Gunthorpe
2024-01-15 18:41     ` Jason Gunthorpe
2024-01-15 18:41     ` Jason Gunthorpe
2024-01-15 18:41     ` Jason Gunthorpe
2024-01-03  9:14 ` [PATCH v2 10/13] mm/gup: Handle huge pud for follow_pud_mask() peterx
2024-01-03  9:14   ` peterx
2024-01-03  9:14   ` peterx
2024-01-03  9:14   ` peterx
2024-01-15 18:49   ` Jason Gunthorpe
2024-01-15 18:49     ` Jason Gunthorpe
2024-01-15 18:49     ` Jason Gunthorpe
2024-01-15 18:49     ` Jason Gunthorpe
2024-02-21 11:49     ` Peter Xu
2024-02-21 11:49       ` Peter Xu
2024-02-21 11:49       ` Peter Xu
2024-02-21 11:49       ` Peter Xu
2024-01-03  9:14 ` [PATCH v2 11/13] mm/gup: Handle huge pmd for follow_pmd_mask() peterx
2024-01-03  9:14   ` peterx
2024-01-03  9:14   ` peterx
2024-01-03  9:14   ` peterx
2024-01-15 18:51   ` Jason Gunthorpe
2024-01-15 18:51     ` Jason Gunthorpe
2024-01-15 18:51     ` Jason Gunthorpe
2024-01-15 18:51     ` Jason Gunthorpe
2024-01-03  9:14 ` [PATCH v2 12/13] mm/gup: Handle hugepd for follow_page() peterx
2024-01-03  9:14   ` peterx
2024-01-03  9:14   ` peterx
2024-01-03  9:14   ` peterx
2024-01-03  9:14 ` [PATCH v2 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code peterx
2024-01-03  9:14   ` peterx
2024-01-03  9:14   ` peterx
2024-01-03  9:14   ` peterx
2024-01-03 11:14 ` [PATCH v2 00/13] mm/gup: Unify hugetlb, part 2 Christophe Leroy
2024-01-03 11:14   ` Christophe Leroy
2024-01-03 11:14   ` Christophe Leroy
2024-01-03 11:14   ` Christophe Leroy
2024-01-08  7:27   ` Peter Xu
2024-01-08  7:27     ` Peter Xu
2024-01-08  7:27     ` Peter Xu
2024-01-08  7:27     ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9e60b948-0044-4826-8551-0a3888650657@arm.com \
    --to=ryan.roberts@arm.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=andrew.jones@linux.dev \
    --cc=aneesh.kumar@kernel.org \
    --cc=axelrasmussen@google.com \
    --cc=christophe.leroy@csgroup.eu \
    --cc=david@redhat.com \
    --cc=hch@infradead.org \
    --cc=jgg@nvidia.com \
    --cc=jhubbard@nvidia.com \
    --cc=jthoughton@google.com \
    --cc=kirill@shutemov.name \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-riscv@lists.infradead.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=lstoakes@gmail.com \
    --cc=mike.kravetz@oracle.com \
    --cc=mpe@ellerman.id.au \
    --cc=muchun.song@linux.dev \
    --cc=peterx@redhat.com \
    --cc=riel@surriel.com \
    --cc=rppt@kernel.org \
    --cc=shy828301@gmail.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.