All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>
To: David Hildenbrand <david@redhat.com>,
	Alistair Popple <apopple@nvidia.com>,
	akpm@linux-foundation.org
Cc: Felix Kuehling <felix.kuehling@amd.com>,
	jgg@nvidia.com, linux-mm@kvack.org, rcampbell@nvidia.com,
	linux-ext4@vger.kernel.org, linux-xfs@vger.kernel.org,
	amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	hch@lst.de, jglisse@redhat.com, willy@infradead.org
Subject: Re: [PATCH v5 01/13] mm: add zone device coherent type memory support
Date: Thu, 23 Jun 2022 13:20:45 -0500	[thread overview]
Message-ID: <c7d4d9a9-ac8a-b48d-55f3-9d2450e660ef@amd.com> (raw)
In-Reply-To: <9af76814-ee3a-0af4-7300-d432050b13a3@redhat.com>


On 6/23/2022 2:57 AM, David Hildenbrand wrote:
> On 23.06.22 01:16, Sierra Guiza, Alejandro (Alex) wrote:
>> On 6/21/2022 11:16 AM, David Hildenbrand wrote:
>>> On 21.06.22 18:08, Sierra Guiza, Alejandro (Alex) wrote:
>>>> On 6/21/2022 7:25 AM, David Hildenbrand wrote:
>>>>> On 21.06.22 13:55, Alistair Popple wrote:
>>>>>> David Hildenbrand<david@redhat.com>  writes:
>>>>>>
>>>>>>> On 21.06.22 13:25, Felix Kuehling wrote:
>>>>>>>> Am 6/17/22 um 23:19 schrieb David Hildenbrand:
>>>>>>>>> On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>>>>> On 6/17/2022 12:33 PM, David Hildenbrand wrote:
>>>>>>>>>>> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>>>>>>> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>>>>>>>>>>>>> On 31.05.22 22:00, Alex Sierra wrote:
>>>>>>>>>>>>>> Device memory that is cache coherent from device and CPU point of view.
>>>>>>>>>>>>>> This is used on platforms that have an advanced system bus (like CAPI
>>>>>>>>>>>>>> or CXL). Any page of a process can be migrated to such memory. However,
>>>>>>>>>>>>>> no one should be allowed to pin such memory so that it can always be
>>>>>>>>>>>>>> evicted.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Signed-off-by: Alex Sierra<alex.sierra@amd.com>
>>>>>>>>>>>>>> Acked-by: Felix Kuehling<Felix.Kuehling@amd.com>
>>>>>>>>>>>>>> Reviewed-by: Alistair Popple<apopple@nvidia.com>
>>>>>>>>>>>>>> [hch: rebased ontop of the refcount changes,
>>>>>>>>>>>>>>             removed is_dev_private_or_coherent_page]
>>>>>>>>>>>>>> Signed-off-by: Christoph Hellwig<hch@lst.de>
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>        include/linux/memremap.h | 19 +++++++++++++++++++
>>>>>>>>>>>>>>        mm/memcontrol.c          |  7 ++++---
>>>>>>>>>>>>>>        mm/memory-failure.c      |  8 ++++++--
>>>>>>>>>>>>>>        mm/memremap.c            | 10 ++++++++++
>>>>>>>>>>>>>>        mm/migrate_device.c      | 16 +++++++---------
>>>>>>>>>>>>>>        mm/rmap.c                |  5 +++--
>>>>>>>>>>>>>>        6 files changed, 49 insertions(+), 16 deletions(-)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>>>>>>>>>>> index 8af304f6b504..9f752ebed613 100644
>>>>>>>>>>>>>> --- a/include/linux/memremap.h
>>>>>>>>>>>>>> +++ b/include/linux/memremap.h
>>>>>>>>>>>>>> @@ -41,6 +41,13 @@ struct vmem_altmap {
>>>>>>>>>>>>>>         * A more complete discussion of unaddressable memory may be found in
>>>>>>>>>>>>>>         * include/linux/hmm.h and Documentation/vm/hmm.rst.
>>>>>>>>>>>>>>         *
>>>>>>>>>>>>>> + * MEMORY_DEVICE_COHERENT:
>>>>>>>>>>>>>> + * Device memory that is cache coherent from device and CPU point of view. This
>>>>>>>>>>>>>> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
>>>>>>>>>>>>>> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
>>>>>>>>>>>>>> + * type. Any page of a process can be migrated to such memory. However no one
>>>>>>>>>>>>> Any page might not be right, I'm pretty sure. ... just thinking about special pages
>>>>>>>>>>>>> like vdso, shared zeropage, ... pinned pages ...
>>>>>>>>>>> Well, you cannot migrate long term pages, that's what I meant :)
>>>>>>>>>>>
>>>>>>>>>>>>>> + * should be allowed to pin such memory so that it can always be evicted.
>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>         * MEMORY_DEVICE_FS_DAX:
>>>>>>>>>>>>>>         * Host memory that has similar access semantics as System RAM i.e. DMA
>>>>>>>>>>>>>>         * coherent and supports page pinning. In support of coordinating page
>>>>>>>>>>>>>> @@ -61,6 +68,7 @@ struct vmem_altmap {
>>>>>>>>>>>>>>        enum memory_type {
>>>>>>>>>>>>>>        	/* 0 is reserved to catch uninitialized type fields */
>>>>>>>>>>>>>>        	MEMORY_DEVICE_PRIVATE = 1,
>>>>>>>>>>>>>> +	MEMORY_DEVICE_COHERENT,
>>>>>>>>>>>>>>        	MEMORY_DEVICE_FS_DAX,
>>>>>>>>>>>>>>        	MEMORY_DEVICE_GENERIC,
>>>>>>>>>>>>>>        	MEMORY_DEVICE_PCI_P2PDMA,
>>>>>>>>>>>>>> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
>>>>>>>>>>>>> In general, this LGTM, and it should be correct with PageAnonExclusive I think.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> However, where exactly is pinning forbidden?
>>>>>>>>>>>> Long-term pinning is forbidden since it would interfere with the device
>>>>>>>>>>>> memory manager owning the
>>>>>>>>>>>> device-coherent pages (e.g. evictions in TTM). However, normal pinning
>>>>>>>>>>>> is allowed on this device type.
>>>>>>>>>>> I don't see updates to folio_is_pinnable() in this patch.
>>>>>>>>>> Device coherent type pages should return true here, as they are pinnable
>>>>>>>>>> pages.
>>>>>>>>> That function is only called for long-term pinnings in try_grab_folio().
>>>>>>>>>
>>>>>>>>>>> So wouldn't try_grab_folio() simply pin these pages? What am I missing?
>>>>>>>>>> As far as I understand this return NULL for long term pin pages.
>>>>>>>>>> Otherwise they get refcount incremented.
>>>>>>>>> I don't follow.
>>>>>>>>>
>>>>>>>>> You're saying
>>>>>>>>>
>>>>>>>>> a) folio_is_pinnable() returns true for device coherent pages
>>>>>>>>>
>>>>>>>>> and that
>>>>>>>>>
>>>>>>>>> b) device coherent pages don't get long-term pinned
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Yet, the code says
>>>>>>>>>
>>>>>>>>> struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>>>>>>>>> {
>>>>>>>>> 	if (flags & FOLL_GET)
>>>>>>>>> 		return try_get_folio(page, refs);
>>>>>>>>> 	else if (flags & FOLL_PIN) {
>>>>>>>>> 		struct folio *folio;
>>>>>>>>>
>>>>>>>>> 		/*
>>>>>>>>> 		 * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
>>>>>>>>> 		 * right zone, so fail and let the caller fall back to the slow
>>>>>>>>> 		 * path.
>>>>>>>>> 		 */
>>>>>>>>> 		if (unlikely((flags & FOLL_LONGTERM) &&
>>>>>>>>> 			     !is_pinnable_page(page)))
>>>>>>>>> 			return NULL;
>>>>>>>>> 		...
>>>>>>>>> 		return folio;
>>>>>>>>> 	}
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> What prevents these pages from getting long-term pinned as stated in this patch?
>>>>>>>> Long-term pinning is handled by __gup_longterm_locked, which migrates
>>>>>>>> pages returned by __get_user_pages_locked that cannot be long-term
>>>>>>>> pinned. try_grab_folio is OK to grab the pages. Anything that can't be
>>>>>>>> long-term pinned will be migrated afterwards, and
>>>>>>>> __get_user_pages_locked will be retried. The migration of
>>>>>>>> DEVICE_COHERENT pages was implemented by Alistair in patch 5/13
>>>>>>>> ("mm/gup: migrate device coherent pages when pinning instead of failing").
>>>>>>> Thanks.
>>>>>>>
>>>>>>> __gup_longterm_locked()->check_and_migrate_movable_pages()
>>>>>>>
>>>>>>> Which checks folio_is_pinnable() and doesn't do anything if set.
>>>>>>>
>>>>>>> Sorry to be dense here, but I don't see how what's stated in this patch
>>>>>>> works without adjusting folio_is_pinnable().
>>>>>> Ugh, I think you might be right about try_grab_folio().
>>>>>>
>>>>>> We didn't update folio_is_pinnable() to include device coherent pages
>>>>>> because device coherent pages are pinnable. It is really just
>>>>>> FOLL_LONGTERM that we want to prevent here.
>>>>>>
>>>>>> For normal PUP that is done by my change in
>>>>>> check_and_migrate_movable_pages() which migrates pages being pinned with
>>>>>> FOLL_LONGTERM. But I think I incorrectly assumed we would take the
>>>>>> pte_devmap() path in gup_pte_range(), which we don't for coherent pages.
>>>>>> So I think the check in try_grab_folio() needs to be:
>>>>> I think I said it already (and I might be wrong without reading the
>>>>> code), but folio_is_pinnable() is *only* called for long-term pinnings.
>>>>>
>>>>> It should actually be called folio_is_longterm_pinnable().
>>>>>
>>>>> That's where that check should go, no?
>>>> David, I think you're right. We didn't catch this since the LONGTERM gup
>>>> test we added to hmm-test only calls to pin_user_pages. Apparently
>>>> try_grab_folio is called only from fast callers (ex.
>>>> pin_user_pages_fast/get_user_pages_fast). I have added a conditional
>>>> similar to what Alistair has proposed to return null on LONGTERM &&
>>>> (coherent_pages || folio_is_pinnable) at try_grab_folio. Also a new gup
>>>> test was added with LONGTERM set that calls pin_user_pages_fast.
>>>> Returning null under this condition it does causes the migration from
>>>> dev to system memory.
>>>>
>>> Why can't coherent memory simply put its checks into
>>> folio_is_pinnable()? I don't get it why we have to do things differently
>>> here.
>>>
>>>> Actually, Im having different problems with a call to PageAnonExclusive
>>>> from try_to_migrate_one during page fault from a HMM test that first
>>>> migrate pages to device private and forks to mark as COW these pages.
>>>> Apparently is catching the first BUG VM_BUG_ON_PGFLAGS(!PageAnon(page),
>>>> page)
>>> With or without this series? A backtrace would be great.
>> Here's the back trace. This happens in a hmm-test added in this patch
>> series. However, I have tried to isolate this BUG by just adding the COW
>> test with private device memory only. This is only present as follows.
>> Allocate anonymous mem->Migrate to private device memory->fork->try to
>> access to parent's anonymous memory (which will suppose to trigger a
>> page fault and migration to system mem). Just for the record, if the
>> child is terminated before the parent's memory is accessed, this problem
>> is not present.
>
> The only usage of PageAnonExclusive() in try_to_migrate_one() is:
>
> anon_exclusive = folio_test_anon(folio) &&
> 		 PageAnonExclusive(subpage);
>
> Which can only possibly fail if subpage is not actually part of the folio.
>
>
> I see some controversial code in the the if (folio_is_zone_device(folio)) case later:
>
> 			 * The assignment to subpage above was computed from a
> 			 * swap PTE which results in an invalid pointer.
> 			 * Since only PAGE_SIZE pages can currently be
> 			 * migrated, just set it to page. This will need to be
> 			 * changed when hugepage migrations to device private
> 			 * memory are supported.
> 			 */
> 			subpage = &folio->page;
>
> There we have our invalid pointer hint.
>
> I don't see how it could have worked if the child quit, though? Maybe
> just pure luck?
>
>
> Does the following fix your issue:

Yes, it fixed the issue. Thanks. Should we include this patch in this 
patch series or separated?

Regards,
Alex Sierra
>
>
>
>  From 09750c714739ef3ca317b4aec82bf20283c8fd2d Mon Sep 17 00:00:00 2001
> From: David Hildenbrand <david@redhat.com>
> Date: Thu, 23 Jun 2022 09:38:45 +0200
> Subject: [PATCH] mm/rmap: fix dereferencing invalid subpage pointer in
>   try_to_migrate_one()
>
> The subpage we calculate is an invalid pointer for device private pages,
> because device private pages are mapped via non-present device private
> entries, not ordinary present PTEs.
>
> Let's just not compute broken pointers and fixup later. Move the proper
> assignment of the correct subpage to the beginning of the function and
> assert that we really only have a single page in our folio.
>
> This currently results in a BUG when tying to compute anon_exclusive,
> because:
>
> [  528.727237] BUG: unable to handle page fault for address: ffffea1fffffffc0
> [  528.739585] #PF: supervisor read access in kernel mode
> [  528.745324] #PF: error_code(0x0000) - not-present page
> [  528.751062] PGD 44eaf2067 P4D 44eaf2067 PUD 0
> [  528.756026] Oops: 0000 [#1] PREEMPT SMP NOPTI
> [  528.760890] CPU: 120 PID: 18275 Comm: hmm-tests Not tainted 5.19.0-rc3-kfd-alex #257
> [  528.769542] Hardware name: AMD Corporation BardPeak/BardPeak, BIOS RTY1002BDS 09/17/2021
> [  528.778579] RIP: 0010:try_to_migrate_one+0x21a/0x1000
> [  528.784225] Code: f6 48 89 c8 48 2b 05 45 d1 6a 01 48 c1 f8 06 48 29
> c3 48 8b 45 a8 48 c1 e3 06 48 01 cb f6 41 18 01 48 89 85 50 ff ff ff 74
> 0b <4c> 8b 33 49 c1 ee 11 41 83 e6 01 48 8b bd 48 ff ff ff e8 3f 99 02
> [  528.805194] RSP: 0000:ffffc90003cdfaa0 EFLAGS: 00010202
> [  528.811027] RAX: 00007ffff7ff4000 RBX: ffffea1fffffffc0 RCX: ffffeaffffffffc0
> [  528.818995] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffc90003cdfaf8
> [  528.826962] RBP: ffffc90003cdfb70 R08: 0000000000000000 R09: 0000000000000000
> [  528.834930] R10: ffffc90003cdf910 R11: 0000000000000002 R12: ffff888194450540
> [  528.842899] R13: ffff888160d057c0 R14: 0000000000000000 R15: 03ffffffffffffff
> [  528.850865] FS:  00007ffff7fdb740(0000) GS:ffff8883b0600000(0000) knlGS:0000000000000000
> [  528.859891] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  528.866308] CR2: ffffea1fffffffc0 CR3: 00000001562b4003 CR4: 0000000000770ee0
> [  528.874275] PKRU: 55555554
> [  528.877286] Call Trace:
> [  528.880016]  <TASK>
> [  528.882356]  ? lock_is_held_type+0xdf/0x130
> [  528.887033]  rmap_walk_anon+0x167/0x410
> [  528.891316]  try_to_migrate+0x90/0xd0
> [  528.895405]  ? try_to_unmap_one+0xe10/0xe10
> [  528.900074]  ? anon_vma_ctor+0x50/0x50
> [  528.904260]  ? put_anon_vma+0x10/0x10
> [  528.908347]  ? invalid_mkclean_vma+0x20/0x20
> [  528.913114]  migrate_vma_setup+0x5f4/0x750
> [  528.917691]  dmirror_devmem_fault+0x8c/0x250 [test_hmm]
> [  528.923532]  do_swap_page+0xac0/0xe50
> [  528.927623]  ? __lock_acquire+0x4b2/0x1ac0
> [  528.932199]  __handle_mm_fault+0x949/0x1440
> [  528.936876]  handle_mm_fault+0x13f/0x3e0
> [  528.941256]  do_user_addr_fault+0x215/0x740
> [  528.945928]  exc_page_fault+0x75/0x280
> [  528.950115]  asm_exc_page_fault+0x27/0x30
> [  528.954593] RIP: 0033:0x40366b
> ...
>
> Fixes: 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive")
> Reported-by: Sierra Guiza, Alejandro (Alex) <alex.sierra@amd.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>   mm/rmap.c | 27 +++++++++++++++++----------
>   1 file changed, 17 insertions(+), 10 deletions(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 5bcb334cd6f2..746c05acad27 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1899,8 +1899,23 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>   		/* Unexpected PMD-mapped THP? */
>   		VM_BUG_ON_FOLIO(!pvmw.pte, folio);
>   
> -		subpage = folio_page(folio,
> -				pte_pfn(*pvmw.pte) - folio_pfn(folio));
> +		if (folio_is_zone_device(folio)) {
> +			/*
> +			 * Our PTE is a non-present device exclusive entry and
> +			 * calculating the subpage as for the common case would
> +			 * result in an invalid pointer.
> +			 *
> +			 * Since only PAGE_SIZE pages can currently be
> +			 * migrated, just set it to page. This will need to be
> +			 * changed when hugepage migrations to device private
> +			 * memory are supported.
> +			 */
> +			VM_BUG_ON_FOLIO(folio_nr_pages(folio) > 1, folio);
> +			subpage = &folio->page;
> +		} else {
> +			subpage = folio_page(folio,
> +					pte_pfn(*pvmw.pte) - folio_pfn(folio));
> +		}
>   		address = pvmw.address;
>   		anon_exclusive = folio_test_anon(folio) &&
>   				 PageAnonExclusive(subpage);
> @@ -1993,15 +2008,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>   			/*
>   			 * No need to invalidate here it will synchronize on
>   			 * against the special swap migration pte.
> -			 *
> -			 * The assignment to subpage above was computed from a
> -			 * swap PTE which results in an invalid pointer.
> -			 * Since only PAGE_SIZE pages can currently be
> -			 * migrated, just set it to page. This will need to be
> -			 * changed when hugepage migrations to device private
> -			 * memory are supported.
>   			 */
> -			subpage = &folio->page;
>   		} else if (PageHWPoison(subpage)) {
>   			pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
>   			if (folio_test_hugetlb(folio)) {

WARNING: multiple messages have this Message-ID (diff)
From: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>
To: David Hildenbrand <david@redhat.com>,
	Alistair Popple <apopple@nvidia.com>,
	akpm@linux-foundation.org
Cc: rcampbell@nvidia.com, willy@infradead.org,
	Felix Kuehling <felix.kuehling@amd.com>,
	amd-gfx@lists.freedesktop.org, linux-xfs@vger.kernel.org,
	linux-mm@kvack.org, jglisse@redhat.com,
	dri-devel@lists.freedesktop.org, jgg@nvidia.com,
	linux-ext4@vger.kernel.org, hch@lst.de
Subject: Re: [PATCH v5 01/13] mm: add zone device coherent type memory support
Date: Thu, 23 Jun 2022 13:20:45 -0500	[thread overview]
Message-ID: <c7d4d9a9-ac8a-b48d-55f3-9d2450e660ef@amd.com> (raw)
In-Reply-To: <9af76814-ee3a-0af4-7300-d432050b13a3@redhat.com>


On 6/23/2022 2:57 AM, David Hildenbrand wrote:
> On 23.06.22 01:16, Sierra Guiza, Alejandro (Alex) wrote:
>> On 6/21/2022 11:16 AM, David Hildenbrand wrote:
>>> On 21.06.22 18:08, Sierra Guiza, Alejandro (Alex) wrote:
>>>> On 6/21/2022 7:25 AM, David Hildenbrand wrote:
>>>>> On 21.06.22 13:55, Alistair Popple wrote:
>>>>>> David Hildenbrand<david@redhat.com>  writes:
>>>>>>
>>>>>>> On 21.06.22 13:25, Felix Kuehling wrote:
>>>>>>>> Am 6/17/22 um 23:19 schrieb David Hildenbrand:
>>>>>>>>> On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>>>>> On 6/17/2022 12:33 PM, David Hildenbrand wrote:
>>>>>>>>>>> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote:
>>>>>>>>>>>> On 6/17/2022 4:40 AM, David Hildenbrand wrote:
>>>>>>>>>>>>> On 31.05.22 22:00, Alex Sierra wrote:
>>>>>>>>>>>>>> Device memory that is cache coherent from device and CPU point of view.
>>>>>>>>>>>>>> This is used on platforms that have an advanced system bus (like CAPI
>>>>>>>>>>>>>> or CXL). Any page of a process can be migrated to such memory. However,
>>>>>>>>>>>>>> no one should be allowed to pin such memory so that it can always be
>>>>>>>>>>>>>> evicted.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Signed-off-by: Alex Sierra<alex.sierra@amd.com>
>>>>>>>>>>>>>> Acked-by: Felix Kuehling<Felix.Kuehling@amd.com>
>>>>>>>>>>>>>> Reviewed-by: Alistair Popple<apopple@nvidia.com>
>>>>>>>>>>>>>> [hch: rebased ontop of the refcount changes,
>>>>>>>>>>>>>>             removed is_dev_private_or_coherent_page]
>>>>>>>>>>>>>> Signed-off-by: Christoph Hellwig<hch@lst.de>
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>        include/linux/memremap.h | 19 +++++++++++++++++++
>>>>>>>>>>>>>>        mm/memcontrol.c          |  7 ++++---
>>>>>>>>>>>>>>        mm/memory-failure.c      |  8 ++++++--
>>>>>>>>>>>>>>        mm/memremap.c            | 10 ++++++++++
>>>>>>>>>>>>>>        mm/migrate_device.c      | 16 +++++++---------
>>>>>>>>>>>>>>        mm/rmap.c                |  5 +++--
>>>>>>>>>>>>>>        6 files changed, 49 insertions(+), 16 deletions(-)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>>>>>>>>>>> index 8af304f6b504..9f752ebed613 100644
>>>>>>>>>>>>>> --- a/include/linux/memremap.h
>>>>>>>>>>>>>> +++ b/include/linux/memremap.h
>>>>>>>>>>>>>> @@ -41,6 +41,13 @@ struct vmem_altmap {
>>>>>>>>>>>>>>         * A more complete discussion of unaddressable memory may be found in
>>>>>>>>>>>>>>         * include/linux/hmm.h and Documentation/vm/hmm.rst.
>>>>>>>>>>>>>>         *
>>>>>>>>>>>>>> + * MEMORY_DEVICE_COHERENT:
>>>>>>>>>>>>>> + * Device memory that is cache coherent from device and CPU point of view. This
>>>>>>>>>>>>>> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A
>>>>>>>>>>>>>> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory
>>>>>>>>>>>>>> + * type. Any page of a process can be migrated to such memory. However no one
>>>>>>>>>>>>> Any page might not be right, I'm pretty sure. ... just thinking about special pages
>>>>>>>>>>>>> like vdso, shared zeropage, ... pinned pages ...
>>>>>>>>>>> Well, you cannot migrate long term pages, that's what I meant :)
>>>>>>>>>>>
>>>>>>>>>>>>>> + * should be allowed to pin such memory so that it can always be evicted.
>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>         * MEMORY_DEVICE_FS_DAX:
>>>>>>>>>>>>>>         * Host memory that has similar access semantics as System RAM i.e. DMA
>>>>>>>>>>>>>>         * coherent and supports page pinning. In support of coordinating page
>>>>>>>>>>>>>> @@ -61,6 +68,7 @@ struct vmem_altmap {
>>>>>>>>>>>>>>        enum memory_type {
>>>>>>>>>>>>>>        	/* 0 is reserved to catch uninitialized type fields */
>>>>>>>>>>>>>>        	MEMORY_DEVICE_PRIVATE = 1,
>>>>>>>>>>>>>> +	MEMORY_DEVICE_COHERENT,
>>>>>>>>>>>>>>        	MEMORY_DEVICE_FS_DAX,
>>>>>>>>>>>>>>        	MEMORY_DEVICE_GENERIC,
>>>>>>>>>>>>>>        	MEMORY_DEVICE_PCI_P2PDMA,
>>>>>>>>>>>>>> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio)
>>>>>>>>>>>>> In general, this LGTM, and it should be correct with PageAnonExclusive I think.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> However, where exactly is pinning forbidden?
>>>>>>>>>>>> Long-term pinning is forbidden since it would interfere with the device
>>>>>>>>>>>> memory manager owning the
>>>>>>>>>>>> device-coherent pages (e.g. evictions in TTM). However, normal pinning
>>>>>>>>>>>> is allowed on this device type.
>>>>>>>>>>> I don't see updates to folio_is_pinnable() in this patch.
>>>>>>>>>> Device coherent type pages should return true here, as they are pinnable
>>>>>>>>>> pages.
>>>>>>>>> That function is only called for long-term pinnings in try_grab_folio().
>>>>>>>>>
>>>>>>>>>>> So wouldn't try_grab_folio() simply pin these pages? What am I missing?
>>>>>>>>>> As far as I understand this return NULL for long term pin pages.
>>>>>>>>>> Otherwise they get refcount incremented.
>>>>>>>>> I don't follow.
>>>>>>>>>
>>>>>>>>> You're saying
>>>>>>>>>
>>>>>>>>> a) folio_is_pinnable() returns true for device coherent pages
>>>>>>>>>
>>>>>>>>> and that
>>>>>>>>>
>>>>>>>>> b) device coherent pages don't get long-term pinned
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Yet, the code says
>>>>>>>>>
>>>>>>>>> struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>>>>>>>>> {
>>>>>>>>> 	if (flags & FOLL_GET)
>>>>>>>>> 		return try_get_folio(page, refs);
>>>>>>>>> 	else if (flags & FOLL_PIN) {
>>>>>>>>> 		struct folio *folio;
>>>>>>>>>
>>>>>>>>> 		/*
>>>>>>>>> 		 * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
>>>>>>>>> 		 * right zone, so fail and let the caller fall back to the slow
>>>>>>>>> 		 * path.
>>>>>>>>> 		 */
>>>>>>>>> 		if (unlikely((flags & FOLL_LONGTERM) &&
>>>>>>>>> 			     !is_pinnable_page(page)))
>>>>>>>>> 			return NULL;
>>>>>>>>> 		...
>>>>>>>>> 		return folio;
>>>>>>>>> 	}
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> What prevents these pages from getting long-term pinned as stated in this patch?
>>>>>>>> Long-term pinning is handled by __gup_longterm_locked, which migrates
>>>>>>>> pages returned by __get_user_pages_locked that cannot be long-term
>>>>>>>> pinned. try_grab_folio is OK to grab the pages. Anything that can't be
>>>>>>>> long-term pinned will be migrated afterwards, and
>>>>>>>> __get_user_pages_locked will be retried. The migration of
>>>>>>>> DEVICE_COHERENT pages was implemented by Alistair in patch 5/13
>>>>>>>> ("mm/gup: migrate device coherent pages when pinning instead of failing").
>>>>>>> Thanks.
>>>>>>>
>>>>>>> __gup_longterm_locked()->check_and_migrate_movable_pages()
>>>>>>>
>>>>>>> Which checks folio_is_pinnable() and doesn't do anything if set.
>>>>>>>
>>>>>>> Sorry to be dense here, but I don't see how what's stated in this patch
>>>>>>> works without adjusting folio_is_pinnable().
>>>>>> Ugh, I think you might be right about try_grab_folio().
>>>>>>
>>>>>> We didn't update folio_is_pinnable() to include device coherent pages
>>>>>> because device coherent pages are pinnable. It is really just
>>>>>> FOLL_LONGTERM that we want to prevent here.
>>>>>>
>>>>>> For normal PUP that is done by my change in
>>>>>> check_and_migrate_movable_pages() which migrates pages being pinned with
>>>>>> FOLL_LONGTERM. But I think I incorrectly assumed we would take the
>>>>>> pte_devmap() path in gup_pte_range(), which we don't for coherent pages.
>>>>>> So I think the check in try_grab_folio() needs to be:
>>>>> I think I said it already (and I might be wrong without reading the
>>>>> code), but folio_is_pinnable() is *only* called for long-term pinnings.
>>>>>
>>>>> It should actually be called folio_is_longterm_pinnable().
>>>>>
>>>>> That's where that check should go, no?
>>>> David, I think you're right. We didn't catch this since the LONGTERM gup
>>>> test we added to hmm-test only calls to pin_user_pages. Apparently
>>>> try_grab_folio is called only from fast callers (ex.
>>>> pin_user_pages_fast/get_user_pages_fast). I have added a conditional
>>>> similar to what Alistair has proposed to return null on LONGTERM &&
>>>> (coherent_pages || folio_is_pinnable) at try_grab_folio. Also a new gup
>>>> test was added with LONGTERM set that calls pin_user_pages_fast.
>>>> Returning null under this condition it does causes the migration from
>>>> dev to system memory.
>>>>
>>> Why can't coherent memory simply put its checks into
>>> folio_is_pinnable()? I don't get it why we have to do things differently
>>> here.
>>>
>>>> Actually, Im having different problems with a call to PageAnonExclusive
>>>> from try_to_migrate_one during page fault from a HMM test that first
>>>> migrate pages to device private and forks to mark as COW these pages.
>>>> Apparently is catching the first BUG VM_BUG_ON_PGFLAGS(!PageAnon(page),
>>>> page)
>>> With or without this series? A backtrace would be great.
>> Here's the back trace. This happens in a hmm-test added in this patch
>> series. However, I have tried to isolate this BUG by just adding the COW
>> test with private device memory only. This is only present as follows.
>> Allocate anonymous mem->Migrate to private device memory->fork->try to
>> access to parent's anonymous memory (which will suppose to trigger a
>> page fault and migration to system mem). Just for the record, if the
>> child is terminated before the parent's memory is accessed, this problem
>> is not present.
>
> The only usage of PageAnonExclusive() in try_to_migrate_one() is:
>
> anon_exclusive = folio_test_anon(folio) &&
> 		 PageAnonExclusive(subpage);
>
> Which can only possibly fail if subpage is not actually part of the folio.
>
>
> I see some controversial code in the the if (folio_is_zone_device(folio)) case later:
>
> 			 * The assignment to subpage above was computed from a
> 			 * swap PTE which results in an invalid pointer.
> 			 * Since only PAGE_SIZE pages can currently be
> 			 * migrated, just set it to page. This will need to be
> 			 * changed when hugepage migrations to device private
> 			 * memory are supported.
> 			 */
> 			subpage = &folio->page;
>
> There we have our invalid pointer hint.
>
> I don't see how it could have worked if the child quit, though? Maybe
> just pure luck?
>
>
> Does the following fix your issue:

Yes, it fixed the issue. Thanks. Should we include this patch in this 
patch series or separated?

Regards,
Alex Sierra
>
>
>
>  From 09750c714739ef3ca317b4aec82bf20283c8fd2d Mon Sep 17 00:00:00 2001
> From: David Hildenbrand <david@redhat.com>
> Date: Thu, 23 Jun 2022 09:38:45 +0200
> Subject: [PATCH] mm/rmap: fix dereferencing invalid subpage pointer in
>   try_to_migrate_one()
>
> The subpage we calculate is an invalid pointer for device private pages,
> because device private pages are mapped via non-present device private
> entries, not ordinary present PTEs.
>
> Let's just not compute broken pointers and fixup later. Move the proper
> assignment of the correct subpage to the beginning of the function and
> assert that we really only have a single page in our folio.
>
> This currently results in a BUG when tying to compute anon_exclusive,
> because:
>
> [  528.727237] BUG: unable to handle page fault for address: ffffea1fffffffc0
> [  528.739585] #PF: supervisor read access in kernel mode
> [  528.745324] #PF: error_code(0x0000) - not-present page
> [  528.751062] PGD 44eaf2067 P4D 44eaf2067 PUD 0
> [  528.756026] Oops: 0000 [#1] PREEMPT SMP NOPTI
> [  528.760890] CPU: 120 PID: 18275 Comm: hmm-tests Not tainted 5.19.0-rc3-kfd-alex #257
> [  528.769542] Hardware name: AMD Corporation BardPeak/BardPeak, BIOS RTY1002BDS 09/17/2021
> [  528.778579] RIP: 0010:try_to_migrate_one+0x21a/0x1000
> [  528.784225] Code: f6 48 89 c8 48 2b 05 45 d1 6a 01 48 c1 f8 06 48 29
> c3 48 8b 45 a8 48 c1 e3 06 48 01 cb f6 41 18 01 48 89 85 50 ff ff ff 74
> 0b <4c> 8b 33 49 c1 ee 11 41 83 e6 01 48 8b bd 48 ff ff ff e8 3f 99 02
> [  528.805194] RSP: 0000:ffffc90003cdfaa0 EFLAGS: 00010202
> [  528.811027] RAX: 00007ffff7ff4000 RBX: ffffea1fffffffc0 RCX: ffffeaffffffffc0
> [  528.818995] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffc90003cdfaf8
> [  528.826962] RBP: ffffc90003cdfb70 R08: 0000000000000000 R09: 0000000000000000
> [  528.834930] R10: ffffc90003cdf910 R11: 0000000000000002 R12: ffff888194450540
> [  528.842899] R13: ffff888160d057c0 R14: 0000000000000000 R15: 03ffffffffffffff
> [  528.850865] FS:  00007ffff7fdb740(0000) GS:ffff8883b0600000(0000) knlGS:0000000000000000
> [  528.859891] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  528.866308] CR2: ffffea1fffffffc0 CR3: 00000001562b4003 CR4: 0000000000770ee0
> [  528.874275] PKRU: 55555554
> [  528.877286] Call Trace:
> [  528.880016]  <TASK>
> [  528.882356]  ? lock_is_held_type+0xdf/0x130
> [  528.887033]  rmap_walk_anon+0x167/0x410
> [  528.891316]  try_to_migrate+0x90/0xd0
> [  528.895405]  ? try_to_unmap_one+0xe10/0xe10
> [  528.900074]  ? anon_vma_ctor+0x50/0x50
> [  528.904260]  ? put_anon_vma+0x10/0x10
> [  528.908347]  ? invalid_mkclean_vma+0x20/0x20
> [  528.913114]  migrate_vma_setup+0x5f4/0x750
> [  528.917691]  dmirror_devmem_fault+0x8c/0x250 [test_hmm]
> [  528.923532]  do_swap_page+0xac0/0xe50
> [  528.927623]  ? __lock_acquire+0x4b2/0x1ac0
> [  528.932199]  __handle_mm_fault+0x949/0x1440
> [  528.936876]  handle_mm_fault+0x13f/0x3e0
> [  528.941256]  do_user_addr_fault+0x215/0x740
> [  528.945928]  exc_page_fault+0x75/0x280
> [  528.950115]  asm_exc_page_fault+0x27/0x30
> [  528.954593] RIP: 0033:0x40366b
> ...
>
> Fixes: 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive")
> Reported-by: Sierra Guiza, Alejandro (Alex) <alex.sierra@amd.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>   mm/rmap.c | 27 +++++++++++++++++----------
>   1 file changed, 17 insertions(+), 10 deletions(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 5bcb334cd6f2..746c05acad27 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1899,8 +1899,23 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>   		/* Unexpected PMD-mapped THP? */
>   		VM_BUG_ON_FOLIO(!pvmw.pte, folio);
>   
> -		subpage = folio_page(folio,
> -				pte_pfn(*pvmw.pte) - folio_pfn(folio));
> +		if (folio_is_zone_device(folio)) {
> +			/*
> +			 * Our PTE is a non-present device exclusive entry and
> +			 * calculating the subpage as for the common case would
> +			 * result in an invalid pointer.
> +			 *
> +			 * Since only PAGE_SIZE pages can currently be
> +			 * migrated, just set it to page. This will need to be
> +			 * changed when hugepage migrations to device private
> +			 * memory are supported.
> +			 */
> +			VM_BUG_ON_FOLIO(folio_nr_pages(folio) > 1, folio);
> +			subpage = &folio->page;
> +		} else {
> +			subpage = folio_page(folio,
> +					pte_pfn(*pvmw.pte) - folio_pfn(folio));
> +		}
>   		address = pvmw.address;
>   		anon_exclusive = folio_test_anon(folio) &&
>   				 PageAnonExclusive(subpage);
> @@ -1993,15 +2008,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>   			/*
>   			 * No need to invalidate here it will synchronize on
>   			 * against the special swap migration pte.
> -			 *
> -			 * The assignment to subpage above was computed from a
> -			 * swap PTE which results in an invalid pointer.
> -			 * Since only PAGE_SIZE pages can currently be
> -			 * migrated, just set it to page. This will need to be
> -			 * changed when hugepage migrations to device private
> -			 * memory are supported.
>   			 */
> -			subpage = &folio->page;
>   		} else if (PageHWPoison(subpage)) {
>   			pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
>   			if (folio_test_hugetlb(folio)) {

  reply	other threads:[~2022-06-23 19:16 UTC|newest]

Thread overview: 82+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-31 20:00 [PATCH v5 00/13] Add MEMORY_DEVICE_COHERENT for coherent device memory mapping Alex Sierra
2022-05-31 20:00 ` Alex Sierra
2022-05-31 20:00 ` [PATCH v5 01/13] mm: add zone device coherent type memory support Alex Sierra
2022-05-31 20:00   ` Alex Sierra
2022-06-17  9:40   ` David Hildenbrand
2022-06-17  9:40     ` David Hildenbrand
2022-06-17 17:20     ` Sierra Guiza, Alejandro (Alex)
2022-06-17 17:20       ` Sierra Guiza, Alejandro (Alex)
2022-06-17 17:33       ` David Hildenbrand
2022-06-17 17:33         ` David Hildenbrand
2022-06-17 19:27         ` Sierra Guiza, Alejandro (Alex)
2022-06-17 19:27           ` Sierra Guiza, Alejandro (Alex)
2022-06-17 21:19           ` David Hildenbrand
2022-06-17 21:19             ` David Hildenbrand
2022-06-21 11:25             ` Felix Kuehling
2022-06-21 11:25               ` Felix Kuehling
2022-06-21 11:32               ` David Hildenbrand
2022-06-21 11:32                 ` David Hildenbrand
2022-06-21 11:55                 ` Alistair Popple
2022-06-21 11:55                   ` Alistair Popple
2022-06-21 12:25                   ` David Hildenbrand
2022-06-21 12:25                     ` David Hildenbrand
2022-06-21 16:08                     ` Sierra Guiza, Alejandro (Alex)
2022-06-21 16:08                       ` Sierra Guiza, Alejandro (Alex)
2022-06-21 16:16                       ` David Hildenbrand
2022-06-21 16:16                         ` David Hildenbrand
2022-06-22  0:16                         ` Alistair Popple
2022-06-22  0:16                           ` Alistair Popple
2022-06-22 23:06                           ` Sierra Guiza, Alejandro (Alex)
2022-06-22 23:06                             ` Sierra Guiza, Alejandro (Alex)
2022-06-22 23:16                         ` Sierra Guiza, Alejandro (Alex)
2022-06-22 23:16                           ` Sierra Guiza, Alejandro (Alex)
2022-06-23  7:57                           ` David Hildenbrand
2022-06-23  7:57                             ` David Hildenbrand
2022-06-23 18:20                             ` Sierra Guiza, Alejandro (Alex) [this message]
2022-06-23 18:20                               ` Sierra Guiza, Alejandro (Alex)
2022-06-23 18:21                               ` David Hildenbrand
2022-06-23 18:21                                 ` David Hildenbrand
2022-06-24 16:13                                 ` Sierra Guiza, Alejandro (Alex)
2022-06-24 16:13                                   ` Sierra Guiza, Alejandro (Alex)
2022-06-18  9:32       ` Oded Gabbay
2022-06-18  9:32         ` Oded Gabbay
2022-06-20  0:17         ` Alistair Popple
2022-06-20  0:17           ` Alistair Popple
2022-06-20  6:01           ` Oded Gabbay
2022-06-20  6:01             ` Oded Gabbay
2022-06-20  8:13             ` Alistair Popple
2022-06-20  8:13               ` Alistair Popple
2022-06-20 12:23               ` Oded Gabbay
2022-06-20 12:23                 ` Oded Gabbay
2022-05-31 20:00 ` [PATCH v5 02/13] mm: handling Non-LRU pages returned by vm_normal_pages Alex Sierra
2022-05-31 20:00   ` Alex Sierra
2022-06-08  7:06   ` Alistair Popple
2022-06-08  7:06     ` Alistair Popple
2022-06-17  9:51   ` David Hildenbrand
2022-06-17  9:51     ` David Hildenbrand
2022-05-31 20:00 ` [PATCH v5 03/13] mm: add device coherent vma selection for memory migration Alex Sierra
2022-05-31 20:00   ` Alex Sierra
2022-05-31 20:00 ` [PATCH v5 04/13] mm: remove the vma check in migrate_vma_setup() Alex Sierra
2022-05-31 20:00   ` Alex Sierra
2022-05-31 20:00 ` [PATCH v5 05/13] mm/gup: migrate device coherent pages when pinning instead of failing Alex Sierra
2022-05-31 20:00   ` Alex Sierra
2022-05-31 20:00 ` [PATCH v5 06/13] drm/amdkfd: add SPM support for SVM Alex Sierra
2022-05-31 20:00   ` Alex Sierra
2022-05-31 20:00 ` [PATCH v5 07/13] lib: test_hmm add ioctl to get zone device type Alex Sierra
2022-05-31 20:00   ` Alex Sierra
2022-05-31 20:00 ` [PATCH v5 08/13] lib: test_hmm add module param for " Alex Sierra
2022-05-31 20:00   ` Alex Sierra
2022-05-31 20:00 ` [PATCH v5 09/13] lib: add support for device coherent type in test_hmm Alex Sierra
2022-05-31 20:00   ` Alex Sierra
2022-05-31 20:00 ` [PATCH v5 10/13] tools: update hmm-test to support device coherent type Alex Sierra
2022-05-31 20:00   ` Alex Sierra
2022-05-31 20:00 ` [PATCH v5 11/13] tools: update test_hmm script to support SP config Alex Sierra
2022-05-31 20:00   ` Alex Sierra
2022-05-31 20:00 ` [PATCH v5 12/13] tools: add hmm gup tests for device coherent type Alex Sierra
2022-05-31 20:00   ` Alex Sierra
2022-05-31 20:00 ` [PATCH v5 13/13] tools: add selftests to hmm for COW in device memory Alex Sierra
2022-05-31 20:00   ` Alex Sierra
2022-06-17  2:19 ` [PATCH v5 00/13] Add MEMORY_DEVICE_COHERENT for coherent device memory mapping Andrew Morton
2022-06-17  2:19   ` Andrew Morton
2022-06-17  7:44   ` David Hildenbrand
2022-06-17  7:44     ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c7d4d9a9-ac8a-b48d-55f3-9d2450e660ef@amd.com \
    --to=alex.sierra@amd.com \
    --cc=akpm@linux-foundation.org \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=apopple@nvidia.com \
    --cc=david@redhat.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=felix.kuehling@amd.com \
    --cc=hch@lst.de \
    --cc=jgg@nvidia.com \
    --cc=jglisse@redhat.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=rcampbell@nvidia.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.