[PATCH v3 1/3] mm/migrate_device.c: Flush TLB while holding PTL

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 1/3] mm/migrate_device.c: Flush TLB while holding PTL
@ 2022-08-24  3:03 Alistair Popple
  2022-08-24  3:03 ` [PATCH v3 2/3] mm/migrate_device.c: Copy pte dirty bit to page Alistair Popple
                   ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: Alistair Popple @ 2022-08-24  3:03 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Peter Xu, Nadav Amit, Alistair Popple, huang ying, LKML,
	Sierra Guiza, Alejandro (Alex),
	Felix Kuehling, Jason Gunthorpe, John Hubbard, David Hildenbrand,
	Ralph Campbell, Matthew Wilcox, Karol Herbst, Lyude Paul,
	Ben Skeggs, Logan Gunthorpe, paulus, linuxppc-dev, stable

When clearing a PTE the TLB should be flushed whilst still holding the
PTL to avoid a potential race with madvise/munmap/etc. For example
consider the following sequence:

  CPU0                          CPU1
  ----                          ----

  migrate_vma_collect_pmd()
  pte_unmap_unlock()
                                madvise(MADV_DONTNEED)
                                -> zap_pte_range()
                                pte_offset_map_lock()
                                [ PTE not present, TLB not flushed ]
                                pte_unmap_unlock()
                                [ page is still accessible via stale TLB ]
  flush_tlb_range()

In this case the page may still be accessed via the stale TLB entry
after madvise returns. Fix this by flushing the TLB while holding the
PTL.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reported-by: Nadav Amit <nadav.amit@gmail.com>
Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
Cc: stable@vger.kernel.org

---

Changes for v3:

 - New for v3
---
 mm/migrate_device.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 27fb37d..6a5ef9f 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -254,13 +254,14 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 		migrate->dst[migrate->npages] = 0;
 		migrate->src[migrate->npages++] = mpfn;
 	}
-	arch_leave_lazy_mmu_mode();
-	pte_unmap_unlock(ptep - 1, ptl);
 
 	/* Only flush the TLB if we actually modified any entries */
 	if (unmapped)
 		flush_tlb_range(walk->vma, start, end);
 
+	arch_leave_lazy_mmu_mode();
+	pte_unmap_unlock(ptep - 1, ptl);
+
 	return 0;
 }
 

base-commit: ffcf9c5700e49c0aee42dcba9a12ba21338e8136
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 2/3] mm/migrate_device.c: Copy pte dirty bit to page
  2022-08-24  3:03 [PATCH v3 1/3] mm/migrate_device.c: Flush TLB while holding PTL Alistair Popple
@ 2022-08-24  3:03 ` Alistair Popple
  2022-08-24 15:39   ` Peter Xu
  2022-08-24  3:03 ` [PATCH v3 3/3] selftests/hmm-tests: Add test for dirty bits Alistair Popple
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 20+ messages in thread
From: Alistair Popple @ 2022-08-24  3:03 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Peter Xu, Nadav Amit, Alistair Popple, huang ying, LKML,
	Sierra Guiza, Alejandro (Alex),
	Felix Kuehling, Jason Gunthorpe, John Hubbard, David Hildenbrand,
	Ralph Campbell, Matthew Wilcox, Karol Herbst, Lyude Paul,
	Ben Skeggs, Logan Gunthorpe, paulus, linuxppc-dev, stable,
	Huang Ying

migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that
installs migration entries directly if it can lock the migrating page.
When removing a dirty pte the dirty bit is supposed to be carried over
to the underlying page to prevent it being lost.

Currently migrate_vma_*() can only be used for private anonymous
mappings. That means loss of the dirty bit usually doesn't result in
data loss because these pages are typically not file-backed. However
pages may be backed by swap storage which can result in data loss if an
attempt is made to migrate a dirty page that doesn't yet have the
PageDirty flag set.

In this case migration will fail due to unexpected references but the
dirty pte bit will be lost. If the page is subsequently reclaimed data
won't be written back to swap storage as it is considered uptodate,
resulting in data loss if the page is subsequently accessed.

Prevent this by copying the dirty bit to the page when removing the pte
to match what try_to_migrate_one() does.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Acked-by: Peter Xu <peterx@redhat.com>
Reported-by: Huang Ying <ying.huang@intel.com>
Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
Cc: stable@vger.kernel.org

---

Changes for v3:

 - Defer TLB flushing
 - Split a TLB flushing fix into a separate change.

Changes for v2:

 - Fixed up Reported-by tag.
 - Added Peter's Acked-by.
 - Atomically read and clear the pte to prevent the dirty bit getting
   set after reading it.
 - Added fixes tag
---
 mm/migrate_device.c |  9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 6a5ef9f..51d9afa 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -7,6 +7,7 @@
 #include <linux/export.h>
 #include <linux/memremap.h>
 #include <linux/migrate.h>
+#include <linux/mm.h>
 #include <linux/mm_inline.h>
 #include <linux/mmu_notifier.h>
 #include <linux/oom.h>
@@ -196,7 +197,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
 			if (anon_exclusive) {
 				flush_cache_page(vma, addr, pte_pfn(*ptep));
-				ptep_clear_flush(vma, addr, ptep);
+				pte = ptep_clear_flush(vma, addr, ptep);
 
 				if (page_try_share_anon_rmap(page)) {
 					set_pte_at(mm, addr, ptep, pte);
@@ -206,11 +207,15 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 					goto next;
 				}
 			} else {
-				ptep_get_and_clear(mm, addr, ptep);
+				pte = ptep_get_and_clear(mm, addr, ptep);
 			}
 
 			migrate->cpages++;
 
+			/* Set the dirty flag on the folio now the pte is gone. */
+			if (pte_dirty(pte))
+				folio_mark_dirty(page_folio(page));
+
 			/* Setup special migration page table entry */
 			if (mpfn & MIGRATE_PFN_WRITE)
 				entry = make_writable_migration_entry(
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 3/3] selftests/hmm-tests: Add test for dirty bits
  2022-08-24  3:03 [PATCH v3 1/3] mm/migrate_device.c: Flush TLB while holding PTL Alistair Popple
  2022-08-24  3:03 ` [PATCH v3 2/3] mm/migrate_device.c: Copy pte dirty bit to page Alistair Popple
@ 2022-08-24  3:03 ` Alistair Popple
  2022-08-24  8:21 ` [PATCH v3 1/3] mm/migrate_device.c: Flush TLB while holding PTL David Hildenbrand
  2022-08-25  1:36 ` Huang, Ying
  3 siblings, 0 replies; 20+ messages in thread
From: Alistair Popple @ 2022-08-24  3:03 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Peter Xu, Nadav Amit, Alistair Popple, huang ying, LKML,
	Sierra Guiza, Alejandro (Alex),
	Felix Kuehling, Jason Gunthorpe, John Hubbard, David Hildenbrand,
	Ralph Campbell, Matthew Wilcox, Karol Herbst, Lyude Paul,
	Ben Skeggs, Logan Gunthorpe, paulus, linuxppc-dev, stable

We were not correctly copying PTE dirty bits to pages during
migrate_vma_setup() calls. This could potentially lead to data loss, so
add a test for this.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
 tools/testing/selftests/vm/hmm-tests.c | 124 ++++++++++++++++++++++++++-
 1 file changed, 124 insertions(+)

diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c
index 529f53b..70fdb49 100644
--- a/tools/testing/selftests/vm/hmm-tests.c
+++ b/tools/testing/selftests/vm/hmm-tests.c
@@ -1200,6 +1200,130 @@ TEST_F(hmm, migrate_multiple)
 	}
 }
 
+static char cgroup[] = "/sys/fs/cgroup/hmm-test-XXXXXX";
+static int write_cgroup_param(char *cgroup_path, char *param, long value)
+{
+	int ret;
+	FILE *f;
+	char *filename;
+
+	if (asprintf(&filename, "%s/%s", cgroup_path, param) < 0)
+		return -1;
+
+	f = fopen(filename, "w");
+	if (!f) {
+		ret = -1;
+		goto out;
+	}
+
+	ret = fprintf(f, "%ld\n", value);
+	if (ret < 0)
+		goto out1;
+
+	ret = 0;
+
+out1:
+	fclose(f);
+out:
+	free(filename);
+
+	return ret;
+}
+
+static int setup_cgroup(void)
+{
+	pid_t pid = getpid();
+	int ret;
+
+	if (!mkdtemp(cgroup))
+		return -1;
+
+	ret = write_cgroup_param(cgroup, "cgroup.procs", pid);
+	if (ret)
+		return ret;
+
+	return 0;
+}
+
+static int destroy_cgroup(void)
+{
+	pid_t pid = getpid();
+	int ret;
+
+	ret = write_cgroup_param("/sys/fs/cgroup/cgroup.procs",
+				"cgroup.proc", pid);
+	if (ret)
+		return ret;
+
+	if (rmdir(cgroup))
+		return -1;
+
+	return 0;
+}
+
+/*
+ * Try and migrate a dirty page that has previously been swapped to disk. This
+ * checks that we don't loose dirty bits.
+ */
+TEST_F(hmm, migrate_dirty_page)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	int tmp = 0;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	ASSERT_EQ(setup_cgroup(), 0);
+
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = 0;
+
+	ASSERT_FALSE(write_cgroup_param(cgroup, "memory.reclaim", 1UL<<30));
+
+	/* Fault pages back in from swap as clean pages */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		tmp += ptr[i];
+
+	/* Dirty the pte */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/*
+	 * Attempt to migrate memory to device, which should fail because
+	 * hopefully some pages are backed by swap storage.
+	 */
+	ASSERT_TRUE(hmm_migrate_sys_to_dev(self->fd, buffer, npages));
+
+	ASSERT_FALSE(write_cgroup_param(cgroup, "memory.reclaim", 1UL<<30));
+
+	/* Check we still see the updated data after restoring from swap. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	hmm_buffer_free(buffer);
+	destroy_cgroup();
+}
+
 /*
  * Read anonymous memory multiple times.
  */
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 1/3] mm/migrate_device.c: Flush TLB while holding PTL
  2022-08-24  3:03 [PATCH v3 1/3] mm/migrate_device.c: Flush TLB while holding PTL Alistair Popple
  2022-08-24  3:03 ` [PATCH v3 2/3] mm/migrate_device.c: Copy pte dirty bit to page Alistair Popple
  2022-08-24  3:03 ` [PATCH v3 3/3] selftests/hmm-tests: Add test for dirty bits Alistair Popple
@ 2022-08-24  8:21 ` David Hildenbrand
  2022-08-24 12:26   ` Alistair Popple
  2022-08-25  1:36 ` Huang, Ying
  3 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2022-08-24  8:21 UTC (permalink / raw)
  To: Alistair Popple, linux-mm, akpm
  Cc: Peter Xu, Nadav Amit, huang ying, LKML, Sierra Guiza,
	Alejandro (Alex),
	Felix Kuehling, Jason Gunthorpe, John Hubbard, Ralph Campbell,
	Matthew Wilcox, Karol Herbst, Lyude Paul, Ben Skeggs,
	Logan Gunthorpe, paulus, linuxppc-dev, stable

On 24.08.22 05:03, Alistair Popple wrote:
> When clearing a PTE the TLB should be flushed whilst still holding the
> PTL to avoid a potential race with madvise/munmap/etc. For example
> consider the following sequence:
> 
>   CPU0                          CPU1
>   ----                          ----
> 
>   migrate_vma_collect_pmd()
>   pte_unmap_unlock()
>                                 madvise(MADV_DONTNEED)
>                                 -> zap_pte_range()
>                                 pte_offset_map_lock()
>                                 [ PTE not present, TLB not flushed ]
>                                 pte_unmap_unlock()
>                                 [ page is still accessible via stale TLB ]
>   flush_tlb_range()
> 
> In this case the page may still be accessed via the stale TLB entry
> after madvise returns. Fix this by flushing the TLB while holding the
> PTL.
> 
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> Reported-by: Nadav Amit <nadav.amit@gmail.com>
> Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
> Cc: stable@vger.kernel.org
> 
> ---
> 
> Changes for v3:
> 
>  - New for v3
> ---
>  mm/migrate_device.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 27fb37d..6a5ef9f 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -254,13 +254,14 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  		migrate->dst[migrate->npages] = 0;
>  		migrate->src[migrate->npages++] = mpfn;
>  	}
> -	arch_leave_lazy_mmu_mode();
> -	pte_unmap_unlock(ptep - 1, ptl);
>  
>  	/* Only flush the TLB if we actually modified any entries */
>  	if (unmapped)
>  		flush_tlb_range(walk->vma, start, end);
>  
> +	arch_leave_lazy_mmu_mode();
> +	pte_unmap_unlock(ptep - 1, ptl);
> +
>  	return 0;
>  }
>  
> 
> base-commit: ffcf9c5700e49c0aee42dcba9a12ba21338e8136

I'm not a TLB-flushing expert, but this matches my understanding (and a
TLB flushing Linux documentation I stumbled over some while ago but
cannot quickly find).

In the ordinary try_to_migrate_one() path, flushing would happen via
ptep_clear_flush() (just like we do for the anon_exclusive case here as
well), correct?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 1/3] mm/migrate_device.c: Flush TLB while holding PTL
  2022-08-24  8:21 ` [PATCH v3 1/3] mm/migrate_device.c: Flush TLB while holding PTL David Hildenbrand
@ 2022-08-24 12:26   ` Alistair Popple
  2022-08-24 12:35     ` David Hildenbrand
  0 siblings, 1 reply; 20+ messages in thread
From: Alistair Popple @ 2022-08-24 12:26 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, akpm, Peter Xu, Nadav Amit, huang ying, LKML,
	Sierra Guiza, Alejandro (Alex),
	Felix Kuehling, Jason Gunthorpe, John Hubbard, Ralph Campbell,
	Matthew Wilcox, Karol Herbst, Lyude Paul, Ben Skeggs,
	Logan Gunthorpe, paulus, linuxppc-dev, stable


David Hildenbrand <david@redhat.com> writes:

> On 24.08.22 05:03, Alistair Popple wrote:
>> When clearing a PTE the TLB should be flushed whilst still holding the
>> PTL to avoid a potential race with madvise/munmap/etc. For example
>> consider the following sequence:
>>
>>   CPU0                          CPU1
>>   ----                          ----
>>
>>   migrate_vma_collect_pmd()
>>   pte_unmap_unlock()
>>                                 madvise(MADV_DONTNEED)
>>                                 -> zap_pte_range()
>>                                 pte_offset_map_lock()
>>                                 [ PTE not present, TLB not flushed ]
>>                                 pte_unmap_unlock()
>>                                 [ page is still accessible via stale TLB ]
>>   flush_tlb_range()
>>
>> In this case the page may still be accessed via the stale TLB entry
>> after madvise returns. Fix this by flushing the TLB while holding the
>> PTL.
>>
>> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>> Reported-by: Nadav Amit <nadav.amit@gmail.com>
>> Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
>> Cc: stable@vger.kernel.org
>>
>> ---
>>
>> Changes for v3:
>>
>>  - New for v3
>> ---
>>  mm/migrate_device.c | 5 +++--
>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>> index 27fb37d..6a5ef9f 100644
>> --- a/mm/migrate_device.c
>> +++ b/mm/migrate_device.c
>> @@ -254,13 +254,14 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>  		migrate->dst[migrate->npages] = 0;
>>  		migrate->src[migrate->npages++] = mpfn;
>>  	}
>> -	arch_leave_lazy_mmu_mode();
>> -	pte_unmap_unlock(ptep - 1, ptl);
>>
>>  	/* Only flush the TLB if we actually modified any entries */
>>  	if (unmapped)
>>  		flush_tlb_range(walk->vma, start, end);
>>
>> +	arch_leave_lazy_mmu_mode();
>> +	pte_unmap_unlock(ptep - 1, ptl);
>> +
>>  	return 0;
>>  }
>>
>>
>> base-commit: ffcf9c5700e49c0aee42dcba9a12ba21338e8136
>
> I'm not a TLB-flushing expert, but this matches my understanding (and a
> TLB flushing Linux documentation I stumbled over some while ago but
> cannot quickly find).
>
> In the ordinary try_to_migrate_one() path, flushing would happen via
> ptep_clear_flush() (just like we do for the anon_exclusive case here as
> well), correct?

Correct.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 1/3] mm/migrate_device.c: Flush TLB while holding PTL
  2022-08-24 12:26   ` Alistair Popple
@ 2022-08-24 12:35     ` David Hildenbrand
  0 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2022-08-24 12:35 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, akpm, Peter Xu, Nadav Amit, huang ying, LKML,
	Sierra Guiza, Alejandro (Alex),
	Felix Kuehling, Jason Gunthorpe, John Hubbard, Ralph Campbell,
	Matthew Wilcox, Karol Herbst, Lyude Paul, Ben Skeggs,
	Logan Gunthorpe, paulus, linuxppc-dev, stable

On 24.08.22 14:26, Alistair Popple wrote:
> 
> David Hildenbrand <david@redhat.com> writes:
> 
>> On 24.08.22 05:03, Alistair Popple wrote:
>>> When clearing a PTE the TLB should be flushed whilst still holding the
>>> PTL to avoid a potential race with madvise/munmap/etc. For example
>>> consider the following sequence:
>>>
>>>   CPU0                          CPU1
>>>   ----                          ----
>>>
>>>   migrate_vma_collect_pmd()
>>>   pte_unmap_unlock()
>>>                                 madvise(MADV_DONTNEED)
>>>                                 -> zap_pte_range()
>>>                                 pte_offset_map_lock()
>>>                                 [ PTE not present, TLB not flushed ]
>>>                                 pte_unmap_unlock()
>>>                                 [ page is still accessible via stale TLB ]
>>>   flush_tlb_range()
>>>
>>> In this case the page may still be accessed via the stale TLB entry
>>> after madvise returns. Fix this by flushing the TLB while holding the
>>> PTL.
>>>
>>> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>>> Reported-by: Nadav Amit <nadav.amit@gmail.com>
>>> Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
>>> Cc: stable@vger.kernel.org
>>>
>>> ---
>>>
>>> Changes for v3:
>>>
>>>  - New for v3
>>> ---
>>>  mm/migrate_device.c | 5 +++--
>>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>> index 27fb37d..6a5ef9f 100644
>>> --- a/mm/migrate_device.c
>>> +++ b/mm/migrate_device.c
>>> @@ -254,13 +254,14 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>  		migrate->dst[migrate->npages] = 0;
>>>  		migrate->src[migrate->npages++] = mpfn;
>>>  	}
>>> -	arch_leave_lazy_mmu_mode();
>>> -	pte_unmap_unlock(ptep - 1, ptl);
>>>
>>>  	/* Only flush the TLB if we actually modified any entries */
>>>  	if (unmapped)
>>>  		flush_tlb_range(walk->vma, start, end);
>>>
>>> +	arch_leave_lazy_mmu_mode();
>>> +	pte_unmap_unlock(ptep - 1, ptl);
>>> +
>>>  	return 0;
>>>  }
>>>
>>>
>>> base-commit: ffcf9c5700e49c0aee42dcba9a12ba21338e8136
>>
>> I'm not a TLB-flushing expert, but this matches my understanding (and a
>> TLB flushing Linux documentation I stumbled over some while ago but
>> cannot quickly find).
>>
>> In the ordinary try_to_migrate_one() path, flushing would happen via
>> ptep_clear_flush() (just like we do for the anon_exclusive case here as
>> well), correct?
> 
> Correct.
> 

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/3] mm/migrate_device.c: Copy pte dirty bit to page
  2022-08-24  3:03 ` [PATCH v3 2/3] mm/migrate_device.c: Copy pte dirty bit to page Alistair Popple
@ 2022-08-24 15:39   ` Peter Xu
  2022-08-25 22:21     ` Alistair Popple
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Xu @ 2022-08-24 15:39 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, akpm, Nadav Amit, huang ying, LKML, Sierra Guiza,
	Alejandro (Alex),
	Felix Kuehling, Jason Gunthorpe, John Hubbard, David Hildenbrand,
	Ralph Campbell, Matthew Wilcox, Karol Herbst, Lyude Paul,
	Ben Skeggs, Logan Gunthorpe, paulus, linuxppc-dev, stable,
	Huang Ying

On Wed, Aug 24, 2022 at 01:03:38PM +1000, Alistair Popple wrote:
> migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that
> installs migration entries directly if it can lock the migrating page.
> When removing a dirty pte the dirty bit is supposed to be carried over
> to the underlying page to prevent it being lost.
> 
> Currently migrate_vma_*() can only be used for private anonymous
> mappings. That means loss of the dirty bit usually doesn't result in
> data loss because these pages are typically not file-backed. However
> pages may be backed by swap storage which can result in data loss if an
> attempt is made to migrate a dirty page that doesn't yet have the
> PageDirty flag set.
> 
> In this case migration will fail due to unexpected references but the
> dirty pte bit will be lost. If the page is subsequently reclaimed data
> won't be written back to swap storage as it is considered uptodate,
> resulting in data loss if the page is subsequently accessed.
> 
> Prevent this by copying the dirty bit to the page when removing the pte
> to match what try_to_migrate_one() does.
> 
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> Acked-by: Peter Xu <peterx@redhat.com>
> Reported-by: Huang Ying <ying.huang@intel.com>
> Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
> Cc: stable@vger.kernel.org
> 
> ---
> 
> Changes for v3:
> 
>  - Defer TLB flushing
>  - Split a TLB flushing fix into a separate change.
> 
> Changes for v2:
> 
>  - Fixed up Reported-by tag.
>  - Added Peter's Acked-by.
>  - Atomically read and clear the pte to prevent the dirty bit getting
>    set after reading it.
>  - Added fixes tag
> ---
>  mm/migrate_device.c |  9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 6a5ef9f..51d9afa 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -7,6 +7,7 @@
>  #include <linux/export.h>
>  #include <linux/memremap.h>
>  #include <linux/migrate.h>
> +#include <linux/mm.h>
>  #include <linux/mm_inline.h>
>  #include <linux/mmu_notifier.h>
>  #include <linux/oom.h>
> @@ -196,7 +197,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  			anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
>  			if (anon_exclusive) {
>  				flush_cache_page(vma, addr, pte_pfn(*ptep));
> -				ptep_clear_flush(vma, addr, ptep);
> +				pte = ptep_clear_flush(vma, addr, ptep);
>  
>  				if (page_try_share_anon_rmap(page)) {
>  					set_pte_at(mm, addr, ptep, pte);
> @@ -206,11 +207,15 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  					goto next;
>  				}
>  			} else {
> -				ptep_get_and_clear(mm, addr, ptep);
> +				pte = ptep_get_and_clear(mm, addr, ptep);
>  			}

I remember that in v2 both flush_cache_page() and ptep_get_and_clear() are
moved above the condition check so they're called unconditionally.  Could
you explain the rational on why it's changed back (since I think v2 was the
correct approach)?

The other question is if we want to split the patch, would it be better to
move the tlb changes to patch 1, and leave the dirty bit fix in patch 2?

>  
>  			migrate->cpages++;
>  
> +			/* Set the dirty flag on the folio now the pte is gone. */
> +			if (pte_dirty(pte))
> +				folio_mark_dirty(page_folio(page));
> +
>  			/* Setup special migration page table entry */
>  			if (mpfn & MIGRATE_PFN_WRITE)
>  				entry = make_writable_migration_entry(
> -- 
> git-series 0.9.1
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 1/3] mm/migrate_device.c: Flush TLB while holding PTL
  2022-08-24  3:03 [PATCH v3 1/3] mm/migrate_device.c: Flush TLB while holding PTL Alistair Popple
                   ` (2 preceding siblings ...)
  2022-08-24  8:21 ` [PATCH v3 1/3] mm/migrate_device.c: Flush TLB while holding PTL David Hildenbrand
@ 2022-08-25  1:36 ` Huang, Ying
  2022-08-25 22:35   ` Alistair Popple
  3 siblings, 1 reply; 20+ messages in thread
From: Huang, Ying @ 2022-08-25  1:36 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, akpm, Peter Xu, Nadav Amit, huang ying, LKML,
	Sierra Guiza, Alejandro (Alex),
	Felix Kuehling, Jason Gunthorpe, John Hubbard, David Hildenbrand,
	Ralph Campbell, Matthew Wilcox, Karol Herbst, Lyude Paul,
	Ben Skeggs, Logan Gunthorpe, paulus, linuxppc-dev, stable

Alistair Popple <apopple@nvidia.com> writes:

> When clearing a PTE the TLB should be flushed whilst still holding the
> PTL to avoid a potential race with madvise/munmap/etc. For example
> consider the following sequence:
>
>   CPU0                          CPU1
>   ----                          ----
>
>   migrate_vma_collect_pmd()
>   pte_unmap_unlock()
>                                 madvise(MADV_DONTNEED)
>                                 -> zap_pte_range()
>                                 pte_offset_map_lock()
>                                 [ PTE not present, TLB not flushed ]
>                                 pte_unmap_unlock()
>                                 [ page is still accessible via stale TLB ]
>   flush_tlb_range()
>
> In this case the page may still be accessed via the stale TLB entry
> after madvise returns. Fix this by flushing the TLB while holding the
> PTL.
>
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> Reported-by: Nadav Amit <nadav.amit@gmail.com>
> Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
> Cc: stable@vger.kernel.org
>
> ---
>
> Changes for v3:
>
>  - New for v3
> ---
>  mm/migrate_device.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 27fb37d..6a5ef9f 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -254,13 +254,14 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  		migrate->dst[migrate->npages] = 0;
>  		migrate->src[migrate->npages++] = mpfn;
>  	}
> -	arch_leave_lazy_mmu_mode();
> -	pte_unmap_unlock(ptep - 1, ptl);
>  
>  	/* Only flush the TLB if we actually modified any entries */
>  	if (unmapped)
>  		flush_tlb_range(walk->vma, start, end);

It appears that we can increase "unmapped" only if ptep_get_and_clear()
is used?

Best Regards,
Huang, Ying

> +	arch_leave_lazy_mmu_mode();
> +	pte_unmap_unlock(ptep - 1, ptl);
> +
>  	return 0;
>  }
>  
>
> base-commit: ffcf9c5700e49c0aee42dcba9a12ba21338e8136

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/3] mm/migrate_device.c: Copy pte dirty bit to page
  2022-08-24 15:39   ` Peter Xu
@ 2022-08-25 22:21     ` Alistair Popple
  2022-08-25 23:27       ` Peter Xu
  0 siblings, 1 reply; 20+ messages in thread
From: Alistair Popple @ 2022-08-25 22:21 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, akpm, Nadav Amit, huang ying, LKML, Sierra Guiza,
	Alejandro (Alex),
	Felix Kuehling, Jason Gunthorpe, John Hubbard, David Hildenbrand,
	Ralph Campbell, Matthew Wilcox, Karol Herbst, Lyude Paul,
	Ben Skeggs, Logan Gunthorpe, paulus, linuxppc-dev, stable,
	Huang Ying


Peter Xu <peterx@redhat.com> writes:

> On Wed, Aug 24, 2022 at 01:03:38PM +1000, Alistair Popple wrote:
>> migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that
>> installs migration entries directly if it can lock the migrating page.
>> When removing a dirty pte the dirty bit is supposed to be carried over
>> to the underlying page to prevent it being lost.
>>
>> Currently migrate_vma_*() can only be used for private anonymous
>> mappings. That means loss of the dirty bit usually doesn't result in
>> data loss because these pages are typically not file-backed. However
>> pages may be backed by swap storage which can result in data loss if an
>> attempt is made to migrate a dirty page that doesn't yet have the
>> PageDirty flag set.
>>
>> In this case migration will fail due to unexpected references but the
>> dirty pte bit will be lost. If the page is subsequently reclaimed data
>> won't be written back to swap storage as it is considered uptodate,
>> resulting in data loss if the page is subsequently accessed.
>>
>> Prevent this by copying the dirty bit to the page when removing the pte
>> to match what try_to_migrate_one() does.
>>
>> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>> Acked-by: Peter Xu <peterx@redhat.com>
>> Reported-by: Huang Ying <ying.huang@intel.com>
>> Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
>> Cc: stable@vger.kernel.org
>>
>> ---
>>
>> Changes for v3:
>>
>>  - Defer TLB flushing
>>  - Split a TLB flushing fix into a separate change.
>>
>> Changes for v2:
>>
>>  - Fixed up Reported-by tag.
>>  - Added Peter's Acked-by.
>>  - Atomically read and clear the pte to prevent the dirty bit getting
>>    set after reading it.
>>  - Added fixes tag
>> ---
>>  mm/migrate_device.c |  9 +++++++--
>>  1 file changed, 7 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>> index 6a5ef9f..51d9afa 100644
>> --- a/mm/migrate_device.c
>> +++ b/mm/migrate_device.c
>> @@ -7,6 +7,7 @@
>>  #include <linux/export.h>
>>  #include <linux/memremap.h>
>>  #include <linux/migrate.h>
>> +#include <linux/mm.h>
>>  #include <linux/mm_inline.h>
>>  #include <linux/mmu_notifier.h>
>>  #include <linux/oom.h>
>> @@ -196,7 +197,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>  			anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
>>  			if (anon_exclusive) {
>>  				flush_cache_page(vma, addr, pte_pfn(*ptep));
>> -				ptep_clear_flush(vma, addr, ptep);
>> +				pte = ptep_clear_flush(vma, addr, ptep);
>>
>>  				if (page_try_share_anon_rmap(page)) {
>>  					set_pte_at(mm, addr, ptep, pte);
>> @@ -206,11 +207,15 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>  					goto next;
>>  				}
>>  			} else {
>> -				ptep_get_and_clear(mm, addr, ptep);
>> +				pte = ptep_get_and_clear(mm, addr, ptep);
>>  			}
>
> I remember that in v2 both flush_cache_page() and ptep_get_and_clear() are
> moved above the condition check so they're called unconditionally.  Could
> you explain the rational on why it's changed back (since I think v2 was the
> correct approach)?

Mainly because I agree with your original comments, that it would be
better to keep the batching of TLB flushing if possible. After the
discussion I don't think there is any issues with HW pte dirty bits
here. There are already other cases where HW needs to get that right
anyway (eg. zap_pte_range).

> The other question is if we want to split the patch, would it be better to
> move the tlb changes to patch 1, and leave the dirty bit fix in patch 2?

Isn't that already the case? Patch 1 moves the TLB flush before the PTL
as suggested, patch 2 atomically copies the dirty bit without changing
any TLB flushing.

>>
>>  			migrate->cpages++;
>>
>> +			/* Set the dirty flag on the folio now the pte is gone. */
>> +			if (pte_dirty(pte))
>> +				folio_mark_dirty(page_folio(page));
>> +
>>  			/* Setup special migration page table entry */
>>  			if (mpfn & MIGRATE_PFN_WRITE)
>>  				entry = make_writable_migration_entry(
>> --
>> git-series 0.9.1
>>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 1/3] mm/migrate_device.c: Flush TLB while holding PTL
  2022-08-25  1:36 ` Huang, Ying
@ 2022-08-25 22:35   ` Alistair Popple
  2022-08-26  0:56     ` Huang, Ying
  0 siblings, 1 reply; 20+ messages in thread
From: Alistair Popple @ 2022-08-25 22:35 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, akpm, Peter Xu, Nadav Amit, huang ying, LKML,
	Sierra Guiza, Alejandro (Alex),
	Felix Kuehling, Jason Gunthorpe, John Hubbard, David Hildenbrand,
	Ralph Campbell, Matthew Wilcox, Karol Herbst, Lyude Paul,
	Ben Skeggs, Logan Gunthorpe, paulus, linuxppc-dev, stable


"Huang, Ying" <ying.huang@intel.com> writes:

> Alistair Popple <apopple@nvidia.com> writes:
>
>> When clearing a PTE the TLB should be flushed whilst still holding the
>> PTL to avoid a potential race with madvise/munmap/etc. For example
>> consider the following sequence:
>>
>>   CPU0                          CPU1
>>   ----                          ----
>>
>>   migrate_vma_collect_pmd()
>>   pte_unmap_unlock()
>>                                 madvise(MADV_DONTNEED)
>>                                 -> zap_pte_range()
>>                                 pte_offset_map_lock()
>>                                 [ PTE not present, TLB not flushed ]
>>                                 pte_unmap_unlock()
>>                                 [ page is still accessible via stale TLB ]
>>   flush_tlb_range()
>>
>> In this case the page may still be accessed via the stale TLB entry
>> after madvise returns. Fix this by flushing the TLB while holding the
>> PTL.
>>
>> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>> Reported-by: Nadav Amit <nadav.amit@gmail.com>
>> Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
>> Cc: stable@vger.kernel.org
>>
>> ---
>>
>> Changes for v3:
>>
>>  - New for v3
>> ---
>>  mm/migrate_device.c | 5 +++--
>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>> index 27fb37d..6a5ef9f 100644
>> --- a/mm/migrate_device.c
>> +++ b/mm/migrate_device.c
>> @@ -254,13 +254,14 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>  		migrate->dst[migrate->npages] = 0;
>>  		migrate->src[migrate->npages++] = mpfn;
>>  	}
>> -	arch_leave_lazy_mmu_mode();
>> -	pte_unmap_unlock(ptep - 1, ptl);
>>
>>  	/* Only flush the TLB if we actually modified any entries */
>>  	if (unmapped)
>>  		flush_tlb_range(walk->vma, start, end);
>
> It appears that we can increase "unmapped" only if ptep_get_and_clear()
> is used?

In other words you mean we only need to increase unmapped if pte_present
&& !anon_exclusive?

Agree, that's a good optimisation to make. However I'm just trying to
solve a data corruption issue (not dirtying the page) here, so will post
that as a separate optimisation patch. Thanks.

 - Alistair

> Best Regards,
> Huang, Ying
>
>> +	arch_leave_lazy_mmu_mode();
>> +	pte_unmap_unlock(ptep - 1, ptl);
>> +
>>  	return 0;
>>  }
>>
>>
>> base-commit: ffcf9c5700e49c0aee42dcba9a12ba21338e8136

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/3] mm/migrate_device.c: Copy pte dirty bit to page
  2022-08-25 22:21     ` Alistair Popple
@ 2022-08-25 23:27       ` Peter Xu
  2022-08-26  1:02         ` Alistair Popple
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Xu @ 2022-08-25 23:27 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, akpm, Nadav Amit, huang ying, LKML, Sierra Guiza,
	Alejandro (Alex),
	Felix Kuehling, Jason Gunthorpe, John Hubbard, David Hildenbrand,
	Ralph Campbell, Matthew Wilcox, Karol Herbst, Lyude Paul,
	Ben Skeggs, Logan Gunthorpe, paulus, linuxppc-dev, stable,
	Huang Ying

On Fri, Aug 26, 2022 at 08:21:44AM +1000, Alistair Popple wrote:
> 
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Wed, Aug 24, 2022 at 01:03:38PM +1000, Alistair Popple wrote:
> >> migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that
> >> installs migration entries directly if it can lock the migrating page.
> >> When removing a dirty pte the dirty bit is supposed to be carried over
> >> to the underlying page to prevent it being lost.
> >>
> >> Currently migrate_vma_*() can only be used for private anonymous
> >> mappings. That means loss of the dirty bit usually doesn't result in
> >> data loss because these pages are typically not file-backed. However
> >> pages may be backed by swap storage which can result in data loss if an
> >> attempt is made to migrate a dirty page that doesn't yet have the
> >> PageDirty flag set.
> >>
> >> In this case migration will fail due to unexpected references but the
> >> dirty pte bit will be lost. If the page is subsequently reclaimed data
> >> won't be written back to swap storage as it is considered uptodate,
> >> resulting in data loss if the page is subsequently accessed.
> >>
> >> Prevent this by copying the dirty bit to the page when removing the pte
> >> to match what try_to_migrate_one() does.
> >>
> >> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> >> Acked-by: Peter Xu <peterx@redhat.com>
> >> Reported-by: Huang Ying <ying.huang@intel.com>
> >> Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
> >> Cc: stable@vger.kernel.org
> >>
> >> ---
> >>
> >> Changes for v3:
> >>
> >>  - Defer TLB flushing
> >>  - Split a TLB flushing fix into a separate change.
> >>
> >> Changes for v2:
> >>
> >>  - Fixed up Reported-by tag.
> >>  - Added Peter's Acked-by.
> >>  - Atomically read and clear the pte to prevent the dirty bit getting
> >>    set after reading it.
> >>  - Added fixes tag
> >> ---
> >>  mm/migrate_device.c |  9 +++++++--
> >>  1 file changed, 7 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> >> index 6a5ef9f..51d9afa 100644
> >> --- a/mm/migrate_device.c
> >> +++ b/mm/migrate_device.c
> >> @@ -7,6 +7,7 @@
> >>  #include <linux/export.h>
> >>  #include <linux/memremap.h>
> >>  #include <linux/migrate.h>
> >> +#include <linux/mm.h>
> >>  #include <linux/mm_inline.h>
> >>  #include <linux/mmu_notifier.h>
> >>  #include <linux/oom.h>
> >> @@ -196,7 +197,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >>  			anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
> >>  			if (anon_exclusive) {
> >>  				flush_cache_page(vma, addr, pte_pfn(*ptep));
> >> -				ptep_clear_flush(vma, addr, ptep);
> >> +				pte = ptep_clear_flush(vma, addr, ptep);
> >>
> >>  				if (page_try_share_anon_rmap(page)) {
> >>  					set_pte_at(mm, addr, ptep, pte);
> >> @@ -206,11 +207,15 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >>  					goto next;
> >>  				}
> >>  			} else {
> >> -				ptep_get_and_clear(mm, addr, ptep);
> >> +				pte = ptep_get_and_clear(mm, addr, ptep);
> >>  			}
> >
> > I remember that in v2 both flush_cache_page() and ptep_get_and_clear() are
> > moved above the condition check so they're called unconditionally.  Could
> > you explain the rational on why it's changed back (since I think v2 was the
> > correct approach)?
> 
> Mainly because I agree with your original comments, that it would be
> better to keep the batching of TLB flushing if possible. After the
> discussion I don't think there is any issues with HW pte dirty bits
> here. There are already other cases where HW needs to get that right
> anyway (eg. zap_pte_range).

Yes tlb batching was kept, thanks for doing that way.  Though if only apply
patch 1 we'll have both ptep_clear_flush() and batched flush which seems to
be redundant.

> 
> > The other question is if we want to split the patch, would it be better to
> > move the tlb changes to patch 1, and leave the dirty bit fix in patch 2?
> 
> Isn't that already the case? Patch 1 moves the TLB flush before the PTL
> as suggested, patch 2 atomically copies the dirty bit without changing
> any TLB flushing.

IMHO it's cleaner to have patch 1 fix batch flush, replace
ptep_clear_flush() with ptep_get_and_clear() and update pte properly.

No strong opinions on the layout, but I still think we should drop the
redundant ptep_clear_flush() above, meanwhile add the flush_cache_page()
properly for !exclusive case too.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 1/3] mm/migrate_device.c: Flush TLB while holding PTL
  2022-08-25 22:35   ` Alistair Popple
@ 2022-08-26  0:56     ` Huang, Ying
  0 siblings, 0 replies; 20+ messages in thread
From: Huang, Ying @ 2022-08-26  0:56 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, akpm, Peter Xu, Nadav Amit, huang ying, LKML,
	Sierra Guiza, Alejandro (Alex),
	Felix Kuehling, Jason Gunthorpe, John Hubbard, David Hildenbrand,
	Ralph Campbell, Matthew Wilcox, Karol Herbst, Lyude Paul,
	Ben Skeggs, Logan Gunthorpe, paulus, linuxppc-dev, stable

Alistair Popple <apopple@nvidia.com> writes:

> "Huang, Ying" <ying.huang@intel.com> writes:
>
>> Alistair Popple <apopple@nvidia.com> writes:
>>
>>> When clearing a PTE the TLB should be flushed whilst still holding the
>>> PTL to avoid a potential race with madvise/munmap/etc. For example
>>> consider the following sequence:
>>>
>>>   CPU0                          CPU1
>>>   ----                          ----
>>>
>>>   migrate_vma_collect_pmd()
>>>   pte_unmap_unlock()
>>>                                 madvise(MADV_DONTNEED)
>>>                                 -> zap_pte_range()
>>>                                 pte_offset_map_lock()
>>>                                 [ PTE not present, TLB not flushed ]
>>>                                 pte_unmap_unlock()
>>>                                 [ page is still accessible via stale TLB ]
>>>   flush_tlb_range()
>>>
>>> In this case the page may still be accessed via the stale TLB entry
>>> after madvise returns. Fix this by flushing the TLB while holding the
>>> PTL.
>>>
>>> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>>> Reported-by: Nadav Amit <nadav.amit@gmail.com>
>>> Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
>>> Cc: stable@vger.kernel.org
>>>
>>> ---
>>>
>>> Changes for v3:
>>>
>>>  - New for v3
>>> ---
>>>  mm/migrate_device.c | 5 +++--
>>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>> index 27fb37d..6a5ef9f 100644
>>> --- a/mm/migrate_device.c
>>> +++ b/mm/migrate_device.c
>>> @@ -254,13 +254,14 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>  		migrate->dst[migrate->npages] = 0;
>>>  		migrate->src[migrate->npages++] = mpfn;
>>>  	}
>>> -	arch_leave_lazy_mmu_mode();
>>> -	pte_unmap_unlock(ptep - 1, ptl);
>>>
>>>  	/* Only flush the TLB if we actually modified any entries */
>>>  	if (unmapped)
>>>  		flush_tlb_range(walk->vma, start, end);
>>
>> It appears that we can increase "unmapped" only if ptep_get_and_clear()
>> is used?
>
> In other words you mean we only need to increase unmapped if pte_present
> && !anon_exclusive?
>
> Agree, that's a good optimisation to make. However I'm just trying to
> solve a data corruption issue (not dirtying the page) here, so will post
> that as a separate optimisation patch. Thanks.

OK.  Then the patch looks good to me.  Feel free to add my,

Reviewed-by: "Huang, Ying" <ying.huang@intel.com>

Best Regards,
Huang, Ying

>>
>>> +	arch_leave_lazy_mmu_mode();
>>> +	pte_unmap_unlock(ptep - 1, ptl);
>>> +
>>>  	return 0;
>>>  }
>>>
>>>
>>> base-commit: ffcf9c5700e49c0aee42dcba9a12ba21338e8136

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/3] mm/migrate_device.c: Copy pte dirty bit to page
  2022-08-25 23:27       ` Peter Xu
@ 2022-08-26  1:02         ` Alistair Popple
  2022-08-26  1:14           ` Huang, Ying
  2022-08-26 14:32           ` Peter Xu
  0 siblings, 2 replies; 20+ messages in thread
From: Alistair Popple @ 2022-08-26  1:02 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, akpm, Nadav Amit, huang ying, LKML, Sierra Guiza,
	Alejandro (Alex),
	Felix Kuehling, Jason Gunthorpe, John Hubbard, David Hildenbrand,
	Ralph Campbell, Matthew Wilcox, Karol Herbst, Lyude Paul,
	Ben Skeggs, Logan Gunthorpe, paulus, linuxppc-dev, stable,
	Huang Ying


Peter Xu <peterx@redhat.com> writes:

> On Fri, Aug 26, 2022 at 08:21:44AM +1000, Alistair Popple wrote:
>>
>> Peter Xu <peterx@redhat.com> writes:
>>
>> > On Wed, Aug 24, 2022 at 01:03:38PM +1000, Alistair Popple wrote:
>> >> migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that
>> >> installs migration entries directly if it can lock the migrating page.
>> >> When removing a dirty pte the dirty bit is supposed to be carried over
>> >> to the underlying page to prevent it being lost.
>> >>
>> >> Currently migrate_vma_*() can only be used for private anonymous
>> >> mappings. That means loss of the dirty bit usually doesn't result in
>> >> data loss because these pages are typically not file-backed. However
>> >> pages may be backed by swap storage which can result in data loss if an
>> >> attempt is made to migrate a dirty page that doesn't yet have the
>> >> PageDirty flag set.
>> >>
>> >> In this case migration will fail due to unexpected references but the
>> >> dirty pte bit will be lost. If the page is subsequently reclaimed data
>> >> won't be written back to swap storage as it is considered uptodate,
>> >> resulting in data loss if the page is subsequently accessed.
>> >>
>> >> Prevent this by copying the dirty bit to the page when removing the pte
>> >> to match what try_to_migrate_one() does.
>> >>
>> >> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>> >> Acked-by: Peter Xu <peterx@redhat.com>
>> >> Reported-by: Huang Ying <ying.huang@intel.com>
>> >> Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
>> >> Cc: stable@vger.kernel.org
>> >>
>> >> ---
>> >>
>> >> Changes for v3:
>> >>
>> >>  - Defer TLB flushing
>> >>  - Split a TLB flushing fix into a separate change.
>> >>
>> >> Changes for v2:
>> >>
>> >>  - Fixed up Reported-by tag.
>> >>  - Added Peter's Acked-by.
>> >>  - Atomically read and clear the pte to prevent the dirty bit getting
>> >>    set after reading it.
>> >>  - Added fixes tag
>> >> ---
>> >>  mm/migrate_device.c |  9 +++++++--
>> >>  1 file changed, 7 insertions(+), 2 deletions(-)
>> >>
>> >> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>> >> index 6a5ef9f..51d9afa 100644
>> >> --- a/mm/migrate_device.c
>> >> +++ b/mm/migrate_device.c
>> >> @@ -7,6 +7,7 @@
>> >>  #include <linux/export.h>
>> >>  #include <linux/memremap.h>
>> >>  #include <linux/migrate.h>
>> >> +#include <linux/mm.h>
>> >>  #include <linux/mm_inline.h>
>> >>  #include <linux/mmu_notifier.h>
>> >>  #include <linux/oom.h>
>> >> @@ -196,7 +197,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>> >>  			anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
>> >>  			if (anon_exclusive) {
>> >>  				flush_cache_page(vma, addr, pte_pfn(*ptep));
>> >> -				ptep_clear_flush(vma, addr, ptep);
>> >> +				pte = ptep_clear_flush(vma, addr, ptep);
>> >>
>> >>  				if (page_try_share_anon_rmap(page)) {
>> >>  					set_pte_at(mm, addr, ptep, pte);
>> >> @@ -206,11 +207,15 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>> >>  					goto next;
>> >>  				}
>> >>  			} else {
>> >> -				ptep_get_and_clear(mm, addr, ptep);
>> >> +				pte = ptep_get_and_clear(mm, addr, ptep);
>> >>  			}
>> >
>> > I remember that in v2 both flush_cache_page() and ptep_get_and_clear() are
>> > moved above the condition check so they're called unconditionally.  Could
>> > you explain the rational on why it's changed back (since I think v2 was the
>> > correct approach)?
>>
>> Mainly because I agree with your original comments, that it would be
>> better to keep the batching of TLB flushing if possible. After the
>> discussion I don't think there is any issues with HW pte dirty bits
>> here. There are already other cases where HW needs to get that right
>> anyway (eg. zap_pte_range).
>
> Yes tlb batching was kept, thanks for doing that way.  Though if only apply
> patch 1 we'll have both ptep_clear_flush() and batched flush which seems to
> be redundant.
>
>>
>> > The other question is if we want to split the patch, would it be better to
>> > move the tlb changes to patch 1, and leave the dirty bit fix in patch 2?
>>
>> Isn't that already the case? Patch 1 moves the TLB flush before the PTL
>> as suggested, patch 2 atomically copies the dirty bit without changing
>> any TLB flushing.
>
> IMHO it's cleaner to have patch 1 fix batch flush, replace
> ptep_clear_flush() with ptep_get_and_clear() and update pte properly.

Which ptep_clear_flush() are you referring to? This one?

			if (anon_exclusive) {
				flush_cache_page(vma, addr, pte_pfn(*ptep));
				ptep_clear_flush(vma, addr, ptep);

My understanding is that we need to do a flush for anon_exclusive.

> No strong opinions on the layout, but I still think we should drop the
> redundant ptep_clear_flush() above, meanwhile add the flush_cache_page()
> properly for !exclusive case too.

Good point, we need flush_cache_page() for !exclusive. Will add.

> Thanks,

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/3] mm/migrate_device.c: Copy pte dirty bit to page
  2022-08-26  1:02         ` Alistair Popple
@ 2022-08-26  1:14           ` Huang, Ying
  2022-08-26 14:32           ` Peter Xu
  1 sibling, 0 replies; 20+ messages in thread
From: Huang, Ying @ 2022-08-26  1:14 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Peter Xu, linux-mm, akpm, Nadav Amit, huang ying, LKML,
	Sierra Guiza, Alejandro (Alex),
	Felix Kuehling, Jason Gunthorpe, John Hubbard, David Hildenbrand,
	Ralph Campbell, Matthew Wilcox, Karol Herbst, Lyude Paul,
	Ben Skeggs, Logan Gunthorpe, paulus, linuxppc-dev, stable

Alistair Popple <apopple@nvidia.com> writes:

> Peter Xu <peterx@redhat.com> writes:
>
>> On Fri, Aug 26, 2022 at 08:21:44AM +1000, Alistair Popple wrote:
>>>
>>> Peter Xu <peterx@redhat.com> writes:
>>>
>>> > On Wed, Aug 24, 2022 at 01:03:38PM +1000, Alistair Popple wrote:
>>> >> migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that
>>> >> installs migration entries directly if it can lock the migrating page.
>>> >> When removing a dirty pte the dirty bit is supposed to be carried over
>>> >> to the underlying page to prevent it being lost.
>>> >>
>>> >> Currently migrate_vma_*() can only be used for private anonymous
>>> >> mappings. That means loss of the dirty bit usually doesn't result in
>>> >> data loss because these pages are typically not file-backed. However
>>> >> pages may be backed by swap storage which can result in data loss if an
>>> >> attempt is made to migrate a dirty page that doesn't yet have the
>>> >> PageDirty flag set.
>>> >>
>>> >> In this case migration will fail due to unexpected references but the
>>> >> dirty pte bit will be lost. If the page is subsequently reclaimed data
>>> >> won't be written back to swap storage as it is considered uptodate,
>>> >> resulting in data loss if the page is subsequently accessed.
>>> >>
>>> >> Prevent this by copying the dirty bit to the page when removing the pte
>>> >> to match what try_to_migrate_one() does.
>>> >>
>>> >> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>>> >> Acked-by: Peter Xu <peterx@redhat.com>
>>> >> Reported-by: Huang Ying <ying.huang@intel.com>
>>> >> Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
>>> >> Cc: stable@vger.kernel.org
>>> >>
>>> >> ---
>>> >>
>>> >> Changes for v3:
>>> >>
>>> >>  - Defer TLB flushing
>>> >>  - Split a TLB flushing fix into a separate change.
>>> >>
>>> >> Changes for v2:
>>> >>
>>> >>  - Fixed up Reported-by tag.
>>> >>  - Added Peter's Acked-by.
>>> >>  - Atomically read and clear the pte to prevent the dirty bit getting
>>> >>    set after reading it.
>>> >>  - Added fixes tag
>>> >> ---
>>> >>  mm/migrate_device.c |  9 +++++++--
>>> >>  1 file changed, 7 insertions(+), 2 deletions(-)
>>> >>
>>> >> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>> >> index 6a5ef9f..51d9afa 100644
>>> >> --- a/mm/migrate_device.c
>>> >> +++ b/mm/migrate_device.c
>>> >> @@ -7,6 +7,7 @@
>>> >>  #include <linux/export.h>
>>> >>  #include <linux/memremap.h>
>>> >>  #include <linux/migrate.h>
>>> >> +#include <linux/mm.h>
>>> >>  #include <linux/mm_inline.h>
>>> >>  #include <linux/mmu_notifier.h>
>>> >>  #include <linux/oom.h>
>>> >> @@ -196,7 +197,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>> >>  			anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
>>> >>  			if (anon_exclusive) {
>>> >>  				flush_cache_page(vma, addr, pte_pfn(*ptep));
>>> >> -				ptep_clear_flush(vma, addr, ptep);
>>> >> +				pte = ptep_clear_flush(vma, addr, ptep);
>>> >>
>>> >>  				if (page_try_share_anon_rmap(page)) {
>>> >>  					set_pte_at(mm, addr, ptep, pte);
>>> >> @@ -206,11 +207,15 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>> >>  					goto next;
>>> >>  				}
>>> >>  			} else {
>>> >> -				ptep_get_and_clear(mm, addr, ptep);
>>> >> +				pte = ptep_get_and_clear(mm, addr, ptep);
>>> >>  			}
>>> >
>>> > I remember that in v2 both flush_cache_page() and ptep_get_and_clear() are
>>> > moved above the condition check so they're called unconditionally.  Could
>>> > you explain the rational on why it's changed back (since I think v2 was the
>>> > correct approach)?
>>>
>>> Mainly because I agree with your original comments, that it would be
>>> better to keep the batching of TLB flushing if possible. After the
>>> discussion I don't think there is any issues with HW pte dirty bits
>>> here. There are already other cases where HW needs to get that right
>>> anyway (eg. zap_pte_range).
>>
>> Yes tlb batching was kept, thanks for doing that way.  Though if only apply
>> patch 1 we'll have both ptep_clear_flush() and batched flush which seems to
>> be redundant.
>>
>>>
>>> > The other question is if we want to split the patch, would it be better to
>>> > move the tlb changes to patch 1, and leave the dirty bit fix in patch 2?
>>>
>>> Isn't that already the case? Patch 1 moves the TLB flush before the PTL
>>> as suggested, patch 2 atomically copies the dirty bit without changing
>>> any TLB flushing.
>>
>> IMHO it's cleaner to have patch 1 fix batch flush, replace
>> ptep_clear_flush() with ptep_get_and_clear() and update pte properly.
>
> Which ptep_clear_flush() are you referring to? This one?
>
> 			if (anon_exclusive) {
> 				flush_cache_page(vma, addr, pte_pfn(*ptep));
> 				ptep_clear_flush(vma, addr, ptep);
>
> My understanding is that we need to do a flush for anon_exclusive.
>
>> No strong opinions on the layout, but I still think we should drop the
>> redundant ptep_clear_flush() above, meanwhile add the flush_cache_page()
>> properly for !exclusive case too.
>
> Good point, we need flush_cache_page() for !exclusive. Will add.

That can be in another patch.  For the patch itself, it looks good to
me.  Feel free to add,

Reviewed-by: "Huang, Ying" <ying.huang@intel.com>

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/3] mm/migrate_device.c: Copy pte dirty bit to page
  2022-08-26  1:02         ` Alistair Popple
  2022-08-26  1:14           ` Huang, Ying
@ 2022-08-26 14:32           ` Peter Xu
  2022-08-26 14:47             ` David Hildenbrand
  1 sibling, 1 reply; 20+ messages in thread
From: Peter Xu @ 2022-08-26 14:32 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, akpm, Nadav Amit, huang ying, LKML, Sierra Guiza,
	Alejandro (Alex),
	Felix Kuehling, Jason Gunthorpe, John Hubbard, David Hildenbrand,
	Ralph Campbell, Matthew Wilcox, Karol Herbst, Lyude Paul,
	Ben Skeggs, Logan Gunthorpe, paulus, linuxppc-dev, stable,
	Huang Ying

On Fri, Aug 26, 2022 at 11:02:58AM +1000, Alistair Popple wrote:
> 
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Fri, Aug 26, 2022 at 08:21:44AM +1000, Alistair Popple wrote:
> >>
> >> Peter Xu <peterx@redhat.com> writes:
> >>
> >> > On Wed, Aug 24, 2022 at 01:03:38PM +1000, Alistair Popple wrote:
> >> >> migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that
> >> >> installs migration entries directly if it can lock the migrating page.
> >> >> When removing a dirty pte the dirty bit is supposed to be carried over
> >> >> to the underlying page to prevent it being lost.
> >> >>
> >> >> Currently migrate_vma_*() can only be used for private anonymous
> >> >> mappings. That means loss of the dirty bit usually doesn't result in
> >> >> data loss because these pages are typically not file-backed. However
> >> >> pages may be backed by swap storage which can result in data loss if an
> >> >> attempt is made to migrate a dirty page that doesn't yet have the
> >> >> PageDirty flag set.
> >> >>
> >> >> In this case migration will fail due to unexpected references but the
> >> >> dirty pte bit will be lost. If the page is subsequently reclaimed data
> >> >> won't be written back to swap storage as it is considered uptodate,
> >> >> resulting in data loss if the page is subsequently accessed.
> >> >>
> >> >> Prevent this by copying the dirty bit to the page when removing the pte
> >> >> to match what try_to_migrate_one() does.
> >> >>
> >> >> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> >> >> Acked-by: Peter Xu <peterx@redhat.com>
> >> >> Reported-by: Huang Ying <ying.huang@intel.com>
> >> >> Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
> >> >> Cc: stable@vger.kernel.org
> >> >>
> >> >> ---
> >> >>
> >> >> Changes for v3:
> >> >>
> >> >>  - Defer TLB flushing
> >> >>  - Split a TLB flushing fix into a separate change.
> >> >>
> >> >> Changes for v2:
> >> >>
> >> >>  - Fixed up Reported-by tag.
> >> >>  - Added Peter's Acked-by.
> >> >>  - Atomically read and clear the pte to prevent the dirty bit getting
> >> >>    set after reading it.
> >> >>  - Added fixes tag
> >> >> ---
> >> >>  mm/migrate_device.c |  9 +++++++--
> >> >>  1 file changed, 7 insertions(+), 2 deletions(-)
> >> >>
> >> >> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> >> >> index 6a5ef9f..51d9afa 100644
> >> >> --- a/mm/migrate_device.c
> >> >> +++ b/mm/migrate_device.c
> >> >> @@ -7,6 +7,7 @@
> >> >>  #include <linux/export.h>
> >> >>  #include <linux/memremap.h>
> >> >>  #include <linux/migrate.h>
> >> >> +#include <linux/mm.h>
> >> >>  #include <linux/mm_inline.h>
> >> >>  #include <linux/mmu_notifier.h>
> >> >>  #include <linux/oom.h>
> >> >> @@ -196,7 +197,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >> >>  			anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
> >> >>  			if (anon_exclusive) {
> >> >>  				flush_cache_page(vma, addr, pte_pfn(*ptep));
> >> >> -				ptep_clear_flush(vma, addr, ptep);
> >> >> +				pte = ptep_clear_flush(vma, addr, ptep);
> >> >>
> >> >>  				if (page_try_share_anon_rmap(page)) {
> >> >>  					set_pte_at(mm, addr, ptep, pte);
> >> >> @@ -206,11 +207,15 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >> >>  					goto next;
> >> >>  				}
> >> >>  			} else {
> >> >> -				ptep_get_and_clear(mm, addr, ptep);
> >> >> +				pte = ptep_get_and_clear(mm, addr, ptep);
> >> >>  			}
> >> >
> >> > I remember that in v2 both flush_cache_page() and ptep_get_and_clear() are
> >> > moved above the condition check so they're called unconditionally.  Could
> >> > you explain the rational on why it's changed back (since I think v2 was the
> >> > correct approach)?
> >>
> >> Mainly because I agree with your original comments, that it would be
> >> better to keep the batching of TLB flushing if possible. After the
> >> discussion I don't think there is any issues with HW pte dirty bits
> >> here. There are already other cases where HW needs to get that right
> >> anyway (eg. zap_pte_range).
> >
> > Yes tlb batching was kept, thanks for doing that way.  Though if only apply
> > patch 1 we'll have both ptep_clear_flush() and batched flush which seems to
> > be redundant.
> >
> >>
> >> > The other question is if we want to split the patch, would it be better to
> >> > move the tlb changes to patch 1, and leave the dirty bit fix in patch 2?
> >>
> >> Isn't that already the case? Patch 1 moves the TLB flush before the PTL
> >> as suggested, patch 2 atomically copies the dirty bit without changing
> >> any TLB flushing.
> >
> > IMHO it's cleaner to have patch 1 fix batch flush, replace
> > ptep_clear_flush() with ptep_get_and_clear() and update pte properly.
> 
> Which ptep_clear_flush() are you referring to? This one?
> 
> 			if (anon_exclusive) {
> 				flush_cache_page(vma, addr, pte_pfn(*ptep));
> 				ptep_clear_flush(vma, addr, ptep);

Correct.

> 
> My understanding is that we need to do a flush for anon_exclusive.

To me anon exclusive only shows this mm exclusively owns this page. I
didn't quickly figure out why that requires different handling on tlb
flushs.  Did I perhaps miss something?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/3] mm/migrate_device.c: Copy pte dirty bit to page
  2022-08-26 14:32           ` Peter Xu
@ 2022-08-26 14:47             ` David Hildenbrand
  2022-08-26 15:55               ` Peter Xu
  0 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2022-08-26 14:47 UTC (permalink / raw)
  To: Peter Xu, Alistair Popple
  Cc: linux-mm, akpm, Nadav Amit, huang ying, LKML, Sierra Guiza,
	Alejandro (Alex),
	Felix Kuehling, Jason Gunthorpe, John Hubbard, Ralph Campbell,
	Matthew Wilcox, Karol Herbst, Lyude Paul, Ben Skeggs,
	Logan Gunthorpe, paulus, linuxppc-dev, stable, Huang Ying

On 26.08.22 16:32, Peter Xu wrote:
> On Fri, Aug 26, 2022 at 11:02:58AM +1000, Alistair Popple wrote:
>>
>> Peter Xu <peterx@redhat.com> writes:
>>
>>> On Fri, Aug 26, 2022 at 08:21:44AM +1000, Alistair Popple wrote:
>>>>
>>>> Peter Xu <peterx@redhat.com> writes:
>>>>
>>>>> On Wed, Aug 24, 2022 at 01:03:38PM +1000, Alistair Popple wrote:
>>>>>> migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that
>>>>>> installs migration entries directly if it can lock the migrating page.
>>>>>> When removing a dirty pte the dirty bit is supposed to be carried over
>>>>>> to the underlying page to prevent it being lost.
>>>>>>
>>>>>> Currently migrate_vma_*() can only be used for private anonymous
>>>>>> mappings. That means loss of the dirty bit usually doesn't result in
>>>>>> data loss because these pages are typically not file-backed. However
>>>>>> pages may be backed by swap storage which can result in data loss if an
>>>>>> attempt is made to migrate a dirty page that doesn't yet have the
>>>>>> PageDirty flag set.
>>>>>>
>>>>>> In this case migration will fail due to unexpected references but the
>>>>>> dirty pte bit will be lost. If the page is subsequently reclaimed data
>>>>>> won't be written back to swap storage as it is considered uptodate,
>>>>>> resulting in data loss if the page is subsequently accessed.
>>>>>>
>>>>>> Prevent this by copying the dirty bit to the page when removing the pte
>>>>>> to match what try_to_migrate_one() does.
>>>>>>
>>>>>> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>>>>>> Acked-by: Peter Xu <peterx@redhat.com>
>>>>>> Reported-by: Huang Ying <ying.huang@intel.com>
>>>>>> Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
>>>>>> Cc: stable@vger.kernel.org
>>>>>>
>>>>>> ---
>>>>>>
>>>>>> Changes for v3:
>>>>>>
>>>>>>  - Defer TLB flushing
>>>>>>  - Split a TLB flushing fix into a separate change.
>>>>>>
>>>>>> Changes for v2:
>>>>>>
>>>>>>  - Fixed up Reported-by tag.
>>>>>>  - Added Peter's Acked-by.
>>>>>>  - Atomically read and clear the pte to prevent the dirty bit getting
>>>>>>    set after reading it.
>>>>>>  - Added fixes tag
>>>>>> ---
>>>>>>  mm/migrate_device.c |  9 +++++++--
>>>>>>  1 file changed, 7 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>>>> index 6a5ef9f..51d9afa 100644
>>>>>> --- a/mm/migrate_device.c
>>>>>> +++ b/mm/migrate_device.c
>>>>>> @@ -7,6 +7,7 @@
>>>>>>  #include <linux/export.h>
>>>>>>  #include <linux/memremap.h>
>>>>>>  #include <linux/migrate.h>
>>>>>> +#include <linux/mm.h>
>>>>>>  #include <linux/mm_inline.h>
>>>>>>  #include <linux/mmu_notifier.h>
>>>>>>  #include <linux/oom.h>
>>>>>> @@ -196,7 +197,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>>  			anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
>>>>>>  			if (anon_exclusive) {
>>>>>>  				flush_cache_page(vma, addr, pte_pfn(*ptep));
>>>>>> -				ptep_clear_flush(vma, addr, ptep);
>>>>>> +				pte = ptep_clear_flush(vma, addr, ptep);
>>>>>>
>>>>>>  				if (page_try_share_anon_rmap(page)) {
>>>>>>  					set_pte_at(mm, addr, ptep, pte);
>>>>>> @@ -206,11 +207,15 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>>  					goto next;
>>>>>>  				}
>>>>>>  			} else {
>>>>>> -				ptep_get_and_clear(mm, addr, ptep);
>>>>>> +				pte = ptep_get_and_clear(mm, addr, ptep);
>>>>>>  			}
>>>>>
>>>>> I remember that in v2 both flush_cache_page() and ptep_get_and_clear() are
>>>>> moved above the condition check so they're called unconditionally.  Could
>>>>> you explain the rational on why it's changed back (since I think v2 was the
>>>>> correct approach)?
>>>>
>>>> Mainly because I agree with your original comments, that it would be
>>>> better to keep the batching of TLB flushing if possible. After the
>>>> discussion I don't think there is any issues with HW pte dirty bits
>>>> here. There are already other cases where HW needs to get that right
>>>> anyway (eg. zap_pte_range).
>>>
>>> Yes tlb batching was kept, thanks for doing that way.  Though if only apply
>>> patch 1 we'll have both ptep_clear_flush() and batched flush which seems to
>>> be redundant.
>>>
>>>>
>>>>> The other question is if we want to split the patch, would it be better to
>>>>> move the tlb changes to patch 1, and leave the dirty bit fix in patch 2?
>>>>
>>>> Isn't that already the case? Patch 1 moves the TLB flush before the PTL
>>>> as suggested, patch 2 atomically copies the dirty bit without changing
>>>> any TLB flushing.
>>>
>>> IMHO it's cleaner to have patch 1 fix batch flush, replace
>>> ptep_clear_flush() with ptep_get_and_clear() and update pte properly.
>>
>> Which ptep_clear_flush() are you referring to? This one?
>>
>> 			if (anon_exclusive) {
>> 				flush_cache_page(vma, addr, pte_pfn(*ptep));
>> 				ptep_clear_flush(vma, addr, ptep);
> 
> Correct.
> 
>>
>> My understanding is that we need to do a flush for anon_exclusive.
> 
> To me anon exclusive only shows this mm exclusively owns this page. I
> didn't quickly figure out why that requires different handling on tlb
> flushs.  Did I perhaps miss something?

GUP-fast is the magic bit, we have to make sure that we won't see new
GUP pins, thus the TLB flush.

include/linux/mm.h:gup_must_unshare() contains documentation.

Without GUP-fast, some things would be significantly easier to handle.


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/3] mm/migrate_device.c: Copy pte dirty bit to page
  2022-08-26 14:47             ` David Hildenbrand
@ 2022-08-26 15:55               ` Peter Xu
  2022-08-26 16:46                 ` David Hildenbrand
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Xu @ 2022-08-26 15:55 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alistair Popple, linux-mm, akpm, Nadav Amit, huang ying, LKML,
	Sierra Guiza, Alejandro (Alex),
	Felix Kuehling, Jason Gunthorpe, John Hubbard, Ralph Campbell,
	Matthew Wilcox, Karol Herbst, Lyude Paul, Ben Skeggs,
	Logan Gunthorpe, paulus, linuxppc-dev, stable, Huang Ying

On Fri, Aug 26, 2022 at 04:47:22PM +0200, David Hildenbrand wrote:
> > To me anon exclusive only shows this mm exclusively owns this page. I
> > didn't quickly figure out why that requires different handling on tlb
> > flushs.  Did I perhaps miss something?
> 
> GUP-fast is the magic bit, we have to make sure that we won't see new
> GUP pins, thus the TLB flush.
> 
> include/linux/mm.h:gup_must_unshare() contains documentation.

Hmm.. Shouldn't ptep_get_and_clear() (e.g., xchg() on x86_64) already
guarantees that no other process/thread will see this pte anymore
afterwards?

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/3] mm/migrate_device.c: Copy pte dirty bit to page
  2022-08-26 15:55               ` Peter Xu
@ 2022-08-26 16:46                 ` David Hildenbrand
  2022-08-26 21:37                   ` Peter Xu
  0 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2022-08-26 16:46 UTC (permalink / raw)
  To: Peter Xu
  Cc: Alistair Popple, linux-mm, akpm, Nadav Amit, huang ying, LKML,
	Sierra Guiza, Alejandro (Alex),
	Felix Kuehling, Jason Gunthorpe, John Hubbard, Ralph Campbell,
	Matthew Wilcox, Karol Herbst, Lyude Paul, Ben Skeggs,
	Logan Gunthorpe, paulus, linuxppc-dev, stable, Huang Ying

On 26.08.22 17:55, Peter Xu wrote:
> On Fri, Aug 26, 2022 at 04:47:22PM +0200, David Hildenbrand wrote:
>>> To me anon exclusive only shows this mm exclusively owns this page. I
>>> didn't quickly figure out why that requires different handling on tlb
>>> flushs.  Did I perhaps miss something?
>>
>> GUP-fast is the magic bit, we have to make sure that we won't see new
>> GUP pins, thus the TLB flush.
>>
>> include/linux/mm.h:gup_must_unshare() contains documentation.
> 
> Hmm.. Shouldn't ptep_get_and_clear() (e.g., xchg() on x86_64) already
> guarantees that no other process/thread will see this pte anymore
> afterwards?

You could have a GUP-fast thread that just looked up the PTE and is
going to pin the page afterwards, after the ptep_get_and_clear()
returned. You'll have to wait until that thread finished.

Another user that relies on this interaction between GUP-fast and TLB
flushing is for example mm/ksm.c:write_protect_page()

There is a comment in there explaining the interaction a bit more detailed.

Maybe we'll be able to handle this differently in the future (maybe once
this turns out to be an actual performance problem). Unfortunately,
mm->write_protect_seq isn't easily usable because we'd need have to make
sure we're the exclusive writer.

For now, it's not too complicated. For PTEs:
* try_to_migrate_one() already uses ptep_clear_flush().
* try_to_unmap_one() already conditionally used ptep_clear_flush().
* migrate_vma_collect_pmd() was the one case that didn't use it already
 (and I wonder why it's different than try_to_migrate_one()).

-- 
Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/3] mm/migrate_device.c: Copy pte dirty bit to page
  2022-08-26 16:46                 ` David Hildenbrand
@ 2022-08-26 21:37                   ` Peter Xu
  2022-08-26 22:19                     ` David Hildenbrand
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Xu @ 2022-08-26 21:37 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alistair Popple, linux-mm, akpm, Nadav Amit, huang ying, LKML,
	Sierra Guiza, Alejandro (Alex),
	Felix Kuehling, Jason Gunthorpe, John Hubbard, Ralph Campbell,
	Matthew Wilcox, Karol Herbst, Lyude Paul, Ben Skeggs,
	Logan Gunthorpe, paulus, linuxppc-dev, stable, Huang Ying

On Fri, Aug 26, 2022 at 06:46:02PM +0200, David Hildenbrand wrote:
> On 26.08.22 17:55, Peter Xu wrote:
> > On Fri, Aug 26, 2022 at 04:47:22PM +0200, David Hildenbrand wrote:
> >>> To me anon exclusive only shows this mm exclusively owns this page. I
> >>> didn't quickly figure out why that requires different handling on tlb
> >>> flushs.  Did I perhaps miss something?
> >>
> >> GUP-fast is the magic bit, we have to make sure that we won't see new
> >> GUP pins, thus the TLB flush.
> >>
> >> include/linux/mm.h:gup_must_unshare() contains documentation.
> > 
> > Hmm.. Shouldn't ptep_get_and_clear() (e.g., xchg() on x86_64) already
> > guarantees that no other process/thread will see this pte anymore
> > afterwards?
> 
> You could have a GUP-fast thread that just looked up the PTE and is
> going to pin the page afterwards, after the ptep_get_and_clear()
> returned. You'll have to wait until that thread finished.

IIUC the early tlb flush won't protect concurrent fast-gup from happening,
but I think it's safe because fast-gup will check pte after pinning, so
either:

  (1) fast-gup runs before ptep_get_and_clear(), then
      page_try_share_anon_rmap() will fail properly, or,

  (2) fast-gup runs during or after ptep_get_and_clear(), then fast-gup
      will see that either the pte is none or changed, then it'll fail the
      fast-gup itself.

> 
> Another user that relies on this interaction between GUP-fast and TLB
> flushing is for example mm/ksm.c:write_protect_page()
> 
> There is a comment in there explaining the interaction a bit more detailed.
> 
> Maybe we'll be able to handle this differently in the future (maybe once
> this turns out to be an actual performance problem). Unfortunately,
> mm->write_protect_seq isn't easily usable because we'd need have to make
> sure we're the exclusive writer.
> 
> 
> For now, it's not too complicated. For PTEs:
> * try_to_migrate_one() already uses ptep_clear_flush().
> * try_to_unmap_one() already conditionally used ptep_clear_flush().
> * migrate_vma_collect_pmd() was the one case that didn't use it already
>  (and I wonder why it's different than try_to_migrate_one()).

I'm not sure whether I fully get the point, but here one major difference
is all the rest handles one page, so a tlb flush alongside with the pte
clear sounds reasonable.  Even if so try_to_unmap_one() was modified to use
tlb batching, but then I see that anon exclusive made that batching
conditional.  I also have question there on whether we can keep using the
tlb batching even with anon exclusive pages there.

In general, I still don't see how stall tlb could affect anon exclusive
pages on racing with fast-gup, because the only side effect of a stall tlb
is unwanted page update iiuc, the problem is fast-gup doesn't even use tlb,
afaict..

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/3] mm/migrate_device.c: Copy pte dirty bit to page
  2022-08-26 21:37                   ` Peter Xu
@ 2022-08-26 22:19                     ` David Hildenbrand
  0 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2022-08-26 22:19 UTC (permalink / raw)
  To: Peter Xu
  Cc: Alistair Popple, linux-mm, akpm, Nadav Amit, huang ying, LKML,
	Sierra Guiza, Alejandro (Alex),
	Felix Kuehling, Jason Gunthorpe, John Hubbard, Ralph Campbell,
	Matthew Wilcox, Karol Herbst, Lyude Paul, Ben Skeggs,
	Logan Gunthorpe, paulus, linuxppc-dev, stable, Huang Ying

On 26.08.22 23:37, Peter Xu wrote:
> On Fri, Aug 26, 2022 at 06:46:02PM +0200, David Hildenbrand wrote:
>> On 26.08.22 17:55, Peter Xu wrote:
>>> On Fri, Aug 26, 2022 at 04:47:22PM +0200, David Hildenbrand wrote:
>>>>> To me anon exclusive only shows this mm exclusively owns this page. I
>>>>> didn't quickly figure out why that requires different handling on tlb
>>>>> flushs.  Did I perhaps miss something?
>>>>
>>>> GUP-fast is the magic bit, we have to make sure that we won't see new
>>>> GUP pins, thus the TLB flush.
>>>>
>>>> include/linux/mm.h:gup_must_unshare() contains documentation.
>>>
>>> Hmm.. Shouldn't ptep_get_and_clear() (e.g., xchg() on x86_64) already
>>> guarantees that no other process/thread will see this pte anymore
>>> afterwards?
>>
>> You could have a GUP-fast thread that just looked up the PTE and is
>> going to pin the page afterwards, after the ptep_get_and_clear()
>> returned. You'll have to wait until that thread finished.
> 

Good that we're talking about it, very helpful! If that's actually not
required -- good.


What I learned how GUP-fast and TLB flushes interact is the following:

GUP-fast disables local interrupts. A TLB flush will send an IPI and
wait until it has been processed. This implies, that once the TLB flush
succeeded, that the interrupt was handled and GUP-fast cannot be running
anymore.

BUT, there is the new RCU variant nowadays, and the TLB flush itself
should not actually perform such a sync. They merely protect the page
tables from getting freed and the THP from getting split IIUC. And
you're correct that that wouldn't help.


> IIUC the early tlb flush won't protect concurrent fast-gup from happening,
> but I think it's safe because fast-gup will check pte after pinning, so
> either:
> 
>   (1) fast-gup runs before ptep_get_and_clear(), then
>       page_try_share_anon_rmap() will fail properly, or,
> 
>   (2) fast-gup runs during or after ptep_get_and_clear(), then fast-gup
>       will see that either the pte is none or changed, then it'll fail the
>       fast-gup itself.

I think you're right and I might have managed to confuse myself with the
write_protect_page() comment. I placed the gup_must_unshare() check
explicitly after the "pte changed" check for this reason. So once the
PTE was cleared, GUP-fast would undo any pin performed on a now-stale PTE.

> 
>>
>> Another user that relies on this interaction between GUP-fast and TLB
>> flushing is for example mm/ksm.c:write_protect_page()
>>
>> There is a comment in there explaining the interaction a bit more detailed.
>>
>> Maybe we'll be able to handle this differently in the future (maybe once
>> this turns out to be an actual performance problem). Unfortunately,
>> mm->write_protect_seq isn't easily usable because we'd need have to make
>> sure we're the exclusive writer.
>>
>>
>> For now, it's not too complicated. For PTEs:
>> * try_to_migrate_one() already uses ptep_clear_flush().
>> * try_to_unmap_one() already conditionally used ptep_clear_flush().
>> * migrate_vma_collect_pmd() was the one case that didn't use it already
>>  (and I wonder why it's different than try_to_migrate_one()).
> 
> I'm not sure whether I fully get the point, but here one major difference
> is all the rest handles one page, so a tlb flush alongside with the pte
> clear sounds reasonable.  Even if so try_to_unmap_one() was modified to use
> tlb batching, but then I see that anon exclusive made that batching
> conditional.  I also have question there on whether we can keep using the
> tlb batching even with anon exclusive pages there.
> 
> In general, I still don't see how stall tlb could affect anon exclusive
> pages on racing with fast-gup, because the only side effect of a stall tlb
> is unwanted page update iiuc, the problem is fast-gup doesn't even use tlb,
> afaict..

I have the gut feeling that the comment in write_protect_page() is
indeed stale, and that clearing PageAnonExclusive doesn't strictly need
the TLB flush.

I'll try to refresh my memory if there was any other case that I had to
handle over the weekend.

Thanks!

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2022-08-26 22:22 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-24  3:03 [PATCH v3 1/3] mm/migrate_device.c: Flush TLB while holding PTL Alistair Popple
2022-08-24  3:03 ` [PATCH v3 2/3] mm/migrate_device.c: Copy pte dirty bit to page Alistair Popple
2022-08-24 15:39   ` Peter Xu
2022-08-25 22:21     ` Alistair Popple
2022-08-25 23:27       ` Peter Xu
2022-08-26  1:02         ` Alistair Popple
2022-08-26  1:14           ` Huang, Ying
2022-08-26 14:32           ` Peter Xu
2022-08-26 14:47             ` David Hildenbrand
2022-08-26 15:55               ` Peter Xu
2022-08-26 16:46                 ` David Hildenbrand
2022-08-26 21:37                   ` Peter Xu
2022-08-26 22:19                     ` David Hildenbrand
2022-08-24  3:03 ` [PATCH v3 3/3] selftests/hmm-tests: Add test for dirty bits Alistair Popple
2022-08-24  8:21 ` [PATCH v3 1/3] mm/migrate_device.c: Flush TLB while holding PTL David Hildenbrand
2022-08-24 12:26   ` Alistair Popple
2022-08-24 12:35     ` David Hildenbrand
2022-08-25  1:36 ` Huang, Ying
2022-08-25 22:35   ` Alistair Popple
2022-08-26  0:56     ` Huang, Ying

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).