linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2] mm/migrate_device.c: Copy pte dirty bit to page
@ 2022-08-12  5:22 Alistair Popple
  2022-08-12  5:22 ` [PATCH 2/2] selftests/hmm-tests: Add test for dirty bits Alistair Popple
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Alistair Popple @ 2022-08-12  5:22 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: linux-kernel, Sierra Guiza, Alejandro (Alex),
	Felix Kuehling, Jason Gunthorpe, John Hubbard, David Hildenbrand,
	Ralph Campbell, Matthew Wilcox, Karol Herbst, Lyude Paul,
	Ben Skeggs, Logan Gunthorpe, linuxram, paulus, Alistair Popple,
	Peter Xu

migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that
installs migration entries directly if it can lock the migrating page.
When removing a dirty pte the dirty bit is supposed to be carried over
to the underlying page to prevent it being lost.

Currently migrate_vma_*() can only be used for private anonymous
mappings. That means loss of the dirty bit usually doesn't result in
data loss because these pages are typically not file-backed. However
pages may be backed by swap storage which can result in data loss if an
attempt is made to migrate a dirty page that doesn't yet have the
PageDirty flag set.

In this case migration will fail due to unexpected references but the
dirty pte bit will be lost. If the page is subsequently reclaimed data
won't be written back to swap storage as it is considered uptodate,
resulting in data loss if the page is subsequently accessed.

Prevent this by copying the dirty bit to the page when removing the pte
to match what try_to_migrate_one() does.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reported-by: Peter Xu <peterx@redhat.com>
---
 mm/migrate_device.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 27fb37d..d38f8a6 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -7,6 +7,7 @@
 #include <linux/export.h>
 #include <linux/memremap.h>
 #include <linux/migrate.h>
+#include <linux/mm.h>
 #include <linux/mm_inline.h>
 #include <linux/mmu_notifier.h>
 #include <linux/oom.h>
@@ -211,6 +212,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 
 			migrate->cpages++;
 
+			/* Set the dirty flag on the folio now the pte is gone. */
+			if (pte_dirty(pte))
+				folio_mark_dirty(page_folio(page));
+
 			/* Setup special migration page table entry */
 			if (mpfn & MIGRATE_PFN_WRITE)
 				entry = make_writable_migration_entry(

base-commit: ffcf9c5700e49c0aee42dcba9a12ba21338e8136
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/2] selftests/hmm-tests: Add test for dirty bits
  2022-08-12  5:22 [PATCH 1/2] mm/migrate_device.c: Copy pte dirty bit to page Alistair Popple
@ 2022-08-12  5:22 ` Alistair Popple
  2022-08-12  7:58   ` Mika Penttilä
  2022-08-15 20:29 ` [PATCH 1/2] mm/migrate_device.c: Copy pte dirty bit to page Peter Xu
  2022-08-16  1:39 ` huang ying
  2 siblings, 1 reply; 14+ messages in thread
From: Alistair Popple @ 2022-08-12  5:22 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: linux-kernel, Sierra Guiza, Alejandro (Alex),
	Felix Kuehling, Jason Gunthorpe, John Hubbard, David Hildenbrand,
	Ralph Campbell, Matthew Wilcox, Karol Herbst, Lyude Paul,
	Ben Skeggs, Logan Gunthorpe, linuxram, paulus, Alistair Popple

We were not correctly copying PTE dirty bits to pages during
migrate_vma_setup() calls. This could potentially lead to data loss, so
add a test for this.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
 tools/testing/selftests/vm/hmm-tests.c | 124 ++++++++++++++++++++++++++-
 1 file changed, 124 insertions(+)

diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c
index 529f53b..70fdb49 100644
--- a/tools/testing/selftests/vm/hmm-tests.c
+++ b/tools/testing/selftests/vm/hmm-tests.c
@@ -1200,6 +1200,130 @@ TEST_F(hmm, migrate_multiple)
 	}
 }
 
+static char cgroup[] = "/sys/fs/cgroup/hmm-test-XXXXXX";
+static int write_cgroup_param(char *cgroup_path, char *param, long value)
+{
+	int ret;
+	FILE *f;
+	char *filename;
+
+	if (asprintf(&filename, "%s/%s", cgroup_path, param) < 0)
+		return -1;
+
+	f = fopen(filename, "w");
+	if (!f) {
+		ret = -1;
+		goto out;
+	}
+
+	ret = fprintf(f, "%ld\n", value);
+	if (ret < 0)
+		goto out1;
+
+	ret = 0;
+
+out1:
+	fclose(f);
+out:
+	free(filename);
+
+	return ret;
+}
+
+static int setup_cgroup(void)
+{
+	pid_t pid = getpid();
+	int ret;
+
+	if (!mkdtemp(cgroup))
+		return -1;
+
+	ret = write_cgroup_param(cgroup, "cgroup.procs", pid);
+	if (ret)
+		return ret;
+
+	return 0;
+}
+
+static int destroy_cgroup(void)
+{
+	pid_t pid = getpid();
+	int ret;
+
+	ret = write_cgroup_param("/sys/fs/cgroup/cgroup.procs",
+				"cgroup.proc", pid);
+	if (ret)
+		return ret;
+
+	if (rmdir(cgroup))
+		return -1;
+
+	return 0;
+}
+
+/*
+ * Try and migrate a dirty page that has previously been swapped to disk. This
+ * checks that we don't loose dirty bits.
+ */
+TEST_F(hmm, migrate_dirty_page)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	int tmp = 0;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	ASSERT_EQ(setup_cgroup(), 0);
+
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = 0;
+
+	ASSERT_FALSE(write_cgroup_param(cgroup, "memory.reclaim", 1UL<<30));
+
+	/* Fault pages back in from swap as clean pages */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		tmp += ptr[i];
+
+	/* Dirty the pte */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/*
+	 * Attempt to migrate memory to device, which should fail because
+	 * hopefully some pages are backed by swap storage.
+	 */
+	ASSERT_TRUE(hmm_migrate_sys_to_dev(self->fd, buffer, npages));
+
+	ASSERT_FALSE(write_cgroup_param(cgroup, "memory.reclaim", 1UL<<30));
+
+	/* Check we still see the updated data after restoring from swap. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	hmm_buffer_free(buffer);
+	destroy_cgroup();
+}
+
 /*
  * Read anonymous memory multiple times.
  */
-- 
git-series 0.9.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] selftests/hmm-tests: Add test for dirty bits
  2022-08-12  5:22 ` [PATCH 2/2] selftests/hmm-tests: Add test for dirty bits Alistair Popple
@ 2022-08-12  7:58   ` Mika Penttilä
  2022-08-15  2:35     ` Alistair Popple
  0 siblings, 1 reply; 14+ messages in thread
From: Mika Penttilä @ 2022-08-12  7:58 UTC (permalink / raw)
  To: Alistair Popple, linux-mm, akpm
  Cc: linux-kernel, Sierra Guiza, Alejandro (Alex),
	Felix Kuehling, Jason Gunthorpe, John Hubbard, David Hildenbrand,
	Ralph Campbell, Matthew Wilcox, Karol Herbst, Lyude Paul,
	Ben Skeggs, Logan Gunthorpe, linuxram, paulus

Hi Alistair!

On 12.8.2022 8.22, Alistair Popple wrote:
> We were not correctly copying PTE dirty bits to pages during
> migrate_vma_setup() calls. This could potentially lead to data loss, so
> add a test for this.
> 
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> ---
>   tools/testing/selftests/vm/hmm-tests.c | 124 ++++++++++++++++++++++++++-
>   1 file changed, 124 insertions(+)
> 
> diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c
> index 529f53b..70fdb49 100644
> --- a/tools/testing/selftests/vm/hmm-tests.c
> +++ b/tools/testing/selftests/vm/hmm-tests.c
> @@ -1200,6 +1200,130 @@ TEST_F(hmm, migrate_multiple)
>   	}
>   }
>   
> +static char cgroup[] = "/sys/fs/cgroup/hmm-test-XXXXXX";
> +static int write_cgroup_param(char *cgroup_path, char *param, long value)
> +{
> +	int ret;
> +	FILE *f;
> +	char *filename;
> +
> +	if (asprintf(&filename, "%s/%s", cgroup_path, param) < 0)
> +		return -1;
> +
> +	f = fopen(filename, "w");
> +	if (!f) {
> +		ret = -1;
> +		goto out;
> +	}
> +
> +	ret = fprintf(f, "%ld\n", value);
> +	if (ret < 0)
> +		goto out1;
> +
> +	ret = 0;
> +
> +out1:
> +	fclose(f);
> +out:
> +	free(filename);
> +
> +	return ret;
> +}
> +
> +static int setup_cgroup(void)
> +{
> +	pid_t pid = getpid();
> +	int ret;
> +
> +	if (!mkdtemp(cgroup))
> +		return -1;
> +
> +	ret = write_cgroup_param(cgroup, "cgroup.procs", pid);
> +	if (ret)
> +		return ret;
> +
> +	return 0;
> +}
> +
> +static int destroy_cgroup(void)
> +{
> +	pid_t pid = getpid();
> +	int ret;
> +
> +	ret = write_cgroup_param("/sys/fs/cgroup/cgroup.procs",
> +				"cgroup.proc", pid);
> +	if (ret)
> +		return ret;
> +
> +	if (rmdir(cgroup))
> +		return -1;
> +
> +	return 0;
> +}
> +
> +/*
> + * Try and migrate a dirty page that has previously been swapped to disk. This
> + * checks that we don't loose dirty bits.
> + */
> +TEST_F(hmm, migrate_dirty_page)
> +{
> +	struct hmm_buffer *buffer;
> +	unsigned long npages;
> +	unsigned long size;
> +	unsigned long i;
> +	int *ptr;
> +	int tmp = 0;
> +
> +	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
> +	ASSERT_NE(npages, 0);
> +	size = npages << self->page_shift;
> +
> +	buffer = malloc(sizeof(*buffer));
> +	ASSERT_NE(buffer, NULL);
> +
> +	buffer->fd = -1;
> +	buffer->size = size;
> +	buffer->mirror = malloc(size);
> +	ASSERT_NE(buffer->mirror, NULL);
> +
> +	ASSERT_EQ(setup_cgroup(), 0);
> +
> +	buffer->ptr = mmap(NULL, size,
> +			   PROT_READ | PROT_WRITE,
> +			   MAP_PRIVATE | MAP_ANONYMOUS,
> +			   buffer->fd, 0);
> +	ASSERT_NE(buffer->ptr, MAP_FAILED);
> +
> +	/* Initialize buffer in system memory. */
> +	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
> +		ptr[i] = 0;
> +
> +	ASSERT_FALSE(write_cgroup_param(cgroup, "memory.reclaim", 1UL<<30));
> +
> +	/* Fault pages back in from swap as clean pages */
> +	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
> +		tmp += ptr[i];
> +
> +	/* Dirty the pte */
> +	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
> +		ptr[i] = i;
> +

The anon pages are quite likely in memory at this point, and dirty in pte.



> +	/*
> +	 * Attempt to migrate memory to device, which should fail because
> +	 * hopefully some pages are backed by swap storage.
> +	 */
> +	ASSERT_TRUE(hmm_migrate_sys_to_dev(self->fd, buffer, npages));


And pages marked dirty also now. But could you elaborate how and where 
the above fails in more detail, couldn't immediately see it...



> +
> +	ASSERT_FALSE(write_cgroup_param(cgroup, "memory.reclaim", 1UL<<30));
> +
> +	/* Check we still see the updated data after restoring from swap. */
> +	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
> +		ASSERT_EQ(ptr[i], i);
> +
> +	hmm_buffer_free(buffer);
> +	destroy_cgroup();
> +}
> +
>   /*
>    * Read anonymous memory multiple times.
>    */


--Mika


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] selftests/hmm-tests: Add test for dirty bits
  2022-08-12  7:58   ` Mika Penttilä
@ 2022-08-15  2:35     ` Alistair Popple
  2022-08-15  3:11       ` Mika Penttilä
  0 siblings, 1 reply; 14+ messages in thread
From: Alistair Popple @ 2022-08-15  2:35 UTC (permalink / raw)
  To: Mika Penttilä
  Cc: linux-mm, akpm, linux-kernel, Sierra Guiza, Alejandro (Alex),
	Felix Kuehling, Jason Gunthorpe, John Hubbard, David Hildenbrand,
	Ralph Campbell, Matthew Wilcox, Karol Herbst, Lyude Paul,
	Ben Skeggs, Logan Gunthorpe, linuxram, paulus


Mika Penttilä <mpenttil@redhat.com> writes:

> Hi Alistair!
>
> On 12.8.2022 8.22, Alistair Popple wrote:

[...]

>> +	buffer->ptr = mmap(NULL, size,
>> +			   PROT_READ | PROT_WRITE,
>> +			   MAP_PRIVATE | MAP_ANONYMOUS,
>> +			   buffer->fd, 0);
>> +	ASSERT_NE(buffer->ptr, MAP_FAILED);
>> +
>> +	/* Initialize buffer in system memory. */
>> +	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
>> +		ptr[i] = 0;
>> +
>> +	ASSERT_FALSE(write_cgroup_param(cgroup, "memory.reclaim", 1UL<<30));
>> +
>> +	/* Fault pages back in from swap as clean pages */
>> +	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
>> +		tmp += ptr[i];
>> +
>> +	/* Dirty the pte */
>> +	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
>> +		ptr[i] = i;
>> +
>
> The anon pages are quite likely in memory at this point, and dirty in pte.

Why would the pte be dirty? I just confirmed using some modified pagemap
code that on my system at least this isn't the case.

>> +	/*
>> +	 * Attempt to migrate memory to device, which should fail because
>> +	 * hopefully some pages are backed by swap storage.
>> +	 */
>> +	ASSERT_TRUE(hmm_migrate_sys_to_dev(self->fd, buffer, npages));
>
> And pages marked dirty also now. But could you elaborate how and where the above
> fails in more detail, couldn't immediately see it...

Not if you don't have patch 1 of this series applied. If the
trylock_page() in migrate_vma_collect_pmd() succeeds (which it almost
always does) it will have cleared the pte without setting PageDirty.

So now we have a dirty page without PageDirty set and without a dirty
pte. If this page gets swapped back to disk and is still in the swap
cache data will be lost because reclaim will see a clean page and won't
write it out again.

At least that's my understanding - please let me know if you see
something that doesn't make sense.

>> +
>> +	ASSERT_FALSE(write_cgroup_param(cgroup, "memory.reclaim", 1UL<<30));
>> +
>> +	/* Check we still see the updated data after restoring from swap. */
>> +	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
>> +		ASSERT_EQ(ptr[i], i);
>> +
>> +	hmm_buffer_free(buffer);
>> +	destroy_cgroup();
>> +}
>> +
>>   /*
>>    * Read anonymous memory multiple times.
>>    */
>
>
> --Mika

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] selftests/hmm-tests: Add test for dirty bits
  2022-08-15  2:35     ` Alistair Popple
@ 2022-08-15  3:11       ` Mika Penttilä
  2022-08-15  3:21         ` Alistair Popple
  0 siblings, 1 reply; 14+ messages in thread
From: Mika Penttilä @ 2022-08-15  3:11 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, akpm, linux-kernel, Sierra Guiza, Alejandro (Alex),
	Felix Kuehling, Jason Gunthorpe, John Hubbard, David Hildenbrand,
	Ralph Campbell, Matthew Wilcox, Karol Herbst, Lyude Paul,
	Ben Skeggs, Logan Gunthorpe, linuxram, paulus



On 15.8.2022 5.35, Alistair Popple wrote:
> 
> Mika Penttilä <mpenttil@redhat.com> writes:
> 
>> Hi Alistair!
>>
>> On 12.8.2022 8.22, Alistair Popple wrote:
> 
> [...]
> 
>>> +	buffer->ptr = mmap(NULL, size,
>>> +			   PROT_READ | PROT_WRITE,
>>> +			   MAP_PRIVATE | MAP_ANONYMOUS,
>>> +			   buffer->fd, 0);
>>> +	ASSERT_NE(buffer->ptr, MAP_FAILED);
>>> +
>>> +	/* Initialize buffer in system memory. */
>>> +	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
>>> +		ptr[i] = 0;
>>> +
>>> +	ASSERT_FALSE(write_cgroup_param(cgroup, "memory.reclaim", 1UL<<30));
>>> +
>>> +	/* Fault pages back in from swap as clean pages */
>>> +	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
>>> +		tmp += ptr[i];
>>> +
>>> +	/* Dirty the pte */
>>> +	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
>>> +		ptr[i] = i;
>>> +
>>
>> The anon pages are quite likely in memory at this point, and dirty in pte.
> 
> Why would the pte be dirty? I just confirmed using some modified pagemap
> code that on my system at least this isn't the case.
> 
>>> +	/*
>>> +	 * Attempt to migrate memory to device, which should fail because
>>> +	 * hopefully some pages are backed by swap storage.
>>> +	 */
>>> +	ASSERT_TRUE(hmm_migrate_sys_to_dev(self->fd, buffer, npages));
>>
>> And pages marked dirty also now. But could you elaborate how and where the above
>> fails in more detail, couldn't immediately see it...
> 
> Not if you don't have patch 1 of this series applied. If the
> trylock_page() in migrate_vma_collect_pmd() succeeds (which it almost
> always does) it will have cleared the pte without setting PageDirty.
> 

Ah yes but I meant with the patch 1 applied, the comment "Attempt to 
migrate memory to device, which should fail because hopefully some pages 
are backed by swap storage" indicates that hmm_migrate_sys_to_dev() 
would fail..and there's that ASSERT_TRUE which means fail here.

So I understand the data loss but where is the hmm_migrate_sys_to_dev() 
failing, with or wihtout patch 1 applied?




> So now we have a dirty page without PageDirty set and without a dirty
> pte. If this page gets swapped back to disk and is still in the swap
> cache data will be lost because reclaim will see a clean page and won't
> write it out again.
> 
> At least that's my understanding - please let me know if you see
> something that doesn't make sense.
> 
>>> +
>>> +	ASSERT_FALSE(write_cgroup_param(cgroup, "memory.reclaim", 1UL<<30));
>>> +
>>> +	/* Check we still see the updated data after restoring from swap. */
>>> +	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
>>> +		ASSERT_EQ(ptr[i], i);
>>> +
>>> +	hmm_buffer_free(buffer);
>>> +	destroy_cgroup();
>>> +}
>>> +
>>>    /*
>>>     * Read anonymous memory multiple times.
>>>     */
>>
>>
>> --Mika
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] selftests/hmm-tests: Add test for dirty bits
  2022-08-15  3:11       ` Mika Penttilä
@ 2022-08-15  3:21         ` Alistair Popple
  2022-08-15  4:05           ` Mika Penttilä
  0 siblings, 1 reply; 14+ messages in thread
From: Alistair Popple @ 2022-08-15  3:21 UTC (permalink / raw)
  To: Mika Penttilä
  Cc: linux-mm, akpm, linux-kernel, Sierra Guiza, Alejandro (Alex),
	Felix Kuehling, Jason Gunthorpe, John Hubbard, David Hildenbrand,
	Ralph Campbell, Matthew Wilcox, Karol Herbst, Lyude Paul,
	Ben Skeggs, Logan Gunthorpe, linuxram, paulus


Mika Penttilä <mpenttil@redhat.com> writes:

> On 15.8.2022 5.35, Alistair Popple wrote:
>> Mika Penttilä <mpenttil@redhat.com> writes:
>>
>>> Hi Alistair!
>>>
>>> On 12.8.2022 8.22, Alistair Popple wrote:
>> [...]
>>
>>>> +	buffer->ptr = mmap(NULL, size,
>>>> +			   PROT_READ | PROT_WRITE,
>>>> +			   MAP_PRIVATE | MAP_ANONYMOUS,
>>>> +			   buffer->fd, 0);
>>>> +	ASSERT_NE(buffer->ptr, MAP_FAILED);
>>>> +
>>>> +	/* Initialize buffer in system memory. */
>>>> +	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
>>>> +		ptr[i] = 0;
>>>> +
>>>> +	ASSERT_FALSE(write_cgroup_param(cgroup, "memory.reclaim", 1UL<<30));
>>>> +
>>>> +	/* Fault pages back in from swap as clean pages */
>>>> +	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
>>>> +		tmp += ptr[i];
>>>> +
>>>> +	/* Dirty the pte */
>>>> +	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
>>>> +		ptr[i] = i;
>>>> +
>>>
>>> The anon pages are quite likely in memory at this point, and dirty in pte.
>> Why would the pte be dirty? I just confirmed using some modified pagemap
>> code that on my system at least this isn't the case.
>>
>>>> +	/*
>>>> +	 * Attempt to migrate memory to device, which should fail because
>>>> +	 * hopefully some pages are backed by swap storage.
>>>> +	 */
>>>> +	ASSERT_TRUE(hmm_migrate_sys_to_dev(self->fd, buffer, npages));
>>>
>>> And pages marked dirty also now. But could you elaborate how and where the above
>>> fails in more detail, couldn't immediately see it...
>> Not if you don't have patch 1 of this series applied. If the
>> trylock_page() in migrate_vma_collect_pmd() succeeds (which it almost
>> always does) it will have cleared the pte without setting PageDirty.
>>
>
> Ah yes but I meant with the patch 1 applied, the comment "Attempt to migrate
> memory to device, which should fail because hopefully some pages are backed by
> swap storage" indicates that hmm_migrate_sys_to_dev() would fail..and there's
> that ASSERT_TRUE which means fail here.
>
> So I understand the data loss but where is the hmm_migrate_sys_to_dev() failing,
> with or wihtout patch 1 applied?

Oh right. hmm_migrate_sys_to_dev() will fail because the page is in the
swap cache, and migrate_vma_*() doesn't currently support migrating
pages with a mapping.

>> So now we have a dirty page without PageDirty set and without a dirty
>> pte. If this page gets swapped back to disk and is still in the swap
>> cache data will be lost because reclaim will see a clean page and won't
>> write it out again.
>> At least that's my understanding - please let me know if you see
>> something that doesn't make sense.
>>
>>>> +
>>>> +	ASSERT_FALSE(write_cgroup_param(cgroup, "memory.reclaim", 1UL<<30));
>>>> +
>>>> +	/* Check we still see the updated data after restoring from swap. */
>>>> +	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
>>>> +		ASSERT_EQ(ptr[i], i);
>>>> +
>>>> +	hmm_buffer_free(buffer);
>>>> +	destroy_cgroup();
>>>> +}
>>>> +
>>>>    /*
>>>>     * Read anonymous memory multiple times.
>>>>     */
>>>
>>>
>>> --Mika
>>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] selftests/hmm-tests: Add test for dirty bits
  2022-08-15  3:21         ` Alistair Popple
@ 2022-08-15  4:05           ` Mika Penttilä
  2022-08-15  4:06             ` Mika Penttilä
  0 siblings, 1 reply; 14+ messages in thread
From: Mika Penttilä @ 2022-08-15  4:05 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, akpm, linux-kernel, Sierra Guiza, Alejandro (Alex),
	Felix Kuehling, Jason Gunthorpe, John Hubbard, David Hildenbrand,
	Ralph Campbell, Matthew Wilcox, Karol Herbst, Lyude Paul,
	Ben Skeggs, Logan Gunthorpe, linuxram, paulus



On 15.8.2022 6.21, Alistair Popple wrote:
> 
> Mika Penttilä <mpenttil@redhat.com> writes:
> 
>> On 15.8.2022 5.35, Alistair Popple wrote:
>>> Mika Penttilä <mpenttil@redhat.com> writes:
>>>
>>>> Hi Alistair!
>>>>
>>>> On 12.8.2022 8.22, Alistair Popple wrote:
>>> [...]
>>>
>>>>> +	buffer->ptr = mmap(NULL, size,
>>>>> +			   PROT_READ | PROT_WRITE,
>>>>> +			   MAP_PRIVATE | MAP_ANONYMOUS,
>>>>> +			   buffer->fd, 0);
>>>>> +	ASSERT_NE(buffer->ptr, MAP_FAILED);
>>>>> +
>>>>> +	/* Initialize buffer in system memory. */
>>>>> +	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
>>>>> +		ptr[i] = 0;
>>>>> +
>>>>> +	ASSERT_FALSE(write_cgroup_param(cgroup, "memory.reclaim", 1UL<<30));
>>>>> +
>>>>> +	/* Fault pages back in from swap as clean pages */
>>>>> +	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
>>>>> +		tmp += ptr[i];
>>>>> +
>>>>> +	/* Dirty the pte */
>>>>> +	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
>>>>> +		ptr[i] = i;
>>>>> +
>>>>
>>>> The anon pages are quite likely in memory at this point, and dirty in pte.
>>> Why would the pte be dirty? I just confirmed using some modified pagemap
>>> code that on my system at least this isn't the case.
>>>
>>>>> +	/*
>>>>> +	 * Attempt to migrate memory to device, which should fail because
>>>>> +	 * hopefully some pages are backed by swap storage.
>>>>> +	 */
>>>>> +	ASSERT_TRUE(hmm_migrate_sys_to_dev(self->fd, buffer, npages));
>>>>
>>>> And pages marked dirty also now. But could you elaborate how and where the above
>>>> fails in more detail, couldn't immediately see it...
>>> Not if you don't have patch 1 of this series applied. If the
>>> trylock_page() in migrate_vma_collect_pmd() succeeds (which it almost
>>> always does) it will have cleared the pte without setting PageDirty.
>>>
>>
>> Ah yes but I meant with the patch 1 applied, the comment "Attempt to migrate
>> memory to device, which should fail because hopefully some pages are backed by
>> swap storage" indicates that hmm_migrate_sys_to_dev() would fail..and there's
>> that ASSERT_TRUE which means fail here.
>>
>> So I understand the data loss but where is the hmm_migrate_sys_to_dev() failing,
>> with or wihtout patch 1 applied?
> 
> Oh right. hmm_migrate_sys_to_dev() will fail because the page is in the
> swap cache, and migrate_vma_*() doesn't currently support migrating
> pages with a mapping.
> 

Ok I forgot we skip also page cache pages, not just file pages...




>>> So now we have a dirty page without PageDirty set and without a dirty
>>> pte. If this page gets swapped back to disk and is still in the swap
>>> cache data will be lost because reclaim will see a clean page and won't
>>> write it out again.
>>> At least that's my understanding - please let me know if you see
>>> something that doesn't make sense.
>>>
>>>>> +
>>>>> +	ASSERT_FALSE(write_cgroup_param(cgroup, "memory.reclaim", 1UL<<30));
>>>>> +
>>>>> +	/* Check we still see the updated data after restoring from swap. */
>>>>> +	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
>>>>> +		ASSERT_EQ(ptr[i], i);
>>>>> +
>>>>> +	hmm_buffer_free(buffer);
>>>>> +	destroy_cgroup();
>>>>> +}
>>>>> +
>>>>>     /*
>>>>>      * Read anonymous memory multiple times.
>>>>>      */
>>>>
>>>>
>>>> --Mika
>>>
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] selftests/hmm-tests: Add test for dirty bits
  2022-08-15  4:05           ` Mika Penttilä
@ 2022-08-15  4:06             ` Mika Penttilä
  0 siblings, 0 replies; 14+ messages in thread
From: Mika Penttilä @ 2022-08-15  4:06 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, akpm, linux-kernel, Sierra Guiza, Alejandro (Alex),
	Felix Kuehling, Jason Gunthorpe, John Hubbard, David Hildenbrand,
	Ralph Campbell, Matthew Wilcox, Karol Herbst, Lyude Paul,
	Ben Skeggs, Logan Gunthorpe, linuxram, paulus



On 15.8.2022 7.05, Mika Penttilä wrote:
> 
> 
> On 15.8.2022 6.21, Alistair Popple wrote:
>>
>> Mika Penttilä <mpenttil@redhat.com> writes:
>>
>>> On 15.8.2022 5.35, Alistair Popple wrote:
>>>> Mika Penttilä <mpenttil@redhat.com> writes:
>>>>
>>>>> Hi Alistair!
>>>>>
>>>>> On 12.8.2022 8.22, Alistair Popple wrote:
>>>> [...]
>>>>
>>>>>> +    buffer->ptr = mmap(NULL, size,
>>>>>> +               PROT_READ | PROT_WRITE,
>>>>>> +               MAP_PRIVATE | MAP_ANONYMOUS,
>>>>>> +               buffer->fd, 0);
>>>>>> +    ASSERT_NE(buffer->ptr, MAP_FAILED);
>>>>>> +
>>>>>> +    /* Initialize buffer in system memory. */
>>>>>> +    for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
>>>>>> +        ptr[i] = 0;
>>>>>> +
>>>>>> +    ASSERT_FALSE(write_cgroup_param(cgroup, "memory.reclaim", 
>>>>>> 1UL<<30));
>>>>>> +
>>>>>> +    /* Fault pages back in from swap as clean pages */
>>>>>> +    for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
>>>>>> +        tmp += ptr[i];
>>>>>> +
>>>>>> +    /* Dirty the pte */
>>>>>> +    for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
>>>>>> +        ptr[i] = i;
>>>>>> +
>>>>>
>>>>> The anon pages are quite likely in memory at this point, and dirty 
>>>>> in pte.
>>>> Why would the pte be dirty? I just confirmed using some modified 
>>>> pagemap
>>>> code that on my system at least this isn't the case.
>>>>
>>>>>> +    /*
>>>>>> +     * Attempt to migrate memory to device, which should fail 
>>>>>> because
>>>>>> +     * hopefully some pages are backed by swap storage.
>>>>>> +     */
>>>>>> +    ASSERT_TRUE(hmm_migrate_sys_to_dev(self->fd, buffer, npages));
>>>>>
>>>>> And pages marked dirty also now. But could you elaborate how and 
>>>>> where the above
>>>>> fails in more detail, couldn't immediately see it...
>>>> Not if you don't have patch 1 of this series applied. If the
>>>> trylock_page() in migrate_vma_collect_pmd() succeeds (which it almost
>>>> always does) it will have cleared the pte without setting PageDirty.
>>>>
>>>
>>> Ah yes but I meant with the patch 1 applied, the comment "Attempt to 
>>> migrate
>>> memory to device, which should fail because hopefully some pages are 
>>> backed by
>>> swap storage" indicates that hmm_migrate_sys_to_dev() would fail..and 
>>> there's
>>> that ASSERT_TRUE which means fail here.
>>>
>>> So I understand the data loss but where is the 
>>> hmm_migrate_sys_to_dev() failing,
>>> with or wihtout patch 1 applied?
>>
>> Oh right. hmm_migrate_sys_to_dev() will fail because the page is in the
>> swap cache, and migrate_vma_*() doesn't currently support migrating
>> pages with a mapping.
>>
> 
> Ok I forgot we skip also page cache pages, not just file pages...

Meant we skip swap cache pages also, not just file pages..



> 
> 
> 
> 
>>>> So now we have a dirty page without PageDirty set and without a dirty
>>>> pte. If this page gets swapped back to disk and is still in the swap
>>>> cache data will be lost because reclaim will see a clean page and won't
>>>> write it out again.
>>>> At least that's my understanding - please let me know if you see
>>>> something that doesn't make sense.
>>>>
>>>>>> +
>>>>>> +    ASSERT_FALSE(write_cgroup_param(cgroup, "memory.reclaim", 
>>>>>> 1UL<<30));
>>>>>> +
>>>>>> +    /* Check we still see the updated data after restoring from 
>>>>>> swap. */
>>>>>> +    for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
>>>>>> +        ASSERT_EQ(ptr[i], i);
>>>>>> +
>>>>>> +    hmm_buffer_free(buffer);
>>>>>> +    destroy_cgroup();
>>>>>> +}
>>>>>> +
>>>>>>     /*
>>>>>>      * Read anonymous memory multiple times.
>>>>>>      */
>>>>>
>>>>>
>>>>> --Mika
>>>>
>>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] mm/migrate_device.c: Copy pte dirty bit to page
  2022-08-12  5:22 [PATCH 1/2] mm/migrate_device.c: Copy pte dirty bit to page Alistair Popple
  2022-08-12  5:22 ` [PATCH 2/2] selftests/hmm-tests: Add test for dirty bits Alistair Popple
@ 2022-08-15 20:29 ` Peter Xu
  2022-08-16  0:51   ` Alistair Popple
  2022-08-16  1:39 ` huang ying
  2 siblings, 1 reply; 14+ messages in thread
From: Peter Xu @ 2022-08-15 20:29 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, akpm, linux-kernel, Sierra Guiza, Alejandro (Alex),
	Felix Kuehling, Jason Gunthorpe, John Hubbard, David Hildenbrand,
	Ralph Campbell, Matthew Wilcox, Karol Herbst, Lyude Paul,
	Ben Skeggs, Logan Gunthorpe, linuxram, paulus

Hi, Alistair,

On Fri, Aug 12, 2022 at 03:22:30PM +1000, Alistair Popple wrote:
> migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that
> installs migration entries directly if it can lock the migrating page.
> When removing a dirty pte the dirty bit is supposed to be carried over
> to the underlying page to prevent it being lost.
> 
> Currently migrate_vma_*() can only be used for private anonymous
> mappings. That means loss of the dirty bit usually doesn't result in
> data loss because these pages are typically not file-backed. However
> pages may be backed by swap storage which can result in data loss if an
> attempt is made to migrate a dirty page that doesn't yet have the
> PageDirty flag set.
> 
> In this case migration will fail due to unexpected references but the
> dirty pte bit will be lost. If the page is subsequently reclaimed data
> won't be written back to swap storage as it is considered uptodate,
> resulting in data loss if the page is subsequently accessed.
> 
> Prevent this by copying the dirty bit to the page when removing the pte
> to match what try_to_migrate_one() does.
> 
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> Reported-by: Peter Xu <peterx@redhat.com>

This line should be:

Reported-by: Huang Ying <ying.huang@intel.com>

Please also feel free to add:

Acked-by: Peter Xu <peterx@redhat.com>

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] mm/migrate_device.c: Copy pte dirty bit to page
  2022-08-15 20:29 ` [PATCH 1/2] mm/migrate_device.c: Copy pte dirty bit to page Peter Xu
@ 2022-08-16  0:51   ` Alistair Popple
  0 siblings, 0 replies; 14+ messages in thread
From: Alistair Popple @ 2022-08-16  0:51 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, akpm, linux-kernel, Sierra Guiza, Alejandro (Alex),
	Felix Kuehling, Jason Gunthorpe, John Hubbard, David Hildenbrand,
	Ralph Campbell, Matthew Wilcox, Karol Herbst, Lyude Paul,
	Ben Skeggs, Logan Gunthorpe, linuxram, paulus


Peter Xu <peterx@redhat.com> writes:

> Hi, Alistair,
>
> On Fri, Aug 12, 2022 at 03:22:30PM +1000, Alistair Popple wrote:
>> migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that
>> installs migration entries directly if it can lock the migrating page.
>> When removing a dirty pte the dirty bit is supposed to be carried over
>> to the underlying page to prevent it being lost.
>>
>> Currently migrate_vma_*() can only be used for private anonymous
>> mappings. That means loss of the dirty bit usually doesn't result in
>> data loss because these pages are typically not file-backed. However
>> pages may be backed by swap storage which can result in data loss if an
>> attempt is made to migrate a dirty page that doesn't yet have the
>> PageDirty flag set.
>>
>> In this case migration will fail due to unexpected references but the
>> dirty pte bit will be lost. If the page is subsequently reclaimed data
>> won't be written back to swap storage as it is considered uptodate,
>> resulting in data loss if the page is subsequently accessed.
>>
>> Prevent this by copying the dirty bit to the page when removing the pte
>> to match what try_to_migrate_one() does.
>>
>> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>> Reported-by: Peter Xu <peterx@redhat.com>
>
> This line should be:
>
> Reported-by: Huang Ying <ying.huang@intel.com>
>
> Please also feel free to add:
>
> Acked-by: Peter Xu <peterx@redhat.com>

Thanks Peter, my bad. I assume Andrew can fix up the tags if I don't
need to re-spin this series.

> Thanks,

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] mm/migrate_device.c: Copy pte dirty bit to page
  2022-08-12  5:22 [PATCH 1/2] mm/migrate_device.c: Copy pte dirty bit to page Alistair Popple
  2022-08-12  5:22 ` [PATCH 2/2] selftests/hmm-tests: Add test for dirty bits Alistair Popple
  2022-08-15 20:29 ` [PATCH 1/2] mm/migrate_device.c: Copy pte dirty bit to page Peter Xu
@ 2022-08-16  1:39 ` huang ying
  2022-08-16  2:28   ` Alistair Popple
  2 siblings, 1 reply; 14+ messages in thread
From: huang ying @ 2022-08-16  1:39 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, akpm, linux-kernel, Sierra Guiza, Alejandro (Alex),
	Felix Kuehling, Jason Gunthorpe, John Hubbard, David Hildenbrand,
	Ralph Campbell, Matthew Wilcox, Karol Herbst, Lyude Paul,
	Ben Skeggs, Logan Gunthorpe, linuxram, paulus, Peter Xu

Hi, Alistair,

On Fri, Aug 12, 2022 at 1:23 PM Alistair Popple <apopple@nvidia.com> wrote:
>
> migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that
> installs migration entries directly if it can lock the migrating page.
> When removing a dirty pte the dirty bit is supposed to be carried over
> to the underlying page to prevent it being lost.
>
> Currently migrate_vma_*() can only be used for private anonymous
> mappings. That means loss of the dirty bit usually doesn't result in
> data loss because these pages are typically not file-backed. However
> pages may be backed by swap storage which can result in data loss if an
> attempt is made to migrate a dirty page that doesn't yet have the
> PageDirty flag set.
>
> In this case migration will fail due to unexpected references but the
> dirty pte bit will be lost. If the page is subsequently reclaimed data
> won't be written back to swap storage as it is considered uptodate,
> resulting in data loss if the page is subsequently accessed.
>
> Prevent this by copying the dirty bit to the page when removing the pte
> to match what try_to_migrate_one() does.
>
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> Reported-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/migrate_device.c | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 27fb37d..d38f8a6 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -7,6 +7,7 @@
>  #include <linux/export.h>
>  #include <linux/memremap.h>
>  #include <linux/migrate.h>
> +#include <linux/mm.h>
>  #include <linux/mm_inline.h>
>  #include <linux/mmu_notifier.h>
>  #include <linux/oom.h>
> @@ -211,6 +212,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>
>                         migrate->cpages++;
>
> +                       /* Set the dirty flag on the folio now the pte is gone. */
> +                       if (pte_dirty(pte))
> +                               folio_mark_dirty(page_folio(page));
> +

I think that this isn't sufficient to fix all issues.  Firstly, "pte"
is assigned at the begin of the loop, before the PTE is cleared via
ptep_clear_flush() or ptep_get_and_clear().  That is, the pte isn't
changed atomically.  Between "pte" assignment and PTE clear, the PTE
may become dirty.  That is, we need to update pte when we clear the
PTE.

And I don't know why we use ptep_get_and_clear() to clear PTE if
(!anon_exclusive).  Why don't we need to flush the TLB?

Best Regards,
Huang, Ying

>                         /* Setup special migration page table entry */
>                         if (mpfn & MIGRATE_PFN_WRITE)
>                                 entry = make_writable_migration_entry(
>
> base-commit: ffcf9c5700e49c0aee42dcba9a12ba21338e8136
> --
> git-series 0.9.1
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] mm/migrate_device.c: Copy pte dirty bit to page
  2022-08-16  1:39 ` huang ying
@ 2022-08-16  2:28   ` Alistair Popple
  2022-08-16  6:37     ` huang ying
  0 siblings, 1 reply; 14+ messages in thread
From: Alistair Popple @ 2022-08-16  2:28 UTC (permalink / raw)
  To: huang ying
  Cc: linux-mm, akpm, linux-kernel, Sierra Guiza, Alejandro (Alex),
	Felix Kuehling, Jason Gunthorpe, John Hubbard, David Hildenbrand,
	Ralph Campbell, Matthew Wilcox, Karol Herbst, Lyude Paul,
	Ben Skeggs, Logan Gunthorpe, linuxram, paulus, Peter Xu


huang ying <huang.ying.caritas@gmail.com> writes:

> Hi, Alistair,
>
> On Fri, Aug 12, 2022 at 1:23 PM Alistair Popple <apopple@nvidia.com> wrote:
>>
>> migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that
>> installs migration entries directly if it can lock the migrating page.
>> When removing a dirty pte the dirty bit is supposed to be carried over
>> to the underlying page to prevent it being lost.
>>
>> Currently migrate_vma_*() can only be used for private anonymous
>> mappings. That means loss of the dirty bit usually doesn't result in
>> data loss because these pages are typically not file-backed. However
>> pages may be backed by swap storage which can result in data loss if an
>> attempt is made to migrate a dirty page that doesn't yet have the
>> PageDirty flag set.
>>
>> In this case migration will fail due to unexpected references but the
>> dirty pte bit will be lost. If the page is subsequently reclaimed data
>> won't be written back to swap storage as it is considered uptodate,
>> resulting in data loss if the page is subsequently accessed.
>>
>> Prevent this by copying the dirty bit to the page when removing the pte
>> to match what try_to_migrate_one() does.
>>
>> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>> Reported-by: Peter Xu <peterx@redhat.com>
>> ---
>>  mm/migrate_device.c | 5 +++++
>>  1 file changed, 5 insertions(+)
>>
>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>> index 27fb37d..d38f8a6 100644
>> --- a/mm/migrate_device.c
>> +++ b/mm/migrate_device.c
>> @@ -7,6 +7,7 @@
>>  #include <linux/export.h>
>>  #include <linux/memremap.h>
>>  #include <linux/migrate.h>
>> +#include <linux/mm.h>
>>  #include <linux/mm_inline.h>
>>  #include <linux/mmu_notifier.h>
>>  #include <linux/oom.h>
>> @@ -211,6 +212,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>
>>                         migrate->cpages++;
>>
>> +                       /* Set the dirty flag on the folio now the pte is gone. */
>> +                       if (pte_dirty(pte))
>> +                               folio_mark_dirty(page_folio(page));
>> +
>
> I think that this isn't sufficient to fix all issues.  Firstly, "pte"
> is assigned at the begin of the loop, before the PTE is cleared via
> ptep_clear_flush() or ptep_get_and_clear().  That is, the pte isn't
> changed atomically.  Between "pte" assignment and PTE clear, the PTE
> may become dirty.  That is, we need to update pte when we clear the
> PTE.

Oh good catch, thanks. Will fix.

> And I don't know why we use ptep_get_and_clear() to clear PTE if
> (!anon_exclusive).  Why don't we need to flush the TLB?

We do the TLB flush at the end if anything was modified:

	/* Only flush the TLB if we actually modified any entries */
	if (unmapped)
		flush_tlb_range(walk->vma, start, end);

Obviously I don't think that will work correctly now given we have to
read the dirty bits and clear the PTE atomically. I assume it was
originally written this way for some sort of performance reason.

 - Alistair

> Best Regards,
> Huang, Ying
>
>>                         /* Setup special migration page table entry */
>>                         if (mpfn & MIGRATE_PFN_WRITE)
>>                                 entry = make_writable_migration_entry(
>>
>> base-commit: ffcf9c5700e49c0aee42dcba9a12ba21338e8136
>> --
>> git-series 0.9.1
>>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] mm/migrate_device.c: Copy pte dirty bit to page
  2022-08-16  2:28   ` Alistair Popple
@ 2022-08-16  6:37     ` huang ying
  2022-08-17  1:27       ` Alistair Popple
  0 siblings, 1 reply; 14+ messages in thread
From: huang ying @ 2022-08-16  6:37 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, akpm, linux-kernel, Sierra Guiza, Alejandro (Alex),
	Felix Kuehling, Jason Gunthorpe, John Hubbard, David Hildenbrand,
	Ralph Campbell, Matthew Wilcox, Karol Herbst, Lyude Paul,
	Ben Skeggs, Logan Gunthorpe, linuxram, paulus, Peter Xu

On Tue, Aug 16, 2022 at 10:33 AM Alistair Popple <apopple@nvidia.com> wrote:
>
>
> huang ying <huang.ying.caritas@gmail.com> writes:
>
> > Hi, Alistair,
> >
> > On Fri, Aug 12, 2022 at 1:23 PM Alistair Popple <apopple@nvidia.com> wrote:

[snip]

>
> > And I don't know why we use ptep_get_and_clear() to clear PTE if
> > (!anon_exclusive).  Why don't we need to flush the TLB?
>
> We do the TLB flush at the end if anything was modified:
>
>         /* Only flush the TLB if we actually modified any entries */
>         if (unmapped)
>                 flush_tlb_range(walk->vma, start, end);
>
> Obviously I don't think that will work correctly now given we have to
> read the dirty bits and clear the PTE atomically. I assume it was
> originally written this way for some sort of performance reason.

Thanks for pointing this out.  If there were parallel page table
operations such as mprotect() or munmap(), the delayed TLB flushing
mechanism here may have some problem.  Please take a look at the
comments of flush_tlb_batched_pending() and TLB flush batching
implementation in try_to_unmap_one().  We may need to flush TLB with
page table lock held or use a mechanism similar to that in
try_to_unmap_one().

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] mm/migrate_device.c: Copy pte dirty bit to page
  2022-08-16  6:37     ` huang ying
@ 2022-08-17  1:27       ` Alistair Popple
  0 siblings, 0 replies; 14+ messages in thread
From: Alistair Popple @ 2022-08-17  1:27 UTC (permalink / raw)
  To: huang ying
  Cc: linux-mm, akpm, linux-kernel, Sierra Guiza, Alejandro (Alex),
	Felix Kuehling, Jason Gunthorpe, John Hubbard, David Hildenbrand,
	Ralph Campbell, Matthew Wilcox, Karol Herbst, Lyude Paul,
	Ben Skeggs, Logan Gunthorpe, linuxram, paulus, Peter Xu


huang ying <huang.ying.caritas@gmail.com> writes:

> On Tue, Aug 16, 2022 at 10:33 AM Alistair Popple <apopple@nvidia.com> wrote:
>>
>>
>> huang ying <huang.ying.caritas@gmail.com> writes:
>>
>> > Hi, Alistair,
>> >
>> > On Fri, Aug 12, 2022 at 1:23 PM Alistair Popple <apopple@nvidia.com> wrote:
>
> [snip]
>
>>
>> > And I don't know why we use ptep_get_and_clear() to clear PTE if
>> > (!anon_exclusive).  Why don't we need to flush the TLB?
>>
>> We do the TLB flush at the end if anything was modified:
>>
>>         /* Only flush the TLB if we actually modified any entries */
>>         if (unmapped)
>>                 flush_tlb_range(walk->vma, start, end);
>>
>> Obviously I don't think that will work correctly now given we have to
>> read the dirty bits and clear the PTE atomically. I assume it was
>> originally written this way for some sort of performance reason.
>
> Thanks for pointing this out.  If there were parallel page table
> operations such as mprotect() or munmap(), the delayed TLB flushing
> mechanism here may have some problem.  Please take a look at the
> comments of flush_tlb_batched_pending() and TLB flush batching
> implementation in try_to_unmap_one().  We may need to flush TLB with
> page table lock held or use a mechanism similar to that in
> try_to_unmap_one().

Thanks for the pointers. I agree there is likely also a problem here
with the delayed TLB flushing. v2 of this patch deals with this by
always flushing the TLB using ptep_flush_clear(), similar to how
try_to_migrate_one() works. It looks like it could be worth
investigating using batched TLB flushing for both this and
try_to_migrate(), but I will leave that for a future optimisation.

> Best Regards,
> Huang, Ying

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2022-08-17  1:35 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-12  5:22 [PATCH 1/2] mm/migrate_device.c: Copy pte dirty bit to page Alistair Popple
2022-08-12  5:22 ` [PATCH 2/2] selftests/hmm-tests: Add test for dirty bits Alistair Popple
2022-08-12  7:58   ` Mika Penttilä
2022-08-15  2:35     ` Alistair Popple
2022-08-15  3:11       ` Mika Penttilä
2022-08-15  3:21         ` Alistair Popple
2022-08-15  4:05           ` Mika Penttilä
2022-08-15  4:06             ` Mika Penttilä
2022-08-15 20:29 ` [PATCH 1/2] mm/migrate_device.c: Copy pte dirty bit to page Peter Xu
2022-08-16  0:51   ` Alistair Popple
2022-08-16  1:39 ` huang ying
2022-08-16  2:28   ` Alistair Popple
2022-08-16  6:37     ` huang ying
2022-08-17  1:27       ` Alistair Popple

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).