linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/1] pagemap: report swap location for shared pages
@ 2021-07-14 15:24 Tiberiu Georgescu
  2021-07-14 15:24 ` [RFC PATCH 1/1] " Tiberiu Georgescu
  2021-07-14 16:01 ` [RFC PATCH 0/1] " Peter Xu
  0 siblings, 2 replies; 8+ messages in thread
From: Tiberiu Georgescu @ 2021-07-14 15:24 UTC (permalink / raw)
  To: akpm, peterx, catalin.marinas, peterz, chinwen.chang, linmiaohe,
	jannh, apopple, christian.brauner, ebiederm, adobriyan,
	songmuchun, axboe, linux-kernel, linux-fsdevel, linux-mm
  Cc: ivan.teterevkov, florian.schmidt, carl.waldspurger, Tiberiu Georgescu

When a page allocated using the MAP_SHARED flag is swapped out, its pagemap
entry is cleared. In many cases, there is no difference between swapped-out
shared pages and newly allocated, non-dirty pages in the pagemap interface.

Example pagemap-test code (Tested on Kernel Version 5.14-rc1):

	#define NPAGES (256)
	/* map 1MiB shared memory */
	size_t pagesize = getpagesize();
	char *p = mmap(NULL, pagesize * NPAGES, PROT_READ | PROT_WRITE,
			   MAP_ANONYMOUS | MAP_SHARED, -1, 0);
	/* Dirty new pages. */
	for (i = 0; i < PAGES; i++)
		p[i * pagesize] = i;

Run the above program in a small cgroup, which allows swapping:

	/* Initialise cgroup & run a program */
	$ echo 512K > foo/memory.limit_in_bytes
	$ echo 60 > foo/memory.swappiness
	$ cgexec -g memory:foo ./pagemap-test

Check the pagemap report. This is an example of the current expected output:

	$ dd if=/proc/$PID/pagemap ibs=8 skip=$(($VADDR / $PAGESIZE)) count=$COUNT | hexdump -C
	00000000  00 00 00 00 00 00 80 00  00 00 00 00 00 00 80 00  |................|
	*
	00000710  e1 6b 06 00 00 00 80 a1  9e eb 06 00 00 00 80 a1  |.k..............|
	00000720  6b ee 06 00 00 00 80 a1  a5 a4 05 00 00 00 80 a1  |k...............|
	00000730  5c bf 06 00 00 00 80 a1  90 b6 06 00 00 00 80 a1  |\...............|

The first pagemap entries are reported as zeroes, indicating the pages have
never been allocated while they have actually been swapped out. It is
possible for bit 55 (PTE is Soft-Dirty) to be set on all pages of the
shared VMA, indicating some access to the page, but nothing else (frame
location, presence in swap or otherwise).

This patch addresses the behaviour and modifies pte_to_pagemap_entry() to
make use of the XArray associated with the virtual memory area struct
passed as an argument. The XArray contains the location of virtual pages in
the page cache, swap cache or on disk. If they are on either of the caches,
then the original implementation still works. If not, then the missing
information will be retrieved from the XArray.

The root cause of the missing functionality is that the PTE for the page
itself is cleared when a swap out occurs on a shared page.  Please take a
look at the proposed patch. I would appreciate it if you could verify a
couple of points:

1. Why do swappable and non-syncable shared pages have their PTEs cleared
   when they are swapped out ? Why does the behaviour differ so much
   between MAP_SHARED and MAP_PRIVATE pages? What are the origins of the
   approach?

2. PM_SOFT_DIRTY and PM_UFFD_WP are two flags that seem to get lost once
   the shared page is swapped out. Is there any other way to retrieve
   their value in the proposed patch, other than ensuring these flags are
   set, when necessary, in the PTE?

Kind regards,
Tibi

Tiberiu Georgescu (1):
  pagemap: report swap location for shared pages

 fs/proc/task_mmu.c | 37 +++++++++++++++++++++++++++++--------
 1 file changed, 29 insertions(+), 8 deletions(-)

-- 
2.32.0


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC PATCH 1/1] pagemap: report swap location for shared pages
  2021-07-14 15:24 [RFC PATCH 0/1] pagemap: report swap location for shared pages Tiberiu Georgescu
@ 2021-07-14 15:24 ` Tiberiu Georgescu
  2021-07-14 16:08   ` Peter Xu
  2021-07-14 16:01 ` [RFC PATCH 0/1] " Peter Xu
  1 sibling, 1 reply; 8+ messages in thread
From: Tiberiu Georgescu @ 2021-07-14 15:24 UTC (permalink / raw)
  To: akpm, peterx, catalin.marinas, peterz, chinwen.chang, linmiaohe,
	jannh, apopple, christian.brauner, ebiederm, adobriyan,
	songmuchun, axboe, linux-kernel, linux-fsdevel, linux-mm
  Cc: ivan.teterevkov, florian.schmidt, carl.waldspurger, Tiberiu Georgescu

When a page allocated using the MAP_SHARED flag is swapped out, its pagemap
entry is cleared. In many cases, there is no difference between swapped-out
shared pages and newly allocated, non-dirty pages in the pagemap interface.

This patch addresses the behaviour and modifies pte_to_pagemap_entry() to
make use of the XArray associated with the virtual memory area struct
passed as an argument. The XArray contains the location of virtual pages
in the page cache, swap cache or on disk. If they are on either of the
caches, then the original implementation still works. If not, then the
missing information will be retrieved from the XArray.

Co-developed-by: Florian Schmidt <florian.schmidt@nutanix.com>
Signed-off-by: Florian Schmidt <florian.schmidt@nutanix.com>
Co-developed-by: Carl Waldspurger <carl.waldspurger@nutanix.com>
Signed-off-by: Carl Waldspurger <carl.waldspurger@nutanix.com>
Co-developed-by: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
Signed-off-by: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
Signed-off-by: Tiberiu Georgescu <tiberiu.georgescu@nutanix.com>
---
 fs/proc/task_mmu.c | 37 +++++++++++++++++++++++++++++--------
 1 file changed, 29 insertions(+), 8 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index eb97468dfe4c..b17c8aedd32e 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1359,12 +1359,25 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end,
 	return err;
 }
 
+static void *get_xa_entry_at_vma_addr(struct vm_area_struct *vma,
+		unsigned long addr)
+{
+	struct inode *inode = file_inode(vma->vm_file);
+	struct address_space *mapping = inode->i_mapping;
+	pgoff_t offset = linear_page_index(vma, addr);
+
+	return xa_load(&mapping->i_pages, offset);
+}
+
 static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
 		struct vm_area_struct *vma, unsigned long addr, pte_t pte)
 {
 	u64 frame = 0, flags = 0;
 	struct page *page = NULL;
 
+	if (vma->vm_flags & VM_SOFTDIRTY)
+		flags |= PM_SOFT_DIRTY;
+
 	if (pte_present(pte)) {
 		if (pm->show_pfn)
 			frame = pte_pfn(pte);
@@ -1374,13 +1387,22 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
 			flags |= PM_SOFT_DIRTY;
 		if (pte_uffd_wp(pte))
 			flags |= PM_UFFD_WP;
-	} else if (is_swap_pte(pte)) {
+	} else if (is_swap_pte(pte) || shmem_file(vma->vm_file)) {
 		swp_entry_t entry;
-		if (pte_swp_soft_dirty(pte))
-			flags |= PM_SOFT_DIRTY;
-		if (pte_swp_uffd_wp(pte))
-			flags |= PM_UFFD_WP;
-		entry = pte_to_swp_entry(pte);
+		if (is_swap_pte(pte)) {
+			entry = pte_to_swp_entry(pte);
+			if (pte_swp_soft_dirty(pte))
+				flags |= PM_SOFT_DIRTY;
+			if (pte_swp_uffd_wp(pte))
+				flags |= PM_UFFD_WP;
+		} else {
+			void *xa_entry = get_xa_entry_at_vma_addr(vma, addr);
+
+			if (xa_is_value(xa_entry))
+				entry = radix_to_swp_entry(xa_entry);
+			else
+				goto out;
+		}
 		if (pm->show_pfn)
 			frame = swp_type(entry) |
 				(swp_offset(entry) << MAX_SWAPFILES_SHIFT);
@@ -1393,9 +1415,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
 		flags |= PM_FILE;
 	if (page && page_mapcount(page) == 1)
 		flags |= PM_MMAP_EXCLUSIVE;
-	if (vma->vm_flags & VM_SOFTDIRTY)
-		flags |= PM_SOFT_DIRTY;
 
+out:
 	return make_pme(frame, flags);
 }
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH 0/1] pagemap: report swap location for shared pages
  2021-07-14 15:24 [RFC PATCH 0/1] pagemap: report swap location for shared pages Tiberiu Georgescu
  2021-07-14 15:24 ` [RFC PATCH 1/1] " Tiberiu Georgescu
@ 2021-07-14 16:01 ` Peter Xu
  1 sibling, 0 replies; 8+ messages in thread
From: Peter Xu @ 2021-07-14 16:01 UTC (permalink / raw)
  To: Tiberiu Georgescu
  Cc: akpm, catalin.marinas, peterz, chinwen.chang, linmiaohe, jannh,
	apopple, christian.brauner, ebiederm, adobriyan, songmuchun,
	axboe, linux-kernel, linux-fsdevel, linux-mm, ivan.teterevkov,
	florian.schmidt, carl.waldspurger, Hugh Dickins,
	Andrea Arcangeli

On Wed, Jul 14, 2021 at 03:24:25PM +0000, Tiberiu Georgescu wrote:
> When a page allocated using the MAP_SHARED flag is swapped out, its pagemap
> entry is cleared. In many cases, there is no difference between swapped-out
> shared pages and newly allocated, non-dirty pages in the pagemap interface.
> 
> Example pagemap-test code (Tested on Kernel Version 5.14-rc1):
> 
> 	#define NPAGES (256)
> 	/* map 1MiB shared memory */
> 	size_t pagesize = getpagesize();
> 	char *p = mmap(NULL, pagesize * NPAGES, PROT_READ | PROT_WRITE,
> 			   MAP_ANONYMOUS | MAP_SHARED, -1, 0);
> 	/* Dirty new pages. */
> 	for (i = 0; i < PAGES; i++)
> 		p[i * pagesize] = i;
> 
> Run the above program in a small cgroup, which allows swapping:
> 
> 	/* Initialise cgroup & run a program */
> 	$ echo 512K > foo/memory.limit_in_bytes
> 	$ echo 60 > foo/memory.swappiness
> 	$ cgexec -g memory:foo ./pagemap-test
> 
> Check the pagemap report. This is an example of the current expected output:
> 
> 	$ dd if=/proc/$PID/pagemap ibs=8 skip=$(($VADDR / $PAGESIZE)) count=$COUNT | hexdump -C
> 	00000000  00 00 00 00 00 00 80 00  00 00 00 00 00 00 80 00  |................|
> 	*
> 	00000710  e1 6b 06 00 00 00 80 a1  9e eb 06 00 00 00 80 a1  |.k..............|
> 	00000720  6b ee 06 00 00 00 80 a1  a5 a4 05 00 00 00 80 a1  |k...............|
> 	00000730  5c bf 06 00 00 00 80 a1  90 b6 06 00 00 00 80 a1  |\...............|
> 
> The first pagemap entries are reported as zeroes, indicating the pages have
> never been allocated while they have actually been swapped out. It is
> possible for bit 55 (PTE is Soft-Dirty) to be set on all pages of the
> shared VMA, indicating some access to the page, but nothing else (frame
> location, presence in swap or otherwise).
> 
> This patch addresses the behaviour and modifies pte_to_pagemap_entry() to
> make use of the XArray associated with the virtual memory area struct
> passed as an argument. The XArray contains the location of virtual pages in
> the page cache, swap cache or on disk. If they are on either of the caches,
> then the original implementation still works. If not, then the missing
> information will be retrieved from the XArray.
> 
> The root cause of the missing functionality is that the PTE for the page
> itself is cleared when a swap out occurs on a shared page.  Please take a
> look at the proposed patch. I would appreciate it if you could verify a
> couple of points:
> 
> 1. Why do swappable and non-syncable shared pages have their PTEs cleared
>    when they are swapped out ? Why does the behaviour differ so much
>    between MAP_SHARED and MAP_PRIVATE pages? What are the origins of the
>    approach?

My understanding is linux mm treat this differently for file-backed memories,
MAP_SHARED is one of this kind.  For these memories, ptes can be dropped at any
time because it can be reloaded from page cache when faulted again.

Anonymous private memories cannot do that, so anonymous private memories keep
all things within ptes, including swap entry.

> 
> 2. PM_SOFT_DIRTY and PM_UFFD_WP are two flags that seem to get lost once
>    the shared page is swapped out. Is there any other way to retrieve
>    their value in the proposed patch, other than ensuring these flags are
>    set, when necessary, in the PTE?

uffd-wp has no problem on dropping them because uffd-wp does not yet support
shmem.  Shmem support is posted upstream but still during review:

https://lore.kernel.org/lkml/20210527201927.29586-1-peterx@redhat.com/

After that work they'll persist, then we won't have an issue using uffd-wp with
shmem swapping; the pagemap part is done in patch 25 of 27:

https://lore.kernel.org/lkml/20210527202340.32306-1-peterx@redhat.com/

However I agree soft-dirty seems to be still broken with it.

(Cc Hugh and Andrea too)

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH 1/1] pagemap: report swap location for shared pages
  2021-07-14 15:24 ` [RFC PATCH 1/1] " Tiberiu Georgescu
@ 2021-07-14 16:08   ` Peter Xu
  2021-07-14 16:24     ` David Hildenbrand
  2021-07-15  9:48     ` Tiberiu Georgescu
  0 siblings, 2 replies; 8+ messages in thread
From: Peter Xu @ 2021-07-14 16:08 UTC (permalink / raw)
  To: Tiberiu Georgescu
  Cc: akpm, catalin.marinas, peterz, chinwen.chang, linmiaohe, jannh,
	apopple, christian.brauner, ebiederm, adobriyan, songmuchun,
	axboe, linux-kernel, linux-fsdevel, linux-mm, ivan.teterevkov,
	florian.schmidt, carl.waldspurger, Hugh Dickins,
	Andrea Arcangeli

On Wed, Jul 14, 2021 at 03:24:26PM +0000, Tiberiu Georgescu wrote:
> When a page allocated using the MAP_SHARED flag is swapped out, its pagemap
> entry is cleared. In many cases, there is no difference between swapped-out
> shared pages and newly allocated, non-dirty pages in the pagemap interface.
> 
> This patch addresses the behaviour and modifies pte_to_pagemap_entry() to
> make use of the XArray associated with the virtual memory area struct
> passed as an argument. The XArray contains the location of virtual pages
> in the page cache, swap cache or on disk. If they are on either of the
> caches, then the original implementation still works. If not, then the
> missing information will be retrieved from the XArray.
> 
> Co-developed-by: Florian Schmidt <florian.schmidt@nutanix.com>
> Signed-off-by: Florian Schmidt <florian.schmidt@nutanix.com>
> Co-developed-by: Carl Waldspurger <carl.waldspurger@nutanix.com>
> Signed-off-by: Carl Waldspurger <carl.waldspurger@nutanix.com>
> Co-developed-by: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
> Signed-off-by: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
> Signed-off-by: Tiberiu Georgescu <tiberiu.georgescu@nutanix.com>
> ---
>  fs/proc/task_mmu.c | 37 +++++++++++++++++++++++++++++--------
>  1 file changed, 29 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index eb97468dfe4c..b17c8aedd32e 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1359,12 +1359,25 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end,
>  	return err;
>  }
>  
> +static void *get_xa_entry_at_vma_addr(struct vm_area_struct *vma,
> +		unsigned long addr)
> +{
> +	struct inode *inode = file_inode(vma->vm_file);
> +	struct address_space *mapping = inode->i_mapping;
> +	pgoff_t offset = linear_page_index(vma, addr);
> +
> +	return xa_load(&mapping->i_pages, offset);
> +}
> +
>  static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
>  		struct vm_area_struct *vma, unsigned long addr, pte_t pte)
>  {
>  	u64 frame = 0, flags = 0;
>  	struct page *page = NULL;
>  
> +	if (vma->vm_flags & VM_SOFTDIRTY)
> +		flags |= PM_SOFT_DIRTY;
> +
>  	if (pte_present(pte)) {
>  		if (pm->show_pfn)
>  			frame = pte_pfn(pte);
> @@ -1374,13 +1387,22 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
>  			flags |= PM_SOFT_DIRTY;
>  		if (pte_uffd_wp(pte))
>  			flags |= PM_UFFD_WP;
> -	} else if (is_swap_pte(pte)) {
> +	} else if (is_swap_pte(pte) || shmem_file(vma->vm_file)) {
>  		swp_entry_t entry;
> -		if (pte_swp_soft_dirty(pte))
> -			flags |= PM_SOFT_DIRTY;
> -		if (pte_swp_uffd_wp(pte))
> -			flags |= PM_UFFD_WP;
> -		entry = pte_to_swp_entry(pte);
> +		if (is_swap_pte(pte)) {
> +			entry = pte_to_swp_entry(pte);
> +			if (pte_swp_soft_dirty(pte))
> +				flags |= PM_SOFT_DIRTY;
> +			if (pte_swp_uffd_wp(pte))
> +				flags |= PM_UFFD_WP;
> +		} else {
> +			void *xa_entry = get_xa_entry_at_vma_addr(vma, addr);
> +
> +			if (xa_is_value(xa_entry))
> +				entry = radix_to_swp_entry(xa_entry);
> +			else
> +				goto out;
> +		}
>  		if (pm->show_pfn)
>  			frame = swp_type(entry) |
>  				(swp_offset(entry) << MAX_SWAPFILES_SHIFT);
> @@ -1393,9 +1415,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
>  		flags |= PM_FILE;
>  	if (page && page_mapcount(page) == 1)
>  		flags |= PM_MMAP_EXCLUSIVE;
> -	if (vma->vm_flags & VM_SOFTDIRTY)
> -		flags |= PM_SOFT_DIRTY;

IMHO moving this to the entry will only work for the initial iteration, however
it won't really help anything, as soft-dirty should always be used in pair with
clear_refs written with value "4" first otherwise all pages will be marked
soft-dirty then the pagemap data is meaningless.

After the "write 4" op VM_SOFTDIRTY will be cleared and I expect the test case
to see all zeros again even with the patch.

I think one way to fix this is to do something similar to uffd-wp: we leave a
marker in pte showing that this is soft-dirtied pte even if swapped out.
However we don't have a mechanism for that yet in current linux, and the
uffd-wp series is the first one trying to introduce something like that.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH 1/1] pagemap: report swap location for shared pages
  2021-07-14 16:08   ` Peter Xu
@ 2021-07-14 16:24     ` David Hildenbrand
  2021-07-14 16:30       ` David Hildenbrand
  2021-07-15  9:48     ` Tiberiu Georgescu
  1 sibling, 1 reply; 8+ messages in thread
From: David Hildenbrand @ 2021-07-14 16:24 UTC (permalink / raw)
  To: Peter Xu, Tiberiu Georgescu
  Cc: akpm, catalin.marinas, peterz, chinwen.chang, linmiaohe, jannh,
	apopple, christian.brauner, ebiederm, adobriyan, songmuchun,
	axboe, linux-kernel, linux-fsdevel, linux-mm, ivan.teterevkov,
	florian.schmidt, carl.waldspurger, Hugh Dickins,
	Andrea Arcangeli

On 14.07.21 18:08, Peter Xu wrote:
> On Wed, Jul 14, 2021 at 03:24:26PM +0000, Tiberiu Georgescu wrote:
>> When a page allocated using the MAP_SHARED flag is swapped out, its pagemap
>> entry is cleared. In many cases, there is no difference between swapped-out
>> shared pages and newly allocated, non-dirty pages in the pagemap interface.
>>
>> This patch addresses the behaviour and modifies pte_to_pagemap_entry() to
>> make use of the XArray associated with the virtual memory area struct
>> passed as an argument. The XArray contains the location of virtual pages
>> in the page cache, swap cache or on disk. If they are on either of the
>> caches, then the original implementation still works. If not, then the
>> missing information will be retrieved from the XArray.
>>
>> Co-developed-by: Florian Schmidt <florian.schmidt@nutanix.com>
>> Signed-off-by: Florian Schmidt <florian.schmidt@nutanix.com>
>> Co-developed-by: Carl Waldspurger <carl.waldspurger@nutanix.com>
>> Signed-off-by: Carl Waldspurger <carl.waldspurger@nutanix.com>
>> Co-developed-by: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
>> Signed-off-by: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
>> Signed-off-by: Tiberiu Georgescu <tiberiu.georgescu@nutanix.com>
>> ---
>>   fs/proc/task_mmu.c | 37 +++++++++++++++++++++++++++++--------
>>   1 file changed, 29 insertions(+), 8 deletions(-)
>>
>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> index eb97468dfe4c..b17c8aedd32e 100644
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
>> @@ -1359,12 +1359,25 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end,
>>   	return err;
>>   }
>>   
>> +static void *get_xa_entry_at_vma_addr(struct vm_area_struct *vma,
>> +		unsigned long addr)
>> +{
>> +	struct inode *inode = file_inode(vma->vm_file);
>> +	struct address_space *mapping = inode->i_mapping;
>> +	pgoff_t offset = linear_page_index(vma, addr);
>> +
>> +	return xa_load(&mapping->i_pages, offset);
>> +}
>> +
>>   static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
>>   		struct vm_area_struct *vma, unsigned long addr, pte_t pte)
>>   {
>>   	u64 frame = 0, flags = 0;
>>   	struct page *page = NULL;
>>   
>> +	if (vma->vm_flags & VM_SOFTDIRTY)
>> +		flags |= PM_SOFT_DIRTY;
>> +
>>   	if (pte_present(pte)) {
>>   		if (pm->show_pfn)
>>   			frame = pte_pfn(pte);
>> @@ -1374,13 +1387,22 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
>>   			flags |= PM_SOFT_DIRTY;
>>   		if (pte_uffd_wp(pte))
>>   			flags |= PM_UFFD_WP;
>> -	} else if (is_swap_pte(pte)) {
>> +	} else if (is_swap_pte(pte) || shmem_file(vma->vm_file)) {
>>   		swp_entry_t entry;
>> -		if (pte_swp_soft_dirty(pte))
>> -			flags |= PM_SOFT_DIRTY;
>> -		if (pte_swp_uffd_wp(pte))
>> -			flags |= PM_UFFD_WP;
>> -		entry = pte_to_swp_entry(pte);
>> +		if (is_swap_pte(pte)) {
>> +			entry = pte_to_swp_entry(pte);
>> +			if (pte_swp_soft_dirty(pte))
>> +				flags |= PM_SOFT_DIRTY;
>> +			if (pte_swp_uffd_wp(pte))
>> +				flags |= PM_UFFD_WP;
>> +		} else {
>> +			void *xa_entry = get_xa_entry_at_vma_addr(vma, addr);
>> +
>> +			if (xa_is_value(xa_entry))
>> +				entry = radix_to_swp_entry(xa_entry);
>> +			else
>> +				goto out;
>> +		}
>>   		if (pm->show_pfn)
>>   			frame = swp_type(entry) |
>>   				(swp_offset(entry) << MAX_SWAPFILES_SHIFT);
>> @@ -1393,9 +1415,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
>>   		flags |= PM_FILE;
>>   	if (page && page_mapcount(page) == 1)
>>   		flags |= PM_MMAP_EXCLUSIVE;
>> -	if (vma->vm_flags & VM_SOFTDIRTY)
>> -		flags |= PM_SOFT_DIRTY;
> 
> IMHO moving this to the entry will only work for the initial iteration, however
> it won't really help anything, as soft-dirty should always be used in pair with
> clear_refs written with value "4" first otherwise all pages will be marked
> soft-dirty then the pagemap data is meaningless.
> 
> After the "write 4" op VM_SOFTDIRTY will be cleared and I expect the test case
> to see all zeros again even with the patch.
> 
> I think one way to fix this is to do something similar to uffd-wp: we leave a
> marker in pte showing that this is soft-dirtied pte even if swapped out.

How exactly does such a pte look like? Simply pte_none() with another 
bit set?

> However we don't have a mechanism for that yet in current linux, and the
> uffd-wp series is the first one trying to introduce something like that.

Can you give me a pointer? I'm very interested in learning how to 
identify this case.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH 1/1] pagemap: report swap location for shared pages
  2021-07-14 16:24     ` David Hildenbrand
@ 2021-07-14 16:30       ` David Hildenbrand
  2021-07-14 17:12         ` Peter Xu
  0 siblings, 1 reply; 8+ messages in thread
From: David Hildenbrand @ 2021-07-14 16:30 UTC (permalink / raw)
  To: Peter Xu, Tiberiu Georgescu
  Cc: akpm, catalin.marinas, peterz, chinwen.chang, linmiaohe, jannh,
	apopple, christian.brauner, ebiederm, adobriyan, songmuchun,
	axboe, linux-kernel, linux-fsdevel, linux-mm, ivan.teterevkov,
	florian.schmidt, carl.waldspurger, Hugh Dickins,
	Andrea Arcangeli

On 14.07.21 18:24, David Hildenbrand wrote:
> On 14.07.21 18:08, Peter Xu wrote:
>> On Wed, Jul 14, 2021 at 03:24:26PM +0000, Tiberiu Georgescu wrote:
>>> When a page allocated using the MAP_SHARED flag is swapped out, its pagemap
>>> entry is cleared. In many cases, there is no difference between swapped-out
>>> shared pages and newly allocated, non-dirty pages in the pagemap interface.
>>>
>>> This patch addresses the behaviour and modifies pte_to_pagemap_entry() to
>>> make use of the XArray associated with the virtual memory area struct
>>> passed as an argument. The XArray contains the location of virtual pages
>>> in the page cache, swap cache or on disk. If they are on either of the
>>> caches, then the original implementation still works. If not, then the
>>> missing information will be retrieved from the XArray.
>>>
>>> Co-developed-by: Florian Schmidt <florian.schmidt@nutanix.com>
>>> Signed-off-by: Florian Schmidt <florian.schmidt@nutanix.com>
>>> Co-developed-by: Carl Waldspurger <carl.waldspurger@nutanix.com>
>>> Signed-off-by: Carl Waldspurger <carl.waldspurger@nutanix.com>
>>> Co-developed-by: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
>>> Signed-off-by: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
>>> Signed-off-by: Tiberiu Georgescu <tiberiu.georgescu@nutanix.com>
>>> ---
>>>    fs/proc/task_mmu.c | 37 +++++++++++++++++++++++++++++--------
>>>    1 file changed, 29 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>>> index eb97468dfe4c..b17c8aedd32e 100644
>>> --- a/fs/proc/task_mmu.c
>>> +++ b/fs/proc/task_mmu.c
>>> @@ -1359,12 +1359,25 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end,
>>>    	return err;
>>>    }
>>>    
>>> +static void *get_xa_entry_at_vma_addr(struct vm_area_struct *vma,
>>> +		unsigned long addr)
>>> +{
>>> +	struct inode *inode = file_inode(vma->vm_file);
>>> +	struct address_space *mapping = inode->i_mapping;
>>> +	pgoff_t offset = linear_page_index(vma, addr);
>>> +
>>> +	return xa_load(&mapping->i_pages, offset);
>>> +}
>>> +
>>>    static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
>>>    		struct vm_area_struct *vma, unsigned long addr, pte_t pte)
>>>    {
>>>    	u64 frame = 0, flags = 0;
>>>    	struct page *page = NULL;
>>>    
>>> +	if (vma->vm_flags & VM_SOFTDIRTY)
>>> +		flags |= PM_SOFT_DIRTY;
>>> +
>>>    	if (pte_present(pte)) {
>>>    		if (pm->show_pfn)
>>>    			frame = pte_pfn(pte);
>>> @@ -1374,13 +1387,22 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
>>>    			flags |= PM_SOFT_DIRTY;
>>>    		if (pte_uffd_wp(pte))
>>>    			flags |= PM_UFFD_WP;
>>> -	} else if (is_swap_pte(pte)) {
>>> +	} else if (is_swap_pte(pte) || shmem_file(vma->vm_file)) {
>>>    		swp_entry_t entry;
>>> -		if (pte_swp_soft_dirty(pte))
>>> -			flags |= PM_SOFT_DIRTY;
>>> -		if (pte_swp_uffd_wp(pte))
>>> -			flags |= PM_UFFD_WP;
>>> -		entry = pte_to_swp_entry(pte);
>>> +		if (is_swap_pte(pte)) {
>>> +			entry = pte_to_swp_entry(pte);
>>> +			if (pte_swp_soft_dirty(pte))
>>> +				flags |= PM_SOFT_DIRTY;
>>> +			if (pte_swp_uffd_wp(pte))
>>> +				flags |= PM_UFFD_WP;
>>> +		} else {
>>> +			void *xa_entry = get_xa_entry_at_vma_addr(vma, addr);
>>> +
>>> +			if (xa_is_value(xa_entry))
>>> +				entry = radix_to_swp_entry(xa_entry);
>>> +			else
>>> +				goto out;
>>> +		}
>>>    		if (pm->show_pfn)
>>>    			frame = swp_type(entry) |
>>>    				(swp_offset(entry) << MAX_SWAPFILES_SHIFT);
>>> @@ -1393,9 +1415,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
>>>    		flags |= PM_FILE;
>>>    	if (page && page_mapcount(page) == 1)
>>>    		flags |= PM_MMAP_EXCLUSIVE;
>>> -	if (vma->vm_flags & VM_SOFTDIRTY)
>>> -		flags |= PM_SOFT_DIRTY;
>>
>> IMHO moving this to the entry will only work for the initial iteration, however
>> it won't really help anything, as soft-dirty should always be used in pair with
>> clear_refs written with value "4" first otherwise all pages will be marked
>> soft-dirty then the pagemap data is meaningless.
>>
>> After the "write 4" op VM_SOFTDIRTY will be cleared and I expect the test case
>> to see all zeros again even with the patch.
>>
>> I think one way to fix this is to do something similar to uffd-wp: we leave a
>> marker in pte showing that this is soft-dirtied pte even if swapped out.
> 
> How exactly does such a pte look like? Simply pte_none() with another
> bit set?
> 
>> However we don't have a mechanism for that yet in current linux, and the
>> uffd-wp series is the first one trying to introduce something like that.
> 
> Can you give me a pointer? I'm very interested in learning how to
> identify this case.
> 

I assume it's 
https://lore.kernel.org/lkml/20210527202117.30689-1-peterx@redhat.com/

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH 1/1] pagemap: report swap location for shared pages
  2021-07-14 16:30       ` David Hildenbrand
@ 2021-07-14 17:12         ` Peter Xu
  0 siblings, 0 replies; 8+ messages in thread
From: Peter Xu @ 2021-07-14 17:12 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Tiberiu Georgescu, akpm, catalin.marinas, peterz, chinwen.chang,
	linmiaohe, jannh, apopple, christian.brauner, ebiederm,
	adobriyan, songmuchun, axboe, linux-kernel, linux-fsdevel,
	linux-mm, ivan.teterevkov, florian.schmidt, carl.waldspurger,
	Hugh Dickins, Andrea Arcangeli

On Wed, Jul 14, 2021 at 06:30:05PM +0200, David Hildenbrand wrote:
> On 14.07.21 18:24, David Hildenbrand wrote:
> > On 14.07.21 18:08, Peter Xu wrote:
> > > On Wed, Jul 14, 2021 at 03:24:26PM +0000, Tiberiu Georgescu wrote:
> > > > When a page allocated using the MAP_SHARED flag is swapped out, its pagemap
> > > > entry is cleared. In many cases, there is no difference between swapped-out
> > > > shared pages and newly allocated, non-dirty pages in the pagemap interface.
> > > > 
> > > > This patch addresses the behaviour and modifies pte_to_pagemap_entry() to
> > > > make use of the XArray associated with the virtual memory area struct
> > > > passed as an argument. The XArray contains the location of virtual pages
> > > > in the page cache, swap cache or on disk. If they are on either of the
> > > > caches, then the original implementation still works. If not, then the
> > > > missing information will be retrieved from the XArray.
> > > > 
> > > > Co-developed-by: Florian Schmidt <florian.schmidt@nutanix.com>
> > > > Signed-off-by: Florian Schmidt <florian.schmidt@nutanix.com>
> > > > Co-developed-by: Carl Waldspurger <carl.waldspurger@nutanix.com>
> > > > Signed-off-by: Carl Waldspurger <carl.waldspurger@nutanix.com>
> > > > Co-developed-by: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
> > > > Signed-off-by: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
> > > > Signed-off-by: Tiberiu Georgescu <tiberiu.georgescu@nutanix.com>
> > > > ---
> > > >    fs/proc/task_mmu.c | 37 +++++++++++++++++++++++++++++--------
> > > >    1 file changed, 29 insertions(+), 8 deletions(-)
> > > > 
> > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > > index eb97468dfe4c..b17c8aedd32e 100644
> > > > --- a/fs/proc/task_mmu.c
> > > > +++ b/fs/proc/task_mmu.c
> > > > @@ -1359,12 +1359,25 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end,
> > > >    	return err;
> > > >    }
> > > > +static void *get_xa_entry_at_vma_addr(struct vm_area_struct *vma,
> > > > +		unsigned long addr)
> > > > +{
> > > > +	struct inode *inode = file_inode(vma->vm_file);
> > > > +	struct address_space *mapping = inode->i_mapping;
> > > > +	pgoff_t offset = linear_page_index(vma, addr);
> > > > +
> > > > +	return xa_load(&mapping->i_pages, offset);
> > > > +}
> > > > +
> > > >    static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
> > > >    		struct vm_area_struct *vma, unsigned long addr, pte_t pte)
> > > >    {
> > > >    	u64 frame = 0, flags = 0;
> > > >    	struct page *page = NULL;
> > > > +	if (vma->vm_flags & VM_SOFTDIRTY)
> > > > +		flags |= PM_SOFT_DIRTY;
> > > > +
> > > >    	if (pte_present(pte)) {
> > > >    		if (pm->show_pfn)
> > > >    			frame = pte_pfn(pte);
> > > > @@ -1374,13 +1387,22 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
> > > >    			flags |= PM_SOFT_DIRTY;
> > > >    		if (pte_uffd_wp(pte))
> > > >    			flags |= PM_UFFD_WP;
> > > > -	} else if (is_swap_pte(pte)) {
> > > > +	} else if (is_swap_pte(pte) || shmem_file(vma->vm_file)) {
> > > >    		swp_entry_t entry;
> > > > -		if (pte_swp_soft_dirty(pte))
> > > > -			flags |= PM_SOFT_DIRTY;
> > > > -		if (pte_swp_uffd_wp(pte))
> > > > -			flags |= PM_UFFD_WP;
> > > > -		entry = pte_to_swp_entry(pte);
> > > > +		if (is_swap_pte(pte)) {
> > > > +			entry = pte_to_swp_entry(pte);
> > > > +			if (pte_swp_soft_dirty(pte))
> > > > +				flags |= PM_SOFT_DIRTY;
> > > > +			if (pte_swp_uffd_wp(pte))
> > > > +				flags |= PM_UFFD_WP;
> > > > +		} else {
> > > > +			void *xa_entry = get_xa_entry_at_vma_addr(vma, addr);
> > > > +
> > > > +			if (xa_is_value(xa_entry))
> > > > +				entry = radix_to_swp_entry(xa_entry);
> > > > +			else
> > > > +				goto out;
> > > > +		}
> > > >    		if (pm->show_pfn)
> > > >    			frame = swp_type(entry) |
> > > >    				(swp_offset(entry) << MAX_SWAPFILES_SHIFT);
> > > > @@ -1393,9 +1415,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
> > > >    		flags |= PM_FILE;
> > > >    	if (page && page_mapcount(page) == 1)
> > > >    		flags |= PM_MMAP_EXCLUSIVE;
> > > > -	if (vma->vm_flags & VM_SOFTDIRTY)
> > > > -		flags |= PM_SOFT_DIRTY;
> > > 
> > > IMHO moving this to the entry will only work for the initial iteration, however
> > > it won't really help anything, as soft-dirty should always be used in pair with
> > > clear_refs written with value "4" first otherwise all pages will be marked
> > > soft-dirty then the pagemap data is meaningless.
> > > 
> > > After the "write 4" op VM_SOFTDIRTY will be cleared and I expect the test case
> > > to see all zeros again even with the patch.
> > > 
> > > I think one way to fix this is to do something similar to uffd-wp: we leave a
> > > marker in pte showing that this is soft-dirtied pte even if swapped out.
> > 
> > How exactly does such a pte look like? Simply pte_none() with another
> > bit set?

Yes something like that.  The pte can be defined at will, as long as never used
elsewhere.

> > 
> > > However we don't have a mechanism for that yet in current linux, and the
> > > uffd-wp series is the first one trying to introduce something like that.
> > 
> > Can you give me a pointer? I'm very interested in learning how to
> > identify this case.
> > 
> 
> I assume it's
> https://lore.kernel.org/lkml/20210527202117.30689-1-peterx@redhat.com/

Yes.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH 1/1] pagemap: report swap location for shared pages
  2021-07-14 16:08   ` Peter Xu
  2021-07-14 16:24     ` David Hildenbrand
@ 2021-07-15  9:48     ` Tiberiu Georgescu
  1 sibling, 0 replies; 8+ messages in thread
From: Tiberiu Georgescu @ 2021-07-15  9:48 UTC (permalink / raw)
  To: Peter Xu
  Cc: akpm, catalin.marinas, peterz, chinwen.chang, linmiaohe, jannh,
	apopple, christian.brauner, ebiederm, adobriyan, songmuchun,
	axboe, linux-kernel, linux-fsdevel, linux-mm, Ivan Teterevkov,
	Florian Schmidt, Carl Waldspurger [C],
	Hugh Dickins, Andrea Arcangeli


> On 14 Jul 2021, at 17:08, Peter Xu <peterx@redhat.com> wrote:
> 
> On Wed, Jul 14, 2021 at 03:24:26PM +0000, Tiberiu Georgescu wrote:
>> 
>> static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
>> 		struct vm_area_struct *vma, unsigned long addr, pte_t pte)
>> {
>> 	u64 frame = 0, flags = 0;
>> 	struct page *page = NULL;
>> 
>> +	if (vma->vm_flags & VM_SOFTDIRTY)
>> +		flags |= PM_SOFT_DIRTY;
>> +
>> 	if (pte_present(pte)) {
>> 		if (pm->show_pfn)
>> 			frame = pte_pfn(pte);
>> @@ -1374,13 +1387,22 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
>> 			flags |= PM_SOFT_DIRTY;
>> 		if (pte_uffd_wp(pte))
>> 			flags |= PM_UFFD_WP;
>> -	} else if (is_swap_pte(pte)) {
>> +	} else if (is_swap_pte(pte) || shmem_file(vma->vm_file)) {
>> 		swp_entry_t entry;
>> -		if (pte_swp_soft_dirty(pte))
>> -			flags |= PM_SOFT_DIRTY;
>> -		if (pte_swp_uffd_wp(pte))
>> -			flags |= PM_UFFD_WP;
>> -		entry = pte_to_swp_entry(pte);
>> +		if (is_swap_pte(pte)) {
>> +			entry = pte_to_swp_entry(pte);
>> +			if (pte_swp_soft_dirty(pte))
>> +				flags |= PM_SOFT_DIRTY;
>> +			if (pte_swp_uffd_wp(pte))
>> +				flags |= PM_UFFD_WP;
>> +		} else {
>> +			void *xa_entry = get_xa_entry_at_vma_addr(vma, addr);
>> +
>> +			if (xa_is_value(xa_entry))
>> +				entry = radix_to_swp_entry(xa_entry);
>> +			else
>> +				goto out;
>> +		}
>> 		if (pm->show_pfn)
>> 			frame = swp_type(entry) |
>> 				(swp_offset(entry) << MAX_SWAPFILES_SHIFT);
>> @@ -1393,9 +1415,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
>> 		flags |= PM_FILE;
>> 	if (page && page_mapcount(page) == 1)
>> 		flags |= PM_MMAP_EXCLUSIVE;
>> -	if (vma->vm_flags & VM_SOFTDIRTY)
>> -		flags |= PM_SOFT_DIRTY;
> 
> IMHO moving this to the entry will only work for the initial iteration, however
> it won't really help anything, as soft-dirty should always be used in pair with
> clear_refs written with value "4" first otherwise all pages will be marked
> soft-dirty then the pagemap data is meaningless.
> 
> After the "write 4" op VM_SOFTDIRTY will be cleared and I expect the test case
> to see all zeros again even with the patch.

Indeed, the SOFT_DIRTY bit gets cleared and does not get set when we dirty the
page and swap it out again. However, the pagemap entries are not completely 
zeroed out. The patch mostly deals with adding the swap frame offset on the 
pagemap entries of swappable, non-syncable pages, even if they are MAP_SHARED.

Example output post-patch, after writing 4 to clear_refs and dirtying the pages:
        
        $ dd if=/proc/$PID/pagemap ibs=8 skip=$(($VADDR / $PAGESIZE)) count=256 | hexdump -C
        00000000  80 13 01 00 00 00 00 40  a0 13 01 00 00 00 00 40  |.......@.......@|
        ...........more swapped-out entries............
        000005e0  e0 2a 01 00 00 00 00 40  00 2b 01 00 00 00 00 40  |.*.....@.+.....@|
        000005f0  20 2b 01 00 00 00 00 40  40 2b 01 00 00 00 00 40  | +.....@@+.....@|
        00000600  72 6c 1d 00 00 00 80 a1  c1 34 12 00 00 00 80 a1  |rl.......4......|
        ...........more in-memory entries............
        000007f0  3c 21 18 00 00 00 80 a1  69 ec 17 00 00 00 80 a1  |<!......i.......|

You may find the pre-patch example output on the RFC cover letter, for reference:
https://lkml.org/lkml/2021/7/14/594

> I think one way to fix this is to do something similar to uffd-wp: we leave a
> marker in pte showing that this is soft-dirtied pte even if swapped out.
> However we don't have a mechanism for that yet in current linux, and the
> uffd-wp series is the first one trying to introduce something like that.

I am taking a look at the uffd-wp patch today. Hope it gets upstreamed soon, so I can
adapt one of the mechanisms in there to keep track of the SOFT_DIRTY bit on the
PTE after swap.

Kind regards,
Tibi

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-07-15  9:48 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-14 15:24 [RFC PATCH 0/1] pagemap: report swap location for shared pages Tiberiu Georgescu
2021-07-14 15:24 ` [RFC PATCH 1/1] " Tiberiu Georgescu
2021-07-14 16:08   ` Peter Xu
2021-07-14 16:24     ` David Hildenbrand
2021-07-14 16:30       ` David Hildenbrand
2021-07-14 17:12         ` Peter Xu
2021-07-15  9:48     ` Tiberiu Georgescu
2021-07-14 16:01 ` [RFC PATCH 0/1] " Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).