linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/1] pagemap: swap location for shared pages
@ 2021-07-30 16:08 Tiberiu A Georgescu
  2021-07-30 16:08 ` [PATCH 1/1] pagemap: report " Tiberiu A Georgescu
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Tiberiu A Georgescu @ 2021-07-30 16:08 UTC (permalink / raw)
  To: akpm, viro, peterx, david, christian.brauner, ebiederm,
	adobriyan, songmuchun, axboe, vincenzo.frascino, catalin.marinas,
	peterz, chinwen.chang, linmiaohe, jannh, apopple, linux-kernel,
	linux-fsdevel, linux-mm
  Cc: ivan.teterevkov, florian.schmidt, carl.waldspurger,
	jonathan.davies, Tiberiu A Georgescu

This patch follows up on a previous RFC:
20210714152426.216217-1-tiberiu.georgescu@nutanix.com

When a page allocated using the MAP_SHARED flag is swapped out, its pagemap
entry is cleared. In many cases, there is no difference between swapped-out
shared pages and newly allocated, non-dirty pages in the pagemap interface.

Example pagemap-test code (Tested on Kernel Version 5.14-rc3):
    #define NPAGES (256)
    /* map 1MiB shared memory */
    size_t pagesize = getpagesize();
    char *p = mmap(NULL, pagesize * NPAGES, PROT_READ | PROT_WRITE,
    		   MAP_ANONYMOUS | MAP_SHARED, -1, 0);
    /* Dirty new pages. */
    for (i = 0; i < PAGES; i++)
    	p[i * pagesize] = i;

Run the above program in a small cgroup, which causes swapping:
    /* Initialise cgroup & run a program */
    $ echo 512K > foo/memory.limit_in_bytes
    $ echo 60 > foo/memory.swappiness
    $ cgexec -g memory:foo ./pagemap-test

Check the pagemap report. Example of the current expected output:
    $ dd if=/proc/$PID/pagemap ibs=8 skip=$(($VADDR / $PAGESIZE)) count=$COUNT | hexdump -C
    00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
    *
    00000710  e1 6b 06 00 00 00 00 a1  9e eb 06 00 00 00 00 a1  |.k..............|
    00000720  6b ee 06 00 00 00 00 a1  a5 a4 05 00 00 00 00 a1  |k...............|
    00000730  5c bf 06 00 00 00 00 a1  90 b6 06 00 00 00 00 a1  |\...............|

The first pagemap entries are reported as zeroes, indicating the pages have
never been allocated while they have actually been swapped out.

This patch addresses the behaviour and modifies pte_to_pagemap_entry() to
make use of the XArray associated with the virtual memory area struct
passed as an argument. The XArray contains the location of virtual pages in
the page cache, swap cache or on disk. If they are in either of the caches,
then the original implementation still works. If not, then the missing
information will be retrieved from the XArray.

Performance
============
I measured the performance of the patch on a single socket Xeon E5-2620
machine, with 128GiB of RAM and 128GiB of swap storage. These were the
steps taken:

  1. Run example pagemap-test code on a cgroup
    a. Set up cgroup with limit_in_bytes=4GiB and swappiness=60;
    b. allocate 16GiB (about 4 million pages);
    c. dirty 0,50 or 100% of pages;
    d. do this for both private and shared memory.
  2. Run `dd if=<PAGEMAP> ibs=8 skip=$(($VADDR / $PAGESIZE)) count=4194304`
     for each possible configuration above
    a.  3 times for warm up;
    b. 10 times to measure performance.
       Use `time` or another performance measuring tool.

Results (averaged over 10 iterations):
               +--------+------------+------------+
               | dirty% |  pre patch | post patch |
               +--------+------------+------------+
 private|anon  |     0% |      8.15s |      8.40s |
               |    50% |     11.83s |     12.19s |
               |   100% |     12.37s |     12.20s |
               +--------+------------+------------+
  shared|anon  |     0% |      8.17s |      8.18s |
               |    50% | (*) 10.43s |     37.43s |
               |   100% | (*) 10.20s |     38.59s |
               +--------+------------+------------+

(*): reminder that pre-patch produces incorrect pagemap entries for swapped
     out pages.

From run to run the above results are stable (mostly <1% stderr).

The amount of time it takes for a full read of the pagemap depends on the
granularity used by dd to read the pagemap file. Even though the access is
sequential, the script only reads 8 bytes at a time, running pagemap_read()
COUNT times (one time for each page in a 16GiB area).

To reduce overhead, we can use batching for large amounts of sequential
access. We can make dd read multiple page entries at a time,
allowing the kernel to make optimisations and yield more throughput.

Performance in real time (seconds) of
`dd if=<PAGEMAP> ibs=8*$BATCH skip=$(($VADDR / $PAGESIZE / $BATCH))
count=$((4194304 / $BATCH))`:
+---------------------------------+ +---------------------------------+
|     Shared, Anon, 50% dirty     | |     Shared, Anon, 100% dirty    |
+-------+------------+------------+ +-------+------------+------------+
| Batch |  Pre-patch | Post-patch | | Batch |  Pre-patch | Post-patch |
+-------+------------+------------+ +-------+------------+------------+
|     1 | (*) 10.43s |     37.43s | |     1 | (*) 10.20s |     38.59s |
|     2 | (*)  5.25s |     18.77s | |     2 | (*)  5.15s |     19.37s |
|     4 | (*)  2.63s |      9.42s | |     4 | (*)  2.63s |      9.74s |
|     8 | (*)  1.38s |      4.80s | |     8 | (*)  1.35s |      4.94s |
|    16 | (*)  0.73s |      2.46s | |    16 | (*)  0.72s |      2.54s |
|    32 | (*)  0.40s |      1.31s | |    32 | (*)  0.41s |      1.34s |
|    64 | (*)  0.25s |      0.72s | |    64 | (*)  0.24s |      0.74s |
|   128 | (*)  0.16s |      0.43s | |   128 | (*)  0.16s |      0.44s |
|   256 | (*)  0.12s |      0.28s | |   256 | (*)  0.12s |      0.29s |
|   512 | (*)  0.10s |      0.21s | |   512 | (*)  0.10s |      0.22s |
|  1024 | (*)  0.10s |      0.20s | |  1024 | (*)  0.10s |      0.21s |
+-------+------------+------------+ +-------+------------+------------+

To conclude, in order to make the most of the underlying mechanisms of
pagemap and xarray, one should be using batching to achieve better
performance.

Future Work
============

Note: there are PTE flags which currently do not survive the swap out when
the page is shmem: SOFT_DIRTY and UFFD_WP.

A solution for saving the state of the UFFD_WP flag has been proposed by
Peter Xu in the patch linked below. The concept and mechanism proposed
could be extended to include the SOFT_DIRTY bit as well:
20210715201422.211004-1-peterx@redhat.com
Our patches are mostly orthogonal.

Kind regards,
Tibi

Tiberiu A Georgescu (1):
  pagemap: report swap location for shared pages

 fs/proc/task_mmu.c | 38 ++++++++++++++++++++++++++++++--------
 1 file changed, 30 insertions(+), 8 deletions(-)

-- 
2.32.0.380.geb27b338a3


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/1] pagemap: report swap location for shared pages
  2021-07-30 16:08 [PATCH 0/1] pagemap: swap location for shared pages Tiberiu A Georgescu
@ 2021-07-30 16:08 ` Tiberiu A Georgescu
  2021-07-30 17:28 ` [PATCH 0/1] pagemap: " Eric W. Biederman
  2021-08-04 18:33 ` Peter Xu
  2 siblings, 0 replies; 13+ messages in thread
From: Tiberiu A Georgescu @ 2021-07-30 16:08 UTC (permalink / raw)
  To: akpm, viro, peterx, david, christian.brauner, ebiederm,
	adobriyan, songmuchun, axboe, vincenzo.frascino, catalin.marinas,
	peterz, chinwen.chang, linmiaohe, jannh, apopple, linux-kernel,
	linux-fsdevel, linux-mm
  Cc: ivan.teterevkov, florian.schmidt, carl.waldspurger,
	jonathan.davies, Tiberiu A Georgescu

When a page allocated using the MAP_SHARED flag is swapped out, its pagemap
entry is cleared. In many cases, there is no difference between swapped-out
shared pages and newly allocated, non-dirty pages in the pagemap interface.

This patch addresses the behaviour and modifies pte_to_pagemap_entry() to
make use of the XArray associated with the virtual memory area struct
passed as an argument. The XArray contains the location of virtual pages
in the page cache, swap cache or on disk. If they are on either of the
caches, then the original implementation still works. If not, then the
missing information will be retrieved from the XArray.

Signed-off-by: Tiberiu A Georgescu <tiberiu.georgescu@nutanix.com>
Reviewed-by: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
Reviewed-by: Florian Schmidt <florian.schmidt@nutanix.com>
Reviewed-by: Carl Waldspurger <carl.waldspurger@nutanix.com>
Reviewed-by: Jonathan Davies <jonathan.davies@nutanix.com>
---
 fs/proc/task_mmu.c | 38 ++++++++++++++++++++++++++++++--------
 1 file changed, 30 insertions(+), 8 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index eb97468dfe4c..894148f800be 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1359,12 +1359,25 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end,
 	return err;
 }
 
+static void *get_xa_entry_at_vma_addr(struct vm_area_struct *vma,
+		unsigned long addr)
+{
+	struct inode *inode = file_inode(vma->vm_file);
+	struct address_space *mapping = inode->i_mapping;
+	pgoff_t offset = linear_page_index(vma, addr);
+
+	return xa_load(&mapping->i_pages, offset);
+}
+
 static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
 		struct vm_area_struct *vma, unsigned long addr, pte_t pte)
 {
 	u64 frame = 0, flags = 0;
 	struct page *page = NULL;
 
+	if (vma->vm_flags & VM_SOFTDIRTY)
+		flags |= PM_SOFT_DIRTY;
+
 	if (pte_present(pte)) {
 		if (pm->show_pfn)
 			frame = pte_pfn(pte);
@@ -1374,13 +1387,23 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
 			flags |= PM_SOFT_DIRTY;
 		if (pte_uffd_wp(pte))
 			flags |= PM_UFFD_WP;
-	} else if (is_swap_pte(pte)) {
+	} else { /* Could be in swap */
 		swp_entry_t entry;
-		if (pte_swp_soft_dirty(pte))
-			flags |= PM_SOFT_DIRTY;
-		if (pte_swp_uffd_wp(pte))
-			flags |= PM_UFFD_WP;
-		entry = pte_to_swp_entry(pte);
+		if (is_swap_pte(pte)) {
+			if (pte_swp_soft_dirty(pte))
+				flags |= PM_SOFT_DIRTY;
+			if (pte_swp_uffd_wp(pte))
+				flags |= PM_UFFD_WP;
+			entry = pte_to_swp_entry(pte);
+		} else if (shmem_file(vma->vm_file)) {
+			void *xa_entry = get_xa_entry_at_vma_addr(vma, addr);
+
+			if (!xa_is_value(xa_entry)) /* Not a swap offset */
+				goto out;
+			entry = radix_to_swp_entry(xa_entry);
+		} else {
+			goto out;
+		}
 		if (pm->show_pfn)
 			frame = swp_type(entry) |
 				(swp_offset(entry) << MAX_SWAPFILES_SHIFT);
@@ -1393,9 +1416,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
 		flags |= PM_FILE;
 	if (page && page_mapcount(page) == 1)
 		flags |= PM_MMAP_EXCLUSIVE;
-	if (vma->vm_flags & VM_SOFTDIRTY)
-		flags |= PM_SOFT_DIRTY;
 
+out:
 	return make_pme(frame, flags);
 }
 
-- 
2.32.0.380.geb27b338a3


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/1] pagemap: swap location for shared pages
  2021-07-30 16:08 [PATCH 0/1] pagemap: swap location for shared pages Tiberiu A Georgescu
  2021-07-30 16:08 ` [PATCH 1/1] pagemap: report " Tiberiu A Georgescu
@ 2021-07-30 17:28 ` Eric W. Biederman
  2021-08-02 12:20   ` Tiberiu Georgescu
  2021-08-04 18:33 ` Peter Xu
  2 siblings, 1 reply; 13+ messages in thread
From: Eric W. Biederman @ 2021-07-30 17:28 UTC (permalink / raw)
  To: Tiberiu A Georgescu
  Cc: akpm, viro, peterx, david, christian.brauner, adobriyan,
	songmuchun, axboe, vincenzo.frascino, catalin.marinas, peterz,
	chinwen.chang, linmiaohe, jannh, apopple, linux-kernel,
	linux-fsdevel, linux-mm, ivan.teterevkov, florian.schmidt,
	carl.waldspurger, jonathan.davies

Tiberiu A Georgescu <tiberiu.georgescu@nutanix.com> writes:

> This patch follows up on a previous RFC:
> 20210714152426.216217-1-tiberiu.georgescu@nutanix.com
>
> When a page allocated using the MAP_SHARED flag is swapped out, its pagemap
> entry is cleared. In many cases, there is no difference between swapped-out
> shared pages and newly allocated, non-dirty pages in the pagemap
> interface.

What is the point?

You say a shared swapped out page is the same as a clean shared page
and you are exactly correct.  What is the point in knowing a shared
page was swapped out?  What does is the gain?

I tried to understand the point by looking at your numbers below
and everything I could see looked worse post patch.  

Eric



> Example pagemap-test code (Tested on Kernel Version 5.14-rc3):
>     #define NPAGES (256)
>     /* map 1MiB shared memory */
>     size_t pagesize = getpagesize();
>     char *p = mmap(NULL, pagesize * NPAGES, PROT_READ | PROT_WRITE,
>     		   MAP_ANONYMOUS | MAP_SHARED, -1, 0);
>     /* Dirty new pages. */
>     for (i = 0; i < PAGES; i++)
>     	p[i * pagesize] = i;
>
> Run the above program in a small cgroup, which causes swapping:
>     /* Initialise cgroup & run a program */
>     $ echo 512K > foo/memory.limit_in_bytes
>     $ echo 60 > foo/memory.swappiness
>     $ cgexec -g memory:foo ./pagemap-test
>
> Check the pagemap report. Example of the current expected output:
>     $ dd if=/proc/$PID/pagemap ibs=8 skip=$(($VADDR / $PAGESIZE)) count=$COUNT | hexdump -C
>     00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>     *
>     00000710  e1 6b 06 00 00 00 00 a1  9e eb 06 00 00 00 00 a1  |.k..............|
>     00000720  6b ee 06 00 00 00 00 a1  a5 a4 05 00 00 00 00 a1  |k...............|
>     00000730  5c bf 06 00 00 00 00 a1  90 b6 06 00 00 00 00 a1  |\...............|
>
> The first pagemap entries are reported as zeroes, indicating the pages have
> never been allocated while they have actually been swapped out.
>
> This patch addresses the behaviour and modifies pte_to_pagemap_entry() to
> make use of the XArray associated with the virtual memory area struct
> passed as an argument. The XArray contains the location of virtual pages in
> the page cache, swap cache or on disk. If they are in either of the caches,
> then the original implementation still works. If not, then the missing
> information will be retrieved from the XArray.
>
> Performance
> ============
> I measured the performance of the patch on a single socket Xeon E5-2620
> machine, with 128GiB of RAM and 128GiB of swap storage. These were the
> steps taken:
>
>   1. Run example pagemap-test code on a cgroup
>     a. Set up cgroup with limit_in_bytes=4GiB and swappiness=60;
>     b. allocate 16GiB (about 4 million pages);
>     c. dirty 0,50 or 100% of pages;
>     d. do this for both private and shared memory.
>   2. Run `dd if=<PAGEMAP> ibs=8 skip=$(($VADDR / $PAGESIZE)) count=4194304`
>      for each possible configuration above
>     a.  3 times for warm up;
>     b. 10 times to measure performance.
>        Use `time` or another performance measuring tool.
>
> Results (averaged over 10 iterations):
>                +--------+------------+------------+
>                | dirty% |  pre patch | post patch |
>                +--------+------------+------------+
>  private|anon  |     0% |      8.15s |      8.40s |
>                |    50% |     11.83s |     12.19s |
>                |   100% |     12.37s |     12.20s |
>                +--------+------------+------------+
>   shared|anon  |     0% |      8.17s |      8.18s |
>                |    50% | (*) 10.43s |     37.43s |
>                |   100% | (*) 10.20s |     38.59s |
>                +--------+------------+------------+
>
> (*): reminder that pre-patch produces incorrect pagemap entries for swapped
>      out pages.
>
> From run to run the above results are stable (mostly <1% stderr).
>
> The amount of time it takes for a full read of the pagemap depends on the
> granularity used by dd to read the pagemap file. Even though the access is
> sequential, the script only reads 8 bytes at a time, running pagemap_read()
> COUNT times (one time for each page in a 16GiB area).
>
> To reduce overhead, we can use batching for large amounts of sequential
> access. We can make dd read multiple page entries at a time,
> allowing the kernel to make optimisations and yield more throughput.
>
> Performance in real time (seconds) of
> `dd if=<PAGEMAP> ibs=8*$BATCH skip=$(($VADDR / $PAGESIZE / $BATCH))
> count=$((4194304 / $BATCH))`:
> +---------------------------------+ +---------------------------------+
> |     Shared, Anon, 50% dirty     | |     Shared, Anon, 100% dirty    |
> +-------+------------+------------+ +-------+------------+------------+
> | Batch |  Pre-patch | Post-patch | | Batch |  Pre-patch | Post-patch |
> +-------+------------+------------+ +-------+------------+------------+
> |     1 | (*) 10.43s |     37.43s | |     1 | (*) 10.20s |     38.59s |
> |     2 | (*)  5.25s |     18.77s | |     2 | (*)  5.15s |     19.37s |
> |     4 | (*)  2.63s |      9.42s | |     4 | (*)  2.63s |      9.74s |
> |     8 | (*)  1.38s |      4.80s | |     8 | (*)  1.35s |      4.94s |
> |    16 | (*)  0.73s |      2.46s | |    16 | (*)  0.72s |      2.54s |
> |    32 | (*)  0.40s |      1.31s | |    32 | (*)  0.41s |      1.34s |
> |    64 | (*)  0.25s |      0.72s | |    64 | (*)  0.24s |      0.74s |
> |   128 | (*)  0.16s |      0.43s | |   128 | (*)  0.16s |      0.44s |
> |   256 | (*)  0.12s |      0.28s | |   256 | (*)  0.12s |      0.29s |
> |   512 | (*)  0.10s |      0.21s | |   512 | (*)  0.10s |      0.22s |
> |  1024 | (*)  0.10s |      0.20s | |  1024 | (*)  0.10s |      0.21s |
> +-------+------------+------------+ +-------+------------+------------+
>
> To conclude, in order to make the most of the underlying mechanisms of
> pagemap and xarray, one should be using batching to achieve better
> performance.
>
> Future Work
> ============
>
> Note: there are PTE flags which currently do not survive the swap out when
> the page is shmem: SOFT_DIRTY and UFFD_WP.
>
> A solution for saving the state of the UFFD_WP flag has been proposed by
> Peter Xu in the patch linked below. The concept and mechanism proposed
> could be extended to include the SOFT_DIRTY bit as well:
> 20210715201422.211004-1-peterx@redhat.com
> Our patches are mostly orthogonal.
>
> Kind regards,
> Tibi
>
> Tiberiu A Georgescu (1):
>   pagemap: report swap location for shared pages
>
>  fs/proc/task_mmu.c | 38 ++++++++++++++++++++++++++++++--------
>  1 file changed, 30 insertions(+), 8 deletions(-)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/1] pagemap: swap location for shared pages
  2021-07-30 17:28 ` [PATCH 0/1] pagemap: " Eric W. Biederman
@ 2021-08-02 12:20   ` Tiberiu Georgescu
  0 siblings, 0 replies; 13+ messages in thread
From: Tiberiu Georgescu @ 2021-08-02 12:20 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, viro, Peter Xu, David Hildenbrand,
	christian.brauner, adobriyan, songmuchun, axboe,
	vincenzo.frascino, catalin.marinas, peterz, chinwen.chang,
	Miaohe Lin, jannh, Alistair Popple, linux-kernel, linux-fsdevel,
	linux-mm, Ivan Teterevkov, Florian Schmidt, Carl Waldspurger [C],
	Jonathan Davies


> On 30 Jul 2021, at 18:28, Eric W. Biederman <ebiederm@xmission.com> wrote:
> 
> Tiberiu A Georgescu <tiberiu.georgescu@nutanix.com> writes:
> 
>> This patch follows up on a previous RFC:
>> 20210714152426.216217-1-tiberiu.georgescu@nutanix.com
>> 
>> When a page allocated using the MAP_SHARED flag is swapped out, its pagemap
>> entry is cleared. In many cases, there is no difference between swapped-out
>> shared pages and newly allocated, non-dirty pages in the pagemap
>> interface.
> 
> What is the point?

The reason why this patch is important has been discussed in my RFC
patch and on this thread:
https://lore.kernel.org/lkml/20210715201651.212134-1-peterx@redhat.com/.
The most relevant reply should be Ivan's:
https://lore.kernel.org/lkml/CY4PR0201MB3460E372956C0E1B8D33F904E9E39@CY4PR0201MB3460.namprd02.prod.outlook.com/

In short, this swap information helps us enhance live migration in some cases.
> 
> You say a shared swapped out page is the same as a clean shared page
> and you are exactly correct.  What is the point in knowing a shared
> page was swapped out?  What does is the gain?
> 
What I meant was that shared swapped out pages and clean shared pages
have their ptes identical pre-patch. I understand they are somewhat similar
concepts when it comes to file shared pages, where swapping is done
directly on the disk.

Our case focuses on anonymous pages and shared pages with identical 
underlying behaviour (like pages allocated using memfd). These pages get 
cleared once the runtime is over, and the difference between allocated,
but uninitialised pages, and dirty pages that have been swapped out is 
significant, as the former could still contain usable data.

The post-patch pagemap entry now contains the swap type and offset for
swapped out pages, regardless of whether the page is private or shared.
This, by definition of the pagemap, should be the correct behaviour.

> I tried to understand the point by looking at your numbers below
> and everything I could see looked worse post patch.

Indeed, the numbers are mostly bigger post-patch. It is a tradeoff between
correctness and performance. However, the tradeoff is not inconvenient on sparse 
single accesses, and it can be made significantly faster by leveraging batching.
In future work, the performance can be improved by leveraging a mechanism 
proposed by Peter Xu: Special PTEs:
https://lore.kernel.org/lkml/20210715201422.211004-1-peterx@redhat.com/

The main concern of the RFC was that the xarray check would slow down
checking empty pages significantly. Thankfully, we can only see a small 
overhead when no allocated shared page is dirtied.

> 
> Eric
> 
Hope I was able to clarifiy a few things. Now, having laid out the context,
please have another look at my proposed patch.

Thank you,
Tibi

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/1] pagemap: swap location for shared pages
  2021-07-30 16:08 [PATCH 0/1] pagemap: swap location for shared pages Tiberiu A Georgescu
  2021-07-30 16:08 ` [PATCH 1/1] pagemap: report " Tiberiu A Georgescu
  2021-07-30 17:28 ` [PATCH 0/1] pagemap: " Eric W. Biederman
@ 2021-08-04 18:33 ` Peter Xu
  2021-08-04 18:49   ` David Hildenbrand
  2 siblings, 1 reply; 13+ messages in thread
From: Peter Xu @ 2021-08-04 18:33 UTC (permalink / raw)
  To: Tiberiu A Georgescu
  Cc: akpm, viro, david, christian.brauner, ebiederm, adobriyan,
	songmuchun, axboe, vincenzo.frascino, catalin.marinas, peterz,
	chinwen.chang, linmiaohe, jannh, apopple, linux-kernel,
	linux-fsdevel, linux-mm, ivan.teterevkov, florian.schmidt,
	carl.waldspurger, jonathan.davies

Hi, Tiberiu,

On Fri, Jul 30, 2021 at 04:08:25PM +0000, Tiberiu A Georgescu wrote:
> This patch follows up on a previous RFC:
> 20210714152426.216217-1-tiberiu.georgescu@nutanix.com
> 
> When a page allocated using the MAP_SHARED flag is swapped out, its pagemap
> entry is cleared. In many cases, there is no difference between swapped-out
> shared pages and newly allocated, non-dirty pages in the pagemap interface.
> 
> Example pagemap-test code (Tested on Kernel Version 5.14-rc3):
>     #define NPAGES (256)
>     /* map 1MiB shared memory */
>     size_t pagesize = getpagesize();
>     char *p = mmap(NULL, pagesize * NPAGES, PROT_READ | PROT_WRITE,
>     		   MAP_ANONYMOUS | MAP_SHARED, -1, 0);
>     /* Dirty new pages. */
>     for (i = 0; i < PAGES; i++)
>     	p[i * pagesize] = i;
> 
> Run the above program in a small cgroup, which causes swapping:
>     /* Initialise cgroup & run a program */
>     $ echo 512K > foo/memory.limit_in_bytes
>     $ echo 60 > foo/memory.swappiness
>     $ cgexec -g memory:foo ./pagemap-test
> 
> Check the pagemap report. Example of the current expected output:
>     $ dd if=/proc/$PID/pagemap ibs=8 skip=$(($VADDR / $PAGESIZE)) count=$COUNT | hexdump -C
>     00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>     *
>     00000710  e1 6b 06 00 00 00 00 a1  9e eb 06 00 00 00 00 a1  |.k..............|
>     00000720  6b ee 06 00 00 00 00 a1  a5 a4 05 00 00 00 00 a1  |k...............|
>     00000730  5c bf 06 00 00 00 00 a1  90 b6 06 00 00 00 00 a1  |\...............|
> 
> The first pagemap entries are reported as zeroes, indicating the pages have
> never been allocated while they have actually been swapped out.
> 
> This patch addresses the behaviour and modifies pte_to_pagemap_entry() to
> make use of the XArray associated with the virtual memory area struct
> passed as an argument. The XArray contains the location of virtual pages in
> the page cache, swap cache or on disk. If they are in either of the caches,
> then the original implementation still works. If not, then the missing
> information will be retrieved from the XArray.
> 
> Performance
> ============
> I measured the performance of the patch on a single socket Xeon E5-2620
> machine, with 128GiB of RAM and 128GiB of swap storage. These were the
> steps taken:
> 
>   1. Run example pagemap-test code on a cgroup
>     a. Set up cgroup with limit_in_bytes=4GiB and swappiness=60;
>     b. allocate 16GiB (about 4 million pages);
>     c. dirty 0,50 or 100% of pages;
>     d. do this for both private and shared memory.
>   2. Run `dd if=<PAGEMAP> ibs=8 skip=$(($VADDR / $PAGESIZE)) count=4194304`
>      for each possible configuration above
>     a.  3 times for warm up;
>     b. 10 times to measure performance.
>        Use `time` or another performance measuring tool.
> 
> Results (averaged over 10 iterations):
>                +--------+------------+------------+
>                | dirty% |  pre patch | post patch |
>                +--------+------------+------------+
>  private|anon  |     0% |      8.15s |      8.40s |
>                |    50% |     11.83s |     12.19s |
>                |   100% |     12.37s |     12.20s |
>                +--------+------------+------------+
>   shared|anon  |     0% |      8.17s |      8.18s |
>                |    50% | (*) 10.43s |     37.43s |
>                |   100% | (*) 10.20s |     38.59s |
>                +--------+------------+------------+
> 
> (*): reminder that pre-patch produces incorrect pagemap entries for swapped
>      out pages.
> 
> From run to run the above results are stable (mostly <1% stderr).
> 
> The amount of time it takes for a full read of the pagemap depends on the
> granularity used by dd to read the pagemap file. Even though the access is
> sequential, the script only reads 8 bytes at a time, running pagemap_read()
> COUNT times (one time for each page in a 16GiB area).
> 
> To reduce overhead, we can use batching for large amounts of sequential
> access. We can make dd read multiple page entries at a time,
> allowing the kernel to make optimisations and yield more throughput.
> 
> Performance in real time (seconds) of
> `dd if=<PAGEMAP> ibs=8*$BATCH skip=$(($VADDR / $PAGESIZE / $BATCH))
> count=$((4194304 / $BATCH))`:
> +---------------------------------+ +---------------------------------+
> |     Shared, Anon, 50% dirty     | |     Shared, Anon, 100% dirty    |
> +-------+------------+------------+ +-------+------------+------------+
> | Batch |  Pre-patch | Post-patch | | Batch |  Pre-patch | Post-patch |
> +-------+------------+------------+ +-------+------------+------------+
> |     1 | (*) 10.43s |     37.43s | |     1 | (*) 10.20s |     38.59s |
> |     2 | (*)  5.25s |     18.77s | |     2 | (*)  5.15s |     19.37s |
> |     4 | (*)  2.63s |      9.42s | |     4 | (*)  2.63s |      9.74s |
> |     8 | (*)  1.38s |      4.80s | |     8 | (*)  1.35s |      4.94s |
> |    16 | (*)  0.73s |      2.46s | |    16 | (*)  0.72s |      2.54s |
> |    32 | (*)  0.40s |      1.31s | |    32 | (*)  0.41s |      1.34s |
> |    64 | (*)  0.25s |      0.72s | |    64 | (*)  0.24s |      0.74s |
> |   128 | (*)  0.16s |      0.43s | |   128 | (*)  0.16s |      0.44s |
> |   256 | (*)  0.12s |      0.28s | |   256 | (*)  0.12s |      0.29s |
> |   512 | (*)  0.10s |      0.21s | |   512 | (*)  0.10s |      0.22s |
> |  1024 | (*)  0.10s |      0.20s | |  1024 | (*)  0.10s |      0.21s |
> +-------+------------+------------+ +-------+------------+------------+
> 
> To conclude, in order to make the most of the underlying mechanisms of
> pagemap and xarray, one should be using batching to achieve better
> performance.

So what I'm still a bit worried is whether it will regress some existing users.
Note that existing users can try to read pagemap in their own way; we can't
expect all the userspaces to change their behavior due to a kernel change.

Meanwhile, from the numbers, it seems to show a 4x speed down due to looking up
the page cache no matter the size of ibs=.  IOW I don't see a good way to avoid
that overhead, so no way to have the userspace run as fast as before.

Also note that it's not only affecting the PM_SWAP users; it potentially
affects all the /proc/pagemap users as long as there're file-backed memory on
the read region of pagemap, which is very sane to happen.

That's why I think if we want to persist it, we should still consider starting
from the pte marker idea.

I do plan to move the pte marker idea forward unless that'll be NACKed upstream
for some other reason, because that seems to be the only way for uffd-wp to
support file based memories; no matter with a new swp type or with special swap
pte.  I am even thinking about whether I should propose that with PM_SWAP first
because that seems to be a simpler scenario than uffd-wp (which will get the
rest uffd-wp patches involved then), then we can have a shared infrastructure.
But haven't thought deeper than that.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/1] pagemap: swap location for shared pages
  2021-08-04 18:33 ` Peter Xu
@ 2021-08-04 18:49   ` David Hildenbrand
  2021-08-04 19:17     ` Peter Xu
  0 siblings, 1 reply; 13+ messages in thread
From: David Hildenbrand @ 2021-08-04 18:49 UTC (permalink / raw)
  To: Peter Xu, Tiberiu A Georgescu
  Cc: akpm, viro, christian.brauner, ebiederm, adobriyan, songmuchun,
	axboe, vincenzo.frascino, catalin.marinas, peterz, chinwen.chang,
	linmiaohe, jannh, apopple, linux-kernel, linux-fsdevel, linux-mm,
	ivan.teterevkov, florian.schmidt, carl.waldspurger,
	jonathan.davies

On 04.08.21 20:33, Peter Xu wrote:
> Hi, Tiberiu,
> 
> On Fri, Jul 30, 2021 at 04:08:25PM +0000, Tiberiu A Georgescu wrote:
>> This patch follows up on a previous RFC:
>> 20210714152426.216217-1-tiberiu.georgescu@nutanix.com
>>
>> When a page allocated using the MAP_SHARED flag is swapped out, its pagemap
>> entry is cleared. In many cases, there is no difference between swapped-out
>> shared pages and newly allocated, non-dirty pages in the pagemap interface.
>>
>> Example pagemap-test code (Tested on Kernel Version 5.14-rc3):
>>      #define NPAGES (256)
>>      /* map 1MiB shared memory */
>>      size_t pagesize = getpagesize();
>>      char *p = mmap(NULL, pagesize * NPAGES, PROT_READ | PROT_WRITE,
>>      		   MAP_ANONYMOUS | MAP_SHARED, -1, 0);
>>      /* Dirty new pages. */
>>      for (i = 0; i < PAGES; i++)
>>      	p[i * pagesize] = i;
>>
>> Run the above program in a small cgroup, which causes swapping:
>>      /* Initialise cgroup & run a program */
>>      $ echo 512K > foo/memory.limit_in_bytes
>>      $ echo 60 > foo/memory.swappiness
>>      $ cgexec -g memory:foo ./pagemap-test
>>
>> Check the pagemap report. Example of the current expected output:
>>      $ dd if=/proc/$PID/pagemap ibs=8 skip=$(($VADDR / $PAGESIZE)) count=$COUNT | hexdump -C
>>      00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>>      *
>>      00000710  e1 6b 06 00 00 00 00 a1  9e eb 06 00 00 00 00 a1  |.k..............|
>>      00000720  6b ee 06 00 00 00 00 a1  a5 a4 05 00 00 00 00 a1  |k...............|
>>      00000730  5c bf 06 00 00 00 00 a1  90 b6 06 00 00 00 00 a1  |\...............|
>>
>> The first pagemap entries are reported as zeroes, indicating the pages have
>> never been allocated while they have actually been swapped out.
>>
>> This patch addresses the behaviour and modifies pte_to_pagemap_entry() to
>> make use of the XArray associated with the virtual memory area struct
>> passed as an argument. The XArray contains the location of virtual pages in
>> the page cache, swap cache or on disk. If they are in either of the caches,
>> then the original implementation still works. If not, then the missing
>> information will be retrieved from the XArray.
>>
>> Performance
>> ============
>> I measured the performance of the patch on a single socket Xeon E5-2620
>> machine, with 128GiB of RAM and 128GiB of swap storage. These were the
>> steps taken:
>>
>>    1. Run example pagemap-test code on a cgroup
>>      a. Set up cgroup with limit_in_bytes=4GiB and swappiness=60;
>>      b. allocate 16GiB (about 4 million pages);
>>      c. dirty 0,50 or 100% of pages;
>>      d. do this for both private and shared memory.
>>    2. Run `dd if=<PAGEMAP> ibs=8 skip=$(($VADDR / $PAGESIZE)) count=4194304`
>>       for each possible configuration above
>>      a.  3 times for warm up;
>>      b. 10 times to measure performance.
>>         Use `time` or another performance measuring tool.
>>
>> Results (averaged over 10 iterations):
>>                 +--------+------------+------------+
>>                 | dirty% |  pre patch | post patch |
>>                 +--------+------------+------------+
>>   private|anon  |     0% |      8.15s |      8.40s |
>>                 |    50% |     11.83s |     12.19s |
>>                 |   100% |     12.37s |     12.20s |
>>                 +--------+------------+------------+
>>    shared|anon  |     0% |      8.17s |      8.18s |
>>                 |    50% | (*) 10.43s |     37.43s |
>>                 |   100% | (*) 10.20s |     38.59s |
>>                 +--------+------------+------------+
>>
>> (*): reminder that pre-patch produces incorrect pagemap entries for swapped
>>       out pages.
>>
>>  From run to run the above results are stable (mostly <1% stderr).
>>
>> The amount of time it takes for a full read of the pagemap depends on the
>> granularity used by dd to read the pagemap file. Even though the access is
>> sequential, the script only reads 8 bytes at a time, running pagemap_read()
>> COUNT times (one time for each page in a 16GiB area).
>>
>> To reduce overhead, we can use batching for large amounts of sequential
>> access. We can make dd read multiple page entries at a time,
>> allowing the kernel to make optimisations and yield more throughput.
>>
>> Performance in real time (seconds) of
>> `dd if=<PAGEMAP> ibs=8*$BATCH skip=$(($VADDR / $PAGESIZE / $BATCH))
>> count=$((4194304 / $BATCH))`:
>> +---------------------------------+ +---------------------------------+
>> |     Shared, Anon, 50% dirty     | |     Shared, Anon, 100% dirty    |
>> +-------+------------+------------+ +-------+------------+------------+
>> | Batch |  Pre-patch | Post-patch | | Batch |  Pre-patch | Post-patch |
>> +-------+------------+------------+ +-------+------------+------------+
>> |     1 | (*) 10.43s |     37.43s | |     1 | (*) 10.20s |     38.59s |
>> |     2 | (*)  5.25s |     18.77s | |     2 | (*)  5.15s |     19.37s |
>> |     4 | (*)  2.63s |      9.42s | |     4 | (*)  2.63s |      9.74s |
>> |     8 | (*)  1.38s |      4.80s | |     8 | (*)  1.35s |      4.94s |
>> |    16 | (*)  0.73s |      2.46s | |    16 | (*)  0.72s |      2.54s |
>> |    32 | (*)  0.40s |      1.31s | |    32 | (*)  0.41s |      1.34s |
>> |    64 | (*)  0.25s |      0.72s | |    64 | (*)  0.24s |      0.74s |
>> |   128 | (*)  0.16s |      0.43s | |   128 | (*)  0.16s |      0.44s |
>> |   256 | (*)  0.12s |      0.28s | |   256 | (*)  0.12s |      0.29s |
>> |   512 | (*)  0.10s |      0.21s | |   512 | (*)  0.10s |      0.22s |
>> |  1024 | (*)  0.10s |      0.20s | |  1024 | (*)  0.10s |      0.21s |
>> +-------+------------+------------+ +-------+------------+------------+
>>
>> To conclude, in order to make the most of the underlying mechanisms of
>> pagemap and xarray, one should be using batching to achieve better
>> performance.
> 
> So what I'm still a bit worried is whether it will regress some existing users.
> Note that existing users can try to read pagemap in their own way; we can't
> expect all the userspaces to change their behavior due to a kernel change.

Then let's provide a way to enable the new behavior for a process if we 
don't find another way to extract that information. I would actually 
prefer finding a different interface for that, because with such things 
the "pagemap" no longer expresses which pages are currently mapped. 
Shared memory is weird.

> 
> Meanwhile, from the numbers, it seems to show a 4x speed down due to looking up
> the page cache no matter the size of ibs=.  IOW I don't see a good way to avoid
> that overhead, so no way to have the userspace run as fast as before.
> 
> Also note that it's not only affecting the PM_SWAP users; it potentially
> affects all the /proc/pagemap users as long as there're file-backed memory on
> the read region of pagemap, which is very sane to happen.
> 
> That's why I think if we want to persist it, we should still consider starting
> from the pte marker idea.

TBH, I tend to really dislike the PTE marker idea. IMHO, we shouldn't 
store any state information regarding shared memory in per-process page 
tables: it just doesn't make too much sense.

And this is similar to SOFTDIRTY or UFFD_WP bits: this information 
actually belongs to the shared file ("did *someone* write to this page", 
"is *someone* interested into changes to that page", "is there 
something"). I know, that screams for a completely different design in 
respect to these features.

I guess we start learning the hard way that shared memory is just 
different and requires different interfaces than per-process page table 
interfaces we have (pagemap, userfaultfd).

I didn't have time to explore any alternatives yet, but I wonder if 
tracking such stuff per an actual fd/memfd and not via process page 
tables is actually the right and clean approach. There are certainly 
many issues to solve, but conceptually to me it feels more natural to 
have these shared memory features not mangled into process page tables.

> 
> I do plan to move the pte marker idea forward unless that'll be NACKed upstream
> for some other reason, because that seems to be the only way for uffd-wp to
> support file based memories; no matter with a new swp type or with special swap
> pte.  I am even thinking about whether I should propose that with PM_SWAP first
> because that seems to be a simpler scenario than uffd-wp (which will get the
> rest uffd-wp patches involved then), then we can have a shared infrastructure.
> But haven't thought deeper than that.
> 
> Thanks,
> 


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/1] pagemap: swap location for shared pages
  2021-08-04 18:49   ` David Hildenbrand
@ 2021-08-04 19:17     ` Peter Xu
  2021-08-11 16:15       ` David Hildenbrand
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Xu @ 2021-08-04 19:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Tiberiu A Georgescu, akpm, viro, christian.brauner, ebiederm,
	adobriyan, songmuchun, axboe, vincenzo.frascino, catalin.marinas,
	peterz, chinwen.chang, linmiaohe, jannh, apopple, linux-kernel,
	linux-fsdevel, linux-mm, ivan.teterevkov, florian.schmidt,
	carl.waldspurger, jonathan.davies

On Wed, Aug 04, 2021 at 08:49:14PM +0200, David Hildenbrand wrote:
> TBH, I tend to really dislike the PTE marker idea. IMHO, we shouldn't store
> any state information regarding shared memory in per-process page tables: it
> just doesn't make too much sense.
> 
> And this is similar to SOFTDIRTY or UFFD_WP bits: this information actually
> belongs to the shared file ("did *someone* write to this page", "is
> *someone* interested into changes to that page", "is there something"). I
> know, that screams for a completely different design in respect to these
> features.
> 
> I guess we start learning the hard way that shared memory is just different
> and requires different interfaces than per-process page table interfaces we
> have (pagemap, userfaultfd).
> 
> I didn't have time to explore any alternatives yet, but I wonder if tracking
> such stuff per an actual fd/memfd and not via process page tables is
> actually the right and clean approach. There are certainly many issues to
> solve, but conceptually to me it feels more natural to have these shared
> memory features not mangled into process page tables.

Yes, we can explore all the possibilities, I'm totally fine with it.

I just want to say I still don't think when there's page cache then we must put
all the page-relevant things into the page cache.

They're shared by processes, but process can still have its own way to describe
the relationship to that page in the cache, to me it's as simple as "we allow
process A to write to page cache P", while "we don't allow process B to write
to the same page" like the write bit.

Swap information is indeed tricky because it's shared by all the processes, but
so far I still see it as a doable approach as long as it can solve the problem.

I am not against the approach mentioned in this patch either - but I still
share my concerns, as that's not normally what we do with existing interfaces.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/1] pagemap: swap location for shared pages
  2021-08-04 19:17     ` Peter Xu
@ 2021-08-11 16:15       ` David Hildenbrand
  2021-08-11 16:17         ` David Hildenbrand
  2021-08-11 18:25         ` Peter Xu
  0 siblings, 2 replies; 13+ messages in thread
From: David Hildenbrand @ 2021-08-11 16:15 UTC (permalink / raw)
  To: Peter Xu
  Cc: Tiberiu A Georgescu, akpm, viro, christian.brauner, ebiederm,
	adobriyan, songmuchun, axboe, vincenzo.frascino, catalin.marinas,
	peterz, chinwen.chang, linmiaohe, jannh, apopple, linux-kernel,
	linux-fsdevel, linux-mm, ivan.teterevkov, florian.schmidt,
	carl.waldspurger, jonathan.davies

On 04.08.21 21:17, Peter Xu wrote:
> On Wed, Aug 04, 2021 at 08:49:14PM +0200, David Hildenbrand wrote:
>> TBH, I tend to really dislike the PTE marker idea. IMHO, we shouldn't store
>> any state information regarding shared memory in per-process page tables: it
>> just doesn't make too much sense.
>>
>> And this is similar to SOFTDIRTY or UFFD_WP bits: this information actually
>> belongs to the shared file ("did *someone* write to this page", "is
>> *someone* interested into changes to that page", "is there something"). I
>> know, that screams for a completely different design in respect to these
>> features.
>>
>> I guess we start learning the hard way that shared memory is just different
>> and requires different interfaces than per-process page table interfaces we
>> have (pagemap, userfaultfd).
>>
>> I didn't have time to explore any alternatives yet, but I wonder if tracking
>> such stuff per an actual fd/memfd and not via process page tables is
>> actually the right and clean approach. There are certainly many issues to
>> solve, but conceptually to me it feels more natural to have these shared
>> memory features not mangled into process page tables.
> 
> Yes, we can explore all the possibilities, I'm totally fine with it.
> 
> I just want to say I still don't think when there's page cache then we must put
> all the page-relevant things into the page cache.

[sorry for the late reply]

Right, but for the case of shared, swapped out pages, the information is 
already there, in the page cache :)

> 
> They're shared by processes, but process can still have its own way to describe
> the relationship to that page in the cache, to me it's as simple as "we allow
> process A to write to page cache P", while "we don't allow process B to write
> to the same page" like the write bit.

The issue I'm having uffd-wp as it was proposed for shared memory is 
that there is hardly a sane use case where we would *want* it to work 
that way.

A UFFD-WP flag in a page table for shared memory means "please notify 
once this process modifies the shared memory (via page tables, not via 
any other fd modification)". Do we have an example application where 
these semantics makes sense and don't over-complicate the whole 
approach? I don't know any, thus I'm asking dumb questions :)


For background snapshots in QEMU the flow would currently be like this, 
assuming all processes have the shared guest memory mapped.

1. Background snapshot preparation: QEMU requests all processes
    to uffd-wp the range
a) All processes register a uffd handler on guest RAM
b) All processes fault in all guest memory (essentially populating all
    memory): with a uffd-WP extensions we might be able to get rid of
    that, I remember you were working on that.
c) All processes uffd-WP the range to set the bit in their page table

2. Background snapshot runs:
a) A process either receives a UFFD-WP event and forwards it to QEMU or
    QEMU polls all other processes for UFFD events.
b) QEMU writes the to-be-changed page to the migration stream.
c) QEMU triggers all processes to un-protect the page and wake up any
    waiters. All processes clear the uffd-WP bit in their page tables.

3. Background snapshot completes:
a) All processes unregister the uffd handler


Now imagine something like this:

1. Background snapshot preparation:
a) QEMU registers a UFFD-WP handler on a *memfd file* that corresponds
    to guest memory.
b) QEMU uffd-wp's the whole file

2. Background snapshot runs:
a) QEMU receives a UFFD-WP event.
b) QEMU writes the to-be-changed page to the migration stream.
c) QEMU un-protect the page and wake up any waiters.

3. Background snapshot completes:
a) QEMU unregister the uffd handler


Wouldn't that be much nicer and much easier to handle? Yes, it is much 
harder to implement because such an infrastructure does not exist yet, 
and it most probably wouldn't be called uffd anymore, because we are 
dealing with file access. But this way, it would actually be super easy 
to use the feature across multiple processes and eventually to even 
catch other file modifications.

Again, I am not sure if uffd-wp or softdirty make too much sense in 
general when applied to shmem. But I'm happy to learn more.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/1] pagemap: swap location for shared pages
  2021-08-11 16:15       ` David Hildenbrand
@ 2021-08-11 16:17         ` David Hildenbrand
  2021-08-11 18:25         ` Peter Xu
  1 sibling, 0 replies; 13+ messages in thread
From: David Hildenbrand @ 2021-08-11 16:17 UTC (permalink / raw)
  To: Peter Xu
  Cc: Tiberiu A Georgescu, akpm, viro, christian.brauner, ebiederm,
	adobriyan, songmuchun, axboe, vincenzo.frascino, catalin.marinas,
	peterz, chinwen.chang, linmiaohe, jannh, apopple, linux-kernel,
	linux-fsdevel, linux-mm, ivan.teterevkov, florian.schmidt,
	carl.waldspurger, jonathan.davies

On 11.08.21 18:15, David Hildenbrand wrote:
> On 04.08.21 21:17, Peter Xu wrote:
>> On Wed, Aug 04, 2021 at 08:49:14PM +0200, David Hildenbrand wrote:
>>> TBH, I tend to really dislike the PTE marker idea. IMHO, we shouldn't store
>>> any state information regarding shared memory in per-process page tables: it
>>> just doesn't make too much sense.
>>>
>>> And this is similar to SOFTDIRTY or UFFD_WP bits: this information actually
>>> belongs to the shared file ("did *someone* write to this page", "is
>>> *someone* interested into changes to that page", "is there something"). I
>>> know, that screams for a completely different design in respect to these
>>> features.
>>>
>>> I guess we start learning the hard way that shared memory is just different
>>> and requires different interfaces than per-process page table interfaces we
>>> have (pagemap, userfaultfd).
>>>
>>> I didn't have time to explore any alternatives yet, but I wonder if tracking
>>> such stuff per an actual fd/memfd and not via process page tables is
>>> actually the right and clean approach. There are certainly many issues to
>>> solve, but conceptually to me it feels more natural to have these shared
>>> memory features not mangled into process page tables.
>>
>> Yes, we can explore all the possibilities, I'm totally fine with it.
>>
>> I just want to say I still don't think when there's page cache then we must put
>> all the page-relevant things into the page cache.
> 
> [sorry for the late reply]
> 
> Right, but for the case of shared, swapped out pages, the information is
> already there, in the page cache :)
> 
>>
>> They're shared by processes, but process can still have its own way to describe
>> the relationship to that page in the cache, to me it's as simple as "we allow
>> process A to write to page cache P", while "we don't allow process B to write
>> to the same page" like the write bit.
> 
> The issue I'm having uffd-wp as it was proposed for shared memory is
> that there is hardly a sane use case where we would *want* it to work
> that way.
> 
> A UFFD-WP flag in a page table for shared memory means "please notify
> once this process modifies the shared memory (via page tables, not via
> any other fd modification)". Do we have an example application where
> these semantics makes sense and don't over-complicate the whole
> approach? I don't know any, thus I'm asking dumb questions :)
> 
> 
> For background snapshots in QEMU the flow would currently be like this,
> assuming all processes have the shared guest memory mapped.
> 
> 1. Background snapshot preparation: QEMU requests all processes
>      to uffd-wp the range
> a) All processes register a uffd handler on guest RAM
> b) All processes fault in all guest memory (essentially populating all
>      memory): with a uffd-WP extensions we might be able to get rid of
>      that, I remember you were working on that.
> c) All processes uffd-WP the range to set the bit in their page table
> 
> 2. Background snapshot runs:
> a) A process either receives a UFFD-WP event and forwards it to QEMU or
>      QEMU polls all other processes for UFFD events.
> b) QEMU writes the to-be-changed page to the migration stream.
> c) QEMU triggers all processes to un-protect the page and wake up any
>      waiters. All processes clear the uffd-WP bit in their page tables.

Oh, and I forgot, whenever we save any page to the migration stream, we 
have to trigger all processes to un-protect.


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/1] pagemap: swap location for shared pages
  2021-08-11 16:15       ` David Hildenbrand
  2021-08-11 16:17         ` David Hildenbrand
@ 2021-08-11 18:25         ` Peter Xu
  2021-08-11 18:41           ` David Hildenbrand
  1 sibling, 1 reply; 13+ messages in thread
From: Peter Xu @ 2021-08-11 18:25 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Tiberiu A Georgescu, akpm, viro, christian.brauner, ebiederm,
	adobriyan, songmuchun, axboe, vincenzo.frascino, catalin.marinas,
	peterz, chinwen.chang, linmiaohe, jannh, apopple, linux-kernel,
	linux-fsdevel, linux-mm, ivan.teterevkov, florian.schmidt,
	carl.waldspurger, jonathan.davies

On Wed, Aug 11, 2021 at 06:15:37PM +0200, David Hildenbrand wrote:
> On 04.08.21 21:17, Peter Xu wrote:
> > On Wed, Aug 04, 2021 at 08:49:14PM +0200, David Hildenbrand wrote:
> > > TBH, I tend to really dislike the PTE marker idea. IMHO, we shouldn't store
> > > any state information regarding shared memory in per-process page tables: it
> > > just doesn't make too much sense.
> > > 
> > > And this is similar to SOFTDIRTY or UFFD_WP bits: this information actually
> > > belongs to the shared file ("did *someone* write to this page", "is
> > > *someone* interested into changes to that page", "is there something"). I
> > > know, that screams for a completely different design in respect to these
> > > features.
> > > 
> > > I guess we start learning the hard way that shared memory is just different
> > > and requires different interfaces than per-process page table interfaces we
> > > have (pagemap, userfaultfd).
> > > 
> > > I didn't have time to explore any alternatives yet, but I wonder if tracking
> > > such stuff per an actual fd/memfd and not via process page tables is
> > > actually the right and clean approach. There are certainly many issues to
> > > solve, but conceptually to me it feels more natural to have these shared
> > > memory features not mangled into process page tables.
> > 
> > Yes, we can explore all the possibilities, I'm totally fine with it.
> > 
> > I just want to say I still don't think when there's page cache then we must put
> > all the page-relevant things into the page cache.
> 
> [sorry for the late reply]
> 
> Right, but for the case of shared, swapped out pages, the information is
> already there, in the page cache :)
> 
> > 
> > They're shared by processes, but process can still have its own way to describe
> > the relationship to that page in the cache, to me it's as simple as "we allow
> > process A to write to page cache P", while "we don't allow process B to write
> > to the same page" like the write bit.
> 
> The issue I'm having uffd-wp as it was proposed for shared memory is that
> there is hardly a sane use case where we would *want* it to work that way.
> 
> A UFFD-WP flag in a page table for shared memory means "please notify once
> this process modifies the shared memory (via page tables, not via any other
> fd modification)". Do we have an example application where these semantics
> makes sense and don't over-complicate the whole approach? I don't know any,
> thus I'm asking dumb questions :)
> 
> 
> For background snapshots in QEMU the flow would currently be like this,
> assuming all processes have the shared guest memory mapped.
> 
> 1. Background snapshot preparation: QEMU requests all processes
>    to uffd-wp the range
> a) All processes register a uffd handler on guest RAM

To be explicit: not a handler; just register with uffd-wp and pass over the fd
to the main process.

> b) All processes fault in all guest memory (essentially populating all
>    memory): with a uffd-WP extensions we might be able to get rid of
>    that, I remember you were working on that.
> c) All processes uffd-WP the range to set the bit in their page table
> 
> 2. Background snapshot runs:
> a) A process either receives a UFFD-WP event and forwards it to QEMU or
>    QEMU polls all other processes for UFFD events.
> b) QEMU writes the to-be-changed page to the migration stream.
> c) QEMU triggers all processes to un-protect the page and wake up any
>    waiters. All processes clear the uffd-WP bit in their page tables.
> 
> 3. Background snapshot completes:
> a) All processes unregister the uffd handler
> 
> 
> Now imagine something like this:
> 
> 1. Background snapshot preparation:
> a) QEMU registers a UFFD-WP handler on a *memfd file* that corresponds
>    to guest memory.
> b) QEMU uffd-wp's the whole file
> 
> 2. Background snapshot runs:
> a) QEMU receives a UFFD-WP event.
> b) QEMU writes the to-be-changed page to the migration stream.
> c) QEMU un-protect the page and wake up any waiters.
> 
> 3. Background snapshot completes:
> a) QEMU unregister the uffd handler
> 
> 
> Wouldn't that be much nicer and much easier to handle? Yes, it is much
> harder to implement because such an infrastructure does not exist yet, and
> it most probably wouldn't be called uffd anymore, because we are dealing
> with file access. But this way, it would actually be super easy to use the
> feature across multiple processes and eventually to even catch other file
> modifications.

I can totally understand how you see this.  We've discussed about that, isn't
it? About the ideal worlds. :)

It would be great if this can work out, I hope so.  So far I'm not that
ambicious, and as I said, I don't know whether there will be other concerns
when it goes into the page cache layer, and when it's a behavior of multiple
processes where one of them can rule others without others being notice of it.

Even if we want to go that way, I think we should first come up with some way
to describe the domains that one uffd-wp registered file should behave upon.
It shouldn't be "any process touching this file".

One quick example in my mind is when a malicious process wants to stop another
daemon process, it'll be easier as long as the malicious process can delete a
file that the daemon used to read/write, replace it with a shmem with uffd-wp
registered (or maybe just a regular file on file systems, if your proposal will
naturally work on them).  The problem is, is it really "legal" to be able to
stop the daemon running like that?

I also don't know the initial concept when uffd is designed and why it's
designed at pte level.  Avoid vma manipulation should be a major factor, but I
can't say I understand all of them.  Not sure whether Andrea has any input here.

That's why I think current uffd can still make sense with per-process concepts
and keep it that way.  When register uffd-wp yes we need to do that for
multiple processes, but it also means each process is fully aware that this is
happening so it's kind of verified that this is wanted behavior for that
process.  It'll happen with less "surprises", and smells safer.

I don't think that will not work out.  It may require all the process to
support uffd-wp apis and cooperate, but that's so far how it should work for me
in a safe and self-contained way.  Say, every process should be aware of what's
going to happen on blocked page faults.

> 
> Again, I am not sure if uffd-wp or softdirty make too much sense in general
> when applied to shmem. But I'm happy to learn more.

Me too, I'm more than glad to know whether the page cache idea could be
welcomed or am I just wrong about it.  Before I understand more things around
this, so far I still think the per-process based and fd-based solution of uffd
still makes sense.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/1] pagemap: swap location for shared pages
  2021-08-11 18:25         ` Peter Xu
@ 2021-08-11 18:41           ` David Hildenbrand
  2021-08-11 19:54             ` Peter Xu
  0 siblings, 1 reply; 13+ messages in thread
From: David Hildenbrand @ 2021-08-11 18:41 UTC (permalink / raw)
  To: Peter Xu
  Cc: Tiberiu A Georgescu, akpm, viro, christian.brauner, ebiederm,
	adobriyan, songmuchun, axboe, vincenzo.frascino, catalin.marinas,
	peterz, chinwen.chang, linmiaohe, jannh, apopple, linux-kernel,
	linux-fsdevel, linux-mm, ivan.teterevkov, florian.schmidt,
	carl.waldspurger, jonathan.davies

On 11.08.21 20:25, Peter Xu wrote:
> On Wed, Aug 11, 2021 at 06:15:37PM +0200, David Hildenbrand wrote:
>> On 04.08.21 21:17, Peter Xu wrote:
>>> On Wed, Aug 04, 2021 at 08:49:14PM +0200, David Hildenbrand wrote:
>>>> TBH, I tend to really dislike the PTE marker idea. IMHO, we shouldn't store
>>>> any state information regarding shared memory in per-process page tables: it
>>>> just doesn't make too much sense.
>>>>
>>>> And this is similar to SOFTDIRTY or UFFD_WP bits: this information actually
>>>> belongs to the shared file ("did *someone* write to this page", "is
>>>> *someone* interested into changes to that page", "is there something"). I
>>>> know, that screams for a completely different design in respect to these
>>>> features.
>>>>
>>>> I guess we start learning the hard way that shared memory is just different
>>>> and requires different interfaces than per-process page table interfaces we
>>>> have (pagemap, userfaultfd).
>>>>
>>>> I didn't have time to explore any alternatives yet, but I wonder if tracking
>>>> such stuff per an actual fd/memfd and not via process page tables is
>>>> actually the right and clean approach. There are certainly many issues to
>>>> solve, but conceptually to me it feels more natural to have these shared
>>>> memory features not mangled into process page tables.
>>>
>>> Yes, we can explore all the possibilities, I'm totally fine with it.
>>>
>>> I just want to say I still don't think when there's page cache then we must put
>>> all the page-relevant things into the page cache.
>>
>> [sorry for the late reply]
>>
>> Right, but for the case of shared, swapped out pages, the information is
>> already there, in the page cache :)
>>
>>>
>>> They're shared by processes, but process can still have its own way to describe
>>> the relationship to that page in the cache, to me it's as simple as "we allow
>>> process A to write to page cache P", while "we don't allow process B to write
>>> to the same page" like the write bit.
>>
>> The issue I'm having uffd-wp as it was proposed for shared memory is that
>> there is hardly a sane use case where we would *want* it to work that way.
>>
>> A UFFD-WP flag in a page table for shared memory means "please notify once
>> this process modifies the shared memory (via page tables, not via any other
>> fd modification)". Do we have an example application where these semantics
>> makes sense and don't over-complicate the whole approach? I don't know any,
>> thus I'm asking dumb questions :)
>>
>>
>> For background snapshots in QEMU the flow would currently be like this,
>> assuming all processes have the shared guest memory mapped.
>>
>> 1. Background snapshot preparation: QEMU requests all processes
>>     to uffd-wp the range
>> a) All processes register a uffd handler on guest RAM
> 
> To be explicit: not a handler; just register with uffd-wp and pass over the fd
> to the main process.

Good point.

> 
>> b) All processes fault in all guest memory (essentially populating all
>>     memory): with a uffd-WP extensions we might be able to get rid of
>>     that, I remember you were working on that.
>> c) All processes uffd-WP the range to set the bit in their page table
>>
>> 2. Background snapshot runs:
>> a) A process either receives a UFFD-WP event and forwards it to QEMU or
>>     QEMU polls all other processes for UFFD events.
>> b) QEMU writes the to-be-changed page to the migration stream.
>> c) QEMU triggers all processes to un-protect the page and wake up any
>>     waiters. All processes clear the uffd-WP bit in their page tables.
>>
>> 3. Background snapshot completes:
>> a) All processes unregister the uffd handler
>>
>>
>> Now imagine something like this:
>>
>> 1. Background snapshot preparation:
>> a) QEMU registers a UFFD-WP handler on a *memfd file* that corresponds
>>     to guest memory.
>> b) QEMU uffd-wp's the whole file
>>
>> 2. Background snapshot runs:
>> a) QEMU receives a UFFD-WP event.
>> b) QEMU writes the to-be-changed page to the migration stream.
>> c) QEMU un-protect the page and wake up any waiters.
>>
>> 3. Background snapshot completes:
>> a) QEMU unregister the uffd handler
>>
>>
>> Wouldn't that be much nicer and much easier to handle? Yes, it is much
>> harder to implement because such an infrastructure does not exist yet, and
>> it most probably wouldn't be called uffd anymore, because we are dealing
>> with file access. But this way, it would actually be super easy to use the
>> feature across multiple processes and eventually to even catch other file
>> modifications.
> 
> I can totally understand how you see this.  We've discussed about that, isn't
> it? About the ideal worlds. :)

Well, let's dream big :)

> 
> It would be great if this can work out, I hope so.  So far I'm not that
> ambicious, and as I said, I don't know whether there will be other concerns
> when it goes into the page cache layer, and when it's a behavior of multiple
> processes where one of them can rule others without others being notice of it.
> 
> Even if we want to go that way, I think we should first come up with some way
> to describe the domains that one uffd-wp registered file should behave upon.
> It shouldn't be "any process touching this file".
> 
> One quick example in my mind is when a malicious process wants to stop another
> daemon process, it'll be easier as long as the malicious process can delete a
> file that the daemon used to read/write, replace it with a shmem with uffd-wp
> registered (or maybe just a regular file on file systems, if your proposal will
> naturally work on them).  The problem is, is it really "legal" to be able to
> stop the daemon running like that?

Good question, I'd imagine e.g., file sealing could forbid uffd (or 
however it is called) registration on a file, and there would have to be 
a way to reject files that have uffd registered. But it's certainly a 
valid concern - and it raises the question to *what* we actually want to 
apply such a concept. Random files? random memfd? most probably not. 
Special memfds created with an ALLOW_UFFD flag? sounds like a good idea.

> 
> I also don't know the initial concept when uffd is designed and why it's
> designed at pte level.  Avoid vma manipulation should be a major factor, but I
> can't say I understand all of them.  Not sure whether Andrea has any input here.

AFAIU originally a) avoid signal handler madness and b) avoid VMA 
modifications and c) avoid taking the mmap lock in write (well, that 
didn't work out completely for uffd-wp for now IIRC).

> 
> That's why I think current uffd can still make sense with per-process concepts
> and keep it that way.  When register uffd-wp yes we need to do that for
> multiple processes, but it also means each process is fully aware that this is
> happening so it's kind of verified that this is wanted behavior for that
> process.  It'll happen with less "surprises", and smells safer.
> 
> I don't think that will not work out.  It may require all the process to
> support uffd-wp apis and cooperate, but that's so far how it should work for me
> in a safe and self-contained way.  Say, every process should be aware of what's
> going to happen on blocked page faults.

That's a valid concern, although I wonder if it can just be handled via 
specially marked memfds ("this memfd might get a uffd handler registered 
later").

>>
>> Again, I am not sure if uffd-wp or softdirty make too much sense in general
>> when applied to shmem. But I'm happy to learn more.
> 
> Me too, I'm more than glad to know whether the page cache idea could be
> welcomed or am I just wrong about it.  Before I understand more things around
> this, so far I still think the per-process based and fd-based solution of uffd
> still makes sense.

I'd be curious about applications where the per-process approach would 
actually solve something a per-fd approach couldn't solve. Maybe there 
are some that I just can't envision.

(using shmem for a single process only isn't a use case I consider 
important :) )

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/1] pagemap: swap location for shared pages
  2021-08-11 18:41           ` David Hildenbrand
@ 2021-08-11 19:54             ` Peter Xu
  2021-08-11 20:13               ` David Hildenbrand
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Xu @ 2021-08-11 19:54 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Tiberiu A Georgescu, akpm, viro, christian.brauner, ebiederm,
	adobriyan, songmuchun, axboe, vincenzo.frascino, catalin.marinas,
	peterz, chinwen.chang, linmiaohe, jannh, apopple, linux-kernel,
	linux-fsdevel, linux-mm, ivan.teterevkov, florian.schmidt,
	carl.waldspurger, jonathan.davies, Andrea Arcangeli

On Wed, Aug 11, 2021 at 08:41:32PM +0200, David Hildenbrand wrote:
> On 11.08.21 20:25, Peter Xu wrote:
> > On Wed, Aug 11, 2021 at 06:15:37PM +0200, David Hildenbrand wrote:
> > > On 04.08.21 21:17, Peter Xu wrote:
> > > > On Wed, Aug 04, 2021 at 08:49:14PM +0200, David Hildenbrand wrote:
> > > > > TBH, I tend to really dislike the PTE marker idea. IMHO, we shouldn't store
> > > > > any state information regarding shared memory in per-process page tables: it
> > > > > just doesn't make too much sense.
> > > > > 
> > > > > And this is similar to SOFTDIRTY or UFFD_WP bits: this information actually
> > > > > belongs to the shared file ("did *someone* write to this page", "is
> > > > > *someone* interested into changes to that page", "is there something"). I
> > > > > know, that screams for a completely different design in respect to these
> > > > > features.
> > > > > 
> > > > > I guess we start learning the hard way that shared memory is just different
> > > > > and requires different interfaces than per-process page table interfaces we
> > > > > have (pagemap, userfaultfd).
> > > > > 
> > > > > I didn't have time to explore any alternatives yet, but I wonder if tracking
> > > > > such stuff per an actual fd/memfd and not via process page tables is
> > > > > actually the right and clean approach. There are certainly many issues to
> > > > > solve, but conceptually to me it feels more natural to have these shared
> > > > > memory features not mangled into process page tables.
> > > > 
> > > > Yes, we can explore all the possibilities, I'm totally fine with it.
> > > > 
> > > > I just want to say I still don't think when there's page cache then we must put
> > > > all the page-relevant things into the page cache.
> > > 
> > > [sorry for the late reply]
> > > 
> > > Right, but for the case of shared, swapped out pages, the information is
> > > already there, in the page cache :)
> > > 
> > > > 
> > > > They're shared by processes, but process can still have its own way to describe
> > > > the relationship to that page in the cache, to me it's as simple as "we allow
> > > > process A to write to page cache P", while "we don't allow process B to write
> > > > to the same page" like the write bit.
> > > 
> > > The issue I'm having uffd-wp as it was proposed for shared memory is that
> > > there is hardly a sane use case where we would *want* it to work that way.
> > > 
> > > A UFFD-WP flag in a page table for shared memory means "please notify once
> > > this process modifies the shared memory (via page tables, not via any other
> > > fd modification)". Do we have an example application where these semantics
> > > makes sense and don't over-complicate the whole approach? I don't know any,
> > > thus I'm asking dumb questions :)
> > > 
> > > 
> > > For background snapshots in QEMU the flow would currently be like this,
> > > assuming all processes have the shared guest memory mapped.
> > > 
> > > 1. Background snapshot preparation: QEMU requests all processes
> > >     to uffd-wp the range
> > > a) All processes register a uffd handler on guest RAM
> > 
> > To be explicit: not a handler; just register with uffd-wp and pass over the fd
> > to the main process.
> 
> Good point.
> 
> > 
> > > b) All processes fault in all guest memory (essentially populating all
> > >     memory): with a uffd-WP extensions we might be able to get rid of
> > >     that, I remember you were working on that.
> > > c) All processes uffd-WP the range to set the bit in their page table
> > > 
> > > 2. Background snapshot runs:
> > > a) A process either receives a UFFD-WP event and forwards it to QEMU or
> > >     QEMU polls all other processes for UFFD events.
> > > b) QEMU writes the to-be-changed page to the migration stream.
> > > c) QEMU triggers all processes to un-protect the page and wake up any
> > >     waiters. All processes clear the uffd-WP bit in their page tables.
> > > 
> > > 3. Background snapshot completes:
> > > a) All processes unregister the uffd handler
> > > 
> > > 
> > > Now imagine something like this:
> > > 
> > > 1. Background snapshot preparation:
> > > a) QEMU registers a UFFD-WP handler on a *memfd file* that corresponds
> > >     to guest memory.
> > > b) QEMU uffd-wp's the whole file
> > > 
> > > 2. Background snapshot runs:
> > > a) QEMU receives a UFFD-WP event.
> > > b) QEMU writes the to-be-changed page to the migration stream.
> > > c) QEMU un-protect the page and wake up any waiters.
> > > 
> > > 3. Background snapshot completes:
> > > a) QEMU unregister the uffd handler
> > > 
> > > 
> > > Wouldn't that be much nicer and much easier to handle? Yes, it is much
> > > harder to implement because such an infrastructure does not exist yet, and
> > > it most probably wouldn't be called uffd anymore, because we are dealing
> > > with file access. But this way, it would actually be super easy to use the
> > > feature across multiple processes and eventually to even catch other file
> > > modifications.
> > 
> > I can totally understand how you see this.  We've discussed about that, isn't
> > it? About the ideal worlds. :)
> 
> Well, let's dream big :)
> 
> > 
> > It would be great if this can work out, I hope so.  So far I'm not that
> > ambicious, and as I said, I don't know whether there will be other concerns
> > when it goes into the page cache layer, and when it's a behavior of multiple
> > processes where one of them can rule others without others being notice of it.
> > 
> > Even if we want to go that way, I think we should first come up with some way
> > to describe the domains that one uffd-wp registered file should behave upon.
> > It shouldn't be "any process touching this file".
> > 
> > One quick example in my mind is when a malicious process wants to stop another
> > daemon process, it'll be easier as long as the malicious process can delete a
> > file that the daemon used to read/write, replace it with a shmem with uffd-wp
> > registered (or maybe just a regular file on file systems, if your proposal will
> > naturally work on them).  The problem is, is it really "legal" to be able to
> > stop the daemon running like that?
> 
> Good question, I'd imagine e.g., file sealing could forbid uffd (or however
> it is called) registration on a file, and there would have to be a way to
> reject files that have uffd registered. But it's certainly a valid concern -
> and it raises the question to *what* we actually want to apply such a
> concept. Random files? random memfd? most probably not. Special memfds
> created with an ALLOW_UFFD flag? sounds like a good idea.

Note that when daemons open files, they may not be aware of what's underneath
but read that file directly.  The attacker could still create the file with
uffd-wp enabled with any flag we introduce.

> 
> > 
> > I also don't know the initial concept when uffd is designed and why it's
> > designed at pte level.  Avoid vma manipulation should be a major factor, but I
> > can't say I understand all of them.  Not sure whether Andrea has any input here.
> 
> AFAIU originally a) avoid signal handler madness and b) avoid VMA
> modifications and c) avoid taking the mmap lock in write (well, that didn't
> work out completely for uffd-wp for now IIRC).

Nadav fixed that; it's with read lock now just like when it's introduced.
Please see mwriteprotect_range() and commit 6ce64428d62026a10c.

I won't say message queue (signal handling) is because uffd is pte-based; that
seems to be an orthogonal design decision.  But yeah I agree that's something
better than using signal handlers.

> 
> > 
> > That's why I think current uffd can still make sense with per-process concepts
> > and keep it that way.  When register uffd-wp yes we need to do that for
> > multiple processes, but it also means each process is fully aware that this is
> > happening so it's kind of verified that this is wanted behavior for that
> > process.  It'll happen with less "surprises", and smells safer.
> > 
> > I don't think that will not work out.  It may require all the process to
> > support uffd-wp apis and cooperate, but that's so far how it should work for me
> > in a safe and self-contained way.  Say, every process should be aware of what's
> > going to happen on blocked page faults.
> 
> That's a valid concern, although I wonder if it can just be handled via
> specially marked memfds ("this memfd might get a uffd handler registered
> later").

Yes, please see my above concern.  So I think we at least reached concensus on:
(1) that idea is already not userfaultfd but something else; what's that is
still to be defined.  And, (2) that definitely needs further thoughts and
context to support its validity and safety.  Now uffd got people worried about
safety already, that's why all the uffd selinux and privileged_userfaultfd
sysctl comes to mainline; we'd wish good luck with the new concept!

OTOH, uffd whole idea is already in mainline, it has limitations on requiring
to rework all processes to support uffd-wp, but actually the same to MISSING
messages has already happened and our QE is testing those: that's what we do
with e.g. postcopy-migrating vhost-user enabled OVS-DPDK - we pass over uffd
registered with missing mode and let QEMU handle the page fault.  So it's a bit
complicated but it should work.  And I hope you can also agree we don't need to
block uffd before that idea settles.

The pte markers idea need comment; that's about implementation, and it'll be
great to have comments there or even NACK (better with a better suggestion,
though :).  But the original idea of uffd that is pte-based has never changed.

> 
> > > 
> > > Again, I am not sure if uffd-wp or softdirty make too much sense in general
> > > when applied to shmem. But I'm happy to learn more.
> > 
> > Me too, I'm more than glad to know whether the page cache idea could be
> > welcomed or am I just wrong about it.  Before I understand more things around
> > this, so far I still think the per-process based and fd-based solution of uffd
> > still makes sense.
> 
> I'd be curious about applications where the per-process approach would
> actually solve something a per-fd approach couldn't solve. Maybe there are
> some that I just can't envision.

Right, that's a good point.

Actually it could be when like virtio-mem that some process shouldn't have
write privilege, but we still allow some other process writting to the shmem.
Something like that.

> 
> (using shmem for a single process only isn't a use case I consider important
> :) )

If you still remember the discussion about "having qemu start to use memfd and
shmem as default"? :)

shmem is hard but it's indeed useful in many cases, even if single threaded.
For example, shmem-based VMs can do local binary update without migrating guest
RAMs (because memory is shared between old/new binaries!).  To me it's always a
valid request to enable both shmem and write protect.

[It seems Andrea is not in the loop; do that for real]

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/1] pagemap: swap location for shared pages
  2021-08-11 19:54             ` Peter Xu
@ 2021-08-11 20:13               ` David Hildenbrand
  0 siblings, 0 replies; 13+ messages in thread
From: David Hildenbrand @ 2021-08-11 20:13 UTC (permalink / raw)
  To: Peter Xu
  Cc: Tiberiu A Georgescu, akpm, viro, christian.brauner, ebiederm,
	adobriyan, songmuchun, axboe, vincenzo.frascino, catalin.marinas,
	peterz, chinwen.chang, linmiaohe, jannh, apopple, linux-kernel,
	linux-fsdevel, linux-mm, ivan.teterevkov, florian.schmidt,
	carl.waldspurger, jonathan.davies, Andrea Arcangeli

>>
>> Good question, I'd imagine e.g., file sealing could forbid uffd (or however
>> it is called) registration on a file, and there would have to be a way to
>> reject files that have uffd registered. But it's certainly a valid concern -
>> and it raises the question to *what* we actually want to apply such a
>> concept. Random files? random memfd? most probably not. Special memfds
>> created with an ALLOW_UFFD flag? sounds like a good idea.
> 
> Note that when daemons open files, they may not be aware of what's underneath
> but read that file directly.  The attacker could still create the file with
> uffd-wp enabled with any flag we introduce.

Right, but we could, for example, use a prctrl to make a process to opt 
in to opening possibly-uffd-wp-protected files at all. I guess securing 
that aspect shouldn't be a hard nut to crack. At least with my thinking.

> 
>>
>>>
>>> I also don't know the initial concept when uffd is designed and why it's
>>> designed at pte level.  Avoid vma manipulation should be a major factor, but I
>>> can't say I understand all of them.  Not sure whether Andrea has any input here.
>>
>> AFAIU originally a) avoid signal handler madness and b) avoid VMA
>> modifications and c) avoid taking the mmap lock in write (well, that didn't
>> work out completely for uffd-wp for now IIRC).
> 
> Nadav fixed that; it's with read lock now just like when it's introduced.
> Please see mwriteprotect_range() and commit 6ce64428d62026a10c.

Oh, rings a bell, thanks!

>>
>>>
>>> That's why I think current uffd can still make sense with per-process concepts
>>> and keep it that way.  When register uffd-wp yes we need to do that for
>>> multiple processes, but it also means each process is fully aware that this is
>>> happening so it's kind of verified that this is wanted behavior for that
>>> process.  It'll happen with less "surprises", and smells safer.
>>>
>>> I don't think that will not work out.  It may require all the process to
>>> support uffd-wp apis and cooperate, but that's so far how it should work for me
>>> in a safe and self-contained way.  Say, every process should be aware of what's
>>> going to happen on blocked page faults.
>>
>> That's a valid concern, although I wonder if it can just be handled via
>> specially marked memfds ("this memfd might get a uffd handler registered
>> later").
> 
> Yes, please see my above concern.  So I think we at least reached concensus on:
> (1) that idea is already not userfaultfd but something else; what's that is
> still to be defined.  And, (2) that definitely needs further thoughts and
> context to support its validity and safety.  Now uffd got people worried about
> safety already, that's why all the uffd selinux and privileged_userfaultfd
> sysctl comes to mainline; we'd wish good luck with the new concept!

Sure, whenever you introduce random ever-lasting delays, we have to be 
very careful what we support. And if means not supporting some ioctls 
for such a special memfd (hello secretmem :)).

> 
> OTOH, uffd whole idea is already in mainline, it has limitations on requiring
> to rework all processes to support uffd-wp, but actually the same to MISSING
> messages has already happened and our QE is testing those: that's what we do
> with e.g. postcopy-migrating vhost-user enabled OVS-DPDK - we pass over uffd
> registered with missing mode and let QEMU handle the page fault.  So it's a bit
> complicated but it should work.  And I hope you can also agree we don't need to
> block uffd before that idea settles.

Let's phrase it that way: instead of extending something that just 
doesn't fit cleanly and feels kind of hackish (see my approach to 
teaching QEMU background snapshots above), I'd much rather see something 
clean and actually performant for the use cases I am aware of.

That doesn't mean that your current uffd-wp approach on shmem is all bad 
IMHO  (well, I make no decisions either way :) ), I'd just like us to 
look into finding eventually an approach to handle this cleanly instead 
of trying to solve problems we might not have to solve after all (pte 
markers) when things are done differently.

> 
> The pte markers idea need comment; that's about implementation, and it'll be
> great to have comments there or even NACK (better with a better suggestion,
> though :).  But the original idea of uffd that is pte-based has never changed.

Right, I hope some other people can comment. If we want to go down that 
path for uffd-wp, pte makers make sense. I'm not convinced we want them 
to handle swapped shared pages, but that discussion is better off in 
your posting.


>>
>>>>
>>>> Again, I am not sure if uffd-wp or softdirty make too much sense in general
>>>> when applied to shmem. But I'm happy to learn more.
>>>
>>> Me too, I'm more than glad to know whether the page cache idea could be
>>> welcomed or am I just wrong about it.  Before I understand more things around
>>> this, so far I still think the per-process based and fd-based solution of uffd
>>> still makes sense.
>>
>> I'd be curious about applications where the per-process approach would
>> actually solve something a per-fd approach couldn't solve. Maybe there are
>> some that I just can't envision.
> 
> Right, that's a good point.
> 
> Actually it could be when like virtio-mem that some process shouldn't have
> write privilege, but we still allow some other process writting to the shmem.
> Something like that.

With virtio-mem, you most probably wouldn't want anybody writing to it, 
at least from what I can tell. But I understand the rough idea -- just 
that you cannot enforce something on another process that doesn't play 
along (at least with the current uffd-wp approach! you could with an 
fd-based approach).

> 
>>
>> (using shmem for a single process only isn't a use case I consider important
>> :) )
> 
> If you still remember the discussion about "having qemu start to use memfd and
> shmem as default"? :)

Oh yes :)

> 
> shmem is hard but it's indeed useful in many cases, even if single threaded.
> For example, shmem-based VMs can do local binary update without migrating guest
> RAMs (because memory is shared between old/new binaries!).  To me it's always a
> valid request to enable both shmem and write protect.

Right, but it would also just work with an fd-based approach. (well, 
unless we're dealing with shared anonymous RAM, but that is just some 
weird stuff for really exotic use cases)


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-08-11 20:14 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-30 16:08 [PATCH 0/1] pagemap: swap location for shared pages Tiberiu A Georgescu
2021-07-30 16:08 ` [PATCH 1/1] pagemap: report " Tiberiu A Georgescu
2021-07-30 17:28 ` [PATCH 0/1] pagemap: " Eric W. Biederman
2021-08-02 12:20   ` Tiberiu Georgescu
2021-08-04 18:33 ` Peter Xu
2021-08-04 18:49   ` David Hildenbrand
2021-08-04 19:17     ` Peter Xu
2021-08-11 16:15       ` David Hildenbrand
2021-08-11 16:17         ` David Hildenbrand
2021-08-11 18:25         ` Peter Xu
2021-08-11 18:41           ` David Hildenbrand
2021-08-11 19:54             ` Peter Xu
2021-08-11 20:13               ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).