linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* mm: unnecessary COW phenomenon
@ 2021-10-13 22:42 Nadav Amit
  2021-10-14  5:10 ` Peter Xu
  0 siblings, 1 reply; 3+ messages in thread
From: Nadav Amit @ 2021-10-13 22:42 UTC (permalink / raw)
  To: Andrea Arcangeli, Peter Xu; +Cc: Linux-MM, LKML

Andrea, Peter, others,

I encountered many unnecessary COW operations on my development kernel
(based on Linux 5.13), which I did not see a report about and I am not
sure how to solve. An advice would be appreciated.

Commit 09854ba94c6aa ("mm: do_wp_page() simplification”) prevents the reuse of
a page on write-protect fault if page_count(page) != 1. In that case,
wp_page_reuse() is not used and instead the page is COW'd by wp_page_copy
(). wp_page_copy() is obviously much more expensive, not only because of the
copying, but also because it requires a TLB flush and potentially a TLB
shootodwn.

The scenario I encountered happens when I use userfaultfd, but presumably it
might happen regardless of userfaultfd (perhaps swap device with
SWP_SYNCHRONOUS_IO). It involves two page faults: one that maps a new
anonymous page as read-only and a second write-protect fault that happens
shortly after on the same page. In this case the page count is almost always
elevated and therefore a COW is needed.

[ The specific scenario that I have as as follows: I map a page to the
monitored process using UFFDIO_COPY (actually a variant I am working on) as
write-protected. Then, shortly after an write access to the page triggers a
page fault. The uffd monitor quickly resolves the page fault using
UFFDIO_WRITEPROTECT. The kernel keeps the page write protected in the page
tables but marked logically as uffd-unprotected and the page table is
retried. The retry triggers a COW. ]

It turns out that the elevated page count is due to the caching of the page in
the local LRU cache (by lru_cache_add() which is called by
lru_cache_add_inactive_or_unevictable() in the case userfaultfd). Since the
first fault happened shortly before the second write-protect fault, the LRU
cache was still not drained, so the page count was not decreased and a COW is
needed.

Calling lru_add_drain() during this flow resolves the issue most of the time.
Obviously, it needs to be called on the core that allocated (i.e., faulted
in) the page initially to work. It is possible to do it conditionally only if
the page-count is greater than 1.

My questions to you (if I may) are:

1. Am I missing something?
2. Should it happen in other cases, specifically SWP_SYNCHRONOUS_IO?
3. Do you have a better solution?


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: mm: unnecessary COW phenomenon
  2021-10-13 22:42 mm: unnecessary COW phenomenon Nadav Amit
@ 2021-10-14  5:10 ` Peter Xu
  2021-11-10 10:47   ` Nadav Amit
  0 siblings, 1 reply; 3+ messages in thread
From: Peter Xu @ 2021-10-14  5:10 UTC (permalink / raw)
  To: Nadav Amit; +Cc: Andrea Arcangeli, Linux-MM, LKML

On Wed, Oct 13, 2021 at 03:42:08PM -0700, Nadav Amit wrote:
> Andrea, Peter, others,

Hi, Nadav,

> 
> I encountered many unnecessary COW operations on my development kernel
> (based on Linux 5.13), which I did not see a report about and I am not
> sure how to solve. An advice would be appreciated.
> 
> Commit 09854ba94c6aa ("mm: do_wp_page() simplification”) prevents the reuse of
> a page on write-protect fault if page_count(page) != 1. In that case,
> wp_page_reuse() is not used and instead the page is COW'd by wp_page_copy
> (). wp_page_copy() is obviously much more expensive, not only because of the
> copying, but also because it requires a TLB flush and potentially a TLB
> shootodwn.
> 
> The scenario I encountered happens when I use userfaultfd, but presumably it
> might happen regardless of userfaultfd (perhaps swap device with
> SWP_SYNCHRONOUS_IO). It involves two page faults: one that maps a new
> anonymous page as read-only and a second write-protect fault that happens
> shortly after on the same page. In this case the page count is almost always
> elevated and therefore a COW is needed.
> 
> [ The specific scenario that I have as as follows: I map a page to the
> monitored process using UFFDIO_COPY (actually a variant I am working on) as
> write-protected. Then, shortly after an write access to the page triggers a
> page fault. The uffd monitor quickly resolves the page fault using
> UFFDIO_WRITEPROTECT. The kernel keeps the page write protected in the page
> tables but marked logically as uffd-unprotected and the page table is
> retried. The retry triggers a COW. ]
> 
> It turns out that the elevated page count is due to the caching of the page in
> the local LRU cache (by lru_cache_add() which is called by
> lru_cache_add_inactive_or_unevictable() in the case userfaultfd). Since the
> first fault happened shortly before the second write-protect fault, the LRU
> cache was still not drained, so the page count was not decreased and a COW is
> needed.
> 
> Calling lru_add_drain() during this flow resolves the issue most of the time.
> Obviously, it needs to be called on the core that allocated (i.e., faulted
> in) the page initially to work. It is possible to do it conditionally only if
> the page-count is greater than 1.
> 
> My questions to you (if I may) are:
> 
> 1. Am I missing something?

I agree with your analysis.  I didn't even notice the lru_cache_add() can cause
it very likely to trigger the COW in your uffd use case (and also for swap),
but that's indeed something could happen with the current page reuse logic in
do_wp_page(), afaiu.

> 2. Should it happen in other cases, specifically SWP_SYNCHRONOUS_IO?

Frankly I don't know why SWP_SYNCHRONOUS_IO matters here, as that seems to me a
flag to tell whether the swap device is fast on IO so swapping can be done
synchronously and skip swap cache.  E.g., I think normal swapping could have
similar issue too?  As long as in do_swap_page() the reuse_swap_page() call is
either not triggered (which means it's a read fault) or it returned false
(which means there's more than 1 map+swap count).

> 3. Do you have a better solution?

What you suggested as "conditionally lru draining in fault path" seems okay,
but that does look like yet another band-aid to the page reuse logic..
Meanwhile sorry I don't have anything better in mind.  Andrea proposed the
mapcount unshare solution [1] (I believe you should be aware of it now; it
definitely needs some time reading if you didn't follow that previusly...) and
that definitely can resolve this issue too, it's just that upstream hasn't
reached a consensus on that, so the page reuse is kept the current way on
depending on refcount rather than mapcount.

[1] https://github.com/aagit/aa/tree/mapcount_unshare

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: mm: unnecessary COW phenomenon
  2021-10-14  5:10 ` Peter Xu
@ 2021-11-10 10:47   ` Nadav Amit
  0 siblings, 0 replies; 3+ messages in thread
From: Nadav Amit @ 2021-11-10 10:47 UTC (permalink / raw)
  To: Peter Xu; +Cc: Andrea Arcangeli, Linux-MM, LKML, David Hildenbrand



> On Oct 13, 2021, at 10:10 PM, Peter Xu <peterx@redhat.com> wrote:
> 
> On Wed, Oct 13, 2021 at 03:42:08PM -0700, Nadav Amit wrote:
>> Andrea, Peter, others,
> 
> Hi, Nadav,
> 
>> 
>> I encountered many unnecessary COW operations on my development kernel
>> (based on Linux 5.13), which I did not see a report about and I am not
>> sure how to solve. An advice would be appreciated.
>> 
>> Commit 09854ba94c6aa ("mm: do_wp_page() simplification”) prevents the reuse of
>> a page on write-protect fault if page_count(page) != 1. In that case,
>> wp_page_reuse() is not used and instead the page is COW'd by wp_page_copy
>> (). wp_page_copy() is obviously much more expensive, not only because of the
>> copying, but also because it requires a TLB flush and potentially a TLB
>> shootodwn.
>> 
>> The scenario I encountered happens when I use userfaultfd, but presumably it
>> might happen regardless of userfaultfd (perhaps swap device with
>> SWP_SYNCHRONOUS_IO). It involves two page faults: one that maps a new
>> anonymous page as read-only and a second write-protect fault that happens
>> shortly after on the same page. In this case the page count is almost always
>> elevated and therefore a COW is needed.
>> 

[ snip ]

>> 
>> It turns out that the elevated page count is due to the caching of the page in
>> the local LRU cache (by lru_cache_add() which is called by
>> lru_cache_add_inactive_or_unevictable() in the case userfaultfd). Since the
>> first fault happened shortly before the second write-protect fault, the LRU
>> cache was still not drained, so the page count was not decreased and a COW is
>> needed.
>> 
>> Calling lru_add_drain() during this flow resolves the issue most of the time.
>> Obviously, it needs to be called on the core that allocated (i.e., faulted
>> in) the page initially to work. It is possible to do it conditionally only if
>> the page-count is greater than 1.
> 
> I agree with your analysis.  I didn't even notice the lru_cache_add() can cause
> it very likely to trigger the COW in your uffd use case (and also for swap),
> but that's indeed something could happen with the current page reuse logic in
> do_wp_page(), afaiu.

Just an update for the record based on an offline correspondence with Andrea
and Peter, who were very helpful (thanks!)

I could not come up with a non-hacky solution just for this problem. While it
is possible to drain the LRU conditionally, it is admittedly a hack with some
downsides.

The aforementioned issue - unnecessary TLB flush (or even shootdown) on COW
operations - is not limited to userfaultfd and not even to
SWP_SYNCHRONOUS_IO. It seems that whenever the swap is set on very
low-latency device (e.g., pmem, zram), the unnecessary COW might happen and
impact performance negatively.

I created a small test to verify the impact of the phenomenon (the test code
is below). The swap is set on an emulated pmem device and then run with:

	./forceswap 2 100000 1

The benchmark runs 100k rounds in which a page is accessed first for read,
then for write, and then the page is paged out using MADV_PAGEOUT. The two
accesses cause a page-fault. The test only measures the time of the second
access, which should include the wp page-fault. I also measured the delta
in “nr_tlb_remote_flush" from /proc/vmstat.

The results are:

				cycles/op	nr_tlb_remote_flush
-------------------------------------------------------------------
v5.8		bcf876870b95	1606		300000
mainline	cb690f5238d7	10534		399935


As shown, the write-protect fault in mainline takes ~6.5x, which
is explained by the COW operation that is exhibited in the extra
TLB shootdown (nr_tlb_remote_flush). On bare-metal this overhead
should be lower, yet if the number of threads is higher the
overhead would increase.

I tried also to collect the number of IOs, but for some reason
they do not show on /sys/dev/block/X/stat for pmem.

[ Some config details:
  KVM VM running on Haswell.
  host: max-freq; kvm_intel's ple_gap=0; 2MB pages.
  VM: mitigations=off idle=poll. Kernel compiled with
  CONFIG_DEBUG_TLBFLUSH=y. CONFIG_BLK_DEV_PMEM=y	]

-- >8 --

#include <pthread.h>
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <sys/mman.h>
#include <errno.h>
#include <sys/types.h>
#include <unistd.h>

#define PAGE_SIZE		(4096)
#define MAX_THREADS		(50)

volatile int stop = 0;
unsigned long nops;

void* thread_start(void *arg)
{
	while (!stop) {
		asm volatile ("pause" ::: "memory");
	}

	return (void*)NULL;
}

static inline uint64_t rdtscp()
{
	uint64_t rax, rdx, aux;

	asm volatile ("rdtscp\n" : "=a" (rax), "=d" (rdx), "=c" (aux) : : );
	return (rdx << 32) + rax;
}

int main(int argc, char *argv[])
{
	int r, nthreads, npages, j;
	unsigned long i;
	pthread_attr_t attr;
	pthread_t thread_ids[MAX_THREADS];
	void *res;
	volatile char *p, c;
	uint64_t time = 0;

	if (argc < 4) {
		fprintf(stderr, "usage: %s [nthreads] [nops] [npages]\n", argv[0]);
		exit(-1);
	}

	r = pthread_attr_init(&attr);
	if (r != 0) {
		fprintf(stderr, "error setting attributes %d\n", r);
		exit(-1);
	}

	nthreads = atoi(argv[1]);
	nops = strtoull(argv[2], NULL, 0);
	npages = atoi(argv[3]);

	for (i = 0; i < nthreads - 1; i++) {
		r = pthread_create(&thread_ids[i], &attr, &thread_start, NULL);
		if (r != 0) {
			fprintf(stderr, "error creating thread %d\n", r);
			exit(-1);
		}
	}

	p = (volatile char*)mmap(0, PAGE_SIZE * npages, PROT_READ|PROT_WRITE,
				 MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);

	if (p == MAP_FAILED) {
		perror("mmap");
		exit(-1);
	}

	for (i = 0; i < nops; i++) {
		if (madvise((void *)p, PAGE_SIZE * npages, MADV_PAGEOUT)) {
			perror("madvise");
			exit(-1);
		}

		for (j = 0; j < npages; j++) {
			c = p[j * PAGE_SIZE];
			c++;
			time -= rdtscp();
			p[j * PAGE_SIZE] = c;
			time += rdtscp();
		}
	}
	stop = 1;
	for (i = 0; i < nthreads - 1; i++) {
		r = pthread_join(thread_ids[i], &res);
		if (r != 0) {
			fprintf(stderr, "error join\n");
			exit(-1);
		}
	}
	printf("time: %ld\n", time/nops);
	return 0;
}


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-11-10 10:47 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-13 22:42 mm: unnecessary COW phenomenon Nadav Amit
2021-10-14  5:10 ` Peter Xu
2021-11-10 10:47   ` Nadav Amit

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).