Re: mm: unnecessary COW phenomenon

From: Nadav Amit <nadav.amit@gmail.com>
To: Peter Xu <peterx@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>,
	Linux-MM <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>,
	David Hildenbrand <david@redhat.com>
Subject: Re: mm: unnecessary COW phenomenon
Date: Wed, 10 Nov 2021 02:47:33 -0800	[thread overview]
Message-ID: <0480D692-D9B2-429A-9A88-9BBA1331AC3A@gmail.com> (raw)
In-Reply-To: <YWe7x5DK0sMDskYE@t490s>

> On Oct 13, 2021, at 10:10 PM, Peter Xu <peterx@redhat.com> wrote:
> 
> On Wed, Oct 13, 2021 at 03:42:08PM -0700, Nadav Amit wrote:
>> Andrea, Peter, others,
> 
> Hi, Nadav,
> 
>> 
>> I encountered many unnecessary COW operations on my development kernel
>> (based on Linux 5.13), which I did not see a report about and I am not
>> sure how to solve. An advice would be appreciated.
>> 
>> Commit 09854ba94c6aa ("mm: do_wp_page() simplification”) prevents the reuse of
>> a page on write-protect fault if page_count(page) != 1. In that case,
>> wp_page_reuse() is not used and instead the page is COW'd by wp_page_copy
>> (). wp_page_copy() is obviously much more expensive, not only because of the
>> copying, but also because it requires a TLB flush and potentially a TLB
>> shootodwn.
>> 
>> The scenario I encountered happens when I use userfaultfd, but presumably it
>> might happen regardless of userfaultfd (perhaps swap device with
>> SWP_SYNCHRONOUS_IO). It involves two page faults: one that maps a new
>> anonymous page as read-only and a second write-protect fault that happens
>> shortly after on the same page. In this case the page count is almost always
>> elevated and therefore a COW is needed.
>> 

[ snip ]

>> 
>> It turns out that the elevated page count is due to the caching of the page in
>> the local LRU cache (by lru_cache_add() which is called by
>> lru_cache_add_inactive_or_unevictable() in the case userfaultfd). Since the
>> first fault happened shortly before the second write-protect fault, the LRU
>> cache was still not drained, so the page count was not decreased and a COW is
>> needed.
>> 
>> Calling lru_add_drain() during this flow resolves the issue most of the time.
>> Obviously, it needs to be called on the core that allocated (i.e., faulted
>> in) the page initially to work. It is possible to do it conditionally only if
>> the page-count is greater than 1.
> 
> I agree with your analysis.  I didn't even notice the lru_cache_add() can cause
> it very likely to trigger the COW in your uffd use case (and also for swap),
> but that's indeed something could happen with the current page reuse logic in
> do_wp_page(), afaiu.

Just an update for the record based on an offline correspondence with Andrea
and Peter, who were very helpful (thanks!)

I could not come up with a non-hacky solution just for this problem. While it
is possible to drain the LRU conditionally, it is admittedly a hack with some
downsides.

The aforementioned issue - unnecessary TLB flush (or even shootdown) on COW
operations - is not limited to userfaultfd and not even to
SWP_SYNCHRONOUS_IO. It seems that whenever the swap is set on very
low-latency device (e.g., pmem, zram), the unnecessary COW might happen and
impact performance negatively.

I created a small test to verify the impact of the phenomenon (the test code
is below). The swap is set on an emulated pmem device and then run with:

	./forceswap 2 100000 1

The benchmark runs 100k rounds in which a page is accessed first for read,
then for write, and then the page is paged out using MADV_PAGEOUT. The two
accesses cause a page-fault. The test only measures the time of the second
access, which should include the wp page-fault. I also measured the delta
in “nr_tlb_remote_flush" from /proc/vmstat.

The results are:

				cycles/op	nr_tlb_remote_flush
-------------------------------------------------------------------
v5.8		bcf876870b95	1606		300000
mainline	cb690f5238d7	10534		399935

As shown, the write-protect fault in mainline takes ~6.5x, which
is explained by the COW operation that is exhibited in the extra
TLB shootdown (nr_tlb_remote_flush). On bare-metal this overhead
should be lower, yet if the number of threads is higher the
overhead would increase.

I tried also to collect the number of IOs, but for some reason
they do not show on /sys/dev/block/X/stat for pmem.

[ Some config details:
  KVM VM running on Haswell.
  host: max-freq; kvm_intel's ple_gap=0; 2MB pages.
  VM: mitigations=off idle=poll. Kernel compiled with
  CONFIG_DEBUG_TLBFLUSH=y. CONFIG_BLK_DEV_PMEM=y	]

-- >8 --

#include <pthread.h>
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <sys/mman.h>
#include <errno.h>
#include <sys/types.h>
#include <unistd.h>

#define PAGE_SIZE		(4096)
#define MAX_THREADS		(50)

volatile int stop = 0;
unsigned long nops;

void* thread_start(void *arg)
{
	while (!stop) {
		asm volatile ("pause" ::: "memory");
	}

	return (void*)NULL;
}

static inline uint64_t rdtscp()
{
	uint64_t rax, rdx, aux;

	asm volatile ("rdtscp\n" : "=a" (rax), "=d" (rdx), "=c" (aux) : : );
	return (rdx << 32) + rax;
}

int main(int argc, char *argv[])
{
	int r, nthreads, npages, j;
	unsigned long i;
	pthread_attr_t attr;
	pthread_t thread_ids[MAX_THREADS];
	void *res;
	volatile char *p, c;
	uint64_t time = 0;

	if (argc < 4) {
		fprintf(stderr, "usage: %s [nthreads] [nops] [npages]\n", argv[0]);
		exit(-1);
	}

	r = pthread_attr_init(&attr);
	if (r != 0) {
		fprintf(stderr, "error setting attributes %d\n", r);
		exit(-1);
	}

	nthreads = atoi(argv[1]);
	nops = strtoull(argv[2], NULL, 0);
	npages = atoi(argv[3]);

	for (i = 0; i < nthreads - 1; i++) {
		r = pthread_create(&thread_ids[i], &attr, &thread_start, NULL);
		if (r != 0) {
			fprintf(stderr, "error creating thread %d\n", r);
			exit(-1);
		}
	}

	p = (volatile char*)mmap(0, PAGE_SIZE * npages, PROT_READ|PROT_WRITE,
				 MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);

	if (p == MAP_FAILED) {
		perror("mmap");
		exit(-1);
	}

	for (i = 0; i < nops; i++) {
		if (madvise((void *)p, PAGE_SIZE * npages, MADV_PAGEOUT)) {
			perror("madvise");
			exit(-1);
		}

		for (j = 0; j < npages; j++) {
			c = p[j * PAGE_SIZE];
			c++;
			time -= rdtscp();
			p[j * PAGE_SIZE] = c;
			time += rdtscp();
		}
	}
	stop = 1;
	for (i = 0; i < nthreads - 1; i++) {
		r = pthread_join(thread_ids[i], &res);
		if (r != 0) {
			fprintf(stderr, "error join\n");
			exit(-1);
		}
	}
	printf("time: %ld\n", time/nops);
	return 0;
}