Re: bug: data corruption introduced by commit 83d116c53058 ("mm: fix double page fault on arm64 if PTE_AF is cleared")

From: Jeff Moyer <jmoyer@redhat.com>
To: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Justin He <Justin.He@arm.com>,
	 Catalin Marinas <Catalin.Marinas@arm.com>,
	 "Kirill A.Shutemov" <kirill.shutemov@linux.intel.com>,
	 "linux-mm\@kvack.org" <linux-mm@kvack.org>
Subject: Re: bug: data corruption introduced by commit 83d116c53058 ("mm: fix double page fault on arm64 if PTE_AF is cleared")
Date: Tue, 11 Feb 2020 17:01:40 -0500	[thread overview]
Message-ID: <x49d0akokaj.fsf@segfault.boston.devel.redhat.com> (raw)
In-Reply-To: <20200211214439.v6bowbkyienrwtdd@box> (Kirill A. Shutemov's message of "Wed, 12 Feb 2020 00:44:39 +0300")

[-- Attachment #1: Type: text/plain, Size: 2500 bytes --]

"Kirill A. Shutemov" <kirill@shutemov.name> writes:

> On Tue, Feb 11, 2020 at 12:55:50PM -0500, Jeff Moyer wrote:
>> "Kirill A. Shutemov" <kirill@shutemov.name> writes:
>> 
>> > On Tue, Feb 11, 2020 at 11:44:06AM -0500, Jeff Moyer wrote:
>> >> Hi, Justin,
>> >> 
>> >> Justin He <Justin.He@arm.com> writes:
>> >> >> Thanks for the report. But this commit 83d116c53058 doesn't add the
>> >> >> new clear_page code path. Besides the pte_mkyoung part, It just refines
>> >> >> the codes(no functional change) and add a WARN_ON_ONCE to indicate
>> >> >> there is any obscure case before.
>> >> >
>> >> > I can't reproduce it with your provided test file on my arm64 qemu with
>> >> > a pmem device.
>> >> > Could you do me a favor that just revert 83d116c53058 but keep that
>> >> > WARN_ON_ONCE after clear_page()? Is there any difference?
>> >> > Thanks for your help
>> >> 
>> >> Below is the patch I used to put the WARN_ON_ONCE after the clear_page,
>> >> just to be sure that's what you intended.  So with 83d116c53058
>> >> reverted, and the below patch applied, the WARN_ON_ONCE does not
>> >> trigger.
>> >
>> > I cannot explain this. There is no locking to prevent the same scenario
>> > before. It might be an timing difference.
>> >
>> > Could try to put a delay before the copy to make race window larger?
>> 
>> I reverted my change to the reproducer, and now it triggers the warning.
>
> I'm not sure I follow.
>
> My understanding is that you failed to reproduce the issue with
> 83d116c53058 reverted and WARN_ON_ONCE() placed.

I failed to reproduce the issue with the test code I provided in this
email thread.  However, if I simply use the original t_mmap_cow_race
from xfstests, I can trigger the WARN_ON_ONCE.  There is no need to
insert a delay in the kernel.

Does that make sense?

> My ask was to try to put some mdelay() just before
> __copy_from_user_inatomic(). The mdelay() may help with reproducing the
> issue on the old code.
>
> If the bug still fails to reproduce I may misunderstand the source of the
> bug and need to look further.

I understand your request.  Inserting a udelay(10) with the code I
provided does not trigger the warning.  However, see above.

I'm including the unmodified t_mmap_cow_race.c code here for your
convenience.  This is the code that triggers the warning with
83d116c53058 reverted, and the WARN_ON_ONCE added.

Let me know if I'm stilling confusing you.  :)

Cheers,
Jeff

[-- Attachment #2: t_mmap_cow_race.c --]
[-- Type: text/plain, Size: 2284 bytes --]

// SPDX-License-Identifier: GPL-2.0
/* Copyright (c) 2017 Intel Corporation. */
#include <errno.h>
#include <fcntl.h>
#include <libgen.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>

#define MiB(a) ((a)*1024*1024)
#define NUM_THREADS 2

void err_exit(char *op)
{
	fprintf(stderr, "%s: %s\n", op, strerror(errno));
	exit(1);
}

void worker_fn(void *ptr)
{
	char *data = (char *)ptr;
	volatile int a;
	int i, err;

	for (i = 0; i < 10; i++) {
		a = data[0];
		data[0] = a;

		err = madvise(data, MiB(2), MADV_DONTNEED);
		if (err < 0)
			err_exit("madvise");

		/* Mix up the thread timings to encourage the race. */
		err = usleep(rand() % 100);
		if (err < 0)
			err_exit("usleep");
	}
}

int main(int argc, char *argv[])
{
	pthread_t thread[NUM_THREADS];
	int i, j, fd, err;
	char *data;

	if (argc < 2) {
		printf("Usage: %s <file>\n", basename(argv[0]));
		exit(0);
	}

	fd = open(argv[1], O_RDWR|O_CREAT, S_IRUSR|S_IWUSR);
	if (fd < 0)
		err_exit("fd");

	/* This allows us to map a huge page. */
	ftruncate(fd, 0);
	ftruncate(fd, MiB(2));

	/*
	 * First we set up a shared mapping.  Our write will (hopefully) get
	 * the filesystem to give us a 2MiB huge page DAX mapping.  We will
	 * then use this 2MiB page for our private mapping race.
	 */
	data = mmap(NULL, MiB(2), PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
	if (data == MAP_FAILED)
		err_exit("shared mmap");

	data[0] = 1;

	err = munmap(data, MiB(2));
	if (err < 0)
		err_exit("shared munmap");

	for (i = 0; i < 500; i++) {
		data = mmap(NULL, MiB(2), PROT_READ|PROT_WRITE, MAP_PRIVATE,
				fd, 0);
		if (data == MAP_FAILED)
			err_exit("private mmap");

		for (j = 0; j < NUM_THREADS; j++) {
			err = pthread_create(&thread[j], NULL,
					(void*)&worker_fn, data);
			if (err)
				err_exit("pthread_create");
		}

		for (j = 0; j < NUM_THREADS; j++) {
			err = pthread_join(thread[j], NULL);
			if (err)
				err_exit("pthread_join");
		}

		err = munmap(data, MiB(2));
		if (err < 0)
			err_exit("private munmap");
	}

	err = close(fd);
	if (err < 0)
		err_exit("close");

	return 0;
}