All of lore.kernel.org
 help / color / mirror / Atom feed
From: Minchan Kim <minchan@kernel.org>
To: Nadav Amit <nadav.amit@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>, Andy Lutomirski <luto@kernel.org>,
	"open list:MEMORY MANAGEMENT" <linux-mm@kvack.org>
Subject: Re: Potential race in TLB flush batching?
Date: Thu, 27 Jul 2017 08:40:25 +0900	[thread overview]
Message-ID: <20170726234025.GA4491@bbox> (raw)
In-Reply-To: <A300D14C-D7EE-4A26-A7CF-A7643F1A61BA@gmail.com>

Hello Nadav,

On Wed, Jul 26, 2017 at 12:18:37PM -0700, Nadav Amit wrote:
> Mel Gorman <mgorman@suse.de> wrote:
> 
> > On Wed, Jul 26, 2017 at 02:43:06PM +0900, Minchan Kim wrote:
> >>> I'm relying on the fact you are the madv_free author to determine if
> >>> it's really necessary. The race in question is CPU 0 running madv_free
> >>> and updating some PTEs while CPU 1 is also running madv_free and looking
> >>> at the same PTEs. CPU 1 may have writable TLB entries for a page but fail
> >>> the pte_dirty check (because CPU 0 has updated it already) and potentially
> >>> fail to flush. Hence, when madv_free on CPU 1 returns, there are still
> >>> potentially writable TLB entries and the underlying PTE is still present
> >>> so that a subsequent write does not necessarily propagate the dirty bit
> >>> to the underlying PTE any more. Reclaim at some unknown time at the future
> >>> may then see that the PTE is still clean and discard the page even though
> >>> a write has happened in the meantime. I think this is possible but I could
> >>> have missed some protection in madv_free that prevents it happening.
> >> 
> >> Thanks for the detail. You didn't miss anything. It can happen and then
> >> it's really bug. IOW, if application does write something after madv_free,
> >> it must see the written value, not zero.
> >> 
> >> How about adding [set|clear]_tlb_flush_pending in tlb batchin interface?
> >> With it, when tlb_finish_mmu is called, we can know we skip the flush
> >> but there is pending flush, so flush focefully to avoid madv_dontneed
> >> as well as madv_free scenario.
> > 
> > I *think* this is ok as it's simply more expensive on the KSM side in
> > the event of a race but no other harmful change is made assuming that
> > KSM is the only race-prone. The check for mm_tlb_flush_pending also
> > happens under the PTL so there should be sufficient protection from the
> > mm struct update being visible at teh right time.
> > 
> > Check using the test program from "mm: Always flush VMA ranges affected
> > by zap_page_range v2" if it handles the madvise case as well as that
> > would give some degree of safety. Make sure it's tested against 4.13-rc2
> > instead of mmotm which already includes the madv_dontneed fix. If yours
> > works for both then it supersedes the mmotm patch.
> > 
> > It would also be interesting if Nadav would use his slowdown hack to see
> > if he can still force the corruption.
> 
> The proposed fix for the KSM side is likely to work (I will try later), but
> on the tlb_finish_mmu() side, I think there is a problem, since if any TLB
> flush is performed by tlb_flush_mmu(), flush_tlb_mm_range() will not be
> executed. This means that tlb_finish_mmu() may flush one TLB entry, leave
> another one stale and not flush it.

Okay, I will change that part like this to avoid partial flush problem.

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 1c42d69490e4..87d0ebac6605 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -529,10 +529,13 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
  * The barriers below prevent the compiler from re-ordering the instructions
  * around the memory barriers that are already present in the code.
  */
-static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
+static inline int mm_tlb_flush_pending(struct mm_struct *mm)
 {
+	int nr_pending;
+
 	barrier();
-	return atomic_read(&mm->tlb_flush_pending) > 0;
+	nr_pending = atomic_read(&mm->tlb_flush_pending);
+	return nr_pending;
 }
 static inline void set_tlb_flush_pending(struct mm_struct *mm)
 {
diff --git a/mm/memory.c b/mm/memory.c
index d5c5e6497c70..b5320e96ec51 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -286,11 +286,15 @@ bool tlb_flush_mmu(struct mmu_gather *tlb)
 void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
 {
 	struct mmu_gather_batch *batch, *next;
-	bool flushed = tlb_flush_mmu(tlb);
 
+	if (!tlb->fullmm && !tlb->need_flush_all &&
+			mm_tlb_flush_pending(tlb->mm) > 1) {
+		tlb->start = min(start, tlb->start);
+		tlb->end = max(end, tlb->end);
+	}
+
+	tlb_flush_mmu(tlb);
 	clear_tlb_flush_pending(tlb->mm);
-	if (!flushed && mm_tlb_flush_pending(tlb->mm))
-		flush_tlb_mm_range(tlb->mm, start, end, 0UL);
 
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
> 
> Note also that the use of set/clear_tlb_flush_pending() is only applicable
> following my pending fix that changes the pending indication from bool to
> atomic_t.

Sure, I saw it in current mmots. Without your good job, my patch never work. :)
Thanks for the head up.

> 
> For the record here is my test, followed by the patch to add latency. There
> are some magic numbers that may not apply to your system (I got tired of
> trying to time the system). If you run the test in a VM, the pause-loop
> exiting can potentially prevent the issue from appearing.

Thanks for the sharing. I will try it, too.

> 
> --
> 
> #include <stdio.h>
> #include <stdlib.h>
> #include <pthread.h>
> #include <string.h>
> #include <assert.h>
> #include <unistd.h>
> #include <fcntl.h>
> #include <pthread.h>
> #include <stdint.h>
> #include <stdbool.h>
> #include <sys/mman.h>
> #include <sys/types.h>
> 
> #define PAGE_SIZE		(4096)
> #define N_PAGES			(65536ull * 16)
> 
> #define CHANGED_VAL		(7)
> #define BASE_VAL		(9)
> 
> #define max(a,b) \
> 	({ __typeof__ (a) _a = (a); \
> 	  __typeof__ (b) _b = (b); \
> 	  _a > _b ? _a : _b; })
> 
> #define STEP_HELPERS_RUN	(1)
> #define STEP_DONTNEED_DONE	(2)
> #define STEP_ACCESS_PAUSED	(4)
> 
> volatile int sync_step = STEP_ACCESS_PAUSED;
> volatile char *p;
> int dirty_fd, ksm_sharing_fd, ksm_run_fd;
> uint64_t soft_dirty_time, madvise_time, soft_dirty_delta, madvise_delta;
> 
> static inline unsigned long rdtsc()
> {
> 	unsigned long hi, lo;
> 
> 	__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
> 	 return lo | (hi << 32);
> }
> 
> static inline void wait_rdtsc(unsigned long cycles)
> {
> 	unsigned long tsc = rdtsc();
> 
> 	while (rdtsc() - tsc < cycles)
> 		__asm__ __volatile__ ("rep nop" ::: "memory");
> }
> 
> static void break_sharing(void)
> {
> 	char buf[20];
> 
> 	pwrite(ksm_run_fd, "2", 1, 0);
> 
> 	printf("waiting for page sharing to be broken\n");
> 	do {
> 		pread(ksm_sharing_fd, buf, sizeof(buf), 0);
> 	} while (strtoul(buf, NULL, sizeof(buf)));
> }
> 
> 
> static inline void wait_step(unsigned int step)
> {
> 	while (!(sync_step & step))
> 		asm volatile ("rep nop":::"memory");
> }
> 
> static void *big_madvise_thread(void *ign)
> {
> 	while (1) {
> 		uint64_t tsc;
> 
> 		wait_step(STEP_HELPERS_RUN);
> 		wait_rdtsc(madvise_delta);
> 		tsc = rdtsc();
> 		madvise((void*)p, PAGE_SIZE * N_PAGES, MADV_FREE);
> 		madvise_time = rdtsc() - tsc;
> 		sync_step = STEP_DONTNEED_DONE;
> 	}
> }
> 
> static void *soft_dirty_thread(void *ign)
> {
> 	while (1) {
> 		int r;
> 		uint64_t tsc;
> 
> 		wait_step(STEP_HELPERS_RUN | STEP_DONTNEED_DONE);
> 		wait_rdtsc(soft_dirty_delta);
> 
> 		tsc = rdtsc();
> 		r = pwrite(dirty_fd, "4", 1, 0);
> 		assert(r == 1);
> 		soft_dirty_time = rdtsc() - tsc;
> 		wait_step(STEP_DONTNEED_DONE);
> 		sync_step = STEP_ACCESS_PAUSED;
> 	}
> }
> 
> void main(void)
> {
> 	pthread_t aux_thread, aux_thread2;
> 	char pathname[256];
> 	long i;
> 	volatile char c;
> 
> 	sprintf(pathname, "/proc/%d/clear_refs", getpid());
> 	dirty_fd = open(pathname, O_RDWR);
> 
> 	ksm_sharing_fd = open("/sys/kernel/mm/ksm/pages_sharing", O_RDONLY);
> 	assert(ksm_sharing_fd >= 0);
> 
> 	ksm_run_fd = open("/sys/kernel/mm/ksm/run", O_RDWR);
> 	assert(ksm_run_fd >= 0);
> 
> 	pwrite(ksm_run_fd, "0", 1, 0);
> 
> 	p = mmap(0, PAGE_SIZE * N_PAGES, PROT_READ|PROT_WRITE,
> 		 MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> 	assert(p != MAP_FAILED);
> 	madvise((void*)p, PAGE_SIZE * N_PAGES, MADV_MERGEABLE);
> 
> 	memset((void*)p, BASE_VAL, PAGE_SIZE * 2);
> 	for (i = 2; i < N_PAGES; i++)
> 		c = p[PAGE_SIZE * i];
> 
> 	pthread_create(&aux_thread, NULL, big_madvise_thread, NULL);
> 	pthread_create(&aux_thread2, NULL, soft_dirty_thread, NULL);
> 
> 	while (1) {
> 		break_sharing();
> 		*(p + 64) = BASE_VAL;		// cache in TLB and break KSM
> 		pwrite(ksm_run_fd, "1", 1, 0);
> 
> 		wait_rdtsc(0x8000000ull);
> 		sync_step = STEP_HELPERS_RUN;
> 		wait_rdtsc(0x4000000ull);
> 
> 		*(p+64) = CHANGED_VAL;
> 
> 		wait_step(STEP_ACCESS_PAUSED);		// wait for TLB to be flushed
> 		if (*(p+64) != CHANGED_VAL ||
> 		    *(p + PAGE_SIZE + 64) == CHANGED_VAL) {
> 			printf("KSM error\n");
> 			exit(EXIT_FAILURE);
> 		}
> 
> 		printf("No failure yet\n");
> 
> 		soft_dirty_delta = max(0, (long)madvise_time - (long)soft_dirty_time);
> 		madvise_delta = max(0, (long)soft_dirty_time - (long)madvise_time);
> 	}
> }
> 
> -- 8< --
> 
> Subject: [PATCH] TLB flush delay to trigger failure
> 
> ---
>  fs/proc/task_mmu.c | 2 ++
>  mm/ksm.c           | 2 ++
>  mm/madvise.c       | 2 ++
>  3 files changed, 6 insertions(+)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 520802da059c..c13259251210 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -16,6 +16,7 @@
>  #include <linux/mmu_notifier.h>
>  #include <linux/page_idle.h>
>  #include <linux/shmem_fs.h>
> +#include <linux/delay.h>
>  
>  #include <asm/elf.h>
>  #include <linux/uaccess.h>
> @@ -1076,6 +1077,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
>  		walk_page_range(0, mm->highest_vm_end, &clear_refs_walk);
>  		if (type == CLEAR_REFS_SOFT_DIRTY)
>  			mmu_notifier_invalidate_range_end(mm, 0, -1);
> +		msleep(5);
>  		flush_tlb_mm(mm);
>  		up_read(&mm->mmap_sem);
>  out_mm:
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 216184af0e19..317adbb48b0f 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -39,6 +39,7 @@
>  #include <linux/freezer.h>
>  #include <linux/oom.h>
>  #include <linux/numa.h>
> +#include <linux/delay.h>
>  
>  #include <asm/tlbflush.h>
>  #include "internal.h"
> @@ -960,6 +961,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
>  	mmun_end   = addr + PAGE_SIZE;
>  	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
>  
> +	msleep(5);
>  	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
>  	if (!pte_same(*ptep, orig_pte)) {
>  		pte_unmap_unlock(ptep, ptl);
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 25b78ee4fc2c..e4c852360f2c 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -23,6 +23,7 @@
>  #include <linux/swapops.h>
>  #include <linux/shmem_fs.h>
>  #include <linux/mmu_notifier.h>
> +#include <linux/delay.h>
>  
>  #include <asm/tlb.h>
>  
> @@ -472,6 +473,7 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
>  	mmu_notifier_invalidate_range_start(mm, start, end);
>  	madvise_free_page_range(&tlb, vma, start, end);
>  	mmu_notifier_invalidate_range_end(mm, start, end);
> +	msleep(5);
>  	tlb_finish_mmu(&tlb, start, end);
>  
>  	return 0;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2017-07-26 23:40 UTC|newest]

Thread overview: 70+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-07-11  0:52 Potential race in TLB flush batching? Nadav Amit
2017-07-11  6:41 ` Mel Gorman
2017-07-11  7:30   ` Nadav Amit
2017-07-11  9:29     ` Mel Gorman
2017-07-11 10:40       ` Nadav Amit
2017-07-11 13:20         ` Mel Gorman
2017-07-11 14:58           ` Andy Lutomirski
2017-07-11 15:53             ` Mel Gorman
2017-07-11 17:23               ` Andy Lutomirski
2017-07-11 19:18                 ` Mel Gorman
2017-07-11 20:06                   ` Nadav Amit
2017-07-11 21:09                     ` Mel Gorman
2017-07-11 20:09                   ` Mel Gorman
2017-07-11 21:52                     ` Mel Gorman
2017-07-11 22:27                       ` Nadav Amit
2017-07-11 22:34                         ` Nadav Amit
2017-07-12  8:27                         ` Mel Gorman
2017-07-12 23:27                           ` Nadav Amit
2017-07-12 23:36                             ` Andy Lutomirski
2017-07-12 23:42                               ` Nadav Amit
2017-07-13  5:38                                 ` Andy Lutomirski
2017-07-13 16:05                                   ` Nadav Amit
2017-07-13 16:06                                     ` Andy Lutomirski
2017-07-13  6:07                             ` Mel Gorman
2017-07-13 16:08                               ` Andy Lutomirski
2017-07-13 17:07                                 ` Mel Gorman
2017-07-13 17:15                                   ` Andy Lutomirski
2017-07-13 18:23                                     ` Mel Gorman
2017-07-14 23:16                               ` Nadav Amit
2017-07-15 15:55                                 ` Mel Gorman
2017-07-15 16:41                                   ` Andy Lutomirski
2017-07-17  7:49                                     ` Mel Gorman
2017-07-18 21:28                                   ` Nadav Amit
2017-07-19  7:41                                     ` Mel Gorman
2017-07-19 19:41                                       ` Nadav Amit
2017-07-19 19:58                                         ` Mel Gorman
2017-07-19 20:20                                           ` Nadav Amit
2017-07-19 21:47                                             ` Mel Gorman
2017-07-19 22:19                                               ` Nadav Amit
2017-07-19 22:59                                                 ` Mel Gorman
2017-07-19 23:39                                                   ` Nadav Amit
2017-07-20  7:43                                                     ` Mel Gorman
2017-07-22  1:19                                                       ` Nadav Amit
2017-07-24  9:58                                                         ` Mel Gorman
2017-07-24 19:46                                                           ` Nadav Amit
2017-07-25  7:37                                                           ` Minchan Kim
2017-07-25  8:51                                                             ` Mel Gorman
2017-07-25  9:11                                                               ` Minchan Kim
2017-07-25 10:10                                                                 ` Mel Gorman
2017-07-26  5:43                                                                   ` Minchan Kim
2017-07-26  9:22                                                                     ` Mel Gorman
2017-07-26 19:18                                                                       ` Nadav Amit
2017-07-26 23:40                                                                         ` Minchan Kim [this message]
2017-07-27  0:09                                                                           ` Nadav Amit
2017-07-27  0:34                                                                             ` Minchan Kim
2017-07-27  0:48                                                                               ` Nadav Amit
2017-07-27  1:13                                                                                 ` Nadav Amit
2017-07-27  7:04                                                                                   ` Minchan Kim
2017-07-27  7:21                                                                                     ` Mel Gorman
2017-07-27 16:04                                                                                       ` Nadav Amit
2017-07-27 17:36                                                                                         ` Mel Gorman
2017-07-26 23:44                                                                       ` Minchan Kim
2017-07-11 22:07                   ` Andy Lutomirski
2017-07-11 22:33                     ` Mel Gorman
2017-07-14  7:00                     ` Benjamin Herrenschmidt
2017-07-14  8:31                       ` Mel Gorman
2017-07-14  9:02                         ` Benjamin Herrenschmidt
2017-07-14  9:27                           ` Mel Gorman
2017-07-14 22:21                             ` Andy Lutomirski
2017-07-11 16:22           ` Nadav Amit

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170726234025.GA4491@bbox \
    --to=minchan@kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mgorman@suse.de \
    --cc=nadav.amit@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.