All of lore.kernel.org
 help / color / mirror / Atom feed
From: Minchan Kim <minchan@kernel.org>
To: Mel Gorman <mgorman@suse.de>
Cc: Nadav Amit <nadav.amit@gmail.com>,
	Andy Lutomirski <luto@kernel.org>,
	"open list:MEMORY MANAGEMENT" <linux-mm@kvack.org>
Subject: Re: Potential race in TLB flush batching?
Date: Wed, 26 Jul 2017 14:43:06 +0900	[thread overview]
Message-ID: <20170726054306.GA11100@bbox> (raw)
In-Reply-To: <20170725100722.2dxnmgypmwnrfawp@suse.de>

On Tue, Jul 25, 2017 at 11:10:06AM +0100, Mel Gorman wrote:
> On Tue, Jul 25, 2017 at 06:11:15PM +0900, Minchan Kim wrote:
> > On Tue, Jul 25, 2017 at 09:51:32AM +0100, Mel Gorman wrote:
> > > On Tue, Jul 25, 2017 at 04:37:48PM +0900, Minchan Kim wrote:
> > > > > Ok, as you say you have reproduced this with corruption, I would suggest
> > > > > one path for dealing with it although you'll need to pass it by the
> > > > > original authors.
> > > > > 
> > > > > When unmapping ranges, there is a check for dirty PTEs in
> > > > > zap_pte_range() that forces a flush for dirty PTEs which aims to avoid
> > > > > writable stale PTEs from CPU0 in a scenario like you laid out above.
> > > > > 
> > > > > madvise_free misses a similar class of check so I'm adding Minchan Kim
> > > > > to the cc as the original author of much of that code. Minchan Kim will
> > > > > need to confirm but it appears that two modifications would be required.
> > > > > The first should pass in the mmu_gather structure to
> > > > > madvise_free_pte_range (at minimum) and force flush the TLB under the
> > > > > PTL if a dirty PTE is encountered. The second is that it should consider
> > > > 
> > > > OTL: I couldn't read this lengthy discussion so I miss miss something.
> > > > 
> > > > About MADV_FREE, I do not understand why it should flush TLB in MADV_FREE
> > > > context. MADV_FREE's semantic allows "write(ie, dirty)" so if other thread
> > > > in parallel which has stale pte does "store" to make the pte dirty,
> > > > it's okay since try_to_unmap_one in shrink_page_list catches the dirty.
> > > > 
> > > 
> > > In try_to_unmap_one it's fine. It's not necessarily fine in KSM. Given
> > > that the key is that data corruption is avoided, you could argue with a
> > > comment that madv_free doesn't necesssarily have to flush it as long as
> > > KSM does even if it's clean due to batching.
> > 
> > Yes, I think it should be done in side where have a concern.
> > Maybe, mm_struct can carry a flag which indicates someone is
> > doing the TLB bacthing and then KSM side can flush it by the flag.
> > It would reduce unncessary flushing.
> > 
> 
> If you're confident that it's only necessary on the KSM side to avoid the
> problem then I'm ok with that. Update KSM in that case with a comment
> explaining the madv_free race and why the flush is unconditionally
> necessary. madv_free only came up because it was a critical part of having
> KSM miss a TLB flush.
> 
> > > Like madvise(), madv_free can potentially return with a stale PTE visible
> > > to the caller that observed a pte_none at the time of madv_free and uses
> > > a stale PTE that potentially allows a lost write. It's debatable whether
> > 
> > That is the part I cannot understand.
> > How does it lost "the write"? MADV_FREE doesn't discard the memory so
> > finally, the write should be done sometime.
> > Could you tell me more?
> > 
> 
> I'm relying on the fact you are the madv_free author to determine if
> it's really necessary. The race in question is CPU 0 running madv_free
> and updating some PTEs while CPU 1 is also running madv_free and looking
> at the same PTEs. CPU 1 may have writable TLB entries for a page but fail
> the pte_dirty check (because CPU 0 has updated it already) and potentially
> fail to flush. Hence, when madv_free on CPU 1 returns, there are still
> potentially writable TLB entries and the underlying PTE is still present
> so that a subsequent write does not necessarily propagate the dirty bit
> to the underlying PTE any more. Reclaim at some unknown time at the future
> may then see that the PTE is still clean and discard the page even though
> a write has happened in the meantime. I think this is possible but I could
> have missed some protection in madv_free that prevents it happening.

Thanks for the detail. You didn't miss anything. It can happen and then
it's really bug. IOW, if application does write something after madv_free,
it must see the written value, not zero.

How about adding [set|clear]_tlb_flush_pending in tlb batchin interface?
With it, when tlb_finish_mmu is called, we can know we skip the flush
but there is pending flush, so flush focefully to avoid madv_dontneed
as well as madv_free scenario.

Also, KSM can know it through mm_tlb_flush_pending?
If it's acceptable, need to look into soft dirty to use [set|clear]_tlb
_flush_pending or TLB gathering API.

To show my intention:

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 8afa4335e5b2..fffd4d86d0c4 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -113,7 +113,7 @@ struct mmu_gather {
 #define HAVE_GENERIC_MMU_GATHER
 
 void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long start, unsigned long end);
-void tlb_flush_mmu(struct mmu_gather *tlb);
+bool tlb_flush_mmu(struct mmu_gather *tlb);
 void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start,
 							unsigned long end);
 extern bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
diff --git a/mm/ksm.c b/mm/ksm.c
index 4dc92f138786..0fbbd5d234d5 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1037,8 +1037,9 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 	if (WARN_ONCE(!pvmw.pte, "Unexpected PMD mapping?"))
 		goto out_unlock;
 
-	if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) ||
-	    (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte))) {
+	if ((pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) ||
+	    (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte))) ||
+		mm_tlb_flush_pending(mm)) {
 		pte_t entry;
 
 		swapped = PageSwapCache(page);
diff --git a/mm/memory.c b/mm/memory.c
index ea9f28e44b81..d5c5e6497c70 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -239,12 +239,13 @@ void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long
 	tlb->page_size = 0;
 
 	__tlb_reset_range(tlb);
+	set_tlb_flush_pending(tlb->mm);
 }
 
-static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
+static bool tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 {
 	if (!tlb->end)
-		return;
+		return false;
 
 	tlb_flush(tlb);
 	mmu_notifier_invalidate_range(tlb->mm, tlb->start, tlb->end);
@@ -252,6 +253,7 @@ static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 	tlb_table_flush(tlb);
 #endif
 	__tlb_reset_range(tlb);
+	return true;
 }
 
 static void tlb_flush_mmu_free(struct mmu_gather *tlb)
@@ -265,10 +267,16 @@ static void tlb_flush_mmu_free(struct mmu_gather *tlb)
 	tlb->active = &tlb->local;
 }
 
-void tlb_flush_mmu(struct mmu_gather *tlb)
+/*
+ * returns true if tlb flush really happens
+ */
+bool tlb_flush_mmu(struct mmu_gather *tlb)
 {
-	tlb_flush_mmu_tlbonly(tlb);
+	bool ret;
+
+	ret = tlb_flush_mmu_tlbonly(tlb);
 	tlb_flush_mmu_free(tlb);
+	return ret;
 }
 
 /* tlb_finish_mmu
@@ -278,8 +286,11 @@ void tlb_flush_mmu(struct mmu_gather *tlb)
 void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
 {
 	struct mmu_gather_batch *batch, *next;
+	bool flushed = tlb_flush_mmu(tlb);
 
-	tlb_flush_mmu(tlb);
+	clear_tlb_flush_pending(tlb->mm);
+	if (!flushed && mm_tlb_flush_pending(tlb->mm))
+		flush_tlb_mm_range(tlb->mm, start, end, 0UL);
 
 	/* keep the page table cache within bounds */
 	check_pgt_cache();


> 
> -- 
> Mel Gorman
> SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2017-07-26  5:43 UTC|newest]

Thread overview: 70+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-07-11  0:52 Potential race in TLB flush batching? Nadav Amit
2017-07-11  6:41 ` Mel Gorman
2017-07-11  7:30   ` Nadav Amit
2017-07-11  9:29     ` Mel Gorman
2017-07-11 10:40       ` Nadav Amit
2017-07-11 13:20         ` Mel Gorman
2017-07-11 14:58           ` Andy Lutomirski
2017-07-11 15:53             ` Mel Gorman
2017-07-11 17:23               ` Andy Lutomirski
2017-07-11 19:18                 ` Mel Gorman
2017-07-11 20:06                   ` Nadav Amit
2017-07-11 21:09                     ` Mel Gorman
2017-07-11 20:09                   ` Mel Gorman
2017-07-11 21:52                     ` Mel Gorman
2017-07-11 22:27                       ` Nadav Amit
2017-07-11 22:34                         ` Nadav Amit
2017-07-12  8:27                         ` Mel Gorman
2017-07-12 23:27                           ` Nadav Amit
2017-07-12 23:36                             ` Andy Lutomirski
2017-07-12 23:42                               ` Nadav Amit
2017-07-13  5:38                                 ` Andy Lutomirski
2017-07-13 16:05                                   ` Nadav Amit
2017-07-13 16:06                                     ` Andy Lutomirski
2017-07-13  6:07                             ` Mel Gorman
2017-07-13 16:08                               ` Andy Lutomirski
2017-07-13 17:07                                 ` Mel Gorman
2017-07-13 17:15                                   ` Andy Lutomirski
2017-07-13 18:23                                     ` Mel Gorman
2017-07-14 23:16                               ` Nadav Amit
2017-07-15 15:55                                 ` Mel Gorman
2017-07-15 16:41                                   ` Andy Lutomirski
2017-07-17  7:49                                     ` Mel Gorman
2017-07-18 21:28                                   ` Nadav Amit
2017-07-19  7:41                                     ` Mel Gorman
2017-07-19 19:41                                       ` Nadav Amit
2017-07-19 19:58                                         ` Mel Gorman
2017-07-19 20:20                                           ` Nadav Amit
2017-07-19 21:47                                             ` Mel Gorman
2017-07-19 22:19                                               ` Nadav Amit
2017-07-19 22:59                                                 ` Mel Gorman
2017-07-19 23:39                                                   ` Nadav Amit
2017-07-20  7:43                                                     ` Mel Gorman
2017-07-22  1:19                                                       ` Nadav Amit
2017-07-24  9:58                                                         ` Mel Gorman
2017-07-24 19:46                                                           ` Nadav Amit
2017-07-25  7:37                                                           ` Minchan Kim
2017-07-25  8:51                                                             ` Mel Gorman
2017-07-25  9:11                                                               ` Minchan Kim
2017-07-25 10:10                                                                 ` Mel Gorman
2017-07-26  5:43                                                                   ` Minchan Kim [this message]
2017-07-26  9:22                                                                     ` Mel Gorman
2017-07-26 19:18                                                                       ` Nadav Amit
2017-07-26 23:40                                                                         ` Minchan Kim
2017-07-27  0:09                                                                           ` Nadav Amit
2017-07-27  0:34                                                                             ` Minchan Kim
2017-07-27  0:48                                                                               ` Nadav Amit
2017-07-27  1:13                                                                                 ` Nadav Amit
2017-07-27  7:04                                                                                   ` Minchan Kim
2017-07-27  7:21                                                                                     ` Mel Gorman
2017-07-27 16:04                                                                                       ` Nadav Amit
2017-07-27 17:36                                                                                         ` Mel Gorman
2017-07-26 23:44                                                                       ` Minchan Kim
2017-07-11 22:07                   ` Andy Lutomirski
2017-07-11 22:33                     ` Mel Gorman
2017-07-14  7:00                     ` Benjamin Herrenschmidt
2017-07-14  8:31                       ` Mel Gorman
2017-07-14  9:02                         ` Benjamin Herrenschmidt
2017-07-14  9:27                           ` Mel Gorman
2017-07-14 22:21                             ` Andy Lutomirski
2017-07-11 16:22           ` Nadav Amit

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170726054306.GA11100@bbox \
    --to=minchan@kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mgorman@suse.de \
    --cc=nadav.amit@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.