From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
Received: from mail-pg0-f71.google.com (mail-pg0-f71.google.com [74.125.83.71])
	by kanga.kvack.org (Postfix) with ESMTP id 3E8526B025F
	for <linux-mm@kvack.org>; Thu, 27 Jul 2017 03:04:24 -0400 (EDT)
Received: by mail-pg0-f71.google.com with SMTP id e9so106390728pga.5
        for <linux-mm@kvack.org>; Thu, 27 Jul 2017 00:04:24 -0700 (PDT)
Received: from lgeamrelo13.lge.com (LGEAMRELO13.lge.com. [156.147.23.53])
        by mx.google.com with ESMTP id z82si1824624pfd.327.2017.07.27.00.04.22
        for <linux-mm@kvack.org>;
        Thu, 27 Jul 2017 00:04:23 -0700 (PDT)
Date: Thu, 27 Jul 2017 16:04:20 +0900
From: Minchan Kim <minchan@kernel.org>
Subject: Re: Potential race in TLB flush batching?
Message-ID: <20170727070420.GA1052@bbox>
References: <20170725091115.GA22920@bbox>
 <20170725100722.2dxnmgypmwnrfawp@suse.de>
 <20170726054306.GA11100@bbox>
 <20170726092228.pyjxamxweslgaemi@suse.de>
 <A300D14C-D7EE-4A26-A7CF-A7643F1A61BA@gmail.com>
 <20170726234025.GA4491@bbox>
 <60FF1876-AC4F-49BB-BC36-A144C3B6EA9E@gmail.com>
 <20170727003434.GA537@bbox>
 <77AFE0A4-FE3D-4E05-B248-30ADE2F184EF@gmail.com>
 <AACB7A95-A1E1-4ACD-812F-BD9F8F564FD7@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <AACB7A95-A1E1-4ACD-812F-BD9F8F564FD7@gmail.com>
Sender: owner-linux-mm@kvack.org
List-ID: <linux-mm.kvack.org>
To: Nadav Amit <nadav.amit@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>, Andy Lutomirski <luto@kernel.org>, "open list:MEMORY MANAGEMENT" <linux-mm@kvack.org>

On Wed, Jul 26, 2017 at 06:13:15PM -0700, Nadav Amit wrote:
> Nadav Amit <nadav.amit@gmail.com> wrote:
> 
> > Minchan Kim <minchan@kernel.org> wrote:
> > 
> >> On Wed, Jul 26, 2017 at 05:09:09PM -0700, Nadav Amit wrote:
> >>> Minchan Kim <minchan@kernel.org> wrote:
> >>> 
> >>>> Hello Nadav,
> >>>> 
> >>>> On Wed, Jul 26, 2017 at 12:18:37PM -0700, Nadav Amit wrote:
> >>>>> Mel Gorman <mgorman@suse.de> wrote:
> >>>>> 
> >>>>>> On Wed, Jul 26, 2017 at 02:43:06PM +0900, Minchan Kim wrote:
> >>>>>>>> I'm relying on the fact you are the madv_free author to determine if
> >>>>>>>> it's really necessary. The race in question is CPU 0 running madv_free
> >>>>>>>> and updating some PTEs while CPU 1 is also running madv_free and looking
> >>>>>>>> at the same PTEs. CPU 1 may have writable TLB entries for a page but fail
> >>>>>>>> the pte_dirty check (because CPU 0 has updated it already) and potentially
> >>>>>>>> fail to flush. Hence, when madv_free on CPU 1 returns, there are still
> >>>>>>>> potentially writable TLB entries and the underlying PTE is still present
> >>>>>>>> so that a subsequent write does not necessarily propagate the dirty bit
> >>>>>>>> to the underlying PTE any more. Reclaim at some unknown time at the future
> >>>>>>>> may then see that the PTE is still clean and discard the page even though
> >>>>>>>> a write has happened in the meantime. I think this is possible but I could
> >>>>>>>> have missed some protection in madv_free that prevents it happening.
> >>>>>>> 
> >>>>>>> Thanks for the detail. You didn't miss anything. It can happen and then
> >>>>>>> it's really bug. IOW, if application does write something after madv_free,
> >>>>>>> it must see the written value, not zero.
> >>>>>>> 
> >>>>>>> How about adding [set|clear]_tlb_flush_pending in tlb batchin interface?
> >>>>>>> With it, when tlb_finish_mmu is called, we can know we skip the flush
> >>>>>>> but there is pending flush, so flush focefully to avoid madv_dontneed
> >>>>>>> as well as madv_free scenario.
> >>>>>> 
> >>>>>> I *think* this is ok as it's simply more expensive on the KSM side in
> >>>>>> the event of a race but no other harmful change is made assuming that
> >>>>>> KSM is the only race-prone. The check for mm_tlb_flush_pending also
> >>>>>> happens under the PTL so there should be sufficient protection from the
> >>>>>> mm struct update being visible at teh right time.
> >>>>>> 
> >>>>>> Check using the test program from "mm: Always flush VMA ranges affected
> >>>>>> by zap_page_range v2" if it handles the madvise case as well as that
> >>>>>> would give some degree of safety. Make sure it's tested against 4.13-rc2
> >>>>>> instead of mmotm which already includes the madv_dontneed fix. If yours
> >>>>>> works for both then it supersedes the mmotm patch.
> >>>>>> 
> >>>>>> It would also be interesting if Nadav would use his slowdown hack to see
> >>>>>> if he can still force the corruption.
> >>>>> 
> >>>>> The proposed fix for the KSM side is likely to work (I will try later), but
> >>>>> on the tlb_finish_mmu() side, I think there is a problem, since if any TLB
> >>>>> flush is performed by tlb_flush_mmu(), flush_tlb_mm_range() will not be
> >>>>> executed. This means that tlb_finish_mmu() may flush one TLB entry, leave
> >>>>> another one stale and not flush it.
> >>>> 
> >>>> Okay, I will change that part like this to avoid partial flush problem.
> >>>> 
> >>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> >>>> index 1c42d69490e4..87d0ebac6605 100644
> >>>> --- a/include/linux/mm_types.h
> >>>> +++ b/include/linux/mm_types.h
> >>>> @@ -529,10 +529,13 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
> >>>> * The barriers below prevent the compiler from re-ordering the instructions
> >>>> * around the memory barriers that are already present in the code.
> >>>> */
> >>>> -static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
> >>>> +static inline int mm_tlb_flush_pending(struct mm_struct *mm)
> >>>> {
> >>>> +	int nr_pending;
> >>>> +
> >>>> 	barrier();
> >>>> -	return atomic_read(&mm->tlb_flush_pending) > 0;
> >>>> +	nr_pending = atomic_read(&mm->tlb_flush_pending);
> >>>> +	return nr_pending;
> >>>> }
> >>>> static inline void set_tlb_flush_pending(struct mm_struct *mm)
> >>>> {
> >>>> diff --git a/mm/memory.c b/mm/memory.c
> >>>> index d5c5e6497c70..b5320e96ec51 100644
> >>>> --- a/mm/memory.c
> >>>> +++ b/mm/memory.c
> >>>> @@ -286,11 +286,15 @@ bool tlb_flush_mmu(struct mmu_gather *tlb)
> >>>> void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
> >>>> {
> >>>> 	struct mmu_gather_batch *batch, *next;
> >>>> -	bool flushed = tlb_flush_mmu(tlb);
> >>>> 
> >>>> +	if (!tlb->fullmm && !tlb->need_flush_all &&
> >>>> +			mm_tlb_flush_pending(tlb->mm) > 1) {
> >>> 
> >>> I saw you noticed my comment about the access of the flag without a lock. I
> >>> must say it feels strange that a memory barrier would be needed here, but
> >>> that what I understood from the documentation.
> >> 
> >> I saw your recent barriers fix patch, too.
> >> [PATCH v2 2/2] mm: migrate: fix barriers around tlb_flush_pending
> >> 
> >> As I commented out in there, I hope to use below here without being
> >> aware of complex barrier stuff. Instead, mm_tlb_flush_pending should
> >> call the right barrier inside.
> >> 
> >>       mm_tlb_flush_pending(tlb->mm, false:no-pte-locked) > 1
> > 
> > I will address it in v3.
> > 
> > 
> >>>> +		tlb->start = min(start, tlb->start);
> >>>> +		tlb->end = max(end, tlb->end);
> >>> 
> >>> Erra?| You open-code mmu_gather which is arch-specific. It appears that all of
> >>> them have start and end members, but not need_flush_all. Besides, I am not
> >> 
> >> When I see tlb_gather_mmu which is not arch-specific, it intializes
> >> need_flush_all to zero so it would be no harmful although some of
> >> architecture doesn't set the flag.
> >> Please correct me if I miss something.
> > 
> > Oh.. my bad. I missed the fact that this code is under a??#ifdef
> > HAVE_GENERIC_MMU_GATHERa??. But that means that arch-specific tlb_finish_mmu()
> > implementations (s390, arm) may need to be modified as well.
> > 
> >>> sure whether they regard start and end the same way.
> >> 
> >> I understand your worry but my patch takes longer range by min/max
> >> so I cannot imagine how it breaks. During looking the code, I found
> >> __tlb_adjust_range so better to use it rather than open-code.
> >> 
> >> 
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index b5320e96ec51..b23188daa396 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -288,10 +288,8 @@ void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long e
> >> 	struct mmu_gather_batch *batch, *next;
> >> 
> >> 	if (!tlb->fullmm && !tlb->need_flush_all &&
> >> -			mm_tlb_flush_pending(tlb->mm) > 1) {
> >> -		tlb->start = min(start, tlb->start);
> >> -		tlb->end = max(end, tlb->end);
> >> -	}
> >> +			mm_tlb_flush_pending(tlb->mm) > 1)
> >> +		__tlb_adjust_range(tlb->mm, start, end - start);
> >> 
> >> 	tlb_flush_mmu(tlb);
> >> 	clear_tlb_flush_pending(tlb->mm);
> > 
> > This one is better, especially as I now understand it is only for the
> > generic MMU gather (which I missed before).
> 
> There is one issue I forgot: pte_accessible() on x86 regards
> mm_tlb_flush_pending() as an indication for NUMA migration. But now the code
> does not make too much sense:
> 
>         if ((pte_flags(a) & _PAGE_PROTNONE) &&
>                         mm_tlb_flush_pending(mm))
> 
> Either we remove the _PAGE_PROTNONE check or we need to use the atomic field
> to count separately pending flushes due to migration and due to other
> reasons. The first option is safer, but Mel objected to it, because of the
> performance implications. The second one requires some thought on how to
> build a single counter for multiple reasons and avoid a potential overflow.
> 
> Thoughts?
> 

I'm really new for the autoNUMA so not sure I understand your concern
If your concern is that increasing places where add up pending count,
autoNUMA performance might be hurt. Right?
If so, above _PAGE_PROTNONE check will filter out most of cases?
Maybe, Mel could answer.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>