Re: Potential race in TLB flush batching?

From: Nadav Amit <nadav.amit@gmail.com>
To: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>, Andy Lutomirski <luto@kernel.org>,
	"open list:MEMORY MANAGEMENT" <linux-mm@kvack.org>
Subject: Re: Potential race in TLB flush batching?
Date: Wed, 26 Jul 2017 18:13:15 -0700	[thread overview]
Message-ID: <AACB7A95-A1E1-4ACD-812F-BD9F8F564FD7@gmail.com> (raw)
In-Reply-To: <77AFE0A4-FE3D-4E05-B248-30ADE2F184EF@gmail.com>

Nadav Amit <nadav.amit@gmail.com> wrote:

> Minchan Kim <minchan@kernel.org> wrote:
> 
>> On Wed, Jul 26, 2017 at 05:09:09PM -0700, Nadav Amit wrote:
>>> Minchan Kim <minchan@kernel.org> wrote:
>>> 
>>>> Hello Nadav,
>>>> 
>>>> On Wed, Jul 26, 2017 at 12:18:37PM -0700, Nadav Amit wrote:
>>>>> Mel Gorman <mgorman@suse.de> wrote:
>>>>> 
>>>>>> On Wed, Jul 26, 2017 at 02:43:06PM +0900, Minchan Kim wrote:
>>>>>>>> I'm relying on the fact you are the madv_free author to determine if
>>>>>>>> it's really necessary. The race in question is CPU 0 running madv_free
>>>>>>>> and updating some PTEs while CPU 1 is also running madv_free and looking
>>>>>>>> at the same PTEs. CPU 1 may have writable TLB entries for a page but fail
>>>>>>>> the pte_dirty check (because CPU 0 has updated it already) and potentially
>>>>>>>> fail to flush. Hence, when madv_free on CPU 1 returns, there are still
>>>>>>>> potentially writable TLB entries and the underlying PTE is still present
>>>>>>>> so that a subsequent write does not necessarily propagate the dirty bit
>>>>>>>> to the underlying PTE any more. Reclaim at some unknown time at the future
>>>>>>>> may then see that the PTE is still clean and discard the page even though
>>>>>>>> a write has happened in the meantime. I think this is possible but I could
>>>>>>>> have missed some protection in madv_free that prevents it happening.
>>>>>>> 
>>>>>>> Thanks for the detail. You didn't miss anything. It can happen and then
>>>>>>> it's really bug. IOW, if application does write something after madv_free,
>>>>>>> it must see the written value, not zero.
>>>>>>> 
>>>>>>> How about adding [set|clear]_tlb_flush_pending in tlb batchin interface?
>>>>>>> With it, when tlb_finish_mmu is called, we can know we skip the flush
>>>>>>> but there is pending flush, so flush focefully to avoid madv_dontneed
>>>>>>> as well as madv_free scenario.
>>>>>> 
>>>>>> I *think* this is ok as it's simply more expensive on the KSM side in
>>>>>> the event of a race but no other harmful change is made assuming that
>>>>>> KSM is the only race-prone. The check for mm_tlb_flush_pending also
>>>>>> happens under the PTL so there should be sufficient protection from the
>>>>>> mm struct update being visible at teh right time.
>>>>>> 
>>>>>> Check using the test program from "mm: Always flush VMA ranges affected
>>>>>> by zap_page_range v2" if it handles the madvise case as well as that
>>>>>> would give some degree of safety. Make sure it's tested against 4.13-rc2
>>>>>> instead of mmotm which already includes the madv_dontneed fix. If yours
>>>>>> works for both then it supersedes the mmotm patch.
>>>>>> 
>>>>>> It would also be interesting if Nadav would use his slowdown hack to see
>>>>>> if he can still force the corruption.
>>>>> 
>>>>> The proposed fix for the KSM side is likely to work (I will try later), but
>>>>> on the tlb_finish_mmu() side, I think there is a problem, since if any TLB
>>>>> flush is performed by tlb_flush_mmu(), flush_tlb_mm_range() will not be
>>>>> executed. This means that tlb_finish_mmu() may flush one TLB entry, leave
>>>>> another one stale and not flush it.
>>>> 
>>>> Okay, I will change that part like this to avoid partial flush problem.
>>>> 
>>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>>>> index 1c42d69490e4..87d0ebac6605 100644
>>>> --- a/include/linux/mm_types.h
>>>> +++ b/include/linux/mm_types.h
>>>> @@ -529,10 +529,13 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
>>>> * The barriers below prevent the compiler from re-ordering the instructions
>>>> * around the memory barriers that are already present in the code.
>>>> */
>>>> -static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
>>>> +static inline int mm_tlb_flush_pending(struct mm_struct *mm)
>>>> {
>>>> +	int nr_pending;
>>>> +
>>>> 	barrier();
>>>> -	return atomic_read(&mm->tlb_flush_pending) > 0;
>>>> +	nr_pending = atomic_read(&mm->tlb_flush_pending);
>>>> +	return nr_pending;
>>>> }
>>>> static inline void set_tlb_flush_pending(struct mm_struct *mm)
>>>> {
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index d5c5e6497c70..b5320e96ec51 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -286,11 +286,15 @@ bool tlb_flush_mmu(struct mmu_gather *tlb)
>>>> void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
>>>> {
>>>> 	struct mmu_gather_batch *batch, *next;
>>>> -	bool flushed = tlb_flush_mmu(tlb);
>>>> 
>>>> +	if (!tlb->fullmm && !tlb->need_flush_all &&
>>>> +			mm_tlb_flush_pending(tlb->mm) > 1) {
>>> 
>>> I saw you noticed my comment about the access of the flag without a lock. I
>>> must say it feels strange that a memory barrier would be needed here, but
>>> that what I understood from the documentation.
>> 
>> I saw your recent barriers fix patch, too.
>> [PATCH v2 2/2] mm: migrate: fix barriers around tlb_flush_pending
>> 
>> As I commented out in there, I hope to use below here without being
>> aware of complex barrier stuff. Instead, mm_tlb_flush_pending should
>> call the right barrier inside.
>> 
>>       mm_tlb_flush_pending(tlb->mm, false:no-pte-locked) > 1
> 
> I will address it in v3.
> 
> 
>>>> +		tlb->start = min(start, tlb->start);
>>>> +		tlb->end = max(end, tlb->end);
>>> 
>>> Err… You open-code mmu_gather which is arch-specific. It appears that all of
>>> them have start and end members, but not need_flush_all. Besides, I am not
>> 
>> When I see tlb_gather_mmu which is not arch-specific, it intializes
>> need_flush_all to zero so it would be no harmful although some of
>> architecture doesn't set the flag.
>> Please correct me if I miss something.
> 
> Oh.. my bad. I missed the fact that this code is under “#ifdef
> HAVE_GENERIC_MMU_GATHER”. But that means that arch-specific tlb_finish_mmu()
> implementations (s390, arm) may need to be modified as well.
> 
>>> sure whether they regard start and end the same way.
>> 
>> I understand your worry but my patch takes longer range by min/max
>> so I cannot imagine how it breaks. During looking the code, I found
>> __tlb_adjust_range so better to use it rather than open-code.
>> 
>> 
>> diff --git a/mm/memory.c b/mm/memory.c
>> index b5320e96ec51..b23188daa396 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -288,10 +288,8 @@ void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long e
>> 	struct mmu_gather_batch *batch, *next;
>> 
>> 	if (!tlb->fullmm && !tlb->need_flush_all &&
>> -			mm_tlb_flush_pending(tlb->mm) > 1) {
>> -		tlb->start = min(start, tlb->start);
>> -		tlb->end = max(end, tlb->end);
>> -	}
>> +			mm_tlb_flush_pending(tlb->mm) > 1)
>> +		__tlb_adjust_range(tlb->mm, start, end - start);
>> 
>> 	tlb_flush_mmu(tlb);
>> 	clear_tlb_flush_pending(tlb->mm);
> 
> This one is better, especially as I now understand it is only for the
> generic MMU gather (which I missed before).

There is one issue I forgot: pte_accessible() on x86 regards
mm_tlb_flush_pending() as an indication for NUMA migration. But now the code
does not make too much sense:

        if ((pte_flags(a) & _PAGE_PROTNONE) &&
                        mm_tlb_flush_pending(mm))

Either we remove the _PAGE_PROTNONE check or we need to use the atomic field
to count separately pending flushes due to migration and due to other
reasons. The first option is safer, but Mel objected to it, because of the
performance implications. The second one requires some thought on how to
build a single counter for multiple reasons and avoid a potential overflow.

Thoughts?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>