Le 14/01/2019 à 14:19, Vinayak Menon a écrit :
> On 1/11/2019 9:13 PM, Vinayak Menon wrote:
>> Hi Laurent,
>>
>> We are observing an issue with speculative page fault with the following test code on ARM64 (4.14 kernel, 8 cores).
> 
> 
> With the patch below, we don't hit the issue.
> 
> From: Vinayak Menon <vinmenon@codeaurora.org>
> Date: Mon, 14 Jan 2019 16:06:34 +0530
> Subject: [PATCH] mm: flush stale tlb entries on speculative write fault
> 
> It is observed that the following scenario results in
> threads A and B of process 1 blocking on pthread_mutex_lock
> forever after few iterations.
> 
> CPU 1                   CPU 2                    CPU 3
> Process 1,              Process 1,               Process 1,
> Thread A                Thread B                 Thread C
> 
> while (1) {             while (1) {              while(1) {
> pthread_mutex_lock(l)   pthread_mutex_lock(l)    fork
> pthread_mutex_unlock(l) pthread_mutex_unlock(l)  }
> }                       }
> 
> When from thread C, copy_one_pte write-protects the parent pte
> (of lock l), stale tlb entries can exist with write permissions
> on one of the CPUs at least. This can create a problem if one
> of the threads A or B hits the write fault. Though dup_mmap calls
> flush_tlb_mm after copy_page_range, since speculative page fault
> does not take mmap_sem it can proceed further fixing a fault soon
> after CPU 3 does ptep_set_wrprotect. But the CPU with stale tlb
> entry can still modify old_page even after it is copied to
> new_page by wp_page_copy, thus causing a corruption.

Nice catch and thanks for your investigation!

There is a real synchronization issue here between copy_page_range() and 
the speculative page fault handler. I didn't get it on PowerVM since the 
TLB are flushed when arch_exit_lazy_mode() is called in 
copy_page_range() but now, I can get it when running on x86_64.

> Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
> ---
>   mm/memory.c | 7 +++++++
>   1 file changed, 7 insertions(+)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 52080e4..1ea168ff 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4507,6 +4507,13 @@ int __handle_speculative_fault(struct mm_struct *mm, unsigned long address,
>                  return VM_FAULT_RETRY;
>          }
> 
> +       /*
> +        * Discard tlb entries created before ptep_set_wrprotect
> +        * in copy_one_pte
> +        */
> +       if (flags & FAULT_FLAG_WRITE && !pte_write(vmf.orig_pte))
> +               flush_tlb_page(vmf.vma, address);
> +
>          mem_cgroup_oom_enable();
>          ret = handle_pte_fault(&vmf);
>          mem_cgroup_oom_disable();

Your patch is fixing the race but I'm wondering about the cost of these 
tlb flushes. Here we are flushing on a per page basis (architecture like 
x86_64 are smarter and flush more pages) but there is a request to flush 
a range of tlb entries each time a cow page is newly touched. I think 
there could be some bad impact here.

Another option would be to flush the range in copy_pte_range() before 
unlocking the page table lock. This will flush entries flush_tlb_mm() 
would later handle in dup_mmap() but that will be called once per fork 
per cow VMA.

I tried the attached patch which seems to fix the issue on x86_64. Could 
you please give it a try on arm64 ?

Thanks,
Laurent.