[v12,18/31] mm: protect against PTE changes done by dup_mmap()
diff mbox series

Message ID 20190416134522.17540-19-ldufour@linux.ibm.com
State New, archived
Headers show
Series
  • Speculative page faults
Related show

Commit Message

Laurent Dufour April 16, 2019, 1:45 p.m. UTC
Vinayak Menon and Ganesh Mahendran reported that the following scenario may
lead to thread being blocked due to data corruption:

    CPU 1                   CPU 2                    CPU 3
    Process 1,              Process 1,               Process 1,
    Thread A                Thread B                 Thread C

    while (1) {             while (1) {              while(1) {
    pthread_mutex_lock(l)   pthread_mutex_lock(l)    fork
    pthread_mutex_unlock(l) pthread_mutex_unlock(l)  }
    }                       }

In the details this happens because :

    CPU 1                CPU 2                       CPU 3
    fork()
    copy_pte_range()
      set PTE rdonly
    got to next VMA...
     .                   PTE is seen rdonly          PTE still writable
     .                   thread is writing to page
     .                   -> page fault
     .                     copy the page             Thread writes to page
     .                      .                        -> no page fault
     .                     update the PTE
     .                     flush TLB for that PTE
   flush TLB                                        PTE are now rdonly

So the write done by the CPU 3 is interfering with the page copy operation
done by CPU 2, leading to the data corruption.

To avoid this we mark all the VMA involved in the COW mechanism as changing
by calling vm_write_begin(). This ensures that the speculative page fault
handler will not try to handle a fault on these pages.
The marker is set until the TLB is flushed, ensuring that all the CPUs will
now see the PTE as not writable.
Once the TLB is flush, the marker is removed by calling vm_write_end().

The variable last is used to keep tracked of the latest VMA marked to
handle the error path where part of the VMA may have been marked.

Since multiple VMA from the same mm may have the sequence count increased
during this process, the use of the vm_raw_write_begin/end() is required to
avoid lockdep false warning messages.

Reported-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
Reported-by: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
---
 kernel/fork.c | 30 ++++++++++++++++++++++++++++--
 1 file changed, 28 insertions(+), 2 deletions(-)

Comments

Jerome Glisse April 22, 2019, 8:32 p.m. UTC | #1
On Tue, Apr 16, 2019 at 03:45:09PM +0200, Laurent Dufour wrote:
> Vinayak Menon and Ganesh Mahendran reported that the following scenario may
> lead to thread being blocked due to data corruption:
> 
>     CPU 1                   CPU 2                    CPU 3
>     Process 1,              Process 1,               Process 1,
>     Thread A                Thread B                 Thread C
> 
>     while (1) {             while (1) {              while(1) {
>     pthread_mutex_lock(l)   pthread_mutex_lock(l)    fork
>     pthread_mutex_unlock(l) pthread_mutex_unlock(l)  }
>     }                       }
> 
> In the details this happens because :
> 
>     CPU 1                CPU 2                       CPU 3
>     fork()
>     copy_pte_range()
>       set PTE rdonly
>     got to next VMA...
>      .                   PTE is seen rdonly          PTE still writable
>      .                   thread is writing to page
>      .                   -> page fault
>      .                     copy the page             Thread writes to page
>      .                      .                        -> no page fault
>      .                     update the PTE
>      .                     flush TLB for that PTE
>    flush TLB                                        PTE are now rdonly

Should the fork be on CPU3 to be consistant with the top thing (just to
make it easier to read and go from one to the other as thread can move
from one CPU to another).

> 
> So the write done by the CPU 3 is interfering with the page copy operation
> done by CPU 2, leading to the data corruption.
> 
> To avoid this we mark all the VMA involved in the COW mechanism as changing
> by calling vm_write_begin(). This ensures that the speculative page fault
> handler will not try to handle a fault on these pages.
> The marker is set until the TLB is flushed, ensuring that all the CPUs will
> now see the PTE as not writable.
> Once the TLB is flush, the marker is removed by calling vm_write_end().
> 
> The variable last is used to keep tracked of the latest VMA marked to
> handle the error path where part of the VMA may have been marked.
> 
> Since multiple VMA from the same mm may have the sequence count increased
> during this process, the use of the vm_raw_write_begin/end() is required to
> avoid lockdep false warning messages.
> 
> Reported-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
> Reported-by: Vinayak Menon <vinmenon@codeaurora.org>
> Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>

A minor comment (see below)

Reviewed-by: Jérome Glisse <jglisse@redhat.com>

> ---
>  kernel/fork.c | 30 ++++++++++++++++++++++++++++--
>  1 file changed, 28 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index f8dae021c2e5..2992d2c95256 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -462,7 +462,7 @@ EXPORT_SYMBOL(free_task);
>  static __latent_entropy int dup_mmap(struct mm_struct *mm,
>  					struct mm_struct *oldmm)
>  {
> -	struct vm_area_struct *mpnt, *tmp, *prev, **pprev;
> +	struct vm_area_struct *mpnt, *tmp, *prev, **pprev, *last = NULL;
>  	struct rb_node **rb_link, *rb_parent;
>  	int retval;
>  	unsigned long charge;
> @@ -581,8 +581,18 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>  		rb_parent = &tmp->vm_rb;
>  
>  		mm->map_count++;
> -		if (!(tmp->vm_flags & VM_WIPEONFORK))
> +		if (!(tmp->vm_flags & VM_WIPEONFORK)) {
> +			if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT)) {
> +				/*
> +				 * Mark this VMA as changing to prevent the
> +				 * speculative page fault hanlder to process
> +				 * it until the TLB are flushed below.
> +				 */
> +				last = mpnt;
> +				vm_raw_write_begin(mpnt);
> +			}
>  			retval = copy_page_range(mm, oldmm, mpnt);
> +		}
>  
>  		if (tmp->vm_ops && tmp->vm_ops->open)
>  			tmp->vm_ops->open(tmp);
> @@ -595,6 +605,22 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>  out:
>  	up_write(&mm->mmap_sem);
>  	flush_tlb_mm(oldmm);
> +
> +	if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT)) {

You do not need to check for CONFIG_SPECULATIVE_PAGE_FAULT as last
will always be NULL if it is not enabled but maybe the compiler will
miss the optimization opportunity if you only have the for() loop
below.

> +		/*
> +		 * Since the TLB has been flush, we can safely unmark the
> +		 * copied VMAs and allows the speculative page fault handler to
> +		 * process them again.
> +		 * Walk back the VMA list from the last marked VMA.
> +		 */
> +		for (; last; last = last->vm_prev) {
> +			if (last->vm_flags & VM_DONTCOPY)
> +				continue;
> +			if (!(last->vm_flags & VM_WIPEONFORK))
> +				vm_raw_write_end(last);
> +		}
> +	}
> +
>  	up_write(&oldmm->mmap_sem);
>  	dup_userfaultfd_complete(&uf);
>  fail_uprobe_end:
> -- 
> 2.21.0
>
Laurent Dufour April 24, 2019, 10:33 a.m. UTC | #2
Le 22/04/2019 à 22:32, Jerome Glisse a écrit :
> On Tue, Apr 16, 2019 at 03:45:09PM +0200, Laurent Dufour wrote:
>> Vinayak Menon and Ganesh Mahendran reported that the following scenario may
>> lead to thread being blocked due to data corruption:
>>
>>      CPU 1                   CPU 2                    CPU 3
>>      Process 1,              Process 1,               Process 1,
>>      Thread A                Thread B                 Thread C
>>
>>      while (1) {             while (1) {              while(1) {
>>      pthread_mutex_lock(l)   pthread_mutex_lock(l)    fork
>>      pthread_mutex_unlock(l) pthread_mutex_unlock(l)  }
>>      }                       }
>>
>> In the details this happens because :
>>
>>      CPU 1                CPU 2                       CPU 3
>>      fork()
>>      copy_pte_range()
>>        set PTE rdonly
>>      got to next VMA...
>>       .                   PTE is seen rdonly          PTE still writable
>>       .                   thread is writing to page
>>       .                   -> page fault
>>       .                     copy the page             Thread writes to page
>>       .                      .                        -> no page fault
>>       .                     update the PTE
>>       .                     flush TLB for that PTE
>>     flush TLB                                        PTE are now rdonly
> 
> Should the fork be on CPU3 to be consistant with the top thing (just to
> make it easier to read and go from one to the other as thread can move
> from one CPU to another).

Sure, this is quite confusing this way ;)

>>
>> So the write done by the CPU 3 is interfering with the page copy operation
>> done by CPU 2, leading to the data corruption.
>>
>> To avoid this we mark all the VMA involved in the COW mechanism as changing
>> by calling vm_write_begin(). This ensures that the speculative page fault
>> handler will not try to handle a fault on these pages.
>> The marker is set until the TLB is flushed, ensuring that all the CPUs will
>> now see the PTE as not writable.
>> Once the TLB is flush, the marker is removed by calling vm_write_end().
>>
>> The variable last is used to keep tracked of the latest VMA marked to
>> handle the error path where part of the VMA may have been marked.
>>
>> Since multiple VMA from the same mm may have the sequence count increased
>> during this process, the use of the vm_raw_write_begin/end() is required to
>> avoid lockdep false warning messages.
>>
>> Reported-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
>> Reported-by: Vinayak Menon <vinmenon@codeaurora.org>
>> Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
> 
> A minor comment (see below)
> 
> Reviewed-by: Jérome Glisse <jglisse@redhat.com>

Thanks for the review Jérôme.

>> ---
>>   kernel/fork.c | 30 ++++++++++++++++++++++++++++--
>>   1 file changed, 28 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index f8dae021c2e5..2992d2c95256 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -462,7 +462,7 @@ EXPORT_SYMBOL(free_task);
>>   static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>   					struct mm_struct *oldmm)
>>   {
>> -	struct vm_area_struct *mpnt, *tmp, *prev, **pprev;
>> +	struct vm_area_struct *mpnt, *tmp, *prev, **pprev, *last = NULL;
>>   	struct rb_node **rb_link, *rb_parent;
>>   	int retval;
>>   	unsigned long charge;
>> @@ -581,8 +581,18 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>   		rb_parent = &tmp->vm_rb;
>>   
>>   		mm->map_count++;
>> -		if (!(tmp->vm_flags & VM_WIPEONFORK))
>> +		if (!(tmp->vm_flags & VM_WIPEONFORK)) {
>> +			if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT)) {
>> +				/*
>> +				 * Mark this VMA as changing to prevent the
>> +				 * speculative page fault hanlder to process
>> +				 * it until the TLB are flushed below.
>> +				 */
>> +				last = mpnt;
>> +				vm_raw_write_begin(mpnt);
>> +			}
>>   			retval = copy_page_range(mm, oldmm, mpnt);
>> +		}
>>   
>>   		if (tmp->vm_ops && tmp->vm_ops->open)
>>   			tmp->vm_ops->open(tmp);
>> @@ -595,6 +605,22 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>   out:
>>   	up_write(&mm->mmap_sem);
>>   	flush_tlb_mm(oldmm);
>> +
>> +	if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT)) {
> 
> You do not need to check for CONFIG_SPECULATIVE_PAGE_FAULT as last
> will always be NULL if it is not enabled but maybe the compiler will
> miss the optimization opportunity if you only have the for() loop
> below.

I didn't check for the generated code, perhaps the compiler will be 
optimize that correctly.
This being said, I think the if block is better for the code 
readability, highlighting that this block is only needed in the case of SPF.

>> +		/*
>> +		 * Since the TLB has been flush, we can safely unmark the
>> +		 * copied VMAs and allows the speculative page fault handler to
>> +		 * process them again.
>> +		 * Walk back the VMA list from the last marked VMA.
>> +		 */
>> +		for (; last; last = last->vm_prev) {
>> +			if (last->vm_flags & VM_DONTCOPY)
>> +				continue;
>> +			if (!(last->vm_flags & VM_WIPEONFORK))
>> +				vm_raw_write_end(last);
>> +		}
>> +	}
>> +
>>   	up_write(&oldmm->mmap_sem);
>>   	dup_userfaultfd_complete(&uf);
>>   fail_uprobe_end:
>> -- 
>> 2.21.0
>>
>

Patch
diff mbox series

diff --git a/kernel/fork.c b/kernel/fork.c
index f8dae021c2e5..2992d2c95256 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -462,7 +462,7 @@  EXPORT_SYMBOL(free_task);
 static __latent_entropy int dup_mmap(struct mm_struct *mm,
 					struct mm_struct *oldmm)
 {
-	struct vm_area_struct *mpnt, *tmp, *prev, **pprev;
+	struct vm_area_struct *mpnt, *tmp, *prev, **pprev, *last = NULL;
 	struct rb_node **rb_link, *rb_parent;
 	int retval;
 	unsigned long charge;
@@ -581,8 +581,18 @@  static __latent_entropy int dup_mmap(struct mm_struct *mm,
 		rb_parent = &tmp->vm_rb;
 
 		mm->map_count++;
-		if (!(tmp->vm_flags & VM_WIPEONFORK))
+		if (!(tmp->vm_flags & VM_WIPEONFORK)) {
+			if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT)) {
+				/*
+				 * Mark this VMA as changing to prevent the
+				 * speculative page fault hanlder to process
+				 * it until the TLB are flushed below.
+				 */
+				last = mpnt;
+				vm_raw_write_begin(mpnt);
+			}
 			retval = copy_page_range(mm, oldmm, mpnt);
+		}
 
 		if (tmp->vm_ops && tmp->vm_ops->open)
 			tmp->vm_ops->open(tmp);
@@ -595,6 +605,22 @@  static __latent_entropy int dup_mmap(struct mm_struct *mm,
 out:
 	up_write(&mm->mmap_sem);
 	flush_tlb_mm(oldmm);
+
+	if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT)) {
+		/*
+		 * Since the TLB has been flush, we can safely unmark the
+		 * copied VMAs and allows the speculative page fault handler to
+		 * process them again.
+		 * Walk back the VMA list from the last marked VMA.
+		 */
+		for (; last; last = last->vm_prev) {
+			if (last->vm_flags & VM_DONTCOPY)
+				continue;
+			if (!(last->vm_flags & VM_WIPEONFORK))
+				vm_raw_write_end(last);
+		}
+	}
+
 	up_write(&oldmm->mmap_sem);
 	dup_userfaultfd_complete(&uf);
 fail_uprobe_end: