linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* about the ptep_set_access_flags() for hardware AF/DBM
@ 2019-10-27  9:56 FF
  2019-10-28 18:43 ` Catalin Marinas
  0 siblings, 1 reply; 5+ messages in thread
From: FF @ 2019-10-27  9:56 UTC (permalink / raw)
  To: linux-arm-kernel, catalin.marinas, julien.grall, will.deacon,
	mark.rutland, steve.capper


hi all:

i see a patch, commit id: 66dbd6e61a52 "arm64: Implement ptep_set_access_flags() for hardware AF/DBM"
in this patch, the author show a insteresting case of the racy of hardware AF/DBM.

Here is the scenario:
A more complex situation is possible when all CPUs support hardware
   AF/DBM:

   a) Initial state: shareable + writable vma and pte_none(pte)
   b) Read fault taken by two threads of the same process on different
      CPUs
   c) CPU0 takes the mmap_sem and proceeds to handling the fault. It
      eventually reaches do_set_pte() which sets a writable + clean pte.
      CPU0 releases the mmap_sem
   d) CPU1 acquires the mmap_sem and proceeds to handle_pte_fault(). The
      pte entry it reads is present, writable and clean and it continues
      to pte_mkyoung()
   e) CPU1 calls ptep_set_access_flags()

   If between (d) and (e) the hardware (another CPU) updates the dirty
   state (clears PTE_RDONLY), CPU1 will override the PTR_RDONLY bit
   marking the entry clean again.

my question is:
1. in step a, it say, the initial state vma is : sharable + writable + pte_none.
let suppose this is a anon mapping by mmap() API.

so the vma->vm_page_prot should be : VM_READ |  VM_WRITE | VM_SHARED
in vm_get_page_prot(), it will change to pte attribute,in linux kernel it has a protection_map[] array.
in that case, it should be __S011 (PAGE_SHARED). for PAGE_SHARED, the pte attribute will set PTE_WRITE,so PTE_DBM is set, 
but the PTE_RDONLY should be zero, right?

in step c, CPU0 trigger read fault and handle the page fault, it will call do_anonymous_page(), and using  system_zero_page.
i don't what is a clean pte?  but currently, the  PTE_RDONLY is zero, it means this pte is writable.

when the CPU2 write this memory, it will update the dirty state like clear PTE_RDONLY, but my questions, the PTE_RDONLY is still zero, in step a~d,
so why CPU1 will override RT_RDONLY bit and marking the entry clean again.

in that case, why it will trigger "racy dirty state clearing" in set_pte_at?
i see the pte_dirty() will check the sw_dirty and hw_dirty, so in this case, is it our sw has not set PTE_DIRTY bit? how hw dirty checking, the PTE_RDONLY is always zero.

#define pte_dirty(pte)		(pte_sw_dirty(pte) || pte_hw_dirty(pte))

would you like point out what i missing?

Best
Ben



_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: about the ptep_set_access_flags() for hardware AF/DBM
  2019-10-27  9:56 about the ptep_set_access_flags() for hardware AF/DBM FF
@ 2019-10-28 18:43 ` Catalin Marinas
  2019-10-29  0:54   ` FF
  0 siblings, 1 reply; 5+ messages in thread
From: Catalin Marinas @ 2019-10-28 18:43 UTC (permalink / raw)
  To: FF
  Cc: mark.rutland, julien.grall, will.deacon, linux-arm-kernel, steve.capper

On Sun, Oct 27, 2019 at 05:56:24PM +0800, FF wrote:
> i see a patch, commit id: 66dbd6e61a52 "arm64: Implement ptep_set_access_flags() for hardware AF/DBM"
> in this patch, the author show a insteresting case of the racy of hardware AF/DBM.
> 
> Here is the scenario:
> A more complex situation is possible when all CPUs support hardware
>    AF/DBM:
> 
>    a) Initial state: shareable + writable vma and pte_none(pte)
>    b) Read fault taken by two threads of the same process on different
>       CPUs
>    c) CPU0 takes the mmap_sem and proceeds to handling the fault. It
>       eventually reaches do_set_pte() which sets a writable + clean pte.
>       CPU0 releases the mmap_sem
>    d) CPU1 acquires the mmap_sem and proceeds to handle_pte_fault(). The
>       pte entry it reads is present, writable and clean and it continues
>       to pte_mkyoung()
>    e) CPU1 calls ptep_set_access_flags()
> 
>    If between (d) and (e) the hardware (another CPU) updates the dirty
>    state (clears PTE_RDONLY), CPU1 will override the PTR_RDONLY bit
>    marking the entry clean again.
> 
> my question is:
> 1. in step a, it say, the initial state vma is : sharable + writable +
> pte_none. let suppose this is a anon mapping by mmap() API.

What I had in mind at the time was a file mapping rather than anonymous
one (vma_is_anonymous() is false for shared mappings).

> so the vma->vm_page_prot should be : VM_READ |  VM_WRITE | VM_SHARED
> in vm_get_page_prot(), it will change to pte attribute,in linux
> kernel it has a protection_map[] array. in that case, it should be
> __S011 (PAGE_SHARED). for PAGE_SHARED, the pte attribute will set
> PTE_WRITE,so PTE_DBM is set, but the PTE_RDONLY should be zero,
> right?

PAGE_SHARED is indeed writable but how it ends up in the pte depends on
the mapping. For a shared memory mapping, I think you do get a writable
entry on a read fault.

For file mappings, the writable attribute is cleared from vm_page_prot
via the vma_set_page_prot() function because vma_wants_writenotify() is
true. Filesystem normally want to track which pages have been dirtied to
write back.

> in step c, CPU0 trigger read fault and handle the page fault, it will
> call do_anonymous_page(), and using  system_zero_page. i don't what is
> a clean pte?  but currently, the  PTE_RDONLY is zero, it means this
> pte is writable.

Note that we can't invoke do_anonymous_page() for VM_SHARED mappings.
This is only for private mappings. If you look at mmap_region(), the vma
is not set up as anonymous if MAP_SHARED is given but as a shmem.

> when the CPU2 write this memory, it will update the dirty state like
> clear PTE_RDONLY, but my questions, the PTE_RDONLY is still zero, in
> step a~d, so why CPU1 will override RT_RDONLY bit and marking the
> entry clean again.

As I said above, this scenario is for shared file mappings where you do
get a PTE_RDONLY set for clean mappings.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re:Re: about the ptep_set_access_flags() for hardware AF/DBM
  2019-10-28 18:43 ` Catalin Marinas
@ 2019-10-29  0:54   ` FF
  2019-10-29 12:11     ` Catalin Marinas
  0 siblings, 1 reply; 5+ messages in thread
From: FF @ 2019-10-29  0:54 UTC (permalink / raw)
  To: Catalin Marinas, runninglinuxkernel
  Cc: mark.rutland, steve.capper, will.deacon, runninglinuxkernel,
	julien.grall, linux-arm-kernel



At 2019-10-29 02:43:03, "Catalin Marinas" <catalin.marinas@arm.com> wrote:
>On Sun, Oct 27, 2019 at 05:56:24PM +0800, FF wrote:
>> i see a patch, commit id: 66dbd6e61a52 "arm64: Implement ptep_set_access_flags() for hardware AF/DBM"
>> in this patch, the author show a insteresting case of the racy of hardware AF/DBM.
>> 
>> Here is the scenario:
>> A more complex situation is possible when all CPUs support hardware
>>    AF/DBM:
>> 
>>    a) Initial state: shareable + writable vma and pte_none(pte)
>>    b) Read fault taken by two threads of the same process on different
>>       CPUs
>>    c) CPU0 takes the mmap_sem and proceeds to handling the fault. It
>>       eventually reaches do_set_pte() which sets a writable + clean pte.
>>       CPU0 releases the mmap_sem
>>    d) CPU1 acquires the mmap_sem and proceeds to handle_pte_fault(). The
>>       pte entry it reads is present, writable and clean and it continues
>>       to pte_mkyoung()
>>    e) CPU1 calls ptep_set_access_flags()
>> 
>>    If between (d) and (e) the hardware (another CPU) updates the dirty
>>    state (clears PTE_RDONLY), CPU1 will override the PTR_RDONLY bit
>>    marking the entry clean again.
>> 
>> my question is:
>> 1. in step a, it say, the initial state vma is : sharable + writable +
>> pte_none. let suppose this is a anon mapping by mmap() API.
>
>What I had in mind at the time was a file mapping rather than anonymous
>one (vma_is_anonymous() is false for shared mappings).
>
>> so the vma->vm_page_prot should be : VM_READ |  VM_WRITE | VM_SHARED
>> in vm_get_page_prot(), it will change to pte attribute,in linux
>> kernel it has a protection_map[] array. in that case, it should be
>> __S011 (PAGE_SHARED). for PAGE_SHARED, the pte attribute will set
>> PTE_WRITE,so PTE_DBM is set, but the PTE_RDONLY should be zero,
>> right?
>
>PAGE_SHARED is indeed writable but how it ends up in the pte depends on
>the mapping. For a shared memory mapping, I think you do get a writable
>entry on a read fault.
>
>For file mappings, the writable attribute is cleared from vm_page_prot
>via the vma_set_page_prot() function because vma_wants_writenotify() is
>true. Filesystem normally want to track which pages have been dirtied to
>write back.
>
>> in step c, CPU0 trigger read fault and handle the page fault, it will
>> call do_anonymous_page(), and using  system_zero_page. i don't what is
>> a clean pte?  but currently, the  PTE_RDONLY is zero, it means this
>> pte is writable.
>
>Note that we can't invoke do_anonymous_page() for VM_SHARED mappings.
>This is only for private mappings. If you look at mmap_region(), the vma
>is not set up as anonymous if MAP_SHARED is given but as a shmem.
>
>> when the CPU2 write this memory, it will update the dirty state like
>> clear PTE_RDONLY, but my questions, the PTE_RDONLY is still zero, in
>> step a~d, so why CPU1 will override RT_RDONLY bit and marking the
>> entry clean again.
>
>As I said above, this scenario is for shared file mappings where you do
>get a PTE_RDONLY set for clean mappings.
>
>-- 
>Catalin


hi Catalin:

Thanks for your point out.
i want to elaborate the scenario, i saw the first patch to fix the ptep_set_access_flags() for hardware AF/DBM is on Linux 4.7-rc1.
commit id "66dbd6e6" ,arm64: Implement ptep_set_access_flags() for hardware AF/DBM

i think you have issue on Linux 4.6, let's  assume that we are look at Linux 4.6 source code.

1. initial phase: we want to create a sharable+writable file mapping by mmap() API, the filesyste is:ext4
   
   in do_mmap(), the vm_flags should be set VM_READ | VM_WRITE | VM_SHARED.
   in mmap_region()->vma_set_page_prot(), it will let the some shared mappigns will want the pages marked read-only to track write events,
   so it will clear the VM_SHARED. so it will get the pte attribute from protection_map[] is __P011.
   
   In Linux 4.6, __P011 is PAGE_COPY:
   #define PAGE_COPY		__pgprot(_PAGE_DEFAULT | PTE_USER | PTE_NG | PTE_PXN | PTE_UXN)
   
   for PAGE_COPY, the PTE_RDONLY and PTE_WRITE(DMB) are zero.
   so the vm_flags is: VM_READ | VM_WRITE
   
2. Thread 1 on CPU0 want to write this page, page_fault will be trigger.
   in handle_pte_fault->do_fault->do_shared_fault(), it will allocate a new page cache, and in do_set_pte(),
   it will call: "maybe_mkwrite(pte_mkdirty(entry), vma)" to set the pte entry.
   so the pte attribute should be: PTE_DIRTY | PTE_WRITE.
   
3. Thread 2 on CPU1 also want to read this page but this pte has not create by Thread 1, so page_fault happen.
   in pte_offset_map(), it found that the pte is created by Thread 1, so it will directly call:
   
   entry = pte_mkyoung(entry);
   ptep_set_access_flags()
   
   in ptep_set_access_flags, it will call set_pte_at() to set pte.
   but in set_pte_at() function:
   
   	if (pte_present(pte)) {
		if (pte_sw_dirty(pte) && pte_write(pte))
			pte_val(pte) &= ~PTE_RDONLY;
		else
			pte_val(pte) |= PTE_RDONLY;
		if (pte_user(pte) && pte_exec(pte) && !pte_special(pte))
			__sync_icache_dcache(pte, addr);
	}
	
	it will clean the PTE_RDONLY bit, because the PTE_DIRTY | PTE_WRITE is set in our scenario.
	otherwise, anyone clean the  PTE_DIRTY bit, who will clean this PTE_DIRTY bit?

so i am very confusing the patch "arm64: Implement ptep_set_access_flags() for hardware AF/DBM" commit log's scenrio.
would you like point out what i am missing?

Best
Ben


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Re: about the ptep_set_access_flags() for hardware AF/DBM
  2019-10-29  0:54   ` FF
@ 2019-10-29 12:11     ` Catalin Marinas
  2019-10-29 14:04       ` FF
  0 siblings, 1 reply; 5+ messages in thread
From: Catalin Marinas @ 2019-10-29 12:11 UTC (permalink / raw)
  To: FF
  Cc: mark.rutland, steve.capper, runninglinuxkernel, will.deacon,
	julien.grall, linux-arm-kernel

Hi Ben,

On Tue, Oct 29, 2019 at 08:54:38AM +0800, FF wrote:
> >On Sun, Oct 27, 2019 at 05:56:24PM +0800, FF wrote:
> >> Here is the scenario:
> >> A more complex situation is possible when all CPUs support hardware
> >>    AF/DBM:
> >> 
> >>    a) Initial state: shareable + writable vma and pte_none(pte)
> >>    b) Read fault taken by two threads of the same process on different
> >>       CPUs
> >>    c) CPU0 takes the mmap_sem and proceeds to handling the fault. It
> >>       eventually reaches do_set_pte() which sets a writable + clean pte.
> >>       CPU0 releases the mmap_sem
> >>    d) CPU1 acquires the mmap_sem and proceeds to handle_pte_fault(). The
> >>       pte entry it reads is present, writable and clean and it continues
> >>       to pte_mkyoung()
> >>    e) CPU1 calls ptep_set_access_flags()
> >> 
> >>    If between (d) and (e) the hardware (another CPU) updates the dirty
> >>    state (clears PTE_RDONLY), CPU1 will override the PTR_RDONLY bit
> >>    marking the entry clean again.
[...]
> i want to elaborate the scenario, i saw the first patch to fix the
> ptep_set_access_flags() for hardware AF/DBM is on Linux 4.7-rc1.
> commit id "66dbd6e6" ,arm64: Implement ptep_set_access_flags() for
> hardware AF/DBM

What are you trying to solve? ptep_set_access_flags() being atomic is
not any worse. Do you think we wouldn't need this patch?

> i think you have issue on Linux 4.6, let's  assume that we are look at
> Linux 4.6 source code.
> 
> 1. initial phase: we want to create a sharable+writable file mapping
>    by mmap() API, the filesyste is:ext4

See more below but I think we may need shm instead of a file mapping to
trigger the race (which, BTW, is rather theoretical; I haven't seen it
in practice).

>    in do_mmap(), the vm_flags should be set VM_READ | VM_WRITE | VM_SHARED.
>    in mmap_region()->vma_set_page_prot(), it will let the some shared
>    mappigns will want the pages marked read-only to track write
>    events, so it will clear the VM_SHARED. so it will get the pte
>    attribute from protection_map[] is __P011.
>    
>    In Linux 4.6, __P011 is PAGE_COPY:
>    #define PAGE_COPY		__pgprot(_PAGE_DEFAULT | PTE_USER | PTE_NG | PTE_PXN | PTE_UXN)
>    
>    for PAGE_COPY, the PTE_RDONLY and PTE_WRITE(DMB) are zero.
>    so the vm_flags is: VM_READ | VM_WRITE

While you are right that PAGE_COPY has PTE_RDONLY and PTE_WRITE zero,
set_pte_at() in 4.6 sets PTE_RDONLY if !PTE_WRITE. So the resulting
mapping in the page table is read-only.

Anyway I think with vma_wants_writenotify() we can't trigger this race
since it's a purely read-only fault (requires kernel notification). What
we need is to end up with a writable+clean entry which means VM_SHARED
set leading to PAGE_SHARED attributes which have PTE_WRITE/DBM set. Note
that set_pte_at() in 4.6 would mark the page as PTE_RDONLY since
pte_sw_dirty() is false.

> 2. Thread 1 on CPU0 want to write this page, page_fault will be trigger.
>    in handle_pte_fault->do_fault->do_shared_fault(), it will allocate
>    a new page cache, and in do_set_pte(), it will call:
>    "maybe_mkwrite(pte_mkdirty(entry), vma)" to set the pte entry. so
>    the pte attribute should be: PTE_DIRTY | PTE_WRITE.

Yes but the scenario I had in mind was a read fault here rather than
write which would set a PAGE_SHARED attributes ending up with
PTE_WRITE|PTE_RDONLY (PTE_WRITE is the PTE_DBM bit).

> 3. Thread 2 on CPU1 also want to read this page but this pte has not
>    create by Thread 1, so page_fault happen. in pte_offset_map(), it
>    found that the pte is created by Thread 1, so it will directly
>    call:
>    
>    entry = pte_mkyoung(entry);
>    ptep_set_access_flags()
>    
>    in ptep_set_access_flags, it will call set_pte_at() to set pte.
>    but in set_pte_at() function:
>    
>    	if (pte_present(pte)) {
> 		if (pte_sw_dirty(pte) && pte_write(pte))
> 			pte_val(pte) &= ~PTE_RDONLY;
> 		else
> 			pte_val(pte) |= PTE_RDONLY;
> 		if (pte_user(pte) && pte_exec(pte) && !pte_special(pte))
> 			__sync_icache_dcache(pte, addr);
> 	}
> 	
> 	it will clean the PTE_RDONLY bit, because the PTE_DIRTY |
> 	PTE_WRITE is set in our scenario. otherwise, anyone clean the
> 	PTE_DIRTY bit, who will clean this PTE_DIRTY bit?

Correct for your scenario but not if point 2 is a read.

> so i am very confusing the patch "arm64: Implement
> ptep_set_access_flags() for hardware AF/DBM" commit log's scenrio.
> would you like point out what i am missing?

If point 2 is a read fault, that goes via do_read_fault() and the pte
ends up as clean with PTE_WRITE|PTE_RDONLY is set since it's not
pte_sw_dirty() (checked by set_pte_at()).

Thread 2 on CPU1 would end up calling ptep_set_access_flags() on a
read-only pte with DBM set because it took a read fault (same as Thread
1).

The problem appears if a Thread 3 on CPU2 performs a write access in
parallel with point 3 above. CPU2 sees the pte as valid, RDONLY and DBM
set, and proceeds to clearing the RDONLY bit in hardware. CPU1 then
overrides the PTE_RDONLY bit if ptep_set_access_flags() is not atomic.

Now you need to find a vm_operations_struct that allows shared, writable
and clean mappings and does not set .page_mkwrite (shm_vm_ops is one).

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re:Re: Re: about the ptep_set_access_flags() for hardware AF/DBM
  2019-10-29 12:11     ` Catalin Marinas
@ 2019-10-29 14:04       ` FF
  0 siblings, 0 replies; 5+ messages in thread
From: FF @ 2019-10-29 14:04 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: mark.rutland, steve.capper, runninglinuxkernel, will.deacon,
	julien.grall, linux-arm-kernel



At 2019-10-29 20:11:54, "Catalin Marinas" <catalin.marinas@arm.com> wrote:
>Hi Ben,
>
>On Tue, Oct 29, 2019 at 08:54:38AM +0800, FF wrote:
>> >On Sun, Oct 27, 2019 at 05:56:24PM +0800, FF wrote:
>> >> Here is the scenario:
>> >> A more complex situation is possible when all CPUs support hardware
>> >>    AF/DBM:
>> >> 
>> >>    a) Initial state: shareable + writable vma and pte_none(pte)
>> >>    b) Read fault taken by two threads of the same process on different
>> >>       CPUs
>> >>    c) CPU0 takes the mmap_sem and proceeds to handling the fault. It
>> >>       eventually reaches do_set_pte() which sets a writable + clean pte.
>> >>       CPU0 releases the mmap_sem
>> >>    d) CPU1 acquires the mmap_sem and proceeds to handle_pte_fault(). The
>> >>       pte entry it reads is present, writable and clean and it continues
>> >>       to pte_mkyoung()
>> >>    e) CPU1 calls ptep_set_access_flags()
>> >> 
>> >>    If between (d) and (e) the hardware (another CPU) updates the dirty
>> >>    state (clears PTE_RDONLY), CPU1 will override the PTR_RDONLY bit
>> >>    marking the entry clean again.
>[...]
>> i want to elaborate the scenario, i saw the first patch to fix the
>> ptep_set_access_flags() for hardware AF/DBM is on Linux 4.7-rc1.
>> commit id "66dbd6e6" ,arm64: Implement ptep_set_access_flags() for
>> hardware AF/DBM
>
>What are you trying to solve? ptep_set_access_flags() being atomic is
>not any worse. Do you think we wouldn't need this patch?
>
>> i think you have issue on Linux 4.6, let's  assume that we are look at
>> Linux 4.6 source code.
>> 
>> 1. initial phase: we want to create a sharable+writable file mapping
>>    by mmap() API, the filesyste is:ext4
>
>See more below but I think we may need shm instead of a file mapping to
>trigger the race (which, BTW, is rather theoretical; I haven't seen it
>in practice).
>
>>    in do_mmap(), the vm_flags should be set VM_READ | VM_WRITE | VM_SHARED.
>>    in mmap_region()->vma_set_page_prot(), it will let the some shared
>>    mappigns will want the pages marked read-only to track write
>>    events, so it will clear the VM_SHARED. so it will get the pte
>>    attribute from protection_map[] is __P011.
>>    
>>    In Linux 4.6, __P011 is PAGE_COPY:
>>    #define PAGE_COPY		__pgprot(_PAGE_DEFAULT | PTE_USER | PTE_NG | PTE_PXN | PTE_UXN)
>>    
>>    for PAGE_COPY, the PTE_RDONLY and PTE_WRITE(DMB) are zero.
>>    so the vm_flags is: VM_READ | VM_WRITE
>
>While you are right that PAGE_COPY has PTE_RDONLY and PTE_WRITE zero,
>set_pte_at() in 4.6 sets PTE_RDONLY if !PTE_WRITE. So the resulting
>mapping in the page table is read-only.
>
>Anyway I think with vma_wants_writenotify() we can't trigger this race
>since it's a purely read-only fault (requires kernel notification). What
>we need is to end up with a writable+clean entry which means VM_SHARED
>set leading to PAGE_SHARED attributes which have PTE_WRITE/DBM set. Note
>that set_pte_at() in 4.6 would mark the page as PTE_RDONLY since
>pte_sw_dirty() is false.
>
>> 2. Thread 1 on CPU0 want to write this page, page_fault will be trigger.
>>    in handle_pte_fault->do_fault->do_shared_fault(), it will allocate
>>    a new page cache, and in do_set_pte(), it will call:
>>    "maybe_mkwrite(pte_mkdirty(entry), vma)" to set the pte entry. so
>>    the pte attribute should be: PTE_DIRTY | PTE_WRITE.
>
>Yes but the scenario I had in mind was a read fault here rather than
>write which would set a PAGE_SHARED attributes ending up with
>PTE_WRITE|PTE_RDONLY (PTE_WRITE is the PTE_DBM bit).
>
>> 3. Thread 2 on CPU1 also want to read this page but this pte has not
>>    create by Thread 1, so page_fault happen. in pte_offset_map(), it
>>    found that the pte is created by Thread 1, so it will directly
>>    call:
>>    
>>    entry = pte_mkyoung(entry);
>>    ptep_set_access_flags()
>>    
>>    in ptep_set_access_flags, it will call set_pte_at() to set pte.
>>    but in set_pte_at() function:
>>    
>>    	if (pte_present(pte)) {
>> 		if (pte_sw_dirty(pte) && pte_write(pte))
>> 			pte_val(pte) &= ~PTE_RDONLY;
>> 		else
>> 			pte_val(pte) |= PTE_RDONLY;
>> 		if (pte_user(pte) && pte_exec(pte) && !pte_special(pte))
>> 			__sync_icache_dcache(pte, addr);
>> 	}
>> 	
>> 	it will clean the PTE_RDONLY bit, because the PTE_DIRTY |
>> 	PTE_WRITE is set in our scenario. otherwise, anyone clean the
>> 	PTE_DIRTY bit, who will clean this PTE_DIRTY bit?
>
>Correct for your scenario but not if point 2 is a read.
>
>> so i am very confusing the patch "arm64: Implement
>> ptep_set_access_flags() for hardware AF/DBM" commit log's scenrio.
>> would you like point out what i am missing?
>
>If point 2 is a read fault, that goes via do_read_fault() and the pte
>ends up as clean with PTE_WRITE|PTE_RDONLY is set since it's not
>pte_sw_dirty() (checked by set_pte_at()).
>
>Thread 2 on CPU1 would end up calling ptep_set_access_flags() on a
>read-only pte with DBM set because it took a read fault (same as Thread
>1).
>
>The problem appears if a Thread 3 on CPU2 performs a write access in
>parallel with point 3 above. CPU2 sees the pte as valid, RDONLY and DBM
>set, and proceeds to clearing the RDONLY bit in hardware. CPU1 then
>overrides the PTE_RDONLY bit if ptep_set_access_flags() is not atomic.
>
>Now you need to find a vm_operations_struct that allows shared, writable
>and clean mappings and does not set .page_mkwrite (shm_vm_ops is one).
>
>-- 
>Catalin

hi Catalin:

Cool! Thanks for your point out. you are right, we should use shmem to reproduce this issue.
Let's elaborate this scenario on Linux 4.6.

1. initial phase: we want to create a sharable+writable anon mapping by mmap() API.
   
   in do_mmap(), the vm_flags should be set VM_READ | VM_WRITE | VM_SHARED.
   so in vm_get_page_prot(), it will get the pte attribute from protection_map[] is __S011.
   
   In Linux 4.6, __P011 is PAGE_SHARED:

  #define __S011  PAGE_SHARED
  #define PAGE_SHARED		__pgprot(_PAGE_DEFAULT | PTE_USER | PTE_NG | PTE_PXN | PTE_UXN | PTE_WRITE)
   
   for PAGE_SHARED, the PTE_RDONLY is zero, but PTE_WRITE(DMB) are 1.
   so the vm_flags is: VM_READ | VM_WRITE | VM_SHARED
   
2. Thread 1 on CPU0 want to read this page, page_fault will be trigger.
   in handle_pte_fault->do_fault->do_read_fault(), it will allocate a new page cache, and in do_set_pte(),
   in set_pte_at() in Linux 4.6.
   
       		if (pte_sw_dirty(pte) && pte_write(pte))
			pte_val(pte) &= ~PTE_RDONLY;
				else
			pte_val(pte) |= PTE_RDONLY;

	in this scenrio, PTE_DIRTY is not set, so it will set the PTE_RDONLY.
   
   so the pte attribute should be: PTE_WRITE | PTE_RDONLY.
   
3. Thread 2 on CPU1 also want to read this page but this pte has not create by Thread 1, so page_fault happen.
   in pte_offset_map(), it found that the pte is created by Thread 1, so it will directly call:
   
   entry = pte_mkyoung(entry);
   ptep_set_access_flags()
  
   if in the interval between pte_mkyoung() and ptep_set_access_flags(), if other thread T3 want to write this page, the hardware DMB will
   clean PTE_RDONLY automatically.
   
   and the T2 execute the ptep_set_access_flags(), it will set the PTE_RDONLY back, so we lose the Dirty status.

Thanks, Catalin, amazing!
   

Best
Ben shushu

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-10-29 14:05 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-27  9:56 about the ptep_set_access_flags() for hardware AF/DBM FF
2019-10-28 18:43 ` Catalin Marinas
2019-10-29  0:54   ` FF
2019-10-29 12:11     ` Catalin Marinas
2019-10-29 14:04       ` FF

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).