All of lore.kernel.org
 help / color / mirror / Atom feed
* Huge page(contiguous bit) slow down
@ 2018-09-18  3:02 Zhang, Lei
  2018-09-18 11:33 ` Will Deacon
  0 siblings, 1 reply; 13+ messages in thread
From: Zhang, Lei @ 2018-09-18  3:02 UTC (permalink / raw)
  To: linux-arm-kernel

I found a slowdown problem when we uses a huge page on arm64 chip.
I think the cause of this problem is a bug of huge_ptep_set_access_flags() with contiguous pte.
Could you review and merge my patch?


Incident:
Multiple threaded process repeats page fault again and again, so the PC in EL0 doesn't move to the next operation.

Multiple threads occur page fault at the same time and the same VA. It may cause slowdown or hang of a process.
Because a race problem on updating pte maybe happened when the condition matched as below.
1. Multiple threads are running
2. All threads are ld/st-ing to the same huge page
3. The huge page is consist of contiguous ptes (2MiB page = 64KiB page x 32)
4. The huge page is mapped without MAP_POPULATE flag.

Mechanism:
The mechanism of this problem is as below.
Updating pte use 4 steps.

step1: create pte content
step2: zero-clear pte at ptep_get_and_clear
step3: flush tlb at flush_tlb_range
step4: set pte at set_pte_at(pte becomes non-zero)

When thread1 is doing between step2 and step4, thread2 accesses the same huge page at the same time.
It cause a new page fault at thread2.
After that, when thread2 is doing between step2 and step4, thread1 retries the access to the same page.
It cause a new page fault at thread1 again.
Multi-threads repeat this flow again and again.

On the other hand, if the pte is not a contiguous pte, slowdown or hang will not occur.
Because it check whether a correct pte has been already presented by other thread using pte_same before step2(0 clear pte).

Call tree information:
hugetlb_fault
  huge_ptep_set_access_flags
    ptep_set_access_flags  -> ( contiguous pte route not call this)
      pte_same
    ptep_get_and_clear
    flush_tlb_range
    set_pte_at

So our patch calls the same check function not only for non-contiguous pte but also for contiguous pte.

-----------------------------
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -332,6 +332,9 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
        if (!pte_cont(pte))
                return ptep_set_access_flags(vma, addr, ptep, pte, dirty);

+       if(pte_same(pte, READ_ONCE(*ptep)))
+               return 0;
+
        ncontig = find_num_contig(vma->vm_mm, addr, ptep, &pgsize);
        dpfn = pgsize >> PAGE_SHIFT;

Best Regards
Lei Zhang
Email: zhang.lei at jp.fujitsu.com
FUJTISU LIMITED.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Huge page(contiguous bit) slow down
  2018-09-18  3:02 Huge page(contiguous bit) slow down Zhang, Lei
@ 2018-09-18 11:33 ` Will Deacon
  2018-09-18 14:58   ` Catalin Marinas
  2018-09-18 15:18   ` Punit Agrawal
  0 siblings, 2 replies; 13+ messages in thread
From: Will Deacon @ 2018-09-18 11:33 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Lei Zhang,

Thanks for the report and the initial diagnosis.

On Tue, Sep 18, 2018 at 03:02:17AM +0000, Zhang, Lei wrote:
> I found a slowdown problem when we uses a huge page on arm64 chip.
> I think the cause of this problem is a bug of huge_ptep_set_access_flags()
> with contiguous pte.
> Could you review and merge my patch?

Before we get to that, here are some things that might help you when
reporting problems in the future:

1. Please keep your line lengths to <= 80 columns, since this makes the
   email easier to read. (I've reflowed your text in my reply)

2. Please try to CC the right maintainers on your report. If you're not
   sure who they are, then you can run ./scripts/get_maintainer.pl:

   $ ./scripts/get_maintainer.pl arch/arm64/mm/hugetlbpage.c

   identifies me and Catalin, for example. You can also use git blame,
   which shows that Steve and Punit wrote a lot of this function. I've
   added these people to CC.

3. If you have a patch that you'd like to be merged, you'll need a
   commit message that includes your "Signed-off-by:" line. You can use
   git format-patch to generate this, but you should also have a look at
   Documentation/process/submitting-patches.rst.

4. Please always mention the kernel version that you're seeing problems
   with, in case we've applied fixes to the problematic area in the
   meantime.

> Incident:
> Multiple threaded process repeats page fault again and again, so the PC in
> EL0 doesn't move to the next operation.
> 
> Multiple threads occur page fault at the same time and the same VA. It may
> cause slowdown or hang of a process.
> Because a race problem on updating pte maybe happened when the condition
> matched as below.
> 1. Multiple threads are running
> 2. All threads are ld/st-ing to the same huge page
> 3. The huge page is consist of contiguous ptes (2MiB page = 64KiB page x 32)
> 4. The huge page is mapped without MAP_POPULATE flag.
> 
> Mechanism:
> The mechanism of this problem is as below.
> Updating pte use 4 steps.
> 
> step1: create pte content
> step2: zero-clear pte at ptep_get_and_clear
> step3: flush tlb at flush_tlb_range
> step4: set pte at set_pte_at(pte becomes non-zero)
> 
> When thread1 is doing between step2 and step4, thread2 accesses the same huge
> page at the same time.
> It cause a new page fault at thread2.
> After that, when thread2 is doing between step2 and step4, thread1 retries
> the access to the same page.
> It cause a new page fault at thread1 again.
> Multi-threads repeat this flow again and again.

Hmm, yes, I can see how this happens. Whilst the mmap_sem should serialise
the faults, the contig hugepage code always clears the pte as part of the
BBM sequence, so we can get stuck in a fault cycle.

> On the other hand, if the pte is not a contiguous pte, slowdown or hang will
> not occur.
> Because it check whether a correct pte has been already presented by other
> thread using pte_same before step2(0 clear pte).
> 
> Call tree information:
> hugetlb_fault
>   huge_ptep_set_access_flags
>     ptep_set_access_flags  -> ( contiguous pte route not call this)
>       pte_same
>     ptep_get_and_clear
>     flush_tlb_range
>     set_pte_at
> 
> So our patch calls the same check function not only for non-contiguous pte
> but also for contiguous pte.
> 
> -----------------------------
> --- a/arch/arm64/mm/hugetlbpage.c
> +++ b/arch/arm64/mm/hugetlbpage.c
> @@ -332,6 +332,9 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
>         if (!pte_cont(pte))
>                 return ptep_set_access_flags(vma, addr, ptep, pte, dirty);
> 
> +       if(pte_same(pte, READ_ONCE(*ptep)))
> +               return 0;
> +

This broadly seems to follow the non-contiguous code, but I wonder if we
can then drop the subsequent pte_same() check on this path and always return
1 when we actually update the entries?

Will

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Huge page(contiguous bit) slow down
  2018-09-18 11:33 ` Will Deacon
@ 2018-09-18 14:58   ` Catalin Marinas
  2018-09-18 15:16     ` Will Deacon
  2018-09-18 15:18   ` Punit Agrawal
  1 sibling, 1 reply; 13+ messages in thread
From: Catalin Marinas @ 2018-09-18 14:58 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Sep 18, 2018 at 12:33:01PM +0100, Will Deacon wrote:
> On Tue, Sep 18, 2018 at 03:02:17AM +0000, Zhang, Lei wrote:
> > --- a/arch/arm64/mm/hugetlbpage.c
> > +++ b/arch/arm64/mm/hugetlbpage.c
> > @@ -332,6 +332,9 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
> >         if (!pte_cont(pte))
> >                 return ptep_set_access_flags(vma, addr, ptep, pte, dirty);
> > 
> > +       if(pte_same(pte, READ_ONCE(*ptep)))
> > +               return 0;
> > +
> 
> This broadly seems to follow the non-contiguous code, but I wonder if we
> can then drop the subsequent pte_same() check on this path and always return
> 1 when we actually update the entries?

I don't remember why we went for first clearing and then checking
pte_same() (maybe Steve knows) but I think we can leave pte_same()
outside the get_clear_flush()/set_pte_at() block. This code is executed
with the mmap_sem taken, so there shouldn't be any race on the
individual ptes.

-- 
Catalin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Huge page(contiguous bit) slow down
  2018-09-18 14:58   ` Catalin Marinas
@ 2018-09-18 15:16     ` Will Deacon
  2018-09-18 16:09       ` Catalin Marinas
  2018-09-18 20:26       ` Steve Capper
  0 siblings, 2 replies; 13+ messages in thread
From: Will Deacon @ 2018-09-18 15:16 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Sep 18, 2018 at 03:58:32PM +0100, Catalin Marinas wrote:
> On Tue, Sep 18, 2018 at 12:33:01PM +0100, Will Deacon wrote:
> > On Tue, Sep 18, 2018 at 03:02:17AM +0000, Zhang, Lei wrote:
> > > --- a/arch/arm64/mm/hugetlbpage.c
> > > +++ b/arch/arm64/mm/hugetlbpage.c
> > > @@ -332,6 +332,9 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
> > >         if (!pte_cont(pte))
> > >                 return ptep_set_access_flags(vma, addr, ptep, pte, dirty);
> > > 
> > > +       if(pte_same(pte, READ_ONCE(*ptep)))
> > > +               return 0;
> > > +
> > 
> > This broadly seems to follow the non-contiguous code, but I wonder if we
> > can then drop the subsequent pte_same() check on this path and always return
> > 1 when we actually update the entries?
> 
> I don't remember why we went for first clearing and then checking
> pte_same() (maybe Steve knows) but I think we can leave pte_same()
> outside the get_clear_flush()/set_pte_at() block. This code is executed
> with the mmap_sem taken, so there shouldn't be any race on the
> individual ptes.

I suspect it's just to avoid the additional load of the page-table entry,
since we still have to use get_clear_flush() even with this change.

One thing I don't really grok is the interaction between the contiguous
hint and HW_AFDBM. Is it possible for us to be e.g. halfway through the
set_pte_at() loop and then for the hardware to perform atomic PTE updates
for entries later in the loop? If so, we've got a race and need to use
cmpxchg() like we do for the non-contiguous code.

Will

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Huge page(contiguous bit) slow down
  2018-09-18 11:33 ` Will Deacon
  2018-09-18 14:58   ` Catalin Marinas
@ 2018-09-18 15:18   ` Punit Agrawal
  1 sibling, 0 replies; 13+ messages in thread
From: Punit Agrawal @ 2018-09-18 15:18 UTC (permalink / raw)
  To: linux-arm-kernel

Will Deacon <will.deacon@arm.com> writes:

> Hi Lei Zhang,
>
> Thanks for the report and the initial diagnosis.
>
> On Tue, Sep 18, 2018 at 03:02:17AM +0000, Zhang, Lei wrote:
>> I found a slowdown problem when we uses a huge page on arm64 chip.
>> I think the cause of this problem is a bug of huge_ptep_set_access_flags()
>> with contiguous pte.
>> Could you review and merge my patch?
>
> Before we get to that, here are some things that might help you when
> reporting problems in the future:
>
> 1. Please keep your line lengths to <= 80 columns, since this makes the
>    email easier to read. (I've reflowed your text in my reply)
>
> 2. Please try to CC the right maintainers on your report. If you're not
>    sure who they are, then you can run ./scripts/get_maintainer.pl:
>
>    $ ./scripts/get_maintainer.pl arch/arm64/mm/hugetlbpage.c
>
>    identifies me and Catalin, for example. You can also use git blame,
>    which shows that Steve and Punit wrote a lot of this function. I've
>    added these people to CC.
>
> 3. If you have a patch that you'd like to be merged, you'll need a
>    commit message that includes your "Signed-off-by:" line. You can use
>    git format-patch to generate this, but you should also have a look at
>    Documentation/process/submitting-patches.rst.
>
> 4. Please always mention the kernel version that you're seeing problems
>    with, in case we've applied fixes to the problematic area in the
>    meantime.
>
>> Incident:
>> Multiple threaded process repeats page fault again and again, so the PC in
>> EL0 doesn't move to the next operation.
>> 
>> Multiple threads occur page fault at the same time and the same VA. It may
>> cause slowdown or hang of a process.
>> Because a race problem on updating pte maybe happened when the condition
>> matched as below.
>> 1. Multiple threads are running
>> 2. All threads are ld/st-ing to the same huge page
>> 3. The huge page is consist of contiguous ptes (2MiB page = 64KiB page x 32)
>> 4. The huge page is mapped without MAP_POPULATE flag.
>> 
>> Mechanism:
>> The mechanism of this problem is as below.
>> Updating pte use 4 steps.
>> 
>> step1: create pte content
>> step2: zero-clear pte at ptep_get_and_clear
>> step3: flush tlb at flush_tlb_range
>> step4: set pte at set_pte_at(pte becomes non-zero)
>> 
>> When thread1 is doing between step2 and step4, thread2 accesses the same huge
>> page at the same time.
>> It cause a new page fault at thread2.
>> After that, when thread2 is doing between step2 and step4, thread1 retries
>> the access to the same page.
>> It cause a new page fault at thread1 again.
>> Multi-threads repeat this flow again and again.
>
> Hmm, yes, I can see how this happens. Whilst the mmap_sem should serialise
> the faults, the contig hugepage code always clears the pte as part of the
> BBM sequence, so we can get stuck in a fault cycle.

A similar issue was reported/fixed for stage 2 fault handling recently
as well. Instead of userspace, vcpus were stuck in the fault cycle
there.

See commit 976d34e2dab1 ("KVM: arm/arm64: Skip updating PTE entry if no
change"), and commit 86658b819cd0 ("KVM: arm/arm64: Skip updating PMD
entry if no change").

[...]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Huge page(contiguous bit) slow down
  2018-09-18 15:16     ` Will Deacon
@ 2018-09-18 16:09       ` Catalin Marinas
  2018-09-18 16:14         ` Will Deacon
  2018-09-18 20:26       ` Steve Capper
  1 sibling, 1 reply; 13+ messages in thread
From: Catalin Marinas @ 2018-09-18 16:09 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Sep 18, 2018 at 04:16:26PM +0100, Will Deacon wrote:
> On Tue, Sep 18, 2018 at 03:58:32PM +0100, Catalin Marinas wrote:
> > On Tue, Sep 18, 2018 at 12:33:01PM +0100, Will Deacon wrote:
> > > On Tue, Sep 18, 2018 at 03:02:17AM +0000, Zhang, Lei wrote:
> > > > --- a/arch/arm64/mm/hugetlbpage.c
> > > > +++ b/arch/arm64/mm/hugetlbpage.c
> > > > @@ -332,6 +332,9 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
> > > >         if (!pte_cont(pte))
> > > >                 return ptep_set_access_flags(vma, addr, ptep, pte, dirty);
> > > > 
> > > > +       if(pte_same(pte, READ_ONCE(*ptep)))
> > > > +               return 0;
> > > > +
> > > 
> > > This broadly seems to follow the non-contiguous code, but I wonder if we
> > > can then drop the subsequent pte_same() check on this path and always return
> > > 1 when we actually update the entries?
> > 
> > I don't remember why we went for first clearing and then checking
> > pte_same() (maybe Steve knows) but I think we can leave pte_same()
> > outside the get_clear_flush()/set_pte_at() block. This code is executed
> > with the mmap_sem taken, so there shouldn't be any race on the
> > individual ptes.
> 
> I suspect it's just to avoid the additional load of the page-table entry,
> since we still have to use get_clear_flush() even with this change.
> 
> One thing I don't really grok is the interaction between the contiguous
> hint and HW_AFDBM. Is it possible for us to be e.g. halfway through the
> set_pte_at() loop and then for the hardware to perform atomic PTE updates
> for entries later in the loop? If so, we've got a race and need to use
> cmpxchg() like we do for the non-contiguous code.

With the current code, no, since get_clear_flush() sets all of them to
0, so no hardware updates before set_pte_at().

-- 
Catalin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Huge page(contiguous bit) slow down
  2018-09-18 16:09       ` Catalin Marinas
@ 2018-09-18 16:14         ` Will Deacon
  2018-09-18 17:11           ` Catalin Marinas
  0 siblings, 1 reply; 13+ messages in thread
From: Will Deacon @ 2018-09-18 16:14 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Sep 18, 2018 at 05:09:28PM +0100, Catalin Marinas wrote:
> On Tue, Sep 18, 2018 at 04:16:26PM +0100, Will Deacon wrote:
> > On Tue, Sep 18, 2018 at 03:58:32PM +0100, Catalin Marinas wrote:
> > > On Tue, Sep 18, 2018 at 12:33:01PM +0100, Will Deacon wrote:
> > > > On Tue, Sep 18, 2018 at 03:02:17AM +0000, Zhang, Lei wrote:
> > > > > --- a/arch/arm64/mm/hugetlbpage.c
> > > > > +++ b/arch/arm64/mm/hugetlbpage.c
> > > > > @@ -332,6 +332,9 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
> > > > >         if (!pte_cont(pte))
> > > > >                 return ptep_set_access_flags(vma, addr, ptep, pte, dirty);
> > > > > 
> > > > > +       if(pte_same(pte, READ_ONCE(*ptep)))
> > > > > +               return 0;
> > > > > +
> > > > 
> > > > This broadly seems to follow the non-contiguous code, but I wonder if we
> > > > can then drop the subsequent pte_same() check on this path and always return
> > > > 1 when we actually update the entries?
> > > 
> > > I don't remember why we went for first clearing and then checking
> > > pte_same() (maybe Steve knows) but I think we can leave pte_same()
> > > outside the get_clear_flush()/set_pte_at() block. This code is executed
> > > with the mmap_sem taken, so there shouldn't be any race on the
> > > individual ptes.
> > 
> > I suspect it's just to avoid the additional load of the page-table entry,
> > since we still have to use get_clear_flush() even with this change.
> > 
> > One thing I don't really grok is the interaction between the contiguous
> > hint and HW_AFDBM. Is it possible for us to be e.g. halfway through the
> > set_pte_at() loop and then for the hardware to perform atomic PTE updates
> > for entries later in the loop? If so, we've got a race and need to use
> > cmpxchg() like we do for the non-contiguous code.
> 
> With the current code, no, since get_clear_flush() sets all of them to
> 0, so no hardware updates before set_pte_at().

The case I'm concerned about is when we've set_pte_at() half of the mapping,
though. At this point, a CPU can get a translation via one of the entries
that we've put down, and it's not clear to me whether this could establish
a contiguous TLB entry which could then result in access/dirty updates to
PTEs that we haven't yet written out.

Will

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Huge page(contiguous bit) slow down
  2018-09-18 16:14         ` Will Deacon
@ 2018-09-18 17:11           ` Catalin Marinas
  2018-09-18 20:30             ` Steve Capper
  2018-09-19 10:29             ` Will Deacon
  0 siblings, 2 replies; 13+ messages in thread
From: Catalin Marinas @ 2018-09-18 17:11 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Sep 18, 2018 at 05:14:33PM +0100, Will Deacon wrote:
> On Tue, Sep 18, 2018 at 05:09:28PM +0100, Catalin Marinas wrote:
> > On Tue, Sep 18, 2018 at 04:16:26PM +0100, Will Deacon wrote:
> > > On Tue, Sep 18, 2018 at 03:58:32PM +0100, Catalin Marinas wrote:
> > > > On Tue, Sep 18, 2018 at 12:33:01PM +0100, Will Deacon wrote:
> > > > > On Tue, Sep 18, 2018 at 03:02:17AM +0000, Zhang, Lei wrote:
> > > > > > --- a/arch/arm64/mm/hugetlbpage.c
> > > > > > +++ b/arch/arm64/mm/hugetlbpage.c
> > > > > > @@ -332,6 +332,9 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
> > > > > >         if (!pte_cont(pte))
> > > > > >                 return ptep_set_access_flags(vma, addr, ptep, pte, dirty);
> > > > > > 
> > > > > > +       if(pte_same(pte, READ_ONCE(*ptep)))
> > > > > > +               return 0;
> > > > > > +
> > > > > 
> > > > > This broadly seems to follow the non-contiguous code, but I wonder if we
> > > > > can then drop the subsequent pte_same() check on this path and always return
> > > > > 1 when we actually update the entries?
> > > > 
> > > > I don't remember why we went for first clearing and then checking
> > > > pte_same() (maybe Steve knows) but I think we can leave pte_same()
> > > > outside the get_clear_flush()/set_pte_at() block. This code is executed
> > > > with the mmap_sem taken, so there shouldn't be any race on the
> > > > individual ptes.
> > > 
> > > I suspect it's just to avoid the additional load of the page-table entry,
> > > since we still have to use get_clear_flush() even with this change.
> > > 
> > > One thing I don't really grok is the interaction between the contiguous
> > > hint and HW_AFDBM. Is it possible for us to be e.g. halfway through the
> > > set_pte_at() loop and then for the hardware to perform atomic PTE updates
> > > for entries later in the loop? If so, we've got a race and need to use
> > > cmpxchg() like we do for the non-contiguous code.
> > 
> > With the current code, no, since get_clear_flush() sets all of them to
> > 0, so no hardware updates before set_pte_at().
> 
> The case I'm concerned about is when we've set_pte_at() half of the mapping,
> though. At this point, a CPU can get a translation via one of the entries
> that we've put down, and it's not clear to me whether this could establish
> a contiguous TLB entry which could then result in access/dirty updates to
> PTEs that we haven't yet written out.

Ah, I see what you meant. The actual updates are not done based on the
TLB information but rather the CPU performs a read-modify-write of the
entry (when an existing TLB entry would give fault; for writes as we
don't cache the no access flag in the TLB).

The AF case is simpler as the hardware doesn't cache an AF=0 pte in the
TLB. For DBM, we could indeed have a contiguous writable+clean (DBM=1,
AP[2] = 1) entry covering not yet written ptes. The ARM ARM describes
that the AP[2] bit is updated (cleared) atomically only if the DBM bit
is set in the pte:

  If, for a write access, the PE finds that a cached copy of the
  descriptor in a TLB had the DBM bit set to 1 and the AP[2] or S2AP[1]
  bit set to the value that forbids writes, then the PE must check that
  the cached copy is not stale with regard to the descriptor entry in
  memory, and if necessary perform an atomic read-modify-write update of
  the descriptor in memory. This applies if the cached copy of the
  descriptor in a TLB is either:
  - A stage 1 descriptor in which DBM has the value 1 and AP[2] has the
    value 1.
  - A stage 2 descriptor in which DBM has the value 1 and S2AP[1] has
    the value 0.

(and there are some further notes in the ARM ARM; I think we don't
have an issue here)

-- 
Catalin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Huge page(contiguous bit) slow down
  2018-09-18 15:16     ` Will Deacon
  2018-09-18 16:09       ` Catalin Marinas
@ 2018-09-18 20:26       ` Steve Capper
  1 sibling, 0 replies; 13+ messages in thread
From: Steve Capper @ 2018-09-18 20:26 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Sep 18, 2018 at 04:16:26PM +0100, Will Deacon wrote:
> On Tue, Sep 18, 2018 at 03:58:32PM +0100, Catalin Marinas wrote:
> > On Tue, Sep 18, 2018 at 12:33:01PM +0100, Will Deacon wrote:
> > > On Tue, Sep 18, 2018 at 03:02:17AM +0000, Zhang, Lei wrote:
> > > > --- a/arch/arm64/mm/hugetlbpage.c
> > > > +++ b/arch/arm64/mm/hugetlbpage.c
> > > > @@ -332,6 +332,9 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
> > > >         if (!pte_cont(pte))
> > > >                 return ptep_set_access_flags(vma, addr, ptep, pte, dirty);
> > > > 
> > > > +       if(pte_same(pte, READ_ONCE(*ptep)))
> > > > +               return 0;
> > > > +
> > > 
> > > This broadly seems to follow the non-contiguous code, but I wonder if we
> > > can then drop the subsequent pte_same() check on this path and always return
> > > 1 when we actually update the entries?
> > 
> > I don't remember why we went for first clearing and then checking
> > pte_same() (maybe Steve knows) but I think we can leave pte_same()
> > outside the get_clear_flush()/set_pte_at() block. This code is executed
> > with the mmap_sem taken, so there shouldn't be any race on the
> > individual ptes.
> 
> I suspect it's just to avoid the additional load of the page-table entry,
> since we still have to use get_clear_flush() even with this change.
> 
> One thing I don't really grok is the interaction between the contiguous
> hint and HW_AFDBM. Is it possible for us to be e.g. halfway through the
> set_pte_at() loop and then for the hardware to perform atomic PTE updates
> for entries later in the loop? If so, we've got a race and need to use
> cmpxchg() like we do for the non-contiguous code.

Hi,
Yes this was precisely the case, I think I "over-optimised" the code
path in huge_ptep_set_access_flags.

I think it would be better to check first to see if a change is needed
in a similar manner as Lei Zhang's patch.

One thing worth considering is that some ptes from the contiguous range
may be dirty so rather than just a:
 if(pte_same(pte, READ_ONCE(*ptep)))
                 return 0;

We may instead want to check all the ptes in the contiguous range for
dirty ones?

Cheers,
-- 
Steve

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Huge page(contiguous bit) slow down
  2018-09-18 17:11           ` Catalin Marinas
@ 2018-09-18 20:30             ` Steve Capper
  2018-09-19 10:29             ` Will Deacon
  1 sibling, 0 replies; 13+ messages in thread
From: Steve Capper @ 2018-09-18 20:30 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Sep 18, 2018 at 06:11:34PM +0100, Catalin Marinas wrote:
> On Tue, Sep 18, 2018 at 05:14:33PM +0100, Will Deacon wrote:
> > On Tue, Sep 18, 2018 at 05:09:28PM +0100, Catalin Marinas wrote:
> > > On Tue, Sep 18, 2018 at 04:16:26PM +0100, Will Deacon wrote:
> > > > On Tue, Sep 18, 2018 at 03:58:32PM +0100, Catalin Marinas wrote:
> > > > > On Tue, Sep 18, 2018 at 12:33:01PM +0100, Will Deacon wrote:
> > > > > > On Tue, Sep 18, 2018 at 03:02:17AM +0000, Zhang, Lei wrote:
> > > > > > > --- a/arch/arm64/mm/hugetlbpage.c
> > > > > > > +++ b/arch/arm64/mm/hugetlbpage.c
> > > > > > > @@ -332,6 +332,9 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
> > > > > > >         if (!pte_cont(pte))
> > > > > > >                 return ptep_set_access_flags(vma, addr, ptep, pte, dirty);
> > > > > > > 
> > > > > > > +       if(pte_same(pte, READ_ONCE(*ptep)))
> > > > > > > +               return 0;
> > > > > > > +
> > > > > > 
> > > > > > This broadly seems to follow the non-contiguous code, but I wonder if we
> > > > > > can then drop the subsequent pte_same() check on this path and always return
> > > > > > 1 when we actually update the entries?
> > > > > 
> > > > > I don't remember why we went for first clearing and then checking
> > > > > pte_same() (maybe Steve knows) but I think we can leave pte_same()
> > > > > outside the get_clear_flush()/set_pte_at() block. This code is executed
> > > > > with the mmap_sem taken, so there shouldn't be any race on the
> > > > > individual ptes.
> > > > 
> > > > I suspect it's just to avoid the additional load of the page-table entry,
> > > > since we still have to use get_clear_flush() even with this change.
> > > > 
> > > > One thing I don't really grok is the interaction between the contiguous
> > > > hint and HW_AFDBM. Is it possible for us to be e.g. halfway through the
> > > > set_pte_at() loop and then for the hardware to perform atomic PTE updates
> > > > for entries later in the loop? If so, we've got a race and need to use
> > > > cmpxchg() like we do for the non-contiguous code.
> > > 
> > > With the current code, no, since get_clear_flush() sets all of them to
> > > 0, so no hardware updates before set_pte_at().
> > 
> > The case I'm concerned about is when we've set_pte_at() half of the mapping,
> > though. At this point, a CPU can get a translation via one of the entries
> > that we've put down, and it's not clear to me whether this could establish
> > a contiguous TLB entry which could then result in access/dirty updates to
> > PTEs that we haven't yet written out.
> 
> Ah, I see what you meant. The actual updates are not done based on the
> TLB information but rather the CPU performs a read-modify-write of the
> entry (when an existing TLB entry would give fault; for writes as we
> don't cache the no access flag in the TLB).
> 
> The AF case is simpler as the hardware doesn't cache an AF=0 pte in the
> TLB. For DBM, we could indeed have a contiguous writable+clean (DBM=1,
> AP[2] = 1) entry covering not yet written ptes. The ARM ARM describes
> that the AP[2] bit is updated (cleared) atomically only if the DBM bit
> is set in the pte:
> 
>   If, for a write access, the PE finds that a cached copy of the
>   descriptor in a TLB had the DBM bit set to 1 and the AP[2] or S2AP[1]
>   bit set to the value that forbids writes, then the PE must check that
>   the cached copy is not stale with regard to the descriptor entry in
>   memory, and if necessary perform an atomic read-modify-write update of
>   the descriptor in memory. This applies if the cached copy of the
>   descriptor in a TLB is either:
>   - A stage 1 descriptor in which DBM has the value 1 and AP[2] has the
>     value 1.
>   - A stage 2 descriptor in which DBM has the value 1 and S2AP[1] has
>     the value 0.
> 
> (and there are some further notes in the ARM ARM; I think we don't
> have an issue here)
> 

FWIW, all hugetlb ptes are young so AF should always be set for present
hugetlb pages.

Cheers,
-- 
Steve

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Huge page(contiguous bit) slow down
  2018-09-18 17:11           ` Catalin Marinas
  2018-09-18 20:30             ` Steve Capper
@ 2018-09-19 10:29             ` Will Deacon
  2018-09-19 15:37               ` Steve Capper
  1 sibling, 1 reply; 13+ messages in thread
From: Will Deacon @ 2018-09-19 10:29 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Sep 18, 2018 at 06:11:34PM +0100, Catalin Marinas wrote:
> On Tue, Sep 18, 2018 at 05:14:33PM +0100, Will Deacon wrote:
> > On Tue, Sep 18, 2018 at 05:09:28PM +0100, Catalin Marinas wrote:
> > > On Tue, Sep 18, 2018 at 04:16:26PM +0100, Will Deacon wrote:
> > > > One thing I don't really grok is the interaction between the contiguous
> > > > hint and HW_AFDBM. Is it possible for us to be e.g. halfway through the
> > > > set_pte_at() loop and then for the hardware to perform atomic PTE updates
> > > > for entries later in the loop? If so, we've got a race and need to use
> > > > cmpxchg() like we do for the non-contiguous code.
> > > 
> > > With the current code, no, since get_clear_flush() sets all of them to
> > > 0, so no hardware updates before set_pte_at().
> > 
> > The case I'm concerned about is when we've set_pte_at() half of the mapping,
> > though. At this point, a CPU can get a translation via one of the entries
> > that we've put down, and it's not clear to me whether this could establish
> > a contiguous TLB entry which could then result in access/dirty updates to
> > PTEs that we haven't yet written out.
> 
> Ah, I see what you meant. The actual updates are not done based on the
> TLB information but rather the CPU performs a read-modify-write of the
> entry (when an existing TLB entry would give fault; for writes as we
> don't cache the no access flag in the TLB).
> 
> The AF case is simpler as the hardware doesn't cache an AF=0 pte in the
> TLB. For DBM, we could indeed have a contiguous writable+clean (DBM=1,
> AP[2] = 1) entry covering not yet written ptes. The ARM ARM describes
> that the AP[2] bit is updated (cleared) atomically only if the DBM bit
> is set in the pte:
> 
>   If, for a write access, the PE finds that a cached copy of the
>   descriptor in a TLB had the DBM bit set to 1 and the AP[2] or S2AP[1]
>   bit set to the value that forbids writes, then the PE must check that
>   the cached copy is not stale with regard to the descriptor entry in
>   memory, and if necessary perform an atomic read-modify-write update of
>   the descriptor in memory. This applies if the cached copy of the
>   descriptor in a TLB is either:
>   - A stage 1 descriptor in which DBM has the value 1 and AP[2] has the
>     value 1.
>   - A stage 2 descriptor in which DBM has the value 1 and S2AP[1] has
>     the value 0.
> 
> (and there are some further notes in the ARM ARM; I think we don't
> have an issue here)

Thanks for the clarification, I also think we're alright here. The concern
would be if the CPU can, for example, always check the first PTE in the
contiguous range yet perform an atomic update on a different one. I have no
idea why a CPU would do this, so I think the architecture text can just be
interpreted a bit broadly there and I'll see if I can raise a clarification.

However, as you mention off-list, there is a problem with the proposed
solution of using pte_same() only on the first entry of the range. If a
CPU ignores the contiguous hint (as it is permitted to do), then we could
repeatedly fault on a different PTE within the range (for example, if we
are using software dirty-bit) and get stuck.

Something like the diff below might help, but I haven't tested it. What do
you think?

Will

--->8

diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 192b3ba07075..edae6774132d 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -324,7 +324,7 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
 			       unsigned long addr, pte_t *ptep,
 			       pte_t pte, int dirty)
 {
-	int ncontig, i, changed = 0;
+	int ncontig, i;
 	size_t pgsize = 0;
 	unsigned long pfn = pte_pfn(pte), dpfn;
 	pgprot_t hugeprot;
@@ -336,9 +336,14 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
 	ncontig = find_num_contig(vma->vm_mm, addr, ptep, &pgsize);
 	dpfn = pgsize >> PAGE_SHIFT;
 
+	for (i = 0; i < ncontig; i++)
+		if (!pte_same(huge_ptep_get(ptep + i), pte))
+			break;
+
+	if (i == ncontig)
+		return 0;
+
 	orig_pte = get_clear_flush(vma->vm_mm, addr, ptep, pgsize, ncontig);
-	if (!pte_same(orig_pte, pte))
-		changed = 1;
 
 	/* Make sure we don't lose the dirty state */
 	if (pte_dirty(orig_pte))
@@ -348,7 +353,7 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
 	for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
 		set_pte_at(vma->vm_mm, addr, ptep, pfn_pte(pfn, hugeprot));
 
-	return changed;
+	return 1;
 }
 
 void huge_ptep_set_wrprotect(struct mm_struct *mm,

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Huge page(contiguous bit) slow down
  2018-09-19 10:29             ` Will Deacon
@ 2018-09-19 15:37               ` Steve Capper
  2018-09-19 16:05                 ` Will Deacon
  0 siblings, 1 reply; 13+ messages in thread
From: Steve Capper @ 2018-09-19 15:37 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Sep 19, 2018 at 11:29:37AM +0100, Will Deacon wrote:
> On Tue, Sep 18, 2018 at 06:11:34PM +0100, Catalin Marinas wrote:
> > On Tue, Sep 18, 2018 at 05:14:33PM +0100, Will Deacon wrote:
> > > On Tue, Sep 18, 2018 at 05:09:28PM +0100, Catalin Marinas wrote:
> > > > On Tue, Sep 18, 2018 at 04:16:26PM +0100, Will Deacon wrote:
> > > > > One thing I don't really grok is the interaction between the contiguous
> > > > > hint and HW_AFDBM. Is it possible for us to be e.g. halfway through the
> > > > > set_pte_at() loop and then for the hardware to perform atomic PTE updates
> > > > > for entries later in the loop? If so, we've got a race and need to use
> > > > > cmpxchg() like we do for the non-contiguous code.
> > > > 
> > > > With the current code, no, since get_clear_flush() sets all of them to
> > > > 0, so no hardware updates before set_pte_at().
> > > 
> > > The case I'm concerned about is when we've set_pte_at() half of the mapping,
> > > though. At this point, a CPU can get a translation via one of the entries
> > > that we've put down, and it's not clear to me whether this could establish
> > > a contiguous TLB entry which could then result in access/dirty updates to
> > > PTEs that we haven't yet written out.
> > 
> > Ah, I see what you meant. The actual updates are not done based on the
> > TLB information but rather the CPU performs a read-modify-write of the
> > entry (when an existing TLB entry would give fault; for writes as we
> > don't cache the no access flag in the TLB).
> > 
> > The AF case is simpler as the hardware doesn't cache an AF=0 pte in the
> > TLB. For DBM, we could indeed have a contiguous writable+clean (DBM=1,
> > AP[2] = 1) entry covering not yet written ptes. The ARM ARM describes
> > that the AP[2] bit is updated (cleared) atomically only if the DBM bit
> > is set in the pte:
> > 
> >   If, for a write access, the PE finds that a cached copy of the
> >   descriptor in a TLB had the DBM bit set to 1 and the AP[2] or S2AP[1]
> >   bit set to the value that forbids writes, then the PE must check that
> >   the cached copy is not stale with regard to the descriptor entry in
> >   memory, and if necessary perform an atomic read-modify-write update of
> >   the descriptor in memory. This applies if the cached copy of the
> >   descriptor in a TLB is either:
> >   - A stage 1 descriptor in which DBM has the value 1 and AP[2] has the
> >     value 1.
> >   - A stage 2 descriptor in which DBM has the value 1 and S2AP[1] has
> >     the value 0.
> > 
> > (and there are some further notes in the ARM ARM; I think we don't
> > have an issue here)
> 
> Thanks for the clarification, I also think we're alright here. The concern
> would be if the CPU can, for example, always check the first PTE in the
> contiguous range yet perform an atomic update on a different one. I have no
> idea why a CPU would do this, so I think the architecture text can just be
> interpreted a bit broadly there and I'll see if I can raise a clarification.
> 
> However, as you mention off-list, there is a problem with the proposed
> solution of using pte_same() only on the first entry of the range. If a
> CPU ignores the contiguous hint (as it is permitted to do), then we could
> repeatedly fault on a different PTE within the range (for example, if we
> are using software dirty-bit) and get stuck.
> 
> Something like the diff below might help, but I haven't tested it. What do
> you think?
> 
> Will

Hi Will,

> 
> --->8
> 
> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> index 192b3ba07075..edae6774132d 100644
> --- a/arch/arm64/mm/hugetlbpage.c
> +++ b/arch/arm64/mm/hugetlbpage.c
> @@ -324,7 +324,7 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
>  			       unsigned long addr, pte_t *ptep,
>  			       pte_t pte, int dirty)
>  {
> -	int ncontig, i, changed = 0;
> +	int ncontig, i;
>  	size_t pgsize = 0;
>  	unsigned long pfn = pte_pfn(pte), dpfn;
>  	pgprot_t hugeprot;
> @@ -336,9 +336,14 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
>  	ncontig = find_num_contig(vma->vm_mm, addr, ptep, &pgsize);
>  	dpfn = pgsize >> PAGE_SHIFT;
>  
> +	for (i = 0; i < ncontig; i++)
> +		if (!pte_same(huge_ptep_get(ptep + i), pte))
> +			break;
> +

This comparison will always fail as our pfns will be different. I have a work
in progress fix (a separate function to go through the contiguous range
to preserve the dirty information) that I'll test and post shortly.

Cheers,
-- 
Steve

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Huge page(contiguous bit) slow down
  2018-09-19 15:37               ` Steve Capper
@ 2018-09-19 16:05                 ` Will Deacon
  0 siblings, 0 replies; 13+ messages in thread
From: Will Deacon @ 2018-09-19 16:05 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Sep 19, 2018 at 04:37:08PM +0100, Steve Capper wrote:
> On Wed, Sep 19, 2018 at 11:29:37AM +0100, Will Deacon wrote:
> > diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> > index 192b3ba07075..edae6774132d 100644
> > --- a/arch/arm64/mm/hugetlbpage.c
> > +++ b/arch/arm64/mm/hugetlbpage.c
> > @@ -324,7 +324,7 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
> >  			       unsigned long addr, pte_t *ptep,
> >  			       pte_t pte, int dirty)
> >  {
> > -	int ncontig, i, changed = 0;
> > +	int ncontig, i;
> >  	size_t pgsize = 0;
> >  	unsigned long pfn = pte_pfn(pte), dpfn;
> >  	pgprot_t hugeprot;
> > @@ -336,9 +336,14 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
> >  	ncontig = find_num_contig(vma->vm_mm, addr, ptep, &pgsize);
> >  	dpfn = pgsize >> PAGE_SHIFT;
> >  
> > +	for (i = 0; i < ncontig; i++)
> > +		if (!pte_same(huge_ptep_get(ptep + i), pte))
> > +			break;
> > +
> 
> This comparison will always fail as our pfns will be different. I have a work
> in progress fix (a separate function to go through the contiguous range
> to preserve the dirty information) that I'll test and post shortly.

Ha, yes of course. I forgot about the most useful part of the pte!
I'll take a look at your patch when you post it.

Will

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2018-09-19 16:05 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-18  3:02 Huge page(contiguous bit) slow down Zhang, Lei
2018-09-18 11:33 ` Will Deacon
2018-09-18 14:58   ` Catalin Marinas
2018-09-18 15:16     ` Will Deacon
2018-09-18 16:09       ` Catalin Marinas
2018-09-18 16:14         ` Will Deacon
2018-09-18 17:11           ` Catalin Marinas
2018-09-18 20:30             ` Steve Capper
2018-09-19 10:29             ` Will Deacon
2018-09-19 15:37               ` Steve Capper
2018-09-19 16:05                 ` Will Deacon
2018-09-18 20:26       ` Steve Capper
2018-09-18 15:18   ` Punit Agrawal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.