diff for duplicates of <9FE19350E8A7EE45B64D8D63D368C8966B847F54@SHSMSX101.ccr.corp.intel.com>
diff --git a/a/1.txt b/N1/1.txt
index 720ed98..70fca7a 100644
--- a/a/1.txt
+++ b/N1/1.txt
@@ -1,7 +1,9 @@
Hi Laurent,
-Regression test for v11 patch serials have been run, some regression is found by LKP-tools (linux kernel performance)
-tested on Intel 4s skylake platform. This time only test the cases which have been run and found regressions on
+Regression test for v11 patch serials have been run, some regression is fou=
+nd by LKP-tools (linux kernel performance)
+tested on Intel 4s skylake platform. This time only test the cases which ha=
+ve been run and found regressions on
V9 patch serials.
The regression result is sorted by the metric will-it-scale.per_thread_ops.
@@ -13,30 +15,43 @@ Benchmark: will-it-scale
Download link: https://github.com/antonblanchard/will-it-scale/tree/master
Metrics:
- will-it-scale.per_process_ops=processes/nr_cpu
- will-it-scale.per_thread_ops=threads/nr_cpu
- test box: lkp-skl-4sp1(nr_cpu=192,memory=768G)
+ will-it-scale.per_process_ops=3Dprocesses/nr_cpu
+ will-it-scale.per_thread_ops=3Dthreads/nr_cpu
+ test box: lkp-skl-4sp1(nr_cpu=3D192,memory=3D768G)
THP: enable / disable
nr_task:100%
1. Regressions:
a). Enable THP
-testcase base change head metric
-page_fault3/enable THP 10519 -20.5% 836 will-it-scale.per_thread_ops
-page_fault2/enalbe THP 8281 -18.8% 6728 will-it-scale.per_thread_ops
-brk1/eanble THP 998475 -2.2% 976893 will-it-scale.per_process_ops
-context_switch1/enable THP 223910 -1.3% 220930 will-it-scale.per_process_ops
-context_switch1/enable THP 233722 -1.0% 231288 will-it-scale.per_thread_ops
+testcase base change head =
+ metric
+page_fault3/enable THP 10519 -20.5% 836 will=
+-it-scale.per_thread_ops
+page_fault2/enalbe THP 8281 -18.8% 6728 will=
+-it-scale.per_thread_ops
+brk1/eanble THP 998475 -2.2% 976893 will=
+-it-scale.per_process_ops
+context_switch1/enable THP 223910 -1.3% 220930 will=
+-it-scale.per_process_ops
+context_switch1/enable THP 233722 -1.0% 231288 will=
+-it-scale.per_thread_ops
b). Disable THP
-page_fault3/disable THP 10856 -23.1% 8344 will-it-scale.per_thread_ops
-page_fault2/disable THP 8147 -18.8% 6613 will-it-scale.per_thread_ops
-brk1/disable THP 957 -7.9% 881 will-it-scale.per_thread_ops
-context_switch1/disable THP 237006 -2.2% 231907 will-it-scale.per_thread_ops
-brk1/disable THP 997317 -2.0% 977778 will-it-scale.per_process_ops
-page_fault3/disable THP 467454 -1.8% 459251 will-it-scale.per_process_ops
-context_switch1/disable THP 224431 -1.3% 221567 will-it-scale.per_process_ops
+page_fault3/disable THP 10856 -23.1% 8344 will=
+-it-scale.per_thread_ops
+page_fault2/disable THP 8147 -18.8% 6613 will=
+-it-scale.per_thread_ops
+brk1/disable THP 957 -7.9% 881 will=
+-it-scale.per_thread_ops
+context_switch1/disable THP 237006 -2.2% 231907 will=
+-it-scale.per_thread_ops
+brk1/disable THP 997317 -2.0% 977778 will=
+-it-scale.per_process_ops
+page_fault3/disable THP 467454 -1.8% 459251 will=
+-it-scale.per_process_ops
+context_switch1/disable THP 224431 -1.3% 221567 will=
+-it-scale.per_process_ops
Notes: for the above values of test result, the higher is better.
@@ -46,10 +61,21 @@ Notes: for the above values of test result, the higher is better.
Best regards
Haiyan Song
________________________________________
-From: owner-linux-mm@kvack.org [owner-linux-mm@kvack.org] on behalf of Laurent Dufour [ldufour@linux.vnet.ibm.com]
+From: owner-linux-mm@kvack.org [owner-linux-mm@kvack.org] on behalf of Laur=
+ent Dufour [ldufour@linux.vnet.ibm.com]
Sent: Monday, May 28, 2018 4:54 PM
To: Song, HaiyanX
-Cc: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi; linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org
+Cc: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kir=
+ill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Mat=
+thew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; =
+benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Glei=
+xner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.s=
+enozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi=
+; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan K=
+im; Punit Agrawal; vinayak menon; Yang Shi; linux-kernel@vger.kernel.org; l=
+inux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora=
+@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs=
+.org; x86@kernel.org
Subject: Re: [PATCH v11 00/26] Speculative page faults
On 28/05/2018 10:22, Haiyan Song wrote:
@@ -67,12 +93,14 @@ Do you plan to give this V11 a run ?
> On Mon, May 28, 2018 at 09:51:34AM +0200, Laurent Dufour wrote:
>> On 28/05/2018 07:23, Song, HaiyanX wrote:
>>>
->>> Some regression and improvements is found by LKP-tools(linux kernel performance) on V9 patch series
+>>> Some regression and improvements is found by LKP-tools(linux kernel per=
+formance) on V9 patch series
>>> tested on Intel 4s Skylake platform.
>>
>> Hi,
>>
->> Thanks for reporting this benchmark results, but you mentioned the "V9 patch
+>> Thanks for reporting this benchmark results, but you mentioned the "V9 p=
+atch
>> series" while responding to the v11 header series...
>> Were these tests done on v9 or v11 ?
>>
@@ -80,8 +108,10 @@ Do you plan to give this V11 a run ?
>> Laurent.
>>
>>>
->>> The regression result is sorted by the metric will-it-scale.per_thread_ops.
->>> Branch: Laurent-Dufour/Speculative-page-faults/20180316-151833 (V9 patch series)
+>>> The regression result is sorted by the metric will-it-scale.per_thread_=
+ops.
+>>> Branch: Laurent-Dufour/Speculative-page-faults/20180316-151833 (V9 patc=
+h series)
>>> Commit id:
>>> base commit: d55f34411b1b126429a823d06c3124c16283231f
>>> head commit: 0355322b3577eeab7669066df42c550a56801110
@@ -89,47 +119,72 @@ Do you plan to give this V11 a run ?
>>> Download link:
>>> https://github.com/antonblanchard/will-it-scale/tree/master/tests
>>> Metrics:
->>> will-it-scale.per_process_ops=processes/nr_cpu
->>> will-it-scale.per_thread_ops=threads/nr_cpu
->>> test box: lkp-skl-4sp1(nr_cpu=192,memory=768G)
+>>> will-it-scale.per_process_ops=3Dprocesses/nr_cpu
+>>> will-it-scale.per_thread_ops=3Dthreads/nr_cpu
+>>> test box: lkp-skl-4sp1(nr_cpu=3D192,memory=3D768G)
>>> THP: enable / disable
>>> nr_task: 100%
>>>
>>> 1. Regressions:
>>> a) THP enabled:
->>> testcase base change head metric
->>> page_fault3/ enable THP 10092 -17.5% 8323 will-it-scale.per_thread_ops
->>> page_fault2/ enable THP 8300 -17.2% 6869 will-it-scale.per_thread_ops
->>> brk1/ enable THP 957.67 -7.6% 885 will-it-scale.per_thread_ops
->>> page_fault3/ enable THP 172821 -5.3% 163692 will-it-scale.per_process_ops
->>> signal1/ enable THP 9125 -3.2% 8834 will-it-scale.per_process_ops
+>>> testcase base change head =
+ metric
+>>> page_fault3/ enable THP 10092 -17.5% 8323 =
+ will-it-scale.per_thread_ops
+>>> page_fault2/ enable THP 8300 -17.2% 6869 =
+ will-it-scale.per_thread_ops
+>>> brk1/ enable THP 957.67 -7.6% 885 =
+ will-it-scale.per_thread_ops
+>>> page_fault3/ enable THP 172821 -5.3% 163692 =
+ will-it-scale.per_process_ops
+>>> signal1/ enable THP 9125 -3.2% 8834 =
+ will-it-scale.per_process_ops
>>>
>>> b) THP disabled:
->>> testcase base change head metric
->>> page_fault3/ disable THP 10107 -19.1% 8180 will-it-scale.per_thread_ops
->>> page_fault2/ disable THP 8432 -17.8% 6931 will-it-scale.per_thread_ops
->>> context_switch1/ disable THP 215389 -6.8% 200776 will-it-scale.per_thread_ops
->>> brk1/ disable THP 939.67 -6.6% 877.33 will-it-scale.per_thread_ops
->>> page_fault3/ disable THP 173145 -4.7% 165064 will-it-scale.per_process_ops
->>> signal1/ disable THP 9162 -3.9% 8802 will-it-scale.per_process_ops
+>>> testcase base change head =
+ metric
+>>> page_fault3/ disable THP 10107 -19.1% 8180 =
+ will-it-scale.per_thread_ops
+>>> page_fault2/ disable THP 8432 -17.8% 6931 =
+ will-it-scale.per_thread_ops
+>>> context_switch1/ disable THP 215389 -6.8% 200776 =
+ will-it-scale.per_thread_ops
+>>> brk1/ disable THP 939.67 -6.6% 877.33=
+ will-it-scale.per_thread_ops
+>>> page_fault3/ disable THP 173145 -4.7% 165064 =
+ will-it-scale.per_process_ops
+>>> signal1/ disable THP 9162 -3.9% 8802 =
+ will-it-scale.per_process_ops
>>>
>>> 2. Improvements:
>>> a) THP enabled:
->>> testcase base change head metric
->>> malloc1/ enable THP 66.33 +469.8% 383.67 will-it-scale.per_thread_ops
->>> writeseek3/ enable THP 2531 +4.5% 2646 will-it-scale.per_thread_ops
->>> signal1/ enable THP 989.33 +2.8% 1016 will-it-scale.per_thread_ops
+>>> testcase base change head =
+ metric
+>>> malloc1/ enable THP 66.33 +469.8% 383.67=
+ will-it-scale.per_thread_ops
+>>> writeseek3/ enable THP 2531 +4.5% 2646 =
+ will-it-scale.per_thread_ops
+>>> signal1/ enable THP 989.33 +2.8% 1016 =
+ will-it-scale.per_thread_ops
>>>
>>> b) THP disabled:
->>> testcase base change head metric
->>> malloc1/ disable THP 90.33 +417.3% 467.33 will-it-scale.per_thread_ops
->>> read2/ disable THP 58934 +39.2% 82060 will-it-scale.per_thread_ops
->>> page_fault1/ disable THP 8607 +36.4% 11736 will-it-scale.per_thread_ops
->>> read1/ disable THP 314063 +12.7% 353934 will-it-scale.per_thread_ops
->>> writeseek3/ disable THP 2452 +12.5% 2759 will-it-scale.per_thread_ops
->>> signal1/ disable THP 971.33 +5.5% 1024 will-it-scale.per_thread_ops
->>>
->>> Notes: for above values in column "change", the higher value means that the related testcase result
+>>> testcase base change head =
+ metric
+>>> malloc1/ disable THP 90.33 +417.3% 467.33=
+ will-it-scale.per_thread_ops
+>>> read2/ disable THP 58934 +39.2% 82060 =
+ will-it-scale.per_thread_ops
+>>> page_fault1/ disable THP 8607 +36.4% 11736 =
+ will-it-scale.per_thread_ops
+>>> read1/ disable THP 314063 +12.7% 353934 =
+ will-it-scale.per_thread_ops
+>>> writeseek3/ disable THP 2452 +12.5% 2759 =
+ will-it-scale.per_thread_ops
+>>> signal1/ disable THP 971.33 +5.5% 1024 =
+ will-it-scale.per_thread_ops
+>>>
+>>> Notes: for above values in column "change", the higher value means that=
+ the related testcase result
>>> on head commit is better than that on base commit for this benchmark.
>>>
>>>
@@ -137,75 +192,112 @@ Do you plan to give this V11 a run ?
>>> Haiyan Song
>>>
>>> ________________________________________
->>> From: owner-linux-mm@kvack.org [owner-linux-mm@kvack.org] on behalf of Laurent Dufour [ldufour@linux.vnet.ibm.com]
+>>> From: owner-linux-mm@kvack.org [owner-linux-mm@kvack.org] on behalf of =
+Laurent Dufour [ldufour@linux.vnet.ibm.com]
>>> Sent: Thursday, May 17, 2018 7:06 PM
->>> To: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi
->>> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org
+>>> To: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org;=
+ kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz;=
+ Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.c=
+om; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas =
+Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; serg=
+ey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, =
+Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minch=
+an Kim; Punit Agrawal; vinayak menon; Yang Shi
+>>> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.=
+ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.c=
+om; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org
>>> Subject: [PATCH v11 00/26] Speculative page faults
>>>
->>> This is a port on kernel 4.17 of the work done by Peter Zijlstra to handle
+>>> This is a port on kernel 4.17 of the work done by Peter Zijlstra to han=
+dle
>>> page fault without holding the mm semaphore [1].
>>>
>>> The idea is to try to handle user space page faults without holding the
>>> mmap_sem. This should allow better concurrency for massively threaded
->>> process since the page fault handler will not wait for other threads memory
->>> layout change to be done, assuming that this change is done in another part
->>> of the process's memory space. This type page fault is named speculative
->>> page fault. If the speculative page fault fails because of a concurrency is
->>> detected or because underlying PMD or PTE tables are not yet allocating, it
+>>> process since the page fault handler will not wait for other threads me=
+mory
+>>> layout change to be done, assuming that this change is done in another =
+part
+>>> of the process's memory space. This type page fault is named speculativ=
+e
+>>> page fault. If the speculative page fault fails because of a concurrenc=
+y is
+>>> detected or because underlying PMD or PTE tables are not yet allocating=
+, it
>>> is failing its processing and a classic page fault is then tried.
>>>
->>> The speculative page fault (SPF) has to look for the VMA matching the fault
->>> address without holding the mmap_sem, this is done by introducing a rwlock
->>> which protects the access to the mm_rb tree. Previously this was done using
+>>> The speculative page fault (SPF) has to look for the VMA matching the f=
+ault
+>>> address without holding the mmap_sem, this is done by introducing a rwl=
+ock
+>>> which protects the access to the mm_rb tree. Previously this was done u=
+sing
>>> SRCU but it was introducing a lot of scheduling to process the VMA's
->>> freeing operation which was hitting the performance by 20% as reported by
+>>> freeing operation which was hitting the performance by 20% as reported =
+by
>>> Kemi Wang [2]. Using a rwlock to protect access to the mm_rb tree is
->>> limiting the locking contention to these operations which are expected to
->>> be in a O(log n) order. In addition to ensure that the VMA is not freed in
+>>> limiting the locking contention to these operations which are expected =
+to
+>>> be in a O(log n) order. In addition to ensure that the VMA is not freed=
+ in
>>> our back a reference count is added and 2 services (get_vma() and
>>> put_vma()) are introduced to handle the reference count. Once a VMA is
>>> fetched from the RB tree using get_vma(), it must be later freed using
>>> put_vma(). I can't see anymore the overhead I got while will-it-scale
>>> benchmark anymore.
>>>
->>> The VMA's attributes checked during the speculative page fault processing
->>> have to be protected against parallel changes. This is done by using a per
+>>> The VMA's attributes checked during the speculative page fault processi=
+ng
+>>> have to be protected against parallel changes. This is done by using a =
+per
>>> VMA sequence lock. This sequence lock allows the speculative page fault
>>> handler to fast check for parallel changes in progress and to abort the
>>> speculative page fault in that case.
>>>
->>> Once the VMA has been found, the speculative page fault handler would check
->>> for the VMA's attributes to verify that the page fault has to be handled
->>> correctly or not. Thus, the VMA is protected through a sequence lock which
+>>> Once the VMA has been found, the speculative page fault handler would c=
+heck
+>>> for the VMA's attributes to verify that the page fault has to be handle=
+d
+>>> correctly or not. Thus, the VMA is protected through a sequence lock wh=
+ich
>>> allows fast detection of concurrent VMA changes. If such a change is
->>> detected, the speculative page fault is aborted and a *classic* page fault
->>> is tried. VMA sequence lockings are added when VMA attributes which are
+>>> detected, the speculative page fault is aborted and a *classic* page fa=
+ult
+>>> is tried. VMA sequence lockings are added when VMA attributes which ar=
+e
>>> checked during the page fault are modified.
>>>
->>> When the PTE is fetched, the VMA is checked to see if it has been changed,
->>> so once the page table is locked, the VMA is valid, so any other changes
+>>> When the PTE is fetched, the VMA is checked to see if it has been chang=
+ed,
+>>> so once the page table is locked, the VMA is valid, so any other change=
+s
>>> leading to touching this PTE will need to lock the page table, so no
>>> parallel change is possible at this time.
>>>
>>> The locking of the PTE is done with interrupts disabled, this allows
>>> checking for the PMD to ensure that there is not an ongoing collapsing
->>> operation. Since khugepaged is firstly set the PMD to pmd_none and then is
->>> waiting for the other CPU to have caught the IPI interrupt, if the pmd is
+>>> operation. Since khugepaged is firstly set the PMD to pmd_none and then=
+ is
+>>> waiting for the other CPU to have caught the IPI interrupt, if the pmd =
+is
>>> valid at the time the PTE is locked, we have the guarantee that the
>>> collapsing operation will have to wait on the PTE lock to move forward.
>>> This allows the SPF handler to map the PTE safely. If the PMD value is
->>> different from the one recorded at the beginning of the SPF operation, the
+>>> different from the one recorded at the beginning of the SPF operation, =
+the
>>> classic page fault handler will be called to handle the operation while
->>> holding the mmap_sem. As the PTE lock is done with the interrupts disabled,
->>> the lock is done using spin_trylock() to avoid dead lock when handling a
->>> page fault while a TLB invalidate is requested by another CPU holding the
+>>> holding the mmap_sem. As the PTE lock is done with the interrupts disab=
+led,
+>>> the lock is done using spin_trylock() to avoid dead lock when handling =
+a
+>>> page fault while a TLB invalidate is requested by another CPU holding t=
+he
>>> PTE.
>>>
>>> In pseudo code, this could be seen as:
>>> speculative_page_fault()
>>> {
->>> vma = get_vma()
+>>> vma =3D get_vma()
>>> check vma sequence count
>>> check vma's support
>>> disable interrupt
@@ -216,10 +308,11 @@ Do you plan to give this V11 a run ?
>>> check vma sequence count
>>> handle_pte_fault(vma)
>>> ..
->>> page = alloc_page()
+>>> page =3D alloc_page()
>>> pte_map_lock()
>>> disable interrupt
->>> abort if sequence counter has changed
+>>> abort if sequence counter has chang=
+ed
>>> abort if pmd or pte has changed
>>> pte map and lock
>>> enable interrupt
@@ -235,7 +328,7 @@ Do you plan to give this V11 a run ?
>>> goto done
>>> again:
>>> lock(mmap_sem)
->>> vma = find_vma();
+>>> vma =3D find_vma();
>>> handle_pte_fault(vma);
>>> if retry
>>> unlock(mmap_sem)
@@ -244,22 +337,27 @@ Do you plan to give this V11 a run ?
>>> handle fault error
>>> }
>>>
->>> Support for THP is not done because when checking for the PMD, we can be
+>>> Support for THP is not done because when checking for the PMD, we can b=
+e
>>> confused by an in progress collapsing operation done by khugepaged. The
>>> issue is that pmd_none() could be true either if the PMD is not already
->>> populated or if the underlying PTE are in the way to be collapsed. So we
+>>> populated or if the underlying PTE are in the way to be collapsed. So w=
+e
>>> cannot safely allocate a PMD if pmd_none() is true.
>>>
->>> This series add a new software performance event named 'speculative-faults'
+>>> This series add a new software performance event named 'speculative-fau=
+lts'
>>> or 'spf'. It counts the number of successful page fault event handled
>>> speculatively. When recording 'faults,spf' events, the faults one is
->>> counting the total number of page fault events while 'spf' is only counting
+>>> counting the total number of page fault events while 'spf' is only coun=
+ting
>>> the part of the faults processed speculatively.
>>>
>>> There are some trace events introduced by this series. They allow
>>> identifying why the page faults were not processed speculatively. This
>>> doesn't take in account the faults generated by a monothreaded process
->>> which directly processed while holding the mmap_sem. This trace events are
+>>> which directly processed while holding the mmap_sem. This trace events =
+are
>>> grouped in a system named 'pagefault', they are:
>>> - pagefault:spf_vma_changed : if the VMA has been changed in our back
>>> - pagefault:spf_vma_noanon : the vma->anon_vma field was not yet set.
@@ -272,20 +370,26 @@ Do you plan to give this V11 a run ?
>>> following arguments :
>>> $ perf stat -e 'faults,spf,pagefault:*' <command>
>>>
->>> There is also a dedicated vmstat counter showing the number of successful
+>>> There is also a dedicated vmstat counter showing the number of successf=
+ul
>>> page fault handled speculatively. I can be seen this way:
>>> $ grep speculative_pgfault /proc/vmstat
>>>
->>> This series builds on top of v4.16-mmotm-2018-04-13-17-28 and is functional
+>>> This series builds on top of v4.16-mmotm-2018-04-13-17-28 and is functi=
+onal
>>> on x86, PowerPC and arm64.
>>>
>>> ---------------------
>>> Real Workload results
>>>
->>> As mentioned in previous email, we did non official runs using a "popular
->>> in memory multithreaded database product" on 176 cores SMT8 Power system
->>> which showed a 30% improvements in the number of transaction processed per
->>> second. This run has been done on the v6 series, but changes introduced in
+>>> As mentioned in previous email, we did non official runs using a "popul=
+ar
+>>> in memory multithreaded database product" on 176 cores SMT8 Power syste=
+m
+>>> which showed a 30% improvements in the number of transaction processed =
+per
+>>> second. This run has been done on the v6 series, but changes introduced=
+ in
>>> this new version should not impact the performance boost seen.
>>>
>>> Here are the perf data captured during 2 of these runs on top of the v8
@@ -294,12 +398,16 @@ Do you plan to give this V11 a run ?
>>> faults 89.418 101.364 +13%
>>> spf n/a 97.989
>>>
->>> With the SPF kernel, most of the page fault were processed in a speculative
+>>> With the SPF kernel, most of the page fault were processed in a specula=
+tive
>>> way.
>>>
->>> Ganesh Mahendran had backported the series on top of a 4.9 kernel and gave
->>> it a try on an android device. He reported that the application launch time
->>> was improved in average by 6%, and for large applications (~100 threads) by
+>>> Ganesh Mahendran had backported the series on top of a 4.9 kernel and g=
+ave
+>>> it a try on an android device. He reported that the application launch =
+time
+>>> was improved in average by 6%, and for large applications (~100 threads=
+) by
>>> 20%.
>>>
>>> Here are the launch time Ganesh mesured on Android 8.0 on top of a Qcom
@@ -372,7 +480,8 @@ Do you plan to give this V11 a run ?
>>> 0 pagefault:spf_pmd_changed
>>>
>>> Very few speculative page faults were recorded as most of the processes
->>> involved are monothreaded (sounds that on this architecture some threads
+>>> involved are monothreaded (sounds that on this architecture some thread=
+s
>>> were created during the kernel build processing).
>>>
>>> Here are the kerbench results on a 80 CPUs Power8 system:
@@ -407,15 +516,18 @@ Do you plan to give this V11 a run ?
>>> 0 pagefault:spf_vma_access
>>> 0 pagefault:spf_pmd_changed
>>>
->>> Most of the processes involved are monothreaded so SPF is not activated but
+>>> Most of the processes involved are monothreaded so SPF is not activated=
+ but
>>> there is no impact on the performance.
>>>
>>> Ebizzy:
>>> -------
->>> The test is counting the number of records per second it can manage, the
+>>> The test is counting the number of records per second it can manage, th=
+e
>>> higher is the best. I run it like this 'ebizzy -mTt <nrcpus>'. To get
>>> consistent result I repeated the test 100 times and measure the average
->>> result. The number is the record processes per second, the higher is the
+>>> result. The number is the record processes per second, the higher is th=
+e
>>> best.
>>>
>>> BASE SPF delta
@@ -442,12 +554,14 @@ Do you plan to give this V11 a run ?
>>> 0 pagefault:spf_vma_access
>>> 0 pagefault:spf_pmd_changed
>>>
->>> In ebizzy's case most of the page fault were handled in a speculative way,
+>>> In ebizzy's case most of the page fault were handled in a speculative w=
+ay,
>>> leading the ebizzy performance boost.
>>>
>>> ------------------
>>> Changes since v10 (https://lkml.org/lkml/2018/4/17/572):
->>> - Accounted for all review feedbacks from Punit Agrawal, Ganesh Mahendran
+>>> - Accounted for all review feedbacks from Punit Agrawal, Ganesh Mahend=
+ran
>>> and Minchan Kim, hopefully.
>>> - Remove unneeded check on CONFIG_SPECULATIVE_PAGE_FAULT in
>>> __do_page_fault().
@@ -456,12 +570,15 @@ Do you plan to give this V11 a run ?
>>> of aborting the speculative page fault handling. Dropping the now
>>> useless
>>> trace event pagefault:spf_pte_lock.
->>> - No more try to reuse the fetched VMA during the speculative page fault
+>>> - No more try to reuse the fetched VMA during the speculative page fau=
+lt
>>> handling when retrying is needed. This adds a lot of complexity and
->>> additional tests done didn't show a significant performance improvement.
+>>> additional tests done didn't show a significant performance improvem=
+ent.
>>> - Convert IS_ENABLED(CONFIG_NUMA) back to #ifdef due to build error.
>>>
->>> [1] http://linux-kernel.2935.n7.nabble.com/RFC-PATCH-0-6-Another-go-at-speculative-page-faults-tt965642.html#none
+>>> [1] http://linux-kernel.2935.n7.nabble.com/RFC-PATCH-0-6-Another-go-at-=
+speculative-page-faults-tt965642.html#none
>>> [2] https://patchwork.kernel.org/patch/9999687/
>>>
>>>
@@ -524,7 +641,8 @@ Do you plan to give this V11 a run ?
>>> mm/internal.h | 20 ++
>>> mm/khugepaged.c | 5 +
>>> mm/madvise.c | 6 +-
->>> mm/memory.c | 612 +++++++++++++++++++++++++++++-----
+>>> mm/memory.c | 612 ++++++++++++++++++++++++++=
++++-----
>>> mm/mempolicy.c | 51 ++-
>>> mm/migrate.c | 6 +-
>>> mm/mlock.c | 13 +-
diff --git a/a/content_digest b/N1/content_digest
index ca53e7f..56e778a 100644
--- a/a/content_digest
+++ b/N1/content_digest
@@ -76,8 +76,10 @@
[
"Hi Laurent,\n",
"\n",
- "Regression test for v11 patch serials have been run, some regression is found by LKP-tools (linux kernel performance)\n",
- "tested on Intel 4s skylake platform. This time only test the cases which have been run and found regressions on\n",
+ "Regression test for v11 patch serials have been run, some regression is fou=\n",
+ "nd by LKP-tools (linux kernel performance)\n",
+ "tested on Intel 4s skylake platform. This time only test the cases which ha=\n",
+ "ve been run and found regressions on\n",
"V9 patch serials.\n",
"\n",
"The regression result is sorted by the metric will-it-scale.per_thread_ops.\n",
@@ -89,30 +91,43 @@
"Download link: https://github.com/antonblanchard/will-it-scale/tree/master\n",
"\n",
"Metrics:\n",
- " will-it-scale.per_process_ops=processes/nr_cpu\n",
- " will-it-scale.per_thread_ops=threads/nr_cpu\n",
- " test box: lkp-skl-4sp1(nr_cpu=192,memory=768G)\n",
+ " will-it-scale.per_process_ops=3Dprocesses/nr_cpu\n",
+ " will-it-scale.per_thread_ops=3Dthreads/nr_cpu\n",
+ " test box: lkp-skl-4sp1(nr_cpu=3D192,memory=3D768G)\n",
"THP: enable / disable\n",
"nr_task:100%\n",
"\n",
"1. Regressions:\n",
"\n",
"a). Enable THP\n",
- "testcase base change head metric\n",
- "page_fault3/enable THP 10519 -20.5% 836 will-it-scale.per_thread_ops\n",
- "page_fault2/enalbe THP 8281 -18.8% 6728 will-it-scale.per_thread_ops\n",
- "brk1/eanble THP 998475 -2.2% 976893 will-it-scale.per_process_ops\n",
- "context_switch1/enable THP 223910 -1.3% 220930 will-it-scale.per_process_ops\n",
- "context_switch1/enable THP 233722 -1.0% 231288 will-it-scale.per_thread_ops\n",
+ "testcase base change head =\n",
+ " metric\n",
+ "page_fault3/enable THP 10519 -20.5% 836 will=\n",
+ "-it-scale.per_thread_ops\n",
+ "page_fault2/enalbe THP 8281 -18.8% 6728 will=\n",
+ "-it-scale.per_thread_ops\n",
+ "brk1/eanble THP 998475 -2.2% 976893 will=\n",
+ "-it-scale.per_process_ops\n",
+ "context_switch1/enable THP 223910 -1.3% 220930 will=\n",
+ "-it-scale.per_process_ops\n",
+ "context_switch1/enable THP 233722 -1.0% 231288 will=\n",
+ "-it-scale.per_thread_ops\n",
"\n",
"b). Disable THP\n",
- "page_fault3/disable THP 10856 -23.1% 8344 will-it-scale.per_thread_ops\n",
- "page_fault2/disable THP 8147 -18.8% 6613 will-it-scale.per_thread_ops\n",
- "brk1/disable THP 957 -7.9% 881 will-it-scale.per_thread_ops\n",
- "context_switch1/disable THP 237006 -2.2% 231907 will-it-scale.per_thread_ops\n",
- "brk1/disable THP 997317 -2.0% 977778 will-it-scale.per_process_ops\n",
- "page_fault3/disable THP 467454 -1.8% 459251 will-it-scale.per_process_ops\n",
- "context_switch1/disable THP 224431 -1.3% 221567 will-it-scale.per_process_ops\n",
+ "page_fault3/disable THP 10856 -23.1% 8344 will=\n",
+ "-it-scale.per_thread_ops\n",
+ "page_fault2/disable THP 8147 -18.8% 6613 will=\n",
+ "-it-scale.per_thread_ops\n",
+ "brk1/disable THP 957 -7.9% 881 will=\n",
+ "-it-scale.per_thread_ops\n",
+ "context_switch1/disable THP 237006 -2.2% 231907 will=\n",
+ "-it-scale.per_thread_ops\n",
+ "brk1/disable THP 997317 -2.0% 977778 will=\n",
+ "-it-scale.per_process_ops\n",
+ "page_fault3/disable THP 467454 -1.8% 459251 will=\n",
+ "-it-scale.per_process_ops\n",
+ "context_switch1/disable THP 224431 -1.3% 221567 will=\n",
+ "-it-scale.per_process_ops\n",
"\n",
"Notes: for the above values of test result, the higher is better.\n",
"\n",
@@ -122,10 +137,21 @@
"Best regards\n",
"Haiyan Song\n",
"________________________________________\n",
- "From: owner-linux-mm\@kvack.org [owner-linux-mm\@kvack.org] on behalf of Laurent Dufour [ldufour\@linux.vnet.ibm.com]\n",
+ "From: owner-linux-mm\@kvack.org [owner-linux-mm\@kvack.org] on behalf of Laur=\n",
+ "ent Dufour [ldufour\@linux.vnet.ibm.com]\n",
"Sent: Monday, May 28, 2018 4:54 PM\n",
"To: Song, HaiyanX\n",
- "Cc: akpm\@linux-foundation.org; mhocko\@kernel.org; peterz\@infradead.org; kirill\@shutemov.name; ak\@linux.intel.com; dave\@stgolabs.net; jack\@suse.cz; Matthew Wilcox; khandual\@linux.vnet.ibm.com; aneesh.kumar\@linux.vnet.ibm.com; benh\@kernel.crashing.org; mpe\@ellerman.id.au; paulus\@samba.org; Thomas Gleixner; Ingo Molnar; hpa\@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work\@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi; linux-kernel\@vger.kernel.org; linux-mm\@kvack.org; haren\@linux.vnet.ibm.com; npiggin\@gmail.com; bsingharora\@gmail.com; paulmck\@linux.vnet.ibm.com; Tim Chen; linuxppc-dev\@lists.ozlabs.org; x86\@kernel.org\n",
+ "Cc: akpm\@linux-foundation.org; mhocko\@kernel.org; peterz\@infradead.org; kir=\n",
+ "ill\@shutemov.name; ak\@linux.intel.com; dave\@stgolabs.net; jack\@suse.cz; Mat=\n",
+ "thew Wilcox; khandual\@linux.vnet.ibm.com; aneesh.kumar\@linux.vnet.ibm.com; =\n",
+ "benh\@kernel.crashing.org; mpe\@ellerman.id.au; paulus\@samba.org; Thomas Glei=\n",
+ "xner; Ingo Molnar; hpa\@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.s=\n",
+ "enozhatsky.work\@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi=\n",
+ "; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan K=\n",
+ "im; Punit Agrawal; vinayak menon; Yang Shi; linux-kernel\@vger.kernel.org; l=\n",
+ "inux-mm\@kvack.org; haren\@linux.vnet.ibm.com; npiggin\@gmail.com; bsingharora=\n",
+ "\@gmail.com; paulmck\@linux.vnet.ibm.com; Tim Chen; linuxppc-dev\@lists.ozlabs=\n",
+ ".org; x86\@kernel.org\n",
"Subject: Re: [PATCH v11 00/26] Speculative page faults\n",
"\n",
"On 28/05/2018 10:22, Haiyan Song wrote:\n",
@@ -143,12 +169,14 @@
"> On Mon, May 28, 2018 at 09:51:34AM +0200, Laurent Dufour wrote:\n",
">> On 28/05/2018 07:23, Song, HaiyanX wrote:\n",
">>>\n",
- ">>> Some regression and improvements is found by LKP-tools(linux kernel performance) on V9 patch series\n",
+ ">>> Some regression and improvements is found by LKP-tools(linux kernel per=\n",
+ "formance) on V9 patch series\n",
">>> tested on Intel 4s Skylake platform.\n",
">>\n",
">> Hi,\n",
">>\n",
- ">> Thanks for reporting this benchmark results, but you mentioned the \"V9 patch\n",
+ ">> Thanks for reporting this benchmark results, but you mentioned the \"V9 p=\n",
+ "atch\n",
">> series\" while responding to the v11 header series...\n",
">> Were these tests done on v9 or v11 ?\n",
">>\n",
@@ -156,8 +184,10 @@
">> Laurent.\n",
">>\n",
">>>\n",
- ">>> The regression result is sorted by the metric will-it-scale.per_thread_ops.\n",
- ">>> Branch: Laurent-Dufour/Speculative-page-faults/20180316-151833 (V9 patch series)\n",
+ ">>> The regression result is sorted by the metric will-it-scale.per_thread_=\n",
+ "ops.\n",
+ ">>> Branch: Laurent-Dufour/Speculative-page-faults/20180316-151833 (V9 patc=\n",
+ "h series)\n",
">>> Commit id:\n",
">>> base commit: d55f34411b1b126429a823d06c3124c16283231f\n",
">>> head commit: 0355322b3577eeab7669066df42c550a56801110\n",
@@ -165,47 +195,72 @@
">>> Download link:\n",
">>> https://github.com/antonblanchard/will-it-scale/tree/master/tests\n",
">>> Metrics:\n",
- ">>> will-it-scale.per_process_ops=processes/nr_cpu\n",
- ">>> will-it-scale.per_thread_ops=threads/nr_cpu\n",
- ">>> test box: lkp-skl-4sp1(nr_cpu=192,memory=768G)\n",
+ ">>> will-it-scale.per_process_ops=3Dprocesses/nr_cpu\n",
+ ">>> will-it-scale.per_thread_ops=3Dthreads/nr_cpu\n",
+ ">>> test box: lkp-skl-4sp1(nr_cpu=3D192,memory=3D768G)\n",
">>> THP: enable / disable\n",
">>> nr_task: 100%\n",
">>>\n",
">>> 1. Regressions:\n",
">>> a) THP enabled:\n",
- ">>> testcase base change head metric\n",
- ">>> page_fault3/ enable THP 10092 -17.5% 8323 will-it-scale.per_thread_ops\n",
- ">>> page_fault2/ enable THP 8300 -17.2% 6869 will-it-scale.per_thread_ops\n",
- ">>> brk1/ enable THP 957.67 -7.6% 885 will-it-scale.per_thread_ops\n",
- ">>> page_fault3/ enable THP 172821 -5.3% 163692 will-it-scale.per_process_ops\n",
- ">>> signal1/ enable THP 9125 -3.2% 8834 will-it-scale.per_process_ops\n",
+ ">>> testcase base change head =\n",
+ " metric\n",
+ ">>> page_fault3/ enable THP 10092 -17.5% 8323 =\n",
+ " will-it-scale.per_thread_ops\n",
+ ">>> page_fault2/ enable THP 8300 -17.2% 6869 =\n",
+ " will-it-scale.per_thread_ops\n",
+ ">>> brk1/ enable THP 957.67 -7.6% 885 =\n",
+ " will-it-scale.per_thread_ops\n",
+ ">>> page_fault3/ enable THP 172821 -5.3% 163692 =\n",
+ " will-it-scale.per_process_ops\n",
+ ">>> signal1/ enable THP 9125 -3.2% 8834 =\n",
+ " will-it-scale.per_process_ops\n",
">>>\n",
">>> b) THP disabled:\n",
- ">>> testcase base change head metric\n",
- ">>> page_fault3/ disable THP 10107 -19.1% 8180 will-it-scale.per_thread_ops\n",
- ">>> page_fault2/ disable THP 8432 -17.8% 6931 will-it-scale.per_thread_ops\n",
- ">>> context_switch1/ disable THP 215389 -6.8% 200776 will-it-scale.per_thread_ops\n",
- ">>> brk1/ disable THP 939.67 -6.6% 877.33 will-it-scale.per_thread_ops\n",
- ">>> page_fault3/ disable THP 173145 -4.7% 165064 will-it-scale.per_process_ops\n",
- ">>> signal1/ disable THP 9162 -3.9% 8802 will-it-scale.per_process_ops\n",
+ ">>> testcase base change head =\n",
+ " metric\n",
+ ">>> page_fault3/ disable THP 10107 -19.1% 8180 =\n",
+ " will-it-scale.per_thread_ops\n",
+ ">>> page_fault2/ disable THP 8432 -17.8% 6931 =\n",
+ " will-it-scale.per_thread_ops\n",
+ ">>> context_switch1/ disable THP 215389 -6.8% 200776 =\n",
+ " will-it-scale.per_thread_ops\n",
+ ">>> brk1/ disable THP 939.67 -6.6% 877.33=\n",
+ " will-it-scale.per_thread_ops\n",
+ ">>> page_fault3/ disable THP 173145 -4.7% 165064 =\n",
+ " will-it-scale.per_process_ops\n",
+ ">>> signal1/ disable THP 9162 -3.9% 8802 =\n",
+ " will-it-scale.per_process_ops\n",
">>>\n",
">>> 2. Improvements:\n",
">>> a) THP enabled:\n",
- ">>> testcase base change head metric\n",
- ">>> malloc1/ enable THP 66.33 +469.8% 383.67 will-it-scale.per_thread_ops\n",
- ">>> writeseek3/ enable THP 2531 +4.5% 2646 will-it-scale.per_thread_ops\n",
- ">>> signal1/ enable THP 989.33 +2.8% 1016 will-it-scale.per_thread_ops\n",
+ ">>> testcase base change head =\n",
+ " metric\n",
+ ">>> malloc1/ enable THP 66.33 +469.8% 383.67=\n",
+ " will-it-scale.per_thread_ops\n",
+ ">>> writeseek3/ enable THP 2531 +4.5% 2646 =\n",
+ " will-it-scale.per_thread_ops\n",
+ ">>> signal1/ enable THP 989.33 +2.8% 1016 =\n",
+ " will-it-scale.per_thread_ops\n",
">>>\n",
">>> b) THP disabled:\n",
- ">>> testcase base change head metric\n",
- ">>> malloc1/ disable THP 90.33 +417.3% 467.33 will-it-scale.per_thread_ops\n",
- ">>> read2/ disable THP 58934 +39.2% 82060 will-it-scale.per_thread_ops\n",
- ">>> page_fault1/ disable THP 8607 +36.4% 11736 will-it-scale.per_thread_ops\n",
- ">>> read1/ disable THP 314063 +12.7% 353934 will-it-scale.per_thread_ops\n",
- ">>> writeseek3/ disable THP 2452 +12.5% 2759 will-it-scale.per_thread_ops\n",
- ">>> signal1/ disable THP 971.33 +5.5% 1024 will-it-scale.per_thread_ops\n",
- ">>>\n",
- ">>> Notes: for above values in column \"change\", the higher value means that the related testcase result\n",
+ ">>> testcase base change head =\n",
+ " metric\n",
+ ">>> malloc1/ disable THP 90.33 +417.3% 467.33=\n",
+ " will-it-scale.per_thread_ops\n",
+ ">>> read2/ disable THP 58934 +39.2% 82060 =\n",
+ " will-it-scale.per_thread_ops\n",
+ ">>> page_fault1/ disable THP 8607 +36.4% 11736 =\n",
+ " will-it-scale.per_thread_ops\n",
+ ">>> read1/ disable THP 314063 +12.7% 353934 =\n",
+ " will-it-scale.per_thread_ops\n",
+ ">>> writeseek3/ disable THP 2452 +12.5% 2759 =\n",
+ " will-it-scale.per_thread_ops\n",
+ ">>> signal1/ disable THP 971.33 +5.5% 1024 =\n",
+ " will-it-scale.per_thread_ops\n",
+ ">>>\n",
+ ">>> Notes: for above values in column \"change\", the higher value means that=\n",
+ " the related testcase result\n",
">>> on head commit is better than that on base commit for this benchmark.\n",
">>>\n",
">>>\n",
@@ -213,75 +268,112 @@
">>> Haiyan Song\n",
">>>\n",
">>> ________________________________________\n",
- ">>> From: owner-linux-mm\@kvack.org [owner-linux-mm\@kvack.org] on behalf of Laurent Dufour [ldufour\@linux.vnet.ibm.com]\n",
+ ">>> From: owner-linux-mm\@kvack.org [owner-linux-mm\@kvack.org] on behalf of =\n",
+ "Laurent Dufour [ldufour\@linux.vnet.ibm.com]\n",
">>> Sent: Thursday, May 17, 2018 7:06 PM\n",
- ">>> To: akpm\@linux-foundation.org; mhocko\@kernel.org; peterz\@infradead.org; kirill\@shutemov.name; ak\@linux.intel.com; dave\@stgolabs.net; jack\@suse.cz; Matthew Wilcox; khandual\@linux.vnet.ibm.com; aneesh.kumar\@linux.vnet.ibm.com; benh\@kernel.crashing.org; mpe\@ellerman.id.au; paulus\@samba.org; Thomas Gleixner; Ingo Molnar; hpa\@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work\@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi\n",
- ">>> Cc: linux-kernel\@vger.kernel.org; linux-mm\@kvack.org; haren\@linux.vnet.ibm.com; npiggin\@gmail.com; bsingharora\@gmail.com; paulmck\@linux.vnet.ibm.com; Tim Chen; linuxppc-dev\@lists.ozlabs.org; x86\@kernel.org\n",
+ ">>> To: akpm\@linux-foundation.org; mhocko\@kernel.org; peterz\@infradead.org;=\n",
+ " kirill\@shutemov.name; ak\@linux.intel.com; dave\@stgolabs.net; jack\@suse.cz;=\n",
+ " Matthew Wilcox; khandual\@linux.vnet.ibm.com; aneesh.kumar\@linux.vnet.ibm.c=\n",
+ "om; benh\@kernel.crashing.org; mpe\@ellerman.id.au; paulus\@samba.org; Thomas =\n",
+ "Gleixner; Ingo Molnar; hpa\@zytor.com; Will Deacon; Sergey Senozhatsky; serg=\n",
+ "ey.senozhatsky.work\@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, =\n",
+ "Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minch=\n",
+ "an Kim; Punit Agrawal; vinayak menon; Yang Shi\n",
+ ">>> Cc: linux-kernel\@vger.kernel.org; linux-mm\@kvack.org; haren\@linux.vnet.=\n",
+ "ibm.com; npiggin\@gmail.com; bsingharora\@gmail.com; paulmck\@linux.vnet.ibm.c=\n",
+ "om; Tim Chen; linuxppc-dev\@lists.ozlabs.org; x86\@kernel.org\n",
">>> Subject: [PATCH v11 00/26] Speculative page faults\n",
">>>\n",
- ">>> This is a port on kernel 4.17 of the work done by Peter Zijlstra to handle\n",
+ ">>> This is a port on kernel 4.17 of the work done by Peter Zijlstra to han=\n",
+ "dle\n",
">>> page fault without holding the mm semaphore [1].\n",
">>>\n",
">>> The idea is to try to handle user space page faults without holding the\n",
">>> mmap_sem. This should allow better concurrency for massively threaded\n",
- ">>> process since the page fault handler will not wait for other threads memory\n",
- ">>> layout change to be done, assuming that this change is done in another part\n",
- ">>> of the process's memory space. This type page fault is named speculative\n",
- ">>> page fault. If the speculative page fault fails because of a concurrency is\n",
- ">>> detected or because underlying PMD or PTE tables are not yet allocating, it\n",
+ ">>> process since the page fault handler will not wait for other threads me=\n",
+ "mory\n",
+ ">>> layout change to be done, assuming that this change is done in another =\n",
+ "part\n",
+ ">>> of the process's memory space. This type page fault is named speculativ=\n",
+ "e\n",
+ ">>> page fault. If the speculative page fault fails because of a concurrenc=\n",
+ "y is\n",
+ ">>> detected or because underlying PMD or PTE tables are not yet allocating=\n",
+ ", it\n",
">>> is failing its processing and a classic page fault is then tried.\n",
">>>\n",
- ">>> The speculative page fault (SPF) has to look for the VMA matching the fault\n",
- ">>> address without holding the mmap_sem, this is done by introducing a rwlock\n",
- ">>> which protects the access to the mm_rb tree. Previously this was done using\n",
+ ">>> The speculative page fault (SPF) has to look for the VMA matching the f=\n",
+ "ault\n",
+ ">>> address without holding the mmap_sem, this is done by introducing a rwl=\n",
+ "ock\n",
+ ">>> which protects the access to the mm_rb tree. Previously this was done u=\n",
+ "sing\n",
">>> SRCU but it was introducing a lot of scheduling to process the VMA's\n",
- ">>> freeing operation which was hitting the performance by 20% as reported by\n",
+ ">>> freeing operation which was hitting the performance by 20% as reported =\n",
+ "by\n",
">>> Kemi Wang [2]. Using a rwlock to protect access to the mm_rb tree is\n",
- ">>> limiting the locking contention to these operations which are expected to\n",
- ">>> be in a O(log n) order. In addition to ensure that the VMA is not freed in\n",
+ ">>> limiting the locking contention to these operations which are expected =\n",
+ "to\n",
+ ">>> be in a O(log n) order. In addition to ensure that the VMA is not freed=\n",
+ " in\n",
">>> our back a reference count is added and 2 services (get_vma() and\n",
">>> put_vma()) are introduced to handle the reference count. Once a VMA is\n",
">>> fetched from the RB tree using get_vma(), it must be later freed using\n",
">>> put_vma(). I can't see anymore the overhead I got while will-it-scale\n",
">>> benchmark anymore.\n",
">>>\n",
- ">>> The VMA's attributes checked during the speculative page fault processing\n",
- ">>> have to be protected against parallel changes. This is done by using a per\n",
+ ">>> The VMA's attributes checked during the speculative page fault processi=\n",
+ "ng\n",
+ ">>> have to be protected against parallel changes. This is done by using a =\n",
+ "per\n",
">>> VMA sequence lock. This sequence lock allows the speculative page fault\n",
">>> handler to fast check for parallel changes in progress and to abort the\n",
">>> speculative page fault in that case.\n",
">>>\n",
- ">>> Once the VMA has been found, the speculative page fault handler would check\n",
- ">>> for the VMA's attributes to verify that the page fault has to be handled\n",
- ">>> correctly or not. Thus, the VMA is protected through a sequence lock which\n",
+ ">>> Once the VMA has been found, the speculative page fault handler would c=\n",
+ "heck\n",
+ ">>> for the VMA's attributes to verify that the page fault has to be handle=\n",
+ "d\n",
+ ">>> correctly or not. Thus, the VMA is protected through a sequence lock wh=\n",
+ "ich\n",
">>> allows fast detection of concurrent VMA changes. If such a change is\n",
- ">>> detected, the speculative page fault is aborted and a *classic* page fault\n",
- ">>> is tried. VMA sequence lockings are added when VMA attributes which are\n",
+ ">>> detected, the speculative page fault is aborted and a *classic* page fa=\n",
+ "ult\n",
+ ">>> is tried. VMA sequence lockings are added when VMA attributes which ar=\n",
+ "e\n",
">>> checked during the page fault are modified.\n",
">>>\n",
- ">>> When the PTE is fetched, the VMA is checked to see if it has been changed,\n",
- ">>> so once the page table is locked, the VMA is valid, so any other changes\n",
+ ">>> When the PTE is fetched, the VMA is checked to see if it has been chang=\n",
+ "ed,\n",
+ ">>> so once the page table is locked, the VMA is valid, so any other change=\n",
+ "s\n",
">>> leading to touching this PTE will need to lock the page table, so no\n",
">>> parallel change is possible at this time.\n",
">>>\n",
">>> The locking of the PTE is done with interrupts disabled, this allows\n",
">>> checking for the PMD to ensure that there is not an ongoing collapsing\n",
- ">>> operation. Since khugepaged is firstly set the PMD to pmd_none and then is\n",
- ">>> waiting for the other CPU to have caught the IPI interrupt, if the pmd is\n",
+ ">>> operation. Since khugepaged is firstly set the PMD to pmd_none and then=\n",
+ " is\n",
+ ">>> waiting for the other CPU to have caught the IPI interrupt, if the pmd =\n",
+ "is\n",
">>> valid at the time the PTE is locked, we have the guarantee that the\n",
">>> collapsing operation will have to wait on the PTE lock to move forward.\n",
">>> This allows the SPF handler to map the PTE safely. If the PMD value is\n",
- ">>> different from the one recorded at the beginning of the SPF operation, the\n",
+ ">>> different from the one recorded at the beginning of the SPF operation, =\n",
+ "the\n",
">>> classic page fault handler will be called to handle the operation while\n",
- ">>> holding the mmap_sem. As the PTE lock is done with the interrupts disabled,\n",
- ">>> the lock is done using spin_trylock() to avoid dead lock when handling a\n",
- ">>> page fault while a TLB invalidate is requested by another CPU holding the\n",
+ ">>> holding the mmap_sem. As the PTE lock is done with the interrupts disab=\n",
+ "led,\n",
+ ">>> the lock is done using spin_trylock() to avoid dead lock when handling =\n",
+ "a\n",
+ ">>> page fault while a TLB invalidate is requested by another CPU holding t=\n",
+ "he\n",
">>> PTE.\n",
">>>\n",
">>> In pseudo code, this could be seen as:\n",
">>> speculative_page_fault()\n",
">>> {\n",
- ">>> vma = get_vma()\n",
+ ">>> vma =3D get_vma()\n",
">>> check vma sequence count\n",
">>> check vma's support\n",
">>> disable interrupt\n",
@@ -292,10 +384,11 @@
">>> check vma sequence count\n",
">>> handle_pte_fault(vma)\n",
">>> ..\n",
- ">>> page = alloc_page()\n",
+ ">>> page =3D alloc_page()\n",
">>> pte_map_lock()\n",
">>> disable interrupt\n",
- ">>> abort if sequence counter has changed\n",
+ ">>> abort if sequence counter has chang=\n",
+ "ed\n",
">>> abort if pmd or pte has changed\n",
">>> pte map and lock\n",
">>> enable interrupt\n",
@@ -311,7 +404,7 @@
">>> goto done\n",
">>> again:\n",
">>> lock(mmap_sem)\n",
- ">>> vma = find_vma();\n",
+ ">>> vma =3D find_vma();\n",
">>> handle_pte_fault(vma);\n",
">>> if retry\n",
">>> unlock(mmap_sem)\n",
@@ -320,22 +413,27 @@
">>> handle fault error\n",
">>> }\n",
">>>\n",
- ">>> Support for THP is not done because when checking for the PMD, we can be\n",
+ ">>> Support for THP is not done because when checking for the PMD, we can b=\n",
+ "e\n",
">>> confused by an in progress collapsing operation done by khugepaged. The\n",
">>> issue is that pmd_none() could be true either if the PMD is not already\n",
- ">>> populated or if the underlying PTE are in the way to be collapsed. So we\n",
+ ">>> populated or if the underlying PTE are in the way to be collapsed. So w=\n",
+ "e\n",
">>> cannot safely allocate a PMD if pmd_none() is true.\n",
">>>\n",
- ">>> This series add a new software performance event named 'speculative-faults'\n",
+ ">>> This series add a new software performance event named 'speculative-fau=\n",
+ "lts'\n",
">>> or 'spf'. It counts the number of successful page fault event handled\n",
">>> speculatively. When recording 'faults,spf' events, the faults one is\n",
- ">>> counting the total number of page fault events while 'spf' is only counting\n",
+ ">>> counting the total number of page fault events while 'spf' is only coun=\n",
+ "ting\n",
">>> the part of the faults processed speculatively.\n",
">>>\n",
">>> There are some trace events introduced by this series. They allow\n",
">>> identifying why the page faults were not processed speculatively. This\n",
">>> doesn't take in account the faults generated by a monothreaded process\n",
- ">>> which directly processed while holding the mmap_sem. This trace events are\n",
+ ">>> which directly processed while holding the mmap_sem. This trace events =\n",
+ "are\n",
">>> grouped in a system named 'pagefault', they are:\n",
">>> - pagefault:spf_vma_changed : if the VMA has been changed in our back\n",
">>> - pagefault:spf_vma_noanon : the vma->anon_vma field was not yet set.\n",
@@ -348,20 +446,26 @@
">>> following arguments :\n",
">>> \$ perf stat -e 'faults,spf,pagefault:*' <command>\n",
">>>\n",
- ">>> There is also a dedicated vmstat counter showing the number of successful\n",
+ ">>> There is also a dedicated vmstat counter showing the number of successf=\n",
+ "ul\n",
">>> page fault handled speculatively. I can be seen this way:\n",
">>> \$ grep speculative_pgfault /proc/vmstat\n",
">>>\n",
- ">>> This series builds on top of v4.16-mmotm-2018-04-13-17-28 and is functional\n",
+ ">>> This series builds on top of v4.16-mmotm-2018-04-13-17-28 and is functi=\n",
+ "onal\n",
">>> on x86, PowerPC and arm64.\n",
">>>\n",
">>> ---------------------\n",
">>> Real Workload results\n",
">>>\n",
- ">>> As mentioned in previous email, we did non official runs using a \"popular\n",
- ">>> in memory multithreaded database product\" on 176 cores SMT8 Power system\n",
- ">>> which showed a 30% improvements in the number of transaction processed per\n",
- ">>> second. This run has been done on the v6 series, but changes introduced in\n",
+ ">>> As mentioned in previous email, we did non official runs using a \"popul=\n",
+ "ar\n",
+ ">>> in memory multithreaded database product\" on 176 cores SMT8 Power syste=\n",
+ "m\n",
+ ">>> which showed a 30% improvements in the number of transaction processed =\n",
+ "per\n",
+ ">>> second. This run has been done on the v6 series, but changes introduced=\n",
+ " in\n",
">>> this new version should not impact the performance boost seen.\n",
">>>\n",
">>> Here are the perf data captured during 2 of these runs on top of the v8\n",
@@ -370,12 +474,16 @@
">>> faults 89.418 101.364 +13%\n",
">>> spf n/a 97.989\n",
">>>\n",
- ">>> With the SPF kernel, most of the page fault were processed in a speculative\n",
+ ">>> With the SPF kernel, most of the page fault were processed in a specula=\n",
+ "tive\n",
">>> way.\n",
">>>\n",
- ">>> Ganesh Mahendran had backported the series on top of a 4.9 kernel and gave\n",
- ">>> it a try on an android device. He reported that the application launch time\n",
- ">>> was improved in average by 6%, and for large applications (~100 threads) by\n",
+ ">>> Ganesh Mahendran had backported the series on top of a 4.9 kernel and g=\n",
+ "ave\n",
+ ">>> it a try on an android device. He reported that the application launch =\n",
+ "time\n",
+ ">>> was improved in average by 6%, and for large applications (~100 threads=\n",
+ ") by\n",
">>> 20%.\n",
">>>\n",
">>> Here are the launch time Ganesh mesured on Android 8.0 on top of a Qcom\n",
@@ -448,7 +556,8 @@
">>> 0 pagefault:spf_pmd_changed\n",
">>>\n",
">>> Very few speculative page faults were recorded as most of the processes\n",
- ">>> involved are monothreaded (sounds that on this architecture some threads\n",
+ ">>> involved are monothreaded (sounds that on this architecture some thread=\n",
+ "s\n",
">>> were created during the kernel build processing).\n",
">>>\n",
">>> Here are the kerbench results on a 80 CPUs Power8 system:\n",
@@ -483,15 +592,18 @@
">>> 0 pagefault:spf_vma_access\n",
">>> 0 pagefault:spf_pmd_changed\n",
">>>\n",
- ">>> Most of the processes involved are monothreaded so SPF is not activated but\n",
+ ">>> Most of the processes involved are monothreaded so SPF is not activated=\n",
+ " but\n",
">>> there is no impact on the performance.\n",
">>>\n",
">>> Ebizzy:\n",
">>> -------\n",
- ">>> The test is counting the number of records per second it can manage, the\n",
+ ">>> The test is counting the number of records per second it can manage, th=\n",
+ "e\n",
">>> higher is the best. I run it like this 'ebizzy -mTt <nrcpus>'. To get\n",
">>> consistent result I repeated the test 100 times and measure the average\n",
- ">>> result. The number is the record processes per second, the higher is the\n",
+ ">>> result. The number is the record processes per second, the higher is th=\n",
+ "e\n",
">>> best.\n",
">>>\n",
">>> BASE SPF delta\n",
@@ -518,12 +630,14 @@
">>> 0 pagefault:spf_vma_access\n",
">>> 0 pagefault:spf_pmd_changed\n",
">>>\n",
- ">>> In ebizzy's case most of the page fault were handled in a speculative way,\n",
+ ">>> In ebizzy's case most of the page fault were handled in a speculative w=\n",
+ "ay,\n",
">>> leading the ebizzy performance boost.\n",
">>>\n",
">>> ------------------\n",
">>> Changes since v10 (https://lkml.org/lkml/2018/4/17/572):\n",
- ">>> - Accounted for all review feedbacks from Punit Agrawal, Ganesh Mahendran\n",
+ ">>> - Accounted for all review feedbacks from Punit Agrawal, Ganesh Mahend=\n",
+ "ran\n",
">>> and Minchan Kim, hopefully.\n",
">>> - Remove unneeded check on CONFIG_SPECULATIVE_PAGE_FAULT in\n",
">>> __do_page_fault().\n",
@@ -532,12 +646,15 @@
">>> of aborting the speculative page fault handling. Dropping the now\n",
">>> useless\n",
">>> trace event pagefault:spf_pte_lock.\n",
- ">>> - No more try to reuse the fetched VMA during the speculative page fault\n",
+ ">>> - No more try to reuse the fetched VMA during the speculative page fau=\n",
+ "lt\n",
">>> handling when retrying is needed. This adds a lot of complexity and\n",
- ">>> additional tests done didn't show a significant performance improvement.\n",
+ ">>> additional tests done didn't show a significant performance improvem=\n",
+ "ent.\n",
">>> - Convert IS_ENABLED(CONFIG_NUMA) back to #ifdef due to build error.\n",
">>>\n",
- ">>> [1] http://linux-kernel.2935.n7.nabble.com/RFC-PATCH-0-6-Another-go-at-speculative-page-faults-tt965642.html#none\n",
+ ">>> [1] http://linux-kernel.2935.n7.nabble.com/RFC-PATCH-0-6-Another-go-at-=\n",
+ "speculative-page-faults-tt965642.html#none\n",
">>> [2] https://patchwork.kernel.org/patch/9999687/\n",
">>>\n",
">>>\n",
@@ -600,7 +717,8 @@
">>> mm/internal.h | 20 ++\n",
">>> mm/khugepaged.c | 5 +\n",
">>> mm/madvise.c | 6 +-\n",
- ">>> mm/memory.c | 612 +++++++++++++++++++++++++++++-----\n",
+ ">>> mm/memory.c | 612 ++++++++++++++++++++++++++=\n",
+ "+++-----\n",
">>> mm/mempolicy.c | 51 ++-\n",
">>> mm/migrate.c | 6 +-\n",
">>> mm/mlock.c | 13 +-\n",
@@ -628,4 +746,4 @@
">"
]
-617f24814c1dbc7638771bcb400081ca23fdfb2df0de6e5f2ba6ea5d5c4710de
+d5198e20fe696e1225cc0346b895da1b6a5df97e863d1ae8cf803c02fe2a3f0a
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.