From: Aaron Lu <aaron.lu@intel.com> To: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton <akpm@linux-foundation.org>, Huang Ying <ying.huang@intel.com>, Dave Hansen <dave.hansen@intel.com>, Kemi Wang <kemi.wang@intel.com>, Tim Chen <tim.c.chen@linux.intel.com>, Andi Kleen <ak@linux.intel.com>, Mel Gorman <mgorman@techsingularity.net>, Matthew Wilcox <willy@infradead.org>, David Rientjes <rientjes@google.com> Subject: Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock Date: Tue, 6 Mar 2018 20:27:33 +0800 [thread overview] Message-ID: <20180306122733.GA9664@intel.com> (raw) In-Reply-To: <bdec481f-b402-64b6-75b0-350b370f3eac@suse.cz> On Tue, Mar 06, 2018 at 08:55:57AM +0100, Vlastimil Babka wrote: > On 03/05/2018 12:41 PM, Aaron Lu wrote: > > On Fri, Mar 02, 2018 at 06:55:25PM +0100, Vlastimil Babka wrote: > >> On 03/01/2018 03:00 PM, Michal Hocko wrote: > >>> > >>> I am really surprised that this has such a big impact. > >> > >> It's even stranger to me. Struct page is 64 bytes these days, exactly a > >> a cache line. Unless that changed, Intel CPUs prefetched a "buddy" cache > >> line (that forms an aligned 128 bytes block with the one we touch). > >> Which is exactly a order-0 buddy struct page! Maybe that implicit > >> prefetching stopped at L2 and explicit goes all the way to L1, can't > > > > The Intel Architecture Optimization Manual section 7.3.2 says: > > > > prefetchT0 - fetch data into all cache levels > > Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer > > microarchitectures: 1st, 2nd and 3rd level cache. > > > > prefetchT2 - fetch data into 2nd and 3rd level caches (identical to > > prefetchT1) > > Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer > > microarchitectures: 2nd and 3rd level cache. > > > > prefetchNTA - fetch data into non-temporal cache close to the processor, > > minimizing cache pollution > > Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer > > microarchitectures: must fetch into 3rd level cache with fast replacement. > > > > I tried 'prefetcht0' and 'prefetcht2' instead of the default > > 'prefetchNTA' on a 2 sockets Intel Skylake, the two ended up with about > > the same performance number as prefetchNTA. I had expected prefetchT0 to > > deliver a better score if it was indeed due to L1D since prefetchT2 will > > not place data into L1 while prefetchT0 will, but looks like it is not > > the case here. > > > > It feels more like the buddy cacheline isn't in any level of the caches > > without prefetch for some reason. > > So the adjacent line prefetch might be disabled? Could you check bios or > the MSR mentioned in > https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors root@lkp-bdw-ep2 ~# rdmsr 0x1a4 0 Looks like this feature isn't disabled(the doc you linked says value 1 means disable). > >> remember. Would that make such a difference? It would be nice to do some > >> perf tests with cache counters to see what is really going on... > > > > Compare prefetchT2 to no-prefetch, I saw these metrics change: > > > > no-prefetch change prefetchT2 metrics > > \ \ > > stddev stddev > > ------------------------------------------------------------------------ > > 0.18 +0.0 0.18 perf-stat.branch-miss-rate% > > 8.268e+09 +3.8% 8.585e+09 perf-stat.branch-misses > > 2.333e+10 +4.7% 2.443e+10 perf-stat.cache-misses > > 2.402e+11 +5.0% 2.522e+11 perf-stat.cache-references > > 3.52 -1.1% 3.48 perf-stat.cpi > > 0.02 -0.0 0.01 ±3% perf-stat.dTLB-load-miss-rate% > > 8.677e+08 -7.3% 8.048e+08 ±3% perf-stat.dTLB-load-misses > > 1.18 +0.0 1.19 perf-stat.dTLB-store-miss-rate% > > 2.359e+10 +6.0% 2.502e+10 perf-stat.dTLB-store-misses > > 1.979e+12 +5.0% 2.078e+12 perf-stat.dTLB-stores > > 6.126e+09 +10.1% 6.745e+09 ±3% perf-stat.iTLB-load-misses > > 3464 -8.4% 3172 ±3% perf-stat.instructions-per-iTLB-miss > > 0.28 +1.1% 0.29 perf-stat.ipc > > 2.929e+09 +5.1% 3.077e+09 perf-stat.minor-faults > > 9.244e+09 +4.7% 9.681e+09 perf-stat.node-loads > > 2.491e+08 +5.8% 2.634e+08 perf-stat.node-store-misses > > 6.472e+09 +6.1% 6.869e+09 perf-stat.node-stores > > 2.929e+09 +5.1% 3.077e+09 perf-stat.page-faults > > 2182469 -4.2% 2090977 perf-stat.path-length > > > > Not sure if this is useful though... > > Looks like most stats increased in absolute values as the work done > increased and this is a time-limited benchmark? Although number of Yes it is. > instructions (calculated from itlb misses and insns-per-itlb-miss) shows > less than 1% increase, so dunno. And the improvement comes from reduced > dTLB-load-misses? That makes no sense for order-0 buddy struct pages > which always share a page. And the memmap mapping should use huge pages. THP is disabled to stress order 0 pages(should have mentioned this in patch's description, sorry about this). > BTW what is path-length? It's the instruction path length: the number of machine code instructions required to execute a section of a computer program.
WARNING: multiple messages have this Message-ID (diff)
From: Aaron Lu <aaron.lu@intel.com> To: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton <akpm@linux-foundation.org>, Huang Ying <ying.huang@intel.com>, Dave Hansen <dave.hansen@intel.com>, Kemi Wang <kemi.wang@intel.com>, Tim Chen <tim.c.chen@linux.intel.com>, Andi Kleen <ak@linux.intel.com>, Mel Gorman <mgorman@techsingularity.net>, Matthew Wilcox <willy@infradead.org>, David Rientjes <rientjes@google.com> Subject: Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock Date: Tue, 6 Mar 2018 20:27:33 +0800 [thread overview] Message-ID: <20180306122733.GA9664@intel.com> (raw) In-Reply-To: <bdec481f-b402-64b6-75b0-350b370f3eac@suse.cz> On Tue, Mar 06, 2018 at 08:55:57AM +0100, Vlastimil Babka wrote: > On 03/05/2018 12:41 PM, Aaron Lu wrote: > > On Fri, Mar 02, 2018 at 06:55:25PM +0100, Vlastimil Babka wrote: > >> On 03/01/2018 03:00 PM, Michal Hocko wrote: > >>> > >>> I am really surprised that this has such a big impact. > >> > >> It's even stranger to me. Struct page is 64 bytes these days, exactly a > >> a cache line. Unless that changed, Intel CPUs prefetched a "buddy" cache > >> line (that forms an aligned 128 bytes block with the one we touch). > >> Which is exactly a order-0 buddy struct page! Maybe that implicit > >> prefetching stopped at L2 and explicit goes all the way to L1, can't > > > > The Intel Architecture Optimization Manual section 7.3.2 says: > > > > prefetchT0 - fetch data into all cache levels > > Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer > > microarchitectures: 1st, 2nd and 3rd level cache. > > > > prefetchT2 - fetch data into 2nd and 3rd level caches (identical to > > prefetchT1) > > Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer > > microarchitectures: 2nd and 3rd level cache. > > > > prefetchNTA - fetch data into non-temporal cache close to the processor, > > minimizing cache pollution > > Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer > > microarchitectures: must fetch into 3rd level cache with fast replacement. > > > > I tried 'prefetcht0' and 'prefetcht2' instead of the default > > 'prefetchNTA' on a 2 sockets Intel Skylake, the two ended up with about > > the same performance number as prefetchNTA. I had expected prefetchT0 to > > deliver a better score if it was indeed due to L1D since prefetchT2 will > > not place data into L1 while prefetchT0 will, but looks like it is not > > the case here. > > > > It feels more like the buddy cacheline isn't in any level of the caches > > without prefetch for some reason. > > So the adjacent line prefetch might be disabled? Could you check bios or > the MSR mentioned in > https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors root@lkp-bdw-ep2 ~# rdmsr 0x1a4 0 Looks like this feature isn't disabled(the doc you linked says value 1 means disable). > >> remember. Would that make such a difference? It would be nice to do some > >> perf tests with cache counters to see what is really going on... > > > > Compare prefetchT2 to no-prefetch, I saw these metrics change: > > > > no-prefetch change prefetchT2 metrics > > \ \ > > stddev stddev > > ------------------------------------------------------------------------ > > 0.18 +0.0 0.18 perf-stat.branch-miss-rate% > > 8.268e+09 +3.8% 8.585e+09 perf-stat.branch-misses > > 2.333e+10 +4.7% 2.443e+10 perf-stat.cache-misses > > 2.402e+11 +5.0% 2.522e+11 perf-stat.cache-references > > 3.52 -1.1% 3.48 perf-stat.cpi > > 0.02 -0.0 0.01 +-3% perf-stat.dTLB-load-miss-rate% > > 8.677e+08 -7.3% 8.048e+08 +-3% perf-stat.dTLB-load-misses > > 1.18 +0.0 1.19 perf-stat.dTLB-store-miss-rate% > > 2.359e+10 +6.0% 2.502e+10 perf-stat.dTLB-store-misses > > 1.979e+12 +5.0% 2.078e+12 perf-stat.dTLB-stores > > 6.126e+09 +10.1% 6.745e+09 +-3% perf-stat.iTLB-load-misses > > 3464 -8.4% 3172 +-3% perf-stat.instructions-per-iTLB-miss > > 0.28 +1.1% 0.29 perf-stat.ipc > > 2.929e+09 +5.1% 3.077e+09 perf-stat.minor-faults > > 9.244e+09 +4.7% 9.681e+09 perf-stat.node-loads > > 2.491e+08 +5.8% 2.634e+08 perf-stat.node-store-misses > > 6.472e+09 +6.1% 6.869e+09 perf-stat.node-stores > > 2.929e+09 +5.1% 3.077e+09 perf-stat.page-faults > > 2182469 -4.2% 2090977 perf-stat.path-length > > > > Not sure if this is useful though... > > Looks like most stats increased in absolute values as the work done > increased and this is a time-limited benchmark? Although number of Yes it is. > instructions (calculated from itlb misses and insns-per-itlb-miss) shows > less than 1% increase, so dunno. And the improvement comes from reduced > dTLB-load-misses? That makes no sense for order-0 buddy struct pages > which always share a page. And the memmap mapping should use huge pages. THP is disabled to stress order 0 pages(should have mentioned this in patch's description, sorry about this). > BTW what is path-length? It's the instruction path length: the number of machine code instructions required to execute a section of a computer program. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2018-03-06 12:26 UTC|newest] Thread overview: 67+ messages / expand[flat|nested] mbox.gz Atom feed top 2018-03-01 6:28 [PATCH v4 0/3] mm: improve zone->lock scalability Aaron Lu 2018-03-01 6:28 ` Aaron Lu 2018-03-01 6:28 ` [PATCH v4 1/3] mm/free_pcppages_bulk: update pcp->count inside Aaron Lu 2018-03-01 6:28 ` Aaron Lu 2018-03-01 12:11 ` David Rientjes 2018-03-01 12:11 ` David Rientjes 2018-03-01 13:45 ` Michal Hocko 2018-03-01 13:45 ` Michal Hocko 2018-03-12 13:22 ` Vlastimil Babka 2018-03-13 2:11 ` Aaron Lu 2018-03-01 6:28 ` [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free Aaron Lu 2018-03-01 6:28 ` Aaron Lu 2018-03-01 13:55 ` Michal Hocko 2018-03-01 13:55 ` Michal Hocko 2018-03-02 7:15 ` Aaron Lu 2018-03-02 7:15 ` Aaron Lu 2018-03-02 15:34 ` Dave Hansen 2018-03-02 15:34 ` Dave Hansen 2018-03-02 7:31 ` Huang, Ying 2018-03-02 7:31 ` Huang, Ying 2018-03-02 0:01 ` Andrew Morton 2018-03-02 0:01 ` Andrew Morton 2018-03-02 8:01 ` Aaron Lu 2018-03-02 8:01 ` Aaron Lu 2018-03-02 21:23 ` Andrew Morton 2018-03-02 21:23 ` Andrew Morton 2018-03-02 21:25 ` Dave Hansen 2018-03-02 21:25 ` Dave Hansen 2018-03-12 14:22 ` Vlastimil Babka 2018-03-13 3:34 ` Aaron Lu 2018-03-22 15:17 ` Matthew Wilcox 2018-03-26 3:03 ` Aaron Lu 2018-03-01 6:28 ` [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock Aaron Lu 2018-03-01 6:28 ` Aaron Lu 2018-03-01 14:00 ` Michal Hocko 2018-03-01 14:00 ` Michal Hocko 2018-03-02 8:31 ` Aaron Lu 2018-03-02 8:31 ` Aaron Lu 2018-03-02 17:55 ` Vlastimil Babka 2018-03-02 17:55 ` Vlastimil Babka 2018-03-02 18:00 ` Dave Hansen 2018-03-02 18:00 ` Dave Hansen 2018-03-02 18:08 ` Vlastimil Babka 2018-03-02 18:08 ` Vlastimil Babka 2018-03-05 11:41 ` Aaron Lu 2018-03-05 11:41 ` Aaron Lu 2018-03-05 11:48 ` Aaron Lu 2018-03-05 11:48 ` Aaron Lu 2018-03-06 7:55 ` Vlastimil Babka 2018-03-06 7:55 ` Vlastimil Babka 2018-03-06 12:27 ` Aaron Lu [this message] 2018-03-06 12:27 ` Aaron Lu 2018-03-06 12:53 ` Matthew Wilcox 2018-03-06 12:53 ` Matthew Wilcox 2018-03-02 0:09 ` Andrew Morton 2018-03-02 0:09 ` Andrew Morton 2018-03-02 8:27 ` Aaron Lu 2018-03-02 8:27 ` Aaron Lu 2018-03-09 8:24 ` [PATCH v4 3/3 update] " Aaron Lu 2018-03-09 21:58 ` Andrew Morton 2018-03-10 14:46 ` Aaron Lu 2018-03-12 15:05 ` Vlastimil Babka 2018-03-12 17:32 ` Dave Hansen 2018-03-13 3:35 ` Aaron Lu 2018-03-13 7:04 ` Aaron Lu 2018-03-20 9:50 ` Vlastimil Babka 2018-03-20 11:31 ` [PATCH v4 3/3 update2] " Aaron Lu
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20180306122733.GA9664@intel.com \ --to=aaron.lu@intel.com \ --cc=ak@linux.intel.com \ --cc=akpm@linux-foundation.org \ --cc=dave.hansen@intel.com \ --cc=kemi.wang@intel.com \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-mm@kvack.org \ --cc=mgorman@techsingularity.net \ --cc=mhocko@kernel.org \ --cc=rientjes@google.com \ --cc=tim.c.chen@linux.intel.com \ --cc=vbabka@suse.cz \ --cc=willy@infradead.org \ --cc=ying.huang@intel.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.