From: Aaron Lu <aaron.lu@intel.com> To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Andrew Morton <akpm@linux-foundation.org>, Huang Ying <ying.huang@intel.com>, Dave Hansen <dave.hansen@intel.com>, Kemi Wang <kemi.wang@intel.com>, Tim Chen <tim.c.chen@linux.intel.com>, Andi Kleen <ak@linux.intel.com>, Michal Hocko <mhocko@suse.com>, Vlastimil Babka <vbabka@suse.cz>, Mel Gorman <mgorman@techsingularity.net>, Matthew Wilcox <willy@infradead.org>, David Rientjes <rientjes@google.com> Subject: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock Date: Thu, 1 Mar 2018 14:28:45 +0800 [thread overview] Message-ID: <20180301062845.26038-4-aaron.lu@intel.com> (raw) In-Reply-To: <20180301062845.26038-1-aaron.lu@intel.com> When a page is freed back to the global pool, its buddy will be checked to see if it's possible to do a merge. This requires accessing buddy's page structure and that access could take a long time if it's cache cold. This patch adds a prefetch to the to-be-freed page's buddy outside of zone->lock in hope of accessing buddy's page structure later under zone->lock will be faster. Since we *always* do buddy merging and check an order-0 page's buddy to try to merge it when it goes into the main allocator, the cacheline will always come in, i.e. the prefetched data will never be unused. In the meantime, there are two concerns: 1 the prefetch could potentially evict existing cachelines, especially for L1D cache since it is not huge; 2 there is some additional instruction overhead, namely calculating buddy pfn twice. For 1, it's hard to say, this microbenchmark though shows good result but the actual benefit of this patch will be workload/CPU dependant; For 2, since the calculation is a XOR on two local variables, it's expected in many cases that cycles spent will be offset by reduced memory latency later. This is especially true for NUMA machines where multiple CPUs are contending on zone->lock and the most time consuming part under zone->lock is the wait of 'struct page' cacheline of the to-be-freed pages and their buddies. Test with will-it-scale/page_fault1 full load: kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) v4.16-rc2+ 9034215 7971818 13667135 15677465 patch2/3 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4% this patch 10338868 +8.4% 8544477 +2.8% 14839808 +5.5% 17155464 +2.9% Note: this patch's performance improvement percent is against patch2/3. [changelog stole from Dave Hansen and Mel Gorman's comments] https://lkml.org/lkml/2018/1/24/551 Suggested-by: Ying Huang <ying.huang@intel.com> Signed-off-by: Aaron Lu <aaron.lu@intel.com> --- mm/page_alloc.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index dafdcdec9c1f..1d838041931e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1141,6 +1141,9 @@ static void free_pcppages_bulk(struct zone *zone, int count, batch_free = count; do { + unsigned long pfn, buddy_pfn; + struct page *buddy; + page = list_last_entry(list, struct page, lru); /* must delete to avoid corrupting pcp list */ list_del(&page->lru); @@ -1150,6 +1153,18 @@ static void free_pcppages_bulk(struct zone *zone, int count, continue; list_add_tail(&page->lru, &head); + + /* + * We are going to put the page back to the global + * pool, prefetch its buddy to speed up later access + * under zone->lock. It is believed the overhead of + * calculating buddy_pfn here can be offset by reduced + * memory latency later. + */ + pfn = page_to_pfn(page); + buddy_pfn = __find_buddy_pfn(pfn, 0); + buddy = page + (buddy_pfn - pfn); + prefetch(buddy); } while (--count && --batch_free && !list_empty(list)); } -- 2.14.3
WARNING: multiple messages have this Message-ID (diff)
From: Aaron Lu <aaron.lu@intel.com> To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Andrew Morton <akpm@linux-foundation.org>, Huang Ying <ying.huang@intel.com>, Dave Hansen <dave.hansen@intel.com>, Kemi Wang <kemi.wang@intel.com>, Tim Chen <tim.c.chen@linux.intel.com>, Andi Kleen <ak@linux.intel.com>, Michal Hocko <mhocko@suse.com>, Vlastimil Babka <vbabka@suse.cz>, Mel Gorman <mgorman@techsingularity.net>, Matthew Wilcox <willy@infradead.org>, David Rientjes <rientjes@google.com> Subject: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock Date: Thu, 1 Mar 2018 14:28:45 +0800 [thread overview] Message-ID: <20180301062845.26038-4-aaron.lu@intel.com> (raw) In-Reply-To: <20180301062845.26038-1-aaron.lu@intel.com> When a page is freed back to the global pool, its buddy will be checked to see if it's possible to do a merge. This requires accessing buddy's page structure and that access could take a long time if it's cache cold. This patch adds a prefetch to the to-be-freed page's buddy outside of zone->lock in hope of accessing buddy's page structure later under zone->lock will be faster. Since we *always* do buddy merging and check an order-0 page's buddy to try to merge it when it goes into the main allocator, the cacheline will always come in, i.e. the prefetched data will never be unused. In the meantime, there are two concerns: 1 the prefetch could potentially evict existing cachelines, especially for L1D cache since it is not huge; 2 there is some additional instruction overhead, namely calculating buddy pfn twice. For 1, it's hard to say, this microbenchmark though shows good result but the actual benefit of this patch will be workload/CPU dependant; For 2, since the calculation is a XOR on two local variables, it's expected in many cases that cycles spent will be offset by reduced memory latency later. This is especially true for NUMA machines where multiple CPUs are contending on zone->lock and the most time consuming part under zone->lock is the wait of 'struct page' cacheline of the to-be-freed pages and their buddies. Test with will-it-scale/page_fault1 full load: kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) v4.16-rc2+ 9034215 7971818 13667135 15677465 patch2/3 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4% this patch 10338868 +8.4% 8544477 +2.8% 14839808 +5.5% 17155464 +2.9% Note: this patch's performance improvement percent is against patch2/3. [changelog stole from Dave Hansen and Mel Gorman's comments] https://lkml.org/lkml/2018/1/24/551 Suggested-by: Ying Huang <ying.huang@intel.com> Signed-off-by: Aaron Lu <aaron.lu@intel.com> --- mm/page_alloc.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index dafdcdec9c1f..1d838041931e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1141,6 +1141,9 @@ static void free_pcppages_bulk(struct zone *zone, int count, batch_free = count; do { + unsigned long pfn, buddy_pfn; + struct page *buddy; + page = list_last_entry(list, struct page, lru); /* must delete to avoid corrupting pcp list */ list_del(&page->lru); @@ -1150,6 +1153,18 @@ static void free_pcppages_bulk(struct zone *zone, int count, continue; list_add_tail(&page->lru, &head); + + /* + * We are going to put the page back to the global + * pool, prefetch its buddy to speed up later access + * under zone->lock. It is believed the overhead of + * calculating buddy_pfn here can be offset by reduced + * memory latency later. + */ + pfn = page_to_pfn(page); + buddy_pfn = __find_buddy_pfn(pfn, 0); + buddy = page + (buddy_pfn - pfn); + prefetch(buddy); } while (--count && --batch_free && !list_empty(list)); } -- 2.14.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2018-03-01 6:28 UTC|newest] Thread overview: 67+ messages / expand[flat|nested] mbox.gz Atom feed top 2018-03-01 6:28 [PATCH v4 0/3] mm: improve zone->lock scalability Aaron Lu 2018-03-01 6:28 ` Aaron Lu 2018-03-01 6:28 ` [PATCH v4 1/3] mm/free_pcppages_bulk: update pcp->count inside Aaron Lu 2018-03-01 6:28 ` Aaron Lu 2018-03-01 12:11 ` David Rientjes 2018-03-01 12:11 ` David Rientjes 2018-03-01 13:45 ` Michal Hocko 2018-03-01 13:45 ` Michal Hocko 2018-03-12 13:22 ` Vlastimil Babka 2018-03-13 2:11 ` Aaron Lu 2018-03-01 6:28 ` [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free Aaron Lu 2018-03-01 6:28 ` Aaron Lu 2018-03-01 13:55 ` Michal Hocko 2018-03-01 13:55 ` Michal Hocko 2018-03-02 7:15 ` Aaron Lu 2018-03-02 7:15 ` Aaron Lu 2018-03-02 15:34 ` Dave Hansen 2018-03-02 15:34 ` Dave Hansen 2018-03-02 7:31 ` Huang, Ying 2018-03-02 7:31 ` Huang, Ying 2018-03-02 0:01 ` Andrew Morton 2018-03-02 0:01 ` Andrew Morton 2018-03-02 8:01 ` Aaron Lu 2018-03-02 8:01 ` Aaron Lu 2018-03-02 21:23 ` Andrew Morton 2018-03-02 21:23 ` Andrew Morton 2018-03-02 21:25 ` Dave Hansen 2018-03-02 21:25 ` Dave Hansen 2018-03-12 14:22 ` Vlastimil Babka 2018-03-13 3:34 ` Aaron Lu 2018-03-22 15:17 ` Matthew Wilcox 2018-03-26 3:03 ` Aaron Lu 2018-03-01 6:28 ` Aaron Lu [this message] 2018-03-01 6:28 ` [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock Aaron Lu 2018-03-01 14:00 ` Michal Hocko 2018-03-01 14:00 ` Michal Hocko 2018-03-02 8:31 ` Aaron Lu 2018-03-02 8:31 ` Aaron Lu 2018-03-02 17:55 ` Vlastimil Babka 2018-03-02 17:55 ` Vlastimil Babka 2018-03-02 18:00 ` Dave Hansen 2018-03-02 18:00 ` Dave Hansen 2018-03-02 18:08 ` Vlastimil Babka 2018-03-02 18:08 ` Vlastimil Babka 2018-03-05 11:41 ` Aaron Lu 2018-03-05 11:41 ` Aaron Lu 2018-03-05 11:48 ` Aaron Lu 2018-03-05 11:48 ` Aaron Lu 2018-03-06 7:55 ` Vlastimil Babka 2018-03-06 7:55 ` Vlastimil Babka 2018-03-06 12:27 ` Aaron Lu 2018-03-06 12:27 ` Aaron Lu 2018-03-06 12:53 ` Matthew Wilcox 2018-03-06 12:53 ` Matthew Wilcox 2018-03-02 0:09 ` Andrew Morton 2018-03-02 0:09 ` Andrew Morton 2018-03-02 8:27 ` Aaron Lu 2018-03-02 8:27 ` Aaron Lu 2018-03-09 8:24 ` [PATCH v4 3/3 update] " Aaron Lu 2018-03-09 21:58 ` Andrew Morton 2018-03-10 14:46 ` Aaron Lu 2018-03-12 15:05 ` Vlastimil Babka 2018-03-12 17:32 ` Dave Hansen 2018-03-13 3:35 ` Aaron Lu 2018-03-13 7:04 ` Aaron Lu 2018-03-20 9:50 ` Vlastimil Babka 2018-03-20 11:31 ` [PATCH v4 3/3 update2] " Aaron Lu
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20180301062845.26038-4-aaron.lu@intel.com \ --to=aaron.lu@intel.com \ --cc=ak@linux.intel.com \ --cc=akpm@linux-foundation.org \ --cc=dave.hansen@intel.com \ --cc=kemi.wang@intel.com \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-mm@kvack.org \ --cc=mgorman@techsingularity.net \ --cc=mhocko@suse.com \ --cc=rientjes@google.com \ --cc=tim.c.chen@linux.intel.com \ --cc=vbabka@suse.cz \ --cc=willy@infradead.org \ --cc=ying.huang@intel.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.