* [PATCH 1/2] free_pcppages_bulk: do not hold lock when picking pages to free @ 2018-01-24 2:30 Aaron Lu 2018-01-24 2:30 ` [PATCH 2/2] free_pcppages_bulk: prefetch buddy while not holding lock Aaron Lu ` (2 more replies) 0 siblings, 3 replies; 19+ messages in thread From: Aaron Lu @ 2018-01-24 2:30 UTC (permalink / raw) To: linux-mm, linux-kernel Cc: Andrew Morton, Huang Ying, Dave Hansen, Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko, Vlastimil Babka, Mel Gorman When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy, the zone->lock is held and then pages are chosen from PCP's migratetype list. While there is actually no need to do this 'choose part' under lock since it's PCP pages, the only CPU that can touch them is us and irq is also disabled. Moving this part outside could reduce lock held time and improve performance. Test with will-it-scale/page_fault1 full load: kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) v4.15-rc4 9037332 8000124 13642741 15728686 this patch 9608786 +6.3% 8368915 +4.6% 14042169 +2.9% 17433559 +10.8% What the test does is: starts $nr_cpu processes and each will repeated do the following for 5 minutes: 1 mmap 128M anonymouse space; 2 write access to that space; 3 munmap. The score is the aggregated iteration. https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c Signed-off-by: Aaron Lu <aaron.lu@intel.com> --- mm/page_alloc.c | 33 +++++++++++++++++++-------------- 1 file changed, 19 insertions(+), 14 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 4093728f292e..a076f754dac1 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1113,12 +1113,12 @@ static void free_pcppages_bulk(struct zone *zone, int count, int migratetype = 0; int batch_free = 0; bool isolated_pageblocks; + struct list_head head; + struct page *page, *tmp; - spin_lock(&zone->lock); - isolated_pageblocks = has_isolate_pageblock(zone); + INIT_LIST_HEAD(&head); while (count) { - struct page *page; struct list_head *list; /* @@ -1140,26 +1140,31 @@ static void free_pcppages_bulk(struct zone *zone, int count, batch_free = count; do { - int mt; /* migratetype of the to-be-freed page */ - page = list_last_entry(list, struct page, lru); /* must delete as __free_one_page list manipulates */ list_del(&page->lru); - mt = get_pcppage_migratetype(page); - /* MIGRATE_ISOLATE page should not go to pcplists */ - VM_BUG_ON_PAGE(is_migrate_isolate(mt), page); - /* Pageblock could have been isolated meanwhile */ - if (unlikely(isolated_pageblocks)) - mt = get_pageblock_migratetype(page); - if (bulkfree_pcp_prepare(page)) continue; - __free_one_page(page, page_to_pfn(page), zone, 0, mt); - trace_mm_page_pcpu_drain(page, 0, mt); + list_add_tail(&page->lru, &head); } while (--count && --batch_free && !list_empty(list)); } + + spin_lock(&zone->lock); + isolated_pageblocks = has_isolate_pageblock(zone); + + list_for_each_entry_safe(page, tmp, &head, lru) { + int mt = get_pcppage_migratetype(page); + /* MIGRATE_ISOLATE page should not go to pcplists */ + VM_BUG_ON_PAGE(is_migrate_isolate(mt), page); + /* Pageblock could have been isolated meanwhile */ + if (unlikely(isolated_pageblocks)) + mt = get_pageblock_migratetype(page); + + __free_one_page(page, page_to_pfn(page), zone, 0, mt); + trace_mm_page_pcpu_drain(page, 0, mt); + } spin_unlock(&zone->lock); } -- 2.14.3 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH 2/2] free_pcppages_bulk: prefetch buddy while not holding lock 2018-01-24 2:30 [PATCH 1/2] free_pcppages_bulk: do not hold lock when picking pages to free Aaron Lu @ 2018-01-24 2:30 ` Aaron Lu 2018-01-24 16:43 ` Mel Gorman 2018-01-24 16:40 ` [PATCH 1/2] free_pcppages_bulk: do not hold lock when picking pages to free Mel Gorman 2018-02-05 5:30 ` RFC: eliminate zone->lock contention for will-it-scale/page_fault1 on big server Aaron Lu 2 siblings, 1 reply; 19+ messages in thread From: Aaron Lu @ 2018-01-24 2:30 UTC (permalink / raw) To: linux-mm, linux-kernel Cc: Andrew Morton, Huang Ying, Dave Hansen, Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko, Vlastimil Babka, Mel Gorman When a page is freed back to the global pool, its buddy will be checked to see if it's possible to do a merge. This requires accessing buddy's page structure and that access could take a long time if it's cache cold. This patch adds a prefetch to the to-be-freed page's buddy outside of zone->lock in hope of accessing buddy's page structure later under zone->lock will be faster. Test with will-it-scale/page_fault1 full load: kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) v4.15-rc4 9037332 8000124 13642741 15728686 patch1/2 9608786 +6.3% 8368915 +4.6% 14042169 +2.9% 17433559 +10.8% this patch 10462292 +8.9% 8602889 +2.8% 14802073 +5.4% 17624575 +1.1% Note: this patch's performance improvement percent is against patch1/2. Suggested-by: Ying Huang <ying.huang@intel.com> Signed-off-by: Aaron Lu <aaron.lu@intel.com> --- mm/page_alloc.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a076f754dac1..9ef084d41708 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1140,6 +1140,9 @@ static void free_pcppages_bulk(struct zone *zone, int count, batch_free = count; do { + unsigned long pfn, buddy_pfn; + struct page *buddy; + page = list_last_entry(list, struct page, lru); /* must delete as __free_one_page list manipulates */ list_del(&page->lru); @@ -1148,6 +1151,16 @@ static void free_pcppages_bulk(struct zone *zone, int count, continue; list_add_tail(&page->lru, &head); + + /* + * We are going to put the page back to + * the global pool, prefetch its buddy to + * speed up later access under zone->lock. + */ + pfn = page_to_pfn(page); + buddy_pfn = __find_buddy_pfn(pfn, 0); + buddy = page + (buddy_pfn - pfn); + prefetch(buddy); } while (--count && --batch_free && !list_empty(list)); } -- 2.14.3 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 2/2] free_pcppages_bulk: prefetch buddy while not holding lock 2018-01-24 2:30 ` [PATCH 2/2] free_pcppages_bulk: prefetch buddy while not holding lock Aaron Lu @ 2018-01-24 16:43 ` Mel Gorman 2018-01-24 16:57 ` Dave Hansen 0 siblings, 1 reply; 19+ messages in thread From: Mel Gorman @ 2018-01-24 16:43 UTC (permalink / raw) To: Aaron Lu Cc: linux-mm, linux-kernel, Andrew Morton, Huang Ying, Dave Hansen, Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko, Vlastimil Babka On Wed, Jan 24, 2018 at 10:30:50AM +0800, Aaron Lu wrote: > When a page is freed back to the global pool, its buddy will be checked > to see if it's possible to do a merge. This requires accessing buddy's > page structure and that access could take a long time if it's cache cold. > > This patch adds a prefetch to the to-be-freed page's buddy outside of > zone->lock in hope of accessing buddy's page structure later under > zone->lock will be faster. > > Test with will-it-scale/page_fault1 full load: > > kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) > v4.15-rc4 9037332 8000124 13642741 15728686 > patch1/2 9608786 +6.3% 8368915 +4.6% 14042169 +2.9% 17433559 +10.8% > this patch 10462292 +8.9% 8602889 +2.8% 14802073 +5.4% 17624575 +1.1% > > Note: this patch's performance improvement percent is against patch1/2. > I'm less convinced by this for a microbenchmark. Prefetch has not been a universal win in the past and we cannot be sure that it's a good idea on all architectures or doesn't have other side-effects such as consuming memory bandwidth for data we don't need or evicting cache hot data for buddy information that is not used. Furthermore, we end up doing some calculations twice without any guarantee that the prefetch can offset the cost. It's not strong enough of an opinion to outright NAK it but I'm not ACKing it either. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/2] free_pcppages_bulk: prefetch buddy while not holding lock 2018-01-24 16:43 ` Mel Gorman @ 2018-01-24 16:57 ` Dave Hansen 2018-01-24 18:19 ` Mel Gorman 0 siblings, 1 reply; 19+ messages in thread From: Dave Hansen @ 2018-01-24 16:57 UTC (permalink / raw) To: Mel Gorman, Aaron Lu Cc: linux-mm, linux-kernel, Andrew Morton, Huang Ying, Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko, Vlastimil Babka On 01/24/2018 08:43 AM, Mel Gorman wrote: > I'm less convinced by this for a microbenchmark. Prefetch has not been a > universal win in the past and we cannot be sure that it's a good idea on > all architectures or doesn't have other side-effects such as consuming > memory bandwidth for data we don't need or evicting cache hot data for > buddy information that is not used. I had the same reaction. But, I think this case is special. We *always* do buddy merging (well, before the next patch in the series is applied) and check an order-0 page's buddy to try to merge it when it goes into the main allocator. So, the cacheline will always come in. IOW, I don't think this has the same downsides normally associated with prefetch() since the data is always used. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/2] free_pcppages_bulk: prefetch buddy while not holding lock 2018-01-24 16:57 ` Dave Hansen @ 2018-01-24 18:19 ` Mel Gorman 2018-01-24 19:23 ` Dave Hansen 0 siblings, 1 reply; 19+ messages in thread From: Mel Gorman @ 2018-01-24 18:19 UTC (permalink / raw) To: Dave Hansen Cc: Aaron Lu, linux-mm, linux-kernel, Andrew Morton, Huang Ying, Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko, Vlastimil Babka On Wed, Jan 24, 2018 at 08:57:43AM -0800, Dave Hansen wrote: > On 01/24/2018 08:43 AM, Mel Gorman wrote: > > I'm less convinced by this for a microbenchmark. Prefetch has not been a > > universal win in the past and we cannot be sure that it's a good idea on > > all architectures or doesn't have other side-effects such as consuming > > memory bandwidth for data we don't need or evicting cache hot data for > > buddy information that is not used. > > I had the same reaction. > > But, I think this case is special. We *always* do buddy merging (well, > before the next patch in the series is applied) and check an order-0 > page's buddy to try to merge it when it goes into the main allocator. > So, the cacheline will always come in. > > IOW, I don't think this has the same downsides normally associated with > prefetch() since the data is always used. That doesn't side-step the calculations are done twice in the free_pcppages_bulk path and there is no guarantee that one prefetch in the list of pages being freed will not evict a previous prefetch due to collisions. At least on the machine I'm writing this from, the prefetches necessary for a standard drain are 1/16th of the L1D cache so some collisions/evictions are possible. We're doing definite work in one path on the chance it'll still be cache-resident when it's recalculated. I suspect that only a microbenchmark that is doing very large amounts of frees (or a large munmap or exit) will notice and the costs of a large munmap/exit are so high that the prefetch will be negligible savings. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/2] free_pcppages_bulk: prefetch buddy while not holding lock 2018-01-24 18:19 ` Mel Gorman @ 2018-01-24 19:23 ` Dave Hansen 2018-01-24 21:12 ` Mel Gorman 0 siblings, 1 reply; 19+ messages in thread From: Dave Hansen @ 2018-01-24 19:23 UTC (permalink / raw) To: Mel Gorman Cc: Aaron Lu, linux-mm, linux-kernel, Andrew Morton, Huang Ying, Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko, Vlastimil Babka On 01/24/2018 10:19 AM, Mel Gorman wrote: >> IOW, I don't think this has the same downsides normally associated with >> prefetch() since the data is always used. > That doesn't side-step the calculations are done twice in the > free_pcppages_bulk path and there is no guarantee that one prefetch > in the list of pages being freed will not evict a previous prefetch > due to collisions. Fair enough. The description here could probably use some touchups to explicitly spell out the downsides. I do agree with you that there is no guarantee that this will be resident in the cache before use. In fact, it might be entertaining to see if we can show the extra conflicts in the L1 given from this change given a large enough PCP batch size. But, this isn't just about the L1. If the results of the prefetch() stay in *ANY* cache, then the memory bandwidth impact of this change is still zero. You'll have a lot harder time arguing that we're likely to see L2/L3 evictions in this path for our typical PCP batch sizes. Do you want to see some analysis for less-frequent PCP frees? We could pretty easily instrument the latency doing normal-ish things to see if we can measure a benefit from this rather than a tight-loop micro. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/2] free_pcppages_bulk: prefetch buddy while not holding lock 2018-01-24 19:23 ` Dave Hansen @ 2018-01-24 21:12 ` Mel Gorman 2018-01-25 7:25 ` [PATCH v2 " Aaron Lu 0 siblings, 1 reply; 19+ messages in thread From: Mel Gorman @ 2018-01-24 21:12 UTC (permalink / raw) To: Dave Hansen Cc: Aaron Lu, linux-mm, linux-kernel, Andrew Morton, Huang Ying, Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko, Vlastimil Babka On Wed, Jan 24, 2018 at 11:23:49AM -0800, Dave Hansen wrote: > On 01/24/2018 10:19 AM, Mel Gorman wrote: > >> IOW, I don't think this has the same downsides normally associated with > >> prefetch() since the data is always used. > > That doesn't side-step the calculations are done twice in the > > free_pcppages_bulk path and there is no guarantee that one prefetch > > in the list of pages being freed will not evict a previous prefetch > > due to collisions. > > Fair enough. The description here could probably use some touchups to > explicitly spell out the downsides. > It would be preferable. As I said, I'm not explicitly NAKing this but it might push someone else over the edge into an outright ACK. I think patch 1 should go ahead as-is unconditionally as I see no reason to hold that one back. I would suggest adding the detail in the changelog that a prefetch will potentially evict an earlier prefetch from the L1 cache but it is expected the data would still be preserved in a L2 or L3 cache. Further note that while there is some additional instruction overhead, it is required that the data be fetched eventually and it's expected in many cases that cycles spent early will be offset by reduced memory latency later. Finally note that actual benefit will be workload/CPU dependant. Also consider adding a comment above the actual prefetch because it deserves one otherwise it looks like a fast path is being sprinked with magic dust from the prefetch fairy. > I do agree with you that there is no guarantee that this will be > resident in the cache before use. In fact, it might be entertaining to > see if we can show the extra conflicts in the L1 given from this change > given a large enough PCP batch size. > Well, I wouldn't bother worrying about different PCP batch sizes. In typical operations, it's going to be the pcp->batch size. Even if you were dumping the entire PCP due to a drain, it's still going to be less than many L1 sizes on x86 at least and those drains are usually in the context of a much larger operation where the overhead of the buddy calculations will be negligable in comparison. > But, this isn't just about the L1. If the results of the prefetch() > stay in *ANY* cache, then the memory bandwidth impact of this change is > still zero. You'll have a lot harder time arguing that we're likely to > see L2/L3 evictions in this path for our typical PCP batch sizes. > s/lot harder time/impossible without making crap up/ > Do you want to see some analysis for less-frequent PCP frees? Actually no I don't. While it would be interesting to see, it's a waste of your time. Less-frequent PCPs means that the impact of the extra cycles is *also* marginal and a PCP drain must fetch the data eventually. > We could > pretty easily instrument the latency doing normal-ish things to see if > we can measure a benefit from this rather than a tight-loop micro. I believe the only realistic scenario where it's going to be detectable is a network intensive application on very high speed networks where the cost of the alloc/free paths tends to be more noticable. I suspect anything else will be so far into the noise that it'll be unnoticable. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH v2 2/2] free_pcppages_bulk: prefetch buddy while not holding lock 2018-01-24 21:12 ` Mel Gorman @ 2018-01-25 7:25 ` Aaron Lu 0 siblings, 0 replies; 19+ messages in thread From: Aaron Lu @ 2018-01-25 7:25 UTC (permalink / raw) To: Mel Gorman Cc: Dave Hansen, linux-mm, linux-kernel, Andrew Morton, Huang Ying, Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko, Vlastimil Babka When a page is freed back to the global pool, its buddy will be checked to see if it's possible to do a merge. This requires accessing buddy's page structure and that access could take a long time if it's cache cold. This patch adds a prefetch to the to-be-freed page's buddy outside of zone->lock in hope of accessing buddy's page structure later under zone->lock will be faster. Since we *always* do buddy merging and check an order-0 page's buddy to try to merge it when it goes into the main allocator, the cacheline will always come in, i.e. the prefetched data will never be unused. In the meantime, there is the concern of a prefetch potentially evicting existing cachelines. This can be true for L1D cache since it is not huge. Considering the prefetch instruction used is prefetchnta, which will only store the date in L2 for "Pentium 4 and Intel Xeon processors" according to Intel's "Instruction Manual Set" document, it is not likely to cause cache pollution. Other architectures may have this cache pollution problem though. There is also some additional instruction overhead, namely calculating buddy pfn twice. Since the calculation is a XOR on two local variables, it's expected in many cases that cycles spent will be offset by reduced memory latency later. This is especially true for NUMA machines where multiple CPUs are contending on zone->lock and the most time consuming part under zone->lock is the wait of 'struct page' cacheline of the to-be-freed pages and their buddies. Test with will-it-scale/page_fault1 full load: kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) v4.15-rc4 9037332 8000124 13642741 15728686 patch1/2 9608786 +6.3% 8368915 +4.6% 14042169 +2.9% 17433559 +10.8% this patch 10462292 +8.9% 8602889 +2.8% 14802073 +5.4% 17624575 +1.1% Note: this patch's performance improvement percent is against patch1/2. Please also note the actual benefit of this patch will be workload/CPU dependant. [changelog stole from Dave Hansen and Mel Gorman's comments] https://lkml.org/lkml/2018/1/24/551 Suggested-by: Ying Huang <ying.huang@intel.com> Signed-off-by: Aaron Lu <aaron.lu@intel.com> --- v2: update changelog according to Dave Hansen and Mel Gorman's comments. Add more comments in code to explain why prefetch is done as requested by Mel Gorman. mm/page_alloc.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c9e5ded39b16..6566a4b5b124 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1138,6 +1138,9 @@ static void free_pcppages_bulk(struct zone *zone, int count, batch_free = count; do { + unsigned long pfn, buddy_pfn; + struct page *buddy; + page = list_last_entry(list, struct page, lru); /* must delete as __free_one_page list manipulates */ list_del(&page->lru); @@ -1146,6 +1149,18 @@ static void free_pcppages_bulk(struct zone *zone, int count, continue; list_add_tail(&page->lru, &head); + + /* + * We are going to put the page back to the global + * pool, prefetch its buddy to speed up later access + * under zone->lock. It is believed the overhead of + * calculating buddy_pfn here can be offset by reduced + * memory latency later. + */ + pfn = page_to_pfn(page); + buddy_pfn = __find_buddy_pfn(pfn, 0); + buddy = page + (buddy_pfn - pfn); + prefetch(buddy); } while (--count && --batch_free && !list_empty(list)); } -- 2.14.3 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 1/2] free_pcppages_bulk: do not hold lock when picking pages to free 2018-01-24 2:30 [PATCH 1/2] free_pcppages_bulk: do not hold lock when picking pages to free Aaron Lu 2018-01-24 2:30 ` [PATCH 2/2] free_pcppages_bulk: prefetch buddy while not holding lock Aaron Lu @ 2018-01-24 16:40 ` Mel Gorman 2018-01-25 7:21 ` [PATCH v2 " Aaron Lu 2018-02-05 5:30 ` RFC: eliminate zone->lock contention for will-it-scale/page_fault1 on big server Aaron Lu 2 siblings, 1 reply; 19+ messages in thread From: Mel Gorman @ 2018-01-24 16:40 UTC (permalink / raw) To: Aaron Lu Cc: linux-mm, linux-kernel, Andrew Morton, Huang Ying, Dave Hansen, Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko, Vlastimil Babka On Wed, Jan 24, 2018 at 10:30:49AM +0800, Aaron Lu wrote: > When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy, > the zone->lock is held and then pages are chosen from PCP's migratetype > list. While there is actually no need to do this 'choose part' under > lock since it's PCP pages, the only CPU that can touch them is us and > irq is also disabled. > > Moving this part outside could reduce lock held time and improve > performance. Test with will-it-scale/page_fault1 full load: > > kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) > v4.15-rc4 9037332 8000124 13642741 15728686 > this patch 9608786 +6.3% 8368915 +4.6% 14042169 +2.9% 17433559 +10.8% > > What the test does is: starts $nr_cpu processes and each will repeated > do the following for 5 minutes: > 1 mmap 128M anonymouse space; > 2 write access to that space; > 3 munmap. > The score is the aggregated iteration. > > https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c > > Signed-off-by: Aaron Lu <aaron.lu@intel.com> > --- > mm/page_alloc.c | 33 +++++++++++++++++++-------------- > 1 file changed, 19 insertions(+), 14 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 4093728f292e..a076f754dac1 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1113,12 +1113,12 @@ static void free_pcppages_bulk(struct zone *zone, int count, > int migratetype = 0; > int batch_free = 0; > bool isolated_pageblocks; > + struct list_head head; > + struct page *page, *tmp; > > - spin_lock(&zone->lock); > - isolated_pageblocks = has_isolate_pageblock(zone); > + INIT_LIST_HEAD(&head); > Declare head as LIST_HEAD(head) and avoid INIT_LIST_HEAD. Otherwise I think this is safe Acked-by: Mel Gorman <mgorman@techsingularity.net> -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH v2 1/2] free_pcppages_bulk: do not hold lock when picking pages to free 2018-01-24 16:40 ` [PATCH 1/2] free_pcppages_bulk: do not hold lock when picking pages to free Mel Gorman @ 2018-01-25 7:21 ` Aaron Lu 2018-02-15 12:06 ` Mel Gorman 2018-02-15 12:46 ` Matthew Wilcox 0 siblings, 2 replies; 19+ messages in thread From: Aaron Lu @ 2018-01-25 7:21 UTC (permalink / raw) To: Mel Gorman Cc: linux-mm, linux-kernel, Andrew Morton, Huang Ying, Dave Hansen, Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko, Vlastimil Babka When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy, the zone->lock is held and then pages are chosen from PCP's migratetype list. While there is actually no need to do this 'choose part' under lock since it's PCP pages, the only CPU that can touch them is us and irq is also disabled. Moving this part outside could reduce lock held time and improve performance. Test with will-it-scale/page_fault1 full load: kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) v4.15-rc4 9037332 8000124 13642741 15728686 this patch 9608786 +6.3% 8368915 +4.6% 14042169 +2.9% 17433559 +10.8% What the test does is: starts $nr_cpu processes and each will repeatedly do the following for 5 minutes: 1 mmap 128M anonymouse space; 2 write access to that space; 3 munmap. The score is the aggregated iteration. https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c Acked-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Aaron Lu <aaron.lu@intel.com> --- v2: use LIST_HEAD(head) as suggested by Mel Gorman. mm/page_alloc.c | 33 ++++++++++++++++++--------------- 1 file changed, 18 insertions(+), 15 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 4093728f292e..c9e5ded39b16 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1113,12 +1113,10 @@ static void free_pcppages_bulk(struct zone *zone, int count, int migratetype = 0; int batch_free = 0; bool isolated_pageblocks; - - spin_lock(&zone->lock); - isolated_pageblocks = has_isolate_pageblock(zone); + struct page *page, *tmp; + LIST_HEAD(head); while (count) { - struct page *page; struct list_head *list; /* @@ -1140,26 +1138,31 @@ static void free_pcppages_bulk(struct zone *zone, int count, batch_free = count; do { - int mt; /* migratetype of the to-be-freed page */ - page = list_last_entry(list, struct page, lru); /* must delete as __free_one_page list manipulates */ list_del(&page->lru); - mt = get_pcppage_migratetype(page); - /* MIGRATE_ISOLATE page should not go to pcplists */ - VM_BUG_ON_PAGE(is_migrate_isolate(mt), page); - /* Pageblock could have been isolated meanwhile */ - if (unlikely(isolated_pageblocks)) - mt = get_pageblock_migratetype(page); - if (bulkfree_pcp_prepare(page)) continue; - __free_one_page(page, page_to_pfn(page), zone, 0, mt); - trace_mm_page_pcpu_drain(page, 0, mt); + list_add_tail(&page->lru, &head); } while (--count && --batch_free && !list_empty(list)); } + + spin_lock(&zone->lock); + isolated_pageblocks = has_isolate_pageblock(zone); + + list_for_each_entry_safe(page, tmp, &head, lru) { + int mt = get_pcppage_migratetype(page); + /* MIGRATE_ISOLATE page should not go to pcplists */ + VM_BUG_ON_PAGE(is_migrate_isolate(mt), page); + /* Pageblock could have been isolated meanwhile */ + if (unlikely(isolated_pageblocks)) + mt = get_pageblock_migratetype(page); + + __free_one_page(page, page_to_pfn(page), zone, 0, mt); + trace_mm_page_pcpu_drain(page, 0, mt); + } spin_unlock(&zone->lock); } -- 2.14.3 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH v2 1/2] free_pcppages_bulk: do not hold lock when picking pages to free 2018-01-25 7:21 ` [PATCH v2 " Aaron Lu @ 2018-02-15 12:06 ` Mel Gorman 2018-02-23 1:37 ` Aaron Lu 2018-02-15 12:46 ` Matthew Wilcox 1 sibling, 1 reply; 19+ messages in thread From: Mel Gorman @ 2018-02-15 12:06 UTC (permalink / raw) To: Aaron Lu Cc: linux-mm, linux-kernel, Andrew Morton, Huang Ying, Dave Hansen, Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko, Vlastimil Babka On Thu, Jan 25, 2018 at 03:21:44PM +0800, Aaron Lu wrote: > When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy, > the zone->lock is held and then pages are chosen from PCP's migratetype > list. While there is actually no need to do this 'choose part' under > lock since it's PCP pages, the only CPU that can touch them is us and > irq is also disabled. > > Moving this part outside could reduce lock held time and improve > performance. Test with will-it-scale/page_fault1 full load: > > kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) > v4.15-rc4 9037332 8000124 13642741 15728686 > this patch 9608786 +6.3% 8368915 +4.6% 14042169 +2.9% 17433559 +10.8% > > What the test does is: starts $nr_cpu processes and each will repeatedly > do the following for 5 minutes: > 1 mmap 128M anonymouse space; > 2 write access to that space; > 3 munmap. > The score is the aggregated iteration. > > https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c > > Acked-by: Mel Gorman <mgorman@techsingularity.net> > Signed-off-by: Aaron Lu <aaron.lu@intel.com> It looks like this series may have gotten lost because it was embedded within an existing thread or else it was the proximity to the merge window. I suggest a rebase, retest and resubmit unless there was some major objection that I missed. Patch 1 is fine by me at least. I never explicitly acked patch 2 but I've no major objection to it, just am a tad uncomfortable with prefetch magic sauce in general. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v2 1/2] free_pcppages_bulk: do not hold lock when picking pages to free 2018-02-15 12:06 ` Mel Gorman @ 2018-02-23 1:37 ` Aaron Lu 0 siblings, 0 replies; 19+ messages in thread From: Aaron Lu @ 2018-02-23 1:37 UTC (permalink / raw) To: Mel Gorman Cc: linux-mm, linux-kernel, Andrew Morton, Huang Ying, Dave Hansen, Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko, Vlastimil Babka On Thu, Feb 15, 2018 at 12:06:08PM +0000, Mel Gorman wrote: > On Thu, Jan 25, 2018 at 03:21:44PM +0800, Aaron Lu wrote: > > When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy, > > the zone->lock is held and then pages are chosen from PCP's migratetype > > list. While there is actually no need to do this 'choose part' under > > lock since it's PCP pages, the only CPU that can touch them is us and > > irq is also disabled. > > > > Moving this part outside could reduce lock held time and improve > > performance. Test with will-it-scale/page_fault1 full load: > > > > kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) > > v4.15-rc4 9037332 8000124 13642741 15728686 > > this patch 9608786 +6.3% 8368915 +4.6% 14042169 +2.9% 17433559 +10.8% > > > > What the test does is: starts $nr_cpu processes and each will repeatedly > > do the following for 5 minutes: > > 1 mmap 128M anonymouse space; > > 2 write access to that space; > > 3 munmap. > > The score is the aggregated iteration. > > > > https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c > > > > Acked-by: Mel Gorman <mgorman@techsingularity.net> > > Signed-off-by: Aaron Lu <aaron.lu@intel.com> > > It looks like this series may have gotten lost because it was embedded > within an existing thread or else it was the proximity to the merge > window. I suggest a rebase, retest and resubmit unless there was some > major objection that I missed. Patch 1 is fine by me at least. I never > explicitly acked patch 2 but I've no major objection to it, just am a tad > uncomfortable with prefetch magic sauce in general. Thanks for the suggestion. I just got back from vacation and will send out once I collected all the required date. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v2 1/2] free_pcppages_bulk: do not hold lock when picking pages to free 2018-01-25 7:21 ` [PATCH v2 " Aaron Lu 2018-02-15 12:06 ` Mel Gorman @ 2018-02-15 12:46 ` Matthew Wilcox 2018-02-15 14:55 ` Mel Gorman 2018-02-23 1:42 ` Aaron Lu 1 sibling, 2 replies; 19+ messages in thread From: Matthew Wilcox @ 2018-02-15 12:46 UTC (permalink / raw) To: Aaron Lu Cc: Mel Gorman, linux-mm, linux-kernel, Andrew Morton, Huang Ying, Dave Hansen, Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko, Vlastimil Babka On Thu, Jan 25, 2018 at 03:21:44PM +0800, Aaron Lu wrote: > When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy, > the zone->lock is held and then pages are chosen from PCP's migratetype > list. While there is actually no need to do this 'choose part' under > lock since it's PCP pages, the only CPU that can touch them is us and > irq is also disabled. I have no objection to this patch. If you're looking for ideas for future improvement though, I wonder whether using a LIST_HEAD is the best way to store these pages temporarily. If you batch them into a pagevec and then free the entire pagevec, the CPU should be a little faster scanning a short array than walking a linked list. It would also puts a hard boundary on how long zone->lock is held, as you'd drop it and go back for another batch after 15 pages. That might be bad, of course. Another minor change I'd like to see is free_pcpages_bulk updating pcp->count itself; all of the callers do it currently. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v2 1/2] free_pcppages_bulk: do not hold lock when picking pages to free 2018-02-15 12:46 ` Matthew Wilcox @ 2018-02-15 14:55 ` Mel Gorman 2018-02-23 1:42 ` Aaron Lu 1 sibling, 0 replies; 19+ messages in thread From: Mel Gorman @ 2018-02-15 14:55 UTC (permalink / raw) To: Matthew Wilcox Cc: Aaron Lu, linux-mm, linux-kernel, Andrew Morton, Huang Ying, Dave Hansen, Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko, Vlastimil Babka On Thu, Feb 15, 2018 at 04:46:44AM -0800, Matthew Wilcox wrote: > On Thu, Jan 25, 2018 at 03:21:44PM +0800, Aaron Lu wrote: > > When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy, > > the zone->lock is held and then pages are chosen from PCP's migratetype > > list. While there is actually no need to do this 'choose part' under > > lock since it's PCP pages, the only CPU that can touch them is us and > > irq is also disabled. > > I have no objection to this patch. If you're looking for ideas for > future improvement though, I wonder whether using a LIST_HEAD is the > best way to store these pages temporarily. If you batch them into a > pagevec and then free the entire pagevec, the CPU should be a little > faster scanning a short array than walking a linked list. > > It would also puts a hard boundary on how long zone->lock is held, as > you'd drop it and go back for another batch after 15 pages. That might > be bad, of course. > It's not a guaranteed win. You're trading a list traversal for increased stack usage of 128 bytes (unless you stick a static one in the pgdat and incur a cache miss penalty instead or a per-cpu pagevec and increase memory consumption) and 2 spin lock acquire/releases in the common case which may or may not be contended. It might make more sense if the PCP's themselves where statically sized but that would limit tuning options and increase the per-cpu footprint of the pcp structures. Maybe I'm missing something obvious and it really would be a universal win but right now I find it hard to imagine that it's a win. > Another minor change I'd like to see is free_pcpages_bulk updating > pcp->count itself; all of the callers do it currently. > That should be reasonable, it's not even particularly difficult. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v2 1/2] free_pcppages_bulk: do not hold lock when picking pages to free 2018-02-15 12:46 ` Matthew Wilcox 2018-02-15 14:55 ` Mel Gorman @ 2018-02-23 1:42 ` Aaron Lu 1 sibling, 0 replies; 19+ messages in thread From: Aaron Lu @ 2018-02-23 1:42 UTC (permalink / raw) To: Matthew Wilcox Cc: Mel Gorman, linux-mm, linux-kernel, Andrew Morton, Huang Ying, Dave Hansen, Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko, Vlastimil Babka On Thu, Feb 15, 2018 at 04:46:44AM -0800, Matthew Wilcox wrote: > On Thu, Jan 25, 2018 at 03:21:44PM +0800, Aaron Lu wrote: > > When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy, > > the zone->lock is held and then pages are chosen from PCP's migratetype > > list. While there is actually no need to do this 'choose part' under > > lock since it's PCP pages, the only CPU that can touch them is us and > > irq is also disabled. > > I have no objection to this patch. If you're looking for ideas for > future improvement though, I wonder whether using a LIST_HEAD is the > best way to store these pages temporarily. If you batch them into a > pagevec and then free the entire pagevec, the CPU should be a little > faster scanning a short array than walking a linked list. Thanks for the suggestion. > It would also puts a hard boundary on how long zone->lock is held, as > you'd drop it and go back for another batch after 15 pages. That might > be bad, of course. Yes that's a concern. As Mel reponded in another email, I think I'll just keep using list here. > > Another minor change I'd like to see is free_pcpages_bulk updating > pcp->count itself; all of the callers do it currently. Sounds good, I'll prepare a separate patch for this, thanks! ^ permalink raw reply [flat|nested] 19+ messages in thread
* RFC: eliminate zone->lock contention for will-it-scale/page_fault1 on big server 2018-01-24 2:30 [PATCH 1/2] free_pcppages_bulk: do not hold lock when picking pages to free Aaron Lu 2018-01-24 2:30 ` [PATCH 2/2] free_pcppages_bulk: prefetch buddy while not holding lock Aaron Lu 2018-01-24 16:40 ` [PATCH 1/2] free_pcppages_bulk: do not hold lock when picking pages to free Mel Gorman @ 2018-02-05 5:30 ` Aaron Lu 2018-02-05 5:31 ` [RFC PATCH 1/2] __free_one_page: skip merge for order-0 page unless compaction is in progress Aaron Lu 2018-02-05 5:32 ` [RFC PATCH 2/2] rmqueue_bulk: avoid touching page structures under zone->lock Aaron Lu 2 siblings, 2 replies; 19+ messages in thread From: Aaron Lu @ 2018-02-05 5:30 UTC (permalink / raw) To: linux-mm, linux-kernel Cc: Andrew Morton, Huang Ying, Dave Hansen, Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko, Vlastimil Babka, Mel Gorman, Daniel Jordan In addition the the two patches, there are two more patches that I would like to get some feedback. The two patches are more radical: the 3rd deals with free path zone->lock contention by avoiding doing any merge for order0 pages while the 4th deals with allocation path zone->lock contention by taking pcp->batch pages off the free_area order0 list without the need to iterate the list. Both patches are developed based on "the most time consuming part of operations under zone->lock is cache misses on struct page". The 3rd patch may be controversial but doesn't have correctness problem; the 4th is in an early stage and serves only as a proof-of-concept. Your comments are appreciated, thanks. ^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC PATCH 1/2] __free_one_page: skip merge for order-0 page unless compaction is in progress 2018-02-05 5:30 ` RFC: eliminate zone->lock contention for will-it-scale/page_fault1 on big server Aaron Lu @ 2018-02-05 5:31 ` Aaron Lu 2018-02-05 22:17 ` Dave Hansen 2018-02-05 5:32 ` [RFC PATCH 2/2] rmqueue_bulk: avoid touching page structures under zone->lock Aaron Lu 1 sibling, 1 reply; 19+ messages in thread From: Aaron Lu @ 2018-02-05 5:31 UTC (permalink / raw) To: linux-mm, linux-kernel Cc: Andrew Morton, Huang Ying, Dave Hansen, Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko, Vlastimil Babka, Mel Gorman, Daniel Jordan Running will-it-scale/page_fault1 process mode workload on a 2 sockets Intel Skylake server showed severe lock contention of zone->lock, as high as about 80%(43% on allocation path and 38% on free path) CPU cycles are burnt spinning. With perf, the most time consuming part inside that lock on free path is cache missing on page structures, mostly on the to-be-freed page's buddy due to merging. One way to avoid this overhead is not do any merging at all for order-0 pages and leave the need for high order pages to compaction. With this approach, the lock contention for zone->lock on free path dropped to 4% but allocation side still has as high as 43% lock contention. In the meantime, the dropped lock contention on free side doesn't translate to performance increase, instead, it's consumed by increased lock contention of the per node lru_lock(rose from 2% to 33%). One concern of this approach is, how much impact does it have on high order page allocation, like for order-9 pages? I have run the stress-highalloc workload on a Haswell Desktop(8 CPU/4G memory) sometime ago and it showed similar success rate for vanilla kernel and the patched kernel, both at 74%. Note that it indeed took longer to finish the test: 244s vs 218s. 1 vanilla Attempted allocations: 1917 Failed allocs: 494 Success allocs: 1423 % Success: 74 Duration alloctest pass: 218s 2 no_merge_in_buddy Attempted allocations: 1917 Failed allocs: 497 Success allocs: 1420 % Success: 74 Duration alloctest pass: 244s The above test was done with the default --ms-delay=100, which means there is a delay of 100ms between page allocations. If I set the delay to 1ms, the success rate of this patch will drop to 36% while vanilla could maintain at about 70%. Though 1ms delay may not be practical, it indeed shows the possible impact of this patch on high order page allocation. The next patch deals with allocation path zone->lock contention. Suggested-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Aaron Lu <aaron.lu@intel.com> --- mm/compaction.c | 28 ++++++++++++++ mm/internal.h | 15 +++++++- mm/page_alloc.c | 116 ++++++++++++++++++++++++++++++++++++-------------------- 3 files changed, 117 insertions(+), 42 deletions(-) diff --git a/mm/compaction.c b/mm/compaction.c index 10cd757f1006..b53c4d420533 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -669,6 +669,28 @@ static bool too_many_isolated(struct zone *zone) return isolated > (inactive + active) / 2; } +static int merge_page(struct zone *zone, struct page *page, unsigned long pfn) +{ + int order = 0; + unsigned long buddy_pfn = __find_buddy_pfn(pfn, order); + struct page *buddy = page + (buddy_pfn - pfn); + + /* Only do merging if the merge skipped page's buddy is also free */ + if (PageBuddy(buddy)) { + int mt = get_pageblock_migratetype(page); + unsigned long flags; + + spin_lock_irqsave(&zone->lock, flags); + if (likely(page_merge_skipped(page))) { + do_merge(zone, page, mt); + order = page_order(page); + } + spin_unlock_irqrestore(&zone->lock, flags); + } + + return order; +} + /** * isolate_migratepages_block() - isolate all migrate-able pages within * a single pageblock @@ -777,6 +799,12 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, */ if (PageBuddy(page)) { unsigned long freepage_order = page_order_unsafe(page); + /* + * If the page didn't do merging on free time, now do + * it since we are doing compaction. + */ + if (page_merge_skipped(page)) + freepage_order = merge_page(zone, page, low_pfn); /* * Without lock, we cannot be sure that what we got is diff --git a/mm/internal.h b/mm/internal.h index e6bd35182dae..d2b0ac02d459 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -228,9 +228,22 @@ int find_suitable_fallback(struct free_area *area, unsigned int order, static inline unsigned int page_order(struct page *page) { /* PageBuddy() must be checked by the caller */ - return page_private(page); + return page_private(page) & ~(1 << 16); } +/* + * This function returns if the page is in buddy but didn't do any merging + * for performance reason. This function only makes sense if PageBuddy(page) + * is also true. The caller should hold zone->lock for this function to return + * correct value, or it can handle invalid values gracefully. + */ +static inline bool page_merge_skipped(struct page *page) +{ + return PageBuddy(page) && (page->private & (1 << 16)); +} + +void do_merge(struct zone *zone, struct page *page, int migratetype); + /* * Like page_order(), but for callers who cannot afford to hold the zone lock. * PageBuddy() should be checked first by the caller to minimize race window, diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 2ac7fa97dd55..9497c8c5f808 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -781,49 +781,14 @@ static inline int page_is_buddy(struct page *page, struct page *buddy, return 0; } -/* - * Freeing function for a buddy system allocator. - * - * The concept of a buddy system is to maintain direct-mapped table - * (containing bit values) for memory blocks of various "orders". - * The bottom level table contains the map for the smallest allocatable - * units of memory (here, pages), and each level above it describes - * pairs of units from the levels below, hence, "buddies". - * At a high level, all that happens here is marking the table entry - * at the bottom level available, and propagating the changes upward - * as necessary, plus some accounting needed to play nicely with other - * parts of the VM system. - * At each level, we keep a list of pages, which are heads of continuous - * free pages of length of (1 << order) and marked with _mapcount - * PAGE_BUDDY_MAPCOUNT_VALUE. Page's order is recorded in page_private(page) - * field. - * So when we are allocating or freeing one, we can derive the state of the - * other. That is, if we allocate a small block, and both were - * free, the remainder of the region must be split into blocks. - * If a block is freed, and its buddy is also free, then this - * triggers coalescing into a block of larger size. - * - * -- nyc - */ - -static inline void __free_one_page(struct page *page, - unsigned long pfn, - struct zone *zone, unsigned int order, - int migratetype) +static void inline __do_merge(struct page *page, unsigned int order, + struct zone *zone, int migratetype) { + unsigned int max_order = min_t(unsigned int, MAX_ORDER, pageblock_order + 1); + unsigned long pfn = page_to_pfn(page); unsigned long combined_pfn; unsigned long uninitialized_var(buddy_pfn); struct page *buddy; - unsigned int max_order; - - max_order = min_t(unsigned int, MAX_ORDER, pageblock_order + 1); - - VM_BUG_ON(!zone_is_initialized(zone)); - VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page); - - VM_BUG_ON(migratetype == -1); - if (likely(!is_migrate_isolate(migratetype))) - __mod_zone_freepage_state(zone, 1 << order, migratetype); VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page); VM_BUG_ON_PAGE(bad_range(zone, page), page); @@ -879,8 +844,6 @@ static inline void __free_one_page(struct page *page, } done_merging: - set_page_order(page, order); - /* * If this is not the largest possible page, check if the buddy * of the next-highest order is free. If it is, it's possible @@ -905,9 +868,80 @@ static inline void __free_one_page(struct page *page, list_add(&page->lru, &zone->free_area[order].free_list[migratetype]); out: + set_page_order(page, order); zone->free_area[order].nr_free++; } +void do_merge(struct zone *zone, struct page *page, int migratetype) +{ + VM_BUG_ON(page_order(page) != 0); + + list_del(&page->lru); + zone->free_area[0].nr_free--; + rmv_page_order(page); + + __do_merge(page, 0, zone, migratetype); +} + +static inline bool should_skip_merge(struct zone *zone, unsigned int order) +{ +#ifdef CONFIG_COMPACTION + return !zone->compact_considered && !order; +#else + return false; +#endif +} + +/* + * Freeing function for a buddy system allocator. + * + * The concept of a buddy system is to maintain direct-mapped table + * (containing bit values) for memory blocks of various "orders". + * The bottom level table contains the map for the smallest allocatable + * units of memory (here, pages), and each level above it describes + * pairs of units from the levels below, hence, "buddies". + * At a high level, all that happens here is marking the table entry + * at the bottom level available, and propagating the changes upward + * as necessary, plus some accounting needed to play nicely with other + * parts of the VM system. + * At each level, we keep a list of pages, which are heads of continuous + * free pages of length of (1 << order) and marked with _mapcount + * PAGE_BUDDY_MAPCOUNT_VALUE. Page's order is recorded in page_private(page) + * field. + * So when we are allocating or freeing one, we can derive the state of the + * other. That is, if we allocate a small block, and both were + * free, the remainder of the region must be split into blocks. + * If a block is freed, and its buddy is also free, then this + * triggers coalescing into a block of larger size. + * + * -- nyc + */ +static inline void __free_one_page(struct page *page, + unsigned long pfn, + struct zone *zone, unsigned int order, + int migratetype) +{ + VM_BUG_ON(!zone_is_initialized(zone)); + VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page); + + VM_BUG_ON(migratetype == -1); + if (likely(!is_migrate_isolate(migratetype))) + __mod_zone_freepage_state(zone, 1 << order, migratetype); + + if (should_skip_merge(zone, order)) { + list_add(&page->lru, &zone->free_area[order].free_list[migratetype]); + /* + * 1 << 16 set on page->private to indicate this order0 + * page skipped merging during free time + */ + set_page_order(page, order | (1 << 16)); + zone->free_area[order].nr_free++; + return; + } + + __do_merge(page, order, zone, migratetype); +} + /* * A bad page could be due to a number of fields. Instead of multiple branches, * try and check multiple fields with one check. The caller must do a detailed -- 2.14.3 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [RFC PATCH 1/2] __free_one_page: skip merge for order-0 page unless compaction is in progress 2018-02-05 5:31 ` [RFC PATCH 1/2] __free_one_page: skip merge for order-0 page unless compaction is in progress Aaron Lu @ 2018-02-05 22:17 ` Dave Hansen 0 siblings, 0 replies; 19+ messages in thread From: Dave Hansen @ 2018-02-05 22:17 UTC (permalink / raw) To: Aaron Lu, linux-mm, linux-kernel Cc: Andrew Morton, Huang Ying, Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko, Vlastimil Babka, Mel Gorman, Daniel Jordan On 02/04/2018 09:31 PM, Aaron Lu wrote: > Running will-it-scale/page_fault1 process mode workload on a 2 sockets > Intel Skylake server showed severe lock contention of zone->lock, as > high as about 80%(43% on allocation path and 38% on free path) CPU > cycles are burnt spinning. With perf, the most time consuming part inside > that lock on free path is cache missing on page structures, mostly on > the to-be-freed page's buddy due to merging. > > One way to avoid this overhead is not do any merging at all for order-0 > pages and leave the need for high order pages to compaction. I think the RFC here is: we *know* this hurts high-order allocations and Aaron demonstrated that it does make the latency worse. But, unexpectedly, it didn't totally crater them. So, is the harm to large allocations worth the performance benefit afforded to smaller ones by this patch? How would we make a decision on something like that? If nothing else, this would make a nice companion topic to Daniel Jordan's "lru_lock scalability" proposal for LSF/MM. ^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC PATCH 2/2] rmqueue_bulk: avoid touching page structures under zone->lock 2018-02-05 5:30 ` RFC: eliminate zone->lock contention for will-it-scale/page_fault1 on big server Aaron Lu 2018-02-05 5:31 ` [RFC PATCH 1/2] __free_one_page: skip merge for order-0 page unless compaction is in progress Aaron Lu @ 2018-02-05 5:32 ` Aaron Lu 1 sibling, 0 replies; 19+ messages in thread From: Aaron Lu @ 2018-02-05 5:32 UTC (permalink / raw) To: linux-mm, linux-kernel Cc: Andrew Morton, Huang Ying, Dave Hansen, Kemi Wang, Tim Chen, Andi Kleen, Michal Hocko, Vlastimil Babka, Mel Gorman, Daniel Jordan Profile on Intel Skylake server shows the most time consuming part under zone->lock on allocation path is accessing those to-be-returned page's struct page in rmqueue_bulk() and its child functions. We do not really need to touch all those to-be-returned pages under zone->lock, just need to move them out of the order 0's free_list and adjust area->nr_free under zone->lock, other operations on page structure like rmv_page_order(page) etc. could be done outside zone->lock. So if it's possible to know the 1st and the last page structure of the pcp->batch number pages in the free_list, we can achieve the above without needing to touch all those page structures in between. The problem is, the free page is linked in a doubly list so we only know where the head and tail is, but not the Nth element in the list. Assume order0 mt=Movable free_list has 7 pages available: head <-> p7 <-> p6 <-> p5 <-> p4 <-> p3 <-> p2 <-> p1 One experiment I have done here is to add a new list for it: say cluster list, where it will link pages of every pcp->batch(th) element in the free_list. Take pcp->batch=3 as an example, we have: free_list: head <-> p7 <-> p6 <-> p5 <-> p4 <-> p3 <-> p2 <-> p1 cluster_list: head <--------> p6 <---------------> p3 Let's call p6-p4 a cluster, similarly, p3-p1 is another cluster. Then every time rmqueue_bulk() is called to get 3 pages, we will iterate the cluster_list first. If cluster list is not empty, we can quickly locate the first and last page, p6 and p4 in this case(p4 is retrieved by checking p6's next on cluster_list and then check p3's prev on free_list). This way, we can reduce the need to touch all those page structures in between under zone->lock. Note: a common pcp->batch should be 31 since it is the default PCP batch number. With this change, on 2 sockets Skylake server, with will-it-scale/page_fault1 full load test, zone lock has gone, lru_lock contention rose to 70% and performance increased by 16.7% compared to vanilla. There are some fundemental problems with this patch though: 1 When compaction occurs, the number of pages in a cluster could be less than predefined; this will make "1 cluster can satify the request" not true any more. Due to this reason, the patch currently requires no compaction to happen; 2 When new pages are freed to order 0 free_list, it could merge with its buddy and that would also cause fewer pages left in a cluster. Thus, no merge for order-0 is required for this patch to work; 3 Similarly, when fallback allocation happens, the same problem could happen again. Considering the above listed problems, this patch can only serve as a POC that cache miss is the most time consuming operation in big server. Your comments on a possible way to overcome them are greatly appreciated. Suggested-by: Ying Huang <ying.huang@intel.com> Signed-off-by: Aaron Lu <aaron.lu@intel.com> --- include/linux/mm_types.h | 43 ++++++++++-------- include/linux/mmzone.h | 7 +++ mm/page_alloc.c | 114 ++++++++++++++++++++++++++++++++++++++++++++--- 3 files changed, 141 insertions(+), 23 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index cfd0ac4e5e0e..e7aee48a224a 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -43,26 +43,33 @@ struct page { /* First double word block */ unsigned long flags; /* Atomic flags, some possibly * updated asynchronously */ - union { - struct address_space *mapping; /* If low bit clear, points to - * inode address_space, or NULL. - * If page mapped as anonymous - * memory, low bit is set, and - * it points to anon_vma object - * or KSM private structure. See - * PAGE_MAPPING_ANON and - * PAGE_MAPPING_KSM. - */ - void *s_mem; /* slab first object */ - atomic_t compound_mapcount; /* first tail page */ - /* page_deferred_list().next -- second tail page */ - }; - /* Second double word */ union { - pgoff_t index; /* Our offset within mapping. */ - void *freelist; /* sl[aou]b first free object */ - /* page_deferred_list().prev -- second tail page */ + struct { + union { + struct address_space *mapping; /* If low bit clear, points to + * inode address_space, or NULL. + * If page mapped as anonymous + * memory, low bit is set, and + * it points to anon_vma object + * or KSM private structure. See + * PAGE_MAPPING_ANON and + * PAGE_MAPPING_KSM. + */ + void *s_mem; /* slab first object */ + atomic_t compound_mapcount; /* first tail page */ + /* page_deferred_list().next -- second tail page */ + }; + + /* Second double word */ + union { + pgoff_t index; /* Our offset within mapping. */ + void *freelist; /* sl[aou]b first free object */ + /* page_deferred_list().prev -- second tail page */ + }; + }; + + struct list_head cluster; }; union { diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 67f2e3c38939..3f1451213184 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -355,6 +355,12 @@ enum zone_type { #ifndef __GENERATING_BOUNDS_H +struct order0_cluster { + struct list_head list[MIGRATE_PCPTYPES]; + unsigned long offset[MIGRATE_PCPTYPES]; + int batch; +}; + struct zone { /* Read-mostly fields */ @@ -459,6 +465,7 @@ struct zone { /* free areas of different sizes */ struct free_area free_area[MAX_ORDER]; + struct order0_cluster order0_cluster; /* zone flags, see below */ unsigned long flags; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 9497c8c5f808..3eaafe597a66 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -736,6 +736,7 @@ static inline void rmv_page_order(struct page *page) { __ClearPageBuddy(page); set_page_private(page, 0); + BUG_ON(page->cluster.next); } /* @@ -793,6 +794,9 @@ static void inline __do_merge(struct page *page, unsigned int order, VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page); VM_BUG_ON_PAGE(bad_range(zone, page), page); + /* order0 merge doesn't work yet */ + BUG_ON(!order); + continue_merging: while (order < max_order - 1) { buddy_pfn = __find_buddy_pfn(pfn, order); @@ -883,6 +887,19 @@ void do_merge(struct zone *zone, struct page *page, int migratetype) __do_merge(page, 0, zone, migratetype); } +static inline void add_to_order0_free_list(struct page *page, struct zone *zone, int mt) +{ + struct order0_cluster *cluster = &zone->order0_cluster; + + list_add(&page->lru, &zone->free_area[0].free_list[mt]); + + /* If this is the pcp->batch(th) page, link it to the cluster list */ + if (mt < MIGRATE_PCPTYPES && !(++cluster->offset[mt] % cluster->batch)) { + list_add(&page->cluster, &cluster->list[mt]); + cluster->offset[mt] = 0; + } +} + static inline bool should_skip_merge(struct zone *zone, unsigned int order) { #ifdef CONFIG_COMPACTION @@ -929,7 +946,7 @@ static inline void __free_one_page(struct page *page, __mod_zone_freepage_state(zone, 1 << order, migratetype); if (should_skip_merge(zone, order)) { - list_add(&page->lru, &zone->free_area[order].free_list[migratetype]); + add_to_order0_free_list(page, zone, migratetype); /* * 1 << 16 set on page->private to indicate this order0 * page skipped merging during free time @@ -1732,7 +1749,10 @@ static inline void expand(struct zone *zone, struct page *page, if (set_page_guard(zone, &page[size], high, migratetype)) continue; - list_add(&page[size].lru, &area->free_list[migratetype]); + if (high) + list_add(&page[size].lru, &area->free_list[migratetype]); + else + add_to_order0_free_list(&page[size], zone, migratetype); area->nr_free++; set_page_order(&page[size], high); } @@ -1881,6 +1901,11 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, list_del(&page->lru); rmv_page_order(page); area->nr_free--; + if (!current_order && migratetype < MIGRATE_PCPTYPES) { + BUG_ON(!zone->order0_cluster.offset[migratetype]); + BUG_ON(page->cluster.next); + zone->order0_cluster.offset[migratetype]--; + } expand(zone, page, order, current_order, area, migratetype); set_pcppage_migratetype(page, migratetype); return page; @@ -1968,8 +1993,13 @@ static int move_freepages(struct zone *zone, } order = page_order(page); - list_move(&page->lru, + if (order) { + list_move(&page->lru, &zone->free_area[order].free_list[migratetype]); + } else { + __list_del_entry(&page->lru); + add_to_order0_free_list(page, zone, migratetype); + } page += 1 << order; pages_moved += 1 << order; } @@ -2118,7 +2148,12 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page, single_page: area = &zone->free_area[current_order]; - list_move(&page->lru, &area->free_list[start_type]); + if (current_order) + list_move(&page->lru, &area->free_list[start_type]); + else { + __list_del_entry(&page->lru); + add_to_order0_free_list(page, zone, start_type); + } } /* @@ -2379,6 +2414,45 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype) return page; } +static noinline int rmqueue_bulk_cluster(struct zone *zone, unsigned int order, + unsigned long count, struct list_head *list, + int migratetype) +{ + struct list_head *cluster_head; + struct page *head, *tail; + + cluster_head = &zone->order0_cluster.list[migratetype]; + head = list_first_entry_or_null(cluster_head, struct page, cluster); + if (!head) + return 0; + + if (head->cluster.next == cluster_head) + tail = list_last_entry(&zone->free_area[0].free_list[migratetype], struct page, lru); + else { + struct page *tmp = list_entry(head->cluster.next, struct page, cluster); + tail = list_entry(tmp->lru.prev, struct page, lru); + } + + zone->free_area[0].nr_free -= count; + + /* Remove the page from the cluster list */ + list_del(&head->cluster); + /* Restore the two page fields */ + head->cluster.next = head->cluster.prev = NULL; + + /* Take the pcp->batch pages off free_area list */ + tail->lru.next->prev = head->lru.prev; + head->lru.prev->next = tail->lru.next; + + /* Attach them to list */ + head->lru.prev = list; + list->next = &head->lru; + tail->lru.next = list; + list->prev = &tail->lru; + + return 1; +} + /* * Obtain a specified number of elements from the buddy allocator, all under * a single hold of the lock, for efficiency. Add them to the supplied list. @@ -2391,6 +2465,28 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, int i, alloced = 0; spin_lock(&zone->lock); + if (count == zone->order0_cluster.batch && + rmqueue_bulk_cluster(zone, order, count, list, migratetype)) { + struct page *page, *tmp; + spin_unlock(&zone->lock); + + i = alloced = count; + list_for_each_entry_safe(page, tmp, list, lru) { + rmv_page_order(page); + set_pcppage_migratetype(page, migratetype); + + if (unlikely(check_pcp_refill(page))) { + list_del(&page->lru); + alloced--; + continue; + } + if (is_migrate_cma(get_pcppage_migratetype(page))) + __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, + -(1 << order)); + } + goto done_alloc; + } + for (i = 0; i < count; ++i) { struct page *page = __rmqueue(zone, order, migratetype); if (unlikely(page == NULL)) @@ -2415,7 +2511,9 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, -(1 << order)); } + spin_unlock(&zone->lock); +done_alloc: /* * i pages were removed from the buddy list even if some leak due * to check_pcp_refill failing so adjust NR_FREE_PAGES based @@ -2423,7 +2521,6 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, * pages added to the pcp list. */ __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order)); - spin_unlock(&zone->lock); return alloced; } @@ -5451,6 +5548,10 @@ static void __meminit zone_init_free_lists(struct zone *zone) for_each_migratetype_order(order, t) { INIT_LIST_HEAD(&zone->free_area[order].free_list[t]); zone->free_area[order].nr_free = 0; + if (!order && t < MIGRATE_PCPTYPES) { + INIT_LIST_HEAD(&zone->order0_cluster.list[t]); + zone->order0_cluster.offset[t] = 0; + } } } @@ -5488,6 +5589,9 @@ static int zone_batchsize(struct zone *zone) * and the other with pages of the other colors. */ batch = rounddown_pow_of_two(batch + batch/2) - 1; + if (batch < 1) + batch = 1; + zone->order0_cluster.batch = batch; return batch; -- 2.14.3 ^ permalink raw reply related [flat|nested] 19+ messages in thread
end of thread, other threads:[~2018-02-23 1:41 UTC | newest] Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-01-24 2:30 [PATCH 1/2] free_pcppages_bulk: do not hold lock when picking pages to free Aaron Lu 2018-01-24 2:30 ` [PATCH 2/2] free_pcppages_bulk: prefetch buddy while not holding lock Aaron Lu 2018-01-24 16:43 ` Mel Gorman 2018-01-24 16:57 ` Dave Hansen 2018-01-24 18:19 ` Mel Gorman 2018-01-24 19:23 ` Dave Hansen 2018-01-24 21:12 ` Mel Gorman 2018-01-25 7:25 ` [PATCH v2 " Aaron Lu 2018-01-24 16:40 ` [PATCH 1/2] free_pcppages_bulk: do not hold lock when picking pages to free Mel Gorman 2018-01-25 7:21 ` [PATCH v2 " Aaron Lu 2018-02-15 12:06 ` Mel Gorman 2018-02-23 1:37 ` Aaron Lu 2018-02-15 12:46 ` Matthew Wilcox 2018-02-15 14:55 ` Mel Gorman 2018-02-23 1:42 ` Aaron Lu 2018-02-05 5:30 ` RFC: eliminate zone->lock contention for will-it-scale/page_fault1 on big server Aaron Lu 2018-02-05 5:31 ` [RFC PATCH 1/2] __free_one_page: skip merge for order-0 page unless compaction is in progress Aaron Lu 2018-02-05 22:17 ` Dave Hansen 2018-02-05 5:32 ` [RFC PATCH 2/2] rmqueue_bulk: avoid touching page structures under zone->lock Aaron Lu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).