From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ot0-f198.google.com (mail-ot0-f198.google.com [74.125.82.198]) by kanga.kvack.org (Postfix) with ESMTP id DC58B6B0003 for ; Tue, 20 Mar 2018 18:29:35 -0400 (EDT) Received: by mail-ot0-f198.google.com with SMTP id 94-v6so383121oth.4 for ; Tue, 20 Mar 2018 15:29:35 -0700 (PDT) Received: from mail-sor-f41.google.com (mail-sor-f41.google.com. [209.85.220.41]) by mx.google.com with SMTPS id r15-v6sor1238263oth.118.2018.03.20.15.29.34 for (Google Transport Security); Tue, 20 Mar 2018 15:29:34 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20180320085452.24641-4-aaron.lu@intel.com> References: <20180320085452.24641-1-aaron.lu@intel.com> <20180320085452.24641-4-aaron.lu@intel.com> From: "Figo.zhang" Date: Tue, 20 Mar 2018 15:29:33 -0700 Message-ID: Subject: Re: [RFC PATCH v2 3/4] mm/rmqueue_bulk: alloc without touching individual page structure Content-Type: multipart/alternative; boundary="000000000000de61a70567df9bfd" Sender: owner-linux-mm@kvack.org List-ID: To: Aaron Lu Cc: Linux MM , LKML , Andrew Morton , Huang Ying , Dave Hansen , Kemi Wang , Tim Chen , Andi Kleen , Michal Hocko , Vlastimil Babka , Mel Gorman , Matthew Wilcox , Daniel Jordan --000000000000de61a70567df9bfd Content-Type: text/plain; charset="UTF-8" 2018-03-20 1:54 GMT-07:00 Aaron Lu : > Profile on Intel Skylake server shows the most time consuming part > under zone->lock on allocation path is accessing those to-be-returned > page's "struct page" on the free_list inside zone->lock. One explanation > is, different CPUs are releasing pages to the head of free_list and > those page's 'struct page' may very well be cache cold for the allocating > CPU when it grabs these pages from free_list' head. The purpose here > is to avoid touching these pages one by one inside zone->lock. > > One idea is, we just take the requested number of pages off free_list > with something like list_cut_position() and then adjust nr_free of > free_area accordingly inside zone->lock and other operations like > clearing PageBuddy flag for these pages are done outside of zone->lock. > sounds good! your idea is reducing the lock contention in rmqueue_bulk() function by split the order-0 freelist into two list, one is without zone->lock, other is need zone->lock? it seems that it is a big lock granularity holding the zone->lock in rmqueue_bulk() , why not we change like it? static int rmqueue_bulk(struct zone *zone, unsigned int order, unsigned long count, struct list_head *list, int migratetype, bool cold) { for (i = 0; i < count; ++i) { spin_lock(&zone->lock); struct page *page = __rmqueue(zone, order, migratetype); spin_unlock(&zone->lock); ... } __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order)); return i; } > > list_cut_position() needs to know where to cut, that's what the new > 'struct cluster' meant to provide. All pages on order 0's free_list > belongs to a cluster so when a number of pages is needed, the cluster > to which head page of free_list belongs is checked and then tail page > of the cluster could be found. With tail page, list_cut_position() can > be used to drop the cluster off free_list. The 'struct cluster' also has > 'nr' to tell how many pages this cluster has so nr_free of free_area can > be adjusted inside the lock too. > > This caused a race window though: from the moment zone->lock is dropped > till these pages' PageBuddy flags get cleared, these pages are not in > buddy but still have PageBuddy flag set. > > This doesn't cause problems for users that access buddy pages through > free_list. But there are other users, like move_freepages() which is > used to move a pageblock pages from one migratetype to another in > fallback allocation path, will test PageBuddy flag of a page derived > from PFN. The end result could be that for pages in the race window, > they are moved back to free_list of another migratetype. For this > reason, a synchronization function zone_wait_cluster_alloc() is > introduced to wait till all pages are in correct state. This function > is meant to be called with zone->lock held, so after this function > returns, we do not need to worry about new pages becoming racy state. > > Another user is compaction, where it will scan a pageblock for > migratable candidates. In this process, pages derived from PFN will > be checked for PageBuddy flag to decide if it is a merge skipped page. > To avoid a racy page getting merged back into buddy, the > zone_wait_and_disable_cluster_alloc() function is introduced to: > 1 disable clustered allocation by increasing zone->cluster.disable_depth; > 2 wait till the race window pass by calling zone_wait_cluster_alloc(). > This function is also meant to be called with zone->lock held so after > it returns, all pages are in correct state and no more cluster alloc > will be attempted till zone_enable_cluster_alloc() is called to decrease > zone->cluster.disable_depth. > > The two patches could eliminate zone->lock contention entirely but at > the same time, pgdat->lru_lock contention rose to 82%. Final performance > increased about 8.3%. > > Suggested-by: Ying Huang > Signed-off-by: Aaron Lu > --- > Documentation/vm/struct_page_field | 5 + > include/linux/mm_types.h | 2 + > include/linux/mmzone.h | 35 +++++ > mm/compaction.c | 4 + > mm/internal.h | 34 +++++ > mm/page_alloc.c | 288 ++++++++++++++++++++++++++++++ > +++++-- > 6 files changed, 360 insertions(+), 8 deletions(-) > > diff --git a/Documentation/vm/struct_page_field b/Documentation/vm/struct_ > page_field > index 1ab6c19ccc7a..bab738ea4e0a 100644 > --- a/Documentation/vm/struct_page_field > +++ b/Documentation/vm/struct_page_field > @@ -3,3 +3,8 @@ Used to indicate this page skipped merging when added to > buddy. This > field only makes sense if the page is in Buddy and is order zero. > It's a bug if any higher order pages in Buddy has this field set. > Shares space with index. > + > +cluster: > +Order 0 Buddy pages are grouped in cluster on free_list to speed up > +allocation. This field stores the cluster pointer for them. > +Shares space with mapping. > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 7edc4e102a8e..49fe9d755a7c 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -84,6 +84,8 @@ struct page { > void *s_mem; /* slab first object */ > atomic_t compound_mapcount; /* first tail page */ > /* page_deferred_list().next -- second tail page */ > + > + struct cluster *cluster; /* order 0 cluster this > page belongs to */ > }; > > /* Second double word */ > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 7522a6987595..09ba9d3cc385 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -355,6 +355,40 @@ enum zone_type { > > #ifndef __GENERATING_BOUNDS_H > > +struct cluster { > + struct page *tail; /* tail page of the cluster */ > + int nr; /* how many pages are in this cluster */ > +}; > + > +struct order0_cluster { > + /* order 0 cluster array, dynamically allocated */ > + struct cluster *array; > + /* > + * order 0 cluster array length, also used to indicate if cluster > + * allocation is enabled for this zone(cluster allocation is > disabled > + * for small zones whose batch size is smaller than 1, like DMA > zone) > + */ > + int len; > + /* > + * smallest position from where we search for an > + * empty cluster from the cluster array > + */ > + int zero_bit; > + /* bitmap used to quickly locate an empty cluster from cluster > array */ > + unsigned long *bitmap; > + > + /* disable cluster allocation to avoid new pages becoming racy > state. */ > + unsigned long disable_depth; > + > + /* > + * used to indicate if there are pages allocated in cluster mode > + * still in racy state. Caller with zone->lock held could use > helper > + * function zone_wait_cluster_alloc() to wait all such pages to > exit > + * the race window. > + */ > + atomic_t in_progress; > +}; > + > struct zone { > /* Read-mostly fields */ > > @@ -459,6 +493,7 @@ struct zone { > > /* free areas of different sizes */ > struct free_area free_area[MAX_ORDER]; > + struct order0_cluster cluster; > > /* zone flags, see below */ > unsigned long flags; > diff --git a/mm/compaction.c b/mm/compaction.c > index fb9031fdca41..e71fa82786a1 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -1601,6 +1601,8 @@ static enum compact_result compact_zone(struct zone > *zone, struct compact_contro > > migrate_prep_local(); > > + zone_wait_and_disable_cluster_alloc(zone); > + > while ((ret = compact_finished(zone, cc)) == COMPACT_CONTINUE) { > int err; > > @@ -1699,6 +1701,8 @@ static enum compact_result compact_zone(struct zone > *zone, struct compact_contro > zone->compact_cached_free_pfn = free_pfn; > } > > + zone_enable_cluster_alloc(zone); > + > count_compact_events(COMPACTMIGRATE_SCANNED, > cc->total_migrate_scanned); > count_compact_events(COMPACTFREE_SCANNED, cc->total_free_scanned); > > diff --git a/mm/internal.h b/mm/internal.h > index 2bfbaae2d835..1b0535af1b49 100644 > --- a/mm/internal.h > +++ b/mm/internal.h > @@ -557,12 +557,46 @@ static inline bool can_skip_merge(struct zone *zone, > int order) > if (order) > return false; > > + /* > + * Clustered allocation is only disabled when high-order pages > + * are needed, e.g. in compaction and CMA alloc, so we should > + * also skip merging in that case. > + */ > + if (zone->cluster.disable_depth) > + return false; > + > return true; > } > + > +static inline void zone_wait_cluster_alloc(struct zone *zone) > +{ > + while (atomic_read(&zone->cluster.in_progress)) > + cpu_relax(); > +} > + > +static inline void zone_wait_and_disable_cluster_alloc(struct zone *zone) > +{ > + unsigned long flags; > + spin_lock_irqsave(&zone->lock, flags); > + zone->cluster.disable_depth++; > + zone_wait_cluster_alloc(zone); > + spin_unlock_irqrestore(&zone->lock, flags); > +} > + > +static inline void zone_enable_cluster_alloc(struct zone *zone) > +{ > + unsigned long flags; > + spin_lock_irqsave(&zone->lock, flags); > + zone->cluster.disable_depth--; > + spin_unlock_irqrestore(&zone->lock, flags); > +} > #else /* CONFIG_COMPACTION */ > static inline bool can_skip_merge(struct zone *zone, int order) > { > return false; > } > +static inline void zone_wait_cluster_alloc(struct zone *zone) {} > +static inline void zone_wait_and_disable_cluster_alloc(struct zone > *zone) {} > +static inline void zone_enable_cluster_alloc(struct zone *zone) {} > #endif /* CONFIG_COMPACTION */ > #endif /* __MM_INTERNAL_H */ > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index eb78014dfbde..ac93833a2877 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -746,6 +746,82 @@ static inline void set_page_order(struct page *page, > unsigned int order) > __SetPageBuddy(page); > } > > +static inline struct cluster *new_cluster(struct zone *zone, int nr, > + struct page *tail) > +{ > + struct order0_cluster *cluster = &zone->cluster; > + int n = find_next_zero_bit(cluster->bitmap, cluster->len, > cluster->zero_bit); > + if (n == cluster->len) { > + printk_ratelimited("node%d zone %s cluster used up\n", > + zone->zone_pgdat->node_id, zone->name); > + return NULL; > + } > + cluster->zero_bit = n; > + set_bit(n, cluster->bitmap); > + cluster->array[n].nr = nr; > + cluster->array[n].tail = tail; > + return &cluster->array[n]; > +} > + > +static inline struct cluster *add_to_cluster_common(struct page *page, > + struct zone *zone, struct page *neighbor) > +{ > + struct cluster *c; > + > + if (neighbor) { > + int batch = this_cpu_ptr(zone->pageset)->pcp.batch; > + c = neighbor->cluster; > + if (c && c->nr < batch) { > + page->cluster = c; > + c->nr++; > + return c; > + } > + } > + > + c = new_cluster(zone, 1, page); > + if (unlikely(!c)) > + return NULL; > + > + page->cluster = c; > + return c; > +} > + > +/* > + * Add this page to the cluster where the previous head page belongs. > + * Called after page is added to free_list(and becoming the new head). > + */ > +static inline void add_to_cluster_head(struct page *page, struct zone > *zone, > + int order, int mt) > +{ > + struct page *neighbor; > + > + if (order || !zone->cluster.len) > + return; > + > + neighbor = page->lru.next == &zone->free_area[0].free_list[mt] ? > + NULL : list_entry(page->lru.next, struct page, lru); > + add_to_cluster_common(page, zone, neighbor); > +} > + > +/* > + * Add this page to the cluster where the previous tail page belongs. > + * Called after page is added to free_list(and becoming the new tail). > + */ > +static inline void add_to_cluster_tail(struct page *page, struct zone > *zone, > + int order, int mt) > +{ > + struct page *neighbor; > + struct cluster *c; > + > + if (order || !zone->cluster.len) > + return; > + > + neighbor = page->lru.prev == &zone->free_area[0].free_list[mt] ? > + NULL : list_entry(page->lru.prev, struct page, lru); > + c = add_to_cluster_common(page, zone, neighbor); > + c->tail = page; > +} > + > static inline void add_to_buddy_common(struct page *page, struct zone > *zone, > unsigned int order, int mt) > { > @@ -765,6 +841,7 @@ static inline void add_to_buddy_head(struct page > *page, struct zone *zone, > { > add_to_buddy_common(page, zone, order, mt); > list_add(&page->lru, &zone->free_area[order].free_list[mt]); > + add_to_cluster_head(page, zone, order, mt); > } > > static inline void add_to_buddy_tail(struct page *page, struct zone *zone, > @@ -772,6 +849,7 @@ static inline void add_to_buddy_tail(struct page > *page, struct zone *zone, > { > add_to_buddy_common(page, zone, order, mt); > list_add_tail(&page->lru, &zone->free_area[order].free_list[mt]); > + add_to_cluster_tail(page, zone, order, mt); > } > > static inline void rmv_page_order(struct page *page) > @@ -780,9 +858,29 @@ static inline void rmv_page_order(struct page *page) > set_page_private(page, 0); > } > > +/* called before removed from free_list */ > +static inline void remove_from_cluster(struct page *page, struct zone > *zone) > +{ > + struct cluster *c = page->cluster; > + if (!c) > + return; > + > + page->cluster = NULL; > + c->nr--; > + if (!c->nr) { > + int bit = c - zone->cluster.array; > + c->tail = NULL; > + clear_bit(bit, zone->cluster.bitmap); > + if (bit < zone->cluster.zero_bit) > + zone->cluster.zero_bit = bit; > + } else if (page == c->tail) > + c->tail = list_entry(page->lru.prev, struct page, lru); > +} > + > static inline void remove_from_buddy(struct page *page, struct zone *zone, > unsigned int order) > { > + remove_from_cluster(page, zone); > list_del(&page->lru); > zone->free_area[order].nr_free--; > rmv_page_order(page); > @@ -2025,6 +2123,17 @@ static int move_freepages(struct zone *zone, > if (num_movable) > *num_movable = 0; > > + /* > + * Cluster alloced pages may have their PageBuddy flag unclear yet > + * after dropping zone->lock in rmqueue_bulk() and steal here could > + * move them back to free_list. So it's necessary to wait till all > + * those pages have their flags properly cleared. > + * > + * We do not need to disable cluster alloc though since we already > + * held zone->lock and no allocation could happen. > + */ > + zone_wait_cluster_alloc(zone); > + > for (page = start_page; page <= end_page;) { > if (!pfn_valid_within(page_to_pfn(page))) { > page++; > @@ -2049,8 +2158,10 @@ static int move_freepages(struct zone *zone, > } > > order = page_order(page); > + remove_from_cluster(page, zone); > list_move(&page->lru, > &zone->free_area[order].free_list[migratetype]); > + add_to_cluster_head(page, zone, order, migratetype); > page += 1 << order; > pages_moved += 1 << order; > } > @@ -2199,7 +2310,9 @@ static void steal_suitable_fallback(struct zone > *zone, struct page *page, > > single_page: > area = &zone->free_area[current_order]; > + remove_from_cluster(page, zone); > list_move(&page->lru, &area->free_list[start_type]); > + add_to_cluster_head(page, zone, current_order, start_type); > } > > /* > @@ -2460,6 +2573,145 @@ __rmqueue(struct zone *zone, unsigned int order, > int migratetype) > return page; > } > > +static int __init zone_order0_cluster_init(void) > +{ > + struct zone *zone; > + > + for_each_zone(zone) { > + int len, mt, batch; > + unsigned long flags; > + struct order0_cluster *cluster; > + > + if (!managed_zone(zone)) > + continue; > + > + /* no need to enable cluster allocation for batch<=1 zone > */ > + preempt_disable(); > + batch = this_cpu_ptr(zone->pageset)->pcp.batch; > + preempt_enable(); > + if (batch <= 1) > + continue; > + > + cluster = &zone->cluster; > + /* FIXME: possible overflow of int type */ > + len = DIV_ROUND_UP(zone->managed_pages, batch); > + cluster->array = vzalloc(len * sizeof(struct cluster)); > + if (!cluster->array) > + return -ENOMEM; > + cluster->bitmap = vzalloc(DIV_ROUND_UP(len, BITS_PER_LONG) > * > + sizeof(unsigned long)); > + if (!cluster->bitmap) > + return -ENOMEM; > + > + spin_lock_irqsave(&zone->lock, flags); > + cluster->len = len; > + for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) { > + struct page *page; > + list_for_each_entry_reverse(page, > + &zone->free_area[0].free_list[mt], > lru) > + add_to_cluster_head(page, zone, 0, mt); > + } > + spin_unlock_irqrestore(&zone->lock, flags); > + } > + > + return 0; > +} > +subsys_initcall(zone_order0_cluster_init); > + > +static inline int __rmqueue_bulk_cluster(struct zone *zone, unsigned long > count, > + struct list_head *list, > int mt) > +{ > + struct list_head *head = &zone->free_area[0].free_list[mt]; > + int nr = 0; > + > + while (nr < count) { > + struct page *head_page; > + struct list_head *tail, tmp_list; > + struct cluster *c; > + int bit; > + > + head_page = list_first_entry_or_null(head, struct page, > lru); > + if (!head_page || !head_page->cluster) > + break; > + > + c = head_page->cluster; > + tail = &c->tail->lru; > + > + /* drop the cluster off free_list and attach to list */ > + list_cut_position(&tmp_list, head, tail); > + list_splice_tail(&tmp_list, list); > + > + nr += c->nr; > + zone->free_area[0].nr_free -= c->nr; > + > + /* this cluster is empty now */ > + c->tail = NULL; > + c->nr = 0; > + bit = c - zone->cluster.array; > + clear_bit(bit, zone->cluster.bitmap); > + if (bit < zone->cluster.zero_bit) > + zone->cluster.zero_bit = bit; > + } > + > + return nr; > +} > + > +static inline int rmqueue_bulk_cluster(struct zone *zone, unsigned int > order, > + unsigned long count, struct list_head > *list, > + int migratetype) > +{ > + int alloced; > + struct page *page; > + > + /* > + * Cluster alloc races with merging so don't try cluster alloc > when we > + * can't skip merging. Note that can_skip_merge() keeps the same > return > + * value from here till all pages have their flags properly > processed, > + * i.e. the end of the function where in_progress is incremented, > even > + * we have dropped the lock in the middle because the only place > that > + * can change can_skip_merge()'s return value is compaction code > and > + * compaction needs to wait on in_progress. > + */ > + if (!can_skip_merge(zone, 0)) > + return 0; > + > + /* Cluster alloc is disabled, mostly compaction is already in > progress */ > + if (zone->cluster.disable_depth) > + return 0; > + > + /* Cluster alloc is disabled for this zone */ > + if (unlikely(!zone->cluster.len)) > + return 0; > + > + alloced = __rmqueue_bulk_cluster(zone, count, list, migratetype); > + if (!alloced) > + return 0; > + > + /* > + * Cache miss on page structure could slow things down > + * dramatically so accessing these alloced pages without > + * holding lock for better performance. > + * > + * Since these pages still have PageBuddy set, there is a race > + * window between now and when PageBuddy is cleared for them > + * below. Any operation that would scan a pageblock and check > + * PageBuddy(page), e.g. compaction, will need to wait till all > + * such pages are properly processed. in_progress is used for > + * such purpose so increase it now before dropping the lock. > + */ > + atomic_inc(&zone->cluster.in_progress); > + spin_unlock(&zone->lock); > + > + list_for_each_entry(page, list, lru) { > + rmv_page_order(page); > + page->cluster = NULL; > + set_pcppage_migratetype(page, migratetype); > + } > + atomic_dec(&zone->cluster.in_progress); > + > + return alloced; > +} > + > /* > * Obtain a specified number of elements from the buddy allocator, all > under > * a single hold of the lock, for efficiency. Add them to the supplied > list. > @@ -2469,17 +2721,23 @@ static int rmqueue_bulk(struct zone *zone, > unsigned int order, > unsigned long count, struct list_head *list, > int migratetype) > { > - int i, alloced = 0; > + int i, alloced; > + struct page *page, *tmp; > > spin_lock(&zone->lock); > - for (i = 0; i < count; ++i) { > - struct page *page = __rmqueue(zone, order, migratetype); > + alloced = rmqueue_bulk_cluster(zone, order, count, list, > migratetype); > + if (alloced > 0) { > + if (alloced >= count) > + goto out; > + else > + spin_lock(&zone->lock); > + } > + > + for (; alloced < count; alloced++) { > + page = __rmqueue(zone, order, migratetype); > if (unlikely(page == NULL)) > break; > > - if (unlikely(check_pcp_refill(page))) > - continue; > - > /* > * Split buddy pages returned by expand() are received > here in > * physical page order. The page is added to the tail of > @@ -2491,7 +2749,18 @@ static int rmqueue_bulk(struct zone *zone, unsigned > int order, > * pages are ordered properly. > */ > list_add_tail(&page->lru, list); > - alloced++; > + } > + spin_unlock(&zone->lock); > + > +out: > + i = alloced; > + list_for_each_entry_safe(page, tmp, list, lru) { > + if (unlikely(check_pcp_refill(page))) { > + list_del(&page->lru); > + alloced--; > + continue; > + } > + > if (is_migrate_cma(get_pcppage_migratetype(page))) > __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, > -(1 << order)); > @@ -2504,7 +2773,6 @@ static int rmqueue_bulk(struct zone *zone, unsigned > int order, > * pages added to the pcp list. > */ > __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order)); > - spin_unlock(&zone->lock); > return alloced; > } > > @@ -7744,6 +8012,7 @@ int alloc_contig_range(unsigned long start, unsigned > long end, > unsigned long outer_start, outer_end; > unsigned int order; > int ret = 0; > + struct zone *zone = page_zone(pfn_to_page(start)); > > struct compact_control cc = { > .nr_migratepages = 0, > @@ -7786,6 +8055,7 @@ int alloc_contig_range(unsigned long start, unsigned > long end, > if (ret) > return ret; > > + zone_wait_and_disable_cluster_alloc(zone); > /* > * In case of -EBUSY, we'd like to know which page causes problem. > * So, just fall through. test_pages_isolated() has a tracepoint > @@ -7868,6 +8138,8 @@ int alloc_contig_range(unsigned long start, unsigned > long end, > done: > undo_isolate_page_range(pfn_max_align_down(start), > pfn_max_align_up(end), migratetype); > + > + zone_enable_cluster_alloc(zone); > return ret; > } > > -- > 2.14.3 > > --000000000000de61a70567df9bfd Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


2018-03-20 1:54 GMT-07:00 Aaron Lu <aaron.lu@intel.com>= :
Profile on Intel Skylake server shows t= he most time consuming part
under zone->lock on allocation path is accessing those to-be-returned page's "struct page" on the free_list inside zone->lock. O= ne explanation
is, different CPUs are releasing pages to the head of free_list and
those page's 'struct page' may very well be cache cold for the = allocating
CPU when it grabs these pages from free_list' head. The purpose here is to avoid touching these pages one by one inside zone->lock.

One idea is, we just take the requested number of pages off free_list
with something like list_cut_position() and then adjust nr_free of
free_area accordingly inside zone->lock and other operations like
clearing PageBuddy flag for these pages are done outside of zone->lock.<= br>

sounds good!=C2=A0
your idea is reducing the lock contention in rmqueue_bulk() = function by split the order-0
freelist into t= wo list, one is without zone->lock, other is need zone->lock?<= br style=3D"color:rgb(34,34,34);font-family:arial,sans-serif;font-size:smal= l;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;= font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text= -transform:none;white-space:normal;word-spacing:0px;background-color:rgb(25= 5,255,255);text-decoration-style:initial;text-decoration-color:initial">it seems that it is a big lock granularity holding the= zone->lock in rmqueue_bulk() ,
why not we= change like it?

static int rmqueue_bulk(struct= zone *zone, unsigned int order,
=C2=A0=C2=A0= =C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 unsigned long count, struct li= st_head *list,
=C2=A0=C2=A0=C2=A0 =C2=A0=C2= =A0=C2=A0 =C2=A0=C2=A0=C2=A0 int migratetype, bool cold)
{
=C2=A0=C2=A0=C2=A0=C2=A0<= /span>
=C2=A0=C2=A0=C2=A0 for (i =3D 0; i <= ; count; ++i) {
=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0 =C2=A0 spin_lock(&zone->lock);
=C2= =A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 struct page *page =3D __rmqueue(zone, or= der, migratetype);
=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 spin_unlock(&zone->lock);
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ...
=C2= =A0=C2=A0=C2=A0 }
=C2=A0=C2=A0=C2=A0 __mod_zo= ne_page_state(zone, NR_FREE_PAGES, -(i << order));
=C2=A0=C2=A0=C2=A0=C2=A0
=C2=A0=C2=A0=C2=A0 return i;
}=C2= =A0
=C2=A0 =C2=A0

list_cut_position() needs to know where to cut, that's what the new
'struct cluster' meant to provide. All pages on order 0's free_= list
belongs to a cluster so when a number of pages is needed, the cluster
to which head page of free_list belongs is checked and then tail page
of the cluster could be found. With tail page, list_cut_position() can
be used to drop the cluster off free_list. The 'struct cluster' als= o has
'nr' to tell how many pages this cluster has so nr_free of free_are= a can
be adjusted inside the lock too.

This caused a race window though: from the moment zone->lock is dropped<= br> till these pages' PageBuddy flags get cleared, these pages are not in buddy but still have PageBuddy flag set.

This doesn't cause problems for users that access buddy pages through free_list. But there are other users, like move_freepages() which is
used to move a pageblock pages from one migratetype to another in
fallback allocation path, will test PageBuddy flag of a page derived
from PFN. The end result could be that for pages in the race window,
they are moved back to free_list of another migratetype. For this
reason, a synchronization function zone_wait_cluster_alloc() is
introduced to wait till all pages are in correct state. This function
is meant to be called with zone->lock held, so after this function
returns, we do not need to worry about new pages becoming racy state.

Another user is compaction, where it will scan a pageblock for
migratable candidates. In this process, pages derived from PFN will
be checked for PageBuddy flag to decide if it is a merge skipped page.
To avoid a racy page getting merged back into buddy, the
zone_wait_and_disable_cluster_alloc() function is introduced to:
1 disable clustered allocation by increasing zone->cluster.disable_depth= ;
2 wait till the race window pass by calling zone_wait_cluster_alloc().
This function is also meant to be called with zone->lock held so after it returns, all pages are in correct state and no more cluster alloc
will be attempted till zone_enable_cluster_alloc() is called to decrease zone->cluster.disable_depth.

The two patches could eliminate zone->lock contention entirely but at the same time, pgdat->lru_lock contention rose to 82%. Final performance=
increased about 8.3%.

Suggested-by: Ying Huang <ying.h= uang@intel.com>
Signed-off-by: Aaron Lu <aaron.lu@= intel.com>
---
=C2=A0Documentation/vm/struct_page_field |=C2=A0 =C2=A05 +
=C2=A0include/linux/mm_types.h=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2= =A0 =C2=A02 +
=C2=A0include/linux/mmzone.h=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0|=C2=A0 35 +++++
=C2=A0mm/compaction.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 |=C2=A0 =C2=A04 +
=C2=A0mm/internal.h=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 |=C2=A0 34 +++++
=C2=A0mm/page_alloc.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 | 288 +++++++++++++++++++++++++++++++++++--
=C2=A06 files changed, 360 insertions(+), 8 deletions(-)

diff --git a/Documentation/vm/struct_page_field b/Documentation/vm/str= uct_page_field
index 1ab6c19ccc7a..bab738ea4e0a 100644
--- a/Documentation/vm/struct_page_field
+++ b/Documentation/vm/struct_page_field
@@ -3,3 +3,8 @@ Used to indicate this page skipped merging when added to bu= ddy. This
=C2=A0field only makes sense if the page is in Buddy and is order zero.
=C2=A0It's a bug if any higher order pages in Buddy has this field set.=
=C2=A0Shares space with index.
+
+cluster:
+Order 0 Buddy pages are grouped in cluster on free_list to speed up
+allocation. This field stores the cluster pointer for them.
+Shares space with mapping.
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 7edc4e102a8e..49fe9d755a7c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -84,6 +84,8 @@ struct page {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 void *s_mem;=C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 /* slab firs= t object */
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 atomic_t compound_m= apcount;=C2=A0 =C2=A0 =C2=A0/* first tail page */
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 /* page_deferred_li= st().next=C2=A0 =C2=A0 =C2=A0-- second tail page */
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0struct cluster *clu= ster;=C2=A0 =C2=A0 =C2=A0 =C2=A0 /* order 0 cluster this page belongs to */=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 };

=C2=A0 =C2=A0 =C2=A0 =C2=A0 /* Second double word */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 7522a6987595..09ba9d3cc385 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -355,6 +355,40 @@ enum zone_type {

=C2=A0#ifndef __GENERATING_BOUNDS_H

+struct cluster {
+=C2=A0 =C2=A0 =C2=A0 =C2=A0struct page=C2=A0 =C2=A0 =C2=A0*tail;=C2=A0 /* = tail page of the cluster */
+=C2=A0 =C2=A0 =C2=A0 =C2=A0int=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0nr;=C2=A0 =C2=A0 =C2=A0/* how many pages are in this cluster */
+};
+
+struct order0_cluster {
+=C2=A0 =C2=A0 =C2=A0 =C2=A0/* order 0 cluster array, dynamically allocated= */
+=C2=A0 =C2=A0 =C2=A0 =C2=A0struct cluster *array;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0/*
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * order 0 cluster array length, also used to i= ndicate if cluster
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * allocation is enabled for this zone(cluster = allocation is disabled
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * for small zones whose batch size is smaller = than 1, like DMA zone)
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 */
+=C2=A0 =C2=A0 =C2=A0 =C2=A0int=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0len;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0/*
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * smallest position from where we search for a= n
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * empty cluster from the cluster array
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 */
+=C2=A0 =C2=A0 =C2=A0 =C2=A0int=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0zero_bit;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0/* bitmap used to quickly locate an empty clust= er from cluster array */
+=C2=A0 =C2=A0 =C2=A0 =C2=A0unsigned long=C2=A0 =C2=A0*bitmap;
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0/* disable cluster allocation to avoid new page= s becoming racy state. */
+=C2=A0 =C2=A0 =C2=A0 =C2=A0unsigned long=C2=A0 =C2=A0disable_depth;
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0/*
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * used to indicate if there are pages allocate= d in cluster mode
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * still in racy state. Caller with zone->lo= ck held could use helper
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * function zone_wait_cluster_alloc() to wait a= ll such pages to exit
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * the race window.
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 */
+=C2=A0 =C2=A0 =C2=A0 =C2=A0atomic_t=C2=A0 =C2=A0 =C2=A0 =C2=A0 in_progress= ;
+};
+
=C2=A0struct zone {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 /* Read-mostly fields */

@@ -459,6 +493,7 @@ struct zone {

=C2=A0 =C2=A0 =C2=A0 =C2=A0 /* free areas of different sizes */
=C2=A0 =C2=A0 =C2=A0 =C2=A0 struct free_area=C2=A0 =C2=A0 =C2=A0 =C2=A0 fre= e_area[MAX_ORDER];
+=C2=A0 =C2=A0 =C2=A0 =C2=A0struct order0_cluster=C2=A0 =C2=A0cluster;

=C2=A0 =C2=A0 =C2=A0 =C2=A0 /* zone flags, see below */
=C2=A0 =C2=A0 =C2=A0 =C2=A0 unsigned long=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0flags;
diff --git a/mm/compaction.c b/mm/compaction.c
index fb9031fdca41..e71fa82786a1 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1601,6 +1601,8 @@ static enum compact_result compact_zone(struct zone *= zone, struct compact_contro

=C2=A0 =C2=A0 =C2=A0 =C2=A0 migrate_prep_local();

+=C2=A0 =C2=A0 =C2=A0 =C2=A0zone_wait_and_disable_cluster_alloc(zone);=
+
=C2=A0 =C2=A0 =C2=A0 =C2=A0 while ((ret =3D compact_finished(zone, cc)) =3D= =3D COMPACT_CONTINUE) {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 int err;

@@ -1699,6 +1701,8 @@ static enum compact_result compact_zone(struct zone *= zone, struct compact_contro
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 zone->compact_cached_free_pfn =3D free_pfn;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 }

+=C2=A0 =C2=A0 =C2=A0 =C2=A0zone_enable_cluster_alloc(zone);
+
=C2=A0 =C2=A0 =C2=A0 =C2=A0 count_compact_events(COMPACTMIGRATE_SCANNE= D, cc->total_migrate_scanned);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 count_compact_events(COMPACTFREE_SCANNED, = cc->total_free_scanned);

diff --git a/mm/internal.h b/mm/internal.h
index 2bfbaae2d835..1b0535af1b49 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -557,12 +557,46 @@ static inline bool can_skip_merge(struct zone *zone, = int order)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 if (order)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return false;

+=C2=A0 =C2=A0 =C2=A0 =C2=A0/*
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * Clustered allocation is only disabled when h= igh-order pages
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * are needed, e.g. in compaction and CMA alloc= , so we should
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * also skip merging in that case.
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 */
+=C2=A0 =C2=A0 =C2=A0 =C2=A0if (zone->cluster.disable_depth)
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return false;
+
=C2=A0 =C2=A0 =C2=A0 =C2=A0 return true;
=C2=A0}
+
+static inline void zone_wait_cluster_alloc(struct zone *zone)
+{
+=C2=A0 =C2=A0 =C2=A0 =C2=A0while (atomic_read(&zone->cluster.i= n_progress))
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0cpu_relax();
+}
+
+static inline void zone_wait_and_disable_cluster_alloc(struct zone *z= one)
+{
+=C2=A0 =C2=A0 =C2=A0 =C2=A0unsigned long flags;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0spin_lock_irqsave(&zone->lock, flags); +=C2=A0 =C2=A0 =C2=A0 =C2=A0zone->cluster.disable_depth++;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0zone_wait_cluster_alloc(zone);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0spin_unlock_irqrestore(&zone->lock,= flags);
+}
+
+static inline void zone_enable_cluster_alloc(struct zone *zone)
+{
+=C2=A0 =C2=A0 =C2=A0 =C2=A0unsigned long flags;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0spin_lock_irqsave(&zone->lock, flags); +=C2=A0 =C2=A0 =C2=A0 =C2=A0zone->cluster.disable_depth--;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0spin_unlock_irqrestore(&zone->lock,= flags);
+}
=C2=A0#else /* CONFIG_COMPACTION */
=C2=A0static inline bool can_skip_merge(struct zone *zone, int order)
=C2=A0{
=C2=A0 =C2=A0 =C2=A0 =C2=A0 return false;
=C2=A0}
+static inline void zone_wait_cluster_alloc(struct zone *zone) {}
+static inline void zone_wait_and_disable_cluster_alloc(struct zone *z= one) {}
+static inline void zone_enable_cluster_alloc(struct zone *zone) {} =C2=A0#endif=C2=A0 /* CONFIG_COMPACTION */
=C2=A0#endif /* __MM_INTERNAL_H */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index eb78014dfbde..ac93833a2877 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -746,6 +746,82 @@ static inline void set_page_order(struct page *page, u= nsigned int order)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 __SetPageBuddy(page);
=C2=A0}

+static inline struct cluster *new_cluster(struct zone *zone, int nr,
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0struct page *tail)
+{
+=C2=A0 =C2=A0 =C2=A0 =C2=A0struct order0_cluster *cluster =3D &zone-&g= t;cluster;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0int n =3D find_next_zero_bit(cluster->b= itmap, cluster->len, cluster->zero_bit);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0if (n =3D=3D cluster->len) {
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0printk_ratelimited(= "node%d zone %s cluster used up\n",
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0zone->zone_pgdat->node_id, zone= ->name);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return NULL;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0}
+=C2=A0 =C2=A0 =C2=A0 =C2=A0cluster->zero_bit =3D n;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0set_bit(n, cluster->bitmap);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0cluster->array[n].nr =3D nr;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0cluster->array[n].tail =3D tail;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0return &cluster->array[n];
+}
+
+static inline struct cluster *add_to_cluster_common(struct page *page,
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0struct zone *zone, struct page *neighbor)
+{
+=C2=A0 =C2=A0 =C2=A0 =C2=A0struct cluster *c;
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0if (neighbor) {
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0int batch =3D this_= cpu_ptr(zone->pageset)->pcp.batch;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0c =3D neighbor->= cluster;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (c && c-= >nr < batch) {
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0page->cluster =3D c;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0c->nr++;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0return c;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0}
+=C2=A0 =C2=A0 =C2=A0 =C2=A0}
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0c =3D new_cluster(zone, 1, page);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0if (unlikely(!c))
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return NULL;
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0page->cluster =3D c;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0return c;
+}
+
+/*
+ * Add this page to the cluster where the previous head page belongs.
+ * Called after page is added to free_list(and becoming the new head).
+ */
+static inline void add_to_cluster_head(struct page *page, struct zone *zon= e,
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 int order, int = mt)
+{
+=C2=A0 =C2=A0 =C2=A0 =C2=A0struct page *neighbor;
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0if (order || !zone->cluster.len)
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return;
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0neighbor =3D page->lru.next =3D=3D &zone= ->free_area[0].free_list[mt] ?
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 NULL : list= _entry(page->lru.next, struct page, lru);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0add_to_cluster_common(page, zone, neighbor); +}
+
+/*
+ * Add this page to the cluster where the previous tail page belongs.
+ * Called after page is added to free_list(and becoming the new tail).
+ */
+static inline void add_to_cluster_tail(struct page *page, struct zone *zon= e,
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 int order, int = mt)
+{
+=C2=A0 =C2=A0 =C2=A0 =C2=A0struct page *neighbor;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0struct cluster *c;
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0if (order || !zone->cluster.len)
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return;
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0neighbor =3D page->lru.prev =3D=3D &zone= ->free_area[0].free_list[mt] ?
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 NULL : list= _entry(page->lru.prev, struct page, lru);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0c =3D add_to_cluster_common(page, zone, neighbo= r);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0c->tail =3D page;
+}
+
=C2=A0static inline void add_to_buddy_common(struct page *page, struct zone= *zone,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 unsigned= int order, int mt)
=C2=A0{
@@ -765,6 +841,7 @@ static inline void add_to_buddy_head(struct page *page,= struct zone *zone,
=C2=A0{
=C2=A0 =C2=A0 =C2=A0 =C2=A0 add_to_buddy_common(page, zone, order, mt);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 list_add(&page->lru, &zone->free_= area[order].free_list[mt]);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0add_to_cluster_head(page, zone, order, mt);
=C2=A0}

=C2=A0static inline void add_to_buddy_tail(struct page *page, struct zone *= zone,
@@ -772,6 +849,7 @@ static inline void add_to_buddy_tail(struct page *page,= struct zone *zone,
=C2=A0{
=C2=A0 =C2=A0 =C2=A0 =C2=A0 add_to_buddy_common(page, zone, order, mt);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 list_add_tail(&page->lru, &zone->= free_area[order].free_list[mt]);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0add_to_cluster_tail(page, zone, order, mt);
=C2=A0}

=C2=A0static inline void rmv_page_order(struct page *page)
@@ -780,9 +858,29 @@ static inline void rmv_page_order(struct page *page) =C2=A0 =C2=A0 =C2=A0 =C2=A0 set_page_private(page, 0);
=C2=A0}

+/* called before removed from free_list */
+static inline void remove_from_cluster(struct page *page, struct zone *zon= e)
+{
+=C2=A0 =C2=A0 =C2=A0 =C2=A0struct cluster *c =3D page->cluster;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0if (!c)
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return;
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0page->cluster =3D NULL;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0c->nr--;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0if (!c->nr) {
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0int bit =3D c - zon= e->cluster.array;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0c->tail =3D NULL= ;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0clear_bit(bit, zone= ->cluster.bitmap);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (bit < zone-&= gt;cluster.zero_bit)
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0zone->cluster.zero_bit =3D bit;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0} else if (page =3D=3D c->tail)
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0c->tail =3D list= _entry(page->lru.prev, struct page, lru);
+}
+
=C2=A0static inline void remove_from_buddy(struct page *page, struct zone *= zone,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 unsigned= int order)
=C2=A0{
+=C2=A0 =C2=A0 =C2=A0 =C2=A0remove_from_cluster(page, zone);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 list_del(&page->lru);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 zone->free_area[order].nr_free--;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 rmv_page_order(page);
@@ -2025,6 +2123,17 @@ static int move_freepages(struct zone *zone,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 if (num_movable)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 *num_movable =3D 0;=

+=C2=A0 =C2=A0 =C2=A0 =C2=A0/*
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * Cluster alloced pages may have their PageBud= dy flag unclear yet
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * after dropping zone->lock in rmqueue_bulk= () and steal here could
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * move them back to free_list. So it's nec= essary to wait till all
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * those pages have their flags properly cleare= d.
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 *
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * We do not need to disable cluster alloc thou= gh since we already
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * held zone->lock and no allocation could h= appen.
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 */
+=C2=A0 =C2=A0 =C2=A0 =C2=A0zone_wait_cluster_alloc(zone);
+
=C2=A0 =C2=A0 =C2=A0 =C2=A0 for (page =3D start_page; page <=3D end_page= ;) {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (!pfn_valid_with= in(page_to_pfn(page))) {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 page++;
@@ -2049,8 +2158,10 @@ static int move_freepages(struct zone *zone,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 }

=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 order =3D page_orde= r(page);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0remove_from_cluster= (page, zone);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 list_move(&page= ->lru,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 &zone->free_area[order].free_list[migratetype= ]);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0add_to_cluster_head= (page, zone, order, migratetype);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 page +=3D 1 <<= ; order;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 pages_moved +=3D 1 = << order;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 }
@@ -2199,7 +2310,9 @@ static void steal_suitable_fallback(struct zone *zone= , struct page *page,

=C2=A0single_page:
=C2=A0 =C2=A0 =C2=A0 =C2=A0 area =3D &zone->free_area[current_o= rder];
+=C2=A0 =C2=A0 =C2=A0 =C2=A0remove_from_cluster(page, zone);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 list_move(&page->lru, &area->free= _list[start_type]);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0add_to_cluster_head(page, zone, current_order, = start_type);
=C2=A0}

=C2=A0/*
@@ -2460,6 +2573,145 @@ __rmqueue(struct zone *zone, unsigned int order, in= t migratetype)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 return page;
=C2=A0}

+static int __init zone_order0_cluster_init(void)
+{
+=C2=A0 =C2=A0 =C2=A0 =C2=A0struct zone *zone;
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0for_each_zone(zone) {
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0int len, mt, batch;=
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0unsigned long flags= ;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0struct order0_clust= er *cluster;
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (!managed_zone(z= one))
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0continue;
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0/* no need to enabl= e cluster allocation for batch<=3D1 zone */
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0preempt_disable();<= br> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0batch =3D this_cpu_= ptr(zone->pageset)->pcp.batch;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0preempt_enable(); +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (batch <=3D 1= )
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0continue;
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0cluster =3D &zo= ne->cluster;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0/* FIXME: possible = overflow of int type */
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0len =3D DIV_ROUND_U= P(zone->managed_pages, batch);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0cluster->array = =3D vzalloc(len * sizeof(struct cluster));
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (!cluster->ar= ray)
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0return -ENOMEM;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0cluster->bitmap = =3D vzalloc(DIV_ROUND_UP(len, BITS_PER_LONG) *
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0sizeof(unsigned long));
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (!cluster->bi= tmap)
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0return -ENOMEM;
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0spin_lock_irqsave(&= amp;zone->lock, flags);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0cluster->len =3D= len;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0for (mt =3D 0; mt &= lt; MIGRATE_PCPTYPES; mt++) {
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0struct page *page;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0list_for_each_entry_reverse(page,
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0&zone= ->free_area[0].free_list[mt], lru)
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0add_to_cluster_head(page, zone, 0, mt= );
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0}
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0spin_unlock_irqrest= ore(&zone->lock, flags);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0}
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0return 0;
+}
+subsys_initcall(zone_order0_cluster_init);
+
+static inline int __rmqueue_bulk_cluster(struct zone *zone, unsigned long = count,
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0struct list_head *list, int mt)
+{
+=C2=A0 =C2=A0 =C2=A0 =C2=A0struct list_head *head =3D &zone->free_a= rea[0].free_list[mt];
+=C2=A0 =C2=A0 =C2=A0 =C2=A0int nr =3D 0;
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0while (nr < count) {
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0struct page *head_p= age;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0struct list_head *t= ail, tmp_list;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0struct cluster *c;<= br> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0int bit;
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0head_page =3D list_= first_entry_or_null(head, struct page, lru);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (!head_page || != head_page->cluster)
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0break;
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0c =3D head_page->= ;cluster;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0tail =3D &c->= ;tail->lru;
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0/* drop the cluster= off free_list and attach to list */
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0list_cut_position(&= amp;tmp_list, head, tail);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0list_splice_tail(&a= mp;tmp_list, list);
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0nr +=3D c->nr; +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0zone->free_area[= 0].nr_free -=3D c->nr;
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0/* this cluster is = empty now */
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0c->tail =3D NULL= ;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0c->nr =3D 0;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0bit =3D c - zone-&g= t;cluster.array;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0clear_bit(bit, zone= ->cluster.bitmap);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (bit < zone-&= gt;cluster.zero_bit)
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0zone->cluster.zero_bit =3D bit;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0}
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0return nr;
+}
+
+static inline int rmqueue_bulk_cluster(struct zone *zone, unsigned int ord= er,
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0unsigned long count, struct list_head= *list,
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0int migratetype)
+{
+=C2=A0 =C2=A0 =C2=A0 =C2=A0int alloced;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0struct page *page;
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0/*
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * Cluster alloc races with merging so don'= t try cluster alloc when we
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * can't skip merging. Note that can_skip_m= erge() keeps the same return
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * value from here till all pages have their fl= ags properly processed,
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * i.e. the end of the function where in_progre= ss is incremented, even
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * we have dropped the lock in the middle becau= se the only place that
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * can change can_skip_merge()'s return val= ue is compaction code and
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * compaction needs to wait on in_progress.
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 */
+=C2=A0 =C2=A0 =C2=A0 =C2=A0if (!can_skip_merge(zone, 0))
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return 0;
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0/* Cluster alloc is disabled, mostly compaction= is already in progress */
+=C2=A0 =C2=A0 =C2=A0 =C2=A0if (zone->cluster.disable_depth)
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return 0;
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0/* Cluster alloc is disabled for this zone */ +=C2=A0 =C2=A0 =C2=A0 =C2=A0if (unlikely(!zone->cluster.len))
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return 0;
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0alloced =3D __rmqueue_bulk_cluster(zone, count,= list, migratetype);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0if (!alloced)
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return 0;
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0/*
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * Cache miss on page structure could slow thin= gs down
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * dramatically so accessing these alloced page= s without
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * holding lock for better performance.
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 *
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * Since these pages still have PageBuddy set, = there is a race
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * window between now and when PageBuddy is cle= ared for them
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * below. Any operation that would scan a pageb= lock and check
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * PageBuddy(page), e.g. compaction, will need = to wait till all
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * such pages are properly processed. in_progre= ss is used for
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 * such purpose so increase it now before dropp= ing the lock.
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 */
+=C2=A0 =C2=A0 =C2=A0 =C2=A0atomic_inc(&zone->cluster.in_progre= ss);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0spin_unlock(&zone->lock);
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0list_for_each_entry(page, list, lru) {
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0rmv_page_order(page= );
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0page->cluster = =3D NULL;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0set_pcppage_migrate= type(page, migratetype);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0}
+=C2=A0 =C2=A0 =C2=A0 =C2=A0atomic_dec(&zone->cluster.in_progre= ss);
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0return alloced;
+}
+
=C2=A0/*
=C2=A0 * Obtain a specified number of elements from the buddy allocator, al= l under
=C2=A0 * a single hold of the lock, for efficiency.=C2=A0 Add them to the s= upplied list.
@@ -2469,17 +2721,23 @@ static int rmqueue_bulk(struct zone *zone, unsigned= int order,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 unsigned long count, struct list_head *list,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 int migratetype)
=C2=A0{
-=C2=A0 =C2=A0 =C2=A0 =C2=A0int i, alloced =3D 0;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0int i, alloced;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0struct page *page, *tmp;

=C2=A0 =C2=A0 =C2=A0 =C2=A0 spin_lock(&zone->lock);
-=C2=A0 =C2=A0 =C2=A0 =C2=A0for (i =3D 0; i < count; ++i) {
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0struct page *page = =3D __rmqueue(zone, order, migratetype);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0alloced =3D rmqueue_bulk_cluster(zone, order, c= ount, list, migratetype);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0if (alloced > 0) {
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (alloced >=3D= count)
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0goto out;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0else
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0spin_lock(&zone->lock);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0}
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0for (; alloced < count; alloced++) {
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0page =3D __rmqueue(= zone, order, migratetype);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (unlikely(page = =3D=3D NULL))
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 break;

-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (unlikely(check_= pcp_refill(page)))
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0continue;
-
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 /*
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* Split buddy= pages returned by expand() are received here in
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* physical pa= ge order. The page is added to the tail of
@@ -2491,7 +2749,18 @@ static int rmqueue_bulk(struct zone *zone, unsigned = int order,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* pages are o= rdered properly.
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0*/
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 list_add_tail(&= page->lru, list);
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0alloced++;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0}
+=C2=A0 =C2=A0 =C2=A0 =C2=A0spin_unlock(&zone->lock);
+
+out:
+=C2=A0 =C2=A0 =C2=A0 =C2=A0i =3D alloced;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0list_for_each_entry_safe(page, tmp, list, lru) = {
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (unlikely(check_= pcp_refill(page))) {
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0list_del(&page->lru);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0alloced--;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0continue;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0}
+
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (is_migrate_cma(= get_pcppage_migratetype(page)))
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 __mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 -(1 << order));
@@ -2504,7 +2773,6 @@ static int rmqueue_bulk(struct zone *zone, unsigned i= nt order,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* pages added to the pcp list.
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0*/
=C2=A0 =C2=A0 =C2=A0 =C2=A0 __mod_zone_page_state(zone, NR_FREE_PAGES, -(i = << order));
-=C2=A0 =C2=A0 =C2=A0 =C2=A0spin_unlock(&zone->lock);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 return alloced;
=C2=A0}

@@ -7744,6 +8012,7 @@ int alloc_contig_range(unsigned long start, unsigned = long end,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 unsigned long outer_start, outer_end;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 unsigned int order;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 int ret =3D 0;
+=C2=A0 =C2=A0 =C2=A0 =C2=A0struct zone *zone =3D page_zone(pfn_to_page(sta= rt));

=C2=A0 =C2=A0 =C2=A0 =C2=A0 struct compact_control cc =3D {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 .nr_migratepages = =3D 0,
@@ -7786,6 +8055,7 @@ int alloc_contig_range(unsigned long start, unsigned = long end,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 if (ret)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return ret;

+=C2=A0 =C2=A0 =C2=A0 =C2=A0zone_wait_and_disable_cluster_alloc(zone);=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 /*
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* In case of -EBUSY, we'd like to kno= w which page causes problem.
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* So, just fall through. test_pages_isola= ted() has a tracepoint
@@ -7868,6 +8138,8 @@ int alloc_contig_range(unsigned long start, unsigned = long end,
=C2=A0done:
=C2=A0 =C2=A0 =C2=A0 =C2=A0 undo_isolate_page_range(pfn_max_align_down= (start),
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 pfn_max_align_up(end), migratetype);=
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0zone_enable_cluster_alloc(zone);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 return ret;
=C2=A0}

--
2.14.3


--000000000000de61a70567df9bfd--