From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 993CDC3F2CE for ; Wed, 4 Mar 2020 18:04:36 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 17D7624671 for ; Wed, 4 Mar 2020 18:04:36 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 17D7624671 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=suse.cz Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id A27E46B0003; Wed, 4 Mar 2020 13:04:35 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 9B0FF6B0005; Wed, 4 Mar 2020 13:04:35 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 879116B0007; Wed, 4 Mar 2020 13:04:35 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0201.hostedemail.com [216.40.44.201]) by kanga.kvack.org (Postfix) with ESMTP id 65F136B0003 for ; Wed, 4 Mar 2020 13:04:35 -0500 (EST) Received: from smtpin14.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 12E273D12 for ; Wed, 4 Mar 2020 18:04:35 +0000 (UTC) X-FDA: 76558454910.14.error92_ad05c8368959 X-HE-Tag: error92_ad05c8368959 X-Filterd-Recvd-Size: 23161 Received: from mx2.suse.de (mx2.suse.de [195.135.220.15]) by imf04.hostedemail.com (Postfix) with ESMTP for ; Wed, 4 Mar 2020 18:04:34 +0000 (UTC) X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id 950EFACCA; Wed, 4 Mar 2020 18:04:31 +0000 (UTC) Subject: Re: [PATCH v2] mm: Proactive compaction To: Nitin Gupta , Mel Gorman , Michal Hocko Cc: Matthew Wilcox , Andrew Morton , Mike Kravetz , linux-kernel , linux-mm , Linux API , Joonsoo Kim , David Rientjes References: <20200302213343.2712-1-nigupta@nvidia.com> From: Vlastimil Babka Message-ID: Date: Wed, 4 Mar 2020 19:04:29 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.5.0 MIME-Version: 1.0 In-Reply-To: <20200302213343.2712-1-nigupta@nvidia.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: +CC linux-api (new tunable) and more folks who have discussed compaction = before On 3/2/20 10:33 PM, Nitin Gupta wrote: > For some applications we need to allocate almost all memory as > hugepages. However, on a running system, higher order allocations can > fail if the memory is fragmented. Linux kernel currently does on-demand > compaction as we request more hugepages but this style of compaction > incurs very high latency. Experiments with one-time full memory > compaction (followed by hugepage allocations) shows that kernel is able > to restore a highly fragmented memory state to a fairly compacted memor= y > state within <1 sec for a 32G system. Such data suggests that a more > proactive compaction can help us allocate a large fraction of memory as > hugepages keeping allocation latencies low. Yeah, we should have something like this, currently kcompactd does only v= ery limited work. > For a more proactive compaction, the approach taken here is to define > per page-node tunable called 'proactiveness' which dictates > bounds for external fragmentation for HUGETLB_PAGE_ORDER pages which > kcompactd should try to maintain. >=20 > The tunable is exposed through sysfs: > /sys/kernel/mm/compaction/node-n/proactiveness >=20 > The value of this tunable is used to determine low and high thresholds > for external fragmentation wrt HUGETLB_PAGE_ORDER order. >=20 > Note that previous version of this patch [1] was found to introduce too > many tunables (per-order extfrag{low, high}) but this one reduces them > to just (per-node proactiveness). Also, the new tunable is an > opaque value instead of asking for specific bounds of "external > fragmentation" which would have been difficult to estimate. The interna= l > interpretation of this opaque value allows for future fine-tuning. I guess we can live with that single tunable, like we have swappiness. Th= e per-order thresholds would be too much indeed. But does it have to be per= -node? Wouldn't a single one be enough? > Currently, we use a simple translation from this tunable to [low, high] > "proactive compaction score" thresholds (low=3D100-proactiveness, > high=3Dlow+10%). The score for a node is defined as weighed mean of per= -zone > external fragmentation wrt HUGETLB_PAGE_ORDER. A zone's present_pages > determines its weight. Proactive compaction is triggered when a node's > score exceeds its high threshold value and continues till it reaches it= s > low value. >=20 > To periodically check per-node score, we reuse per-node kcompactd > threads which are woken up every few milliseconds to check the same. If Few milliseconds would be excessive. The code seems to define that as 500= , which is somewhat better. Should there be also a backoff though if it finds out= it has nothing to do? And perhaps if kswapd is running on the node, kcompactd sh= ould wait so they don't interfere? > a node's score exceeds its high threshold (as derived from user provide= d > proactiveness value), proactive compaction is started till its score > reaches its low threshold value. By default, proactiveness is set to 0 > (=3D> low=3D100%, high=3D100%) for all nodes. Maybe we can go with a non-0 yet conservative default after some more tes= ting. > This patch is largely based on ideas from Michal Hocko posted here: > https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/ >=20 > Performance data > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 > System: x64_64, 1T RAM, 80 CPU threads. > Kernel: 5.6.0-rc3 + this patch >=20 > echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled > echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag >=20 > Before starting the driver, the system was fragmented from a userspace > program that allocates all memory and then for each 2M aligned section, > frees 3/4 of base pages using munmap. The workload is mainly anonymous > userspace pages which are easy to move around. I intentionally avoided > unmovable pages in this test to see how much latency we incur when > hugepage allocations hit direct compaction. >=20 > 1. Kernel hugepage allocation latencies >=20 > With system in such a fragmented state, a kernel driver then allocates > as many hugepages as possible and measures allocation latency: >=20 > (all latency values are in microseconds) >=20 > - With vanilla 5.6.0-rc3 >=20 > echo 0 | sudo tee /sys/kernel/mm/compaction/node-*/proactiveness >=20 > percentile latency > =E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80= =93=E2=80=93=E2=80=93 =E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80= =93=E2=80=93 > 5 7894 > 10 9496 > 25 12561 > 30 15295 > 40 18244 > 50 21229 > 60 27556 > 75 30147 > 80 31047 > 90 32859 > 95 33799 >=20 > Total 2M hugepages allocated =3D 383859 (749G worth of hugepages out of= 762G > total free =3D> 98% of free memory could be allocated as hugepages) >=20 > - With 5.6.0-rc3 + this patch, with proactiveness=3D20 >=20 > echo 20 | sudo tee /sys/kernel/mm/compaction/node-*/proactiveness >=20 > percentile latency > =E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80= =93=E2=80=93=E2=80=93 =E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80=93=E2=80= =93=E2=80=93 > 5 2 > 10 2 > 25 3 > 30 3 > 40 3 > 50 4 > 60 4 > 75 4 > 80 4 > 90 5 > 95 429 >=20 > Total 2M hugepages allocated =3D 11120 (21.7G worth of hugepages out of > 25G total free =3D> 98% of free memory could be allocated as hugepages) By your description it seems to be a one-time fragmentation event followe= d by a one-time stream of allocations? So kcompactd probably did the proactive w= ork just once? That works as a smoke test, but what I guess will be more impo= rtant is behavior under more complex workloads, where we should also check the = vmstat compact_daemon* stats and possibly also kcompactd kthreads CPU utilizatio= ns. > 2. JAVA heap allocation >=20 > First fragment memory using the same method as for (1). >=20 > With memory in a fragmented state, run: >=20 > /usr/bin/time > java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouc= h >=20 > To allocate 700G of java heap using hugepages. >=20 > - With vanilla 5.6.0-rc3 >=20 > 17.39user 1666.48system 27:37.89elapsed >=20 > - With 5.6.0-rc3 + this patch, with proactiveness=3D20 >=20 > 8.35user 194.58system 3:19.62elapsed >=20 > Elapsed time remains around 3:15 as proactiveness is further increased. Similar comment as for the above test. But that's a large reduction, so I= wonder how much time kcompactd took for this defragmentation. Was there a compar= ably large delay between the fragmentation and the java application to let kco= mpactd do the job? Or was it that much more efficient than the one-compaction-per-single-hugepage that the java page faulting caused? >=20 > Backoff behavior > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 > Above workloads produces a memory state which is easy to compact. > However, if memory is filled with unmovable pages, proactive compaction > should essentially back off. To test this aspect: >=20 > - Created a kernel driver that allocates almost all memory as hugepages > followed by freeing first 3/4 of each hugepage. > - Set proactiveness=3D40 (for all nodes) > - Note that proactive_compact_node() is deferred maximum number of time= s > with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check > (=3D> ~30 seconds between retries). >=20 > [1] https://patchwork.kernel.org/patch/11098289/ >=20 > Signed-off-by: Nitin Gupta > To: Mel Gorman > To: Michal Hocko > To: Vlastimil Babka > CC: Matthew Wilcox > CC: Andrew Morton > CC: Mike Kravetz > CC: linux-kernel > CC: linux-mm I haven't studied the code in detail yet, but leaving the mail for the re= ference of new CC's. Thanks, Vlastimil > --- > Changelog v2 vs v1: > - Introduce per-node and per-zone "proactive compaction score". This > score is compared against watermarks which are set according to > user provided proactiveness value. > - Separate code-paths for proactive compaction from targeted compactio= n > i.e. where pgdat->kcompactd_max_order is non-zero. > - Renamed hpage_compaction_effort -> proactiveness. In future we may > use more than extfrag wrt hugepage size to determine proactive > compaction score. > --- > include/linux/compaction.h | 10 ++ > mm/compaction.c | 242 ++++++++++++++++++++++++++++++++++++- > mm/internal.h | 1 + > mm/page_alloc.c | 1 + > mm/vmstat.c | 12 ++ > 5 files changed, 260 insertions(+), 6 deletions(-) >=20 > diff --git a/include/linux/compaction.h b/include/linux/compaction.h > index 4b898cdbdf05..c98f45107164 100644 > --- a/include/linux/compaction.h > +++ b/include/linux/compaction.h > @@ -60,6 +60,15 @@ enum compact_result { > =20 > struct alloc_context; /* in mm/internal.h */ > =20 > +// "node-%d" > +#define COMPACTION_STATE_NAME_LEN 16 > +// Per-node compaction state > +struct compaction_state { > + int node_id; > + unsigned int proactiveness; > + char name[COMPACTION_STATE_NAME_LEN]; > +}; > + > /* > * Number of free order-0 pages that should be available above given w= atermark > * to make sure compaction has reasonable chance of not running out of= free > @@ -90,6 +99,7 @@ extern int sysctl_compaction_handler(struct ctl_table= *table, int write, > extern int sysctl_extfrag_threshold; > extern int sysctl_compact_unevictable_allowed; > =20 > +extern int extfrag_for_order(struct zone *zone, unsigned int order); > extern int fragmentation_index(struct zone *zone, unsigned int order); > extern enum compact_result try_to_compact_pages(gfp_t gfp_mask, > unsigned int order, unsigned int alloc_flags, > diff --git a/mm/compaction.c b/mm/compaction.c > index 672d3c78c6ab..d906ccfedce0 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -25,6 +25,10 @@ > #include > #include "internal.h" > =20 > +#ifdef CONFIG_COMPACTION > +static struct compaction_state compaction_states[MAX_NUMNODES]; > +#endif > + > #ifdef CONFIG_COMPACTION > static inline void count_compact_event(enum vm_event_item item) > { > @@ -50,6 +54,8 @@ static inline void count_compact_events(enum vm_event= _item item, long delta) > #define pageblock_start_pfn(pfn) block_start_pfn(pfn, pageblock_order) > #define pageblock_end_pfn(pfn) block_end_pfn(pfn, pageblock_order) > =20 > +static const int HPAGE_FRAG_CHECK_INTERVAL_MSEC =3D 500; > + > static unsigned long release_freepages(struct list_head *freelist) > { > struct page *page, *next; > @@ -1846,6 +1852,51 @@ static inline bool is_via_compact_memory(int ord= er) > return order =3D=3D -1; > } > =20 > +static int proactive_compaction_score_zone(struct zone *zone) > +{ > + unsigned long score; > + > + score =3D zone->present_pages * > + extfrag_for_order(zone, HUGETLB_PAGE_ORDER); > + score =3D div64_ul(score, > + node_present_pages(zone->zone_pgdat->node_id) + 1); > + return score; > +} > + > +static int proactive_compaction_score_node(pg_data_t *pgdat) > +{ > + unsigned long score =3D 0; > + int zoneid; > + > + for (zoneid =3D 0; zoneid < MAX_NR_ZONES; zoneid++) { > + struct zone *zone; > + > + zone =3D &pgdat->node_zones[zoneid]; > + score +=3D proactive_compaction_score_zone(zone); > + } > + > + return score; > +} > + > +static int proactive_compaction_score_wmark(pg_data_t *pgdat, bool low= ) > +{ > + int wmark_low; > + > + wmark_low =3D 100 - compaction_states[pgdat->node_id].proactiveness; > + return low ? wmark_low : min(wmark_low + 10, 100); > +} > + > +static bool should_proactive_compact_node(pg_data_t *pgdat) > +{ > + int wmark_high; > + > + if (!compaction_states[pgdat->node_id].proactiveness) > + return false; > + > + wmark_high =3D proactive_compaction_score_wmark(pgdat, false); > + return proactive_compaction_score_node(pgdat) > wmark_high; > +} > + > static enum compact_result __compact_finished(struct compact_control *= cc) > { > unsigned int order; > @@ -1872,6 +1923,19 @@ static enum compact_result __compact_finished(st= ruct compact_control *cc) > return COMPACT_PARTIAL_SKIPPED; > } > =20 > + if (cc->proactive_compaction) { > + int score, wmark_low; > + > + score =3D proactive_compaction_score_zone(cc->zone); > + wmark_low =3D proactive_compaction_score_wmark( > + cc->zone->zone_pgdat, true); > + if (score > wmark_low) > + ret =3D COMPACT_CONTINUE; > + else > + ret =3D COMPACT_SUCCESS; > + goto out; > + } > + > if (is_via_compact_memory(cc->order)) > return COMPACT_CONTINUE; > =20 > @@ -1930,6 +1994,7 @@ static enum compact_result __compact_finished(str= uct compact_control *cc) > } > } > =20 > +out: > if (cc->contended || fatal_signal_pending(current)) > ret =3D COMPACT_CONTENDED; > =20 > @@ -2301,6 +2366,7 @@ static enum compact_result compact_zone_order(str= uct zone *zone, int order, > .alloc_flags =3D alloc_flags, > .classzone_idx =3D classzone_idx, > .direct_compaction =3D true, > + .proactive_compaction =3D false, > .whole_zone =3D (prio =3D=3D MIN_COMPACT_PRIORITY), > .ignore_skip_hint =3D (prio =3D=3D MIN_COMPACT_PRIORITY), > .ignore_block_suitable =3D (prio =3D=3D MIN_COMPACT_PRIORITY) > @@ -2404,6 +2470,34 @@ enum compact_result try_to_compact_pages(gfp_t g= fp_mask, unsigned int order, > return rc; > } > =20 > +/* Compact all zones within a node according to proactiveness */ > +static void proactive_compact_node(pg_data_t *pgdat) > +{ > + int zoneid; > + struct zone *zone; > + struct compact_control cc =3D { > + .order =3D -1, > + .mode =3D MIGRATE_SYNC_LIGHT, > + .ignore_skip_hint =3D true, > + .whole_zone =3D true, > + .gfp_mask =3D GFP_KERNEL, > + .direct_compaction =3D false, > + .proactive_compaction =3D true, > + }; > + > + for (zoneid =3D 0; zoneid < MAX_NR_ZONES; zoneid++) { > + zone =3D &pgdat->node_zones[zoneid]; > + if (!populated_zone(zone)) > + continue; > + > + cc.zone =3D zone; > + > + compact_zone(&cc, NULL); > + > + VM_BUG_ON(!list_empty(&cc.freepages)); > + VM_BUG_ON(!list_empty(&cc.migratepages)); > + } > +} > =20 > /* Compact all zones within a node */ > static void compact_node(int nid) > @@ -2417,9 +2511,10 @@ static void compact_node(int nid) > .ignore_skip_hint =3D true, > .whole_zone =3D true, > .gfp_mask =3D GFP_KERNEL, > + .direct_compaction =3D false, > + .proactive_compaction =3D false, > }; > =20 > - > for (zoneid =3D 0; zoneid < MAX_NR_ZONES; zoneid++) { > =20 > zone =3D &pgdat->node_zones[zoneid]; > @@ -2492,6 +2587,118 @@ void compaction_unregister_node(struct node *no= de) > } > #endif /* CONFIG_SYSFS && CONFIG_NUMA */ > =20 > +#ifdef CONFIG_SYSFS > + > +#define COMPACTION_ATTR_RO(_name) \ > + static struct kobj_attribute _name##_attr =3D __ATTR_RO(_name) > + > +#define COMPACTION_ATTR(_name) \ > + static struct kobj_attribute _name##_attr =3D \ > + __ATTR(_name, 0644, _name##_show, _name##_store) > + > +static struct kobject *compaction_kobj; > +static struct kobject *compaction_kobjs[MAX_NUMNODES]; > + > +static struct compaction_state *kobj_to_compaction_state(struct kobjec= t *kobj) > +{ > + int node; > + > + for_each_online_node(node) { > + if (compaction_kobjs[node] =3D=3D kobj) > + return &compaction_states[node]; > + } > + > + return NULL; > +} > + > +static ssize_t proactiveness_store(struct kobject *kobj, > + struct kobj_attribute *attr, const char *buf, size_t count) > +{ > + int err; > + unsigned long input; > + struct compaction_state *c =3D kobj_to_compaction_state(kobj); > + > + err =3D kstrtoul(buf, 10, &input); > + if (err) > + return err; > + if (input > 100) > + return -EINVAL; > + > + c->proactiveness =3D input; > + return count; > +} > + > +static ssize_t proactiveness_show(struct kobject *kobj, > + struct kobj_attribute *attr, char *buf) > +{ > + struct compaction_state *c =3D kobj_to_compaction_state(kobj); > + > + return sprintf(buf, "%u\n", c->proactiveness); > +} > + > +COMPACTION_ATTR(proactiveness); > + > +static struct attribute *compaction_attrs[] =3D { > + &proactiveness_attr.attr, > + NULL, > +}; > + > +static const struct attribute_group compaction_attr_group =3D { > + .attrs =3D compaction_attrs, > +}; > + > +static int compaction_sysfs_add_node(struct compaction_state *c, > + struct kobject *parent, struct kobject **compaction_kobjs, > + const struct attribute_group *compaction_attr_group) > +{ > + int retval; > + > + compaction_kobjs[c->node_id] =3D > + kobject_create_and_add(c->name, parent); > + if (!compaction_kobjs[c->node_id]) > + return -ENOMEM; > + > + retval =3D sysfs_create_group(compaction_kobjs[c->node_id], > + compaction_attr_group); > + if (retval) > + kobject_put(compaction_kobjs[c->node_id]); > + > + return retval; > +} > + > +static void __init compaction_sysfs_init(void) > +{ > + struct compaction_state *c; > + int err, node; > + > + compaction_kobj =3D kobject_create_and_add("compaction", mm_kobj); > + if (!compaction_kobj) > + return; > + > + for_each_online_node(node) { > + c =3D &compaction_states[node]; > + err =3D compaction_sysfs_add_node(c, compaction_kobj, > + compaction_kobjs, > + &compaction_attr_group); > + if (err) > + pr_err("compaction: Unable to add state %s", c->name); > + } > +} > + > +static void __init compaction_init(void) > +{ > + int node; > + > + for_each_online_node(node) { > + struct compaction_state *c =3D &compaction_states[node]; > + > + c->node_id =3D node; > + c->proactiveness =3D 0; > + snprintf(c->name, COMPACTION_STATE_NAME_LEN, "node-%d", node); > + } > +} > +#endif > + > static inline bool kcompactd_work_requested(pg_data_t *pgdat) > { > return pgdat->kcompactd_max_order > 0 || kthread_should_stop(); > @@ -2532,6 +2739,8 @@ static void kcompactd_do_work(pg_data_t *pgdat) > .mode =3D MIGRATE_SYNC_LIGHT, > .ignore_skip_hint =3D false, > .gfp_mask =3D GFP_KERNEL, > + .direct_compaction =3D false, > + .proactive_compaction =3D false, > }; > trace_mm_compaction_kcompactd_wake(pgdat->node_id, cc.order, > cc.classzone_idx); > @@ -2629,6 +2838,7 @@ static int kcompactd(void *p) > { > pg_data_t *pgdat =3D (pg_data_t*)p; > struct task_struct *tsk =3D current; > + unsigned int proactive_defer =3D 0; > =20 > const struct cpumask *cpumask =3D cpumask_of_node(pgdat->node_id); > =20 > @@ -2644,12 +2854,29 @@ static int kcompactd(void *p) > unsigned long pflags; > =20 > trace_mm_compaction_kcompactd_sleep(pgdat->node_id); > - wait_event_freezable(pgdat->kcompactd_wait, > - kcompactd_work_requested(pgdat)); > + if (wait_event_freezable_timeout(pgdat->kcompactd_wait, > + kcompactd_work_requested(pgdat), > + msecs_to_jiffies(HPAGE_FRAG_CHECK_INTERVAL_MSEC))) { > =20 > - psi_memstall_enter(&pflags); > - kcompactd_do_work(pgdat); > - psi_memstall_leave(&pflags); > + psi_memstall_enter(&pflags); > + kcompactd_do_work(pgdat); > + psi_memstall_leave(&pflags); > + continue; > + } > + > + if (should_proactive_compact_node(pgdat)) { > + unsigned int prev_score, score; > + > + if (proactive_defer) { > + proactive_defer--; > + continue; > + } > + prev_score =3D proactive_compaction_score_node(pgdat); > + proactive_compact_node(pgdat); > + score =3D proactive_compaction_score_node(pgdat); > + proactive_defer =3D score < prev_score ? > + 0 : 1 << COMPACT_MAX_DEFER_SHIFT; > + } > } > =20 > return 0; > @@ -2726,6 +2953,9 @@ static int __init kcompactd_init(void) > return ret; > } > =20 > + compaction_init(); > + compaction_sysfs_init(); > + > for_each_node_state(nid, N_MEMORY) > kcompactd_run(nid); > return 0; > diff --git a/mm/internal.h b/mm/internal.h > index 3cf20ab3ca01..e66bafd6c7a2 100644 > --- a/mm/internal.h > +++ b/mm/internal.h > @@ -203,6 +203,7 @@ struct compact_control { > bool no_set_skip_hint; /* Don't mark blocks for skipping */ > bool ignore_block_suitable; /* Scan blocks considered unsuitable */ > bool direct_compaction; /* False from kcompactd or /proc/... */ > + bool proactive_compaction; /* kcompactd proactive compaction */ > bool whole_zone; /* Whole zone should/has been scanned */ > bool contended; /* Signal lock or sched contention */ > bool rescan; /* Rescanning the same pageblock */ > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 3c4eb750a199..e92c706e93ee 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -8402,6 +8402,7 @@ int alloc_contig_range(unsigned long start, unsig= ned long end, > .ignore_skip_hint =3D true, > .no_set_skip_hint =3D true, > .gfp_mask =3D current_gfp_context(gfp_mask), > + .proactive_compaction =3D false, > }; > INIT_LIST_HEAD(&cc.migratepages); > =20 > diff --git a/mm/vmstat.c b/mm/vmstat.c > index 78d53378db99..70d724122643 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -1074,6 +1074,18 @@ static int __fragmentation_index(unsigned int or= der, struct contig_page_info *in > return 1000 - div_u64( (1000+(div_u64(info->free_pages * 1000ULL, req= uested))), info->free_blocks_total); > } > =20 > +int extfrag_for_order(struct zone *zone, unsigned int order) > +{ > + struct contig_page_info info; > + > + fill_contig_page_info(zone, order, &info); > + if (info.free_pages =3D=3D 0) > + return 0; > + > + return (info.free_pages - (info.free_blocks_suitable << order)) * 100 > + / info.free_pages; > +} > + > /* Same as __fragmentation index but allocs contig_page_info on stack = */ > int fragmentation_index(struct zone *zone, unsigned int order) > { >=20