From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.5 required=3.0 tests=FREEMAIL_FORGED_FROMDOMAIN, FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id ED2BDCA9EB9 for ; Sat, 26 Oct 2019 11:08:37 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id A95492070B for ; Sat, 26 Oct 2019 11:08:37 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A95492070B Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=sina.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 3F0566B0006; Sat, 26 Oct 2019 07:08:37 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3A1BE6B0008; Sat, 26 Oct 2019 07:08:37 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 290BE6B000A; Sat, 26 Oct 2019 07:08:37 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0236.hostedemail.com [216.40.44.236]) by kanga.kvack.org (Postfix) with ESMTP id 07E0F6B0006 for ; Sat, 26 Oct 2019 07:08:37 -0400 (EDT) Received: from smtpin23.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with SMTP id AEC7B2C2E for ; Sat, 26 Oct 2019 11:08:36 +0000 (UTC) X-FDA: 76085662632.23.print12_10be8a86f9917 X-HE-Tag: print12_10be8a86f9917 X-Filterd-Recvd-Size: 6796 Received: from r3-17.sinamail.sina.com.cn (r3-17.sinamail.sina.com.cn [202.108.3.17]) by imf20.hostedemail.com (Postfix) with SMTP for ; Sat, 26 Oct 2019 11:08:34 +0000 (UTC) Received: from unknown (HELO localhost.localdomain)([222.131.69.34]) by sina.com with ESMTP id 5DB429090001A068; Sat, 26 Oct 2019 19:07:56 +0800 (CST) X-Sender: hdanton@sina.com X-Auth-ID: hdanton@sina.com X-SMAIL-MID: 7744549284331 From: Hillf Danton To: linux-mm Cc: Andrew Morton , linux-kernel , Chris Down , Tejun Heo , Roman Gushchin , Michal Hocko , Johannes Weiner , Shakeel Butt , Matthew Wilcox , Minchan Kim , Mel Gorman , Hillf Danton Subject: [RFC v2] memcg: add memcg lru for page reclaiming Date: Sat, 26 Oct 2019 19:07:45 +0800 Message-Id: <20191026110745.12956-1-hdanton@sina.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Currently soft limit reclaim is frozen, see Documentation/admin-guide/cgroup-v2.rst for reasons. This work adds memcg hook into kswapd's logic to bypass slr, paving a brick for its cleanup later. After b23afb93d317 ("memcg: punt high overage reclaim to return-to-userland path"), high limit breachers are reclaimed one after another spiraling up through the memcg hierarchy before returning to userspace. We can not add new hook yet if it is infeasible to defer that reclaiming a bit further until kswapd becomes active. It can be defered however because high limit breach looks benign in the absence of memory pressure, or we ensure it will be reclaimed soon in the presence of kswapd. To delay reclaiming, the spiralup is broken into two parts: the up half that only rips the first victim, and the bottom half that only queues the victim's first ancestor for later processing. The defer can be ignored if already under memory pressure; otherwise work is done after BH. Then we need a fifo list to facilitate queuing up breachers and ripping them in round robin once kswapd starts working. It is essencially a simple copy of the page lru. New hook is not added without slr's another problem addressed, though the first was solved by the current spiralup. Overrecalim is solved by ripping MEMCG_CHARGE_BATCH pages a time from a victim. And it is current high work behavior too. V2 is based on next-20191025. Changes since v1 - drop MEMCG_LRU=20 - add hook into kswapd's logic to bypass slr Changes since v0 - add MEMCG_LRU in init/Kconfig - drop changes in mm/vmscan.c - make memcg lru work in parallel to slr Cc: Chris Down Cc: Tejun Heo Cc: Roman Gushchin Cc: Michal Hocko Cc: Johannes Weiner Cc: Shakeel Butt Cc: Matthew Wilcox Cc: Minchan Kim Cc: Mel Gorman Signed-off-by: Hillf Danton --- --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -222,6 +222,8 @@ struct mem_cgroup { /* Upper bound of normal memory consumption range */ unsigned long high; =20 + struct list_head lru_node; + /* Range enforcement for interrupt charges */ struct work_struct high_work; =20 @@ -740,6 +742,8 @@ static inline void mod_lruvec_page_state local_irq_restore(flags); } =20 +void mem_cgroup_reclaim_high(void); + unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, gfp_t gfp_mask, unsigned long *total_scanned); @@ -1126,6 +1130,10 @@ static inline void __mod_lruvec_slab_sta __mod_node_page_state(page_pgdat(page), idx, val); } =20 +static inline void mem_cgroup_reclaim_high(void) +{ +} + static inline unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, gfp_t gfp_mask, --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2332,14 +2332,57 @@ static int memcg_hotplug_cpu_dead(unsign return 0; } =20 +static DEFINE_SPINLOCK(memcg_lru_lock); +static LIST_HEAD(memcg_lru); /* a copy of page lru */ + +static void memcg_add_lru(struct mem_cgroup *memcg) +{ + spin_lock_irq(&memcg_lru_lock); + if (list_empty(&memcg->lru_node)) + list_add_tail(&memcg->lru_node, &memcg_lru); + spin_unlock_irq(&memcg_lru_lock); +} + +static struct mem_cgroup *memcg_pinch_lru(void) +{ + struct mem_cgroup *memcg, *next; + + spin_lock_irq(&memcg_lru_lock); + + list_for_each_entry_safe(memcg, next, &memcg_lru, lru_node) { + list_del_init(&memcg->lru_node); + + if (page_counter_read(&memcg->memory) > memcg->high) { + spin_unlock_irq(&memcg_lru_lock); + return memcg; + } + } + spin_unlock_irq(&memcg_lru_lock); + + return NULL; +} + +void mem_cgroup_reclaim_high(void) +{ + struct mem_cgroup *memcg =3D memcg_pinch_lru(); + + if (memcg) + schedule_work(&memcg->high_work); +} + static void reclaim_high(struct mem_cgroup *memcg, unsigned int nr_pages, gfp_t gfp_mask) { + struct mem_cgroup *victim =3D memcg; do { if (page_counter_read(&memcg->memory) <=3D memcg->high) continue; memcg_memory_event(memcg, MEMCG_HIGH); + if (victim !=3D memcg) { + memcg_add_lru(memcg); + return; + } try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true); } while ((memcg =3D parent_mem_cgroup(memcg))); } @@ -5055,6 +5098,7 @@ static struct mem_cgroup *mem_cgroup_all if (memcg_wb_domain_init(memcg, GFP_KERNEL)) goto fail; =20 + INIT_LIST_HEAD(&memcg->lru_node); INIT_WORK(&memcg->high_work, high_work_func); memcg->last_scanned_node =3D MAX_NUMNODES; INIT_LIST_HEAD(&memcg->oom_notify); --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2996,6 +2996,9 @@ static void shrink_zones(struct zonelist if (zone->zone_pgdat =3D=3D last_pgdat) continue; =20 + mem_cgroup_reclaim_high(); + continue; + /* * This steals pages from memory cgroups over softlimit * and returns the number of reclaimed pages and @@ -3690,12 +3693,16 @@ restart: if (sc.priority < DEF_PRIORITY - 2) sc.may_writepage =3D 1; =20 + mem_cgroup_reclaim_high(); + goto soft_limit_reclaim_end; + /* Call soft limit reclaim before calling shrink_node. */ sc.nr_scanned =3D 0; nr_soft_scanned =3D 0; nr_soft_reclaimed =3D mem_cgroup_soft_limit_reclaim(pgdat, sc.order, sc.gfp_mask, &nr_soft_scanned); sc.nr_reclaimed +=3D nr_soft_reclaimed; +soft_limit_reclaim_end: =20 /* * There should be no need to raise the scanning priority if --