From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.6 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id F0726C433E0 for ; Wed, 20 May 2020 23:26:26 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id A48CD20759 for ; Wed, 20 May 2020 23:26:26 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=cmpxchg-org.20150623.gappssmtp.com header.i=@cmpxchg-org.20150623.gappssmtp.com header.b="PSsOIvSP" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A48CD20759 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=cmpxchg.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id A93CB80015; Wed, 20 May 2020 19:26:20 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A6D0C8000A; Wed, 20 May 2020 19:26:20 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 931EE80015; Wed, 20 May 2020 19:26:20 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0005.hostedemail.com [216.40.44.5]) by kanga.kvack.org (Postfix) with ESMTP id 73CFA8000A for ; Wed, 20 May 2020 19:26:20 -0400 (EDT) Received: from smtpin23.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 3AB69181AC544 for ; Wed, 20 May 2020 23:26:20 +0000 (UTC) X-FDA: 76838683320.23.neck19_1a2e59703c755 X-HE-Tag: neck19_1a2e59703c755 X-Filterd-Recvd-Size: 9537 Received: from mail-qt1-f194.google.com (mail-qt1-f194.google.com [209.85.160.194]) by imf11.hostedemail.com (Postfix) with ESMTP for ; Wed, 20 May 2020 23:26:19 +0000 (UTC) Received: by mail-qt1-f194.google.com with SMTP id p12so4053940qtn.13 for ; Wed, 20 May 2020 16:26:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=/qXY/GfTCL+Mr3UBVbQ4X5QzZETe/pl7Vv8oztEayYI=; b=PSsOIvSPlgXVB2PWgNn9RS9RdFPIsrwfa/83SQv3ZZf/Uvr9WVl1NH08U4jAfwTUod QJAh5J0PzfHdKRbyXV7fjK+H7YFSnYIyqeI4+n+922T1Vp7jvCdwiWf/d5/toiDLMvDk Hxri8JORn9NhXuBMoiwwGospfZD7yJyppmhtc1Mn+UUQyN4CYoIgO9nu6EiHOTs2f+7r sU0zICUqXiuR8vypJWT2L26oHyUgnyUC7mW9k0fHCuYIfvftxaw1JzhAsUbFOfodI01j ZY/Fy7ZuqoV0nf3mWpqipinUK0bVh66y/AdKSga06/X9ZcHOK7Kl1iOSkYtR6wCOcjbE Op/w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=/qXY/GfTCL+Mr3UBVbQ4X5QzZETe/pl7Vv8oztEayYI=; b=H5hJ7PZHX+J0nYHyrjlnh/63UtODoCX5RBfXJ+W97um5nwD4a7mAJ45ni47J/48Q0f /z/XxV78uGHaZQ+SQLgDXfFs0rzE548Ozi63a+pgBXXG0v8kVVoKKU/r/ZO/xA2sE/KT aj3W+3FgOXX2plsydKhDzeiS/oi4dtobr2E1uyP7y2TsnVF9bEbhC1WvP2IGfE0jnILA PN1SsilrzDGxvmaHJTCbF6gR8gCP/PtPDESgxxRmY86Q8VNym5OmYhZsdH3mLVCmQcTQ 8+/DKSVDmt49yBnXPm0YXD9EIlfOUfoY110pp/onxa910elZrKls+7+LIjVX/g2QKWqU Zu6A== X-Gm-Message-State: AOAM530Z549d7qry0+ZDqnWjwhMfvuOqGYvgjymjlBUZFQYOxGE7GfBc LbyGOgizuEtL8ySud7b6jsyq1kC4EYs= X-Google-Smtp-Source: ABdhPJxi4EgJCESVIwdCAEwBX7lXykOt6rKu034N+rUoTj2Ks7dEIwot4dYtekJfafPemUXYI7SASw== X-Received: by 2002:ac8:c8b:: with SMTP id n11mr8182253qti.49.1590017178581; Wed, 20 May 2020 16:26:18 -0700 (PDT) Received: from localhost ([2620:10d:c091:480::1:4708]) by smtp.gmail.com with ESMTPSA id t88sm3665378qtd.5.2020.05.20.16.26.17 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 May 2020 16:26:18 -0700 (PDT) From: Johannes Weiner To: linux-mm@kvack.org Cc: Rik van Riel , Minchan Kim , Michal Hocko , Andrew Morton , Joonsoo Kim , linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: [PATCH 12/14] mm: vmscan: determine anon/file pressure balance at the reclaim root Date: Wed, 20 May 2020 19:25:23 -0400 Message-Id: <20200520232525.798933-13-hannes@cmpxchg.org> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20200520232525.798933-1-hannes@cmpxchg.org> References: <20200520232525.798933-1-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: We split the LRU lists into anon and file, and we rebalance the scan pressure between them when one of them begins thrashing: if the file cache experiences workingset refaults, we increase the pressure on anonymous pages; if the workload is stalled on swapins, we increase the pressure on the file cache instead. With cgroups and their nested LRU lists, we currently don't do this correctly. While recursive cgroup reclaim establishes a relative LRU order among the pages of all involved cgroups, LRU pressure balancing is done on an individual cgroup LRU level. As a result, when one cgroup is thrashing on the filesystem cache while a sibling may have cold anonymous pages, pressure doesn't get equalized between them. This patch moves LRU balancing decision to the root of reclaim - the same level where the LRU order is established. It does this by tracking LRU cost recursively, so that every level of the cgroup tree knows the aggregate LRU cost of all memory within its domain. When the page scanner calculates the scan balance for any given individual cgroup's LRU list, it uses the values from the ancestor cgroup that initiated the reclaim cycle. If one sibling is then thrashing on the cache, it will tip the pressure balance inside its ancestors, and the next hierarchical reclaim iteration will go more after the anon pages in the tree. Signed-off-by: Johannes Weiner --- include/linux/memcontrol.h | 13 ++++++++++++ mm/swap.c | 32 ++++++++++++++++++++++++----- mm/vmscan.c | 41 ++++++++++++++++---------------------- 3 files changed, 57 insertions(+), 29 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 32a0b4d47540..d982c80da157 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1303,6 +1303,19 @@ static inline void dec_lruvec_page_state(struct pa= ge *page, mod_lruvec_page_state(page, idx, -1); } =20 +static inline struct lruvec *parent_lruvec(struct lruvec *lruvec) +{ + struct mem_cgroup *memcg; + + memcg =3D lruvec_memcg(lruvec); + if (!memcg) + return NULL; + memcg =3D parent_mem_cgroup(memcg); + if (!memcg) + return NULL; + return mem_cgroup_lruvec(memcg, lruvec_pgdat(lruvec)); +} + #ifdef CONFIG_CGROUP_WRITEBACK =20 struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb); diff --git a/mm/swap.c b/mm/swap.c index 2ff91656dea2..3d8aa46c47ff 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -266,11 +266,33 @@ void lru_note_cost(struct page *page) { struct lruvec *lruvec =3D mem_cgroup_page_lruvec(page, page_pgdat(page)= ); =20 - /* Record new data point */ - if (page_is_file_lru(page)) - lruvec->file_cost++; - else - lruvec->anon_cost++; + do { + unsigned long lrusize; + + /* Record cost event */ + if (page_is_file_lru(page)) + lruvec->file_cost++; + else + lruvec->anon_cost++; + + /* + * Decay previous events + * + * Because workloads change over time (and to avoid + * overflow) we keep these statistics as a floating + * average, which ends up weighing recent refaults + * more than old ones. + */ + lrusize =3D lruvec_page_state(lruvec, NR_INACTIVE_ANON) + + lruvec_page_state(lruvec, NR_ACTIVE_ANON) + + lruvec_page_state(lruvec, NR_INACTIVE_FILE) + + lruvec_page_state(lruvec, NR_ACTIVE_FILE); + + if (lruvec->file_cost + lruvec->anon_cost > lrusize / 4) { + lruvec->file_cost /=3D 2; + lruvec->anon_cost /=3D 2; + } + } while ((lruvec =3D parent_lruvec(lruvec))); } =20 static void __activate_page(struct page *page, struct lruvec *lruvec, diff --git a/mm/vmscan.c b/mm/vmscan.c index e7e6868bcbf7..1487ff5d4698 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -79,6 +79,12 @@ struct scan_control { */ struct mem_cgroup *target_mem_cgroup; =20 + /* + * Scan pressure balancing between anon and file LRUs + */ + unsigned long anon_cost; + unsigned long file_cost; + /* Can active pages be deactivated as part of reclaim? */ #define DEACTIVATE_ANON 1 #define DEACTIVATE_FILE 2 @@ -2231,10 +2237,8 @@ static void get_scan_count(struct lruvec *lruvec, = struct scan_control *sc, int swappiness =3D mem_cgroup_swappiness(memcg); u64 fraction[2]; u64 denominator =3D 0; /* gcc */ - struct pglist_data *pgdat =3D lruvec_pgdat(lruvec); unsigned long anon_prio, file_prio; enum scan_balance scan_balance; - unsigned long anon, file; unsigned long totalcost; unsigned long ap, fp; enum lru_list lru; @@ -2285,7 +2289,6 @@ static void get_scan_count(struct lruvec *lruvec, s= truct scan_control *sc, } =20 scan_balance =3D SCAN_FRACT; - /* * Calculate the pressure balance between anon and file pages. * @@ -2300,30 +2303,12 @@ static void get_scan_count(struct lruvec *lruvec,= struct scan_control *sc, anon_prio =3D swappiness; file_prio =3D 200 - anon_prio; =20 - /* - * Because workloads change over time (and to avoid overflow) - * we keep these statistics as a floating average, which ends - * up weighing recent refaults more than old ones. - */ - - anon =3D lruvec_lru_size(lruvec, LRU_ACTIVE_ANON, MAX_NR_ZONES) + - lruvec_lru_size(lruvec, LRU_INACTIVE_ANON, MAX_NR_ZONES); - file =3D lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES) + - lruvec_lru_size(lruvec, LRU_INACTIVE_FILE, MAX_NR_ZONES); - - spin_lock_irq(&pgdat->lru_lock); - totalcost =3D lruvec->anon_cost + lruvec->file_cost; - if (unlikely(totalcost > (anon + file) / 4)) { - lruvec->anon_cost /=3D 2; - lruvec->file_cost /=3D 2; - totalcost /=3D 2; - } + totalcost =3D sc->anon_cost + sc->file_cost; ap =3D anon_prio * (totalcost + 1); - ap /=3D lruvec->anon_cost + 1; + ap /=3D sc->anon_cost + 1; =20 fp =3D file_prio * (totalcost + 1); - fp /=3D lruvec->file_cost + 1; - spin_unlock_irq(&pgdat->lru_lock); + fp /=3D sc->file_cost + 1; =20 fraction[0] =3D ap; fraction[1] =3D fp; @@ -2679,6 +2664,14 @@ static void shrink_node(pg_data_t *pgdat, struct s= can_control *sc) nr_reclaimed =3D sc->nr_reclaimed; nr_scanned =3D sc->nr_scanned; =20 + /* + * Determine the scan balance between anon and file LRUs. + */ + spin_lock_irq(&pgdat->lru_lock); + sc->anon_cost =3D target_lruvec->anon_cost; + sc->file_cost =3D target_lruvec->file_cost; + spin_unlock_irq(&pgdat->lru_lock); + /* * Target desirable inactive:active list ratios for the anon * and file LRU lists. --=20 2.26.2