From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 84365C433FE for ; Wed, 16 Mar 2022 07:54:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id F3A5F8D0002; Wed, 16 Mar 2022 03:54:53 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EC1EE8D0001; Wed, 16 Mar 2022 03:54:53 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D3B4B8D0002; Wed, 16 Mar 2022 03:54:53 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0050.hostedemail.com [216.40.44.50]) by kanga.kvack.org (Postfix) with ESMTP id BD4E08D0001 for ; Wed, 16 Mar 2022 03:54:53 -0400 (EDT) Received: from smtpin25.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 5D7218249980 for ; Wed, 16 Mar 2022 07:54:53 +0000 (UTC) X-FDA: 79249488066.25.766A6F7 Received: from mail-vs1-f41.google.com (mail-vs1-f41.google.com [209.85.217.41]) by imf27.hostedemail.com (Postfix) with ESMTP id E946C4000F for ; Wed, 16 Mar 2022 07:54:52 +0000 (UTC) Received: by mail-vs1-f41.google.com with SMTP id h30so1323798vsq.13 for ; Wed, 16 Mar 2022 00:54:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=b5gTRuko97eI/fthkcfnm9gayEQtvFpZFqZB8tduF7c=; b=WEe9+Bxpxgg10owxznjLHnSwYsxADD5OwT5rKXFFA++k6khG8td0tx9nt5rmyU+hKE gyqj+Vj4rThPsgWeQqN/nUwF+EhGps1cOTqW8y8kn3HflTz0/RlrfHsoJrXexBqRxRSv 4w2LSS6KYHffsXK+CNuyGjSqRUiaCtYTHvkfNn7VhTOzljys9we1ZgXrh83Z+WEQgPtp LzKIRHetzS+NyotLb5RgaeAJJd1KTkWIX3t2A5f5I0PHf4RVEj9ccb7uyFEPm30pJN7t uQqAlWqe7whNwPfU2GEq5GzeaTP6oh3fHKjQRM3PBFASqZ1d5W0d8KeWPlIzlthazodb otgA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=b5gTRuko97eI/fthkcfnm9gayEQtvFpZFqZB8tduF7c=; b=MgO8IoyNsf2j67h1/Ic2y5ZZlCnAsjnn+1GGD1ay4U1GKj6iAC6nVXc1XYOmQUIsJr unnU0f4pn23SPlIiyhokowlttL/BwOFos4VhpQioAlWyTujKmjPkm9sHz9rSayTlfNWC jA7eseT0W1KFzXYrBeqvX0/GDj0JUW/FOIta9Qt4C7D/dRz0HFMp6jABz/iwsWmE7hGH KBI/+aBKvaVpmBCzU1KojLtv5XN6WuwNEXhMaPreHM6nWfUtj59F/DTlypHVG7XU97s1 zH6RU2E5vVJTuYDburlxkps2+tiIpCKxZha0QYGEbtGCcNJqe8t03gVx/33a5Ps0lTj8 hykQ== X-Gm-Message-State: AOAM530c/QYT5VLtN9zbMd4ozh5xK82hGa5vH0QHof2VHMMwKntNtFB/ C42rNGoXoexOGUW9tTwrIgKaoStwutbVLuOrD2cBSQ== X-Google-Smtp-Source: ABdhPJwGuAArwozEUZv8gKaGEUnT0xpDpZgIxvxFfayV12kU/WqsaR4Ko1uznCEbu2B6kiUcO73yy/9VP99wOuSBNVI= X-Received: by 2002:a05:6102:f0c:b0:320:9156:732f with SMTP id v12-20020a0561020f0c00b003209156732fmr13355818vss.6.1647417291993; Wed, 16 Mar 2022 00:54:51 -0700 (PDT) MIME-Version: 1.0 References: <20220309021230.721028-1-yuzhao@google.com> <20220309021230.721028-7-yuzhao@google.com> <87wnguwif3.fsf@yhuang6-desk2.ccr.corp.intel.com> In-Reply-To: <87wnguwif3.fsf@yhuang6-desk2.ccr.corp.intel.com> From: Yu Zhao Date: Wed, 16 Mar 2022 01:54:41 -0600 Message-ID: Subject: Re: [PATCH v9 06/14] mm: multi-gen LRU: minimal implementation To: "Huang, Ying" Cc: Andrew Morton , Linus Torvalds , Andi Kleen , Aneesh Kumar , Catalin Marinas , Dave Hansen , Hillf Danton , Jens Axboe , Jesse Barnes , Johannes Weiner , Jonathan Corbet , Matthew Wilcox , Mel Gorman , Michael Larabel , Michal Hocko , Mike Rapoport , Rik van Riel , Vlastimil Babka , Will Deacon , Linux ARM , "open list:DOCUMENTATION" , linux-kernel , Linux-MM , Kernel Page Reclaim v2 , "the arch/x86 maintainers" , Brian Geffon , Jan Alexander Steffens , Oleksandr Natalenko , Steven Barrett , Suleiman Souhlal , Daniel Byrne , Donald Carr , =?UTF-8?Q?Holger_Hoffst=C3=A4tte?= , Konstantin Kharlamov , Shuang Zhai , Sofia Trinh , Vaibhav Jain Content-Type: text/plain; charset="UTF-8" X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: E946C4000F X-Stat-Signature: 3h65kfg6mt7x9m4wfne1d7xm6u4tt3hx X-Rspam-User: Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=WEe9+Bxp; spf=pass (imf27.hostedemail.com: domain of yuzhao@google.com designates 209.85.217.41 as permitted sender) smtp.mailfrom=yuzhao@google.com; dmarc=pass (policy=reject) header.from=google.com X-HE-Tag: 1647417292-523398 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Mar 15, 2022 at 11:55 PM Huang, Ying wrote: > > Hi, Yu, > > Yu Zhao writes: > > [snip] > > > > > +static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc) > > +{ > > + struct mem_cgroup *memcg = lruvec_memcg(lruvec); > > + struct pglist_data *pgdat = lruvec_pgdat(lruvec); > > + > > + if (!can_demote(pgdat->node_id, sc) && > > + mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH) > > + return 0; > > + > > + return mem_cgroup_swappiness(memcg); > > +} > > + > > We have tested v9 for memory tiering system, the demotion works now even > without swap devices configured. Thanks! Admittedly I didn't test it :) So thanks for testing -- I'm glad to hear it didn't fall apart. > And we found that the demotion (page reclaiming on DRAM nodes) speed is > lower than the original implementation. This sounds like an improvement to me, assuming the initial hot/cold memory placements were similar for both the baseline and MGLRU. Correct me if I'm wrong: since demotion is driven by promotion, lower demotion speed means hot and cold pages were sorted out to DRAM and AEP at a faster speed, hence an improvement. # promotion path: numa_hint_faults 498301236 numa_pages_migrated 152650705 numa_hint_faults 494583387 numa_pages_migrated 34165992 # demotion path: pgsteal_anon 153798203 pgsteal_file 33 pgsteal_anon 32701576 pgsteal_file 33 The hint faults are similar but MGLRU has much fewer migrated -- my guess is it demoted much fewer hot/warm pages and therefore led to less work on the promotion path. > The workload itself is just a > memory accessing micro-benchmark with Gauss distribution. It is run on > a system with DRAM and PMEM. Initially, quite some hot pages are placed > in PMEM and quite some cold pages are placed in DRAM. Then the page > placement optimizing mechanism based on NUMA balancing will try to > promote some hot pages from PMEM node to DRAM node. My understanding seems to be correct? > If the DRAM node > near full (reach high watermark), kswapd of the DRAM node will be woke > up to demote (reclaim) some cold DRAM pages to PMEM. Because quite some > pages on DRAM is very cold (not accessed for at least several seconds), > the benchmark performance will be better if demotion speed is faster. I'm confused. It seems to me demotion speed is irrelevant. The time to reach the equilibrium is what we want to measure. > Some data comes from /proc/vmstat and perf-profile is as follows. > > From /proc/vmstat, it seems that the page scanned and page demoted is > much less with MGLRU enabled. The pgdemote_kswapd / pgscan_kswapd is > 5.22 times higher with MGLRU enabled than that with MGLRU disabled. I > think this shows the value of direct page table scanning. Can't disagree :) > From perf-profile, the CPU cycles for kswapd is same. But less pages > are demoted (reclaimed) with MGLRU. And it appears that the total page > table scanning time of MGLRU is longer if we compare walk_page_range > (1.97%, MGLRU enabled) and page_referenced (0.54%, MGLRU disabled)? It's possible if the address space is very large and sparse. But once MGLRU warms up, it should detect it and fall back to page_referenced(). > Because we only demote (reclaim) from DRAM nodes, but not demote > (reclaim) from PMEM nodes and bloom filter doesn't work well enough? The bloom filters are per lruvec. So this should affect them. > One thing that may be not friendly for bloom filter is that some virtual > pages may change their resident nodes because of demotion/promotion. Yes, it's possible. > Can you teach me to how interpret these data for MGLRU? Or can you > point me to the other/better data for MGLRU? You are the expert :) My current understanding is that this is an improvement. IOW, with MGLRU, DRAM (hot) <-> AEP (cold) reached equilibrium a lot faster. > MGLRU disabled via: echo -n 0 > /sys/kernel/mm/lru_gen/enabled > -------------------------------------------------------------- > > /proc/vmstat: > > pgactivate 1767172340 > pgdeactivate 1740111896 > pglazyfree 0 > pgfault 583875828 > pgmajfault 0 > pglazyfreed 0 > pgrefill 1740111896 > pgreuse 22626572 > pgsteal_kswapd 153796237 > pgsteal_direct 1999 > pgdemote_kswapd 153796237 > pgdemote_direct 1999 > pgscan_kswapd 2055504891 > pgscan_direct 1999 > pgscan_direct_throttle 0 > pgscan_anon 2055356614 > pgscan_file 150276 > pgsteal_anon 153798203 > pgsteal_file 33 > zone_reclaim_failed 0 > pginodesteal 0 > slabs_scanned 82761 > kswapd_inodesteal 0 > kswapd_low_wmark_hit_quickly 2960 > kswapd_high_wmark_hit_quickly 17732 > pageoutrun 21583 > pgrotated 0 > drop_pagecache 0 > drop_slab 0 > oom_kill 0 > numa_pte_updates 515994024 > numa_huge_pte_updates 154 > numa_hint_faults 498301236 > numa_hint_faults_local 121109067 > numa_pages_migrated 152650705 > pgmigrate_success 307213704 > pgmigrate_fail 39 > thp_migration_success 93 > thp_migration_fail 0 > thp_migration_split 0 > > perf-profile: > > kswapd.kthread.ret_from_fork: 2.86 > balance_pgdat.kswapd.kthread.ret_from_fork: 2.86 > shrink_node.balance_pgdat.kswapd.kthread.ret_from_fork: 2.85 > shrink_lruvec.shrink_node.balance_pgdat.kswapd.kthread: 2.76 > shrink_inactive_list.shrink_lruvec.shrink_node.balance_pgdat.kswapd: 1.9 > shrink_page_list.shrink_inactive_list.shrink_lruvec.shrink_node.balance_pgdat: 1.52 > shrink_active_list.shrink_lruvec.shrink_node.balance_pgdat.kswapd: 0.85 > migrate_pages.shrink_page_list.shrink_inactive_list.shrink_lruvec.shrink_node: 0.79 > page_referenced.shrink_page_list.shrink_inactive_list.shrink_lruvec.shrink_node: 0.54 > > > MGLRU enabled via: echo -n 7 > /sys/kernel/mm/lru_gen/enabled > ------------------------------------------------------------- > > /proc/vmstat: > > pgactivate 47212585 > pgdeactivate 0 > pglazyfree 0 > pgfault 580056521 > pgmajfault 0 > pglazyfreed 0 > pgrefill 6911868880 > pgreuse 25108929 > pgsteal_kswapd 32701609 > pgsteal_direct 0 > pgdemote_kswapd 32701609 > pgdemote_direct 0 > pgscan_kswapd 83582770 > pgscan_direct 0 > pgscan_direct_throttle 0 > pgscan_anon 83549777 > pgscan_file 32993 > pgsteal_anon 32701576 > pgsteal_file 33 > zone_reclaim_failed 0 > pginodesteal 0 > slabs_scanned 84829 > kswapd_inodesteal 0 > kswapd_low_wmark_hit_quickly 313 > kswapd_high_wmark_hit_quickly 5262 > pageoutrun 5895 > pgrotated 0 > drop_pagecache 0 > drop_slab 0 > oom_kill 0 > numa_pte_updates 512084786 > numa_huge_pte_updates 198 > numa_hint_faults 494583387 > numa_hint_faults_local 129411334 > numa_pages_migrated 34165992 > pgmigrate_success 67833977 > pgmigrate_fail 7 > thp_migration_success 135 > thp_migration_fail 0 > thp_migration_split 0 > > perf-profile: > > kswapd.kthread.ret_from_fork: 2.86 > balance_pgdat.kswapd.kthread.ret_from_fork: 2.86 > lru_gen_age_node.balance_pgdat.kswapd.kthread.ret_from_fork: 1.97 > walk_page_range.try_to_inc_max_seq.lru_gen_age_node.balance_pgdat.kswapd: 1.97 > shrink_node.balance_pgdat.kswapd.kthread.ret_from_fork: 0.89 > evict_folios.lru_gen_shrink_lruvec.shrink_lruvec.shrink_node.balance_pgdat: 0.89 > scan_folios.evict_folios.lru_gen_shrink_lruvec.shrink_lruvec.shrink_node: 0.66 > > Best Regards, > Huang, Ying > > [snip] >