From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 496EEC433DB for ; Tue, 16 Mar 2021 04:45:27 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id EE9C365101 for ; Tue, 16 Mar 2021 04:45:26 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org EE9C365101 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 7DC5F6B006C; Tue, 16 Mar 2021 00:45:25 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 78C746B006E; Tue, 16 Mar 2021 00:45:25 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 607CB6B0070; Tue, 16 Mar 2021 00:45:25 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0238.hostedemail.com [216.40.44.238]) by kanga.kvack.org (Postfix) with ESMTP id 437BF6B006C for ; Tue, 16 Mar 2021 00:45:25 -0400 (EDT) Received: from smtpin20.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id EE54F180AD82F for ; Tue, 16 Mar 2021 04:45:24 +0000 (UTC) X-FDA: 77924498568.20.39CE128 Received: from mail-io1-f47.google.com (mail-io1-f47.google.com [209.85.166.47]) by imf25.hostedemail.com (Postfix) with ESMTP id 7403F6000102 for ; Tue, 16 Mar 2021 04:45:24 +0000 (UTC) Received: by mail-io1-f47.google.com with SMTP id g27so35836943iox.2 for ; Mon, 15 Mar 2021 21:45:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=DBsfjI7Y9qCZHdzx/JVFP1TkC1/+jN/GH13dkfs78b4=; b=g1xy5TEBHfyPxwIXSOaSp8xTF1R7KhbcM3BdOHdTD05qOZbn9PYJUcQzYtp1Ruj1dw 8rzK9epnwJp9Bkl8FdPGMgK1k5JbJhzDY0V5X5xGfivH30XARF+tbRUacnLA9uelzrE2 dCC06tiUSl2CxsJBvMquE22cSymjRNRX5fiXHqsPurc0w2/kpRhfr0RNA+bCKiKQU18T 98oynbx9Y9+sMVAGdQx2JDQAS7Ndloh0m2yeFDlCOVFXZH3M5iJLyb8Wv1T6a/2SR2/Q gpt/QNM1qKaoiyWYq+qn9PMDTxDDRe3UvQvtQQbBzEJGcKCx/ORO3Nl0jKbXJ1ZmZpHk qRgQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=DBsfjI7Y9qCZHdzx/JVFP1TkC1/+jN/GH13dkfs78b4=; b=nwSng1geV4hFL+UeGDxPVCLcz3yQI+Ea+FJwx9xBzj/4Xad6zs9soAdCbR/q5tb1YX 0PpXFEROYGOJCF5RBXiZz1Ku16FK+2aSE7FynBT9WqNvsNvuAaakYGXQQfWOaIaXZLGn 3e4hvT3haisslig1g2t+1OzPVrPh7Uc4bXhG7khs/rxSLmqdVoRzAZeivB6z7xq1bxD2 rM9FlluEKmWH4YSjXja1AzoTSvnNbKIp3uB1FOzVYVbWREob1EgB452L3FY+WocrlHuF 33jdMad+M2nRZcAHeKRIAvcKIAZuZAc2PWrQPxrjBEP40sndXTHHPOARilw/mkFxkiF+ dTGw== X-Gm-Message-State: AOAM530I+ZAgrogk/PYCnxYrDPVqmPp8thkcsv+JU7VR1DYmKAIuXOqx WlfV9Cyp7H8pQEsCvWh4s2k/nQ== X-Google-Smtp-Source: ABdhPJxEpIUeNmedia6BMm/JQ8HjdRr7IDMTVY5N83A2Lur9+0y0RVAnR4BZhFaLMKNV7N3UQ+uY/g== X-Received: by 2002:a02:9985:: with SMTP id a5mr12336093jal.122.1615869923602; Mon, 15 Mar 2021 21:45:23 -0700 (PDT) Received: from google.com ([2620:15c:183:200:d825:37a2:4b55:995f]) by smtp.gmail.com with ESMTPSA id 5sm8716358ill.20.2021.03.15.21.45.22 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 15 Mar 2021 21:45:22 -0700 (PDT) Date: Mon, 15 Mar 2021 22:45:18 -0600 From: Yu Zhao To: "Huang, Ying" Cc: linux-mm@kvack.org, Alex Shi , Andrew Morton , Dave Hansen , Hillf Danton , Johannes Weiner , Joonsoo Kim , Matthew Wilcox , Mel Gorman , Michal Hocko , Roman Gushchin , Vlastimil Babka , Wei Yang , Yang Shi , linux-kernel@vger.kernel.org, page-reclaim@google.com Subject: Re: [PATCH v1 10/14] mm: multigenerational lru: core Message-ID: References: <87im5rsvd8.fsf@yhuang6-desk1.ccr.corp.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <87im5rsvd8.fsf@yhuang6-desk1.ccr.corp.intel.com> X-Stat-Signature: oqzz31gjwtant683hg8dtz5ak6tfzj9h X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 7403F6000102 Received-SPF: none (google.com>: No applicable sender policy available) receiver=imf25; identity=mailfrom; envelope-from=""; helo=mail-io1-f47.google.com; client-ip=209.85.166.47 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1615869924-665655 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Mar 16, 2021 at 10:08:51AM +0800, Huang, Ying wrote: > Yu Zhao writes: > [snip] > > > +/* Main function used by foreground, background and user-triggered aging. */ > > +static bool walk_mm_list(struct lruvec *lruvec, unsigned long next_seq, > > + struct scan_control *sc, int swappiness) > > +{ > > + bool last; > > + struct mm_struct *mm = NULL; > > + int nid = lruvec_pgdat(lruvec)->node_id; > > + struct mem_cgroup *memcg = lruvec_memcg(lruvec); > > + struct lru_gen_mm_list *mm_list = get_mm_list(memcg); > > + > > + VM_BUG_ON(next_seq > READ_ONCE(lruvec->evictable.max_seq)); > > + > > + /* > > + * For each walk of the mm list of a memcg, we decrement the priority > > + * of its lruvec. For each walk of memcgs in kswapd, we increment the > > + * priorities of all lruvecs. > > + * > > + * So if this lruvec has a higher priority (smaller value), it means > > + * other concurrent reclaimers (global or memcg reclaim) have walked > > + * its mm list. Skip it for this priority to balance the pressure on > > + * all memcgs. > > + */ > > +#ifdef CONFIG_MEMCG > > + if (!mem_cgroup_disabled() && !cgroup_reclaim(sc) && > > + sc->priority > atomic_read(&lruvec->evictable.priority)) > > + return false; > > +#endif > > + > > + do { > > + last = get_next_mm(lruvec, next_seq, swappiness, &mm); > > + if (mm) > > + walk_mm(lruvec, mm, swappiness); > > + > > + cond_resched(); > > + } while (mm); > > It appears that we need to scan the whole address space of multiple > processes in this loop? > > If so, I have some concerns about the duration of the function. Do you > have some number of the distribution of the duration of the function? > And may be the number of mm_struct and the number of pages scanned. > > In comparison, in the traditional LRU algorithm, for each round, only a > small subset of the whole physical memory is scanned. Reasonable concerns, and insightful too. We are sensitive to direct reclaim latency, and we tuned another path carefully so that direct reclaims virtually don't hit this path :) Some numbers from the cover letter first: In addition, direct reclaim latency is reduced by 22% at 99th percentile and the number of refaults is reduced 7%. These metrics are important to phones and laptops as they are correlated to user experience. And "another path" is the background aging in kswapd: age_active_anon() age_lru_gens() try_walk_mm_list() /* try to spread pages out across spread+1 generations */ if (old_and_young[0] >= old_and_young[1] * spread && min_nr_gens(max_seq, min_seq, swappiness) > max(spread, MIN_NR_GENS)) return; walk_mm_list(lruvec, max_seq, sc, swappiness); By default, spread = 2, which makes kswapd slight more aggressive than direct reclaim for our use cases. This can be entirely disabled by setting spread to 0, for worloads that don't care about direct reclaim latency, or larger values, they are more sensitive than ours. It's worth noting that walk_mm_list() is multithreaded -- reclaiming threads can work on different mm_structs on the same list concurrently. We do occasionally see this function in direct reclaims, on over-overcommitted systems, i.e., kswapd CPU usage is 100%. Under the same condition, we saw the current page reclaim live locked and triggered hardware watchdog timeouts (our hardware watchdog is set to 2 hours) many times.