From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=3.0 tests=MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A3778C2BA83 for ; Fri, 14 Feb 2020 10:45:47 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 6B8C7206B6 for ; Fri, 14 Feb 2020 10:45:47 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6B8C7206B6 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 0A9B36B0612; Fri, 14 Feb 2020 05:45:47 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 05BE06B0613; Fri, 14 Feb 2020 05:45:47 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E8C3F6B0614; Fri, 14 Feb 2020 05:45:46 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0249.hostedemail.com [216.40.44.249]) by kanga.kvack.org (Postfix) with ESMTP id CE33A6B0612 for ; Fri, 14 Feb 2020 05:45:46 -0500 (EST) Received: from smtpin22.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 6AC7F180AD815 for ; Fri, 14 Feb 2020 10:45:46 +0000 (UTC) X-FDA: 76488401892.22.gate54_153b2e275bb0c X-HE-Tag: gate54_153b2e275bb0c X-Filterd-Recvd-Size: 5547 Received: from mail-wm1-f67.google.com (mail-wm1-f67.google.com [209.85.128.67]) by imf23.hostedemail.com (Postfix) with ESMTP for ; Fri, 14 Feb 2020 10:45:45 +0000 (UTC) Received: by mail-wm1-f67.google.com with SMTP id a6so10087993wme.2 for ; Fri, 14 Feb 2020 02:45:45 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to; bh=cfoLfv0qwTQ5T5I/t2MGSNpnUas9f9z3PiShv7DRous=; b=AakIJ5utvmt6UPlBE/nu9cVfibfRagdZhXj0lJ5fMAQ1jS1rp9/mB8TStJNk6QMzmt RGiTPrHs6vgqkdiFT+pfgtTc0GlOin0B68pNC/ee+C0EKX4OVXi2cyuF2011GL72iQwA x/GyZiiAzYvjygpW+jrnqXpe0tJjwg4Y5NR82a2DVcrDR36jn4459T9siZAokbtT0Gnw Ukhp1UzYzr3iBeDlSJkPtw0M+xOsIY6emDK4nE7kgdHpd7VlfOtwDV5ON3QjKTtvjFTT QUsbvcFl+UWIq7kfhfniBdJgE4U3DhvQH7By/tbCcfL14nZ0GNZyFDQ5TjCDYZSD11Ua c/lg== X-Gm-Message-State: APjAAAVVrvbQNMA13mQYnrqUMsg+pRRvzdhjrNJ6T99s0gCmL9POvTIl Wsw6UI85+2dSjaP0QRl7L7WBsXpB X-Google-Smtp-Source: APXvYqw99Ag0MokPmoxFFs4J8t3isK3Ekt3eTFcDPBSJXVisWlDbfQixM3nRetHGFVKy9hedENXAAg== X-Received: by 2002:a1c:9cce:: with SMTP id f197mr4167675wme.133.1581677144730; Fri, 14 Feb 2020 02:45:44 -0800 (PST) Received: from localhost (ip-37-188-133-87.eurotel.cz. [37.188.133.87]) by smtp.gmail.com with ESMTPSA id e16sm6837822wrs.73.2020.02.14.02.45.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 14 Feb 2020 02:45:43 -0800 (PST) Date: Fri, 14 Feb 2020 11:45:41 +0100 From: Michal Hocko To: Tim Chen Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Dave Hansen , Dan Williams , Huang Ying Subject: Re: [Lsf-pc] [LSF/MM TOPIC] Memory cgroups, whether you like it or not Message-ID: <20200214104541.GT31689@dhcp22.suse.cz> References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed 05-02-20 10:34:57, Tim Chen wrote: > Topic: Memory cgroups, whether you like it or not >=20 > 1. Memory cgroup counters scalability >=20 > Recently, benchmark teams at Intel were running some bare-metal > benchmarks. To our great surprise, we saw lots of memcg activity in > the profiles. When we asked the benchmark team, they did not even > realize they were using memory cgroups. They were fond of running all > their benchmarks in containers that just happened to use memory cgroups > by default. What were previously problems only for memory cgroup users > are quickly becoming a problem for everyone. >=20 > There are mem cgroup counters that are read in page management paths > which scale poorly when read. These counters are per cpu based and > need to be summed over all CPUs to get the overall value for the mem > cgroup in lruvec_page_state_local function. This led to scalability > problems on system with large numbers of CPUs. For example, we=E2=80=99= ve seen 14+% kernel > CPU time consumed in snapshot_refaults(). We have also encountered a > similar issue recently when computing the lru_size[1]. >=20 > We'll like to do some brainstorming to see if there are ways to make > such accounting more scalable. For example, not all usages > of such counters need precise counts, and some approximate counts that = are > updated lazily can be used. Please make sure to prepare numbers based on the current upstream kernel so that we have some grounds to base the discussion on. Ideally post them into the email. >=20 > [1] https://lore.kernel.org/linux-mm/a64eecf1-81d4-371f-ff6d-1cb057bd09= 1c@linux.intel.com/=20 >=20 > 2. Tiered memory accounting and management >=20 > Traditionally, all RAM is DRAM. Some DRAM might be closer/faster > than others, but a byte of media has about the same cost whether it > is close or far. But, with new memory tiers such as High-Bandwidth > Memory or Persistent Memory, there is a choice between fast/expensive > and slow/cheap. But, the current memory cgroups still live in the > old model. There is only one set of limits, and it implies that all > memory has the same cost. We would like to extend memory cgroups to > comprehend different memory tiers to give users a way to choose a mix > between fast/expensive and slow/cheap. >=20 > We would like to propose that for systems with multiple memory tiers, > We will add accounting per mem cgroup for a memory cgroup's usage of th= e > top tier memory. Such top tier memory are precious resources, where it > makes sense to impose soft limits. We can start to actively demote the > top tier memory used by cgroups that exceed their allowance when the > system experience memory pressure in tyhe top tier memory. A similar topic has been discussed last year. See https://lwn.net/Article= s/787326/ > There is existing infrastructure for memory soft limit per cgroup we > can leverage to implement such a scheme. We'll like to find out if thi= s > approach makes sense for people working on systems with multiple memory= tiers. Soft limit is dead. --=20 Michal Hocko SUSE Labs