From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 27CC1C433C1 for ; Wed, 31 Mar 2021 00:28:22 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 7E5B3619CB for ; Wed, 31 Mar 2021 00:28:21 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7E5B3619CB Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id D425E6B007E; Tue, 30 Mar 2021 20:28:20 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D17CF6B0081; Tue, 30 Mar 2021 20:28:20 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BB96D6B0082; Tue, 30 Mar 2021 20:28:20 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0175.hostedemail.com [216.40.44.175]) by kanga.kvack.org (Postfix) with ESMTP id 9C5C46B007E for ; Tue, 30 Mar 2021 20:28:20 -0400 (EDT) Received: from smtpin07.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 5996A1730877 for ; Wed, 31 Mar 2021 00:28:20 +0000 (UTC) X-FDA: 77978282760.07.BB51BB1 Received: from mail-lf1-f51.google.com (mail-lf1-f51.google.com [209.85.167.51]) by imf15.hostedemail.com (Postfix) with ESMTP id CAF63A0000FD for ; Wed, 31 Mar 2021 00:28:19 +0000 (UTC) Received: by mail-lf1-f51.google.com with SMTP id o10so26454975lfb.9 for ; Tue, 30 Mar 2021 17:28:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=1cv7lLfghP1Iplp/hAe0FKDdnNBVuJ0m8vGvM7Ky2GY=; b=LUWalD0ZKI8m3/8ws2U1R9cKhzvVEE8ZXL1jnnHaRB0hPlC9XelvwMiaO4hEl/m186 VaLaI9FB38PWu1TJYiNcmHWicQPvp0uG3d2nQXMyL7n8PGVKo8ZEgiZ//KNP9zzw0oz5 x3POwWHcxIHM33aAQRSTBodiW3OBXjAYhb0KthStXZr6a1SrqNSovgG3J/iftaUkzy92 G4VqcHdmQKZqwqLWdPUanSmXutoaOC/c/ttM3WUD7ZdlelCAWQ6nb1Pn4NI31qR/QnK/ 9bWdAm8/gFGRvCByS6ykpBfJlZBpdut0c9fPbNrKNwJqT9uphY4eQsPhdGmHdIAkPy2p XD4A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=1cv7lLfghP1Iplp/hAe0FKDdnNBVuJ0m8vGvM7Ky2GY=; b=IJrVTn/EAr0Vqok2gwe3rY3NcAg6Q5AiJc90ktXoplqiYhyLnhLeSgZDzOgGBRf+1t wS8qnFb+4LvAJx1J2Wd0T8eRl3umuVmRBwiQ0h9WHviFX4EMGqTXRESfJI83IWPpXJFV Y1cQHFaTpO2al2jC1Ydbp3b+PQMP1DNswXDhyYVeb/lq/CLbl9HhVYoObXyCEnRt2zLr mGzUKFB+NxzkPdiVvhBF/xGAm0iNZLOkbpZlX/mcpcqko2qvQg26z07BjqWqT9d5DCQ3 26yTwMYAE17QI8opkry6vnAbpfhsg6CCoTaeQB6OlLWK9Bp5O04wIn4AFbtBgcU7angg 9o0Q== X-Gm-Message-State: AOAM5322C1OJ2aC8RH4GgzUYHQvdq8TljZU2pMYmmm0mHpFizLgi071u aHKVz38VSl+dJBxdnsy3g1r2b96wb/8OnybmpHwCKQ== X-Google-Smtp-Source: ABdhPJyxesGnlIEAzu+klNre9N8aN6gRtEWSNL9sbHRRtHrhdTo5FEe7AgGTZdu4xQndQZX8OgrKnIY0GN/M3c2Y34A= X-Received: by 2002:a19:e0d:: with SMTP id 13mr458683lfo.549.1617150498014; Tue, 30 Mar 2021 17:28:18 -0700 (PDT) MIME-Version: 1.0 References: <20210330101531.82752-1-songmuchun@bytedance.com> In-Reply-To: From: Shakeel Butt Date: Tue, 30 Mar 2021 17:28:06 -0700 Message-ID: Subject: Re: [RFC PATCH 00/15] Use obj_cgroup APIs to charge the LRU pages To: Johannes Weiner Cc: Muchun Song , Greg Thelen , Roman Gushchin , Michal Hocko , Andrew Morton , Vladimir Davydov , LKML , Linux MM , Xiongchun duan Content-Type: text/plain; charset="UTF-8" X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: CAF63A0000FD X-Stat-Signature: mjx8ciodgw81631eyed9sspu1fg4dabh Received-SPF: none (google.com>: No applicable sender policy available) receiver=imf15; identity=mailfrom; envelope-from=""; helo=mail-lf1-f51.google.com; client-ip=209.85.167.51 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1617150499-971140 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Mar 30, 2021 at 2:10 PM Johannes Weiner wrote: > [...] > > The main concern I have with *just* reparenting LRU pages is that for > > the long running systems, the root memcg will become a dumping ground. > > In addition a job running multiple times on a machine will see > > inconsistent memory usage if it re-accesses the file pages which were > > reparented to the root memcg. > > I don't understand how Muchun's patches are supposed to *change* the > behavior the way you are describing it. This IS today's behavior. > > We have hierarchical accounting, and a page that belongs to a leaf > cgroup will automatically belong to all its parents. > > Further, if you delete a cgroup today, the abandoned cache will stay > physically linked to that cgroup, but that zombie group no longer acts > as a control structure: it imposes no limit and no protection; the > pages will be reclaimed as if it WERE linked to the parent. > > For all intents and purposes, when you delete a cgroup today, its > remaining pages ARE dumped onto the parent. > > The only difference is that today they pointlessly pin the leaf cgroup > as a holding vessel - which is then round-robin'd from the parent > during reclaim in order to pretend that all these child pages actually > ARE linked to the parent's LRU list. > > Remember how we used to have every page physically linked to multiple > lrus? The leaf cgroup and the root? > > All pages always belong to the (virtual) LRU list of all ancestor > cgroups. The only thing Muchun changes is that they no longer pin a > cgroup that has no semantical meaning anymore (because it's neither > visible to the user nor exerts any contol over the pages anymore). > Indeed you are right. Even if the physical representation of the tree has changed, the logical picture remains the same. [Subconsciously I was sad that we will lose the information about the origin memcg of the page for debugging purposes but then I thought if we really need it then we can just add that metadata in the obj_cgroup object. So, never mind.] > Maybe I'm missing something that either you or Roman can explain to > me. But this series looks like a (rare) pure win. > > Whether you like the current semantics is a separate discussion IMO. > > > Please note that I do agree with the mentioned problem and we do see > > this issue in our fleet. Internally we have a "memcg mount option" > > feature which couples a file system with a memcg and all file pages > > allocated on that file system will be charged to that memcg. Multiple > > instances (concurrent or subsequent) of the job will use that file > > system (with a dedicated memcg) without leaving the zombies behind. I > > am not pushing for this solution as it comes with its own intricacies > > (e.g. if memcg coupled with a file system has a limit, the oom > > behavior would be awkward and therefore internally we don't put a > > limit on such memcgs). Though I want this to be part of discussion. > > Right, you disconnect memory from the tasks that are allocating it, > and so you can't assign culpability when you need to. > > OOM is one thing, but there are also CPU cycles and IO bandwidth > consumed during reclaim. > We didn't really have any issue regarding CPU or IO but that might be due to our unique setup (i.e. no local disk). > > I think the underlying reasons behind this issue are: > > > > 1) Filesystem shared by disjoint jobs. > > 2) For job dedicated filesystems, the lifetime of the filesystem is > > different from the lifetime of the job. > > There is also the case of deleting a cgroup just to recreate it right > after for the same job. Many job managers do this on restart right now > - like systemd, and what we're using in our fleet. This seems > avoidable by recycling a group for another instance of the same job. I was bundling the scenario you mentioned with (2) i.e. the filesystem persists across multiple subsequent instances of the same job. > > Sharing is a more difficult discussion. If you access a page that you > share with another cgroup, it may or may not be subject to your own or > your buddy's memory limits. The only limit it is guaranteed to be > subjected to is that of your parent. So One thing I could imagine is, > instead of having a separate cgroup outside the hierarchy, we would > reparent live pages the second they are accessed from a foreign > cgroup. And reparent them until you reach the first common ancestor. > > This way, when you mount a filesystem shared by two jobs, you can put > them into a joint subtree, and the root level of this subtree captures > all the memory (as well as the reclaim CPU and IO) used by the two > jobs - the private portions and the shared portions - and doesn't make > them the liability of jobs in the system that DON'T share the same fs. I will give more thought on this idea and see where it goes. > > But again, this is a useful discussion to have, but I don't quite see > why it's relevant to Muchun's patches. They're purely an optimization. > > So I'd like to clear that up first before going further. I think we are on the same page i.e. these patches change the physical representation of the memcg tree but logically it remains the same and fixes the zombie memcg issue.