From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4E14BC433EF for ; Wed, 20 Oct 2021 19:02:55 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id EA9EF611EF for ; Wed, 20 Oct 2021 19:02:54 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org EA9EF611EF Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=mit.edu Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 37F396B007D; Wed, 20 Oct 2021 15:02:54 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 32EE06B007E; Wed, 20 Oct 2021 15:02:54 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1F617940007; Wed, 20 Oct 2021 15:02:54 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0036.hostedemail.com [216.40.44.36]) by kanga.kvack.org (Postfix) with ESMTP id 113536B007D for ; Wed, 20 Oct 2021 15:02:54 -0400 (EDT) Received: from smtpin21.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id C5E742D3A9 for ; Wed, 20 Oct 2021 19:02:53 +0000 (UTC) X-FDA: 78717737826.21.0BC3FC0 Received: from outgoing.mit.edu (outgoing-auth-1.mit.edu [18.9.28.11]) by imf19.hostedemail.com (Postfix) with ESMTP id 87A84B0000B1 for ; Wed, 20 Oct 2021 19:02:50 +0000 (UTC) Received: from cwcc.thunk.org (pool-72-74-133-215.bstnma.fios.verizon.net [72.74.133.215]) (authenticated bits=0) (User authenticated as tytso@ATHENA.MIT.EDU) by outgoing.mit.edu (8.14.7/8.12.4) with ESMTP id 19KJ2gOS023773 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 20 Oct 2021 15:02:43 -0400 Received: by cwcc.thunk.org (Postfix, from userid 15806) id D07C315C00CC; Wed, 20 Oct 2021 15:02:42 -0400 (EDT) Date: Wed, 20 Oct 2021 15:02:42 -0400 From: "Theodore Ts'o" To: Michal Hocko Cc: Mina Almasry , Roman Gushchin , Shakeel Butt , Greg Thelen , Johannes Weiner , Hugh Dickins , Tejun Heo , Linux-MM , "open list:FILESYSTEMS (VFS and infrastructure)" , cgroups@vger.kernel.org, riel@surriel.com Subject: Re: [RFC Proposal] Deterministic memcg charging for shared memory Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 87A84B0000B1 X-Stat-Signature: 4agxshqkxiohy76fac5bz3ttiap1y94s Authentication-Results: imf19.hostedemail.com; dkim=none; dmarc=none; spf=none (imf19.hostedemail.com: domain of tytso@mit.edu has no SPF policy when checking 18.9.28.11) smtp.mailfrom=tytso@mit.edu X-HE-Tag: 1634756570-791172 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Oct 20, 2021 at 11:09:07AM +0200, Michal Hocko wrote: > > 3. We would need to extend this functionality to other file systems of > > persistent disk, then mount that file system with 'memcg= > shared library memcg>'. Jobs can then use the shared library and any > > memory allocated due to loading the shared library is charged to a > > dedicated memcg, and not charged to the job using the shared library. > > This is more of a question for fs people. My understanding is rather > limited so I cannot even imagine all the possible setups but just from > a very high level understanding bind mounts can get really interesting. > Can those disagree on the memcg? > > I am pretty sure I didn't get to think through this very deeply, my gut > feeling tells me that this will open many interesting questions and I am > not sure whether it solves more problems than it introduces at this moment. > I would be really curious what others think about this. My understanding of the proposal is that the mount option would be on the superblock, and it would not be a per-bind-mount option, ala the ro mount option. In other words, the designation of the target memcg for which all tmpfs files would be charged would be something that would be stored in the struct super. I'm also going to assume that the only thing that gets charged is memory for files that are backed on the tmpfs. So for example, if there is a MAP_PRIVATE mapping, the base page would have be charged to the target memcg when the file was originally created. However, if the process tries to modify a private mapping, and there page allocated on the copy-on-write would get charged to the process's memcg, and not to the tmpfs's target memcg. If we make these simplifying assumptions, then it should be fairly simple. Essentially, the model is that whenever we do the file system equivalent of "block allocation" for the file system, the tmpfs file system has all of the pages associated with that file system is charged to the target memcg. That's pretty straightforward, and is pretty easy to model and anticipate. In fact, if the only use case was #3 (shared libraries and library runtimes) this workload could be accomodated without needing any kernel changes. This could be done by simply having the setup process run in the "target memcg", and it would simply copy all of the shared libraries and runtime files into the tmpfs at setup time. So that would get charged to the memcg which first allocated the file, and that would be the setup memcg. And all of the Kubernetes containers that use these shared libraries and language runtimes, when they map those pages read-only into their task processes, since those tmpfs pages were charged to the setup memcg, they won't get charged to the task containers. And I *do* believe that it's much easier to anticipate how much memory will be used by these shared files, and so we don't need to potentially give a task container enough memory quota so that if it is the first container to start running, it gets charged with all of the memory, while all of the other containers can afford to freeload off the first container --- but we still have to give those containers enough memory in their memcg in case those *other* containers happen to be the first one to get launched. Cheers, - Ted