Re: Inode limitation for overlayfs

From: Chengguang Xu <cgxu519@mykernel.net>
To: "Amir Goldstein" <amir73il@gmail.com>
Cc: "linux-unionfs" <linux-unionfs@vger.kernel.org>,
	"miklos" <miklos@szeredi.hu>
Subject: Re: Inode limitation for overlayfs
Date: Sun, 29 Mar 2020 22:19:02 +0800	[thread overview]
Message-ID: <17126a9038c.d17770b728105.8827100903005997785@mykernel.net> (raw)
In-Reply-To: <CAOQ4uxj-6upXZkAKVDocuLSwveO8hb5_pdbd=_0zbRx2UD9gsg@mail.gmail.com>

 ---- 在 星期五, 2020-03-27 17:45:37 Amir Goldstein <amir73il@gmail.com> 撰写 ----
 > On Fri, Mar 27, 2020 at 8:18 AM Chengguang Xu <cgxu519@mykernel.net> wrote:
 > >
 > >  ---- 在 星期四, 2020-03-26 15:34:13 Amir Goldstein <amir73il@gmail.com> 撰写 ----
 > >  > On Thu, Mar 26, 2020 at 7:45 AM Chengguang Xu <cgxu519@mykernel.net> wrote:
 > >  > >
 > >  > > Hello,
 > >  > >
 > >  > > On container use case, in order to prevent inode exhaustion on host file system by particular containers,  we would like to add inode limitation for containers.
 > >  > > However,  current solution for inode limitation is based on project quota in specific underlying filesystem so it will also count deleted files(char type files) in overlay's upper layer.
 > >  > > Even worse, users may delete some lower layer files for getting more usable free inodes but the result will be opposite (consuming more inodes).
 > >  > >
 > >  > > It is somewhat different compare to disk size limitation for overlayfs, so I think maybe we can add a limit option just for new files in overlayfs. What do you think?
 > 
 > You are saying above that the goal is to prevent inode exhaustion on
 > host file system,
 > but you want to allow containers to modify and delete unlimited number
 > of lower files
 > thus allowing inode exhaustion. I don't see the logic is that.
 > 

End users do not understand kernel tech very well, so we just want to mitigate
container's different user experience as much as possible. In our point of view, 
consuming more inode by deleting lower file is the feature of overlayfs, it's not
caused by user's  abnormal using. However, we have to limit malicious user
program which is endlessly creating new files until host inode exhausting.

 > Even if we only count new files and present this information on df -i
 > how would users be able to free up inodes when they hit the limit?
 > How would they know which inodes to delete?
 > 
 > >  >
 > >  > The questions are where do we store the accounting and how do we maintain them.
 > >  > An answer to those questions could be - in the inode index:
 > >  >
 > >  > Currently, with nfs_export=on, there is already an index dir containing:
 > >  > - 1 hardlink per copied up non-dir inode
 > >  > - 1 directory per copied-up directory
 > >  > - 1 whiteout per whiteout in upperdir (not an hardlink)
 > >  >
 > >
 > > Hi Amir,
 > >
 > > Thanks for quick response and detail information.
 > >
 > > I think the simplest way is just store accounting info in memory(maybe  in s_fs_info).
 > > At very first, I just thought  doing it for container use case, for container, it will be
 > > enough because the upper layer is always empty at starting time and will be destroyed
 > > at ending time.
 > 
 > That is not a concept that overlayfs is currently aware of.
 > *If* the concept is acceptable and you do implement a feature intended for this
 > special use case, you should verify on mount time that upperdir is empty.
 > 
 > >
 > > Adding a meta info to index dir is a  better solution for general use case but it seems
 > > more complicated and I'm not sure if there are other use cases concern with this problem.
 > > Suggestion?
 > 
 > docker already supports container storage quota using project quotas
 > on upperdir (I implemented it).
 > Seems like a very natural extension to also limit no. of inodes.
 > The problem, as you wrote it above is that project quotas
 > "will also count deleted files(char type files) in overlay's upper layer."
 > My suggestion to you was a way to account for the whiteouts separately,
 > so you may deduct them from total inode count.
 > If you are saying my suggestion is complicated, perhaps you did not
 > understand it.
 > 

I think the key point here is the count of whiteout inode. I would like to
propose share same inode with different whiteout files so that we can save
inode significantly for whiteout files. After this, I think we can just implement
normal inode limit for container just like block limit.

 > >
 > >
 > >  > We can also make this behavior independent of nfs_export feature.
 > >  > In the past, I proposed the option index=all for this behavior.
 > >  >
 > >  > On mount, in ovl_indexdir_cleanup(), the index entries for file/dir/whiteout
 > >  > can be counted and then maintained on index add/remove.
 > >  >
 > >  > Now if you combine that with project quotas on upper/work dir, you get:
 > >  > <Total upper/work inodes> = <pure upper inodes> + <non-dir index count> +
 > >  >                                            2*<dir index count> +
 > >  > 2*<whiteout index count>
 > >
 > > I'm not clear what the exact relationships between those indexes and nfs_export
 > 
 > nfs_export feature reuiqres index_all, but we did not have a reason (yet) to
 > add an option to enable index_all without enabling nfs_export:
 > 
 > /* Index all files on copy up. For now only enabled for NFS export */
 > bool ovl_index_all(struct super_block *sb)
 > {
 >         struct ovl_fs *ofs = sb->s_fs_info;
 > 
 >         return ofs->config.nfs_export && ofs->config.index;
 > }
 > 
 > > but  if possible I hope having  separated switches for every index functions and a total
 > > switch(index=all) to enable all index functions at same time.
 > >
 > 
 > FYI, index_all stands for "index all modified/deleted lower files (and dirs)"
 > At the moment, the only limitation of nfs_export=on that could be relaxed with
 > index=all is that nfs_export=on is mutually exclusive with metacopy=on.
 > index=all will not have this limitation.
 > 
 > >  >
 > >  > Assuming that you know the total from project quotas and the index counts
 > >  > from overlayfs, you can calculate total pure upper.
 > >  >
 > >  > Now you *can* implement upper inodes quota within overlayfs, but you
 > >  > can also do that without changing overlayfs at all assuming you can
 > >  > allow some slack in quota enforcement -
 > >  > periodically scan the index dir and adjust project quota limits.
 > >
 > > Dynamically changing inode limit  looks  too complicated to implement in management system
 > > and having different quota limit during lifetime for same container may cause confusion to sys admins.
 > > So I still hope to solve this problem on overlayfs layer.
 > >
 > 
 > To me that sounds like shifting complexity from system to kernel
 > for not a good enough reason and with loosing flexibility.
 > You are proposing a heuristic solution anyway because it is inherently
 > not immune against DoS of a malicious container that does rm -rf *.
 > So to me it makes more sense to deal with that logic in container
 > management level, where more heuristics can be applied, for example:
 > Allow to add up to X new files, modify %Y files from lower and
 > delete %Z files from lower.
 > 
 > Note that container management does not have to adjust project quota
 > limits periodically, it only needs to re-calculate and adjust project quota
 > limits when user gets out of quota warning.
 > I believe there are already mechanisms in Linux quota management to
 > notify management software of quota limit expiry in order to take action,
 > but I am not that familiar with those mechanisms.
 > 

If overlayfs could export statistical information(whiteout count, etc.. ) to user space,
it will be very helpful and easy to implement based on the detail information.

Thanks,
cgxu