* Question about XFS_MAXINUMBER @ 2018-03-16 14:05 Amir Goldstein 2018-03-16 17:59 ` Amir Goldstein 2018-03-16 22:24 ` Dave Chinner 0 siblings, 2 replies; 20+ messages in thread From: Amir Goldstein @ 2018-03-16 14:05 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs, Miklos Szeredi, overlayfs Hi guys, I am trying to get a lower bound for unused inode number MSB on a mounted xfs super block, so I can publish it on struct super_block. This doesn't need to be a tight lower bound, but it needs to be a loewr bound that cannot change with growfs nor when remounting with different options (i.e. inode64). This is needed for overlayfs to be able to use the unused upper bits for overlayfs inode number namespace (see [1]). I realize that for a given agcount, a "soft" lower bound of unused upper bits is agno_log-agblklog-inopblog, which makes the "hard" lower bound 32-agblklog-inopblog, so I think I can use this number. I was staring at this definition and tried to figure out where this absolute limit of 56 used bits came from: #define XFS_MAXINUMBER ((xfs_ino_t)((1ULL << 56) - 1ULL)) Is this number really correct? If yes, then where does the constrain on maximum 56 bits come from? Thanks, Amir. [1] https://marc.info/?l=linux-unionfs&m=151007386419753&w=2 ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Question about XFS_MAXINUMBER 2018-03-16 14:05 Question about XFS_MAXINUMBER Amir Goldstein @ 2018-03-16 17:59 ` Amir Goldstein 2018-03-16 22:24 ` Dave Chinner 1 sibling, 0 replies; 20+ messages in thread From: Amir Goldstein @ 2018-03-16 17:59 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs, Miklos Szeredi, overlayfs On Fri, Mar 16, 2018 at 4:05 PM, Amir Goldstein <amir73il@gmail.com> wrote: > Hi guys, > > I am trying to get a lower bound for unused inode number MSB on > a mounted xfs super block, so I can publish it on struct super_block. > > This doesn't need to be a tight lower bound, but it needs to be > a loewr bound that cannot change with growfs nor when > remounting with different options (i.e. inode64). > > This is needed for overlayfs to be able to use the unused upper bits > for overlayfs inode number namespace (see [1]). > > I realize that for a given agcount, a "soft" lower bound of unused > upper bits is agno_log-agblklog-inopblog, which makes the "hard" Hmm, I copied that typo from the comment in xfs_format.h. Unless I am missing something the amount of unused upper bits is 64 - agno_log - agblklog - inopblog. Hence the "hard" limit below: > lower bound 32-agblklog-inopblog, so I think I can use this number. > > I was staring at this definition and tried to figure out where this > absolute limit of 56 used bits came from: > #define XFS_MAXINUMBER ((xfs_ino_t)((1ULL << 56) - 1ULL)) > > Is this number really correct? If yes, then where does the constrain > on maximum 56 bits come from? > > Thanks, > Amir. > > [1] https://marc.info/?l=linux-unionfs&m=151007386419753&w=2 ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Question about XFS_MAXINUMBER 2018-03-16 14:05 Question about XFS_MAXINUMBER Amir Goldstein 2018-03-16 17:59 ` Amir Goldstein @ 2018-03-16 22:24 ` Dave Chinner 2018-03-17 5:40 ` Miklos Szeredi 1 sibling, 1 reply; 20+ messages in thread From: Dave Chinner @ 2018-03-16 22:24 UTC (permalink / raw) To: Amir Goldstein; +Cc: Darrick J. Wong, linux-xfs, Miklos Szeredi, overlayfs On Fri, Mar 16, 2018 at 04:05:22PM +0200, Amir Goldstein wrote: > Hi guys, > > I am trying to get a lower bound for unused inode number MSB on > a mounted xfs super block, so I can publish it on struct super_block. Sorry, what? The inode number is owned by the filesystem - nobody should be touching it or making assumptions they can screw with it in any way. > This doesn't need to be a tight lower bound, but it needs to be > a loewr bound that cannot change with growfs nor when > remounting with different options (i.e. inode64). > > This is needed for overlayfs to be able to use the unused upper bits > for overlayfs inode number namespace (see [1]). SO you're assuming that filesystems don't ever encode information into their inode numbers. I've already got plans to use a bunch of the unused upper bits in the inode number internally in XFS for subvolumes, and ISTR that Darrick was mulling a use for some of them a while back, too... > I realize that for a given agcount, a "soft" lower bound of unused > upper bits is agno_log-agblklog-inopblog, which makes the "hard" > lower bound 32-agblklog-inopblog, so I think I can use this number. > > I was staring at this definition and tried to figure out where this > absolute limit of 56 used bits came from: > #define XFS_MAXINUMBER ((xfs_ino_t)((1ULL << 56) - 1ULL)) > > Is this number really correct? If yes, then where does the constrain > on maximum 56 bits come from? Yes, 56 bits is the current maximum *physical* inode number - the inode number is currently a physical representation of the location on disk. 56 bits is needed to represent inodes in 2^63 bytes of physical space. Off the top of my head, it works out something like this for a a 512 byte inode, 4k block size filesystem: bits range meaning 6 0-63 inode # in chunk 7-22 1TB block offset in AG of inode 0 blkspag / bsize / inopblk 2^30 / 2^12 / 2^3 = 2^15 23-55 AGNO AG number The breakdown of bits change for different inode and block sizes, but the worse case comes out somewhere around 56 bits... *but* #define NULLFSINO ((xfs_ino_t)-1) is a valid inode number on disk, indicating that the field is not holding an inode number. the MSB indicates the inode number is a "virtual" inode number, holding some special significance that is not directly a physical inode number. Hence we actually use all 64 bits of the inode number on disk, and hence there are no free bits in the inode number for anyone outside XFS to use. IOWs, I think your plan is DOA because we already use the entire 64 bit space in the inode number field and have plans for the "unused bits" already in motion.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Question about XFS_MAXINUMBER 2018-03-16 22:24 ` Dave Chinner @ 2018-03-17 5:40 ` Miklos Szeredi 2018-03-17 7:56 ` Amir Goldstein 2018-03-17 8:04 ` Dave Chinner 0 siblings, 2 replies; 20+ messages in thread From: Miklos Szeredi @ 2018-03-17 5:40 UTC (permalink / raw) To: Dave Chinner; +Cc: Amir Goldstein, Darrick J. Wong, linux-xfs, overlayfs On Fri, Mar 16, 2018 at 11:24 PM, Dave Chinner <david@fromorbit.com> wrote: > On Fri, Mar 16, 2018 at 04:05:22PM +0200, Amir Goldstein wrote: >> Hi guys, >> >> I am trying to get a lower bound for unused inode number MSB on >> a mounted xfs super block, so I can publish it on struct super_block. > > Sorry, what? > > The inode number is owned by the filesystem - nobody should be > touching it or making assumptions they can screw with it in any way. > >> This doesn't need to be a tight lower bound, but it needs to be >> a loewr bound that cannot change with growfs nor when >> remounting with different options (i.e. inode64). >> >> This is needed for overlayfs to be able to use the unused upper bits >> for overlayfs inode number namespace (see [1]). > > SO you're assuming that filesystems don't ever encode information > into their inode numbers. I've already got plans to use a bunch of > the unused upper bits in the inode number internally in XFS for > subvolumes, and ISTR that Darrick was mulling a use for some of > them a while back, too... > >> I realize that for a given agcount, a "soft" lower bound of unused >> upper bits is agno_log-agblklog-inopblog, which makes the "hard" >> lower bound 32-agblklog-inopblog, so I think I can use this number. >> >> I was staring at this definition and tried to figure out where this >> absolute limit of 56 used bits came from: >> #define XFS_MAXINUMBER ((xfs_ino_t)((1ULL << 56) - 1ULL)) >> >> Is this number really correct? If yes, then where does the constrain >> on maximum 56 bits come from? > > Yes, 56 bits is the current maximum *physical* inode number - the > inode number is currently a physical representation of the location > on disk. 56 bits is needed to represent inodes in 2^63 bytes of > physical space. > > Off the top of my head, it works out something like this for a > a 512 byte inode, 4k block size filesystem: > > bits range meaning > 6 0-63 inode # in chunk > 7-22 1TB block offset in AG of inode 0 > blkspag / bsize / inopblk > 2^30 / 2^12 / 2^3 = 2^15 > 23-55 AGNO AG number > > The breakdown of bits change for different inode and block sizes, > but the worse case comes out somewhere around 56 bits... > > *but* > > #define NULLFSINO ((xfs_ino_t)-1) > > is a valid inode number on disk, indicating that the field is not > holding an inode number. the MSB indicates the inode number is a > "virtual" inode number, holding some special significance that is > not directly a physical inode number. Hence we actually use all 64 > bits of the inode number on disk, and hence there are no free bits > in the inode number for anyone outside XFS to use. > > IOWs, I think your plan is DOA because we already use the entire 64 > bit space in the inode number field and have plans for the "unused > bits" already in motion.... We don't care about internal or on-disk use. Does that still make it DOA? I ask, because we've thought long and hard about what to do for multiplexing inum space in overlayfs, and found no other sane options. Ideas welcome, of course. Thanks, Miklos ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Question about XFS_MAXINUMBER 2018-03-17 5:40 ` Miklos Szeredi @ 2018-03-17 7:56 ` Amir Goldstein 2018-03-17 21:28 ` Dave Chinner 2018-03-17 8:04 ` Dave Chinner 1 sibling, 1 reply; 20+ messages in thread From: Amir Goldstein @ 2018-03-17 7:56 UTC (permalink / raw) To: Miklos Szeredi; +Cc: Dave Chinner, Darrick J. Wong, linux-xfs, overlayfs On Sat, Mar 17, 2018 at 7:40 AM, Miklos Szeredi <miklos@szeredi.hu> wrote: > On Fri, Mar 16, 2018 at 11:24 PM, Dave Chinner <david@fromorbit.com> wrote: >> On Fri, Mar 16, 2018 at 04:05:22PM +0200, Amir Goldstein wrote: >>> Hi guys, >>> >>> I am trying to get a lower bound for unused inode number MSB on >>> a mounted xfs super block, so I can publish it on struct super_block. >> >> Sorry, what? >> >> The inode number is owned by the filesystem - nobody should be >> touching it or making assumptions they can screw with it in any way. >> Let me clarify with the simplest example: With overlay of 2 layers, lower and upper on 2 different xfs fs assuming that stat(2) from xfs will not be using the 63 MSB: On stat(2) of an overlay upper inode we want to return: st_dev = <overlay anon bdev> st_ino = <real upper st_ino> On stat(2) of an overlay lower inode we want to return: st_dev = <overlay anon bdev> st_ino = <real lower st_ino> | 1 << 63 Now for ext4 this is always safe to do and we find that automatically due to the fact that ext4 uses the default encode_fh generic 32bit inode encoding. For xfs this should also be safe, but we don't want to whitelist xfs by name/magic, so we want xfs to publish the max amount of bits exposed to user with stat(2)/getdents(3). Recently, I became aware of an nfsd use case that also looks at inode->i_ino, so we may want to also be able to assume max_ino_bits also applies to inode->i_ino, but if you tell us to stay clear of inode->i_ino, then we can always use stat.st_ino. Thanks, Amir. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Question about XFS_MAXINUMBER 2018-03-17 7:56 ` Amir Goldstein @ 2018-03-17 21:28 ` Dave Chinner 2018-03-18 6:21 ` Amir Goldstein 0 siblings, 1 reply; 20+ messages in thread From: Dave Chinner @ 2018-03-17 21:28 UTC (permalink / raw) To: Amir Goldstein; +Cc: Miklos Szeredi, Darrick J. Wong, linux-xfs, overlayfs On Sat, Mar 17, 2018 at 09:56:19AM +0200, Amir Goldstein wrote: > On Sat, Mar 17, 2018 at 7:40 AM, Miklos Szeredi <miklos@szeredi.hu> wrote: > > On Fri, Mar 16, 2018 at 11:24 PM, Dave Chinner <david@fromorbit.com> wrote: > >> On Fri, Mar 16, 2018 at 04:05:22PM +0200, Amir Goldstein wrote: > >>> Hi guys, > >>> > >>> I am trying to get a lower bound for unused inode number MSB on > >>> a mounted xfs super block, so I can publish it on struct super_block. > >> > >> Sorry, what? > >> > >> The inode number is owned by the filesystem - nobody should be > >> touching it or making assumptions they can screw with it in any way. > >> > > Let me clarify with the simplest example: > > With overlay of 2 layers, lower and upper on 2 different xfs fs > assuming that stat(2) from xfs will not be using the 63 MSB: > > On stat(2) of an overlay upper inode we want to return: > st_dev = <overlay anon bdev> > st_ino = <real upper st_ino> > > On stat(2) of an overlay lower inode we want to return: > st_dev = <overlay anon bdev> > st_ino = <real lower st_ino> | 1 << 63 > > Now for ext4 this is always safe to do and we find that automatically > due to the fact that ext4 uses the default encode_fh generic 32bit > inode encoding. > > For xfs this should also be safe, but we don't want to whitelist xfs > by name/magic, so we want xfs to publish the max amount of bits > exposed to user with stat(2)/getdents(3). > > Recently, I became aware of an nfsd use case that also looks > at inode->i_ino, so we may want to also be able to assume > max_ino_bits also applies to inode->i_ino, but if you tell us to > stay clear of inode->i_ino, then we can always use stat.st_ino. > > Thanks, > Amir. > On Sat, Mar 17, 2018 at 10:24:39AM +0200, Amir Goldstein wrote: > On Sat, Mar 17, 2018 at 10:04 AM, Dave Chinner <david@fromorbit.com> wrote: > > On Sat, Mar 17, 2018 at 06:40:23AM +0100, Miklos Szeredi wrote: > [...] > >> I ask, because we've thought long and hard about what to do for > >> multiplexing inum space in overlayfs, and found no other sane options. > >> Ideas welcome, of course. > > > > Why do you need to "multiplex" the inum space? perhaps you'd do > > better to start with a description of why you want to play games > > with inode numbers, rather than just posting a patch to steal bits > > from other filesytem inode number spaces.... > > > > I think this patch perhaps explains best what we want to do: > https://marc.info/?l=linux-unionfs&m=151007386219743&w=2 > > I had already given a simple example in an earlier response. So, I'll quote that here: > > > On stat(2) of an overlay upper inode we want to return: > > > st_dev = <overlay anon bdev> > > > st_ino = <real upper st_ino> > > > > > > On stat(2) of an overlay lower inode we want to return: > > > st_dev = <overlay anon bdev> > > > st_ino = <real lower st_ino> | 1 << 63 This makes no sense to me - this implies the inode number changes on copy-up, and .... > As the the "why" question, we have several requirements for > overlay inode numbers: > 1. st_ino is persistent > 2. st_ino/st_dev pair is unique in the system > 3. st_ino is consistent with d_ino > 4. st_ino doesn't change on copy up > 5. st_dev is uniform across all overlay inodes .... this means requierment #4 isn't met, even on the same filesystem. IOWs, if overlay has already met #4 on the same filesystem, then there is a persistent mapping between lower and upper inodes (Req. #1) that maps the upper inode # to the lower inode #. That has to be overlay information, because the underlying filesystem doesn't store it. And because the lower inode/dev is unique, then req. 2 is met, too. FWIW, req 5 is badly worded - st_dev is uniform across all inodes in a single overlay filesystem, not all overlay inodes. > With upstream overlayfs we meet all requirements above for > the case of all underlying layers on the same fs, by using a real > underlying inode st_ino and the overlay st_dev. Yeah, that's what I thought. So why can't you do exactly the same thing for different underlying filesystems? You've already got a mapping between upper and lower inode numbers, why can't that map across different superblocks? Why do you need special "inode number bits" exposed to userspace to identify upper->lower inode mappings that overlay should already have a persistent mapping mechanism for? > With the 'xino' patch set [1], we can meet all requirements above > also for the case of underlying layers on different fs, by multiplpexing > the inum space, as long as we know about unused high ino bits. Your example makes no sense to me - I don't see how adding extra bits to the lower inode number allows you to meet requirement #4, not why presenting "st_ino = <real upper st_ino>" for inodes that have been copied up iis being done because that violates requirement #4.... > The ovl-xino branch already has the xfs patch (not yet posted) to publish > max_ino_bits. That has no explanation of why you need to screw with inode number bits, either. It's all mechanism, and there's zero explanation of what problem it solves. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Question about XFS_MAXINUMBER 2018-03-17 21:28 ` Dave Chinner @ 2018-03-18 6:21 ` Amir Goldstein 2018-03-18 23:02 ` Dave Chinner 0 siblings, 1 reply; 20+ messages in thread From: Amir Goldstein @ 2018-03-18 6:21 UTC (permalink / raw) To: Dave Chinner; +Cc: Miklos Szeredi, Darrick J. Wong, linux-xfs, overlayfs On Sat, Mar 17, 2018 at 11:28 PM, Dave Chinner <david@fromorbit.com> wrote: > On Sat, Mar 17, 2018 at 09:56:19AM +0200, Amir Goldstein wrote: >> On Sat, Mar 17, 2018 at 7:40 AM, Miklos Szeredi <miklos@szeredi.hu> wrote: >> > On Fri, Mar 16, 2018 at 11:24 PM, Dave Chinner <david@fromorbit.com> wrote: >> >> On Fri, Mar 16, 2018 at 04:05:22PM +0200, Amir Goldstein wrote: >> >>> Hi guys, >> >>> >> >>> I am trying to get a lower bound for unused inode number MSB on >> >>> a mounted xfs super block, so I can publish it on struct super_block. >> >> >> >> Sorry, what? >> >> >> >> The inode number is owned by the filesystem - nobody should be >> >> touching it or making assumptions they can screw with it in any way. >> >> >> >> Let me clarify with the simplest example: >> >> With overlay of 2 layers, lower and upper on 2 different xfs fs >> assuming that stat(2) from xfs will not be using the 63 MSB: >> >> On stat(2) of an overlay upper inode we want to return: >> st_dev = <overlay anon bdev> >> st_ino = <real upper st_ino> >> >> On stat(2) of an overlay lower inode we want to return: >> st_dev = <overlay anon bdev> >> st_ino = <real lower st_ino> | 1 << 63 >> >> Now for ext4 this is always safe to do and we find that automatically >> due to the fact that ext4 uses the default encode_fh generic 32bit >> inode encoding. >> >> For xfs this should also be safe, but we don't want to whitelist xfs >> by name/magic, so we want xfs to publish the max amount of bits >> exposed to user with stat(2)/getdents(3). >> >> Recently, I became aware of an nfsd use case that also looks >> at inode->i_ino, so we may want to also be able to assume >> max_ino_bits also applies to inode->i_ino, but if you tell us to >> stay clear of inode->i_ino, then we can always use stat.st_ino. >> >> Thanks, >> Amir. >> > > On Sat, Mar 17, 2018 at 10:24:39AM +0200, Amir Goldstein wrote: >> On Sat, Mar 17, 2018 at 10:04 AM, Dave Chinner <david@fromorbit.com> wrote: >> > On Sat, Mar 17, 2018 at 06:40:23AM +0100, Miklos Szeredi wrote: >> [...] >> >> I ask, because we've thought long and hard about what to do for >> >> multiplexing inum space in overlayfs, and found no other sane options. >> >> Ideas welcome, of course. >> > >> > Why do you need to "multiplex" the inum space? perhaps you'd do >> > better to start with a description of why you want to play games >> > with inode numbers, rather than just posting a patch to steal bits >> > from other filesytem inode number spaces.... >> > >> >> I think this patch perhaps explains best what we want to do: >> https://marc.info/?l=linux-unionfs&m=151007386219743&w=2 >> >> I had already given a simple example in an earlier response. > > So, I'll quote that here: > >> > > On stat(2) of an overlay upper inode we want to return: >> > > st_dev = <overlay anon bdev> >> > > st_ino = <real upper st_ino> >> > > >> > > On stat(2) of an overlay lower inode we want to return: >> > > st_dev = <overlay anon bdev> >> > > st_ino = <real lower st_ino> | 1 << 63 > > This makes no sense to me - this implies the inode number changes on > copy-up, and .... > I tried to keep the example simple, but failed to mention that lower and upper refer to different file, say foo and bar. I should have mentioned that "foo" is a pure upper - a file that was created as upper and let's suppose the real ino of "foo" in upper fs is 10. And let's suppose that the real ino of "bar" on lower fs is also 10, which is possible when lower fs is a different fs than upper fs. >> As the the "why" question, we have several requirements for >> overlay inode numbers: >> 1. st_ino is persistent >> 2. st_ino/st_dev pair is unique in the system >> 3. st_ino is consistent with d_ino >> 4. st_ino doesn't change on copy up >> 5. st_dev is uniform across all overlay inodes > > .... this means requierment #4 isn't met, even on the same > filesystem. > > IOWs, if overlay has already met #4 on the same filesystem, then > there is a persistent mapping between lower and upper inodes (Req. > #1) that maps the upper inode # to the lower inode #. That has to be > overlay information, because the underlying filesystem doesn't store Correct. #4 is met because we keep track of "copy up origin" by storing the lower inode file handle in "origin" xattr of coped up file. Therefore, for an upper file that originated in a lower file we will use the real lower multiplexed ino across copy up and across mount cycle. > it. And because the lower inode/dev is unique, then req. 2 is met, > too. > Correct. But notice that overlay does not use the real st_dev. If it did, that would break the requirement that the real fs st_ino/_st_dev pair is unique in the system. So for non-samefs, overlay uses a different anon bdev for each layer to satisfy #2, but breaks #5. > FWIW, req 5 is badly worded - st_dev is uniform across all inodes in > a single overlay filesystem, not all overlay inodes. > Correct. FYI, #5 has never been met for non-samefs. What overlayfs now is it meets #5 for directory inodes (to make find -xdev happy) at the cost of trading off #1. >> With upstream overlayfs we meet all requirements above for >> the case of all underlying layers on the same fs, by using a real >> underlying inode st_ino and the overlay st_dev. > > Yeah, that's what I thought. So why can't you do exactly the same > thing for different underlying filesystems? You've already got a > mapping between upper and lower inode numbers, why can't that map > across different superblocks? Why do you need special "inode number > bits" exposed to userspace to identify upper->lower inode > mappings that overlay should already have a persistent mapping > mechanism for? Because real pure upper inode and lower inode can have the same inode number and we want to multiplex our way our of this collision. Note that we do NOT maintain a data structure for looking up used lower/upper inode numbers, nor do we want to maintain a persistent data structure for persistent overlay inode numbers that map to real underlying inodes. AFAIK, aufs can use a small db for it's 'xino' feature. This is something that we wish to avoid. > >> With the 'xino' patch set [1], we can meet all requirements above >> also for the case of underlying layers on different fs, by multiplpexing >> the inum space, as long as we know about unused high ino bits. > > Your example makes no sense to me - I don't see how adding extra > bits to the lower inode number allows you to meet requirement #4, > not why presenting "st_ino = <real upper st_ino>" for inodes that > have been copied up iis being done because that violates requirement > #4.... The example was miss communicated. I hope I was able to make the problem clear now. > >> The ovl-xino branch already has the xfs patch (not yet posted) to publish >> max_ino_bits. > > That has no explanation of why you need to screw with inode number > bits, either. It's all mechanism, and there's zero explanation of > what problem it solves. > It's true. The explanation is now scattered in previous patches, that incrementally fixed samefs case and improved non-samefs case. I think currently, the most documented version could be found in this new helper: https://github.com/amir73il/linux/blob/overlayfs-devel/fs/overlayfs/inode.c#L62 but I will make sure to add proper full doumentation including the requiremetns and how they are met in the next version I post. Please let me know if I missed something and if motivation is still not clear. Thanks! Amir. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Question about XFS_MAXINUMBER 2018-03-18 6:21 ` Amir Goldstein @ 2018-03-18 23:02 ` Dave Chinner 2018-03-19 4:03 ` Amir Goldstein 0 siblings, 1 reply; 20+ messages in thread From: Dave Chinner @ 2018-03-18 23:02 UTC (permalink / raw) To: Amir Goldstein; +Cc: Miklos Szeredi, Darrick J. Wong, linux-xfs, overlayfs On Sun, Mar 18, 2018 at 08:21:16AM +0200, Amir Goldstein wrote: > On Sat, Mar 17, 2018 at 11:28 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Sat, Mar 17, 2018 at 09:56:19AM +0200, Amir Goldstein wrote: > >> On Sat, Mar 17, 2018 at 7:40 AM, Miklos Szeredi <miklos@szeredi.hu> wrote: > >> > On Fri, Mar 16, 2018 at 11:24 PM, Dave Chinner <david@fromorbit.com> wrote: > >> >> On Fri, Mar 16, 2018 at 04:05:22PM +0200, Amir Goldstein wrote: > >> >>> Hi guys, > >> >>> > >> >>> I am trying to get a lower bound for unused inode number MSB on > >> >>> a mounted xfs super block, so I can publish it on struct super_block. > >> >> > >> >> Sorry, what? > >> >> > >> >> The inode number is owned by the filesystem - nobody should be > >> >> touching it or making assumptions they can screw with it in any way. > >> >> > >> > >> Let me clarify with the simplest example: > >> > >> With overlay of 2 layers, lower and upper on 2 different xfs fs > >> assuming that stat(2) from xfs will not be using the 63 MSB: > >> > >> On stat(2) of an overlay upper inode we want to return: > >> st_dev = <overlay anon bdev> > >> st_ino = <real upper st_ino> > >> > >> On stat(2) of an overlay lower inode we want to return: > >> st_dev = <overlay anon bdev> > >> st_ino = <real lower st_ino> | 1 << 63 [....] > I should have mentioned that "foo" is a pure upper - a file that was created > as upper and let's suppose the real ino of "foo" in upper fs is 10. > And let's suppose that the real ino of "bar" on lower fs is also 10, which is > possible when lower fs is a different fs than upper fs. Ok, so to close the loop. The problem is that overlay has no inode number space of it's own, nor does it have any persistent inode number mapping scheme. Hence overlay has no way of providing users with a consistent, unique {dev,ino #} tuple to userspace when it's different directories lie on different filesystems. [....] > > across different superblocks? Why do you need special "inode number > > bits" exposed to userspace to identify upper->lower inode > > mappings that overlay should already have a persistent mapping > > mechanism for? > > Because real pure upper inode and lower inode can have the same > inode number and we want to multiplex our way our of this collision. > > Note that we do NOT maintain a data structure for looking up used > lower/upper inode numbers, nor do we want to maintain a persistent > data structure for persistent overlay inode numbers that map to > real underlying inodes. AFAIK, aufs can use a small db for it's 'xino' > feature. This is something that we wish to avoid. SO instead of maintaining your own data structure to provide the necessary guarantees, the solution is to steal bits from the underlying filesystem inode numbers on the assumption they they will never user them? What happens when a user upgrades their kernel, the underlying fs changes all it's inode numbers because it's done some virtual mapping thing for, say, having different inode number ranges for separate mount namespaces? And so instead of having N bits of free inode number space before upgrade, it now has zero? How will overlay react to this sort of change, given it could expose duplicate inode numbers.... Quite frankly, I think this "steal bits from the underlying filesystems" mechanism is a recipe for trouble. If you want play these games, you get to keep all the broken bits when filesystems change the number of available bits. Given that overlay has a persistent inode numbering problem, why doesn't overlay just allocate and store it's own inode numbers and other required persistent state in an xattr? Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Question about XFS_MAXINUMBER 2018-03-18 23:02 ` Dave Chinner @ 2018-03-19 4:03 ` Amir Goldstein 2018-03-19 8:42 ` Miklos Szeredi 2018-03-20 1:47 ` Dave Chinner 0 siblings, 2 replies; 20+ messages in thread From: Amir Goldstein @ 2018-03-19 4:03 UTC (permalink / raw) To: Dave Chinner Cc: Miklos Szeredi, Darrick J. Wong, linux-xfs, overlayfs, linux-fsdevel This thread has come to a point where I should have included fsdevel a while ago, so CCing fsdevel. For those interested in previous episodes: https://marc.info/?l=linux-xfs&m=152120912822207&w=2 On Mon, Mar 19, 2018 at 1:02 AM, Dave Chinner <david@fromorbit.com> wrote: > [....] > >> I should have mentioned that "foo" is a pure upper - a file that was created >> as upper and let's suppose the real ino of "foo" in upper fs is 10. >> And let's suppose that the real ino of "bar" on lower fs is also 10, which is >> possible when lower fs is a different fs than upper fs. > > Ok, so to close the loop. The problem is that overlay has no inode > number space of it's own, nor does it have any persistent inode > number mapping scheme. Hence overlay has no way of providing users > with a consistent, unique {dev,ino #} tuple to userspace when it's > different directories lie on different filesystems. > Yes. [...] >> Because real pure upper inode and lower inode can have the same >> inode number and we want to multiplex our way our of this collision. >> >> Note that we do NOT maintain a data structure for looking up used >> lower/upper inode numbers, nor do we want to maintain a persistent >> data structure for persistent overlay inode numbers that map to >> real underlying inodes. AFAIK, aufs can use a small db for it's 'xino' >> feature. This is something that we wish to avoid. > > SO instead of maintaining your own data structure to provide the > necessary guarantees, the solution is to steal bits from the > underlying filesystem inode numbers on the assumption they they will > never user them? > Well, it is not an assumption if filesystem is inclined to publish s_max_ino_bits, which is not that different in concept from publishing s_maxbytes and s_max_links, which are also limitations in current kernel/sb that could be lifted in the future. > What happens when a user upgrades their kernel, the underlying fs > changes all it's inode numbers because it's done some virtual > mapping thing for, say, having different inode number ranges for > separate mount namespaces? And so instead of having N bits of free > inode number space before upgrade, it now has zero? How will overlay > react to this sort of change, given it could expose duplicate inode > numbers.... After kernel upgrade, filesystem would set s_max_ino_bits to 64 or not set it at all and then overlayfs will not use high bits and fall back to what it does today. But if we want to bring practical arguments from containers world into the picture, IMO it is far more likely that existing container solution would benefit from overlayfs inode numbers multiplexing than they would from inode number mapping by filesystem for different mount namespace. > > Quite frankly, I think this "steal bits from the underlying > filesystems" mechanism is a recipe for trouble. If you want play > these games, you get to keep all the broken bits when filesystems > change the number of available bits. > I don't see that as a problem. I would say there are a fair amount of users out there using containers with overlayfs. Do you realize that the majority of those users are settling for things like: no directory rename, breaking hardlinks on copy up. Those are "features" of overlayfs that have been fixed in recent kernels, but only now on their way to distro kernels and not yet enabled by container runtimes. Container admins already make the choice of underlying fileystem concisely to get the best from overlayfs and I would expect that they will soon be opting in for xfs+reflink because of that concience choice. If ever xfs decides to change inode numbers address space on kernel upgrade without users opting in for it, I would be surprised, but I should also hope that xfs would at least leave a choice for users to opt-out of this behavior and that is what container admins would do. Heck, for all I care, users could also opt-in for unused inode bits explicitly (e.g. -o inode56) if you are concerned about letting go of those upper bits implicitly. My patch set already provides the capability for users to declare with overlay -o xino that enough upper bits are available (i.e. because user knows the underlying fs and its true practical limits). But the feature will be much more useful if users disn't have to do that. > Given that overlay has a persistent inode numbering problem, why > doesn't overlay just allocate and store it's own inode numbers and > other required persistent state in an xattr? > First, this is not as simple as it sounds. If you have a huge number of readonly files in multiple lower layers, it makes no sense to scan them all on overlay mount to discover which inode numbers are free to use and it make no sense either to create a persistent mapping for every lower file accessed in that case. And there are other problematic factors with this sort of scheme. Second, and this may be a revolutionary argument, I would like to believe that we are all working together for a "greater good". Sure, xfs developers strive to perfect and enhance xfs and overlayfs developers strive to perfect and enhance overlayfs. But when there is an opportunity for synergy between subsystems, one should consider the best solution as a whole and IMHO, the solution of filesystem declaring already unused ino bits is the best solution as a whole. xfs is not required to declare s_max_ino_bits for all eternity, only for this specific super block instance, in this specific kernel. Thanks, Amir. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Question about XFS_MAXINUMBER 2018-03-19 4:03 ` Amir Goldstein @ 2018-03-19 8:42 ` Miklos Szeredi 2018-03-20 1:47 ` Dave Chinner 1 sibling, 0 replies; 20+ messages in thread From: Miklos Szeredi @ 2018-03-19 8:42 UTC (permalink / raw) To: Amir Goldstein Cc: Dave Chinner, Darrick J. Wong, linux-xfs, overlayfs, linux-fsdevel On Mon, Mar 19, 2018 at 5:03 AM, Amir Goldstein <amir73il@gmail.com> wrote: > On Mon, Mar 19, 2018 at 1:02 AM, Dave Chinner <david@fromorbit.com> wrote: >> Given that overlay has a persistent inode numbering problem, why >> doesn't overlay just allocate and store it's own inode numbers and >> other required persistent state in an xattr? > > First, this is not as simple as it sounds. > If you have a huge number of readonly files in multiple lower layers, > it makes no sense to scan them all on overlay mount to discover which > inode numbers are free to use and it make no sense either to create > a persistent mapping for every lower file accessed in that case. > And there are other problematic factors with this sort of scheme. Such as when all layers are read-only. Where do we store the persistent inode numbers in that case? > > Second, and this may be a revolutionary argument, I would like to > believe that we are all working together for a "greater good". > Sure, xfs developers strive to perfect and enhance xfs and overlayfs > developers strive to perfect and enhance overlayfs. > But when there is an opportunity for synergy between subsystems, > one should consider the best solution as a whole and IMHO, > the solution of filesystem declaring already unused ino bits > is the best solution as a whole. xfs is not required to declare > s_max_ino_bits for all eternity, only for this specific super block > instance, in this specific kernel. The "specific kernel" part requires clarification. We do promise backward compatibility when upgrading the kernel, and silently increasing s_max_ino_bits on a kernel upgrade would break that promise. Could be backed by a feature flag. And unlimited use could be the default, people have learned to live with needing special features for overlayfs. And I do agree with Amir, that the "mine all mine" philosophy isn't necessarily the right one. In normal cases overlayfs would just use one or two bits of the inumber space. While Amir's current patch keeps the layer index in the spare bits, it is sufficient to hold an "fs index" that is incremented when a new superblock is encountered during enumeration of layers. The number of different fs instances used for creating an overlay is unlikely to be large, so for all practical purposes a few (4-6) bits should be enough. Thanks, Miklos ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Question about XFS_MAXINUMBER 2018-03-19 4:03 ` Amir Goldstein 2018-03-19 8:42 ` Miklos Szeredi @ 2018-03-20 1:47 ` Dave Chinner 2018-03-20 6:29 ` Amir Goldstein 2018-03-20 9:32 ` Miklos Szeredi 1 sibling, 2 replies; 20+ messages in thread From: Dave Chinner @ 2018-03-20 1:47 UTC (permalink / raw) To: Amir Goldstein Cc: Miklos Szeredi, Darrick J. Wong, linux-xfs, overlayfs, linux-fsdevel On Mon, Mar 19, 2018 at 06:03:30AM +0200, Amir Goldstein wrote: > This thread has come to a point where I should have included fsdevel a > while ago, > so CCing fsdevel. For those interested in previous episodes: > https://marc.info/?l=linux-xfs&m=152120912822207&w=2 > > On Mon, Mar 19, 2018 at 1:02 AM, Dave Chinner <david@fromorbit.com> wrote: > > [....] > > > >> I should have mentioned that "foo" is a pure upper - a file that was created > >> as upper and let's suppose the real ino of "foo" in upper fs is 10. > >> And let's suppose that the real ino of "bar" on lower fs is also 10, which is > >> possible when lower fs is a different fs than upper fs. > > > > Ok, so to close the loop. The problem is that overlay has no inode > > number space of it's own, nor does it have any persistent inode > > number mapping scheme. Hence overlay has no way of providing users > > with a consistent, unique {dev,ino #} tuple to userspace when it's > > different directories lie on different filesystems. > > > > Yes. > > [...] > >> Because real pure upper inode and lower inode can have the same > >> inode number and we want to multiplex our way our of this collision. > >> > >> Note that we do NOT maintain a data structure for looking up used > >> lower/upper inode numbers, nor do we want to maintain a persistent > >> data structure for persistent overlay inode numbers that map to > >> real underlying inodes. AFAIK, aufs can use a small db for it's 'xino' > >> feature. This is something that we wish to avoid. > > > > SO instead of maintaining your own data structure to provide the > > necessary guarantees, the solution is to steal bits from the > > underlying filesystem inode numbers on the assumption they they will > > never user them? > > > > Well, it is not an assumption if filesystem is inclined to publish > s_max_ino_bits, which is not that different in concept from publishing > s_maxbytes and s_max_links, which are also limitations in current > kernel/sb that could be lifted in the future. It is different, because you're expecting to be able to publish persistent user visible information based on it. If we change s_max_ino_bits in the underlying filesystem, then overlay inode numbers change and that can cause all sorts of problem with things like filehandles, backups that use dev/inode number tuples to detect identical files, etc. i.e. there's a heap of downstream impacts of changing inode numbers. If we have to publish s_max_ino_bits to the VFS, we essentially fix the ABI of the user visible inode number the filesysetm publishes. IOWs, we effectively can't change it without breaking external users. I suspect you don't realise we already expose the full 64 bit inode number space completely to userspace through other ABIs. e.g. the bulkstat ioctls. We've already got applications that use the XFS inode number as a 64 bit value both to and from the kernel (e.g. xfs_dump, file handle encoding, etc), so the idea that we can now take bits back from what we've already agreed to expose to userspace is fraught with problems. That's the problem I see here - it's not that we /can't/ implement s_max_ino_bits, the problem is that once we publish it we can't change it because it will cause random breakage of applications using it. And because we've already effectively published it to userspace applications as s_max_ino_bits = 64, there's no scope for movement at all. > Do you realize that the majority of those users are settling for things > like: no directory rename, breaking hardlinks on copy up. > Those are "features" of overlayfs that have been fixed in recent kernels, > but only now on their way to distro kernels and not yet enabled > by container runtimes. > > Container admins already make the choice of underlying fileystem > concisely to get the best from overlayfs and I would expect that > they will soon be opting in for xfs+reflink because of that concience > choice. If ever xfs decides to change inode numbers address space > on kernel upgrade without users opting in for it, We've done this many times in the past. e.g. we changed the default inode allocation policy from inode32 to inode64 back in 2012. That means users, on kernel upgrade, silently went from 32 bit inodes to 64 bit inodes. We've done this because of the fact that the *filesystem owns the entire inode number space* and as long as we don't change individual inode numbers that users see for a specific inode, we can do whatever we want inside that inode number space. > > Given that overlay has a persistent inode numbering problem, why > > doesn't overlay just allocate and store it's own inode numbers and > > other required persistent state in an xattr? > > > > First, this is not as simple as it sounds. Sure, just like s_max_ino_bits is not as simple as it sounds. If we want to explicitly reserve part of the inode number space for other layers to use for their own purposes, then we need to explicitly and persistently support that in the underlying filesystem. That means mkfs, repair, db, growfs, etc all need to understand that inode numbers have a size limit and do the right thing... That makes it an opt-in configuration that we can test and support without having to care about overlay implementations or backwards compatibility across applications on existing filesystems. > Second, and this may be a revolutionary argument, I would like to > believe that we are all working together for a "greater good". I don't say no for the fun of saying no. I say no because I think something is a bad idea. Just because I say no doesn't mean I don't don't want to solve the problem. It just means that I think the solution being presented is a bad idea and we need to explore the problem space for a more robust solution. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Question about XFS_MAXINUMBER 2018-03-20 1:47 ` Dave Chinner @ 2018-03-20 6:29 ` Amir Goldstein 2018-03-20 8:04 ` Ian Kent 2018-03-20 13:08 ` Dave Chinner 2018-03-20 9:32 ` Miklos Szeredi 1 sibling, 2 replies; 20+ messages in thread From: Amir Goldstein @ 2018-03-20 6:29 UTC (permalink / raw) To: Dave Chinner Cc: Miklos Szeredi, Darrick J. Wong, linux-xfs, overlayfs, linux-fsdevel On Tue, Mar 20, 2018 at 3:47 AM, Dave Chinner <david@fromorbit.com> wrote: > On Mon, Mar 19, 2018 at 06:03:30AM +0200, Amir Goldstein wrote: [...] >> Well, it is not an assumption if filesystem is inclined to publish >> s_max_ino_bits, which is not that different in concept from publishing >> s_maxbytes and s_max_links, which are also limitations in current >> kernel/sb that could be lifted in the future. > > It is different, because you're expecting to be able to publish > persistent user visible information based on it. > > If we change s_max_ino_bits in the underlying filesystem, then > overlay inode numbers change and that can cause all sorts of problem > with things like filehandles, backups that use dev/inode number > tuples to detect identical files, etc. i.e. there's a heap of > downstream impacts of changing inode numbers. If we have to > publish s_max_ino_bits to the VFS, we essentially fix the ABI of the > user visible inode number the filesysetm publishes. IOWs, we > effectively can't change it without breaking external users. > You are right. > I suspect you don't realise we already expose the full 64 bit > inode number space completely to userspace through other ABIs. e.g. > the bulkstat ioctls. We've already got applications that use the XFS > inode number as a 64 bit value both to and from the kernel (e.g. > xfs_dump, file handle encoding, etc), so the idea that we can now > take bits back from what we've already agreed to expose to userspace > is fraught with problems. I'm sorry. There must be something I am missing. Are users exposed to high ino bits via xfs tools other than NULLFSINO NULLAGINO? If they are then I did not find where. And w.r.t to NULLINO (-1), that ino is not exposed via getattr() and readdir(), so not a problem for overlayfs. > > That's the problem I see here - it's not that we /can't/ implement > s_max_ino_bits, the problem is that once we publish it we can't > change it because it will cause random breakage of applications > using it. And because we've already effectively published it to > userspace applications as s_max_ino_bits = 64, there's no scope for > movement at all. > Agreed. So we can add an explicit compat feature bit to declare that user would like to limit future use of high ino bits on his fs. Makes me wonder, how come there is no feature to block "inode64" mount option, so user can declare he wishes to keep the fs fully compatible for mounting on 32bit systems? [...] > We've done this many times in the past. e.g. we changed the default > inode allocation policy from inode32 to inode64 back in 2012. That > means users, on kernel upgrade, silently went from 32 bit inodes to > 64 bit inodes. We've done this because of the fact that the > *filesystem owns the entire inode number space* and as long as we > don't change individual inode numbers that users see for a specific > inode, we can do whatever we want inside that inode number space. > Right. My main point is that, unless I am missing something, never in xfs history, was a non NULL inode number exposed to user with high 8 bits used, so at least forward/backward compat for "inode56" feature is not going to be a big challenge. >> > Given that overlay has a persistent inode numbering problem, why >> > doesn't overlay just allocate and store it's own inode numbers and >> > other required persistent state in an xattr? >> > >> >> First, this is not as simple as it sounds. > > Sure, just like s_max_ino_bits is not as simple as it sounds. It never is ;-) > > If we want to explicitly reserve part of the inode number space for > other layers to use for their own purposes, then we need to > explicitly and persistently support that in the underlying > filesystem. That means mkfs, repair, db, growfs, etc all need to > understand that inode numbers have a size limit and do the right > thing... > > That makes it an opt-in configuration that we can test and support > without having to care about overlay implementations or backwards > compatibility across applications on existing filesystems. > OK. I'll work on a proposal. >> Second, and this may be a revolutionary argument, I would like to >> believe that we are all working together for a "greater good". > > I don't say no for the fun of saying no. I say no because I think > something is a bad idea. Just because I say no doesn't mean I don't > don't want to solve the problem. It just means that I think the > solution being presented is a bad idea and we need to explore the > problem space for a more robust solution. > And I do appreciate the time you've put into understanding the overlayfs problem and explaining the problems with my current proposal. Thanks, Amir. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Question about XFS_MAXINUMBER 2018-03-20 6:29 ` Amir Goldstein @ 2018-03-20 8:04 ` Ian Kent 2018-03-20 8:57 ` Amir Goldstein 2018-03-20 9:20 ` Miklos Szeredi 2018-03-20 13:08 ` Dave Chinner 1 sibling, 2 replies; 20+ messages in thread From: Ian Kent @ 2018-03-20 8:04 UTC (permalink / raw) To: Amir Goldstein, Dave Chinner Cc: Miklos Szeredi, Darrick J. Wong, linux-xfs, overlayfs, linux-fsdevel Hi Amir, Miklos, On 20/03/18 14:29, Amir Goldstein wrote: > > And I do appreciate the time you've put into understanding the overlayfs > problem and explaining the problems with my current proposal. > For a while now I've been wondering why overlayfs is keen to avoid using a local, persistent, inode number mapping cache? Sure there can be subtle problems with them but there are problems with other alternatives too. Ian ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Question about XFS_MAXINUMBER 2018-03-20 8:04 ` Ian Kent @ 2018-03-20 8:57 ` Amir Goldstein 2018-03-20 10:18 ` Ian Kent 2018-03-20 9:20 ` Miklos Szeredi 1 sibling, 1 reply; 20+ messages in thread From: Amir Goldstein @ 2018-03-20 8:57 UTC (permalink / raw) To: Ian Kent Cc: Dave Chinner, Miklos Szeredi, Darrick J. Wong, linux-xfs, overlayfs, linux-fsdevel On Tue, Mar 20, 2018 at 10:04 AM, Ian Kent <raven@themaw.net> wrote: > Hi Amir, Miklos, > > On 20/03/18 14:29, Amir Goldstein wrote: >> >> And I do appreciate the time you've put into understanding the overlayfs >> problem and explaining the problems with my current proposal. >> > > For a while now I've been wondering why overlayfs is keen to avoid using > a local, persistent, inode number mapping cache? > A local persistent inode map is a more complex solution. If you remove re-factoring, my patch set adds less than 100 lines of code and it solves the problem for many real world setups. A more complex solution needs a use case in the real world to justify it over a less complex solution. I am not saying we can avoid the complex solution forever, but so far, I did not yet see the requests from users to justify it. > Sure there can be subtle problems with them but there are problems with > other alternatives too. > There is a difference between "not applicable" and "problematic" The -o xino solution is not applicable to all setups, but I am not aware of any problems with this solution. Even without underlying filesystem declaring number of used ino bit, user can declare this with overlayfs mount option, so practically, the problem for overlayfs over xfs is already solved. The discussion about a VFS API for max_ino_bits is to make users life easier, but the API is not required to fix the overlayfs inode number problem. Thanks, Amir. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Question about XFS_MAXINUMBER 2018-03-20 8:57 ` Amir Goldstein @ 2018-03-20 10:18 ` Ian Kent 0 siblings, 0 replies; 20+ messages in thread From: Ian Kent @ 2018-03-20 10:18 UTC (permalink / raw) To: Amir Goldstein Cc: Dave Chinner, Miklos Szeredi, Darrick J. Wong, linux-xfs, overlayfs, linux-fsdevel On 20/03/18 16:57, Amir Goldstein wrote: > On Tue, Mar 20, 2018 at 10:04 AM, Ian Kent <raven@themaw.net> wrote: >> Hi Amir, Miklos, >> >> On 20/03/18 14:29, Amir Goldstein wrote: >>> >>> And I do appreciate the time you've put into understanding the overlayfs >>> problem and explaining the problems with my current proposal. >>> >> >> For a while now I've been wondering why overlayfs is keen to avoid using >> a local, persistent, inode number mapping cache? >> > > A local persistent inode map is a more complex solution. > If you remove re-factoring, my patch set adds less than 100 lines of code > and it solves the problem for many real world setups. > A more complex solution needs a use case in the real world to justify > it over a less complex solution. Indeed, it is significantly more complex. Ian ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Question about XFS_MAXINUMBER 2018-03-20 8:04 ` Ian Kent 2018-03-20 8:57 ` Amir Goldstein @ 2018-03-20 9:20 ` Miklos Szeredi 1 sibling, 0 replies; 20+ messages in thread From: Miklos Szeredi @ 2018-03-20 9:20 UTC (permalink / raw) To: Ian Kent Cc: Amir Goldstein, Dave Chinner, Darrick J. Wong, linux-xfs, overlayfs, linux-fsdevel On Tue, Mar 20, 2018 at 9:04 AM, Ian Kent <raven@themaw.net> wrote: > Hi Amir, Miklos, > > On 20/03/18 14:29, Amir Goldstein wrote: >> >> And I do appreciate the time you've put into understanding the overlayfs >> problem and explaining the problems with my current proposal. >> > > For a while now I've been wondering why overlayfs is keen to avoid using > a local, persistent, inode number mapping cache? Think of overlayfs as a normal filesystem, except it's not backed by a block device, but instead one or more read-only directory tree and optionally one writable directory tree. There's a twist, however: when not mounted, you are allowed to change the backing directories. This is a really important feature of overlayfs. So where does the initial mapping come from (overlay is never started from scratch, like a newly formatted filesystem)? And what happens when layers are modified and we encounter unmapped inode numbers? In both cases we must either create/update the mapping before mount, or update the mapping on lookup. Creating/updating the mapping up-front means a really high startup cost, which can be amortized if the layers are guaranteed not to change outside of the overlay. Updating a persistent mapping on lookup means having to do sync writes on lookup, which can be very detrimental to performance. If all layers are read-only, this scheme falls apart, since we've nowhere to write the persistent mapping. Or we can just say, screw the persistency and store the mapping on e.g. tmpfs. Performance-wise that's much better, but then we fail to provide the guarantees about inode numbers (e.g. NFS export won't work properly). In my opinion it's much less about simplicity of implementation as about quality of implementation. Ideas for fixing the above issues are welcome. Thanks, Miklos ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Question about XFS_MAXINUMBER 2018-03-20 6:29 ` Amir Goldstein 2018-03-20 8:04 ` Ian Kent @ 2018-03-20 13:08 ` Dave Chinner 1 sibling, 0 replies; 20+ messages in thread From: Dave Chinner @ 2018-03-20 13:08 UTC (permalink / raw) To: Amir Goldstein Cc: Miklos Szeredi, Darrick J. Wong, linux-xfs, overlayfs, linux-fsdevel On Tue, Mar 20, 2018 at 08:29:35AM +0200, Amir Goldstein wrote: > On Tue, Mar 20, 2018 at 3:47 AM, Dave Chinner <david@fromorbit.com> wrote: > > On Mon, Mar 19, 2018 at 06:03:30AM +0200, Amir Goldstein wrote: > [...] > >> Well, it is not an assumption if filesystem is inclined to publish > >> s_max_ino_bits, which is not that different in concept from publishing > >> s_maxbytes and s_max_links, which are also limitations in current > >> kernel/sb that could be lifted in the future. > > > > It is different, because you're expecting to be able to publish > > persistent user visible information based on it. > > > > If we change s_max_ino_bits in the underlying filesystem, then > > overlay inode numbers change and that can cause all sorts of problem > > with things like filehandles, backups that use dev/inode number > > tuples to detect identical files, etc. i.e. there's a heap of > > downstream impacts of changing inode numbers. If we have to > > publish s_max_ino_bits to the VFS, we essentially fix the ABI of the > > user visible inode number the filesysetm publishes. IOWs, we > > effectively can't change it without breaking external users. > > > > You are right. > > > I suspect you don't realise we already expose the full 64 bit > > inode number space completely to userspace through other ABIs. e.g. > > the bulkstat ioctls. We've already got applications that use the XFS > > inode number as a 64 bit value both to and from the kernel (e.g. > > xfs_dump, file handle encoding, etc), so the idea that we can now > > take bits back from what we've already agreed to expose to userspace > > is fraught with problems. > > I'm sorry. There must be something I am missing. > Are users exposed to high ino bits via xfs tools other than NULLFSINO > NULLAGINO? If they are then I did not find where. > And w.r.t to NULLINO (-1), that ino is not exposed via getattr() and readdir(), > so not a problem for overlayfs. Bulkstat exposes the on-disk inode number directly to userspace, and other ioctls take those inode numbers back in as ioctl parameters (e.g. as bulkstat iteration cookies) and as part of userspce constructed filehandles (i.e. in libhandle, xfs_fsr, xfsdump, etc). The filehandles are explicitly encoded with 64 bit inode numbers.... > > That's the problem I see here - it's not that we /can't/ implement > > s_max_ino_bits, the problem is that once we publish it we can't > > change it because it will cause random breakage of applications > > using it. And because we've already effectively published it to > > userspace applications as s_max_ino_bits = 64, there's no scope for > > movement at all. > > > > Agreed. So we can add an explicit compat feature bit to declare that user > would like to limit future use of high ino bits on his fs. > Makes me wonder, how come there is no feature to block "inode64" > mount option, so user can declare he wishes to keep the fs fully > compatible for mounting on 32bit systems? Because inode64 was the original mechanism for allocating inodes. inode32 was introduced years after XFS was first shipped. You need to go ask the old Irix engineers why they implemented inode32 as a mount option and not an on-disk feature flag and created the mess that is the inode32 mount option. These days, inode32 reads 64 bit inode just fine - it just can't create new 64 bit inode numbers. And if you *really* still need only 32 bit inodes in this day and age, there's that old xfs_reno tool: http://xfs.org/index.php/Unfinished_work#The_xfs_reno_tool CHeers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Question about XFS_MAXINUMBER 2018-03-20 1:47 ` Dave Chinner 2018-03-20 6:29 ` Amir Goldstein @ 2018-03-20 9:32 ` Miklos Szeredi 1 sibling, 0 replies; 20+ messages in thread From: Miklos Szeredi @ 2018-03-20 9:32 UTC (permalink / raw) To: Dave Chinner Cc: Amir Goldstein, Darrick J. Wong, linux-xfs, overlayfs, linux-fsdevel On Tue, Mar 20, 2018 at 2:47 AM, Dave Chinner <david@fromorbit.com> wrote: >> Second, and this may be a revolutionary argument, I would like to >> believe that we are all working together for a "greater good". > > I don't say no for the fun of saying no. I say no because I think > something is a bad idea. Just because I say no doesn't mean I don't > don't want to solve the problem. It just means that I think the > solution being presented is a bad idea and we need to explore the > problem space for a more robust solution. Totally agreed, let's do that. I've presented the issues I see with creating a generic (i.e. non-multiplexing) inode number mapping for overlayfs in answer to Ian's mail. Do you see a way this problem can be solved without those issues? Thanks, Miklos ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Question about XFS_MAXINUMBER 2018-03-17 5:40 ` Miklos Szeredi 2018-03-17 7:56 ` Amir Goldstein @ 2018-03-17 8:04 ` Dave Chinner 2018-03-17 8:24 ` Amir Goldstein 1 sibling, 1 reply; 20+ messages in thread From: Dave Chinner @ 2018-03-17 8:04 UTC (permalink / raw) To: Miklos Szeredi; +Cc: Amir Goldstein, Darrick J. Wong, linux-xfs, overlayfs On Sat, Mar 17, 2018 at 06:40:23AM +0100, Miklos Szeredi wrote: > On Fri, Mar 16, 2018 at 11:24 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Fri, Mar 16, 2018 at 04:05:22PM +0200, Amir Goldstein wrote: > >> Hi guys, > >> > >> I am trying to get a lower bound for unused inode number MSB on > >> a mounted xfs super block, so I can publish it on struct super_block. > > > > Sorry, what? > > > > The inode number is owned by the filesystem - nobody should be > > touching it or making assumptions they can screw with it in any way. > > > >> This doesn't need to be a tight lower bound, but it needs to be > >> a loewr bound that cannot change with growfs nor when > >> remounting with different options (i.e. inode64). > >> > >> This is needed for overlayfs to be able to use the unused upper bits > >> for overlayfs inode number namespace (see [1]). > > > > SO you're assuming that filesystems don't ever encode information > > into their inode numbers. I've already got plans to use a bunch of > > the unused upper bits in the inode number internally in XFS for > > subvolumes, and ISTR that Darrick was mulling a use for some of > > them a while back, too... > > > >> I realize that for a given agcount, a "soft" lower bound of unused > >> upper bits is agno_log-agblklog-inopblog, which makes the "hard" > >> lower bound 32-agblklog-inopblog, so I think I can use this number. > >> > >> I was staring at this definition and tried to figure out where this > >> absolute limit of 56 used bits came from: > >> #define XFS_MAXINUMBER ((xfs_ino_t)((1ULL << 56) - 1ULL)) > >> > >> Is this number really correct? If yes, then where does the constrain > >> on maximum 56 bits come from? > > > > Yes, 56 bits is the current maximum *physical* inode number - the > > inode number is currently a physical representation of the location > > on disk. 56 bits is needed to represent inodes in 2^63 bytes of > > physical space. > > > > Off the top of my head, it works out something like this for a > > a 512 byte inode, 4k block size filesystem: > > > > bits range meaning > > 6 0-63 inode # in chunk > > 7-22 1TB block offset in AG of inode 0 > > blkspag / bsize / inopblk > > 2^30 / 2^12 / 2^3 = 2^15 > > 23-55 AGNO AG number > > > > The breakdown of bits change for different inode and block sizes, > > but the worse case comes out somewhere around 56 bits... > > > > *but* > > > > #define NULLFSINO ((xfs_ino_t)-1) > > > > is a valid inode number on disk, indicating that the field is not > > holding an inode number. the MSB indicates the inode number is a > > "virtual" inode number, holding some special significance that is > > not directly a physical inode number. Hence we actually use all 64 > > bits of the inode number on disk, and hence there are no free bits > > in the inode number for anyone outside XFS to use. > > > > IOWs, I think your plan is DOA because we already use the entire 64 > > bit space in the inode number field and have plans for the "unused > > bits" already in motion.... > > We don't care about internal or on-disk use. > > Does that still make it DOA? Yes, because we reserve the full 64 bits for internal filesystem use. Just because we aren't using them right now doesn't mean we'll never use them. > I ask, because we've thought long and hard about what to do for > multiplexing inum space in overlayfs, and found no other sane options. > Ideas welcome, of course. Why do you need to "multiplex" the inum space? perhaps you'd do better to start with a description of why you want to play games with inode numbers, rather than just posting a patch to steal bits from other filesytem inode number spaces.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Question about XFS_MAXINUMBER 2018-03-17 8:04 ` Dave Chinner @ 2018-03-17 8:24 ` Amir Goldstein 0 siblings, 0 replies; 20+ messages in thread From: Amir Goldstein @ 2018-03-17 8:24 UTC (permalink / raw) To: Dave Chinner; +Cc: Miklos Szeredi, Darrick J. Wong, linux-xfs, overlayfs On Sat, Mar 17, 2018 at 10:04 AM, Dave Chinner <david@fromorbit.com> wrote: > On Sat, Mar 17, 2018 at 06:40:23AM +0100, Miklos Szeredi wrote: [...] >> I ask, because we've thought long and hard about what to do for >> multiplexing inum space in overlayfs, and found no other sane options. >> Ideas welcome, of course. > > Why do you need to "multiplex" the inum space? perhaps you'd do > better to start with a description of why you want to play games > with inode numbers, rather than just posting a patch to steal bits > from other filesytem inode number spaces.... > I think this patch perhaps explains best what we want to do: https://marc.info/?l=linux-unionfs&m=151007386219743&w=2 I had already given a simple example in an earlier response. As the the "why" question, we have several requirements for overlay inode numbers: 1. st_ino is persistent 2. st_ino/st_dev pair is unique in the system 3. st_ino is consistent with d_ino 4. st_ino doesn't change on copy up 5. st_dev is uniform across all overlay inodes With upstream overlayfs we meet all requirements above for the case of all underlying layers on the same fs, by using a real underlying inode st_ino and the overlay st_dev. With the 'xino' patch set [1], we can meet all requirements above also for the case of underlying layers on different fs, by multiplpexing the inum space, as long as we know about unused high ino bits. The ovl-xino branch already has the xfs patch (not yet posted) to publish max_ino_bits. Cheers! Amir. [1] https://github.com/amir73il/linux/commits/ovl-xino ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2018-03-20 13:08 UTC | newest] Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-03-16 14:05 Question about XFS_MAXINUMBER Amir Goldstein 2018-03-16 17:59 ` Amir Goldstein 2018-03-16 22:24 ` Dave Chinner 2018-03-17 5:40 ` Miklos Szeredi 2018-03-17 7:56 ` Amir Goldstein 2018-03-17 21:28 ` Dave Chinner 2018-03-18 6:21 ` Amir Goldstein 2018-03-18 23:02 ` Dave Chinner 2018-03-19 4:03 ` Amir Goldstein 2018-03-19 8:42 ` Miklos Szeredi 2018-03-20 1:47 ` Dave Chinner 2018-03-20 6:29 ` Amir Goldstein 2018-03-20 8:04 ` Ian Kent 2018-03-20 8:57 ` Amir Goldstein 2018-03-20 10:18 ` Ian Kent 2018-03-20 9:20 ` Miklos Szeredi 2018-03-20 13:08 ` Dave Chinner 2018-03-20 9:32 ` Miklos Szeredi 2018-03-17 8:04 ` Dave Chinner 2018-03-17 8:24 ` Amir Goldstein
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.