All of lore.kernel.org
 help / color / mirror / Atom feed
* Question about XFS_MAXINUMBER
@ 2018-03-16 14:05 Amir Goldstein
  2018-03-16 17:59 ` Amir Goldstein
  2018-03-16 22:24 ` Dave Chinner
  0 siblings, 2 replies; 20+ messages in thread
From: Amir Goldstein @ 2018-03-16 14:05 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, Miklos Szeredi, overlayfs

Hi guys,

I am trying to get a lower bound for unused inode number MSB on
a mounted xfs super block, so I can publish it on struct super_block.

This doesn't need to be a tight lower bound, but it needs to be
a loewr bound that cannot change with growfs nor when
remounting with different options (i.e. inode64).

This is needed for overlayfs to be able to use the unused upper bits
for overlayfs inode number namespace (see [1]).

I realize that for a given agcount, a "soft" lower bound of unused
upper bits is agno_log-agblklog-inopblog, which makes the "hard"
lower bound 32-agblklog-inopblog, so I think I can use this number.

I was staring at this definition and tried to figure out where this
absolute limit of 56 used bits came from:
 #define XFS_MAXINUMBER          ((xfs_ino_t)((1ULL << 56) - 1ULL))

Is this number really correct? If yes, then where does the constrain
on maximum 56 bits come from?

Thanks,
Amir.

[1] https://marc.info/?l=linux-unionfs&m=151007386419753&w=2

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question about XFS_MAXINUMBER
  2018-03-16 14:05 Question about XFS_MAXINUMBER Amir Goldstein
@ 2018-03-16 17:59 ` Amir Goldstein
  2018-03-16 22:24 ` Dave Chinner
  1 sibling, 0 replies; 20+ messages in thread
From: Amir Goldstein @ 2018-03-16 17:59 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, Miklos Szeredi, overlayfs

On Fri, Mar 16, 2018 at 4:05 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> Hi guys,
>
> I am trying to get a lower bound for unused inode number MSB on
> a mounted xfs super block, so I can publish it on struct super_block.
>
> This doesn't need to be a tight lower bound, but it needs to be
> a loewr bound that cannot change with growfs nor when
> remounting with different options (i.e. inode64).
>
> This is needed for overlayfs to be able to use the unused upper bits
> for overlayfs inode number namespace (see [1]).
>
> I realize that for a given agcount, a "soft" lower bound of unused
> upper bits is agno_log-agblklog-inopblog, which makes the "hard"

Hmm, I copied that typo from the comment in xfs_format.h.
Unless I am missing something the amount of unused upper bits
is 64 - agno_log - agblklog - inopblog. Hence the "hard" limit below:

> lower bound 32-agblklog-inopblog, so I think I can use this number.
>
> I was staring at this definition and tried to figure out where this
> absolute limit of 56 used bits came from:
>  #define XFS_MAXINUMBER          ((xfs_ino_t)((1ULL << 56) - 1ULL))
>
> Is this number really correct? If yes, then where does the constrain
> on maximum 56 bits come from?
>
> Thanks,
> Amir.
>
> [1] https://marc.info/?l=linux-unionfs&m=151007386419753&w=2

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question about XFS_MAXINUMBER
  2018-03-16 14:05 Question about XFS_MAXINUMBER Amir Goldstein
  2018-03-16 17:59 ` Amir Goldstein
@ 2018-03-16 22:24 ` Dave Chinner
  2018-03-17  5:40   ` Miklos Szeredi
  1 sibling, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2018-03-16 22:24 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Darrick J. Wong, linux-xfs, Miklos Szeredi, overlayfs

On Fri, Mar 16, 2018 at 04:05:22PM +0200, Amir Goldstein wrote:
> Hi guys,
> 
> I am trying to get a lower bound for unused inode number MSB on
> a mounted xfs super block, so I can publish it on struct super_block.

Sorry, what?

The inode number is owned by the filesystem - nobody should be
touching it or making assumptions they can screw with it in any way.

> This doesn't need to be a tight lower bound, but it needs to be
> a loewr bound that cannot change with growfs nor when
> remounting with different options (i.e. inode64).
>
> This is needed for overlayfs to be able to use the unused upper bits
> for overlayfs inode number namespace (see [1]).

SO you're assuming that filesystems don't ever encode information
into their inode numbers. I've already got plans to use a bunch of
the unused upper bits in the inode number internally in XFS for
subvolumes, and ISTR that Darrick was mulling a use for some of
them a while back, too...

> I realize that for a given agcount, a "soft" lower bound of unused
> upper bits is agno_log-agblklog-inopblog, which makes the "hard"
> lower bound 32-agblklog-inopblog, so I think I can use this number.
> 
> I was staring at this definition and tried to figure out where this
> absolute limit of 56 used bits came from:
>  #define XFS_MAXINUMBER          ((xfs_ino_t)((1ULL << 56) - 1ULL))
> 
> Is this number really correct? If yes, then where does the constrain
> on maximum 56 bits come from?

Yes, 56 bits is the current maximum *physical* inode number - the
inode number is currently a physical representation of the location
on disk. 56 bits is needed to represent inodes in 2^63 bytes of
physical space.

Off the top of my head, it works out something like this for a
a 512 byte inode, 4k block size filesystem:

bits		range		meaning
6		0-63		inode # in chunk
7-22		1TB		block offset in AG of inode 0
				blkspag / bsize / inopblk
				2^30 / 2^12 / 2^3 = 2^15
23-55		AGNO		AG number

The breakdown of bits change for different inode and block sizes,
but the worse case comes out somewhere around 56 bits...

*but*

#define NULLFSINO ((xfs_ino_t)-1)

is a valid inode number on disk, indicating that the field is not
holding an inode number. the MSB indicates the inode number is a
"virtual" inode number, holding some special significance that is
not directly a physical inode number.  Hence we actually use all 64
bits of the inode number on disk, and hence there are no free bits
in the inode number for anyone outside XFS to use.

IOWs, I think your plan is DOA because we already use the entire 64
bit space in the inode number field and have plans for the "unused
bits" already in motion....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question about XFS_MAXINUMBER
  2018-03-16 22:24 ` Dave Chinner
@ 2018-03-17  5:40   ` Miklos Szeredi
  2018-03-17  7:56     ` Amir Goldstein
  2018-03-17  8:04     ` Dave Chinner
  0 siblings, 2 replies; 20+ messages in thread
From: Miklos Szeredi @ 2018-03-17  5:40 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Amir Goldstein, Darrick J. Wong, linux-xfs, overlayfs

On Fri, Mar 16, 2018 at 11:24 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Fri, Mar 16, 2018 at 04:05:22PM +0200, Amir Goldstein wrote:
>> Hi guys,
>>
>> I am trying to get a lower bound for unused inode number MSB on
>> a mounted xfs super block, so I can publish it on struct super_block.
>
> Sorry, what?
>
> The inode number is owned by the filesystem - nobody should be
> touching it or making assumptions they can screw with it in any way.
>
>> This doesn't need to be a tight lower bound, but it needs to be
>> a loewr bound that cannot change with growfs nor when
>> remounting with different options (i.e. inode64).
>>
>> This is needed for overlayfs to be able to use the unused upper bits
>> for overlayfs inode number namespace (see [1]).
>
> SO you're assuming that filesystems don't ever encode information
> into their inode numbers. I've already got plans to use a bunch of
> the unused upper bits in the inode number internally in XFS for
> subvolumes, and ISTR that Darrick was mulling a use for some of
> them a while back, too...
>
>> I realize that for a given agcount, a "soft" lower bound of unused
>> upper bits is agno_log-agblklog-inopblog, which makes the "hard"
>> lower bound 32-agblklog-inopblog, so I think I can use this number.
>>
>> I was staring at this definition and tried to figure out where this
>> absolute limit of 56 used bits came from:
>>  #define XFS_MAXINUMBER          ((xfs_ino_t)((1ULL << 56) - 1ULL))
>>
>> Is this number really correct? If yes, then where does the constrain
>> on maximum 56 bits come from?
>
> Yes, 56 bits is the current maximum *physical* inode number - the
> inode number is currently a physical representation of the location
> on disk. 56 bits is needed to represent inodes in 2^63 bytes of
> physical space.
>
> Off the top of my head, it works out something like this for a
> a 512 byte inode, 4k block size filesystem:
>
> bits            range           meaning
> 6               0-63            inode # in chunk
> 7-22            1TB             block offset in AG of inode 0
>                                 blkspag / bsize / inopblk
>                                 2^30 / 2^12 / 2^3 = 2^15
> 23-55           AGNO            AG number
>
> The breakdown of bits change for different inode and block sizes,
> but the worse case comes out somewhere around 56 bits...
>
> *but*
>
> #define NULLFSINO ((xfs_ino_t)-1)
>
> is a valid inode number on disk, indicating that the field is not
> holding an inode number. the MSB indicates the inode number is a
> "virtual" inode number, holding some special significance that is
> not directly a physical inode number.  Hence we actually use all 64
> bits of the inode number on disk, and hence there are no free bits
> in the inode number for anyone outside XFS to use.
>
> IOWs, I think your plan is DOA because we already use the entire 64
> bit space in the inode number field and have plans for the "unused
> bits" already in motion....

We don't care about internal or on-disk use.

Does that still make it DOA?

I ask, because we've thought long and hard about what to do for
multiplexing inum space in overlayfs, and found no other sane options.
Ideas welcome, of course.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question about XFS_MAXINUMBER
  2018-03-17  5:40   ` Miklos Szeredi
@ 2018-03-17  7:56     ` Amir Goldstein
  2018-03-17 21:28       ` Dave Chinner
  2018-03-17  8:04     ` Dave Chinner
  1 sibling, 1 reply; 20+ messages in thread
From: Amir Goldstein @ 2018-03-17  7:56 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Dave Chinner, Darrick J. Wong, linux-xfs, overlayfs

On Sat, Mar 17, 2018 at 7:40 AM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> On Fri, Mar 16, 2018 at 11:24 PM, Dave Chinner <david@fromorbit.com> wrote:
>> On Fri, Mar 16, 2018 at 04:05:22PM +0200, Amir Goldstein wrote:
>>> Hi guys,
>>>
>>> I am trying to get a lower bound for unused inode number MSB on
>>> a mounted xfs super block, so I can publish it on struct super_block.
>>
>> Sorry, what?
>>
>> The inode number is owned by the filesystem - nobody should be
>> touching it or making assumptions they can screw with it in any way.
>>

Let me clarify with the simplest example:

With overlay of 2 layers, lower and upper on 2 different xfs fs
assuming that stat(2) from xfs will not be using the 63 MSB:

On stat(2) of an overlay upper inode we want to return:
  st_dev = <overlay anon bdev>
  st_ino = <real upper st_ino>

On stat(2) of an overlay lower inode we want to return:
  st_dev = <overlay anon bdev>
  st_ino = <real lower st_ino> | 1 << 63

Now for ext4 this is always safe to do and we find that automatically
due to the fact that ext4 uses the default encode_fh generic 32bit
inode encoding.

For xfs this should also be safe, but we don't want to whitelist xfs
by name/magic, so we want xfs to publish the max amount of bits
exposed to user with stat(2)/getdents(3).

Recently, I became aware of an nfsd use case that also looks
at inode->i_ino, so we may want to also be able to assume
max_ino_bits also applies to inode->i_ino, but if you tell us to
stay clear of inode->i_ino, then we can always use stat.st_ino.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question about XFS_MAXINUMBER
  2018-03-17  5:40   ` Miklos Szeredi
  2018-03-17  7:56     ` Amir Goldstein
@ 2018-03-17  8:04     ` Dave Chinner
  2018-03-17  8:24       ` Amir Goldstein
  1 sibling, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2018-03-17  8:04 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Amir Goldstein, Darrick J. Wong, linux-xfs, overlayfs

On Sat, Mar 17, 2018 at 06:40:23AM +0100, Miklos Szeredi wrote:
> On Fri, Mar 16, 2018 at 11:24 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Fri, Mar 16, 2018 at 04:05:22PM +0200, Amir Goldstein wrote:
> >> Hi guys,
> >>
> >> I am trying to get a lower bound for unused inode number MSB on
> >> a mounted xfs super block, so I can publish it on struct super_block.
> >
> > Sorry, what?
> >
> > The inode number is owned by the filesystem - nobody should be
> > touching it or making assumptions they can screw with it in any way.
> >
> >> This doesn't need to be a tight lower bound, but it needs to be
> >> a loewr bound that cannot change with growfs nor when
> >> remounting with different options (i.e. inode64).
> >>
> >> This is needed for overlayfs to be able to use the unused upper bits
> >> for overlayfs inode number namespace (see [1]).
> >
> > SO you're assuming that filesystems don't ever encode information
> > into their inode numbers. I've already got plans to use a bunch of
> > the unused upper bits in the inode number internally in XFS for
> > subvolumes, and ISTR that Darrick was mulling a use for some of
> > them a while back, too...
> >
> >> I realize that for a given agcount, a "soft" lower bound of unused
> >> upper bits is agno_log-agblklog-inopblog, which makes the "hard"
> >> lower bound 32-agblklog-inopblog, so I think I can use this number.
> >>
> >> I was staring at this definition and tried to figure out where this
> >> absolute limit of 56 used bits came from:
> >>  #define XFS_MAXINUMBER          ((xfs_ino_t)((1ULL << 56) - 1ULL))
> >>
> >> Is this number really correct? If yes, then where does the constrain
> >> on maximum 56 bits come from?
> >
> > Yes, 56 bits is the current maximum *physical* inode number - the
> > inode number is currently a physical representation of the location
> > on disk. 56 bits is needed to represent inodes in 2^63 bytes of
> > physical space.
> >
> > Off the top of my head, it works out something like this for a
> > a 512 byte inode, 4k block size filesystem:
> >
> > bits            range           meaning
> > 6               0-63            inode # in chunk
> > 7-22            1TB             block offset in AG of inode 0
> >                                 blkspag / bsize / inopblk
> >                                 2^30 / 2^12 / 2^3 = 2^15
> > 23-55           AGNO            AG number
> >
> > The breakdown of bits change for different inode and block sizes,
> > but the worse case comes out somewhere around 56 bits...
> >
> > *but*
> >
> > #define NULLFSINO ((xfs_ino_t)-1)
> >
> > is a valid inode number on disk, indicating that the field is not
> > holding an inode number. the MSB indicates the inode number is a
> > "virtual" inode number, holding some special significance that is
> > not directly a physical inode number.  Hence we actually use all 64
> > bits of the inode number on disk, and hence there are no free bits
> > in the inode number for anyone outside XFS to use.
> >
> > IOWs, I think your plan is DOA because we already use the entire 64
> > bit space in the inode number field and have plans for the "unused
> > bits" already in motion....
> 
> We don't care about internal or on-disk use.
> 
> Does that still make it DOA?

Yes, because we reserve the full 64 bits for internal filesystem
use. Just because we aren't using them right now doesn't mean we'll
never use them.

> I ask, because we've thought long and hard about what to do for
> multiplexing inum space in overlayfs, and found no other sane options.
> Ideas welcome, of course.

Why do you need to "multiplex" the inum space? perhaps you'd do
better to start with a description of why you want to play games
with inode numbers, rather than just posting a patch to steal bits
from other filesytem inode number spaces....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question about XFS_MAXINUMBER
  2018-03-17  8:04     ` Dave Chinner
@ 2018-03-17  8:24       ` Amir Goldstein
  0 siblings, 0 replies; 20+ messages in thread
From: Amir Goldstein @ 2018-03-17  8:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Miklos Szeredi, Darrick J. Wong, linux-xfs, overlayfs

On Sat, Mar 17, 2018 at 10:04 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Sat, Mar 17, 2018 at 06:40:23AM +0100, Miklos Szeredi wrote:
[...]
>> I ask, because we've thought long and hard about what to do for
>> multiplexing inum space in overlayfs, and found no other sane options.
>> Ideas welcome, of course.
>
> Why do you need to "multiplex" the inum space? perhaps you'd do
> better to start with a description of why you want to play games
> with inode numbers, rather than just posting a patch to steal bits
> from other filesytem inode number spaces....
>

I think this patch perhaps explains best what we want to do:
https://marc.info/?l=linux-unionfs&m=151007386219743&w=2

I had already given a simple example in an earlier response.

As the the "why" question, we have several requirements for
overlay inode numbers:
1. st_ino is persistent
2. st_ino/st_dev pair is unique in the system
3. st_ino is consistent with d_ino
4. st_ino doesn't change on copy up
5. st_dev is uniform across all overlay inodes

With upstream overlayfs we meet all requirements above for
the case of all underlying layers on the same fs, by using a real
underlying inode st_ino and the overlay st_dev.

With the 'xino' patch set [1], we can meet all requirements above
also for the case of underlying layers on different fs, by multiplpexing
the inum space, as long as we know about unused high ino bits.
The ovl-xino branch already has the xfs patch (not yet posted) to publish
max_ino_bits.

Cheers!
Amir.

[1] https://github.com/amir73il/linux/commits/ovl-xino

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question about XFS_MAXINUMBER
  2018-03-17  7:56     ` Amir Goldstein
@ 2018-03-17 21:28       ` Dave Chinner
  2018-03-18  6:21         ` Amir Goldstein
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2018-03-17 21:28 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Miklos Szeredi, Darrick J. Wong, linux-xfs, overlayfs

On Sat, Mar 17, 2018 at 09:56:19AM +0200, Amir Goldstein wrote:
> On Sat, Mar 17, 2018 at 7:40 AM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> > On Fri, Mar 16, 2018 at 11:24 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> On Fri, Mar 16, 2018 at 04:05:22PM +0200, Amir Goldstein wrote:
> >>> Hi guys,
> >>>
> >>> I am trying to get a lower bound for unused inode number MSB on
> >>> a mounted xfs super block, so I can publish it on struct super_block.
> >>
> >> Sorry, what?
> >>
> >> The inode number is owned by the filesystem - nobody should be
> >> touching it or making assumptions they can screw with it in any way.
> >>
> 
> Let me clarify with the simplest example:
> 
> With overlay of 2 layers, lower and upper on 2 different xfs fs
> assuming that stat(2) from xfs will not be using the 63 MSB:
> 
> On stat(2) of an overlay upper inode we want to return:
>   st_dev = <overlay anon bdev>
>   st_ino = <real upper st_ino>
> 
> On stat(2) of an overlay lower inode we want to return:
>   st_dev = <overlay anon bdev>
>   st_ino = <real lower st_ino> | 1 << 63
> 
> Now for ext4 this is always safe to do and we find that automatically
> due to the fact that ext4 uses the default encode_fh generic 32bit
> inode encoding.
> 
> For xfs this should also be safe, but we don't want to whitelist xfs
> by name/magic, so we want xfs to publish the max amount of bits
> exposed to user with stat(2)/getdents(3).
> 
> Recently, I became aware of an nfsd use case that also looks
> at inode->i_ino, so we may want to also be able to assume
> max_ino_bits also applies to inode->i_ino, but if you tell us to
> stay clear of inode->i_ino, then we can always use stat.st_ino.
> 
> Thanks,
> Amir.
> 

On Sat, Mar 17, 2018 at 10:24:39AM +0200, Amir Goldstein wrote:
> On Sat, Mar 17, 2018 at 10:04 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Sat, Mar 17, 2018 at 06:40:23AM +0100, Miklos Szeredi wrote:
> [...]
> >> I ask, because we've thought long and hard about what to do for
> >> multiplexing inum space in overlayfs, and found no other sane options.
> >> Ideas welcome, of course.
> >
> > Why do you need to "multiplex" the inum space? perhaps you'd do
> > better to start with a description of why you want to play games
> > with inode numbers, rather than just posting a patch to steal bits
> > from other filesytem inode number spaces....
> >
> 
> I think this patch perhaps explains best what we want to do:
> https://marc.info/?l=linux-unionfs&m=151007386219743&w=2
> 
> I had already given a simple example in an earlier response.

So, I'll quote that here:

> > > On stat(2) of an overlay upper inode we want to return:
> > >   st_dev = <overlay anon bdev>
> > >   st_ino = <real upper st_ino>
> > > 
> > > On stat(2) of an overlay lower inode we want to return:
> > >   st_dev = <overlay anon bdev>
> > >   st_ino = <real lower st_ino> | 1 << 63

This makes no sense to me - this implies the inode number changes on
copy-up, and ....

> As the the "why" question, we have several requirements for
> overlay inode numbers:
> 1. st_ino is persistent
> 2. st_ino/st_dev pair is unique in the system
> 3. st_ino is consistent with d_ino
> 4. st_ino doesn't change on copy up
> 5. st_dev is uniform across all overlay inodes

.... this means requierment #4 isn't met, even on the same
filesystem.

IOWs, if overlay has already met #4 on the same filesystem, then
there is a persistent mapping between lower and upper inodes (Req.
#1) that maps the upper inode # to the lower inode #. That has to be
overlay information, because the underlying filesystem doesn't store
it. And because the lower inode/dev is unique, then req. 2 is met,
too.

FWIW, req 5 is badly worded - st_dev is uniform across all inodes in
a single overlay filesystem, not all overlay inodes.

> With upstream overlayfs we meet all requirements above for
> the case of all underlying layers on the same fs, by using a real
> underlying inode st_ino and the overlay st_dev.

Yeah, that's what I thought. So why can't you do exactly the same
thing for different underlying filesystems? You've already got a
mapping between upper and lower inode numbers, why can't that map
across different superblocks? Why do you need special "inode number
bits" exposed to userspace to identify upper->lower inode
mappings that overlay should already have a persistent mapping
mechanism for?

> With the 'xino' patch set [1], we can meet all requirements above
> also for the case of underlying layers on different fs, by multiplpexing
> the inum space, as long as we know about unused high ino bits.

Your example makes no sense to me - I don't see how adding extra
bits to the lower inode number allows you to meet requirement #4,
not why presenting "st_ino = <real upper st_ino>" for inodes that
have been copied up iis being done because that violates requirement
#4....

> The ovl-xino branch already has the xfs patch (not yet posted) to publish
> max_ino_bits.

That has no explanation of why you need to screw with inode number
bits, either. It's all mechanism, and there's zero explanation of
what problem it solves.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question about XFS_MAXINUMBER
  2018-03-17 21:28       ` Dave Chinner
@ 2018-03-18  6:21         ` Amir Goldstein
  2018-03-18 23:02           ` Dave Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Amir Goldstein @ 2018-03-18  6:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Miklos Szeredi, Darrick J. Wong, linux-xfs, overlayfs

On Sat, Mar 17, 2018 at 11:28 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Sat, Mar 17, 2018 at 09:56:19AM +0200, Amir Goldstein wrote:
>> On Sat, Mar 17, 2018 at 7:40 AM, Miklos Szeredi <miklos@szeredi.hu> wrote:
>> > On Fri, Mar 16, 2018 at 11:24 PM, Dave Chinner <david@fromorbit.com> wrote:
>> >> On Fri, Mar 16, 2018 at 04:05:22PM +0200, Amir Goldstein wrote:
>> >>> Hi guys,
>> >>>
>> >>> I am trying to get a lower bound for unused inode number MSB on
>> >>> a mounted xfs super block, so I can publish it on struct super_block.
>> >>
>> >> Sorry, what?
>> >>
>> >> The inode number is owned by the filesystem - nobody should be
>> >> touching it or making assumptions they can screw with it in any way.
>> >>
>>
>> Let me clarify with the simplest example:
>>
>> With overlay of 2 layers, lower and upper on 2 different xfs fs
>> assuming that stat(2) from xfs will not be using the 63 MSB:
>>
>> On stat(2) of an overlay upper inode we want to return:
>>   st_dev = <overlay anon bdev>
>>   st_ino = <real upper st_ino>
>>
>> On stat(2) of an overlay lower inode we want to return:
>>   st_dev = <overlay anon bdev>
>>   st_ino = <real lower st_ino> | 1 << 63

>>
>> Now for ext4 this is always safe to do and we find that automatically
>> due to the fact that ext4 uses the default encode_fh generic 32bit
>> inode encoding.
>>
>> For xfs this should also be safe, but we don't want to whitelist xfs
>> by name/magic, so we want xfs to publish the max amount of bits
>> exposed to user with stat(2)/getdents(3).
>>
>> Recently, I became aware of an nfsd use case that also looks
>> at inode->i_ino, so we may want to also be able to assume
>> max_ino_bits also applies to inode->i_ino, but if you tell us to
>> stay clear of inode->i_ino, then we can always use stat.st_ino.
>>
>> Thanks,
>> Amir.
>>
>
> On Sat, Mar 17, 2018 at 10:24:39AM +0200, Amir Goldstein wrote:
>> On Sat, Mar 17, 2018 at 10:04 AM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Sat, Mar 17, 2018 at 06:40:23AM +0100, Miklos Szeredi wrote:
>> [...]
>> >> I ask, because we've thought long and hard about what to do for
>> >> multiplexing inum space in overlayfs, and found no other sane options.
>> >> Ideas welcome, of course.
>> >
>> > Why do you need to "multiplex" the inum space? perhaps you'd do
>> > better to start with a description of why you want to play games
>> > with inode numbers, rather than just posting a patch to steal bits
>> > from other filesytem inode number spaces....
>> >
>>
>> I think this patch perhaps explains best what we want to do:
>> https://marc.info/?l=linux-unionfs&m=151007386219743&w=2
>>
>> I had already given a simple example in an earlier response.
>
> So, I'll quote that here:
>
>> > > On stat(2) of an overlay upper inode we want to return:
>> > >   st_dev = <overlay anon bdev>
>> > >   st_ino = <real upper st_ino>
>> > >
>> > > On stat(2) of an overlay lower inode we want to return:
>> > >   st_dev = <overlay anon bdev>
>> > >   st_ino = <real lower st_ino> | 1 << 63
>
> This makes no sense to me - this implies the inode number changes on
> copy-up, and ....
>

I tried to keep the example simple, but failed to mention that lower and
upper refer to different file, say foo and bar.

I should have mentioned that "foo" is a pure upper - a file that was created
as upper and let's suppose the real ino of "foo" in upper fs is 10.
And let's suppose that the real ino of "bar" on lower fs is also 10, which is
possible when lower fs is a different fs than upper fs.

>> As the the "why" question, we have several requirements for
>> overlay inode numbers:
>> 1. st_ino is persistent
>> 2. st_ino/st_dev pair is unique in the system
>> 3. st_ino is consistent with d_ino
>> 4. st_ino doesn't change on copy up
>> 5. st_dev is uniform across all overlay inodes
>
> .... this means requierment #4 isn't met, even on the same
> filesystem.
>
> IOWs, if overlay has already met #4 on the same filesystem, then
> there is a persistent mapping between lower and upper inodes (Req.
> #1) that maps the upper inode # to the lower inode #. That has to be
> overlay information, because the underlying filesystem doesn't store

Correct. #4 is met because we keep track of "copy up origin" by storing
the lower inode file handle in "origin" xattr of coped up file.
Therefore, for an upper file that originated in a lower file we will use
the real lower multiplexed ino across copy up and across mount cycle.

> it. And because the lower inode/dev is unique, then req. 2 is met,
> too.
>

Correct. But notice that overlay does not use the real st_dev. If it did,
that would break the requirement that the real fs st_ino/_st_dev pair
is unique in the system.
So for non-samefs, overlay uses a different anon bdev for each layer
to satisfy #2, but breaks #5.

> FWIW, req 5 is badly worded - st_dev is uniform across all inodes in
> a single overlay filesystem, not all overlay inodes.
>

Correct. FYI, #5 has never been met for non-samefs.
What overlayfs now is it meets #5 for directory inodes
(to make find -xdev happy) at the cost of trading off #1.

>> With upstream overlayfs we meet all requirements above for
>> the case of all underlying layers on the same fs, by using a real
>> underlying inode st_ino and the overlay st_dev.
>
> Yeah, that's what I thought. So why can't you do exactly the same
> thing for different underlying filesystems? You've already got a
> mapping between upper and lower inode numbers, why can't that map
> across different superblocks? Why do you need special "inode number
> bits" exposed to userspace to identify upper->lower inode
> mappings that overlay should already have a persistent mapping
> mechanism for?

Because real pure upper inode and lower inode can have the same
inode number and we want to multiplex our way our of this collision.

Note that we do NOT maintain a data structure for looking up used
lower/upper inode numbers, nor do we want to maintain a persistent
data structure for persistent overlay inode numbers that map to
real underlying inodes. AFAIK, aufs can use a small db for it's 'xino'
feature. This is something that we wish to avoid.

>
>> With the 'xino' patch set [1], we can meet all requirements above
>> also for the case of underlying layers on different fs, by multiplpexing
>> the inum space, as long as we know about unused high ino bits.
>
> Your example makes no sense to me - I don't see how adding extra
> bits to the lower inode number allows you to meet requirement #4,
> not why presenting "st_ino = <real upper st_ino>" for inodes that
> have been copied up iis being done because that violates requirement
> #4....

The example was miss communicated. I hope I was able to make the
problem clear now.

>
>> The ovl-xino branch already has the xfs patch (not yet posted) to publish
>> max_ino_bits.
>
> That has no explanation of why you need to screw with inode number
> bits, either. It's all mechanism, and there's zero explanation of
> what problem it solves.
>

It's true. The explanation is now scattered in previous patches, that
incrementally fixed samefs case and improved non-samefs case.
I think currently, the most documented version could be found in this
new helper:
https://github.com/amir73il/linux/blob/overlayfs-devel/fs/overlayfs/inode.c#L62
but I will make sure to add proper full doumentation including the requiremetns
and how they are met in the next version I post.

Please let me know if I missed something and if motivation is still not clear.

Thanks!
Amir.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question about XFS_MAXINUMBER
  2018-03-18  6:21         ` Amir Goldstein
@ 2018-03-18 23:02           ` Dave Chinner
  2018-03-19  4:03             ` Amir Goldstein
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2018-03-18 23:02 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Miklos Szeredi, Darrick J. Wong, linux-xfs, overlayfs

On Sun, Mar 18, 2018 at 08:21:16AM +0200, Amir Goldstein wrote:
> On Sat, Mar 17, 2018 at 11:28 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Sat, Mar 17, 2018 at 09:56:19AM +0200, Amir Goldstein wrote:
> >> On Sat, Mar 17, 2018 at 7:40 AM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> >> > On Fri, Mar 16, 2018 at 11:24 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> >> On Fri, Mar 16, 2018 at 04:05:22PM +0200, Amir Goldstein wrote:
> >> >>> Hi guys,
> >> >>>
> >> >>> I am trying to get a lower bound for unused inode number MSB on
> >> >>> a mounted xfs super block, so I can publish it on struct super_block.
> >> >>
> >> >> Sorry, what?
> >> >>
> >> >> The inode number is owned by the filesystem - nobody should be
> >> >> touching it or making assumptions they can screw with it in any way.
> >> >>
> >>
> >> Let me clarify with the simplest example:
> >>
> >> With overlay of 2 layers, lower and upper on 2 different xfs fs
> >> assuming that stat(2) from xfs will not be using the 63 MSB:
> >>
> >> On stat(2) of an overlay upper inode we want to return:
> >>   st_dev = <overlay anon bdev>
> >>   st_ino = <real upper st_ino>
> >>
> >> On stat(2) of an overlay lower inode we want to return:
> >>   st_dev = <overlay anon bdev>
> >>   st_ino = <real lower st_ino> | 1 << 63

[....]

> I should have mentioned that "foo" is a pure upper - a file that was created
> as upper and let's suppose the real ino of "foo" in upper fs is 10.
> And let's suppose that the real ino of "bar" on lower fs is also 10, which is
> possible when lower fs is a different fs than upper fs.

Ok, so to close the loop. The problem is that overlay has no inode
number space of it's own, nor does it have any persistent inode
number mapping scheme. Hence overlay has no way of providing users
with a consistent, unique {dev,ino #} tuple to userspace when it's
different directories lie on different filesystems.

[....]

> > across different superblocks? Why do you need special "inode number
> > bits" exposed to userspace to identify upper->lower inode
> > mappings that overlay should already have a persistent mapping
> > mechanism for?
> 
> Because real pure upper inode and lower inode can have the same
> inode number and we want to multiplex our way our of this collision.
> 
> Note that we do NOT maintain a data structure for looking up used
> lower/upper inode numbers, nor do we want to maintain a persistent
> data structure for persistent overlay inode numbers that map to
> real underlying inodes. AFAIK, aufs can use a small db for it's 'xino'
> feature. This is something that we wish to avoid.

SO instead of maintaining your own data structure to provide the
necessary guarantees, the solution is to steal bits from the
underlying filesystem inode numbers on the assumption they they will
never user them?

What happens when a user upgrades their kernel, the underlying fs
changes all it's inode numbers because it's done some virtual
mapping thing for, say, having different inode number ranges for
separate mount namespaces? And so instead of having N bits of free
inode number space before upgrade, it now has zero? How will overlay
react to this sort of change, given it could expose duplicate inode
numbers....

Quite frankly, I think this "steal bits from the underlying
filesystems" mechanism is a recipe for trouble. If you want play
these games, you get to keep all the broken bits when filesystems
change the number of available bits.

Given that overlay has a persistent inode numbering problem, why
doesn't overlay just allocate and store it's own inode numbers and
other required persistent state in an xattr? 

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question about XFS_MAXINUMBER
  2018-03-18 23:02           ` Dave Chinner
@ 2018-03-19  4:03             ` Amir Goldstein
  2018-03-19  8:42               ` Miklos Szeredi
  2018-03-20  1:47               ` Dave Chinner
  0 siblings, 2 replies; 20+ messages in thread
From: Amir Goldstein @ 2018-03-19  4:03 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Miklos Szeredi, Darrick J. Wong, linux-xfs, overlayfs, linux-fsdevel

This thread has come to a point where I should have included fsdevel a
while ago,
so CCing fsdevel. For those interested in previous episodes:
https://marc.info/?l=linux-xfs&m=152120912822207&w=2

On Mon, Mar 19, 2018 at 1:02 AM, Dave Chinner <david@fromorbit.com> wrote:
> [....]
>
>> I should have mentioned that "foo" is a pure upper - a file that was created
>> as upper and let's suppose the real ino of "foo" in upper fs is 10.
>> And let's suppose that the real ino of "bar" on lower fs is also 10, which is
>> possible when lower fs is a different fs than upper fs.
>
> Ok, so to close the loop. The problem is that overlay has no inode
> number space of it's own, nor does it have any persistent inode
> number mapping scheme. Hence overlay has no way of providing users
> with a consistent, unique {dev,ino #} tuple to userspace when it's
> different directories lie on different filesystems.
>

Yes.

[...]
>> Because real pure upper inode and lower inode can have the same
>> inode number and we want to multiplex our way our of this collision.
>>
>> Note that we do NOT maintain a data structure for looking up used
>> lower/upper inode numbers, nor do we want to maintain a persistent
>> data structure for persistent overlay inode numbers that map to
>> real underlying inodes. AFAIK, aufs can use a small db for it's 'xino'
>> feature. This is something that we wish to avoid.
>
> SO instead of maintaining your own data structure to provide the
> necessary guarantees, the solution is to steal bits from the
> underlying filesystem inode numbers on the assumption they they will
> never user them?
>

Well, it is not an assumption if filesystem is inclined to publish
s_max_ino_bits, which is not that different in concept from publishing
s_maxbytes and s_max_links, which are also limitations in current
kernel/sb that could be lifted in the future.

> What happens when a user upgrades their kernel, the underlying fs
> changes all it's inode numbers because it's done some virtual
> mapping thing for, say, having different inode number ranges for
> separate mount namespaces? And so instead of having N bits of free
> inode number space before upgrade, it now has zero? How will overlay
> react to this sort of change, given it could expose duplicate inode
> numbers....

After kernel upgrade, filesystem would set s_max_ino_bits to 64 or
not set it at all and then overlayfs will not use high bits and fall back
to what it does today.

But if we want to bring practical arguments from containers
world into the picture, IMO it is far more likely that existing container
solution would benefit from overlayfs inode numbers multiplexing
than they would from inode number mapping by filesystem for
different mount namespace.

>
> Quite frankly, I think this "steal bits from the underlying
> filesystems" mechanism is a recipe for trouble. If you want play
> these games, you get to keep all the broken bits when filesystems
> change the number of available bits.
>

I don't see that as a problem. I would say there are a fair amount of
users out there using containers with overlayfs.

Do you realize that the majority of those users are settling for things
like: no directory rename, breaking hardlinks on copy up.
Those are "features" of overlayfs that have been fixed in recent kernels,
but only now on their way to distro kernels and not yet enabled
by container runtimes.

Container admins already make the choice of underlying fileystem
concisely to get the best from overlayfs and I would expect that
they will soon be opting in for xfs+reflink because of that concience
choice. If ever xfs decides to change inode numbers address space
on kernel upgrade without users opting in for it, I would be surprised,
but I should also hope that xfs would at least leave a choice for users
to opt-out of this behavior and that is what container admins would do.

Heck, for all I care, users could also opt-in for unused inode bits
explicitly (e.g. -o inode56) if you are concerned about letting go
of those upper bits implicitly.
My patch set already provides the capability for users to declare
with overlay -o xino that enough upper bits are available (i.e. because
user knows the underlying fs and its true practical limits). But the
feature will be much more useful if users disn't have to do that.

> Given that overlay has a persistent inode numbering problem, why
> doesn't overlay just allocate and store it's own inode numbers and
> other required persistent state in an xattr?
>

First, this is not as simple as it sounds.
If you have a huge number of readonly files in multiple lower layers,
it makes no sense to scan them all on overlay mount to discover which
inode numbers are free to use and it make no sense either to create
a persistent mapping for every lower file accessed in that case.
And there are other problematic factors with this sort of scheme.

Second, and this may be a revolutionary argument, I would like to
believe that we are all working together for a "greater good".
Sure, xfs developers strive to perfect and enhance xfs and overlayfs
developers strive to perfect and enhance overlayfs.
But when there is an opportunity for synergy between subsystems,
one should consider the best solution as a whole and IMHO,
the solution of filesystem declaring already unused ino bits
is the best solution as a whole. xfs is not required to declare
s_max_ino_bits for all eternity, only for this specific super block
instance, in this specific kernel.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question about XFS_MAXINUMBER
  2018-03-19  4:03             ` Amir Goldstein
@ 2018-03-19  8:42               ` Miklos Szeredi
  2018-03-20  1:47               ` Dave Chinner
  1 sibling, 0 replies; 20+ messages in thread
From: Miklos Szeredi @ 2018-03-19  8:42 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Dave Chinner, Darrick J. Wong, linux-xfs, overlayfs, linux-fsdevel

On Mon, Mar 19, 2018 at 5:03 AM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Mon, Mar 19, 2018 at 1:02 AM, Dave Chinner <david@fromorbit.com> wrote:

>> Given that overlay has a persistent inode numbering problem, why
>> doesn't overlay just allocate and store it's own inode numbers and
>> other required persistent state in an xattr?
>
> First, this is not as simple as it sounds.
> If you have a huge number of readonly files in multiple lower layers,
> it makes no sense to scan them all on overlay mount to discover which
> inode numbers are free to use and it make no sense either to create
> a persistent mapping for every lower file accessed in that case.
> And there are other problematic factors with this sort of scheme.

Such as when all layers are read-only.   Where do we store the
persistent inode numbers in that case?

>
> Second, and this may be a revolutionary argument, I would like to
> believe that we are all working together for a "greater good".
> Sure, xfs developers strive to perfect and enhance xfs and overlayfs
> developers strive to perfect and enhance overlayfs.
> But when there is an opportunity for synergy between subsystems,
> one should consider the best solution as a whole and IMHO,
> the solution of filesystem declaring already unused ino bits
> is the best solution as a whole. xfs is not required to declare
> s_max_ino_bits for all eternity, only for this specific super block
> instance, in this specific kernel.

The "specific kernel" part requires clarification.  We do promise
backward compatibility when upgrading the kernel, and silently
increasing s_max_ino_bits on a kernel upgrade would break that
promise.  Could be backed by a feature flag.  And unlimited use could
be the default, people have learned to live with needing special
features for overlayfs.

And I do agree with Amir, that the "mine all mine" philosophy isn't
necessarily the right one.  In normal cases overlayfs would just use
one or two bits of the inumber space.  While Amir's current patch
keeps the layer index in the spare bits, it is sufficient to hold an
"fs index" that is incremented when a new superblock is encountered
during enumeration of layers.  The number of different fs instances
used for creating an overlay is unlikely to be large, so for all
practical purposes a few (4-6) bits should be enough.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question about XFS_MAXINUMBER
  2018-03-19  4:03             ` Amir Goldstein
  2018-03-19  8:42               ` Miklos Szeredi
@ 2018-03-20  1:47               ` Dave Chinner
  2018-03-20  6:29                 ` Amir Goldstein
  2018-03-20  9:32                 ` Miklos Szeredi
  1 sibling, 2 replies; 20+ messages in thread
From: Dave Chinner @ 2018-03-20  1:47 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Miklos Szeredi, Darrick J. Wong, linux-xfs, overlayfs, linux-fsdevel

On Mon, Mar 19, 2018 at 06:03:30AM +0200, Amir Goldstein wrote:
> This thread has come to a point where I should have included fsdevel a
> while ago,
> so CCing fsdevel. For those interested in previous episodes:
> https://marc.info/?l=linux-xfs&m=152120912822207&w=2
> 
> On Mon, Mar 19, 2018 at 1:02 AM, Dave Chinner <david@fromorbit.com> wrote:
> > [....]
> >
> >> I should have mentioned that "foo" is a pure upper - a file that was created
> >> as upper and let's suppose the real ino of "foo" in upper fs is 10.
> >> And let's suppose that the real ino of "bar" on lower fs is also 10, which is
> >> possible when lower fs is a different fs than upper fs.
> >
> > Ok, so to close the loop. The problem is that overlay has no inode
> > number space of it's own, nor does it have any persistent inode
> > number mapping scheme. Hence overlay has no way of providing users
> > with a consistent, unique {dev,ino #} tuple to userspace when it's
> > different directories lie on different filesystems.
> >
> 
> Yes.
> 
> [...]
> >> Because real pure upper inode and lower inode can have the same
> >> inode number and we want to multiplex our way our of this collision.
> >>
> >> Note that we do NOT maintain a data structure for looking up used
> >> lower/upper inode numbers, nor do we want to maintain a persistent
> >> data structure for persistent overlay inode numbers that map to
> >> real underlying inodes. AFAIK, aufs can use a small db for it's 'xino'
> >> feature. This is something that we wish to avoid.
> >
> > SO instead of maintaining your own data structure to provide the
> > necessary guarantees, the solution is to steal bits from the
> > underlying filesystem inode numbers on the assumption they they will
> > never user them?
> >
> 
> Well, it is not an assumption if filesystem is inclined to publish
> s_max_ino_bits, which is not that different in concept from publishing
> s_maxbytes and s_max_links, which are also limitations in current
> kernel/sb that could be lifted in the future.

It is different, because you're expecting to be able to publish
persistent user visible information based on it.

If we change s_max_ino_bits in the underlying filesystem, then
overlay inode numbers change and that can cause all sorts of problem
with things like filehandles, backups that use dev/inode number
tuples to detect identical files, etc.  i.e. there's a heap of
downstream impacts of changing inode numbers. If we have to
publish s_max_ino_bits to the VFS, we essentially fix the ABI of the
user visible inode number the filesysetm publishes. IOWs, we
effectively can't change it without breaking external users.

I suspect you don't realise we already expose the full 64 bit
inode number space completely to userspace through other ABIs. e.g.
the bulkstat ioctls. We've already got applications that use the XFS
inode number as a 64 bit value both to and from the kernel (e.g.
xfs_dump, file handle encoding, etc), so the idea that we can now
take bits back from what we've already agreed to expose to userspace
is fraught with problems.

That's the problem I see here - it's not that we /can't/ implement
s_max_ino_bits, the problem is that once we publish it we can't
change it because it will cause random breakage of applications
using it. And because we've already effectively published it to
userspace applications as s_max_ino_bits = 64, there's no scope for
movement at all.

> Do you realize that the majority of those users are settling for things
> like: no directory rename, breaking hardlinks on copy up.
> Those are "features" of overlayfs that have been fixed in recent kernels,
> but only now on their way to distro kernels and not yet enabled
> by container runtimes.
> 
> Container admins already make the choice of underlying fileystem
> concisely to get the best from overlayfs and I would expect that
> they will soon be opting in for xfs+reflink because of that concience
> choice. If ever xfs decides to change inode numbers address space
> on kernel upgrade without users opting in for it,

We've done this many times in the past. e.g. we changed the default
inode allocation policy from inode32 to inode64 back in 2012. That
means users, on kernel upgrade, silently went from 32 bit inodes to
64 bit inodes. We've done this because of the fact that the
*filesystem owns the entire inode number space* and as long as we
don't change individual inode numbers that users see for a specific
inode, we can do whatever we want inside that inode number space.

> > Given that overlay has a persistent inode numbering problem, why
> > doesn't overlay just allocate and store it's own inode numbers and
> > other required persistent state in an xattr?
> >
> 
> First, this is not as simple as it sounds.

Sure, just like s_max_ino_bits is not as simple as it sounds.

If we want to explicitly reserve part of the inode number space for
other layers to use for their own purposes, then we need to
explicitly and persistently support that in the underlying
filesystem. That means mkfs, repair, db, growfs, etc all need to
understand that inode numbers have a size limit and do the right
thing...

That makes it an opt-in configuration that we can test and support
without having to care about overlay implementations or backwards
compatibility across applications on existing filesystems.

> Second, and this may be a revolutionary argument, I would like to
> believe that we are all working together for a "greater good".

I don't say no for the fun of saying no. I say no because I think
something is a bad idea. Just because I say no doesn't mean I don't
don't want to solve the problem. It just means that I think the
solution being presented is a bad idea and we need to explore the
problem space for a more robust solution.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question about XFS_MAXINUMBER
  2018-03-20  1:47               ` Dave Chinner
@ 2018-03-20  6:29                 ` Amir Goldstein
  2018-03-20  8:04                   ` Ian Kent
  2018-03-20 13:08                   ` Dave Chinner
  2018-03-20  9:32                 ` Miklos Szeredi
  1 sibling, 2 replies; 20+ messages in thread
From: Amir Goldstein @ 2018-03-20  6:29 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Miklos Szeredi, Darrick J. Wong, linux-xfs, overlayfs, linux-fsdevel

On Tue, Mar 20, 2018 at 3:47 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Mar 19, 2018 at 06:03:30AM +0200, Amir Goldstein wrote:
[...]
>> Well, it is not an assumption if filesystem is inclined to publish
>> s_max_ino_bits, which is not that different in concept from publishing
>> s_maxbytes and s_max_links, which are also limitations in current
>> kernel/sb that could be lifted in the future.
>
> It is different, because you're expecting to be able to publish
> persistent user visible information based on it.
>
> If we change s_max_ino_bits in the underlying filesystem, then
> overlay inode numbers change and that can cause all sorts of problem
> with things like filehandles, backups that use dev/inode number
> tuples to detect identical files, etc.  i.e. there's a heap of
> downstream impacts of changing inode numbers. If we have to
> publish s_max_ino_bits to the VFS, we essentially fix the ABI of the
> user visible inode number the filesysetm publishes. IOWs, we
> effectively can't change it without breaking external users.
>

You are right.

> I suspect you don't realise we already expose the full 64 bit
> inode number space completely to userspace through other ABIs. e.g.
> the bulkstat ioctls. We've already got applications that use the XFS
> inode number as a 64 bit value both to and from the kernel (e.g.
> xfs_dump, file handle encoding, etc), so the idea that we can now
> take bits back from what we've already agreed to expose to userspace
> is fraught with problems.

I'm sorry. There must be something I am missing.
Are users exposed to high ino bits via xfs tools other than NULLFSINO
NULLAGINO? If they are then I did not find where.
And w.r.t to NULLINO (-1), that ino is not exposed via getattr() and readdir(),
so not a problem for overlayfs.

>
> That's the problem I see here - it's not that we /can't/ implement
> s_max_ino_bits, the problem is that once we publish it we can't
> change it because it will cause random breakage of applications
> using it. And because we've already effectively published it to
> userspace applications as s_max_ino_bits = 64, there's no scope for
> movement at all.
>

Agreed. So we can add an explicit compat feature bit to declare that user
would like to limit future use of high ino bits on his fs.
Makes me wonder, how come there is no feature to block "inode64"
mount option, so user can declare he wishes to keep the fs fully
compatible for mounting on 32bit systems?

[...]

> We've done this many times in the past. e.g. we changed the default
> inode allocation policy from inode32 to inode64 back in 2012. That
> means users, on kernel upgrade, silently went from 32 bit inodes to
> 64 bit inodes. We've done this because of the fact that the
> *filesystem owns the entire inode number space* and as long as we
> don't change individual inode numbers that users see for a specific
> inode, we can do whatever we want inside that inode number space.
>

Right. My main point is that, unless I am missing something, never in
xfs history, was a non NULL inode number exposed to user with high
8 bits used, so at least forward/backward compat for "inode56" feature
is not going to be a big challenge.

>> > Given that overlay has a persistent inode numbering problem, why
>> > doesn't overlay just allocate and store it's own inode numbers and
>> > other required persistent state in an xattr?
>> >
>>
>> First, this is not as simple as it sounds.
>
> Sure, just like s_max_ino_bits is not as simple as it sounds.

It never is ;-)

>
> If we want to explicitly reserve part of the inode number space for
> other layers to use for their own purposes, then we need to
> explicitly and persistently support that in the underlying
> filesystem. That means mkfs, repair, db, growfs, etc all need to
> understand that inode numbers have a size limit and do the right
> thing...
>
> That makes it an opt-in configuration that we can test and support
> without having to care about overlay implementations or backwards
> compatibility across applications on existing filesystems.
>

OK. I'll work on a proposal.

>> Second, and this may be a revolutionary argument, I would like to
>> believe that we are all working together for a "greater good".
>
> I don't say no for the fun of saying no. I say no because I think
> something is a bad idea. Just because I say no doesn't mean I don't
> don't want to solve the problem. It just means that I think the
> solution being presented is a bad idea and we need to explore the
> problem space for a more robust solution.
>

And I do appreciate the time you've put into understanding the overlayfs
problem and explaining the problems with my current proposal.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question about XFS_MAXINUMBER
  2018-03-20  6:29                 ` Amir Goldstein
@ 2018-03-20  8:04                   ` Ian Kent
  2018-03-20  8:57                     ` Amir Goldstein
  2018-03-20  9:20                     ` Miklos Szeredi
  2018-03-20 13:08                   ` Dave Chinner
  1 sibling, 2 replies; 20+ messages in thread
From: Ian Kent @ 2018-03-20  8:04 UTC (permalink / raw)
  To: Amir Goldstein, Dave Chinner
  Cc: Miklos Szeredi, Darrick J. Wong, linux-xfs, overlayfs, linux-fsdevel

Hi Amir, Miklos,

On 20/03/18 14:29, Amir Goldstein wrote:
> 
> And I do appreciate the time you've put into understanding the overlayfs
> problem and explaining the problems with my current proposal.
> 

For a while now I've been wondering why overlayfs is keen to avoid using
a local, persistent, inode number mapping cache?

Sure there can be subtle problems with them but there are problems with
other alternatives too.

Ian

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question about XFS_MAXINUMBER
  2018-03-20  8:04                   ` Ian Kent
@ 2018-03-20  8:57                     ` Amir Goldstein
  2018-03-20 10:18                       ` Ian Kent
  2018-03-20  9:20                     ` Miklos Szeredi
  1 sibling, 1 reply; 20+ messages in thread
From: Amir Goldstein @ 2018-03-20  8:57 UTC (permalink / raw)
  To: Ian Kent
  Cc: Dave Chinner, Miklos Szeredi, Darrick J. Wong, linux-xfs,
	overlayfs, linux-fsdevel

On Tue, Mar 20, 2018 at 10:04 AM, Ian Kent <raven@themaw.net> wrote:
> Hi Amir, Miklos,
>
> On 20/03/18 14:29, Amir Goldstein wrote:
>>
>> And I do appreciate the time you've put into understanding the overlayfs
>> problem and explaining the problems with my current proposal.
>>
>
> For a while now I've been wondering why overlayfs is keen to avoid using
> a local, persistent, inode number mapping cache?
>

A local persistent inode map is a more complex solution.
If you remove re-factoring, my patch set adds less than 100 lines of code
and it solves the problem for many real world setups.
A more complex solution needs a use case in the real world to justify
it over a less complex solution.
I am not saying we can avoid the complex solution forever, but so far,
I did not yet see the requests from users to justify it.

> Sure there can be subtle problems with them but there are problems with
> other alternatives too.
>

There is a difference between "not applicable" and "problematic"
The -o xino solution is not applicable to all setups, but I am not aware of
any problems with this solution.
Even without underlying filesystem declaring number of used ino bit,
user can declare this with overlayfs mount option, so practically, the
problem for overlayfs over xfs is already solved.

The discussion about a VFS API for max_ino_bits is to make users
life easier, but the API is not required to fix the overlayfs inode number
problem.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question about XFS_MAXINUMBER
  2018-03-20  8:04                   ` Ian Kent
  2018-03-20  8:57                     ` Amir Goldstein
@ 2018-03-20  9:20                     ` Miklos Szeredi
  1 sibling, 0 replies; 20+ messages in thread
From: Miklos Szeredi @ 2018-03-20  9:20 UTC (permalink / raw)
  To: Ian Kent
  Cc: Amir Goldstein, Dave Chinner, Darrick J. Wong, linux-xfs,
	overlayfs, linux-fsdevel

On Tue, Mar 20, 2018 at 9:04 AM, Ian Kent <raven@themaw.net> wrote:
> Hi Amir, Miklos,
>
> On 20/03/18 14:29, Amir Goldstein wrote:
>>
>> And I do appreciate the time you've put into understanding the overlayfs
>> problem and explaining the problems with my current proposal.
>>
>
> For a while now I've been wondering why overlayfs is keen to avoid using
> a local, persistent, inode number mapping cache?

Think of overlayfs as a normal filesystem, except it's not backed by a
block device, but instead one or more read-only directory tree and
optionally one writable directory tree. There's a twist, however: when
not mounted, you are allowed to change the backing directories.  This
is a really important feature of overlayfs.

So where does the initial mapping come from (overlay is never started
from scratch, like a newly formatted filesystem)?  And what happens
when layers are modified and we encounter unmapped inode numbers?

In both cases we must either create/update the mapping before mount,
or update the mapping on lookup.

Creating/updating the mapping up-front means a really high startup
cost, which can be amortized if the layers are guaranteed not to
change outside of the overlay.

Updating a persistent mapping on lookup means having to do sync writes
on lookup, which can be very detrimental to performance.  If all
layers are read-only, this scheme falls apart, since we've nowhere to
write the persistent mapping.

Or we can just say, screw the persistency and store the mapping on
e.g. tmpfs.  Performance-wise that's much better, but then we fail to
provide the guarantees about inode numbers (e.g. NFS export won't work
properly).

In my opinion it's much less about simplicity of implementation as
about quality of implementation.

Ideas for fixing the above issues are welcome.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question about XFS_MAXINUMBER
  2018-03-20  1:47               ` Dave Chinner
  2018-03-20  6:29                 ` Amir Goldstein
@ 2018-03-20  9:32                 ` Miklos Szeredi
  1 sibling, 0 replies; 20+ messages in thread
From: Miklos Szeredi @ 2018-03-20  9:32 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Amir Goldstein, Darrick J. Wong, linux-xfs, overlayfs, linux-fsdevel

On Tue, Mar 20, 2018 at 2:47 AM, Dave Chinner <david@fromorbit.com> wrote:

>> Second, and this may be a revolutionary argument, I would like to
>> believe that we are all working together for a "greater good".
>
> I don't say no for the fun of saying no. I say no because I think
> something is a bad idea. Just because I say no doesn't mean I don't
> don't want to solve the problem. It just means that I think the
> solution being presented is a bad idea and we need to explore the
> problem space for a more robust solution.

Totally agreed, let's do that.  I've presented the issues I see with
creating a generic (i.e. non-multiplexing) inode number mapping for
overlayfs in answer to Ian's mail.

Do you see a way this problem can be solved without those issues?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question about XFS_MAXINUMBER
  2018-03-20  8:57                     ` Amir Goldstein
@ 2018-03-20 10:18                       ` Ian Kent
  0 siblings, 0 replies; 20+ messages in thread
From: Ian Kent @ 2018-03-20 10:18 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Dave Chinner, Miklos Szeredi, Darrick J. Wong, linux-xfs,
	overlayfs, linux-fsdevel

On 20/03/18 16:57, Amir Goldstein wrote:
> On Tue, Mar 20, 2018 at 10:04 AM, Ian Kent <raven@themaw.net> wrote:
>> Hi Amir, Miklos,
>>
>> On 20/03/18 14:29, Amir Goldstein wrote:
>>>
>>> And I do appreciate the time you've put into understanding the overlayfs
>>> problem and explaining the problems with my current proposal.
>>>
>>
>> For a while now I've been wondering why overlayfs is keen to avoid using
>> a local, persistent, inode number mapping cache?
>>
> 
> A local persistent inode map is a more complex solution.
> If you remove re-factoring, my patch set adds less than 100 lines of code
> and it solves the problem for many real world setups.
> A more complex solution needs a use case in the real world to justify
> it over a less complex solution.

Indeed, it is significantly more complex.

Ian

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Question about XFS_MAXINUMBER
  2018-03-20  6:29                 ` Amir Goldstein
  2018-03-20  8:04                   ` Ian Kent
@ 2018-03-20 13:08                   ` Dave Chinner
  1 sibling, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2018-03-20 13:08 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Miklos Szeredi, Darrick J. Wong, linux-xfs, overlayfs, linux-fsdevel

On Tue, Mar 20, 2018 at 08:29:35AM +0200, Amir Goldstein wrote:
> On Tue, Mar 20, 2018 at 3:47 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Mar 19, 2018 at 06:03:30AM +0200, Amir Goldstein wrote:
> [...]
> >> Well, it is not an assumption if filesystem is inclined to publish
> >> s_max_ino_bits, which is not that different in concept from publishing
> >> s_maxbytes and s_max_links, which are also limitations in current
> >> kernel/sb that could be lifted in the future.
> >
> > It is different, because you're expecting to be able to publish
> > persistent user visible information based on it.
> >
> > If we change s_max_ino_bits in the underlying filesystem, then
> > overlay inode numbers change and that can cause all sorts of problem
> > with things like filehandles, backups that use dev/inode number
> > tuples to detect identical files, etc.  i.e. there's a heap of
> > downstream impacts of changing inode numbers. If we have to
> > publish s_max_ino_bits to the VFS, we essentially fix the ABI of the
> > user visible inode number the filesysetm publishes. IOWs, we
> > effectively can't change it without breaking external users.
> >
> 
> You are right.
> 
> > I suspect you don't realise we already expose the full 64 bit
> > inode number space completely to userspace through other ABIs. e.g.
> > the bulkstat ioctls. We've already got applications that use the XFS
> > inode number as a 64 bit value both to and from the kernel (e.g.
> > xfs_dump, file handle encoding, etc), so the idea that we can now
> > take bits back from what we've already agreed to expose to userspace
> > is fraught with problems.
> 
> I'm sorry. There must be something I am missing.
> Are users exposed to high ino bits via xfs tools other than NULLFSINO
> NULLAGINO? If they are then I did not find where.
> And w.r.t to NULLINO (-1), that ino is not exposed via getattr() and readdir(),
> so not a problem for overlayfs.

Bulkstat exposes the on-disk inode number directly to userspace, and
other ioctls take those inode numbers back in as ioctl parameters
(e.g.  as bulkstat iteration cookies) and as part of userspce
constructed filehandles (i.e. in libhandle, xfs_fsr, xfsdump, etc).
The filehandles are explicitly encoded with 64 bit inode numbers....

> > That's the problem I see here - it's not that we /can't/ implement
> > s_max_ino_bits, the problem is that once we publish it we can't
> > change it because it will cause random breakage of applications
> > using it. And because we've already effectively published it to
> > userspace applications as s_max_ino_bits = 64, there's no scope for
> > movement at all.
> >
> 
> Agreed. So we can add an explicit compat feature bit to declare that user
> would like to limit future use of high ino bits on his fs.
> Makes me wonder, how come there is no feature to block "inode64"
> mount option, so user can declare he wishes to keep the fs fully
> compatible for mounting on 32bit systems?

Because inode64 was the original mechanism for allocating inodes.
inode32 was introduced years after XFS was first shipped. You need
to go ask the old Irix engineers why they implemented inode32 as a
mount option and not an on-disk feature flag and created the mess
that is the inode32 mount option.

These days, inode32 reads 64 bit inode just fine - it just can't
create new 64 bit inode numbers.  And if you *really* still need
only 32 bit inodes in this day and age, there's that old xfs_reno
tool:

http://xfs.org/index.php/Unfinished_work#The_xfs_reno_tool

CHeers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2018-03-20 13:08 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-16 14:05 Question about XFS_MAXINUMBER Amir Goldstein
2018-03-16 17:59 ` Amir Goldstein
2018-03-16 22:24 ` Dave Chinner
2018-03-17  5:40   ` Miklos Szeredi
2018-03-17  7:56     ` Amir Goldstein
2018-03-17 21:28       ` Dave Chinner
2018-03-18  6:21         ` Amir Goldstein
2018-03-18 23:02           ` Dave Chinner
2018-03-19  4:03             ` Amir Goldstein
2018-03-19  8:42               ` Miklos Szeredi
2018-03-20  1:47               ` Dave Chinner
2018-03-20  6:29                 ` Amir Goldstein
2018-03-20  8:04                   ` Ian Kent
2018-03-20  8:57                     ` Amir Goldstein
2018-03-20 10:18                       ` Ian Kent
2018-03-20  9:20                     ` Miklos Szeredi
2018-03-20 13:08                   ` Dave Chinner
2018-03-20  9:32                 ` Miklos Szeredi
2018-03-17  8:04     ` Dave Chinner
2018-03-17  8:24       ` Amir Goldstein

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.