All of lore.kernel.org
 help / color / mirror / Atom feed
* Exporting a unique ino/dev pair to user space
@ 2018-06-06 21:38 Mark Fasheh
  2018-06-07  0:47 ` Ian Kent
  2018-06-07  7:38 ` Amir Goldstein
  0 siblings, 2 replies; 7+ messages in thread
From: Mark Fasheh @ 2018-06-06 21:38 UTC (permalink / raw)
  To: Al Viro, linux-fsdevel
  Cc: linux-btrfs, linux-unionfs, Dave Chinner, Amir Goldstein, Jeff Mahoney

Hi,

We have an inconsistency in how the kernel is exporting inode number /
device pairs for user space. There's of course stat(2) and statx(2),
but aside from those we simply dump inode->i_ino and super->s_dev. In
some cases, the dumped values differ from what is returned via stat(2)
or statx(2). Some filesystems might even show duplicate (but
internally different!) pairs when the raw i_ino/s_dev is used.

Some examples where we dump raw ino/dev:

- /proc/<pid>/maps. I've written about how this confuses lsof(8):
  https://marc.info/?l=linux-btrfs&m=130074451403261&w=2

- Unsurprisingly, many VFS tracepoints dump ino and/or dev. See
  trace/events/lock.h or trace/events/writeback.h for examples.

- eventpoll also dumps the raw ino/dev pair via ep_show_fdinfo()

- Audit records the raw ino/dev and passes them around. We do seem to
  have paths printed from audit as well, but if it's printed with the
  wrong ino/dev pair I believe my point still stands.


This breaks software which expects these pairs to be unique, and can
put the user in a situation where they might not be able to find an
inode referenced from the kernel. What's even worse - depending on how
ino is exported, they might even find the *wrong* inode.

I also should point out that we're likely at this point because
stat(2) has been using an unsigned long for ino. On a 32 bit system,
it would have been impossible for the user to get the real inode
number in the first place. So there probably wasn't much we could do.

These days though, we have statx(2) which will do the right thing on
all platforms so we no longer have that excuse. The user ought to be
able to take an ino/dev pair and ultimately find the actual file on
their system, partially with the help of statx(2).


Some examples of how ino is manipulated by filesystems:

- On 64 bit i_ino and kstat->ino tend to be filled in correctly (from
  what I can tell). stat->dev is not always consistent with super->s_dev.

- On 32 bits, many filesystems employ a transformation to squeeze a 64
  bit identifier into 32 bits. The exact methods are fs specific,
  what's important is that we're losing information, introducing the
  possibility of duplicate inode numbers.

- On all platforms, Btrfs and Overlayfs can have duplicate inode
  numbers. Of course, device can be different across the fs as well
  with the kernel reporting s_dev and these filesystems returning
  a different device via stat() or statx().


Getting the inode number portion of this pair fixed would immediately
solve the situation for all filesystems except Btrfs and
Overlayfs - they report a different device from stat.

Regarding the device portion of the pair, I'm honestly not sure
whether Overlayfs cares, and my attempts to fix the s_dev situation
for Btrfs have all come to the same dead ends that I've hit briefly
looking into this inode number issue - the problems are intrinsically
linked.

So my questions are:

1) Do we care about this? On one hand, we've had this inconsistency
   for a long time, for various reasons. On the other hand, I can point
   to bugzilla's where these inconsistencies have become a problem.

   In the case that we don't care, any fs-internal solutions are
   likely to be extremely disruptive to the end user.


2) If we do care, how do we fix this?

 2a) Do we use 64 bit i_ino on 32 bit systems? This has some obvious
     downsides from a memory usage standpoint. Also it doesn't fully
     address the issue - we still have a device field that Btrfs and
     Overlayfs override.

     We could combine this with an intermediate structure between the
     inode and super block so s_dev could be abstracted out. See my
     fs_view patches for an example of how this could be done:
     https://www.spinics.net/lists/linux-fsdevel/msg125492.html

 2b) Do we call ->getattr for this information? Plumbing a vfsmount to
     various regions of the kernel like audit, or some of the deeper
     tracepoints looks ugly and prone to life-timing issues (which
     vfsmount do we use?). On the upside, we could probably make it
     really light by only sending down the STATX_INO flag and letting
     the filesystem optimize accordingly.

 2c) I don't think we can really use a dedicated callback without
     passing the vfsmount through since Overlayfs ->getattr might call
     the lower fs ->getattr. At that point we might as well use getattr.

I'd appreciate any comments or suggestions.

Thanks,
  --Mark

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Exporting a unique ino/dev pair to user space
  2018-06-06 21:38 Exporting a unique ino/dev pair to user space Mark Fasheh
@ 2018-06-07  0:47 ` Ian Kent
  2018-06-07  2:06   ` Mark Fasheh
  2018-06-07  7:38 ` Amir Goldstein
  1 sibling, 1 reply; 7+ messages in thread
From: Ian Kent @ 2018-06-07  0:47 UTC (permalink / raw)
  To: Mark Fasheh, Al Viro, linux-fsdevel
  Cc: linux-btrfs, linux-unionfs, Dave Chinner, Amir Goldstein, Jeff Mahoney

On Wed, 2018-06-06 at 23:38 +0200, Mark Fasheh wrote:
> Hi,

I'm not sure I understand what the problem is.

> 
> We have an inconsistency in how the kernel is exporting inode number /
> device pairs for user space. There's of course stat(2) and statx(2),
> but aside from those we simply dump inode->i_ino and super->s_dev. In
> some cases, the dumped values differ from what is returned via stat(2)
> or statx(2). Some filesystems might even show duplicate (but
> internally different!) pairs when the raw i_ino/s_dev is used.

How is it that you can dump the raw ino and s_dev if your not in
kernel code?

For stat family system calls, if the file system defines the inode
operation getattr it will be used to fill the stat structure otherwise
the VFS will fill the stat  structure and it will use inode->i_ino and
sb->s_dev as you say.

> 
> Some examples where we dump raw ino/dev:
> 
> - /proc/<pid>/maps. I've written about how this confuses lsof(8):
>   https://marc.info/?l=linux-btrfs&m=130074451403261&w=2
> 
> - Unsurprisingly, many VFS tracepoints dump ino and/or dev. See
>   trace/events/lock.h or trace/events/writeback.h for examples.
> 
> - eventpoll also dumps the raw ino/dev pair via ep_show_fdinfo()
> 
> - Audit records the raw ino/dev and passes them around. We do seem to
>   have paths printed from audit as well, but if it's printed with the
>   wrong ino/dev pair I believe my point still stands.
> 
> 
> This breaks software which expects these pairs to be unique, and can
> put the user in a situation where they might not be able to find an
> inode referenced from the kernel. What's even worse - depending on how
> ino is exported, they might even find the *wrong* inode.
> 
> I also should point out that we're likely at this point because
> stat(2) has been using an unsigned long for ino. On a 32 bit system,
> it would have been impossible for the user to get the real inode
> number in the first place. So there probably wasn't much we could do.
> 
> These days though, we have statx(2) which will do the right thing on
> all platforms so we no longer have that excuse. The user ought to be
> able to take an ino/dev pair and ultimately find the actual file on
> their system, partially with the help of statx(2).
> 
> 
> Some examples of how ino is manipulated by filesystems:
> 
> - On 64 bit i_ino and kstat->ino tend to be filled in correctly (from
>   what I can tell). stat->dev is not always consistent with super->s_dev.
> 
> - On 32 bits, many filesystems employ a transformation to squeeze a 64
>   bit identifier into 32 bits. The exact methods are fs specific,
>   what's important is that we're losing information, introducing the
>   possibility of duplicate inode numbers.
> 
> - On all platforms, Btrfs and Overlayfs can have duplicate inode
>   numbers. Of course, device can be different across the fs as well
>   with the kernel reporting s_dev and these filesystems returning
>   a different device via stat() or statx().
> 
> 
> Getting the inode number portion of this pair fixed would immediately
> solve the situation for all filesystems except Btrfs and
> Overlayfs - they report a different device from stat.
> 
> Regarding the device portion of the pair, I'm honestly not sure
> whether Overlayfs cares, and my attempts to fix the s_dev situation
> for Btrfs have all come to the same dead ends that I've hit briefly
> looking into this inode number issue - the problems are intrinsically
> linked.
> 
> So my questions are:
> 
> 1) Do we care about this? On one hand, we've had this inconsistency
>    for a long time, for various reasons. On the other hand, I can point
>    to bugzilla's where these inconsistencies have become a problem.
> 
>    In the case that we don't care, any fs-internal solutions are
>    likely to be extremely disruptive to the end user.
> 
> 
> 2) If we do care, how do we fix this?
> 
>  2a) Do we use 64 bit i_ino on 32 bit systems? This has some obvious
>      downsides from a memory usage standpoint. Also it doesn't fully
>      address the issue - we still have a device field that Btrfs and
>      Overlayfs override.
> 
>      We could combine this with an intermediate structure between the
>      inode and super block so s_dev could be abstracted out. See my
>      fs_view patches for an example of how this could be done:
>      https://www.spinics.net/lists/linux-fsdevel/msg125492.html
> 
>  2b) Do we call ->getattr for this information? Plumbing a vfsmount to
>      various regions of the kernel like audit, or some of the deeper
>      tracepoints looks ugly and prone to life-timing issues (which
>      vfsmount do we use?). On the upside, we could probably make it
>      really light by only sending down the STATX_INO flag and letting
>      the filesystem optimize accordingly.
> 
>  2c) I don't think we can really use a dedicated callback without
>      passing the vfsmount through since Overlayfs ->getattr might call
>      the lower fs ->getattr. At that point we might as well use getattr.
> 
> I'd appreciate any comments or suggestions.
> 
> Thanks,
>   --Mark
> --
> To unsubscribe from this list: send the line "unsubscribe linux-unionfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Exporting a unique ino/dev pair to user space
  2018-06-07  0:47 ` Ian Kent
@ 2018-06-07  2:06   ` Mark Fasheh
  2018-06-07 11:31     ` Ian Kent
  0 siblings, 1 reply; 7+ messages in thread
From: Mark Fasheh @ 2018-06-07  2:06 UTC (permalink / raw)
  To: Ian Kent
  Cc: Al Viro, linux-fsdevel, linux-btrfs, linux-unionfs, Dave Chinner,
	Amir Goldstein, Jeff Mahoney

Hi Ian,

On Thu, Jun 07, 2018 at 08:47:28AM +0800, Ian Kent wrote:
> On Wed, 2018-06-06 at 23:38 +0200, Mark Fasheh wrote:
> > Hi,
> 
> I'm not sure I understand what the problem is.

I'll try to elaborate below.


> > We have an inconsistency in how the kernel is exporting inode number /
> > device pairs for user space. There's of course stat(2) and statx(2),
> > but aside from those we simply dump inode->i_ino and super->s_dev. In
> > some cases, the dumped values differ from what is returned via stat(2)
> > or statx(2). Some filesystems might even show duplicate (but
> > internally different!) pairs when the raw i_ino/s_dev is used.
> 
> How is it that you can dump the raw ino and s_dev if your not in
> kernel code?

If you look below my first paragraph, you'll see a list of places where the
kernel publishes (maybe that's a better word?) ino/dev pairs by printing or
otherwise copying raw ino/s_dev values into user accesible buffers.


> For stat family system calls, if the file system defines the inode
> operation getattr it will be used to fill the stat structure otherwise
> the VFS will fill the stat  structure and it will use inode->i_ino and
> sb->s_dev as you say.

My concern is that those pairs are sometimes not unique and do not line up
with what statx(2) returns. We actually need the true inode / device as is
returned by statx in those places. I go into much more detail in my original
mail.


> > Some examples where we dump raw ino/dev:
> > 
> > - /proc/<pid>/maps. I've written about how this confuses lsof(8):
> >   https://marc.info/?l=linux-btrfs&m=130074451403261&w=2
> > 
> > - Unsurprisingly, many VFS tracepoints dump ino and/or dev. See
> >   trace/events/lock.h or trace/events/writeback.h for examples.
> > 
> > - eventpoll also dumps the raw ino/dev pair via ep_show_fdinfo()
> > 
> > - Audit records the raw ino/dev and passes them around. We do seem to
> >   have paths printed from audit as well, but if it's printed with the
> >   wrong ino/dev pair I believe my point still stands.
> > 
> > 
> > This breaks software which expects these pairs to be unique, and can
> > put the user in a situation where they might not be able to find an
> > inode referenced from the kernel. What's even worse - depending on how
> > ino is exported, they might even find the *wrong* inode.

Thanks,
	--Mark

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Exporting a unique ino/dev pair to user space
  2018-06-06 21:38 Exporting a unique ino/dev pair to user space Mark Fasheh
  2018-06-07  0:47 ` Ian Kent
@ 2018-06-07  7:38 ` Amir Goldstein
  2018-06-07 21:06   ` Mark Fasheh
  1 sibling, 1 reply; 7+ messages in thread
From: Amir Goldstein @ 2018-06-07  7:38 UTC (permalink / raw)
  To: Mark Fasheh
  Cc: Al Viro, linux-fsdevel, Linux Btrfs, overlayfs, Dave Chinner,
	Jeff Mahoney, J. Bruce Fields, Jeff Layton, Miklos Szeredi,
	David Howells

On Thu, Jun 7, 2018 at 12:38 AM, Mark Fasheh <mfasheh@suse.de> wrote:
> Hi,
>
> We have an inconsistency in how the kernel is exporting inode number /
> device pairs for user space. There's of course stat(2) and statx(2),
> but aside from those we simply dump inode->i_ino and super->s_dev. In
> some cases, the dumped values differ from what is returned via stat(2)
> or statx(2). Some filesystems might even show duplicate (but
> internally different!) pairs when the raw i_ino/s_dev is used.
>
> Some examples where we dump raw ino/dev:
>
> - /proc/<pid>/maps. I've written about how this confuses lsof(8):
>   https://marc.info/?l=linux-btrfs&m=130074451403261&w=2
>
> - Unsurprisingly, many VFS tracepoints dump ino and/or dev. See
>   trace/events/lock.h or trace/events/writeback.h for examples.
>
> - eventpoll also dumps the raw ino/dev pair via ep_show_fdinfo()
>
> - Audit records the raw ino/dev and passes them around. We do seem to
>   have paths printed from audit as well, but if it's printed with the
>   wrong ino/dev pair I believe my point still stands.
>

knfsd also looks at i_ino. I converted one or two of these places
to getattr callbacks recently, but there are still some more.

Anyway, w.r.t. Overlayfs, I believe Miklos' work to make file_inode()
an overlay inode should resolve several of the issues listed above -
probably not all, but didn't check every tracepoint..

>
> This breaks software which expects these pairs to be unique, and can
> put the user in a situation where they might not be able to find an
> inode referenced from the kernel. What's even worse - depending on how
> ino is exported, they might even find the *wrong* inode.
>
> I also should point out that we're likely at this point because
> stat(2) has been using an unsigned long for ino. On a 32 bit system,
> it would have been impossible for the user to get the real inode
> number in the first place. So there probably wasn't much we could do.
>
> These days though, we have statx(2) which will do the right thing on
> all platforms so we no longer have that excuse. The user ought to be
> able to take an ino/dev pair and ultimately find the actual file on
> their system, partially with the help of statx(2).
>
>
> Some examples of how ino is manipulated by filesystems:
>
> - On 64 bit i_ino and kstat->ino tend to be filled in correctly (from
>   what I can tell). stat->dev is not always consistent with super->s_dev.
>
> - On 32 bits, many filesystems employ a transformation to squeeze a 64
>   bit identifier into 32 bits. The exact methods are fs specific,
>   what's important is that we're losing information, introducing the
>   possibility of duplicate inode numbers.
>
> - On all platforms, Btrfs and Overlayfs can have duplicate inode
>   numbers. Of course, device can be different across the fs as well
>   with the kernel reporting s_dev and these filesystems returning
>   a different device via stat() or statx().
>
>
> Getting the inode number portion of this pair fixed would immediately
> solve the situation for all filesystems except Btrfs and
> Overlayfs - they report a different device from stat.
>
> Regarding the device portion of the pair, I'm honestly not sure
> whether Overlayfs cares, and my attempts to fix the s_dev situation
> for Btrfs have all come to the same dead ends that I've hit briefly
> looking into this inode number issue - the problems are intrinsically
> linked.
>
> So my questions are:
>
> 1) Do we care about this? On one hand, we've had this inconsistency
>    for a long time, for various reasons. On the other hand, I can point
>    to bugzilla's where these inconsistencies have become a problem.
>
>    In the case that we don't care, any fs-internal solutions are
>    likely to be extremely disruptive to the end user.
>
>
> 2) If we do care, how do we fix this?
>
>  2a) Do we use 64 bit i_ino on 32 bit systems? This has some obvious
>      downsides from a memory usage standpoint. Also it doesn't fully
>      address the issue - we still have a device field that Btrfs and
>      Overlayfs override.
>
>      We could combine this with an intermediate structure between the
>      inode and super block so s_dev could be abstracted out. See my
>      fs_view patches for an example of how this could be done:
>      https://www.spinics.net/lists/linux-fsdevel/msg125492.html
>
>  2b) Do we call ->getattr for this information? Plumbing a vfsmount to
>      various regions of the kernel like audit, or some of the deeper
>      tracepoints looks ugly and prone to life-timing issues (which
>      vfsmount do we use?). On the upside, we could probably make it
>      really light by only sending down the STATX_INO flag and letting
>      the filesystem optimize accordingly.
>
>  2c) I don't think we can really use a dedicated callback without
>      passing the vfsmount through since Overlayfs ->getattr might call
>      the lower fs ->getattr. At that point we might as well use getattr.
>

Didn't get the Overlayfs lower fs getattr argument.
Overlayfs doesn't use the vfsmount passed into getattr
and it could very well pass a dentry to lower fs getattr.

As a matter of fact, out of 35 getattr implementations in the kernel:
(git grep "\s\.getattr\s" fs| awk '{print $4}'| sort -u|grep -v
"nfs.*_proc_getattr"|wc -l)
there is only one using the vfsmount - nfs_getattr() for MNT_NOATIME
check and most of them only ever use d_inode(path->dentry).

This API seems quite odd.
Maybe it should be fixed so more in kernel call sites could call getattr
without a vfsmount.
Not sure what would be the best way to handle nfs_getattr().

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Exporting a unique ino/dev pair to user space
  2018-06-07  2:06   ` Mark Fasheh
@ 2018-06-07 11:31     ` Ian Kent
  0 siblings, 0 replies; 7+ messages in thread
From: Ian Kent @ 2018-06-07 11:31 UTC (permalink / raw)
  To: Mark Fasheh
  Cc: Al Viro, linux-fsdevel, linux-btrfs, linux-unionfs, Dave Chinner,
	Amir Goldstein, Jeff Mahoney

On Thu, 2018-06-07 at 04:06 +0200, Mark Fasheh wrote:
> Hi Ian,
> 
> On Thu, Jun 07, 2018 at 08:47:28AM +0800, Ian Kent wrote:
> > On Wed, 2018-06-06 at 23:38 +0200, Mark Fasheh wrote:
> > > Hi,
> > 
> > I'm not sure I understand what the problem is.
> 
> I'll try to elaborate below.
> 
> 
> > > We have an inconsistency in how the kernel is exporting inode number /
> > > device pairs for user space. There's of course stat(2) and statx(2),
> > > but aside from those we simply dump inode->i_ino and super->s_dev. In
> > > some cases, the dumped values differ from what is returned via stat(2)
> > > or statx(2). Some filesystems might even show duplicate (but
> > > internally different!) pairs when the raw i_ino/s_dev is used.
> > 
> > How is it that you can dump the raw ino and s_dev if your not in
> > kernel code?
> 
> If you look below my first paragraph, you'll see a list of places where the
> kernel publishes (maybe that's a better word?) ino/dev pairs by printing or
> otherwise copying raw ino/s_dev values into user accesible buffers.
> 
> 
> > For stat family system calls, if the file system defines the inode
> > operation getattr it will be used to fill the stat structure otherwise
> > the VFS will fill the stat  structure and it will use inode->i_ino and
> > sb->s_dev as you say.
> 
> My concern is that those pairs are sometimes not unique and do not line up
> with what statx(2) returns. We actually need the true inode / device as is
> returned by statx in those places. I go into much more detail in my original
> mail.

IMHO I think you are right in that the values seen by user space
should be consistent.

I also think that vfs_statx() (which is pretty much what's called
by all the stat family system calls, and indirectly vfs_getattr)
should be the defining way to get those values if only because it's
core VFS and maintained by the VFS maintainer who should be the one
to set the rules.

But, as you say there are a bunch of places, not necessarily easy
to find, that would need review.

And there's the question of 32 bit .....

> 
> 
> > > Some examples where we dump raw ino/dev:
> > > 
> > > - /proc/<pid>/maps. I've written about how this confuses lsof(8):
> > >   https://marc.info/?l=linux-btrfs&m=130074451403261&w=2
> > > 
> > > - Unsurprisingly, many VFS tracepoints dump ino and/or dev. See
> > >   trace/events/lock.h or trace/events/writeback.h for examples.
> > > 
> > > - eventpoll also dumps the raw ino/dev pair via ep_show_fdinfo()
> > > 
> > > - Audit records the raw ino/dev and passes them around. We do seem to
> > >   have paths printed from audit as well, but if it's printed with the
> > >   wrong ino/dev pair I believe my point still stands.
> > > 
> > > 
> > > This breaks software which expects these pairs to be unique, and can
> > > put the user in a situation where they might not be able to find an
> > > inode referenced from the kernel. What's even worse - depending on how
> > > ino is exported, they might even find the *wrong* inode.
> 
> Thanks,
> 	--Mark

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Exporting a unique ino/dev pair to user space
  2018-06-07  7:38 ` Amir Goldstein
@ 2018-06-07 21:06   ` Mark Fasheh
  2018-06-08  6:45     ` Amir Goldstein
  0 siblings, 1 reply; 7+ messages in thread
From: Mark Fasheh @ 2018-06-07 21:06 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Al Viro, linux-fsdevel, Linux Btrfs, overlayfs, Dave Chinner,
	Jeff Mahoney, J. Bruce Fields, Jeff Layton, Miklos Szeredi,
	David Howells

On Thu, Jun 07, 2018 at 10:38:51AM +0300, Amir Goldstein wrote:
> On Thu, Jun 7, 2018 at 12:38 AM, Mark Fasheh <mfasheh@suse.de> wrote:
> > Hi,
> >
> > We have an inconsistency in how the kernel is exporting inode number /
> > device pairs for user space. There's of course stat(2) and statx(2),
> > but aside from those we simply dump inode->i_ino and super->s_dev. In
> > some cases, the dumped values differ from what is returned via stat(2)
> > or statx(2). Some filesystems might even show duplicate (but
> > internally different!) pairs when the raw i_ino/s_dev is used.
> >
> > Some examples where we dump raw ino/dev:
> >
> > - /proc/<pid>/maps. I've written about how this confuses lsof(8):
> >   https://marc.info/?l=linux-btrfs&m=130074451403261&w=2
> >
> > - Unsurprisingly, many VFS tracepoints dump ino and/or dev. See
> >   trace/events/lock.h or trace/events/writeback.h for examples.
> >
> > - eventpoll also dumps the raw ino/dev pair via ep_show_fdinfo()
> >
> > - Audit records the raw ino/dev and passes them around. We do seem to
> >   have paths printed from audit as well, but if it's printed with the
> >   wrong ino/dev pair I believe my point still stands.
> >
> 
> knfsd also looks at i_ino. I converted one or two of these places
> to getattr callbacks recently, but there are still some more.
> 
> Anyway, w.r.t. Overlayfs, I believe Miklos' work to make file_inode()
> an overlay inode should resolve several of the issues listed above -
> probably not all, but didn't check every tracepoint..

Ok, thanks for letting me know about knfsd and the overlayfs work. I haven't
had a chance to check out the overlayfs work yet but I will shortly.


> > This breaks software which expects these pairs to be unique, and can
> > put the user in a situation where they might not be able to find an
> > inode referenced from the kernel. What's even worse - depending on how
> > ino is exported, they might even find the *wrong* inode.
> >
> > I also should point out that we're likely at this point because
> > stat(2) has been using an unsigned long for ino. On a 32 bit system,
> > it would have been impossible for the user to get the real inode
> > number in the first place. So there probably wasn't much we could do.
> >
> > These days though, we have statx(2) which will do the right thing on
> > all platforms so we no longer have that excuse. The user ought to be
> > able to take an ino/dev pair and ultimately find the actual file on
> > their system, partially with the help of statx(2).
> >
> >
> > Some examples of how ino is manipulated by filesystems:
> >
> > - On 64 bit i_ino and kstat->ino tend to be filled in correctly (from
> >   what I can tell). stat->dev is not always consistent with super->s_dev.
> >
> > - On 32 bits, many filesystems employ a transformation to squeeze a 64
> >   bit identifier into 32 bits. The exact methods are fs specific,
> >   what's important is that we're losing information, introducing the
> >   possibility of duplicate inode numbers.
> >
> > - On all platforms, Btrfs and Overlayfs can have duplicate inode
> >   numbers. Of course, device can be different across the fs as well
> >   with the kernel reporting s_dev and these filesystems returning
> >   a different device via stat() or statx().
> >
> >
> > Getting the inode number portion of this pair fixed would immediately
> > solve the situation for all filesystems except Btrfs and
> > Overlayfs - they report a different device from stat.
> >
> > Regarding the device portion of the pair, I'm honestly not sure
> > whether Overlayfs cares, and my attempts to fix the s_dev situation
> > for Btrfs have all come to the same dead ends that I've hit briefly
> > looking into this inode number issue - the problems are intrinsically
> > linked.
> >
> > So my questions are:
> >
> > 1) Do we care about this? On one hand, we've had this inconsistency
> >    for a long time, for various reasons. On the other hand, I can point
> >    to bugzilla's where these inconsistencies have become a problem.
> >
> >    In the case that we don't care, any fs-internal solutions are
> >    likely to be extremely disruptive to the end user.
> >
> >
> > 2) If we do care, how do we fix this?
> >
> >  2a) Do we use 64 bit i_ino on 32 bit systems? This has some obvious
> >      downsides from a memory usage standpoint. Also it doesn't fully
> >      address the issue - we still have a device field that Btrfs and
> >      Overlayfs override.
> >
> >      We could combine this with an intermediate structure between the
> >      inode and super block so s_dev could be abstracted out. See my
> >      fs_view patches for an example of how this could be done:
> >      https://www.spinics.net/lists/linux-fsdevel/msg125492.html
> >
> >  2b) Do we call ->getattr for this information? Plumbing a vfsmount to
> >      various regions of the kernel like audit, or some of the deeper
> >      tracepoints looks ugly and prone to life-timing issues (which
> >      vfsmount do we use?). On the upside, we could probably make it
> >      really light by only sending down the STATX_INO flag and letting
> >      the filesystem optimize accordingly.
> >
> >  2c) I don't think we can really use a dedicated callback without
> >      passing the vfsmount through since Overlayfs ->getattr might call
> >      the lower fs ->getattr. At that point we might as well use getattr.
> >
> 
> Didn't get the Overlayfs lower fs getattr argument.
> Overlayfs doesn't use the vfsmount passed into getattr
> and it could very well pass a dentry to lower fs getattr.

My main point in 2c) is that, from my understanding, Overlayfs may need to
call down to one of the filesystem ->getattr() calls. Since those take a
vfsmount we don't gain anything by having a unique callback from this - the
plumbing work would be the same.


> As a matter of fact, out of 35 getattr implementations in the kernel:
> (git grep "\s\.getattr\s" fs| awk '{print $4}'| sort -u|grep -v
> "nfs.*_proc_getattr"|wc -l)
> there is only one using the vfsmount - nfs_getattr() for MNT_NOATIME
> check and most of them only ever use d_inode(path->dentry).
> 
> This API seems quite odd.
> Maybe it should be fixed so more in kernel call sites could call getattr
> without a vfsmount.
> Not sure what would be the best way to handle nfs_getattr().

Yeah I saw that nfs_getattr() is the only user of the vfsmount. I totally
agree that this would be tons easier if we didn't have to pass the vfsmount
(and like you said, there's only the ONE user).

This is a bit hacky, but I wonder if we could blow the function signature
back out to a dentry + vfsmount and make the vfsmount optional when getattr
is called only for ino/dev. It's a bit ugly to have optional arguments like
that but nfs would work with just a line or two change and the other fs
would never even care.
	--Mark

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Exporting a unique ino/dev pair to user space
  2018-06-07 21:06   ` Mark Fasheh
@ 2018-06-08  6:45     ` Amir Goldstein
  0 siblings, 0 replies; 7+ messages in thread
From: Amir Goldstein @ 2018-06-08  6:45 UTC (permalink / raw)
  To: Mark Fasheh
  Cc: Al Viro, linux-fsdevel, Linux Btrfs, overlayfs, Dave Chinner,
	Jeff Mahoney, J. Bruce Fields, Jeff Layton, Miklos Szeredi,
	David Howells

On Fri, Jun 8, 2018 at 12:06 AM, Mark Fasheh <mfasheh@suse.de> wrote:
[...]

>> >  2c) I don't think we can really use a dedicated callback without
>> >      passing the vfsmount through since Overlayfs ->getattr might call
>> >      the lower fs ->getattr. At that point we might as well use getattr.
>> >
>>
>> Didn't get the Overlayfs lower fs getattr argument.
>> Overlayfs doesn't use the vfsmount passed into getattr
>> and it could very well pass a dentry to lower fs getattr.
>
> My main point in 2c) is that, from my understanding, Overlayfs may need to
> call down to one of the filesystem ->getattr() calls. Since those take a
> vfsmount we don't gain anything by having a unique callback from this - the
> plumbing work would be the same.
>

I guess I don't understand what you mean by "dedicated callback", but I
think we both in agreement that changing fs getattr() to take dentry is
preferred.

>
>> As a matter of fact, out of 35 getattr implementations in the kernel:
>> (git grep "\s\.getattr\s" fs| awk '{print $4}'| sort -u|grep -v
>> "nfs.*_proc_getattr"|wc -l)
>> there is only one using the vfsmount - nfs_getattr() for MNT_NOATIME
>> check and most of them only ever use d_inode(path->dentry).
>>
>> This API seems quite odd.
>> Maybe it should be fixed so more in kernel call sites could call getattr
>> without a vfsmount.
>> Not sure what would be the best way to handle nfs_getattr().
>
> Yeah I saw that nfs_getattr() is the only user of the vfsmount. I totally
> agree that this would be tons easier if we didn't have to pass the vfsmount
> (and like you said, there's only the ONE user).
>
> This is a bit hacky, but I wonder if we could blow the function signature
> back out to a dentry + vfsmount and make the vfsmount optional when getattr
> is called only for ino/dev. It's a bit ugly to have optional arguments like
> that but nfs would work with just a line or two change and the other fs
> would never even care.

OR.. change the signature of fs getattr() and vfs_getattr_nosec() to dentry
and set a kernel query_flag AT_STATX_NO_REMOTE_ATIME from
vfs_getattr(). No need to pass vfsmount to nfs.

Then users that access i_ino/s_dev can call vfs_getattr_nosec() with a
bonus of not having to go through security modules (which now they don't).

The only current user of vfs_getattr_nosec() expfs.c queries STATX_INO,
so change is safe.

Overlayfs doesn't need to get vfsmount in ovl_getattr() - it knows the
lower fs vfsmount for calling vfs_getattr() regardless and in other
places overlayfs just needs to query  STATX_INO, so it may also
use the light vfs_getattr_nosec() callback.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-06-08  6:45 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-06 21:38 Exporting a unique ino/dev pair to user space Mark Fasheh
2018-06-07  0:47 ` Ian Kent
2018-06-07  2:06   ` Mark Fasheh
2018-06-07 11:31     ` Ian Kent
2018-06-07  7:38 ` Amir Goldstein
2018-06-07 21:06   ` Mark Fasheh
2018-06-08  6:45     ` Amir Goldstein

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.