All of lore.kernel.org
 help / color / mirror / Atom feed
* [Virtio-fs] Ways to uniquely and persistently identify nodes
@ 2020-01-13 17:46 Max Reitz
  2020-01-14 16:13 ` Miklos Szeredi
                   ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Max Reitz @ 2020-01-13 17:46 UTC (permalink / raw)
  To: virtio-fs


[-- Attachment #1.1: Type: text/plain, Size: 8443 bytes --]

Hi,

As discussed in today’s meeting, there is a problem with uniquely and
persistently identifying nodes in the guest.

Actually, there are multiple problems:

(1) stat’s st_ino in the guest is not necessarily unique.  Currently, it
just the st_ino from the host file, so if you have mounted multiple
different filesystems in the exported directory tree, you may get
collisions.

(2) The FUSE 64-bit fuse_ino_t (which identifies an open file,
basically) is not persistent.  It is just an index into a vector that
contains all open inodes, and whenever virtiofsd is restarted, the
vector is renewed.  That means that whenever this happens, all
fuse_ino_t values the guest holds will become invalid.  (And when
virtiofsd starts handing out new fuse_ino_t values, those will probably
not point to the same inodes as before.)

(3) name_to_handle_at()/open_by_handle_at() are implemented by FUSE just
by handing out the fuse_ino_t value as the handle.  This is not only a
problem as long as fuse_ino_t is not persistent (see (2)), but also in
general, because the fuse_ino_t value is only valid (per FUSE protocol)
as long as the inode is referenced by the guest.


The first question that I think needs to be asked is whether we care
about each of this points at all.

(1) Maybe it just doesn’t matter whether the st_ino values are unique.

(2) Maybe we don’t care about virtiofsd being restarted while the guest
is running or only paused.  (“Restarting” includes migration of the
daemon to a different host.)

(3) I suppose we do care about this.


Assuming we do care about the points, here are some ways I have
considered of addressing them:

(1)

(a)

If we could make the 64-bit fuse_ino_t unique and persistent (see (2)),
we could use that for st_ino (also a 64-bit field).

(This is the case if we keep the current schema for fuse_ino_t, be it
because we don’t care about (2) or because we want (2a).)

(b)

Otherwise, we probably want to continue passing through st_ino and then
ensure that stat’s st_dev is unique for each submount in the exported
tree.  We can achieve that by extending the FUSE protocol for virtiofsd
to announce submounts and then the FUSE kernel driver to automount them.
 (This means that these submounts in the guest are then technically
unrelated filesystems.  It also means that the FUSE driver would need to
automount them with the “virtiofs” fs type, which is kind of weird, and
restricts this solution to virtiofs.)


(2)

(a)

We can keep the current way if we just store the in-memory mapping while
virtiofsd is suspended (and migrate it it if we want to migrate the
virtiofsd instance).  The table may grow to be very large, though, and
it contains for example file descriptors that we would need to migrate,
too (perhaps as file handles?).

(b)

We could extend the fuse_ino_t type to an arbitrary size basically, to
be negotiated between FUSE server and client.  This would require
extensive modification of the FUSE protocol and kernel driver (and would
ask for respective modification of libfuse, too), though.  Such a larger
value could then capture both a submount ID and a unique identifier for
inodes on the respective host filesystems, such as st_ino.  This would
ensure that every virtiofsd instance would generate the same fuse_ino_t
values for the same nodes on the same exported tree.

However, note that this doesn’t auto-populate the fuse_ino_t mappings:
When after restarting virtiofsd the server wants to access an existing
inode, it can’t, because there is no good way to translate even larger
fuse_ino_t values to a file descriptor.  (We could do that if the
fuse_ino_t value encapsulated a handle.  (As in open_by_handle_at().)
The problem is that we can’t trust the guest to keep a handle, so we
must ensure that the handle returned points to a file the guest is
allowed to access.  Doing that cryptographically (e.g. with a MAC) is
probably out of the question, because that would make fuse_ino_t really
big.  Another idea would be to set a flag on the host FS for files that
the guest has a handle to.  But this flag would need to be
guest-specific...  So we’d probably again end up with a large database
just as in (2a).  (It doesn’t need to be a flag on the FS, it could also
be a database, I suppose.))

We could also re-enumerate the exported tree after reopening (perhaps
lazily for each exported filesystem) and thus recreate the mapping.  But
this would take as much time as a “find” over the whole exported tree.

(c)

We could complement the fuse_ino_t value by a context ID, that in our
case would probably be derived from the submount ID (e.g. the relative
mount point).  This would only require minor modification of the FUSE
protocol: Detecting mount points in lookups; a new command to retrieve a
mount point’s context ID; and some way to set this context ID.

We could set the context ID either explicitly with a new command; or as
part of every FUSE request (a field in the request header); or just as
part of virtio-fs (be it with one virtqueue per context (which may or
may not be feasible), or just by prefixing every FUSE request on the
line with a context ID).

One of the questions here is: If we just choose the context ID to be 32
or 64 bit in size, will we ever run into the same problem of “96/128
bits aren’t enough”?

The other problem is the same as in (2b): We cannot get an FD from a
context ID + fuse_ino_t alone, so if virtiofsd is restarted, the guest
cannot keep using existing inodes without reopening them.

The only way I see here to get around this problem is to re-enumerate
the whole exported tree (or at least lazily by context ID a.k.a.
filesystem) and thus reconstruct the mapping from ID to inode after
resuming virtiofsd.


(3)

(a)

If the fuse_ino_t keeps to be 64 bit and persistent (we don’t reuse IDs
and we keep existing mappings around even when their refcount drops to 0
(but why would we do that?)), we don’t have to change anything.

(b)

We probably just want new FUSE commands to query handles and open
handles.  We could then decide whether we want them to use persistent
IDs that we get from solving (2), or just pass through the handles from
the host.

If we do the latter, we have the same problem I mentioned in (2b): We
can’t trust the guest to keep the handle unmodified, and if it does
modify it, we have to keep the guest from accessing files it must not see.

The two ways that have been proposed so far are

(I) Enrich the host’s file handle by cryptographic information to prove
its integrity, e.g. a MAC based on a randomly generated
virtiofsd-internal symmetric key.  The two problems I see are that the
file handle gets rather big, and that the guest might be able to guess
the MAC (very improbable, though, especially if we were to terminate the
connection when the guest tries to use a file handle with an invalid MAC).

(II) Keep notes somewhere of what file handles the guest may use.  This
could be implemented by storing a virtiofsd instance ID as metadata in
the filesystem (attached to the file, so virtiofsd can read it when
opening the file by its handle); or as a database for each virtiofsd
instance (where it puts all handles handed out to a guest).



You can see that all of these problems are kind of intertwined, but not
really.  Many solutions look similar, and some solutions can solve
multiple problems; but it doesn’t mean we have to think about everything
at once.  I think we should first think about how to handle the
identification problem (1/2).  Maybe there isn’t much to do there anyway
because we don’t care about it and can just use the existing fuse_ino_t
as st_ino for the guest.

Then we can think about (3).  If we decide to add new FUSE commands for
getting/opening file handles, then this works with pretty much every way
we go about (1) and (2).



Side note:

As for migrating a virtiofsd instance: Note that everything above that
depends on host file handles or host ino_t values will make it
impossible to migrate to a different filesystem.  But maybe doing that
would lead to all kind of other problems anyway.


Another note:

It took rather long to write this, so I probably forgot a whole bunch of
stuff...

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Virtio-fs] Ways to uniquely and persistently identify nodes
  2020-01-13 17:46 [Virtio-fs] Ways to uniquely and persistently identify nodes Max Reitz
@ 2020-01-14 16:13 ` Miklos Szeredi
  2020-01-14 16:49   ` Max Reitz
  2020-01-15 10:12 ` Dr. David Alan Gilbert
  2020-01-15 12:50 ` Stefan Hajnoczi
  2 siblings, 1 reply; 25+ messages in thread
From: Miklos Szeredi @ 2020-01-14 16:13 UTC (permalink / raw)
  To: Max Reitz; +Cc: virtio-fs-list

On Mon, Jan 13, 2020 at 6:47 PM Max Reitz <mreitz@redhat.com> wrote:
>
> Hi,
>
> As discussed in today’s meeting, there is a problem with uniquely and
> persistently identifying nodes in the guest.
>
> Actually, there are multiple problems:
>
> (1) stat’s st_ino in the guest is not necessarily unique.  Currently, it
> just the st_ino from the host file, so if you have mounted multiple
> different filesystems in the exported directory tree, you may get
> collisions.
>
> (2) The FUSE 64-bit fuse_ino_t (which identifies an open file,
> basically) is not persistent.  It is just an index into a vector that
> contains all open inodes, and whenever virtiofsd is restarted, the
> vector is renewed.  That means that whenever this happens, all
> fuse_ino_t values the guest holds will become invalid.  (And when
> virtiofsd starts handing out new fuse_ino_t values, those will probably
> not point to the same inodes as before.)
>
> (3) name_to_handle_at()/open_by_handle_at() are implemented by FUSE just
> by handing out the fuse_ino_t value as the handle.  This is not only a
> problem as long as fuse_ino_t is not persistent (see (2)), but also in
> general, because the fuse_ino_t value is only valid (per FUSE protocol)
> as long as the inode is referenced by the guest.
>
> The first question that I think needs to be asked is whether we care
> about each of this points at all.
>
> (1) Maybe it just doesn’t matter whether the st_ino values are unique.

It does matter, otherwise we can't claim it's a POSIX filesystem, and
applications can't rely on the guarantees made by the standard.
st_ino is used to find hard links by backup utilities and st_ino +
d_ino values are used by tree traversal (such as find(1)).

>
> (2) Maybe we don’t care about virtiofsd being restarted while the guest
> is running or only paused.  (“Restarting” includes migration of the
> daemon to a different host.)
>
> (3) I suppose we do care about this.
>
>
> Assuming we do care about the points, here are some ways I have
> considered of addressing them:
>
> (1)
>
> (a)
>
> If we could make the 64-bit fuse_ino_t unique and persistent (see (2)),
> we could use that for st_ino (also a 64-bit field).

My gut feeling is that we'd want to avoid boundless maps stored in
memory or any kind of storage.

In fact we want to do the opposite: avoid having to keep map of
objects currently in guest cache.  We want to flush the server maps
independently of the guest cache when the maps grow too large or there
are too many open file descriptors (a big issue with running virtiofsd
unprivileged)...

>
> (b)
>
> Otherwise, we probably want to continue passing through st_ino and then
> ensure that stat’s st_dev is unique for each submount in the exported
> tree.  We can achieve that by extending the FUSE protocol for virtiofsd
> to announce submounts and then the FUSE kernel driver to automount them.
>  (This means that these submounts in the guest are then technically
> unrelated filesystems.  It also means that the FUSE driver would need to
> automount them with the “virtiofs” fs type, which is kind of weird, and
> restricts this solution to virtiofs.)
>
>
> (2)
>
> (a)
>
> We can keep the current way if we just store the in-memory mapping while
> virtiofsd is suspended (and migrate it it if we want to migrate the
> virtiofsd instance).  The table may grow to be very large, though, and
> it contains for example file descriptors that we would need to migrate,
> too (perhaps as file handles?).
>
> (b)
>
> We could extend the fuse_ino_t type to an arbitrary size basically, to
> be negotiated between FUSE server and client.  This would require
> extensive modification of the FUSE protocol and kernel driver (and would
> ask for respective modification of libfuse, too), though.  Such a larger
> value could then capture both a submount ID and a unique identifier for
> inodes on the respective host filesystems, such as st_ino.  This would
> ensure that every virtiofsd instance would generate the same fuse_ino_t
> values for the same nodes on the same exported tree.
>
> However, note that this doesn’t auto-populate the fuse_ino_t mappings:
> When after restarting virtiofsd the server wants to access an existing
> inode, it can’t, because there is no good way to translate even larger
> fuse_ino_t values to a file descriptor.  (We could do that if the
> fuse_ino_t value encapsulated a handle.  (As in open_by_handle_at().)

...and this is the proposal that would solve that one as well.  If we
could encapsulate the file handle into all messages, than we wouldn't
have to worry about refcounting objects and keeping files open.  The
server could just flush it's caches independently of the guest.

> The problem is that we can’t trust the guest to keep a handle, so we
> must ensure that the handle returned points to a file the guest is
> allowed to access.  Doing that cryptographically (e.g. with a MAC) is
> probably out of the question, because that would make fuse_ino_t really
> big.

I'm not a crypto expert, but why would that need to be big?  AES has
16 byte block size, so the handle would need to be padded to be a
multiple of 16 bytes, but that doesn't sound excessive at all.

What worries me more is the variable nature of the field size, but I
suppose there's no good way to get around that.

>   Another idea would be to set a flag on the host FS for files that
> the guest has a handle to.  But this flag would need to be
> guest-specific...  So we’d probably again end up with a large database
> just as in (2a).  (It doesn’t need to be a flag on the FS, it could also
> be a database, I suppose.))
>
> We could also re-enumerate the exported tree after reopening (perhaps
> lazily for each exported filesystem) and thus recreate the mapping.  But
> this would take as much time as a “find” over the whole exported tree.
>
> (c)
>
> We could complement the fuse_ino_t value by a context ID, that in our
> case would probably be derived from the submount ID (e.g. the relative
> mount point).  This would only require minor modification of the FUSE
> protocol: Detecting mount points in lookups; a new command to retrieve a
> mount point’s context ID; and some way to set this context ID.
>
> We could set the context ID either explicitly with a new command; or as
> part of every FUSE request (a field in the request header); or just as
> part of virtio-fs (be it with one virtqueue per context (which may or
> may not be feasible), or just by prefixing every FUSE request on the
> line with a context ID).
>
> One of the questions here is: If we just choose the context ID to be 32
> or 64 bit in size, will we ever run into the same problem of “96/128
> bits aren’t enough”?

I wouldn't worry about that.   32bits for the number of submounts is
definitely enough for the foreseeable future.

>
> The other problem is the same as in (2b): We cannot get an FD from a
> context ID + fuse_ino_t alone, so if virtiofsd is restarted, the guest
> cannot keep using existing inodes without reopening them.
>
> The only way I see here to get around this problem is to re-enumerate
> the whole exported tree (or at least lazily by context ID a.k.a.
> filesystem) and thus reconstruct the mapping from ID to inode after
> resuming virtiofsd.
>
>
> (3)
>
> (a)
>
> If the fuse_ino_t keeps to be 64 bit and persistent (we don’t reuse IDs
> and we keep existing mappings around even when their refcount drops to 0
> (but why would we do that?)), we don’t have to change anything.
>
> (b)
>
> We probably just want new FUSE commands to query handles and open
> handles.  We could then decide whether we want them to use persistent
> IDs that we get from solving (2), or just pass through the handles from
> the host.
>
> If we do the latter, we have the same problem I mentioned in (2b): We
> can’t trust the guest to keep the handle unmodified, and if it does
> modify it, we have to keep the guest from accessing files it must not see.
>
> The two ways that have been proposed so far are
>
> (I) Enrich the host’s file handle by cryptographic information to prove
> its integrity, e.g. a MAC based on a randomly generated
> virtiofsd-internal symmetric key.  The two problems I see are that the
> file handle gets rather big, and that the guest might be able to guess
> the MAC (very improbable, though, especially if we were to terminate the
> connection when the guest tries to use a file handle with an invalid MAC).
>
> (II) Keep notes somewhere of what file handles the guest may use.  This
> could be implemented by storing a virtiofsd instance ID as metadata in
> the filesystem (attached to the file, so virtiofsd can read it when
> opening the file by its handle); or as a database for each virtiofsd
> instance (where it puts all handles handed out to a guest).
>
>
>
> You can see that all of these problems are kind of intertwined, but not
> really.  Many solutions look similar, and some solutions can solve
> multiple problems; but it doesn’t mean we have to think about everything
> at once.  I think we should first think about how to handle the
> identification problem (1/2).  Maybe there isn’t much to do there anyway
> because we don’t care about it and can just use the existing fuse_ino_t
> as st_ino for the guest.
>
> Then we can think about (3).  If we decide to add new FUSE commands for
> getting/opening file handles, then this works with pretty much every way
> we go about (1) and (2).
>
>
>
> Side note:
>
> As for migrating a virtiofsd instance: Note that everything above that
> depends on host file handles or host ino_t values will make it
> impossible to migrate to a different filesystem.  But maybe doing that
> would lead to all kind of other problems anyway.

If filesystem exported to guests is backed by a common server (e.g.
NFS) than that should solve the ino_t and file handle persistency
across migration.

Thanks,
Miklos



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Virtio-fs] Ways to uniquely and persistently identify nodes
  2020-01-14 16:13 ` Miklos Szeredi
@ 2020-01-14 16:49   ` Max Reitz
  2020-01-14 19:24     ` Miklos Szeredi
  0 siblings, 1 reply; 25+ messages in thread
From: Max Reitz @ 2020-01-14 16:49 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: virtio-fs-list


[-- Attachment #1.1: Type: text/plain, Size: 8118 bytes --]

On 14.01.20 17:13, Miklos Szeredi wrote:
> On Mon, Jan 13, 2020 at 6:47 PM Max Reitz <mreitz@redhat.com> wrote:
>>
>> Hi,
>>
>> As discussed in today’s meeting, there is a problem with uniquely and
>> persistently identifying nodes in the guest.
>>
>> Actually, there are multiple problems:
>>
>> (1) stat’s st_ino in the guest is not necessarily unique.  Currently, it
>> just the st_ino from the host file, so if you have mounted multiple
>> different filesystems in the exported directory tree, you may get
>> collisions.
>>
>> (2) The FUSE 64-bit fuse_ino_t (which identifies an open file,
>> basically) is not persistent.  It is just an index into a vector that
>> contains all open inodes, and whenever virtiofsd is restarted, the
>> vector is renewed.  That means that whenever this happens, all
>> fuse_ino_t values the guest holds will become invalid.  (And when
>> virtiofsd starts handing out new fuse_ino_t values, those will probably
>> not point to the same inodes as before.)
>>
>> (3) name_to_handle_at()/open_by_handle_at() are implemented by FUSE just
>> by handing out the fuse_ino_t value as the handle.  This is not only a
>> problem as long as fuse_ino_t is not persistent (see (2)), but also in
>> general, because the fuse_ino_t value is only valid (per FUSE protocol)
>> as long as the inode is referenced by the guest.
>>
>> The first question that I think needs to be asked is whether we care
>> about each of this points at all.
>>
>> (1) Maybe it just doesn’t matter whether the st_ino values are unique.
> 
> It does matter, otherwise we can't claim it's a POSIX filesystem, and
> applications can't rely on the guarantees made by the standard.
> st_ino is used to find hard links by backup utilities and st_ino +
> d_ino values are used by tree traversal (such as find(1)).
> 
>>
>> (2) Maybe we don’t care about virtiofsd being restarted while the guest
>> is running or only paused.  (“Restarting” includes migration of the
>> daemon to a different host.)
>>
>> (3) I suppose we do care about this.
>>
>>
>> Assuming we do care about the points, here are some ways I have
>> considered of addressing them:
>>
>> (1)
>>
>> (a)
>>
>> If we could make the 64-bit fuse_ino_t unique and persistent (see (2)),
>> we could use that for st_ino (also a 64-bit field).
> 
> My gut feeling is that we'd want to avoid boundless maps stored in
> memory or any kind of storage.

Well, if we don’t need persistent fuse_ino_t values (i.e., we don’t care
about (2)), then we could just use what passthrough_ll currently does.
It does sound a bit like you don’t like even the current mapping,
though. (?)

> In fact we want to do the opposite: avoid having to keep map of
> objects currently in guest cache.  We want to flush the server maps
> independently of the guest cache when the maps grow too large or there
> are too many open file descriptors (a big issue with running virtiofsd
> unprivileged)...
> 
>>
>> (b)
>>
>> Otherwise, we probably want to continue passing through st_ino and then
>> ensure that stat’s st_dev is unique for each submount in the exported
>> tree.  We can achieve that by extending the FUSE protocol for virtiofsd
>> to announce submounts and then the FUSE kernel driver to automount them.
>>  (This means that these submounts in the guest are then technically
>> unrelated filesystems.  It also means that the FUSE driver would need to
>> automount them with the “virtiofs” fs type, which is kind of weird, and
>> restricts this solution to virtiofs.)
>>
>>
>> (2)
>>
>> (a)
>>
>> We can keep the current way if we just store the in-memory mapping while
>> virtiofsd is suspended (and migrate it it if we want to migrate the
>> virtiofsd instance).  The table may grow to be very large, though, and
>> it contains for example file descriptors that we would need to migrate,
>> too (perhaps as file handles?).
>>
>> (b)
>>
>> We could extend the fuse_ino_t type to an arbitrary size basically, to
>> be negotiated between FUSE server and client.  This would require
>> extensive modification of the FUSE protocol and kernel driver (and would
>> ask for respective modification of libfuse, too), though.  Such a larger
>> value could then capture both a submount ID and a unique identifier for
>> inodes on the respective host filesystems, such as st_ino.  This would
>> ensure that every virtiofsd instance would generate the same fuse_ino_t
>> values for the same nodes on the same exported tree.
>>
>> However, note that this doesn’t auto-populate the fuse_ino_t mappings:
>> When after restarting virtiofsd the server wants to access an existing
>> inode, it can’t, because there is no good way to translate even larger
>> fuse_ino_t values to a file descriptor.  (We could do that if the
>> fuse_ino_t value encapsulated a handle.  (As in open_by_handle_at().)
> 
> ...and this is the proposal that would solve that one as well.  If we
> could encapsulate the file handle into all messages, than we wouldn't
> have to worry about refcounting objects and keeping files open.  The
> server could just flush it's caches independently of the guest.
> 
>> The problem is that we can’t trust the guest to keep a handle, so we
>> must ensure that the handle returned points to a file the guest is
>> allowed to access.  Doing that cryptographically (e.g. with a MAC) is
>> probably out of the question, because that would make fuse_ino_t really
>> big.
> 
> I'm not a crypto expert, but why would that need to be big?  AES has
> 16 byte block size, so the handle would need to be padded to be a
> multiple of 16 bytes, but that doesn't sound excessive at all.

A MAC is an encrypted hash, so it needs to be as long as the hash is.
That is, 32 bytes for SHA-256 or 64 for SHA-512.  I naively expected
that increasing the file handle size from 8 to 32 + handle (which is
probably 12 to 16 bytes in length, so ~48 B) or even 64 + handle (~80 B)
would not be so nice.

> What worries me more is the variable nature of the field size, but I
> suppose there's no good way to get around that.

What worries me most is how to pass that object around to all FUSE
functions, and that they all need a new interface.

I just had a very fuzzy (and maybe stupid) idea: Maybe we could keep an
internal vector of currently active handles and then when variable-size
handles are enabled, fuse_ino_t would just act as an index into that vector?

I suppose we could then use the full-size handles in all messages and
just hand out temporary indices to existing functions (just so we don’t
have to change their interface).  Server and client have their own
vectors, because when they communicate, only the full handles have meaning.

Or we could implement the table on top of the current system by sharing
it between the client and server.  Whenever the server creates a
fuse_ino_t value, it then also creates a full-size handle, and returns
both the handle and its corresponding fuse_ino_t value to the server.
The server can use the fuse_ino_t normally most of the time, but with a
catch: The client would be able to invalidate it.  Then the server needs
to obtain a new fuse_ino_t value for the existing handle.
(Invalidating and reacquiring a fuse_ino_t value would be new FUSE
operations.)

I just got this idea, so maybe it’s silly.  But I suppose it could solve
our problems without being too invasive?

[...]

>> Side note:
>>
>> As for migrating a virtiofsd instance: Note that everything above that
>> depends on host file handles or host ino_t values will make it
>> impossible to migrate to a different filesystem.  But maybe doing that
>> would lead to all kind of other problems anyway.
> 
> If filesystem exported to guests is backed by a common server (e.g.
> NFS) than that should solve the ino_t and file handle persistency
> across migration.

Yes, but that wouldn’t be a different filesystem then.

Thanks a lot for taking the time to read and respond!

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Virtio-fs] Ways to uniquely and persistently identify nodes
  2020-01-14 16:49   ` Max Reitz
@ 2020-01-14 19:24     ` Miklos Szeredi
  2020-01-15  9:37       ` Max Reitz
  2020-01-15 12:01       ` Stefan Hajnoczi
  0 siblings, 2 replies; 25+ messages in thread
From: Miklos Szeredi @ 2020-01-14 19:24 UTC (permalink / raw)
  To: Max Reitz; +Cc: virtio-fs-list

On Tue, Jan 14, 2020 at 6:13 PM Max Reitz <mreitz@redhat.com> wrote:
> What worries me most is how to pass that object around to all FUSE
> functions, and that they all need a new interface.

You mean libfuse API?

> I just had a very fuzzy (and maybe stupid) idea: Maybe we could keep an
> internal vector of currently active handles and then when variable-size
> handles are enabled, fuse_ino_t would just act as an index into that vector?
>
> I suppose we could then use the full-size handles in all messages and
> just hand out temporary indices to existing functions (just so we don’t
> have to change their interface).  Server and client have their own
> vectors, because when they communicate, only the full handles have meaning.
>
> Or we could implement the table on top of the current system by sharing
> it between the client and server.  Whenever the server creates a
> fuse_ino_t value, it then also creates a full-size handle, and returns
> both the handle and its corresponding fuse_ino_t value to the server.
> The server can use the fuse_ino_t normally most of the time, but with a
> catch: The client would be able to invalidate it.  Then the server needs
> to obtain a new fuse_ino_t value for the existing handle.
> (Invalidating and reacquiring a fuse_ino_t value would be new FUSE
> operations.)

I think you may have accidentally switched "server" and "client" in
the above description a couple of times.

But if I'm reading this correctly, your idea is to keep the 64bit
value on the interface (this could be just libfuse or both libfuse and
the kernel API) and add new interfaces to establish and reestablish
the mapping from (non-persistent) fuse_ino_t to (persistent) handle.

Thanks,
Miklos



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Virtio-fs] Ways to uniquely and persistently identify nodes
  2020-01-14 19:24     ` Miklos Szeredi
@ 2020-01-15  9:37       ` Max Reitz
  2020-01-15  9:39         ` Max Reitz
  2020-01-15 10:10         ` Miklos Szeredi
  2020-01-15 12:01       ` Stefan Hajnoczi
  1 sibling, 2 replies; 25+ messages in thread
From: Max Reitz @ 2020-01-15  9:37 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: virtio-fs-list


[-- Attachment #1.1: Type: text/plain, Size: 1964 bytes --]

On 14.01.20 20:24, Miklos Szeredi wrote:
> On Tue, Jan 14, 2020 at 6:13 PM Max Reitz <mreitz@redhat.com> wrote:
>> What worries me most is how to pass that object around to all FUSE
>> functions, and that they all need a new interface.
> 
> You mean libfuse API?

Not only that, but also the FUSE kernel driver.

>> I just had a very fuzzy (and maybe stupid) idea: Maybe we could keep an
>> internal vector of currently active handles and then when variable-size
>> handles are enabled, fuse_ino_t would just act as an index into that vector?
>>
>> I suppose we could then use the full-size handles in all messages and
>> just hand out temporary indices to existing functions (just so we don’t
>> have to change their interface).  Server and client have their own
>> vectors, because when they communicate, only the full handles have meaning.
>>
>> Or we could implement the table on top of the current system by sharing
>> it between the client and server.  Whenever the server creates a
>> fuse_ino_t value, it then also creates a full-size handle, and returns
>> both the handle and its corresponding fuse_ino_t value to the server.
>> The server can use the fuse_ino_t normally most of the time, but with a
>> catch: The client would be able to invalidate it.  Then the server needs
>> to obtain a new fuse_ino_t value for the existing handle.
>> (Invalidating and reacquiring a fuse_ino_t value would be new FUSE
>> operations.)
> 
> I think you may have accidentally switched "server" and "client" in
> the above description a couple of times.

Isn’t the kernel (i.e., the guest) the client and virtiofsd the server?

> But if I'm reading this correctly, your idea is to keep the 64bit
> value on the interface (this could be just libfuse or both libfuse and
> the kernel API) and add new interfaces to establish and reestablish
> the mapping from (non-persistent) fuse_ino_t to (persistent) handle.

Yes.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Virtio-fs] Ways to uniquely and persistently identify nodes
  2020-01-15  9:37       ` Max Reitz
@ 2020-01-15  9:39         ` Max Reitz
  2020-01-15 10:10         ` Miklos Szeredi
  1 sibling, 0 replies; 25+ messages in thread
From: Max Reitz @ 2020-01-15  9:39 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: virtio-fs-list


[-- Attachment #1.1: Type: text/plain, Size: 1518 bytes --]

On 15.01.20 10:37, Max Reitz wrote:

[...]

>>> I just had a very fuzzy (and maybe stupid) idea: Maybe we could keep an
>>> internal vector of currently active handles and then when variable-size
>>> handles are enabled, fuse_ino_t would just act as an index into that vector?
>>>
>>> I suppose we could then use the full-size handles in all messages and
>>> just hand out temporary indices to existing functions (just so we don’t
>>> have to change their interface).  Server and client have their own
>>> vectors, because when they communicate, only the full handles have meaning.
>>>
>>> Or we could implement the table on top of the current system by sharing
>>> it between the client and server.  Whenever the server creates a
>>> fuse_ino_t value, it then also creates a full-size handle, and returns
>>> both the handle and its corresponding fuse_ino_t value to the server.
>>> The server can use the fuse_ino_t normally most of the time, but with a
>>> catch: The client would be able to invalidate it.  Then the server needs
>>> to obtain a new fuse_ino_t value for the existing handle.
>>> (Invalidating and reacquiring a fuse_ino_t value would be new FUSE
>>> operations.)
>>
>> I think you may have accidentally switched "server" and "client" in
>> the above description a couple of times.
> 
> Isn’t the kernel (i.e., the guest) the client and virtiofsd the server?

Oh, er, yes.  So I started to get it wrong after “returns both the
handle...”.  Sorry. O:-)

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Virtio-fs] Ways to uniquely and persistently identify nodes
  2020-01-15  9:37       ` Max Reitz
  2020-01-15  9:39         ` Max Reitz
@ 2020-01-15 10:10         ` Miklos Szeredi
  2020-01-15 11:51           ` Miklos Szeredi
  1 sibling, 1 reply; 25+ messages in thread
From: Miklos Szeredi @ 2020-01-15 10:10 UTC (permalink / raw)
  To: Max Reitz; +Cc: virtio-fs-list

On Wed, Jan 15, 2020 at 10:37 AM Max Reitz <mreitz@redhat.com> wrote:

> >> Or we could implement the table on top of the current system by sharing
> >> it between the client and server.  Whenever the server creates a
> >> fuse_ino_t value, it then also creates a full-size handle, and returns
> >> both the handle and its corresponding fuse_ino_t value to the server.
> >> The server can use the fuse_ino_t normally most of the time, but with a
> >> catch: The client would be able to invalidate it.  Then the server needs
> >> to obtain a new fuse_ino_t value for the existing handle.
> >> (Invalidating and reacquiring a fuse_ino_t value would be new FUSE
> >> operations.)
> >
> > I think you may have accidentally switched "server" and "client" in
> > the above description a couple of times.
>
> Isn’t the kernel (i.e., the guest) the client and virtiofsd the server?

It is.

So it is the server that creates a temporary fuse_ino_t and sends the
persistent handle and the temporary fuse_ino_t to the client.

The client stores both as opaque objects, and it cannot invalidate
fuse_ino_t (although it could theoretically drop it, then re-request
it based on the stored handle).   The server on the other hand could
invalidate fuse_ino_t and force the client to request a new temporary
fuse_ino_t based on the persistent handle.  This could be achieved, by
returning a special error value in case of an invalid fuse_ino_t.

New operations needed:

LOOKUP_WITH_HANDLE: same as LOOKUP, but returns handle in addition to
fuse_ino_t and attributes.
LOOKUP_BY_HANDLE: same as LOOKUP, but gets a handle as input.

Thanks,
Miklos



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Virtio-fs] Ways to uniquely and persistently identify nodes
  2020-01-13 17:46 [Virtio-fs] Ways to uniquely and persistently identify nodes Max Reitz
  2020-01-14 16:13 ` Miklos Szeredi
@ 2020-01-15 10:12 ` Dr. David Alan Gilbert
  2020-01-15 12:58   ` Max Reitz
  2020-01-15 12:50 ` Stefan Hajnoczi
  2 siblings, 1 reply; 25+ messages in thread
From: Dr. David Alan Gilbert @ 2020-01-15 10:12 UTC (permalink / raw)
  To: Max Reitz; +Cc: virtio-fs

* Max Reitz (mreitz@redhat.com) wrote:
> Hi,
> 
> As discussed in today’s meeting, there is a problem with uniquely and
> persistently identifying nodes in the guest.
> 
> Actually, there are multiple problems:
> 
> (1) stat’s st_ino in the guest is not necessarily unique.  Currently, it
> just the st_ino from the host file, so if you have mounted multiple
> different filesystems in the exported directory tree, you may get
> collisions.
> 
> (2) The FUSE 64-bit fuse_ino_t (which identifies an open file,
> basically) is not persistent.  It is just an index into a vector that
> contains all open inodes, and whenever virtiofsd is restarted, the
> vector is renewed.  That means that whenever this happens, all
> fuse_ino_t values the guest holds will become invalid.  (And when
> virtiofsd starts handing out new fuse_ino_t values, those will probably
> not point to the same inodes as before.)
> 
> (3) name_to_handle_at()/open_by_handle_at() are implemented by FUSE just
> by handing out the fuse_ino_t value as the handle.  This is not only a
> problem as long as fuse_ino_t is not persistent (see (2)), but also in
> general, because the fuse_ino_t value is only valid (per FUSE protocol)
> as long as the inode is referenced by the guest.
> 
> 
> The first question that I think needs to be asked is whether we care
> about each of this points at all.
> 
> (1) Maybe it just doesn’t matter whether the st_ino values are unique.
> 
> (2) Maybe we don’t care about virtiofsd being restarted while the guest
> is running or only paused.  (“Restarting” includes migration of the
> daemon to a different host.)

I prefer to split that answer because I care about migration but not
about restarting the daemon in place.

> (3) I suppose we do care about this.
> 
> 
> Assuming we do care about the points, here are some ways I have
> considered of addressing them:
> 
> (1)
> 
> (a)
> 
> If we could make the 64-bit fuse_ino_t unique and persistent (see (2)),
> we could use that for st_ino (also a 64-bit field).
> 
> (This is the case if we keep the current schema for fuse_ino_t, be it
> because we don’t care about (2) or because we want (2a).)
> 
> (b)
> 
> Otherwise, we probably want to continue passing through st_ino and then
> ensure that stat’s st_dev is unique for each submount in the exported
> tree.  We can achieve that by extending the FUSE protocol for virtiofsd
> to announce submounts and then the FUSE kernel driver to automount them.

This feels nice to me, since the st_dev map should be small.

>  (This means that these submounts in the guest are then technically
> unrelated filesystems.  It also means that the FUSE driver would need to
> automount them with the “virtiofs” fs type, which is kind of weird, and
> restricts this solution to virtiofs.)

Well it has to automount them witht he same type as it's mounted with;
so it's not virtiofs specific.

> (2)
> 
> (a)
> 
> We can keep the current way if we just store the in-memory mapping while
> virtiofsd is suspended (and migrate it it if we want to migrate the
> virtiofsd instance).  The table may grow to be very large, though, and
> it contains for example file descriptors that we would need to migrate,
> too (perhaps as file handles?).

The 'when suspended' worries me - if it's important data to persist then
we probably need to be more careful with it; i.e. keep it sync'd to
disk.  If it's not important long term then do we need to keep it that
long? 
Be wary of migating a large, rapidly changing table.

> (b)
> 
> We could extend the fuse_ino_t type to an arbitrary size basically, to
> be negotiated between FUSE server and client.  This would require
> extensive modification of the FUSE protocol and kernel driver (and would
> ask for respective modification of libfuse, too), though.  Such a larger
> value could then capture both a submount ID and a unique identifier for
> inodes on the respective host filesystems, such as st_ino.  This would
> ensure that every virtiofsd instance would generate the same fuse_ino_t
> values for the same nodes on the same exported tree.
> 
> However, note that this doesn’t auto-populate the fuse_ino_t mappings:
> When after restarting virtiofsd the server wants to access an existing
> inode, it can’t, because there is no good way to translate even larger
> fuse_ino_t values to a file descriptor.  (We could do that if the
> fuse_ino_t value encapsulated a handle.  (As in open_by_handle_at().)
> The problem is that we can’t trust the guest to keep a handle, so we
> must ensure that the handle returned points to a file the guest is
> allowed to access.  Doing that cryptographically (e.g. with a MAC) is
> probably out of the question, because that would make fuse_ino_t really
> big.  Another idea would be to set a flag on the host FS for files that
> the guest has a handle to.  But this flag would need to be
> guest-specific...  So we’d probably again end up with a large database
> just as in (2a).  (It doesn’t need to be a flag on the FS, it could also
> be a database, I suppose.))

I'm no crypto person, but I don't know how to show that's safe.
Some inode numbers are well-known (e.g. on xfs / always seems to be 128
for me).  I'm just worrying that makes it easier for the guest to figure
out the crypto.


> We could also re-enumerate the exported tree after reopening (perhaps
> lazily for each exported filesystem) and thus recreate the mapping.  But
> this would take as much time as a “find” over the whole exported tree.
> 
> (c)
> 
> We could complement the fuse_ino_t value by a context ID, that in our
> case would probably be derived from the submount ID (e.g. the relative
> mount point).  This would only require minor modification of the FUSE
> protocol: Detecting mount points in lookups; a new command to retrieve a
> mount point’s context ID; and some way to set this context ID.
> 
> We could set the context ID either explicitly with a new command; or as
> part of every FUSE request (a field in the request header); or just as
> part of virtio-fs (be it with one virtqueue per context (which may or
> may not be feasible), or just by prefixing every FUSE request on the
> line with a context ID).

One-virtqueue per context doesn't seem feasible to me; we don't know
how many submounts there will be and there could be lots of them.

Dave

> One of the questions here is: If we just choose the context ID to be 32
> or 64 bit in size, will we ever run into the same problem of “96/128
> bits aren’t enough”?
> 
> The other problem is the same as in (2b): We cannot get an FD from a
> context ID + fuse_ino_t alone, so if virtiofsd is restarted, the guest
> cannot keep using existing inodes without reopening them.
> 
> The only way I see here to get around this problem is to re-enumerate
> the whole exported tree (or at least lazily by context ID a.k.a.
> filesystem) and thus reconstruct the mapping from ID to inode after
> resuming virtiofsd.
> 
> 
> (3)
> 
> (a)
> 
> If the fuse_ino_t keeps to be 64 bit and persistent (we don’t reuse IDs
> and we keep existing mappings around even when their refcount drops to 0
> (but why would we do that?)), we don’t have to change anything.
> 
> (b)
> 
> We probably just want new FUSE commands to query handles and open
> handles.  We could then decide whether we want them to use persistent
> IDs that we get from solving (2), or just pass through the handles from
> the host.
> 
> If we do the latter, we have the same problem I mentioned in (2b): We
> can’t trust the guest to keep the handle unmodified, and if it does
> modify it, we have to keep the guest from accessing files it must not see.
> 
> The two ways that have been proposed so far are
> 
> (I) Enrich the host’s file handle by cryptographic information to prove
> its integrity, e.g. a MAC based on a randomly generated
> virtiofsd-internal symmetric key.  The two problems I see are that the
> file handle gets rather big, and that the guest might be able to guess
> the MAC (very improbable, though, especially if we were to terminate the
> connection when the guest tries to use a file handle with an invalid MAC).
> 
> (II) Keep notes somewhere of what file handles the guest may use.  This
> could be implemented by storing a virtiofsd instance ID as metadata in
> the filesystem (attached to the file, so virtiofsd can read it when
> opening the file by its handle); or as a database for each virtiofsd
> instance (where it puts all handles handed out to a guest).
> 
> 
> 
> You can see that all of these problems are kind of intertwined, but not
> really.  Many solutions look similar, and some solutions can solve
> multiple problems; but it doesn’t mean we have to think about everything
> at once.  I think we should first think about how to handle the
> identification problem (1/2).  Maybe there isn’t much to do there anyway
> because we don’t care about it and can just use the existing fuse_ino_t
> as st_ino for the guest.
> 
> Then we can think about (3).  If we decide to add new FUSE commands for
> getting/opening file handles, then this works with pretty much every way
> we go about (1) and (2).
> 
> 
> 
> Side note:
> 
> As for migrating a virtiofsd instance: Note that everything above that
> depends on host file handles or host ino_t values will make it
> impossible to migrate to a different filesystem.  But maybe doing that
> would lead to all kind of other problems anyway.
> 
> 
> Another note:
> 
> It took rather long to write this, so I probably forgot a whole bunch of
> stuff...
> 
> Max
> 




> _______________________________________________
> Virtio-fs mailing list
> Virtio-fs@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs

--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Virtio-fs] Ways to uniquely and persistently identify nodes
  2020-01-15 10:10         ` Miklos Szeredi
@ 2020-01-15 11:51           ` Miklos Szeredi
  2020-01-15 12:51             ` Max Reitz
  0 siblings, 1 reply; 25+ messages in thread
From: Miklos Szeredi @ 2020-01-15 11:51 UTC (permalink / raw)
  To: Max Reitz; +Cc: virtio-fs-list

On Wed, Jan 15, 2020 at 11:10 AM Miklos Szeredi <mszeredi@redhat.com> wrote:

> New operations needed:
>
> LOOKUP_WITH_HANDLE: same as LOOKUP, but returns handle in addition to
> fuse_ino_t and attributes.
> LOOKUP_BY_HANDLE: same as LOOKUP, but gets a handle as input.

And that can be reduced further to a single extended LOOKUP_HANDLE,
which takes a handle as input and returns a handle in addition to the
usual stuff.  The lookup-by-handle case can be done with a special
name.  Currently "." is used for this purpose in the fuse protocol,
which is a bit confusing since it has nothing to do the "." directory
entry, but at least we don't have to introduce a new concept for this.

Thanks,
Miklos



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Virtio-fs] Ways to uniquely and persistently identify nodes
  2020-01-14 19:24     ` Miklos Szeredi
  2020-01-15  9:37       ` Max Reitz
@ 2020-01-15 12:01       ` Stefan Hajnoczi
  2020-01-15 13:00         ` Max Reitz
  1 sibling, 1 reply; 25+ messages in thread
From: Stefan Hajnoczi @ 2020-01-15 12:01 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: virtio-fs-list

[-- Attachment #1: Type: text/plain, Size: 2560 bytes --]

On Tue, Jan 14, 2020 at 08:24:51PM +0100, Miklos Szeredi wrote:
> On Tue, Jan 14, 2020 at 6:13 PM Max Reitz <mreitz@redhat.com> wrote:
> > What worries me most is how to pass that object around to all FUSE
> > functions, and that they all need a new interface.
> 
> You mean libfuse API?
> 
> > I just had a very fuzzy (and maybe stupid) idea: Maybe we could keep an
> > internal vector of currently active handles and then when variable-size
> > handles are enabled, fuse_ino_t would just act as an index into that vector?
> >
> > I suppose we could then use the full-size handles in all messages and
> > just hand out temporary indices to existing functions (just so we don’t
> > have to change their interface).  Server and client have their own
> > vectors, because when they communicate, only the full handles have meaning.
> >
> > Or we could implement the table on top of the current system by sharing
> > it between the client and server.  Whenever the server creates a
> > fuse_ino_t value, it then also creates a full-size handle, and returns
> > both the handle and its corresponding fuse_ino_t value to the server.
> > The server can use the fuse_ino_t normally most of the time, but with a
> > catch: The client would be able to invalidate it.  Then the server needs
> > to obtain a new fuse_ino_t value for the existing handle.
> > (Invalidating and reacquiring a fuse_ino_t value would be new FUSE
> > operations.)
> 
> I think you may have accidentally switched "server" and "client" in
> the above description a couple of times.
> 
> But if I'm reading this correctly, your idea is to keep the 64bit
> value on the interface (this could be just libfuse or both libfuse and
> the kernel API) and add new interfaces to establish and reestablish
> the mapping from (non-persistent) fuse_ino_t to (persistent) handle.

I'm not sure I understand Max's idea.

Miklos, I think you're saying keep fuse_ino_t semantics unchanged (not
persistent, can be reused after the last reference is dropped) and add a
separate struct export_operations-style FUSE API to map fuse_ino_t <->
struct file_handle.

We'd need to use submounts to avoid st_ino collisions in that case.

I like this approach because it requires no persistent state or extra
in-memory data structures (if struct file_handle is cryptographically
signed).

My biggest concern is that submounts may be complex to implement or
introduce problems (e.g. users see the submounts and get EBUSY when
unmounting the original directory).

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Virtio-fs] Ways to uniquely and persistently identify nodes
  2020-01-13 17:46 [Virtio-fs] Ways to uniquely and persistently identify nodes Max Reitz
  2020-01-14 16:13 ` Miklos Szeredi
  2020-01-15 10:12 ` Dr. David Alan Gilbert
@ 2020-01-15 12:50 ` Stefan Hajnoczi
  2020-01-15 14:21   ` Max Reitz
  2 siblings, 1 reply; 25+ messages in thread
From: Stefan Hajnoczi @ 2020-01-15 12:50 UTC (permalink / raw)
  To: Max Reitz; +Cc: virtio-fs

[-- Attachment #1: Type: text/plain, Size: 2105 bytes --]

On Mon, Jan 13, 2020 at 06:46:49PM +0100, Max Reitz wrote:
> (1) stat’s st_ino in the guest is not necessarily unique.  Currently, it
> just the st_ino from the host file, so if you have mounted multiple
> different filesystems in the exported directory tree, you may get
> collisions.

This must be fixed because POSIX file systems have unique and persistent
st_ino so that stat(2) can distinguish inodes from each other.

> (2) The FUSE 64-bit fuse_ino_t (which identifies an open file,
> basically) is not persistent.  It is just an index into a vector that
> contains all open inodes, and whenever virtiofsd is restarted, the
> vector is renewed.  That means that whenever this happens, all
> fuse_ino_t values the guest holds will become invalid.  (And when
> virtiofsd starts handing out new fuse_ino_t values, those will probably
> not point to the same inodes as before.)

fuse_ino_t only needs to be persistent when the FUSE_EXPORT_SUPPORT
protocol feature is enabled.  It means that nodeids persist even when
the last reference is dropped.  (If you implement an alternative
open_by_handle_at(2) approach then it would extend the semantics.)

However, it's more challenging to support power management freeze/thaw
(not a high priority since no one seems to use it) when fuse_ino_t is
not persistent.  The driver would have to do something during freeze
and/or thaw to be able to restore the FUSE session state (fuse_ino_t,
fh, etc).  If we end up not being able to implement freeze/that, I think
that would be acceptable.

> (3) name_to_handle_at()/open_by_handle_at() are implemented by FUSE just
> by handing out the fuse_ino_t value as the handle.  This is not only a
> problem as long as fuse_ino_t is not persistent (see (2)), but also in
> general, because the fuse_ino_t value is only valid (per FUSE protocol)
> as long as the inode is referenced by the guest.

open_by_handle_at() will be needed sooner or later.  For example, users
might wish to export a virtio-fs file system via an NFS server running
inside the guest.  We need this.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Virtio-fs] Ways to uniquely and persistently identify nodes
  2020-01-15 11:51           ` Miklos Szeredi
@ 2020-01-15 12:51             ` Max Reitz
  2020-01-15 14:58               ` Miklos Szeredi
  0 siblings, 1 reply; 25+ messages in thread
From: Max Reitz @ 2020-01-15 12:51 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: virtio-fs-list


[-- Attachment #1.1: Type: text/plain, Size: 925 bytes --]

On 15.01.20 12:51, Miklos Szeredi wrote:
> On Wed, Jan 15, 2020 at 11:10 AM Miklos Szeredi <mszeredi@redhat.com> wrote:
> 
>> New operations needed:
>>
>> LOOKUP_WITH_HANDLE: same as LOOKUP, but returns handle in addition to
>> fuse_ino_t and attributes.
>> LOOKUP_BY_HANDLE: same as LOOKUP, but gets a handle as input.
> 
> And that can be reduced further to a single extended LOOKUP_HANDLE,
> which takes a handle as input and returns a handle in addition to the
> usual stuff.  The lookup-by-handle case can be done with a special
> name.  Currently "." is used for this purpose in the fuse protocol,
> which is a bit confusing since it has nothing to do the "." directory
> entry, but at least we don't have to introduce a new concept for this.

I’m afraid I don’t quite understand.  What handle would LOOKUP_HANDLE
take as input when I want to open a new file (as in, LOOKUP_WITH_HANDLE)?

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Virtio-fs] Ways to uniquely and persistently identify nodes
  2020-01-15 10:12 ` Dr. David Alan Gilbert
@ 2020-01-15 12:58   ` Max Reitz
  0 siblings, 0 replies; 25+ messages in thread
From: Max Reitz @ 2020-01-15 12:58 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: virtio-fs


[-- Attachment #1.1: Type: text/plain, Size: 7645 bytes --]

On 15.01.20 11:12, Dr. David Alan Gilbert wrote:
> * Max Reitz (mreitz@redhat.com) wrote:
>> Hi,
>>
>> As discussed in today’s meeting, there is a problem with uniquely and
>> persistently identifying nodes in the guest.
>>
>> Actually, there are multiple problems:
>>
>> (1) stat’s st_ino in the guest is not necessarily unique.  Currently, it
>> just the st_ino from the host file, so if you have mounted multiple
>> different filesystems in the exported directory tree, you may get
>> collisions.
>>
>> (2) The FUSE 64-bit fuse_ino_t (which identifies an open file,
>> basically) is not persistent.  It is just an index into a vector that
>> contains all open inodes, and whenever virtiofsd is restarted, the
>> vector is renewed.  That means that whenever this happens, all
>> fuse_ino_t values the guest holds will become invalid.  (And when
>> virtiofsd starts handing out new fuse_ino_t values, those will probably
>> not point to the same inodes as before.)
>>
>> (3) name_to_handle_at()/open_by_handle_at() are implemented by FUSE just
>> by handing out the fuse_ino_t value as the handle.  This is not only a
>> problem as long as fuse_ino_t is not persistent (see (2)), but also in
>> general, because the fuse_ino_t value is only valid (per FUSE protocol)
>> as long as the inode is referenced by the guest.
>>
>>
>> The first question that I think needs to be asked is whether we care
>> about each of this points at all.
>>
>> (1) Maybe it just doesn’t matter whether the st_ino values are unique.
>>
>> (2) Maybe we don’t care about virtiofsd being restarted while the guest
>> is running or only paused.  (“Restarting” includes migration of the
>> daemon to a different host.)
> 
> I prefer to split that answer because I care about migration but not
> about restarting the daemon in place.
> 
>> (3) I suppose we do care about this.
>>
>>
>> Assuming we do care about the points, here are some ways I have
>> considered of addressing them:
>>
>> (1)
>>
>> (a)
>>
>> If we could make the 64-bit fuse_ino_t unique and persistent (see (2)),
>> we could use that for st_ino (also a 64-bit field).
>>
>> (This is the case if we keep the current schema for fuse_ino_t, be it
>> because we don’t care about (2) or because we want (2a).)
>>
>> (b)
>>
>> Otherwise, we probably want to continue passing through st_ino and then
>> ensure that stat’s st_dev is unique for each submount in the exported
>> tree.  We can achieve that by extending the FUSE protocol for virtiofsd
>> to announce submounts and then the FUSE kernel driver to automount them.
> 
> This feels nice to me, since the st_dev map should be small.
> 
>>  (This means that these submounts in the guest are then technically
>> unrelated filesystems.  It also means that the FUSE driver would need to
>> automount them with the “virtiofs” fs type, which is kind of weird, and
>> restricts this solution to virtiofs.)
> 
> Well it has to automount them witht he same type as it's mounted with;
> so it's not virtiofs specific.

Right.  But we need some way to tell the filesystem the context for the
submount.  I’m not sure whether that’s possible in an fs-agnostic way,
because the only information we can reasonably pass goes to
vfs_submount()’s @name parameter.  I think?  Or maybe we can make @data
point to a FUSE-common structure that every FUSE fs with submounts would
then need to be able to parse?  But I’m not sure how that would even
translate to userspace.

>> (2)
>>
>> (a)
>>
>> We can keep the current way if we just store the in-memory mapping while
>> virtiofsd is suspended (and migrate it it if we want to migrate the
>> virtiofsd instance).  The table may grow to be very large, though, and
>> it contains for example file descriptors that we would need to migrate,
>> too (perhaps as file handles?).
> 
> The 'when suspended' worries me - if it's important data to persist then
> we probably need to be more careful with it; i.e. keep it sync'd to
> disk.  If it's not important long term then do we need to keep it that
> long? 
> Be wary of migating a large, rapidly changing table.
> 
>> (b)
>>
>> We could extend the fuse_ino_t type to an arbitrary size basically, to
>> be negotiated between FUSE server and client.  This would require
>> extensive modification of the FUSE protocol and kernel driver (and would
>> ask for respective modification of libfuse, too), though.  Such a larger
>> value could then capture both a submount ID and a unique identifier for
>> inodes on the respective host filesystems, such as st_ino.  This would
>> ensure that every virtiofsd instance would generate the same fuse_ino_t
>> values for the same nodes on the same exported tree.
>>
>> However, note that this doesn’t auto-populate the fuse_ino_t mappings:
>> When after restarting virtiofsd the server wants to access an existing
>> inode, it can’t, because there is no good way to translate even larger
>> fuse_ino_t values to a file descriptor.  (We could do that if the
>> fuse_ino_t value encapsulated a handle.  (As in open_by_handle_at().)
>> The problem is that we can’t trust the guest to keep a handle, so we
>> must ensure that the handle returned points to a file the guest is
>> allowed to access.  Doing that cryptographically (e.g. with a MAC) is
>> probably out of the question, because that would make fuse_ino_t really
>> big.  Another idea would be to set a flag on the host FS for files that
>> the guest has a handle to.  But this flag would need to be
>> guest-specific...  So we’d probably again end up with a large database
>> just as in (2a).  (It doesn’t need to be a flag on the FS, it could also
>> be a database, I suppose.))
> 
> I'm no crypto person, but I don't know how to show that's safe.
> Some inode numbers are well-known (e.g. on xfs / always seems to be 128
> for me).  I'm just worrying that makes it easier for the guest to figure
> out the crypto.

Well, with MACs the adversary sees the data and the MAC together anyway,
so they’re designed in such a way that you can’t infer the key from that
information.

And of course they’re designed such that without a key, you can’t
generate a valid MAC for given data.

IOW, that’s the precise reason we’d use a MAC: So the guest can’t just
return a well-known handle to get access to files it shouldn’t be able
to access.

>> We could also re-enumerate the exported tree after reopening (perhaps
>> lazily for each exported filesystem) and thus recreate the mapping.  But
>> this would take as much time as a “find” over the whole exported tree.
>>
>> (c)
>>
>> We could complement the fuse_ino_t value by a context ID, that in our
>> case would probably be derived from the submount ID (e.g. the relative
>> mount point).  This would only require minor modification of the FUSE
>> protocol: Detecting mount points in lookups; a new command to retrieve a
>> mount point’s context ID; and some way to set this context ID.
>>
>> We could set the context ID either explicitly with a new command; or as
>> part of every FUSE request (a field in the request header); or just as
>> part of virtio-fs (be it with one virtqueue per context (which may or
>> may not be feasible), or just by prefixing every FUSE request on the
>> line with a context ID).
> 
> One-virtqueue per context doesn't seem feasible to me; we don't know
> how many submounts there will be and there could be lots of them.

OK, good, so I don’t have to worry about it. :-)

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Virtio-fs] Ways to uniquely and persistently identify nodes
  2020-01-15 12:01       ` Stefan Hajnoczi
@ 2020-01-15 13:00         ` Max Reitz
  2020-01-16 20:32           ` Vivek Goyal
  0 siblings, 1 reply; 25+ messages in thread
From: Max Reitz @ 2020-01-15 13:00 UTC (permalink / raw)
  To: Stefan Hajnoczi, Miklos Szeredi; +Cc: virtio-fs-list


[-- Attachment #1.1: Type: text/plain, Size: 3098 bytes --]

On 15.01.20 13:01, Stefan Hajnoczi wrote:
> On Tue, Jan 14, 2020 at 08:24:51PM +0100, Miklos Szeredi wrote:
>> On Tue, Jan 14, 2020 at 6:13 PM Max Reitz <mreitz@redhat.com> wrote:
>>> What worries me most is how to pass that object around to all FUSE
>>> functions, and that they all need a new interface.
>>
>> You mean libfuse API?
>>
>>> I just had a very fuzzy (and maybe stupid) idea: Maybe we could keep an
>>> internal vector of currently active handles and then when variable-size
>>> handles are enabled, fuse_ino_t would just act as an index into that vector?
>>>
>>> I suppose we could then use the full-size handles in all messages and
>>> just hand out temporary indices to existing functions (just so we don’t
>>> have to change their interface).  Server and client have their own
>>> vectors, because when they communicate, only the full handles have meaning.
>>>
>>> Or we could implement the table on top of the current system by sharing
>>> it between the client and server.  Whenever the server creates a
>>> fuse_ino_t value, it then also creates a full-size handle, and returns
>>> both the handle and its corresponding fuse_ino_t value to the server.
>>> The server can use the fuse_ino_t normally most of the time, but with a
>>> catch: The client would be able to invalidate it.  Then the server needs
>>> to obtain a new fuse_ino_t value for the existing handle.
>>> (Invalidating and reacquiring a fuse_ino_t value would be new FUSE
>>> operations.)
>>
>> I think you may have accidentally switched "server" and "client" in
>> the above description a couple of times.
>>
>> But if I'm reading this correctly, your idea is to keep the 64bit
>> value on the interface (this could be just libfuse or both libfuse and
>> the kernel API) and add new interfaces to establish and reestablish
>> the mapping from (non-persistent) fuse_ino_t to (persistent) handle.
> 
> I'm not sure I understand Max's idea.
> 
> Miklos, I think you're saying keep fuse_ino_t semantics unchanged (not
> persistent, can be reused after the last reference is dropped) and add a
> separate struct export_operations-style FUSE API to map fuse_ino_t <->
> struct file_handle.

Yes, that’d be the idea.

> We'd need to use submounts to avoid st_ino collisions in that case.

Yes, because that idea wouldn’t solve the st_ino problem.  (Only
persistent fuse_ino_t values would solve it.  This proposal is exactly
the opposite, we want them to be absolutely not persistent.)

So we’d still want submounts with separate st_devs and pass through
st_ino from the host.

> I like this approach because it requires no persistent state or extra
> in-memory data structures (if struct file_handle is cryptographically
> signed).
> 
> My biggest concern is that submounts may be complex to implement or
> introduce problems (e.g. users see the submounts and get EBUSY when
> unmounting the original directory).

I’m not quite sure what you mean, but as far as I saw so far you can
auto-unmount submounts when the root is unmounted.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Virtio-fs] Ways to uniquely and persistently identify nodes
  2020-01-15 12:50 ` Stefan Hajnoczi
@ 2020-01-15 14:21   ` Max Reitz
  0 siblings, 0 replies; 25+ messages in thread
From: Max Reitz @ 2020-01-15 14:21 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs


[-- Attachment #1.1: Type: text/plain, Size: 3301 bytes --]

On 15.01.20 13:50, Stefan Hajnoczi wrote:
> On Mon, Jan 13, 2020 at 06:46:49PM +0100, Max Reitz wrote:
>> (1) stat’s st_ino in the guest is not necessarily unique.  Currently, it
>> just the st_ino from the host file, so if you have mounted multiple
>> different filesystems in the exported directory tree, you may get
>> collisions.
> 
> This must be fixed because POSIX file systems have unique and persistent
> st_ino so that stat(2) can distinguish inodes from each other.

OK.

>> (2) The FUSE 64-bit fuse_ino_t (which identifies an open file,
>> basically) is not persistent.  It is just an index into a vector that
>> contains all open inodes, and whenever virtiofsd is restarted, the
>> vector is renewed.  That means that whenever this happens, all
>> fuse_ino_t values the guest holds will become invalid.  (And when
>> virtiofsd starts handing out new fuse_ino_t values, those will probably
>> not point to the same inodes as before.)
> 
> fuse_ino_t only needs to be persistent when the FUSE_EXPORT_SUPPORT
> protocol feature is enabled.  It means that nodeids persist even when
> the last reference is dropped.  (If you implement an alternative
> open_by_handle_at(2) approach then it would extend the semantics.)

It currently looks like we want them even less persistent, i.e. allow
the server to make the client refresh fuse_ino_t values, right?

Also, when I say “persistence“, I mostly mean “persistence through
multiple virtiofsd sessions”.  The current state is that a new session
doesn’t know what to do with the old values at all, so it’s all just broken.

(It wouldn’t be broken if we could tell the client that this is so and
make the client refresh all its fuse_ino_t values.)

> However, it's more challenging to support power management freeze/thaw
> (not a high priority since no one seems to use it) when fuse_ino_t is
> not persistent.  The driver would have to do something during freeze
> and/or thaw to be able to restore the FUSE session state (fuse_ino_t,
> fh, etc).  If we end up not being able to implement freeze/that, I think
> that would be acceptable.

I suppose we don’t need fuse_ino_t specifically to be persistent; it
would suffice if the client could get a persistent value (the handle)
from which to get new fuse_ino_t values when they are invalidated.

>> (3) name_to_handle_at()/open_by_handle_at() are implemented by FUSE just
>> by handing out the fuse_ino_t value as the handle.  This is not only a
>> problem as long as fuse_ino_t is not persistent (see (2)), but also in
>> general, because the fuse_ino_t value is only valid (per FUSE protocol)
>> as long as the inode is referenced by the guest.
> 
> open_by_handle_at() will be needed sooner or later.  For example, users
> might wish to export a virtio-fs file system via an NFS server running
> inside the guest.  We need this.

Sure.  But I think we should solve the first two points first because
however we do it will probably affect how we want to address this problem.

(For example, if we do decide to have handles always shared between
server and client and make fuse_ino_t a transient value that the server
can force-invalidate, then the *_handle_at() functions will be trivial
to implement.)

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Virtio-fs] Ways to uniquely and persistently identify nodes
  2020-01-15 12:51             ` Max Reitz
@ 2020-01-15 14:58               ` Miklos Szeredi
  2020-01-15 15:05                 ` Miklos Szeredi
  2020-01-15 15:12                 ` Max Reitz
  0 siblings, 2 replies; 25+ messages in thread
From: Miklos Szeredi @ 2020-01-15 14:58 UTC (permalink / raw)
  To: Max Reitz; +Cc: virtio-fs-list

On Wed, Jan 15, 2020 at 1:51 PM Max Reitz <mreitz@redhat.com> wrote:
>
> On 15.01.20 12:51, Miklos Szeredi wrote:
> > On Wed, Jan 15, 2020 at 11:10 AM Miklos Szeredi <mszeredi@redhat.com> wrote:
> >
> >> New operations needed:
> >>
> >> LOOKUP_WITH_HANDLE: same as LOOKUP, but returns handle in addition to
> >> fuse_ino_t and attributes.
> >> LOOKUP_BY_HANDLE: same as LOOKUP, but gets a handle as input.
> >
> > And that can be reduced further to a single extended LOOKUP_HANDLE,
> > which takes a handle as input and returns a handle in addition to the
> > usual stuff.  The lookup-by-handle case can be done with a special
> > name.  Currently "." is used for this purpose in the fuse protocol,
> > which is a bit confusing since it has nothing to do the "." directory
> > entry, but at least we don't have to introduce a new concept for this.
>
> I’m afraid I don’t quite understand.  What handle would LOOKUP_HANDLE
> take as input when I want to open a new file (as in, LOOKUP_WITH_HANDLE)?

For example open("/foo/bar", ...) would do:

lookup_handle($ROOT_HANDLE, "foo") -> { FOO_HANDLE, FOO_NODEID }
lookup_handle($FOO_HANDLE, "bar") -> { BAR_HANDLE, BAR_NODEID }
open($BAR_NODEID, ...)

Thanks,
Miklos



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Virtio-fs] Ways to uniquely and persistently identify nodes
  2020-01-15 14:58               ` Miklos Szeredi
@ 2020-01-15 15:05                 ` Miklos Szeredi
  2020-01-15 15:19                   ` Max Reitz
  2020-01-15 15:12                 ` Max Reitz
  1 sibling, 1 reply; 25+ messages in thread
From: Miklos Szeredi @ 2020-01-15 15:05 UTC (permalink / raw)
  To: Max Reitz; +Cc: virtio-fs-list

On Wed, Jan 15, 2020 at 3:58 PM Miklos Szeredi <mszeredi@redhat.com> wrote:
>
> On Wed, Jan 15, 2020 at 1:51 PM Max Reitz <mreitz@redhat.com> wrote:
> >
> > On 15.01.20 12:51, Miklos Szeredi wrote:
> > > On Wed, Jan 15, 2020 at 11:10 AM Miklos Szeredi <mszeredi@redhat.com> wrote:
> > >
> > >> New operations needed:
> > >>
> > >> LOOKUP_WITH_HANDLE: same as LOOKUP, but returns handle in addition to
> > >> fuse_ino_t and attributes.
> > >> LOOKUP_BY_HANDLE: same as LOOKUP, but gets a handle as input.
> > >
> > > And that can be reduced further to a single extended LOOKUP_HANDLE,
> > > which takes a handle as input and returns a handle in addition to the
> > > usual stuff.  The lookup-by-handle case can be done with a special
> > > name.  Currently "." is used for this purpose in the fuse protocol,
> > > which is a bit confusing since it has nothing to do the "." directory
> > > entry, but at least we don't have to introduce a new concept for this.
> >
> > I’m afraid I don’t quite understand.  What handle would LOOKUP_HANDLE
> > take as input when I want to open a new file (as in, LOOKUP_WITH_HANDLE)?
>
> For example open("/foo/bar", ...) would do:
>
> lookup_handle($ROOT_HANDLE, "foo") -> { FOO_HANDLE, FOO_NODEID }
> lookup_handle($FOO_HANDLE, "bar") -> { BAR_HANDLE, BAR_NODEID }
> open($BAR_NODEID, ...)

Note also that this scheme makes fuse_ino_t values non-reusable,
otherwise there could be messy conflicts with client using an old
value that the server interprets as a new value.

But with the 64bit fuse_ino_t space wraparound shouldn't be an issue.

Thanks,
Miklos



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Virtio-fs] Ways to uniquely and persistently identify nodes
  2020-01-15 14:58               ` Miklos Szeredi
  2020-01-15 15:05                 ` Miklos Szeredi
@ 2020-01-15 15:12                 ` Max Reitz
  2020-01-15 15:46                   ` Miklos Szeredi
  1 sibling, 1 reply; 25+ messages in thread
From: Max Reitz @ 2020-01-15 15:12 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: virtio-fs-list


[-- Attachment #1.1: Type: text/plain, Size: 2370 bytes --]

On 15.01.20 15:58, Miklos Szeredi wrote:
> On Wed, Jan 15, 2020 at 1:51 PM Max Reitz <mreitz@redhat.com> wrote:
>>
>> On 15.01.20 12:51, Miklos Szeredi wrote:
>>> On Wed, Jan 15, 2020 at 11:10 AM Miklos Szeredi <mszeredi@redhat.com> wrote:
>>>
>>>> New operations needed:
>>>>
>>>> LOOKUP_WITH_HANDLE: same as LOOKUP, but returns handle in addition to
>>>> fuse_ino_t and attributes.
>>>> LOOKUP_BY_HANDLE: same as LOOKUP, but gets a handle as input.
>>>
>>> And that can be reduced further to a single extended LOOKUP_HANDLE,
>>> which takes a handle as input and returns a handle in addition to the
>>> usual stuff.  The lookup-by-handle case can be done with a special
>>> name.  Currently "." is used for this purpose in the fuse protocol,
>>> which is a bit confusing since it has nothing to do the "." directory
>>> entry, but at least we don't have to introduce a new concept for this.
>>
>> I’m afraid I don’t quite understand.  What handle would LOOKUP_HANDLE
>> take as input when I want to open a new file (as in, LOOKUP_WITH_HANDLE)?
> 
> For example open("/foo/bar", ...) would do:
> 
> lookup_handle($ROOT_HANDLE, "foo") -> { FOO_HANDLE, FOO_NODEID }
> lookup_handle($FOO_HANDLE, "bar") -> { BAR_HANDLE, BAR_NODEID }
> open($BAR_NODEID, ...)

OK.  It seems a bit weird to me because if all other commands used a
fuse_ino_t to identify nodes, I think lookup should be no different.

Sure, we’d only need one new command instead of two, but naively just
adding two commands doesn’t seem fundamentally more difficult to me.

Specifically, what seems a bit problematic to me is that for virtiofsd
we’d probably want fuse_ino_t to stay as it is: An index into a vector.
 We probably don’t want to store the handle in virtiofsd at all.

So whenever a handle comes in, we’d first need to open it with
open_by_handle_at().  That’s OK if handles only come in through
lookup_by_handle(), i.e. when there is no valid fuse_ino_t value anyway.
 But it seems a bit cumbersome if all lookups are based on handles.

OTOH, we could keep the handles around, but that would probably require
a hash map with the handle as a key (so we can let lookup() find the
existing fd without having to open a new one).  Doable, sure, but I
wonder whether it wouldn’t be just easier to have two separate commands.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Virtio-fs] Ways to uniquely and persistently identify nodes
  2020-01-15 15:05                 ` Miklos Szeredi
@ 2020-01-15 15:19                   ` Max Reitz
  2020-01-15 16:01                     ` Miklos Szeredi
  0 siblings, 1 reply; 25+ messages in thread
From: Max Reitz @ 2020-01-15 15:19 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: virtio-fs-list


[-- Attachment #1.1: Type: text/plain, Size: 2414 bytes --]

On 15.01.20 16:05, Miklos Szeredi wrote:
> On Wed, Jan 15, 2020 at 3:58 PM Miklos Szeredi <mszeredi@redhat.com> wrote:
>>
>> On Wed, Jan 15, 2020 at 1:51 PM Max Reitz <mreitz@redhat.com> wrote:
>>>
>>> On 15.01.20 12:51, Miklos Szeredi wrote:
>>>> On Wed, Jan 15, 2020 at 11:10 AM Miklos Szeredi <mszeredi@redhat.com> wrote:
>>>>
>>>>> New operations needed:
>>>>>
>>>>> LOOKUP_WITH_HANDLE: same as LOOKUP, but returns handle in addition to
>>>>> fuse_ino_t and attributes.
>>>>> LOOKUP_BY_HANDLE: same as LOOKUP, but gets a handle as input.
>>>>
>>>> And that can be reduced further to a single extended LOOKUP_HANDLE,
>>>> which takes a handle as input and returns a handle in addition to the
>>>> usual stuff.  The lookup-by-handle case can be done with a special
>>>> name.  Currently "." is used for this purpose in the fuse protocol,
>>>> which is a bit confusing since it has nothing to do the "." directory
>>>> entry, but at least we don't have to introduce a new concept for this.
>>>
>>> I’m afraid I don’t quite understand.  What handle would LOOKUP_HANDLE
>>> take as input when I want to open a new file (as in, LOOKUP_WITH_HANDLE)?
>>
>> For example open("/foo/bar", ...) would do:
>>
>> lookup_handle($ROOT_HANDLE, "foo") -> { FOO_HANDLE, FOO_NODEID }
>> lookup_handle($FOO_HANDLE, "bar") -> { BAR_HANDLE, BAR_NODEID }
>> open($BAR_NODEID, ...)
> 
> Note also that this scheme makes fuse_ino_t values non-reusable,
> otherwise there could be messy conflicts with client using an old
> value that the server interprets as a new value.

But wouldn’t it just be against the protocol to use a dropped or
invalidated fuse_ino_t value?

> But with the 64bit fuse_ino_t space wraparound shouldn't be an issue.

I don’t really see the problem with the current solution of having
fuse_ino_t be an index into a vector, and old values being reused.  As
far as I understand, there is a protocol for refcount fuse_ino_t values
and the clients should not reuse values whose refcount has gone to zero.

(And we’d add to that protocol by adding a way for the server to make
the client invalidate an existing fuse_ino_t value and request a new
one.  I suppose we’d do that just by responding e.g. EINVAL if a
fuse_ino_t value is used that the server doesn’t recognize, and then the
client has to invoke lookup_(by_)handle to get a new value.)

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Virtio-fs] Ways to uniquely and persistently identify nodes
  2020-01-15 15:12                 ` Max Reitz
@ 2020-01-15 15:46                   ` Miklos Szeredi
  2020-01-15 15:52                     ` Max Reitz
  0 siblings, 1 reply; 25+ messages in thread
From: Miklos Szeredi @ 2020-01-15 15:46 UTC (permalink / raw)
  To: Max Reitz; +Cc: virtio-fs-list

On Wed, Jan 15, 2020 at 4:23 PM Max Reitz <mreitz@redhat.com> wrote:
>
> On 15.01.20 15:58, Miklos Szeredi wrote:
> > On Wed, Jan 15, 2020 at 1:51 PM Max Reitz <mreitz@redhat.com> wrote:
> >>
> >> On 15.01.20 12:51, Miklos Szeredi wrote:
> >>> On Wed, Jan 15, 2020 at 11:10 AM Miklos Szeredi <mszeredi@redhat.com> wrote:
> >>>
> >>>> New operations needed:
> >>>>
> >>>> LOOKUP_WITH_HANDLE: same as LOOKUP, but returns handle in addition to
> >>>> fuse_ino_t and attributes.
> >>>> LOOKUP_BY_HANDLE: same as LOOKUP, but gets a handle as input.
> >>>
> >>> And that can be reduced further to a single extended LOOKUP_HANDLE,
> >>> which takes a handle as input and returns a handle in addition to the
> >>> usual stuff.  The lookup-by-handle case can be done with a special
> >>> name.  Currently "." is used for this purpose in the fuse protocol,
> >>> which is a bit confusing since it has nothing to do the "." directory
> >>> entry, but at least we don't have to introduce a new concept for this.
> >>
> >> I’m afraid I don’t quite understand.  What handle would LOOKUP_HANDLE
> >> take as input when I want to open a new file (as in, LOOKUP_WITH_HANDLE)?
> >
> > For example open("/foo/bar", ...) would do:
> >
> > lookup_handle($ROOT_HANDLE, "foo") -> { FOO_HANDLE, FOO_NODEID }
> > lookup_handle($FOO_HANDLE, "bar") -> { BAR_HANDLE, BAR_NODEID }
> > open($BAR_NODEID, ...)
>
> OK.  It seems a bit weird to me because if all other commands used a
> fuse_ino_t to identify nodes, I think lookup should be no different.
>
> Sure, we’d only need one new command instead of two, but naively just
> adding two commands doesn’t seem fundamentally more difficult to me.
>
> Specifically, what seems a bit problematic to me is that for virtiofsd
> we’d probably want fuse_ino_t to stay as it is: An index into a vector.
>  We probably don’t want to store the handle in virtiofsd at all.
>
> So whenever a handle comes in, we’d first need to open it with
> open_by_handle_at().  That’s OK if handles only come in through
> lookup_by_handle(), i.e. when there is no valid fuse_ino_t value anyway.
>  But it seems a bit cumbersome if all lookups are based on handles.
>
> OTOH, we could keep the handles around, but that would probably require
> a hash map with the handle as a key (so we can let lookup() find the
> existing fd without having to open a new one).  Doable, sure, but I
> wonder whether it wouldn’t be just easier to have two separate commands.

So, let's modify that to

> > lookup_handle($ROOT_HANDLE, $ROOT_NODEID, "foo") -> { FOO_HANDLE, FOO_NODEID }
> > lookup_handle($FOO_HANDLE, $FOO_NODEID, "bar") -> { BAR_HANDLE, BAR_NODEID }
> > open($BAR_NODEID, ...)

I think that fixes your concern.  Implementation is free to use either
the nodeid or the handle.

Thanks,
Miklos



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Virtio-fs] Ways to uniquely and persistently identify nodes
  2020-01-15 15:46                   ` Miklos Szeredi
@ 2020-01-15 15:52                     ` Max Reitz
  0 siblings, 0 replies; 25+ messages in thread
From: Max Reitz @ 2020-01-15 15:52 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: virtio-fs-list


[-- Attachment #1.1: Type: text/plain, Size: 2954 bytes --]

On 15.01.20 16:46, Miklos Szeredi wrote:
> On Wed, Jan 15, 2020 at 4:23 PM Max Reitz <mreitz@redhat.com> wrote:
>>
>> On 15.01.20 15:58, Miklos Szeredi wrote:
>>> On Wed, Jan 15, 2020 at 1:51 PM Max Reitz <mreitz@redhat.com> wrote:
>>>>
>>>> On 15.01.20 12:51, Miklos Szeredi wrote:
>>>>> On Wed, Jan 15, 2020 at 11:10 AM Miklos Szeredi <mszeredi@redhat.com> wrote:
>>>>>
>>>>>> New operations needed:
>>>>>>
>>>>>> LOOKUP_WITH_HANDLE: same as LOOKUP, but returns handle in addition to
>>>>>> fuse_ino_t and attributes.
>>>>>> LOOKUP_BY_HANDLE: same as LOOKUP, but gets a handle as input.
>>>>>
>>>>> And that can be reduced further to a single extended LOOKUP_HANDLE,
>>>>> which takes a handle as input and returns a handle in addition to the
>>>>> usual stuff.  The lookup-by-handle case can be done with a special
>>>>> name.  Currently "." is used for this purpose in the fuse protocol,
>>>>> which is a bit confusing since it has nothing to do the "." directory
>>>>> entry, but at least we don't have to introduce a new concept for this.
>>>>
>>>> I’m afraid I don’t quite understand.  What handle would LOOKUP_HANDLE
>>>> take as input when I want to open a new file (as in, LOOKUP_WITH_HANDLE)?
>>>
>>> For example open("/foo/bar", ...) would do:
>>>
>>> lookup_handle($ROOT_HANDLE, "foo") -> { FOO_HANDLE, FOO_NODEID }
>>> lookup_handle($FOO_HANDLE, "bar") -> { BAR_HANDLE, BAR_NODEID }
>>> open($BAR_NODEID, ...)
>>
>> OK.  It seems a bit weird to me because if all other commands used a
>> fuse_ino_t to identify nodes, I think lookup should be no different.
>>
>> Sure, we’d only need one new command instead of two, but naively just
>> adding two commands doesn’t seem fundamentally more difficult to me.
>>
>> Specifically, what seems a bit problematic to me is that for virtiofsd
>> we’d probably want fuse_ino_t to stay as it is: An index into a vector.
>>  We probably don’t want to store the handle in virtiofsd at all.
>>
>> So whenever a handle comes in, we’d first need to open it with
>> open_by_handle_at().  That’s OK if handles only come in through
>> lookup_by_handle(), i.e. when there is no valid fuse_ino_t value anyway.
>>  But it seems a bit cumbersome if all lookups are based on handles.
>>
>> OTOH, we could keep the handles around, but that would probably require
>> a hash map with the handle as a key (so we can let lookup() find the
>> existing fd without having to open a new one).  Doable, sure, but I
>> wonder whether it wouldn’t be just easier to have two separate commands.
> 
> So, let's modify that to
> 
>>> lookup_handle($ROOT_HANDLE, $ROOT_NODEID, "foo") -> { FOO_HANDLE, FOO_NODEID }
>>> lookup_handle($FOO_HANDLE, $FOO_NODEID, "bar") -> { BAR_HANDLE, BAR_NODEID }
>>> open($BAR_NODEID, ...)
> 
> I think that fixes your concern.  Implementation is free to use either
> the nodeid or the handle.

Works for me. :-)

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Virtio-fs] Ways to uniquely and persistently identify nodes
  2020-01-15 15:19                   ` Max Reitz
@ 2020-01-15 16:01                     ` Miklos Szeredi
  2020-01-15 16:12                       ` Max Reitz
  0 siblings, 1 reply; 25+ messages in thread
From: Miklos Szeredi @ 2020-01-15 16:01 UTC (permalink / raw)
  To: Max Reitz; +Cc: virtio-fs-list

On Wed, Jan 15, 2020 at 4:30 PM Max Reitz <mreitz@redhat.com> wrote:
>
> On 15.01.20 16:05, Miklos Szeredi wrote:
> > On Wed, Jan 15, 2020 at 3:58 PM Miklos Szeredi <mszeredi@redhat.com> wrote:
> >>
> >> On Wed, Jan 15, 2020 at 1:51 PM Max Reitz <mreitz@redhat.com> wrote:
> >>>
> >>> On 15.01.20 12:51, Miklos Szeredi wrote:
> >>>> On Wed, Jan 15, 2020 at 11:10 AM Miklos Szeredi <mszeredi@redhat.com> wrote:
> >>>>
> >>>>> New operations needed:
> >>>>>
> >>>>> LOOKUP_WITH_HANDLE: same as LOOKUP, but returns handle in addition to
> >>>>> fuse_ino_t and attributes.
> >>>>> LOOKUP_BY_HANDLE: same as LOOKUP, but gets a handle as input.
> >>>>
> >>>> And that can be reduced further to a single extended LOOKUP_HANDLE,
> >>>> which takes a handle as input and returns a handle in addition to the
> >>>> usual stuff.  The lookup-by-handle case can be done with a special
> >>>> name.  Currently "." is used for this purpose in the fuse protocol,
> >>>> which is a bit confusing since it has nothing to do the "." directory
> >>>> entry, but at least we don't have to introduce a new concept for this.
> >>>
> >>> I’m afraid I don’t quite understand.  What handle would LOOKUP_HANDLE
> >>> take as input when I want to open a new file (as in, LOOKUP_WITH_HANDLE)?
> >>
> >> For example open("/foo/bar", ...) would do:
> >>
> >> lookup_handle($ROOT_HANDLE, "foo") -> { FOO_HANDLE, FOO_NODEID }
> >> lookup_handle($FOO_HANDLE, "bar") -> { BAR_HANDLE, BAR_NODEID }
> >> open($BAR_NODEID, ...)
> >
> > Note also that this scheme makes fuse_ino_t values non-reusable,
> > otherwise there could be messy conflicts with client using an old
> > value that the server interprets as a new value.
>
> But wouldn’t it just be against the protocol to use a dropped or
> invalidated fuse_ino_t value?
>
> > But with the 64bit fuse_ino_t space wraparound shouldn't be an issue.
>
> I don’t really see the problem with the current solution of having
> fuse_ino_t be an index into a vector, and old values being reused.  As
> far as I understand, there is a protocol for refcount fuse_ino_t values
> and the clients should not reuse values whose refcount has gone to zero.

I'm saying that one big advantage of having the client keep hold of
the persistent handle is that the server is free to drop its cached
objects even before the refcount drops to zero.  At that point the
client is holding onto an invalid fuse_ino_t value and if that value
is reused by the server, that will result in a mess.

>
> (And we’d add to that protocol by adding a way for the server to make
> the client invalidate an existing fuse_ino_t value and request a new
> one.  I suppose we’d do that just by responding e.g. EINVAL if a
> fuse_ino_t value is used that the server doesn’t recognize, and then the
> client has to invoke lookup_(by_)handle to get a new value.)

Yes, except we'd better use something more specific (EFUSEINOINVALID)
instead of EINVAL to make sure it doesn't conflict with real errors
returned from the operation.

Thanks,
Miklos



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Virtio-fs] Ways to uniquely and persistently identify nodes
  2020-01-15 16:01                     ` Miklos Szeredi
@ 2020-01-15 16:12                       ` Max Reitz
  0 siblings, 0 replies; 25+ messages in thread
From: Max Reitz @ 2020-01-15 16:12 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: virtio-fs-list


[-- Attachment #1.1: Type: text/plain, Size: 3226 bytes --]

On 15.01.20 17:01, Miklos Szeredi wrote:
> On Wed, Jan 15, 2020 at 4:30 PM Max Reitz <mreitz@redhat.com> wrote:
>>
>> On 15.01.20 16:05, Miklos Szeredi wrote:
>>> On Wed, Jan 15, 2020 at 3:58 PM Miklos Szeredi <mszeredi@redhat.com> wrote:
>>>>
>>>> On Wed, Jan 15, 2020 at 1:51 PM Max Reitz <mreitz@redhat.com> wrote:
>>>>>
>>>>> On 15.01.20 12:51, Miklos Szeredi wrote:
>>>>>> On Wed, Jan 15, 2020 at 11:10 AM Miklos Szeredi <mszeredi@redhat.com> wrote:
>>>>>>
>>>>>>> New operations needed:
>>>>>>>
>>>>>>> LOOKUP_WITH_HANDLE: same as LOOKUP, but returns handle in addition to
>>>>>>> fuse_ino_t and attributes.
>>>>>>> LOOKUP_BY_HANDLE: same as LOOKUP, but gets a handle as input.
>>>>>>
>>>>>> And that can be reduced further to a single extended LOOKUP_HANDLE,
>>>>>> which takes a handle as input and returns a handle in addition to the
>>>>>> usual stuff.  The lookup-by-handle case can be done with a special
>>>>>> name.  Currently "." is used for this purpose in the fuse protocol,
>>>>>> which is a bit confusing since it has nothing to do the "." directory
>>>>>> entry, but at least we don't have to introduce a new concept for this.
>>>>>
>>>>> I’m afraid I don’t quite understand.  What handle would LOOKUP_HANDLE
>>>>> take as input when I want to open a new file (as in, LOOKUP_WITH_HANDLE)?
>>>>
>>>> For example open("/foo/bar", ...) would do:
>>>>
>>>> lookup_handle($ROOT_HANDLE, "foo") -> { FOO_HANDLE, FOO_NODEID }
>>>> lookup_handle($FOO_HANDLE, "bar") -> { BAR_HANDLE, BAR_NODEID }
>>>> open($BAR_NODEID, ...)
>>>
>>> Note also that this scheme makes fuse_ino_t values non-reusable,
>>> otherwise there could be messy conflicts with client using an old
>>> value that the server interprets as a new value.
>>
>> But wouldn’t it just be against the protocol to use a dropped or
>> invalidated fuse_ino_t value?
>>
>>> But with the 64bit fuse_ino_t space wraparound shouldn't be an issue.
>>
>> I don’t really see the problem with the current solution of having
>> fuse_ino_t be an index into a vector, and old values being reused.  As
>> far as I understand, there is a protocol for refcount fuse_ino_t values
>> and the clients should not reuse values whose refcount has gone to zero.
> 
> I'm saying that one big advantage of having the client keep hold of
> the persistent handle is that the server is free to drop its cached
> objects even before the refcount drops to zero.  At that point the
> client is holding onto an invalid fuse_ino_t value and if that value
> is reused by the server, that will result in a mess.

Oh, yes, sure.

>> (And we’d add to that protocol by adding a way for the server to make
>> the client invalidate an existing fuse_ino_t value and request a new
>> one.  I suppose we’d do that just by responding e.g. EINVAL if a
>> fuse_ino_t value is used that the server doesn’t recognize, and then the
>> client has to invoke lookup_(by_)handle to get a new value.)
> 
> Yes, except we'd better use something more specific (EFUSEINOINVALID)
> instead of EINVAL to make sure it doesn't conflict with real errors
> returned from the operation.

If we can do that, sure, that’s better.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Virtio-fs] Ways to uniquely and persistently identify nodes
  2020-01-15 13:00         ` Max Reitz
@ 2020-01-16 20:32           ` Vivek Goyal
  2020-01-17 18:13             ` Max Reitz
  0 siblings, 1 reply; 25+ messages in thread
From: Vivek Goyal @ 2020-01-16 20:32 UTC (permalink / raw)
  To: Max Reitz; +Cc: virtio-fs-list

On Wed, Jan 15, 2020 at 02:00:27PM +0100, Max Reitz wrote:
> On 15.01.20 13:01, Stefan Hajnoczi wrote:
> > On Tue, Jan 14, 2020 at 08:24:51PM +0100, Miklos Szeredi wrote:
> >> On Tue, Jan 14, 2020 at 6:13 PM Max Reitz <mreitz@redhat.com> wrote:
> >>> What worries me most is how to pass that object around to all FUSE
> >>> functions, and that they all need a new interface.
> >>
> >> You mean libfuse API?
> >>
> >>> I just had a very fuzzy (and maybe stupid) idea: Maybe we could keep an
> >>> internal vector of currently active handles and then when variable-size
> >>> handles are enabled, fuse_ino_t would just act as an index into that vector?
> >>>
> >>> I suppose we could then use the full-size handles in all messages and
> >>> just hand out temporary indices to existing functions (just so we don’t
> >>> have to change their interface).  Server and client have their own
> >>> vectors, because when they communicate, only the full handles have meaning.
> >>>
> >>> Or we could implement the table on top of the current system by sharing
> >>> it between the client and server.  Whenever the server creates a
> >>> fuse_ino_t value, it then also creates a full-size handle, and returns
> >>> both the handle and its corresponding fuse_ino_t value to the server.
> >>> The server can use the fuse_ino_t normally most of the time, but with a
> >>> catch: The client would be able to invalidate it.  Then the server needs
> >>> to obtain a new fuse_ino_t value for the existing handle.
> >>> (Invalidating and reacquiring a fuse_ino_t value would be new FUSE
> >>> operations.)
> >>
> >> I think you may have accidentally switched "server" and "client" in
> >> the above description a couple of times.
> >>
> >> But if I'm reading this correctly, your idea is to keep the 64bit
> >> value on the interface (this could be just libfuse or both libfuse and
> >> the kernel API) and add new interfaces to establish and reestablish
> >> the mapping from (non-persistent) fuse_ino_t to (persistent) handle.
> > 
> > I'm not sure I understand Max's idea.
> > 
> > Miklos, I think you're saying keep fuse_ino_t semantics unchanged (not
> > persistent, can be reused after the last reference is dropped) and add a
> > separate struct export_operations-style FUSE API to map fuse_ino_t <->
> > struct file_handle.
> 
> Yes, that’d be the idea.
> 
> > We'd need to use submounts to avoid st_ino collisions in that case.
> 
> Yes, because that idea wouldn’t solve the st_ino problem.  (Only
> persistent fuse_ino_t values would solve it.  This proposal is exactly
> the opposite, we want them to be absolutely not persistent.)
> 
> So we’d still want submounts with separate st_devs and pass through
> st_ino from the host.

I am not sure I understand whole of the discussion. Here is my
understanding. Please correct me if I got it wrong.

So we seem to have two main problems.

- st_inode number collision as seen by user space in case of multiple
  submounts at host in directory being exported. And solution for this
  seems to be supporting submounts where st_dev is reported from
  submount in guest and st_ino is passthrough from host.

- Being able have a persistent handle for inode on host. This will allow
  us to support export API on virtiofs and also allow server to trim
  cache despite the fact that client still has reference to that inode.

IIUC, first one is more pressing for us and second one can wait. Have
I got it right?

Having said that, I don't know how complex it is to support submounts
and what are other implications.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Virtio-fs] Ways to uniquely and persistently identify nodes
  2020-01-16 20:32           ` Vivek Goyal
@ 2020-01-17 18:13             ` Max Reitz
  0 siblings, 0 replies; 25+ messages in thread
From: Max Reitz @ 2020-01-17 18:13 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs-list


[-- Attachment #1.1: Type: text/plain, Size: 4479 bytes --]

On 16.01.20 21:32, Vivek Goyal wrote:
> On Wed, Jan 15, 2020 at 02:00:27PM +0100, Max Reitz wrote:
>> On 15.01.20 13:01, Stefan Hajnoczi wrote:
>>> On Tue, Jan 14, 2020 at 08:24:51PM +0100, Miklos Szeredi wrote:
>>>> On Tue, Jan 14, 2020 at 6:13 PM Max Reitz <mreitz@redhat.com> wrote:
>>>>> What worries me most is how to pass that object around to all FUSE
>>>>> functions, and that they all need a new interface.
>>>>
>>>> You mean libfuse API?
>>>>
>>>>> I just had a very fuzzy (and maybe stupid) idea: Maybe we could keep an
>>>>> internal vector of currently active handles and then when variable-size
>>>>> handles are enabled, fuse_ino_t would just act as an index into that vector?
>>>>>
>>>>> I suppose we could then use the full-size handles in all messages and
>>>>> just hand out temporary indices to existing functions (just so we don’t
>>>>> have to change their interface).  Server and client have their own
>>>>> vectors, because when they communicate, only the full handles have meaning.
>>>>>
>>>>> Or we could implement the table on top of the current system by sharing
>>>>> it between the client and server.  Whenever the server creates a
>>>>> fuse_ino_t value, it then also creates a full-size handle, and returns
>>>>> both the handle and its corresponding fuse_ino_t value to the server.
>>>>> The server can use the fuse_ino_t normally most of the time, but with a
>>>>> catch: The client would be able to invalidate it.  Then the server needs
>>>>> to obtain a new fuse_ino_t value for the existing handle.
>>>>> (Invalidating and reacquiring a fuse_ino_t value would be new FUSE
>>>>> operations.)
>>>>
>>>> I think you may have accidentally switched "server" and "client" in
>>>> the above description a couple of times.
>>>>
>>>> But if I'm reading this correctly, your idea is to keep the 64bit
>>>> value on the interface (this could be just libfuse or both libfuse and
>>>> the kernel API) and add new interfaces to establish and reestablish
>>>> the mapping from (non-persistent) fuse_ino_t to (persistent) handle.
>>>
>>> I'm not sure I understand Max's idea.
>>>
>>> Miklos, I think you're saying keep fuse_ino_t semantics unchanged (not
>>> persistent, can be reused after the last reference is dropped) and add a
>>> separate struct export_operations-style FUSE API to map fuse_ino_t <->
>>> struct file_handle.
>>
>> Yes, that’d be the idea.
>>
>>> We'd need to use submounts to avoid st_ino collisions in that case.
>>
>> Yes, because that idea wouldn’t solve the st_ino problem.  (Only
>> persistent fuse_ino_t values would solve it.  This proposal is exactly
>> the opposite, we want them to be absolutely not persistent.)
>>
>> So we’d still want submounts with separate st_devs and pass through
>> st_ino from the host.
> 
> I am not sure I understand whole of the discussion. Here is my
> understanding. Please correct me if I got it wrong.
> 
> So we seem to have two main problems.
> 
> - st_inode number collision as seen by user space in case of multiple
>   submounts at host in directory being exported. And solution for this
>   seems to be supporting submounts where st_dev is reported from
>   submount in guest and st_ino is passthrough from host.
> 
> - Being able have a persistent handle for inode on host. This will allow
>   us to support export API on virtiofs and also allow server to trim
>   cache despite the fact that client still has reference to that inode.

It would also allow us to migrate virtiofsd, as long as the destination
can make sense of the existing handles.

Trimming the cache wasn’t really a problem AFAIU, but with the right
solution to the problem we can address that, too, yes.

As I wrote, I see three problems: (1) Inode collision, (2)
migrateability, (3) passing persistent file handles to the client.

> IIUC, first one is more pressing for us and second one can wait. Have
> I got it right?

It was my impression that file handles were more pressing in practice,
but I don’t know.  I’m currently trying to solve them all in one package.

> Having said that, I don't know how complex it is to support submounts
> and what are other implications.

As far as I’ve seen so far, it’s conceptually rather simple, the devil’s
just in the details.  (It’s easy to do a submount, the question is how
we can make that submount then use a relative base directory as its root.)

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2020-01-17 18:13 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-13 17:46 [Virtio-fs] Ways to uniquely and persistently identify nodes Max Reitz
2020-01-14 16:13 ` Miklos Szeredi
2020-01-14 16:49   ` Max Reitz
2020-01-14 19:24     ` Miklos Szeredi
2020-01-15  9:37       ` Max Reitz
2020-01-15  9:39         ` Max Reitz
2020-01-15 10:10         ` Miklos Szeredi
2020-01-15 11:51           ` Miklos Szeredi
2020-01-15 12:51             ` Max Reitz
2020-01-15 14:58               ` Miklos Szeredi
2020-01-15 15:05                 ` Miklos Szeredi
2020-01-15 15:19                   ` Max Reitz
2020-01-15 16:01                     ` Miklos Szeredi
2020-01-15 16:12                       ` Max Reitz
2020-01-15 15:12                 ` Max Reitz
2020-01-15 15:46                   ` Miklos Szeredi
2020-01-15 15:52                     ` Max Reitz
2020-01-15 12:01       ` Stefan Hajnoczi
2020-01-15 13:00         ` Max Reitz
2020-01-16 20:32           ` Vivek Goyal
2020-01-17 18:13             ` Max Reitz
2020-01-15 10:12 ` Dr. David Alan Gilbert
2020-01-15 12:58   ` Max Reitz
2020-01-15 12:50 ` Stefan Hajnoczi
2020-01-15 14:21   ` Max Reitz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.