ceph-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* fscrypt and file truncation on cephfs
@ 2021-03-11 16:14 Jeff Layton
  2021-03-12  4:17 ` Patrick Donnelly
  0 siblings, 1 reply; 10+ messages in thread
From: Jeff Layton @ 2021-03-11 16:14 UTC (permalink / raw)
  To: dev, open list:CEPH DISTRIBUTED...

tl;dr version: in cephfs, the MDS handles truncating object data when
inodes are truncated. This is problematic with fscrypt.

Longer version:

I've been working on a patchset to add fscrypt support to kcephfs, and
have hit a problem with the way that truncation is handled. The main
issue is that fscrypt uses block-based ciphers, so we must ensure that
we read and write complete crypto blocks on the OSDs.

I'm currently using 4k crypto blocks, but we may want to allow this to
be tunable eventually (though it will need to be smaller than and align
with the OSD object size). For simplicity's sake, I'm planning to
disallow custom layouts on encrypted inodes. We could consider adding
that later (but it doesn't sound likely to be worthwhile).

Normally, when a file is truncated (usually via a SETATTR MDS call), the
MDS handles truncating or deleting objects on the OSDs. This is done
somewhat lazily in that the MDS replies to the client before this
process is complete (AFAICT).

Once we add fscrypt support, the MDS handling truncation becomes a
problem, in that we need to be able to deal with complete crypto blocks.
Letting the MDS truncate away part of a block will leave us with a block
that can't be decrypted.

There are a number of possible approaches to fixing this, but ultimately
the client will have to zero-pad, encrypt and write the blocks at the
edges since the MDS doesn't have access to the keys.

There are several possible approaches that I've identified:

1/ We could teach the MDS the crypto blocksize, and ensure that it
doesn't truncate away partial blocks. The client could tell the MDS what
blocksize it's using on the inode and the MDS could ensure that
truncates align to the blocks. The client will still need to write
partial blocks at the edges of holes or at the EOF, and it probably
shouldn't do that until it gets the unstable reply from the MDS. We
could handle this by adding a new truncate op or extending the existing
one.

2/ We could cede the object truncate/delete to the client altogether.
The MDS is aware when an inode is encrypted so it could just not do it
for those inodes. We also already handle hole punching completely on the
client (though the size doesn't change there). Truncate could be a
special case of that. Probably, the client would issue the truncate and
then be responsible for deleting/rewriting blocks after that reply comes
in. We'd have to consider how to handle delinquent clients that don't
clean up correctly.

3/ We could maintain a separate field in the inode for the real
inode->i_size that crypto-enabled clients would use. The client would
always communicate a size to the MDS that is rounded up to the end of
the last crypto block, such that the "true" size of the inode on disk
would always be represented in the rstats. Only crypto-enabled clients
would care about the "realsize" field. In fact, this value could
_itself_ be encrypted too, so that the i_size of the file is masked from
clients that don't have keys.

Ceph's truncation machinery is pretty complex in general, so I could
have missed other approaches or something that makes these ideas
impossible. I'm leaning toward #3 here since I think it has the most
benefit and keeps the MDS out of the whole business.

What should we do here?
-- 
Jeff Layton <jlayton@redhat.com>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: fscrypt and file truncation on cephfs
  2021-03-11 16:14 fscrypt and file truncation on cephfs Jeff Layton
@ 2021-03-12  4:17 ` Patrick Donnelly
  2021-03-12  8:43   ` Gregory Farnum
  2021-03-12 12:38   ` Jeff Layton
  0 siblings, 2 replies; 10+ messages in thread
From: Patrick Donnelly @ 2021-03-12  4:17 UTC (permalink / raw)
  To: Jeff Layton; +Cc: dev, open list:CEPH DISTRIBUTED...

On Thu, Mar 11, 2021 at 8:15 AM Jeff Layton <jlayton@redhat.com> wrote:
>
> tl;dr version: in cephfs, the MDS handles truncating object data when
> inodes are truncated. This is problematic with fscrypt.
>
> Longer version:
>
> I've been working on a patchset to add fscrypt support to kcephfs, and
> have hit a problem with the way that truncation is handled. The main
> issue is that fscrypt uses block-based ciphers, so we must ensure that
> we read and write complete crypto blocks on the OSDs.
>
> I'm currently using 4k crypto blocks, but we may want to allow this to
> be tunable eventually (though it will need to be smaller than and align
> with the OSD object size). For simplicity's sake, I'm planning to
> disallow custom layouts on encrypted inodes. We could consider adding
> that later (but it doesn't sound likely to be worthwhile).
>
> Normally, when a file is truncated (usually via a SETATTR MDS call), the
> MDS handles truncating or deleting objects on the OSDs. This is done
> somewhat lazily in that the MDS replies to the client before this
> process is complete (AFAICT).

So I've done some more research on this and it's not that simplistic.
Broadly, a truncate causes the following to happen:

- Revoke all write caps (but not Fcb) from clients.

- Journal the truncate operation.

- Respond with unsafe reply.

- After setattr is journalled, regrant Fs with new file size,
truncate_seq, truncate_size

- issue trunc cap update with new file size, truncate_seq,
truncate_size (looks redundant with prior step)

- actually start truncating objects above file size; concurrently
grant all wanted Fwb... caps wanted by client

- reply safe

From what I can tell, the clients use the truncate_seq/truncate_size
to avoid writing to data what the MDS plans to truncate. I haven't
really dug into how that works. Maybe someone more familiar with that
code can chime in.

So the MDS seems to truncate/delete objects lazily in the background
but it does so safely and consistently.

> Once we add fscrypt support, the MDS handling truncation becomes a
> problem, in that we need to be able to deal with complete crypto blocks.
> Letting the MDS truncate away part of a block will leave us with a block
> that can't be decrypted.
>
> There are a number of possible approaches to fixing this, but ultimately
> the client will have to zero-pad, encrypt and write the blocks at the
> edges since the MDS doesn't have access to the keys.
>
> There are several possible approaches that I've identified:
>
> 1/ We could teach the MDS the crypto blocksize, and ensure that it
> doesn't truncate away partial blocks. The client could tell the MDS what
> blocksize it's using on the inode and the MDS could ensure that
> truncates align to the blocks. The client will still need to write
> partial blocks at the edges of holes or at the EOF, and it probably
> shouldn't do that until it gets the unstable reply from the MDS. We
> could handle this by adding a new truncate op or extending the existing
> one.
>
> 2/ We could cede the object truncate/delete to the client altogether.
> The MDS is aware when an inode is encrypted so it could just not do it
> for those inodes. We also already handle hole punching completely on the
> client (though the size doesn't change there). Truncate could be a
> special case of that. Probably, the client would issue the truncate and
> then be responsible for deleting/rewriting blocks after that reply comes
> in. We'd have to consider how to handle delinquent clients that don't
> clean up correctly.

We can't really do this I think. The MDS necessarily mediates between
clients when files are truncated.

> 3/ We could maintain a separate field in the inode for the real
> inode->i_size that crypto-enabled clients would use. The client would
> always communicate a size to the MDS that is rounded up to the end of
> the last crypto block, such that the "true" size of the inode on disk
> would always be represented in the rstats. Only crypto-enabled clients
> would care about the "realsize" field. In fact, this value could
> _itself_ be encrypted too, so that the i_size of the file is masked from
> clients that don't have keys.
>
> Ceph's truncation machinery is pretty complex in general, so I could
> have missed other approaches or something that makes these ideas
> impossible. I'm leaning toward #3 here since I think it has the most
> benefit and keeps the MDS out of the whole business.

"realsize" could be mediated by the same locks as the inode size, so
it should not be a complicated addition. Informing the MDS about a
blocksize may be worse in the long run as it complicates all the
truncate code paths, I think. From our past conversations, I think we
posed (1) to generalize the (3) option? I don't have a strong opinion
now on which is better in the long run (either for encryption or the
maintainability of CephFS).

If you're going to encrypt the realsize I wonder what other metadata
you might encrypt?

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: fscrypt and file truncation on cephfs
  2021-03-12  4:17 ` Patrick Donnelly
@ 2021-03-12  8:43   ` Gregory Farnum
  2021-03-12 12:48     ` Jeff Layton
  2021-03-12 12:38   ` Jeff Layton
  1 sibling, 1 reply; 10+ messages in thread
From: Gregory Farnum @ 2021-03-12  8:43 UTC (permalink / raw)
  To: Patrick Donnelly; +Cc: Jeff Layton, dev, open list:CEPH DISTRIBUTED...

On Thu, Mar 11, 2021 at 8:18 PM Patrick Donnelly <pdonnell@redhat.com> wrote:
>
> On Thu, Mar 11, 2021 at 8:15 AM Jeff Layton <jlayton@redhat.com> wrote:
> >
> > tl;dr version: in cephfs, the MDS handles truncating object data when
> > inodes are truncated. This is problematic with fscrypt.
> >
> > Longer version:
> >
> > I've been working on a patchset to add fscrypt support to kcephfs, and
> > have hit a problem with the way that truncation is handled. The main
> > issue is that fscrypt uses block-based ciphers, so we must ensure that
> > we read and write complete crypto blocks on the OSDs.
> >
> > I'm currently using 4k crypto blocks, but we may want to allow this to
> > be tunable eventually (though it will need to be smaller than and align
> > with the OSD object size). For simplicity's sake, I'm planning to
> > disallow custom layouts on encrypted inodes. We could consider adding
> > that later (but it doesn't sound likely to be worthwhile).
> >
> > Normally, when a file is truncated (usually via a SETATTR MDS call), the
> > MDS handles truncating or deleting objects on the OSDs. This is done
> > somewhat lazily in that the MDS replies to the client before this
> > process is complete (AFAICT).
>
> So I've done some more research on this and it's not that simplistic.
> Broadly, a truncate causes the following to happen:
>
> - Revoke all write caps (but not Fcb) from clients.
>
> - Journal the truncate operation.
>
> - Respond with unsafe reply.
>
> - After setattr is journalled, regrant Fs with new file size,
> truncate_seq, truncate_size
>
> - issue trunc cap update with new file size, truncate_seq,
> truncate_size (looks redundant with prior step)
>
> - actually start truncating objects above file size; concurrently
> grant all wanted Fwb... caps wanted by client
>
> - reply safe
>
> From what I can tell, the clients use the truncate_seq/truncate_size
> to avoid writing to data what the MDS plans to truncate. I haven't
> really dug into how that works. Maybe someone more familiar with that
> code can chime in.
>
> So the MDS seems to truncate/delete objects lazily in the background
> but it does so safely and consistently.

Right; ti's lazy in that it's not done immediately in a blocking
manner, but it's absolutely safe. Truncate seq and size are also
fields you can send to the OSD on read or write operations, and the
client includes them on every op. It just has to do a (reasonably)
simple conversion from the total truncate size the MDS gives it to
what that means for the object being accessed (based on the striping
pattern and object number).

I'll try and think a bit more on how to handle the special extra size
for encryption.

...although in my current sleep-addled state, I'm actually not sure we
need to add any permanent storage to the MDS to handle this case! We
can probably just extend the front-end truncate op so that it can take
a separate "real-truncate-size" and the logical file size, can't we?
-Greg

>
> > Once we add fscrypt support, the MDS handling truncation becomes a
> > problem, in that we need to be able to deal with complete crypto blocks.
> > Letting the MDS truncate away part of a block will leave us with a block
> > that can't be decrypted.
> >
> > There are a number of possible approaches to fixing this, but ultimately
> > the client will have to zero-pad, encrypt and write the blocks at the
> > edges since the MDS doesn't have access to the keys.
> >
> > There are several possible approaches that I've identified:
> >
> > 1/ We could teach the MDS the crypto blocksize, and ensure that it
> > doesn't truncate away partial blocks. The client could tell the MDS what
> > blocksize it's using on the inode and the MDS could ensure that
> > truncates align to the blocks. The client will still need to write
> > partial blocks at the edges of holes or at the EOF, and it probably
> > shouldn't do that until it gets the unstable reply from the MDS. We
> > could handle this by adding a new truncate op or extending the existing
> > one.
> >
> > 2/ We could cede the object truncate/delete to the client altogether.
> > The MDS is aware when an inode is encrypted so it could just not do it
> > for those inodes. We also already handle hole punching completely on the
> > client (though the size doesn't change there). Truncate could be a
> > special case of that. Probably, the client would issue the truncate and
> > then be responsible for deleting/rewriting blocks after that reply comes
> > in. We'd have to consider how to handle delinquent clients that don't
> > clean up correctly.
>
> We can't really do this I think. The MDS necessarily mediates between
> clients when files are truncated.
>
> > 3/ We could maintain a separate field in the inode for the real
> > inode->i_size that crypto-enabled clients would use. The client would
> > always communicate a size to the MDS that is rounded up to the end of
> > the last crypto block, such that the "true" size of the inode on disk
> > would always be represented in the rstats. Only crypto-enabled clients
> > would care about the "realsize" field. In fact, this value could
> > _itself_ be encrypted too, so that the i_size of the file is masked from
> > clients that don't have keys.
> >
> > Ceph's truncation machinery is pretty complex in general, so I could
> > have missed other approaches or something that makes these ideas
> > impossible. I'm leaning toward #3 here since I think it has the most
> > benefit and keeps the MDS out of the whole business.
>
> "realsize" could be mediated by the same locks as the inode size, so
> it should not be a complicated addition. Informing the MDS about a
> blocksize may be worse in the long run as it complicates all the
> truncate code paths, I think. From our past conversations, I think we
> posed (1) to generalize the (3) option? I don't have a strong opinion
> now on which is better in the long run (either for encryption or the
> maintainability of CephFS).
>
> If you're going to encrypt the realsize I wonder what other metadata
> you might encrypt?
>
> --
> Patrick Donnelly, Ph.D.
> He / Him / His
> Principal Software Engineer
> Red Hat Sunnyvale, CA
> GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
> _______________________________________________
> Dev mailing list -- dev@ceph.io
> To unsubscribe send an email to dev-leave@ceph.io
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: fscrypt and file truncation on cephfs
  2021-03-12  4:17 ` Patrick Donnelly
  2021-03-12  8:43   ` Gregory Farnum
@ 2021-03-12 12:38   ` Jeff Layton
  1 sibling, 0 replies; 10+ messages in thread
From: Jeff Layton @ 2021-03-12 12:38 UTC (permalink / raw)
  To: Patrick Donnelly; +Cc: dev, open list:CEPH DISTRIBUTED...

On Thu, 2021-03-11 at 20:17 -0800, Patrick Donnelly wrote:
> On Thu, Mar 11, 2021 at 8:15 AM Jeff Layton <jlayton@redhat.com> wrote:
> > 
> > tl;dr version: in cephfs, the MDS handles truncating object data when
> > inodes are truncated. This is problematic with fscrypt.
> > 
> > Longer version:
> > 
> > I've been working on a patchset to add fscrypt support to kcephfs, and
> > have hit a problem with the way that truncation is handled. The main
> > issue is that fscrypt uses block-based ciphers, so we must ensure that
> > we read and write complete crypto blocks on the OSDs.
> > 
> > I'm currently using 4k crypto blocks, but we may want to allow this to
> > be tunable eventually (though it will need to be smaller than and align
> > with the OSD object size). For simplicity's sake, I'm planning to
> > disallow custom layouts on encrypted inodes. We could consider adding
> > that later (but it doesn't sound likely to be worthwhile).
> > 
> > Normally, when a file is truncated (usually via a SETATTR MDS call), the
> > MDS handles truncating or deleting objects on the OSDs. This is done
> > somewhat lazily in that the MDS replies to the client before this
> > process is complete (AFAICT).
> 
> So I've done some more research on this and it's not that simplistic.
> Broadly, a truncate causes the following to happen:
> 
> - Revoke all write caps (but not Fcb) from clients.
> 
> - Journal the truncate operation.
> 
> - Respond with unsafe reply.
> 
> - After setattr is journalled, regrant Fs with new file size,
> truncate_seq, truncate_size
> 
> - issue trunc cap update with new file size, truncate_seq,
> truncate_size (looks redundant with prior step)
> 
> - actually start truncating objects above file size; concurrently
> grant all wanted Fwb... caps wanted by client
> 
> - reply safe
> 
> From what I can tell, the clients use the truncate_seq/truncate_size
> to avoid writing to data what the MDS plans to truncate. I haven't
> really dug into how that works. Maybe someone more familiar with that
> code can chime in.
> 
> So the MDS seems to truncate/delete objects lazily in the background
> but it does so safely and consistently.
> 
> > Once we add fscrypt support, the MDS handling truncation becomes a
> > problem, in that we need to be able to deal with complete crypto blocks.
> > Letting the MDS truncate away part of a block will leave us with a block
> > that can't be decrypted.
> > 
> > There are a number of possible approaches to fixing this, but ultimately
> > the client will have to zero-pad, encrypt and write the blocks at the
> > edges since the MDS doesn't have access to the keys.
> > 
> > There are several possible approaches that I've identified:
> > 
> > 1/ We could teach the MDS the crypto blocksize, and ensure that it
> > doesn't truncate away partial blocks. The client could tell the MDS what
> > blocksize it's using on the inode and the MDS could ensure that
> > truncates align to the blocks. The client will still need to write
> > partial blocks at the edges of holes or at the EOF, and it probably
> > shouldn't do that until it gets the unstable reply from the MDS. We
> > could handle this by adding a new truncate op or extending the existing
> > one.
> > 
> > 2/ We could cede the object truncate/delete to the client altogether.
> > The MDS is aware when an inode is encrypted so it could just not do it
> > for those inodes. We also already handle hole punching completely on the
> > client (though the size doesn't change there). Truncate could be a
> > special case of that. Probably, the client would issue the truncate and
> > then be responsible for deleting/rewriting blocks after that reply comes
> > in. We'd have to consider how to handle delinquent clients that don't
> > clean up correctly.
> 
> We can't really do this I think. The MDS necessarily mediates between
> clients when files are truncated.
> 

Ok. I suppose we could do something to ensure that the MDS doesn't do
anything conflicting until the client is done truncating, but that's
probably more complex overall than the other options.

> > 3/ We could maintain a separate field in the inode for the real
> > inode->i_size that crypto-enabled clients would use. The client would
> > always communicate a size to the MDS that is rounded up to the end of
> > the last crypto block, such that the "true" size of the inode on disk
> > would always be represented in the rstats. Only crypto-enabled clients
> > would care about the "realsize" field. In fact, this value could
> > _itself_ be encrypted too, so that the i_size of the file is masked from
> > clients that don't have keys.
> > 
> > Ceph's truncation machinery is pretty complex in general, so I could
> > have missed other approaches or something that makes these ideas
> > impossible. I'm leaning toward #3 here since I think it has the most
> > benefit and keeps the MDS out of the whole business.
> 
> "realsize" could be mediated by the same locks as the inode size, so
> it should not be a complicated addition. Informing the MDS about a
> blocksize may be worse in the long run as it complicates all the
> truncate code paths, I think. From our past conversations, I think we
> posed (1) to generalize the (3) option? I don't have a strong opinion
> now on which is better in the long run (either for encryption or the
> maintainability of CephFS).
> 

1 and 3 are a bit different. For 1, there is only the single size field
to maintain, and the MDS just has to be aware of the crypto blocksize so
it can do the truncation along the boundaries.

With 3, we'd have 2 size fields to maintain (though the MDS would treat
the new one as opaque). The clients would just be responsible for
setting the "original" size field along block boundaries and properly
maintaining the realsize.

It's a subtle difference, but I think #3 makes for fewer changes on the
MDS, and less awareness there of crypto in general.

> If you're going to encrypt the realsize I wonder what other metadata
> you might encrypt?
> 

I may have spoken too soon here. The min crypto blocksize is 16 bytes,
so we'd need to maintain a field at least that size in the inode if we
wanted to encrypt that data.

The realsize field would be a good candidate for that, but the other
fields probably can't easily be treated as opaque (maybe st_rdev, but
the other stuff probably can't be).

Still, we could potentially put realsize into a field that's 16 bytes
long and just zero-pad it out for now. We could then add other fields
later if we had some that were useful (maybe st_rdev?).

I'll probably not do that initially, but we could consider it later.
-- 
Jeff Layton <jlayton@redhat.com>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: fscrypt and file truncation on cephfs
  2021-03-12  8:43   ` Gregory Farnum
@ 2021-03-12 12:48     ` Jeff Layton
  2021-03-12 19:45       ` Gregory Farnum
  0 siblings, 1 reply; 10+ messages in thread
From: Jeff Layton @ 2021-03-12 12:48 UTC (permalink / raw)
  To: Gregory Farnum, Patrick Donnelly; +Cc: dev, open list:CEPH DISTRIBUTED...

On Fri, 2021-03-12 at 00:43 -0800, Gregory Farnum wrote:
> On Thu, Mar 11, 2021 at 8:18 PM Patrick Donnelly <pdonnell@redhat.com> wrote:
> > 
> > On Thu, Mar 11, 2021 at 8:15 AM Jeff Layton <jlayton@redhat.com> wrote:
> > > 
> > > tl;dr version: in cephfs, the MDS handles truncating object data when
> > > inodes are truncated. This is problematic with fscrypt.
> > > 
> > > Longer version:
> > > 
> > > I've been working on a patchset to add fscrypt support to kcephfs, and
> > > have hit a problem with the way that truncation is handled. The main
> > > issue is that fscrypt uses block-based ciphers, so we must ensure that
> > > we read and write complete crypto blocks on the OSDs.
> > > 
> > > I'm currently using 4k crypto blocks, but we may want to allow this to
> > > be tunable eventually (though it will need to be smaller than and align
> > > with the OSD object size). For simplicity's sake, I'm planning to
> > > disallow custom layouts on encrypted inodes. We could consider adding
> > > that later (but it doesn't sound likely to be worthwhile).
> > > 
> > > Normally, when a file is truncated (usually via a SETATTR MDS call), the
> > > MDS handles truncating or deleting objects on the OSDs. This is done
> > > somewhat lazily in that the MDS replies to the client before this
> > > process is complete (AFAICT).
> > 
> > So I've done some more research on this and it's not that simplistic.
> > Broadly, a truncate causes the following to happen:
> > 
> > - Revoke all write caps (but not Fcb) from clients.
> > 
> > - Journal the truncate operation.
> > 
> > - Respond with unsafe reply.
> > 
> > - After setattr is journalled, regrant Fs with new file size,
> > truncate_seq, truncate_size
> > 
> > - issue trunc cap update with new file size, truncate_seq,
> > truncate_size (looks redundant with prior step)
> > 
> > - actually start truncating objects above file size; concurrently
> > grant all wanted Fwb... caps wanted by client
> > 
> > - reply safe
> > 
> > From what I can tell, the clients use the truncate_seq/truncate_size
> > to avoid writing to data what the MDS plans to truncate. I haven't
> > really dug into how that works. Maybe someone more familiar with that
> > code can chime in.
> > 
> > So the MDS seems to truncate/delete objects lazily in the background
> > but it does so safely and consistently.
> 
> Right; ti's lazy in that it's not done immediately in a blocking
> manner, but it's absolutely safe. Truncate seq and size are also
> fields you can send to the OSD on read or write operations, and the
> client includes them on every op. It just has to do a (reasonably)
> simple conversion from the total truncate size the MDS gives it to
> what that means for the object being accessed (based on the striping
> pattern and object number).
> 
> I'll try and think a bit more on how to handle the special extra size
> for encryption.
> 
> ...although in my current sleep-addled state, I'm actually not sure we
> need to add any permanent storage to the MDS to handle this case! We
> can probably just extend the front-end truncate op so that it can take
> a separate "real-truncate-size" and the logical file size, can't we?

That would be one nice thing about the approach of #1. Truncating the
size downward is always done via an explicit SETATTR op (I think), so we
could just extend that with a new field for that tells the MDS where to
stop truncating.

Note that regardless of the approach we choose, the client will still
need to do a read/modify/write on the edge block before we can really
treat the truncation as "done". I'm not yet sure whether that has any
bearing on the consistency/safety of the truncation process.

-- 
Jeff Layton <jlayton@redhat.com>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: fscrypt and file truncation on cephfs
  2021-03-12 12:48     ` Jeff Layton
@ 2021-03-12 19:45       ` Gregory Farnum
  2021-03-12 20:24         ` Jeff Layton
  0 siblings, 1 reply; 10+ messages in thread
From: Gregory Farnum @ 2021-03-12 19:45 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Patrick Donnelly, dev, open list:CEPH DISTRIBUTED...

On Fri, Mar 12, 2021 at 4:49 AM Jeff Layton <jlayton@redhat.com> wrote:
>
> On Fri, 2021-03-12 at 00:43 -0800, Gregory Farnum wrote:

> > ...although in my current sleep-addled state, I'm actually not sure we
> > need to add any permanent storage to the MDS to handle this case! We
> > can probably just extend the front-end truncate op so that it can take
> > a separate "real-truncate-size" and the logical file size, can't we?
>
> That would be one nice thing about the approach of #1. Truncating the
> size downward is always done via an explicit SETATTR op (I think), so we
> could just extend that with a new field for that tells the MDS where to
> stop truncating.
>
> Note that regardless of the approach we choose, the client will still
> need to do a read/modify/write on the edge block before we can really
> treat the truncation as "done". I'm not yet sure whether that has any
> bearing on the consistency/safety of the truncation process.

Wait what? Let's talk more about that: I can't emphasize enough how
much you really, *really* don't want to require clients to do a RMW to
RADOS that is coordinated with MDS state changes that may modify the
same objects.

Isn't it enough for:
1) truncate op sets file size to new logical size, and truncate_size
to end of crypto block
2) clients do rados reads until the end of the crypto block and throw
away the data past the logical end-of-file when serving IO
3) clients can remove the old data on a write when they feel like it
and then set truncate_size to the real file size

The only tricky bit there is making sure the MDS respects the
truncate_size everywhere it needs to, and doesn't inadvertently force
it down to the file size when we don't want that happening.
I do foresee one leak, which is how we handle probing for file size
when a client with write caps disappears[2], but that's going to be
the case no matter what solution we come up with unless we're willing
to tolerate a two-phase commit. Which would be *awful*[1].
-Greg

[1]: We're talking about needing to send a crypto-truncate to the MDS,
have it withdraw all client caps, return a new "start your phase"
message/state which grants read and write caps to the truncating
client (which I have no idea how we'd do while a logical MDS write is
happening), have the client do the RMW on the tail object and return
its caps, commit the new file size, and then return the final op and
re-issue caps — and handle cleaning up if the client fails at any
point in that process. I guess if we did manage to jump through all
those hoops, client failure recovery would be deterministic and
accurate based on if the final object size was larger than or equal to
the attempted new size...
[2]: And now that I think about the implications of crypto some more,
we can even fix this. Since crypto clients can never do rados append
ops (and I don't think we can in the FS *anyway*?) we can just have
all crypto-client writes include setting a "logical size" xattr on the
rados object that the MDS can use when probing. We maybe should do
that for everything, actually...


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: fscrypt and file truncation on cephfs
  2021-03-12 19:45       ` Gregory Farnum
@ 2021-03-12 20:24         ` Jeff Layton
  2021-03-12 23:36           ` Gregory Farnum
  0 siblings, 1 reply; 10+ messages in thread
From: Jeff Layton @ 2021-03-12 20:24 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Patrick Donnelly, dev, open list:CEPH DISTRIBUTED...

On Fri, 2021-03-12 at 11:45 -0800, Gregory Farnum wrote:
> On Fri, Mar 12, 2021 at 4:49 AM Jeff Layton <jlayton@redhat.com> wrote:
> > 
> > On Fri, 2021-03-12 at 00:43 -0800, Gregory Farnum wrote:
> 
> > > ...although in my current sleep-addled state, I'm actually not sure we
> > > need to add any permanent storage to the MDS to handle this case! We
> > > can probably just extend the front-end truncate op so that it can take
> > > a separate "real-truncate-size" and the logical file size, can't we?
> > 
> > That would be one nice thing about the approach of #1. Truncating the
> > size downward is always done via an explicit SETATTR op (I think), so we
> > could just extend that with a new field for that tells the MDS where to
> > stop truncating.
> > 
> > Note that regardless of the approach we choose, the client will still
> > need to do a read/modify/write on the edge block before we can really
> > treat the truncation as "done". I'm not yet sure whether that has any
> > bearing on the consistency/safety of the truncation process.
> 
> Wait what? Let's talk more about that: I can't emphasize enough how
> much you really, *really* don't want to require clients to do a RMW to
> RADOS that is coordinated with MDS state changes that may modify the
> same objects.
> 
> Isn't it enough for:
> 1) truncate op sets file size to new logical size, and truncate_size
> to end of crypto block
> 2) clients do rados reads until the end of the crypto block and throw
> away the data past the logical end-of-file when serving IO
> 3) clients can remove the old data on a write when they feel like it
> and then set truncate_size to the real file size
> 
> The only tricky bit there is making sure the MDS respects the
> truncate_size everywhere it needs to, and doesn't inadvertently force
> it down to the file size when we don't want that happening.
>

That sounds like it would work. I'll probably take something like that
approach.

In principle, the MDS really should not care about the logical size of
the file at all. All it needs to know is how big the actual backing
objects are, and that will always be rounded up to the next crypto
block.

So, I'm sort of a fan of the idea of having a "private" size field that
only the clients with valid the crypto keys use. Everything else would
just report (and deal with) the rounded-up size. So, the clients would
always set the current inode size field to the rounded-up size, and then
maintain the actual logical file size in their private field.

This has the advantage that we don't need to tinker with the MDS
mechanics so much -- only extend the inode with the new field (which
sucks, but is a reasonably mechanical change).

This also dovetails nicely with having the client "lazily" zero-pad the
EOF block after resetting the size.

> I do foresee one leak, which is how we handle probing for file size
> when a client with write caps disappears[2], but that's going to be
> the case no matter what solution we come up with unless we're willing
> to tolerate a two-phase commit. Which would be *awful*[1].
> -Greg
> 
> [1]: We're talking about needing to send a crypto-truncate to the MDS,
> have it withdraw all client caps, return a new "start your phase"
> message/state which grants read and write caps to the truncating
> client (which I have no idea how we'd do while a logical MDS write is
> happening), have the client do the RMW on the tail object and return
> its caps, commit the new file size, and then return the final op and
> re-issue caps — and handle cleaning up if the client fails at any
> point in that process. I guess if we did manage to jump through all
> those hoops, client failure recovery would be deterministic and
> accurate based on if the final object size was larger than or equal to
> the attempted new size...

How does this probing work today?

> [2]: And now that I think about the implications of crypto some more,
> we can even fix this. Since crypto clients can never do rados append
> ops (and I don't think we can in the FS *anyway*?) we can just have
> all crypto-client writes include setting a "logical size" xattr on the
> rados object that the MDS can use when probing. We maybe should do
> that for everything, actually...
> 

That does sound like a good idea, and it wouldn't be too difficult to
implement since we're already going to be moving to doing more multi-op
OSD write calls anyway.

-- 
Jeff Layton <jlayton@redhat.com>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: fscrypt and file truncation on cephfs
  2021-03-12 20:24         ` Jeff Layton
@ 2021-03-12 23:36           ` Gregory Farnum
  2021-03-18 19:20             ` Jeff Layton
  0 siblings, 1 reply; 10+ messages in thread
From: Gregory Farnum @ 2021-03-12 23:36 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Patrick Donnelly, dev, open list:CEPH DISTRIBUTED...

On Fri, Mar 12, 2021 at 12:24 PM Jeff Layton <jlayton@redhat.com> wrote:
>
> On Fri, 2021-03-12 at 11:45 -0800, Gregory Farnum wrote:
> > On Fri, Mar 12, 2021 at 4:49 AM Jeff Layton <jlayton@redhat.com> wrote:
> > >
> > > On Fri, 2021-03-12 at 00:43 -0800, Gregory Farnum wrote:
> >
> > > > ...although in my current sleep-addled state, I'm actually not sure we
> > > > need to add any permanent storage to the MDS to handle this case! We
> > > > can probably just extend the front-end truncate op so that it can take
> > > > a separate "real-truncate-size" and the logical file size, can't we?
> > >
> > > That would be one nice thing about the approach of #1. Truncating the
> > > size downward is always done via an explicit SETATTR op (I think), so we
> > > could just extend that with a new field for that tells the MDS where to
> > > stop truncating.
> > >
> > > Note that regardless of the approach we choose, the client will still
> > > need to do a read/modify/write on the edge block before we can really
> > > treat the truncation as "done". I'm not yet sure whether that has any
> > > bearing on the consistency/safety of the truncation process.
> >
> > Wait what? Let's talk more about that: I can't emphasize enough how
> > much you really, *really* don't want to require clients to do a RMW to
> > RADOS that is coordinated with MDS state changes that may modify the
> > same objects.
> >
> > Isn't it enough for:
> > 1) truncate op sets file size to new logical size, and truncate_size
> > to end of crypto block
> > 2) clients do rados reads until the end of the crypto block and throw
> > away the data past the logical end-of-file when serving IO
> > 3) clients can remove the old data on a write when they feel like it
> > and then set truncate_size to the real file size
> >
> > The only tricky bit there is making sure the MDS respects the
> > truncate_size everywhere it needs to, and doesn't inadvertently force
> > it down to the file size when we don't want that happening.
> >
>
> That sounds like it would work. I'll probably take something like that
> approach.
>
> In principle, the MDS really should not care about the logical size of
> the file at all. All it needs to know is how big the actual backing
> objects are, and that will always be rounded up to the next crypto
> block.
>
> So, I'm sort of a fan of the idea of having a "private" size field that
> only the clients with valid the crypto keys use. Everything else would
> just report (and deal with) the rounded-up size. So, the clients would
> always set the current inode size field to the rounded-up size, and then
> maintain the actual logical file size in their private field.
>
> This has the advantage that we don't need to tinker with the MDS
> mechanics so much -- only extend the inode with the new field (which
> sucks, but is a reasonably mechanical change).

Yeah, I initially really disliked the idea of adding a new "real
logical size" member, but the more I think about it the less I like
the alternatives, too. IIRC we go to a fair bit of trouble in the
server to handle sizes changing while other stuff is going on so
you'll want to look at that and see if we're okay just relying on the
clients to set their "logical crypto size" or not.

>
> This also dovetails nicely with having the client "lazily" zero-pad the
> EOF block after resetting the size.
>
> > I do foresee one leak, which is how we handle probing for file size
> > when a client with write caps disappears[2], but that's going to be
> > the case no matter what solution we come up with unless we're willing
> > to tolerate a two-phase commit. Which would be *awful*[1].
> > -Greg
> >
> > [1]: We're talking about needing to send a crypto-truncate to the MDS,
> > have it withdraw all client caps, return a new "start your phase"
> > message/state which grants read and write caps to the truncating
> > client (which I have no idea how we'd do while a logical MDS write is
> > happening), have the client do the RMW on the tail object and return
> > its caps, commit the new file size, and then return the final op and
> > re-issue caps — and handle cleaning up if the client fails at any
> > point in that process. I guess if we did manage to jump through all
> > those hoops, client failure recovery would be deterministic and
> > accurate based on if the final object size was larger than or equal to
> > the attempted new size...
>
> How does this probing work today?

The MDS checks for objects in the interval from the latest size it has
seen up to the max allowed size for the client, and sets the size to
the last byte it finds. There's nothing tricky about it. (It may also
probe *every* object in the file to try and resolve the mtime, but I'm
less sure about that.)

>
> > [2]: And now that I think about the implications of crypto some more,
> > we can even fix this. Since crypto clients can never do rados append
> > ops (and I don't think we can in the FS *anyway*?) we can just have
> > all crypto-client writes include setting a "logical size" xattr on the
> > rados object that the MDS can use when probing. We maybe should do
> > that for everything, actually...
> >
>
> That does sound like a good idea, and it wouldn't be too difficult to
> implement since we're already going to be moving to doing more multi-op
> OSD write calls anyway.
>
> --
> Jeff Layton <jlayton@redhat.com>
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: fscrypt and file truncation on cephfs
  2021-03-12 23:36           ` Gregory Farnum
@ 2021-03-18 19:20             ` Jeff Layton
  2021-03-19 18:53               ` Gregory Farnum
  0 siblings, 1 reply; 10+ messages in thread
From: Jeff Layton @ 2021-03-18 19:20 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Patrick Donnelly, dev, open list:CEPH DISTRIBUTED...

On Fri, 2021-03-12 at 15:36 -0800, Gregory Farnum wrote:
> On Fri, Mar 12, 2021 at 12:24 PM Jeff Layton <jlayton@redhat.com> wrote:
> > 
> > On Fri, 2021-03-12 at 11:45 -0800, Gregory Farnum wrote:
> > > On Fri, Mar 12, 2021 at 4:49 AM Jeff Layton <jlayton@redhat.com> wrote:
> > > > 
> > > > On Fri, 2021-03-12 at 00:43 -0800, Gregory Farnum wrote:
> > > 
> > > > > ...although in my current sleep-addled state, I'm actually not sure we
> > > > > need to add any permanent storage to the MDS to handle this case! We
> > > > > can probably just extend the front-end truncate op so that it can take
> > > > > a separate "real-truncate-size" and the logical file size, can't we?
> > > > 
> > > > That would be one nice thing about the approach of #1. Truncating the
> > > > size downward is always done via an explicit SETATTR op (I think), so we
> > > > could just extend that with a new field for that tells the MDS where to
> > > > stop truncating.
> > > > 
> > > > Note that regardless of the approach we choose, the client will still
> > > > need to do a read/modify/write on the edge block before we can really
> > > > treat the truncation as "done". I'm not yet sure whether that has any
> > > > bearing on the consistency/safety of the truncation process.
> > > 
> > > Wait what? Let's talk more about that: I can't emphasize enough how
> > > much you really, *really* don't want to require clients to do a RMW to
> > > RADOS that is coordinated with MDS state changes that may modify the
> > > same objects.
> > > 
> > > Isn't it enough for:
> > > 1) truncate op sets file size to new logical size, and truncate_size
> > > to end of crypto block
> > > 2) clients do rados reads until the end of the crypto block and throw
> > > away the data past the logical end-of-file when serving IO
> > > 3) clients can remove the old data on a write when they feel like it
> > > and then set truncate_size to the real file size
> > > 
> > > The only tricky bit there is making sure the MDS respects the
> > > truncate_size everywhere it needs to, and doesn't inadvertently force
> > > it down to the file size when we don't want that happening.
> > > 
> > 
> > That sounds like it would work. I'll probably take something like that
> > approach.
> > 
> > In principle, the MDS really should not care about the logical size of
> > the file at all. All it needs to know is how big the actual backing
> > objects are, and that will always be rounded up to the next crypto
> > block.
> > 
> > So, I'm sort of a fan of the idea of having a "private" size field that
> > only the clients with valid the crypto keys use. Everything else would
> > just report (and deal with) the rounded-up size. So, the clients would
> > always set the current inode size field to the rounded-up size, and then
> > maintain the actual logical file size in their private field.
> > 
> > This has the advantage that we don't need to tinker with the MDS
> > mechanics so much -- only extend the inode with the new field (which
> > sucks, but is a reasonably mechanical change).
> 
> Yeah, I initially really disliked the idea of adding a new "real
> logical size" member, but the more I think about it the less I like
> the alternatives, too. IIRC we go to a fair bit of trouble in the
> server to handle sizes changing while other stuff is going on so
> you'll want to look at that and see if we're okay just relying on the
> clients to set their "logical crypto size" or not.
> 

Yeah, this is tricky business, and the MDS side of things is definitely
not my area of expertise. What kind of "other stuff" do you mean in this
case?

In any case, after thinking about this a bit, what may be best is to add
new "crypto_info" field to the inode. This field would basically be
opaque to the MDS, but it'd need to record and track changes to it.

It might be good to make that field some multiple of 16 bytes (which is
the fundamental fscrypt blocksize). I like the idea of being able to
encrypt this info, and there may be other private info we'd wish to
stash in there later.

We could consider tacking this field onto the encryption context xattr,
but I'm not sure that's the best idea.

> > 
> > This also dovetails nicely with having the client "lazily" zero-pad the
> > EOF block after resetting the size.
> > 
> > > I do foresee one leak, which is how we handle probing for file size
> > > when a client with write caps disappears[2], but that's going to be
> > > the case no matter what solution we come up with unless we're willing
> > > to tolerate a two-phase commit. Which would be *awful*[1].
> > > -Greg
> > > 
> > > [1]: We're talking about needing to send a crypto-truncate to the MDS,
> > > have it withdraw all client caps, return a new "start your phase"
> > > message/state which grants read and write caps to the truncating
> > > client (which I have no idea how we'd do while a logical MDS write is
> > > happening), have the client do the RMW on the tail object and return
> > > its caps, commit the new file size, and then return the final op and
> > > re-issue caps — and handle cleaning up if the client fails at any
> > > point in that process. I guess if we did manage to jump through all
> > > those hoops, client failure recovery would be deterministic and
> > > accurate based on if the final object size was larger than or equal to
> > > the attempted new size...
> > 
> > How does this probing work today?
> 
> The MDS checks for objects in the interval from the latest size it has
> seen up to the max allowed size for the client, and sets the size to
> the last byte it finds. There's nothing tricky about it. (It may also
> probe *every* object in the file to try and resolve the mtime, but I'm
> less sure about that.)
> 

Oh right, I sort of remember seeing this at some point.

I think this can still work -- the MDS would just be determining the
rounded-up size of the file. The client would need to determine the
length of the last object though. It can't use the size of the object at
that point, so we'd need to come up with some other way to track the
size.

One idea might be to just set an xattr on the object with the latest
size of the decrypted object when we do a write? We'd have to do it in
the same WRITE compound op, but that could be done. We could even
encrypt that data too since prying eyes don't need to know.

Then the client could just vet the size in the xattr of the last object
vs. the (private) one in the inode.

Either way, I'm going to need to enlist some help for the MDS side of
this bit, I think, and whoever picks up that part should probably be
involved in the design.

> > 
> > > [2]: And now that I think about the implications of crypto some more,
> > > we can even fix this. Since crypto clients can never do rados append
> > > ops (and I don't think we can in the FS *anyway*?) we can just have
> > > all crypto-client writes include setting a "logical size" xattr on the
> > > rados object that the MDS can use when probing. We maybe should do
> > > that for everything, actually...
> > > 
> > 
> > That does sound like a good idea, and it wouldn't be too difficult to
> > implement since we're already going to be moving to doing more multi-op
> > OSD write calls anyway.
> > 
> > --
> > Jeff Layton <jlayton@redhat.com>
> > 
> 

-- 
Jeff Layton <jlayton@redhat.com>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: fscrypt and file truncation on cephfs
  2021-03-18 19:20             ` Jeff Layton
@ 2021-03-19 18:53               ` Gregory Farnum
  0 siblings, 0 replies; 10+ messages in thread
From: Gregory Farnum @ 2021-03-19 18:53 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Patrick Donnelly, dev, open list:CEPH DISTRIBUTED...

On Thu, Mar 18, 2021 at 12:20 PM Jeff Layton <jlayton@redhat.com> wrote:
>
> On Fri, 2021-03-12 at 15:36 -0800, Gregory Farnum wrote:
> > On Fri, Mar 12, 2021 at 12:24 PM Jeff Layton <jlayton@redhat.com> wrote:
> > >
> > > On Fri, 2021-03-12 at 11:45 -0800, Gregory Farnum wrote:
> > > > On Fri, Mar 12, 2021 at 4:49 AM Jeff Layton <jlayton@redhat.com> wrote:
> > > > >
> > > > > On Fri, 2021-03-12 at 00:43 -0800, Gregory Farnum wrote:
> > > >
> > > > > > ...although in my current sleep-addled state, I'm actually not sure we
> > > > > > need to add any permanent storage to the MDS to handle this case! We
> > > > > > can probably just extend the front-end truncate op so that it can take
> > > > > > a separate "real-truncate-size" and the logical file size, can't we?
> > > > >
> > > > > That would be one nice thing about the approach of #1. Truncating the
> > > > > size downward is always done via an explicit SETATTR op (I think), so we
> > > > > could just extend that with a new field for that tells the MDS where to
> > > > > stop truncating.
> > > > >
> > > > > Note that regardless of the approach we choose, the client will still
> > > > > need to do a read/modify/write on the edge block before we can really
> > > > > treat the truncation as "done". I'm not yet sure whether that has any
> > > > > bearing on the consistency/safety of the truncation process.
> > > >
> > > > Wait what? Let's talk more about that: I can't emphasize enough how
> > > > much you really, *really* don't want to require clients to do a RMW to
> > > > RADOS that is coordinated with MDS state changes that may modify the
> > > > same objects.
> > > >
> > > > Isn't it enough for:
> > > > 1) truncate op sets file size to new logical size, and truncate_size
> > > > to end of crypto block
> > > > 2) clients do rados reads until the end of the crypto block and throw
> > > > away the data past the logical end-of-file when serving IO
> > > > 3) clients can remove the old data on a write when they feel like it
> > > > and then set truncate_size to the real file size
> > > >
> > > > The only tricky bit there is making sure the MDS respects the
> > > > truncate_size everywhere it needs to, and doesn't inadvertently force
> > > > it down to the file size when we don't want that happening.
> > > >
> > >
> > > That sounds like it would work. I'll probably take something like that
> > > approach.
> > >
> > > In principle, the MDS really should not care about the logical size of
> > > the file at all. All it needs to know is how big the actual backing
> > > objects are, and that will always be rounded up to the next crypto
> > > block.
> > >
> > > So, I'm sort of a fan of the idea of having a "private" size field that
> > > only the clients with valid the crypto keys use. Everything else would
> > > just report (and deal with) the rounded-up size. So, the clients would
> > > always set the current inode size field to the rounded-up size, and then
> > > maintain the actual logical file size in their private field.
> > >
> > > This has the advantage that we don't need to tinker with the MDS
> > > mechanics so much -- only extend the inode with the new field (which
> > > sucks, but is a reasonably mechanical change).
> >
> > Yeah, I initially really disliked the idea of adding a new "real
> > logical size" member, but the more I think about it the less I like
> > the alternatives, too. IIRC we go to a fair bit of trouble in the
> > server to handle sizes changing while other stuff is going on so
> > you'll want to look at that and see if we're okay just relying on the
> > clients to set their "logical crypto size" or not.
> >
>
> Yeah, this is tricky business, and the MDS side of things is definitely
> not my area of expertise. What kind of "other stuff" do you mean in this
> case?

It's vague in my head, but what I'm mostly concerned about is multiple
clients with write caps updating the size simultaneously, which I
think they can do? So we need to not shrink incorrectly, if one of
them extends the file to 6MB after another has already extended it to
7MB.

Or maybe I'm just remembering how we try to not go backwards on
timestamps and all the size handling is really straightforward. We'll
need to look.

> In any case, after thinking about this a bit, what may be best is to add
> new "crypto_info" field to the inode. This field would basically be
> opaque to the MDS, but it'd need to record and track changes to it.
>
> It might be good to make that field some multiple of 16 bytes (which is
> the fundamental fscrypt blocksize). I like the idea of being able to
> encrypt this info, and there may be other private info we'd wish to
> stash in there later.
>
> We could consider tacking this field onto the encryption context xattr,
> but I'm not sure that's the best idea.

xattrs are convenient because they don't require updating the core
inode wire protocol fields, some of which are unions that are already
full and not incorporated into our newer struct_v versioning (unless
that got fixed up at some point).
I'm not sure how that works out when we need to do a truncate though;
xattrs might require something unpleasant like setting the xattr
logical size and then issuing the mds setattr.

>
> > >
> > > This also dovetails nicely with having the client "lazily" zero-pad the
> > > EOF block after resetting the size.
> > >
> > > > I do foresee one leak, which is how we handle probing for file size
> > > > when a client with write caps disappears[2], but that's going to be
> > > > the case no matter what solution we come up with unless we're willing
> > > > to tolerate a two-phase commit. Which would be *awful*[1].
> > > > -Greg
> > > >
> > > > [1]: We're talking about needing to send a crypto-truncate to the MDS,
> > > > have it withdraw all client caps, return a new "start your phase"
> > > > message/state which grants read and write caps to the truncating
> > > > client (which I have no idea how we'd do while a logical MDS write is
> > > > happening), have the client do the RMW on the tail object and return
> > > > its caps, commit the new file size, and then return the final op and
> > > > re-issue caps — and handle cleaning up if the client fails at any
> > > > point in that process. I guess if we did manage to jump through all
> > > > those hoops, client failure recovery would be deterministic and
> > > > accurate based on if the final object size was larger than or equal to
> > > > the attempted new size...
> > >
> > > How does this probing work today?
> >
> > The MDS checks for objects in the interval from the latest size it has
> > seen up to the max allowed size for the client, and sets the size to
> > the last byte it finds. There's nothing tricky about it. (It may also
> > probe *every* object in the file to try and resolve the mtime, but I'm
> > less sure about that.)
> >
>
> Oh right, I sort of remember seeing this at some point.
>
> I think this can still work -- the MDS would just be determining the
> rounded-up size of the file. The client would need to determine the
> length of the last object though. It can't use the size of the object at
> that point, so we'd need to come up with some other way to track the
> size.
>
> One idea might be to just set an xattr on the object with the latest
> size of the decrypted object when we do a write? We'd have to do it in
> the same WRITE compound op, but that could be done. We could even
> encrypt that data too since prying eyes don't need to know.
>
> Then the client could just vet the size in the xattr of the last object
> vs. the (private) one in the inode.

Yeah, if we give the server the encrypted size to use and maintain the
logical size on clients that's what it'll have to be.

>
> Either way, I'm going to need to enlist some help for the MDS side of
> this bit, I think, and whoever picks up that part should probably be
> involved in the design.
>
> > >
> > > > [2]: And now that I think about the implications of crypto some more,
> > > > we can even fix this. Since crypto clients can never do rados append
> > > > ops (and I don't think we can in the FS *anyway*?) we can just have
> > > > all crypto-client writes include setting a "logical size" xattr on the
> > > > rados object that the MDS can use when probing. We maybe should do
> > > > that for everything, actually...
> > > >
> > >
> > > That does sound like a good idea, and it wouldn't be too difficult to
> > > implement since we're already going to be moving to doing more multi-op
> > > OSD write calls anyway.
> > >
> > > --
> > > Jeff Layton <jlayton@redhat.com>
> > >
> >
>
> --
> Jeff Layton <jlayton@redhat.com>
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-03-19 18:55 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-11 16:14 fscrypt and file truncation on cephfs Jeff Layton
2021-03-12  4:17 ` Patrick Donnelly
2021-03-12  8:43   ` Gregory Farnum
2021-03-12 12:48     ` Jeff Layton
2021-03-12 19:45       ` Gregory Farnum
2021-03-12 20:24         ` Jeff Layton
2021-03-12 23:36           ` Gregory Farnum
2021-03-18 19:20             ` Jeff Layton
2021-03-19 18:53               ` Gregory Farnum
2021-03-12 12:38   ` Jeff Layton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).