ceph-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* fscrypt and file truncation on cephfs
@ 2021-03-11 16:14 Jeff Layton
  2021-03-12  4:17 ` Patrick Donnelly
  0 siblings, 1 reply; 10+ messages in thread
From: Jeff Layton @ 2021-03-11 16:14 UTC (permalink / raw)
  To: dev, open list:CEPH DISTRIBUTED...

tl;dr version: in cephfs, the MDS handles truncating object data when
inodes are truncated. This is problematic with fscrypt.

Longer version:

I've been working on a patchset to add fscrypt support to kcephfs, and
have hit a problem with the way that truncation is handled. The main
issue is that fscrypt uses block-based ciphers, so we must ensure that
we read and write complete crypto blocks on the OSDs.

I'm currently using 4k crypto blocks, but we may want to allow this to
be tunable eventually (though it will need to be smaller than and align
with the OSD object size). For simplicity's sake, I'm planning to
disallow custom layouts on encrypted inodes. We could consider adding
that later (but it doesn't sound likely to be worthwhile).

Normally, when a file is truncated (usually via a SETATTR MDS call), the
MDS handles truncating or deleting objects on the OSDs. This is done
somewhat lazily in that the MDS replies to the client before this
process is complete (AFAICT).

Once we add fscrypt support, the MDS handling truncation becomes a
problem, in that we need to be able to deal with complete crypto blocks.
Letting the MDS truncate away part of a block will leave us with a block
that can't be decrypted.

There are a number of possible approaches to fixing this, but ultimately
the client will have to zero-pad, encrypt and write the blocks at the
edges since the MDS doesn't have access to the keys.

There are several possible approaches that I've identified:

1/ We could teach the MDS the crypto blocksize, and ensure that it
doesn't truncate away partial blocks. The client could tell the MDS what
blocksize it's using on the inode and the MDS could ensure that
truncates align to the blocks. The client will still need to write
partial blocks at the edges of holes or at the EOF, and it probably
shouldn't do that until it gets the unstable reply from the MDS. We
could handle this by adding a new truncate op or extending the existing
one.

2/ We could cede the object truncate/delete to the client altogether.
The MDS is aware when an inode is encrypted so it could just not do it
for those inodes. We also already handle hole punching completely on the
client (though the size doesn't change there). Truncate could be a
special case of that. Probably, the client would issue the truncate and
then be responsible for deleting/rewriting blocks after that reply comes
in. We'd have to consider how to handle delinquent clients that don't
clean up correctly.

3/ We could maintain a separate field in the inode for the real
inode->i_size that crypto-enabled clients would use. The client would
always communicate a size to the MDS that is rounded up to the end of
the last crypto block, such that the "true" size of the inode on disk
would always be represented in the rstats. Only crypto-enabled clients
would care about the "realsize" field. In fact, this value could
_itself_ be encrypted too, so that the i_size of the file is masked from
clients that don't have keys.

Ceph's truncation machinery is pretty complex in general, so I could
have missed other approaches or something that makes these ideas
impossible. I'm leaning toward #3 here since I think it has the most
benefit and keeps the MDS out of the whole business.

What should we do here?
-- 
Jeff Layton <jlayton@redhat.com>


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-03-19 18:55 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-11 16:14 fscrypt and file truncation on cephfs Jeff Layton
2021-03-12  4:17 ` Patrick Donnelly
2021-03-12  8:43   ` Gregory Farnum
2021-03-12 12:48     ` Jeff Layton
2021-03-12 19:45       ` Gregory Farnum
2021-03-12 20:24         ` Jeff Layton
2021-03-12 23:36           ` Gregory Farnum
2021-03-18 19:20             ` Jeff Layton
2021-03-19 18:53               ` Gregory Farnum
2021-03-12 12:38   ` Jeff Layton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).