Re: fscrypt and file truncation on cephfs

From: Jeff Layton <jlayton@redhat.com>
To: Gregory Farnum <gfarnum@redhat.com>,
	Patrick Donnelly <pdonnell@redhat.com>
Cc: dev <dev@ceph.io>,
	"open list:CEPH DISTRIBUTED..." <ceph-devel@vger.kernel.org>
Subject: Re: fscrypt and file truncation on cephfs
Date: Fri, 12 Mar 2021 07:48:47 -0500	[thread overview]
Message-ID: <c2dbc391fead002c84c07860812b689a01d2b667.camel@redhat.com> (raw)
In-Reply-To: <CAJ4mKGaaSn67XRkBemJC0XAWBTyWN_VjgEJfT8EB1cLaokAYvQ@mail.gmail.com>

On Fri, 2021-03-12 at 00:43 -0800, Gregory Farnum wrote:
> On Thu, Mar 11, 2021 at 8:18 PM Patrick Donnelly <pdonnell@redhat.com> wrote:
> > 
> > On Thu, Mar 11, 2021 at 8:15 AM Jeff Layton <jlayton@redhat.com> wrote:
> > > 
> > > tl;dr version: in cephfs, the MDS handles truncating object data when
> > > inodes are truncated. This is problematic with fscrypt.
> > > 
> > > Longer version:
> > > 
> > > I've been working on a patchset to add fscrypt support to kcephfs, and
> > > have hit a problem with the way that truncation is handled. The main
> > > issue is that fscrypt uses block-based ciphers, so we must ensure that
> > > we read and write complete crypto blocks on the OSDs.
> > > 
> > > I'm currently using 4k crypto blocks, but we may want to allow this to
> > > be tunable eventually (though it will need to be smaller than and align
> > > with the OSD object size). For simplicity's sake, I'm planning to
> > > disallow custom layouts on encrypted inodes. We could consider adding
> > > that later (but it doesn't sound likely to be worthwhile).
> > > 
> > > Normally, when a file is truncated (usually via a SETATTR MDS call), the
> > > MDS handles truncating or deleting objects on the OSDs. This is done
> > > somewhat lazily in that the MDS replies to the client before this
> > > process is complete (AFAICT).
> > 
> > So I've done some more research on this and it's not that simplistic.
> > Broadly, a truncate causes the following to happen:
> > 
> > - Revoke all write caps (but not Fcb) from clients.
> > 
> > - Journal the truncate operation.
> > 
> > - Respond with unsafe reply.
> > 
> > - After setattr is journalled, regrant Fs with new file size,
> > truncate_seq, truncate_size
> > 
> > - issue trunc cap update with new file size, truncate_seq,
> > truncate_size (looks redundant with prior step)
> > 
> > - actually start truncating objects above file size; concurrently
> > grant all wanted Fwb... caps wanted by client
> > 
> > - reply safe
> > 
> > From what I can tell, the clients use the truncate_seq/truncate_size
> > to avoid writing to data what the MDS plans to truncate. I haven't
> > really dug into how that works. Maybe someone more familiar with that
> > code can chime in.
> > 
> > So the MDS seems to truncate/delete objects lazily in the background
> > but it does so safely and consistently.
> 
> Right; ti's lazy in that it's not done immediately in a blocking
> manner, but it's absolutely safe. Truncate seq and size are also
> fields you can send to the OSD on read or write operations, and the
> client includes them on every op. It just has to do a (reasonably)
> simple conversion from the total truncate size the MDS gives it to
> what that means for the object being accessed (based on the striping
> pattern and object number).
> 
> I'll try and think a bit more on how to handle the special extra size
> for encryption.
> 
> ...although in my current sleep-addled state, I'm actually not sure we
> need to add any permanent storage to the MDS to handle this case! We
> can probably just extend the front-end truncate op so that it can take
> a separate "real-truncate-size" and the logical file size, can't we?

That would be one nice thing about the approach of #1. Truncating the
size downward is always done via an explicit SETATTR op (I think), so we
could just extend that with a new field for that tells the MDS where to
stop truncating.

Note that regardless of the approach we choose, the client will still
need to do a read/modify/write on the edge block before we can really
treat the truncation as "done". I'm not yet sure whether that has any
bearing on the consistency/safety of the truncation process.

-- 
Jeff Layton <jlayton@redhat.com>