ceph-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff Layton <jlayton@redhat.com>
To: Gregory Farnum <gfarnum@redhat.com>,
	Patrick Donnelly <pdonnell@redhat.com>
Cc: dev <dev@ceph.io>,
	"open list:CEPH DISTRIBUTED..." <ceph-devel@vger.kernel.org>
Subject: Re: fscrypt and file truncation on cephfs
Date: Fri, 12 Mar 2021 07:48:47 -0500	[thread overview]
Message-ID: <c2dbc391fead002c84c07860812b689a01d2b667.camel@redhat.com> (raw)
In-Reply-To: <CAJ4mKGaaSn67XRkBemJC0XAWBTyWN_VjgEJfT8EB1cLaokAYvQ@mail.gmail.com>

On Fri, 2021-03-12 at 00:43 -0800, Gregory Farnum wrote:
> On Thu, Mar 11, 2021 at 8:18 PM Patrick Donnelly <pdonnell@redhat.com> wrote:
> > 
> > On Thu, Mar 11, 2021 at 8:15 AM Jeff Layton <jlayton@redhat.com> wrote:
> > > 
> > > tl;dr version: in cephfs, the MDS handles truncating object data when
> > > inodes are truncated. This is problematic with fscrypt.
> > > 
> > > Longer version:
> > > 
> > > I've been working on a patchset to add fscrypt support to kcephfs, and
> > > have hit a problem with the way that truncation is handled. The main
> > > issue is that fscrypt uses block-based ciphers, so we must ensure that
> > > we read and write complete crypto blocks on the OSDs.
> > > 
> > > I'm currently using 4k crypto blocks, but we may want to allow this to
> > > be tunable eventually (though it will need to be smaller than and align
> > > with the OSD object size). For simplicity's sake, I'm planning to
> > > disallow custom layouts on encrypted inodes. We could consider adding
> > > that later (but it doesn't sound likely to be worthwhile).
> > > 
> > > Normally, when a file is truncated (usually via a SETATTR MDS call), the
> > > MDS handles truncating or deleting objects on the OSDs. This is done
> > > somewhat lazily in that the MDS replies to the client before this
> > > process is complete (AFAICT).
> > 
> > So I've done some more research on this and it's not that simplistic.
> > Broadly, a truncate causes the following to happen:
> > 
> > - Revoke all write caps (but not Fcb) from clients.
> > 
> > - Journal the truncate operation.
> > 
> > - Respond with unsafe reply.
> > 
> > - After setattr is journalled, regrant Fs with new file size,
> > truncate_seq, truncate_size
> > 
> > - issue trunc cap update with new file size, truncate_seq,
> > truncate_size (looks redundant with prior step)
> > 
> > - actually start truncating objects above file size; concurrently
> > grant all wanted Fwb... caps wanted by client
> > 
> > - reply safe
> > 
> > From what I can tell, the clients use the truncate_seq/truncate_size
> > to avoid writing to data what the MDS plans to truncate. I haven't
> > really dug into how that works. Maybe someone more familiar with that
> > code can chime in.
> > 
> > So the MDS seems to truncate/delete objects lazily in the background
> > but it does so safely and consistently.
> 
> Right; ti's lazy in that it's not done immediately in a blocking
> manner, but it's absolutely safe. Truncate seq and size are also
> fields you can send to the OSD on read or write operations, and the
> client includes them on every op. It just has to do a (reasonably)
> simple conversion from the total truncate size the MDS gives it to
> what that means for the object being accessed (based on the striping
> pattern and object number).
> 
> I'll try and think a bit more on how to handle the special extra size
> for encryption.
> 
> ...although in my current sleep-addled state, I'm actually not sure we
> need to add any permanent storage to the MDS to handle this case! We
> can probably just extend the front-end truncate op so that it can take
> a separate "real-truncate-size" and the logical file size, can't we?

That would be one nice thing about the approach of #1. Truncating the
size downward is always done via an explicit SETATTR op (I think), so we
could just extend that with a new field for that tells the MDS where to
stop truncating.

Note that regardless of the approach we choose, the client will still
need to do a read/modify/write on the edge block before we can really
treat the truncation as "done". I'm not yet sure whether that has any
bearing on the consistency/safety of the truncation process.

-- 
Jeff Layton <jlayton@redhat.com>


  reply	other threads:[~2021-03-12 12:49 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-11 16:14 fscrypt and file truncation on cephfs Jeff Layton
2021-03-12  4:17 ` Patrick Donnelly
2021-03-12  8:43   ` Gregory Farnum
2021-03-12 12:48     ` Jeff Layton [this message]
2021-03-12 19:45       ` Gregory Farnum
2021-03-12 20:24         ` Jeff Layton
2021-03-12 23:36           ` Gregory Farnum
2021-03-18 19:20             ` Jeff Layton
2021-03-19 18:53               ` Gregory Farnum
2021-03-12 12:38   ` Jeff Layton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c2dbc391fead002c84c07860812b689a01d2b667.camel@redhat.com \
    --to=jlayton@redhat.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=dev@ceph.io \
    --cc=gfarnum@redhat.com \
    --cc=pdonnell@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).