ceph-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Xiubo Li <xiubli@redhat.com>
To: Jeff Layton <jlayton@kernel.org>
Cc: idryomov@gmail.com, ukernel@gmail.com, pdonnell@redhat.com,
	ceph-devel@vger.kernel.org
Subject: Re: [PATCH RFC 2/2] ceph: truncate the file contents when needed when file scrypted
Date: Thu, 16 Sep 2021 18:02:46 +0800	[thread overview]
Message-ID: <3b2878f8-9655-b4a0-c6bd-3cf61eaffb13@redhat.com> (raw)
In-Reply-To: <27b119711a065a2441337298fada3d941c30932a.camel@kernel.org>


On 9/14/21 10:24 PM, Jeff Layton wrote:
> On Tue, 2021-09-14 at 13:40 +0800, Xiubo Li wrote:
>> On 9/14/21 3:34 AM, Jeff Layton wrote:
>>
>> [...]
>>
>>> I'll have to think about whether that's still racy. Part of the problem
>>>>>>> is that once the client doesn't have caps, it doesn't have a way to
>>>>>>> ensure that fscrypt_file (whatever it holds) doesn't change while it's
>>>>>>> doing that zeroing.
>>>>>> As my understanding, if clientA want to read a file, it will open it
>>>>>> with RD mode, then it will get the fscrypt_file meta and "Fr" caps, then
>>>>>> it can safely read the file contents and do the zeroing for that block.
>>>>> Ok, so in this case, the client is just zeroing out the appropriate
>>>>> region in the resulting read data, and not on the OSD. That should be
>>>>> ok.
>>>>>
>>>>>> Then the opened file shouldn't worry whether the fscrypt_file will be
>>>>>> changed by others during it still holding the Fr caps, because if the
>>>>>> clientB wants to update the fscrypt_file it must acquire the Fw caps
>>>>>> first, before that the Fr caps must be revoked from clientA and at the
>>>>>> same time the read pagecache in clientA will be invalidated.
>>>>>>
>>>>> Are you certain that Fw caps is enough to ensure that no other client
>>>>> holds Fr caps?
>>>> I spent hours and went through the mds Locker related code on the weekends.
>>>>
>>>>    From the mds/lock.cc code, for mds filelock for example in the LOCK_MIX
>>>> state and some interim transition states to LOCK_MIX it will allow
>>>> different clients could hold any of Fw or Fr caps. But the Fb/Fc will be
>>>> disabled. Checked the mds/Locker.cc code, found that the mds filelock
>>>> could to switch LOCK_MIX state in some cases when there has one client
>>>> wants Fw and another client wants any of Fr and Fw.
>>>>
>>>> In this case I think the Linux advisory or mandatory locks are necessary
>>>> to keep the file contents concurrency. In multiple processes' concurrent
>>>> read/write or write/write cases without the Linux advisory/mandatory
>>>> locks the file contents' concurrency won't be guaranteed, so the logic
>>>> is the same here ?
>>>>
>>>> If so, couldn't we just assume the Fw vs Fw and Fr vs Fw caps should be
>>>> exclusive in correct use case ? For example, just after the mds filelock
>>>> state switches to LOCK_MIX, if clientA gets the advisory file lock and
>>>> the Fw caps, and even another clientB could be successfully issued the
>>>> Fr caps, the clientB won't do any read because it should be still stuck
>>>> and be waiting for the advisory file lock.
>>>>
>>> I'm not sure I like that idea. Basically, that would change the meaning
>>> of the what Frw caps represent, in a way that is not really consistent
>>> with how they have been used before.
>>>
>>> We could gate that new behavior on the new feature flags, but it sounds
>>> pretty tough.
>>>
>>> I think we have a couple of options:
>>>
>>> 1) we could just make the clients request and wait on Fx caps when they
>>> do a truncate. They might stall for a bit if there is contention, but it
>>> would ensure consistency and the client could be completely in charge of
>>> the truncate. [a]
>> Yeah, for my defer RMW approach we need to held the Fx caps every time
>> when writing/truncating files, and the Fs caps every time when reading.
>>
>> While currently almost all the read/write code have ignored them because
>> read/write do not need them in most cases.
>>
> Note that we already cache truncate operations when we have Fx.

Yeah, only when we have Fx and the attr.ia_size > isize will it cache 
the truncate operation.

For this I am a little curious why couldn't we cache truncate operations 
when attr.ia_size >= isize ?


>   See
> __ceph_setattr. Do we even need to change the read path here, or is the
> existing code just wrong?
>
> This is info I've been trying to get a handle on since I started working
> on cephfs. The rules for FILE caps are still extremely unclear.

I am still check this logic from MDS side. Still not very clear.


[...]
>>
>>> 2) we could rev the protocol, and have the client send along the last
>>> block to be written along with the SETATTR request. Maybe we even
>>> consider just adding a new TRUNCATE call independent of SETATTR. The MDS
>>> would remain in complete control of it at that point.
>> This approach seems much better, since the last block size will always
>> less than or equal to 4K(CEPH_FSCRYPT_BLOCK_SIZE) and the truncate
>> should be rare in normal use cases (?), with extra ~4K data in the
>> SETATTR should be okay when truncating the file.
>>
>> So when truncating a file, in kclient it should read that block, which
>> needs to do the RMW, first, and then do the truncate locally and encrypt
>> it again, and then together with SETATTR request send it to MDS. And the
>> MDS will update that block just before truncating the file.
>>
>> This approach could also keep the fscrypt logic being opaque for the MDS.
>>
>>
> Note that you'll need to guard against races on the RMW. For instance,
> another client could issue a write to that last block just after we do
> the read for the rmw.
>
> If you do this, then you'll probably need to send along the object
> version that you got from the read and have the MDS validate that before
> it does the truncate and writes out the updated crypto block.
>
> If something changed in the interim, the MDS can just return -EAGAIN or
> whatever to the client and it can re-do the whole thing. It's a mess,
> but it should be consistent.
>
> I think we probably want a new call for this too instead of overloading
> SETATTR (which is already extremely messy).

Okay, I will check this more carefully.


>>> The other ideas I've considered seem more complex and don't offer any
>>> significant advantages that I can see.
>>>
>>> [a]: Side question: why does buffering a truncate require Fx and not Fb?
>>> How do Fx and Fb interact?
>> For my defer RMW approach we need the Fx caps every time when writing
>> the file, and the Fw caps is the 'need' caps for write, while the Fb is
>> the 'want' caps. If the Fb caps is not allowed or issued by the MDS, it
>> will write-through data to the osd, after that the Fxw could be safely
>> released. If the client gets the Fb caps, the client must also hold the
>> Fx caps until the buffer has been writen back.
>>
> The problem there is that will effectively serialize all writers of a
> file -- even ones that are writing to completely different backend
> objects.
>
> That will almost certainly regress performance. I think we want to try
> to avoid that. I'd rather that truncate be extremely slow and expensive
> than slow down writes.

Agree.

Thanks.



  reply	other threads:[~2021-09-16 10:03 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-03  8:15 [PATCH RFC 0/2] ceph: size handling for the fscrypt xiubli
2021-09-03  8:15 ` [PATCH RFC 1/2] Revert "ceph: make client zero partial trailing block on truncate" xiubli
2021-09-03  8:15 ` [PATCH RFC 2/2] ceph: truncate the file contents when needed when file scrypted xiubli
2021-09-07 16:26   ` Jeff Layton
2021-09-08  9:37     ` Xiubo Li
2021-09-08 13:57       ` Jeff Layton
2021-09-09  3:38         ` Xiubo Li
2021-09-09 12:48           ` Jeff Layton
2021-09-10  2:30             ` Xiubo Li
2021-09-10 11:46               ` Jeff Layton
2021-09-13  5:42                 ` Xiubo Li
2021-09-13 14:05                   ` Jeff Layton
2021-09-14  5:43                     ` Xiubo Li
2021-09-13 19:34                   ` Jeff Layton
2021-09-14  5:40                     ` Xiubo Li
2021-09-14 14:24                       ` Jeff Layton
2021-09-16 10:02                         ` Xiubo Li [this message]
2021-09-17 17:19                           ` Jeff Layton
2021-09-20 14:32                             ` Xiubo Li
2021-09-20 19:24                               ` Jeff Layton
2021-09-22  2:23                                 ` Xiubo Li
2021-09-24 18:52                                   ` Jeff Layton
2021-09-25  1:02                                     ` Xiubo Li
2021-09-24 15:01                     ` Xiubo Li
2021-09-25  9:56                     ` Xiubo Li
2021-10-11 13:29                       ` Jeff Layton
2021-10-11 15:16                         ` Xiubo Li
2021-09-07 12:35 ` [PATCH RFC 0/2] ceph: size handling for the fscrypt Jeff Layton
2021-09-07 13:19   ` Xiubo Li
2021-09-07 20:58     ` Jeff Layton
2021-09-08 11:16       ` Xiubo Li
2021-09-08 14:12         ` Jeff Layton
2021-09-09  8:12           ` Xiubo Li
2021-09-08 11:17       ` Xiubo Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3b2878f8-9655-b4a0-c6bd-3cf61eaffb13@redhat.com \
    --to=xiubli@redhat.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=idryomov@gmail.com \
    --cc=jlayton@kernel.org \
    --cc=pdonnell@redhat.com \
    --cc=ukernel@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).