From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.2 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,MENTIONS_GIT_HOSTING,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS, USER_AGENT_SANE_1 autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 620F2C4708F for ; Tue, 1 Jun 2021 16:08:18 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 445EF60FDB for ; Tue, 1 Jun 2021 16:08:18 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234140AbhFAQJ6 (ORCPT ); Tue, 1 Jun 2021 12:09:58 -0400 Received: from us-smtp-delivery-124.mimecast.com ([216.205.24.124]:37271 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234135AbhFAQJ6 (ORCPT ); Tue, 1 Jun 2021 12:09:58 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1622563696; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=+8PtbhlRVaERQHH9SclKyeNFSDCobKYu6Mp8EQivgh0=; b=A5Vmbhha3HamU2jb9Cn2kFZxhkga34san88C1kOzR2OgogtmnvcRYQcek+fr4EE+pIh8UN GCCpvvVnCJpfhdpjyR5BZN3ueEE1HrKySgwmR93FSUhd7+VKlBF7Oatoi4zFKFFpK0wOSP 3A+wMzu5q+Ruy8JbvuiFMUCHLefffsA= Received: from mail-ed1-f70.google.com (mail-ed1-f70.google.com [209.85.208.70]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-97-B8X-f2FdOhuQHLkrIXAzpw-1; Tue, 01 Jun 2021 12:08:15 -0400 X-MC-Unique: B8X-f2FdOhuQHLkrIXAzpw-1 Received: by mail-ed1-f70.google.com with SMTP id c15-20020a05640227cfb029038d710bf29cso8051325ede.16 for ; Tue, 01 Jun 2021 09:08:15 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-transfer-encoding :content-language; bh=+8PtbhlRVaERQHH9SclKyeNFSDCobKYu6Mp8EQivgh0=; b=IaQ35Ch1DVda9C7bNccwghvJfC2jhUx4/JcrnX2eCL2TL4S+dNbKitZW6oMonuW0Te 2PsFkHVAlQAZiWLxgvUHb9r58S2nLN4k9vVBlL8kJwvLpi0ouaUOH8Sxu8L4BigEQGBC 401Yo9x4ap1YvUYGJDB6C62HRsDoUTuXWNVqnaxtNbSfcpEPfNnHuU0+2DReiKexoWFN NRyq1/TzGwtQSu++MDUZNeyN03Sk1so8e8hDjgMewc2LBX/q056YYPaBVcc5HXJLzr37 TAlu8Hm371TPLbfWhBNsMzZD3OWy4tY4fDa7WF5KfvYINpK5GHhb2Z9aH8rV70I4Hak5 26rw== X-Gm-Message-State: AOAM5316d2h1vOe3nqHkJjRR7rrCbxs7IL5nwVSuMVWU/ZqC2WiMTEBr F87y2/lR3v0ioucKp84JyXfXZKU4TB7yt2Qal6nR3BbilBj9gdrguRZ6oc97egFtnjLMfxA2Qgq 0Yh2oItajw4nJg0WOU9CtcNmzsA== X-Received: by 2002:a05:6402:15:: with SMTP id d21mr33938385edu.66.1622563693862; Tue, 01 Jun 2021 09:08:13 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwtb6U6QthK9rD46BdMIfcqK8NYygz1d7Z3uC8DlioSoygKlHHXqHaE5qYU6Txp4a73hw8BjA== X-Received: by 2002:a05:6402:15:: with SMTP id d21mr33938360edu.66.1622563693638; Tue, 01 Jun 2021 09:08:13 -0700 (PDT) Received: from dresden.str.redhat.com ([2a02:908:1e46:160:b272:8083:d5:bc7d]) by smtp.gmail.com with ESMTPSA id f21sm6747305edr.45.2021.06.01.09.08.13 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 01 Jun 2021 09:08:13 -0700 (PDT) Subject: Re: virtiofs uuid and file handles To: Amir Goldstein , Vivek Goyal Cc: Miklos Szeredi , overlayfs , linux-fsdevel References: <20200922210445.GG57620@redhat.com> <20210601144909.GC24846@redhat.com> From: Max Reitz Message-ID: <4a85fc2f-8ee0-9772-0347-76221a13ef95@redhat.com> Date: Tue, 1 Jun 2021 18:08:12 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-US Precedence: bulk List-ID: X-Mailing-List: linux-unionfs@vger.kernel.org On 01.06.21 17:42, Amir Goldstein wrote: > On Tue, Jun 1, 2021 at 5:49 PM Vivek Goyal wrote: >> On Mon, May 31, 2021 at 09:12:59PM +0300, Amir Goldstein wrote: >>> On Mon, May 31, 2021 at 5:11 PM Miklos Szeredi wrote: >>>> On Sat, 29 May 2021 at 18:05, Amir Goldstein wrote: >>>>> On Wed, Sep 23, 2020 at 2:12 PM Miklos Szeredi wrote: >>>>>> On Wed, Sep 23, 2020 at 11:57 AM Amir Goldstein wrote: >>>>>>> On Wed, Sep 23, 2020 at 10:44 AM Miklos Szeredi wrote: >>>>>>>> On Wed, Sep 23, 2020 at 4:49 AM Amir Goldstein wrote: >>>>>>>> >>>>>>>>> I think that the proper was to implement reliable persistent file >>>>>>>>> handles in fuse/virtiofs would be to add ENCODE/DECODE to >>>>>>>>> FUSE protocol and allow the server to handle this. >>>>>>>> Max Reitz (Cc-d) is currently looking into this. >>>>>>>> >>>>>>>> One proposal was to add LOOKUP_HANDLE operation that is similar to >>>>>>>> LOOKUP except it takes a {variable length handle, name} as input and >>>>>>>> returns a variable length handle *and* a u64 node_id that can be used >>>>>>>> normally for all other operations. >>>>>>>> >>>>> Miklos, Max, >>>>> >>>>> Any updates on LOOKUP_HANDLE work? Unfortunately not :( >>>>>>>> The advantage of such a scheme for virtio-fs (and possibly other fuse >>>>>>>> based fs) would be that userspace need not keep a refcounted object >>>>>>>> around until the kernel sends a FORGET, but can prune its node ID >>>>>>>> based cache at any time. If that happens and a request from the >>>>>>>> client (kernel) comes in with a stale node ID, the server will return >>>>>>>> -ESTALE and the client can ask for a new node ID with a special >>>>>>>> lookup_handle(fh, NULL). >>>>>>>> >>>>>>>> Disadvantages being: >>>>>>>> >>>>>>>> - cost of generating a file handle on all lookups >>>>>>> I never ran into a local fs implementation where this was expensive. >>>>>>> >>>>>>>> - cost of storing file handle in kernel icache >>>>>>>> >>>>>>>> I don't think either of those are problematic in the virtiofs case. >>>>>>>> The cost of having to keep fds open while the client has them in its >>>>>>>> cache is much higher. >>>>>>>> >>>>>>> Sounds good. >>>>>>> I suppose flock() does need to keep the open fd on server. >>>>>> Open files are a separate issue and do need an active object in the server. >>>>>> >>>>>> The issue this solves is synchronizing "released" and "evicted" >>>>>> states of objects between server and client. I.e. when a file is >>>>>> closed (and no more open files exist referencing the same object) the >>>>>> dentry refcount goes to zero but it remains in the cache. In this >>>>>> state the server could really evict it's own cached object, but can't >>>>>> because the client can gain an active reference at any time via >>>>>> cached path lookup. >>>>>> >>>>>> One other solution would be for the server to send a notification >>>>>> (NOTIFY_EVICT) that would try to clean out the object from the server >>>>>> cache and respond with a FORGET if successful. But I sort of like >>>>>> the file handle one better, since it solves multiple problems. >>>>>> >>>>> Even with LOOKUP_HANDLE, I am struggling to understand how we >>>>> intend to invalidate all fuse dentries referring to ino X in case the server >>>>> replies with reused ino X with a different generation that the one stored >>>>> in fuse inode cache. >>>>> >>>>> This is an issue that I encountered when running the passthrough_hp test, >>>>> on my filesystem. In tst_readdir_big() for example, underlying files are being >>>>> unlinked and new files created reusing the old inode numbers. >>>>> >>>>> This creates a situation where server gets a lookup request >>>>> for file B that uses the reused inode number X, while old file A is >>>>> still in fuse dentry cache using the older generation of real inode >>>>> number X which is still in fuse inode cache. >>>>> >>>>> Now the server knows that the real inode has been rused, because >>>>> the server caches the old generation value, but it cannot reply to >>>>> the lookup request before the old fuse inode has been invalidated. >>>>> IIUC, fuse_lowlevel_notify_inval_inode() is not enough(?). >>>>> We would also need to change fuse_dentry_revalidate() to >>>>> detect the case of reused/invalidated inode. >>>>> >>>>> The straightforward way I can think of is to store inode generation >>>>> in fuse_dentry. It won't even grow the size of the struct. >>>>> >>>>> Am I over complicating this? >>>> In this scheme the generation number is already embedded in the file >>>> handle. If LOOKUP_HANDLE returns a nodeid that can be found in the >>>> icache, but which doesn't match the new file handle, then the old >>>> inode will be marked bad and a new one allocated. >>>> >>>> Does that answer your worries? Or am I missing something? >>> It affirms my understanding of the future implementation, but >>> does not help my implementation without protocol changes. >>> I thought I could get away without LOOKUP_HANDLE for >>> underlying fs that is able to resolve by ino, but seems that I still have an >>> unhandled corner case, so will need to add some kernel patch. >>> Unless there is already a way to signal from server to make the >>> inode bad in a synchronous manner (I did not find any) before >>> replying to LOOKUP with a new generation of the same ino. >>> >>> Any idea about the timeline for LOOKUP_HANDLE? >>> I may be able to pick this up myself if there is no one actively >>> working on it or plans for anyone to make this happen. >> AFAIK, right now max is not actively looking into LOOKUP_HANDLE. >> >> To solve the issue of virtiofs server having too many fds open, he >> is now planning to store corresonding file handle in server and use >> that to open fd later. Yes, that’s right. Initially, I had hoped these things could tie into each other, but it turns out they’re largely separate issue, so for now I’m only working on replacing our O_PATH fds by file handles. >> But this does not help with persistent file handle issue for fuse >> client. >> > I see. Yes that should work, but he'd still need to cope with reused > inode numbers in case you allow unlinks from the host (do you?), > because LOOKUP can find a host fs inode that does not match > the file handle of a previously found inode of the same ino. That’s indeed an issue.  My current approach is to use the file handle (if available) as the key for lookups, so that the generation ID is included. Right now, we use st_ino+st_dev+mnt_id as the key.  st_dev is just a fallback for the mount ID, basically, so what we’d really need is inode ID + generation ID + mount ID, and that’s basically the file handle + mount ID.  So different generation IDs will lead to lookup finding/creating a different inode object (lo_inode in C virtiofsd, InodeData in virtiofsd-rs), and thus returning different fuse_ino IDs to the guest. (See also: https://gitlab.com/mreitz/virtiofsd-rs/-/blob/handles-for-inodes-v4/src/passthrough/mod.rs#L594) > Quoting Miklos' response above: >>>> If LOOKUP_HANDLE returns a nodeid that can be found in the >>>> icache, but which doesn't match the new file handle, then the old >>>> inode will be marked bad and a new one allocated. > This statement, with minor adjustments is also true for LOOKUP: > > "If LOOKUP returns a nodeid that can be found in the icache, but > whose i_generation doesn't match the generation returned in outarg, > then the old inode should be marked bad and a new one allocated." > >> BTW, one concern with file handles coming from guest kernel was that >> how to trust those handles. Guest can create anything and use >> file server to open the files on same filesystem (but not shared >> with guest). >> >> I am assuming same concern should be there with non-virtiofs use >> cases. Regular fuse client must be sending a file handle and >> file server is running with CAP_DAC_READ_SEARCH. How will it make >> sure that client is not able to access files not exported through >> shared directory but are present on same filesystem. >> > That is a concern. > It's the same concern for NFS clients that can guess file handles. > > The ways to address this concern with NFS is the export option > subtree_check, but that uses non unique file handles to an inode > which include a parent handle, so that's probably not a good fit for > LOOKUP_HANDLE. There was a mail thread on the topic of securing file handles in March: https://listman.redhat.com/archives/virtio-fs/2021-March/msg00022.html The problem I see with the subtree_check option is that file handles are invalidated when files are moved to a different parent. So far the consensus was to append a MAC to file handles, generated with some secret key, so that only file handles that have been generated before can later be opened. Ideally, that MAC would be managed by the kernel, so that we could allow virtiofsd to use such MAC-ed file handles even when it doesn’t have CAP_DAC_READ_SEARCH. Max