From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E1131ECAAD3 for ; Wed, 14 Sep 2022 07:28:08 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230085AbiINH2H (ORCPT ); Wed, 14 Sep 2022 03:28:07 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52740 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229493AbiINH2E (ORCPT ); Wed, 14 Sep 2022 03:28:04 -0400 Received: from mail-vs1-xe2a.google.com (mail-vs1-xe2a.google.com [IPv6:2607:f8b0:4864:20::e2a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D09426BD6F for ; Wed, 14 Sep 2022 00:28:00 -0700 (PDT) Received: by mail-vs1-xe2a.google.com with SMTP id o123so14951875vsc.3 for ; Wed, 14 Sep 2022 00:28:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date; bh=wxeXHEzzXvBy5TmvIkfTh9d+b1qpGbPvhAp2hAEy35Q=; b=mjC40ImUyyn3wYx9S79RoOvwg0HdBFxj67XlUXPSliuSoXpPDxASLBM9hf2KApl0Wb GEvcFlEzceTK6IGQOOZZ2L+Y3eO6F8/kT3nRy9z2YQHJ46wXM8M8rXC1bdG8t1o5/Dv2 OaJRrBpgbSPYrdgBFYGG2QLo83qVvbuYQx8tnWQwxE7yePWuD/ZycmJ3ibtefvvoI9x0 NwUqsHzPG+Y93oJdkcdYT/Jr11mvPts+HRw+bxHvjiVKrS5KB0sp3+y1C63/X/YaunWS N0nvtqqlP88wawAynYX9lm9gouHB7itkmymziy4Tr82NDiOScXN4G0L8yswcCHay2nZF P5Ig== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date; bh=wxeXHEzzXvBy5TmvIkfTh9d+b1qpGbPvhAp2hAEy35Q=; b=TqxUab6FT1f/vakeVk7iMNYhVu7KtBFRauzDPHhymSl5BkyYu9oUtBDZyGiFaJtA2X UeKob7fNm3WfDniwoDAoySbYryuc7qCM3fEgVQDZv8+JuWpdOagwzT+2dgiQyKt5wVll 9YXrjBy6vayyEtArUzkt7qX+lnlteRpUQDXGd69qNIfoDdj2nB05IsM9whZ8zboq55iy akI3muC5X2455oz3ROebKxObVD5Ig+9xtD+8Pqyox8L8NJQjICq+hxpN5GM/S1aQOhIG mCjaNgNUIt7AWhLKfqeB8g5pAjTiA1nbaVE4fxs1uX0X8Azw/RFF2Sr4N+TIGImY31Rt IgIA== X-Gm-Message-State: ACgBeo0l/s36B9zUwD1D3wJBgPIXHJcFxSSsGgH7KXGw/fG840IEwZgc KqLuWgVEL9+tjVnxjgS3n8qAVwUH+eQDIvUSx4/u9gxcID4= X-Google-Smtp-Source: AA6agR4Sg+kz+VnL7AIb0oqjM1dPBbRA0M9FGjvtbelMYbgwJgebYRqYAB3qAT7vI64o8E77Vs/yQ3tt20wzahnqsn8= X-Received: by 2002:a67:a649:0:b0:390:88c5:6a91 with SMTP id r9-20020a67a649000000b0039088c56a91mr12106377vsh.3.1663140479872; Wed, 14 Sep 2022 00:27:59 -0700 (PDT) MIME-Version: 1.0 References: <20220912125734.wpcw3udsqri4juuh@quack3> In-Reply-To: From: Amir Goldstein Date: Wed, 14 Sep 2022 10:27:48 +0300 Message-ID: Subject: Re: thoughts about fanotify and HSM To: Jan Kara , Miklos Szeredi Cc: "Plaster, Robert" , David Howells , linux-fsdevel Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org On Mon, Sep 12, 2022 at 7:38 PM Amir Goldstein wrote: > > On Mon, Sep 12, 2022 at 3:57 PM Jan Kara wrote: > > > > Hi Amir! > > > > On Sun 11-09-22 21:12:06, Amir Goldstein wrote: > > > I wanted to consult with you about preliminary design thoughts > > > for implementing a hierarchical storage manager (HSM) > > > with fanotify. > > > I feel that the discussion is losing focus, so let me try to refocus and list pros and cons for different options for HSM API... > > > I have been in contact with some developers in the past > > > who were interested in using fanotify to implement HSM > > > (to replace old DMAPI implementation). > > > > Ah, DMAPI. Shiver. Bad memories of carrying that hacky code in SUSE kernels > > ;) For the record, DMAPI is still partly supported on some proprietary filesystems, but even if a full implementation existed, this old API which was used for tape devices mostly is not a good fit for modern day cloud storage use cases. > > > > So how serious are these guys about HSM and investing into it? > > Let's put it this way. > They had to find a replacement for DMAPI so that they could stop > carrying DMAPI patches, so pretty serious. > They had to do it one way or the other. > As mentioned earlier, this is an open source HSM project [1] with a release coming soon that is using FAN_OPEN_PERM to migrate data from the slower tier. As you can imagine, FAN_OPEN_PERM can only get you as far as DMAPI but not beyond and it leaves the problem of setting the marks on all punched files on bringup. > > But I do know for a fact that there are several companies out there > implementing HSM to tier local storage to cloud and CTERA is one of > those companies. > > We use FUSE to implement HSM and I have reason to believe that > other companies do that as well. > FUSE is the most flexible API to implement HSM, but it suffers from performance overhead in the "fast" path due to context switches and cache line bounces. FUSE_PASSTHROUGH patches [2] address this overhead for large files IO. I plan to upstream those patches. FUSE-BPF [3] and former extFUSE [4] projects aim to address this overhead for readdir and other operations. This is an alluring option for companies that already use FUSE for HSM, because they will not need to change their implementation much, but my gut feeling is that there are interesting corner cases lurking... > > kernel is going to be only a small part of what's needed for it to be > > useful and we've dropped DMAPI from SUSE kernels because the code was > > painful to carry (and forwardport + it was not of great quality) and the > > demand for it was not really big... Note that the demand was not big for the crappy DMAPI ;) it does not say anything about the demand for HSM solutions, which exists and is growing IMO. > > So I'd prefer to avoid the major API > > extension unless there are serious users out there - perhaps we will even > > need to develop the kernel API in cooperation with the userspace part to > > verify the result is actually usable and useful. Yap. It should be trivial to implement a "mirror" HSM backend. For example, the libprojfs [5] projects implements a MirrorProvider backend for the Microsoft ProjFS [6] HSM API. > > > > Basically, FAN_OPEN_PERM + FAN_MARK_FILESYSTEM > > > should be enough to implement a basic HSM, but it is not > > > sufficient for implementing more advanced HSM features. > > > [...] > > My main worry here would be that with FAN_FILESYSTEM marks, there will be > > far to many events (especially for the lookup & access cases) to reasonably > > process. And since the events will be blocking, the impact on performance > > will be large. > > > > Right. That problem needs to be addressed. > > > I think that a reasonably efficient HSM will have to stay in the kernel > > (without generating work for userspace) for the "nothing to do" case. And > > only in case something needs to be migrated, event is generated and > > userspace gets involved. But it isn't obvious to me how to do this with > > fanotify (I could imagine it with say overlayfs which is kind of HSM > > solution already ;)). > > It's true, overlayfs is kind of HSM, but: - Without swap out to slower tier - Without user control over method of swap in from slower tier On another thread regarding FUSE-BPF, Miklos also mentioned the option to add those features to overlayfs [7] to make it useful as an HSM kernel driver. So we have at least three options for an HSM kernel driver (FUSE, fanotify, overlayfs), but none of them is still fully equipped to drive a modern HSM implementation. What is clear is that: 1. The fast path must not context switch to userspace 2. The slow path needs an API for calling into user to migrate files/dirs What is not clear is: 1. The method to persistently mark files/dirs for fast/slow path 2. The API to call into userspace Overlayfs provides a method to mark files for slow path ('trusted.overlay.metacopy' xattr), meaning file that has metadata but not the data, but overlayfs does not provide the API to perform "user controlled migration" of the data. Instead of inventing a new API for that, I'd rather extend the known fanotify protocol and allow the new FAN_XXX_PRE events only on filesystems that have the concept of a file without its content (a.k.a. metacopy). We could say that filesystems that support fscache can also support FAN_XXX_PRE events, and perhaps cachefilesd could make use of hooks to implement user modules that populate the fscache objects out of band. There is the naive approach to interpret a "punched hole" in a file as "no content" as DMAPI did, to support FAN_XXX_PRE events on standard local filesystem (fscache does that internally). That would be an opt-in via fanotify_init() flag and could be useful for old DMAPI HSM implementations that are converted to use the new API. Practically, the filesystems that allow FAN_XXX_PRE events on punched files would need to advertise this support and maintain an inode flag (i.e. I_NODATA) to avoid a performance penalty on every file access. If we take that route, though, it might be better off to let the HSM daemon set this flag explicitly (e.g. chattr +X) when punching holes in files and removing the flag explicitly when filling the holes. And there is the most flexible option of attaching a BFP filter to a filesystem mark, but I am afraid that this program will be limited to using information already in the path/dentry/inode struct. At least HSM could use an existing arbitrary inode flag (e.g. chattr+i) as "persistent marks". So many options! I don't know which to choose :) If this plan sounds reasonable, I can start with a POC of "user controlled copy up/down" for overlayfs, using fanotify as the user notification protocol and see where it goes from there. Thanks for reading my brain dump ;) Amir. [1] https://deepspacestorage.com/ [2] https://lore.kernel.org/linux-fsdevel/20210125153057.3623715-1-balsini@android.com/ [3] https://lpc.events/event/16/contributions/1339/attachments/945/1861/LPC2022%20Fuse-bpf.pdf [4] https://github.com/extfuse/extfuse [5] https://github.com/github/libprojfs [6] https://docs.microsoft.com/en-us/windows/win32/api/_projfs/ [7] https://lore.kernel.org/linux-fsdevel/CAJfpegt4N2nmCQGmLSBB--NzuSSsO6Z0sue27biQd4aiSwvNFw@mail.gmail.com/