From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E4E57C5519F for ; Wed, 25 Nov 2020 12:34:51 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 792F0206E5 for ; Wed, 25 Nov 2020 12:34:51 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="kyvo1Z9k" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728996AbgKYMe2 (ORCPT ); Wed, 25 Nov 2020 07:34:28 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50230 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728515AbgKYMe2 (ORCPT ); Wed, 25 Nov 2020 07:34:28 -0500 Received: from mail-il1-x12d.google.com (mail-il1-x12d.google.com [IPv6:2607:f8b0:4864:20::12d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3EEADC0613D4 for ; Wed, 25 Nov 2020 04:34:28 -0800 (PST) Received: by mail-il1-x12d.google.com with SMTP id t13so1995916ilp.2 for ; Wed, 25 Nov 2020 04:34:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=yeqwOazZr9d6M9+ZfE+h2IFvwQ5KGwAmH3MImI3f72k=; b=kyvo1Z9kLBoPSylRFtI7VX7creUq6xDKSNd8PiXqO7+P6c8k40uqujZwPDywL0TdGt IG1P9ygX1jY+kO1uP26a+gKRcVhqET1cBGD34YFBdanqaW7r6ovNTq3GE8sdBqXsq17p XAxDPb2FbMNtFBLsd0N9yAYOY7J8P5n3J0sQoD4XOc3hKluas30dXXIHPlvXU4+SJs2m M2d0EWw4yKSj+nQrmC5aw8AD1RzCoHlRARQzhOeG8z0YHe8zKYoVwukRk3ciDMauMkZm 3vbg9D1qDOtKephWtUdEhwKQgqXWOVjucjtcGr9MCq9nMwrqB4+2jrREStIG0k4HVoAW zqrg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=yeqwOazZr9d6M9+ZfE+h2IFvwQ5KGwAmH3MImI3f72k=; b=IirnkI6nDJT5ThMu+BDjZeZVYPKP2+evqd7xVW0LLKfjwCUyOQUT6BINDIubKTEmc6 EGZ1k7qI048U/vPJPFHCPYg18i7oW+Urf1S588BTy5bBqBPX5GKucx45Ui19HxVWvkaS PJ8wZ9dpbm4L8z9CAV9YIvEJGdmspTlSwsP4gYFDpUGU2UbfKa703GXDFhpIZiqBUr1t dhEzGSfrYE2GACXenD8/PnRig+r8KBNN4AJ2Cz0DHBQJM4m4CUlM/tzC+Ud1CludYXYY WziO/nppEiJj8Xr/gT3+i2zJSXiugvzLIemnLtn+06mVyyGxdoNl+06jLil09dw//Oh4 3c4g== X-Gm-Message-State: AOAM530a+GsJR55Ij1VKIOSFozeqNCjc/JDBHZbPMi5aMWBgQx9NctlI zPvaPl8UIaiA+j2fkdZ/a/ZqUKDq4P+HQ1ZhhzcC94RtgCY= X-Google-Smtp-Source: ABdhPJzcjw88Bv2aD6N66Rr/5b7Ikccijz2909gMHslMgV9EGz2L5JRPASD9a6WgfKdJnCHvhDNpOsrGXh6SVS0D74A= X-Received: by 2002:a92:da82:: with SMTP id u2mr1180686iln.137.1606307667603; Wed, 25 Nov 2020 04:34:27 -0800 (PST) MIME-Version: 1.0 References: <20201109180016.80059-1-amir73il@gmail.com> <20201124134916.GC19336@quack2.suse.cz> <20201125110156.GB16944@quack2.suse.cz> In-Reply-To: <20201125110156.GB16944@quack2.suse.cz> From: Amir Goldstein Date: Wed, 25 Nov 2020 14:34:16 +0200 Message-ID: Subject: Re: [RFC][PATCH] fanotify: introduce filesystem view mark To: Jan Kara Cc: linux-fsdevel Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org On Wed, Nov 25, 2020 at 1:01 PM Jan Kara wrote: > > On Tue 24-11-20 16:47:41, Amir Goldstein wrote: > > On Tue, Nov 24, 2020 at 3:49 PM Jan Kara wrote: > > > On Mon 09-11-20 20:00:16, Amir Goldstein wrote: > > > > A filesystem view is a subtree of a filesystem accessible from a specific > > > > mount point. When marking an FS view, user expects to get events on all > > > > inodes that are accessible from the marked mount, even if the events > > > > were generated from another mount. > > > > > > > > In particular, the events such as FAN_CREATE, FAN_MOVE, FAN_DELETE that > > > > are not delivered to a mount mark can be delivered to an FS view mark. > > > > > > > > One example of a filesystem view is btrfs subvolume, which cannot be > > > > marked with a regular filesystem mark. > > > > > > > > Another example of a filesystem view is a bind mount, not on the root of > > > > the filesystem, such as the bind mounts used for containers. > > > > > > > > A filesystem view mark is composed of a heads sb mark and an sb_view mark. > > > > The filesystem view mark is connected to the head sb mark and the head > > > > sb mark is connected to the sb object. The mask of the head sb mask is > > > > a cumulative mask of all the associated sb_view mark masks. > > > > > > > > Filesystem view marks cannot co-exist with a regular filesystem mark on > > > > the same filesystem. > > > > > > > > When an event is generated on the head sb mark, fsnotify iterates the > > > > list of associated sb_view marks and filter events that happen outside > > > > of the sb_view mount's root. > > > > > > > > Signed-off-by: Amir Goldstein > > > > > > I gave this just a high-level look (no detailed review) and here are my > > > thoughts: > > > > > > 1) I like the functionality. IMO this is what a lot of people really want > > > when looking for "filesystem wide fs monitoring". > > > > > > 2) I don't quite like the API you propose though. IMO it exposes details of > > > implementation in the API. I'd rather like to have API the same as for > > > mount marks but with a dedicated mark type flag in the API - like > > > FAN_MARK_FILESYSTEM_SUBTREE (or we can keep VIEW if you like it but I think > > > the less terms the better ;). > > > > Sure, FAN_MARK_FS_VIEW is a dedicated mark type. > > The fact that is it a bitwise OR of MOUNT and FILESYSTEM is just a fun fact. > > Sorry if that wasn't clear. > > FAN_MARK_FILESYSTEM_SUBTREE sounds better for uapi. > > > > But I suppose you also meant that we should not limit the subtree root > > to bind mount points? > > > > The reason I used a reference to mnt for a sb_view and not dentry > > is because we have fsnotify_clear_marks_by_mount() callback to > > handle cleanup of the sb_view marks (which I left as TODO). > > > > Alternatively, we can play cache pinning games with the subtree root dentry > > like the case with inode mark, but I didn't want to get into that nor did I know > > if we should - if subtree mark requires CAP_SYS_ADMIN anyway, why not > > require a bind mount as its target, which is something much more visible to > > admins. > > Yeah, I don't have problems with bind mounts in particular. Just I was > thinking that concievably we could make these marks less priviledged (just > with CAP_DAC_SEARCH or so) and then mountpoints may be unnecessarily > restricting. I don't think pinning of subtree root dentry would be > problematic as such - inode marks pin the inode anyway, this is not > substantially different - if we can make it work reliably... > I'm going to need your help for that ;-) > In fact I was considering for a while that we could even make subtree > watches completely unpriviledged - when we walk the dir tree anyway, we > could also check permissions along the way. Due to locking this would be > difficult to do when generating the event but it might be actually doable > if we perform the permission check when reporting the event to userspace. > Just a food for thought... Maybe... but there are some other advantages to restricting to mount. One is that with btrfs subvolumes conn->fsid can actually cache the subvolume's fsid and we could remove the restriction of -EXDEV error of FAN_MARK_FILESYSTEM on subvolume. Another has to do with performance optimization (see below). > > > > Also I think this is going to get expensive > > > (e.g. imagine each write to page cache having to traverse potentially deep > > > tree hierarchy potentially multiple times - once for each subtree). My > > > suspicion should be verified with actual performance measurement but if I'm > > > right and the overhead is indeed noticeable, I think we'll need to employ > > > some caching or so to avoid repeated lookups... > > > > It's true, but here is a different angle to analyse the overhead - claim: > > "If users don't have kernel subtree mark, they will use filesystem mark > > and filter is userspace". If that claim is true, than one can argue that > > this is fair - let the listener process pay the CPU overhead which can be > > contained inside a cgroup and not everyone else. But what is the cost that > > everyone else will be paying in that case? > > Everyone will still pay the cost of the fanotify backend callback including > > allocate, pack and queue the event. > > The question then becomes, what is cheaper? The d_ancestor() traversal > > or all the fanotify backend handler code? > > Note that the former can be lockless and the latter does take locks. > > I understand and it's a fair point that queueing of the event is not cheap > either so I'd be interested in the numbers. Something like how deep subtree > walk is similar to a cost of queueing event. Or similarly checking of how many > subtree watches is similarly costly as queueing of one event? > The cost shouldn't actually be sensitive to the number of subtree watches. We should never do more than a single ancestor traversal per event up to it's sb root and for each ancestor we should be able to check with O(1) if a subtree watch exists on that inode/dentry. If we stay with the bind mount restriction then we can use the check d_mountpoint() on every ancestor which practically reduces the cost per ancestor to zero in most cases. > > I have a pretty good bet on the answer, but as you say only actual performance > > benchmarks can tell. > > > > From my experience, real life fsevent listeners do not listen on FAN_MODIFY > > but they listen on FAN_CLOSE_WRITE, because the the former is too noisy. > > Agreed. > > > The best case scenario is that users step forward to say that they want to > > use fanotify but need the subtree filterting and can provide us with real life > > performance measurement of the userspace vs. kernel alternatives (in terms > > of penalty on the rest of the system). > > With the cost of having to go to userspace and there do essentially the same > subtree walk as you do in the kernel, I have no doubt what's going to be > faster (by orders of magnitude). What I'm somewhat uncertain is whether the > subtree check is OK at the time of event generation or whether it should > better be moved to the moment when we are about to report the event to > userspace (when the cost of the subtree check goes to the process > interested in the event which is fine - but as you properly note we already > had to pay the cost of queueing the event so it isn't clear this is a win > even for the processes generating events). > I think the only semantics that make sense are: * If all object's present and past ancestors were always under subtree - guaranty to get an event * If all object's present and past ancestors were never under subtree - guaranty to not get an event * Otherwise - undefined So I think it is ok to check for subtree on event generation time. Thanks, Amir. Thanks, Amir.