linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Amir Goldstein <amir73il@gmail.com>
Cc: Jan Kara <jack@suse.cz>, linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: fanotify_merge improvements
Date: Wed, 27 Jan 2021 12:24:16 +0100	[thread overview]
Message-ID: <20210127112416.GB3108@quack2.suse.cz> (raw)
In-Reply-To: <CAOQ4uxiSSYr4bejwZBBPDjs1Vg_BUSSjY4YiUAgri=adHdOLuQ@mail.gmail.com>

On Tue 26-01-21 18:21:26, Amir Goldstein wrote:
> On Mon, Jan 25, 2021 at 3:01 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Sat 23-01-21 15:30:59, Amir Goldstein wrote:
> > > On Fri, Jan 22, 2021 at 3:59 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > >
> > > > > > > Hum, now thinking about this, maybe we could clean this up even a bit more.
> > > > > > > event->inode is currently used only by inotify and fanotify for merging
> > > > > > > purposes. Now inotify could use its 'wd' instead of inode with exactly the
> > > > > > > same results, fanotify path or fid check is at least as strong as the inode
> > > > > > > check. So only for the case of pure "inode" events, we need to store inode
> > > > > > > identifier in struct fanotify_event - and we can do that in the union with
> > > > > > > struct path and completely remove the 'inode' member from fsnotify_event.
> > > > > > > Am I missing something?
> > > > > >
> > > > > > That generally sounds good and I did notice it is strange that wd is not
> > > > > > being compared.  However, I think I was worried that comparing fid+name
> > > > > > (in following patches) would be more expensive than comparing dentry (or
> > > > > > object inode) as a "rule out first" in merge, so I preferred to keep the
> > > > > > tag/dentry/id comparison for fanotify_fid case.
> > > > >
> > > > > Yes, that could be a concern.
> > > > >
> > > > > > Given this analysis (and assuming it is correct), would you like me to
> > > > > > just go a head with the change suggested above? or anything beyond that?
> > > > >
> > > > > Let's go just with the change suggested above for now. We can work on this
> > > > > later (probably with optimizing of the fanotify merging code).
> > > > >
> > > >
> > > > Hi Jan,
> > > >
> > > > Recap:
> > > > - fanotify_merge is very inefficient and uses extensive CPU if queue contains
> > > >   many events, so it is rather easy for a poorly written listener to
> > > > cripple the system
> > > > - You had an idea to store in event->objectid a hash of all the compared
> > > >   fields (e.g. fid+name)
> > > > - I think you had an idea to keep a hash table of events in the queue
> > > > to find the
> > > >   merge candidates faster
> > > > - For internal uses, I carry a patch that limits the linear search for
> > > > last 128 events
> > > >   which is enough to relieve the CPU overuse in case of unattended long queues
> > > >
> > > > I tried looking into implementing the hash table idea, assuming I understood you
> > > > correctly and I struggled to choose appropriate table sizes. It seemed to make
> > > > sense to use a global hash table, such as inode/dentry cache for all the groups
> > > > but that would add complexity to locking rules of queue/dequeue and
> > > > group cleanup.
> > > >
> > > > A simpler solution I considered, similar to my 128 events limit patch,
> > > > is to limit
> > > > the linear search to events queued in the last X seconds.
> > > > The rationale is that event merging is not supposed to be long term at all.
> > > > If a listener fails to perform read from the queue, it is not fsnotify's job to
> > > > try and keep the queue compact. I think merging events mechanism was
> > > > mainly meant to merge short bursts of events on objects, which are quite
> > > > common and surely can happen concurrently on several objects.
> > > >
> > > > My intuition is that making event->objectid into event->hash in addition
> > > > to limiting the age of events to merge would address the real life workloads.
> > > > One question if we do choose this approach is what should the age limit be?
> > > > Should it be configurable? Default to infinity and let distro cap the age or
> > > > provide a sane default by kernel while slightly changing behavior (yes please).
> > > >
> > > > What are your thoughts about this?
> > >
> > > Aha! found it:
> > > https://lore.kernel.org/linux-fsdevel/20200227112755.GZ10728@quack2.suse.cz/
> > > You suggested a small hash table per group (128 slots).
> > >
> > > My intuition is that this will not be good enough for the worst case, which is
> > > not that hard to hit is real life:
> > > 1. Listener sets FAN_UNLIMITED_QUEUE
> > > 2. Listener adds a FAN_MARK_FILESYSTEM watch
> > > 3. Many thousands of events are queued
> > > 4. Listener lingers (due to bad implementation?) in reading events
> > > 5. Every single event now incurs a huge fanotify_merge() cost
> > >
> > > Reducing the cost of merge from O(N) to O(N/128) doesn't really fix the
> > > problem.
> >
> > So my thought was that indeed reducing the overhead of merging by a factor
> > of 128 should be enough for any practical case as much as I agree that in
> > principle the computational complexity remains the same. And I've picked
> > per-group hash table to avoid interferences among notification groups and
> > to keep locking simple. That being said I'm not opposed to combining this
> > with a limit on the number of elements traversed in a hash chain (e.g.
> > those 128 you use yourself) - it will be naturally ordered by queue order
> > if we are a bit careful. This will provide efficient and effective merging
> > for ~8k queued events which seems enough to me. I find time based limits
> > not really worth it. Yes, they provide more predictable behavior but less
> > predictable runtime and overall I don't find the complexity worth the
> > benefit.
> >
> 
> Sounds reasonable.
> If you have time, please take a look at this WIP branch:
> https://github.com/amir73il/linux/commits/fanotify_merge
> and let me know if you like the direction it is taking.
> 
> This branch is only compile tested, but I am asking w.r.t to the chosen
> data structures.  So far it is just an array of queues selected by (yet
> unmodified) objectid.  Reading is just from any available queue.
> 
> My goal was to avoid having to hang the event on multiple list/hlist and
> the idea is to implement read by order of events as follows:

As a side note, since we use notification_list as a strict queue, we could
actually use a singly linked list for linking all the events (implemented
in include/linux/llist.h). That way we can save one pointer in
fsnotify_event if we wish without too much complication AFAICT. But I'm not
sure we really care.

> - With multi queue, high bit of obejctid will be masked for merge compare.
> - Instead, they will be used to store the next_qid to read from
> 
> For example:
> - event #1 is added to queue 6
> - set group->last_qid = 6
> - set group->next_qid = 6 (because group->num_events == 1)
> - event #2 is added to queue 13
> - the next_qid bits of the last event in last_qid (6) queue are set to 13
> - set group->last_qid = 13
>
> - read() checks value of group->next_qid and reads the first event
> from queue 6 (event #1)
> - event #1 has 13 stored in next_qid bits so set group->next_qid = 13
> - read() reads first event from queue 13 (event #2)

That's an interesting idea. I like it and I think it would work. Just
instead of masking, I'd use bitfields. Or we could just restrict objectid
to 32-bits and use remaining 32-bits for the next_qid pointer. I know it
will waste some bits but 32-bits of objectid should provide us with enough
space to avoid doing full event comparison in most cases - BTW WRT naming I
find 'qid' somewhat confusing. Can we call it say 'next_bucket' or
something like that?

> Permission events require special care, but that is the idea of a simple
> singly linked list using qid's for reading events by insert order and
> merging by hashed queue.

Why are permission events special in this regard?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

  reply	other threads:[~2021-01-27 11:27 UTC|newest]

Thread overview: 65+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-17 13:14 [PATCH v2 00/16] Fanotify event with name info Amir Goldstein
2020-02-17 13:14 ` [PATCH v2 01/16] fsnotify: tidy up FS_ and FAN_ constants Amir Goldstein
2020-02-17 13:14 ` [PATCH v2 02/16] fsnotify: factor helpers fsnotify_dentry() and fsnotify_file() Amir Goldstein
2020-02-25 13:46   ` Jan Kara
2020-02-25 14:27     ` Amir Goldstein
2020-02-26 13:59       ` Jan Kara
2020-02-17 13:14 ` [PATCH v2 03/16] fsnotify: funnel all dirent events through fsnotify_name() Amir Goldstein
2020-02-17 13:14 ` [PATCH v2 04/16] fsnotify: use helpers to access data by data_type Amir Goldstein
2020-02-17 13:14 ` [PATCH v2 05/16] fsnotify: simplify arguments passing to fsnotify_parent() Amir Goldstein
2020-02-19 10:50   ` kbuild test robot
2020-02-19 11:11   ` Amir Goldstein
2020-02-17 13:14 ` [PATCH v2 06/16] fsnotify: pass dentry instead of inode for events possible on child Amir Goldstein
2020-02-17 13:14 ` [PATCH v2 07/16] fsnotify: replace inode pointer with tag Amir Goldstein
2020-02-26  8:20   ` Jan Kara
2020-02-26  9:34     ` Amir Goldstein
2020-02-26  8:52   ` Jan Kara
2020-02-17 13:14 ` [PATCH v2 08/16] fanotify: merge duplicate events on parent and child Amir Goldstein
2020-02-26  9:18   ` Jan Kara
2020-02-26 12:14     ` Amir Goldstein
2020-02-26 14:38       ` Jan Kara
2021-01-22 13:59         ` fanotify_merge improvements Amir Goldstein
2021-01-23 13:30           ` Amir Goldstein
2021-01-25 13:01             ` Jan Kara
2021-01-26 16:21               ` Amir Goldstein
2021-01-27 11:24                 ` Jan Kara [this message]
2021-01-27 12:57                   ` Amir Goldstein
2021-01-27 15:15                     ` Jan Kara
2021-01-27 18:03                       ` Amir Goldstein
2021-01-28 10:27                         ` Jan Kara
2021-01-28 18:50                           ` Amir Goldstein
2020-02-17 13:14 ` [PATCH v2 09/16] fanotify: fix merging marks masks with FAN_ONDIR Amir Goldstein
2020-02-17 13:14 ` [PATCH v2 10/16] fanotify: send FAN_DIR_MODIFY event flavor with dir inode and name Amir Goldstein
2020-02-17 13:14 ` [PATCH v2 11/16] fanotify: prepare to encode both parent and child fid's Amir Goldstein
2020-02-26 10:23   ` Jan Kara
2020-02-26 11:53     ` Amir Goldstein
2020-02-26 17:07       ` Jan Kara
2020-02-26 17:50         ` Amir Goldstein
2020-02-27  9:06           ` Amir Goldstein
2020-02-27 11:27             ` Jan Kara
2020-02-27 12:12               ` Amir Goldstein
2020-02-27 13:30                 ` Jan Kara
2020-02-27 14:06                   ` Amir Goldstein
2020-03-01 16:26                     ` Amir Goldstein
2020-03-05 15:49                       ` Jan Kara
2020-03-06 11:19                         ` Amir Goldstein
2020-03-08  7:29                           ` Amir Goldstein
2020-03-18 17:51                             ` Jan Kara
2020-03-18 18:50                               ` Amir Goldstein
2020-03-19  9:30                                 ` Jan Kara
2020-03-19 10:07                                   ` Amir Goldstein
2020-03-30 19:29                                 ` Amir Goldstein
2020-02-27 11:01           ` Jan Kara
2020-02-17 13:14 ` [PATCH v2 12/16] fanotify: record name info for FAN_DIR_MODIFY event Amir Goldstein
2020-02-17 13:14 ` [PATCH v2 13/16] fanotify: report " Amir Goldstein
2020-02-19  9:43   ` kbuild test robot
2020-02-19 10:17   ` kbuild test robot
2020-02-19 11:22   ` Amir Goldstein
2020-04-16 12:16   ` Michael Kerrisk (man-pages)
2020-04-20 15:53     ` Jan Kara
2020-04-20 18:45     ` Amir Goldstein
2020-04-20 18:47       ` Michael Kerrisk (man-pages)
2020-02-17 13:14 ` [PATCH v2 14/16] fanotify: report parent fid + name with FAN_REPORT_NAME Amir Goldstein
2020-02-17 13:14 ` [PATCH v2 15/16] fanotify: refine rules for when name is reported Amir Goldstein
2020-02-17 13:14 ` [BONUS][PATCH v2 16/16] fanotify: support limited functionality for unprivileged users Amir Goldstein
2020-02-20 22:10 ` [PATCH v2 00/16] Fanotify event with name info Matthew Bobrowski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210127112416.GB3108@quack2.suse.cz \
    --to=jack@suse.cz \
    --cc=amir73il@gmail.com \
    --cc=linux-fsdevel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).