Re: [PATCH 0/2] shmem: Notify user space when file system is full

From: Khazhy Kumykov <khazhy@google.com>
To: Gabriel Krisman Bertazi <krisman@collabora.com>
Cc: Amir Goldstein <amir73il@gmail.com>,
	Hugh Dickins <hughd@google.com>,
	 Andrew Morton <akpm@linux-foundation.org>,
	Linux MM <linux-mm@kvack.org>, Jan Kara <jack@suse.com>,
	 Matthew Bobrowski <repnop@google.com>,
	kernel@collabora.com
Subject: Re: [PATCH 0/2] shmem: Notify user space when file system is full
Date: Fri, 14 Jan 2022 14:16:14 -0800	[thread overview]
Message-ID: <CACGdZYLLCqzS4VLUHvzYG=rX3SEJaG7Vbs8_Wb_iUVSvXsqkxA@mail.gmail.com> (raw)
In-Reply-To: <871r1aysv2.fsf@collabora.com>

On Fri, Jan 14, 2022 at 12:17 PM Gabriel Krisman Bertazi
<krisman@collabora.com> wrote:
>
> Amir Goldstein <amir73il@gmail.com> writes:
>
> >> > But the question remains, what is so special about shmem that
> >> > your use case requires fsnotify events to handle ENOSPC?
> >> >
> >> > Many systems are deployed on thin provisioned storage these days
> >> > and monitoring the state of the storage to alert administrator before
> >> > storage gets full (be it filesystem inodes or blocks or thinp space)
> >> > is crucial to many systems.
> >> >
> >> > Since the ENOSPC event that you are proposing is asynchronous
> >> > anyway, what is the problem with polling statfs() and meminfo?
> >>
> >> Amir,
> >>
> >> I spoke a bit with Khazhy (in CC) about the problems with polling the
> >> existing APIs, like statfs.  He has been using a previous version of
> >> this code in production to monitor machines for a while now.  Khazhy,
> >> feel free to pitch in with more details.
> >>
> >> Firstly, I don't want to treat shmem as a special case.  The original
> >> patch implemented support only for tmpfs, because it was a fs specific
> >> solution, but I think this would be useful for any other (non-pseudo)
> >> file system in the kernel.
> >>
> >> The use case is similar to the use case I brought up for FAN_FS_ERROR.
> >> A sysadmin monitoring a fleet of machines wants to be notified when a
> >> service failed because of lack of space, without having to trust the
> >> failed application to properly report the error.
> >>
> >> Polling statfs is prone to missing the ENOSPC occurrence if the error is
> >> ephemeral from a monitoring tool point of view. Say the application is
> >> writing a large file, hits ENOSPC and, as a recovery mechanism, removes
> >> the partial file.  If that happens, a daemon might miss the chance to
> >> observe the lack of space in statfs.  Doing it through fsnotify, on the
> >> other hand, always catches the condition and allows a monitoring
> >> tool/sysadmin to take corrective action.
> >>
> >> > I guess one difference is that it is harder to predict page allocation failure
> >> > that causes ENOSPC in shmem, but IIUC, your patch does not report
> >> > an fsevent in that case only in inode/block accounting error.
> >> > Or maybe I did not understand it correctly?
> >>
> >> Correct.  But we cannot predict the enospc, unless we know the
> >> application.  I'm looking for a way for a sysadmin to not have to rely
> >> on the application caring about the file system size.
> >>
> >
> > In the real world, ENOSPC can often be anticipated way ahead of time
> > and sysadmins are practically required to take action when storage space is low.
> > Getting near 90% full filesystem is not healthy on many traditional disk
> > filesystems and causes suboptimal performance and in many cases
> > (especially cow filesystems) may lead to filesystem corruption.
> >
> > All that said, yes, *sometimes* ENOSPC cannot be anticipated,
> > but EIO can never be anticipated, so why are we talking about ENOSPC?
> > Focusing on ENOSPC seems too specific for the purpose of adding fsnotify
> > monitoring for filesystem ephemeral errors.
> >
> > The problem with fsnotify events for ephemeral filesystem errors
> > and that there can be a *lot* of them compared to filesystem corruption
> > errors that would usually put the filesystem in an "emergency" state
> > and stop the events from flooding.
> > For that reason I still think that a polling API for fs ephemeral errors
> > is a better idea.
> >
> > w.r.t to ephemeral errors on writeback we already have syncfs() as
> > a means to provide publish/subscribe API for monitoring applications,
> > to check if there was any error since last check, but we do not have an
> > API that provides this information without the added costs of performing
> > syncfs().
> >
> > IMO, a proper solution would look something like this:
> >
> >          /* per-sb errseq_t for reporting writeback errors via syncfs */
> >          errseq_t s_wb_err;
> > +        /* per-sb errseq_t for reporting vfs errors via fstatfs */
> > +        errseq_t s_vfs_err;
> >
>
> I think making it a polling API wouldn't be a problem for our use case,
> as long as it is kept as an always increasing counter, we should be able
> to detect changes and not miss events.
>
> The problem with the proposal, in my opinion, is the lack of
> differentiation of the errors.  We want to be able to tell apart an EIO
> from a ENOSPC, and it might not be clear from the other fields in
> fstatfs what has happened.

These tmpfs are associated with a containerized task (and often aren't
very large), so it isn't possible to predict a full filesystem.

For our use case (and what makes this "shmem specific") is we want to
differentiate between a user getting ENOSPC due to insufficiently
provisioned fs size, vs. due to running out of memory in a container -
both of which return ENOSPC to the process. Mixing EIO into the same
signal ("there were errors, ever") hides this information. So what we
were looking for was some sort of way of signaling that user saw
enospc, and a way for tmpfs to signal "why". Our first approach was
just to keep a counter of how many times this situation (ENOSPC due to
max size) occurred, but this seemed too niche an interface. Using the
fanotify interface seems like a good candidate, and has the additional
(potential) benefit that a similar notification can be used for any
error on any type of fs, though as you mentioned the volume of
potential errors is much higher, so perhaps sticking to polling is the
way to go.

To my understanding, errseq_t stores a counter for 1 type of error,
and if we see multiple types of errors, we'll overwrite the errno (so,
EIO followed by ENOSPC, or vice versa, would result in missing info)

>
> Also, I suspect Google might care about what inode triggered the error.
> If I understand correctly their use case, that would allow them to trace
> back the origin of the issue.  Either way, wouldn't it be useful for
> applications in general to be able to know what specific writeback failed?

For our usage, just knowing that an error occurred should be good
enough, and if that simplifies things let's omit the extra
functionality.

>
> > fstatfs() is just an example that may be a good fit for fs monitoring
> > applications we can add the error state in f_spare space, but we can
> > also create a dedicated API for polling for errors.
>
> That would be an option.  But f_spare is only 4 words long.  That isn't
> enough if we want to report the errors that occurred.  I think a new api
> should report the specific inodes that had a failed writeback.
>
> --
> Gabriel Krisman Bertazi