Linux-Fsdevel Archive on
 help / color / Atom feed
From: Andres Freund <>
To: Linus Torvalds <>
Cc: David Howells <>,
	Greg Kroah-Hartman <>,
	Casey Schaufler <>,
	Stephen Smalley <>,
	Nicolas Dichtel <>,
	Ian Kent <>,
	Christian Brauner <>,,,
	linux-block <>,
	LSM List <>,
	linux-fsdevel <>,
	Linux API <>,
	Linux Kernel Mailing List <>
Subject: Re: [RFC PATCH 00/14] pipe: Keyrings, Block and USB notifications [ver #3]
Date: Mon, 10 Feb 2020 16:56:26 -0800
Message-ID: <> (raw)
In-Reply-To: <>


I only just now noticed this work after Dave Chinner pointed towards the
feature in the email leading to

On 2020-01-15 12:10:32 -0800, Linus Torvalds wrote:
> So I no longer hate the implementation, but I do want to see the
> actual user space users come out of the woodwork and try this out for
> their use cases.

Postgres has been looking for something roughly like this, fwiw (or
well, been forced to).

While it's better than it used to be (due to b4678df184b3), we still
have problems to reliably detect buffered IO errors, especially when
done across multiple processes.  We can't easily keep an fd open that
predates all writes to a file until, and ensure that fsyncs will happen
only on that fd. The primary reasons for that are
1) every connection (& some internal jobs) is a process, and neither do
want to to fsyncing each touched file in short-lived connections, nor is
it desirable to have to add the complication of having to transfer fds
between processes just to reliably get an error in fsync().
2) we have to cope with having more files open than allowed, so we have
a layer that limits the number of OS level FDs open at the same time. We
don't want to fsync whenever we have to juggle open fds though, as
that'd be too costly.

So it'd good to have a way to *reliably* know when writeback io failed,
so we can abort a checkpoint if necessary, and instead perform journal

For our purposes we'd probably want errors on the fs/superblock level,
rather than block devices. It's not always easy to map between blockdevs
and relevant filesystems, there are errors above the block layer, and we
definitely don'tt want to crash & restart a database just because
somebody pulled an USB storage device that didn't have any of the
database's data on it.

An earlier version of this patchset had some support for that, albeit
perhaps not fully implemented (no errors raised, afaict?):

Is the plan to pick this up again once the basic feature is in?

A few notes from the email referenced above (that actually seem to belong
into this thread more than the other:

1) From the angle of reliably needing to detect writeback errors, I find it
somewhat concerning that an LSM may end up entirely filtering away error
notifications, without a consumer being able to detect that:

+void __post_watch_notification(struct watch_list *wlist,
+			       struct watch_notification *n,
+			       const struct cred *cred,
+			       u64 id)
+		if (security_post_notification(watch->cred, cred, n) < 0)
+			continue;

It's an unpleasant thought that an overly restrictive [-ly configured]
LSM could lead to silently swallowing data integrity errors.

2) It'd be good if there were documentation, aimed at userland consumers
of this, explaining what the delivery guarantees are. To be useful for
us, it needs to be guaranteed that consuming all notifications ensures
that there are no pending notifications queued up somewhere (so we can
do fsync(data); fsync(journal); check_for_errors();
durable_rename(checkpoint_state.tmp, checkpoint_state);).

3) What will the permission model for accessing the notifications be?
It seems currently anyone, even within a container/namespace or
something, will see blockdev errors from everywhere?  The earlier
superblock support (I'm not sure I like that name btw, hard to
understand for us userspace folks), seems to have required exec
permission, but nothing else.


Andres Freund

  reply index

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-01-15 13:30 David Howells
2020-01-15 13:30 ` [RFC PATCH 01/14] uapi: General notification queue definitions " David Howells
2020-01-15 13:30 ` [RFC PATCH 02/14] security: Add hooks to rule on setting a watch " David Howells
2020-01-15 13:31 ` [RFC PATCH 03/14] security: Add a hook for the point of notification insertion " David Howells
2020-01-15 13:31 ` [RFC PATCH 04/14] pipe: Add O_NOTIFICATION_PIPE " David Howells
2020-01-15 13:31 ` [RFC PATCH 05/14] pipe: Add general notification queue support " David Howells
2020-01-15 13:31 ` [RFC PATCH 06/14] keys: Add a notification facility " David Howells
2020-01-15 13:31 ` [RFC PATCH 07/14] Add sample notification program " David Howells
2020-01-26 15:47   ` Guenter Roeck
2020-01-15 13:31 ` [RFC PATCH 08/14] pipe: Allow buffers to be marked read-whole-or-error for notifications " David Howells
2020-01-15 13:31 ` [RFC PATCH 09/14] pipe: Add notification lossage handling " David Howells
2020-01-15 13:32 ` [RFC PATCH 10/14] Add a general, global device notification watch list " David Howells
2020-01-15 13:32 ` [RFC PATCH 11/14] block: Add block layer notifications " David Howells
2020-01-15 13:32 ` [RFC PATCH 12/14] usb: Add USB subsystem " David Howells
2020-01-15 13:32 ` [RFC PATCH 13/14] selinux: Implement the watch_key security hook " David Howells
2020-01-15 13:32 ` [RFC PATCH 14/14] smack: Implement the watch_key and post_notification hooks " David Howells
2020-01-15 20:10 ` [RFC PATCH 00/14] pipe: Keyrings, Block and USB notifications " Linus Torvalds
2020-02-11  0:56   ` Andres Freund [this message]
2020-01-15 21:07 ` David Howells

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Fsdevel Archive on

Archives are clonable:
	git clone --mirror linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ \
	public-inbox-index linux-fsdevel

Example config snippet for mirrors

Newsgroup available over NNTP:

AGPL code for this site: git clone