linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask
@ 2021-03-28 15:56 Amir Goldstein
  2021-03-30  7:31 ` Christian Brauner
                   ` (2 more replies)
  0 siblings, 3 replies; 61+ messages in thread
From: Amir Goldstein @ 2021-03-28 15:56 UTC (permalink / raw)
  To: Jan Kara; +Cc: Christian Brauner, linux-fsdevel, linux-api

Add a high level hook fsnotify_path_create() which is called from
syscall context where mount context is available, so that FAN_CREATE
event can be added to a mount mark mask.

This high level hook is called in addition to fsnotify_create(),
fsnotify_mkdir() and fsnotify_link() hooks in vfs helpers where the mount
context is not available.

In the context where fsnotify_path_create() will be called, a dentry flag
flag is set on the new dentry the suppress the FS_CREATE event in the vfs
level hooks.

This functionality was requested by Christian Brauner to replace
recursive inotify watches for detecting when some path was created under
an idmapped mount without having to monitor FAN_CREATE events in the
entire filesystem.

In combination with more changes to allow unprivileged fanotify listener
to watch an idmapped mount, this functionality would be usable also by
nested container managers.

Link: https://lore.kernel.org/linux-fsdevel/20210318143140.jxycfn3fpqntq34z@wittgenstein/
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---

Jan,

After trying several different approaches, I finally realized that
making FAN_CREATE available for mount marks is not that hard and it could
be very useful IMO.

Adding support for other "inode events" with mount mark, such as
FAN_ATTRIB, FAN_DELETE, FAN_MOVE may also be possible, but adding support
for FAN_CREATE was really easy due to the fact that all call sites are
already surrounded by filename_creat()/done_path_create() calls.

Also, there is an inherent a-symetry between FAN_CREATE and other
events. All the rest of the events may be set when watching a postive
path, for example, to know when a path of a bind mount that was
"injected" to a container was moved or deleted, it is possible to start
watching that directory before injecting the bind mount.

It is not possible to do the same with a "negative" path to know when
a positive dentry was instantiated at that path.

This patch provides functionality that is independant of other changes,
but I also tested it along with other changes that demonstrate how it
would be utilized in userns setups [1][2].

As can be seen in dcache.h patch, this patch comes on top a revert patch
to reclaim an unused dentry flag. If you accept this proposal, I will
post the full series.

Thanks,
Amir.

[1] https://github.com/amir73il/linux/commits/fanotify_userns
[2] https://github.com/amir73il/inotify-tools/commits/fanotify_userns

 fs/namei.c               | 21 ++++++++++++++++++++-
 include/linux/dcache.h   |  2 +-
 include/linux/fanotify.h |  8 ++++----
 include/linux/fsnotify.h | 36 ++++++++++++++++++++++++++++++++++++
 4 files changed, 61 insertions(+), 6 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 216f16e74351..cf979e956938 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3288,7 +3288,7 @@ static const char *open_last_lookups(struct nameidata *nd,
 		inode_lock_shared(dir->d_inode);
 	dentry = lookup_open(nd, file, op, got_write);
 	if (!IS_ERR(dentry) && (file->f_mode & FMODE_CREATED))
-		fsnotify_create(dir->d_inode, dentry);
+		fsnotify_path_create(&nd->path, dentry);
 	if (open_flag & O_CREAT)
 		inode_unlock(dir->d_inode);
 	else
@@ -3560,6 +3560,20 @@ struct file *do_file_open_root(struct dentry *dentry, struct vfsmount *mnt,
 	return file;
 }
 
+static void d_set_path_create(struct dentry *dentry)
+{
+	spin_lock(&dentry->d_lock);
+	dentry->d_flags |= DCACHE_PATH_CREATE;
+	spin_unlock(&dentry->d_lock);
+}
+
+static void d_clear_path_create(struct dentry *dentry)
+{
+	spin_lock(&dentry->d_lock);
+	dentry->d_flags &= ~DCACHE_PATH_CREATE;
+	spin_unlock(&dentry->d_lock);
+}
+
 static struct dentry *filename_create(int dfd, struct filename *name,
 				struct path *path, unsigned int lookup_flags)
 {
@@ -3617,6 +3631,8 @@ static struct dentry *filename_create(int dfd, struct filename *name,
 		goto fail;
 	}
 	putname(name);
+	/* Start "path create" context that ends in done_path_create() */
+	d_set_path_create(dentry);
 	return dentry;
 fail:
 	dput(dentry);
@@ -3641,6 +3657,9 @@ EXPORT_SYMBOL(kern_path_create);
 
 void done_path_create(struct path *path, struct dentry *dentry)
 {
+	if (d_inode(dentry))
+		fsnotify_path_create(path, dentry);
+	d_clear_path_create(dentry);
 	dput(dentry);
 	inode_unlock(path->dentry->d_inode);
 	mnt_drop_write(path->mnt);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 4225caa8cf02..d153793d5b95 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -213,7 +213,7 @@ struct dentry_operations {
 #define DCACHE_SYMLINK_TYPE		0x00600000 /* Symlink (or fallthru to such) */
 
 #define DCACHE_MAY_FREE			0x00800000
-/* Was #define DCACHE_FALLTHRU			0x01000000 */
+#define DCACHE_PATH_CREATE		0x01000000 /* "path_create" context */
 #define DCACHE_NOKEY_NAME		0x02000000 /* Encrypted name encoded without key */
 #define DCACHE_OP_REAL			0x04000000
 
diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h
index bad41bcb25df..f0c5a4a82b6e 100644
--- a/include/linux/fanotify.h
+++ b/include/linux/fanotify.h
@@ -65,10 +65,10 @@ extern struct ctl_table fanotify_table[]; /* for sysctl */
 
 /*
  * Events that can be reported with data type FSNOTIFY_EVENT_PATH.
- * Note that FAN_MODIFY can also be reported with data type
+ * Note that FAN_MODIFY and FAN_CREATE can also be reported with data type
  * FSNOTIFY_EVENT_INODE.
  */
-#define FANOTIFY_PATH_EVENTS	(FAN_ACCESS | FAN_MODIFY | \
+#define FANOTIFY_PATH_EVENTS	(FAN_ACCESS | FAN_MODIFY | FAN_CREATE | \
 				 FAN_CLOSE | FAN_OPEN | FAN_OPEN_EXEC)
 
 /*
@@ -78,8 +78,8 @@ extern struct ctl_table fanotify_table[]; /* for sysctl */
 #define FANOTIFY_DIRENT_EVENTS	(FAN_MOVE | FAN_CREATE | FAN_DELETE)
 
 /* Events that can only be reported with data type FSNOTIFY_EVENT_INODE */
-#define FANOTIFY_INODE_EVENTS	(FANOTIFY_DIRENT_EVENTS | \
-				 FAN_ATTRIB | FAN_MOVE_SELF | FAN_DELETE_SELF)
+#define FANOTIFY_INODE_EVENTS	(FAN_MOVE | FAN_DELETE | FAN_ATTRIB | \
+				 FAN_MOVE_SELF | FAN_DELETE_SELF)
 
 /* Events that user can request to be notified on */
 #define FANOTIFY_EVENTS		(FANOTIFY_PATH_EVENTS | \
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index f8acddcf54fb..9a3d9f7beeb2 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -179,6 +179,30 @@ static inline void fsnotify_inoderemove(struct inode *inode)
 	__fsnotify_inode_delete(inode);
 }
 
+/*
+ * fsnotify_path_create - an inode was linked to namespace
+ *
+ * This higher level hook is called in addition to fsnotify_create(),
+ * fsnotify_mkdir() and fsnotify_link() vfs hooks when the mount context is
+ * available, so that FS_CREATE event can be added to a mount mark mask.
+ *
+ * In that case the, DCACHE_PATH_CREATE flag is set to suppress the FS_CREATE
+ * event in the lower level vfs hooks.
+ */
+static inline void fsnotify_path_create(struct path *path,
+					struct dentry *child)
+{
+	struct inode *dir = path->dentry->d_inode;
+	__u32 mask = FS_CREATE;
+
+	WARN_ON_ONCE(!inode_is_locked(dir));
+
+	if (S_ISDIR(d_inode(child)->i_mode))
+		mask |= FS_ISDIR;
+
+	fsnotify(mask, path, FSNOTIFY_EVENT_PATH, dir, &child->d_name, NULL, 0);
+}
+
 /*
  * fsnotify_create - 'name' was linked in
  */
@@ -186,6 +210,10 @@ static inline void fsnotify_create(struct inode *inode, struct dentry *dentry)
 {
 	audit_inode_child(inode, dentry, AUDIT_TYPE_CHILD_CREATE);
 
+	/* fsnotify_path_create() will be called */
+	if (dentry->d_flags & DCACHE_PATH_CREATE)
+		return;
+
 	fsnotify_dirent(inode, dentry, FS_CREATE);
 }
 
@@ -200,6 +228,10 @@ static inline void fsnotify_link(struct inode *dir, struct inode *inode,
 	fsnotify_link_count(inode);
 	audit_inode_child(dir, new_dentry, AUDIT_TYPE_CHILD_CREATE);
 
+	/* fsnotify_path_create() will be called */
+	if (new_dentry->d_flags & DCACHE_PATH_CREATE)
+		return;
+
 	fsnotify_name(dir, FS_CREATE, inode, &new_dentry->d_name, 0);
 }
 
@@ -223,6 +255,10 @@ static inline void fsnotify_mkdir(struct inode *inode, struct dentry *dentry)
 {
 	audit_inode_child(inode, dentry, AUDIT_TYPE_CHILD_CREATE);
 
+	/* fsnotify_path_create() will be called */
+	if (dentry->d_flags & DCACHE_PATH_CREATE)
+		return;
+
 	fsnotify_dirent(inode, dentry, FS_CREATE | FS_ISDIR);
 }
 
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask
  2021-03-28 15:56 [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask Amir Goldstein
@ 2021-03-30  7:31 ` Christian Brauner
  2021-03-30  9:31   ` Amir Goldstein
  2021-03-30 12:12 ` [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask Christian Brauner
  2021-03-30 12:20 ` Amir Goldstein
  2 siblings, 1 reply; 61+ messages in thread
From: Christian Brauner @ 2021-03-30  7:31 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, linux-api

On Sun, Mar 28, 2021 at 06:56:24PM +0300, Amir Goldstein wrote:
> Add a high level hook fsnotify_path_create() which is called from
> syscall context where mount context is available, so that FAN_CREATE
> event can be added to a mount mark mask.
> 
> This high level hook is called in addition to fsnotify_create(),
> fsnotify_mkdir() and fsnotify_link() hooks in vfs helpers where the mount
> context is not available.
> 
> In the context where fsnotify_path_create() will be called, a dentry flag
> flag is set on the new dentry the suppress the FS_CREATE event in the vfs
> level hooks.
> 
> This functionality was requested by Christian Brauner to replace
> recursive inotify watches for detecting when some path was created under
> an idmapped mount without having to monitor FAN_CREATE events in the
> entire filesystem.
> 
> In combination with more changes to allow unprivileged fanotify listener
> to watch an idmapped mount, this functionality would be usable also by
> nested container managers.
> 
> Link: https://lore.kernel.org/linux-fsdevel/20210318143140.jxycfn3fpqntq34z@wittgenstein/
> Cc: Christian Brauner <christian.brauner@ubuntu.com>
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> ---

Was about to look at this. Does this require preliminary patches since
it doesn't apply to current master. If so, can you just give me a link
to a branch so I can pull from that? :)

Christian

> 
> Jan,
> 
> After trying several different approaches, I finally realized that
> making FAN_CREATE available for mount marks is not that hard and it could
> be very useful IMO.
> 
> Adding support for other "inode events" with mount mark, such as
> FAN_ATTRIB, FAN_DELETE, FAN_MOVE may also be possible, but adding support
> for FAN_CREATE was really easy due to the fact that all call sites are
> already surrounded by filename_creat()/done_path_create() calls.
> 
> Also, there is an inherent a-symetry between FAN_CREATE and other
> events. All the rest of the events may be set when watching a postive
> path, for example, to know when a path of a bind mount that was
> "injected" to a container was moved or deleted, it is possible to start
> watching that directory before injecting the bind mount.
> 
> It is not possible to do the same with a "negative" path to know when
> a positive dentry was instantiated at that path.
> 
> This patch provides functionality that is independant of other changes,
> but I also tested it along with other changes that demonstrate how it
> would be utilized in userns setups [1][2].
> 
> As can be seen in dcache.h patch, this patch comes on top a revert patch
> to reclaim an unused dentry flag. If you accept this proposal, I will
> post the full series.
> 
> Thanks,
> Amir.
> 
> [1] https://github.com/amir73il/linux/commits/fanotify_userns
> [2] https://github.com/amir73il/inotify-tools/commits/fanotify_userns
> 
>  fs/namei.c               | 21 ++++++++++++++++++++-
>  include/linux/dcache.h   |  2 +-
>  include/linux/fanotify.h |  8 ++++----
>  include/linux/fsnotify.h | 36 ++++++++++++++++++++++++++++++++++++
>  4 files changed, 61 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/namei.c b/fs/namei.c
> index 216f16e74351..cf979e956938 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -3288,7 +3288,7 @@ static const char *open_last_lookups(struct nameidata *nd,
>  		inode_lock_shared(dir->d_inode);
>  	dentry = lookup_open(nd, file, op, got_write);
>  	if (!IS_ERR(dentry) && (file->f_mode & FMODE_CREATED))
> -		fsnotify_create(dir->d_inode, dentry);
> +		fsnotify_path_create(&nd->path, dentry);
>  	if (open_flag & O_CREAT)
>  		inode_unlock(dir->d_inode);
>  	else
> @@ -3560,6 +3560,20 @@ struct file *do_file_open_root(struct dentry *dentry, struct vfsmount *mnt,
>  	return file;
>  }
>  
> +static void d_set_path_create(struct dentry *dentry)
> +{
> +	spin_lock(&dentry->d_lock);
> +	dentry->d_flags |= DCACHE_PATH_CREATE;
> +	spin_unlock(&dentry->d_lock);
> +}
> +
> +static void d_clear_path_create(struct dentry *dentry)
> +{
> +	spin_lock(&dentry->d_lock);
> +	dentry->d_flags &= ~DCACHE_PATH_CREATE;
> +	spin_unlock(&dentry->d_lock);
> +}
> +
>  static struct dentry *filename_create(int dfd, struct filename *name,
>  				struct path *path, unsigned int lookup_flags)
>  {
> @@ -3617,6 +3631,8 @@ static struct dentry *filename_create(int dfd, struct filename *name,
>  		goto fail;
>  	}
>  	putname(name);
> +	/* Start "path create" context that ends in done_path_create() */
> +	d_set_path_create(dentry);
>  	return dentry;
>  fail:
>  	dput(dentry);
> @@ -3641,6 +3657,9 @@ EXPORT_SYMBOL(kern_path_create);
>  
>  void done_path_create(struct path *path, struct dentry *dentry)
>  {
> +	if (d_inode(dentry))
> +		fsnotify_path_create(path, dentry);
> +	d_clear_path_create(dentry);
>  	dput(dentry);
>  	inode_unlock(path->dentry->d_inode);
>  	mnt_drop_write(path->mnt);
> diff --git a/include/linux/dcache.h b/include/linux/dcache.h
> index 4225caa8cf02..d153793d5b95 100644
> --- a/include/linux/dcache.h
> +++ b/include/linux/dcache.h
> @@ -213,7 +213,7 @@ struct dentry_operations {
>  #define DCACHE_SYMLINK_TYPE		0x00600000 /* Symlink (or fallthru to such) */
>  
>  #define DCACHE_MAY_FREE			0x00800000
> -/* Was #define DCACHE_FALLTHRU			0x01000000 */
> +#define DCACHE_PATH_CREATE		0x01000000 /* "path_create" context */
>  #define DCACHE_NOKEY_NAME		0x02000000 /* Encrypted name encoded without key */
>  #define DCACHE_OP_REAL			0x04000000
>  
> diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h
> index bad41bcb25df..f0c5a4a82b6e 100644
> --- a/include/linux/fanotify.h
> +++ b/include/linux/fanotify.h
> @@ -65,10 +65,10 @@ extern struct ctl_table fanotify_table[]; /* for sysctl */
>  
>  /*
>   * Events that can be reported with data type FSNOTIFY_EVENT_PATH.
> - * Note that FAN_MODIFY can also be reported with data type
> + * Note that FAN_MODIFY and FAN_CREATE can also be reported with data type
>   * FSNOTIFY_EVENT_INODE.
>   */
> -#define FANOTIFY_PATH_EVENTS	(FAN_ACCESS | FAN_MODIFY | \
> +#define FANOTIFY_PATH_EVENTS	(FAN_ACCESS | FAN_MODIFY | FAN_CREATE | \
>  				 FAN_CLOSE | FAN_OPEN | FAN_OPEN_EXEC)
>  
>  /*
> @@ -78,8 +78,8 @@ extern struct ctl_table fanotify_table[]; /* for sysctl */
>  #define FANOTIFY_DIRENT_EVENTS	(FAN_MOVE | FAN_CREATE | FAN_DELETE)
>  
>  /* Events that can only be reported with data type FSNOTIFY_EVENT_INODE */
> -#define FANOTIFY_INODE_EVENTS	(FANOTIFY_DIRENT_EVENTS | \
> -				 FAN_ATTRIB | FAN_MOVE_SELF | FAN_DELETE_SELF)
> +#define FANOTIFY_INODE_EVENTS	(FAN_MOVE | FAN_DELETE | FAN_ATTRIB | \
> +				 FAN_MOVE_SELF | FAN_DELETE_SELF)
>  
>  /* Events that user can request to be notified on */
>  #define FANOTIFY_EVENTS		(FANOTIFY_PATH_EVENTS | \
> diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
> index f8acddcf54fb..9a3d9f7beeb2 100644
> --- a/include/linux/fsnotify.h
> +++ b/include/linux/fsnotify.h
> @@ -179,6 +179,30 @@ static inline void fsnotify_inoderemove(struct inode *inode)
>  	__fsnotify_inode_delete(inode);
>  }
>  
> +/*
> + * fsnotify_path_create - an inode was linked to namespace
> + *
> + * This higher level hook is called in addition to fsnotify_create(),
> + * fsnotify_mkdir() and fsnotify_link() vfs hooks when the mount context is
> + * available, so that FS_CREATE event can be added to a mount mark mask.
> + *
> + * In that case the, DCACHE_PATH_CREATE flag is set to suppress the FS_CREATE
> + * event in the lower level vfs hooks.
> + */
> +static inline void fsnotify_path_create(struct path *path,
> +					struct dentry *child)
> +{
> +	struct inode *dir = path->dentry->d_inode;
> +	__u32 mask = FS_CREATE;
> +
> +	WARN_ON_ONCE(!inode_is_locked(dir));
> +
> +	if (S_ISDIR(d_inode(child)->i_mode))
> +		mask |= FS_ISDIR;
> +
> +	fsnotify(mask, path, FSNOTIFY_EVENT_PATH, dir, &child->d_name, NULL, 0);
> +}
> +
>  /*
>   * fsnotify_create - 'name' was linked in
>   */
> @@ -186,6 +210,10 @@ static inline void fsnotify_create(struct inode *inode, struct dentry *dentry)
>  {
>  	audit_inode_child(inode, dentry, AUDIT_TYPE_CHILD_CREATE);
>  
> +	/* fsnotify_path_create() will be called */
> +	if (dentry->d_flags & DCACHE_PATH_CREATE)
> +		return;
> +
>  	fsnotify_dirent(inode, dentry, FS_CREATE);
>  }
>  
> @@ -200,6 +228,10 @@ static inline void fsnotify_link(struct inode *dir, struct inode *inode,
>  	fsnotify_link_count(inode);
>  	audit_inode_child(dir, new_dentry, AUDIT_TYPE_CHILD_CREATE);
>  
> +	/* fsnotify_path_create() will be called */
> +	if (new_dentry->d_flags & DCACHE_PATH_CREATE)
> +		return;
> +
>  	fsnotify_name(dir, FS_CREATE, inode, &new_dentry->d_name, 0);
>  }
>  
> @@ -223,6 +255,10 @@ static inline void fsnotify_mkdir(struct inode *inode, struct dentry *dentry)
>  {
>  	audit_inode_child(inode, dentry, AUDIT_TYPE_CHILD_CREATE);
>  
> +	/* fsnotify_path_create() will be called */
> +	if (dentry->d_flags & DCACHE_PATH_CREATE)
> +		return;
> +
>  	fsnotify_dirent(inode, dentry, FS_CREATE | FS_ISDIR);
>  }
>  
> -- 
> 2.30.0
> 

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask
  2021-03-30  7:31 ` Christian Brauner
@ 2021-03-30  9:31   ` Amir Goldstein
  2021-03-30 16:24     ` Amir Goldstein
  0 siblings, 1 reply; 61+ messages in thread
From: Amir Goldstein @ 2021-03-30  9:31 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Jan Kara, linux-fsdevel, Linux API

On Tue, Mar 30, 2021 at 10:31 AM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> On Sun, Mar 28, 2021 at 06:56:24PM +0300, Amir Goldstein wrote:
> > Add a high level hook fsnotify_path_create() which is called from
> > syscall context where mount context is available, so that FAN_CREATE
> > event can be added to a mount mark mask.
> >
> > This high level hook is called in addition to fsnotify_create(),
> > fsnotify_mkdir() and fsnotify_link() hooks in vfs helpers where the mount
> > context is not available.
> >
> > In the context where fsnotify_path_create() will be called, a dentry flag
> > flag is set on the new dentry the suppress the FS_CREATE event in the vfs
> > level hooks.
> >
> > This functionality was requested by Christian Brauner to replace
> > recursive inotify watches for detecting when some path was created under
> > an idmapped mount without having to monitor FAN_CREATE events in the
> > entire filesystem.
> >
> > In combination with more changes to allow unprivileged fanotify listener
> > to watch an idmapped mount, this functionality would be usable also by
> > nested container managers.
> >
> > Link: https://lore.kernel.org/linux-fsdevel/20210318143140.jxycfn3fpqntq34z@wittgenstein/
> > Cc: Christian Brauner <christian.brauner@ubuntu.com>
> > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > ---
>
> Was about to look at this. Does this require preliminary patches since
> it doesn't apply to current master. If so, can you just give me a link
> to a branch so I can pull from that? :)
>

The patch is less useful on its own.
Better take the entire work for the demo which includes this patch:

[1] https://github.com/amir73il/linux/commits/fanotify_userns
[2] https://github.com/amir73il/inotify-tools/commits/fanotify_userns

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask
  2021-03-28 15:56 [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask Amir Goldstein
  2021-03-30  7:31 ` Christian Brauner
@ 2021-03-30 12:12 ` Christian Brauner
  2021-03-30 12:33   ` Amir Goldstein
  2021-03-30 12:20 ` Amir Goldstein
  2 siblings, 1 reply; 61+ messages in thread
From: Christian Brauner @ 2021-03-30 12:12 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, linux-api

On Sun, Mar 28, 2021 at 06:56:24PM +0300, Amir Goldstein wrote:
> Add a high level hook fsnotify_path_create() which is called from
> syscall context where mount context is available, so that FAN_CREATE
> event can be added to a mount mark mask.
> 
> This high level hook is called in addition to fsnotify_create(),
> fsnotify_mkdir() and fsnotify_link() hooks in vfs helpers where the mount
> context is not available.
> 
> In the context where fsnotify_path_create() will be called, a dentry flag
> flag is set on the new dentry the suppress the FS_CREATE event in the vfs
> level hooks.

Ok, just to make sure this scheme would also work for overlay-style
filesystems like ecryptfs where you possible generate two notify events:
- in the ecryptfs layer
- in the lower fs layer
at least when you set a regular inode watch.

If you set a mount watch you ideally would generate two events in both
layers too, right? But afaict that wouldn't work.

Say, someone creates a new link in ecryptfs the DENTRY_PATH_CREATE
flag will be set on the new ecryptfs dentry and so no notify event will
be generated for the ecryptfs layer again. Then ecryptfs calls
vfs_link() to create a new dentry in the lower layer. The new dentry in
the lower layer won't have DCACHE_PATH_CREATE set. Ok, that makes sense.

But since vfs_link() doesn't have access to the mnt context itself you
can't generate a notify event for the mount associated with the lower
fs. This would cause people who a FAN_MARK_MOUNT watch on that lower fs
mount to not get notified about creation events going through the
ecryptfs layer. Is that right?  Seems like this could be a problem.

Christian

> 
> This functionality was requested by Christian Brauner to replace
> recursive inotify watches for detecting when some path was created under
> an idmapped mount without having to monitor FAN_CREATE events in the
> entire filesystem.
> 
> In combination with more changes to allow unprivileged fanotify listener
> to watch an idmapped mount, this functionality would be usable also by
> nested container managers.
> 
> Link: https://lore.kernel.org/linux-fsdevel/20210318143140.jxycfn3fpqntq34z@wittgenstein/
> Cc: Christian Brauner <christian.brauner@ubuntu.com>
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> ---
> 
> Jan,
> 
> After trying several different approaches, I finally realized that
> making FAN_CREATE available for mount marks is not that hard and it could
> be very useful IMO.
> 
> Adding support for other "inode events" with mount mark, such as
> FAN_ATTRIB, FAN_DELETE, FAN_MOVE may also be possible, but adding support
> for FAN_CREATE was really easy due to the fact that all call sites are
> already surrounded by filename_creat()/done_path_create() calls.
> 
> Also, there is an inherent a-symetry between FAN_CREATE and other
> events. All the rest of the events may be set when watching a postive
> path, for example, to know when a path of a bind mount that was
> "injected" to a container was moved or deleted, it is possible to start
> watching that directory before injecting the bind mount.
> 
> It is not possible to do the same with a "negative" path to know when
> a positive dentry was instantiated at that path.
> 
> This patch provides functionality that is independant of other changes,
> but I also tested it along with other changes that demonstrate how it
> would be utilized in userns setups [1][2].
> 
> As can be seen in dcache.h patch, this patch comes on top a revert patch
> to reclaim an unused dentry flag. If you accept this proposal, I will
> post the full series.
> 
> Thanks,
> Amir.
> 
> [1] https://github.com/amir73il/linux/commits/fanotify_userns
> [2] https://github.com/amir73il/inotify-tools/commits/fanotify_userns
> 
>  fs/namei.c               | 21 ++++++++++++++++++++-
>  include/linux/dcache.h   |  2 +-
>  include/linux/fanotify.h |  8 ++++----
>  include/linux/fsnotify.h | 36 ++++++++++++++++++++++++++++++++++++
>  4 files changed, 61 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/namei.c b/fs/namei.c
> index 216f16e74351..cf979e956938 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -3288,7 +3288,7 @@ static const char *open_last_lookups(struct nameidata *nd,
>  		inode_lock_shared(dir->d_inode);
>  	dentry = lookup_open(nd, file, op, got_write);
>  	if (!IS_ERR(dentry) && (file->f_mode & FMODE_CREATED))
> -		fsnotify_create(dir->d_inode, dentry);
> +		fsnotify_path_create(&nd->path, dentry);
>  	if (open_flag & O_CREAT)
>  		inode_unlock(dir->d_inode);
>  	else
> @@ -3560,6 +3560,20 @@ struct file *do_file_open_root(struct dentry *dentry, struct vfsmount *mnt,
>  	return file;
>  }
>  
> +static void d_set_path_create(struct dentry *dentry)
> +{
> +	spin_lock(&dentry->d_lock);
> +	dentry->d_flags |= DCACHE_PATH_CREATE;
> +	spin_unlock(&dentry->d_lock);
> +}
> +
> +static void d_clear_path_create(struct dentry *dentry)
> +{
> +	spin_lock(&dentry->d_lock);
> +	dentry->d_flags &= ~DCACHE_PATH_CREATE;
> +	spin_unlock(&dentry->d_lock);
> +}
> +
>  static struct dentry *filename_create(int dfd, struct filename *name,
>  				struct path *path, unsigned int lookup_flags)
>  {
> @@ -3617,6 +3631,8 @@ static struct dentry *filename_create(int dfd, struct filename *name,
>  		goto fail;
>  	}
>  	putname(name);
> +	/* Start "path create" context that ends in done_path_create() */
> +	d_set_path_create(dentry);
>  	return dentry;
>  fail:
>  	dput(dentry);
> @@ -3641,6 +3657,9 @@ EXPORT_SYMBOL(kern_path_create);
>  
>  void done_path_create(struct path *path, struct dentry *dentry)
>  {
> +	if (d_inode(dentry))
> +		fsnotify_path_create(path, dentry);
> +	d_clear_path_create(dentry);
>  	dput(dentry);
>  	inode_unlock(path->dentry->d_inode);
>  	mnt_drop_write(path->mnt);
> diff --git a/include/linux/dcache.h b/include/linux/dcache.h
> index 4225caa8cf02..d153793d5b95 100644
> --- a/include/linux/dcache.h
> +++ b/include/linux/dcache.h
> @@ -213,7 +213,7 @@ struct dentry_operations {
>  #define DCACHE_SYMLINK_TYPE		0x00600000 /* Symlink (or fallthru to such) */
>  
>  #define DCACHE_MAY_FREE			0x00800000
> -/* Was #define DCACHE_FALLTHRU			0x01000000 */

Indeed, that seems completely unused.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask
  2021-03-28 15:56 [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask Amir Goldstein
  2021-03-30  7:31 ` Christian Brauner
  2021-03-30 12:12 ` [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask Christian Brauner
@ 2021-03-30 12:20 ` Amir Goldstein
  2 siblings, 0 replies; 61+ messages in thread
From: Amir Goldstein @ 2021-03-30 12:20 UTC (permalink / raw)
  To: Jan Kara; +Cc: Christian Brauner, linux-fsdevel, Linux API

On Sun, Mar 28, 2021 at 6:56 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> Add a high level hook fsnotify_path_create() which is called from
> syscall context where mount context is available, so that FAN_CREATE
> event can be added to a mount mark mask.
>
> This high level hook is called in addition to fsnotify_create(),
> fsnotify_mkdir() and fsnotify_link() hooks in vfs helpers where the mount
> context is not available.
>
> In the context where fsnotify_path_create() will be called, a dentry flag
> flag is set on the new dentry the suppress the FS_CREATE event in the vfs
> level hooks.
>
> This functionality was requested by Christian Brauner to replace
> recursive inotify watches for detecting when some path was created under
> an idmapped mount without having to monitor FAN_CREATE events in the
> entire filesystem.
>
> In combination with more changes to allow unprivileged fanotify listener
> to watch an idmapped mount, this functionality would be usable also by
> nested container managers.
>
> Link: https://lore.kernel.org/linux-fsdevel/20210318143140.jxycfn3fpqntq34z@wittgenstein/
> Cc: Christian Brauner <christian.brauner@ubuntu.com>
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> ---
>
> Jan,
>
> After trying several different approaches, I finally realized that
> making FAN_CREATE available for mount marks is not that hard and it could
> be very useful IMO.
>
> Adding support for other "inode events" with mount mark, such as
> FAN_ATTRIB, FAN_DELETE, FAN_MOVE may also be possible, but adding support
> for FAN_CREATE was really easy due to the fact that all call sites are
> already surrounded by filename_creat()/done_path_create() calls.
>

FWIW, adding support for FAN_DELETE and FAN_MOVE_SELF was not
so hard. The move event at least will also be needed for the use case
where watching when a negative path is instantiated.

https://github.com/amir73il/linux/commits/fsnotify_path_hooks

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask
  2021-03-30 12:12 ` [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask Christian Brauner
@ 2021-03-30 12:33   ` Amir Goldstein
  2021-03-30 12:53     ` Christian Brauner
  0 siblings, 1 reply; 61+ messages in thread
From: Amir Goldstein @ 2021-03-30 12:33 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Jan Kara, linux-fsdevel, Linux API

On Tue, Mar 30, 2021 at 3:12 PM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> On Sun, Mar 28, 2021 at 06:56:24PM +0300, Amir Goldstein wrote:
> > Add a high level hook fsnotify_path_create() which is called from
> > syscall context where mount context is available, so that FAN_CREATE
> > event can be added to a mount mark mask.
> >
> > This high level hook is called in addition to fsnotify_create(),
> > fsnotify_mkdir() and fsnotify_link() hooks in vfs helpers where the mount
> > context is not available.
> >
> > In the context where fsnotify_path_create() will be called, a dentry flag
> > flag is set on the new dentry the suppress the FS_CREATE event in the vfs
> > level hooks.
>
> Ok, just to make sure this scheme would also work for overlay-style
> filesystems like ecryptfs where you possible generate two notify events:
> - in the ecryptfs layer
> - in the lower fs layer
> at least when you set a regular inode watch.
>
> If you set a mount watch you ideally would generate two events in both
> layers too, right? But afaict that wouldn't work.
>
> Say, someone creates a new link in ecryptfs the DENTRY_PATH_CREATE
> flag will be set on the new ecryptfs dentry and so no notify event will
> be generated for the ecryptfs layer again. Then ecryptfs calls
> vfs_link() to create a new dentry in the lower layer. The new dentry in
> the lower layer won't have DCACHE_PATH_CREATE set. Ok, that makes sense.
>
> But since vfs_link() doesn't have access to the mnt context itself you
> can't generate a notify event for the mount associated with the lower
> fs. This would cause people who a FAN_MARK_MOUNT watch on that lower fs
> mount to not get notified about creation events going through the
> ecryptfs layer. Is that right?  Seems like this could be a problem.
>

Not sure I follow what the problem might be.

FAN_MARK_MOUNT subscribes to get only events that were
generated via that vfsmount - that has been that way forever.

A listener may subscribe to (say) FAN_CREATE on a certain
mount AND also also on a specific parent directory.

If the listener is watching the entire ecryptfs mount and the
specific lower directory where said vfs_link() happens, both
events will be reported. One from fsnotify_create_path() and
the lower from fsnotify_create().

If one listener is watching the ecryptfs mount and another
listener is watching the specific ecryptfs directory, both
listeners will get a single event each. They will both get
the event that is emitted from fsnotify_path_create().

Besides I am not sure about ecryptfs, but overlayfs uses
private mount clone for accessing lower layer, so by definition
users cannot watch the underlying overlayfs operations using
a mount mark. Also, overlayfs suppresses fsnotify events on
underlying files intentionally with FMODE_NONOTIFY.

Hope that answers your question.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask
  2021-03-30 12:33   ` Amir Goldstein
@ 2021-03-30 12:53     ` Christian Brauner
  2021-03-30 12:55       ` Christian Brauner
  2021-03-30 13:54       ` Amir Goldstein
  0 siblings, 2 replies; 61+ messages in thread
From: Christian Brauner @ 2021-03-30 12:53 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, Linux API

On Tue, Mar 30, 2021 at 03:33:23PM +0300, Amir Goldstein wrote:
> On Tue, Mar 30, 2021 at 3:12 PM Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> >
> > On Sun, Mar 28, 2021 at 06:56:24PM +0300, Amir Goldstein wrote:
> > > Add a high level hook fsnotify_path_create() which is called from
> > > syscall context where mount context is available, so that FAN_CREATE
> > > event can be added to a mount mark mask.
> > >
> > > This high level hook is called in addition to fsnotify_create(),
> > > fsnotify_mkdir() and fsnotify_link() hooks in vfs helpers where the mount
> > > context is not available.
> > >
> > > In the context where fsnotify_path_create() will be called, a dentry flag
> > > flag is set on the new dentry the suppress the FS_CREATE event in the vfs
> > > level hooks.
> >
> > Ok, just to make sure this scheme would also work for overlay-style
> > filesystems like ecryptfs where you possible generate two notify events:
> > - in the ecryptfs layer
> > - in the lower fs layer
> > at least when you set a regular inode watch.
> >
> > If you set a mount watch you ideally would generate two events in both
> > layers too, right? But afaict that wouldn't work.
> >
> > Say, someone creates a new link in ecryptfs the DENTRY_PATH_CREATE
> > flag will be set on the new ecryptfs dentry and so no notify event will
> > be generated for the ecryptfs layer again. Then ecryptfs calls
> > vfs_link() to create a new dentry in the lower layer. The new dentry in
> > the lower layer won't have DCACHE_PATH_CREATE set. Ok, that makes sense.
> >
> > But since vfs_link() doesn't have access to the mnt context itself you
> > can't generate a notify event for the mount associated with the lower
> > fs. This would cause people who a FAN_MARK_MOUNT watch on that lower fs
> > mount to not get notified about creation events going through the
> > ecryptfs layer. Is that right?  Seems like this could be a problem.
> >
> 
> Not sure I follow what the problem might be.
> 
> FAN_MARK_MOUNT subscribes to get only events that were
> generated via that vfsmount - that has been that way forever.
> 
> A listener may subscribe to (say) FAN_CREATE on a certain
> mount AND also also on a specific parent directory.
> 
> If the listener is watching the entire ecryptfs mount and the
> specific lower directory where said vfs_link() happens, both
> events will be reported. One from fsnotify_create_path() and
> the lower from fsnotify_create().
> 
> If one listener is watching the ecryptfs mount and another
> listener is watching the specific ecryptfs directory, both
> listeners will get a single event each. They will both get
> the event that is emitted from fsnotify_path_create().
> 
> Besides I am not sure about ecryptfs, but overlayfs uses
> private mount clone for accessing lower layer, so by definition

I know. That's why I was using ecryptfs as an example which doesn't do
that (And I think it should be switched tbh.). It simply uses
kern_path() and then stashes that path.

My example probably would be something like:

mount -t ext4 /dev/sdb /A

1. FAN_MARK_MOUNT(/A)

mount --bind /A /B

2. FAN_MARK_MOUNT(/B)

mount -t ecryptfs /B /C

3. FAN_MARK_MOUNT(/C)

let's say I now do

touch /unencrypted/bla

I may be way off here but intuitively it seems both 1. and 2. should get
a creation event but not 3., right?

But with your proposal would both 1. and 2. still get a creation event?

> users cannot watch the underlying overlayfs operations using
> a mount mark. Also, overlayfs suppresses fsnotify events on
> underlying files intentionally with FMODE_NONOTIFY.

Probably ecryptfs should too?

Christian

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask
  2021-03-30 12:53     ` Christian Brauner
@ 2021-03-30 12:55       ` Christian Brauner
  2021-03-30 13:54       ` Amir Goldstein
  1 sibling, 0 replies; 61+ messages in thread
From: Christian Brauner @ 2021-03-30 12:55 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, Linux API

On Tue, Mar 30, 2021 at 02:53:36PM +0200, Christian Brauner wrote:
> On Tue, Mar 30, 2021 at 03:33:23PM +0300, Amir Goldstein wrote:
> > On Tue, Mar 30, 2021 at 3:12 PM Christian Brauner
> > <christian.brauner@ubuntu.com> wrote:
> > >
> > > On Sun, Mar 28, 2021 at 06:56:24PM +0300, Amir Goldstein wrote:
> > > > Add a high level hook fsnotify_path_create() which is called from
> > > > syscall context where mount context is available, so that FAN_CREATE
> > > > event can be added to a mount mark mask.
> > > >
> > > > This high level hook is called in addition to fsnotify_create(),
> > > > fsnotify_mkdir() and fsnotify_link() hooks in vfs helpers where the mount
> > > > context is not available.
> > > >
> > > > In the context where fsnotify_path_create() will be called, a dentry flag
> > > > flag is set on the new dentry the suppress the FS_CREATE event in the vfs
> > > > level hooks.
> > >
> > > Ok, just to make sure this scheme would also work for overlay-style
> > > filesystems like ecryptfs where you possible generate two notify events:
> > > - in the ecryptfs layer
> > > - in the lower fs layer
> > > at least when you set a regular inode watch.
> > >
> > > If you set a mount watch you ideally would generate two events in both
> > > layers too, right? But afaict that wouldn't work.
> > >
> > > Say, someone creates a new link in ecryptfs the DENTRY_PATH_CREATE
> > > flag will be set on the new ecryptfs dentry and so no notify event will
> > > be generated for the ecryptfs layer again. Then ecryptfs calls
> > > vfs_link() to create a new dentry in the lower layer. The new dentry in
> > > the lower layer won't have DCACHE_PATH_CREATE set. Ok, that makes sense.
> > >
> > > But since vfs_link() doesn't have access to the mnt context itself you
> > > can't generate a notify event for the mount associated with the lower
> > > fs. This would cause people who a FAN_MARK_MOUNT watch on that lower fs
> > > mount to not get notified about creation events going through the
> > > ecryptfs layer. Is that right?  Seems like this could be a problem.
> > >
> > 
> > Not sure I follow what the problem might be.
> > 
> > FAN_MARK_MOUNT subscribes to get only events that were
> > generated via that vfsmount - that has been that way forever.
> > 
> > A listener may subscribe to (say) FAN_CREATE on a certain
> > mount AND also also on a specific parent directory.
> > 
> > If the listener is watching the entire ecryptfs mount and the
> > specific lower directory where said vfs_link() happens, both
> > events will be reported. One from fsnotify_create_path() and
> > the lower from fsnotify_create().
> > 
> > If one listener is watching the ecryptfs mount and another
> > listener is watching the specific ecryptfs directory, both
> > listeners will get a single event each. They will both get
> > the event that is emitted from fsnotify_path_create().
> > 
> > Besides I am not sure about ecryptfs, but overlayfs uses
> > private mount clone for accessing lower layer, so by definition
> 
> I know. That's why I was using ecryptfs as an example which doesn't do
> that (And I think it should be switched tbh.). It simply uses
> kern_path() and then stashes that path.
> 
> My example probably would be something like:
> 
> mount -t ext4 /dev/sdb /A
> 
> 1. FAN_MARK_MOUNT(/A)
> 
> mount --bind /A /B
> 
> 2. FAN_MARK_MOUNT(/B)
> 
> mount -t ecryptfs /B /C
> 
> 3. FAN_MARK_MOUNT(/C)
> 
> let's say I now do
> 
> touch /unencrypted/bla

touch /C/bla

> 
> I may be way off here but intuitively it seems both 1. and 2. should get
> a creation event but not 3., right?
> 
> But with your proposal would both 1. and 2. still get a creation event?
> 
> > users cannot watch the underlying overlayfs operations using
> > a mount mark. Also, overlayfs suppresses fsnotify events on
> > underlying files intentionally with FMODE_NONOTIFY.
> 
> Probably ecryptfs should too?
> 
> Christian

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask
  2021-03-30 12:53     ` Christian Brauner
  2021-03-30 12:55       ` Christian Brauner
@ 2021-03-30 13:54       ` Amir Goldstein
  2021-03-30 14:17         ` Christian Brauner
  1 sibling, 1 reply; 61+ messages in thread
From: Amir Goldstein @ 2021-03-30 13:54 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Jan Kara, linux-fsdevel, Linux API, Miklos Szeredi

On Tue, Mar 30, 2021 at 3:53 PM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> On Tue, Mar 30, 2021 at 03:33:23PM +0300, Amir Goldstein wrote:
> > On Tue, Mar 30, 2021 at 3:12 PM Christian Brauner
> > <christian.brauner@ubuntu.com> wrote:
> > >
> > > On Sun, Mar 28, 2021 at 06:56:24PM +0300, Amir Goldstein wrote:
> > > > Add a high level hook fsnotify_path_create() which is called from
> > > > syscall context where mount context is available, so that FAN_CREATE
> > > > event can be added to a mount mark mask.
> > > >
> > > > This high level hook is called in addition to fsnotify_create(),
> > > > fsnotify_mkdir() and fsnotify_link() hooks in vfs helpers where the mount
> > > > context is not available.
> > > >
> > > > In the context where fsnotify_path_create() will be called, a dentry flag
> > > > flag is set on the new dentry the suppress the FS_CREATE event in the vfs
> > > > level hooks.
> > >
> > > Ok, just to make sure this scheme would also work for overlay-style
> > > filesystems like ecryptfs where you possible generate two notify events:
> > > - in the ecryptfs layer
> > > - in the lower fs layer
> > > at least when you set a regular inode watch.
> > >
> > > If you set a mount watch you ideally would generate two events in both
> > > layers too, right? But afaict that wouldn't work.
> > >
> > > Say, someone creates a new link in ecryptfs the DENTRY_PATH_CREATE
> > > flag will be set on the new ecryptfs dentry and so no notify event will
> > > be generated for the ecryptfs layer again. Then ecryptfs calls
> > > vfs_link() to create a new dentry in the lower layer. The new dentry in
> > > the lower layer won't have DCACHE_PATH_CREATE set. Ok, that makes sense.
> > >
> > > But since vfs_link() doesn't have access to the mnt context itself you
> > > can't generate a notify event for the mount associated with the lower
> > > fs. This would cause people who a FAN_MARK_MOUNT watch on that lower fs
> > > mount to not get notified about creation events going through the
> > > ecryptfs layer. Is that right?  Seems like this could be a problem.
> > >
> >
> > Not sure I follow what the problem might be.
> >
> > FAN_MARK_MOUNT subscribes to get only events that were
> > generated via that vfsmount - that has been that way forever.
> >
> > A listener may subscribe to (say) FAN_CREATE on a certain
> > mount AND also also on a specific parent directory.
> >
> > If the listener is watching the entire ecryptfs mount and the
> > specific lower directory where said vfs_link() happens, both
> > events will be reported. One from fsnotify_create_path() and
> > the lower from fsnotify_create().
> >
> > If one listener is watching the ecryptfs mount and another
> > listener is watching the specific ecryptfs directory, both
> > listeners will get a single event each. They will both get
> > the event that is emitted from fsnotify_path_create().
> >
> > Besides I am not sure about ecryptfs, but overlayfs uses
> > private mount clone for accessing lower layer, so by definition
>
> I know. That's why I was using ecryptfs as an example which doesn't do
> that (And I think it should be switched tbh.). It simply uses
> kern_path() and then stashes that path.
>
> My example probably would be something like:
>
> mount -t ext4 /dev/sdb /A
>
> 1. FAN_MARK_MOUNT(/A)
>
> mount --bind /A /B
>
> 2. FAN_MARK_MOUNT(/B)
>
> mount -t ecryptfs /B /C
>
> 3. FAN_MARK_MOUNT(/C)
>
> let's say I now do
>
> touch /C/bla
>
> I may be way off here but intuitively it seems both 1. and 2. should get
> a creation event but not 3., right?
>

Why not 3?
You explicitly set a mark on /C requesting to be notified when
objects are created via /C.

> But with your proposal would both 1. and 2. still get a creation event?
>

They would not get an event, because fsnotify() looks for CREATE event
subscribers on inode->i_fsnotify_marks and inode->i_sb_s_fsnotify_marks
and does not find any.

The vfs_create() -> fsnotify_create() hook passes data_type inode to
fsnotify() so there is no fsnotify_data_path() to extract mnt event
subscribers from.

The same fate would be to files created by overlayfs, nfsd and cachefiles.

Only the create event on /C/bla from the syscall context would
call fsnoity_path_create() and result with path data in fsnotify(), so
the mnt event subscriber would be found.

> > users cannot watch the underlying overlayfs operations using
> > a mount mark. Also, overlayfs suppresses fsnotify events on
> > underlying files intentionally with FMODE_NONOTIFY.
>
> Probably ecryptfs should too?
>

<shrug> :)

FMODE_NONOTIFY is there not because there was a requirement
not to send events, but because the path of the internal file is
"fake", so it has a weird looking path. After all there are many
other events that would be sent (not on open files).

At least I think that's the reason...

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask
  2021-03-30 13:54       ` Amir Goldstein
@ 2021-03-30 14:17         ` Christian Brauner
  2021-03-30 14:56           ` Amir Goldstein
  0 siblings, 1 reply; 61+ messages in thread
From: Christian Brauner @ 2021-03-30 14:17 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, Linux API, Miklos Szeredi

On Tue, Mar 30, 2021 at 04:54:02PM +0300, Amir Goldstein wrote:
> On Tue, Mar 30, 2021 at 3:53 PM Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> >
> > On Tue, Mar 30, 2021 at 03:33:23PM +0300, Amir Goldstein wrote:
> > > On Tue, Mar 30, 2021 at 3:12 PM Christian Brauner
> > > <christian.brauner@ubuntu.com> wrote:
> > > >
> > > > On Sun, Mar 28, 2021 at 06:56:24PM +0300, Amir Goldstein wrote:
> > > > > Add a high level hook fsnotify_path_create() which is called from
> > > > > syscall context where mount context is available, so that FAN_CREATE
> > > > > event can be added to a mount mark mask.
> > > > >
> > > > > This high level hook is called in addition to fsnotify_create(),
> > > > > fsnotify_mkdir() and fsnotify_link() hooks in vfs helpers where the mount
> > > > > context is not available.
> > > > >
> > > > > In the context where fsnotify_path_create() will be called, a dentry flag
> > > > > flag is set on the new dentry the suppress the FS_CREATE event in the vfs
> > > > > level hooks.
> > > >
> > > > Ok, just to make sure this scheme would also work for overlay-style
> > > > filesystems like ecryptfs where you possible generate two notify events:
> > > > - in the ecryptfs layer
> > > > - in the lower fs layer
> > > > at least when you set a regular inode watch.
> > > >
> > > > If you set a mount watch you ideally would generate two events in both
> > > > layers too, right? But afaict that wouldn't work.
> > > >
> > > > Say, someone creates a new link in ecryptfs the DENTRY_PATH_CREATE
> > > > flag will be set on the new ecryptfs dentry and so no notify event will
> > > > be generated for the ecryptfs layer again. Then ecryptfs calls
> > > > vfs_link() to create a new dentry in the lower layer. The new dentry in
> > > > the lower layer won't have DCACHE_PATH_CREATE set. Ok, that makes sense.
> > > >
> > > > But since vfs_link() doesn't have access to the mnt context itself you
> > > > can't generate a notify event for the mount associated with the lower
> > > > fs. This would cause people who a FAN_MARK_MOUNT watch on that lower fs
> > > > mount to not get notified about creation events going through the
> > > > ecryptfs layer. Is that right?  Seems like this could be a problem.
> > > >
> > >
> > > Not sure I follow what the problem might be.
> > >
> > > FAN_MARK_MOUNT subscribes to get only events that were
> > > generated via that vfsmount - that has been that way forever.
> > >
> > > A listener may subscribe to (say) FAN_CREATE on a certain
> > > mount AND also also on a specific parent directory.
> > >
> > > If the listener is watching the entire ecryptfs mount and the
> > > specific lower directory where said vfs_link() happens, both
> > > events will be reported. One from fsnotify_create_path() and
> > > the lower from fsnotify_create().
> > >
> > > If one listener is watching the ecryptfs mount and another
> > > listener is watching the specific ecryptfs directory, both
> > > listeners will get a single event each. They will both get
> > > the event that is emitted from fsnotify_path_create().
> > >
> > > Besides I am not sure about ecryptfs, but overlayfs uses
> > > private mount clone for accessing lower layer, so by definition
> >
> > I know. That's why I was using ecryptfs as an example which doesn't do
> > that (And I think it should be switched tbh.). It simply uses
> > kern_path() and then stashes that path.
> >
> > My example probably would be something like:
> >
> > mount -t ext4 /dev/sdb /A
> >
> > 1. FAN_MARK_MOUNT(/A)
> >
> > mount --bind /A /B
> >
> > 2. FAN_MARK_MOUNT(/B)
> >
> > mount -t ecryptfs /B /C
> >
> > 3. FAN_MARK_MOUNT(/C)
> >
> > let's say I now do
> >
> > touch /C/bla
> >
> > I may be way off here but intuitively it seems both 1. and 2. should get
> > a creation event but not 3., right?
> >
> 
> Why not 3?
> You explicitly set a mark on /C requesting to be notified when
> objects are created via /C.

Sorry, that was a typo. I meant to write, both 2. and 3. should get a
creation event but not 1.

> 
> > But with your proposal would both 1. and 2. still get a creation event?
> >

Same obvious typo. The correct question would be: with your proposal do
2. and 3. both get an event?

Because it feels like they both should since /C is mounted on top of /B
and ecryptfs acts as a shim. Both FAN_MARK_MOUNT(/B) and
FAN_MARK_MOUNT(/C) should get a creation event after all both will have
mnt->mnt_fsnotify_marks set.

> 
> They would not get an event, because fsnotify() looks for CREATE event
> subscribers on inode->i_fsnotify_marks and inode->i_sb_s_fsnotify_marks
> and does not find any.

Well yes, but my example has FAN_MARK_MOUNT(/B) set. So fanotify
_should_ look at
	    (!mnt || !mnt->mnt_fsnotify_marks) &&
and see that there are subscribers and should notify the subscribers in
/B even if the file is created through /C.

My point is with your solution this can't be handled and I want to make
sure that this is ok. Because right now you'd not be notified about a
new file having been created in /B even though mnt->mnt_fsnotify_marks
is set and the creation went through /B via /C.

_Unless_ we switch to an argument like overlayfs and say "This is a
private mount which is opaque and so we don't need to generate events.".
Overlayfs handles this cleanly due to clone_private_mount() which will
shed all mnt->mnt_fsnotify_marks and ecryptfs should too if that is the
argument we follow, no?

> 
> The vfs_create() -> fsnotify_create() hook passes data_type inode to
> fsnotify() so there is no fsnotify_data_path() to extract mnt event
> subscribers from.

Right, that was my point. You don't have the mnt context for the
underlying fs at a time when e.g. call vfs_link() which ultimately calls
fsnotify_create/link() which I'm saying might be a problem.

> 
> The same fate would be to files created by overlayfs, nfsd and cachefiles.
> 
> Only the create event on /C/bla from the syscall context would
> call fsnoity_path_create() and result with path data in fsnotify(), so
> the mnt event subscriber would be found.
> 
> > > users cannot watch the underlying overlayfs operations using
> > > a mount mark. Also, overlayfs suppresses fsnotify events on
> > > underlying files intentionally with FMODE_NONOTIFY.
> >
> > Probably ecryptfs should too?

It really feels like ecryptfs should do clone_private_mnt() and probably
cachefiles too. I mentioned this to David just a few weeks ago actually.

Christian

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask
  2021-03-30 14:17         ` Christian Brauner
@ 2021-03-30 14:56           ` Amir Goldstein
  2021-03-31  9:46             ` Christian Brauner
  0 siblings, 1 reply; 61+ messages in thread
From: Amir Goldstein @ 2021-03-30 14:56 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Jan Kara, linux-fsdevel, Linux API, Miklos Szeredi

> > > My example probably would be something like:
> > >
> > > mount -t ext4 /dev/sdb /A
> > >
> > > 1. FAN_MARK_MOUNT(/A)
> > >
> > > mount --bind /A /B
> > >
> > > 2. FAN_MARK_MOUNT(/B)
> > >
> > > mount -t ecryptfs /B /C
> > >
> > > 3. FAN_MARK_MOUNT(/C)
> > >
> > > let's say I now do
> > >
> > > touch /C/bla
> > >
> > > I may be way off here but intuitively it seems both 1. and 2. should get
> > > a creation event but not 3., right?
> > >
> >
> > Why not 3?
> > You explicitly set a mark on /C requesting to be notified when
> > objects are created via /C.
>
> Sorry, that was a typo. I meant to write, both 2. and 3. should get a
> creation event but not 1.
>
> >
> > > But with your proposal would both 1. and 2. still get a creation event?
> > >
>
> Same obvious typo. The correct question would be: with your proposal do
> 2. and 3. both get an event?
>
> Because it feels like they both should since /C is mounted on top of /B
> and ecryptfs acts as a shim. Both FAN_MARK_MOUNT(/B) and
> FAN_MARK_MOUNT(/C) should get a creation event after all both will have
> mnt->mnt_fsnotify_marks set.
>

Right.

There are two ways to address this inconsistency:
1. Change internal callers of vfs_ helpers to use a private mount,
    as you yourself suggested for ecryptfs and cachefiles
2. Add fsnotify_path_ hooks at caller site - that would be the
    correct thing to do for nfsd IMO

> >
> > They would not get an event, because fsnotify() looks for CREATE event
> > subscribers on inode->i_fsnotify_marks and inode->i_sb_s_fsnotify_marks
> > and does not find any.
>
> Well yes, but my example has FAN_MARK_MOUNT(/B) set. So fanotify
> _should_ look at
>             (!mnt || !mnt->mnt_fsnotify_marks) &&
> and see that there are subscribers and should notify the subscribers in
> /B even if the file is created through /C.
>
> My point is with your solution this can't be handled and I want to make
> sure that this is ok. Because right now you'd not be notified about a
> new file having been created in /B even though mnt->mnt_fsnotify_marks
> is set and the creation went through /B via /C.
>

If you are referring to the ecryptfs use case specifically, then I think it is
ok. After all, whether ecryptfs uses a private mount clone or not is not
something the user can know.

> _Unless_ we switch to an argument like overlayfs and say "This is a
> private mount which is opaque and so we don't need to generate events.".
> Overlayfs handles this cleanly due to clone_private_mount() which will
> shed all mnt->mnt_fsnotify_marks and ecryptfs should too if that is the
> argument we follow, no?
>

There is simply no way that the user can infer from the documentation
of FAN_MARK_MOUNT that the event on /B is expected when /B is
underlying layer of ecryptfs or overlayfs.
It requires deep internal knowledge of the stacked fs implementation.
In best case, the user can infer that she MAY get an event on /B.
Some users MAY also expect to get an event on /A because they do not
understand the concept of bind mounts...
Clone a mount ns and you will get more lost users...

> >
> > The vfs_create() -> fsnotify_create() hook passes data_type inode to
> > fsnotify() so there is no fsnotify_data_path() to extract mnt event
> > subscribers from.
>
> Right, that was my point. You don't have the mnt context for the
> underlying fs at a time when e.g. call vfs_link() which ultimately calls
> fsnotify_create/link() which I'm saying might be a problem.
>

It's a problem. If it wasn't a problem I wouldn't need to work around it ;-)

It would be a problem if people think that the FAN_MOUNT_MARK
is a subtree mark, which it certainly is not. And I have no doubt that
as Jan said, people really do want a subtree mark.

My question to you with this RFC is: Does the ability to subscribe to
CREATE/DELETE/MOVE events on a mount help any of your use
cases? With or without the property that mount marks are allowed
inside userns for idmapped mounts.

Note that if we think the semantics of this are useful for container
managers, but too complex for most mortals, we may decide to
restrict the ability to subscribe to those events to idmapped mounts(?).

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask
  2021-03-30  9:31   ` Amir Goldstein
@ 2021-03-30 16:24     ` Amir Goldstein
  2021-03-31 10:08       ` Christian Brauner
  0 siblings, 1 reply; 61+ messages in thread
From: Amir Goldstein @ 2021-03-30 16:24 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Jan Kara, linux-fsdevel, Linux API

On Tue, Mar 30, 2021 at 12:31 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Tue, Mar 30, 2021 at 10:31 AM Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> >
> > On Sun, Mar 28, 2021 at 06:56:24PM +0300, Amir Goldstein wrote:
> > > Add a high level hook fsnotify_path_create() which is called from
> > > syscall context where mount context is available, so that FAN_CREATE
> > > event can be added to a mount mark mask.
> > >
> > > This high level hook is called in addition to fsnotify_create(),
> > > fsnotify_mkdir() and fsnotify_link() hooks in vfs helpers where the mount
> > > context is not available.
> > >
> > > In the context where fsnotify_path_create() will be called, a dentry flag
> > > flag is set on the new dentry the suppress the FS_CREATE event in the vfs
> > > level hooks.
> > >
> > > This functionality was requested by Christian Brauner to replace
> > > recursive inotify watches for detecting when some path was created under
> > > an idmapped mount without having to monitor FAN_CREATE events in the
> > > entire filesystem.
> > >
> > > In combination with more changes to allow unprivileged fanotify listener
> > > to watch an idmapped mount, this functionality would be usable also by
> > > nested container managers.
> > >
> > > Link: https://lore.kernel.org/linux-fsdevel/20210318143140.jxycfn3fpqntq34z@wittgenstein/
> > > Cc: Christian Brauner <christian.brauner@ubuntu.com>
> > > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > > ---
> >
> > Was about to look at this. Does this require preliminary patches since
> > it doesn't apply to current master. If so, can you just give me a link
> > to a branch so I can pull from that? :)
> >
>
> The patch is less useful on its own.
> Better take the entire work for the demo which includes this patch:
>
> [1] https://github.com/amir73il/linux/commits/fanotify_userns
> [2] https://github.com/amir73il/inotify-tools/commits/fanotify_userns
>

Christian,

Apologies for the fast moving target.
I just force force the kernel+demo branches to include support for
the two extra events (delete and move) on mount mark.

3dd3d7f02717...6cfe8f7a9148 fanotify_userns -> fanotify_userns (forced update)

The same demo I described before with a mix of:
- CREATE event on MOUNT mark
- Rest of events on recursive inode marks
Still work when running demo script with --fanotify --recursive
on idmapped mount

But now the demo also works with --global on idmapped mount
by setting up only a mount mark to watch most events
excluding the unsupported events (see below).

Thanks,
Amir.

# ./test_demo.sh /mnt 0
+ WD=/mnt
+ ID=0
...
+ inotifywatch --global -w --timeout -2 /mnt
Establishing filesystem global watch...
...
total  modify  close_write  move_self  create  delete  filename
3      0       1            0          1       1       /mnt/a/1 (deleted)
2      0       1            0          1       0       /mnt/a/0 (deleted)
2      0       1            0          1       0       /mnt/a/2
2      0       1            0          1       0       /mnt/a/3
2      0       0            0          1       1       /mnt/a/dir1 (deleted)
2      0       1            0          1       0       /mnt/a/dir2/A/B/C/file2
1      0       0            0          1       0       /mnt/a/dir0 (deleted)
1      0       0            0          1       0       /mnt/a/dir2
1      0       0            0          1       0       /mnt/a/dir2/A
1      0       0            0          1       0       /mnt/a/dir2/A/B
1      0       0            0          1       0       /mnt/a/dir2/A/B/C

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask
  2021-03-30 14:56           ` Amir Goldstein
@ 2021-03-31  9:46             ` Christian Brauner
  2021-03-31 11:29               ` Amir Goldstein
  0 siblings, 1 reply; 61+ messages in thread
From: Christian Brauner @ 2021-03-31  9:46 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, Linux API, Miklos Szeredi

On Tue, Mar 30, 2021 at 05:56:25PM +0300, Amir Goldstein wrote:
> > > > My example probably would be something like:
> > > >
> > > > mount -t ext4 /dev/sdb /A
> > > >
> > > > 1. FAN_MARK_MOUNT(/A)
> > > >
> > > > mount --bind /A /B
> > > >
> > > > 2. FAN_MARK_MOUNT(/B)
> > > >
> > > > mount -t ecryptfs /B /C
> > > >
> > > > 3. FAN_MARK_MOUNT(/C)
> > > >
> > > > let's say I now do
> > > >
> > > > touch /C/bla
> > > >
> > > > I may be way off here but intuitively it seems both 1. and 2. should get
> > > > a creation event but not 3., right?
> > > >
> > >
> > > Why not 3?
> > > You explicitly set a mark on /C requesting to be notified when
> > > objects are created via /C.
> >
> > Sorry, that was a typo. I meant to write, both 2. and 3. should get a
> > creation event but not 1.
> >
> > >
> > > > But with your proposal would both 1. and 2. still get a creation event?
> > > >
> >
> > Same obvious typo. The correct question would be: with your proposal do
> > 2. and 3. both get an event?
> >
> > Because it feels like they both should since /C is mounted on top of /B
> > and ecryptfs acts as a shim. Both FAN_MARK_MOUNT(/B) and
> > FAN_MARK_MOUNT(/C) should get a creation event after all both will have
> > mnt->mnt_fsnotify_marks set.
> >
> 
> Right.
> 
> There are two ways to address this inconsistency:
> 1. Change internal callers of vfs_ helpers to use a private mount,
>     as you yourself suggested for ecryptfs and cachefiles

I feel like this is he correct thing to do independently of the fanotify
considerations. I think I'll send an RFC for this today or later this
week.

> 2. Add fsnotify_path_ hooks at caller site - that would be the
>     correct thing to do for nfsd IMO

I do not have an informed opinion yet on nfsd so I simply need to trust
you here. :)

> 
> > >
> > > They would not get an event, because fsnotify() looks for CREATE event
> > > subscribers on inode->i_fsnotify_marks and inode->i_sb_s_fsnotify_marks
> > > and does not find any.
> >
> > Well yes, but my example has FAN_MARK_MOUNT(/B) set. So fanotify
> > _should_ look at
> >             (!mnt || !mnt->mnt_fsnotify_marks) &&
> > and see that there are subscribers and should notify the subscribers in
> > /B even if the file is created through /C.
> >
> > My point is with your solution this can't be handled and I want to make
> > sure that this is ok. Because right now you'd not be notified about a
> > new file having been created in /B even though mnt->mnt_fsnotify_marks
> > is set and the creation went through /B via /C.
> >
> 
> If you are referring to the ecryptfs use case specifically, then I think it is
> ok. After all, whether ecryptfs uses a private mount clone or not is not
> something the user can know.
> 
> > _Unless_ we switch to an argument like overlayfs and say "This is a
> > private mount which is opaque and so we don't need to generate events.".
> > Overlayfs handles this cleanly due to clone_private_mount() which will
> > shed all mnt->mnt_fsnotify_marks and ecryptfs should too if that is the
> > argument we follow, no?
> >
> 
> There is simply no way that the user can infer from the documentation
> of FAN_MARK_MOUNT that the event on /B is expected when /B is
> underlying layer of ecryptfs or overlayfs.
> It requires deep internal knowledge of the stacked fs implementation.
> In best case, the user can infer that she MAY get an event on /B.
> Some users MAY also expect to get an event on /A because they do not
> understand the concept of bind mounts...
> Clone a mount ns and you will get more lost users...

I disagree to some extent. For example, a user might remount /B
read-only at which point /C is effectively read-only too which makes it
plain obvious to the user that /C piggy-backs on /B.
But leaving that aside my questioning is more concerned with whether the
implementation we're aiming for is consistent and intuitive and that
stacking example came to my mind pretty quickly.

> 
> > >
> > > The vfs_create() -> fsnotify_create() hook passes data_type inode to
> > > fsnotify() so there is no fsnotify_data_path() to extract mnt event
> > > subscribers from.
> >
> > Right, that was my point. You don't have the mnt context for the
> > underlying fs at a time when e.g. call vfs_link() which ultimately calls
> > fsnotify_create/link() which I'm saying might be a problem.
> >
> 
> It's a problem. If it wasn't a problem I wouldn't need to work around it ;-)
> 
> It would be a problem if people think that the FAN_MOUNT_MARK
> is a subtree mark, which it certainly is not. And I have no doubt that

I don't think subtree marks figure into the example above. But we
digress.

> as Jan said, people really do want a subtree mark.
> 
> My question to you with this RFC is: Does the ability to subscribe to
> CREATE/DELETE/MOVE events on a mount help any of your use
> cases? With or without the property that mount marks are allowed

Since I explicitly pointed on in a prior mail that it would be great to
have the same events as for a regular fanotify watch I think I already
answered that question. :)

> inside userns for idmapped mounts.

But if it helps then I'll do it once: yes, both would indeed be very
useful.

> 
> Note that if we think the semantics of this are useful for container
> managers, but too complex for most mortals, we may decide to
> restrict the ability to subscribe to those events to idmapped mounts(?).

I don't think it's too complex for most users. But supervisors and
container managers are prime users of a feature like this.

Christian

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask
  2021-03-30 16:24     ` Amir Goldstein
@ 2021-03-31 10:08       ` Christian Brauner
  2021-03-31 10:57         ` Amir Goldstein
  2021-04-08 11:44         ` open_by_handle_at() in userns Amir Goldstein
  0 siblings, 2 replies; 61+ messages in thread
From: Christian Brauner @ 2021-03-31 10:08 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, Linux API

On Tue, Mar 30, 2021 at 07:24:02PM +0300, Amir Goldstein wrote:
> On Tue, Mar 30, 2021 at 12:31 PM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > On Tue, Mar 30, 2021 at 10:31 AM Christian Brauner
> > <christian.brauner@ubuntu.com> wrote:
> > >
> > > On Sun, Mar 28, 2021 at 06:56:24PM +0300, Amir Goldstein wrote:
> > > > Add a high level hook fsnotify_path_create() which is called from
> > > > syscall context where mount context is available, so that FAN_CREATE
> > > > event can be added to a mount mark mask.
> > > >
> > > > This high level hook is called in addition to fsnotify_create(),
> > > > fsnotify_mkdir() and fsnotify_link() hooks in vfs helpers where the mount
> > > > context is not available.
> > > >
> > > > In the context where fsnotify_path_create() will be called, a dentry flag
> > > > flag is set on the new dentry the suppress the FS_CREATE event in the vfs
> > > > level hooks.
> > > >
> > > > This functionality was requested by Christian Brauner to replace
> > > > recursive inotify watches for detecting when some path was created under
> > > > an idmapped mount without having to monitor FAN_CREATE events in the
> > > > entire filesystem.
> > > >
> > > > In combination with more changes to allow unprivileged fanotify listener
> > > > to watch an idmapped mount, this functionality would be usable also by
> > > > nested container managers.
> > > >
> > > > Link: https://lore.kernel.org/linux-fsdevel/20210318143140.jxycfn3fpqntq34z@wittgenstein/
> > > > Cc: Christian Brauner <christian.brauner@ubuntu.com>
> > > > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > > > ---
> > >
> > > Was about to look at this. Does this require preliminary patches since
> > > it doesn't apply to current master. If so, can you just give me a link
> > > to a branch so I can pull from that? :)
> > >
> >
> > The patch is less useful on its own.
> > Better take the entire work for the demo which includes this patch:
> >
> > [1] https://github.com/amir73il/linux/commits/fanotify_userns
> > [2] https://github.com/amir73il/inotify-tools/commits/fanotify_userns
> >
> 
> Christian,
> 
> Apologies for the fast moving target.

No problem.

> I just force force the kernel+demo branches to include support for
> the two extra events (delete and move) on mount mark.

Sounds good.

One thing your patch

commit ea31e84fda83c17b88851de399f76f5d9fc1abf4
Author: Amir Goldstein <amir73il@gmail.com>
Date:   Sat Mar 20 12:58:12 2021 +0200

    fs: allow open by file handle inside userns

    open_by_handle_at(2) requires CAP_DAC_READ_SEARCH in init userns,
    where most filesystems are mounted.

    Relax the requirement to allow a user with CAP_DAC_READ_SEARCH
    inside userns to open by file handle in filesystems that were
    mounted inside that userns.

    In addition, also allow open by handle in an idmapped mount, which is
    mapped to the userns while verifying that the returned open file path
    is under the root of the idmapped mount.

    This is going to be needed for setting an fanotify mark on a filesystem
    and watching events inside userns.

    Signed-off-by: Amir Goldstein <amir73il@gmail.com>

Requires fs/exportfs/expfs.c to be made idmapped mounts aware.
open_by_handle_at() uses exportfs_decode_fh() which e.g. has the
following and other callchains:

exportfs_decode_fh()
-> exportfs_decode_fh_raw()
   -> lookup_one_len()
      -> inode_permission(mnt_userns, ...)

That's not a huge problem though I did all these changes for the
overlayfs support for idmapped mounts I have in a branch from an earlier
version of the idmapped mounts patchset. Basically lookup_one_len(),
lookup_one_len_unlocked(), and lookup_positive_unlocked() need to take
the mnt_userns into account. I can rebase my change and send it for
consideration next cycle. If you can live without the
open_by_handle_at() support for now in this patchset (Which I think you
said you could.) then it's not a blocker either. Sorry for the
inconvenience.

Christian

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask
  2021-03-31 10:08       ` Christian Brauner
@ 2021-03-31 10:57         ` Amir Goldstein
  2021-04-08 11:44         ` open_by_handle_at() in userns Amir Goldstein
  1 sibling, 0 replies; 61+ messages in thread
From: Amir Goldstein @ 2021-03-31 10:57 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Jan Kara, linux-fsdevel, Linux API

On Wed, Mar 31, 2021 at 1:08 PM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> On Tue, Mar 30, 2021 at 07:24:02PM +0300, Amir Goldstein wrote:
> > On Tue, Mar 30, 2021 at 12:31 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > >
> > > On Tue, Mar 30, 2021 at 10:31 AM Christian Brauner
> > > <christian.brauner@ubuntu.com> wrote:
> > > >
> > > > On Sun, Mar 28, 2021 at 06:56:24PM +0300, Amir Goldstein wrote:
> > > > > Add a high level hook fsnotify_path_create() which is called from
> > > > > syscall context where mount context is available, so that FAN_CREATE
> > > > > event can be added to a mount mark mask.
> > > > >
> > > > > This high level hook is called in addition to fsnotify_create(),
> > > > > fsnotify_mkdir() and fsnotify_link() hooks in vfs helpers where the mount
> > > > > context is not available.
> > > > >
> > > > > In the context where fsnotify_path_create() will be called, a dentry flag
> > > > > flag is set on the new dentry the suppress the FS_CREATE event in the vfs
> > > > > level hooks.
> > > > >
> > > > > This functionality was requested by Christian Brauner to replace
> > > > > recursive inotify watches for detecting when some path was created under
> > > > > an idmapped mount without having to monitor FAN_CREATE events in the
> > > > > entire filesystem.
> > > > >
> > > > > In combination with more changes to allow unprivileged fanotify listener
> > > > > to watch an idmapped mount, this functionality would be usable also by
> > > > > nested container managers.
> > > > >
> > > > > Link: https://lore.kernel.org/linux-fsdevel/20210318143140.jxycfn3fpqntq34z@wittgenstein/
> > > > > Cc: Christian Brauner <christian.brauner@ubuntu.com>
> > > > > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > > > > ---
> > > >
> > > > Was about to look at this. Does this require preliminary patches since
> > > > it doesn't apply to current master. If so, can you just give me a link
> > > > to a branch so I can pull from that? :)
> > > >
> > >
> > > The patch is less useful on its own.
> > > Better take the entire work for the demo which includes this patch:
> > >
> > > [1] https://github.com/amir73il/linux/commits/fanotify_userns
> > > [2] https://github.com/amir73il/inotify-tools/commits/fanotify_userns
> > >
> >
> > Christian,
> >
> > Apologies for the fast moving target.
>
> No problem.
>
> > I just force force the kernel+demo branches to include support for
> > the two extra events (delete and move) on mount mark.
>
> Sounds good.
>
> One thing your patch
>
> commit ea31e84fda83c17b88851de399f76f5d9fc1abf4
> Author: Amir Goldstein <amir73il@gmail.com>
> Date:   Sat Mar 20 12:58:12 2021 +0200
>
>     fs: allow open by file handle inside userns
>
>     open_by_handle_at(2) requires CAP_DAC_READ_SEARCH in init userns,
>     where most filesystems are mounted.
>
>     Relax the requirement to allow a user with CAP_DAC_READ_SEARCH
>     inside userns to open by file handle in filesystems that were
>     mounted inside that userns.
>
>     In addition, also allow open by handle in an idmapped mount, which is
>     mapped to the userns while verifying that the returned open file path
>     is under the root of the idmapped mount.
>
>     This is going to be needed for setting an fanotify mark on a filesystem
>     and watching events inside userns.
>
>     Signed-off-by: Amir Goldstein <amir73il@gmail.com>
>
> Requires fs/exportfs/expfs.c to be made idmapped mounts aware.
> open_by_handle_at() uses exportfs_decode_fh() which e.g. has the
> following and other callchains:
>
> exportfs_decode_fh()
> -> exportfs_decode_fh_raw()
>    -> lookup_one_len()
>       -> inode_permission(mnt_userns, ...)
>
> That's not a huge problem though I did all these changes for the
> overlayfs support for idmapped mounts I have in a branch from an earlier
> version of the idmapped mounts patchset. Basically lookup_one_len(),
> lookup_one_len_unlocked(), and lookup_positive_unlocked() need to take
> the mnt_userns into account. I can rebase my change and send it for
> consideration next cycle. If you can live without the
> open_by_handle_at() support for now in this patchset (Which I think you
> said you could.) then it's not a blocker either. Sorry for the
> inconvenience.
>

My preference is that you take this patch through your branch.
I think you will do a much better job than me selling it and arguing for
its correctness.

The ability to setup fanotify marks on idmapped mounts does not
depend on it in any way.
I've only added it to my branch to facilitate the demo, which also
independently resolves the reported fids to paths.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask
  2021-03-31  9:46             ` Christian Brauner
@ 2021-03-31 11:29               ` Amir Goldstein
  2021-03-31 12:17                 ` Christian Brauner
                                   ` (2 more replies)
  0 siblings, 3 replies; 61+ messages in thread
From: Amir Goldstein @ 2021-03-31 11:29 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jan Kara, linux-fsdevel, Linux API, Miklos Szeredi, J. Bruce Fields

On Wed, Mar 31, 2021 at 12:46 PM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> On Tue, Mar 30, 2021 at 05:56:25PM +0300, Amir Goldstein wrote:
> > > > > My example probably would be something like:
> > > > >
> > > > > mount -t ext4 /dev/sdb /A
> > > > >
> > > > > 1. FAN_MARK_MOUNT(/A)
> > > > >
> > > > > mount --bind /A /B
> > > > >
> > > > > 2. FAN_MARK_MOUNT(/B)
> > > > >
> > > > > mount -t ecryptfs /B /C
> > > > >
> > > > > 3. FAN_MARK_MOUNT(/C)
> > > > >
> > > > > let's say I now do
> > > > >
> > > > > touch /C/bla
> > > > >
> > > > > I may be way off here but intuitively it seems both 1. and 2. should get
> > > > > a creation event but not 3., right?
> > > > >
> > > >
> > > > Why not 3?
> > > > You explicitly set a mark on /C requesting to be notified when
> > > > objects are created via /C.
> > >
> > > Sorry, that was a typo. I meant to write, both 2. and 3. should get a
> > > creation event but not 1.
> > >
> > > >
> > > > > But with your proposal would both 1. and 2. still get a creation event?
> > > > >
> > >
> > > Same obvious typo. The correct question would be: with your proposal do
> > > 2. and 3. both get an event?
> > >
> > > Because it feels like they both should since /C is mounted on top of /B
> > > and ecryptfs acts as a shim. Both FAN_MARK_MOUNT(/B) and
> > > FAN_MARK_MOUNT(/C) should get a creation event after all both will have
> > > mnt->mnt_fsnotify_marks set.
> > >
> >
> > Right.
> >
> > There are two ways to address this inconsistency:
> > 1. Change internal callers of vfs_ helpers to use a private mount,
> >     as you yourself suggested for ecryptfs and cachefiles
>
> I feel like this is he correct thing to do independently of the fanotify
> considerations. I think I'll send an RFC for this today or later this
> week.
>
> > 2. Add fsnotify_path_ hooks at caller site - that would be the
> >     correct thing to do for nfsd IMO
>
> I do not have an informed opinion yet on nfsd so I simply need to trust
> you here. :)
>

As long as "exp_export: export of idmapped mounts not yet supported.\n"
I don't think it matters much.
It feels like adding idmapped mounts to nfsd is on your roadmap.
When you get to that we can discuss adding fsnotify path hooks to nfsd
if Jan agrees to the fsnotify path hooks concept.

> >
> > > >
> > > > They would not get an event, because fsnotify() looks for CREATE event
> > > > subscribers on inode->i_fsnotify_marks and inode->i_sb_s_fsnotify_marks
> > > > and does not find any.
> > >
> > > Well yes, but my example has FAN_MARK_MOUNT(/B) set. So fanotify
> > > _should_ look at
> > >             (!mnt || !mnt->mnt_fsnotify_marks) &&
> > > and see that there are subscribers and should notify the subscribers in
> > > /B even if the file is created through /C.
> > >
> > > My point is with your solution this can't be handled and I want to make
> > > sure that this is ok. Because right now you'd not be notified about a
> > > new file having been created in /B even though mnt->mnt_fsnotify_marks
> > > is set and the creation went through /B via /C.
> > >
> >
> > If you are referring to the ecryptfs use case specifically, then I think it is
> > ok. After all, whether ecryptfs uses a private mount clone or not is not
> > something the user can know.
> >
> > > _Unless_ we switch to an argument like overlayfs and say "This is a
> > > private mount which is opaque and so we don't need to generate events.".
> > > Overlayfs handles this cleanly due to clone_private_mount() which will
> > > shed all mnt->mnt_fsnotify_marks and ecryptfs should too if that is the
> > > argument we follow, no?
> > >
> >
> > There is simply no way that the user can infer from the documentation
> > of FAN_MARK_MOUNT that the event on /B is expected when /B is
> > underlying layer of ecryptfs or overlayfs.
> > It requires deep internal knowledge of the stacked fs implementation.
> > In best case, the user can infer that she MAY get an event on /B.
> > Some users MAY also expect to get an event on /A because they do not
> > understand the concept of bind mounts...
> > Clone a mount ns and you will get more lost users...
>
> I disagree to some extent. For example, a user might remount /B
> read-only at which point /C is effectively read-only too which makes it
> plain obvious to the user that /C piggy-backs on /B.

Yes, but that is a bug. /C should not become read-only. It should use
a private clone of /B, so I don't see where this is going.

> But leaving that aside my questioning is more concerned with whether the
> implementation we're aiming for is consistent and intuitive and that
> stacking example came to my mind pretty quickly.
>

This implementation is a compromise for not having clear user mount
context in all places that call for an event.
For every person you find that thinks it is intuitive to get an event on /B
for touch C/bla, you will find another person that thinks it is not intuitive
to get an event. I think we are way beyond the stage with mount
namespaces where intuition alone suffice.

w.r.t consistent, you gave a few examples and I suggested how IMO
they should be fixed to behave consistently.
If you have other examples of alleged inconsistencies, please list them.

> >
> > > >
> > > > The vfs_create() -> fsnotify_create() hook passes data_type inode to
> > > > fsnotify() so there is no fsnotify_data_path() to extract mnt event
> > > > subscribers from.
> > >
> > > Right, that was my point. You don't have the mnt context for the
> > > underlying fs at a time when e.g. call vfs_link() which ultimately calls
> > > fsnotify_create/link() which I'm saying might be a problem.
> > >
> >
> > It's a problem. If it wasn't a problem I wouldn't need to work around it ;-)
> >
> > It would be a problem if people think that the FAN_MOUNT_MARK
> > is a subtree mark, which it certainly is not. And I have no doubt that
>
> I don't think subtree marks figure into the example above. But we
> digress.
>
> > as Jan said, people really do want a subtree mark.
> >
> > My question to you with this RFC is: Does the ability to subscribe to
> > CREATE/DELETE/MOVE events on a mount help any of your use
> > cases? With or without the property that mount marks are allowed
>
> Since I explicitly pointed on in a prior mail that it would be great to
> have the same events as for a regular fanotify watch I think I already
> answered that question. :)
>
> > inside userns for idmapped mounts.
>
> But if it helps then I'll do it once: yes, both would indeed be very
> useful.
>

OK. I understand that the "promise" of those abilities is very useful.
Please also confirm once you tested the demo code that the new
events on an idmapped mount will "actually" be useful to container
managers. If you can work my demo code into a demo branch for
the bind mount injection or something that would be best.

The reason I am asking this is because while I was working on
enabling sb/mount marks inside userns I found several other issues
(e.g. open_by_handle_at()) without fixing them the demo would have
been much less impressive and much less useful in practice.

So I would like to know that we really have all the pieces needed for
a useful solution, before proposing the fanotify patches.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask
  2021-03-31 11:29               ` Amir Goldstein
@ 2021-03-31 12:17                 ` Christian Brauner
  2021-03-31 12:59                   ` Amir Goldstein
  2021-03-31 12:54                 ` Jan Kara
  2021-03-31 13:06                 ` [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask J. Bruce Fields
  2 siblings, 1 reply; 61+ messages in thread
From: Christian Brauner @ 2021-03-31 12:17 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, linux-fsdevel, Linux API, Miklos Szeredi, J. Bruce Fields

On Wed, Mar 31, 2021 at 02:29:04PM +0300, Amir Goldstein wrote:
> On Wed, Mar 31, 2021 at 12:46 PM Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> >
> > On Tue, Mar 30, 2021 at 05:56:25PM +0300, Amir Goldstein wrote:
> > > > > > My example probably would be something like:
> > > > > >
> > > > > > mount -t ext4 /dev/sdb /A
> > > > > >
> > > > > > 1. FAN_MARK_MOUNT(/A)
> > > > > >
> > > > > > mount --bind /A /B
> > > > > >
> > > > > > 2. FAN_MARK_MOUNT(/B)
> > > > > >
> > > > > > mount -t ecryptfs /B /C
> > > > > >
> > > > > > 3. FAN_MARK_MOUNT(/C)
> > > > > >
> > > > > > let's say I now do
> > > > > >
> > > > > > touch /C/bla
> > > > > >
> > > > > > I may be way off here but intuitively it seems both 1. and 2. should get
> > > > > > a creation event but not 3., right?
> > > > > >
> > > > >
> > > > > Why not 3?
> > > > > You explicitly set a mark on /C requesting to be notified when
> > > > > objects are created via /C.
> > > >
> > > > Sorry, that was a typo. I meant to write, both 2. and 3. should get a
> > > > creation event but not 1.
> > > >
> > > > >
> > > > > > But with your proposal would both 1. and 2. still get a creation event?
> > > > > >
> > > >
> > > > Same obvious typo. The correct question would be: with your proposal do
> > > > 2. and 3. both get an event?
> > > >
> > > > Because it feels like they both should since /C is mounted on top of /B
> > > > and ecryptfs acts as a shim. Both FAN_MARK_MOUNT(/B) and
> > > > FAN_MARK_MOUNT(/C) should get a creation event after all both will have
> > > > mnt->mnt_fsnotify_marks set.
> > > >
> > >
> > > Right.
> > >
> > > There are two ways to address this inconsistency:
> > > 1. Change internal callers of vfs_ helpers to use a private mount,
> > >     as you yourself suggested for ecryptfs and cachefiles
> >

[1]:

> > I feel like this is he correct thing to do independently of the fanotify
> > considerations. I think I'll send an RFC for this today or later this
> > week.
> >
> > > 2. Add fsnotify_path_ hooks at caller site - that would be the
> > >     correct thing to do for nfsd IMO
> >
> > I do not have an informed opinion yet on nfsd so I simply need to trust
> > you here. :)
> >
> 
> As long as "exp_export: export of idmapped mounts not yet supported.\n"
> I don't think it matters much.
> It feels like adding idmapped mounts to nfsd is on your roadmap.
> When you get to that we can discuss adding fsnotify path hooks to nfsd
> if Jan agrees to the fsnotify path hooks concept.
> 
> > >
> > > > >
> > > > > They would not get an event, because fsnotify() looks for CREATE event
> > > > > subscribers on inode->i_fsnotify_marks and inode->i_sb_s_fsnotify_marks
> > > > > and does not find any.
> > > >
> > > > Well yes, but my example has FAN_MARK_MOUNT(/B) set. So fanotify
> > > > _should_ look at
> > > >             (!mnt || !mnt->mnt_fsnotify_marks) &&
> > > > and see that there are subscribers and should notify the subscribers in
> > > > /B even if the file is created through /C.
> > > >
> > > > My point is with your solution this can't be handled and I want to make
> > > > sure that this is ok. Because right now you'd not be notified about a
> > > > new file having been created in /B even though mnt->mnt_fsnotify_marks
> > > > is set and the creation went through /B via /C.
> > > >
> > >
> > > If you are referring to the ecryptfs use case specifically, then I think it is
> > > ok. After all, whether ecryptfs uses a private mount clone or not is not
> > > something the user can know.
> > >
> > > > _Unless_ we switch to an argument like overlayfs and say "This is a
> > > > private mount which is opaque and so we don't need to generate events.".
> > > > Overlayfs handles this cleanly due to clone_private_mount() which will
> > > > shed all mnt->mnt_fsnotify_marks and ecryptfs should too if that is the
> > > > argument we follow, no?
> > > >
> > >
> > > There is simply no way that the user can infer from the documentation
> > > of FAN_MARK_MOUNT that the event on /B is expected when /B is
> > > underlying layer of ecryptfs or overlayfs.
> > > It requires deep internal knowledge of the stacked fs implementation.
> > > In best case, the user can infer that she MAY get an event on /B.
> > > Some users MAY also expect to get an event on /A because they do not
> > > understand the concept of bind mounts...
> > > Clone a mount ns and you will get more lost users...
> >
> > I disagree to some extent. For example, a user might remount /B
> > read-only at which point /C is effectively read-only too which makes it
> > plain obvious to the user that /C piggy-backs on /B.
> 
> Yes, but that is a bug. /C should not become read-only. It should use

I agree. That's why I said they should use one. :)
I just pointed out how it is today for two filesystems.

> a private clone of /B, so I don't see where this is going.

It's going to [1] above. :)

> 
> > But leaving that aside my questioning is more concerned with whether the
> > implementation we're aiming for is consistent and intuitive and that
> > stacking example came to my mind pretty quickly.
> >
> 
> This implementation is a compromise for not having clear user mount
> context in all places that call for an event.
> For every person you find that thinks it is intuitive to get an event on /B
> for touch C/bla, you will find another person that thinks it is not intuitive

And I think here we disagree. The technical implementation currently
requires this since the two mounts are both clearly marked and the first
mount creates objects by going through the other mount and they don't
have a private mount. All I was saying is that the current patchset
can't handle this case and asked whether we are ok with that and if not
what we do to fix it.
My proposal two or three mails ago and then picked up by you is: make
them both use a private clone mount which is - as I said in earlier
mails - the correct solution anyway and falls in line with overlayfs
too.

> to get an event. I think we are way beyond the stage with mount
> namespaces where intuition alone suffice.
> 
> w.r.t consistent, you gave a few examples and I suggested how IMO
> they should be fixed to behave consistently.
> If you have other examples of alleged inconsistencies, please list them.

It feels like I somehow upset you with this. Again, I agree with the
push of the patchset in general and I'm grateful you're doing that work
and we agree on the fix for the two filesystems. As I said, I'll try to
get an RFC fix out for both.

> 
> > >
> > > > >
> > > > > The vfs_create() -> fsnotify_create() hook passes data_type inode to
> > > > > fsnotify() so there is no fsnotify_data_path() to extract mnt event
> > > > > subscribers from.
> > > >
> > > > Right, that was my point. You don't have the mnt context for the
> > > > underlying fs at a time when e.g. call vfs_link() which ultimately calls
> > > > fsnotify_create/link() which I'm saying might be a problem.
> > > >
> > >
> > > It's a problem. If it wasn't a problem I wouldn't need to work around it ;-)
> > >
> > > It would be a problem if people think that the FAN_MOUNT_MARK
> > > is a subtree mark, which it certainly is not. And I have no doubt that
> >
> > I don't think subtree marks figure into the example above. But we
> > digress.
> >
> > > as Jan said, people really do want a subtree mark.
> > >
> > > My question to you with this RFC is: Does the ability to subscribe to
> > > CREATE/DELETE/MOVE events on a mount help any of your use
> > > cases? With or without the property that mount marks are allowed
> >
> > Since I explicitly pointed on in a prior mail that it would be great to
> > have the same events as for a regular fanotify watch I think I already
> > answered that question. :)
> >
> > > inside userns for idmapped mounts.
> >
> > But if it helps then I'll do it once: yes, both would indeed be very
> > useful.
> >
> 
> OK. I understand that the "promise" of those abilities is very useful.
> Please also confirm once you tested the demo code that the new
> events on an idmapped mount will "actually" be useful to container
> managers. If you can work my demo code into a demo branch for
> the bind mount injection or something that would be best.
> 
> The reason I am asking this is because while I was working on
> enabling sb/mount marks inside userns I found several other issues
> (e.g. open_by_handle_at()) without fixing them the demo would have
> been much less impressive and much less useful in practice.
> 
> So I would like to know that we really have all the pieces needed for
> a useful solution, before proposing the fanotify patches.

Sure, if you think that you have your branch in the shape that you want
to. So far it has been evolving quite rapidly as you said yourself. :)
I can probably test this soon early next week seems most likely since I
need to find a timeslot to actually do the work you're asking. Hope that
works.

Thanks!
Christian

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask
  2021-03-31 11:29               ` Amir Goldstein
  2021-03-31 12:17                 ` Christian Brauner
@ 2021-03-31 12:54                 ` Jan Kara
  2021-03-31 14:06                   ` Amir Goldstein
  2021-03-31 13:06                 ` [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask J. Bruce Fields
  2 siblings, 1 reply; 61+ messages in thread
From: Jan Kara @ 2021-03-31 12:54 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Christian Brauner, Jan Kara, linux-fsdevel, Linux API,
	Miklos Szeredi, J. Bruce Fields

On Wed 31-03-21 14:29:04, Amir Goldstein wrote:
> On Wed, Mar 31, 2021 at 12:46 PM Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> >
> > On Tue, Mar 30, 2021 at 05:56:25PM +0300, Amir Goldstein wrote:
> > > > > > My example probably would be something like:
> > > > > >
> > > > > > mount -t ext4 /dev/sdb /A
> > > > > >
> > > > > > 1. FAN_MARK_MOUNT(/A)
> > > > > >
> > > > > > mount --bind /A /B
> > > > > >
> > > > > > 2. FAN_MARK_MOUNT(/B)
> > > > > >
> > > > > > mount -t ecryptfs /B /C
> > > > > >
> > > > > > 3. FAN_MARK_MOUNT(/C)
> > > > > >
> > > > > > let's say I now do
> > > > > >
> > > > > > touch /C/bla
> > > > > >
> > > > > > I may be way off here but intuitively it seems both 1. and 2. should get
> > > > > > a creation event but not 3., right?
> > > > > >
> > > > >
> > > > > Why not 3?
> > > > > You explicitly set a mark on /C requesting to be notified when
> > > > > objects are created via /C.
> > > >
> > > > Sorry, that was a typo. I meant to write, both 2. and 3. should get a
> > > > creation event but not 1.
> > > >
> > > > >
> > > > > > But with your proposal would both 1. and 2. still get a creation event?
> > > > > >
> > > >
> > > > Same obvious typo. The correct question would be: with your proposal do
> > > > 2. and 3. both get an event?
> > > >
> > > > Because it feels like they both should since /C is mounted on top of /B
> > > > and ecryptfs acts as a shim. Both FAN_MARK_MOUNT(/B) and
> > > > FAN_MARK_MOUNT(/C) should get a creation event after all both will have
> > > > mnt->mnt_fsnotify_marks set.
> > > >
> > >
> > > Right.
> > >
> > > There are two ways to address this inconsistency:
> > > 1. Change internal callers of vfs_ helpers to use a private mount,
> > >     as you yourself suggested for ecryptfs and cachefiles
> >
> > I feel like this is he correct thing to do independently of the fanotify
> > considerations. I think I'll send an RFC for this today or later this
> > week.
> >
> > > 2. Add fsnotify_path_ hooks at caller site - that would be the
> > >     correct thing to do for nfsd IMO
> >
> > I do not have an informed opinion yet on nfsd so I simply need to trust
> > you here. :)
> >
> 
> As long as "exp_export: export of idmapped mounts not yet supported.\n"
> I don't think it matters much.
> It feels like adding idmapped mounts to nfsd is on your roadmap.
> When you get to that we can discuss adding fsnotify path hooks to nfsd
> if Jan agrees to the fsnotify path hooks concept.

I was looking at the patch and thinking about it for a few days already. I
think that generating fsnotify event later (higher up the stack where we
have mount information) is fine and a neat idea. I just dislike the hackery
with dentry flags. Also I'm somewhat uneasy that it is random (from
userspace POV) when path event is generated and when not (at least that's
my impression from the patch - maybe I'm wrong). How difficult would it be
to get rid of it? I mean what if we just moved say fsnotify_create() call
wholly up the stack? It would mean more explicit calls to fsnotify_create()
from filesystems - as far as I'm looking nfsd, overlayfs, cachefiles,
ecryptfs. But that would seem to be manageable.  Also, to maintain sanity,
we would probably have to lift generation of all directory events like
that. That would be already notable churn but maybe doable... I know you've
been looking at similar things in the past so if you are aware why this
won't fly, please tell me.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask
  2021-03-31 12:17                 ` Christian Brauner
@ 2021-03-31 12:59                   ` Amir Goldstein
  0 siblings, 0 replies; 61+ messages in thread
From: Amir Goldstein @ 2021-03-31 12:59 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jan Kara, linux-fsdevel, Linux API, Miklos Szeredi, J. Bruce Fields

> > This implementation is a compromise for not having clear user mount
> > context in all places that call for an event.
> > For every person you find that thinks it is intuitive to get an event on /B
> > for touch C/bla, you will find another person that thinks it is not intuitive
>
> And I think here we disagree. The technical implementation currently
> requires this since the two mounts are both clearly marked and the first
> mount creates objects by going through the other mount and they don't
> have a private mount. All I was saying is that the current patchset
> can't handle this case and asked whether we are ok with that and if not
> what we do to fix it.
> My proposal two or three mails ago and then picked up by you is: make
> them both use a private clone mount which is - as I said in earlier
> mails - the correct solution anyway and falls in line with overlayfs
> too.
>

As long as we agree on the solution ;-)

> > to get an event. I think we are way beyond the stage with mount
> > namespaces where intuition alone suffice.
> >
> > w.r.t consistent, you gave a few examples and I suggested how IMO
> > they should be fixed to behave consistently.
> > If you have other examples of alleged inconsistencies, please list them.
>
> It feels like I somehow upset you with this.

You do not upset me.

I just didn't find a better way to address "consistent and intuitive" concern
without asking for more concrete examples, after we eliminated the
ecryptfs example, which we already agreed(?), is a non issue.

My claim about "intuitive" is that there is a limit to how intuitive
this could be.
I do not see myself explaining in the man page why FAN_DELETE_SELF
cannot be requested for a mount mark. It's just too low level.

So the best I can do is document the events that are available to inode/sb
mark and not available to mount mark.

Currently, the fanotify_mark.2 man page reads:
"...The events which require that filesystem objects are identified by
file handles,
 such as FAN_CREATE, FAN_ATTRIB,  FAN_MOVE,  and FAN_DELETE_SELF,
 cannot  be provided as a mask when flags contains FAN_MARK_MOUNT..."

I will change that to:
"...The events FAN_ATTRIB,  FAN_MOVE,  and FAN_DELETE_SELF,
 cannot  be provided as a mask when flags contain FAN_MARK_MOUNT..."

Without providing a rationale to the list of forbidden events.

BTW, there is an undocumented fact about FAN_MODIFY -
This event is allowed in a mount mark mask, but it only reports the events
generated by fsnotify_modify() on file writes. It does not report to a mount
mark, the FAN_MODIFY event generated by fsnotify_change() from
truncate() and utimensat(), because of the missing mount context.

So yeh, I do understand where the "inconsistent" feeling is coming from... ;-)

[...]
> > So I would like to know that we really have all the pieces needed for
> > a useful solution, before proposing the fanotify patches.
>
> Sure, if you think that you have your branch in the shape that you want
> to. So far it has been evolving quite rapidly as you said yourself. :)
> I can probably test this soon early next week seems most likely since I
> need to find a timeslot to actually do the work you're asking. Hope that
> works.
>

No plans to make any more changes to those branches and no rush
as to when to post tham. This is not v4.13-rc1 material anyway.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask
  2021-03-31 11:29               ` Amir Goldstein
  2021-03-31 12:17                 ` Christian Brauner
  2021-03-31 12:54                 ` Jan Kara
@ 2021-03-31 13:06                 ` J. Bruce Fields
  2 siblings, 0 replies; 61+ messages in thread
From: J. Bruce Fields @ 2021-03-31 13:06 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Christian Brauner, Jan Kara, linux-fsdevel, Linux API, Miklos Szeredi

On Wed, Mar 31, 2021 at 02:29:04PM +0300, Amir Goldstein wrote:
> On Wed, Mar 31, 2021 at 12:46 PM Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> > I feel like this is he correct thing to do independently of the fanotify
> > considerations. I think I'll send an RFC for this today or later this
> > week.
> >
> > > 2. Add fsnotify_path_ hooks at caller site - that would be the
> > >     correct thing to do for nfsd IMO
> >
> > I do not have an informed opinion yet on nfsd so I simply need to trust
> > you here. :)
> >
> 
> As long as "exp_export: export of idmapped mounts not yet supported.\n"
> I don't think it matters much.

Last I looked, nfsd support for idmapped mounts looked like it would be
a pretty straightforward job, for what it's worth.

--b.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask
  2021-03-31 12:54                 ` Jan Kara
@ 2021-03-31 14:06                   ` Amir Goldstein
  2021-03-31 20:59                     ` fsnotify path hooks Amir Goldstein
  0 siblings, 1 reply; 61+ messages in thread
From: Amir Goldstein @ 2021-03-31 14:06 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christian Brauner, linux-fsdevel, Linux API, Miklos Szeredi,
	J. Bruce Fields

> > As long as "exp_export: export of idmapped mounts not yet supported.\n"
> > I don't think it matters much.
> > It feels like adding idmapped mounts to nfsd is on your roadmap.
> > When you get to that we can discuss adding fsnotify path hooks to nfsd
> > if Jan agrees to the fsnotify path hooks concept.
>
> I was looking at the patch and thinking about it for a few days already. I
> think that generating fsnotify event later (higher up the stack where we
> have mount information) is fine and a neat idea. I just dislike the hackery
> with dentry flags.

Me as well. I used this hack for fast POC.

If we stick with the dual hooks approach, we will have to either pass a new
argument to vfs helpers or use another trick:

Convert all the many calls sites that were converted by Christian to:
   vfs_XXX(&init_user_ns, ...
because they do not have mount context, to:
   vfs_XXX(NULL, ...

Inside the vfs helpers, use init_user_ns when mnt_userns is NULL,
but pass the original mnt_userns argument to fsnotify_ns_XXX hooks.
A non-NULL mnt_userns arg means "path_notify" context.
I have already POC code for passing mnt_userns to fsnotify hooks [1].

I did not check if this assumption always works, but there seems to
be a large overlap between idmapped aware callers and use cases
that will require sending events to a mount mark.

> Also I'm somewhat uneasy that it is random (from
> userspace POV) when path event is generated and when not (at least that's
> my impression from the patch - maybe I'm wrong). How difficult would it be
> to get rid of it? I mean what if we just moved say fsnotify_create() call
> wholly up the stack? It would mean more explicit calls to fsnotify_create()
> from filesystems - as far as I'm looking nfsd, overlayfs, cachefiles,
> ecryptfs. But that would seem to be manageable.  Also, to maintain sanity,

1. I don't think we can do that for all the fsnotify_create() hooks, such as
    debugfs for example
2. It is useless to pass the mount from overlayfs to fsnotify, its a private
    mount that users cannot set a mark on anyway and Christian has
    promised to propose the same change for cachefiles and ecryptfs,
    so I think it's not worth the churn in those call sites
3. I am uneasy with removing the fsnotify hooks from vfs helpers and
    trusting that new callers of vfs_create() will remember to add the high
    level hooks, so I prefer the existing behavior remains for such callers

> we would probably have to lift generation of all directory events like
> that. That would be already notable churn but maybe doable... I know you've
> been looking at similar things in the past so if you are aware why this
> won't fly, please tell me.

I agree with that and since I posted this RFC patch, I have already added
support for FAN_DELETE and FAN_MOVE_SELF [2].
This was easy - not much churn at all.

FAN_MOVED_FROM I dropped because of the old name snapshot.
FAN_MOVED_TO I dropped because it needs the cookie to be in sync with
that of the FAN_MOVED_FROM event.

Besides, this event pair is "inotify legacy" as far as I am concerned.
FAN_MOVE_SELF can provide most of the needed functionality.
The rest of the functionality should be provided by a new event pair
IMO, FAN_LINK/FAN_UNLINK, as described in this proposal [3].

Which leaves us with two events: FAN_DELETE_SELF and FAN_ATTRIB.
FAN_DELETE_SELF is not appropriate for a mount mark IMO.
FAN_ATTRIB would be useful on mount mark IMO.
It would incur a bit more churn to add it, but I think it's certainly doable.

Just need to decide if we stay with the "dual hooks" approach and if so
on the technique to pass the "notify_path" state into vfs helpers and
existing fsnotify hooks.

Thanks,
Amir.

[1] https://github.com/amir73il/linux/commits/fanotify_in_userns
[2] https://github.com/amir73il/linux/commits/fsnotify_path_hooks
[3] https://lore.kernel.org/linux-fsdevel/CAOQ4uxhEsbfA5+sW4XPnUKgCkXtwoDA-BR3iRO34Nx5c4y7Nug@mail.gmail.com/

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: fsnotify path hooks
  2021-03-31 14:06                   ` Amir Goldstein
@ 2021-03-31 20:59                     ` Amir Goldstein
  2021-04-01 10:29                       ` Jan Kara
  0 siblings, 1 reply; 61+ messages in thread
From: Amir Goldstein @ 2021-03-31 20:59 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christian Brauner, linux-fsdevel, Linux API, Miklos Szeredi,
	J. Bruce Fields

On Wed, Mar 31, 2021 at 5:06 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> > > As long as "exp_export: export of idmapped mounts not yet supported.\n"
> > > I don't think it matters much.
> > > It feels like adding idmapped mounts to nfsd is on your roadmap.
> > > When you get to that we can discuss adding fsnotify path hooks to nfsd
> > > if Jan agrees to the fsnotify path hooks concept.
> >
> > I was looking at the patch and thinking about it for a few days already. I
> > think that generating fsnotify event later (higher up the stack where we
> > have mount information) is fine and a neat idea. I just dislike the hackery
> > with dentry flags.
>
> Me as well. I used this hack for fast POC.
>
> If we stick with the dual hooks approach, we will have to either pass a new
> argument to vfs helpers or use another trick:
>
> Convert all the many calls sites that were converted by Christian to:
>    vfs_XXX(&init_user_ns, ...
> because they do not have mount context, to:
>    vfs_XXX(NULL, ...
>
> Inside the vfs helpers, use init_user_ns when mnt_userns is NULL,
> but pass the original mnt_userns argument to fsnotify_ns_XXX hooks.
> A non-NULL mnt_userns arg means "path_notify" context.
> I have already POC code for passing mnt_userns to fsnotify hooks [1].
>
> I did not check if this assumption always works, but there seems to
> be a large overlap between idmapped aware callers and use cases
> that will require sending events to a mount mark.
>

The above "trick" is pretty silly as I believe Christian intends
to fix all those call sites that pass init_user_ns.

> > Also I'm somewhat uneasy that it is random (from
> > userspace POV) when path event is generated and when not (at least that's
> > my impression from the patch - maybe I'm wrong). How difficult would it be
> > to get rid of it? I mean what if we just moved say fsnotify_create() call
> > wholly up the stack? It would mean more explicit calls to fsnotify_create()
> > from filesystems - as far as I'm looking nfsd, overlayfs, cachefiles,
> > ecryptfs. But that would seem to be manageable.  Also, to maintain sanity,
>
> 1. I don't think we can do that for all the fsnotify_create() hooks, such as
>     debugfs for example
> 2. It is useless to pass the mount from overlayfs to fsnotify, its a private
>     mount that users cannot set a mark on anyway and Christian has
>     promised to propose the same change for cachefiles and ecryptfs,
>     so I think it's not worth the churn in those call sites
> 3. I am uneasy with removing the fsnotify hooks from vfs helpers and
>     trusting that new callers of vfs_create() will remember to add the high
>     level hooks, so I prefer the existing behavior remains for such callers
>

So I read your proposal the wrong way.
You meant move fsnotify_create() up *without* passing mount context
from overlayfs and friends.

So yeh, I do think it is manageable. I think the best solution would be
something along the lines of wrappers like the following:

static inline int vfs_mkdir(...)
{
        int error = __vfs_mkdir_nonotify(...);
        if (!error)
                fsnotify_mkdir(dir, dentry);
        return error;
}

And then the few call sites that call the fsnotify_path_ hooks
(i.e. in syscalls and perhaps later in nfsd) will call the
__vfs_xxx_nonotify() variant.

I suppose that with this approach I could make all the relevant events
available for mount mark with relatively little churn.
I will try it out.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: fsnotify path hooks
  2021-03-31 20:59                     ` fsnotify path hooks Amir Goldstein
@ 2021-04-01 10:29                       ` Jan Kara
  2021-04-01 14:18                         ` Amir Goldstein
  0 siblings, 1 reply; 61+ messages in thread
From: Jan Kara @ 2021-04-01 10:29 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Christian Brauner, linux-fsdevel, Linux API,
	Miklos Szeredi, J. Bruce Fields

On Wed 31-03-21 23:59:27, Amir Goldstein wrote:
> On Wed, Mar 31, 2021 at 5:06 PM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > > > As long as "exp_export: export of idmapped mounts not yet supported.\n"
> > > > I don't think it matters much.
> > > > It feels like adding idmapped mounts to nfsd is on your roadmap.
> > > > When you get to that we can discuss adding fsnotify path hooks to nfsd
> > > > if Jan agrees to the fsnotify path hooks concept.
> > >
> > > I was looking at the patch and thinking about it for a few days already. I
> > > think that generating fsnotify event later (higher up the stack where we
> > > have mount information) is fine and a neat idea. I just dislike the hackery
> > > with dentry flags.
> >
> > Me as well. I used this hack for fast POC.
> >
> > If we stick with the dual hooks approach, we will have to either pass a new
> > argument to vfs helpers or use another trick:
> >
> > Convert all the many calls sites that were converted by Christian to:
> >    vfs_XXX(&init_user_ns, ...
> > because they do not have mount context, to:
> >    vfs_XXX(NULL, ...
> >
> > Inside the vfs helpers, use init_user_ns when mnt_userns is NULL,
> > but pass the original mnt_userns argument to fsnotify_ns_XXX hooks.
> > A non-NULL mnt_userns arg means "path_notify" context.
> > I have already POC code for passing mnt_userns to fsnotify hooks [1].
> >
> > I did not check if this assumption always works, but there seems to
> > be a large overlap between idmapped aware callers and use cases
> > that will require sending events to a mount mark.
> >
> 
> The above "trick" is pretty silly as I believe Christian intends
> to fix all those call sites that pass init_user_ns.

If he does that we also should have the mountpoint there to use for
fsnotify, shouldn't we? :)

> > > Also I'm somewhat uneasy that it is random (from
> > > userspace POV) when path event is generated and when not (at least that's
> > > my impression from the patch - maybe I'm wrong). How difficult would it be
> > > to get rid of it? I mean what if we just moved say fsnotify_create() call
> > > wholly up the stack? It would mean more explicit calls to fsnotify_create()
> > > from filesystems - as far as I'm looking nfsd, overlayfs, cachefiles,
> > > ecryptfs. But that would seem to be manageable.  Also, to maintain sanity,
> >
> > 1. I don't think we can do that for all the fsnotify_create() hooks, such as
> >     debugfs for example
> > 2. It is useless to pass the mount from overlayfs to fsnotify, its a private
> >     mount that users cannot set a mark on anyway and Christian has
> >     promised to propose the same change for cachefiles and ecryptfs,
> >     so I think it's not worth the churn in those call sites
> > 3. I am uneasy with removing the fsnotify hooks from vfs helpers and
> >     trusting that new callers of vfs_create() will remember to add the high
> >     level hooks, so I prefer the existing behavior remains for such callers
> >
> 
> So I read your proposal the wrong way.
> You meant move fsnotify_create() up *without* passing mount context
> from overlayfs and friends.

Well, I was thinking that we could find appropriate mount context for
overlayfs or ecryptfs (which just shows how little I know about these
filesystems ;) I didn't think of e.g. debugfs. Anyway, if we can make
mountpoint marks work for directory events at least for most filesystems, I
think that is OK as well. However it would be then needed to detect whether
a given filesystem actually supports mount marks for dir events and if not,
report error from fanotify_mark() instead of silently not generating
events.

> So yeh, I do think it is manageable. I think the best solution would be
> something along the lines of wrappers like the following:
> 
> static inline int vfs_mkdir(...)
> {
>         int error = __vfs_mkdir_nonotify(...);
>         if (!error)
>                 fsnotify_mkdir(dir, dentry);
>         return error;
> }
> 
> And then the few call sites that call the fsnotify_path_ hooks
> (i.e. in syscalls and perhaps later in nfsd) will call the
> __vfs_xxx_nonotify() variant.

Yes, that is OK with me. Or we could have something like:

static inline void fsnotify_dirent(struct vfsmount *mnt, struct inode *dir,
				   struct dentry *dentry, __u32 mask)
{
	if (!mnt) {
		fsnotify(mask, d_inode(dentry), FSNOTIFY_EVENT_INODE, dir,
			 &dentry->d_name, NULL, 0);
	} else {
		struct path path = {
			.mnt = mnt,
			.dentry = d_find_any_alias(dir)
		};
		fsnotify(mask, d_inode(dentry), FSNOTIFY_EVENT_PATH, &path,
			 &dentry->d_name, NULL, 0);
	}
}

static inline void fsnotify_mkdir(struct vfsmount *mnt, struct inode *inode,
				  struct dentry *dentry)
{
        audit_inode_child(inode, dentry, AUDIT_TYPE_CHILD_CREATE);

        fsnotify_dirent(mnt, inode, dentry, FS_CREATE | FS_ISDIR);
}

static inline int vfs_mkdir(mnt, ...)
{
	int error = __vfs_mkdir_nonotify(...);
	if (!error)
		fsnotify_mkdir(mnt, dir, dentry);
}

And pass mnt to vfs_mkdir() for filesystems where we have it...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: fsnotify path hooks
  2021-04-01 10:29                       ` Jan Kara
@ 2021-04-01 14:18                         ` Amir Goldstein
  2021-04-02  8:20                           ` Amir Goldstein
                                             ` (2 more replies)
  0 siblings, 3 replies; 61+ messages in thread
From: Amir Goldstein @ 2021-04-01 14:18 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christian Brauner, linux-fsdevel, Linux API, Miklos Szeredi,
	J. Bruce Fields

On Thu, Apr 1, 2021 at 1:29 PM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 31-03-21 23:59:27, Amir Goldstein wrote:
> > On Wed, Mar 31, 2021 at 5:06 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > >
> > > > > As long as "exp_export: export of idmapped mounts not yet supported.\n"
> > > > > I don't think it matters much.
> > > > > It feels like adding idmapped mounts to nfsd is on your roadmap.
> > > > > When you get to that we can discuss adding fsnotify path hooks to nfsd
> > > > > if Jan agrees to the fsnotify path hooks concept.
> > > >
> > > > I was looking at the patch and thinking about it for a few days already. I
> > > > think that generating fsnotify event later (higher up the stack where we
> > > > have mount information) is fine and a neat idea. I just dislike the hackery
> > > > with dentry flags.
> > >
> > > Me as well. I used this hack for fast POC.
> > >
> > > If we stick with the dual hooks approach, we will have to either pass a new
> > > argument to vfs helpers or use another trick:
> > >
> > > Convert all the many calls sites that were converted by Christian to:
> > >    vfs_XXX(&init_user_ns, ...
> > > because they do not have mount context, to:
> > >    vfs_XXX(NULL, ...
> > >
> > > Inside the vfs helpers, use init_user_ns when mnt_userns is NULL,
> > > but pass the original mnt_userns argument to fsnotify_ns_XXX hooks.
> > > A non-NULL mnt_userns arg means "path_notify" context.
> > > I have already POC code for passing mnt_userns to fsnotify hooks [1].
> > >
> > > I did not check if this assumption always works, but there seems to
> > > be a large overlap between idmapped aware callers and use cases
> > > that will require sending events to a mount mark.
> > >
> >
> > The above "trick" is pretty silly as I believe Christian intends
> > to fix all those call sites that pass init_user_ns.
>
> If he does that we also should have the mountpoint there to use for
> fsnotify, shouldn't we? :)
>

Yes, but that's not going to be hard for us anyway.
nfsd has mount context available via fhp for any access
and for overlayfs/ecryptfs we don't want the mount mark event.
I will explain why...

> > > > Also I'm somewhat uneasy that it is random (from
> > > > userspace POV) when path event is generated and when not (at least that's
> > > > my impression from the patch - maybe I'm wrong). How difficult would it be
> > > > to get rid of it? I mean what if we just moved say fsnotify_create() call
> > > > wholly up the stack? It would mean more explicit calls to fsnotify_create()
> > > > from filesystems - as far as I'm looking nfsd, overlayfs, cachefiles,
> > > > ecryptfs. But that would seem to be manageable.  Also, to maintain sanity,
> > >
> > > 1. I don't think we can do that for all the fsnotify_create() hooks, such as
> > >     debugfs for example
> > > 2. It is useless to pass the mount from overlayfs to fsnotify, its a private
> > >     mount that users cannot set a mark on anyway and Christian has
> > >     promised to propose the same change for cachefiles and ecryptfs,
> > >     so I think it's not worth the churn in those call sites
> > > 3. I am uneasy with removing the fsnotify hooks from vfs helpers and
> > >     trusting that new callers of vfs_create() will remember to add the high
> > >     level hooks, so I prefer the existing behavior remains for such callers
> > >
> >
> > So I read your proposal the wrong way.
> > You meant move fsnotify_create() up *without* passing mount context
> > from overlayfs and friends.
>
> Well, I was thinking that we could find appropriate mount context for
> overlayfs or ecryptfs (which just shows how little I know about these
> filesystems ;) I didn't think of e.g. debugfs. Anyway, if we can make
> mountpoint marks work for directory events at least for most filesystems, I
> think that is OK as well. However it would be then needed to detect whether
> a given filesystem actually supports mount marks for dir events and if not,
> report error from fanotify_mark() instead of silently not generating
> events.
>

It's not about "filesystems that support mount marks".
mount marks will work perfectly well on overlayfs.

The thing is if you place a mount mark on the underlying store of
overlayfs (say xfs) and then files are created/deleted by the
overlayfs driver (in xfs) you wont get any events, because
overlayfs uses a private mount clone to perform underlying operations.

So while we CAN get the overlayfs underlying layer mount context
it is irrelevant because no user can setup a mount mark on that
private mount, so no need to bother calling the path hooks.

This is not the case with nfsd IMO.
With nfsd, when "exporting" a path to clients, nfsd is really exporting
a specific mount (and keeping that mount busy too).
It can even export whole mount topologies.

But then again, getting the mount context in every nfsd operation
is easy, there is an export context to client requests and the export
context has the exported path.

Therefore, nfsd is my only user using the vfs helpers that is expected
to call the fsnotify path hooks (other than syscalls).

> > So yeh, I do think it is manageable. I think the best solution would be
> > something along the lines of wrappers like the following:
> >
> > static inline int vfs_mkdir(...)
> > {
> >         int error = __vfs_mkdir_nonotify(...);
> >         if (!error)
> >                 fsnotify_mkdir(dir, dentry);
> >         return error;
> > }
> >
> > And then the few call sites that call the fsnotify_path_ hooks
> > (i.e. in syscalls and perhaps later in nfsd) will call the
> > __vfs_xxx_nonotify() variant.
>
> Yes, that is OK with me. Or we could have something like:
>
> static inline void fsnotify_dirent(struct vfsmount *mnt, struct inode *dir,
>                                    struct dentry *dentry, __u32 mask)
> {
>         if (!mnt) {
>                 fsnotify(mask, d_inode(dentry), FSNOTIFY_EVENT_INODE, dir,
>                          &dentry->d_name, NULL, 0);
>         } else {
>                 struct path path = {
>                         .mnt = mnt,
>                         .dentry = d_find_any_alias(dir)
>                 };
>                 fsnotify(mask, d_inode(dentry), FSNOTIFY_EVENT_PATH, &path,
>                          &dentry->d_name, NULL, 0);
>         }
> }
>
> static inline void fsnotify_mkdir(struct vfsmount *mnt, struct inode *inode,
>                                   struct dentry *dentry)
> {
>         audit_inode_child(inode, dentry, AUDIT_TYPE_CHILD_CREATE);
>
>         fsnotify_dirent(mnt, inode, dentry, FS_CREATE | FS_ISDIR);
> }
>
> static inline int vfs_mkdir(mnt, ...)
> {
>         int error = __vfs_mkdir_nonotify(...);
>         if (!error)
>                 fsnotify_mkdir(mnt, dir, dentry);
> }
>

I've done something similar to that. I think it's a bit cleaner,
but we can debate on the details later.
Pushed POC to branch fsnotify_path_hooks.

At the moment, create, delete, move and move_self are supported
for syscalls and helpers are ready for nfsd.

The method I used for rename hook is a bit different than
for other hooks, because other hooks are very easy to open code
while rename is complex so I create a helper for nfsd to call.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: fsnotify path hooks
  2021-04-01 14:18                         ` Amir Goldstein
@ 2021-04-02  8:20                           ` Amir Goldstein
  2021-04-04 10:27                             ` LSM and setxattr helpers Amir Goldstein
  2021-04-06  8:35                           ` fsnotify path hooks Jan Kara
  2021-04-06 18:49                           ` Amir Goldstein
  2 siblings, 1 reply; 61+ messages in thread
From: Amir Goldstein @ 2021-04-02  8:20 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christian Brauner, linux-fsdevel, Linux API, Miklos Szeredi,
	J. Bruce Fields, Tyler Hicks, Mimi Zohar, James Morris,
	Serge E. Hallyn

> This is not the case with nfsd IMO.
> With nfsd, when "exporting" a path to clients, nfsd is really exporting
> a specific mount (and keeping that mount busy too).
> It can even export whole mount topologies.
>
> But then again, getting the mount context in every nfsd operation
> is easy, there is an export context to client requests and the export
> context has the exported path.
>
> Therefore, nfsd is my only user using the vfs helpers that is expected
> to call the fsnotify path hooks (other than syscalls).
>
[...]
>
> I've done something similar to that. I think it's a bit cleaner,
> but we can debate on the details later.
> Pushed POC to branch fsnotify_path_hooks.
>
> At the moment, create, delete, move and move_self are supported
> for syscalls and helpers are ready for nfsd.
>
> The method I used for rename hook is a bit different than
> for other hooks, because other hooks are very easy to open code
> while rename is complex so I create a helper for nfsd to call.
>

I pushed the nfsd example code as well (only compile tested):

https://github.com/amir73il/linux/commits/fsnotify_path_hooks

Now all that is left is dealing with notify_change() and with
vfs_{set,remove}xattr().

Nice thing about vfs_{set,remove}xattr() is that they already have
several levels of __vfs_ helpers and nfsd already calls those, so
we can hoist fsnotify_xattr() hooks hooks up from the __vfs_xxx
helpers to the common vfs_xxx helpers and add fsnotify hooks to
the very few callers of __vfs_ helpers.

nfsd is consistently calling __vfs_{set,remove}xattr_locked() which
do generate events, but ecryptfs mixes __vfs_setxattr_locked() with
__vfs_removexattr(), which does not generate event and does not
check permissions - it looks like an oversight.

The thing is, right now __vfs_setxattr_noperm() generates events,
but looking at all the security/* callers, it feels to me like those are
very internal operations and that "noperm" should also imply "nonotify".

To prove my point, all those callers call __vfs_removexattr() which
does NOT generate an event.

Also, I *think* the EVM setxattr is something that usually follows
another file data/metadata change, so some event would have been
generated by the original change anyway.

Mimi,

Do you have an opinion on that?

The question is if you think it is important for an inotify/fanotify watcher
that subscribed to IN_ATTRIB/FAN_ATTRIB events on a file to get an
event when the IMA security blob changes.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: LSM and setxattr helpers
  2021-04-02  8:20                           ` Amir Goldstein
@ 2021-04-04 10:27                             ` Amir Goldstein
  2021-04-05 12:23                               ` Christian Brauner
                                                 ` (2 more replies)
  0 siblings, 3 replies; 61+ messages in thread
From: Amir Goldstein @ 2021-04-04 10:27 UTC (permalink / raw)
  To: Mimi Zohar, Casey Schaufler, Paul Moore
  Cc: Jan Kara, Christian Brauner, linux-fsdevel, Miklos Szeredi,
	J. Bruce Fields, Tyler Hicks, James Morris, Serge E. Hallyn,
	LSM List

[-- Attachment #1: Type: text/plain, Size: 3903 bytes --]

[forking question about security modules]

>
> Nice thing about vfs_{set,remove}xattr() is that they already have
> several levels of __vfs_ helpers and nfsd already calls those, so
> we can hoist fsnotify_xattr() hooks hooks up from the __vfs_xxx
> helpers to the common vfs_xxx helpers and add fsnotify hooks to
> the very few callers of __vfs_ helpers.
>
> nfsd is consistently calling __vfs_{set,remove}xattr_locked() which
> do generate events, but ecryptfs mixes __vfs_setxattr_locked() with
> __vfs_removexattr(), which does not generate event and does not
> check permissions - it looks like an oversight.
>
> The thing is, right now __vfs_setxattr_noperm() generates events,
> but looking at all the security/* callers, it feels to me like those are
> very internal operations and that "noperm" should also imply "nonotify".
>
> To prove my point, all those callers call __vfs_removexattr() which
> does NOT generate an event.
>
> Also, I *think* the EVM setxattr is something that usually follows
> another file data/metadata change, so some event would have been
> generated by the original change anyway.
>
> Mimi,
>
> Do you have an opinion on that?
>
> The question is if you think it is important for an inotify/fanotify watcher
> that subscribed to IN_ATTRIB/FAN_ATTRIB events on a file to get an
> event when the IMA security blob changes.
>

Guys,

I was doing some re-factoring of the __vfs_setxattr helpers
and noticed some strange things.

The wider context is fsnotify_xattr() hooks inside internal
setxattr,removexattr calls. I would like to move those hooks
to the common vfs_{set,remove}xattr() helpers.

SMACK & SELINUX:
For the callers of __vfs_setxattr_noperm(),
smack_inode_setsecctx() and selinux_inode_setsecctx()
It seems that the only user is nfsd4_set_nfs4_label(), so it
makes sense for me to add the fsnotify_xattr() in nfsd context,
same as I did with other fsnotify_ hooks.

Are there any other expected callers of security_inode_setsecctx()
except nfsd in the future? If so they would need to also add the
fsnotify_xattr() hook, if at all the user visible FS_ATTRIB event is
considered desirable.

SMACK:
Just to be sure, is the call to __vfs_setxattr() from smack_d_instantiate()
guaranteed to be called for an inode whose S_NOSEC flag is already
cleared? Because the flag is being cleared by __vfs_setxattr_noperm().

EVM:
I couldn't find what's stopping this recursion:
evm_update_evmxattr() => __vfs_setxattr_noperm() =>
security_inode_post_setxattr() => evm_inode_post_removexattr() =>
evm_update_evmxattr()

It looks like the S_NOSEC should already be clear when
evm_update_evmxattr() is called(?), so it seems more logical to me to
call __vfs_setxattr() as there is no ->inode_setsecurity() hook for EVM.
Am I missing something?

It seems to me that updating the EVM hmac should not generate
a visible FS_ATTRIB event to listeners, because it is an internal
implementation detail and because update EVM hmac happens
following another change to the inode which anyway reports a
visible event to listeners.
Also, please note that evm_update_evmxattr() may also call
__vfs_removexattr() which does not call the fsnotify_xattr() hook.

IMA:
Similarly, ima_fix_xattr() should be called on an inode without
S_NOSEC flag and no other LSM should be interested in the
IMA hash update, right? So wouldn't it be better to use
__vfs_setxattr() in this case as well?

ima_fix_xattr() can be called after file data update, which again
will have other visible events, but it can also be called in "fix mode"
I suppose also when reading files? Still, it seems to me like an
internal implementation detail that should not generate a user
visible event.

If you agree with my proposed changes, please ACK the
respective bits of your subsystem from the attached patch.
Note that my patch does not contain the proposed change to
use __vfs_setxattr() in IMA/EVM.

Thanks,
Amir.

[-- Attachment #2: move-fsnotify_xattr-hooks-up-the-call-stack.patch.txt --]
[-- Type: text/plain, Size: 4986 bytes --]

From 76db3bf84ac1383c9c0b7e6f8268ab2527873d80 Mon Sep 17 00:00:00 2001
From: Amir Goldstein <amir73il@gmail.com>
Date: Fri, 2 Apr 2021 11:28:13 +0300
Subject: [PATCH] fsnotify: move fsnotify_xattr() hooks up the call stack

Move the fsnotify hooks from internal __vfs_xxx helpers into the
more common vfs_xxx helpers.

Add the fsnotify hooks to the callers of the __vfs_xxx_locked helpers
nfsd and ecryptfs (ecryptfs gains an event on removexattr).

Add the fsnotify hooks to nfsd4_set_nfs4_label() to compensate for the
lost fsnotify hooks from the LSM inode_setsecctx() hooks of smack and
selinux.

Leave evm_update_evmxattr() and ima_fix_xattr() callers without fsnotify
hooks, because the update of IMA/EVM hash seems like information that
may not need to be exposed to fsnotify watchers and in most cases, the
hash update will follow update of inode data or metadata that will have
anyway generated an fsnotify event.

Cc: Tyler Hicks <code@tyhicks.com>
Cc: Mimi Zohar <zohar@linux.ibm.com>
Cc: James Morris <jmorris@namei.org>
Cc: Serge E. Hallyn <serge@hallyn.com>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Cc: Paul Moore <paul@paul-moore.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/ecryptfs/inode.c |  5 +++++
 fs/nfsd/vfs.c       |  6 ++++++
 fs/xattr.c          | 14 ++++++--------
 3 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
index 18e9285fbb4c..7f24830e8c5c 100644
--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -16,6 +16,7 @@
 #include <linux/namei.h>
 #include <linux/mount.h>
 #include <linux/fs_stack.h>
+#include <linux/fsnotify.h>
 #include <linux/slab.h>
 #include <linux/xattr.h>
 #include <asm/unaligned.h>
@@ -1047,6 +1048,8 @@ ecryptfs_setxattr(struct dentry *dentry, struct inode *inode,
 	}
 	inode_lock(lower_inode);
 	rc = __vfs_setxattr_locked(&init_user_ns, lower_dentry, name, value, size, flags, NULL);
+	if (!rc)
+		fsnotify_xattr(lower_dentry);
 	inode_unlock(lower_inode);
 	if (!rc && inode)
 		fsstack_copy_attr_all(inode, lower_inode);
@@ -1113,6 +1116,8 @@ static int ecryptfs_removexattr(struct dentry *dentry, struct inode *inode,
 	}
 	inode_lock(lower_inode);
 	rc = __vfs_removexattr(&init_user_ns, lower_dentry, name);
+	if (!rc)
+		fsnotify_xattr(lower_dentry);
 	inode_unlock(lower_inode);
 out:
 	return rc;
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 611c4b8f3c74..75e22ab17cca 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -520,6 +520,8 @@ __be32 nfsd4_set_nfs4_label(struct svc_rqst *rqstp, struct svc_fh *fhp,
 
 	inode_lock(d_inode(dentry));
 	host_error = security_inode_setsecctx(dentry, label->data, label->len);
+	if (!host_error)
+		fsnotify_xattr(dentry);
 	inode_unlock(d_inode(dentry));
 	return nfserrno(host_error);
 }
@@ -2315,6 +2317,8 @@ nfsd_removexattr(struct svc_rqst *rqstp, struct svc_fh *fhp, char *name)
 
 	ret = __vfs_removexattr_locked(&init_user_ns, fhp->fh_dentry,
 				       name, NULL);
+	if (!ret)
+		fsnotify_xattr(fhp->fh_dentry);
 
 	fh_unlock(fhp);
 	fh_drop_write(fhp);
@@ -2340,6 +2344,8 @@ nfsd_setxattr(struct svc_rqst *rqstp, struct svc_fh *fhp, char *name,
 
 	ret = __vfs_setxattr_locked(&init_user_ns, fhp->fh_dentry, name, buf,
 				    len, flags, NULL);
+	if (!ret)
+		fsnotify_xattr(fhp->fh_dentry);
 
 	fh_unlock(fhp);
 	fh_drop_write(fhp);
diff --git a/fs/xattr.c b/fs/xattr.c
index b3444e06cded..6c0ac300b519 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -213,11 +213,9 @@ int __vfs_setxattr_noperm(struct user_namespace *mnt_userns,
 	if (inode->i_opflags & IOP_XATTR) {
 		error = __vfs_setxattr(mnt_userns, dentry, inode, name, value,
 				       size, flags);
-		if (!error) {
-			fsnotify_xattr(dentry);
+		if (!error)
 			security_inode_post_setxattr(dentry, name, value,
 						     size, flags);
-		}
 	} else {
 		if (unlikely(is_bad_inode(inode)))
 			return -EIO;
@@ -230,8 +228,6 @@ int __vfs_setxattr_noperm(struct user_namespace *mnt_userns,
 
 			error = security_inode_setsecurity(inode, suffix, value,
 							   size, flags);
-			if (!error)
-				fsnotify_xattr(dentry);
 		}
 	}
 
@@ -299,6 +295,8 @@ vfs_setxattr(struct user_namespace *mnt_userns, struct dentry *dentry,
 	inode_lock(inode);
 	error = __vfs_setxattr_locked(mnt_userns, dentry, name, value, size,
 				      flags, &delegated_inode);
+	if (!error)
+		fsnotify_xattr(dentry);
 	inode_unlock(inode);
 
 	if (delegated_inode) {
@@ -500,10 +498,8 @@ __vfs_removexattr_locked(struct user_namespace *mnt_userns,
 
 	error = __vfs_removexattr(mnt_userns, dentry, name);
 
-	if (!error) {
-		fsnotify_xattr(dentry);
+	if (!error)
 		evm_inode_post_removexattr(dentry, name);
-	}
 
 out:
 	return error;
@@ -522,6 +518,8 @@ vfs_removexattr(struct user_namespace *mnt_userns, struct dentry *dentry,
 	inode_lock(inode);
 	error = __vfs_removexattr_locked(mnt_userns, dentry,
 					 name, &delegated_inode);
+	if (!error)
+		fsnotify_xattr(dentry);
 	inode_unlock(inode);
 
 	if (delegated_inode) {
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: LSM and setxattr helpers
  2021-04-04 10:27                             ` LSM and setxattr helpers Amir Goldstein
@ 2021-04-05 12:23                               ` Christian Brauner
  2021-04-05 14:47                               ` Mimi Zohar
  2021-04-05 16:18                               ` Casey Schaufler
  2 siblings, 0 replies; 61+ messages in thread
From: Christian Brauner @ 2021-04-05 12:23 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Mimi Zohar, Casey Schaufler, Paul Moore, Jan Kara, linux-fsdevel,
	Miklos Szeredi, J. Bruce Fields, Tyler Hicks, James Morris,
	Serge E. Hallyn, LSM List

On Sun, Apr 04, 2021 at 01:27:21PM +0300, Amir Goldstein wrote:
> [forking question about security modules]
> 
> >
> > Nice thing about vfs_{set,remove}xattr() is that they already have
> > several levels of __vfs_ helpers and nfsd already calls those, so
> > we can hoist fsnotify_xattr() hooks hooks up from the __vfs_xxx
> > helpers to the common vfs_xxx helpers and add fsnotify hooks to
> > the very few callers of __vfs_ helpers.
> >
> > nfsd is consistently calling __vfs_{set,remove}xattr_locked() which
> > do generate events, but ecryptfs mixes __vfs_setxattr_locked() with
> > __vfs_removexattr(), which does not generate event and does not
> > check permissions - it looks like an oversight.
> >
> > The thing is, right now __vfs_setxattr_noperm() generates events,
> > but looking at all the security/* callers, it feels to me like those are
> > very internal operations and that "noperm" should also imply "nonotify".
> >
> > To prove my point, all those callers call __vfs_removexattr() which
> > does NOT generate an event.
> >
> > Also, I *think* the EVM setxattr is something that usually follows
> > another file data/metadata change, so some event would have been
> > generated by the original change anyway.
> >
> > Mimi,
> >
> > Do you have an opinion on that?
> >
> > The question is if you think it is important for an inotify/fanotify watcher
> > that subscribed to IN_ATTRIB/FAN_ATTRIB events on a file to get an
> > event when the IMA security blob changes.
> >
> 
> Guys,
> 
> I was doing some re-factoring of the __vfs_setxattr helpers
> and noticed some strange things.
> 
> The wider context is fsnotify_xattr() hooks inside internal
> setxattr,removexattr calls. I would like to move those hooks
> to the common vfs_{set,remove}xattr() helpers.
> 
> SMACK & SELINUX:
> For the callers of __vfs_setxattr_noperm(),
> smack_inode_setsecctx() and selinux_inode_setsecctx()
> It seems that the only user is nfsd4_set_nfs4_label(), so it
> makes sense for me to add the fsnotify_xattr() in nfsd context,
> same as I did with other fsnotify_ hooks.
> 
> Are there any other expected callers of security_inode_setsecctx()
> except nfsd in the future? If so they would need to also add the
> fsnotify_xattr() hook, if at all the user visible FS_ATTRIB event is
> considered desirable.
> 
> SMACK:
> Just to be sure, is the call to __vfs_setxattr() from smack_d_instantiate()
> guaranteed to be called for an inode whose S_NOSEC flag is already
> cleared? Because the flag is being cleared by __vfs_setxattr_noperm().
> 
> EVM:
> I couldn't find what's stopping this recursion:
> evm_update_evmxattr() => __vfs_setxattr_noperm() =>
> security_inode_post_setxattr() => evm_inode_post_removexattr() =>
> evm_update_evmxattr()
> 
> It looks like the S_NOSEC should already be clear when
> evm_update_evmxattr() is called(?), so it seems more logical to me to
> call __vfs_setxattr() as there is no ->inode_setsecurity() hook for EVM.
> Am I missing something?
> 
> It seems to me that updating the EVM hmac should not generate
> a visible FS_ATTRIB event to listeners, because it is an internal
> implementation detail and because update EVM hmac happens
> following another change to the inode which anyway reports a
> visible event to listeners.
> Also, please note that evm_update_evmxattr() may also call
> __vfs_removexattr() which does not call the fsnotify_xattr() hook.
> 
> IMA:
> Similarly, ima_fix_xattr() should be called on an inode without
> S_NOSEC flag and no other LSM should be interested in the
> IMA hash update, right? So wouldn't it be better to use
> __vfs_setxattr() in this case as well?

It feels like xattr changes that are essentially side-effects of another
operation should probably not generate fsnotify() events in general; at
least not without a good reason why userspace needs to know about the
event.

Christian

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: LSM and setxattr helpers
  2021-04-04 10:27                             ` LSM and setxattr helpers Amir Goldstein
  2021-04-05 12:23                               ` Christian Brauner
@ 2021-04-05 14:47                               ` Mimi Zohar
  2021-04-06 15:43                                 ` Amir Goldstein
  2021-04-05 16:18                               ` Casey Schaufler
  2 siblings, 1 reply; 61+ messages in thread
From: Mimi Zohar @ 2021-04-05 14:47 UTC (permalink / raw)
  To: Amir Goldstein, Casey Schaufler, Paul Moore
  Cc: Jan Kara, Christian Brauner, linux-fsdevel, Miklos Szeredi,
	J. Bruce Fields, Tyler Hicks, James Morris, Serge E. Hallyn,
	LSM List

Hi Amir,

On Sun, 2021-04-04 at 13:27 +0300, Amir Goldstein wrote:
> [forking question about security modules]
> 
> >
> > Nice thing about vfs_{set,remove}xattr() is that they already have
> > several levels of __vfs_ helpers and nfsd already calls those, so
> > we can hoist fsnotify_xattr() hooks hooks up from the __vfs_xxx
> > helpers to the common vfs_xxx helpers and add fsnotify hooks to
> > the very few callers of __vfs_ helpers.
> >
> > nfsd is consistently calling __vfs_{set,remove}xattr_locked() which
> > do generate events, but ecryptfs mixes __vfs_setxattr_locked() with
> > __vfs_removexattr(), which does not generate event and does not
> > check permissions - it looks like an oversight.
> >
> > The thing is, right now __vfs_setxattr_noperm() generates events,
> > but looking at all the security/* callers, it feels to me like those are
> > very internal operations and that "noperm" should also imply "nonotify".
> >
> > To prove my point, all those callers call __vfs_removexattr() which
> > does NOT generate an event.
> >
> > Also, I *think* the EVM setxattr is something that usually follows
> > another file data/metadata change, so some event would have been
> > generated by the original change anyway.
> >
> > Mimi,
> >
> > Do you have an opinion on that?

Right, EVM is re-calculating the EVM HMAC, which is based on other LSM
xattrs and includes some misc file metadata (e.g. ino, generation, uid,
gid, mode).

> >
> > The question is if you think it is important for an inotify/fanotify watcher
> > that subscribed to IN_ATTRIB/FAN_ATTRIB events on a file to get an
> > event when the IMA security blob changes.

Probably not.  Programs could open files R/W, but never modify the
file.  Perhaps to detect mutable file changes, but I'm not aware of
anyone doing so.

> 
> Guys,
> 
> I was doing some re-factoring of the __vfs_setxattr helpers
> and noticed some strange things.
> 
> The wider context is fsnotify_xattr() hooks inside internal
> setxattr,removexattr calls. I would like to move those hooks
> to the common vfs_{set,remove}xattr() helpers.
> 
> SMACK & SELINUX:
> For the callers of __vfs_setxattr_noperm(),
> smack_inode_setsecctx() and selinux_inode_setsecctx()
> It seems that the only user is nfsd4_set_nfs4_label(), so it
> makes sense for me to add the fsnotify_xattr() in nfsd context,
> same as I did with other fsnotify_ hooks.
> 
> Are there any other expected callers of security_inode_setsecctx()
> except nfsd in the future? If so they would need to also add the
> fsnotify_xattr() hook, if at all the user visible FS_ATTRIB event is
> considered desirable.
> 
> SMACK:
> Just to be sure, is the call to __vfs_setxattr() from smack_d_instantiate()
> guaranteed to be called for an inode whose S_NOSEC flag is already
> cleared? Because the flag is being cleared by __vfs_setxattr_noperm().
> 
> EVM:
> I couldn't find what's stopping this recursion:
> evm_update_evmxattr() => __vfs_setxattr_noperm() =>
> security_inode_post_setxattr() => evm_inode_post_removexattr() =>
> evm_update_evmxattr()

EVM is triggered when file metadata changes, causing the EVM HMAC to be
re-calculated. Before updating security.evm, EVM first verifies, on the
evm_inode_setattr/setxattr/removexattr() hooks, that the existing
security.evm value is correct.

On the _post hooks, security.evm is updated or removed, if no LSM xattr
exists.

> It looks like the S_NOSEC should already be clear when
> evm_update_evmxattr() is called(?), so it seems more logical to me to
> call __vfs_setxattr() as there is no ->inode_setsecurity() hook for EVM.
> Am I missing something?

EVM is triggered when an LSM updates/removes its xattr.   The LSM is
responsible for taking the inode lock.   Thus it is calling
__vfs_setxattr_noperm.

> 
> It seems to me that updating the EVM hmac should not generate
> a visible FS_ATTRIB event to listeners, because it is an internal
> implementation detail and because update EVM hmac happens
> following another change to the inode which anyway reports a
> visible event to listeners.

Ok

> Also, please note that evm_update_evmxattr() may also call
> __vfs_removexattr() which does not call the fsnotify_xattr() hook.
> 
> IMA:
> Similarly, ima_fix_xattr() should be called on an inode without
> S_NOSEC flag and no other LSM should be interested in the
> IMA hash update, right? So wouldn't it be better to use
> __vfs_setxattr() in this case as well?
> 
> ima_fix_xattr() can be called after file data update, which again
> will have other visible events, but it can also be called in "fix mode"
> I suppose also when reading files? Still, it seems to me like an
> internal implementation detail that should not generate a user
> visible event.

Originally, IMA took the inode lock really early, way before calling
setxattr.  Taking the inode lock can probably be deferred to setxattr. 
I have a vague recollection that SELinux also prevented IMA from
writing its own xattr label.  I don't know if that is still true.

thanks,

Mimi

> 
> If you agree with my proposed changes, please ACK the
> respective bits of your subsystem from the attached patch.
> Note that my patch does not contain the proposed change to
> use __vfs_setxattr() in IMA/EVM.
> 
> Thanks,
> Amir.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: LSM and setxattr helpers
  2021-04-04 10:27                             ` LSM and setxattr helpers Amir Goldstein
  2021-04-05 12:23                               ` Christian Brauner
  2021-04-05 14:47                               ` Mimi Zohar
@ 2021-04-05 16:18                               ` Casey Schaufler
  2 siblings, 0 replies; 61+ messages in thread
From: Casey Schaufler @ 2021-04-05 16:18 UTC (permalink / raw)
  To: Amir Goldstein, Mimi Zohar, Paul Moore
  Cc: Jan Kara, Christian Brauner, linux-fsdevel, Miklos Szeredi,
	J. Bruce Fields, Tyler Hicks, James Morris, Serge E. Hallyn,
	LSM List, Casey Schaufler

On 4/4/2021 3:27 AM, Amir Goldstein wrote:
> [forking question about security modules]
>
>> Nice thing about vfs_{set,remove}xattr() is that they already have
>> several levels of __vfs_ helpers and nfsd already calls those, so
>> we can hoist fsnotify_xattr() hooks hooks up from the __vfs_xxx
>> helpers to the common vfs_xxx helpers and add fsnotify hooks to
>> the very few callers of __vfs_ helpers.
>>
>> nfsd is consistently calling __vfs_{set,remove}xattr_locked() which
>> do generate events, but ecryptfs mixes __vfs_setxattr_locked() with
>> __vfs_removexattr(), which does not generate event and does not
>> check permissions - it looks like an oversight.
>>
>> The thing is, right now __vfs_setxattr_noperm() generates events,
>> but looking at all the security/* callers, it feels to me like those are
>> very internal operations and that "noperm" should also imply "nonotify".
>>
>> To prove my point, all those callers call __vfs_removexattr() which
>> does NOT generate an event.
>>
>> Also, I *think* the EVM setxattr is something that usually follows
>> another file data/metadata change, so some event would have been
>> generated by the original change anyway.
>>
>> Mimi,
>>
>> Do you have an opinion on that?
>>
>> The question is if you think it is important for an inotify/fanotify watcher
>> that subscribed to IN_ATTRIB/FAN_ATTRIB events on a file to get an
>> event when the IMA security blob changes.
>>
> Guys,
>
> I was doing some re-factoring of the __vfs_setxattr helpers
> and noticed some strange things.
>
> The wider context is fsnotify_xattr() hooks inside internal
> setxattr,removexattr calls. I would like to move those hooks
> to the common vfs_{set,remove}xattr() helpers.
>
> SMACK & SELINUX:
> For the callers of __vfs_setxattr_noperm(),
> smack_inode_setsecctx() and selinux_inode_setsecctx()
> It seems that the only user is nfsd4_set_nfs4_label(), so it
> makes sense for me to add the fsnotify_xattr() in nfsd context,
> same as I did with other fsnotify_ hooks.

That seems reasonable to me, but the SELinux team would
have more experience with NFS deployemnts than Smack does.

> Are there any other expected callers of security_inode_setsecctx()
> except nfsd in the future? If so they would need to also add the
> fsnotify_xattr() hook, if at all the user visible FS_ATTRIB event is
> considered desirable.

Not that I know of.

> SMACK:
> Just to be sure, is the call to __vfs_setxattr() from smack_d_instantiate()
> guaranteed to be called for an inode whose S_NOSEC flag is already
> cleared? Because the flag is being cleared by __vfs_setxattr_noperm().

My understanding is that the inode is always in the process of being
initialized, and that S_NOSEC should never have been set.

>
> EVM:
> I couldn't find what's stopping this recursion:
> evm_update_evmxattr() => __vfs_setxattr_noperm() =>
> security_inode_post_setxattr() => evm_inode_post_removexattr() =>
> evm_update_evmxattr()
>
> It looks like the S_NOSEC should already be clear when
> evm_update_evmxattr() is called(?), so it seems more logical to me to
> call __vfs_setxattr() as there is no ->inode_setsecurity() hook for EVM.
> Am I missing something?
>
> It seems to me that updating the EVM hmac should not generate
> a visible FS_ATTRIB event to listeners, because it is an internal
> implementation detail and because update EVM hmac happens
> following another change to the inode which anyway reports a
> visible event to listeners.
> Also, please note that evm_update_evmxattr() may also call
> __vfs_removexattr() which does not call the fsnotify_xattr() hook.
>
> IMA:
> Similarly, ima_fix_xattr() should be called on an inode without
> S_NOSEC flag and no other LSM should be interested in the
> IMA hash update, right? So wouldn't it be better to use
> __vfs_setxattr() in this case as well?
>
> ima_fix_xattr() can be called after file data update, which again
> will have other visible events, but it can also be called in "fix mode"
> I suppose also when reading files? Still, it seems to me like an
> internal implementation detail that should not generate a user
> visible event.
>
> If you agree with my proposed changes, please ACK the
> respective bits of your subsystem from the attached patch.
> Note that my patch does not contain the proposed change to
> use __vfs_setxattr() in IMA/EVM.
>
> Thanks,
> Amir.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: fsnotify path hooks
  2021-04-01 14:18                         ` Amir Goldstein
  2021-04-02  8:20                           ` Amir Goldstein
@ 2021-04-06  8:35                           ` Jan Kara
  2021-04-06 18:49                           ` Amir Goldstein
  2 siblings, 0 replies; 61+ messages in thread
From: Jan Kara @ 2021-04-06  8:35 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Christian Brauner, linux-fsdevel, Linux API,
	Miklos Szeredi, J. Bruce Fields

On Thu 01-04-21 17:18:05, Amir Goldstein wrote:
> > > > > Also I'm somewhat uneasy that it is random (from
> > > > > userspace POV) when path event is generated and when not (at least that's
> > > > > my impression from the patch - maybe I'm wrong). How difficult would it be
> > > > > to get rid of it? I mean what if we just moved say fsnotify_create() call
> > > > > wholly up the stack? It would mean more explicit calls to fsnotify_create()
> > > > > from filesystems - as far as I'm looking nfsd, overlayfs, cachefiles,
> > > > > ecryptfs. But that would seem to be manageable.  Also, to maintain sanity,
> > > >
> > > > 1. I don't think we can do that for all the fsnotify_create() hooks, such as
> > > >     debugfs for example
> > > > 2. It is useless to pass the mount from overlayfs to fsnotify, its a private
> > > >     mount that users cannot set a mark on anyway and Christian has
> > > >     promised to propose the same change for cachefiles and ecryptfs,
> > > >     so I think it's not worth the churn in those call sites
> > > > 3. I am uneasy with removing the fsnotify hooks from vfs helpers and
> > > >     trusting that new callers of vfs_create() will remember to add the high
> > > >     level hooks, so I prefer the existing behavior remains for such callers
> > > >
> > >
> > > So I read your proposal the wrong way.
> > > You meant move fsnotify_create() up *without* passing mount context
> > > from overlayfs and friends.
> >
> > Well, I was thinking that we could find appropriate mount context for
> > overlayfs or ecryptfs (which just shows how little I know about these
> > filesystems ;) I didn't think of e.g. debugfs. Anyway, if we can make
> > mountpoint marks work for directory events at least for most filesystems, I
> > think that is OK as well. However it would be then needed to detect whether
> > a given filesystem actually supports mount marks for dir events and if not,
> > report error from fanotify_mark() instead of silently not generating
> > events.
> >
> 
> It's not about "filesystems that support mount marks".
> mount marks will work perfectly well on overlayfs.
> 
> The thing is if you place a mount mark on the underlying store of
> overlayfs (say xfs) and then files are created/deleted by the
> overlayfs driver (in xfs) you wont get any events, because
> overlayfs uses a private mount clone to perform underlying operations.

OK, understood.

> So while we CAN get the overlayfs underlying layer mount context
> it is irrelevant because no user can setup a mount mark on that
> private mount, so no need to bother calling the path hooks.
> 
> This is not the case with nfsd IMO.
> With nfsd, when "exporting" a path to clients, nfsd is really exporting
> a specific mount (and keeping that mount busy too).
> It can even export whole mount topologies.
> 
> But then again, getting the mount context in every nfsd operation
> is easy, there is an export context to client requests and the export
> context has the exported path.
> 
> Therefore, nfsd is my only user using the vfs helpers that is expected
> to call the fsnotify path hooks (other than syscalls).

I agree.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: LSM and setxattr helpers
  2021-04-05 14:47                               ` Mimi Zohar
@ 2021-04-06 15:43                                 ` Amir Goldstein
  0 siblings, 0 replies; 61+ messages in thread
From: Amir Goldstein @ 2021-04-06 15:43 UTC (permalink / raw)
  To: Mimi Zohar
  Cc: Casey Schaufler, Paul Moore, Jan Kara, Christian Brauner,
	linux-fsdevel, Miklos Szeredi, J. Bruce Fields, Tyler Hicks,
	James Morris, Serge E. Hallyn, LSM List

security_inode_post_setxattr

On Mon, Apr 5, 2021 at 5:47 PM Mimi Zohar <zohar@linux.ibm.com> wrote:
>
> Hi Amir,
>
> On Sun, 2021-04-04 at 13:27 +0300, Amir Goldstein wrote:
> > [forking question about security modules]
> >
> > >
> > > Nice thing about vfs_{set,remove}xattr() is that they already have
> > > several levels of __vfs_ helpers and nfsd already calls those, so
> > > we can hoist fsnotify_xattr() hooks hooks up from the __vfs_xxx
> > > helpers to the common vfs_xxx helpers and add fsnotify hooks to
> > > the very few callers of __vfs_ helpers.
> > >
> > > nfsd is consistently calling __vfs_{set,remove}xattr_locked() which
> > > do generate events, but ecryptfs mixes __vfs_setxattr_locked() with
> > > __vfs_removexattr(), which does not generate event and does not
> > > check permissions - it looks like an oversight.
> > >
> > > The thing is, right now __vfs_setxattr_noperm() generates events,
> > > but looking at all the security/* callers, it feels to me like those are
> > > very internal operations and that "noperm" should also imply "nonotify".
> > >
> > > To prove my point, all those callers call __vfs_removexattr() which
> > > does NOT generate an event.
> > >
> > > Also, I *think* the EVM setxattr is something that usually follows
> > > another file data/metadata change, so some event would have been
> > > generated by the original change anyway.
> > >
> > > Mimi,
> > >
> > > Do you have an opinion on that?
>
> Right, EVM is re-calculating the EVM HMAC, which is based on other LSM
> xattrs and includes some misc file metadata (e.g. ino, generation, uid,
> gid, mode).
>

That explains why EVM registers to security_inode_post_setxattr() hook in
__vfs_setxattr_noperm() and which is the helper that selinux and smack call.

> > >
> > > The question is if you think it is important for an inotify/fanotify watcher
> > > that subscribed to IN_ATTRIB/FAN_ATTRIB events on a file to get an
> > > event when the IMA security blob changes.
>
> Probably not.  Programs could open files R/W, but never modify the
> file.  Perhaps to detect mutable file changes, but I'm not aware of
> anyone doing so.
>
> >
> > Guys,
> >
> > I was doing some re-factoring of the __vfs_setxattr helpers
> > and noticed some strange things.
> >
> > The wider context is fsnotify_xattr() hooks inside internal
> > setxattr,removexattr calls. I would like to move those hooks
> > to the common vfs_{set,remove}xattr() helpers.
> >
> > SMACK & SELINUX:
> > For the callers of __vfs_setxattr_noperm(),
> > smack_inode_setsecctx() and selinux_inode_setsecctx()
> > It seems that the only user is nfsd4_set_nfs4_label(), so it
> > makes sense for me to add the fsnotify_xattr() in nfsd context,
> > same as I did with other fsnotify_ hooks.
> >
> > Are there any other expected callers of security_inode_setsecctx()
> > except nfsd in the future? If so they would need to also add the
> > fsnotify_xattr() hook, if at all the user visible FS_ATTRIB event is
> > considered desirable.
> >
> > SMACK:
> > Just to be sure, is the call to __vfs_setxattr() from smack_d_instantiate()
> > guaranteed to be called for an inode whose S_NOSEC flag is already
> > cleared? Because the flag is being cleared by __vfs_setxattr_noperm().
> >
> > EVM:
> > I couldn't find what's stopping this recursion:
> > evm_update_evmxattr() => __vfs_setxattr_noperm() =>
> > security_inode_post_setxattr() => evm_inode_post_removexattr() =>
> > evm_update_evmxattr()
>
> EVM is triggered when file metadata changes, causing the EVM HMAC to be
> re-calculated. Before updating security.evm, EVM first verifies, on the
> evm_inode_setattr/setxattr/removexattr() hooks, that the existing
> security.evm value is correct.
>
> On the _post hooks, security.evm is updated or removed, if no LSM xattr
> exists.
>

I'm not sure I understand why evm_update_evmxattr() calls
__vfs_setxattr_noperm() and not __vfs_setxattr(), but it's not really important
for my needs to understand this. Neither helper will generate an fsnotify event.

> > It looks like the S_NOSEC should already be clear when
> > evm_update_evmxattr() is called(?), so it seems more logical to me to
> > call __vfs_setxattr() as there is no ->inode_setsecurity() hook for EVM.
> > Am I missing something?
>
> EVM is triggered when an LSM updates/removes its xattr.   The LSM is
> responsible for taking the inode lock.   Thus it is calling
> __vfs_setxattr_noperm.
>

Surely you need to call a variant that is __vfs_setxattr_locked() or
below it. I just did not understand why that variant is not  __vfs_setxattr().

> >
> > It seems to me that updating the EVM hmac should not generate
> > a visible FS_ATTRIB event to listeners, because it is an internal
> > implementation detail and because update EVM hmac happens
> > following another change to the inode which anyway reports a
> > visible event to listeners.
>
> Ok
>


OK. It looks like there is a consensus about losing those events.
That's what I thought, but wanted to check with you security guys.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: fsnotify path hooks
  2021-04-01 14:18                         ` Amir Goldstein
  2021-04-02  8:20                           ` Amir Goldstein
  2021-04-06  8:35                           ` fsnotify path hooks Jan Kara
@ 2021-04-06 18:49                           ` Amir Goldstein
  2021-04-08 12:52                             ` Jan Kara
  2 siblings, 1 reply; 61+ messages in thread
From: Amir Goldstein @ 2021-04-06 18:49 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christian Brauner, linux-fsdevel, Miklos Szeredi, J. Bruce Fields

[...]
> > > So yeh, I do think it is manageable. I think the best solution would be
> > > something along the lines of wrappers like the following:
> > >
> > > static inline int vfs_mkdir(...)
> > > {
> > >         int error = __vfs_mkdir_nonotify(...);
> > >         if (!error)
> > >                 fsnotify_mkdir(dir, dentry);
> > >         return error;
> > > }
> > >
> > > And then the few call sites that call the fsnotify_path_ hooks
> > > (i.e. in syscalls and perhaps later in nfsd) will call the
> > > __vfs_xxx_nonotify() variant.
> >
> > Yes, that is OK with me. Or we could have something like:
> >
> > static inline void fsnotify_dirent(struct vfsmount *mnt, struct inode *dir,
> >                                    struct dentry *dentry, __u32 mask)
> > {
> >         if (!mnt) {
> >                 fsnotify(mask, d_inode(dentry), FSNOTIFY_EVENT_INODE, dir,
> >                          &dentry->d_name, NULL, 0);
> >         } else {
> >                 struct path path = {
> >                         .mnt = mnt,
> >                         .dentry = d_find_any_alias(dir)
> >                 };
> >                 fsnotify(mask, d_inode(dentry), FSNOTIFY_EVENT_PATH, &path,
> >                          &dentry->d_name, NULL, 0);
> >         }
> > }
> >
> > static inline void fsnotify_mkdir(struct vfsmount *mnt, struct inode *inode,
> >                                   struct dentry *dentry)
> > {
> >         audit_inode_child(inode, dentry, AUDIT_TYPE_CHILD_CREATE);
> >
> >         fsnotify_dirent(mnt, inode, dentry, FS_CREATE | FS_ISDIR);
> > }
> >
> > static inline int vfs_mkdir(mnt, ...)
> > {
> >         int error = __vfs_mkdir_nonotify(...);
> >         if (!error)
> >                 fsnotify_mkdir(mnt, dir, dentry);
> > }
> >
>
> I've done something similar to that. I think it's a bit cleaner,
> but we can debate on the details later.
> Pushed POC to branch fsnotify_path_hooks.

FYI, I tried your suggested approach above for fsnotify_xattr(),
but I think I prefer to use an explicit flavor fsnotify_xattr_mnt()
and a wrapper fsnotify_xattr().
Pushed WIP to fsnotify_path_hooks branch. It also contains
some unstashed "fix" patches to tidy up the previous hooks.

I ran into another hurdle with fsnotify_xattr() -
vfs_setxattr() is too large to duplicate a _nonotify() variant IMO.
OTOH, I cannot lift fsnotify_xattr() up to callers without moving
the fsnotify hook outside the inode lock.

This was not a problem with the directory entry path hooks.
This is also not going to be a problem with fsnotify_change(),
because notify_change() is called with inode locked.

Do you think that calling fsnotify_xattr() under inode lock is important?
Should I refactor a helper vfs_setxattr_notify() that takes a boolean
arg for optionally calling fsnotify_xattr()?
Do you have another idea how to deal with that hook?

With notify_change() I have a different silly problem with using the
refactoring method - the name notify_change_nonotify() is unacceptable.
We may consider __ATTR_NONOTIFY ia_valid flag as the method to
use instead of refactoring in this case, just because we can and
because it creates less clutter.

What do you think?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: open_by_handle_at() in userns
  2021-03-31 10:08       ` Christian Brauner
  2021-03-31 10:57         ` Amir Goldstein
@ 2021-04-08 11:44         ` Amir Goldstein
  2021-04-08 12:55           ` Christian Brauner
  1 sibling, 1 reply; 61+ messages in thread
From: Amir Goldstein @ 2021-04-08 11:44 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Jan Kara, linux-fsdevel, Linux API

> One thing your patch
>
> commit ea31e84fda83c17b88851de399f76f5d9fc1abf4
> Author: Amir Goldstein <amir73il@gmail.com>
> Date:   Sat Mar 20 12:58:12 2021 +0200
>
>     fs: allow open by file handle inside userns
>
>     open_by_handle_at(2) requires CAP_DAC_READ_SEARCH in init userns,
>     where most filesystems are mounted.
>
>     Relax the requirement to allow a user with CAP_DAC_READ_SEARCH
>     inside userns to open by file handle in filesystems that were
>     mounted inside that userns.
>
>     In addition, also allow open by handle in an idmapped mount, which is
>     mapped to the userns while verifying that the returned open file path
>     is under the root of the idmapped mount.
>
>     This is going to be needed for setting an fanotify mark on a filesystem
>     and watching events inside userns.
>
>     Signed-off-by: Amir Goldstein <amir73il@gmail.com>
>
> Requires fs/exportfs/expfs.c to be made idmapped mounts aware.
> open_by_handle_at() uses exportfs_decode_fh() which e.g. has the
> following and other callchains:
>
> exportfs_decode_fh()
> -> exportfs_decode_fh_raw()
>    -> lookup_one_len()
>       -> inode_permission(mnt_userns, ...)
>
> That's not a huge problem though I did all these changes for the
> overlayfs support for idmapped mounts I have in a branch from an earlier
> version of the idmapped mounts patchset. Basically lookup_one_len(),
> lookup_one_len_unlocked(), and lookup_positive_unlocked() need to take
> the mnt_userns into account. I can rebase my change and send it for
> consideration next cycle. If you can live without the
> open_by_handle_at() support for now in this patchset (Which I think you
> said you could.) then it's not a blocker either. Sorry for the
> inconvenience.
>

Christian,

I think making exportfs_decode_fh() idmapped mount aware is not
enough, because when a dentry alias is found in dcache, none of
those lookup functions are called.

I think we will also need something like this:
https://github.com/amir73il/linux/commits/fhandle_userns

I factored-out a helper from nfsd_apcceptable() which implements
the "subtree_check" nfsd logic and uses it for open_by_handle_at().

I've also added a small patch to name_to_handle_at() with a UAPI
change that could make these changes usable by userspace nfs
server inside userns, but I have no demo nor tests for that and frankly,
I have little incentive to try and promote this UAPI change without
anybody asking for it...

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: fsnotify path hooks
  2021-04-06 18:49                           ` Amir Goldstein
@ 2021-04-08 12:52                             ` Jan Kara
  2021-04-08 15:11                               ` Amir Goldstein
  0 siblings, 1 reply; 61+ messages in thread
From: Jan Kara @ 2021-04-08 12:52 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Christian Brauner, linux-fsdevel, Miklos Szeredi,
	J. Bruce Fields

On Tue 06-04-21 21:49:13, Amir Goldstein wrote:
> [...]
> > > > So yeh, I do think it is manageable. I think the best solution would be
> > > > something along the lines of wrappers like the following:
> > > >
> > > > static inline int vfs_mkdir(...)
> > > > {
> > > >         int error = __vfs_mkdir_nonotify(...);
> > > >         if (!error)
> > > >                 fsnotify_mkdir(dir, dentry);
> > > >         return error;
> > > > }
> > > >
> > > > And then the few call sites that call the fsnotify_path_ hooks
> > > > (i.e. in syscalls and perhaps later in nfsd) will call the
> > > > __vfs_xxx_nonotify() variant.
> > >
> > > Yes, that is OK with me. Or we could have something like:
> > >
> > > static inline void fsnotify_dirent(struct vfsmount *mnt, struct inode *dir,
> > >                                    struct dentry *dentry, __u32 mask)
> > > {
> > >         if (!mnt) {
> > >                 fsnotify(mask, d_inode(dentry), FSNOTIFY_EVENT_INODE, dir,
> > >                          &dentry->d_name, NULL, 0);
> > >         } else {
> > >                 struct path path = {
> > >                         .mnt = mnt,
> > >                         .dentry = d_find_any_alias(dir)
> > >                 };
> > >                 fsnotify(mask, d_inode(dentry), FSNOTIFY_EVENT_PATH, &path,
> > >                          &dentry->d_name, NULL, 0);
> > >         }
> > > }
> > >
> > > static inline void fsnotify_mkdir(struct vfsmount *mnt, struct inode *inode,
> > >                                   struct dentry *dentry)
> > > {
> > >         audit_inode_child(inode, dentry, AUDIT_TYPE_CHILD_CREATE);
> > >
> > >         fsnotify_dirent(mnt, inode, dentry, FS_CREATE | FS_ISDIR);
> > > }
> > >
> > > static inline int vfs_mkdir(mnt, ...)
> > > {
> > >         int error = __vfs_mkdir_nonotify(...);
> > >         if (!error)
> > >                 fsnotify_mkdir(mnt, dir, dentry);
> > > }
> > >
> >
> > I've done something similar to that. I think it's a bit cleaner,
> > but we can debate on the details later.
> > Pushed POC to branch fsnotify_path_hooks.
> 
> FYI, I tried your suggested approach above for fsnotify_xattr(),
> but I think I prefer to use an explicit flavor fsnotify_xattr_mnt()
> and a wrapper fsnotify_xattr().
> Pushed WIP to fsnotify_path_hooks branch. It also contains
> some unstashed "fix" patches to tidy up the previous hooks.

What's in fsnotify_path_hooks branch looks good to me wrt xattr hooks. What
I somewhat dislike about e.g. the fsnotify_create() approach you took is
that there are separate hooks fsnotify_create() and fsnotify_create_path()
which expose what is IMO an internal fsnotify detail of what are different
event types. I'd say it is more natural (from VFS POV) to have just a
single hook and fill in as much information as available... Also from
outside view, it is unclear that e.g. vfs_create() will generate some types
of fsnotify events but not all while e.g. do_mknodat() will generate all
fsnotify events. That's why I'm not sure whether a helper like vfs_create()
in your tree is the right abstraction since generating one type of fsnotify
event while not generating another type should be a very conscious decision
of the implementor - basically if you have no other option.

That all being said, this is just an internal API so we are free to tweak
it in the future if we get things wrong. So I'm not pushing hard for my
proposal but I wanted to raise my concerns. Also I think Al Viro might have
his opinion on this so you should probably CC him when posting the series...

> I ran into another hurdle with fsnotify_xattr() -
> vfs_setxattr() is too large to duplicate a _nonotify() variant IMO.
> OTOH, I cannot lift fsnotify_xattr() up to callers without moving
> the fsnotify hook outside the inode lock.
> 
> This was not a problem with the directory entry path hooks.
> This is also not going to be a problem with fsnotify_change(),
> because notify_change() is called with inode locked.
> 
> Do you think that calling fsnotify_xattr() under inode lock is important?
> Should I refactor a helper vfs_setxattr_notify() that takes a boolean
> arg for optionally calling fsnotify_xattr()?
> Do you have another idea how to deal with that hook?

I think having the event generated outside of i_rwsem is fine. The only
reason why I think it could possibly matter is due to reordering of events
on the same inode but that order is uncertain anyway.

> With notify_change() I have a different silly problem with using the
> refactoring method - the name notify_change_nonotify() is unacceptable.
> We may consider __ATTR_NONOTIFY ia_valid flag as the method to
> use instead of refactoring in this case, just because we can and
> because it creates less clutter.
> 
> What do you think?

Hmm, notify_change() is an inconsistent name anyway (for historical
reasons). Consistent name would be vfs_setattr(). And
vfs_setattr_nonotify() would be a fine name as well. What do you think?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: open_by_handle_at() in userns
  2021-04-08 11:44         ` open_by_handle_at() in userns Amir Goldstein
@ 2021-04-08 12:55           ` Christian Brauner
  2021-04-08 14:15             ` J. Bruce Fields
  2021-04-08 15:34             ` Amir Goldstein
  0 siblings, 2 replies; 61+ messages in thread
From: Christian Brauner @ 2021-04-08 12:55 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, Linux API, bfields

On Thu, Apr 08, 2021 at 02:44:47PM +0300, Amir Goldstein wrote:
> > One thing your patch
> >
> > commit ea31e84fda83c17b88851de399f76f5d9fc1abf4
> > Author: Amir Goldstein <amir73il@gmail.com>
> > Date:   Sat Mar 20 12:58:12 2021 +0200
> >
> >     fs: allow open by file handle inside userns
> >
> >     open_by_handle_at(2) requires CAP_DAC_READ_SEARCH in init userns,
> >     where most filesystems are mounted.
> >
> >     Relax the requirement to allow a user with CAP_DAC_READ_SEARCH
> >     inside userns to open by file handle in filesystems that were
> >     mounted inside that userns.
> >
> >     In addition, also allow open by handle in an idmapped mount, which is
> >     mapped to the userns while verifying that the returned open file path
> >     is under the root of the idmapped mount.
> >
> >     This is going to be needed for setting an fanotify mark on a filesystem
> >     and watching events inside userns.
> >
> >     Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> >
> > Requires fs/exportfs/expfs.c to be made idmapped mounts aware.
> > open_by_handle_at() uses exportfs_decode_fh() which e.g. has the
> > following and other callchains:
> >
> > exportfs_decode_fh()
> > -> exportfs_decode_fh_raw()
> >    -> lookup_one_len()
> >       -> inode_permission(mnt_userns, ...)
> >
> > That's not a huge problem though I did all these changes for the
> > overlayfs support for idmapped mounts I have in a branch from an earlier
> > version of the idmapped mounts patchset. Basically lookup_one_len(),
> > lookup_one_len_unlocked(), and lookup_positive_unlocked() need to take
> > the mnt_userns into account. I can rebase my change and send it for
> > consideration next cycle. If you can live without the
> > open_by_handle_at() support for now in this patchset (Which I think you
> > said you could.) then it's not a blocker either. Sorry for the
> > inconvenience.
> >
> 
> Christian,
> 
> I think making exportfs_decode_fh() idmapped mount aware is not
> enough, because when a dentry alias is found in dcache, none of
> those lookup functions are called.
> 
> I think we will also need something like this:
> https://github.com/amir73il/linux/commits/fhandle_userns
> 
> I factored-out a helper from nfsd_apcceptable() which implements
> the "subtree_check" nfsd logic and uses it for open_by_handle_at().
> 
> I've also added a small patch to name_to_handle_at() with a UAPI
> change that could make these changes usable by userspace nfs
> server inside userns, but I have no demo nor tests for that and frankly,
> I have little incentive to try and promote this UAPI change without
> anybody asking for it...

Ah, at first I was confused about why this would matter but it matters
because nfsd already implements a check of that sort directly in nfsd
independent of idmapped mounts:
https://github.com/amir73il/linux/commit/4bef9ff1718935b7b42afbae71cfaab7770e8436

Afaict, an nfs server can't be mounted inside of userns right now. That
is something that folks from Netflix and from Kinvolk have been
interested in enabling. They also want the ability to use idmapped
mounts + nfs. Understandable that you don't want to drive this of
course. I'll sync with them about this.

Independent of that, I thought our last understanding was that you
wouldn't need to handle open_by_handle_at() for now.

Christian

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: open_by_handle_at() in userns
  2021-04-08 12:55           ` Christian Brauner
@ 2021-04-08 14:15             ` J. Bruce Fields
  2021-04-08 15:54               ` Amir Goldstein
  2021-04-08 15:34             ` Amir Goldstein
  1 sibling, 1 reply; 61+ messages in thread
From: J. Bruce Fields @ 2021-04-08 14:15 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Amir Goldstein, Jan Kara, linux-fsdevel, Linux API

On Thu, Apr 08, 2021 at 02:55:30PM +0200, Christian Brauner wrote:
> On Thu, Apr 08, 2021 at 02:44:47PM +0300, Amir Goldstein wrote:
> > > One thing your patch
> > >
> > > commit ea31e84fda83c17b88851de399f76f5d9fc1abf4
> > > Author: Amir Goldstein <amir73il@gmail.com>
> > > Date:   Sat Mar 20 12:58:12 2021 +0200
> > >
> > >     fs: allow open by file handle inside userns
> > >
> > >     open_by_handle_at(2) requires CAP_DAC_READ_SEARCH in init userns,
> > >     where most filesystems are mounted.
> > >
> > >     Relax the requirement to allow a user with CAP_DAC_READ_SEARCH
> > >     inside userns to open by file handle in filesystems that were
> > >     mounted inside that userns.
> > >
> > >     In addition, also allow open by handle in an idmapped mount, which is
> > >     mapped to the userns while verifying that the returned open file path
> > >     is under the root of the idmapped mount.
> > >
> > >     This is going to be needed for setting an fanotify mark on a filesystem
> > >     and watching events inside userns.
> > >
> > >     Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > >
> > > Requires fs/exportfs/expfs.c to be made idmapped mounts aware.
> > > open_by_handle_at() uses exportfs_decode_fh() which e.g. has the
> > > following and other callchains:
> > >
> > > exportfs_decode_fh()
> > > -> exportfs_decode_fh_raw()
> > >    -> lookup_one_len()
> > >       -> inode_permission(mnt_userns, ...)
> > >
> > > That's not a huge problem though I did all these changes for the
> > > overlayfs support for idmapped mounts I have in a branch from an earlier
> > > version of the idmapped mounts patchset. Basically lookup_one_len(),
> > > lookup_one_len_unlocked(), and lookup_positive_unlocked() need to take
> > > the mnt_userns into account. I can rebase my change and send it for
> > > consideration next cycle. If you can live without the
> > > open_by_handle_at() support for now in this patchset (Which I think you
> > > said you could.) then it's not a blocker either. Sorry for the
> > > inconvenience.
> > >
> > 
> > Christian,
> > 
> > I think making exportfs_decode_fh() idmapped mount aware is not
> > enough, because when a dentry alias is found in dcache, none of
> > those lookup functions are called.
> > 
> > I think we will also need something like this:
> > https://github.com/amir73il/linux/commits/fhandle_userns
> > 
> > I factored-out a helper from nfsd_apcceptable() which implements
> > the "subtree_check" nfsd logic and uses it for open_by_handle_at().
> > 
> > I've also added a small patch to name_to_handle_at() with a UAPI
> > change that could make these changes usable by userspace nfs
> > server inside userns, but I have no demo nor tests for that and frankly,
> > I have little incentive to try and promote this UAPI change without
> > anybody asking for it...
> 
> Ah, at first I was confused about why this would matter but it matters
> because nfsd already implements a check of that sort directly in nfsd
> independent of idmapped mounts:
> https://github.com/amir73il/linux/commit/4bef9ff1718935b7b42afbae71cfaab7770e8436

Only in the NFSEXP_NOSUBTREECHECK case.  Taking a quick look, I think
Amir's not proposing a check like that by default, so, fine.  (I assume
problems with e.g. subtreechecking and cross-directory renames are
understood....)

> Afaict, an nfs server can't be mounted inside of userns right now. That
> is something that folks from Netflix and from Kinvolk have been
> interested in enabling. They also want the ability to use idmapped
> mounts + nfs. Understandable that you don't want to drive this of
> course. I'll sync with them about this.

I think those would both be reasonable things to do.

--b.

> Independent of that, I thought our last understanding was that you
> wouldn't need to handle open_by_handle_at() for now.
> 
> Christian

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: fsnotify path hooks
  2021-04-08 12:52                             ` Jan Kara
@ 2021-04-08 15:11                               ` Amir Goldstein
  2021-04-09 10:08                                 ` Jan Kara
  2021-04-19 16:41                                 ` Amir Goldstein
  0 siblings, 2 replies; 61+ messages in thread
From: Amir Goldstein @ 2021-04-08 15:11 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christian Brauner, linux-fsdevel, Miklos Szeredi, J. Bruce Fields

> > FYI, I tried your suggested approach above for fsnotify_xattr(),
> > but I think I prefer to use an explicit flavor fsnotify_xattr_mnt()
> > and a wrapper fsnotify_xattr().
> > Pushed WIP to fsnotify_path_hooks branch. It also contains
> > some unstashed "fix" patches to tidy up the previous hooks.
>
> What's in fsnotify_path_hooks branch looks good to me wrt xattr hooks.
> I somewhat dislike about e.g. the fsnotify_create() approach you took is
> that there are separate hooks fsnotify_create() and fsnotify_create_path()
> which expose what is IMO an internal fsnotify detail of what are different
> event types. I'd say it is more natural (from VFS POV) to have just a
> single hook and fill in as much information as available... Also from

So to be clear, you do NOT want additional wrappers like this and
you prefer to have the NULL mnt argument explicit in all callers?

static inline void fsnotify_xattr(struct dentry *dentry)
{
        fsnotify_xattr_mnt(NULL, dentry);
}

For fsnotify_xattr() it does not matter so much, but fsnotify_create/mkdir()
have quite a few callers in special filesystems.

> outside view, it is unclear that e.g. vfs_create() will generate some types
> of fsnotify events but not all while e.g. do_mknodat() will generate all
> fsnotify events. That's why I'm not sure whether a helper like vfs_create()
> in your tree is the right abstraction since generating one type of fsnotify
> event while not generating another type should be a very conscious decision
> of the implementor - basically if you have no other option.
>

I lost you here.
Are you ok with vfs_create() vs. vfs_create_nonotify()?
How do you propose to change fsnotify hooks in vfs_create()?
Did you mean to call the same fsnotify_create() hook in both vfs_create()
and do_mknodat() where the former is called with NULL mnt argument?

> That all being said, this is just an internal API so we are free to tweak
> it in the future if we get things wrong. So I'm not pushing hard for my
> proposal but I wanted to raise my concerns. Also I think Al Viro might have
> his opinion on this so you should probably CC him when posting the series...
>

Sure, just wanted to get your initial feedback before kicking the patch set
into shape, because it could have gone in many different ways, which it did ;)

> > I ran into another hurdle with fsnotify_xattr() -
> > vfs_setxattr() is too large to duplicate a _nonotify() variant IMO.
> > OTOH, I cannot lift fsnotify_xattr() up to callers without moving
> > the fsnotify hook outside the inode lock.
> >
> > This was not a problem with the directory entry path hooks.
> > This is also not going to be a problem with fsnotify_change(),
> > because notify_change() is called with inode locked.
> >
> > Do you think that calling fsnotify_xattr() under inode lock is important?
> > Should I refactor a helper vfs_setxattr_notify() that takes a boolean
> > arg for optionally calling fsnotify_xattr()?
> > Do you have another idea how to deal with that hook?
>
> I think having the event generated outside of i_rwsem is fine. The only
> reason why I think it could possibly matter is due to reordering of events
> on the same inode but that order is uncertain anyway.
>

That's what I thought. It is not much different than ACCESS/MODIFY
events.

> > With notify_change() I have a different silly problem with using the
> > refactoring method - the name notify_change_nonotify() is unacceptable.
> > We may consider __ATTR_NONOTIFY ia_valid flag as the method to
> > use instead of refactoring in this case, just because we can and
> > because it creates less clutter.
> >
> > What do you think?
>
> Hmm, notify_change() is an inconsistent name anyway (for historical
> reasons). Consistent name would be vfs_setattr(). And
> vfs_setattr_nonotify() would be a fine name as well. What do you think?
>

I won't propose to convert all notify_change() callers before getting an
explicit ACK from Al, but I could still use notify_change() as a wrapper
around vfs_setattr_nonotify() and maybe also create an alias vfs_setattr()
and use it in some places.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: open_by_handle_at() in userns
  2021-04-08 12:55           ` Christian Brauner
  2021-04-08 14:15             ` J. Bruce Fields
@ 2021-04-08 15:34             ` Amir Goldstein
  2021-04-08 15:41               ` Christian Brauner
  1 sibling, 1 reply; 61+ messages in thread
From: Amir Goldstein @ 2021-04-08 15:34 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Jan Kara, linux-fsdevel, Linux API, J. Bruce Fields

On Thu, Apr 8, 2021 at 3:55 PM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> On Thu, Apr 08, 2021 at 02:44:47PM +0300, Amir Goldstein wrote:
> > > One thing your patch
> > >
> > > commit ea31e84fda83c17b88851de399f76f5d9fc1abf4
> > > Author: Amir Goldstein <amir73il@gmail.com>
> > > Date:   Sat Mar 20 12:58:12 2021 +0200
> > >
> > >     fs: allow open by file handle inside userns
> > >
> > >     open_by_handle_at(2) requires CAP_DAC_READ_SEARCH in init userns,
> > >     where most filesystems are mounted.
> > >
> > >     Relax the requirement to allow a user with CAP_DAC_READ_SEARCH
> > >     inside userns to open by file handle in filesystems that were
> > >     mounted inside that userns.
> > >
> > >     In addition, also allow open by handle in an idmapped mount, which is
> > >     mapped to the userns while verifying that the returned open file path
> > >     is under the root of the idmapped mount.
> > >
> > >     This is going to be needed for setting an fanotify mark on a filesystem
> > >     and watching events inside userns.
> > >
> > >     Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > >
> > > Requires fs/exportfs/expfs.c to be made idmapped mounts aware.
> > > open_by_handle_at() uses exportfs_decode_fh() which e.g. has the
> > > following and other callchains:
> > >
> > > exportfs_decode_fh()
> > > -> exportfs_decode_fh_raw()
> > >    -> lookup_one_len()
> > >       -> inode_permission(mnt_userns, ...)
> > >
> > > That's not a huge problem though I did all these changes for the
> > > overlayfs support for idmapped mounts I have in a branch from an earlier
> > > version of the idmapped mounts patchset. Basically lookup_one_len(),
> > > lookup_one_len_unlocked(), and lookup_positive_unlocked() need to take
> > > the mnt_userns into account. I can rebase my change and send it for
> > > consideration next cycle. If you can live without the
> > > open_by_handle_at() support for now in this patchset (Which I think you
> > > said you could.) then it's not a blocker either. Sorry for the
> > > inconvenience.
> > >
> >
> > Christian,
> >
> > I think making exportfs_decode_fh() idmapped mount aware is not
> > enough, because when a dentry alias is found in dcache, none of
> > those lookup functions are called.
> >
> > I think we will also need something like this:
> > https://github.com/amir73il/linux/commits/fhandle_userns
> >
> > I factored-out a helper from nfsd_apcceptable() which implements
> > the "subtree_check" nfsd logic and uses it for open_by_handle_at().
> >
> > I've also added a small patch to name_to_handle_at() with a UAPI
> > change that could make these changes usable by userspace nfs
> > server inside userns, but I have no demo nor tests for that and frankly,
> > I have little incentive to try and promote this UAPI change without
> > anybody asking for it...
>
> Ah, at first I was confused about why this would matter but it matters
> because nfsd already implements a check of that sort directly in nfsd
> independent of idmapped mounts:
> https://github.com/amir73il/linux/commit/4bef9ff1718935b7b42afbae71cfaab7770e8436
>

The check is needed for slightly different reasons.
nfsd "subtree_check" feature explicitly meant to forbid access in case
file was moved "out of reach", for example, out of the export path.
Note the nfsd "subtree_check" affects both file handle encoding
(i.e. "connectable") and file handle decoding (i.e. nfsd_acceptable()).

open_by_handle_at() in idmapped mount needs to verify that the ancestry
inode owners can be mapped to the userns, because we already checked
that user has CAP_DAC_READ_SEARCH in userns, but it's nicer to do
the full inode_permission() check IMO.

> Afaict, an nfs server can't be mounted inside of userns right now. That
> is something that folks from Netflix and from Kinvolk have been
> interested in enabling. They also want the ability to use idmapped
> mounts + nfs. Understandable that you don't want to drive this of
> course. I'll sync with them about this.
>
> Independent of that, I thought our last understanding was that you
> wouldn't need to handle open_by_handle_at() for now.
>

I don't need it. But I realized that the fanotify_userns demo branch
I provided you is buggy in terms of security, so I wanted to give you
(or whoever wants to pursue this) a better reference.
It was one of those things that are easier to code than to explain ;-)

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: open_by_handle_at() in userns
  2021-04-08 15:34             ` Amir Goldstein
@ 2021-04-08 15:41               ` Christian Brauner
  0 siblings, 0 replies; 61+ messages in thread
From: Christian Brauner @ 2021-04-08 15:41 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, Linux API, J. Bruce Fields

On Thu, Apr 08, 2021 at 06:34:01PM +0300, Amir Goldstein wrote:
> On Thu, Apr 8, 2021 at 3:55 PM Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> >
> > On Thu, Apr 08, 2021 at 02:44:47PM +0300, Amir Goldstein wrote:
> > > > One thing your patch
> > > >
> > > > commit ea31e84fda83c17b88851de399f76f5d9fc1abf4
> > > > Author: Amir Goldstein <amir73il@gmail.com>
> > > > Date:   Sat Mar 20 12:58:12 2021 +0200
> > > >
> > > >     fs: allow open by file handle inside userns
> > > >
> > > >     open_by_handle_at(2) requires CAP_DAC_READ_SEARCH in init userns,
> > > >     where most filesystems are mounted.
> > > >
> > > >     Relax the requirement to allow a user with CAP_DAC_READ_SEARCH
> > > >     inside userns to open by file handle in filesystems that were
> > > >     mounted inside that userns.
> > > >
> > > >     In addition, also allow open by handle in an idmapped mount, which is
> > > >     mapped to the userns while verifying that the returned open file path
> > > >     is under the root of the idmapped mount.
> > > >
> > > >     This is going to be needed for setting an fanotify mark on a filesystem
> > > >     and watching events inside userns.
> > > >
> > > >     Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > > >
> > > > Requires fs/exportfs/expfs.c to be made idmapped mounts aware.
> > > > open_by_handle_at() uses exportfs_decode_fh() which e.g. has the
> > > > following and other callchains:
> > > >
> > > > exportfs_decode_fh()
> > > > -> exportfs_decode_fh_raw()
> > > >    -> lookup_one_len()
> > > >       -> inode_permission(mnt_userns, ...)
> > > >
> > > > That's not a huge problem though I did all these changes for the
> > > > overlayfs support for idmapped mounts I have in a branch from an earlier
> > > > version of the idmapped mounts patchset. Basically lookup_one_len(),
> > > > lookup_one_len_unlocked(), and lookup_positive_unlocked() need to take
> > > > the mnt_userns into account. I can rebase my change and send it for
> > > > consideration next cycle. If you can live without the
> > > > open_by_handle_at() support for now in this patchset (Which I think you
> > > > said you could.) then it's not a blocker either. Sorry for the
> > > > inconvenience.
> > > >
> > >
> > > Christian,
> > >
> > > I think making exportfs_decode_fh() idmapped mount aware is not
> > > enough, because when a dentry alias is found in dcache, none of
> > > those lookup functions are called.
> > >
> > > I think we will also need something like this:
> > > https://github.com/amir73il/linux/commits/fhandle_userns
> > >
> > > I factored-out a helper from nfsd_apcceptable() which implements
> > > the "subtree_check" nfsd logic and uses it for open_by_handle_at().
> > >
> > > I've also added a small patch to name_to_handle_at() with a UAPI
> > > change that could make these changes usable by userspace nfs
> > > server inside userns, but I have no demo nor tests for that and frankly,
> > > I have little incentive to try and promote this UAPI change without
> > > anybody asking for it...
> >
> > Ah, at first I was confused about why this would matter but it matters
> > because nfsd already implements a check of that sort directly in nfsd
> > independent of idmapped mounts:
> > https://github.com/amir73il/linux/commit/4bef9ff1718935b7b42afbae71cfaab7770e8436
> >
> 
> The check is needed for slightly different reasons.
> nfsd "subtree_check" feature explicitly meant to forbid access in case
> file was moved "out of reach", for example, out of the export path.
> Note the nfsd "subtree_check" affects both file handle encoding
> (i.e. "connectable") and file handle decoding (i.e. nfsd_acceptable()).
> 
> open_by_handle_at() in idmapped mount needs to verify that the ancestry
> inode owners can be mapped to the userns, because we already checked
> that user has CAP_DAC_READ_SEARCH in userns, but it's nicer to do
> the full inode_permission() check IMO.
> 
> > Afaict, an nfs server can't be mounted inside of userns right now. That
> > is something that folks from Netflix and from Kinvolk have been
> > interested in enabling. They also want the ability to use idmapped
> > mounts + nfs. Understandable that you don't want to drive this of
> > course. I'll sync with them about this.
> >
> > Independent of that, I thought our last understanding was that you
> > wouldn't need to handle open_by_handle_at() for now.
> >
> 
> I don't need it. But I realized that the fanotify_userns demo branch
> I provided you is buggy in terms of security, so I wanted to give you
> (or whoever wants to pursue this) a better reference.
> It was one of those things that are easier to code than to explain ;-)

And I highly appreciate it! :)
Please keep those branches for reference somewhere. I've noted them down
to make sure to look at them before I do anything!

Christian

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: open_by_handle_at() in userns
  2021-04-08 14:15             ` J. Bruce Fields
@ 2021-04-08 15:54               ` Amir Goldstein
  2021-04-08 16:08                 ` J. Bruce Fields
  0 siblings, 1 reply; 61+ messages in thread
From: Amir Goldstein @ 2021-04-08 15:54 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Christian Brauner, Jan Kara, linux-fsdevel, Linux API, Miklos Szeredi

On Thu, Apr 8, 2021 at 5:15 PM J. Bruce Fields <bfields@fieldses.org> wrote:
>
> On Thu, Apr 08, 2021 at 02:55:30PM +0200, Christian Brauner wrote:
> > On Thu, Apr 08, 2021 at 02:44:47PM +0300, Amir Goldstein wrote:
> > > > One thing your patch
> > > >
> > > > commit ea31e84fda83c17b88851de399f76f5d9fc1abf4
> > > > Author: Amir Goldstein <amir73il@gmail.com>
> > > > Date:   Sat Mar 20 12:58:12 2021 +0200
> > > >
> > > >     fs: allow open by file handle inside userns
> > > >
> > > >     open_by_handle_at(2) requires CAP_DAC_READ_SEARCH in init userns,
> > > >     where most filesystems are mounted.
> > > >
> > > >     Relax the requirement to allow a user with CAP_DAC_READ_SEARCH
> > > >     inside userns to open by file handle in filesystems that were
> > > >     mounted inside that userns.
> > > >
> > > >     In addition, also allow open by handle in an idmapped mount, which is
> > > >     mapped to the userns while verifying that the returned open file path
> > > >     is under the root of the idmapped mount.
> > > >
> > > >     This is going to be needed for setting an fanotify mark on a filesystem
> > > >     and watching events inside userns.
> > > >
> > > >     Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > > >
> > > > Requires fs/exportfs/expfs.c to be made idmapped mounts aware.
> > > > open_by_handle_at() uses exportfs_decode_fh() which e.g. has the
> > > > following and other callchains:
> > > >
> > > > exportfs_decode_fh()
> > > > -> exportfs_decode_fh_raw()
> > > >    -> lookup_one_len()
> > > >       -> inode_permission(mnt_userns, ...)
> > > >
> > > > That's not a huge problem though I did all these changes for the
> > > > overlayfs support for idmapped mounts I have in a branch from an earlier
> > > > version of the idmapped mounts patchset. Basically lookup_one_len(),
> > > > lookup_one_len_unlocked(), and lookup_positive_unlocked() need to take
> > > > the mnt_userns into account. I can rebase my change and send it for
> > > > consideration next cycle. If you can live without the
> > > > open_by_handle_at() support for now in this patchset (Which I think you
> > > > said you could.) then it's not a blocker either. Sorry for the
> > > > inconvenience.
> > > >
> > >
> > > Christian,
> > >
> > > I think making exportfs_decode_fh() idmapped mount aware is not
> > > enough, because when a dentry alias is found in dcache, none of
> > > those lookup functions are called.
> > >
> > > I think we will also need something like this:
> > > https://github.com/amir73il/linux/commits/fhandle_userns
> > >
> > > I factored-out a helper from nfsd_apcceptable() which implements
> > > the "subtree_check" nfsd logic and uses it for open_by_handle_at().
> > >
> > > I've also added a small patch to name_to_handle_at() with a UAPI
> > > change that could make these changes usable by userspace nfs
> > > server inside userns, but I have no demo nor tests for that and frankly,
> > > I have little incentive to try and promote this UAPI change without
> > > anybody asking for it...
> >
> > Ah, at first I was confused about why this would matter but it matters
> > because nfsd already implements a check of that sort directly in nfsd
> > independent of idmapped mounts:
> > https://github.com/amir73il/linux/commit/4bef9ff1718935b7b42afbae71cfaab7770e8436
>
> Only in the NFSEXP_NOSUBTREECHECK case.  Taking a quick look, I think
> Amir's not proposing a check like that by default, so, fine.  (I assume
> problems with e.g. subtreechecking and cross-directory renames are
> understood....)
>

They are understood to me :) but I didn't want to get into it, because it is
complicated to explain and I wasn't sure if anyone cared...

I started working on open_by_handle_at() in userns for fanotify and fanotify
mostly reports directory fhandle, so no issues with cross-directory renames.
In any case, fanotify never reports "connectable" non-dir file handles.

Because my proposed change ALSO makes it possible to start talking about
userspace nfs server inside userns (in case anyone cares), I wanted to lay
out the path towards a userspace "subtree_check" like solution.

Another thing I am contemplating is, if and when idmapped mount support
is added to overlayfs, we can store an additional "connectable" file handle
in the overlayfs index (whose key is the non-connectable fhandle) and
fix ovl_acceptable() similar to nfsd_acceptable() and then we will be able
to mount an overlayfs inside userns with nfs_export support.

I've included a two liner patch on the fhandle_userns branch to allow
overlayfs inside userns with nfs_export support in the case that
underlying filesystem was mounted inside userns, but that is not such
an interesting use case IMO.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: open_by_handle_at() in userns
  2021-04-08 15:54               ` Amir Goldstein
@ 2021-04-08 16:08                 ` J. Bruce Fields
  2021-04-08 16:48                   ` Frank Filz
  0 siblings, 1 reply; 61+ messages in thread
From: J. Bruce Fields @ 2021-04-08 16:08 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Christian Brauner, Jan Kara, linux-fsdevel, Linux API, Miklos Szeredi

On Thu, Apr 08, 2021 at 06:54:52PM +0300, Amir Goldstein wrote:
> They are understood to me :) but I didn't want to get into it, because it is
> complicated to explain and I wasn't sure if anyone cared...
> 
> I started working on open_by_handle_at() in userns for fanotify and fanotify
> mostly reports directory fhandle, so no issues with cross-directory renames.
> In any case, fanotify never reports "connectable" non-dir file handles.
> 
> Because my proposed change ALSO makes it possible to start talking about
> userspace nfs server inside userns (in case anyone cares), I wanted to lay
> out the path towards a userspace "subtree_check" like solution.

We have to support subdirectory exports and subtree checking because we
already have, but, FWIW, if I were writing a new NFS server from
scratch, I don't think I would.  It's poorly understood, and the effort
would be better spent on more flexible storage management.

--b.

> Another thing I am contemplating is, if and when idmapped mount support
> is added to overlayfs, we can store an additional "connectable" file handle
> in the overlayfs index (whose key is the non-connectable fhandle) and
> fix ovl_acceptable() similar to nfsd_acceptable() and then we will be able
> to mount an overlayfs inside userns with nfs_export support.
> 
> I've included a two liner patch on the fhandle_userns branch to allow
> overlayfs inside userns with nfs_export support in the case that
> underlying filesystem was mounted inside userns, but that is not such
> an interesting use case IMO.
> 
> Thanks,
> Amir.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: open_by_handle_at() in userns
  2021-04-08 16:08                 ` J. Bruce Fields
@ 2021-04-08 16:48                   ` Frank Filz
  0 siblings, 0 replies; 61+ messages in thread
From: Frank Filz @ 2021-04-08 16:48 UTC (permalink / raw)
  To: 'J. Bruce Fields', 'Amir Goldstein'
  Cc: 'Christian Brauner', 'Jan Kara',
	'linux-fsdevel', 'Linux API',
	'Miklos Szeredi'

> On Thu, Apr 08, 2021 at 06:54:52PM +0300, Amir Goldstein wrote:
> > They are understood to me :) but I didn't want to get into it, because
> > it is complicated to explain and I wasn't sure if anyone cared...
> >
> > I started working on open_by_handle_at() in userns for fanotify and
> > fanotify mostly reports directory fhandle, so no issues with
cross-directory
> renames.
> > In any case, fanotify never reports "connectable" non-dir file handles.
> >
> > Because my proposed change ALSO makes it possible to start talking
> > about userspace nfs server inside userns (in case anyone cares), I
> > wanted to lay out the path towards a userspace "subtree_check" like
solution.
> 
> We have to support subdirectory exports and subtree checking because we
> already have, but, FWIW, if I were writing a new NFS server from scratch,
I don't
> think I would.  It's poorly understood, and the effort would be better
spent on
> more flexible storage management.

Yea, nfs-ganesha does not attempt to support subtree checking. It will allow
subtree exports, but it makes no assurance that they are secure. One option
though that turns out to work well for them is btrfs subvols since each
subvol has its own st_dev device ID, it's really as if it's a separate
filesystem (and nfs-ganesha treats it as such).

I'm curious about the userns solution. I'm not familiar with it, but we have
issues with running nfs-ganesha inside containers due to the privileges it
requires to properly run.

> > Another thing I am contemplating is, if and when idmapped mount
> > support is added to overlayfs, we can store an additional
> > "connectable" file handle in the overlayfs index (whose key is the
> > non-connectable fhandle) and fix ovl_acceptable() similar to
> > nfsd_acceptable() and then we will be able to mount an overlayfs inside
userns
> with nfs_export support.
> >
> > I've included a two liner patch on the fhandle_userns branch to allow
> > overlayfs inside userns with nfs_export support in the case that
> > underlying filesystem was mounted inside userns, but that is not such
> > an interesting use case IMO.
> >
> > Thanks,
> > Amir.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: fsnotify path hooks
  2021-04-08 15:11                               ` Amir Goldstein
@ 2021-04-09 10:08                                 ` Jan Kara
  2021-04-09 10:45                                   ` Christian Brauner
                                                     ` (2 more replies)
  2021-04-19 16:41                                 ` Amir Goldstein
  1 sibling, 3 replies; 61+ messages in thread
From: Jan Kara @ 2021-04-09 10:08 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Christian Brauner, linux-fsdevel, Miklos Szeredi,
	J. Bruce Fields

On Thu 08-04-21 18:11:31, Amir Goldstein wrote:
> > > FYI, I tried your suggested approach above for fsnotify_xattr(),
> > > but I think I prefer to use an explicit flavor fsnotify_xattr_mnt()
> > > and a wrapper fsnotify_xattr().
> > > Pushed WIP to fsnotify_path_hooks branch. It also contains
> > > some unstashed "fix" patches to tidy up the previous hooks.
> >
> > What's in fsnotify_path_hooks branch looks good to me wrt xattr hooks.
> > I somewhat dislike about e.g. the fsnotify_create() approach you took is
> > that there are separate hooks fsnotify_create() and fsnotify_create_path()
> > which expose what is IMO an internal fsnotify detail of what are different
> > event types. I'd say it is more natural (from VFS POV) to have just a
> > single hook and fill in as much information as available... Also from
> 
> So to be clear, you do NOT want additional wrappers like this and
> you prefer to have the NULL mnt argument explicit in all callers?
> 
> static inline void fsnotify_xattr(struct dentry *dentry)
> {
>         fsnotify_xattr_mnt(NULL, dentry);
> }
> 
> For fsnotify_xattr() it does not matter so much, but fsnotify_create/mkdir()
> have quite a few callers in special filesystems.

Yes, I prefer explicit NULL mnt argument to make it obvious we are going to
miss something in this case. I agree it's going to be somewhat bigger churn
but it isn't that bad (10 + 6 callers).

> > outside view, it is unclear that e.g. vfs_create() will generate some types
> > of fsnotify events but not all while e.g. do_mknodat() will generate all
> > fsnotify events. That's why I'm not sure whether a helper like vfs_create()
> > in your tree is the right abstraction since generating one type of fsnotify
> > event while not generating another type should be a very conscious decision
> > of the implementor - basically if you have no other option.
> 
> I lost you here.

Sorry, I was probably too philosophical here ;)

> Are you ok with vfs_create() vs. vfs_create_nonotify()?

I'm OK with vfs_create_nonotify(). I have a problem with vfs_create()
because it generates inode + fs events but does not generate mount events
which is just strange (although I appreciate the technical reason behind
it :).

> How do you propose to change fsnotify hooks in vfs_create()?

So either pass 'mnt' to vfs_create() - as we discussed, this may be
actually acceptable these days due to idmapped mounts work - and generate
all events there, or make vfs_create() not generate any fsnotify events and
create new vfs_create_notify() which will take the 'mnt' and generate
events. Either is fine with me and more consistent than what you currently
propose. Thoughts?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: fsnotify path hooks
  2021-04-09 10:08                                 ` Jan Kara
@ 2021-04-09 10:45                                   ` Christian Brauner
  2021-04-20  6:01                                     ` Amir Goldstein
  2021-04-09 13:22                                   ` Amir Goldstein
  2021-04-18 18:51                                   ` Amir Goldstein
  2 siblings, 1 reply; 61+ messages in thread
From: Christian Brauner @ 2021-04-09 10:45 UTC (permalink / raw)
  To: Jan Kara; +Cc: Amir Goldstein, linux-fsdevel, Miklos Szeredi, J. Bruce Fields

On Fri, Apr 09, 2021 at 12:08:11PM +0200, Jan Kara wrote:
> On Thu 08-04-21 18:11:31, Amir Goldstein wrote:
> > > > FYI, I tried your suggested approach above for fsnotify_xattr(),
> > > > but I think I prefer to use an explicit flavor fsnotify_xattr_mnt()
> > > > and a wrapper fsnotify_xattr().
> > > > Pushed WIP to fsnotify_path_hooks branch. It also contains
> > > > some unstashed "fix" patches to tidy up the previous hooks.
> > >
> > > What's in fsnotify_path_hooks branch looks good to me wrt xattr hooks.
> > > I somewhat dislike about e.g. the fsnotify_create() approach you took is
> > > that there are separate hooks fsnotify_create() and fsnotify_create_path()
> > > which expose what is IMO an internal fsnotify detail of what are different
> > > event types. I'd say it is more natural (from VFS POV) to have just a
> > > single hook and fill in as much information as available... Also from
> > 
> > So to be clear, you do NOT want additional wrappers like this and
> > you prefer to have the NULL mnt argument explicit in all callers?
> > 
> > static inline void fsnotify_xattr(struct dentry *dentry)
> > {
> >         fsnotify_xattr_mnt(NULL, dentry);
> > }
> > 
> > For fsnotify_xattr() it does not matter so much, but fsnotify_create/mkdir()
> > have quite a few callers in special filesystems.
> 
> Yes, I prefer explicit NULL mnt argument to make it obvious we are going to

I'm personally not a fan of that passing explicit NULL and one of the
first comments Al made to me about idmapped mounts was sm along the
lines of "don't pass NULL to indicate non-idmapped it's an invitation
for bugs". And I think that's actually a good point.
Maybe we should do something similar to anonymous mount namespaces. For
example, we could introduce an anonymous vfsmount that gets passed by
default. Basically similar to init_user_ns or init_net etc.

> miss something in this case. I agree it's going to be somewhat bigger churn
> but it isn't that bad (10 + 6 callers).
> 
> > > outside view, it is unclear that e.g. vfs_create() will generate some types
> > > of fsnotify events but not all while e.g. do_mknodat() will generate all
> > > fsnotify events. That's why I'm not sure whether a helper like vfs_create()
> > > in your tree is the right abstraction since generating one type of fsnotify
> > > event while not generating another type should be a very conscious decision
> > > of the implementor - basically if you have no other option.
> > 
> > I lost you here.
> 
> Sorry, I was probably too philosophical here ;)
> 
> > Are you ok with vfs_create() vs. vfs_create_nonotify()?
> 
> I'm OK with vfs_create_nonotify(). I have a problem with vfs_create()
> because it generates inode + fs events but does not generate mount events
> which is just strange (although I appreciate the technical reason behind
> it :).
> 
> > How do you propose to change fsnotify hooks in vfs_create()?
> 
> So either pass 'mnt' to vfs_create() - as we discussed, this may be
> actually acceptable these days due to idmapped mounts work - and generate

I would think passing struct vfsmount or even struct path to vfs_*
helpers is acceptable (although I know about the long-standing
resistance) as long as neither are passed down to inode methods
themselves. And that should work since the consensus seems to be to
never generate mnt fanotify events for an underlying mnt where one fs
stacks on top of another (cachefiles, ecryptfs, overlayfs, etc.).
Another argument for passing the vfsmount is that all of those stacking
filesystems already do have access to the relevant struct vfsmount
anyway (As I know from my idmapped mount port of overlayfs for example).

> all events there, or make vfs_create() not generate any fsnotify events and
> create new vfs_create_notify() which will take the 'mnt' and generate
> events. Either is fine with me and more consistent than what you currently
> propose. Thoughts?

One thing, whatever you end up passing to vfs_create() please make sure
to retrieve mnt_userns once so permission checking and object creation
line-up:

int vfs_create(struct vfsmount *mnt, struct inode *dir,
	       struct dentry *dentry, umode_t mode, bool want_excl)
{
	struct user_namespace *mnt_userns;

	mnt_userns = mnt_user_ns(mnt);

	int error = may_create(mnt_userns, dir, dentry);
	if (error)
		return error;

	if (!dir->i_op->create)
		return -EACCES;	/* shouldn't it be ENOSYS? */
	mode &= S_IALLUGO;
	mode |= S_IFREG;
	error = security_inode_create(dir, dentry, mode);
	if (error)
		return error;
	error = dir->i_op->create(mnt_userns, dir, dentry, mode, want_excl);
	if (!error)
		fsnotify_create(mnt, dir, dentry);
	return error;
}

Christian

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: fsnotify path hooks
  2021-04-09 10:08                                 ` Jan Kara
  2021-04-09 10:45                                   ` Christian Brauner
@ 2021-04-09 13:22                                   ` Amir Goldstein
  2021-04-09 14:30                                     ` Al Viro
  2021-04-18 18:51                                   ` Amir Goldstein
  2 siblings, 1 reply; 61+ messages in thread
From: Amir Goldstein @ 2021-04-09 13:22 UTC (permalink / raw)
  To: Jan Kara, Al Viro
  Cc: Christian Brauner, linux-fsdevel, Miklos Szeredi, J. Bruce Fields

On Fri, Apr 9, 2021 at 1:08 PM Jan Kara <jack@suse.cz> wrote:
>
> On Thu 08-04-21 18:11:31, Amir Goldstein wrote:
> > > > FYI, I tried your suggested approach above for fsnotify_xattr(),
> > > > but I think I prefer to use an explicit flavor fsnotify_xattr_mnt()
> > > > and a wrapper fsnotify_xattr().
> > > > Pushed WIP to fsnotify_path_hooks branch. It also contains
> > > > some unstashed "fix" patches to tidy up the previous hooks.
> > >
> > > What's in fsnotify_path_hooks branch looks good to me wrt xattr hooks.
> > > I somewhat dislike about e.g. the fsnotify_create() approach you took is
> > > that there are separate hooks fsnotify_create() and fsnotify_create_path()
> > > which expose what is IMO an internal fsnotify detail of what are different
> > > event types. I'd say it is more natural (from VFS POV) to have just a
> > > single hook and fill in as much information as available... Also from
> >
> > So to be clear, you do NOT want additional wrappers like this and
> > you prefer to have the NULL mnt argument explicit in all callers?
> >
> > static inline void fsnotify_xattr(struct dentry *dentry)
> > {
> >         fsnotify_xattr_mnt(NULL, dentry);
> > }
> >
> > For fsnotify_xattr() it does not matter so much, but fsnotify_create/mkdir()
> > have quite a few callers in special filesystems.
>
> Yes, I prefer explicit NULL mnt argument to make it obvious we are going to
> miss something in this case. I agree it's going to be somewhat bigger churn
> but it isn't that bad (10 + 6 callers).
>

I don't mind the churn so much, but for clarity of what we are missing, I'd
prefer to use fsnotify_inode_create() vs. fsnotify_path_create(), which is
exactly the difference between the two flavors - the type or args passed.
BTW, there is a precedence to that convention with security_{inode,path}
hooks, but in that case, both hooks are called.

> > > outside view, it is ctunclear that e.g. vfs_create() will generate some types
> > > of fsnotify events but not all while e.g. do_mknodat() will generate all
> > > fsnotify events. That's why I'm not sure whether a helper like vfs_create()
> > > in your tree is the right abstraction since generating one type of fsnotify
> > > event while not generating another type should be a very conscious decision
> > > of the implementor - basically if you have no other option.
> >
> > I lost you here.
>
> Sorry, I was probably too philosophical here ;)
>
> > Are you ok with vfs_create() vs. vfs_create_nonotify()?
>
> I'm OK with vfs_create_nonotify(). I have a problem with vfs_create()
> because it generates inode + fs events but does not generate mount events
> which is just strange (although I appreciate the technical reason behind
> it :).
>
> > How do you propose to change fsnotify hooks in vfs_create()?
>
> So either pass 'mnt' to vfs_create() - as we discussed, this may be
> actually acceptable these days due to idmapped mounts work - and generate
> all events there, or make vfs_create() not generate any fsnotify events and
> create new vfs_create_notify() which will take the 'mnt' and generate
> events. Either is fine with me and more consistent than what you currently
> propose. Thoughts?
>

I'm good with vfs_create_notify(). This definitely forces me to submit the
s/notify_change/vfs_setattr_notify conversion ;-)
I will take a swing at it.

But we are actually going in cycles around the solution that we all want,
but fear of rejection. It's time to try and solicit direct feedback from Al.

Al,

would you be ok with passing mnt arg to vfs_create() and friends,
so that we can pass that to fsnotify_create() (and friends) in order to
be able to report FAN_CREATE events to FAN_MARK_MOUNT listeners?

The watched mount context is relevant to syscalls/io_uring and nfsd
and less relevant to callers using private mount clones like overlayfs,
but it doesn't hurt to pass the private mount.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: fsnotify path hooks
  2021-04-09 13:22                                   ` Amir Goldstein
@ 2021-04-09 14:30                                     ` Al Viro
  2021-04-09 14:39                                       ` Christian Brauner
  2021-04-09 16:06                                       ` Amir Goldstein
  0 siblings, 2 replies; 61+ messages in thread
From: Al Viro @ 2021-04-09 14:30 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Christian Brauner, linux-fsdevel, Miklos Szeredi,
	J. Bruce Fields

On Fri, Apr 09, 2021 at 04:22:58PM +0300, Amir Goldstein wrote:

> But we are actually going in cycles around the solution that we all want,
> but fear of rejection. It's time to try and solicit direct feedback from Al.
> 
> Al,
> 
> would you be ok with passing mnt arg to vfs_create() and friends,
> so that we can pass that to fsnotify_create() (and friends) in order to
> be able to report FAN_CREATE events to FAN_MARK_MOUNT listeners?

I would very much prefer to avoid going that way.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: fsnotify path hooks
  2021-04-09 14:30                                     ` Al Viro
@ 2021-04-09 14:39                                       ` Christian Brauner
  2021-04-09 14:46                                         ` Al Viro
  2021-04-09 16:06                                       ` Amir Goldstein
  1 sibling, 1 reply; 61+ messages in thread
From: Christian Brauner @ 2021-04-09 14:39 UTC (permalink / raw)
  To: Al Viro
  Cc: Amir Goldstein, Jan Kara, linux-fsdevel, Miklos Szeredi, J. Bruce Fields

On Fri, Apr 09, 2021 at 02:30:13PM +0000, Al Viro wrote:
> On Fri, Apr 09, 2021 at 04:22:58PM +0300, Amir Goldstein wrote:
> 
> > But we are actually going in cycles around the solution that we all want,
> > but fear of rejection. It's time to try and solicit direct feedback from Al.
> > 
> > Al,
> > 
> > would you be ok with passing mnt arg to vfs_create() and friends,
> > so that we can pass that to fsnotify_create() (and friends) in order to
> > be able to report FAN_CREATE events to FAN_MARK_MOUNT listeners?
> 
> I would very much prefer to avoid going that way.

Fwif and if I understand this correctly then this would mostly matter
for stacking filesystems or filesystems that already have access to the
relevant struct vfsmount they need to talk about anyway. It's not about
passing struct vfsmount into inode methods themselves which would be a
bad idea.  Meaning Amir's change would not require or cause vfsmounts
to be made accessible where they arent't already.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: fsnotify path hooks
  2021-04-09 14:39                                       ` Christian Brauner
@ 2021-04-09 14:46                                         ` Al Viro
  2021-04-09 15:20                                           ` Christian Brauner
  0 siblings, 1 reply; 61+ messages in thread
From: Al Viro @ 2021-04-09 14:46 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Amir Goldstein, Jan Kara, linux-fsdevel, Miklos Szeredi, J. Bruce Fields

On Fri, Apr 09, 2021 at 04:39:01PM +0200, Christian Brauner wrote:
> On Fri, Apr 09, 2021 at 02:30:13PM +0000, Al Viro wrote:
> > On Fri, Apr 09, 2021 at 04:22:58PM +0300, Amir Goldstein wrote:
> > 
> > > But we are actually going in cycles around the solution that we all want,
> > > but fear of rejection. It's time to try and solicit direct feedback from Al.
> > > 
> > > Al,
> > > 
> > > would you be ok with passing mnt arg to vfs_create() and friends,
> > > so that we can pass that to fsnotify_create() (and friends) in order to
> > > be able to report FAN_CREATE events to FAN_MARK_MOUNT listeners?
> > 
> > I would very much prefer to avoid going that way.
> 
> Fwif and if I understand this correctly then this would mostly matter
> for stacking filesystems or filesystems that already have access to the
> relevant struct vfsmount they need to talk about anyway. It's not about
> passing struct vfsmount into inode methods themselves which would be a
> bad idea.  Meaning Amir's change would not require or cause vfsmounts
> to be made accessible where they arent't already.

AFAICS, you are getting caught on mount marks semantics (or lack thereof).
It's _not_ "event happened to something visible here"; it's "event had
been requested by way that explicitly refers to that mount".

And that, IMO, pushes that crap from the places where the work is done
to the places where it had been requested.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: fsnotify path hooks
  2021-04-09 14:46                                         ` Al Viro
@ 2021-04-09 15:20                                           ` Christian Brauner
  0 siblings, 0 replies; 61+ messages in thread
From: Christian Brauner @ 2021-04-09 15:20 UTC (permalink / raw)
  To: Al Viro
  Cc: Amir Goldstein, Jan Kara, linux-fsdevel, Miklos Szeredi, J. Bruce Fields

On Fri, Apr 09, 2021 at 02:46:08PM +0000, Al Viro wrote:
> On Fri, Apr 09, 2021 at 04:39:01PM +0200, Christian Brauner wrote:
> > On Fri, Apr 09, 2021 at 02:30:13PM +0000, Al Viro wrote:
> > > On Fri, Apr 09, 2021 at 04:22:58PM +0300, Amir Goldstein wrote:
> > > 
> > > > But we are actually going in cycles around the solution that we all want,
> > > > but fear of rejection. It's time to try and solicit direct feedback from Al.
> > > > 
> > > > Al,
> > > > 
> > > > would you be ok with passing mnt arg to vfs_create() and friends,
> > > > so that we can pass that to fsnotify_create() (and friends) in order to
> > > > be able to report FAN_CREATE events to FAN_MARK_MOUNT listeners?
> > > 
> > > I would very much prefer to avoid going that way.
> > 
> > Fwif and if I understand this correctly then this would mostly matter
> > for stacking filesystems or filesystems that already have access to the
> > relevant struct vfsmount they need to talk about anyway. It's not about
> > passing struct vfsmount into inode methods themselves which would be a
> > bad idea.  Meaning Amir's change would not require or cause vfsmounts
> > to be made accessible where they arent't already.
> 
> AFAICS, you are getting caught on mount marks semantics (or lack thereof).
> It's _not_ "event happened to something visible here"; it's "event had
> been requested by way that explicitly refers to that mount".

Well yes, that's what mount marks are supposed to be afaict. I don't
think that's necessarily wrong though, at the least it can be quite
helpful.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: fsnotify path hooks
  2021-04-09 14:30                                     ` Al Viro
  2021-04-09 14:39                                       ` Christian Brauner
@ 2021-04-09 16:06                                       ` Amir Goldstein
  2021-04-09 16:09                                         ` Amir Goldstein
  1 sibling, 1 reply; 61+ messages in thread
From: Amir Goldstein @ 2021-04-09 16:06 UTC (permalink / raw)
  To: Al Viro
  Cc: Jan Kara, Christian Brauner, linux-fsdevel, Miklos Szeredi,
	J. Bruce Fields

On Fri, Apr 9, 2021 at 5:30 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Fri, Apr 09, 2021 at 04:22:58PM +0300, Amir Goldstein wrote:
>
> > But we are actually going in cycles around the solution that we all want,
> > but fear of rejection. It's time to try and solicit direct feedback from Al.
> >
> > Al,
> >
> > would you be ok with passing mnt arg to vfs_create() and friends,
> > so that we can pass that to fsnotify_create() (and friends) in order to
> > be able to report FAN_CREATE events to FAN_MARK_MOUNT listeners?
>
> I would very much prefer to avoid going that way.

OK, so I will go with the more "expressive" implementation that Jan suggested.

Callers that do NOT care about mount mark semantics will use the
wrapper:

vfs_create_notify(mnt, ...)
{
    vfs_create(...
    fsnotify_create(NULL,...
}

The two callers that do care about mount mark semantics (syscalls
and nfsd) will open code:

vfs_create(...
fsnotify_create(mnt,...

Hope that way works better for you.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: fsnotify path hooks
  2021-04-09 16:06                                       ` Amir Goldstein
@ 2021-04-09 16:09                                         ` Amir Goldstein
  0 siblings, 0 replies; 61+ messages in thread
From: Amir Goldstein @ 2021-04-09 16:09 UTC (permalink / raw)
  To: Al Viro
  Cc: Jan Kara, Christian Brauner, linux-fsdevel, Miklos Szeredi,
	J. Bruce Fields

On Fri, Apr 9, 2021 at 7:06 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Fri, Apr 9, 2021 at 5:30 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > On Fri, Apr 09, 2021 at 04:22:58PM +0300, Amir Goldstein wrote:
> >
> > > But we are actually going in cycles around the solution that we all want,
> > > but fear of rejection. It's time to try and solicit direct feedback from Al.
> > >
> > > Al,
> > >
> > > would you be ok with passing mnt arg to vfs_create() and friends,
> > > so that we can pass that to fsnotify_create() (and friends) in order to
> > > be able to report FAN_CREATE events to FAN_MARK_MOUNT listeners?
> >
> > I would very much prefer to avoid going that way.
>
> OK, so I will go with the more "expressive" implementation that Jan suggested.
>
> Callers that do NOT care about mount mark semantics will use the
> wrapper:
>
> vfs_create_notify(mnt, ...)

Oops, there was not supposed to be mnt arg in this wrapper -
that's the point.
Maybe I should name it xxx_inode_notify() to be more exact.

Thanks,
Amir.

> {
>     vfs_create(...
>     fsnotify_create(NULL,...
> }
>
> The two callers that do care about mount mark semantics (syscalls
> and nfsd) will open code:
>
> vfs_create(...
> fsnotify_create(mnt,...
>
> Hope that way works better for you.
>
> Thanks,
> Amir.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: fsnotify path hooks
  2021-04-09 10:08                                 ` Jan Kara
  2021-04-09 10:45                                   ` Christian Brauner
  2021-04-09 13:22                                   ` Amir Goldstein
@ 2021-04-18 18:51                                   ` Amir Goldstein
  2021-04-19  8:08                                     ` Amir Goldstein
  2 siblings, 1 reply; 61+ messages in thread
From: Amir Goldstein @ 2021-04-18 18:51 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christian Brauner, linux-fsdevel, Miklos Szeredi,
	J. Bruce Fields, Al Viro

On Fri, Apr 9, 2021 at 1:08 PM Jan Kara <jack@suse.cz> wrote:
>
> On Thu 08-04-21 18:11:31, Amir Goldstein wrote:
> > > > FYI, I tried your suggested approach above for fsnotify_xattr(),
> > > > but I think I prefer to use an explicit flavor fsnotify_xattr_mnt()
> > > > and a wrapper fsnotify_xattr().
> > > > Pushed WIP to fsnotify_path_hooks branch. It also contains
> > > > some unstashed "fix" patches to tidy up the previous hooks.
> > >
> > > What's in fsnotify_path_hooks branch looks good to me wrt xattr hooks.
> > > I somewhat dislike about e.g. the fsnotify_create() approach you took is
> > > that there are separate hooks fsnotify_create() and fsnotify_create_path()
> > > which expose what is IMO an internal fsnotify detail of what are different
> > > event types. I'd say it is more natural (from VFS POV) to have just a
> > > single hook and fill in as much information as available... Also from
> >
> > So to be clear, you do NOT want additional wrappers like this and
> > you prefer to have the NULL mnt argument explicit in all callers?
> >
> > static inline void fsnotify_xattr(struct dentry *dentry)
> > {
> >         fsnotify_xattr_mnt(NULL, dentry);
> > }
> >
> > For fsnotify_xattr() it does not matter so much, but fsnotify_create/mkdir()
> > have quite a few callers in special filesystems.
>
> Yes, I prefer explicit NULL mnt argument to make it obvious we are going to
> miss something in this case. I agree it's going to be somewhat bigger churn
> but it isn't that bad (10 + 6 callers).
>
> > > outside view, it is unclear that e.g. vfs_create() will generate some types
> > > of fsnotify events but not all while e.g. do_mknodat() will generate all
> > > fsnotify events. That's why I'm not sure whether a helper like vfs_create()
> > > in your tree is the right abstraction since generating one type of fsnotify
> > > event while not generating another type should be a very conscious decision
> > > of the implementor - basically if you have no other option.
> >
> > I lost you here.
>
> Sorry, I was probably too philosophical here ;)
>
> > Are you ok with vfs_create() vs. vfs_create_nonotify()?
>
> I'm OK with vfs_create_nonotify(). I have a problem with vfs_create()
> because it generates inode + fs events but does not generate mount events
> which is just strange (although I appreciate the technical reason behind
> it :).
>
> > How do you propose to change fsnotify hooks in vfs_create()?
>
> So either pass 'mnt' to vfs_create() - as we discussed, this may be
> actually acceptable these days due to idmapped mounts work - and generate
> all events there, or make vfs_create() not generate any fsnotify events and
> create new vfs_create_notify() which will take the 'mnt' and generate
> events. Either is fine with me and more consistent than what you currently
> propose. Thoughts?
>

Jan,

I started to go down the vfs_create_notify() path and I guess it's looking
not too bad (?). Pushed WIP to branch fsnotify_path_hooks-wip.

I hit another bump though.  By getting fsnotify_{unlink,rmdir}() outside of the
vfs helpers, we break the rule:

        /* Expected to be called before d_delete() */
        WARN_ON_ONCE(d_is_negative(dentry));

I'm not sure how to solve this without passing mnt into the vfs helpers.

One solution is not adding support for delete events to mount mark.
I was trying to aim for maximum flexibility, but for the use case that
Christian mentioned (injecting bind mounts into container) it is only
really necessary to support FAN_CREATE and FAN_MOVED_TO
(or FAN_MOVE_SELF) on a mount mark for observing when a path
becomes positive.

For observing when a path becomes negative, it is sufficient to watch
all the ancestor directories.

Thoughts?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: fsnotify path hooks
  2021-04-18 18:51                                   ` Amir Goldstein
@ 2021-04-19  8:08                                     ` Amir Goldstein
  0 siblings, 0 replies; 61+ messages in thread
From: Amir Goldstein @ 2021-04-19  8:08 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christian Brauner, linux-fsdevel, Miklos Szeredi,
	J. Bruce Fields, Al Viro

> > > How do you propose to change fsnotify hooks in vfs_create()?
> >
> > So either pass 'mnt' to vfs_create() - as we discussed, this may be
> > actually acceptable these days due to idmapped mounts work - and generate
> > all events there, or make vfs_create() not generate any fsnotify events and
> > create new vfs_create_notify() which will take the 'mnt' and generate
> > events. Either is fine with me and more consistent than what you currently
> > propose. Thoughts?
> >
>
> Jan,
>
> I started to go down the vfs_create_notify() path and I guess it's looking
> not too bad (?). Pushed WIP to branch fsnotify_path_hooks-wip.
>
> I hit another bump though.  By getting fsnotify_{unlink,rmdir}() outside of the
> vfs helpers, we break the rule:
>
>         /* Expected to be called before d_delete() */
>         WARN_ON_ONCE(d_is_negative(dentry));
>
> I'm not sure how to solve this without passing mnt into the vfs helpers.
>

Nevermind, I figured it out.
Created a helper fsnotify_delete() that allows a negative dentry
and takes a separate victim inode arg.
fsnotify_{unlink,rmdir}_notify() take care of holding the reference to
the victim inode.

I also realized that passing mnt to fsnotify() as path data arg
is limiting and over complicating, so passing it as a separate arg.
Packed all args to fsnotify() in struct fsnotify_event_info.

Pushed what I have so far to fsnotify_path_hooks-wip.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: fsnotify path hooks
  2021-04-08 15:11                               ` Amir Goldstein
  2021-04-09 10:08                                 ` Jan Kara
@ 2021-04-19 16:41                                 ` Amir Goldstein
  2021-04-19 17:02                                   ` Al Viro
  1 sibling, 1 reply; 61+ messages in thread
From: Amir Goldstein @ 2021-04-19 16:41 UTC (permalink / raw)
  To: Jan Kara, Al Viro
  Cc: Christian Brauner, linux-fsdevel, Miklos Szeredi, J. Bruce Fields

> > Hmm, notify_change() is an inconsistent name anyway (for historical
> > reasons). Consistent name would be vfs_setattr(). And
> > vfs_setattr_nonotify() would be a fine name as well. What do you think?
> >
>
> I won't propose to convert all notify_change() callers before getting an
> explicit ACK from Al, but I could still use notify_change() as a wrapper
> around vfs_setattr_nonotify() and maybe also create an alias vfs_setattr()
> and use it in some places.
>

Tried that. It didn't go so well.
Renamed s/notify_change/vfs_setattr and then created a wrapper
vfs_setattr_notify(mnt, ...

First issue is that there are no callers left to vfs_setattr(), so the two
phase conversion becomes: s/notify_change/vfs_setattr_change
which is mostly noise.

The second issue is that fsnotify_change() takes @ia_valid arg, which
may be modified inside notify_change().

Which brings me back to the last resort.

Al,

Would you be willing to make an exception for notify_change()
and pass mnt arg to the helper? and if so, which of the following
is the lesser evil in your opinion:

1. Optional mnt arg
--------------------------
int notify_change(struct vfsmount *mnt,
                 struct user_namespace *mnt_userns,
                 struct dentry *dentry, struct iattr *attr,
                 struct inode **delegated_inode)

@mnt is non-NULL from syscalls and nfsd and NULL from other callers.

2. path instead of dentry
--------------------------------
int notify_change(struct user_namespace *mnt_userns,
                 struct path *path, struct iattr *attr,
                 struct inode **delegated_inode)

This is symmetric with vfs_getattr().
syscalls and nfsd use the actual path.
overlayfs, ecryptfs, cachefiles compose a path from the private mount
(Christian posted patches to make ecryptfs, cachefiles mount private).

3. Mandatory mnt arg
-----------------------------
Like #1, but use some private mount instead of NULL, similar to the
mnt_userns arg.

Any of the above acceptable?

Pushed option #1 (along with rest of the work) to:
https://github.com/amir73il/linux/commits/fsnotify_path_hooks

It's only sanity tested.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: fsnotify path hooks
  2021-04-19 16:41                                 ` Amir Goldstein
@ 2021-04-19 17:02                                   ` Al Viro
  2021-04-19 22:04                                     ` Amir Goldstein
  0 siblings, 1 reply; 61+ messages in thread
From: Al Viro @ 2021-04-19 17:02 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Christian Brauner, linux-fsdevel, Miklos Szeredi,
	J. Bruce Fields

On Mon, Apr 19, 2021 at 07:41:51PM +0300, Amir Goldstein wrote:

> Would you be willing to make an exception for notify_change()
> and pass mnt arg to the helper? and if so, which of the following
> is the lesser evil in your opinion:
> 
> 1. Optional mnt arg
> --------------------------
> int notify_change(struct vfsmount *mnt,
>                  struct user_namespace *mnt_userns,
>                  struct dentry *dentry, struct iattr *attr,
>                  struct inode **delegated_inode)
> 
> @mnt is non-NULL from syscalls and nfsd and NULL from other callers.
> 
> 2. path instead of dentry
> --------------------------------
> int notify_change(struct user_namespace *mnt_userns,
>                  struct path *path, struct iattr *attr,
>                  struct inode **delegated_inode)
> 
> This is symmetric with vfs_getattr().
> syscalls and nfsd use the actual path.
> overlayfs, ecryptfs, cachefiles compose a path from the private mount
> (Christian posted patches to make ecryptfs, cachefiles mount private).
> 
> 3. Mandatory mnt arg
> -----------------------------
> Like #1, but use some private mount instead of NULL, similar to the
> mnt_userns arg.
> 
> Any of the above acceptable?
> 
> Pushed option #1 (along with rest of the work) to:
> https://github.com/amir73il/linux/commits/fsnotify_path_hooks
> 
> It's only sanity tested.

	Out of that bunch only #2 is more or less tolerable.
HOWEVER, if we go that way, mnt_user_ns crap must go, and
I really want to see details on all callers - which mount are
you going to use in each case.

	The thing that is not going to be acceptable is
a combination of mount from one filesystem and dentry from
another.  In particular, file_remove_privs() is going to be
interesting.

	Note, BTW, that ftruncate() and file_remove_privs()
are different in that respect - the latter hits d_real()
(by way of file_dentry()), the former does not.  Which one
is correct and if both are, why are their needs different?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: fsnotify path hooks
  2021-04-19 17:02                                   ` Al Viro
@ 2021-04-19 22:04                                     ` Amir Goldstein
  2021-04-20  7:53                                       ` Amir Goldstein
  0 siblings, 1 reply; 61+ messages in thread
From: Amir Goldstein @ 2021-04-19 22:04 UTC (permalink / raw)
  To: Al Viro, Christian Brauner, Miklos Szeredi
  Cc: Jan Kara, linux-fsdevel, J. Bruce Fields

On Mon, Apr 19, 2021 at 8:02 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Mon, Apr 19, 2021 at 07:41:51PM +0300, Amir Goldstein wrote:
>
> > Would you be willing to make an exception for notify_change()
> > and pass mnt arg to the helper? and if so, which of the following
> > is the lesser evil in your opinion:
> >
> > 1. Optional mnt arg
> > --------------------------
> > int notify_change(struct vfsmount *mnt,
> >                  struct user_namespace *mnt_userns,
> >                  struct dentry *dentry, struct iattr *attr,
> >                  struct inode **delegated_inode)
> >
> > @mnt is non-NULL from syscalls and nfsd and NULL from other callers.
> >
> > 2. path instead of dentry
> > --------------------------------
> > int notify_change(struct user_namespace *mnt_userns,
> >                  struct path *path, struct iattr *attr,
> >                  struct inode **delegated_inode)
> >
> > This is symmetric with vfs_getattr().
> > syscalls and nfsd use the actual path.
> > overlayfs, ecryptfs, cachefiles compose a path from the private mount
> > (Christian posted patches to make ecryptfs, cachefiles mount private).
> >
> > 3. Mandatory mnt arg
> > -----------------------------
> > Like #1, but use some private mount instead of NULL, similar to the
> > mnt_userns arg.
> >
> > Any of the above acceptable?
> >
> > Pushed option #1 (along with rest of the work) to:
> > https://github.com/amir73il/linux/commits/fsnotify_path_hooks
> >
> > It's only sanity tested.
>
>         Out of that bunch only #2 is more or less tolerable.

Tolerable works for me.

> HOWEVER, if we go that way, mnt_user_ns crap must go, and

Christian requested that I refrain from re-acquiring mnt_user_ns
from mnt after it had already been used for security checks,
for example:
 do_open()
    may_create_in_sticky(mnt_userns,...)
    may_open(mnt_userns,...)
    handle_truncate(mnt_userns,...
        do_truncate(mnt_userns,...
              notify_change(mnt_userns,...

Although, I am not sure exactly why.
Isn't mnt_userns supposed to be stable after the mount is
connected to the namespace?
What is the concern from re-quiring mnt_userns from path->mnt
inside notify_change()?

> I really want to see details on all callers - which mount are
> you going to use in each case.

The callers are:
cachefiles, ecryptfs, nfsd, devtmpfs,
do_truncate(), vfs_utimes() and file_remove_privs()

* cachefiles, ecryptfs, nfsd compose paths from stashed
mount like this all the time (e.g. for vfs_truncate(), vf_getattr()).

* devtmpfs has the parent path from and also uses it to
compose child path for vfs_getattr().

* vfs_utimes() and all callers of do_truncate() already have the
path, just need to pass it through to notify_change()

>
>         The thing that is not going to be acceptable is
> a combination of mount from one filesystem and dentry from
> another.  In particular, file_remove_privs() is going to be
> interesting.
>
>         Note, BTW, that ftruncate() and file_remove_privs()
> are different in that respect - the latter hits d_real()
> (by way of file_dentry()), the former does not.  Which one
> is correct and if both are, why are their needs different?

Nowadays (>= v4.19) I think the only files whose file_inode() and
f_path do not agree are the overlayfs "real.file" that can find their
way to f_mapping and to some vfs helpers and from there to
filesystem ops and to file_modified() or generic_file_write_iter()
and to file_remove_privs().

Contrary to that, overlayfs does not call any vfs truncate()
helper, it calls notify_change() directly (with a composed path).

So what should we do about file_remove_privs()?
Since I don't think we really need to care about generating an
event on file_remove_privs(), perhaps it could call __notify_change()
that does not generate an event and the rest of the callers call this wrapper:

int notify_change(struct path *path, struct iattr *attr,
                            struct inode **delegated_inode)
{
        unsigned int ia_valid;
        int error = __notify_change(mnt_user_ns(path->mnt), path->dentry,
                                                    attr, &ia_valid,
delegated_inode);

        if (!error)
                fsnotify_change(path, ia_valid);
        return error;
}

Does this make sense?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: fsnotify path hooks
  2021-04-09 10:45                                   ` Christian Brauner
@ 2021-04-20  6:01                                     ` Amir Goldstein
  2021-04-20 11:41                                       ` Christian Brauner
  0 siblings, 1 reply; 61+ messages in thread
From: Amir Goldstein @ 2021-04-20  6:01 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jan Kara, linux-fsdevel, Miklos Szeredi, J. Bruce Fields, Al Viro

> One thing, whatever you end up passing to vfs_create() please make sure
> to retrieve mnt_userns once so permission checking and object creation
> line-up:
>
> int vfs_create(struct vfsmount *mnt, struct inode *dir,
>                struct dentry *dentry, umode_t mode, bool want_excl)
> {
>         struct user_namespace *mnt_userns;
>
>         mnt_userns = mnt_user_ns(mnt);
>
>         int error = may_create(mnt_userns, dir, dentry);
>         if (error)
>                 return error;
>
>         if (!dir->i_op->create)
>                 return -EACCES; /* shouldn't it be ENOSYS? */
>         mode &= S_IALLUGO;
>         mode |= S_IFREG;
>         error = security_inode_create(dir, dentry, mode);
>         if (error)
>                 return error;
>         error = dir->i_op->create(mnt_userns, dir, dentry, mode, want_excl);
>         if (!error)
>                 fsnotify_create(mnt, dir, dentry);
>         return error;
> }
>

Christian,

What is the concern here?
Can mnt_user_ns() change under us?
I am asking because Al doesn't like both mnt_userns AND path to
be passed to do_tuncate() => notify_change()
So I will need to retrieve mnt_userns again inside notify_change()
after it had been used for security checks in do_open().
Would that be acceptable to you?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: fsnotify path hooks
  2021-04-19 22:04                                     ` Amir Goldstein
@ 2021-04-20  7:53                                       ` Amir Goldstein
  0 siblings, 0 replies; 61+ messages in thread
From: Amir Goldstein @ 2021-04-20  7:53 UTC (permalink / raw)
  To: Al Viro, Christian Brauner, Miklos Szeredi
  Cc: Jan Kara, linux-fsdevel, J. Bruce Fields

> > I really want to see details on all callers - which mount are
> > you going to use in each case.
>
> The callers are:
> cachefiles, ecryptfs, nfsd, devtmpfs,
> do_truncate(), vfs_utimes() and file_remove_privs()
>
> * cachefiles, ecryptfs, nfsd compose paths from stashed
> mount like this all the time (e.g. for vfs_truncate(), vf_getattr()).
>
> * devtmpfs has the parent path from and also uses it to
> compose child path for vfs_getattr().
>
> * vfs_utimes() and all callers of do_truncate() already have the
> path, just need to pass it through to notify_change()
>

Not sure how I forgot to mention chmod and chown, but obviously
there is no problem with path from those callers.

> >
> >         The thing that is not going to be acceptable is
> > a combination of mount from one filesystem and dentry from
> > another.  In particular, file_remove_privs() is going to be
> > interesting.
> >
> >         Note, BTW, that ftruncate() and file_remove_privs()
> > are different in that respect - the latter hits d_real()
> > (by way of file_dentry()), the former does not.  Which one
> > is correct and if both are, why are their needs different?
>
> Nowadays (>= v4.19) I think the only files whose file_inode() and
> f_path do not agree are the overlayfs "real.file" that can find their
> way to f_mapping and to some vfs helpers and from there to
> filesystem ops and to file_modified() or generic_file_write_iter()
> and to file_remove_privs().
>
> Contrary to that, overlayfs does not call any vfs truncate()
> helper, it calls notify_change() directly (with a composed path).
>
> So what should we do about file_remove_privs()?
> Since I don't think we really need to care about generating an
> event on file_remove_privs(), perhaps it could call __notify_change()
> that does not generate an event

I found more instances of notify_change() in overlayfs copy_up code
that IMO should also use __notify_change() and not report fsnotify
events for restoring attributes on files post copy up.

Like with the case of file_remove_privs(), it is enough IMO that the
listener is able to get an event on the modification event that caused
copy up or remove_privs, no need for an event on the subsystem
internal implementation details.

> and the rest of the callers call this wrapper:
>
> int notify_change(struct path *path, struct iattr *attr,
>                             struct inode **delegated_inode)
> {
>         unsigned int ia_valid;
>         int error = __notify_change(mnt_user_ns(path->mnt), path->dentry,
>                                                     attr, &ia_valid,

Braino here. There is no need to pass ia_valid to helper.
I failed to notice that notify_change updates attr->ia_valid.

Which brings me to the fun question - naming.

Would you like me to follow up on Jan's suggestion to rename:
s/__notify_change/vfs_setattr
s/notify_change/vfs_setattr_notify

I pushed this version (a.k.a. "tollerabe") to:
https://github.com/amir73il/linux/commits/fsnotify_path_hooks

It's only sanity tested.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: fsnotify path hooks
  2021-04-20  6:01                                     ` Amir Goldstein
@ 2021-04-20 11:41                                       ` Christian Brauner
  2021-04-20 11:58                                         ` Amir Goldstein
  2021-04-20 13:38                                         ` Christian Brauner
  0 siblings, 2 replies; 61+ messages in thread
From: Christian Brauner @ 2021-04-20 11:41 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, linux-fsdevel, Miklos Szeredi, J. Bruce Fields, Al Viro

On Tue, Apr 20, 2021 at 09:01:09AM +0300, Amir Goldstein wrote:
> > One thing, whatever you end up passing to vfs_create() please make sure
> > to retrieve mnt_userns once so permission checking and object creation
> > line-up:
> >
> > int vfs_create(struct vfsmount *mnt, struct inode *dir,
> >                struct dentry *dentry, umode_t mode, bool want_excl)
> > {
> >         struct user_namespace *mnt_userns;
> >
> >         mnt_userns = mnt_user_ns(mnt);
> >
> >         int error = may_create(mnt_userns, dir, dentry);
> >         if (error)
> >                 return error;
> >
> >         if (!dir->i_op->create)
> >                 return -EACCES; /* shouldn't it be ENOSYS? */
> >         mode &= S_IALLUGO;
> >         mode |= S_IFREG;
> >         error = security_inode_create(dir, dentry, mode);
> >         if (error)
> >                 return error;
> >         error = dir->i_op->create(mnt_userns, dir, dentry, mode, want_excl);
> >         if (!error)
> >                 fsnotify_create(mnt, dir, dentry);
> >         return error;
> > }
> >
> 
> Christian,
> 
> What is the concern here?
> Can mnt_user_ns() change under us?
> I am asking because Al doesn't like both mnt_userns AND path to
> be passed to do_tuncate() => notify_change()
> So I will need to retrieve mnt_userns again inside notify_change()
> after it had been used for security checks in do_open().
> Would that be acceptable to you?

The mnt_userns can't change once a mnt has been idmapped and it can
never change if the mount is visible in the filesystem already. The only
case we've been worried about and why we did it this way is when you
have a caller do fd = open_tree(OPEN_TREE_CLONE) and then share that
unattached fd with multiple processes
T1: mkdirat(fd, "dir1", 0755);
T2: mount_setattr(fd, "",); /* changes idmapping */
That case isn't a problem if the mnt_userns is only retrieved once for
permission checking and operating on the inode. I think with your
changes that still shouldn't be an issue though since the vfs_*()
helpers encompass the permission checking anyway and for notify_change,
we could simply add a mnt_userns field to struct iattr and pass it down.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: fsnotify path hooks
  2021-04-20 11:41                                       ` Christian Brauner
@ 2021-04-20 11:58                                         ` Amir Goldstein
  2021-04-20 13:38                                         ` Christian Brauner
  1 sibling, 0 replies; 61+ messages in thread
From: Amir Goldstein @ 2021-04-20 11:58 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jan Kara, linux-fsdevel, Miklos Szeredi, J. Bruce Fields, Al Viro

On Tue, Apr 20, 2021 at 2:41 PM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> On Tue, Apr 20, 2021 at 09:01:09AM +0300, Amir Goldstein wrote:
> > > One thing, whatever you end up passing to vfs_create() please make sure
> > > to retrieve mnt_userns once so permission checking and object creation
> > > line-up:
> > >
> > > int vfs_create(struct vfsmount *mnt, struct inode *dir,
> > >                struct dentry *dentry, umode_t mode, bool want_excl)
> > > {
> > >         struct user_namespace *mnt_userns;
> > >
> > >         mnt_userns = mnt_user_ns(mnt);
> > >
> > >         int error = may_create(mnt_userns, dir, dentry);
> > >         if (error)
> > >                 return error;
> > >
> > >         if (!dir->i_op->create)
> > >                 return -EACCES; /* shouldn't it be ENOSYS? */
> > >         mode &= S_IALLUGO;
> > >         mode |= S_IFREG;
> > >         error = security_inode_create(dir, dentry, mode);
> > >         if (error)
> > >                 return error;
> > >         error = dir->i_op->create(mnt_userns, dir, dentry, mode, want_excl);
> > >         if (!error)
> > >                 fsnotify_create(mnt, dir, dentry);
> > >         return error;
> > > }
> > >
> >
> > Christian,
> >
> > What is the concern here?
> > Can mnt_user_ns() change under us?
> > I am asking because Al doesn't like both mnt_userns AND path to
> > be passed to do_tuncate() => notify_change()
> > So I will need to retrieve mnt_userns again inside notify_change()
> > after it had been used for security checks in do_open().
> > Would that be acceptable to you?
>
> The mnt_userns can't change once a mnt has been idmapped and it can
> never change if the mount is visible in the filesystem already. The only
> case we've been worried about and why we did it this way is when you
> have a caller do fd = open_tree(OPEN_TREE_CLONE) and then share that
> unattached fd with multiple processes
> T1: mkdirat(fd, "dir1", 0755);
> T2: mount_setattr(fd, "",); /* changes idmapping */
> That case isn't a problem if the mnt_userns is only retrieved once for
> permission checking and operating on the inode. I think with your
> changes that still shouldn't be an issue though since the vfs_*()
> helpers encompass the permission checking anyway and for notify_change,
> we could simply add a mnt_userns field to struct iattr and pass it down.

I suppose that could work for notify_change().

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: fsnotify path hooks
  2021-04-20 11:41                                       ` Christian Brauner
  2021-04-20 11:58                                         ` Amir Goldstein
@ 2021-04-20 13:38                                         ` Christian Brauner
  1 sibling, 0 replies; 61+ messages in thread
From: Christian Brauner @ 2021-04-20 13:38 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, linux-fsdevel, Miklos Szeredi, J. Bruce Fields, Al Viro

On Tue, Apr 20, 2021 at 01:41:59PM +0200, Christian Brauner wrote:
> On Tue, Apr 20, 2021 at 09:01:09AM +0300, Amir Goldstein wrote:
> > > One thing, whatever you end up passing to vfs_create() please make sure
> > > to retrieve mnt_userns once so permission checking and object creation
> > > line-up:
> > >
> > > int vfs_create(struct vfsmount *mnt, struct inode *dir,
> > >                struct dentry *dentry, umode_t mode, bool want_excl)
> > > {
> > >         struct user_namespace *mnt_userns;
> > >
> > >         mnt_userns = mnt_user_ns(mnt);
> > >
> > >         int error = may_create(mnt_userns, dir, dentry);
> > >         if (error)
> > >                 return error;
> > >
> > >         if (!dir->i_op->create)
> > >                 return -EACCES; /* shouldn't it be ENOSYS? */
> > >         mode &= S_IALLUGO;
> > >         mode |= S_IFREG;
> > >         error = security_inode_create(dir, dentry, mode);
> > >         if (error)
> > >                 return error;
> > >         error = dir->i_op->create(mnt_userns, dir, dentry, mode, want_excl);
> > >         if (!error)
> > >                 fsnotify_create(mnt, dir, dentry);
> > >         return error;
> > > }
> > >
> > 
> > Christian,
> > 
> > What is the concern here?
> > Can mnt_user_ns() change under us?
> > I am asking because Al doesn't like both mnt_userns AND path to
> > be passed to do_tuncate() => notify_change()
> > So I will need to retrieve mnt_userns again inside notify_change()
> > after it had been used for security checks in do_open().
> > Would that be acceptable to you?
> 
> The mnt_userns can't change once a mnt has been idmapped and it can
> never change if the mount is visible in the filesystem already. The only
> case we've been worried about and why we did it this way is when you
> have a caller do fd = open_tree(OPEN_TREE_CLONE) and then share that
> unattached fd with multiple processes
> T1: mkdirat(fd, "dir1", 0755);
> T2: mount_setattr(fd, "",); /* changes idmapping */
> That case isn't a problem if the mnt_userns is only retrieved once for
> permission checking and operating on the inode. I think with your
> changes that still shouldn't be an issue though since the vfs_*()
> helpers encompass the permission checking anyway and for notify_change,
> we could simply add a mnt_userns field to struct iattr and pass it down.

So I mean something along those lines. I converted a few callers to
illustrate this and I hope Al doesn't kill me. Please note that this
won't compile since I haven't converted all callers. I can give you a
full patch though if you think that is ok:

---
 drivers/base/devtmpfs.c   |  6 ++++--
 fs/attr.c                 | 16 ++++++++--------
 fs/cachefiles/interface.c | 10 ++++++++--
 fs/ecryptfs/inode.c       | 12 ++++++++++--
 fs/inode.c                |  7 ++++---
 include/linux/fs.h        |  3 ++-
 6 files changed, 36 insertions(+), 18 deletions(-)

diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c
index 653c8c6ac7a7..323a549c62e3 100644
--- a/drivers/base/devtmpfs.c
+++ b/drivers/base/devtmpfs.c
@@ -221,8 +221,9 @@ static int handle_create(const char *nodename, umode_t mode, kuid_t uid,
 		newattrs.ia_uid = uid;
 		newattrs.ia_gid = gid;
 		newattrs.ia_valid = ATTR_MODE|ATTR_UID|ATTR_GID;
+		newattrs.mnt_userns = &init_user_ns;
 		inode_lock(d_inode(dentry));
-		notify_change(&init_user_ns, dentry, &newattrs, NULL);
+		notify_change(path, dentry, &newattrs, NULL);
 		inode_unlock(d_inode(dentry));
 
 		/* mark as kernel-created inode */
@@ -329,8 +330,9 @@ static int handle_remove(const char *nodename, struct device *dev)
 			newattrs.ia_mode = stat.mode & ~0777;
 			newattrs.ia_valid =
 				ATTR_UID|ATTR_GID|ATTR_MODE;
+			newattrs.mnt_userns = &init_user_ns;
 			inode_lock(d_inode(dentry));
-			notify_change(&init_user_ns, dentry, &newattrs, NULL);
+			notify_change(path, dentry, &newattrs, NULL);
 			inode_unlock(d_inode(dentry));
 			err = vfs_unlink(&init_user_ns, d_inode(parent.dentry),
 					 dentry, NULL);
diff --git a/fs/attr.c b/fs/attr.c
index 87ef39db1c34..59a9ed986e49 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -279,7 +279,7 @@ EXPORT_SYMBOL(setattr_copy);
  * permissions. On non-idmapped mounts or if permission checking is to be
  * performed on the raw inode simply passs init_user_ns.
  */
-int notify_change(struct user_namespace *mnt_userns, struct dentry *dentry,
+int notify_change(struct path *path, struct dentry *dentry,
 		  struct iattr *attr, struct inode **delegated_inode)
 {
 	struct inode *inode = dentry->d_inode;
@@ -303,8 +303,8 @@ int notify_change(struct user_namespace *mnt_userns, struct dentry *dentry,
 		if (IS_IMMUTABLE(inode))
 			return -EPERM;
 
-		if (!inode_owner_or_capable(mnt_userns, inode)) {
-			error = inode_permission(mnt_userns, inode, MAY_WRITE);
+		if (!inode_owner_or_capable(attr->mnt_userns, inode)) {
+			error = inode_permission(attr->mnt_userns, inode, MAY_WRITE);
 			if (error)
 				return error;
 		}
@@ -381,10 +381,10 @@ int notify_change(struct user_namespace *mnt_userns, struct dentry *dentry,
 	 * gids unless those uids & gids are being made valid.
 	 */
 	if (!(ia_valid & ATTR_UID) &&
-	    !uid_valid(i_uid_into_mnt(mnt_userns, inode)))
+	    !uid_valid(i_uid_into_mnt(attr->mnt_userns, inode)))
 		return -EOVERFLOW;
 	if (!(ia_valid & ATTR_GID) &&
-	    !gid_valid(i_gid_into_mnt(mnt_userns, inode)))
+	    !gid_valid(i_gid_into_mnt(attr->mnt_userns, inode)))
 		return -EOVERFLOW;
 
 	error = security_inode_setattr(dentry, attr);
@@ -395,13 +395,13 @@ int notify_change(struct user_namespace *mnt_userns, struct dentry *dentry,
 		return error;
 
 	if (inode->i_op->setattr)
-		error = inode->i_op->setattr(mnt_userns, dentry, attr);
+		error = inode->i_op->setattr(attr->mnt_userns, dentry, attr);
 	else
-		error = simple_setattr(mnt_userns, dentry, attr);
+		error = simple_setattr(attr->mnt_userns, dentry, attr);
 
 	if (!error) {
 		fsnotify_change(dentry, ia_valid);
-		ima_inode_post_setattr(mnt_userns, dentry);
+		ima_inode_post_setattr(attr->mnt_userns, dentry);
 		evm_inode_post_setattr(dentry, ia_valid);
 	}
 
diff --git a/fs/cachefiles/interface.c b/fs/cachefiles/interface.c
index 5efa6a3702c0..cede4b790694 100644
--- a/fs/cachefiles/interface.c
+++ b/fs/cachefiles/interface.c
@@ -429,6 +429,7 @@ static int cachefiles_check_consistency(struct fscache_operation *op)
  */
 static int cachefiles_attr_changed(struct fscache_object *_object)
 {
+	struct path path;
 	struct cachefiles_object *object;
 	struct cachefiles_cache *cache;
 	const struct cred *saved_cred;
@@ -460,6 +461,9 @@ static int cachefiles_attr_changed(struct fscache_object *_object)
 	if (oi_size == ni_size)
 		return 0;
 
+	path.dentry = object->backer;
+	path.mnt = cache->mnt;
+
 	cachefiles_begin_secure(cache, &saved_cred);
 	inode_lock(d_inode(object->backer));
 
@@ -470,14 +474,16 @@ static int cachefiles_attr_changed(struct fscache_object *_object)
 		_debug("discard tail %llx", oi_size);
 		newattrs.ia_valid = ATTR_SIZE;
 		newattrs.ia_size = oi_size & PAGE_MASK;
-		ret = notify_change(&init_user_ns, object->backer, &newattrs, NULL);
+		newattr.mnt_userns = &init_user_ns;
+		ret = notify_change(&path, object->backer, &newattrs, NULL);
 		if (ret < 0)
 			goto truncate_failed;
 	}
 
 	newattrs.ia_valid = ATTR_SIZE;
 	newattrs.ia_size = ni_size;
-	ret = notify_change(&init_user_ns, object->backer, &newattrs, NULL);
+	newattr.mnt_userns = &init_user_ns;
+	ret = notify_change(&path, object->backer, &newattrs, NULL);
 
 truncate_failed:
 	inode_unlock(d_inode(object->backer));
diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
index 18e9285fbb4c..8347742087e0 100644
--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -867,10 +867,14 @@ int ecryptfs_truncate(struct dentry *dentry, loff_t new_length)
 
 	rc = truncate_upper(dentry, &ia, &lower_ia);
 	if (!rc && lower_ia.ia_valid & ATTR_SIZE) {
+		struct path *lower_path;
 		struct dentry *lower_dentry = ecryptfs_dentry_to_lower(dentry);
 
+		lower_path = ecryptfs_dentry_to_lower_path(dentry);
+		lower_ia.mnt_userns = mnt_user_ns(lower_path);
+
 		inode_lock(d_inode(lower_dentry));
-		rc = notify_change(&init_user_ns, lower_dentry,
+		rc = notify_change(lower_path, lower_dentry,
 				   &lower_ia, NULL);
 		inode_unlock(d_inode(lower_dentry));
 	}
@@ -906,6 +910,7 @@ static int ecryptfs_setattr(struct user_namespace *mnt_userns,
 	struct inode *inode;
 	struct inode *lower_inode;
 	struct ecryptfs_crypt_stat *crypt_stat;
+	struct path *lower_path;
 
 	crypt_stat = &ecryptfs_inode_to_private(d_inode(dentry))->crypt_stat;
 	if (!(crypt_stat->flags & ECRYPTFS_STRUCT_INITIALIZED)) {
@@ -977,8 +982,11 @@ static int ecryptfs_setattr(struct user_namespace *mnt_userns,
 	if (lower_ia.ia_valid & (ATTR_KILL_SUID | ATTR_KILL_SGID))
 		lower_ia.ia_valid &= ~ATTR_MODE;
 
+	lower_path = ecryptfs_dentry_to_lower_path(dentry);
+	lower_ia.mnt_userns = mnt_user_ns(lower_path);
+
 	inode_lock(d_inode(lower_dentry));
-	rc = notify_change(&init_user_ns, lower_dentry, &lower_ia, NULL);
+	rc = notify_change(lower_path, lower_dentry, &lower_ia, NULL);
 	inode_unlock(d_inode(lower_dentry));
 out:
 	fsstack_copy_attr_all(inode, lower_inode);
diff --git a/fs/inode.c b/fs/inode.c
index a047ab306f9a..12a1531a6c52 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1906,17 +1906,18 @@ int dentry_needs_remove_privs(struct dentry *dentry)
 	return mask;
 }
 
-static int __remove_privs(struct user_namespace *mnt_userns,
+static int __remove_privs(struct file *file,
 			  struct dentry *dentry, int kill)
 {
 	struct iattr newattrs;
 
 	newattrs.ia_valid = ATTR_FORCE | kill;
+	newattrs.mnt_userns = file_mnt_user_ns(file);
 	/*
 	 * Note we call this on write, so notify_change will not
 	 * encounter any conflicting delegations:
 	 */
-	return notify_change(mnt_userns, dentry, &newattrs, NULL);
+	return notify_change(file->f_path, dentry, &newattrs, NULL);
 }
 
 /*
@@ -1943,7 +1944,7 @@ int file_remove_privs(struct file *file)
 	if (kill < 0)
 		return kill;
 	if (kill)
-		error = __remove_privs(file_mnt_user_ns(file), dentry, kill);
+		error = __remove_privs(file, dentry, kill);
 	if (!error)
 		inode_has_no_xattr(inode);
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ec8f3ddf4a6a..d23bcedf5f92 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -234,6 +234,7 @@ struct iattr {
 	 * check for (ia_valid & ATTR_FILE), and not for (ia_file != NULL).
 	 */
 	struct file	*ia_file;
+	struct user_namespace *mnt_userns;
 };
 
 /*
@@ -2862,7 +2863,7 @@ static inline int bmap(struct inode *inode,  sector_t *block)
 }
 #endif
 
-int notify_change(struct user_namespace *, struct dentry *,
+int notify_change(struct path *, struct dentry *,
 		  struct iattr *, struct inode **);
 int inode_permission(struct user_namespace *, struct inode *, int);
 int generic_permission(struct user_namespace *, struct inode *, int);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2021-04-20 13:39 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-28 15:56 [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask Amir Goldstein
2021-03-30  7:31 ` Christian Brauner
2021-03-30  9:31   ` Amir Goldstein
2021-03-30 16:24     ` Amir Goldstein
2021-03-31 10:08       ` Christian Brauner
2021-03-31 10:57         ` Amir Goldstein
2021-04-08 11:44         ` open_by_handle_at() in userns Amir Goldstein
2021-04-08 12:55           ` Christian Brauner
2021-04-08 14:15             ` J. Bruce Fields
2021-04-08 15:54               ` Amir Goldstein
2021-04-08 16:08                 ` J. Bruce Fields
2021-04-08 16:48                   ` Frank Filz
2021-04-08 15:34             ` Amir Goldstein
2021-04-08 15:41               ` Christian Brauner
2021-03-30 12:12 ` [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask Christian Brauner
2021-03-30 12:33   ` Amir Goldstein
2021-03-30 12:53     ` Christian Brauner
2021-03-30 12:55       ` Christian Brauner
2021-03-30 13:54       ` Amir Goldstein
2021-03-30 14:17         ` Christian Brauner
2021-03-30 14:56           ` Amir Goldstein
2021-03-31  9:46             ` Christian Brauner
2021-03-31 11:29               ` Amir Goldstein
2021-03-31 12:17                 ` Christian Brauner
2021-03-31 12:59                   ` Amir Goldstein
2021-03-31 12:54                 ` Jan Kara
2021-03-31 14:06                   ` Amir Goldstein
2021-03-31 20:59                     ` fsnotify path hooks Amir Goldstein
2021-04-01 10:29                       ` Jan Kara
2021-04-01 14:18                         ` Amir Goldstein
2021-04-02  8:20                           ` Amir Goldstein
2021-04-04 10:27                             ` LSM and setxattr helpers Amir Goldstein
2021-04-05 12:23                               ` Christian Brauner
2021-04-05 14:47                               ` Mimi Zohar
2021-04-06 15:43                                 ` Amir Goldstein
2021-04-05 16:18                               ` Casey Schaufler
2021-04-06  8:35                           ` fsnotify path hooks Jan Kara
2021-04-06 18:49                           ` Amir Goldstein
2021-04-08 12:52                             ` Jan Kara
2021-04-08 15:11                               ` Amir Goldstein
2021-04-09 10:08                                 ` Jan Kara
2021-04-09 10:45                                   ` Christian Brauner
2021-04-20  6:01                                     ` Amir Goldstein
2021-04-20 11:41                                       ` Christian Brauner
2021-04-20 11:58                                         ` Amir Goldstein
2021-04-20 13:38                                         ` Christian Brauner
2021-04-09 13:22                                   ` Amir Goldstein
2021-04-09 14:30                                     ` Al Viro
2021-04-09 14:39                                       ` Christian Brauner
2021-04-09 14:46                                         ` Al Viro
2021-04-09 15:20                                           ` Christian Brauner
2021-04-09 16:06                                       ` Amir Goldstein
2021-04-09 16:09                                         ` Amir Goldstein
2021-04-18 18:51                                   ` Amir Goldstein
2021-04-19  8:08                                     ` Amir Goldstein
2021-04-19 16:41                                 ` Amir Goldstein
2021-04-19 17:02                                   ` Al Viro
2021-04-19 22:04                                     ` Amir Goldstein
2021-04-20  7:53                                       ` Amir Goldstein
2021-03-31 13:06                 ` [RFC][PATCH] fanotify: allow setting FAN_CREATE in mount mark mask J. Bruce Fields
2021-03-30 12:20 ` Amir Goldstein

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).