All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/3] autofs - make disc device user accessible
@ 2017-05-10  4:18 ` Ian Kent
  0 siblings, 0 replies; 44+ messages in thread
From: Ian Kent @ 2017-05-10  4:18 UTC (permalink / raw)
  To: Al Viro
  Cc: Colin Walters, Ondrej Holy, autofs mailing list,
	Kernel Mailing List, David Howells, linux-fsdevel

The autofs miscellanous device ioctls that shouldn't require
CAP_SYS_ADMIN need to be accessible to user space applications in
order to be able to get information about autofs mounts.

The module checks capabilities so the miscelaneous device should
be fine with broad permissions.

Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Colin Walters <walters@redhat.com>
Cc: Ondrej Holy <oholy@redhat.com>
Cc: stable@vger.kernel.org
---
 fs/autofs4/dev-ioctl.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/autofs4/dev-ioctl.c b/fs/autofs4/dev-ioctl.c
index 734cbf8..9b58d6e 100644
--- a/fs/autofs4/dev-ioctl.c
+++ b/fs/autofs4/dev-ioctl.c
@@ -733,7 +733,8 @@ static const struct file_operations _dev_ioctl_fops = {
 static struct miscdevice _autofs_dev_ioctl_misc = {
 	.minor		= AUTOFS_MINOR,
 	.name		= AUTOFS_DEVICE_NAME,
-	.fops		= &_dev_ioctl_fops
+	.fops		= &_dev_ioctl_fops,
+	.mode           = 0666
 };
 
 MODULE_ALIAS_MISCDEV(AUTOFS_MINOR);

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 1/3] autofs - make disc device user accessible
@ 2017-05-10  4:18 ` Ian Kent
  0 siblings, 0 replies; 44+ messages in thread
From: Ian Kent @ 2017-05-10  4:18 UTC (permalink / raw)
  To: Al Viro
  Cc: Colin Walters, Ondrej Holy, autofs mailing list,
	Kernel Mailing List, David Howells, linux-fsdevel

The autofs miscellanous device ioctls that shouldn't require
CAP_SYS_ADMIN need to be accessible to user space applications in
order to be able to get information about autofs mounts.

The module checks capabilities so the miscelaneous device should
be fine with broad permissions.

Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Colin Walters <walters@redhat.com>
Cc: Ondrej Holy <oholy@redhat.com>
Cc: stable@vger.kernel.org
---
 fs/autofs4/dev-ioctl.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/autofs4/dev-ioctl.c b/fs/autofs4/dev-ioctl.c
index 734cbf8..9b58d6e 100644
--- a/fs/autofs4/dev-ioctl.c
+++ b/fs/autofs4/dev-ioctl.c
@@ -733,7 +733,8 @@ static const struct file_operations _dev_ioctl_fops = {
 static struct miscdevice _autofs_dev_ioctl_misc = {
 	.minor		= AUTOFS_MINOR,
 	.name		= AUTOFS_DEVICE_NAME,
-	.fops		= &_dev_ioctl_fops
+	.fops		= &_dev_ioctl_fops,
+	.mode           = 0666
 };
 
 MODULE_ALIAS_MISCDEV(AUTOFS_MINOR);

--
To unsubscribe from this list: send the line "unsubscribe autofs" in

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 2/3] autofs - make dev ioctl version and ismountpoint user accessible
  2017-05-10  4:18 ` Ian Kent
@ 2017-05-10  4:18   ` Ian Kent
  -1 siblings, 0 replies; 44+ messages in thread
From: Ian Kent @ 2017-05-10  4:18 UTC (permalink / raw)
  To: Al Viro
  Cc: Colin Walters, Ondrej Holy, autofs mailing list,
	Kernel Mailing List, David Howells, linux-fsdevel

Some of the autofs miscellaneous device ioctls need to be accessable to
user space applications without CAP_SYS_ADMIN to get information about
autofs mounts.

Start by making the autofs miscellaneous device ioctl header available
and allow applications to use version and ismountpoint ioctls.

Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Colin Walters <walters@redhat.com>
Cc: Ondrej Holy <oholy@redhat.com>
Cc: stable@vger.kernel.org
---
 fs/autofs4/dev-ioctl.c              |   12 ++++++++----
 include/uapi/linux/Kbuild           |    1 +
 include/uapi/linux/auto_dev-ioctl.h |    2 +-
 3 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/fs/autofs4/dev-ioctl.c b/fs/autofs4/dev-ioctl.c
index 9b58d6e..f8cb3f6 100644
--- a/fs/autofs4/dev-ioctl.c
+++ b/fs/autofs4/dev-ioctl.c
@@ -628,10 +628,6 @@ static int _autofs_dev_ioctl(unsigned int command,
 	ioctl_fn fn = NULL;
 	int err = 0;
 
-	/* only root can play with this */
-	if (!capable(CAP_SYS_ADMIN))
-		return -EPERM;
-
 	cmd_first = _IOC_NR(AUTOFS_DEV_IOCTL_IOC_FIRST);
 	cmd = _IOC_NR(command);
 
@@ -640,6 +636,14 @@ static int _autofs_dev_ioctl(unsigned int command,
 		return -ENOTTY;
 	}
 
+	/* Only root can use ioctls other than AUTOFS_DEV_IOCTL_VERSION_CMD
+	 * and AUTOFS_DEV_IOCTL_ISMOUNTPOINT_CMD
+	 */
+	if (cmd != AUTOFS_DEV_IOCTL_VERSION_CMD &&
+	    cmd != AUTOFS_DEV_IOCTL_ISMOUNTPOINT_CMD &&
+	    !capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
 	/* Copy the parameters into kernel space. */
 	param = copy_dev_ioctl(user);
 	if (IS_ERR(param))
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 662c592..1f22bbb 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -61,6 +61,7 @@ header-y += atm_zatm.h
 header-y += audit.h
 header-y += auto_fs4.h
 header-y += auto_fs.h
+header-y += auto_dev-ioctl.h
 header-y += auxvec.h
 header-y += ax25.h
 header-y += b1lli.h
diff --git a/include/uapi/linux/auto_dev-ioctl.h b/include/uapi/linux/auto_dev-ioctl.h
index 744b3d0..5558db8 100644
--- a/include/uapi/linux/auto_dev-ioctl.h
+++ b/include/uapi/linux/auto_dev-ioctl.h
@@ -16,7 +16,7 @@
 #define AUTOFS_DEVICE_NAME		"autofs"
 
 #define AUTOFS_DEV_IOCTL_VERSION_MAJOR	1
-#define AUTOFS_DEV_IOCTL_VERSION_MINOR	0
+#define AUTOFS_DEV_IOCTL_VERSION_MINOR	1
 
 #define AUTOFS_DEV_IOCTL_SIZE		sizeof(struct autofs_dev_ioctl)
 

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 2/3] autofs - make dev ioctl version and ismountpoint user accessible
@ 2017-05-10  4:18   ` Ian Kent
  0 siblings, 0 replies; 44+ messages in thread
From: Ian Kent @ 2017-05-10  4:18 UTC (permalink / raw)
  To: Al Viro
  Cc: Colin Walters, Ondrej Holy, autofs mailing list,
	Kernel Mailing List, David Howells, linux-fsdevel

Some of the autofs miscellaneous device ioctls need to be accessable to
user space applications without CAP_SYS_ADMIN to get information about
autofs mounts.

Start by making the autofs miscellaneous device ioctl header available
and allow applications to use version and ismountpoint ioctls.

Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Colin Walters <walters@redhat.com>
Cc: Ondrej Holy <oholy@redhat.com>
Cc: stable@vger.kernel.org
---
 fs/autofs4/dev-ioctl.c              |   12 ++++++++----
 include/uapi/linux/Kbuild           |    1 +
 include/uapi/linux/auto_dev-ioctl.h |    2 +-
 3 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/fs/autofs4/dev-ioctl.c b/fs/autofs4/dev-ioctl.c
index 9b58d6e..f8cb3f6 100644
--- a/fs/autofs4/dev-ioctl.c
+++ b/fs/autofs4/dev-ioctl.c
@@ -628,10 +628,6 @@ static int _autofs_dev_ioctl(unsigned int command,
 	ioctl_fn fn = NULL;
 	int err = 0;
 
-	/* only root can play with this */
-	if (!capable(CAP_SYS_ADMIN))
-		return -EPERM;
-
 	cmd_first = _IOC_NR(AUTOFS_DEV_IOCTL_IOC_FIRST);
 	cmd = _IOC_NR(command);
 
@@ -640,6 +636,14 @@ static int _autofs_dev_ioctl(unsigned int command,
 		return -ENOTTY;
 	}
 
+	/* Only root can use ioctls other than AUTOFS_DEV_IOCTL_VERSION_CMD
+	 * and AUTOFS_DEV_IOCTL_ISMOUNTPOINT_CMD
+	 */
+	if (cmd != AUTOFS_DEV_IOCTL_VERSION_CMD &&
+	    cmd != AUTOFS_DEV_IOCTL_ISMOUNTPOINT_CMD &&
+	    !capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
 	/* Copy the parameters into kernel space. */
 	param = copy_dev_ioctl(user);
 	if (IS_ERR(param))
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 662c592..1f22bbb 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -61,6 +61,7 @@ header-y += atm_zatm.h
 header-y += audit.h
 header-y += auto_fs4.h
 header-y += auto_fs.h
+header-y += auto_dev-ioctl.h
 header-y += auxvec.h
 header-y += ax25.h
 header-y += b1lli.h
diff --git a/include/uapi/linux/auto_dev-ioctl.h b/include/uapi/linux/auto_dev-ioctl.h
index 744b3d0..5558db8 100644
--- a/include/uapi/linux/auto_dev-ioctl.h
+++ b/include/uapi/linux/auto_dev-ioctl.h
@@ -16,7 +16,7 @@
 #define AUTOFS_DEVICE_NAME		"autofs"
 
 #define AUTOFS_DEV_IOCTL_VERSION_MAJOR	1
-#define AUTOFS_DEV_IOCTL_VERSION_MINOR	0
+#define AUTOFS_DEV_IOCTL_VERSION_MINOR	1
 
 #define AUTOFS_DEV_IOCTL_SIZE		sizeof(struct autofs_dev_ioctl)
 

--
To unsubscribe from this list: send the line "unsubscribe autofs" in

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored
  2017-05-10  4:18 ` Ian Kent
  (?)
  (?)
@ 2017-05-10  4:18 ` Ian Kent
  2017-05-12 12:49     ` Colin Walters
  2017-11-21  1:53   ` NeilBrown
  -1 siblings, 2 replies; 44+ messages in thread
From: Ian Kent @ 2017-05-10  4:18 UTC (permalink / raw)
  To: Al Viro
  Cc: Colin Walters, Ondrej Holy, autofs mailing list,
	Kernel Mailing List, David Howells, linux-fsdevel

The fstatat(2) and statx() calls can pass the flag AT_NO_AUTOMOUNT
which is meant to clear the LOOKUP_AUTOMOUNT flag and prevent triggering
of an automount by the call. But this flag is unconditionally cleared
for all stat family system calls except statx().

stat family system calls have always triggered mount requests for the
negative dentry case in follow_automount() which is intended but prevents
the fstatat(2) and statx() AT_NO_AUTOMOUNT case from being handled.

In order to handle the AT_NO_AUTOMOUNT for both system calls the
negative dentry case in follow_automount() needs to be changed to
return ENOENT when the LOOKUP_AUTOMOUNT flag is clear (and the other
required flags are clear).

AFAICT this change doesn't have any noticable side effects and may,
in some use cases (although I didn't see it in testing) prevent
unnecessary callbacks to the automount daemon.

It's also possible that a stat family call has been made with a
path that is in the process of being mounted by some other process.
But stat family calls should return the automount state of the path
as it is "now" so it shouldn't wait for mount completion.

This is the same semantic as the positive dentry case already
handled.

Signed-off-by: Ian Kent <raven@themaw.net>
Cc: David Howells <dhowells@redhat.com>
Cc: Colin Walters <walters@redhat.com>
Cc: Ondrej Holy <oholy@redhat.com>
Cc: stable@vger.kernel.org
---
 fs/namei.c         |   15 ++++++++++++---
 include/linux/fs.h |    3 +--
 2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 7286f87..cd74838 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1129,9 +1129,18 @@ static int follow_automount(struct path *path, struct nameidata *nd,
 	 * of the daemon to instantiate them before they can be used.
 	 */
 	if (!(nd->flags & (LOOKUP_PARENT | LOOKUP_DIRECTORY |
-			   LOOKUP_OPEN | LOOKUP_CREATE | LOOKUP_AUTOMOUNT)) &&
-	    path->dentry->d_inode)
-		return -EISDIR;
+			   LOOKUP_OPEN | LOOKUP_CREATE |
+			   LOOKUP_AUTOMOUNT))) {
+		/* Positive dentry that isn't meant to trigger an
+		 * automount, EISDIR will allow it to be used,
+		 * otherwise there's no mount here "now" so return
+		 * ENOENT.
+		 */
+		if (path->dentry->d_inode)
+			return -EISDIR;
+		else
+			return -ENOENT;
+	}
 
 	if (path->dentry->d_sb->s_user_ns != &init_user_ns)
 		return -EACCES;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 26488b4..be09684 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2935,8 +2935,7 @@ static inline int vfs_lstat(const char __user *name, struct kstat *stat)
 static inline int vfs_fstatat(int dfd, const char __user *filename,
 			      struct kstat *stat, int flags)
 {
-	return vfs_statx(dfd, filename, flags | AT_NO_AUTOMOUNT,
-			 stat, STATX_BASIC_STATS);
+	return vfs_statx(dfd, filename, flags, stat, STATX_BASIC_STATS);
 }
 static inline int vfs_fstat(int fd, struct kstat *stat)
 {

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored
  2017-05-10  4:18 ` [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored Ian Kent
@ 2017-05-12 12:49     ` Colin Walters
  2017-11-21  1:53   ` NeilBrown
  1 sibling, 0 replies; 44+ messages in thread
From: Colin Walters @ 2017-05-12 12:49 UTC (permalink / raw)
  To: Ian Kent, Al Viro
  Cc: Colin Walters, Ondrej Holy, autofs mailing list,
	Kernel Mailing List, David Howells, linux-fsdevel

On Wed, May 10, 2017, at 12:18 AM, Ian Kent wrote:
> The fstatat(2) and statx() calls can pass the flag AT_NO_AUTOMOUNT
> which is meant to clear the LOOKUP_AUTOMOUNT flag and prevent triggering
> of an automount by the call. But this flag is unconditionally cleared
> for all stat family system calls except statx().
> 

Here's the GLib patch to make use of this:
https://bugzilla.gnome.org/show_bug.cgi?id=782554

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored
@ 2017-05-12 12:49     ` Colin Walters
  0 siblings, 0 replies; 44+ messages in thread
From: Colin Walters @ 2017-05-12 12:49 UTC (permalink / raw)
  To: Ian Kent, Al Viro
  Cc: Colin Walters, Ondrej Holy, autofs mailing list,
	Kernel Mailing List, David Howells, linux-fsdevel

On Wed, May 10, 2017, at 12:18 AM, Ian Kent wrote:
> The fstatat(2) and statx() calls can pass the flag AT_NO_AUTOMOUNT
> which is meant to clear the LOOKUP_AUTOMOUNT flag and prevent triggering
> of an automount by the call. But this flag is unconditionally cleared
> for all stat family system calls except statx().
> 

Here's the GLib patch to make use of this:
https://bugzilla.gnome.org/show_bug.cgi?id=782554

--
To unsubscribe from this list: send the line "unsubscribe autofs" in

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored
  2017-05-10  4:18 ` [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored Ian Kent
  2017-05-12 12:49     ` Colin Walters
@ 2017-11-21  1:53   ` NeilBrown
  2017-11-22  4:28       ` Ian Kent
  1 sibling, 1 reply; 44+ messages in thread
From: NeilBrown @ 2017-11-21  1:53 UTC (permalink / raw)
  To: Ian Kent, Al Viro
  Cc: Colin Walters, Ondrej Holy, autofs mailing list,
	Kernel Mailing List, David Howells, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 4336 bytes --]

On Wed, May 10 2017, Ian Kent wrote:

> The fstatat(2) and statx() calls can pass the flag AT_NO_AUTOMOUNT
> which is meant to clear the LOOKUP_AUTOMOUNT flag and prevent triggering
> of an automount by the call. But this flag is unconditionally cleared
> for all stat family system calls except statx().
>
> stat family system calls have always triggered mount requests for the
> negative dentry case in follow_automount() which is intended but prevents
> the fstatat(2) and statx() AT_NO_AUTOMOUNT case from being handled.
>
> In order to handle the AT_NO_AUTOMOUNT for both system calls the
> negative dentry case in follow_automount() needs to be changed to
> return ENOENT when the LOOKUP_AUTOMOUNT flag is clear (and the other
> required flags are clear).
>
> AFAICT this change doesn't have any noticable side effects and may,
> in some use cases (although I didn't see it in testing) prevent
> unnecessary callbacks to the automount daemon.

Actually, this patch does have a noticeable side effect.

Previously, if /home were an indirect mount point so that /home/neilb
would mount my home directory, "ls -l /home/neilb" would always work.
Now it doesn't.

This happens because "ls" calls 'lstat' on the path and when that fails
with "ENOENT", it doesn't bother trying to open.

An alternate approach to the problem that this patch addresses is to
*not* change follow_automount() but instead change the automount daemon
and possibly autofs.

When a notification of access for an indirect mount point is received,
it would firstly just create the directory - not triggering a mount.
If another notification is then received (after the directory has been
created), it then triggers the mount.

I suspect this might need changes to autofs to avoid races when two
threads call lstat() on the same path at the same time.  We would need
to ensure that automount didn't see this as two requests.... though
maybe it already has enough locking.

Does that seem reasonable?  Should we revert this patch (as a
regression) and do something in automount instead?

Thanks,
NeilBrown


>
> It's also possible that a stat family call has been made with a
> path that is in the process of being mounted by some other process.
> But stat family calls should return the automount state of the path
> as it is "now" so it shouldn't wait for mount completion.
>
> This is the same semantic as the positive dentry case already
> handled.
>
> Signed-off-by: Ian Kent <raven@themaw.net>
> Cc: David Howells <dhowells@redhat.com>
> Cc: Colin Walters <walters@redhat.com>
> Cc: Ondrej Holy <oholy@redhat.com>
> Cc: stable@vger.kernel.org
> ---
>  fs/namei.c         |   15 ++++++++++++---
>  include/linux/fs.h |    3 +--
>  2 files changed, 13 insertions(+), 5 deletions(-)
>
> diff --git a/fs/namei.c b/fs/namei.c
> index 7286f87..cd74838 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -1129,9 +1129,18 @@ static int follow_automount(struct path *path, struct nameidata *nd,
>  	 * of the daemon to instantiate them before they can be used.
>  	 */
>  	if (!(nd->flags & (LOOKUP_PARENT | LOOKUP_DIRECTORY |
> -			   LOOKUP_OPEN | LOOKUP_CREATE | LOOKUP_AUTOMOUNT)) &&
> -	    path->dentry->d_inode)
> -		return -EISDIR;
> +			   LOOKUP_OPEN | LOOKUP_CREATE |
> +			   LOOKUP_AUTOMOUNT))) {
> +		/* Positive dentry that isn't meant to trigger an
> +		 * automount, EISDIR will allow it to be used,
> +		 * otherwise there's no mount here "now" so return
> +		 * ENOENT.
> +		 */
> +		if (path->dentry->d_inode)
> +			return -EISDIR;
> +		else
> +			return -ENOENT;
> +	}
>  
>  	if (path->dentry->d_sb->s_user_ns != &init_user_ns)
>  		return -EACCES;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 26488b4..be09684 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2935,8 +2935,7 @@ static inline int vfs_lstat(const char __user *name, struct kstat *stat)
>  static inline int vfs_fstatat(int dfd, const char __user *filename,
>  			      struct kstat *stat, int flags)
>  {
> -	return vfs_statx(dfd, filename, flags | AT_NO_AUTOMOUNT,
> -			 stat, STATX_BASIC_STATS);
> +	return vfs_statx(dfd, filename, flags, stat, STATX_BASIC_STATS);
>  }
>  static inline int vfs_fstat(int fd, struct kstat *stat)
>  {

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored
  2017-11-21  1:53   ` NeilBrown
@ 2017-11-22  4:28       ` Ian Kent
  0 siblings, 0 replies; 44+ messages in thread
From: Ian Kent @ 2017-11-22  4:28 UTC (permalink / raw)
  To: NeilBrown, Al Viro
  Cc: Colin Walters, Ondrej Holy, autofs mailing list,
	Kernel Mailing List, David Howells, linux-fsdevel

On 21/11/17 09:53, NeilBrown wrote:
> On Wed, May 10 2017, Ian Kent wrote:
> 
>> The fstatat(2) and statx() calls can pass the flag AT_NO_AUTOMOUNT
>> which is meant to clear the LOOKUP_AUTOMOUNT flag and prevent triggering
>> of an automount by the call. But this flag is unconditionally cleared
>> for all stat family system calls except statx().
>>
>> stat family system calls have always triggered mount requests for the
>> negative dentry case in follow_automount() which is intended but prevents
>> the fstatat(2) and statx() AT_NO_AUTOMOUNT case from being handled.
>>
>> In order to handle the AT_NO_AUTOMOUNT for both system calls the
>> negative dentry case in follow_automount() needs to be changed to
>> return ENOENT when the LOOKUP_AUTOMOUNT flag is clear (and the other
>> required flags are clear).
>>
>> AFAICT this change doesn't have any noticable side effects and may,
>> in some use cases (although I didn't see it in testing) prevent
>> unnecessary callbacks to the automount daemon.
> 
> Actually, this patch does have a noticeable side effect.

That's right.

> 
> Previously, if /home were an indirect mount point so that /home/neilb
> would mount my home directory, "ls -l /home/neilb" would always work.
> Now it doesn't.

An this is correct too.

The previous policy was that a ->lookup() would always cause a mount to
occur. It's that policy that prevents the AT_NO_AUTOMOUNT flag from being
able to work consistently.

If you set the indirect mount "browse" option that will cause the mount
point directories for each of the map keys to be pre-created. So a
directory listing will show the mount points and an attempt to access
anything within the mount point directory will cause it to be mounted.

There is still the problem of not knowing map keys when the wildcard
map entry is used.

Still, stat family systems calls have always had similar problems that
prevent consistent behavior.

> 
> This happens because "ls" calls 'lstat' on the path and when that fails
> with "ENOENT", it doesn't bother trying to open.

I know, I believe the ENOENT is appropriate because there is no mount
and no directory at the time this happens.

> 
> An alternate approach to the problem that this patch addresses is to
> *not* change follow_automount() but instead change the automount daemon
> and possibly autofs.

Perhaps, but the daemon probably doesn't have enough information to
make decisions about this so there would need to be something coming
from the kernel too.

> 
> When a notification of access for an indirect mount point is received,
> it would firstly just create the directory - not triggering a mount.
> If another notification is then received (after the directory has been
> created), it then triggers the mount.

Not sure, perhaps.

But I don't think that will work, I've had many problems with this type
behavior due to bugs and I think the the same sort of problems would be
encountered.

The indirect mount "browse" option which behaves very much like what your
suggesting is the internal program default, and has been since the initial
v5 release. Historically it is disabled on install to maintain backward
compatibility with the original Linux autofs (which is also the reason for
always triggering an automount on ->lookup()).

Perhaps now is the time for that to change. 

> 
> I suspect this might need changes to autofs to avoid races when two
> threads call lstat() on the same path at the same time.  We would need
> to ensure that automount didn't see this as two requests.... though
> maybe it already has enough locking.
> 
> Does that seem reasonable?  Should we revert this patch (as a
> regression) and do something in automount instead?

Can you check out the "browse" option behavior before we talk further
about this please.

The problem in user space now is that there is no way to control
whether an indirect mount that uses the "nobrowse" option is triggered
or not (using fstatat() or statx()) regardless of the flags used and it's
essential the AT_NO_AUTOMOUNT flag work as expected to have user space
take more care in not triggering automounts.

> 
> Thanks,
> NeilBrown
> 
> 
>>
>> It's also possible that a stat family call has been made with a
>> path that is in the process of being mounted by some other process.
>> But stat family calls should return the automount state of the path
>> as it is "now" so it shouldn't wait for mount completion.
>>
>> This is the same semantic as the positive dentry case already
>> handled.
>>
>> Signed-off-by: Ian Kent <raven@themaw.net>
>> Cc: David Howells <dhowells@redhat.com>
>> Cc: Colin Walters <walters@redhat.com>
>> Cc: Ondrej Holy <oholy@redhat.com>
>> Cc: stable@vger.kernel.org
>> ---
>>  fs/namei.c         |   15 ++++++++++++---
>>  include/linux/fs.h |    3 +--
>>  2 files changed, 13 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/namei.c b/fs/namei.c
>> index 7286f87..cd74838 100644
>> --- a/fs/namei.c
>> +++ b/fs/namei.c
>> @@ -1129,9 +1129,18 @@ static int follow_automount(struct path *path, struct nameidata *nd,
>>  	 * of the daemon to instantiate them before they can be used.
>>  	 */
>>  	if (!(nd->flags & (LOOKUP_PARENT | LOOKUP_DIRECTORY |
>> -			   LOOKUP_OPEN | LOOKUP_CREATE | LOOKUP_AUTOMOUNT)) &&
>> -	    path->dentry->d_inode)
>> -		return -EISDIR;
>> +			   LOOKUP_OPEN | LOOKUP_CREATE |
>> +			   LOOKUP_AUTOMOUNT))) {
>> +		/* Positive dentry that isn't meant to trigger an
>> +		 * automount, EISDIR will allow it to be used,
>> +		 * otherwise there's no mount here "now" so return
>> +		 * ENOENT.
>> +		 */
>> +		if (path->dentry->d_inode)
>> +			return -EISDIR;
>> +		else
>> +			return -ENOENT;
>> +	}
>>  
>>  	if (path->dentry->d_sb->s_user_ns != &init_user_ns)
>>  		return -EACCES;
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 26488b4..be09684 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -2935,8 +2935,7 @@ static inline int vfs_lstat(const char __user *name, struct kstat *stat)
>>  static inline int vfs_fstatat(int dfd, const char __user *filename,
>>  			      struct kstat *stat, int flags)
>>  {
>> -	return vfs_statx(dfd, filename, flags | AT_NO_AUTOMOUNT,
>> -			 stat, STATX_BASIC_STATS);
>> +	return vfs_statx(dfd, filename, flags, stat, STATX_BASIC_STATS);
>>  }
>>  static inline int vfs_fstat(int fd, struct kstat *stat)
>>  {

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored
@ 2017-11-22  4:28       ` Ian Kent
  0 siblings, 0 replies; 44+ messages in thread
From: Ian Kent @ 2017-11-22  4:28 UTC (permalink / raw)
  To: NeilBrown, Al Viro
  Cc: Colin Walters, Ondrej Holy, autofs mailing list,
	Kernel Mailing List, David Howells, linux-fsdevel

On 21/11/17 09:53, NeilBrown wrote:
> On Wed, May 10 2017, Ian Kent wrote:
> 
>> The fstatat(2) and statx() calls can pass the flag AT_NO_AUTOMOUNT
>> which is meant to clear the LOOKUP_AUTOMOUNT flag and prevent triggering
>> of an automount by the call. But this flag is unconditionally cleared
>> for all stat family system calls except statx().
>>
>> stat family system calls have always triggered mount requests for the
>> negative dentry case in follow_automount() which is intended but prevents
>> the fstatat(2) and statx() AT_NO_AUTOMOUNT case from being handled.
>>
>> In order to handle the AT_NO_AUTOMOUNT for both system calls the
>> negative dentry case in follow_automount() needs to be changed to
>> return ENOENT when the LOOKUP_AUTOMOUNT flag is clear (and the other
>> required flags are clear).
>>
>> AFAICT this change doesn't have any noticable side effects and may,
>> in some use cases (although I didn't see it in testing) prevent
>> unnecessary callbacks to the automount daemon.
> 
> Actually, this patch does have a noticeable side effect.

That's right.

> 
> Previously, if /home were an indirect mount point so that /home/neilb
> would mount my home directory, "ls -l /home/neilb" would always work.
> Now it doesn't.

An this is correct too.

The previous policy was that a ->lookup() would always cause a mount to
occur. It's that policy that prevents the AT_NO_AUTOMOUNT flag from being
able to work consistently.

If you set the indirect mount "browse" option that will cause the mount
point directories for each of the map keys to be pre-created. So a
directory listing will show the mount points and an attempt to access
anything within the mount point directory will cause it to be mounted.

There is still the problem of not knowing map keys when the wildcard
map entry is used.

Still, stat family systems calls have always had similar problems that
prevent consistent behavior.

> 
> This happens because "ls" calls 'lstat' on the path and when that fails
> with "ENOENT", it doesn't bother trying to open.

I know, I believe the ENOENT is appropriate because there is no mount
and no directory at the time this happens.

> 
> An alternate approach to the problem that this patch addresses is to
> *not* change follow_automount() but instead change the automount daemon
> and possibly autofs.

Perhaps, but the daemon probably doesn't have enough information to
make decisions about this so there would need to be something coming
from the kernel too.

> 
> When a notification of access for an indirect mount point is received,
> it would firstly just create the directory - not triggering a mount.
> If another notification is then received (after the directory has been
> created), it then triggers the mount.

Not sure, perhaps.

But I don't think that will work, I've had many problems with this type
behavior due to bugs and I think the the same sort of problems would be
encountered.

The indirect mount "browse" option which behaves very much like what your
suggesting is the internal program default, and has been since the initial
v5 release. Historically it is disabled on install to maintain backward
compatibility with the original Linux autofs (which is also the reason for
always triggering an automount on ->lookup()).

Perhaps now is the time for that to change. 

> 
> I suspect this might need changes to autofs to avoid races when two
> threads call lstat() on the same path at the same time.  We would need
> to ensure that automount didn't see this as two requests.... though
> maybe it already has enough locking.
> 
> Does that seem reasonable?  Should we revert this patch (as a
> regression) and do something in automount instead?

Can you check out the "browse" option behavior before we talk further
about this please.

The problem in user space now is that there is no way to control
whether an indirect mount that uses the "nobrowse" option is triggered
or not (using fstatat() or statx()) regardless of the flags used and it's
essential the AT_NO_AUTOMOUNT flag work as expected to have user space
take more care in not triggering automounts.

> 
> Thanks,
> NeilBrown
> 
> 
>>
>> It's also possible that a stat family call has been made with a
>> path that is in the process of being mounted by some other process.
>> But stat family calls should return the automount state of the path
>> as it is "now" so it shouldn't wait for mount completion.
>>
>> This is the same semantic as the positive dentry case already
>> handled.
>>
>> Signed-off-by: Ian Kent <raven@themaw.net>
>> Cc: David Howells <dhowells@redhat.com>
>> Cc: Colin Walters <walters@redhat.com>
>> Cc: Ondrej Holy <oholy@redhat.com>
>> Cc: stable@vger.kernel.org
>> ---
>>  fs/namei.c         |   15 ++++++++++++---
>>  include/linux/fs.h |    3 +--
>>  2 files changed, 13 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/namei.c b/fs/namei.c
>> index 7286f87..cd74838 100644
>> --- a/fs/namei.c
>> +++ b/fs/namei.c
>> @@ -1129,9 +1129,18 @@ static int follow_automount(struct path *path, struct nameidata *nd,
>>  	 * of the daemon to instantiate them before they can be used.
>>  	 */
>>  	if (!(nd->flags & (LOOKUP_PARENT | LOOKUP_DIRECTORY |
>> -			   LOOKUP_OPEN | LOOKUP_CREATE | LOOKUP_AUTOMOUNT)) &&
>> -	    path->dentry->d_inode)
>> -		return -EISDIR;
>> +			   LOOKUP_OPEN | LOOKUP_CREATE |
>> +			   LOOKUP_AUTOMOUNT))) {
>> +		/* Positive dentry that isn't meant to trigger an
>> +		 * automount, EISDIR will allow it to be used,
>> +		 * otherwise there's no mount here "now" so return
>> +		 * ENOENT.
>> +		 */
>> +		if (path->dentry->d_inode)
>> +			return -EISDIR;
>> +		else
>> +			return -ENOENT;
>> +	}
>>  
>>  	if (path->dentry->d_sb->s_user_ns != &init_user_ns)
>>  		return -EACCES;
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 26488b4..be09684 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -2935,8 +2935,7 @@ static inline int vfs_lstat(const char __user *name, struct kstat *stat)
>>  static inline int vfs_fstatat(int dfd, const char __user *filename,
>>  			      struct kstat *stat, int flags)
>>  {
>> -	return vfs_statx(dfd, filename, flags | AT_NO_AUTOMOUNT,
>> -			 stat, STATX_BASIC_STATS);
>> +	return vfs_statx(dfd, filename, flags, stat, STATX_BASIC_STATS);
>>  }
>>  static inline int vfs_fstat(int fd, struct kstat *stat)
>>  {

--
To unsubscribe from this list: send the line "unsubscribe autofs" in

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored
  2017-11-22  4:28       ` Ian Kent
  (?)
@ 2017-11-23  0:36       ` Ian Kent
  2017-11-23  2:21         ` NeilBrown
  2017-11-27 16:01           ` Mike Marion
  -1 siblings, 2 replies; 44+ messages in thread
From: Ian Kent @ 2017-11-23  0:36 UTC (permalink / raw)
  To: NeilBrown, Al Viro
  Cc: Colin Walters, Ondrej Holy, autofs mailing list,
	Kernel Mailing List, David Howells, linux-fsdevel

On 22/11/17 12:28, Ian Kent wrote:
> On 21/11/17 09:53, NeilBrown wrote:
>> On Wed, May 10 2017, Ian Kent wrote:
>>
>>> The fstatat(2) and statx() calls can pass the flag AT_NO_AUTOMOUNT
>>> which is meant to clear the LOOKUP_AUTOMOUNT flag and prevent triggering
>>> of an automount by the call. But this flag is unconditionally cleared
>>> for all stat family system calls except statx().
>>>
>>> stat family system calls have always triggered mount requests for the
>>> negative dentry case in follow_automount() which is intended but prevents
>>> the fstatat(2) and statx() AT_NO_AUTOMOUNT case from being handled.
>>>
>>> In order to handle the AT_NO_AUTOMOUNT for both system calls the
>>> negative dentry case in follow_automount() needs to be changed to
>>> return ENOENT when the LOOKUP_AUTOMOUNT flag is clear (and the other
>>> required flags are clear).
>>>
>>> AFAICT this change doesn't have any noticable side effects and may,
>>> in some use cases (although I didn't see it in testing) prevent
>>> unnecessary callbacks to the automount daemon.
>>
>> Actually, this patch does have a noticeable side effect.
> 
> That's right.
> 
>>
>> Previously, if /home were an indirect mount point so that /home/neilb
>> would mount my home directory, "ls -l /home/neilb" would always work.
>> Now it doesn't.
> 
> An this is correct too.
> 
> The previous policy was that a ->lookup() would always cause a mount to
> occur. It's that policy that prevents the AT_NO_AUTOMOUNT flag from being
> able to work consistently.
> 
> If you set the indirect mount "browse" option that will cause the mount
> point directories for each of the map keys to be pre-created. So a
> directory listing will show the mount points and an attempt to access
> anything within the mount point directory will cause it to be mounted.
> 
> There is still the problem of not knowing map keys when the wildcard
> map entry is used.
> 
> Still, stat family systems calls have always had similar problems that
> prevent consistent behavior.
> 
>>
>> This happens because "ls" calls 'lstat' on the path and when that fails
>> with "ENOENT", it doesn't bother trying to open.
> 
> I know, I believe the ENOENT is appropriate because there is no mount
> and no directory at the time this happens.
> 
>>
>> An alternate approach to the problem that this patch addresses is to
>> *not* change follow_automount() but instead change the automount daemon
>> and possibly autofs.
> 
> Perhaps, but the daemon probably doesn't have enough information to
> make decisions about this so there would need to be something coming
> from the kernel too.
> 
>>
>> When a notification of access for an indirect mount point is received,
>> it would firstly just create the directory - not triggering a mount.
>> If another notification is then received (after the directory has been
>> created), it then triggers the mount.
> 
> Not sure, perhaps.
> 
> But I don't think that will work, I've had many problems with this type
> behavior due to bugs and I think the the same sort of problems would be
> encountered.
> 
> The indirect mount "browse" option which behaves very much like what your
> suggesting is the internal program default, and has been since the initial
> v5 release. Historically it is disabled on install to maintain backward
> compatibility with the original Linux autofs (which is also the reason for
> always triggering an automount on ->lookup()).
> 
> Perhaps now is the time for that to change. 
> 
>>
>> I suspect this might need changes to autofs to avoid races when two
>> threads call lstat() on the same path at the same time.  We would need
>> to ensure that automount didn't see this as two requests.... though
>> maybe it already has enough locking.
>>
>> Does that seem reasonable?  Should we revert this patch (as a
>> regression) and do something in automount instead?

Hey Neil, I'm looking at this again because RH QE have complained about
a regression test failing with a kernel that has this change.

Maybe I'm just dumb but I though a "find <base directory> <options>"
would, well, just look at the contents below <base directory> but an
strace shows that it reads and calls fstatat() on "every entry in the
mount table" regardless of the path.

And with the move of userspace to use /proc based mount tables (one
example being the symlink of /etc/mtab into /proc) even modest sized
direct mount maps will be a problem with every entry getting mounted.

Systems will cope with this fine but larger systems not so much.

If find does this then the user space changes needed to accommodate
this sort of change are almost certainly far more than I expected.

I think this is an example of the larger problem I'm faced with and
this change was was meant to be a starting point for resolution.

The most obvious symptom of the problem is auto-mounts no longer able
to be expired due to being re-mounted immediately after expire. Another
symptom is unwanted (by the user) accesses causing unexpected auto-mount
attempts.

I believe this monitoring of the mount table is what leads to excessive
CPU consumption I've seen, usually around six processes, under heavy
mount activity. And following this, when the mount table is large and
there is "no mount activity" two of the six processes continue to consume
excessive CPU, until the mount table shrinks.

So now I'm coming around to the idea of reverting this change ..... and
going back to the drawing board.

> 
> Can you check out the "browse" option behavior before we talk further
> about this please.
> 
> The problem in user space now is that there is no way to control
> whether an indirect mount that uses the "nobrowse" option is triggered
> or not (using fstatat() or statx()) regardless of the flags used and it's
> essential the AT_NO_AUTOMOUNT flag work as expected to have user space
> take more care in not triggering automounts.
> 
>>
>> Thanks,
>> NeilBrown
>>
>>
>>>
>>> It's also possible that a stat family call has been made with a
>>> path that is in the process of being mounted by some other process.
>>> But stat family calls should return the automount state of the path
>>> as it is "now" so it shouldn't wait for mount completion.
>>>
>>> This is the same semantic as the positive dentry case already
>>> handled.
>>>
>>> Signed-off-by: Ian Kent <raven@themaw.net>
>>> Cc: David Howells <dhowells@redhat.com>
>>> Cc: Colin Walters <walters@redhat.com>
>>> Cc: Ondrej Holy <oholy@redhat.com>
>>> Cc: stable@vger.kernel.org
>>> ---
>>>  fs/namei.c         |   15 ++++++++++++---
>>>  include/linux/fs.h |    3 +--
>>>  2 files changed, 13 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/fs/namei.c b/fs/namei.c
>>> index 7286f87..cd74838 100644
>>> --- a/fs/namei.c
>>> +++ b/fs/namei.c
>>> @@ -1129,9 +1129,18 @@ static int follow_automount(struct path *path, struct nameidata *nd,
>>>  	 * of the daemon to instantiate them before they can be used.
>>>  	 */
>>>  	if (!(nd->flags & (LOOKUP_PARENT | LOOKUP_DIRECTORY |
>>> -			   LOOKUP_OPEN | LOOKUP_CREATE | LOOKUP_AUTOMOUNT)) &&
>>> -	    path->dentry->d_inode)
>>> -		return -EISDIR;
>>> +			   LOOKUP_OPEN | LOOKUP_CREATE |
>>> +			   LOOKUP_AUTOMOUNT))) {
>>> +		/* Positive dentry that isn't meant to trigger an
>>> +		 * automount, EISDIR will allow it to be used,
>>> +		 * otherwise there's no mount here "now" so return
>>> +		 * ENOENT.
>>> +		 */
>>> +		if (path->dentry->d_inode)
>>> +			return -EISDIR;
>>> +		else
>>> +			return -ENOENT;
>>> +	}
>>>  
>>>  	if (path->dentry->d_sb->s_user_ns != &init_user_ns)
>>>  		return -EACCES;
>>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>>> index 26488b4..be09684 100644
>>> --- a/include/linux/fs.h
>>> +++ b/include/linux/fs.h
>>> @@ -2935,8 +2935,7 @@ static inline int vfs_lstat(const char __user *name, struct kstat *stat)
>>>  static inline int vfs_fstatat(int dfd, const char __user *filename,
>>>  			      struct kstat *stat, int flags)
>>>  {
>>> -	return vfs_statx(dfd, filename, flags | AT_NO_AUTOMOUNT,
>>> -			 stat, STATX_BASIC_STATS);
>>> +	return vfs_statx(dfd, filename, flags, stat, STATX_BASIC_STATS);
>>>  }
>>>  static inline int vfs_fstat(int fd, struct kstat *stat)
>>>  {
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored
  2017-11-22  4:28       ` Ian Kent
  (?)
  (?)
@ 2017-11-23  0:47       ` NeilBrown
  2017-11-23  1:43           ` Ian Kent
  -1 siblings, 1 reply; 44+ messages in thread
From: NeilBrown @ 2017-11-23  0:47 UTC (permalink / raw)
  To: Ian Kent, Al Viro
  Cc: Colin Walters, Ondrej Holy, autofs mailing list,
	Kernel Mailing List, David Howells, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 8220 bytes --]

On Wed, Nov 22 2017, Ian Kent wrote:

> On 21/11/17 09:53, NeilBrown wrote:
>> On Wed, May 10 2017, Ian Kent wrote:
>> 
>>> The fstatat(2) and statx() calls can pass the flag AT_NO_AUTOMOUNT
>>> which is meant to clear the LOOKUP_AUTOMOUNT flag and prevent triggering
>>> of an automount by the call. But this flag is unconditionally cleared
>>> for all stat family system calls except statx().
>>>
>>> stat family system calls have always triggered mount requests for the
>>> negative dentry case in follow_automount() which is intended but prevents
>>> the fstatat(2) and statx() AT_NO_AUTOMOUNT case from being handled.
>>>
>>> In order to handle the AT_NO_AUTOMOUNT for both system calls the
>>> negative dentry case in follow_automount() needs to be changed to
>>> return ENOENT when the LOOKUP_AUTOMOUNT flag is clear (and the other
>>> required flags are clear).
>>>
>>> AFAICT this change doesn't have any noticable side effects and may,
>>> in some use cases (although I didn't see it in testing) prevent
>>> unnecessary callbacks to the automount daemon.
>> 
>> Actually, this patch does have a noticeable side effect.
>
> That's right.
>
>> 
>> Previously, if /home were an indirect mount point so that /home/neilb
>> would mount my home directory, "ls -l /home/neilb" would always work.
>> Now it doesn't.
>
> An this is correct too.
>
> The previous policy was that a ->lookup() would always cause a mount to
> occur. It's that policy that prevents the AT_NO_AUTOMOUNT flag from being
> able to work consistently.

Just to be clear, the previous policy was:
 kernel - a lookup would cause a message to be sent to the automount daemon
 daemon - the receipt of a message would cause a mount to occur.

Either part of this policy could be changed to allow AT_NO_AUTOMOUNT to
work reliably.  You chose to change the first.  At the time I thought
this was a good idea.  I no longer think so.

Specifically, I think the second part of the policy should be revised a little.

>
> If you set the indirect mount "browse" option that will cause the mount
> point directories for each of the map keys to be pre-created. So a
> directory listing will show the mount points and an attempt to access
> anything within the mount point directory will cause it to be mounted.
>
> There is still the problem of not knowing map keys when the wildcard
> map entry is used.
>
> Still, stat family systems calls have always had similar problems that
> prevent consistent behavior.

Yes, I understand all this.  stat family sys-call have some odd
behaviours at times like "stat(); open(); fstat()" will result in
different sets of status info.  This is known.
The point is that these odd behaviours have changed in a noticeably way
(contrary to the change log comment) and it isn't clear these changes
are good.

>
>> 
>> This happens because "ls" calls 'lstat' on the path and when that fails
>> with "ENOENT", it doesn't bother trying to open.
>
> I know, I believe the ENOENT is appropriate because there is no mount
> and no directory at the time this happens.

Two distinct statements here.  "no mount" and "no directory".
If AT_NO_AUTOMOUNT is given (or implicit) the "no mount" is significant
and shouldn't be changed.   It isn't clear that "no directory" is
significant.
If you think of the list of names stored in the autofs filesystem as a
cache of recently used names from the map, then the directory *does*
exist (in the map), it is just that the (in-kernel) cache hasn't been
populated yet.
Should "AT_NO_AUTOMOUNT" prevent the cache from being populated?

>
>> 
>> An alternate approach to the problem that this patch addresses is to
>> *not* change follow_automount() but instead change the automount daemon
>> and possibly autofs.
>
> Perhaps, but the daemon probably doesn't have enough information to
> make decisions about this so there would need to be something coming
> from the kernel too.

I don't think so.
The daemon already has a promise that upcalls for a given name are
serialized, and it has the ability to test if a name is already in the
cache.  This is enough.
I applied the following patch:

diff --git a/modules/mount_nfs.c b/modules/mount_nfs.c
index c558a7381054..7ddfe9c61038 100644
--- a/modules/mount_nfs.c
+++ b/modules/mount_nfs.c
@@ -269,6 +269,11 @@ dont_probe:
 		free_host_list(&hosts);
 		return 1;
 	}
+	if (!status) {
+		debug(ap->logopt, MODPREFIX "created dir, that'll do for now");
+		free_host_list(&hosts);
+		return 0;
+	}
 
 	if (!status)
 		existed = 0;

to automount and now it behaves (for NFS mounts) how I would like (though
there is room for improvement).

>
>> 
>> When a notification of access for an indirect mount point is received,
>> it would firstly just create the directory - not triggering a mount.
>> If another notification is then received (after the directory has been
>> created), it then triggers the mount.
>
> Not sure, perhaps.
>
> But I don't think that will work, I've had many problems with this type
> behavior due to bugs and I think the the same sort of problems would be
> encountered.
>
> The indirect mount "browse" option which behaves very much like what your
> suggesting is the internal program default, and has been since the initial
> v5 release. Historically it is disabled on install to maintain backward
> compatibility with the original Linux autofs (which is also the reason for
> always triggering an automount on ->lookup()).
>
> Perhaps now is the time for that to change.

Enabling browse mode does resolve this problem when the map is
enumerable (as you say, wildcards can't be supported).
But browse mode isn't always wanted.  If you have a very large map, then
caching all 10,000 entries in the kernel may be a pointless waste of
time and space.

>
>> 
>> I suspect this might need changes to autofs to avoid races when two
>> threads call lstat() on the same path at the same time.  We would need
>> to ensure that automount didn't see this as two requests.... though
>> maybe it already has enough locking.
>> 
>> Does that seem reasonable?  Should we revert this patch (as a
>> regression) and do something in automount instead?
>
> Can you check out the "browse" option behavior before we talk further
> about this please.

Done that.

>
> The problem in user space now is that there is no way to control
> whether an indirect mount that uses the "nobrowse" option is triggered
> or not (using fstatat() or statx()) regardless of the flags used and it's
> essential the AT_NO_AUTOMOUNT flag work as expected to have user space
> take more care in not triggering automounts.

I think that user-space has all the control that it needs.
There are two cases:
1/ daemon receives "missing indirect" message and the directory
   doesn't currently exist (mkdir() by the daemon succeeds).
   This could be an AT_NO_AUTOMOUNT lookup, or it could be a full
   open() or similar.  In either case the directory gets created if
   the map suggests that the name should exist.  The daemon needs to
   be careful not to block for long if it goes off-host to check for
   validity of the name.

2/ daemon received "missing indirect" message and the directory
   *does* exist (mkdir() by daemon fails with -EEXIST).
   This cannot be an AT_NO_AUTOMOUNT lookup, it must be a lookup
   intended to trigger automounts.  In this case, we trigger the
   mount.

This does lead to an odd behaviour with wildcards.
If /misc is a map with a wildcard entry, then
  ls -d /misc/IDoNotExist
will create IDoNotExist and report its existence.
  ls -l /misc/IDoNotExist
will report that it doesn't exist, but it will still have
been created.

Removing the directory after a mount-attempt fails is probably only
a few more lines of code to my patch, and would fix the worst of this.
Have a very short timeout on directories that don't immediately get
mounted on (e.g. 2 seconds??) would further improve the situation.
That timeout would have to be completely managed in the daemon
as the kernel doesn't expire empty directories.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored
  2017-11-23  0:47       ` NeilBrown
@ 2017-11-23  1:43           ` Ian Kent
  0 siblings, 0 replies; 44+ messages in thread
From: Ian Kent @ 2017-11-23  1:43 UTC (permalink / raw)
  To: NeilBrown, Al Viro
  Cc: Colin Walters, Ondrej Holy, autofs mailing list,
	Kernel Mailing List, David Howells, linux-fsdevel

On 23/11/17 08:47, NeilBrown wrote:
> On Wed, Nov 22 2017, Ian Kent wrote:
> 
>> On 21/11/17 09:53, NeilBrown wrote:
>>> On Wed, May 10 2017, Ian Kent wrote:
>>>
>>>> The fstatat(2) and statx() calls can pass the flag AT_NO_AUTOMOUNT
>>>> which is meant to clear the LOOKUP_AUTOMOUNT flag and prevent triggering
>>>> of an automount by the call. But this flag is unconditionally cleared
>>>> for all stat family system calls except statx().
>>>>
>>>> stat family system calls have always triggered mount requests for the
>>>> negative dentry case in follow_automount() which is intended but prevents
>>>> the fstatat(2) and statx() AT_NO_AUTOMOUNT case from being handled.
>>>>
>>>> In order to handle the AT_NO_AUTOMOUNT for both system calls the
>>>> negative dentry case in follow_automount() needs to be changed to
>>>> return ENOENT when the LOOKUP_AUTOMOUNT flag is clear (and the other
>>>> required flags are clear).
>>>>
>>>> AFAICT this change doesn't have any noticable side effects and may,
>>>> in some use cases (although I didn't see it in testing) prevent
>>>> unnecessary callbacks to the automount daemon.
>>>
>>> Actually, this patch does have a noticeable side effect.
>>
>> That's right.
>>
>>>
>>> Previously, if /home were an indirect mount point so that /home/neilb
>>> would mount my home directory, "ls -l /home/neilb" would always work.
>>> Now it doesn't.
>>
>> An this is correct too.
>>
>> The previous policy was that a ->lookup() would always cause a mount to
>> occur. It's that policy that prevents the AT_NO_AUTOMOUNT flag from being
>> able to work consistently.
> 
> Just to be clear, the previous policy was:
>  kernel - a lookup would cause a message to be sent to the automount daemon
>  daemon - the receipt of a message would cause a mount to occur.
> 
> Either part of this policy could be changed to allow AT_NO_AUTOMOUNT to
> work reliably.  You chose to change the first.  At the time I thought
> this was a good idea.  I no longer think so.
> 
> Specifically, I think the second part of the policy should be revised a little.
> 
>>
>> If you set the indirect mount "browse" option that will cause the mount
>> point directories for each of the map keys to be pre-created. So a
>> directory listing will show the mount points and an attempt to access
>> anything within the mount point directory will cause it to be mounted.
>>
>> There is still the problem of not knowing map keys when the wildcard
>> map entry is used.
>>
>> Still, stat family systems calls have always had similar problems that
>> prevent consistent behavior.
> 
> Yes, I understand all this.  stat family sys-call have some odd
> behaviours at times like "stat(); open(); fstat()" will result in
> different sets of status info.  This is known.
> The point is that these odd behaviours have changed in a noticeably way
> (contrary to the change log comment) and it isn't clear these changes
> are good.
> 
>>
>>>
>>> This happens because "ls" calls 'lstat' on the path and when that fails
>>> with "ENOENT", it doesn't bother trying to open.
>>
>> I know, I believe the ENOENT is appropriate because there is no mount
>> and no directory at the time this happens.
> 
> Two distinct statements here.  "no mount" and "no directory".
> If AT_NO_AUTOMOUNT is given (or implicit) the "no mount" is significant
> and shouldn't be changed.   It isn't clear that "no directory" is
> significant.
> If you think of the list of names stored in the autofs filesystem as a
> cache of recently used names from the map, then the directory *does*
> exist (in the map), it is just that the (in-kernel) cache hasn't been
> populated yet.
> Should "AT_NO_AUTOMOUNT" prevent the cache from being populated?
> 
>>
>>>
>>> An alternate approach to the problem that this patch addresses is to
>>> *not* change follow_automount() but instead change the automount daemon
>>> and possibly autofs.
>>
>> Perhaps, but the daemon probably doesn't have enough information to
>> make decisions about this so there would need to be something coming
>> from the kernel too.
> 
> I don't think so.
> The daemon already has a promise that upcalls for a given name are
> serialized, and it has the ability to test if a name is already in the
> cache.  This is enough.
> I applied the following patch:
> 
> diff --git a/modules/mount_nfs.c b/modules/mount_nfs.c
> index c558a7381054..7ddfe9c61038 100644
> --- a/modules/mount_nfs.c
> +++ b/modules/mount_nfs.c
> @@ -269,6 +269,11 @@ dont_probe:
>  		free_host_list(&hosts);
>  		return 1;
>  	}
> +	if (!status) {
> +		debug(ap->logopt, MODPREFIX "created dir, that'll do for now");
> +		free_host_list(&hosts);
> +		return 0;
> +	}
>  
>  	if (!status)
>  		existed = 0;
> 
> to automount and now it behaves (for NFS mounts) how I would like (though
> there is room for improvement).
> 
>>
>>>
>>> When a notification of access for an indirect mount point is received,
>>> it would firstly just create the directory - not triggering a mount.
>>> If another notification is then received (after the directory has been
>>> created), it then triggers the mount.
>>
>> Not sure, perhaps.
>>
>> But I don't think that will work, I've had many problems with this type
>> behavior due to bugs and I think the the same sort of problems would be
>> encountered.
>>
>> The indirect mount "browse" option which behaves very much like what your
>> suggesting is the internal program default, and has been since the initial
>> v5 release. Historically it is disabled on install to maintain backward
>> compatibility with the original Linux autofs (which is also the reason for
>> always triggering an automount on ->lookup()).
>>
>> Perhaps now is the time for that to change.
> 
> Enabling browse mode does resolve this problem when the map is
> enumerable (as you say, wildcards can't be supported).
> But browse mode isn't always wanted.  If you have a very large map, then
> caching all 10,000 entries in the kernel may be a pointless waste of
> time and space.

Indeed, that's the main use of nobrowse indirect maps.

In fact another recent change where I moved the last_used field so it
wouldn't be updated on every walk, to help with mounts never expiring,
also needs at different approach.

Doing that causes more aggressive expiring of automounts which increases
umount to mount churn and leads to increased server load at sites with a
very large number of clients.
 
> 
>>
>>>
>>> I suspect this might need changes to autofs to avoid races when two
>>> threads call lstat() on the same path at the same time.  We would need
>>> to ensure that automount didn't see this as two requests.... though
>>> maybe it already has enough locking.
>>>
>>> Does that seem reasonable?  Should we revert this patch (as a
>>> regression) and do something in automount instead?
>>
>> Can you check out the "browse" option behavior before we talk further
>> about this please.
> 
> Done that.
> 
>>
>> The problem in user space now is that there is no way to control
>> whether an indirect mount that uses the "nobrowse" option is triggered
>> or not (using fstatat() or statx()) regardless of the flags used and it's
>> essential the AT_NO_AUTOMOUNT flag work as expected to have user space
>> take more care in not triggering automounts.
> 
> I think that user-space has all the control that it needs.
> There are two cases:
> 1/ daemon receives "missing indirect" message and the directory
>    doesn't currently exist (mkdir() by the daemon succeeds).
>    This could be an AT_NO_AUTOMOUNT lookup, or it could be a full
>    open() or similar.  In either case the directory gets created if
>    the map suggests that the name should exist.  The daemon needs to
>    be careful not to block for long if it goes off-host to check for
>    validity of the name.

Aren't you assuming the the daemon only receives these lookups
on the last path component?

Intermediate path components that can trigger an automount must
be treated as not having AT_NO_AUTOMOUNT in order for the walk to
continue or fail at that point.

> 
> 2/ daemon received "missing indirect" message and the directory
>    *does* exist (mkdir() by daemon fails with -EEXIST).
>    This cannot be an AT_NO_AUTOMOUNT lookup, it must be a lookup
>    intended to trigger automounts.  In this case, we trigger the
>    mount.

An fstatat(<path>, AT_NO_AUTOMOUNT) on a browse indirect map entry
means the directory will exist but user space has asked not to trigger
the mount and return the stat info of the autofs dentry.

Please don't get me wrong, I think the suggestion is good, I just
don't think it's as simple to do as it appears.

> 
> This does lead to an odd behaviour with wildcards.
> If /misc is a map with a wildcard entry, then
>   ls -d /misc/IDoNotExist
> will create IDoNotExist and report its existence.
>   ls -l /misc/IDoNotExist
> will report that it doesn't exist, but it will still have
> been created.
> 
> Removing the directory after a mount-attempt fails is probably only
> a few more lines of code to my patch, and would fix the worst of this.
> Have a very short timeout on directories that don't immediately get
> mounted on (e.g. 2 seconds??) would further improve the situation.
> That timeout would have to be completely managed in the daemon
> as the kernel doesn't expire empty directories.
> 
> Thanks,
> NeilBrown
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored
@ 2017-11-23  1:43           ` Ian Kent
  0 siblings, 0 replies; 44+ messages in thread
From: Ian Kent @ 2017-11-23  1:43 UTC (permalink / raw)
  To: NeilBrown, Al Viro
  Cc: Colin Walters, Ondrej Holy, autofs mailing list,
	Kernel Mailing List, David Howells, linux-fsdevel

On 23/11/17 08:47, NeilBrown wrote:
> On Wed, Nov 22 2017, Ian Kent wrote:
> 
>> On 21/11/17 09:53, NeilBrown wrote:
>>> On Wed, May 10 2017, Ian Kent wrote:
>>>
>>>> The fstatat(2) and statx() calls can pass the flag AT_NO_AUTOMOUNT
>>>> which is meant to clear the LOOKUP_AUTOMOUNT flag and prevent triggering
>>>> of an automount by the call. But this flag is unconditionally cleared
>>>> for all stat family system calls except statx().
>>>>
>>>> stat family system calls have always triggered mount requests for the
>>>> negative dentry case in follow_automount() which is intended but prevents
>>>> the fstatat(2) and statx() AT_NO_AUTOMOUNT case from being handled.
>>>>
>>>> In order to handle the AT_NO_AUTOMOUNT for both system calls the
>>>> negative dentry case in follow_automount() needs to be changed to
>>>> return ENOENT when the LOOKUP_AUTOMOUNT flag is clear (and the other
>>>> required flags are clear).
>>>>
>>>> AFAICT this change doesn't have any noticable side effects and may,
>>>> in some use cases (although I didn't see it in testing) prevent
>>>> unnecessary callbacks to the automount daemon.
>>>
>>> Actually, this patch does have a noticeable side effect.
>>
>> That's right.
>>
>>>
>>> Previously, if /home were an indirect mount point so that /home/neilb
>>> would mount my home directory, "ls -l /home/neilb" would always work.
>>> Now it doesn't.
>>
>> An this is correct too.
>>
>> The previous policy was that a ->lookup() would always cause a mount to
>> occur. It's that policy that prevents the AT_NO_AUTOMOUNT flag from being
>> able to work consistently.
> 
> Just to be clear, the previous policy was:
>  kernel - a lookup would cause a message to be sent to the automount daemon
>  daemon - the receipt of a message would cause a mount to occur.
> 
> Either part of this policy could be changed to allow AT_NO_AUTOMOUNT to
> work reliably.  You chose to change the first.  At the time I thought
> this was a good idea.  I no longer think so.
> 
> Specifically, I think the second part of the policy should be revised a little.
> 
>>
>> If you set the indirect mount "browse" option that will cause the mount
>> point directories for each of the map keys to be pre-created. So a
>> directory listing will show the mount points and an attempt to access
>> anything within the mount point directory will cause it to be mounted.
>>
>> There is still the problem of not knowing map keys when the wildcard
>> map entry is used.
>>
>> Still, stat family systems calls have always had similar problems that
>> prevent consistent behavior.
> 
> Yes, I understand all this.  stat family sys-call have some odd
> behaviours at times like "stat(); open(); fstat()" will result in
> different sets of status info.  This is known.
> The point is that these odd behaviours have changed in a noticeably way
> (contrary to the change log comment) and it isn't clear these changes
> are good.
> 
>>
>>>
>>> This happens because "ls" calls 'lstat' on the path and when that fails
>>> with "ENOENT", it doesn't bother trying to open.
>>
>> I know, I believe the ENOENT is appropriate because there is no mount
>> and no directory at the time this happens.
> 
> Two distinct statements here.  "no mount" and "no directory".
> If AT_NO_AUTOMOUNT is given (or implicit) the "no mount" is significant
> and shouldn't be changed.   It isn't clear that "no directory" is
> significant.
> If you think of the list of names stored in the autofs filesystem as a
> cache of recently used names from the map, then the directory *does*
> exist (in the map), it is just that the (in-kernel) cache hasn't been
> populated yet.
> Should "AT_NO_AUTOMOUNT" prevent the cache from being populated?
> 
>>
>>>
>>> An alternate approach to the problem that this patch addresses is to
>>> *not* change follow_automount() but instead change the automount daemon
>>> and possibly autofs.
>>
>> Perhaps, but the daemon probably doesn't have enough information to
>> make decisions about this so there would need to be something coming
>> from the kernel too.
> 
> I don't think so.
> The daemon already has a promise that upcalls for a given name are
> serialized, and it has the ability to test if a name is already in the
> cache.  This is enough.
> I applied the following patch:
> 
> diff --git a/modules/mount_nfs.c b/modules/mount_nfs.c
> index c558a7381054..7ddfe9c61038 100644
> --- a/modules/mount_nfs.c
> +++ b/modules/mount_nfs.c
> @@ -269,6 +269,11 @@ dont_probe:
>  		free_host_list(&hosts);
>  		return 1;
>  	}
> +	if (!status) {
> +		debug(ap->logopt, MODPREFIX "created dir, that'll do for now");
> +		free_host_list(&hosts);
> +		return 0;
> +	}
>  
>  	if (!status)
>  		existed = 0;
> 
> to automount and now it behaves (for NFS mounts) how I would like (though
> there is room for improvement).
> 
>>
>>>
>>> When a notification of access for an indirect mount point is received,
>>> it would firstly just create the directory - not triggering a mount.
>>> If another notification is then received (after the directory has been
>>> created), it then triggers the mount.
>>
>> Not sure, perhaps.
>>
>> But I don't think that will work, I've had many problems with this type
>> behavior due to bugs and I think the the same sort of problems would be
>> encountered.
>>
>> The indirect mount "browse" option which behaves very much like what your
>> suggesting is the internal program default, and has been since the initial
>> v5 release. Historically it is disabled on install to maintain backward
>> compatibility with the original Linux autofs (which is also the reason for
>> always triggering an automount on ->lookup()).
>>
>> Perhaps now is the time for that to change.
> 
> Enabling browse mode does resolve this problem when the map is
> enumerable (as you say, wildcards can't be supported).
> But browse mode isn't always wanted.  If you have a very large map, then
> caching all 10,000 entries in the kernel may be a pointless waste of
> time and space.

Indeed, that's the main use of nobrowse indirect maps.

In fact another recent change where I moved the last_used field so it
wouldn't be updated on every walk, to help with mounts never expiring,
also needs at different approach.

Doing that causes more aggressive expiring of automounts which increases
umount to mount churn and leads to increased server load at sites with a
very large number of clients.
 
> 
>>
>>>
>>> I suspect this might need changes to autofs to avoid races when two
>>> threads call lstat() on the same path at the same time.  We would need
>>> to ensure that automount didn't see this as two requests.... though
>>> maybe it already has enough locking.
>>>
>>> Does that seem reasonable?  Should we revert this patch (as a
>>> regression) and do something in automount instead?
>>
>> Can you check out the "browse" option behavior before we talk further
>> about this please.
> 
> Done that.
> 
>>
>> The problem in user space now is that there is no way to control
>> whether an indirect mount that uses the "nobrowse" option is triggered
>> or not (using fstatat() or statx()) regardless of the flags used and it's
>> essential the AT_NO_AUTOMOUNT flag work as expected to have user space
>> take more care in not triggering automounts.
> 
> I think that user-space has all the control that it needs.
> There are two cases:
> 1/ daemon receives "missing indirect" message and the directory
>    doesn't currently exist (mkdir() by the daemon succeeds).
>    This could be an AT_NO_AUTOMOUNT lookup, or it could be a full
>    open() or similar.  In either case the directory gets created if
>    the map suggests that the name should exist.  The daemon needs to
>    be careful not to block for long if it goes off-host to check for
>    validity of the name.

Aren't you assuming the the daemon only receives these lookups
on the last path component?

Intermediate path components that can trigger an automount must
be treated as not having AT_NO_AUTOMOUNT in order for the walk to
continue or fail at that point.

> 
> 2/ daemon received "missing indirect" message and the directory
>    *does* exist (mkdir() by daemon fails with -EEXIST).
>    This cannot be an AT_NO_AUTOMOUNT lookup, it must be a lookup
>    intended to trigger automounts.  In this case, we trigger the
>    mount.

An fstatat(<path>, AT_NO_AUTOMOUNT) on a browse indirect map entry
means the directory will exist but user space has asked not to trigger
the mount and return the stat info of the autofs dentry.

Please don't get me wrong, I think the suggestion is good, I just
don't think it's as simple to do as it appears.

> 
> This does lead to an odd behaviour with wildcards.
> If /misc is a map with a wildcard entry, then
>   ls -d /misc/IDoNotExist
> will create IDoNotExist and report its existence.
>   ls -l /misc/IDoNotExist
> will report that it doesn't exist, but it will still have
> been created.
> 
> Removing the directory after a mount-attempt fails is probably only
> a few more lines of code to my patch, and would fix the worst of this.
> Have a very short timeout on directories that don't immediately get
> mounted on (e.g. 2 seconds??) would further improve the situation.
> That timeout would have to be completely managed in the daemon
> as the kernel doesn't expire empty directories.
> 
> Thanks,
> NeilBrown
> 

--
To unsubscribe from this list: send the line "unsubscribe autofs" in

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored
  2017-11-23  0:36       ` Ian Kent
@ 2017-11-23  2:21         ` NeilBrown
  2017-11-23  2:46           ` Ian Kent
  2017-11-27 16:01           ` Mike Marion
  1 sibling, 1 reply; 44+ messages in thread
From: NeilBrown @ 2017-11-23  2:21 UTC (permalink / raw)
  To: Ian Kent, Al Viro
  Cc: Colin Walters, Ondrej Holy, autofs mailing list,
	Kernel Mailing List, David Howells, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 2304 bytes --]

On Thu, Nov 23 2017, Ian Kent wrote:

>
> Hey Neil, I'm looking at this again because RH QE have complained about
> a regression test failing with a kernel that has this change.
>
> Maybe I'm just dumb but I though a "find <base directory> <options>"
> would, well, just look at the contents below <base directory> but an
> strace shows that it reads and calls fstatat() on "every entry in the
> mount table" regardless of the path.

weird ... I can only get find to look at the mount table if given the
-fstyp option, and even then it doesn't fstatat anything that isn't in
the tree it is searching.


>
> And with the move of userspace to use /proc based mount tables (one
> example being the symlink of /etc/mtab into /proc) even modest sized
> direct mount maps will be a problem with every entry getting mounted.

But the patch in question is only about indirect mount maps, isn't it?
How is it relevant to direct mount maps?

>
> Systems will cope with this fine but larger systems not so much.
>
> If find does this then the user space changes needed to accommodate
> this sort of change are almost certainly far more than I expected.
>
> I think this is an example of the larger problem I'm faced with and
> this change was was meant to be a starting point for resolution.
>
> The most obvious symptom of the problem is auto-mounts no longer able
> to be expired due to being re-mounted immediately after expire. Another
> symptom is unwanted (by the user) accesses causing unexpected auto-mount
> attempts.
>
> I believe this monitoring of the mount table is what leads to excessive
> CPU consumption I've seen, usually around six processes, under heavy
> mount activity. And following this, when the mount table is large and
> there is "no mount activity" two of the six processes continue to consume
> excessive CPU, until the mount table shrinks.
>
> So now I'm coming around to the idea of reverting this change ..... and
> going back to the drawing board.

I can well imaging that a large mount table could cause problems for
applications that are written to expect one, and I can imagine that
autofs could cause extra issues for such a program as it might change
the mount table more often.  But I haven't yet worked out how this is
related to the patch in question....

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored
  2017-11-23  1:43           ` Ian Kent
  (?)
@ 2017-11-23  2:26           ` Ian Kent
  -1 siblings, 0 replies; 44+ messages in thread
From: Ian Kent @ 2017-11-23  2:26 UTC (permalink / raw)
  To: NeilBrown, Al Viro
  Cc: Colin Walters, Ondrej Holy, autofs mailing list,
	Kernel Mailing List, David Howells, linux-fsdevel

On 23/11/17 09:43, Ian Kent wrote:
> On 23/11/17 08:47, NeilBrown wrote:
>> On Wed, Nov 22 2017, Ian Kent wrote:
>>
>>> On 21/11/17 09:53, NeilBrown wrote:
>>>> On Wed, May 10 2017, Ian Kent wrote:
>>>>
>>>>> The fstatat(2) and statx() calls can pass the flag AT_NO_AUTOMOUNT
>>>>> which is meant to clear the LOOKUP_AUTOMOUNT flag and prevent triggering
>>>>> of an automount by the call. But this flag is unconditionally cleared
>>>>> for all stat family system calls except statx().
>>>>>
>>>>> stat family system calls have always triggered mount requests for the
>>>>> negative dentry case in follow_automount() which is intended but prevents
>>>>> the fstatat(2) and statx() AT_NO_AUTOMOUNT case from being handled.
>>>>>
>>>>> In order to handle the AT_NO_AUTOMOUNT for both system calls the
>>>>> negative dentry case in follow_automount() needs to be changed to
>>>>> return ENOENT when the LOOKUP_AUTOMOUNT flag is clear (and the other
>>>>> required flags are clear).
>>>>>
>>>>> AFAICT this change doesn't have any noticable side effects and may,
>>>>> in some use cases (although I didn't see it in testing) prevent
>>>>> unnecessary callbacks to the automount daemon.
>>>>
>>>> Actually, this patch does have a noticeable side effect.
>>>
>>> That's right.
>>>
>>>>
>>>> Previously, if /home were an indirect mount point so that /home/neilb
>>>> would mount my home directory, "ls -l /home/neilb" would always work.
>>>> Now it doesn't.
>>>
>>> An this is correct too.
>>>
>>> The previous policy was that a ->lookup() would always cause a mount to
>>> occur. It's that policy that prevents the AT_NO_AUTOMOUNT flag from being
>>> able to work consistently.
>>
>> Just to be clear, the previous policy was:
>>  kernel - a lookup would cause a message to be sent to the automount daemon
>>  daemon - the receipt of a message would cause a mount to occur.
>>
>> Either part of this policy could be changed to allow AT_NO_AUTOMOUNT to
>> work reliably.  You chose to change the first.  At the time I thought
>> this was a good idea.  I no longer think so.
>>
>> Specifically, I think the second part of the policy should be revised a little.
>>
>>>
>>> If you set the indirect mount "browse" option that will cause the mount
>>> point directories for each of the map keys to be pre-created. So a
>>> directory listing will show the mount points and an attempt to access
>>> anything within the mount point directory will cause it to be mounted.
>>>
>>> There is still the problem of not knowing map keys when the wildcard
>>> map entry is used.
>>>
>>> Still, stat family systems calls have always had similar problems that
>>> prevent consistent behavior.
>>
>> Yes, I understand all this.  stat family sys-call have some odd
>> behaviours at times like "stat(); open(); fstat()" will result in
>> different sets of status info.  This is known.
>> The point is that these odd behaviours have changed in a noticeably way
>> (contrary to the change log comment) and it isn't clear these changes
>> are good.
>>
>>>
>>>>
>>>> This happens because "ls" calls 'lstat' on the path and when that fails
>>>> with "ENOENT", it doesn't bother trying to open.
>>>
>>> I know, I believe the ENOENT is appropriate because there is no mount
>>> and no directory at the time this happens.
>>
>> Two distinct statements here.  "no mount" and "no directory".
>> If AT_NO_AUTOMOUNT is given (or implicit) the "no mount" is significant
>> and shouldn't be changed.   It isn't clear that "no directory" is
>> significant.
>> If you think of the list of names stored in the autofs filesystem as a
>> cache of recently used names from the map, then the directory *does*
>> exist (in the map), it is just that the (in-kernel) cache hasn't been
>> populated yet.
>> Should "AT_NO_AUTOMOUNT" prevent the cache from being populated?
>>
>>>
>>>>
>>>> An alternate approach to the problem that this patch addresses is to
>>>> *not* change follow_automount() but instead change the automount daemon
>>>> and possibly autofs.
>>>
>>> Perhaps, but the daemon probably doesn't have enough information to
>>> make decisions about this so there would need to be something coming
>>> from the kernel too.
>>
>> I don't think so.
>> The daemon already has a promise that upcalls for a given name are
>> serialized, and it has the ability to test if a name is already in the
>> cache.  This is enough.
>> I applied the following patch:
>>
>> diff --git a/modules/mount_nfs.c b/modules/mount_nfs.c
>> index c558a7381054..7ddfe9c61038 100644
>> --- a/modules/mount_nfs.c
>> +++ b/modules/mount_nfs.c
>> @@ -269,6 +269,11 @@ dont_probe:
>>  		free_host_list(&hosts);
>>  		return 1;
>>  	}
>> +	if (!status) {
>> +		debug(ap->logopt, MODPREFIX "created dir, that'll do for now");
>> +		free_host_list(&hosts);
>> +		return 0;
>> +	}
>>  
>>  	if (!status)
>>  		existed = 0;
>>
>> to automount and now it behaves (for NFS mounts) how I would like (though
>> there is room for improvement).
>>
>>>
>>>>
>>>> When a notification of access for an indirect mount point is received,
>>>> it would firstly just create the directory - not triggering a mount.
>>>> If another notification is then received (after the directory has been
>>>> created), it then triggers the mount.
>>>
>>> Not sure, perhaps.
>>>
>>> But I don't think that will work, I've had many problems with this type
>>> behavior due to bugs and I think the the same sort of problems would be
>>> encountered.
>>>
>>> The indirect mount "browse" option which behaves very much like what your
>>> suggesting is the internal program default, and has been since the initial
>>> v5 release. Historically it is disabled on install to maintain backward
>>> compatibility with the original Linux autofs (which is also the reason for
>>> always triggering an automount on ->lookup()).
>>>
>>> Perhaps now is the time for that to change.
>>
>> Enabling browse mode does resolve this problem when the map is
>> enumerable (as you say, wildcards can't be supported).
>> But browse mode isn't always wanted.  If you have a very large map, then
>> caching all 10,000 entries in the kernel may be a pointless waste of
>> time and space.
> 
> Indeed, that's the main use of nobrowse indirect maps.
> 
> In fact another recent change where I moved the last_used field so it
> wouldn't be updated on every walk, to help with mounts never expiring,
> also needs at different approach.
> 
> Doing that causes more aggressive expiring of automounts which increases
> umount to mount churn and leads to increased server load at sites with a
> very large number of clients.
>  
>>
>>>
>>>>
>>>> I suspect this might need changes to autofs to avoid races when two
>>>> threads call lstat() on the same path at the same time.  We would need
>>>> to ensure that automount didn't see this as two requests.... though
>>>> maybe it already has enough locking.
>>>>
>>>> Does that seem reasonable?  Should we revert this patch (as a
>>>> regression) and do something in automount instead?
>>>
>>> Can you check out the "browse" option behavior before we talk further
>>> about this please.
>>
>> Done that.
>>
>>>
>>> The problem in user space now is that there is no way to control
>>> whether an indirect mount that uses the "nobrowse" option is triggered
>>> or not (using fstatat() or statx()) regardless of the flags used and it's
>>> essential the AT_NO_AUTOMOUNT flag work as expected to have user space
>>> take more care in not triggering automounts.
>>
>> I think that user-space has all the control that it needs.
>> There are two cases:
>> 1/ daemon receives "missing indirect" message and the directory
>>    doesn't currently exist (mkdir() by the daemon succeeds).
>>    This could be an AT_NO_AUTOMOUNT lookup, or it could be a full
>>    open() or similar.  In either case the directory gets created if
>>    the map suggests that the name should exist.  The daemon needs to
>>    be careful not to block for long if it goes off-host to check for
>>    validity of the name.
> 
> Aren't you assuming the the daemon only receives these lookups
> on the last path component?
> 
> Intermediate path components that can trigger an automount must
> be treated as not having AT_NO_AUTOMOUNT in order for the walk to
> continue or fail at that point.
> 
>>
>> 2/ daemon received "missing indirect" message and the directory
>>    *does* exist (mkdir() by daemon fails with -EEXIST).
>>    This cannot be an AT_NO_AUTOMOUNT lookup, it must be a lookup
>>    intended to trigger automounts.  In this case, we trigger the
>>    mount.
> 
> An fstatat(<path>, AT_NO_AUTOMOUNT) on a browse indirect map entry
> means the directory will exist but user space has asked not to trigger
> the mount and return the stat info of the autofs dentry.

This isn't right, this case won't call back to the daemon as things
are now (both before the recent change and after), it will just return
the stat info of the existing autofs dentry so if a callback reaches
the daemon it can't be an AT_NO_AUTOMOUNT lookup.

Sorry, my mistake.

> 
> Please don't get me wrong, I think the suggestion is good, I just
> don't think it's as simple to do as it appears.
> 
>>
>> This does lead to an odd behaviour with wildcards.
>> If /misc is a map with a wildcard entry, then
>>   ls -d /misc/IDoNotExist
>> will create IDoNotExist and report its existence.
>>   ls -l /misc/IDoNotExist
>> will report that it doesn't exist, but it will still have
>> been created.
>>
>> Removing the directory after a mount-attempt fails is probably only
>> a few more lines of code to my patch, and would fix the worst of this.
>> Have a very short timeout on directories that don't immediately get
>> mounted on (e.g. 2 seconds??) would further improve the situation.
>> That timeout would have to be completely managed in the daemon
>> as the kernel doesn't expire empty directories.
>>
>> Thanks,
>> NeilBrown
>>
> 
> --
> To unsubscribe from this list: send the line "unsubscribe autofs" in
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored
  2017-11-23  2:21         ` NeilBrown
@ 2017-11-23  2:46           ` Ian Kent
  2017-11-23  3:04               ` Ian Kent
  2017-11-23  4:49             ` NeilBrown
  0 siblings, 2 replies; 44+ messages in thread
From: Ian Kent @ 2017-11-23  2:46 UTC (permalink / raw)
  To: NeilBrown, Al Viro
  Cc: Colin Walters, Ondrej Holy, autofs mailing list,
	Kernel Mailing List, David Howells, linux-fsdevel

On 23/11/17 10:21, NeilBrown wrote:
> On Thu, Nov 23 2017, Ian Kent wrote:
> 
>>
>> Hey Neil, I'm looking at this again because RH QE have complained about
>> a regression test failing with a kernel that has this change.
>>
>> Maybe I'm just dumb but I though a "find <base directory> <options>"
>> would, well, just look at the contents below <base directory> but an
>> strace shows that it reads and calls fstatat() on "every entry in the
>> mount table" regardless of the path.
> 
> weird ... I can only get find to look at the mount table if given the
> -fstyp option, and even then it doesn't fstatat anything that isn't in
> the tree it is searching.

It's probably the -xautofs (exclude autofs fs'es) that was used in
the test that requires reading the mount table to get info about
excluding autofs mounts but the fstatat() on all the entries,
regardless of path, that was a surprise to me.

find did use AT_SYMLINK_NOFOLLOW which historically behaved like
AT_NO_AUTOMOUNT.

> 
> 
>>
>> And with the move of userspace to use /proc based mount tables (one
>> example being the symlink of /etc/mtab into /proc) even modest sized
>> direct mount maps will be a problem with every entry getting mounted.
> 
> But the patch in question is only about indirect mount maps, isn't it?
> How is it relevant to direct mount maps?

The change here will cause fstatat() to trigger direct mounts on access
if it doesn't use AT_NO_AUTOMOUNT.

It's not a problem for browse indirect mounts because they are plain
directories within the autofs mount point not individual autofs mount
triggers.

> 
>>
>> Systems will cope with this fine but larger systems not so much.
>>
>> If find does this then the user space changes needed to accommodate
>> this sort of change are almost certainly far more than I expected.
>>
>> I think this is an example of the larger problem I'm faced with and
>> this change was was meant to be a starting point for resolution.
>>
>> The most obvious symptom of the problem is auto-mounts no longer able
>> to be expired due to being re-mounted immediately after expire. Another
>> symptom is unwanted (by the user) accesses causing unexpected auto-mount
>> attempts.
>>
>> I believe this monitoring of the mount table is what leads to excessive
>> CPU consumption I've seen, usually around six processes, under heavy
>> mount activity. And following this, when the mount table is large and
>> there is "no mount activity" two of the six processes continue to consume
>> excessive CPU, until the mount table shrinks.
>>
>> So now I'm coming around to the idea of reverting this change ..... and
>> going back to the drawing board.
> 
> I can well imaging that a large mount table could cause problems for
> applications that are written to expect one, and I can imagine that
> autofs could cause extra issues for such a program as it might change
> the mount table more often.  But I haven't yet worked out how this is
> related to the patch in question....
> 
> Thanks,
> NeilBrown
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored
  2017-11-23  2:46           ` Ian Kent
@ 2017-11-23  3:04               ` Ian Kent
  2017-11-23  4:49             ` NeilBrown
  1 sibling, 0 replies; 44+ messages in thread
From: Ian Kent @ 2017-11-23  3:04 UTC (permalink / raw)
  To: NeilBrown, Al Viro
  Cc: Colin Walters, Ondrej Holy, autofs mailing list,
	Kernel Mailing List, David Howells, linux-fsdevel

On 23/11/17 10:46, Ian Kent wrote:
> On 23/11/17 10:21, NeilBrown wrote:
>> On Thu, Nov 23 2017, Ian Kent wrote:
>>
>>>
>>> Hey Neil, I'm looking at this again because RH QE have complained about
>>> a regression test failing with a kernel that has this change.
>>>
>>> Maybe I'm just dumb but I though a "find <base directory> <options>"
>>> would, well, just look at the contents below <base directory> but an
>>> strace shows that it reads and calls fstatat() on "every entry in the
>>> mount table" regardless of the path.
>>
>> weird ... I can only get find to look at the mount table if given the
>> -fstyp option, and even then it doesn't fstatat anything that isn't in
>> the tree it is searching.
> 
> It's probably the -xautofs (exclude autofs fs'es) that was used in
> the test that requires reading the mount table to get info about
> excluding autofs mounts but the fstatat() on all the entries,
> regardless of path, that was a surprise to me.
> 
> find did use AT_SYMLINK_NOFOLLOW which historically behaved like
> AT_NO_AUTOMOUNT.

Check this simple reproducer out:

[root@intel-mahobay-02 ~]# uname -r
4.14.0-0.rc8.2.el7a.x86_64
[root@intel-mahobay-02 ~]# cat /etc/exports
/exportdir-bz1482065 *(rw,no_root_squash)
[root@intel-mahobay-02 ~]# cat /etc/auto.master
/- /etc/auto.direct
[root@intel-mahobay-02 ~]# cat /etc/auto.direct 
/mnt/autofs/test localhost:/exportdir-bz1482065
[root@intel-mahobay-02 ~]# ls /mnt/autofs/
bz1482065  test
[root@intel-mahobay-02 ~]# ls /exportdir-bz1482065
bz1482065
[root@intel-mahobay-02 ~]# find /mnt/autofs -xautofs  -name '*bz1482065*'
[root@intel-mahobay-02 ~]# find /mnt/autofs -xautofs  -name '*bz1482065*'
/mnt/autofs/bz1482065
/mnt/autofs/test/bz1482065

The first find mounts the direct mount /mnt/autofs/test/bz1482065 which causes the
second to list /mnt/autofs/bz1482065 (not sure yet where this comes from, it must
be a plain directory within /mnt/autofs) and the now mounted bind mount (ie. the
localhost: host name in the map entry) /mnt/autofs/test/bz1482065.

The strace is:
execve("/usr/bin/find", ["find", "/mnt/autofs", "-xautofs", "-name", "*bz1482065*"], [/* 45 vars */]) = 0

snip ...

newfstatat(AT_FDCWD, "/mnt/autofs", {st_mode=S_IFDIR|0755, st_size=35, ...}, AT_SYMLINK_NOFOLLOW) = 0
open("/etc/mtab", O_RDONLY|O_CLOEXEC)   = 4
futex(0x7efec9ba55b0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
fstat(4, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7efeca2ea000
read(4, "sysfs /sys sysfs rw,seclabel,nos"..., 1024) = 1024
read(4, "oup/net_cls,net_prio cgroup rw,n"..., 1024) = 1024
read(4, "ges hugetlbfs rw,seclabel,relati"..., 1024) = 594
read(4, "", 1024)                       = 0
close(4)                                = 0
munmap(0x7efeca2ea000, 4096)            = 0
newfstatat(AT_FDCWD, "/sys", {st_mode=S_IFDIR|0555, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/proc", {st_mode=S_IFDIR|0555, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/dev", {st_mode=S_IFDIR|0755, st_size=3280, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/sys/kernel/security", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/dev/shm", {st_mode=S_IFDIR|S_ISVTX|0777, st_size=40, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/dev/pts", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/run", {st_mode=S_IFDIR|0755, st_size=1020, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/sys/fs/cgroup", {st_mode=S_IFDIR|0755, st_size=340, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/sys/fs/cgroup/systemd", {st_mode=S_IFDIR|0555, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/sys/fs/pstore", {st_mode=S_IFDIR|0750, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/sys/fs/cgroup/blkio", {st_mode=S_IFDIR|0555, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/sys/fs/cgroup/cpu,cpuacct", {st_mode=S_IFDIR|0555, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/sys/fs/cgroup/pids", {st_mode=S_IFDIR|0555, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/sys/fs/cgroup/net_cls,net_prio", {st_mode=S_IFDIR|0555, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/sys/fs/cgroup/devices", {st_mode=S_IFDIR|0555, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/sys/fs/cgroup/freezer", {st_mode=S_IFDIR|0555, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/sys/fs/cgroup/perf_event", {st_mode=S_IFDIR|0555, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/sys/fs/cgroup/cpuset", {st_mode=S_IFDIR|0555, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/sys/fs/cgroup/memory", {st_mode=S_IFDIR|0555, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/sys/fs/cgroup/hugetlb", {st_mode=S_IFDIR|0555, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/sys/kernel/config", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/", {st_mode=S_IFDIR|0555, st_size=4096, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/sys/fs/selinux", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/proc/sys/fs/binfmt_misc", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/sys/kernel/debug", {st_mode=S_IFDIR|0700, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/dev/mqueue", {st_mode=S_IFDIR|S_ISVTX|0777, st_size=40, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/dev/hugepages", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/proc/fs/nfsd", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/boot", {st_mode=S_IFDIR|0555, st_size=4096, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/home", {st_mode=S_IFDIR|0755, st_size=18, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/var/lib/nfs/rpc_pipefs", {st_mode=S_IFDIR|0555, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/run/user/0", {st_mode=S_IFDIR|0700, st_size=40, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/proc/sys/fs/binfmt_misc", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/mnt/autofs/test", {st_mode=S_IFDIR|0755, st_size=23, ...}, AT_SYMLINK_NOFOLLOW) = 0
fchdir(3)                               = 0
close(3)                                = 0
close(1)                                = 0
close(2)                                = 0
exit_group(0)                           = ?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored
@ 2017-11-23  3:04               ` Ian Kent
  0 siblings, 0 replies; 44+ messages in thread
From: Ian Kent @ 2017-11-23  3:04 UTC (permalink / raw)
  To: NeilBrown, Al Viro
  Cc: Colin Walters, Ondrej Holy, autofs mailing list,
	Kernel Mailing List, David Howells, linux-fsdevel

On 23/11/17 10:46, Ian Kent wrote:
> On 23/11/17 10:21, NeilBrown wrote:
>> On Thu, Nov 23 2017, Ian Kent wrote:
>>
>>>
>>> Hey Neil, I'm looking at this again because RH QE have complained about
>>> a regression test failing with a kernel that has this change.
>>>
>>> Maybe I'm just dumb but I though a "find <base directory> <options>"
>>> would, well, just look at the contents below <base directory> but an
>>> strace shows that it reads and calls fstatat() on "every entry in the
>>> mount table" regardless of the path.
>>
>> weird ... I can only get find to look at the mount table if given the
>> -fstyp option, and even then it doesn't fstatat anything that isn't in
>> the tree it is searching.
> 
> It's probably the -xautofs (exclude autofs fs'es) that was used in
> the test that requires reading the mount table to get info about
> excluding autofs mounts but the fstatat() on all the entries,
> regardless of path, that was a surprise to me.
> 
> find did use AT_SYMLINK_NOFOLLOW which historically behaved like
> AT_NO_AUTOMOUNT.

Check this simple reproducer out:

[root@intel-mahobay-02 ~]# uname -r
4.14.0-0.rc8.2.el7a.x86_64
[root@intel-mahobay-02 ~]# cat /etc/exports
/exportdir-bz1482065 *(rw,no_root_squash)
[root@intel-mahobay-02 ~]# cat /etc/auto.master
/- /etc/auto.direct
[root@intel-mahobay-02 ~]# cat /etc/auto.direct 
/mnt/autofs/test localhost:/exportdir-bz1482065
[root@intel-mahobay-02 ~]# ls /mnt/autofs/
bz1482065  test
[root@intel-mahobay-02 ~]# ls /exportdir-bz1482065
bz1482065
[root@intel-mahobay-02 ~]# find /mnt/autofs -xautofs  -name '*bz1482065*'
[root@intel-mahobay-02 ~]# find /mnt/autofs -xautofs  -name '*bz1482065*'
/mnt/autofs/bz1482065
/mnt/autofs/test/bz1482065

The first find mounts the direct mount /mnt/autofs/test/bz1482065 which causes the
second to list /mnt/autofs/bz1482065 (not sure yet where this comes from, it must
be a plain directory within /mnt/autofs) and the now mounted bind mount (ie. the
localhost: host name in the map entry) /mnt/autofs/test/bz1482065.

The strace is:
execve("/usr/bin/find", ["find", "/mnt/autofs", "-xautofs", "-name", "*bz1482065*"], [/* 45 vars */]) = 0

snip ...

newfstatat(AT_FDCWD, "/mnt/autofs", {st_mode=S_IFDIR|0755, st_size=35, ...}, AT_SYMLINK_NOFOLLOW) = 0
open("/etc/mtab", O_RDONLY|O_CLOEXEC)   = 4
futex(0x7efec9ba55b0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
fstat(4, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7efeca2ea000
read(4, "sysfs /sys sysfs rw,seclabel,nos"..., 1024) = 1024
read(4, "oup/net_cls,net_prio cgroup rw,n"..., 1024) = 1024
read(4, "ges hugetlbfs rw,seclabel,relati"..., 1024) = 594
read(4, "", 1024)                       = 0
close(4)                                = 0
munmap(0x7efeca2ea000, 4096)            = 0
newfstatat(AT_FDCWD, "/sys", {st_mode=S_IFDIR|0555, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/proc", {st_mode=S_IFDIR|0555, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/dev", {st_mode=S_IFDIR|0755, st_size=3280, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/sys/kernel/security", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/dev/shm", {st_mode=S_IFDIR|S_ISVTX|0777, st_size=40, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/dev/pts", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/run", {st_mode=S_IFDIR|0755, st_size=1020, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/sys/fs/cgroup", {st_mode=S_IFDIR|0755, st_size=340, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/sys/fs/cgroup/systemd", {st_mode=S_IFDIR|0555, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/sys/fs/pstore", {st_mode=S_IFDIR|0750, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/sys/fs/cgroup/blkio", {st_mode=S_IFDIR|0555, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/sys/fs/cgroup/cpu,cpuacct", {st_mode=S_IFDIR|0555, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/sys/fs/cgroup/pids", {st_mode=S_IFDIR|0555, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/sys/fs/cgroup/net_cls,net_prio", {st_mode=S_IFDIR|0555, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/sys/fs/cgroup/devices", {st_mode=S_IFDIR|0555, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/sys/fs/cgroup/freezer", {st_mode=S_IFDIR|0555, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/sys/fs/cgroup/perf_event", {st_mode=S_IFDIR|0555, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/sys/fs/cgroup/cpuset", {st_mode=S_IFDIR|0555, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/sys/fs/cgroup/memory", {st_mode=S_IFDIR|0555, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/sys/fs/cgroup/hugetlb", {st_mode=S_IFDIR|0555, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/sys/kernel/config", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/", {st_mode=S_IFDIR|0555, st_size=4096, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/sys/fs/selinux", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/proc/sys/fs/binfmt_misc", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/sys/kernel/debug", {st_mode=S_IFDIR|0700, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/dev/mqueue", {st_mode=S_IFDIR|S_ISVTX|0777, st_size=40, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/dev/hugepages", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/proc/fs/nfsd", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/boot", {st_mode=S_IFDIR|0555, st_size=4096, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/home", {st_mode=S_IFDIR|0755, st_size=18, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/var/lib/nfs/rpc_pipefs", {st_mode=S_IFDIR|0555, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/run/user/0", {st_mode=S_IFDIR|0700, st_size=40, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/proc/sys/fs/binfmt_misc", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/mnt/autofs/test", {st_mode=S_IFDIR|0755, st_size=23, ...}, AT_SYMLINK_NOFOLLOW) = 0
fchdir(3)                               = 0
close(3)                                = 0
close(1)                                = 0
close(2)                                = 0
exit_group(0)                           = ?

--
To unsubscribe from this list: send the line "unsubscribe autofs" in

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored
  2017-11-23  1:43           ` Ian Kent
  (?)
  (?)
@ 2017-11-23  3:04           ` NeilBrown
  2017-11-23  3:41             ` Ian Kent
  -1 siblings, 1 reply; 44+ messages in thread
From: NeilBrown @ 2017-11-23  3:04 UTC (permalink / raw)
  To: Ian Kent, Al Viro
  Cc: Colin Walters, Ondrej Holy, autofs mailing list,
	Kernel Mailing List, David Howells, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 9732 bytes --]

On Thu, Nov 23 2017, Ian Kent wrote:

> On 23/11/17 08:47, NeilBrown wrote:
>> On Wed, Nov 22 2017, Ian Kent wrote:
>> 
>>> On 21/11/17 09:53, NeilBrown wrote:
>>>> On Wed, May 10 2017, Ian Kent wrote:
>>>>
>>>>> The fstatat(2) and statx() calls can pass the flag AT_NO_AUTOMOUNT
>>>>> which is meant to clear the LOOKUP_AUTOMOUNT flag and prevent triggering
>>>>> of an automount by the call. But this flag is unconditionally cleared
>>>>> for all stat family system calls except statx().
>>>>>
>>>>> stat family system calls have always triggered mount requests for the
>>>>> negative dentry case in follow_automount() which is intended but prevents
>>>>> the fstatat(2) and statx() AT_NO_AUTOMOUNT case from being handled.
>>>>>
>>>>> In order to handle the AT_NO_AUTOMOUNT for both system calls the
>>>>> negative dentry case in follow_automount() needs to be changed to
>>>>> return ENOENT when the LOOKUP_AUTOMOUNT flag is clear (and the other
>>>>> required flags are clear).
>>>>>
>>>>> AFAICT this change doesn't have any noticable side effects and may,
>>>>> in some use cases (although I didn't see it in testing) prevent
>>>>> unnecessary callbacks to the automount daemon.
>>>>
>>>> Actually, this patch does have a noticeable side effect.
>>>
>>> That's right.
>>>
>>>>
>>>> Previously, if /home were an indirect mount point so that /home/neilb
>>>> would mount my home directory, "ls -l /home/neilb" would always work.
>>>> Now it doesn't.
>>>
>>> An this is correct too.
>>>
>>> The previous policy was that a ->lookup() would always cause a mount to
>>> occur. It's that policy that prevents the AT_NO_AUTOMOUNT flag from being
>>> able to work consistently.
>> 
>> Just to be clear, the previous policy was:
>>  kernel - a lookup would cause a message to be sent to the automount daemon
>>  daemon - the receipt of a message would cause a mount to occur.
>> 
>> Either part of this policy could be changed to allow AT_NO_AUTOMOUNT to
>> work reliably.  You chose to change the first.  At the time I thought
>> this was a good idea.  I no longer think so.
>> 
>> Specifically, I think the second part of the policy should be revised a little.
>> 
>>>
>>> If you set the indirect mount "browse" option that will cause the mount
>>> point directories for each of the map keys to be pre-created. So a
>>> directory listing will show the mount points and an attempt to access
>>> anything within the mount point directory will cause it to be mounted.
>>>
>>> There is still the problem of not knowing map keys when the wildcard
>>> map entry is used.
>>>
>>> Still, stat family systems calls have always had similar problems that
>>> prevent consistent behavior.
>> 
>> Yes, I understand all this.  stat family sys-call have some odd
>> behaviours at times like "stat(); open(); fstat()" will result in
>> different sets of status info.  This is known.
>> The point is that these odd behaviours have changed in a noticeably way
>> (contrary to the change log comment) and it isn't clear these changes
>> are good.
>> 
>>>
>>>>
>>>> This happens because "ls" calls 'lstat' on the path and when that fails
>>>> with "ENOENT", it doesn't bother trying to open.
>>>
>>> I know, I believe the ENOENT is appropriate because there is no mount
>>> and no directory at the time this happens.
>> 
>> Two distinct statements here.  "no mount" and "no directory".
>> If AT_NO_AUTOMOUNT is given (or implicit) the "no mount" is significant
>> and shouldn't be changed.   It isn't clear that "no directory" is
>> significant.
>> If you think of the list of names stored in the autofs filesystem as a
>> cache of recently used names from the map, then the directory *does*
>> exist (in the map), it is just that the (in-kernel) cache hasn't been
>> populated yet.
>> Should "AT_NO_AUTOMOUNT" prevent the cache from being populated?
>> 
>>>
>>>>
>>>> An alternate approach to the problem that this patch addresses is to
>>>> *not* change follow_automount() but instead change the automount daemon
>>>> and possibly autofs.
>>>
>>> Perhaps, but the daemon probably doesn't have enough information to
>>> make decisions about this so there would need to be something coming
>>> from the kernel too.
>> 
>> I don't think so.
>> The daemon already has a promise that upcalls for a given name are
>> serialized, and it has the ability to test if a name is already in the
>> cache.  This is enough.
>> I applied the following patch:
>> 
>> diff --git a/modules/mount_nfs.c b/modules/mount_nfs.c
>> index c558a7381054..7ddfe9c61038 100644
>> --- a/modules/mount_nfs.c
>> +++ b/modules/mount_nfs.c
>> @@ -269,6 +269,11 @@ dont_probe:
>>  		free_host_list(&hosts);
>>  		return 1;
>>  	}
>> +	if (!status) {
>> +		debug(ap->logopt, MODPREFIX "created dir, that'll do for now");
>> +		free_host_list(&hosts);
>> +		return 0;
>> +	}
>>  
>>  	if (!status)
>>  		existed = 0;
>> 
>> to automount and now it behaves (for NFS mounts) how I would like (though
>> there is room for improvement).
>> 
>>>
>>>>
>>>> When a notification of access for an indirect mount point is received,
>>>> it would firstly just create the directory - not triggering a mount.
>>>> If another notification is then received (after the directory has been
>>>> created), it then triggers the mount.
>>>
>>> Not sure, perhaps.
>>>
>>> But I don't think that will work, I've had many problems with this type
>>> behavior due to bugs and I think the the same sort of problems would be
>>> encountered.
>>>
>>> The indirect mount "browse" option which behaves very much like what your
>>> suggesting is the internal program default, and has been since the initial
>>> v5 release. Historically it is disabled on install to maintain backward
>>> compatibility with the original Linux autofs (which is also the reason for
>>> always triggering an automount on ->lookup()).
>>>
>>> Perhaps now is the time for that to change.
>> 
>> Enabling browse mode does resolve this problem when the map is
>> enumerable (as you say, wildcards can't be supported).
>> But browse mode isn't always wanted.  If you have a very large map, then
>> caching all 10,000 entries in the kernel may be a pointless waste of
>> time and space.
>
> Indeed, that's the main use of nobrowse indirect maps.
>
> In fact another recent change where I moved the last_used field so it
> wouldn't be updated on every walk, to help with mounts never expiring,
> also needs at different approach.
>
> Doing that causes more aggressive expiring of automounts which increases
> umount to mount churn and leads to increased server load at sites with a
> very large number of clients.
>  
>> 
>>>
>>>>
>>>> I suspect this might need changes to autofs to avoid races when two
>>>> threads call lstat() on the same path at the same time.  We would need
>>>> to ensure that automount didn't see this as two requests.... though
>>>> maybe it already has enough locking.
>>>>
>>>> Does that seem reasonable?  Should we revert this patch (as a
>>>> regression) and do something in automount instead?
>>>
>>> Can you check out the "browse" option behavior before we talk further
>>> about this please.
>> 
>> Done that.
>> 
>>>
>>> The problem in user space now is that there is no way to control
>>> whether an indirect mount that uses the "nobrowse" option is triggered
>>> or not (using fstatat() or statx()) regardless of the flags used and it's
>>> essential the AT_NO_AUTOMOUNT flag work as expected to have user space
>>> take more care in not triggering automounts.
>> 
>> I think that user-space has all the control that it needs.
>> There are two cases:
>> 1/ daemon receives "missing indirect" message and the directory
>>    doesn't currently exist (mkdir() by the daemon succeeds).
>>    This could be an AT_NO_AUTOMOUNT lookup, or it could be a full
>>    open() or similar.  In either case the directory gets created if
>>    the map suggests that the name should exist.  The daemon needs to
>>    be careful not to block for long if it goes off-host to check for
>>    validity of the name.
>
> Aren't you assuming the the daemon only receives these lookups
> on the last path component?

No.
follow_managed() calls follow_automount() in a loop until it fails
(including -EISDIR which isn't exactly failure), or until no automount
is needed.
So where-ever the DCACHE_NEED_AUTOMOUNT is in the path, d_automount()
will be called if the dentry is negative, and unless it is the last
component of a NO_AUTOMOUNT lookup, it will be called if the dentry
isn't mounted-on.  So it would now be called twice for indirect
mounts that aren't browseable - once to mkdir and once to mount.  Might
there be a problem there?

>
> Intermediate path components that can trigger an automount must
> be treated as not having AT_NO_AUTOMOUNT in order for the walk to
> continue or fail at that point.
>
>> 
>> 2/ daemon received "missing indirect" message and the directory
>>    *does* exist (mkdir() by daemon fails with -EEXIST).
>>    This cannot be an AT_NO_AUTOMOUNT lookup, it must be a lookup
>>    intended to trigger automounts.  In this case, we trigger the
>>    mount.
>
> An fstatat(<path>, AT_NO_AUTOMOUNT) on a browse indirect map entry
> means the directory will exist but user space has asked not to trigger
> the mount and return the stat info of the autofs dentry.
>
> Please don't get me wrong, I think the suggestion is good, I just
> don't think it's as simple to do as it appears.

Maybe I should write a more complete patch for you to experiment with.

Thanks,
NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored
  2017-11-23  3:04           ` NeilBrown
@ 2017-11-23  3:41             ` Ian Kent
  0 siblings, 0 replies; 44+ messages in thread
From: Ian Kent @ 2017-11-23  3:41 UTC (permalink / raw)
  To: NeilBrown, Al Viro
  Cc: Colin Walters, Ondrej Holy, autofs mailing list,
	Kernel Mailing List, David Howells, linux-fsdevel

On 23/11/17 11:04, NeilBrown wrote:
> On Thu, Nov 23 2017, Ian Kent wrote:
> 
>> On 23/11/17 08:47, NeilBrown wrote:
>>> On Wed, Nov 22 2017, Ian Kent wrote:
>>>
>>>> On 21/11/17 09:53, NeilBrown wrote:
>>>>> On Wed, May 10 2017, Ian Kent wrote:
>>>>>
>>>>>> The fstatat(2) and statx() calls can pass the flag AT_NO_AUTOMOUNT
>>>>>> which is meant to clear the LOOKUP_AUTOMOUNT flag and prevent triggering
>>>>>> of an automount by the call. But this flag is unconditionally cleared
>>>>>> for all stat family system calls except statx().
>>>>>>
>>>>>> stat family system calls have always triggered mount requests for the
>>>>>> negative dentry case in follow_automount() which is intended but prevents
>>>>>> the fstatat(2) and statx() AT_NO_AUTOMOUNT case from being handled.
>>>>>>
>>>>>> In order to handle the AT_NO_AUTOMOUNT for both system calls the
>>>>>> negative dentry case in follow_automount() needs to be changed to
>>>>>> return ENOENT when the LOOKUP_AUTOMOUNT flag is clear (and the other
>>>>>> required flags are clear).
>>>>>>
>>>>>> AFAICT this change doesn't have any noticable side effects and may,
>>>>>> in some use cases (although I didn't see it in testing) prevent
>>>>>> unnecessary callbacks to the automount daemon.
>>>>>
>>>>> Actually, this patch does have a noticeable side effect.
>>>>
>>>> That's right.
>>>>
>>>>>
>>>>> Previously, if /home were an indirect mount point so that /home/neilb
>>>>> would mount my home directory, "ls -l /home/neilb" would always work.
>>>>> Now it doesn't.
>>>>
>>>> An this is correct too.
>>>>
>>>> The previous policy was that a ->lookup() would always cause a mount to
>>>> occur. It's that policy that prevents the AT_NO_AUTOMOUNT flag from being
>>>> able to work consistently.
>>>
>>> Just to be clear, the previous policy was:
>>>  kernel - a lookup would cause a message to be sent to the automount daemon
>>>  daemon - the receipt of a message would cause a mount to occur.
>>>
>>> Either part of this policy could be changed to allow AT_NO_AUTOMOUNT to
>>> work reliably.  You chose to change the first.  At the time I thought
>>> this was a good idea.  I no longer think so.
>>>
>>> Specifically, I think the second part of the policy should be revised a little.
>>>
>>>>
>>>> If you set the indirect mount "browse" option that will cause the mount
>>>> point directories for each of the map keys to be pre-created. So a
>>>> directory listing will show the mount points and an attempt to access
>>>> anything within the mount point directory will cause it to be mounted.
>>>>
>>>> There is still the problem of not knowing map keys when the wildcard
>>>> map entry is used.
>>>>
>>>> Still, stat family systems calls have always had similar problems that
>>>> prevent consistent behavior.
>>>
>>> Yes, I understand all this.  stat family sys-call have some odd
>>> behaviours at times like "stat(); open(); fstat()" will result in
>>> different sets of status info.  This is known.
>>> The point is that these odd behaviours have changed in a noticeably way
>>> (contrary to the change log comment) and it isn't clear these changes
>>> are good.
>>>
>>>>
>>>>>
>>>>> This happens because "ls" calls 'lstat' on the path and when that fails
>>>>> with "ENOENT", it doesn't bother trying to open.
>>>>
>>>> I know, I believe the ENOENT is appropriate because there is no mount
>>>> and no directory at the time this happens.
>>>
>>> Two distinct statements here.  "no mount" and "no directory".
>>> If AT_NO_AUTOMOUNT is given (or implicit) the "no mount" is significant
>>> and shouldn't be changed.   It isn't clear that "no directory" is
>>> significant.
>>> If you think of the list of names stored in the autofs filesystem as a
>>> cache of recently used names from the map, then the directory *does*
>>> exist (in the map), it is just that the (in-kernel) cache hasn't been
>>> populated yet.
>>> Should "AT_NO_AUTOMOUNT" prevent the cache from being populated?
>>>
>>>>
>>>>>
>>>>> An alternate approach to the problem that this patch addresses is to
>>>>> *not* change follow_automount() but instead change the automount daemon
>>>>> and possibly autofs.
>>>>
>>>> Perhaps, but the daemon probably doesn't have enough information to
>>>> make decisions about this so there would need to be something coming
>>>> from the kernel too.
>>>
>>> I don't think so.
>>> The daemon already has a promise that upcalls for a given name are
>>> serialized, and it has the ability to test if a name is already in the
>>> cache.  This is enough.
>>> I applied the following patch:
>>>
>>> diff --git a/modules/mount_nfs.c b/modules/mount_nfs.c
>>> index c558a7381054..7ddfe9c61038 100644
>>> --- a/modules/mount_nfs.c
>>> +++ b/modules/mount_nfs.c
>>> @@ -269,6 +269,11 @@ dont_probe:
>>>  		free_host_list(&hosts);
>>>  		return 1;
>>>  	}
>>> +	if (!status) {
>>> +		debug(ap->logopt, MODPREFIX "created dir, that'll do for now");
>>> +		free_host_list(&hosts);
>>> +		return 0;
>>> +	}
>>>  
>>>  	if (!status)
>>>  		existed = 0;
>>>
>>> to automount and now it behaves (for NFS mounts) how I would like (though
>>> there is room for improvement).
>>>
>>>>
>>>>>
>>>>> When a notification of access for an indirect mount point is received,
>>>>> it would firstly just create the directory - not triggering a mount.
>>>>> If another notification is then received (after the directory has been
>>>>> created), it then triggers the mount.
>>>>
>>>> Not sure, perhaps.
>>>>
>>>> But I don't think that will work, I've had many problems with this type
>>>> behavior due to bugs and I think the the same sort of problems would be
>>>> encountered.
>>>>
>>>> The indirect mount "browse" option which behaves very much like what your
>>>> suggesting is the internal program default, and has been since the initial
>>>> v5 release. Historically it is disabled on install to maintain backward
>>>> compatibility with the original Linux autofs (which is also the reason for
>>>> always triggering an automount on ->lookup()).
>>>>
>>>> Perhaps now is the time for that to change.
>>>
>>> Enabling browse mode does resolve this problem when the map is
>>> enumerable (as you say, wildcards can't be supported).
>>> But browse mode isn't always wanted.  If you have a very large map, then
>>> caching all 10,000 entries in the kernel may be a pointless waste of
>>> time and space.
>>
>> Indeed, that's the main use of nobrowse indirect maps.
>>
>> In fact another recent change where I moved the last_used field so it
>> wouldn't be updated on every walk, to help with mounts never expiring,
>> also needs at different approach.
>>
>> Doing that causes more aggressive expiring of automounts which increases
>> umount to mount churn and leads to increased server load at sites with a
>> very large number of clients.
>>  
>>>
>>>>
>>>>>
>>>>> I suspect this might need changes to autofs to avoid races when two
>>>>> threads call lstat() on the same path at the same time.  We would need
>>>>> to ensure that automount didn't see this as two requests.... though
>>>>> maybe it already has enough locking.
>>>>>
>>>>> Does that seem reasonable?  Should we revert this patch (as a
>>>>> regression) and do something in automount instead?
>>>>
>>>> Can you check out the "browse" option behavior before we talk further
>>>> about this please.
>>>
>>> Done that.
>>>
>>>>
>>>> The problem in user space now is that there is no way to control
>>>> whether an indirect mount that uses the "nobrowse" option is triggered
>>>> or not (using fstatat() or statx()) regardless of the flags used and it's
>>>> essential the AT_NO_AUTOMOUNT flag work as expected to have user space
>>>> take more care in not triggering automounts.
>>>
>>> I think that user-space has all the control that it needs.
>>> There are two cases:
>>> 1/ daemon receives "missing indirect" message and the directory
>>>    doesn't currently exist (mkdir() by the daemon succeeds).
>>>    This could be an AT_NO_AUTOMOUNT lookup, or it could be a full
>>>    open() or similar.  In either case the directory gets created if
>>>    the map suggests that the name should exist.  The daemon needs to
>>>    be careful not to block for long if it goes off-host to check for
>>>    validity of the name.
>>
>> Aren't you assuming the the daemon only receives these lookups
>> on the last path component?
> 
> No.
> follow_managed() calls follow_automount() in a loop until it fails
> (including -EISDIR which isn't exactly failure), or until no automount
> is needed.
> So where-ever the DCACHE_NEED_AUTOMOUNT is in the path, d_automount()
> will be called if the dentry is negative, and unless it is the last
> component of a NO_AUTOMOUNT lookup, it will be called if the dentry
> isn't mounted-on.  So it would now be called twice for indirect
> mounts that aren't browseable - once to mkdir and once to mount.  Might
> there be a problem there?

Yes, your right, and this essentially implements the needed "does this
key exist" query to the daemon .... mmmm.

Although the directory create in the daemon could be avoided with a
callback that specifically asks "does this key exist" and a module
minor version update could be used to decide whether the new functionality
can be used or not, just a thought.

The network source key lookup might need attention, as you pointed out,
but the usual ->d_automount() suffers this potential problem already so
that might be a bonus.

> 
>>
>> Intermediate path components that can trigger an automount must
>> be treated as not having AT_NO_AUTOMOUNT in order for the walk to
>> continue or fail at that point.
>>
>>>
>>> 2/ daemon received "missing indirect" message and the directory
>>>    *does* exist (mkdir() by daemon fails with -EEXIST).
>>>    This cannot be an AT_NO_AUTOMOUNT lookup, it must be a lookup
>>>    intended to trigger automounts.  In this case, we trigger the
>>>    mount.
>>
>> An fstatat(<path>, AT_NO_AUTOMOUNT) on a browse indirect map entry
>> means the directory will exist but user space has asked not to trigger
>> the mount and return the stat info of the autofs dentry.
>>
>> Please don't get me wrong, I think the suggestion is good, I just
>> don't think it's as simple to do as it appears.
> 
> Maybe I should write a more complete patch for you to experiment with.
> 
> Thanks,
> NeilBrown
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored
  2017-11-23  2:46           ` Ian Kent
  2017-11-23  3:04               ` Ian Kent
@ 2017-11-23  4:49             ` NeilBrown
  2017-11-23  6:34               ` Ian Kent
  1 sibling, 1 reply; 44+ messages in thread
From: NeilBrown @ 2017-11-23  4:49 UTC (permalink / raw)
  To: Ian Kent, Al Viro
  Cc: Colin Walters, Ondrej Holy, autofs mailing list,
	Kernel Mailing List, David Howells, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 5341 bytes --]

On Thu, Nov 23 2017, Ian Kent wrote:

> On 23/11/17 10:21, NeilBrown wrote:
>> On Thu, Nov 23 2017, Ian Kent wrote:
>> 
>>>
>>> Hey Neil, I'm looking at this again because RH QE have complained about
>>> a regression test failing with a kernel that has this change.
>>>
>>> Maybe I'm just dumb but I though a "find <base directory> <options>"
>>> would, well, just look at the contents below <base directory> but an
>>> strace shows that it reads and calls fstatat() on "every entry in the
>>> mount table" regardless of the path.
>> 
>> weird ... I can only get find to look at the mount table if given the
>> -fstyp option, and even then it doesn't fstatat anything that isn't in
>> the tree it is searching.
>
> It's probably the -xautofs (exclude autofs fs'es) that was used in
> the test that requires reading the mount table to get info about
> excluding autofs mounts but the fstatat() on all the entries,
> regardless of path, that was a surprise to me.
>
> find did use AT_SYMLINK_NOFOLLOW which historically behaved like
> AT_NO_AUTOMOUNT.
>
>> 
>> 
>>>
>>> And with the move of userspace to use /proc based mount tables (one
>>> example being the symlink of /etc/mtab into /proc) even modest sized
>>> direct mount maps will be a problem with every entry getting mounted.
>> 
>> But the patch in question is only about indirect mount maps, isn't it?
>> How is it relevant to direct mount maps?
>
> The change here will cause fstatat() to trigger direct mounts on access
> if it doesn't use AT_NO_AUTOMOUNT.

Ahhh... light dawns.
This is about this bit of the patch:

 static inline int vfs_fstatat(int dfd, const char __user *filename,
                              struct kstat *stat, int flags)
 {
-       return vfs_statx(dfd, filename, flags | AT_NO_AUTOMOUNT,
-                        stat, STATX_BASIC_STATS);
+       return vfs_statx(dfd, filename, flags, stat, STATX_BASIC_STATS);
 }

I hadn't paid much attention to that.

I before this patch:
  stat and lstat act as you would expect AT_NO_AUTOMOUNT to act on
      direct mount and browseable indirect mount, but not on unbrowseable
      indirect mounts
  fstatat appeared to accept the AT_NO_AUTOMOUNT flag, but actually
      assumed it was always set, but acted like stat and lstat
  xstatat actually accepted the AT_NO_AUTOMOUNT flag, but it had no
      effect on unbrowseable indirect mounts.

after the patch, the distinction between direct and indirect was gone,
and fstatat now handles AT_NO_AUTOMOUNT the same as xstatat.
So:
  stat and lstat now don't trigger automounts even on indirect, but
    this is a mixed blessing as they don't even trigger the mkdir
  fstatat without AT_NO_AUTOMOUNT now always triggers an automount
    This is a problematic regression that you have noticed and
    likely needs to be reverted.  Maybe we can assume
    AT_NO_AUTOMOUNT when AT_SYMLINK_NOFOLLOW is set, and require
    people to use xstatat if they need to set the flags separately
  xstatat now correctly honours AT_NO_AUTOMOUNT for indirect mounts
    but is otherwise unchanged.

What would you think of changing the above to

 static inline int vfs_fstatat(int dfd, const char __user *filename,
                              struct kstat *stat, int flags)
 {
-       return vfs_statx(dfd, filename, flags | AT_NO_AUTOMOUNT,
-                        stat, STATX_BASIC_STATS);
+       return vfs_statx(dfd, filename,
+                        (flags & AT_SYMLINK_NOFOLLOW) ? (flags |
+                             AT_NO_AUTOMOUNT) : flags,
+                             stat, STATX_BASIC_STATS);
 }

??

Thanks,
NeilBrown

>
> It's not a problem for browse indirect mounts because they are plain
> directories within the autofs mount point not individual autofs mount
> triggers.
>
>> 
>>>
>>> Systems will cope with this fine but larger systems not so much.
>>>
>>> If find does this then the user space changes needed to accommodate
>>> this sort of change are almost certainly far more than I expected.
>>>
>>> I think this is an example of the larger problem I'm faced with and
>>> this change was was meant to be a starting point for resolution.
>>>
>>> The most obvious symptom of the problem is auto-mounts no longer able
>>> to be expired due to being re-mounted immediately after expire. Another
>>> symptom is unwanted (by the user) accesses causing unexpected auto-mount
>>> attempts.
>>>
>>> I believe this monitoring of the mount table is what leads to excessive
>>> CPU consumption I've seen, usually around six processes, under heavy
>>> mount activity. And following this, when the mount table is large and
>>> there is "no mount activity" two of the six processes continue to consume
>>> excessive CPU, until the mount table shrinks.
>>>
>>> So now I'm coming around to the idea of reverting this change ..... and
>>> going back to the drawing board.
>> 
>> I can well imaging that a large mount table could cause problems for
>> applications that are written to expect one, and I can imagine that
>> autofs could cause extra issues for such a program as it might change
>> the mount table more often.  But I haven't yet worked out how this is
>> related to the patch in question....
>> 
>> Thanks,
>> NeilBrown
>> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored
  2017-11-23  4:49             ` NeilBrown
@ 2017-11-23  6:34               ` Ian Kent
  0 siblings, 0 replies; 44+ messages in thread
From: Ian Kent @ 2017-11-23  6:34 UTC (permalink / raw)
  To: NeilBrown, Al Viro
  Cc: Colin Walters, Ondrej Holy, autofs mailing list,
	Kernel Mailing List, David Howells, linux-fsdevel

On 23/11/17 12:49, NeilBrown wrote:
> On Thu, Nov 23 2017, Ian Kent wrote:
> 
>> On 23/11/17 10:21, NeilBrown wrote:
>>> On Thu, Nov 23 2017, Ian Kent wrote:
>>>
>>>>
>>>> Hey Neil, I'm looking at this again because RH QE have complained about
>>>> a regression test failing with a kernel that has this change.
>>>>
>>>> Maybe I'm just dumb but I though a "find <base directory> <options>"
>>>> would, well, just look at the contents below <base directory> but an
>>>> strace shows that it reads and calls fstatat() on "every entry in the
>>>> mount table" regardless of the path.
>>>
>>> weird ... I can only get find to look at the mount table if given the
>>> -fstyp option, and even then it doesn't fstatat anything that isn't in
>>> the tree it is searching.
>>
>> It's probably the -xautofs (exclude autofs fs'es) that was used in
>> the test that requires reading the mount table to get info about
>> excluding autofs mounts but the fstatat() on all the entries,
>> regardless of path, that was a surprise to me.
>>
>> find did use AT_SYMLINK_NOFOLLOW which historically behaved like
>> AT_NO_AUTOMOUNT.
>>
>>>
>>>
>>>>
>>>> And with the move of userspace to use /proc based mount tables (one
>>>> example being the symlink of /etc/mtab into /proc) even modest sized
>>>> direct mount maps will be a problem with every entry getting mounted.
>>>
>>> But the patch in question is only about indirect mount maps, isn't it?
>>> How is it relevant to direct mount maps?
>>
>> The change here will cause fstatat() to trigger direct mounts on access
>> if it doesn't use AT_NO_AUTOMOUNT.
> 
> Ahhh... light dawns.
> This is about this bit of the patch:
> 
>  static inline int vfs_fstatat(int dfd, const char __user *filename,
>                               struct kstat *stat, int flags)
>  {
> -       return vfs_statx(dfd, filename, flags | AT_NO_AUTOMOUNT,
> -                        stat, STATX_BASIC_STATS);
> +       return vfs_statx(dfd, filename, flags, stat, STATX_BASIC_STATS);
>  }
> 
> I hadn't paid much attention to that.
> 
> I before this patch:
>   stat and lstat act as you would expect AT_NO_AUTOMOUNT to act on
>       direct mount and browseable indirect mount, but not on unbrowseable
>       indirect mounts

Yep, because of the fall thru for negative dentrys at:

 	if (!(nd->flags & (LOOKUP_PARENT | LOOKUP_DIRECTORY |
			   LOOKUP_OPEN | LOOKUP_CREATE | LOOKUP_AUTOMOUNT)) &&
	    path->dentry->d_inode)
		return -EISDIR;

which is missing a LOOKUP_FOLLOW check if we wanted to catch AT_SYMLINK_NOFOLLOW.

>   fstatat appeared to accept the AT_NO_AUTOMOUNT flag, but actually
>       assumed it was always set, but acted like stat and lstat

Yep, always set AT_NO_AUTOMOUNT so that it behaved the same as ....

>   xstatat actually accepted the AT_NO_AUTOMOUNT flag, but it had no
>       effect on unbrowseable indirect mounts.

Yep.

> 
> after the patch, the distinction between direct and indirect was gone,
> and fstatat now handles AT_NO_AUTOMOUNT the same as xstatat.
> So:
>   stat and lstat now don't trigger automounts even on indirect, but
>     this is a mixed blessing as they don't even trigger the mkdir

Yep.

>   fstatat without AT_NO_AUTOMOUNT now always triggers an automount
>     This is a problematic regression that you have noticed and
>     likely needs to be reverted.  Maybe we can assume
>     AT_NO_AUTOMOUNT when AT_SYMLINK_NOFOLLOW is set, and require
>     people to use xstatat if they need to set the flags separately

Yep.

The introduction of AT_NO_AUTOMOUNT (and the introduction of
follow_managed() and friends) was meant to do away with the
misleading use the AT_SYMLINK_NOFOLLOW (at the time the automount
mechanism abused the ->follow_link() method because it had similar
semantics to symlinks).

To catch the older usage pattern re-adding an AT_SYMLINK_NOFOLLOW
check would be helpful.

The reality is there are still the same old problems of unintended
mounting (mount storms) and AT_NO_AUTOMOUNT not being properly
handled.

Certainly the implementation we have now is much better but these
niggling problems remain and user space steps on them way too often,
and it feels like its much more so lately.  

>   xstatat now correctly honours AT_NO_AUTOMOUNT for indirect mounts
>     but is otherwise unchanged.

Yep, assuming we accept the ENOENT return as sensible for AT_NO_AUTOMOUNT
no browse indirect mount case. statx() being a new system call it would
be ideal to get the semantics of this call right now before it becomes well
used.

> 
> What would you think of changing the above to
> 
>  static inline int vfs_fstatat(int dfd, const char __user *filename,
>                               struct kstat *stat, int flags)
>  {
> -       return vfs_statx(dfd, filename, flags | AT_NO_AUTOMOUNT,
> -                        stat, STATX_BASIC_STATS);
> +       return vfs_statx(dfd, filename,
> +                        (flags & AT_SYMLINK_NOFOLLOW) ? (flags |
> +                             AT_NO_AUTOMOUNT) : flags,
> +                             stat, STATX_BASIC_STATS);
>  }

Yes, I think so, and we'd need to add LOOKUP_FOLLOW to the check in
follow_automount().

Not sure I like that nested ternary in this case though.

> 
> ??
> 
> Thanks,
> NeilBrown
> 
>>
>> It's not a problem for browse indirect mounts because they are plain
>> directories within the autofs mount point not individual autofs mount
>> triggers.
>>
>>>
>>>>
>>>> Systems will cope with this fine but larger systems not so much.
>>>>
>>>> If find does this then the user space changes needed to accommodate
>>>> this sort of change are almost certainly far more than I expected.
>>>>
>>>> I think this is an example of the larger problem I'm faced with and
>>>> this change was was meant to be a starting point for resolution.
>>>>
>>>> The most obvious symptom of the problem is auto-mounts no longer able
>>>> to be expired due to being re-mounted immediately after expire. Another
>>>> symptom is unwanted (by the user) accesses causing unexpected auto-mount
>>>> attempts.
>>>>
>>>> I believe this monitoring of the mount table is what leads to excessive
>>>> CPU consumption I've seen, usually around six processes, under heavy
>>>> mount activity. And following this, when the mount table is large and
>>>> there is "no mount activity" two of the six processes continue to consume
>>>> excessive CPU, until the mount table shrinks.
>>>>
>>>> So now I'm coming around to the idea of reverting this change ..... and
>>>> going back to the drawing board.
>>>
>>> I can well imaging that a large mount table could cause problems for
>>> applications that are written to expect one, and I can imagine that
>>> autofs could cause extra issues for such a program as it might change
>>> the mount table more often.  But I haven't yet worked out how this is
>>> related to the patch in question....
>>>
>>> Thanks,
>>> NeilBrown
>>>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored
  2017-11-23  0:36       ` Ian Kent
@ 2017-11-27 16:01           ` Mike Marion
  2017-11-27 16:01           ` Mike Marion
  1 sibling, 0 replies; 44+ messages in thread
From: Mike Marion @ 2017-11-27 16:01 UTC (permalink / raw)
  To: Ian Kent
  Cc: NeilBrown, Al Viro, Colin Walters, Ondrej Holy,
	autofs mailing list, Kernel Mailing List, David Howells,
	linux-fsdevel

On Thu, Nov 23, 2017 at 08:36:49AM +0800, Ian Kent wrote:

> And with the move of userspace to use /proc based mount tables (one
> example being the symlink of /etc/mtab into /proc) even modest sized
> direct mount maps will be a problem with every entry getting mounted.
> 
> Systems will cope with this fine but larger systems not so much.

Yes.. we've run into some big issues due to the change of /etc/mtab from
a file to a symlink to /proc/self/mounts.  Most have been worked around
thus far (mostly due to Suse coming up with patches) but still have a
few annoying ones.

-- 
Mike Marion-Unix SysAdmin/Sr. Staff IT Engineer-http://www.qualcomm.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored
@ 2017-11-27 16:01           ` Mike Marion
  0 siblings, 0 replies; 44+ messages in thread
From: Mike Marion @ 2017-11-27 16:01 UTC (permalink / raw)
  To: Ian Kent
  Cc: NeilBrown, Al Viro, Colin Walters, Ondrej Holy,
	autofs mailing list, Kernel Mailing List, David Howells,
	linux-fsdevel

On Thu, Nov 23, 2017 at 08:36:49AM +0800, Ian Kent wrote:

> And with the move of userspace to use /proc based mount tables (one
> example being the symlink of /etc/mtab into /proc) even modest sized
> direct mount maps will be a problem with every entry getting mounted.
> 
> Systems will cope with this fine but larger systems not so much.

Yes.. we've run into some big issues due to the change of /etc/mtab from
a file to a symlink to /proc/self/mounts.  Most have been worked around
thus far (mostly due to Suse coming up with patches) but still have a
few annoying ones.

-- 
Mike Marion-Unix SysAdmin/Sr. Staff IT Engineer-http://www.qualcomm.com--
To unsubscribe from this list: send the line "unsubscribe autofs" in

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored
  2017-11-27 16:01           ` Mike Marion
  (?)
@ 2017-11-27 23:43           ` Ian Kent
  2017-11-28  0:29               ` Mike Marion
  -1 siblings, 1 reply; 44+ messages in thread
From: Ian Kent @ 2017-11-27 23:43 UTC (permalink / raw)
  To: Mike Marion
  Cc: NeilBrown, Al Viro, Colin Walters, Ondrej Holy,
	autofs mailing list, Kernel Mailing List, David Howells,
	linux-fsdevel

On 28/11/17 00:01, Mike Marion wrote:
> On Thu, Nov 23, 2017 at 08:36:49AM +0800, Ian Kent wrote:
> 
>> And with the move of userspace to use /proc based mount tables (one
>> example being the symlink of /etc/mtab into /proc) even modest sized
>> direct mount maps will be a problem with every entry getting mounted.
>>
>> Systems will cope with this fine but larger systems not so much.
> 
> Yes.. we've run into some big issues due to the change of /etc/mtab from
> a file to a symlink to /proc/self/mounts.  Most have been worked around
> thus far (mostly due to Suse coming up with patches) but still have a
> few annoying ones.
> 

I think the situation is going to get worse before it gets better.

On recent Fedora and kernel, with a large map and heavy mount activity
I see:

systemd, udisksd, gvfs-udisks2-volume-monitor, gvfsd-trash,
gnome-settings-daemon, packagekitd and gnome-shell

all go crazy consuming large amounts of CPU.

Once the mount activity is completed I see two processes continue to
consume a large amount of CPU. I thought one of those two was systemd
but my notes say they were gvfs-udisks2-volume-monitor and
gnome-settings-daemon.

The symlink change was probably the start, now a number of applications
now got directly to the proc file system for this information.

For large mount tables and many processes accessing the mount table
(probably reading the whole thing, either periodically or on change
notification) the current system does not scale well at all.

Ian

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored
  2017-11-27 23:43           ` Ian Kent
@ 2017-11-28  0:29               ` Mike Marion
  0 siblings, 0 replies; 44+ messages in thread
From: Mike Marion @ 2017-11-28  0:29 UTC (permalink / raw)
  To: Ian Kent; +Cc: autofs mailing list, Kernel Mailing List, linux-fsdevel

On Tue, Nov 28, 2017 at 07:43:05AM +0800, Ian Kent wrote:

> I think the situation is going to get worse before it gets better.
> 
> On recent Fedora and kernel, with a large map and heavy mount activity
> I see:
> 
> systemd, udisksd, gvfs-udisks2-volume-monitor, gvfsd-trash,
> gnome-settings-daemon, packagekitd and gnome-shell
> 
> all go crazy consuming large amounts of CPU.

Yep.  I'm not even worried about the CPU usage as much (yet, I'm sure 
it'll be more of a problem as time goes on).  We have pretty huge
direct maps and our initial startup tests on a new host with the link vs
file took >6 hours.  That's not a typo.  We worked with Suse engineering 
to come up with a fix, which should've been pushed here some time ago.

Then, there's shutdowns (and reboots). They also took a long time (on
the order of 20+min) because it would walk the entire /proc/mounts
"unmounting" things.  Also fixed now.  That one had something to do in
SMP code as if you used a single CPU/core, it didn't take long at all.

Just got a fix for the suse grub2-mkconfig script to fix their parsing 
looking for the root dev to skip over fstype autofs
(probe_nfsroot_device function).

> The symlink change was probably the start, now a number of applications
> now got directly to the proc file system for this information.
> 
> For large mount tables and many processes accessing the mount table
> (probably reading the whole thing, either periodically or on change
> notification) the current system does not scale well at all.

We use Clearcase in some instances as well, and that's yet another thing
adding mounts, and its startup is very slow, due to the size of
/proc/mounts.  

It's definitely something that's more than just autofs and probably
going to get worse, as you say.

-- 
Mike Marion-Unix SysAdmin/Sr. Staff IT Engineer-http://www.qualcomm.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored
@ 2017-11-28  0:29               ` Mike Marion
  0 siblings, 0 replies; 44+ messages in thread
From: Mike Marion @ 2017-11-28  0:29 UTC (permalink / raw)
  To: Ian Kent; +Cc: autofs mailing list, Kernel Mailing List, linux-fsdevel

On Tue, Nov 28, 2017 at 07:43:05AM +0800, Ian Kent wrote:

> I think the situation is going to get worse before it gets better.
> 
> On recent Fedora and kernel, with a large map and heavy mount activity
> I see:
> 
> systemd, udisksd, gvfs-udisks2-volume-monitor, gvfsd-trash,
> gnome-settings-daemon, packagekitd and gnome-shell
> 
> all go crazy consuming large amounts of CPU.

Yep.  I'm not even worried about the CPU usage as much (yet, I'm sure 
it'll be more of a problem as time goes on).  We have pretty huge
direct maps and our initial startup tests on a new host with the link vs
file took >6 hours.  That's not a typo.  We worked with Suse engineering 
to come up with a fix, which should've been pushed here some time ago.

Then, there's shutdowns (and reboots). They also took a long time (on
the order of 20+min) because it would walk the entire /proc/mounts
"unmounting" things.  Also fixed now.  That one had something to do in
SMP code as if you used a single CPU/core, it didn't take long at all.

Just got a fix for the suse grub2-mkconfig script to fix their parsing 
looking for the root dev to skip over fstype autofs
(probe_nfsroot_device function).

> The symlink change was probably the start, now a number of applications
> now got directly to the proc file system for this information.
> 
> For large mount tables and many processes accessing the mount table
> (probably reading the whole thing, either periodically or on change
> notification) the current system does not scale well at all.

We use Clearcase in some instances as well, and that's yet another thing
adding mounts, and its startup is very slow, due to the size of
/proc/mounts.  

It's definitely something that's more than just autofs and probably
going to get worse, as you say.

-- 
Mike Marion-Unix SysAdmin/Sr. Staff IT Engineer-http://www.qualcomm.com--
To unsubscribe from this list: send the line "unsubscribe autofs" in

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored
  2017-11-28  0:29               ` Mike Marion
  (?)
@ 2017-11-29  1:17               ` NeilBrown
  2017-11-29  2:13                   ` Mike Marion
  2017-11-29  2:56                   ` Ian Kent
  -1 siblings, 2 replies; 44+ messages in thread
From: NeilBrown @ 2017-11-29  1:17 UTC (permalink / raw)
  To: Mike Marion, Ian Kent
  Cc: autofs mailing list, Kernel Mailing List, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 3412 bytes --]

On Tue, Nov 28 2017, Mike Marion wrote:

> On Tue, Nov 28, 2017 at 07:43:05AM +0800, Ian Kent wrote:
>
>> I think the situation is going to get worse before it gets better.
>> 
>> On recent Fedora and kernel, with a large map and heavy mount activity
>> I see:
>> 
>> systemd, udisksd, gvfs-udisks2-volume-monitor, gvfsd-trash,
>> gnome-settings-daemon, packagekitd and gnome-shell
>> 
>> all go crazy consuming large amounts of CPU.
>
> Yep.  I'm not even worried about the CPU usage as much (yet, I'm sure 
> it'll be more of a problem as time goes on).  We have pretty huge
> direct maps and our initial startup tests on a new host with the link vs
> file took >6 hours.  That's not a typo.  We worked with Suse engineering 
> to come up with a fix, which should've been pushed here some time ago.
>
> Then, there's shutdowns (and reboots). They also took a long time (on
> the order of 20+min) because it would walk the entire /proc/mounts
> "unmounting" things.  Also fixed now.  That one had something to do in
> SMP code as if you used a single CPU/core, it didn't take long at all.
>
> Just got a fix for the suse grub2-mkconfig script to fix their parsing 
> looking for the root dev to skip over fstype autofs
> (probe_nfsroot_device function).
>
>> The symlink change was probably the start, now a number of applications
>> now got directly to the proc file system for this information.
>> 
>> For large mount tables and many processes accessing the mount table
>> (probably reading the whole thing, either periodically or on change
>> notification) the current system does not scale well at all.
>
> We use Clearcase in some instances as well, and that's yet another thing
> adding mounts, and its startup is very slow, due to the size of
> /proc/mounts.  
>
> It's definitely something that's more than just autofs and probably
> going to get worse, as you say.

If we assume that applications are going to want to read
/proc/self/mount* a log, we probably need to make it faster.
I performed a simple experiment where I mounted 1000 tmpfs filesystems,
copied /proc/self/mountinfo to /tmp/mountinfo, then
ran 4 for loops in parallel catting one of these files to /dev/null 1000 times.
On a single CPU VM:
  For /tmp/mountinfo, each group of 1000 cats took about 3 seconds.
  For /proc/self/mountinfo, each group of 1000 cats took about 14 seconds.
On a 4 CPU VM
  /tmp/mountinfo: 1.5secs
  /proc/self/mountinfo: 3.5 secs

Using "perf record" it appears that most of the cost is repeated calls
to prepend_path, with a small contribution from the fact that each read
only returns 4K rather than the 128K that cat asks for.

If we could hang a cache off struct mnt_namespace and use it instead of
iterating the mount table - using rcu and ns->event to ensure currency -
we should be able to minimize the cost of this increased use of
/proc/self/mount*.

I suspect that the best approach would be implement a cache at the
seq_file level.

One possible problem might be if applications assume that a read will
always return a whole number of lines (it currently does).  To be
sure we remain safe, we would only be able to use the cache for
a read() syscall which reads the whole file.
How big do people see /proc/self/mount* getting?  What size reads
does 'strace' show the various programs using to read it?

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored
  2017-11-29  1:17               ` NeilBrown
@ 2017-11-29  2:13                   ` Mike Marion
  2017-11-29  2:56                   ` Ian Kent
  1 sibling, 0 replies; 44+ messages in thread
From: Mike Marion @ 2017-11-29  2:13 UTC (permalink / raw)
  To: NeilBrown
  Cc: Ian Kent, autofs mailing list, Kernel Mailing List, linux-fsdevel

On Wed, Nov 29, 2017 at 12:17:27PM +1100, NeilBrown wrote:

> How big do people see /proc/self/mount* getting?  What size reads
> does 'strace' show the various programs using to read it?

We already have line counts into 5 figures.  This wasn't an issue until 
the change of /etc/mtab to a link.  The large count is due to our large
direct automount maps.

-- 
Mike Marion-Unix SysAdmin/Sr. Staff IT Engineer-http://www.qualcomm.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored
@ 2017-11-29  2:13                   ` Mike Marion
  0 siblings, 0 replies; 44+ messages in thread
From: Mike Marion @ 2017-11-29  2:13 UTC (permalink / raw)
  To: NeilBrown
  Cc: Ian Kent, autofs mailing list, Kernel Mailing List, linux-fsdevel

On Wed, Nov 29, 2017 at 12:17:27PM +1100, NeilBrown wrote:

> How big do people see /proc/self/mount* getting?  What size reads
> does 'strace' show the various programs using to read it?

We already have line counts into 5 figures.  This wasn't an issue until 
the change of /etc/mtab to a link.  The large count is due to our large
direct automount maps.

-- 
Mike Marion-Unix SysAdmin/Sr. Staff IT Engineer-http://www.qualcomm.com--
To unsubscribe from this list: send the line "unsubscribe autofs" in

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored
  2017-11-29  2:13                   ` Mike Marion
  (?)
@ 2017-11-29  2:28                   ` Ian Kent
  2017-11-29  2:48                     ` NeilBrown
  -1 siblings, 1 reply; 44+ messages in thread
From: Ian Kent @ 2017-11-29  2:28 UTC (permalink / raw)
  To: Mike Marion, NeilBrown
  Cc: autofs mailing list, Kernel Mailing List, linux-fsdevel

On 29/11/17 10:13, Mike Marion wrote:
> On Wed, Nov 29, 2017 at 12:17:27PM +1100, NeilBrown wrote:
> 
>> How big do people see /proc/self/mount* getting?  What size reads
>> does 'strace' show the various programs using to read it?
> 
> We already have line counts into 5 figures.  This wasn't an issue until 
> the change of /etc/mtab to a link.  The large count is due to our large
> direct automount maps.
> 

And, admittedly, the testing I was doing was with 15k+ size maps.

Of course it's necessary to have this number of mounts to see serious
problems which is easiest to do with large direct mount maps.

The thing that's different now is that before applications started
using /proc directly for mount table information using mount(2)
instead of mount(8) was enough to prevent the mount entries from
being added to the table seen by applications.

Ian

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored
  2017-11-29  2:28                   ` Ian Kent
@ 2017-11-29  2:48                     ` NeilBrown
  2017-11-29  3:14                         ` Ian Kent
  0 siblings, 1 reply; 44+ messages in thread
From: NeilBrown @ 2017-11-29  2:48 UTC (permalink / raw)
  To: Ian Kent, Mike Marion
  Cc: autofs mailing list, Kernel Mailing List, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 1726 bytes --]

On Wed, Nov 29 2017, Ian Kent wrote:

> On 29/11/17 10:13, Mike Marion wrote:
>> On Wed, Nov 29, 2017 at 12:17:27PM +1100, NeilBrown wrote:
>> 
>>> How big do people see /proc/self/mount* getting?  What size reads
>>> does 'strace' show the various programs using to read it?
>> 
>> We already have line counts into 5 figures.  This wasn't an issue until 
>> the change of /etc/mtab to a link.  The large count is due to our large
>> direct automount maps.
>> 

So .... 90,000 lines with a length of may 120 chars or about 10Meg.
Presumably these machines would have so many gigabytes of RAM that
caching a 10M mountinfo file would go unnoticed?

Reading that in 128K chunks without generating bits on the fly will help
a lot I suspect.

We could probably ensure proper alignment by searching backwards for
'\n' when deciding how much to return for a read.

>
> And, admittedly, the testing I was doing was with 15k+ size maps.
>
> Of course it's necessary to have this number of mounts to see serious
> problems which is easiest to do with large direct mount maps.
>
> The thing that's different now is that before applications started
> using /proc directly for mount table information using mount(2)
> instead of mount(8) was enough to prevent the mount entries from
> being added to the table seen by applications.

I wonder who would notice if untriggered direct mounts quietly disappeared from
/proc/mounts...  I suspect systemd would, but there is a good chance it
would fail-safe: assume that the mount worked.
Alternately we could introduce /proc/self/mountinfo2 which doesn't list
direct automounts and encourage problematic programs to use that where
available.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored
  2017-11-29  1:17               ` NeilBrown
@ 2017-11-29  2:56                   ` Ian Kent
  2017-11-29  2:56                   ` Ian Kent
  1 sibling, 0 replies; 44+ messages in thread
From: Ian Kent @ 2017-11-29  2:56 UTC (permalink / raw)
  To: NeilBrown, Mike Marion
  Cc: autofs mailing list, Kernel Mailing List, linux-fsdevel, Al Viro


Adding Al Viro to the Cc list as I believe Stephen Whitehouse and
Al have discussed something similar, please feel free to chime in
with your thoughts Al.

On 29/11/17 09:17, NeilBrown wrote:
> On Tue, Nov 28 2017, Mike Marion wrote:
> 
>> On Tue, Nov 28, 2017 at 07:43:05AM +0800, Ian Kent wrote:
>>
>>> I think the situation is going to get worse before it gets better.
>>>
>>> On recent Fedora and kernel, with a large map and heavy mount activity
>>> I see:
>>>
>>> systemd, udisksd, gvfs-udisks2-volume-monitor, gvfsd-trash,
>>> gnome-settings-daemon, packagekitd and gnome-shell
>>>
>>> all go crazy consuming large amounts of CPU.
>>
>> Yep.  I'm not even worried about the CPU usage as much (yet, I'm sure 
>> it'll be more of a problem as time goes on).  We have pretty huge
>> direct maps and our initial startup tests on a new host with the link vs
>> file took >6 hours.  That's not a typo.  We worked with Suse engineering 
>> to come up with a fix, which should've been pushed here some time ago.
>>
>> Then, there's shutdowns (and reboots). They also took a long time (on
>> the order of 20+min) because it would walk the entire /proc/mounts
>> "unmounting" things.  Also fixed now.  That one had something to do in
>> SMP code as if you used a single CPU/core, it didn't take long at all.
>>
>> Just got a fix for the suse grub2-mkconfig script to fix their parsing 
>> looking for the root dev to skip over fstype autofs
>> (probe_nfsroot_device function).
>>
>>> The symlink change was probably the start, now a number of applications
>>> now got directly to the proc file system for this information.
>>>
>>> For large mount tables and many processes accessing the mount table
>>> (probably reading the whole thing, either periodically or on change
>>> notification) the current system does not scale well at all.
>>
>> We use Clearcase in some instances as well, and that's yet another thing
>> adding mounts, and its startup is very slow, due to the size of
>> /proc/mounts.  
>>
>> It's definitely something that's more than just autofs and probably
>> going to get worse, as you say.
> 
> If we assume that applications are going to want to read
> /proc/self/mount* a log, we probably need to make it faster.
> I performed a simple experiment where I mounted 1000 tmpfs filesystems,
> copied /proc/self/mountinfo to /tmp/mountinfo, then
> ran 4 for loops in parallel catting one of these files to /dev/null 1000 times.
> On a single CPU VM:
>   For /tmp/mountinfo, each group of 1000 cats took about 3 seconds.
>   For /proc/self/mountinfo, each group of 1000 cats took about 14 seconds.
> On a 4 CPU VM
>   /tmp/mountinfo: 1.5secs
>   /proc/self/mountinfo: 3.5 secs
> 
> Using "perf record" it appears that most of the cost is repeated calls
> to prepend_path, with a small contribution from the fact that each read
> only returns 4K rather than the 128K that cat asks for.
> 
> If we could hang a cache off struct mnt_namespace and use it instead of
> iterating the mount table - using rcu and ns->event to ensure currency -
> we should be able to minimize the cost of this increased use of
> /proc/self/mount*.
> 
> I suspect that the best approach would be implement a cache at the
> seq_file level.
> 
> One possible problem might be if applications assume that a read will
> always return a whole number of lines (it currently does).  To be
> sure we remain safe, we would only be able to use the cache for
> a read() syscall which reads the whole file.
> How big do people see /proc/self/mount* getting?  What size reads
> does 'strace' show the various programs using to read it?

Buffer size almost always has a significant impact on IO so that's
likely a big factor but the other aspect of this is notification
of changes.

The risk is improving the IO efficiency might just allow a higher
rate of processing of change notifications and similar symptoms
to what we have now.

The suggestion is that a system that allows for incremental (diff
type) update notification is needed to allow mount table propagation
to scale well.

That implies some as yet undefined user <-> kernel communication
protocol.

Ian

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored
@ 2017-11-29  2:56                   ` Ian Kent
  0 siblings, 0 replies; 44+ messages in thread
From: Ian Kent @ 2017-11-29  2:56 UTC (permalink / raw)
  To: NeilBrown, Mike Marion
  Cc: autofs mailing list, Kernel Mailing List, linux-fsdevel, Al Viro


Adding Al Viro to the Cc list as I believe Stephen Whitehouse and
Al have discussed something similar, please feel free to chime in
with your thoughts Al.

On 29/11/17 09:17, NeilBrown wrote:
> On Tue, Nov 28 2017, Mike Marion wrote:
> 
>> On Tue, Nov 28, 2017 at 07:43:05AM +0800, Ian Kent wrote:
>>
>>> I think the situation is going to get worse before it gets better.
>>>
>>> On recent Fedora and kernel, with a large map and heavy mount activity
>>> I see:
>>>
>>> systemd, udisksd, gvfs-udisks2-volume-monitor, gvfsd-trash,
>>> gnome-settings-daemon, packagekitd and gnome-shell
>>>
>>> all go crazy consuming large amounts of CPU.
>>
>> Yep.  I'm not even worried about the CPU usage as much (yet, I'm sure 
>> it'll be more of a problem as time goes on).  We have pretty huge
>> direct maps and our initial startup tests on a new host with the link vs
>> file took >6 hours.  That's not a typo.  We worked with Suse engineering 
>> to come up with a fix, which should've been pushed here some time ago.
>>
>> Then, there's shutdowns (and reboots). They also took a long time (on
>> the order of 20+min) because it would walk the entire /proc/mounts
>> "unmounting" things.  Also fixed now.  That one had something to do in
>> SMP code as if you used a single CPU/core, it didn't take long at all.
>>
>> Just got a fix for the suse grub2-mkconfig script to fix their parsing 
>> looking for the root dev to skip over fstype autofs
>> (probe_nfsroot_device function).
>>
>>> The symlink change was probably the start, now a number of applications
>>> now got directly to the proc file system for this information.
>>>
>>> For large mount tables and many processes accessing the mount table
>>> (probably reading the whole thing, either periodically or on change
>>> notification) the current system does not scale well at all.
>>
>> We use Clearcase in some instances as well, and that's yet another thing
>> adding mounts, and its startup is very slow, due to the size of
>> /proc/mounts.  
>>
>> It's definitely something that's more than just autofs and probably
>> going to get worse, as you say.
> 
> If we assume that applications are going to want to read
> /proc/self/mount* a log, we probably need to make it faster.
> I performed a simple experiment where I mounted 1000 tmpfs filesystems,
> copied /proc/self/mountinfo to /tmp/mountinfo, then
> ran 4 for loops in parallel catting one of these files to /dev/null 1000 times.
> On a single CPU VM:
>   For /tmp/mountinfo, each group of 1000 cats took about 3 seconds.
>   For /proc/self/mountinfo, each group of 1000 cats took about 14 seconds.
> On a 4 CPU VM
>   /tmp/mountinfo: 1.5secs
>   /proc/self/mountinfo: 3.5 secs
> 
> Using "perf record" it appears that most of the cost is repeated calls
> to prepend_path, with a small contribution from the fact that each read
> only returns 4K rather than the 128K that cat asks for.
> 
> If we could hang a cache off struct mnt_namespace and use it instead of
> iterating the mount table - using rcu and ns->event to ensure currency -
> we should be able to minimize the cost of this increased use of
> /proc/self/mount*.
> 
> I suspect that the best approach would be implement a cache at the
> seq_file level.
> 
> One possible problem might be if applications assume that a read will
> always return a whole number of lines (it currently does).  To be
> sure we remain safe, we would only be able to use the cache for
> a read() syscall which reads the whole file.
> How big do people see /proc/self/mount* getting?  What size reads
> does 'strace' show the various programs using to read it?

Buffer size almost always has a significant impact on IO so that's
likely a big factor but the other aspect of this is notification
of changes.

The risk is improving the IO efficiency might just allow a higher
rate of processing of change notifications and similar symptoms
to what we have now.

The suggestion is that a system that allows for incremental (diff
type) update notification is needed to allow mount table propagation
to scale well.

That implies some as yet undefined user <-> kernel communication
protocol.

Ian
--
To unsubscribe from this list: send the line "unsubscribe autofs" in

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored
  2017-11-29  2:48                     ` NeilBrown
@ 2017-11-29  3:14                         ` Ian Kent
  0 siblings, 0 replies; 44+ messages in thread
From: Ian Kent @ 2017-11-29  3:14 UTC (permalink / raw)
  To: NeilBrown, Mike Marion
  Cc: autofs mailing list, Kernel Mailing List, linux-fsdevel

On 29/11/17 10:48, NeilBrown wrote:
> On Wed, Nov 29 2017, Ian Kent wrote:
> 
>> On 29/11/17 10:13, Mike Marion wrote:
>>> On Wed, Nov 29, 2017 at 12:17:27PM +1100, NeilBrown wrote:
>>>
>>>> How big do people see /proc/self/mount* getting?  What size reads
>>>> does 'strace' show the various programs using to read it?
>>>
>>> We already have line counts into 5 figures.  This wasn't an issue until 
>>> the change of /etc/mtab to a link.  The large count is due to our large
>>> direct automount maps.
>>>
> 
> So .... 90,000 lines with a length of may 120 chars or about 10Meg.
> Presumably these machines would have so many gigabytes of RAM that
> caching a 10M mountinfo file would go unnoticed?
> 
> Reading that in 128K chunks without generating bits on the fly will help
> a lot I suspect.
> 
> We could probably ensure proper alignment by searching backwards for
> '\n' when deciding how much to return for a read.
> 
>>
>> And, admittedly, the testing I was doing was with 15k+ size maps.
>>
>> Of course it's necessary to have this number of mounts to see serious
>> problems which is easiest to do with large direct mount maps.
>>
>> The thing that's different now is that before applications started
>> using /proc directly for mount table information using mount(2)
>> instead of mount(8) was enough to prevent the mount entries from
>> being added to the table seen by applications.
> 
> I wonder who would notice if untriggered direct mounts quietly disappeared from
> /proc/mounts...  I suspect systemd would, but there is a good chance it
> would fail-safe: assume that the mount worked.
> Alternately we could introduce /proc/self/mountinfo2 which doesn't list
> direct automounts and encourage problematic programs to use that where
> available.

Personally I think the proc filesystem tables should be a full representation
of the mounted mounts. That needs to available somewhere.

autofs still uses the proc mounts in one situation (there might be more I've
missed in eliminating this) and other applications probably do too (I have
patches to eliminate this one but they don't fit well with the current
implementation).

One approach I tried was to write a simple file system to mount over
/etc/mtab (et. al.) and filter entries returned based on simple filter
rules.

That file system is not complete by any means and I couldn't get a feel
for how effective it would be due to the direct access to the proc mount
tables done by applications.

OTOH, if applications could be persuaded to use /etc/(mtab|mounts|mountinfo)
and only consult the proc tables when absolutely necessary this pseudo file
system would be an ideal place for an entry cache ....

Ian

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored
@ 2017-11-29  3:14                         ` Ian Kent
  0 siblings, 0 replies; 44+ messages in thread
From: Ian Kent @ 2017-11-29  3:14 UTC (permalink / raw)
  To: NeilBrown, Mike Marion
  Cc: autofs mailing list, Kernel Mailing List, linux-fsdevel

On 29/11/17 10:48, NeilBrown wrote:
> On Wed, Nov 29 2017, Ian Kent wrote:
> 
>> On 29/11/17 10:13, Mike Marion wrote:
>>> On Wed, Nov 29, 2017 at 12:17:27PM +1100, NeilBrown wrote:
>>>
>>>> How big do people see /proc/self/mount* getting?  What size reads
>>>> does 'strace' show the various programs using to read it?
>>>
>>> We already have line counts into 5 figures.  This wasn't an issue until 
>>> the change of /etc/mtab to a link.  The large count is due to our large
>>> direct automount maps.
>>>
> 
> So .... 90,000 lines with a length of may 120 chars or about 10Meg.
> Presumably these machines would have so many gigabytes of RAM that
> caching a 10M mountinfo file would go unnoticed?
> 
> Reading that in 128K chunks without generating bits on the fly will help
> a lot I suspect.
> 
> We could probably ensure proper alignment by searching backwards for
> '\n' when deciding how much to return for a read.
> 
>>
>> And, admittedly, the testing I was doing was with 15k+ size maps.
>>
>> Of course it's necessary to have this number of mounts to see serious
>> problems which is easiest to do with large direct mount maps.
>>
>> The thing that's different now is that before applications started
>> using /proc directly for mount table information using mount(2)
>> instead of mount(8) was enough to prevent the mount entries from
>> being added to the table seen by applications.
> 
> I wonder who would notice if untriggered direct mounts quietly disappeared from
> /proc/mounts...  I suspect systemd would, but there is a good chance it
> would fail-safe: assume that the mount worked.
> Alternately we could introduce /proc/self/mountinfo2 which doesn't list
> direct automounts and encourage problematic programs to use that where
> available.

Personally I think the proc filesystem tables should be a full representation
of the mounted mounts. That needs to available somewhere.

autofs still uses the proc mounts in one situation (there might be more I've
missed in eliminating this) and other applications probably do too (I have
patches to eliminate this one but they don't fit well with the current
implementation).

One approach I tried was to write a simple file system to mount over
/etc/mtab (et. al.) and filter entries returned based on simple filter
rules.

That file system is not complete by any means and I couldn't get a feel
for how effective it would be due to the direct access to the proc mount
tables done by applications.

OTOH, if applications could be persuaded to use /etc/(mtab|mounts|mountinfo)
and only consult the proc tables when absolutely necessary this pseudo file
system would be an ideal place for an entry cache ....

Ian
--
To unsubscribe from this list: send the line "unsubscribe autofs" in

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored
  2017-11-29  2:56                   ` Ian Kent
  (?)
@ 2017-11-29  3:45                   ` NeilBrown
  2017-11-29  6:00                       ` Ian Kent
  -1 siblings, 1 reply; 44+ messages in thread
From: NeilBrown @ 2017-11-29  3:45 UTC (permalink / raw)
  To: Ian Kent, Mike Marion
  Cc: autofs mailing list, Kernel Mailing List, linux-fsdevel, Al Viro

[-- Attachment #1: Type: text/plain, Size: 5639 bytes --]

On Wed, Nov 29 2017, Ian Kent wrote:

> Adding Al Viro to the Cc list as I believe Stephen Whitehouse and
> Al have discussed something similar, please feel free to chime in
> with your thoughts Al.
>
> On 29/11/17 09:17, NeilBrown wrote:
>> On Tue, Nov 28 2017, Mike Marion wrote:
>> 
>>> On Tue, Nov 28, 2017 at 07:43:05AM +0800, Ian Kent wrote:
>>>
>>>> I think the situation is going to get worse before it gets better.
>>>>
>>>> On recent Fedora and kernel, with a large map and heavy mount activity
>>>> I see:
>>>>
>>>> systemd, udisksd, gvfs-udisks2-volume-monitor, gvfsd-trash,
>>>> gnome-settings-daemon, packagekitd and gnome-shell
>>>>
>>>> all go crazy consuming large amounts of CPU.
>>>
>>> Yep.  I'm not even worried about the CPU usage as much (yet, I'm sure 
>>> it'll be more of a problem as time goes on).  We have pretty huge
>>> direct maps and our initial startup tests on a new host with the link vs
>>> file took >6 hours.  That's not a typo.  We worked with Suse engineering 
>>> to come up with a fix, which should've been pushed here some time ago.
>>>
>>> Then, there's shutdowns (and reboots). They also took a long time (on
>>> the order of 20+min) because it would walk the entire /proc/mounts
>>> "unmounting" things.  Also fixed now.  That one had something to do in
>>> SMP code as if you used a single CPU/core, it didn't take long at all.
>>>
>>> Just got a fix for the suse grub2-mkconfig script to fix their parsing 
>>> looking for the root dev to skip over fstype autofs
>>> (probe_nfsroot_device function).
>>>
>>>> The symlink change was probably the start, now a number of applications
>>>> now got directly to the proc file system for this information.
>>>>
>>>> For large mount tables and many processes accessing the mount table
>>>> (probably reading the whole thing, either periodically or on change
>>>> notification) the current system does not scale well at all.
>>>
>>> We use Clearcase in some instances as well, and that's yet another thing
>>> adding mounts, and its startup is very slow, due to the size of
>>> /proc/mounts.  
>>>
>>> It's definitely something that's more than just autofs and probably
>>> going to get worse, as you say.
>> 
>> If we assume that applications are going to want to read
>> /proc/self/mount* a log, we probably need to make it faster.
>> I performed a simple experiment where I mounted 1000 tmpfs filesystems,
>> copied /proc/self/mountinfo to /tmp/mountinfo, then
>> ran 4 for loops in parallel catting one of these files to /dev/null 1000 times.
>> On a single CPU VM:
>>   For /tmp/mountinfo, each group of 1000 cats took about 3 seconds.
>>   For /proc/self/mountinfo, each group of 1000 cats took about 14 seconds.
>> On a 4 CPU VM
>>   /tmp/mountinfo: 1.5secs
>>   /proc/self/mountinfo: 3.5 secs
>> 
>> Using "perf record" it appears that most of the cost is repeated calls
>> to prepend_path, with a small contribution from the fact that each read
>> only returns 4K rather than the 128K that cat asks for.
>> 
>> If we could hang a cache off struct mnt_namespace and use it instead of
>> iterating the mount table - using rcu and ns->event to ensure currency -
>> we should be able to minimize the cost of this increased use of
>> /proc/self/mount*.
>> 
>> I suspect that the best approach would be implement a cache at the
>> seq_file level.
>> 
>> One possible problem might be if applications assume that a read will
>> always return a whole number of lines (it currently does).  To be
>> sure we remain safe, we would only be able to use the cache for
>> a read() syscall which reads the whole file.
>> How big do people see /proc/self/mount* getting?  What size reads
>> does 'strace' show the various programs using to read it?
>
> Buffer size almost always has a significant impact on IO so that's
> likely a big factor but the other aspect of this is notification
> of changes.
>
> The risk is improving the IO efficiency might just allow a higher
> rate of processing of change notifications and similar symptoms
> to what we have now.

That's an issue that we should be able to get empirical data on.
Are these systems that demonstrate problems actually showing a high
rate of changes to the mount table, or is the mount table being
read frequently despite not changing?
To find out you could use a program like one of the answers to:

https://stackoverflow.com/questions/5070801/monitoring-mount-point-changes-via-proc-mounts

or instrument the kernel to add a counter to 'struct mnt_namespace' and
have mounts_open_common() increment that counter and report the value as
well as ns->event.  The long term ratio of the two numbers might be
interesting.



>
> The suggestion is that a system that allows for incremental (diff
> type) update notification is needed to allow mount table propagation
> to scale well.
>
> That implies some as yet undefined user <-> kernel communication
> protocol.

I can almost conceive a mountinfo variant where new entries always
appear at the end and deletions appear as just "$mnt_id".
struct proc_mounts would contain a bitmap similar to mnt_id_ida which
records which ids have been reported to this file.  When an event is
noticed it checks for deletions and reports them before anything else.
Keeping track of location in the ns->list list might be tricky.  It
could be done with a new 64bit never-reused mnt id, though some other
approach might be possible.

An app would read to the end, then poll/select for exceptional events
and keep reading.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored
  2017-11-29  3:45                   ` NeilBrown
@ 2017-11-29  6:00                       ` Ian Kent
  0 siblings, 0 replies; 44+ messages in thread
From: Ian Kent @ 2017-11-29  6:00 UTC (permalink / raw)
  To: NeilBrown, Mike Marion
  Cc: autofs mailing list, Kernel Mailing List, linux-fsdevel, Al Viro

On 29/11/17 11:45, NeilBrown wrote:
> On Wed, Nov 29 2017, Ian Kent wrote:
> 
>> Adding Al Viro to the Cc list as I believe Stephen Whitehouse and
>> Al have discussed something similar, please feel free to chime in
>> with your thoughts Al.
>>
>> On 29/11/17 09:17, NeilBrown wrote:
>>> On Tue, Nov 28 2017, Mike Marion wrote:
>>>
>>>> On Tue, Nov 28, 2017 at 07:43:05AM +0800, Ian Kent wrote:
>>>>
>>>>> I think the situation is going to get worse before it gets better.
>>>>>
>>>>> On recent Fedora and kernel, with a large map and heavy mount activity
>>>>> I see:
>>>>>
>>>>> systemd, udisksd, gvfs-udisks2-volume-monitor, gvfsd-trash,
>>>>> gnome-settings-daemon, packagekitd and gnome-shell
>>>>>
>>>>> all go crazy consuming large amounts of CPU.
>>>>
>>>> Yep.  I'm not even worried about the CPU usage as much (yet, I'm sure 
>>>> it'll be more of a problem as time goes on).  We have pretty huge
>>>> direct maps and our initial startup tests on a new host with the link vs
>>>> file took >6 hours.  That's not a typo.  We worked with Suse engineering 
>>>> to come up with a fix, which should've been pushed here some time ago.
>>>>
>>>> Then, there's shutdowns (and reboots). They also took a long time (on
>>>> the order of 20+min) because it would walk the entire /proc/mounts
>>>> "unmounting" things.  Also fixed now.  That one had something to do in
>>>> SMP code as if you used a single CPU/core, it didn't take long at all.
>>>>
>>>> Just got a fix for the suse grub2-mkconfig script to fix their parsing 
>>>> looking for the root dev to skip over fstype autofs
>>>> (probe_nfsroot_device function).
>>>>
>>>>> The symlink change was probably the start, now a number of applications
>>>>> now got directly to the proc file system for this information.
>>>>>
>>>>> For large mount tables and many processes accessing the mount table
>>>>> (probably reading the whole thing, either periodically or on change
>>>>> notification) the current system does not scale well at all.
>>>>
>>>> We use Clearcase in some instances as well, and that's yet another thing
>>>> adding mounts, and its startup is very slow, due to the size of
>>>> /proc/mounts.  
>>>>
>>>> It's definitely something that's more than just autofs and probably
>>>> going to get worse, as you say.
>>>
>>> If we assume that applications are going to want to read
>>> /proc/self/mount* a log, we probably need to make it faster.
>>> I performed a simple experiment where I mounted 1000 tmpfs filesystems,
>>> copied /proc/self/mountinfo to /tmp/mountinfo, then
>>> ran 4 for loops in parallel catting one of these files to /dev/null 1000 times.
>>> On a single CPU VM:
>>>   For /tmp/mountinfo, each group of 1000 cats took about 3 seconds.
>>>   For /proc/self/mountinfo, each group of 1000 cats took about 14 seconds.
>>> On a 4 CPU VM
>>>   /tmp/mountinfo: 1.5secs
>>>   /proc/self/mountinfo: 3.5 secs
>>>
>>> Using "perf record" it appears that most of the cost is repeated calls
>>> to prepend_path, with a small contribution from the fact that each read
>>> only returns 4K rather than the 128K that cat asks for.
>>>
>>> If we could hang a cache off struct mnt_namespace and use it instead of
>>> iterating the mount table - using rcu and ns->event to ensure currency -
>>> we should be able to minimize the cost of this increased use of
>>> /proc/self/mount*.
>>>
>>> I suspect that the best approach would be implement a cache at the
>>> seq_file level.
>>>
>>> One possible problem might be if applications assume that a read will
>>> always return a whole number of lines (it currently does).  To be
>>> sure we remain safe, we would only be able to use the cache for
>>> a read() syscall which reads the whole file.
>>> How big do people see /proc/self/mount* getting?  What size reads
>>> does 'strace' show the various programs using to read it?
>>
>> Buffer size almost always has a significant impact on IO so that's
>> likely a big factor but the other aspect of this is notification
>> of changes.
>>
>> The risk is improving the IO efficiency might just allow a higher
>> rate of processing of change notifications and similar symptoms
>> to what we have now.
> 
> That's an issue that we should be able to get empirical data on.
> Are these systems that demonstrate problems actually showing a high
> rate of changes to the mount table, or is the mount table being
> read frequently despite not changing?
> To find out you could use a program like one of the answers to:
> 
> https://stackoverflow.com/questions/5070801/monitoring-mount-point-changes-via-proc-mounts
> 
> or instrument the kernel to add a counter to 'struct mnt_namespace' and
> have mounts_open_common() increment that counter and report the value as
> well as ns->event.  The long term ratio of the two numbers might be
> interesting.

One scenario is, under heavy mount activity, the CPU usage of processes
systemd, udisksd2, gvfs-udisks2-volume-monitor, gvfsd-trash,
gnome-settings-daemon and packagekitd (and possibly gnome-shell, might
be a false positive though) grow to consume all available CPU.

The processes gvfs-udisks2-volume-monitor and gnome-settings-daemon 
(and possibly packagekitd, might be false positive) continue to use
excessive CPU when the mount table is large but there is no mount/umount
activity.

In this case heavy mount activity means starting autofs with a direct
mount map of 15k+ entries.

The shutdown can be a problem too but umount(2) is just too slow for it
to be as pronounced a problem as what we see at startup. The umount
slowness appears to be constant. I'm not sure if it's proportional in any
way to the number of mounts present on the system.

> 
> 
> 
>>
>> The suggestion is that a system that allows for incremental (diff
>> type) update notification is needed to allow mount table propagation
>> to scale well.
>>
>> That implies some as yet undefined user <-> kernel communication
>> protocol.
> 
> I can almost conceive a mountinfo variant where new entries always
> appear at the end and deletions appear as just "$mnt_id".
> struct proc_mounts would contain a bitmap similar to mnt_id_ida which
> records which ids have been reported to this file.  When an event is
> noticed it checks for deletions and reports them before anything else.
> Keeping track of location in the ns->list list might be tricky.  It
> could be done with a new 64bit never-reused mnt id, though some other
> approach might be possible.
> 
> An app would read to the end, then poll/select for exceptional events
> and keep reading.

Keeping track of modifications is certainly one of the problems and this
sounds like a good start to resolving it ....

I believe the current preferred delivery of this to be a shared library
interface available to user space that talks to the kernel to obtain the
needed information.

Ian

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored
@ 2017-11-29  6:00                       ` Ian Kent
  0 siblings, 0 replies; 44+ messages in thread
From: Ian Kent @ 2017-11-29  6:00 UTC (permalink / raw)
  To: NeilBrown, Mike Marion
  Cc: autofs mailing list, Kernel Mailing List, linux-fsdevel, Al Viro

On 29/11/17 11:45, NeilBrown wrote:
> On Wed, Nov 29 2017, Ian Kent wrote:
> 
>> Adding Al Viro to the Cc list as I believe Stephen Whitehouse and
>> Al have discussed something similar, please feel free to chime in
>> with your thoughts Al.
>>
>> On 29/11/17 09:17, NeilBrown wrote:
>>> On Tue, Nov 28 2017, Mike Marion wrote:
>>>
>>>> On Tue, Nov 28, 2017 at 07:43:05AM +0800, Ian Kent wrote:
>>>>
>>>>> I think the situation is going to get worse before it gets better.
>>>>>
>>>>> On recent Fedora and kernel, with a large map and heavy mount activity
>>>>> I see:
>>>>>
>>>>> systemd, udisksd, gvfs-udisks2-volume-monitor, gvfsd-trash,
>>>>> gnome-settings-daemon, packagekitd and gnome-shell
>>>>>
>>>>> all go crazy consuming large amounts of CPU.
>>>>
>>>> Yep.  I'm not even worried about the CPU usage as much (yet, I'm sure 
>>>> it'll be more of a problem as time goes on).  We have pretty huge
>>>> direct maps and our initial startup tests on a new host with the link vs
>>>> file took >6 hours.  That's not a typo.  We worked with Suse engineering 
>>>> to come up with a fix, which should've been pushed here some time ago.
>>>>
>>>> Then, there's shutdowns (and reboots). They also took a long time (on
>>>> the order of 20+min) because it would walk the entire /proc/mounts
>>>> "unmounting" things.  Also fixed now.  That one had something to do in
>>>> SMP code as if you used a single CPU/core, it didn't take long at all.
>>>>
>>>> Just got a fix for the suse grub2-mkconfig script to fix their parsing 
>>>> looking for the root dev to skip over fstype autofs
>>>> (probe_nfsroot_device function).
>>>>
>>>>> The symlink change was probably the start, now a number of applications
>>>>> now got directly to the proc file system for this information.
>>>>>
>>>>> For large mount tables and many processes accessing the mount table
>>>>> (probably reading the whole thing, either periodically or on change
>>>>> notification) the current system does not scale well at all.
>>>>
>>>> We use Clearcase in some instances as well, and that's yet another thing
>>>> adding mounts, and its startup is very slow, due to the size of
>>>> /proc/mounts.  
>>>>
>>>> It's definitely something that's more than just autofs and probably
>>>> going to get worse, as you say.
>>>
>>> If we assume that applications are going to want to read
>>> /proc/self/mount* a log, we probably need to make it faster.
>>> I performed a simple experiment where I mounted 1000 tmpfs filesystems,
>>> copied /proc/self/mountinfo to /tmp/mountinfo, then
>>> ran 4 for loops in parallel catting one of these files to /dev/null 1000 times.
>>> On a single CPU VM:
>>>   For /tmp/mountinfo, each group of 1000 cats took about 3 seconds.
>>>   For /proc/self/mountinfo, each group of 1000 cats took about 14 seconds.
>>> On a 4 CPU VM
>>>   /tmp/mountinfo: 1.5secs
>>>   /proc/self/mountinfo: 3.5 secs
>>>
>>> Using "perf record" it appears that most of the cost is repeated calls
>>> to prepend_path, with a small contribution from the fact that each read
>>> only returns 4K rather than the 128K that cat asks for.
>>>
>>> If we could hang a cache off struct mnt_namespace and use it instead of
>>> iterating the mount table - using rcu and ns->event to ensure currency -
>>> we should be able to minimize the cost of this increased use of
>>> /proc/self/mount*.
>>>
>>> I suspect that the best approach would be implement a cache at the
>>> seq_file level.
>>>
>>> One possible problem might be if applications assume that a read will
>>> always return a whole number of lines (it currently does).  To be
>>> sure we remain safe, we would only be able to use the cache for
>>> a read() syscall which reads the whole file.
>>> How big do people see /proc/self/mount* getting?  What size reads
>>> does 'strace' show the various programs using to read it?
>>
>> Buffer size almost always has a significant impact on IO so that's
>> likely a big factor but the other aspect of this is notification
>> of changes.
>>
>> The risk is improving the IO efficiency might just allow a higher
>> rate of processing of change notifications and similar symptoms
>> to what we have now.
> 
> That's an issue that we should be able to get empirical data on.
> Are these systems that demonstrate problems actually showing a high
> rate of changes to the mount table, or is the mount table being
> read frequently despite not changing?
> To find out you could use a program like one of the answers to:
> 
> https://stackoverflow.com/questions/5070801/monitoring-mount-point-changes-via-proc-mounts
> 
> or instrument the kernel to add a counter to 'struct mnt_namespace' and
> have mounts_open_common() increment that counter and report the value as
> well as ns->event.  The long term ratio of the two numbers might be
> interesting.

One scenario is, under heavy mount activity, the CPU usage of processes
systemd, udisksd2, gvfs-udisks2-volume-monitor, gvfsd-trash,
gnome-settings-daemon and packagekitd (and possibly gnome-shell, might
be a false positive though) grow to consume all available CPU.

The processes gvfs-udisks2-volume-monitor and gnome-settings-daemon 
(and possibly packagekitd, might be false positive) continue to use
excessive CPU when the mount table is large but there is no mount/umount
activity.

In this case heavy mount activity means starting autofs with a direct
mount map of 15k+ entries.

The shutdown can be a problem too but umount(2) is just too slow for it
to be as pronounced a problem as what we see at startup. The umount
slowness appears to be constant. I'm not sure if it's proportional in any
way to the number of mounts present on the system.

> 
> 
> 
>>
>> The suggestion is that a system that allows for incremental (diff
>> type) update notification is needed to allow mount table propagation
>> to scale well.
>>
>> That implies some as yet undefined user <-> kernel communication
>> protocol.
> 
> I can almost conceive a mountinfo variant where new entries always
> appear at the end and deletions appear as just "$mnt_id".
> struct proc_mounts would contain a bitmap similar to mnt_id_ida which
> records which ids have been reported to this file.  When an event is
> noticed it checks for deletions and reports them before anything else.
> Keeping track of location in the ns->list list might be tricky.  It
> could be done with a new 64bit never-reused mnt id, though some other
> approach might be possible.
> 
> An app would read to the end, then poll/select for exceptional events
> and keep reading.

Keeping track of modifications is certainly one of the problems and this
sounds like a good start to resolving it ....

I believe the current preferred delivery of this to be a shared library
interface available to user space that talks to the kernel to obtain the
needed information.

Ian
--
To unsubscribe from this list: send the line "unsubscribe autofs" in

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored
  2017-11-29  6:00                       ` Ian Kent
  (?)
@ 2017-11-29  7:39                       ` NeilBrown
  2017-11-30  0:00                         ` Ian Kent
  -1 siblings, 1 reply; 44+ messages in thread
From: NeilBrown @ 2017-11-29  7:39 UTC (permalink / raw)
  To: Ian Kent, Mike Marion
  Cc: autofs mailing list, Kernel Mailing List, linux-fsdevel, Al Viro

[-- Attachment #1: Type: text/plain, Size: 8956 bytes --]

On Wed, Nov 29 2017, Ian Kent wrote:

> On 29/11/17 11:45, NeilBrown wrote:
>> On Wed, Nov 29 2017, Ian Kent wrote:
>> 
>>> Adding Al Viro to the Cc list as I believe Stephen Whitehouse and
>>> Al have discussed something similar, please feel free to chime in
>>> with your thoughts Al.
>>>
>>> On 29/11/17 09:17, NeilBrown wrote:
>>>> On Tue, Nov 28 2017, Mike Marion wrote:
>>>>
>>>>> On Tue, Nov 28, 2017 at 07:43:05AM +0800, Ian Kent wrote:
>>>>>
>>>>>> I think the situation is going to get worse before it gets better.
>>>>>>
>>>>>> On recent Fedora and kernel, with a large map and heavy mount activity
>>>>>> I see:
>>>>>>
>>>>>> systemd, udisksd, gvfs-udisks2-volume-monitor, gvfsd-trash,
>>>>>> gnome-settings-daemon, packagekitd and gnome-shell
>>>>>>
>>>>>> all go crazy consuming large amounts of CPU.
>>>>>
>>>>> Yep.  I'm not even worried about the CPU usage as much (yet, I'm sure 
>>>>> it'll be more of a problem as time goes on).  We have pretty huge
>>>>> direct maps and our initial startup tests on a new host with the link vs
>>>>> file took >6 hours.  That's not a typo.  We worked with Suse engineering 
>>>>> to come up with a fix, which should've been pushed here some time ago.
>>>>>
>>>>> Then, there's shutdowns (and reboots). They also took a long time (on
>>>>> the order of 20+min) because it would walk the entire /proc/mounts
>>>>> "unmounting" things.  Also fixed now.  That one had something to do in
>>>>> SMP code as if you used a single CPU/core, it didn't take long at all.
>>>>>
>>>>> Just got a fix for the suse grub2-mkconfig script to fix their parsing 
>>>>> looking for the root dev to skip over fstype autofs
>>>>> (probe_nfsroot_device function).
>>>>>
>>>>>> The symlink change was probably the start, now a number of applications
>>>>>> now got directly to the proc file system for this information.
>>>>>>
>>>>>> For large mount tables and many processes accessing the mount table
>>>>>> (probably reading the whole thing, either periodically or on change
>>>>>> notification) the current system does not scale well at all.
>>>>>
>>>>> We use Clearcase in some instances as well, and that's yet another thing
>>>>> adding mounts, and its startup is very slow, due to the size of
>>>>> /proc/mounts.  
>>>>>
>>>>> It's definitely something that's more than just autofs and probably
>>>>> going to get worse, as you say.
>>>>
>>>> If we assume that applications are going to want to read
>>>> /proc/self/mount* a log, we probably need to make it faster.
>>>> I performed a simple experiment where I mounted 1000 tmpfs filesystems,
>>>> copied /proc/self/mountinfo to /tmp/mountinfo, then
>>>> ran 4 for loops in parallel catting one of these files to /dev/null 1000 times.
>>>> On a single CPU VM:
>>>>   For /tmp/mountinfo, each group of 1000 cats took about 3 seconds.
>>>>   For /proc/self/mountinfo, each group of 1000 cats took about 14 seconds.
>>>> On a 4 CPU VM
>>>>   /tmp/mountinfo: 1.5secs
>>>>   /proc/self/mountinfo: 3.5 secs
>>>>
>>>> Using "perf record" it appears that most of the cost is repeated calls
>>>> to prepend_path, with a small contribution from the fact that each read
>>>> only returns 4K rather than the 128K that cat asks for.
>>>>
>>>> If we could hang a cache off struct mnt_namespace and use it instead of
>>>> iterating the mount table - using rcu and ns->event to ensure currency -
>>>> we should be able to minimize the cost of this increased use of
>>>> /proc/self/mount*.
>>>>
>>>> I suspect that the best approach would be implement a cache at the
>>>> seq_file level.
>>>>
>>>> One possible problem might be if applications assume that a read will
>>>> always return a whole number of lines (it currently does).  To be
>>>> sure we remain safe, we would only be able to use the cache for
>>>> a read() syscall which reads the whole file.
>>>> How big do people see /proc/self/mount* getting?  What size reads
>>>> does 'strace' show the various programs using to read it?
>>>
>>> Buffer size almost always has a significant impact on IO so that's
>>> likely a big factor but the other aspect of this is notification
>>> of changes.
>>>
>>> The risk is improving the IO efficiency might just allow a higher
>>> rate of processing of change notifications and similar symptoms
>>> to what we have now.
>> 
>> That's an issue that we should be able to get empirical data on.
>> Are these systems that demonstrate problems actually showing a high
>> rate of changes to the mount table, or is the mount table being
>> read frequently despite not changing?
>> To find out you could use a program like one of the answers to:
>> 
>> https://stackoverflow.com/questions/5070801/monitoring-mount-point-changes-via-proc-mounts
>> 
>> or instrument the kernel to add a counter to 'struct mnt_namespace' and
>> have mounts_open_common() increment that counter and report the value as
>> well as ns->event.  The long term ratio of the two numbers might be
>> interesting.
>
> One scenario is, under heavy mount activity, the CPU usage of processes
> systemd, udisksd2, gvfs-udisks2-volume-monitor, gvfsd-trash,
> gnome-settings-daemon and packagekitd (and possibly gnome-shell, might
> be a false positive though) grow to consume all available CPU.

OK, that's pretty clearly caused by lots of changes happening in quick
succession.
The current poll support in the kernel is slightly non-optimal in that
it could trigger more events than necessary, but it probably doesn't
make a big difference.

The current code means that poll/select will report an event if there
has been an event since the last time that poll/select reported an
event.
It is only necessary to report an event if these has been one since the
last time that the first record in the file was read.
So if there are a sequence of events, then the file is read, a
subsequent poll will report another event where there isn't really one.
This would only be noticed if there was a delay between the event being
reported in the kernel, and the read happening.

You could test by removing
		m->poll_event = event;
from mounts_poll() and adding something like:
 if (*pos == 0) m->poll_event = p->ns->event;
to m_start().


>
> The processes gvfs-udisks2-volume-monitor and gnome-settings-daemon 
> (and possibly packagekitd, might be false positive) continue to use
> excessive CPU when the mount table is large but there is no mount/umount
> activity.

That is strange.  I wonder if something could be triggering phantom
events in some way.. Do they eventually settle, or does this continue
for hours.  Is there an easy way I could reproduce?

>
> In this case heavy mount activity means starting autofs with a direct
> mount map of 15k+ entries.

I'll try that tomorrow and see what happens.  Would bind mounts be
enough?

>
> The shutdown can be a problem too but umount(2) is just too slow for it
> to be as pronounced a problem as what we see at startup. The umount
> slowness appears to be constant. I'm not sure if it's proportional in any
> way to the number of mounts present on the system.

Changing synchronize_rcu() in namespace_unlock() to
synchronize_rcu_expedited() makes umount much faster.  Al has the patch
but hasn't responded.  Using "umount -c", which systemd now does, can
also help I think.

>
>> 
>> 
>> 
>>>
>>> The suggestion is that a system that allows for incremental (diff
>>> type) update notification is needed to allow mount table propagation
>>> to scale well.
>>>
>>> That implies some as yet undefined user <-> kernel communication
>>> protocol.
>> 
>> I can almost conceive a mountinfo variant where new entries always
>> appear at the end and deletions appear as just "$mnt_id".
>> struct proc_mounts would contain a bitmap similar to mnt_id_ida which
>> records which ids have been reported to this file.  When an event is
>> noticed it checks for deletions and reports them before anything else.
>> Keeping track of location in the ns->list list might be tricky.  It
>> could be done with a new 64bit never-reused mnt id, though some other
>> approach might be possible.
>> 
>> An app would read to the end, then poll/select for exceptional events
>> and keep reading.
>
> Keeping track of modifications is certainly one of the problems and this
> sounds like a good start to resolving it ....
>
> I believe the current preferred delivery of this to be a shared library
> interface available to user space that talks to the kernel to obtain the
> needed information.

Any suggests at the sort of interface?  If I provided a pollable fd, and
a function which given the fd and a 'struct mntent' (or similar) would
fill in the mntent if info was available, would that be likely to suit?
Not that I'm promising anything but having a framework helps guide
thoughts.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored
  2017-11-29  6:00                       ` Ian Kent
@ 2017-11-29 16:51                         ` Mike Marion
  -1 siblings, 0 replies; 44+ messages in thread
From: Mike Marion @ 2017-11-29 16:51 UTC (permalink / raw)
  To: Ian Kent
  Cc: NeilBrown, autofs mailing list, Kernel Mailing List,
	linux-fsdevel, Al Viro

On Wed, Nov 29, 2017 at 02:00:31PM +0800, Ian Kent wrote:
> On 29/11/17 11:45, NeilBrown wrote:
> > On Wed, Nov 29 2017, Ian Kent wrote:
> > 
> >> Adding Al Viro to the Cc list as I believe Stephen Whitehouse and
> >> Al have discussed something similar, please feel free to chime in
> >> with your thoughts Al.
> >>
> >> On 29/11/17 09:17, NeilBrown wrote:
> >>> On Tue, Nov 28 2017, Mike Marion wrote:
> >>>
> >>>> On Tue, Nov 28, 2017 at 07:43:05AM +0800, Ian Kent wrote:
> >>>>
> >>>>> I think the situation is going to get worse before it gets better.
> >>>>>
> >>>>> On recent Fedora and kernel, with a large map and heavy mount activity
> >>>>> I see:
> >>>>>
> >>>>> systemd, udisksd, gvfs-udisks2-volume-monitor, gvfsd-trash,
> >>>>> gnome-settings-daemon, packagekitd and gnome-shell
> >>>>>
> >>>>> all go crazy consuming large amounts of CPU.
> >>>>
> >>>> Yep.  I'm not even worried about the CPU usage as much (yet, I'm sure 
> >>>> it'll be more of a problem as time goes on).  We have pretty huge
> >>>> direct maps and our initial startup tests on a new host with the link vs
> >>>> file took >6 hours.  That's not a typo.  We worked with Suse engineering 
> >>>> to come up with a fix, which should've been pushed here some time ago.
> >>>>
> >>>> Then, there's shutdowns (and reboots). They also took a long time (on
> >>>> the order of 20+min) because it would walk the entire /proc/mounts
> >>>> "unmounting" things.  Also fixed now.  That one had something to do in
> >>>> SMP code as if you used a single CPU/core, it didn't take long at all.
> >>>>
> >>>> Just got a fix for the suse grub2-mkconfig script to fix their parsing 
> >>>> looking for the root dev to skip over fstype autofs
> >>>> (probe_nfsroot_device function).
> >>>>
> >>>>> The symlink change was probably the start, now a number of applications
> >>>>> now got directly to the proc file system for this information.
> >>>>>
> >>>>> For large mount tables and many processes accessing the mount table
> >>>>> (probably reading the whole thing, either periodically or on change
> >>>>> notification) the current system does not scale well at all.
> >>>>
> >>>> We use Clearcase in some instances as well, and that's yet another thing
> >>>> adding mounts, and its startup is very slow, due to the size of
> >>>> /proc/mounts.  
> >>>>
> >>>> It's definitely something that's more than just autofs and probably
> >>>> going to get worse, as you say.
> >>>
> >>> If we assume that applications are going to want to read
> >>> /proc/self/mount* a log, we probably need to make it faster.
> >>> I performed a simple experiment where I mounted 1000 tmpfs filesystems,
> >>> copied /proc/self/mountinfo to /tmp/mountinfo, then
> >>> ran 4 for loops in parallel catting one of these files to /dev/null 1000 times.
> >>> On a single CPU VM:
> >>>   For /tmp/mountinfo, each group of 1000 cats took about 3 seconds.
> >>>   For /proc/self/mountinfo, each group of 1000 cats took about 14 seconds.
> >>> On a 4 CPU VM
> >>>   /tmp/mountinfo: 1.5secs
> >>>   /proc/self/mountinfo: 3.5 secs
> >>>
> >>> Using "perf record" it appears that most of the cost is repeated calls
> >>> to prepend_path, with a small contribution from the fact that each read
> >>> only returns 4K rather than the 128K that cat asks for.
> >>>
> >>> If we could hang a cache off struct mnt_namespace and use it instead of
> >>> iterating the mount table - using rcu and ns->event to ensure currency -
> >>> we should be able to minimize the cost of this increased use of
> >>> /proc/self/mount*.
> >>>
> >>> I suspect that the best approach would be implement a cache at the
> >>> seq_file level.
> >>>
> >>> One possible problem might be if applications assume that a read will
> >>> always return a whole number of lines (it currently does).  To be
> >>> sure we remain safe, we would only be able to use the cache for
> >>> a read() syscall which reads the whole file.
> >>> How big do people see /proc/self/mount* getting?  What size reads
> >>> does 'strace' show the various programs using to read it?
> >>
> >> Buffer size almost always has a significant impact on IO so that's
> >> likely a big factor but the other aspect of this is notification
> >> of changes.
> >>
> >> The risk is improving the IO efficiency might just allow a higher
> >> rate of processing of change notifications and similar symptoms
> >> to what we have now.
> > 
> > That's an issue that we should be able to get empirical data on.
> > Are these systems that demonstrate problems actually showing a high
> > rate of changes to the mount table, or is the mount table being
> > read frequently despite not changing?
> > To find out you could use a program like one of the answers to:
> > 
> > https://stackoverflow.com/questions/5070801/monitoring-mount-point-changes-via-proc-mounts
> > 
> > or instrument the kernel to add a counter to 'struct mnt_namespace' and
> > have mounts_open_common() increment that counter and report the value as
> > well as ns->event.  The long term ratio of the two numbers might be
> > interesting.
> 
> One scenario is, under heavy mount activity, the CPU usage of processes
> systemd, udisksd2, gvfs-udisks2-volume-monitor, gvfsd-trash,
> gnome-settings-daemon and packagekitd (and possibly gnome-shell, might
> be a false positive though) grow to consume all available CPU.
> 
> The processes gvfs-udisks2-volume-monitor and gnome-settings-daemon 
> (and possibly packagekitd, might be false positive) continue to use
> excessive CPU when the mount table is large but there is no mount/umount
> activity.
> 
> In this case heavy mount activity means starting autofs with a direct
> mount map of 15k+ entries.
> 
> The shutdown can be a problem too but umount(2) is just too slow for it
> to be as pronounced a problem as what we see at startup. The umount
> slowness appears to be constant. I'm not sure if it's proportional in any
> way to the number of mounts present on the system.

BTW, the previously mentioned fix we got from suse is (I'm pretty sure)
this one:
http://kernel.suse.com/cgit/kernel-source/commit/?id=10c43659465b18bd337f4434f41e133d09e08b13
Which simply changes a synchronize_rcu(); call to synchronize_rcu_expedited();

I do believe that they found it both proportional to the number of
mounts as well as something that only slowed down shutdown noticeably on
SMP hosts.  That SMP bit was only stumbled upon when their initial
testing was on a single cpu VM, since that's about the only time anyone
runs with only 1 CPU these days.

> >>
> >> The suggestion is that a system that allows for incremental (diff
> >> type) update notification is needed to allow mount table propagation
> >> to scale well.
> >>
> >> That implies some as yet undefined user <-> kernel communication
> >> protocol.
> > 
> > I can almost conceive a mountinfo variant where new entries always
> > appear at the end and deletions appear as just "$mnt_id".
> > struct proc_mounts would contain a bitmap similar to mnt_id_ida which
> > records which ids have been reported to this file.  When an event is
> > noticed it checks for deletions and reports them before anything else.
> > Keeping track of location in the ns->list list might be tricky.  It
> > could be done with a new 64bit never-reused mnt id, though some other
> > approach might be possible.
> > 
> > An app would read to the end, then poll/select for exceptional events
> > and keep reading.
> 
> Keeping track of modifications is certainly one of the problems and this
> sounds like a good start to resolving it ....
> 
> I believe the current preferred delivery of this to be a shared library
> interface available to user space that talks to the kernel to obtain the
> needed information.
> 
> Ian

-- 
Mike Marion-Unix SysAdmin/Sr. Staff IT Engineer-http://www.qualcomm.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored
@ 2017-11-29 16:51                         ` Mike Marion
  0 siblings, 0 replies; 44+ messages in thread
From: Mike Marion @ 2017-11-29 16:51 UTC (permalink / raw)
  To: Ian Kent
  Cc: NeilBrown, autofs mailing list, Kernel Mailing List,
	linux-fsdevel, Al Viro

On Wed, Nov 29, 2017 at 02:00:31PM +0800, Ian Kent wrote:
> On 29/11/17 11:45, NeilBrown wrote:
> > On Wed, Nov 29 2017, Ian Kent wrote:
> > 
> >> Adding Al Viro to the Cc list as I believe Stephen Whitehouse and
> >> Al have discussed something similar, please feel free to chime in
> >> with your thoughts Al.
> >>
> >> On 29/11/17 09:17, NeilBrown wrote:
> >>> On Tue, Nov 28 2017, Mike Marion wrote:
> >>>
> >>>> On Tue, Nov 28, 2017 at 07:43:05AM +0800, Ian Kent wrote:
> >>>>
> >>>>> I think the situation is going to get worse before it gets better.
> >>>>>
> >>>>> On recent Fedora and kernel, with a large map and heavy mount activity
> >>>>> I see:
> >>>>>
> >>>>> systemd, udisksd, gvfs-udisks2-volume-monitor, gvfsd-trash,
> >>>>> gnome-settings-daemon, packagekitd and gnome-shell
> >>>>>
> >>>>> all go crazy consuming large amounts of CPU.
> >>>>
> >>>> Yep.  I'm not even worried about the CPU usage as much (yet, I'm sure 
> >>>> it'll be more of a problem as time goes on).  We have pretty huge
> >>>> direct maps and our initial startup tests on a new host with the link vs
> >>>> file took >6 hours.  That's not a typo.  We worked with Suse engineering 
> >>>> to come up with a fix, which should've been pushed here some time ago.
> >>>>
> >>>> Then, there's shutdowns (and reboots). They also took a long time (on
> >>>> the order of 20+min) because it would walk the entire /proc/mounts
> >>>> "unmounting" things.  Also fixed now.  That one had something to do in
> >>>> SMP code as if you used a single CPU/core, it didn't take long at all.
> >>>>
> >>>> Just got a fix for the suse grub2-mkconfig script to fix their parsing 
> >>>> looking for the root dev to skip over fstype autofs
> >>>> (probe_nfsroot_device function).
> >>>>
> >>>>> The symlink change was probably the start, now a number of applications
> >>>>> now got directly to the proc file system for this information.
> >>>>>
> >>>>> For large mount tables and many processes accessing the mount table
> >>>>> (probably reading the whole thing, either periodically or on change
> >>>>> notification) the current system does not scale well at all.
> >>>>
> >>>> We use Clearcase in some instances as well, and that's yet another thing
> >>>> adding mounts, and its startup is very slow, due to the size of
> >>>> /proc/mounts.  
> >>>>
> >>>> It's definitely something that's more than just autofs and probably
> >>>> going to get worse, as you say.
> >>>
> >>> If we assume that applications are going to want to read
> >>> /proc/self/mount* a log, we probably need to make it faster.
> >>> I performed a simple experiment where I mounted 1000 tmpfs filesystems,
> >>> copied /proc/self/mountinfo to /tmp/mountinfo, then
> >>> ran 4 for loops in parallel catting one of these files to /dev/null 1000 times.
> >>> On a single CPU VM:
> >>>   For /tmp/mountinfo, each group of 1000 cats took about 3 seconds.
> >>>   For /proc/self/mountinfo, each group of 1000 cats took about 14 seconds.
> >>> On a 4 CPU VM
> >>>   /tmp/mountinfo: 1.5secs
> >>>   /proc/self/mountinfo: 3.5 secs
> >>>
> >>> Using "perf record" it appears that most of the cost is repeated calls
> >>> to prepend_path, with a small contribution from the fact that each read
> >>> only returns 4K rather than the 128K that cat asks for.
> >>>
> >>> If we could hang a cache off struct mnt_namespace and use it instead of
> >>> iterating the mount table - using rcu and ns->event to ensure currency -
> >>> we should be able to minimize the cost of this increased use of
> >>> /proc/self/mount*.
> >>>
> >>> I suspect that the best approach would be implement a cache at the
> >>> seq_file level.
> >>>
> >>> One possible problem might be if applications assume that a read will
> >>> always return a whole number of lines (it currently does).  To be
> >>> sure we remain safe, we would only be able to use the cache for
> >>> a read() syscall which reads the whole file.
> >>> How big do people see /proc/self/mount* getting?  What size reads
> >>> does 'strace' show the various programs using to read it?
> >>
> >> Buffer size almost always has a significant impact on IO so that's
> >> likely a big factor but the other aspect of this is notification
> >> of changes.
> >>
> >> The risk is improving the IO efficiency might just allow a higher
> >> rate of processing of change notifications and similar symptoms
> >> to what we have now.
> > 
> > That's an issue that we should be able to get empirical data on.
> > Are these systems that demonstrate problems actually showing a high
> > rate of changes to the mount table, or is the mount table being
> > read frequently despite not changing?
> > To find out you could use a program like one of the answers to:
> > 
> > https://stackoverflow.com/questions/5070801/monitoring-mount-point-changes-via-proc-mounts
> > 
> > or instrument the kernel to add a counter to 'struct mnt_namespace' and
> > have mounts_open_common() increment that counter and report the value as
> > well as ns->event.  The long term ratio of the two numbers might be
> > interesting.
> 
> One scenario is, under heavy mount activity, the CPU usage of processes
> systemd, udisksd2, gvfs-udisks2-volume-monitor, gvfsd-trash,
> gnome-settings-daemon and packagekitd (and possibly gnome-shell, might
> be a false positive though) grow to consume all available CPU.
> 
> The processes gvfs-udisks2-volume-monitor and gnome-settings-daemon 
> (and possibly packagekitd, might be false positive) continue to use
> excessive CPU when the mount table is large but there is no mount/umount
> activity.
> 
> In this case heavy mount activity means starting autofs with a direct
> mount map of 15k+ entries.
> 
> The shutdown can be a problem too but umount(2) is just too slow for it
> to be as pronounced a problem as what we see at startup. The umount
> slowness appears to be constant. I'm not sure if it's proportional in any
> way to the number of mounts present on the system.

BTW, the previously mentioned fix we got from suse is (I'm pretty sure)
this one:
http://kernel.suse.com/cgit/kernel-source/commit/?id=10c43659465b18bd337f4434f41e133d09e08b13
Which simply changes a synchronize_rcu(); call to synchronize_rcu_expedited();

I do believe that they found it both proportional to the number of
mounts as well as something that only slowed down shutdown noticeably on
SMP hosts.  That SMP bit was only stumbled upon when their initial
testing was on a single cpu VM, since that's about the only time anyone
runs with only 1 CPU these days.

> >>
> >> The suggestion is that a system that allows for incremental (diff
> >> type) update notification is needed to allow mount table propagation
> >> to scale well.
> >>
> >> That implies some as yet undefined user <-> kernel communication
> >> protocol.
> > 
> > I can almost conceive a mountinfo variant where new entries always
> > appear at the end and deletions appear as just "$mnt_id".
> > struct proc_mounts would contain a bitmap similar to mnt_id_ida which
> > records which ids have been reported to this file.  When an event is
> > noticed it checks for deletions and reports them before anything else.
> > Keeping track of location in the ns->list list might be tricky.  It
> > could be done with a new 64bit never-reused mnt id, though some other
> > approach might be possible.
> > 
> > An app would read to the end, then poll/select for exceptional events
> > and keep reading.
> 
> Keeping track of modifications is certainly one of the problems and this
> sounds like a good start to resolving it ....
> 
> I believe the current preferred delivery of this to be a shared library
> interface available to user space that talks to the kernel to obtain the
> needed information.
> 
> Ian

-- 
Mike Marion-Unix SysAdmin/Sr. Staff IT Engineer-http://www.qualcomm.com--
To unsubscribe from this list: send the line "unsubscribe autofs" in

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored
  2017-11-29  7:39                       ` NeilBrown
@ 2017-11-30  0:00                         ` Ian Kent
  0 siblings, 0 replies; 44+ messages in thread
From: Ian Kent @ 2017-11-30  0:00 UTC (permalink / raw)
  To: NeilBrown, Mike Marion
  Cc: autofs mailing list, Kernel Mailing List, linux-fsdevel, Al Viro,
	David Howells

On 29/11/17 15:39, NeilBrown wrote:
> On Wed, Nov 29 2017, Ian Kent wrote:
> 
>> On 29/11/17 11:45, NeilBrown wrote:
>>> On Wed, Nov 29 2017, Ian Kent wrote:
>>>
>>>> Adding Al Viro to the Cc list as I believe Stephen Whitehouse and
>>>> Al have discussed something similar, please feel free to chime in
>>>> with your thoughts Al.
>>>>
>>>> On 29/11/17 09:17, NeilBrown wrote:
>>>>> On Tue, Nov 28 2017, Mike Marion wrote:
>>>>>
>>>>>> On Tue, Nov 28, 2017 at 07:43:05AM +0800, Ian Kent wrote:
>>>>>>
>>>>>>> I think the situation is going to get worse before it gets better.
>>>>>>>
>>>>>>> On recent Fedora and kernel, with a large map and heavy mount activity
>>>>>>> I see:
>>>>>>>
>>>>>>> systemd, udisksd, gvfs-udisks2-volume-monitor, gvfsd-trash,
>>>>>>> gnome-settings-daemon, packagekitd and gnome-shell
>>>>>>>
>>>>>>> all go crazy consuming large amounts of CPU.
>>>>>>
>>>>>> Yep.  I'm not even worried about the CPU usage as much (yet, I'm sure 
>>>>>> it'll be more of a problem as time goes on).  We have pretty huge
>>>>>> direct maps and our initial startup tests on a new host with the link vs
>>>>>> file took >6 hours.  That's not a typo.  We worked with Suse engineering 
>>>>>> to come up with a fix, which should've been pushed here some time ago.
>>>>>>
>>>>>> Then, there's shutdowns (and reboots). They also took a long time (on
>>>>>> the order of 20+min) because it would walk the entire /proc/mounts
>>>>>> "unmounting" things.  Also fixed now.  That one had something to do in
>>>>>> SMP code as if you used a single CPU/core, it didn't take long at all.
>>>>>>
>>>>>> Just got a fix for the suse grub2-mkconfig script to fix their parsing 
>>>>>> looking for the root dev to skip over fstype autofs
>>>>>> (probe_nfsroot_device function).
>>>>>>
>>>>>>> The symlink change was probably the start, now a number of applications
>>>>>>> now got directly to the proc file system for this information.
>>>>>>>
>>>>>>> For large mount tables and many processes accessing the mount table
>>>>>>> (probably reading the whole thing, either periodically or on change
>>>>>>> notification) the current system does not scale well at all.
>>>>>>
>>>>>> We use Clearcase in some instances as well, and that's yet another thing
>>>>>> adding mounts, and its startup is very slow, due to the size of
>>>>>> /proc/mounts.  
>>>>>>
>>>>>> It's definitely something that's more than just autofs and probably
>>>>>> going to get worse, as you say.
>>>>>
>>>>> If we assume that applications are going to want to read
>>>>> /proc/self/mount* a log, we probably need to make it faster.
>>>>> I performed a simple experiment where I mounted 1000 tmpfs filesystems,
>>>>> copied /proc/self/mountinfo to /tmp/mountinfo, then
>>>>> ran 4 for loops in parallel catting one of these files to /dev/null 1000 times.
>>>>> On a single CPU VM:
>>>>>   For /tmp/mountinfo, each group of 1000 cats took about 3 seconds.
>>>>>   For /proc/self/mountinfo, each group of 1000 cats took about 14 seconds.
>>>>> On a 4 CPU VM
>>>>>   /tmp/mountinfo: 1.5secs
>>>>>   /proc/self/mountinfo: 3.5 secs
>>>>>
>>>>> Using "perf record" it appears that most of the cost is repeated calls
>>>>> to prepend_path, with a small contribution from the fact that each read
>>>>> only returns 4K rather than the 128K that cat asks for.
>>>>>
>>>>> If we could hang a cache off struct mnt_namespace and use it instead of
>>>>> iterating the mount table - using rcu and ns->event to ensure currency -
>>>>> we should be able to minimize the cost of this increased use of
>>>>> /proc/self/mount*.
>>>>>
>>>>> I suspect that the best approach would be implement a cache at the
>>>>> seq_file level.
>>>>>
>>>>> One possible problem might be if applications assume that a read will
>>>>> always return a whole number of lines (it currently does).  To be
>>>>> sure we remain safe, we would only be able to use the cache for
>>>>> a read() syscall which reads the whole file.
>>>>> How big do people see /proc/self/mount* getting?  What size reads
>>>>> does 'strace' show the various programs using to read it?
>>>>
>>>> Buffer size almost always has a significant impact on IO so that's
>>>> likely a big factor but the other aspect of this is notification
>>>> of changes.
>>>>
>>>> The risk is improving the IO efficiency might just allow a higher
>>>> rate of processing of change notifications and similar symptoms
>>>> to what we have now.
>>>
>>> That's an issue that we should be able to get empirical data on.
>>> Are these systems that demonstrate problems actually showing a high
>>> rate of changes to the mount table, or is the mount table being
>>> read frequently despite not changing?
>>> To find out you could use a program like one of the answers to:
>>>
>>> https://stackoverflow.com/questions/5070801/monitoring-mount-point-changes-via-proc-mounts
>>>
>>> or instrument the kernel to add a counter to 'struct mnt_namespace' and
>>> have mounts_open_common() increment that counter and report the value as
>>> well as ns->event.  The long term ratio of the two numbers might be
>>> interesting.
>>
>> One scenario is, under heavy mount activity, the CPU usage of processes
>> systemd, udisksd2, gvfs-udisks2-volume-monitor, gvfsd-trash,
>> gnome-settings-daemon and packagekitd (and possibly gnome-shell, might
>> be a false positive though) grow to consume all available CPU.
> 
> OK, that's pretty clearly caused by lots of changes happening in quick
> succession.
> The current poll support in the kernel is slightly non-optimal in that
> it could trigger more events than necessary, but it probably doesn't
> make a big difference.
> 
> The current code means that poll/select will report an event if there
> has been an event since the last time that poll/select reported an
> event.
> It is only necessary to report an event if these has been one since the
> last time that the first record in the file was read.
> So if there are a sequence of events, then the file is read, a
> subsequent poll will report another event where there isn't really one.
> This would only be noticed if there was a delay between the event being
> reported in the kernel, and the read happening.
> 
> You could test by removing
> 		m->poll_event = event;
> from mounts_poll() and adding something like:
>  if (*pos == 0) m->poll_event = p->ns->event;
> to m_start().
> 
> 
>>
>> The processes gvfs-udisks2-volume-monitor and gnome-settings-daemon 
>> (and possibly packagekitd, might be false positive) continue to use
>> excessive CPU when the mount table is large but there is no mount/umount
>> activity.
> 
> That is strange.  I wonder if something could be triggering phantom
> events in some way.. Do they eventually settle, or does this continue
> for hours.  Is there an easy way I could reproduce?

Any large direct mount map will do it, it's the mounting (and subsequent
umount) of the autofs direct mount triggers that see the problematic
behavior.

Something like (I've called it mkdmap.c) should be sufficient:

#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <string.h>
#include <stdlib.h>

char mount1[] = "gandalf:/usr/local/etc";
char dir1[] = "/direct/etc/";
char mount2[] = "gandalf:/usr/local/bin";
char dir2[] = "/direct/bin/";
char mount3[] = "gandalf:/usr/local/lib";
char dir3[] = "/direct/lib/";
char mount4[] = "gandalf:/usr/local/man";
char dir4[] = "/direct/man/";
char mount5[] = "gandalf:/usr/local/share";
char dir5[] = "/direct/share/";

void pmap(char p)
{
        char tem[80];
        char dir[100];
        int i;

        sprintf(tem, "%cXXXXXX", p);

        for (i=0; i < 1000; i++) {
                char *tname;

                strcpy(dir, dir1);
                strcat(dir, tem);
                tname = mktemp(dir);
                printf("%s\t%s\n", tname, mount1);

                strcpy(dir, dir2);
                strcat(dir, tem);
                tname = mktemp(dir);
                printf("%s\t%s\n", tname, mount2);

                strcpy(dir, dir3);
                strcat(dir, tem);
                tname = mktemp(dir);
                printf("%s\t%s\n", tname, mount3);

                strcpy(dir, dir4);
                strcat(dir, tem);
                tname = mktemp(dir);
                printf("%s\t%s\n", tname, mount4);

                strcpy(dir, dir5);
                strcat(dir, tem);
                tname = mktemp(dir);
                printf("%s\t%s\n", tname, mount5);
        }
}

int main(int argc, char **argv)
{
        pmap('a');
        pmap('b');
        pmap('c');
        pmap('d');
        pmap('e');
        pmap('f');
}

This will print 30,000 direct mount map entries to stdout.

It isn't necessary for the server (gandalf in this case) to be valid but
might be worth changing to something sensible for your environment.

My testing was done using F26, I'd be interested to know what happens
with other environments.

> 
>>
>> In this case heavy mount activity means starting autofs with a direct
>> mount map of 15k+ entries.
> 
> I'll try that tomorrow and see what happens.  Would bind mounts be
> enough?

It's not necessary to actually trigger mounts, as I say above.

Starting autofs with a large direct mount map, wait for a while to observe
which processes continue to consume resources when the mount table isn't
changing any more then shutdown autofs should be enough.

> 
>>
>> The shutdown can be a problem too but umount(2) is just too slow for it
>> to be as pronounced a problem as what we see at startup. The umount
>> slowness appears to be constant. I'm not sure if it's proportional in any
>> way to the number of mounts present on the system.
> 
> Changing synchronize_rcu() in namespace_unlock() to
> synchronize_rcu_expedited() makes umount much faster.  Al has the patch
> but hasn't responded.  Using "umount -c", which systemd now does, can
> also help I think.
> 
>>
>>>
>>>
>>>
>>>>
>>>> The suggestion is that a system that allows for incremental (diff
>>>> type) update notification is needed to allow mount table propagation
>>>> to scale well.
>>>>
>>>> That implies some as yet undefined user <-> kernel communication
>>>> protocol.
>>>
>>> I can almost conceive a mountinfo variant where new entries always
>>> appear at the end and deletions appear as just "$mnt_id".
>>> struct proc_mounts would contain a bitmap similar to mnt_id_ida which
>>> records which ids have been reported to this file.  When an event is
>>> noticed it checks for deletions and reports them before anything else.
>>> Keeping track of location in the ns->list list might be tricky.  It
>>> could be done with a new 64bit never-reused mnt id, though some other
>>> approach might be possible.
>>>
>>> An app would read to the end, then poll/select for exceptional events
>>> and keep reading.
>>
>> Keeping track of modifications is certainly one of the problems and this
>> sounds like a good start to resolving it ....
>>
>> I believe the current preferred delivery of this to be a shared library
>> interface available to user space that talks to the kernel to obtain the
>> needed information.
> 
> Any suggests at the sort of interface?  If I provided a pollable fd, and
> a function which given the fd and a 'struct mntent' (or similar) would
> fill in the mntent if info was available, would that be likely to suit?
> Not that I'm promising anything but having a framework helps guide
> thoughts.

One difficulty we face with this is the lack of a framework to build
ideas upon.

It's complicated by the need for what's done to be extensible to handling
general file system event notification.

That's hind of a contradiction in itself as getting mount table information,
notifications or otherwise, probably requires a call directly to the kernel
so that current->fs can be used to select the correct mounts list for the
requesting process namespace while general notification events imply a
disjoint system.

Ian

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2017-11-30  0:00 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-10  4:18 [PATCH 1/3] autofs - make disc device user accessible Ian Kent
2017-05-10  4:18 ` Ian Kent
2017-05-10  4:18 ` [PATCH 2/3] autofs - make dev ioctl version and ismountpoint " Ian Kent
2017-05-10  4:18   ` Ian Kent
2017-05-10  4:18 ` [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored Ian Kent
2017-05-12 12:49   ` Colin Walters
2017-05-12 12:49     ` Colin Walters
2017-11-21  1:53   ` NeilBrown
2017-11-22  4:28     ` Ian Kent
2017-11-22  4:28       ` Ian Kent
2017-11-23  0:36       ` Ian Kent
2017-11-23  2:21         ` NeilBrown
2017-11-23  2:46           ` Ian Kent
2017-11-23  3:04             ` Ian Kent
2017-11-23  3:04               ` Ian Kent
2017-11-23  4:49             ` NeilBrown
2017-11-23  6:34               ` Ian Kent
2017-11-27 16:01         ` Mike Marion
2017-11-27 16:01           ` Mike Marion
2017-11-27 23:43           ` Ian Kent
2017-11-28  0:29             ` Mike Marion
2017-11-28  0:29               ` Mike Marion
2017-11-29  1:17               ` NeilBrown
2017-11-29  2:13                 ` Mike Marion
2017-11-29  2:13                   ` Mike Marion
2017-11-29  2:28                   ` Ian Kent
2017-11-29  2:48                     ` NeilBrown
2017-11-29  3:14                       ` Ian Kent
2017-11-29  3:14                         ` Ian Kent
2017-11-29  2:56                 ` Ian Kent
2017-11-29  2:56                   ` Ian Kent
2017-11-29  3:45                   ` NeilBrown
2017-11-29  6:00                     ` Ian Kent
2017-11-29  6:00                       ` Ian Kent
2017-11-29  7:39                       ` NeilBrown
2017-11-30  0:00                         ` Ian Kent
2017-11-29 16:51                       ` Mike Marion
2017-11-29 16:51                         ` Mike Marion
2017-11-23  0:47       ` NeilBrown
2017-11-23  1:43         ` Ian Kent
2017-11-23  1:43           ` Ian Kent
2017-11-23  2:26           ` Ian Kent
2017-11-23  3:04           ` NeilBrown
2017-11-23  3:41             ` Ian Kent

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.