All of lore.kernel.org
 help / color / mirror / Atom feed
* Near-simultaneous automount of multiple directories fails
@ 2016-04-08  7:55 Marcel De Boer
  2016-04-08  8:54 ` Ian Kent
  0 siblings, 1 reply; 5+ messages in thread
From: Marcel De Boer @ 2016-04-08  7:55 UTC (permalink / raw)
  To: autofs

Hi!

I've already reported this on the CentOS bug tracker a while ago, but I 
thought I'd report it here too.

https://bugs.centos.org/view.php?id=9835

Summarized (there's more information on the bug report): on one of our 
servers we initially saw that every few days one home directory became 
inaccessible. This happened to two different homedirectories (but only one 
at a time) out of the couple hundred we have. We traced this to 
simultaneously scheduled cron scripts running out of the affected 
homedirectories, which caused both directories to be mounted nearly 
simultaneously.

A test setup on a different machine (the primary description from the bug 
report, as the server was not stock CentOS) also showed that if we had 
cron simultaneously mount four directories every 10 minutes, only half of 
them would get mounted every time. On this machine an RPM rebuild of 
autofs made the issue disappear, but it was much more persistent on the 
server.

Eventually it seems that there is an issue in mount_mount() from 
mount_nfs.c; to my untrained eye, it looks like it can get called 
simultaneously from different threads, where they change shared 
information, probably the 'hosts' or 'tmp' lists.

I made a patch that seems to work reliably for our situation, but it's 
very crude, it just makes sure everything touching the 'hosts' list (and 
everything else during that time) does not run in parallel. It might be a 
starting point for someone who knows the code better, though. (Patch was 
made against the code used in the 5.0.5_115 CentOS 6 RPM.)

The server has received some more upgrades in the mean while, so we may no 
be able to reproduce it on that system anymore.

Kind regards,
 	Marcel de Boer


--- autofs-5.0.5-orig/modules/mount_nfs.c	2016-01-05 15:26:55.993014650 +0100
+++ autofs-5.0.5/modules/mount_nfs.c	2016-01-05 15:25:51.434011526 +0100
@@ -40,6 +40,9 @@
  static struct mount_mod *mount_bind = NULL;
  static int init_ctr = 0;

+/* Multiple access to hosts workaround */
+static pthread_mutex_t host_list_mutex = PTHREAD_MUTEX_INITIALIZER;
+
  int mount_init(void **context)
  {
  	/* Make sure we have the local mount method available */
@@ -190,7 +193,9 @@
  		      nfsoptions, nobind, nosymlink, ro);
  	}

+	pthread_mutex_lock(&host_list_mutex);
  	if (!parse_location(ap->logopt, &hosts, what, flags)) {
+        	pthread_mutex_unlock(&host_list_mutex);
  		info(ap->logopt, MODPREFIX "no hosts available");
  		return 1;
  	}
@@ -235,6 +240,7 @@

  dont_probe:
  	if (!hosts) {
+        	pthread_mutex_unlock(&host_list_mutex);
  		info(ap->logopt, MODPREFIX "no hosts available");
  		return 1;
  	}
@@ -264,6 +270,7 @@
  		char *estr = strerror_r(errno, buf, MAX_ERR_BUF);
  		error(ap->logopt,
  		      MODPREFIX "mkdir_path %s failed: %s", fullpath, estr);
+        	pthread_mutex_unlock(&host_list_mutex);
  		return 1;
  	}

@@ -300,6 +307,7 @@
  			/* Success - we're done */
  			if (!err) {
  				free_host_list(&hosts);
+                        	pthread_mutex_unlock(&host_list_mutex);
  				return 0;
  			}

@@ -325,6 +333,7 @@
  			if (!loc) {
  				char *estr = strerror_r(errno, buf, MAX_ERR_BUF);
  				error(ap->logopt, "malloc: %s", estr);
+                        	pthread_mutex_unlock(&host_list_mutex);
  				return 1;
  			}
  			if (this->addr->sa_family == AF_INET6) {
@@ -338,6 +347,7 @@
  			if (!loc) {
  				char *estr = strerror_r(errno, buf, MAX_ERR_BUF);
  				error(ap->logopt, "malloc: %s", estr);
+                        	pthread_mutex_unlock(&host_list_mutex);
  				return 1;
  			}
  			strcpy(loc, this->name);
@@ -365,6 +375,7 @@
  			info(ap->logopt, MODPREFIX "mounted %s on %s", loc, fullpath);
  			free(loc);
  			free_host_list(&hosts);
+                       	pthread_mutex_unlock(&host_list_mutex);
  			return 0;
  		}

@@ -374,6 +385,7 @@

  forced_fail:
  	free_host_list(&hosts);
+	pthread_mutex_unlock(&host_list_mutex);

  	/* If we get here we've failed to complete the mount */



-- 
Marcel de Boer
Test engineer, Service Routing R&D, IP/Optical Networks
Nokia, Antwerp, Belgium
--
To unsubscribe from this list: send the line "unsubscribe autofs" in

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Near-simultaneous automount of multiple directories fails
  2016-04-08  7:55 Near-simultaneous automount of multiple directories fails Marcel De Boer
@ 2016-04-08  8:54 ` Ian Kent
  2016-04-08  9:46   ` Ian Kent
  0 siblings, 1 reply; 5+ messages in thread
From: Ian Kent @ 2016-04-08  8:54 UTC (permalink / raw)
  To: Marcel De Boer, autofs

On Fri, 2016-04-08 at 09:55 +0200, Marcel De Boer wrote:
> Hi!
> 
> I've already reported this on the CentOS bug tracker a while ago, but
> I 
> thought I'd report it here too.
> 
> https://bugs.centos.org/view.php?id=9835
> 
> Summarized (there's more information on the bug report): on one of our
> servers we initially saw that every few days one home directory became
> inaccessible. This happened to two different homedirectories (but only
> one 
> at a time) out of the couple hundred we have. We traced this to 
> simultaneously scheduled cron scripts running out of the affected 
> homedirectories, which caused both directories to be mounted nearly 
> simultaneously.
> 
> A test setup on a different machine (the primary description from the
> bug 
> report, as the server was not stock CentOS) also showed that if we had
> cron simultaneously mount four directories every 10 minutes, only half
> of 
> them would get mounted every time. On this machine an RPM rebuild of 
> autofs made the issue disappear, but it was much more persistent on
> the 
> server.
> 
> Eventually it seems that there is an issue in mount_mount() from 
> mount_nfs.c; to my untrained eye, it looks like it can get called 
> simultaneously from different threads, where they change shared 
> information, probably the 'hosts' or 'tmp' lists.

Whatever the problem is it isn't access to either of these two variables
or the lists they may represent.

They are both local variables of the mount_mount() function and so
cannot be accessed simultaneously by any other function.

Ian
--
To unsubscribe from this list: send the line "unsubscribe autofs" in

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Near-simultaneous automount of multiple directories fails
  2016-04-08  8:54 ` Ian Kent
@ 2016-04-08  9:46   ` Ian Kent
  2016-04-08 11:37     ` Marcel De Boer
  0 siblings, 1 reply; 5+ messages in thread
From: Ian Kent @ 2016-04-08  9:46 UTC (permalink / raw)
  To: Marcel De Boer, autofs

On Fri, 2016-04-08 at 16:54 +0800, Ian Kent wrote:
> On Fri, 2016-04-08 at 09:55 +0200, Marcel De Boer wrote:
> > Hi!
> > 
> > I've already reported this on the CentOS bug tracker a while ago,
> > but
> > I 
> > thought I'd report it here too.
> > 
> > https://bugs.centos.org/view.php?id=9835
> > 
> > Summarized (there's more information on the bug report): on one of
> > our
> > servers we initially saw that every few days one home directory
> > became
> > inaccessible. This happened to two different homedirectories (but
> > only
> > one 
> > at a time) out of the couple hundred we have. We traced this to 
> > simultaneously scheduled cron scripts running out of the affected 
> > homedirectories, which caused both directories to be mounted nearly 
> > simultaneously.
> > 
> > A test setup on a different machine (the primary description from
> > the
> > bug 
> > report, as the server was not stock CentOS) also showed that if we
> > had
> > cron simultaneously mount four directories every 10 minutes, only
> > half
> > of 
> > them would get mounted every time. On this machine an RPM rebuild of
> > autofs made the issue disappear, but it was much more persistent on
> > the 
> > server.
> > 
> > Eventually it seems that there is an issue in mount_mount() from 
> > mount_nfs.c; to my untrained eye, it looks like it can get called 
> > simultaneously from different threads, where they change shared 
> > information, probably the 'hosts' or 'tmp' lists.
> 
> Whatever the problem is it isn't access to either of these two
> variables
> or the lists they may represent.
> 
> They are both local variables of the mount_mount() function and so
> cannot be accessed simultaneously by any other function.

Btw, there has been no actual RHEL release of revision 115.

Only 113 in RHEL-6.7 and (probably) revision 122 will be RHEL-6.8.
So I wonder what else went into revision 115.

AFAICS revision 115, if it is truly from RHEL, is a mid debug
development revision and really shouldn't be used unless provided by
RedHat support, to get development feedback from testing.

We probably shouldn't work with revision 122 yet so may be we should
work with revision 113, not sure about that though.

Anyway it could be function calls to some other shared library causing a
problem.

AFAICS the autofs code called in this region is re-entrant in the same
way as the hosts and tmp variables are in mount_mount(), so there's
something else going on.

I'm not sure I could reproduce this because I have a stress test (used
for RHEL) that uses (IIRC) 8 concurrent threads to test mount
concurrency and to test for mount to expire races.

The maps used are somewhat more complex than what you have here so
perhaps I missed this point with that test.

However, I've recently written another RHEL test (based on this test)
that uses a simple indirect map with the 8 concurrent threads to try and
duplicate a different problem.

I would have though this test would expose this sort of problem but
after (I can't actually remember the longest run) about three days of
continuous running I didn't see any problems.

Granted it was a different scenario to yours though.

So I think we need to narrow down where this is occurring.

To start with I'd add mutexes around just the parse_location() and
 prune_host_list() functions and then if that also resolves the problem
drill down from there.

Something like (totally untested):

debug

From: Ian Kent <raven@themaw.net>


---
 modules/mount_nfs.c |    9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/modules/mount_nfs.c b/modules/mount_nfs.c
index 84f7bda..95df40f 100644
--- a/modules/mount_nfs.c
+++ b/modules/mount_nfs.c
@@ -54,6 +54,8 @@ int mount_init(void **context)
 	return !mount_bind;
 }
 
+static pthread_mutex_t host_list_mutex = PTHREAD_MUTEX_INITIALIZER;
+
 int mount_mount(struct autofs_point *ap, const char *root, const char *name, int name_len,
 		const char *what, const char *fstype, const char *options,
 		void *context)
@@ -190,16 +192,20 @@ int mount_mount(struct autofs_point *ap, const char *root, const char *name, int
 		      nfsoptions, nobind, nosymlink, ro);
 	}
 
+	pthread_mutex_lock(&host_list_mutex);
 	if (!parse_location(ap->logopt, &hosts, what, flags)) {
 		info(ap->logopt, MODPREFIX "no hosts available");
+		pthread_mutex_unlock(&host_list_mutex);
 		return 1;
 	}
 	/*
 	 * We can't probe protocol rdma so leave it to mount.nfs(8)
 	 * and and suffer the delay if a server isn't available.
 	 */
-	if (rdma)
+	if (rdma) {
+		pthread_mutex_unlock(&host_list_mutex);
 		goto dont_probe;
+	}
 
 	/*
 	 * If this is a singleton mount, and NFSv4 only hasn't been asked
@@ -232,6 +238,7 @@ int mount_mount(struct autofs_point *ap, const char *root, const char *name, int
 	} else {
 		prune_host_list(ap->logopt, &hosts, vers, port);
 	}
+	pthread_mutex_unlock(&host_list_mutex);
 
 dont_probe:
 	if (!hosts) {
--
To unsubscribe from this list: send the line "unsubscribe autofs" in

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: Near-simultaneous automount of multiple directories fails
  2016-04-08  9:46   ` Ian Kent
@ 2016-04-08 11:37     ` Marcel De Boer
  2016-04-10  2:34       ` Ian Kent
  0 siblings, 1 reply; 5+ messages in thread
From: Marcel De Boer @ 2016-04-08 11:37 UTC (permalink / raw)
  To: EXT Ian Kent; +Cc: autofs

Hi!

>> Whatever the problem is it isn't access to either of these two 
>> variables or the lists they may represent.
>>
>> They are both local variables of the mount_mount() function and so
>> cannot be accessed simultaneously by any other function.

Too bad... that means my changes probably just mixed up the timing enough 
to avoid the problem.

> Btw, there has been no actual RHEL release of revision 115.
>
> Only 113 in RHEL-6.7 and (probably) revision 122 will be RHEL-6.8.
> So I wonder what else went into revision 115.
<...>
> We probably shouldn't work with revision 122 yet so may be we should
> work with revision 113, not sure about that though.

Ah wait... because it was just for local testing, I also changed the 
patchlevel so yum wouldn't complain. Judging from the build machine 
history, it actually is -113. Postponing writing this mail for too long 
made me forget too much...

> I'm not sure I could reproduce this because I have a stress test (used
> for RHEL) that uses (IIRC) 8 concurrent threads to test mount
> concurrency and to test for mount to expire races.
>
> The maps used are somewhat more complex than what you have here so
> perhaps I missed this point with that test.

The configuration for the server uses indirect maps from the local 
filesystem. All other machines get a slightly different config through 
NIS.

> However, I've recently written another RHEL test (based on this test)
> that uses a simple indirect map with the 8 concurrent threads to try and
> duplicate a different problem.
>
> I would have though this test would expose this sort of problem but
> after (I can't actually remember the longest run) about three days of
> continuous running I didn't see any problems.
>
> Granted it was a different scenario to yours though.

Of course it also looks timing-related, so there's no telling in exactly 
which configuration it'll pop up. For the machine I used for testing (not 
the same hardware as the server), the issue already disappeared when I 
locally rebuilt the same RPM as the one that was already installed.

I already noticed changes in the frequency when I changed the versions of 
supporting packages (libtirpc) or ran it in the foreground or with 
debugging.

> So I think we need to narrow down where this is occurring.
>
> To start with I'd add mutexes around just the parse_location() and
> prune_host_list() functions and then if that also resolves the problem
> drill down from there.

I'll see if I can do that next week (even though the server is busy, it's 
not a disaster if it happens, but I prefer to be around to unwedge it.)

Thanks!

Kind regards,
 	Marcel de Boer

-- 
Marcel de Boer
Test engineer, Service Routing R&D, IP/Optical Networks
Nokia, Antwerp, Belgium

On Fri, 8 Apr 2016, EXT Ian Kent wrote:

> On Fri, 2016-04-08 at 16:54 +0800, Ian Kent wrote:
>> On Fri, 2016-04-08 at 09:55 +0200, Marcel De Boer wrote:
>>> Hi!
>>>
>>> I've already reported this on the CentOS bug tracker a while ago,
>>> but
>>> I
>>> thought I'd report it here too.
>>>
>>> https://bugs.centos.org/view.php?id=9835
>>>
>>> Summarized (there's more information on the bug report): on one of
>>> our
>>> servers we initially saw that every few days one home directory
>>> became
>>> inaccessible. This happened to two different homedirectories (but
>>> only
>>> one
>>> at a time) out of the couple hundred we have. We traced this to
>>> simultaneously scheduled cron scripts running out of the affected
>>> homedirectories, which caused both directories to be mounted nearly
>>> simultaneously.
>>>
>>> A test setup on a different machine (the primary description from
>>> the
>>> bug
>>> report, as the server was not stock CentOS) also showed that if we
>>> had
>>> cron simultaneously mount four directories every 10 minutes, only
>>> half
>>> of
>>> them would get mounted every time. On this machine an RPM rebuild of
>>> autofs made the issue disappear, but it was much more persistent on
>>> the
>>> server.
>>>
>>> Eventually it seems that there is an issue in mount_mount() from
>>> mount_nfs.c; to my untrained eye, it looks like it can get called
>>> simultaneously from different threads, where they change shared
>>> information, probably the 'hosts' or 'tmp' lists.
>>
>> Whatever the problem is it isn't access to either of these two
>> variables
>> or the lists they may represent.
>>
>> They are both local variables of the mount_mount() function and so
>> cannot be accessed simultaneously by any other function.
>
> Btw, there has been no actual RHEL release of revision 115.
>
> Only 113 in RHEL-6.7 and (probably) revision 122 will be RHEL-6.8.
> So I wonder what else went into revision 115.
>
> AFAICS revision 115, if it is truly from RHEL, is a mid debug
> development revision and really shouldn't be used unless provided by
> RedHat support, to get development feedback from testing.
>
> We probably shouldn't work with revision 122 yet so may be we should
> work with revision 113, not sure about that though.
>
> Anyway it could be function calls to some other shared library causing a
> problem.
>
> AFAICS the autofs code called in this region is re-entrant in the same
> way as the hosts and tmp variables are in mount_mount(), so there's
> something else going on.
>
> I'm not sure I could reproduce this because I have a stress test (used
> for RHEL) that uses (IIRC) 8 concurrent threads to test mount
> concurrency and to test for mount to expire races.
>
> The maps used are somewhat more complex than what you have here so
> perhaps I missed this point with that test.
>
> However, I've recently written another RHEL test (based on this test)
> that uses a simple indirect map with the 8 concurrent threads to try and
> duplicate a different problem.
>
> I would have though this test would expose this sort of problem but
> after (I can't actually remember the longest run) about three days of
> continuous running I didn't see any problems.
>
> Granted it was a different scenario to yours though.
>
> So I think we need to narrow down where this is occurring.
>
> To start with I'd add mutexes around just the parse_location() and
> prune_host_list() functions and then if that also resolves the problem
> drill down from there.
>
> Something like (totally untested):
>
> debug
>
> From: Ian Kent <raven@themaw.net>
>
>
> ---
> modules/mount_nfs.c |    9 ++++++++-
> 1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/modules/mount_nfs.c b/modules/mount_nfs.c
> index 84f7bda..95df40f 100644
> --- a/modules/mount_nfs.c
> +++ b/modules/mount_nfs.c
> @@ -54,6 +54,8 @@ int mount_init(void **context)
> 	return !mount_bind;
> }
>
> +static pthread_mutex_t host_list_mutex = PTHREAD_MUTEX_INITIALIZER;
> +
> int mount_mount(struct autofs_point *ap, const char *root, const char *name, int name_len,
> 		const char *what, const char *fstype, const char *options,
> 		void *context)
> @@ -190,16 +192,20 @@ int mount_mount(struct autofs_point *ap, const char *root, const char *name, int
> 		      nfsoptions, nobind, nosymlink, ro);
> 	}
>
> +	pthread_mutex_lock(&host_list_mutex);
> 	if (!parse_location(ap->logopt, &hosts, what, flags)) {
> 		info(ap->logopt, MODPREFIX "no hosts available");
> +		pthread_mutex_unlock(&host_list_mutex);
> 		return 1;
> 	}
> 	/*
> 	 * We can't probe protocol rdma so leave it to mount.nfs(8)
> 	 * and and suffer the delay if a server isn't available.
> 	 */
> -	if (rdma)
> +	if (rdma) {
> +		pthread_mutex_unlock(&host_list_mutex);
> 		goto dont_probe;
> +	}
>
> 	/*
> 	 * If this is a singleton mount, and NFSv4 only hasn't been asked
> @@ -232,6 +238,7 @@ int mount_mount(struct autofs_point *ap, const char *root, const char *name, int
> 	} else {
> 		prune_host_list(ap->logopt, &hosts, vers, port);
> 	}
> +	pthread_mutex_unlock(&host_list_mutex);
>
> dont_probe:
> 	if (!hosts) {
>
--
To unsubscribe from this list: send the line "unsubscribe autofs" in

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Near-simultaneous automount of multiple directories fails
  2016-04-08 11:37     ` Marcel De Boer
@ 2016-04-10  2:34       ` Ian Kent
  0 siblings, 0 replies; 5+ messages in thread
From: Ian Kent @ 2016-04-10  2:34 UTC (permalink / raw)
  To: Marcel De Boer; +Cc: autofs

On Fri, 2016-04-08 at 13:37 +0200, Marcel De Boer wrote:
> Hi!
> 
> > > Whatever the problem is it isn't access to either of these two 
> > > variables or the lists they may represent.
> > > 
> > > They are both local variables of the mount_mount() function and so
> > > cannot be accessed simultaneously by any other function.
> 
> Too bad... that means my changes probably just mixed up the timing
> enough 
> to avoid the problem.
> 
> > Btw, there has been no actual RHEL release of revision 115.
> > 
> > Only 113 in RHEL-6.7 and (probably) revision 122 will be RHEL-6.8.
> > So I wonder what else went into revision 115.
> <...>
> > We probably shouldn't work with revision 122 yet so may be we should
> > work with revision 113, not sure about that though.
> 
> Ah wait... because it was just for local testing, I also changed the 
> patchlevel so yum wouldn't complain. Judging from the build machine 
> history, it actually is -113. Postponing writing this mail for too
> long 
> made me forget too much...
> 
> > I'm not sure I could reproduce this because I have a stress test
> > (used
> > for RHEL) that uses (IIRC) 8 concurrent threads to test mount
> > concurrency and to test for mount to expire races.
> > 
> > The maps used are somewhat more complex than what you have here so
> > perhaps I missed this point with that test.
> 
> The configuration for the server uses indirect maps from the local 
> filesystem. All other machines get a slightly different config through
> NIS.
> 
> > However, I've recently written another RHEL test (based on this
> > test)
> > that uses a simple indirect map with the 8 concurrent threads to try
> > and
> > duplicate a different problem.
> > 
> > I would have though this test would expose this sort of problem but
> > after (I can't actually remember the longest run) about three days
> > of
> > continuous running I didn't see any problems.
> > 
> > Granted it was a different scenario to yours though.
> 
> Of course it also looks timing-related, so there's no telling in
> exactly 
> which configuration it'll pop up. For the machine I used for testing
> (not 
> the same hardware as the server), the issue already disappeared when I
> locally rebuilt the same RPM as the one that was already installed.
> 
> I already noticed changes in the frequency when I changed the versions
> of 
> supporting packages (libtirpc) or ran it in the foreground or with 
> debugging.
> 
> > So I think we need to narrow down where this is occurring.
> > 
> > To start with I'd add mutexes around just the parse_location() and
> > prune_host_list() functions and then if that also resolves the
> > problem
> > drill down from there.
> 
> I'll see if I can do that next week (even though the server is busy,
> it's 
> not a disaster if it happens, but I prefer to be around to unwedge
> it.)

I can help with that by providing patches.

To start with the change here is quite conservative, next one would be
much less so (and quite a bit more difficult to write). The idea being
much like a kernel bi-sect to narrow the search quickly.

> 
> Thanks!
> 
> Kind regards,
>  	Marcel de Boer
> 
--
To unsubscribe from this list: send the line "unsubscribe autofs" in

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-04-10  2:34 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-08  7:55 Near-simultaneous automount of multiple directories fails Marcel De Boer
2016-04-08  8:54 ` Ian Kent
2016-04-08  9:46   ` Ian Kent
2016-04-08 11:37     ` Marcel De Boer
2016-04-10  2:34       ` Ian Kent

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.