All of lore.kernel.org
 help / color / mirror / Atom feed
From: Martin Wilck <mwilck@suse.com>
To: Benjamin Marzinski <bmarzins@redhat.com>, mwilck+gmail@suse.de
Cc: device-mapper development <dm-devel@redhat.com>
Subject: Re: [PATCH v3 01/19] libmultipath: fix tur checker timeout
Date: Fri, 05 Oct 2018 12:11:18 +0200	[thread overview]
Message-ID: <fd4fb2bec1f7b4dfa5703b2fb395a8f7bc8974bb.camel@suse.com> (raw)
In-Reply-To: <20181004163118.GA3172@octiron.msp.redhat.com>

On Thu, 2018-10-04 at 11:31 -0500,  Benjamin Marzinski  wrote:
> On Mon, Oct 01, 2018 at 09:51:22PM +0200, Martin Wilck wrote:
> > On Fri, 2018-09-21 at 18:05 -0500, Benjamin Marzinski wrote:
> > > The code previously was timing out mode if ct->thread was 0 but
> > > ct->running wasn't. This combination never happens.  The idea was
> > > to
> > > timeout if for some reason the path checker tried to kill the
> > > thread,
> > > but it didn't die.  The correct thing to check for this is ct-
> > > > holders.
> > > 
> > > ct->holders will always be at least one when libcheck_check() is
> > > called,
> > > since libcheck_free() won't get called until the device is no
> > > longer
> > > being checked. So, if ct->holders is 2, that means that the tur
> > > thread
> > > is has not shut down yet.
> > > 
> > > Also, instead of returning PATH_TIMEOUT whenever the thread
> > > hasn't
> > > died,
> > > it should only time out if the thread didn't successfully get a
> > > value,
> > > which means the previous state was already PATH_TIMEOUT.
> > 
> > What about PATH_UNCHECKED?
> > 
> 
> I don't see how that could happen. In this state we've definitely set
> ct->state to PATH_PENDING when we started the thread. The thread will
> only change  ct->state to the return of tur_check(), which doesn't
> ever return PATH_UNCHECKED. Am I missing something here?

OK. The only thing you missed was a comment :-)

This stuff is so subtle that we should describe exactly how
the locking works, which actor is supposed/entitled to set what field
to what value in what situation, and so on.

> 
> > > 
> > > Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
> > > ---
> > >  libmultipath/checkers/tur.c | 15 +++++++++------
> > >  1 file changed, 9 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/libmultipath/checkers/tur.c
> > > b/libmultipath/checkers/tur.c
> > > index bf8486d..275541f 100644
> > > --- a/libmultipath/checkers/tur.c
> > > +++ b/libmultipath/checkers/tur.c
> > > @@ -355,12 +355,15 @@ int libcheck_check(struct checker * c)
> > >  		}
> > >  		pthread_mutex_unlock(&ct->lock);
> > >  	} else {
> > > -		if (uatomic_read(&ct->running) != 0) {
> > > -			/* pthread cancel failed. continue in sync mode
> > > */
> > > -			pthread_mutex_unlock(&ct->lock);
> > > -			condlog(3, "%s: tur thread not responding",
> > > -				tur_devt(devt, sizeof(devt), ct));
> > > -			return PATH_TIMEOUT;
> > > +		if (uatomic_read(&ct->holders) > 1) {
> > > +			/* pthread cancel failed. If it didn't get the
> > > path
> > > +			   state already, timeout */
> > > +			if (ct->state == PATH_PENDING) {
> > > +				pthread_mutex_unlock(&ct->lock);
> > > +				condlog(3, "%s: tur thread not
> > > responding",
> > > +					tur_devt(devt, sizeof(devt),
> > > ct));
> > > +				return PATH_TIMEOUT;
> > > +			}
> > >  		}
> > 
> > I'm not quite getting it yet. We're entering this code path if 
> > ct->thread is 0. As you point out, if there are >1 holders, a
> > previously cancelled TUR thread must be still "alive". But now we
> > want
> > to check *again*. The previous code refused to do that - a single
> > hanging TUR thread could block further checks from happening,
> > forever.
> > The PATH_TIMEOUT return code was actually questionable - in this
> > libcheck_check() invocation, we didn't even try to check, so
> > nothing
> > could actually time out (but that's obviously independent of your
> > patch).
> > 
> > With your patch, if ct->state != PATH_PENDING, we'd try to start
> > another TUR thread although there's still a hanging one. That's not
> > necessarily wrong, but I fail to see why it depends on ct->state.
> > Anyway, if we do this, we take the risk of starting more and more
> > TUR
> > threads which might all end up hanging; I wonder if we should stop
> > creating more threads if ct->holders grows further (e.g. > 5).
> 
> It's been a while, and I'm not exactly sure what I was thinking here.
> But IIRC the idea was that if the state isn't set yet, then the old
> thread could mess with the results of a future thread. Also, it was
> to
> avoid a corner case where the tur checker was called, started the
> thread, got the result before the checker exitted, cancelled the
> thread
> and returned the result and then was called very shortly afterwards,
> before the thread could clean itself up.  In that case, the right
> thing
> to do (I thought) would be to start a new thread, because the other
> one
> would be ending soon.  In truth, starting a new thread at all is
> probably a bad idea, since the old thread will still mess with the
> checker context on exit. 
> 
> A better idea might be to simply fail back to a syncronous call to
> tur_check() when you notice a cancelled thread that hasn't exitted.
> That can cause all the other checkers to get delayed, but at least in
> that case you are still checking paths. The other option is to do as
> before and just not check this path, which won't effect the other
> checkers. It seems very unlikely that the thread could remain
> uncancelled forever, especially if the path was usable.
>
> thoughts?

Generally speaking, we're deeply in the realm of highly unlikely
situations I would say. But we should get it right eventually.

Maybe we can add logic to the tur thread to keep its hands off the
context if it's a "zombie", like below (just a thought, untested)?

Martin


diff --git a/libmultipath/checkers/tur.c b/libmultipath/checkers/tur.c
index bf8486d3..e8493ca8 100644
--- a/libmultipath/checkers/tur.c
+++ b/libmultipath/checkers/tur.c
@@ -219,11 +219,18 @@ static void cleanup_func(void *data)
 	rcu_unregister_thread();
 }
 
+#define tur_thread_quit_unless_owner(__ct)	   \
+	if (__ct->thread != pthread_self()) {	   \
+		pthread_mutex_unlock(&__ct->lock); \
+		pthread_exit(NULL);		   \
+	}
+
 static void copy_msg_to_tcc(void *ct_p, const char *msg)
 {
 	struct tur_checker_context *ct = ct_p;
 
 	pthread_mutex_lock(&ct->lock);
+	tur_thread_quit_unless_owner(ct);
 	strlcpy(ct->message, msg, sizeof(ct->message));
 	pthread_mutex_unlock(&ct->lock);
 }
@@ -243,6 +250,7 @@ static void *tur_thread(void *ctx)
 
 	/* TUR checker start up */
 	pthread_mutex_lock(&ct->lock);
+	tur_thread_quit_unless_owner(ct);
 	ct->state = PATH_PENDING;
 	ct->message[0] = '\0';
 	pthread_mutex_unlock(&ct->lock);
@@ -252,6 +260,7 @@ static void *tur_thread(void *ctx)
 
 	/* TUR checker done */
 	pthread_mutex_lock(&ct->lock);
+	tur_thread_quit_unless_owner(ct);
 	ct->state = state;
 	pthread_cond_signal(&ct->active);
 	pthread_mutex_unlock(&ct->lock);




-- 
Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107
SUSELinux GmbH, GF: Felix Imendörffer, Jane Smithard,  Graham Norton
HRB 21284 (AG Nürnberg)


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

  reply	other threads:[~2018-10-05 10:11 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-09-21 23:05 [PATCH v3 00/19] Misc Multipath patches Benjamin Marzinski
2018-09-21 23:05 ` [PATCH v3 01/19] libmultipath: fix tur checker timeout Benjamin Marzinski
2018-10-01 19:51   ` Martin Wilck
2018-10-04 16:31     ` Benjamin Marzinski
2018-10-05 10:11       ` Martin Wilck [this message]
2018-10-05 17:02         ` Benjamin Marzinski
2018-10-05 19:23           ` Martin Wilck
2018-09-21 23:05 ` [PATCH v3 02/19] libmultipath: fix tur checker double locking Benjamin Marzinski
2018-10-01 20:09   ` Martin Wilck
2018-10-01 20:44     ` Martin Wilck
2018-10-04 16:47       ` Benjamin Marzinski
2018-10-04 16:45     ` Benjamin Marzinski
2018-10-05 10:25       ` Martin Wilck
2018-10-05 17:10         ` Benjamin Marzinski
2018-10-05 19:07           ` Martin Wilck
2018-09-21 23:05 ` [PATCH v3 03/19] libmultipath: fix tur memory misuse Benjamin Marzinski
2018-10-01 20:59   ` Martin Wilck
2018-10-02  7:48     ` Martin Wilck
2018-09-21 23:05 ` [PATCH v3 04/19] libmultipath: cleanup tur locking Benjamin Marzinski
2018-10-01 21:08   ` Martin Wilck
2018-09-21 23:05 ` [PATCH v3 05/19] libmultipath: fix tur checker timeout issue Benjamin Marzinski
2018-10-01 21:09   ` Martin Wilck
2018-09-21 23:05 ` [PATCH v3 06/19] libmultipath: fix set_int error path Benjamin Marzinski
2018-10-01 21:17   ` Martin Wilck
2018-09-21 23:05 ` [PATCH v3 07/19] libmultipath: fix length issues in get_vpd_sgio Benjamin Marzinski
2018-10-01 21:25   ` Martin Wilck
2018-09-21 23:05 ` [PATCH v3 08/19] libmultipath: _install_keyword cleanup Benjamin Marzinski
2018-10-01 21:26   ` Martin Wilck
2018-09-21 23:05 ` [PATCH v3 09/19] libmultipath: remove unused code Benjamin Marzinski
2018-10-01 21:28   ` Martin Wilck
2018-09-21 23:05 ` [PATCH v3 10/19] libmultipath: fix memory issue in path_latency prio Benjamin Marzinski
2018-10-01 21:30   ` Martin Wilck
2018-09-21 23:05 ` [PATCH v3 11/19] libmultipath: fix null dereference int alloc_path_group Benjamin Marzinski
2018-10-01 21:33   ` Martin Wilck
2018-09-21 23:05 ` [PATCH v3 12/19] libmutipath: don't use malformed uevents Benjamin Marzinski
2018-10-01 21:31   ` Martin Wilck
2018-09-21 23:05 ` [PATCH v3 13/19] multipath: fix max array size in print_cmd_valid Benjamin Marzinski
2018-10-01 21:35   ` Martin Wilck
2018-09-21 23:05 ` [PATCH v3 14/19] multipathd: function return value tweaks Benjamin Marzinski
2018-10-01 21:37   ` Martin Wilck
2018-09-21 23:05 ` [PATCH v3 15/19] multipathd: minor fixes Benjamin Marzinski
2018-10-01 21:38   ` Martin Wilck
2018-09-21 23:05 ` [PATCH v3 16/19] multipathd: remove useless check and fix format Benjamin Marzinski
2018-10-01 21:40   ` Martin Wilck
2018-09-21 23:05 ` [PATCH v3 17/19] multipathd: fix memory leak on error in configure Benjamin Marzinski
2018-10-01 21:42   ` Martin Wilck
2018-09-21 23:05 ` [PATCH v3 18/19] libmultipath: Don't blank intialized paths Benjamin Marzinski
2018-10-01 22:00   ` Martin Wilck
2018-10-02 22:37     ` Martin Wilck
2018-10-05 19:38       ` Benjamin Marzinski
2018-10-08  9:41         ` Martin Wilck
2018-10-09 22:20           ` Benjamin Marzinski
2018-10-08  9:35   ` Martin Wilck
2018-09-21 23:05 ` [PATCH v3 19/19] libmultipath: Fixup updating paths Benjamin Marzinski
2018-10-01 22:30   ` Martin Wilck
2018-10-05 20:32     ` Benjamin Marzinski
2018-10-07  8:36 ` [PATCH v3 00/19] Misc Multipath patches Christophe Varoqui
2018-10-09 16:13   ` Benjamin Marzinski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fd4fb2bec1f7b4dfa5703b2fb395a8f7bc8974bb.camel@suse.com \
    --to=mwilck@suse.com \
    --cc=bmarzins@redhat.com \
    --cc=dm-devel@redhat.com \
    --cc=mwilck+gmail@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.