All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/1]: tools: Retry blktap2 tapdisk message on interrupt.
@ 2013-03-19  7:26 Dr. Greg Wettstein
  2013-03-25 12:40 ` Ian Jackson
  0 siblings, 1 reply; 3+ messages in thread
From: Dr. Greg Wettstein @ 2013-03-19  7:26 UTC (permalink / raw)
  To: xen-devel; +Cc: Ian.Jackson, jbeulich

Re-start blktap2 IPC select call on interrupt.

We hunted this miserable bug for a long time.

The teardown of a blktap2 tapdisk instance is being carried out
inconsistently up to and including the 4.2.1 release.  The
problem appears to be a classic 'Heisenbug' which disappears if a
single function call is added to the tapdisk shutdown path.  It
is likely this bug has been in existence for the life of the
blktap2 code.

Control messages to manipulate a tapdisk instance are sent over a
UNIX domain socket.  A select call is used on both the read and
write paths to wait on I/O and to set a timeout for the
transmission and reception of the control plane messages.

The existing code fails receipt or transmission of the control message
on any type of error return from the select call.  The xl control
process receives an interrupt while waiting in the select call which
in turn causes an error return with SIGINT as the return code.

This prematurely terminates the teardown of the tapdisk instance
leaving it in various states of shutdown.  Since multiple messages
are needed to implement a full teardown the tapdisk instance can be
left in various states ranging from fully connected to only the minor
being left allocated.

The fix is straight forward.  Check the return code from the
select call and re-try read or write of the control message if
errno is sent to EINTR.  The problem manifests itself in the read
path but there appears to be little reason to not add the fix to
the write path as well.  Both paths appear to be cut-and-paste
copies of each other.

Signed-off-by: Dr. Greg Wettstein <greg@enjellic.com>
---
 tools/blktap2/control/tap-ctl-ipc.c |   10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/tools/blktap2/control/tap-ctl-ipc.c b/tools/blktap2/control/tap-ctl-ipc.c
index cc6160e..c8aad1c 100644
--- a/tools/blktap2/control/tap-ctl-ipc.c
+++ b/tools/blktap2/control/tap-ctl-ipc.c
@@ -64,8 +64,11 @@ tap_ctl_read_message(int fd, tapdisk_message_t *message, int timeout)
 		FD_SET(fd, &readfds);
 
 		ret = select(fd + 1, &readfds, NULL, NULL, t);
-		if (ret == -1)
+		if (ret == -1) {
+			if (errno == EINTR)
+				continue;
 			break;
+		}
 		else if (FD_ISSET(fd, &readfds)) {
 			ret = read(fd, message + offset, len - offset);
 			if (ret <= 0)
@@ -114,8 +117,11 @@ tap_ctl_write_message(int fd, tapdisk_message_t *message, int timeout)
 		 * bit more time than expected. */
 
 		ret = select(fd + 1, NULL, &writefds, NULL, t);
-		if (ret == -1)
+		if (ret == -1) {
+			if (errno == EINTR)
+				continue;
 			break;
+		}
 		else if (FD_ISSET(fd, &writefds)) {
 			ret = write(fd, message + offset, len - offset);
 			if (ret <= 0)

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH 1/1]: tools: Retry blktap2 tapdisk message on interrupt.
  2013-03-19  7:26 [PATCH 1/1]: tools: Retry blktap2 tapdisk message on interrupt Dr. Greg Wettstein
@ 2013-03-25 12:40 ` Ian Jackson
  0 siblings, 0 replies; 3+ messages in thread
From: Ian Jackson @ 2013-03-25 12:40 UTC (permalink / raw)
  To: greg; +Cc: jbeulich, xen-devel

Dr. Greg Wettstein writes ("[PATCH 1/1]: tools: Retry blktap2 tapdisk message on interrupt."):
> Re-start blktap2 IPC select call on interrupt.

Thanks.  I have applied this patch.  But I think you should worry that
the fix may be incomplete.  I see calls to read() as well, and read()
can also fail with EINTR.

Ian.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH 1/1]: tools: Retry blktap2 tapdisk message on interrupt.
@ 2013-03-28  7:45 Dr. Greg Wettstein
  0 siblings, 0 replies; 3+ messages in thread
From: Dr. Greg Wettstein @ 2013-03-28  7:45 UTC (permalink / raw)
  To: Ian Jackson; +Cc: jbeulich, xen-devel

On Mar 25, 12:40pm, Ian Jackson wrote:
} Subject: Re: [PATCH 1/1]: tools: Retry blktap2 tapdisk message on interrup

> Dr. Greg Wettstein writes ("[PATCH 1/1]: tools: Retry blktap2 tapdisk message on interrupt."):
> > Re-start blktap2 IPC select call on interrupt.

> Thanks.  I have applied this patch.  But I think you should worry
> that the fix may be incomplete.  I see calls to read() as well, and
> read() can also fail with EINTR.

Yes the read/write calls on the socket are interruptible as well.  The
last patch was definitely 'scratch the itch', we have never seen a
failure on the read or write call.  I suspect catching the select call
interrupt is shielding the read/write calls from any interrupts.
Nonetheless a complete fix should address the read/write calls.

A patch immediately follows.  It went through our test harness without
any issues but since we are not proving it against a known regression
it bears running in other test environments to make sure i doesn't
trip any unexpected behavior.

> Ian.

Have a good weekend.

}-- End of excerpt from Ian Jackson

As always,
Dr. G.W. Wettstein, Ph.D.   Enjellic Systems Development, LLC.
4206 N. 19th Ave.           Specializing in information infra-structure
Fargo, ND  58102            development.
PH: 701-281-1686
FAX: 701-281-3949           EMAIL: greg@enjellic.com
------------------------------------------------------------------------------
"According to the philosopher Ly Tin Wheedle, chaos is found in greatest
 abundance wherever order is being sought.  It always defeats order,
 because it is better organized."
                                -- Terry Pratchett
                                   Interesting Times

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2013-03-28  7:45 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-03-19  7:26 [PATCH 1/1]: tools: Retry blktap2 tapdisk message on interrupt Dr. Greg Wettstein
2013-03-25 12:40 ` Ian Jackson
2013-03-28  7:45 Dr. Greg Wettstein

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.