All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] Minor fix to xentop to stop it dying when domains go away at the wrong time
@ 2006-07-14 19:00 Graham, Simon
  2006-07-25 15:21 ` Keir Fraser
  0 siblings, 1 reply; 6+ messages in thread
From: Graham, Simon @ 2006-07-14 19:00 UTC (permalink / raw)
  To: xen-devel

Currently, xentop exits with a failure if xenstat_get_node() returns
NULL - this can happen if a VM is deleted between the time
xenstat_get_node() gets the list of VMs and the time it attempts to
collect information about the VMs. This patch modifies xentop to simply
ignore a NULL return and loop around to try the call again.

Note that the patch supplied is against 3.0.testing although I think
unstable is basically the same.

Signed-off-by: Simon Graham Simon.Graham@stratus.com

---
Index: xentop.c
===================================================================
--- xentop.c	(revision 3098)
+++ xentop.c	(working copy)
@@ -762,14 +762,18 @@
 {
 	xenstat_domain **domains;
 	unsigned int i, num_domains = 0;
+        xenstat_node *new_node = xenstat_get_node(xhandle,
XENSTAT_ALL);
 
-	/* Now get the node information */
+        if (new_node == NULL) {
+		// This can happen if domains change during call - just
+		// return and try again
+		return;
+        }
+
 	if (prev_node != NULL)
 		xenstat_free_node(prev_node);
 	prev_node = cur_node;
-	cur_node = xenstat_get_node(xhandle, XENSTAT_ALL);
-	if (cur_node == NULL)
-		fail("Failed to retrieve statistics from libxenstat\n");
+	cur_node = new_node;
 
 	/* dump summary top information */
 	do_summary();
@@ -871,8 +875,10 @@
 		if(ch != ERR || (curtime.tv_sec - oldtime.tv_sec) >=
delay) {
 			clear();
 			top();
-			oldtime = curtime;
-			refresh();
+                        if (cur_node != NULL) {
+				oldtime = curtime;
+				refresh();
+                        }
 		}
 		ch = getch();
 	} while (handle_key(ch));

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] Minor fix to xentop to stop it dying when domains go away at the wrong time
  2006-07-14 19:00 [PATCH] Minor fix to xentop to stop it dying when domains go away at the wrong time Graham, Simon
@ 2006-07-25 15:21 ` Keir Fraser
  0 siblings, 0 replies; 6+ messages in thread
From: Keir Fraser @ 2006-07-25 15:21 UTC (permalink / raw)
  To: Graham, Simon; +Cc: xen-devel


On 14 Jul 2006, at 20:00, Graham, Simon wrote:

> Currently, xentop exits with a failure if xenstat_get_node() returns
> NULL - this can happen if a VM is deleted between the time
> xenstat_get_node() gets the list of VMs and the time it attempts to
> collect information about the VMs. This patch modifies xentop to simply
> ignore a NULL return and loop around to try the call again.
>
> Note that the patch supplied is against 3.0.testing although I think
> unstable is basically the same.
>
> Signed-off-by: Simon Graham Simon.Graham@stratus.com

libxenstat should be fixed so that xenstat_get_node() does not 
spuriously fail in this way. This could be done by getting the 
'collectors' it calls to return better error info -- we can use EAGAIN 
to cause xenstat_get_node() to rerun itself from scratch rather than 
returning failure to the caller. Otherwise we're going to need to fix 
every user of libxenstat, which is stupid.

  -- Keir

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] Minor fix to xentop to stop it dying when domains go away at the wrong time
  2006-07-26 20:36 Graham, Simon
@ 2006-07-27  9:26 ` Keir Fraser
  0 siblings, 0 replies; 6+ messages in thread
From: Keir Fraser @ 2006-07-27  9:26 UTC (permalink / raw)
  To: Graham, Simon; +Cc: xen-devel


On 26 Jul 2006, at 21:36, Graham, Simon wrote:

> OK - I've reworked the fix to put it in libxenstat -- still not
> completely convinced I like it, but take a look and let me know what 
> you
> think - as you suggested, I've made the collectors return a value
> indicating if a fatal error occurred (-ve), a retryable error occurred
> (0) or they were successful (+ve) and put in code to retry from the top
> when a retryable error occurs (with a small 1/4s delay so we don't spin
> wildly whilst things stabilize).

Thinking about this some more, those retryable failures will generally 
mean that a domain is being created or being destroyed. In those two 
cases, perhaps xenstat_get_node() should simply prune the problematic 
domain from the list it returns? That would avoid unbounded delay in 
xenstat_get_node().

I think what you have so far is okay: fatal error in a collector causes 
error in the caller; recoverable error could cause domain to be pruned 
rather than retrying in the caller. Maybe we should have macros for the 
possible return values from a collector: -1/0/+1 return values are not 
immediately obvious.

  -- Keir

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [PATCH] Minor fix to xentop to stop it dying when domains go away at the wrong time
@ 2006-07-26 20:36 Graham, Simon
  2006-07-27  9:26 ` Keir Fraser
  0 siblings, 1 reply; 6+ messages in thread
From: Graham, Simon @ 2006-07-26 20:36 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel

[-- Attachment #1: Type: text/plain, Size: 1045 bytes --]

OK - I've reworked the fix to put it in libxenstat -- still not
completely convinced I like it, but take a look and let me know what you
think - as you suggested, I've made the collectors return a value
indicating if a fatal error occurred (-ve), a retryable error occurred
(0) or they were successful (+ve) and put in code to retry from the top
when a retryable error occurs (with a small 1/4s delay so we don't spin
wildly whilst things stabilize).

Simon

-----------------------------------------------------------

Currently, xenstat_get_node exits with a failure if a VM is deleted
between the time
it gets the list of VMs and the time it queries xenstore for it's name
and other
parameters. This patch modifies xenstat_get_node to retry the process
from scratch
if a recoverable error occurs.

Note that the patch supplied is against 3.0.testing although I think
unstable is basically the same.

Signed-off-by: Simon Graham Simon.Graham@stratus.com

-----------------------------------------------------------


[-- Attachment #2: xenstat-bug661.patch --]
[-- Type: application/octet-stream, Size: 5614 bytes --]

Index: xenstat.c
===================================================================
--- xenstat.c	(revision 3288)
+++ xenstat.c	(working copy)
@@ -86,7 +86,9 @@
  * Data-collection types
  */
 /* Called to collect the information for the node and all the domains on
- * it. When called, the domain information has already been collected. */
+ * it. When called, the domain information has already been collected.
+ * Return status is -ve for fatal errors, 0 for retryable errors and
+ * +ve for success */
 typedef int (*xenstat_collect_func)(xenstat_node * node);
 /* Called to free the information collected by the collect function.  The free
  * function will only be called on a xenstat_node if that node includes
@@ -185,9 +187,15 @@
 	xenstat_node *node;
 	dom0_physinfo_t physinfo;
 	dom0_getdomaininfo_t domaininfo[DOMAIN_CHUNK_SIZE];
-	unsigned int num_domains, new_domains;
+	unsigned int new_domains;
 	unsigned int i;
 
+	/* Each attempt to get the list of domains may fail because the
+	 * list changes during the process - thus we loop until it
+	 * succeeds (or a memory allocation error causes the routine to 
+	 * return NULL. */
+restart:
+
 	/* Create the node */
 	node = (xenstat_node *) calloc(1, sizeof(xenstat_node));
 	if (node == NULL)
@@ -219,28 +227,46 @@
 		return NULL;
 	}
 
-	num_domains = 0;
+	node->num_domains = 0;
+
 	do {
 		xenstat_domain *domain;
 
 		new_domains = xi_get_domaininfolist(handle->xihandle,
-		                                    domaininfo, num_domains,
+		                                    domaininfo, node->num_domains,
 		                                    DOMAIN_CHUNK_SIZE);
 
 		node->domains = realloc(node->domains,
-					(num_domains + new_domains)
+					(node->num_domains + new_domains)
 					* sizeof(xenstat_domain));
 		if (node->domains == NULL) {
 			free(node);
 			return NULL;
 		}
 
-		domain = node->domains + num_domains;
+		domain = node->domains + node->num_domains;
 
+		node->num_domains += new_domains;
+
+		/* zero out newly allocated memory in case error occurs below */
+		memset(domain, 0, new_domains * sizeof(xenstat_domain));
+
 		for (i = 0; i < new_domains; i++) {
 			/* Fill in domain using domaininfo[i] */
 			domain->id = domaininfo[i].domain;
 			domain->name = xenstat_get_domain_name(handle, domaininfo[i].domain);
+			if (domain->name == NULL) {
+				xenstat_free_node(node);
+
+				if (errno == ENOMEM) {
+					return NULL;
+				}
+				else {
+					/* failed to get name -- retry after short delay */
+					usleep(250000);
+					goto restart;
+				}
+			}
 			domain->state = domaininfo[i].flags;
 			domain->cpu_ns = domaininfo[i].cpu_time;
 			domain->num_vcpus = (domaininfo[i].max_vcpu_id+1);
@@ -259,18 +285,29 @@
 
 			domain++;
 		}
-		num_domains += new_domains;
 	} while (new_domains == DOMAIN_CHUNK_SIZE);
-	node->num_domains = num_domains;
 
 	/* Run all the extra data collectors requested */
 	node->flags = 0;
 	for (i = 0; i < NUM_COLLECTORS; i++) {
 		if ((flags & collectors[i].flag) == collectors[i].flag) {
+			int status;
+
 			node->flags |= collectors[i].flag;
-			if(collectors[i].collect(node) == 0) {
+			status = collectors[i].collect(node);
+
+			if (status <= 0) {
 				xenstat_free_node(node);
-				return NULL;
+
+				if (status < 0) {
+					/* Fatal error */
+					return NULL;
+				}
+				else {
+					/* Retry after short delay */
+					usleep(250000);
+					goto restart;
+				}
 			}
 		}
 	}
@@ -461,7 +498,7 @@
 		node->domains[i].vcpus = malloc(node->domains[i].num_vcpus
 						* sizeof(xenstat_vcpu));
 		if (node->domains[i].vcpus == NULL)
-			return 0;
+			return -1;
 	
 		for (vcpu = 0; vcpu < node->domains[i].num_vcpus; vcpu++) {
 			/* FIXME: need to be using a more efficient mechanism*/
@@ -469,7 +506,7 @@
 
 			if (xi_get_domain_vcpu_info(node->handle->xihandle,
 			    node->domains[i].id, vcpu, &info) != 0)
-				return 0;
+				return (errno == ENOMEM)? -1 : 0;
 
 			node->domains[i].vcpus[vcpu].online = info.online;
 			node->domains[i].vcpus[vcpu].ns = info.cpu_time;
@@ -523,20 +560,20 @@
 		node->handle->procnetdev = fopen("/proc/net/dev", "r");
 		if (node->handle->procnetdev == NULL) {
 			perror("Error opening /proc/net/dev");
-			return 0;
+			return -1;
 		}
 
 		/* Validate the format of /proc/net/dev */
 		if (fread(header, sizeof(PROCNETDEV_HEADER) - 1, 1,
 			  node->handle->procnetdev) != 1) {
 			perror("Error reading /proc/net/dev header");
-			return 0;
+			return -1;
 		}
 		header[sizeof(PROCNETDEV_HEADER) - 1] = '\0';
 		if (strcmp(header, PROCNETDEV_HEADER) != 0) {
 			fprintf(stderr,
 				"Unexpected /proc/net/dev format\n");
-			return 0;
+			return -1;
 		}
 	}
 
@@ -588,7 +625,7 @@
 				    sizeof(xenstat_network));
 		}
 		if (domain->networks == NULL)
-			return 0;
+			return -1;
 		domain->networks[domain->num_networks - 1] = net;
 	}
 
@@ -679,7 +716,7 @@
 		/* Get the Xen version number and extraversion string */
 		if (xi_get_xen_version(node->handle->xihandle,
 				       &vnum, &version) < 0)
-			return 0;
+			return -1;
 		/* Format the version information as a string and store it */
 		snprintf(node->handle->xen_version, VERSION_SIZE, "%ld.%ld%s",
 			 ((vnum >> 16) & 0xFFFF), vnum & 0xFFFF, version);
@@ -701,13 +738,8 @@
 static char *xenstat_get_domain_name(xenstat_handle *handle, unsigned int domain_id)
 {
 	char path[80];
-	char *name;
 
 	snprintf(path, sizeof(path),"/local/domain/%i/name", domain_id);
 	
-	name = xs_read(handle->xshandle, XBT_NULL, path, NULL);
-	if (name == NULL)
-		name = strdup(" ");
-
-	return name;
+	return xs_read(handle->xshandle, XBT_NULL, path, NULL);
 }	

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] Minor fix to xentop to stop it dying when domains go away at the wrong time
  2006-07-25 18:40 Graham, Simon
@ 2006-07-25 18:48 ` Keir Fraser
  0 siblings, 0 replies; 6+ messages in thread
From: Keir Fraser @ 2006-07-25 18:48 UTC (permalink / raw)
  To: Graham, Simon; +Cc: xen-devel


On 25 Jul 2006, at 19:40, Graham, Simon wrote:

> 1. It's not clear that all possible users of xenstat_get_node() would
> want
>    to have the call take an arbitrary amount of time as it struggles to
>    get a consistent snapshot -- better to let the caller decide policy
>    on retrying the call.

At the moment the caller cannot even make that choice as the function 
does not return specific error codes. If it, for example, returned a 
negative errno then the caller could check -EAGAIN. Anyhow, clearly the 
bulk of the patch belongs in libxenstat -- the only question is whether 
we retry internally or in the caller. I think the former is simpler as 
it does not change the function's specification.

  -- Keir

> 2. xentop is currently the only user of xenstat_get_node in the tree 
> and
> the
>    fix in xentop was waaay easier ;-)
>
> If you still think xenstat_get_node() should loop until it has a
> consistent snapshot
> then I'll redo the patch that way

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [PATCH] Minor fix to xentop to stop it dying when domains go away at the wrong time
@ 2006-07-25 18:40 Graham, Simon
  2006-07-25 18:48 ` Keir Fraser
  0 siblings, 1 reply; 6+ messages in thread
From: Graham, Simon @ 2006-07-25 18:40 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel

> 
> libxenstat should be fixed so that xenstat_get_node() does not
> spuriously fail in this way. This could be done by getting the
> 'collectors' it calls to return better error info -- we can use EAGAIN
> to cause xenstat_get_node() to rerun itself from scratch rather than
> returning failure to the caller. Otherwise we're going to need to fix
> every user of libxenstat, which is stupid.
> 
>   -- Keir

I thought about this too but decided not to for a couple of reasons:

1. It's not clear that all possible users of xenstat_get_node() would
want
   to have the call take an arbitrary amount of time as it struggles to
   get a consistent snapshot -- better to let the caller decide policy
   on retrying the call.

2. xentop is currently the only user of xenstat_get_node in the tree and
the
   fix in xentop was waaay easier ;-)

If you still think xenstat_get_node() should loop until it has a
consistent snapshot
then I'll redo the patch that way

/simgr

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2006-07-27  9:26 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-07-14 19:00 [PATCH] Minor fix to xentop to stop it dying when domains go away at the wrong time Graham, Simon
2006-07-25 15:21 ` Keir Fraser
2006-07-25 18:40 Graham, Simon
2006-07-25 18:48 ` Keir Fraser
2006-07-26 20:36 Graham, Simon
2006-07-27  9:26 ` Keir Fraser

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.