* [PATCH] Minor fix to xentop to stop it dying when domains go away at the wrong time
@ 2006-07-14 19:00 Graham, Simon
2006-07-25 15:21 ` Keir Fraser
0 siblings, 1 reply; 6+ messages in thread
From: Graham, Simon @ 2006-07-14 19:00 UTC (permalink / raw)
To: xen-devel
Currently, xentop exits with a failure if xenstat_get_node() returns
NULL - this can happen if a VM is deleted between the time
xenstat_get_node() gets the list of VMs and the time it attempts to
collect information about the VMs. This patch modifies xentop to simply
ignore a NULL return and loop around to try the call again.
Note that the patch supplied is against 3.0.testing although I think
unstable is basically the same.
Signed-off-by: Simon Graham Simon.Graham@stratus.com
---
Index: xentop.c
===================================================================
--- xentop.c (revision 3098)
+++ xentop.c (working copy)
@@ -762,14 +762,18 @@
{
xenstat_domain **domains;
unsigned int i, num_domains = 0;
+ xenstat_node *new_node = xenstat_get_node(xhandle,
XENSTAT_ALL);
- /* Now get the node information */
+ if (new_node == NULL) {
+ // This can happen if domains change during call - just
+ // return and try again
+ return;
+ }
+
if (prev_node != NULL)
xenstat_free_node(prev_node);
prev_node = cur_node;
- cur_node = xenstat_get_node(xhandle, XENSTAT_ALL);
- if (cur_node == NULL)
- fail("Failed to retrieve statistics from libxenstat\n");
+ cur_node = new_node;
/* dump summary top information */
do_summary();
@@ -871,8 +875,10 @@
if(ch != ERR || (curtime.tv_sec - oldtime.tv_sec) >=
delay) {
clear();
top();
- oldtime = curtime;
- refresh();
+ if (cur_node != NULL) {
+ oldtime = curtime;
+ refresh();
+ }
}
ch = getch();
} while (handle_key(ch));
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] Minor fix to xentop to stop it dying when domains go away at the wrong time
2006-07-14 19:00 [PATCH] Minor fix to xentop to stop it dying when domains go away at the wrong time Graham, Simon
@ 2006-07-25 15:21 ` Keir Fraser
0 siblings, 0 replies; 6+ messages in thread
From: Keir Fraser @ 2006-07-25 15:21 UTC (permalink / raw)
To: Graham, Simon; +Cc: xen-devel
On 14 Jul 2006, at 20:00, Graham, Simon wrote:
> Currently, xentop exits with a failure if xenstat_get_node() returns
> NULL - this can happen if a VM is deleted between the time
> xenstat_get_node() gets the list of VMs and the time it attempts to
> collect information about the VMs. This patch modifies xentop to simply
> ignore a NULL return and loop around to try the call again.
>
> Note that the patch supplied is against 3.0.testing although I think
> unstable is basically the same.
>
> Signed-off-by: Simon Graham Simon.Graham@stratus.com
libxenstat should be fixed so that xenstat_get_node() does not
spuriously fail in this way. This could be done by getting the
'collectors' it calls to return better error info -- we can use EAGAIN
to cause xenstat_get_node() to rerun itself from scratch rather than
returning failure to the caller. Otherwise we're going to need to fix
every user of libxenstat, which is stupid.
-- Keir
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] Minor fix to xentop to stop it dying when domains go away at the wrong time
2006-07-26 20:36 Graham, Simon
@ 2006-07-27 9:26 ` Keir Fraser
0 siblings, 0 replies; 6+ messages in thread
From: Keir Fraser @ 2006-07-27 9:26 UTC (permalink / raw)
To: Graham, Simon; +Cc: xen-devel
On 26 Jul 2006, at 21:36, Graham, Simon wrote:
> OK - I've reworked the fix to put it in libxenstat -- still not
> completely convinced I like it, but take a look and let me know what
> you
> think - as you suggested, I've made the collectors return a value
> indicating if a fatal error occurred (-ve), a retryable error occurred
> (0) or they were successful (+ve) and put in code to retry from the top
> when a retryable error occurs (with a small 1/4s delay so we don't spin
> wildly whilst things stabilize).
Thinking about this some more, those retryable failures will generally
mean that a domain is being created or being destroyed. In those two
cases, perhaps xenstat_get_node() should simply prune the problematic
domain from the list it returns? That would avoid unbounded delay in
xenstat_get_node().
I think what you have so far is okay: fatal error in a collector causes
error in the caller; recoverable error could cause domain to be pruned
rather than retrying in the caller. Maybe we should have macros for the
possible return values from a collector: -1/0/+1 return values are not
immediately obvious.
-- Keir
^ permalink raw reply [flat|nested] 6+ messages in thread
* RE: [PATCH] Minor fix to xentop to stop it dying when domains go away at the wrong time
@ 2006-07-26 20:36 Graham, Simon
2006-07-27 9:26 ` Keir Fraser
0 siblings, 1 reply; 6+ messages in thread
From: Graham, Simon @ 2006-07-26 20:36 UTC (permalink / raw)
To: Keir Fraser; +Cc: xen-devel
[-- Attachment #1: Type: text/plain, Size: 1045 bytes --]
OK - I've reworked the fix to put it in libxenstat -- still not
completely convinced I like it, but take a look and let me know what you
think - as you suggested, I've made the collectors return a value
indicating if a fatal error occurred (-ve), a retryable error occurred
(0) or they were successful (+ve) and put in code to retry from the top
when a retryable error occurs (with a small 1/4s delay so we don't spin
wildly whilst things stabilize).
Simon
-----------------------------------------------------------
Currently, xenstat_get_node exits with a failure if a VM is deleted
between the time
it gets the list of VMs and the time it queries xenstore for it's name
and other
parameters. This patch modifies xenstat_get_node to retry the process
from scratch
if a recoverable error occurs.
Note that the patch supplied is against 3.0.testing although I think
unstable is basically the same.
Signed-off-by: Simon Graham Simon.Graham@stratus.com
-----------------------------------------------------------
[-- Attachment #2: xenstat-bug661.patch --]
[-- Type: application/octet-stream, Size: 5614 bytes --]
Index: xenstat.c
===================================================================
--- xenstat.c (revision 3288)
+++ xenstat.c (working copy)
@@ -86,7 +86,9 @@
* Data-collection types
*/
/* Called to collect the information for the node and all the domains on
- * it. When called, the domain information has already been collected. */
+ * it. When called, the domain information has already been collected.
+ * Return status is -ve for fatal errors, 0 for retryable errors and
+ * +ve for success */
typedef int (*xenstat_collect_func)(xenstat_node * node);
/* Called to free the information collected by the collect function. The free
* function will only be called on a xenstat_node if that node includes
@@ -185,9 +187,15 @@
xenstat_node *node;
dom0_physinfo_t physinfo;
dom0_getdomaininfo_t domaininfo[DOMAIN_CHUNK_SIZE];
- unsigned int num_domains, new_domains;
+ unsigned int new_domains;
unsigned int i;
+ /* Each attempt to get the list of domains may fail because the
+ * list changes during the process - thus we loop until it
+ * succeeds (or a memory allocation error causes the routine to
+ * return NULL. */
+restart:
+
/* Create the node */
node = (xenstat_node *) calloc(1, sizeof(xenstat_node));
if (node == NULL)
@@ -219,28 +227,46 @@
return NULL;
}
- num_domains = 0;
+ node->num_domains = 0;
+
do {
xenstat_domain *domain;
new_domains = xi_get_domaininfolist(handle->xihandle,
- domaininfo, num_domains,
+ domaininfo, node->num_domains,
DOMAIN_CHUNK_SIZE);
node->domains = realloc(node->domains,
- (num_domains + new_domains)
+ (node->num_domains + new_domains)
* sizeof(xenstat_domain));
if (node->domains == NULL) {
free(node);
return NULL;
}
- domain = node->domains + num_domains;
+ domain = node->domains + node->num_domains;
+ node->num_domains += new_domains;
+
+ /* zero out newly allocated memory in case error occurs below */
+ memset(domain, 0, new_domains * sizeof(xenstat_domain));
+
for (i = 0; i < new_domains; i++) {
/* Fill in domain using domaininfo[i] */
domain->id = domaininfo[i].domain;
domain->name = xenstat_get_domain_name(handle, domaininfo[i].domain);
+ if (domain->name == NULL) {
+ xenstat_free_node(node);
+
+ if (errno == ENOMEM) {
+ return NULL;
+ }
+ else {
+ /* failed to get name -- retry after short delay */
+ usleep(250000);
+ goto restart;
+ }
+ }
domain->state = domaininfo[i].flags;
domain->cpu_ns = domaininfo[i].cpu_time;
domain->num_vcpus = (domaininfo[i].max_vcpu_id+1);
@@ -259,18 +285,29 @@
domain++;
}
- num_domains += new_domains;
} while (new_domains == DOMAIN_CHUNK_SIZE);
- node->num_domains = num_domains;
/* Run all the extra data collectors requested */
node->flags = 0;
for (i = 0; i < NUM_COLLECTORS; i++) {
if ((flags & collectors[i].flag) == collectors[i].flag) {
+ int status;
+
node->flags |= collectors[i].flag;
- if(collectors[i].collect(node) == 0) {
+ status = collectors[i].collect(node);
+
+ if (status <= 0) {
xenstat_free_node(node);
- return NULL;
+
+ if (status < 0) {
+ /* Fatal error */
+ return NULL;
+ }
+ else {
+ /* Retry after short delay */
+ usleep(250000);
+ goto restart;
+ }
}
}
}
@@ -461,7 +498,7 @@
node->domains[i].vcpus = malloc(node->domains[i].num_vcpus
* sizeof(xenstat_vcpu));
if (node->domains[i].vcpus == NULL)
- return 0;
+ return -1;
for (vcpu = 0; vcpu < node->domains[i].num_vcpus; vcpu++) {
/* FIXME: need to be using a more efficient mechanism*/
@@ -469,7 +506,7 @@
if (xi_get_domain_vcpu_info(node->handle->xihandle,
node->domains[i].id, vcpu, &info) != 0)
- return 0;
+ return (errno == ENOMEM)? -1 : 0;
node->domains[i].vcpus[vcpu].online = info.online;
node->domains[i].vcpus[vcpu].ns = info.cpu_time;
@@ -523,20 +560,20 @@
node->handle->procnetdev = fopen("/proc/net/dev", "r");
if (node->handle->procnetdev == NULL) {
perror("Error opening /proc/net/dev");
- return 0;
+ return -1;
}
/* Validate the format of /proc/net/dev */
if (fread(header, sizeof(PROCNETDEV_HEADER) - 1, 1,
node->handle->procnetdev) != 1) {
perror("Error reading /proc/net/dev header");
- return 0;
+ return -1;
}
header[sizeof(PROCNETDEV_HEADER) - 1] = '\0';
if (strcmp(header, PROCNETDEV_HEADER) != 0) {
fprintf(stderr,
"Unexpected /proc/net/dev format\n");
- return 0;
+ return -1;
}
}
@@ -588,7 +625,7 @@
sizeof(xenstat_network));
}
if (domain->networks == NULL)
- return 0;
+ return -1;
domain->networks[domain->num_networks - 1] = net;
}
@@ -679,7 +716,7 @@
/* Get the Xen version number and extraversion string */
if (xi_get_xen_version(node->handle->xihandle,
&vnum, &version) < 0)
- return 0;
+ return -1;
/* Format the version information as a string and store it */
snprintf(node->handle->xen_version, VERSION_SIZE, "%ld.%ld%s",
((vnum >> 16) & 0xFFFF), vnum & 0xFFFF, version);
@@ -701,13 +738,8 @@
static char *xenstat_get_domain_name(xenstat_handle *handle, unsigned int domain_id)
{
char path[80];
- char *name;
snprintf(path, sizeof(path),"/local/domain/%i/name", domain_id);
- name = xs_read(handle->xshandle, XBT_NULL, path, NULL);
- if (name == NULL)
- name = strdup(" ");
-
- return name;
+ return xs_read(handle->xshandle, XBT_NULL, path, NULL);
}
[-- Attachment #3: Type: text/plain, Size: 138 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] Minor fix to xentop to stop it dying when domains go away at the wrong time
2006-07-25 18:40 Graham, Simon
@ 2006-07-25 18:48 ` Keir Fraser
0 siblings, 0 replies; 6+ messages in thread
From: Keir Fraser @ 2006-07-25 18:48 UTC (permalink / raw)
To: Graham, Simon; +Cc: xen-devel
On 25 Jul 2006, at 19:40, Graham, Simon wrote:
> 1. It's not clear that all possible users of xenstat_get_node() would
> want
> to have the call take an arbitrary amount of time as it struggles to
> get a consistent snapshot -- better to let the caller decide policy
> on retrying the call.
At the moment the caller cannot even make that choice as the function
does not return specific error codes. If it, for example, returned a
negative errno then the caller could check -EAGAIN. Anyhow, clearly the
bulk of the patch belongs in libxenstat -- the only question is whether
we retry internally or in the caller. I think the former is simpler as
it does not change the function's specification.
-- Keir
> 2. xentop is currently the only user of xenstat_get_node in the tree
> and
> the
> fix in xentop was waaay easier ;-)
>
> If you still think xenstat_get_node() should loop until it has a
> consistent snapshot
> then I'll redo the patch that way
^ permalink raw reply [flat|nested] 6+ messages in thread
* RE: [PATCH] Minor fix to xentop to stop it dying when domains go away at the wrong time
@ 2006-07-25 18:40 Graham, Simon
2006-07-25 18:48 ` Keir Fraser
0 siblings, 1 reply; 6+ messages in thread
From: Graham, Simon @ 2006-07-25 18:40 UTC (permalink / raw)
To: Keir Fraser; +Cc: xen-devel
>
> libxenstat should be fixed so that xenstat_get_node() does not
> spuriously fail in this way. This could be done by getting the
> 'collectors' it calls to return better error info -- we can use EAGAIN
> to cause xenstat_get_node() to rerun itself from scratch rather than
> returning failure to the caller. Otherwise we're going to need to fix
> every user of libxenstat, which is stupid.
>
> -- Keir
I thought about this too but decided not to for a couple of reasons:
1. It's not clear that all possible users of xenstat_get_node() would
want
to have the call take an arbitrary amount of time as it struggles to
get a consistent snapshot -- better to let the caller decide policy
on retrying the call.
2. xentop is currently the only user of xenstat_get_node in the tree and
the
fix in xentop was waaay easier ;-)
If you still think xenstat_get_node() should loop until it has a
consistent snapshot
then I'll redo the patch that way
/simgr
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2006-07-27 9:26 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-07-14 19:00 [PATCH] Minor fix to xentop to stop it dying when domains go away at the wrong time Graham, Simon
2006-07-25 15:21 ` Keir Fraser
2006-07-25 18:40 Graham, Simon
2006-07-25 18:48 ` Keir Fraser
2006-07-26 20:36 Graham, Simon
2006-07-27 9:26 ` Keir Fraser
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.