All of lore.kernel.org
 help / color / mirror / Atom feed
* Reducing impact of save/restore/dump on Dom0
@ 2007-02-06 15:43 Graham, Simon
  2007-02-06 16:23 ` John Levon
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Graham, Simon @ 2007-02-06 15:43 UTC (permalink / raw)
  To: xen-devel

[-- Attachment #1: Type: text/plain, Size: 2087 bytes --]

Currently, save, restore and dump all used cached I/O in Dom0 to
write/read the file containing the memory image of the DomU - when the
memory assigned to the DomU is greater than free memory in Dom0, this
leads to severe memory thrashing and generally the Dom0 performance goes
into the toilet.

The 'classic' answer to avoiding this when writing very large files is,
of course, to use non-cached I/O to manipulate the files - this
introduces the restriction that reads and writes have to be at sector
aligned offsets and be an integral number of sectors in length. I've
been working on a prototype for using O_DIRECT for the dump case; this
one is actually easier to do because the on-disk format of a dump file
already has the memory pages aligned to sector boundaries so only the
dump header has to be handled specially - attached is a patch to
unstable for this code which makes the following changes to
tools/libxc/xc_core.c:

1. xc_domain_dumpcore_via_callback is modified to ensure that the buffer
containing
   page data to be dumped is page aligned (using mmap instead of malloc)
2. xc_domain_dumpcore is modified to open the file O_DIRECT and to
buffer writes issued
   by xc_domain_dumpcore_via_callback as needed.

I'd welcome comments on this prior to submitting it for inclusion. I
know that there is a lot of discussion around O_DIRECT being a bad idea
in general and that the POSIX fadvise() call should provide the hints to
tell Linux that a file's cache doesn't need to be maintained but from
looking at the 2.6.16 & 2.6.18 sources I believe this call is not fully
implemented. I also think that this is a case where you never want to
use the file cache at all.

Solving this problem for save/restore is more tricky because the
on-disk/wire save format does not force the page data to be page aligned
-- my proposal would be to page align each batch of pages written,
leaving pad between the batch header and the pages themselves but I
realize that a change in on-disk/wire format is a potential
compatibility problem.

Simon


[-- Attachment #2: direct-io-dump.patch --]
[-- Type: application/octet-stream, Size: 4653 bytes --]

Index: trunk/tools/libxc/xc_core.c
===================================================================
--- trunk/tools/libxc/xc_core.c	(revision 9948)
+++ trunk/tools/libxc/xc_core.c	(working copy)
@@ -1,7 +1,13 @@
 #include "xg_private.h"
 #include <stdlib.h>
 #include <unistd.h>
+#include <fcntl.h>
 
+#ifndef O_DIRECT
+#define O_DIRECT 040000
+#endif
+#define SECTOR_SIZE 512
+
 /* number of pages to write at a time */
 #define DUMP_INCREMENT (4 * 1024)
 #define round_pgup(_p)    (((_p)+(PAGE_SIZE-1))&PAGE_MASK)
@@ -38,7 +44,10 @@
     int dummy_len;
     int sts;
 
-    if ( (dump_mem_start = malloc(DUMP_INCREMENT*PAGE_SIZE)) == NULL )
+    dump_mem_start = mmap(0, PAGE_SIZE*DUMP_INCREMENT, 
+                          PROT_READ|PROT_WRITE, 
+                          MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
+    if (dump_mem_start == NULL)
     {
         PERROR("Could not allocate dump_mem");
         goto error_out;
@@ -114,13 +123,15 @@
         }
     }
 
-    free(dump_mem_start);
+    munmap(dump_mem_start, PAGE_SIZE*DUMP_INCREMENT);
     free(page_array);
     return 0;
 
  error_out:
-    free(dump_mem_start);
-    free(page_array);
+    if (dump_mem_start)
+        munmap(dump_mem_start, PAGE_SIZE*DUMP_INCREMENT);
+    if (page_array)
+        free(page_array);
     return -1;
 }
 
@@ -130,14 +141,13 @@
 };
 
 /* Callback routine for writing to a local dump file. */
-static int local_file_dump(void *args, char *buffer, unsigned int length)
+static int local_file_dump(int fd, char *buffer, unsigned int length)
 {
-    struct dump_args *da = args;
     int bytes, offset;
 
     for ( offset = 0; offset < length; offset += bytes )
     {
-        bytes = write(da->fd, &buffer[offset], length-offset);
+        bytes = write(fd, &buffer[offset], length-offset);
         if ( bytes <= 0 )
         {
             PERROR("Failed to write buffer");
@@ -148,6 +158,84 @@
     return 0;
 }
 
+// Callback routine that support O_DIRECT i/o to dump file by converting
+// packed non-sector aligned output to sector aligned output.
+static int
+local_dumppacked_rtn(void *args, char *buffer, size_t length)
+{
+    static char *io_base = NULL;
+    static int   io_len  = 0;
+
+    struct dump_args *da = args;
+    int sts;
+    int remaining;
+    void *p;
+    int i;
+
+    if (io_base == NULL) {
+        // Initialize page aligned buffer to use to accumulate sector sized info
+        io_base = mmap(0, PAGE_SIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
+
+        if (!io_base) {
+            fprintf(stderr, "Could not allocate sector buffer: %s\n", strerror(errno));
+            return -errno;
+        }
+    }
+
+    // Quick check to see if we can simply write from buffer passed in - this is
+    // OK if:
+    // 1. We dont have any pending writes buffered in our internal buffer
+    // 2. The buffer address is sector aligned
+    // 3. The length is an integral number of sectors
+    if (io_len == 0 &&
+        !((unsigned long)buffer & (SECTOR_SIZE-1)) &&
+        !(length & (SECTOR_SIZE-1))) {
+	    // Yipee! no need to copy - just write
+	    return local_file_dump(da->fd, buffer, length);
+    }
+    else {
+	    remaining = PAGE_SIZE-io_len;
+	    p = (char *)io_base + io_len;
+
+	    // Convert packed to sector aligned and write
+	    while (length) {
+		    // Add as much as possible to buffer
+		    if (length < remaining)
+			    i = length;
+		    else
+			    i = remaining;
+
+		    memcpy(p,buffer,i);
+
+		    buffer += + i;
+		    io_len += i;
+
+		    // write if necessary
+		    if (io_len == PAGE_SIZE) {
+			    sts = local_file_dump(da->fd, io_base, io_len);
+			    if ( sts != 0 )
+				    return sts;
+
+			    remaining = PAGE_SIZE;
+			    p = io_base;
+			    io_len = 0;
+		    }
+
+		    length -= i;
+	    }
+
+	    // Now, write any remainder if it is an integral number of sectors long
+	    if (io_len && !(io_len % SECTOR_SIZE)) {
+		    sts = local_file_dump(da->fd, io_base, io_len);
+		    if (sts != 0)
+			    return sts;
+
+		    io_len = 0;
+	    }
+    }
+    return 0;
+}
+
 int
 xc_domain_dumpcore(int xc_handle,
                    uint32_t domid,
@@ -156,14 +244,14 @@
     struct dump_args da;
     int sts;
 
-    if ( (da.fd = open(corename, O_CREAT|O_RDWR, S_IWUSR|S_IRUSR)) < 0 )
+    if ( (da.fd = open(corename, O_CREAT|O_RDWR|O_DIRECT, S_IWUSR|S_IRUSR)) < 0 )
     {
         PERROR("Could not open corefile %s", corename);
         return -errno;
     }
 
     sts = xc_domain_dumpcore_via_callback(
-        xc_handle, domid, &da, &local_file_dump);
+        xc_handle, domid, &da, &local_dumppacked_rtn);
 
     close(da.fd);
 

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread
* RE: Reducing impact of save/restore/dump on Dom0
@ 2007-02-06 17:25 Graham, Simon
  0 siblings, 0 replies; 15+ messages in thread
From: Graham, Simon @ 2007-02-06 17:25 UTC (permalink / raw)
  To: John Levon; +Cc: xen-devel

Thanks for the comments:

> 
> +#ifndef O_DIRECT
> +#define O_DIRECT 040000
> +#endif
> 
> O_DIRECT doesn't exist on Solaris. The closest equivalent is
> directio(fd, DIRECTIO_ON). Either way you shouldn't be defining it
> yourself?
> 

Well, I would agree with you except that it ends up not being defined
when I build on RHEL4/U2! It seems to me that O_DIRECT is something that
folks really don't want me to be able to use! 

Anyone know of a generic mechanism that works everywhere for enabling
this? If not, I guess I'll have to add #ifdef's... yuck! What I can do
is fix the problem on Linux and make it behave the same old way on
everything else (that I can't build/test for)... other people can then
add platform specific implementations as needed...

> And you should expect this to be able to fail, on Solaris at least.
> 
> +    dump_mem_start = mmap(0, PAGE_SIZE*DUMP_INCREMENT,
> +                          PROT_READ|PROT_WRITE,
> +                          MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
> 
> Please use the more portable MAP_ANON.
> 

Will do.

Simon

^ permalink raw reply	[flat|nested] 15+ messages in thread
* RE: Reducing impact of save/restore/dump on Dom0
@ 2007-02-06 19:21 Graham, Simon
  0 siblings, 0 replies; 15+ messages in thread
From: Graham, Simon @ 2007-02-06 19:21 UTC (permalink / raw)
  To: Iustin Pop; +Cc: xen-devel

> Yes, that's more or less expected. I've used 10% (you can't go below
5%
> = harcoded limit in the kernel) and then, for a 512MB dom0, only ~25
MB
> of cache will be used. I would hardly say that 25MB is too much.
> 

Well, I think it means only 25MB of dirty cache is allowed before writes
become synchronous - you will still use all of memory for the cache, it
will just be cleaned earlier and therefore available for reuse plus the
penalty for the write moves to the writer rather than to everyone
else... Still, I agree it's worth experimenting with (and I intend to).

> > I still feel that dump/save/restore files really don't belong in the
> > system cache at all since they just pollute the cache for no ggood
> > reason.
> 
> Then there is also posix_fadvise, which is more or less what you need
> to
> use in case you worry about your cache. I haven't used it, but I've
> heard from people that using fadvise with POSIX_FADV_DONTNEED+fsync or
> O_SYNC in batches can reduce your cache usage.
> 
> Just a few thoughts, as these don't change the way you do writes, as
> opposed to O_DIRECT.

I certainly would prefer this too; I hadn't considered using
POSIX_FADV_DONTNEED/fsync in the loop...

Thanks for the suggestions,
Simon

^ permalink raw reply	[flat|nested] 15+ messages in thread
* RE: Reducing impact of save/restore/dump on Dom0
@ 2007-02-07 12:56 Graham, Simon
  0 siblings, 0 replies; 15+ messages in thread
From: Graham, Simon @ 2007-02-07 12:56 UTC (permalink / raw)
  To: Jeremy Fitzhardinge, Ian Pratt; +Cc: xen-devel

> -----Original Message-----
> From: Jeremy Fitzhardinge [mailto:jeremy@goop.org]
> Ian Pratt wrote:
> > As for making the IO bypass the buffer cache, I'm not sure what the
> best
> > way to do this is. There are some occasions where we want the
restore
> > image to be in the buffer cache (e.g. as used by the fault injection
> > testing for fast domain restart) but I agree that its not helpful in
> the
> > normal case. My first inclination would be O_DIRECT, but there may
be
> a
> > better way.
> O_DIRECT is strongly deprecated.  fadvise(..., FADV_DONTNEED, ...) is
> the preferred interface.

I'm currently experimenting with using fsync/fadvise - will post results
shortly. If this works well, then it's not essential to change the
on-disk format although I think the performance will be better if it is
changed.

I do find it a little annoying that in Linux the routine is actually
called posix_fadvise64 rather than fadvise64 but I can obviously work
round that.

Simon

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2007-02-07 12:58 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-06 15:43 Reducing impact of save/restore/dump on Dom0 Graham, Simon
2007-02-06 16:23 ` John Levon
2007-02-06 17:36 ` Iustin Pop
2007-02-06 17:46   ` Graham, Simon
2007-02-06 18:35     ` Iustin Pop
2007-02-06 23:00 ` Ian Pratt
2007-02-06 23:20   ` Jeremy Fitzhardinge
2007-02-07  6:40     ` Rusty Russell
2007-02-07  7:41       ` Jeremy Fitzhardinge
2007-02-07 12:11 ` Andi Kleen
2007-02-07 12:52   ` Ian Pratt
2007-02-07 12:58     ` Andi Kleen
2007-02-06 17:25 Graham, Simon
2007-02-06 19:21 Graham, Simon
2007-02-07 12:56 Graham, Simon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.