All of lore.kernel.org
 help / color / mirror / Atom feed
* Reducing impact of save/restore/dump on Dom0
@ 2007-02-06 15:43 Graham, Simon
  2007-02-06 16:23 ` John Levon
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Graham, Simon @ 2007-02-06 15:43 UTC (permalink / raw)
  To: xen-devel

[-- Attachment #1: Type: text/plain, Size: 2087 bytes --]

Currently, save, restore and dump all used cached I/O in Dom0 to
write/read the file containing the memory image of the DomU - when the
memory assigned to the DomU is greater than free memory in Dom0, this
leads to severe memory thrashing and generally the Dom0 performance goes
into the toilet.

The 'classic' answer to avoiding this when writing very large files is,
of course, to use non-cached I/O to manipulate the files - this
introduces the restriction that reads and writes have to be at sector
aligned offsets and be an integral number of sectors in length. I've
been working on a prototype for using O_DIRECT for the dump case; this
one is actually easier to do because the on-disk format of a dump file
already has the memory pages aligned to sector boundaries so only the
dump header has to be handled specially - attached is a patch to
unstable for this code which makes the following changes to
tools/libxc/xc_core.c:

1. xc_domain_dumpcore_via_callback is modified to ensure that the buffer
containing
   page data to be dumped is page aligned (using mmap instead of malloc)
2. xc_domain_dumpcore is modified to open the file O_DIRECT and to
buffer writes issued
   by xc_domain_dumpcore_via_callback as needed.

I'd welcome comments on this prior to submitting it for inclusion. I
know that there is a lot of discussion around O_DIRECT being a bad idea
in general and that the POSIX fadvise() call should provide the hints to
tell Linux that a file's cache doesn't need to be maintained but from
looking at the 2.6.16 & 2.6.18 sources I believe this call is not fully
implemented. I also think that this is a case where you never want to
use the file cache at all.

Solving this problem for save/restore is more tricky because the
on-disk/wire save format does not force the page data to be page aligned
-- my proposal would be to page align each batch of pages written,
leaving pad between the batch header and the pages themselves but I
realize that a change in on-disk/wire format is a potential
compatibility problem.

Simon


[-- Attachment #2: direct-io-dump.patch --]
[-- Type: application/octet-stream, Size: 4653 bytes --]

Index: trunk/tools/libxc/xc_core.c
===================================================================
--- trunk/tools/libxc/xc_core.c	(revision 9948)
+++ trunk/tools/libxc/xc_core.c	(working copy)
@@ -1,7 +1,13 @@
 #include "xg_private.h"
 #include <stdlib.h>
 #include <unistd.h>
+#include <fcntl.h>
 
+#ifndef O_DIRECT
+#define O_DIRECT 040000
+#endif
+#define SECTOR_SIZE 512
+
 /* number of pages to write at a time */
 #define DUMP_INCREMENT (4 * 1024)
 #define round_pgup(_p)    (((_p)+(PAGE_SIZE-1))&PAGE_MASK)
@@ -38,7 +44,10 @@
     int dummy_len;
     int sts;
 
-    if ( (dump_mem_start = malloc(DUMP_INCREMENT*PAGE_SIZE)) == NULL )
+    dump_mem_start = mmap(0, PAGE_SIZE*DUMP_INCREMENT, 
+                          PROT_READ|PROT_WRITE, 
+                          MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
+    if (dump_mem_start == NULL)
     {
         PERROR("Could not allocate dump_mem");
         goto error_out;
@@ -114,13 +123,15 @@
         }
     }
 
-    free(dump_mem_start);
+    munmap(dump_mem_start, PAGE_SIZE*DUMP_INCREMENT);
     free(page_array);
     return 0;
 
  error_out:
-    free(dump_mem_start);
-    free(page_array);
+    if (dump_mem_start)
+        munmap(dump_mem_start, PAGE_SIZE*DUMP_INCREMENT);
+    if (page_array)
+        free(page_array);
     return -1;
 }
 
@@ -130,14 +141,13 @@
 };
 
 /* Callback routine for writing to a local dump file. */
-static int local_file_dump(void *args, char *buffer, unsigned int length)
+static int local_file_dump(int fd, char *buffer, unsigned int length)
 {
-    struct dump_args *da = args;
     int bytes, offset;
 
     for ( offset = 0; offset < length; offset += bytes )
     {
-        bytes = write(da->fd, &buffer[offset], length-offset);
+        bytes = write(fd, &buffer[offset], length-offset);
         if ( bytes <= 0 )
         {
             PERROR("Failed to write buffer");
@@ -148,6 +158,84 @@
     return 0;
 }
 
+// Callback routine that support O_DIRECT i/o to dump file by converting
+// packed non-sector aligned output to sector aligned output.
+static int
+local_dumppacked_rtn(void *args, char *buffer, size_t length)
+{
+    static char *io_base = NULL;
+    static int   io_len  = 0;
+
+    struct dump_args *da = args;
+    int sts;
+    int remaining;
+    void *p;
+    int i;
+
+    if (io_base == NULL) {
+        // Initialize page aligned buffer to use to accumulate sector sized info
+        io_base = mmap(0, PAGE_SIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
+
+        if (!io_base) {
+            fprintf(stderr, "Could not allocate sector buffer: %s\n", strerror(errno));
+            return -errno;
+        }
+    }
+
+    // Quick check to see if we can simply write from buffer passed in - this is
+    // OK if:
+    // 1. We dont have any pending writes buffered in our internal buffer
+    // 2. The buffer address is sector aligned
+    // 3. The length is an integral number of sectors
+    if (io_len == 0 &&
+        !((unsigned long)buffer & (SECTOR_SIZE-1)) &&
+        !(length & (SECTOR_SIZE-1))) {
+	    // Yipee! no need to copy - just write
+	    return local_file_dump(da->fd, buffer, length);
+    }
+    else {
+	    remaining = PAGE_SIZE-io_len;
+	    p = (char *)io_base + io_len;
+
+	    // Convert packed to sector aligned and write
+	    while (length) {
+		    // Add as much as possible to buffer
+		    if (length < remaining)
+			    i = length;
+		    else
+			    i = remaining;
+
+		    memcpy(p,buffer,i);
+
+		    buffer += + i;
+		    io_len += i;
+
+		    // write if necessary
+		    if (io_len == PAGE_SIZE) {
+			    sts = local_file_dump(da->fd, io_base, io_len);
+			    if ( sts != 0 )
+				    return sts;
+
+			    remaining = PAGE_SIZE;
+			    p = io_base;
+			    io_len = 0;
+		    }
+
+		    length -= i;
+	    }
+
+	    // Now, write any remainder if it is an integral number of sectors long
+	    if (io_len && !(io_len % SECTOR_SIZE)) {
+		    sts = local_file_dump(da->fd, io_base, io_len);
+		    if (sts != 0)
+			    return sts;
+
+		    io_len = 0;
+	    }
+    }
+    return 0;
+}
+
 int
 xc_domain_dumpcore(int xc_handle,
                    uint32_t domid,
@@ -156,14 +244,14 @@
     struct dump_args da;
     int sts;
 
-    if ( (da.fd = open(corename, O_CREAT|O_RDWR, S_IWUSR|S_IRUSR)) < 0 )
+    if ( (da.fd = open(corename, O_CREAT|O_RDWR|O_DIRECT, S_IWUSR|S_IRUSR)) < 0 )
     {
         PERROR("Could not open corefile %s", corename);
         return -errno;
     }
 
     sts = xc_domain_dumpcore_via_callback(
-        xc_handle, domid, &da, &local_file_dump);
+        xc_handle, domid, &da, &local_dumppacked_rtn);
 
     close(da.fd);
 

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread
* RE: Re: Reducing impact of save/restore/dump on Dom0
@ 2007-02-07 13:06 Graham, Simon
  0 siblings, 0 replies; 13+ messages in thread
From: Graham, Simon @ 2007-02-07 13:06 UTC (permalink / raw)
  To: Andi Kleen, Ian Pratt; +Cc: xen-devel


> > It's pretty easy for us to arrange that everything is page aligned.
> If
> > there was a measurable performance advantage using o_direct rather
> than
> > madvise/fadvise it probably makes sense to use it -- I can't see
> > o_direct going away any time soon.
> 
> O_DIRECT won't do any write behind, so unless you do very aggressive
IO
> (large IO requests, threads or aio so you do useful work during the
> disk write
> latency) it will be likely slower. Similar for read -- it won't do any
> readahead which you would need to do by yourself.
> 
> It's really not a very good idea for most non database applications.

Well, the dump/save/restore does do large IO requests for most of the
data. Also, it's a non-performance path - it's MUCH more important that
other things in Dom0 happen quickly (such as performing I/O for other
domains) so I would be quite happy with the save/restore/dump process
being a little slow in return for not destroying Dom0 performance (which
is what happens today).

Having said that, I understand that O_DIRECT is deprecated (by Linus at
least) and there is also the problem of it not being available on
Solaris; hence I am trying out the fsync/fadvise(DON'T_NEED) in the loop
after writing/reading a chunk of data.

Simon

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2007-02-07 13:06 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-06 15:43 Reducing impact of save/restore/dump on Dom0 Graham, Simon
2007-02-06 16:23 ` John Levon
2007-02-06 17:36 ` Iustin Pop
2007-02-06 17:46   ` Graham, Simon
2007-02-06 18:35     ` Iustin Pop
2007-02-06 23:00 ` Ian Pratt
2007-02-06 23:20   ` Jeremy Fitzhardinge
2007-02-07  6:40     ` Rusty Russell
2007-02-07  7:41       ` Jeremy Fitzhardinge
2007-02-07 12:11 ` Andi Kleen
2007-02-07 12:52   ` Ian Pratt
2007-02-07 12:58     ` Andi Kleen
2007-02-07 13:06 Graham, Simon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.