All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v6 0/2] block: enforce minimal 4096 alignment in qemu_blockalign
@ 2015-05-12  5:47 Denis V. Lunev
  2015-05-12  5:47 ` [Qemu-devel] [PATCH 1/2] block: minimal bounce buffer alignment Denis V. Lunev
  2015-05-12  5:47 ` [Qemu-devel] [PATCH 2/2] block: align bounce buffers to page Denis V. Lunev
  0 siblings, 2 replies; 9+ messages in thread
From: Denis V. Lunev @ 2015-05-12  5:47 UTC (permalink / raw)
  Cc: Kevin Wolf, qemu-block, qemu-devel, Stefan Hajnoczi,
	Paolo Bonzini, Denis V. Lunev

I have used the following program to test
#define _GNU_SOURCE

#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/types.h>
#include <malloc.h>
#include <string.h>

int main(int argc, char *argv[])
{
    int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644);
    void *buf;
    int i = 0, align = atoi(argv[2]);

    do {
        buf = memalign(align, 4096);
        if (align >= 4096)
            break;
        if ((unsigned long)buf & 4095)
            break;
        i++;
    } while (1);
    printf("%d %p\n", i, buf);

    memset(buf, 0x11, 4096);

    for (i = 0; i < 100000; i++) {
        lseek(fd, SEEK_CUR, 4096);
        write(fd, buf, 4096);
    }

    close(fd);
    return 0;
}
for in in `seq 1 30` ; do a.out aa ; done

The file was placed into 8 GB partition on HDD below to avoid speed
change due to different offset on disk. Results are reliable:
- 189 vs 180 seconds on Linux 3.16

The following setups have been tested:
1) ext4 with block size equals to 1024 over 512/512 physical/logical
   sector size SSD disk
2) ext4 with block size equals to 4096 over 512/512 physical/logical
   sector size SSD disk
3) ext4 with block size equals to 4096 over 512/4096 physical/logical
   sector size rotational disk (WDC WD20EZRX)
4) xfs with block size equals to 4096 over 512/512 physical/logical
   sector size SSD disk

The difference is quite reliable and the same 5%.
  qemu-io -n -c 'write -P 0xaa 0 1G' 1.img
for image in qcow2 format is 1% faster.

qemu-img is also affected. The difference in between
  qemu-img create -f qcow2 1.img 64G
  qemu-io -n -c 'write -P 0xaa 0 1G' 1.img
  time for i in `seq 1 30` ; do qemu-img convert 1.img -t none -O raw 2.img ; rm -rf 2.img ; done
is around 126 vs 119 seconds.

The justification of the performance improve is quite interesting.
>From the kernel point of view each request to the disk was split
by two. This could be seen by blktrace like this:
  9,0   11  1     0.000000000 11151  Q  WS 312737792 + 1023 [qemu-img]
  9,0   11  2     0.000007938 11151  Q  WS 312738815 + 8 [qemu-img]
  9,0   11  3     0.000030735 11151  Q  WS 312738823 + 1016 [qemu-img]
  9,0   11  4     0.000032482 11151  Q  WS 312739839 + 8 [qemu-img]
  9,0   11  5     0.000041379 11151  Q  WS 312739847 + 1016 [qemu-img]
  9,0   11  6     0.000042818 11151  Q  WS 312740863 + 8 [qemu-img]
  9,0   11  7     0.000051236 11151  Q  WS 312740871 + 1017 [qemu-img]
  9,0    5  1     0.169071519 11151  Q  WS 312741888 + 1023 [qemu-img]
After the patch the pattern becomes normal:
  9,0    6  1     0.000000000 12422  Q  WS 314834944 + 1024 [qemu-img]
  9,0    6  2     0.000038527 12422  Q  WS 314835968 + 1024 [qemu-img]
  9,0    6  3     0.000072849 12422  Q  WS 314836992 + 1024 [qemu-img]
  9,0    6  4     0.000106276 12422  Q  WS 314838016 + 1024 [qemu-img]
and the amount of requests sent to disk (could be calculated counting
number of lines in the output of blktrace) is reduced about 2 times.

Both qemu-img and qemu-io are affected while qemu-kvm is not. The guest
does his job well and real requests comes properly aligned (to page).

Changes from v5:
- found justification from kernel point of view
- fixed checkpatch warnings in the patch 2

Changes from v4:
- patches reordered
- dropped conversion from 512 to BDRV_SECTOR_SIZE
- getpagesize() is replaced with MAX(4096, getpagesize()) as suggested by
  Kevin

Changes from v3:
- portable way to calculate system page size used
- 512/4096 values are replaced with proper macros/values

Changes from v2:
- opt_mem_alignment is split to opt_mem_alignment for bounce buffering
  and min_mem_alignment to check buffers coming from guest.

Changes from v1:
- enforces 4096 alignment in qemu_(try_)blockalign, avoid touching of
  bdrv_qiov_is_aligned path not to enforce additional bounce buffering
  as suggested by Paolo
- reduces 10% to 5% in patch description to better fit 180 vs 189
  difference

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Qemu-devel] [PATCH 1/2] block: minimal bounce buffer alignment
  2015-05-12  5:47 [Qemu-devel] [PATCH v6 0/2] block: enforce minimal 4096 alignment in qemu_blockalign Denis V. Lunev
@ 2015-05-12  5:47 ` Denis V. Lunev
  2015-05-12 10:29   ` Kevin Wolf
  2015-05-12  5:47 ` [Qemu-devel] [PATCH 2/2] block: align bounce buffers to page Denis V. Lunev
  1 sibling, 1 reply; 9+ messages in thread
From: Denis V. Lunev @ 2015-05-12  5:47 UTC (permalink / raw)
  Cc: Kevin Wolf, qemu-block, qemu-devel, Stefan Hajnoczi,
	Paolo Bonzini, Denis V. Lunev

The patch introduces new concept: minimal memory alignment for bounce
buffers. Original so called "optimal" value is actually minimal required
value for aligment. It should be used for validation that the IOVec
is properly aligned and bounce buffer is not required.

Though, from the performance point of view, it would be better if
bounce buffer or IOVec allocated by QEMU will be aligned stricter.

The patch does not change any alignment value yet.

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
---
 block.c                   | 11 +++++++++++
 block/io.c                |  7 ++++++-
 block/raw-posix.c         |  1 +
 include/block/block.h     |  2 ++
 include/block/block_int.h |  3 +++
 5 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/block.c b/block.c
index 7904098..e293907 100644
--- a/block.c
+++ b/block.c
@@ -113,6 +113,16 @@ size_t bdrv_opt_mem_align(BlockDriverState *bs)
     return bs->bl.opt_mem_alignment;
 }
 
+size_t bdrv_min_mem_align(BlockDriverState *bs)
+{
+    if (!bs || !bs->drv) {
+        /* 4k should be on the safe side */
+        return 4096;
+    }
+
+    return bs->bl.min_mem_alignment;
+}
+
 /* check if the path starts with "<protocol>:" */
 int path_has_protocol(const char *path)
 {
@@ -890,6 +900,7 @@ static int bdrv_open_common(BlockDriverState *bs, BlockDriverState *file,
     }
 
     assert(bdrv_opt_mem_align(bs) != 0);
+    assert(bdrv_min_mem_align(bs) != 0);
     assert((bs->request_alignment != 0) || bs->sg);
     return 0;
 
diff --git a/block/io.c b/block/io.c
index 1ce62c4..908a3d1 100644
--- a/block/io.c
+++ b/block/io.c
@@ -201,8 +201,10 @@ void bdrv_refresh_limits(BlockDriverState *bs, Error **errp)
         }
         bs->bl.opt_transfer_length = bs->file->bl.opt_transfer_length;
         bs->bl.max_transfer_length = bs->file->bl.max_transfer_length;
+        bs->bl.min_mem_alignment = bs->file->bl.min_mem_alignment;
         bs->bl.opt_mem_alignment = bs->file->bl.opt_mem_alignment;
     } else {
+        bs->bl.min_mem_alignment = 512;
         bs->bl.opt_mem_alignment = 512;
     }
 
@@ -221,6 +223,9 @@ void bdrv_refresh_limits(BlockDriverState *bs, Error **errp)
         bs->bl.opt_mem_alignment =
             MAX(bs->bl.opt_mem_alignment,
                 bs->backing_hd->bl.opt_mem_alignment);
+        bs->bl.min_mem_alignment =
+            MAX(bs->bl.min_mem_alignment,
+                bs->backing_hd->bl.min_mem_alignment);
     }
 
     /* Then let the driver override it */
@@ -2489,7 +2494,7 @@ void *qemu_try_blockalign0(BlockDriverState *bs, size_t size)
 bool bdrv_qiov_is_aligned(BlockDriverState *bs, QEMUIOVector *qiov)
 {
     int i;
-    size_t alignment = bdrv_opt_mem_align(bs);
+    size_t alignment = bdrv_min_mem_align(bs);
 
     for (i = 0; i < qiov->niov; i++) {
         if ((uintptr_t) qiov->iov[i].iov_base % alignment) {
diff --git a/block/raw-posix.c b/block/raw-posix.c
index 24d8582..7083924 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -725,6 +725,7 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
     BDRVRawState *s = bs->opaque;
 
     raw_probe_alignment(bs, s->fd, errp);
+    bs->bl.min_mem_alignment = s->buf_align;
     bs->bl.opt_mem_alignment = s->buf_align;
 }
 
diff --git a/include/block/block.h b/include/block/block.h
index 7d1a717..c1c963e 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -440,6 +440,8 @@ void bdrv_img_create(const char *filename, const char *fmt,
 
 /* Returns the alignment in bytes that is required so that no bounce buffer
  * is required throughout the stack */
+size_t bdrv_min_mem_align(BlockDriverState *bs);
+/* Returns optimal alignment in bytes for bounce buffer */
 size_t bdrv_opt_mem_align(BlockDriverState *bs);
 void bdrv_set_guest_block_size(BlockDriverState *bs, int align);
 void *qemu_blockalign(BlockDriverState *bs, size_t size);
diff --git a/include/block/block_int.h b/include/block/block_int.h
index db29b74..f004378 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -313,6 +313,9 @@ typedef struct BlockLimits {
     int max_transfer_length;
 
     /* memory alignment so that no bounce buffer is needed */
+    size_t min_mem_alignment;
+
+    /* memory alignment for bounce buffer */
     size_t opt_mem_alignment;
 } BlockLimits;
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [Qemu-devel] [PATCH 2/2] block: align bounce buffers to page
  2015-05-12  5:47 [Qemu-devel] [PATCH v6 0/2] block: enforce minimal 4096 alignment in qemu_blockalign Denis V. Lunev
  2015-05-12  5:47 ` [Qemu-devel] [PATCH 1/2] block: minimal bounce buffer alignment Denis V. Lunev
@ 2015-05-12  5:47 ` Denis V. Lunev
  2015-05-12 10:27   ` Kevin Wolf
  1 sibling, 1 reply; 9+ messages in thread
From: Denis V. Lunev @ 2015-05-12  5:47 UTC (permalink / raw)
  Cc: Kevin Wolf, qemu-block, qemu-devel, Stefan Hajnoczi,
	Paolo Bonzini, Denis V. Lunev

The following sequence
    int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644);
    for (i = 0; i < 100000; i++)
            write(fd, buf, 4096);
performs 5% better if buf is aligned to 4096 bytes.

The difference is quite reliable.

On the other hand we do not want at the moment to enforce bounce
buffering if guest request is aligned to 512 bytes.

The patch changes default bounce buffer optimal alignment to
MAX(page size, 4k). 4k is chosen as maximal known sector size on real
HDD.

The justification of the performance improve is quite interesting.
>From the kernel point of view each request to the disk was split
by two. This could be seen by blktrace like this:
  9,0   11  1     0.000000000 11151  Q  WS 312737792 + 1023 [qemu-img]
  9,0   11  2     0.000007938 11151  Q  WS 312738815 + 8 [qemu-img]
  9,0   11  3     0.000030735 11151  Q  WS 312738823 + 1016 [qemu-img]
  9,0   11  4     0.000032482 11151  Q  WS 312739839 + 8 [qemu-img]
  9,0   11  5     0.000041379 11151  Q  WS 312739847 + 1016 [qemu-img]
  9,0   11  6     0.000042818 11151  Q  WS 312740863 + 8 [qemu-img]
  9,0   11  7     0.000051236 11151  Q  WS 312740871 + 1017 [qemu-img]
  9,0    5  1     0.169071519 11151  Q  WS 312741888 + 1023 [qemu-img]
After the patch the pattern becomes normal:
  9,0    6  1     0.000000000 12422  Q  WS 314834944 + 1024 [qemu-img]
  9,0    6  2     0.000038527 12422  Q  WS 314835968 + 1024 [qemu-img]
  9,0    6  3     0.000072849 12422  Q  WS 314836992 + 1024 [qemu-img]
  9,0    6  4     0.000106276 12422  Q  WS 314838016 + 1024 [qemu-img]
and the amount of requests sent to disk (could be calculated counting
number of lines in the output of blktrace) is reduced about 2 times.

Both qemu-img and qemu-io are affected while qemu-kvm is not. The guest
does his job well and real requests comes properly aligned (to page).

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
---
 block.c           |  8 ++++----
 block/io.c        |  2 +-
 block/raw-posix.c | 14 ++++++++------
 3 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/block.c b/block.c
index e293907..325f727 100644
--- a/block.c
+++ b/block.c
@@ -106,8 +106,8 @@ int is_windows_drive(const char *filename)
 size_t bdrv_opt_mem_align(BlockDriverState *bs)
 {
     if (!bs || !bs->drv) {
-        /* 4k should be on the safe side */
-        return 4096;
+        /* page size or 4k (hdd sector size) should be on the safe side */
+        return MAX(4096, getpagesize());
     }
 
     return bs->bl.opt_mem_alignment;
@@ -116,8 +116,8 @@ size_t bdrv_opt_mem_align(BlockDriverState *bs)
 size_t bdrv_min_mem_align(BlockDriverState *bs)
 {
     if (!bs || !bs->drv) {
-        /* 4k should be on the safe side */
-        return 4096;
+        /* page size or 4k (hdd sector size) should be on the safe side */
+        return MAX(4096, getpagesize());
     }
 
     return bs->bl.min_mem_alignment;
diff --git a/block/io.c b/block/io.c
index 908a3d1..071652c 100644
--- a/block/io.c
+++ b/block/io.c
@@ -205,7 +205,7 @@ void bdrv_refresh_limits(BlockDriverState *bs, Error **errp)
         bs->bl.opt_mem_alignment = bs->file->bl.opt_mem_alignment;
     } else {
         bs->bl.min_mem_alignment = 512;
-        bs->bl.opt_mem_alignment = 512;
+        bs->bl.opt_mem_alignment = getpagesize();
     }
 
     if (bs->backing_hd) {
diff --git a/block/raw-posix.c b/block/raw-posix.c
index 7083924..04f3d4e 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -301,6 +301,7 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp)
 {
     BDRVRawState *s = bs->opaque;
     char *buf;
+    size_t max_align = MAX(MAX_BLOCKSIZE, getpagesize());
 
     /* For /dev/sg devices the alignment is not really used.
        With buffered I/O, we don't have any restrictions. */
@@ -330,9 +331,9 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp)
     /* If we could not get the sizes so far, we can only guess them */
     if (!s->buf_align) {
         size_t align;
-        buf = qemu_memalign(MAX_BLOCKSIZE, 2 * MAX_BLOCKSIZE);
-        for (align = 512; align <= MAX_BLOCKSIZE; align <<= 1) {
-            if (raw_is_io_aligned(fd, buf + align, MAX_BLOCKSIZE)) {
+        buf = qemu_memalign(max_align, 2 * max_align);
+        for (align = 512; align <= max_align; align <<= 1) {
+            if (raw_is_io_aligned(fd, buf + align, max_align)) {
                 s->buf_align = align;
                 break;
             }
@@ -342,8 +343,8 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp)
 
     if (!bs->request_alignment) {
         size_t align;
-        buf = qemu_memalign(s->buf_align, MAX_BLOCKSIZE);
-        for (align = 512; align <= MAX_BLOCKSIZE; align <<= 1) {
+        buf = qemu_memalign(s->buf_align, max_align);
+        for (align = 512; align <= max_align; align <<= 1) {
             if (raw_is_io_aligned(fd, buf, align)) {
                 bs->request_alignment = align;
                 break;
@@ -726,7 +727,9 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
 
     raw_probe_alignment(bs, s->fd, errp);
     bs->bl.min_mem_alignment = s->buf_align;
-    bs->bl.opt_mem_alignment = s->buf_align;
+    if (bs->bl.min_mem_alignment > bs->bl.opt_mem_alignment) {
+        bs->bl.opt_mem_alignment = bs->bl.min_mem_alignment;
+    }
 }
 
 static int check_for_dasd(int fd)
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] block: align bounce buffers to page
  2015-05-12  5:47 ` [Qemu-devel] [PATCH 2/2] block: align bounce buffers to page Denis V. Lunev
@ 2015-05-12 10:27   ` Kevin Wolf
  2015-05-12 10:36     ` [Qemu-devel] [Qemu-block] " Denis V. Lunev
  2015-05-12 10:50     ` [Qemu-devel] " Paolo Bonzini
  0 siblings, 2 replies; 9+ messages in thread
From: Kevin Wolf @ 2015-05-12 10:27 UTC (permalink / raw)
  To: Denis V. Lunev; +Cc: Paolo Bonzini, Stefan Hajnoczi, qemu-devel, qemu-block

Am 12.05.2015 um 07:47 hat Denis V. Lunev geschrieben:
> The following sequence
>     int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644);
>     for (i = 0; i < 100000; i++)
>             write(fd, buf, 4096);
> performs 5% better if buf is aligned to 4096 bytes.
> 
> The difference is quite reliable.
> 
> On the other hand we do not want at the moment to enforce bounce
> buffering if guest request is aligned to 512 bytes.
> 
> The patch changes default bounce buffer optimal alignment to
> MAX(page size, 4k). 4k is chosen as maximal known sector size on real
> HDD.
> 
> The justification of the performance improve is quite interesting.
> From the kernel point of view each request to the disk was split
> by two. This could be seen by blktrace like this:
>   9,0   11  1     0.000000000 11151  Q  WS 312737792 + 1023 [qemu-img]
>   9,0   11  2     0.000007938 11151  Q  WS 312738815 + 8 [qemu-img]
>   9,0   11  3     0.000030735 11151  Q  WS 312738823 + 1016 [qemu-img]
>   9,0   11  4     0.000032482 11151  Q  WS 312739839 + 8 [qemu-img]
>   9,0   11  5     0.000041379 11151  Q  WS 312739847 + 1016 [qemu-img]
>   9,0   11  6     0.000042818 11151  Q  WS 312740863 + 8 [qemu-img]
>   9,0   11  7     0.000051236 11151  Q  WS 312740871 + 1017 [qemu-img]
>   9,0    5  1     0.169071519 11151  Q  WS 312741888 + 1023 [qemu-img]
> After the patch the pattern becomes normal:
>   9,0    6  1     0.000000000 12422  Q  WS 314834944 + 1024 [qemu-img]
>   9,0    6  2     0.000038527 12422  Q  WS 314835968 + 1024 [qemu-img]
>   9,0    6  3     0.000072849 12422  Q  WS 314836992 + 1024 [qemu-img]
>   9,0    6  4     0.000106276 12422  Q  WS 314838016 + 1024 [qemu-img]
> and the amount of requests sent to disk (could be calculated counting
> number of lines in the output of blktrace) is reduced about 2 times.
> 
> Both qemu-img and qemu-io are affected while qemu-kvm is not. The guest
> does his job well and real requests comes properly aligned (to page).
> 
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Paolo Bonzini <pbonzini@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  block.c           |  8 ++++----
>  block/io.c        |  2 +-
>  block/raw-posix.c | 14 ++++++++------
>  3 files changed, 13 insertions(+), 11 deletions(-)
> 
> diff --git a/block.c b/block.c
> index e293907..325f727 100644
> --- a/block.c
> +++ b/block.c
> @@ -106,8 +106,8 @@ int is_windows_drive(const char *filename)
>  size_t bdrv_opt_mem_align(BlockDriverState *bs)
>  {
>      if (!bs || !bs->drv) {
> -        /* 4k should be on the safe side */
> -        return 4096;
> +        /* page size or 4k (hdd sector size) should be on the safe side */
> +        return MAX(4096, getpagesize());
>      }
>  
>      return bs->bl.opt_mem_alignment;
> @@ -116,8 +116,8 @@ size_t bdrv_opt_mem_align(BlockDriverState *bs)
>  size_t bdrv_min_mem_align(BlockDriverState *bs)
>  {
>      if (!bs || !bs->drv) {
> -        /* 4k should be on the safe side */
> -        return 4096;
> +        /* page size or 4k (hdd sector size) should be on the safe side */
> +        return MAX(4096, getpagesize());
>      }
>  
>      return bs->bl.min_mem_alignment;
> diff --git a/block/io.c b/block/io.c
> index 908a3d1..071652c 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -205,7 +205,7 @@ void bdrv_refresh_limits(BlockDriverState *bs, Error **errp)
>          bs->bl.opt_mem_alignment = bs->file->bl.opt_mem_alignment;
>      } else {
>          bs->bl.min_mem_alignment = 512;
> -        bs->bl.opt_mem_alignment = 512;
> +        bs->bl.opt_mem_alignment = getpagesize();
>      }
>  
>      if (bs->backing_hd) {

I think it would make more sense to keep this specific to the raw-posix
driver. After all, it's only the kernel page cache that we optimise
here. Other backends probably don't take advantage of page alignment.

> diff --git a/block/raw-posix.c b/block/raw-posix.c
> index 7083924..04f3d4e 100644
> --- a/block/raw-posix.c
> +++ b/block/raw-posix.c
> @@ -301,6 +301,7 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp)
>  {
>      BDRVRawState *s = bs->opaque;
>      char *buf;
> +    size_t max_align = MAX(MAX_BLOCKSIZE, getpagesize());
>  
>      /* For /dev/sg devices the alignment is not really used.
>         With buffered I/O, we don't have any restrictions. */
> @@ -330,9 +331,9 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp)
>      /* If we could not get the sizes so far, we can only guess them */
>      if (!s->buf_align) {
>          size_t align;
> -        buf = qemu_memalign(MAX_BLOCKSIZE, 2 * MAX_BLOCKSIZE);
> -        for (align = 512; align <= MAX_BLOCKSIZE; align <<= 1) {
> -            if (raw_is_io_aligned(fd, buf + align, MAX_BLOCKSIZE)) {
> +        buf = qemu_memalign(max_align, 2 * max_align);
> +        for (align = 512; align <= max_align; align <<= 1) {
> +            if (raw_is_io_aligned(fd, buf + align, max_align)) {
>                  s->buf_align = align;
>                  break;
>              }
> @@ -342,8 +343,8 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp)
>  
>      if (!bs->request_alignment) {
>          size_t align;
> -        buf = qemu_memalign(s->buf_align, MAX_BLOCKSIZE);
> -        for (align = 512; align <= MAX_BLOCKSIZE; align <<= 1) {
> +        buf = qemu_memalign(s->buf_align, max_align);
> +        for (align = 512; align <= max_align; align <<= 1) {
>              if (raw_is_io_aligned(fd, buf, align)) {
>                  bs->request_alignment = align;
>                  break;
> @@ -726,7 +727,9 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
>  
>      raw_probe_alignment(bs, s->fd, errp);
>      bs->bl.min_mem_alignment = s->buf_align;
> -    bs->bl.opt_mem_alignment = s->buf_align;
> +    if (bs->bl.min_mem_alignment > bs->bl.opt_mem_alignment) {
> +        bs->bl.opt_mem_alignment = bs->bl.min_mem_alignment;
> +    }

Or, if you want to keep the getpagesize() initialisation as a generic
fallback just in case, I would still suggest to be explicit here instead
of relying on the default, like this:

    bs->bl.opt_mem_alignment = MAX(s->buf_align, getpagesize()).

Kevin

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [PATCH 1/2] block: minimal bounce buffer alignment
  2015-05-12  5:47 ` [Qemu-devel] [PATCH 1/2] block: minimal bounce buffer alignment Denis V. Lunev
@ 2015-05-12 10:29   ` Kevin Wolf
  0 siblings, 0 replies; 9+ messages in thread
From: Kevin Wolf @ 2015-05-12 10:29 UTC (permalink / raw)
  To: Denis V. Lunev; +Cc: Paolo Bonzini, Stefan Hajnoczi, qemu-devel, qemu-block

Am 12.05.2015 um 07:47 hat Denis V. Lunev geschrieben:
> The patch introduces new concept: minimal memory alignment for bounce
> buffers. Original so called "optimal" value is actually minimal required
> value for aligment. It should be used for validation that the IOVec
> is properly aligned and bounce buffer is not required.
> 
> Though, from the performance point of view, it would be better if
> bounce buffer or IOVec allocated by QEMU will be aligned stricter.
> 
> The patch does not change any alignment value yet.
> 
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Paolo Bonzini <pbonzini@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>

Reviewed-by: Kevin Wolf <kwolf@redhat.com>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 2/2] block: align bounce buffers to page
  2015-05-12 10:27   ` Kevin Wolf
@ 2015-05-12 10:36     ` Denis V. Lunev
  2015-05-12 13:08       ` Kevin Wolf
  2015-05-12 10:50     ` [Qemu-devel] " Paolo Bonzini
  1 sibling, 1 reply; 9+ messages in thread
From: Denis V. Lunev @ 2015-05-12 10:36 UTC (permalink / raw)
  To: Kevin Wolf, Denis V. Lunev
  Cc: Paolo Bonzini, qemu-block, qemu-devel, Stefan Hajnoczi

On 12/05/15 13:27, Kevin Wolf wrote:
> Am 12.05.2015 um 07:47 hat Denis V. Lunev geschrieben:
>> The following sequence
>>      int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644);
>>      for (i = 0; i < 100000; i++)
>>              write(fd, buf, 4096);
>> performs 5% better if buf is aligned to 4096 bytes.
>>
>> The difference is quite reliable.
>>
>> On the other hand we do not want at the moment to enforce bounce
>> buffering if guest request is aligned to 512 bytes.
>>
>> The patch changes default bounce buffer optimal alignment to
>> MAX(page size, 4k). 4k is chosen as maximal known sector size on real
>> HDD.
>>
>> The justification of the performance improve is quite interesting.
>>  From the kernel point of view each request to the disk was split
>> by two. This could be seen by blktrace like this:
>>    9,0   11  1     0.000000000 11151  Q  WS 312737792 + 1023 [qemu-img]
>>    9,0   11  2     0.000007938 11151  Q  WS 312738815 + 8 [qemu-img]
>>    9,0   11  3     0.000030735 11151  Q  WS 312738823 + 1016 [qemu-img]
>>    9,0   11  4     0.000032482 11151  Q  WS 312739839 + 8 [qemu-img]
>>    9,0   11  5     0.000041379 11151  Q  WS 312739847 + 1016 [qemu-img]
>>    9,0   11  6     0.000042818 11151  Q  WS 312740863 + 8 [qemu-img]
>>    9,0   11  7     0.000051236 11151  Q  WS 312740871 + 1017 [qemu-img]
>>    9,0    5  1     0.169071519 11151  Q  WS 312741888 + 1023 [qemu-img]
>> After the patch the pattern becomes normal:
>>    9,0    6  1     0.000000000 12422  Q  WS 314834944 + 1024 [qemu-img]
>>    9,0    6  2     0.000038527 12422  Q  WS 314835968 + 1024 [qemu-img]
>>    9,0    6  3     0.000072849 12422  Q  WS 314836992 + 1024 [qemu-img]
>>    9,0    6  4     0.000106276 12422  Q  WS 314838016 + 1024 [qemu-img]
>> and the amount of requests sent to disk (could be calculated counting
>> number of lines in the output of blktrace) is reduced about 2 times.
>>
>> Both qemu-img and qemu-io are affected while qemu-kvm is not. The guest
>> does his job well and real requests comes properly aligned (to page).
>>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> CC: Paolo Bonzini <pbonzini@redhat.com>
>> CC: Kevin Wolf <kwolf@redhat.com>
>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>> ---
>>   block.c           |  8 ++++----
>>   block/io.c        |  2 +-
>>   block/raw-posix.c | 14 ++++++++------
>>   3 files changed, 13 insertions(+), 11 deletions(-)
>>
>> diff --git a/block.c b/block.c
>> index e293907..325f727 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -106,8 +106,8 @@ int is_windows_drive(const char *filename)
>>   size_t bdrv_opt_mem_align(BlockDriverState *bs)
>>   {
>>       if (!bs || !bs->drv) {
>> -        /* 4k should be on the safe side */
>> -        return 4096;
>> +        /* page size or 4k (hdd sector size) should be on the safe side */
>> +        return MAX(4096, getpagesize());
>>       }
>>
>>       return bs->bl.opt_mem_alignment;
>> @@ -116,8 +116,8 @@ size_t bdrv_opt_mem_align(BlockDriverState *bs)
>>   size_t bdrv_min_mem_align(BlockDriverState *bs)
>>   {
>>       if (!bs || !bs->drv) {
>> -        /* 4k should be on the safe side */
>> -        return 4096;
>> +        /* page size or 4k (hdd sector size) should be on the safe side */
>> +        return MAX(4096, getpagesize());
>>       }
>>
>>       return bs->bl.min_mem_alignment;
>> diff --git a/block/io.c b/block/io.c
>> index 908a3d1..071652c 100644
>> --- a/block/io.c
>> +++ b/block/io.c
>> @@ -205,7 +205,7 @@ void bdrv_refresh_limits(BlockDriverState *bs, Error **errp)
>>           bs->bl.opt_mem_alignment = bs->file->bl.opt_mem_alignment;
>>       } else {
>>           bs->bl.min_mem_alignment = 512;
>> -        bs->bl.opt_mem_alignment = 512;
>> +        bs->bl.opt_mem_alignment = getpagesize();
>>       }
>>
>>       if (bs->backing_hd) {
>
> I think it would make more sense to keep this specific to the raw-posix
> driver. After all, it's only the kernel page cache that we optimise
> here. Other backends probably don't take advantage of page alignment.
>
>> diff --git a/block/raw-posix.c b/block/raw-posix.c
>> index 7083924..04f3d4e 100644
>> --- a/block/raw-posix.c
>> +++ b/block/raw-posix.c
>> @@ -301,6 +301,7 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp)
>>   {
>>       BDRVRawState *s = bs->opaque;
>>       char *buf;
>> +    size_t max_align = MAX(MAX_BLOCKSIZE, getpagesize());
>>
>>       /* For /dev/sg devices the alignment is not really used.
>>          With buffered I/O, we don't have any restrictions. */
>> @@ -330,9 +331,9 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp)
>>       /* If we could not get the sizes so far, we can only guess them */
>>       if (!s->buf_align) {
>>           size_t align;
>> -        buf = qemu_memalign(MAX_BLOCKSIZE, 2 * MAX_BLOCKSIZE);
>> -        for (align = 512; align <= MAX_BLOCKSIZE; align <<= 1) {
>> -            if (raw_is_io_aligned(fd, buf + align, MAX_BLOCKSIZE)) {
>> +        buf = qemu_memalign(max_align, 2 * max_align);
>> +        for (align = 512; align <= max_align; align <<= 1) {
>> +            if (raw_is_io_aligned(fd, buf + align, max_align)) {
>>                   s->buf_align = align;
>>                   break;
>>               }
>> @@ -342,8 +343,8 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp)
>>
>>       if (!bs->request_alignment) {
>>           size_t align;
>> -        buf = qemu_memalign(s->buf_align, MAX_BLOCKSIZE);
>> -        for (align = 512; align <= MAX_BLOCKSIZE; align <<= 1) {
>> +        buf = qemu_memalign(s->buf_align, max_align);
>> +        for (align = 512; align <= max_align; align <<= 1) {
>>               if (raw_is_io_aligned(fd, buf, align)) {
>>                   bs->request_alignment = align;
>>                   break;
>> @@ -726,7 +727,9 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
>>
>>       raw_probe_alignment(bs, s->fd, errp);
>>       bs->bl.min_mem_alignment = s->buf_align;
>> -    bs->bl.opt_mem_alignment = s->buf_align;
>> +    if (bs->bl.min_mem_alignment > bs->bl.opt_mem_alignment) {
>> +        bs->bl.opt_mem_alignment = bs->bl.min_mem_alignment;
>> +    }
>
> Or, if you want to keep the getpagesize() initialisation as a generic
> fallback just in case, I would still suggest to be explicit here instead
> of relying on the default, like this:
>
>      bs->bl.opt_mem_alignment = MAX(s->buf_align, getpagesize()).
>
> Kevin
>
definitely I can do this if this is a strict requirement and I have
not performed any real testing on Windows and other platforms
but from my point of view we will be on a safe side with this
alignment.

Pls note, that I do not make any new allocation and any new
alignment check. The patch just forces alignment of the
allocation which will be performed in any case. And this
approach just matches IO coming from guest with IO initiated
by the qemu-img/io. All guest operations (both Windows and
Linux) are really page aligned by address and offset
nowadays.

This approach is safe. It does not bring any additional
(significant) overhead.

Den

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] block: align bounce buffers to page
  2015-05-12 10:27   ` Kevin Wolf
  2015-05-12 10:36     ` [Qemu-devel] [Qemu-block] " Denis V. Lunev
@ 2015-05-12 10:50     ` Paolo Bonzini
  1 sibling, 0 replies; 9+ messages in thread
From: Paolo Bonzini @ 2015-05-12 10:50 UTC (permalink / raw)
  To: Kevin Wolf, Denis V. Lunev; +Cc: Stefan Hajnoczi, qemu-devel, qemu-block



On 12/05/2015 12:27, Kevin Wolf wrote:
> I think it would make more sense to keep this specific to the raw-posix
> driver. After all, it's only the kernel page cache that we optimise
> here. Other backends probably don't take advantage of page alignment.

I don't think it makes sense to keep it raw-posix-specific, though.
It's not the page cache that we optimize for, because this is with
O_DIRECT.  If anything, making it page aligned means that the buffer
spans one fewer physical page and thus it may economize a bit on TLB misses.

Paolo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 2/2] block: align bounce buffers to page
  2015-05-12 10:36     ` [Qemu-devel] [Qemu-block] " Denis V. Lunev
@ 2015-05-12 13:08       ` Kevin Wolf
  2015-05-12 13:13         ` Denis V. Lunev
  0 siblings, 1 reply; 9+ messages in thread
From: Kevin Wolf @ 2015-05-12 13:08 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: Denis V. Lunev, qemu-block, qemu-devel, Stefan Hajnoczi, Paolo Bonzini

Am 12.05.2015 um 12:36 hat Denis V. Lunev geschrieben:
> On 12/05/15 13:27, Kevin Wolf wrote:
> >Am 12.05.2015 um 07:47 hat Denis V. Lunev geschrieben:
> >>The following sequence
> >>     int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644);
> >>     for (i = 0; i < 100000; i++)
> >>             write(fd, buf, 4096);
> >>performs 5% better if buf is aligned to 4096 bytes.
> >>
> >>The difference is quite reliable.
> >>
> >>On the other hand we do not want at the moment to enforce bounce
> >>buffering if guest request is aligned to 512 bytes.
> >>
> >>The patch changes default bounce buffer optimal alignment to
> >>MAX(page size, 4k). 4k is chosen as maximal known sector size on real
> >>HDD.
> >>
> >>The justification of the performance improve is quite interesting.
> >> From the kernel point of view each request to the disk was split
> >>by two. This could be seen by blktrace like this:
> >>   9,0   11  1     0.000000000 11151  Q  WS 312737792 + 1023 [qemu-img]
> >>   9,0   11  2     0.000007938 11151  Q  WS 312738815 + 8 [qemu-img]
> >>   9,0   11  3     0.000030735 11151  Q  WS 312738823 + 1016 [qemu-img]
> >>   9,0   11  4     0.000032482 11151  Q  WS 312739839 + 8 [qemu-img]
> >>   9,0   11  5     0.000041379 11151  Q  WS 312739847 + 1016 [qemu-img]
> >>   9,0   11  6     0.000042818 11151  Q  WS 312740863 + 8 [qemu-img]
> >>   9,0   11  7     0.000051236 11151  Q  WS 312740871 + 1017 [qemu-img]
> >>   9,0    5  1     0.169071519 11151  Q  WS 312741888 + 1023 [qemu-img]
> >>After the patch the pattern becomes normal:
> >>   9,0    6  1     0.000000000 12422  Q  WS 314834944 + 1024 [qemu-img]
> >>   9,0    6  2     0.000038527 12422  Q  WS 314835968 + 1024 [qemu-img]
> >>   9,0    6  3     0.000072849 12422  Q  WS 314836992 + 1024 [qemu-img]
> >>   9,0    6  4     0.000106276 12422  Q  WS 314838016 + 1024 [qemu-img]
> >>and the amount of requests sent to disk (could be calculated counting
> >>number of lines in the output of blktrace) is reduced about 2 times.
> >>
> >>Both qemu-img and qemu-io are affected while qemu-kvm is not. The guest
> >>does his job well and real requests comes properly aligned (to page).
> >>
> >>Signed-off-by: Denis V. Lunev <den@openvz.org>
> >>CC: Paolo Bonzini <pbonzini@redhat.com>
> >>CC: Kevin Wolf <kwolf@redhat.com>
> >>CC: Stefan Hajnoczi <stefanha@redhat.com>
> >>---
> >>  block.c           |  8 ++++----
> >>  block/io.c        |  2 +-
> >>  block/raw-posix.c | 14 ++++++++------
> >>  3 files changed, 13 insertions(+), 11 deletions(-)
> >>
> >>diff --git a/block.c b/block.c
> >>index e293907..325f727 100644
> >>--- a/block.c
> >>+++ b/block.c
> >>@@ -106,8 +106,8 @@ int is_windows_drive(const char *filename)
> >>  size_t bdrv_opt_mem_align(BlockDriverState *bs)
> >>  {
> >>      if (!bs || !bs->drv) {
> >>-        /* 4k should be on the safe side */
> >>-        return 4096;
> >>+        /* page size or 4k (hdd sector size) should be on the safe side */
> >>+        return MAX(4096, getpagesize());
> >>      }
> >>
> >>      return bs->bl.opt_mem_alignment;
> >>@@ -116,8 +116,8 @@ size_t bdrv_opt_mem_align(BlockDriverState *bs)
> >>  size_t bdrv_min_mem_align(BlockDriverState *bs)
> >>  {
> >>      if (!bs || !bs->drv) {
> >>-        /* 4k should be on the safe side */
> >>-        return 4096;
> >>+        /* page size or 4k (hdd sector size) should be on the safe side */
> >>+        return MAX(4096, getpagesize());
> >>      }
> >>
> >>      return bs->bl.min_mem_alignment;
> >>diff --git a/block/io.c b/block/io.c
> >>index 908a3d1..071652c 100644
> >>--- a/block/io.c
> >>+++ b/block/io.c
> >>@@ -205,7 +205,7 @@ void bdrv_refresh_limits(BlockDriverState *bs, Error **errp)
> >>          bs->bl.opt_mem_alignment = bs->file->bl.opt_mem_alignment;
> >>      } else {
> >>          bs->bl.min_mem_alignment = 512;
> >>-        bs->bl.opt_mem_alignment = 512;
> >>+        bs->bl.opt_mem_alignment = getpagesize();
> >>      }
> >>
> >>      if (bs->backing_hd) {
> >
> >I think it would make more sense to keep this specific to the raw-posix
> >driver. After all, it's only the kernel page cache that we optimise
> >here. Other backends probably don't take advantage of page alignment.
> >
> >>diff --git a/block/raw-posix.c b/block/raw-posix.c
> >>index 7083924..04f3d4e 100644
> >>--- a/block/raw-posix.c
> >>+++ b/block/raw-posix.c
> >>@@ -301,6 +301,7 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp)
> >>  {
> >>      BDRVRawState *s = bs->opaque;
> >>      char *buf;
> >>+    size_t max_align = MAX(MAX_BLOCKSIZE, getpagesize());
> >>
> >>      /* For /dev/sg devices the alignment is not really used.
> >>         With buffered I/O, we don't have any restrictions. */
> >>@@ -330,9 +331,9 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp)
> >>      /* If we could not get the sizes so far, we can only guess them */
> >>      if (!s->buf_align) {
> >>          size_t align;
> >>-        buf = qemu_memalign(MAX_BLOCKSIZE, 2 * MAX_BLOCKSIZE);
> >>-        for (align = 512; align <= MAX_BLOCKSIZE; align <<= 1) {
> >>-            if (raw_is_io_aligned(fd, buf + align, MAX_BLOCKSIZE)) {
> >>+        buf = qemu_memalign(max_align, 2 * max_align);
> >>+        for (align = 512; align <= max_align; align <<= 1) {
> >>+            if (raw_is_io_aligned(fd, buf + align, max_align)) {
> >>                  s->buf_align = align;
> >>                  break;
> >>              }
> >>@@ -342,8 +343,8 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp)
> >>
> >>      if (!bs->request_alignment) {
> >>          size_t align;
> >>-        buf = qemu_memalign(s->buf_align, MAX_BLOCKSIZE);
> >>-        for (align = 512; align <= MAX_BLOCKSIZE; align <<= 1) {
> >>+        buf = qemu_memalign(s->buf_align, max_align);
> >>+        for (align = 512; align <= max_align; align <<= 1) {
> >>              if (raw_is_io_aligned(fd, buf, align)) {
> >>                  bs->request_alignment = align;
> >>                  break;
> >>@@ -726,7 +727,9 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
> >>
> >>      raw_probe_alignment(bs, s->fd, errp);
> >>      bs->bl.min_mem_alignment = s->buf_align;
> >>-    bs->bl.opt_mem_alignment = s->buf_align;
> >>+    if (bs->bl.min_mem_alignment > bs->bl.opt_mem_alignment) {
> >>+        bs->bl.opt_mem_alignment = bs->bl.min_mem_alignment;
> >>+    }
> >
> >Or, if you want to keep the getpagesize() initialisation as a generic
> >fallback just in case, I would still suggest to be explicit here instead
> >of relying on the default, like this:
> >
> >     bs->bl.opt_mem_alignment = MAX(s->buf_align, getpagesize()).
> >
> >Kevin
> >
> definitely I can do this if this is a strict requirement and I have
> not performed any real testing on Windows and other platforms
> but from my point of view we will be on a safe side with this
> alignment.

Yes, it certainly won't hurt as a default, so I'm okay with keeping it
in block.c. I would only like to have it explicit in raw-posix, too,
because the justification you use in the commit message is specific to
raw-posix (or, to be more precise, specific to raw-posix on Linux).

Paolo is right that I missed that the page cache isn't involved, but
then it must be the Linux block layer that splits the requests as you
reported. That's still raw-posix only.

For other backends (like network protocols), defaulting to pagesize
shouldn't hurt and possibly there are some effects that make it an
improvement there as well, but for raw-posix we actually have a good
reason to do so and to be explicit about it in the driver.

> Pls note, that I do not make any new allocation and any new
> alignment check. The patch just forces alignment of the
> allocation which will be performed in any case. And this
> approach just matches IO coming from guest with IO initiated
> by the qemu-img/io. All guest operations (both Windows and
> Linux) are really page aligned by address and offset
> nowadays.
> 
> This approach is safe. It does not bring any additional
> (significant) overhead.

Yes, I understand that. :-)

Kevin

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 2/2] block: align bounce buffers to page
  2015-05-12 13:08       ` Kevin Wolf
@ 2015-05-12 13:13         ` Denis V. Lunev
  0 siblings, 0 replies; 9+ messages in thread
From: Denis V. Lunev @ 2015-05-12 13:13 UTC (permalink / raw)
  To: Kevin Wolf, Denis V. Lunev
  Cc: Paolo Bonzini, qemu-block, qemu-devel, Stefan Hajnoczi

On 12/05/15 16:08, Kevin Wolf wrote:
> Am 12.05.2015 um 12:36 hat Denis V. Lunev geschrieben:
>> On 12/05/15 13:27, Kevin Wolf wrote:
>>> Am 12.05.2015 um 07:47 hat Denis V. Lunev geschrieben:
>>>> The following sequence
>>>>      int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644);
>>>>      for (i = 0; i < 100000; i++)
>>>>              write(fd, buf, 4096);
>>>> performs 5% better if buf is aligned to 4096 bytes.
>>>>
>>>> The difference is quite reliable.
>>>>
>>>> On the other hand we do not want at the moment to enforce bounce
>>>> buffering if guest request is aligned to 512 bytes.
>>>>
>>>> The patch changes default bounce buffer optimal alignment to
>>>> MAX(page size, 4k). 4k is chosen as maximal known sector size on real
>>>> HDD.
>>>>
>>>> The justification of the performance improve is quite interesting.
>>>>  From the kernel point of view each request to the disk was split
>>>> by two. This could be seen by blktrace like this:
>>>>    9,0   11  1     0.000000000 11151  Q  WS 312737792 + 1023 [qemu-img]
>>>>    9,0   11  2     0.000007938 11151  Q  WS 312738815 + 8 [qemu-img]
>>>>    9,0   11  3     0.000030735 11151  Q  WS 312738823 + 1016 [qemu-img]
>>>>    9,0   11  4     0.000032482 11151  Q  WS 312739839 + 8 [qemu-img]
>>>>    9,0   11  5     0.000041379 11151  Q  WS 312739847 + 1016 [qemu-img]
>>>>    9,0   11  6     0.000042818 11151  Q  WS 312740863 + 8 [qemu-img]
>>>>    9,0   11  7     0.000051236 11151  Q  WS 312740871 + 1017 [qemu-img]
>>>>    9,0    5  1     0.169071519 11151  Q  WS 312741888 + 1023 [qemu-img]
>>>> After the patch the pattern becomes normal:
>>>>    9,0    6  1     0.000000000 12422  Q  WS 314834944 + 1024 [qemu-img]
>>>>    9,0    6  2     0.000038527 12422  Q  WS 314835968 + 1024 [qemu-img]
>>>>    9,0    6  3     0.000072849 12422  Q  WS 314836992 + 1024 [qemu-img]
>>>>    9,0    6  4     0.000106276 12422  Q  WS 314838016 + 1024 [qemu-img]
>>>> and the amount of requests sent to disk (could be calculated counting
>>>> number of lines in the output of blktrace) is reduced about 2 times.
>>>>
>>>> Both qemu-img and qemu-io are affected while qemu-kvm is not. The guest
>>>> does his job well and real requests comes properly aligned (to page).
>>>>
>>>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>>>> CC: Paolo Bonzini <pbonzini@redhat.com>
>>>> CC: Kevin Wolf <kwolf@redhat.com>
>>>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>>>> ---
>>>>   block.c           |  8 ++++----
>>>>   block/io.c        |  2 +-
>>>>   block/raw-posix.c | 14 ++++++++------
>>>>   3 files changed, 13 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/block.c b/block.c
>>>> index e293907..325f727 100644
>>>> --- a/block.c
>>>> +++ b/block.c
>>>> @@ -106,8 +106,8 @@ int is_windows_drive(const char *filename)
>>>>   size_t bdrv_opt_mem_align(BlockDriverState *bs)
>>>>   {
>>>>       if (!bs || !bs->drv) {
>>>> -        /* 4k should be on the safe side */
>>>> -        return 4096;
>>>> +        /* page size or 4k (hdd sector size) should be on the safe side */
>>>> +        return MAX(4096, getpagesize());
>>>>       }
>>>>
>>>>       return bs->bl.opt_mem_alignment;
>>>> @@ -116,8 +116,8 @@ size_t bdrv_opt_mem_align(BlockDriverState *bs)
>>>>   size_t bdrv_min_mem_align(BlockDriverState *bs)
>>>>   {
>>>>       if (!bs || !bs->drv) {
>>>> -        /* 4k should be on the safe side */
>>>> -        return 4096;
>>>> +        /* page size or 4k (hdd sector size) should be on the safe side */
>>>> +        return MAX(4096, getpagesize());
>>>>       }
>>>>
>>>>       return bs->bl.min_mem_alignment;
>>>> diff --git a/block/io.c b/block/io.c
>>>> index 908a3d1..071652c 100644
>>>> --- a/block/io.c
>>>> +++ b/block/io.c
>>>> @@ -205,7 +205,7 @@ void bdrv_refresh_limits(BlockDriverState *bs, Error **errp)
>>>>           bs->bl.opt_mem_alignment = bs->file->bl.opt_mem_alignment;
>>>>       } else {
>>>>           bs->bl.min_mem_alignment = 512;
>>>> -        bs->bl.opt_mem_alignment = 512;
>>>> +        bs->bl.opt_mem_alignment = getpagesize();
>>>>       }
>>>>
>>>>       if (bs->backing_hd) {
>>> I think it would make more sense to keep this specific to the raw-posix
>>> driver. After all, it's only the kernel page cache that we optimise
>>> here. Other backends probably don't take advantage of page alignment.
>>>
>>>> diff --git a/block/raw-posix.c b/block/raw-posix.c
>>>> index 7083924..04f3d4e 100644
>>>> --- a/block/raw-posix.c
>>>> +++ b/block/raw-posix.c
>>>> @@ -301,6 +301,7 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp)
>>>>   {
>>>>       BDRVRawState *s = bs->opaque;
>>>>       char *buf;
>>>> +    size_t max_align = MAX(MAX_BLOCKSIZE, getpagesize());
>>>>
>>>>       /* For /dev/sg devices the alignment is not really used.
>>>>          With buffered I/O, we don't have any restrictions. */
>>>> @@ -330,9 +331,9 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp)
>>>>       /* If we could not get the sizes so far, we can only guess them */
>>>>       if (!s->buf_align) {
>>>>           size_t align;
>>>> -        buf = qemu_memalign(MAX_BLOCKSIZE, 2 * MAX_BLOCKSIZE);
>>>> -        for (align = 512; align <= MAX_BLOCKSIZE; align <<= 1) {
>>>> -            if (raw_is_io_aligned(fd, buf + align, MAX_BLOCKSIZE)) {
>>>> +        buf = qemu_memalign(max_align, 2 * max_align);
>>>> +        for (align = 512; align <= max_align; align <<= 1) {
>>>> +            if (raw_is_io_aligned(fd, buf + align, max_align)) {
>>>>                   s->buf_align = align;
>>>>                   break;
>>>>               }
>>>> @@ -342,8 +343,8 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp)
>>>>
>>>>       if (!bs->request_alignment) {
>>>>           size_t align;
>>>> -        buf = qemu_memalign(s->buf_align, MAX_BLOCKSIZE);
>>>> -        for (align = 512; align <= MAX_BLOCKSIZE; align <<= 1) {
>>>> +        buf = qemu_memalign(s->buf_align, max_align);
>>>> +        for (align = 512; align <= max_align; align <<= 1) {
>>>>               if (raw_is_io_aligned(fd, buf, align)) {
>>>>                   bs->request_alignment = align;
>>>>                   break;
>>>> @@ -726,7 +727,9 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
>>>>
>>>>       raw_probe_alignment(bs, s->fd, errp);
>>>>       bs->bl.min_mem_alignment = s->buf_align;
>>>> -    bs->bl.opt_mem_alignment = s->buf_align;
>>>> +    if (bs->bl.min_mem_alignment > bs->bl.opt_mem_alignment) {
>>>> +        bs->bl.opt_mem_alignment = bs->bl.min_mem_alignment;
>>>> +    }
>>> Or, if you want to keep the getpagesize() initialisation as a generic
>>> fallback just in case, I would still suggest to be explicit here instead
>>> of relying on the default, like this:
>>>
>>>      bs->bl.opt_mem_alignment = MAX(s->buf_align, getpagesize()).
>>>
>>> Kevin
>>>
>> definitely I can do this if this is a strict requirement and I have
>> not performed any real testing on Windows and other platforms
>> but from my point of view we will be on a safe side with this
>> alignment.
> Yes, it certainly won't hurt as a default, so I'm okay with keeping it
> in block.c. I would only like to have it explicit in raw-posix, too,
> because the justification you use in the commit message is specific to
> raw-posix (or, to be more precise, specific to raw-posix on Linux).
>
> Paolo is right that I missed that the page cache isn't involved, but
> then it must be the Linux block layer that splits the requests as you
> reported. That's still raw-posix only.
>
> For other backends (like network protocols), defaulting to pagesize
> shouldn't hurt and possibly there are some effects that make it an
> improvement there as well, but for raw-posix we actually have a good
> reason to do so and to be explicit about it in the driver.

ok, makes sense.

>> Pls note, that I do not make any new allocation and any new
>> alignment check. The patch just forces alignment of the
>> allocation which will be performed in any case. And this
>> approach just matches IO coming from guest with IO initiated
>> by the qemu-img/io. All guest operations (both Windows and
>> Linux) are really page aligned by address and offset
>> nowadays.
>>
>> This approach is safe. It does not bring any additional
>> (significant) overhead.
> Yes, I understand that. :-)
>
> Kevin

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2015-05-12 13:29 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-12  5:47 [Qemu-devel] [PATCH v6 0/2] block: enforce minimal 4096 alignment in qemu_blockalign Denis V. Lunev
2015-05-12  5:47 ` [Qemu-devel] [PATCH 1/2] block: minimal bounce buffer alignment Denis V. Lunev
2015-05-12 10:29   ` Kevin Wolf
2015-05-12  5:47 ` [Qemu-devel] [PATCH 2/2] block: align bounce buffers to page Denis V. Lunev
2015-05-12 10:27   ` Kevin Wolf
2015-05-12 10:36     ` [Qemu-devel] [Qemu-block] " Denis V. Lunev
2015-05-12 13:08       ` Kevin Wolf
2015-05-12 13:13         ` Denis V. Lunev
2015-05-12 10:50     ` [Qemu-devel] " Paolo Bonzini

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.