[Qemu-devel] [PATCH 0/7] Initial VHDX support (and a bug fix for QCOW)

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH 0/7] Initial VHDX support (and a bug fix for QCOW)
@ 2013-03-06 14:46 Jeff Cody
  2013-03-06 14:47 ` [Qemu-devel] [PATCH 1/7] block: only force IO completion in .bdrv_truncate if we are shrinking Jeff Cody
                   ` (6 more replies)
  0 siblings, 7 replies; 49+ messages in thread
From: Jeff Cody @ 2013-03-06 14:46 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, sw, stefanha

Note: Patch 1/7 is a bug fix for QCOW, that is also needed for VHDX

This adds the initial support for VHDX image files.  It supports both
read and write operations, across fixed and dynamic files.

Notably, the following is not yet supported:
    * Differencing files
    * Log replay (so we will refuse to open any images that are not 'clean')
    * .bdrv_create()

This series, and additional work to add in the above features, may be found
at:

https://github.com/codyprime/qemu-kvm-jtc/tree/jtc-vhdx-v1
git://github.com/codyprime/qemu-kvm-jtc.git, branch jtc-vhdx-v1

Jeff Cody (7):
  block: only force IO completion in .bdrv_truncate if we are shrinking
  qemu: add castagnoli crc32c checksum algorithm
  block: vhdx header for the QEMU support of VHDX images
  block: initial VHDX driver support framework - supports open and probe
  block: add read-only support to VHDX image format.
  block: add header update capability for VHDX images.
  block: add write support for VHDX images

 block.c               |   10 +-
 block/Makefile.objs   |    1 +
 block/vhdx.c          | 1137 +++++++++++++++++++++++++++++++++++++++++++++++++
 block/vhdx.h          |  326 ++++++++++++++
 include/qemu/crc32c.h |   35 ++
 util/Makefile.objs    |    1 +
 util/crc32c.c         |  115 +++++
 7 files changed, 1623 insertions(+), 2 deletions(-)
 create mode 100644 block/vhdx.c
 create mode 100644 block/vhdx.h
 create mode 100644 include/qemu/crc32c.h
 create mode 100644 util/crc32c.c

-- 
1.8.1.4

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Qemu-devel] [PATCH 1/7] block: only force IO completion in .bdrv_truncate if we are shrinking
  2013-03-06 14:46 [Qemu-devel] [PATCH 0/7] Initial VHDX support (and a bug fix for QCOW) Jeff Cody
@ 2013-03-06 14:47 ` Jeff Cody
  2013-03-06 17:50   ` Peter Lieven
  2013-03-06 14:47 ` [Qemu-devel] [PATCH 2/7] qemu: add castagnoli crc32c checksum algorithm Jeff Cody
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 49+ messages in thread
From: Jeff Cody @ 2013-03-06 14:47 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, sw, pl, stefanha, pbonzini

Commit 9a665b2b made bdrv_truncate() call bdrv_drain_all(), but this breaks
QCOW images, as well other future image formats (such as VHDX) that may call
bdrv_truncate(bs->file) from within a read/write operation.  For example, QCOW
will cause an assert, due to tracked_requests not being empty (since the
read/write that called bdrv_truncate() is still in progress).

This modifies the check to only force the bdrv_drain_all() call if the BDS to
be truncated is not growable, or if we are shrinking the BDS.  Otherwise, if we
are just growing the file, allow it to happen without forcing a call to
bdrv_drain_all().

Signed-off-by: Jeff Cody <jcody@redhat.com>
---
 block.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/block.c b/block.c
index 124a9eb..dce844c 100644
--- a/block.c
+++ b/block.c
@@ -2450,8 +2450,14 @@ int bdrv_truncate(BlockDriverState *bs, int64_t offset)
     if (bdrv_in_use(bs))
         return -EBUSY;
 
-    /* There better not be any in-flight IOs when we truncate the device. */
-    bdrv_drain_all();
+    /* Don't force a drain if we are just growing the file - otherwise,
+     * using bdrv_truncate() from within a block driver in a read/write
+     * operation will cause an assert, because bdrv_drain_all() will assert if
+     * a tracked_request is still in progress. */
+    if (!bs->growable || offset < bdrv_getlength(bs)) {
+        /* There better not be any in-flight IOs when we truncate the device. */
+        bdrv_drain_all();
+    }
 
     ret = drv->bdrv_truncate(bs, offset);
     if (ret == 0) {
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [Qemu-devel] [PATCH 2/7] qemu: add castagnoli crc32c checksum algorithm
  2013-03-06 14:46 [Qemu-devel] [PATCH 0/7] Initial VHDX support (and a bug fix for QCOW) Jeff Cody
  2013-03-06 14:47 ` [Qemu-devel] [PATCH 1/7] block: only force IO completion in .bdrv_truncate if we are shrinking Jeff Cody
@ 2013-03-06 14:47 ` Jeff Cody
  2013-03-06 14:47 ` [Qemu-devel] [PATCH 3/7] block: vhdx header for the QEMU support of VHDX images Jeff Cody
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 49+ messages in thread
From: Jeff Cody @ 2013-03-06 14:47 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, sw, stefanha

This adds the Castagnoli CRC32C algorithm, using the 0x11EDC6F41
polynomial.

This is extracted from the linux kernel cryptographic crc32.c module.

The algorithm is based on:

Castagnoli93: Guy Castagnoli and Stefan Braeuer and Martin Herrman
             "Optimization of Cyclic Redundancy-Check Codes with 24
              and 32 Parity Bits", IEEE Transactions on Communication,
              Volume 41, Number 6, June 1993

Signed-off-by: Jeff Cody <jcody@redhat.com>
---
 include/qemu/crc32c.h |  35 +++++++++++++++
 util/Makefile.objs    |   1 +
 util/crc32c.c         | 115 ++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 151 insertions(+)
 create mode 100644 include/qemu/crc32c.h
 create mode 100644 util/crc32c.c

diff --git a/include/qemu/crc32c.h b/include/qemu/crc32c.h
new file mode 100644
index 0000000..56d1c3b
--- /dev/null
+++ b/include/qemu/crc32c.h
@@ -0,0 +1,35 @@
+/*
+ *  Castagnoli CRC32C Checksum Algorithm
+ *
+ *  Polynomial: 0x11EDC6F41
+ *
+ *  Castagnoli93: Guy Castagnoli and Stefan Braeuer and Martin Herrman
+ *               "Optimization of Cyclic Redundancy-Check Codes with 24
+ *                 and 32 Parity Bits",IEEE Transactions on Communication,
+ *                Volume 41, Number 6, June 1993
+ *
+ *  Copyright (c) 2013 Red Hat, Inc.,
+ *
+ *  Authors:
+ *   Jeff Cody <jcody@redhat.com>
+ *
+ *  Based on the Linux kernel cryptographic crc32c module,
+ *
+ *  Copyright (c) 2004 Cisco Systems, Inc.
+ *  Copyright (c) 2008 Herbert Xu <herbert@gondor.apana.org.au>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ */
+
+#ifndef QEMU_CRC32_H
+#define QEMU_CRC32_H
+
+#include "qemu-common.h"
+
+uint32_t crc32c(uint32_t crc, const uint8_t *data, unsigned int length);
+
+#endif
diff --git a/util/Makefile.objs b/util/Makefile.objs
index 495a178..3dd2eda 100644
--- a/util/Makefile.objs
+++ b/util/Makefile.objs
@@ -8,3 +8,4 @@ util-obj-y += error.o qemu-error.o
 util-obj-$(CONFIG_POSIX) += compatfd.o
 util-obj-y += iov.o aes.o qemu-config.o qemu-sockets.o uri.o notify.o
 util-obj-y += qemu-option.o qemu-progress.o
+util-obj-y += crc32c.o
diff --git a/util/crc32c.c b/util/crc32c.c
new file mode 100644
index 0000000..8866327
--- /dev/null
+++ b/util/crc32c.c
@@ -0,0 +1,115 @@
+/*
+ *  Castagnoli CRC32C Checksum Algorithm
+ *
+ *  Polynomial: 0x11EDC6F41
+ *
+ *  Castagnoli93: Guy Castagnoli and Stefan Braeuer and Martin Herrman
+ *               "Optimization of Cyclic Redundancy-Check Codes with 24
+ *                 and 32 Parity Bits",IEEE Transactions on Communication,
+ *                Volume 41, Number 6, June 1993
+ *
+ *  Copyright (c) 2013 Red Hat, Inc.,
+ *
+ *  Authors:
+ *   Jeff Cody <jcody@redhat.com>
+ *
+ *  Based on the Linux kernel cryptographic crc32c module,
+ *
+ *  Copyright (c) 2004 Cisco Systems, Inc.
+ *  Copyright (c) 2008 Herbert Xu <herbert@gondor.apana.org.au>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ */
+
+#include "qemu-common.h"
+#include "qemu/crc32c.h"
+
+/*
+ * This is the CRC-32C table
+ * Generated with:
+ * width = 32 bits
+ * poly = 0x1EDC6F41
+ * reflect input bytes = true
+ * reflect output bytes = true
+ */
+
+static const uint32_t crc32c_table[256] = {
+    0x00000000L, 0xF26B8303L, 0xE13B70F7L, 0x1350F3F4L,
+    0xC79A971FL, 0x35F1141CL, 0x26A1E7E8L, 0xD4CA64EBL,
+    0x8AD958CFL, 0x78B2DBCCL, 0x6BE22838L, 0x9989AB3BL,
+    0x4D43CFD0L, 0xBF284CD3L, 0xAC78BF27L, 0x5E133C24L,
+    0x105EC76FL, 0xE235446CL, 0xF165B798L, 0x030E349BL,
+    0xD7C45070L, 0x25AFD373L, 0x36FF2087L, 0xC494A384L,
+    0x9A879FA0L, 0x68EC1CA3L, 0x7BBCEF57L, 0x89D76C54L,
+    0x5D1D08BFL, 0xAF768BBCL, 0xBC267848L, 0x4E4DFB4BL,
+    0x20BD8EDEL, 0xD2D60DDDL, 0xC186FE29L, 0x33ED7D2AL,
+    0xE72719C1L, 0x154C9AC2L, 0x061C6936L, 0xF477EA35L,
+    0xAA64D611L, 0x580F5512L, 0x4B5FA6E6L, 0xB93425E5L,
+    0x6DFE410EL, 0x9F95C20DL, 0x8CC531F9L, 0x7EAEB2FAL,
+    0x30E349B1L, 0xC288CAB2L, 0xD1D83946L, 0x23B3BA45L,
+    0xF779DEAEL, 0x05125DADL, 0x1642AE59L, 0xE4292D5AL,
+    0xBA3A117EL, 0x4851927DL, 0x5B016189L, 0xA96AE28AL,
+    0x7DA08661L, 0x8FCB0562L, 0x9C9BF696L, 0x6EF07595L,
+    0x417B1DBCL, 0xB3109EBFL, 0xA0406D4BL, 0x522BEE48L,
+    0x86E18AA3L, 0x748A09A0L, 0x67DAFA54L, 0x95B17957L,
+    0xCBA24573L, 0x39C9C670L, 0x2A993584L, 0xD8F2B687L,
+    0x0C38D26CL, 0xFE53516FL, 0xED03A29BL, 0x1F682198L,
+    0x5125DAD3L, 0xA34E59D0L, 0xB01EAA24L, 0x42752927L,
+    0x96BF4DCCL, 0x64D4CECFL, 0x77843D3BL, 0x85EFBE38L,
+    0xDBFC821CL, 0x2997011FL, 0x3AC7F2EBL, 0xC8AC71E8L,
+    0x1C661503L, 0xEE0D9600L, 0xFD5D65F4L, 0x0F36E6F7L,
+    0x61C69362L, 0x93AD1061L, 0x80FDE395L, 0x72966096L,
+    0xA65C047DL, 0x5437877EL, 0x4767748AL, 0xB50CF789L,
+    0xEB1FCBADL, 0x197448AEL, 0x0A24BB5AL, 0xF84F3859L,
+    0x2C855CB2L, 0xDEEEDFB1L, 0xCDBE2C45L, 0x3FD5AF46L,
+    0x7198540DL, 0x83F3D70EL, 0x90A324FAL, 0x62C8A7F9L,
+    0xB602C312L, 0x44694011L, 0x5739B3E5L, 0xA55230E6L,
+    0xFB410CC2L, 0x092A8FC1L, 0x1A7A7C35L, 0xE811FF36L,
+    0x3CDB9BDDL, 0xCEB018DEL, 0xDDE0EB2AL, 0x2F8B6829L,
+    0x82F63B78L, 0x709DB87BL, 0x63CD4B8FL, 0x91A6C88CL,
+    0x456CAC67L, 0xB7072F64L, 0xA457DC90L, 0x563C5F93L,
+    0x082F63B7L, 0xFA44E0B4L, 0xE9141340L, 0x1B7F9043L,
+    0xCFB5F4A8L, 0x3DDE77ABL, 0x2E8E845FL, 0xDCE5075CL,
+    0x92A8FC17L, 0x60C37F14L, 0x73938CE0L, 0x81F80FE3L,
+    0x55326B08L, 0xA759E80BL, 0xB4091BFFL, 0x466298FCL,
+    0x1871A4D8L, 0xEA1A27DBL, 0xF94AD42FL, 0x0B21572CL,
+    0xDFEB33C7L, 0x2D80B0C4L, 0x3ED04330L, 0xCCBBC033L,
+    0xA24BB5A6L, 0x502036A5L, 0x4370C551L, 0xB11B4652L,
+    0x65D122B9L, 0x97BAA1BAL, 0x84EA524EL, 0x7681D14DL,
+    0x2892ED69L, 0xDAF96E6AL, 0xC9A99D9EL, 0x3BC21E9DL,
+    0xEF087A76L, 0x1D63F975L, 0x0E330A81L, 0xFC588982L,
+    0xB21572C9L, 0x407EF1CAL, 0x532E023EL, 0xA145813DL,
+    0x758FE5D6L, 0x87E466D5L, 0x94B49521L, 0x66DF1622L,
+    0x38CC2A06L, 0xCAA7A905L, 0xD9F75AF1L, 0x2B9CD9F2L,
+    0xFF56BD19L, 0x0D3D3E1AL, 0x1E6DCDEEL, 0xEC064EEDL,
+    0xC38D26C4L, 0x31E6A5C7L, 0x22B65633L, 0xD0DDD530L,
+    0x0417B1DBL, 0xF67C32D8L, 0xE52CC12CL, 0x1747422FL,
+    0x49547E0BL, 0xBB3FFD08L, 0xA86F0EFCL, 0x5A048DFFL,
+    0x8ECEE914L, 0x7CA56A17L, 0x6FF599E3L, 0x9D9E1AE0L,
+    0xD3D3E1ABL, 0x21B862A8L, 0x32E8915CL, 0xC083125FL,
+    0x144976B4L, 0xE622F5B7L, 0xF5720643L, 0x07198540L,
+    0x590AB964L, 0xAB613A67L, 0xB831C993L, 0x4A5A4A90L,
+    0x9E902E7BL, 0x6CFBAD78L, 0x7FAB5E8CL, 0x8DC0DD8FL,
+    0xE330A81AL, 0x115B2B19L, 0x020BD8EDL, 0xF0605BEEL,
+    0x24AA3F05L, 0xD6C1BC06L, 0xC5914FF2L, 0x37FACCF1L,
+    0x69E9F0D5L, 0x9B8273D6L, 0x88D28022L, 0x7AB90321L,
+    0xAE7367CAL, 0x5C18E4C9L, 0x4F48173DL, 0xBD23943EL,
+    0xF36E6F75L, 0x0105EC76L, 0x12551F82L, 0xE03E9C81L,
+    0x34F4F86AL, 0xC69F7B69L, 0xD5CF889DL, 0x27A40B9EL,
+    0x79B737BAL, 0x8BDCB4B9L, 0x988C474DL, 0x6AE7C44EL,
+    0xBE2DA0A5L, 0x4C4623A6L, 0x5F16D052L, 0xAD7D5351L
+};
+
+
+uint32_t crc32c(uint32_t crc, const uint8_t *data, unsigned int length)
+{
+    while (length--) {
+        crc = crc32c_table[(crc ^ *data++) & 0xFFL] ^ (crc >> 8);
+    }
+    return crc^0xffffffff;
+}
+
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [Qemu-devel] [PATCH 3/7] block: vhdx header for the QEMU support of VHDX images
  2013-03-06 14:46 [Qemu-devel] [PATCH 0/7] Initial VHDX support (and a bug fix for QCOW) Jeff Cody
  2013-03-06 14:47 ` [Qemu-devel] [PATCH 1/7] block: only force IO completion in .bdrv_truncate if we are shrinking Jeff Cody
  2013-03-06 14:47 ` [Qemu-devel] [PATCH 2/7] qemu: add castagnoli crc32c checksum algorithm Jeff Cody
@ 2013-03-06 14:47 ` Jeff Cody
  2013-03-07 13:15   ` Stefan Hajnoczi
  2013-03-06 14:48 ` [Qemu-devel] [PATCH 4/7] block: initial VHDX driver support framework - supports open and probe Jeff Cody
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 49+ messages in thread
From: Jeff Cody @ 2013-03-06 14:47 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, sw, stefanha

This is based on Microsoft's VHDX specification:
    "VHDX Format Specification v0.95", published 4/12/2012
    https://www.microsoft.com/en-us/download/details.aspx?id=29681

These structures define the various header, metadata, and other
block structures defined in the VHDX specification.

Signed-off-by: Jeff Cody <jcody@redhat.com>
---
 block/vhdx.h | 326 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 326 insertions(+)
 create mode 100644 block/vhdx.h

diff --git a/block/vhdx.h b/block/vhdx.h
new file mode 100644
index 0000000..0955eef
--- /dev/null
+++ b/block/vhdx.h
@@ -0,0 +1,326 @@
+/*
+ * Block driver for Hyper-V VHDX Images
+ *
+ * Copyright (c) 2013 Red Hat, Inc.,
+ *
+ * Authors:
+ *  Jeff Cody <jcody@redhat.com>
+ *
+ *  This is based on the "VHDX Format Specification v0.95", published 4/12/2012
+ *  by Microsoft:
+ *      https://www.microsoft.com/en-us/download/details.aspx?id=29681
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2 or later.
+ * See the COPYING.LIB file in the top-level directory.
+ *
+ */
+
+#ifndef BLOCK_VHDX_H
+#define BLOCK_VHDX_H
+
+/* Structures and fields present in the VHDX file */
+
+/* The header section has the following blocks,
+ * each block is 64KB:
+ *
+ * _____________________________________________________________________________
+ * | File Id. |   Header 1    | Header 2   | Region Table |  Reserved (768KB)  |
+ * |----------|---------------|------------|--------------|--------------------|
+ * |          |               |            |              |                    |
+ * 0.........64KB...........128KB........192KB..........256KB................1MB
+ */
+
+#define VHDX_HEADER_BLOCK_SIZE      (64*1024)
+
+#define VHDX_FILE_ID_OFFSET         0
+#define VHDX_HEADER1_OFFSET         (VHDX_HEADER_BLOCK_SIZE*1)
+#define VHDX_HEADER2_OFFSET         (VHDX_HEADER_BLOCK_SIZE*2)
+#define VHDX_REGION_TABLE_OFFSET    (VHDX_HEADER_BLOCK_SIZE*3)
+
+
+/*
+ * A note on the use of MS-GUID fields.  For more details on the GUID,
+ * please see: https://en.wikipedia.org/wiki/Globally_unique_identifier.
+ *
+ * The VHDX specification only states that these are MS GUIDs, and which
+ * bytes are data1-data4. It makes no mention of what algorithm should be used
+ * to generate the GUID, nor what standard.  However, looking at the specified
+ * known GUID fields, it appears the GUIDs are:
+ *  Standard/DCE GUID type  (noted by 10b in the MSB of byte 0 of .data4)
+ *  Random algorithm        (noted by 0x4XXX for .data3)
+ */
+
+/* ---- HEADER SECTION STRUCTURES ---- */
+
+/* Important note: these structures are as defined in the VHDX specification,
+ * including byte order and size.  However, without being packed structures,
+ * they will not match 1:1 data read from disk.  Rather than use potentially
+ * non-portable packed structures, data is copied from read buffers into
+ * the structures below.  However, for reference, please refrain from
+ * modifying these structures to something that does not represent the spec */
+
+#define VHDX_FILE_ID_MAGIC 0x656C696678646876  /* 'vhdxfile' */
+typedef struct vhdx_file_identifier {
+    uint64_t    signature;              /* "vhdxfile" in ASCII */
+    uint16_t    creator[256];           /* optional; utf-16 string to identify
+                                           the vhdx file creator.  Diagnotistic
+                                           only */
+} vhdx_file_identifier;
+
+
+/* the guid is a 16 byte unique ID - the definition for this used by
+ * Microsoft is not just 16 bytes though - it is a structure that is defined,
+ * so we need to follow it here so that endianness does not trip us up */
+
+typedef struct ms_guid {
+    uint32_t    data1;
+    uint16_t    data2;
+    uint16_t    data3;
+    uint8_t     data4[8];
+} ms_guid;
+
+#define guid_eq(a, b) \
+    (memcmp(&(a), &(b), sizeof(ms_guid)) == 0)
+
+#define VHDX_HEADER_SIZE (4*1024)   /* although the vhdx_header struct in disk
+                                       is only 582 bytes, for purposes of crc
+                                       the header is the first 4KB of the 64KB
+                                       block */
+
+#define VHDX_HDR_MAGIC 0x64616568   /* 'head' */
+typedef struct QEMU_PACKED vhdx_header {
+    uint32_t    signature;              /* "head" in ASCII */
+    uint32_t    checksum;               /* CRC-32C hash of the whole header */
+    uint64_t    sequence_number;        /* Seq number of this header.  Each
+                                           VHDX file has 2 of these headers,
+                                           and only the header with the highest
+                                           sequence number is valid */
+    ms_guid     file_write_guid;       /* 128 bit unique identifier. Must be
+                                           updated to new, unique value before
+                                           the first modification is made to
+                                           file */
+    ms_guid     data_write_guid;        /* 128 bit unique identifier. Must be
+                                           updated to new, unique value before
+                                           the first modification is made to
+                                           visible data.   Visbile data is
+                                           defined as:
+                                                    - system & user metadata
+                                                    - raw block data
+                                                    - disk size
+                                                    - any change that will
+                                                      cause the virtual disk
+                                                      sector read to differ
+
+                                           This does not need to change if
+                                           blocks are re-arranged */
+    ms_guid     log_guid;               /* 128 bit unique identifier. If zero,
+                                           there is no valid log. If non-zero,
+                                           log entries with this guid are
+                                           valid. */
+    uint16_t    log_version;            /* version of the log format. Mustn't be
+                                           zero, unless log_guid is also zero */
+    uint16_t    version;                /* version of th evhdx file.  Currently,
+                                           only supported version is "1" */
+    uint32_t    log_length;             /* length of the log.  Must be multiple
+                                           of 1MB */
+    uint64_t    log_offset;             /* byte offset in the file of the log.
+                                           Must also be a multiple of 1MB */
+} vhdx_header;
+
+/* 4KB in packed data size, not to be used except for initial data read */
+typedef struct QEMU_PACKED vhdx_header_padded {
+    vhdx_header header;
+    uint8_t     reserved[502];          /* per the VHDX spec */
+    uint8_t     reserved_[3514];        /* for the initial packed struct read */
+} vhdx_header_padded;
+
+/* Header for the region table block */
+#define VHDX_RT_MAGIC 0x69676572  /* 'regi ' */
+typedef struct QEMU_PACKED vhdx_region_table_header {
+    uint32_t    signature;              /* "regi" in ASCII */
+    uint32_t    checksum;               /* CRC-32C hash of the 64KB table */
+    uint32_t    entry_count;            /* number of valid entries */
+    uint32_t    reserved;
+} vhdx_region_table_header;
+
+/* Individual region table entry.  There may be a maximum of 2047 of these
+ *
+ *  There are two known region table properties.  Both are required.
+ *  BAT (block allocation table):  2DC27766F62342009D64115E9BFD4A08
+ *  Metadata:                      8B7CA20647904B9AB8FE575F050F886E
+ */
+#define VHDX_REGION_ENTRY_REQUIRED  0x01    /* if set, parser must understand
+                                               this entry in order to open
+                                               file */
+typedef struct QEMU_PACKED vhdx_region_table_entry {
+    ms_guid     guid;                   /* 128-bit unique identifier */
+    uint64_t    file_offset;            /* offset of the object in the file.
+                                           Must be multiple of 1MB */
+    uint32_t    length;                 /* length, in bytes, of the object */
+    uint32_t    data_bits;
+} vhdx_region_table_entry;
+
+
+/* ---- LOG ENTRY STRUCTURES ---- */
+
+#define VHDX_LOGE_MAGIC 0x65676F6C /* 'loge' */
+typedef struct QEMU_PACKED vhdx_log_entry_header {
+    uint32_t    signature;              /* "loge" in ASCII */
+    uint32_t    checksum;               /* CRC-32C hash of the 64KB table */
+    uint32_t    entry_length;           /* length in bytes, multiple of 1MB */
+    uint32_t    tail;                   /* byte offset of first log entry of a
+                                           seq, where this entry is the last
+                                           entry */
+    uint64_t    sequence_number;        /* incremented with each log entry.
+                                           May not be zero. */
+    uint32_t    descriptor_count;       /* number of descriptors in this log
+                                           entry, must be >= 0 */
+    uint32_t    reserved;
+    ms_guid     log_guid;               /* value of the log_guid from
+                                           vhdx_header.  If not found in
+                                           vhdx_header, it is invalid */
+    uint64_t    flushed_file_offset;    /* see spec for full details - this
+                                           sould be vhdx file size in bytes */
+    uint64_t    last_file_offset;       /* size in bytes that all allocated
+                                           file structures fit into */
+} vhdx_log_entry_header;
+
+#define VHDX_ZERO_MGIC 0x6F72657A /* 'zero' */
+typedef struct QEMU_PACKED vhdx_log_zero_descriptor {
+    uint32_t    zero_signature;         /* "zero" in ASCII */
+    uint32_t    reserver;
+    uint64_t    zero_length;            /* length of the section to zero */
+    uint64_t    file_offset;            /* file offset to write zeros - multiple
+                                           of 4kB */
+    uint64_t    sequence_number;        /* must match same field in
+                                           vhdx_log_entry_header */
+} vhdx_log_zero_descriptor;
+
+#define VHDX_DATA_MAGIC 0x63736564 /* 'desc' */
+typedef struct QEMU_PACKED vhdx_log_data_descriptor {
+    uint32_t    data_signature;         /* "desc" in ASCII */
+    uint32_t    trailing_bytes;         /* bytes 4092-4096 of the data sector */
+    uint64_t    leading_bytes;          /* bytes 0-7 of the data sector */
+    uint64_t    file_offset;            /* file offset where the data described
+                                           herein is written */
+    uint64_t    sequence_number;        /* must match the sequence number field
+                                           in entry header */
+} vhdx_log_data_descriptor;
+
+#define VHDX_DATAS_MAGIC 0x61746164 /* 'data' */
+typedef struct QEMU_PACKED vhdx_log_data_sector {
+    uint32_t    data_signature;         /* "data" in ASCII */
+    uint32_t    sequence_high;          /* 4 MSB of 8 byte sequence_number */
+    uint8_t     data[4084];             /* raw data, bytes 8-4091 (inclusive).
+                                           see the data descriptor field for the
+                                           other mising bytes */
+    uint32_t    sequence_low;           /* 4 LSB of 8 byte sequence_number */
+} vhdx_log_data_sector;
+
+
+
+/* block states - different state values depending on whether it is a
+ * payload block, or a sector block. */
+
+#define PAYLOAD_BLOCK_NOT_PRESENT       0
+#define PAYLOAD_BLOCK_UNDEFINED         1
+#define PAYLOAD_BLOCK_ZERO              2
+#define PAYLOAD_BLOCK_UNMAPPED          5
+#define PAYLOAD_BLOCK_FULL_PRESENT      6
+#define PAYLOAD_BLOCK_PARTIALLY_PRESENT 7
+
+#define SB_BLOCK_NOT_PRESENT    0
+#define SB_BLOCK_PRESENT        6
+
+/* per the spec */
+#define VHDX_MAX_SECTORS_PER_BLOCK  (1<<23)
+
+/* upper 44 bits are the file offset in 1MB units lower 3 bits are the state
+   other bits are reserved */
+#define VHDX_BAT_STATE_BIT_MASK 0x07
+#define VHDX_BAT_FILE_OFF_BITS (64-44)
+typedef uint64_t vhdx_bat_entry;
+
+/* ---- METADATA REGION STRUCTURES ---- */
+
+#define VHDX_METADATA_ENTRY_SIZE 32
+#define VHDX_METADATA_MAX_ENTRIES 2047  /* not including the header */
+#define VHDX_METADATA_TABLE_MAX_SIZE \
+    (VHDX_METADATA_ENTRY_SIZE * (VHDX_METADATA_MAX_ENTRIES+1))
+#define VHDX_METADATA_MAGIC 0x617461646174656D /* 'metadata' */
+typedef struct QEMU_PACKED vhdx_metadata_table_header {
+    uint64_t    signature;              /* "metadata" in ASCII */
+    uint16_t    reserved;
+    uint16_t    entry_count;            /* number table entries. <= 2047 */
+    uint32_t    reserved2[5];
+} vhdx_metadata_table_header;
+
+#define VHDX_META_FLAGS_IS_USER         0x01    /* max 1024 entries */
+#define VHDX_META_FLAGS_IS_VIRTUAL_DISK 0x02    /* virtual disk metadata if set,
+                                                   otherwise file metdata */
+#define VHDX_META_FLAGS_IS_REQUIRED     0x04    /* parse must understand this
+                                                   entry to open the file */
+typedef struct QEMU_PACKED vhdx_metadata_table_entry {
+    ms_guid     item_id;                /* 128-bit identifier for metadata */
+    uint32_t    offset;                 /* byte offset of the metadata.  At
+                                           least 64kB.  Relative to start of
+                                           metadata region */
+                                        /* note: if length = 0, so is offset */
+    uint32_t    length;                 /* length of metadata. <= 1MB. */
+    uint32_t    data_bits;      /* least-significant 3 bits are flags, the
+                                   rest are reserved (see above) */
+    uint32_t    reserved2;
+} vhdx_metadata_table_entry;
+
+#define VHDX_PARAMS_LEAVE_BLOCKS_ALLOCED 0x01   /* Do not change any blocks to
+                                                   be BLOCK_NOT_PRESENT.
+                                                   If set indicates a fixed
+                                                   size VHDX file */
+#define VHDX_PARAMS_HAS_PARENT           0x02    /* has parent / backing file */
+typedef struct QEMU_PACKED vhdx_file_parameters {
+    uint32_t    block_size;             /* size of each payload block, always
+                                           power of 2, <= 256MB and >= 1MB. */
+    uint32_t data_bits;     /* least-significant 2 bits are flags, the rest
+                               are reserved (see above) */
+} vhdx_file_parameters;
+
+typedef struct QEMU_PACKED vhdx_virtual_disk_size {
+    uint64_t    virtual_disk_size;      /* Size of the virtual disk, in bytes.
+                                           Must be multiple of the sector size,
+                                           max of 64TB */
+} vhdx_virtual_disk_size;
+
+typedef struct QEMU_PACKED vhdx_page83_data {
+    uint8_t     page_83_data[16];       /* unique id for scsi devices that
+                                           support page 0x83 */
+} vhdx_page83_data;
+
+typedef struct QEMU_PACKED vhdx_virtual_disk_logical_sector_size {
+    uint32_t    logical_sector_size;    /* virtual disk sector size (in bytes).
+                                           Can only be 512 or 4096 bytes */
+} vhdx_virtual_disk_logical_sector_size;
+
+typedef struct QEMU_PACKED vhdx_virtual_disk_physical_sector_size {
+    uint32_t    physical_sector_size;   /* physical sector size (in bytes).
+                                           Can only be 512 or 4096 bytes */
+} vhdx_virtual_disk_physical_sector_size;
+
+typedef struct QEMU_PACKED vhdx_parent_locator_header {
+    uint8_t     locator_type[16];       /* type of the parent virtual disk. */
+    uint16_t    reserved;
+    uint16_t    key_value_count;        /* number of key/value pairs for this
+                                           locator */
+} vhdx_parent_locator_header;
+
+/* key and value strings are UNICODE strings, UTF-16 LE encoding, no NULs */
+typedef struct QEMU_PACKED vhdx_parent_locator_entry {
+    uint32_t    key_offset;             /* offset in metadata for key, > 0 */
+    uint32_t    value_offset;           /* offset in metadata for value, >0 */
+    uint16_t    key_length;             /* length of entry key, > 0 */
+    uint16_t    value_length;           /* length of entry value, > 0 */
+} vhdx_parent_locator_entry;
+
+
+/* ----- END VHDX SPECIFICATION STRUCTURES ---- */
+
+#endif
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [Qemu-devel] [PATCH 4/7] block: initial VHDX driver support framework - supports open and probe
  2013-03-06 14:46 [Qemu-devel] [PATCH 0/7] Initial VHDX support (and a bug fix for QCOW) Jeff Cody
                   ` (2 preceding siblings ...)
  2013-03-06 14:47 ` [Qemu-devel] [PATCH 3/7] block: vhdx header for the QEMU support of VHDX images Jeff Cody
@ 2013-03-06 14:48 ` Jeff Cody
  2013-03-07 14:30   ` Stefan Hajnoczi
  2013-03-06 14:48 ` [Qemu-devel] [PATCH 5/7] block: add read-only support to VHDX image format Jeff Cody
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 49+ messages in thread
From: Jeff Cody @ 2013-03-06 14:48 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, sw, stefanha

This is the initial block driver framework for VHDX image support
(i.e. Hyper-V image file formats), that supports opening VHDX files, and
parsing the headers.

This commit does not yet enable:
    - reading
    - writing
    - updating the header
    - differencing files (images with parents)
    - log replay / dirty logs (only clean images)

This is based on Microsoft's VHDX specification:
    "VHDX Format Specification v0.95", published 4/12/2012
    https://www.microsoft.com/en-us/download/details.aspx?id=29681

Signed-off-by: Jeff Cody <jcody@redhat.com>
---
 block/Makefile.objs |   1 +
 block/vhdx.c        | 717 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 718 insertions(+)
 create mode 100644 block/vhdx.c

diff --git a/block/Makefile.objs b/block/Makefile.objs
index c067f38..a819577 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -2,6 +2,7 @@ block-obj-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o vvfat
 block-obj-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o qcow2-cache.o
 block-obj-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o
 block-obj-y += qed-check.o
+block-obj-y += vhdx.o
 block-obj-y += parallels.o blkdebug.o blkverify.o
 block-obj-$(CONFIG_WIN32) += raw-win32.o win32-aio.o
 block-obj-$(CONFIG_POSIX) += raw-posix.o
diff --git a/block/vhdx.c b/block/vhdx.c
new file mode 100644
index 0000000..7c9cd99
--- /dev/null
+++ b/block/vhdx.c
@@ -0,0 +1,717 @@
+/*
+ * Block driver for Hyper-V VHDX Images
+ *
+ * Copyright (c) 2013 Red Hat, Inc.,
+ *
+ * Authors:
+ *  Jeff Cody <jcody@redhat.com>
+ *
+ *  This is based on the "VHDX Format Specification v0.95", published 4/12/2012
+ *  by Microsoft:
+ *      https://www.microsoft.com/en-us/download/details.aspx?id=29681
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2 or later.
+ * See the COPYING.LIB file in the top-level directory.
+ *
+ */
+
+#include "qemu-common.h"
+#include "block/block_int.h"
+#include "qemu/module.h"
+#include "qemu/crc32c.h"
+#include "block/vhdx.h"
+
+
+#define leguid_to_cpus(guid) do { \
+    le32_to_cpus(&(guid)->data1); \
+    le16_to_cpus(&(guid)->data2); \
+    le16_to_cpus(&(guid)->data3); } while (0)
+
+/* Several metadata and region table data entries are identified by
+ * guids in  a MS-specific GUID format. */
+
+
+/* ------- Known Region Table GUIDs ---------------------- */
+static const ms_guid bat_guid =      { .data1 = 0x2dc27766,
+                                       .data2 = 0xf623,
+                                       .data3 = 0x4200,
+                                       .data4 = { 0x9d, 0x64, 0x11, 0x5e,
+                                                  0x9b, 0xfd, 0x4a, 0x08} };
+
+static const ms_guid metadata_guid = { .data1 = 0x8b7ca206,
+                                       .data2 = 0x4790,
+                                       .data3 = 0x4b9a,
+                                       .data4 = { 0xb8, 0xfe, 0x57, 0x5f,
+                                                  0x05, 0x0f, 0x88, 0x6e} };
+
+
+
+/* ------- Known Metadata Entry GUIDs ---------------------- */
+static const ms_guid file_param_guid =   { .data1 = 0xcaa16737,
+                                           .data2 = 0xfa36,
+                                           .data3 = 0x4d43,
+                                           .data4 = { 0xb3, 0xb6, 0x33, 0xf0,
+                                                      0xaa, 0x44, 0xe7, 0x6b} };
+
+static const ms_guid virtual_size_guid = { .data1 = 0x2FA54224,
+                                           .data2 = 0xcd1b,
+                                           .data3 = 0x4876,
+                                           .data4 = { 0xb2, 0x11, 0x5d, 0xbe,
+                                                      0xd8, 0x3b, 0xf4, 0xb8} };
+
+static const ms_guid page83_guid =       { .data1 = 0xbeca12ab,
+                                           .data2 = 0xb2e6,
+                                           .data3 = 0x4523,
+                                           .data4 = { 0x93, 0xef, 0xc3, 0x09,
+                                                      0xe0, 0x00, 0xc7, 0x46} };
+
+static const ms_guid logical_sector_guid = {.data1 = 0x8141bf1d,
+                                            .data2 = 0xa96f,
+                                            .data3 = 0x4709,
+                                           .data4 = { 0xba, 0x47, 0xf2, 0x33,
+                                                      0xa8, 0xfa, 0xab, 0x5f} };
+
+static const ms_guid phys_sector_guid =  { .data1 = 0xcda348c7,
+                                           .data2 = 0x445d,
+                                           .data3 = 0x4471,
+                                           .data4 = { 0x9c, 0xc9, 0xe9, 0x88,
+                                                      0x52, 0x51, 0xc5, 0x56} };
+
+static const ms_guid parent_locator_guid = {.data1 = 0xa8d35f2d,
+                                            .data2 = 0xb30b,
+                                            .data3 = 0x454d,
+                                           .data4 = { 0xab, 0xf7, 0xd3, 0xd8,
+                                                      0x48, 0x34, 0xab, 0x0c} };
+
+/* Each parent type must have a valid GUID; this is for parent images
+ * of type 'VHDX'.  If we were to allow e.g. a QCOW2 parent, we would
+ * need to make up our own QCOW2 GUID type */
+static const ms_guid parent_vhdx_guid = { .data1 = 0xb04aefb7,
+                                          .data2 = 0xd19e,
+                                          .data3 = 0x4a81,
+                                          .data4 = { 0xb7, 0x89, 0x25, 0xb8,
+                                                     0xe9, 0x44, 0x59, 0x13} };
+
+
+#define META_FILE_PARAMETER_PRESENT      0x01
+#define META_VIRTUAL_DISK_SIZE_PRESENT   0x02
+#define META_PAGE_83_PRESENT             0x04
+#define META_LOGICAL_SECTOR_SIZE_PRESENT 0x08
+#define META_PHYS_SECTOR_SIZE_PRESENT    0x10
+#define META_PARENT_LOCATOR_PRESENT      0x20
+
+#define META_ALL_PRESENT    \
+    (META_FILE_PARAMETER_PRESENT | META_VIRTUAL_DISK_SIZE_PRESENT | \
+     META_PAGE_83_PRESENT | META_LOGICAL_SECTOR_SIZE_PRESENT | \
+     META_PHYS_SECTOR_SIZE_PRESENT)
+
+typedef struct vhdx_metadata_entries {
+    vhdx_metadata_table_entry file_parameters_entry;
+    vhdx_metadata_table_entry virtual_disk_size_entry;
+    vhdx_metadata_table_entry page83_data_entry;
+    vhdx_metadata_table_entry logical_sector_size_entry;
+    vhdx_metadata_table_entry phys_sector_size_entry;
+    vhdx_metadata_table_entry parent_locator_entry;
+    uint16_t present;
+} vhdx_metadata_entries;
+
+
+typedef struct BDRVVHDXState {
+    CoMutex lock;
+
+    int curr_header;
+    vhdx_header *headers[2];
+
+    vhdx_region_table_header rt;
+    vhdx_region_table_entry bat_rt;         /* region table for the BAT */
+    vhdx_region_table_entry metadata_rt;    /* region table for the metadata */
+    vhdx_region_table_entry *unknown_rt;
+    unsigned int unknown_rt_size;
+
+    vhdx_metadata_table_header  metadata_hdr;
+    vhdx_metadata_entries metadata_entries;
+
+    vhdx_file_parameters params;
+    uint32_t block_size;
+    uint32_t block_size_bits;
+    uint32_t sectors_per_block;
+    uint32_t sectors_per_block_bits;
+
+    uint64_t virtual_disk_size;
+    uint32_t logical_sector_size;
+    uint32_t physical_sector_size;
+
+    uint64_t chunk_ratio;
+    uint32_t chunk_ratio_bits;
+    uint32_t logical_sector_size_bits;
+
+    uint32_t bat_entries;
+    vhdx_bat_entry *bat;
+    uint64_t bat_offset;
+
+    vhdx_parent_locator_header parent_header;
+    vhdx_parent_locator_entry *parent_entries;
+
+} BDRVVHDXState;
+
+/* Validates the checksum of the buffer, with an in-place CRC.
+ *
+ * Zero is substituted during crc calculation for the original crc field,
+ * and the crc field is restored afterwards.  But the buffer will be modifed
+ * during the calculation, so this may not be not suitable for multi-threaded
+ * use.
+ *
+ * crc_offset: byte offset in buf of the buffer crc
+ * buf: buffer pointer
+ * size: size of buffer (must be > crc_offset+4)
+ *
+ * returns true if checksum is valid, false otherwise
+ */
+static bool vhdx_checksum_is_valid(uint8_t *buf, size_t size, int crc_offset)
+{
+    uint32_t crc_orig;
+    uint32_t crc;
+
+    assert(buf != NULL);
+    assert(size > (crc_offset+4));
+
+    memcpy(&crc_orig, buf+crc_offset, sizeof(crc_orig));
+    memset(buf+crc_offset, 0, sizeof(crc_orig));
+
+    crc = crc32c(0xffffffff, buf, size);
+
+    memcpy(buf+crc_offset, &crc_orig, sizeof(crc_orig));
+
+    crc_orig = le32_to_cpu(crc_orig);
+    return crc == crc_orig;
+}
+
+/*
+ * Per the MS VHDX Specification, for every VHDX file:
+ *      - The header section is fixed size - 1 MB
+ *      - The header section is always the first "object"
+ *      - The first 64KB of the header is the File Identifier
+ *      - The first uint64 (8 bytes) is the VHDX Signature ("vhdxfile")
+ *      - The following 512 bytes constitute a UTF-16 string identifiying the
+ *        software that created the file, and is optional and diagnostic only.
+ *
+ *  Therefore, we probe by looking for the vhdxfile signature "vhdxfile"
+ */
+static int vhdx_probe(const uint8_t *buf, int buf_size, const char *filename)
+{
+    if (buf_size >= 8 && !strncmp((char *)buf, "vhdxfile", 8)) {
+        return 100;
+    }
+    return 0;
+}
+
+
+/* All VHDX structures on disk are little endian */
+static void vhdx_header_le_import(vhdx_header *h)
+{
+    assert(h != NULL);
+
+    le32_to_cpus(&h->signature);
+    le32_to_cpus(&h->checksum);
+    le64_to_cpus(&h->sequence_number);
+
+    leguid_to_cpus(&h->file_write_guid);
+    leguid_to_cpus(&h->data_write_guid);
+    leguid_to_cpus(&h->log_guid);
+
+    le16_to_cpus(&h->log_version);
+    le16_to_cpus(&h->version);
+    le32_to_cpus(&h->log_length);
+    le64_to_cpus(&h->log_offset);
+}
+
+
+/* opens the specified header block from the VHDX file header section */
+static int vhdx_parse_header(BlockDriverState *bs, BDRVVHDXState *s)
+{
+    int ret = 0;
+    vhdx_header *header1;
+    vhdx_header *header2;
+    uint64_t h1_seq = 0;
+    uint64_t h2_seq = 0;
+    uint8_t *buffer;
+
+    header1 = g_malloc(sizeof(vhdx_header));
+    header2 = g_malloc(sizeof(vhdx_header));
+
+    buffer = g_malloc(VHDX_HEADER_SIZE);
+
+    s->headers[0] = header1;
+    s->headers[1] = header2;
+
+    /* We have to read the whole VHDX_HEADER_SIZE instead of
+     * sizeof(vhdx_header), because the checksum is over the whole
+     * region */
+    ret = bdrv_pread(bs->file, VHDX_HEADER1_OFFSET, buffer, VHDX_HEADER_SIZE);
+    if (ret < 0) {
+        goto fail;
+    }
+    /* copy over just the relevant portion that we need */
+    memcpy(header1, buffer, sizeof(vhdx_header));
+    vhdx_header_le_import(header1);
+
+    if (vhdx_checksum_is_valid(buffer, VHDX_HEADER_SIZE, 4) &&
+        header1->signature == VHDX_HDR_MAGIC) {
+        h1_seq = header1->sequence_number;
+    }
+
+    ret = bdrv_pread(bs->file, VHDX_HEADER2_OFFSET, buffer, VHDX_HEADER_SIZE);
+    if (ret < 0) {
+        goto fail;
+    }
+    /* copy over just the relevant portion that we need */
+    memcpy(header2, buffer, sizeof(vhdx_header));
+    vhdx_header_le_import(header2);
+
+    if (vhdx_checksum_is_valid(buffer, VHDX_HEADER_SIZE, 4) &&
+        header2->signature == VHDX_HDR_MAGIC) {
+        h2_seq = header2->sequence_number;
+    }
+
+    if (h1_seq > h2_seq) {
+        s->curr_header = 0;
+    } else if (h2_seq > h1_seq) {
+        s->curr_header = 1;
+    } else {
+        printf("NO VALID HEADER\n");
+        ret = -EINVAL;
+        goto fail;
+    }
+
+    ret = 0;
+
+    goto exit;
+
+fail:
+    g_free(header1);
+    g_free(header2);
+    s->headers[0] = NULL;
+    s->headers[1] = NULL;
+exit:
+    g_free(buffer);
+    return ret;
+}
+
+/* Parse the replay log.  Per the VHDX spec, if the log is present
+ * it must be replayed prior to opening the file, even read-only.
+ *
+ * If read-only, we must replay the log in RAM (or refuse to open
+ * a dirty VHDX file read-only */
+static int vhdx_parse_log(BlockDriverState *bs, BDRVVHDXState *s)
+{
+    int ret = 0;
+    int i;
+    vhdx_header *hdr;
+
+    hdr = s->headers[s->curr_header];
+
+    /* either either the log guid, or log length is zero,
+     * then a replay log is present */
+    for (i = 0; i < sizeof(hdr->log_guid.data4); i++) {
+        ret |= hdr->log_guid.data4[i];
+    }
+    if (hdr->log_guid.data1 == 0 &&
+        hdr->log_guid.data2 == 0 &&
+        hdr->log_guid.data3 == 0 &&
+        ret == 0) {
+        goto exit;
+    }
+
+    if (hdr->log_length == 0) {
+        goto exit;
+    }
+
+    /* there is a log present, but we don't support that yet */
+    ret = -ENOTSUP;
+
+exit:
+    return ret;
+}
+
+static int vhdx_open_region_tables(BlockDriverState *bs, BDRVVHDXState *s)
+{
+    int ret = 0;
+    uint8_t *buffer;
+    int offset = 0;
+    vhdx_region_table_entry rt_entry;
+    int i;
+
+    /* We have to read the whole 64KB block, because the crc32 is over the
+     * whole block */
+    buffer = g_malloc(VHDX_HEADER_BLOCK_SIZE);
+
+    ret = bdrv_pread(bs->file, VHDX_REGION_TABLE_OFFSET, buffer,
+                    VHDX_HEADER_BLOCK_SIZE);
+    if (ret < 0) {
+        goto fail;
+    }
+    memcpy(&s->rt, buffer, sizeof(s->rt));
+    le32_to_cpus(&s->rt.signature);
+    le32_to_cpus(&s->rt.checksum);
+    le32_to_cpus(&s->rt.entry_count);
+    le32_to_cpus(&s->rt.reserved);
+    offset += sizeof(s->rt);
+
+    if (!vhdx_checksum_is_valid(buffer, VHDX_HEADER_BLOCK_SIZE, 4) ||
+        s->rt.signature != VHDX_RT_MAGIC) {
+        ret = -EINVAL;
+        goto fail;
+    }
+
+    for (i = 0; i < s->rt.entry_count; i++) {
+        memcpy(&rt_entry, buffer+offset, sizeof(rt_entry));
+        offset += sizeof(rt_entry);
+
+        leguid_to_cpus(&rt_entry.guid);
+        le64_to_cpus(&rt_entry.file_offset);
+        le32_to_cpus(&rt_entry.length);
+        le32_to_cpus(&rt_entry.data_bits);
+
+        /* see if we recognize the entry */
+        if (guid_eq(rt_entry.guid, bat_guid)) {
+            s->bat_rt = rt_entry;
+            continue;
+        }
+
+        if (guid_eq(rt_entry.guid, metadata_guid)) {
+            s->metadata_rt = rt_entry;
+            continue;
+        }
+
+        if (rt_entry.data_bits & VHDX_REGION_ENTRY_REQUIRED) {
+            /* cannot read vhdx file - required region table entry that
+             * we do not understand.  per spec, we must fail to open */
+            ret = -ENOTSUP;
+            goto fail;
+        }
+    }
+    ret = 0;
+
+fail:
+    g_free(buffer);
+    return ret;
+}
+
+
+
+/* Metadata initial parser
+ *
+ * This loads all the metadata entry fields.  This may cause additional
+ * fields to be processed (e.g. parent locator, etc..).
+ *
+ * There are 5 Metadata items that are always required:
+ *      - File Parameters (block size, has a parent)
+ *      - Virtual Disk Size (size, in bytes, of the virtual drive)
+ *      - Page 83 Data (scsi page 83 guid)
+ *      - Logical Sector Size (logical sector size in bytes, either 512 or
+ *                             4096.  We only support 512 currently)
+ *      - Physical Sector Size (512 or 4096)
+ *
+ * Also, if the File Parameters indicate this is a differencing file,
+ * we must also look for the Parent Locator metadata item.
+ */
+static int vhdx_parse_metadata(BlockDriverState *bs, BDRVVHDXState *s)
+{
+    int ret = 0;
+    uint8_t *buffer;
+    int offset = 0;
+    int i = 0;
+    uint32_t block_size, sectors_per_block, logical_sector_size;
+    uint64_t chunk_ratio;
+    vhdx_metadata_table_entry md_entry;
+
+    buffer = g_malloc(VHDX_METADATA_TABLE_MAX_SIZE);
+
+    ret = bdrv_pread(bs->file, s->metadata_rt.file_offset, buffer,
+                     VHDX_METADATA_TABLE_MAX_SIZE);
+    if (ret < 0) {
+        goto fail_no_free;
+    }
+    memcpy(&s->metadata_hdr, buffer, sizeof(s->metadata_hdr));
+    offset += sizeof(s->metadata_hdr);
+
+    le64_to_cpus(&s->metadata_hdr.signature);
+    le16_to_cpus(&s->metadata_hdr.reserved);
+    le16_to_cpus(&s->metadata_hdr.entry_count);
+
+    if (s->metadata_hdr.signature != VHDX_METADATA_MAGIC) {
+        ret = -EINVAL;
+        goto fail_no_free;
+    }
+
+    s->metadata_entries.present = 0;
+
+    for (i = 0; i < s->metadata_hdr.entry_count; i++) {
+        memcpy(&md_entry, buffer+offset, sizeof(md_entry));
+        offset += sizeof(md_entry);
+
+        leguid_to_cpus(&md_entry.item_id);
+        le32_to_cpus(&md_entry.offset);
+        le32_to_cpus(&md_entry.length);
+        le32_to_cpus(&md_entry.data_bits);
+        le32_to_cpus(&md_entry.reserved2);
+
+        if (guid_eq(md_entry.item_id, file_param_guid)) {
+            s->metadata_entries.file_parameters_entry = md_entry;
+            s->metadata_entries.present |= META_FILE_PARAMETER_PRESENT;
+            continue;
+        }
+
+        if (guid_eq(md_entry.item_id, virtual_size_guid)) {
+            s->metadata_entries.virtual_disk_size_entry = md_entry;
+            s->metadata_entries.present |= META_VIRTUAL_DISK_SIZE_PRESENT;
+            continue;
+        }
+
+        if (guid_eq(md_entry.item_id, page83_guid)) {
+            s->metadata_entries.page83_data_entry = md_entry;
+            s->metadata_entries.present |= META_PAGE_83_PRESENT;
+            continue;
+        }
+
+        if (guid_eq(md_entry.item_id, logical_sector_guid)) {
+            s->metadata_entries.logical_sector_size_entry = md_entry;
+            s->metadata_entries.present |= META_LOGICAL_SECTOR_SIZE_PRESENT;
+            continue;
+        }
+
+        if (guid_eq(md_entry.item_id, phys_sector_guid)) {
+            s->metadata_entries.phys_sector_size_entry = md_entry;
+            s->metadata_entries.present |= META_PHYS_SECTOR_SIZE_PRESENT;
+            continue;
+        }
+
+        if (guid_eq(md_entry.item_id, parent_locator_guid)) {
+            s->metadata_entries.parent_locator_entry = md_entry;
+            s->metadata_entries.present |= META_PARENT_LOCATOR_PRESENT;
+            continue;
+        }
+
+        if (md_entry.data_bits & VHDX_META_FLAGS_IS_REQUIRED) {
+            /* cannot read vhdx file - required region table entry that
+             * we do not understand.  per spec, we must fail to open */
+            ret = -ENOTSUP;
+            goto exit;
+        }
+    }
+
+    if (s->metadata_entries.present != META_ALL_PRESENT) {
+        ret = -ENOTSUP;
+        goto exit;
+    }
+
+    ret = bdrv_pread(bs->file,
+                     s->metadata_entries.file_parameters_entry.offset
+                                         + s->metadata_rt.file_offset,
+                     &s->params,
+                     sizeof(s->params));
+
+    le32_to_cpus(&s->params.block_size);
+    le32_to_cpus(&s->params.data_bits);
+
+
+    /* We now have the file parameters, so we can tell if this is a
+     * differencing file (i.e.. has_parent), is dynamic or fixed
+     * sized (leave_blocks_allocated), and the block size */
+
+    /* The parent locator required iff the file parameters has_parent set */
+    if (s->params.data_bits & VHDX_PARAMS_HAS_PARENT) {
+        if (s->metadata_entries.present & ~META_PARENT_LOCATOR_PRESENT) {
+            /* TODO: parse  parent locator fields */
+            ret = -ENOTSUP; /* temp, until differencing files are supported */
+            goto exit;
+        } else {
+            /* if has_parent is set, but there is not parent locator present,
+             * then that is an invalid combination */
+            ret = -EINVAL;
+            goto exit;
+        }
+    }
+
+    /* determine virtual disk size, logical sector size,
+     * and phys sector size */
+
+    ret = bdrv_pread(bs->file,
+                     s->metadata_entries.virtual_disk_size_entry.offset
+                                           + s->metadata_rt.file_offset,
+                     &s->virtual_disk_size,
+                     sizeof(uint64_t));
+    if (ret < 0) {
+        goto exit;
+    }
+    ret = bdrv_pread(bs->file,
+                     s->metadata_entries.logical_sector_size_entry.offset
+                                             + s->metadata_rt.file_offset,
+                     &s->logical_sector_size,
+                     sizeof(uint32_t));
+    if (ret < 0) {
+        goto exit;
+    }
+    ret = bdrv_pread(bs->file,
+                     s->metadata_entries.phys_sector_size_entry.offset
+                                          + s->metadata_rt.file_offset,
+                     &s->physical_sector_size,
+                     sizeof(uint32_t));
+    if (ret < 0) {
+        goto exit;
+    }
+
+    le64_to_cpus(&s->virtual_disk_size);
+    le32_to_cpus(&s->logical_sector_size);
+    le32_to_cpus(&s->physical_sector_size);
+
+    /* both block_size and sector_size are guaranteed powers of 2 */
+    s->sectors_per_block = s->params.block_size / s->logical_sector_size;
+    s->chunk_ratio = (VHDX_MAX_SECTORS_PER_BLOCK) *
+                     (uint64_t)s->logical_sector_size /
+                     (uint64_t)s->params.block_size;
+
+
+    /* These values are ones we will want to use for division / multiplication
+     * later on, and they are all guaranteed (per the spec) to be powers of 2,
+     * so we can take advantage of that for shift operations during
+     * reads/writes */
+    logical_sector_size = s->logical_sector_size;
+    while (logical_sector_size >>= 1) {
+        s->logical_sector_size_bits++;
+    }
+    sectors_per_block = s->sectors_per_block;
+    while (sectors_per_block >>= 1) {
+        s->sectors_per_block_bits++;
+    }
+    chunk_ratio = s->chunk_ratio;
+    while (chunk_ratio >>= 1) {
+        s->chunk_ratio_bits++;
+    }
+    block_size = s->params.block_size;
+    while (block_size >>= 1) {
+        s->block_size_bits++;
+    }
+
+    if (s->logical_sector_size != BDRV_SECTOR_SIZE) {
+        printf("VHDX error - QEMU only supports 512 byte sector sizes\n");
+        ret = -ENOTSUP;
+        goto exit;
+    }
+
+    ret = 0;
+
+exit:
+    g_free(buffer);
+fail_no_free:
+    return ret;
+}
+
+static int vhdx_open(BlockDriverState *bs, int flags)
+{
+    BDRVVHDXState *s = bs->opaque;
+    int ret = 0;
+    int i;
+
+    s->bat = NULL;
+
+    qemu_co_mutex_init(&s->lock);
+
+    ret = vhdx_parse_header(bs, s);
+    if (ret) {
+        goto fail;
+    }
+
+    ret = vhdx_parse_log(bs, s);
+    if (ret) {
+        goto fail;
+    }
+
+    ret = vhdx_open_region_tables(bs, s);
+    if (ret) {
+        goto fail;
+    }
+
+    ret = vhdx_parse_metadata(bs, s);
+    if (ret) {
+        goto fail;
+    }
+    s->block_size = s->params.block_size;
+
+    /* the VHDX spec dictates that virtual_disk_size is always a multiple of
+     * logical_sector_size */
+    bs->total_sectors = s->virtual_disk_size / s->logical_sector_size;
+
+    s->bat_offset = s->bat_rt.file_offset;
+    s->bat_entries = s->bat_rt.length / sizeof(vhdx_bat_entry);
+    s->bat = g_malloc(s->bat_rt.length);
+
+    ret = bdrv_pread(bs->file, s->bat_offset, s->bat, s->bat_rt.length);
+
+    for (i = 0; i < s->bat_entries; i++) {
+        le64_to_cpus(&s->bat[i]);
+    }
+
+    if (flags & BDRV_O_RDWR) {
+        ret = -ENOTSUP;
+        goto fail;
+    }
+
+    /* TODO: differencing files, read, write */
+
+    return 0;
+fail:
+    g_free(s->bat);
+    return ret;
+}
+
+static int vhdx_reopen_prepare(BDRVReopenState *state,
+                               BlockReopenQueue *queue, Error **errp)
+{
+    return 0;
+}
+
+
+static coroutine_fn int vhdx_co_readv(BlockDriverState *bs, int64_t sector_num,
+                                      int nb_sectors, QEMUIOVector *qiov)
+{
+    return -ENOTSUP;
+}
+
+
+
+static coroutine_fn int vhdx_co_writev(BlockDriverState *bs, int64_t sector_num,
+                                      int nb_sectors, QEMUIOVector *qiov)
+{
+    return -ENOTSUP;
+}
+
+
+static void vhdx_close(BlockDriverState *bs)
+{
+    BDRVVHDXState *s = bs->opaque;
+
+    g_free(s->headers[0]);
+    g_free(s->headers[1]);
+    g_free(s->bat);
+    g_free(s->parent_entries);
+}
+
+static BlockDriver bdrv_vhdx = {
+    .format_name    = "vhdx",
+    .instance_size  = sizeof(BDRVVHDXState),
+
+    .bdrv_probe             = vhdx_probe,
+    .bdrv_open              = vhdx_open,
+    .bdrv_close             = vhdx_close,
+    .bdrv_reopen_prepare    = vhdx_reopen_prepare,
+    .bdrv_co_readv          = vhdx_co_readv,
+    .bdrv_co_writev         = vhdx_co_writev,
+};
+
+static void bdrv_vhdx_init(void)
+{
+    bdrv_register(&bdrv_vhdx);
+}
+
+block_init(bdrv_vhdx_init);
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [Qemu-devel] [PATCH 5/7] block: add read-only support to VHDX image format.
  2013-03-06 14:46 [Qemu-devel] [PATCH 0/7] Initial VHDX support (and a bug fix for QCOW) Jeff Cody
                   ` (3 preceding siblings ...)
  2013-03-06 14:48 ` [Qemu-devel] [PATCH 4/7] block: initial VHDX driver support framework - supports open and probe Jeff Cody
@ 2013-03-06 14:48 ` Jeff Cody
  2013-03-06 14:48 ` [Qemu-devel] [PATCH 6/7] block: add header update capability for VHDX images Jeff Cody
  2013-03-06 14:48 ` [Qemu-devel] [PATCH 7/7] block: add write support " Jeff Cody
  6 siblings, 0 replies; 49+ messages in thread
From: Jeff Cody @ 2013-03-06 14:48 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, sw, stefanha

This adds in read-only support to the VHDX image format.  This supports
reads for fixed-size, and dynamic sized VHDX images.

Differencing files are still unsupported.

The image must be opened without BDRV_O_RDWR set, because we do not
yet update the headers.  I.e., pass 'readonly=on' in the drive image
options from the QEMU commandline.

Signed-off-by: Jeff Cody <jcody@redhat.com>
---
 block/vhdx.c | 123 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 121 insertions(+), 2 deletions(-)

diff --git a/block/vhdx.c b/block/vhdx.c
index 7c9cd99..97775d2 100644
--- a/block/vhdx.c
+++ b/block/vhdx.c
@@ -116,6 +116,17 @@ typedef struct vhdx_metadata_entries {
 } vhdx_metadata_entries;
 
 
+typedef struct vhdx_sector_info {
+    uint32_t bat_idx;       /* BAT entry index */
+    uint32_t sectors_avail; /* sectors available in payload block */
+    uint32_t bytes_left;    /* bytes left in the block after data to r/w */
+    uint32_t bytes_avail;   /* bytes available in payload block */
+    uint64_t file_offset;   /* absolute offset in bytes, in file */
+    uint64_t block_offset;  /* block offset, in bytes */
+} vhdx_sector_info;
+
+
+
 typedef struct BDRVVHDXState {
     CoMutex lock;
 
@@ -657,7 +668,7 @@ static int vhdx_open(BlockDriverState *bs, int flags)
         goto fail;
     }
 
-    /* TODO: differencing files, read, write */
+    /* TODO: differencing files, write */
 
     return 0;
 fail:
@@ -672,10 +683,118 @@ static int vhdx_reopen_prepare(BDRVReopenState *state,
 }
 
 
+/*
+ * Perform sector to block offset translations, to get various
+ * sector and file offsets into the image.  See vhdx_sector_info
+ */
+static void vhdx_block_translate(BDRVVHDXState *s, int64_t sector_num,
+                                 int nb_sectors, vhdx_sector_info *sinfo)
+{
+    uint32_t block_offset;
+
+    sinfo->bat_idx = sector_num >> s->sectors_per_block_bits;
+    /* effectively a modulo - this gives us the offset into the block
+     * (in sector sizes) for our sector number */
+    block_offset = sector_num - (sinfo->bat_idx << s->sectors_per_block_bits);
+    /* the chunk ratio gives us the interleaving of the sector
+     * bitmaps, so we need to advance our page block index by the
+     * sector bitmaps entry number */
+    sinfo->bat_idx += sinfo->bat_idx >> s->chunk_ratio_bits;
+
+    /* the number of sectors we can read/write in this cycle */
+    sinfo->sectors_avail = s->sectors_per_block - block_offset;
+
+    sinfo->bytes_left = sinfo->sectors_avail << s->logical_sector_size_bits;
+
+    if (sinfo->sectors_avail > nb_sectors) {
+        sinfo->sectors_avail = nb_sectors;
+    }
+
+    sinfo->bytes_avail = sinfo->sectors_avail << s->logical_sector_size_bits;
+
+    sinfo->file_offset = s->bat[sinfo->bat_idx] >> VHDX_BAT_FILE_OFF_BITS;
+
+    sinfo->block_offset = block_offset << s->logical_sector_size_bits;
+
+    /* The file offset must be past the header section, so must be > 0 */
+    if (sinfo->file_offset == 0) {
+        return;
+    }
+
+    /* block offset is the offset in vhdx logical sectors, in
+     * the payload data block. Convert that to a byte offset
+     * in the block, and add in the payload data block offset
+     * in the file, in bytes, to get the final read address */
+
+    sinfo->file_offset <<= 20;  /* now in bytes, rather than 1MB units */
+    sinfo->file_offset += sinfo->block_offset;
+}
+
+
+
 static coroutine_fn int vhdx_co_readv(BlockDriverState *bs, int64_t sector_num,
                                       int nb_sectors, QEMUIOVector *qiov)
 {
-    return -ENOTSUP;
+    BDRVVHDXState *s = bs->opaque;
+    int ret = 0;
+    vhdx_sector_info sinfo;
+    uint64_t bytes_done = 0;
+    QEMUIOVector hd_qiov;
+
+    qemu_iovec_init(&hd_qiov, qiov->niov);
+
+    qemu_co_mutex_lock(&s->lock);
+
+    while (nb_sectors > 0) {
+        /* We are a differencing file, so we need to inspect the sector bitmap
+         * to see if we have the data or not */
+        if (s->params.data_bits & VHDX_PARAMS_HAS_PARENT) {
+            /* not supported yet */
+            ret = -ENOTSUP;
+            goto exit;
+        } else {
+            vhdx_block_translate(s, sector_num, nb_sectors, &sinfo);
+
+            qemu_iovec_reset(&hd_qiov);
+            qemu_iovec_concat(&hd_qiov, qiov,  bytes_done, sinfo.bytes_avail);
+
+            /* check the payload block state */
+            switch (s->bat[sinfo.bat_idx] & VHDX_BAT_STATE_BIT_MASK) {
+            case PAYLOAD_BLOCK_NOT_PRESENT: /* fall through */
+            case PAYLOAD_BLOCK_UNDEFINED:   /* fall through */
+            case PAYLOAD_BLOCK_UNMAPPED:    /* fall through */
+            case PAYLOAD_BLOCK_ZERO:
+                /* return zero */
+                qemu_iovec_memset(&hd_qiov, 0, 0, sinfo.bytes_avail);
+                break;
+            case PAYLOAD_BLOCK_FULL_PRESENT:
+                qemu_co_mutex_unlock(&s->lock);
+                ret = bdrv_co_readv(bs->file,
+                                    sinfo.file_offset >> BDRV_SECTOR_BITS,
+                                    sinfo.sectors_avail, &hd_qiov);
+                qemu_co_mutex_lock(&s->lock);
+                if (ret < 0) {
+                    goto exit;
+                }
+                break;
+            case PAYLOAD_BLOCK_PARTIALLY_PRESENT:
+                /* we don't yet support difference files, fall through
+                 * to error */
+            default:
+                ret = -EIO;
+                goto exit;
+                break;
+            }
+            nb_sectors -= sinfo.sectors_avail;
+            sector_num += sinfo.sectors_avail;
+            bytes_done += sinfo.bytes_avail;
+        }
+    }
+    ret = 0;
+exit:
+    qemu_co_mutex_unlock(&s->lock);
+    qemu_iovec_destroy(&hd_qiov);
+    return ret;
 }
 
 
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [Qemu-devel] [PATCH 6/7] block: add header update capability for VHDX images.
  2013-03-06 14:46 [Qemu-devel] [PATCH 0/7] Initial VHDX support (and a bug fix for QCOW) Jeff Cody
                   ` (4 preceding siblings ...)
  2013-03-06 14:48 ` [Qemu-devel] [PATCH 5/7] block: add read-only support to VHDX image format Jeff Cody
@ 2013-03-06 14:48 ` Jeff Cody
  2013-03-07 14:50   ` Stefan Hajnoczi
  2013-03-06 14:48 ` [Qemu-devel] [PATCH 7/7] block: add write support " Jeff Cody
  6 siblings, 1 reply; 49+ messages in thread
From: Jeff Cody @ 2013-03-06 14:48 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, sw, stefanha

This adds the ability to update the headers in a VHDX image, including
generating a new MS-compatible GUID, and checksum.

Signed-off-by: Jeff Cody <jcody@redhat.com>
---
 block/vhdx.c | 165 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 163 insertions(+), 2 deletions(-)

diff --git a/block/vhdx.c b/block/vhdx.c
index 97775d2..13d1e7f 100644
--- a/block/vhdx.c
+++ b/block/vhdx.c
@@ -27,6 +27,11 @@
     le16_to_cpus(&(guid)->data2); \
     le16_to_cpus(&(guid)->data3); } while (0)
 
+#define cpu_to_leguids(guid) do { \
+    cpu_to_le32s(&(guid)->data1); \
+    cpu_to_le16s(&(guid)->data2); \
+    cpu_to_le16s(&(guid)->data3); } while (0)
+
 /* Several metadata and region table data entries are identified by
  * guids in  a MS-specific GUID format. */
 
@@ -160,11 +165,35 @@ typedef struct BDRVVHDXState {
     vhdx_bat_entry *bat;
     uint64_t bat_offset;
 
+    ms_guid session_guid;
+
+
     vhdx_parent_locator_header parent_header;
     vhdx_parent_locator_entry *parent_entries;
 
 } BDRVVHDXState;
 
+/* Calculates new checksum.
+ *
+ * Zero is substituted during crc calculation for the original crc field
+ * crc_offset: byte offset in buf of the buffer crc
+ * buf: buffer pointer
+ * size: size of buffer (must be > crc_offset+4)
+ */
+static uint32_t vhdx_update_checksum(uint8_t *buf, size_t size, int crc_offset)
+{
+    uint32_t crc;
+
+    assert(buf != NULL);
+    assert(size > (crc_offset+4));
+
+    memset(buf+crc_offset, 0, sizeof(crc));
+    crc =  crc32c(0xffffffff, buf, size);
+    memcpy(buf+crc_offset, &crc, sizeof(crc));
+
+    return crc;
+}
+
 /* Validates the checksum of the buffer, with an in-place CRC.
  *
  * Zero is substituted during crc calculation for the original crc field,
@@ -198,6 +227,33 @@ static bool vhdx_checksum_is_valid(uint8_t *buf, size_t size, int crc_offset)
 }
 
 /*
+ * This generates a UUID that is compliant with the MS GUIDs used
+ * in the VHDX spec (and elsewhere).
+ *
+ * We can do this with uuid_generate if uuid.h is present,
+ * however not all systems have uuid and the generation is
+ * pretty straightforward for the DCE + random usage case
+ *
+ */
+static void vhdx_guid_generate(ms_guid *guid)
+{
+    assert(guid != NULL);
+
+    int i;
+
+    guid->data1 = g_random_int();
+    guid->data2 = g_random_int_range(0, 0xffff);
+    guid->data3 = g_random_int_range(0, 0x0fff);
+    guid->data3 |= 0x4000; /* denotes random algorithm */
+
+    guid->data4[0] = g_random_int_range(0, 0x3f);
+    guid->data4[0] |= 0x80; /* denotes DCE type */
+    for (i = 1; i < sizeof(guid->data4); i++) {
+        guid->data4[i] = g_random_int_range(0, 0xff);
+    }
+}
+
+/*
  * Per the MS VHDX Specification, for every VHDX file:
  *      - The header section is fixed size - 1 MB
  *      - The header section is always the first "object"
@@ -236,6 +292,107 @@ static void vhdx_header_le_import(vhdx_header *h)
     le64_to_cpus(&h->log_offset);
 }
 
+/* All VHDX structures on disk are little endian */
+static void vhdx_header_le_export(vhdx_header *orig_h, vhdx_header *new_h)
+{
+    assert(orig_h != NULL);
+    assert(new_h != NULL);
+
+    new_h->signature       = cpu_to_le32(orig_h->signature);
+    new_h->checksum        = cpu_to_le32(orig_h->checksum);
+    new_h->sequence_number = cpu_to_le64(orig_h->sequence_number);
+
+    memcpy(&new_h->file_write_guid, &orig_h->file_write_guid, sizeof(ms_guid));
+    memcpy(&new_h->data_write_guid, &orig_h->data_write_guid, sizeof(ms_guid));
+    memcpy(&new_h->log_guid,        &orig_h->log_guid,        sizeof(ms_guid));
+
+    cpu_to_leguids(&new_h->file_write_guid);
+    cpu_to_leguids(&new_h->data_write_guid);
+    cpu_to_leguids(&new_h->log_guid);
+
+    new_h->log_version     = cpu_to_le16(orig_h->log_version);
+    new_h->version         = cpu_to_le16(orig_h->version);
+    new_h->log_length      = cpu_to_le32(orig_h->log_length);
+    new_h->log_offset      = cpu_to_le64(orig_h->log_offset);
+}
+
+/* Update the VHDX headers
+ *
+ * This follows the VHDX spec procedures for header updates.
+ *
+ *  - non-current header is updated with largest sequence number
+ */
+static int vhdx_update_header(BlockDriverState *bs, BDRVVHDXState *s, bool rw)
+{
+    int ret = 0;
+    int hdr_idx = 0;
+    uint64_t header_offset = VHDX_HEADER1_OFFSET;
+
+    vhdx_header *active_header;
+    vhdx_header *inactive_header;
+    vhdx_header header_le;
+    uint8_t *buffer;
+
+    /* operate on the non-current header */
+    if (s->curr_header == 0) {
+        hdr_idx = 1;
+        header_offset = VHDX_HEADER2_OFFSET;
+    }
+
+    active_header   = s->headers[s->curr_header];
+    inactive_header = s->headers[hdr_idx];
+
+    inactive_header->sequence_number = active_header->sequence_number + 1;
+
+    /* a new file guid must be generate before any file write, including
+     * headers */
+    memcpy(&inactive_header->file_write_guid, &s->session_guid,
+           sizeof(ms_guid));
+
+    /* a new data guid only needs to be generate before any guest-visisble
+     * writes, so update it if the image is opened r/w. */
+    if (rw) {
+        vhdx_guid_generate(&inactive_header->data_write_guid);
+    }
+
+    /* the header checksum is not over just the packed size of vhdx_header,
+     * but rather over the entire 'reserved' range for the header, which is
+     * 4KB (VHDX_HEADER_SIZE). */
+
+    buffer = g_malloc(VHDX_HEADER_SIZE);
+    /* we can't assume the extra reserved bytes are 0 */
+    ret = bdrv_pread(bs->file, header_offset, buffer, VHDX_HEADER_SIZE);
+    if (ret < 0) {
+        goto fail;
+    }
+    /* overwrite the actual vhdx_header portion */
+    memcpy(buffer, inactive_header, sizeof(vhdx_header));
+    inactive_header->checksum = vhdx_update_checksum(buffer,
+                                                     VHDX_HEADER_SIZE, 4);
+    vhdx_header_le_export(inactive_header, &header_le);
+    bdrv_pwrite_sync(bs->file, header_offset, &header_le, sizeof(vhdx_header));
+    s->curr_header = hdr_idx;
+
+fail:
+    g_free(buffer);
+    return ret;
+}
+
+/*
+ * The VHDX spec calls for header updates to be performed twice, so that both
+ * the current and non-current header have valid info
+ */
+static int vhdx_update_headers(BlockDriverState *bs, BDRVVHDXState *s, bool rw)
+{
+    int ret;
+
+    ret = vhdx_update_header(bs, s, rw);
+    if (ret < 0) {
+        return ret;
+    }
+    ret = vhdx_update_header(bs, s, rw);
+    return ret;
+}
 
 /* opens the specified header block from the VHDX file header section */
 static int vhdx_parse_header(BlockDriverState *bs, BDRVVHDXState *s)
@@ -628,6 +785,11 @@ static int vhdx_open(BlockDriverState *bs, int flags)
 
     qemu_co_mutex_init(&s->lock);
 
+    /* This is used for any header updates, for the file_write_guid.
+     * The spec dictates that a new value should be used for the first
+     * header update */
+    vhdx_guid_generate(&s->session_guid);
+
     ret = vhdx_parse_header(bs, s);
     if (ret) {
         goto fail;
@@ -664,8 +826,7 @@ static int vhdx_open(BlockDriverState *bs, int flags)
     }
 
     if (flags & BDRV_O_RDWR) {
-        ret = -ENOTSUP;
-        goto fail;
+        vhdx_update_headers(bs, s, false);
     }
 
     /* TODO: differencing files, write */
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [Qemu-devel] [PATCH 7/7] block: add write support for VHDX images
  2013-03-06 14:46 [Qemu-devel] [PATCH 0/7] Initial VHDX support (and a bug fix for QCOW) Jeff Cody
                   ` (5 preceding siblings ...)
  2013-03-06 14:48 ` [Qemu-devel] [PATCH 6/7] block: add header update capability for VHDX images Jeff Cody
@ 2013-03-06 14:48 ` Jeff Cody
  2013-03-07 15:59   ` Stefan Hajnoczi
  6 siblings, 1 reply; 49+ messages in thread
From: Jeff Cody @ 2013-03-06 14:48 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, sw, stefanha

This adds in write support for VHDX images.  This supports writing to
both dynamic and fixed VHDX images.  Differencing images are still
unsupported.

Signed-off-by: Jeff Cody <jcody@redhat.com>
---
 block/vhdx.c | 144 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 142 insertions(+), 2 deletions(-)

diff --git a/block/vhdx.c b/block/vhdx.c
index 13d1e7f..316a5d8 100644
--- a/block/vhdx.c
+++ b/block/vhdx.c
@@ -165,6 +165,7 @@ typedef struct BDRVVHDXState {
     vhdx_bat_entry *bat;
     uint64_t bat_offset;
 
+    bool first_visible_write;
     ms_guid session_guid;
 
 
@@ -782,6 +783,7 @@ static int vhdx_open(BlockDriverState *bs, int flags)
     int i;
 
     s->bat = NULL;
+    s->first_visible_write = true;
 
     qemu_co_mutex_init(&s->lock);
 
@@ -829,7 +831,7 @@ static int vhdx_open(BlockDriverState *bs, int flags)
         vhdx_update_headers(bs, s, false);
     }
 
-    /* TODO: differencing files, write */
+    /* TODO: differencing files */
 
     return 0;
 fail:
@@ -958,12 +960,150 @@ exit:
     return ret;
 }
 
+/*
+ * Allocate a new payload block at the end of the file.
+ *
+ * Allocation will happen at 1MB alignment inside the file
+ *
+ * Returns the file offset start of the new payload block
+ */
+static int vhdx_allocate_block(BlockDriverState *bs, BDRVVHDXState *s,
+                                    uint64_t *new_offset)
+{
+    *new_offset = bdrv_getlength(bs->file);
+
+    /* per the spec, the address for a block is in units of 1MB */
+    if (*new_offset % (1024*1024)) {
+        *new_offset = ((*new_offset >> 20) + 1) << 20;  /* round up to 1MB */
+    }
+    return bdrv_truncate(bs->file, *new_offset + s->block_size);
+}
+
+/*
+ * Update the BAT tablet entry with the new file offset, and the new entry
+ * state */
+static int vhdx_update_bat_table_entry(BlockDriverState *bs, BDRVVHDXState *s,
+                                       vhdx_sector_info *sinfo, int state)
+{
+    uint64_t bat_tmp;
+    uint64_t bat_entry_offset;
+
+    /* The BAT entry is a uint64, with 44 bits for the file offset in units of
+     * 1MB, and 3 bits for the block state. */
+    s->bat[sinfo->bat_idx]  = ((sinfo->file_offset>>20) <<
+                               VHDX_BAT_FILE_OFF_BITS);
+
+    s->bat[sinfo->bat_idx] |= state & VHDX_BAT_STATE_BIT_MASK;
 
+    bat_tmp = cpu_to_le64(s->bat[sinfo->bat_idx]);
+    bat_entry_offset = s->bat_offset + sinfo->bat_idx * sizeof(vhdx_bat_entry);
+
+    return bdrv_pwrite_sync(bs->file, bat_entry_offset, &bat_tmp,
+                            sizeof(vhdx_bat_entry));
+}
 
 static coroutine_fn int vhdx_co_writev(BlockDriverState *bs, int64_t sector_num,
                                       int nb_sectors, QEMUIOVector *qiov)
 {
-    return -ENOTSUP;
+    int ret = -ENOTSUP;
+    BDRVVHDXState *s = bs->opaque;
+    vhdx_sector_info sinfo;
+    uint64_t bytes_done = 0;
+    QEMUIOVector hd_qiov;
+
+    qemu_iovec_init(&hd_qiov, qiov->niov);
+
+    qemu_co_mutex_lock(&s->lock);
+
+    /* Per the spec, on the first write of guest-visible data to the file the
+     * data write guid must be updated in the header */
+    if (s->first_visible_write) {
+        s->first_visible_write = false;
+        vhdx_update_headers(bs, s, true);
+    }
+
+    while (nb_sectors > 0) {
+        if (s->params.data_bits & VHDX_PARAMS_HAS_PARENT) {
+            /* not supported yet */
+            ret = -ENOTSUP;
+            goto exit;
+        } else {
+            vhdx_block_translate(s, sector_num, nb_sectors, &sinfo);
+
+            qemu_iovec_reset(&hd_qiov);
+            qemu_iovec_concat(&hd_qiov, qiov,  bytes_done, sinfo.bytes_avail);
+            /* check the payload block state */
+            switch (s->bat[sinfo.bat_idx] & VHDX_BAT_STATE_BIT_MASK) {
+            case PAYLOAD_BLOCK_ZERO:
+                /* in this case, we need to preserve zero writes for
+                 * data that is not part of this write, so we must pad
+                 * the rest of the buffer to zeroes */
+
+                /* if we are on a posix system with ftruncate() that extends
+                 * a file, then it is zero-filled for us.  On Win32, the raw
+                 * layer uses SetFilePointer and SetFileEnd, which does not
+                 * zero fill AFAIK */
+
+                /* TODO: queue another write of zero buffers if the host OS does
+                 * not zero-fill on file extension */
+
+                /* fall through */
+            case PAYLOAD_BLOCK_NOT_PRESENT: /* fall through */
+            case PAYLOAD_BLOCK_UNMAPPED:    /* fall through */
+            case PAYLOAD_BLOCK_UNDEFINED:   /* fall through */
+                ret = vhdx_allocate_block(bs, s, &sinfo.file_offset);
+                if (ret < 0) {
+                    goto exit;
+                }
+                /* once we support differencing files, this may also be
+                 * partially present */
+                /* update block state to the newly specified state */
+                ret = vhdx_update_bat_table_entry(bs, s, &sinfo,
+                                                  PAYLOAD_BLOCK_FULL_PRESENT);
+                if (ret < 0) {
+                    goto exit;
+                }
+                /* since we just allocated a block, file_offset is the
+                 * beginning of the payload block. It needs to be the
+                 * write address, which includes the offset into the block */
+                sinfo.file_offset += sinfo.block_offset;
+                /* fall through */
+            case PAYLOAD_BLOCK_FULL_PRESENT:
+                /* if the file offset address is in the header zone,
+                 * there is a problem */
+                if (sinfo.file_offset < (1024*1024)) {
+                    ret = -EFAULT;
+                    goto exit;
+                }
+                /* block exists, so we can just overwrite it */
+                qemu_co_mutex_unlock(&s->lock);
+                ret = bdrv_co_writev(bs->file,
+                                    sinfo.file_offset>>BDRV_SECTOR_BITS,
+                                    sinfo.sectors_avail, &hd_qiov);
+                qemu_co_mutex_lock(&s->lock);
+                if (ret < 0) {
+                    goto exit;
+                }
+                break;
+            case PAYLOAD_BLOCK_PARTIALLY_PRESENT:
+                /* we don't yet support difference files, fall through
+                 * to error */
+            default:
+                ret = -EIO;
+                goto exit;
+                break;
+            }
+
+            nb_sectors -= sinfo.sectors_avail;
+            sector_num += sinfo.sectors_avail;
+            bytes_done += sinfo.bytes_avail;
+
+        }
+    }
+
+exit:
+    qemu_co_mutex_unlock(&s->lock);
+    return ret;
 }
 
 
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 1/7] block: only force IO completion in .bdrv_truncate if we are shrinking
  2013-03-06 14:47 ` [Qemu-devel] [PATCH 1/7] block: only force IO completion in .bdrv_truncate if we are shrinking Jeff Cody
@ 2013-03-06 17:50   ` Peter Lieven
  2013-03-06 18:06     ` Paolo Bonzini
  2013-03-06 18:32     ` Jeff Cody
  0 siblings, 2 replies; 49+ messages in thread
From: Peter Lieven @ 2013-03-06 17:50 UTC (permalink / raw)
  To: Jeff Cody; +Cc: kwolf, sw, qemu-devel, stefanha, pbonzini

Am 06.03.2013 15:47, schrieb Jeff Cody:
> Commit 9a665b2b made bdrv_truncate() call bdrv_drain_all(), but this breaks
> QCOW images, as well other future image formats (such as VHDX) that may call
> bdrv_truncate(bs->file) from within a read/write operation.  For example, QCOW
> will cause an assert, due to tracked_requests not being empty (since the
> read/write that called bdrv_truncate() is still in progress).
> 
> This modifies the check to only force the bdrv_drain_all() call if the BDS to
> be truncated is not growable, or if we are shrinking the BDS.  Otherwise, if we
> are just growing the file, allow it to happen without forcing a call to
> bdrv_drain_all().

This will actually break iscsi_truncate(). Paolo requested that the iscsi_tuncate command
should use sync I/O for rereading the LUN capacity. Using sync I/O is only possible if
there are no async requests in-flight.

Looking at the source I have not found a place where bs->growable is set to 0 for any
block driver, maybe I miss something. Having bs->growable for iSCSI would also be ok.

Shouldn't it be possible to call bdrv_drain_all() any time? There are other places
where this is called. One I have in mind is e.g. if you cancel an ongoing block migration.

Peter

> 
> Signed-off-by: Jeff Cody <jcody@redhat.com>
> ---
>  block.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/block.c b/block.c
> index 124a9eb..dce844c 100644
> --- a/block.c
> +++ b/block.c
> @@ -2450,8 +2450,14 @@ int bdrv_truncate(BlockDriverState *bs, int64_t offset)
>      if (bdrv_in_use(bs))
>          return -EBUSY;
>  
> -    /* There better not be any in-flight IOs when we truncate the device. */
> -    bdrv_drain_all();
> +    /* Don't force a drain if we are just growing the file - otherwise,
> +     * using bdrv_truncate() from within a block driver in a read/write
> +     * operation will cause an assert, because bdrv_drain_all() will assert if
> +     * a tracked_request is still in progress. */
> +    if (!bs->growable || offset < bdrv_getlength(bs)) {
> +        /* There better not be any in-flight IOs when we truncate the device. */
> +        bdrv_drain_all();
> +    }
>  
>      ret = drv->bdrv_truncate(bs, offset);
>      if (ret == 0) {
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 1/7] block: only force IO completion in .bdrv_truncate if we are shrinking
  2013-03-06 17:50   ` Peter Lieven
@ 2013-03-06 18:06     ` Paolo Bonzini
  2013-03-06 18:14       ` Jeff Cody
  2013-03-06 18:27       ` Peter Lieven
  2013-03-06 18:32     ` Jeff Cody
  1 sibling, 2 replies; 49+ messages in thread
From: Paolo Bonzini @ 2013-03-06 18:06 UTC (permalink / raw)
  To: Peter Lieven; +Cc: kwolf, sw, Jeff Cody, qemu-devel, stefanha

Il 06/03/2013 18:50, Peter Lieven ha scritto:
>> > Commit 9a665b2b made bdrv_truncate() call bdrv_drain_all(), but this breaks
>> > QCOW images, as well other future image formats (such as VHDX) that may call
>> > bdrv_truncate(bs->file) from within a read/write operation.  For example, QCOW
>> > will cause an assert, due to tracked_requests not being empty (since the
>> > read/write that called bdrv_truncate() is still in progress).

I'm not sure such bdrv_truncate calls are necessary.  QCOW2 doesn't have
them (almost; there is one in qcow2_write_compressed, I'm not even sure
that one is necessary though), and I think QCOW's breaks using it with a
block device as a backing file.

Paolo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 1/7] block: only force IO completion in .bdrv_truncate if we are shrinking
  2013-03-06 18:06     ` Paolo Bonzini
@ 2013-03-06 18:14       ` Jeff Cody
  2013-03-06 18:31         ` Paolo Bonzini
       [not found]         ` <51378A23.5090301@dlhnet.de>
  2013-03-06 18:27       ` Peter Lieven
  1 sibling, 2 replies; 49+ messages in thread
From: Jeff Cody @ 2013-03-06 18:14 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kwolf, sw, Peter Lieven, qemu-devel, stefanha

On Wed, Mar 06, 2013 at 07:06:24PM +0100, Paolo Bonzini wrote:
> Il 06/03/2013 18:50, Peter Lieven ha scritto:
> >> > Commit 9a665b2b made bdrv_truncate() call bdrv_drain_all(), but this breaks
> >> > QCOW images, as well other future image formats (such as VHDX) that may call
> >> > bdrv_truncate(bs->file) from within a read/write operation.  For example, QCOW
> >> > will cause an assert, due to tracked_requests not being empty (since the
> >> > read/write that called bdrv_truncate() is still in progress).
> 
> I'm not sure such bdrv_truncate calls are necessary.  QCOW2 doesn't have
> them (almost; there is one in qcow2_write_compressed, I'm not even sure
> that one is necessary though), and I think QCOW's breaks using it with a
> block device as a backing file.
> 
> Paolo

QCOW breaks with it using a normal raw posix file as a device.  As a
test: qemu-img create -f qcow test.qcow 5G.  Now run qemu with that
drive mounted, and try to partition and format it.  QEMU now asserts.

The nicety of being able to using truncate during a write call,
especially for VHDX (which can have relatively large block/cluster
sizes), so to grow the file sparsely in a dynamically allocated file.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 1/7] block: only force IO completion in .bdrv_truncate if we are shrinking
  2013-03-06 18:06     ` Paolo Bonzini
  2013-03-06 18:14       ` Jeff Cody
@ 2013-03-06 18:27       ` Peter Lieven
  2013-03-07  8:57         ` Kevin Wolf
  1 sibling, 1 reply; 49+ messages in thread
From: Peter Lieven @ 2013-03-06 18:27 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kwolf, sw, Jeff Cody, qemu-devel, stefanha

Am 06.03.2013 19:06, schrieb Paolo Bonzini:
> Il 06/03/2013 18:50, Peter Lieven ha scritto:
>>>> Commit 9a665b2b made bdrv_truncate() call bdrv_drain_all(), but this breaks
>>>> QCOW images, as well other future image formats (such as VHDX) that may call
>>>> bdrv_truncate(bs->file) from within a read/write operation.  For example, QCOW
>>>> will cause an assert, due to tracked_requests not being empty (since the
>>>> read/write that called bdrv_truncate() is still in progress).
> 
> I'm not sure such bdrv_truncate calls are necessary.  QCOW2 doesn't have
> them (almost; there is one in qcow2_write_compressed, I'm not even sure
> that one is necessary though), and I think QCOW's breaks using it with a
> block device as a backing file.

I think we have to check the sense of bs->growable nevertheless. It is set
to 1 for all drivers which can't be right?!

int bdrv_file_open(BlockDriverState **pbs, const char *filename, int flags)
{
    BlockDriverState *bs;
    BlockDriver *drv;
    int ret;

    drv = bdrv_find_protocol(filename);
    if (!drv) {
        return -ENOENT;
    }

    bs = bdrv_new("");
    ret = bdrv_open_common(bs, NULL, filename, flags, drv);
    if (ret < 0) {
        bdrv_delete(bs);
        return ret;
    }
    bs->growable = 1;
    *pbs = bs;
    return 0;
}

I think each driver should set it accordingly on its own.

Peter


> 
> Paolo
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 1/7] block: only force IO completion in .bdrv_truncate if we are shrinking
  2013-03-06 18:14       ` Jeff Cody
@ 2013-03-06 18:31         ` Paolo Bonzini
  2013-03-06 18:48           ` Jeff Cody
       [not found]         ` <51378A23.5090301@dlhnet.de>
  1 sibling, 1 reply; 49+ messages in thread
From: Paolo Bonzini @ 2013-03-06 18:31 UTC (permalink / raw)
  To: Jeff Cody; +Cc: kwolf, sw, Peter Lieven, qemu-devel, stefanha

Il 06/03/2013 19:14, Jeff Cody ha scritto:
> QCOW breaks with it using a normal raw posix file as a device.  As a
> test: qemu-img create -f qcow test.qcow 5G.  Now run qemu with that
> drive mounted, and try to partition and format it.  QEMU now asserts.
> 
> The nicety of being able to using truncate during a write call,
> especially for VHDX (which can have relatively large block/cluster
> sizes), so to grow the file sparsely in a dynamically allocated file.

Perhaps we need two APIs, "truncate" and "revalidate".

Truncate should be a no-op if (!bs->growable).

Revalidate could be called by the block_resize monitor command with no
size specified.

Paolo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 1/7] block: only force IO completion in .bdrv_truncate if we are shrinking
  2013-03-06 17:50   ` Peter Lieven
  2013-03-06 18:06     ` Paolo Bonzini
@ 2013-03-06 18:32     ` Jeff Cody
  2013-03-06 20:22       ` Paolo Bonzini
  1 sibling, 1 reply; 49+ messages in thread
From: Jeff Cody @ 2013-03-06 18:32 UTC (permalink / raw)
  To: Peter Lieven; +Cc: kwolf, sw, qemu-devel, stefanha, pbonzini

On Wed, Mar 06, 2013 at 06:50:56PM +0100, Peter Lieven wrote:
> Am 06.03.2013 15:47, schrieb Jeff Cody:
> > Commit 9a665b2b made bdrv_truncate() call bdrv_drain_all(), but this breaks
> > QCOW images, as well other future image formats (such as VHDX) that may call
> > bdrv_truncate(bs->file) from within a read/write operation.  For example, QCOW
> > will cause an assert, due to tracked_requests not being empty (since the
> > read/write that called bdrv_truncate() is still in progress).
> > 
> > This modifies the check to only force the bdrv_drain_all() call if the BDS to
> > be truncated is not growable, or if we are shrinking the BDS.  Otherwise, if we
> > are just growing the file, allow it to happen without forcing a call to
> > bdrv_drain_all().
> 
> This will actually break iscsi_truncate(). Paolo requested that the iscsi_tuncate command
> should use sync I/O for rereading the LUN capacity. Using sync I/O is only possible if
> there are no async requests in-flight.
> 
> Looking at the source I have not found a place where bs->growable is set to 0 for any
> block driver, maybe I miss something. Having bs->growable for iSCSI would also be ok.
> 
> Shouldn't it be possible to call bdrv_drain_all() any time? There are other places
> where this is called. One I have in mind is e.g. if you cancel an ongoing block migration.
> 

That is a good point - what happens to QCOW now, if there is a block
job in progress (e.g. block-commit, block-stream, etc...)?  I would
imagine -EBUSY gets thrown, since bdrv_truncate() checks
bdrv_in_use().

> 
> > 
> > Signed-off-by: Jeff Cody <jcody@redhat.com>
> > ---
> >  block.c | 10 ++++++++--
> >  1 file changed, 8 insertions(+), 2 deletions(-)
> > 
> > diff --git a/block.c b/block.c
> > index 124a9eb..dce844c 100644
> > --- a/block.c
> > +++ b/block.c
> > @@ -2450,8 +2450,14 @@ int bdrv_truncate(BlockDriverState *bs, int64_t offset)
> >      if (bdrv_in_use(bs))
> >          return -EBUSY;
> >  
> > -    /* There better not be any in-flight IOs when we truncate the device. */
> > -    bdrv_drain_all();
> > +    /* Don't force a drain if we are just growing the file - otherwise,
> > +     * using bdrv_truncate() from within a block driver in a read/write
> > +     * operation will cause an assert, because bdrv_drain_all() will assert if
> > +     * a tracked_request is still in progress. */
> > +    if (!bs->growable || offset < bdrv_getlength(bs)) {
> > +        /* There better not be any in-flight IOs when we truncate the device. */
> > +        bdrv_drain_all();
> > +    }
> >  
> >      ret = drv->bdrv_truncate(bs, offset);
> >      if (ret == 0) {
> > 
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 1/7] block: only force IO completion in .bdrv_truncate if we are shrinking
       [not found]           ` <20130306184217.GC3743@localhost.localdomain>
@ 2013-03-06 18:46             ` Peter Lieven
  0 siblings, 0 replies; 49+ messages in thread
From: Peter Lieven @ 2013-03-06 18:46 UTC (permalink / raw)
  To: Jeff Cody; +Cc: Kevin Wolf, Paolo Bonzini, Stefan Hajnoczi, qemu-devel, sw

Am 06.03.2013 19:42, schrieb Jeff Cody:
> On Wed, Mar 06, 2013 at 07:25:39PM +0100, Peter Lieven wrote:
>> Am 06.03.2013 19:14, schrieb Jeff Cody:
>>> On Wed, Mar 06, 2013 at 07:06:24PM +0100, Paolo Bonzini wrote:
>>>> Il 06/03/2013 18:50, Peter Lieven ha scritto:
>>>>>>> Commit 9a665b2b made bdrv_truncate() call bdrv_drain_all(), but this breaks
>>>>>>> QCOW images, as well other future image formats (such as VHDX) that may call
>>>>>>> bdrv_truncate(bs->file) from within a read/write operation.  For example, QCOW
>>>>>>> will cause an assert, due to tracked_requests not being empty (since the
>>>>>>> read/write that called bdrv_truncate() is still in progress).
>>>>
>>>> I'm not sure such bdrv_truncate calls are necessary.  QCOW2 doesn't have
>>>> them (almost; there is one in qcow2_write_compressed, I'm not even sure
>>>> that one is necessary though), and I think QCOW's breaks using it with a
>>>> block device as a backing file.
>>>>
>>>> Paolo
>>>
>>> QCOW breaks with it using a normal raw posix file as a device.  As a
>>> test: qemu-img create -f qcow test.qcow 5G.  Now run qemu with that
>>> drive mounted, and try to partition and format it.  QEMU now asserts.
>>>
>>> The nicety of being able to using truncate during a write call,
>>> especially for VHDX (which can have relatively large block/cluster
>>> sizes), so to grow the file sparsely in a dynamically allocated file.
>>>
>>
>> why don't you call vhdx_truncate() there instead of bdrv_truncate()?
>>
>> Peter
>>
> 
> What I want to do is grow the underlying file, so bdrv_truncate() is
> called on bs->file.  vhdx_truncate() would primarily care about
> updating the vhdx file structures / headers to reflect the new file
> size (which is what vhdx_block_allocate() does), but I need to
> actually make the underlying file larger (and ideally, in a sparse
> fashion). 
> 

As Paolo suggested we probably need to fix the underlying misuage of
bdrv_truncate() for non growable devices and for that we additionally
need to set bs->growable not to 1 per default, but only if the devices
are really growable: qcow2, raw (only on regular file), vhdx? I maybe
missing some.

Paolo?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 1/7] block: only force IO completion in .bdrv_truncate if we are shrinking
  2013-03-06 18:31         ` Paolo Bonzini
@ 2013-03-06 18:48           ` Jeff Cody
  2013-03-06 19:03             ` Peter Lieven
  0 siblings, 1 reply; 49+ messages in thread
From: Jeff Cody @ 2013-03-06 18:48 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kwolf, sw, Peter Lieven, qemu-devel, stefanha

On Wed, Mar 06, 2013 at 07:31:51PM +0100, Paolo Bonzini wrote:
> Il 06/03/2013 19:14, Jeff Cody ha scritto:
> > QCOW breaks with it using a normal raw posix file as a device.  As a
> > test: qemu-img create -f qcow test.qcow 5G.  Now run qemu with that
> > drive mounted, and try to partition and format it.  QEMU now asserts.
> > 
> > The nicety of being able to using truncate during a write call,
> > especially for VHDX (which can have relatively large block/cluster
> > sizes), so to grow the file sparsely in a dynamically allocated file.
> 
> Perhaps we need two APIs, "truncate" and "revalidate".
> 
> Truncate should be a no-op if (!bs->growable).
> 
> Revalidate could be called by the block_resize monitor command with no
> size specified.
> 
> Paolo

I think that is a good solution.  Is it better to have "truncate" and
"revalidate", or "truncate" and "grow", with grow being a subset of
truncate, with fewer restrictions?  There may still be operations
where it is OK to grow a file, but not OK to shrink it.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 1/7] block: only force IO completion in .bdrv_truncate if we are shrinking
  2013-03-06 18:48           ` Jeff Cody
@ 2013-03-06 19:03             ` Peter Lieven
  2013-03-06 20:39               ` Paolo Bonzini
  0 siblings, 1 reply; 49+ messages in thread
From: Peter Lieven @ 2013-03-06 19:03 UTC (permalink / raw)
  To: Jeff Cody; +Cc: kwolf, Paolo Bonzini, qemu-devel, stefanha, sw

Am 06.03.2013 19:48, schrieb Jeff Cody:
> On Wed, Mar 06, 2013 at 07:31:51PM +0100, Paolo Bonzini wrote:
>> Il 06/03/2013 19:14, Jeff Cody ha scritto:
>>> QCOW breaks with it using a normal raw posix file as a device.  As a
>>> test: qemu-img create -f qcow test.qcow 5G.  Now run qemu with that
>>> drive mounted, and try to partition and format it.  QEMU now asserts.
>>>
>>> The nicety of being able to using truncate during a write call,
>>> especially for VHDX (which can have relatively large block/cluster
>>> sizes), so to grow the file sparsely in a dynamically allocated file.
>>
>> Perhaps we need two APIs, "truncate" and "revalidate".
>>
>> Truncate should be a no-op if (!bs->growable).
>>
>> Revalidate could be called by the block_resize monitor command with no
>> size specified.
>>
>> Paolo
> 
> I think that is a good solution.  Is it better to have "truncate" and
> "revalidate", or "truncate" and "grow", with grow being a subset of
> truncate, with fewer restrictions?  There may still be operations
> where it is OK to grow a file, but not OK to shrink it.
> 

Or as a first step:

a) Call brdv_drain_all() only if the device is shrinked (independently of !bs->growable)
b) Call brdv_drain_all() inside iscsi_truncate() because it is a special requirement there
c) Fix the value of bs->growable for all drivers

Peter

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 1/7] block: only force IO completion in .bdrv_truncate if we are shrinking
  2013-03-06 18:32     ` Jeff Cody
@ 2013-03-06 20:22       ` Paolo Bonzini
  0 siblings, 0 replies; 49+ messages in thread
From: Paolo Bonzini @ 2013-03-06 20:22 UTC (permalink / raw)
  To: Jeff Cody; +Cc: kwolf, sw, Peter Lieven, qemu-devel, stefanha

Il 06/03/2013 19:32, Jeff Cody ha scritto:
> On Wed, Mar 06, 2013 at 06:50:56PM +0100, Peter Lieven wrote:
>> Looking at the source I have not found a place where bs->growable is set to 0 for any
>> block driver, maybe I miss something. Having bs->growable for iSCSI would also be ok.
>>
>> Shouldn't it be possible to call bdrv_drain_all() any time? There are other places
>> where this is called. One I have in mind is e.g. if you cancel an ongoing block migration.
> 
> That is a good point - what happens to QCOW now, if there is a block
> job in progress (e.g. block-commit, block-stream, etc...)?  I would
> imagine -EBUSY gets thrown, since bdrv_truncate() checks
> bdrv_in_use().

No, bs->file is not marked in use.  Only bs is.

Paolo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 1/7] block: only force IO completion in .bdrv_truncate if we are shrinking
  2013-03-06 19:03             ` Peter Lieven
@ 2013-03-06 20:39               ` Paolo Bonzini
  2013-03-07  8:50                 ` Kevin Wolf
  2013-03-07  8:53                 ` Peter Lieven
  0 siblings, 2 replies; 49+ messages in thread
From: Paolo Bonzini @ 2013-03-06 20:39 UTC (permalink / raw)
  To: Peter Lieven; +Cc: kwolf, sw, Jeff Cody, qemu-devel, stefanha

Il 06/03/2013 20:03, Peter Lieven ha scritto:
> Am 06.03.2013 19:48, schrieb Jeff Cody:
>> On Wed, Mar 06, 2013 at 07:31:51PM +0100, Paolo Bonzini wrote:
>>> Il 06/03/2013 19:14, Jeff Cody ha scritto:
>>>> QCOW breaks with it using a normal raw posix file as a device.  As a
>>>> test: qemu-img create -f qcow test.qcow 5G.  Now run qemu with that
>>>> drive mounted, and try to partition and format it.  QEMU now asserts.
>>>>
>>>> The nicety of being able to using truncate during a write call,
>>>> especially for VHDX (which can have relatively large block/cluster
>>>> sizes), so to grow the file sparsely in a dynamically allocated file.
>>>
>>> Perhaps we need two APIs, "truncate" and "revalidate".
>>>
>>> Truncate should be a no-op if (!bs->growable).
>>>
>>> Revalidate could be called by the block_resize monitor command with no
>>> size specified.
>>>
>>> Paolo
>>
>> I think that is a good solution.  Is it better to have "truncate" and
>> "revalidate", or "truncate" and "grow", with grow being a subset of
>> truncate, with fewer restrictions?  There may still be operations
>> where it is OK to grow a file, but not OK to shrink it.
> 
> Or as a first step:
> 
> a) Call brdv_drain_all() only if the device is shrinked (independently of !bs->growable)
> b) Call brdv_drain_all() inside iscsi_truncate() because it is a special requirement there
> c) Fix the value of bs->growable for all drivers

Let's start from (c).  bdrv_file_open sets bs->growable = 1.  I think it
should be removed and only the file protocol should set it.

Then we can add bdrv_revalidate and, for block_resize, call
bdrv_revalidate+bdrv_truncate.  For bs->growable = 0 &&
!bs->drv->bdrv_truncate, bdrv_truncate can just check that the actual
size is the same or bigger as the one requested, and fail otherwise.

Paolo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 1/7] block: only force IO completion in .bdrv_truncate if we are shrinking
  2013-03-06 20:39               ` Paolo Bonzini
@ 2013-03-07  8:50                 ` Kevin Wolf
  2013-03-07  8:56                   ` Peter Lieven
  2013-03-07 16:45                   ` Paolo Bonzini
  2013-03-07  8:53                 ` Peter Lieven
  1 sibling, 2 replies; 49+ messages in thread
From: Kevin Wolf @ 2013-03-07  8:50 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: sw, Jeff Cody, Peter Lieven, qemu-devel, stefanha

Am 06.03.2013 um 21:39 hat Paolo Bonzini geschrieben:
> Il 06/03/2013 20:03, Peter Lieven ha scritto:
> > Am 06.03.2013 19:48, schrieb Jeff Cody:
> >> On Wed, Mar 06, 2013 at 07:31:51PM +0100, Paolo Bonzini wrote:
> >>> Il 06/03/2013 19:14, Jeff Cody ha scritto:
> >>>> QCOW breaks with it using a normal raw posix file as a device.  As a
> >>>> test: qemu-img create -f qcow test.qcow 5G.  Now run qemu with that
> >>>> drive mounted, and try to partition and format it.  QEMU now asserts.
> >>>>
> >>>> The nicety of being able to using truncate during a write call,
> >>>> especially for VHDX (which can have relatively large block/cluster
> >>>> sizes), so to grow the file sparsely in a dynamically allocated file.
> >>>
> >>> Perhaps we need two APIs, "truncate" and "revalidate".
> >>>
> >>> Truncate should be a no-op if (!bs->growable).
> >>>
> >>> Revalidate could be called by the block_resize monitor command with no
> >>> size specified.
> >>>
> >>> Paolo
> >>
> >> I think that is a good solution.  Is it better to have "truncate" and
> >> "revalidate", or "truncate" and "grow", with grow being a subset of
> >> truncate, with fewer restrictions?  There may still be operations
> >> where it is OK to grow a file, but not OK to shrink it.

What semantics would the both operations have? Is truncate the same as
it used to be? I don't really understand what "revalidate" would do, it
sounds like a read-only operation from its name?

> > Or as a first step:
> > 
> > a) Call brdv_drain_all() only if the device is shrinked (independently of !bs->growable)
> > b) Call brdv_drain_all() inside iscsi_truncate() because it is a special requirement there
> > c) Fix the value of bs->growable for all drivers
> 
> Let's start from (c).  bdrv_file_open sets bs->growable = 1.  I think it
> should be removed and only the file protocol should set it.

This is probably right.

> Then we can add bdrv_revalidate and, for block_resize, call
> bdrv_revalidate+bdrv_truncate.  For bs->growable = 0 &&
> !bs->drv->bdrv_truncate, bdrv_truncate can just check that the actual
> size is the same or bigger as the one requested, and fail otherwise.

This one not so much. bs->growable does not mean that you can use
bdrv_truncate. It rather means that you may write beyond the end of the
file even without truncating it first. Mabye bs->auto_grow would be a
better for it.

So bs->growable == true implies that bdrv_truncate() should be allowed
as well, because obviously changing the BDS size is possbile, but it's
not true the other way round.

Kevin

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 1/7] block: only force IO completion in .bdrv_truncate if we are shrinking
  2013-03-06 20:39               ` Paolo Bonzini
  2013-03-07  8:50                 ` Kevin Wolf
@ 2013-03-07  8:53                 ` Peter Lieven
  2013-03-07  8:59                   ` Kevin Wolf
  2013-03-07 16:09                   ` Paolo Bonzini
  1 sibling, 2 replies; 49+ messages in thread
From: Peter Lieven @ 2013-03-07  8:53 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kwolf, sw, Jeff Cody, qemu-devel, stefanha

On 06.03.2013 21:39, Paolo Bonzini wrote:
> Il 06/03/2013 20:03, Peter Lieven ha scritto:
>> Am 06.03.2013 19:48, schrieb Jeff Cody:
>>> On Wed, Mar 06, 2013 at 07:31:51PM +0100, Paolo Bonzini wrote:
>>>> Il 06/03/2013 19:14, Jeff Cody ha scritto:
>>>>> QCOW breaks with it using a normal raw posix file as a device.  As a
>>>>> test: qemu-img create -f qcow test.qcow 5G.  Now run qemu with that
>>>>> drive mounted, and try to partition and format it.  QEMU now asserts.
>>>>>
>>>>> The nicety of being able to using truncate during a write call,
>>>>> especially for VHDX (which can have relatively large block/cluster
>>>>> sizes), so to grow the file sparsely in a dynamically allocated file.
>>>>
>>>> Perhaps we need two APIs, "truncate" and "revalidate".
>>>>
>>>> Truncate should be a no-op if (!bs->growable).
>>>>
>>>> Revalidate could be called by the block_resize monitor command with no
>>>> size specified.
>>>>
>>>> Paolo
>>>
>>> I think that is a good solution.  Is it better to have "truncate" and
>>> "revalidate", or "truncate" and "grow", with grow being a subset of
>>> truncate, with fewer restrictions?  There may still be operations
>>> where it is OK to grow a file, but not OK to shrink it.
>>
>> Or as a first step:
>>
>> a) Call brdv_drain_all() only if the device is shrinked (independently of !bs->growable)
>> b) Call brdv_drain_all() inside iscsi_truncate() because it is a special requirement there
>> c) Fix the value of bs->growable for all drivers
>
> Let's start from (c).  bdrv_file_open sets bs->growable = 1.  I think it
> should be removed and only the file protocol should set it.
>
> Then we can add bdrv_revalidate and, for block_resize, call
> bdrv_revalidate+bdrv_truncate.  For bs->growable = 0 &&
> !bs->drv->bdrv_truncate, bdrv_truncate can just check that the actual
> size is the same or bigger as the one requested, and fail otherwise.
>
> Paolo
>

Regarding brd_drain_all(). Is the fix right to call it only on device shrink?
In this case it has to be added to iscsi_truncate as well.

Peter

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 1/7] block: only force IO completion in .bdrv_truncate if we are shrinking
  2013-03-07  8:50                 ` Kevin Wolf
@ 2013-03-07  8:56                   ` Peter Lieven
  2013-03-07  9:03                     ` Kevin Wolf
  2013-03-07 16:45                   ` Paolo Bonzini
  1 sibling, 1 reply; 49+ messages in thread
From: Peter Lieven @ 2013-03-07  8:56 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Paolo Bonzini, Jeff Cody, qemu-devel, stefanha, sw

On 07.03.2013 09:50, Kevin Wolf wrote:
> Am 06.03.2013 um 21:39 hat Paolo Bonzini geschrieben:
>> Il 06/03/2013 20:03, Peter Lieven ha scritto:
>>> Am 06.03.2013 19:48, schrieb Jeff Cody:
>>>> On Wed, Mar 06, 2013 at 07:31:51PM +0100, Paolo Bonzini wrote:
>>>>> Il 06/03/2013 19:14, Jeff Cody ha scritto:
>>>>>> QCOW breaks with it using a normal raw posix file as a device.  As a
>>>>>> test: qemu-img create -f qcow test.qcow 5G.  Now run qemu with that
>>>>>> drive mounted, and try to partition and format it.  QEMU now asserts.
>>>>>>
>>>>>> The nicety of being able to using truncate during a write call,
>>>>>> especially for VHDX (which can have relatively large block/cluster
>>>>>> sizes), so to grow the file sparsely in a dynamically allocated file.
>>>>>
>>>>> Perhaps we need two APIs, "truncate" and "revalidate".
>>>>>
>>>>> Truncate should be a no-op if (!bs->growable).
>>>>>
>>>>> Revalidate could be called by the block_resize monitor command with no
>>>>> size specified.
>>>>>
>>>>> Paolo
>>>>
>>>> I think that is a good solution.  Is it better to have "truncate" and
>>>> "revalidate", or "truncate" and "grow", with grow being a subset of
>>>> truncate, with fewer restrictions?  There may still be operations
>>>> where it is OK to grow a file, but not OK to shrink it.
>
> What semantics would the both operations have? Is truncate the same as
> it used to be? I don't really understand what "revalidate" would do, it
> sounds like a read-only operation from its name?
>
>>> Or as a first step:
>>>
>>> a) Call brdv_drain_all() only if the device is shrinked (independently of !bs->growable)
>>> b) Call brdv_drain_all() inside iscsi_truncate() because it is a special requirement there
>>> c) Fix the value of bs->growable for all drivers
>>
>> Let's start from (c).  bdrv_file_open sets bs->growable = 1.  I think it
>> should be removed and only the file protocol should set it.
>
> This is probably right.

If bs->growable is 1 for all drivers, whats the fix status of CVE-2008-0928? This
flag was introduced as a fix for this problem.

bdrv_check_byte_request() does nothing useful if bs->growable is 1.

Peter

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 1/7] block: only force IO completion in .bdrv_truncate if we are shrinking
  2013-03-06 18:27       ` Peter Lieven
@ 2013-03-07  8:57         ` Kevin Wolf
  0 siblings, 0 replies; 49+ messages in thread
From: Kevin Wolf @ 2013-03-07  8:57 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Paolo Bonzini, Jeff Cody, qemu-devel, stefanha, sw

Am 06.03.2013 um 19:27 hat Peter Lieven geschrieben:
> Am 06.03.2013 19:06, schrieb Paolo Bonzini:
> > Il 06/03/2013 18:50, Peter Lieven ha scritto:
> >>>> Commit 9a665b2b made bdrv_truncate() call bdrv_drain_all(), but this breaks
> >>>> QCOW images, as well other future image formats (such as VHDX) that may call
> >>>> bdrv_truncate(bs->file) from within a read/write operation.  For example, QCOW
> >>>> will cause an assert, due to tracked_requests not being empty (since the
> >>>> read/write that called bdrv_truncate() is still in progress).
> > 
> > I'm not sure such bdrv_truncate calls are necessary.  QCOW2 doesn't have
> > them (almost; there is one in qcow2_write_compressed, I'm not even sure
> > that one is necessary though), and I think QCOW's breaks using it with a
> > block device as a backing file.
> 
> I think we have to check the sense of bs->growable nevertheless. It is set
> to 1 for all drivers which can't be right?!

For everything that goes through bdrv_file_open(), which are protocol
drivers, not format drivers. It is required for files so that formats
can write past EOF; for other drivers that can't actually grow their
backing storage it doesn't hurt because they will return an eror anyway
when you write to it.

"bdrv_truncate() works" and "bs->growable == true" are totally different
things, so even though it doesn't look right at the first sight,
bs->growable does its job correctly.

In your other mail you're talking about setting it for raw, qcow2, VHDX.
This would be discussing on the entirely wrong level, this isn't about
formats, but about protocols (file, host_device, nbd, iscsi, http,
vvfat...), that are below the format drivers.

Kevin

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 1/7] block: only force IO completion in .bdrv_truncate if we are shrinking
  2013-03-07  8:53                 ` Peter Lieven
@ 2013-03-07  8:59                   ` Kevin Wolf
  2013-03-07 16:09                   ` Paolo Bonzini
  1 sibling, 0 replies; 49+ messages in thread
From: Kevin Wolf @ 2013-03-07  8:59 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Paolo Bonzini, Jeff Cody, qemu-devel, stefanha, sw

Am 07.03.2013 um 09:53 hat Peter Lieven geschrieben:
> On 06.03.2013 21:39, Paolo Bonzini wrote:
> >Il 06/03/2013 20:03, Peter Lieven ha scritto:
> >>Am 06.03.2013 19:48, schrieb Jeff Cody:
> >>>On Wed, Mar 06, 2013 at 07:31:51PM +0100, Paolo Bonzini wrote:
> >>>>Il 06/03/2013 19:14, Jeff Cody ha scritto:
> >>>>>QCOW breaks with it using a normal raw posix file as a device.  As a
> >>>>>test: qemu-img create -f qcow test.qcow 5G.  Now run qemu with that
> >>>>>drive mounted, and try to partition and format it.  QEMU now asserts.
> >>>>>
> >>>>>The nicety of being able to using truncate during a write call,
> >>>>>especially for VHDX (which can have relatively large block/cluster
> >>>>>sizes), so to grow the file sparsely in a dynamically allocated file.
> >>>>
> >>>>Perhaps we need two APIs, "truncate" and "revalidate".
> >>>>
> >>>>Truncate should be a no-op if (!bs->growable).
> >>>>
> >>>>Revalidate could be called by the block_resize monitor command with no
> >>>>size specified.
> >>>>
> >>>>Paolo
> >>>
> >>>I think that is a good solution.  Is it better to have "truncate" and
> >>>"revalidate", or "truncate" and "grow", with grow being a subset of
> >>>truncate, with fewer restrictions?  There may still be operations
> >>>where it is OK to grow a file, but not OK to shrink it.
> >>
> >>Or as a first step:
> >>
> >>a) Call brdv_drain_all() only if the device is shrinked (independently of !bs->growable)
> >>b) Call brdv_drain_all() inside iscsi_truncate() because it is a special requirement there
> >>c) Fix the value of bs->growable for all drivers
> >
> >Let's start from (c).  bdrv_file_open sets bs->growable = 1.  I think it
> >should be removed and only the file protocol should set it.
> >
> >Then we can add bdrv_revalidate and, for block_resize, call
> >bdrv_revalidate+bdrv_truncate.  For bs->growable = 0 &&
> >!bs->drv->bdrv_truncate, bdrv_truncate can just check that the actual
> >size is the same or bigger as the one requested, and fail otherwise.
> >
> >Paolo
> >
> 
> Regarding brd_drain_all(). Is the fix right to call it only on device shrink?
> In this case it has to be added to iscsi_truncate as well.

The real fix would bdrv_drain(bs). I hope we're not too far away from
that today, even though we're not quite there.

Kevin

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 1/7] block: only force IO completion in .bdrv_truncate if we are shrinking
  2013-03-07  8:56                   ` Peter Lieven
@ 2013-03-07  9:03                     ` Kevin Wolf
  2013-03-07  9:16                       ` Peter Lieven
  0 siblings, 1 reply; 49+ messages in thread
From: Kevin Wolf @ 2013-03-07  9:03 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Paolo Bonzini, Jeff Cody, qemu-devel, stefanha, sw

Am 07.03.2013 um 09:56 hat Peter Lieven geschrieben:
> On 07.03.2013 09:50, Kevin Wolf wrote:
> >Am 06.03.2013 um 21:39 hat Paolo Bonzini geschrieben:
> >>Il 06/03/2013 20:03, Peter Lieven ha scritto:
> >>>Am 06.03.2013 19:48, schrieb Jeff Cody:
> >>>>On Wed, Mar 06, 2013 at 07:31:51PM +0100, Paolo Bonzini wrote:
> >>>>>Il 06/03/2013 19:14, Jeff Cody ha scritto:
> >>>>>>QCOW breaks with it using a normal raw posix file as a device.  As a
> >>>>>>test: qemu-img create -f qcow test.qcow 5G.  Now run qemu with that
> >>>>>>drive mounted, and try to partition and format it.  QEMU now asserts.
> >>>>>>
> >>>>>>The nicety of being able to using truncate during a write call,
> >>>>>>especially for VHDX (which can have relatively large block/cluster
> >>>>>>sizes), so to grow the file sparsely in a dynamically allocated file.
> >>>>>
> >>>>>Perhaps we need two APIs, "truncate" and "revalidate".
> >>>>>
> >>>>>Truncate should be a no-op if (!bs->growable).
> >>>>>
> >>>>>Revalidate could be called by the block_resize monitor command with no
> >>>>>size specified.
> >>>>>
> >>>>>Paolo
> >>>>
> >>>>I think that is a good solution.  Is it better to have "truncate" and
> >>>>"revalidate", or "truncate" and "grow", with grow being a subset of
> >>>>truncate, with fewer restrictions?  There may still be operations
> >>>>where it is OK to grow a file, but not OK to shrink it.
> >
> >What semantics would the both operations have? Is truncate the same as
> >it used to be? I don't really understand what "revalidate" would do, it
> >sounds like a read-only operation from its name?
> >
> >>>Or as a first step:
> >>>
> >>>a) Call brdv_drain_all() only if the device is shrinked (independently of !bs->growable)
> >>>b) Call brdv_drain_all() inside iscsi_truncate() because it is a special requirement there
> >>>c) Fix the value of bs->growable for all drivers
> >>
> >>Let's start from (c).  bdrv_file_open sets bs->growable = 1.  I think it
> >>should be removed and only the file protocol should set it.
> >
> >This is probably right.
> 
> If bs->growable is 1 for all drivers, whats the fix status of CVE-2008-0928? This
> flag was introduced as a fix for this problem.
> 
> bdrv_check_byte_request() does nothing useful if bs->growable is 1.

Don't ignore the difference between bdrv_open() and bdrv_file_open().
Typically you have two BDSes: On top there is e.g. a qcow2 BDS that is
opened through bdrv_open() and has bs->growable = false. Its bs->file is
using the file protocol (raw-posix driver) and opened by
bdrv_file_open(). This one has bs->file->growable = true so that qcow2
can write to newly allocated areas without calling bdrv_truncate()
first.

Kevin

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 1/7] block: only force IO completion in .bdrv_truncate if we are shrinking
  2013-03-07  9:03                     ` Kevin Wolf
@ 2013-03-07  9:16                       ` Peter Lieven
  2013-03-07  9:22                         ` Kevin Wolf
  0 siblings, 1 reply; 49+ messages in thread
From: Peter Lieven @ 2013-03-07  9:16 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Paolo Bonzini, Jeff Cody, qemu-devel, stefanha, sw


Am 07.03.2013 um 10:03 schrieb Kevin Wolf <kwolf@redhat.com>:

> Am 07.03.2013 um 09:56 hat Peter Lieven geschrieben:
>> On 07.03.2013 09:50, Kevin Wolf wrote:
>>> Am 06.03.2013 um 21:39 hat Paolo Bonzini geschrieben:
>>>> Il 06/03/2013 20:03, Peter Lieven ha scritto:
>>>>> Am 06.03.2013 19:48, schrieb Jeff Cody:
>>>>>> On Wed, Mar 06, 2013 at 07:31:51PM +0100, Paolo Bonzini wrote:
>>>>>>> Il 06/03/2013 19:14, Jeff Cody ha scritto:
>>>>>>>> QCOW breaks with it using a normal raw posix file as a device.  As a
>>>>>>>> test: qemu-img create -f qcow test.qcow 5G.  Now run qemu with that
>>>>>>>> drive mounted, and try to partition and format it.  QEMU now asserts.
>>>>>>>> 
>>>>>>>> The nicety of being able to using truncate during a write call,
>>>>>>>> especially for VHDX (which can have relatively large block/cluster
>>>>>>>> sizes), so to grow the file sparsely in a dynamically allocated file.
>>>>>>> 
>>>>>>> Perhaps we need two APIs, "truncate" and "revalidate".
>>>>>>> 
>>>>>>> Truncate should be a no-op if (!bs->growable).
>>>>>>> 
>>>>>>> Revalidate could be called by the block_resize monitor command with no
>>>>>>> size specified.
>>>>>>> 
>>>>>>> Paolo
>>>>>> 
>>>>>> I think that is a good solution.  Is it better to have "truncate" and
>>>>>> "revalidate", or "truncate" and "grow", with grow being a subset of
>>>>>> truncate, with fewer restrictions?  There may still be operations
>>>>>> where it is OK to grow a file, but not OK to shrink it.
>>> 
>>> What semantics would the both operations have? Is truncate the same as
>>> it used to be? I don't really understand what "revalidate" would do, it
>>> sounds like a read-only operation from its name?
>>> 
>>>>> Or as a first step:
>>>>> 
>>>>> a) Call brdv_drain_all() only if the device is shrinked (independently of !bs->growable)
>>>>> b) Call brdv_drain_all() inside iscsi_truncate() because it is a special requirement there
>>>>> c) Fix the value of bs->growable for all drivers
>>>> 
>>>> Let's start from (c).  bdrv_file_open sets bs->growable = 1.  I think it
>>>> should be removed and only the file protocol should set it.
>>> 
>>> This is probably right.
>> 
>> If bs->growable is 1 for all drivers, whats the fix status of CVE-2008-0928? This
>> flag was introduced as a fix for this problem.
>> 
>> bdrv_check_byte_request() does nothing useful if bs->growable is 1.
> 
> Don't ignore the difference between bdrv_open() and bdrv_file_open().
> Typically you have two BDSes: On top there is e.g. a qcow2 BDS that is
> opened through bdrv_open() and has bs->growable = false. Its bs->file is
> using the file protocol (raw-posix driver) and opened by
> bdrv_file_open(). This one has bs->file->growable = true so that qcow2
> can write to newly allocated areas without calling bdrv_truncate()
> first.

Sorry, I have to admin I am little confused by what is happening in bdrv_open().

However, what I can say is that bs->growable is 1 for an iSCSI backed
harddrive and I wonder how this can happen if bdrv_file_open is not used for
opening it because that is the only place where bs->growable is set to 1.

cmdline:
x86_64-softmmu/qemu-system-x86_64 -k de -enable-kvm -m 1024 -drive if=virtio,file=iscsi://172.21.200.31/iqn.2001-05.com.equallogic:0-8a0906-16470e107-713001aa6de511e0-001-test/0 -vnc :1 -boot dc -monitor stdio

Peter

> 
> Kevin

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 1/7] block: only force IO completion in .bdrv_truncate if we are shrinking
  2013-03-07  9:16                       ` Peter Lieven
@ 2013-03-07  9:22                         ` Kevin Wolf
  2013-03-07  9:25                           ` Peter Lieven
  0 siblings, 1 reply; 49+ messages in thread
From: Kevin Wolf @ 2013-03-07  9:22 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Paolo Bonzini, Jeff Cody, qemu-devel, stefanha, sw

Am 07.03.2013 um 10:16 hat Peter Lieven geschrieben:
> >> If bs->growable is 1 for all drivers, whats the fix status of CVE-2008-0928? This
> >> flag was introduced as a fix for this problem.
> >> 
> >> bdrv_check_byte_request() does nothing useful if bs->growable is 1.
> > 
> > Don't ignore the difference between bdrv_open() and bdrv_file_open().
> > Typically you have two BDSes: On top there is e.g. a qcow2 BDS that is
> > opened through bdrv_open() and has bs->growable = false. Its bs->file is
> > using the file protocol (raw-posix driver) and opened by
> > bdrv_file_open(). This one has bs->file->growable = true so that qcow2
> > can write to newly allocated areas without calling bdrv_truncate()
> > first.
> 
> Sorry, I have to admin I am little confused by what is happening in bdrv_open().
> 
> However, what I can say is that bs->growable is 1 for an iSCSI backed
> harddrive and I wonder how this can happen if bdrv_file_open is not used for
> opening it because that is the only place where bs->growable is set to 1.
> 
> cmdline:
> x86_64-softmmu/qemu-system-x86_64 -k de -enable-kvm -m 1024 -drive if=virtio,file=iscsi://172.21.200.31/iqn.2001-05.com.equallogic:0-8a0906-16470e107-713001aa6de511e0-001-test/0 -vnc :1 -boot dc -monitor stdio

It is used for the iscsi driver. You have a raw BDS (growable == false)
on top of an iscsi one (growable == true).

Kevin

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 1/7] block: only force IO completion in .bdrv_truncate if we are shrinking
  2013-03-07  9:22                         ` Kevin Wolf
@ 2013-03-07  9:25                           ` Peter Lieven
  2013-03-07 10:00                             ` Kevin Wolf
  0 siblings, 1 reply; 49+ messages in thread
From: Peter Lieven @ 2013-03-07  9:25 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Paolo Bonzini, Jeff Cody, qemu-devel, stefanha, sw


Am 07.03.2013 um 10:22 schrieb Kevin Wolf <kwolf@redhat.com>:

> Am 07.03.2013 um 10:16 hat Peter Lieven geschrieben:
>>>> If bs->growable is 1 for all drivers, whats the fix status of CVE-2008-0928? This
>>>> flag was introduced as a fix for this problem.
>>>> 
>>>> bdrv_check_byte_request() does nothing useful if bs->growable is 1.
>>> 
>>> Don't ignore the difference between bdrv_open() and bdrv_file_open().
>>> Typically you have two BDSes: On top there is e.g. a qcow2 BDS that is
>>> opened through bdrv_open() and has bs->growable = false. Its bs->file is
>>> using the file protocol (raw-posix driver) and opened by
>>> bdrv_file_open(). This one has bs->file->growable = true so that qcow2
>>> can write to newly allocated areas without calling bdrv_truncate()
>>> first.
>> 
>> Sorry, I have to admin I am little confused by what is happening in bdrv_open().
>> 
>> However, what I can say is that bs->growable is 1 for an iSCSI backed
>> harddrive and I wonder how this can happen if bdrv_file_open is not used for
>> opening it because that is the only place where bs->growable is set to 1.
>> 
>> cmdline:
>> x86_64-softmmu/qemu-system-x86_64 -k de -enable-kvm -m 1024 -drive if=virtio,file=iscsi://172.21.200.31/iqn.2001-05.com.equallogic:0-8a0906-16470e107-713001aa6de511e0-001-test/0 -vnc :1 -boot dc -monitor stdio
> 
> It is used for the iscsi driver. You have a raw BDS (growable == false)
> on top of an iscsi one (growable == true).

Ok, but growable == true is wrong for the iSCSI driver isn`t it?

Peter

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 1/7] block: only force IO completion in .bdrv_truncate if we are shrinking
  2013-03-07  9:25                           ` Peter Lieven
@ 2013-03-07 10:00                             ` Kevin Wolf
  2013-03-07 10:22                               ` Peter Lieven
  0 siblings, 1 reply; 49+ messages in thread
From: Kevin Wolf @ 2013-03-07 10:00 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Paolo Bonzini, Jeff Cody, qemu-devel, stefanha, sw

Am 07.03.2013 um 10:25 hat Peter Lieven geschrieben:
> 
> Am 07.03.2013 um 10:22 schrieb Kevin Wolf <kwolf@redhat.com>:
> 
> > Am 07.03.2013 um 10:16 hat Peter Lieven geschrieben:
> >>>> If bs->growable is 1 for all drivers, whats the fix status of CVE-2008-0928? This
> >>>> flag was introduced as a fix for this problem.
> >>>> 
> >>>> bdrv_check_byte_request() does nothing useful if bs->growable is 1.
> >>> 
> >>> Don't ignore the difference between bdrv_open() and bdrv_file_open().
> >>> Typically you have two BDSes: On top there is e.g. a qcow2 BDS that is
> >>> opened through bdrv_open() and has bs->growable = false. Its bs->file is
> >>> using the file protocol (raw-posix driver) and opened by
> >>> bdrv_file_open(). This one has bs->file->growable = true so that qcow2
> >>> can write to newly allocated areas without calling bdrv_truncate()
> >>> first.
> >> 
> >> Sorry, I have to admin I am little confused by what is happening in bdrv_open().
> >> 
> >> However, what I can say is that bs->growable is 1 for an iSCSI backed
> >> harddrive and I wonder how this can happen if bdrv_file_open is not used for
> >> opening it because that is the only place where bs->growable is set to 1.
> >> 
> >> cmdline:
> >> x86_64-softmmu/qemu-system-x86_64 -k de -enable-kvm -m 1024 -drive if=virtio,file=iscsi://172.21.200.31/iqn.2001-05.com.equallogic:0-8a0906-16470e107-713001aa6de511e0-001-test/0 -vnc :1 -boot dc -monitor stdio
> > 
> > It is used for the iscsi driver. You have a raw BDS (growable == false)
> > on top of an iscsi one (growable == true).
> 
> Ok, but growable == true is wrong for the iSCSI driver isn`t it?

I guess it depends on your definition. If you say that growable includes
the capability of growing the image, then yes, it's wrong. If you only
interpret it as permission to write beyond EOF (if the driver supports
that), then it's right even though any such attempt will result in an
error.

Practically speaking, the difference is between -EIO returned from
bdrv_check_byte_request() and -EIO from iscsi_aio_write16_cb(), i.e.
the result is the same.

Kevin

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 1/7] block: only force IO completion in .bdrv_truncate if we are shrinking
  2013-03-07 10:00                             ` Kevin Wolf
@ 2013-03-07 10:22                               ` Peter Lieven
  0 siblings, 0 replies; 49+ messages in thread
From: Peter Lieven @ 2013-03-07 10:22 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Paolo Bonzini, Jeff Cody, qemu-devel, stefanha, sw


Am 07.03.2013 um 11:00 schrieb Kevin Wolf <kwolf@redhat.com>:

> Am 07.03.2013 um 10:25 hat Peter Lieven geschrieben:
>> 
>> Am 07.03.2013 um 10:22 schrieb Kevin Wolf <kwolf@redhat.com>:
>> 
>>> Am 07.03.2013 um 10:16 hat Peter Lieven geschrieben:
>>>>>> If bs->growable is 1 for all drivers, whats the fix status of CVE-2008-0928? This
>>>>>> flag was introduced as a fix for this problem.
>>>>>> 
>>>>>> bdrv_check_byte_request() does nothing useful if bs->growable is 1.
>>>>> 
>>>>> Don't ignore the difference between bdrv_open() and bdrv_file_open().
>>>>> Typically you have two BDSes: On top there is e.g. a qcow2 BDS that is
>>>>> opened through bdrv_open() and has bs->growable = false. Its bs->file is
>>>>> using the file protocol (raw-posix driver) and opened by
>>>>> bdrv_file_open(). This one has bs->file->growable = true so that qcow2
>>>>> can write to newly allocated areas without calling bdrv_truncate()
>>>>> first.
>>>> 
>>>> Sorry, I have to admin I am little confused by what is happening in bdrv_open().
>>>> 
>>>> However, what I can say is that bs->growable is 1 for an iSCSI backed
>>>> harddrive and I wonder how this can happen if bdrv_file_open is not used for
>>>> opening it because that is the only place where bs->growable is set to 1.
>>>> 
>>>> cmdline:
>>>> x86_64-softmmu/qemu-system-x86_64 -k de -enable-kvm -m 1024 -drive if=virtio,file=iscsi://172.21.200.31/iqn.2001-05.com.equallogic:0-8a0906-16470e107-713001aa6de511e0-001-test/0 -vnc :1 -boot dc -monitor stdio
>>> 
>>> It is used for the iscsi driver. You have a raw BDS (growable == false)
>>> on top of an iscsi one (growable == true).
>> 
>> Ok, but growable == true is wrong for the iSCSI driver isn`t it?
> 
> I guess it depends on your definition. If you say that growable includes
> the capability of growing the image, then yes, it's wrong. If you only
> interpret it as permission to write beyond EOF (if the driver supports
> that), then it's right even though any such attempt will result in an
> error.
> 
> Practically speaking, the difference is between -EIO returned from
> bdrv_check_byte_request() and -EIO from iscsi_aio_write16_cb(), i.e.
> the result is the same.

Yes, but there are many broken storage implementations outside which might react
freaky if there is a read/write beyond the LUN size or with negative offset.
The check_request would block such requests earlier protecting the infrastructure.
I have a case open with the vendor of a storage system where I am able to crash
the storage sending illegal requests. I personally would feel more comfortable
if illegal requests where just not possible.

Peter

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 3/7] block: vhdx header for the QEMU support of VHDX images
  2013-03-06 14:47 ` [Qemu-devel] [PATCH 3/7] block: vhdx header for the QEMU support of VHDX images Jeff Cody
@ 2013-03-07 13:15   ` Stefan Hajnoczi
  2013-03-07 13:43     ` Jeff Cody
  0 siblings, 1 reply; 49+ messages in thread
From: Stefan Hajnoczi @ 2013-03-07 13:15 UTC (permalink / raw)
  To: Jeff Cody; +Cc: kwolf, sw, qemu-devel

On Wed, Mar 06, 2013 at 09:47:41AM -0500, Jeff Cody wrote:
> +#define VHDX_ZERO_MGIC 0x6F72657A /* 'zero' */

s/MGIC/MAGIC/

> +typedef struct QEMU_PACKED vhdx_log_zero_descriptor {
> +    uint32_t    zero_signature;         /* "zero" in ASCII */
> +    uint32_t    reserver;

s/reserver/reserved/ ?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 3/7] block: vhdx header for the QEMU support of VHDX images
  2013-03-07 13:15   ` Stefan Hajnoczi
@ 2013-03-07 13:43     ` Jeff Cody
  0 siblings, 0 replies; 49+ messages in thread
From: Jeff Cody @ 2013-03-07 13:43 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: kwolf, sw, qemu-devel

On Thu, Mar 07, 2013 at 02:15:42PM +0100, Stefan Hajnoczi wrote:
> On Wed, Mar 06, 2013 at 09:47:41AM -0500, Jeff Cody wrote:
> > +#define VHDX_ZERO_MGIC 0x6F72657A /* 'zero' */
> 
> s/MGIC/MAGIC/
> 
> > +typedef struct QEMU_PACKED vhdx_log_zero_descriptor {
> > +    uint32_t    zero_signature;         /* "zero" in ASCII */
> > +    uint32_t    reserver;
> 
> s/reserver/reserved/ ?

Yes, thanks (on both counts).

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 4/7] block: initial VHDX driver support framework - supports open and probe
  2013-03-06 14:48 ` [Qemu-devel] [PATCH 4/7] block: initial VHDX driver support framework - supports open and probe Jeff Cody
@ 2013-03-07 14:30   ` Stefan Hajnoczi
  2013-03-07 15:23     ` Jeff Cody
  0 siblings, 1 reply; 49+ messages in thread
From: Stefan Hajnoczi @ 2013-03-07 14:30 UTC (permalink / raw)
  To: Jeff Cody; +Cc: kwolf, sw, qemu-devel

On Wed, Mar 06, 2013 at 09:48:11AM -0500, Jeff Cody wrote:
> +#define leguid_to_cpus(guid) do { \
> +    le32_to_cpus(&(guid)->data1); \
> +    le16_to_cpus(&(guid)->data2); \
> +    le16_to_cpus(&(guid)->data3); } while (0)

This should be a function.  Please avoid macros.

> +static const ms_guid logical_sector_guid = {.data1 = 0x8141bf1d,
> +                                            .data2 = 0xa96f,
> +                                            .data3 = 0x4709,
> +                                           .data4 = { 0xba, 0x47, 0xf2, 0x33,

Indentation

> +                                                      0xa8, 0xfa, 0xab, 0x5f} };
> +
> +static const ms_guid phys_sector_guid =  { .data1 = 0xcda348c7,
> +                                           .data2 = 0x445d,
> +                                           .data3 = 0x4471,
> +                                           .data4 = { 0x9c, 0xc9, 0xe9, 0x88,
> +                                                      0x52, 0x51, 0xc5, 0x56} };
> +
> +static const ms_guid parent_locator_guid = {.data1 = 0xa8d35f2d,
> +                                            .data2 = 0xb30b,
> +                                            .data3 = 0x454d,
> +                                           .data4 = { 0xab, 0xf7, 0xd3, 0xd8,

Indentation

> +typedef struct BDRVVHDXState {
> +    CoMutex lock;
> +
> +    int curr_header;
> +    vhdx_header *headers[2];
> +
> +    vhdx_region_table_header rt;
> +    vhdx_region_table_entry bat_rt;         /* region table for the BAT */
> +    vhdx_region_table_entry metadata_rt;    /* region table for the metadata */
> +    vhdx_region_table_entry *unknown_rt;
> +    unsigned int unknown_rt_size;

This field isn't used yet but "unsigned int" for something that has
"size" in its name is suspicious.  Should it be size_t (memory) or
uint64_t (disk)?

> +static int vhdx_probe(const uint8_t *buf, int buf_size, const char *filename)
> +{
> +    if (buf_size >= 8 && !strncmp((char *)buf, "vhdxfile", 8)) {

memcmp() saves you the trouble of casting buf (which should stay const,
BTW).

> +static int vhdx_parse_metadata(BlockDriverState *bs, BDRVVHDXState *s)
> +{
> +    int ret = 0;
> +    uint8_t *buffer;
> +    int offset = 0;
> +    int i = 0;
> +    uint32_t block_size, sectors_per_block, logical_sector_size;
> +    uint64_t chunk_ratio;
> +    vhdx_metadata_table_entry md_entry;
> +
> +    buffer = g_malloc(VHDX_METADATA_TABLE_MAX_SIZE);

Please use qemu_blockalign() to avoid bounce buffers in case the memory
is not 512-byte aligned.  Use qemu_vfree() instead of g_free().

This applies to all I/O buffers that the block driver allocates.

> +
> +    ret = bdrv_pread(bs->file, s->metadata_rt.file_offset, buffer,
> +                     VHDX_METADATA_TABLE_MAX_SIZE);
> +    if (ret < 0) {
> +        goto fail_no_free;
> +    }
> +    memcpy(&s->metadata_hdr, buffer, sizeof(s->metadata_hdr));
> +    offset += sizeof(s->metadata_hdr);
> +
> +    le64_to_cpus(&s->metadata_hdr.signature);
> +    le16_to_cpus(&s->metadata_hdr.reserved);
> +    le16_to_cpus(&s->metadata_hdr.entry_count);
> +
> +    if (s->metadata_hdr.signature != VHDX_METADATA_MAGIC) {
> +        ret = -EINVAL;
> +        goto fail_no_free;
> +    }
> +
> +    s->metadata_entries.present = 0;
> +
> +    for (i = 0; i < s->metadata_hdr.entry_count; i++) {
> +        memcpy(&md_entry, buffer+offset, sizeof(md_entry));

Read beyond end of buffer if metadata_hdr.entry_count is wrong.

We cannot trust the image file - it can be corrupt or malicious.  Checks
are required for everything that can be validated.  QEMU cannot do
anything that would cause a crash or memory corruption on invalid input.

> +    ret = bdrv_pread(bs->file,
> +                     s->metadata_entries.file_parameters_entry.offset
> +                                         + s->metadata_rt.file_offset,
> +                     &s->params,
> +                     sizeof(s->params));

Missing error code path when ret < 0.

> +    /* determine virtual disk size, logical sector size,
> +     * and phys sector size */
> +
> +    ret = bdrv_pread(bs->file,
> +                     s->metadata_entries.virtual_disk_size_entry.offset
> +                                           + s->metadata_rt.file_offset,
> +                     &s->virtual_disk_size,
> +                     sizeof(uint64_t));
> +    if (ret < 0) {
> +        goto exit;
> +    }
> +    ret = bdrv_pread(bs->file,
> +                     s->metadata_entries.logical_sector_size_entry.offset
> +                                             + s->metadata_rt.file_offset,
> +                     &s->logical_sector_size,
> +                     sizeof(uint32_t));
> +    if (ret < 0) {
> +        goto exit;
> +    }
> +    ret = bdrv_pread(bs->file,
> +                     s->metadata_entries.phys_sector_size_entry.offset
> +                                          + s->metadata_rt.file_offset,
> +                     &s->physical_sector_size,
> +                     sizeof(uint32_t));
> +    if (ret < 0) {
> +        goto exit;
> +    }
> +
> +    le64_to_cpus(&s->virtual_disk_size);
> +    le32_to_cpus(&s->logical_sector_size);
> +    le32_to_cpus(&s->physical_sector_size);
> +
> +    /* both block_size and sector_size are guaranteed powers of 2 */
> +    s->sectors_per_block = s->params.block_size / s->logical_sector_size;

Divide-by-zero if file is corrupt/malicious.

> +    s->chunk_ratio = (VHDX_MAX_SECTORS_PER_BLOCK) *
> +                     (uint64_t)s->logical_sector_size /
> +                     (uint64_t)s->params.block_size;

Divide-by-zero if file is corrupt/malicious.

> +
> +
> +    /* These values are ones we will want to use for division / multiplication
> +     * later on, and they are all guaranteed (per the spec) to be powers of 2,
> +     * so we can take advantage of that for shift operations during
> +     * reads/writes */

Before doing that, let's verify that they are powers of two and fail if
they violate the spec.

if (x & x - 1) {
    ret = -EINVAL;
    goto exit;
}

> +static int vhdx_open(BlockDriverState *bs, int flags)
> +{
> +    BDRVVHDXState *s = bs->opaque;
> +    int ret = 0;
> +    int i;
> +
> +    s->bat = NULL;
> +
> +    qemu_co_mutex_init(&s->lock);
> +
> +    ret = vhdx_parse_header(bs, s);
> +    if (ret) {
> +        goto fail;
> +    }
> +
> +    ret = vhdx_parse_log(bs, s);
> +    if (ret) {
> +        goto fail;
> +    }
> +
> +    ret = vhdx_open_region_tables(bs, s);
> +    if (ret) {
> +        goto fail;
> +    }
> +
> +    ret = vhdx_parse_metadata(bs, s);
> +    if (ret) {
> +        goto fail;
> +    }
> +    s->block_size = s->params.block_size;
> +
> +    /* the VHDX spec dictates that virtual_disk_size is always a multiple of
> +     * logical_sector_size */
> +    bs->total_sectors = s->virtual_disk_size / s->logical_sector_size;
> +
> +    s->bat_offset = s->bat_rt.file_offset;
> +    s->bat_entries = s->bat_rt.length / sizeof(vhdx_bat_entry);
> +    s->bat = g_malloc(s->bat_rt.length);
> +
> +    ret = bdrv_pread(bs->file, s->bat_offset, s->bat, s->bat_rt.length);
> +
> +    for (i = 0; i < s->bat_entries; i++) {
> +        le64_to_cpus(&s->bat[i]);
> +    }

How does BAT size scale when the image size is increased?  QCOW2 and QED
use caches for metadata that would be too large or wasteful to keep in
memory.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 6/7] block: add header update capability for VHDX images.
  2013-03-06 14:48 ` [Qemu-devel] [PATCH 6/7] block: add header update capability for VHDX images Jeff Cody
@ 2013-03-07 14:50   ` Stefan Hajnoczi
  2013-03-07 15:33     ` Jeff Cody
  0 siblings, 1 reply; 49+ messages in thread
From: Stefan Hajnoczi @ 2013-03-07 14:50 UTC (permalink / raw)
  To: Jeff Cody; +Cc: kwolf, sw, qemu-devel

On Wed, Mar 06, 2013 at 09:48:33AM -0500, Jeff Cody wrote:
> This adds the ability to update the headers in a VHDX image, including
> generating a new MS-compatible GUID, and checksum.
> 
> Signed-off-by: Jeff Cody <jcody@redhat.com>
> ---
>  block/vhdx.c | 165 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 163 insertions(+), 2 deletions(-)
> 
> diff --git a/block/vhdx.c b/block/vhdx.c
> index 97775d2..13d1e7f 100644
> --- a/block/vhdx.c
> +++ b/block/vhdx.c
> @@ -27,6 +27,11 @@
>      le16_to_cpus(&(guid)->data2); \
>      le16_to_cpus(&(guid)->data3); } while (0)
>  
> +#define cpu_to_leguids(guid) do { \
> +    cpu_to_le32s(&(guid)->data1); \
> +    cpu_to_le16s(&(guid)->data2); \
> +    cpu_to_le16s(&(guid)->data3); } while (0)
> +

Please use a function instead of a macro.

> +/* Calculates new checksum.
> + *
> + * Zero is substituted during crc calculation for the original crc field
> + * crc_offset: byte offset in buf of the buffer crc
> + * buf: buffer pointer
> + * size: size of buffer (must be > crc_offset+4)
> + */
> +static uint32_t vhdx_update_checksum(uint8_t *buf, size_t size, int crc_offset)
> +{
> +    uint32_t crc;
> +
> +    assert(buf != NULL);
> +    assert(size > (crc_offset+4));
> +
> +    memset(buf+crc_offset, 0, sizeof(crc));
> +    crc =  crc32c(0xffffffff, buf, size);
> +    memcpy(buf+crc_offset, &crc, sizeof(crc));

Worth a comment that the checksum is in CPU endianness.  Must use
vhdx_header_le_export() before writing to file.  That implies the buffer
must also be in CPU endianness otherwise vhdx_header_le_export() will
byteswap unwanted fields.

> +
> +    return crc;
> +}
> +
>  /* Validates the checksum of the buffer, with an in-place CRC.
>   *
>   * Zero is substituted during crc calculation for the original crc field,
> @@ -198,6 +227,33 @@ static bool vhdx_checksum_is_valid(uint8_t *buf, size_t size, int crc_offset)
>  }
>  
>  /*
> + * This generates a UUID that is compliant with the MS GUIDs used
> + * in the VHDX spec (and elsewhere).
> + *
> + * We can do this with uuid_generate if uuid.h is present,
> + * however not all systems have uuid and the generation is

Do you mean libuuid is not available on all host platforms supported by
QEMU, or just that you wanted to avoid the dependency?

Since other parts of QEMU already use libuuid I'd rather see it used
than reimplemented in VHDX.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 4/7] block: initial VHDX driver support framework - supports open and probe
  2013-03-07 14:30   ` Stefan Hajnoczi
@ 2013-03-07 15:23     ` Jeff Cody
  2013-03-07 16:12       ` Stefan Hajnoczi
  0 siblings, 1 reply; 49+ messages in thread
From: Jeff Cody @ 2013-03-07 15:23 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: kwolf, sw, qemu-devel

On Thu, Mar 07, 2013 at 03:30:44PM +0100, Stefan Hajnoczi wrote:
> On Wed, Mar 06, 2013 at 09:48:11AM -0500, Jeff Cody wrote:
> > +#define leguid_to_cpus(guid) do { \
> > +    le32_to_cpus(&(guid)->data1); \
> > +    le16_to_cpus(&(guid)->data2); \
> > +    le16_to_cpus(&(guid)->data3); } while (0)
> 
> This should be a function.  Please avoid macros.
> 

OK

> > +static const ms_guid logical_sector_guid = {.data1 = 0x8141bf1d,
> > +                                            .data2 = 0xa96f,
> > +                                            .data3 = 0x4709,
> > +                                           .data4 = { 0xba, 0x47, 0xf2, 0x33,
> 
> Indentation
> 

OK.  I was bumping up against the 80 char limit, on this and the other
one you flagged.  I'll figure something out.

> > +                                                      0xa8, 0xfa, 0xab, 0x5f} };
> > +
> > +static const ms_guid phys_sector_guid =  { .data1 = 0xcda348c7,
> > +                                           .data2 = 0x445d,
> > +                                           .data3 = 0x4471,
> > +                                           .data4 = { 0x9c, 0xc9, 0xe9, 0x88,
> > +                                                      0x52, 0x51, 0xc5, 0x56} };
> > +
> > +static const ms_guid parent_locator_guid = {.data1 = 0xa8d35f2d,
> > +                                            .data2 = 0xb30b,
> > +                                            .data3 = 0x454d,
> > +                                           .data4 = { 0xab, 0xf7, 0xd3, 0xd8,
> 
> Indentation
> 
> > +typedef struct BDRVVHDXState {
> > +    CoMutex lock;
> > +
> > +    int curr_header;
> > +    vhdx_header *headers[2];
> > +
> > +    vhdx_region_table_header rt;
> > +    vhdx_region_table_entry bat_rt;         /* region table for the BAT */
> > +    vhdx_region_table_entry metadata_rt;    /* region table for the metadata */
> > +    vhdx_region_table_entry *unknown_rt;
> > +    unsigned int unknown_rt_size;
> 
> This field isn't used yet but "unsigned int" for something that has
> "size" in its name is suspicious.  Should it be size_t (memory) or
> uint64_t (disk)?
> 

Yes, good catch - although I'll just drop them, since they are not
used.

> > +static int vhdx_probe(const uint8_t *buf, int buf_size, const char *filename)
> > +{
> > +    if (buf_size >= 8 && !strncmp((char *)buf, "vhdxfile", 8)) {
> 
> memcmp() saves you the trouble of casting buf (which should stay const,
> BTW).
> 

Good point; thanks.

> > +static int vhdx_parse_metadata(BlockDriverState *bs, BDRVVHDXState *s)
> > +{
> > +    int ret = 0;
> > +    uint8_t *buffer;
> > +    int offset = 0;
> > +    int i = 0;
> > +    uint32_t block_size, sectors_per_block, logical_sector_size;
> > +    uint64_t chunk_ratio;
> > +    vhdx_metadata_table_entry md_entry;
> > +
> > +    buffer = g_malloc(VHDX_METADATA_TABLE_MAX_SIZE);
> 
> Please use qemu_blockalign() to avoid bounce buffers in case the memory
> is not 512-byte aligned.  Use qemu_vfree() instead of g_free().
> 
> This applies to all I/O buffers that the block driver allocates.
> 

OK.

> > +
> > +    ret = bdrv_pread(bs->file, s->metadata_rt.file_offset, buffer,
> > +                     VHDX_METADATA_TABLE_MAX_SIZE);
> > +    if (ret < 0) {
> > +        goto fail_no_free;
> > +    }
> > +    memcpy(&s->metadata_hdr, buffer, sizeof(s->metadata_hdr));
> > +    offset += sizeof(s->metadata_hdr);
> > +
> > +    le64_to_cpus(&s->metadata_hdr.signature);
> > +    le16_to_cpus(&s->metadata_hdr.reserved);
> > +    le16_to_cpus(&s->metadata_hdr.entry_count);
> > +
> > +    if (s->metadata_hdr.signature != VHDX_METADATA_MAGIC) {
> > +        ret = -EINVAL;
> > +        goto fail_no_free;
> > +    }
> > +
> > +    s->metadata_entries.present = 0;
> > +
> > +    for (i = 0; i < s->metadata_hdr.entry_count; i++) {
> > +        memcpy(&md_entry, buffer+offset, sizeof(md_entry));
> 
> Read beyond end of buffer if metadata_hdr.entry_count is wrong.
> 
> We cannot trust the image file - it can be corrupt or malicious.  Checks
> are required for everything that can be validated.  QEMU cannot do
> anything that would cause a crash or memory corruption on invalid input.
> 

Good point. I'll add a check before the loop - if entry_count is
wrong, we can't really trust the file anyway, so should probably
return -EINVAL.

> > +    ret = bdrv_pread(bs->file,
> > +                     s->metadata_entries.file_parameters_entry.offset
> > +                                         + s->metadata_rt.file_offset,
> > +                     &s->params,
> > +                     sizeof(s->params));
> 
> Missing error code path when ret < 0.
> 

Thanks.

> > +    /* determine virtual disk size, logical sector size,
> > +     * and phys sector size */
> > +
> > +    ret = bdrv_pread(bs->file,
> > +                     s->metadata_entries.virtual_disk_size_entry.offset
> > +                                           + s->metadata_rt.file_offset,
> > +                     &s->virtual_disk_size,
> > +                     sizeof(uint64_t));
> > +    if (ret < 0) {
> > +        goto exit;
> > +    }
> > +    ret = bdrv_pread(bs->file,
> > +                     s->metadata_entries.logical_sector_size_entry.offset
> > +                                             + s->metadata_rt.file_offset,
> > +                     &s->logical_sector_size,
> > +                     sizeof(uint32_t));
> > +    if (ret < 0) {
> > +        goto exit;
> > +    }
> > +    ret = bdrv_pread(bs->file,
> > +                     s->metadata_entries.phys_sector_size_entry.offset
> > +                                          + s->metadata_rt.file_offset,
> > +                     &s->physical_sector_size,
> > +                     sizeof(uint32_t));
> > +    if (ret < 0) {
> > +        goto exit;
> > +    }
> > +
> > +    le64_to_cpus(&s->virtual_disk_size);
> > +    le32_to_cpus(&s->logical_sector_size);
> > +    le32_to_cpus(&s->physical_sector_size);
> > +
> > +    /* both block_size and sector_size are guaranteed powers of 2 */
> > +    s->sectors_per_block = s->params.block_size / s->logical_sector_size;
> 
> Divide-by-zero if file is corrupt/malicious.
> 
> > +    s->chunk_ratio = (VHDX_MAX_SECTORS_PER_BLOCK) *
> > +                     (uint64_t)s->logical_sector_size /
> > +                     (uint64_t)s->params.block_size;
> 
> Divide-by-zero if file is corrupt/malicious.
> 

Thanks, will add checks.

> > +
> > +
> > +    /* These values are ones we will want to use for division / multiplication
> > +     * later on, and they are all guaranteed (per the spec) to be powers of 2,
> > +     * so we can take advantage of that for shift operations during
> > +     * reads/writes */
> 
> Before doing that, let's verify that they are powers of two and fail if
> they violate the spec.
> 
> if (x & x - 1) {
>     ret = -EINVAL;
>     goto exit;
> }
> 

Good point, I will do that.

> > +static int vhdx_open(BlockDriverState *bs, int flags)
> > +{
> > +    BDRVVHDXState *s = bs->opaque;
> > +    int ret = 0;
> > +    int i;
> > +
> > +    s->bat = NULL;
> > +
> > +    qemu_co_mutex_init(&s->lock);
> > +
> > +    ret = vhdx_parse_header(bs, s);
> > +    if (ret) {
> > +        goto fail;
> > +    }
> > +
> > +    ret = vhdx_parse_log(bs, s);
> > +    if (ret) {
> > +        goto fail;
> > +    }
> > +
> > +    ret = vhdx_open_region_tables(bs, s);
> > +    if (ret) {
> > +        goto fail;
> > +    }
> > +
> > +    ret = vhdx_parse_metadata(bs, s);
> > +    if (ret) {
> > +        goto fail;
> > +    }
> > +    s->block_size = s->params.block_size;
> > +
> > +    /* the VHDX spec dictates that virtual_disk_size is always a multiple of
> > +     * logical_sector_size */
> > +    bs->total_sectors = s->virtual_disk_size / s->logical_sector_size;
> > +
> > +    s->bat_offset = s->bat_rt.file_offset;
> > +    s->bat_entries = s->bat_rt.length / sizeof(vhdx_bat_entry);
> > +    s->bat = g_malloc(s->bat_rt.length);
> > +
> > +    ret = bdrv_pread(bs->file, s->bat_offset, s->bat, s->bat_rt.length);
> > +
> > +    for (i = 0; i < s->bat_entries; i++) {
> > +        le64_to_cpus(&s->bat[i]);
> > +    }
> 
> How does BAT size scale when the image size is increased?  QCOW2 and QED
> use caches for metadata that would be too large or wasteful to keep in
> memory.

The BAT size is dependent on the virtual disk size, and the block
size.  The block size is allowed to range from 1MB - 256MB.  There is
one BAT entry per block.

In practice, the large block size keeps the BAT entry table reasonable
(for a 127GB file, the block size created by Hyper-V is 32MB, so the
table is pretty small - 32KB).

However, I don't see anything in the spec that forces the block size
to be larger, for a large virtual disk size.  So for the max size of
64TB, and the smallest block size of 1MB, keeping the BAT in memory
would indeed be excessive.

I'll re-read the spec, and see if there is anything that ties the
block size and virtual size together.  If not, I'll have to add
caching.

Thanks for the detailed review!

Jeff

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 6/7] block: add header update capability for VHDX images.
  2013-03-07 14:50   ` Stefan Hajnoczi
@ 2013-03-07 15:33     ` Jeff Cody
  2013-03-07 16:15       ` Stefan Hajnoczi
  0 siblings, 1 reply; 49+ messages in thread
From: Jeff Cody @ 2013-03-07 15:33 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: kwolf, sw, qemu-devel

On Thu, Mar 07, 2013 at 03:50:56PM +0100, Stefan Hajnoczi wrote:
> On Wed, Mar 06, 2013 at 09:48:33AM -0500, Jeff Cody wrote:
> > This adds the ability to update the headers in a VHDX image, including
> > generating a new MS-compatible GUID, and checksum.
> > 
> > Signed-off-by: Jeff Cody <jcody@redhat.com>
> > ---
> >  block/vhdx.c | 165 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 163 insertions(+), 2 deletions(-)
> > 
> > diff --git a/block/vhdx.c b/block/vhdx.c
> > index 97775d2..13d1e7f 100644
> > --- a/block/vhdx.c
> > +++ b/block/vhdx.c
> > @@ -27,6 +27,11 @@
> >      le16_to_cpus(&(guid)->data2); \
> >      le16_to_cpus(&(guid)->data3); } while (0)
> >  
> > +#define cpu_to_leguids(guid) do { \
> > +    cpu_to_le32s(&(guid)->data1); \
> > +    cpu_to_le16s(&(guid)->data2); \
> > +    cpu_to_le16s(&(guid)->data3); } while (0)
> > +
> 
> Please use a function instead of a macro.
> 

OK.

> > +/* Calculates new checksum.
> > + *
> > + * Zero is substituted during crc calculation for the original crc field
> > + * crc_offset: byte offset in buf of the buffer crc
> > + * buf: buffer pointer
> > + * size: size of buffer (must be > crc_offset+4)
> > + */
> > +static uint32_t vhdx_update_checksum(uint8_t *buf, size_t size, int crc_offset)
> > +{
> > +    uint32_t crc;
> > +
> > +    assert(buf != NULL);
> > +    assert(size > (crc_offset+4));
> > +
> > +    memset(buf+crc_offset, 0, sizeof(crc));
> > +    crc =  crc32c(0xffffffff, buf, size);
> > +    memcpy(buf+crc_offset, &crc, sizeof(crc));
> 
> Worth a comment that the checksum is in CPU endianness.  Must use
> vhdx_header_le_export() before writing to file.  That implies the buffer
> must also be in CPU endianness otherwise vhdx_header_le_export() will
> byteswap unwanted fields.
>

OK.

> > +
> > +    return crc;
> > +}
> > +
> >  /* Validates the checksum of the buffer, with an in-place CRC.
> >   *
> >   * Zero is substituted during crc calculation for the original crc field,
> > @@ -198,6 +227,33 @@ static bool vhdx_checksum_is_valid(uint8_t *buf, size_t size, int crc_offset)
> >  }
> >  
> >  /*
> > + * This generates a UUID that is compliant with the MS GUIDs used
> > + * in the VHDX spec (and elsewhere).
> > + *
> > + * We can do this with uuid_generate if uuid.h is present,
> > + * however not all systems have uuid and the generation is
> 
> Do you mean libuuid is not available on all host platforms supported by
> QEMU, or just that you wanted to avoid the dependency?
>

The former, I am assuming by the configure check for uuid and
CONFIG_UUID.

> Since other parts of QEMU already use libuuid I'd rather see it used
> than reimplemented in VHDX.

There are two other users of uuid.h - block/vpc.c and block/vdi.c.
Both uses are wrapped in an #ifdef for CONFIG_UUID, and if
uuid_generate is not present they just leave the uuid field with
zeroes.

At least for the VHDX spec, there are several places where a GUID /
UUID of zero has a special meaning.  So if CONFIG_UUID is not present,
something needs to be generated - and generating a valid GUID that is
compatible with the spec is pretty trivial, which is why I just went
ahead and did that.

I also thought about adding this into util/ or the like, so that if
uuid.h is not available, vdi and vpc could use it.  However, I wasn't
certain of the requirements for vdi and vpc itself, so I didn't do
that yet.


Jeff

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 7/7] block: add write support for VHDX images
  2013-03-06 14:48 ` [Qemu-devel] [PATCH 7/7] block: add write support " Jeff Cody
@ 2013-03-07 15:59   ` Stefan Hajnoczi
  2013-03-07 16:05     ` Jeff Cody
  0 siblings, 1 reply; 49+ messages in thread
From: Stefan Hajnoczi @ 2013-03-07 15:59 UTC (permalink / raw)
  To: Jeff Cody; +Cc: kwolf, sw, qemu-devel

On Wed, Mar 06, 2013 at 09:48:43AM -0500, Jeff Cody wrote:
> @@ -958,12 +960,150 @@ exit:
>      return ret;
>  }
>  
> +/*
> + * Allocate a new payload block at the end of the file.
> + *
> + * Allocation will happen at 1MB alignment inside the file
> + *
> + * Returns the file offset start of the new payload block
> + */
> +static int vhdx_allocate_block(BlockDriverState *bs, BDRVVHDXState *s,
> +                                    uint64_t *new_offset)
> +{
> +    *new_offset = bdrv_getlength(bs->file);
> +
> +    /* per the spec, the address for a block is in units of 1MB */
> +    if (*new_offset % (1024*1024)) {
> +        *new_offset = ((*new_offset >> 20) + 1) << 20;  /* round up to 1MB */
> +    }
> +    return bdrv_truncate(bs->file, *new_offset + s->block_size);
> +}

Is it necessary to resize the file?  Why not just write at EOF to grow
it?

> +/*
> + * Update the BAT tablet entry with the new file offset, and the new entry
> + * state */
> +static int vhdx_update_bat_table_entry(BlockDriverState *bs, BDRVVHDXState *s,
> +                                       vhdx_sector_info *sinfo, int state)
> +{
> +    uint64_t bat_tmp;
> +    uint64_t bat_entry_offset;
> +
> +    /* The BAT entry is a uint64, with 44 bits for the file offset in units of
> +     * 1MB, and 3 bits for the block state. */
> +    s->bat[sinfo->bat_idx]  = ((sinfo->file_offset>>20) <<
> +                               VHDX_BAT_FILE_OFF_BITS);
> +
> +    s->bat[sinfo->bat_idx] |= state & VHDX_BAT_STATE_BIT_MASK;
>  
> +    bat_tmp = cpu_to_le64(s->bat[sinfo->bat_idx]);
> +    bat_entry_offset = s->bat_offset + sinfo->bat_idx * sizeof(vhdx_bat_entry);
> +
> +    return bdrv_pwrite_sync(bs->file, bat_entry_offset, &bat_tmp,
> +                            sizeof(vhdx_bat_entry));
> +}
>  
>  static coroutine_fn int vhdx_co_writev(BlockDriverState *bs, int64_t sector_num,
>                                        int nb_sectors, QEMUIOVector *qiov)
>  {
> -    return -ENOTSUP;
> +    int ret = -ENOTSUP;
> +    BDRVVHDXState *s = bs->opaque;
> +    vhdx_sector_info sinfo;
> +    uint64_t bytes_done = 0;
> +    QEMUIOVector hd_qiov;
> +
> +    qemu_iovec_init(&hd_qiov, qiov->niov);
> +
> +    qemu_co_mutex_lock(&s->lock);
> +
> +    /* Per the spec, on the first write of guest-visible data to the file the
> +     * data write guid must be updated in the header */
> +    if (s->first_visible_write) {
> +        s->first_visible_write = false;
> +        vhdx_update_headers(bs, s, true);
> +    }
> +
> +    while (nb_sectors > 0) {
> +        if (s->params.data_bits & VHDX_PARAMS_HAS_PARENT) {
> +            /* not supported yet */
> +            ret = -ENOTSUP;
> +            goto exit;
> +        } else {
> +            vhdx_block_translate(s, sector_num, nb_sectors, &sinfo);
> +
> +            qemu_iovec_reset(&hd_qiov);
> +            qemu_iovec_concat(&hd_qiov, qiov,  bytes_done, sinfo.bytes_avail);
> +            /* check the payload block state */
> +            switch (s->bat[sinfo.bat_idx] & VHDX_BAT_STATE_BIT_MASK) {
> +            case PAYLOAD_BLOCK_ZERO:
> +                /* in this case, we need to preserve zero writes for
> +                 * data that is not part of this write, so we must pad
> +                 * the rest of the buffer to zeroes */
> +
> +                /* if we are on a posix system with ftruncate() that extends
> +                 * a file, then it is zero-filled for us.  On Win32, the raw
> +                 * layer uses SetFilePointer and SetFileEnd, which does not
> +                 * zero fill AFAIK */
> +
> +                /* TODO: queue another write of zero buffers if the host OS does
> +                 * not zero-fill on file extension */
> +
> +                /* fall through */
> +            case PAYLOAD_BLOCK_NOT_PRESENT: /* fall through */
> +            case PAYLOAD_BLOCK_UNMAPPED:    /* fall through */
> +            case PAYLOAD_BLOCK_UNDEFINED:   /* fall through */
> +                ret = vhdx_allocate_block(bs, s, &sinfo.file_offset);
> +                if (ret < 0) {
> +                    goto exit;
> +                }
> +                /* once we support differencing files, this may also be
> +                 * partially present */
> +                /* update block state to the newly specified state */
> +                ret = vhdx_update_bat_table_entry(bs, s, &sinfo,
> +                                                  PAYLOAD_BLOCK_FULL_PRESENT);
> +                if (ret < 0) {
> +                    goto exit;
> +                }

The BAT table entry must be written after data is already on disk.
Otherwise a crash partway through would leave the BAT table pointing to
undefined data.

Do you need to use the log to ensure this?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 7/7] block: add write support for VHDX images
  2013-03-07 15:59   ` Stefan Hajnoczi
@ 2013-03-07 16:05     ` Jeff Cody
  0 siblings, 0 replies; 49+ messages in thread
From: Jeff Cody @ 2013-03-07 16:05 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: kwolf, sw, qemu-devel

On Thu, Mar 07, 2013 at 04:59:05PM +0100, Stefan Hajnoczi wrote:
> On Wed, Mar 06, 2013 at 09:48:43AM -0500, Jeff Cody wrote:
> > @@ -958,12 +960,150 @@ exit:
> >      return ret;
> >  }
> >  
> > +/*
> > + * Allocate a new payload block at the end of the file.
> > + *
> > + * Allocation will happen at 1MB alignment inside the file
> > + *
> > + * Returns the file offset start of the new payload block
> > + */
> > +static int vhdx_allocate_block(BlockDriverState *bs, BDRVVHDXState *s,
> > +                                    uint64_t *new_offset)
> > +{
> > +    *new_offset = bdrv_getlength(bs->file);
> > +
> > +    /* per the spec, the address for a block is in units of 1MB */
> > +    if (*new_offset % (1024*1024)) {
> > +        *new_offset = ((*new_offset >> 20) + 1) << 20;  /* round up to 1MB */
> > +    }
> > +    return bdrv_truncate(bs->file, *new_offset + s->block_size);
> > +}
> 
> Is it necessary to resize the file?  Why not just write at EOF to grow
> it?
> 

After the recent bdrv_truncate() discussion, I think that may be best.

> > +/*
> > + * Update the BAT tablet entry with the new file offset, and the new entry
> > + * state */
> > +static int vhdx_update_bat_table_entry(BlockDriverState *bs, BDRVVHDXState *s,
> > +                                       vhdx_sector_info *sinfo, int state)
> > +{
> > +    uint64_t bat_tmp;
> > +    uint64_t bat_entry_offset;
> > +
> > +    /* The BAT entry is a uint64, with 44 bits for the file offset in units of
> > +     * 1MB, and 3 bits for the block state. */
> > +    s->bat[sinfo->bat_idx]  = ((sinfo->file_offset>>20) <<
> > +                               VHDX_BAT_FILE_OFF_BITS);
> > +
> > +    s->bat[sinfo->bat_idx] |= state & VHDX_BAT_STATE_BIT_MASK;
> >  
> > +    bat_tmp = cpu_to_le64(s->bat[sinfo->bat_idx]);
> > +    bat_entry_offset = s->bat_offset + sinfo->bat_idx * sizeof(vhdx_bat_entry);
> > +
> > +    return bdrv_pwrite_sync(bs->file, bat_entry_offset, &bat_tmp,
> > +                            sizeof(vhdx_bat_entry));
> > +}
> >  
> >  static coroutine_fn int vhdx_co_writev(BlockDriverState *bs, int64_t sector_num,
> >                                        int nb_sectors, QEMUIOVector *qiov)
> >  {
> > -    return -ENOTSUP;
> > +    int ret = -ENOTSUP;
> > +    BDRVVHDXState *s = bs->opaque;
> > +    vhdx_sector_info sinfo;
> > +    uint64_t bytes_done = 0;
> > +    QEMUIOVector hd_qiov;
> > +
> > +    qemu_iovec_init(&hd_qiov, qiov->niov);
> > +
> > +    qemu_co_mutex_lock(&s->lock);
> > +
> > +    /* Per the spec, on the first write of guest-visible data to the file the
> > +     * data write guid must be updated in the header */
> > +    if (s->first_visible_write) {
> > +        s->first_visible_write = false;
> > +        vhdx_update_headers(bs, s, true);
> > +    }
> > +
> > +    while (nb_sectors > 0) {
> > +        if (s->params.data_bits & VHDX_PARAMS_HAS_PARENT) {
> > +            /* not supported yet */
> > +            ret = -ENOTSUP;
> > +            goto exit;
> > +        } else {
> > +            vhdx_block_translate(s, sector_num, nb_sectors, &sinfo);
> > +
> > +            qemu_iovec_reset(&hd_qiov);
> > +            qemu_iovec_concat(&hd_qiov, qiov,  bytes_done, sinfo.bytes_avail);
> > +            /* check the payload block state */
> > +            switch (s->bat[sinfo.bat_idx] & VHDX_BAT_STATE_BIT_MASK) {
> > +            case PAYLOAD_BLOCK_ZERO:
> > +                /* in this case, we need to preserve zero writes for
> > +                 * data that is not part of this write, so we must pad
> > +                 * the rest of the buffer to zeroes */
> > +
> > +                /* if we are on a posix system with ftruncate() that extends
> > +                 * a file, then it is zero-filled for us.  On Win32, the raw
> > +                 * layer uses SetFilePointer and SetFileEnd, which does not
> > +                 * zero fill AFAIK */
> > +
> > +                /* TODO: queue another write of zero buffers if the host OS does
> > +                 * not zero-fill on file extension */
> > +
> > +                /* fall through */
> > +            case PAYLOAD_BLOCK_NOT_PRESENT: /* fall through */
> > +            case PAYLOAD_BLOCK_UNMAPPED:    /* fall through */
> > +            case PAYLOAD_BLOCK_UNDEFINED:   /* fall through */
> > +                ret = vhdx_allocate_block(bs, s, &sinfo.file_offset);
> > +                if (ret < 0) {
> > +                    goto exit;
> > +                }
> > +                /* once we support differencing files, this may also be
> > +                 * partially present */
> > +                /* update block state to the newly specified state */
> > +                ret = vhdx_update_bat_table_entry(bs, s, &sinfo,
> > +                                                  PAYLOAD_BLOCK_FULL_PRESENT);
> > +                if (ret < 0) {
> > +                    goto exit;
> > +                }
> 
> The BAT table entry must be written after data is already on disk.
> Otherwise a crash partway through would leave the BAT table pointing to
> undefined data.
> 
> Do you need to use the log to ensure this?

Yes, indeed.  The log support is what I am working on now - all
metadata and BAT writes are supposed to go through the log, for this
reason.  The next version will have log support, and so I will use
that for the BAT writes (and any other metadata writes, except the
header).

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 1/7] block: only force IO completion in .bdrv_truncate if we are shrinking
  2013-03-07  8:53                 ` Peter Lieven
  2013-03-07  8:59                   ` Kevin Wolf
@ 2013-03-07 16:09                   ` Paolo Bonzini
  2013-03-08  7:53                     ` Peter Lieven
  1 sibling, 1 reply; 49+ messages in thread
From: Paolo Bonzini @ 2013-03-07 16:09 UTC (permalink / raw)
  To: Peter Lieven; +Cc: kwolf, sw, Jeff Cody, qemu-devel, stefanha

Il 07/03/2013 09:53, Peter Lieven ha scritto:
>>
>> Then we can add bdrv_revalidate and, for block_resize, call
>> bdrv_revalidate+bdrv_truncate.  For bs->growable = 0 &&
>> !bs->drv->bdrv_truncate, bdrv_truncate can just check that the actual
>> size is the same or bigger as the one requested, and fail otherwise.
>>
> 
> Regarding brd_drain_all(). Is the fix right to call it only on device
> shrink?

I think the fix is to only call it for the monitor command.  Optionally,
when shrinking, assert that there are no requests in flight.

Paolo

> In this case it has to be added to iscsi_truncate as well.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 4/7] block: initial VHDX driver support framework - supports open and probe
  2013-03-07 15:23     ` Jeff Cody
@ 2013-03-07 16:12       ` Stefan Hajnoczi
  2013-03-08  8:35         ` Kevin Wolf
  0 siblings, 1 reply; 49+ messages in thread
From: Stefan Hajnoczi @ 2013-03-07 16:12 UTC (permalink / raw)
  To: Jeff Cody; +Cc: kwolf, sw, qemu-devel

On Thu, Mar 07, 2013 at 10:23:54AM -0500, Jeff Cody wrote:
> On Thu, Mar 07, 2013 at 03:30:44PM +0100, Stefan Hajnoczi wrote:
> > On Wed, Mar 06, 2013 at 09:48:11AM -0500, Jeff Cody wrote:
> > > +    ret = bdrv_pread(bs->file, s->bat_offset, s->bat, s->bat_rt.length);
> > > +
> > > +    for (i = 0; i < s->bat_entries; i++) {
> > > +        le64_to_cpus(&s->bat[i]);
> > > +    }
> > 
> > How does BAT size scale when the image size is increased?  QCOW2 and QED
> > use caches for metadata that would be too large or wasteful to keep in
> > memory.
> 
> The BAT size is dependent on the virtual disk size, and the block
> size.  The block size is allowed to range from 1MB - 256MB.  There is
> one BAT entry per block.
> 
> In practice, the large block size keeps the BAT entry table reasonable
> (for a 127GB file, the block size created by Hyper-V is 32MB, so the
> table is pretty small - 32KB).
> 
> However, I don't see anything in the spec that forces the block size
> to be larger, for a large virtual disk size.  So for the max size of
> 64TB, and the smallest block size of 1MB, keeping the BAT in memory
> would indeed be excessive.
> 
> I'll re-read the spec, and see if there is anything that ties the
> block size and virtual size together.  If not, I'll have to add
> caching.

BTW the qcow2 cache code can be reused.

Stefan

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 6/7] block: add header update capability for VHDX images.
  2013-03-07 15:33     ` Jeff Cody
@ 2013-03-07 16:15       ` Stefan Hajnoczi
  0 siblings, 0 replies; 49+ messages in thread
From: Stefan Hajnoczi @ 2013-03-07 16:15 UTC (permalink / raw)
  To: Jeff Cody; +Cc: kwolf, sw, qemu-devel

On Thu, Mar 07, 2013 at 10:33:49AM -0500, Jeff Cody wrote:
> On Thu, Mar 07, 2013 at 03:50:56PM +0100, Stefan Hajnoczi wrote:
> > On Wed, Mar 06, 2013 at 09:48:33AM -0500, Jeff Cody wrote:
> > > This adds the ability to update the headers in a VHDX image, including
> > > generating a new MS-compatible GUID, and checksum.
> > > 
> > > Signed-off-by: Jeff Cody <jcody@redhat.com>
> > > ---
> > >  block/vhdx.c | 165 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
> > >  1 file changed, 163 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/block/vhdx.c b/block/vhdx.c
> > > index 97775d2..13d1e7f 100644
> > > --- a/block/vhdx.c
> > > +++ b/block/vhdx.c
> > > @@ -27,6 +27,11 @@
> > >      le16_to_cpus(&(guid)->data2); \
> > >      le16_to_cpus(&(guid)->data3); } while (0)
> > >  
> > > +#define cpu_to_leguids(guid) do { \
> > > +    cpu_to_le32s(&(guid)->data1); \
> > > +    cpu_to_le16s(&(guid)->data2); \
> > > +    cpu_to_le16s(&(guid)->data3); } while (0)
> > > +
> > 
> > Please use a function instead of a macro.
> > 
> 
> OK.
> 
> > > +/* Calculates new checksum.
> > > + *
> > > + * Zero is substituted during crc calculation for the original crc field
> > > + * crc_offset: byte offset in buf of the buffer crc
> > > + * buf: buffer pointer
> > > + * size: size of buffer (must be > crc_offset+4)
> > > + */
> > > +static uint32_t vhdx_update_checksum(uint8_t *buf, size_t size, int crc_offset)
> > > +{
> > > +    uint32_t crc;
> > > +
> > > +    assert(buf != NULL);
> > > +    assert(size > (crc_offset+4));
> > > +
> > > +    memset(buf+crc_offset, 0, sizeof(crc));
> > > +    crc =  crc32c(0xffffffff, buf, size);
> > > +    memcpy(buf+crc_offset, &crc, sizeof(crc));
> > 
> > Worth a comment that the checksum is in CPU endianness.  Must use
> > vhdx_header_le_export() before writing to file.  That implies the buffer
> > must also be in CPU endianness otherwise vhdx_header_le_export() will
> > byteswap unwanted fields.
> >
> 
> OK.
> 
> > > +
> > > +    return crc;
> > > +}
> > > +
> > >  /* Validates the checksum of the buffer, with an in-place CRC.
> > >   *
> > >   * Zero is substituted during crc calculation for the original crc field,
> > > @@ -198,6 +227,33 @@ static bool vhdx_checksum_is_valid(uint8_t *buf, size_t size, int crc_offset)
> > >  }
> > >  
> > >  /*
> > > + * This generates a UUID that is compliant with the MS GUIDs used
> > > + * in the VHDX spec (and elsewhere).
> > > + *
> > > + * We can do this with uuid_generate if uuid.h is present,
> > > + * however not all systems have uuid and the generation is
> > 
> > Do you mean libuuid is not available on all host platforms supported by
> > QEMU, or just that you wanted to avoid the dependency?
> >
> 
> The former, I am assuming by the configure check for uuid and
> CONFIG_UUID.

No, the configure check just lets you build without libuuid, if you
don't want to install the headers :).  In block/vpc.c it's an optional
dependency.

> > Since other parts of QEMU already use libuuid I'd rather see it used
> > than reimplemented in VHDX.
> 
> There are two other users of uuid.h - block/vpc.c and block/vdi.c.
> Both uses are wrapped in an #ifdef for CONFIG_UUID, and if
> uuid_generate is not present they just leave the uuid field with
> zeroes.
> 
> At least for the VHDX spec, there are several places where a GUID /
> UUID of zero has a special meaning.  So if CONFIG_UUID is not present,
> something needs to be generated - and generating a valid GUID that is
> compatible with the spec is pretty trivial, which is why I just went
> ahead and did that.
> 
> I also thought about adding this into util/ or the like, so that if
> uuid.h is not available, vdi and vpc could use it.  However, I wasn't
> certain of the requirements for vdi and vpc itself, so I didn't do
> that yet.

Exactly, in the general case the problem becomes harder and then you're
really reimplementing libuuid.

I suggest using libuuid and building in block/vhdx.o only when it is
available.

Stefa

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 1/7] block: only force IO completion in .bdrv_truncate if we are shrinking
  2013-03-07  8:50                 ` Kevin Wolf
  2013-03-07  8:56                   ` Peter Lieven
@ 2013-03-07 16:45                   ` Paolo Bonzini
  1 sibling, 0 replies; 49+ messages in thread
From: Paolo Bonzini @ 2013-03-07 16:45 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: sw, Jeff Cody, Peter Lieven, qemu-devel, stefanha

Il 07/03/2013 09:50, Kevin Wolf ha scritto:
> Am 06.03.2013 um 21:39 hat Paolo Bonzini geschrieben:
>> Il 06/03/2013 20:03, Peter Lieven ha scritto:
>>> Am 06.03.2013 19:48, schrieb Jeff Cody:
>>>> On Wed, Mar 06, 2013 at 07:31:51PM +0100, Paolo Bonzini wrote:
>>>>> Il 06/03/2013 19:14, Jeff Cody ha scritto:
>>>>>> QCOW breaks with it using a normal raw posix file as a device.  As a
>>>>>> test: qemu-img create -f qcow test.qcow 5G.  Now run qemu with that
>>>>>> drive mounted, and try to partition and format it.  QEMU now asserts.
>>>>>>
>>>>>> The nicety of being able to using truncate during a write call,
>>>>>> especially for VHDX (which can have relatively large block/cluster
>>>>>> sizes), so to grow the file sparsely in a dynamically allocated file.
>>>>>
>>>>> Perhaps we need two APIs, "truncate" and "revalidate".
>>>>>
>>>>> Truncate should be a no-op if (!bs->growable).
>>>>>
>>>>> Revalidate could be called by the block_resize monitor command with no
>>>>> size specified.
>>>>
>>>> I think that is a good solution.  Is it better to have "truncate" and
>>>> "revalidate", or "truncate" and "grow", with grow being a subset of
>>>> truncate, with fewer restrictions?  There may still be operations
>>>> where it is OK to grow a file, but not OK to shrink it.
> 
> What semantics would the both operations have? Is truncate the same as
> it used to be? I don't really understand what "revalidate" would do, it
> sounds like a read-only operation from its name?

Revalidate would update the current size.  Files fetch it on all calls
to bdrv_getlength, but other backends cache it.  It would visit the BDS
->file chain from the bottom (bs->file->file->file...) to the top
implicitly, with no need for an explicit recursion in the callback (like
we do for flush_to_os).  Before starting the recursion, bdrv_revalidate
does a bdrv_drain_all, so using the current iscsi_truncate for
iscsi_revalidate would be fine.

The block_resize command will always call revalidate before doing
anything.  Then block_resize will try to do a bdrv_truncate if the
callback is there.

Another change that could make sense, is to make bdrv_truncate succeed
if there is no bdrv_truncate callback but bs is larger than the
requested size.  This would be a difference from today's

    if (!drv->bdrv_truncate)
        return -ENOTSUP;

And it could even do the same if the callback is there, but returns
ENOTSUP.  It would simplify some code, like this in raw-posix.c's
raw_truncate:

    } else if (S_ISCHR(st.st_mode) || S_ISBLK(st.st_mode)) {
       if (offset > raw_getlength(bs)) {
           return -EINVAL;
       }
    } else {
        return -ENOTSUP;
    }

The "else if" branch can just go away.

The remaining question is about usage of bdrv_truncate in the formats.
One important aspect is that both QCOW and VHDX, in addition to using
bdrv_truncate to extend the file, rely on bdrv_getlength to fetch the
last _used_ byte of the file, not the available space.  I think we can
say that bdrv_getlength behaves like that iff bs->growable.  If so, QCOW
and VHDX should fail to open for write if the underlying file has
!bs->growable.  Which will never fail right now, but it will as soon as
we implement this:

>> Let's start from (c).  bdrv_file_open sets bs->growable = 1.  I think it
>> should be removed and only the file protocol should set it.
> 
> This is probably right.

Good.  More precisely, raw_open should only set it if the file is a
regular file.

Paolo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 1/7] block: only force IO completion in .bdrv_truncate if we are shrinking
  2013-03-07 16:09                   ` Paolo Bonzini
@ 2013-03-08  7:53                     ` Peter Lieven
  2013-03-08  9:23                       ` Paolo Bonzini
  0 siblings, 1 reply; 49+ messages in thread
From: Peter Lieven @ 2013-03-08  7:53 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kwolf, sw, Jeff Cody, qemu-devel, stefanha

On 07.03.2013 17:09, Paolo Bonzini wrote:
> Il 07/03/2013 09:53, Peter Lieven ha scritto:
>>>
>>> Then we can add bdrv_revalidate and, for block_resize, call
>>> bdrv_revalidate+bdrv_truncate.  For bs->growable = 0 &&
>>> !bs->drv->bdrv_truncate, bdrv_truncate can just check that the actual
>>> size is the same or bigger as the one requested, and fail otherwise.
>>>
>>
>> Regarding brd_drain_all(). Is the fix right to call it only on device
>> shrink?
>
> I think the fix is to only call it for the monitor command.  Optionally,
> when shrinking, assert that there are no requests in flight.

Okay.

What is the plan? just fix this or fix the whole thing (which seems to be some
work).

The suggested patch from Jeff will break iscsi_truncate as bs->growable is 1 currently.

Peter

>
> Paolo
>
>> In this case it has to be added to iscsi_truncate as well.
>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 4/7] block: initial VHDX driver support framework - supports open and probe
  2013-03-07 16:12       ` Stefan Hajnoczi
@ 2013-03-08  8:35         ` Kevin Wolf
  0 siblings, 0 replies; 49+ messages in thread
From: Kevin Wolf @ 2013-03-08  8:35 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: sw, Jeff Cody, qemu-devel

Am 07.03.2013 um 17:12 hat Stefan Hajnoczi geschrieben:
> On Thu, Mar 07, 2013 at 10:23:54AM -0500, Jeff Cody wrote:
> > On Thu, Mar 07, 2013 at 03:30:44PM +0100, Stefan Hajnoczi wrote:
> > > On Wed, Mar 06, 2013 at 09:48:11AM -0500, Jeff Cody wrote:
> > > > +    ret = bdrv_pread(bs->file, s->bat_offset, s->bat, s->bat_rt.length);
> > > > +
> > > > +    for (i = 0; i < s->bat_entries; i++) {
> > > > +        le64_to_cpus(&s->bat[i]);
> > > > +    }
> > > 
> > > How does BAT size scale when the image size is increased?  QCOW2 and QED
> > > use caches for metadata that would be too large or wasteful to keep in
> > > memory.
> > 
> > The BAT size is dependent on the virtual disk size, and the block
> > size.  The block size is allowed to range from 1MB - 256MB.  There is
> > one BAT entry per block.
> > 
> > In practice, the large block size keeps the BAT entry table reasonable
> > (for a 127GB file, the block size created by Hyper-V is 32MB, so the
> > table is pretty small - 32KB).
> > 
> > However, I don't see anything in the spec that forces the block size
> > to be larger, for a large virtual disk size.  So for the max size of
> > 64TB, and the smallest block size of 1MB, keeping the BAT in memory
> > would indeed be excessive.
> > 
> > I'll re-read the spec, and see if there is anything that ties the
> > block size and virtual size together.  If not, I'll have to add
> > caching.
> 
> BTW the qcow2 cache code can be reused.

The add-cow series has patches to make it work outside of qcow2. I'm not
sure if there was something wrong with this part of the series, but at
least it could be a starting point.

Kevin

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 1/7] block: only force IO completion in .bdrv_truncate if we are shrinking
  2013-03-08  7:53                     ` Peter Lieven
@ 2013-03-08  9:23                       ` Paolo Bonzini
  2013-03-08  9:35                         ` Kevin Wolf
  0 siblings, 1 reply; 49+ messages in thread
From: Paolo Bonzini @ 2013-03-08  9:23 UTC (permalink / raw)
  To: Peter Lieven; +Cc: kwolf, sw, Jeff Cody, qemu-devel, stefanha

Il 08/03/2013 08:53, Peter Lieven ha scritto:
>>
>> I think the fix is to only call it for the monitor command.  Optionally,
>> when shrinking, assert that there are no requests in flight.
> 
> Okay.
> 
> What is the plan? just fix this or fix the whole thing (which seems to
> be some
> work).
> 
> The suggested patch from Jeff will break iscsi_truncate as bs->growable
> is 1 currently.

I guess it's up to Kevin and Jeff.

Paolo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 1/7] block: only force IO completion in .bdrv_truncate if we are shrinking
  2013-03-08  9:23                       ` Paolo Bonzini
@ 2013-03-08  9:35                         ` Kevin Wolf
  2013-03-08 11:46                           ` Peter Lieven
  0 siblings, 1 reply; 49+ messages in thread
From: Kevin Wolf @ 2013-03-08  9:35 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: sw, Jeff Cody, Peter Lieven, qemu-devel, stefanha

Am 08.03.2013 um 10:23 hat Paolo Bonzini geschrieben:
> Il 08/03/2013 08:53, Peter Lieven ha scritto:
> >>
> >> I think the fix is to only call it for the monitor command.  Optionally,
> >> when shrinking, assert that there are no requests in flight.
> > 
> > Okay.
> > 
> > What is the plan? just fix this or fix the whole thing (which seems to
> > be some
> > work).
> > 
> > The suggested patch from Jeff will break iscsi_truncate as bs->growable
> > is 1 currently.
> 
> I guess it's up to Kevin and Jeff.

Someone needs to send a fix for the qcow1 (and probably vmdk)
regression, otherwise I'll have to revert the patch.

Kevin

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 1/7] block: only force IO completion in .bdrv_truncate if we are shrinking
  2013-03-08  9:35                         ` Kevin Wolf
@ 2013-03-08 11:46                           ` Peter Lieven
  2013-03-08 11:56                             ` Kevin Wolf
  0 siblings, 1 reply; 49+ messages in thread
From: Peter Lieven @ 2013-03-08 11:46 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Paolo Bonzini, Jeff Cody, qemu-devel, stefanha, sw


Am 08.03.2013 um 10:35 schrieb Kevin Wolf <kwolf@redhat.com>:

> Am 08.03.2013 um 10:23 hat Paolo Bonzini geschrieben:
>> Il 08/03/2013 08:53, Peter Lieven ha scritto:
>>>> 
>>>> I think the fix is to only call it for the monitor command.  Optionally,
>>>> when shrinking, assert that there are no requests in flight.
>>> 
>>> Okay.
>>> 
>>> What is the plan? just fix this or fix the whole thing (which seems to
>>> be some
>>> work).
>>> 
>>> The suggested patch from Jeff will break iscsi_truncate as bs->growable
>>> is 1 currently.
>> 
>> I guess it's up to Kevin and Jeff.
> 
> Someone needs to send a fix for the qcow1 (and probably vmdk)
> regression, otherwise I'll have to revert the patch.

What about Paolos suggestion to call bdrv_drain_all() from the block_resize
command and not in bdrv_truncate? It is not a the complete solution, but it
will fix the regression. However, what happens if someone resizes a qcow2
device?

Peter

> 
> Kevin

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 1/7] block: only force IO completion in .bdrv_truncate if we are shrinking
  2013-03-08 11:46                           ` Peter Lieven
@ 2013-03-08 11:56                             ` Kevin Wolf
  2013-03-09  9:36                               ` Peter Lieven
  0 siblings, 1 reply; 49+ messages in thread
From: Kevin Wolf @ 2013-03-08 11:56 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Paolo Bonzini, Jeff Cody, qemu-devel, stefanha, sw

Am 08.03.2013 um 12:46 hat Peter Lieven geschrieben:
> 
> Am 08.03.2013 um 10:35 schrieb Kevin Wolf <kwolf@redhat.com>:
> 
> > Am 08.03.2013 um 10:23 hat Paolo Bonzini geschrieben:
> >> Il 08/03/2013 08:53, Peter Lieven ha scritto:
> >>>> 
> >>>> I think the fix is to only call it for the monitor command.  Optionally,
> >>>> when shrinking, assert that there are no requests in flight.
> >>> 
> >>> Okay.
> >>> 
> >>> What is the plan? just fix this or fix the whole thing (which seems to
> >>> be some
> >>> work).
> >>> 
> >>> The suggested patch from Jeff will break iscsi_truncate as bs->growable
> >>> is 1 currently.
> >> 
> >> I guess it's up to Kevin and Jeff.
> > 
> > Someone needs to send a fix for the qcow1 (and probably vmdk)
> > regression, otherwise I'll have to revert the patch.
> 
> What about Paolos suggestion to call bdrv_drain_all() from the block_resize
> command and not in bdrv_truncate? It is not a the complete solution, but it
> will fix the regression. However, what happens if someone resizes a qcow2
> device?

I suppose you mean qcow1? It doesn't support resizing images, but even
if it did, this shouldn't be a problem: The problematic case is the call
of bdrv_drain_all() during a read/write function, which would have to
wait for itself to complete. As soon as you restrict the
bdrv_drain_all() to the monitor, waiting for in-flight I/Os should just
work.

Kevin

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 1/7] block: only force IO completion in .bdrv_truncate if we are shrinking
  2013-03-08 11:56                             ` Kevin Wolf
@ 2013-03-09  9:36                               ` Peter Lieven
  0 siblings, 0 replies; 49+ messages in thread
From: Peter Lieven @ 2013-03-09  9:36 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Paolo Bonzini, Jeff Cody, qemu-devel, stefanha, sw

Am 08.03.2013 12:56, schrieb Kevin Wolf:
> Am 08.03.2013 um 12:46 hat Peter Lieven geschrieben:
>>
>> Am 08.03.2013 um 10:35 schrieb Kevin Wolf <kwolf@redhat.com>:
>>
>>> Am 08.03.2013 um 10:23 hat Paolo Bonzini geschrieben:
>>>> Il 08/03/2013 08:53, Peter Lieven ha scritto:
>>>>>>
>>>>>> I think the fix is to only call it for the monitor command.  Optionally,
>>>>>> when shrinking, assert that there are no requests in flight.
>>>>>
>>>>> Okay.
>>>>>
>>>>> What is the plan? just fix this or fix the whole thing (which seems to
>>>>> be some
>>>>> work).
>>>>>
>>>>> The suggested patch from Jeff will break iscsi_truncate as bs->growable
>>>>> is 1 currently.
>>>>
>>>> I guess it's up to Kevin and Jeff.
>>>
>>> Someone needs to send a fix for the qcow1 (and probably vmdk)
>>> regression, otherwise I'll have to revert the patch.
>>
>> What about Paolos suggestion to call bdrv_drain_all() from the block_resize
>> command and not in bdrv_truncate? It is not a the complete solution, but it
>> will fix the regression. However, what happens if someone resizes a qcow2
>> device?
> 
> I suppose you mean qcow1? It doesn't support resizing images, but even
> if it did, this shouldn't be a problem: The problematic case is the call
> of bdrv_drain_all() during a read/write function, which would have to
> wait for itself to complete. As soon as you restrict the
> bdrv_drain_all() to the monitor, waiting for in-flight I/Os should just
> work.

Ok, then please ignore / revert the patch that added bdrv_drain_all() in bdrv_truncate() and
I will sent a new one that adds bdrv_drain_all() to qmp_block_resize().

Peter


> 
> Kevin
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2013-03-09  9:36 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-03-06 14:46 [Qemu-devel] [PATCH 0/7] Initial VHDX support (and a bug fix for QCOW) Jeff Cody
2013-03-06 14:47 ` [Qemu-devel] [PATCH 1/7] block: only force IO completion in .bdrv_truncate if we are shrinking Jeff Cody
2013-03-06 17:50   ` Peter Lieven
2013-03-06 18:06     ` Paolo Bonzini
2013-03-06 18:14       ` Jeff Cody
2013-03-06 18:31         ` Paolo Bonzini
2013-03-06 18:48           ` Jeff Cody
2013-03-06 19:03             ` Peter Lieven
2013-03-06 20:39               ` Paolo Bonzini
2013-03-07  8:50                 ` Kevin Wolf
2013-03-07  8:56                   ` Peter Lieven
2013-03-07  9:03                     ` Kevin Wolf
2013-03-07  9:16                       ` Peter Lieven
2013-03-07  9:22                         ` Kevin Wolf
2013-03-07  9:25                           ` Peter Lieven
2013-03-07 10:00                             ` Kevin Wolf
2013-03-07 10:22                               ` Peter Lieven
2013-03-07 16:45                   ` Paolo Bonzini
2013-03-07  8:53                 ` Peter Lieven
2013-03-07  8:59                   ` Kevin Wolf
2013-03-07 16:09                   ` Paolo Bonzini
2013-03-08  7:53                     ` Peter Lieven
2013-03-08  9:23                       ` Paolo Bonzini
2013-03-08  9:35                         ` Kevin Wolf
2013-03-08 11:46                           ` Peter Lieven
2013-03-08 11:56                             ` Kevin Wolf
2013-03-09  9:36                               ` Peter Lieven
     [not found]         ` <51378A23.5090301@dlhnet.de>
     [not found]           ` <20130306184217.GC3743@localhost.localdomain>
2013-03-06 18:46             ` Peter Lieven
2013-03-06 18:27       ` Peter Lieven
2013-03-07  8:57         ` Kevin Wolf
2013-03-06 18:32     ` Jeff Cody
2013-03-06 20:22       ` Paolo Bonzini
2013-03-06 14:47 ` [Qemu-devel] [PATCH 2/7] qemu: add castagnoli crc32c checksum algorithm Jeff Cody
2013-03-06 14:47 ` [Qemu-devel] [PATCH 3/7] block: vhdx header for the QEMU support of VHDX images Jeff Cody
2013-03-07 13:15   ` Stefan Hajnoczi
2013-03-07 13:43     ` Jeff Cody
2013-03-06 14:48 ` [Qemu-devel] [PATCH 4/7] block: initial VHDX driver support framework - supports open and probe Jeff Cody
2013-03-07 14:30   ` Stefan Hajnoczi
2013-03-07 15:23     ` Jeff Cody
2013-03-07 16:12       ` Stefan Hajnoczi
2013-03-08  8:35         ` Kevin Wolf
2013-03-06 14:48 ` [Qemu-devel] [PATCH 5/7] block: add read-only support to VHDX image format Jeff Cody
2013-03-06 14:48 ` [Qemu-devel] [PATCH 6/7] block: add header update capability for VHDX images Jeff Cody
2013-03-07 14:50   ` Stefan Hajnoczi
2013-03-07 15:33     ` Jeff Cody
2013-03-07 16:15       ` Stefan Hajnoczi
2013-03-06 14:48 ` [Qemu-devel] [PATCH 7/7] block: add write support " Jeff Cody
2013-03-07 15:59   ` Stefan Hajnoczi
2013-03-07 16:05     ` Jeff Cody

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.