From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:55710) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YLowY-0002Sm-V6 for qemu-devel@nongnu.org; Thu, 12 Feb 2015 03:16:17 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YLowT-0000su-Hr for qemu-devel@nongnu.org; Thu, 12 Feb 2015 03:16:14 -0500 Received: from mail-ig0-x233.google.com ([2607:f8b0:4001:c05::233]:47470) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YLowT-0000sf-AO for qemu-devel@nongnu.org; Thu, 12 Feb 2015 03:16:09 -0500 Received: by mail-ig0-f179.google.com with SMTP id l13so2203772iga.0 for ; Thu, 12 Feb 2015 00:16:08 -0800 (PST) Date: Thu, 12 Feb 2015 16:16:03 +0800 From: Liu Yuan Message-ID: <20150212081603.GQ7801@ubuntu-trusty> References: <20150210031051.GB3859@ubuntu-trusty> <54D9BFAA.8050307@lab.ntt.co.jp> <20150210085819.GC18727@ubuntu-trusty> <54D9D5D1.2060601@lab.ntt.co.jp> <20150210103558.GF18727@ubuntu-trusty> <877fvn649i.wl%mitake.hitoshi@lab.ntt.co.jp> <20150212070049.GL7801@ubuntu-trusty> <87386b6132.wl%mitake.hitoshi@lab.ntt.co.jp> <20150212074217.GN7801@ubuntu-trusty> <87y4o34ke4.wl%mitake.hitoshi@lab.ntt.co.jp> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <87y4o34ke4.wl%mitake.hitoshi@lab.ntt.co.jp> Subject: Re: [Qemu-devel] [sheepdog] [PATCH v4] sheepdog: selectable object size support List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Hitoshi Mitake Cc: kwolf@redhat.com, sheepdog@lists.wpkg.org, Teruaki Ishizaki , qemu-devel@nongnu.org, stefanha@redhat.com On Thu, Feb 12, 2015 at 05:13:55PM +0900, Hitoshi Mitake wrote: > At Thu, 12 Feb 2015 15:42:17 +0800, > Liu Yuan wrote: > > > > On Thu, Feb 12, 2015 at 04:28:01PM +0900, Hitoshi Mitake wrote: > > > At Thu, 12 Feb 2015 15:00:49 +0800, > > > Liu Yuan wrote: > > > > > > > > On Thu, Feb 12, 2015 at 03:19:21PM +0900, Hitoshi Mitake wrote: > > > > > At Tue, 10 Feb 2015 18:35:58 +0800, > > > > > Liu Yuan wrote: > > > > > > > > > > > > On Tue, Feb 10, 2015 at 06:56:33PM +0900, Teruaki Ishizaki wrote: > > > > > > > (2015/02/10 17:58), Liu Yuan wrote: > > > > > > > >On Tue, Feb 10, 2015 at 05:22:02PM +0900, Teruaki Ishizaki wrote: > > > > > > > >>(2015/02/10 12:10), Liu Yuan wrote: > > > > > > > >>>On Tue, Jan 27, 2015 at 05:35:27PM +0900, Teruaki Ishizaki wrote: > > > > > > > >>>>Previously, qemu block driver of sheepdog used hard-coded VDI object size. > > > > > > > >>>>This patch enables users to handle "block_size_shift" value for > > > > > > > >>>>calculating VDI object size. > > > > > > > >>>> > > > > > > > >>>>When you start qemu, you don't need to specify additional command option. > > > > > > > >>>> > > > > > > > >>>>But when you create the VDI which doesn't have default object size > > > > > > > >>>>with qemu-img command, you specify block_size_shift option. > > > > > > > >>>> > > > > > > > >>>>If you want to create a VDI of 8MB(1 << 23) object size, > > > > > > > >>>>you need to specify following command option. > > > > > > > >>>> > > > > > > > >>>> # qemu-img create -o block_size_shift=23 sheepdog:test1 100M > > > > > > > >>>> > > > > > > > >>>>In addition, when you don't specify qemu-img command option, > > > > > > > >>>>a default value of sheepdog cluster is used for creating VDI. > > > > > > > >>>> > > > > > > > >>>> # qemu-img create sheepdog:test2 100M > > > > > > > >>>> > > > > > > > >>>>Signed-off-by: Teruaki Ishizaki > > > > > > > >>>>--- > > > > > > > >>>>V4: > > > > > > > >>>> - Limit a read/write buffer size for creating a preallocated VDI. > > > > > > > >>>> - Replace a parse function for the block_size_shift option. > > > > > > > >>>> - Fix an error message. > > > > > > > >>>> > > > > > > > >>>>V3: > > > > > > > >>>> - Delete the needless operation of buffer. > > > > > > > >>>> - Delete the needless operations of request header. > > > > > > > >>>> for SD_OP_GET_CLUSTER_DEFAULT. > > > > > > > >>>> - Fix coding style problems. > > > > > > > >>>> > > > > > > > >>>>V2: > > > > > > > >>>> - Fix coding style problem (white space). > > > > > > > >>>> - Add members, store_policy and block_size_shift to struct SheepdogVdiReq. > > > > > > > >>>> - Initialize request header to use block_size_shift specified by user. > > > > > > > >>>>--- > > > > > > > >>>> block/sheepdog.c | 138 ++++++++++++++++++++++++++++++++++++++------- > > > > > > > >>>> include/block/block_int.h | 1 + > > > > > > > >>>> 2 files changed, 119 insertions(+), 20 deletions(-) > > > > > > > >>>> > > > > > > > >>>>diff --git a/block/sheepdog.c b/block/sheepdog.c > > > > > > > >>>>index be3176f..a43b947 100644 > > > > > > > >>>>--- a/block/sheepdog.c > > > > > > > >>>>+++ b/block/sheepdog.c > > > > > > > >>>>@@ -37,6 +37,7 @@ > > > > > > > >>>> #define SD_OP_READ_VDIS 0x15 > > > > > > > >>>> #define SD_OP_FLUSH_VDI 0x16 > > > > > > > >>>> #define SD_OP_DEL_VDI 0x17 > > > > > > > >>>>+#define SD_OP_GET_CLUSTER_DEFAULT 0x18 > > > > > > > >>>> > > > > > > > >>>> #define SD_FLAG_CMD_WRITE 0x01 > > > > > > > >>>> #define SD_FLAG_CMD_COW 0x02 > > > > > > > >>>>@@ -167,7 +168,8 @@ typedef struct SheepdogVdiReq { > > > > > > > >>>> uint32_t base_vdi_id; > > > > > > > >>>> uint8_t copies; > > > > > > > >>>> uint8_t copy_policy; > > > > > > > >>>>- uint8_t reserved[2]; > > > > > > > >>>>+ uint8_t store_policy; > > > > > > > >>>>+ uint8_t block_size_shift; > > > > > > > >>>> uint32_t snapid; > > > > > > > >>>> uint32_t type; > > > > > > > >>>> uint32_t pad[2]; > > > > > > > >>>>@@ -186,6 +188,21 @@ typedef struct SheepdogVdiRsp { > > > > > > > >>>> uint32_t pad[5]; > > > > > > > >>>> } SheepdogVdiRsp; > > > > > > > >>>> > > > > > > > >>>>+typedef struct SheepdogClusterRsp { > > > > > > > >>>>+ uint8_t proto_ver; > > > > > > > >>>>+ uint8_t opcode; > > > > > > > >>>>+ uint16_t flags; > > > > > > > >>>>+ uint32_t epoch; > > > > > > > >>>>+ uint32_t id; > > > > > > > >>>>+ uint32_t data_length; > > > > > > > >>>>+ uint32_t result; > > > > > > > >>>>+ uint8_t nr_copies; > > > > > > > >>>>+ uint8_t copy_policy; > > > > > > > >>>>+ uint8_t block_size_shift; > > > > > > > >>>>+ uint8_t __pad1; > > > > > > > >>>>+ uint32_t __pad2[6]; > > > > > > > >>>>+} SheepdogClusterRsp; > > > > > > > >>>>+ > > > > > > > >>>> typedef struct SheepdogInode { > > > > > > > >>>> char name[SD_MAX_VDI_LEN]; > > > > > > > >>>> char tag[SD_MAX_VDI_TAG_LEN]; > > > > > > > >>>>@@ -1544,6 +1561,7 @@ static int do_sd_create(BDRVSheepdogState *s, uint32_t *vdi_id, int snapshot, > > > > > > > >>>> hdr.vdi_size = s->inode.vdi_size; > > > > > > > >>>> hdr.copy_policy = s->inode.copy_policy; > > > > > > > >>>> hdr.copies = s->inode.nr_copies; > > > > > > > >>>>+ hdr.block_size_shift = s->inode.block_size_shift; > > > > > > > >>>> > > > > > > > >>>> ret = do_req(fd, s->aio_context, (SheepdogReq *)&hdr, buf, &wlen, &rlen); > > > > > > > >>>> > > > > > > > >>>>@@ -1569,9 +1587,12 @@ static int do_sd_create(BDRVSheepdogState *s, uint32_t *vdi_id, int snapshot, > > > > > > > >>>> static int sd_prealloc(const char *filename, Error **errp) > > > > > > > >>>> { > > > > > > > >>>> BlockDriverState *bs = NULL; > > > > > > > >>>>+ BDRVSheepdogState *base = NULL; > > > > > > > >>>>+ unsigned long buf_size; > > > > > > > >>>> uint32_t idx, max_idx; > > > > > > > >>>>+ uint32_t object_size; > > > > > > > >>>> int64_t vdi_size; > > > > > > > >>>>- void *buf = g_malloc0(SD_DATA_OBJ_SIZE); > > > > > > > >>>>+ void *buf = NULL; > > > > > > > >>>> int ret; > > > > > > > >>>> > > > > > > > >>>> ret = bdrv_open(&bs, filename, NULL, NULL, BDRV_O_RDWR | BDRV_O_PROTOCOL, > > > > > > > >>>>@@ -1585,18 +1606,24 @@ static int sd_prealloc(const char *filename, Error **errp) > > > > > > > >>>> ret = vdi_size; > > > > > > > >>>> goto out; > > > > > > > >>>> } > > > > > > > >>>>- max_idx = DIV_ROUND_UP(vdi_size, SD_DATA_OBJ_SIZE); > > > > > > > >>>>+ > > > > > > > >>>>+ base = bs->opaque; > > > > > > > >>>>+ object_size = (UINT32_C(1) << base->inode.block_size_shift); > > > > > > > >>>>+ buf_size = MIN(object_size, SD_DATA_OBJ_SIZE); > > > > > > > >>>>+ buf = g_malloc0(buf_size); > > > > > > > >>>>+ > > > > > > > >>>>+ max_idx = DIV_ROUND_UP(vdi_size, buf_size); > > > > > > > >>>> > > > > > > > >>>> for (idx = 0; idx < max_idx; idx++) { > > > > > > > >>>> /* > > > > > > > >>>> * The created image can be a cloned image, so we need to read > > > > > > > >>>> * a data from the source image. > > > > > > > >>>> */ > > > > > > > >>>>- ret = bdrv_pread(bs, idx * SD_DATA_OBJ_SIZE, buf, SD_DATA_OBJ_SIZE); > > > > > > > >>>>+ ret = bdrv_pread(bs, idx * buf_size, buf, buf_size); > > > > > > > >>>> if (ret < 0) { > > > > > > > >>>> goto out; > > > > > > > >>>> } > > > > > > > >>>>- ret = bdrv_pwrite(bs, idx * SD_DATA_OBJ_SIZE, buf, SD_DATA_OBJ_SIZE); > > > > > > > >>>>+ ret = bdrv_pwrite(bs, idx * buf_size, buf, buf_size); > > > > > > > >>>> if (ret < 0) { > > > > > > > >>>> goto out; > > > > > > > >>>> } > > > > > > > >>>>@@ -1669,6 +1696,21 @@ static int parse_redundancy(BDRVSheepdogState *s, const char *opt) > > > > > > > >>>> return 0; > > > > > > > >>>> } > > > > > > > >>>> > > > > > > > >>>>+static int parse_block_size_shift(BDRVSheepdogState *s, QemuOpts *opt) > > > > > > > >>>>+{ > > > > > > > >>>>+ struct SheepdogInode *inode = &s->inode; > > > > > > > >>>>+ inode->block_size_shift = > > > > > > > >>>>+ (uint8_t)qemu_opt_get_number_del(opt, "block_size_shift", 0); > > > > > > > >>>>+ if (inode->block_size_shift == 0) { > > > > > > > >>>>+ /* block_size_shift is set for cluster default value by sheepdog */ > > > > > > > >>>>+ return 0; > > > > > > > >>>>+ } else if (inode->block_size_shift < 20 || inode->block_size_shift > 31) { > > > > > > > >>>>+ return -EINVAL; > > > > > > > >>>>+ } > > > > > > > >>>>+ > > > > > > > >>>>+ return 0; > > > > > > > >>>>+} > > > > > > > >>>>+ > > > > > > > >>>> static int sd_create(const char *filename, QemuOpts *opts, > > > > > > > >>>> Error **errp) > > > > > > > >>>> { > > > > > > > >>>>@@ -1679,6 +1721,7 @@ static int sd_create(const char *filename, QemuOpts *opts, > > > > > > > >>>> BDRVSheepdogState *s; > > > > > > > >>>> char tag[SD_MAX_VDI_TAG_LEN]; > > > > > > > >>>> uint32_t snapid; > > > > > > > >>>>+ uint64_t max_vdi_size; > > > > > > > >>>> bool prealloc = false; > > > > > > > >>>> > > > > > > > >>>> s = g_new0(BDRVSheepdogState, 1); > > > > > > > >>>>@@ -1718,9 +1761,10 @@ static int sd_create(const char *filename, QemuOpts *opts, > > > > > > > >>>> } > > > > > > > >>>> } > > > > > > > >>>> > > > > > > > >>>>- if (s->inode.vdi_size > SD_MAX_VDI_SIZE) { > > > > > > > >>>>- error_setg(errp, "too big image size"); > > > > > > > >>>>- ret = -EINVAL; > > > > > > > >>>>+ ret = parse_block_size_shift(s, opts); > > > > > > > >>>>+ if (ret < 0) { > > > > > > > >>>>+ error_setg(errp, "Invalid block_size_shift:" > > > > > > > >>>>+ " '%s'. please use 20-31", buf); > > > > > > > >>>> goto out; > > > > > > > >>>> } > > > > > > > >>>> > > > > > > > >>>>@@ -1757,6 +1801,47 @@ static int sd_create(const char *filename, QemuOpts *opts, > > > > > > > >>>> } > > > > > > > >>>> > > > > > > > >>>> s->aio_context = qemu_get_aio_context(); > > > > > > > >>>>+ > > > > > > > >>>>+ /* if block_size_shift is not specified, get cluster default value */ > > > > > > > >>>>+ if (s->inode.block_size_shift == 0) { > > > > > > > >>> > > > > > > > >>>Seems that we don't keep backward compatibility for new QEMU and old sheep that > > > > > > > >>>doesn't support SD_OP_GET_CLUSTER_DEFAULT. > > > > > > > >>Then, I'll add the operation that if block_size_shift from > > > > > > > >>SD_OP_GET_CLUSTER_DEFAULT is zero, block_size_shift is set to 22. > > > > > > > >>Is it OK? > > > > > > > > > > > > > > > >If sheep doesn't support SD_OP_GET_CLUSTER_DEFAULT, sheep will return > > > > > > > >SD_RES_INVALID_PARMS. So to keep backward compatibility, we shouldn't issue > > > > > > > >SD_OP_GET_CLUSTER_DEFAULT to sheep daemon. > > > > > > > OK, I think that the way is better. > > > > > > > > > > > > > > > >My point is that, after user upgrading the sheep and still stick with old QEMU, > > > > > > > >(I guess many users will), any operations in the past sholudn't fail right after > > > > > > > >upgrading. > > > > > > > > > > > > > > > >> > > > > > > > >>> > > > > > > > >>>What will happen if old QEMU with new sheep that support block_size_shift? Most > > > > > > > >>>distributions will ship the old stable qemu that wouldn't be aware of > > > > > > > >>>block_size_shift. > > > > > > > >>If old QEMU with new sheep is used, VDI is created with cluster default > > > > > > > >>block_size_shift and accessed as 4MB object_size. > > > > > > > >>So I think that for backward compatibility, users must do cluster > > > > > > > >>format command with default block_size_shift 22 equal to > > > > > > > >>4MB object_size. > > > > > > > > > > > > > > > >how old QEMU know about block_size_shift? For old QEMU, block_size_shift is > > > > > > > >encoded as 0 and then send the create request to sheep. Does sheep can handle > > > > > > > >block_size_shift = 0? You know, we can't pass any value to old QEMU for > > > > > > > >block_size_shift. > > > > > > > Sheep can handle block_size_shift = 0, when users create new VDI. > > > > > > > Old QEMU does do_sd_create() without setting hdr.block_size_shift and > > > > > > > sends a request SD_OP_NEW_VDI. > > > > > > > If block_size_shift is set to zero, new sheep sets cluster default value > > > > > > > in cluster_new_vdi() like copies and copy_policy. > > > > > > > > > > > > Okay, so new sheep can handle old QEMU request. By the way, how about the > > > > > > suggestion in my last email? (x + 1) * 4MB stuff... > > > > > > > > > > I think specifying object size in multiple of 4MB is overkill. Because > > > > > we can specify both of block_size_shift and a number of objects which > > > > > belong to VDI. More precisely, > > > > > 2 ^ block_size_shift * #objects = VDI size > > > > > we can choose the block_size_shift between 20 and 30, and #objects > > > > > from 1 < 2 ^ 20. > > > > > # #objects is specified via VDI size implicitly > > > > > > > > I can't understand this paragraph. If object size is fixed, we can calculate > > > > # of object easily. It has nothing to do with block_size_shift. > > > > > > I wanted to say that we can choose both of block_size_shift and # of > > > objects from wide range, so granualarity is flexible and steps between > > > VDI sizes are reasonably small. > > > > > > > > > > > > The granualarity of VDI sizes seems to be really fine (even > > > > > block_size_shift == 30, step is 1GB). So supporting block size with > > > > > multiple of 4MB will not provide practical benefit. > > > > > > > > Prabably yes, but finer granularity doesn't cause trouble and gives more > > > > flexibility. We can allow 12MB at user's will, for example. Even it doesn't > > > > provide practical benefits, what benefit block_size_shift will provide over > > > > (x + 1) * 4MB scheme? My point is, give a wider range won't hurt and will > > > > pose less constaint in the future. > > > > > > > > The main reason I suggest (x + 1) * 4MB, is that we can easily keep backward > > > > compatibility because x = 0 means 4MB, but with this patch proposal, x = 22 > > > > means 4MB. > > > > > > Utilizing block_size_shift of inode doesn't break > > > compatibility. Because older versions of sheepdog initialize with > > > inode->block_size_shift with 22. > > > > Older version? What about the old sheep that doesn't support block_size_shift? > > If I remember well, block_size_shift is a new thing introduced by > > fd9e4a28fad7c16b2237f4f48c9d264af192e40e, at Dec 12, 2014, very new. This means > > all our stable sheep won't have any idea of what block_size_shift is. > > > > The main trouble is old QEMU, which will always set inode->block_size_shift as 0 > > and expect the object size is 4MB. > > > > Think about following case: > > > > 1 Old qemu and old sheep runs a long time with many vdis that has 4MB object. > > 2 Users upgrade sheep into new sheep but doesn't upgrade QEMU. This is quit > > normal case because many users use stable QEMU branch. > > 3 He still use qemu-img to create new vdi and expect the object size is 4MB > > If new sheep doesn't set object size as 4MB, this qemu block driver will > > malfunction. > > Older qemu drivers doesn't use inode->block_size_shift at all even for > VDI creation (parameters for creation are passed via > SheepdogVdiReq). So it doesn't break compatibility. > > I tested qemu-img create with qemu 1.0 and the master branch of > sheepdog but there's no problem. Am I missing something? I guess I misunderstand how the master sheep handles block_size_shift = 0 case. Sorry for the trouble. Yuan