From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:52822) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YLoiG-00047A-I3 for qemu-devel@nongnu.org; Thu, 12 Feb 2015 03:01:30 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YLoiB-0004yk-7O for qemu-devel@nongnu.org; Thu, 12 Feb 2015 03:01:28 -0500 Received: from tama50.ecl.ntt.co.jp ([129.60.39.147]:56499) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YLoiA-0004yY-FB for qemu-devel@nongnu.org; Thu, 12 Feb 2015 03:01:23 -0500 Message-ID: <54DC5DC1.80109@lab.ntt.co.jp> Date: Thu, 12 Feb 2015 17:01:05 +0900 From: Teruaki Ishizaki MIME-Version: 1.0 References: <1422347727-13006-1-git-send-email-ishizaki.teruaki@lab.ntt.co.jp> <20150210031051.GB3859@ubuntu-trusty> <54D9BFAA.8050307@lab.ntt.co.jp> <20150210085819.GC18727@ubuntu-trusty> <54D9D5D1.2060601@lab.ntt.co.jp> <20150210103558.GF18727@ubuntu-trusty> <877fvn649i.wl%mitake.hitoshi@lab.ntt.co.jp> <20150212070049.GL7801@ubuntu-trusty> <87386b6132.wl%mitake.hitoshi@lab.ntt.co.jp> <20150212074217.GN7801@ubuntu-trusty> In-Reply-To: <20150212074217.GN7801@ubuntu-trusty> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [sheepdog] [PATCH v4] sheepdog: selectable object size support List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Liu Yuan , Hitoshi Mitake Cc: kwolf@redhat.com, sheepdog@lists.wpkg.org, qemu-devel@nongnu.org, stefanha@redhat.com (2015/02/12 16:42), Liu Yuan wrote: > On Thu, Feb 12, 2015 at 04:28:01PM +0900, Hitoshi Mitake wrote: >> At Thu, 12 Feb 2015 15:00:49 +0800, >> Liu Yuan wrote: >>> >>> On Thu, Feb 12, 2015 at 03:19:21PM +0900, Hitoshi Mitake wrote: >>>> At Tue, 10 Feb 2015 18:35:58 +0800, >>>> Liu Yuan wrote: >>>>> >>>>> On Tue, Feb 10, 2015 at 06:56:33PM +0900, Teruaki Ishizaki wrote: >>>>>> (2015/02/10 17:58), Liu Yuan wrote: >>>>>>> On Tue, Feb 10, 2015 at 05:22:02PM +0900, Teruaki Ishizaki wrote: >>>>>>>> (2015/02/10 12:10), Liu Yuan wrote: >>>>>>>>> On Tue, Jan 27, 2015 at 05:35:27PM +0900, Teruaki Ishizaki wrote: >>>>>>>>>> Previously, qemu block driver of sheepdog used hard-coded VDI object size. >>>>>>>>>> This patch enables users to handle "block_size_shift" value for >>>>>>>>>> calculating VDI object size. >>>>>>>>>> >>>>>>>>>> When you start qemu, you don't need to specify additional command option. >>>>>>>>>> >>>>>>>>>> But when you create the VDI which doesn't have default object size >>>>>>>>>> with qemu-img command, you specify block_size_shift option. >>>>>>>>>> >>>>>>>>>> If you want to create a VDI of 8MB(1 << 23) object size, >>>>>>>>>> you need to specify following command option. >>>>>>>>>> >>>>>>>>>> # qemu-img create -o block_size_shift=23 sheepdog:test1 100M >>>>>>>>>> >>>>>>>>>> In addition, when you don't specify qemu-img command option, >>>>>>>>>> a default value of sheepdog cluster is used for creating VDI. >>>>>>>>>> >>>>>>>>>> # qemu-img create sheepdog:test2 100M >>>>>>>>>> >>>>>>>>>> Signed-off-by: Teruaki Ishizaki >>>>>>>>>> --- >>>>>>>>>> V4: >>>>>>>>>> - Limit a read/write buffer size for creating a preallocated VDI. >>>>>>>>>> - Replace a parse function for the block_size_shift option. >>>>>>>>>> - Fix an error message. >>>>>>>>>> >>>>>>>>>> V3: >>>>>>>>>> - Delete the needless operation of buffer. >>>>>>>>>> - Delete the needless operations of request header. >>>>>>>>>> for SD_OP_GET_CLUSTER_DEFAULT. >>>>>>>>>> - Fix coding style problems. >>>>>>>>>> >>>>>>>>>> V2: >>>>>>>>>> - Fix coding style problem (white space). >>>>>>>>>> - Add members, store_policy and block_size_shift to struct SheepdogVdiReq. >>>>>>>>>> - Initialize request header to use block_size_shift specified by user. >>>>>>>>>> --- >>>>>>>>>> block/sheepdog.c | 138 ++++++++++++++++++++++++++++++++++++++------- >>>>>>>>>> include/block/block_int.h | 1 + >>>>>>>>>> 2 files changed, 119 insertions(+), 20 deletions(-) >>>>>>>>>> >>>>>>>>>> diff --git a/block/sheepdog.c b/block/sheepdog.c >>>>>>>>>> index be3176f..a43b947 100644 >>>>>>>>>> --- a/block/sheepdog.c >>>>>>>>>> +++ b/block/sheepdog.c >>>>>>>>>> @@ -37,6 +37,7 @@ >>>>>>>>>> #define SD_OP_READ_VDIS 0x15 >>>>>>>>>> #define SD_OP_FLUSH_VDI 0x16 >>>>>>>>>> #define SD_OP_DEL_VDI 0x17 >>>>>>>>>> +#define SD_OP_GET_CLUSTER_DEFAULT 0x18 >>>>>>>>>> >>>>>>>>>> #define SD_FLAG_CMD_WRITE 0x01 >>>>>>>>>> #define SD_FLAG_CMD_COW 0x02 >>>>>>>>>> @@ -167,7 +168,8 @@ typedef struct SheepdogVdiReq { >>>>>>>>>> uint32_t base_vdi_id; >>>>>>>>>> uint8_t copies; >>>>>>>>>> uint8_t copy_policy; >>>>>>>>>> - uint8_t reserved[2]; >>>>>>>>>> + uint8_t store_policy; >>>>>>>>>> + uint8_t block_size_shift; >>>>>>>>>> uint32_t snapid; >>>>>>>>>> uint32_t type; >>>>>>>>>> uint32_t pad[2]; >>>>>>>>>> @@ -186,6 +188,21 @@ typedef struct SheepdogVdiRsp { >>>>>>>>>> uint32_t pad[5]; >>>>>>>>>> } SheepdogVdiRsp; >>>>>>>>>> >>>>>>>>>> +typedef struct SheepdogClusterRsp { >>>>>>>>>> + uint8_t proto_ver; >>>>>>>>>> + uint8_t opcode; >>>>>>>>>> + uint16_t flags; >>>>>>>>>> + uint32_t epoch; >>>>>>>>>> + uint32_t id; >>>>>>>>>> + uint32_t data_length; >>>>>>>>>> + uint32_t result; >>>>>>>>>> + uint8_t nr_copies; >>>>>>>>>> + uint8_t copy_policy; >>>>>>>>>> + uint8_t block_size_shift; >>>>>>>>>> + uint8_t __pad1; >>>>>>>>>> + uint32_t __pad2[6]; >>>>>>>>>> +} SheepdogClusterRsp; >>>>>>>>>> + >>>>>>>>>> typedef struct SheepdogInode { >>>>>>>>>> char name[SD_MAX_VDI_LEN]; >>>>>>>>>> char tag[SD_MAX_VDI_TAG_LEN]; >>>>>>>>>> @@ -1544,6 +1561,7 @@ static int do_sd_create(BDRVSheepdogState *s, uint32_t *vdi_id, int snapshot, >>>>>>>>>> hdr.vdi_size = s->inode.vdi_size; >>>>>>>>>> hdr.copy_policy = s->inode.copy_policy; >>>>>>>>>> hdr.copies = s->inode.nr_copies; >>>>>>>>>> + hdr.block_size_shift = s->inode.block_size_shift; >>>>>>>>>> >>>>>>>>>> ret = do_req(fd, s->aio_context, (SheepdogReq *)&hdr, buf, &wlen, &rlen); >>>>>>>>>> >>>>>>>>>> @@ -1569,9 +1587,12 @@ static int do_sd_create(BDRVSheepdogState *s, uint32_t *vdi_id, int snapshot, >>>>>>>>>> static int sd_prealloc(const char *filename, Error **errp) >>>>>>>>>> { >>>>>>>>>> BlockDriverState *bs = NULL; >>>>>>>>>> + BDRVSheepdogState *base = NULL; >>>>>>>>>> + unsigned long buf_size; >>>>>>>>>> uint32_t idx, max_idx; >>>>>>>>>> + uint32_t object_size; >>>>>>>>>> int64_t vdi_size; >>>>>>>>>> - void *buf = g_malloc0(SD_DATA_OBJ_SIZE); >>>>>>>>>> + void *buf = NULL; >>>>>>>>>> int ret; >>>>>>>>>> >>>>>>>>>> ret = bdrv_open(&bs, filename, NULL, NULL, BDRV_O_RDWR | BDRV_O_PROTOCOL, >>>>>>>>>> @@ -1585,18 +1606,24 @@ static int sd_prealloc(const char *filename, Error **errp) >>>>>>>>>> ret = vdi_size; >>>>>>>>>> goto out; >>>>>>>>>> } >>>>>>>>>> - max_idx = DIV_ROUND_UP(vdi_size, SD_DATA_OBJ_SIZE); >>>>>>>>>> + >>>>>>>>>> + base = bs->opaque; >>>>>>>>>> + object_size = (UINT32_C(1) << base->inode.block_size_shift); >>>>>>>>>> + buf_size = MIN(object_size, SD_DATA_OBJ_SIZE); >>>>>>>>>> + buf = g_malloc0(buf_size); >>>>>>>>>> + >>>>>>>>>> + max_idx = DIV_ROUND_UP(vdi_size, buf_size); >>>>>>>>>> >>>>>>>>>> for (idx = 0; idx < max_idx; idx++) { >>>>>>>>>> /* >>>>>>>>>> * The created image can be a cloned image, so we need to read >>>>>>>>>> * a data from the source image. >>>>>>>>>> */ >>>>>>>>>> - ret = bdrv_pread(bs, idx * SD_DATA_OBJ_SIZE, buf, SD_DATA_OBJ_SIZE); >>>>>>>>>> + ret = bdrv_pread(bs, idx * buf_size, buf, buf_size); >>>>>>>>>> if (ret < 0) { >>>>>>>>>> goto out; >>>>>>>>>> } >>>>>>>>>> - ret = bdrv_pwrite(bs, idx * SD_DATA_OBJ_SIZE, buf, SD_DATA_OBJ_SIZE); >>>>>>>>>> + ret = bdrv_pwrite(bs, idx * buf_size, buf, buf_size); >>>>>>>>>> if (ret < 0) { >>>>>>>>>> goto out; >>>>>>>>>> } >>>>>>>>>> @@ -1669,6 +1696,21 @@ static int parse_redundancy(BDRVSheepdogState *s, const char *opt) >>>>>>>>>> return 0; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> +static int parse_block_size_shift(BDRVSheepdogState *s, QemuOpts *opt) >>>>>>>>>> +{ >>>>>>>>>> + struct SheepdogInode *inode = &s->inode; >>>>>>>>>> + inode->block_size_shift = >>>>>>>>>> + (uint8_t)qemu_opt_get_number_del(opt, "block_size_shift", 0); >>>>>>>>>> + if (inode->block_size_shift == 0) { >>>>>>>>>> + /* block_size_shift is set for cluster default value by sheepdog */ >>>>>>>>>> + return 0; >>>>>>>>>> + } else if (inode->block_size_shift < 20 || inode->block_size_shift > 31) { >>>>>>>>>> + return -EINVAL; >>>>>>>>>> + } >>>>>>>>>> + >>>>>>>>>> + return 0; >>>>>>>>>> +} >>>>>>>>>> + >>>>>>>>>> static int sd_create(const char *filename, QemuOpts *opts, >>>>>>>>>> Error **errp) >>>>>>>>>> { >>>>>>>>>> @@ -1679,6 +1721,7 @@ static int sd_create(const char *filename, QemuOpts *opts, >>>>>>>>>> BDRVSheepdogState *s; >>>>>>>>>> char tag[SD_MAX_VDI_TAG_LEN]; >>>>>>>>>> uint32_t snapid; >>>>>>>>>> + uint64_t max_vdi_size; >>>>>>>>>> bool prealloc = false; >>>>>>>>>> >>>>>>>>>> s = g_new0(BDRVSheepdogState, 1); >>>>>>>>>> @@ -1718,9 +1761,10 @@ static int sd_create(const char *filename, QemuOpts *opts, >>>>>>>>>> } >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> - if (s->inode.vdi_size > SD_MAX_VDI_SIZE) { >>>>>>>>>> - error_setg(errp, "too big image size"); >>>>>>>>>> - ret = -EINVAL; >>>>>>>>>> + ret = parse_block_size_shift(s, opts); >>>>>>>>>> + if (ret < 0) { >>>>>>>>>> + error_setg(errp, "Invalid block_size_shift:" >>>>>>>>>> + " '%s'. please use 20-31", buf); >>>>>>>>>> goto out; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> @@ -1757,6 +1801,47 @@ static int sd_create(const char *filename, QemuOpts *opts, >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> s->aio_context = qemu_get_aio_context(); >>>>>>>>>> + >>>>>>>>>> + /* if block_size_shift is not specified, get cluster default value */ >>>>>>>>>> + if (s->inode.block_size_shift == 0) { >>>>>>>>> >>>>>>>>> Seems that we don't keep backward compatibility for new QEMU and old sheep that >>>>>>>>> doesn't support SD_OP_GET_CLUSTER_DEFAULT. >>>>>>>> Then, I'll add the operation that if block_size_shift from >>>>>>>> SD_OP_GET_CLUSTER_DEFAULT is zero, block_size_shift is set to 22. >>>>>>>> Is it OK? >>>>>>> >>>>>>> If sheep doesn't support SD_OP_GET_CLUSTER_DEFAULT, sheep will return >>>>>>> SD_RES_INVALID_PARMS. So to keep backward compatibility, we shouldn't issue >>>>>>> SD_OP_GET_CLUSTER_DEFAULT to sheep daemon. >>>>>> OK, I think that the way is better. >>>>>>> >>>>>>> My point is that, after user upgrading the sheep and still stick with old QEMU, >>>>>>> (I guess many users will), any operations in the past sholudn't fail right after >>>>>>> upgrading. >>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> What will happen if old QEMU with new sheep that support block_size_shift? Most >>>>>>>>> distributions will ship the old stable qemu that wouldn't be aware of >>>>>>>>> block_size_shift. >>>>>>>> If old QEMU with new sheep is used, VDI is created with cluster default >>>>>>>> block_size_shift and accessed as 4MB object_size. >>>>>>>> So I think that for backward compatibility, users must do cluster >>>>>>>> format command with default block_size_shift 22 equal to >>>>>>>> 4MB object_size. >>>>>>> >>>>>>> how old QEMU know about block_size_shift? For old QEMU, block_size_shift is >>>>>>> encoded as 0 and then send the create request to sheep. Does sheep can handle >>>>>>> block_size_shift = 0? You know, we can't pass any value to old QEMU for >>>>>>> block_size_shift. >>>>>> Sheep can handle block_size_shift = 0, when users create new VDI. >>>>>> Old QEMU does do_sd_create() without setting hdr.block_size_shift and >>>>>> sends a request SD_OP_NEW_VDI. >>>>>> If block_size_shift is set to zero, new sheep sets cluster default value >>>>>> in cluster_new_vdi() like copies and copy_policy. >>>>> >>>>> Okay, so new sheep can handle old QEMU request. By the way, how about the >>>>> suggestion in my last email? (x + 1) * 4MB stuff... >>>> >>>> I think specifying object size in multiple of 4MB is overkill. Because >>>> we can specify both of block_size_shift and a number of objects which >>>> belong to VDI. More precisely, >>>> 2 ^ block_size_shift * #objects = VDI size >>>> we can choose the block_size_shift between 20 and 30, and #objects >>>> from 1 < 2 ^ 20. >>>> # #objects is specified via VDI size implicitly >>> >>> I can't understand this paragraph. If object size is fixed, we can calculate >>> # of object easily. It has nothing to do with block_size_shift. >> >> I wanted to say that we can choose both of block_size_shift and # of >> objects from wide range, so granualarity is flexible and steps between >> VDI sizes are reasonably small. >> >>> >>>> The granualarity of VDI sizes seems to be really fine (even >>>> block_size_shift == 30, step is 1GB). So supporting block size with >>>> multiple of 4MB will not provide practical benefit. >>> >>> Prabably yes, but finer granularity doesn't cause trouble and gives more >>> flexibility. We can allow 12MB at user's will, for example. Even it doesn't >>> provide practical benefits, what benefit block_size_shift will provide over >>> (x + 1) * 4MB scheme? My point is, give a wider range won't hurt and will >>> pose less constaint in the future. >>> >>> The main reason I suggest (x + 1) * 4MB, is that we can easily keep backward >>> compatibility because x = 0 means 4MB, but with this patch proposal, x = 22 >>> means 4MB. >> >> Utilizing block_size_shift of inode doesn't break >> compatibility. Because older versions of sheepdog initialize with >> inode->block_size_shift with 22. > > Older version? What about the old sheep that doesn't support block_size_shift? > If I remember well, block_size_shift is a new thing introduced by > fd9e4a28fad7c16b2237f4f48c9d264af192e40e, at Dec 12, 2014, very new. This means > all our stable sheep won't have any idea of what block_size_shift is. Old sheepdog initialize with inode->block_size_shift with 22. The implementation of Sheepdog 0.9 is following ======= static struct sd_inode *alloc_inode(const struct vdi_iocb *iocb, uint32_t new_snapid, uint32_t new_vid, uint32_t *data_vdi_id, struct generation_reference *gref) { struct sd_inode *new = xzalloc(sizeof(*new)); unsigned long block_size = SD_DATA_OBJ_SIZE; ...(snip) new->block_size_shift = find_next_bit(&block_size, BITS_PER_LONG, 0); ======= It means that block_size_shift is initialized with 22. > > The main trouble is old QEMU, which will always set inode->block_size_shift as 0 > and expect the object size is 4MB. > > Think about following case: > > 1 Old qemu and old sheep runs a long time with many vdis that has 4MB object. > 2 Users upgrade sheep into new sheep but doesn't upgrade QEMU. This is quit > normal case because many users use stable QEMU branch. > 3 He still use qemu-img to create new vdi and expect the object size is 4MB > If new sheep doesn't set object size as 4MB, this qemu block driver will > malfunction. As I said, Qe/sd new old new CDS Ign old CDS NULL So, when we use old Qemu with new Sheepdog, we should do cluster format with default block_size_shift 22.