From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:57009) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1fEFYU-00083q-A8 for qemu-devel@nongnu.org; Thu, 03 May 2018 10:49:59 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1fEFYS-0007po-TH for qemu-devel@nongnu.org; Thu, 03 May 2018 10:49:58 -0400 References: <20180501211336.986372-1-eblake@redhat.com> <20180501211336.986372-4-eblake@redhat.com> <26f63bd9-4f91-fd46-631c-5f12c04fc9e0@virtuozzo.com> From: Eric Blake Message-ID: <49fac5ae-4a46-65bf-e889-b5bf979391f3@redhat.com> Date: Thu, 3 May 2018 09:49:48 -0500 MIME-Version: 1.0 In-Reply-To: <26f63bd9-4f91-fd46-631c-5f12c04fc9e0@virtuozzo.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [PATCH 3/4] nbd/client: Support requests of additional block sizing info List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Vladimir Sementsov-Ogievskiy , qemu-devel@nongnu.org Cc: qemu-block@nongnu.org, edgar.kaziakhmedov@virtuozzo.com, Paolo Bonzini , Kevin Wolf , Max Reitz On 05/03/2018 04:17 AM, Vladimir Sementsov-Ogievskiy wrote: > 02.05.2018 00:13, Eric Blake wrote: >> The NBD spec is clarifying [1] that a server may want to advertise >> different limits for READ/WRITE (in our case, 32M) than for >> TRIM/ZERO (in our case, nearly 4G).=C2=A0 Implement the client >> side support for these alternate limits, by always requesting >> the new information (a compliant server must ignore the >> request if it is unknown), and by validating anything reported >> by the server before populating the block layer limits. >> >> [1] https://lists.debian.org/nbd/2018/03/msg00048.html >> >> Signed-off-by: Eric Blake >> >> +++ b/nbd/client.c >> @@ -337,16 +337,18 @@ static int nbd_opt_go(QIOChannel *ioc, const=20 >> char *wantname, >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 info->flags =3D 0; >> >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 trace_nbd_opt_go_start(wantname); >> -=C2=A0=C2=A0=C2=A0 buf =3D g_malloc(4 + len + 2 + 2 * info->request_s= izes + 1); >> +=C2=A0=C2=A0=C2=A0 buf =3D g_malloc(4 + len + 2 + 3 * 2 * info->reque= st_sizes + 1); >=20 > Hmm, what "+1" means? Doesn't appear to be necessary, but a little slop never hurts. Better=20 might be struct-ifying things rather than open-coding offsets, or even=20 using a vectored approach rather than requiring a single buffer. But if=20 useful, that's refactoring that is independent of this patch, while this=20 is just demonstrating the bare minimum implementation of the new feature. >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 stl_be_p(buf, len); >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 memcpy(buf + 4, wantname, len); >> -=C2=A0=C2=A0=C2=A0 /* At most one request, everything else up to serv= er */ >> -=C2=A0=C2=A0=C2=A0 stw_be_p(buf + 4 + len, info->request_sizes); >> +=C2=A0=C2=A0=C2=A0 /* Either 0 or 3 requests, everything else up to s= erver */ >> +=C2=A0=C2=A0=C2=A0 stw_be_p(buf + 4 + len, 3 * info->request_sizes); >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (info->request_sizes) { >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 stw_be_p(buf + = 4 + len + 2, NBD_INFO_BLOCK_SIZE); >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 stw_be_p(buf + 4 + len + 2= + 2, NBD_INFO_TRIM_SIZE); >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 stw_be_p(buf + 4 + len + 2= + 2 + 2, NBD_INFO_ZERO_SIZE); >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 error =3D nbd_send_option_request(ioc, = NBD_OPT_GO, >> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 4 + le= n + 2 + 2 *=20 >> info->request_sizes, >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 4 + le= n + 2 + 3 * 2 *=20 >> info->request_sizes, >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= buf, errp); >=20 > magic chunk) Change looks correct. Is it worth introducing a buf_len=20 > variable, to > not retype it? And increment as we go? Yeah, it might make things more legible. >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 be= 32_to_cpus(&info->max_trim); >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if= (!info->min_trim || !info->max_trim || >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 (info->max_trim !=3D UINT32_MAX && >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0 !QEMU_IS_ALIGNED(info->max_trim, info->min_tr= im))) { >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 error_setg(errp, "server trim sizes %" PRIu32 "/%"=20 >> PRIu32 >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 " are not valid", info->min_trim,=20 >> info->max_trim); >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 nbd_send_opt_abort(ioc); >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 return -1; >> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >=20 > Didn't you forget to check that min_trim is a power of two? Nope. The NBD spec wording specifically uses "The minimum trim size=20 SHOULD be a power of 2, and MUST be at least as large as the preferred=20 block size advertised in `NBD_INFO_BLOCK_SIZE`", in part because there=20 ARE existing hardware iscsi devices that have an advertised=20 preferred/maximum trim size of exactly 15 megabytes (you can request=20 smaller trims 1M at a time, and the device then caches things to=20 eventually report unmapped slices at a 15M boundary). For more=20 information on this odd hardware, look in the qemu logs for the Dell=20 Equallogic device, such as commit 3482b9bc. Since the NBD 'min trim' is advisory, and merely telling the client the=20 ideal alignments to use for a trim most likely to have an effect (on=20 Equallogic, a 1M trim by itself might not trim anything unless=20 neighboring areas are also trimmed, while an aligned 15M trim is=20 immediately observable), an NBD server wrapping such an iscsi device may=20 prefer to mirror the hardware's preferred alignment of 15M, rather than=20 inventing a minimum alignment of 1M, in what it advertises in=20 NBD_INFO_TRIM_SIZE. And maybe I should tweak the NBD spec addition to call things "preferred=20 trim/zero" rather than "min trim/zero", if that makes things any easier=20 to grasp on first read, and better matches the iscsi concept. >=20 > hmm, so, now nbd_opt_go() is full of very-very similar code parts, whic= h=20 > may worth some refactoring now or in future. >=20 > Are we going to add similar limits for BLOCK_STATUS? and for CACHE? I=20 > have an idea: why not to make an universal option for it in > protocol? Something like >=20 > option INFO_CMD_SIZE: > =C2=A0 uint16_t command > =C2=A0 uint32_t min_block > =C2=A0 uint32_t max_block >=20 > and server can send several such options. Most of semantics for these=20 > additional limits are common, so it will simplify both documentation > and realization.. >=20 > I'll copy this to nbd thread and it is better to discuss it in it. Yep, I've followed up on that thread, but will post a v2 of this=20 proposal along those lines. --=20 Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org