From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dario Faggioli Subject: Re: [PATCH v10 8/9] libxl: vnuma nodes placement bits Date: Fri, 12 Sep 2014 18:51:58 +0200 Message-ID: <1410540718.3531.91.camel@Solace.lan> References: <1409718258-3276-1-git-send-email-ufimtseva@gmail.com> <1409718258-3276-6-git-send-email-ufimtseva@gmail.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============2008838841271256629==" Return-path: In-Reply-To: <1409718258-3276-6-git-send-email-ufimtseva@gmail.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Elena Ufimtseva Cc: keir@xen.org, Ian.Campbell@citrix.com, stefano.stabellini@eu.citrix.com, george.dunlap@eu.citrix.com, msw@linux.com, lccycc123@gmail.com, ian.jackson@eu.citrix.com, xen-devel@lists.xen.org, JBeulich@suse.com, Wei Liu List-Id: xen-devel@lists.xenproject.org --===============2008838841271256629== Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="=-75yYN3EQ48N1CrgIoZAl" --=-75yYN3EQ48N1CrgIoZAl Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On mer, 2014-09-03 at 00:24 -0400, Elena Ufimtseva wrote: > --- a/tools/libxl/libxl_dom.c > +++ b/tools/libxl/libxl_dom.c > @@ -23,6 +23,7 @@ > #include > #include > #include > +#include > =20 > libxl_domain_type libxl__domain_type(libxl__gc *gc, uint32_t domid) > { > @@ -227,12 +228,114 @@ static void hvm_set_conf_params(xc_interface *hand= le, uint32_t domid, > libxl_defbool_val(info->u.hvm.nested_hvm)); > } > =20 > +/* sets vnode_to_pnode map. */ > +static int libxl__init_vnode_to_pnode(libxl__gc *gc, uint32_t domid, > + libxl_domain_build_info *info) > Indentation. > +{ > + unsigned int i, n; > + int nr_nodes =3D 0; > + uint64_t *vnodes_mem; > + unsigned long long *nodes_claim =3D NULL; > + libxl_numainfo *ninfo =3D NULL; > + > + if (info->vnuma_vnodemap =3D=3D NULL) { > + info->vnuma_vnodemap =3D libxl__calloc(gc, info->vnodes, > + sizeof(*info->vnuma_vnodemap)); > + } > + So, help me a bit, do you mind? :-D If I'm not wrong, here, vnuma_vnodemap is NULL (only?) if no vnode to pnode mapping has been specified in the config file, and so you have to allocate it for the first time. In this case, info->vnodes will be 1, so you're allocating a 1 element array (which is not wrong, I'm just asking). Is this correct? If yes, I repeat what I said while reviewing one of the other patches, I think this can and, if yes, is better done in libxl__domain_build_info_setdefault(). Or am I missing something. > + /* default setting. */ > + for (i =3D 0; i < info->vnodes; i++) > + info->vnuma_vnodemap[i] =3D LIBXC_VNUMA_NO_NODE; > + This too, of course. BTW, LIBXC_ ? > + /* Get NUMA info. */ > + ninfo =3D libxl_get_numainfo(CTX, &nr_nodes); > + if (ninfo =3D=3D NULL) > + return ERROR_FAIL; > The error reporting is ok, this time. However, I think you should hold the error code in an rc variable, have a 'goto out;' here and, at the 'out' label, issue the proper cleanups. There may not be much to cleanup a this very stage, but it's a pattern that is usually pretty much always followed in libxl. > + /* Nothing to see if only one NUMA node. */ > + if (nr_nodes <=3D 1) > + return 0; > + And in fact, once you're here, you've got at least libxl_numainfo_list_free() disposal function to call. That's why I was saying above, have an rc variable and return only at the bottom, using a goto label. > + vnodes_mem =3D info->vnuma_mem; > + /* > + * TODO: change algorithm. The current just fits the nodes > + * by its memory sizes. If no p-node found, will be used default > + * value of LIBXC_VNUMA_NO_NODE. > + */ > + nodes_claim =3D libxl__calloc(gc, info->vnodes, sizeof(*nodes_claim)= ); > + if ( !nodes_claim ) > + return ERROR_FAIL; > + Same as above, about the 'goto out'. However, rather than changing the algorithm, I think this last part of the function (the one below this very comment) should just go away or, at most, be moved somewhere else. I'll try as hard as I can to explain how I think this code and the existing NUMA automatic placement should interact. To achieve that, I may also consider sending some code, in the form of a draft patch (but that will likely happen on Monday). > + libxl_for_each_set_bit(n, info->nodemap) > + { > + for (i =3D 0; i < info->vnodes; i++) > + { > + unsigned long mem_sz =3D vnodes_mem[i] << 20; > + if ((nodes_claim[n] + mem_sz <=3D ninfo[n].free) && > + /* vnode was not set yet. */ > + (info->vnuma_vnodemap[i] =3D=3D LIBXC_VNUMA_NO_NODE ) ) > + { > + info->vnuma_vnodemap[i] =3D n; > + nodes_claim[n] +=3D mem_sz; > + } > + } > + } > + So, AFAIUI, in absence of any indication from the user, you are trying to 'compact' things as much as possible, i.e., putting the vnodes on the smallest possible number of pnode. This is fine, but does not belong here (more details below). > + return 0; > +} > + > +/* > + * Builds vnode memory regions from configuration info > + * for vnuma nodes with multiple regions. > + */ > +static int libxl__build_vnuma_ranges(libxl__gc *gc, > + uint32_t domid, > + /* IN: mem sizes in megabytes */ > + libxl_domain_build_info *b_info, > + /* OUT: linux NUMA blocks addresses */ > + vmemrange_t **memrange) > This is going away, right? > int libxl__build_pre(libxl__gc *gc, uint32_t domid, > libxl_domain_config *d_config, libxl__domain_build_state *= state) > { > libxl_domain_build_info *const info =3D &d_config->b_info; > libxl_ctx *ctx =3D libxl__gc_owner(gc); > char *xs_domid, *con_domid; > + struct vmemrange *memrange; > int rc; > =20 > if (xc_domain_max_vcpus(ctx->xch, domid, info->max_vcpus) !=3D 0) { > @@ -240,6 +343,20 @@ int libxl__build_pre(libxl__gc *gc, uint32_t domid, > return ERROR_FAIL; > } > =20 > + if (libxl__build_vnuma_ranges(gc, domid, info, &memrange) !=3D 0) { > + LOG(DETAIL, "Failed to build vnuma nodes memory ranges.\n"); > + return ERROR_FAIL; > + > + } > + > + /* > + * NUMA placement and vNUMA autoplacement handling: > + * If numa_placement is set to default, do not use vnode to pnode > + * mapping as automatic placement algorithm will find best numa node= s. > + * If numa_placement is not used, we can try and use domain vnode > + * to pnode mask. > + */ > + > /* > * Check if the domain has any CPU or node affinity already. If not,= try > * to build up the latter via automatic NUMA placement. In fact, in = case > @@ -298,7 +415,33 @@ int libxl__build_pre(libxl__gc *gc, uint32_t domid, > NULL, &cpumap_soft); > =20 > libxl_bitmap_dispose(&cpumap_soft); > + > + /* > + * If vnode_to_pnode mask was defined, dont use it if we automat= ically > + * place domain on NUMA nodes, just give warning. > + */ > + if (!libxl_defbool_val(info->vnuma_autoplacement)) { > + LOG(INFO, "Automatic NUMA placement for domain is turned on.= \ > + vnode to physical nodes mapping will not be used."); > + } > + if (libxl__init_vnode_to_pnode(gc, domid, info) < 0) { > + LOG(ERROR, "Failed to build vnode to pnode map\n"); > + return ERROR_FAIL; > + } > + } else { > + if (!libxl_defbool_val(info->vnuma_autoplacement)) { > + if (!libxl__vnodemap_is_usable(gc, info)) { > + LOG(ERROR, "Defined vnode to pnode domain map cannot= be used.\n"); > + return ERROR_FAIL; > + } > + } else { > + if (libxl__init_vnode_to_pnode(gc, domid, info) < 0) { > + LOG(ERROR, "Failed to build vnode to pnode map.\n"); > + return ERROR_FAIL; > + } > + } > } > + The additional vnuma_autoplacement bool switch is going away, so this is changing a lot in next version, and I'll comment directly on that. About the interactions between automatic placement and vNUMA, here's what I think. There should exist the following situations: 1) the user does not specify *any* vNUMA topology (or explicitly ask for 1 node, which is pointless, but possible), in the config file; 2) the user specifies a vNUMA topology, and also a vnode to pnode map, in the config file; 3) the user specifies a vNUMA topology, but _not_ a vnode to pnode map, in the config file; Am I missing anything? In all of the above cases, I think libxc should always receive a well filled and consistent dom->vnode_to_pnode structure, and build the domain image according to that. What changes, is where and how we populate such vnode_to_pnode structure. Here's how I'd proceed: 1) nothing to do, wrt vNUMA. In my opinion, libxl__domain_build_info_setdefault() should already set all the vNUMA libxl fields to match this situation, and we should not touch them in this case. libxl__dom_vnuma_init() (see below) will do the libxl-->libxc 'translation', and we're done. Note that this does not have any relationship or dependency with automatic NUMA placement, and with the existing numa_placement boolean build_info switch. That may or may not be enabled, for reasons vNUMA does not have to care or deal with! 2) quite similar to 1: we have to do something, but nothing too fancy or complicated. In fact, the values of the vNUMA libxl fields are again already set for us to what we want to just pass down to libxc... The only difference is that they've been set by xl, rather than by libxl__domain_build_info_setdefault(). :-) The only *practical* difference is that, as soon as we recognize that we're being explicitly passed a vnode to pnode map, we should disable automatic placement. That can easily happen by means of the existing boolean numa_placement switch, in build_info. What I'd do, is to set it to false from within xl, in the same code branch where the vnode_to_pnode config entry is parsed (look at what happens when, in xl, we find an hard or soft affinity being explicitly set). 3) in this case, since the user has not been explicit about vnode to pnode placement, we need to figure that out ourselves, and that is the job of NUMA automatic placement. What we hence do is leaving the automatic placement enabled, and modify the algorithm in order for it to: - take into account the vNUMA topology, when looking for an optimal placement solution, - provide a valid vnode_to_pnode map, as an output. At that point, such outputted vnode_to_pnode map will be passed down to libxc, as in all the previous cases. I understand that this may look complex, but this is probably one of the cases where the explanation is more complex than the code itself! :-P In any case, I felt like it was important for me to provide my view on how this should be implemented. I also appreciate that, especially modifying the automatic placement logic, may be a bit hard for someone which is not me. So, if Elena, wants to have a go on this, please, be my guest. :-) In case I don't see anything on xen-devel, as I said, I'll hack up a prototype on Monday, and send it on the list. =20 > +/* > + * Function fills the xc_dom_image with memory sizes for later > + * use in domain memory allocator. No regions here are being > + * filled and used in allocator as the regions do belong to one node. > + */ > +static int libxl__dom_vnuma_init(struct libxl_domain_build_info *info, > + struct xc_dom_image *dom) > +{ > + errno =3D ERROR_INVAL; > + > + if (info->vnodes =3D=3D 0) > + return -1; > + > + info->vmemranges =3D info->vnodes; > + > + dom->vnode_to_pnode =3D (unsigned int *)malloc( > + info->vnodes * sizeof(*info->vnuma_vnodemap)= ); > + dom->numa_memszs =3D (uint64_t *)malloc( > + info->vnodes * sizeof(*info->vnuma_mem)); > + This is still libxl, so I think libxl__malloc() and friends is still the right call to use, I think. If you don't want the memory you're allocating to be garbage collected automatically, just use NOGC. > + errno =3D ERROR_FAIL; > + if ( dom->numa_memszs =3D=3D NULL || dom->vnode_to_pnode =3D=3D NULL= ) { > + info->vnodes =3D 0; > + if (dom->vnode_to_pnode) > + free(dom->vnode_to_pnode); > + if (dom->numa_memszs) > + free(dom->numa_memszs); > + return -1; > + } > No errno, and return only libxl error codes in libxl. However, if you use the proper libxl__*alloc() variant, error handling will just go away! :-) > + memcpy(dom->numa_memszs, info->vnuma_mem, > + sizeof(*info->vnuma_mem) * info->vnodes); > + memcpy(dom->vnode_to_pnode, info->vnuma_vnodemap, > + sizeof(*info->vnuma_vnodemap) * info->vnodes); > + > + dom->vnodes =3D info->vnodes; > + > + return 0; > +} > + Regards, Dario --=20 <> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) --=-75yYN3EQ48N1CrgIoZAl Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part Content-Transfer-Encoding: 7bit -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iEYEABECAAYFAlQTJK4ACgkQk4XaBE3IOsQQXgCfV3NcJZ8IVgGRC3j2R0nzdaDx HqUAnApXgWHw0MtvRDluxAZW9XFbjZSz =4fCE -----END PGP SIGNATURE----- --=-75yYN3EQ48N1CrgIoZAl-- --===============2008838841271256629== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel --===============2008838841271256629==--