On Mon, 2015-02-16 at 17:38 +0000, Wei Liu wrote: > On Mon, Feb 16, 2015 at 04:51:43PM +0000, Dario Faggioli wrote: > > So, if you're saying that, if we use a bitmap, we should write somewhere > > how libxl would use it, I certainly agree. Up to what level of details > > we, at that point, should do that, I'm not sure. I think I'd be fine, as > > a user, if finding it written that "the memory of the vnode will be > > allocated out of the pnodes specified in the bitmap", with no much > > further detail, especially considering the use case for the feature. > > > > This is of course OK. And the most simple implementation of this > strategy is to pass on the node information to Xen to let Xen decide > which node of the several nodes specified to allocate from. This would > be trivial. > Exactly. This is, by the way, what is happening right now if the user sets explicitly an hard or a soft affinity for the vcpus that spans multiple nodes. The domain's node affinity is dynamically calculated to be, say, 0110, and memory is striped on nodes #1 and #2. The end result is, most of the time (nearly) even distribution, but, for instance, if we ran out of free RAM on #1, the allocation will continue on #2, making things uneven, but still in line with what the user asked. That behavior would be hence consistent with the already existing, non-vNUMA case. > I think having a vnode mapped able to map to several pnode would be > good. I'm just trying to figure out if a single bitmap is enough to > cover all the sensible usecases. Or what should we say about that > interface. > Well, ideally, you may turn the pnode map into another 'nested list', making it possible to specify, for vnode #2, not only that you want the memory from pnodes #0 and #4, but that you wants 0.5G from the former and 1.5G from the latter. However, this: - makes the interface very complicated to both specify, understand and parse - requires non trivial changes inside Xen, which is of course possible, but is it worth? - very few benefits, as you won't have a fine grained enough control of what memory that 0.5G is comprised of (it's all within the same vnode!!), neither of what the guest OS will put there, e.g., across reboots. So, really, rather useless, IMO So, yes, I think having a bitmap would be enough for now (for a while, actually! :-D) > > As I said, if there is only 1GB free on all pnodes, the user will be > > allowed to specify a set of pnodes for the vnodes, instead of not being > > able to use vnuma at all, no matter how libxl (or whoever else) will > > actually split the memory, in this, previous or future version of Xen... > > This is the scenario I'm talking about, and in such a scenario, knowing > > how the split happens, does not really help much, it is just the > > _possibility_ of splitting, that helps... > > > > I don't see this problem that way though. > > Basically you're saying, a user wants to use vNUMA, then at some point > he / she finds out there is no enough memory in each specified pnode to > accommodate his / her requirement, then he / she changes the > configuration on the fly. > Mmm... no, I wasn't thinking about 'on the fly' changes. But perhaps I don't get what you mean with 'on the fly'. > In reality, if you have mass deployment you probably won't do that. You > might just want to use the same configuration all the time. > Yes, and given you have the same VMs all the times, and especially if deploy also happens in the same or very similar order all the times, you'll *always* end up with only 1G left free on all pnodes, and hence you will *never* be able to turn on vnode-to-pnode mapping to that last VM that you are deploying that have with 2G vnodes. > > > > If we allow the user (or the automatic placement algorithm) to specify a > > > > bitmap of pnode for each vnode, he could put, say, vnode #1 on pnode #0 > > > > and #2, which maybe are really close (in terms of NUMA distances) to > > > > each other, and vnode #2 to pnode #5 and #6 (close to each others too). > > > > This would give worst performance than having each vnode on just one > > > > pnode, but, most likely, better performance than the scenario described > > > > right above. > > > > > > > > > > I get what you mean. So by writing the above paragraphs, you sort of > > > confirm that there still are too many implications in the algorithms, > > > right? A user cannot just tell from the interface what the behaviour is > > > going to be. > > > > > An user can tell that, if he wants a vnode 2GB wide, and there is no > > pnode with 2GB free, but the sum of free memory in pnode #4 and #6 is >= > > 2GB, he can still use vNUMA, by paying the (small or high will depends > > on more factors) price of having that vnode split in two (or more!). > > > > What if #4 and #6 do have > 2GB ram each? What will the behaviour be? > Xen decides, while allocating, as it is right now. > Does it imply better or worse performance? Again, I'm thinking about > migrating the same configuration to another version of Xen, or even just > another host that has enough memory. > > I guess the best we can say (at this point, if we're to use a bitmap), > is that memory will allocate from the nodes specified, user should not > expect any specific behaviour -- that basically is telling user not to > specify multiple nodes... > So, vNUMA is a performance optimization. Actually, no, vNUMA is a feature someone may want independently from performance, but vNUMA nodes to pNUMA nodes mapping is **definittely** a performance optimization. In may experience, performance depends on a lot of things and factors. There are very few features that just "give better or worse performance". There are some, sure, but I don't think vNUMA falls in this category. For instance, in this case, performance will depend on the workload, on the host load conditions, on the order in which you boot the VMs on other constraints (e.g., whether or not you use devices attached to different IO controllers in IONUMA boxes), and that's like this already. We should work hard not to cause performance regressions, i.e., making things worse for people just upgrading Xen and not changing anything else, but, in this case, not changing *anything* *else* means taking into account all the factors above (or more!). Then yes, if one really manages to keep all the involved variables and factors fixed, then just upgrading Xen should not degrade performance. Actually, it ideally would make things better... isn't this what we're supposed to do here all day? :-P :-P IOW, in this case, if one wants top determinism, he can just set _only_ one bit in the bitmap and forget about it. OTOH, if one can trade a slight (potential) degradation in perf (which of course also means less determinism) with increased flexibility, the bitmap would help. Allow me another example, migration, since you're mentioning it yourself. Assume we have vnode #0 mapped on pnode #2 and vnode #1 mapped on pnode #4 on host X, with distance between these pnodes being 20. We migrate the VM to host Y, and there is not enough free memory on any couple of pnodes with distance 20, so we pick two with distance 40: this will alter the performance. What if there is free memory in such two pnodes, because they're bigger on Y than on X, but there happens to be more contention on the pcpus of those nodes on Y than on X (as a consequence o them being bigger, or because on Y there are more small domains, wrt memory, but with more vcpus, or ...): this will alter the performance (making things worse). What if there is less contention: this will alter the performance (making things better). What if there are not even just two pnodes with enough free memory, no matter the distance, and we need to turn vnode-to-pnode mapping off: this will alter the performance. What if the new host is non-NUMA: this will alter the performance. So, in summary. vnode-to-pnode is helping performance for you workload: jolly good. You migrate the VM(s): all bets are off! You upgrade Xen: well, if you don't change anything else, things should be equal or better; if you change something else, there is the chance that you need to re-evaluate the performance and adapt the workload... but that's the case already, no matter whether a bitmap or an integer is used. > > I think there would be room for some increased user satisfaction in > > this, even without knowing much and/or being in control on how exactly > > the split happens, as there are chances for performance to be (if the > > thing is used properly) better than in the no-vNUMA case, which is what > > we're after. > > > > > You can of course say the algorithm is fixed but I don't > > > think we want to do that? > > > > > I don't want to, but I don't think it's needed. > > > > Anyway, I'm more than ok if we want to defer the discussion to after > > this series is in. It will require a further change in the interface, > > but I don't think it would be a terrible price to pay, if we decide the > > feature is worth. > > > > Or, and that was the other thing I was suggesting, we can have the > > bitmap in vnode_info since now, but then only accept ints in xl config > > parsing, and enforce the weight of the bitmap to be 1 (and perhaps print > > a warning) for now. This would not require changing the API in future, > > it'd just be a matter of changing the xl config file parsing. The > > "problem" would still stand for libxl callers different than xl, though, > > I know. > > > > Note that the uint32_t mapping has a very rigid semantics. > It has. Then, whether or not this rigidness buys you more consistent performance, for example across migrations, it is questionable, and my opinion would be that no, it does not. > As long as > you give me a well-defined semantics of that bitmap I'm fine with this. > Otherwise I feel more comfortable with the interface as it is. > The well defined semantic is: <> BTW, sorry for being so long... I at least hope I've expressed my view clearly enough. Let me also repeat that I'm fine leaving this alone and (perhaps) coming back to it later, when the series is merged. Thanks and Regards, Dario