On Fri, Feb 19, 2021 at 05:04:23PM -0300, Daniel Henrique Barboza wrote: > > > On 2/16/21 11:31 PM, David Gibson wrote: > > On Thu, Feb 11, 2021 at 07:52:46PM -0300, Daniel Henrique Barboza wrote: > > > Handling errors in memory hotunplug in the pSeries machine is more complex > > > than any other device type, because there are all the complications that other > > > devices has, and more. > > > > > > For instance, determining a timeout for a DIMM hotunplug must consider if it's a > > > Hash-MMU or a Radix-MMU guest, because Hash guests takes longer to hotunplug DIMMs. > > > The size of the DIMM is also a factor, given that longer DIMMs naturally takes > > > longer to be hotunplugged from the kernel. And there's also the guest memory usage to > > > be considered: if there's a process that is consuming memory that would be lost by > > > the DIMM unplug, the kernel will postpone the unplug process until the process > > > finishes, and then initiate the regular hotunplug process. The first two > > > considerations are manageable, but the last one is a deal breaker. > > > > > > There is no sane way for the pSeries machine to determine the memory load in the guest > > > when attempting a DIMM hotunplug - and even if there was a way, the guest can start > > > using all the RAM in the middle of the unplug process and invalidate our previous > > > assumptions - and in result we can't even begin to calculate a timeout for the > > > operation. This means that we can't implement a viable timeout mechanism for memory > > > unplug in pSeries. > > > > > > Going back to why we would consider an unplug timeout, the reason is that we can't > > > know if the kernel is giving up the unplug. Turns out that, sometimes, we can. > > > Consider a failed memory hotunplug attempt where the kernel will error out with > > > the following message: > > > > > > 'pseries-hotplug-mem: Memory indexed-count-remove failed, adding any removed LMBs' > > > > > > This happens when there is a LMB that the kernel gave up in removing, and the LMBs > > > marked for removal of the same DIMM are now being added back. This process happens > > > > We need to be a little careful about terminology here. From the > > guest's point of view, there's no such thing as a DIMM, only LMBs. > > What the guest is doing here is essentially rejecting a single "index > > + number" DRC unplug request, which corresponds to one DIMM on the > > qemu side. > > I'll reword this paragraph to avoid using "DIMM" in the context of the guest > kernel. > > > > > > in the pseries kernel in [1], dlpar_memory_remove_by_ic() into dlpar_add_lmb(), and > > > after that update_lmb_associativity_index(). In this function, the kernel is configuring > > > the LMB DRC connector again. Note that this is a valid usage in LOPAR, as stated in > > > section "ibm,configure-connector RTAS Call": > > > > > > 'A subsequent sequence of calls to ibm,configure-connector with the same entry from > > > the “ibm,drc-indexes” or “ibm,drc-info” property will restart the configuration of > > > devices which were not completely configured.' > > > > > > We can use this kernel behavior in our favor. If a DRC connector reconfiguration > > > for a LMB that we marked as unplug pending happens, this indicates that the kernel > > > changed its mind about the unplug and is reasserting that it will keep using the > > > DIMM. In this case, it's safe to assume that the whole DIMM unplug was cancelled. > > > > > > This patch hops into rtas_ibm_configure_connector() and, in the scenario described > > > above, clear the unplug state for the DIMM device. This will not solve all the > > > problems we still have with memory unplug, but it will cover this case where the > > > kernel reconfigures LMBs after a failed unplug. We are a bit more resilient, > > > without using an unreliable timeout, and we didn't make the remaining error cases > > > any worse. > > > > I wonder if we could use this as a beginning of a hotplug failure > > reporting mechanism. As noted, this is explicitly allowed by PAPR and > > I think in general it makes sense that a configure-connector would > > re-assert that the guest is using the resource and we can't unplug it. > > > > I think it's worth looking into it. The kernel already does that in case of hotunplug > failure of LMBs (at least in this particular case), so it's a matter of evaluating > how hard it is to do the same for e.g. CPUs. > > > > Could we extend guests to do an indicative configure-connector on any > > unplug it knows it can't complete? Or if configure-connector is too > > disruptive could we use an (extra) H_SET_INDICATOR to "UNISOLATE" > > state? If I'm reading right, that should be both permitted and a no-op > > for existing PAPR implementations, so it should be a pretty safe way > > to add that indication. > > A quick look in LOPAR shows that set_indicator can be used to report > hotplug failures (which is a surprise to me, I wasn't aware of it): > > ----- > (Table 13.7, R1-13.5.3.4-4.) > > For all DR options: If this is a DR operation that involves the user insert- > ing a DR entity, then if the firmware can determine that the inserted entity > would cause a system disturbance, then the set-indicator RTAS call must > not unisolate the entity and must return an error status which is unique to the > particular error. > ----- > > The wording 'would cause a system disturbance' seems generic on purpose, giving > the firmware/platform the prerrogative to refuse a hotplug for any given > reason. Right. This is about firmware/platform detected errors, which is of less interest to us at the moment than OS detected errors. > I also read that there is no rule against using set_indicator with "unisolate" to > an already unisolated resource. It would be a no-op. > > I don't think we would find fierce opposition if we propose an addendum such as: > > "For all DR options: If this is a DR operation that involves the user removing > ing a DR entity, then if the partition operational system can determine that > removing the entity would cause a system disturbance, then the set-indicator RTAS > call can be used with the 'unisolate' value to inform the platform that the entity > can not be removed at that time." > > This would be enough to accomplish what we're aiming for using set_indicator and > unisolate. Then we have 2 options to signal a failed unplug - configure-connector > and unisolating the device. The guest can use whichever is easier or makes more > sense. Sounds good, let's do it. Because it's a no-op currently, I also think we can go ahead with an implementation even without waiting for a PAPR change to be approved. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson