From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Chen, Tiejun" Subject: Re: [RFC][PATCH 04/13] tools/libxl: detect and avoid conflicts with RDM Date: Mon, 18 May 2015 09:06:04 +0800 Message-ID: <55593AFC.60507@intel.com> References: <1428657724-3498-1-git-send-email-tiejun.chen@intel.com> <1428657724-3498-5-git-send-email-tiejun.chen@intel.com> <20150508144343.GO3848@zion.uk.xensource.com> <555063D1.7030907@intel.com> <20150511113247.GC30997@zion.uk.xensource.com> <55545C81.8070509@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <55545C81.8070509@intel.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Wei Liu Cc: kevin.tian@intel.com, ian.campbell@citrix.com, andrew.cooper3@citrix.com, tim@xen.org, xen-devel@lists.xen.org, stefano.stabellini@citrix.com, JBeulich@suse.com, yang.z.zhang@intel.com, Ian.Jackson@eu.citrix.com List-Id: xen-devel@lists.xenproject.org Wei, Any comments? Thanks Tiejun On 2015/5/14 16:27, Chen, Tiejun wrote: > On 2015/5/11 19:32, Wei Liu wrote: >> On Mon, May 11, 2015 at 04:09:53PM +0800, Chen, Tiejun wrote: >>> On 2015/5/8 22:43, Wei Liu wrote: >>>> Sorry for the late review. This series fell through the crack. >>>> >>> >>> Thanks for your review. >>> >>>> On Fri, Apr 10, 2015 at 05:21:55PM +0800, Tiejun Chen wrote: >>>>> While building a VM, HVM domain builder provides struct >>>>> hvm_info_table{} >>>>> to help hvmloader. Currently it includes two fields to construct guest >>>>> e820 table by hvmloader, low_mem_pgend and high_mem_pgend. So we >>>>> should >>>>> check them to fix any conflict with RAM. >>>>> >>>>> RMRR can reside in address space beyond 4G theoretically, but we never >>>>> see this in real world. So in order to avoid breaking highmem layout >>>> >>>> How does this break highmem layout? >>> >>> In most cases highmen is always continuous like [4G, ...] but RMRR is >>> theoretically beyond 4G but very rarely. Especially we don't see this >>> happened in real world. So we don't want to such a case breaking the >>> highmem. >>> >> >> The problem is we take this approach just because this rarely happens >> *now* is not future proof. It needs to be clearly documented somewhere >> in the manual (or any other Intel docs) and be referenced in the code. >> Otherwise we might end up in a situation that no-one knows how it is >> supposed to work and no-one can fix it if it breaks in the future, that >> is, every single device on earth requires RMRR > 4G overnight (I'm >> exaggerating). >> >> Or you can just make it works with highmem. How much more work do you >> envisage? >> >> (If my above comment makes no sense do let me know. I only have very >> shallow understanding of RMRR) > > Maybe I'm misleading you :) > > I don't mean RMRR > 4G is not allowed to work in our implementation. > What I'm saying is that our *policy* is just simple for this kind of > rare highmem case... > >> >>>> >>>>> we don't solve highmem conflict. Note this means highmem rmrr could >>>>> still >>>>> be supported if no conflict. >>>>> > > Like these two sentences above. > >>>> >>>> Aren't you actively trying to avoid conflict in libxl? >>> >>> RMRR is fixed by BIOS so we can't aovid conflict. Here we want to >>> adopt some >>> good policies to address RMRR. In the case of highmemt, that simple >>> policy >>> should be enough? >>> >> >> Whatever policy you and HV maintainers agree on. Just clearly document >> it. > > Do you mean I should brief this patch description into one separate > document? > >> >>>> >>>>> But in the case of lowmem, RMRR probably scatter the whole RAM space. >>>>> Especially multiple RMRR entries would worsen this to lead a >>>>> complicated >>>>> memory layout. And then its hard to extend hvm_info_table{} to work >>>>> hvmloader out. So here we're trying to figure out a simple solution to >>>>> avoid breaking existing layout. So when a conflict occurs, >>>>> >>>>> #1. Above a predefined boundary (default 2G) >>>>> - move lowmem_end below reserved region to solve conflict; >>>>> >>>> >>>> I hope this "predefined boundary" is user tunable. I will check >>>> later in >>>> this patch if it is the case. >>> >>> Yes. As we clarified in that comments, >>> >>> * TODO: Its batter to provide a config parameter for this boundary >>> value. >>> >>> This mean I would provide a patch address this since currently I just >>> think >>> this is not a big deal? >>> >> >> Yes please provide a config option to override that. It's reasonable >> that user wants to change that. > > Okay. > >> >>>> >>>>> #2 Below a predefined boundary (default 2G) >>>>> - Check force/try policy. >>>>> "force" policy leads to fail libxl. Note when both policies >>>>> are specified on a given region, 'force' is always preferred. >>>>> "try" policy issue a warning message and also mask this >>>>> entry INVALID >>>>> to indicate we shouldn't expose this entry to hvmloader. >>>>> >>>>> Signed-off-by: Tiejun Chen >>>>> --- >>>>> tools/libxc/include/xenctrl.h | 8 ++ >>>>> tools/libxc/include/xenguest.h | 3 +- >>>>> tools/libxc/xc_domain.c | 40 +++++++++ >>>>> tools/libxc/xc_hvm_build_x86.c | 28 +++--- >>>>> tools/libxl/libxl_create.c | 2 +- >>>>> tools/libxl/libxl_dm.c | 195 >>>>> +++++++++++++++++++++++++++++++++++++++++ >>>>> tools/libxl/libxl_dom.c | 27 +++++- >>>>> tools/libxl/libxl_internal.h | 11 ++- >>>>> tools/libxl/libxl_types.idl | 7 ++ >>>>> 9 files changed, 303 insertions(+), 18 deletions(-) >>>>> >>>>> diff --git a/tools/libxc/include/xenctrl.h >>>>> b/tools/libxc/include/xenctrl.h >>>>> index 59bbe06..299b95f 100644 >>>>> --- a/tools/libxc/include/xenctrl.h >>>>> +++ b/tools/libxc/include/xenctrl.h >>>>> @@ -2053,6 +2053,14 @@ int xc_get_device_group(xc_interface *xch, >>>>> uint32_t *num_sdevs, >>>>> uint32_t *sdev_array); >>>>> >>>>> +struct xen_reserved_device_memory >>>>> +*xc_device_get_rdm(xc_interface *xch, >>>>> + uint32_t flag, >>>>> + uint16_t seg, >>>>> + uint8_t bus, >>>>> + uint8_t devfn, >>>>> + unsigned int *nr_entries); >>>>> + >>>>> int xc_test_assign_device(xc_interface *xch, >>>>> uint32_t domid, >>>>> uint32_t machine_sbdf); >> >> [...] >> >>>> >>>>> uint64_t mem_target; /* Memory target in bytes. */ >>>>> uint64_t mmio_size; /* Size of the MMIO hole in >>>>> bytes. */ >>>>> const char *image_file_name; /* File name of the image to >>>>> load. */ >>>>> diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c >>>>> index 4f8383e..85b18ea 100644 >>>>> --- a/tools/libxc/xc_domain.c >>>>> +++ b/tools/libxc/xc_domain.c >>>>> @@ -1665,6 +1665,46 @@ int xc_assign_device( >>>>> return do_domctl(xch, &domctl); >>>>> } >>>>> >>>>> +struct xen_reserved_device_memory >>>>> +*xc_device_get_rdm(xc_interface *xch, >>>>> + uint32_t flag, >>>>> + uint16_t seg, >>>>> + uint8_t bus, >>>>> + uint8_t devfn, >>>>> + unsigned int *nr_entries) >>>>> +{ >>>>> + struct xen_reserved_device_memory *xrdm = NULL; >>>>> + int rc = xc_reserved_device_memory_map(xch, flag, seg, bus, >>>>> devfn, xrdm, >>>>> + nr_entries); >>>>> + >>>>> + if ( rc < 0 ) >>>>> + { >>>>> + if ( errno == ENOBUFS ) >>>>> + { >>>>> + if ( (xrdm = malloc(*nr_entries * >>>>> + >>>>> sizeof(xen_reserved_device_memory_t))) == NULL ) >>>>> + { >>>>> + PERROR("Could not allocate memory."); >>>>> + goto out; >>>>> + } >>>> >>>> Don't you leak origin xrdm in this case? >>> >>> The caller to xc_device_get_rdm always frees this. >>> >> >> I think I misunderstood how this function works. I thought xrdm was >> passed in by caller, which is clearly not the case. Sorry! >> >> >> In that case, the `if ( rc < 0 )' is not needed because the call should >> always return rc < 0. An assert is good enough. > > assert(rc < 0)? But we can't presume the user always pass '0' firstly, > and additionally, we may have no any RMRR indeed. > > So I guess what you want is, > > assert( rc <=0 ); > if ( !rc ) > goto xrdm; > > if ( errno == ENOBUFS ) > ... > > Right? > >> >>>> >>>> And, this style is not very good. Shouldn't the caller allocate enough >>>> memory before hand? >>> >>> Are you saying the caller to xc_device_get_rdm()? If so, any caller >>> don't >>> know this, too. >>> >> >> I see. >> >>> Actually this is just a wrapper of that fundamental hypercall, >>> xc_reserved_device_memory_map() in patch #2, and based on that, we >>> always >>> have to first call this to inquire how much memory we really need. >>> And this >>> is why we have this wrapper since we don't want to duplicate more codes. >>> >>> One error handler of this wrapper is just handling ENOBUFS since the >>> caller >>> never know how much memory we should allocate. So oftentimes we >>> always set >>> 'entries = 0' to inquire firstly. >>> >>> Here Jan suggested we may need to figure out a good way to consolidate >>> xc_reserved_device_memory_map() and its wrapper, xc_device_get_rdm(). >>> >>> But in some ways, that wrapper likes a static function so we just >>> need to >>> move this into that file associated to its caller, right? >>> >> >> Yes, if there is only one user at the moment, make a static function. > > Thanks. > >> >>>> >>>>> + rc = xc_reserved_device_memory_map(xch, flag, seg, >>>>> bus, devfn, xrdm, >>>>> + nr_entries); >>>>> + if ( rc ) >>>>> + { >>>>> + PERROR("Could not get reserved device memory maps."); >>>>> + free(xrdm); >>>>> + xrdm = NULL; >>>>> + } >>>>> + } >>>>> + else { >>>>> + PERROR("Could not get reserved device memory maps."); >>>>> + } >>>>> + } >>>>> + >>>>> + out: >>>>> + return xrdm; >>>>> +} >>>>> + >>>>> int xc_get_device_group( >>>>> xc_interface *xch, >>>>> uint32_t domid, >>>>> diff --git a/tools/libxc/xc_hvm_build_x86.c >>>>> b/tools/libxc/xc_hvm_build_x86.c >>>>> index c81a25b..3f87bb3 100644 >>>>> --- a/tools/libxc/xc_hvm_build_x86.c >>>>> +++ b/tools/libxc/xc_hvm_build_x86.c >>>>> @@ -89,19 +89,16 @@ static int modules_init(struct >>>>> xc_hvm_build_args *args, >>>>> } >>>>> >>>>> static void build_hvm_info(void *hvm_info_page, uint64_t mem_size, >>>>> - uint64_t mmio_start, uint64_t mmio_size) >>>>> + uint64_t lowmem_end) >>>>> { >>>>> struct hvm_info_table *hvm_info = (struct hvm_info_table *) >>>>> (((unsigned char *)hvm_info_page) + HVM_INFO_OFFSET); >>>>> - uint64_t lowmem_end = mem_size, highmem_end = 0; >>>>> + uint64_t highmem_end = 0; >>>>> uint8_t sum; >>>>> int i; >>>>> >>>>> - if ( lowmem_end > mmio_start ) >>>>> - { >>>>> - highmem_end = (1ull<<32) + (lowmem_end - mmio_start); >>>>> - lowmem_end = mmio_start; >>>>> - } >>>>> + if ( mem_size > lowmem_end ) >>>>> + highmem_end = (1ull<<32) + (mem_size - lowmem_end); >>>>> >>>>> memset(hvm_info_page, 0, PAGE_SIZE); >> >> [...] >> >>>> >>>>> + struct xc_hvm_build_args *args) >>>> >>>> This function does more than "checking", so a better name is needed. >>>> >>>> May be you should split this function to one "build" function and one >>>> "check" function? What do you think? >>> >>> We'd better keep this big one since this can make our policy >>> understandable, >>> but I agree we need to rename this like, >>> >>> libxl__domain_device_construct_rdm() >>> >>> construct = check + build :) >> >> I'm fine with this. > > Thanks. > >> >>> >>>> >>>>> +{ >>>>> + int i, j, conflict; >>>>> + libxl_ctx *ctx = libxl__gc_owner(gc); >>>> >>>> You can just use CTX macro. >> >> [...] >> >>>> >>>>> + if ((type == LIBXL_RDM_RESERVE_TYPE_NONE) && >>>>> !d_config->num_pcidevs) >>>>> + return 0; >>>>> + >>>>> + /* Collect all rdm info if exist. */ >>>>> + xrdm = xc_device_get_rdm(ctx->xch, LIBXL_RDM_RESERVE_TYPE_HOST, >>>>> + 0, 0, 0, &nr_all_rdms); >>>>> + if (!nr_all_rdms) >>>>> + return 0; >>>>> + d_config->rdms = libxl__calloc(gc, nr_all_rdms, >>>>> + sizeof(libxl_device_rdm)); >>>> >>>> Note that if you use "gc" here the allocated memory will be, well, >>>> garbage collected at some point. If you don't want them to be gc'ed you >>>> should use NOGC. >>> >>> Sorry, what does that mean by 'garbage collected'? >>> >> >> That means the memory allocated with gc will be freed at some point by >> GC_FREE, because those memory regions are meant to be temporary and used >> internally. >> >> When entering a libxl public API function (those start wtih >> libxl_), that function calls GC_INIT to initialise a garbage collector. >> When that function exits, it calls GC_FREE to free all memory allocated >> with that gc. > > Thanks for sharing this good info to me! > >> >> Since d_config is very likely to be used by libxl user (xl, libvirt >> etc), you probably don't want to fill it with gc allocated memory. >> > > Yes. > >>>> >>>>> + memset(d_config->rdms, 0, sizeof(libxl_device_rdm)); >>>>> + >>>>> + /* Query all RDM entries in this platform */ >>>>> + if (type == LIBXL_RDM_RESERVE_TYPE_HOST) { >>>>> + d_config->num_rdms = nr_all_rdms; >>>>> + for (i = 0; i < d_config->num_rdms; i++) { >>>>> + d_config->rdms[i].start = >>>>> + (uint64_t)xrdm[i].start_pfn << >>>>> XC_PAGE_SHIFT; >>>>> + d_config->rdms[i].size = >>>>> + (uint64_t)xrdm[i].nr_pages << >>>>> XC_PAGE_SHIFT; >>>>> + d_config->rdms[i].flag = d_config->b_info.rdm.reserve; >>>>> + } >>>>> + } else { >>>>> + d_config->num_rdms = 0; >>>>> + } >>>> >>>> And you should move d_config->rdms = libxl__calloc inside that `if'. >>>> That is, don't allocate memory if you don't need it. >>> >>> We can't since in all cases we need to preallocate this, and then we >>> will >>> handle this according to our policy. >>> >> >> How would it ever be used again if you set d_config->num_rdms to 0? How >> do you know the exact size of your array again? > > If we don't consider multiple devices shares one rdm entry, our workflow > can be showed as follows: > > #1. We always preallocate all rdms[] but with memset(). > #2. Then we have two cases for that global rule, > > #2.1 If we really need all according to our global rule, we would set > all rdms[] with all real rdm info and set d_config->num_rdms. > #2.2 If we have no such a global rule to obtain all, we just clear > d_config->num_rdms. > > #3. Then for per device rule > > #3.1 From #2.1, we just need to check if we should change one given rdm > entry's policy if this given entry is just associated to this device. > #3.2 From 2.2, obviously we just need to fill rdms one by one. Of > course, its very possible that we don't fill all rdms since all > passthroued devices might not have no rdm at all or they just occupy > some. But anyway, finally we sync d_config->num_rdms. > >> >>>> >>>>> + free(xrdm); >>>>> + >>>>> + /* Query RDM entries per-device */ >>>>> + for (i = 0; i < d_config->num_pcidevs; i++) { >>>>> + unsigned int nr_entries = 0; >>> >>> Maybe I should restate this, >>> unsigned int nr_entries; >>> >>>>> + bool new = true; >>>>> + seg = d_config->pcidevs[i].domain; >>>>> + bus = d_config->pcidevs[i].bus; >>>>> + devfn = PCI_DEVFN(d_config->pcidevs[i].dev, >>>>> d_config->pcidevs[i].func); >>>>> + nr_entries = 0; >>>> >>>> You've already initialised this variable. >>> >>> We need to set this as zero to start. >>> >> >> Either of the tow works for me. Just don't want redundant >> initialisation. > > Right. > >> >>>> >>>>> + xrdm = xc_device_get_rdm(ctx->xch, >>>>> LIBXL_RDM_RESERVE_TYPE_NONE, >>>>> + seg, bus, devfn, &nr_entries); >>>>> + /* No RDM to associated with this device. */ >>>>> + if (!nr_entries) >>>>> + continue; >>>>> + >>>>> + /* Need to check whether this entry is already saved in >>>>> the array. >>>>> + * This could come from two cases: >>>>> + * >>>>> + * - user may configure to get all RMRRs in this >>>>> platform, which >>>>> + * is already queried before this point >>>> >>>> Formatting. >>> >>> Are you saying this? >> >> I mean you need to move "is already..." to the right go align with >> previous line. > > Fixed. > >> >>> >>> + /* Need to check whether this entry is already saved in the >>> array. >>> >>> => >> >> The CODING_STYLE in libxl doesn't seem to enforce this, so you can just >> follow other examples. >> >>> /* >>> >>> * Need to check whether this entry is already saved in the >>> array. >>> * This could come from two cases: >>> >>>> >>>>> + * - or two assigned devices may share one RMRR entry >>>>> + * >>>>> + * different policies may be configured on the same RMRR >>>>> due to above >>>>> + * two cases. We choose a simple policy to always favor >>>>> stricter policy >>>>> + */ >>>>> + for (j = 0; j < d_config->num_rdms; j++) { >>>>> + if (d_config->rdms[j].start == >>>>> + (uint64_t)xrdm[0].start_pfn << >>>>> XC_PAGE_SHIFT) >>>>> + { >>>>> + if (d_config->rdms[j].flag != >>>>> LIBXL_RDM_RESERVE_FLAG_FORCE) >>>>> + d_config->rdms[j].flag = >>>>> d_config->pcidevs[i].rdm_reserve; >>>>> + new = false; >>>>> + break; >>>>> + } >>>>> + } >>>>> + >>>>> + if (new) { >>>>> + if (d_config->num_rdms > nr_all_rdms - 1) { >>>>> + LIBXL__LOG(CTX, LIBXL__LOG_ERROR, "Exceed rdm >>>>> array boundary!\n"); >>>> >>>> LOG(ERROR, ...) >>> >>> Fixed. >>> >>>> >>>>> + free(xrdm); >>>>> + return -1; >>>> >>>> Please use goto out idiom. >>> >>> We just have two 'return -1' differently so I'm not sure its worth doing >>> this. >>> >> >> Yes, please comply with libxl idiom. >> >>>> >>>>> + } >>>>> + >>>>> + /* >>>>> + * This is a new entry. >>>>> + */ >>>> >>>> /* This is a new entry. */ >>> >>> Fixed. >>> >>>> >>>>> + d_config->rdms[d_config->num_rdms].start = >>>>> + (uint64_t)xrdm[0].start_pfn << >>>>> XC_PAGE_SHIFT; >>>>> + d_config->rdms[d_config->num_rdms].size = >>>>> + (uint64_t)xrdm[0].nr_pages << >>>>> XC_PAGE_SHIFT; >>>>> + d_config->rdms[d_config->num_rdms].flag = >>>>> d_config->pcidevs[i].rdm_reserve; >>>>> + d_config->num_rdms++; >>>> >>>> Does this work? I don't see you reallocate memory. >>> >>> Like I replied above we always preallocate this at the beginning. >>> >> >> Ah, OK. >> >> But please don't do this. It's hard to see you don't overrun the >> buffer. Please allocate memory only when you need it. > > Sorry I don't understand this. As I mention above, we don't know how > many rdm entries we really need to allocate. So this is why we'd like to > preallocate all rdms at the beginning. Then this can cover all cases, > global policy, (global policy & per device) and only per device. Even if > multiple devices shares one rdm we also need to avoid duplicating a new... > > >> >>>> >>>>> + } >>>>> + free(xrdm); >>>> >>>> Bug: you free xrdm several times. >>> >>> No any conflict. >>> >>> What I did is that I would free once I finish to calling every >>> xc_device_get_rdm(). >>> >> >> OK. I misread. Sorry. >> >> >> [...] >> >>> >>>> >>>>> + /* Next step is to check and avoid potential conflict between >>>>> RDM entries >>>>> + * and guest RAM. To avoid intrusive impact to existing memory >>>>> layout >>>>> + * {lowmem, mmio, highmem} which is passed around various >>>>> function blocks, >>>>> + * below conflicts are not handled which are rare and handling >>>>> them would >>>>> + * lead to a more scattered layout: >>>>> + * - RMRR in highmem area (>4G) >>>>> + * - RMRR lower than a defined memory boundary (e.g. 2G) >>>>> + * Otherwise for conflicts between boundary and 4G, we'll >>>>> simply move lowmem >>>>> + * end below reserved region to solve conflict. >>>>> + * >>>>> + * If a conflict is detected on a given RMRR entry, an error >>>>> will be >>>>> + * returned. >>>>> + * If 'force' policy is specified. Or conflict is treated as a >>>>> warning if >>>>> + * 'try' policy is specified, and we also mark this as INVALID >>>>> not to expose >>>>> + * this entry to hvmloader. >>>>> + * >>>>> + * Firstly we should check the case of rdm < 4G because we may >>>>> need to >>>>> + * expand highmem_end. >>>> >>>> Is this strategy agreed in previous discussion? How future-proof is >>>> this >>> >>> Yes, this is based on that design. >>> >> >> OK. >> >> [...] >> >>>>> >>>>> int libxl__build_hvm(libxl__gc *gc, uint32_t domid, >>>>> - libxl_domain_build_info *info, >>>>> + libxl_domain_config *d_config, >>>>> libxl__domain_build_state *state) >>>>> { >>>>> libxl_ctx *ctx = libxl__gc_owner(gc); >>>>> struct xc_hvm_build_args args = {}; >>>>> int ret, rc = ERROR_FAIL; >>>>> + libxl_domain_build_info *const info = &d_config->b_info; >>>>> + uint64_t rdm_mem_boundary, mmio_start; >> >> I didn't mention this in the first pass. You seem to have inserted some >> tabs? We use space to indent. >> >> > > Okay. > > Thanks > Tiejun > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel > >