From mboxrd@z Thu Jan  1 00:00:00 1970
From: "Chen, Tiejun" <tiejun.chen@intel.com>
Subject: Re: [RFC][PATCH 04/13] tools/libxl: detect and avoid
 conflicts with RDM
Date: Thu, 14 May 2015 16:27:45 +0800
Message-ID: <55545C81.8070509@intel.com>
References: <1428657724-3498-1-git-send-email-tiejun.chen@intel.com>
	<1428657724-3498-5-git-send-email-tiejun.chen@intel.com>
	<20150508144343.GO3848@zion.uk.xensource.com>
	<555063D1.7030907@intel.com>
	<20150511113247.GC30997@zion.uk.xensource.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; Format="flowed"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <20150511113247.GC30997@zion.uk.xensource.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Wei Liu <wei.liu2@citrix.com>
Cc: kevin.tian@intel.com, ian.campbell@citrix.com, andrew.cooper3@citrix.com, tim@xen.org, xen-devel@lists.xen.org, stefano.stabellini@citrix.com, JBeulich@suse.com, yang.z.zhang@intel.com, Ian.Jackson@eu.citrix.com
List-Id: xen-devel@lists.xenproject.org

On 2015/5/11 19:32, Wei Liu wrote:
> On Mon, May 11, 2015 at 04:09:53PM +0800, Chen, Tiejun wrote:
>> On 2015/5/8 22:43, Wei Liu wrote:
>>> Sorry for the late review. This series fell through the crack.
>>>
>>
>> Thanks for your review.
>>
>>> On Fri, Apr 10, 2015 at 05:21:55PM +0800, Tiejun Chen wrote:
>>>> While building a VM, HVM domain builder provides struct hvm_info_table{}
>>>> to help hvmloader. Currently it includes two fields to construct guest
>>>> e820 table by hvmloader, low_mem_pgend and high_mem_pgend. So we should
>>>> check them to fix any conflict with RAM.
>>>>
>>>> RMRR can reside in address space beyond 4G theoretically, but we never
>>>> see this in real world. So in order to avoid breaking highmem layout
>>>
>>> How does this break highmem layout?
>>
>> In most cases highmen is always continuous like [4G, ...] but RMRR is
>> theoretically beyond 4G but very rarely. Especially we don't see this
>> happened in real world. So we don't want to such a case breaking the
>> highmem.
>>
>
> The problem is  we take this approach just because this rarely happens
> *now* is not future proof.  It needs to be clearly documented somewhere
> in the manual (or any other Intel docs) and be referenced in the code.
> Otherwise we might end up in a situation that no-one knows how it is
> supposed to work and no-one can fix it if it breaks in the future, that
> is, every single device on earth requires RMRR > 4G overnight (I'm
> exaggerating).
>
> Or you can just make it works with highmem. How much more work do you
> envisage?
>
> (If my above comment makes no sense do let me know. I only have very
> shallow understanding of RMRR)

Maybe I'm misleading you :)

I don't mean RMRR > 4G is not allowed to work in our implementation. 
What I'm saying is that our *policy* is just simple for this kind of 
rare highmem case...

>
>>>
>>>> we don't solve highmem conflict. Note this means highmem rmrr could still
>>>> be supported if no conflict.
>>>>

Like these two sentences above.

>>>
>>> Aren't you actively trying to avoid conflict in libxl?
>>
>> RMRR is fixed by BIOS so we can't aovid conflict. Here we want to adopt some
>> good policies to address RMRR. In the case of highmemt, that simple policy
>> should be enough?
>>
>
> Whatever policy you and HV maintainers agree on. Just clearly document
> it.

Do you mean I should brief this patch description into one separate 
document?

>
>>>
>>>> But in the case of lowmem, RMRR probably scatter the whole RAM space.
>>>> Especially multiple RMRR entries would worsen this to lead a complicated
>>>> memory layout. And then its hard to extend hvm_info_table{} to work
>>>> hvmloader out. So here we're trying to figure out a simple solution to
>>>> avoid breaking existing layout. So when a conflict occurs,
>>>>
>>>>      #1. Above a predefined boundary (default 2G)
>>>>          - move lowmem_end below reserved region to solve conflict;
>>>>
>>>
>>> I hope this "predefined boundary" is user tunable. I will check later in
>>> this patch if it is the case.
>>
>> Yes. As we clarified in that comments,
>>
>> * TODO: Its batter to provide a config parameter for this boundary value.
>>
>> This mean I would provide a patch address this since currently I just think
>> this is not a big deal?
>>
>
> Yes please provide a config option to override that. It's reasonable
> that user wants to change that.

Okay.

>
>>>
>>>>      #2 Below a predefined boundary (default 2G)
>>>>          - Check force/try policy.
>>>>          "force" policy leads to fail libxl. Note when both policies
>>>>          are specified on a given region, 'force' is always preferred.
>>>>          "try" policy issue a warning message and also mask this entry INVALID
>>>>          to indicate we shouldn't expose this entry to hvmloader.
>>>>
>>>> Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
>>>> ---
>>>>   tools/libxc/include/xenctrl.h  |   8 ++
>>>>   tools/libxc/include/xenguest.h |   3 +-
>>>>   tools/libxc/xc_domain.c        |  40 +++++++++
>>>>   tools/libxc/xc_hvm_build_x86.c |  28 +++---
>>>>   tools/libxl/libxl_create.c     |   2 +-
>>>>   tools/libxl/libxl_dm.c         | 195 +++++++++++++++++++++++++++++++++++++++++
>>>>   tools/libxl/libxl_dom.c        |  27 +++++-
>>>>   tools/libxl/libxl_internal.h   |  11 ++-
>>>>   tools/libxl/libxl_types.idl    |   7 ++
>>>>   9 files changed, 303 insertions(+), 18 deletions(-)
>>>>
>>>> diff --git a/tools/libxc/include/xenctrl.h b/tools/libxc/include/xenctrl.h
>>>> index 59bbe06..299b95f 100644
>>>> --- a/tools/libxc/include/xenctrl.h
>>>> +++ b/tools/libxc/include/xenctrl.h
>>>> @@ -2053,6 +2053,14 @@ int xc_get_device_group(xc_interface *xch,
>>>>                        uint32_t *num_sdevs,
>>>>                        uint32_t *sdev_array);
>>>>
>>>> +struct xen_reserved_device_memory
>>>> +*xc_device_get_rdm(xc_interface *xch,
>>>> +                   uint32_t flag,
>>>> +                   uint16_t seg,
>>>> +                   uint8_t bus,
>>>> +                   uint8_t devfn,
>>>> +                   unsigned int *nr_entries);
>>>> +
>>>>   int xc_test_assign_device(xc_interface *xch,
>>>>                             uint32_t domid,
>>>>                             uint32_t machine_sbdf);
>
> [...]
>
>>>
>>>>       uint64_t mem_target;         /* Memory target in bytes. */
>>>>       uint64_t mmio_size;          /* Size of the MMIO hole in bytes. */
>>>>       const char *image_file_name; /* File name of the image to load. */
>>>> diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
>>>> index 4f8383e..85b18ea 100644
>>>> --- a/tools/libxc/xc_domain.c
>>>> +++ b/tools/libxc/xc_domain.c
>>>> @@ -1665,6 +1665,46 @@ int xc_assign_device(
>>>>       return do_domctl(xch, &domctl);
>>>>   }
>>>>
>>>> +struct xen_reserved_device_memory
>>>> +*xc_device_get_rdm(xc_interface *xch,
>>>> +                   uint32_t flag,
>>>> +                   uint16_t seg,
>>>> +                   uint8_t bus,
>>>> +                   uint8_t devfn,
>>>> +                   unsigned int *nr_entries)
>>>> +{
>>>> +    struct xen_reserved_device_memory *xrdm = NULL;
>>>> +    int rc = xc_reserved_device_memory_map(xch, flag, seg, bus, devfn, xrdm,
>>>> +                                           nr_entries);
>>>> +
>>>> +    if ( rc < 0 )
>>>> +    {
>>>> +        if ( errno == ENOBUFS )
>>>> +        {
>>>> +            if ( (xrdm = malloc(*nr_entries *
>>>> +                                sizeof(xen_reserved_device_memory_t))) == NULL )
>>>> +            {
>>>> +                PERROR("Could not allocate memory.");
>>>> +                goto out;
>>>> +            }
>>>
>>> Don't you leak origin xrdm in this case?
>>
>> The caller to xc_device_get_rdm always frees this.
>>
>
> I think I misunderstood how this function works. I thought xrdm was
> passed in by caller, which is clearly not the case. Sorry!
>
>
> In that case, the `if ( rc < 0 )' is not needed because the call should
> always return rc < 0. An assert is good enough.

assert(rc < 0)? But we can't presume the user always pass '0' firstly, 
and additionally, we may have no any RMRR indeed.

So I guess what you want is,

assert( rc <=0 );
if ( !rc )
     goto xrdm;

if ( errno == ENOBUFS )
...

Right?

>
>>>
>>> And, this style is not very good. Shouldn't the caller allocate enough
>>> memory before hand?
>>
>> Are you saying the caller to xc_device_get_rdm()? If so, any caller don't
>> know this, too.
>>
>
> I see.
>
>> Actually this is just a wrapper of that fundamental hypercall,
>> xc_reserved_device_memory_map() in patch #2, and based on that, we always
>> have to first call this to inquire how much memory we really need. And this
>> is why we have this wrapper since we don't want to duplicate more codes.
>>
>> One error handler of this wrapper is just handling ENOBUFS since the caller
>> never know how much memory we should allocate. So oftentimes we always set
>> 'entries = 0' to inquire firstly.
>>
>> Here Jan suggested we may need to figure out a good way to consolidate
>> xc_reserved_device_memory_map() and its wrapper, xc_device_get_rdm().
>>
>> But in some ways, that wrapper likes a static function so we just need to
>> move this into that file associated to its caller, right?
>>
>
> Yes, if there is only one user at the moment, make a static function.

Thanks.

>
>>>
>>>> +            rc = xc_reserved_device_memory_map(xch, flag, seg, bus, devfn, xrdm,
>>>> +                                               nr_entries);
>>>> +            if ( rc )
>>>> +            {
>>>> +                PERROR("Could not get reserved device memory maps.");
>>>> +                free(xrdm);
>>>> +                xrdm = NULL;
>>>> +            }
>>>> +        }
>>>> +        else {
>>>> +            PERROR("Could not get reserved device memory maps.");
>>>> +        }
>>>> +    }
>>>> +
>>>> + out:
>>>> +    return xrdm;
>>>> +}
>>>> +
>>>>   int xc_get_device_group(
>>>>       xc_interface *xch,
>>>>       uint32_t domid,
>>>> diff --git a/tools/libxc/xc_hvm_build_x86.c b/tools/libxc/xc_hvm_build_x86.c
>>>> index c81a25b..3f87bb3 100644
>>>> --- a/tools/libxc/xc_hvm_build_x86.c
>>>> +++ b/tools/libxc/xc_hvm_build_x86.c
>>>> @@ -89,19 +89,16 @@ static int modules_init(struct xc_hvm_build_args *args,
>>>>   }
>>>>
>>>>   static void build_hvm_info(void *hvm_info_page, uint64_t mem_size,
>>>> -                           uint64_t mmio_start, uint64_t mmio_size)
>>>> +                           uint64_t lowmem_end)
>>>>   {
>>>>       struct hvm_info_table *hvm_info = (struct hvm_info_table *)
>>>>           (((unsigned char *)hvm_info_page) + HVM_INFO_OFFSET);
>>>> -    uint64_t lowmem_end = mem_size, highmem_end = 0;
>>>> +    uint64_t highmem_end = 0;
>>>>       uint8_t sum;
>>>>       int i;
>>>>
>>>> -    if ( lowmem_end > mmio_start )
>>>> -    {
>>>> -        highmem_end = (1ull<<32) + (lowmem_end - mmio_start);
>>>> -        lowmem_end = mmio_start;
>>>> -    }
>>>> +    if ( mem_size > lowmem_end )
>>>> +        highmem_end = (1ull<<32) + (mem_size - lowmem_end);
>>>>
>>>>       memset(hvm_info_page, 0, PAGE_SIZE);
>
> [...]
>
>>>
>>>> +                                   struct xc_hvm_build_args *args)
>>>
>>> This function does more than "checking", so a better name is needed.
>>>
>>> May be you should split this function to one "build" function and one
>>> "check" function? What do you think?
>>
>> We'd better keep this big one since this can make our policy understandable,
>> but I agree we need to rename this like,
>>
>> libxl__domain_device_construct_rdm()
>>
>> construct = check + build :)
>
> I'm fine with this.

Thanks.

>
>>
>>>
>>>> +{
>>>> +    int i, j, conflict;
>>>> +    libxl_ctx *ctx = libxl__gc_owner(gc);
>>>
>>> You can just use CTX macro.
>
> [...]
>
>>>
>>>> +    if ((type == LIBXL_RDM_RESERVE_TYPE_NONE) && !d_config->num_pcidevs)
>>>> +        return 0;
>>>> +
>>>> +    /* Collect all rdm info if exist. */
>>>> +    xrdm = xc_device_get_rdm(ctx->xch, LIBXL_RDM_RESERVE_TYPE_HOST,
>>>> +                             0, 0, 0, &nr_all_rdms);
>>>> +    if (!nr_all_rdms)
>>>> +        return 0;
>>>> +    d_config->rdms = libxl__calloc(gc, nr_all_rdms,
>>>> +                                   sizeof(libxl_device_rdm));
>>>
>>> Note that if you use "gc" here the allocated memory will be, well,
>>> garbage collected at some point. If you don't want them to be gc'ed you
>>> should use NOGC.
>>
>> Sorry, what does that mean by 'garbage collected'?
>>
>
> That means the memory allocated with gc will be freed at some point by
> GC_FREE, because those memory regions are meant to be temporary and used
> internally.
>
> When entering a libxl public API function (those start wtih
> libxl_), that function calls GC_INIT to initialise a garbage collector.
> When that function exits, it calls GC_FREE to free all memory allocated
> with that gc.

Thanks for sharing this good info to me!

>
> Since d_config is very likely to be used by libxl user (xl, libvirt
> etc), you probably don't want to fill it with gc allocated memory.
>

Yes.

>>>
>>>> +    memset(d_config->rdms, 0, sizeof(libxl_device_rdm));
>>>> +
>>>> +    /* Query all RDM entries in this platform */
>>>> +    if (type == LIBXL_RDM_RESERVE_TYPE_HOST) {
>>>> +        d_config->num_rdms = nr_all_rdms;
>>>> +        for (i = 0; i < d_config->num_rdms; i++) {
>>>> +            d_config->rdms[i].start =
>>>> +                                (uint64_t)xrdm[i].start_pfn << XC_PAGE_SHIFT;
>>>> +            d_config->rdms[i].size =
>>>> +                                (uint64_t)xrdm[i].nr_pages << XC_PAGE_SHIFT;
>>>> +            d_config->rdms[i].flag = d_config->b_info.rdm.reserve;
>>>> +        }
>>>> +    } else {
>>>> +        d_config->num_rdms = 0;
>>>> +    }
>>>
>>> And you should move d_config->rdms = libxl__calloc inside that `if'.
>>> That is, don't allocate memory if you don't need it.
>>
>> We can't since in all cases we need to preallocate this, and then we will
>> handle this according to our policy.
>>
>
> How would it ever be used again if you set d_config->num_rdms to 0? How
> do you know the exact size of your array again?

If we don't consider multiple devices shares one rdm entry, our workflow 
can be showed as follows:

#1. We always preallocate all rdms[] but with memset().
#2. Then we have two cases for that global rule,

#2.1 If we really need all according to our global rule, we would set 
all rdms[] with all real rdm info and set d_config->num_rdms.
#2.2 If we have no such a global rule to obtain all, we just clear 
d_config->num_rdms.

#3. Then for per device rule

#3.1 From #2.1, we just need to check if we should change one given rdm 
entry's policy if this given entry is just associated to this device.
#3.2 From 2.2, obviously we just need to fill rdms one by one. Of 
course, its very possible that we don't fill all rdms since all 
passthroued devices might not have no rdm at all or they just occupy 
some. But anyway, finally we sync d_config->num_rdms.

>
>>>
>>>> +    free(xrdm);
>>>> +
>>>> +    /* Query RDM entries per-device */
>>>> +    for (i = 0; i < d_config->num_pcidevs; i++) {
>>>> +        unsigned int nr_entries = 0;
>>
>> Maybe I should restate this,
>> 	unsigned int nr_entries;
>>
>>>> +        bool new = true;
>>>> +        seg = d_config->pcidevs[i].domain;
>>>> +        bus = d_config->pcidevs[i].bus;
>>>> +        devfn = PCI_DEVFN(d_config->pcidevs[i].dev, d_config->pcidevs[i].func);
>>>> +        nr_entries = 0;
>>>
>>> You've already initialised this variable.
>>
>> We need to set this as zero to start.
>>
>
> Either of the tow works for me. Just don't want redundant
> initialisation.

Right.

>
>>>
>>>> +        xrdm = xc_device_get_rdm(ctx->xch, LIBXL_RDM_RESERVE_TYPE_NONE,
>>>> +                                 seg, bus, devfn, &nr_entries);
>>>> +        /* No RDM to associated with this device. */
>>>> +        if (!nr_entries)
>>>> +            continue;
>>>> +
>>>> +        /* Need to check whether this entry is already saved in the array.
>>>> +         * This could come from two cases:
>>>> +         *
>>>> +         *   - user may configure to get all RMRRs in this platform, which
>>>> +         * is already queried before this point
>>>
>>> Formatting.
>>
>> Are you saying this?
>
> I mean you need to move "is already..." to the right go align with
> previous line.

Fixed.

>
>>
>> +        /* Need to check whether this entry is already saved in the array.
>>
>> =>
>
> The CODING_STYLE in libxl doesn't seem to enforce this, so you can just
> follow other examples.
>
>>          /*
>>
>>           * Need to check whether this entry is already saved in the array.
>>           * This could come from two cases:
>>
>>>
>>>> +         *   - or two assigned devices may share one RMRR entry
>>>> +         *
>>>> +         * different policies may be configured on the same RMRR due to above
>>>> +         * two cases. We choose a simple policy to always favor stricter policy
>>>> +         */
>>>> +        for (j = 0; j < d_config->num_rdms; j++) {
>>>> +            if (d_config->rdms[j].start ==
>>>> +                                (uint64_t)xrdm[0].start_pfn << XC_PAGE_SHIFT)
>>>> +             {
>>>> +                if (d_config->rdms[j].flag != LIBXL_RDM_RESERVE_FLAG_FORCE)
>>>> +                    d_config->rdms[j].flag = d_config->pcidevs[i].rdm_reserve;
>>>> +                new = false;
>>>> +                break;
>>>> +            }
>>>> +        }
>>>> +
>>>> +        if (new) {
>>>> +            if (d_config->num_rdms > nr_all_rdms - 1) {
>>>> +                LIBXL__LOG(CTX, LIBXL__LOG_ERROR, "Exceed rdm array boundary!\n");
>>>
>>> LOG(ERROR, ...)
>>
>> Fixed.
>>
>>>
>>>> +                free(xrdm);
>>>> +                return -1;
>>>
>>> Please use goto out idiom.
>>
>> We just have two 'return -1' differently so I'm not sure its worth doing
>> this.
>>
>
> Yes, please comply with libxl idiom.
>
>>>
>>>> +            }
>>>> +
>>>> +            /*
>>>> +             * This is a new entry.
>>>> +             */
>>>
>>> /* This is a new entry. */
>>
>> Fixed.
>>
>>>
>>>> +            d_config->rdms[d_config->num_rdms].start =
>>>> +                                (uint64_t)xrdm[0].start_pfn << XC_PAGE_SHIFT;
>>>> +            d_config->rdms[d_config->num_rdms].size =
>>>> +                                (uint64_t)xrdm[0].nr_pages << XC_PAGE_SHIFT;
>>>> +            d_config->rdms[d_config->num_rdms].flag = d_config->pcidevs[i].rdm_reserve;
>>>> +            d_config->num_rdms++;
>>>
>>> Does this work? I don't see you reallocate memory.
>>
>> Like I replied above we always preallocate this at the beginning.
>>
>
> Ah, OK.
>
> But please don't do this. It's hard to see you don't overrun the
> buffer. Please allocate memory only when you need it.

Sorry I don't understand this. As I mention above, we don't know how 
many rdm entries we really need to allocate. So this is why we'd like to 
preallocate all rdms at the beginning. Then this can cover all cases, 
global policy, (global policy & per device) and only per device. Even if 
multiple devices shares one rdm we also need to avoid duplicating a new...


>
>>>
>>>> +        }
>>>> +        free(xrdm);
>>>
>>> Bug: you free xrdm several times.
>>
>> No any conflict.
>>
>> What I did is that I would free once I finish to calling every
>> xc_device_get_rdm().
>>
>
> OK. I misread. Sorry.
>
>
> [...]
>
>>
>>>
>>>> +    /* Next step is to check and avoid potential conflict between RDM entries
>>>> +     * and guest RAM. To avoid intrusive impact to existing memory layout
>>>> +     * {lowmem, mmio, highmem} which is passed around various function blocks,
>>>> +     * below conflicts are not handled which are rare and handling them would
>>>> +     * lead to a more scattered layout:
>>>> +     *  - RMRR in highmem area (>4G)
>>>> +     *  - RMRR lower than a defined memory boundary (e.g. 2G)
>>>> +     * Otherwise for conflicts between boundary and 4G, we'll simply move lowmem
>>>> +     * end below reserved region to solve conflict.
>>>> +     *
>>>> +     * If a conflict is detected on a given RMRR entry, an error will be
>>>> +     * returned.
>>>> +     * If 'force' policy is specified. Or conflict is treated as a warning if
>>>> +     * 'try' policy is specified, and we also mark this as INVALID not to expose
>>>> +     * this entry to hvmloader.
>>>> +     *
>>>> +     * Firstly we should check the case of rdm < 4G because we may need to
>>>> +     * expand highmem_end.
>>>
>>> Is this strategy agreed in previous discussion? How future-proof is this
>>
>> Yes, this is based on that design.
>>
>
> OK.
>
> [...]
>
>>>>
>>>>   int libxl__build_hvm(libxl__gc *gc, uint32_t domid,
>>>> -              libxl_domain_build_info *info,
>>>> +              libxl_domain_config *d_config,
>>>>                 libxl__domain_build_state *state)
>>>>   {
>>>>       libxl_ctx *ctx = libxl__gc_owner(gc);
>>>>       struct xc_hvm_build_args args = {};
>>>>       int ret, rc = ERROR_FAIL;
>>>> +    libxl_domain_build_info *const info = &d_config->b_info;
>>>> +    uint64_t rdm_mem_boundary, mmio_start;
>
> I didn't mention this in the first pass. You seem to have inserted some
> tabs? We use space to indent.
>
>

Okay.

Thanks
Tiejun