Re: [PATCH v12 09/11] hmat acpi: Build System Locality Latency and Bandwidth Information Structure(s)

From: Igor Mammedov <imammedo@redhat.com>
To: Tao Xu <tao3.xu@intel.com>
Cc: "ehabkost@redhat.com" <ehabkost@redhat.com>,
	"Liu, Jingqi" <jingqi.liu@intel.com>,
	"Du, Fan" <fan.du@intel.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	"jonathan.cameron@huawei.com" <jonathan.cameron@huawei.com>,
	"Williams, Dan J" <dan.j.williams@intel.com>
Subject: Re: [PATCH v12 09/11] hmat acpi: Build System Locality Latency and Bandwidth Information Structure(s)
Date: Thu, 17 Oct 2019 16:17:14 +0200	[thread overview]
Message-ID: <20191017161714.191e5d48@redhat.com> (raw)
In-Reply-To: <2e5b2843-5a82-4152-873a-27f469175c6b@intel.com>

On Tue, 15 Oct 2019 13:40:54 +0800
Tao Xu <tao3.xu@intel.com> wrote:

> On 10/15/2019 8:59 AM, Tao Xu wrote:
> > On 10/14/2019 5:00 PM, Igor Mammedov wrote:  
> >> On Sat, 12 Oct 2019 11:04:03 +0800
> >> Tao Xu <tao3.xu@intel.com> wrote:
> >>  
> >>> On 10/11/2019 10:08 PM, Igor Mammedov wrote:  
> >>>> On Thu, 10 Oct 2019 14:53:56 +0800
> >>>> Tao Xu <tao3.xu@intel.com> wrote:  
> >>>>> On 10/3/2019 10:41 PM, Igor Mammedov wrote:  
> >>>>>> On Fri, 20 Sep 2019 15:43:47 +0800
> >>>>>> Tao Xu <tao3.xu@intel.com> wrote:  
> >>>>>>> From: Liu Jingqi <jingqi.liu@intel.com>
> >>>>>>>
> >>>>>>> This structure describes the memory access latency and bandwidth
> >>>>>>> information from various memory access initiator proximity domains.
> >>>>>>> The latency and bandwidth numbers represented in this structure
> >>>>>>> correspond to rated latency and bandwidth for the platform.
> >>>>>>> The software could use this information as hint for optimization.
> >>>>>>>
> >>>>>>> Signed-off-by: Liu Jingqi <jingqi.liu@intel.com>
> >>>>>>> Signed-off-by: Tao Xu <tao3.xu@intel.com>
> >>>>>>> ---
> >>>>>>>
> >>>>>>> Changes in v12:
> >>>>>>>        - Fix a bug that if HMAT is enabled and without hmat-lb 
> >>>>>>> setting,
> >>>>>>>          QEMU will crash. (reported by Danmei Wei)
> >>>>>>>
> >>>>>>> Changes in v11:
> >>>>>>>        - Calculate base in build_hmat_lb().
> >>>>>>> ---
> >>>>>>>     hw/acpi/hmat.c | 126 
> >>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++-
> >>>>>>>     hw/acpi/hmat.h |   2 +
> >>>>>>>     2 files changed, 127 insertions(+), 1 deletion(-)
> >>>>>>>
> >>>>>>> diff --git a/hw/acpi/hmat.c b/hw/acpi/hmat.c
> >>>>>>> index 1368fce7ee..e7be849581 100644
> >>>>>>> --- a/hw/acpi/hmat.c
> >>>>>>> +++ b/hw/acpi/hmat.c
> >>>>>>> @@ -27,6 +27,7 @@
> >>>>>>>     #include "qemu/osdep.h"
> >>>>>>>     #include "sysemu/numa.h"
> >>>>>>>     #include "hw/acpi/hmat.h"
> >>>>>>> +#include "qemu/error-report.h"
> >>>>>>>     /*
> >>>>>>>      * ACPI 6.3:
> >>>>>>> @@ -67,11 +68,105 @@ static void build_hmat_mpda(GArray 
> >>>>>>> *table_data, uint16_t flags, int initiator,
> >>>>>>>         build_append_int_noprefix(table_data, 0, 8);
> >>>>>>>     }
> >>>>>>> +static bool entry_overflow(uint64_t *lb_data, uint64_t base, int 
> >>>>>>> len)
> >>>>>>> +{
> >>>>>>> +    int i;
> >>>>>>> +
> >>>>>>> +    for (i = 0; i < len; i++) {
> >>>>>>> +        if (lb_data[i] / base >= UINT16_MAX) {
> >>>>>>> +            return true;
> >>>>>>> +        }
> >>>>>>> +    }
> >>>>>>> +
> >>>>>>> +    return false;
> >>>>>>> +}  
> >>>>>> I suggest to do this check at CLI parsing time  
> >>>>>>> +/*
> >>>>>>> + * ACPI 6.3: 5.2.27.4 System Locality Latency and Bandwidth 
> >>>>>>> Information
> >>>>>>> + * Structure: Table 5-146
> >>>>>>> + */
> >>>>>>> +static void build_hmat_lb(GArray *table_data, HMAT_LB_Info 
> >>>>>>> *hmat_lb,
> >>>>>>> +                          uint32_t num_initiator, uint32_t 
> >>>>>>> num_target,
> >>>>>>> +                          uint32_t *initiator_list, int type)
> >>>>>>> +{
> >>>>>>> +    uint8_t mask = 0x0f;
> >>>>>>> +    uint32_t s = num_initiator;
> >>>>>>> +    uint32_t t = num_target;  
> >>>>>> drop this locals and use arguments directly  
> >>>>>>> +    uint64_t base = 1;
> >>>>>>> +    uint64_t *lb_data;
> >>>>>>> +    int i, unit;
> >>>>>>> +
> >>>>>>> +    /* Type */
> >>>>>>> +    build_append_int_noprefix(table_data, 1, 2);
> >>>>>>> +    /* Reserved */
> >>>>>>> +    build_append_int_noprefix(table_data, 0, 2);
> >>>>>>> +    /* Length */
> >>>>>>> +    build_append_int_noprefix(table_data, 32 + 4 * s + 4 * t + 2 
> >>>>>>> * s * t, 4);  
> >>>>>>                                                 ^^^^
> >>>>>> to me above looks like /dev/random output, absolutely unreadable.
> >>>>>> Suggest to use local var (like: lb_length) for expression with 
> >>>>>> comments
> >>>>>> beside magic numbers.  
> >>>>>>> +    /* Flags: Bits [3:0] Memory Hierarchy, Bits[7:4] Reserved */
> >>>>>>> +    build_append_int_noprefix(table_data, hmat_lb->hierarchy & 
> >>>>>>> mask, 1);  
> >>>>>>
> >>>>>> why do you need to use mask here?  
> >>>>> Because Bits[7:4] Reserved, so I use mask to keep it reserved.  
> >>>>
> >>>> these bits are not user provided and set to 0, if they get set it's
> >>>> programming error and instead of masking problem out QEMU should abort,
> >>>> I suggest replace masking with assert(!foo>>x).  
> >>>>>>> +    /* Data Type */
> >>>>>>> +    build_append_int_noprefix(table_data, hmat_lb->data_type, 1);  
> >>>>>>
> >>>>>> Isn't hmat_lb->data_type and passed argument 'type' the same?  
> >>>>> Yes, I will drop 'type'.  
> >>>>>>> +    /* Reserved */
> >>>>>>> +    build_append_int_noprefix(table_data, 0, 2);
> >>>>>>> +    /* Number of Initiator Proximity Domains (s) */
> >>>>>>> +    build_append_int_noprefix(table_data, s, 4);
> >>>>>>> +    /* Number of Target Proximity Domains (t) */
> >>>>>>> +    build_append_int_noprefix(table_data, t, 4);
> >>>>>>> +    /* Reserved */
> >>>>>>> +    build_append_int_noprefix(table_data, 0, 4);
> >>>>>>> +
> >>>>>>> +    if (HMAT_IS_LATENCY(type)) {
> >>>>>>> +        unit = 1000;
> >>>>>>> +        lb_data = hmat_lb->latency;
> >>>>>>> +    } else {
> >>>>>>> +        unit = 1024;
> >>>>>>> +        lb_data = hmat_lb->bandwidth;
> >>>>>>> +    }
> >>>>>>> +
> >>>>>>> +    while (entry_overflow(lb_data, base, s * t)) {
> >>>>>>> +        for (i = 0; i < s * t; i++) {
> >>>>>>> +            if (!QEMU_IS_ALIGNED(lb_data[i], unit * base)) {
> >>>>>>> +                error_report("Invalid latency/bandwidth input, 
> >>>>>>> all "
> >>>>>>> +                "latencies/bandwidths should be specified in the 
> >>>>>>> same units.");
> >>>>>>> +                exit(1);
> >>>>>>> +            }
> >>>>>>> +        }
> >>>>>>> +        base *= unit;
> >>>>>>> +    }  
> >>>>>> Can you clarify what you are trying to check here?  
> >>>>> This part I use entry_overflow() to check if uint16 can store 
> >>>>> entry. If
> >>>>> can't store and the entries matrix can be divisible by unit * base, 
> >>>>> then
> >>>>> base will be unit * base.
> >>>>>
> >>>>> For example, if lb_data[i] are 1048576(1TB/s) and 1024(1GB/s), unit is
> >>>>> 1024, so 1048576 is bigger than UINT16_MAX, and can be divisible by 
> >>>>> 1024
> >>>>> * 1, so base is 1024 and entries are 1024 and 1 (see entry =
> >>>>> hmat_lb->latency[i] / base;). The benefit is even user input different
> >>>>> unit(TB/s vs GB/s), we can still store the data as far as possible.  
> >>>>
> >>>> Is it possible instead of doing multiple iterations over lb_data
> >>>> until it finds valid base, just go over lb_data once to find MIN/MAX
> >>>> and then calculate base using it. Error out with max/min offending
> >>>> values if it's not possible to compress the range into uint16_t?  
> >>>
> >>> Although we tell user input same unit data, such as use 1GB/s 3GB/s. If
> >>> user input data such as 1048575, 1048576(1TB/s) and 1024(1GB/s), then we
> >>> will get 1024 * (1023 1024 1). I am wondering if it is appropriate
> >>> because we lose a float number(0.999020). But in our codes, it will
> >>> raise error.  
> >> I do not understand what you are trying to say here, could you rephrase
> >> it, so the problem would be more clear, please?
> >>  
> > Sorry, I mean how we treat the data cannot be divisible if we use 
> > max/min as base. For another example, If user input the data(including 3 
> > bandwidths) : 9GB/s 5GB/s 3GB/s. Then max/min result is 3. But entries 
> > should be uint16, (5GB/s)/3 we can only get 1GB/s, then we should raise 
> > error(overflow).
> > But if this patch, we will get the base is 1GB/s.  
> I understand the MIN/MAX means, in the case above, we get MAX is 9GB/s, 
> MIN is 3GB/s, then I use code below to calculate :
> 
>      while (max_data >= UINT16_MAX) {
>          if (!QEMU_IS_ALIGNED(max_data, unit * base) ||
>              !QEMU_IS_ALIGNED(min_data, unit * base) {
>                  error_report("Invalid latency/bandwidth input.");
>                  exit(1);
>          }
>          base *= unit;
>      }
this check won't cover, entries in between min and max.
Maybe using range bitmap the time of parsing bandwidth/latency CLI option
would work:

   parse_numa_hmat_lb(...) {
      ...
      if (bw && !ALIGNED(value, 1MB))
          error fatal("should be 1MB aligned")

      sub_table->range_bitmap |= value;

      last_bit = find_last_bit(sub_table->range_bitmap)
      first_bit = find_first_bit(sub_table->range_bitmap)
      if ((last_bit - first_bit) > UINT16_BITS)
          error_fatal("value (%d) should not differ from
                      previously entered values on more that UNINT16_MAX")

      sub_table->base = bit_2_base(first_bit)
      sub_table[x] = value
      ...
   }

it should
  1: error out at the first option which value deviates too
     much from previously parsed options for sub-table
  2: recalculate 'base' value for sub-table