Re: [Qemu-devel] [PATCH v9 09/11] numa: Extend the CLI to provide memory latency and bandwidth information

From: Eric Blake <eblake@redhat.com>
To: Tao <tao3.xu@intel.com>, imammedo@redhat.com, ehabkost@redhat.com
Cc: jingqi.liu@intel.com, fan.du@intel.com, qemu-devel@nongnu.org,
	daniel@linux.ibm.com, Markus Armbruster <armbru@redhat.com>,
	jonathan.cameron@huawei.com, dan.j.williams@intel.com
Subject: Re: [Qemu-devel] [PATCH v9 09/11] numa: Extend the CLI to provide memory latency and bandwidth information
Date: Tue, 13 Aug 2019 10:11:32 -0500	[thread overview]
Message-ID: <be305a8a-f1f4-7084-4bb7-4174d530483d@redhat.com> (raw)
In-Reply-To: <20190809065731.9097-10-tao3.xu@intel.com>

[-- Attachment #1.1: Type: text/plain, Size: 7132 bytes --]

On 8/9/19 1:57 AM, Tao wrote:
> From: Liu Jingqi <jingqi.liu@intel.com>
> 
> Add -numa hmat-lb option to provide System Locality Latency and
> Bandwidth Information. These memory attributes help to build
> System Locality Latency and Bandwidth Information Structure(s)
> in ACPI Heterogeneous Memory Attribute Table (HMAT).
> 
> Signed-off-by: Liu Jingqi <jingqi.liu@intel.com>
> Signed-off-by: Tao Xu <tao3.xu@intel.com>
> ---
> 
> Changes in v9:
>     - change the CLI input way, make it more user firendly (Daniel Black)
>     use latency=NUM[p|n|u]s and bandwidth=NUM[M|G|P](B/s) as input and drop
>     the base-lat and base-bw input.

Why are you hand-rolling yet another scaling parser instead of reusing
one that's already in-tree?

> +++ b/hw/core/numa.c

> +void parse_numa_hmat_lb(MachineState *ms, NumaHmatLBOptions *node,
> +                        Error **errp)
> +{

> +    if (node->has_latency) {
> +        hmat_lb = ms->numa_state->hmat_lb[node->hierarchy][node->data_type];
> +
> +        if (!hmat_lb) {
> +            hmat_lb = g_malloc0(sizeof(*hmat_lb));
> +            ms->numa_state->hmat_lb[node->hierarchy][node->data_type] = hmat_lb;
> +        } else if (hmat_lb->latency[node->initiator][node->target]) {
> +            error_setg(errp, "Duplicate configuration of the latency for "
> +                       "initiator=%" PRIu16 " and target=%" PRIu16 ".",
> +                       node->initiator, node->target);
> +            return;
> +        }
> +
> +        ret = qemu_strtoui(node->latency, &endptr, 10, &latency);
> +        if (ret < 0) {
> +            error_setg(errp, "Invalid latency %s", node->latency);
> +            return;
> +        }
> +
> +        if (*endptr == '\0') {
> +            base_lat = 1;
> +        } else if (*(endptr + 1) == 's') {
> +            switch (*endptr) {
> +            case 'p':
> +                base_lat = 1;
> +                break;
> +            case 'n':
> +                base_lat = PICO_PER_NSEC;
> +                break;
> +            case 'u':
> +                base_lat = PICO_PER_USEC;
> +                break;

Hmm - this is a different scaling than any of our existing parsers
(which assume multiples k/M/G..., not subdivisions u/n/s)

> +    if (node->has_bandwidth) {
> +        hmat_lb = ms->numa_state->hmat_lb[node->hierarchy][node->data_type];
> +
> +        if (!hmat_lb) {
> +            hmat_lb = g_malloc0(sizeof(*hmat_lb));
> +            ms->numa_state->hmat_lb[node->hierarchy][node->data_type] = hmat_lb;
> +        } else if (hmat_lb->bandwidth[node->initiator][node->target]) {
> +            error_setg(errp, "Duplicate configuration of the bandwidth for "
> +                       "initiator=%" PRIu16 " and target=%" PRIu16 ".",
> +                       node->initiator, node->target);
> +            return;
> +        }
> +
> +        ret = qemu_strtoui(node->bandwidth, &endptr, 10, &bandwidth);
> +        if (ret < 0) {
> +            error_setg(errp, "Invalid bandwidth %s", node->bandwidth);
> +            return;
> +        }
> +
> +        switch (toupper(*endptr)) {
> +        case '\0':
> +        case 'M':
> +            base_bw = 1;
> +            break;
> +        case 'G':
> +            base_bw = UINT64_C(1) << 10;
> +            break;
> +        case 'P':
> +            base_bw = UINT64_C(1) << 20;
> +            break;

But this one, in addition to being wrong (P is 1<<30, not 1<<20), should
definitely be reusing qemu_strtosz_metric() or similar (look in
util/cutils.c).

> +++ b/qapi/machine.json
> @@ -377,10 +377,12 @@
>  #
>  # @cpu: property based CPU(s) to node mapping (Since: 2.10)
>  #
> +# @hmat-lb: memory latency and bandwidth information (Since: 4.2)
> +#
>  # Since: 2.1
>  ##
>  { 'enum': 'NumaOptionsType',
> -  'data': [ 'node', 'dist', 'cpu' ] }
> +  'data': [ 'node', 'dist', 'cpu', 'hmat-lb' ] }
>  

> +##
> +# @HmatLBDataType:
> +#
> +# Data type in the System Locality Latency
> +# and Bandwidth Information Structure of HMAT (Heterogeneous
> +# Memory Attribute Table)
> +#
> +# For more information of @HmatLBDataType see
> +# the chapter 5.2.27.4: Table 5-142:  Field "Data Type" of ACPI 6.3 spec.
> +#
> +# @access-latency: access latency (picoseconds)
> +#
> +# @read-latency: read latency (picoseconds)
> +#
> +# @write-latency: write latency (picoseconds)
> +#
> +# @access-bandwidth: access bandwidth (MB/s)
> +#
> +# @read-bandwidth: read bandwidth (MB/s)
> +#
> +# @write-bandwidth: write bandwidth (MB/s)

Are these really the best scales?

> +
> +##
> +# @NumaHmatLBOptions:
> +#
> +# Set the system locality latency and bandwidth information
> +# between Initiator and Target proximity Domains.
> +#
> +# For more information of @NumaHmatLBOptions see
> +# the chapter 5.2.27.4: Table 5-142 of ACPI 6.3 spec.
> +#
> +# @initiator: the Initiator Proximity Domain.
> +#
> +# @target: the Target Proximity Domain.
> +#
> +# @hierarchy: the Memory Hierarchy. Indicates the performance
> +#             of memory or side cache.
> +#
> +# @data-type: presents the type of data, access/read/write
> +#             latency or hit latency.
> +#
> +# @latency: the value of latency from @initiator to @target proximity domain,
> +#           the latency units are "ps(picosecond)", "ns(nanosecond)" or
> +#           "us(microsecond)".
> +#
> +# @bandwidth: the value of bandwidth between @initiator and @target proximity
> +#             domain, the bandwidth units are "MB(/s)","GB(/s)" or "PB(/s)".
> +#
> +# Since: 4.2
> +##
> +{ 'struct': 'NumaHmatLBOptions',
> +    'data': {
> +    'initiator': 'uint16',
> +    'target': 'uint16',
> +    'hierarchy': 'HmatLBMemoryHierarchy',
> +    'data-type': 'HmatLBDataType',
> +    '*latency': 'str',
> +    '*bandwidth': 'str' }}

...and then parsing strings instead of taking raw integers?  Parsing
strings is okay for HMP, but for QMP, our goal should be a single
representation with no additional sugar on top.  Latency and bandwidth
should be int in a single scale.

> +++ b/qemu-options.hx
> @@ -164,16 +164,19 @@ DEF("numa", HAS_ARG, QEMU_OPTION_numa,
>      "-numa node[,mem=size][,cpus=firstcpu[-lastcpu]][,nodeid=node][,initiator=node]\n"
>      "-numa node[,memdev=id][,cpus=firstcpu[-lastcpu]][,nodeid=node][,initiator=node]\n"
>      "-numa dist,src=source,dst=destination,val=distance\n"
> -    "-numa cpu,node-id=node[,socket-id=x][,core-id=y][,thread-id=z]\n",
> +    "-numa cpu,node-id=node[,socket-id=x][,core-id=y][,thread-id=z]\n"
> +    "-numa hmat-lb,initiator=node,target=node,hierarchy=memory|first-level|second-level|third-level,data-type=access-latency|read-latency|write-latency[,latency=lat][,bandwidth=bw]\n",

Command-line parsing can then take human-written scaled numbers, and
pre-convert them into the single scale accepted by the QMP interface.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]