From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ot0-x242.google.com (mail-ot0-x242.google.com [IPv6:2607:f8b0:4003:c0f::242]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id 446CC222F4E03 for ; Fri, 22 Dec 2017 14:48:54 -0800 (PST) Received: by mail-ot0-x242.google.com with SMTP id d5so26314382oti.3 for ; Fri, 22 Dec 2017 14:53:44 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <39cbe02a-d309-443d-54c9-678a0799342d@gmail.com> References: <20171214021019.13579-1-ross.zwisler@linux.intel.com> <20171214130032.GK16951@dhcp22.suse.cz> <20171218203547.GA2366@linux.intel.com> <20171220181937.GB12236@bombadil.infradead.org> <2da89d31-27a3-34ab-2dbb-92403c8215ec@intel.com> <20171220211649.GA32200@bombadil.infradead.org> <20171220212408.GA8308@linux.intel.com> <20171220224105.GA27258@linux.intel.com> <39cbe02a-d309-443d-54c9-678a0799342d@gmail.com> From: Dan Williams Date: Fri, 22 Dec 2017 14:53:42 -0800 Message-ID: Subject: Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: linux-nvdimm-bounces@lists.01.org Sender: "Linux-nvdimm" To: Brice Goglin Cc: Michal Hocko , "Box, David E" , Dave Hansen , "Zheng, Lv" , "linux-nvdimm@lists.01.org" , "Rafael J. Wysocki" , Anaczkowski,, "Robert , Matthew Wilcox , Linux ACPI" , Odzioba,, "Erik , Len Brown" , John Hubbard , linuxppc-dev , Jerome Glisse , devel@acpica.org, Kogut,, "Marcin , Linux API , Nachimuthu, Murugasamy" , "Rafael J. Wysocki" , "linux-kernel@vger.kernel.org" , Koziej,, "Joonas , Andrew Morton , Tim Chen" List-ID: T24gVGh1LCBEZWMgMjEsIDIwMTcgYXQgMTI6MzEgUE0sIEJyaWNlIEdvZ2xpbiA8YnJpY2UuZ29n bGluQGdtYWlsLmNvbT4gd3JvdGU6Cj4gTGUgMjAvMTIvMjAxNyDDoCAyMzo0MSwgUm9zcyBad2lz bGVyIGEgw6ljcml0IDoKWy4uXQo+IEhlbGxvCj4KPiBJIGNhbiBjb25maXJtIHRoYXQgSFBDIHJ1 bnRpbWVzIGFyZSBnb2luZyB0byB1c2UgdGhlc2UgcGF0Y2hlcyAoYXQgbGVhc3QKPiBhbGwgcnVu dGltZXMgdGhhdCB1c2UgaHdsb2MgZm9yIHRvcG9sb2d5IGRpc2NvdmVyeSwgYnV0IHRoYXQncyB0 aGUgdmFzdAo+IG1ham9yaXR5IG9mIEhQQyBhbnl3YXkpLgo+Cj4gV2UgcmVhbGx5IGRpZG4ndCBs aWtlIEtOTCBleHBvc2luZyBhIGhhY2t5IFNMSVQgdGFibGUgWzFdLiBXZSBoYWQgdG8KPiBleHBs aWNpdGx5IGRldGVjdCB0aGF0IHNwZWNpZmljIGNyYXp5IHRhYmxlIHRvIGZpbmQgb3V0IHdoaWNo IE5VTUEgbm9kZXMKPiB3ZXJlIGxvY2FsIHRvIHdoaWNoIGNvcmVzLCBhbmQgdG8gZmluZCBvdXQg d2hpY2ggTlVNQSBub2RlcyB3ZXJlCj4gSEJNL01DRFJBTSBvciBERFIuIEFuZCB0aGVuIHdlIGhh ZCB0byBoaWRlIHRoZSBTTElUIHZhbHVlcyB0byB0aGUKPiBhcHBsaWNhdGlvbiBiZWNhdXNlIHRo ZSByZXBvcnRlZCBsYXRlbmNpZXMgZGlkbid0IG1hdGNoIHJlYWxpdHkuIFF1aXRlCj4gYW5ub3lp bmcuCj4KPiBXaXRoIFJvc3MnIHBhdGNoZXMsIHdlIGNhbiBlYXNpbHkgZ2V0IHdoYXQgd2UgbmVl ZDoKPiAqIHdoaWNoIE5VTUEgbm9kZXMgYXJlIGxvY2FsIHRvIHdoaWNoIENQVXM/IC9zeXMvZGV2 aWNlcy9zeXN0ZW0vbm9kZS8KPiBjYW4gb25seSByZXBvcnQgYSBzaW5nbGUgbG9jYWwgbm9kZSBw ZXIgQ1BVIChkb2Vzbid0IHdvcmsgZm9yIEtOTCBhbmQKPiB1cGNvbWluZyBhcmNoaXRlY3R1cmVz IHdpdGggSEJNK0REUisuLi4pCj4gKiB3aGljaCBOVU1BIG5vZGVzIGFyZSBzbG93L2Zhc3QgKGZv ciBib3RoIGJhbmR3aWR0aCBhbmQgbGF0ZW5jeSkKPiBBbmQgd2UgY2FuIHN0aWxsIGxvb2sgYXQg U0xJVCB1bmRlciAvc3lzL2RldmljZXMvc3lzdGVtL25vZGUgaWYgcmVhbGx5Cj4gbmVlZGVkLgo+ Cj4gQW5kIG9mIGNvdXJzZSBoYXZpbmcgdGhpcyBpbiBzeXNmcyBpcyBtdWNoIGJldHRlciB0aGFu IHBhcnNpbmcgQUNQSQo+IHRhYmxlcyB0aGF0IGFyZSBvbmx5IGFjY2Vzc2libGUgdG8gcm9vdCA6 KQoKT24gdGhpcyBwb2ludCwgaXQncyBub3QgY2xlYXIgdG8gbWUgdGhhdCB3ZSBzaG91bGQgYWxs b3cgdGhlc2Ugc3lzZnMKZW50cmllcyB0byBiZSB3b3JsZCByZWFkYWJsZS4gR2l2ZW4gL3Byb2Mv aW9tZW0gbm93IGhpZGVzIHBoeXNpY2FsCmFkZHJlc3MgaW5mb3JtYXRpb24gZnJvbSBub24tcm9v dCB3ZSBhdCBsZWFzdCBuZWVkIHRvIGJlIGNhcmVmdWwgbm90CnRvIHVuZG8gdGhhdCB3aXRoIG5l dyBzeXNmcyBITUFUIGF0dHJpYnV0ZXMuIE9uY2UgeW91IG5lZWQgdG8gYmUgcm9vdApmb3IgdGhp cyBpbmZvLCBpcyBwYXJzaW5nIGJpbmFyeSBITUFUIHZzIHN5c2ZzIGEgYmxvY2tlciBmb3IgdGhl IEhQQwp1c2UgY2FzZT8KClBlcmhhcHMgd2UgY2FuIGVubGlzdCAvcHJvYy9pb21lbSBvciBhIHNp bWlsYXIgZW51bWVyYXRpb24gaW50ZXJmYWNlCnRvIHRlbGwgdXNlcnNwYWNlIHRoZSBOVU1BIG5v ZGUgYW5kIHdoZXRoZXIgdGhlIGtlcm5lbCB0aGlua3MgaXQgaGFzCmJldHRlciBvciB3b3JzZSBw ZXJmb3JtYW5jZSBjaGFyYWN0ZXJpc3RpY3MgcmVsYXRpdmUgdG8gYmFzZQpzeXN0ZW0tUkFNLCBp LmUuIG5ldyBJT1JFU19ERVNDXyogdmFsdWVzLiBJJ20gd29ycmllZCB0aGF0IGlmIHdlIHN0YXJ0 CnB1Ymxpc2hpbmcgYWJzb2x1dGUgbnVtYmVycyBpbiBzeXNmcyB1c2Vyc3BhY2Ugd2lsbCBkZWZh dWx0IHRvIGxvb2tpbmcKZm9yIHNwZWNpZmljIG1hZ2ljIG51bWJlcnMgaW4gc3lzZnMgdnMgYXNr aW5nIHRoZSBrZXJuZWwgZm9yIG1lbW9yeQp0aGF0IGhhcyBwZXJmb3JtYW5jZSBjaGFyYWN0ZXJp c3RpY3MgcmVsYXRpdmUgdG8gYmFzZSAiU3lzdGVtIFJBTSIuIEluCm90aGVyIHdvcmRzIHRoZSBh YnNvbHV0ZSBwZXJmb3JtYW5jZSBpbmZvcm1hdGlvbiB0aGF0IHRoZSBITUFUCnB1Ymxpc2hlcyBp cyB1c2VmdWwgdG8gdGhlIGtlcm5lbCwgYnV0IGl0J3Mgbm90IGNsZWFyIHRoYXQgdXNlcnNwYWNl Cm5lZWRzIHRoYXQgdnMgYSByZWxhdGl2ZSBpbmRpY2F0b3IgZm9yIG1ha2luZyBOVU1BIG5vZGUg cHJlZmVyZW5jZQpkZWNpc2lvbnMuCl9fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19f X19fX19fX19fX19fCkxpbnV4LW52ZGltbSBtYWlsaW5nIGxpc3QKTGludXgtbnZkaW1tQGxpc3Rz LjAxLm9yZwpodHRwczovL2xpc3RzLjAxLm9yZy9tYWlsbWFuL2xpc3RpbmZvL2xpbnV4LW52ZGlt bQo= From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dan Williams Subject: Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT Date: Fri, 22 Dec 2017 14:53:42 -0800 Message-ID: References: <20171214021019.13579-1-ross.zwisler@linux.intel.com> <20171214130032.GK16951@dhcp22.suse.cz> <20171218203547.GA2366@linux.intel.com> <20171220181937.GB12236@bombadil.infradead.org> <2da89d31-27a3-34ab-2dbb-92403c8215ec@intel.com> <20171220211649.GA32200@bombadil.infradead.org> <20171220212408.GA8308@linux.intel.com> <20171220224105.GA27258@linux.intel.com> <39cbe02a-d309-443d-54c9-678a0799342d@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Return-path: Received: from mail-ot0-f196.google.com ([74.125.82.196]:37877 "EHLO mail-ot0-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755934AbdLVWxp (ORCPT ); Fri, 22 Dec 2017 17:53:45 -0500 Received: by mail-ot0-f196.google.com with SMTP id p31so19667546ota.4 for ; Fri, 22 Dec 2017 14:53:45 -0800 (PST) In-Reply-To: <39cbe02a-d309-443d-54c9-678a0799342d@gmail.com> Sender: linux-acpi-owner@vger.kernel.org List-Id: linux-acpi@vger.kernel.org To: Brice Goglin Cc: Ross Zwisler , Matthew Wilcox , Dave Hansen , Michal Hocko , "linux-kernel@vger.kernel.org" , "Anaczkowski, Lukasz" , "Box, David E" , "Kogut, Jaroslaw" , "Koss, Marcin" , "Koziej, Artur" , "Lahtinen, Joonas" , "Moore, Robert" , "Nachimuthu, Murugasamy" , "Odzioba, Lukasz" , "Rafael J. Wysocki" , "Rafael J. Wysocki" , "Schmauss, Erik" On Thu, Dec 21, 2017 at 12:31 PM, Brice Goglin wro= te: > Le 20/12/2017 =C3=A0 23:41, Ross Zwisler a =C3=A9crit : [..] > Hello > > I can confirm that HPC runtimes are going to use these patches (at least > all runtimes that use hwloc for topology discovery, but that's the vast > majority of HPC anyway). > > We really didn't like KNL exposing a hacky SLIT table [1]. We had to > explicitly detect that specific crazy table to find out which NUMA nodes > were local to which cores, and to find out which NUMA nodes were > HBM/MCDRAM or DDR. And then we had to hide the SLIT values to the > application because the reported latencies didn't match reality. Quite > annoying. > > With Ross' patches, we can easily get what we need: > * which NUMA nodes are local to which CPUs? /sys/devices/system/node/ > can only report a single local node per CPU (doesn't work for KNL and > upcoming architectures with HBM+DDR+...) > * which NUMA nodes are slow/fast (for both bandwidth and latency) > And we can still look at SLIT under /sys/devices/system/node if really > needed. > > And of course having this in sysfs is much better than parsing ACPI > tables that are only accessible to root :) On this point, it's not clear to me that we should allow these sysfs entries to be world readable. Given /proc/iomem now hides physical address information from non-root we at least need to be careful not to undo that with new sysfs HMAT attributes. Once you need to be root for this info, is parsing binary HMAT vs sysfs a blocker for the HPC use case? Perhaps we can enlist /proc/iomem or a similar enumeration interface to tell userspace the NUMA node and whether the kernel thinks it has better or worse performance characteristics relative to base system-RAM, i.e. new IORES_DESC_* values. I'm worried that if we start publishing absolute numbers in sysfs userspace will default to looking for specific magic numbers in sysfs vs asking the kernel for memory that has performance characteristics relative to base "System RAM". In other words the absolute performance information that the HMAT publishes is useful to the kernel, but it's not clear that userspace needs that vs a relative indicator for making NUMA node preference decisions. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756756AbdLVWxu (ORCPT ); Fri, 22 Dec 2017 17:53:50 -0500 Received: from mail-ot0-f194.google.com ([74.125.82.194]:46589 "EHLO mail-ot0-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756209AbdLVWxp (ORCPT ); Fri, 22 Dec 2017 17:53:45 -0500 X-Google-Smtp-Source: ACJfBotIqXwqRhl2g7/+mZhEiD4JNQakuphhUiYONAYin/lwxi29NDBDY6c4AL723b9MvZWgDL+8Zr6NdHbBuakNlw8= MIME-Version: 1.0 In-Reply-To: <39cbe02a-d309-443d-54c9-678a0799342d@gmail.com> References: <20171214021019.13579-1-ross.zwisler@linux.intel.com> <20171214130032.GK16951@dhcp22.suse.cz> <20171218203547.GA2366@linux.intel.com> <20171220181937.GB12236@bombadil.infradead.org> <2da89d31-27a3-34ab-2dbb-92403c8215ec@intel.com> <20171220211649.GA32200@bombadil.infradead.org> <20171220212408.GA8308@linux.intel.com> <20171220224105.GA27258@linux.intel.com> <39cbe02a-d309-443d-54c9-678a0799342d@gmail.com> From: Dan Williams Date: Fri, 22 Dec 2017 14:53:42 -0800 Message-ID: Subject: Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT To: Brice Goglin Cc: Ross Zwisler , Matthew Wilcox , Dave Hansen , Michal Hocko , "linux-kernel@vger.kernel.org" , "Anaczkowski, Lukasz" , "Box, David E" , "Kogut, Jaroslaw" , "Koss, Marcin" , "Koziej, Artur" , "Lahtinen, Joonas" , "Moore, Robert" , "Nachimuthu, Murugasamy" , "Odzioba, Lukasz" , "Rafael J. Wysocki" , "Rafael J. Wysocki" , "Schmauss, Erik" , "Verma, Vishal L" , "Zheng, Lv" , Andrew Morton , Balbir Singh , Jerome Glisse , John Hubbard , Len Brown , Tim Chen , devel@acpica.org, Linux ACPI , Linux MM , "linux-nvdimm@lists.01.org" , Linux API , linuxppc-dev Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by mail.home.local id vBMMs3ho030870 On Thu, Dec 21, 2017 at 12:31 PM, Brice Goglin wrote: > Le 20/12/2017 à 23:41, Ross Zwisler a écrit : [..] > Hello > > I can confirm that HPC runtimes are going to use these patches (at least > all runtimes that use hwloc for topology discovery, but that's the vast > majority of HPC anyway). > > We really didn't like KNL exposing a hacky SLIT table [1]. We had to > explicitly detect that specific crazy table to find out which NUMA nodes > were local to which cores, and to find out which NUMA nodes were > HBM/MCDRAM or DDR. And then we had to hide the SLIT values to the > application because the reported latencies didn't match reality. Quite > annoying. > > With Ross' patches, we can easily get what we need: > * which NUMA nodes are local to which CPUs? /sys/devices/system/node/ > can only report a single local node per CPU (doesn't work for KNL and > upcoming architectures with HBM+DDR+...) > * which NUMA nodes are slow/fast (for both bandwidth and latency) > And we can still look at SLIT under /sys/devices/system/node if really > needed. > > And of course having this in sysfs is much better than parsing ACPI > tables that are only accessible to root :) On this point, it's not clear to me that we should allow these sysfs entries to be world readable. Given /proc/iomem now hides physical address information from non-root we at least need to be careful not to undo that with new sysfs HMAT attributes. Once you need to be root for this info, is parsing binary HMAT vs sysfs a blocker for the HPC use case? Perhaps we can enlist /proc/iomem or a similar enumeration interface to tell userspace the NUMA node and whether the kernel thinks it has better or worse performance characteristics relative to base system-RAM, i.e. new IORES_DESC_* values. I'm worried that if we start publishing absolute numbers in sysfs userspace will default to looking for specific magic numbers in sysfs vs asking the kernel for memory that has performance characteristics relative to base "System RAM". In other words the absolute performance information that the HMAT publishes is useful to the kernel, but it's not clear that userspace needs that vs a relative indicator for making NUMA node preference decisions. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-f70.google.com (mail-oi0-f70.google.com [209.85.218.70]) by kanga.kvack.org (Postfix) with ESMTP id 091976B0253 for ; Fri, 22 Dec 2017 17:53:45 -0500 (EST) Received: by mail-oi0-f70.google.com with SMTP id 184so13164875oii.1 for ; Fri, 22 Dec 2017 14:53:45 -0800 (PST) Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id z34sor1480535otb.59.2017.12.22.14.53.43 for (Google Transport Security); Fri, 22 Dec 2017 14:53:44 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <39cbe02a-d309-443d-54c9-678a0799342d@gmail.com> References: <20171214021019.13579-1-ross.zwisler@linux.intel.com> <20171214130032.GK16951@dhcp22.suse.cz> <20171218203547.GA2366@linux.intel.com> <20171220181937.GB12236@bombadil.infradead.org> <2da89d31-27a3-34ab-2dbb-92403c8215ec@intel.com> <20171220211649.GA32200@bombadil.infradead.org> <20171220212408.GA8308@linux.intel.com> <20171220224105.GA27258@linux.intel.com> <39cbe02a-d309-443d-54c9-678a0799342d@gmail.com> From: Dan Williams Date: Fri, 22 Dec 2017 14:53:42 -0800 Message-ID: Subject: Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org List-ID: To: Brice Goglin Cc: Ross Zwisler , Matthew Wilcox , Dave Hansen , Michal Hocko , "linux-kernel@vger.kernel.org" , "Anaczkowski, Lukasz" , "Box, David E" , "Kogut, Jaroslaw" , "Koss, Marcin" , "Koziej, Artur" , "Lahtinen, Joonas" , "Moore, Robert" , "Nachimuthu, Murugasamy" , "Odzioba, Lukasz" , "Rafael J. Wysocki" , "Rafael J. Wysocki" , "Schmauss, Erik" , "Verma, Vishal L" , "Zheng, Lv" , Andrew Morton , Balbir Singh , Jerome Glisse , John Hubbard , Len Brown , Tim Chen , devel@acpica.org, Linux ACPI , Linux MM , "linux-nvdimm@lists.01.org" , Linux API , linuxppc-dev On Thu, Dec 21, 2017 at 12:31 PM, Brice Goglin wro= te: > Le 20/12/2017 =C3=A0 23:41, Ross Zwisler a =C3=A9crit : [..] > Hello > > I can confirm that HPC runtimes are going to use these patches (at least > all runtimes that use hwloc for topology discovery, but that's the vast > majority of HPC anyway). > > We really didn't like KNL exposing a hacky SLIT table [1]. We had to > explicitly detect that specific crazy table to find out which NUMA nodes > were local to which cores, and to find out which NUMA nodes were > HBM/MCDRAM or DDR. And then we had to hide the SLIT values to the > application because the reported latencies didn't match reality. Quite > annoying. > > With Ross' patches, we can easily get what we need: > * which NUMA nodes are local to which CPUs? /sys/devices/system/node/ > can only report a single local node per CPU (doesn't work for KNL and > upcoming architectures with HBM+DDR+...) > * which NUMA nodes are slow/fast (for both bandwidth and latency) > And we can still look at SLIT under /sys/devices/system/node if really > needed. > > And of course having this in sysfs is much better than parsing ACPI > tables that are only accessible to root :) On this point, it's not clear to me that we should allow these sysfs entries to be world readable. Given /proc/iomem now hides physical address information from non-root we at least need to be careful not to undo that with new sysfs HMAT attributes. Once you need to be root for this info, is parsing binary HMAT vs sysfs a blocker for the HPC use case? Perhaps we can enlist /proc/iomem or a similar enumeration interface to tell userspace the NUMA node and whether the kernel thinks it has better or worse performance characteristics relative to base system-RAM, i.e. new IORES_DESC_* values. I'm worried that if we start publishing absolute numbers in sysfs userspace will default to looking for specific magic numbers in sysfs vs asking the kernel for memory that has performance characteristics relative to base "System RAM". In other words the absolute performance information that the HMAT publishes is useful to the kernel, but it's not clear that userspace needs that vs a relative indicator for making NUMA node preference decisions. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ot0-x243.google.com (mail-ot0-x243.google.com [IPv6:2607:f8b0:4003:c0f::243]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 3z3P261m3gzF0CG for ; Sat, 23 Dec 2017 09:53:45 +1100 (AEDT) Received: by mail-ot0-x243.google.com with SMTP id a42so4033292otj.5 for ; Fri, 22 Dec 2017 14:53:45 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <39cbe02a-d309-443d-54c9-678a0799342d@gmail.com> References: <20171214021019.13579-1-ross.zwisler@linux.intel.com> <20171214130032.GK16951@dhcp22.suse.cz> <20171218203547.GA2366@linux.intel.com> <20171220181937.GB12236@bombadil.infradead.org> <2da89d31-27a3-34ab-2dbb-92403c8215ec@intel.com> <20171220211649.GA32200@bombadil.infradead.org> <20171220212408.GA8308@linux.intel.com> <20171220224105.GA27258@linux.intel.com> <39cbe02a-d309-443d-54c9-678a0799342d@gmail.com> From: Dan Williams Date: Fri, 22 Dec 2017 14:53:42 -0800 Message-ID: Subject: Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT To: Brice Goglin Cc: Ross Zwisler , Matthew Wilcox , Dave Hansen , Michal Hocko , "linux-kernel@vger.kernel.org" , "Anaczkowski, Lukasz" , "Box, David E" , "Kogut, Jaroslaw" , "Koss, Marcin" , "Koziej, Artur" , "Lahtinen, Joonas" , "Moore, Robert" , "Nachimuthu, Murugasamy" , "Odzioba, Lukasz" , "Rafael J. Wysocki" , "Rafael J. Wysocki" , "Schmauss, Erik" , "Verma, Vishal L" , "Zheng, Lv" , Andrew Morton , Balbir Singh , Jerome Glisse , John Hubbard , Len Brown , Tim Chen , devel@acpica.org, Linux ACPI , Linux MM , "linux-nvdimm@lists.01.org" , Linux API , linuxppc-dev Content-Type: text/plain; charset="UTF-8" List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , On Thu, Dec 21, 2017 at 12:31 PM, Brice Goglin wro= te: > Le 20/12/2017 =C3=A0 23:41, Ross Zwisler a =C3=A9crit : [..] > Hello > > I can confirm that HPC runtimes are going to use these patches (at least > all runtimes that use hwloc for topology discovery, but that's the vast > majority of HPC anyway). > > We really didn't like KNL exposing a hacky SLIT table [1]. We had to > explicitly detect that specific crazy table to find out which NUMA nodes > were local to which cores, and to find out which NUMA nodes were > HBM/MCDRAM or DDR. And then we had to hide the SLIT values to the > application because the reported latencies didn't match reality. Quite > annoying. > > With Ross' patches, we can easily get what we need: > * which NUMA nodes are local to which CPUs? /sys/devices/system/node/ > can only report a single local node per CPU (doesn't work for KNL and > upcoming architectures with HBM+DDR+...) > * which NUMA nodes are slow/fast (for both bandwidth and latency) > And we can still look at SLIT under /sys/devices/system/node if really > needed. > > And of course having this in sysfs is much better than parsing ACPI > tables that are only accessible to root :) On this point, it's not clear to me that we should allow these sysfs entries to be world readable. Given /proc/iomem now hides physical address information from non-root we at least need to be careful not to undo that with new sysfs HMAT attributes. Once you need to be root for this info, is parsing binary HMAT vs sysfs a blocker for the HPC use case? Perhaps we can enlist /proc/iomem or a similar enumeration interface to tell userspace the NUMA node and whether the kernel thinks it has better or worse performance characteristics relative to base system-RAM, i.e. new IORES_DESC_* values. I'm worried that if we start publishing absolute numbers in sysfs userspace will default to looking for specific magic numbers in sysfs vs asking the kernel for memory that has performance characteristics relative to base "System RAM". In other words the absolute performance information that the HMAT publishes is useful to the kernel, but it's not clear that userspace needs that vs a relative indicator for making NUMA node preference decisions.