linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] clarify and cleanup CPU and NUMA topology ABIs
@ 2021-04-29  7:03 Tian Tao
  2021-04-29  7:03 ` [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue of sysfs pagebuf Tian Tao
  2021-04-29  7:03 ` [PATCH 2/2] Documentation/ABI: Move the topology-related sysfs interface to the right place Tian Tao
  0 siblings, 2 replies; 15+ messages in thread
From: Tian Tao @ 2021-04-29  7:03 UTC (permalink / raw)
  To: corbet, gregkh, song.bao.hua; +Cc: linux-doc, linux-kernel, Tian Tao

patch #1: clarify the overflow issue of sysfs pagebuf, and Move the
presence of BUILD_BUG_ON to more sensible place.

patch #2: move the interface that exists under
/sys/devices/system/cpu/cpuX/topology/ to the more logical
Documentation/ABI/ file that can be properly parsed and
displayed to the user space.

these two patches are a follow-up to Greg's comments in the below threads:
https://lore.kernel.org/lkml/YFRGIedW1fUlnmi+@kroah.com/
https://lore.kernel.org/lkml/YFR2kwakbcGiI37w@kroah.com/

Tian Tao (2):
  CPU, NUMA topology ABIs: clarify the overflow issue of sysfs pagebuf
  Documentation/ABI: Move the topology-related sysfs interface to the
    right place

 Documentation/ABI/stable/sysfs-devices-node       |   5 +-
 Documentation/ABI/stable/sysfs-devices-system-cpu | 142 ++++++++++++++++++++++
 Documentation/admin-guide/cputopology.rst         |  83 +------------
 drivers/base/node.c                               |   3 -
 include/linux/cpumask.h                           |   6 +
 5 files changed, 154 insertions(+), 85 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue of sysfs pagebuf
  2021-04-29  7:03 [PATCH 0/2] clarify and cleanup CPU and NUMA topology ABIs Tian Tao
@ 2021-04-29  7:03 ` Tian Tao
  2021-04-29 14:21   ` Dave Hansen
  2021-05-06 23:16   ` kernel test robot
  2021-04-29  7:03 ` [PATCH 2/2] Documentation/ABI: Move the topology-related sysfs interface to the right place Tian Tao
  1 sibling, 2 replies; 15+ messages in thread
From: Tian Tao @ 2021-04-29  7:03 UTC (permalink / raw)
  To: corbet, gregkh, song.bao.hua
  Cc: linux-doc, linux-kernel, Tian Tao, Rafael J. Wysocki,
	Peter Zijlstra, Valentin Schneider, Dave Hansen,
	Daniel Bristot de Oliveira

Both numa node and cpu use cpu bitmap like 3,ffffffff to expose hardware
topology. When cpu number is large, the page buffer of sysfs will over-
flow. This doesn't really happen nowadays as the maximum NR_CPUS is 8196
for X86_64 and 4096 for ARM64 since 8196 * 9 / 32 = 2305 is still smaller
than 4KB page size.

So the existing BUILD_BUG_ON() in drivers/base/node.c is pretty much
preventing future problems similar with Y2K when hardware gets more
and more CPUs.

On the other hand, it should be more sensible to move the guard to common
code which can protect both cpu and numa:
/sys/devices/system/cpu/cpu0/topology/die_cpus etc.
/sys/devices/system/node/node0/cpumap etc.

Topology bitmap mask strings shouldn't be larger than PAGE_SIZE as
lstopo and numactl depend on them. But other ABIs exposing cpu lists
are not really used by common applications, so this patch also marks
those lists could be trimmed as there is no any guarantee those lists
are always less than PAGE_SIZE especially a list could be like this:
0, 3, 5, 7, 9, 11... etc.

Signed-off-by: Tian Tao <tiantao6@hisilicon.com>
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
---
 Documentation/ABI/stable/sysfs-devices-node |  5 ++++-
 Documentation/admin-guide/cputopology.rst   | 15 +++++++++++++++
 drivers/base/node.c                         |  3 ---
 include/linux/cpumask.h                     |  6 ++++++
 4 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
index 484fc04..82dfe64 100644
--- a/Documentation/ABI/stable/sysfs-devices-node
+++ b/Documentation/ABI/stable/sysfs-devices-node
@@ -47,7 +47,10 @@ What:		/sys/devices/system/node/nodeX/cpulist
 Date:		October 2002
 Contact:	Linux Memory Management list <linux-mm@kvack.org>
 Description:
-		The CPUs associated to the node.
+		The CPUs associated to the node. The format is like 0-3,
+		8-11, 14,17. maximum size is PAGE_SIZE, so the tail
+		of the string will be trimmed while its size is larger
+		than PAGE_SIZE.
 
 What:		/sys/devices/system/node/nodeX/meminfo
 Date:		October 2002
diff --git a/Documentation/admin-guide/cputopology.rst b/Documentation/admin-guide/cputopology.rst
index b90dafc..4538d78 100644
--- a/Documentation/admin-guide/cputopology.rst
+++ b/Documentation/admin-guide/cputopology.rst
@@ -44,6 +44,9 @@ core_cpus:
 core_cpus_list:
 
 	human-readable list of CPUs within the same core.
+	The format is like 0-3, 8-11, 14,17. The maximum size is PAGE_SIZE,
+	so the tail of the string will be trimmed while its size is larger
+	than PAGE_SIZE.
 	(deprecated name: "thread_siblings_list");
 
 package_cpus:
@@ -54,6 +57,9 @@ package_cpus:
 package_cpus_list:
 
 	human-readable list of CPUs sharing the same physical_package_id.
+	The format is like 0-3, 8-11, 14,17. The maximum size is PAGE_SIZE,
+	so the tail of the string will be trimmed while its size is larger
+	than PAGE_SIZE.
 	(deprecated name: "core_siblings_list")
 
 die_cpus:
@@ -63,6 +69,9 @@ die_cpus:
 die_cpus_list:
 
 	human-readable list of CPUs within the same die.
+	The format is like 0-3, 8-11, 14,17. The maximum size is PAGE_SIZE,
+	so the tail of the string will be trimmed while its size is larger
+	than PAGE_SIZE.
 
 book_siblings:
 
@@ -73,6 +82,9 @@ book_siblings_list:
 
 	human-readable list of cpuX's hardware threads within the same
 	book_id.
+	The format is like 0-3, 8-11, 14,17. The maximum size is PAGE_SIZE,
+	so the tail of the string will be trimmed while its size is larger
+	than PAGE_SIZE.
 
 drawer_siblings:
 
@@ -83,6 +95,9 @@ drawer_siblings_list:
 
 	human-readable list of cpuX's hardware threads within the same
 	drawer_id.
+	The format is like 0-3, 8-11, 14,17. The maximum size is PAGE_SIZE,
+	so the tail of the string will be trimmed while its size is larger
+	than PAGE_SIZE.
 
 Architecture-neutral, drivers/base/topology.c, exports these attributes.
 However, the book and drawer related sysfs files will only be created if
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 2c36f61d..e24530c3 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -33,9 +33,6 @@ static ssize_t node_read_cpumap(struct device *dev, bool list, char *buf)
 	cpumask_var_t mask;
 	struct node *node_dev = to_node(dev);
 
-	/* 2008/04/07: buf currently PAGE_SIZE, need 9 chars per 32 bits. */
-	BUILD_BUG_ON((NR_CPUS/32 * 9) > (PAGE_SIZE-1));
-
 	if (!alloc_cpumask_var(&mask, GFP_KERNEL))
 		return 0;
 
diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index bfc4690..1882477 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -12,6 +12,7 @@
 #include <linux/bitmap.h>
 #include <linux/atomic.h>
 #include <linux/bug.h>
+#include <asm/page.h>
 
 /* Don't assign or return these: may not be this big! */
 typedef struct cpumask { DECLARE_BITMAP(bits, NR_CPUS); } cpumask_t;
@@ -979,6 +980,11 @@ static inline bool cpu_dying(unsigned int cpu)
 static inline ssize_t
 cpumap_print_to_pagebuf(bool list, char *buf, const struct cpumask *mask)
 {
+	/*
+	 * 32bits requires 9bytes: "ff,ffffffff", thus, too many CPUs will
+	 * cause the overflow of sysfs pagebuf
+	 */
+	BUILD_BUG_ON((NR_CPUS/32 * 9) > (PAGE_SIZE-1));
 	return bitmap_print_to_pagebuf(list, buf, cpumask_bits(mask),
 				      nr_cpu_ids);
 }
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/2] Documentation/ABI: Move the topology-related sysfs interface to the right place
  2021-04-29  7:03 [PATCH 0/2] clarify and cleanup CPU and NUMA topology ABIs Tian Tao
  2021-04-29  7:03 ` [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue of sysfs pagebuf Tian Tao
@ 2021-04-29  7:03 ` Tian Tao
  1 sibling, 0 replies; 15+ messages in thread
From: Tian Tao @ 2021-04-29  7:03 UTC (permalink / raw)
  To: corbet, gregkh, song.bao.hua; +Cc: linux-doc, linux-kernel, Tian Tao

Move the interface that exists under
/sys/devices/system/cpu/cpuX/topology/ to the more logical
Documentation/ABI/ file that can be properly parsed and
displayed to the user space

Signed-off-by: Tian Tao <tiantao6@hisilicon.com>
---
 Documentation/ABI/stable/sysfs-devices-system-cpu | 142 ++++++++++++++++++++++
 Documentation/admin-guide/cputopology.rst         |  98 +--------------
 2 files changed, 144 insertions(+), 96 deletions(-)

diff --git a/Documentation/ABI/stable/sysfs-devices-system-cpu b/Documentation/ABI/stable/sysfs-devices-system-cpu
index 33c133e..f63a72d 100644
--- a/Documentation/ABI/stable/sysfs-devices-system-cpu
+++ b/Documentation/ABI/stable/sysfs-devices-system-cpu
@@ -1,3 +1,7 @@
+Export CPU topology info via sysfs. Items (attributes) are similar
+to /proc/cpuinfo output of some architectures. They reside in
+/sys/devices/system/cpu/cpuX/topology/:
+
 What: 		/sys/devices/system/cpu/dscr_default
 Date:		13-May-2014
 KernelVersion:	v3.15.0
@@ -23,3 +27,141 @@ Description:	Default value for the Data Stream Control Register (DSCR) on
 		here).
 		If set by a process it will be inherited by child processes.
 Values:		64 bit unsigned integer (bit field)
+
+What:           /sys/devices/system/cpu/cpuX/topology/physical_package_id
+Date:           19-Mar-2021
+KernelVersion:  v5.12
+Contact:
+Description:    physical package id of cpuX. Typically corresponds to a physical
+                socket number, but the actual value is architecture and platform
+                dependent.
+Values:         64 bit unsigned integer (bit field)
+
+What:           /sys/devices/system/cpu/cpuX/topology/die_id
+Date:           19-Mar-2021
+KernelVersion:  v5.12
+Contact:
+Description:    the CPU die ID of cpuX. Typically it is the hardware platform's
+                identifier (rather than the kernel's). The actual value is
+                architecture and platform dependent.
+Values:         64 bit unsigned integer (bit field)
+
+What:           /sys/devices/system/cpu/cpuX/topology/core_id
+Date:           19-Mar-2021
+KernelVersion:  v5.12
+Contact:
+Description:    the CPU core ID of cpuX. Typically it is the hardware platform's
+                identifier (rather than the kernel's). The actual value is
+                architecture and platform dependent.
+Values:         64 bit unsigned integer (bit field)
+
+What:           /sys/devices/system/cpu/cpuX/topology/book_id
+Date:           19-Mar-2021
+KernelVersion:  v5.12
+Contact:
+Description:    the book ID of cpuX. Typically it is the hardware platform's
+                identifier (rather than the kernel's). The actual value is
+                architecture and platform dependent. it's only used on s390.
+Values:         64 bit unsigned integer (bit field)
+
+What:           /sys/devices/system/cpu/cpuX/topology/drawer_id
+Date:           19-Mar-2021
+KernelVersion:  v5.12
+Contact:
+Description:    the drawer ID of cpuX. Typically it is the hardware platform's
+                identifier (rather than the kernel's). The actual value is
+                architecture and platform dependent. it's only used on s390.
+Values:         64 bit unsigned integer (bit field)
+
+What:           /sys/devices/system/cpu/cpuX/topology/core_cpus
+Date:           19-Mar-2021
+KernelVersion:  v5.12
+Contact:
+Description:    internal kernel map of CPUs within the same core.
+                (deprecated name: "thread_siblings")
+Values:         hexadecimal bitmask.
+
+What:           /sys/devices/system/cpu/cpuX/topology/core_cpus_list
+Date:           19-Mar-2021
+KernelVersion:  v5.12
+Contact:
+Description:    human-readable list of CPUs within the same core.
+                The format is like 0-3, 8-11, 14,17. The maximum size is PAGE_SIZE,
+	        so the tail of the string will be trimmed while its size is larger
+                than PAGE_SIZE.
+                (deprecated name: "thread_siblings_list").
+Values:         decimal list.
+
+What:           /sys/devices/system/cpu/cpuX/topology/package_cpus
+Date:           19-Mar-2021
+KernelVersion:  v5.12
+Contact:
+Description:    internal kernel map of the CPUs sharing the same physical_package_id.
+                (deprecated name: "core_siblings").
+Values:         hexadecimal bitmask.
+
+What:           /sys/devices/system/cpu/cpuX/topology/package_cpus_list
+Date:           19-Mar-2021
+KernelVersion:  v5.12
+Contact:
+Description:    human-readable list of CPUs sharing the same physical_package_id.
+                The format is like 0-3, 8-11, 14,17. The maximum size is PAGE_SIZE,
+                so the tail of the string will be trimmed while its size is larger
+                than PAGE_SIZE.
+                (deprecated name: "core_siblings_list")
+Values:         decimal list.
+
+What:           /sys/devices/system/cpu/cpuX/topology/die_cpus
+Date:           19-Mar-2021
+KernelVersion:  v5.12
+Contact:
+Description:    internal kernel map of CPUs within the same die.
+Values:         hexadecimal bitmask.
+
+What:           /sys/devices/system/cpu/cpuX/topology/die_cpus_list
+Date:           19-Mar-2021
+KernelVersion:  v5.12
+Contact:
+Description:    human-readable list of CPUs within the same die.
+                The format is like 0-3, 8-11, 14,17. The maximum size is PAGE_SIZE,
+                so the tail of the string will be trimmed while its size is larger
+                than PAGE_SIZE.
+Values:         decimal list.
+
+What:           /sys/devices/system/cpu/cpuX/topology/book_siblings
+Date:           19-Mar-2021
+KernelVersion:  v5.12
+Contact:
+Description:    internal kernel map of cpuX's hardware threads within the same
+                book_id. it's only used on s390.
+Values:         hexadecimal bitmask.
+
+What:           /sys/devices/system/cpu/cpuX/topology/book_siblings_list
+Date:           19-Mar-2021
+KernelVersion:  v5.12
+Contact:
+Description:    human-readable list of cpuX's hardware threads within the same
+                book_id. it's only used on s390.
+                The format is like 0-3, 8-11, 14,17. The maximum size is PAGE_SIZE,
+                so the tail of the string will be trimmed while its size is larger
+                than PAGE_SIZE.
+Values:         decimal list.
+
+What:           /sys/devices/system/cpu/cpuX/topology/drawer_siblings
+Date:           19-Mar-2021
+KernelVersion:  v5.12
+Contact:
+Description:    internal kernel map of cpuX's hardware threads within the same
+                drawer_id. it's only used on s390.
+Values:         hexadecimal bitmask.
+
+What:           /sys/devices/system/cpu/cpuX/topology/drawer_siblings_list
+Date:           19-Mar-2021
+KernelVersion:  v5.12
+Contact:
+Description:    human-readable list of cpuX's hardware threads within the same
+                drawer_id.
+                The format is like 0-3, 8-11, 14,17. The maximum size is PAGE_SIZE,
+                so the tail of the string will be trimmed while its size is larger
+                than PAGE_SIZE. it's only used on s390.
+Values:         decimal list.
diff --git a/Documentation/admin-guide/cputopology.rst b/Documentation/admin-guide/cputopology.rst
index 4538d78..de1995c 100644
--- a/Documentation/admin-guide/cputopology.rst
+++ b/Documentation/admin-guide/cputopology.rst
@@ -2,102 +2,8 @@
 How CPU topology info is exported via sysfs
 ===========================================
 
-Export CPU topology info via sysfs. Items (attributes) are similar
-to /proc/cpuinfo output of some architectures.  They reside in
-/sys/devices/system/cpu/cpuX/topology/:
-
-physical_package_id:
-
-	physical package id of cpuX. Typically corresponds to a physical
-	socket number, but the actual value is architecture and platform
-	dependent.
-
-die_id:
-
-	the CPU die ID of cpuX. Typically it is the hardware platform's
-	identifier (rather than the kernel's).  The actual value is
-	architecture and platform dependent.
-
-core_id:
-
-	the CPU core ID of cpuX. Typically it is the hardware platform's
-	identifier (rather than the kernel's).  The actual value is
-	architecture and platform dependent.
-
-book_id:
-
-	the book ID of cpuX. Typically it is the hardware platform's
-	identifier (rather than the kernel's).	The actual value is
-	architecture and platform dependent.
-
-drawer_id:
-
-	the drawer ID of cpuX. Typically it is the hardware platform's
-	identifier (rather than the kernel's).	The actual value is
-	architecture and platform dependent.
-
-core_cpus:
-
-	internal kernel map of CPUs within the same core.
-	(deprecated name: "thread_siblings")
-
-core_cpus_list:
-
-	human-readable list of CPUs within the same core.
-	The format is like 0-3, 8-11, 14,17. The maximum size is PAGE_SIZE,
-	so the tail of the string will be trimmed while its size is larger
-	than PAGE_SIZE.
-	(deprecated name: "thread_siblings_list");
-
-package_cpus:
-
-	internal kernel map of the CPUs sharing the same physical_package_id.
-	(deprecated name: "core_siblings")
-
-package_cpus_list:
-
-	human-readable list of CPUs sharing the same physical_package_id.
-	The format is like 0-3, 8-11, 14,17. The maximum size is PAGE_SIZE,
-	so the tail of the string will be trimmed while its size is larger
-	than PAGE_SIZE.
-	(deprecated name: "core_siblings_list")
-
-die_cpus:
-
-	internal kernel map of CPUs within the same die.
-
-die_cpus_list:
-
-	human-readable list of CPUs within the same die.
-	The format is like 0-3, 8-11, 14,17. The maximum size is PAGE_SIZE,
-	so the tail of the string will be trimmed while its size is larger
-	than PAGE_SIZE.
-
-book_siblings:
-
-	internal kernel map of cpuX's hardware threads within the same
-	book_id.
-
-book_siblings_list:
-
-	human-readable list of cpuX's hardware threads within the same
-	book_id.
-	The format is like 0-3, 8-11, 14,17. The maximum size is PAGE_SIZE,
-	so the tail of the string will be trimmed while its size is larger
-	than PAGE_SIZE.
-
-drawer_siblings:
-
-	internal kernel map of cpuX's hardware threads within the same
-	drawer_id.
-
-drawer_siblings_list:
-
-	human-readable list of cpuX's hardware threads within the same
-	drawer_id.
-	The format is like 0-3, 8-11, 14,17. The maximum size is PAGE_SIZE,
-	so the tail of the string will be trimmed while its size is larger
-	than PAGE_SIZE.
+Export CPU topology info via sysfs. Please refer to the ABI file:
+Documentation/ABI/stable/sysfs-devices-system-cpu.
 
 Architecture-neutral, drivers/base/topology.c, exports these attributes.
 However, the book and drawer related sysfs files will only be created if
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue of sysfs pagebuf
  2021-04-29  7:03 ` [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue of sysfs pagebuf Tian Tao
@ 2021-04-29 14:21   ` Dave Hansen
  2021-04-29 15:46     ` Greg KH
  2021-04-29 21:08     ` Song Bao Hua (Barry Song)
  2021-05-06 23:16   ` kernel test robot
  1 sibling, 2 replies; 15+ messages in thread
From: Dave Hansen @ 2021-04-29 14:21 UTC (permalink / raw)
  To: Tian Tao, corbet, gregkh, song.bao.hua
  Cc: linux-doc, linux-kernel, Rafael J. Wysocki, Peter Zijlstra,
	Valentin Schneider, Dave Hansen, Daniel Bristot de Oliveira

On 4/29/21 12:03 AM, Tian Tao wrote:
> diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
> index 484fc04..82dfe64 100644
> --- a/Documentation/ABI/stable/sysfs-devices-node
> +++ b/Documentation/ABI/stable/sysfs-devices-node
> @@ -47,7 +47,10 @@ What:		/sys/devices/system/node/nodeX/cpulist
>  Date:		October 2002
>  Contact:	Linux Memory Management list <linux-mm@kvack.org>
>  Description:
> -		The CPUs associated to the node.
> +		The CPUs associated to the node. The format is like 0-3,
> +		8-11, 14,17. maximum size is PAGE_SIZE, so the tail
> +		of the string will be trimmed while its size is larger
> +		than PAGE_SIZE.

I think it's pretty arguable that truncating output on a real system is
an ABI break.  Doing this would make the interface rather useless.

Don't we need a real solution rather than throwing up our hands?

Do we think >PAGE_SIZE data out of a sysfs file is a worse ABI break or
something?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue of sysfs pagebuf
  2021-04-29 14:21   ` Dave Hansen
@ 2021-04-29 15:46     ` Greg KH
  2021-04-29 21:08     ` Song Bao Hua (Barry Song)
  1 sibling, 0 replies; 15+ messages in thread
From: Greg KH @ 2021-04-29 15:46 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Tian Tao, corbet, song.bao.hua, linux-doc, linux-kernel,
	Rafael J. Wysocki, Peter Zijlstra, Valentin Schneider,
	Dave Hansen, Daniel Bristot de Oliveira

On Thu, Apr 29, 2021 at 07:21:13AM -0700, Dave Hansen wrote:
> On 4/29/21 12:03 AM, Tian Tao wrote:
> > diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
> > index 484fc04..82dfe64 100644
> > --- a/Documentation/ABI/stable/sysfs-devices-node
> > +++ b/Documentation/ABI/stable/sysfs-devices-node
> > @@ -47,7 +47,10 @@ What:		/sys/devices/system/node/nodeX/cpulist
> >  Date:		October 2002
> >  Contact:	Linux Memory Management list <linux-mm@kvack.org>
> >  Description:
> > -		The CPUs associated to the node.
> > +		The CPUs associated to the node. The format is like 0-3,
> > +		8-11, 14,17. maximum size is PAGE_SIZE, so the tail
> > +		of the string will be trimmed while its size is larger
> > +		than PAGE_SIZE.
> 
> I think it's pretty arguable that truncating output on a real system is
> an ABI break.  Doing this would make the interface rather useless.
> 
> Don't we need a real solution rather than throwing up our hands?
> 
> Do we think >PAGE_SIZE data out of a sysfs file is a worse ABI break or
> something?

There is a real way to get > PAGE_SIZE out of a sysfs file.  The LED
developers had to do this when they ran into this same exact problem.
Make it a binary sysfs file and promise to NEVER create such a file
again in the future :)

thanks,

gre gk-h

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue of sysfs pagebuf
  2021-04-29 14:21   ` Dave Hansen
  2021-04-29 15:46     ` Greg KH
@ 2021-04-29 21:08     ` Song Bao Hua (Barry Song)
  2021-04-29 21:38       ` Dave Hansen
  1 sibling, 1 reply; 15+ messages in thread
From: Song Bao Hua (Barry Song) @ 2021-04-29 21:08 UTC (permalink / raw)
  To: Dave Hansen, tiantao (H), corbet, gregkh
  Cc: linux-doc, linux-kernel, Rafael J. Wysocki, Peter Zijlstra,
	Valentin Schneider, Dave Hansen, Daniel Bristot de Oliveira



> -----Original Message-----
> From: Dave Hansen [mailto:dave.hansen@intel.com]
> Sent: Friday, April 30, 2021 2:21 AM
> To: tiantao (H) <tiantao6@hisilicon.com>; corbet@lwn.net;
> gregkh@linuxfoundation.org; Song Bao Hua (Barry Song)
> <song.bao.hua@hisilicon.com>
> Cc: linux-doc@vger.kernel.org; linux-kernel@vger.kernel.org; Rafael J.
> Wysocki <rafael@kernel.org>; Peter Zijlstra <peterz@infradead.org>; Valentin
> Schneider <valentin.schneider@arm.com>; Dave Hansen
> <dave.hansen@linux.intel.com>; Daniel Bristot de Oliveira <bristot@redhat.com>
> Subject: Re: [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue
> of sysfs pagebuf
> 
> On 4/29/21 12:03 AM, Tian Tao wrote:
> > diff --git a/Documentation/ABI/stable/sysfs-devices-node
> b/Documentation/ABI/stable/sysfs-devices-node
> > index 484fc04..82dfe64 100644
> > --- a/Documentation/ABI/stable/sysfs-devices-node
> > +++ b/Documentation/ABI/stable/sysfs-devices-node
> > @@ -47,7 +47,10 @@ What:		/sys/devices/system/node/nodeX/cpulist
> >  Date:		October 2002
> >  Contact:	Linux Memory Management list <linux-mm@kvack.org>
> >  Description:
> > -		The CPUs associated to the node.
> > +		The CPUs associated to the node. The format is like 0-3,
> > +		8-11, 14,17. maximum size is PAGE_SIZE, so the tail
> > +		of the string will be trimmed while its size is larger
> > +		than PAGE_SIZE.
> 
> I think it's pretty arguable that truncating output on a real system is
> an ABI break.  Doing this would make the interface rather useless.
> 
> Don't we need a real solution rather than throwing up our hands?
> 
> Do we think >PAGE_SIZE data out of a sysfs file is a worse ABI break or
> something?

This kind of cpu list ABIs have been there for many years but have 
never been documented well.

We have two ABIs:
xxx_cpus - in format like 3333333333
xxx_cpus_list - in format like 0,3,5,7,9,11,13....

xxx_cpus_list is another human-readable version of xxx_cpus. It doesn't
include any more useful information than xxx_cpus.

xxx_cpus won't overflow based on BUILD_BUG_ON and maximum NR_CPUS
in kconfig nowadays.

if people all agree the trimmed list is a break of ABI, I think we may
totally remove this list. For these days, this list probably has never
overflowed but literally this could happen.

thoughts?

Thanks
Barry

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue of sysfs pagebuf
  2021-04-29 21:08     ` Song Bao Hua (Barry Song)
@ 2021-04-29 21:38       ` Dave Hansen
  2021-04-29 22:32         ` Song Bao Hua (Barry Song)
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Hansen @ 2021-04-29 21:38 UTC (permalink / raw)
  To: Song Bao Hua (Barry Song), tiantao (H), corbet, gregkh
  Cc: linux-doc, linux-kernel, Rafael J. Wysocki, Peter Zijlstra,
	Valentin Schneider, Dave Hansen, Daniel Bristot de Oliveira

On 4/29/21 2:08 PM, Song Bao Hua (Barry Song) wrote:
>> Do we think >PAGE_SIZE data out of a sysfs file is a worse ABI break or
>> something?
> This kind of cpu list ABIs have been there for many years but have 
> never been documented well.
> 
> We have two ABIs:
> xxx_cpus - in format like 3333333333
> xxx_cpus_list - in format like 0,3,5,7,9,11,13....
> 
> xxx_cpus_list is another human-readable version of xxx_cpus. It doesn't
> include any more useful information than xxx_cpus.
> 
> xxx_cpus won't overflow based on BUILD_BUG_ON and maximum NR_CPUS
> in kconfig nowadays.
> 
> if people all agree the trimmed list is a break of ABI, I think we may
> totally remove this list. For these days, this list probably has never
> overflowed but literally this could happen.
> 
> thoughts?

From what Greg said, it sounds like removing the BUILD_BUG_ON(), making
it a binary sysfs file, and making it support arbitrary lengths is the
way to go.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue of sysfs pagebuf
  2021-04-29 21:38       ` Dave Hansen
@ 2021-04-29 22:32         ` Song Bao Hua (Barry Song)
  2021-04-29 22:38           ` Dave Hansen
  0 siblings, 1 reply; 15+ messages in thread
From: Song Bao Hua (Barry Song) @ 2021-04-29 22:32 UTC (permalink / raw)
  To: Dave Hansen, tiantao (H), corbet, gregkh
  Cc: linux-doc, linux-kernel, Rafael J. Wysocki, Peter Zijlstra,
	Valentin Schneider, Dave Hansen, Daniel Bristot de Oliveira



> -----Original Message-----
> From: Dave Hansen [mailto:dave.hansen@intel.com]
> Sent: Friday, April 30, 2021 9:39 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>; tiantao (H)
> <tiantao6@hisilicon.com>; corbet@lwn.net; gregkh@linuxfoundation.org
> Cc: linux-doc@vger.kernel.org; linux-kernel@vger.kernel.org; Rafael J.
> Wysocki <rafael@kernel.org>; Peter Zijlstra <peterz@infradead.org>; Valentin
> Schneider <valentin.schneider@arm.com>; Dave Hansen
> <dave.hansen@linux.intel.com>; Daniel Bristot de Oliveira <bristot@redhat.com>
> Subject: Re: [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue
> of sysfs pagebuf
> 
> On 4/29/21 2:08 PM, Song Bao Hua (Barry Song) wrote:
> >> Do we think >PAGE_SIZE data out of a sysfs file is a worse ABI break or
> >> something?
> > This kind of cpu list ABIs have been there for many years but have
> > never been documented well.
> >
> > We have two ABIs:
> > xxx_cpus - in format like 3333333333
> > xxx_cpus_list - in format like 0,3,5,7,9,11,13....
> >
> > xxx_cpus_list is another human-readable version of xxx_cpus. It doesn't
> > include any more useful information than xxx_cpus.
> >
> > xxx_cpus won't overflow based on BUILD_BUG_ON and maximum NR_CPUS
> > in kconfig nowadays.
> >
> > if people all agree the trimmed list is a break of ABI, I think we may
> > totally remove this list. For these days, this list probably has never
> > overflowed but literally this could happen.
> >
> > thoughts?
> 
> From what Greg said, it sounds like removing the BUILD_BUG_ON(), making
> it a binary sysfs file, and making it support arbitrary lengths is the
> way to go.

I am actually more concerned on xxx_cpus_list than xxx_cpus.

xxx_cpus has never overflowed. Though there is a BUILD_BUG_ON(), but the
maximum NR_CPUS is 8096, it is still far from overflow.
8096 /32 * 9 = 2277
as 2277 < 4096

While NR_CPUS gets to ~14500, for a 4KB page, xxx_cpus will overflow.
But I don't know when hardware will reach there. If we reach there,
the existing code to describe topology and schedule tasks might also
need rework.

On the other hand, lscpu, hwloc, numactl etc are using the existing
hex bitmap ABI:

$ strace lscpu  2>&1 | grep topology
faccessat(AT_FDCWD,
"/sys/devices/system/cpu/cpu0/topology/thread_siblings", F_OK) = 0
openat(AT_FDCWD,
"/sys/devices/system/cpu/cpu0/topology/thread_siblings",
O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD,
"/sys/devices/system/cpu/cpu0/topology/core_siblings",
O_RDONLY|O_CLOEXEC) = 3
faccessat(AT_FDCWD,
"/sys/devices/system/cpu/cpu0/topology/book_siblings", F_OK) = -1
ENOENT (No such file or directory)
faccessat(AT_FDCWD,
"/sys/devices/system/cpu/cpu1/topology/thread_siblings", F_OK) = 0
openat(AT_FDCWD,
...

$ strace numactl --hardware  2>&1 | grep cpu
openat(AT_FDCWD, "/sys/devices/system/cpu",
O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/sys/devices/system/node/node0/cpumap", O_RDONLY) = 3
openat(AT_FDCWD, "/sys/devices/system/node/node1/cpumap", O_RDONLY) = 3
openat(AT_FDCWD, "/sys/devices/system/node/node2/cpumap", O_RDONLY) = 3
openat(AT_FDCWD, "/sys/devices/system/node/node3/cpumap", O_RDONLY) = 3

If we move to binary, it means we have to change those applications.

For this moment, I'd argue we keep cpu bitmap ABI as is and defer this
issue to when we really get so many cores.

Thanks
Barry

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue of sysfs pagebuf
  2021-04-29 22:32         ` Song Bao Hua (Barry Song)
@ 2021-04-29 22:38           ` Dave Hansen
  2021-04-29 22:48             ` Song Bao Hua (Barry Song)
                               ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Dave Hansen @ 2021-04-29 22:38 UTC (permalink / raw)
  To: Song Bao Hua (Barry Song), tiantao (H), corbet, gregkh
  Cc: linux-doc, linux-kernel, Rafael J. Wysocki, Peter Zijlstra,
	Valentin Schneider, Dave Hansen, Daniel Bristot de Oliveira

On 4/29/21 3:32 PM, Song Bao Hua (Barry Song) wrote:
> $ strace numactl --hardware  2>&1 | grep cpu
> openat(AT_FDCWD, "/sys/devices/system/cpu",
> O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
> openat(AT_FDCWD, "/sys/devices/system/node/node0/cpumap", O_RDONLY) = 3
> openat(AT_FDCWD, "/sys/devices/system/node/node1/cpumap", O_RDONLY) = 3
> openat(AT_FDCWD, "/sys/devices/system/node/node2/cpumap", O_RDONLY) = 3
> openat(AT_FDCWD, "/sys/devices/system/node/node3/cpumap", O_RDONLY) = 3
> 
> If we move to binary, it means we have to change those applications.

I thought Greg was saying to using a sysfs binary attribute using
something like like sysfs_create_bin_file().  Those don't have the
PAGE_SIZE limitation.  But, there's also nothing to keep us from spewing
nice human-readable text via the "binary" file.

We don't need to change the file format, just the internal kernel API
that we produce the files with.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue of sysfs pagebuf
  2021-04-29 22:38           ` Dave Hansen
@ 2021-04-29 22:48             ` Song Bao Hua (Barry Song)
  2021-04-30  6:05             ` gregkh
  2021-07-23 11:20             ` Song Bao Hua (Barry Song)
  2 siblings, 0 replies; 15+ messages in thread
From: Song Bao Hua (Barry Song) @ 2021-04-29 22:48 UTC (permalink / raw)
  To: Dave Hansen, tiantao (H), corbet, gregkh
  Cc: linux-doc, linux-kernel, Rafael J. Wysocki, Peter Zijlstra,
	Valentin Schneider, Dave Hansen, Daniel Bristot de Oliveira



> -----Original Message-----
> From: Dave Hansen [mailto:dave.hansen@intel.com]
> Sent: Friday, April 30, 2021 10:39 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>; tiantao (H)
> <tiantao6@hisilicon.com>; corbet@lwn.net; gregkh@linuxfoundation.org
> Cc: linux-doc@vger.kernel.org; linux-kernel@vger.kernel.org; Rafael J.
> Wysocki <rafael@kernel.org>; Peter Zijlstra <peterz@infradead.org>; Valentin
> Schneider <valentin.schneider@arm.com>; Dave Hansen
> <dave.hansen@linux.intel.com>; Daniel Bristot de Oliveira <bristot@redhat.com>
> Subject: Re: [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue
> of sysfs pagebuf
> 
> On 4/29/21 3:32 PM, Song Bao Hua (Barry Song) wrote:
> > $ strace numactl --hardware  2>&1 | grep cpu
> > openat(AT_FDCWD, "/sys/devices/system/cpu",
> > O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
> > openat(AT_FDCWD, "/sys/devices/system/node/node0/cpumap", O_RDONLY) = 3
> > openat(AT_FDCWD, "/sys/devices/system/node/node1/cpumap", O_RDONLY) = 3
> > openat(AT_FDCWD, "/sys/devices/system/node/node2/cpumap", O_RDONLY) = 3
> > openat(AT_FDCWD, "/sys/devices/system/node/node3/cpumap", O_RDONLY) = 3
> >
> > If we move to binary, it means we have to change those applications.
> 
> I thought Greg was saying to using a sysfs binary attribute using
> something like like sysfs_create_bin_file().  Those don't have the
> PAGE_SIZE limitation.  But, there's also nothing to keep us from spewing
> nice human-readable text via the "binary" file.
> 
> We don't need to change the file format, just the internal kernel API
> that we produce the files with.

Dave, thanks for clarification. Sounds a way to go.

Barry

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue of sysfs pagebuf
  2021-04-29 22:38           ` Dave Hansen
  2021-04-29 22:48             ` Song Bao Hua (Barry Song)
@ 2021-04-30  6:05             ` gregkh
  2021-07-23 11:20             ` Song Bao Hua (Barry Song)
  2 siblings, 0 replies; 15+ messages in thread
From: gregkh @ 2021-04-30  6:05 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Song Bao Hua (Barry Song), tiantao (H),
	corbet, linux-doc, linux-kernel, Rafael J. Wysocki,
	Peter Zijlstra, Valentin Schneider, Dave Hansen,
	Daniel Bristot de Oliveira

On Thu, Apr 29, 2021 at 03:38:39PM -0700, Dave Hansen wrote:
> On 4/29/21 3:32 PM, Song Bao Hua (Barry Song) wrote:
> > $ strace numactl --hardware  2>&1 | grep cpu
> > openat(AT_FDCWD, "/sys/devices/system/cpu",
> > O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
> > openat(AT_FDCWD, "/sys/devices/system/node/node0/cpumap", O_RDONLY) = 3
> > openat(AT_FDCWD, "/sys/devices/system/node/node1/cpumap", O_RDONLY) = 3
> > openat(AT_FDCWD, "/sys/devices/system/node/node2/cpumap", O_RDONLY) = 3
> > openat(AT_FDCWD, "/sys/devices/system/node/node3/cpumap", O_RDONLY) = 3
> > 
> > If we move to binary, it means we have to change those applications.
> 
> I thought Greg was saying to using a sysfs binary attribute using
> something like like sysfs_create_bin_file().  Those don't have the
> PAGE_SIZE limitation.  But, there's also nothing to keep us from spewing
> nice human-readable text via the "binary" file.

That is correct.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue of sysfs pagebuf
  2021-04-29  7:03 ` [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue of sysfs pagebuf Tian Tao
  2021-04-29 14:21   ` Dave Hansen
@ 2021-05-06 23:16   ` kernel test robot
  1 sibling, 0 replies; 15+ messages in thread
From: kernel test robot @ 2021-05-06 23:16 UTC (permalink / raw)
  To: Tian Tao, corbet, gregkh, song.bao.hua
  Cc: kbuild-all, linux-doc, linux-kernel, Tian Tao, Rafael J. Wysocki,
	Peter Zijlstra, Valentin Schneider, Dave Hansen

[-- Attachment #1: Type: text/plain, Size: 17540 bytes --]

Hi Tian,

I love your patch! Yet something to improve:

[auto build test ERROR on lwn/docs-next]
[also build test ERROR on driver-core/driver-core-testing linus/master v5.12 next-20210506]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Tian-Tao/clarify-and-cleanup-CPU-and-NUMA-topology-ABIs/20210429-150451
base:   git://git.lwn.net/linux-2.6 docs-next
config: mips-allyesconfig (attached as .config)
compiler: mips-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/357a0babc7ad904e3099bf86034af88cf5ce2a70
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Tian-Tao/clarify-and-cleanup-CPU-and-NUMA-topology-ABIs/20210429-150451
        git checkout 357a0babc7ad904e3099bf86034af88cf5ce2a70
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross W=1 ARCH=mips 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from arch/mips/include/asm/io.h:31,
                    from arch/mips/include/asm/page.h:200,
                    from include/linux/cpumask.h:15,
                    from include/linux/smp.h:13,
                    from arch/mips/include/asm/cpu-type.h:12,
                    from arch/mips/include/asm/timex.h:19,
                    from include/linux/timex.h:65,
                    from include/linux/time32.h:13,
                    from include/linux/time.h:60,
                    from include/linux/compat.h:10,
                    from arch/mips/kernel/asm-offsets.c:12:
>> arch/mips/include/asm/processor.h:269:2: error: unknown type name 'cpumask_t'
     269 |  cpumask_t user_cpus_allowed;
         |  ^~~~~~~~~
   arch/mips/kernel/asm-offsets.c:26:6: warning: no previous prototype for 'output_ptreg_defines' [-Wmissing-prototypes]
      26 | void output_ptreg_defines(void)
         |      ^~~~~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:78:6: warning: no previous prototype for 'output_task_defines' [-Wmissing-prototypes]
      78 | void output_task_defines(void)
         |      ^~~~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:93:6: warning: no previous prototype for 'output_thread_info_defines' [-Wmissing-prototypes]
      93 | void output_thread_info_defines(void)
         |      ^~~~~~~~~~~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:110:6: warning: no previous prototype for 'output_thread_defines' [-Wmissing-prototypes]
     110 | void output_thread_defines(void)
         |      ^~~~~~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:138:6: warning: no previous prototype for 'output_thread_fpu_defines' [-Wmissing-prototypes]
     138 | void output_thread_fpu_defines(void)
         |      ^~~~~~~~~~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:181:6: warning: no previous prototype for 'output_mm_defines' [-Wmissing-prototypes]
     181 | void output_mm_defines(void)
         |      ^~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:220:6: warning: no previous prototype for 'output_sc_defines' [-Wmissing-prototypes]
     220 | void output_sc_defines(void)
         |      ^~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:255:6: warning: no previous prototype for 'output_signal_defined' [-Wmissing-prototypes]
     255 | void output_signal_defined(void)
         |      ^~~~~~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:322:6: warning: no previous prototype for 'output_pbe_defines' [-Wmissing-prototypes]
     322 | void output_pbe_defines(void)
         |      ^~~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:334:6: warning: no previous prototype for 'output_pm_defines' [-Wmissing-prototypes]
     334 | void output_pm_defines(void)
         |      ^~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:348:6: warning: no previous prototype for 'output_kvm_defines' [-Wmissing-prototypes]
     348 | void output_kvm_defines(void)
         |      ^~~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:392:6: warning: no previous prototype for 'output_cps_defines' [-Wmissing-prototypes]
     392 | void output_cps_defines(void)
         |      ^~~~~~~~~~~~~~~~~~
--
   In file included from arch/mips/include/asm/io.h:31,
                    from arch/mips/include/asm/page.h:200,
                    from include/linux/cpumask.h:15,
                    from include/linux/smp.h:13,
                    from arch/mips/include/asm/cpu-type.h:12,
                    from arch/mips/include/asm/timex.h:19,
                    from include/linux/timex.h:65,
                    from include/linux/time32.h:13,
                    from include/linux/time.h:60,
                    from include/linux/compat.h:10,
                    from arch/mips/kernel/asm-offsets.c:12:
>> arch/mips/include/asm/processor.h:269:2: error: unknown type name 'cpumask_t'
     269 |  cpumask_t user_cpus_allowed;
         |  ^~~~~~~~~
   arch/mips/kernel/asm-offsets.c:26:6: warning: no previous prototype for 'output_ptreg_defines' [-Wmissing-prototypes]
      26 | void output_ptreg_defines(void)
         |      ^~~~~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:78:6: warning: no previous prototype for 'output_task_defines' [-Wmissing-prototypes]
      78 | void output_task_defines(void)
         |      ^~~~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:93:6: warning: no previous prototype for 'output_thread_info_defines' [-Wmissing-prototypes]
      93 | void output_thread_info_defines(void)
         |      ^~~~~~~~~~~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:110:6: warning: no previous prototype for 'output_thread_defines' [-Wmissing-prototypes]
     110 | void output_thread_defines(void)
         |      ^~~~~~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:138:6: warning: no previous prototype for 'output_thread_fpu_defines' [-Wmissing-prototypes]
     138 | void output_thread_fpu_defines(void)
         |      ^~~~~~~~~~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:181:6: warning: no previous prototype for 'output_mm_defines' [-Wmissing-prototypes]
     181 | void output_mm_defines(void)
         |      ^~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:220:6: warning: no previous prototype for 'output_sc_defines' [-Wmissing-prototypes]
     220 | void output_sc_defines(void)
         |      ^~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:255:6: warning: no previous prototype for 'output_signal_defined' [-Wmissing-prototypes]
     255 | void output_signal_defined(void)
         |      ^~~~~~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:322:6: warning: no previous prototype for 'output_pbe_defines' [-Wmissing-prototypes]
     322 | void output_pbe_defines(void)
         |      ^~~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:334:6: warning: no previous prototype for 'output_pm_defines' [-Wmissing-prototypes]
     334 | void output_pm_defines(void)
         |      ^~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:348:6: warning: no previous prototype for 'output_kvm_defines' [-Wmissing-prototypes]
     348 | void output_kvm_defines(void)
         |      ^~~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:392:6: warning: no previous prototype for 'output_cps_defines' [-Wmissing-prototypes]
     392 | void output_cps_defines(void)
         |      ^~~~~~~~~~~~~~~~~~
   make[2]: *** [scripts/Makefile.build:116: arch/mips/kernel/asm-offsets.s] Error 1
   make[2]: Target '__build' not remade because of errors.
   make[1]: *** [Makefile:1233: prepare0] Error 2
   make[1]: Target 'modules_prepare' not remade because of errors.
   make: *** [Makefile:215: __sub-make] Error 2
   make: Target 'modules_prepare' not remade because of errors.
--
   In file included from arch/mips/include/asm/io.h:31,
                    from arch/mips/include/asm/page.h:200,
                    from include/linux/cpumask.h:15,
                    from include/linux/smp.h:13,
                    from arch/mips/include/asm/cpu-type.h:12,
                    from arch/mips/include/asm/timex.h:19,
                    from include/linux/timex.h:65,
                    from include/linux/time32.h:13,
                    from include/linux/time.h:60,
                    from include/linux/compat.h:10,
                    from arch/mips/kernel/asm-offsets.c:12:
>> arch/mips/include/asm/processor.h:269:2: error: unknown type name 'cpumask_t'
     269 |  cpumask_t user_cpus_allowed;
         |  ^~~~~~~~~
   arch/mips/kernel/asm-offsets.c:26:6: warning: no previous prototype for 'output_ptreg_defines' [-Wmissing-prototypes]
      26 | void output_ptreg_defines(void)
         |      ^~~~~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:78:6: warning: no previous prototype for 'output_task_defines' [-Wmissing-prototypes]
      78 | void output_task_defines(void)
         |      ^~~~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:93:6: warning: no previous prototype for 'output_thread_info_defines' [-Wmissing-prototypes]
      93 | void output_thread_info_defines(void)
         |      ^~~~~~~~~~~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:110:6: warning: no previous prototype for 'output_thread_defines' [-Wmissing-prototypes]
     110 | void output_thread_defines(void)
         |      ^~~~~~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:138:6: warning: no previous prototype for 'output_thread_fpu_defines' [-Wmissing-prototypes]
     138 | void output_thread_fpu_defines(void)
         |      ^~~~~~~~~~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:181:6: warning: no previous prototype for 'output_mm_defines' [-Wmissing-prototypes]
     181 | void output_mm_defines(void)
         |      ^~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:220:6: warning: no previous prototype for 'output_sc_defines' [-Wmissing-prototypes]
     220 | void output_sc_defines(void)
         |      ^~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:255:6: warning: no previous prototype for 'output_signal_defined' [-Wmissing-prototypes]
     255 | void output_signal_defined(void)
         |      ^~~~~~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:322:6: warning: no previous prototype for 'output_pbe_defines' [-Wmissing-prototypes]
     322 | void output_pbe_defines(void)
         |      ^~~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:334:6: warning: no previous prototype for 'output_pm_defines' [-Wmissing-prototypes]
     334 | void output_pm_defines(void)
         |      ^~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:348:6: warning: no previous prototype for 'output_kvm_defines' [-Wmissing-prototypes]
     348 | void output_kvm_defines(void)
         |      ^~~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:392:6: warning: no previous prototype for 'output_cps_defines' [-Wmissing-prototypes]
     392 | void output_cps_defines(void)
         |      ^~~~~~~~~~~~~~~~~~
   make[2]: *** [scripts/Makefile.build:116: arch/mips/kernel/asm-offsets.s] Error 1
   make[2]: Target '__build' not remade because of errors.
   make[1]: *** [Makefile:1233: prepare0] Error 2
   make[1]: Target 'prepare' not remade because of errors.
   make: *** [Makefile:215: __sub-make] Error 2
   make: Target 'prepare' not remade because of errors.


vim +/cpumask_t +269 arch/mips/include/asm/processor.h

e50c0a8fa60da9a include/asm-mips/processor.h      Ralf Baechle   2005-05-31  242  
^1da177e4c3f415 include/asm-mips/processor.h      Linus Torvalds 2005-04-16  243  /*
^1da177e4c3f415 include/asm-mips/processor.h      Linus Torvalds 2005-04-16  244   * If you change thread_struct remember to change the #defines below too!
^1da177e4c3f415 include/asm-mips/processor.h      Linus Torvalds 2005-04-16  245   */
^1da177e4c3f415 include/asm-mips/processor.h      Linus Torvalds 2005-04-16  246  struct thread_struct {
^1da177e4c3f415 include/asm-mips/processor.h      Linus Torvalds 2005-04-16  247  	/* Saved main processor registers. */
^1da177e4c3f415 include/asm-mips/processor.h      Linus Torvalds 2005-04-16  248  	unsigned long reg16;
^1da177e4c3f415 include/asm-mips/processor.h      Linus Torvalds 2005-04-16  249  	unsigned long reg17, reg18, reg19, reg20, reg21, reg22, reg23;
^1da177e4c3f415 include/asm-mips/processor.h      Linus Torvalds 2005-04-16  250  	unsigned long reg29, reg30, reg31;
^1da177e4c3f415 include/asm-mips/processor.h      Linus Torvalds 2005-04-16  251  
^1da177e4c3f415 include/asm-mips/processor.h      Linus Torvalds 2005-04-16  252  	/* Saved cp0 stuff. */
^1da177e4c3f415 include/asm-mips/processor.h      Linus Torvalds 2005-04-16  253  	unsigned long cp0_status;
^1da177e4c3f415 include/asm-mips/processor.h      Linus Torvalds 2005-04-16  254  
2725f3778fddb70 arch/mips/include/asm/processor.h Paul Burton    2018-11-07  255  #ifdef CONFIG_MIPS_FP_SUPPORT
^1da177e4c3f415 include/asm-mips/processor.h      Linus Torvalds 2005-04-16  256  	/* Saved fpu/fpu emulator stuff. */
37cddff8e330a87 arch/mips/include/asm/processor.h Paul Burton    2014-07-11  257  	struct mips_fpu_struct fpu FPU_ALIGN;
432c6bacbd0c16e arch/mips/include/asm/processor.h Paul Burton    2016-07-08  258  	/* Assigned branch delay slot 'emulation' frame */
432c6bacbd0c16e arch/mips/include/asm/processor.h Paul Burton    2016-07-08  259  	atomic_t bd_emu_frame;
432c6bacbd0c16e arch/mips/include/asm/processor.h Paul Burton    2016-07-08  260  	/* PC of the branch from a branch delay slot 'emulation' */
432c6bacbd0c16e arch/mips/include/asm/processor.h Paul Burton    2016-07-08  261  	unsigned long bd_emu_branch_pc;
432c6bacbd0c16e arch/mips/include/asm/processor.h Paul Burton    2016-07-08  262  	/* PC to continue from following a branch delay slot 'emulation' */
432c6bacbd0c16e arch/mips/include/asm/processor.h Paul Burton    2016-07-08  263  	unsigned long bd_emu_cont_pc;
aebdc6ff3b2e793 arch/mips/include/asm/processor.h Yousong Zhou   2020-03-24  264  #endif
f088fc84f94c1a3 include/asm-mips/processor.h      Ralf Baechle   2006-04-05  265  #ifdef CONFIG_MIPS_MT_FPAFF
f088fc84f94c1a3 include/asm-mips/processor.h      Ralf Baechle   2006-04-05  266  	/* Emulated instruction count */
f088fc84f94c1a3 include/asm-mips/processor.h      Ralf Baechle   2006-04-05  267  	unsigned long emulated_fp;
f088fc84f94c1a3 include/asm-mips/processor.h      Ralf Baechle   2006-04-05  268  	/* Saved per-thread scheduler affinity mask */
f088fc84f94c1a3 include/asm-mips/processor.h      Ralf Baechle   2006-04-05 @269  	cpumask_t user_cpus_allowed;
f088fc84f94c1a3 include/asm-mips/processor.h      Ralf Baechle   2006-04-05  270  #endif /* CONFIG_MIPS_MT_FPAFF */
^1da177e4c3f415 include/asm-mips/processor.h      Linus Torvalds 2005-04-16  271  
e50c0a8fa60da9a include/asm-mips/processor.h      Ralf Baechle   2005-05-31  272  	/* Saved state of the DSP ASE, if available. */
e50c0a8fa60da9a include/asm-mips/processor.h      Ralf Baechle   2005-05-31  273  	struct mips_dsp_state dsp;
e50c0a8fa60da9a include/asm-mips/processor.h      Ralf Baechle   2005-05-31  274  
6aa3524c182c01b arch/mips/include/asm/processor.h David Daney    2008-09-23  275  	/* Saved watch register state, if available. */
6aa3524c182c01b arch/mips/include/asm/processor.h David Daney    2008-09-23  276  	union mips_watch_reg_state watch;
6aa3524c182c01b arch/mips/include/asm/processor.h David Daney    2008-09-23  277  
^1da177e4c3f415 include/asm-mips/processor.h      Linus Torvalds 2005-04-16  278  	/* Other stuff associated with the thread. */
^1da177e4c3f415 include/asm-mips/processor.h      Linus Torvalds 2005-04-16  279  	unsigned long cp0_badvaddr;	/* Last user fault */
^1da177e4c3f415 include/asm-mips/processor.h      Linus Torvalds 2005-04-16  280  	unsigned long cp0_baduaddr;	/* Last kernel fault accessing USEG */
^1da177e4c3f415 include/asm-mips/processor.h      Linus Torvalds 2005-04-16  281  	unsigned long error_code;
e3b28831c18c6c9 arch/mips/include/asm/processor.h Ralf Baechle   2015-07-28  282  	unsigned long trap_nr;
b5e00af81f298f4 arch/mips/include/asm/processor.h David Daney    2008-12-11  283  #ifdef CONFIG_CPU_CAVIUM_OCTEON
b5e00af81f298f4 arch/mips/include/asm/processor.h David Daney    2008-12-11  284  	struct octeon_cop2_state cp2 __attribute__ ((__aligned__(128)));
b5e00af81f298f4 arch/mips/include/asm/processor.h David Daney    2008-12-11  285  	struct octeon_cvmseg_state cvmseg __attribute__ ((__aligned__(128)));
5649d37c2b23ad6 arch/mips/include/asm/processor.h Jayachandran C 2013-06-10  286  #endif
5649d37c2b23ad6 arch/mips/include/asm/processor.h Jayachandran C 2013-06-10  287  #ifdef CONFIG_CPU_XLP
5649d37c2b23ad6 arch/mips/include/asm/processor.h Jayachandran C 2013-06-10  288  	struct nlm_cop2_state cp2;
b5e00af81f298f4 arch/mips/include/asm/processor.h David Daney    2008-12-11  289  #endif
e50c0a8fa60da9a include/asm-mips/processor.h      Ralf Baechle   2005-05-31  290  	struct mips_abi *abi;
^1da177e4c3f415 include/asm-mips/processor.h      Linus Torvalds 2005-04-16  291  };
^1da177e4c3f415 include/asm-mips/processor.h      Linus Torvalds 2005-04-16  292  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 69892 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue of sysfs pagebuf
  2021-04-29 22:38           ` Dave Hansen
  2021-04-29 22:48             ` Song Bao Hua (Barry Song)
  2021-04-30  6:05             ` gregkh
@ 2021-07-23 11:20             ` Song Bao Hua (Barry Song)
  2021-07-23 11:28               ` gregkh
  2 siblings, 1 reply; 15+ messages in thread
From: Song Bao Hua (Barry Song) @ 2021-07-23 11:20 UTC (permalink / raw)
  To: Dave Hansen, tiantao (H), corbet, gregkh
  Cc: linux-doc, linux-kernel, Rafael J. Wysocki, Peter Zijlstra,
	Valentin Schneider, Dave Hansen, Daniel Bristot de Oliveira,
	Linuxarm



> -----Original Message-----
> From: Dave Hansen [mailto:dave.hansen@intel.com]
> Sent: Friday, April 30, 2021 10:39 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>; tiantao (H)
> <tiantao6@hisilicon.com>; corbet@lwn.net; gregkh@linuxfoundation.org
> Cc: linux-doc@vger.kernel.org; linux-kernel@vger.kernel.org; Rafael J.
> Wysocki <rafael@kernel.org>; Peter Zijlstra <peterz@infradead.org>; Valentin
> Schneider <valentin.schneider@arm.com>; Dave Hansen
> <dave.hansen@linux.intel.com>; Daniel Bristot de Oliveira <bristot@redhat.com>
> Subject: Re: [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue
> of sysfs pagebuf
> 
> On 4/29/21 3:32 PM, Song Bao Hua (Barry Song) wrote:
> > $ strace numactl --hardware  2>&1 | grep cpu
> > openat(AT_FDCWD, "/sys/devices/system/cpu",
> > O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
> > openat(AT_FDCWD, "/sys/devices/system/node/node0/cpumap", O_RDONLY) = 3
> > openat(AT_FDCWD, "/sys/devices/system/node/node1/cpumap", O_RDONLY) = 3
> > openat(AT_FDCWD, "/sys/devices/system/node/node2/cpumap", O_RDONLY) = 3
> > openat(AT_FDCWD, "/sys/devices/system/node/node3/cpumap", O_RDONLY) = 3
> >
> > If we move to binary, it means we have to change those applications.
> 
> I thought Greg was saying to using a sysfs binary attribute using
> something like like sysfs_create_bin_file().  Those don't have the
> PAGE_SIZE limitation.  But, there's also nothing to keep us from spewing
> nice human-readable text via the "binary" file.
> 
> We don't need to change the file format, just the internal kernel API
> that we produce the files with.

Sorry for waking-up the old thread.

I am not sure how common a regular device_attribute will be larger than
4KB and have to work around by bin_attribute. But I wrote a prototype
which can initially support large regular sysfs entry and be able to
fill the entire buffer by only one show() entry. The other words to say,
we don't need to call read() of bin_attribute multiple times for a large
regular file. Right now, only read-only attribute is supported.

Subject: [RFC PATCH] sysfs: support regular device attr which can be larger than
 PAGE_SIZE

A regular sysfs ABI could be more than 4KB, right now, we are using
bin_attribute to workaround and break this limit. A better solution
would be supporting long device attribute. In this case, we will
still be able to enjoy the advantages of buffering and seeking of
seq file and only need to fill the entire buffer of sysfs entry
once.

Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
---
 drivers/base/core.c      |  2 +-
 fs/seq_file.c            | 22 ++++++++++++++++++++++
 fs/sysfs/file.c          | 40 +++++++++++++++++++++++++++++++++++++++-
 fs/sysfs/group.c         |  4 ++--
 include/linux/device.h   | 20 ++++++++++++++++++++
 include/linux/seq_file.h |  1 +
 include/linux/sysfs.h    |  1 +
 7 files changed, 86 insertions(+), 4 deletions(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index cadcade65825..0cd4ed165154 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -2053,7 +2053,7 @@ static ssize_t dev_attr_show(struct kobject *kobj, struct attribute *attr,
 
 	if (dev_attr->show)
 		ret = dev_attr->show(dev, dev_attr, buf);
-	if (ret >= (ssize_t)PAGE_SIZE) {
+	if (ret >= (ssize_t)PAGE_SIZE && !(attr->mode & SYSFS_LONGATTR)) {
 		printk("dev_attr_show: %pS returned bad count\n",
 				dev_attr->show);
 	}
diff --git a/fs/seq_file.c b/fs/seq_file.c
index b117b212ef28..9054615f8f19 100644
--- a/fs/seq_file.c
+++ b/fs/seq_file.c
@@ -84,6 +84,28 @@ int seq_open(struct file *file, const struct seq_operations *op)
 }
 EXPORT_SYMBOL(seq_open);
 
+/**
+ *	seq_open_prealloc_buf -	allow seq_file users to preallocate buf
+ *	@file: file we initialize
+ *	@size: the maximum size of the file
+ *
+ *	apply to those scenerios single_open_size() doesn't apply. for
+ *	example, while a regular sysfs entry is more than PAGE_SIZE;
+ *	this will permit users to fill the entire buffer by calling
+ *	device_attr show() only once
+ */
+int seq_open_prealloc_buf(struct file *file, unsigned long size)
+{
+	void *buf = seq_buf_alloc(size);
+	if (buf)
+		return -ENOMEM;
+
+	((struct seq_file *)file->private_data)->buf = buf;
+	((struct seq_file *)file->private_data)->size = size;
+
+	return 0;
+}
+
 static int traverse(struct seq_file *m, loff_t offset)
 {
 	loff_t pos = 0;
diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c
index 9aefa7779b29..09ae12c2326c 100644
--- a/fs/sysfs/file.c
+++ b/fs/sysfs/file.c
@@ -16,6 +16,7 @@
 #include <linux/mutex.h>
 #include <linux/seq_file.h>
 #include <linux/mm.h>
+#include <linux/device.h> /* for device's longattr support */
 
 #include "sysfs.h"
 
@@ -202,6 +203,32 @@ void sysfs_notify(struct kobject *kobj, const char *dir, const char *attr)
 }
 EXPORT_SYMBOL_GPL(sysfs_notify);
 
+static int sysfs_kf_longattr_seq_show(struct seq_file *sf, void *v)
+{
+	struct kernfs_open_file *of = sf->private;
+	struct kobject *kobj = of->kn->parent->priv;
+	const struct sysfs_ops *ops = sysfs_file_ops(of->kn);
+	ssize_t count;
+	char *buf;
+
+	count = seq_get_buf(sf, &buf);
+
+	if (ops->show) {
+		count = ops->show(kobj, of->kn->priv, buf);
+		if (count < 0)
+			return count;
+	}
+
+	seq_commit(sf, count);
+	return 0;
+}
+
+static int sysfs_longattr_open(struct kernfs_open_file *of)
+{
+	struct device_long_attribute *lattr = (struct device_long_attribute *)of->kn->priv;
+	return seq_open_prealloc_buf(of->file, lattr->size);
+}
+
 static const struct kernfs_ops sysfs_file_kfops_empty = {
 };
 
@@ -223,6 +250,11 @@ static const struct kernfs_ops sysfs_prealloc_kfops_ro = {
 	.prealloc	= true,
 };
 
+static struct kernfs_ops sysfs_longattr_kfops_ro = {
+	.open		= sysfs_longattr_open,
+	.seq_show 	= sysfs_kf_longattr_seq_show,
+};
+
 static const struct kernfs_ops sysfs_prealloc_kfops_wo = {
 	.write		= sysfs_kf_write,
 	.prealloc	= true,
@@ -276,6 +308,8 @@ int sysfs_add_file_mode_ns(struct kernfs_node *parent,
 		if (sysfs_ops->show && sysfs_ops->store) {
 			if (mode & SYSFS_PREALLOC)
 				ops = &sysfs_prealloc_kfops_rw;
+			else if (mode & SYSFS_LONGATTR)
+				ops = &sysfs_longattr_kfops_ro;
 			else
 				ops = &sysfs_file_kfops_rw;
 		} else if (sysfs_ops->show) {
@@ -291,7 +325,11 @@ int sysfs_add_file_mode_ns(struct kernfs_node *parent,
 		} else
 			ops = &sysfs_file_kfops_empty;
 
-		size = PAGE_SIZE;
+		if (mode & SYSFS_LONGATTR) {
+			size = ((struct device_long_attribute *)attr)->size;
+		} else {
+			size = PAGE_SIZE;
+		}
 	} else {
 		struct bin_attribute *battr = (void *)attr;
 
diff --git a/fs/sysfs/group.c b/fs/sysfs/group.c
index 64e6a6698935..2d80b05550d1 100644
--- a/fs/sysfs/group.c
+++ b/fs/sysfs/group.c
@@ -56,11 +56,11 @@ static int create_files(struct kernfs_node *parent, struct kobject *kobj,
 					continue;
 			}
 
-			WARN(mode & ~(SYSFS_PREALLOC | 0664),
+			WARN(mode & ~(SYSFS_LONGATTR | SYSFS_PREALLOC | 0664),
 			     "Attribute %s: Invalid permissions 0%o\n",
 			     (*attr)->name, mode);
 
-			mode &= SYSFS_PREALLOC | 0664;
+			mode &= SYSFS_LONGATTR | SYSFS_PREALLOC | 0664;
 			error = sysfs_add_file_mode_ns(parent, *attr, false,
 						       mode, uid, gid, NULL);
 			if (unlikely(error))
diff --git a/include/linux/device.h b/include/linux/device.h
index 59940f1744c1..791a3d25f0bb 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -150,6 +150,26 @@ ssize_t device_store_bool(struct device *dev, struct device_attribute *attr,
 	struct device_attribute dev_attr_##_name =		\
 		__ATTR_IGNORE_LOCKDEP(_name, _mode, _show, _store)
 
+/*
+ * for those devices whose attributes are larger than 4KB but still want
+ * to take the advantages of regular sysfs, like show() method is able to
+ * fill the entire buffer by one read operation
+ */
+struct device_long_attribute {
+	struct device_attribute attr;
+	size_t size;
+};
+
+#define __LONG_ATTR_RO(_name, _size) {                                 \
+       .attr.attr = {.name = __stringify(_name),                       \
+		     .mode = SYSFS_LONGATTR | 0444 },                  \
+       .attr.show   = _name##_show,                                    \
+       .size = _size,                                                  \
+}
+
+#define LONG_ATTR_RO(_name, _size) \
+struct device_long_attribute dev_attr_##_name = __LONG_ATTR_RO(_name, _size)
+
 int device_create_file(struct device *device,
 		       const struct device_attribute *entry);
 void device_remove_file(struct device *dev,
diff --git a/include/linux/seq_file.h b/include/linux/seq_file.h
index dd99569595fd..e7357fc91c1c 100644
--- a/include/linux/seq_file.h
+++ b/include/linux/seq_file.h
@@ -106,6 +106,7 @@ void seq_pad(struct seq_file *m, char c);
 
 char *mangle_path(char *s, const char *p, const char *esc);
 int seq_open(struct file *, const struct seq_operations *);
+int seq_open_prealloc_buf(struct file *, unsigned long);
 ssize_t seq_read(struct file *, char __user *, size_t, loff_t *);
 ssize_t seq_read_iter(struct kiocb *iocb, struct iov_iter *iter);
 loff_t seq_lseek(struct file *, loff_t, int);
diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
index a12556a4b93a..9198afd46fb0 100644
--- a/include/linux/sysfs.h
+++ b/include/linux/sysfs.h
@@ -97,6 +97,7 @@ struct attribute_group {
  */
 
 #define SYSFS_PREALLOC 010000
+#define SYSFS_LONGATTR 020000
 
 #define __ATTR(_name, _mode, _show, _store) {				\
 	.attr = {.name = __stringify(_name),				\
-- 
2.25.1

very simple way to use it:
Add some long attribute by:

LONG_ATTR_RO(xxx, 2 * PAGE_SIZE);
LONG_ATTR_RO(yyy, 2 * PAGE_SIZE);
....
Then add xxx and yyy to attribute list as we usually do
for normal device_attribute:
struct attribute *attrs[] = {
	&dev_attr_xxx.attr.attr,
	&dev_attr_yyy.attr.attr,
}

Not quite sure if it is valuable. If it is yes, I will
split the code and send a RFC patchset.

Thanks
Barry


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue of sysfs pagebuf
  2021-07-23 11:20             ` Song Bao Hua (Barry Song)
@ 2021-07-23 11:28               ` gregkh
  2021-07-23 12:48                 ` Song Bao Hua (Barry Song)
  0 siblings, 1 reply; 15+ messages in thread
From: gregkh @ 2021-07-23 11:28 UTC (permalink / raw)
  To: Song Bao Hua (Barry Song)
  Cc: Dave Hansen, tiantao (H),
	corbet, linux-doc, linux-kernel, Rafael J. Wysocki,
	Peter Zijlstra, Valentin Schneider, Dave Hansen,
	Daniel Bristot de Oliveira, Linuxarm

On Fri, Jul 23, 2021 at 11:20:19AM +0000, Song Bao Hua (Barry Song) wrote:
> 
> 
> > -----Original Message-----
> > From: Dave Hansen [mailto:dave.hansen@intel.com]
> > Sent: Friday, April 30, 2021 10:39 AM
> > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>; tiantao (H)
> > <tiantao6@hisilicon.com>; corbet@lwn.net; gregkh@linuxfoundation.org
> > Cc: linux-doc@vger.kernel.org; linux-kernel@vger.kernel.org; Rafael J.
> > Wysocki <rafael@kernel.org>; Peter Zijlstra <peterz@infradead.org>; Valentin
> > Schneider <valentin.schneider@arm.com>; Dave Hansen
> > <dave.hansen@linux.intel.com>; Daniel Bristot de Oliveira <bristot@redhat.com>
> > Subject: Re: [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue
> > of sysfs pagebuf
> > 
> > On 4/29/21 3:32 PM, Song Bao Hua (Barry Song) wrote:
> > > $ strace numactl --hardware  2>&1 | grep cpu
> > > openat(AT_FDCWD, "/sys/devices/system/cpu",
> > > O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
> > > openat(AT_FDCWD, "/sys/devices/system/node/node0/cpumap", O_RDONLY) = 3
> > > openat(AT_FDCWD, "/sys/devices/system/node/node1/cpumap", O_RDONLY) = 3
> > > openat(AT_FDCWD, "/sys/devices/system/node/node2/cpumap", O_RDONLY) = 3
> > > openat(AT_FDCWD, "/sys/devices/system/node/node3/cpumap", O_RDONLY) = 3
> > >
> > > If we move to binary, it means we have to change those applications.
> > 
> > I thought Greg was saying to using a sysfs binary attribute using
> > something like like sysfs_create_bin_file().  Those don't have the
> > PAGE_SIZE limitation.  But, there's also nothing to keep us from spewing
> > nice human-readable text via the "binary" file.
> > 
> > We don't need to change the file format, just the internal kernel API
> > that we produce the files with.
> 
> Sorry for waking-up the old thread.
> 
> I am not sure how common a regular device_attribute will be larger than
> 4KB and have to work around by bin_attribute. But I wrote a prototype
> which can initially support large regular sysfs entry and be able to
> fill the entire buffer by only one show() entry. The other words to say,
> we don't need to call read() of bin_attribute multiple times for a large
> regular file. Right now, only read-only attribute is supported.
> 
> Subject: [RFC PATCH] sysfs: support regular device attr which can be larger than
>  PAGE_SIZE
> 
> A regular sysfs ABI could be more than 4KB, right now, we are using
> bin_attribute to workaround and break this limit. A better solution
> would be supporting long device attribute. In this case, we will
> still be able to enjoy the advantages of buffering and seeking of
> seq file and only need to fill the entire buffer of sysfs entry
> once.

No, please no.  I WANT people to run into this problem and realize that
it went totally wrong because they should not be having more than one
"value" in a sysfs file like this.

Let's not make it easy on people please, moving to a bin attribute is a
big deal, let's keep it that way.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue of sysfs pagebuf
  2021-07-23 11:28               ` gregkh
@ 2021-07-23 12:48                 ` Song Bao Hua (Barry Song)
  0 siblings, 0 replies; 15+ messages in thread
From: Song Bao Hua (Barry Song) @ 2021-07-23 12:48 UTC (permalink / raw)
  To: gregkh
  Cc: Dave Hansen, tiantao (H),
	corbet, linux-doc, linux-kernel, Rafael J. Wysocki,
	Peter Zijlstra, Valentin Schneider, Dave Hansen,
	Daniel Bristot de Oliveira, Linuxarm



> -----Original Message-----
> From: gregkh@linuxfoundation.org [mailto:gregkh@linuxfoundation.org]
> Sent: Friday, July 23, 2021 11:29 PM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> Cc: Dave Hansen <dave.hansen@intel.com>; tiantao (H) <tiantao6@hisilicon.com>;
> corbet@lwn.net; linux-doc@vger.kernel.org; linux-kernel@vger.kernel.org;
> Rafael J. Wysocki <rafael@kernel.org>; Peter Zijlstra <peterz@infradead.org>;
> Valentin Schneider <valentin.schneider@arm.com>; Dave Hansen
> <dave.hansen@linux.intel.com>; Daniel Bristot de Oliveira
> <bristot@redhat.com>; Linuxarm <linuxarm@huawei.com>
> Subject: Re: [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue
> of sysfs pagebuf
> 
> On Fri, Jul 23, 2021 at 11:20:19AM +0000, Song Bao Hua (Barry Song) wrote:
> >
> >
> > > -----Original Message-----
> > > From: Dave Hansen [mailto:dave.hansen@intel.com]
> > > Sent: Friday, April 30, 2021 10:39 AM
> > > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>; tiantao (H)
> > > <tiantao6@hisilicon.com>; corbet@lwn.net; gregkh@linuxfoundation.org
> > > Cc: linux-doc@vger.kernel.org; linux-kernel@vger.kernel.org; Rafael J.
> > > Wysocki <rafael@kernel.org>; Peter Zijlstra <peterz@infradead.org>;
> Valentin
> > > Schneider <valentin.schneider@arm.com>; Dave Hansen
> > > <dave.hansen@linux.intel.com>; Daniel Bristot de Oliveira
> <bristot@redhat.com>
> > > Subject: Re: [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue
> > > of sysfs pagebuf
> > >
> > > On 4/29/21 3:32 PM, Song Bao Hua (Barry Song) wrote:
> > > > $ strace numactl --hardware  2>&1 | grep cpu
> > > > openat(AT_FDCWD, "/sys/devices/system/cpu",
> > > > O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
> > > > openat(AT_FDCWD, "/sys/devices/system/node/node0/cpumap", O_RDONLY) =
> 3
> > > > openat(AT_FDCWD, "/sys/devices/system/node/node1/cpumap", O_RDONLY) =
> 3
> > > > openat(AT_FDCWD, "/sys/devices/system/node/node2/cpumap", O_RDONLY) =
> 3
> > > > openat(AT_FDCWD, "/sys/devices/system/node/node3/cpumap", O_RDONLY) =
> 3
> > > >
> > > > If we move to binary, it means we have to change those applications.
> > >
> > > I thought Greg was saying to using a sysfs binary attribute using
> > > something like like sysfs_create_bin_file().  Those don't have the
> > > PAGE_SIZE limitation.  But, there's also nothing to keep us from spewing
> > > nice human-readable text via the "binary" file.
> > >
> > > We don't need to change the file format, just the internal kernel API
> > > that we produce the files with.
> >
> > Sorry for waking-up the old thread.
> >
> > I am not sure how common a regular device_attribute will be larger than
> > 4KB and have to work around by bin_attribute. But I wrote a prototype
> > which can initially support large regular sysfs entry and be able to
> > fill the entire buffer by only one show() entry. The other words to say,
> > we don't need to call read() of bin_attribute multiple times for a large
> > regular file. Right now, only read-only attribute is supported.
> >
> > Subject: [RFC PATCH] sysfs: support regular device attr which can be larger
> than
> >  PAGE_SIZE
> >
> > A regular sysfs ABI could be more than 4KB, right now, we are using
> > bin_attribute to workaround and break this limit. A better solution
> > would be supporting long device attribute. In this case, we will
> > still be able to enjoy the advantages of buffering and seeking of
> > seq file and only need to fill the entire buffer of sysfs entry
> > once.
> 
> No, please no.  I WANT people to run into this problem and realize that
> it went totally wrong because they should not be having more than one
> "value" in a sysfs file like this.
> 
> Let's not make it easy on people please, moving to a bin attribute is a
> big deal, let's keep it that way.

Ok. Got it. Thanks for clarification. That was the experiment I made
recently when I almost got totally stuck.

> 
> thanks,
> 
> greg k-h

Thanks
Barry

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-07-23 12:49 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-29  7:03 [PATCH 0/2] clarify and cleanup CPU and NUMA topology ABIs Tian Tao
2021-04-29  7:03 ` [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue of sysfs pagebuf Tian Tao
2021-04-29 14:21   ` Dave Hansen
2021-04-29 15:46     ` Greg KH
2021-04-29 21:08     ` Song Bao Hua (Barry Song)
2021-04-29 21:38       ` Dave Hansen
2021-04-29 22:32         ` Song Bao Hua (Barry Song)
2021-04-29 22:38           ` Dave Hansen
2021-04-29 22:48             ` Song Bao Hua (Barry Song)
2021-04-30  6:05             ` gregkh
2021-07-23 11:20             ` Song Bao Hua (Barry Song)
2021-07-23 11:28               ` gregkh
2021-07-23 12:48                 ` Song Bao Hua (Barry Song)
2021-05-06 23:16   ` kernel test robot
2021-04-29  7:03 ` [PATCH 2/2] Documentation/ABI: Move the topology-related sysfs interface to the right place Tian Tao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).