linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1] drivers/base/memory.c: Don't store end_section_nr in memory blocks
@ 2019-07-31 12:22 David Hildenbrand
  2019-07-31 12:43 ` Michal Hocko
  2019-07-31 20:57 ` Andrew Morton
  0 siblings, 2 replies; 18+ messages in thread
From: David Hildenbrand @ 2019-07-31 12:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Greg Kroah-Hartman,
	Rafael J. Wysocki, Andrew Morton, Pavel Tatashin, Michal Hocko,
	Dan Williams, Oscar Salvador

Each memory block spans the same amount of sections/pages/bytes. The size
is determined before the first memory block is created. No need to store
what we can easily calculate - and the calculations even look simpler now.

While at it, fix the variable naming in register_mem_sect_under_node() -
we no longer talk about a single section.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/base/memory.c  | 1 -
 drivers/base/node.c    | 9 ++++-----
 include/linux/memory.h | 3 ++-
 mm/memory_hotplug.c    | 2 +-
 4 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 154d5d4a0779..cb80f2bdd7de 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -670,7 +670,6 @@ static int init_memory_block(struct memory_block **memory,
 		return -ENOMEM;
 
 	mem->start_section_nr = block_id * sections_per_block;
-	mem->end_section_nr = mem->start_section_nr + sections_per_block - 1;
 	mem->state = state;
 	start_pfn = section_nr_to_pfn(mem->start_section_nr);
 	mem->phys_device = arch_get_memory_phys_device(start_pfn);
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 840c95baa1d8..e9a504e7c8c2 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -756,13 +756,12 @@ static int __ref get_nid_for_pfn(unsigned long pfn)
 static int register_mem_sect_under_node(struct memory_block *mem_blk,
 					 void *arg)
 {
+	unsigned long start_pfn = section_nr_to_pfn(mem_blk->start_section_nr);
+	unsigned long end_pfn = start_pfn + PAGES_PER_MEMORY_BLOCK - 1;
 	int ret, nid = *(int *)arg;
-	unsigned long pfn, sect_start_pfn, sect_end_pfn;
+	unsigned long pfn;
 
-	sect_start_pfn = section_nr_to_pfn(mem_blk->start_section_nr);
-	sect_end_pfn = section_nr_to_pfn(mem_blk->end_section_nr);
-	sect_end_pfn += PAGES_PER_SECTION - 1;
-	for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
+	for (pfn = start_pfn; pfn <= end_pfn; pfn++) {
 		int page_nid;
 
 		/*
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 02e633f3ede0..16d2c0979976 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -25,7 +25,6 @@
 
 struct memory_block {
 	unsigned long start_section_nr;
-	unsigned long end_section_nr;
 	unsigned long state;		/* serialized by the dev->lock */
 	int section_count;		/* serialized by mem_sysfs_mutex */
 	int online_type;		/* for passing data to online routine */
@@ -40,6 +39,8 @@ int arch_get_memory_phys_device(unsigned long start_pfn);
 unsigned long memory_block_size_bytes(void);
 int set_memory_block_size_order(unsigned int order);
 
+#define PAGES_PER_MEMORY_BLOCK (memory_block_size_bytes() / PAGE_SIZE)
+
 /* These states are exposed to userspace as text strings in sysfs */
 #define	MEM_ONLINE		(1<<0) /* exposed to userspace */
 #define	MEM_GOING_OFFLINE	(1<<1) /* exposed to userspace */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 9a82e12bd0e7..db33a0ffcb1f 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1650,7 +1650,7 @@ static int check_memblock_offlined_cb(struct memory_block *mem, void *arg)
 		phys_addr_t beginpa, endpa;
 
 		beginpa = PFN_PHYS(section_nr_to_pfn(mem->start_section_nr));
-		endpa = PFN_PHYS(section_nr_to_pfn(mem->end_section_nr + 1))-1;
+		endpa = beginpa + memory_block_size_bytes() - 1;
 		pr_warn("removing memory fails, because memory [%pa-%pa] is onlined\n",
 			&beginpa, &endpa);
 
-- 
2.21.0


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1] drivers/base/memory.c: Don't store end_section_nr in memory blocks
  2019-07-31 12:22 [PATCH v1] drivers/base/memory.c: Don't store end_section_nr in memory blocks David Hildenbrand
@ 2019-07-31 12:43 ` Michal Hocko
  2019-07-31 13:12   ` David Hildenbrand
  2019-07-31 20:57 ` Andrew Morton
  1 sibling, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2019-07-31 12:43 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Greg Kroah-Hartman, Rafael J. Wysocki,
	Andrew Morton, Pavel Tatashin, Dan Williams, Oscar Salvador

On Wed 31-07-19 14:22:13, David Hildenbrand wrote:
> Each memory block spans the same amount of sections/pages/bytes. The size
> is determined before the first memory block is created. No need to store
> what we can easily calculate - and the calculations even look simpler now.

While this cleanup helps a bit, I am not sure this is really worth
bothering. I guess we can agree when I say that the memblock interface
is suboptimal (to put it mildly).  Shouldn't we strive for making it
a real hotplug API in the future? What do I mean by that? Why should
be any memblock fixed in size? Shouldn't we have use hotplugable units
instead (aka pfn range that userspace can work with sensibly)? Do we
know of any existing userspace that would depend on the current single
section res. 2GB sized memblocks?

All that being said, I do not oppose to the patch but can we start
thinking about the underlying memblock limitations rather than micro
cleanups?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1] drivers/base/memory.c: Don't store end_section_nr in memory blocks
  2019-07-31 12:43 ` Michal Hocko
@ 2019-07-31 13:12   ` David Hildenbrand
  2019-07-31 13:25     ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: David Hildenbrand @ 2019-07-31 13:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, Greg Kroah-Hartman, Rafael J. Wysocki,
	Andrew Morton, Pavel Tatashin, Dan Williams, Oscar Salvador

On 31.07.19 14:43, Michal Hocko wrote:
> On Wed 31-07-19 14:22:13, David Hildenbrand wrote:
>> Each memory block spans the same amount of sections/pages/bytes. The size
>> is determined before the first memory block is created. No need to store
>> what we can easily calculate - and the calculations even look simpler now.
> 
> While this cleanup helps a bit, I am not sure this is really worth
> bothering. I guess we can agree when I say that the memblock interface
> is suboptimal (to put it mildly).  Shouldn't we strive for making it
> a real hotplug API in the future? What do I mean by that? Why should
> be any memblock fixed in size? Shouldn't we have use hotplugable units
> instead (aka pfn range that userspace can work with sensibly)? Do we
> know of any existing userspace that would depend on the current single
> section res. 2GB sized memblocks?

Short story: It is already ABI (e.g.,
/sys/devices/system/memory/block_size_bytes) - around since 2005 (!) -
since we had memory block devices.

I suspect that it is mainly manually used. But I might be wrong.


Long story:

How would you want to number memory blocks? At least no longer by phys
index. For now, memory blocks are ordered and numbered by their block id.

Admins might want to online parts of a DIMM MOVABLE/NORMAL, to more
reliably use huge pages but still have enough space for kernel memory
(e.g., page tables). They might like that a DIMM is actually a set of
memory blocks instead of one big chunk.

IOW: You can consider it a restriction to add e.g., DIMMs only in one
bigger chunks.

> 
> All that being said, I do not oppose to the patch but can we start
> thinking about the underlying memblock limitations rather than micro
> cleanups?

I am pro cleaning up what we have right now, not expect it to eventually
change some-when in the future. (btw, I highly doubt it will change)

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1] drivers/base/memory.c: Don't store end_section_nr in memory blocks
  2019-07-31 13:12   ` David Hildenbrand
@ 2019-07-31 13:25     ` Michal Hocko
  2019-07-31 13:42       ` David Hildenbrand
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2019-07-31 13:25 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Greg Kroah-Hartman, Rafael J. Wysocki,
	Andrew Morton, Pavel Tatashin, Dan Williams, Oscar Salvador

On Wed 31-07-19 15:12:12, David Hildenbrand wrote:
> On 31.07.19 14:43, Michal Hocko wrote:
> > On Wed 31-07-19 14:22:13, David Hildenbrand wrote:
> >> Each memory block spans the same amount of sections/pages/bytes. The size
> >> is determined before the first memory block is created. No need to store
> >> what we can easily calculate - and the calculations even look simpler now.
> > 
> > While this cleanup helps a bit, I am not sure this is really worth
> > bothering. I guess we can agree when I say that the memblock interface
> > is suboptimal (to put it mildly).  Shouldn't we strive for making it
> > a real hotplug API in the future? What do I mean by that? Why should
> > be any memblock fixed in size? Shouldn't we have use hotplugable units
> > instead (aka pfn range that userspace can work with sensibly)? Do we
> > know of any existing userspace that would depend on the current single
> > section res. 2GB sized memblocks?
> 
> Short story: It is already ABI (e.g.,
> /sys/devices/system/memory/block_size_bytes) - around since 2005 (!) -
> since we had memory block devices.
> 
> I suspect that it is mainly manually used. But I might be wrong.

Any pointer to the real userspace depending on it? Most usecases I am
aware of rely on udev events and either onlining or offlining the memory
in the handler.

I know we have documented this as an ABI and it is really _sad_ that
this ABI didn't get through normal scrutiny any user visible interface
should go through but these are sins of the past...

> Long story:
> 
> How would you want to number memory blocks? At least no longer by phys
> index. For now, memory blocks are ordered and numbered by their block id.

memory_${mem_section_nr_of_start_pfn}

> Admins might want to online parts of a DIMM MOVABLE/NORMAL, to more
> reliably use huge pages but still have enough space for kernel memory
> (e.g., page tables). They might like that a DIMM is actually a set of
> memory blocks instead of one big chunk.

They might. Do they though? There are many theoretical usecases but
let's face it, there is a cost given to the current state. E.g. the
number of memblock directories is already quite large on machines with a
lot of memory even though they use large blocks. That has negative
implications already (e.g. the number of events you get, any iteration
on the /sys etc.). Also 2G memblocks are quite arbitrary and they
already limit the above usecase some, right?

> IOW: You can consider it a restriction to add e.g., DIMMs only in one
> bigger chunks.
> 
> > 
> > All that being said, I do not oppose to the patch but can we start
> > thinking about the underlying memblock limitations rather than micro
> > cleanups?
> 
> I am pro cleaning up what we have right now, not expect it to eventually
> change some-when in the future. (btw, I highly doubt it will change)

I do agree, but having the memblock fixed size doesn't really go along
with variable memblock size if we ever go there. But as I've said I am
not really against the patch.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1] drivers/base/memory.c: Don't store end_section_nr in memory blocks
  2019-07-31 13:25     ` Michal Hocko
@ 2019-07-31 13:42       ` David Hildenbrand
  2019-07-31 14:04         ` David Hildenbrand
  2019-07-31 14:14         ` Michal Hocko
  0 siblings, 2 replies; 18+ messages in thread
From: David Hildenbrand @ 2019-07-31 13:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, Greg Kroah-Hartman, Rafael J. Wysocki,
	Andrew Morton, Pavel Tatashin, Dan Williams, Oscar Salvador

On 31.07.19 15:25, Michal Hocko wrote:
> On Wed 31-07-19 15:12:12, David Hildenbrand wrote:
>> On 31.07.19 14:43, Michal Hocko wrote:
>>> On Wed 31-07-19 14:22:13, David Hildenbrand wrote:
>>>> Each memory block spans the same amount of sections/pages/bytes. The size
>>>> is determined before the first memory block is created. No need to store
>>>> what we can easily calculate - and the calculations even look simpler now.
>>>
>>> While this cleanup helps a bit, I am not sure this is really worth
>>> bothering. I guess we can agree when I say that the memblock interface
>>> is suboptimal (to put it mildly).  Shouldn't we strive for making it
>>> a real hotplug API in the future? What do I mean by that? Why should
>>> be any memblock fixed in size? Shouldn't we have use hotplugable units
>>> instead (aka pfn range that userspace can work with sensibly)? Do we
>>> know of any existing userspace that would depend on the current single
>>> section res. 2GB sized memblocks?
>>
>> Short story: It is already ABI (e.g.,
>> /sys/devices/system/memory/block_size_bytes) - around since 2005 (!) -
>> since we had memory block devices.
>>
>> I suspect that it is mainly manually used. But I might be wrong.
> 
> Any pointer to the real userspace depending on it? Most usecases I am
> aware of rely on udev events and either onlining or offlining the memory
> in the handler.

Yes, that's also what I know - onlining and triggering kexec().

On s390x, admins online sub-increments to selectively add memory to a VM
- but we could still emulate that by adding memory for that use case in
the kernel in the current granularity. See

https://books.google.de/books?id=afq4CgAAQBAJ&pg=PA117&lpg=PA117&dq=/sys/devices/system/memory/block_size_bytes&source=bl&ots=iYk_vW5O4G&sig=ACfU3U0s-O-SOVaQO-7HpKO5Hj866w9Pxw&hl=de&sa=X&ved=2ahUKEwjOjPqIot_jAhVPfZoKHcxpAqcQ6AEwB3oECAgQAQ#v=onepage&q=%2Fsys%2Fdevices%2Fsystem%2Fmemory%2Fblock_size_bytes&f=false

> 
> I know we have documented this as an ABI and it is really _sad_ that
> this ABI didn't get through normal scrutiny any user visible interface
> should go through but these are sins of the past...

A quick google search indicates that

Kata containers queries the block size:
https://github.com/kata-containers/runtime/issues/796

Powerpc userspace queries it:
https://groups.google.com/forum/#!msg/powerpc-utils-devel/dKjZCqpTxus/AwkstV2ABwAJ

I can imagine that ppc dynamic memory onlines only pieces of added
memory - DIMMs AFAIK (haven't looked at the details).

There might be more users.

> 
>> Long story:
>>
>> How would you want to number memory blocks? At least no longer by phys
>> index. For now, memory blocks are ordered and numbered by their block id.
> 
> memory_${mem_section_nr_of_start_pfn}
> 

Fair enough, although this could break some scripts where people
manually offline/online specific blocks. (but who knows what
people/scripts do :( )

>> Admins might want to online parts of a DIMM MOVABLE/NORMAL, to more
>> reliably use huge pages but still have enough space for kernel memory
>> (e.g., page tables). They might like that a DIMM is actually a set of
>> memory blocks instead of one big chunk.
> 
> They might. Do they though? There are many theoretical usecases but
> let's face it, there is a cost given to the current state. E.g. the
> number of memblock directories is already quite large on machines with a
> lot of memory even though they use large blocks. That has negative
> implications already (e.g. the number of events you get, any iteration
> on the /sys etc.). Also 2G memblocks are quite arbitrary and they
> already limit the above usecase some, right?

I mean there are other theoretical issues: Onlining a very big DIMM in
one shot might trigger OOM, while slowly adding/onlining would currently
works. Who knows if that is relevant in practice.

Also, it would break the current use case of memtrace, which removes
memory in a granularity that wasn't added. But luckily, memtrace is an
exception :)

> 
>> IOW: You can consider it a restriction to add e.g., DIMMs only in one
>> bigger chunks.
>>
>>>
>>> All that being said, I do not oppose to the patch but can we start
>>> thinking about the underlying memblock limitations rather than micro
>>> cleanups?
>>
>> I am pro cleaning up what we have right now, not expect it to eventually
>> change some-when in the future. (btw, I highly doubt it will change)
> 
> I do agree, but having the memblock fixed size doesn't really go along
> with variable memblock size if we ever go there. But as I've said I am
> not really against the patch.

Fair enough, for now I am not convinced that we will actually see
variable memory blocks in the near future.

Thanks for the discussion (I was thinking about the same concept a while
back when trying to find out if there could be an easy way to identify
which memory blocks belong to a single DIMM you want to eventually
unplug and therefore online it all to the MOVABLE zone).

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1] drivers/base/memory.c: Don't store end_section_nr in memory blocks
  2019-07-31 13:42       ` David Hildenbrand
@ 2019-07-31 14:04         ` David Hildenbrand
  2019-07-31 14:15           ` Michal Hocko
  2019-07-31 14:14         ` Michal Hocko
  1 sibling, 1 reply; 18+ messages in thread
From: David Hildenbrand @ 2019-07-31 14:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, Greg Kroah-Hartman, Rafael J. Wysocki,
	Andrew Morton, Pavel Tatashin, Dan Williams, Oscar Salvador

On 31.07.19 15:42, David Hildenbrand wrote:
> On 31.07.19 15:25, Michal Hocko wrote:
>> On Wed 31-07-19 15:12:12, David Hildenbrand wrote:
>>> On 31.07.19 14:43, Michal Hocko wrote:
>>>> On Wed 31-07-19 14:22:13, David Hildenbrand wrote:
>>>>> Each memory block spans the same amount of sections/pages/bytes. The size
>>>>> is determined before the first memory block is created. No need to store
>>>>> what we can easily calculate - and the calculations even look simpler now.
>>>>
>>>> While this cleanup helps a bit, I am not sure this is really worth
>>>> bothering. I guess we can agree when I say that the memblock interface
>>>> is suboptimal (to put it mildly).  Shouldn't we strive for making it
>>>> a real hotplug API in the future? What do I mean by that? Why should
>>>> be any memblock fixed in size? Shouldn't we have use hotplugable units
>>>> instead (aka pfn range that userspace can work with sensibly)? Do we
>>>> know of any existing userspace that would depend on the current single
>>>> section res. 2GB sized memblocks?
>>>
>>> Short story: It is already ABI (e.g.,
>>> /sys/devices/system/memory/block_size_bytes) - around since 2005 (!) -
>>> since we had memory block devices.
>>>
>>> I suspect that it is mainly manually used. But I might be wrong.
>>
>> Any pointer to the real userspace depending on it? Most usecases I am
>> aware of rely on udev events and either onlining or offlining the memory
>> in the handler.
> 
> Yes, that's also what I know - onlining and triggering kexec().
> 
> On s390x, admins online sub-increments to selectively add memory to a VM
> - but we could still emulate that by adding memory for that use case in
> the kernel in the current granularity. See
> 
> https://books.google.de/books?id=afq4CgAAQBAJ&pg=PA117&lpg=PA117&dq=/sys/devices/system/memory/block_size_bytes&source=bl&ots=iYk_vW5O4G&sig=ACfU3U0s-O-SOVaQO-7HpKO5Hj866w9Pxw&hl=de&sa=X&ved=2ahUKEwjOjPqIot_jAhVPfZoKHcxpAqcQ6AEwB3oECAgQAQ#v=onepage&q=%2Fsys%2Fdevices%2Fsystem%2Fmemory%2Fblock_size_bytes&f=false
> 
>>
>> I know we have documented this as an ABI and it is really _sad_ that
>> this ABI didn't get through normal scrutiny any user visible interface
>> should go through but these are sins of the past...
> 
> A quick google search indicates that
> 
> Kata containers queries the block size:
> https://github.com/kata-containers/runtime/issues/796
> 
> Powerpc userspace queries it:
> https://groups.google.com/forum/#!msg/powerpc-utils-devel/dKjZCqpTxus/AwkstV2ABwAJ

FWIW, powerpc-utils also uses the "removable" property - which means
we're also stuck with that unfortunately. :(

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1] drivers/base/memory.c: Don't store end_section_nr in memory blocks
  2019-07-31 13:42       ` David Hildenbrand
  2019-07-31 14:04         ` David Hildenbrand
@ 2019-07-31 14:14         ` Michal Hocko
  2019-07-31 14:21           ` David Hildenbrand
  1 sibling, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2019-07-31 14:14 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Greg Kroah-Hartman, Rafael J. Wysocki,
	Andrew Morton, Pavel Tatashin, Dan Williams, Oscar Salvador

On Wed 31-07-19 15:42:53, David Hildenbrand wrote:
> On 31.07.19 15:25, Michal Hocko wrote:
[...]
> > I know we have documented this as an ABI and it is really _sad_ that
> > this ABI didn't get through normal scrutiny any user visible interface
> > should go through but these are sins of the past...
> 
> A quick google search indicates that
> 
> Kata containers queries the block size:
> https://github.com/kata-containers/runtime/issues/796
> 
> Powerpc userspace queries it:
> https://groups.google.com/forum/#!msg/powerpc-utils-devel/dKjZCqpTxus/AwkstV2ABwAJ
> 
> I can imagine that ppc dynamic memory onlines only pieces of added
> memory - DIMMs AFAIK (haven't looked at the details).
> 
> There might be more users.

Thanks! I suspect most of them are just using the information because
they do not have anything better.

Thinking about it some more, I believe that we can reasonably provide
both APIs controlable by a command line parameter for backwards
compatibility. It is the hotplug code to control sysfs APIs.  E.g.
create one sysfs entry per add_memory_resource for the new semantic.

It is some time since I've checked the ACPI side of the matter but that
code shouldn't really depend on a particular size of the memblock
either when trigerring udev events. I might be wrong here of course.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1] drivers/base/memory.c: Don't store end_section_nr in memory blocks
  2019-07-31 14:04         ` David Hildenbrand
@ 2019-07-31 14:15           ` Michal Hocko
  2019-07-31 14:23             ` David Hildenbrand
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2019-07-31 14:15 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Greg Kroah-Hartman, Rafael J. Wysocki,
	Andrew Morton, Pavel Tatashin, Dan Williams, Oscar Salvador

On Wed 31-07-19 16:04:10, David Hildenbrand wrote:
> On 31.07.19 15:42, David Hildenbrand wrote:
[...]
> > Powerpc userspace queries it:
> > https://groups.google.com/forum/#!msg/powerpc-utils-devel/dKjZCqpTxus/AwkstV2ABwAJ
> 
> FWIW, powerpc-utils also uses the "removable" property - which means
> we're also stuck with that unfortunately. :(

Yeah, I am aware of that and I strongly suspect this is actually in use
because it is terribly unreliable for any non-idle system. There is
simply no way to find out whether something is offlinable than to try
it.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1] drivers/base/memory.c: Don't store end_section_nr in memory blocks
  2019-07-31 14:14         ` Michal Hocko
@ 2019-07-31 14:21           ` David Hildenbrand
  2019-07-31 14:37             ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: David Hildenbrand @ 2019-07-31 14:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, Greg Kroah-Hartman, Rafael J. Wysocki,
	Andrew Morton, Pavel Tatashin, Dan Williams, Oscar Salvador

On 31.07.19 16:14, Michal Hocko wrote:
> On Wed 31-07-19 15:42:53, David Hildenbrand wrote:
>> On 31.07.19 15:25, Michal Hocko wrote:
> [...]
>>> I know we have documented this as an ABI and it is really _sad_ that
>>> this ABI didn't get through normal scrutiny any user visible interface
>>> should go through but these are sins of the past...
>>
>> A quick google search indicates that
>>
>> Kata containers queries the block size:
>> https://github.com/kata-containers/runtime/issues/796
>>
>> Powerpc userspace queries it:
>> https://groups.google.com/forum/#!msg/powerpc-utils-devel/dKjZCqpTxus/AwkstV2ABwAJ
>>
>> I can imagine that ppc dynamic memory onlines only pieces of added
>> memory - DIMMs AFAIK (haven't looked at the details).
>>
>> There might be more users.
> 
> Thanks! I suspect most of them are just using the information because
> they do not have anything better.

powerpc-utils actually seem to use the fine-grained API to dynamically
manage memory assignment to the VM.

> 
> Thinking about it some more, I believe that we can reasonably provide
> both APIs controlable by a command line parameter for backwards
> compatibility. It is the hotplug code to control sysfs APIs.  E.g.
> create one sysfs entry per add_memory_resource for the new semantic.

Yeah, but the real question is: who needs it. I can only think about
some DIMM scenarios (some, not all). I would be interested in more use
cases. Of course, to provide and maintain two APIs we need a good reason.

(one sysfs per add_memory_resource() won't cover all DIMMs completely as
far as I remember - I might be wrong, I remember there could be a
sequence of add_memory(). Also, some DIMMs might actually overlap with
memory indicated during boot - complicated stuff)

> 
> It is some time since I've checked the ACPI side of the matter but that
> code shouldn't really depend on a particular size of the memblock
> either when trigerring udev events. I might be wrong here of course.

It only has to respect the alignment/size restriction when calling
add_memory() right now. That would map to a "minimum block size"

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1] drivers/base/memory.c: Don't store end_section_nr in memory blocks
  2019-07-31 14:15           ` Michal Hocko
@ 2019-07-31 14:23             ` David Hildenbrand
  0 siblings, 0 replies; 18+ messages in thread
From: David Hildenbrand @ 2019-07-31 14:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, Greg Kroah-Hartman, Rafael J. Wysocki,
	Andrew Morton, Pavel Tatashin, Dan Williams, Oscar Salvador

On 31.07.19 16:15, Michal Hocko wrote:
> On Wed 31-07-19 16:04:10, David Hildenbrand wrote:
>> On 31.07.19 15:42, David Hildenbrand wrote:
> [...]
>>> Powerpc userspace queries it:
>>> https://groups.google.com/forum/#!msg/powerpc-utils-devel/dKjZCqpTxus/AwkstV2ABwAJ
>>
>> FWIW, powerpc-utils also uses the "removable" property - which means
>> we're also stuck with that unfortunately. :(
> 
> Yeah, I am aware of that and I strongly suspect this is actually in use
> because it is terribly unreliable for any non-idle system. There is
> simply no way to find out whether something is offlinable than to try
> it.

According to the introducing commit "removable" is only used to not even
try some memory blocks (IOW to save cpu cycles) - not treated as a
guaranteed (which is correct).

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1] drivers/base/memory.c: Don't store end_section_nr in memory blocks
  2019-07-31 14:21           ` David Hildenbrand
@ 2019-07-31 14:37             ` Michal Hocko
  2019-07-31 14:43               ` David Hildenbrand
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2019-07-31 14:37 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Greg Kroah-Hartman, Rafael J. Wysocki,
	Andrew Morton, Pavel Tatashin, Dan Williams, Oscar Salvador

On Wed 31-07-19 16:21:46, David Hildenbrand wrote:
[...]
> > Thinking about it some more, I believe that we can reasonably provide
> > both APIs controlable by a command line parameter for backwards
> > compatibility. It is the hotplug code to control sysfs APIs.  E.g.
> > create one sysfs entry per add_memory_resource for the new semantic.
> 
> Yeah, but the real question is: who needs it. I can only think about
> some DIMM scenarios (some, not all). I would be interested in more use
> cases. Of course, to provide and maintain two APIs we need a good reason.

Well, my 3TB machine that has 7 movable nodes could really go with less
than
$ find /sys/devices/system/memory -name "memory*" | wc -l
1729

when it doesn't really make any sense to offline less than a
hotremovable entity which is the whole node effectivelly. I have seen
reports where a similarly large machine chocked on boot just because of
too many udev events...

In other words allowing smaller granularity is a nice toy but real
usecases usually work with the whole hotplugable entity (e.g. the whole
ACPI container).

> (one sysfs per add_memory_resource() won't cover all DIMMs completely as
> far as I remember - I might be wrong, I remember there could be a
> sequence of add_memory(). Also, some DIMMs might actually overlap with
> memory indicated during boot - complicated stuff)

Which is something we have to live with anyway due to nodes interleaving.
So nothing really new.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1] drivers/base/memory.c: Don't store end_section_nr in memory blocks
  2019-07-31 14:37             ` Michal Hocko
@ 2019-07-31 14:43               ` David Hildenbrand
  2019-08-01  6:13                 ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: David Hildenbrand @ 2019-07-31 14:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, Greg Kroah-Hartman, Rafael J. Wysocki,
	Andrew Morton, Pavel Tatashin, Dan Williams, Oscar Salvador

On 31.07.19 16:37, Michal Hocko wrote:
> On Wed 31-07-19 16:21:46, David Hildenbrand wrote:
> [...]
>>> Thinking about it some more, I believe that we can reasonably provide
>>> both APIs controlable by a command line parameter for backwards
>>> compatibility. It is the hotplug code to control sysfs APIs.  E.g.
>>> create one sysfs entry per add_memory_resource for the new semantic.
>>
>> Yeah, but the real question is: who needs it. I can only think about
>> some DIMM scenarios (some, not all). I would be interested in more use
>> cases. Of course, to provide and maintain two APIs we need a good reason.
> 
> Well, my 3TB machine that has 7 movable nodes could really go with less
> than
> $ find /sys/devices/system/memory -name "memory*" | wc -l
> 1729>

The question is if it would be sufficient to increase the memory block
size even further for these kinds of systems (e.g., via a boot parameter
- I think we have that on uv systems) instead of having blocks of
different sizes. Say, 128GB blocks because you're not going to hotplug
128MB DIMMs into such a system - at least that's my guess ;)

> when it doesn't really make any sense to offline less than a
> hotremovable entity which is the whole node effectivelly. I have seen
> reports where a similarly large machine chocked on boot just because of
> too many udev events...>
> In other words allowing smaller granularity is a nice toy but real
> usecases usually work with the whole hotplugable entity (e.g. the whole
> ACPI container).

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1] drivers/base/memory.c: Don't store end_section_nr in memory blocks
  2019-07-31 12:22 [PATCH v1] drivers/base/memory.c: Don't store end_section_nr in memory blocks David Hildenbrand
  2019-07-31 12:43 ` Michal Hocko
@ 2019-07-31 20:57 ` Andrew Morton
  2019-08-01  6:48   ` David Hildenbrand
  1 sibling, 1 reply; 18+ messages in thread
From: Andrew Morton @ 2019-07-31 20:57 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Greg Kroah-Hartman, Rafael J. Wysocki,
	Pavel Tatashin, Michal Hocko, Dan Williams, Oscar Salvador

On Wed, 31 Jul 2019 14:22:13 +0200 David Hildenbrand <david@redhat.com> wrote:

> Each memory block spans the same amount of sections/pages/bytes. The size
> is determined before the first memory block is created. No need to store
> what we can easily calculate - and the calculations even look simpler now.
> 
> While at it, fix the variable naming in register_mem_sect_under_node() -
> we no longer talk about a single section.
> 
> ...
>
> --- a/include/linux/memory.h
> +++ b/include/linux/memory.h
> @@ -40,6 +39,8 @@ int arch_get_memory_phys_device(unsigned long start_pfn);
>  unsigned long memory_block_size_bytes(void);
>  int set_memory_block_size_order(unsigned int order);
>  
> +#define PAGES_PER_MEMORY_BLOCK (memory_block_size_bytes() / PAGE_SIZE)

Please let's not hide function calls inside macros which look like
compile-time constants!  Adding "()" to the macro would be a bit
better.  Making it a regular old inline C function would be better
still.  But I'd suggest just open-coding this at the macro's single
callsite.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1] drivers/base/memory.c: Don't store end_section_nr in memory blocks
  2019-07-31 14:43               ` David Hildenbrand
@ 2019-08-01  6:13                 ` Michal Hocko
  2019-08-01  7:00                   ` David Hildenbrand
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2019-08-01  6:13 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Greg Kroah-Hartman, Rafael J. Wysocki,
	Andrew Morton, Pavel Tatashin, Dan Williams, Oscar Salvador

On Wed 31-07-19 16:43:58, David Hildenbrand wrote:
> On 31.07.19 16:37, Michal Hocko wrote:
> > On Wed 31-07-19 16:21:46, David Hildenbrand wrote:
> > [...]
> >>> Thinking about it some more, I believe that we can reasonably provide
> >>> both APIs controlable by a command line parameter for backwards
> >>> compatibility. It is the hotplug code to control sysfs APIs.  E.g.
> >>> create one sysfs entry per add_memory_resource for the new semantic.
> >>
> >> Yeah, but the real question is: who needs it. I can only think about
> >> some DIMM scenarios (some, not all). I would be interested in more use
> >> cases. Of course, to provide and maintain two APIs we need a good reason.
> > 
> > Well, my 3TB machine that has 7 movable nodes could really go with less
> > than
> > $ find /sys/devices/system/memory -name "memory*" | wc -l
> > 1729>
> 
> The question is if it would be sufficient to increase the memory block
> size even further for these kinds of systems (e.g., via a boot parameter
> - I think we have that on uv systems) instead of having blocks of
> different sizes. Say, 128GB blocks because you're not going to hotplug
> 128MB DIMMs into such a system - at least that's my guess ;)

The system has
[    0.000000] ACPI: SRAT: Node 1 PXM 1 [mem 0x10000000000-0x17fffffffff]
[    0.000000] ACPI: SRAT: Node 2 PXM 2 [mem 0x80000000000-0x87fffffffff]
[    0.000000] ACPI: SRAT: Node 3 PXM 3 [mem 0x90000000000-0x97fffffffff]
[    0.000000] ACPI: SRAT: Node 4 PXM 4 [mem 0x100000000000-0x107fffffffff]
[    0.000000] ACPI: SRAT: Node 5 PXM 5 [mem 0x110000000000-0x117fffffffff]
[    0.000000] ACPI: SRAT: Node 6 PXM 6 [mem 0x180000000000-0x183fffffffff]
[    0.000000] ACPI: SRAT: Node 7 PXM 7 [mem 0x190000000000-0x191fffffffff]

hotplugable memory. I would love to have those 7 memory blocks to work
with. Any smaller grained split is just not helping as the platform will
not be able to hotremove it anyway.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1] drivers/base/memory.c: Don't store end_section_nr in memory blocks
  2019-07-31 20:57 ` Andrew Morton
@ 2019-08-01  6:48   ` David Hildenbrand
  0 siblings, 0 replies; 18+ messages in thread
From: David Hildenbrand @ 2019-08-01  6:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Greg Kroah-Hartman, Rafael J. Wysocki,
	Pavel Tatashin, Michal Hocko, Dan Williams, Oscar Salvador

On 31.07.19 22:57, Andrew Morton wrote:
> On Wed, 31 Jul 2019 14:22:13 +0200 David Hildenbrand <david@redhat.com> wrote:
> 
>> Each memory block spans the same amount of sections/pages/bytes. The size
>> is determined before the first memory block is created. No need to store
>> what we can easily calculate - and the calculations even look simpler now.
>>
>> While at it, fix the variable naming in register_mem_sect_under_node() -
>> we no longer talk about a single section.
>>
>> ...
>>
>> --- a/include/linux/memory.h
>> +++ b/include/linux/memory.h
>> @@ -40,6 +39,8 @@ int arch_get_memory_phys_device(unsigned long start_pfn);
>>  unsigned long memory_block_size_bytes(void);
>>  int set_memory_block_size_order(unsigned int order);
>>  
>> +#define PAGES_PER_MEMORY_BLOCK (memory_block_size_bytes() / PAGE_SIZE)
> 
> Please let's not hide function calls inside macros which look like
> compile-time constants!  Adding "()" to the macro would be a bit
> better.  Making it a regular old inline C function would be better
> still.  But I'd suggest just open-coding this at the macro's single
> callsite.
> 

Sure, makes sense. Thanks!

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1] drivers/base/memory.c: Don't store end_section_nr in memory blocks
  2019-08-01  6:13                 ` Michal Hocko
@ 2019-08-01  7:00                   ` David Hildenbrand
  2019-08-01  8:27                     ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: David Hildenbrand @ 2019-08-01  7:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, Greg Kroah-Hartman, Rafael J. Wysocki,
	Andrew Morton, Pavel Tatashin, Dan Williams, Oscar Salvador

On 01.08.19 08:13, Michal Hocko wrote:
> On Wed 31-07-19 16:43:58, David Hildenbrand wrote:
>> On 31.07.19 16:37, Michal Hocko wrote:
>>> On Wed 31-07-19 16:21:46, David Hildenbrand wrote:
>>> [...]
>>>>> Thinking about it some more, I believe that we can reasonably provide
>>>>> both APIs controlable by a command line parameter for backwards
>>>>> compatibility. It is the hotplug code to control sysfs APIs.  E.g.
>>>>> create one sysfs entry per add_memory_resource for the new semantic.
>>>>
>>>> Yeah, but the real question is: who needs it. I can only think about
>>>> some DIMM scenarios (some, not all). I would be interested in more use
>>>> cases. Of course, to provide and maintain two APIs we need a good reason.
>>>
>>> Well, my 3TB machine that has 7 movable nodes could really go with less
>>> than
>>> $ find /sys/devices/system/memory -name "memory*" | wc -l
>>> 1729>
>>
>> The question is if it would be sufficient to increase the memory block
>> size even further for these kinds of systems (e.g., via a boot parameter
>> - I think we have that on uv systems) instead of having blocks of
>> different sizes. Say, 128GB blocks because you're not going to hotplug
>> 128MB DIMMs into such a system - at least that's my guess ;)
> 
> The system has
> [    0.000000] ACPI: SRAT: Node 1 PXM 1 [mem 0x10000000000-0x17fffffffff]
> [    0.000000] ACPI: SRAT: Node 2 PXM 2 [mem 0x80000000000-0x87fffffffff]
> [    0.000000] ACPI: SRAT: Node 3 PXM 3 [mem 0x90000000000-0x97fffffffff]
> [    0.000000] ACPI: SRAT: Node 4 PXM 4 [mem 0x100000000000-0x107fffffffff]
> [    0.000000] ACPI: SRAT: Node 5 PXM 5 [mem 0x110000000000-0x117fffffffff]
> [    0.000000] ACPI: SRAT: Node 6 PXM 6 [mem 0x180000000000-0x183fffffffff]
> [    0.000000] ACPI: SRAT: Node 7 PXM 7 [mem 0x190000000000-0x191fffffffff]
> 
> hotplugable memory. I would love to have those 7 memory blocks to work
> with. Any smaller grained split is just not helping as the platform will
> not be able to hotremove it anyway.
> 

So the smallest granularity in your system is indeed 128GB (btw, nice
system, I wish I had something like that), the biggest one 512GB.

Using a memory block size of 128GB would imply on a 3TB system 24 memory
blocks - which is tolerable IMHO. Especially, performance-wise there
shouldn't be a real difference to 7 blocks. Hotunplug triggered via ACPI
will take care of offlining the right DIMMs.

Of course, 7 blocks would be nicer, but as discussed, not possible with
the current ABI.

What we could do right now is finally make "cat
/sys/devices/system/memory/memory99/phys_device" indicate on x86-64 to
which DIMM an added memory range belongs (if applicable). For now, it's
only used on s390x. We could store for each memory block the
"phys_index" - a.k.a. section number of the lowest memory block of a
add_memory() range.

This would at least allow user space to identify all memory blocks that
logically belong together (DIMM) without ABI changes.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1] drivers/base/memory.c: Don't store end_section_nr in memory blocks
  2019-08-01  7:00                   ` David Hildenbrand
@ 2019-08-01  8:27                     ` Michal Hocko
  2019-08-01  8:36                       ` David Hildenbrand
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2019-08-01  8:27 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Greg Kroah-Hartman, Rafael J. Wysocki,
	Andrew Morton, Pavel Tatashin, Dan Williams, Oscar Salvador

On Thu 01-08-19 09:00:45, David Hildenbrand wrote:
> On 01.08.19 08:13, Michal Hocko wrote:
> > On Wed 31-07-19 16:43:58, David Hildenbrand wrote:
> >> On 31.07.19 16:37, Michal Hocko wrote:
> >>> On Wed 31-07-19 16:21:46, David Hildenbrand wrote:
> >>> [...]
> >>>>> Thinking about it some more, I believe that we can reasonably provide
> >>>>> both APIs controlable by a command line parameter for backwards
> >>>>> compatibility. It is the hotplug code to control sysfs APIs.  E.g.
> >>>>> create one sysfs entry per add_memory_resource for the new semantic.
> >>>>
> >>>> Yeah, but the real question is: who needs it. I can only think about
> >>>> some DIMM scenarios (some, not all). I would be interested in more use
> >>>> cases. Of course, to provide and maintain two APIs we need a good reason.
> >>>
> >>> Well, my 3TB machine that has 7 movable nodes could really go with less
> >>> than
> >>> $ find /sys/devices/system/memory -name "memory*" | wc -l
> >>> 1729>
> >>
> >> The question is if it would be sufficient to increase the memory block
> >> size even further for these kinds of systems (e.g., via a boot parameter
> >> - I think we have that on uv systems) instead of having blocks of
> >> different sizes. Say, 128GB blocks because you're not going to hotplug
> >> 128MB DIMMs into such a system - at least that's my guess ;)
> > 
> > The system has
> > [    0.000000] ACPI: SRAT: Node 1 PXM 1 [mem 0x10000000000-0x17fffffffff]
> > [    0.000000] ACPI: SRAT: Node 2 PXM 2 [mem 0x80000000000-0x87fffffffff]
> > [    0.000000] ACPI: SRAT: Node 3 PXM 3 [mem 0x90000000000-0x97fffffffff]
> > [    0.000000] ACPI: SRAT: Node 4 PXM 4 [mem 0x100000000000-0x107fffffffff]
> > [    0.000000] ACPI: SRAT: Node 5 PXM 5 [mem 0x110000000000-0x117fffffffff]
> > [    0.000000] ACPI: SRAT: Node 6 PXM 6 [mem 0x180000000000-0x183fffffffff]
> > [    0.000000] ACPI: SRAT: Node 7 PXM 7 [mem 0x190000000000-0x191fffffffff]
> > 
> > hotplugable memory. I would love to have those 7 memory blocks to work
> > with. Any smaller grained split is just not helping as the platform will
> > not be able to hotremove it anyway.
> > 
> 
> So the smallest granularity in your system is indeed 128GB (btw, nice
> system, I wish I had something like that), the biggest one 512GB.
> 
> Using a memory block size of 128GB would imply on a 3TB system 24 memory
> blocks - which is tolerable IMHO. Especially, performance-wise there
> shouldn't be a real difference to 7 blocks. Hotunplug triggered via ACPI
> will take care of offlining the right DIMMs.

The problem with a fixed size memblock is that you might not know how
much memory you will have until much later after the boot. For example,
it should be quite reasonable to expect that this particular machine
would boot with node 0 only and have additional boards with memory added
during runtime. How big the memblock should be then? And I believe that
the virtualization usecase is similar in that regards. You get memory on
demand.
 
> Of course, 7 blocks would be nicer, but as discussed, not possible with
> the current ABI.

As I've said, if we want to move forward we have to change the API we
have right now. With backward compatible option of course.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v1] drivers/base/memory.c: Don't store end_section_nr in memory blocks
  2019-08-01  8:27                     ` Michal Hocko
@ 2019-08-01  8:36                       ` David Hildenbrand
  0 siblings, 0 replies; 18+ messages in thread
From: David Hildenbrand @ 2019-08-01  8:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, Greg Kroah-Hartman, Rafael J. Wysocki,
	Andrew Morton, Pavel Tatashin, Dan Williams, Oscar Salvador

On 01.08.19 10:27, Michal Hocko wrote:
> On Thu 01-08-19 09:00:45, David Hildenbrand wrote:
>> On 01.08.19 08:13, Michal Hocko wrote:
>>> On Wed 31-07-19 16:43:58, David Hildenbrand wrote:
>>>> On 31.07.19 16:37, Michal Hocko wrote:
>>>>> On Wed 31-07-19 16:21:46, David Hildenbrand wrote:
>>>>> [...]
>>>>>>> Thinking about it some more, I believe that we can reasonably provide
>>>>>>> both APIs controlable by a command line parameter for backwards
>>>>>>> compatibility. It is the hotplug code to control sysfs APIs.  E.g.
>>>>>>> create one sysfs entry per add_memory_resource for the new semantic.
>>>>>>
>>>>>> Yeah, but the real question is: who needs it. I can only think about
>>>>>> some DIMM scenarios (some, not all). I would be interested in more use
>>>>>> cases. Of course, to provide and maintain two APIs we need a good reason.
>>>>>
>>>>> Well, my 3TB machine that has 7 movable nodes could really go with less
>>>>> than
>>>>> $ find /sys/devices/system/memory -name "memory*" | wc -l
>>>>> 1729>
>>>>
>>>> The question is if it would be sufficient to increase the memory block
>>>> size even further for these kinds of systems (e.g., via a boot parameter
>>>> - I think we have that on uv systems) instead of having blocks of
>>>> different sizes. Say, 128GB blocks because you're not going to hotplug
>>>> 128MB DIMMs into such a system - at least that's my guess ;)
>>>
>>> The system has
>>> [    0.000000] ACPI: SRAT: Node 1 PXM 1 [mem 0x10000000000-0x17fffffffff]
>>> [    0.000000] ACPI: SRAT: Node 2 PXM 2 [mem 0x80000000000-0x87fffffffff]
>>> [    0.000000] ACPI: SRAT: Node 3 PXM 3 [mem 0x90000000000-0x97fffffffff]
>>> [    0.000000] ACPI: SRAT: Node 4 PXM 4 [mem 0x100000000000-0x107fffffffff]
>>> [    0.000000] ACPI: SRAT: Node 5 PXM 5 [mem 0x110000000000-0x117fffffffff]
>>> [    0.000000] ACPI: SRAT: Node 6 PXM 6 [mem 0x180000000000-0x183fffffffff]
>>> [    0.000000] ACPI: SRAT: Node 7 PXM 7 [mem 0x190000000000-0x191fffffffff]
>>>
>>> hotplugable memory. I would love to have those 7 memory blocks to work
>>> with. Any smaller grained split is just not helping as the platform will
>>> not be able to hotremove it anyway.
>>>
>>
>> So the smallest granularity in your system is indeed 128GB (btw, nice
>> system, I wish I had something like that), the biggest one 512GB.
>>
>> Using a memory block size of 128GB would imply on a 3TB system 24 memory
>> blocks - which is tolerable IMHO. Especially, performance-wise there
>> shouldn't be a real difference to 7 blocks. Hotunplug triggered via ACPI
>> will take care of offlining the right DIMMs.
> 
> The problem with a fixed size memblock is that you might not know how
> much memory you will have until much later after the boot. For example,
> it should be quite reasonable to expect that this particular machine
> would boot with node 0 only and have additional boards with memory added
> during runtime. How big the memblock should be then? And I believe that
> the virtualization usecase is similar in that regards. You get memory on
> demand.
>  

Well, via a kernel parameter you could make it configurable (just as on
UV systems). Not optimal though, but would work in many scenarios.

I see virtualization environments rather moving away from inflexible
huge DIMMs towards hotplugging smaller granularities (hyper-v balloon,
xen balloon, virtio-mem).

>> Of course, 7 blocks would be nicer, but as discussed, not possible with
>> the current ABI.
> 
> As I've said, if we want to move forward we have to change the API we
> have right now. With backward compatible option of course.
> 

I am not convinced a new API is really worth it yet.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2019-08-01  8:36 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-31 12:22 [PATCH v1] drivers/base/memory.c: Don't store end_section_nr in memory blocks David Hildenbrand
2019-07-31 12:43 ` Michal Hocko
2019-07-31 13:12   ` David Hildenbrand
2019-07-31 13:25     ` Michal Hocko
2019-07-31 13:42       ` David Hildenbrand
2019-07-31 14:04         ` David Hildenbrand
2019-07-31 14:15           ` Michal Hocko
2019-07-31 14:23             ` David Hildenbrand
2019-07-31 14:14         ` Michal Hocko
2019-07-31 14:21           ` David Hildenbrand
2019-07-31 14:37             ` Michal Hocko
2019-07-31 14:43               ` David Hildenbrand
2019-08-01  6:13                 ` Michal Hocko
2019-08-01  7:00                   ` David Hildenbrand
2019-08-01  8:27                     ` Michal Hocko
2019-08-01  8:36                       ` David Hildenbrand
2019-07-31 20:57 ` Andrew Morton
2019-08-01  6:48   ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).