All of lore.kernel.org
 help / color / mirror / Atom feed
* Percpu allocator: CPU hotplug support
@ 2021-04-22  0:44 Alexey Makhalov
  2021-04-22  1:10 ` Roman Gushchin
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Alexey Makhalov @ 2021-04-22  0:44 UTC (permalink / raw)
  To: linux-mm, Dennis Zhou, Tejun Heo, Christoph Lameter; +Cc: Roman Gushchin

Current implementation of percpu allocator uses total possible number of CPUs (nr_cpu_ids) to
get number of units to allocate per chunk. Every alloc_percpu() request of N bytes will allocate
N*nr_cpu_ids bytes even if the number of present CPUs is much less. Percpu allocator grows by
number of chunks keeping number of units per chunk constant. This is done in that way to
simplify CPU hotplug/remove to have per-cpu area preallocated.

Problem: This behavior can lead to inefficient memory usage for big server machines and VMs,
where nr_cpu_ids is huge.

Example from my experiment:
2 vCPU VM with hotplug support (up to 128):
[    0.105989] smpboot: Allowing 128 CPUs, 126 hotplug CPUs
By creating huge amount of active or/and dying memory cgroups, I can generate active percpu
allocations of 100 MB (per single CPU) including fragmentation overhead. But in that case total
percpu memory consumption (reported in /proc/meminfo) will be 12.8 GB. BTW, chunks are
filled by ~75% in my experiment, so fragmentation is not a concern.
Out of 12.8 GB:
 - 0.2 GB are actually used by present vCPUs, and
 - 12.6 GB are "wasted"!

I've seen production VMs consuming 16-20 GB of memory by Percpu. Roman reported 100 GB.
There are solutions to reduce "wasted" memory overhead such as: disabling CPU hotplug; reducing
number of maximum CPUs reported by hypervisor or/and firmware; using possible_cpus= kernel
parameter. But it won't eliminate fundamental issue with "wasted" memory.

Suggestion: To support percpu chunks scaling by number of units there. To allocate/deallocate new
units for existing chunks on CPU hotplug/remove event.

Any thoughts? Thanks! --Alexey


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Percpu allocator: CPU hotplug support
  2021-04-22  0:44 Percpu allocator: CPU hotplug support Alexey Makhalov
@ 2021-04-22  1:10 ` Roman Gushchin
  2021-04-22  1:33 ` Dennis Zhou
  2021-04-29 11:39 ` Pratik Sampat
  2 siblings, 0 replies; 7+ messages in thread
From: Roman Gushchin @ 2021-04-22  1:10 UTC (permalink / raw)
  To: Alexey Makhalov; +Cc: linux-mm, Dennis Zhou, Tejun Heo, Christoph Lameter

On Thu, Apr 22, 2021 at 12:44:37AM +0000, Alexey Makhalov wrote:
> Current implementation of percpu allocator uses total possible number of CPUs (nr_cpu_ids) to
> get number of units to allocate per chunk. Every alloc_percpu() request of N bytes will allocate
> N*nr_cpu_ids bytes even if the number of present CPUs is much less. Percpu allocator grows by
> number of chunks keeping number of units per chunk constant. This is done in that way to
> simplify CPU hotplug/remove to have per-cpu area preallocated.
> 
> Problem: This behavior can lead to inefficient memory usage for big server machines and VMs,
> where nr_cpu_ids is huge.
> 
> Example from my experiment:
> 2 vCPU VM with hotplug support (up to 128):

Maybe I'm missing something, but I find the setup very strange.
Who needs a 2 cpu machine which *maybe* can be extended to be a 128 CPUs machine
on the fly?

> [    0.105989] smpboot: Allowing 128 CPUs, 126 hotplug CPUs
> By creating huge amount of active or/and dying memory cgroups, I can generate active percpu
> allocations of 100 MB (per single CPU) including fragmentation overhead. But in that case total
> percpu memory consumption (reported in /proc/meminfo) will be 12.8 GB. BTW, chunks are
> filled by ~75% in my experiment, so fragmentation is not a concern.
> Out of 12.8 GB:
>  - 0.2 GB are actually used by present vCPUs, and
>  - 12.6 GB are "wasted"!
> 
> I've seen production VMs consuming 16-20 GB of memory by Percpu. Roman reported 100 GB.

My case is completely different and has nothing to do with this problem:
the machine had a huge number of outstanding percpu allocations, caused by
another problem.

> There are solutions to reduce "wasted" memory overhead such as: disabling CPU hotplug; reducing
> number of maximum CPUs reported by hypervisor or/and firmware; using possible_cpus= kernel
> parameter. But it won't eliminate fundamental issue with "wasted" memory.
> 
> Suggestion: To support percpu chunks scaling by number of units there. To allocate/deallocate new
> units for existing chunks on CPU hotplug/remove event.

I guess most of users don't have this problem because the number of possible
cpus and the actual number of cpus are usually equal or not that different.
Someone who really depends on a such setup can try implementing it, but I'm
not sure it's trivial/possible to do without adding an overhead for the majority
of users.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Percpu allocator: CPU hotplug support
  2021-04-22  0:44 Percpu allocator: CPU hotplug support Alexey Makhalov
  2021-04-22  1:10 ` Roman Gushchin
@ 2021-04-22  1:33 ` Dennis Zhou
  2021-04-22  7:45   ` Laurent Dufour
  2021-04-29 11:39 ` Pratik Sampat
  2 siblings, 1 reply; 7+ messages in thread
From: Dennis Zhou @ 2021-04-22  1:33 UTC (permalink / raw)
  To: Alexey Makhalov; +Cc: linux-mm, Tejun Heo, Christoph Lameter, Roman Gushchin

Hello,

On Thu, Apr 22, 2021 at 12:44:37AM +0000, Alexey Makhalov wrote:
> Current implementation of percpu allocator uses total possible number of CPUs (nr_cpu_ids) to
> get number of units to allocate per chunk. Every alloc_percpu() request of N bytes will allocate
> N*nr_cpu_ids bytes even if the number of present CPUs is much less. Percpu allocator grows by
> number of chunks keeping number of units per chunk constant. This is done in that way to
> simplify CPU hotplug/remove to have per-cpu area preallocated.
> 
> Problem: This behavior can lead to inefficient memory usage for big server machines and VMs,
> where nr_cpu_ids is huge.
> 
> Example from my experiment:
> 2 vCPU VM with hotplug support (up to 128):
> [    0.105989] smpboot: Allowing 128 CPUs, 126 hotplug CPUs
> By creating huge amount of active or/and dying memory cgroups, I can generate active percpu
> allocations of 100 MB (per single CPU) including fragmentation overhead. But in that case total
> percpu memory consumption (reported in /proc/meminfo) will be 12.8 GB. BTW, chunks are
> filled by ~75% in my experiment, so fragmentation is not a concern.
> Out of 12.8 GB:
>  - 0.2 GB are actually used by present vCPUs, and
>  - 12.6 GB are "wasted"!
> 
> I've seen production VMs consuming 16-20 GB of memory by Percpu. Roman reported 100 GB.
> There are solutions to reduce "wasted" memory overhead such as: disabling CPU hotplug; reducing
> number of maximum CPUs reported by hypervisor or/and firmware; using possible_cpus= kernel
> parameter. But it won't eliminate fundamental issue with "wasted" memory.
> 
> Suggestion: To support percpu chunks scaling by number of units there. To allocate/deallocate new
> units for existing chunks on CPU hotplug/remove event.
> 

Idk. In theory it sounds doable. In practice I'm not so sure. The two
problems off the top of my head:
1) What happens if we can't allocate new pages when a cpu is onlined?
2) It's possible users set particular conditions in percpu variables
that are not tied to just statistics summing (such as the cpu
runqueues). Users would have to provide online init and exit functions
which could get weird.

As Roman mentioned, I think it would be much better to not have the
large discrepancy between the cpu_online_mask and the cpu_possible_mask.

Thanks,
Dennis


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Percpu allocator: CPU hotplug support
  2021-04-22  1:33 ` Dennis Zhou
@ 2021-04-22  7:45   ` Laurent Dufour
  2021-04-22  8:22     ` Alexey Makhalov
  0 siblings, 1 reply; 7+ messages in thread
From: Laurent Dufour @ 2021-04-22  7:45 UTC (permalink / raw)
  To: Dennis Zhou, Alexey Makhalov
  Cc: linux-mm, Tejun Heo, Christoph Lameter, Roman Gushchin,
	Aneesh Kumar K.V, Srikar Dronamraju

Le 22/04/2021 à 03:33, Dennis Zhou a écrit :
> Hello,
> 
> On Thu, Apr 22, 2021 at 12:44:37AM +0000, Alexey Makhalov wrote:
>> Current implementation of percpu allocator uses total possible number of CPUs (nr_cpu_ids) to
>> get number of units to allocate per chunk. Every alloc_percpu() request of N bytes will allocate
>> N*nr_cpu_ids bytes even if the number of present CPUs is much less. Percpu allocator grows by
>> number of chunks keeping number of units per chunk constant. This is done in that way to
>> simplify CPU hotplug/remove to have per-cpu area preallocated.
>>
>> Problem: This behavior can lead to inefficient memory usage for big server machines and VMs,
>> where nr_cpu_ids is huge.
>>
>> Example from my experiment:
>> 2 vCPU VM with hotplug support (up to 128):
>> [    0.105989] smpboot: Allowing 128 CPUs, 126 hotplug CPUs
>> By creating huge amount of active or/and dying memory cgroups, I can generate active percpu
>> allocations of 100 MB (per single CPU) including fragmentation overhead. But in that case total
>> percpu memory consumption (reported in /proc/meminfo) will be 12.8 GB. BTW, chunks are
>> filled by ~75% in my experiment, so fragmentation is not a concern.
>> Out of 12.8 GB:
>>   - 0.2 GB are actually used by present vCPUs, and
>>   - 12.6 GB are "wasted"!
>>
>> I've seen production VMs consuming 16-20 GB of memory by Percpu. Roman reported 100 GB.
>> There are solutions to reduce "wasted" memory overhead such as: disabling CPU hotplug; reducing
>> number of maximum CPUs reported by hypervisor or/and firmware; using possible_cpus= kernel
>> parameter. But it won't eliminate fundamental issue with "wasted" memory.
>>
>> Suggestion: To support percpu chunks scaling by number of units there. To allocate/deallocate new
>> units for existing chunks on CPU hotplug/remove event.
>>
> 
> Idk. In theory it sounds doable. In practice I'm not so sure. The two
> problems off the top of my head:
> 1) What happens if we can't allocate new pages when a cpu is onlined?
> 2) It's possible users set particular conditions in percpu variables
> that are not tied to just statistics summing (such as the cpu
> runqueues). Users would have to provide online init and exit functions
> which could get weird.
> 
> As Roman mentioned, I think it would be much better to not have the
> large discrepancy between the cpu_online_mask and the cpu_possible_mask.

Indeed it is quite common on PowerPC to set a VM with a possible high number of 
CPUs but with a reasonnable number of online CPUs. This allows the user to scale 
up its VM when needed.

For instance we may see up to 1024 possible CPUs while the online number is 
*only* 128.

Cheers,
Laurent.



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Percpu allocator: CPU hotplug support
  2021-04-22  7:45   ` Laurent Dufour
@ 2021-04-22  8:22     ` Alexey Makhalov
  2021-04-22 17:52       ` Vlastimil Babka
  0 siblings, 1 reply; 7+ messages in thread
From: Alexey Makhalov @ 2021-04-22  8:22 UTC (permalink / raw)
  To: linux-mm, Laurent Dufour
  Cc: Dennis Zhou, Tejun Heo, Christoph Lameter, Roman Gushchin,
	Aneesh Kumar K.V, Srikar Dronamraju

Hello,

> On Apr 22, 2021, at 12:45 AM, Laurent Dufour <ldufour@linux.ibm.com> wrote:
> 
> Le 22/04/2021 à 03:33, Dennis Zhou a écrit :
>> Hello,
>> On Thu, Apr 22, 2021 at 12:44:37AM +0000, Alexey Makhalov wrote:
>>> Current implementation of percpu allocator uses total possible number of CPUs (nr_cpu_ids) to
>>> get number of units to allocate per chunk. Every alloc_percpu() request of N bytes will allocate
>>> N*nr_cpu_ids bytes even if the number of present CPUs is much less. Percpu allocator grows by
>>> number of chunks keeping number of units per chunk constant. This is done in that way to
>>> simplify CPU hotplug/remove to have per-cpu area preallocated.
>>> 
>>> Problem: This behavior can lead to inefficient memory usage for big server machines and VMs,
>>> where nr_cpu_ids is huge.
>>> 
>>> Example from my experiment:
>>> 2 vCPU VM with hotplug support (up to 128):
>>> [    0.105989] smpboot: Allowing 128 CPUs, 126 hotplug CPUs
>>> By creating huge amount of active or/and dying memory cgroups, I can generate active percpu
>>> allocations of 100 MB (per single CPU) including fragmentation overhead. But in that case total
>>> percpu memory consumption (reported in /proc/meminfo) will be 12.8 GB. BTW, chunks are
>>> filled by ~75% in my experiment, so fragmentation is not a concern.
>>> Out of 12.8 GB:
>>>  - 0.2 GB are actually used by present vCPUs, and
>>>  - 12.6 GB are "wasted"!
>>> 
>>> I've seen production VMs consuming 16-20 GB of memory by Percpu. Roman reported 100 GB.
>>> There are solutions to reduce "wasted" memory overhead such as: disabling CPU hotplug; reducing
>>> number of maximum CPUs reported by hypervisor or/and firmware; using possible_cpus= kernel
>>> parameter. But it won't eliminate fundamental issue with "wasted" memory.
>>> 
>>> Suggestion: To support percpu chunks scaling by number of units there. To allocate/deallocate new
>>> units for existing chunks on CPU hotplug/remove event.
>>> 
>> Idk. In theory it sounds doable. In practice I'm not so sure. The two
>> problems off the top of my head:
>> 1) What happens if we can't allocate new pages when a cpu is onlined?
Simply - registering CPU can return error on allocation failure. Or potentially it can be reinstantiated later on memory availability if it’s the case.

>> 2) It's possible users set particular conditions in percpu variables
>> that are not tied to just statistics summing (such as the cpu
>> runqueues). Users would have to provide online init and exit functions
>> which could get weird.
I do not think online init/exit function is a right approach.
There are many places in the Linux where percpu data get initialized right after got allocated:
ptr = alloc_percpu();
for_each_possible_cpu(cpu) {
        initialize (per_cpu_ptr(ptr, cpu));
}
Let’s keep all such instances untouched. Hope initialize() just touch content of percpu area without allocating substructures. If so - it should be redesigned.
BTW, this loop does extra work (runtime overhead) to initialize areas for possible cpus which might never arrive.

The proposal:
 - in case of possible_cpus > online_cpus, add additional unit (call it A) to the chunks which will contain initialized image of percpu data for possible cpus.
 - for_each_possible_cpu(cpu) from snippet above should go through all online cpus + 1 (for unit A).
 - on new CPU #N arrival, percpu should allocate corresponding unit N and initialize its content by data from unit A. Repeat for all chunks.
 - on CPU D departure - release unit D from the chunks, keeping unit A intact.
 - in case of possible_cpus > online_cpus, overhead will be +1 (for unit A), while current overhead is +(possible_cpus-online_cpus).
 - in case of possible_cpus == online_cpus (no CPU hotplug) - do not allocate unit A, keep percpu allocator as it is now - no overhead.

Does it fully cover 2nd concern?

>> As Roman mentioned, I think it would be much better to not have the
>> large discrepancy between the cpu_online_mask and the cpu_possible_mask.
> 
> Indeed it is quite common on PowerPC to set a VM with a possible high number of CPUs but with a reasonnable number of online CPUs. This allows the user to scale up its VM when needed.
> 
> For instance we may see up to 1024 possible CPUs while the online number is *only* 128.
Agree. In VMs, vCPUs there are just threads/processes on the host and can be easily added/removed on demand.

Thanks,
—Alexey




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Percpu allocator: CPU hotplug support
  2021-04-22  8:22     ` Alexey Makhalov
@ 2021-04-22 17:52       ` Vlastimil Babka
  0 siblings, 0 replies; 7+ messages in thread
From: Vlastimil Babka @ 2021-04-22 17:52 UTC (permalink / raw)
  To: Alexey Makhalov, linux-mm, Laurent Dufour
  Cc: Dennis Zhou, Tejun Heo, Christoph Lameter, Roman Gushchin,
	Aneesh Kumar K.V, Srikar Dronamraju

On 4/22/21 10:22 AM, Alexey Makhalov wrote:
> Hello,
> 
>>> 2) It's possible users set particular conditions in percpu variables
>>> that are not tied to just statistics summing (such as the cpu
>>> runqueues). Users would have to provide online init and exit functions
>>> which could get weird.
> I do not think online init/exit function is a right approach.
> There are many places in the Linux where percpu data get initialized right after got allocated:
> ptr = alloc_percpu();
> for_each_possible_cpu(cpu) {
>         initialize (per_cpu_ptr(ptr, cpu));
> }
> Let’s keep all such instances untouched. Hope initialize() just touch content of percpu area without allocating substructures. If so - it should be redesigned.

I'm afraid that 'hope' won't get us far. For example in the mm/page_alloc.c we
use INIT_LIST_HEAD() for percpu structures. Which means it's initialized to
empty list_head which are two "self-pointers" and you can't just memcpy that
elsewhere.

You could try to special-case this stuff in your "initialize N from A" approach
but it becomes rather fragile so we would indeed need callbacks for proper
init/exit on online/offline.

> BTW, this loop does extra work (runtime overhead) to initialize areas for possible cpus which might never arrive.
> 
> The proposal:
>  - in case of possible_cpus > online_cpus, add additional unit (call it A) to the chunks which will contain initialized image of percpu data for possible cpus.
>  - for_each_possible_cpu(cpu) from snippet above should go through all online cpus + 1 (for unit A).
>  - on new CPU #N arrival, percpu should allocate corresponding unit N and initialize its content by data from unit A. Repeat for all chunks.
>  - on CPU D departure - release unit D from the chunks, keeping unit A intact.
>  - in case of possible_cpus > online_cpus, overhead will be +1 (for unit A), while current overhead is +(possible_cpus-online_cpus).
>  - in case of possible_cpus == online_cpus (no CPU hotplug) - do not allocate unit A, keep percpu allocator as it is now - no overhead.
> 
> Does it fully cover 2nd concern?
> 
>>> As Roman mentioned, I think it would be much better to not have the
>>> large discrepancy between the cpu_online_mask and the cpu_possible_mask.
>> 
>> Indeed it is quite common on PowerPC to set a VM with a possible high number of CPUs but with a reasonnable number of online CPUs. This allows the user to scale up its VM when needed.

Yeah somehow it's always PowerPC with this kind of possible vs online problem :)
Last time I recall it was SLUB page order.

So I'm not against the hotplug support, but it really won't be simple.

>> For instance we may see up to 1024 possible CPUs while the online number is *only* 128.
> Agree. In VMs, vCPUs there are just threads/processes on the host and can be easily added/removed on demand.
> 
> Thanks,
> —Alexey
> 
> 
> 



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Percpu allocator: CPU hotplug support
  2021-04-22  0:44 Percpu allocator: CPU hotplug support Alexey Makhalov
  2021-04-22  1:10 ` Roman Gushchin
  2021-04-22  1:33 ` Dennis Zhou
@ 2021-04-29 11:39 ` Pratik Sampat
  2 siblings, 0 replies; 7+ messages in thread
From: Pratik Sampat @ 2021-04-29 11:39 UTC (permalink / raw)
  To: Alexey Makhalov, linux-mm
  Cc: Dennis Zhou, Roman Gushchin, Vlastimil Babka, Christoph Lameter,
	ldufour, Tejun Heo, Aneesh Kumar K.V, Srikar Dronamraju,
	pratik.r.sampat

Hello

On 22/04/21 6:14 am, Alexey Makhalov wrote:
> Current implementation of percpu allocator uses total possible number of CPUs (nr_cpu_ids) to
> get number of units to allocate per chunk. Every alloc_percpu() request of N bytes will allocate
> N*nr_cpu_ids bytes even if the number of present CPUs is much less. Percpu allocator grows by
> number of chunks keeping number of units per chunk constant. This is done in that way to
> simplify CPU hotplug/remove to have per-cpu area preallocated.
>
> Problem: This behavior can lead to inefficient memory usage for big server machines and VMs,
> where nr_cpu_ids is huge.
>
> Example from my experiment:
> 2 vCPU VM with hotplug support (up to 128):
> [    0.105989] smpboot: Allowing 128 CPUs, 126 hotplug CPUs
> By creating huge amount of active or/and dying memory cgroups, I can generate active percpu
> allocations of 100 MB (per single CPU) including fragmentation overhead. But in that case total
> percpu memory consumption (reported in /proc/meminfo) will be 12.8 GB. BTW, chunks are
> filled by ~75% in my experiment, so fragmentation is not a concern.
> Out of 12.8 GB:
>   - 0.2 GB are actually used by present vCPUs, and
>   - 12.6 GB are "wasted"!
>
> I've seen production VMs consuming 16-20 GB of memory by Percpu. Roman reported 100 GB.
> There are solutions to reduce "wasted" memory overhead such as: disabling CPU hotplug; reducing
> number of maximum CPUs reported by hypervisor or/and firmware; using possible_cpus= kernel
> parameter. But it won't eliminate fundamental issue with "wasted" memory.
>
> Suggestion: To support percpu chunks scaling by number of units there. To allocate/deallocate new
> units for existing chunks on CPU hotplug/remove event.
>
> Any thoughts? Thanks! --Alexey
>
>
I've run some traces around memory cgroups to determine memory consumption by
the Percpu allocator and the major contributers to these allocations by either
creating an empty memory cgroup or an empty container.

There are 4 memcg percpu allocation charges I see when I create a cgroup
attached to a memory controller. They seem to belong to mm/memcontrol.c's
lruvec_stat and vmstats.

I've run this experiment in 2 configurations on a POWER9 box
1. cpus=16 (present), maxcpus=16   (possible)
2. cpus=16 (present), maxcpus=1024 (possible)

On system boot,
Maxcpus    Sum percpu charges(MB)
16         2.4979
1024       159.86

0 MB container setup (empty parallel container setup that just spawns and spins)
Maxcpus    per container avg(MB)
16         0.0398
1024       2.5507

The difference in cgroup charges, although in absolute numbers is quite small,
wastes memory proportionally when the cgroup or the container setup is scaled
to say 10,000 containers.

If memory cgroup is the point of focus then would it make sense to attempt to
optimize only those callers to be hotplug aware than to attempt to optimize the
whole percpu allocator?

Thanks,
Pratik



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-04-29 11:39 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-22  0:44 Percpu allocator: CPU hotplug support Alexey Makhalov
2021-04-22  1:10 ` Roman Gushchin
2021-04-22  1:33 ` Dennis Zhou
2021-04-22  7:45   ` Laurent Dufour
2021-04-22  8:22     ` Alexey Makhalov
2021-04-22 17:52       ` Vlastimil Babka
2021-04-29 11:39 ` Pratik Sampat

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.