[PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

* [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining
@ 2023-01-25  7:34 Leonardo Bras
  2023-01-25  7:34 ` [PATCH v2 1/5] mm/memcontrol: Align percpu memcg_stock to cache Leonardo Bras
                   ` (5 more replies)
  0 siblings, 6 replies; 48+ messages in thread
From: Leonardo Bras @ 2023-01-25  7:34 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Marcelo Tosatti
  Cc: Leonardo Bras, cgroups, linux-mm, linux-kernel

Disclaimer:
a - The cover letter got bigger than expected, so I had to split it in
    sections to better organize myself. I am not very confortable with it.
b - Performance numbers below did not include patch 5/5 (Remove flags
    from memcg_stock_pcp), which could further improve performance for
    drain_all_stock(), but I could only notice the optimization at the
    last minute.

0 - Motivation:
On current codebase, when drain_all_stock() is ran, it will schedule a
drain_local_stock() for each cpu that has a percpu stock associated with a
descendant of a given root_memcg.

This happens even on 'isolated cpus', a feature commonly used on workloads that
are sensitive to interruption and context switching such as vRAN and Industrial
Control Systems.

Since this scheduling behavior is a problem to those workloads, the proposal is
to replace the current local_lock + schedule_work_on() solution with a per-cpu
spinlock.

1 - Possible performance losses:
With a low amount of shedule_work_on(), local_locks are supposed to perform
better than spinlocks, given cacheline is always accessed by a single CPU and
there is no contention. The impact on those areas is analyzed bellow:

1.1 - Cacheline usage
In current implementation drain_all_stock() will be remote reading the percpu
memcg_stock cacheline of every online CPU, and remote writing to all cpus that
succeed the mem_cgroup_is_descendant() test. (stock->flags, stock->work)

With spinlocks, drain_all_stock() will be remote reading the percpu memcg_stock
cacheline of every online CPU, and remote writing to all cpus that succeed the
mem_cgroup_is_descendant() test. (stock->stock_lock, on top of the above)
While the spinlock may require extra acquire/release writes on some archs, they
will all happen on an exclusive cacheline, so not much overhead.

In both cases, the next local cpu read will require a fetch from memory (as it
was invalidated on the remote write) and cacheline exclusivity get before the
local write.

So about cacheline usage, there should not be much impact.

1.2 - Contention
We can safely assume that drain_all_stock() will not run oftenly. If it was not
the case, there would be a lot of scheduled tasks and kill cpu performance.

Since it does not run oftenly, and it is the only function that acesses remote
percpu memcg_stock, contention should not be too expressive, and should cause
less impact than scheduling (remote) and running the scheduled work (local). 

2 - Performance test results:
2.1 - Test System:
	- Non-virtualized AMD EPYC 7601 32-Core Processor, 128 CPUs distributed 
	in 8 NUMA nodes:  0-7,64-71 on node0, 8-15,72-79 on node1, and so on.
	- 256GB RAM, 2x32GB per NUMA node
	- For cpu isolation: use kernel cmdline:
	isolcpus=8-15,72-79 nohz_full=8-15,72-79
	- Memcg group created with:
		[Slice]
		MemoryAccounting=true
		MemoryLimit=1G
		MemoryMax=1G
		MemorySwapMax=1M
	- For pinning runs on given CPU, I used 'taskset -c $cpunum command'
	- For restarting the memcg, it was used:
	restart_memcg(){
	  systemctl stop test.slice
	  systemctl daemon-reload
	  systemctl start test.slice
	}

2.2 - Impact on functions that use memcg_stock:
	2.2.1 - Approach: Using perf or tracepoints get a very course result, so
	it was preferred to count the total cpu clocks between entering and 
	exiting the functions that use the memcg_stock, on an isolated cpu. 
	Something like this was used on x86:

	fun_name(){
		u64 clk = rdtsc_ordered();
		... <function does it's job>
		clk = rdtsc_ordered() - clk;
		<percpu struct dbg>
		dbg->fun_name.cycles += clk;
		dbg->fun_name.count++;
	}

	The percpu statistics were then acquired via a char device after the 
	test finished, and an average function clock usage was calculated.

	For the stress test, run "cat /proc/interrupts > /dev/null" in a loop
	of 1000000 iterations inside the memcg for each cpu tested:
	for each cpu in $cpuset; do 
		systemd-run --wait --slice=test.slice taskset -c $cpu bash -c "
		for k in {1..100000} ; do 
		  cat /proc/interrupts > /dev/null;
		done"

	For the drain_all_stock() test, it was necessary to restart the memcg
	(or cause an OOM) to call the function.

	2.2.2 - Results
	For 1 isolated CPU, pinned on cpu 8, with no drain_all_stock() calls,
	being STDDEV the standard deviation between the average on 6 runs,
	and Call Count the sum of calls on the 6 runs:

	Patched			Average clk	STDDEV		Call Count
	consume_stock:		63.75983475	0.1610502136	72167768
	refill_stock:		67.45708322	0.09732816852	23401756
	mod_objcg_state:	98.03841384	1.491628532	181292961
	consume_obj_stock:	63.2543456	0.04624513799	94846454
	refill_obj_stock:	78.56390025	0.3160306174	91732973

	Upstream		Average clk	STDDEV		Call Count
	consume_stock:		53.51201046	0.05200824438	71.866536
	refill_stock:		65.46991584	0.1178078417	23401752
	mod_objcg_state:	84.95365055	1.371464414	181.292707
	consume_obj_stock:	60.03619438	0.05944582207	94.846327
	refill_obj_stock:	73.23757912	1.161933856	91.732787

	Patched - Upstream	Diff (cycles)	Diff %	
	consume_stock:		10.24782429	19.15051258	
	refill_stock:		1.987167378	3.035237411	
	mod_objcg_state:	13.08476328	15.40223781	
	consume_obj_stock:	3.218151217	5.360351785	
	refill_obj_stock:	5.326321123	7.272661368	

	So in average the above patched functions are 2~13 clocks cycles slower
	than upstream.

	On the other hand, drain_all_stock is faster on the patched version,
	even considering it does all the draining instead of scheduling the work
	to other CPUs:

	drain_all_stock
	cpus	Upstream 	Patched		Diff (cycles)	Diff(%)
	1	44331.10831	38978.03581	-5353.072507	-12.07520567
	8	43992.96512	39026.76654	-4966.198572	-11.2886198
	128	156274.6634	58053.87421	-98220.78915	-62.85138425

	(8 cpus being in the same NUMA node)

2.3 - Contention numbers
	2.3.1 - Approach
	On top of the patched version, I replaced the spin_lock_irqsave() on 
	functions that use the memcg_stock with	spin_lock_irqsave_cc(), which
	is defined as:

	#define spin_lock_irqsave_cc(l, flags)				\
		if (!spin_trylock_irqsave(l, flags)) {			\
	              	u64 clk = rdtsc_ordered();			\
        	      	spin_lock_irqsave(l, flags);			\
        	     	clk = rdtsc_ordered() - clk;			\
        	      	pr_err("mydebug: cpu %d hit contention :" 	\
        	      	" fun = %s, clk = %lld/n",			\
        	      	smp_processor_id(), __func__, clk);	\
      	}

	So in case of contention (try_lock failing) it would record an 
	approximate clk usage before getting the lock, and print this to dmesg.

	For the stress test, run "cat /proc/interrupts > /dev/null" in a loop
	of 1000000 iterations for each of the 128 cpus inside the memcg 
	(limit set to 20G):

	for each cpu in {1..128}; do 
		restart_memcg()
		systemd-run --wait --slice=test.slice taskset -c $cpu bash -c "
		for k in {1..100000} ; do 
		  cat /proc/interrupts > /dev/null;
		done"

	This loop was repeated for over 8 hours

	2.3.2- Results

	Function		# calls
	consume_stock:		15078323802
	refill_stock:		2495995683
	mod_objcg_state:	39765559905
	consume_obj_stock:	19882888224
	refill_obj_stock:	21025241793
	drain_all_stock:	592

	Contentions hit: 0

	(Other more aggressive synthetic tests were run, and even in this case
	contention was hit just a couple times over some hours.)

2.4 - Syscall time measure
	2.4.1- Approach
	To measure the patchset effect on syscall time, the following code was
	used: (copied/adapted from
 https://lore.kernel.org/linux-mm/20220924152227.819815-1-atomlin@redhat.com/ )

 	################
 	#include <stdio.h>
	#include <stdlib.h>
	#include <sys/mman.h>
	#include <unistd.h>
	#include <string.h>

	typedef unsigned long long cycles_t;
	typedef unsigned long long usecs_t;
	typedef unsigned long long u64;

	#ifdef __x86_64__
	#define DECLARE_ARGS(val, low, high)    unsigned long low, high
	#define EAX_EDX_VAL(val, low, high)     ((low) | ((u64)(high) << 32))
	#define EAX_EDX_ARGS(val, low, high)    "a" (low), "d" (high)
	#define EAX_EDX_RET(val, low, high)     "=a" (low), "=d" (high)
	#else
	#define DECLARE_ARGS(val, low, high)    unsigned long long val
	#define EAX_EDX_VAL(val, low, high)     (val)
	#define EAX_EDX_ARGS(val, low, high)    "A" (val)
	#define EAX_EDX_RET(val, low, high)     "=A" (val)
	#endif

	static inline unsigned long long __rdtscll(void)
	{
		DECLARE_ARGS(val, low, high);

		asm volatile("cpuid; rdtsc" : EAX_EDX_RET(val, low, high));

		return EAX_EDX_VAL(val, low, high);
	}

	#define rdtscll(val) do { (val) = __rdtscll(); } while (0)

	#define NRSYSCALLS 30000000
	#define NRSLEEPS   100000

	#define page_mmap()     mmap(NULL, 4096, PROT_READ|PROT_WRITE, \
			       MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)
	#define page_munmap(x)  munmap(x, 4096)

	void main(int argc, char *argv[])
	{
		unsigned long a, b, cycles;
		int i, syscall = 0;
		int *page;

		page = page_mmap();
		if (page == MAP_FAILED)
		        perror("mmap");

		if (page_munmap(page))
		        perror("munmap");

		if (argc != 2) {
		        printf("usage: %s {idle,syscall}\n", argv[0]);
		        exit(1);
		}

		rdtscll(a);

		if (strncmp("idle", argv[1], 4) == 0)
		        syscall = 0;
		else if (strncmp("syscall", argv[1], 7) == 0)
		        syscall = 1;
		else {
		      	printf("usage: %s {idle,syscall}\n", argv[0]);
		        exit(1);
		}

		if (syscall == 1) {
		        for (i = 0; i < NRSYSCALLS; i++) {
		                page = page_mmap();
		                if (page == MAP_FAILED)
		                        perror("mmap");

				#ifdef MY_WRITE
		                page[3] = i;
		                #endif

		                if (page_munmap(page))
		                        perror("munmap");
		        }
		} else {
		        for (i = 0; i < NRSLEEPS; i++)
		                usleep(10);
		}

		rdtscll(b);

		cycles = b - a;

		if (syscall == 1)
		        printf("cycles / syscall: %d\n", (b-a)/(NRSYSCALLS*2));
		else
		    	printf("cycles / idle loop: %d\n", (b-a)/NRSLEEPS);
	}
	################

	Running with ./my_test syscall will cause it to print the average clock
	cycles usage of the syscall pair (page_mmap() and page_munmap());

	It was compiled with two versions: With -DMY_WRITE and without it.
	The difference is writing to the allocated page, causing it to fault.

	Each version was run 200 times, pinned to an isolated cpu.
	Then an average and standard deviation was calculated on those results.

	2.4.2- Results
	Patched: no_write	write
	AVG	2991.195	5746.495
	STDEV	27.77488427	40.55878512
	STDEV %	0.9285547838	0.7058004073	

	Upstream: no_write	write
	AVG	3012.22		5749.605
	STDEV	25.1186451	37.26206223
	STDEV %	0.8338914522	0.6480803851

	Pat - Up: no_write	write
	Diff	-21.025		-3.11
	Diff %	-0.6979901866	-0.05409067232

	Meaning the pair page_mmap() + page_munmap() + pagefault runs a tiny bit
	faster on the patched version, compared to upstream.

3 - Discussion on results	
	On 2.2 we see every function that uses memcg_stock on local cpu gets a
	slower by some cycles on the patched version, while the function that
	accesses it remotely (drain_all_stock()) gets faster. The difference is
	more accentuated as we raise the cpu count, and consequently start
	dealing with sharing memory across NUMA.

	On 2.3 we see contention is not a big issue, as expected in 1.2.
	This probably happens due to the fact that drain_all_stock() runs quite
	rarely on normal operation, and the other functions are quite fast.

	On 2.4 we can see that page_mmap() + page_munmap() + pagefault ran 
	a tiny bit faster. This is probably due to the fact that
	drain_all_stock() does not schedule work on the running cpu, causing it
	not to get interrupted, and possibly making up for the increased time
	in local functions.

4- Conclusion
	Scheduling work on isolated cpus can be an issue for some workloads.
	Reducing the issue by replacing	the local_lock in memcg_stock with a
	spinlock should not cause much impact on performance.

Leonardo Bras (5):
  mm/memcontrol: Align percpu memcg_stock to cache
  mm/memcontrol: Change stock_lock type from local_lock_t to spinlock_t
  mm/memcontrol: Reorder memcg_stock_pcp members to avoid holes
  mm/memcontrol: Perform all stock drain in current CPU
  mm/memcontrol: Remove flags from memcg_stock_pcp

 mm/memcontrol.c | 75 ++++++++++++++++++++-----------------------------
 1 file changed, 31 insertions(+), 44 deletions(-)

-- 
2.39.1

^ permalink raw reply	[flat|nested] 48+ messages in thread