Re: [RFC] cgroup gets release after long time

From: Roman Gushchin <guro@fb.com>
To: Jiri Olsa <jolsa@redhat.com>
Cc: Tejun Heo <tj@kernel.org>, Li Zefan <lizefan@huawei.com>,
	Daniel Mack <daniel@zonque.org>,
	"cgroups@vger.kernel.org" <cgroups@vger.kernel.org>,
	"bpf@vger.kernel.org" <bpf@vger.kernel.org>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	"David S. Miller" <davem@davemloft.net>,
	Pavel Hrdina <phrdina@redhat.com>
Subject: Re: [RFC] cgroup gets release after long time
Date: Thu, 16 May 2019 15:22:33 +0000	[thread overview]
Message-ID: <20190516152224.GA7163@castle.DHCP.thefacebook.com> (raw)
In-Reply-To: <20190516103915.GB27421@krava>

On Thu, May 16, 2019 at 12:39:15PM +0200, Jiri Olsa wrote:
> hi,
> Pavel reported an issue with bpf programs (attached to cgroup)
> not being released at the time when the cgroup is removed and
> are still visible in 'bpftool prog' list afterwards.

Hi Jiri!

Can you, please, try the patch from
https://github.com/rgushchin/linux/commit/f77afa1952d81a1afa6c4872d342bf6721e148e2 ?

It should solve the problem, and I'm about to post it upstream.

Thanks!

> 
> It seems like this is not bpf specific, because I was able
> to cut the bpf code from his example and still see delayed
> release of cgroup.
> 
> It happens only on cgroup2 fs (booted with systemd.unified_cgroup_hierarchy=1
> kernel command line option), please check the attached program
> below and following scenario:
> 
> TERM 1
> # gcc -o test test.c
> 
> 			TERM 2
> 			# cd /sys/kernel/debug/tracing
> 			# echo 1 > events/cgroup/cgroup_release/enable
> 
> TERM 1 -> create and remove cgroup1
> # ./test group1
> qemu-system-x86_64: terminating on signal 15 from pid 1775 (./test)
> 
> 			TERM 2
> 			# cat trace_pipe
> 			<nothing>
> 
> TERM 1 -> create and remove cgroup2
> # ./test group2
> qemu-system-x86_64: terminating on signal 15 from pid 1783 (./test)
> 
> 			TERM 2  - group1 being released
> 			# cat trace_pipe
> 			kworker/22:2-1135  [022] ....  2947.375526: cgroup_release: root=0 id=78 level=1 path=/group1
> 
> TERM 1 -> create and remove cgroup3
> # ./test group3
> qemu-system-x86_64: terminating on signal 15 from pid 1798 (./test)
> 
> 			TERM 2 - group2 being released
> 			# cat trace_pipe
> 			kworker/22:2-1135  [022] ....  2947.375526: cgroup_release: root=0 id=78 level=1 path=/group1
> 			kworker/22:0-1787  [022] ....  2961.501261: cgroup_release: root=0 id=78 level=1 path=/group2
> 
> 
> Looks like the previous cgroup release is triggered by creating
> another cgroup.  If I don't do anything the cgroup is released
> (tracepoint shows) in about 90 seconds.
> 
> The cgroup_release tracepoint is triggered in css_release_work_fn,
> the same function where the cgroup_bpf_put is called, hence the
> delay in releasing of the bpf programs.
> 
> Is this expected or somehow configurable? It's confusing seeing
> all the bpf programs from removed cgroups being around. In Pavel's
> setup it's about 100 of them.
> 
> Note, I could reproduce this only with qemu-kvm being run in child
> process in the example below.
> 
> thoughts? thanks,
> jirka
> 
> 
> ---
> #include <fcntl.h>
> #include <signal.h>
> #include <stdio.h>
> #include <string.h>
> #include <sys/stat.h>
> #include <sys/types.h>
> #include <unistd.h>
> 
> #define CGROUP_PATH "/sys/fs/cgroup"
> 
> int
> main(int argc, char **argv)
> {
> 	pid_t pid = -1;
> 	char path[1024];
> 	int rc;
> 
> 	pid = fork();
> 
> 	if (pid == 0) {
> 		execl("/usr/bin/qemu-kvm",
> 		      "/usr/bin/qemu-kvm",
> 		      "-display", "none",
> 		      NULL);
> 		fprintf(stderr, "failed to start qemu process\n");
> 		_exit(-1);
> 	} else {
> 		int filefd = -1;
> 		char proc[1024];
> 
> 		snprintf(path, 1024, "%s/%s", CGROUP_PATH, argv[1]);
> 
> 		sleep(1);
> 
> 		if (mkdir(path, 0755) < 0) {
> 			fprintf(stderr, "failed to create cgroup '%s'\n", path);
> 			return -1;
> 		}
> 
> 		snprintf(proc, 1024, "%s/cgroup.procs", path);
> 
> 		filefd = open(proc, O_WRONLY|O_TRUNC);
> 		if (filefd > 0) {
> 			dprintf(filefd, "%u", pid);
> 			close(filefd);
> 		}
> 
> 		sleep(1);
> 	}
> 
> 	if (pid > 0)
> 		kill(pid, SIGTERM);
> 	do {
> 		rc = rmdir(path);
> 	} while (rc != 0);
> 
> 	return 0;
> }