Re: [trace-cmd Patch RFC] trace-cmd: top: A new interface to detect peak memory

From: Pratyush Anand <panand@redhat.com>
To: Steven Rostedt <rostedt@goodmis.org>
Cc: linux-kernel@vger.kernel.org, dyoung@redhat.com, xlpang@redhat.com
Subject: Re: [trace-cmd Patch RFC] trace-cmd: top: A new interface to detect peak memory
Date: Thu, 27 Apr 2017 19:32:43 +0530	[thread overview]
Message-ID: <727920fa-219c-6d8d-3ba6-2f0553b2cbdc@redhat.com> (raw)
In-Reply-To: <20170426100135.758cb098@gandalf.local.home>

Hi Steve,

On Wednesday 26 April 2017 07:31 PM, Steven Rostedt wrote:
>
> Sorry for the late reply. I finally have time to start looking at
> trace-cmd again.

Thanks a lot for your review :-)

>
> On Wed,  1 Mar 2017 20:32:37 +0530
> Pratyush Anand <panand@redhat.com> wrote:
>

[...]

> This is as nice idea, but it needs some clean ups. I tried it out just
> to play with it.
>
> First, I just did:
>
>  trace-cmd top
>
> And nothing happened. Either a help message needs to be printed, or it
> starts something. Maybe have it be interactive, that is, print out data
> as it comes in?
>

[...]

>
> But then for some reason it -p stopped doing anything. And -e didn't
> stop it either. But after killing it manually and removing the pid
> file, it appeared to work normal again. I can't seem to repeat it, so
> lets not worry about that now.
>

Yes, this version was sharing info between two processes using files. I have a 
new version [1] which has sharing through socket and that works better. Help 
issue is also fixed in that version.

> More below.
>

[...]

>> +void top_disable_events(void)
>> +{
>> +	char *path;
>> +	char c;
>> +	int fd;
>> +	int ret;
>> +
>> +	/*
>> +	 * WARNING: We want only few events related to memory allocation to
>> +	 * be enabled. Disable all events here. Latter, selective events
>> +	 * would be enabled
>
> trace-cmd reset does this. You may want to call that instead of doing
> it manually here. Someone else wanted to pull out trace-cmd reset into
> its own file. I'll have to get those patches. But then we can call that
> and enable tracing again.
>
> Or better yet, if we are only tracing memory events, just create your
> own instance and use that. Why bother with normal tracing?
>
>  mkdir /sys/kernel/debug/tracing/instances/trace-cmd-memory

Yes, it seems a good idea. Will do.

>
>> +	 */
>> +	c = '0';
>> +	path = get_instance_file(&top_instance, "events/enable");
>> +	fd = open(path, O_WRONLY);
>> +	if (fd < 0)
>> +		die("failed to open '%s'", path);
>> +	ret = write(fd, &c, 1);
>> +	close(fd);
>> +	tracecmd_put_tracing_file(path);
>> +	if (ret < 1)
>> +		die("failed to write 0 to events/enable");
>> +
>> +	path = get_instance_file(&top_instance, "tracing_on");
>> +	fd = open(path, O_WRONLY);
>> +	if (fd < 0)
>> +		die("failed to open '%s'", path);
>> +	ret = write(fd, &c, 1);
>> +	close(fd);
>> +	tracecmd_put_tracing_file(path);
>> +	if (ret < 1)
>> +		die("failed to write 0 to tracing_on\n");
>> +
>> +	path = get_instance_file(&top_instance, "set_event");
>> +	fd = open(path, O_WRONLY);
>> +	if (fd < 0)
>> +		die("failed to open '%s'", path);
>> +	close(fd);
>> +	tracecmd_put_tracing_file(path);
>> +}
>> +
>> +void top_exit(void)
>> +{
>> +	int i;
>> +	/*
>> +	 * free all the dynamic memories which could have been allocated by
>> +	 * this program
>> +	 */
>> +	if (kbuf)
>> +		kbuffer_free(kbuf);
>> +	if (record)
>> +		free_record(record);
>> +	if (pevent)
>> +		pevent_free(pevent);
>
> The above three functions all can handle a NULL pointer. Remove the if
> statements.

ok

>
>> +	if (top_task_stats) {
>> +		for (i = 0; i < top_task_cnt; i++)
>> +			free(top_task_stats[i].comm);
>> +		free(top_task_stats);
>> +	}
>> +
>> +	top_disable_events();
>> +	remove(TRACE_CMD_TOP_PID_FILE);
>> +}
>> +
>> +void top_initialize_events(void)
>> +{
>> +	char *path;
>> +	int fd;
>> +	char c;
>> +	int ret;
>> +	char *buf;
>> +	int size;
>> +	enum kbuffer_long_size long_size;
>> +	enum kbuffer_endian endian;
>> +	char id_str[8];
>> +
>
> Might want to look at trace_stream_init() and use that instead. It is
> made for live streaming of events.

OK, will look into trace_stream_init().

>
>> +	/* trace data is read from the trace_raw_pipe directly */
>> +	if (tracecmd_host_bigendian())
>> +		endian = KBUFFER_ENDIAN_BIG;
>> +	else
>> +		endian = KBUFFER_ENDIAN_LITTLE;
>> +

[...]

>> +void top_start(void)
>> +{
>> +	pid_t pid;
>> +	int fd, ret;
>> +
>> +	top_disable_events();
>> +	top_initialize_events();
>> +	page_size = getpagesize();
>> +
>> +	pid = fork();
>> +	if (!pid) {
>> +		do {
>> +			top_process_events();
>> +			sleep(1);
>
> We should probably just sleep on the buffer, waiting for data.

OK

>
>> +			if (set_print_only) {
>> +				top_print_stats();
>> +				set_print_only = false;
>> +			}
>> +			if (set_print_and_exit) {
>> +				top_print_stats();
>> +				top_exit();
>> +				break;
>> +			}
>> +		} while (1);
>> +	} else {
>> +		fd = open(TRACE_CMD_TOP_PID_FILE, O_WRONLY | O_CREAT);
>> +		if (fd < 0)
>> +			goto kill_child;
>> +		ret = write(fd, &pid, sizeof(pid));
>> +		if (ret < sizeof(pid))
>> +			goto close_fd;
>> +		close(fd);
>> +	}
>> +
>> +	return;
>> +
>> +close_fd:
>> +	close(fd);
>> +kill_child:
>> +	kill(pid, SIGUSR2);
>> +	die("Failed to start trace-top");
>> +	return;
>> +}
>> +

[...]

> Looks good, I think we can extend on this.

Thanks :-)

I will implement your review comments and will send next revision.
However, I had couple of observation which I was unable to justify:

# ./trace-cmd top -s /tmp/test
# ./trace-cmd top -p /tmp/test | grep trace-cmd
   15292           trace-cmd               22144           15808
Here,15292 is the pid of trace-cmd task
22144 KB is the peak memory usage
15808 KB is the current memory usage

Now check rss component from statm
# cat /proc/15292/statm
   50 35 23 7 0 12 0 36

This is a result on ARM64/64KB page size. Therefore, as per statm rss is 35 
pages = 35*64 = 2240KB
I patched my kernel [2] for test purpose, so that statm reports peak memory as 
well. Here, the last extra entry in statm output is peak memory and it is 36 
pages = 2304KB.
So, this is a huge difference between what has been reported by statm and what 
we get from trace-cmd.
I understand that `trace-cmd top` output would only be approximate, because 
many of the memory could be allocated by task and freed in interrupt context. 
So, many a time it can differ. But, is such a huge difference justified? If 
yes, can we count on the output of this utility to find early boot time oom 
issues?

[1] 
https://github.com/pratyushanand/trace-cmd/commit/602c2cd96aa613633ad20c6d382e41f7db37a2a4
[2] 
https://github.com/pratyushanand/linux/commit/197e2045361b6b70085c46c78ea6607d8afce517