All of lore.kernel.org
 help / color / mirror / Atom feed
* Help with trace-cmd/ftrace recording process ID information
@ 2017-07-17 19:18 Will Hawkins
  2017-07-17 19:53 ` Steven Rostedt
  0 siblings, 1 reply; 7+ messages in thread
From: Will Hawkins @ 2017-07-17 19:18 UTC (permalink / raw)
  To: linux-kernel, rostedt


Hello everyone, especially Mr. Rostedt,

I have had great success with ftrace debugging performance issues on 
Linux systems. The combination of ftrace and trace-cmd are absolutely 
amazing tools for digging in to exactly what is going on in a system and 
where performance problems exist.

I recently switched to a different host and attempted to run trace-cmd 
record to get a record of page faults:

/path/to//trace-cmd/trace-cmd record -e page_fault_user /bin/ls

When I "report" on that trace, I get entries like the following:

<...>-41850 [010] 27484983.185200: page_fault_user: 
address=__per_cpu_end ip=__per_cpu_end error_code=0x14

It's exactly what I want. However, it does not list the process that 
generated that fault. Instead, it uses <...>. I dug into the trace-cmd 
code and see where this is generated and why it is generated.

What I don't understand is why on a different system, when I run the 
same record command, I get the following output:

  ls-19887 [005] 2438162.263793: page_fault_user: 
address=__per_cpu_end ip=__per_cpu_end error_code=0x14

Again, it's exactly what I want and it lists the process name that 
generated the fault.

 From the code, I see that the <...> is printed instead of the name of 
the process when the pid is not in the pevent's command lines. What I 
can't seem to figure out is why the process would be in that list on one 
host and not on the other.

When I looked at the trace.dat file directly, I did notice that on the 
"good" host, there are a list of pids and names. On the "bad" host, 
there is no such list in the trace.dat file. I am sure that is the 
reason for the <...>s being printed, but I can't figure out why that 
list is not getting in the trace.dat file.

I gave a quick look to try to find where that pid/comm list is generated 
and written to the trace.dat file, but couldn't find anything.

I figured that I would send an email before I dug any further in case 
someone has seen this already. I am happy to pass along other pertinent 
information if it is helpful to debug the problem. I just don't want to 
spam the list with information that is irrelevant.

Again, the combination of ftrace/trace-cmd is borderline magic. I 
appreciate all the work that has gone into it!

Thanks in advance for helping me sort through this issue!

Will

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Help with trace-cmd/ftrace recording process ID information
  2017-07-17 19:18 Help with trace-cmd/ftrace recording process ID information Will Hawkins
@ 2017-07-17 19:53 ` Steven Rostedt
  2017-07-17 20:06   ` Will Hawkins
  0 siblings, 1 reply; 7+ messages in thread
From: Steven Rostedt @ 2017-07-17 19:53 UTC (permalink / raw)
  To: Will Hawkins; +Cc: linux-kernel

On Mon, 17 Jul 2017 15:18:18 -0400
Will Hawkins <hawkinsw@borlaugic.com> wrote:

> Hello everyone, especially Mr. Rostedt,
> 
> I have had great success with ftrace debugging performance issues on 
> Linux systems. The combination of ftrace and trace-cmd are absolutely 
> amazing tools for digging in to exactly what is going on in a system and 
> where performance problems exist.
> 
> I recently switched to a different host and attempted to run trace-cmd 
> record to get a record of page faults:
> 
> /path/to//trace-cmd/trace-cmd record -e page_fault_user /bin/ls
> 
> When I "report" on that trace, I get entries like the following:
> 
> <...>-41850 [010] 27484983.185200: page_fault_user: 
> address=__per_cpu_end ip=__per_cpu_end error_code=0x14
> 
> It's exactly what I want. However, it does not list the process that 
> generated that fault. Instead, it uses <...>. I dug into the trace-cmd 
> code and see where this is generated and why it is generated.
> 
> What I don't understand is why on a different system, when I run the 
> same record command, I get the following output:
> 
>   ls-19887 [005] 2438162.263793: page_fault_user: 
> address=__per_cpu_end ip=__per_cpu_end error_code=0x14
> 
> Again, it's exactly what I want and it lists the process name that 
> generated the fault.
> 
>  From the code, I see that the <...> is printed instead of the name of 
> the process when the pid is not in the pevent's command lines. What I 
> can't seem to figure out is why the process would be in that list on one 
> host and not on the other.

Are you using the same kernel version and trace-cmd version on both
hosts?

> 
> When I looked at the trace.dat file directly, I did notice that on the 
> "good" host, there are a list of pids and names. On the "bad" host, 
> there is no such list in the trace.dat file. I am sure that is the 
> reason for the <...>s being printed, but I can't figure out why that 
> list is not getting in the trace.dat file.
> 
> I gave a quick look to try to find where that pid/comm list is generated 
> and written to the trace.dat file, but couldn't find anything.
> 
> I figured that I would send an email before I dug any further in case 
> someone has seen this already. I am happy to pass along other pertinent 
> information if it is helpful to debug the problem. I just don't want to 
> spam the list with information that is irrelevant.
> 
> Again, the combination of ftrace/trace-cmd is borderline magic. I 
> appreciate all the work that has gone into it!
> 
> Thanks in advance for helping me sort through this issue!

The comm (the program name) is not saved at each event. Instead,
there's a "cache" of them. On a schedule switch, when tracing is
active, it will store a comm in a table. The trace file uses this list
too. When trace-cmd is finished tracing, it will read that table which
is located in the tracefs directory and the file is called
saved_cmdlines.

By default, it saves 128 comms. If you want more or less, you can
change the size by echoing in the new size number into the file
saved_cmdlines_size.

I'm not sure why trace-cmd didn't save that file, unless it was an
older version that did the recording.


-- Steve

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Help with trace-cmd/ftrace recording process ID information
  2017-07-17 19:53 ` Steven Rostedt
@ 2017-07-17 20:06   ` Will Hawkins
  2017-07-17 20:08     ` Will Hawkins
  2017-07-17 20:17     ` Steven Rostedt
  0 siblings, 2 replies; 7+ messages in thread
From: Will Hawkins @ 2017-07-17 20:06 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: linux-kernel

Thank you for your response!

On Mon, Jul 17, 2017 at 3:53 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Mon, 17 Jul 2017 15:18:18 -0400
> Will Hawkins <hawkinsw@borlaugic.com> wrote:
>
> > Hello everyone, especially Mr. Rostedt,
> >
> > I have had great success with ftrace debugging performance issues on
> > Linux systems. The combination of ftrace and trace-cmd are absolutely
> > amazing tools for digging in to exactly what is going on in a system and
> > where performance problems exist.
> >
> > I recently switched to a different host and attempted to run trace-cmd
> > record to get a record of page faults:
> >
> > /path/to//trace-cmd/trace-cmd record -e page_fault_user /bin/ls
> >
> > When I "report" on that trace, I get entries like the following:
> >
> > <...>-41850 [010] 27484983.185200: page_fault_user:
> > address=__per_cpu_end ip=__per_cpu_end error_code=0x14
> >
> > It's exactly what I want. However, it does not list the process that
> > generated that fault. Instead, it uses <...>. I dug into the trace-cmd
> > code and see where this is generated and why it is generated.
> >
> > What I don't understand is why on a different system, when I run the
> > same record command, I get the following output:
> >
> >   ls-19887 [005] 2438162.263793: page_fault_user:
> > address=__per_cpu_end ip=__per_cpu_end error_code=0x14
> >
> > Again, it's exactly what I want and it lists the process name that
> > generated the fault.
> >
> >  From the code, I see that the <...> is printed instead of the name of
> > the process when the pid is not in the pevent's command lines. What I
> > can't seem to figure out is why the process would be in that list on one
> > host and not on the other.
>
> Are you using the same kernel version and trace-cmd version on both
> hosts?

Roughly, yes.

Both are running (old) kernels 3.13 but are running from HEAD of
master from the main git repository of trace-cmd.

>
> >
> > When I looked at the trace.dat file directly, I did notice that on the
> > "good" host, there are a list of pids and names. On the "bad" host,
> > there is no such list in the trace.dat file. I am sure that is the
> > reason for the <...>s being printed, but I can't figure out why that
> > list is not getting in the trace.dat file.
> >
> > I gave a quick look to try to find where that pid/comm list is generated
> > and written to the trace.dat file, but couldn't find anything.
> >
> > I figured that I would send an email before I dug any further in case
> > someone has seen this already. I am happy to pass along other pertinent
> > information if it is helpful to debug the problem. I just don't want to
> > spam the list with information that is irrelevant.
> >
> > Again, the combination of ftrace/trace-cmd is borderline magic. I
> > appreciate all the work that has gone into it!
> >
> > Thanks in advance for helping me sort through this issue!
>
> The comm (the program name) is not saved at each event. Instead,
> there's a "cache" of them. On a schedule switch, when tracing is
> active, it will store a comm in a table. The trace file uses this list
> too. When trace-cmd is finished tracing, it will read that table which
> is located in the tracefs directory and the file is called
> saved_cmdlines.

This seems to be the problem:

On the "good" system, that file is up-to-date with cached PIDs and
comms. On the bad host, there are no cached entries from any of the
traces that I've run.

Because these are running old kernels, there is no saved_cmdlines_size
knob to turn. Do you have any idea why the saved_cmdlines would not be
getting updated appropriately on the "bad" host? I know this is not
ideal, but I can try to reboot that host and see if something is
simply wedged. The system has been online for almost a year, so it's
possible that something has gone wrong.

Any help you can offer would be great! Thank you, again, for your response!

Will

>
> By default, it saves 128 comms. If you want more or less, you can
> change the size by echoing in the new size number into the file
> saved_cmdlines_size.
>
> I'm not sure why trace-cmd didn't save that file, unless it was an
> older version that did the recording.
>
>
> -- Steve
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Help with trace-cmd/ftrace recording process ID information
  2017-07-17 20:06   ` Will Hawkins
@ 2017-07-17 20:08     ` Will Hawkins
  2017-07-17 20:17     ` Steven Rostedt
  1 sibling, 0 replies; 7+ messages in thread
From: Will Hawkins @ 2017-07-17 20:08 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: linux-kernel

On Mon, Jul 17, 2017 at 4:06 PM, Will Hawkins <hawkinsw@borlaugic.com> wrote:
> Thank you for your response!
>
> On Mon, Jul 17, 2017 at 3:53 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
>>
>> On Mon, 17 Jul 2017 15:18:18 -0400
>> Will Hawkins <hawkinsw@borlaugic.com> wrote:
>>
>> > Hello everyone, especially Mr. Rostedt,
>> >
>> > I have had great success with ftrace debugging performance issues on
>> > Linux systems. The combination of ftrace and trace-cmd are absolutely
>> > amazing tools for digging in to exactly what is going on in a system and
>> > where performance problems exist.
>> >
>> > I recently switched to a different host and attempted to run trace-cmd
>> > record to get a record of page faults:
>> >
>> > /path/to//trace-cmd/trace-cmd record -e page_fault_user /bin/ls
>> >
>> > When I "report" on that trace, I get entries like the following:
>> >
>> > <...>-41850 [010] 27484983.185200: page_fault_user:
>> > address=__per_cpu_end ip=__per_cpu_end error_code=0x14
>> >
>> > It's exactly what I want. However, it does not list the process that
>> > generated that fault. Instead, it uses <...>. I dug into the trace-cmd
>> > code and see where this is generated and why it is generated.
>> >
>> > What I don't understand is why on a different system, when I run the
>> > same record command, I get the following output:
>> >
>> >   ls-19887 [005] 2438162.263793: page_fault_user:
>> > address=__per_cpu_end ip=__per_cpu_end error_code=0x14
>> >
>> > Again, it's exactly what I want and it lists the process name that
>> > generated the fault.
>> >
>> >  From the code, I see that the <...> is printed instead of the name of
>> > the process when the pid is not in the pevent's command lines. What I
>> > can't seem to figure out is why the process would be in that list on one
>> > host and not on the other.
>>
>> Are you using the same kernel version and trace-cmd version on both
>> hosts?
>
> Roughly, yes.
>
> Both are running (old) kernels 3.13 but are running from HEAD of
> master from the main git repository of trace-cmd.
>
>>
>> >
>> > When I looked at the trace.dat file directly, I did notice that on the
>> > "good" host, there are a list of pids and names. On the "bad" host,
>> > there is no such list in the trace.dat file. I am sure that is the
>> > reason for the <...>s being printed, but I can't figure out why that
>> > list is not getting in the trace.dat file.
>> >
>> > I gave a quick look to try to find where that pid/comm list is generated
>> > and written to the trace.dat file, but couldn't find anything.
>> >
>> > I figured that I would send an email before I dug any further in case
>> > someone has seen this already. I am happy to pass along other pertinent
>> > information if it is helpful to debug the problem. I just don't want to
>> > spam the list with information that is irrelevant.
>> >
>> > Again, the combination of ftrace/trace-cmd is borderline magic. I
>> > appreciate all the work that has gone into it!
>> >
>> > Thanks in advance for helping me sort through this issue!
>>
>> The comm (the program name) is not saved at each event. Instead,
>> there's a "cache" of them. On a schedule switch, when tracing is
>> active, it will store a comm in a table. The trace file uses this list
>> too. When trace-cmd is finished tracing, it will read that table which
>> is located in the tracefs directory and the file is called
>> saved_cmdlines.
>
> This seems to be the problem:
>
> On the "good" system, that file is up-to-date with cached PIDs and
> comms. On the bad host, there are no cached entries from any of the
> traces that I've run.
>
> Because these are running old kernels, there is no saved_cmdlines_size
> knob to turn. Do you have any idea why the saved_cmdlines would not be
> getting updated appropriately on the "bad" host? I know this is not
> ideal, but I can try to reboot that host and see if something is
> simply wedged. The system has been online for almost a year, so it's
> possible that something has gone wrong.
>
> Any help you can offer would be great! Thank you, again, for your response!

I failed to mention that the "bad" host does include the (limited and
stale) information from saved_cmdlines in the trace.dat file. I just
verified that. So, now the only open question is why that file is not
being updated appropriately.

Thanks again!
Will

>
> Will
>
>>
>> By default, it saves 128 comms. If you want more or less, you can
>> change the size by echoing in the new size number into the file
>> saved_cmdlines_size.
>>
>> I'm not sure why trace-cmd didn't save that file, unless it was an
>> older version that did the recording.
>>
>>
>> -- Steve
>>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Help with trace-cmd/ftrace recording process ID information
  2017-07-17 20:06   ` Will Hawkins
  2017-07-17 20:08     ` Will Hawkins
@ 2017-07-17 20:17     ` Steven Rostedt
  2017-07-17 22:31       ` Will Hawkins
  1 sibling, 1 reply; 7+ messages in thread
From: Steven Rostedt @ 2017-07-17 20:17 UTC (permalink / raw)
  To: Will Hawkins; +Cc: linux-kernel

On Mon, 17 Jul 2017 16:06:37 -0400
Will Hawkins <hawkinsw@borlaugic.com> wrote:


> This seems to be the problem:
> 
> On the "good" system, that file is up-to-date with cached PIDs and
> comms. On the bad host, there are no cached entries from any of the
> traces that I've run.
> 
> Because these are running old kernels, there is no saved_cmdlines_size
> knob to turn. Do you have any idea why the saved_cmdlines would not be
> getting updated appropriately on the "bad" host? I know this is not
> ideal, but I can try to reboot that host and see if something is

Yeah, a reboot may work.

> simply wedged. The system has been online for almost a year, so it's
> possible that something has gone wrong.
> 
> Any help you can offer would be great! Thank you, again, for your response!

The recording of command lines only happens when tracing is done, and
there were a few bugs with the older kernels that caused it to either
stop and never start again, or to simply just miss a bunch of recording.

It may be that it stopped and never started again, so you will only
have a stale file.

-- Steve

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Help with trace-cmd/ftrace recording process ID information
  2017-07-17 20:17     ` Steven Rostedt
@ 2017-07-17 22:31       ` Will Hawkins
  2017-07-18  0:45         ` Steven Rostedt
  0 siblings, 1 reply; 7+ messages in thread
From: Will Hawkins @ 2017-07-17 22:31 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: linux-kernel

On Mon, Jul 17, 2017 at 4:17 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Mon, 17 Jul 2017 16:06:37 -0400
> Will Hawkins <hawkinsw@borlaugic.com> wrote:
>
>
>> This seems to be the problem:
>>
>> On the "good" system, that file is up-to-date with cached PIDs and
>> comms. On the bad host, there are no cached entries from any of the
>> traces that I've run.
>>
>> Because these are running old kernels, there is no saved_cmdlines_size
>> knob to turn. Do you have any idea why the saved_cmdlines would not be
>> getting updated appropriately on the "bad" host? I know this is not
>> ideal, but I can try to reboot that host and see if something is
>
> Yeah, a reboot may work.
>
>> simply wedged. The system has been online for almost a year, so it's
>> possible that something has gone wrong.
>>
>> Any help you can offer would be great! Thank you, again, for your response!
>
> The recording of command lines only happens when tracing is done, and
> there were a few bugs with the older kernels that caused it to either
> stop and never start again, or to simply just miss a bunch of recording.
>
> It may be that it stopped and never started again, so you will only
> have a stale file.

This appears to have been the problem! I did a reboot and everything
is back to normal.

Is there a way to poke at the tracing infrastructure in the kernel to
get it to restart process recording? I would feel more comfortable
with a solution like that instead of rebooting, obviously. The 'ol
Windows "solution" makes me queazy :-)

Thanks again for your quick responses. I hope that I can repay you at
some point by contributing code to the great tools you've built!

Will

>
> -- Steve

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Help with trace-cmd/ftrace recording process ID information
  2017-07-17 22:31       ` Will Hawkins
@ 2017-07-18  0:45         ` Steven Rostedt
  0 siblings, 0 replies; 7+ messages in thread
From: Steven Rostedt @ 2017-07-18  0:45 UTC (permalink / raw)
  To: Will Hawkins; +Cc: linux-kernel

On Mon, 17 Jul 2017 18:31:14 -0400
Will Hawkins <hawkinsw@borlaugic.com> wrote:


> > It may be that it stopped and never started again, so you will only
> > have a stale file.  
> 
> This appears to have been the problem! I did a reboot and everything
> is back to normal.
> 
> Is there a way to poke at the tracing infrastructure in the kernel to
> get it to restart process recording? I would feel more comfortable
> with a solution like that instead of rebooting, obviously. The 'ol
> Windows "solution" makes me queazy :-)

Well, that's what happens when you run older kernels ;-)

The best I can do is to find the commit that fixed the issue and have
it back ported. Or you can try that. Start with the code that reads and
writes to saved_cmdlines and that will lead you to the map table. You
also need to look where the sched_switch tracepoint gets enabled and
disabled. It uses a ref count to know to enable and disable it, and the
ref count probably got out of sync. Need to find the place that
disabled it twice or something.

One way is to read the commits that affect all that since your kernel.

-- Steve


> 
> Thanks again for your quick responses. I hope that I can repay you at
> some point by contributing code to the great tools you've built!
> 
> Will
> 
> >
> > -- Steve  

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-07-18  0:45 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-17 19:18 Help with trace-cmd/ftrace recording process ID information Will Hawkins
2017-07-17 19:53 ` Steven Rostedt
2017-07-17 20:06   ` Will Hawkins
2017-07-17 20:08     ` Will Hawkins
2017-07-17 20:17     ` Steven Rostedt
2017-07-17 22:31       ` Will Hawkins
2017-07-18  0:45         ` Steven Rostedt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.