All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Emily Shaffer <emilyshaffer@google.com>
Cc: Jeff Hostetler <git@jeffhostetler.com>,
	git@vger.kernel.org, Junio C Hamano <gitster@pobox.com>,
	Bagas Sanjaya <bagasdotme@gmail.com>
Subject: Re: [PATCH v2] tr2: log parent process name
Date: Tue, 25 May 2021 01:33:27 +0200	[thread overview]
Message-ID: <877djnogbk.fsf@evledraar.gmail.com> (raw)
In-Reply-To: <YKgSc5OgVOt6HQqW@google.com>


On Fri, May 21 2021, Emily Shaffer wrote:

> On Fri, May 21, 2021 at 03:15:16PM -0400, Jeff Hostetler wrote:
>> On 5/20/21 5:05 PM, Emily Shaffer wrote:
>> > - I took a look at Jeff H's advice on using a "data_json" event to log
>> >    this and decided it would be a little more flexible to add a new event
>> >    instead. If we want, it'd be feasible to then shoehorn the GfW parent
>> >    tree stuff into this new event too. Doing it this way is definitely
>> >    easier to parse for Google's trace analysis system (which for now
>> >    completely skips "data_json" as it's polymorphic), and also - I think
>> >    - means that we can add more fields later on if we need to (thread
>> >    info, different fields than just /proc/n/comm like exec path, argv,
>> >    whatever).
>> 
>> I could argue both sides of this, so I guess it is fine either way.
>> 
>> In GFW I log a array of argv[0] strings in a generic "data_json" event.
>> I could also log additional "data_json" events with more structured
>> data if needed.
>> 
>> On the other hand, you're proposing a "cmd_ancestry" event with a
>> single array of strings.  You would have to expand the call signature
>> of the trace2_cmd_ancestry() API to add additional data and inside
>> tr2_tgt_event.c add additional fields to the JSON being composed.
>> 
>> So both are about equal.
>> 
>> (I'll avoid the temptation to make a snarky comment about fixing
>> your post processing. :-) :-) :-) )
>
> ;P
>
> (I don't have much to add - this is an accurate summary of what I
> thought about, too. Thanks for writing it out.)
>
>> 
>> It really doesn't matter one way or the other.
>> 
>> > - Jonathan N also pointed out to me that /proc/n/comm exists, and logs
>> >    the "command name" - excluding argv, excluding path, etc. It seems
>> 
>> So you're trying to log argv[0] of the process and not the full
>> command line.  That's what I'm doing.
>
> It's close to argv[0], yeah. POSIX docs indicate it might be truncated
> in a way that argv[0] hasn't been, but it also doesn't include the
> leading path (as far as I've seen). For example, a long-running helper
> script I use with mutt, right now (gaffing on line length in email to
> help with argv clarity, sorry):
>
>   $ ps aux | grep mutt
>   emilysh+ 4119883 0.0 0.0 6892 3600 pts/6 S+ 12:44 0:00 /bin/bash
> /usr/local/google/home/emilyshaffer/dotfiles/open-vim-in-new-split.sh
> /var/tmp/mutt-podkayne-413244-1263002-7433772284891386689
>   # comm is truncated to 15ch, except apparently in the cases of some
>   # kernel worker processes I saw with much longer names?
>   $ cat /proc/4119883/comm
>   open-vim-in-new
>   # exe is a link to the executable, which means bash as this is a
>   # script
>   $ ls -lha /proc/4119883/exe
>   lrwxrwxrwx 1 emilyshaffer primarygroup 0 May 21 12:44
>   /proc/4119883/exe -> /usr/bin/bash
>   # cmdline has the whole argv, separated on NUL so it runs together in
>   # editor
>   $ cat /proc/4119883/cmdline
>   /bin/bash/usr/local/google/home/emilyshaffer/dotfiles/open-vim-in-new-split.sh/var/tmp/mutt-podkayne-413244-1263002-7433772284891386689
>
> Jonathan N pointed out that the process name (the thing in 'comm') can
> also be manually manipulated by the process itself, and 'man procfs'
> also talks about 'PR_SET_NAME' and 'PR_GET_NAME' operations in
> 'prctl()', so that tracks. (It doesn't look like we can use prctl() to
> find out the names of processes besides the current process, though, so
> the procfs stuff is still needed. Dang.)
>
>> 
>> >    like this is a little more safe about excluding personal information
>> >    from the traces which take the form of "myscript.sh
>> >    --password=hunter2", but would still be worrisome for something like
>> >    "mysupersecretproject.sh". I'm not sure whether that means we still
>> >    want to guard it with a config flag, though.
>> 
>> You might check whether you get the name of the script or just get
>> a lot of entries with just "/usr/bin/bash".
>
> See above :)
>
>> There's lots of PII in the data stream to worry about.
>> The name of the command is just one aspect, but I digress.
>
> Yes, that's what we've noticed too, so a process name isn't worrying us
> that much more.
>
>> 
>> > - I also added a lot to the commit message; hopefully it's not too
>> >    rambly, but I hoped to explain why just setting GIT_TRACE2_PARENT_SID
>> >    wasn't going to cut it.
>> > - As for testing, I followed the lead of GfW's parentage info - "this
>> >    isn't portable so writing tests for it will suck, just scrub it from
>> >    the tests". Maybe it makes sense to do some more
>> >    platform-specific-ness in the test suite instead? I wasn't sure.
>> 
>> yeah, that's probably best.  Unless you can tokenize it properly
>> so that you can predict the results in a HEREDOC in the test source.
>> 
>> For example, you might try to test tracing a command (where a top-level
>> "git foo" (SPACE form) spawns a "git-foo" (DASHED form) and check the
>> output for the child.
>
> Yeah, I had trouble with even deciding when to attempt such a check or
> not.
>> > +	if (reason == TRACE2_PROCESS_INFO_STARTUP)
>> > +	{
>> > +		/*
>> > +		 * NEEDSWORK: we could do the entire ptree in an array instead,
>> > +		 * see compat/win32/trace2_win32_process_info.c.
>> > +		 */
>> > +		char *names[2];
>> > +		names[0] = get_process_name(getppid());
>> > +		names[1] = NULL;
>> 
>> You're only logging 1 parent.  That's fine to get started.
>> 
>> I'm logging IIRC 10 parents on GFW.  That might seem overkill,
>> but there are lots of intermediate parents that hide what is
>> happening.  For example, a "git push" might spawn "git remote-https"
>> which spawns "git-remote-https" which spawn "git send-pack" which
>> spawns "git pack-objects".
>> 
>> And that doesn't include who called push.
>> 
>> And it's not uncommon to see 2 or 3 "bash" entries in the array
>> because of the bash scripts being run.
>
> Agree. But it's expensive - I didn't find a handy library call to find
> "parent ID of given process ID", so I think we'd have to manipulate
> procfs; and so far I only see parent ID in summary infos like
> /proc/n/status or /proc/n/stat, which contain lots of other info too and
> would need parsing.

It sounds a bit like you're fumbling your way towards (re?)discovering:

    pstree -s <pid>

You can look at its implementation (or strace it) to see what it does,
and yes, on Linux there's no handy C library for this, iterating over
procfs is the library.

Aside from the privacy, PII, usefulness of this data etc. discussions in
this & related threads I don't think that per-se should be an issue on a
modern Linux system. After all we'd just need to do it once on
startup. For any sub-process we spawn we'd carry it forward after the
initial /usr/bin/git invocation.

  parent reply	other threads:[~2021-05-24 23:36 UTC|newest]

Thread overview: 87+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-07  0:29 [PATCH] tr2: log parent process name Emily Shaffer
2021-05-07  3:25 ` Bagas Sanjaya
2021-05-07 17:09 ` Emily Shaffer
2021-05-10 12:29 ` Ævar Arnfjörð Bjarmason
2021-05-11 21:31   ` Junio C Hamano
2021-05-14 22:06   ` Emily Shaffer
2021-05-16  3:48     ` Junio C Hamano
2021-05-17 20:17       ` Emily Shaffer
2021-05-11 17:28 ` Jeff Hostetler
2021-05-14 22:07   ` Emily Shaffer
2021-05-20 21:05 ` [PATCH v2] " Emily Shaffer
2021-05-20 21:36   ` Randall S. Becker
2021-05-20 23:23     ` Emily Shaffer
2021-05-21 13:20       ` Randall S. Becker
2021-05-21 16:24         ` Randall S. Becker
2021-05-21  2:09   ` Junio C Hamano
2021-05-21 19:02     ` Emily Shaffer
2021-05-21 23:22       ` Junio C Hamano
2021-05-24 18:37         ` Emily Shaffer
2021-05-21 19:15   ` Jeff Hostetler
2021-05-21 20:05     ` Emily Shaffer
2021-05-21 20:23       ` Randall S. Becker
2021-05-22 11:18       ` Jeff Hostetler
2021-05-24 23:33       ` Ævar Arnfjörð Bjarmason [this message]
2021-05-24 20:10   ` [PATCH v3] " Emily Shaffer
2021-05-24 20:49     ` Emily Shaffer
2021-05-25  3:54     ` Junio C Hamano
2021-05-25 13:33       ` Randall S. Becker
2021-06-08 18:58     ` [PATCH v4] " Emily Shaffer
2021-06-08 20:56       ` Emily Shaffer
2021-06-08 22:10       ` [PATCH v5] " Emily Shaffer
2021-06-08 22:16         ` Randall S. Becker
2021-06-08 22:24           ` Emily Shaffer
2021-06-08 22:39             ` Randall S. Becker
2021-06-09 20:17               ` Emily Shaffer
2021-06-16  8:42         ` Junio C Hamano
2021-06-28 16:45         ` Jeff Hostetler
2021-06-29 23:51           ` Emily Shaffer
2021-06-30  6:10             ` Ævar Arnfjörð Bjarmason
2021-07-22  0:21               ` Emily Shaffer
2021-07-22  1:27         ` [PATCH v6 0/2] " Emily Shaffer
2021-07-22  1:27           ` [PATCH v6 1/2] tr2: make process info collection platform-generic Emily Shaffer
2021-08-02  9:34             ` Ævar Arnfjörð Bjarmason
2021-07-22  1:27           ` [PATCH v6 2/2] tr2: log parent process name Emily Shaffer
2021-07-22 21:02             ` Junio C Hamano
2021-08-02  9:38             ` Ævar Arnfjörð Bjarmason
2021-08-02 12:45               ` Ævar Arnfjörð Bjarmason
2021-08-02 10:22             ` Ævar Arnfjörð Bjarmason
2021-08-02 12:47               ` Ævar Arnfjörð Bjarmason
2021-08-02 15:23               ` Jeff Hostetler
2021-08-02 16:10               ` Randall S. Becker
2021-08-02 18:41                 ` Ævar Arnfjörð Bjarmason
2021-08-25 23:19               ` [PATCH 0/6] tr2: plug memory leaks + logic errors + Win32 & Linux feature parity Ævar Arnfjörð Bjarmason
2021-08-25 23:19                 ` [PATCH 1/6] tr2: remove NEEDSWORK comment for "non-procfs" implementations Ævar Arnfjörð Bjarmason
2021-08-25 23:19                 ` [PATCH 2/6] tr2: clarify TRACE2_PROCESS_INFO_EXIT comment under Linux Ævar Arnfjörð Bjarmason
2021-08-25 23:19                 ` [PATCH 3/6] tr2: stop leaking "thread_name" memory Ævar Arnfjörð Bjarmason
2021-08-26  3:09                   ` Taylor Blau
2021-08-25 23:19                 ` [PATCH 4/6] tr2: fix memory leak & logic error in 2f732bf15e6 Ævar Arnfjörð Bjarmason
2021-08-26  3:21                   ` Taylor Blau
2021-08-25 23:19                 ` [PATCH 5/6] tr2: do compiler enum check in trace2_collect_process_info() Ævar Arnfjörð Bjarmason
2021-08-26  3:23                   ` Taylor Blau
2021-08-25 23:19                 ` [PATCH 6/6] tr2: log N parent process names on Linux Ævar Arnfjörð Bjarmason
2021-08-25 23:49                   ` Eric Sunshine
2021-08-26  4:07                   ` Taylor Blau
2021-08-26 12:24                     ` "I don't know what the author meant by that..." (was "Re: [PATCH 6/6] tr2: log N parent process names on Linux") Ævar Arnfjörð Bjarmason
2021-08-26 12:22                 ` [PATCH v2 0/6] tr2: plug memory leaks + logic errors + Win32 & Linux feature parity Ævar Arnfjörð Bjarmason
2021-08-26 12:22                   ` [PATCH v2 1/6] tr2: remove NEEDSWORK comment for "non-procfs" implementations Ævar Arnfjörð Bjarmason
2021-08-26 12:22                   ` [PATCH v2 2/6] tr2: clarify TRACE2_PROCESS_INFO_EXIT comment under Linux Ævar Arnfjörð Bjarmason
2021-08-26 12:22                   ` [PATCH v2 3/6] tr2: stop leaking "thread_name" memory Ævar Arnfjörð Bjarmason
2021-08-26 12:22                   ` [PATCH v2 4/6] tr2: fix memory leak & logic error in 2f732bf15e6 Ævar Arnfjörð Bjarmason
2021-08-26 15:58                     ` Eric Sunshine
2021-08-26 16:42                     ` Junio C Hamano
2021-08-26 12:22                   ` [PATCH v2 5/6] tr2: do compiler enum check in trace2_collect_process_info() Ævar Arnfjörð Bjarmason
2021-08-26 12:22                   ` [PATCH v2 6/6] tr2: log N parent process names on Linux Ævar Arnfjörð Bjarmason
2021-08-26 22:38                   ` [PATCH v2 0/6] tr2: plug memory leaks + logic errors + Win32 & Linux feature parity Taylor Blau
2021-08-27  8:02                   ` [PATCH v3 " Ævar Arnfjörð Bjarmason
2021-08-27  8:02                     ` [PATCH v3 1/6] tr2: remove NEEDSWORK comment for "non-procfs" implementations Ævar Arnfjörð Bjarmason
2021-08-27  8:02                     ` [PATCH v3 2/6] tr2: clarify TRACE2_PROCESS_INFO_EXIT comment under Linux Ævar Arnfjörð Bjarmason
2021-08-27  8:02                     ` [PATCH v3 3/6] tr2: stop leaking "thread_name" memory Ævar Arnfjörð Bjarmason
2021-08-27  8:02                     ` [PATCH v3 4/6] tr2: leave the parent list empty upon failure & don't leak memory Ævar Arnfjörð Bjarmason
2021-08-27  8:02                     ` [PATCH v3 5/6] tr2: do compiler enum check in trace2_collect_process_info() Ævar Arnfjörð Bjarmason
2021-08-27  8:02                     ` [PATCH v3 6/6] tr2: log N parent process names on Linux Ævar Arnfjörð Bjarmason
2021-08-31  0:17                     ` [PATCH v3 0/6] tr2: plug memory leaks + logic errors + Win32 & Linux feature parity Taylor Blau
2021-08-02 10:30             ` [PATCH v6 2/2] tr2: log parent process name Ævar Arnfjörð Bjarmason
2021-08-02 16:24               ` Junio C Hamano
2021-08-02 18:42                 ` Ævar Arnfjörð Bjarmason
2021-07-22 16:59           ` [PATCH v6 0/2] " Jeff Hostetler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=877djnogbk.fsf@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=bagasdotme@gmail.com \
    --cc=emilyshaffer@google.com \
    --cc=git@jeffhostetler.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.