Re: [RFC PATCH] audit-testsuite: improve our chances of losing records in lost_reset

From: Richard Guy Briggs <rgb@redhat.com>
To: Paul Moore <paul@paul-moore.com>
Cc: linux-audit@redhat.com
Subject: Re: [RFC PATCH] audit-testsuite: improve our chances of losing records in lost_reset
Date: Fri, 14 Dec 2018 16:00:23 -0500	[thread overview]
Message-ID: <20181214210023.q2xmnt6ncz6dlhnn@madcap2.tricolour.ca> (raw)
In-Reply-To: <CAHC9VhQCxznE8PuwJRdLVEVTUsVaNuWwN3hQ83qxKbq990OkmQ@mail.gmail.com>

On 2018-12-14 15:35, Paul Moore wrote:
> On Fri, Dec 14, 2018 at 11:12 AM Richard Guy Briggs <rgb@redhat.com> wrote:
> > On 2018-12-14 10:53, Paul Moore wrote:
> > > On Thu, Dec 13, 2018 at 8:59 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > On 2018-12-13 18:23, Paul Moore wrote:
> > > > > On Thu, Dec 13, 2018 at 6:17 PM Paul Moore <paul@paul-moore.com> wrote:
> 
> ...
> 
> > > As an aside, have you spent any time trying to debug that wrong PID
> > > problem?  While not strictly audit related, that seems like a pretty
> > > serious Bash bug.  Or maybe it's a problem with the test?  I vaguely
> > > remember a discussion between you and Ondrej and some confusion around
> > > which Bash variable to use to fetch PIDs, but I may be mistaken.
> >
> > I haven't spent much time trying to debug that bash PID increment issue,
> > but it perplexed me since it was the identical technique used in
> > multiple tests in the audit-testsuite that has never caused an issue
> > previously on any of the same set of test machines.  This was for the
> > missing mount umount2 hang bug test
> > https://github.com/linux-audit/audit-testsuite/pull/76
> 
> Ah, I think I see the problem.  Unless I'm mistaken, you are talking
> about the shell/Bash command where the tests print the PID using "echo
> $$" or similar, yes?  As you likely already know, in Bash (and likely
> other shells as well), $$ is the PID of the running Bash process; in
> the two places where I saw it used in the existing audit-testsuite $$
> is being used to reference the Bash instance itself or something it
> exec's (which ends up inheriting the PID).  It looks like you are
> using $$ as a way to capture the PID of a child process being spawned
> by the shell, yes?  This may explain why you sometimes get lucky and
> $$+1 works for the PID.

I was always getting lucky with $$+1, but understandably uncomfortable
with it since others weren't so fortunate.

In the code that's there, that process is backgrounded, but I'm fairly
certain I tested without that and carefully checked the options (-f and
-s) to ensure it wasn't daemonizing or multithreading.  I was pretty
careful to set up exactly the same conditions for running that process
as other tests use, but that looks like the first thing to check next
time I try it.  It wouldn't be the first time I've missed something
obvious.

> > > This brings up to the next step: how do we want to address this?
> > >
> > > Prior to the queue rework that started in v4.10 things were a bit
> > > simpler and it looks like we always registered a lost record
> > > independent of the "audit=?" setting on the kernel command line and
> > > the current queue backlog.  While this would have made this test
> > > easier, it could result in some over counting problems in the cases
> > > where an auditd instance came along and read the "lost" records from
> > > the queue.  I don't think reverting to this behavior is ideal.
> > >
> > > I'm also not certain that recording lost records in the *not*
> > > "audit=1" case is a good solution either.  In the case where the
> > > system is not running an audit daemon they are almost guaranteed to
> > > hit the backlog at some point and then their system will start spewing
> > > scary looking audit log lost messages; we would surely win lots of
> > > friends this way.
> > >
> > > We could move the "audit=1" check (really it's an audit_default check)
> > > into audit_log_lost() and use it to squelch the printk() and then call
> > > audit_log_lost in kauditd_hold_skb() if we can't queue the record.
> > > This should preserve an accurate lost record count (well, until it
> > > rolls over, but that's not a new concern), prevent unnecessary scary
> > > lost record messages, and ensure a consistent audit_log_lost()
> > > operation (all the other callers I didn't mention in this mail).
> > >
> > > Or the simplest option is to just ignore this and require that the
> > > audit-testsuite be run on a system booted with "audit=1" :)
> > >
> > > I'm currently leaning towards moving the "audit=1" check into
> > > audit_log_lost(), what do you guys think?
> >
> > I'll need some time to digest all this.  My first reaction is that
> > requiring "audit=1" for the audit-testsuite is not the right answer and
> > as we've discussed previously, it makes sense to run the entire
> > testsuite both with and without "audit=1".
> 
> Yes, requiring "audit=1" is not an answer.  I put it there mostly as
> an attempt at humor, hence the smiley at the end, but clearly my
> abilities as a comic are lacking ;)

Sorry, email is limited...  I read :) and ;) differently.

> Anyway, no rush on looking it over; this isn't a -stable fix in my
> eyes so it's not a candidate for merging until after the upcoming
> merge window.  I think it will make a lot more sense once you go back
> and look at the code, reading about the problem in email probably
> makes is sound more complicated than it is.  I'm fairly certain the
> right solution is to move the "audit=1" check into audit_log_lost(),
> but I sent this out in case you or anyone else could think of a better
> solution.  In the meantime I'll throw together a quick patch (it
> should be pretty small) and add it to my testing queue.

I'll look at the code and your patch carefully when I can.

> paul moore

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635