On Mon, 2021-01-18 at 15:36 -0500, Paul Moore wrote:
On Mon, Jan 18, 2021 at 9:31 AM Steve Grubb <
sgrubb@redhat.com
> wrote:
On Monday, January 18, 2021 8:54:30 AM EST Paul Moore wrote:
I like the N of M concept but there would be a LOT of change -
especially
for all the non-kernel event sources. The EOE would be the most
seamless, but at a cost. My preference is to allow the 2 second 'timer'
to be configurable.

Agree with Burn, numbering the records coming up from the kernel is
going to be a real nightmare, and not something to consider lightly.
Especially when it sounds like we don't yet have a root cause for the
issue.

A very long time ago, we had numbered records. But it was decided that
there's no real point in it and we'd rather just save disk space.

With the current kernel code, adding numbered records is not something to
take lightly.

That's why I'm saying we had it and it was removed. I could imagine that if
you had auditing of the kill syscall enabled and a whole process group was
being killed, you could have hundreds of records that need numbering. No good
way to know in advance how many records make up the event.

You only mentioned disk space concerns so it wasn't clear to me that
you were in agreement about this being a bad idea.  Regardless, I'm
glad to see we are on the same page about this.

I know that the kernel does not serialize the events headed for user
space. But I'm curious how an event gets stuck and others can jump ahead
while one that's already inflight can get hung for 4 seconds before it's
next record goes out?

Have you determined that the problem is the kernel?

I assume so because the kernel adds the timestamp and choses what hits the
socket next. Auditd does no ordering of events. It just looks up the text
event ID, some minor translation if the enriched format is being used, and
writes it to disk. It can handle well over 100k records per second.

Feel free to insert the old joke about assumptions.

I guess I was hoping for a bit more understanding of the problem and
perhaps some actual data indicating the kernel was the source of the
problem.  Conjecture based on how things are supposed to work can be
misleading.

Initially it was looking like it was a userspace issue, is that no longer
the general thought?

I don't see how user space could cause this. Even if auditd was slow, it
shouldn't take 4 seconds to write to disk and then come back to read another
record. And even it did, why would the newest record go out before completing
one that's in progress? Something in the kernel chooses what's next. I
suspect that might need looking at.

See above.

Also, is there a reliable reproducer yet?

I don't know of one. But, I suppose we could modify ausearch to look for
examples of this.

The kernel queuing is a rather complicated affair due to the need to
gracefully handle auditd failing, fallbacks to the console, and
multicast groups all while handling extreme pressure (e.g. auditing
*every* syscall) and not destroying the responsiveness of the system
(we actually can still make forward progress if you are auditing
*every* syscall).  With that complexity comes a number of corner
cases, and I imagine there are a few cases where the system is under
extreme pressure and/or the auditd daemon is dead and/or starved from
CPU time.  As I know Richard is reading this, to be clear I'm talking
about the hold/retry queues and the UNICAST_RETRIES case.  The severe
delays you are talking about in this thread seem severe, but perhaps
if the system is under enough pressure to cause the ordering issues in
the first place such a delay is to be expected.

Anyway, my test setup isn't likely able to reproduce such a scenario
without some significant tweaks, so perhaps those of you who have seen
this problem (Burn, and anyone else?) could shed some light into the
state of the system when the ordering problem occurred.

I tend to have a rigorous auditing posture (see the rules loaded in https://github.com/linux-audit/audit-userspace/issues/148) which is not normal for most. Perhaps, Paul, you have hit the nail on the head by stating that this 'severe delay' is not that unreasonable given my rules posture and we just need to 'deal with it' in user space.
We still get the event data, I just need to adjust the user space tools to deal with this occurrence.
As for what the system is doing, in my home case it's a Centos 7 VM running a tomcat service which only gets busy every 20 minutes and the other is a HPE Z800 running Centos 8 with 4-5 VM's mostly dormant. I can put any code in these hosts to assist in 'validating'/testing the delay. Advise and I will run.