All of lore.kernel.org
 help / color / mirror / Atom feed
* Kernel oops+crash on repeated auditd restarts
@ 2012-01-25 16:45 Valentin Avram
  2012-01-25 16:53 ` Peter Moody
  2012-01-25 19:20 ` Eric Paris
  0 siblings, 2 replies; 24+ messages in thread
From: Valentin Avram @ 2012-01-25 16:45 UTC (permalink / raw)
  To: linux-audit


[-- Attachment #1.1: Type: text/plain, Size: 1606 bytes --]

Hello.

Did anybody ever experience kernel oopses and even kernel crashes (after a
while), by just restarting repeatedly the auditd daemon?

I ask this because i had this problem on Dell R610 servers running Gentoo
Linux kernels gentoo-sources-3.0.6 and gentoo-sources-2.6.37-r4 (see this
bug: https://bugs.gentoo.org/show_bug.cgi?id=389405 ).

The kernels are nothing special, just the vanilla 2.6.37 and 3.0.6 with a
few gentoo patches (see https://lkml.org/lkml/2011/11/28/330 ).

The auditd version is 2.1.3 (latest). The audit.rules file contains
basically the following rules:

-D
-w /etc -p wa -k etc-directory
[snip: same for /sbin, /bin, /usr/sbin, /usr/bin]
-a exit,never -F dir=/lib/rc -k skip-lib-rc
-w /lib -p wa -k lib-directory
-w /usr/lib -p wa -k usr-lib-directory
-a exit,never -F arch=b32 -S read [snip: -S for write,open,fstat,mmap etc.]
-k excluded-syscalls
-b 8192

The bug seems to be somewhere in the fsnotify kernel part, however Gentoo
kernel devs and ppl on lkml did not seem too interested, so.. did anybody
notice a similar behaviour? Or better yet, is anybody willing to run on one
of your servers this simple test: start the minimum server services, use a
similar audit.rules configuration, then start auditd and run in a shell the
following one-liner:

while :; do /etc/init.d/auditd stop ; sleep 5 ; /etc/init.d/auditd start ;
sleep 5 ; done

This was enough to oops and crash the kernel in less than one hour on the
servers where i did the tests. If any similar behavior happens, i'd be very
interested to know the the kernel version and distro.

Thank you for your time.

[-- Attachment #1.2: Type: text/html, Size: 2036 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Kernel oops+crash on repeated auditd restarts
  2012-01-25 16:45 Kernel oops+crash on repeated auditd restarts Valentin Avram
@ 2012-01-25 16:53 ` Peter Moody
  2012-01-25 19:20 ` Eric Paris
  1 sibling, 0 replies; 24+ messages in thread
From: Peter Moody @ 2012-01-25 16:53 UTC (permalink / raw)
  To: Valentin Avram; +Cc: linux-audit

Just flushing the rules (auditctl -D) would cause my ubuntu machine
running a 2.6.38 kernel to oops fairly regularly, maybe one in five
times. This was especially painful when testing new rules.

On Wed, Jan 25, 2012 at 8:45 AM, Valentin Avram <aval13@gmail.com> wrote:
> Hello.
>
> Did anybody ever experience kernel oopses and even kernel crashes (after a
> while), by just restarting repeatedly the auditd daemon?
>
> I ask this because i had this problem on Dell R610 servers running Gentoo
> Linux kernels gentoo-sources-3.0.6 and gentoo-sources-2.6.37-r4 (see this
> bug: https://bugs.gentoo.org/show_bug.cgi?id=389405 ).
>
> The kernels are nothing special, just the vanilla 2.6.37 and 3.0.6 with a
> few gentoo patches (see https://lkml.org/lkml/2011/11/28/330 ).
>
> The auditd version is 2.1.3 (latest). The audit.rules file contains
> basically the following rules:
>
> -D
> -w /etc -p wa -k etc-directory
> [snip: same for /sbin, /bin, /usr/sbin, /usr/bin]
> -a exit,never -F dir=/lib/rc -k skip-lib-rc
> -w /lib -p wa -k lib-directory
> -w /usr/lib -p wa -k usr-lib-directory
> -a exit,never -F arch=b32 -S read [snip: -S for write,open,fstat,mmap etc.]
> -k excluded-syscalls
> -b 8192
>
> The bug seems to be somewhere in the fsnotify kernel part, however Gentoo
> kernel devs and ppl on lkml did not seem too interested, so.. did anybody
> notice a similar behaviour? Or better yet, is anybody willing to run on one
> of your servers this simple test: start the minimum server services, use a
> similar audit.rules configuration, then start auditd and run in a shell the
> following one-liner:
>
> while :; do /etc/init.d/auditd stop ; sleep 5 ; /etc/init.d/auditd start ;
> sleep 5 ; done
>
> This was enough to oops and crash the kernel in less than one hour on the
> servers where i did the tests. If any similar behavior happens, i'd be very
> interested to know the the kernel version and distro.
>
> Thank you for your time.
>
>
> --
> Linux-audit mailing list
> Linux-audit@redhat.com
> https://www.redhat.com/mailman/listinfo/linux-audit



-- 
Peter Moody      Google    1.650.253.7306
Security Engineer  pgp:0xC3410038

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Kernel oops+crash on repeated auditd restarts
  2012-01-25 16:45 Kernel oops+crash on repeated auditd restarts Valentin Avram
  2012-01-25 16:53 ` Peter Moody
@ 2012-01-25 19:20 ` Eric Paris
  2012-01-26  7:13   ` Valentin Avram
  1 sibling, 1 reply; 24+ messages in thread
From: Eric Paris @ 2012-01-25 19:20 UTC (permalink / raw)
  To: Valentin Avram; +Cc: linux-audit

On Wed, 2012-01-25 at 18:45 +0200, Valentin Avram wrote:

> Did anybody ever experience kernel oopses and even kernel crashes
> (after a while), by just restarting repeatedly the auditd daemon?

No, but I'll try to remember to take a look.  We did have a BUG() that
was recently fixed when using -w rules (as I recall).   But I've never
seen this particular NULL pointer bug.  We did recently fix a race in
fsnotify mark destruction that could be this, but those symptoms weren't
exactly the same.

I'm both the upstream Audit and fsnotify maintainer so I'm grumbley at
Gentoo for never letting me know isn't working.  Where else did you
report this?  I'm wondering where all the information failure is
happening.

Can you send me any and all info you have?

I'll see if I can reproduce a problem here (but I'm a Fedora guy)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Kernel oops+crash on repeated auditd restarts
  2012-01-25 19:20 ` Eric Paris
@ 2012-01-26  7:13   ` Valentin Avram
  2012-02-08 16:11     ` Valentin Avram
  0 siblings, 1 reply; 24+ messages in thread
From: Valentin Avram @ 2012-01-26  7:13 UTC (permalink / raw)
  To: Eric Paris; +Cc: linux-audit


[-- Attachment #1.1: Type: text/plain, Size: 2131 bytes --]

Please read below.

On Wed, Jan 25, 2012 at 9:20 PM, Eric Paris <eparis@redhat.com> wrote:

> On Wed, 2012-01-25 at 18:45 +0200, Valentin Avram wrote:
>
> > Did anybody ever experience kernel oopses and even kernel crashes
> > (after a while), by just restarting repeatedly the auditd daemon?
>
> No, but I'll try to remember to take a look.  We did have a BUG() that
> was recently fixed when using -w rules (as I recall).   But I've never
> seen this particular NULL pointer bug.  We did recently fix a race in
> fsnotify mark destruction that could be this, but those symptoms weren't
> exactly the same.
>
> I'm both the upstream Audit and fsnotify maintainer so I'm grumbley at
> Gentoo for never letting me know isn't working.  Where else did you
> report this?  I'm wondering where all the information failure is
> happening.
>

I only reported the issue on Gentoo bugs and LKML (the two links i included
in the original email). The Gentoo guys at first did seem interested in the
bug and asked for a test with a kernel compiled with CONFIG_DEBUG_INFO and
CONFIG_DEBUG_LIST. After that test it looked like some list is getting
messed up somewhere (altough i'm part C programmer, my kernel insides
knowledge is limited). The LKML guys didn't even bother to answer.


> Can you send me any and all info you have?
>
>
All the information i had is posted on the Gentoo bug report. The two
machines i used to test the issue are now in production mode, so i can't do
any testing on them. However I'll soon have access to a new machine that
can stay in test mode for a while, where i plan to retest with Gentoo's
latest "stable-marked" kernel gentoo-sources-3.1.6.



> I'll see if I can reproduce a problem here (but I'm a Fedora guy)
>

At this moment i'm not extremely sure if it's a auditd issue or a kernel
issue or both. However, if you're running a kernel lower than 3.0.7 and
auditd 2.1.3, I'd be very interested if running the one-liner i posted
(audit start and stop on a loop with 5 seconds delay) will eventually (in 1
hour or something close) crash the kernel completely (or at least oops a
lot of times).

Thank you.

[-- Attachment #1.2: Type: text/html, Size: 2893 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Kernel oops+crash on repeated auditd restarts
  2012-01-26  7:13   ` Valentin Avram
@ 2012-02-08 16:11     ` Valentin Avram
  2012-03-05  8:35       ` Valentin Avram
  0 siblings, 1 reply; 24+ messages in thread
From: Valentin Avram @ 2012-02-08 16:11 UTC (permalink / raw)
  To: Eric Paris; +Cc: linux-audit


[-- Attachment #1.1: Type: text/plain, Size: 2193 bytes --]

Hello.

Fresh news: Gentoo's gentoo-sources-3.1.10-r1 with audit-2.1.3 still gives
oops using the simple "start ; sleep 5 ; stop ; sleep 5 ; repeat" one-liner.

Kernel oops after less than 5 minutes:

BUG: unable to handle kernel NULL pointer dereference at 00000004
IP: [<c10f2337>] fsnotify_mark_destroy+0x87/0x130
*pdpt = 0000000000000000 *pde = f000def8f000def8
Oops: 0002 [#1] SMP

Pid: 690, comm: fsnotify_mark Not tainted 3.1.10-gentoo-r1-drbd-version3 #1
Dell Inc. PowerEdge R610/0F0XJ6
EIP: 0060:[<c10f2337>] EFLAGS: 00010216 CPU: 3
EIP is at fsnotify_mark_destroy+0x87/0x130
EAX: f2e51708 EBX: f2415fa8 ECX: 00000000 EDX: f2e51744
ESI: f2f46c00 EDI: ffffffc4 EBP: c10ea000 ESP: f2415f90
 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Process fsnotify_mark (pid: 690, ti=f2414000 task=f2f46c00 task.ti=f2414000)
Stack:
 f2f46c00 00000000 f2f46c00 c1050150 f2415fa0 f2415fa0 f2e51744 f2e51744
 f2c47f68 00000000 c10f22b0 00000000 c104f854 00000000 00000000 00000000
 00000000 f2415fd4 f2415fd4 00000000 c104f7e0 f2c47f68 c15820b6 00000000
Call Trace:
 [<c1050150>] ? abort_exclusive_wait+0x90/0x90
 [<c10f22b0>] ? fsnotify_put_mark+0x20/0x20
 [<c104f854>] ? kthread+0x74/0x80
 [<c104f7e0>] ? kthread_flush_work_fn+0x10/0x10
 [<c15820b6>] ? kernel_thread_helper+0x6/0xd
Code: 34 1b 8b c1 e8 4b 2d f6 ff 8b 54 24 18 8d 42 c4 39 da 8b 48 3c 8d 79
c4 75 0e eb 2d 90 8d b4 26 00 00 00 00 89 f8 89 ef 8b 68 40
 69 04 89 4d 00 89 50 3c 89 50 40 e8 48 ff ff ff 8b 4f 3c 8d
EIP: [<c10f2337>] fsnotify_mark_destroy+0x87/0x130 SS:ESP 0068:f2415f90
CR2: 0000000000000004
---[ end trace d10081cf0e5b936c ]---

So far only one oops occured, however the test server is doing quite
nothing right now. I'll install more services, retry and post back here the
results.

On Thu, Jan 26, 2012 at 9:13 AM, Valentin Avram <aval13@gmail.com> wrote:

>
> All the information i had is posted on the Gentoo bug report. The two
> machines i used to test the issue are now in production mode, so i can't do
> any testing on them. However I'll soon have access to a new machine that
> can stay in test mode for a while, where i plan to retest with Gentoo's
> latest "stable-marked" kernel gentoo-sources-3.1.6.
>
>

[-- Attachment #1.2: Type: text/html, Size: 2698 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Kernel oops+crash on repeated auditd restarts
  2012-02-08 16:11     ` Valentin Avram
@ 2012-03-05  8:35       ` Valentin Avram
  2012-03-28 20:51         ` Peter Moody
  0 siblings, 1 reply; 24+ messages in thread
From: Valentin Avram @ 2012-03-05  8:35 UTC (permalink / raw)
  To: Eric Paris; +Cc: linux-audit


[-- Attachment #1.1: Type: text/plain, Size: 3835 bytes --]

Finally i found some time and spare server to retest the oops and list_add
corruptions i was getting with the 3.x kernels and auditd 2.1.3.

I tested now with gentoo's latest stable 3.2.1-gentoo-r2 and kernel.org's
3.2.9.

Both get the oops/BUG in the same way and after that, they keep pouring
list_add corruptions with audit_prune_tre(truncated?) and auditctl as comms.

Since this is not about Gentoo's kernel only, i'll post here the oops in
3.2.9 and also attach some list_add corruptions.

3.2.9 BUG:

kernel: [  301.240011] BUG: unable to handle kernel NULL pointer
dereference at   (null)
kernel: [  301.240305] IP: [<c1238dd0>] __list_del_entry+0x20/0xe0
kernel: [  301.240481] *pdpt = 0000000000000000 *pde = f000ddc8f000ddc8
kernel: [  301.240698] Oops: 0000 [#1] SMP
kernel: [  301.240910]
kernel: [  301.241030] Pid: 642, comm: fsnotify_mark Not tainted
3.2.9-drbd-version3 #1 Dell Inc. PowerEdge 2950/0CX396
kernel: [  301.241370] EIP: 0060:[<c1238dd0>] EFLAGS: 00010287 CPU: 6
kernel: [  301.241498] EIP is at __list_del_entry+0x20/0xe0
kernel: [  301.241623] EAX: f4fae544 EBX: f47cffa4 ECX: ffffffff EDX: 00000000
kernel: [  301.241751] ESI: f4fae544 EDI: f4fae508 EBP: f47cff7c ESP: f47cff64
kernel: [  301.241879]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
kernel: [  301.242005] Process fsnotify_mark (pid: 642, ti=f47ce000
task=f4f47c00 task.ti=f47ce000)
kernel: [  301.242207] Stack:
kernel: [  301.242327]  c10813c0 f47cffa4 f4f47c00 f4e70888 f47cff7c
f47cffa4 f47cffb8 c10f6976
kernel: [  301.242882]  ffffffc3 f4f47c00 f4f47c00 00000000 f4f47c00
c10530c0 f47cff9c f47cff9c
kernel: [  301.243438]  f4fae544 f4fae544 f4c47f58 00000000 c10f68f0
f47cffe4 c1052834 00000000
kernel: [  301.243995] Call Trace:
kernel: [  301.244119]  [<c10813c0>] ? rcu_check_callbacks+0x110/0x110
kernel: [  301.244248]  [<c10f6976>] fsnotify_mark_destroy+0x86/0x120
kernel: [  301.244377]  [<c10530c0>] ? abort_exclusive_wait+0x80/0x80
kernel: [  301.244504]  [<c10f68f0>] ? fsnotify_put_mark+0x30/0x30
kernel: [  301.244631]  [<c1052834>] kthread+0x74/0x80
kernel: [  301.244756]  [<c10527c0>] ? kthread_flush_work_fn+0x10/0x10
kernel: [  301.244885]  [<c1582ab6>] kernel_thread_helper+0x6/0xd
kernel: [  301.245011] Code: 55 f4 8b 45 f8 e9 75 ff ff ff 90 55 89 e5
53 83 ec 14 8b 08 8b 50 04 81 f9 00 01 10 00 74 24 81 fa 00 02 20 00
0f 84 8e 00 00 00 <8b> 1a 39 d8 75 62 8b 59 04 39 d8 75 35 89 51 04 89
0a 83 c4 14
kernel: [  301.248195] EIP: [<c1238dd0>] __list_del_entry+0x20/0xe0
SS:ESP 0068:f47cff64
kernel: [  301.248414] CR2: 0000000000000000
kernel: [  301.248538] ---[ end trace 15082dbfb353f84c ]---

The kernel was compiled with the following DEBUG support (the bolded one
were requested by Gentoo's Dev:
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_SLUB_DEBUG=y
CONFIG_HAVE_DMA_API_DEBUG=y
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_PNP_DEBUG_MESSAGES=y
CONFIG_AIC94XX_DEBUG=y
CONFIG_USB_DEBUG=y
CONFIG_DEBUG_KERNEL=y
CONFIG_SCHED_DEBUG=y
CONFIG_DEBUG_RT_MUTEXES=y
CONFIG_DEBUG_PI_LIST=y
CONFIG_DEBUG_BUGVERBOSE=y
*CONFIG_DEBUG_INFO=y*
CONFIG_DEBUG_MEMORY_INIT=y
*CONFIG_DEBUG_LIST=y*
CONFIG_DEBUG_STACKOVERFLOW=y
CONFIG_DEBUG_RODATA=y
CONFIG_DEBUG_RODATA_TEST=y

I attached the kernel config i used for 3.2.9 to generate this oops and
warnings.

>From the list_add warnings that come after, out of 805 warnings i
processed, after masking with XXXXX the PID and next= values that kept
changing in every one, i got 26 types of MD5. I also attached the files
relevant as an archive to this email.

The Gentoo bug i opened is sleeping, it seems nobody has the time to at
least test to confirm or not the problems i'm seeing (or everybody's
thinking that nobody would restart auditd so often, so the bug it's not
that serious).

Thank you for your time.

On Wed, Feb 8, 2012 at 6:11 PM, Valentin Avram <aval13@gmail.com> wrote:

[-- Attachment #1.2: Type: text/html, Size: 4292 bytes --]

[-- Attachment #2: parse_oops.tgz --]
[-- Type: application/x-gzip, Size: 1783 bytes --]

[-- Attachment #3: kernel_config.gz --]
[-- Type: application/x-gzip, Size: 15572 bytes --]

[-- Attachment #4: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Kernel oops+crash on repeated auditd restarts
  2012-03-05  8:35       ` Valentin Avram
@ 2012-03-28 20:51         ` Peter Moody
  2012-03-28 22:42           ` Peter Moody
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Moody @ 2012-03-28 20:51 UTC (permalink / raw)
  To: Valentin Avram; +Cc: linux-audit

Are you still able to reliably reproduce this oops? I'm trying to
track this down because this bug (or a very similar bug) is causing
some significant headaches here at work, but I haven't had a lot of
luck. I'm using usermode linux, though, so that might be interfering
with things.

On Mon, Mar 5, 2012 at 12:35 AM, Valentin Avram <aval13@gmail.com> wrote:
> Finally i found some time and spare server to retest the oops and list_add
> corruptions i was getting with the 3.x kernels and auditd 2.1.3.
>
> I tested now with gentoo's latest stable 3.2.1-gentoo-r2 and kernel.org's
> 3.2.9.
>
> Both get the oops/BUG in the same way and after that, they keep pouring
> list_add corruptions with audit_prune_tre(truncated?) and auditctl as comms.
>
> Since this is not about Gentoo's kernel only, i'll post here the oops in
> 3.2.9 and also attach some list_add corruptions.
>
> 3.2.9 BUG:
>
> kernel: [  301.240011] BUG: unable to handle kernel NULL pointer dereference
> at   (null)
> kernel: [  301.240305] IP: [<c1238dd0>] __list_del_entry+0x20/0xe0
> kernel: [  301.240481] *pdpt = 0000000000000000 *pde = f000ddc8f000ddc8
> kernel: [  301.240698] Oops: 0000 [#1] SMP
> kernel: [  301.240910]
> kernel: [  301.241030] Pid: 642, comm: fsnotify_mark Not tainted
> 3.2.9-drbd-version3 #1 Dell Inc. PowerEdge 2950/0CX396
> kernel: [  301.241370] EIP: 0060:[<c1238dd0>] EFLAGS: 00010287 CPU: 6
> kernel: [  301.241498] EIP is at __list_del_entry+0x20/0xe0
> kernel: [  301.241623] EAX: f4fae544 EBX: f47cffa4 ECX: ffffffff EDX:
> 00000000
> kernel: [  301.241751] ESI: f4fae544 EDI: f4fae508 EBP: f47cff7c ESP:
> f47cff64
> kernel: [  301.241879]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
> kernel: [  301.242005] Process fsnotify_mark (pid: 642, ti=f47ce000
> task=f4f47c00 task.ti=f47ce000)
> kernel: [  301.242207] Stack:
> kernel: [  301.242327]  c10813c0 f47cffa4 f4f47c00 f4e70888 f47cff7c
> f47cffa4 f47cffb8 c10f6976
> kernel: [  301.242882]  ffffffc3 f4f47c00 f4f47c00 00000000 f4f47c00
> c10530c0 f47cff9c f47cff9c
> kernel: [  301.243438]  f4fae544 f4fae544 f4c47f58 00000000 c10f68f0
> f47cffe4 c1052834 00000000
> kernel: [  301.243995] Call Trace:
> kernel: [  301.244119]  [<c10813c0>] ? rcu_check_callbacks+0x110/0x110
> kernel: [  301.244248]  [<c10f6976>] fsnotify_mark_destroy+0x86/0x120
> kernel: [  301.244377]  [<c10530c0>] ? abort_exclusive_wait+0x80/0x80
> kernel: [  301.244504]  [<c10f68f0>] ? fsnotify_put_mark+0x30/0x30
> kernel: [  301.244631]  [<c1052834>] kthread+0x74/0x80
> kernel: [  301.244756]  [<c10527c0>] ? kthread_flush_work_fn+0x10/0x10
> kernel: [  301.244885]  [<c1582ab6>] kernel_thread_helper+0x6/0xd
> kernel: [  301.245011] Code: 55 f4 8b 45 f8 e9 75 ff ff ff 90 55 89 e5 53 83
> ec 14 8b 08 8b 50 04 81 f9 00 01 10 00 74 24 81 fa 00 02 20 00 0f 84 8e 00
> 00 00 <8b> 1a 39 d8 75 62 8b 59 04 39 d8 75 35 89 51 04 89 0a 83 c4 14
> kernel: [  301.248195] EIP: [<c1238dd0>] __list_del_entry+0x20/0xe0 SS:ESP
> 0068:f47cff64
> kernel: [  301.248414] CR2: 0000000000000000
> kernel: [  301.248538] ---[ end trace 15082dbfb353f84c ]---
>
> The kernel was compiled with the following DEBUG support (the bolded one
> were requested by Gentoo's Dev:
> CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
> CONFIG_SLUB_DEBUG=y
> CONFIG_HAVE_DMA_API_DEBUG=y
> CONFIG_X86_DEBUGCTLMSR=y
> CONFIG_PNP_DEBUG_MESSAGES=y
> CONFIG_AIC94XX_DEBUG=y
> CONFIG_USB_DEBUG=y
> CONFIG_DEBUG_KERNEL=y
> CONFIG_SCHED_DEBUG=y
> CONFIG_DEBUG_RT_MUTEXES=y
> CONFIG_DEBUG_PI_LIST=y
> CONFIG_DEBUG_BUGVERBOSE=y
> CONFIG_DEBUG_INFO=y
> CONFIG_DEBUG_MEMORY_INIT=y
> CONFIG_DEBUG_LIST=y
> CONFIG_DEBUG_STACKOVERFLOW=y
> CONFIG_DEBUG_RODATA=y
> CONFIG_DEBUG_RODATA_TEST=y
>
> I attached the kernel config i used for 3.2.9 to generate this oops and
> warnings.
>
> From the list_add warnings that come after, out of 805 warnings i processed,
> after masking with XXXXX the PID and next= values that kept changing in
> every one, i got 26 types of MD5. I also attached the files relevant as an
> archive to this email.
>
> The Gentoo bug i opened is sleeping, it seems nobody has the time to at
> least test to confirm or not the problems i'm seeing (or everybody's
> thinking that nobody would restart auditd so often, so the bug it's not that
> serious).
>
>
> Thank you for your time.
>
> On Wed, Feb 8, 2012 at 6:11 PM, Valentin Avram <aval13@gmail.com> wrote:
>
>
> --
> Linux-audit mailing list
> Linux-audit@redhat.com
> https://www.redhat.com/mailman/listinfo/linux-audit



-- 
Peter Moody      Google    1.650.253.7306
Security Engineer  pgp:0xC3410038

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Kernel oops+crash on repeated auditd restarts
  2012-03-28 20:51         ` Peter Moody
@ 2012-03-28 22:42           ` Peter Moody
  2012-03-29  1:14             ` Eric Paris
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Moody @ 2012-03-28 22:42 UTC (permalink / raw)
  To: Valentin Avram; +Cc: linux-audit

fyi: this patch [1] seems to fix the issue for me. The explanation in
the subject would reliably oops my machine.

[1] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=fed474857efbed79cd390d0aee224231ca718f63

On Wed, Mar 28, 2012 at 1:51 PM, Peter Moody <pmoody@google.com> wrote:
> Are you still able to reliably reproduce this oops? I'm trying to
> track this down because this bug (or a very similar bug) is causing
> some significant headaches here at work, but I haven't had a lot of
> luck. I'm using usermode linux, though, so that might be interfering
> with things.
>
> On Mon, Mar 5, 2012 at 12:35 AM, Valentin Avram <aval13@gmail.com> wrote:
>> Finally i found some time and spare server to retest the oops and list_add
>> corruptions i was getting with the 3.x kernels and auditd 2.1.3.
>>
>> I tested now with gentoo's latest stable 3.2.1-gentoo-r2 and kernel.org's
>> 3.2.9.
>>
>> Both get the oops/BUG in the same way and after that, they keep pouring
>> list_add corruptions with audit_prune_tre(truncated?) and auditctl as comms.
>>
>> Since this is not about Gentoo's kernel only, i'll post here the oops in
>> 3.2.9 and also attach some list_add corruptions.
>>
>> 3.2.9 BUG:
>>
>> kernel: [  301.240011] BUG: unable to handle kernel NULL pointer dereference
>> at   (null)
>> kernel: [  301.240305] IP: [<c1238dd0>] __list_del_entry+0x20/0xe0
>> kernel: [  301.240481] *pdpt = 0000000000000000 *pde = f000ddc8f000ddc8
>> kernel: [  301.240698] Oops: 0000 [#1] SMP
>> kernel: [  301.240910]
>> kernel: [  301.241030] Pid: 642, comm: fsnotify_mark Not tainted
>> 3.2.9-drbd-version3 #1 Dell Inc. PowerEdge 2950/0CX396
>> kernel: [  301.241370] EIP: 0060:[<c1238dd0>] EFLAGS: 00010287 CPU: 6
>> kernel: [  301.241498] EIP is at __list_del_entry+0x20/0xe0
>> kernel: [  301.241623] EAX: f4fae544 EBX: f47cffa4 ECX: ffffffff EDX:
>> 00000000
>> kernel: [  301.241751] ESI: f4fae544 EDI: f4fae508 EBP: f47cff7c ESP:
>> f47cff64
>> kernel: [  301.241879]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
>> kernel: [  301.242005] Process fsnotify_mark (pid: 642, ti=f47ce000
>> task=f4f47c00 task.ti=f47ce000)
>> kernel: [  301.242207] Stack:
>> kernel: [  301.242327]  c10813c0 f47cffa4 f4f47c00 f4e70888 f47cff7c
>> f47cffa4 f47cffb8 c10f6976
>> kernel: [  301.242882]  ffffffc3 f4f47c00 f4f47c00 00000000 f4f47c00
>> c10530c0 f47cff9c f47cff9c
>> kernel: [  301.243438]  f4fae544 f4fae544 f4c47f58 00000000 c10f68f0
>> f47cffe4 c1052834 00000000
>> kernel: [  301.243995] Call Trace:
>> kernel: [  301.244119]  [<c10813c0>] ? rcu_check_callbacks+0x110/0x110
>> kernel: [  301.244248]  [<c10f6976>] fsnotify_mark_destroy+0x86/0x120
>> kernel: [  301.244377]  [<c10530c0>] ? abort_exclusive_wait+0x80/0x80
>> kernel: [  301.244504]  [<c10f68f0>] ? fsnotify_put_mark+0x30/0x30
>> kernel: [  301.244631]  [<c1052834>] kthread+0x74/0x80
>> kernel: [  301.244756]  [<c10527c0>] ? kthread_flush_work_fn+0x10/0x10
>> kernel: [  301.244885]  [<c1582ab6>] kernel_thread_helper+0x6/0xd
>> kernel: [  301.245011] Code: 55 f4 8b 45 f8 e9 75 ff ff ff 90 55 89 e5 53 83
>> ec 14 8b 08 8b 50 04 81 f9 00 01 10 00 74 24 81 fa 00 02 20 00 0f 84 8e 00
>> 00 00 <8b> 1a 39 d8 75 62 8b 59 04 39 d8 75 35 89 51 04 89 0a 83 c4 14
>> kernel: [  301.248195] EIP: [<c1238dd0>] __list_del_entry+0x20/0xe0 SS:ESP
>> 0068:f47cff64
>> kernel: [  301.248414] CR2: 0000000000000000
>> kernel: [  301.248538] ---[ end trace 15082dbfb353f84c ]---
>>
>> The kernel was compiled with the following DEBUG support (the bolded one
>> were requested by Gentoo's Dev:
>> CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
>> CONFIG_SLUB_DEBUG=y
>> CONFIG_HAVE_DMA_API_DEBUG=y
>> CONFIG_X86_DEBUGCTLMSR=y
>> CONFIG_PNP_DEBUG_MESSAGES=y
>> CONFIG_AIC94XX_DEBUG=y
>> CONFIG_USB_DEBUG=y
>> CONFIG_DEBUG_KERNEL=y
>> CONFIG_SCHED_DEBUG=y
>> CONFIG_DEBUG_RT_MUTEXES=y
>> CONFIG_DEBUG_PI_LIST=y
>> CONFIG_DEBUG_BUGVERBOSE=y
>> CONFIG_DEBUG_INFO=y
>> CONFIG_DEBUG_MEMORY_INIT=y
>> CONFIG_DEBUG_LIST=y
>> CONFIG_DEBUG_STACKOVERFLOW=y
>> CONFIG_DEBUG_RODATA=y
>> CONFIG_DEBUG_RODATA_TEST=y
>>
>> I attached the kernel config i used for 3.2.9 to generate this oops and
>> warnings.
>>
>> From the list_add warnings that come after, out of 805 warnings i processed,
>> after masking with XXXXX the PID and next= values that kept changing in
>> every one, i got 26 types of MD5. I also attached the files relevant as an
>> archive to this email.
>>
>> The Gentoo bug i opened is sleeping, it seems nobody has the time to at
>> least test to confirm or not the problems i'm seeing (or everybody's
>> thinking that nobody would restart auditd so often, so the bug it's not that
>> serious).
>>
>>
>> Thank you for your time.
>>
>> On Wed, Feb 8, 2012 at 6:11 PM, Valentin Avram <aval13@gmail.com> wrote:
>>
>>
>> --
>> Linux-audit mailing list
>> Linux-audit@redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-audit
>
>
>
> --
> Peter Moody      Google    1.650.253.7306
> Security Engineer  pgp:0xC3410038



-- 
Peter Moody      Google    1.650.253.7306
Security Engineer  pgp:0xC3410038

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Kernel oops+crash on repeated auditd restarts
  2012-03-28 22:42           ` Peter Moody
@ 2012-03-29  1:14             ` Eric Paris
  2012-03-29  6:44               ` Valentin Avram
  0 siblings, 1 reply; 24+ messages in thread
From: Eric Paris @ 2012-03-29  1:14 UTC (permalink / raw)
  To: Peter Moody; +Cc: linux-audit

That patch fixes a BUG() .  The report has a NULL ptr deref and some
apparent list correuption....  Sadly they aren't the same....

On Wed, 2012-03-28 at 15:42 -0700, Peter Moody wrote:
> fyi: this patch [1] seems to fix the issue for me. The explanation in
> the subject would reliably oops my machine.
> 
> [1] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=fed474857efbed79cd390d0aee224231ca718f63
> 
> On Wed, Mar 28, 2012 at 1:51 PM, Peter Moody <pmoody@google.com> wrote:
> > Are you still able to reliably reproduce this oops? I'm trying to
> > track this down because this bug (or a very similar bug) is causing
> > some significant headaches here at work, but I haven't had a lot of
> > luck. I'm using usermode linux, though, so that might be interfering
> > with things.
> >
> > On Mon, Mar 5, 2012 at 12:35 AM, Valentin Avram <aval13@gmail.com> wrote:
> >> Finally i found some time and spare server to retest the oops and list_add
> >> corruptions i was getting with the 3.x kernels and auditd 2.1.3.
> >>
> >> I tested now with gentoo's latest stable 3.2.1-gentoo-r2 and kernel.org's
> >> 3.2.9.
> >>
> >> Both get the oops/BUG in the same way and after that, they keep pouring
> >> list_add corruptions with audit_prune_tre(truncated?) and auditctl as comms.
> >>
> >> Since this is not about Gentoo's kernel only, i'll post here the oops in
> >> 3.2.9 and also attach some list_add corruptions.
> >>
> >> 3.2.9 BUG:
> >>
> >> kernel: [  301.240011] BUG: unable to handle kernel NULL pointer dereference
> >> at   (null)
> >> kernel: [  301.240305] IP: [<c1238dd0>] __list_del_entry+0x20/0xe0
> >> kernel: [  301.240481] *pdpt = 0000000000000000 *pde = f000ddc8f000ddc8
> >> kernel: [  301.240698] Oops: 0000 [#1] SMP
> >> kernel: [  301.240910]
> >> kernel: [  301.241030] Pid: 642, comm: fsnotify_mark Not tainted
> >> 3.2.9-drbd-version3 #1 Dell Inc. PowerEdge 2950/0CX396
> >> kernel: [  301.241370] EIP: 0060:[<c1238dd0>] EFLAGS: 00010287 CPU: 6
> >> kernel: [  301.241498] EIP is at __list_del_entry+0x20/0xe0
> >> kernel: [  301.241623] EAX: f4fae544 EBX: f47cffa4 ECX: ffffffff EDX:
> >> 00000000
> >> kernel: [  301.241751] ESI: f4fae544 EDI: f4fae508 EBP: f47cff7c ESP:
> >> f47cff64
> >> kernel: [  301.241879]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
> >> kernel: [  301.242005] Process fsnotify_mark (pid: 642, ti=f47ce000
> >> task=f4f47c00 task.ti=f47ce000)
> >> kernel: [  301.242207] Stack:
> >> kernel: [  301.242327]  c10813c0 f47cffa4 f4f47c00 f4e70888 f47cff7c
> >> f47cffa4 f47cffb8 c10f6976
> >> kernel: [  301.242882]  ffffffc3 f4f47c00 f4f47c00 00000000 f4f47c00
> >> c10530c0 f47cff9c f47cff9c
> >> kernel: [  301.243438]  f4fae544 f4fae544 f4c47f58 00000000 c10f68f0
> >> f47cffe4 c1052834 00000000
> >> kernel: [  301.243995] Call Trace:
> >> kernel: [  301.244119]  [<c10813c0>] ? rcu_check_callbacks+0x110/0x110
> >> kernel: [  301.244248]  [<c10f6976>] fsnotify_mark_destroy+0x86/0x120
> >> kernel: [  301.244377]  [<c10530c0>] ? abort_exclusive_wait+0x80/0x80
> >> kernel: [  301.244504]  [<c10f68f0>] ? fsnotify_put_mark+0x30/0x30
> >> kernel: [  301.244631]  [<c1052834>] kthread+0x74/0x80
> >> kernel: [  301.244756]  [<c10527c0>] ? kthread_flush_work_fn+0x10/0x10
> >> kernel: [  301.244885]  [<c1582ab6>] kernel_thread_helper+0x6/0xd
> >> kernel: [  301.245011] Code: 55 f4 8b 45 f8 e9 75 ff ff ff 90 55 89 e5 53 83
> >> ec 14 8b 08 8b 50 04 81 f9 00 01 10 00 74 24 81 fa 00 02 20 00 0f 84 8e 00
> >> 00 00 <8b> 1a 39 d8 75 62 8b 59 04 39 d8 75 35 89 51 04 89 0a 83 c4 14
> >> kernel: [  301.248195] EIP: [<c1238dd0>] __list_del_entry+0x20/0xe0 SS:ESP
> >> 0068:f47cff64
> >> kernel: [  301.248414] CR2: 0000000000000000
> >> kernel: [  301.248538] ---[ end trace 15082dbfb353f84c ]---
> >>
> >> The kernel was compiled with the following DEBUG support (the bolded one
> >> were requested by Gentoo's Dev:
> >> CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
> >> CONFIG_SLUB_DEBUG=y
> >> CONFIG_HAVE_DMA_API_DEBUG=y
> >> CONFIG_X86_DEBUGCTLMSR=y
> >> CONFIG_PNP_DEBUG_MESSAGES=y
> >> CONFIG_AIC94XX_DEBUG=y
> >> CONFIG_USB_DEBUG=y
> >> CONFIG_DEBUG_KERNEL=y
> >> CONFIG_SCHED_DEBUG=y
> >> CONFIG_DEBUG_RT_MUTEXES=y
> >> CONFIG_DEBUG_PI_LIST=y
> >> CONFIG_DEBUG_BUGVERBOSE=y
> >> CONFIG_DEBUG_INFO=y
> >> CONFIG_DEBUG_MEMORY_INIT=y
> >> CONFIG_DEBUG_LIST=y
> >> CONFIG_DEBUG_STACKOVERFLOW=y
> >> CONFIG_DEBUG_RODATA=y
> >> CONFIG_DEBUG_RODATA_TEST=y
> >>
> >> I attached the kernel config i used for 3.2.9 to generate this oops and
> >> warnings.
> >>
> >> From the list_add warnings that come after, out of 805 warnings i processed,
> >> after masking with XXXXX the PID and next= values that kept changing in
> >> every one, i got 26 types of MD5. I also attached the files relevant as an
> >> archive to this email.
> >>
> >> The Gentoo bug i opened is sleeping, it seems nobody has the time to at
> >> least test to confirm or not the problems i'm seeing (or everybody's
> >> thinking that nobody would restart auditd so often, so the bug it's not that
> >> serious).
> >>
> >>
> >> Thank you for your time.
> >>
> >> On Wed, Feb 8, 2012 at 6:11 PM, Valentin Avram <aval13@gmail.com> wrote:
> >>
> >>
> >> --
> >> Linux-audit mailing list
> >> Linux-audit@redhat.com
> >> https://www.redhat.com/mailman/listinfo/linux-audit
> >
> >
> >
> > --
> > Peter Moody      Google    1.650.253.7306
> > Security Engineer  pgp:0xC3410038
> 
> 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Kernel oops+crash on repeated auditd restarts
  2012-03-29  1:14             ` Eric Paris
@ 2012-03-29  6:44               ` Valentin Avram
  2012-04-03 16:15                 ` Peter Moody
  0 siblings, 1 reply; 24+ messages in thread
From: Valentin Avram @ 2012-03-29  6:44 UTC (permalink / raw)
  To: Eric Paris; +Cc: linux-audit


[-- Attachment #1.1: Type: text/plain, Size: 6852 bytes --]

Yes, i know that patch. It made it into kernel 3.2.2. I tested it
successfully (oops in 3.2.1, no oops in 3.2.9), but this oops i'm seeing is
also in 3.2.9.

I monitored changelogs since 3.2.1 to 3.2.12 but there were no fixes either
in audit subsystem or in fsnotify. I'll try to reproduce in latest 3.2.13
and repost the oops, but i'm 99% confident it will be the same.

Sadly nobody except you seems to pay attention to this problem, probably
because it requires special conditions to reproduce (really, who starts and
stops auditd every 5 seconds on a production server?). We only ran into it
because one of our servers would randomly oops and then freeze about each
month after stopping and then starting

auditd

every morning (and the stop-start sequence was needed to workaround a bug
somewhere that would hang a

gzip

running on a file outside a watched folder).

Anyway, as a last note, i have a feeling that the oops is not exactly
random, there is a pattern, just that i haven't figured it out completely
yet.

Will keep you

uptodate

with the things i find out.

V.
On Mar 29, 2012 4:14 AM, "Eric Paris" <eparis@redhat.com> wrote:

> That patch fixes a BUG() .  The report has a NULL ptr deref and some
> apparent list correuption....  Sadly they aren't the same....
>
> On Wed, 2012-03-28 at 15:42 -0700, Peter Moody wrote:
> > fyi: this patch [1] seems to fix the issue for me. The explanation in
> > the subject would reliably oops my machine.
> >
> > [1]
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=fed474857efbed79cd390d0aee224231ca718f63
> >
> > On Wed, Mar 28, 2012 at 1:51 PM, Peter Moody <pmoody@google.com> wrote:
> > > Are you still able to reliably reproduce this oops? I'm trying to
> > > track this down because this bug (or a very similar bug) is causing
> > > some significant headaches here at work, but I haven't had a lot of
> > > luck. I'm using usermode linux, though, so that might be interfering
> > > with things.
> > >
> > > On Mon, Mar 5, 2012 at 12:35 AM, Valentin Avram <aval13@gmail.com>
> wrote:
> > >> Finally i found some time and spare server to retest the oops and
> list_add
> > >> corruptions i was getting with the 3.x kernels and auditd 2.1.3.
> > >>
> > >> I tested now with gentoo's latest stable 3.2.1-gentoo-r2 and
> kernel.org's
> > >> 3.2.9.
> > >>
> > >> Both get the oops/BUG in the same way and after that, they keep
> pouring
> > >> list_add corruptions with audit_prune_tre(truncated?) and auditctl as
> comms.
> > >>
> > >> Since this is not about Gentoo's kernel only, i'll post here the oops
> in
> > >> 3.2.9 and also attach some list_add corruptions.
> > >>
> > >> 3.2.9 BUG:
> > >>
> > >> kernel: [  301.240011] BUG: unable to handle kernel NULL pointer
> dereference
> > >> at   (null)
> > >> kernel: [  301.240305] IP: [<c1238dd0>] __list_del_entry+0x20/0xe0
> > >> kernel: [  301.240481] *pdpt = 0000000000000000 *pde =
> f000ddc8f000ddc8
> > >> kernel: [  301.240698] Oops: 0000 [#1] SMP
> > >> kernel: [  301.240910]
> > >> kernel: [  301.241030] Pid: 642, comm: fsnotify_mark Not tainted
> > >> 3.2.9-drbd-version3 #1 Dell Inc. PowerEdge 2950/0CX396
> > >> kernel: [  301.241370] EIP: 0060:[<c1238dd0>] EFLAGS: 00010287 CPU: 6
> > >> kernel: [  301.241498] EIP is at __list_del_entry+0x20/0xe0
> > >> kernel: [  301.241623] EAX: f4fae544 EBX: f47cffa4 ECX: ffffffff EDX:
> > >> 00000000
> > >> kernel: [  301.241751] ESI: f4fae544 EDI: f4fae508 EBP: f47cff7c ESP:
> > >> f47cff64
> > >> kernel: [  301.241879]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
> > >> kernel: [  301.242005] Process fsnotify_mark (pid: 642, ti=f47ce000
> > >> task=f4f47c00 task.ti=f47ce000)
> > >> kernel: [  301.242207] Stack:
> > >> kernel: [  301.242327]  c10813c0 f47cffa4 f4f47c00 f4e70888 f47cff7c
> > >> f47cffa4 f47cffb8 c10f6976
> > >> kernel: [  301.242882]  ffffffc3 f4f47c00 f4f47c00 00000000 f4f47c00
> > >> c10530c0 f47cff9c f47cff9c
> > >> kernel: [  301.243438]  f4fae544 f4fae544 f4c47f58 00000000 c10f68f0
> > >> f47cffe4 c1052834 00000000
> > >> kernel: [  301.243995] Call Trace:
> > >> kernel: [  301.244119]  [<c10813c0>] ? rcu_check_callbacks+0x110/0x110
> > >> kernel: [  301.244248]  [<c10f6976>] fsnotify_mark_destroy+0x86/0x120
> > >> kernel: [  301.244377]  [<c10530c0>] ? abort_exclusive_wait+0x80/0x80
> > >> kernel: [  301.244504]  [<c10f68f0>] ? fsnotify_put_mark+0x30/0x30
> > >> kernel: [  301.244631]  [<c1052834>] kthread+0x74/0x80
> > >> kernel: [  301.244756]  [<c10527c0>] ? kthread_flush_work_fn+0x10/0x10
> > >> kernel: [  301.244885]  [<c1582ab6>] kernel_thread_helper+0x6/0xd
> > >> kernel: [  301.245011] Code: 55 f4 8b 45 f8 e9 75 ff ff ff 90 55 89
> e5 53 83
> > >> ec 14 8b 08 8b 50 04 81 f9 00 01 10 00 74 24 81 fa 00 02 20 00 0f 84
> 8e 00
> > >> 00 00 <8b> 1a 39 d8 75 62 8b 59 04 39 d8 75 35 89 51 04 89 0a 83 c4 14
> > >> kernel: [  301.248195] EIP: [<c1238dd0>] __list_del_entry+0x20/0xe0
> SS:ESP
> > >> 0068:f47cff64
> > >> kernel: [  301.248414] CR2: 0000000000000000
> > >> kernel: [  301.248538] ---[ end trace 15082dbfb353f84c ]---
> > >>
> > >> The kernel was compiled with the following DEBUG support (the bolded
> one
> > >> were requested by Gentoo's Dev:
> > >> CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
> > >> CONFIG_SLUB_DEBUG=y
> > >> CONFIG_HAVE_DMA_API_DEBUG=y
> > >> CONFIG_X86_DEBUGCTLMSR=y
> > >> CONFIG_PNP_DEBUG_MESSAGES=y
> > >> CONFIG_AIC94XX_DEBUG=y
> > >> CONFIG_USB_DEBUG=y
> > >> CONFIG_DEBUG_KERNEL=y
> > >> CONFIG_SCHED_DEBUG=y
> > >> CONFIG_DEBUG_RT_MUTEXES=y
> > >> CONFIG_DEBUG_PI_LIST=y
> > >> CONFIG_DEBUG_BUGVERBOSE=y
> > >> CONFIG_DEBUG_INFO=y
> > >> CONFIG_DEBUG_MEMORY_INIT=y
> > >> CONFIG_DEBUG_LIST=y
> > >> CONFIG_DEBUG_STACKOVERFLOW=y
> > >> CONFIG_DEBUG_RODATA=y
> > >> CONFIG_DEBUG_RODATA_TEST=y
> > >>
> > >> I attached the kernel config i used for 3.2.9 to generate this oops
> and
> > >> warnings.
> > >>
> > >> From the list_add warnings that come after, out of 805 warnings i
> processed,
> > >> after masking with XXXXX the PID and next= values that kept changing
> in
> > >> every one, i got 26 types of MD5. I also attached the files relevant
> as an
> > >> archive to this email.
> > >>
> > >> The Gentoo bug i opened is sleeping, it seems nobody has the time to
> at
> > >> least test to confirm or not the problems i'm seeing (or everybody's
> > >> thinking that nobody would restart auditd so often, so the bug it's
> not that
> > >> serious).
> > >>
> > >>
> > >> Thank you for your time.
> > >>
> > >> On Wed, Feb 8, 2012 at 6:11 PM, Valentin Avram <aval13@gmail.com>
> wrote:
> > >>
> > >>
> > >> --
> > >> Linux-audit mailing list
> > >> Linux-audit@redhat.com
> > >> https://www.redhat.com/mailman/listinfo/linux-audit
> > >
> > >
> > >
> > > --
> > > Peter Moody      Google    1.650.253.7306
> > > Security Engineer  pgp:0xC3410038
> >
> >
> >
>
>
>

[-- Attachment #1.2: Type: text/html, Size: 9139 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Kernel oops+crash on repeated auditd restarts
  2012-03-29  6:44               ` Valentin Avram
@ 2012-04-03 16:15                 ` Peter Moody
  2012-04-05 21:03                   ` Peter Moody
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Moody @ 2012-04-03 16:15 UTC (permalink / raw)
  To: Valentin Avram; +Cc: linux-audit

This may already be known, but the issue seems to be limited to watch
rules. With any watch rules, I can reliably crash my machine while
freeing a watch rule after only starting/stopping auditd a few times.
With no watch rules, I have no issues.

Cheers,
peter

On Wed, Mar 28, 2012 at 11:44 PM, Valentin Avram <aval13@gmail.com> wrote:
> Yes, i know that patch. It made it into kernel 3.2.2. I tested it
> successfully (oops in 3.2.1, no oops in 3.2.9), but this oops i'm seeing is
> also in 3.2.9.
>
> I monitored changelogs since 3.2.1 to 3.2.12 but there were no fixes either
> in audit subsystem or in fsnotify. I'll try to reproduce in latest 3.2.13
> and repost the oops, but i'm 99% confident it will be the same.
>
> Sadly nobody except you seems to pay attention to this problem, probably
> because it requires special conditions to reproduce (really, who starts and
> stops auditd every 5 seconds on a production server?). We only ran into it
> because one of our servers would randomly oops and then freeze about each
> month after stopping and then starting
>
> auditd
>
> every morning (and the stop-start sequence was needed to workaround a bug
> somewhere that would hang a
>
> gzip
>
> running on a file outside a watched folder).
>
> Anyway, as a last note, i have a feeling that the oops is not exactly
> random, there is a pattern, just that i haven't figured it out completely
> yet.
>
> Will keep you
>
> uptodate
>
> with the things i find out.
>
> V.
>
> On Mar 29, 2012 4:14 AM, "Eric Paris" <eparis@redhat.com> wrote:
>>
>> That patch fixes a BUG() .  The report has a NULL ptr deref and some
>> apparent list correuption....  Sadly they aren't the same....
>>
>> On Wed, 2012-03-28 at 15:42 -0700, Peter Moody wrote:
>> > fyi: this patch [1] seems to fix the issue for me. The explanation in
>> > the subject would reliably oops my machine.
>> >
>> > [1]
>> > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=fed474857efbed79cd390d0aee224231ca718f63
>> >
>> > On Wed, Mar 28, 2012 at 1:51 PM, Peter Moody <pmoody@google.com> wrote:
>> > > Are you still able to reliably reproduce this oops? I'm trying to
>> > > track this down because this bug (or a very similar bug) is causing
>> > > some significant headaches here at work, but I haven't had a lot of
>> > > luck. I'm using usermode linux, though, so that might be interfering
>> > > with things.
>> > >
>> > > On Mon, Mar 5, 2012 at 12:35 AM, Valentin Avram <aval13@gmail.com>
>> > > wrote:
>> > >> Finally i found some time and spare server to retest the oops and
>> > >> list_add
>> > >> corruptions i was getting with the 3.x kernels and auditd 2.1.3.
>> > >>
>> > >> I tested now with gentoo's latest stable 3.2.1-gentoo-r2 and
>> > >> kernel.org's
>> > >> 3.2.9.
>> > >>
>> > >> Both get the oops/BUG in the same way and after that, they keep
>> > >> pouring
>> > >> list_add corruptions with audit_prune_tre(truncated?) and auditctl as
>> > >> comms.
>> > >>
>> > >> Since this is not about Gentoo's kernel only, i'll post here the oops
>> > >> in
>> > >> 3.2.9 and also attach some list_add corruptions.
>> > >>
>> > >> 3.2.9 BUG:
>> > >>
>> > >> kernel: [  301.240011] BUG: unable to handle kernel NULL pointer
>> > >> dereference
>> > >> at   (null)
>> > >> kernel: [  301.240305] IP: [<c1238dd0>] __list_del_entry+0x20/0xe0
>> > >> kernel: [  301.240481] *pdpt = 0000000000000000 *pde =
>> > >> f000ddc8f000ddc8
>> > >> kernel: [  301.240698] Oops: 0000 [#1] SMP
>> > >> kernel: [  301.240910]
>> > >> kernel: [  301.241030] Pid: 642, comm: fsnotify_mark Not tainted
>> > >> 3.2.9-drbd-version3 #1 Dell Inc. PowerEdge 2950/0CX396
>> > >> kernel: [  301.241370] EIP: 0060:[<c1238dd0>] EFLAGS: 00010287 CPU: 6
>> > >> kernel: [  301.241498] EIP is at __list_del_entry+0x20/0xe0
>> > >> kernel: [  301.241623] EAX: f4fae544 EBX: f47cffa4 ECX: ffffffff EDX:
>> > >> 00000000
>> > >> kernel: [  301.241751] ESI: f4fae544 EDI: f4fae508 EBP: f47cff7c ESP:
>> > >> f47cff64
>> > >> kernel: [  301.241879]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
>> > >> kernel: [  301.242005] Process fsnotify_mark (pid: 642, ti=f47ce000
>> > >> task=f4f47c00 task.ti=f47ce000)
>> > >> kernel: [  301.242207] Stack:
>> > >> kernel: [  301.242327]  c10813c0 f47cffa4 f4f47c00 f4e70888 f47cff7c
>> > >> f47cffa4 f47cffb8 c10f6976
>> > >> kernel: [  301.242882]  ffffffc3 f4f47c00 f4f47c00 00000000 f4f47c00
>> > >> c10530c0 f47cff9c f47cff9c
>> > >> kernel: [  301.243438]  f4fae544 f4fae544 f4c47f58 00000000 c10f68f0
>> > >> f47cffe4 c1052834 00000000
>> > >> kernel: [  301.243995] Call Trace:
>> > >> kernel: [  301.244119]  [<c10813c0>] ?
>> > >> rcu_check_callbacks+0x110/0x110
>> > >> kernel: [  301.244248]  [<c10f6976>] fsnotify_mark_destroy+0x86/0x120
>> > >> kernel: [  301.244377]  [<c10530c0>] ? abort_exclusive_wait+0x80/0x80
>> > >> kernel: [  301.244504]  [<c10f68f0>] ? fsnotify_put_mark+0x30/0x30
>> > >> kernel: [  301.244631]  [<c1052834>] kthread+0x74/0x80
>> > >> kernel: [  301.244756]  [<c10527c0>] ?
>> > >> kthread_flush_work_fn+0x10/0x10
>> > >> kernel: [  301.244885]  [<c1582ab6>] kernel_thread_helper+0x6/0xd
>> > >> kernel: [  301.245011] Code: 55 f4 8b 45 f8 e9 75 ff ff ff 90 55 89
>> > >> e5 53 83
>> > >> ec 14 8b 08 8b 50 04 81 f9 00 01 10 00 74 24 81 fa 00 02 20 00 0f 84
>> > >> 8e 00
>> > >> 00 00 <8b> 1a 39 d8 75 62 8b 59 04 39 d8 75 35 89 51 04 89 0a 83 c4
>> > >> 14
>> > >> kernel: [  301.248195] EIP: [<c1238dd0>] __list_del_entry+0x20/0xe0
>> > >> SS:ESP
>> > >> 0068:f47cff64
>> > >> kernel: [  301.248414] CR2: 0000000000000000
>> > >> kernel: [  301.248538] ---[ end trace 15082dbfb353f84c ]---
>> > >>
>> > >> The kernel was compiled with the following DEBUG support (the bolded
>> > >> one
>> > >> were requested by Gentoo's Dev:
>> > >> CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
>> > >> CONFIG_SLUB_DEBUG=y
>> > >> CONFIG_HAVE_DMA_API_DEBUG=y
>> > >> CONFIG_X86_DEBUGCTLMSR=y
>> > >> CONFIG_PNP_DEBUG_MESSAGES=y
>> > >> CONFIG_AIC94XX_DEBUG=y
>> > >> CONFIG_USB_DEBUG=y
>> > >> CONFIG_DEBUG_KERNEL=y
>> > >> CONFIG_SCHED_DEBUG=y
>> > >> CONFIG_DEBUG_RT_MUTEXES=y
>> > >> CONFIG_DEBUG_PI_LIST=y
>> > >> CONFIG_DEBUG_BUGVERBOSE=y
>> > >> CONFIG_DEBUG_INFO=y
>> > >> CONFIG_DEBUG_MEMORY_INIT=y
>> > >> CONFIG_DEBUG_LIST=y
>> > >> CONFIG_DEBUG_STACKOVERFLOW=y
>> > >> CONFIG_DEBUG_RODATA=y
>> > >> CONFIG_DEBUG_RODATA_TEST=y
>> > >>
>> > >> I attached the kernel config i used for 3.2.9 to generate this oops
>> > >> and
>> > >> warnings.
>> > >>
>> > >> From the list_add warnings that come after, out of 805 warnings i
>> > >> processed,
>> > >> after masking with XXXXX the PID and next= values that kept changing
>> > >> in
>> > >> every one, i got 26 types of MD5. I also attached the files relevant
>> > >> as an
>> > >> archive to this email.
>> > >>
>> > >> The Gentoo bug i opened is sleeping, it seems nobody has the time to
>> > >> at
>> > >> least test to confirm or not the problems i'm seeing (or everybody's
>> > >> thinking that nobody would restart auditd so often, so the bug it's
>> > >> not that
>> > >> serious).
>> > >>
>> > >>
>> > >> Thank you for your time.
>> > >>
>> > >> On Wed, Feb 8, 2012 at 6:11 PM, Valentin Avram <aval13@gmail.com>
>> > >> wrote:
>> > >>
>> > >>
>> > >> --
>> > >> Linux-audit mailing list
>> > >> Linux-audit@redhat.com
>> > >> https://www.redhat.com/mailman/listinfo/linux-audit
>> > >
>> > >
>> > >
>> > > --
>> > > Peter Moody      Google    1.650.253.7306
>> > > Security Engineer  pgp:0xC3410038
>> >
>> >
>> >
>>
>>
>



-- 
Peter Moody      Google    1.650.253.7306
Security Engineer  pgp:0xC3410038

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Kernel oops+crash on repeated auditd restarts
  2012-04-03 16:15                 ` Peter Moody
@ 2012-04-05 21:03                   ` Peter Moody
  2012-04-05 21:07                     ` Eric Paris
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Moody @ 2012-04-05 21:03 UTC (permalink / raw)
  To: Valentin Avram; +Cc: linux-audit

(please let me know if I should take this off-list)

One other thing (again, maybe already known), but this seems to be
exacerbated by SMP. On my machine, I can't reproduce the crash if I
booth with maxcpus=1.

Still hunting.

Cheers,
peter

On Tue, Apr 3, 2012 at 9:15 AM, Peter Moody <pmoody@google.com> wrote:
> This may already be known, but the issue seems to be limited to watch
> rules. With any watch rules, I can reliably crash my machine while
> freeing a watch rule after only starting/stopping auditd a few times.
> With no watch rules, I have no issues.
>
> Cheers,
> peter
>
> On Wed, Mar 28, 2012 at 11:44 PM, Valentin Avram <aval13@gmail.com> wrote:
>> Yes, i know that patch. It made it into kernel 3.2.2. I tested it
>> successfully (oops in 3.2.1, no oops in 3.2.9), but this oops i'm seeing is
>> also in 3.2.9.
>>
>> I monitored changelogs since 3.2.1 to 3.2.12 but there were no fixes either
>> in audit subsystem or in fsnotify. I'll try to reproduce in latest 3.2.13
>> and repost the oops, but i'm 99% confident it will be the same.
>>
>> Sadly nobody except you seems to pay attention to this problem, probably
>> because it requires special conditions to reproduce (really, who starts and
>> stops auditd every 5 seconds on a production server?). We only ran into it
>> because one of our servers would randomly oops and then freeze about each
>> month after stopping and then starting
>>
>> auditd
>>
>> every morning (and the stop-start sequence was needed to workaround a bug
>> somewhere that would hang a
>>
>> gzip
>>
>> running on a file outside a watched folder).
>>
>> Anyway, as a last note, i have a feeling that the oops is not exactly
>> random, there is a pattern, just that i haven't figured it out completely
>> yet.
>>
>> Will keep you
>>
>> uptodate
>>
>> with the things i find out.
>>
>> V.
>>
>> On Mar 29, 2012 4:14 AM, "Eric Paris" <eparis@redhat.com> wrote:
>>>
>>> That patch fixes a BUG() .  The report has a NULL ptr deref and some
>>> apparent list correuption....  Sadly they aren't the same....
>>>
>>> On Wed, 2012-03-28 at 15:42 -0700, Peter Moody wrote:
>>> > fyi: this patch [1] seems to fix the issue for me. The explanation in
>>> > the subject would reliably oops my machine.
>>> >
>>> > [1]
>>> > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=fed474857efbed79cd390d0aee224231ca718f63
>>> >
>>> > On Wed, Mar 28, 2012 at 1:51 PM, Peter Moody <pmoody@google.com> wrote:
>>> > > Are you still able to reliably reproduce this oops? I'm trying to
>>> > > track this down because this bug (or a very similar bug) is causing
>>> > > some significant headaches here at work, but I haven't had a lot of
>>> > > luck. I'm using usermode linux, though, so that might be interfering
>>> > > with things.
>>> > >
>>> > > On Mon, Mar 5, 2012 at 12:35 AM, Valentin Avram <aval13@gmail.com>
>>> > > wrote:
>>> > >> Finally i found some time and spare server to retest the oops and
>>> > >> list_add
>>> > >> corruptions i was getting with the 3.x kernels and auditd 2.1.3.
>>> > >>
>>> > >> I tested now with gentoo's latest stable 3.2.1-gentoo-r2 and
>>> > >> kernel.org's
>>> > >> 3.2.9.
>>> > >>
>>> > >> Both get the oops/BUG in the same way and after that, they keep
>>> > >> pouring
>>> > >> list_add corruptions with audit_prune_tre(truncated?) and auditctl as
>>> > >> comms.
>>> > >>
>>> > >> Since this is not about Gentoo's kernel only, i'll post here the oops
>>> > >> in
>>> > >> 3.2.9 and also attach some list_add corruptions.
>>> > >>
>>> > >> 3.2.9 BUG:
>>> > >>
>>> > >> kernel: [  301.240011] BUG: unable to handle kernel NULL pointer
>>> > >> dereference
>>> > >> at   (null)
>>> > >> kernel: [  301.240305] IP: [<c1238dd0>] __list_del_entry+0x20/0xe0
>>> > >> kernel: [  301.240481] *pdpt = 0000000000000000 *pde =
>>> > >> f000ddc8f000ddc8
>>> > >> kernel: [  301.240698] Oops: 0000 [#1] SMP
>>> > >> kernel: [  301.240910]
>>> > >> kernel: [  301.241030] Pid: 642, comm: fsnotify_mark Not tainted
>>> > >> 3.2.9-drbd-version3 #1 Dell Inc. PowerEdge 2950/0CX396
>>> > >> kernel: [  301.241370] EIP: 0060:[<c1238dd0>] EFLAGS: 00010287 CPU: 6
>>> > >> kernel: [  301.241498] EIP is at __list_del_entry+0x20/0xe0
>>> > >> kernel: [  301.241623] EAX: f4fae544 EBX: f47cffa4 ECX: ffffffff EDX:
>>> > >> 00000000
>>> > >> kernel: [  301.241751] ESI: f4fae544 EDI: f4fae508 EBP: f47cff7c ESP:
>>> > >> f47cff64
>>> > >> kernel: [  301.241879]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
>>> > >> kernel: [  301.242005] Process fsnotify_mark (pid: 642, ti=f47ce000
>>> > >> task=f4f47c00 task.ti=f47ce000)
>>> > >> kernel: [  301.242207] Stack:
>>> > >> kernel: [  301.242327]  c10813c0 f47cffa4 f4f47c00 f4e70888 f47cff7c
>>> > >> f47cffa4 f47cffb8 c10f6976
>>> > >> kernel: [  301.242882]  ffffffc3 f4f47c00 f4f47c00 00000000 f4f47c00
>>> > >> c10530c0 f47cff9c f47cff9c
>>> > >> kernel: [  301.243438]  f4fae544 f4fae544 f4c47f58 00000000 c10f68f0
>>> > >> f47cffe4 c1052834 00000000
>>> > >> kernel: [  301.243995] Call Trace:
>>> > >> kernel: [  301.244119]  [<c10813c0>] ?
>>> > >> rcu_check_callbacks+0x110/0x110
>>> > >> kernel: [  301.244248]  [<c10f6976>] fsnotify_mark_destroy+0x86/0x120
>>> > >> kernel: [  301.244377]  [<c10530c0>] ? abort_exclusive_wait+0x80/0x80
>>> > >> kernel: [  301.244504]  [<c10f68f0>] ? fsnotify_put_mark+0x30/0x30
>>> > >> kernel: [  301.244631]  [<c1052834>] kthread+0x74/0x80
>>> > >> kernel: [  301.244756]  [<c10527c0>] ?
>>> > >> kthread_flush_work_fn+0x10/0x10
>>> > >> kernel: [  301.244885]  [<c1582ab6>] kernel_thread_helper+0x6/0xd
>>> > >> kernel: [  301.245011] Code: 55 f4 8b 45 f8 e9 75 ff ff ff 90 55 89
>>> > >> e5 53 83
>>> > >> ec 14 8b 08 8b 50 04 81 f9 00 01 10 00 74 24 81 fa 00 02 20 00 0f 84
>>> > >> 8e 00
>>> > >> 00 00 <8b> 1a 39 d8 75 62 8b 59 04 39 d8 75 35 89 51 04 89 0a 83 c4
>>> > >> 14
>>> > >> kernel: [  301.248195] EIP: [<c1238dd0>] __list_del_entry+0x20/0xe0
>>> > >> SS:ESP
>>> > >> 0068:f47cff64
>>> > >> kernel: [  301.248414] CR2: 0000000000000000
>>> > >> kernel: [  301.248538] ---[ end trace 15082dbfb353f84c ]---
>>> > >>
>>> > >> The kernel was compiled with the following DEBUG support (the bolded
>>> > >> one
>>> > >> were requested by Gentoo's Dev:
>>> > >> CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
>>> > >> CONFIG_SLUB_DEBUG=y
>>> > >> CONFIG_HAVE_DMA_API_DEBUG=y
>>> > >> CONFIG_X86_DEBUGCTLMSR=y
>>> > >> CONFIG_PNP_DEBUG_MESSAGES=y
>>> > >> CONFIG_AIC94XX_DEBUG=y
>>> > >> CONFIG_USB_DEBUG=y
>>> > >> CONFIG_DEBUG_KERNEL=y
>>> > >> CONFIG_SCHED_DEBUG=y
>>> > >> CONFIG_DEBUG_RT_MUTEXES=y
>>> > >> CONFIG_DEBUG_PI_LIST=y
>>> > >> CONFIG_DEBUG_BUGVERBOSE=y
>>> > >> CONFIG_DEBUG_INFO=y
>>> > >> CONFIG_DEBUG_MEMORY_INIT=y
>>> > >> CONFIG_DEBUG_LIST=y
>>> > >> CONFIG_DEBUG_STACKOVERFLOW=y
>>> > >> CONFIG_DEBUG_RODATA=y
>>> > >> CONFIG_DEBUG_RODATA_TEST=y
>>> > >>
>>> > >> I attached the kernel config i used for 3.2.9 to generate this oops
>>> > >> and
>>> > >> warnings.
>>> > >>
>>> > >> From the list_add warnings that come after, out of 805 warnings i
>>> > >> processed,
>>> > >> after masking with XXXXX the PID and next= values that kept changing
>>> > >> in
>>> > >> every one, i got 26 types of MD5. I also attached the files relevant
>>> > >> as an
>>> > >> archive to this email.
>>> > >>
>>> > >> The Gentoo bug i opened is sleeping, it seems nobody has the time to
>>> > >> at
>>> > >> least test to confirm or not the problems i'm seeing (or everybody's
>>> > >> thinking that nobody would restart auditd so often, so the bug it's
>>> > >> not that
>>> > >> serious).
>>> > >>
>>> > >>
>>> > >> Thank you for your time.
>>> > >>
>>> > >> On Wed, Feb 8, 2012 at 6:11 PM, Valentin Avram <aval13@gmail.com>
>>> > >> wrote:
>>> > >>
>>> > >>
>>> > >> --
>>> > >> Linux-audit mailing list
>>> > >> Linux-audit@redhat.com
>>> > >> https://www.redhat.com/mailman/listinfo/linux-audit
>>> > >
>>> > >
>>> > >
>>> > > --
>>> > > Peter Moody      Google    1.650.253.7306
>>> > > Security Engineer  pgp:0xC3410038
>>> >
>>> >
>>> >
>>>
>>>
>>
>
>
>
> --
> Peter Moody      Google    1.650.253.7306
> Security Engineer  pgp:0xC3410038



-- 
Peter Moody      Google    1.650.253.7306
Security Engineer  pgp:0xC3410038

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Kernel oops+crash on repeated auditd restarts
  2012-04-05 21:03                   ` Peter Moody
@ 2012-04-05 21:07                     ` Eric Paris
  2012-04-17 17:56                       ` Peter Moody
  0 siblings, 1 reply; 24+ messages in thread
From: Eric Paris @ 2012-04-05 21:07 UTC (permalink / raw)
  To: Peter Moody; +Cc: linux-audit

please please please keep on list.  Everything you say might help track
it down!

On Thu, 2012-04-05 at 14:03 -0700, Peter Moody wrote:
> (please let me know if I should take this off-list)
> 
> One other thing (again, maybe already known), but this seems to be
> exacerbated by SMP. On my machine, I can't reproduce the crash if I
> booth with maxcpus=1.
> 
> Still hunting.
> 
> Cheers,
> peter
> 
> On Tue, Apr 3, 2012 at 9:15 AM, Peter Moody <pmoody@google.com> wrote:
> > This may already be known, but the issue seems to be limited to watch
> > rules. With any watch rules, I can reliably crash my machine while
> > freeing a watch rule after only starting/stopping auditd a few times.
> > With no watch rules, I have no issues.
> >
> > Cheers,
> > peter
> >
> > On Wed, Mar 28, 2012 at 11:44 PM, Valentin Avram <aval13@gmail.com> wrote:
> >> Yes, i know that patch. It made it into kernel 3.2.2. I tested it
> >> successfully (oops in 3.2.1, no oops in 3.2.9), but this oops i'm seeing is
> >> also in 3.2.9.
> >>
> >> I monitored changelogs since 3.2.1 to 3.2.12 but there were no fixes either
> >> in audit subsystem or in fsnotify. I'll try to reproduce in latest 3.2.13
> >> and repost the oops, but i'm 99% confident it will be the same.
> >>
> >> Sadly nobody except you seems to pay attention to this problem, probably
> >> because it requires special conditions to reproduce (really, who starts and
> >> stops auditd every 5 seconds on a production server?). We only ran into it
> >> because one of our servers would randomly oops and then freeze about each
> >> month after stopping and then starting
> >>
> >> auditd
> >>
> >> every morning (and the stop-start sequence was needed to workaround a bug
> >> somewhere that would hang a
> >>
> >> gzip
> >>
> >> running on a file outside a watched folder).
> >>
> >> Anyway, as a last note, i have a feeling that the oops is not exactly
> >> random, there is a pattern, just that i haven't figured it out completely
> >> yet.
> >>
> >> Will keep you
> >>
> >> uptodate
> >>
> >> with the things i find out.
> >>
> >> V.
> >>
> >> On Mar 29, 2012 4:14 AM, "Eric Paris" <eparis@redhat.com> wrote:
> >>>
> >>> That patch fixes a BUG() .  The report has a NULL ptr deref and some
> >>> apparent list correuption....  Sadly they aren't the same....
> >>>
> >>> On Wed, 2012-03-28 at 15:42 -0700, Peter Moody wrote:
> >>> > fyi: this patch [1] seems to fix the issue for me. The explanation in
> >>> > the subject would reliably oops my machine.
> >>> >
> >>> > [1]
> >>> > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=fed474857efbed79cd390d0aee224231ca718f63
> >>> >
> >>> > On Wed, Mar 28, 2012 at 1:51 PM, Peter Moody <pmoody@google.com> wrote:
> >>> > > Are you still able to reliably reproduce this oops? I'm trying to
> >>> > > track this down because this bug (or a very similar bug) is causing
> >>> > > some significant headaches here at work, but I haven't had a lot of
> >>> > > luck. I'm using usermode linux, though, so that might be interfering
> >>> > > with things.
> >>> > >
> >>> > > On Mon, Mar 5, 2012 at 12:35 AM, Valentin Avram <aval13@gmail.com>
> >>> > > wrote:
> >>> > >> Finally i found some time and spare server to retest the oops and
> >>> > >> list_add
> >>> > >> corruptions i was getting with the 3.x kernels and auditd 2.1.3.
> >>> > >>
> >>> > >> I tested now with gentoo's latest stable 3.2.1-gentoo-r2 and
> >>> > >> kernel.org's
> >>> > >> 3.2.9.
> >>> > >>
> >>> > >> Both get the oops/BUG in the same way and after that, they keep
> >>> > >> pouring
> >>> > >> list_add corruptions with audit_prune_tre(truncated?) and auditctl as
> >>> > >> comms.
> >>> > >>
> >>> > >> Since this is not about Gentoo's kernel only, i'll post here the oops
> >>> > >> in
> >>> > >> 3.2.9 and also attach some list_add corruptions.
> >>> > >>
> >>> > >> 3.2.9 BUG:
> >>> > >>
> >>> > >> kernel: [  301.240011] BUG: unable to handle kernel NULL pointer
> >>> > >> dereference
> >>> > >> at   (null)
> >>> > >> kernel: [  301.240305] IP: [<c1238dd0>] __list_del_entry+0x20/0xe0
> >>> > >> kernel: [  301.240481] *pdpt = 0000000000000000 *pde =
> >>> > >> f000ddc8f000ddc8
> >>> > >> kernel: [  301.240698] Oops: 0000 [#1] SMP
> >>> > >> kernel: [  301.240910]
> >>> > >> kernel: [  301.241030] Pid: 642, comm: fsnotify_mark Not tainted
> >>> > >> 3.2.9-drbd-version3 #1 Dell Inc. PowerEdge 2950/0CX396
> >>> > >> kernel: [  301.241370] EIP: 0060:[<c1238dd0>] EFLAGS: 00010287 CPU: 6
> >>> > >> kernel: [  301.241498] EIP is at __list_del_entry+0x20/0xe0
> >>> > >> kernel: [  301.241623] EAX: f4fae544 EBX: f47cffa4 ECX: ffffffff EDX:
> >>> > >> 00000000
> >>> > >> kernel: [  301.241751] ESI: f4fae544 EDI: f4fae508 EBP: f47cff7c ESP:
> >>> > >> f47cff64
> >>> > >> kernel: [  301.241879]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
> >>> > >> kernel: [  301.242005] Process fsnotify_mark (pid: 642, ti=f47ce000
> >>> > >> task=f4f47c00 task.ti=f47ce000)
> >>> > >> kernel: [  301.242207] Stack:
> >>> > >> kernel: [  301.242327]  c10813c0 f47cffa4 f4f47c00 f4e70888 f47cff7c
> >>> > >> f47cffa4 f47cffb8 c10f6976
> >>> > >> kernel: [  301.242882]  ffffffc3 f4f47c00 f4f47c00 00000000 f4f47c00
> >>> > >> c10530c0 f47cff9c f47cff9c
> >>> > >> kernel: [  301.243438]  f4fae544 f4fae544 f4c47f58 00000000 c10f68f0
> >>> > >> f47cffe4 c1052834 00000000
> >>> > >> kernel: [  301.243995] Call Trace:
> >>> > >> kernel: [  301.244119]  [<c10813c0>] ?
> >>> > >> rcu_check_callbacks+0x110/0x110
> >>> > >> kernel: [  301.244248]  [<c10f6976>] fsnotify_mark_destroy+0x86/0x120
> >>> > >> kernel: [  301.244377]  [<c10530c0>] ? abort_exclusive_wait+0x80/0x80
> >>> > >> kernel: [  301.244504]  [<c10f68f0>] ? fsnotify_put_mark+0x30/0x30
> >>> > >> kernel: [  301.244631]  [<c1052834>] kthread+0x74/0x80
> >>> > >> kernel: [  301.244756]  [<c10527c0>] ?
> >>> > >> kthread_flush_work_fn+0x10/0x10
> >>> > >> kernel: [  301.244885]  [<c1582ab6>] kernel_thread_helper+0x6/0xd
> >>> > >> kernel: [  301.245011] Code: 55 f4 8b 45 f8 e9 75 ff ff ff 90 55 89
> >>> > >> e5 53 83
> >>> > >> ec 14 8b 08 8b 50 04 81 f9 00 01 10 00 74 24 81 fa 00 02 20 00 0f 84
> >>> > >> 8e 00
> >>> > >> 00 00 <8b> 1a 39 d8 75 62 8b 59 04 39 d8 75 35 89 51 04 89 0a 83 c4
> >>> > >> 14
> >>> > >> kernel: [  301.248195] EIP: [<c1238dd0>] __list_del_entry+0x20/0xe0
> >>> > >> SS:ESP
> >>> > >> 0068:f47cff64
> >>> > >> kernel: [  301.248414] CR2: 0000000000000000
> >>> > >> kernel: [  301.248538] ---[ end trace 15082dbfb353f84c ]---
> >>> > >>
> >>> > >> The kernel was compiled with the following DEBUG support (the bolded
> >>> > >> one
> >>> > >> were requested by Gentoo's Dev:
> >>> > >> CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
> >>> > >> CONFIG_SLUB_DEBUG=y
> >>> > >> CONFIG_HAVE_DMA_API_DEBUG=y
> >>> > >> CONFIG_X86_DEBUGCTLMSR=y
> >>> > >> CONFIG_PNP_DEBUG_MESSAGES=y
> >>> > >> CONFIG_AIC94XX_DEBUG=y
> >>> > >> CONFIG_USB_DEBUG=y
> >>> > >> CONFIG_DEBUG_KERNEL=y
> >>> > >> CONFIG_SCHED_DEBUG=y
> >>> > >> CONFIG_DEBUG_RT_MUTEXES=y
> >>> > >> CONFIG_DEBUG_PI_LIST=y
> >>> > >> CONFIG_DEBUG_BUGVERBOSE=y
> >>> > >> CONFIG_DEBUG_INFO=y
> >>> > >> CONFIG_DEBUG_MEMORY_INIT=y
> >>> > >> CONFIG_DEBUG_LIST=y
> >>> > >> CONFIG_DEBUG_STACKOVERFLOW=y
> >>> > >> CONFIG_DEBUG_RODATA=y
> >>> > >> CONFIG_DEBUG_RODATA_TEST=y
> >>> > >>
> >>> > >> I attached the kernel config i used for 3.2.9 to generate this oops
> >>> > >> and
> >>> > >> warnings.
> >>> > >>
> >>> > >> From the list_add warnings that come after, out of 805 warnings i
> >>> > >> processed,
> >>> > >> after masking with XXXXX the PID and next= values that kept changing
> >>> > >> in
> >>> > >> every one, i got 26 types of MD5. I also attached the files relevant
> >>> > >> as an
> >>> > >> archive to this email.
> >>> > >>
> >>> > >> The Gentoo bug i opened is sleeping, it seems nobody has the time to
> >>> > >> at
> >>> > >> least test to confirm or not the problems i'm seeing (or everybody's
> >>> > >> thinking that nobody would restart auditd so often, so the bug it's
> >>> > >> not that
> >>> > >> serious).
> >>> > >>
> >>> > >>
> >>> > >> Thank you for your time.
> >>> > >>
> >>> > >> On Wed, Feb 8, 2012 at 6:11 PM, Valentin Avram <aval13@gmail.com>
> >>> > >> wrote:
> >>> > >>
> >>> > >>
> >>> > >> --
> >>> > >> Linux-audit mailing list
> >>> > >> Linux-audit@redhat.com
> >>> > >> https://www.redhat.com/mailman/listinfo/linux-audit
> >>> > >
> >>> > >
> >>> > >
> >>> > > --
> >>> > > Peter Moody      Google    1.650.253.7306
> >>> > > Security Engineer  pgp:0xC3410038
> >>> >
> >>> >
> >>> >
> >>>
> >>>
> >>
> >
> >
> >
> > --
> > Peter Moody      Google    1.650.253.7306
> > Security Engineer  pgp:0xC3410038
> 
> 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Kernel oops+crash on repeated auditd restarts
  2012-04-05 21:07                     ` Eric Paris
@ 2012-04-17 17:56                       ` Peter Moody
  2012-04-17 18:24                         ` Peter Moody
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Moody @ 2012-04-17 17:56 UTC (permalink / raw)
  To: Eric Paris; +Cc: linux-audit

[-- Attachment #1: Type: text/plain, Size: 9665 bytes --]

Here's a trace with debugging turned way up plus a few extra printk's
added to fs/notify/mark.c. I'm looping through private_destroy_list
before and after the call to synchronize_srcu.

I can reproduce this reliably with kvm with 2 virtual processors:
Linux desktop 3.4.0-rc3-oops1+ #1 SMP Tue Apr 17 09:59:44 PDT 2012
x86_64 GNU/Linux

Cheers,
peter

On Thu, Apr 5, 2012 at 2:07 PM, Eric Paris <eparis@redhat.com> wrote:
> please please please keep on list.  Everything you say might help track
> it down!
>
> On Thu, 2012-04-05 at 14:03 -0700, Peter Moody wrote:
>> (please let me know if I should take this off-list)
>>
>> One other thing (again, maybe already known), but this seems to be
>> exacerbated by SMP. On my machine, I can't reproduce the crash if I
>> booth with maxcpus=1.
>>
>> Still hunting.
>>
>> Cheers,
>> peter
>>
>> On Tue, Apr 3, 2012 at 9:15 AM, Peter Moody <pmoody@google.com> wrote:
>> > This may already be known, but the issue seems to be limited to watch
>> > rules. With any watch rules, I can reliably crash my machine while
>> > freeing a watch rule after only starting/stopping auditd a few times.
>> > With no watch rules, I have no issues.
>> >
>> > Cheers,
>> > peter
>> >
>> > On Wed, Mar 28, 2012 at 11:44 PM, Valentin Avram <aval13@gmail.com> wrote:
>> >> Yes, i know that patch. It made it into kernel 3.2.2. I tested it
>> >> successfully (oops in 3.2.1, no oops in 3.2.9), but this oops i'm seeing is
>> >> also in 3.2.9.
>> >>
>> >> I monitored changelogs since 3.2.1 to 3.2.12 but there were no fixes either
>> >> in audit subsystem or in fsnotify. I'll try to reproduce in latest 3.2.13
>> >> and repost the oops, but i'm 99% confident it will be the same.
>> >>
>> >> Sadly nobody except you seems to pay attention to this problem, probably
>> >> because it requires special conditions to reproduce (really, who starts and
>> >> stops auditd every 5 seconds on a production server?). We only ran into it
>> >> because one of our servers would randomly oops and then freeze about each
>> >> month after stopping and then starting
>> >>
>> >> auditd
>> >>
>> >> every morning (and the stop-start sequence was needed to workaround a bug
>> >> somewhere that would hang a
>> >>
>> >> gzip
>> >>
>> >> running on a file outside a watched folder).
>> >>
>> >> Anyway, as a last note, i have a feeling that the oops is not exactly
>> >> random, there is a pattern, just that i haven't figured it out completely
>> >> yet.
>> >>
>> >> Will keep you
>> >>
>> >> uptodate
>> >>
>> >> with the things i find out.
>> >>
>> >> V.
>> >>
>> >> On Mar 29, 2012 4:14 AM, "Eric Paris" <eparis@redhat.com> wrote:
>> >>>
>> >>> That patch fixes a BUG() .  The report has a NULL ptr deref and some
>> >>> apparent list correuption....  Sadly they aren't the same....
>> >>>
>> >>> On Wed, 2012-03-28 at 15:42 -0700, Peter Moody wrote:
>> >>> > fyi: this patch [1] seems to fix the issue for me. The explanation in
>> >>> > the subject would reliably oops my machine.
>> >>> >
>> >>> > [1]
>> >>> > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=fed474857efbed79cd390d0aee224231ca718f63
>> >>> >
>> >>> > On Wed, Mar 28, 2012 at 1:51 PM, Peter Moody <pmoody@google.com> wrote:
>> >>> > > Are you still able to reliably reproduce this oops? I'm trying to
>> >>> > > track this down because this bug (or a very similar bug) is causing
>> >>> > > some significant headaches here at work, but I haven't had a lot of
>> >>> > > luck. I'm using usermode linux, though, so that might be interfering
>> >>> > > with things.
>> >>> > >
>> >>> > > On Mon, Mar 5, 2012 at 12:35 AM, Valentin Avram <aval13@gmail.com>
>> >>> > > wrote:
>> >>> > >> Finally i found some time and spare server to retest the oops and
>> >>> > >> list_add
>> >>> > >> corruptions i was getting with the 3.x kernels and auditd 2.1.3.
>> >>> > >>
>> >>> > >> I tested now with gentoo's latest stable 3.2.1-gentoo-r2 and
>> >>> > >> kernel.org's
>> >>> > >> 3.2.9.
>> >>> > >>
>> >>> > >> Both get the oops/BUG in the same way and after that, they keep
>> >>> > >> pouring
>> >>> > >> list_add corruptions with audit_prune_tre(truncated?) and auditctl as
>> >>> > >> comms.
>> >>> > >>
>> >>> > >> Since this is not about Gentoo's kernel only, i'll post here the oops
>> >>> > >> in
>> >>> > >> 3.2.9 and also attach some list_add corruptions.
>> >>> > >>
>> >>> > >> 3.2.9 BUG:
>> >>> > >>
>> >>> > >> kernel: [  301.240011] BUG: unable to handle kernel NULL pointer
>> >>> > >> dereference
>> >>> > >> at   (null)
>> >>> > >> kernel: [  301.240305] IP: [<c1238dd0>] __list_del_entry+0x20/0xe0
>> >>> > >> kernel: [  301.240481] *pdpt = 0000000000000000 *pde =
>> >>> > >> f000ddc8f000ddc8
>> >>> > >> kernel: [  301.240698] Oops: 0000 [#1] SMP
>> >>> > >> kernel: [  301.240910]
>> >>> > >> kernel: [  301.241030] Pid: 642, comm: fsnotify_mark Not tainted
>> >>> > >> 3.2.9-drbd-version3 #1 Dell Inc. PowerEdge 2950/0CX396
>> >>> > >> kernel: [  301.241370] EIP: 0060:[<c1238dd0>] EFLAGS: 00010287 CPU: 6
>> >>> > >> kernel: [  301.241498] EIP is at __list_del_entry+0x20/0xe0
>> >>> > >> kernel: [  301.241623] EAX: f4fae544 EBX: f47cffa4 ECX: ffffffff EDX:
>> >>> > >> 00000000
>> >>> > >> kernel: [  301.241751] ESI: f4fae544 EDI: f4fae508 EBP: f47cff7c ESP:
>> >>> > >> f47cff64
>> >>> > >> kernel: [  301.241879]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
>> >>> > >> kernel: [  301.242005] Process fsnotify_mark (pid: 642, ti=f47ce000
>> >>> > >> task=f4f47c00 task.ti=f47ce000)
>> >>> > >> kernel: [  301.242207] Stack:
>> >>> > >> kernel: [  301.242327]  c10813c0 f47cffa4 f4f47c00 f4e70888 f47cff7c
>> >>> > >> f47cffa4 f47cffb8 c10f6976
>> >>> > >> kernel: [  301.242882]  ffffffc3 f4f47c00 f4f47c00 00000000 f4f47c00
>> >>> > >> c10530c0 f47cff9c f47cff9c
>> >>> > >> kernel: [  301.243438]  f4fae544 f4fae544 f4c47f58 00000000 c10f68f0
>> >>> > >> f47cffe4 c1052834 00000000
>> >>> > >> kernel: [  301.243995] Call Trace:
>> >>> > >> kernel: [  301.244119]  [<c10813c0>] ?
>> >>> > >> rcu_check_callbacks+0x110/0x110
>> >>> > >> kernel: [  301.244248]  [<c10f6976>] fsnotify_mark_destroy+0x86/0x120
>> >>> > >> kernel: [  301.244377]  [<c10530c0>] ? abort_exclusive_wait+0x80/0x80
>> >>> > >> kernel: [  301.244504]  [<c10f68f0>] ? fsnotify_put_mark+0x30/0x30
>> >>> > >> kernel: [  301.244631]  [<c1052834>] kthread+0x74/0x80
>> >>> > >> kernel: [  301.244756]  [<c10527c0>] ?
>> >>> > >> kthread_flush_work_fn+0x10/0x10
>> >>> > >> kernel: [  301.244885]  [<c1582ab6>] kernel_thread_helper+0x6/0xd
>> >>> > >> kernel: [  301.245011] Code: 55 f4 8b 45 f8 e9 75 ff ff ff 90 55 89
>> >>> > >> e5 53 83
>> >>> > >> ec 14 8b 08 8b 50 04 81 f9 00 01 10 00 74 24 81 fa 00 02 20 00 0f 84
>> >>> > >> 8e 00
>> >>> > >> 00 00 <8b> 1a 39 d8 75 62 8b 59 04 39 d8 75 35 89 51 04 89 0a 83 c4
>> >>> > >> 14
>> >>> > >> kernel: [  301.248195] EIP: [<c1238dd0>] __list_del_entry+0x20/0xe0
>> >>> > >> SS:ESP
>> >>> > >> 0068:f47cff64
>> >>> > >> kernel: [  301.248414] CR2: 0000000000000000
>> >>> > >> kernel: [  301.248538] ---[ end trace 15082dbfb353f84c ]---
>> >>> > >>
>> >>> > >> The kernel was compiled with the following DEBUG support (the bolded
>> >>> > >> one
>> >>> > >> were requested by Gentoo's Dev:
>> >>> > >> CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
>> >>> > >> CONFIG_SLUB_DEBUG=y
>> >>> > >> CONFIG_HAVE_DMA_API_DEBUG=y
>> >>> > >> CONFIG_X86_DEBUGCTLMSR=y
>> >>> > >> CONFIG_PNP_DEBUG_MESSAGES=y
>> >>> > >> CONFIG_AIC94XX_DEBUG=y
>> >>> > >> CONFIG_USB_DEBUG=y
>> >>> > >> CONFIG_DEBUG_KERNEL=y
>> >>> > >> CONFIG_SCHED_DEBUG=y
>> >>> > >> CONFIG_DEBUG_RT_MUTEXES=y
>> >>> > >> CONFIG_DEBUG_PI_LIST=y
>> >>> > >> CONFIG_DEBUG_BUGVERBOSE=y
>> >>> > >> CONFIG_DEBUG_INFO=y
>> >>> > >> CONFIG_DEBUG_MEMORY_INIT=y
>> >>> > >> CONFIG_DEBUG_LIST=y
>> >>> > >> CONFIG_DEBUG_STACKOVERFLOW=y
>> >>> > >> CONFIG_DEBUG_RODATA=y
>> >>> > >> CONFIG_DEBUG_RODATA_TEST=y
>> >>> > >>
>> >>> > >> I attached the kernel config i used for 3.2.9 to generate this oops
>> >>> > >> and
>> >>> > >> warnings.
>> >>> > >>
>> >>> > >> From the list_add warnings that come after, out of 805 warnings i
>> >>> > >> processed,
>> >>> > >> after masking with XXXXX the PID and next= values that kept changing
>> >>> > >> in
>> >>> > >> every one, i got 26 types of MD5. I also attached the files relevant
>> >>> > >> as an
>> >>> > >> archive to this email.
>> >>> > >>
>> >>> > >> The Gentoo bug i opened is sleeping, it seems nobody has the time to
>> >>> > >> at
>> >>> > >> least test to confirm or not the problems i'm seeing (or everybody's
>> >>> > >> thinking that nobody would restart auditd so often, so the bug it's
>> >>> > >> not that
>> >>> > >> serious).
>> >>> > >>
>> >>> > >>
>> >>> > >> Thank you for your time.
>> >>> > >>
>> >>> > >> On Wed, Feb 8, 2012 at 6:11 PM, Valentin Avram <aval13@gmail.com>
>> >>> > >> wrote:
>> >>> > >>
>> >>> > >>
>> >>> > >> --
>> >>> > >> Linux-audit mailing list
>> >>> > >> Linux-audit@redhat.com
>> >>> > >> https://www.redhat.com/mailman/listinfo/linux-audit
>> >>> > >
>> >>> > >
>> >>> > >
>> >>> > > --
>> >>> > > Peter Moody      Google    1.650.253.7306
>> >>> > > Security Engineer  pgp:0xC3410038
>> >>> >
>> >>> >
>> >>> >
>> >>>
>> >>>
>> >>
>> >
>> >
>> >
>> > --
>> > Peter Moody      Google    1.650.253.7306
>> > Security Engineer  pgp:0xC3410038
>>
>>
>>
>
>



-- 
Peter Moody      Google    1.650.253.7306
Security Engineer  pgp:0xC3410038

[-- Attachment #2: trace.gz --]
[-- Type: application/x-gzip, Size: 3060 bytes --]

[-- Attachment #3: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Kernel oops+crash on repeated auditd restarts
  2012-04-17 17:56                       ` Peter Moody
@ 2012-04-17 18:24                         ` Peter Moody
  2012-04-17 21:54                           ` Peter Moody
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Moody @ 2012-04-17 18:24 UTC (permalink / raw)
  To: Eric Paris; +Cc: linux-audit

[-- Attachment #1: Type: text/plain, Size: 10100 bytes --]

and my config.gz

On Tue, Apr 17, 2012 at 10:56 AM, Peter Moody <pmoody@google.com> wrote:
> Here's a trace with debugging turned way up plus a few extra printk's
> added to fs/notify/mark.c. I'm looping through private_destroy_list
> before and after the call to synchronize_srcu.
>
> I can reproduce this reliably with kvm with 2 virtual processors:
> Linux desktop 3.4.0-rc3-oops1+ #1 SMP Tue Apr 17 09:59:44 PDT 2012
> x86_64 GNU/Linux
>
> Cheers,
> peter
>
> On Thu, Apr 5, 2012 at 2:07 PM, Eric Paris <eparis@redhat.com> wrote:
>> please please please keep on list.  Everything you say might help track
>> it down!
>>
>> On Thu, 2012-04-05 at 14:03 -0700, Peter Moody wrote:
>>> (please let me know if I should take this off-list)
>>>
>>> One other thing (again, maybe already known), but this seems to be
>>> exacerbated by SMP. On my machine, I can't reproduce the crash if I
>>> booth with maxcpus=1.
>>>
>>> Still hunting.
>>>
>>> Cheers,
>>> peter
>>>
>>> On Tue, Apr 3, 2012 at 9:15 AM, Peter Moody <pmoody@google.com> wrote:
>>> > This may already be known, but the issue seems to be limited to watch
>>> > rules. With any watch rules, I can reliably crash my machine while
>>> > freeing a watch rule after only starting/stopping auditd a few times.
>>> > With no watch rules, I have no issues.
>>> >
>>> > Cheers,
>>> > peter
>>> >
>>> > On Wed, Mar 28, 2012 at 11:44 PM, Valentin Avram <aval13@gmail.com> wrote:
>>> >> Yes, i know that patch. It made it into kernel 3.2.2. I tested it
>>> >> successfully (oops in 3.2.1, no oops in 3.2.9), but this oops i'm seeing is
>>> >> also in 3.2.9.
>>> >>
>>> >> I monitored changelogs since 3.2.1 to 3.2.12 but there were no fixes either
>>> >> in audit subsystem or in fsnotify. I'll try to reproduce in latest 3.2.13
>>> >> and repost the oops, but i'm 99% confident it will be the same.
>>> >>
>>> >> Sadly nobody except you seems to pay attention to this problem, probably
>>> >> because it requires special conditions to reproduce (really, who starts and
>>> >> stops auditd every 5 seconds on a production server?). We only ran into it
>>> >> because one of our servers would randomly oops and then freeze about each
>>> >> month after stopping and then starting
>>> >>
>>> >> auditd
>>> >>
>>> >> every morning (and the stop-start sequence was needed to workaround a bug
>>> >> somewhere that would hang a
>>> >>
>>> >> gzip
>>> >>
>>> >> running on a file outside a watched folder).
>>> >>
>>> >> Anyway, as a last note, i have a feeling that the oops is not exactly
>>> >> random, there is a pattern, just that i haven't figured it out completely
>>> >> yet.
>>> >>
>>> >> Will keep you
>>> >>
>>> >> uptodate
>>> >>
>>> >> with the things i find out.
>>> >>
>>> >> V.
>>> >>
>>> >> On Mar 29, 2012 4:14 AM, "Eric Paris" <eparis@redhat.com> wrote:
>>> >>>
>>> >>> That patch fixes a BUG() .  The report has a NULL ptr deref and some
>>> >>> apparent list correuption....  Sadly they aren't the same....
>>> >>>
>>> >>> On Wed, 2012-03-28 at 15:42 -0700, Peter Moody wrote:
>>> >>> > fyi: this patch [1] seems to fix the issue for me. The explanation in
>>> >>> > the subject would reliably oops my machine.
>>> >>> >
>>> >>> > [1]
>>> >>> > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=fed474857efbed79cd390d0aee224231ca718f63
>>> >>> >
>>> >>> > On Wed, Mar 28, 2012 at 1:51 PM, Peter Moody <pmoody@google.com> wrote:
>>> >>> > > Are you still able to reliably reproduce this oops? I'm trying to
>>> >>> > > track this down because this bug (or a very similar bug) is causing
>>> >>> > > some significant headaches here at work, but I haven't had a lot of
>>> >>> > > luck. I'm using usermode linux, though, so that might be interfering
>>> >>> > > with things.
>>> >>> > >
>>> >>> > > On Mon, Mar 5, 2012 at 12:35 AM, Valentin Avram <aval13@gmail.com>
>>> >>> > > wrote:
>>> >>> > >> Finally i found some time and spare server to retest the oops and
>>> >>> > >> list_add
>>> >>> > >> corruptions i was getting with the 3.x kernels and auditd 2.1.3.
>>> >>> > >>
>>> >>> > >> I tested now with gentoo's latest stable 3.2.1-gentoo-r2 and
>>> >>> > >> kernel.org's
>>> >>> > >> 3.2.9.
>>> >>> > >>
>>> >>> > >> Both get the oops/BUG in the same way and after that, they keep
>>> >>> > >> pouring
>>> >>> > >> list_add corruptions with audit_prune_tre(truncated?) and auditctl as
>>> >>> > >> comms.
>>> >>> > >>
>>> >>> > >> Since this is not about Gentoo's kernel only, i'll post here the oops
>>> >>> > >> in
>>> >>> > >> 3.2.9 and also attach some list_add corruptions.
>>> >>> > >>
>>> >>> > >> 3.2.9 BUG:
>>> >>> > >>
>>> >>> > >> kernel: [  301.240011] BUG: unable to handle kernel NULL pointer
>>> >>> > >> dereference
>>> >>> > >> at   (null)
>>> >>> > >> kernel: [  301.240305] IP: [<c1238dd0>] __list_del_entry+0x20/0xe0
>>> >>> > >> kernel: [  301.240481] *pdpt = 0000000000000000 *pde =
>>> >>> > >> f000ddc8f000ddc8
>>> >>> > >> kernel: [  301.240698] Oops: 0000 [#1] SMP
>>> >>> > >> kernel: [  301.240910]
>>> >>> > >> kernel: [  301.241030] Pid: 642, comm: fsnotify_mark Not tainted
>>> >>> > >> 3.2.9-drbd-version3 #1 Dell Inc. PowerEdge 2950/0CX396
>>> >>> > >> kernel: [  301.241370] EIP: 0060:[<c1238dd0>] EFLAGS: 00010287 CPU: 6
>>> >>> > >> kernel: [  301.241498] EIP is at __list_del_entry+0x20/0xe0
>>> >>> > >> kernel: [  301.241623] EAX: f4fae544 EBX: f47cffa4 ECX: ffffffff EDX:
>>> >>> > >> 00000000
>>> >>> > >> kernel: [  301.241751] ESI: f4fae544 EDI: f4fae508 EBP: f47cff7c ESP:
>>> >>> > >> f47cff64
>>> >>> > >> kernel: [  301.241879]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
>>> >>> > >> kernel: [  301.242005] Process fsnotify_mark (pid: 642, ti=f47ce000
>>> >>> > >> task=f4f47c00 task.ti=f47ce000)
>>> >>> > >> kernel: [  301.242207] Stack:
>>> >>> > >> kernel: [  301.242327]  c10813c0 f47cffa4 f4f47c00 f4e70888 f47cff7c
>>> >>> > >> f47cffa4 f47cffb8 c10f6976
>>> >>> > >> kernel: [  301.242882]  ffffffc3 f4f47c00 f4f47c00 00000000 f4f47c00
>>> >>> > >> c10530c0 f47cff9c f47cff9c
>>> >>> > >> kernel: [  301.243438]  f4fae544 f4fae544 f4c47f58 00000000 c10f68f0
>>> >>> > >> f47cffe4 c1052834 00000000
>>> >>> > >> kernel: [  301.243995] Call Trace:
>>> >>> > >> kernel: [  301.244119]  [<c10813c0>] ?
>>> >>> > >> rcu_check_callbacks+0x110/0x110
>>> >>> > >> kernel: [  301.244248]  [<c10f6976>] fsnotify_mark_destroy+0x86/0x120
>>> >>> > >> kernel: [  301.244377]  [<c10530c0>] ? abort_exclusive_wait+0x80/0x80
>>> >>> > >> kernel: [  301.244504]  [<c10f68f0>] ? fsnotify_put_mark+0x30/0x30
>>> >>> > >> kernel: [  301.244631]  [<c1052834>] kthread+0x74/0x80
>>> >>> > >> kernel: [  301.244756]  [<c10527c0>] ?
>>> >>> > >> kthread_flush_work_fn+0x10/0x10
>>> >>> > >> kernel: [  301.244885]  [<c1582ab6>] kernel_thread_helper+0x6/0xd
>>> >>> > >> kernel: [  301.245011] Code: 55 f4 8b 45 f8 e9 75 ff ff ff 90 55 89
>>> >>> > >> e5 53 83
>>> >>> > >> ec 14 8b 08 8b 50 04 81 f9 00 01 10 00 74 24 81 fa 00 02 20 00 0f 84
>>> >>> > >> 8e 00
>>> >>> > >> 00 00 <8b> 1a 39 d8 75 62 8b 59 04 39 d8 75 35 89 51 04 89 0a 83 c4
>>> >>> > >> 14
>>> >>> > >> kernel: [  301.248195] EIP: [<c1238dd0>] __list_del_entry+0x20/0xe0
>>> >>> > >> SS:ESP
>>> >>> > >> 0068:f47cff64
>>> >>> > >> kernel: [  301.248414] CR2: 0000000000000000
>>> >>> > >> kernel: [  301.248538] ---[ end trace 15082dbfb353f84c ]---
>>> >>> > >>
>>> >>> > >> The kernel was compiled with the following DEBUG support (the bolded
>>> >>> > >> one
>>> >>> > >> were requested by Gentoo's Dev:
>>> >>> > >> CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
>>> >>> > >> CONFIG_SLUB_DEBUG=y
>>> >>> > >> CONFIG_HAVE_DMA_API_DEBUG=y
>>> >>> > >> CONFIG_X86_DEBUGCTLMSR=y
>>> >>> > >> CONFIG_PNP_DEBUG_MESSAGES=y
>>> >>> > >> CONFIG_AIC94XX_DEBUG=y
>>> >>> > >> CONFIG_USB_DEBUG=y
>>> >>> > >> CONFIG_DEBUG_KERNEL=y
>>> >>> > >> CONFIG_SCHED_DEBUG=y
>>> >>> > >> CONFIG_DEBUG_RT_MUTEXES=y
>>> >>> > >> CONFIG_DEBUG_PI_LIST=y
>>> >>> > >> CONFIG_DEBUG_BUGVERBOSE=y
>>> >>> > >> CONFIG_DEBUG_INFO=y
>>> >>> > >> CONFIG_DEBUG_MEMORY_INIT=y
>>> >>> > >> CONFIG_DEBUG_LIST=y
>>> >>> > >> CONFIG_DEBUG_STACKOVERFLOW=y
>>> >>> > >> CONFIG_DEBUG_RODATA=y
>>> >>> > >> CONFIG_DEBUG_RODATA_TEST=y
>>> >>> > >>
>>> >>> > >> I attached the kernel config i used for 3.2.9 to generate this oops
>>> >>> > >> and
>>> >>> > >> warnings.
>>> >>> > >>
>>> >>> > >> From the list_add warnings that come after, out of 805 warnings i
>>> >>> > >> processed,
>>> >>> > >> after masking with XXXXX the PID and next= values that kept changing
>>> >>> > >> in
>>> >>> > >> every one, i got 26 types of MD5. I also attached the files relevant
>>> >>> > >> as an
>>> >>> > >> archive to this email.
>>> >>> > >>
>>> >>> > >> The Gentoo bug i opened is sleeping, it seems nobody has the time to
>>> >>> > >> at
>>> >>> > >> least test to confirm or not the problems i'm seeing (or everybody's
>>> >>> > >> thinking that nobody would restart auditd so often, so the bug it's
>>> >>> > >> not that
>>> >>> > >> serious).
>>> >>> > >>
>>> >>> > >>
>>> >>> > >> Thank you for your time.
>>> >>> > >>
>>> >>> > >> On Wed, Feb 8, 2012 at 6:11 PM, Valentin Avram <aval13@gmail.com>
>>> >>> > >> wrote:
>>> >>> > >>
>>> >>> > >>
>>> >>> > >> --
>>> >>> > >> Linux-audit mailing list
>>> >>> > >> Linux-audit@redhat.com
>>> >>> > >> https://www.redhat.com/mailman/listinfo/linux-audit
>>> >>> > >
>>> >>> > >
>>> >>> > >
>>> >>> > > --
>>> >>> > > Peter Moody      Google    1.650.253.7306
>>> >>> > > Security Engineer  pgp:0xC3410038
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>>
>>> >>>
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > Peter Moody      Google    1.650.253.7306
>>> > Security Engineer  pgp:0xC3410038
>>>
>>>
>>>
>>
>>
>
>
>
> --
> Peter Moody      Google    1.650.253.7306
> Security Engineer  pgp:0xC3410038



-- 
Peter Moody      Google    1.650.253.7306
Security Engineer  pgp:0xC3410038

[-- Attachment #2: config.gz --]
[-- Type: application/x-gzip, Size: 34351 bytes --]

[-- Attachment #3: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Kernel oops+crash on repeated auditd restarts
  2012-04-17 18:24                         ` Peter Moody
@ 2012-04-17 21:54                           ` Peter Moody
  2012-04-21  2:14                             ` Marcelo Cerri
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Moody @ 2012-04-17 21:54 UTC (permalink / raw)
  To: Eric Paris; +Cc: linux-audit

Last thing. moving synchronize_srcu(&fsnotify_mark_srcu) out of the
for(;;) loop in fs/notify/mark.c appears to solve the stability issues
for me. I don't know enough about kernel internals to determine if
this is doing lots of other bad things to my system or not.

Cheers,
peter

On Tue, Apr 17, 2012 at 11:24 AM, Peter Moody <pmoody@google.com> wrote:
> and my config.gz
>
> On Tue, Apr 17, 2012 at 10:56 AM, Peter Moody <pmoody@google.com> wrote:
>> Here's a trace with debugging turned way up plus a few extra printk's
>> added to fs/notify/mark.c. I'm looping through private_destroy_list
>> before and after the call to synchronize_srcu.
>>
>> I can reproduce this reliably with kvm with 2 virtual processors:
>> Linux desktop 3.4.0-rc3-oops1+ #1 SMP Tue Apr 17 09:59:44 PDT 2012
>> x86_64 GNU/Linux
>>
>> Cheers,
>> peter
>>
>> On Thu, Apr 5, 2012 at 2:07 PM, Eric Paris <eparis@redhat.com> wrote:
>>> please please please keep on list.  Everything you say might help track
>>> it down!
>>>
>>> On Thu, 2012-04-05 at 14:03 -0700, Peter Moody wrote:
>>>> (please let me know if I should take this off-list)
>>>>
>>>> One other thing (again, maybe already known), but this seems to be
>>>> exacerbated by SMP. On my machine, I can't reproduce the crash if I
>>>> booth with maxcpus=1.
>>>>
>>>> Still hunting.
>>>>
>>>> Cheers,
>>>> peter
>>>>
>>>> On Tue, Apr 3, 2012 at 9:15 AM, Peter Moody <pmoody@google.com> wrote:
>>>> > This may already be known, but the issue seems to be limited to watch
>>>> > rules. With any watch rules, I can reliably crash my machine while
>>>> > freeing a watch rule after only starting/stopping auditd a few times.
>>>> > With no watch rules, I have no issues.
>>>> >
>>>> > Cheers,
>>>> > peter
>>>> >
>>>> > On Wed, Mar 28, 2012 at 11:44 PM, Valentin Avram <aval13@gmail.com> wrote:
>>>> >> Yes, i know that patch. It made it into kernel 3.2.2. I tested it
>>>> >> successfully (oops in 3.2.1, no oops in 3.2.9), but this oops i'm seeing is
>>>> >> also in 3.2.9.
>>>> >>
>>>> >> I monitored changelogs since 3.2.1 to 3.2.12 but there were no fixes either
>>>> >> in audit subsystem or in fsnotify. I'll try to reproduce in latest 3.2.13
>>>> >> and repost the oops, but i'm 99% confident it will be the same.
>>>> >>
>>>> >> Sadly nobody except you seems to pay attention to this problem, probably
>>>> >> because it requires special conditions to reproduce (really, who starts and
>>>> >> stops auditd every 5 seconds on a production server?). We only ran into it
>>>> >> because one of our servers would randomly oops and then freeze about each
>>>> >> month after stopping and then starting
>>>> >>
>>>> >> auditd
>>>> >>
>>>> >> every morning (and the stop-start sequence was needed to workaround a bug
>>>> >> somewhere that would hang a
>>>> >>
>>>> >> gzip
>>>> >>
>>>> >> running on a file outside a watched folder).
>>>> >>
>>>> >> Anyway, as a last note, i have a feeling that the oops is not exactly
>>>> >> random, there is a pattern, just that i haven't figured it out completely
>>>> >> yet.
>>>> >>
>>>> >> Will keep you
>>>> >>
>>>> >> uptodate
>>>> >>
>>>> >> with the things i find out.
>>>> >>
>>>> >> V.
>>>> >>
>>>> >> On Mar 29, 2012 4:14 AM, "Eric Paris" <eparis@redhat.com> wrote:
>>>> >>>
>>>> >>> That patch fixes a BUG() .  The report has a NULL ptr deref and some
>>>> >>> apparent list correuption....  Sadly they aren't the same....
>>>> >>>
>>>> >>> On Wed, 2012-03-28 at 15:42 -0700, Peter Moody wrote:
>>>> >>> > fyi: this patch [1] seems to fix the issue for me. The explanation in
>>>> >>> > the subject would reliably oops my machine.
>>>> >>> >
>>>> >>> > [1]
>>>> >>> > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=fed474857efbed79cd390d0aee224231ca718f63
>>>> >>> >
>>>> >>> > On Wed, Mar 28, 2012 at 1:51 PM, Peter Moody <pmoody@google.com> wrote:
>>>> >>> > > Are you still able to reliably reproduce this oops? I'm trying to
>>>> >>> > > track this down because this bug (or a very similar bug) is causing
>>>> >>> > > some significant headaches here at work, but I haven't had a lot of
>>>> >>> > > luck. I'm using usermode linux, though, so that might be interfering
>>>> >>> > > with things.
>>>> >>> > >
>>>> >>> > > On Mon, Mar 5, 2012 at 12:35 AM, Valentin Avram <aval13@gmail.com>
>>>> >>> > > wrote:
>>>> >>> > >> Finally i found some time and spare server to retest the oops and
>>>> >>> > >> list_add
>>>> >>> > >> corruptions i was getting with the 3.x kernels and auditd 2.1.3.
>>>> >>> > >>
>>>> >>> > >> I tested now with gentoo's latest stable 3.2.1-gentoo-r2 and
>>>> >>> > >> kernel.org's
>>>> >>> > >> 3.2.9.
>>>> >>> > >>
>>>> >>> > >> Both get the oops/BUG in the same way and after that, they keep
>>>> >>> > >> pouring
>>>> >>> > >> list_add corruptions with audit_prune_tre(truncated?) and auditctl as
>>>> >>> > >> comms.
>>>> >>> > >>
>>>> >>> > >> Since this is not about Gentoo's kernel only, i'll post here the oops
>>>> >>> > >> in
>>>> >>> > >> 3.2.9 and also attach some list_add corruptions.
>>>> >>> > >>
>>>> >>> > >> 3.2.9 BUG:
>>>> >>> > >>
>>>> >>> > >> kernel: [  301.240011] BUG: unable to handle kernel NULL pointer
>>>> >>> > >> dereference
>>>> >>> > >> at   (null)
>>>> >>> > >> kernel: [  301.240305] IP: [<c1238dd0>] __list_del_entry+0x20/0xe0
>>>> >>> > >> kernel: [  301.240481] *pdpt = 0000000000000000 *pde =
>>>> >>> > >> f000ddc8f000ddc8
>>>> >>> > >> kernel: [  301.240698] Oops: 0000 [#1] SMP
>>>> >>> > >> kernel: [  301.240910]
>>>> >>> > >> kernel: [  301.241030] Pid: 642, comm: fsnotify_mark Not tainted
>>>> >>> > >> 3.2.9-drbd-version3 #1 Dell Inc. PowerEdge 2950/0CX396
>>>> >>> > >> kernel: [  301.241370] EIP: 0060:[<c1238dd0>] EFLAGS: 00010287 CPU: 6
>>>> >>> > >> kernel: [  301.241498] EIP is at __list_del_entry+0x20/0xe0
>>>> >>> > >> kernel: [  301.241623] EAX: f4fae544 EBX: f47cffa4 ECX: ffffffff EDX:
>>>> >>> > >> 00000000
>>>> >>> > >> kernel: [  301.241751] ESI: f4fae544 EDI: f4fae508 EBP: f47cff7c ESP:
>>>> >>> > >> f47cff64
>>>> >>> > >> kernel: [  301.241879]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
>>>> >>> > >> kernel: [  301.242005] Process fsnotify_mark (pid: 642, ti=f47ce000
>>>> >>> > >> task=f4f47c00 task.ti=f47ce000)
>>>> >>> > >> kernel: [  301.242207] Stack:
>>>> >>> > >> kernel: [  301.242327]  c10813c0 f47cffa4 f4f47c00 f4e70888 f47cff7c
>>>> >>> > >> f47cffa4 f47cffb8 c10f6976
>>>> >>> > >> kernel: [  301.242882]  ffffffc3 f4f47c00 f4f47c00 00000000 f4f47c00
>>>> >>> > >> c10530c0 f47cff9c f47cff9c
>>>> >>> > >> kernel: [  301.243438]  f4fae544 f4fae544 f4c47f58 00000000 c10f68f0
>>>> >>> > >> f47cffe4 c1052834 00000000
>>>> >>> > >> kernel: [  301.243995] Call Trace:
>>>> >>> > >> kernel: [  301.244119]  [<c10813c0>] ?
>>>> >>> > >> rcu_check_callbacks+0x110/0x110
>>>> >>> > >> kernel: [  301.244248]  [<c10f6976>] fsnotify_mark_destroy+0x86/0x120
>>>> >>> > >> kernel: [  301.244377]  [<c10530c0>] ? abort_exclusive_wait+0x80/0x80
>>>> >>> > >> kernel: [  301.244504]  [<c10f68f0>] ? fsnotify_put_mark+0x30/0x30
>>>> >>> > >> kernel: [  301.244631]  [<c1052834>] kthread+0x74/0x80
>>>> >>> > >> kernel: [  301.244756]  [<c10527c0>] ?
>>>> >>> > >> kthread_flush_work_fn+0x10/0x10
>>>> >>> > >> kernel: [  301.244885]  [<c1582ab6>] kernel_thread_helper+0x6/0xd
>>>> >>> > >> kernel: [  301.245011] Code: 55 f4 8b 45 f8 e9 75 ff ff ff 90 55 89
>>>> >>> > >> e5 53 83
>>>> >>> > >> ec 14 8b 08 8b 50 04 81 f9 00 01 10 00 74 24 81 fa 00 02 20 00 0f 84
>>>> >>> > >> 8e 00
>>>> >>> > >> 00 00 <8b> 1a 39 d8 75 62 8b 59 04 39 d8 75 35 89 51 04 89 0a 83 c4
>>>> >>> > >> 14
>>>> >>> > >> kernel: [  301.248195] EIP: [<c1238dd0>] __list_del_entry+0x20/0xe0
>>>> >>> > >> SS:ESP
>>>> >>> > >> 0068:f47cff64
>>>> >>> > >> kernel: [  301.248414] CR2: 0000000000000000
>>>> >>> > >> kernel: [  301.248538] ---[ end trace 15082dbfb353f84c ]---
>>>> >>> > >>
>>>> >>> > >> The kernel was compiled with the following DEBUG support (the bolded
>>>> >>> > >> one
>>>> >>> > >> were requested by Gentoo's Dev:
>>>> >>> > >> CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
>>>> >>> > >> CONFIG_SLUB_DEBUG=y
>>>> >>> > >> CONFIG_HAVE_DMA_API_DEBUG=y
>>>> >>> > >> CONFIG_X86_DEBUGCTLMSR=y
>>>> >>> > >> CONFIG_PNP_DEBUG_MESSAGES=y
>>>> >>> > >> CONFIG_AIC94XX_DEBUG=y
>>>> >>> > >> CONFIG_USB_DEBUG=y
>>>> >>> > >> CONFIG_DEBUG_KERNEL=y
>>>> >>> > >> CONFIG_SCHED_DEBUG=y
>>>> >>> > >> CONFIG_DEBUG_RT_MUTEXES=y
>>>> >>> > >> CONFIG_DEBUG_PI_LIST=y
>>>> >>> > >> CONFIG_DEBUG_BUGVERBOSE=y
>>>> >>> > >> CONFIG_DEBUG_INFO=y
>>>> >>> > >> CONFIG_DEBUG_MEMORY_INIT=y
>>>> >>> > >> CONFIG_DEBUG_LIST=y
>>>> >>> > >> CONFIG_DEBUG_STACKOVERFLOW=y
>>>> >>> > >> CONFIG_DEBUG_RODATA=y
>>>> >>> > >> CONFIG_DEBUG_RODATA_TEST=y
>>>> >>> > >>
>>>> >>> > >> I attached the kernel config i used for 3.2.9 to generate this oops
>>>> >>> > >> and
>>>> >>> > >> warnings.
>>>> >>> > >>
>>>> >>> > >> From the list_add warnings that come after, out of 805 warnings i
>>>> >>> > >> processed,
>>>> >>> > >> after masking with XXXXX the PID and next= values that kept changing
>>>> >>> > >> in
>>>> >>> > >> every one, i got 26 types of MD5. I also attached the files relevant
>>>> >>> > >> as an
>>>> >>> > >> archive to this email.
>>>> >>> > >>
>>>> >>> > >> The Gentoo bug i opened is sleeping, it seems nobody has the time to
>>>> >>> > >> at
>>>> >>> > >> least test to confirm or not the problems i'm seeing (or everybody's
>>>> >>> > >> thinking that nobody would restart auditd so often, so the bug it's
>>>> >>> > >> not that
>>>> >>> > >> serious).
>>>> >>> > >>
>>>> >>> > >>
>>>> >>> > >> Thank you for your time.
>>>> >>> > >>
>>>> >>> > >> On Wed, Feb 8, 2012 at 6:11 PM, Valentin Avram <aval13@gmail.com>
>>>> >>> > >> wrote:
>>>> >>> > >>
>>>> >>> > >>
>>>> >>> > >> --
>>>> >>> > >> Linux-audit mailing list
>>>> >>> > >> Linux-audit@redhat.com
>>>> >>> > >> https://www.redhat.com/mailman/listinfo/linux-audit
>>>> >>> > >
>>>> >>> > >
>>>> >>> > >
>>>> >>> > > --
>>>> >>> > > Peter Moody      Google    1.650.253.7306
>>>> >>> > > Security Engineer  pgp:0xC3410038
>>>> >>> >
>>>> >>> >
>>>> >>> >
>>>> >>>
>>>> >>>
>>>> >>
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Peter Moody      Google    1.650.253.7306
>>>> > Security Engineer  pgp:0xC3410038
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>> --
>> Peter Moody      Google    1.650.253.7306
>> Security Engineer  pgp:0xC3410038
>
>
>
> --
> Peter Moody      Google    1.650.253.7306
> Security Engineer  pgp:0xC3410038



-- 
Peter Moody      Google    1.650.253.7306
Security Engineer  pgp:0xC3410038

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Kernel oops+crash on repeated auditd restarts
  2012-04-17 21:54                           ` Peter Moody
@ 2012-04-21  2:14                             ` Marcelo Cerri
  2012-04-23 16:05                               ` Peter Moody
  2012-04-23 16:26                               ` Eric Paris
  0 siblings, 2 replies; 24+ messages in thread
From: Marcelo Cerri @ 2012-04-21  2:14 UTC (permalink / raw)
  To: Peter Moody, Valentin Avram; +Cc: linux-audit


I took a look at the source code and made some tests. It seems to be a
problem with the reference count of the fsnotify_mark structure.

This error occurs because the fsnotify_mark_destroy function
(which runs in a separated kthread) is trying to iterate through a mark
that is already freed.

Looking at the fsnotify_destroy_mark function (not confuse with
fsnotify_mark_destroy), which adds a mark to destroy_list to be freed
later by fsnotify_mark_destroy, I noticed that it does not increment
the reference count for the reference added to the destroy_list and
usually the callers dispose the references they held after calling
fsnotify_destroy_mark.

The patch below increments the reference count of a mark when it is
added to the destroy list. It seems to solve the issue and it doesn't
seem to cause any memory leak. Please, can you make some tests in your
environments and let me know if there is any problem with this patch.

Regarding the synchronize_scru call, I don't think it's causing this
error. Probably it just make it more frequently because it forces all
the cpus to schedule, giving the chance to someone else to free the
mark.

---
 fs/notify/mark.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/notify/mark.c b/fs/notify/mark.c
index f104d56..2985fff 100644
--- a/fs/notify/mark.c
+++ b/fs/notify/mark.c
@@ -150,6 +150,7 @@ void fsnotify_destroy_mark(struct fsnotify_mark
*mark) spin_unlock(&group->mark_lock);
    spin_unlock(&mark->lock);
 
+   fsnotify_get_mark(mark);
    spin_lock(&destroy_lock);
    list_add(&mark->destroy_list, &destroy_list);
    spin_unlock(&destroy_lock);
-- 
1.7.9.4


On Tue, 17 Apr 2012 14:54:29 -0700
Peter Moody <pmoody@google.com> wrote:

> Last thing. moving synchronize_srcu(&fsnotify_mark_srcu) out of the
> for(;;) loop in fs/notify/mark.c appears to solve the stability issues
> for me. I don't know enough about kernel internals to determine if
> this is doing lots of other bad things to my system or not.
> 
> Cheers,
> peter
> 
> On Tue, Apr 17, 2012 at 11:24 AM, Peter Moody <pmoody@google.com>
> wrote:
> > and my config.gz
> >
> > On Tue, Apr 17, 2012 at 10:56 AM, Peter Moody <pmoody@google.com>
> > wrote:
> >> Here's a trace with debugging turned way up plus a few extra
> >> printk's added to fs/notify/mark.c. I'm looping through
> >> private_destroy_list before and after the call to synchronize_srcu.
> >>
> >> I can reproduce this reliably with kvm with 2 virtual processors:
> >> Linux desktop 3.4.0-rc3-oops1+ #1 SMP Tue Apr 17 09:59:44 PDT 2012
> >> x86_64 GNU/Linux
> >>
> >> Cheers,
> >> peter
> >>
> >> On Thu, Apr 5, 2012 at 2:07 PM, Eric Paris <eparis@redhat.com>
> >> wrote:
> >>> please please please keep on list.  Everything you say might help
> >>> track it down!
> >>>
> >>> On Thu, 2012-04-05 at 14:03 -0700, Peter Moody wrote:
> >>>> (please let me know if I should take this off-list)
> >>>>
> >>>> One other thing (again, maybe already known), but this seems to
> >>>> be exacerbated by SMP. On my machine, I can't reproduce the
> >>>> crash if I booth with maxcpus=1.
> >>>>
> >>>> Still hunting.
> >>>>
> >>>> Cheers,
> >>>> peter
> >>>>
> >>>> On Tue, Apr 3, 2012 at 9:15 AM, Peter Moody <pmoody@google.com>
> >>>> wrote:
> >>>> > This may already be known, but the issue seems to be limited
> >>>> > to watch rules. With any watch rules, I can reliably crash my
> >>>> > machine while freeing a watch rule after only
> >>>> > starting/stopping auditd a few times. With no watch rules, I
> >>>> > have no issues.
> >>>> >
> >>>> > Cheers,
> >>>> > peter
> >>>> >
> >>>> > On Wed, Mar 28, 2012 at 11:44 PM, Valentin Avram
> >>>> > <aval13@gmail.com> wrote:
> >>>> >> Yes, i know that patch. It made it into kernel 3.2.2. I
> >>>> >> tested it successfully (oops in 3.2.1, no oops in 3.2.9), but
> >>>> >> this oops i'm seeing is also in 3.2.9.
> >>>> >>
> >>>> >> I monitored changelogs since 3.2.1 to 3.2.12 but there were
> >>>> >> no fixes either in audit subsystem or in fsnotify. I'll try
> >>>> >> to reproduce in latest 3.2.13 and repost the oops, but i'm
> >>>> >> 99% confident it will be the same.
> >>>> >>
> >>>> >> Sadly nobody except you seems to pay attention to this
> >>>> >> problem, probably because it requires special conditions to
> >>>> >> reproduce (really, who starts and stops auditd every 5
> >>>> >> seconds on a production server?). We only ran into it because
> >>>> >> one of our servers would randomly oops and then freeze about
> >>>> >> each month after stopping and then starting
> >>>> >>
> >>>> >> auditd
> >>>> >>
> >>>> >> every morning (and the stop-start sequence was needed to
> >>>> >> workaround a bug somewhere that would hang a
> >>>> >>
> >>>> >> gzip
> >>>> >>
> >>>> >> running on a file outside a watched folder).
> >>>> >>
> >>>> >> Anyway, as a last note, i have a feeling that the oops is not
> >>>> >> exactly random, there is a pattern, just that i haven't
> >>>> >> figured it out completely yet.
> >>>> >>
> >>>> >> Will keep you
> >>>> >>
> >>>> >> uptodate
> >>>> >>
> >>>> >> with the things i find out.
> >>>> >>
> >>>> >> V.
> >>>> >>
> >>>> >> On Mar 29, 2012 4:14 AM, "Eric Paris" <eparis@redhat.com>
> >>>> >> wrote:
> >>>> >>>
> >>>> >>> That patch fixes a BUG() .  The report has a NULL ptr deref
> >>>> >>> and some apparent list correuption....  Sadly they aren't
> >>>> >>> the same....
> >>>> >>>
> >>>> >>> On Wed, 2012-03-28 at 15:42 -0700, Peter Moody wrote:
> >>>> >>> > fyi: this patch [1] seems to fix the issue for me. The
> >>>> >>> > explanation in the subject would reliably oops my machine.
> >>>> >>> >
> >>>> >>> > [1]
> >>>> >>> > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=fed474857efbed79cd390d0aee224231ca718f63
> >>>> >>> >
> >>>> >>> > On Wed, Mar 28, 2012 at 1:51 PM, Peter Moody
> >>>> >>> > <pmoody@google.com> wrote:
> >>>> >>> > > Are you still able to reliably reproduce this oops? I'm
> >>>> >>> > > trying to track this down because this bug (or a very
> >>>> >>> > > similar bug) is causing some significant headaches here
> >>>> >>> > > at work, but I haven't had a lot of luck. I'm using
> >>>> >>> > > usermode linux, though, so that might be interfering
> >>>> >>> > > with things.
> >>>> >>> > >
> >>>> >>> > > On Mon, Mar 5, 2012 at 12:35 AM, Valentin Avram
> >>>> >>> > > <aval13@gmail.com> wrote:
> >>>> >>> > >> Finally i found some time and spare server to retest
> >>>> >>> > >> the oops and list_add
> >>>> >>> > >> corruptions i was getting with the 3.x kernels and
> >>>> >>> > >> auditd 2.1.3.
> >>>> >>> > >>
> >>>> >>> > >> I tested now with gentoo's latest stable
> >>>> >>> > >> 3.2.1-gentoo-r2 and kernel.org's
> >>>> >>> > >> 3.2.9.
> >>>> >>> > >>
> >>>> >>> > >> Both get the oops/BUG in the same way and after that,
> >>>> >>> > >> they keep pouring
> >>>> >>> > >> list_add corruptions with audit_prune_tre(truncated?)
> >>>> >>> > >> and auditctl as comms.
> >>>> >>> > >>
> >>>> >>> > >> Since this is not about Gentoo's kernel only, i'll post
> >>>> >>> > >> here the oops in
> >>>> >>> > >> 3.2.9 and also attach some list_add corruptions.
> >>>> >>> > >>
> >>>> >>> > >> 3.2.9 BUG:
> >>>> >>> > >>
> >>>> >>> > >> kernel: [  301.240011] BUG: unable to handle kernel
> >>>> >>> > >> NULL pointer dereference
> >>>> >>> > >> at   (null)
> >>>> >>> > >> kernel: [  301.240305] IP: [<c1238dd0>]
> >>>> >>> > >> __list_del_entry+0x20/0xe0 kernel: [  301.240481] *pdpt
> >>>> >>> > >> = 0000000000000000 *pde = f000ddc8f000ddc8
> >>>> >>> > >> kernel: [  301.240698] Oops: 0000 [#1] SMP
> >>>> >>> > >> kernel: [  301.240910]
> >>>> >>> > >> kernel: [  301.241030] Pid: 642, comm: fsnotify_mark
> >>>> >>> > >> Not tainted 3.2.9-drbd-version3 #1 Dell Inc. PowerEdge
> >>>> >>> > >> 2950/0CX396 kernel: [  301.241370] EIP:
> >>>> >>> > >> 0060:[<c1238dd0>] EFLAGS: 00010287 CPU: 6 kernel:
> >>>> >>> > >> [  301.241498] EIP is at __list_del_entry+0x20/0xe0
> >>>> >>> > >> kernel: [  301.241623] EAX: f4fae544 EBX: f47cffa4 ECX:
> >>>> >>> > >> ffffffff EDX: 00000000 kernel: [  301.241751] ESI:
> >>>> >>> > >> f4fae544 EDI: f4fae508 EBP: f47cff7c ESP: f47cff64
> >>>> >>> > >> kernel: [  301.241879]  DS: 007b ES: 007b FS: 00d8 GS:
> >>>> >>> > >> 0000 SS: 0068 kernel: [  301.242005] Process
> >>>> >>> > >> fsnotify_mark (pid: 642, ti=f47ce000 task=f4f47c00
> >>>> >>> > >> task.ti=f47ce000) kernel: [  301.242207] Stack:
> >>>> >>> > >> kernel: [  301.242327]  c10813c0 f47cffa4 f4f47c00
> >>>> >>> > >> f4e70888 f47cff7c f47cffa4 f47cffb8 c10f6976
> >>>> >>> > >> kernel: [  301.242882]  ffffffc3 f4f47c00 f4f47c00
> >>>> >>> > >> 00000000 f4f47c00 c10530c0 f47cff9c f47cff9c
> >>>> >>> > >> kernel: [  301.243438]  f4fae544 f4fae544 f4c47f58
> >>>> >>> > >> 00000000 c10f68f0 f47cffe4 c1052834 00000000
> >>>> >>> > >> kernel: [  301.243995] Call Trace:
> >>>> >>> > >> kernel: [  301.244119]  [<c10813c0>] ?
> >>>> >>> > >> rcu_check_callbacks+0x110/0x110
> >>>> >>> > >> kernel: [  301.244248]  [<c10f6976>]
> >>>> >>> > >> fsnotify_mark_destroy+0x86/0x120 kernel: [  301.244377]
> >>>> >>> > >>  [<c10530c0>] ? abort_exclusive_wait+0x80/0x80 kernel:
> >>>> >>> > >> [  301.244504]  [<c10f68f0>] ?
> >>>> >>> > >> fsnotify_put_mark+0x30/0x30 kernel: [  301.244631]
> >>>> >>> > >>  [<c1052834>] kthread+0x74/0x80 kernel: [  301.244756]
> >>>> >>> > >>  [<c10527c0>] ? kthread_flush_work_fn+0x10/0x10 kernel:
> >>>> >>> > >> [  301.244885]  [<c1582ab6>]
> >>>> >>> > >> kernel_thread_helper+0x6/0xd kernel: [  301.245011]
> >>>> >>> > >> Code: 55 f4 8b 45 f8 e9 75 ff ff ff 90 55 89 e5 53 83
> >>>> >>> > >> ec 14 8b 08 8b 50 04 81 f9 00 01 10 00 74 24 81 fa 00
> >>>> >>> > >> 02 20 00 0f 84 8e 00 00 00 <8b> 1a 39 d8 75 62 8b 59 04
> >>>> >>> > >> 39 d8 75 35 89 51 04 89 0a 83 c4 14
> >>>> >>> > >> kernel: [  301.248195] EIP: [<c1238dd0>]
> >>>> >>> > >> __list_del_entry+0x20/0xe0 SS:ESP
> >>>> >>> > >> 0068:f47cff64
> >>>> >>> > >> kernel: [  301.248414] CR2: 0000000000000000
> >>>> >>> > >> kernel: [  301.248538] ---[ end trace
> >>>> >>> > >> 15082dbfb353f84c ]---
> >>>> >>> > >>
> >>>> >>> > >> The kernel was compiled with the following DEBUG
> >>>> >>> > >> support (the bolded one
> >>>> >>> > >> were requested by Gentoo's Dev:
> >>>> >>> > >> CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
> >>>> >>> > >> CONFIG_SLUB_DEBUG=y
> >>>> >>> > >> CONFIG_HAVE_DMA_API_DEBUG=y
> >>>> >>> > >> CONFIG_X86_DEBUGCTLMSR=y
> >>>> >>> > >> CONFIG_PNP_DEBUG_MESSAGES=y
> >>>> >>> > >> CONFIG_AIC94XX_DEBUG=y
> >>>> >>> > >> CONFIG_USB_DEBUG=y
> >>>> >>> > >> CONFIG_DEBUG_KERNEL=y
> >>>> >>> > >> CONFIG_SCHED_DEBUG=y
> >>>> >>> > >> CONFIG_DEBUG_RT_MUTEXES=y
> >>>> >>> > >> CONFIG_DEBUG_PI_LIST=y
> >>>> >>> > >> CONFIG_DEBUG_BUGVERBOSE=y
> >>>> >>> > >> CONFIG_DEBUG_INFO=y
> >>>> >>> > >> CONFIG_DEBUG_MEMORY_INIT=y
> >>>> >>> > >> CONFIG_DEBUG_LIST=y
> >>>> >>> > >> CONFIG_DEBUG_STACKOVERFLOW=y
> >>>> >>> > >> CONFIG_DEBUG_RODATA=y
> >>>> >>> > >> CONFIG_DEBUG_RODATA_TEST=y
> >>>> >>> > >>
> >>>> >>> > >> I attached the kernel config i used for 3.2.9 to
> >>>> >>> > >> generate this oops and
> >>>> >>> > >> warnings.
> >>>> >>> > >>
> >>>> >>> > >> From the list_add warnings that come after, out of 805
> >>>> >>> > >> warnings i processed,
> >>>> >>> > >> after masking with XXXXX the PID and next= values that
> >>>> >>> > >> kept changing in
> >>>> >>> > >> every one, i got 26 types of MD5. I also attached the
> >>>> >>> > >> files relevant as an
> >>>> >>> > >> archive to this email.
> >>>> >>> > >>
> >>>> >>> > >> The Gentoo bug i opened is sleeping, it seems nobody
> >>>> >>> > >> has the time to at
> >>>> >>> > >> least test to confirm or not the problems i'm seeing
> >>>> >>> > >> (or everybody's thinking that nobody would restart
> >>>> >>> > >> auditd so often, so the bug it's not that
> >>>> >>> > >> serious).
> >>>> >>> > >>
> >>>> >>> > >>
> >>>> >>> > >> Thank you for your time.
> >>>> >>> > >>
> >>>> >>> > >> On Wed, Feb 8, 2012 at 6:11 PM, Valentin Avram
> >>>> >>> > >> <aval13@gmail.com> wrote:
> >>>> >>> > >>
> >>>> >>> > >>
> >>>> >>> > >> --
> >>>> >>> > >> Linux-audit mailing list
> >>>> >>> > >> Linux-audit@redhat.com
> >>>> >>> > >> https://www.redhat.com/mailman/listinfo/linux-audit
> >>>> >>> > >
> >>>> >>> > >
> >>>> >>> > >
> >>>> >>> > > --
> >>>> >>> > > Peter Moody      Google    1.650.253.7306
> >>>> >>> > > Security Engineer  pgp:0xC3410038
> >>>> >>> >
> >>>> >>> >
> >>>> >>> >
> >>>> >>>
> >>>> >>>
> >>>> >>
> >>>> >
> >>>> >
> >>>> >
> >>>> > --
> >>>> > Peter Moody      Google    1.650.253.7306
> >>>> > Security Engineer  pgp:0xC3410038
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >>
> >> --
> >> Peter Moody      Google    1.650.253.7306
> >> Security Engineer  pgp:0xC3410038
> >
> >
> >
> > --
> > Peter Moody      Google    1.650.253.7306
> > Security Engineer  pgp:0xC3410038
> 
> 
> 

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: Kernel oops+crash on repeated auditd restarts
  2012-04-21  2:14                             ` Marcelo Cerri
@ 2012-04-23 16:05                               ` Peter Moody
  2012-04-23 16:26                               ` Eric Paris
  1 sibling, 0 replies; 24+ messages in thread
From: Peter Moody @ 2012-04-23 16:05 UTC (permalink / raw)
  To: Marcelo Cerri; +Cc: linux-audit

This works for me. Thanks, Marcelo!

Cheers,
peter

On Fri, Apr 20, 2012 at 7:14 PM, Marcelo Cerri
<mhcerri@linux.vnet.ibm.com> wrote:
>
> I took a look at the source code and made some tests. It seems to be a
> problem with the reference count of the fsnotify_mark structure.
>
> This error occurs because the fsnotify_mark_destroy function
> (which runs in a separated kthread) is trying to iterate through a mark
> that is already freed.
>
> Looking at the fsnotify_destroy_mark function (not confuse with
> fsnotify_mark_destroy), which adds a mark to destroy_list to be freed
> later by fsnotify_mark_destroy, I noticed that it does not increment
> the reference count for the reference added to the destroy_list and
> usually the callers dispose the references they held after calling
> fsnotify_destroy_mark.
>
> The patch below increments the reference count of a mark when it is
> added to the destroy list. It seems to solve the issue and it doesn't
> seem to cause any memory leak. Please, can you make some tests in your
> environments and let me know if there is any problem with this patch.
>
> Regarding the synchronize_scru call, I don't think it's causing this
> error. Probably it just make it more frequently because it forces all
> the cpus to schedule, giving the chance to someone else to free the
> mark.
>
> ---
>  fs/notify/mark.c |    1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/fs/notify/mark.c b/fs/notify/mark.c
> index f104d56..2985fff 100644
> --- a/fs/notify/mark.c
> +++ b/fs/notify/mark.c
> @@ -150,6 +150,7 @@ void fsnotify_destroy_mark(struct fsnotify_mark
> *mark) spin_unlock(&group->mark_lock);
>    spin_unlock(&mark->lock);
>
> +   fsnotify_get_mark(mark);
>    spin_lock(&destroy_lock);
>    list_add(&mark->destroy_list, &destroy_list);
>    spin_unlock(&destroy_lock);
> --
> 1.7.9.4
>
>
> On Tue, 17 Apr 2012 14:54:29 -0700
> Peter Moody <pmoody@google.com> wrote:
>
>> Last thing. moving synchronize_srcu(&fsnotify_mark_srcu) out of the
>> for(;;) loop in fs/notify/mark.c appears to solve the stability issues
>> for me. I don't know enough about kernel internals to determine if
>> this is doing lots of other bad things to my system or not.
>>
>> Cheers,
>> peter
>>
>> On Tue, Apr 17, 2012 at 11:24 AM, Peter Moody <pmoody@google.com>
>> wrote:
>> > and my config.gz
>> >
>> > On Tue, Apr 17, 2012 at 10:56 AM, Peter Moody <pmoody@google.com>
>> > wrote:
>> >> Here's a trace with debugging turned way up plus a few extra
>> >> printk's added to fs/notify/mark.c. I'm looping through
>> >> private_destroy_list before and after the call to synchronize_srcu.
>> >>
>> >> I can reproduce this reliably with kvm with 2 virtual processors:
>> >> Linux desktop 3.4.0-rc3-oops1+ #1 SMP Tue Apr 17 09:59:44 PDT 2012
>> >> x86_64 GNU/Linux
>> >>
>> >> Cheers,
>> >> peter
>> >>
>> >> On Thu, Apr 5, 2012 at 2:07 PM, Eric Paris <eparis@redhat.com>
>> >> wrote:
>> >>> please please please keep on list.  Everything you say might help
>> >>> track it down!
>> >>>
>> >>> On Thu, 2012-04-05 at 14:03 -0700, Peter Moody wrote:
>> >>>> (please let me know if I should take this off-list)
>> >>>>
>> >>>> One other thing (again, maybe already known), but this seems to
>> >>>> be exacerbated by SMP. On my machine, I can't reproduce the
>> >>>> crash if I booth with maxcpus=1.
>> >>>>
>> >>>> Still hunting.
>> >>>>
>> >>>> Cheers,
>> >>>> peter
>> >>>>
>> >>>> On Tue, Apr 3, 2012 at 9:15 AM, Peter Moody <pmoody@google.com>
>> >>>> wrote:
>> >>>> > This may already be known, but the issue seems to be limited
>> >>>> > to watch rules. With any watch rules, I can reliably crash my
>> >>>> > machine while freeing a watch rule after only
>> >>>> > starting/stopping auditd a few times. With no watch rules, I
>> >>>> > have no issues.
>> >>>> >
>> >>>> > Cheers,
>> >>>> > peter
>> >>>> >
>> >>>> > On Wed, Mar 28, 2012 at 11:44 PM, Valentin Avram
>> >>>> > <aval13@gmail.com> wrote:
>> >>>> >> Yes, i know that patch. It made it into kernel 3.2.2. I
>> >>>> >> tested it successfully (oops in 3.2.1, no oops in 3.2.9), but
>> >>>> >> this oops i'm seeing is also in 3.2.9.
>> >>>> >>
>> >>>> >> I monitored changelogs since 3.2.1 to 3.2.12 but there were
>> >>>> >> no fixes either in audit subsystem or in fsnotify. I'll try
>> >>>> >> to reproduce in latest 3.2.13 and repost the oops, but i'm
>> >>>> >> 99% confident it will be the same.
>> >>>> >>
>> >>>> >> Sadly nobody except you seems to pay attention to this
>> >>>> >> problem, probably because it requires special conditions to
>> >>>> >> reproduce (really, who starts and stops auditd every 5
>> >>>> >> seconds on a production server?). We only ran into it because
>> >>>> >> one of our servers would randomly oops and then freeze about
>> >>>> >> each month after stopping and then starting
>> >>>> >>
>> >>>> >> auditd
>> >>>> >>
>> >>>> >> every morning (and the stop-start sequence was needed to
>> >>>> >> workaround a bug somewhere that would hang a
>> >>>> >>
>> >>>> >> gzip
>> >>>> >>
>> >>>> >> running on a file outside a watched folder).
>> >>>> >>
>> >>>> >> Anyway, as a last note, i have a feeling that the oops is not
>> >>>> >> exactly random, there is a pattern, just that i haven't
>> >>>> >> figured it out completely yet.
>> >>>> >>
>> >>>> >> Will keep you
>> >>>> >>
>> >>>> >> uptodate
>> >>>> >>
>> >>>> >> with the things i find out.
>> >>>> >>
>> >>>> >> V.
>> >>>> >>
>> >>>> >> On Mar 29, 2012 4:14 AM, "Eric Paris" <eparis@redhat.com>
>> >>>> >> wrote:
>> >>>> >>>
>> >>>> >>> That patch fixes a BUG() .  The report has a NULL ptr deref
>> >>>> >>> and some apparent list correuption....  Sadly they aren't
>> >>>> >>> the same....
>> >>>> >>>
>> >>>> >>> On Wed, 2012-03-28 at 15:42 -0700, Peter Moody wrote:
>> >>>> >>> > fyi: this patch [1] seems to fix the issue for me. The
>> >>>> >>> > explanation in the subject would reliably oops my machine.
>> >>>> >>> >
>> >>>> >>> > [1]
>> >>>> >>> > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=fed474857efbed79cd390d0aee224231ca718f63
>> >>>> >>> >
>> >>>> >>> > On Wed, Mar 28, 2012 at 1:51 PM, Peter Moody
>> >>>> >>> > <pmoody@google.com> wrote:
>> >>>> >>> > > Are you still able to reliably reproduce this oops? I'm
>> >>>> >>> > > trying to track this down because this bug (or a very
>> >>>> >>> > > similar bug) is causing some significant headaches here
>> >>>> >>> > > at work, but I haven't had a lot of luck. I'm using
>> >>>> >>> > > usermode linux, though, so that might be interfering
>> >>>> >>> > > with things.
>> >>>> >>> > >
>> >>>> >>> > > On Mon, Mar 5, 2012 at 12:35 AM, Valentin Avram
>> >>>> >>> > > <aval13@gmail.com> wrote:
>> >>>> >>> > >> Finally i found some time and spare server to retest
>> >>>> >>> > >> the oops and list_add
>> >>>> >>> > >> corruptions i was getting with the 3.x kernels and
>> >>>> >>> > >> auditd 2.1.3.
>> >>>> >>> > >>
>> >>>> >>> > >> I tested now with gentoo's latest stable
>> >>>> >>> > >> 3.2.1-gentoo-r2 and kernel.org's
>> >>>> >>> > >> 3.2.9.
>> >>>> >>> > >>
>> >>>> >>> > >> Both get the oops/BUG in the same way and after that,
>> >>>> >>> > >> they keep pouring
>> >>>> >>> > >> list_add corruptions with audit_prune_tre(truncated?)
>> >>>> >>> > >> and auditctl as comms.
>> >>>> >>> > >>
>> >>>> >>> > >> Since this is not about Gentoo's kernel only, i'll post
>> >>>> >>> > >> here the oops in
>> >>>> >>> > >> 3.2.9 and also attach some list_add corruptions.
>> >>>> >>> > >>
>> >>>> >>> > >> 3.2.9 BUG:
>> >>>> >>> > >>
>> >>>> >>> > >> kernel: [  301.240011] BUG: unable to handle kernel
>> >>>> >>> > >> NULL pointer dereference
>> >>>> >>> > >> at   (null)
>> >>>> >>> > >> kernel: [  301.240305] IP: [<c1238dd0>]
>> >>>> >>> > >> __list_del_entry+0x20/0xe0 kernel: [  301.240481] *pdpt
>> >>>> >>> > >> = 0000000000000000 *pde = f000ddc8f000ddc8
>> >>>> >>> > >> kernel: [  301.240698] Oops: 0000 [#1] SMP
>> >>>> >>> > >> kernel: [  301.240910]
>> >>>> >>> > >> kernel: [  301.241030] Pid: 642, comm: fsnotify_mark
>> >>>> >>> > >> Not tainted 3.2.9-drbd-version3 #1 Dell Inc. PowerEdge
>> >>>> >>> > >> 2950/0CX396 kernel: [  301.241370] EIP:
>> >>>> >>> > >> 0060:[<c1238dd0>] EFLAGS: 00010287 CPU: 6 kernel:
>> >>>> >>> > >> [  301.241498] EIP is at __list_del_entry+0x20/0xe0
>> >>>> >>> > >> kernel: [  301.241623] EAX: f4fae544 EBX: f47cffa4 ECX:
>> >>>> >>> > >> ffffffff EDX: 00000000 kernel: [  301.241751] ESI:
>> >>>> >>> > >> f4fae544 EDI: f4fae508 EBP: f47cff7c ESP: f47cff64
>> >>>> >>> > >> kernel: [  301.241879]  DS: 007b ES: 007b FS: 00d8 GS:
>> >>>> >>> > >> 0000 SS: 0068 kernel: [  301.242005] Process
>> >>>> >>> > >> fsnotify_mark (pid: 642, ti=f47ce000 task=f4f47c00
>> >>>> >>> > >> task.ti=f47ce000) kernel: [  301.242207] Stack:
>> >>>> >>> > >> kernel: [  301.242327]  c10813c0 f47cffa4 f4f47c00
>> >>>> >>> > >> f4e70888 f47cff7c f47cffa4 f47cffb8 c10f6976
>> >>>> >>> > >> kernel: [  301.242882]  ffffffc3 f4f47c00 f4f47c00
>> >>>> >>> > >> 00000000 f4f47c00 c10530c0 f47cff9c f47cff9c
>> >>>> >>> > >> kernel: [  301.243438]  f4fae544 f4fae544 f4c47f58
>> >>>> >>> > >> 00000000 c10f68f0 f47cffe4 c1052834 00000000
>> >>>> >>> > >> kernel: [  301.243995] Call Trace:
>> >>>> >>> > >> kernel: [  301.244119]  [<c10813c0>] ?
>> >>>> >>> > >> rcu_check_callbacks+0x110/0x110
>> >>>> >>> > >> kernel: [  301.244248]  [<c10f6976>]
>> >>>> >>> > >> fsnotify_mark_destroy+0x86/0x120 kernel: [  301.244377]
>> >>>> >>> > >>  [<c10530c0>] ? abort_exclusive_wait+0x80/0x80 kernel:
>> >>>> >>> > >> [  301.244504]  [<c10f68f0>] ?
>> >>>> >>> > >> fsnotify_put_mark+0x30/0x30 kernel: [  301.244631]
>> >>>> >>> > >>  [<c1052834>] kthread+0x74/0x80 kernel: [  301.244756]
>> >>>> >>> > >>  [<c10527c0>] ? kthread_flush_work_fn+0x10/0x10 kernel:
>> >>>> >>> > >> [  301.244885]  [<c1582ab6>]
>> >>>> >>> > >> kernel_thread_helper+0x6/0xd kernel: [  301.245011]
>> >>>> >>> > >> Code: 55 f4 8b 45 f8 e9 75 ff ff ff 90 55 89 e5 53 83
>> >>>> >>> > >> ec 14 8b 08 8b 50 04 81 f9 00 01 10 00 74 24 81 fa 00
>> >>>> >>> > >> 02 20 00 0f 84 8e 00 00 00 <8b> 1a 39 d8 75 62 8b 59 04
>> >>>> >>> > >> 39 d8 75 35 89 51 04 89 0a 83 c4 14
>> >>>> >>> > >> kernel: [  301.248195] EIP: [<c1238dd0>]
>> >>>> >>> > >> __list_del_entry+0x20/0xe0 SS:ESP
>> >>>> >>> > >> 0068:f47cff64
>> >>>> >>> > >> kernel: [  301.248414] CR2: 0000000000000000
>> >>>> >>> > >> kernel: [  301.248538] ---[ end trace
>> >>>> >>> > >> 15082dbfb353f84c ]---
>> >>>> >>> > >>
>> >>>> >>> > >> The kernel was compiled with the following DEBUG
>> >>>> >>> > >> support (the bolded one
>> >>>> >>> > >> were requested by Gentoo's Dev:
>> >>>> >>> > >> CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
>> >>>> >>> > >> CONFIG_SLUB_DEBUG=y
>> >>>> >>> > >> CONFIG_HAVE_DMA_API_DEBUG=y
>> >>>> >>> > >> CONFIG_X86_DEBUGCTLMSR=y
>> >>>> >>> > >> CONFIG_PNP_DEBUG_MESSAGES=y
>> >>>> >>> > >> CONFIG_AIC94XX_DEBUG=y
>> >>>> >>> > >> CONFIG_USB_DEBUG=y
>> >>>> >>> > >> CONFIG_DEBUG_KERNEL=y
>> >>>> >>> > >> CONFIG_SCHED_DEBUG=y
>> >>>> >>> > >> CONFIG_DEBUG_RT_MUTEXES=y
>> >>>> >>> > >> CONFIG_DEBUG_PI_LIST=y
>> >>>> >>> > >> CONFIG_DEBUG_BUGVERBOSE=y
>> >>>> >>> > >> CONFIG_DEBUG_INFO=y
>> >>>> >>> > >> CONFIG_DEBUG_MEMORY_INIT=y
>> >>>> >>> > >> CONFIG_DEBUG_LIST=y
>> >>>> >>> > >> CONFIG_DEBUG_STACKOVERFLOW=y
>> >>>> >>> > >> CONFIG_DEBUG_RODATA=y
>> >>>> >>> > >> CONFIG_DEBUG_RODATA_TEST=y
>> >>>> >>> > >>
>> >>>> >>> > >> I attached the kernel config i used for 3.2.9 to
>> >>>> >>> > >> generate this oops and
>> >>>> >>> > >> warnings.
>> >>>> >>> > >>
>> >>>> >>> > >> From the list_add warnings that come after, out of 805
>> >>>> >>> > >> warnings i processed,
>> >>>> >>> > >> after masking with XXXXX the PID and next= values that
>> >>>> >>> > >> kept changing in
>> >>>> >>> > >> every one, i got 26 types of MD5. I also attached the
>> >>>> >>> > >> files relevant as an
>> >>>> >>> > >> archive to this email.
>> >>>> >>> > >>
>> >>>> >>> > >> The Gentoo bug i opened is sleeping, it seems nobody
>> >>>> >>> > >> has the time to at
>> >>>> >>> > >> least test to confirm or not the problems i'm seeing
>> >>>> >>> > >> (or everybody's thinking that nobody would restart
>> >>>> >>> > >> auditd so often, so the bug it's not that
>> >>>> >>> > >> serious).
>> >>>> >>> > >>
>> >>>> >>> > >>
>> >>>> >>> > >> Thank you for your time.
>> >>>> >>> > >>
>> >>>> >>> > >> On Wed, Feb 8, 2012 at 6:11 PM, Valentin Avram
>> >>>> >>> > >> <aval13@gmail.com> wrote:
>> >>>> >>> > >>
>> >>>> >>> > >>
>> >>>> >>> > >> --
>> >>>> >>> > >> Linux-audit mailing list
>> >>>> >>> > >> Linux-audit@redhat.com
>> >>>> >>> > >> https://www.redhat.com/mailman/listinfo/linux-audit
>> >>>> >>> > >
>> >>>> >>> > >
>> >>>> >>> > >
>> >>>> >>> > > --
>> >>>> >>> > > Peter Moody      Google    1.650.253.7306
>> >>>> >>> > > Security Engineer  pgp:0xC3410038
>> >>>> >>> >
>> >>>> >>> >
>> >>>> >>> >
>> >>>> >>>
>> >>>> >>>
>> >>>> >>
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> > --
>> >>>> > Peter Moody      Google    1.650.253.7306
>> >>>> > Security Engineer  pgp:0xC3410038
>> >>>>
>> >>>>
>> >>>>
>> >>>
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Peter Moody      Google    1.650.253.7306
>> >> Security Engineer  pgp:0xC3410038
>> >
>> >
>> >
>> > --
>> > Peter Moody      Google    1.650.253.7306
>> > Security Engineer  pgp:0xC3410038
>>
>>
>>
>



-- 
Peter Moody      Google    1.650.253.7306
Security Engineer  pgp:0xC3410038

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Kernel oops+crash on repeated auditd restarts
  2012-04-21  2:14                             ` Marcelo Cerri
  2012-04-23 16:05                               ` Peter Moody
@ 2012-04-23 16:26                               ` Eric Paris
  2012-04-24  1:27                                 ` Peter Moody
  2012-04-24  5:12                                 ` Marcelo Cerri
  1 sibling, 2 replies; 24+ messages in thread
From: Eric Paris @ 2012-04-23 16:26 UTC (permalink / raw)
  To: Marcelo Cerri; +Cc: linux-audit

On Fri, 2012-04-20 at 23:14 -0300, Marcelo Cerri wrote:

> The patch below increments the reference count of a mark when it is
> added to the destroy list. It seems to solve the issue and it doesn't
> seem to cause any memory leak. Please, can you make some tests in your
> environments and let me know if there is any problem with this patch.

That is almost certainly the wrong thing to do.  This test program
should show a memory leak with your patch.  If it doesn't show a memory
leak then something is screwed up in inotify as well.

#include <errno.h>
#include <unistd.h>
#include <sys/inotify.h>

int main(void)
{
	int fd;
	int rc;
	struct inotify_event event[10];

	fd = inotify_init();
	if (fd < 0)
		return errno;

	while(1) {
		rc = inotify_add_watch(fd, "/tmp", IN_CLOSE_WRITE);
		if (rc < 0)
			return errno;
	
		rc = inotify_rm_watch(fd, rc);
		if (rc)
			return errno;
	
		rc = read(fd, event, sizeof(event));
		if (rc < 0)
			return errno;
	}

	return 0;
}

The lifetime of an object is supposed to be from fsnotify_init_mark()
until it's matching reference is dropped in fsnotify_mark_destroy().  It
sounds to me like we are calling put somewhere in the audit code when we
didn't previously call a get....

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Kernel oops+crash on repeated auditd restarts
  2012-04-23 16:26                               ` Eric Paris
@ 2012-04-24  1:27                                 ` Peter Moody
  2012-04-24  5:12                                 ` Marcelo Cerri
  1 sibling, 0 replies; 24+ messages in thread
From: Peter Moody @ 2012-04-24  1:27 UTC (permalink / raw)
  To: Eric Paris; +Cc: linux-audit

On Mon, Apr 23, 2012 at 9:26 AM, Eric Paris <eparis@redhat.com> wrote:
> On Fri, 2012-04-20 at 23:14 -0300, Marcelo Cerri wrote:
>
>> The patch below increments the reference count of a mark when it is
>> added to the destroy list. It seems to solve the issue and it doesn't
>> seem to cause any memory leak. Please, can you make some tests in your
>> environments and let me know if there is any problem with this patch.
>
> That is almost certainly the wrong thing to do.  This test program
> should show a memory leak with your patch.  If it doesn't show a memory
> leak then something is screwed up in inotify as well.
>
> #include <errno.h>
> #include <unistd.h>
> #include <sys/inotify.h>
>
> int main(void)
> {
>        int fd;
>        int rc;
>        struct inotify_event event[10];
>
>        fd = inotify_init();
>        if (fd < 0)
>                return errno;
>
>        while(1) {
>                rc = inotify_add_watch(fd, "/tmp", IN_CLOSE_WRITE);
>                if (rc < 0)
>                        return errno;
>
>                rc = inotify_rm_watch(fd, rc);
>                if (rc)
>                        return errno;
>
>                rc = read(fd, event, sizeof(event));
>                if (rc < 0)
>                        return errno;
>        }
>
>        return 0;
> }
>
> The lifetime of an object is supposed to be from fsnotify_init_mark()
> until it's matching reference is dropped in fsnotify_mark_destroy().  It
> sounds to me like we are calling put somewhere in the audit code when we
> didn't previously call a get....
>

FWIW, bisecting points me to 75c1be487a690db43da2c1234fcacd84c982803c

75c1be487a690db43da2c1234fcacd84c982803c is the first bad commit
commit 75c1be487a690db43da2c1234fcacd84c982803c
Author: Eric Paris <eparis@redhat.com>
Date:   Wed Jul 28 10:18:38 2010 -0400

    fsnotify: srcu to protect read side of inode and vfsmount locks

    Currently reading the inode->i_fsnotify_marks or
    vfsmount->mnt_fsnotify_marks lists are protected by a spinlock on both the
    read and the write side.  This patch protects the read side of those lists
    with a new single srcu.

    Signed-off-by: Eric Paris <eparis@redhat.com>

:040000 040000 4b5d9b446eefaca96f8a89b8e9c2ef18da88534e
1abcff76e285ae57f5855b60857ef1708e937a0c M	fs
:040000 040000 a02d4ab5b164aa9282a342d73ebe3658f88b4539
3ca9f66ba26cc265d118e6c8558ff2214b9ed192 M	include

Cheers,
peter

-- 
Peter Moody      Google    1.650.253.7306
Security Engineer  pgp:0xC3410038

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Kernel oops+crash on repeated auditd restarts
  2012-04-23 16:26                               ` Eric Paris
  2012-04-24  1:27                                 ` Peter Moody
@ 2012-04-24  5:12                                 ` Marcelo Cerri
  2012-04-24 18:31                                   ` Eric Paris
  1 sibling, 1 reply; 24+ messages in thread
From: Marcelo Cerri @ 2012-04-24  5:12 UTC (permalink / raw)
  To: Eric Paris; +Cc: linux-audit

On Mon, 23 Apr 2012 12:26:16 -0400
Eric Paris <eparis@redhat.com> wrote:

> On Fri, 2012-04-20 at 23:14 -0300, Marcelo Cerri wrote:
> 
> > The patch below increments the reference count of a mark when it is
> > added to the destroy list. It seems to solve the issue and it
> > doesn't seem to cause any memory leak. Please, can you make some
> > tests in your environments and let me know if there is any problem
> > with this patch.
> 
> That is almost certainly the wrong thing to do.  This test program
> should show a memory leak with your patch.  If it doesn't show a
> memory leak then something is screwed up in inotify as well.

Sorry, I should have tested the other features that also make use of
fsnotify. You're right, my patch adds a memory leak for inotify (and
probably for dnotify and fanotify too).

> ...
> 
> The lifetime of an object is supposed to be from fsnotify_init_mark()
> until it's matching reference is dropped in fsnotify_mark_destroy().
> It sounds to me like we are calling put somewhere in the audit code
> when we didn't previously call a get....
> 

Considering that the issue is specific to audit and it seems to occur
only with watches on directories, I investigated the audit_tree.c file
and found a probable cause. The untag_chunk() holds a reference to a
mark at the begging of the function and releases it at the end of it (on
the label out). However when it jumps to the "out" label, it calls
fsnotify_put_mark once more.

Peter and Valentin, can you test this new patch to check if it
solves the oops problem?

Eric, do you agree with this solution?

Regards,
Marcelo

---
 kernel/audit_tree.c |    2 --
 1 file changed, 2 deletions(-)

diff --git a/kernel/audit_tree.c b/kernel/audit_tree.c
index 5bf0790..b5bd9f9 100644
--- a/kernel/audit_tree.c
+++ b/kernel/audit_tree.c
@@ -250,7 +250,6 @@ static void untag_chunk(struct node *p)
        spin_unlock(&hash_lock);
        spin_unlock(&entry->lock);
        fsnotify_destroy_mark(entry);
-       fsnotify_put_mark(entry);
        goto out;
    }
 
@@ -293,7 +292,6 @@ static void untag_chunk(struct node *p)
    spin_unlock(&hash_lock);
    spin_unlock(&entry->lock);
    fsnotify_destroy_mark(entry);
-   fsnotify_put_mark(entry);
    goto out;
 
 Fallback:
-- 
1.7.9.4

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: Kernel oops+crash on repeated auditd restarts
  2012-04-24  5:12                                 ` Marcelo Cerri
@ 2012-04-24 18:31                                   ` Eric Paris
  2012-04-24 18:38                                     ` Peter Moody
  0 siblings, 1 reply; 24+ messages in thread
From: Eric Paris @ 2012-04-24 18:31 UTC (permalink / raw)
  To: Marcelo Cerri; +Cc: linux-audit

On Tue, 2012-04-24 at 02:12 -0300, Marcelo Cerri wrote:
> On Mon, 23 Apr 2012 12:26:16 -0400, Eric Paris <eparis@redhat.com> wrote:

> Considering that the issue is specific to audit and it seems to occur
> only with watches on directories, I investigated the audit_tree.c file
> and found a probable cause. The untag_chunk() holds a reference to a
> mark at the begging of the function and releases it at the end of it (on
> the label out). However when it jumps to the "out" label, it calls
> fsnotify_put_mark once more.
> 
> Peter and Valentin, can you test this new patch to check if it
> solves the oops problem?
> 
> Eric, do you agree with this solution?
> 
> Regards,
> Marcelo
> 
> ---
>  kernel/audit_tree.c |    2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/kernel/audit_tree.c b/kernel/audit_tree.c
> index 5bf0790..b5bd9f9 100644
> --- a/kernel/audit_tree.c
> +++ b/kernel/audit_tree.c
> @@ -250,7 +250,6 @@ static void untag_chunk(struct node *p)
>         spin_unlock(&hash_lock);
>         spin_unlock(&entry->lock);
>         fsnotify_destroy_mark(entry);
> -       fsnotify_put_mark(entry);
>         goto out;
>     }
>  
> @@ -293,7 +292,6 @@ static void untag_chunk(struct node *p)
>     spin_unlock(&hash_lock);
>     spin_unlock(&entry->lock);
>     fsnotify_destroy_mark(entry);
> -   fsnotify_put_mark(entry);
>     goto out;
>  
>  Fallback:

This looks right to me.  The old audit logic before the switch to
fsnotify was:
-       inotify_evict_watch(&chunk->watch);
-       mutex_unlock(&chunk->watch.inode->inotify_mutex);
-       put_inotify_watch(&chunk->watch);

Which I changed to:
+       spin_unlock(&entry->lock);
+       fsnotify_destroy_mark_by_entry(entry);
+       fsnotify_put_mark(entry);

The difference being that inotify_evict_watch() took a reference on
chunk->watch, however fsnotify_destroy_mark_by_entry() does not.  So the
fsnotify_put_mark() was incorrect.

I'd love to hear testing results, and I'm going to try to figure out if
I screwed that up other places....

-Eric

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Kernel oops+crash on repeated auditd restarts
  2012-04-24 18:31                                   ` Eric Paris
@ 2012-04-24 18:38                                     ` Peter Moody
  2012-04-24 19:06                                       ` Eric Paris
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Moody @ 2012-04-24 18:38 UTC (permalink / raw)
  To: Eric Paris; +Cc: linux-audit

On Tue, Apr 24, 2012 at 11:31 AM, Eric Paris <eparis@redhat.com> wrote:
> On Tue, 2012-04-24 at 02:12 -0300, Marcelo Cerri wrote:
>> On Mon, 23 Apr 2012 12:26:16 -0400, Eric Paris <eparis@redhat.com> wrote:
>
>> Considering that the issue is specific to audit and it seems to occur
>> only with watches on directories, I investigated the audit_tree.c file
>> and found a probable cause. The untag_chunk() holds a reference to a
>> mark at the begging of the function and releases it at the end of it (on
>> the label out). However when it jumps to the "out" label, it calls
>> fsnotify_put_mark once more.
>>
>> Peter and Valentin, can you test this new patch to check if it
>> solves the oops problem?
>>
>> Eric, do you agree with this solution?
>>
>> Regards,
>> Marcelo
>>
>> ---
>>  kernel/audit_tree.c |    2 --
>>  1 file changed, 2 deletions(-)
>>
>> diff --git a/kernel/audit_tree.c b/kernel/audit_tree.c
>> index 5bf0790..b5bd9f9 100644
>> --- a/kernel/audit_tree.c
>> +++ b/kernel/audit_tree.c
>> @@ -250,7 +250,6 @@ static void untag_chunk(struct node *p)
>>         spin_unlock(&hash_lock);
>>         spin_unlock(&entry->lock);
>>         fsnotify_destroy_mark(entry);
>> -       fsnotify_put_mark(entry);
>>         goto out;
>>     }
>>
>> @@ -293,7 +292,6 @@ static void untag_chunk(struct node *p)
>>     spin_unlock(&hash_lock);
>>     spin_unlock(&entry->lock);
>>     fsnotify_destroy_mark(entry);
>> -   fsnotify_put_mark(entry);
>>     goto out;
>>
>>  Fallback:
>
> This looks right to me.  The old audit logic before the switch to
> fsnotify was:
> -       inotify_evict_watch(&chunk->watch);
> -       mutex_unlock(&chunk->watch.inode->inotify_mutex);
> -       put_inotify_watch(&chunk->watch);
>
> Which I changed to:
> +       spin_unlock(&entry->lock);
> +       fsnotify_destroy_mark_by_entry(entry);
> +       fsnotify_put_mark(entry);
>
> The difference being that inotify_evict_watch() took a reference on
> chunk->watch, however fsnotify_destroy_mark_by_entry() does not.  So the
> fsnotify_put_mark() was incorrect.
>
> I'd love to hear testing results, and I'm going to try to figure out if
> I screwed that up other places....

I'm testing this now. It looks good WRT the crash. I need to spend
some more time testing be sure memory isn't leaking anywhere.


> -Eric
>



-- 
Peter Moody      Google    1.650.253.7306
Security Engineer  pgp:0xC3410038

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Kernel oops+crash on repeated auditd restarts
  2012-04-24 18:38                                     ` Peter Moody
@ 2012-04-24 19:06                                       ` Eric Paris
  0 siblings, 0 replies; 24+ messages in thread
From: Eric Paris @ 2012-04-24 19:06 UTC (permalink / raw)
  To: Peter Moody; +Cc: linux-audit

On Tue, 2012-04-24 at 11:38 -0700, Peter Moody wrote:
> On Tue, Apr 24, 2012 at 11:31 AM, Eric Paris <eparis@redhat.com> wrote:
> > On Tue, 2012-04-24 at 02:12 -0300, Marcelo Cerri wrote:
> >> On Mon, 23 Apr 2012 12:26:16 -0400, Eric Paris <eparis@redhat.com> wrote:

> > I'd love to hear testing results, and I'm going to try to figure out if
> > I screwed that up other places....
> 
> I'm testing this now. It looks good WRT the crash. I need to spend
> some more time testing be sure memory isn't leaking anywhere.

I just sent another version which fixed a couple of other places I
believe I was doing ref counting wrong in the audit_tree code.
Hopefully everyone can give that one a whirl....

-Eric

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2012-04-24 19:06 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-25 16:45 Kernel oops+crash on repeated auditd restarts Valentin Avram
2012-01-25 16:53 ` Peter Moody
2012-01-25 19:20 ` Eric Paris
2012-01-26  7:13   ` Valentin Avram
2012-02-08 16:11     ` Valentin Avram
2012-03-05  8:35       ` Valentin Avram
2012-03-28 20:51         ` Peter Moody
2012-03-28 22:42           ` Peter Moody
2012-03-29  1:14             ` Eric Paris
2012-03-29  6:44               ` Valentin Avram
2012-04-03 16:15                 ` Peter Moody
2012-04-05 21:03                   ` Peter Moody
2012-04-05 21:07                     ` Eric Paris
2012-04-17 17:56                       ` Peter Moody
2012-04-17 18:24                         ` Peter Moody
2012-04-17 21:54                           ` Peter Moody
2012-04-21  2:14                             ` Marcelo Cerri
2012-04-23 16:05                               ` Peter Moody
2012-04-23 16:26                               ` Eric Paris
2012-04-24  1:27                                 ` Peter Moody
2012-04-24  5:12                                 ` Marcelo Cerri
2012-04-24 18:31                                   ` Eric Paris
2012-04-24 18:38                                     ` Peter Moody
2012-04-24 19:06                                       ` Eric Paris

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.