From mboxrd@z Thu Jan 1 00:00:00 1970 From: Peter Moody Subject: Re: Kernel oops+crash on repeated auditd restarts Date: Thu, 5 Apr 2012 14:03:57 -0700 Message-ID: References: <1327519203.4131.25.camel@localhost> <1332983643.384.8.camel@localhost> Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Return-path: Received: from mx1.redhat.com (ext-mx13.extmail.prod.ext.phx2.redhat.com [10.5.110.18]) by int-mx12.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id q35L4Umh013054 for ; Thu, 5 Apr 2012 17:04:30 -0400 Received: from mail-iy0-f174.google.com (mail-iy0-f174.google.com [209.85.210.174]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id q35L4Sil009721 for ; Thu, 5 Apr 2012 17:04:28 -0400 Received: by iagz16 with SMTP id z16so3083881iag.33 for ; Thu, 05 Apr 2012 14:04:28 -0700 (PDT) In-Reply-To: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: linux-audit-bounces@redhat.com Errors-To: linux-audit-bounces@redhat.com To: Valentin Avram Cc: linux-audit@redhat.com List-Id: linux-audit@redhat.com (please let me know if I should take this off-list) One other thing (again, maybe already known), but this seems to be exacerbated by SMP. On my machine, I can't reproduce the crash if I booth with maxcpus=3D1. Still hunting. Cheers, peter On Tue, Apr 3, 2012 at 9:15 AM, Peter Moody wrote: > This may already be known, but the issue seems to be limited to watch > rules. With any watch rules, I can reliably crash my machine while > freeing a watch rule after only starting/stopping auditd a few times. > With no watch rules, I have no issues. > > Cheers, > peter > > On Wed, Mar 28, 2012 at 11:44 PM, Valentin Avram wrote: >> Yes, i know that patch. It made it into kernel 3.2.2. I tested it >> successfully (oops in 3.2.1, no oops in 3.2.9), but this oops i'm seeing= is >> also in 3.2.9. >> >> I monitored changelogs since 3.2.1 to 3.2.12 but there were no fixes eit= her >> in audit subsystem or in fsnotify. I'll try to reproduce in latest 3.2.13 >> and repost the oops, but i'm 99% confident it will be the same. >> >> Sadly nobody except you seems to pay attention to this problem, probably >> because it requires special conditions to reproduce (really, who starts = and >> stops auditd every 5 seconds on a production server?). We only ran into = it >> because one of our servers would randomly oops and then freeze about each >> month after stopping and then starting >> >> auditd >> >> every morning (and the stop-start sequence was needed to workaround a bug >> somewhere that would hang a >> >> gzip >> >> running on a file outside a watched folder). >> >> Anyway, as a last note, i have a feeling that the oops is not exactly >> random, there is a pattern, just that i haven't figured it out completely >> yet. >> >> Will keep you >> >> uptodate >> >> with the things i find out. >> >> V. >> >> On Mar 29, 2012 4:14 AM, "Eric Paris" wrote: >>> >>> That patch fixes a BUG() . =A0The report has a NULL ptr deref and some >>> apparent list correuption.... =A0Sadly they aren't the same.... >>> >>> On Wed, 2012-03-28 at 15:42 -0700, Peter Moody wrote: >>> > fyi: this patch [1] seems to fix the issue for me. The explanation in >>> > the subject would reliably oops my machine. >>> > >>> > [1] >>> > http://git.kernel.org/?p=3Dlinux/kernel/git/torvalds/linux-2.6.git;a= =3Dcommit;h=3Dfed474857efbed79cd390d0aee224231ca718f63 >>> > >>> > On Wed, Mar 28, 2012 at 1:51 PM, Peter Moody wrot= e: >>> > > Are you still able to reliably reproduce this oops? I'm trying to >>> > > track this down because this bug (or a very similar bug) is causing >>> > > some significant headaches here at work, but I haven't had a lot of >>> > > luck. I'm using usermode linux, though, so that might be interfering >>> > > with things. >>> > > >>> > > On Mon, Mar 5, 2012 at 12:35 AM, Valentin Avram >>> > > wrote: >>> > >> Finally i found some time and spare server to retest the oops and >>> > >> list_add >>> > >> corruptions i was getting with the 3.x kernels and auditd 2.1.3. >>> > >> >>> > >> I tested now with gentoo's latest stable 3.2.1-gentoo-r2 and >>> > >> kernel.org's >>> > >> 3.2.9. >>> > >> >>> > >> Both get the oops/BUG in the same way and after that, they keep >>> > >> pouring >>> > >> list_add corruptions with audit_prune_tre(truncated?) and auditctl= as >>> > >> comms. >>> > >> >>> > >> Since this is not about Gentoo's kernel only, i'll post here the o= ops >>> > >> in >>> > >> 3.2.9 and also attach some list_add corruptions. >>> > >> >>> > >> 3.2.9 BUG: >>> > >> >>> > >> kernel: [ =A0301.240011] BUG: unable to handle kernel NULL pointer >>> > >> dereference >>> > >> at =A0 (null) >>> > >> kernel: [ =A0301.240305] IP: [] __list_del_entry+0x20/0x= e0 >>> > >> kernel: [ =A0301.240481] *pdpt =3D 0000000000000000 *pde =3D >>> > >> f000ddc8f000ddc8 >>> > >> kernel: [ =A0301.240698] Oops: 0000 [#1] SMP >>> > >> kernel: [ =A0301.240910] >>> > >> kernel: [ =A0301.241030] Pid: 642, comm: fsnotify_mark Not tainted >>> > >> 3.2.9-drbd-version3 #1 Dell Inc. PowerEdge 2950/0CX396 >>> > >> kernel: [ =A0301.241370] EIP: 0060:[] EFLAGS: 00010287 C= PU: 6 >>> > >> kernel: [ =A0301.241498] EIP is at __list_del_entry+0x20/0xe0 >>> > >> kernel: [ =A0301.241623] EAX: f4fae544 EBX: f47cffa4 ECX: ffffffff= EDX: >>> > >> 00000000 >>> > >> kernel: [ =A0301.241751] ESI: f4fae544 EDI: f4fae508 EBP: f47cff7c= ESP: >>> > >> f47cff64 >>> > >> kernel: [ =A0301.241879] =A0DS: 007b ES: 007b FS: 00d8 GS: 0000 SS= : 0068 >>> > >> kernel: [ =A0301.242005] Process fsnotify_mark (pid: 642, ti=3Df47= ce000 >>> > >> task=3Df4f47c00 task.ti=3Df47ce000) >>> > >> kernel: [ =A0301.242207] Stack: >>> > >> kernel: [ =A0301.242327] =A0c10813c0 f47cffa4 f4f47c00 f4e70888 f4= 7cff7c >>> > >> f47cffa4 f47cffb8 c10f6976 >>> > >> kernel: [ =A0301.242882] =A0ffffffc3 f4f47c00 f4f47c00 00000000 f4= f47c00 >>> > >> c10530c0 f47cff9c f47cff9c >>> > >> kernel: [ =A0301.243438] =A0f4fae544 f4fae544 f4c47f58 00000000 c1= 0f68f0 >>> > >> f47cffe4 c1052834 00000000 >>> > >> kernel: [ =A0301.243995] Call Trace: >>> > >> kernel: [ =A0301.244119] =A0[] ? >>> > >> rcu_check_callbacks+0x110/0x110 >>> > >> kernel: [ =A0301.244248] =A0[] fsnotify_mark_destroy+0x8= 6/0x120 >>> > >> kernel: [ =A0301.244377] =A0[] ? abort_exclusive_wait+0x= 80/0x80 >>> > >> kernel: [ =A0301.244504] =A0[] ? fsnotify_put_mark+0x30/= 0x30 >>> > >> kernel: [ =A0301.244631] =A0[] kthread+0x74/0x80 >>> > >> kernel: [ =A0301.244756] =A0[] ? >>> > >> kthread_flush_work_fn+0x10/0x10 >>> > >> kernel: [ =A0301.244885] =A0[] kernel_thread_helper+0x6/= 0xd >>> > >> kernel: [ =A0301.245011] Code: 55 f4 8b 45 f8 e9 75 ff ff ff 90 55= 89 >>> > >> e5 53 83 >>> > >> ec 14 8b 08 8b 50 04 81 f9 00 01 10 00 74 24 81 fa 00 02 20 00 0f = 84 >>> > >> 8e 00 >>> > >> 00 00 <8b> 1a 39 d8 75 62 8b 59 04 39 d8 75 35 89 51 04 89 0a 83 c4 >>> > >> 14 >>> > >> kernel: [ =A0301.248195] EIP: [] __list_del_entry+0x20/0= xe0 >>> > >> SS:ESP >>> > >> 0068:f47cff64 >>> > >> kernel: [ =A0301.248414] CR2: 0000000000000000 >>> > >> kernel: [ =A0301.248538] ---[ end trace 15082dbfb353f84c ]--- >>> > >> >>> > >> The kernel was compiled with the following DEBUG support (the bold= ed >>> > >> one >>> > >> were requested by Gentoo's Dev: >>> > >> CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=3Dy >>> > >> CONFIG_SLUB_DEBUG=3Dy >>> > >> CONFIG_HAVE_DMA_API_DEBUG=3Dy >>> > >> CONFIG_X86_DEBUGCTLMSR=3Dy >>> > >> CONFIG_PNP_DEBUG_MESSAGES=3Dy >>> > >> CONFIG_AIC94XX_DEBUG=3Dy >>> > >> CONFIG_USB_DEBUG=3Dy >>> > >> CONFIG_DEBUG_KERNEL=3Dy >>> > >> CONFIG_SCHED_DEBUG=3Dy >>> > >> CONFIG_DEBUG_RT_MUTEXES=3Dy >>> > >> CONFIG_DEBUG_PI_LIST=3Dy >>> > >> CONFIG_DEBUG_BUGVERBOSE=3Dy >>> > >> CONFIG_DEBUG_INFO=3Dy >>> > >> CONFIG_DEBUG_MEMORY_INIT=3Dy >>> > >> CONFIG_DEBUG_LIST=3Dy >>> > >> CONFIG_DEBUG_STACKOVERFLOW=3Dy >>> > >> CONFIG_DEBUG_RODATA=3Dy >>> > >> CONFIG_DEBUG_RODATA_TEST=3Dy >>> > >> >>> > >> I attached the kernel config i used for 3.2.9 to generate this oops >>> > >> and >>> > >> warnings. >>> > >> >>> > >> From the list_add warnings that come after, out of 805 warnings i >>> > >> processed, >>> > >> after masking with XXXXX the PID and next=3D values that kept chan= ging >>> > >> in >>> > >> every one, i got 26 types of MD5. I also attached the files releva= nt >>> > >> as an >>> > >> archive to this email. >>> > >> >>> > >> The Gentoo bug i opened is sleeping, it seems nobody has the time = to >>> > >> at >>> > >> least test to confirm or not the problems i'm seeing (or everybody= 's >>> > >> thinking that nobody would restart auditd so often, so the bug it's >>> > >> not that >>> > >> serious). >>> > >> >>> > >> >>> > >> Thank you for your time. >>> > >> >>> > >> On Wed, Feb 8, 2012 at 6:11 PM, Valentin Avram >>> > >> wrote: >>> > >> >>> > >> >>> > >> -- >>> > >> Linux-audit mailing list >>> > >> Linux-audit@redhat.com >>> > >> https://www.redhat.com/mailman/listinfo/linux-audit >>> > > >>> > > >>> > > >>> > > -- >>> > > Peter Moody =A0 =A0 =A0Google =A0 =A01.650.253.7306 >>> > > Security Engineer =A0pgp:0xC3410038 >>> > >>> > >>> > >>> >>> >> > > > > -- > Peter Moody=A0 =A0 =A0 Google=A0 =A0 1.650.253.7306 > Security Engineer=A0 pgp:0xC3410038 -- = Peter Moody=A0 =A0 =A0 Google=A0 =A0 1.650.253.7306 Security Engineer=A0 pgp:0xC3410038