From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Jason@zx2c4.com Received: from krantz.zx2c4.com (localhost [127.0.0.1]) by krantz.zx2c4.com (ZX2C4 Mail Server) with ESMTP id 95f957b6 for ; Mon, 27 Feb 2017 03:21:06 +0000 (UTC) Received: from frisell.zx2c4.com (frisell.zx2c4.com [192.95.5.64]) by krantz.zx2c4.com (ZX2C4 Mail Server) with ESMTP id 3497be83 for ; Mon, 27 Feb 2017 03:21:06 +0000 (UTC) Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTP id 29fb1fcb for ; Mon, 27 Feb 2017 03:21:05 +0000 (UTC) Received: by frisell.zx2c4.com (ZX2C4 Mail Server) with ESMTPSA id 3c7988ad (TLSv1.2:ECDHE-RSA-AES128-GCM-SHA256:128:NO) for ; Mon, 27 Feb 2017 03:21:04 +0000 (UTC) Received: by mail-ot0-f176.google.com with SMTP id x10so44173101otb.1 for ; Sun, 26 Feb 2017 19:22:36 -0800 (PST) MIME-Version: 1.0 In-Reply-To: References: <5e4ad220-6009-7ec9-95eb-ddccb994bb9e@gmail.com> From: "Jason A. Donenfeld" Date: Mon, 27 Feb 2017 04:22:34 +0100 Message-ID: Subject: Re: kernel warning with 0.0.20170223: entered softirq 3 NET_RX net_rx_action+0x0/0x760 with preempt_count 00000101, exited with 00000100? To: Pipacs Content-Type: text/plain; charset=UTF-8 Cc: Brad Spengler , WireGuard mailing list List-Id: Development discussion of WireGuard List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Hey Pipacs, I've been receiving reports of strange bugs from grsec users with WireGuard. The first set of bugs was a heisenbug crash, and I never found the root cause, but it seemed to happen in the rx path. Then today Timoth=C3=A9e emailed another different bug from a grsec box, also along the rx path. This time it was related to the preemption count being wrong coming into and going out of the rx softirq. This kind of preemption mismatch, I figure, might account for the earlier bug I never solved. So armed with this new information, I went hunting. I followed the path inward, surrounding the body of each function with: int i =3D preempt_count(); function_body... if (i !=3D preempt_count()) pr_err("LORDHAVEMERCY\n"); Eventually I isolated the bug to an interesting situation like this: int i =3D preempt_count(); other_function(...); if (i !=3D preempt_count()) pr_err("This will print out\n"); void other_function(int a) { int vla[a]; int i =3D preempt_count(); function_body... if (i !=3D preempt_count()) pr_err("This will NOT print out\n"); } Since I only got the outer print, I thought this was strange, so I rearrang= ed: void other_function(int a) { int i =3D preempt_count(); int vla[a]; if (i !=3D preempt_count()) pr_err("This will print out\n"); function_body... } Yay, we found the bug. But wtf, what could possibly be changing the preempt_count there? So I went disassembling, and lo and behold the clever PaX stack leak plugin was adding calls to pax_check_alloca. Very nice! But still, why the preemption bug situation? I went hunting further: void __used pax_check_alloca(unsigned long size) { ... case STACK_TYPE_IRQ: stack_left =3D sp & (IRQ_STACK_SIZE - 1); put_cpu(); break; ... } Do you see the bug? Looks like somebody snuck in a "put_cpu()" there, where it really does not belong. "put_cpu()" basically just jiggers the preempt_count. I can confirm that removing the erroneous call to "put_cpu()" fixes the bug. So, either this is by design, and there's some odd subtlety I'm missing, or this is a bug that should be fixed in grsec/PaX. In the case of the latter, I believe this introduces a security vulnerability, since it opens up a whole host of interesting race conditions that can be exploited. Thanks, Jason