From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S270354AbTGRTvv (ORCPT ); Fri, 18 Jul 2003 15:51:51 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S270360AbTGRTvv (ORCPT ); Fri, 18 Jul 2003 15:51:51 -0400 Received: from e35.co.us.ibm.com ([32.97.110.133]:41692 "EHLO e35.co.us.ibm.com") by vger.kernel.org with ESMTP id S270354AbTGRTvk (ORCPT ); Fri, 18 Jul 2003 15:51:40 -0400 Date: Fri, 18 Jul 2003 15:06:33 -0500 From: linas@austin.ibm.com To: linux-kernel@vger.kernel.org Subject: KDB in the mainstream 2.4.x kernels? Message-ID: <20030718150633.A50102@forte.austin.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5.1i Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Hi, Will there be a day that I can expect to find KDB in the 2.4.x kernel? I know that a traditional answer has been 'never', but I would like the various influencers and decision makers to reconsider ... I agree with Linus Torvalds that debuggers are 100% useless when you are working on code that you know intimately. I know, I've written a lot of code, I'm proud of it, and I sneer at people who use words like 'development environment'. Crap, if you can't figure out why your code crashed, you shouldn't be a programmer. But these days, I am not debugging my code. I'm debugging code that I've never seen before. And for that, I use KDB. Right now, I work in a job where the *only* thing that I do is to analyze and sometimes (when I'm lucky) fix kernel crashes. Its all I do. I don't write any new code, don't do any porting at all. I also don't debug any 2.5/2.6 'unstable' kernels, nor do I handle any new/unstable device drivers. I focus entirely on the 2.4.x kernels, and, with a small team here, there are more than enough kernel bugs to keep us all completely busy. The crashes are generated by a test team of 8 people with dozens of machines. Ostensibly their mission is to test new hardware, but in fact, almost all the crashes that they find are kernel bugs. The *only* thing that the test team does is to run stress tests. Basic stuff. Kernel stress. File create/delete/copy. Reiser, jfs, ext3, swap, OOM, scsi. Network, nfs, samba. Some tests take hours to crash the kernel, some take days. But the kernel crashes. Its always crashing. Corruption, races, missing locks, typos, bad hardware, you name it. When I get it, it has a KDB prompt in front of it. KDB is great. I can figure out where it crashed, I can look at the assembly, I can examine memory locations. I can chase pointers by hand. And I can do it all symbolically, with the symbol names in front of me. Now, KDB rarely points right at the bug, but it is invaluable for figuring out where to start looking. Sometimes I even find the bug, often I don't. But anyway, this is all academic, because its at work, in a controlled environment, where I have the time and resources I need. But the real reason I write this note is that I want to have the same capability at home. It suddenly occurred to me that the servers I run at home sometimes (rarely) crash with the same symptoms as those at work. Sure, I can probably blame buggy PC hardware. But .. I dunno. I've been consistently ignoring these crashes cause its just too much of a hassle to try to debug them. Its not worth the effort. But hey ... if I had KDB at home... maybe it would be worth looking into the hangs. I could see getting motivated to look into some of these. At least get some idea of where the machine got hung. Maybe no fix, but at least somewhere to lay the blame. Yes, of course I could just apply the KDB patches myself, but frankly its a hassle. I already play the patch game and I hate it. Every new kernel, I have to try to remember where to find patch x, how to apply it, fix up this and that... its just plain painful. I know that this is not a forceful argument. But crashes are a fact of life, whatever the reason may be. And the crashes almost always happen in a piece of code I have *never* laid eyes on before, so its unrealistic to try to puzzle it out with the small dollop of info from magic-sysreq. Debuggers can be useless, or worse than useless, when you are a developer on a piece of code you know well. But when plunging into foreign territory, all the tools and firepower that you can muster are worth every bit. This is why KDB belongs in the mainstream kernel distros. --linas