From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933001AbXCMICA (ORCPT ); Tue, 13 Mar 2007 04:02:00 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S932997AbXCMICA (ORCPT ); Tue, 13 Mar 2007 04:02:00 -0400 Received: from mx1.redhat.com ([66.187.233.31]:54833 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932969AbXCMIB7 (ORCPT ); Tue, 13 Mar 2007 04:01:59 -0400 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit From: Roland McGrath To: Alan Stern X-Fcc: ~/Mail/utrace Cc: Prasanna S Panchamukhi , Kernel development list Subject: Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch) In-Reply-To: Alan Stern's message of Friday, 9 March 2007 13:40:42 -0500 X-Windows: a mistake carried out to perfection. Message-Id: <20070313080150.7C9061801C5@magilla.sf.frob.com> Date: Tue, 13 Mar 2007 01:00:50 -0700 (PDT) Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org > Well, I can add in the test for 0, but finding the set of always-on bits > in DR6 will have to be done separately. Isn't it possible that different > CPUs could have different bits? I don't know, but it seems unlikely. AFAIK all CPUs are presumed to have the same CPUID results, for example. > At that moment D, running on CPU 1, decides to unregister a breakpoint in > T. Clearing TIF_DEBUG now doesn't do any good -- it's too late; CPU 0 has > already tested it. CPU 1 goes in and alters the user breakpoint data, > maybe even deallocating a breakpoint structure that CPU 0 is about to > read. Not a good situation. You make it sound like a really good case for RCU. ;-) > No way to tell when a task being debugged is started up by anything other > than its debugger? Hmmm, in that case maybe it would be better to use > RCU. It won't add much overhead to anything but the code for registering > and unregistering user breakpoints. Indeed, it is for this sort of thing. Still, it feels like a bit too much is going on in switch_to_thread_hw_breakpoint for the common case. It seems to me it ought to be something as simple as this: if (unlikely((thbi->want_dr7 &~ chbi->kdr7) != thbi->active_tdr7) { /* Need to make some installed or uninstalled callbacks. */ if (thbi->active_tdr7 & chbi->kdr7) uninstalled callbacks; else installed callbacks; recompute active_dr7, want_dr7; } switch (thbi->active_bkpts) { case 4: set_debugreg(0, thbi->tdr[0]); case 3: set_debugreg(1, thbi->tdr[1]); case 2: set_debugreg(2, thbi->tdr[2]); case 1: set_debugreg(3, thbi->tdr[3]); } set_debugreg(7, chbi->kdr7 | thbi->active_tdr7); Only in the unlikely case do you need to worry about synchronization, whether it's RCU or spin locks or whatever. The idea is that breakpoint installation when doing the fancy stuff would reduce it to "the four breakpoints I would like, in order" (tdr[3] containing the highest-priority one), and dr7 masks describing what dr7 you were using last time you were running (active_tdr7), and that plus the enable bits you would like to have set (want_dr7). The unlikely case is when the number of kernel debugregs consumed changed since the last time you switched in, so you go recompute active_tdr7. (Put the body of that if in another function.) For the masks to work as I described, you need to use the same enable bit (or both) for kernel and user allocations. It really doesn't matter which one you use, since all of Linux is "local" for the sense of the dr7 enable bits (i.e. you should just use DR_GLOBAL_ENABLE). It's perfectly safe to access all this stuff while it might be getting overwritten, and worst case you switch in some user breakpoints you didn't want. That only happens in the SIGKILL case, when you never hit user mode again and don't care. > +void switch_to_thread_hw_breakpoint(struct task_struct *tsk) [...] > + /* Keep the DR7 bits that refer to kernel breakpoints */ > + get_debugreg(dr7, 7); > + dr7 &= kdr7_masks[chbi->num_kbps]; I don't understand what this part is for. Why fetch dr7 from the CPU? You already know what's there. All you need is the current dr7 bits belonging to kernel allocations, i.e. a chbi->kdr7 mask. > + if (tsk && test_tsk_thread_flag(tsk, TIF_DEBUG)) { switch_to_thread_hw_breakpoint is on the context switch path. On that path, this test can never be false. The context switch path should not have any unnecessary conditionals. If you want to share code with some other places that now call switch_to_thread_hw_breakpoint, they can share a common inline for part of the guts. > + set_debugreg(dr7, 7); /* Disable user bps while switching */ What is this for? The kernel's dr7 bits are already set. Why does it matter if bits enabling user breakpoints are set too? No user breakpoint can be hit on this CPU before this function returns. > + /* Clear any remaining stale bp pointers */ > + while (--i >= chbi->num_kbps) > + chbi->bps[i] = NULL; Why is this done here? This can be done when the kernel allocations are installed/uninstalled. > @@ -15,6 +15,7 @@ struct die_args { > long err; > int trapnr; > int signr; > + int ret; > }; I don't understand why you added this at all. > fastcall void __kprobes do_debug(struct pt_regs * regs, long error_code) [...] > + if ((args.err & (DR_STEP|DR_TRAP0|DR_TRAP1|DR_TRAP2|DR_TRAP3)) || > + args.ret) > + send_sigtrap(tsk, regs, error_code); The args.err test is fine. A notifier that wants the SIGTRAP sent should just leave the appropriate DR_* bit set rather than clear it. In hw_breakpoint_handler, you could just change: if (i >= chbi->num_kbps) data->ret = 1; to: if (i < chbi->num_kbps) data->err &= ~(DR_TRAP0 << i); But really I think you might as well just have the triggered callback call send_sigtrap itself. That's fine for ptrace_triggered. When I get to a utrace-based layer on this, it can either send the signal itself in the same way, or do something else better if I have another option by then. After all that fun x86 implementation detail, now I have some comments about the interface. I took a gander at what implementing hw_breakpoint on powerpc would be like, and it looks pretty simple. It gave me some more detailed thoughts about making the source API more uniformly convenient across machines. Firstly, everything but a few #define's should be in a shared file. First I was thinking linux/hw_breakpoint.h, but now I think it should be asm-generic/hw_breakpoint.h and asm-{i386,x86_64,...}/hw_breakpoint.h do: #define HW_BREAKPOINT_... #include That way, asm/hw_breakpoint.h is what modules #include, and that file is just absent on machines without the support (as opposed to a linux/hw_breakpoint.h that's there but not always useful). > +struct hw_breakpoint { [...] > + void *address; You might actually want to write this: union { void *kernel; void __user *user; unsigned long va; } address; Setting the address uses the appropriate pointer. Turning it into a debug register value uses va. This helps maintain discipline of using __user, so that kernel analysis tools can reliably cite use of user or kernel addresses as right or wrong. > + u8 len; > + u8 type; I don't think we actually want to expose these as fields in this way at all. Instead, just a single field of machine-format bits, and then "encoding" for dr7 values is just: dr7 |= bp->bits << hwnum; This field is not set by the user directly, but by the registration call. It takes type and len arguments, validates and combines them. There is no need for encoding really, just validation and use the hardware bits in the asm/hw_breakpoint.h constants with standard names: #define HW_BREAKPOINT_LEN1 DR_LEN_1 ... On powerpc, the address breakpoint is always for an 8-byte address range. Also, it has distinct bits for breaking on loads and breaking on stores, but no hardware instruction breakpoint is supported. So it would define: #define HW_BREAKPOINT_LEN8 0xc001d00d #define HW_BREAKPOINT_TYPE_READ 1 #define HW_BREAKPOINT_TYPE_WRITE 2 #define HW_BREAKPOINT_TYPE_RW 3 Using two args in the registration calls lets it do validation simply with: if (len != HW_BREAKPOINT_LEN8 || (type &~ 3)) return -EINVAL; bp->bits = type; while still verifying that an explicit length is used by the caller. The validation is also simple on x86, and then: bp->bits = ((len | type) << DR_CONTROL_SHIFT) | 2; This source interface lets a module use #ifdef HW_BREAKPOINT_* to figure out what's available without needing any specific machine knowledge. Also, HW_BREAKPOINT_TYPE_EXECUTE should have HW_BREAKPOINT_LEN_EXECUTE that is the required argument for that type, so callers don't have to encode the machine knowedlge of using LEN1 for execute. The definition of "breakpoint length" on all the machines (whether a single length is available or more) seems to be that the breakpoint address has to be aligned to the length and what it catches is any access to any byte within that range. i.e., the length means a mask of low bits cleared from an address before it's compared the the breakpoint address. (On powerpc, the low three bits of the register are used as flags, so only the high bits of the address are even stored.) The hw_breakpoint documentation should make this definition more explicit, and it probably ought to enforce the alignment of the address specified. > +#define HW_BREAKPOINT_IO 0x2 /* trigger on I/O space access */ Let's not define a name for this while we are not really supporting it. > +/* HW breakpoint status values */ > +#define HW_BREAKPOINT_REGISTERED 1 > +#define HW_BREAKPOINT_INSTALLED 2 This doesn't really need to be public outside of hw_breakpoint.c, does it? Thanks, Roland