* [patch 1/4] x86: remove bad comment @ 2007-12-10 11:38 Markus Metzger 2007-12-10 20:20 ` x86, ptrace: support for branch trace store(BTS) Ingo Molnar 0 siblings, 1 reply; 11+ messages in thread From: Markus Metzger @ 2007-12-10 11:38 UTC (permalink / raw) To: ak, hpa, linux-kernel, mingo, tglx Cc: markut.t.metzger, markus.t.metzger, suresh.b.siddha, roland, akpm, mtk.manpages Remove a comment that is no longer correct. The reconfiguration is done directly in __switch_to_xtra. Signed-off-by: Markus Metzger <markus.t.metzger@intel.com> --- Index: linux-2.6-x86/arch/x86/kernel/process_64.c =================================================================== --- linux-2.6-x86.orig/arch/x86/kernel/process_64.c 2007-12-10 09:35:26.%N +0100 +++ linux-2.6-x86/arch/x86/kernel/process_64.c 2007-12-10 09:36:38.%N +0100 @@ -598,10 +598,6 @@ memset(tss->io_bitmap, 0xff, prev->io_bitmap_max); } - /* - * Last branch recording recofiguration of trace hardware and - * disentangling of trace data per task. - */ if (test_tsk_thread_flag(prev_p, TIF_BTS_TRACE_TS)) ptrace_bts_take_timestamp(prev_p, BTS_TASK_DEPARTS); --------------------------------------------------------------------- Intel GmbH Dornacher Strasse 1 85622 Feldkirchen/Muenchen Germany Sitz der Gesellschaft: Feldkirchen bei Muenchen Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer Registergericht: Muenchen HRB 47456 Ust.-IdNr. VAT Registration No.: DE129385895 Citibank Frankfurt (BLZ 502 109 00) 600119052 This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: x86, ptrace: support for branch trace store(BTS) 2007-12-10 11:38 [patch 1/4] x86: remove bad comment Markus Metzger @ 2007-12-10 20:20 ` Ingo Molnar 2007-12-11 10:34 ` Metzger, Markus T 0 siblings, 1 reply; 11+ messages in thread From: Ingo Molnar @ 2007-12-10 20:20 UTC (permalink / raw) To: Markus Metzger Cc: ak, hpa, linux-kernel, tglx, markut.t.metzger, markus.t.metzger, suresh.b.siddha, roland, akpm, mtk.manpages, Alan Stern [ Cc:-ed Alan Stern a'ka kwatch fame - Alan might be interested in this too. ] hi Markus, finally had time to take a closer look at the design of your Branch-Trace-Support-via-ptrace-on-x86 patchset. I think we'll need to work some more on general API issues. here's the current proposed API: + case PTRACE_BTS_MAX_BUFFER_SIZE: + case PTRACE_BTS_ALLOCATE_BUFFER: + case PTRACE_BTS_GET_BUFFER_SIZE: + case PTRACE_BTS_READ_RECORD: + case PTRACE_BTS_CONFIG: + case PTRACE_BTS_STATUS: i can see a couple of open questions: 1) PTRACE_BTS_MAX_BUFFER_SIZE why is a trace buffer size limit visible to user-space? It's 4000 entries right now: #define PTRACE_BTS_BUFFER_MAX 4000 it would be more flexible if user-space could offer arbitrary sized buffers, which the kernel would attempt to mlock(). Then those pages could be fed to the BTS MSR. I.e. there should be no hard limit in the API, other than a natural resource limit. 2) struct bts_struct the structure of it is hardwired: the basic unit is an array of bts_struct: +struct bts_struct { + enum bts_qualifier qualifier; + union { + /* BTS_BRANCH */ + struct { + long from_ip; + long to_ip; + } lbr; + /* BTS_TASK_ARRIVES or + BTS_TASK_DEPARTS */ + unsigned long long timestamp; + } variant; +}; while other CPUs (on other architectures, etc.) might have a different raw format for the trace entries. So it would be better to implement BTS support in kernel/ptrace.c with a general trace format, and to provide a cross-arch API (only implemented by arch/x86 in the beginning) to translate the 'raw' trace entries into the general format. 3) it would certainly be useful for some applications to have a large "virtual" trace buffer and a small "real" trace buffer. The kernel would use BTS high-watermark interrupts to feed the real trace buffer into the larger "virtual" trace buffer. This way we wouldnt even have to use mlock() to pin down the trace buffer. Ingo ^ permalink raw reply [flat|nested] 11+ messages in thread
* RE: x86, ptrace: support for branch trace store(BTS) 2007-12-10 20:20 ` x86, ptrace: support for branch trace store(BTS) Ingo Molnar @ 2007-12-11 10:34 ` Metzger, Markus T 2007-12-11 14:53 ` Ingo Molnar 0 siblings, 1 reply; 11+ messages in thread From: Metzger, Markus T @ 2007-12-11 10:34 UTC (permalink / raw) To: Ingo Molnar Cc: ak, hpa, linux-kernel, tglx, markut.t.metzger, markus.t.metzger, Siddha, Suresh B, roland, akpm, mtk.manpages, Alan Stern >-----Original Message----- >From: Ingo Molnar [mailto:mingo@elte.hu] >Sent: Montag, 10. Dezember 2007 21:21 >here's the current proposed API: > >+ case PTRACE_BTS_MAX_BUFFER_SIZE: >+ case PTRACE_BTS_ALLOCATE_BUFFER: >+ case PTRACE_BTS_GET_BUFFER_SIZE: >+ case PTRACE_BTS_READ_RECORD: >+ case PTRACE_BTS_CONFIG: >+ case PTRACE_BTS_STATUS: > >i can see a couple of open questions: > >1) PTRACE_BTS_MAX_BUFFER_SIZE > >why is a trace buffer size limit visible to user-space? It's 4000 >entries right now: > > #define PTRACE_BTS_BUFFER_MAX 4000 > >it would be more flexible if user-space could offer arbitrary sized >buffers, which the kernel would attempt to mlock(). Then those pages >could be fed to the BTS MSR. I.e. there should be no hard limit in the >API, other than a natural resource limit. That would be a variation on Andi's zero-copy proposal, wouldn't it? The user supplies the BTS buffer and the kernel manages DS. Andi further suggested a vDSO to interpret the data and translate the hardware format into a higher level user format. I take it that you would leave that inside ptrace. I need to look more into mlock. So far, I found a system call in /usr/include/sys/mman.h and two functions sys_mlock() and user_shm_lock() in the kernel. Is there a memory expert around who could point me to some interesting places to look at? Can we distinguish kernel-locked memory from user-locked memory? I could imagine a malicious user to munlock() the buffer he provided to ptrace. Is there a real difference between mlock()ing user memory and allocating kernel memory? There would be if we could page out mlock()ed memory when the user thread is not running. We would need to disable DS before paging out, and page in before enabling it. If we cannot, then kernel allocated memory would require less space in physical memory. >2) struct bts_struct > >the structure of it is hardwired: > >the basic unit is an array of bts_struct: > > +struct bts_struct { > + enum bts_qualifier qualifier; > + union { > + /* BTS_BRANCH */ > + struct { > + long from_ip; > + long to_ip; > + } lbr; > + /* BTS_TASK_ARRIVES or > + BTS_TASK_DEPARTS */ > + unsigned long long timestamp; > + } variant; > +}; > >while other CPUs (on other architectures, etc.) might have a different >raw format for the trace entries. So it would be better to >implement BTS >support in kernel/ptrace.c with a general trace format, and to >provide a >cross-arch API (only implemented by arch/x86 in the beginning) to >translate the 'raw' trace entries into the general format. The bts_struct is already an abstraction from the raw hardware format. The hardware uses: struct bts_netburst { void* from; void* to; long :4; long predicted :1; }; struct bts_core2 { u64 from; u64 to; u64 :64; }; I encode timestamps in the raw hardware format. This puts the timestamps at the correct places and frees me from any efforts to correlate separate timestamp data and branch record data (which is getting ugly when the BTS buffer overflows). Now, arch/x86/kernel/ds.c already needs to abstract from the raw hardware format. I chose to add timestamps to that abstraction. This makes the DS interface general enough to be directly used by ptrace (causing some misunderstanding, I'm afraid). If new architectures would eliminate the extra word, I would have to merge the qualifier and the escape address, i.e. have some (probably the least significant) bits of the escape address encode the qualifier. The interface would remain stable. Other architectures would at least need to encode the source and destination address. Thus, we should have enough room to support timestamps for other architectures, as well. If they choose a more compact encoding, the interface might no longer be usable. I could instead use struct ds_bts { void* from; void* to; }; for the DS interface and move timestamps and their encoding in the DS BTS format to ptrace. This would move the DS interface closer to the hardware (yet still abstract from details) and move anything extra to its users - currently only ptrace. On the down side, other users of DS (kernel tracing, for example) would have to invent their own method to encode additional information into the BTS buffer. I saw benefit in making the DS interface more powerful. >3) > >it would certainly be useful for some applications to have a large >"virtual" trace buffer and a small "real" trace buffer. The >kernel would >use BTS high-watermark interrupts to feed the real trace >buffer into the >larger "virtual" trace buffer. This way we wouldnt even have to use >mlock() to pin down the trace buffer. I see two ways this could be done wrt disentangling trace: 1. keep a small per-thread buffer 2. use a per-cpu buffer The first alternative would allow us to disentangle the per-thread trace easily by simply reconfiguring the DS on a context switch (similar to how it is currently done). The second alternative would require us to memcpy() the BTS buffer at each context switch. This would be a huge overhead at a performance critical place. I would therefore keep the per-thread buffer. This costs one mlock()ed or kalloc()ed page per traced thread. In addition to that, we need the (pageable) memory to hold the virtual buffer. Thus, it only pays off if the virtual buffer is significantly bigger than the real buffer (we're paying one page extra). Let's look at the interrupt routine. It would need to disable tracing, page in the memory, memcpy() the BTS buffer, and reenable tracing. This is again a big overhead and might take some time. AFAIK, the interrupt routine would schedule the work and return immediately. A kernel thread would later pick up the work and do the copying. The traced thread would need to be suspended until the copying is done (unless we have a table of buffers that we cycle through and thus can provide a fresh buffer to the thread immediately). Thus, its timeslice would effectively be reduced to the amount of work it gets done in a single 'real' BTS buffer. I am not familiar with the new scheduler and how this would impact (system) performance. I could imagine that either the traced task virtually starves or that it receives an unfairly big amount of system resources. But then, I'm only guessing. I'm sure you can give a precise answer. Overall, this looks like a lot more effort and a lot more things that could go wrong and bring down the entire system. Compared to an almost trivial and reasonably safe implementation. And it only starts paying off when the virtual buffer is getting reasonably big; up to a few pages, we should be better off simply mlock()ing the entire buffer. For use in debuggers to show the debuggee's execution trace to the user, I think that this is unnecessarily complicated. For all the use cases that I can think of today, a single page of memory should suffice. >From my own experience (debugging compilers on XScale), by far the most interesting information is the last handful of branch records. I cannot say how far users would want to look back in the trace history or how far they can look back and still make some sense out of what they see. Besides, all this would be hidden behind the ptrace interface. We might add this functionality when there is demand. thanks and regards, markus. --------------------------------------------------------------------- Intel GmbH Dornacher Strasse 1 85622 Feldkirchen/Muenchen Germany Sitz der Gesellschaft: Feldkirchen bei Muenchen Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer Registergericht: Muenchen HRB 47456 Ust.-IdNr. VAT Registration No.: DE129385895 Citibank Frankfurt (BLZ 502 109 00) 600119052 This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: x86, ptrace: support for branch trace store(BTS) 2007-12-11 10:34 ` Metzger, Markus T @ 2007-12-11 14:53 ` Ingo Molnar 2007-12-12 9:18 ` Metzger, Markus T 0 siblings, 1 reply; 11+ messages in thread From: Ingo Molnar @ 2007-12-11 14:53 UTC (permalink / raw) To: Metzger, Markus T Cc: ak, hpa, linux-kernel, tglx, markut.t.metzger, markus.t.metzger, Siddha, Suresh B, roland, akpm, mtk.manpages, Alan Stern * Metzger, Markus T <markus.t.metzger@intel.com> wrote: > That would be a variation on Andi's zero-copy proposal, wouldn't it? > > The user supplies the BTS buffer and the kernel manages DS. > > Andi further suggested a vDSO to interpret the data and translate the > hardware format into a higher level user format. > > I take it that you would leave that inside ptrace. yeah - i think both zero-copy and vdso are probably overkill for this. On the highest level, there are two main usecases of BTS that i can think of: debugging [a user-space task crashes and developer would like to see the last few branches taken - possibly extended to kernel space crashes as well], and instrumentation. In the first use-case (debugging) zero-copy is just an unnecessary complication. In the second use-case (tracing, profiling, call coverage metrics), we could live without zero-copy, as long as the buffer could be made "large enough". The current 4000 records limit seems rather low (and arbitrary) and probably makes the mechanism unsuitable for say call coverage profiling purposes. There's also no real mechanism that i can see to create a guaranteed flow of this information between the debugger and debuggee (unless i missed something), the code appears to overflow the array, and destroy earlier entries, right? That's "by design" for debugging, but quite a limitation for instrumentation which might want to have a reliable stream of the data (and would like the originating task to block until the debugger had an opportunity to siphoon out the data). > I need to look more into mlock. So far, I found a system call in > /usr/include/sys/mman.h and two functions sys_mlock() and > user_shm_lock() in the kernel. Is there a memory expert around who > could point me to some interesting places to look at? sys_mlock() is what i meant - you could just call it internally from ptrace and fail the call if sys_mlock() returns -EPERM. This keeps all the "there's too much memory pinned down" details out of the ptrace code. > Can we distinguish kernel-locked memory from user-locked memory? I > could imagine a malicious user to munlock() the buffer he provided to > ptrace. yeah. Once mlock()-ed, you need to "pin it" via get_user_pages(). That gives a permanent reference count to those pages. > Is there a real difference between mlock()ing user memory and > allocating kernel memory? There would be if we could page out > mlock()ed memory when the user thread is not running. We would need to > disable DS before paging out, and page in before enabling it. If we > cannot, then kernel allocated memory would require less space in > physical memory. mlock() would in essence just give you an easy "does this user have enough privilege to lock this many pages" API. The real pinning would be done by get_user_pages(). Once you have those pages, they wont be swapped out. Ingo ^ permalink raw reply [flat|nested] 11+ messages in thread
* RE: x86, ptrace: support for branch trace store(BTS) 2007-12-11 14:53 ` Ingo Molnar @ 2007-12-12 9:18 ` Metzger, Markus T 2007-12-12 11:03 ` Ingo Molnar 0 siblings, 1 reply; 11+ messages in thread From: Metzger, Markus T @ 2007-12-12 9:18 UTC (permalink / raw) To: Ingo Molnar Cc: ak, hpa, linux-kernel, tglx, markus.t.metzger, markus.t.metzger, Siddha, Suresh B, roland, akpm, mtk.manpages, Alan Stern >-----Original Message----- >From: Ingo Molnar [mailto:mingo@elte.hu] >Sent: Dienstag, 11. Dezember 2007 15:53 >In the second use-case (tracing, profiling, call coverage metrics), we >could live without zero-copy, as long as the buffer could be >made "large >enough". The current 4000 records limit seems rather low (and >arbitrary) >and probably makes the mechanism unsuitable for say call coverage >profiling purposes. The limit is indeed arbitrary. It was requested by one of my internal reviewers. The rationale was that we do not want to allow ptrace users to request that an unlimited amount of memory be allocated in the kernel. Andi suggested to make this a sysctl. Would it be safe to drop the artificial limit and let the limit be the available memory? >There's also no real mechanism that i can see to >create a guaranteed flow of this information between the debugger and >debuggee (unless i missed something), the code appears to overflow the >array, and destroy earlier entries, right? That's "by design" for >debugging, but quite a limitation for instrumentation which might want >to have a reliable stream of the data (and would like the originating >task to block until the debugger had an opportunity to siphoon out the >data). That's correct as well. My focus is on debugging. And that's actually the most useful behavior in that case. I'm not sure what you mean with 'instrumentation'. Regarding streaming data (the only real use case I can think of would be path profiling), I think I would rather allocate a bigger buffer and send a signal when the buffer gets full instead of having the kernel provide an interrupt service routine and copy the data from a smaller buffer into a bigger one. This would leave it to the tracing task which likely wants to write the data into a file if it really got too big to hold it in physical memory. We would need another command to reset the DS index, and probably one to provide the entire buffer at once, and add some configuration bits to say whether we want the signal or a cyclic buffer. For debugging, I think this extension is not relevant. Since it could be implemented as an extension of the current interface, I would leave it, for now. >> I need to look more into mlock. So far, I found a system call in >> /usr/include/sys/mman.h and two functions sys_mlock() and >> user_shm_lock() in the kernel. Is there a memory expert around who >> could point me to some interesting places to look at? > >sys_mlock() is what i meant - you could just call it internally from >ptrace and fail the call if sys_mlock() returns -EPERM. This keeps all >the "there's too much memory pinned down" details out of the ptrace >code. > >> Can we distinguish kernel-locked memory from user-locked memory? I >> could imagine a malicious user to munlock() the buffer he >provided to >> ptrace. > >yeah. Once mlock()-ed, you need to "pin it" via get_user_pages(). That >gives a permanent reference count to those pages. sys_mlock() eventually results in a call to get_user_pages(). I have not found the corresponding put_page() call for munlock(), though. So you're suggesting to do an additional get_user_pages to prevent the user from munlock()ing the memory? So, if the pages are unlocked, they will still be pinned down and cannot further be unlocked? Something like: ptrace_allocate(struct task_struct *child, void *base, size_t size) { if (sys_mlock(base, size) < 0) return -EPERM; get_user_pages(child, ...., base, size, ....); } >> Is there a real difference between mlock()ing user memory and >> allocating kernel memory? There would be if we could page out >> mlock()ed memory when the user thread is not running. We >would need to >> disable DS before paging out, and page in before enabling it. If we >> cannot, then kernel allocated memory would require less space in >> physical memory. > >mlock() would in essence just give you an easy "does this user have >enough privilege to lock this many pages" API. The real >pinning would be >done by get_user_pages(). Once you have those pages, they wont be >swapped out. The actual physical memory consumption will be worse (or at best equal) compared to kalloc()ed memory, since we need to pin down entire pages, whereas kalloc() would allocate just the memory that is actually needed. Unless we would put_page() and get_user_pages() the pinned down pages on a context switch. But then, get_user_pages() calls cond_resched(), which might not be a good think to do during a context switch. Overall, I think that kalloc() is actually more memory efficient than mlock(). The only benefit of mlock() would be that we save some copy_to_user() in the debugger task. This should not be a big deal for debuggers, but it might become interesting for profilers. On the other hand, profilers would typically expect a stream of data and they would either want a terribly big buffer or store the data into a file. In both cases, they need to do a lot of copying, already. I would stay with kalloc() for both use cases. Regarding your other open, are you OK with the DS interface? Andi suggested in one of his first reviews that the configuration commands should be combined. Something like: struct bts_config { size_t buffer_size; unsigned int flags; } The buffer would be reallocated, if its size changed. The flags would be treated like the current CONFIG command to control last branch recording and timestamp recording. They would eventually be extended to control overflow behavior (signal vs. cyclic buffer). I would expect that the buffer_size does not change too often and that the flags might change more frequently. This interface would further be harder to extend in case the struct needs to be extended. On the up side, it rids us of two commands. thanks and regards, markus. --------------------------------------------------------------------- Intel GmbH Dornacher Strasse 1 85622 Feldkirchen/Muenchen Germany Sitz der Gesellschaft: Feldkirchen bei Muenchen Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer Registergericht: Muenchen HRB 47456 Ust.-IdNr. VAT Registration No.: DE129385895 Citibank Frankfurt (BLZ 502 109 00) 600119052 This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: x86, ptrace: support for branch trace store(BTS) 2007-12-12 9:18 ` Metzger, Markus T @ 2007-12-12 11:03 ` Ingo Molnar 2007-12-12 12:23 ` Metzger, Markus T 0 siblings, 1 reply; 11+ messages in thread From: Ingo Molnar @ 2007-12-12 11:03 UTC (permalink / raw) To: Metzger, Markus T Cc: ak, hpa, linux-kernel, tglx, markus.t.metzger, Siddha, Suresh B, roland, akpm, mtk.manpages, Alan Stern * Metzger, Markus T <markus.t.metzger@intel.com> wrote: > Andi suggested to make this a sysctl. that's just as arbitrary ... > Would it be safe to drop the artificial limit and let the limit be the > available memory? no, that would be a DoS :-/ mlock() is rlimit controlled and is available to unprivileged users - up to a small amount of memory can be locked down. But i agree that mlock() can be problematic - see below. > > There's also no real mechanism that i can see to create a guaranteed > > flow of this information between the debugger and debuggee (unless i > > missed something), the code appears to overflow the array, and > > destroy earlier entries, right? That's "by design" for debugging, > > but quite a limitation for instrumentation which might want to have > > a reliable stream of the data (and would like the originating task > > to block until the debugger had an opportunity to siphoon out the > > data). > > That's correct as well. My focus is on debugging. And that's actually > the most useful behavior in that case. I'm not sure what you mean with > 'instrumentation'. the branch trace can be used to generate a very finegrained profile/histogram of code execution - even of rarely executed codepaths which cannot be captured via timer/perf-counter based profiling. another potential use would be for call graph coverage testing. (which currently requires compiler-inserted calls - would be much nicer if we could do this via the hardware.) etc. Branch tracing isnt just for debugging i think - as long as the framework is flexible enough. > The actual physical memory consumption will be worse (or at best > equal) compared to kalloc()ed memory, since we need to pin down entire > pages, whereas kalloc() would allocate just the memory that is > actually needed. i agree that mlock() has problems. A different model would be: no mlock and no get_user_pages() - something quite close to what you have already. Data is streamed out of the internal (hardware-filled, kmalloc()-ed, user-inaccessible) buffer, we stop task execution until it is copied to the larger, user-provided buffer. The debugging feature you are interested in could be enabled as a special-case of this mechanism: if the user-space buffer is not larger than the hardware buffer then no streaming is needed, you can just capture into the kernel buffer. User-space would have to do a PTRACE_BTS_DRAIN_BUFFER call (or something like that) to get the "final portion" of the branch trace out into the user-space buffer. [which, in your debugging special-case, would the full internal buffer.] that way the kmalloc()-ed buffer becomes an internal detail of buffering that you rarely have to be aware of. (it could still be queried - like your patch does it now.) or something else that is intelligent. Basically, what we'd like to have is a future-proof, extensible approach that does not necessarily stop at debugging and integrates this hardware capability into Linux intelligently. Ingo ^ permalink raw reply [flat|nested] 11+ messages in thread
* RE: x86, ptrace: support for branch trace store(BTS) 2007-12-12 11:03 ` Ingo Molnar @ 2007-12-12 12:23 ` Metzger, Markus T 2007-12-13 10:29 ` Ingo Molnar 0 siblings, 1 reply; 11+ messages in thread From: Metzger, Markus T @ 2007-12-12 12:23 UTC (permalink / raw) To: Ingo Molnar Cc: ak, hpa, linux-kernel, tglx, markus.t.metzger, Siddha, Suresh B, roland, akpm, mtk.manpages, Alan Stern >-----Original Message----- >From: Ingo Molnar [mailto:mingo@elte.hu] >Sent: Mittwoch, 12. Dezember 2007 12:04 >> Would it be safe to drop the artificial limit and let the >limit be the >> available memory? > >no, that would be a DoS :-/ How about: for (;;) { int pid = fork(); if (pid) ptrace_allocate(pid, <small_buffer_size>); } I guess what we really want is a per-task kalloc() limit. That limit can be quite big. Should this rather be within kalloc() itself (with the data stored in task_struct) or should this be local to ptrace_bts_alloc? Is there already a per-task memory limit? We could deduct the kalloc()ed memory from that. >> the most useful behavior in that case. I'm not sure what you >mean with >> 'instrumentation'. > >the branch trace can be used to generate a very finegrained >profile/histogram of code execution - even of rarely executed >codepaths >which cannot be captured via timer/perf-counter based profiling. > >another potential use would be for call graph coverage testing. (which >currently requires compiler-inserted calls - would be much nicer if we >could do this via the hardware.) > >etc. Branch tracing isnt just for debugging i think - as long as the >framework is flexible enough. Agreed. >> The actual physical memory consumption will be worse (or at best >> equal) compared to kalloc()ed memory, since we need to pin >down entire >> pages, whereas kalloc() would allocate just the memory that is >> actually needed. > >i agree that mlock() has problems. A different model would be: I have my doubts regarding any form of two-buffer approach. Let's assume we find a way to get a reasonably big limit. Users who want to process that huge amount of data would be better off using a file-based approach (well, if it cannot be held in physical memory, they will spend most of their time swapping, anyway). Those users would typically wait for the 'buffer full' event and drain the buffer into a file - whether this is the real buffer or a bigger virtual buffer. The two-buffer approach would only benefit users who want to hold the full profile in memory - or who want to stall the debuggee until they processed or somehow compressed the data collected so far. Those approaches would not scale for very big profiles. The small profile cases would already be covered with a reasonably big real buffer. The two-buffer approach is moving out the limit of what can be profiled in (virtual) memory. Independent of how many buffers we have, we need to offer overflow handling. This could be: 1. cyclic buffer 2. buffer full event Robust users would either: a. select 2 and ignore the overflow signal to collect an initial profile b. select 1 to collect a profile tail c. select 2 and drain the real or virtual buffer into a file to collect the full profile d. select 2 and bail out on overflow (robust but limited usefulness) Debuggers would typically choose b. Path profilers would typically choose c. Student projects would typically choose d. I see benefit in providing a second overflow handling method. I only see benefit in a bigger second buffer if we cannot get the limit to a reasonable size. Am I missing something? thanks and regards, markus. --------------------------------------------------------------------- Intel GmbH Dornacher Strasse 1 85622 Feldkirchen/Muenchen Germany Sitz der Gesellschaft: Feldkirchen bei Muenchen Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer Registergericht: Muenchen HRB 47456 Ust.-IdNr. VAT Registration No.: DE129385895 Citibank Frankfurt (BLZ 502 109 00) 600119052 This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: x86, ptrace: support for branch trace store(BTS) 2007-12-12 12:23 ` Metzger, Markus T @ 2007-12-13 10:29 ` Ingo Molnar 2007-12-13 12:51 ` Metzger, Markus T 0 siblings, 1 reply; 11+ messages in thread From: Ingo Molnar @ 2007-12-13 10:29 UTC (permalink / raw) To: Metzger, Markus T Cc: ak, hpa, linux-kernel, tglx, markus.t.metzger, Siddha, Suresh B, roland, akpm, mtk.manpages, Alan Stern * Metzger, Markus T <markus.t.metzger@intel.com> wrote: > Users who want to process that huge amount of data would be better off > using a file-based approach (well, if it cannot be held in physical > memory, they will spend most of their time swapping, anyway). Those > users would typically wait for the 'buffer full' event and drain the > buffer into a file - whether this is the real buffer or a bigger > virtual buffer. > > The two-buffer approach would only benefit users who want to hold the > full profile in memory - or who want to stall the debuggee until they > processed or somehow compressed the data collected so far. Those > approaches would not scale for very big profiles. The small profile > cases would already be covered with a reasonably big real buffer. well, the two-buffer approach would just be a general API with no limitations. It would make the internal buffer mostly a pure performance detail. Ingo ^ permalink raw reply [flat|nested] 11+ messages in thread
* RE: x86, ptrace: support for branch trace store(BTS) 2007-12-13 10:29 ` Ingo Molnar @ 2007-12-13 12:51 ` Metzger, Markus T 2007-12-13 13:08 ` Ingo Molnar 0 siblings, 1 reply; 11+ messages in thread From: Metzger, Markus T @ 2007-12-13 12:51 UTC (permalink / raw) To: Ingo Molnar Cc: ak, hpa, linux-kernel, tglx, markus.t.metzger, Siddha, Suresh B, roland, akpm, mtk.manpages, Alan Stern >-----Original Message----- >From: Ingo Molnar [mailto:mingo@elte.hu] >Sent: Donnerstag, 13. Dezember 2007 11:30 >> Users who want to process that huge amount of data would be >better off >> using a file-based approach (well, if it cannot be held in physical >> memory, they will spend most of their time swapping, anyway). Those >> users would typically wait for the 'buffer full' event and drain the >> buffer into a file - whether this is the real buffer or a bigger >> virtual buffer. >> >> The two-buffer approach would only benefit users who want to >hold the >> full profile in memory - or who want to stall the debuggee >until they >> processed or somehow compressed the data collected so far. Those >> approaches would not scale for very big profiles. The small profile >> cases would already be covered with a reasonably big real buffer. > >well, the two-buffer approach would just be a general API with no >limitations. It would make the internal buffer mostly a pure >performance >detail. Agreed. Somewhat. A user-provided second buffer would need to be up-to-date when we switch to the user's task. We would either need to drain the real buffer when switching from the traced task; or we would need to drain the real buffers of all traced tasks when switching to the tracing task. Both would require a get_user_pages() during context switching. Alternatively, we could schedule a kernel task to drain the real buffer when switching from a traced task. The tracing task would then need to wait for all those kernel tasks. I'm not sure how that affects scheduling fairness. And it's getting quite complicated. A kernel-provided second buffer could be entirely hidden behind the ptrace (or, rather, ds) interface. It would not even have to be drained before switching to the tracing task, since ds would just look into the real buffer and then move on to the second buffer - transparent to the user. Its size could be deducted from the user's memory limit and it could be in pageable memory. We would not be able to give precise overflow signals, that way (the not-yet-drained real buffer might actually cause an overflow of the second buffer, when drained). By allowing the user to query for the number of BTS records to drain, we would not need to. A user drain would drain both bufers. The second buffer would be a pure performance/convenience detail of ds, just like you suggested. The ptrace API would allow the user to: - define (and query) the overflow mechanism (wrap-around or event) - define (and query) the size of the buffer within certain limits (we could either give an error or cut off) - define (and query) events to be monitored (last branch trace, scheduling timestamps) - get a single BTS record - query the number of BTS records (to find out how big your drain buffer needs to be; it may be bigger than you requested) - drain all BTS records (copy, then clear) - clear all BTS records Draining would require the user to allocate a buffer to hold the data, which might not be feasible when he is near his memory limit. He could fall back to looping over the single-entry get. It is questionable, how useful the drain ptrace command would actually be; we might want to replace it with a get range command. Are you OK with this? thanks and regards, markus. --------------------------------------------------------------------- Intel GmbH Dornacher Strasse 1 85622 Feldkirchen/Muenchen Germany Sitz der Gesellschaft: Feldkirchen bei Muenchen Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer Registergericht: Muenchen HRB 47456 Ust.-IdNr. VAT Registration No.: DE129385895 Citibank Frankfurt (BLZ 502 109 00) 600119052 This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: x86, ptrace: support for branch trace store(BTS) 2007-12-13 12:51 ` Metzger, Markus T @ 2007-12-13 13:08 ` Ingo Molnar 2007-12-13 16:06 ` Metzger, Markus T 0 siblings, 1 reply; 11+ messages in thread From: Ingo Molnar @ 2007-12-13 13:08 UTC (permalink / raw) To: Metzger, Markus T Cc: ak, hpa, linux-kernel, tglx, markus.t.metzger, Siddha, Suresh B, roland, akpm, mtk.manpages, Alan Stern * Metzger, Markus T <markus.t.metzger@intel.com> wrote: > The ptrace API would allow the user to: > - define (and query) the overflow mechanism > (wrap-around or event) > - define (and query) the size of the buffer within certain limits > (we could either give an error or cut off) > - define (and query) events to be monitored > (last branch trace, scheduling timestamps) > - get a single BTS record > - query the number of BTS records > (to find out how big your drain buffer needs to be; it may be bigger > than you requested) > - drain all BTS records (copy, then clear) > - clear all BTS records > > Draining would require the user to allocate a buffer to hold the data, > which might not be feasible when he is near his memory limit. He could > fall back to looping over the single-entry get. It is questionable, > how useful the drain ptrace command would actually be; we might want > to replace it with a get range command. > > Are you OK with this? this sounds a lot more flexible to me. Please, once it looks good to all of us also extend LTP's ptrace bits with unit tests for these API additions. (Cc: such LTP bits to subrata@linux.vnet.ibm.com) Ingo ^ permalink raw reply [flat|nested] 11+ messages in thread
* RE: x86, ptrace: support for branch trace store(BTS) 2007-12-13 13:08 ` Ingo Molnar @ 2007-12-13 16:06 ` Metzger, Markus T 0 siblings, 0 replies; 11+ messages in thread From: Metzger, Markus T @ 2007-12-13 16:06 UTC (permalink / raw) To: Ingo Molnar Cc: ak, hpa, linux-kernel, tglx, markus.t.metzger, Siddha, Suresh B, roland, akpm, mtk.manpages, Alan Stern >-----Original Message----- >From: Ingo Molnar [mailto:mingo@elte.hu] >Sent: Donnerstag, 13. Dezember 2007 14:08 >> Are you OK with this? > >this sounds a lot more flexible to me. Please, once it looks >good to all I will implement the ptrace ABI and the man pages and send them out for review (I'll return -EOPNOTSUPP a few times) tomorrow. I'll add support for the overflow signal in a later patch, once I figured out how to do it. >of us also extend LTP's ptrace bits with unit tests for these API >additions. (Cc: such LTP bits to subrata@linux.vnet.ibm.com) I need to look into this. This will take some time, as well. regards, markus. --------------------------------------------------------------------- Intel GmbH Dornacher Strasse 1 85622 Feldkirchen/Muenchen Germany Sitz der Gesellschaft: Feldkirchen bei Muenchen Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer Registergericht: Muenchen HRB 47456 Ust.-IdNr. VAT Registration No.: DE129385895 Citibank Frankfurt (BLZ 502 109 00) 600119052 This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2007-12-13 16:12 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2007-12-10 11:38 [patch 1/4] x86: remove bad comment Markus Metzger 2007-12-10 20:20 ` x86, ptrace: support for branch trace store(BTS) Ingo Molnar 2007-12-11 10:34 ` Metzger, Markus T 2007-12-11 14:53 ` Ingo Molnar 2007-12-12 9:18 ` Metzger, Markus T 2007-12-12 11:03 ` Ingo Molnar 2007-12-12 12:23 ` Metzger, Markus T 2007-12-13 10:29 ` Ingo Molnar 2007-12-13 12:51 ` Metzger, Markus T 2007-12-13 13:08 ` Ingo Molnar 2007-12-13 16:06 ` Metzger, Markus T
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).