From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753472AbXLKOxg (ORCPT ); Tue, 11 Dec 2007 09:53:36 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751785AbXLKOx2 (ORCPT ); Tue, 11 Dec 2007 09:53:28 -0500 Received: from mx3.mail.elte.hu ([157.181.1.138]:35519 "EHLO mx3.mail.elte.hu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751285AbXLKOx1 (ORCPT ); Tue, 11 Dec 2007 09:53:27 -0500 Date: Tue, 11 Dec 2007 15:53:01 +0100 From: Ingo Molnar To: "Metzger, Markus T" Cc: ak@suse.de, hpa@zytor.com, linux-kernel@vger.kernel.org, tglx@linutronix.de, markut.t.metzger@intel.com, markus.t.metzger@gmail.com, "Siddha, Suresh B" , roland@redhat.com, akpm@linux-foundation.org, mtk.manpages@gmail.com, Alan Stern Subject: Re: x86, ptrace: support for branch trace store(BTS) Message-ID: <20071211145301.GA19427@elte.hu> References: <20071210123809.A14251@sedona.ch.intel.com> <20071210202052.GA26002@elte.hu> <029E5BE7F699594398CA44E3DDF5544401130A1E@swsmsx413.ger.corp.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <029E5BE7F699594398CA44E3DDF5544401130A1E@swsmsx413.ger.corp.intel.com> User-Agent: Mutt/1.5.17 (2007-11-01) X-ELTE-VirusStatus: clean X-ELTE-SpamScore: -1.5 X-ELTE-SpamLevel: X-ELTE-SpamCheck: no X-ELTE-SpamVersion: ELTE 2.0 X-ELTE-SpamCheck-Details: score=-1.5 required=5.9 tests=BAYES_00 autolearn=no SpamAssassin version=3.2.3 -1.5 BAYES_00 BODY: Bayesian spam probability is 0 to 1% [score: 0.0000] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Metzger, Markus T wrote: > That would be a variation on Andi's zero-copy proposal, wouldn't it? > > The user supplies the BTS buffer and the kernel manages DS. > > Andi further suggested a vDSO to interpret the data and translate the > hardware format into a higher level user format. > > I take it that you would leave that inside ptrace. yeah - i think both zero-copy and vdso are probably overkill for this. On the highest level, there are two main usecases of BTS that i can think of: debugging [a user-space task crashes and developer would like to see the last few branches taken - possibly extended to kernel space crashes as well], and instrumentation. In the first use-case (debugging) zero-copy is just an unnecessary complication. In the second use-case (tracing, profiling, call coverage metrics), we could live without zero-copy, as long as the buffer could be made "large enough". The current 4000 records limit seems rather low (and arbitrary) and probably makes the mechanism unsuitable for say call coverage profiling purposes. There's also no real mechanism that i can see to create a guaranteed flow of this information between the debugger and debuggee (unless i missed something), the code appears to overflow the array, and destroy earlier entries, right? That's "by design" for debugging, but quite a limitation for instrumentation which might want to have a reliable stream of the data (and would like the originating task to block until the debugger had an opportunity to siphoon out the data). > I need to look more into mlock. So far, I found a system call in > /usr/include/sys/mman.h and two functions sys_mlock() and > user_shm_lock() in the kernel. Is there a memory expert around who > could point me to some interesting places to look at? sys_mlock() is what i meant - you could just call it internally from ptrace and fail the call if sys_mlock() returns -EPERM. This keeps all the "there's too much memory pinned down" details out of the ptrace code. > Can we distinguish kernel-locked memory from user-locked memory? I > could imagine a malicious user to munlock() the buffer he provided to > ptrace. yeah. Once mlock()-ed, you need to "pin it" via get_user_pages(). That gives a permanent reference count to those pages. > Is there a real difference between mlock()ing user memory and > allocating kernel memory? There would be if we could page out > mlock()ed memory when the user thread is not running. We would need to > disable DS before paging out, and page in before enabling it. If we > cannot, then kernel allocated memory would require less space in > physical memory. mlock() would in essence just give you an easy "does this user have enough privilege to lock this many pages" API. The real pinning would be done by get_user_pages(). Once you have those pages, they wont be swapped out. Ingo