From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752648Ab1LSLmi (ORCPT ); Mon, 19 Dec 2011 06:42:38 -0500 Received: from mx2.mail.elte.hu ([157.181.151.9]:49116 "EHLO mx2.mail.elte.hu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751840Ab1LSLme (ORCPT ); Mon, 19 Dec 2011 06:42:34 -0500 Date: Mon, 19 Dec 2011 12:40:23 +0100 From: Ingo Molnar To: Avi Kivity Cc: Robert Richter , Benjamin Block , Hans Rosenfeld , hpa@zytor.com, tglx@linutronix.de, suresh.b.siddha@intel.com, eranian@google.com, brgerst@gmail.com, Andreas.Herrmann3@amd.com, x86@kernel.org, linux-kernel@vger.kernel.org, Benjamin Block Subject: Re: [RFC 4/5] x86, perf: implements lwp-perf-integration (rc1) Message-ID: <20111219114023.GB29855@elte.hu> References: <20111216160757.GL665@escobedo.osrc.amd.com> <1324051943-21112-1-git-send-email-hans.rosenfeld@amd.com> <1324051943-21112-4-git-send-email-hans.rosenfeld@amd.com> <20111218080443.GB4144@elte.hu> <20111218234309.GA12958@elte.hu> <20111219090923.GB16765@erda.amd.com> <20111219105429.GC19861@elte.hu> <4EEF1C3B.3010307@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4EEF1C3B.3010307@redhat.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-ELTE-SpamScore: -2.0 X-ELTE-SpamLevel: X-ELTE-SpamCheck: no X-ELTE-SpamVersion: ELTE 2.0 X-ELTE-SpamCheck-Details: score=-2.0 required=5.9 tests=AWL,BAYES_00 autolearn=no SpamAssassin version=3.3.1 -2.0 BAYES_00 BODY: Bayes spam probability is 0 to 1% [score: 0.0000] 0.0 AWL AWL: From: address is in the auto white-list Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Avi Kivity wrote: > On 12/19/2011 12:54 PM, Ingo Molnar wrote: > > * Robert Richter wrote: > > > > > On 19.12.11 00:43:10, Ingo Molnar wrote: > > > > > > > So the question becomes, how well is it integrated: can perf > > > > 'record -a + perf report', or 'perf top' use LWP, to do > > > > system-wide precise [user-space] profiling and such? > > > > > > There is only self-monitoring of a process possible, no > > > kernel and system-wide profiling. This is because we can > > > not allocate memory regions in the kernel for a thread > > > other than the current. This would require a complete > > > rework of mm code. > > > > Hm, i don't think a rework is needed: check the > > vmalloc_to_page() code in kernel/events/ring_buffer.c. Right > > now CONFIG_PERF_USE_VMALLOC is an ARM, MIPS, SH and Sparc > > specific feature, on x86 it turns on if > > CONFIG_DEBUG_PERF_USE_VMALLOC=y. > > > > That should be good enough for prototyping the kernel/user > > shared buffering approach. > > LWP wants user memory, vmalloc is insufficient. You need > do_mmap() with a different mm. Take a look at PERF_USE_VMALLOC, it allows in-kernel allocated memory to be mmap()ed to user-space. It is basically a shared/dual user/kernel mode vmalloc implementation. So all the conceptual pieces are there. > You could let a workqueue call use_mm() and then do_mmap(). > Even then it is subject to disruption by the monitored thread > (and may disrupt the monitored thread by playing with its > address space). [...] Injecting this into another thread's context is indeed advanced stuff: > [...] This is for thread monitoring only, I don't think > system-wide monitoring is possible with LWP. That should be possible too, via two methods: 1) the easy hack: a (per cpu) vmalloc()ed buffer is made ring 3 accessible (by clearing the system bit in the ptes) - and thus accessible to all user-space. This is obviously globally writable/readable memory so only a debugging/prototyping hack - but would be a great first step to prove the concept and see some nice perf top and perf record results ... 2) the proper solution: creating a 'user-space vmalloc()' that is per mm and that gets inherited transparently, across fork() and exec(), and which lies outside the regular vma spaces. On 64-bit this should be straightforward. These vmas are not actually 'known' to user-space normally - the kernel PMU code knows about it and does what we do with PEBS: flushes it when necessary and puts it into the regular perf event channels. This solves the inherited perf record workflow immediately: the parent task just creates the buffer, which gets inherited across exec() and fork(), into every portion of the workload. System-wide profiling is a small additional variant of this: creating such a user-vmalloc() area for all tasks in the system so that the PMU code has them ready in the context-switch code. Solution #2 has the additional advantage that we could migrate PEBS to it and could allow interested user-space access to the 'raw' PEBS buffer as well. (currently the PEBS buffer is only visible to kernel-space.) I'd suggest the easy hack first, to get things going - we can then help out with the proper solution. Thanks, Ingo