From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758171AbcAUE4H (ORCPT ); Wed, 20 Jan 2016 23:56:07 -0500 Received: from mail-pf0-f172.google.com ([209.85.192.172]:34302 "EHLO mail-pf0-f172.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753481AbcAUE4F (ORCPT ); Wed, 20 Jan 2016 23:56:05 -0500 Date: Wed, 20 Jan 2016 20:55:59 -0800 From: Alexei Starovoitov To: Peter Zijlstra Cc: Alexander Shishkin , Ingo Molnar , linux-kernel@vger.kernel.org, vince@deater.net, eranian@google.com, Arnaldo Carvalho de Melo , Jiri Olsa Subject: Re: [PATCH v2] perf: Synchronously cleanup child events Message-ID: <20160121045556.GA99187@ast-mbp.thefacebook.com> References: <20160118144410.GS6357@twins.programming.kicks-ass.net> <1453216354-9282-1-git-send-email-alexander.shishkin@linux.intel.com> <20160119200558.GC6357@twins.programming.kicks-ass.net> <20160119215818.GA87036@ast-mbp.thefacebook.com> <20160120083222.GF6357@twins.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160120083222.GF6357@twins.programming.kicks-ass.net> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jan 20, 2016 at 09:32:22AM +0100, Peter Zijlstra wrote: > On Tue, Jan 19, 2016 at 01:58:19PM -0800, Alexei Starovoitov wrote: > > On Tue, Jan 19, 2016 at 09:05:58PM +0100, Peter Zijlstra wrote: > > > > The most obvious place that generates such magical references would be > > > the bpf arraymap doing perf_event_get() on things. There are a few other > > > places that take temp references (perf_mmap_close), but those are > > > 'short' lived and while ugly will not cause massive grief. The BPF one > > > OTOH is a real problem here. > > > > > > And looking at the BPF stuff, that code seems to assume > > > perf_event_kernel_release() := put_event(), so this patch breaks that > > > too. > > > > > > > > > Alexei, is there a reason the arraymap stuff needs a perf event ref as > > > opposed to a file ref? I'm forever a little confused on how perf<->bpf > > > works. > > > > A file ref will not work, since user space could have closed that > > perf_event file to avoid unnecessary FDs. > > So I'm (possibly again) confused on how BPF works. > > I thought the reason you handed in perf events from userspace; as > opposed to creating your own with perf_event_create_kernel_counter(); > was because userspace was interested in the output. yes. There are two use cases of perf_events from bpf: 1. sw_bpf_output event is used by bpf to push samples into it and user spaces reads it as normal via mmap 2. PERF_TYPE_HARDWARE event is used by bpf program to read counters to measure things like number of cycles or tlb misses in a given function. In this case user space typically leaves FDs around, but it doesn't use them for anything. > Also, BPF should not be a way to get around the filedesc resource limit. all bpf tracing stuff is root only and maps are charged for every element. > > Program only need the stable pointer to 'struct perf_event' which > > it will use while running. > > At the end it will call perf_event_kernel_release() which > > is == put_event(). > > It was the case that 'perf_events' were normal refcnt-ed structures > > and the last guy frees it. > > Sort-of, but user events are (or should be, rather) tied to the filedesc > to account the resources used. > > There is also the event->owner field, we track the task that created the > event, with your current scheme that is left dangling once userspace > closes the last filedesc and you still have a ref open. > > > This put_event_last() logic definitely looks problematic. > > There are no ordering guarantees. > > User space may close FD, while struct perf_event is still alive. > > The loop around perf_event_last() looks buggy. > > I'm obviously missing the main goal of this patch. > > Right, so the patch in question tries to synchronously clean up > everything related to the counter when we close the file. Such that the > file better reflects the actual resource usage. > > Currently we do this async (and with holes). > > In short, user created event really should be filedesc based, yes we > have event references, but those 'should' be short lived. I'm still missing why it's the problem. Which counter do you want to bump as part of perf_event_get() ? still event->refcount, right? but the same perf_event can be stored in multiple bpf maps and many bpf programs can be using it, while nothing can possibly prevent the user space to do close(perf_event_fd) while programs are still running and collecting tlb miss data from the counters. So what do you propose?