From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1758171AbcAUE4H (ORCPT <rfc822;w@1wt.eu>);
	Wed, 20 Jan 2016 23:56:07 -0500
Received: from mail-pf0-f172.google.com ([209.85.192.172]:34302 "EHLO
	mail-pf0-f172.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1753481AbcAUE4F (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Wed, 20 Jan 2016 23:56:05 -0500
Date: Wed, 20 Jan 2016 20:55:59 -0800
From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>,
        Ingo Molnar <mingo@redhat.com>, linux-kernel@vger.kernel.org,
        vince@deater.net, eranian@google.com,
        Arnaldo Carvalho de Melo <acme@infradead.org>,
        Jiri Olsa <jolsa@kernel.org>
Subject: Re: [PATCH v2] perf: Synchronously cleanup child events
Message-ID: <20160121045556.GA99187@ast-mbp.thefacebook.com>
References: <20160118144410.GS6357@twins.programming.kicks-ass.net>
 <1453216354-9282-1-git-send-email-alexander.shishkin@linux.intel.com>
 <20160119200558.GC6357@twins.programming.kicks-ass.net>
 <20160119215818.GA87036@ast-mbp.thefacebook.com>
 <20160120083222.GF6357@twins.programming.kicks-ass.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20160120083222.GF6357@twins.programming.kicks-ass.net>
User-Agent: Mutt/1.5.24 (2015-08-30)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Wed, Jan 20, 2016 at 09:32:22AM +0100, Peter Zijlstra wrote:
> On Tue, Jan 19, 2016 at 01:58:19PM -0800, Alexei Starovoitov wrote:
> > On Tue, Jan 19, 2016 at 09:05:58PM +0100, Peter Zijlstra wrote:
> 
> > > The most obvious place that generates such magical references would be
> > > the bpf arraymap doing perf_event_get() on things. There are a few other
> > > places that take temp references (perf_mmap_close), but those are
> > > 'short' lived and while ugly will not cause massive grief. The BPF one
> > > OTOH is a real problem here.
> > > 
> > > And looking at the BPF stuff, that code seems to assume
> > > perf_event_kernel_release() := put_event(), so this patch breaks that
> > > too.
> > > 
> > > 
> > > Alexei, is there a reason the arraymap stuff needs a perf event ref as
> > > opposed to a file ref? I'm forever a little confused on how perf<->bpf
> > > works.
> > 
> > A file ref will not work, since user space could have closed that
> > perf_event file to avoid unnecessary FDs.
> 
> So I'm (possibly again) confused on how BPF works.
> 
> I thought the reason you handed in perf events from userspace; as
> opposed to creating your own with perf_event_create_kernel_counter();
> was because userspace was interested in the output.

yes. There are two use cases of perf_events from bpf:
1. sw_bpf_output event is used by bpf to push samples into it and
   user spaces reads it as normal via mmap
2. PERF_TYPE_HARDWARE event is used by bpf program to read
   counters to measure things like number of cycles or tlb misses
   in a given function.
   In this case user space typically leaves FDs around, but it doesn't
   use them for anything.

> Also, BPF should not be a way to get around the filedesc resource limit.

all bpf tracing stuff is root only and maps are charged for every element.

> > Program only need the stable pointer to 'struct perf_event' which
> > it will use while running.
> > At the end it will call perf_event_kernel_release() which
> > is == put_event().
> > It was the case that 'perf_events' were normal refcnt-ed structures
> > and the last guy frees it.
> 
> Sort-of, but user events are (or should be, rather) tied to the filedesc
> to account the resources used.
> 
> There is also the event->owner field, we track the task that created the
> event, with your current scheme that is left dangling once userspace
> closes the last filedesc and you still have a ref open.
> 
> > This put_event_last() logic definitely looks problematic.
> > There are no ordering guarantees.
> > User space may close FD, while struct perf_event is still alive.
> > The loop around perf_event_last() looks buggy.
> > I'm obviously missing the main goal of this patch.
> 
> Right, so the patch in question tries to synchronously clean up
> everything related to the counter when we close the file. Such that the
> file better reflects the actual resource usage.
> 
> Currently we do this async (and with holes).
> 
> In short, user created event really should be filedesc based, yes we
> have event references, but those 'should' be short lived.

I'm still missing why it's the problem.
Which counter do you want to bump as part of perf_event_get() ?
still event->refcount, right?
but the same perf_event can be stored in multiple bpf maps
and many bpf programs can be using it, while nothing can
possibly prevent the user space to do close(perf_event_fd)
while programs are still running and collecting tlb miss data
from the counters.
So what do you propose?