From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.4 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,UNPARSEABLE_RELAY,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 67E87C606BD for ; Mon, 8 Jul 2019 16:49:09 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 319CA216C4 for ; Mon, 8 Jul 2019 16:49:09 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="TP6ivQdE" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2389529AbfGHQtJ (ORCPT ); Mon, 8 Jul 2019 12:49:09 -0400 Received: from userp2130.oracle.com ([156.151.31.86]:60566 "EHLO userp2130.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727947AbfGHQtI (ORCPT ); Mon, 8 Jul 2019 12:49:08 -0400 Received: from pps.filterd (userp2130.oracle.com [127.0.0.1]) by userp2130.oracle.com (8.16.0.27/8.16.0.27) with SMTP id x68GXiNe119534; Mon, 8 Jul 2019 16:48:15 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=date : from : to : cc : subject : message-id : references : mime-version : content-type : in-reply-to; s=corp-2018-07-02; bh=cEtJkchxm/CjuC6l1G66NpijImPVQgkeETLZuKZJEv8=; b=TP6ivQdEDPaBTCKTPnY4jtIQ7t+jdzPpUfUR9/h8jVU2k1B4HC/4a+NDLflTtSyR5pJO jMp9C39FkhUffmBePCWibW3lo8w0k0mYJ1js3D5aTPqIOAriC59+ynAe5nauqnsZjLta p8uWNgarIZaHxeheYLNHFiHJHMq2+w+iUFDy+QtJIAVhTV9fHs5B6NSl/DotaBxryHuM 2FgLpFvICPVAwbxOtaqRuSrHBf18nA/b5v46y/FWWSErWEI2NnUXj1R5/UmLDitdijcg xHs8d1sGR/jsijyzQGE6Vt/mfVMXqsNTkcEi+yFrZ9NT8NB7HMewEzZYMijdgm7BME8e Pg== Received: from aserp3030.oracle.com (aserp3030.oracle.com [141.146.126.71]) by userp2130.oracle.com with ESMTP id 2tjk2tfkpf-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 08 Jul 2019 16:48:15 +0000 Received: from pps.filterd (aserp3030.oracle.com [127.0.0.1]) by aserp3030.oracle.com (8.16.0.27/8.16.0.27) with SMTP id x68Gm8aK056663; Mon, 8 Jul 2019 16:48:14 GMT Received: from pps.reinject (localhost [127.0.0.1]) by aserp3030.oracle.com with ESMTP id 2tjhpckdb1-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Mon, 08 Jul 2019 16:48:14 +0000 Received: from aserp3030.oracle.com (aserp3030.oracle.com [127.0.0.1]) by pps.reinject (8.16.0.27/8.16.0.27) with SMTP id x68GmEqI057135; Mon, 8 Jul 2019 16:48:14 GMT Received: from aserv0121.oracle.com (aserv0121.oracle.com [141.146.126.235]) by aserp3030.oracle.com with ESMTP id 2tjhpckdat-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 08 Jul 2019 16:48:14 +0000 Received: from abhmp0010.oracle.com (abhmp0010.oracle.com [141.146.116.16]) by aserv0121.oracle.com (8.14.4/8.13.8) with ESMTP id x68GmDMb022507; Mon, 8 Jul 2019 16:48:13 GMT Received: from localhost (/10.159.211.102) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 08 Jul 2019 09:48:13 -0700 Date: Mon, 8 Jul 2019 12:48:10 -0400 From: Kris Van Hees To: Peter Zijlstra Cc: Kris Van Hees , netdev@vger.kernel.org, bpf@vger.kernel.org, dtrace-devel@oss.oracle.com, linux-kernel@vger.kernel.org, rostedt@goodmis.org, mhiramat@kernel.org, acme@kernel.org, ast@kernel.org, daniel@iogearbox.net, Chris Mason Subject: Re: [PATCH 1/1] tools/dtrace: initial implementation of DTrace Message-ID: <20190708164810.GC20847@oracle.com> References: <201907040313.x643D8Pg025951@userv0121.oracle.com> <201907040314.x643EUoA017906@aserv0122.oracle.com> <20190704130336.GN3402@hirez.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20190704130336.GN3402@hirez.programming.kicks-ass.net> User-Agent: Mutt/1.5.23 (2014-03-12) X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9312 signatures=668688 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1907080206 Sender: bpf-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org On Thu, Jul 04, 2019 at 03:03:36PM +0200, Peter Zijlstra wrote: > On Wed, Jul 03, 2019 at 08:14:30PM -0700, Kris Van Hees wrote: > > +/* > > + * Read the data_head offset from the header page of the ring buffer. The > > + * argument is declared 'volatile' because it references a memory mapped page > > + * that the kernel may be writing to while we access it here. > > + */ > > +static u64 read_rb_head(volatile struct perf_event_mmap_page *rb_page) > > +{ > > + u64 head = rb_page->data_head; > > + > > + asm volatile("" ::: "memory"); > > + > > + return head; > > +} > > + > > +/* > > + * Write the data_tail offset in the header page of the ring buffer. The > > + * argument is declared 'volatile' because it references a memory mapped page > > + * that the kernel may be writing to while we access it here. > > s/writing/reading/ Thanks! > > + */ > > +static void write_rb_tail(volatile struct perf_event_mmap_page *rb_page, > > + u64 tail) > > +{ > > + asm volatile("" ::: "memory"); > > + > > + rb_page->data_tail = tail; > > +} > > That volatile usage is atrocious (kernel style would have you use > {READ,WRITE}_ONCE()). Also your comments fail to mark these as > load_acquire and store_release. And by only using a compiler barrier > you're hard assuming TSO, which is somewhat fragile at best. > > Alternatively, you can use the C11 bits and write: > > return __atomic_load_n(&rb_page->data_head, __ATOMIC_ACQUIRE); > > __atomic_store_n(&rb_page->data_tail, tail, __ATOMIC_RELEASE); Perhaps I should just use ring_buffer_read_head() and ring_buffer_write_tail() since they are provided in tools/include/linux/ring_buffer.h? I expect that would be even more preferable over __atomic_load_n() and __atomic_store_n()? > > +/* > > + * Process and output the probe data at the supplied address. > > + */ > > +static int output_event(int cpu, u64 *buf) > > +{ > > + u8 *data = (u8 *)buf; > > + struct perf_event_header *hdr; > > + > > + hdr = (struct perf_event_header *)data; > > + data += sizeof(struct perf_event_header); > > + > > + if (hdr->type == PERF_RECORD_SAMPLE) { > > + u8 *ptr = data; > > + u32 i, size, probe_id; > > + > > + /* > > + * struct { > > + * struct perf_event_header header; > > + * u32 size; > > + * u32 probe_id; > > + * u32 gap; > > + * u64 data[n]; > > + * } > > + * and data points to the 'size' member at this point. > > + */ > > + if (ptr > (u8 *)buf + hdr->size) { > > + fprintf(stderr, "BAD: corrupted sample header\n"); > > + goto out; > > + } > > + > > + size = *(u32 *)data; > > + data += sizeof(size); > > + ptr += sizeof(size) + size; > > + if (ptr != (u8 *)buf + hdr->size) { > > + fprintf(stderr, "BAD: invalid sample size\n"); > > + goto out; > > + } > > + > > + probe_id = *(u32 *)data; > > + data += sizeof(probe_id); > > + size -= sizeof(probe_id); > > + data += sizeof(u32); /* skip 32-bit gap */ > > + size -= sizeof(u32); > > + buf = (u64 *)data; > > + > > + printf("%3d %6d ", cpu, probe_id); > > + for (i = 0, size /= sizeof(u64); i < size; i++) > > + printf("%#016lx ", buf[i]); > > + printf("\n"); > > + } else if (hdr->type == PERF_RECORD_LOST) { > > + u64 lost; > > + > > + /* > > + * struct { > > + * struct perf_event_header header; > > + * u64 id; > > + * u64 lost; > > + * } > > + * and data points to the 'id' member at this point. > > + */ > > + lost = *(u64 *)(data + sizeof(u64)); > > + > > + printf("[%ld probes dropped]\n", lost); > > + } else > > + fprintf(stderr, "UNKNOWN: record type %d\n", hdr->type); > > + > > +out: > > + return hdr->size; > > +} > > I see a distinct lack of wrapping support. AFAICT when buf+hdr->size > wraps you're doing out-of-bounds accesses. Yes, that is correct. I'm actually trying to figure out why it didn't actually cause a SEGV when I tested this because I'm clearly reading past the end of the mmap'd memory. Thank you for noticing this - I was trying to be too minimal in the code I was putting out and really didn't pay attention to this. Fixed in the V2 I am preparing. > > +/* > > + * Process the available probe data in the given buffer. > > + */ > > +static void process_data(struct dtrace_buffer *buf) > > +{ > > + /* This is volatile because the kernel may be updating the content. */ > > + volatile struct perf_event_mmap_page *rb_page = buf->base; > > + u8 *base = (u8 *)buf->base + > > + buf->page_size; > > + u64 head = read_rb_head(rb_page); > > + > > + while (rb_page->data_tail != head) { > > + u64 tail = rb_page->data_tail; > > + u64 *ptr = (u64 *)(base + tail % buf->data_size); > > + int len; > > + > > + len = output_event(buf->cpu, ptr); > > + > > + write_rb_tail(rb_page, tail + len); > > + head = read_rb_head(rb_page); > > + } > > +} > > more volatile yuck. > > Also: > > for (;;) { > head = __atomic_load_n(&rb_page->data_head, __ATOMIC_ACQUIRE); > tail = __atomic_load_n(&rb_page->data_tail, __ATOMIC_RELAXED); > > if (head == tail) > break; > > do { > hdr = buf->base + (tail & ((1UL << buf->data_shift) - 1)); > if ((tail >> buf->data_shift) != > ((tail + hdr->size) >> buf->data_shift)) > /* handle wrap case */ > else > /* normal case */ > > tail += hdr->size; > } while (tail != head); > > __atomic_store_n(&rb_page->data_tail, tail, __ATOMIC_RELEASE); > } > > Or something. Thank you for this suggestion. As mentioned above, I lean towards using the provided ring_buffer_(read_head,write_tail) implementations since that is the 'other end' of the ring buffer head/tail mechanism that is going to be kept in sync with any changes that might happen on the kernel side, right? > > +/* > > + * Wait for data to become available in any of the buffers. > > + */ > > +int dt_buffer_poll(int epoll_fd, int timeout) > > +{ > > + struct epoll_event events[dt_numcpus]; > > + int i, cnt; > > + > > + cnt = epoll_wait(epoll_fd, events, dt_numcpus, timeout); > > + if (cnt < 0) > > + return -errno; > > + > > + for (i = 0; i < cnt; i++) > > + process_data((struct dtrace_buffer *)events[i].data.ptr); > > + > > + return cnt; > > +} > > Or make sure to read on the CPU by having a poll thread per CPU, then > you can do away with the memory barriers. That is definitely something for the todo list for future optimizations. Thanks for your review and code suggestions. Kris