From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp1.linuxfoundation.org (smtp1.linux-foundation.org [172.17.192.35]) by mail.linuxfoundation.org (Postfix) with ESMTPS id 7CA594D3 for ; Thu, 21 Jul 2016 15:45:36 +0000 (UTC) Received: from mx2.suse.de (mx2.suse.de [195.135.220.15]) by smtp1.linuxfoundation.org (Postfix) with ESMTPS id BFE87122 for ; Thu, 21 Jul 2016 15:45:35 +0000 (UTC) Date: Thu, 21 Jul 2016 17:45:32 +0200 From: Jan Kara To: Chris Mason Message-ID: <20160721154532.GC14146@quack2.suse.cz> References: <578F36B9.802@huawei.com> <20160721100014.GB7901@quack2.suse.cz> <577236a8-2921-842a-2243-b8ecfe467381@fb.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <577236a8-2921-842a-2243-b8ecfe467381@fb.com> Cc: ksummit-discuss@lists.linuxfoundation.org Subject: Re: [Ksummit-discuss] [TECH TOPIC] Kernel tracing and end-to-end performance breakdown List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , On Thu 21-07-16 09:54:53, Chris Mason wrote: > On 07/21/2016 06:00 AM, Jan Kara wrote: > > > >So I think improvements in performance analysis are always welcome but > >current proposal seems to be somewhat handwavy so I'm not sure what outcome > >you'd like to get from the discussion... If you have a more concrete > >proposal how you'd like to achieve what you need, then it may be worth > >discussion. > > > >As a side note I know that Google (and maybe Facebook, not sure here) have > >out-of-tree patches which provide really neat performance analysis > >capabilities. I have heard they are not really upstreamable because they > >are horrible hacks but maybe they can be a good inspiration for this work. > >If we could get someone from these companies to explain what capabilities > >they have and how they achieve this (regardless how hacky the > >implementation may be), that may be an interesting topic. > > At least for facebook, we're moving most things to bpf. The most > interesting part of our analysis isn't so much from the tool used to record > it, it's from being able to aggregate over the fleet and making comparisons > at scale. > > For example, Josef setup the off-cpu flame graphs such that we can record > stack traces for a latency higher than N, and then sum up the most expensive > stack traces over a large number of machines. It makes it much easier to > find those happens-once-a-day problems. By latency higher than N, do you mean that e.g. a syscall took more than N, or just that a process is sleeping for more than N in some place? Honza -- Jan Kara SUSE Labs, CR