From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1753657AbcLPRUF (ORCPT <rfc822;w@1wt.eu>);
        Fri, 16 Dec 2016 12:20:05 -0500
Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:45270 "EHLO
        mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL)
        by vger.kernel.org with ESMTP id S1751014AbcLPRT4 (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 16 Dec 2016 12:19:56 -0500
Date: Fri, 16 Dec 2016 22:49:16 +0530
From: "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com>
To: Balbir Singh <bsingharora@gmail.com>
Cc: Anju T Sudhakar <anju@linux.vnet.ibm.com>,
        linux-kernel@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
        srikar@linux.vnet.ibm.com, mahesh@linux.vnet.ibm.com, paulus@samba.org,
        mhiramat@kernel.org, ananth@in.ibm.com
Subject: Re: [PATCH V2 0/4] OPTPROBES for powerpc
References: <1481732310-7779-1-git-send-email-anju@linux.vnet.ibm.com>
 <fd268944-5f8e-9815-6fef-e7d9d5191044@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <fd268944-5f8e-9815-6fef-e7d9d5191044@gmail.com>
User-Agent: Mutt/1.6.2 (2016-07-01)
X-TM-AS-MML: disable
X-Content-Scanned: Fidelis XPS MAILER
x-cbid: 16121617-0008-0000-0000-000004F45E57
X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused
x-cbparentid: 16121617-0009-0000-0000-000012864C84
Message-Id: <20161216171916.GH4109@naverao1-tp.localdomain>
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2016-12-16_10:,,
 signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0
 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam
 adjust=0 reason=mlx scancount=1 engine=8.0.1-1612050000
 definitions=main-1612160265
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On 2016/12/17 01:46AM, Balbir Singh wrote:
> 
> 
> On 15/12/16 03:18, Anju T Sudhakar wrote:
> > This is the V2 patchset of the kprobes jump optimization
> > (a.k.a OPTPROBES)for powerpc. Kprobe being an inevitable tool
> > for kernel developers, enhancing the performance of kprobe has
> > got much importance.
> > 
> > Currently kprobes inserts a trap instruction to probe a running kernel.
> > Jump optimization allows kprobes to replace the trap with a branch,
> > reducing the probe overhead drastically.
> > 
> > In this series, conditional branch instructions are not considered for
> > optimization as they have to be assessed carefully in SMP systems.
> > 
> > The kprobe placed on the kretprobe_trampoline during boot time, is also
> > optimized in this series. Patch 4/4 furnishes this.
> > 
> > The first two patches can go independently of the series. The helper 
> > functions in these patches are invoked in patch 3/4.
> > 
> > Performance:
> > ============
> > An optimized kprobe in powerpc is 1.05 to 4.7 times faster than a kprobe.
> >  
> > Example:
> >  
> > Placed a probe at an offset 0x50 in _do_fork().
> > *Time Diff here is, difference in time before hitting the probe and
> > after the probed instruction. mftb() is employed in kernel/fork.c for
> > this purpose.
> >  
> > # echo 0 > /proc/sys/debug/kprobes-optimization
> > Kprobes globally unoptimized
> >  [  233.607120] Time Diff = 0x1f0
> >  [  233.608273] Time Diff = 0x1ee
> >  [  233.609228] Time Diff = 0x203
> >  [  233.610400] Time Diff = 0x1ec
> >  [  233.611335] Time Diff = 0x200
> >  [  233.612552] Time Diff = 0x1f0
> >  [  233.613386] Time Diff = 0x1ee
> >  [  233.614547] Time Diff = 0x212
> >  [  233.615570] Time Diff = 0x206
> >  [  233.616819] Time Diff = 0x1f3
> >  [  233.617773] Time Diff = 0x1ec
> >  [  233.618944] Time Diff = 0x1fb
> >  [  233.619879] Time Diff = 0x1f0
> >  [  233.621066] Time Diff = 0x1f9
> >  [  233.621999] Time Diff = 0x283
> >  [  233.623281] Time Diff = 0x24d
> >  [  233.624172] Time Diff = 0x1ea
> >  [  233.625381] Time Diff = 0x1f0
> >  [  233.626358] Time Diff = 0x200
> >  [  233.627572] Time Diff = 0x1ed
> >  
> > # echo 1 > /proc/sys/debug/kprobes-optimization
> > Kprobes globally optimized
> >  [   70.797075] Time Diff = 0x103
> >  [   70.799102] Time Diff = 0x181
> >  [   70.801861] Time Diff = 0x15e
> >  [   70.803466] Time Diff = 0xf0
> >  [   70.804348] Time Diff = 0xd0
> >  [   70.805653] Time Diff = 0xad
> >  [   70.806477] Time Diff = 0xe0
> >  [   70.807725] Time Diff = 0xbe
> >  [   70.808541] Time Diff = 0xc3
> >  [   70.810191] Time Diff = 0xc7
> >  [   70.811007] Time Diff = 0xc0
> >  [   70.812629] Time Diff = 0xc0
> >  [   70.813640] Time Diff = 0xda
> >  [   70.814915] Time Diff = 0xbb
> >  [   70.815726] Time Diff = 0xc4
> >  [   70.816955] Time Diff = 0xc0
> >  [   70.817778] Time Diff = 0xcd
> >  [   70.818999] Time Diff = 0xcd
> >  [   70.820099] Time Diff = 0xcb
> >  [   70.821333] Time Diff = 0xf0
> > 
> > Implementation:
> > ===================
> >  
> > The trap instruction is replaced by a branch to a detour buffer. To address
> > the limitation of branch instruction in power architecture, detour buffer
> > slot is allocated from a reserved area . This will ensure that the branch
> > is within ą 32 MB range. The current kprobes insn caches allocate memory 
> > area for insn slots with module_alloc(). This will always be beyond 
> > ą 32MB range.
> >  
> 
> The paragraph is a little confusing. We need the detour buffer to be within
> +-32 MB, but then you say we always get memory from module_alloc() beyond
> 32MB.

Yes, I think it can be described better. What Anju is mentioning is that 
the existing generic approach for kprobes insn cache uses module_alloc() 
which is not suitable for us due to the 32MB range limit with relative 
branches on powerpc.

Instead, we reserve a 64k block within .text and allocate the detour 
buffer from that area. This puts the detour buffer in range for most of 
the symbols and should be a good start.

> 
> > The detour buffer contains a call to optimized_callback() which in turn
> > call the pre_handler(). Once the pre-handler is run, the original
> > instruction is emulated from the detour buffer itself. Also the detour
> > buffer is equipped with a branch back to the normal work flow after the
> > probed instruction is emulated.
> 
> Does the branch itself use registers that need to be saved? I presume

No, we use immediate values to encode the relative address.

> we are going to rely on the +-32MB, what are the guarantees of success
> of such a mechanism?

We explicitly ensure that the return branch is within range as well 
during registration. In fact, this is one of the reasons why we can't 
optimize conditional branches - we can't know in advance where we need 
to jump back.

> 
> Balbir Singh.
> 

Thanks,
- Naveen