From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.6 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0BC20C47247 for ; Tue, 5 May 2020 08:59:30 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id B841D206B9 for ; Tue, 5 May 2020 08:59:29 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=kernel.org header.i=@kernel.org header.b="JVmVQItf" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org B841D206B9 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 467BE8E00A3; Tue, 5 May 2020 04:59:29 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3F0E18E0058; Tue, 5 May 2020 04:59:29 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2B8148E00A3; Tue, 5 May 2020 04:59:29 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0145.hostedemail.com [216.40.44.145]) by kanga.kvack.org (Postfix) with ESMTP id 0EC8B8E0058 for ; Tue, 5 May 2020 04:59:29 -0400 (EDT) Received: from smtpin29.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id C16D619449 for ; Tue, 5 May 2020 08:59:28 +0000 (UTC) X-FDA: 76782066816.29.jelly33_18835b1199112 X-HE-Tag: jelly33_18835b1199112 X-Filterd-Recvd-Size: 5209 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by imf11.hostedemail.com (Postfix) with ESMTP for ; Tue, 5 May 2020 08:59:28 +0000 (UTC) Received: from willie-the-truck (236.31.169.217.in-addr.arpa [217.169.31.236]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 239272068E; Tue, 5 May 2020 08:59:25 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1588669167; bh=Dvjdv0YlFbM5NHen55jM2GoHiND0M1bZknelVLDMZDM=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=JVmVQItfO/0xThL8obDgQsbX6wzY7InYEMkHkmjClIZjH/lnCBDu6DoW1iSTi4zxS 8lxJW68OPlA0u+dWClAZnSA6Lam1GyAvkYcoFm9/SQUcUhbNLeaMFPv4Ct40hiKE0F Gl6zdCqfTsVEfyIjz7DeQjGsRPpfd+dzz2VquDdo= Date: Tue, 5 May 2020 09:59:21 +0100 From: Will Deacon To: Prathu Baronia Cc: Vlastimil Babka , catalin.marinas@arm.com, alexander.duyck@gmail.com, chintan.pandya@oneplus.com, mhocko@suse.com, akpm@linux-foundation.org, linux-mm@kvack.org, gregkh@linuxfoundation.com, gthelen@google.com, jack@suse.cz, ken.lin@oneplus.com, gasine.xu@oneplus.com, ying.huang@intel.com, mark.rutland@arm.com Subject: Re: [PATCH v2] mm: Optimized hugepage zeroing & copying from user Message-ID: <20200505085919.GB16980@willie-the-truck> References: <20200421093621.3fuptvf2qbyfzwfz@oneplus.com> <20200421100932.GC17256@willie-the-truck> <02d5daa8-ee7b-7d2d-6753-5191a7d761b9@suse.cz> <20200421133935.GC17875@willie-the-truck> <5e334947-22e9-e59d-f7bb-63e04cc8caf0@suse.cz> <20200422081852.GB29541@willie-the-truck> <20200422111928.GA32051@willie-the-truck> <20200422143841.ozuow4jkltzymvgs@oneplus.com> <20200501085855.c5dzk5hfrdzunqdl@oneplus.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200501085855.c5dzk5hfrdzunqdl@oneplus.com> User-Agent: Mutt/1.10.1 (2018-07-13) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, May 01, 2020 at 02:28:55PM +0530, Prathu Baronia wrote: > Platform and setup conditions: > Qualcomm's SM8150 platform under controlled conditions(i.e. only CPU0 and 6 > turned on and set to max frequency, and DDR set to performance governor). > --------------------------------------------------------------------------- > > --------------------------------------------------------------------------- > Summary: > We observed a ~61% improvement in executon time of clearing a hugepage > in the case of arm64 if we increase the granularity i.e. the chunk size > to 64KB from 4KB for each chunk clearing subroutine call. > --------------------------------------------------------------------------- > > For the base build: > > clear_huge_page() ftrace profile > -------------------------------- > - CPU0: > - Samples: 95 > - Mean: 242.099 us > - Std dev: 45.0096 us That's one hell of a deviation. Any idea what's going on there? > - CPU6: > - Samples: 61 > - Mean: 258.372 us > - Std dev: 22.0754 us > > With patches [PATCH {1,2,3}/4] provided at the end where we just revert the > forward-reverse traversal code we observed: > > clear_huge_page() ftrace profile > -------------------------------- > - CPU0: > - Samples: 77 > - Mean: 234.568 > - Std dev: 6.52 > - CPU6: > - Samples: 81 > - Mean: 259.437 > - Std dev: 19.25 > > We were expecting a bit of an improvement for arm64's case because of our > hypothesis that reverse traversal is considerably slower in arm64 but after Will > Deacon's test code which showed similar timings for forward and reverse > traversals we digged a bit deeper into this. > > I found that In the case of arm64 a page is cleared using a special clear_page.S > assembly routine instead of an explicit call to memset. With the below patch we > bypassed the assembly routine and oberserved improvement in execution time of > clear_huge_page on CPU0. > > diff --git a/include/linux/highmem.h b/include/linux/highmem.h > index ea5cdbd8c2c3..a0a97a95aee8 100644 > --- a/include/linux/highmem.h > +++ b/include/linux/highmem.h > @@ -158,7 +158,7 @@ do { > \ > static inline void clear_user_highpage(struct page *page, unsigned long vaddr) > { > void *addr = kmap_atomic(page); > - clear_user_page(addr, vaddr, page); > + memset(addr, 0x0, PAGE_SIZE); > kunmap_atomic(addr); > } > #endif > > For reference I will call the above patch v-exp. > > When v-exp is applied on base we observed: > > clear_huge_page() ftrace profile > -------------------------------- > - CPU0: > - Samples: 71 > - Mean: 124.657 us > - Std dev: 0.494165 us This doesn't make any sense to me. memset() of zero is special-cased to use the DC ZVA instruction in a loop: 3: dc zva, dst add dst, dst, zva_len_x subs count, count, zva_len_x b.ge 3b which is basically the same as clear_page(): 1: dc zva, x0 add x0, x0, x1 tst x0, #(PAGE_SIZE - 1) b.ne 1b Are you able to reproduce this in userspace? Will