From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B5D6DC2BB1D for ; Wed, 15 Apr 2020 03:27:45 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 6F0BA206F9 for ; Wed, 15 Apr 2020 03:27:45 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6F0BA206F9 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 053D68E0007; Tue, 14 Apr 2020 23:27:45 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 004B38E0001; Tue, 14 Apr 2020 23:27:44 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E35298E0007; Tue, 14 Apr 2020 23:27:44 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0048.hostedemail.com [216.40.44.48]) by kanga.kvack.org (Postfix) with ESMTP id CBA248E0001 for ; Tue, 14 Apr 2020 23:27:44 -0400 (EDT) Received: from smtpin08.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 91430824556B for ; Wed, 15 Apr 2020 03:27:44 +0000 (UTC) X-FDA: 76708654848.08.grain43_266596945a142 X-HE-Tag: grain43_266596945a142 X-Filterd-Recvd-Size: 7417 Received: from mga17.intel.com (mga17.intel.com [192.55.52.151]) by imf42.hostedemail.com (Postfix) with ESMTP for ; Wed, 15 Apr 2020 03:27:43 +0000 (UTC) IronPort-SDR: tq2MZT4OhH2/UEwYzZDtBEvc6+pwQJtsFMCqEdoq6mQCGqT4c2bjV33svPUP0SmpCUs+hBsRcu 8/OShFxnxL9Q== X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga005.fm.intel.com ([10.253.24.32]) by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Apr 2020 20:27:42 -0700 IronPort-SDR: Fl4jsB267qvANr5gpQGUOj/YJ0Lof6cPgYXeCE06mcvaUvwM546zBSAOuLPgyBGnRxH+lk1NPs gR88de3hwPvQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.72,385,1580803200"; d="scan'208";a="453778612" Received: from yhuang-dev.sh.intel.com (HELO yhuang-dev) ([10.239.159.23]) by fmsmga005.fm.intel.com with ESMTP; 14 Apr 2020 20:27:39 -0700 From: "Huang\, Ying" To: Prathu Baronia Cc: , , , , , , , , , Subject: Re: [PATCH v2] mm: Optimized hugepage zeroing & copying from user References: <20200414153829.GA15230@oneplus.com> Date: Wed, 15 Apr 2020 11:27:39 +0800 In-Reply-To: <20200414153829.GA15230@oneplus.com> (Prathu Baronia's message of "Tue, 14 Apr 2020 21:08:32 +0530") Message-ID: <87r1wpzavo.fsf@yhuang-dev.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Prathu Baronia writes: > In !HIGHMEM cases, specially in 64-bit architectures, we don't need temp mapping > of pages. Hence, k(map|unmap)_atomic() acts as nothing more than multiple > barrier() calls, for example for a 2MB hugepage in clear_huge_page() these are > called 512 times i.e. to map and unmap each subpage that means in total 2048 > barrier calls. This called for optimization. Simply getting VADDR from page does > the job for us. This also applies to the copy_user_huge_page() function. > > With kmap_atomic() out of the picture we can use memset and memcpy for sizes > larger than 4K. Instead of a left-right approach to access the target subpage, > getting the VADDR from the page and using memset directly in a simple experiment > we observed a 64% improvement in time over the current approach. > > With this(v2) patch we observe 65.85%(under controlled conditions) improvement > over the current approach. Can you describe your test? > Currently process_huge_page iterates over subpages in a left-right manner > targeting the subpage that was accessed to be processed at last to keep the > cache hot around the faulting address. This caused a latency issue because as we > observed in the case of ARM64 the reverse access is much slower than forward > access and much much slower than oneshot access because of the pre-fetcher > behaviour. The following simple userspace experiment to allocate > 100MB(total_size) of pages and writing to it using memset(oneshot), forward > order loop and a reverse order loop gave us a good insight:- > > -------------------------------------------------------------------------------- > Test code snippet: > -------------------------------------------------------------------------------- > /* One shot memset */ > memset (r, 0xd, total_size); > > /* traverse in forward order */ > for (j = 0; j < total_pages; j++) > { > memset (q + (j * SZ_4K), 0xc, SZ_4K); > } > > /* traverse in reverse order */ > for (i = 0; i < total_pages; i++) > { > memset (p + total_size - (i + 1) * SZ_4K, 0xb, SZ_4K); > } You have tested the chunk sizes 4KB and 2MB, can you test some values in between? For example 32KB or 64KB? Maybe there's a sweet point with some smaller granularity and good performance. > ---------------------------------------------------------------------- > Results: > ---------------------------------------------------------------------- > Results for ARM64 target (SM8150 , CPU0 & 6 are online, running at max > frequency) > All numbers are mean of 100 iterations. Variation is ignorable. > ---------------------------------------------------------------------- > - Oneshot : 3389.26 us > - Forward : 8876.16 us > - Reverse : 18157.6 us > ---------------------------------------------------------------------- > > ---------------------------------------------------------------------- > Results for x86-64 (Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz, only CPU 0 in max > frequency, DDR also running at max frequency.) > All numbers are mean of 100 iterations. Variation is ignorable. > ---------------------------------------------------------------------- > - Oneshot : 3203.49 us > - Forward : 5766.46 us > - Reverse : 5187.86 us > ---------------------------------------------------------------------- > > Hence refactor the function process_huge_page() to process the hugepage > in oneshot manner using oneshot version of routines clear_huge_page() and > copy_user_huge_page() for !HIGHMEM cases. > > These oneshot routines do zeroing using memset and copying using memcpy since we > observed after extensive testing on ARM64 and some local testing on x86 memset > and memcpy routines are highly optimized and with the above data points in hand > it made sense to utilize them directly instead of looping over all subpages. > These oneshot routines do zero and copy with a small offset(default kept as 32KB for > now) to keep the cache hot around the faulting address. This offset is dependent > on the cache size and hence can be kept as a tunable configuration option. > > The below profiles are for ARM64(SM8150, CPU0 & 6 are online, running at max > frequency, DDR also running at max frequency.) > > ---------------------------------------------------------------------- > Ftrace Results(clear_huge_page_profile()): > ---------------------------------------------------------------------- > All timing values are in microseconds(us) > ---------------------------------------------------------------------- > Base: > - CPU0: > - Samples: 95 > - Mean: 242.099 us > - Std dev: 45.0096 us > - CPU6: > - Samples: 61 > - Mean: 258.372 us > - Std dev: 22.0754 us > ---------------------------------------------------------------------- > v2: > - CPU0: > - Samples: 63 > - Mean: 112.297 us > - Std dev: 0.310989 us > - CPU6: > - Samples: 99 > - Mean: 67.359 us > - Std dev: 1.15997 us > ---------------------------------------------------------------------- In addition to clearing the huge page itself, we need to consider how it impact the application which accesses the huge page. For example, the huge page may be accessed twice, once in kernel (zeroing) and once in the user space (initializing). So please find a way to test it. As pointed out by Alexander, please consider the cache contention among the logical CPUs too. Best Regards, Huang, Ying