From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.6 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3C5AEC55185 for ; Wed, 22 Apr 2020 11:19:37 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id E50E82084D for ; Wed, 22 Apr 2020 11:19:36 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=kernel.org header.i=@kernel.org header.b="bTMJobbW" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E50E82084D Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 82D9F8E000F; Wed, 22 Apr 2020 07:19:36 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7DDEE8E0003; Wed, 22 Apr 2020 07:19:36 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 71B768E000F; Wed, 22 Apr 2020 07:19:36 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0077.hostedemail.com [216.40.44.77]) by kanga.kvack.org (Postfix) with ESMTP id 5A8A48E0003 for ; Wed, 22 Apr 2020 07:19:36 -0400 (EDT) Received: from smtpin16.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 093CA180AD822 for ; Wed, 22 Apr 2020 11:19:36 +0000 (UTC) X-FDA: 76735245552.16.oven04_36a55af4fee4a X-HE-Tag: oven04_36a55af4fee4a X-Filterd-Recvd-Size: 7298 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by imf48.hostedemail.com (Postfix) with ESMTP for ; Wed, 22 Apr 2020 11:19:35 +0000 (UTC) Received: from willie-the-truck (236.31.169.217.in-addr.arpa [217.169.31.236]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 291BE20787; Wed, 22 Apr 2020 11:19:32 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1587554374; bh=wu009JbuVu+f/IXIYhcyDk+GyGPIsjHhv2J0QdYnZEY=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=bTMJobbWT9Dj4u3mawYbeBglqORaflumb2SfNiZq3gUHZuf85fJH/5yf9TC4xbC// 5fqfNMMvXO0G6So793t8hhgFdtg1XkTROM81trnwXeYDk6Ds+NP7SclkvtD5qU8KTg NH1UGKGyUGod6WpXcHau//DzkB/q16gW9C1ccI6s= Date: Wed, 22 Apr 2020 12:19:29 +0100 From: Will Deacon To: Vlastimil Babka Cc: Prathu Baronia , catalin.marinas@arm.com, alexander.duyck@gmail.com, chintan.pandya@oneplus.com, mhocko@suse.com, akpm@linux-foundation.org, linux-mm@kvack.org, gregkh@linuxfoundation.com, gthelen@google.com, jack@suse.cz, ken.lin@oneplus.com, gasine.xu@oneplus.com, ying.huang@intel.com, mark.rutland@arm.com Subject: Re: [PATCH v2] mm: Optimized hugepage zeroing & copying from user Message-ID: <20200422111928.GA32051@willie-the-truck> References: <87r1wpzavo.fsf@yhuang-dev.intel.com> <20200419155856.dtwxomdkyujljdfi@oneplus.com> <87k12bt3ff.fsf@yhuang-dev.intel.com> <20200421093621.3fuptvf2qbyfzwfz@oneplus.com> <20200421100932.GC17256@willie-the-truck> <02d5daa8-ee7b-7d2d-6753-5191a7d761b9@suse.cz> <20200421133935.GC17875@willie-the-truck> <5e334947-22e9-e59d-f7bb-63e04cc8caf0@suse.cz> <20200422081852.GB29541@willie-the-truck> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200422081852.GB29541@willie-the-truck> User-Agent: Mutt/1.10.1 (2018-07-13) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Apr 22, 2020 at 09:18:52AM +0100, Will Deacon wrote: > On Tue, Apr 21, 2020 at 03:48:04PM +0200, Vlastimil Babka wrote: > > On 4/21/20 3:39 PM, Will Deacon wrote: > > > On Tue, Apr 21, 2020 at 02:48:04PM +0200, Vlastimil Babka wrote: > > >> On 4/21/20 2:47 PM, Vlastimil Babka wrote: > > >> > > > >> > It was suspected that current Intel can prefetch forward and backwards, and the > > >> > tested ARM64 microarchitecture only backwards, can it be true? The current code > > >> > > >> Oops, tested ARM64 microarchitecture I meant "only forwards". > > > > > > I'd be surprised if that's the case, but it could be that there's an erratum > > > workaround in play which hampers the prefetch behaviour. We generally try > > > not to assume too much about the prefetcher on arm64 because they're not > > > well documented and vary wildly between different micro-architectures. > > > > Yeah it's probably not as simple as I thought, as the test code [1] shows the > > page iteration goes backwards, but per-page memsets are not special. So maybe > > it's not hardware specifics, but x86 memtest implementation is also done > > backwards, so it fits the backwards outer loop, but arm64 memset is forward, so > > the resulting pattern is non-linear? > > A straightforward linear prefetcher would probably be defeated by that sort > of thing, yes, but I'd have thought that the recent CPUs (e.g. A76 which I > think is the "big" CPU in the SoC mentioned at the start of the thread) > would still have a fighting chance at prefetching based on non-linear > histories. > > However, to my earlier point, we're making this more difficult than it needs > to be for the hardware and we shouldn't assume that all prefetchers will > handle it gracefully, so keeping the core code relatively straightforward > does seem to be the best bet. Alarm bells just rang initially when it > appeared that we were optimising code under arch/arm64 rather than improving > the core code, but I now have a better picture of what's going on (thanks). > > Alternatively, we could switch our memset() around, but I'm worried that > we could end up hurting something else by doing that. I guess we could add a > memset_backwards() version if we *had* to... > > > In that case it's also a question if the measurement was done in kernel or > > userspace, and if userspace memset have any implications for kernel memset... > > Sounds like it was done in userspace. If I get a chance later on, I'll try > to give it a spin here on some of the boards I have kicking around. I wrote the silly harness below for the snippets given in [1] but I can't see any difference between the forwards and backwards versions on any arm64 systems I have access to. Will [1] https://lore.kernel.org/linux-mm/20200414153829.GA15230@oneplus.com/ --->8 #if 0 /* One shot memset */ memset (r, 0xd, total_size); /* traverse in forward order */ for (j = 0; j < total_pages; j++) { memset (q + (j * SZ_4K), 0xc, SZ_4K); } /* traverse in reverse order */ for (i = 0; i < total_pages; i++) { memset (p + total_size - (i + 1) * SZ_4K, 0xb, SZ_4K); } #endif #include #include #include #include #include #include #define BUF_SZ (1UL*1024*1024*1024) #define PAGE_SIZE 0x1000 #define BUF_SZ_PAGES (BUF_SZ / PAGE_SIZE) #define NSECS_PER_SEC 1000000000ULL static void do_the_thing_forwards(void *buf) { unsigned long i; for (i = 0; i < BUF_SZ_PAGES; i++) memset(buf + (i * PAGE_SIZE), 0xc, PAGE_SIZE); } static void do_the_thing_backwards(void *buf) { unsigned long i; for (i = 0; i < BUF_SZ_PAGES; i++) memset(buf + BUF_SZ - (i + 1) * PAGE_SIZE, 0xc, PAGE_SIZE); } int main(void) { void *buf; unsigned long long delta; struct timespec ts_start, ts_end; if (posix_memalign(&buf, PAGE_SIZE, BUF_SZ)) { perror("posix_memalign()"); return -1; } memset(buf, 0xd, BUF_SZ); if (clock_gettime(CLOCK_MONOTONIC_RAW, &ts_start)) { perror("clock_gettime()"); return -1; } do_the_thing_forwards(buf); do_the_thing_forwards(buf); do_the_thing_forwards(buf); do_the_thing_forwards(buf); do_the_thing_forwards(buf); if (clock_gettime(CLOCK_MONOTONIC_RAW, &ts_end)) { perror("clock_gettime()"); return -1; } delta = NSECS_PER_SEC * (ts_end.tv_sec - ts_start.tv_sec); delta += (ts_end.tv_nsec - ts_start.tv_nsec); printf("Forwards: took %f seconds\n", (double)(delta / (double)NSECS_PER_SEC)); if (clock_gettime(CLOCK_MONOTONIC_RAW, &ts_start)) { perror("clock_gettime()"); return -1; } do_the_thing_backwards(buf); do_the_thing_backwards(buf); do_the_thing_backwards(buf); do_the_thing_backwards(buf); do_the_thing_backwards(buf); if (clock_gettime(CLOCK_MONOTONIC_RAW, &ts_end)) { perror("clock_gettime()"); return -1; } delta = NSECS_PER_SEC * (ts_end.tv_sec - ts_start.tv_sec); delta += (ts_end.tv_nsec - ts_start.tv_nsec); printf("Backwards: took %f seconds\n", (double)(delta / (double)NSECS_PER_SEC)); return 0; }