From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id AA7DCC2BA2B for ; Wed, 15 Apr 2020 03:40:49 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 6B5D020732 for ; Wed, 15 Apr 2020 03:40:49 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6B5D020732 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id D43898E0005; Tue, 14 Apr 2020 23:40:48 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CCD368E0001; Tue, 14 Apr 2020 23:40:48 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BBD228E0005; Tue, 14 Apr 2020 23:40:48 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0226.hostedemail.com [216.40.44.226]) by kanga.kvack.org (Postfix) with ESMTP id A116D8E0001 for ; Tue, 14 Apr 2020 23:40:48 -0400 (EDT) Received: from smtpin29.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 60EF94DC0 for ; Wed, 15 Apr 2020 03:40:48 +0000 (UTC) X-FDA: 76708687776.29.screw63_6f1fcade9f5e X-HE-Tag: screw63_6f1fcade9f5e X-Filterd-Recvd-Size: 5895 Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) by imf48.hostedemail.com (Postfix) with ESMTP for ; Wed, 15 Apr 2020 03:40:47 +0000 (UTC) IronPort-SDR: 0nggzwsyOD8QL2WKVP5zpWtdcFGOd9JwoGEUhj8W1FrxwcFy6WOO70o3oLErFlaDa9WpJUEPrp 4kKWXHJd7OYw== X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Apr 2020 20:40:45 -0700 IronPort-SDR: 4XsVRNbEGHo7A7xoNl66HYf8qzfNGbQE4YQff8T5LOgQnkjXtS25HuC0zHDbml+h3jlbNMzEIY a0z7DLUdb1Pg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.72,385,1580803200"; d="scan'208";a="245591193" Received: from yhuang-dev.sh.intel.com (HELO yhuang-dev) ([10.239.159.23]) by fmsmga008.fm.intel.com with ESMTP; 14 Apr 2020 20:40:43 -0700 From: "Huang\, Ying" To: Alexander Duyck Cc: Prathu Baronia , Michal Hocko , Chintan Pandya , , linux-mm , , Greg Thelen , , Ken Lin , Gasine Xu Subject: Re: [PATCH v2] mm: Optimized hugepage zeroing & copying from user References: <20200414153829.GA15230@oneplus.com> <20200414170312.GR4629@dhcp22.suse.cz> <20200414184743.GB2097@oneplus.com> Date: Wed, 15 Apr 2020 11:40:42 +0800 In-Reply-To: (Alexander Duyck's message of "Tue, 14 Apr 2020 12:32:57 -0700") Message-ID: <87mu7dza9x.fsf@yhuang-dev.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Alexander Duyck writes: > On Tue, Apr 14, 2020 at 11:47 AM Prathu Baronia > wrote: >> >> The 04/14/2020 19:03, Michal Hocko wrote: >> > I still have hard time to see why kmap machinery should introduce any >> > slowdown here. Previous data posted while discussing v1 didn't really >> > show anything outside of the noise. >> > >> You are right, the multiple barriers are not responsible for the slowdown, but >> removal of kmap_atomic() allows us to call memset and memcpy for larger sizes. >> I will re-frame this part of the commit text when we proceed towards v3 to >> present it more cleanly. >> > >> > It would be really nice to provide std >> > >> Here is the data with std:- >> ---------------------------------------------------------------------- >> Results: >> ---------------------------------------------------------------------- >> Results for ARM64 target (SM8150 , CPU0 & 6 are online, running at max >> frequency) >> All numbers are mean of 100 iterations. Variation is ignorable. >> ---------------------------------------------------------------------- >> - Oneshot : 3389.26 us std: 79.1377 us >> - Forward : 8876.16 us std: 172.699 us >> - Reverse : 18157.6 us std: 111.713 us >> ---------------------------------------------------------------------- >> >> ---------------------------------------------------------------------- >> Results for x86-64 (Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz, only CPU 0 in >> max frequency, DDR also running at max frequency.) All numbers are mean of >> 100 iterations. Variation is ignorable. >> ---------------------------------------------------------------------- >> - Oneshot : 3203.49 us std: 115.4086 us >> - Forward : 5766.46 us std: 328.6299 us >> - Reverse : 5187.86 us std: 341.1918 us >> ---------------------------------------------------------------------- >> >> > >> > No. There is absolutely zero reason to add a config option for this. The >> > kernel should have all the information to make an educated guess. >> > >> I will try to incorporate this in v3. But currently I don't have any idea on how >> to go about implementing the guessing logic. Would really appreciate if you can >> suggest some way to go about it. >> >> > Also before going any further. The patch which has introduced the >> > optimization was c79b57e462b5 ("mm: hugetlb: clear target sub-page last >> > when clearing huge page"). It is based on an artificial benchmark which >> > to my knowledge doesn't represent any real workload. Your measurements >> > are based on a different benchmark. Your numbers clearly show that some >> > assumptions used for the optimization are not architecture neutral. >> > >> But oneshot numbers are significantly better on both the archs. I think >> theoretically the oneshot approach should provide better results on all the >> architectures when compared with serial approach. Isn't it a fair assumption to >> go ahead with the oneshot approach? > > I think the point that Michal is getting at is that there are other > tests that need to be run. You are running the test on just one core. > What happens as we start fanning this out and having multiple > instances running per socket? We would be flooding the LLC in addition > to overwriting all the other caches. > > If you take a look at commit c6ddfb6c58903 ("mm, clear_huge_page: move > order algorithm into a separate function") they were running the tests > on multiple threads simultaneously as their concern was flooding the > LLC cache. I wonder if we couldn't look at bypassing the cache > entirely using something like __copy_user_nocache for some portion of > the copy and then only copy in the last pieces that we think will be > immediately accessed. The problem is how to determine the size of the pieces that will be immediately accessed? Best Regards, Huang, Ying