From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=Lua+=57=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-0.8 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id AA7DCC2BA2B
	for <linux-mm@archiver.kernel.org>; Wed, 15 Apr 2020 03:40:49 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 6B5D020732
	for <linux-mm@archiver.kernel.org>; Wed, 15 Apr 2020 03:40:49 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6B5D020732
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id D43898E0005; Tue, 14 Apr 2020 23:40:48 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id CCD368E0001; Tue, 14 Apr 2020 23:40:48 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id BBD228E0005; Tue, 14 Apr 2020 23:40:48 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0226.hostedemail.com [216.40.44.226])
	by kanga.kvack.org (Postfix) with ESMTP id A116D8E0001
	for <linux-mm@kvack.org>; Tue, 14 Apr 2020 23:40:48 -0400 (EDT)
Received: from smtpin29.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with ESMTP id 60EF94DC0
	for <linux-mm@kvack.org>; Wed, 15 Apr 2020 03:40:48 +0000 (UTC)
X-FDA: 76708687776.29.screw63_6f1fcade9f5e
X-HE-Tag: screw63_6f1fcade9f5e
X-Filterd-Recvd-Size: 5895
Received: from mga05.intel.com (mga05.intel.com [192.55.52.43])
	by imf48.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Wed, 15 Apr 2020 03:40:47 +0000 (UTC)
IronPort-SDR: 0nggzwsyOD8QL2WKVP5zpWtdcFGOd9JwoGEUhj8W1FrxwcFy6WOO70o3oLErFlaDa9WpJUEPrp
 4kKWXHJd7OYw==
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from fmsmga008.fm.intel.com ([10.253.24.58])
  by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Apr 2020 20:40:45 -0700
IronPort-SDR: 4XsVRNbEGHo7A7xoNl66HYf8qzfNGbQE4YQff8T5LOgQnkjXtS25HuC0zHDbml+h3jlbNMzEIY
 a0z7DLUdb1Pg==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.72,385,1580803200"; 
   d="scan'208";a="245591193"
Received: from yhuang-dev.sh.intel.com (HELO yhuang-dev) ([10.239.159.23])
  by fmsmga008.fm.intel.com with ESMTP; 14 Apr 2020 20:40:43 -0700
From: "Huang\, Ying" <ying.huang@intel.com>
To: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Prathu Baronia <prathu.baronia@oneplus.com>,  Michal Hocko <mhocko@suse.com>,  Chintan Pandya <chintan.pandya@oneplus.com>,  <akpm@linux-foundation.com>,  linux-mm <linux-mm@kvack.org>,  <gregkh@linuxfoundation.com>,  Greg Thelen <gthelen@google.com>,  <jack@suse.cz>,  Ken Lin <ken.lin@oneplus.com>,  Gasine Xu <gasine.xu@oneplus.com>
Subject: Re: [PATCH v2] mm: Optimized hugepage zeroing & copying from user
References: <20200414153829.GA15230@oneplus.com>
	<20200414170312.GR4629@dhcp22.suse.cz>
	<20200414184743.GB2097@oneplus.com>
	<CAKgT0Ud2zeZO7-akPCLySUAbh5ePF=Kp0V+kaBpV63woQXk_xg@mail.gmail.com>
Date: Wed, 15 Apr 2020 11:40:42 +0800
In-Reply-To: <CAKgT0Ud2zeZO7-akPCLySUAbh5ePF=Kp0V+kaBpV63woQXk_xg@mail.gmail.com>
	(Alexander Duyck's message of "Tue, 14 Apr 2020 12:32:57 -0700")
Message-ID: <87mu7dza9x.fsf@yhuang-dev.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=ascii
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Alexander Duyck <alexander.duyck@gmail.com> writes:

> On Tue, Apr 14, 2020 at 11:47 AM Prathu Baronia
> <prathu.baronia@oneplus.com> wrote:
>>
>> The 04/14/2020 19:03, Michal Hocko wrote:
>> > I still have hard time to see why kmap machinery should introduce any
>> > slowdown here. Previous data posted while discussing v1 didn't really
>> > show anything outside of the noise.
>> >
>> You are right, the multiple barriers are not responsible for the slowdown, but
>> removal of kmap_atomic() allows us to call memset and memcpy for larger sizes.
>> I will re-frame this part of the commit text when we proceed towards v3 to
>> present it more cleanly.
>> >
>> > It would be really nice to provide std
>> >
>> Here is the data with std:-
>> ----------------------------------------------------------------------
>> Results:
>> ----------------------------------------------------------------------
>> Results for ARM64 target (SM8150 , CPU0 & 6 are online, running at max
>> frequency)
>> All numbers are mean of 100 iterations. Variation is ignorable.
>> ----------------------------------------------------------------------
>> - Oneshot : 3389.26 us  std: 79.1377 us
>> - Forward : 8876.16 us  std: 172.699 us
>> - Reverse : 18157.6 us  std: 111.713 us
>> ----------------------------------------------------------------------
>>
>> ----------------------------------------------------------------------
>> Results for x86-64 (Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz, only CPU 0 in
>> max frequency, DDR also running at max frequency.) All numbers are mean of
>> 100 iterations. Variation is ignorable.
>> ----------------------------------------------------------------------
>> - Oneshot : 3203.49 us  std: 115.4086 us
>> - Forward : 5766.46 us  std: 328.6299 us
>> - Reverse : 5187.86 us  std: 341.1918 us
>> ----------------------------------------------------------------------
>>
>> >
>> > No. There is absolutely zero reason to add a config option for this. The
>> > kernel should have all the information to make an educated guess.
>> >
>> I will try to incorporate this in v3. But currently I don't have any idea on how
>> to go about implementing the guessing logic. Would really appreciate if you can
>> suggest some way to go about it.
>>
>> > Also before going any further. The patch which has introduced the
>> > optimization was c79b57e462b5 ("mm: hugetlb: clear target sub-page last
>> > when clearing huge page"). It is based on an artificial benchmark which
>> > to my knowledge doesn't represent any real workload. Your measurements
>> > are based on a different benchmark. Your numbers clearly show that some
>> > assumptions used for the optimization are not architecture neutral.
>> >
>> But oneshot numbers are significantly better on both the archs. I think
>> theoretically the oneshot approach should provide better results on all the
>> architectures when compared with serial approach. Isn't it a fair assumption to
>> go ahead with the oneshot approach?
>
> I think the point that Michal is getting at is that there are other
> tests that need to be run. You are running the test on just one core.
> What happens as we start fanning this out and having multiple
> instances running per socket? We would be flooding the LLC in addition
> to overwriting all the other caches.
>
> If you take a look at commit c6ddfb6c58903 ("mm, clear_huge_page: move
> order algorithm into a separate function") they were running the tests
> on multiple threads simultaneously as their concern was flooding the
> LLC cache. I wonder if we couldn't look at bypassing the cache
> entirely using something like __copy_user_nocache for some portion of
> the copy and then only copy in the last pieces that we think will be
> immediately accessed.

The problem is how to determine the size of the pieces that will be
immediately accessed?

Best Regards,
Huang, Ying