From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=uGV/=6A=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-0.8 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 97F33C2BA19
	for <linux-mm@archiver.kernel.org>; Thu, 16 Apr 2020 01:21:51 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 59A1A2076D
	for <linux-mm@archiver.kernel.org>; Thu, 16 Apr 2020 01:21:51 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 59A1A2076D
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id E56C18E0065; Wed, 15 Apr 2020 21:21:50 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id E081B8E0001; Wed, 15 Apr 2020 21:21:50 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id D1DD88E0065; Wed, 15 Apr 2020 21:21:50 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0189.hostedemail.com [216.40.44.189])
	by kanga.kvack.org (Postfix) with ESMTP id B5F1C8E0001
	for <linux-mm@kvack.org>; Wed, 15 Apr 2020 21:21:50 -0400 (EDT)
Received: from smtpin21.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with ESMTP id 76491180AD807
	for <linux-mm@kvack.org>; Thu, 16 Apr 2020 01:21:50 +0000 (UTC)
X-FDA: 76711966380.21.truck16_ffad8472fa2f
X-HE-Tag: truck16_ffad8472fa2f
X-Filterd-Recvd-Size: 4458
Received: from mga05.intel.com (mga05.intel.com [192.55.52.43])
	by imf09.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Thu, 16 Apr 2020 01:21:49 +0000 (UTC)
IronPort-SDR: za7jI3H0wK7cuqK9o2B3o8CRHkV5+SIakuKL+w02cwA/0OPj/s4X0aNAAFWhdgCXKQjPQ3Hj3t
 gexAGVL3CKdg==
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from fmsmga006.fm.intel.com ([10.253.24.20])
  by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Apr 2020 18:21:48 -0700
IronPort-SDR: x/i3zfRGT1D+BIK31qT1RSY1jTJ7ty8LEjmd8rSSlhlJ01VBknmAXpOO2HH5st9FcEwP7q0oVn
 L7UQrc471CGQ==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.72,388,1580803200"; 
   d="scan'208";a="455098052"
Received: from yhuang-dev.sh.intel.com (HELO yhuang-dev) ([10.239.159.23])
  by fmsmga006.fm.intel.com with ESMTP; 15 Apr 2020 18:21:45 -0700
From: "Huang\, Ying" <ying.huang@intel.com>
To: Prathu Baronia <prathu.baronia@oneplus.com>
Cc: <alexander.duyck@gmail.com>,  <chintan.pandya@oneplus.com>,  <mhocko@suse.com>,  <akpm@linux-foundation.com>,  <linux-mm@kvack.org>,  <gregkh@linuxfoundation.com>,  <gthelen@google.com>,  <jack@suse.cz>,  <ken.lin@oneplus.com>,  <gasine.xu@oneplus.com>
Subject: Re: [PATCH v2] mm: Optimized hugepage zeroing & copying from user
References: <20200414153829.GA15230@oneplus.com>
	<87r1wpzavo.fsf@yhuang-dev.intel.com>
Date: Thu, 16 Apr 2020 09:21:45 +0800
In-Reply-To: <87r1wpzavo.fsf@yhuang-dev.intel.com> (Ying Huang's message of
	"Wed, 15 Apr 2020 11:27:39 +0800")
Message-ID: <87sgh4xm1i.fsf@yhuang-dev.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=ascii
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

"Huang, Ying" <ying.huang@intel.com> writes:

> Prathu Baronia <prathu.baronia@oneplus.com> writes:
>
>> In !HIGHMEM cases, specially in 64-bit architectures, we don't need temp mapping
>> of pages. Hence, k(map|unmap)_atomic() acts as nothing more than multiple
>> barrier() calls, for example for a 2MB hugepage in clear_huge_page() these are
>> called 512 times i.e. to map and unmap each subpage that means in total 2048
>> barrier calls. This called for optimization. Simply getting VADDR from page does
>> the job for us. This also applies to the copy_user_huge_page() function.
>>
>> With kmap_atomic() out of the picture we can use memset and memcpy for sizes
>> larger than 4K. Instead of a left-right approach to access the target subpage,
>> getting the VADDR from the page and using memset directly in a simple experiment
>> we observed a 64% improvement in time over the current approach.
>>
>> With this(v2) patch we observe 65.85%(under controlled conditions) improvement
>> over the current approach. 
>
> Can you describe your test?
>
>> Currently process_huge_page iterates over subpages in a left-right manner
>> targeting the subpage that was accessed to be processed at last to keep the
>> cache hot around the faulting address. This caused a latency issue because as we
>> observed in the case of ARM64 the reverse access is much slower than forward
>> access and much much slower than oneshot access because of the pre-fetcher
>> behaviour. The following simple userspace experiment to allocate
>> 100MB(total_size) of pages and writing to it using memset(oneshot), forward
>> order loop and a reverse order loop gave us a good insight:-
>>
>> --------------------------------------------------------------------------------
>> Test code snippet:
>> --------------------------------------------------------------------------------
>>   /* One shot memset */
>>   memset (r, 0xd, total_size);
>>
>>   /* traverse in forward order */
>>   for (j = 0; j < total_pages; j++)
>>     {
>>       memset (q + (j * SZ_4K), 0xc, SZ_4K);
>>     }
>>
>>   /* traverse in reverse order */
>>   for (i = 0; i < total_pages; i++)
>>     {
>>       memset (p + total_size - (i + 1) * SZ_4K, 0xb, SZ_4K);
>>     }
>
> You have tested the chunk sizes 4KB and 2MB, can you test some values in
> between?  For example 32KB or 64KB?  Maybe there's a sweet point with
> some smaller granularity and good performance.

And if you test in user space, please make sure you copied memset
implementation in kernel.  Because libc memset implementation may be
quite different, for example, it may uses AVX instructions on x86, while
memset in kernel doesn't use them.

Best Regards,
Huang, Ying