From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=05z6=56=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-0.8 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id AD5A6C2BA19
	for <linux-mm@archiver.kernel.org>; Tue, 14 Apr 2020 17:03:18 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 75F5F20678
	for <linux-mm@archiver.kernel.org>; Tue, 14 Apr 2020 17:03:18 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 75F5F20678
Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=suse.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id EFB218E0008; Tue, 14 Apr 2020 13:03:17 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id E83A88E0001; Tue, 14 Apr 2020 13:03:17 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id D72868E0008; Tue, 14 Apr 2020 13:03:17 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0058.hostedemail.com [216.40.44.58])
	by kanga.kvack.org (Postfix) with ESMTP id BB0CA8E0001
	for <linux-mm@kvack.org>; Tue, 14 Apr 2020 13:03:17 -0400 (EDT)
Received: from smtpin30.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with ESMTP id 73DEC181AEF2A
	for <linux-mm@kvack.org>; Tue, 14 Apr 2020 17:03:17 +0000 (UTC)
X-FDA: 76707081234.30.cry62_73fc77385fa0a
X-HE-Tag: cry62_73fc77385fa0a
X-Filterd-Recvd-Size: 5987
Received: from mx2.suse.de (mx2.suse.de [195.135.220.15])
	by imf06.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Tue, 14 Apr 2020 17:03:16 +0000 (UTC)
X-Virus-Scanned: by amavisd-new at test-mx.suse.de
Received: from relay2.suse.de (unknown [195.135.220.254])
	by mx2.suse.de (Postfix) with ESMTP id 3F33BABCE;
	Tue, 14 Apr 2020 17:03:14 +0000 (UTC)
Date: Tue, 14 Apr 2020 19:03:12 +0200
From: Michal Hocko <mhocko@suse.com>
To: Prathu Baronia <prathu.baronia@oneplus.com>
Cc: alexander.duyck@gmail.com, chintan.pandya@oneplus.com,
	ying.huang@intel.com, akpm@linux-foundation.com, linux-mm@kvack.org,
	gregkh@linuxfoundation.com, gthelen@google.com, jack@suse.cz,
	ken.lin@oneplus.com, gasine.xu@oneplus.com
Subject: Re: [PATCH v2] mm: Optimized hugepage zeroing & copying from user
Message-ID: <20200414170312.GR4629@dhcp22.suse.cz>
References: <20200414153829.GA15230@oneplus.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20200414153829.GA15230@oneplus.com>
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Tue 14-04-20 21:08:32, Prathu Baronia wrote:
> In !HIGHMEM cases, specially in 64-bit architectures, we don't need temp mapping
> of pages. Hence, k(map|unmap)_atomic() acts as nothing more than multiple
> barrier() calls, for example for a 2MB hugepage in clear_huge_page() these are
> called 512 times i.e. to map and unmap each subpage that means in total 2048
> barrier calls. This called for optimization. Simply getting VADDR from page does
> the job for us. This also applies to the copy_user_huge_page() function.

I still have hard time to see why kmap machinery should introduce any
slowdown here. Previous data posted while discussing v1 didn't really
show anything outside of the noise.

> With kmap_atomic() out of the picture we can use memset and memcpy for sizes
> larger than 4K. Instead of a left-right approach to access the target subpage,
> getting the VADDR from the page and using memset directly in a simple experiment
> we observed a 64% improvement in time over the current approach.
> 
> With this(v2) patch we observe 65.85%(under controlled conditions) improvement
> over the current approach. 
>
> Currently process_huge_page iterates over subpages in a left-right manner
> targeting the subpage that was accessed to be processed at last to keep the
> cache hot around the faulting address. This caused a latency issue because as we
> observed in the case of ARM64 the reverse access is much slower than forward
> access and much much slower than oneshot access because of the pre-fetcher
> behaviour. The following simple userspace experiment to allocate
> 100MB(total_size) of pages and writing to it using memset(oneshot), forward
> order loop and a reverse order loop gave us a good insight:-
> 
> --------------------------------------------------------------------------------
> Test code snippet:
> --------------------------------------------------------------------------------
>   /* One shot memset */
>   memset (r, 0xd, total_size);
> 
>   /* traverse in forward order */
>   for (j = 0; j < total_pages; j++)
>     {
>       memset (q + (j * SZ_4K), 0xc, SZ_4K);
>     }
> 
>   /* traverse in reverse order */
>   for (i = 0; i < total_pages; i++)
>     {
>       memset (p + total_size - (i + 1) * SZ_4K, 0xb, SZ_4K);
>     }
> ----------------------------------------------------------------------
> Results:
> ----------------------------------------------------------------------
> Results for ARM64 target (SM8150 , CPU0 & 6 are online, running at max
> frequency)
> All numbers are mean of 100 iterations. Variation is ignorable.

It would be really nice to provide std

> ----------------------------------------------------------------------
> - Oneshot : 3389.26 us
> - Forward : 8876.16 us
> - Reverse : 18157.6 us
> ----------------------------------------------------------------------
> 
> ----------------------------------------------------------------------
> Results for x86-64 (Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz, only CPU 0 in max
> frequency, DDR also running at max frequency.)
> All numbers are mean of 100 iterations. Variation is ignorable.
> ----------------------------------------------------------------------
> - Oneshot : 3203.49 us
> - Forward : 5766.46 us
> - Reverse : 5187.86 us
> ----------------------------------------------------------------------
> 
> Hence refactor the function process_huge_page() to process the hugepage
> in oneshot manner using oneshot version of routines clear_huge_page() and
> copy_user_huge_page() for !HIGHMEM cases.

> These oneshot routines do zeroing using memset and copying using memcpy since we
> observed after extensive testing on ARM64 and some local testing on x86 memset
> and memcpy routines are highly optimized and with the above data points in hand
> it made sense to utilize them directly instead of looping over all subpages.
> These oneshot routines do zero and copy with a small offset(default kept as 32KB for
> now) to keep the cache hot around the faulting address. This offset is dependent
> on the cache size and hence can be kept as a tunable configuration option.

No. There is absolutely zero reason to add a config option for this. The
kernel should have all the information to make an educated guess.

Also before going any further. The patch which has introduced the
optimization was c79b57e462b5 ("mm: hugetlb: clear target sub-page last
when clearing huge page"). It is based on an artificial benchmark which
to my knowledge doesn't represent any real workload. Your measurements
are based on a different benchmark. Your numbers clearly show that some
assumptions used for the optimization are not architecture neutral.

In such a case I would much rather revert the optimization and only
build an additional complexity based on real workloads.

-- 
Michal Hocko
SUSE Labs