From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1751971AbbEBLzU (ORCPT <rfc822;w@1wt.eu>);
	Sat, 2 May 2015 07:55:20 -0400
Received: from g4t3427.houston.hp.com ([15.201.208.55]:34398 "EHLO
	g4t3427.houston.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751294AbbEBLzN (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Sat, 2 May 2015 07:55:13 -0400
From: "Elliott, Robert (Server Storage)" <Elliott@hp.com>
To: Daniel J Blueman <daniel@numascale.com>, nzimmer <nzimmer@sgi.com>,
        "Mel Gorman" <mgorman@suse.de>
CC: Pekka Enberg <penberg@kernel.org>,
        Andrew Morton <akpm@linux-foundation.org>,
        Dave Hansen <dave.hansen@intel.com>,
        "Long, Wai Man" <waiman.long@hp.com>,
        "Norton, Scott J" <scott.norton@hp.com>, Linux-MM <linux-mm@kvack.org>,
        LKML <linux-kernel@vger.kernel.org>,
        "'Steffen Persvold'" <sp@numascale.com>,
        "Boaz Harrosh (boaz@plexistor.com)" <boaz@plexistor.com>,
        "dan.j.williams@intel.com" <dan.j.williams@intel.com>,
        "linux-nvdimm@lists.01.org" <linux-nvdimm@ml01.01.org>
Subject: RE: [PATCH 0/13] Parallel struct page initialisation v4
Thread-Topic: [PATCH 0/13] Parallel struct page initialisation v4
Thread-Index: AQHQgcFYZaZRr35KfU2EFsi/Km2hJZ1ilvEAgAAqiwCAAvtZgIAC1irQ
Date: Sat, 2 May 2015 11:52:18 +0000
Message-ID: <94D0CD8314A33A4D9D801C0FE68B40295A8CE70F@G9W0745.americas.hpqcorp.net>
References: <553FD39C.2070503@sgi.com> <1430410227.8193.0@cpanel21.proisp.no>
In-Reply-To: <1430410227.8193.0@cpanel21.proisp.no>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-originating-ip: [16.210.192.234]
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
Content-Transfer-Encoding: 8bit
X-MIME-Autoconverted: from base64 to 8bit by nfs id t42BtNEd011722


> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
> owner@vger.kernel.org] On Behalf Of Daniel J Blueman
> Sent: Thursday, April 30, 2015 11:10 AM
> Subject: Re: [PATCH 0/13] Parallel struct page initialisation v4
...
> On a 7TB, 1728-core NumaConnect system with 108 NUMA nodes, we're
> seeing stock 4.0 boot in 7136s. This drops to 2159s, or a 70% reduction
> with this patchset. Non-temporal PMD init [1] drops this to 1045s.
> 
> Nathan, what do you guys see with the non-temporal PMD patch [1]? Do
> add a sfence at the ende label if you manually patch.
> 
...
> [1] https://lkml.org/lkml/2015/4/23/350

>>From that post:
> +loop_64:
> +	decq  %rcx
> +	movnti	%rax,(%rdi)
> +	movnti	%rax,8(%rdi)
> +	movnti	%rax,16(%rdi)
> +	movnti	%rax,24(%rdi)
> +	movnti	%rax,32(%rdi)
> +	movnti	%rax,40(%rdi)
> +	movnti	%rax,48(%rdi)
> +	movnti	%rax,56(%rdi)
> +	leaq  64(%rdi),%rdi
> +	jnz    loop_64

There are some even more efficient instructions available in x86,
depending on the CPU features:
* movnti		8 byte
* movntdq %xmm		16 byte, SSE
* vmovntdq %ymm	32 byte, AVX
* vmovntdq %zmm	64 byte, AVX-512 (forthcoming)

The last will transfer a full cache line at a time.

For NVDIMMs, the nd pmem driver is also in need of memcpy functions that 
use these non-temporal instructions, both for performance and reliability.
We also need to speed up __clear_page and copy_user_enhanced_string so
userspace accesses through the page cache can keep up.
https://lkml.org/lkml/2015/4/2/453 is one of the threads on that topic.

Some results I've gotten there under different cache attributes
(in terms of 4 KiB IOPS):

16-byte movntdq:
UC write iops=697872 (697.872 K)(0.697872 M)
WB write iops=9745800 (9745.8 K)(9.7458 M)
WC write iops=9801800 (9801.8 K)(9.8018 M)
WT write iops=9812400 (9812.4 K)(9.8124 M)

32-byte vmovntdq:
UC write iops=1274400 (1274.4 K)(1.2744 M)
WB write iops=10259000 (10259 K)(10.259 M)
WC write iops=10286000 (10286 K)(10.286 M)
WT write iops=10294000 (10294 K)(10.294 M)

---
Robert Elliott, HP Server Storage

ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éÝ¶¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayºÊ‡Ú™ë,j­¢f£¢·hšïêÿ‘êçz_è®(­éšŽŠÝ¢j"ú¶m§ÿÿ¾«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^¶m§ÿÿÃÿ¶ìÿ¢¸?–I¥