From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753146AbbCaWMv (ORCPT ); Tue, 31 Mar 2015 18:12:51 -0400 Received: from g4t3427.houston.hp.com ([15.201.208.55]:57945 "EHLO g4t3427.houston.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751638AbbCaWMu convert rfc822-to-8bit (ORCPT ); Tue, 31 Mar 2015 18:12:50 -0400 From: "Elliott, Robert (Server Storage)" To: Christoph Hellwig , "linux-nvdimm@ml01.01.org" , "linux-fsdevel@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "x86@kernel.org" CC: "ross.zwisler@linux.intel.com" , "axboe@kernel.dk" , "boaz@plexistor.com" , "Kani, Toshimitsu" Subject: RE: another pmem variant V2 Thread-Topic: another pmem variant V2 Thread-Index: AQHQZ5+JacS38fCB50eh5aSIiWDyPZ03JjBg Date: Tue, 31 Mar 2015 22:11:29 +0000 Message-ID: <94D0CD8314A33A4D9D801C0FE68B40295A853392@G9W0745.americas.hpqcorp.net> References: <1427358764-6126-1-git-send-email-hch@lst.de> In-Reply-To: <1427358764-6126-1-git-send-email-hch@lst.de> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [16.210.48.26] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 8BIT MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > -----Original Message----- > From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel- > owner@vger.kernel.org] On Behalf Of Christoph Hellwig > Sent: Thursday, March 26, 2015 3:33 AM > To: linux-nvdimm@ml01.01.org; linux-fsdevel@vger.kernel.org; linux- > kernel@vger.kernel.org; x86@kernel.org > Cc: ross.zwisler@linux.intel.com; axboe@kernel.dk; boaz@plexistor.com > Subject: another pmem variant V2 > > Here is another version of the same trivial pmem driver, because two > obviously aren't enough. The first patch is the same pmem driver > that Ross posted a short time ago, just modified to use platform_devices > to find the persistant memory region instead of hardconding it in the > Kconfig. This allows to keep pmem.c separate from any discovery mechanism, > but still allow auto-discovery. > ... > This has been tested both with a real NVDIMM on a system with a type 12 > capable bios, as well as with "fake persistent" memory using the memmap= > option. > > Changes since V1: > - s/E820_PROTECTED_KERN/E820_PMEM/g > - map the persistent memory as uncached > - better kernel parameter description > - various typo fixes > - MODULE_LICENSE fix I used fio to test 4 KiB random read and write IOPS on a 2-socket x86 DDR4 system. With various cache attributes: attr read write notes ---- ---- ----- ----- UC 37 K 21 K ioremap_nocache WB 3.6 M 2.5 M ioremap WC 764 K 3.7 M ioremap_wc WT ioremap_wt So, although UC and WT are the only modes certain to be safe, the V1 default of UC provides abysmal performance - worse than a consumer-class SATA SSD. A solution for x86 is to use the MOVNTI instruction in WB mode. This non-temporal hint uses a buffer like the write combining buffer, not filling the cache and not stopping everything in the CPU. The kernel function __copy_from_user() uses that instruction (with SFENCE at the end) - see arch/x86/lib/copy_user_nocache_64.S. If I made the change from memcpy() to __copy_from_user() correctly, that results in: attr read write notes ---- ---- ----- ----- WB w/NTI 2.4 M 2.6 M __copy_from_user() WC w/NTI 3.2 M 2.1 M __copy_from_user() There is also a non-temporal streaming load hint instruction called MOVNTDQA that might be helpful for reads for both WB and WC. I don't see any existing kernel memcpy-like function that utilizes this instruction, so haven't tried it yet. Intel64 and IA-32 Architectures Software Developers Manual excerpts (Jan 2015) =================================== "The non-temporal move instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD) allow data to be moved from the processor's registers directly into system memory without being also written into the L1, L2, and/or L3 caches. These instructions can be used to prevent cache pollution when operating on data that is going to be modified only once before being stored back into system memory. ... MOVNTI ... The non-temporal hint is implemented by using a write combining (WC) memory type protocol when writing the data to memory. Using this protocol, the processor does not write the data into the cache hierarchy, nor does it fetch the corresponding cache line from memory into the cache hierarchy. ... MOVNTDQA Provides a non-temporal hint that can cause adjacent 16-byte items within an aligned 64-byte region (a streaming line) to be fetched and held in a small set of temporary buffers ("streaming load buffers"). Subsequent streaming loads to other aligned 16-byte items in the same streaming line may be supplied from the streaming load buffer and can improve throughput. ... A processor implementation may make use of the non-temporal hint associated with this instruction if the memory source is WC (write combining) memory type. An implementation may also make use of the non-temporal hint associated with this instruction if the memory source is WB (writeback) memory type."