All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: another pmem variant V2
@ 2015-03-26 18:38 Christoph Hellwig
  0 siblings, 0 replies; 23+ messages in thread
From: Christoph Hellwig @ 2015-03-26 18:38 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Christoph Hellwig, linux-nvdimm, linux-fsdevel, linux-kernel,
	x86, ross.zwisler, axboe

And here's the patch, sorry:


diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 4bd525a..e7bf89e 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -346,7 +346,7 @@ int __init sanitize_e820_map(struct e820entry *biosmap, int max_nr_map,
 		 * continue building up new bios map based on this
 		 * information
 		 */
-		if (current_type != last_type)	{
+		if (current_type != last_type || current_type == E820_PRAM) {
 			if (last_type != 0)	 {
 				new_bios[new_bios_entry].size =
 					change_point[chgidx]->addr - last_addr;

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: another pmem variant V2
  2015-04-02 16:41       ` Christoph Hellwig
@ 2015-04-02 18:03           ` Ingo Molnar
  0 siblings, 0 replies; 23+ messages in thread
From: Ingo Molnar @ 2015-04-02 18:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Elliott, Robert (Server Storage),
	linux-nvdimm, linux-fsdevel, linux-kernel, x86, ross.zwisler,
	axboe, boaz, Kani, Toshimitsu


* Christoph Hellwig <hch@lst.de> wrote:

> On Thu, Apr 02, 2015 at 03:11:36PM +0000, Elliott, Robert (Server Storage) wrote:
> > Attr	Copy		Read IOPS		Write IOPS
> > ====	====		=========		==========
> > UC	memcpy		36 K			22 K
> > UC	NT rd,wr	513 K			326 K
> > 
> > WB	memcpy		3.4 M			2.5 M
> > WB	NT rd,wr	3.3 M			3.5 M
> > 
> > WC	memcpy		776 K			3.5 M
> > WC	NT rd,wr	3.0 M			3.9 M
> > 
> > WT	memcpy		2.1 M			22 K
> > WT	NT rd,wr	3.3 M			2.1 M
> > 
> > a few other variations yielded the peak numbers:
> > WC	NT rd only	3.2 M			4.1 M
> > WC	NT wr only	712 K			4.6 M
> > WT	NT wr only	2.6 M			4.0 M
> > 
> > There are lots of tuning considerations for those memcpy 
> > functions - how far to unroll the loop, whether to
> > include PRFETCHNTA instructions, etc.
> 
> Looks like we should aіm for WC + NT would be a good start.
> 
> Can you prepare a patch to add your NT memcpy variants and
> a second one to use them in the pmem driver?

So we already have NT memcpy variants, see the copy_*_user_*_nocache() 
primitives in arch/x86/. They could be used almost straight away for 
kernel memory as well, as kernel buffers will not fault.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: another pmem variant V2
@ 2015-04-02 18:03           ` Ingo Molnar
  0 siblings, 0 replies; 23+ messages in thread
From: Ingo Molnar @ 2015-04-02 18:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Elliott, Robert (Server Storage),
	linux-nvdimm, linux-fsdevel, linux-kernel, x86, ross.zwisler,
	axboe, boaz, Kani, Toshimitsu


* Christoph Hellwig <hch@lst.de> wrote:

> On Thu, Apr 02, 2015 at 03:11:36PM +0000, Elliott, Robert (Server Storage) wrote:
> > Attr	Copy		Read IOPS		Write IOPS
> > ====	====		=========		==========
> > UC	memcpy		36 K			22 K
> > UC	NT rd,wr	513 K			326 K
> > 
> > WB	memcpy		3.4 M			2.5 M
> > WB	NT rd,wr	3.3 M			3.5 M
> > 
> > WC	memcpy		776 K			3.5 M
> > WC	NT rd,wr	3.0 M			3.9 M
> > 
> > WT	memcpy		2.1 M			22 K
> > WT	NT rd,wr	3.3 M			2.1 M
> > 
> > a few other variations yielded the peak numbers:
> > WC	NT rd only	3.2 M			4.1 M
> > WC	NT wr only	712 K			4.6 M
> > WT	NT wr only	2.6 M			4.0 M
> > 
> > There are lots of tuning considerations for those memcpy 
> > functions - how far to unroll the loop, whether to
> > include PRFETCHNTA instructions, etc.
> 
> Looks like we should aіm for WC + NT would be a good start.
> 
> Can you prepare a patch to add your NT memcpy variants and
> a second one to use them in the pmem driver?

So we already have NT memcpy variants, see the copy_*_user_*_nocache() 
primitives in arch/x86/. They could be used almost straight away for 
kernel memory as well, as kernel buffers will not fault.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: another pmem variant V2
  2015-04-02 15:11     ` Elliott, Robert (Server Storage)
@ 2015-04-02 16:41       ` Christoph Hellwig
  2015-04-02 18:03           ` Ingo Molnar
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2015-04-02 16:41 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage)
  Cc: Christoph Hellwig, linux-nvdimm, linux-fsdevel, linux-kernel,
	x86, ross.zwisler, axboe, boaz, Kani, Toshimitsu

On Thu, Apr 02, 2015 at 03:11:36PM +0000, Elliott, Robert (Server Storage) wrote:
> Attr	Copy		Read IOPS		Write IOPS
> ====	====		=========		==========
> UC	memcpy		36 K			22 K
> UC	NT rd,wr	513 K			326 K
> 
> WB	memcpy		3.4 M			2.5 M
> WB	NT rd,wr	3.3 M			3.5 M
> 
> WC	memcpy		776 K			3.5 M
> WC	NT rd,wr	3.0 M			3.9 M
> 
> WT	memcpy		2.1 M			22 K
> WT	NT rd,wr	3.3 M			2.1 M
> 
> a few other variations yielded the peak numbers:
> WC	NT rd only	3.2 M			4.1 M
> WC	NT wr only	712 K			4.6 M
> WT	NT wr only	2.6 M			4.0 M
> 
> There are lots of tuning considerations for those memcpy 
> functions - how far to unroll the loop, whether to
> include PRFETCHNTA instructions, etc.

Looks like we should aіm for WC + NT would be a good start.

Can you prepare a patch to add your NT memcpy variants and
a second one to use them in the pmem driver?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: another pmem variant V2
  2015-04-01  7:26   ` Christoph Hellwig
@ 2015-04-02 15:11     ` Elliott, Robert (Server Storage)
  2015-04-02 16:41       ` Christoph Hellwig
  0 siblings, 1 reply; 23+ messages in thread
From: Elliott, Robert (Server Storage) @ 2015-04-02 15:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvdimm, linux-fsdevel, linux-kernel, x86, ross.zwisler,
	axboe, boaz, Kani, Toshimitsu



> -----Original Message-----
> From: Christoph Hellwig [mailto:hch@lst.de]
> Sent: Wednesday, April 1, 2015 2:26 AM
> To: Elliott, Robert (Server Storage)
> Cc: Christoph Hellwig; linux-nvdimm@ml01.01.org; linux-
> fsdevel@vger.kernel.org; linux-kernel@vger.kernel.org; x86@kernel.org;
> ross.zwisler@linux.intel.com; axboe@kernel.dk; boaz@plexistor.com; Kani,
> Toshimitsu
> Subject: Re: another pmem variant V2
> 
> On Tue, Mar 31, 2015 at 10:11:29PM +0000, Elliott, Robert (Server Storage)
> wrote:
> > I used fio to test 4 KiB random read and write IOPS
> > on a 2-socket x86 DDR4 system.  With various cache attributes:
> >
> > attr	read		write		notes
> > ----	----		-----		-----
> > UC	37 K		21 K		ioremap_nocache
> > WB	3.6 M		2.5 M		ioremap
> > WC	764 K		3.7 M		ioremap_wc
> > WT	<not tested yet>		ioremap_wt
> >
> > So, although UC and WT are the only modes certain to be safe,
> > the V1 default of UC provides abysmal performance - worse than
> > a consumer-class SATA SSD.
> 
> It doesn't look quite as bad on my setup, but performance is fairly
> bad here as well.
> 
> > A solution for x86 is to use the MOVNTI instruction in WB
> > mode. This non-temporal hint uses a buffer like the write
> > combining buffer, not filling the cache and not stopping
> > everything in the CPU.  The kernel function __copy_from_user()
> > uses that instruction (with SFENCE at the end) - see
> > arch/x86/lib/copy_user_nocache_64.S.
> >
> > If I made the change from memcpy() to __copy_from_user()
> > correctly, that results in:
> >
> > attr		read		write		notes
> > ----		----		-----		-----
> > WB w/NTI	2.4 M		2.6 M		__copy_from_user()
> > WC w/NTI	3.2 M		2.1 M		__copy_from_user()
> 
> That looks a lot better.  It doesn't help us with a pmem device
> mapped directly into userspace using mmap with the DAX infrastructure,
> though.
> 
> Note when we want to move to non-temporal copies we'll need to add
> a new prototype, as __copy_from_user isn't guaranteed to use these,
> and it is defined to only work on user addresses.  That doesn't matter
> on x86 but would blow up on say sparc or s390.

Here are some updated numbers including:
* WT (writethrough) cache attribute
* memcpy that uses non-temporal stores (MOVNTDQ) to the 
  persistent memory for block writes (rather than MOVNTI)
* memcpy that uses non-temporal loads (MOVNTDQA) from the 
  persistent memory for block reads

Attr	Copy		Read IOPS		Write IOPS
====	====		=========		==========
UC	memcpy		36 K			22 K
UC	NT rd,wr	513 K			326 K

WB	memcpy		3.4 M			2.5 M
WB	NT rd,wr	3.3 M			3.5 M

WC	memcpy		776 K			3.5 M
WC	NT rd,wr	3.0 M			3.9 M

WT	memcpy		2.1 M			22 K
WT	NT rd,wr	3.3 M			2.1 M

a few other variations yielded the peak numbers:
WC	NT rd only	3.2 M			4.1 M
WC	NT wr only	712 K			4.6 M
WT	NT wr only	2.6 M			4.0 M

There are lots of tuning considerations for those memcpy 
functions - how far to unroll the loop, whether to
include PRFETCHNTA instructions, etc.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: another pmem variant V2
  2015-04-01 19:33 ` Elliott, Robert (Server Storage)
@ 2015-04-02  9:37   ` Christoph Hellwig
  0 siblings, 0 replies; 23+ messages in thread
From: Christoph Hellwig @ 2015-04-02  9:37 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage)
  Cc: Christoph Hellwig, linux-nvdimm, linux-fsdevel, linux-kernel,
	x86, ross.zwisler, axboe, boaz, Kani, Toshimitsu

On Wed, Apr 01, 2015 at 07:33:38PM +0000, Elliott, Robert (Server Storage) wrote:
> I triggered a paging error in the memcpy call for a block read
> from system-udevd (actually in a modified memcpy() for the cache
> attribute experiments).  
> 
> 1. This triggered an illegal schedule() call from an atomic context.
> The call trace is shown below.
> 
> 2. memcpy() doesn't provide exception handling or error reporting.
> Some functions like do so, like __copy_user_nocache in
> arch/x85/lib/copy_user_nocache_64.S.  
> 
> Should pmem only use functions that do so, if available on the
> architecture?

We'll need to define an interface for the function to use if it isn't
plain memcpy, which would include that detail.

But I can't see how that memcpy should ever have to handle a page fault,
I'd be curious how your reproduces this issue.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: another pmem variant V2
  2015-03-26  8:32 Christoph Hellwig
  2015-03-26 16:57 ` Boaz Harrosh
  2015-03-31 22:11 ` Elliott, Robert (Server Storage)
@ 2015-04-01 19:33 ` Elliott, Robert (Server Storage)
  2015-04-02  9:37   ` Christoph Hellwig
  2 siblings, 1 reply; 23+ messages in thread
From: Elliott, Robert (Server Storage) @ 2015-04-01 19:33 UTC (permalink / raw)
  To: Christoph Hellwig, linux-nvdimm, linux-fsdevel, linux-kernel, x86
  Cc: ross.zwisler, axboe, boaz, Kani, Toshimitsu

> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
> owner@vger.kernel.org] On Behalf Of Christoph Hellwig
> Sent: Thursday, March 26, 2015 3:33 AM
> To: linux-nvdimm@ml01.01.org; linux-fsdevel@vger.kernel.org; linux-
> kernel@vger.kernel.org; x86@kernel.org
> Cc: ross.zwisler@linux.intel.com; axboe@kernel.dk; boaz@plexistor.com
> Subject: another pmem variant V2
> 

I triggered a paging error in the memcpy call for a block read
from system-udevd (actually in a modified memcpy() for the cache
attribute experiments).  

1. This triggered an illegal schedule() call from an atomic context.
The call trace is shown below.

2. memcpy() doesn't provide exception handling or error reporting.
Some functions like do so, like __copy_user_nocache in
arch/x85/lib/copy_user_nocache_64.S.  

Should pmem only use functions that do so, if available on the
architecture?

pmem_rw_page can pass along the return value from the copy function.
pmem_make_request can report the error, if any, via bio_endio.


Call trace
==========
[62117.317216] BUG: scheduling while atomic: systemd-udevd/22135/0x00000001
[62117.317232] Modules linked in: pmem ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables sg vfat fat x86_pkg_temp_thermal coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd xhci_pci hpilo xhci_hcd sb_edac edac_core microcode iTCO_wdt iTCO_vendor_support hpwdt ioatdma shpchp pcspkr lpc_ich mfd_core i2c_i801 wmi pcc_cpufreq dca acpi_cpufreq uinput nfsd auth_rpcgss nfs_acl lockd grace sunrpc xfs exportfs sr_mod cdrom sd_mod bnx2x tg3 ahci mdio libahci ptp pps_core hpsa libcrc32c dm_mirror dm_region_hash dm_log dm_mod ipv6 autofs4 [last unloaded: pmem]
[62117.317233] CPU: 31 PID: 22135 Comm: systemd-udevd Tainted: G      D         4.0.0-rc6+ #7
[62117.317234] Hardware name: HP ProLiant DL380 Gen9
[62117.317235]  ffff88047f3f3ac0 ffff8804241db2e8 ffffffff815a8866 00000000ff86ff86
[62117.317236]  ffff8804241dbfd8 ffff8804241db2f8 ffffffff815a4b45 ffff8804241db348
[62117.317237]  ffffffff815ab893 ffff880457091050 ffff88047f3fbb20 0000000000000000
[62117.317237] Call Trace:
[62117.317240]  [<ffffffff815a8866>] dump_stack+0x45/0x57
[62117.317245]  [<ffffffff815a4b45>] __schedule_bug+0x46/0x54
[62117.317247]  [<ffffffff815ab893>] __schedule+0x793/0x870
[62117.317251]  [<ffffffff815ac0f0>] ? bit_wait+0x50/0x50
[62117.317252]  [<ffffffff815ab9a7>] schedule+0x37/0x90
[62117.317253]  [<ffffffff815ae0cc>] schedule_timeout+0x1dc/0x260
[62117.317258]  [<ffffffff810bab5e>] ? ktime_get+0x3e/0xa0
[62117.317259]  [<ffffffff815ab06c>] io_schedule_timeout+0xac/0x140
[62117.317261]  [<ffffffff815ac126>] bit_wait_io+0x36/0x50
[62117.317262]  [<ffffffff815abeeb>] __wait_on_bit_lock+0x4b/0xb0
[62117.317263]  [<ffffffff81136f62>] ? find_get_entries+0xe2/0x130
[62117.317265]  [<ffffffff811342ec>] __lock_page+0xac/0xb0
[62117.317269]  [<ffffffff81090830>] ? autoremove_wake_function+0x40/0x40
[62117.317276]  [<ffffffff8114322f>] truncate_inode_pages_range+0x3af/0x620
[62117.317278]  [<ffffffff8128e837>] ? cpumask_next_and+0x37/0x50
[62117.317279]  [<ffffffff811c6d80>] ? __brelse+0x40/0x40
[62117.317283]  [<ffffffff810c8add>] ? smp_call_function_many+0x5d/0x280
[62117.317284]  [<ffffffff8128e929>] ? free_cpumask_var+0x9/0x10
[62117.317285]  [<ffffffff810c8ebd>] ? on_each_cpu_cond+0xbd/0x160
[62117.317286]  [<ffffffff811c6d80>] ? __brelse+0x40/0x40
[62117.317288]  [<ffffffff811434b5>] truncate_inode_pages+0x15/0x20
[62117.317289]  [<ffffffff811cab13>] kill_bdev+0x33/0x40
[62117.317291]  [<ffffffff811cbfa8>] __blkdev_put+0x68/0x210
[62117.317293]  [<ffffffff811cca20>] blkdev_put+0x50/0x130
[62117.317294]  [<ffffffff811ccbb5>] blkdev_close+0x25/0x30
[62117.317296]  [<ffffffff811969f7>] __fput+0xe7/0x220
[62117.317298]  [<ffffffff81196b7e>] ____fput+0xe/0x10
[62117.317302]  [<ffffffff8106c554>] task_work_run+0xc4/0xe0
[62117.317306]  [<ffffffff810533f8>] do_exit+0x2d8/0xb10
[62117.317308]  [<ffffffff810a4b6c>] ? kmsg_dump+0x9c/0xc0
[62117.317312]  [<ffffffff8100634e>] oops_end+0x8e/0xd0
[62117.317313]  [<ffffffff815a428f>] no_context+0x2d4/0x334
[62117.317314]  [<ffffffff815a435c>] __bad_area_nosemaphore+0x6d/0x1c6
[62117.317317]  [<ffffffff811506c0>] ? zone_statistics+0x80/0xa0
[62117.317319]  [<ffffffff815a44c8>] bad_area_nosemaphore+0x13/0x15
[62117.317321]  [<ffffffff81043ea1>] __do_page_fault+0x91/0x430
[62117.317322]  [<ffffffff8104424c>] do_page_fault+0xc/0x10
[62117.317324]  [<ffffffff815b0a62>] page_fault+0x22/0x30
[62117.317325]  [<ffffffffa0078302>] ? pmem_do_bvec.isra.6+0x212/0x3f0 [pmem]
[62117.317326]  [<ffffffffa0078523>] pmem_rw_page+0x43/0x60 [pmem]
[62117.317328]  [<ffffffff81293148>] ? __radix_tree_preload+0x38/0xa0
[62117.317329]  [<ffffffff811ca9de>] bdev_read_page+0x2e/0x40
[62117.317330]  [<ffffffff811d131f>] do_mpage_readpage+0x51f/0x6c0
[62117.317331]  [<ffffffff8114211e>] ? lru_cache_add+0xe/0x10
[62117.317332]  [<ffffffff811d159b>] mpage_readpages+0xdb/0x130
[62117.317333]  [<ffffffff811ca990>] ? I_BDEV+0x10/0x10
[62117.317334]  [<ffffffff811ca990>] ? I_BDEV+0x10/0x10
[62117.317336]  [<ffffffff811cb14d>] blkdev_readpages+0x1d/0x20
[62117.317336]  [<ffffffff811404d4>] __do_page_cache_readahead+0x194/0x210
[62117.317337]  [<ffffffff811408e5>] force_page_cache_readahead+0x75/0xb0
[62117.317338]  [<ffffffff81140963>] page_cache_sync_readahead+0x43/0x50
[62117.317339]  [<ffffffff81136161>] generic_file_read_iter+0x431/0x630
[62117.317341]  [<ffffffff811cb4e7>] blkdev_read_iter+0x37/0x40
[62117.317342]  [<ffffffff8119466e>] new_sync_read+0x7e/0xb0
[62117.317343]  [<ffffffff81195838>] __vfs_read+0x18/0x50
[62117.317344]  [<ffffffff811958f6>] vfs_read+0x86/0x140
[62117.317345]  [<ffffffff811959f6>] SyS_read+0x46/0xb0
[62117.317346]  [<ffffffff810eac94>] ? __audit_syscall_entry+0xb4/0x110
[62117.317348]  [<ffffffff815aeff2>] system_call_fastpath+0x12/0x17
[62121.618505] note: systemd-udevd[22133] exited with preempt_count 1



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: another pmem variant V2
  2015-03-31 16:16       ` Christoph Hellwig
  2015-03-31 16:44         ` Ingo Molnar
@ 2015-04-01 12:49         ` Boaz Harrosh
  1 sibling, 0 replies; 23+ messages in thread
From: Boaz Harrosh @ 2015-04-01 12:49 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvdimm, linux-fsdevel, linux-kernel, x86, ross.zwisler, axboe

On 03/31/2015 07:16 PM, Christoph Hellwig wrote:
> On Tue, Mar 31, 2015 at 06:14:15PM +0300, Boaz Harrosh wrote:
>> We can not accept it as is right now.
> 
> Who is we?
> 
>> We have conducted farther tests. And it messes up NUMA.
> 
> Only you if you use the memmap option in weird ways.
> 

No weird ways just the single range. I would be very happy if
you can teach me the proper way, believe me I'm trying for 2 days.

> Sounds like I should simply remove the memmap= option so people don't
> abuse it.  The main point is to parse the e820 tables, which works fine.
> 

What abuse? the single range and the problem shows up.

So you are just pasting over the problem sir. The patch that you
are submitting has grave problems and covering it up with not
allowing memmap=! will not fix it.

The Kernel as is after your patch does not like this half baked
beast as we defined it, the defined pmem-memory-range messes
things up.

> And those people having fake pmem, or pcie cards that they are too lazy
> to submit proper drivers for can stick to their out of tree hacks?
> 

I have now conducted more tests on real type-12 DDR3 system and
the exact same problem as I reported exists with real type-12
chips!
And who are we kidding? the "memmap=!" yes or no, makes no
difference at all. All it does is edit the table as if it was
the table the BIOS gave us. There is no extra processing done
on memmap=.

Your e820 patch trashes NUMA.

But I fix it for good this time here is the fix below.
After I apply below patch every thing boots and work just as
expected. All the problems I reported disappear. Any configuration,
any number of ranges, cross NUMA or not, just works, exactly as
before with my patches.

The fix is over your V2, I will post one later that fixes
your V3 and adds back the memmap=!

---
If you inspect my fix below You will see that what happened is
that the original patch was too aggressive in making pmem
look like ram. In fact it started the ARCH side memory
initialization and was only skipping the generic initialization
of memory. This messed up internal real-memory structures.

With this fix below block/drivers/pmem.ko loads just fine
finds its resources and maps them, and everything just
works. including any "abuse" to memmap=! and any NUMA
configurations. Both real HW, type-12 HW and NUMA Vms.
(Preliminary testing we are conducting the full test
 rig as we speak)

(I'm still going over it, I might send some more cleaning)

Cheers

---
git diff --stat -p -M HEAD
 arch/x86/kernel/e820.c  |  7 ++++---
 arch/x86/kernel/pmem.c  | 17 -----------------
 arch/x86/kernel/setup.c |  2 --
 3 files changed, 4 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 07246e5..dce0e84 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -963,9 +963,10 @@ void __init e820_reserve_resources(void)
 		 * pci device BAR resource and insert them later in
 		 * pcibios_resource_survey()
 		 */
-		if (e820.map[i].type != E820_RESERVED || res->start < (1ULL<<20)) {
-			if (e820.map[i].type != E820_PRAM)
-				res->flags |= IORESOURCE_BUSY;
+		if (((e820.map[i].type != E820_RESERVED) &&
+		     (e820.map[i].type != E820_PRAM)) ||
+		     res->start < (1ULL<<20)) {
+			res->flags |= IORESOURCE_BUSY;
 			insert_resource(&iomem_resource, res);
 		}
 		res++;
diff --git a/arch/x86/kernel/pmem.c b/arch/x86/kernel/pmem.c
index f970048..fcdbc20 100644
--- a/arch/x86/kernel/pmem.c
+++ b/arch/x86/kernel/pmem.c
@@ -9,23 +9,6 @@
 #include <asm/page_types.h>
 #include <asm/setup.h>
 
-void __init reserve_pmem(void)
-{
-	int i;
-
-	for (i = 0; i < e820.nr_map; i++) {
-		struct e820entry *ei = &e820.map[i];
-
-		if (ei->type != E820_PRAM)
-			continue;
-
-		memblock_reserve(ei->addr, ei->addr + ei->size);
-		max_pfn_mapped = init_memory_mapping(
-				ei->addr < 1UL << 32 ? 1UL << 32 : ei->addr,
-				ei->addr + ei->size);
-	}
-}
-
 static __init void register_pmem_device(struct resource *res)
 {
 	struct platform_device *pdev;
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index f2bed2b..0a2421c 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1158,8 +1158,6 @@ void __init setup_arch(char **cmdline_p)
 
 	early_acpi_boot_init();
 
-	reserve_pmem();
-
 	initmem_init();
 	dma_contiguous_reserve(max_pfn_mapped << PAGE_SHIFT);
 


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: another pmem variant V2
  2015-03-31 22:11 ` Elliott, Robert (Server Storage)
@ 2015-04-01  7:26   ` Christoph Hellwig
  2015-04-02 15:11     ` Elliott, Robert (Server Storage)
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2015-04-01  7:26 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage)
  Cc: Christoph Hellwig, linux-nvdimm, linux-fsdevel, linux-kernel,
	x86, ross.zwisler, axboe, boaz, Kani, Toshimitsu

On Tue, Mar 31, 2015 at 10:11:29PM +0000, Elliott, Robert (Server Storage) wrote:
> I used fio to test 4 KiB random read and write IOPS 
> on a 2-socket x86 DDR4 system.  With various cache attributes:
> 
> attr	read		write		notes
> ----	----		-----		-----
> UC	37 K		21 K		ioremap_nocache
> WB	3.6 M		2.5 M		ioremap
> WC	764 K		3.7 M		ioremap_wc
> WT	<not tested yet>		ioremap_wt
> 
> So, although UC and WT are the only modes certain to be safe,
> the V1 default of UC provides abysmal performance - worse than
> a consumer-class SATA SSD.

It doesn't look quite as bad on my setup, but performance is fairly
bad here as well.

> A solution for x86 is to use the MOVNTI instruction in WB
> mode. This non-temporal hint uses a buffer like the write
> combining buffer, not filling the cache and not stopping
> everything in the CPU.  The kernel function __copy_from_user() 
> uses that instruction (with SFENCE at the end) - see
> arch/x86/lib/copy_user_nocache_64.S.
> 
> If I made the change from memcpy() to __copy_from_user()
> correctly, that results in:
> 
> attr		read		write		notes
> ----		----		-----		-----
> WB w/NTI	2.4 M		2.6 M		__copy_from_user()
> WC w/NTI	3.2 M		2.1 M		__copy_from_user()

That looks a lot better.  It doesn't help us with a pmem device
mapped directly into userspace using mmap with the DAX infrastructure,
though.

Note when we want to move to non-temporal copies we'll need to add
a new prototype, as __copy_from_user isn't guaranteed to use these,
and it is defined to only work on user addresses.  That doesn't matter
on x86 but would blow up on say sparc or s390.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: another pmem variant V2
  2015-03-26  8:32 Christoph Hellwig
  2015-03-26 16:57 ` Boaz Harrosh
@ 2015-03-31 22:11 ` Elliott, Robert (Server Storage)
  2015-04-01  7:26   ` Christoph Hellwig
  2015-04-01 19:33 ` Elliott, Robert (Server Storage)
  2 siblings, 1 reply; 23+ messages in thread
From: Elliott, Robert (Server Storage) @ 2015-03-31 22:11 UTC (permalink / raw)
  To: Christoph Hellwig, linux-nvdimm, linux-fsdevel, linux-kernel, x86
  Cc: ross.zwisler, axboe, boaz, Kani, Toshimitsu



> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
> owner@vger.kernel.org] On Behalf Of Christoph Hellwig
> Sent: Thursday, March 26, 2015 3:33 AM
> To: linux-nvdimm@ml01.01.org; linux-fsdevel@vger.kernel.org; linux-
> kernel@vger.kernel.org; x86@kernel.org
> Cc: ross.zwisler@linux.intel.com; axboe@kernel.dk; boaz@plexistor.com
> Subject: another pmem variant V2
> 
> Here is another version of the same trivial pmem driver, because two
> obviously aren't enough.  The first patch is the same pmem driver
> that Ross posted a short time ago, just modified to use platform_devices
> to find the persistant memory region instead of hardconding it in the
> Kconfig.  This allows to keep pmem.c separate from any discovery mechanism,
> but still allow auto-discovery.
> 
...
> This has been tested both with a real NVDIMM on a system with a type 12
> capable bios, as well as with "fake persistent" memory using the memmap=
> option.
> 
> Changes since V1:
>   - s/E820_PROTECTED_KERN/E820_PMEM/g
>   - map the persistent memory as uncached
>   - better kernel parameter description
>   - various typo fixes
>   - MODULE_LICENSE fix

I used fio to test 4 KiB random read and write IOPS 
on a 2-socket x86 DDR4 system.  With various cache attributes:

attr	read		write		notes
----	----		-----		-----
UC	37 K		21 K		ioremap_nocache
WB	3.6 M		2.5 M		ioremap
WC	764 K		3.7 M		ioremap_wc
WT	<not tested yet>		ioremap_wt

So, although UC and WT are the only modes certain to be safe,
the V1 default of UC provides abysmal performance - worse than
a consumer-class SATA SSD.

A solution for x86 is to use the MOVNTI instruction in WB
mode. This non-temporal hint uses a buffer like the write
combining buffer, not filling the cache and not stopping
everything in the CPU.  The kernel function __copy_from_user() 
uses that instruction (with SFENCE at the end) - see
arch/x86/lib/copy_user_nocache_64.S.

If I made the change from memcpy() to __copy_from_user()
correctly, that results in:

attr		read		write		notes
----		----		-----		-----
WB w/NTI	2.4 M		2.6 M		__copy_from_user()
WC w/NTI	3.2 M		2.1 M		__copy_from_user()

There is also a non-temporal streaming load hint instruction
called MOVNTDQA that might be helpful for reads for both WB
and WC. I don't see any existing kernel memcpy-like function 
that utilizes this instruction, so haven't tried it yet.


Intel64 and IA-32 Architectures 
Software Developers Manual excerpts (Jan 2015)
===================================
"The non-temporal move instructions (MOVNTI, MOVNTQ, MOVNTDQ,
MOVNTPS, and MOVNTPD) allow data to be moved from the
processor's registers directly into system memory without
being also written into the L1, L2, and/or L3 caches. These
instructions can be used to prevent cache pollution when
operating on data that is going to be modified only once
before being stored back into system memory. ...

MOVNTI
...
The non-temporal hint is implemented by using a write
combining (WC) memory type protocol when writing the
data to memory. Using this protocol, the processor
does not write the data into the cache hierarchy,
nor does it fetch the corresponding cache line from
memory into the cache hierarchy.
...

MOVNTDQA Provides a non-temporal hint that can cause
adjacent 16-byte items within an aligned 64-byte region
(a streaming line) to be fetched and held in a small
set of temporary buffers ("streaming load buffers"). 
Subsequent streaming loads to other aligned 16-byte 
items in the same streaming line may be supplied from
the streaming load buffer and can improve throughput.
...
A processor implementation may make use of the 
non-temporal hint associated with this instruction if
the memory source is WC (write combining) memory type. 
An implementation may also make use of the non-temporal
hint associated with this instruction if the memory
source is WB (writeback) memory type."



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: another pmem variant V2
  2015-03-31 16:44         ` Ingo Molnar
@ 2015-03-31 17:24           ` Christoph Hellwig
  0 siblings, 0 replies; 23+ messages in thread
From: Christoph Hellwig @ 2015-03-31 17:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christoph Hellwig, Boaz Harrosh, linux-nvdimm, linux-fsdevel,
	linux-kernel, x86, ross.zwisler, axboe

On Tue, Mar 31, 2015 at 06:44:56PM +0200, Ingo Molnar wrote:
> I'd be fine with that too - mind sending an updated series?

I will send an updated one tonight or early tomorrow.

Btw, do you want to keep the E820_PRAM name instead of E820_PMEM?
Seems like most people either don't care or prefer E820_PMEM. I'm
fine either way.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: another pmem variant V2
  2015-03-31 16:16       ` Christoph Hellwig
@ 2015-03-31 16:44         ` Ingo Molnar
  2015-03-31 17:24           ` Christoph Hellwig
  2015-04-01 12:49         ` Boaz Harrosh
  1 sibling, 1 reply; 23+ messages in thread
From: Ingo Molnar @ 2015-03-31 16:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Boaz Harrosh, linux-nvdimm, linux-fsdevel, linux-kernel, x86,
	ross.zwisler, axboe


* Christoph Hellwig <hch@lst.de> wrote:

> On Tue, Mar 31, 2015 at 06:14:15PM +0300, Boaz Harrosh wrote:
> > We can not accept it as is right now.
> 
> Who is we?
> 
> > We have conducted farther tests. And it messes up NUMA.
> 
> Only you if you use the memmap option in weird ways.
> 
> Sounds like I should simply remove the memmap= option so people don't
> abuse it.  The main point is to parse the e820 tables, which works fine.

I'd be fine with that too - mind sending an updated series?

the memmap= option can be added on top of that.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: another pmem variant V2
  2015-03-31 15:14     ` Boaz Harrosh
@ 2015-03-31 16:16       ` Christoph Hellwig
  2015-03-31 16:44         ` Ingo Molnar
  2015-04-01 12:49         ` Boaz Harrosh
  0 siblings, 2 replies; 23+ messages in thread
From: Christoph Hellwig @ 2015-03-31 16:16 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Christoph Hellwig, linux-nvdimm, linux-fsdevel, linux-kernel,
	x86, ross.zwisler, axboe

On Tue, Mar 31, 2015 at 06:14:15PM +0300, Boaz Harrosh wrote:
> We can not accept it as is right now.

Who is we?

> We have conducted farther tests. And it messes up NUMA.

Only you if you use the memmap option in weird ways.

Sounds like I should simply remove the memmap= option so people don't
abuse it.  The main point is to parse the e820 tables, which works fine.

And those people having fake pmem, or pcie cards that they are too lazy
to submit proper drivers for can stick to their out of tree hacks?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: another pmem variant V2
  2015-03-31 10:25     ` Boaz Harrosh
  2015-03-31 10:31       ` Boaz Harrosh
@ 2015-03-31 16:08       ` Christoph Hellwig
  1 sibling, 0 replies; 23+ messages in thread
From: Christoph Hellwig @ 2015-03-31 16:08 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Christoph Hellwig, linux-nvdimm, linux-fsdevel, linux-kernel,
	x86, ross.zwisler, axboe

On Tue, Mar 31, 2015 at 01:25:46PM +0300, Boaz Harrosh wrote:
> The problem I see is that if I state a memmap=nn!aa that crosses a NUMA
> boundary then the machine will not boot.
> So BTW for sure I need that "don't merge E820_PMEM ranges" patch because
> otherwise I will not be able to boot if I have pmem on both NUMA nodes
> and they happen to be contiguous.

Ok.

> Regarding the SQUASHMEs to PMEM. Originally I had them as 3-4 patches.
> But I thought since you are squashing them into a single submitted patch
> I can just send just the one patch. Tell me what you prefer and I'll
> resend (The one vs the three)

The slpit is mostly to get a good description for each change.

> And one last issue. I have some configuration "hardness" with the
> memmap=nn!aa Kernel command line API, it was better for me with the
> pmem map= module param. Will you be OK if I split pmem_probe() into
> calling pmem_alloc(addr, length), so I can keep an out-of-tree patch
> that adds the map= parameter to pmem?

Can't your out of tree patch do that as well?  In fact you might want to
rewrite it to a module that allows your to create/destroy the platform_devices
you use.  And for your PCIe case I'd really prefer a proper in-tree PCI
driver for it.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: another pmem variant V2
  2015-03-31  9:25   ` Christoph Hellwig
  2015-03-31 10:25     ` Boaz Harrosh
@ 2015-03-31 15:14     ` Boaz Harrosh
  2015-03-31 16:16       ` Christoph Hellwig
  1 sibling, 1 reply; 23+ messages in thread
From: Boaz Harrosh @ 2015-03-31 15:14 UTC (permalink / raw)
  To: Christoph Hellwig, Boaz Harrosh
  Cc: linux-nvdimm, linux-fsdevel, linux-kernel, x86, ross.zwisler, axboe

On 03/31/2015 12:25 PM, Christoph Hellwig wrote:
> On Thu, Mar 26, 2015 at 06:57:47PM +0200, Boaz Harrosh wrote:
<>
> 
> Any news?  I'd really like to resend this ASAP to get it into 4.1..


Hi Christoph

I hate to be bearer of bad news but we have a problem with the
e820 patch:
	x86: add support for the non-standard protected e820 type

We can not accept it as is right now.
We have conducted farther tests. And it messes up NUMA.

All the below is based on your latest patches on top of 4.0-rc5

Before any modprobe of pmem.ko, just a clean boot.

In the same exact Kernel, if you use memmap=nn!aa ie add "type-12"
section we have below problems, but if we do memmap=nn\$aa ie add
"reserved" section then everything is fine.
[With my old e820 patches it all works fine because it is closer
 to the memmap=nn\$aa "reserved" section way]

The problems we see in a NUMA machine.
* On some machines we cannot boot if a single memmap=nn!aa crosses
  a NUMA boundary.
  Some VMs sometime boot sometimes do not. Some VMs never boot.
  Some machines boot just fine.
  Doing memmap=nn1!aa1,nn2!aa2 where the split is at the NUMA
  boundary will enable all machines to boot

* Regardless if we use memmap=nn!aa crossing a NUMA boundary or
  if we do memmap=nn1!aa1,nn2!aa2 the output of 
  cat /sys/devices/system/node/node1/meminfo
  Is all ZEROs. Yes very scary everything is ZERO. Even though
  the dmseg prints show the correct numbers the above
  node1/meminfo is broken.
  If with the same Kernel we do memmap=nn\$aa then everything
  is clean.
  So something in the way we defined our new type-12 region
  upsets the NUMA code. And we need farther investigation.

Perhaps you would like to start with my e820 much more conservative
fix, that makes type-12 memory behave exactly as reserved memory.
And go from there, step by step. Until we fix the problem above.
So we can submit something like that for 4.1.

Talk to me, tell me what you need me to experiment with. Should I try
your platform device way but based on my e820 fix? how can I please
help push this through?

Thanks
Boaz


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: another pmem variant V2
  2015-03-31 10:25     ` Boaz Harrosh
@ 2015-03-31 10:31       ` Boaz Harrosh
  2015-03-31 16:08       ` Christoph Hellwig
  1 sibling, 0 replies; 23+ messages in thread
From: Boaz Harrosh @ 2015-03-31 10:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvdimm, linux-fsdevel, linux-kernel, x86, ross.zwisler, axboe

On 03/31/2015 01:25 PM, Boaz Harrosh wrote:
> On 03/31/2015 12:25 PM, Christoph Hellwig wrote:
<>
> The problem I see is that if I state a memmap=nn!aa that crosses a NUMA
> boundary then the machine will not boot.
> So BTW for sure I need that "don't merge E820_PMEM ranges" patch because
> otherwise I will not be able to boot if I have pmem on both NUMA nodes
> and they happen to be contiguous.
> 

BTW if you need NUMA emulated in your KVM add the below to your virsh edit

  <cpu>
    ....
    <numa>
      <cell cpus='0-3' memory='5242880'/>
      <cell cpus='4-7' memory='3145728'/>
    </numa>
  </cpu>


memory= is in KiB only
The sum of these must exactly match what you have for 
  <memory unit='KiB'>8388608</memory>
  <currentMemory unit='KiB'>8388608</currentMemory>

And you probably already have say:
    <topology sockets='2' cores='4' threads='1'/>

Cheers
Boaz


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: another pmem variant V2
  2015-03-31  9:25   ` Christoph Hellwig
@ 2015-03-31 10:25     ` Boaz Harrosh
  2015-03-31 10:31       ` Boaz Harrosh
  2015-03-31 16:08       ` Christoph Hellwig
  2015-03-31 15:14     ` Boaz Harrosh
  1 sibling, 2 replies; 23+ messages in thread
From: Boaz Harrosh @ 2015-03-31 10:25 UTC (permalink / raw)
  To: Christoph Hellwig, Boaz Harrosh
  Cc: linux-nvdimm, linux-fsdevel, linux-kernel, x86, ross.zwisler, axboe

On 03/31/2015 12:25 PM, Christoph Hellwig wrote:
> On Thu, Mar 26, 2015 at 06:57:47PM +0200, Boaz Harrosh wrote:
>> On 03/26/2015 10:32 AM, Christoph Hellwig wrote:
>>> Here is another version of the same trivial pmem driver, because two
>>> obviously aren't enough.  The first patch is the same pmem driver
>>> that Ross posted a short time ago, just modified to use platform_devices
>>> to find the persistant memory region instead of hardconding it in the
>>> Kconfig.  This allows to keep pmem.c separate from any discovery mechanism,
>>> but still allow auto-discovery.
>>>
>>
>> Hi Christoph
>>
>> So I've been trying to test your version, and play around with it.
>> I currently have some problems, but this is the end of the week for me
>> so I will debug and fix it after the weekend on Sunday.
> 
> Any news?  I'd really like to resend this ASAP to get it into 4.1..

Yes sorry I got stuck with the NUMA thing. I will finally finish today.

For some reason your patch with the memmap=nn!aa behaves differently than
with  memmap=nn\$aa

And also compared to my old e820.c fix + my old pmem. My fixes are effectively
the same as with your Kernel and the memmap=nn\$aa which sets the range "reserved"
like. And less "ram" like as in your patch.

The problem I see is that if I state a memmap=nn!aa that crosses a NUMA
boundary then the machine will not boot.
So BTW for sure I need that "don't merge E820_PMEM ranges" patch because
otherwise I will not be able to boot if I have pmem on both NUMA nodes
and they happen to be contiguous.

I do not understand why this happens, because a contiguous range of
RAM is fine with cross NUMA and pmem not. (Also the pmem defined as
memmap=nn!aa behaves the same)

Also we have another problem with NUMA that I'm researching for a solution
long term. Is that if the second NUMA node has only pmem and no RAM than
the Kernel will not define a second NUMA node. And we are NUMA screwed with
pmem. So one must put v-ram and nv-ram equally spread across his nodes.

Regarding the SQUASHMEs to PMEM. Originally I had them as 3-4 patches.
But I thought since you are squashing them into a single submitted patch
I can just send just the one patch. Tell me what you prefer and I'll
resend (The one vs the three)

And one last issue. I have some configuration "hardness" with the
memmap=nn!aa Kernel command line API, it was better for me with the
pmem map= module param. Will you be OK if I split pmem_probe() into
calling pmem_alloc(addr, length), so I can keep an out-of-tree patch
that adds the map= parameter to pmem?

Will send the patch in one hour. Just tell me if you need just the one
or three.

Thanks, Christoph
Boaz


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: another pmem variant V2
  2015-03-26 16:57 ` Boaz Harrosh
  2015-03-26 17:18   ` Christoph Hellwig
@ 2015-03-31  9:25   ` Christoph Hellwig
  2015-03-31 10:25     ` Boaz Harrosh
  2015-03-31 15:14     ` Boaz Harrosh
  1 sibling, 2 replies; 23+ messages in thread
From: Christoph Hellwig @ 2015-03-31  9:25 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Christoph Hellwig, linux-nvdimm, linux-fsdevel, linux-kernel,
	x86, ross.zwisler, axboe

On Thu, Mar 26, 2015 at 06:57:47PM +0200, Boaz Harrosh wrote:
> On 03/26/2015 10:32 AM, Christoph Hellwig wrote:
> > Here is another version of the same trivial pmem driver, because two
> > obviously aren't enough.  The first patch is the same pmem driver
> > that Ross posted a short time ago, just modified to use platform_devices
> > to find the persistant memory region instead of hardconding it in the
> > Kconfig.  This allows to keep pmem.c separate from any discovery mechanism,
> > but still allow auto-discovery.
> > 
> 
> Hi Christoph
> 
> So I've been trying to test your version, and play around with it.
> I currently have some problems, but this is the end of the week for me
> so I will debug and fix it after the weekend on Sunday.

Any news?  I'd really like to resend this ASAP to get it into 4.1..

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: another pmem variant V2
  2015-03-26 17:31     ` Boaz Harrosh
@ 2015-03-26 18:38       ` Christoph Hellwig
  0 siblings, 0 replies; 23+ messages in thread
From: Christoph Hellwig @ 2015-03-26 18:38 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Christoph Hellwig, linux-nvdimm, linux-fsdevel, linux-kernel,
	x86, ross.zwisler, axboe

On Thu, Mar 26, 2015 at 07:31:03PM +0200, Boaz Harrosh wrote:
> But I hope you are not ignoring my real problem. any two memmap= ranges
> will halt the boot. Specially if they are dis-contiguous.

not the case here, verified it in various cofigurations.

> Also I need the contiguous variant split into two devices because
> they might belong to two NUMA nodes. It is very hard to manage
> if a NUMA crossing is in a middle of a single pmemX device
> The way we like to configure it is that each /dev/pmem belongs to
> a single NUMA node. And in a multy device setup each CPU node
> allocates from "his" pmem device If there is space.
> (And it lets me set application affinity if need to)

The hack below ensures two separate type 12 entries stay separate,
but I'm not sure I really want this.  so far it seems like very special
hacks for your very specialized fake-pmem config.

> BTW: Will device mapper let me call ->direct_access()

Right now it doesn't, but it's not hard to add..

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: another pmem variant V2
  2015-03-26 17:18   ` Christoph Hellwig
@ 2015-03-26 17:31     ` Boaz Harrosh
  2015-03-26 18:38       ` Christoph Hellwig
  0 siblings, 1 reply; 23+ messages in thread
From: Boaz Harrosh @ 2015-03-26 17:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvdimm, linux-fsdevel, linux-kernel, x86, ross.zwisler, axboe

On 03/26/2015 07:18 PM, Christoph Hellwig wrote:
> On Thu, Mar 26, 2015 at 06:57:47PM +0200, Boaz Harrosh wrote:
>> For one this auto discovery of yours is very (very) nice but is a bit
>> inconvenience. Before I would reserve a big chuck on each NUMA range
>> on Kernel's memmap=  And then at pmem map= would slice and dice it
>> as I want hot style on modprobe with no need for reboot. Now I need
>> to do it on reboot theoretically. (You know xfstest needs lots of devices
>> some big some small ;-))
> 
> Slicing up a block device based on kernel options is not exactly a smart
> idea.  We have partitions that are perfectly fine for that.  If you
> really don't are about persistance of your partitioning you can just
> set up a device mapper table.  No need to reinvent the wheel.
> 

I know! fdisk is my friend, I know.

But I hope you are not ignoring my real problem. any two memmap= ranges
will halt the boot. Specially if they are dis-contiguous.

Also I need the contiguous variant split into two devices because
they might belong to two NUMA nodes. It is very hard to manage
if a NUMA crossing is in a middle of a single pmemX device
The way we like to configure it is that each /dev/pmem belongs to
a single NUMA node. And in a multy device setup each CPU node
allocates from "his" pmem device If there is space.
(And it lets me set application affinity if need to)

BTW: Will device mapper let me call ->direct_access()

Thanks
Boaz


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: another pmem variant V2
  2015-03-26 16:57 ` Boaz Harrosh
@ 2015-03-26 17:18   ` Christoph Hellwig
  2015-03-26 17:31     ` Boaz Harrosh
  2015-03-31  9:25   ` Christoph Hellwig
  1 sibling, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2015-03-26 17:18 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: linux-nvdimm, linux-fsdevel, linux-kernel, x86, ross.zwisler, axboe

On Thu, Mar 26, 2015 at 06:57:47PM +0200, Boaz Harrosh wrote:
> For one this auto discovery of yours is very (very) nice but is a bit
> inconvenience. Before I would reserve a big chuck on each NUMA range
> on Kernel's memmap=  And then at pmem map= would slice and dice it
> as I want hot style on modprobe with no need for reboot. Now I need
> to do it on reboot theoretically. (You know xfstest needs lots of devices
> some big some small ;-))

Slicing up a block device based on kernel options is not exactly a smart
idea.  We have partitions that are perfectly fine for that.  If you
really don't are about persistance of your partitioning you can just
set up a device mapper table.  No need to reinvent the wheel.

> Also with the modprob pmem map= I was supporting a PCIE memory card but
> I guess I need to throw this one out the door.

Send it my way to support it properly...  Just like any other PCIe device it
should have a proper driver.  For now it can register a pmem platform device,
but when you submit it I'd rather add slightly more low-level versions
of pmem_probe and pmem_remove that take a struct device * and possibly
a resource structure so that you can directly call into it.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: another pmem variant V2
  2015-03-26  8:32 Christoph Hellwig
@ 2015-03-26 16:57 ` Boaz Harrosh
  2015-03-26 17:18   ` Christoph Hellwig
  2015-03-31  9:25   ` Christoph Hellwig
  2015-03-31 22:11 ` Elliott, Robert (Server Storage)
  2015-04-01 19:33 ` Elliott, Robert (Server Storage)
  2 siblings, 2 replies; 23+ messages in thread
From: Boaz Harrosh @ 2015-03-26 16:57 UTC (permalink / raw)
  To: Christoph Hellwig, linux-nvdimm, linux-fsdevel, linux-kernel, x86
  Cc: ross.zwisler, axboe, boaz

On 03/26/2015 10:32 AM, Christoph Hellwig wrote:
> Here is another version of the same trivial pmem driver, because two
> obviously aren't enough.  The first patch is the same pmem driver
> that Ross posted a short time ago, just modified to use platform_devices
> to find the persistant memory region instead of hardconding it in the
> Kconfig.  This allows to keep pmem.c separate from any discovery mechanism,
> but still allow auto-discovery.
> 

Hi Christoph

So I've been trying to test your version, and play around with it.
I currently have some problems, but this is the end of the week for me
so I will debug and fix it after the weekend on Sunday.

But just a heads up for what I stumbled on.

For one this auto discovery of yours is very (very) nice but is a bit
inconvenience. Before I would reserve a big chuck on each NUMA range
on Kernel's memmap=  And then at pmem map= would slice and dice it
as I want hot style on modprobe with no need for reboot. Now I need
to do it on reboot theoretically. (You know xfstest needs lots of devices
some big some small ;-))

theoretically because if I try:
	memmap=1G!4G,1G!6G

It will not boot at all. (will debug on Sunday) I must have 2 devices
because each one is on another NUMA node.

If I do memmap=2G!4G,1G!6G , which is a contiguous range it will give
me only one device. Just an inconvenience. (And also this one sometimes
does not boot)

{All this in a VM for now did not even get to real HW yet}

Also with the modprob pmem map= I was supporting a PCIE memory card but
I guess I need to throw this one out the door.

Am also posting an RFC cleanup to your pmem driver. RFC because I was not
yet able to boot the all thing in our lab for heavy testing.

Thanks
Boaz

> The other two patches are a heavily rewritten version of the code that
> Intel gave to various storage vendors to discover the type 12 (and earlier
> type 6) nvdimms, which I massaged into a form that is hopefully suitable
> for mainline.
> 
> Note that pmem.c really is the minimal version as I think we need something
> included ASAP.  We'll eventually need to be able to do other I/O from and
> to it, and as most people know everyone has their own preferre method to
> do it, which I'd like to discuss once we have the basic driver in.
> 
> This has been tested both with a real NVDIMM on a system with a type 12
> capable bios, as well as with "fake persistent" memory using the memmap=
> option.
> 
> Changes since V1:
>   - s/E820_PROTECTED_KERN/E820_PMEM/g
>   - map the persistent memory as uncached
>   - better kernel parameter description
>   - various typo fixes
>   - MODULE_LICENSE fix
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* another pmem variant V2
@ 2015-03-26  8:32 Christoph Hellwig
  2015-03-26 16:57 ` Boaz Harrosh
                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Christoph Hellwig @ 2015-03-26  8:32 UTC (permalink / raw)
  To: linux-nvdimm, linux-fsdevel, linux-kernel, x86; +Cc: ross.zwisler, axboe, boaz

Here is another version of the same trivial pmem driver, because two
obviously aren't enough.  The first patch is the same pmem driver
that Ross posted a short time ago, just modified to use platform_devices
to find the persistant memory region instead of hardconding it in the
Kconfig.  This allows to keep pmem.c separate from any discovery mechanism,
but still allow auto-discovery.

The other two patches are a heavily rewritten version of the code that
Intel gave to various storage vendors to discover the type 12 (and earlier
type 6) nvdimms, which I massaged into a form that is hopefully suitable
for mainline.

Note that pmem.c really is the minimal version as I think we need something
included ASAP.  We'll eventually need to be able to do other I/O from and
to it, and as most people know everyone has their own preferre method to
do it, which I'd like to discuss once we have the basic driver in.

This has been tested both with a real NVDIMM on a system with a type 12
capable bios, as well as with "fake persistent" memory using the memmap=
option.

Changes since V1:
  - s/E820_PROTECTED_KERN/E820_PMEM/g
  - map the persistent memory as uncached
  - better kernel parameter description
  - various typo fixes
  - MODULE_LICENSE fix


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2015-04-02 18:03 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-26 18:38 another pmem variant V2 Christoph Hellwig
  -- strict thread matches above, loose matches on Subject: below --
2015-03-26  8:32 Christoph Hellwig
2015-03-26 16:57 ` Boaz Harrosh
2015-03-26 17:18   ` Christoph Hellwig
2015-03-26 17:31     ` Boaz Harrosh
2015-03-26 18:38       ` Christoph Hellwig
2015-03-31  9:25   ` Christoph Hellwig
2015-03-31 10:25     ` Boaz Harrosh
2015-03-31 10:31       ` Boaz Harrosh
2015-03-31 16:08       ` Christoph Hellwig
2015-03-31 15:14     ` Boaz Harrosh
2015-03-31 16:16       ` Christoph Hellwig
2015-03-31 16:44         ` Ingo Molnar
2015-03-31 17:24           ` Christoph Hellwig
2015-04-01 12:49         ` Boaz Harrosh
2015-03-31 22:11 ` Elliott, Robert (Server Storage)
2015-04-01  7:26   ` Christoph Hellwig
2015-04-02 15:11     ` Elliott, Robert (Server Storage)
2015-04-02 16:41       ` Christoph Hellwig
2015-04-02 18:03         ` Ingo Molnar
2015-04-02 18:03           ` Ingo Molnar
2015-04-01 19:33 ` Elliott, Robert (Server Storage)
2015-04-02  9:37   ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.