linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Boaz Harrosh <boaz@plexistor.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@linux.intel.com>,
	linux-arch@vger.kernel.org, axboe@kernel.dk, riel@redhat.com,
	hch@infradead.org, linux-nvdimm@ml01.01.org,
	Dave Hansen <dave.hansen@linux.intel.com>,
	linux-kernel@vger.kernel.org, linux-raid@vger.kernel.org,
	mgorman@suse.de, linux-fsdevel@vger.kernel.org
Subject: Re: [Linux-nvdimm] [RFC PATCH 0/7] evacuate struct page from the block layer
Date: Sun, 22 Mar 2015 12:30:34 +0200	[thread overview]
Message-ID: <550E99CA.5090004@plexistor.com> (raw)
In-Reply-To: <20150319125917.6cc2bf02687aab542027d8ac@linux-foundation.org>

On 03/19/2015 09:59 PM, Andrew Morton wrote:
> On Thu, 19 Mar 2015 17:54:15 +0200 Boaz Harrosh <boaz@plexistor.com> wrote:
> 
>> On 03/19/2015 03:43 PM, Matthew Wilcox wrote:
>> <>
>>>
>>> Dan missed "Support O_DIRECT to a mapped DAX file".  More generally, if we
>>> want to be able to do any kind of I/O directly to persistent memory,
>>> and I think we do, we need to do one of:
>>>
>>> 1. Construct struct pages for persistent memory
>>> 1a. Permanently
>>> 1b. While the pages are under I/O
>>> 2. Teach the I/O layers to deal in PFNs instead of struct pages
>>> 3. Replace struct page with some other structure that can represent both
>>>    DRAM and PMEM
>>>
>>> I'm personally a fan of #3, and I was looking at the scatterlist as
>>> my preferred data structure.  I now believe the scatterlist as it is
>>> currently defined isn't sufficient, so we probably end up needing a new
>>> data structure.  I think Dan's preferred method of replacing struct
>>> pages with PFNs is actually less instrusive, but doesn't give us as
>>> much advantage (an entirely new data structure would let us move to an
>>> extent based system at the same time, instead of sticking with an array
>>> of pages).  Clearly Boaz prefers 1a, which works well enough for the
>>> 8GB NV-DIMMs, but not well enough for the 400GB NV-DIMMs.
>>>
>>> What's your preference?  I guess option 0 is "force all I/O to go
>>> through the page cache and then get copied", but that feels like a nasty
>>> performance hit.
>>
>> Thanks Matthew, you have summarized it perfectly.
>>
>> I think #1b might have merit, as well.
> 
> It would be interesting to see what a 1b implementation looks like and
> how it performs.  We already allocate a bunch of temporary things to
> support in-flight IO (bio, request) and allocating pageframes on the
> same basis seems a fairly logical fit.

There is a couple of ways we can do this, they are all kind of 
"hacks" to me, along the line of how transparent huge pages is an
hack, a very nice one at that, and every one that knows me knows
I love hacks, be so it is never the less.

So it is all about designating the page to mean something else
at a set of a flag.

And actually the transparent-huge-pages is the core of this.
because there is already a switch on core page operations when
it is present. (for example get/put_page )

And because we do not want to allocate pages inline, as part of a
section, we also need a bit of a memory_model.h new define.
(May this can avoided I need to stare harder on this)

> 
> It is all a bit of a stopgap, designed to shoehorn
> direct-io-to-dax-mapped-memory into the existing world.  Longer term
> I'd expect us to move to something more powerful, but it's unclear what
> that will be at this time, so a stopgap isn't too bad?
> 

I'd bet real huge-pages is the long term. The one stop gap for
huge-pages is that no one wants to dirty a full 2M for two changed
bytes. 4k is kind of the IO performance granularity we all calculate
for. This can be solved in couple of ways, all very invasive to lots
of Kernel areas. 

Lots of times the problem is "where do you start?"

> 
> This is all contingent upon the prevalence of machines which have vast
> amounts of nv memory and relatively small amounts of regular memory. 
> How confident are we that this really is the future?
> 

One thing you guys are ignoring is that the 1.5% "waste" can come
from nv-memory. If real ram is scarce and nv-ram is hips cheep,
just allocate the pages from nvram then.

Do not forget that very soon after the availability of real
nvram, I mean not the backed up one, but the real like mram
or reram. Lots of machines will be 100% nv-ram + sram caches.
This is nothing to do with storage speed, it is to do with
power consumption. The machine shuts-off and picks up exactly
where it was. (Even at power on they consume much less, no refreshes)
In those machine a partition of storage say, the swap partition, will
be the volatile memory section of the machine, zeroed out on boot and
used as RAM.

So this future above does not exist. Pages can just be allocated
from the cheapest memory you have and be done with it.

(BTW all this can already be done now, I have demonstrated it
 in the lab, a reserved NvDIMM memory region is memory_hot_plugged
 and is there after used as regular RAM)

Thanks
Boaz


  parent reply	other threads:[~2015-03-22 10:30 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-03-16 20:25 [RFC PATCH 0/7] evacuate struct page from the block layer Dan Williams
2015-03-16 20:25 ` [RFC PATCH 1/7] block: add helpers for accessing a bio_vec page Dan Williams
2015-03-16 20:25 ` [RFC PATCH 2/7] block: convert bio_vec.bv_page to bv_pfn Dan Williams
2015-03-16 23:05   ` Al Viro
2015-03-17 13:02     ` Matthew Wilcox
2015-03-17 15:53       ` Dan Williams
2015-03-16 20:25 ` [RFC PATCH 3/7] dma-mapping: allow archs to optionally specify a ->map_pfn() operation Dan Williams
2015-03-18 11:21   ` [Linux-nvdimm] " Boaz Harrosh
2015-03-16 20:25 ` [RFC PATCH 4/7] scatterlist: use sg_phys() Dan Williams
2015-03-16 20:25 ` [RFC PATCH 5/7] scatterlist: support "page-less" (__pfn_t only) entries Dan Williams
2015-03-16 20:25 ` [RFC PATCH 6/7] x86: support dma_map_pfn() Dan Williams
2015-03-16 20:26 ` [RFC PATCH 7/7] block: base support for pfn i/o Dan Williams
2015-03-18 10:47 ` [RFC PATCH 0/7] evacuate struct page from the block layer Boaz Harrosh
2015-03-18 13:06   ` Matthew Wilcox
2015-03-18 14:38     ` [Linux-nvdimm] " Boaz Harrosh
2015-03-20 15:56       ` Rik van Riel
2015-03-22 11:53         ` Boaz Harrosh
2015-03-18 15:35   ` Dan Williams
2015-03-18 20:26 ` Andrew Morton
2015-03-19 13:43   ` Matthew Wilcox
2015-03-19 15:54     ` [Linux-nvdimm] " Boaz Harrosh
2015-03-19 19:59       ` Andrew Morton
2015-03-19 20:59         ` Dan Williams
2015-03-22 17:22           ` Boaz Harrosh
2015-03-20 17:32         ` Wols Lists
2015-03-22 10:30         ` Boaz Harrosh [this message]
2015-03-19 18:17     ` Christoph Hellwig
2015-03-19 19:31       ` Matthew Wilcox
2015-03-22 16:46       ` Boaz Harrosh
2015-03-20 16:21     ` Rik van Riel
2015-03-20 20:31       ` Matthew Wilcox
2015-03-20 21:08         ` Rik van Riel
2015-03-22 17:06           ` Boaz Harrosh
2015-03-22 17:22             ` Dan Williams
2015-03-22 17:39               ` Boaz Harrosh
2015-03-20 21:17         ` Wols Lists
2015-03-22 16:24         ` Boaz Harrosh
2015-03-22 15:51       ` Boaz Harrosh
2015-03-23 15:19         ` Rik van Riel
2015-03-23 19:30           ` Christoph Hellwig
2015-03-24  9:41           ` Boaz Harrosh
2015-03-24 16:57             ` Rik van Riel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=550E99CA.5090004@plexistor.com \
    --to=boaz@plexistor.com \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=dave.hansen@linux.intel.com \
    --cc=hch@infradead.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvdimm@ml01.01.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=riel@redhat.com \
    --cc=willy@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).