nvdimm.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
From: Xishi Qiu <qiuxishi@gmail.com>
To: Dave Hansen <dave.hansen@linux.intel.com>, linux-kernel@vger.kernel.org
Cc: thomas.lendacky@amd.com, mhocko@suse.com,
	Xishi Qiu <qiuxishi@linux.alibaba.com>,
	linux-nvdimm@lists.01.org, ying.huang@intel.com,
	linux-mm@kvack.org, zy107165@alibaba-inc.com, zwisler@kernel.org,
	fengguang.wu@intel.com, akpm@linux-foundation.org
Subject: Re: [PATCH 0/9] Allow persistent memory to be used like normal RAM
Date: Fri, 26 Oct 2018 13:42:43 +0800	[thread overview]
Message-ID: <debe98dd-39f3-18d5-aeb4-fe94519aa0c9@gmail.com> (raw)
In-Reply-To: <20181022201317.8558C1D8@viggo.jf.intel.com>

Hi Dave,

This patchset hotadd a pmem and use it like a normal DRAM, I
have some questions here, and I think my production line may
also concerned.

1) How to set the AEP (Apache Pass) usage percentage for one
process (or a vma)?
e.g. there are two vms from two customers, they pay different
money for the vm. So if we alloc and convert AEP/DRAM by global,
the high load vm may get 100% DRAM, and the low load vm may get
100% AEP, this is unfair. The low load is compared to another
one, for himself, the actual low load maybe is high load.

2) I find page idle only check the access bit, _PAGE_BIT_ACCESSED,
as we know AEP read performance is much higher than write, so I
think we should also check the dirty bit, _PAGE_BIT_DIRTY. Test
and clear dirty bit is safe for anon page, but unsafe for file
page, e.g. should call clear_page_dirty_for_io first, right?

3) I think we should manage the AEP memory separately instead
of together with the DRAM. Manage them together maybe change less
code, but it will cause some problems at high priority DRAM
allocation if there is no DRAM, then should convert (steal DRAM)
from another one, it takes much time.
How about create a new zone, e.g. ZONE_AEP, and use madvise
to set a new flag VM_AEP, which will enable the vma to alloc AEP
memory in page fault later, then use vma_rss_stat(like mm_rss_stat)
to control the AEP usage percentage for a vma.

4) I am interesting about the conversion mechanism betweent AEP
and DRAM. I think numa balancing will cause page fault, this is
unacceptable for some apps, it cause performance jitter. And the
kswapd is not precise enough. So use a daemon kernel thread
(like khugepaged) maybe a good solution, add the AEP used processes
to a list, then scan the VM_AEP marked vmas, get the access state,
and do the conversion.

Thanks,
Xishi Qiu
On 2018/10/23 04:13, Dave Hansen wrote:
> Persistent memory is cool.  But, currently, you have to rewrite
> your applications to use it.  Wouldn't it be cool if you could
> just have it show up in your system like normal RAM and get to
> it like a slow blob of memory?  Well... have I got the patch
> series for you!
> 
> This series adds a new "driver" to which pmem devices can be
> attached.  Once attached, the memory "owned" by the device is
> hot-added to the kernel and managed like any other memory.  On
> systems with an HMAT (a new ACPI table), each socket (roughly)
> will have a separate NUMA node for its persistent memory so
> this newly-added memory can be selected by its unique NUMA
> node.
> 
> This is highly RFC, and I really want the feedback from the
> nvdimm/pmem folks about whether this is a viable long-term
> perversion of their code and device mode.  It's insufficiently
> documented and probably not bisectable either.
> 
> Todo:
> 1. The device re-binding hacks are ham-fisted at best.  We
>    need a better way of doing this, especially so the kmem
>    driver does not get in the way of normal pmem devices.
> 2. When the device has no proper node, we default it to
>    NUMA node 0.  Is that OK?
> 3. We muck with the 'struct resource' code quite a bit. It
>    definitely needs a once-over from folks more familiar
>    with it than I.
> 4. Is there a better way to do this than starting with a
>    copy of pmem.c?
> 
> Here's how I set up a system to test this thing:
> 
> 1. Boot qemu with lots of memory: "-m 4096", for instance
> 2. Reserve 512MB of physical memory.  Reserving a spot a 2GB
>    physical seems to work: memmap=512M!0x0000000080000000
>    This will end up looking like a pmem device at boot.
> 3. When booted, convert fsdax device to "device dax":
> 	ndctl create-namespace -fe namespace0.0 -m dax
> 4. In the background, the kmem driver will probably bind to the
>    new device.
> 5. Now, online the new memory sections.  Perhaps:
> 
> grep ^MemTotal /proc/meminfo
> for f in `grep -vl online /sys/devices/system/memory/*/state`; do
> 	echo $f: `cat $f`
> 	echo online > $f
> 	grep ^MemTotal /proc/meminfo
> done
> 
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Ross Zwisler <zwisler@kernel.org>
> Cc: Vishal Verma <vishal.l.verma@intel.com>
> Cc: Tom Lendacky <thomas.lendacky@amd.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: linux-nvdimm@lists.01.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Fengguang Wu <fengguang.wu@intel.com>
> 
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

  parent reply	other threads:[~2018-10-26  5:43 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-22 20:13 [PATCH 0/9] Allow persistent memory to be used like normal RAM Dave Hansen
2018-10-22 20:13 ` [PATCH 1/9] mm/resource: return real error codes from walk failures Dave Hansen
2018-10-22 20:13 ` [PATCH 2/9] dax: kernel memory driver for mm ownership of DAX Dave Hansen
2018-10-23  1:56   ` Randy Dunlap
2018-10-22 20:13 ` [PATCH 3/9] dax: add more kmem device infrastructure Dave Hansen
2018-10-22 20:13 ` [PATCH 4/9] dax/kmem: allow PMEM devices to bind to KMEM driver Dave Hansen
2018-10-22 20:13 ` [PATCH 5/9] dax/kmem: add more nd dax kmem infrastructure Dave Hansen
2018-10-22 20:13 ` [PATCH 6/9] mm/memory-hotplug: allow memory resources to be children Dave Hansen
2018-10-22 20:13 ` [PATCH 7/9] dax/kmem: actually perform memory hotplug Dave Hansen
2018-10-22 20:13 ` [PATCH 8/9] dax/kmem: let walk_system_ram_range() search child resources Dave Hansen
2018-10-22 20:13 ` [PATCH 9/9] dax/kmem: actually enable the code in Makefile Dave Hansen
2018-10-23  1:05 ` [PATCH 0/9] Allow persistent memory to be used like normal RAM Dan Williams
2018-10-23  1:11   ` Dan Williams
2018-10-26  8:03     ` Xishi Qiu
2018-10-26 13:58       ` Dave Hansen
2018-10-27  4:45     ` Dan Williams
2018-10-23 18:12   ` Elliott, Robert (Persistent Memory)
2018-10-23 18:16     ` Dave Hansen
2018-10-23 18:58       ` Dan Williams
2018-10-26  5:42 ` Xishi Qiu [this message]
2018-10-26  9:03   ` Fengguang Wu
2018-10-27 11:00 ` Fengguang Wu
2018-10-31  5:11 ` Yang Shi
2018-12-03  9:22 ` Brice Goglin
2018-12-03 16:56   ` Dave Hansen
2018-12-03 17:16     ` Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=debe98dd-39f3-18d5-aeb4-fe94519aa0c9@gmail.com \
    --to=qiuxishi@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=dave.hansen@linux.intel.com \
    --cc=fengguang.wu@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=mhocko@suse.com \
    --cc=qiuxishi@linux.alibaba.com \
    --cc=thomas.lendacky@amd.com \
    --cc=ying.huang@intel.com \
    --cc=zwisler@kernel.org \
    --cc=zy107165@alibaba-inc.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).