From: Boaz Harrosh <boaz@plexistor.com>
To: Dan Williams <dan.j.williams@intel.com>, linux-kernel@vger.kernel.org
Cc: axboe@kernel.dk, riel@redhat.com, linux-nvdimm@ml01.01.org,
linux-mm@kvack.org, mgorman@suse.de,
torvalds@linux-foundation.org, hch@lst.de
Subject: Re: [PATCH v5 2/5] allow mapping page-less memremaped areas into KVA
Date: Thu, 13 Aug 2015 08:58:58 +0300 [thread overview]
Message-ID: <55CC3222.5090503@plexistor.com> (raw)
In-Reply-To: <20150813030109.36703.21738.stgit@otcpl-skl-sds-2.jf.intel.com>
On 08/13/2015 06:01 AM, Dan Williams wrote:
> Introduce a type that encapsulates a page-frame-number that can also be
> used to encode other information. This other information is the
> traditional "page_link" encoding in a scatterlist, but can also denote
> "device memory". Where "device memory" is a set of pfns that are not
> part of the kernel's linear mapping, but are accessed via the same
> memory controller as ram. The motivation for this conversion is large
> capacity persistent memory that does not enjoy struct page coverage,
> entries in 'memmap' by default.
>
> This type will be used in replace usage of 'struct page *' in cases
> where only a pfn is required, i.e. scatterlists for drivers, dma mapping
> api, and potentially biovecs for the block layer. The operations in
> those i/o paths that formerly required a 'struct page *' are converted
> to use __pfn_t aware equivalent helpers.
>
> It turns out that while 'struct page' references are used broadly in the
> kernel I/O stacks the usage of 'struct page' based capabilities is very
> shallow for block-i/o. It is only used for populating bio_vecs and
> scatterlists for the retrieval of dma addresses, and for temporary
> kernel mappings (kmap). Aside from kmap, these usages can be trivially
> converted to operate on a pfn.
>
> Indeed, kmap_atomic() is more problematic as it uses mm infrastructure,
> via struct page, to setup and track temporary kernel mappings. It would
> be unfortunate if the kmap infrastructure escaped its 32-bit/HIGHMEM
> bonds and leaked into 64-bit code. Thankfully, it seems all that is
> needed here is to convert kmap_atomic() callers, that want to opt-in to
> supporting persistent memory, to use a new kmap_atomic_pfn_t(). Where
> kmap_atomic_pfn_t() is enabled to re-use the existing ioremap() mapping
> established by the driver for persistent memory.
>
> Note, that as far as conceptually understanding __pfn_t is concerned,
> 'persistent memory' is really any address range in host memory not
> covered by memmap. Contrast this with pure iomem that is on an mmio
> mapped bus like PCI and cannot be converted to a dma_addr_t by "pfn <<
> PAGE_SHIFT".
>
> It would be unfortunate if the kmap infrastructure escaped its current
> 32-bit/HIGHMEM bonds and leaked into 64-bit code. Instead, if the user
> has enabled CONFIG_DEV_PFN we direct the kmap_atomic_pfn_t()
> implementation to scan a list of pre-mapped persistent memory address
> ranges inserted by the pmem driver.
>
> The __pfn_t to resource lookup is indeed inefficient walking of a linked list,
> but there are two mitigating factors:
>
> 1/ The number of persistent memory ranges is bounded by the number of
> DIMMs which is on the order of 10s of DIMMs, not hundreds.
>
> 2/ The lookup yields the entire range, if it becomes inefficient to do a
> kmap_atomic_pfn_t() a PAGE_SIZE at a time the caller can take
> advantage of the fact that the lookup can be amortized for all kmap
> operations it needs to perform in a given range.
>
> [hch: various changes]
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
> include/linux/kmap_pfn.h | 31 ++++++++++++
> include/linux/mm.h | 57 ++++++++++++++++++++++
> mm/Kconfig | 3 +
> mm/Makefile | 1
> mm/kmap_pfn.c | 117 ++++++++++++++++++++++++++++++++++++++++++++++
> 5 files changed, 209 insertions(+)
> create mode 100644 include/linux/kmap_pfn.h
> create mode 100644 mm/kmap_pfn.c
>
> diff --git a/include/linux/kmap_pfn.h b/include/linux/kmap_pfn.h
> new file mode 100644
> index 000000000000..fa44971d8e95
> --- /dev/null
> +++ b/include/linux/kmap_pfn.h
> @@ -0,0 +1,31 @@
> +#ifndef _LINUX_KMAP_PFN_H
> +#define _LINUX_KMAP_PFN_H 1
> +
> +#include <linux/highmem.h>
> +
> +struct device;
> +struct resource;
> +#ifdef CONFIG_KMAP_PFN
> +extern void *kmap_atomic_pfn_t(__pfn_t pfn);
> +extern void kunmap_atomic_pfn_t(void *addr);
> +extern int devm_register_kmap_pfn_range(struct device *dev,
> + struct resource *res, void *base);
> +#else
> +static inline void *kmap_atomic_pfn_t(__pfn_t pfn)
> +{
> + return kmap_atomic(__pfn_t_to_page(pfn));
> +}
> +
> +static inline void kunmap_atomic_pfn_t(void *addr)
> +{
> + __kunmap_atomic(addr);
> +}
> +
> +static inline int devm_register_kmap_pfn_range(struct device *dev,
> + struct resource *res, void *base)
> +{
> + return 0;
> +}
> +#endif /* CONFIG_KMAP_PFN */
> +
> +#endif /* _LINUX_KMAP_PFN_H */
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 84b05ebedb2d..57ba5ca6be72 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -924,6 +924,63 @@ static inline void set_page_links(struct page *page, enum zone_type zone,
> }
>
> /*
> + * __pfn_t: encapsulates a page-frame number that is optionally backed
> + * by memmap (struct page). This type will be used in place of a
> + * 'struct page *' instance in contexts where unmapped memory (usually
> + * persistent memory) is being referenced (scatterlists for drivers,
> + * biovecs for the block layer, etc). Whether a __pfn_t has a struct
> + * page backing is indicated by flags in the low bits of the value;
> + */
> +typedef struct {
> + unsigned long val;
> +} __pfn_t;
> +
> +/*
> + * PFN_SG_CHAIN - pfn is a pointer to the next scatterlist entry
> + * PFN_SG_LAST - pfn references a page and is the last scatterlist entry
> + * PFN_DEV - pfn is not covered by system memmap
> + */
> +enum {
> + PFN_MASK = (1UL << PAGE_SHIFT) - 1,
> + PFN_SG_CHAIN = (1UL << 0),
> + PFN_SG_LAST = (1UL << 1),
> +#ifdef CONFIG_KMAP_PFN
> + PFN_DEV = (1UL << 2),
> +#else
> + PFN_DEV = 0,
> +#endif
> +};
> +
> +static inline bool __pfn_t_has_page(__pfn_t pfn)
> +{
> + return (pfn.val & PFN_DEV) == 0;
> +}
> +
> +static inline unsigned long __pfn_t_to_pfn(__pfn_t pfn)
> +{
> + return pfn.val >> PAGE_SHIFT;
> +}
> +
> +static inline struct page *__pfn_t_to_page(__pfn_t pfn)
> +{
> + if (!__pfn_t_has_page(pfn))
> + return NULL;
> + return pfn_to_page(__pfn_t_to_pfn(pfn));
> +}
> +
> +static inline dma_addr_t __pfn_t_to_phys(__pfn_t pfn)
> +{
> + return __pfn_to_phys(__pfn_t_to_pfn(pfn));
> +}
> +
> +static inline __pfn_t page_to_pfn_t(struct page *page)
> +{
> + __pfn_t pfn = { .val = page_to_pfn(page) << PAGE_SHIFT, };
> +
> + return pfn;
> +}
> +
> +/*
> * Some inline functions in vmstat.h depend on page_zone()
> */
> #include <linux/vmstat.h>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index e79de2bd12cd..ed1be8ff982e 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -654,3 +654,6 @@ config DEFERRED_STRUCT_PAGE_INIT
> when kswapd starts. This has a potential performance impact on
> processes running early in the lifetime of the systemm until kswapd
> finishes the initialisation.
> +
> +config KMAP_PFN
> + bool
> diff --git a/mm/Makefile b/mm/Makefile
> index 98c4eaeabdcb..f7b27958ea69 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -78,3 +78,4 @@ obj-$(CONFIG_CMA) += cma.o
> obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
> obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
> obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
> +obj-$(CONFIG_KMAP_PFN) += kmap_pfn.o
> diff --git a/mm/kmap_pfn.c b/mm/kmap_pfn.c
> new file mode 100644
> index 000000000000..2d58e167dfbc
> --- /dev/null
> +++ b/mm/kmap_pfn.c
> @@ -0,0 +1,117 @@
> +/*
> + * Copyright(c) 2015 Intel Corporation. All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of version 2 of the GNU General Public License as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + * General Public License for more details.
> + */
> +#include <linux/rcupdate.h>
> +#include <linux/rculist.h>
> +#include <linux/highmem.h>
> +#include <linux/device.h>
> +#include <linux/mutex.h>
> +#include <linux/slab.h>
> +#include <linux/mm.h>
> +
> +static LIST_HEAD(ranges);
> +static DEFINE_MUTEX(register_lock);
> +
> +struct kmap {
> + struct list_head list;
> + struct resource *res;
> + struct device *dev;
> + void *base;
> +};
> +
> +static void teardown_kmap(void *data)
> +{
> + struct kmap *kmap = data;
> +
> + dev_dbg(kmap->dev, "kmap unregister %pr\n", kmap->res);
> + mutex_lock(®ister_lock);
> + list_del_rcu(&kmap->list);
> + mutex_unlock(®ister_lock);
> + synchronize_rcu();
> + kfree(kmap);
> +}
> +
> +int devm_register_kmap_pfn_range(struct device *dev, struct resource *res,
> + void *base)
> +{
> + struct kmap *kmap = kzalloc(sizeof(*kmap), GFP_KERNEL);
> + int rc;
> +
> + if (!kmap)
> + return -ENOMEM;
> +
> + INIT_LIST_HEAD(&kmap->list);
> + kmap->res = res;
> + kmap->base = base;
> + kmap->dev = dev;
> + rc = devm_add_action(dev, teardown_kmap, kmap);
> + if (rc) {
> + kfree(kmap);
> + return rc;
> + }
> + dev_dbg(kmap->dev, "kmap register %pr\n", kmap->res);
> +
> + mutex_lock(®ister_lock);
> + list_add_rcu(&kmap->list, &ranges);
> + mutex_unlock(®ister_lock);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(devm_register_kmap_pfn_range);
> +
> +void *kmap_atomic_pfn_t(__pfn_t pfn)
> +{
> + struct page *page = __pfn_t_to_page(pfn);
> + resource_size_t addr;
> + struct kmap *kmap;
> +
> + rcu_read_lock();
> + if (page)
> + return kmap_atomic(page);
Right even with pages I pay rcu_read_lock(); for every access?
> + addr = __pfn_t_to_phys(pfn);
> + list_for_each_entry_rcu(kmap, &ranges, list)
> + if (addr >= kmap->res->start && addr <= kmap->res->end)
> + return kmap->base + addr - kmap->res->start;
> +
Good god! This loop is a real *joke*. You have just dropped memory access
performance by 10 fold.
The all point of pages and memory_model.h was to have a one to one
relation-ships between Kernel-virtual vs physical vs page *
There is already an object that holds a relationship of physical
to Kernel-virtual. It is called a memory-section. Why not just
widen its definition?
If you are willing to accept this loop. In current Linux 2015 Kernel
Then I have nothing farther to say.
Boaz - go mourning for the death of the Linux Kernel alone in the corner ;-(
> + /* only unlock in the error case */
> + rcu_read_unlock();
> + return NULL;
> +}
> +EXPORT_SYMBOL(kmap_atomic_pfn_t);
> +
> +void kunmap_atomic_pfn_t(void *addr)
> +{
> + struct kmap *kmap;
> + bool dev_pfn = false;
> +
> + if (!addr)
> + return;
> +
> + /*
> + * If the original __pfn_t had an entry in the memmap (i.e.
> + * !PFN_DEV) then 'addr' will be outside of the registered
> + * ranges and we'll need to kunmap_atomic() it.
> + */
> + list_for_each_entry_rcu(kmap, &ranges, list)
> + if (addr < kmap->base + resource_size(kmap->res)
> + && addr >= kmap->base) {
> + dev_pfn = true;
> + break;
> + }
> +
> + if (!dev_pfn)
> + kunmap_atomic(addr);
> +
> + /* signal that we are done with the range */
> + rcu_read_unlock();
> +}
> +EXPORT_SYMBOL(kunmap_atomic_pfn_t);
>
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
>
next prev parent reply other threads:[~2015-08-13 5:59 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-08-13 3:00 [PATCH v5 0/5] introduce __pfn_t for unmapped pfn I/O and DAX lifetime Dan Williams
2015-08-13 3:01 ` [PATCH v5 1/5] mm: move __phys_to_pfn and __pfn_to_phys to asm/generic/memory_model.h Dan Williams
2015-08-13 3:01 ` [PATCH v5 2/5] allow mapping page-less memremaped areas into KVA Dan Williams
2015-08-13 5:58 ` Boaz Harrosh [this message]
2015-08-13 12:57 ` Dan Williams
2015-08-13 13:23 ` Boaz Harrosh
2015-08-13 14:41 ` Christoph Hellwig
2015-08-13 15:01 ` Boaz Harrosh
2015-08-13 14:37 ` Christoph Hellwig
2015-08-13 14:48 ` Boaz Harrosh
2015-08-13 15:29 ` Boaz Harrosh
2015-08-13 17:37 ` Dave Hansen
2015-08-13 17:35 ` Matthew Wilcox
2015-08-13 18:15 ` Dan Williams
2015-08-13 3:01 ` [PATCH v5 3/5] dax: drop size parameter to ->direct_access() Dan Williams
2015-08-13 3:01 ` [PATCH v5 4/5] dax: fix mapping lifetime handling, convert to __pfn_t + kmap_atomic_pfn_t() Dan Williams
2015-08-13 6:26 ` Boaz Harrosh
2015-08-13 15:21 ` Dan Williams
2015-08-13 16:34 ` Boaz Harrosh
2015-08-13 18:51 ` Dan Williams
2015-08-13 3:01 ` [PATCH v5 5/5] scatterlist: convert to __pfn_t Dan Williams
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=55CC3222.5090503@plexistor.com \
--to=boaz@plexistor.com \
--cc=axboe@kernel.dk \
--cc=dan.j.williams@intel.com \
--cc=hch@lst.de \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-nvdimm@ml01.01.org \
--cc=mgorman@suse.de \
--cc=riel@redhat.com \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).