linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] Make pfn_t suitable for placing in the radix tree
@ 2016-03-11 21:13 Matthew Wilcox
  2016-03-11 21:13 ` [PATCH 1/3] pfn_t: Change the encoding Matthew Wilcox
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Matthew Wilcox @ 2016-03-11 21:13 UTC (permalink / raw)
  To: Dan Williams; +Cc: Matthew Wilcox, linux-nvdimm, linux-kernel, linux-mm, willy

I did some experimenting with converting the DAX radix tree from storing
sector numbers to storing pfn_t.  While we're not ready to do that
conversion yet, these pieces make sense to at least get reviewed now,
and maybe get upstream.

I think the first patch is worthwhile all by itself as a stepping stone to
making SG lists contain PFNs instead of pages.

Matthew Wilcox (3):
  pfn_t: Change the encoding
  pfn_t: Support for huge PFNs
  pfn_t: New functions pfn_t_add and pfn_t_cmp

 include/linux/pfn_t.h | 72 +++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 61 insertions(+), 11 deletions(-)

-- 
2.7.0

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/3] pfn_t: Change the encoding
  2016-03-11 21:13 [PATCH 0/3] Make pfn_t suitable for placing in the radix tree Matthew Wilcox
@ 2016-03-11 21:13 ` Matthew Wilcox
  2016-03-11 21:40   ` Dan Williams
  2016-03-11 21:13 ` [PATCH 2/3] pfn_t: Support for huge PFNs Matthew Wilcox
  2016-03-11 21:13 ` [PATCH 3/3] pfn_t: New functions pfn_t_add and pfn_t_cmp Matthew Wilcox
  2 siblings, 1 reply; 8+ messages in thread
From: Matthew Wilcox @ 2016-03-11 21:13 UTC (permalink / raw)
  To: Dan Williams; +Cc: Matthew Wilcox, linux-nvdimm, linux-kernel, linux-mm, willy

By moving the flag bits to the bottom, we encourage commonality
between SGs with pages and those using pfn_t.  We can also then insert
a pfn_t into a radix tree, as it uses the same two bits for indirect &
exceptional indicators.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 include/linux/pfn_t.h | 41 ++++++++++++++++++++++++++++++-----------
 1 file changed, 30 insertions(+), 11 deletions(-)

diff --git a/include/linux/pfn_t.h b/include/linux/pfn_t.h
index 9499481..37c596c 100644
--- a/include/linux/pfn_t.h
+++ b/include/linux/pfn_t.h
@@ -8,16 +8,34 @@
  * PFN_SG_LAST - pfn references a page and is the last scatterlist entry
  * PFN_DEV - pfn is not covered by system memmap by default
  * PFN_MAP - pfn has a dynamic page mapping established by a device driver
+ *
+ * Note that the bottom two bits in the pfn_t match the bottom two bits in the
+ * scatterlist so sg_is_chain() and sg_is_last() work.  These bits are also
+ * used by the radix tree for its own purposes, but a PFN cannot be in both a
+ * radix tree and a scatterlist simultaneously.  If a PFN is moved between the
+ * two usages, care should be taken to clear/set these bits appropriately.
  */
-#define PFN_FLAGS_MASK (((u64) ~PAGE_MASK) << (BITS_PER_LONG_LONG - PAGE_SHIFT))
-#define PFN_SG_CHAIN (1ULL << (BITS_PER_LONG_LONG - 1))
-#define PFN_SG_LAST (1ULL << (BITS_PER_LONG_LONG - 2))
-#define PFN_DEV (1ULL << (BITS_PER_LONG_LONG - 3))
-#define PFN_MAP (1ULL << (BITS_PER_LONG_LONG - 4))
+#define PFN_FLAG_BITS	4
+#define PFN_FLAGS_MASK	((1 << PFN_FLAG_BITS) - 1)
+#define __PFN_MAX	((1 << (BITS_PER_LONG - PFN_FLAG_BITS)) - 1)
+#define PFN_SG_CHAIN	0x01UL
+#define PFN_SG_LAST	0x02UL
+#define PFN_SG_MASK	(PFN_SG_CHAIN | PFN_SG_LAST)
+#define PFN_DEV		0x04UL
+#define PFN_MAP		0x08UL
+
+#if 0
+#define PFN_T_BUG_ON(x)	BUG_ON(x)
+#else
+#define PFN_T_BUG_ON(x)	BUILD_BUG_ON_INVALID(x)
+#endif
 
 static inline pfn_t __pfn_to_pfn_t(unsigned long pfn, u64 flags)
 {
-	pfn_t pfn_t = { .val = pfn | (flags & PFN_FLAGS_MASK), };
+	pfn_t pfn_t = { .val = (pfn << PFN_FLAG_BITS) | flags };
+
+	PFN_T_BUG_ON(pfn & ~__PFN_MAX);
+	PFN_T_BUG_ON(flags & ~PFN_FLAGS_MASK);
 
 	return pfn_t;
 }
@@ -28,6 +46,12 @@ static inline pfn_t pfn_to_pfn_t(unsigned long pfn)
 	return __pfn_to_pfn_t(pfn, 0);
 }
 
+static inline unsigned long pfn_t_to_pfn(pfn_t pfn)
+{
+	unsigned long v = pfn.val;
+	return v >> PFN_FLAG_BITS;
+}
+
 extern pfn_t phys_to_pfn_t(phys_addr_t addr, u64 flags);
 
 static inline bool pfn_t_has_page(pfn_t pfn)
@@ -35,11 +59,6 @@ static inline bool pfn_t_has_page(pfn_t pfn)
 	return (pfn.val & PFN_MAP) == PFN_MAP || (pfn.val & PFN_DEV) == 0;
 }
 
-static inline unsigned long pfn_t_to_pfn(pfn_t pfn)
-{
-	return pfn.val & ~PFN_FLAGS_MASK;
-}
-
 static inline struct page *pfn_t_to_page(pfn_t pfn)
 {
 	if (pfn_t_has_page(pfn))
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/3] pfn_t: Support for huge PFNs
  2016-03-11 21:13 [PATCH 0/3] Make pfn_t suitable for placing in the radix tree Matthew Wilcox
  2016-03-11 21:13 ` [PATCH 1/3] pfn_t: Change the encoding Matthew Wilcox
@ 2016-03-11 21:13 ` Matthew Wilcox
  2016-03-11 21:13 ` [PATCH 3/3] pfn_t: New functions pfn_t_add and pfn_t_cmp Matthew Wilcox
  2 siblings, 0 replies; 8+ messages in thread
From: Matthew Wilcox @ 2016-03-11 21:13 UTC (permalink / raw)
  To: Dan Williams; +Cc: Matthew Wilcox, linux-nvdimm, linux-kernel, linux-mm, willy

We need to put entries in the radix tree that represent PMDs and PUDs.
Use another bit to determine if this PFN entry represents a huge page
or not.  If it does, we know the bottom few bits of the PFN are zero,
so we can reuse them to distinguish between PMDs and PUDs.  Thanks to
Neil Brown for suggesting this more compact encoding.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 include/linux/pfn_t.h | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/include/linux/pfn_t.h b/include/linux/pfn_t.h
index 37c596c..0ff4c4e 100644
--- a/include/linux/pfn_t.h
+++ b/include/linux/pfn_t.h
@@ -8,14 +8,21 @@
  * PFN_SG_LAST - pfn references a page and is the last scatterlist entry
  * PFN_DEV - pfn is not covered by system memmap by default
  * PFN_MAP - pfn has a dynamic page mapping established by a device driver
+ * PFN_HUGE - pfn represents a huge page
  *
  * Note that the bottom two bits in the pfn_t match the bottom two bits in the
  * scatterlist so sg_is_chain() and sg_is_last() work.  These bits are also
  * used by the radix tree for its own purposes, but a PFN cannot be in both a
  * radix tree and a scatterlist simultaneously.  If a PFN is moved between the
  * two usages, care should be taken to clear/set these bits appropriately.
+ *
+ * If PFN_HUGE is set, the bottom PMD_SHIFT-PAGE_SHIFT bits are known to be
+ * zero.  We reuse them to indicate whether the PFN represents a PMD or a PUD.
+ *
+ * This scheme supports 27 bits on a 32-bit system (512GB of physical address
+ * space) and 59 bits on a 64-bit system (2048EB of physical address space).
  */
-#define PFN_FLAG_BITS	4
+#define PFN_FLAG_BITS	5
 #define PFN_FLAGS_MASK	((1 << PFN_FLAG_BITS) - 1)
 #define __PFN_MAX	((1 << (BITS_PER_LONG - PFN_FLAG_BITS)) - 1)
 #define PFN_SG_CHAIN	0x01UL
@@ -23,6 +30,11 @@
 #define PFN_SG_MASK	(PFN_SG_CHAIN | PFN_SG_LAST)
 #define PFN_DEV		0x04UL
 #define PFN_MAP		0x08UL
+#define PFN_HUGE	0x10UL
+#define PFN_SIZE_MASK	0x30UL
+#define PFN_SIZE_PTE	0x00UL
+#define PFN_SIZE_PMD	0x10UL
+#define PFN_SIZE_PUD	0x30UL
 
 #if 0
 #define PFN_T_BUG_ON(x)	BUG_ON(x)
@@ -35,7 +47,7 @@ static inline pfn_t __pfn_to_pfn_t(unsigned long pfn, u64 flags)
 	pfn_t pfn_t = { .val = (pfn << PFN_FLAG_BITS) | flags };
 
 	PFN_T_BUG_ON(pfn & ~__PFN_MAX);
-	PFN_T_BUG_ON(flags & ~PFN_FLAGS_MASK);
+	PFN_T_BUG_ON(flags & ~(PFN_FLAGS_MASK | PFN_SIZE_MASK));
 
 	return pfn_t;
 }
@@ -46,9 +58,12 @@ static inline pfn_t pfn_to_pfn_t(unsigned long pfn)
 	return __pfn_to_pfn_t(pfn, 0);
 }
 
+/* The bottom bits of the PFN may be set if PFN_HUGE is set */
 static inline unsigned long pfn_t_to_pfn(pfn_t pfn)
 {
 	unsigned long v = pfn.val;
+	if (v & PFN_HUGE)
+		v &= ~PFN_SIZE_MASK;
 	return v >> PFN_FLAG_BITS;
 }
 
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 3/3] pfn_t: New functions pfn_t_add and pfn_t_cmp
  2016-03-11 21:13 [PATCH 0/3] Make pfn_t suitable for placing in the radix tree Matthew Wilcox
  2016-03-11 21:13 ` [PATCH 1/3] pfn_t: Change the encoding Matthew Wilcox
  2016-03-11 21:13 ` [PATCH 2/3] pfn_t: Support for huge PFNs Matthew Wilcox
@ 2016-03-11 21:13 ` Matthew Wilcox
  2 siblings, 0 replies; 8+ messages in thread
From: Matthew Wilcox @ 2016-03-11 21:13 UTC (permalink / raw)
  To: Dan Williams; +Cc: Matthew Wilcox, linux-nvdimm, linux-kernel, linux-mm, willy

When we find a huge PFN in the radix tree, we need to add the low
bits of the index to it in order to find the PFN we are looking for.
Since we want the result to stay in pfn_t form, create pfn_t_add().

We also need to compare PFNs, for example to determine if the PFN
represents a zero page.  At the moment, we only have use for comparing
equality, but a general compare operation is no more code and may prove
useful in the future.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
 include/linux/pfn_t.h | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/include/linux/pfn_t.h b/include/linux/pfn_t.h
index 0ff4c4e..c3bb5fe 100644
--- a/include/linux/pfn_t.h
+++ b/include/linux/pfn_t.h
@@ -67,6 +67,22 @@ static inline unsigned long pfn_t_to_pfn(pfn_t pfn)
 	return v >> PFN_FLAG_BITS;
 }
 
+static inline __must_check pfn_t pfn_t_add(const pfn_t pfn, int val)
+{
+	pfn_t tmp = pfn;
+	if (tmp.val & PFN_HUGE)
+		tmp.val &= ~PFN_SIZE_MASK;
+	tmp.val &= ~PFN_SG_MASK;
+	tmp.val += val << PFN_FLAG_BITS;
+	return tmp;
+}
+
+/* Like memcmp, returns <0 if a<b, 0 if a=b and >0 if b>a */
+static inline int pfn_t_cmp(pfn_t a, pfn_t b)
+{
+	return pfn_t_to_pfn(b) - pfn_t_to_pfn(a);
+}
+
 extern pfn_t phys_to_pfn_t(phys_addr_t addr, u64 flags);
 
 static inline bool pfn_t_has_page(pfn_t pfn)
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/3] pfn_t: Change the encoding
  2016-03-11 21:13 ` [PATCH 1/3] pfn_t: Change the encoding Matthew Wilcox
@ 2016-03-11 21:40   ` Dan Williams
  2016-03-12 18:30     ` Matthew Wilcox
  0 siblings, 1 reply; 8+ messages in thread
From: Dan Williams @ 2016-03-11 21:40 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-nvdimm@lists.01.org, linux-kernel, Linux MM,
	Matthew Wilcox, Dave Hansen

On Fri, Mar 11, 2016 at 1:13 PM, Matthew Wilcox
<matthew.r.wilcox@intel.com> wrote:
> By moving the flag bits to the bottom, we encourage commonality
> between SGs with pages and those using pfn_t.  We can also then insert
> a pfn_t into a radix tree, as it uses the same two bits for indirect &
> exceptional indicators.

It's not immediately clear to me what we gain with SG entry
commonality.  The down side is that we lose the property that
pfn_to_pfn_t() is a nop.  This was Dave's suggestion so that the
nominal case did not change the binary layout of a typical pfn.

Can we just bit swizzle a pfn_t on insertion/retrieval from the radix?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/3] pfn_t: Change the encoding
  2016-03-11 21:40   ` Dan Williams
@ 2016-03-12 18:30     ` Matthew Wilcox
  2016-03-13 23:09       ` Dan Williams
  0 siblings, 1 reply; 8+ messages in thread
From: Matthew Wilcox @ 2016-03-12 18:30 UTC (permalink / raw)
  To: Dan Williams
  Cc: Matthew Wilcox, linux-nvdimm@lists.01.org, linux-kernel,
	Linux MM, Dave Hansen

On Fri, Mar 11, 2016 at 01:40:20PM -0800, Dan Williams wrote:
> On Fri, Mar 11, 2016 at 1:13 PM, Matthew Wilcox
> <matthew.r.wilcox@intel.com> wrote:
> > By moving the flag bits to the bottom, we encourage commonality
> > between SGs with pages and those using pfn_t.  We can also then insert
> > a pfn_t into a radix tree, as it uses the same two bits for indirect &
> > exceptional indicators.
> 
> It's not immediately clear to me what we gain with SG entry
> commonality.  The down side is that we lose the property that
> pfn_to_pfn_t() is a nop.  This was Dave's suggestion so that the
> nominal case did not change the binary layout of a typical pfn.

I understand that motivation!

> Can we just bit swizzle a pfn_t on insertion/retrieval from the radix?

Of course we *can*, but we end up doing more swizzling that way than we
do this way.  In the Brave New Future where we're storing pfn_t in the
radix tree, on a page fault we find the pfn_t in the radix tree then
we want to insert it into the page tables.  So DAX would first have to
convert the radix tree entry to a pfn_t, then the page table code has to
convert the pfn_t into a pte/pmd/pud (which we currently do by converting
a pfn_t to a pfn, then converting the pfn to a pte/pmd/pud, but I assume
that either the compiler optimises that into a single conversion, or we'll
add pfn_t_pte to each architecture in future if it's actually a problem).

Much easier to look up a pfn_t in the radix tree and pass it directly
to vm_insert_mixed().

If there's any part of the kernel that is doing a *lot* of conversion
between pfn_t and pfn, that surely indicates a place in the kernel where
we need to convert an interface from pfn to pfn_t.

(It occurs to me we can make the code simpler on architectures that
don't support PUDs.  The PFN_HUGE bit is still available to distinguish
between PMDs and PTEs, but we won't need to clear the bottom bit of the
PFN if PFN_HUGE is set, since nobody can add a PUD pfn to the radix tree).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/3] pfn_t: Change the encoding
  2016-03-12 18:30     ` Matthew Wilcox
@ 2016-03-13 23:09       ` Dan Williams
  2016-03-14 15:00         ` Matthew Wilcox
  0 siblings, 1 reply; 8+ messages in thread
From: Dan Williams @ 2016-03-13 23:09 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Matthew Wilcox, linux-nvdimm@lists.01.org, linux-kernel,
	Linux MM, Dave Hansen

On Sat, Mar 12, 2016 at 10:30 AM, Matthew Wilcox <willy@linux.intel.com> wrote:
> On Fri, Mar 11, 2016 at 01:40:20PM -0800, Dan Williams wrote:
>> On Fri, Mar 11, 2016 at 1:13 PM, Matthew Wilcox
>> <matthew.r.wilcox@intel.com> wrote:
>> > By moving the flag bits to the bottom, we encourage commonality
>> > between SGs with pages and those using pfn_t.  We can also then insert
>> > a pfn_t into a radix tree, as it uses the same two bits for indirect &
>> > exceptional indicators.
>>
>> It's not immediately clear to me what we gain with SG entry
>> commonality.  The down side is that we lose the property that
>> pfn_to_pfn_t() is a nop.  This was Dave's suggestion so that the
>> nominal case did not change the binary layout of a typical pfn.
>
> I understand that motivation!
>
>> Can we just bit swizzle a pfn_t on insertion/retrieval from the radix?
>
> Of course we *can*, but we end up doing more swizzling that way than we
> do this way.  In the Brave New Future where we're storing pfn_t in the
> radix tree, on a page fault we find the pfn_t in the radix tree then
> we want to insert it into the page tables.  So DAX would first have to
> convert the radix tree entry to a pfn_t, then the page table code has to
> convert the pfn_t into a pte/pmd/pud (which we currently do by converting
> a pfn_t to a pfn, then converting the pfn to a pte/pmd/pud, but I assume
> that either the compiler optimises that into a single conversion, or we'll
> add pfn_t_pte to each architecture in future if it's actually a problem).
>
> Much easier to look up a pfn_t in the radix tree and pass it directly
> to vm_insert_mixed().
>
> If there's any part of the kernel that is doing a *lot* of conversion
> between pfn_t and pfn, that surely indicates a place in the kernel where
> we need to convert an interface from pfn to pfn_t.

So this is dependent on where pfn_t gets pushed in the future.  For
example, if we revive using a pfn_t in a bio then I think the
pfn_to_pfn_t() conversions will be more prevalent than the fs/dax.c
radix usages.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/3] pfn_t: Change the encoding
  2016-03-13 23:09       ` Dan Williams
@ 2016-03-14 15:00         ` Matthew Wilcox
  0 siblings, 0 replies; 8+ messages in thread
From: Matthew Wilcox @ 2016-03-14 15:00 UTC (permalink / raw)
  To: Dan Williams
  Cc: Matthew Wilcox, linux-nvdimm@lists.01.org, linux-kernel,
	Linux MM, Dave Hansen

On Sun, Mar 13, 2016 at 04:09:38PM -0700, Dan Williams wrote:
> On Sat, Mar 12, 2016 at 10:30 AM, Matthew Wilcox <willy@linux.intel.com> wrote:
> > On Fri, Mar 11, 2016 at 01:40:20PM -0800, Dan Williams wrote:
> >> Can we just bit swizzle a pfn_t on insertion/retrieval from the radix?
> >
> > Of course we *can*, but we end up doing more swizzling that way than we
> > do this way.  In the Brave New Future where we're storing pfn_t in the
> > radix tree, on a page fault we find the pfn_t in the radix tree then
> > we want to insert it into the page tables.  So DAX would first have to
> > convert the radix tree entry to a pfn_t, then the page table code has to
> > convert the pfn_t into a pte/pmd/pud (which we currently do by converting
> > a pfn_t to a pfn, then converting the pfn to a pte/pmd/pud, but I assume
> > that either the compiler optimises that into a single conversion, or we'll
> > add pfn_t_pte to each architecture in future if it's actually a problem).
> >
> > Much easier to look up a pfn_t in the radix tree and pass it directly
> > to vm_insert_mixed().
> >
> > If there's any part of the kernel that is doing a *lot* of conversion
> > between pfn_t and pfn, that surely indicates a place in the kernel where
> > we need to convert an interface from pfn to pfn_t.
> 
> So this is dependent on where pfn_t gets pushed in the future.  For
> example, if we revive using a pfn_t in a bio then I think the
> pfn_to_pfn_t() conversions will be more prevalent than the fs/dax.c
> radix usages.

Yes, we'll be converting to a pfn_t in more places than we are now
...  but what do we do with that pfn_t once we've got it into a bio?
Except for some rare cases (brd, maybe pmem), it gets converted into an
sg list which then gets DMA mapped, then the DMA addresses are converted
into whatever format the hardware wants.  As long as we convert the sg
list before we convert the bio, there aren't going to be any additional
conversions from pfn_t to pfn.  So I don't see this showing up as an
additional per-I/O cost.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-03-14 15:00 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-11 21:13 [PATCH 0/3] Make pfn_t suitable for placing in the radix tree Matthew Wilcox
2016-03-11 21:13 ` [PATCH 1/3] pfn_t: Change the encoding Matthew Wilcox
2016-03-11 21:40   ` Dan Williams
2016-03-12 18:30     ` Matthew Wilcox
2016-03-13 23:09       ` Dan Williams
2016-03-14 15:00         ` Matthew Wilcox
2016-03-11 21:13 ` [PATCH 2/3] pfn_t: Support for huge PFNs Matthew Wilcox
2016-03-11 21:13 ` [PATCH 3/3] pfn_t: New functions pfn_t_add and pfn_t_cmp Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).