All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jane Chu <jane.chu@oracle.com>
To: Joao Martins <joao.m.martins@oracle.com>,
	Dan Williams <dan.j.williams@intel.com>
Cc: Linux MM <linux-mm@kvack.org>, Ira Weiny <ira.weiny@intel.com>,
	linux-nvdimm <linux-nvdimm@lists.01.org>,
	Matthew Wilcox <willy@infradead.org>,
	Jason Gunthorpe <jgg@ziepe.ca>,
	Muchun Song <songmuchun@bytedance.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Jane Chu <jane.chu@oracle.com>
Subject: Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages
Date: Wed, 19 May 2021 11:36:02 -0700	[thread overview]
Message-ID: <dc64c5ae-883c-e35f-165c-c34197bc4d89@oracle.com> (raw)
In-Reply-To: <fa8ad5f1-f923-4393-8771-b2e74abe0f0c@oracle.com>


On 5/19/2021 4:29 AM, Joao Martins wrote:
>
> On 5/18/21 8:56 PM, Jane Chu wrote:
>> On 5/18/2021 10:27 AM, Joao Martins wrote:
>>
>>> On 5/5/21 11:36 PM, Joao Martins wrote:
>>>> On 5/5/21 11:20 PM, Dan Williams wrote:
>>>>> On Wed, May 5, 2021 at 12:50 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>>>> On 5/5/21 7:44 PM, Dan Williams wrote:
>>>>>>> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>>>>> index b46f63dcaed3..bb28d82dda5e 100644
>>>>>>>> --- a/include/linux/memremap.h
>>>>>>>> +++ b/include/linux/memremap.h
>>>>>>>> @@ -114,6 +114,7 @@ struct dev_pagemap {
>>>>>>>>           struct completion done;
>>>>>>>>           enum memory_type type;
>>>>>>>>           unsigned int flags;
>>>>>>>> +       unsigned long align;
>>>>>>> I think this wants some kernel-doc above to indicate that non-zero
>>>>>>> means "use compound pages with tail-page dedup" and zero / PAGE_SIZE
>>>>>>> means "use non-compound base pages".
>>> [...]
>>>
>>>>>>> The non-zero value must be
>>>>>>> PAGE_SIZE, PMD_PAGE_SIZE or PUD_PAGE_SIZE.
>>>>>>> Hmm, maybe it should be an
>>>>>>> enum:
>>>>>>>
>>>>>>> enum devmap_geometry {
>>>>>>>       DEVMAP_PTE,
>>>>>>>       DEVMAP_PMD,
>>>>>>>       DEVMAP_PUD,
>>>>>>> }
>>>>>>>
>>>>>> I suppose a converter between devmap_geometry and page_size would be needed too? And maybe
>>>>>> the whole dax/nvdimm align values change meanwhile (as a followup improvement)?
>>>>> I think it is ok for dax/nvdimm to continue to maintain their align
>>>>> value because it should be ok to have 4MB align if the device really
>>>>> wanted. However, when it goes to map that alignment with
>>>>> memremap_pages() it can pick a mode. For example, it's already the
>>>>> case that dax->align == 1GB is mapped with DEVMAP_PTE today, so
>>>>> they're already separate concepts that can stay separate.
>>>>>
>>>> Gotcha.
>>> I am reconsidering part of the above. In general, yes, the meaning of devmap @align
>>> represents a slightly different variation of the device @align i.e. how the metadata is
>>> laid out **but** regardless of what kind of page table entries we use vmemmap.
>>>
>>> By using DEVMAP_PTE/PMD/PUD we might end up 1) duplicating what nvdimm/dax already
>>> validates in terms of allowed device @align values (i.e. PAGE_SIZE, PMD_SIZE and PUD_SIZE)
>>> 2) the geometry of metadata is very much tied to the value we pick to @align at namespace
>>> provisioning -- not the "align" we might use at mmap() perhaps that's what you referred
>>> above? -- and 3) the value of geometry actually derives from dax device @align because we
>>> will need to create compound pages representing a page size of @align value.
>>>
>>> Using your example above: you're saying that dax->align == 1G is mapped with DEVMAP_PTEs,
>>> in reality the vmemmap is populated with PMDs/PUDs page tables (depending on what archs
>>> decide to do at vmemmap_populate()) and uses base pages as its metadata regardless of what
>>> device @align. In reality what we want to convey in @geometry is not page table sizes, but
>>> just the page size used for the vmemmap of the dax device. Additionally, limiting its
>>> value might not be desirable... if tomorrow Linux for some arch supports dax/nvdimm
>>> devices with 4M align or 64K align, the value of @geometry will have to reflect the 4M to
>>> create compound pages of order 10 for the said vmemmap.
>>>
>>> I am going to wait until you finish reviewing the remaining four patches of this series,
>>> but maybe this is a simple misnomer (s/align/geometry/) with a comment but without
>>> DEVMAP_{PTE,PMD,PUD} enum part? Or perhaps its own struct with a value and enum a
>>> setter/getter to audit its value? Thoughts?
>> Good points there.
>>
>> My understanding is that  dax->align  conveys granularity of size while
>> carving out a namespace it's a geometry attribute loosely akin to sector size of a spindle
>> disk.  I tend to think that device pagesize  has almost no relation to "align" in that, it's
>> possible to have 1G "align" and 4K pagesize, or verse versa.  That is, with the advent of compound page
>> support, it is possible to totally separate the two concepts.
>>
>> How about adding a new option to "ndctl create-namespace" that describes
>> device creator's desired pagesize, and another parameter to describe whether the pagesize shall
>> be fixed or allowed to be split up, such that, if the intention is to never split up 2M pagesize, then it
>> would be possible to save a lot metadata space on the device?
> Maybe that can be selected by the driver too, but it's an interesting point you raise
> should we settle with the geometry (e.g. like a geometry sysfs entry IIUC your
> suggestion?). device-dax for example would use geometry == align and therefore save space
> (like what I propose in patch 10). But fsdax would retain the default that is geometry =
> PAGE_SIZE and align = PMD_SIZE should it want to split pages.

Let's see, I think this is what we have today

        | align   hpagesize  geometry  hpage-splittable
=======================================================
devdax | 4K..1G  2M,1G         4K     artificially no
fsdax  | 4K..1G  2M            4K     yes

So a hard no-split means  (hpagesize == geometry), and that does not apply

to fsdax for now. But is it not possible in future?  Some customer prefers

an optional  guarantee that their DAX hpage never been splitted up for 
the sake of

rdma efficiency.

>
> Interestingly, devmap poisoning always occur at @align level regardless of @geometry.
Yeah, it's a simplification that's not ideal, because after all, 
error-blast-radius != UserMapping-pagesize.
>
> What I am not sure is what value (vs added complexity) it brings to allow geometry *value*
> to be selecteable by user given that so far we seem to only ever initialize metadata as
> either sets of base pages [*] or sets of compound pages (of a size). And the difference
> between both can possibly be summarized to split-ability like you say.
>
> [*] that optionally can are morphed into compound pages by driver

Agreed.  For this series, it's simpler not to make the 
compound-page-size selectable.

thanks,

-jane



  reply	other threads:[~2021-05-19 18:36 UTC|newest]

Thread overview: 108+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
2021-03-25 23:09 ` Joao Martins
2021-03-25 23:09 ` [PATCH v1 01/11] memory-failure: fetch compound_head after pgmap_pfn_valid() Joao Martins
2021-03-25 23:09   ` Joao Martins
2021-04-24  0:12   ` Dan Williams
2021-04-24  0:12     ` Dan Williams
2021-04-24 19:00     ` Joao Martins
2021-04-24 19:00       ` Joao Martins
2021-03-25 23:09 ` [PATCH v1 02/11] mm/page_alloc: split prep_compound_page into head and tail subparts Joao Martins
2021-03-25 23:09   ` Joao Martins
2021-04-24  0:16   ` Dan Williams
2021-04-24  0:16     ` Dan Williams
2021-04-24 19:05     ` Joao Martins
2021-04-24 19:05       ` Joao Martins
2021-03-25 23:09 ` [PATCH v1 03/11] mm/page_alloc: refactor memmap_init_zone_device() page init Joao Martins
2021-03-25 23:09   ` Joao Martins
2021-04-24  0:18   ` Dan Williams
2021-04-24  0:18     ` Dan Williams
2021-04-24 19:05     ` Joao Martins
2021-04-24 19:05       ` Joao Martins
2021-03-25 23:09 ` [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages Joao Martins
2021-03-25 23:09   ` Joao Martins
2021-05-05 18:44   ` Dan Williams
2021-05-05 18:44     ` Dan Williams
2021-05-05 18:58     ` Matthew Wilcox
2021-05-05 18:58       ` Matthew Wilcox
2021-05-05 19:49     ` Joao Martins
2021-05-05 19:49       ` Joao Martins
2021-05-05 22:20       ` Dan Williams
2021-05-05 22:20         ` Dan Williams
2021-05-05 22:36         ` Joao Martins
2021-05-05 22:36           ` Joao Martins
2021-05-05 23:03           ` Dan Williams
2021-05-05 23:03             ` Dan Williams
2021-05-06 10:12             ` Joao Martins
2021-05-06 10:12               ` Joao Martins
2021-05-18 17:27           ` Joao Martins
2021-05-18 17:27             ` Joao Martins
2021-05-18 19:56             ` Jane Chu
2021-05-18 19:56               ` Jane Chu
2021-05-19 11:29               ` Joao Martins
2021-05-19 11:29                 ` Joao Martins
2021-05-19 18:36                 ` Jane Chu [this message]
2021-06-07 20:17             ` Dan Williams
2021-06-07 20:47               ` Joao Martins
2021-06-07 21:00                 ` Joao Martins
2021-06-07 21:57                   ` Dan Williams
2021-05-06  8:05         ` Aneesh Kumar K.V
2021-05-06  8:05           ` Aneesh Kumar K.V
2021-05-06 10:23           ` Joao Martins
2021-05-06 10:23             ` Joao Martins
2021-05-06 11:43             ` Matthew Wilcox
2021-05-06 11:43               ` Matthew Wilcox
2021-05-06 12:15               ` Joao Martins
2021-05-06 12:15                 ` Joao Martins
2021-03-25 23:09 ` [PATCH v1 05/11] mm/sparse-vmemmap: add a pgmap argument to section activation Joao Martins
2021-03-25 23:09   ` Joao Martins
2021-05-05 22:34   ` Dan Williams
2021-05-05 22:34     ` Dan Williams
2021-05-05 22:37     ` Joao Martins
2021-05-05 22:37       ` Joao Martins
2021-05-05 23:14       ` Dan Williams
2021-05-05 23:14         ` Dan Williams
2021-05-06 10:24         ` Joao Martins
2021-05-06 10:24           ` Joao Martins
2021-03-25 23:09 ` [PATCH v1 06/11] mm/sparse-vmemmap: refactor vmemmap_populate_basepages() Joao Martins
2021-03-25 23:09   ` Joao Martins
2021-05-05 22:43   ` Dan Williams
2021-05-05 22:43     ` Dan Williams
2021-05-06 10:27     ` Joao Martins
2021-05-06 10:27       ` Joao Martins
2021-05-06 18:36       ` Joao Martins
2021-05-06 18:36         ` Joao Martins
2021-03-25 23:09 ` [PATCH v1 07/11] mm/sparse-vmemmap: populate compound pagemaps Joao Martins
2021-03-25 23:09   ` Joao Martins
2021-05-06  1:18   ` Dan Williams
2021-05-06  1:18     ` Dan Williams
2021-05-06 11:01     ` Joao Martins
2021-05-06 11:01       ` Joao Martins
2021-05-10 19:19       ` Dan Williams
2021-05-10 19:19         ` Dan Williams
2021-05-13 18:45         ` Joao Martins
2021-05-13 18:45           ` Joao Martins
2021-06-16 15:05           ` Joao Martins
2021-06-16 23:35             ` Dan Williams
2021-03-25 23:09 ` [PATCH v1 08/11] mm/sparse-vmemmap: use hugepages for PUD " Joao Martins
2021-03-25 23:09   ` Joao Martins
2021-06-01 19:30   ` Dan Williams
2021-06-07 12:02     ` Joao Martins
2021-06-07 19:47       ` Dan Williams
2021-03-25 23:09 ` [PATCH v1 09/11] mm/page_alloc: reuse tail struct pages for " Joao Martins
2021-03-25 23:09   ` Joao Martins
2021-06-01 23:35   ` Dan Williams
2021-06-07 13:48     ` Joao Martins
2021-06-07 19:32       ` Dan Williams
2021-06-14 18:41         ` Joao Martins
2021-06-14 23:07           ` Dan Williams
2021-03-25 23:09 ` [PATCH v1 10/11] device-dax: compound pagemap support Joao Martins
2021-03-25 23:09   ` Joao Martins
2021-06-02  0:36   ` Dan Williams
2021-06-07 13:59     ` Joao Martins
2021-03-25 23:09 ` [PATCH v1 11/11] mm/gup: grab head page refcount once for group of subpages Joao Martins
2021-03-25 23:09   ` Joao Martins
2021-06-02  1:05   ` Dan Williams
2021-06-07 15:21     ` Joao Martins
2021-06-07 19:22       ` Dan Williams
2021-04-01  9:38 ` [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
2021-04-01  9:38   ` Joao Martins

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=dc64c5ae-883c-e35f-165c-c34197bc4d89@oracle.com \
    --to=jane.chu@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=dan.j.williams@intel.com \
    --cc=ira.weiny@intel.com \
    --cc=jgg@ziepe.ca \
    --cc=joao.m.martins@oracle.com \
    --cc=linux-mm@kvack.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=mike.kravetz@oracle.com \
    --cc=songmuchun@bytedance.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.