All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dan Williams <dan.j.williams@intel.com>
To: Jason Gunthorpe <jgg@nvidia.com>,
	Dan Williams <dan.j.williams@intel.com>
Cc: nvdimm@lists.linux.dev, "Pan, Xinhui" <Xinhui.Pan@amd.com>,
	"Jan Kara" <jack@suse.cz>,
	dri-devel@lists.freedesktop.org,
	"Karol Herbst" <kherbst@redhat.com>,
	"David Airlie" <airlied@linux.ie>,
	"Darrick J. Wong" <djwong@kernel.org>,
	"Felix Kuehling" <Felix.Kuehling@amd.com>,
	"Alistair Popple" <apopple@nvidia.com>,
	"Matthew Wilcox" <willy@infradead.org>,
	"Christian König" <christian.koenig@amd.com>,
	linux-mm@kvack.org, "Jérôme Glisse" <jglisse@redhat.com>,
	"Ben Skeggs" <bskeggs@redhat.com>,
	"John Hubbard" <jhubbard@nvidia.com>,
	"Alex Deucher" <alexander.deucher@amd.com>,
	akpm@linux-foundation.org, "Christoph Hellwig" <hch@lst.de>
Subject: Re: [PATCH] mm/memremap: Introduce pgmap_request_folio() using pgmap offsets
Date: Wed, 30 Nov 2022 16:22:50 -0800	[thread overview]
Message-ID: <6387f3dabd16e_c95729461@dwillia2-mobl3.amr.corp.intel.com.notmuch> (raw)
In-Reply-To: <Y1wgdp/uaoF70bmk@nvidia.com>

Jason Gunthorpe wrote:
> On Thu, Oct 20, 2022 at 02:56:39PM -0700, Dan Williams wrote:
> > A 'struct dev_pagemap' (pgmap) represents a collection of ZONE_DEVICE
> > pages. The pgmap is a reference counted object that serves a similar
> > role as a 'struct request_queue'. Live references are obtained for each
> > in flight request / page, and once a page's reference count drops to
> > zero the associated pin of the pgmap is dropped as well. While a page is
> > idle nothing should be accessing it because that is effectively a
> > use-after-free situation. Unfortunately, all current ZONE_DEVICE
> > implementations deploy a layering violation to manage requests to
> > activate pages owned by a pgmap. Specifically, they take steps like walk
> > the pfns that were previously assigned at memremap_pages() time and use
> > pfn_to_page() to recall metadata like page->pgmap, or make use of other
> > data like page->zone_device_data.
> > 
> > The first step towards correcting that situation is to provide a
> > API to get access to a pgmap page that does not require the caller to
> > know the pfn, nor access any fields of an idle page. Ideally this API
> > would be able to support dynamic page creation instead of the current
> > status quo of pre-allocating and initializing pages.
> > 
> > On a prompt from Jason, introduce pgmap_request_folio() that operates on
> > an offset into a pgmap. It replaces the shortlived
> > pgmap_request_folios() that was continuing the layering violation of
> > assuming pages are available to be consulted before asking the pgmap to
> > make them available.
> > 
> > For now this only converts the callers to lookup the pgmap and generate
> > the pgmap offset, but it does not do the deeper cleanup of teaching
> > those call sites to generate those arguments without walking the page
> > metadata. For next steps it appears the DEVICE_PRIVATE implementations
> > could plumb the pgmap into the necessary callsites and switch to using
> > gen_pool_alloc() to track which offsets of a pgmap are allocated. For
> > DAX, dax_direct_access() could switch from returning pfns to returning
> > the associated @pgmap and @pgmap_offset. Those changes are saved for
> > follow-on work.
> 
> I like it, though it would be nice to see drivers converted away from
> pfn_to_pgmap_offset()..

I think since there is no urgent need for this series to move forward in
v6.2 I can take the time to kill the need for pfn_to_pgmap_offset() and
circle back for this in v6.3.

The urgency in my mind is not there because:

1/ Physical memory-device-hotplug is not common, that does not arrive
   until CXL 2.0 hosts are shipping in volume. Note that's distinct from
   ACPI hotplug that is platform firmware coordinated.

2/ Beyond the initial growing pains with Folios and DAX-pages there are
   no additional collisions on the v6.2 horizon.

3/ I have not seen any MEMORY_DEVICE_PRIVATE users attempt the
   pfn_to_pgmap_offset() conversion, so no patches are dependent on this
   moving forward in v6.2.

If someone sees an urgency reason I missed then let me know.

> >  /**
> > - * pgmap_request_folios - activate an contiguous span of folios in @pgmap
> > - * @pgmap: host page map for the folio array
> > - * @folio: start of the folio list, all subsequent folios have same folio_size()
> > + * pgmap_request_folio - activate a folio of a given order in @pgmap
> > + * @pgmap: host page map of the folio to activate
> > + * @pgmap_offset: page-offset into the pgmap to request
> > + * @order: expected folio_order() of the folio
> >   *
> >   * Caller is responsible for @pgmap remaining live for the duration of
> > - * this call. Caller is also responsible for not racing requests for the
> > - * same folios.
> > + * this call. The order (size) of the folios in the pgmap are assumed
> > + * stable before this call.
> >   */
> 
> I would probably add some discussion here that this enables
> refcounting on the folio and the pgmap_ops page free will be called
> once the folio is no longer being used.
> 
> And explain that the pgmap user is responsible for tracking which
> pgmap_offsets are requested and which have been returned by free. It
> would be nice to say that this can only be called on free'd folios.

Ok.

> 
> > -bool pgmap_request_folios(struct dev_pagemap *pgmap, struct folio *folio,
> > -			  int nr_folios)
> > +struct folio *pgmap_request_folio(struct dev_pagemap *pgmap,
> > +				  pgoff_t pgmap_offset, int order)
> 
> unsigned int order?

Sure.

> 
> >  {
> > -	struct folio *iter;
> > -	int i;
> > +	unsigned long pfn = pgmap_offset_to_pfn(pgmap, pgmap_offset);
> > +	struct page *page = pfn_to_page(pfn);
> > +	struct folio *folio;
> > +	int v;
> >  
> > -	/*
> > -	 * All of the WARNs below are for catching bugs in future
> > -	 * development that changes the assumptions of:
> > -	 * 1/ uniform folios in @pgmap
> > -	 * 2/ @pgmap death does not race this routine.
> > -	 */
> > -	VM_WARN_ON_ONCE(!folio_span_valid(pgmap, folio, nr_folios));
> > +	if (WARN_ON_ONCE(page->pgmap != pgmap))
> > +		return NULL;
> 
> Checking that pgmap_offset is not bigger than pgmap length is also a
> good assertion.. At that point if pgmap is not right then the struct
> page has been corrupted.

Ok.

> >  	if (WARN_ON_ONCE(percpu_ref_is_dying(&pgmap->ref)))
> > -		return false;
> > +		return NULL;
> >  
> > -	for (iter = folio_next(folio), i = 1; i < nr_folios;
> > -	     iter = folio_next(folio), i++)
> > -		if (WARN_ON_ONCE(folio_order(iter) != folio_order(folio)))
> > -			return false;
> > +	folio = page_folio(page);
> > +	if (WARN_ON_ONCE(folio_order(folio) != order))
> > +		return NULL;
> 
> Do you see a blocker to simply restructuring the pages into head/tail
> here? If the refcounts are all zero it should be safe?

I do not think all callers are good about avoiding a new request if they
are already holding a reference.

> 
> > +	v = folio_ref_inc_return(folio);
> > +	if (v > 1)
> > +		return folio;
> 
> IMHO, ideally, this should require the foilio to have a 0 refcount and
> this should set it to 1.

That would certainly be a nice property, let me take some time to see
how that could be made a requirement.

> > +	if (WARN_ON_ONCE(!percpu_ref_tryget(&pgmap->ref))) {
> 
> This should not be a warn on, there should be races where the dying
> check could miss but the refcounts all reached zero anyhow.

Ok, makes sense.

WARNING: multiple messages have this Message-ID (diff)
From: Dan Williams <dan.j.williams@intel.com>
To: Jason Gunthorpe <jgg@nvidia.com>,
	Dan Williams <dan.j.williams@intel.com>
Cc: akpm@linux-foundation.org, "Matthew Wilcox" <willy@infradead.org>,
	"Jan Kara" <jack@suse.cz>, "Darrick J. Wong" <djwong@kernel.org>,
	"Christoph Hellwig" <hch@lst.de>,
	"John Hubbard" <jhubbard@nvidia.com>,
	"Alistair Popple" <apopple@nvidia.com>,
	"Felix Kuehling" <Felix.Kuehling@amd.com>,
	"Alex Deucher" <alexander.deucher@amd.com>,
	"Christian König" <christian.koenig@amd.com>,
	"Pan, Xinhui" <Xinhui.Pan@amd.com>,
	"David Airlie" <airlied@linux.ie>,
	"Daniel Vetter" <daniel@ffwll.ch>,
	"Ben Skeggs" <bskeggs@redhat.com>,
	"Karol Herbst" <kherbst@redhat.com>,
	"Lyude Paul" <lyude@redhat.com>,
	"Jérôme Glisse" <jglisse@redhat.com>,
	linux-mm@kvack.org, dri-devel@lists.freedesktop.org,
	nvdimm@lists.linux.dev
Subject: Re: [PATCH] mm/memremap: Introduce pgmap_request_folio() using pgmap offsets
Date: Wed, 30 Nov 2022 16:22:50 -0800	[thread overview]
Message-ID: <6387f3dabd16e_c95729461@dwillia2-mobl3.amr.corp.intel.com.notmuch> (raw)
In-Reply-To: <Y1wgdp/uaoF70bmk@nvidia.com>

Jason Gunthorpe wrote:
> On Thu, Oct 20, 2022 at 02:56:39PM -0700, Dan Williams wrote:
> > A 'struct dev_pagemap' (pgmap) represents a collection of ZONE_DEVICE
> > pages. The pgmap is a reference counted object that serves a similar
> > role as a 'struct request_queue'. Live references are obtained for each
> > in flight request / page, and once a page's reference count drops to
> > zero the associated pin of the pgmap is dropped as well. While a page is
> > idle nothing should be accessing it because that is effectively a
> > use-after-free situation. Unfortunately, all current ZONE_DEVICE
> > implementations deploy a layering violation to manage requests to
> > activate pages owned by a pgmap. Specifically, they take steps like walk
> > the pfns that were previously assigned at memremap_pages() time and use
> > pfn_to_page() to recall metadata like page->pgmap, or make use of other
> > data like page->zone_device_data.
> > 
> > The first step towards correcting that situation is to provide a
> > API to get access to a pgmap page that does not require the caller to
> > know the pfn, nor access any fields of an idle page. Ideally this API
> > would be able to support dynamic page creation instead of the current
> > status quo of pre-allocating and initializing pages.
> > 
> > On a prompt from Jason, introduce pgmap_request_folio() that operates on
> > an offset into a pgmap. It replaces the shortlived
> > pgmap_request_folios() that was continuing the layering violation of
> > assuming pages are available to be consulted before asking the pgmap to
> > make them available.
> > 
> > For now this only converts the callers to lookup the pgmap and generate
> > the pgmap offset, but it does not do the deeper cleanup of teaching
> > those call sites to generate those arguments without walking the page
> > metadata. For next steps it appears the DEVICE_PRIVATE implementations
> > could plumb the pgmap into the necessary callsites and switch to using
> > gen_pool_alloc() to track which offsets of a pgmap are allocated. For
> > DAX, dax_direct_access() could switch from returning pfns to returning
> > the associated @pgmap and @pgmap_offset. Those changes are saved for
> > follow-on work.
> 
> I like it, though it would be nice to see drivers converted away from
> pfn_to_pgmap_offset()..

I think since there is no urgent need for this series to move forward in
v6.2 I can take the time to kill the need for pfn_to_pgmap_offset() and
circle back for this in v6.3.

The urgency in my mind is not there because:

1/ Physical memory-device-hotplug is not common, that does not arrive
   until CXL 2.0 hosts are shipping in volume. Note that's distinct from
   ACPI hotplug that is platform firmware coordinated.

2/ Beyond the initial growing pains with Folios and DAX-pages there are
   no additional collisions on the v6.2 horizon.

3/ I have not seen any MEMORY_DEVICE_PRIVATE users attempt the
   pfn_to_pgmap_offset() conversion, so no patches are dependent on this
   moving forward in v6.2.

If someone sees an urgency reason I missed then let me know.

> >  /**
> > - * pgmap_request_folios - activate an contiguous span of folios in @pgmap
> > - * @pgmap: host page map for the folio array
> > - * @folio: start of the folio list, all subsequent folios have same folio_size()
> > + * pgmap_request_folio - activate a folio of a given order in @pgmap
> > + * @pgmap: host page map of the folio to activate
> > + * @pgmap_offset: page-offset into the pgmap to request
> > + * @order: expected folio_order() of the folio
> >   *
> >   * Caller is responsible for @pgmap remaining live for the duration of
> > - * this call. Caller is also responsible for not racing requests for the
> > - * same folios.
> > + * this call. The order (size) of the folios in the pgmap are assumed
> > + * stable before this call.
> >   */
> 
> I would probably add some discussion here that this enables
> refcounting on the folio and the pgmap_ops page free will be called
> once the folio is no longer being used.
> 
> And explain that the pgmap user is responsible for tracking which
> pgmap_offsets are requested and which have been returned by free. It
> would be nice to say that this can only be called on free'd folios.

Ok.

> 
> > -bool pgmap_request_folios(struct dev_pagemap *pgmap, struct folio *folio,
> > -			  int nr_folios)
> > +struct folio *pgmap_request_folio(struct dev_pagemap *pgmap,
> > +				  pgoff_t pgmap_offset, int order)
> 
> unsigned int order?

Sure.

> 
> >  {
> > -	struct folio *iter;
> > -	int i;
> > +	unsigned long pfn = pgmap_offset_to_pfn(pgmap, pgmap_offset);
> > +	struct page *page = pfn_to_page(pfn);
> > +	struct folio *folio;
> > +	int v;
> >  
> > -	/*
> > -	 * All of the WARNs below are for catching bugs in future
> > -	 * development that changes the assumptions of:
> > -	 * 1/ uniform folios in @pgmap
> > -	 * 2/ @pgmap death does not race this routine.
> > -	 */
> > -	VM_WARN_ON_ONCE(!folio_span_valid(pgmap, folio, nr_folios));
> > +	if (WARN_ON_ONCE(page->pgmap != pgmap))
> > +		return NULL;
> 
> Checking that pgmap_offset is not bigger than pgmap length is also a
> good assertion.. At that point if pgmap is not right then the struct
> page has been corrupted.

Ok.

> >  	if (WARN_ON_ONCE(percpu_ref_is_dying(&pgmap->ref)))
> > -		return false;
> > +		return NULL;
> >  
> > -	for (iter = folio_next(folio), i = 1; i < nr_folios;
> > -	     iter = folio_next(folio), i++)
> > -		if (WARN_ON_ONCE(folio_order(iter) != folio_order(folio)))
> > -			return false;
> > +	folio = page_folio(page);
> > +	if (WARN_ON_ONCE(folio_order(folio) != order))
> > +		return NULL;
> 
> Do you see a blocker to simply restructuring the pages into head/tail
> here? If the refcounts are all zero it should be safe?

I do not think all callers are good about avoiding a new request if they
are already holding a reference.

> 
> > +	v = folio_ref_inc_return(folio);
> > +	if (v > 1)
> > +		return folio;
> 
> IMHO, ideally, this should require the foilio to have a 0 refcount and
> this should set it to 1.

That would certainly be a nice property, let me take some time to see
how that could be made a requirement.

> > +	if (WARN_ON_ONCE(!percpu_ref_tryget(&pgmap->ref))) {
> 
> This should not be a warn on, there should be races where the dying
> check could miss but the refcounts all reached zero anyhow.

Ok, makes sense.

  reply	other threads:[~2022-12-01  0:23 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-20 21:56 [PATCH] mm/memremap: Introduce pgmap_request_folio() using pgmap offsets Dan Williams
2022-10-20 21:56 ` Dan Williams
2022-10-20 22:32 ` Felix Kuehling
2022-10-20 22:32   ` Felix Kuehling
2022-10-20 23:17   ` Dan Williams
2022-10-20 23:17     ` Dan Williams
2022-10-21 18:31     ` Felix Kuehling
2022-10-21 18:31       ` Felix Kuehling
2022-10-24  1:44       ` Alistair Popple
2022-10-24  1:44         ` Alistair Popple
2022-10-21 20:36 ` Lyude Paul
2022-10-21 20:36   ` Lyude Paul
2022-10-28 18:33 ` Jason Gunthorpe
2022-10-28 18:33   ` Jason Gunthorpe
2022-12-01  0:22   ` Dan Williams [this message]
2022-12-01  0:22     ` Dan Williams
2022-12-01 23:12     ` Andrew Morton
2022-12-01 23:12       ` Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6387f3dabd16e_c95729461@dwillia2-mobl3.amr.corp.intel.com.notmuch \
    --to=dan.j.williams@intel.com \
    --cc=Felix.Kuehling@amd.com \
    --cc=Xinhui.Pan@amd.com \
    --cc=airlied@linux.ie \
    --cc=akpm@linux-foundation.org \
    --cc=alexander.deucher@amd.com \
    --cc=apopple@nvidia.com \
    --cc=bskeggs@redhat.com \
    --cc=christian.koenig@amd.com \
    --cc=djwong@kernel.org \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=jgg@nvidia.com \
    --cc=jglisse@redhat.com \
    --cc=jhubbard@nvidia.com \
    --cc=kherbst@redhat.com \
    --cc=linux-mm@kvack.org \
    --cc=nvdimm@lists.linux.dev \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.