linux-nvdimm.lists.01.org archive mirror
 help / color / mirror / Atom feed
From: Liu Bo <obuil.liubo@gmail.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: linux-nvdimm <linux-nvdimm@lists.01.org>
Subject: Re: [PATCH v11 4/7] mm, fs, dax: handle layout changes to pinned dax mappings
Date: Wed, 31 Jul 2019 16:02:47 -0700	[thread overview]
Message-ID: <CANQeFDBAhjdsdjjKz_Qv6bgOQzquy9h+sAuxUunn2-Hs64bs-g@mail.gmail.com> (raw)
In-Reply-To: <CAPcyv4hfVhYVFszxJO3vDeDAtz_THZROgv8orFXfddEOaGjUBA@mail.gmail.com>

On Wed, Jul 31, 2019 at 12:16 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Tue, Jul 30, 2019 at 10:07 PM Liu Bo <obuil.liubo@gmail.com> wrote:
> > On Tue, Jul 30, 2019 at 8:58 PM Dan Williams <dan.j.williams@intel.com> wrote:
> [..]
> > > > > +/**
> > > > > + * dax_layout_busy_page - find first pinned page in @mapping
> > > > > + * @mapping: address space to scan for a page with ref count > 1
> > > > > + *
> > > > > + * DAX requires ZONE_DEVICE mapped pages. These pages are never
> > > > > + * 'onlined' to the page allocator so they are considered idle when
> > > > > + * page->count == 1. A filesystem uses this interface to determine if
> > > > > + * any page in the mapping is busy, i.e. for DMA, or other
> > > > > + * get_user_pages() usages.
> > > > > + *
> > > > > + * It is expected that the filesystem is holding locks to block the
> > > > > + * establishment of new mappings in this address_space. I.e. it expects
> > > > > + * to be able to run unmap_mapping_range() and subsequently not race
> > > > > + * mapping_mapped() becoming true.
> > > > > + */
> > > > > +struct page *dax_layout_busy_page(struct address_space *mapping)
> > > > > +{
> > > > > +       pgoff_t indices[PAGEVEC_SIZE];
> > > > > +       struct page *page = NULL;
> > > > > +       struct pagevec pvec;
> > > > > +       pgoff_t index, end;
> > > > > +       unsigned i;
> > > > > +
> > > > > +       /*
> > > > > +        * In the 'limited' case get_user_pages() for dax is disabled.
> > > > > +        */
> > > > > +       if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
> > > > > +               return NULL;
> > > > > +
> > > > > +       if (!dax_mapping(mapping) || !mapping_mapped(mapping))
> > > > > +               return NULL;
> > > > > +
> > > > > +       pagevec_init(&pvec);
> > > > > +       index = 0;
> > > > > +       end = -1;
> > > > > +
> > > > > +       /*
> > > > > +        * If we race get_user_pages_fast() here either we'll see the
> > > > > +        * elevated page count in the pagevec_lookup and wait, or
> > > > > +        * get_user_pages_fast() will see that the page it took a reference
> > > > > +        * against is no longer mapped in the page tables and bail to the
> > > > > +        * get_user_pages() slow path.  The slow path is protected by
> > > > > +        * pte_lock() and pmd_lock(). New references are not taken without
> > > > > +        * holding those locks, and unmap_mapping_range() will not zero the
> > > > > +        * pte or pmd without holding the respective lock, so we are
> > > > > +        * guaranteed to either see new references or prevent new
> > > > > +        * references from being established.
> > > > > +        */
> > > > > +       unmap_mapping_range(mapping, 0, 0, 1);
> > > >
> > > > Why do we have to unmap the whole address space prior to check busy pages?
> > > > Can we have a variate of dax_layout_busy_page() to only unmap a sub
> > > > set  of the whole address space?
> > > >
> > >
> > > This is due to the location in xfs where layouts are broken vs where
> > > the file range is mapped to physical blocks for the truncate
> > > operation. I ultimately decided the reworks needed for that
> > > optimization were large and that the relative performance gain was
> > > small. Do you have performance numbers to the contrary? Feel free to
> > > copy the linux-nvdimm list on future mails, no need for this to be a
> > > private discussion.
> >
> > Thanks a lot for the prompt reply.
> >
> > For virtiofs[1]'s dax mode, it also suffers the same race problem
> > between dax-DMA(mmap+directIO) and fs truncate/punch_hole, besides, it
> > maintains a kind of resource named dax mapping range for IO
> > operations, which is similar to the block concept in filesystem and
> > sometimes we need to reclaim some dax mapping ranges in background.
> > So it might end up the same race problem when this reclaim process and
> > dax-dma(mmap+directIO) run concurrently, however, since reclaim is not
> > a user-triggered operations as truncate, it might be triggered
> > frequently on the fly by virtiofs itself, now if that happened, mmap
> > workloads would be impacted significantly by the reclaim because of
> > reclaim unmapping  the whole address space of inode.
> >
> > As every dax mapping range is 2M for now, a ideal solution is to have
> > layout_checking unmap only that specific 2M range so that other areas
> > in mmap ranges are good to go.
>
> There are larger problems with DAX-dma into a guest mapping. There is
> no mechanism to coordinate a host-fs truncate with the completion of
> guest-dma like what we do with the "layout break" implementation when
> fs and dma are coordinated in the same kernel. The only way,
> presently, to safely assign a dma-initiator device to a guest with DAX
> mapped memory is to use device-dax on the host side where truncate /
> hole punch just isn't supported. Maybe virtio-fs could invent some
> paravirtualized side channel for this coordination, but it does not
> exist today.

I might have mistaken you, this 'dax mapping range' concept is
maintained by virtiofs _inside_ guest, it is corresponding to the
qemu's mmio window that are used for virtiofs.

It seems fine to me if we just unmap a range of the inode's address
space and check busy dax entries against this range, but I'm not sure,
could you please suggest?

thanks,
liubo
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

  reply	other threads:[~2019-07-31 23:05 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-05-19  1:34 [PATCH v11 0/7] dax: fix dma vs truncate/hole-punch Dan Williams
2018-05-19  1:34 ` [PATCH v11 1/7] memremap: split devm_memremap_pages() and memremap() infrastructure Dan Williams
2018-05-22  6:24   ` Christoph Hellwig
2018-05-19  1:35 ` [PATCH v11 2/7] mm: introduce MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPS Dan Williams
2018-05-22  6:25   ` Christoph Hellwig
2018-05-19  1:35 ` [PATCH v11 3/7] mm: fix __gup_device_huge vs unmap Dan Williams
2018-06-11 21:58   ` Andrew Morton
2018-06-11 23:35     ` Dan Williams
2018-05-19  1:35 ` [PATCH v11 4/7] mm, fs, dax: handle layout changes to pinned dax mappings Dan Williams
2018-06-12 21:05   ` Ross Zwisler
2018-06-13 10:41     ` Jan Kara
     [not found]   ` <CANQeFDCHUMP5su8ckoekeOWjEVBb2kN4VfiHuq8xnz8o8hWXvw@mail.gmail.com>
     [not found]     ` <CAPcyv4h6Wgursr6rMV42EFzH-7DJscrBrCFPqhiJp6ocYS9qmw@mail.gmail.com>
2019-07-31  5:07       ` Liu Bo
2019-07-31 19:16         ` Dan Williams
2019-07-31 23:02           ` Liu Bo [this message]
2018-05-19  1:35 ` [PATCH v11 5/7] xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL Dan Williams
2018-05-19  1:35 ` [PATCH v11 6/7] xfs: prepare xfs_break_layouts() for another layout type Dan Williams
2018-05-19  1:35 ` [PATCH v11 7/7] xfs, dax: introduce xfs_break_dax_layouts() Dan Williams
2018-05-22  6:27   ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CANQeFDBAhjdsdjjKz_Qv6bgOQzquy9h+sAuxUunn2-Hs64bs-g@mail.gmail.com \
    --to=obuil.liubo@gmail.com \
    --cc=dan.j.williams@intel.com \
    --cc=linux-nvdimm@lists.01.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).