All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Zeng, Oak" <oak.zeng@intel.com>
To: "Hellstrom, Thomas" <thomas.hellstrom@intel.com>,
	"intel-xe@lists.freedesktop.org" <intel-xe@lists.freedesktop.org>
Cc: "Brost, Matthew" <matthew.brost@intel.com>,
	"Welty, Brian" <brian.welty@intel.com>,
	"airlied@gmail.com" <airlied@gmail.com>,
	"Ghimiray, Himal Prasad" <himal.prasad.ghimiray@intel.com>
Subject: RE: [PATCH 4/5] drm/xe: Helper to populate a userptr or hmmptr
Date: Mon, 18 Mar 2024 16:09:45 +0000	[thread overview]
Message-ID: <SA1PR11MB6991675265662BBCB7C63138922D2@SA1PR11MB6991.namprd11.prod.outlook.com> (raw)
In-Reply-To: <fc90dc67db2accbf5c7ecae892f1cae1783a77a0.camel@intel.com>



> -----Original Message-----
> From: Hellstrom, Thomas <thomas.hellstrom@intel.com>
> Sent: Monday, March 18, 2024 11:40 AM
> To: intel-xe@lists.freedesktop.org; Zeng, Oak <oak.zeng@intel.com>
> Cc: Brost, Matthew <matthew.brost@intel.com>; Welty, Brian
> <brian.welty@intel.com>; airlied@gmail.com; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>
> Subject: Re: [PATCH 4/5] drm/xe: Helper to populate a userptr or hmmptr
> 
> Hi,
> 
> On Mon, 2024-03-18 at 14:49 +0000, Zeng, Oak wrote:
> >
> >
> > > -----Original Message-----
> > > From: Hellstrom, Thomas <thomas.hellstrom@intel.com>
> > > Sent: Monday, March 18, 2024 9:13 AM
> > > To: intel-xe@lists.freedesktop.org; Zeng, Oak <oak.zeng@intel.com>
> > > Cc: Brost, Matthew <matthew.brost@intel.com>; Welty, Brian
> > > <brian.welty@intel.com>; airlied@gmail.com; Ghimiray, Himal Prasad
> > > <himal.prasad.ghimiray@intel.com>
> > > Subject: Re: [PATCH 4/5] drm/xe: Helper to populate a userptr or
> > > hmmptr
> > >
> > > Hi, Oak,
> > >
> > > Found another thing, see below:
> > >
> > > On Wed, 2024-03-13 at 23:35 -0400, Oak Zeng wrote:
> > > > Add a helper function xe_hmm_populate_range to populate
> > > > a a userptr or hmmptr range. This functions calls hmm_range_fault
> > > > to read CPU page tables and populate all pfns/pages of this
> > > > virtual address range.
> > > >
> > > > If the populated page is system memory page, dma-mapping is
> > > > performed
> > > > to get a dma-address which can be used later for GPU to access
> > > > pages.
> > > >
> > > > If the populated page is device private page, we calculate the
> > > > dpa (
> > > > device physical address) of the page.
> > > >
> > > > The dma-address or dpa is then saved in userptr's sg table. This
> > > > is
> > > > prepare work to replace the get_user_pages_fast code in userptr
> > > > code
> > > > path. The helper function will also be used to populate hmmptr
> > > > later.
> > > >
> > > > Signed-off-by: Oak Zeng <oak.zeng@intel.com>
> > > > Co-developed-by: Niranjana Vishwanathapura
> > > > <niranjana.vishwanathapura@intel.com>
> > > > Signed-off-by: Niranjana Vishwanathapura
> > > > <niranjana.vishwanathapura@intel.com>
> > > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > > Cc: Thomas Hellström <thomas.hellstrom@intel.com>
> > > > Cc: Brian Welty <brian.welty@intel.com>
> > > > ---
> > > >  drivers/gpu/drm/xe/Makefile |   3 +-
> > > >  drivers/gpu/drm/xe/xe_hmm.c | 213
> > > > ++++++++++++++++++++++++++++++++++++
> > > >  drivers/gpu/drm/xe/xe_hmm.h |  12 ++
> > > >  3 files changed, 227 insertions(+), 1 deletion(-)
> > > >  create mode 100644 drivers/gpu/drm/xe/xe_hmm.c
> > > >  create mode 100644 drivers/gpu/drm/xe/xe_hmm.h
> > > >
> > > > diff --git a/drivers/gpu/drm/xe/Makefile
> > > > b/drivers/gpu/drm/xe/Makefile
> > > > index 840467080e59..29dcbc938b01 100644
> > > > --- a/drivers/gpu/drm/xe/Makefile
> > > > +++ b/drivers/gpu/drm/xe/Makefile
> > > > @@ -143,7 +143,8 @@ xe-y += xe_bb.o \
> > > >  	xe_wait_user_fence.o \
> > > >  	xe_wa.o \
> > > >  	xe_wopcm.o \
> > > > -	xe_svm_devmem.o
> > > > +	xe_svm_devmem.o \
> > > > +	xe_hmm.o
> > > >
> > > >  # graphics hardware monitoring (HWMON) support
> > > >  xe-$(CONFIG_HWMON) += xe_hwmon.o
> > > > diff --git a/drivers/gpu/drm/xe/xe_hmm.c
> > > > b/drivers/gpu/drm/xe/xe_hmm.c
> > > > new file mode 100644
> > > > index 000000000000..c45c2447d386
> > > > --- /dev/null
> > > > +++ b/drivers/gpu/drm/xe/xe_hmm.c
> > > > @@ -0,0 +1,213 @@
> > > > +// SPDX-License-Identifier: MIT
> > > > +/*
> > > > + * Copyright © 2024 Intel Corporation
> > > > + */
> > > > +
> > > > +#include <linux/mmu_notifier.h>
> > > > +#include <linux/dma-mapping.h>
> > > > +#include <linux/memremap.h>
> > > > +#include <linux/swap.h>
> > > > +#include <linux/mm.h>
> > > > +#include "xe_hmm.h"
> > > > +#include "xe_vm.h"
> > > > +
> > > > +/**
> > > > + * mark_range_accessed() - mark a range is accessed, so core mm
> > > > + * have such information for memory eviction or write back to
> > > > + * hard disk
> > > > + *
> > > > + * @range: the range to mark
> > > > + * @write: if write to this range, we mark pages in this range
> > > > + * as dirty
> > > > + */
> > > > +static void mark_range_accessed(struct hmm_range *range, bool
> > > > write)
> > > > +{
> > > > +	struct page *page;
> > > > +	u64 i, npages;
> > > > +
> > > > +	npages = ((range->end - 1) >> PAGE_SHIFT) - (range-
> > > > >start >>
> > > > PAGE_SHIFT) + 1;
> > > > +	for (i = 0; i < npages; i++) {
> > > > +		page = hmm_pfn_to_page(range->hmm_pfns[i]);
> > > > +		if (write) {
> > > > +			lock_page(page);
> > > > +			set_page_dirty(page);
> > > > +			unlock_page(page);
> > > > +		}
> > > > +		mark_page_accessed(page);
> > > > +	}
> > > > +}
> > > > +
> > > > +/**
> > > > + * build_sg() - build a scatter gather table for all the
> > > > physical
> > > > pages/pfn
> > > > + * in a hmm_range. dma-address is save in sg table and will be
> > > > used
> > > > to program
> > > > + * GPU page table later.
> > > > + *
> > > > + * @xe: the xe device who will access the dma-address in sg
> > > > table
> > > > + * @range: the hmm range that we build the sg table from. range-
> > > > > hmm_pfns[]
> > > > + * has the pfn numbers of pages that back up this hmm address
> > > > range.
> > > > + * @st: pointer to the sg table.
> > > > + * @write: whether we write to this range. This decides dma map
> > > > direction
> > > > + * for system pages. If write we map it bi-diretional; otherwise
> > > > + * DMA_TO_DEVICE
> > > > + *
> > > > + * All the contiguous pfns will be collapsed into one entry in
> > > > + * the scatter gather table. This is for the convenience of
> > > > + * later on operations to bind address range to GPU page table.
> > > > + *
> > > > + * The dma_address in the sg table will later be used by GPU to
> > > > + * access memory. So if the memory is system memory, we need to
> > > > + * do a dma-mapping so it can be accessed by GPU/DMA. If the
> > > > memory
> > > > + * is GPU local memory (of the GPU who is going to access
> > > > memory),
> > > > + * we need gpu dpa (device physical address), and there is no
> > > > need
> > > > + * of dma-mapping.
> > > > + *
> > > > + * FIXME: dma-mapping for peer gpu device to access remote gpu's
> > > > + * memory. Add this when you support p2p
> > > > + *
> > > > + * This function allocates the storage of the sg table. It is
> > > > + * caller's responsibility to free it calling sg_free_table.
> > > > + *
> > > > + * Returns 0 if successful; -ENOMEM if fails to allocate memory
> > > > + */
> > > > +static int build_sg(struct xe_device *xe, struct hmm_range
> > > > *range,
> > > > +			     struct sg_table *st, bool write)
> > > > +{
> > > > +	struct device *dev = xe->drm.dev;
> > > > +	struct scatterlist *sg;
> > > > +	u64 i, npages;
> > > > +
> > > > +	sg = NULL;
> > > > +	st->nents = 0;
> > > > +	npages = ((range->end - 1) >> PAGE_SHIFT) - (range-
> > > > >start >>
> > > > PAGE_SHIFT) + 1;
> > > > +
> > > > +	if (unlikely(sg_alloc_table(st, npages, GFP_KERNEL)))
> > > > +		return -ENOMEM;
> > > > +
> > > > +	for (i = 0; i < npages; i++) {
> > > > +		struct page *page;
> > > > +		unsigned long addr;
> > > > +		struct xe_mem_region *mr;
> > > > +
> > > > +		page = hmm_pfn_to_page(range->hmm_pfns[i]);
> > > > +		if (is_device_private_page(page)) {
> > > > +			mr = page_to_mem_region(page);
> > > > +			addr = vram_pfn_to_dpa(mr, range-
> > > > > hmm_pfns[i]);
> > > > +		} else {
> > > > +			addr = dma_map_page(dev, page, 0,
> > > > PAGE_SIZE,
> > > > +					write ?
> > > > DMA_BIDIRECTIONAL :
> > > > DMA_TO_DEVICE);
> > > > +		}
> > > > +
> > > > +		if (sg && (addr == (sg_dma_address(sg) + sg-
> > > > > length))) {
> > > > +			sg->length += PAGE_SIZE;
> > > > +			sg_dma_len(sg) += PAGE_SIZE;
> > > > +			continue;
> > > > +		}
> > > > +
> > > > +		sg =  sg ? sg_next(sg) : st->sgl;
> > > > +		sg_dma_address(sg) = addr;
> > > > +		sg_dma_len(sg) = PAGE_SIZE;
> > > > +		sg->length = PAGE_SIZE;
> > > > +		st->nents++;
> > > > +	}
> > > > +
> > > > +	sg_mark_end(sg);
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +/**
> > > > + * xe_hmm_populate_range() - Populate physical pages of a
> > > > virtual
> > > > + * address range
> > > > + *
> > > > + * @vma: vma has information of the range to populate. only vma
> > > > + * of userptr and hmmptr type can be populated.
> > > > + * @hmm_range: pointer to hmm_range struct. hmm_rang->hmm_pfns
> > > > + * will hold the populated pfns.
> > > > + * @write: populate pages with write permission
> > > > + *
> > > > + * This function populate the physical pages of a virtual
> > > > + * address range. The populated physical pages is saved in
> > > > + * userptr's sg table. It is similar to get_user_pages but call
> > > > + * hmm_range_fault.
> > > > + *
> > > > + * This function also read mmu notifier sequence # (
> > > > + * mmu_interval_read_begin), for the purpose of later
> > > > + * comparison (through mmu_interval_read_retry).
> > > > + *
> > > > + * This must be called with mmap read or write lock held.
> > > > + *
> > > > + * This function allocates the storage of the userptr sg table.
> > > > + * It is caller's responsibility to free it calling
> > > > sg_free_table.
> > > > + *
> > > > + * returns: 0 for succuss; negative error no on failure
> > > > + */
> > > > +int xe_hmm_populate_range(struct xe_vma *vma, struct hmm_range
> > > > *hmm_range,
> > > > +						bool write)
> > > > +{
> > > > +	unsigned long timeout =
> > > > +		jiffies +
> > > > msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> > > > +	unsigned long *pfns, flags = HMM_PFN_REQ_FAULT;
> > > > +	struct xe_userptr_vma *userptr_vma;
> > > > +	struct xe_userptr *userptr;
> > > > +	u64 start = vma->gpuva.va.addr;
> > > > +	u64 end = start + vma->gpuva.va.range;
> > > > +	struct xe_vm *vm = xe_vma_vm(vma);
> > > > +	u64 npages;
> > > > +	int ret;
> > > > +
> > > > +	userptr_vma = to_userptr_vma(vma);
> > > > +	userptr = &userptr_vma->userptr;
> > > > +	mmap_assert_locked(userptr->notifier.mm);
> > > > +
> > > > +	npages = ((end - 1) >> PAGE_SHIFT) - (start >>
> > > > PAGE_SHIFT) +
> > > > 1;
> > > > +	pfns = kvmalloc_array(npages, sizeof(*pfns),
> > > > GFP_KERNEL);
> > > > +	if (unlikely(!pfns))
> > > > +		return -ENOMEM;
> > > > +
> > > > +	if (write)
> > > > +		flags |= HMM_PFN_REQ_WRITE;
> > > > +
> > > > +	memset64((u64 *)pfns, (u64)flags, npages);
> > > > +	hmm_range->hmm_pfns = pfns;
> > > > +	hmm_range->notifier_seq =
> > > > mmu_interval_read_begin(&userptr-
> > > > > notifier);
> > > > +	hmm_range->notifier = &userptr->notifier;
> > > > +	hmm_range->start = start;
> > > > +	hmm_range->end = end;
> > > > +	hmm_range->pfn_flags_mask = HMM_PFN_REQ_FAULT |
> > > > HMM_PFN_REQ_WRITE;
> > > > +	/**
> > > > +	 * FIXME:
> > > > +	 * Set the the dev_private_owner can prevent
> > > > hmm_range_fault
> > > > to fault
> > > > +	 * in the device private pages owned by caller. See
> > > > function
> > > > +	 * hmm_vma_handle_pte. In multiple GPU case, this should
> > > > be
> > > > set to the
> > > > +	 * device owner of the best migration destination. e.g.,
> > > > device0/vm0
> > > > +	 * has a page fault, but we have determined the best
> > > > placement of
> > > > +	 * the fault address should be on device1, we should set
> > > > below to
> > > > +	 * device1 instead of device0.
> > > > +	 */
> > > > +	hmm_range->dev_private_owner = vm->xe->drm.dev;
> > > > +
> > > > +	while (true) {
> > > > +		ret = hmm_range_fault(hmm_range);
> > > > +		if (time_after(jiffies, timeout))
> > > > +			break;
> > > > +
> > > > +		if (ret == -EBUSY)
> > > > +			continue;
> > >
> > > If (ret == -EBUSY) it looks from the hmm_range_fault()
> > > implementation
> > > like hmm_range->notifier_seq has become invalid and without calling
> > > mmu_interval_read_begin() again, we will end up in an infinite
> > > loop?
> > >
> >
> > I noticed this thing before and had a read_begin in the while loop.
> > But after a second thought, function xe_hmm_populate_range is called
> > inside a mmap lock, so after the read_begin is called above, there
> > can't be a invalidation before mmap unlock. So theoretically EBUSY
> > can't happen?
> >
> > Oak
> 
> Invalidations can happen due to many different things that don't need
> the mmap lock. File truncation and I think page reclaim are typical
> examples.

I see. I will move the read_begin into the loop then. Thanks!

Oak

> 
> /Thomas
> 
> 
> >
> > > /Thomas
> > >
> > >
> > >
> > > > +		break;
> > > > +	}
> > > > +
> > > > +	if (ret)
> > > > +		goto free_pfns;
> > > > +
> > > > +	ret = build_sg(vm->xe, hmm_range, &userptr->sgt, write);
> > > > +	if (ret)
> > > > +		goto free_pfns;
> > > > +
> > > > +	mark_range_accessed(hmm_range, write);
> > > > +	userptr->sg = &userptr->sgt;
> > > > +	userptr->notifier_seq = hmm_range->notifier_seq;
> > > > +
> > > > +free_pfns:
> > > > +	kvfree(pfns);
> > > > +	return ret;
> > > > +}
> > > > +
> > > > diff --git a/drivers/gpu/drm/xe/xe_hmm.h
> > > > b/drivers/gpu/drm/xe/xe_hmm.h
> > > > new file mode 100644
> > > > index 000000000000..960f3f6d36ae
> > > > --- /dev/null
> > > > +++ b/drivers/gpu/drm/xe/xe_hmm.h
> > > > @@ -0,0 +1,12 @@
> > > > +// SPDX-License-Identifier: MIT
> > > > +/*
> > > > + * Copyright © 2024 Intel Corporation
> > > > + */
> > > > +
> > > > +#include <linux/types.h>
> > > > +#include <linux/hmm.h>
> > > > +#include "xe_vm_types.h"
> > > > +#include "xe_svm.h"
> > > > +
> > > > +int xe_hmm_populate_range(struct xe_vma *vma, struct hmm_range
> > > > *hmm_range,
> > > > +						bool write);
> >


  reply	other threads:[~2024-03-18 16:09 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-14  3:35 [PATCH 0/5] Use hmm_range_fault to populate user page Oak Zeng
2024-03-14  3:28 ` ✓ CI.Patch_applied: success for " Patchwork
2024-03-14  3:28 ` ✗ CI.checkpatch: warning " Patchwork
2024-03-14  3:29 ` ✗ CI.KUnit: failure " Patchwork
2024-03-14  3:35 ` [PATCH 1/5] drm/xe/svm: Remap and provide memmap backing for GPU vram Oak Zeng
2024-03-14 17:17   ` Matthew Brost
2024-03-14 18:32     ` Zeng, Oak
2024-03-14 20:49       ` Matthew Brost
2024-03-15 16:00         ` Zeng, Oak
2024-03-15 20:39           ` Matthew Brost
2024-03-15 21:31             ` Zeng, Oak
2024-03-16  1:25               ` Matthew Brost
2024-03-18 10:16                 ` Hellstrom, Thomas
2024-03-18 15:02                   ` Zeng, Oak
2024-03-18 15:46                     ` Hellstrom, Thomas
2024-03-18 14:51                 ` Zeng, Oak
2024-03-15  1:45   ` Welty, Brian
2024-03-15  3:10     ` Zeng, Oak
2024-03-15  3:16       ` Zeng, Oak
2024-03-15 18:05         ` Welty, Brian
2024-03-15 23:11           ` Zeng, Oak
2024-03-14  3:35 ` [PATCH 2/5] drm/xe: Helper to get memory region from tile Oak Zeng
2024-03-14 17:33   ` Matthew Brost
2024-03-14 17:44   ` Matthew Brost
2024-03-15  2:48     ` Zeng, Oak
2024-03-14  3:35 ` [PATCH 3/5] drm/xe: Helper to get dpa from pfn Oak Zeng
2024-03-14 17:39   ` Matthew Brost
2024-03-15 17:29     ` Zeng, Oak
2024-03-16  1:33       ` Matthew Brost
2024-03-18 19:25         ` Zeng, Oak
2024-03-18 12:09     ` Hellstrom, Thomas
2024-03-18 19:27       ` Zeng, Oak
2024-03-14  3:35 ` [PATCH 4/5] drm/xe: Helper to populate a userptr or hmmptr Oak Zeng
2024-03-14 20:25   ` Matthew Brost
2024-03-16  1:35     ` Zeng, Oak
2024-03-18  0:29       ` Matthew Brost
2024-03-18 11:53   ` Hellstrom, Thomas
2024-03-18 19:50     ` Zeng, Oak
2024-03-19  8:41       ` Hellstrom, Thomas
2024-03-19 16:13         ` Zeng, Oak
2024-03-19 19:52           ` Hellstrom, Thomas
2024-03-19 20:01             ` Zeng, Oak
2024-03-18 13:12   ` Hellstrom, Thomas
2024-03-18 14:49     ` Zeng, Oak
2024-03-18 15:40       ` Hellstrom, Thomas
2024-03-18 16:09         ` Zeng, Oak [this message]
2024-03-14  3:35 ` [PATCH 5/5] drm/xe: Use hmm_range_fault to populate user pages Oak Zeng
2024-03-14 20:54   ` Matthew Brost
2024-03-19  2:36     ` Zeng, Oak

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=SA1PR11MB6991675265662BBCB7C63138922D2@SA1PR11MB6991.namprd11.prod.outlook.com \
    --to=oak.zeng@intel.com \
    --cc=airlied@gmail.com \
    --cc=brian.welty@intel.com \
    --cc=himal.prasad.ghimiray@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=matthew.brost@intel.com \
    --cc=thomas.hellstrom@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.