kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 00/15] follow_pfn and other iomap races
@ 2020-10-30 10:08 Daniel Vetter
  2020-10-30 10:08 ` [PATCH v5 01/15] drm/exynos: Stop using frame_vector helpers Daniel Vetter
                   ` (14 more replies)
  0 siblings, 15 replies; 62+ messages in thread
From: Daniel Vetter @ 2020-10-30 10:08 UTC (permalink / raw)
  To: DRI Development, LKML
  Cc: kvm, linux-mm, linux-arm-kernel, linux-samsung-soc, linux-media,
	Daniel Vetter

Hi all

Another update of my patch series to clamp down a bunch of races and gaps
around follow_pfn and other access to iomem mmaps. Previous version:

v1: https://lore.kernel.org/dri-devel/20201007164426.1812530-1-daniel.vetter@ffwll.ch/
v2: https://lore.kernel.org/dri-devel/20201009075934.3509076-1-daniel.vetter@ffwll.ch
v3: https://lore.kernel.org/dri-devel/20201021085655.1192025-1-daniel.vetter@ffwll.ch/
v4: https://lore.kernel.org/dri-devel/20201026105818.2585306-1-daniel.vetter@ffwll.ch/

And the discussion that sparked this journey:

https://lore.kernel.org/dri-devel/20201007164426.1812530-1-daniel.vetter@ffwll.ch/

Changes in v5:
- Tomasz found some issues in the media patches
- Polish suggested by Christoph for the unsafe_follow_pfn patch

Changes in v4:
- Drop the s390 patch, that was very stand-alone and now queued up to land
  through s390 trees.
- Comment polish per Dan's review.

Changes in v3:
- Bunch of polish all over, no functional changes aside from one barrier
  in the resource code, for consistency.
- A few more r-b tags.

Changes in v2:
- tons of small polish&fixes all over, thanks to all the reviewers who
  spotted issues
- I managed to test at least the generic_access_phys and pci mmap revoke
  stuff with a few gdb sessions using our i915 debug tools (hence now also
  the drm/i915 patch to properly request all the pci bar regions)
- reworked approach for the pci mmap revoke: Infrastructure moved into
  kernel/resource.c, address_space mapping is now set up at open time for
  everyone (which required some sysfs changes). Does indeed look a lot
  cleaner and a lot less invasive than I feared at first.

I feel like this is ready for some wider soaking. Since the remaining bits
are all kinda connnected probably simplest if it all goes through -mm.

Cheers, Daniel

Daniel Vetter (15):
  drm/exynos: Stop using frame_vector helpers
  drm/exynos: Use FOLL_LONGTERM for g2d cmdlists
  misc/habana: Stop using frame_vector helpers
  misc/habana: Use FOLL_LONGTERM for userptr
  mm/frame-vector: Use FOLL_LONGTERM
  media: videobuf2: Move frame_vector into media subsystem
  mm: Close race in generic_access_phys
  mm: Add unsafe_follow_pfn
  media/videbuf1|2: Mark follow_pfn usage as unsafe
  vfio/type1: Mark follow_pfn as unsafe
  PCI: Obey iomem restrictions for procfs mmap
  /dev/mem: Only set filp->f_mapping
  resource: Move devmem revoke code to resource framework
  sysfs: Support zapping of binary attr mmaps
  PCI: Revoke mappings like devmem

 drivers/char/mem.c                            |  86 +--------------
 drivers/gpu/drm/exynos/Kconfig                |   1 -
 drivers/gpu/drm/exynos/exynos_drm_g2d.c       |  48 ++++-----
 drivers/media/common/videobuf2/Kconfig        |   1 -
 drivers/media/common/videobuf2/Makefile       |   1 +
 .../media/common/videobuf2}/frame_vector.c    |  57 ++++------
 .../media/common/videobuf2/videobuf2-memops.c |   3 +-
 drivers/media/platform/omap/Kconfig           |   1 -
 drivers/media/v4l2-core/videobuf-dma-contig.c |   2 +-
 drivers/misc/habanalabs/Kconfig               |   1 -
 drivers/misc/habanalabs/common/habanalabs.h   |   6 +-
 drivers/misc/habanalabs/common/memory.c       |  50 ++++-----
 drivers/pci/pci-sysfs.c                       |   4 +
 drivers/pci/proc.c                            |   6 ++
 drivers/vfio/vfio_iommu_type1.c               |   4 +-
 fs/sysfs/file.c                               |  11 ++
 include/linux/ioport.h                        |   6 +-
 include/linux/mm.h                            |  47 +-------
 include/linux/sysfs.h                         |   2 +
 include/media/frame_vector.h                  |  47 ++++++++
 include/media/videobuf2-core.h                |   1 +
 kernel/resource.c                             | 101 +++++++++++++++++-
 mm/Kconfig                                    |   3 -
 mm/Makefile                                   |   1 -
 mm/memory.c                                   |  80 +++++++++++++-
 mm/nommu.c                                    |  27 ++++-
 security/Kconfig                              |  13 +++
 27 files changed, 360 insertions(+), 250 deletions(-)
 rename {mm => drivers/media/common/videobuf2}/frame_vector.c (84%)
 create mode 100644 include/media/frame_vector.h

-- 
2.28.0


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v5 01/15] drm/exynos: Stop using frame_vector helpers
  2020-10-30 10:08 [PATCH v5 00/15] follow_pfn and other iomap races Daniel Vetter
@ 2020-10-30 10:08 ` Daniel Vetter
  2020-10-30 10:08 ` [PATCH v5 02/15] drm/exynos: Use FOLL_LONGTERM for g2d cmdlists Daniel Vetter
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 62+ messages in thread
From: Daniel Vetter @ 2020-10-30 10:08 UTC (permalink / raw)
  To: DRI Development, LKML
  Cc: kvm, linux-mm, linux-arm-kernel, linux-samsung-soc, linux-media,
	Daniel Vetter, John Hubbard, Daniel Vetter, Jason Gunthorpe,
	Inki Dae, Joonyoung Shim, Seung-Woo Kim, Kyungmin Park,
	Kukjin Kim, Krzysztof Kozlowski, Andrew Morton,
	Jérôme Glisse, Jan Kara, Dan Williams

All we need are a pages array, pin_user_pages_fast can give us that
directly. Plus this avoids the entire raw pfn side of get_vaddr_frames.

Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Inki Dae <inki.dae@samsung.com>
Cc: Joonyoung Shim <jy0922.shim@samsung.com>
Cc: Seung-Woo Kim <sw0312.kim@samsung.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Kukjin Kim <kgene@kernel.org>
Cc: Krzysztof Kozlowski <krzk@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: linux-mm@kvack.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-samsung-soc@vger.kernel.org
Cc: linux-media@vger.kernel.org
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
--
v2: Use unpin_user_pages_dirty_lock (John)
---
 drivers/gpu/drm/exynos/Kconfig          |  1 -
 drivers/gpu/drm/exynos/exynos_drm_g2d.c | 47 +++++++++++--------------
 2 files changed, 20 insertions(+), 28 deletions(-)

diff --git a/drivers/gpu/drm/exynos/Kconfig b/drivers/gpu/drm/exynos/Kconfig
index 6417f374b923..43257ef3c09d 100644
--- a/drivers/gpu/drm/exynos/Kconfig
+++ b/drivers/gpu/drm/exynos/Kconfig
@@ -88,7 +88,6 @@ comment "Sub-drivers"
 config DRM_EXYNOS_G2D
 	bool "G2D"
 	depends on VIDEO_SAMSUNG_S5P_G2D=n || COMPILE_TEST
-	select FRAME_VECTOR
 	help
 	  Choose this option if you want to use Exynos G2D for DRM.
 
diff --git a/drivers/gpu/drm/exynos/exynos_drm_g2d.c b/drivers/gpu/drm/exynos/exynos_drm_g2d.c
index 967a5cdc120e..ecede41af9b9 100644
--- a/drivers/gpu/drm/exynos/exynos_drm_g2d.c
+++ b/drivers/gpu/drm/exynos/exynos_drm_g2d.c
@@ -205,7 +205,8 @@ struct g2d_cmdlist_userptr {
 	dma_addr_t		dma_addr;
 	unsigned long		userptr;
 	unsigned long		size;
-	struct frame_vector	*vec;
+	struct page		**pages;
+	unsigned int		npages;
 	struct sg_table		*sgt;
 	atomic_t		refcount;
 	bool			in_pool;
@@ -378,7 +379,6 @@ static void g2d_userptr_put_dma_addr(struct g2d_data *g2d,
 					bool force)
 {
 	struct g2d_cmdlist_userptr *g2d_userptr = obj;
-	struct page **pages;
 
 	if (!obj)
 		return;
@@ -398,15 +398,9 @@ static void g2d_userptr_put_dma_addr(struct g2d_data *g2d,
 	dma_unmap_sgtable(to_dma_dev(g2d->drm_dev), g2d_userptr->sgt,
 			  DMA_BIDIRECTIONAL, 0);
 
-	pages = frame_vector_pages(g2d_userptr->vec);
-	if (!IS_ERR(pages)) {
-		int i;
-
-		for (i = 0; i < frame_vector_count(g2d_userptr->vec); i++)
-			set_page_dirty_lock(pages[i]);
-	}
-	put_vaddr_frames(g2d_userptr->vec);
-	frame_vector_destroy(g2d_userptr->vec);
+	unpin_user_pages_dirty_lock(g2d_userptr->pages, g2d_userptr->npages,
+				    true);
+	kvfree(g2d_userptr->pages);
 
 	if (!g2d_userptr->out_of_list)
 		list_del_init(&g2d_userptr->list);
@@ -474,35 +468,34 @@ static dma_addr_t *g2d_userptr_get_dma_addr(struct g2d_data *g2d,
 	offset = userptr & ~PAGE_MASK;
 	end = PAGE_ALIGN(userptr + size);
 	npages = (end - start) >> PAGE_SHIFT;
-	g2d_userptr->vec = frame_vector_create(npages);
-	if (!g2d_userptr->vec) {
+	g2d_userptr->pages = kvmalloc_array(npages, sizeof(*g2d_userptr->pages),
+					    GFP_KERNEL);
+	if (!g2d_userptr->pages) {
 		ret = -ENOMEM;
 		goto err_free;
 	}
 
-	ret = get_vaddr_frames(start, npages, FOLL_FORCE | FOLL_WRITE,
-		g2d_userptr->vec);
+	ret = pin_user_pages_fast(start, npages, FOLL_FORCE | FOLL_WRITE,
+				  g2d_userptr->pages);
 	if (ret != npages) {
 		DRM_DEV_ERROR(g2d->dev,
 			      "failed to get user pages from userptr.\n");
 		if (ret < 0)
-			goto err_destroy_framevec;
-		ret = -EFAULT;
-		goto err_put_framevec;
-	}
-	if (frame_vector_to_pages(g2d_userptr->vec) < 0) {
+			goto err_destroy_pages;
+		npages = ret;
 		ret = -EFAULT;
-		goto err_put_framevec;
+		goto err_unpin_pages;
 	}
+	g2d_userptr->npages = npages;
 
 	sgt = kzalloc(sizeof(*sgt), GFP_KERNEL);
 	if (!sgt) {
 		ret = -ENOMEM;
-		goto err_put_framevec;
+		goto err_unpin_pages;
 	}
 
 	ret = sg_alloc_table_from_pages(sgt,
-					frame_vector_pages(g2d_userptr->vec),
+					g2d_userptr->pages,
 					npages, offset, size, GFP_KERNEL);
 	if (ret < 0) {
 		DRM_DEV_ERROR(g2d->dev, "failed to get sgt from pages.\n");
@@ -538,11 +531,11 @@ static dma_addr_t *g2d_userptr_get_dma_addr(struct g2d_data *g2d,
 err_free_sgt:
 	kfree(sgt);
 
-err_put_framevec:
-	put_vaddr_frames(g2d_userptr->vec);
+err_unpin_pages:
+	unpin_user_pages(g2d_userptr->pages, npages);
 
-err_destroy_framevec:
-	frame_vector_destroy(g2d_userptr->vec);
+err_destroy_pages:
+	kvfree(g2d_userptr->pages);
 
 err_free:
 	kfree(g2d_userptr);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v5 02/15] drm/exynos: Use FOLL_LONGTERM for g2d cmdlists
  2020-10-30 10:08 [PATCH v5 00/15] follow_pfn and other iomap races Daniel Vetter
  2020-10-30 10:08 ` [PATCH v5 01/15] drm/exynos: Stop using frame_vector helpers Daniel Vetter
@ 2020-10-30 10:08 ` Daniel Vetter
  2020-10-30 10:08 ` [PATCH v5 03/15] misc/habana: Stop using frame_vector helpers Daniel Vetter
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 62+ messages in thread
From: Daniel Vetter @ 2020-10-30 10:08 UTC (permalink / raw)
  To: DRI Development, LKML
  Cc: kvm, linux-mm, linux-arm-kernel, linux-samsung-soc, linux-media,
	Daniel Vetter, Daniel Vetter, Jason Gunthorpe, Inki Dae,
	Joonyoung Shim, Seung-Woo Kim, Kyungmin Park, Kukjin Kim,
	Krzysztof Kozlowski, Andrew Morton, John Hubbard,
	Jérôme Glisse, Jan Kara, Dan Williams

The exynos g2d interface is very unusual, but it looks like the
userptr objects are persistent. Hence they need FOLL_LONGTERM.

Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Inki Dae <inki.dae@samsung.com>
Cc: Joonyoung Shim <jy0922.shim@samsung.com>
Cc: Seung-Woo Kim <sw0312.kim@samsung.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Kukjin Kim <kgene@kernel.org>
Cc: Krzysztof Kozlowski <krzk@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: linux-mm@kvack.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-samsung-soc@vger.kernel.org
Cc: linux-media@vger.kernel.org
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/gpu/drm/exynos/exynos_drm_g2d.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/exynos/exynos_drm_g2d.c b/drivers/gpu/drm/exynos/exynos_drm_g2d.c
index ecede41af9b9..1e0c5a7f206e 100644
--- a/drivers/gpu/drm/exynos/exynos_drm_g2d.c
+++ b/drivers/gpu/drm/exynos/exynos_drm_g2d.c
@@ -475,7 +475,8 @@ static dma_addr_t *g2d_userptr_get_dma_addr(struct g2d_data *g2d,
 		goto err_free;
 	}
 
-	ret = pin_user_pages_fast(start, npages, FOLL_FORCE | FOLL_WRITE,
+	ret = pin_user_pages_fast(start, npages,
+				  FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM,
 				  g2d_userptr->pages);
 	if (ret != npages) {
 		DRM_DEV_ERROR(g2d->dev,
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v5 03/15] misc/habana: Stop using frame_vector helpers
  2020-10-30 10:08 [PATCH v5 00/15] follow_pfn and other iomap races Daniel Vetter
  2020-10-30 10:08 ` [PATCH v5 01/15] drm/exynos: Stop using frame_vector helpers Daniel Vetter
  2020-10-30 10:08 ` [PATCH v5 02/15] drm/exynos: Use FOLL_LONGTERM for g2d cmdlists Daniel Vetter
@ 2020-10-30 10:08 ` Daniel Vetter
  2020-10-30 10:08 ` [PATCH v5 04/15] misc/habana: Use FOLL_LONGTERM for userptr Daniel Vetter
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 62+ messages in thread
From: Daniel Vetter @ 2020-10-30 10:08 UTC (permalink / raw)
  To: DRI Development, LKML
  Cc: kvm, linux-mm, linux-arm-kernel, linux-samsung-soc, linux-media,
	Daniel Vetter, John Hubbard, Daniel Vetter, Jason Gunthorpe,
	Andrew Morton, Jérôme Glisse, Jan Kara, Dan Williams,
	Oded Gabbay, Omer Shpigelman, Ofir Bitton, Tomer Tayar,
	Moti Haimovski, Greg Kroah-Hartman, Pawel Piskorski

All we need are a pages array, pin_user_pages_fast can give us that
directly. Plus this avoids the entire raw pfn side of get_vaddr_frames.

Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: linux-mm@kvack.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-samsung-soc@vger.kernel.org
Cc: linux-media@vger.kernel.org
Cc: Oded Gabbay <oded.gabbay@gmail.com>
Cc: Omer Shpigelman <oshpigelman@habana.ai>
Cc: Ofir Bitton <obitton@habana.ai>
Cc: Tomer Tayar <ttayar@habana.ai>
Cc: Moti Haimovski <mhaimovski@habana.ai>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Pawel Piskorski <ppiskorski@habana.ai>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
--
v2: Use unpin_user_pages_dirty_lock (John)
v3: Update kerneldoc (Oded)
---
 drivers/misc/habanalabs/Kconfig             |  1 -
 drivers/misc/habanalabs/common/habanalabs.h |  6 ++-
 drivers/misc/habanalabs/common/memory.c     | 49 ++++++++-------------
 3 files changed, 22 insertions(+), 34 deletions(-)

diff --git a/drivers/misc/habanalabs/Kconfig b/drivers/misc/habanalabs/Kconfig
index 1640340d3e62..293d79811372 100644
--- a/drivers/misc/habanalabs/Kconfig
+++ b/drivers/misc/habanalabs/Kconfig
@@ -6,7 +6,6 @@
 config HABANA_AI
 	tristate "HabanaAI accelerators (habanalabs)"
 	depends on PCI && HAS_IOMEM
-	select FRAME_VECTOR
 	select GENERIC_ALLOCATOR
 	select HWMON
 	help
diff --git a/drivers/misc/habanalabs/common/habanalabs.h b/drivers/misc/habanalabs/common/habanalabs.h
index 80d4d7385ffe..272aa3f29200 100644
--- a/drivers/misc/habanalabs/common/habanalabs.h
+++ b/drivers/misc/habanalabs/common/habanalabs.h
@@ -921,7 +921,8 @@ struct hl_ctx_mgr {
  * struct hl_userptr - memory mapping chunk information
  * @vm_type: type of the VM.
  * @job_node: linked-list node for hanging the object on the Job's list.
- * @vec: pointer to the frame vector.
+ * @pages: pointer to struct page array
+ * @npages: size of @pages array
  * @sgt: pointer to the scatter-gather table that holds the pages.
  * @dir: for DMA unmapping, the direction must be supplied, so save it.
  * @debugfs_list: node in debugfs list of command submissions.
@@ -932,7 +933,8 @@ struct hl_ctx_mgr {
 struct hl_userptr {
 	enum vm_type_t		vm_type; /* must be first */
 	struct list_head	job_node;
-	struct frame_vector	*vec;
+	struct page		**pages;
+	unsigned int		npages;
 	struct sg_table		*sgt;
 	enum dma_data_direction dir;
 	struct list_head	debugfs_list;
diff --git a/drivers/misc/habanalabs/common/memory.c b/drivers/misc/habanalabs/common/memory.c
index 84227819e4d1..0b220221873d 100644
--- a/drivers/misc/habanalabs/common/memory.c
+++ b/drivers/misc/habanalabs/common/memory.c
@@ -1291,45 +1291,41 @@ static int get_user_memory(struct hl_device *hdev, u64 addr, u64 size,
 		return -EFAULT;
 	}
 
-	userptr->vec = frame_vector_create(npages);
-	if (!userptr->vec) {
+	userptr->pages = kvmalloc_array(npages, sizeof(*userptr->pages),
+					GFP_KERNEL);
+	if (!userptr->pages) {
 		dev_err(hdev->dev, "Failed to create frame vector\n");
 		return -ENOMEM;
 	}
 
-	rc = get_vaddr_frames(start, npages, FOLL_FORCE | FOLL_WRITE,
-				userptr->vec);
+	rc = pin_user_pages_fast(start, npages, FOLL_FORCE | FOLL_WRITE,
+				 userptr->pages);
 
 	if (rc != npages) {
 		dev_err(hdev->dev,
 			"Failed to map host memory, user ptr probably wrong\n");
 		if (rc < 0)
-			goto destroy_framevec;
+			goto destroy_pages;
+		npages = rc;
 		rc = -EFAULT;
-		goto put_framevec;
-	}
-
-	if (frame_vector_to_pages(userptr->vec) < 0) {
-		dev_err(hdev->dev,
-			"Failed to translate frame vector to pages\n");
-		rc = -EFAULT;
-		goto put_framevec;
+		goto put_pages;
 	}
+	userptr->npages = npages;
 
 	rc = sg_alloc_table_from_pages(userptr->sgt,
-					frame_vector_pages(userptr->vec),
-					npages, offset, size, GFP_ATOMIC);
+				       userptr->pages,
+				       npages, offset, size, GFP_ATOMIC);
 	if (rc < 0) {
 		dev_err(hdev->dev, "failed to create SG table from pages\n");
-		goto put_framevec;
+		goto put_pages;
 	}
 
 	return 0;
 
-put_framevec:
-	put_vaddr_frames(userptr->vec);
-destroy_framevec:
-	frame_vector_destroy(userptr->vec);
+put_pages:
+	unpin_user_pages(userptr->pages, npages);
+destroy_pages:
+	kvfree(userptr->pages);
 	return rc;
 }
 
@@ -1415,8 +1411,6 @@ int hl_pin_host_memory(struct hl_device *hdev, u64 addr, u64 size,
  */
 void hl_unpin_host_memory(struct hl_device *hdev, struct hl_userptr *userptr)
 {
-	struct page **pages;
-
 	hl_debugfs_remove_userptr(hdev, userptr);
 
 	if (userptr->dma_mapped)
@@ -1424,15 +1418,8 @@ void hl_unpin_host_memory(struct hl_device *hdev, struct hl_userptr *userptr)
 							userptr->sgt->nents,
 							userptr->dir);
 
-	pages = frame_vector_pages(userptr->vec);
-	if (!IS_ERR(pages)) {
-		int i;
-
-		for (i = 0; i < frame_vector_count(userptr->vec); i++)
-			set_page_dirty_lock(pages[i]);
-	}
-	put_vaddr_frames(userptr->vec);
-	frame_vector_destroy(userptr->vec);
+	unpin_user_pages_dirty_lock(userptr->pages, userptr->npages, true);
+	kvfree(userptr->pages);
 
 	list_del(&userptr->job_node);
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v5 04/15] misc/habana: Use FOLL_LONGTERM for userptr
  2020-10-30 10:08 [PATCH v5 00/15] follow_pfn and other iomap races Daniel Vetter
                   ` (2 preceding siblings ...)
  2020-10-30 10:08 ` [PATCH v5 03/15] misc/habana: Stop using frame_vector helpers Daniel Vetter
@ 2020-10-30 10:08 ` Daniel Vetter
  2020-10-30 10:08 ` [PATCH v5 05/15] mm/frame-vector: Use FOLL_LONGTERM Daniel Vetter
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 62+ messages in thread
From: Daniel Vetter @ 2020-10-30 10:08 UTC (permalink / raw)
  To: DRI Development, LKML
  Cc: kvm, linux-mm, linux-arm-kernel, linux-samsung-soc, linux-media,
	Daniel Vetter, Daniel Vetter, Jason Gunthorpe, Andrew Morton,
	John Hubbard, Jérôme Glisse, Jan Kara, Dan Williams,
	Oded Gabbay, Omer Shpigelman, Ofir Bitton, Tomer Tayar,
	Moti Haimovski, Greg Kroah-Hartman, Pawel Piskorski

These are persistent, not just for the duration of a dma operation.

Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: linux-mm@kvack.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-samsung-soc@vger.kernel.org
Cc: linux-media@vger.kernel.org
Cc: Oded Gabbay <oded.gabbay@gmail.com>
Cc: Omer Shpigelman <oshpigelman@habana.ai>
Cc: Ofir Bitton <obitton@habana.ai>
Cc: Tomer Tayar <ttayar@habana.ai>
Cc: Moti Haimovski <mhaimovski@habana.ai>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Pawel Piskorski <ppiskorski@habana.ai>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/misc/habanalabs/common/memory.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/misc/habanalabs/common/memory.c b/drivers/misc/habanalabs/common/memory.c
index 0b220221873d..d08c41b90fec 100644
--- a/drivers/misc/habanalabs/common/memory.c
+++ b/drivers/misc/habanalabs/common/memory.c
@@ -1298,7 +1298,8 @@ static int get_user_memory(struct hl_device *hdev, u64 addr, u64 size,
 		return -ENOMEM;
 	}
 
-	rc = pin_user_pages_fast(start, npages, FOLL_FORCE | FOLL_WRITE,
+	rc = pin_user_pages_fast(start, npages,
+				 FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM,
 				 userptr->pages);
 
 	if (rc != npages) {
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v5 05/15] mm/frame-vector: Use FOLL_LONGTERM
  2020-10-30 10:08 [PATCH v5 00/15] follow_pfn and other iomap races Daniel Vetter
                   ` (3 preceding siblings ...)
  2020-10-30 10:08 ` [PATCH v5 04/15] misc/habana: Use FOLL_LONGTERM for userptr Daniel Vetter
@ 2020-10-30 10:08 ` Daniel Vetter
  2020-10-30 14:11   ` Tomasz Figa
  2020-10-31  2:55   ` John Hubbard
  2020-10-30 10:08 ` [PATCH v5 06/15] media: videobuf2: Move frame_vector into media subsystem Daniel Vetter
                   ` (9 subsequent siblings)
  14 siblings, 2 replies; 62+ messages in thread
From: Daniel Vetter @ 2020-10-30 10:08 UTC (permalink / raw)
  To: DRI Development, LKML
  Cc: kvm, linux-mm, linux-arm-kernel, linux-samsung-soc, linux-media,
	Daniel Vetter, Daniel Vetter, Jason Gunthorpe, Pawel Osciak,
	Marek Szyprowski, Kyungmin Park, Tomasz Figa,
	Mauro Carvalho Chehab, Andrew Morton, John Hubbard,
	Jérôme Glisse, Jan Kara, Dan Williams

This is used by media/videbuf2 for persistent dma mappings, not just
for a single dma operation and then freed again, so needs
FOLL_LONGTERM.

Unfortunately current pup_locked doesn't support FOLL_LONGTERM due to
locking issues. Rework the code to pull the pup path out from the
mmap_sem critical section as suggested by Jason.

By relying entirely on the vma checks in pin_user_pages and follow_pfn
(for vm_flags and vma_is_fsdax) we can also streamline the code a lot.

Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Pawel Osciak <pawel@osciak.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Tomasz Figa <tfiga@chromium.org>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: linux-mm@kvack.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-samsung-soc@vger.kernel.org
Cc: linux-media@vger.kernel.org
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
--
v2: Streamline the code and further simplify the loop checks (Jason)

v5: Review from Tomasz:
- fix page counting for the follow_pfn case by resetting ret
- drop gup_flags paramater, now unused
---
 .../media/common/videobuf2/videobuf2-memops.c |  3 +-
 include/linux/mm.h                            |  2 +-
 mm/frame_vector.c                             | 53 ++++++-------------
 3 files changed, 19 insertions(+), 39 deletions(-)

diff --git a/drivers/media/common/videobuf2/videobuf2-memops.c b/drivers/media/common/videobuf2/videobuf2-memops.c
index 6e9e05153f4e..9dd6c27162f4 100644
--- a/drivers/media/common/videobuf2/videobuf2-memops.c
+++ b/drivers/media/common/videobuf2/videobuf2-memops.c
@@ -40,7 +40,6 @@ struct frame_vector *vb2_create_framevec(unsigned long start,
 	unsigned long first, last;
 	unsigned long nr;
 	struct frame_vector *vec;
-	unsigned int flags = FOLL_FORCE | FOLL_WRITE;
 
 	first = start >> PAGE_SHIFT;
 	last = (start + length - 1) >> PAGE_SHIFT;
@@ -48,7 +47,7 @@ struct frame_vector *vb2_create_framevec(unsigned long start,
 	vec = frame_vector_create(nr);
 	if (!vec)
 		return ERR_PTR(-ENOMEM);
-	ret = get_vaddr_frames(start & PAGE_MASK, nr, flags, vec);
+	ret = get_vaddr_frames(start & PAGE_MASK, nr, vec);
 	if (ret < 0)
 		goto out_destroy;
 	/* We accept only complete set of PFNs */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ef360fe70aaf..d6b8e30dce2e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1765,7 +1765,7 @@ struct frame_vector {
 struct frame_vector *frame_vector_create(unsigned int nr_frames);
 void frame_vector_destroy(struct frame_vector *vec);
 int get_vaddr_frames(unsigned long start, unsigned int nr_pfns,
-		     unsigned int gup_flags, struct frame_vector *vec);
+		     struct frame_vector *vec);
 void put_vaddr_frames(struct frame_vector *vec);
 int frame_vector_to_pages(struct frame_vector *vec);
 void frame_vector_to_pfns(struct frame_vector *vec);
diff --git a/mm/frame_vector.c b/mm/frame_vector.c
index 10f82d5643b6..f8c34b895c76 100644
--- a/mm/frame_vector.c
+++ b/mm/frame_vector.c
@@ -32,13 +32,12 @@
  * This function takes care of grabbing mmap_lock as necessary.
  */
 int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
-		     unsigned int gup_flags, struct frame_vector *vec)
+		     struct frame_vector *vec)
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
 	int ret = 0;
 	int err;
-	int locked;
 
 	if (nr_frames == 0)
 		return 0;
@@ -48,40 +47,26 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
 
 	start = untagged_addr(start);
 
-	mmap_read_lock(mm);
-	locked = 1;
-	vma = find_vma_intersection(mm, start, start + 1);
-	if (!vma) {
-		ret = -EFAULT;
-		goto out;
-	}
-
-	/*
-	 * While get_vaddr_frames() could be used for transient (kernel
-	 * controlled lifetime) pinning of memory pages all current
-	 * users establish long term (userspace controlled lifetime)
-	 * page pinning. Treat get_vaddr_frames() like
-	 * get_user_pages_longterm() and disallow it for filesystem-dax
-	 * mappings.
-	 */
-	if (vma_is_fsdax(vma)) {
-		ret = -EOPNOTSUPP;
-		goto out;
-	}
-
-	if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) {
+	ret = pin_user_pages_fast(start, nr_frames,
+				  FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM,
+				  (struct page **)(vec->ptrs));
+	if (ret > 0) {
 		vec->got_ref = true;
 		vec->is_pfns = false;
-		ret = pin_user_pages_locked(start, nr_frames,
-			gup_flags, (struct page **)(vec->ptrs), &locked);
-		goto out;
+		goto out_unlocked;
 	}
 
+	mmap_read_lock(mm);
 	vec->got_ref = false;
 	vec->is_pfns = true;
+	ret = 0;
 	do {
 		unsigned long *nums = frame_vector_pfns(vec);
 
+		vma = find_vma_intersection(mm, start, start + 1);
+		if (!vma)
+			break;
+
 		while (ret < nr_frames && start + PAGE_SIZE <= vma->vm_end) {
 			err = follow_pfn(vma, start, &nums[ret]);
 			if (err) {
@@ -92,17 +77,13 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
 			start += PAGE_SIZE;
 			ret++;
 		}
-		/*
-		 * We stop if we have enough pages or if VMA doesn't completely
-		 * cover the tail page.
-		 */
-		if (ret >= nr_frames || start < vma->vm_end)
+		/* Bail out if VMA doesn't completely cover the tail page. */
+		if (start < vma->vm_end)
 			break;
-		vma = find_vma_intersection(mm, start, start + 1);
-	} while (vma && vma->vm_flags & (VM_IO | VM_PFNMAP));
+	} while (ret < nr_frames);
 out:
-	if (locked)
-		mmap_read_unlock(mm);
+	mmap_read_unlock(mm);
+out_unlocked:
 	if (!ret)
 		ret = -EFAULT;
 	if (ret > 0)
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v5 06/15] media: videobuf2: Move frame_vector into media subsystem
  2020-10-30 10:08 [PATCH v5 00/15] follow_pfn and other iomap races Daniel Vetter
                   ` (4 preceding siblings ...)
  2020-10-30 10:08 ` [PATCH v5 05/15] mm/frame-vector: Use FOLL_LONGTERM Daniel Vetter
@ 2020-10-30 10:08 ` Daniel Vetter
  2020-10-30 10:08 ` [PATCH v5 07/15] mm: Close race in generic_access_phys Daniel Vetter
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 62+ messages in thread
From: Daniel Vetter @ 2020-10-30 10:08 UTC (permalink / raw)
  To: DRI Development, LKML
  Cc: kvm, linux-mm, linux-arm-kernel, linux-samsung-soc, linux-media,
	Daniel Vetter, John Hubbard, Mauro Carvalho Chehab, Tomasz Figa,
	Daniel Vetter, Jason Gunthorpe, Pawel Osciak, Marek Szyprowski,
	Kyungmin Park, Mauro Carvalho Chehab, Andrew Morton,
	Jérôme Glisse, Jan Kara, Dan Williams

It's the only user. This also garbage collects the CONFIG_FRAME_VECTOR
symbol from all over the tree (well just one place, somehow omap media
driver still had this in its Kconfig, despite not using it).

Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Acked-by: Tomasz Figa <tfiga@chromium.org>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Pawel Osciak <pawel@osciak.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Tomasz Figa <tfiga@chromium.org>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: linux-mm@kvack.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-samsung-soc@vger.kernel.org
Cc: linux-media@vger.kernel.org
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
--
v3:
- Create a new frame_vector.h header for this (Mauro)
v5:
- Rebase over changes in frame-vector.c from Tomasz review.
---
 drivers/media/common/videobuf2/Kconfig        |  1 -
 drivers/media/common/videobuf2/Makefile       |  1 +
 .../media/common/videobuf2}/frame_vector.c    |  2 +
 drivers/media/platform/omap/Kconfig           |  1 -
 include/linux/mm.h                            | 42 -----------------
 include/media/frame_vector.h                  | 47 +++++++++++++++++++
 include/media/videobuf2-core.h                |  1 +
 mm/Kconfig                                    |  3 --
 mm/Makefile                                   |  1 -
 9 files changed, 51 insertions(+), 48 deletions(-)
 rename {mm => drivers/media/common/videobuf2}/frame_vector.c (99%)
 create mode 100644 include/media/frame_vector.h

diff --git a/drivers/media/common/videobuf2/Kconfig b/drivers/media/common/videobuf2/Kconfig
index edbc99ebba87..d2223a12c95f 100644
--- a/drivers/media/common/videobuf2/Kconfig
+++ b/drivers/media/common/videobuf2/Kconfig
@@ -9,7 +9,6 @@ config VIDEOBUF2_V4L2
 
 config VIDEOBUF2_MEMOPS
 	tristate
-	select FRAME_VECTOR
 
 config VIDEOBUF2_DMA_CONTIG
 	tristate
diff --git a/drivers/media/common/videobuf2/Makefile b/drivers/media/common/videobuf2/Makefile
index 77bebe8b202f..54306f8d096c 100644
--- a/drivers/media/common/videobuf2/Makefile
+++ b/drivers/media/common/videobuf2/Makefile
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0
 videobuf2-common-objs := videobuf2-core.o
+videobuf2-common-objs += frame_vector.o
 
 ifeq ($(CONFIG_TRACEPOINTS),y)
   videobuf2-common-objs += vb2-trace.o
diff --git a/mm/frame_vector.c b/drivers/media/common/videobuf2/frame_vector.c
similarity index 99%
rename from mm/frame_vector.c
rename to drivers/media/common/videobuf2/frame_vector.c
index f8c34b895c76..a0e65481a201 100644
--- a/mm/frame_vector.c
+++ b/drivers/media/common/videobuf2/frame_vector.c
@@ -8,6 +8,8 @@
 #include <linux/pagemap.h>
 #include <linux/sched.h>
 
+#include <media/frame_vector.h>
+
 /**
  * get_vaddr_frames() - map virtual addresses to pfns
  * @start:	starting user address
diff --git a/drivers/media/platform/omap/Kconfig b/drivers/media/platform/omap/Kconfig
index f73b5893220d..de16de46c0f4 100644
--- a/drivers/media/platform/omap/Kconfig
+++ b/drivers/media/platform/omap/Kconfig
@@ -12,6 +12,5 @@ config VIDEO_OMAP2_VOUT
 	depends on VIDEO_V4L2
 	select VIDEOBUF2_DMA_CONTIG
 	select OMAP2_VRFB if ARCH_OMAP2 || ARCH_OMAP3
-	select FRAME_VECTOR
 	help
 	  V4L2 Display driver support for OMAP2/3 based boards.
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d6b8e30dce2e..179dbb78d08d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1751,48 +1751,6 @@ int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc);
 int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc,
 			struct task_struct *task, bool bypass_rlim);
 
-/* Container for pinned pfns / pages */
-struct frame_vector {
-	unsigned int nr_allocated;	/* Number of frames we have space for */
-	unsigned int nr_frames;	/* Number of frames stored in ptrs array */
-	bool got_ref;		/* Did we pin pages by getting page ref? */
-	bool is_pfns;		/* Does array contain pages or pfns? */
-	void *ptrs[];		/* Array of pinned pfns / pages. Use
-				 * pfns_vector_pages() or pfns_vector_pfns()
-				 * for access */
-};
-
-struct frame_vector *frame_vector_create(unsigned int nr_frames);
-void frame_vector_destroy(struct frame_vector *vec);
-int get_vaddr_frames(unsigned long start, unsigned int nr_pfns,
-		     struct frame_vector *vec);
-void put_vaddr_frames(struct frame_vector *vec);
-int frame_vector_to_pages(struct frame_vector *vec);
-void frame_vector_to_pfns(struct frame_vector *vec);
-
-static inline unsigned int frame_vector_count(struct frame_vector *vec)
-{
-	return vec->nr_frames;
-}
-
-static inline struct page **frame_vector_pages(struct frame_vector *vec)
-{
-	if (vec->is_pfns) {
-		int err = frame_vector_to_pages(vec);
-
-		if (err)
-			return ERR_PTR(err);
-	}
-	return (struct page **)(vec->ptrs);
-}
-
-static inline unsigned long *frame_vector_pfns(struct frame_vector *vec)
-{
-	if (!vec->is_pfns)
-		frame_vector_to_pfns(vec);
-	return (unsigned long *)(vec->ptrs);
-}
-
 struct kvec;
 int get_kernel_pages(const struct kvec *iov, int nr_pages, int write,
 			struct page **pages);
diff --git a/include/media/frame_vector.h b/include/media/frame_vector.h
new file mode 100644
index 000000000000..bfed1710dc24
--- /dev/null
+++ b/include/media/frame_vector.h
@@ -0,0 +1,47 @@
+// SPDX-License-Identifier: GPL-2.0
+#ifndef _MEDIA_FRAME_VECTOR_H
+#define _MEDIA_FRAME_VECTOR_H
+
+/* Container for pinned pfns / pages in frame_vector.c */
+struct frame_vector {
+	unsigned int nr_allocated;	/* Number of frames we have space for */
+	unsigned int nr_frames;	/* Number of frames stored in ptrs array */
+	bool got_ref;		/* Did we pin pages by getting page ref? */
+	bool is_pfns;		/* Does array contain pages or pfns? */
+	void *ptrs[];		/* Array of pinned pfns / pages. Use
+				 * pfns_vector_pages() or pfns_vector_pfns()
+				 * for access */
+};
+
+struct frame_vector *frame_vector_create(unsigned int nr_frames);
+void frame_vector_destroy(struct frame_vector *vec);
+int get_vaddr_frames(unsigned long start, unsigned int nr_pfns,
+		     struct frame_vector *vec);
+void put_vaddr_frames(struct frame_vector *vec);
+int frame_vector_to_pages(struct frame_vector *vec);
+void frame_vector_to_pfns(struct frame_vector *vec);
+
+static inline unsigned int frame_vector_count(struct frame_vector *vec)
+{
+	return vec->nr_frames;
+}
+
+static inline struct page **frame_vector_pages(struct frame_vector *vec)
+{
+	if (vec->is_pfns) {
+		int err = frame_vector_to_pages(vec);
+
+		if (err)
+			return ERR_PTR(err);
+	}
+	return (struct page **)(vec->ptrs);
+}
+
+static inline unsigned long *frame_vector_pfns(struct frame_vector *vec)
+{
+	if (!vec->is_pfns)
+		frame_vector_to_pfns(vec);
+	return (unsigned long *)(vec->ptrs);
+}
+
+#endif /* _MEDIA_FRAME_VECTOR_H */
diff --git a/include/media/videobuf2-core.h b/include/media/videobuf2-core.h
index bbb3f26fbde9..d045e3a5a1d8 100644
--- a/include/media/videobuf2-core.h
+++ b/include/media/videobuf2-core.h
@@ -18,6 +18,7 @@
 #include <linux/dma-buf.h>
 #include <linux/bitops.h>
 #include <media/media-request.h>
+#include <media/frame_vector.h>
 
 #define VB2_MAX_FRAME	(32)
 #define VB2_MAX_PLANES	(8)
diff --git a/mm/Kconfig b/mm/Kconfig
index d42423f884a7..0dcff24cba53 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -819,9 +819,6 @@ config DEVICE_PRIVATE
 config VMAP_PFN
 	bool
 
-config FRAME_VECTOR
-	bool
-
 config ARCH_USES_HIGH_VMA_FLAGS
 	bool
 config ARCH_HAS_PKEYS
diff --git a/mm/Makefile b/mm/Makefile
index d73aed0fc99c..db41fff05038 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -110,7 +110,6 @@ obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
 obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
 obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
 obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
-obj-$(CONFIG_FRAME_VECTOR) += frame_vector.o
 obj-$(CONFIG_DEBUG_PAGE_REF) += debug_page_ref.o
 obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
 obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v5 07/15] mm: Close race in generic_access_phys
  2020-10-30 10:08 [PATCH v5 00/15] follow_pfn and other iomap races Daniel Vetter
                   ` (5 preceding siblings ...)
  2020-10-30 10:08 ` [PATCH v5 06/15] media: videobuf2: Move frame_vector into media subsystem Daniel Vetter
@ 2020-10-30 10:08 ` Daniel Vetter
  2020-10-30 10:08 ` [PATCH v5 08/15] mm: Add unsafe_follow_pfn Daniel Vetter
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 62+ messages in thread
From: Daniel Vetter @ 2020-10-30 10:08 UTC (permalink / raw)
  To: DRI Development, LKML
  Cc: kvm, linux-mm, linux-arm-kernel, linux-samsung-soc, linux-media,
	Daniel Vetter, Daniel Vetter, Jason Gunthorpe, Dan Williams,
	Kees Cook, Benjamin Herrensmidt, Dave Airlie, Andrew Morton,
	John Hubbard, Jérôme Glisse, Jan Kara, Chris Wilson

Way back it was a reasonable assumptions that iomem mappings never
change the pfn range they point at. But this has changed:

- gpu drivers dynamically manage their memory nowadays, invalidating
  ptes with unmap_mapping_range when buffers get moved

- contiguous dma allocations have moved from dedicated carvetouts to
  cma regions. This means if we miss the unmap the pfn might contain
  pagecache or anon memory (well anything allocated with GFP_MOVEABLE)

- even /dev/mem now invalidates mappings when the kernel requests that
  iomem region when CONFIG_IO_STRICT_DEVMEM is set, see 3234ac664a87
  ("/dev/mem: Revoke mappings when a driver claims the region")

Accessing pfns obtained from ptes without holding all the locks is
therefore no longer a good idea. Fix this.

Since ioremap might need to manipulate pagetables too we need to drop
the pt lock and have a retry loop if we raced.

While at it, also add kerneldoc and improve the comment for the
vma_ops->access function. It's for accessing, not for moving the
memory from iomem to system memory, as the old comment seemed to
suggest.

References: 28b2ee20c7cb ("access_process_vm device memory infrastructure")
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Benjamin Herrensmidt <benh@kernel.crashing.org>
Cc: Dave Airlie <airlied@linux.ie>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: linux-mm@kvack.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-samsung-soc@vger.kernel.org
Cc: linux-media@vger.kernel.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
--
v2: Fix inversion in the retry check (John).

v4: While at it, use offset_in_page (Chris Wilson)
---
 include/linux/mm.h |  3 ++-
 mm/memory.c        | 46 +++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 179dbb78d08d..83d0be101a38 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -574,7 +574,8 @@ struct vm_operations_struct {
 	vm_fault_t (*pfn_mkwrite)(struct vm_fault *vmf);
 
 	/* called by access_process_vm when get_user_pages() fails, typically
-	 * for use by special VMAs that can switch between memory and hardware
+	 * for use by special VMAs. See also generic_access_phys() for a generic
+	 * implementation useful for any iomem mapping.
 	 */
 	int (*access)(struct vm_area_struct *vma, unsigned long addr,
 		      void *buf, int len, int write);
diff --git a/mm/memory.c b/mm/memory.c
index c48f8df6e502..ac32039ce941 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4847,28 +4847,68 @@ int follow_phys(struct vm_area_struct *vma,
 	return ret;
 }
 
+/**
+ * generic_access_phys - generic implementation for iomem mmap access
+ * @vma: the vma to access
+ * @addr: userspace addres, not relative offset within @vma
+ * @buf: buffer to read/write
+ * @len: length of transfer
+ * @write: set to FOLL_WRITE when writing, otherwise reading
+ *
+ * This is a generic implementation for &vm_operations_struct.access for an
+ * iomem mapping. This callback is used by access_process_vm() when the @vma is
+ * not page based.
+ */
 int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
 			void *buf, int len, int write)
 {
 	resource_size_t phys_addr;
 	unsigned long prot = 0;
 	void __iomem *maddr;
-	int offset = addr & (PAGE_SIZE-1);
+	pte_t *ptep, pte;
+	spinlock_t *ptl;
+	int offset = offset_in_page(addr);
+	int ret = -EINVAL;
+
+	if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
+		return -EINVAL;
+
+retry:
+	if (follow_pte(vma->vm_mm, addr, &ptep, &ptl))
+		return -EINVAL;
+	pte = *ptep;
+	pte_unmap_unlock(ptep, ptl);
 
-	if (follow_phys(vma, addr, write, &prot, &phys_addr))
+	prot = pgprot_val(pte_pgprot(pte));
+	phys_addr = (resource_size_t)pte_pfn(pte) << PAGE_SHIFT;
+
+	if ((write & FOLL_WRITE) && !pte_write(pte))
 		return -EINVAL;
 
 	maddr = ioremap_prot(phys_addr, PAGE_ALIGN(len + offset), prot);
 	if (!maddr)
 		return -ENOMEM;
 
+	if (follow_pte(vma->vm_mm, addr, &ptep, &ptl))
+		goto out_unmap;
+
+	if (!pte_same(pte, *ptep)) {
+		pte_unmap_unlock(ptep, ptl);
+		iounmap(maddr);
+
+		goto retry;
+	}
+
 	if (write)
 		memcpy_toio(maddr + offset, buf, len);
 	else
 		memcpy_fromio(buf, maddr + offset, len);
+	ret = len;
+	pte_unmap_unlock(ptep, ptl);
+out_unmap:
 	iounmap(maddr);
 
-	return len;
+	return ret;
 }
 EXPORT_SYMBOL_GPL(generic_access_phys);
 #endif
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v5 08/15] mm: Add unsafe_follow_pfn
  2020-10-30 10:08 [PATCH v5 00/15] follow_pfn and other iomap races Daniel Vetter
                   ` (6 preceding siblings ...)
  2020-10-30 10:08 ` [PATCH v5 07/15] mm: Close race in generic_access_phys Daniel Vetter
@ 2020-10-30 10:08 ` Daniel Vetter
  2020-11-02  7:29   ` Christoph Hellwig
  2020-10-30 10:08 ` [PATCH v5 09/15] media/videbuf1|2: Mark follow_pfn usage as unsafe Daniel Vetter
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 62+ messages in thread
From: Daniel Vetter @ 2020-10-30 10:08 UTC (permalink / raw)
  To: DRI Development, LKML
  Cc: kvm, linux-mm, linux-arm-kernel, linux-samsung-soc, linux-media,
	Daniel Vetter, Daniel Vetter, Christoph Hellwig, Jason Gunthorpe,
	Kees Cook, Dan Williams, Andrew Morton, John Hubbard,
	Jérôme Glisse, Jan Kara

Way back it was a reasonable assumptions that iomem mappings never
change the pfn range they point at. But this has changed:

- gpu drivers dynamically manage their memory nowadays, invalidating
ptes with unmap_mapping_range when buffers get moved

- contiguous dma allocations have moved from dedicated carvetouts to
cma regions. This means if we miss the unmap the pfn might contain
pagecache or anon memory (well anything allocated with GFP_MOVEABLE)

- even /dev/mem now invalidates mappings when the kernel requests that
iomem region when CONFIG_IO_STRICT_DEVMEM is set, see 3234ac664a87
("/dev/mem: Revoke mappings when a driver claims the region")

Accessing pfns obtained from ptes without holding all the locks is
therefore no longer a good idea.

Unfortunately there's some users where this is not fixable (like v4l
userptr of iomem mappings) or involves a pile of work (vfio type1
iommu). For now annotate these as unsafe and splat appropriately.

This patch adds an unsafe_follow_pfn, which later patches will then
roll out to all appropriate places.

Also mark up follow_pfn as EXPORT_SYMBOL_GPL. The only safe way to use
that by drivers/modules is together with an mmu_notifier, and that's
all _GPL stuff.

Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kees Cook <keescook@chromium.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: linux-mm@kvack.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-samsung-soc@vger.kernel.org
Cc: linux-media@vger.kernel.org
Cc: kvm@vger.kernel.org
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
--
v5: Suggestions from Christoph
- reindent for less weirdness
- use IS_ENABLED instead of #ifdef
- same checks for nommu, for consistency
- EXPORT_SYMBOL_GPL for follow_pfn.
- kerneldoc was already updated in previous versions to explain when
  follow_pfn can be used safely
---
 include/linux/mm.h |  2 ++
 mm/memory.c        | 34 ++++++++++++++++++++++++++++++++--
 mm/nommu.c         | 27 ++++++++++++++++++++++++++-
 security/Kconfig   | 13 +++++++++++++
 4 files changed, 73 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 83d0be101a38..d0fe8bf46a9d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1661,6 +1661,8 @@ int follow_pte_pmd(struct mm_struct *mm, unsigned long address,
 		   pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp);
 int follow_pfn(struct vm_area_struct *vma, unsigned long address,
 	unsigned long *pfn);
+int unsafe_follow_pfn(struct vm_area_struct *vma, unsigned long address,
+		      unsigned long *pfn);
 int follow_phys(struct vm_area_struct *vma, unsigned long address,
 		unsigned int flags, unsigned long *prot, resource_size_t *phys);
 int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/memory.c b/mm/memory.c
index ac32039ce941..0db0c5e233fd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4795,7 +4795,12 @@ EXPORT_SYMBOL(follow_pte_pmd);
  * @address: user virtual address
  * @pfn: location to store found PFN
  *
- * Only IO mappings and raw PFN mappings are allowed.
+ * Only IO mappings and raw PFN mappings are allowed. Note that callers must
+ * ensure coherency with pte updates by using a &mmu_notifier to follow updates.
+ * If this is not feasible, or the access to the @pfn is only very short term,
+ * use follow_pte_pmd() instead and hold the pagetable lock for the duration of
+ * the access instead. Any caller not following these requirements must use
+ * unsafe_follow_pfn() instead.
  *
  * Return: zero and the pfn at @pfn on success, -ve otherwise.
  */
@@ -4816,7 +4821,32 @@ int follow_pfn(struct vm_area_struct *vma, unsigned long address,
 	pte_unmap_unlock(ptep, ptl);
 	return 0;
 }
-EXPORT_SYMBOL(follow_pfn);
+EXPORT_SYMBOL_GPL(follow_pfn);
+
+/**
+ * unsafe_follow_pfn - look up PFN at a user virtual address
+ * @vma: memory mapping
+ * @address: user virtual address
+ * @pfn: location to store found PFN
+ *
+ * Only IO mappings and raw PFN mappings are allowed.
+ *
+ * Returns zero and the pfn at @pfn on success, -ve otherwise.
+ */
+int unsafe_follow_pfn(struct vm_area_struct *vma, unsigned long address,
+		      unsigned long *pfn)
+{
+	if (IS_ENABLED(CONFIG_STRICT_FOLLOW_PFN)) {
+		pr_info("unsafe follow_pfn usage rejected, see CONFIG_STRICT_FOLLOW_PFN\n");
+		return -EINVAL;
+	}
+
+	WARN_ONCE(1, "unsafe follow_pfn usage\n");
+	add_taint(TAINT_USER, LOCKDEP_STILL_OK);
+
+	return follow_pfn(vma, address, pfn);
+}
+EXPORT_SYMBOL(unsafe_follow_pfn);
 
 #ifdef CONFIG_HAVE_IOREMAP_PROT
 int follow_phys(struct vm_area_struct *vma,
diff --git a/mm/nommu.c b/mm/nommu.c
index 0faf39b32cdb..79fc98a6c94a 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -130,7 +130,32 @@ int follow_pfn(struct vm_area_struct *vma, unsigned long address,
 	*pfn = address >> PAGE_SHIFT;
 	return 0;
 }
-EXPORT_SYMBOL(follow_pfn);
+EXPORT_SYMBOL_GPL(follow_pfn);
+
+/**
+ * unsafe_follow_pfn - look up PFN at a user virtual address
+ * @vma: memory mapping
+ * @address: user virtual address
+ * @pfn: location to store found PFN
+ *
+ * Only IO mappings and raw PFN mappings are allowed.
+ *
+ * Returns zero and the pfn at @pfn on success, -ve otherwise.
+ */
+int unsafe_follow_pfn(struct vm_area_struct *vma, unsigned long address,
+		      unsigned long *pfn)
+{
+	if (IS_ENABLED(CONFIG_STRICT_FOLLOW_PFN)) {
+		pr_info("unsafe follow_pfn usage rejected, see CONFIG_STRICT_FOLLOW_PFN\n");
+		return -EINVAL;
+	}
+
+	WARN_ONCE(1, "unsafe follow_pfn usage\n");
+	add_taint(TAINT_USER, LOCKDEP_STILL_OK);
+
+	return follow_pfn(vma, address, pfn);
+}
+EXPORT_SYMBOL(unsafe_follow_pfn);
 
 LIST_HEAD(vmap_area_list);
 
diff --git a/security/Kconfig b/security/Kconfig
index 7561f6f99f1d..48945402e103 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -230,6 +230,19 @@ config STATIC_USERMODEHELPER_PATH
 	  If you wish for all usermode helper programs to be disabled,
 	  specify an empty string here (i.e. "").
 
+config STRICT_FOLLOW_PFN
+	bool "Disable unsafe use of follow_pfn"
+	depends on MMU
+	help
+	  Some functionality in the kernel follows userspace mappings to iomem
+	  ranges in an unsafe matter. Examples include v4l userptr for zero-copy
+	  buffers sharing.
+
+	  If this option is switched on, such access is rejected. Only enable
+	  this option when you must run userspace which requires this.
+
+	  If in doubt, say Y.
+
 source "security/selinux/Kconfig"
 source "security/smack/Kconfig"
 source "security/tomoyo/Kconfig"
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v5 09/15] media/videbuf1|2: Mark follow_pfn usage as unsafe
  2020-10-30 10:08 [PATCH v5 00/15] follow_pfn and other iomap races Daniel Vetter
                   ` (7 preceding siblings ...)
  2020-10-30 10:08 ` [PATCH v5 08/15] mm: Add unsafe_follow_pfn Daniel Vetter
@ 2020-10-30 10:08 ` Daniel Vetter
  2020-10-30 10:08 ` [PATCH v5 10/15] vfio/type1: Mark follow_pfn " Daniel Vetter
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 62+ messages in thread
From: Daniel Vetter @ 2020-10-30 10:08 UTC (permalink / raw)
  To: DRI Development, LKML
  Cc: kvm, linux-mm, linux-arm-kernel, linux-samsung-soc, linux-media,
	Daniel Vetter, Tomasz Figa, Daniel Vetter, Jason Gunthorpe,
	Kees Cook, Dan Williams, Andrew Morton, John Hubbard,
	Jérôme Glisse, Jan Kara, Pawel Osciak,
	Marek Szyprowski, Kyungmin Park, Laurent Dufour, Vlastimil Babka,
	Daniel Jordan, Michel Lespinasse

The media model assumes that buffers are all preallocated, so that
when a media pipeline is running we never miss a deadline because the
buffers aren't allocated or available.

This means we cannot fix the v4l follow_pfn usage through
mmu_notifier, without breaking how this all works. The only real fix
is to deprecate userptr support for VM_IO | VM_PFNMAP mappings and
tell everyone to cut over to dma-buf memory sharing for zerocopy.

userptr for normal memory will keep working as-is, this only affects
the zerocopy userptr usage enabled in 50ac952d2263 ("[media]
videobuf2-dma-sg: Support io userptr operations on io memory").

Acked-by: Tomasz Figa <tfiga@chromium.org>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kees Cook <keescook@chromium.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: linux-mm@kvack.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-samsung-soc@vger.kernel.org
Cc: linux-media@vger.kernel.org
Cc: Pawel Osciak <pawel@osciak.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Tomasz Figa <tfiga@chromium.org>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Michel Lespinasse <walken@google.com>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
--
v3:
- Reference the commit that enabled the zerocopy userptr use case to
  make it abundandtly clear that this patch only affects that, and not
  normal memory userptr. The old commit message already explained that
  normal memory userptr is unaffected, but I guess that was not clear
  enough.
---
 drivers/media/common/videobuf2/frame_vector.c | 2 +-
 drivers/media/v4l2-core/videobuf-dma-contig.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/media/common/videobuf2/frame_vector.c b/drivers/media/common/videobuf2/frame_vector.c
index a0e65481a201..1a82ec13ea00 100644
--- a/drivers/media/common/videobuf2/frame_vector.c
+++ b/drivers/media/common/videobuf2/frame_vector.c
@@ -70,7 +70,7 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
 			break;
 
 		while (ret < nr_frames && start + PAGE_SIZE <= vma->vm_end) {
-			err = follow_pfn(vma, start, &nums[ret]);
+			err = unsafe_follow_pfn(vma, start, &nums[ret]);
 			if (err) {
 				if (ret == 0)
 					ret = err;
diff --git a/drivers/media/v4l2-core/videobuf-dma-contig.c b/drivers/media/v4l2-core/videobuf-dma-contig.c
index 52312ce2ba05..821c4a76ab96 100644
--- a/drivers/media/v4l2-core/videobuf-dma-contig.c
+++ b/drivers/media/v4l2-core/videobuf-dma-contig.c
@@ -183,7 +183,7 @@ static int videobuf_dma_contig_user_get(struct videobuf_dma_contig_memory *mem,
 	user_address = untagged_baddr;
 
 	while (pages_done < (mem->size >> PAGE_SHIFT)) {
-		ret = follow_pfn(vma, user_address, &this_pfn);
+		ret = unsafe_follow_pfn(vma, user_address, &this_pfn);
 		if (ret)
 			break;
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v5 10/15] vfio/type1: Mark follow_pfn as unsafe
  2020-10-30 10:08 [PATCH v5 00/15] follow_pfn and other iomap races Daniel Vetter
                   ` (8 preceding siblings ...)
  2020-10-30 10:08 ` [PATCH v5 09/15] media/videbuf1|2: Mark follow_pfn usage as unsafe Daniel Vetter
@ 2020-10-30 10:08 ` Daniel Vetter
  2020-10-30 10:08 ` [PATCH v5 11/15] PCI: Obey iomem restrictions for procfs mmap Daniel Vetter
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 62+ messages in thread
From: Daniel Vetter @ 2020-10-30 10:08 UTC (permalink / raw)
  To: DRI Development, LKML
  Cc: kvm, linux-mm, linux-arm-kernel, linux-samsung-soc, linux-media,
	Daniel Vetter, Daniel Vetter, Jason Gunthorpe, Kees Cook,
	Dan Williams, Andrew Morton, John Hubbard,
	Jérôme Glisse, Jan Kara, Alex Williamson,
	Cornelia Huck

The code seems to stuff these pfns into iommu pts (or something like
that, I didn't follow), but there's no mmu_notifier to ensure that
access is synchronized with pte updates.

Hence mark these as unsafe. This means that with
CONFIG_STRICT_FOLLOW_PFN, these will be rejected.

Real fix is to wire up an mmu_notifier ... somehow. Probably means any
invalidate is a fatal fault for this vfio device, but then this
shouldn't ever happen if userspace is reasonable.

Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kees Cook <keescook@chromium.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: linux-mm@kvack.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-samsung-soc@vger.kernel.org
Cc: linux-media@vger.kernel.org
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Cornelia Huck <cohuck@redhat.com>
Cc: kvm@vger.kernel.org
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/vfio/vfio_iommu_type1.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index bb2684cc245e..eccfee900033 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -421,7 +421,7 @@ static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
 {
 	int ret;
 
-	ret = follow_pfn(vma, vaddr, pfn);
+	ret = unsafe_follow_pfn(vma, vaddr, pfn);
 	if (ret) {
 		bool unlocked = false;
 
@@ -435,7 +435,7 @@ static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
 		if (ret)
 			return ret;
 
-		ret = follow_pfn(vma, vaddr, pfn);
+		ret = unsafe_follow_pfn(vma, vaddr, pfn);
 	}
 
 	return ret;
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v5 11/15] PCI: Obey iomem restrictions for procfs mmap
  2020-10-30 10:08 [PATCH v5 00/15] follow_pfn and other iomap races Daniel Vetter
                   ` (9 preceding siblings ...)
  2020-10-30 10:08 ` [PATCH v5 10/15] vfio/type1: Mark follow_pfn " Daniel Vetter
@ 2020-10-30 10:08 ` Daniel Vetter
  2020-11-03 21:28   ` Bjorn Helgaas
  2020-10-30 10:08 ` [PATCH v5 12/15] /dev/mem: Only set filp->f_mapping Daniel Vetter
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 62+ messages in thread
From: Daniel Vetter @ 2020-10-30 10:08 UTC (permalink / raw)
  To: DRI Development, LKML
  Cc: kvm, linux-mm, linux-arm-kernel, linux-samsung-soc, linux-media,
	Daniel Vetter, Daniel Vetter, Jason Gunthorpe, Kees Cook,
	Dan Williams, Andrew Morton, John Hubbard,
	Jérôme Glisse, Jan Kara, Bjorn Helgaas, linux-pci

There's three ways to access PCI BARs from userspace: /dev/mem, sysfs
files, and the old proc interface. Two check against
iomem_is_exclusive, proc never did. And with CONFIG_IO_STRICT_DEVMEM,
this starts to matter, since we don't want random userspace having
access to PCI BARs while a driver is loaded and using it.

Fix this by adding the same iomem_is_exclusive() check we already have
on the sysfs side in pci_mmap_resource().

References: 90a545e98126 ("restrict /dev/mem to idle io memory ranges")
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kees Cook <keescook@chromium.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: linux-mm@kvack.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-samsung-soc@vger.kernel.org
Cc: linux-media@vger.kernel.org
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-pci@vger.kernel.org
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
--
v2: Improve commit message (Bjorn)
---
 drivers/pci/proc.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/pci/proc.c b/drivers/pci/proc.c
index d35186b01d98..3a2f90beb4cb 100644
--- a/drivers/pci/proc.c
+++ b/drivers/pci/proc.c
@@ -274,6 +274,11 @@ static int proc_bus_pci_mmap(struct file *file, struct vm_area_struct *vma)
 		else
 			return -EINVAL;
 	}
+
+	if (dev->resource[i].flags & IORESOURCE_MEM &&
+	    iomem_is_exclusive(dev->resource[i].start))
+		return -EINVAL;
+
 	ret = pci_mmap_page_range(dev, i, vma,
 				  fpriv->mmap_state, write_combine);
 	if (ret < 0)
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v5 12/15] /dev/mem: Only set filp->f_mapping
  2020-10-30 10:08 [PATCH v5 00/15] follow_pfn and other iomap races Daniel Vetter
                   ` (10 preceding siblings ...)
  2020-10-30 10:08 ` [PATCH v5 11/15] PCI: Obey iomem restrictions for procfs mmap Daniel Vetter
@ 2020-10-30 10:08 ` Daniel Vetter
  2020-10-30 10:08 ` [PATCH v5 13/15] resource: Move devmem revoke code to resource framework Daniel Vetter
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 62+ messages in thread
From: Daniel Vetter @ 2020-10-30 10:08 UTC (permalink / raw)
  To: DRI Development, LKML
  Cc: kvm, linux-mm, linux-arm-kernel, linux-samsung-soc, linux-media,
	Daniel Vetter, Daniel Vetter, Jason Gunthorpe, Kees Cook,
	Dan Williams, Andrew Morton, John Hubbard,
	Jérôme Glisse, Jan Kara

When we care about pagecache maintenance, we need to make sure that
both f_mapping and i_mapping point at the right mapping.

But for iomem mappings we only care about the virtual/pte side of
things, so f_mapping is enough. Also setting inode->i_mapping was
confusing me as a driver maintainer, since in e.g. drivers/gpu we
don't do that. Per Dan this seems to be copypasta from places which do
care about pagecache consistency, but not needed. Hence remove it for
slightly less confusion.

Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kees Cook <keescook@chromium.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: linux-mm@kvack.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-samsung-soc@vger.kernel.org
Cc: linux-media@vger.kernel.org
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/char/mem.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index 94c2b556cf97..7dcf9e4ea79d 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -891,7 +891,6 @@ static int open_port(struct inode *inode, struct file *filp)
 	 * revocations when drivers want to take over a /dev/mem mapped
 	 * range.
 	 */
-	inode->i_mapping = devmem_inode->i_mapping;
 	filp->f_mapping = inode->i_mapping;
 
 	return 0;
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v5 13/15] resource: Move devmem revoke code to resource framework
  2020-10-30 10:08 [PATCH v5 00/15] follow_pfn and other iomap races Daniel Vetter
                   ` (11 preceding siblings ...)
  2020-10-30 10:08 ` [PATCH v5 12/15] /dev/mem: Only set filp->f_mapping Daniel Vetter
@ 2020-10-30 10:08 ` Daniel Vetter
  2020-10-31  6:36   ` John Hubbard
  2020-10-30 10:08 ` [PATCH v5 14/15] sysfs: Support zapping of binary attr mmaps Daniel Vetter
  2020-10-30 10:08 ` [PATCH v5 15/15] PCI: Revoke mappings like devmem Daniel Vetter
  14 siblings, 1 reply; 62+ messages in thread
From: Daniel Vetter @ 2020-10-30 10:08 UTC (permalink / raw)
  To: DRI Development, LKML
  Cc: kvm, linux-mm, linux-arm-kernel, linux-samsung-soc, linux-media,
	Daniel Vetter, Greg Kroah-Hartman, Daniel Vetter,
	Jason Gunthorpe, Kees Cook, Dan Williams, Andrew Morton,
	John Hubbard, Jérôme Glisse, Jan Kara, Arnd Bergmann,
	David Hildenbrand, Rafael J. Wysocki

We want all iomem mmaps to consistently revoke ptes when the kernel
takes over and CONFIG_IO_STRICT_DEVMEM is enabled. This includes the
pci bar mmaps available through procfs and sysfs, which currently do
not revoke mappings.

To prepare for this, move the code from the /dev/kmem driver to
kernel/resource.c.

Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kees Cook <keescook@chromium.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: linux-mm@kvack.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-samsung-soc@vger.kernel.org
Cc: linux-media@vger.kernel.org
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
--
v3:
- add barrier for consistency and document why we don't have to check
  for NULL (Jason)
v4
- Adjust comments to reflect the general nature of this iomem revoke
  code now (Dan)
---
 drivers/char/mem.c     |  85 +---------------------------------
 include/linux/ioport.h |   6 +--
 kernel/resource.c      | 101 ++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 102 insertions(+), 90 deletions(-)

diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index 7dcf9e4ea79d..43c871dc7477 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -31,9 +31,6 @@
 #include <linux/uio.h>
 #include <linux/uaccess.h>
 #include <linux/security.h>
-#include <linux/pseudo_fs.h>
-#include <uapi/linux/magic.h>
-#include <linux/mount.h>
 
 #ifdef CONFIG_IA64
 # include <linux/efi.h>
@@ -836,42 +833,6 @@ static loff_t memory_lseek(struct file *file, loff_t offset, int orig)
 	return ret;
 }
 
-static struct inode *devmem_inode;
-
-#ifdef CONFIG_IO_STRICT_DEVMEM
-void revoke_devmem(struct resource *res)
-{
-	/* pairs with smp_store_release() in devmem_init_inode() */
-	struct inode *inode = smp_load_acquire(&devmem_inode);
-
-	/*
-	 * Check that the initialization has completed. Losing the race
-	 * is ok because it means drivers are claiming resources before
-	 * the fs_initcall level of init and prevent /dev/mem from
-	 * establishing mappings.
-	 */
-	if (!inode)
-		return;
-
-	/*
-	 * The expectation is that the driver has successfully marked
-	 * the resource busy by this point, so devmem_is_allowed()
-	 * should start returning false, however for performance this
-	 * does not iterate the entire resource range.
-	 */
-	if (devmem_is_allowed(PHYS_PFN(res->start)) &&
-	    devmem_is_allowed(PHYS_PFN(res->end))) {
-		/*
-		 * *cringe* iomem=relaxed says "go ahead, what's the
-		 * worst that can happen?"
-		 */
-		return;
-	}
-
-	unmap_mapping_range(inode->i_mapping, res->start, resource_size(res), 1);
-}
-#endif
-
 static int open_port(struct inode *inode, struct file *filp)
 {
 	int rc;
@@ -891,7 +852,7 @@ static int open_port(struct inode *inode, struct file *filp)
 	 * revocations when drivers want to take over a /dev/mem mapped
 	 * range.
 	 */
-	filp->f_mapping = inode->i_mapping;
+	filp->f_mapping = iomem_get_mapping();
 
 	return 0;
 }
@@ -1023,48 +984,6 @@ static char *mem_devnode(struct device *dev, umode_t *mode)
 
 static struct class *mem_class;
 
-static int devmem_fs_init_fs_context(struct fs_context *fc)
-{
-	return init_pseudo(fc, DEVMEM_MAGIC) ? 0 : -ENOMEM;
-}
-
-static struct file_system_type devmem_fs_type = {
-	.name		= "devmem",
-	.owner		= THIS_MODULE,
-	.init_fs_context = devmem_fs_init_fs_context,
-	.kill_sb	= kill_anon_super,
-};
-
-static int devmem_init_inode(void)
-{
-	static struct vfsmount *devmem_vfs_mount;
-	static int devmem_fs_cnt;
-	struct inode *inode;
-	int rc;
-
-	rc = simple_pin_fs(&devmem_fs_type, &devmem_vfs_mount, &devmem_fs_cnt);
-	if (rc < 0) {
-		pr_err("Cannot mount /dev/mem pseudo filesystem: %d\n", rc);
-		return rc;
-	}
-
-	inode = alloc_anon_inode(devmem_vfs_mount->mnt_sb);
-	if (IS_ERR(inode)) {
-		rc = PTR_ERR(inode);
-		pr_err("Cannot allocate inode for /dev/mem: %d\n", rc);
-		simple_release_fs(&devmem_vfs_mount, &devmem_fs_cnt);
-		return rc;
-	}
-
-	/*
-	 * Publish /dev/mem initialized.
-	 * Pairs with smp_load_acquire() in revoke_devmem().
-	 */
-	smp_store_release(&devmem_inode, inode);
-
-	return 0;
-}
-
 static int __init chr_dev_init(void)
 {
 	int minor;
@@ -1086,8 +1005,6 @@ static int __init chr_dev_init(void)
 		 */
 		if ((minor == DEVPORT_MINOR) && !arch_has_dev_port())
 			continue;
-		if ((minor == DEVMEM_MINOR) && devmem_init_inode() != 0)
-			continue;
 
 		device_create(mem_class, NULL, MKDEV(MEM_MAJOR, minor),
 			      NULL, devlist[minor].name);
diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index 5135d4b86cd6..02a5466245c0 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -307,11 +307,7 @@ struct resource *devm_request_free_mem_region(struct device *dev,
 struct resource *request_free_mem_region(struct resource *base,
 		unsigned long size, const char *name);
 
-#ifdef CONFIG_IO_STRICT_DEVMEM
-void revoke_devmem(struct resource *res);
-#else
-static inline void revoke_devmem(struct resource *res) { };
-#endif
+extern struct address_space *iomem_get_mapping(void);
 
 #endif /* __ASSEMBLY__ */
 #endif	/* _LINUX_IOPORT_H */
diff --git a/kernel/resource.c b/kernel/resource.c
index 3ae2f56cc79d..5ecc3187fe2d 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -18,12 +18,15 @@
 #include <linux/spinlock.h>
 #include <linux/fs.h>
 #include <linux/proc_fs.h>
+#include <linux/pseudo_fs.h>
 #include <linux/sched.h>
 #include <linux/seq_file.h>
 #include <linux/device.h>
 #include <linux/pfn.h>
 #include <linux/mm.h>
+#include <linux/mount.h>
 #include <linux/resource_ext.h>
+#include <uapi/linux/magic.h>
 #include <asm/io.h>
 
 
@@ -1115,6 +1118,58 @@ resource_size_t resource_alignment(struct resource *res)
 
 static DECLARE_WAIT_QUEUE_HEAD(muxed_resource_wait);
 
+static struct inode *iomem_inode;
+
+#ifdef CONFIG_IO_STRICT_DEVMEM
+static void revoke_iomem(struct resource *res)
+{
+	/* pairs with smp_store_release() in iomem_init_inode() */
+	struct inode *inode = smp_load_acquire(&iomem_inode);
+
+	/*
+	 * Check that the initialization has completed. Losing the race
+	 * is ok because it means drivers are claiming resources before
+	 * the fs_initcall level of init and prevent iomem_get_mapping users
+	 * from establishing mappings.
+	 */
+	if (!inode)
+		return;
+
+	/*
+	 * The expectation is that the driver has successfully marked
+	 * the resource busy by this point, so devmem_is_allowed()
+	 * should start returning false, however for performance this
+	 * does not iterate the entire resource range.
+	 */
+	if (devmem_is_allowed(PHYS_PFN(res->start)) &&
+	    devmem_is_allowed(PHYS_PFN(res->end))) {
+		/*
+		 * *cringe* iomem=relaxed says "go ahead, what's the
+		 * worst that can happen?"
+		 */
+		return;
+	}
+
+	unmap_mapping_range(inode->i_mapping, res->start, resource_size(res), 1);
+}
+struct address_space *iomem_get_mapping(void)
+{
+	/*
+	 * This function is only called from file open paths, hence guaranteed
+	 * that fs_initcalls have completed and no need to check for NULL. But
+	 * since revoke_iomem can be called before the initcall we still need
+	 * the barrier to appease checkers.
+	 */
+	return smp_load_acquire(&iomem_inode)->i_mapping;
+}
+#else
+static void revoke_iomem(struct resource *res) {}
+struct address_space *iomem_get_mapping(void)
+{
+	return NULL;
+}
+#endif
+
 /**
  * __request_region - create a new busy resource region
  * @parent: parent resource descriptor
@@ -1182,7 +1237,7 @@ struct resource * __request_region(struct resource *parent,
 	write_unlock(&resource_lock);
 
 	if (res && orig_parent == &iomem_resource)
-		revoke_devmem(res);
+		revoke_iomem(res);
 
 	return res;
 }
@@ -1782,4 +1837,48 @@ static int __init strict_iomem(char *str)
 	return 1;
 }
 
+static int iomem_fs_init_fs_context(struct fs_context *fc)
+{
+	return init_pseudo(fc, DEVMEM_MAGIC) ? 0 : -ENOMEM;
+}
+
+static struct file_system_type iomem_fs_type = {
+	.name		= "iomem",
+	.owner		= THIS_MODULE,
+	.init_fs_context = iomem_fs_init_fs_context,
+	.kill_sb	= kill_anon_super,
+};
+
+static int __init iomem_init_inode(void)
+{
+	static struct vfsmount *iomem_vfs_mount;
+	static int iomem_fs_cnt;
+	struct inode *inode;
+	int rc;
+
+	rc = simple_pin_fs(&iomem_fs_type, &iomem_vfs_mount, &iomem_fs_cnt);
+	if (rc < 0) {
+		pr_err("Cannot mount iomem pseudo filesystem: %d\n", rc);
+		return rc;
+	}
+
+	inode = alloc_anon_inode(iomem_vfs_mount->mnt_sb);
+	if (IS_ERR(inode)) {
+		rc = PTR_ERR(inode);
+		pr_err("Cannot allocate inode for iomem: %d\n", rc);
+		simple_release_fs(&iomem_vfs_mount, &iomem_fs_cnt);
+		return rc;
+	}
+
+	/*
+	 * Publish iomem revocation inode initialized.
+	 * Pairs with smp_load_acquire() in revoke_iomem().
+	 */
+	smp_store_release(&iomem_inode, inode);
+
+	return 0;
+}
+
+fs_initcall(iomem_init_inode);
+
 __setup("iomem=", strict_iomem);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v5 14/15] sysfs: Support zapping of binary attr mmaps
  2020-10-30 10:08 [PATCH v5 00/15] follow_pfn and other iomap races Daniel Vetter
                   ` (12 preceding siblings ...)
  2020-10-30 10:08 ` [PATCH v5 13/15] resource: Move devmem revoke code to resource framework Daniel Vetter
@ 2020-10-30 10:08 ` Daniel Vetter
  2020-10-30 10:08 ` [PATCH v5 15/15] PCI: Revoke mappings like devmem Daniel Vetter
  14 siblings, 0 replies; 62+ messages in thread
From: Daniel Vetter @ 2020-10-30 10:08 UTC (permalink / raw)
  To: DRI Development, LKML
  Cc: kvm, linux-mm, linux-arm-kernel, linux-samsung-soc, linux-media,
	Daniel Vetter, Greg Kroah-Hartman, Daniel Vetter,
	Jason Gunthorpe, Kees Cook, Dan Williams, Andrew Morton,
	John Hubbard, Jérôme Glisse, Jan Kara, Bjorn Helgaas,
	linux-pci, Rafael J. Wysocki, Christian Brauner, David S. Miller,
	Michael Ellerman, Sourabh Jain, Mauro Carvalho Chehab,
	Nayna Jain

We want to be able to revoke pci mmaps so that the same access rules
applies as for /dev/kmem. Revoke support for devmem was added in
3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims the
region").

The simplest way to achieve this is by having the same filp->f_mapping
for all mappings, so that unmap_mapping_range can find them all, no
matter through which file they've been created. Since this must be set
at open time we need sysfs support for this.

Add an optional mapping parameter bin_attr, which is only consulted
when there's also an mmap callback, since without mmap support
allowing to adjust the ->f_mapping makes no sense.

Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kees Cook <keescook@chromium.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: linux-mm@kvack.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-samsung-soc@vger.kernel.org
Cc: linux-media@vger.kernel.org
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-pci@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Sourabh Jain <sourabhjain@linux.ibm.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Cc: Nayna Jain <nayna@linux.ibm.com>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 fs/sysfs/file.c       | 11 +++++++++++
 include/linux/sysfs.h |  2 ++
 2 files changed, 13 insertions(+)

diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c
index 96d0da65e088..9aefa7779b29 100644
--- a/fs/sysfs/file.c
+++ b/fs/sysfs/file.c
@@ -170,6 +170,16 @@ static int sysfs_kf_bin_mmap(struct kernfs_open_file *of,
 	return battr->mmap(of->file, kobj, battr, vma);
 }
 
+static int sysfs_kf_bin_open(struct kernfs_open_file *of)
+{
+	struct bin_attribute *battr = of->kn->priv;
+
+	if (battr->mapping)
+		of->file->f_mapping = battr->mapping;
+
+	return 0;
+}
+
 void sysfs_notify(struct kobject *kobj, const char *dir, const char *attr)
 {
 	struct kernfs_node *kn = kobj->sd, *tmp;
@@ -241,6 +251,7 @@ static const struct kernfs_ops sysfs_bin_kfops_mmap = {
 	.read		= sysfs_kf_bin_read,
 	.write		= sysfs_kf_bin_write,
 	.mmap		= sysfs_kf_bin_mmap,
+	.open		= sysfs_kf_bin_open,
 };
 
 int sysfs_add_file_mode_ns(struct kernfs_node *parent,
diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
index 2caa34c1ca1a..d76a1ddf83a3 100644
--- a/include/linux/sysfs.h
+++ b/include/linux/sysfs.h
@@ -164,11 +164,13 @@ __ATTRIBUTE_GROUPS(_name)
 
 struct file;
 struct vm_area_struct;
+struct address_space;
 
 struct bin_attribute {
 	struct attribute	attr;
 	size_t			size;
 	void			*private;
+	struct address_space	*mapping;
 	ssize_t (*read)(struct file *, struct kobject *, struct bin_attribute *,
 			char *, loff_t, size_t);
 	ssize_t (*write)(struct file *, struct kobject *, struct bin_attribute *,
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v5 15/15] PCI: Revoke mappings like devmem
  2020-10-30 10:08 [PATCH v5 00/15] follow_pfn and other iomap races Daniel Vetter
                   ` (13 preceding siblings ...)
  2020-10-30 10:08 ` [PATCH v5 14/15] sysfs: Support zapping of binary attr mmaps Daniel Vetter
@ 2020-10-30 10:08 ` Daniel Vetter
  2020-10-30 19:22   ` Dan Williams
  2020-11-03 21:30   ` Bjorn Helgaas
  14 siblings, 2 replies; 62+ messages in thread
From: Daniel Vetter @ 2020-10-30 10:08 UTC (permalink / raw)
  To: DRI Development, LKML
  Cc: kvm, linux-mm, linux-arm-kernel, linux-samsung-soc, linux-media,
	Daniel Vetter, Daniel Vetter, Jason Gunthorpe, Kees Cook,
	Dan Williams, Andrew Morton, John Hubbard,
	Jérôme Glisse, Jan Kara, Greg Kroah-Hartman,
	Bjorn Helgaas, linux-pci

Since 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims
the region") /dev/kmem zaps ptes when the kernel requests exclusive
acccess to an iomem region. And with CONFIG_IO_STRICT_DEVMEM, this is
the default for all driver uses.

Except there's two more ways to access PCI BARs: sysfs and proc mmap
support. Let's plug that hole.

For revoke_devmem() to work we need to link our vma into the same
address_space, with consistent vma->vm_pgoff. ->pgoff is already
adjusted, because that's how (io_)remap_pfn_range works, but for the
mapping we need to adjust vma->vm_file->f_mapping. The cleanest way is
to adjust this at at ->open time:

- for sysfs this is easy, now that binary attributes support this. We
  just set bin_attr->mapping when mmap is supported
- for procfs it's a bit more tricky, since procfs pci access has only
  one file per device, and access to a specific resources first needs
  to be set up with some ioctl calls. But mmap is only supported for
  the same resources as sysfs exposes with mmap support, and otherwise
  rejected, so we can set the mapping unconditionally at open time
  without harm.

A special consideration is for arch_can_pci_mmap_io() - we need to
make sure that the ->f_mapping doesn't alias between ioport and iomem
space. There's only 2 ways in-tree to support mmap of ioports: generic
pci mmap (ARCH_GENERIC_PCI_MMAP_RESOURCE), and sparc as the single
architecture hand-rolling. Both approach support ioport mmap through a
special pfn range and not through magic pte attributes. Aliasing is
therefore not a problem.

The only difference in access checks left is that sysfs PCI mmap does
not check for CAP_RAWIO. I'm not really sure whether that should be
added or not.

Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kees Cook <keescook@chromium.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: linux-mm@kvack.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-samsung-soc@vger.kernel.org
Cc: linux-media@vger.kernel.org
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-pci@vger.kernel.org
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
--
v2:
- Totally new approach: Adjust filp->f_mapping at open time. Note that
  this now works on all architectures, not just those support
  ARCH_GENERIC_PCI_MMAP_RESOURCE
---
 drivers/pci/pci-sysfs.c | 4 ++++
 drivers/pci/proc.c      | 1 +
 2 files changed, 5 insertions(+)

diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index d15c881e2e7e..3f1c31bc0b7c 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -929,6 +929,7 @@ void pci_create_legacy_files(struct pci_bus *b)
 	b->legacy_io->read = pci_read_legacy_io;
 	b->legacy_io->write = pci_write_legacy_io;
 	b->legacy_io->mmap = pci_mmap_legacy_io;
+	b->legacy_io->mapping = iomem_get_mapping();
 	pci_adjust_legacy_attr(b, pci_mmap_io);
 	error = device_create_bin_file(&b->dev, b->legacy_io);
 	if (error)
@@ -941,6 +942,7 @@ void pci_create_legacy_files(struct pci_bus *b)
 	b->legacy_mem->size = 1024*1024;
 	b->legacy_mem->attr.mode = 0600;
 	b->legacy_mem->mmap = pci_mmap_legacy_mem;
+	b->legacy_io->mapping = iomem_get_mapping();
 	pci_adjust_legacy_attr(b, pci_mmap_mem);
 	error = device_create_bin_file(&b->dev, b->legacy_mem);
 	if (error)
@@ -1156,6 +1158,8 @@ static int pci_create_attr(struct pci_dev *pdev, int num, int write_combine)
 			res_attr->mmap = pci_mmap_resource_uc;
 		}
 	}
+	if (res_attr->mmap)
+		res_attr->mapping = iomem_get_mapping();
 	res_attr->attr.name = res_attr_name;
 	res_attr->attr.mode = 0600;
 	res_attr->size = pci_resource_len(pdev, num);
diff --git a/drivers/pci/proc.c b/drivers/pci/proc.c
index 3a2f90beb4cb..9bab07302bbf 100644
--- a/drivers/pci/proc.c
+++ b/drivers/pci/proc.c
@@ -298,6 +298,7 @@ static int proc_bus_pci_open(struct inode *inode, struct file *file)
 	fpriv->write_combine = 0;
 
 	file->private_data = fpriv;
+	file->f_mapping = iomem_get_mapping();
 
 	return 0;
 }
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 05/15] mm/frame-vector: Use FOLL_LONGTERM
  2020-10-30 10:08 ` [PATCH v5 05/15] mm/frame-vector: Use FOLL_LONGTERM Daniel Vetter
@ 2020-10-30 14:11   ` Tomasz Figa
  2020-10-30 14:37     ` Daniel Vetter
  2020-10-31  2:55   ` John Hubbard
  1 sibling, 1 reply; 62+ messages in thread
From: Tomasz Figa @ 2020-10-30 14:11 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: DRI Development, LKML, kvm, Linux MM,
	list@263.net:IOMMU DRIVERS
	<iommu@lists.linux-foundation.org>,
	Joerg Roedel <joro@8bytes.org>,,
	linux-samsung-soc, Linux Media Mailing List, Daniel Vetter,
	Jason Gunthorpe, Pawel Osciak, Marek Szyprowski, Kyungmin Park,
	Mauro Carvalho Chehab, Andrew Morton, John Hubbard,
	Jérôme Glisse, Jan Kara, Dan Williams

On Fri, Oct 30, 2020 at 11:08 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
> This is used by media/videbuf2 for persistent dma mappings, not just
> for a single dma operation and then freed again, so needs
> FOLL_LONGTERM.
>
> Unfortunately current pup_locked doesn't support FOLL_LONGTERM due to
> locking issues. Rework the code to pull the pup path out from the
> mmap_sem critical section as suggested by Jason.
>
> By relying entirely on the vma checks in pin_user_pages and follow_pfn
> (for vm_flags and vma_is_fsdax) we can also streamline the code a lot.
>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Pawel Osciak <pawel@osciak.com>
> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Cc: Kyungmin Park <kyungmin.park@samsung.com>
> Cc: Tomasz Figa <tfiga@chromium.org>
> Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Jérôme Glisse <jglisse@redhat.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: linux-mm@kvack.org
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-samsung-soc@vger.kernel.org
> Cc: linux-media@vger.kernel.org
> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> --
> v2: Streamline the code and further simplify the loop checks (Jason)
>
> v5: Review from Tomasz:
> - fix page counting for the follow_pfn case by resetting ret
> - drop gup_flags paramater, now unused
> ---
>  .../media/common/videobuf2/videobuf2-memops.c |  3 +-
>  include/linux/mm.h                            |  2 +-
>  mm/frame_vector.c                             | 53 ++++++-------------
>  3 files changed, 19 insertions(+), 39 deletions(-)
>

Thanks, looks good to me now.

Acked-by: Tomasz Figa <tfiga@chromium.org>

From reading the code, this is quite unlikely to introduce any
behavior changes, but just to be safe, did you have a chance to test
this with some V4L2 driver?

Best regards,
Tomasz

> diff --git a/drivers/media/common/videobuf2/videobuf2-memops.c b/drivers/media/common/videobuf2/videobuf2-memops.c
> index 6e9e05153f4e..9dd6c27162f4 100644
> --- a/drivers/media/common/videobuf2/videobuf2-memops.c
> +++ b/drivers/media/common/videobuf2/videobuf2-memops.c
> @@ -40,7 +40,6 @@ struct frame_vector *vb2_create_framevec(unsigned long start,
>         unsigned long first, last;
>         unsigned long nr;
>         struct frame_vector *vec;
> -       unsigned int flags = FOLL_FORCE | FOLL_WRITE;
>
>         first = start >> PAGE_SHIFT;
>         last = (start + length - 1) >> PAGE_SHIFT;
> @@ -48,7 +47,7 @@ struct frame_vector *vb2_create_framevec(unsigned long start,
>         vec = frame_vector_create(nr);
>         if (!vec)
>                 return ERR_PTR(-ENOMEM);
> -       ret = get_vaddr_frames(start & PAGE_MASK, nr, flags, vec);
> +       ret = get_vaddr_frames(start & PAGE_MASK, nr, vec);
>         if (ret < 0)
>                 goto out_destroy;
>         /* We accept only complete set of PFNs */
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ef360fe70aaf..d6b8e30dce2e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1765,7 +1765,7 @@ struct frame_vector {
>  struct frame_vector *frame_vector_create(unsigned int nr_frames);
>  void frame_vector_destroy(struct frame_vector *vec);
>  int get_vaddr_frames(unsigned long start, unsigned int nr_pfns,
> -                    unsigned int gup_flags, struct frame_vector *vec);
> +                    struct frame_vector *vec);
>  void put_vaddr_frames(struct frame_vector *vec);
>  int frame_vector_to_pages(struct frame_vector *vec);
>  void frame_vector_to_pfns(struct frame_vector *vec);
> diff --git a/mm/frame_vector.c b/mm/frame_vector.c
> index 10f82d5643b6..f8c34b895c76 100644
> --- a/mm/frame_vector.c
> +++ b/mm/frame_vector.c
> @@ -32,13 +32,12 @@
>   * This function takes care of grabbing mmap_lock as necessary.
>   */
>  int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
> -                    unsigned int gup_flags, struct frame_vector *vec)
> +                    struct frame_vector *vec)
>  {
>         struct mm_struct *mm = current->mm;
>         struct vm_area_struct *vma;
>         int ret = 0;
>         int err;
> -       int locked;
>
>         if (nr_frames == 0)
>                 return 0;
> @@ -48,40 +47,26 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
>
>         start = untagged_addr(start);
>
> -       mmap_read_lock(mm);
> -       locked = 1;
> -       vma = find_vma_intersection(mm, start, start + 1);
> -       if (!vma) {
> -               ret = -EFAULT;
> -               goto out;
> -       }
> -
> -       /*
> -        * While get_vaddr_frames() could be used for transient (kernel
> -        * controlled lifetime) pinning of memory pages all current
> -        * users establish long term (userspace controlled lifetime)
> -        * page pinning. Treat get_vaddr_frames() like
> -        * get_user_pages_longterm() and disallow it for filesystem-dax
> -        * mappings.
> -        */
> -       if (vma_is_fsdax(vma)) {
> -               ret = -EOPNOTSUPP;
> -               goto out;
> -       }
> -
> -       if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) {
> +       ret = pin_user_pages_fast(start, nr_frames,
> +                                 FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM,
> +                                 (struct page **)(vec->ptrs));
> +       if (ret > 0) {
>                 vec->got_ref = true;
>                 vec->is_pfns = false;
> -               ret = pin_user_pages_locked(start, nr_frames,
> -                       gup_flags, (struct page **)(vec->ptrs), &locked);
> -               goto out;
> +               goto out_unlocked;
>         }
>
> +       mmap_read_lock(mm);
>         vec->got_ref = false;
>         vec->is_pfns = true;
> +       ret = 0;
>         do {
>                 unsigned long *nums = frame_vector_pfns(vec);
>
> +               vma = find_vma_intersection(mm, start, start + 1);
> +               if (!vma)
> +                       break;
> +
>                 while (ret < nr_frames && start + PAGE_SIZE <= vma->vm_end) {
>                         err = follow_pfn(vma, start, &nums[ret]);
>                         if (err) {
> @@ -92,17 +77,13 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
>                         start += PAGE_SIZE;
>                         ret++;
>                 }
> -               /*
> -                * We stop if we have enough pages or if VMA doesn't completely
> -                * cover the tail page.
> -                */
> -               if (ret >= nr_frames || start < vma->vm_end)
> +               /* Bail out if VMA doesn't completely cover the tail page. */
> +               if (start < vma->vm_end)
>                         break;
> -               vma = find_vma_intersection(mm, start, start + 1);
> -       } while (vma && vma->vm_flags & (VM_IO | VM_PFNMAP));
> +       } while (ret < nr_frames);
>  out:
> -       if (locked)
> -               mmap_read_unlock(mm);
> +       mmap_read_unlock(mm);
> +out_unlocked:
>         if (!ret)
>                 ret = -EFAULT;
>         if (ret > 0)
> --
> 2.28.0
>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 05/15] mm/frame-vector: Use FOLL_LONGTERM
  2020-10-30 14:11   ` Tomasz Figa
@ 2020-10-30 14:37     ` Daniel Vetter
  2020-11-02 18:19       ` Tomasz Figa
  0 siblings, 1 reply; 62+ messages in thread
From: Daniel Vetter @ 2020-10-30 14:37 UTC (permalink / raw)
  To: Tomasz Figa
  Cc: Jérôme Glisse, linux-samsung-soc, Jan Kara,
	Pawel Osciak, kvm, Jason Gunthorpe, John Hubbard,
	Mauro Carvalho Chehab, LKML, DRI Development, Linux MM,
	Kyungmin Park, Daniel Vetter, Andrew Morton, Marek Szyprowski,
	Dan Williams,
	list@263.net:IOMMU DRIVERS
	<iommu@lists.linux-foundation.org>,
	Joerg Roedel <joro@8bytes.org>,,
	Linux Media Mailing List

On Fri, Oct 30, 2020 at 3:11 PM Tomasz Figa <tfiga@chromium.org> wrote:
>
> On Fri, Oct 30, 2020 at 11:08 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> >
> > This is used by media/videbuf2 for persistent dma mappings, not just
> > for a single dma operation and then freed again, so needs
> > FOLL_LONGTERM.
> >
> > Unfortunately current pup_locked doesn't support FOLL_LONGTERM due to
> > locking issues. Rework the code to pull the pup path out from the
> > mmap_sem critical section as suggested by Jason.
> >
> > By relying entirely on the vma checks in pin_user_pages and follow_pfn
> > (for vm_flags and vma_is_fsdax) we can also streamline the code a lot.
> >
> > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > Cc: Jason Gunthorpe <jgg@ziepe.ca>
> > Cc: Pawel Osciak <pawel@osciak.com>
> > Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> > Cc: Kyungmin Park <kyungmin.park@samsung.com>
> > Cc: Tomasz Figa <tfiga@chromium.org>
> > Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: John Hubbard <jhubbard@nvidia.com>
> > Cc: Jérôme Glisse <jglisse@redhat.com>
> > Cc: Jan Kara <jack@suse.cz>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > Cc: linux-mm@kvack.org
> > Cc: linux-arm-kernel@lists.infradead.org
> > Cc: linux-samsung-soc@vger.kernel.org
> > Cc: linux-media@vger.kernel.org
> > Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> > --
> > v2: Streamline the code and further simplify the loop checks (Jason)
> >
> > v5: Review from Tomasz:
> > - fix page counting for the follow_pfn case by resetting ret
> > - drop gup_flags paramater, now unused
> > ---
> >  .../media/common/videobuf2/videobuf2-memops.c |  3 +-
> >  include/linux/mm.h                            |  2 +-
> >  mm/frame_vector.c                             | 53 ++++++-------------
> >  3 files changed, 19 insertions(+), 39 deletions(-)
> >
>
> Thanks, looks good to me now.
>
> Acked-by: Tomasz Figa <tfiga@chromium.org>
>
> From reading the code, this is quite unlikely to introduce any
> behavior changes, but just to be safe, did you have a chance to test
> this with some V4L2 driver?

Nah, unfortunately not.
-Daniel

>
> Best regards,
> Tomasz
>
> > diff --git a/drivers/media/common/videobuf2/videobuf2-memops.c b/drivers/media/common/videobuf2/videobuf2-memops.c
> > index 6e9e05153f4e..9dd6c27162f4 100644
> > --- a/drivers/media/common/videobuf2/videobuf2-memops.c
> > +++ b/drivers/media/common/videobuf2/videobuf2-memops.c
> > @@ -40,7 +40,6 @@ struct frame_vector *vb2_create_framevec(unsigned long start,
> >         unsigned long first, last;
> >         unsigned long nr;
> >         struct frame_vector *vec;
> > -       unsigned int flags = FOLL_FORCE | FOLL_WRITE;
> >
> >         first = start >> PAGE_SHIFT;
> >         last = (start + length - 1) >> PAGE_SHIFT;
> > @@ -48,7 +47,7 @@ struct frame_vector *vb2_create_framevec(unsigned long start,
> >         vec = frame_vector_create(nr);
> >         if (!vec)
> >                 return ERR_PTR(-ENOMEM);
> > -       ret = get_vaddr_frames(start & PAGE_MASK, nr, flags, vec);
> > +       ret = get_vaddr_frames(start & PAGE_MASK, nr, vec);
> >         if (ret < 0)
> >                 goto out_destroy;
> >         /* We accept only complete set of PFNs */
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index ef360fe70aaf..d6b8e30dce2e 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1765,7 +1765,7 @@ struct frame_vector {
> >  struct frame_vector *frame_vector_create(unsigned int nr_frames);
> >  void frame_vector_destroy(struct frame_vector *vec);
> >  int get_vaddr_frames(unsigned long start, unsigned int nr_pfns,
> > -                    unsigned int gup_flags, struct frame_vector *vec);
> > +                    struct frame_vector *vec);
> >  void put_vaddr_frames(struct frame_vector *vec);
> >  int frame_vector_to_pages(struct frame_vector *vec);
> >  void frame_vector_to_pfns(struct frame_vector *vec);
> > diff --git a/mm/frame_vector.c b/mm/frame_vector.c
> > index 10f82d5643b6..f8c34b895c76 100644
> > --- a/mm/frame_vector.c
> > +++ b/mm/frame_vector.c
> > @@ -32,13 +32,12 @@
> >   * This function takes care of grabbing mmap_lock as necessary.
> >   */
> >  int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
> > -                    unsigned int gup_flags, struct frame_vector *vec)
> > +                    struct frame_vector *vec)
> >  {
> >         struct mm_struct *mm = current->mm;
> >         struct vm_area_struct *vma;
> >         int ret = 0;
> >         int err;
> > -       int locked;
> >
> >         if (nr_frames == 0)
> >                 return 0;
> > @@ -48,40 +47,26 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
> >
> >         start = untagged_addr(start);
> >
> > -       mmap_read_lock(mm);
> > -       locked = 1;
> > -       vma = find_vma_intersection(mm, start, start + 1);
> > -       if (!vma) {
> > -               ret = -EFAULT;
> > -               goto out;
> > -       }
> > -
> > -       /*
> > -        * While get_vaddr_frames() could be used for transient (kernel
> > -        * controlled lifetime) pinning of memory pages all current
> > -        * users establish long term (userspace controlled lifetime)
> > -        * page pinning. Treat get_vaddr_frames() like
> > -        * get_user_pages_longterm() and disallow it for filesystem-dax
> > -        * mappings.
> > -        */
> > -       if (vma_is_fsdax(vma)) {
> > -               ret = -EOPNOTSUPP;
> > -               goto out;
> > -       }
> > -
> > -       if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) {
> > +       ret = pin_user_pages_fast(start, nr_frames,
> > +                                 FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM,
> > +                                 (struct page **)(vec->ptrs));
> > +       if (ret > 0) {
> >                 vec->got_ref = true;
> >                 vec->is_pfns = false;
> > -               ret = pin_user_pages_locked(start, nr_frames,
> > -                       gup_flags, (struct page **)(vec->ptrs), &locked);
> > -               goto out;
> > +               goto out_unlocked;
> >         }
> >
> > +       mmap_read_lock(mm);
> >         vec->got_ref = false;
> >         vec->is_pfns = true;
> > +       ret = 0;
> >         do {
> >                 unsigned long *nums = frame_vector_pfns(vec);
> >
> > +               vma = find_vma_intersection(mm, start, start + 1);
> > +               if (!vma)
> > +                       break;
> > +
> >                 while (ret < nr_frames && start + PAGE_SIZE <= vma->vm_end) {
> >                         err = follow_pfn(vma, start, &nums[ret]);
> >                         if (err) {
> > @@ -92,17 +77,13 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
> >                         start += PAGE_SIZE;
> >                         ret++;
> >                 }
> > -               /*
> > -                * We stop if we have enough pages or if VMA doesn't completely
> > -                * cover the tail page.
> > -                */
> > -               if (ret >= nr_frames || start < vma->vm_end)
> > +               /* Bail out if VMA doesn't completely cover the tail page. */
> > +               if (start < vma->vm_end)
> >                         break;
> > -               vma = find_vma_intersection(mm, start, start + 1);
> > -       } while (vma && vma->vm_flags & (VM_IO | VM_PFNMAP));
> > +       } while (ret < nr_frames);
> >  out:
> > -       if (locked)
> > -               mmap_read_unlock(mm);
> > +       mmap_read_unlock(mm);
> > +out_unlocked:
> >         if (!ret)
> >                 ret = -EFAULT;
> >         if (ret > 0)
> > --
> > 2.28.0
> >
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 15/15] PCI: Revoke mappings like devmem
  2020-10-30 10:08 ` [PATCH v5 15/15] PCI: Revoke mappings like devmem Daniel Vetter
@ 2020-10-30 19:22   ` Dan Williams
  2020-11-03 21:30   ` Bjorn Helgaas
  1 sibling, 0 replies; 62+ messages in thread
From: Dan Williams @ 2020-10-30 19:22 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: DRI Development, LKML, KVM list, Linux MM, Linux ARM,
	linux-samsung-soc, Linux-media@vger.kernel.org, Daniel Vetter,
	Jason Gunthorpe, Kees Cook, Andrew Morton, John Hubbard,
	Jérôme Glisse, Jan Kara, Greg Kroah-Hartman,
	Bjorn Helgaas, Linux PCI

On Fri, Oct 30, 2020 at 3:09 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
> Since 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims
> the region") /dev/kmem zaps ptes when the kernel requests exclusive
> acccess to an iomem region. And with CONFIG_IO_STRICT_DEVMEM, this is
> the default for all driver uses.
>
> Except there's two more ways to access PCI BARs: sysfs and proc mmap
> support. Let's plug that hole.
>
> For revoke_devmem() to work we need to link our vma into the same
> address_space, with consistent vma->vm_pgoff. ->pgoff is already
> adjusted, because that's how (io_)remap_pfn_range works, but for the
> mapping we need to adjust vma->vm_file->f_mapping. The cleanest way is
> to adjust this at at ->open time:
>
> - for sysfs this is easy, now that binary attributes support this. We
>   just set bin_attr->mapping when mmap is supported
> - for procfs it's a bit more tricky, since procfs pci access has only
>   one file per device, and access to a specific resources first needs
>   to be set up with some ioctl calls. But mmap is only supported for
>   the same resources as sysfs exposes with mmap support, and otherwise
>   rejected, so we can set the mapping unconditionally at open time
>   without harm.
>
> A special consideration is for arch_can_pci_mmap_io() - we need to
> make sure that the ->f_mapping doesn't alias between ioport and iomem
> space. There's only 2 ways in-tree to support mmap of ioports: generic
> pci mmap (ARCH_GENERIC_PCI_MMAP_RESOURCE), and sparc as the single
> architecture hand-rolling. Both approach support ioport mmap through a
> special pfn range and not through magic pte attributes. Aliasing is
> therefore not a problem.
>
> The only difference in access checks left is that sysfs PCI mmap does
> not check for CAP_RAWIO. I'm not really sure whether that should be
> added or not.
>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Jérôme Glisse <jglisse@redhat.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Dan Williams <dan.j.williams@intel.com>

Looks good to me:

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 05/15] mm/frame-vector: Use FOLL_LONGTERM
  2020-10-30 10:08 ` [PATCH v5 05/15] mm/frame-vector: Use FOLL_LONGTERM Daniel Vetter
  2020-10-30 14:11   ` Tomasz Figa
@ 2020-10-31  2:55   ` John Hubbard
  2020-10-31 14:45     ` Daniel Vetter
  1 sibling, 1 reply; 62+ messages in thread
From: John Hubbard @ 2020-10-31  2:55 UTC (permalink / raw)
  To: Daniel Vetter, DRI Development, LKML
  Cc: kvm, linux-mm, linux-arm-kernel, linux-samsung-soc, linux-media,
	Daniel Vetter, Jason Gunthorpe, Pawel Osciak, Marek Szyprowski,
	Kyungmin Park, Tomasz Figa, Mauro Carvalho Chehab, Andrew Morton,
	Jérôme Glisse, Jan Kara, Dan Williams

On 10/30/20 3:08 AM, Daniel Vetter wrote:
> This is used by media/videbuf2 for persistent dma mappings, not just
> for a single dma operation and then freed again, so needs
> FOLL_LONGTERM.
> 
> Unfortunately current pup_locked doesn't support FOLL_LONGTERM due to
> locking issues. Rework the code to pull the pup path out from the
> mmap_sem critical section as suggested by Jason.
> 
> By relying entirely on the vma checks in pin_user_pages and follow_pfn

There are vma checks in pin_user_pages(), but this patch changes things
to call pin_user_pages_fast(). And that does not have the vma checks.
More below about this:

> (for vm_flags and vma_is_fsdax) we can also streamline the code a lot.
> 
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Pawel Osciak <pawel@osciak.com>
> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Cc: Kyungmin Park <kyungmin.park@samsung.com>
> Cc: Tomasz Figa <tfiga@chromium.org>
> Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Jérôme Glisse <jglisse@redhat.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: linux-mm@kvack.org
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-samsung-soc@vger.kernel.org
> Cc: linux-media@vger.kernel.org
> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> --
> v2: Streamline the code and further simplify the loop checks (Jason)
> 
> v5: Review from Tomasz:
> - fix page counting for the follow_pfn case by resetting ret
> - drop gup_flags paramater, now unused
> ---
>   .../media/common/videobuf2/videobuf2-memops.c |  3 +-
>   include/linux/mm.h                            |  2 +-
>   mm/frame_vector.c                             | 53 ++++++-------------
>   3 files changed, 19 insertions(+), 39 deletions(-)
> 
> diff --git a/drivers/media/common/videobuf2/videobuf2-memops.c b/drivers/media/common/videobuf2/videobuf2-memops.c
> index 6e9e05153f4e..9dd6c27162f4 100644
> --- a/drivers/media/common/videobuf2/videobuf2-memops.c
> +++ b/drivers/media/common/videobuf2/videobuf2-memops.c
> @@ -40,7 +40,6 @@ struct frame_vector *vb2_create_framevec(unsigned long start,
>   	unsigned long first, last;
>   	unsigned long nr;
>   	struct frame_vector *vec;
> -	unsigned int flags = FOLL_FORCE | FOLL_WRITE;
>   
>   	first = start >> PAGE_SHIFT;
>   	last = (start + length - 1) >> PAGE_SHIFT;
> @@ -48,7 +47,7 @@ struct frame_vector *vb2_create_framevec(unsigned long start,
>   	vec = frame_vector_create(nr);
>   	if (!vec)
>   		return ERR_PTR(-ENOMEM);
> -	ret = get_vaddr_frames(start & PAGE_MASK, nr, flags, vec);
> +	ret = get_vaddr_frames(start & PAGE_MASK, nr, vec);
>   	if (ret < 0)
>   		goto out_destroy;
>   	/* We accept only complete set of PFNs */
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ef360fe70aaf..d6b8e30dce2e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1765,7 +1765,7 @@ struct frame_vector {
>   struct frame_vector *frame_vector_create(unsigned int nr_frames);
>   void frame_vector_destroy(struct frame_vector *vec);
>   int get_vaddr_frames(unsigned long start, unsigned int nr_pfns,
> -		     unsigned int gup_flags, struct frame_vector *vec);
> +		     struct frame_vector *vec);
>   void put_vaddr_frames(struct frame_vector *vec);
>   int frame_vector_to_pages(struct frame_vector *vec);
>   void frame_vector_to_pfns(struct frame_vector *vec);
> diff --git a/mm/frame_vector.c b/mm/frame_vector.c
> index 10f82d5643b6..f8c34b895c76 100644
> --- a/mm/frame_vector.c
> +++ b/mm/frame_vector.c
> @@ -32,13 +32,12 @@
>    * This function takes care of grabbing mmap_lock as necessary.
>    */
>   int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
> -		     unsigned int gup_flags, struct frame_vector *vec)
> +		     struct frame_vector *vec)
>   {
>   	struct mm_struct *mm = current->mm;
>   	struct vm_area_struct *vma;
>   	int ret = 0;
>   	int err;
> -	int locked;
>   
>   	if (nr_frames == 0)
>   		return 0;
> @@ -48,40 +47,26 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
>   
>   	start = untagged_addr(start);
>   
> -	mmap_read_lock(mm);
> -	locked = 1;
> -	vma = find_vma_intersection(mm, start, start + 1);
> -	if (!vma) {
> -		ret = -EFAULT;
> -		goto out;
> -	}
> -
> -	/*
> -	 * While get_vaddr_frames() could be used for transient (kernel
> -	 * controlled lifetime) pinning of memory pages all current
> -	 * users establish long term (userspace controlled lifetime)
> -	 * page pinning. Treat get_vaddr_frames() like
> -	 * get_user_pages_longterm() and disallow it for filesystem-dax
> -	 * mappings.
> -	 */
> -	if (vma_is_fsdax(vma)) {
> -		ret = -EOPNOTSUPP;
> -		goto out;
> -	}
> -
> -	if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) {

By removing this check from this location, and changing from
pin_user_pages_locked() to pin_user_pages_fast(), I *think* we end up
losing the check entirely. Is that intended? If so it could use a comment
somewhere to explain why.

thanks,
-- 
John Hubbard
NVIDIA

> +	ret = pin_user_pages_fast(start, nr_frames,
> +				  FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM,
> +				  (struct page **)(vec->ptrs));
> +	if (ret > 0) {
>   		vec->got_ref = true;
>   		vec->is_pfns = false;
> -		ret = pin_user_pages_locked(start, nr_frames,
> -			gup_flags, (struct page **)(vec->ptrs), &locked);
> -		goto out;
> +		goto out_unlocked;
>   	}
>   
> +	mmap_read_lock(mm);
>   	vec->got_ref = false;
>   	vec->is_pfns = true;
> +	ret = 0;
>   	do {
>   		unsigned long *nums = frame_vector_pfns(vec);
>   
> +		vma = find_vma_intersection(mm, start, start + 1);
> +		if (!vma)
> +			break;
> +
>   		while (ret < nr_frames && start + PAGE_SIZE <= vma->vm_end) {
>   			err = follow_pfn(vma, start, &nums[ret]);
>   			if (err) {
> @@ -92,17 +77,13 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
>   			start += PAGE_SIZE;
>   			ret++;
>   		}
> -		/*
> -		 * We stop if we have enough pages or if VMA doesn't completely
> -		 * cover the tail page.
> -		 */
> -		if (ret >= nr_frames || start < vma->vm_end)
> +		/* Bail out if VMA doesn't completely cover the tail page. */
> +		if (start < vma->vm_end)
>   			break;
> -		vma = find_vma_intersection(mm, start, start + 1);
> -	} while (vma && vma->vm_flags & (VM_IO | VM_PFNMAP));
> +	} while (ret < nr_frames);
>   out:
> -	if (locked)
> -		mmap_read_unlock(mm);
> +	mmap_read_unlock(mm);
> +out_unlocked:
>   	if (!ret)
>   		ret = -EFAULT;
>   	if (ret > 0)
> 



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 13/15] resource: Move devmem revoke code to resource framework
  2020-10-30 10:08 ` [PATCH v5 13/15] resource: Move devmem revoke code to resource framework Daniel Vetter
@ 2020-10-31  6:36   ` John Hubbard
  2020-10-31 14:40     ` Daniel Vetter
  0 siblings, 1 reply; 62+ messages in thread
From: John Hubbard @ 2020-10-31  6:36 UTC (permalink / raw)
  To: Daniel Vetter, DRI Development, LKML
  Cc: kvm, linux-mm, linux-arm-kernel, linux-samsung-soc, linux-media,
	Greg Kroah-Hartman, Daniel Vetter, Jason Gunthorpe, Kees Cook,
	Dan Williams, Andrew Morton, Jérôme Glisse, Jan Kara,
	Arnd Bergmann, David Hildenbrand, Rafael J. Wysocki

On 10/30/20 3:08 AM, Daniel Vetter wrote:
> We want all iomem mmaps to consistently revoke ptes when the kernel
> takes over and CONFIG_IO_STRICT_DEVMEM is enabled. This includes the
> pci bar mmaps available through procfs and sysfs, which currently do
> not revoke mappings.
> 
> To prepare for this, move the code from the /dev/kmem driver to
> kernel/resource.c.

This seems like it's doing a lot more than just code movement, right?
Should we list some of that here?

Also, I'm seeing a crash due to this commit. More below:

> 
> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Jérôme Glisse <jglisse@redhat.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: linux-mm@kvack.org
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-samsung-soc@vger.kernel.org
> Cc: linux-media@vger.kernel.org
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> --
> v3:
> - add barrier for consistency and document why we don't have to check
>    for NULL (Jason)
> v4
> - Adjust comments to reflect the general nature of this iomem revoke
>    code now (Dan)
> ---
>   drivers/char/mem.c     |  85 +---------------------------------
>   include/linux/ioport.h |   6 +--
>   kernel/resource.c      | 101 ++++++++++++++++++++++++++++++++++++++++-
>   3 files changed, 102 insertions(+), 90 deletions(-)
> 
> diff --git a/drivers/char/mem.c b/drivers/char/mem.c
> index 7dcf9e4ea79d..43c871dc7477 100644
> --- a/drivers/char/mem.c
> +++ b/drivers/char/mem.c
> @@ -31,9 +31,6 @@
>   #include <linux/uio.h>
>   #include <linux/uaccess.h>
>   #include <linux/security.h>
> -#include <linux/pseudo_fs.h>
> -#include <uapi/linux/magic.h>
> -#include <linux/mount.h>
>   
>   #ifdef CONFIG_IA64
>   # include <linux/efi.h>
> @@ -836,42 +833,6 @@ static loff_t memory_lseek(struct file *file, loff_t offset, int orig)
>   	return ret;
>   }
>   
> -static struct inode *devmem_inode;
> -
> -#ifdef CONFIG_IO_STRICT_DEVMEM
> -void revoke_devmem(struct resource *res)
> -{
> -	/* pairs with smp_store_release() in devmem_init_inode() */
> -	struct inode *inode = smp_load_acquire(&devmem_inode);
> -
> -	/*
> -	 * Check that the initialization has completed. Losing the race
> -	 * is ok because it means drivers are claiming resources before
> -	 * the fs_initcall level of init and prevent /dev/mem from
> -	 * establishing mappings.
> -	 */
> -	if (!inode)
> -		return;
> -
> -	/*
> -	 * The expectation is that the driver has successfully marked
> -	 * the resource busy by this point, so devmem_is_allowed()
> -	 * should start returning false, however for performance this
> -	 * does not iterate the entire resource range.
> -	 */
> -	if (devmem_is_allowed(PHYS_PFN(res->start)) &&
> -	    devmem_is_allowed(PHYS_PFN(res->end))) {
> -		/*
> -		 * *cringe* iomem=relaxed says "go ahead, what's the
> -		 * worst that can happen?"
> -		 */
> -		return;
> -	}
> -
> -	unmap_mapping_range(inode->i_mapping, res->start, resource_size(res), 1);
> -}
> -#endif
> -
>   static int open_port(struct inode *inode, struct file *filp)
>   {
>   	int rc;
> @@ -891,7 +852,7 @@ static int open_port(struct inode *inode, struct file *filp)
>   	 * revocations when drivers want to take over a /dev/mem mapped
>   	 * range.
>   	 */
> -	filp->f_mapping = inode->i_mapping;
> +	filp->f_mapping = iomem_get_mapping();


The problem is that iomem_get_mapping() returns NULL for the !CONFIG_IO_STRICT_DEVMEM
case. And then we have pre-existing fs code that expects to go "up and over", like this:


static int do_dentry_open(struct file *f,
			  struct inode *inode,
			  int (*open)(struct inode *, struct file *))
{
...

	file_ra_state_init(&f->f_ra, f->f_mapping->host->i_mapping);

...and it crashes on that line fairly early in bootup.

Not sure what to suggest for this patch, but wanted to get this report out at least.

thanks,
-- 
John Hubbard
NVIDIA

>   
>   	return 0;
>   }
> @@ -1023,48 +984,6 @@ static char *mem_devnode(struct device *dev, umode_t *mode)
>   
>   static struct class *mem_class;
>   
> -static int devmem_fs_init_fs_context(struct fs_context *fc)
> -{
> -	return init_pseudo(fc, DEVMEM_MAGIC) ? 0 : -ENOMEM;
> -}
> -
> -static struct file_system_type devmem_fs_type = {
> -	.name		= "devmem",
> -	.owner		= THIS_MODULE,
> -	.init_fs_context = devmem_fs_init_fs_context,
> -	.kill_sb	= kill_anon_super,
> -};
> -
> -static int devmem_init_inode(void)
> -{
> -	static struct vfsmount *devmem_vfs_mount;
> -	static int devmem_fs_cnt;
> -	struct inode *inode;
> -	int rc;
> -
> -	rc = simple_pin_fs(&devmem_fs_type, &devmem_vfs_mount, &devmem_fs_cnt);
> -	if (rc < 0) {
> -		pr_err("Cannot mount /dev/mem pseudo filesystem: %d\n", rc);
> -		return rc;
> -	}
> -
> -	inode = alloc_anon_inode(devmem_vfs_mount->mnt_sb);
> -	if (IS_ERR(inode)) {
> -		rc = PTR_ERR(inode);
> -		pr_err("Cannot allocate inode for /dev/mem: %d\n", rc);
> -		simple_release_fs(&devmem_vfs_mount, &devmem_fs_cnt);
> -		return rc;
> -	}
> -
> -	/*
> -	 * Publish /dev/mem initialized.
> -	 * Pairs with smp_load_acquire() in revoke_devmem().
> -	 */
> -	smp_store_release(&devmem_inode, inode);
> -
> -	return 0;
> -}
> -
>   static int __init chr_dev_init(void)
>   {
>   	int minor;
> @@ -1086,8 +1005,6 @@ static int __init chr_dev_init(void)
>   		 */
>   		if ((minor == DEVPORT_MINOR) && !arch_has_dev_port())
>   			continue;
> -		if ((minor == DEVMEM_MINOR) && devmem_init_inode() != 0)
> -			continue;
>   
>   		device_create(mem_class, NULL, MKDEV(MEM_MAJOR, minor),
>   			      NULL, devlist[minor].name);
> diff --git a/include/linux/ioport.h b/include/linux/ioport.h
> index 5135d4b86cd6..02a5466245c0 100644
> --- a/include/linux/ioport.h
> +++ b/include/linux/ioport.h
> @@ -307,11 +307,7 @@ struct resource *devm_request_free_mem_region(struct device *dev,
>   struct resource *request_free_mem_region(struct resource *base,
>   		unsigned long size, const char *name);
>   
> -#ifdef CONFIG_IO_STRICT_DEVMEM
> -void revoke_devmem(struct resource *res);
> -#else
> -static inline void revoke_devmem(struct resource *res) { };
> -#endif
> +extern struct address_space *iomem_get_mapping(void);
>   
>   #endif /* __ASSEMBLY__ */
>   #endif	/* _LINUX_IOPORT_H */
> diff --git a/kernel/resource.c b/kernel/resource.c
> index 3ae2f56cc79d..5ecc3187fe2d 100644
> --- a/kernel/resource.c
> +++ b/kernel/resource.c
> @@ -18,12 +18,15 @@
>   #include <linux/spinlock.h>
>   #include <linux/fs.h>
>   #include <linux/proc_fs.h>
> +#include <linux/pseudo_fs.h>
>   #include <linux/sched.h>
>   #include <linux/seq_file.h>
>   #include <linux/device.h>
>   #include <linux/pfn.h>
>   #include <linux/mm.h>
> +#include <linux/mount.h>
>   #include <linux/resource_ext.h>
> +#include <uapi/linux/magic.h>
>   #include <asm/io.h>
>   
>   
> @@ -1115,6 +1118,58 @@ resource_size_t resource_alignment(struct resource *res)
>   
>   static DECLARE_WAIT_QUEUE_HEAD(muxed_resource_wait);
>   
> +static struct inode *iomem_inode;
> +
> +#ifdef CONFIG_IO_STRICT_DEVMEM
> +static void revoke_iomem(struct resource *res)
> +{
> +	/* pairs with smp_store_release() in iomem_init_inode() */
> +	struct inode *inode = smp_load_acquire(&iomem_inode);
> +
> +	/*
> +	 * Check that the initialization has completed. Losing the race
> +	 * is ok because it means drivers are claiming resources before
> +	 * the fs_initcall level of init and prevent iomem_get_mapping users
> +	 * from establishing mappings.
> +	 */
> +	if (!inode)
> +		return;
> +
> +	/*
> +	 * The expectation is that the driver has successfully marked
> +	 * the resource busy by this point, so devmem_is_allowed()
> +	 * should start returning false, however for performance this
> +	 * does not iterate the entire resource range.
> +	 */
> +	if (devmem_is_allowed(PHYS_PFN(res->start)) &&
> +	    devmem_is_allowed(PHYS_PFN(res->end))) {
> +		/*
> +		 * *cringe* iomem=relaxed says "go ahead, what's the
> +		 * worst that can happen?"
> +		 */
> +		return;
> +	}
> +
> +	unmap_mapping_range(inode->i_mapping, res->start, resource_size(res), 1);
> +}
> +struct address_space *iomem_get_mapping(void)
> +{
> +	/*
> +	 * This function is only called from file open paths, hence guaranteed
> +	 * that fs_initcalls have completed and no need to check for NULL. But
> +	 * since revoke_iomem can be called before the initcall we still need
> +	 * the barrier to appease checkers.
> +	 */
> +	return smp_load_acquire(&iomem_inode)->i_mapping;
> +}
> +#else
> +static void revoke_iomem(struct resource *res) {}
> +struct address_space *iomem_get_mapping(void)
> +{
> +	return NULL;
> +}
> +#endif
> +
>   /**
>    * __request_region - create a new busy resource region
>    * @parent: parent resource descriptor
> @@ -1182,7 +1237,7 @@ struct resource * __request_region(struct resource *parent,
>   	write_unlock(&resource_lock);
>   
>   	if (res && orig_parent == &iomem_resource)
> -		revoke_devmem(res);
> +		revoke_iomem(res);
>   
>   	return res;
>   }
> @@ -1782,4 +1837,48 @@ static int __init strict_iomem(char *str)
>   	return 1;
>   }
>   
> +static int iomem_fs_init_fs_context(struct fs_context *fc)
> +{
> +	return init_pseudo(fc, DEVMEM_MAGIC) ? 0 : -ENOMEM;
> +}
> +
> +static struct file_system_type iomem_fs_type = {
> +	.name		= "iomem",
> +	.owner		= THIS_MODULE,
> +	.init_fs_context = iomem_fs_init_fs_context,
> +	.kill_sb	= kill_anon_super,
> +};
> +
> +static int __init iomem_init_inode(void)
> +{
> +	static struct vfsmount *iomem_vfs_mount;
> +	static int iomem_fs_cnt;
> +	struct inode *inode;
> +	int rc;
> +
> +	rc = simple_pin_fs(&iomem_fs_type, &iomem_vfs_mount, &iomem_fs_cnt);
> +	if (rc < 0) {
> +		pr_err("Cannot mount iomem pseudo filesystem: %d\n", rc);
> +		return rc;
> +	}
> +
> +	inode = alloc_anon_inode(iomem_vfs_mount->mnt_sb);
> +	if (IS_ERR(inode)) {
> +		rc = PTR_ERR(inode);
> +		pr_err("Cannot allocate inode for iomem: %d\n", rc);
> +		simple_release_fs(&iomem_vfs_mount, &iomem_fs_cnt);
> +		return rc;
> +	}
> +
> +	/*
> +	 * Publish iomem revocation inode initialized.
> +	 * Pairs with smp_load_acquire() in revoke_iomem().
> +	 */
> +	smp_store_release(&iomem_inode, inode);
> +
> +	return 0;
> +}
> +
> +fs_initcall(iomem_init_inode);
> +
>   __setup("iomem=", strict_iomem);
> 



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 13/15] resource: Move devmem revoke code to resource framework
  2020-10-31  6:36   ` John Hubbard
@ 2020-10-31 14:40     ` Daniel Vetter
  0 siblings, 0 replies; 62+ messages in thread
From: Daniel Vetter @ 2020-10-31 14:40 UTC (permalink / raw)
  To: John Hubbard
  Cc: DRI Development, LKML, KVM list, Linux MM, Linux ARM,
	linux-samsung-soc, open list:DMA BUFFER SHARING FRAMEWORK,
	Greg Kroah-Hartman, Daniel Vetter, Jason Gunthorpe, Kees Cook,
	Dan Williams, Andrew Morton, Jérôme Glisse, Jan Kara,
	Arnd Bergmann, David Hildenbrand, Rafael J. Wysocki

On Sat, Oct 31, 2020 at 7:36 AM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 10/30/20 3:08 AM, Daniel Vetter wrote:
> > We want all iomem mmaps to consistently revoke ptes when the kernel
> > takes over and CONFIG_IO_STRICT_DEVMEM is enabled. This includes the
> > pci bar mmaps available through procfs and sysfs, which currently do
> > not revoke mappings.
> >
> > To prepare for this, move the code from the /dev/kmem driver to
> > kernel/resource.c.
>
> This seems like it's doing a lot more than just code movement, right?
> Should we list some of that here?

It was meant to be just moving code, but then the inevitable bikeshed
showed up and I forgot to update the commit message properly. Will fix
that.

> Also, I'm seeing a crash due to this commit. More below:

Uh that's not good.
>
> >
> > Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > Cc: Jason Gunthorpe <jgg@ziepe.ca>
> > Cc: Kees Cook <keescook@chromium.org>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: John Hubbard <jhubbard@nvidia.com>
> > Cc: Jérôme Glisse <jglisse@redhat.com>
> > Cc: Jan Kara <jack@suse.cz>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > Cc: linux-mm@kvack.org
> > Cc: linux-arm-kernel@lists.infradead.org
> > Cc: linux-samsung-soc@vger.kernel.org
> > Cc: linux-media@vger.kernel.org
> > Cc: Arnd Bergmann <arnd@arndb.de>
> > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
> > Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> > --
> > v3:
> > - add barrier for consistency and document why we don't have to check
> >    for NULL (Jason)
> > v4
> > - Adjust comments to reflect the general nature of this iomem revoke
> >    code now (Dan)
> > ---
> >   drivers/char/mem.c     |  85 +---------------------------------
> >   include/linux/ioport.h |   6 +--
> >   kernel/resource.c      | 101 ++++++++++++++++++++++++++++++++++++++++-
> >   3 files changed, 102 insertions(+), 90 deletions(-)
> >
> > diff --git a/drivers/char/mem.c b/drivers/char/mem.c
> > index 7dcf9e4ea79d..43c871dc7477 100644
> > --- a/drivers/char/mem.c
> > +++ b/drivers/char/mem.c
> > @@ -31,9 +31,6 @@
> >   #include <linux/uio.h>
> >   #include <linux/uaccess.h>
> >   #include <linux/security.h>
> > -#include <linux/pseudo_fs.h>
> > -#include <uapi/linux/magic.h>
> > -#include <linux/mount.h>
> >
> >   #ifdef CONFIG_IA64
> >   # include <linux/efi.h>
> > @@ -836,42 +833,6 @@ static loff_t memory_lseek(struct file *file, loff_t offset, int orig)
> >       return ret;
> >   }
> >
> > -static struct inode *devmem_inode;
> > -
> > -#ifdef CONFIG_IO_STRICT_DEVMEM
> > -void revoke_devmem(struct resource *res)
> > -{
> > -     /* pairs with smp_store_release() in devmem_init_inode() */
> > -     struct inode *inode = smp_load_acquire(&devmem_inode);
> > -
> > -     /*
> > -      * Check that the initialization has completed. Losing the race
> > -      * is ok because it means drivers are claiming resources before
> > -      * the fs_initcall level of init and prevent /dev/mem from
> > -      * establishing mappings.
> > -      */
> > -     if (!inode)
> > -             return;
> > -
> > -     /*
> > -      * The expectation is that the driver has successfully marked
> > -      * the resource busy by this point, so devmem_is_allowed()
> > -      * should start returning false, however for performance this
> > -      * does not iterate the entire resource range.
> > -      */
> > -     if (devmem_is_allowed(PHYS_PFN(res->start)) &&
> > -         devmem_is_allowed(PHYS_PFN(res->end))) {
> > -             /*
> > -              * *cringe* iomem=relaxed says "go ahead, what's the
> > -              * worst that can happen?"
> > -              */
> > -             return;
> > -     }
> > -
> > -     unmap_mapping_range(inode->i_mapping, res->start, resource_size(res), 1);
> > -}
> > -#endif
> > -
> >   static int open_port(struct inode *inode, struct file *filp)
> >   {
> >       int rc;
> > @@ -891,7 +852,7 @@ static int open_port(struct inode *inode, struct file *filp)
> >        * revocations when drivers want to take over a /dev/mem mapped
> >        * range.
> >        */
> > -     filp->f_mapping = inode->i_mapping;
> > +     filp->f_mapping = iomem_get_mapping();
>
>
> The problem is that iomem_get_mapping() returns NULL for the !CONFIG_IO_STRICT_DEVMEM
> case. And then we have pre-existing fs code that expects to go "up and over", like this:
>
>
> static int do_dentry_open(struct file *f,
>                           struct inode *inode,
>                           int (*open)(struct inode *, struct file *))
> {
> ...
>
>         file_ra_state_init(&f->f_ra, f->f_mapping->host->i_mapping);
>
> ...and it crashes on that line fairly early in bootup.
>
> Not sure what to suggest for this patch, but wanted to get this report out at least.

Old code seems to have worked by always setting up the inode (we still
do that) and always setting it (we don't do that anymore), just not
revoking the ptes when the Kconfig is not set. I'll fix that up and
remove the behaviour change here.
-Daniel

> thanks,
> --
> John Hubbard
> NVIDIA
>
> >
> >       return 0;
> >   }
> > @@ -1023,48 +984,6 @@ static char *mem_devnode(struct device *dev, umode_t *mode)
> >
> >   static struct class *mem_class;
> >
> > -static int devmem_fs_init_fs_context(struct fs_context *fc)
> > -{
> > -     return init_pseudo(fc, DEVMEM_MAGIC) ? 0 : -ENOMEM;
> > -}
> > -
> > -static struct file_system_type devmem_fs_type = {
> > -     .name           = "devmem",
> > -     .owner          = THIS_MODULE,
> > -     .init_fs_context = devmem_fs_init_fs_context,
> > -     .kill_sb        = kill_anon_super,
> > -};
> > -
> > -static int devmem_init_inode(void)
> > -{
> > -     static struct vfsmount *devmem_vfs_mount;
> > -     static int devmem_fs_cnt;
> > -     struct inode *inode;
> > -     int rc;
> > -
> > -     rc = simple_pin_fs(&devmem_fs_type, &devmem_vfs_mount, &devmem_fs_cnt);
> > -     if (rc < 0) {
> > -             pr_err("Cannot mount /dev/mem pseudo filesystem: %d\n", rc);
> > -             return rc;
> > -     }
> > -
> > -     inode = alloc_anon_inode(devmem_vfs_mount->mnt_sb);
> > -     if (IS_ERR(inode)) {
> > -             rc = PTR_ERR(inode);
> > -             pr_err("Cannot allocate inode for /dev/mem: %d\n", rc);
> > -             simple_release_fs(&devmem_vfs_mount, &devmem_fs_cnt);
> > -             return rc;
> > -     }
> > -
> > -     /*
> > -      * Publish /dev/mem initialized.
> > -      * Pairs with smp_load_acquire() in revoke_devmem().
> > -      */
> > -     smp_store_release(&devmem_inode, inode);
> > -
> > -     return 0;
> > -}
> > -
> >   static int __init chr_dev_init(void)
> >   {
> >       int minor;
> > @@ -1086,8 +1005,6 @@ static int __init chr_dev_init(void)
> >                */
> >               if ((minor == DEVPORT_MINOR) && !arch_has_dev_port())
> >                       continue;
> > -             if ((minor == DEVMEM_MINOR) && devmem_init_inode() != 0)
> > -                     continue;
> >
> >               device_create(mem_class, NULL, MKDEV(MEM_MAJOR, minor),
> >                             NULL, devlist[minor].name);
> > diff --git a/include/linux/ioport.h b/include/linux/ioport.h
> > index 5135d4b86cd6..02a5466245c0 100644
> > --- a/include/linux/ioport.h
> > +++ b/include/linux/ioport.h
> > @@ -307,11 +307,7 @@ struct resource *devm_request_free_mem_region(struct device *dev,
> >   struct resource *request_free_mem_region(struct resource *base,
> >               unsigned long size, const char *name);
> >
> > -#ifdef CONFIG_IO_STRICT_DEVMEM
> > -void revoke_devmem(struct resource *res);
> > -#else
> > -static inline void revoke_devmem(struct resource *res) { };
> > -#endif
> > +extern struct address_space *iomem_get_mapping(void);
> >
> >   #endif /* __ASSEMBLY__ */
> >   #endif      /* _LINUX_IOPORT_H */
> > diff --git a/kernel/resource.c b/kernel/resource.c
> > index 3ae2f56cc79d..5ecc3187fe2d 100644
> > --- a/kernel/resource.c
> > +++ b/kernel/resource.c
> > @@ -18,12 +18,15 @@
> >   #include <linux/spinlock.h>
> >   #include <linux/fs.h>
> >   #include <linux/proc_fs.h>
> > +#include <linux/pseudo_fs.h>
> >   #include <linux/sched.h>
> >   #include <linux/seq_file.h>
> >   #include <linux/device.h>
> >   #include <linux/pfn.h>
> >   #include <linux/mm.h>
> > +#include <linux/mount.h>
> >   #include <linux/resource_ext.h>
> > +#include <uapi/linux/magic.h>
> >   #include <asm/io.h>
> >
> >
> > @@ -1115,6 +1118,58 @@ resource_size_t resource_alignment(struct resource *res)
> >
> >   static DECLARE_WAIT_QUEUE_HEAD(muxed_resource_wait);
> >
> > +static struct inode *iomem_inode;
> > +
> > +#ifdef CONFIG_IO_STRICT_DEVMEM
> > +static void revoke_iomem(struct resource *res)
> > +{
> > +     /* pairs with smp_store_release() in iomem_init_inode() */
> > +     struct inode *inode = smp_load_acquire(&iomem_inode);
> > +
> > +     /*
> > +      * Check that the initialization has completed. Losing the race
> > +      * is ok because it means drivers are claiming resources before
> > +      * the fs_initcall level of init and prevent iomem_get_mapping users
> > +      * from establishing mappings.
> > +      */
> > +     if (!inode)
> > +             return;
> > +
> > +     /*
> > +      * The expectation is that the driver has successfully marked
> > +      * the resource busy by this point, so devmem_is_allowed()
> > +      * should start returning false, however for performance this
> > +      * does not iterate the entire resource range.
> > +      */
> > +     if (devmem_is_allowed(PHYS_PFN(res->start)) &&
> > +         devmem_is_allowed(PHYS_PFN(res->end))) {
> > +             /*
> > +              * *cringe* iomem=relaxed says "go ahead, what's the
> > +              * worst that can happen?"
> > +              */
> > +             return;
> > +     }
> > +
> > +     unmap_mapping_range(inode->i_mapping, res->start, resource_size(res), 1);
> > +}
> > +struct address_space *iomem_get_mapping(void)
> > +{
> > +     /*
> > +      * This function is only called from file open paths, hence guaranteed
> > +      * that fs_initcalls have completed and no need to check for NULL. But
> > +      * since revoke_iomem can be called before the initcall we still need
> > +      * the barrier to appease checkers.
> > +      */
> > +     return smp_load_acquire(&iomem_inode)->i_mapping;
> > +}
> > +#else
> > +static void revoke_iomem(struct resource *res) {}
> > +struct address_space *iomem_get_mapping(void)
> > +{
> > +     return NULL;
> > +}
> > +#endif
> > +
> >   /**
> >    * __request_region - create a new busy resource region
> >    * @parent: parent resource descriptor
> > @@ -1182,7 +1237,7 @@ struct resource * __request_region(struct resource *parent,
> >       write_unlock(&resource_lock);
> >
> >       if (res && orig_parent == &iomem_resource)
> > -             revoke_devmem(res);
> > +             revoke_iomem(res);
> >
> >       return res;
> >   }
> > @@ -1782,4 +1837,48 @@ static int __init strict_iomem(char *str)
> >       return 1;
> >   }
> >
> > +static int iomem_fs_init_fs_context(struct fs_context *fc)
> > +{
> > +     return init_pseudo(fc, DEVMEM_MAGIC) ? 0 : -ENOMEM;
> > +}
> > +
> > +static struct file_system_type iomem_fs_type = {
> > +     .name           = "iomem",
> > +     .owner          = THIS_MODULE,
> > +     .init_fs_context = iomem_fs_init_fs_context,
> > +     .kill_sb        = kill_anon_super,
> > +};
> > +
> > +static int __init iomem_init_inode(void)
> > +{
> > +     static struct vfsmount *iomem_vfs_mount;
> > +     static int iomem_fs_cnt;
> > +     struct inode *inode;
> > +     int rc;
> > +
> > +     rc = simple_pin_fs(&iomem_fs_type, &iomem_vfs_mount, &iomem_fs_cnt);
> > +     if (rc < 0) {
> > +             pr_err("Cannot mount iomem pseudo filesystem: %d\n", rc);
> > +             return rc;
> > +     }
> > +
> > +     inode = alloc_anon_inode(iomem_vfs_mount->mnt_sb);
> > +     if (IS_ERR(inode)) {
> > +             rc = PTR_ERR(inode);
> > +             pr_err("Cannot allocate inode for iomem: %d\n", rc);
> > +             simple_release_fs(&iomem_vfs_mount, &iomem_fs_cnt);
> > +             return rc;
> > +     }
> > +
> > +     /*
> > +      * Publish iomem revocation inode initialized.
> > +      * Pairs with smp_load_acquire() in revoke_iomem().
> > +      */
> > +     smp_store_release(&iomem_inode, inode);
> > +
> > +     return 0;
> > +}
> > +
> > +fs_initcall(iomem_init_inode);
> > +
> >   __setup("iomem=", strict_iomem);
> >
>
>


-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 05/15] mm/frame-vector: Use FOLL_LONGTERM
  2020-10-31  2:55   ` John Hubbard
@ 2020-10-31 14:45     ` Daniel Vetter
  2020-11-01  5:22       ` John Hubbard
  0 siblings, 1 reply; 62+ messages in thread
From: Daniel Vetter @ 2020-10-31 14:45 UTC (permalink / raw)
  To: John Hubbard
  Cc: DRI Development, LKML, KVM list, Linux MM, Linux ARM,
	linux-samsung-soc, open list:DMA BUFFER SHARING FRAMEWORK,
	Daniel Vetter, Jason Gunthorpe, Pawel Osciak, Marek Szyprowski,
	Kyungmin Park, Tomasz Figa, Mauro Carvalho Chehab, Andrew Morton,
	Jérôme Glisse, Jan Kara, Dan Williams

On Sat, Oct 31, 2020 at 3:55 AM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 10/30/20 3:08 AM, Daniel Vetter wrote:
> > This is used by media/videbuf2 for persistent dma mappings, not just
> > for a single dma operation and then freed again, so needs
> > FOLL_LONGTERM.
> >
> > Unfortunately current pup_locked doesn't support FOLL_LONGTERM due to
> > locking issues. Rework the code to pull the pup path out from the
> > mmap_sem critical section as suggested by Jason.
> >
> > By relying entirely on the vma checks in pin_user_pages and follow_pfn
>
> There are vma checks in pin_user_pages(), but this patch changes things
> to call pin_user_pages_fast(). And that does not have the vma checks.
> More below about this:
>
> > (for vm_flags and vma_is_fsdax) we can also streamline the code a lot.
> >
> > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > Cc: Jason Gunthorpe <jgg@ziepe.ca>
> > Cc: Pawel Osciak <pawel@osciak.com>
> > Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> > Cc: Kyungmin Park <kyungmin.park@samsung.com>
> > Cc: Tomasz Figa <tfiga@chromium.org>
> > Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: John Hubbard <jhubbard@nvidia.com>
> > Cc: Jérôme Glisse <jglisse@redhat.com>
> > Cc: Jan Kara <jack@suse.cz>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > Cc: linux-mm@kvack.org
> > Cc: linux-arm-kernel@lists.infradead.org
> > Cc: linux-samsung-soc@vger.kernel.org
> > Cc: linux-media@vger.kernel.org
> > Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> > --
> > v2: Streamline the code and further simplify the loop checks (Jason)
> >
> > v5: Review from Tomasz:
> > - fix page counting for the follow_pfn case by resetting ret
> > - drop gup_flags paramater, now unused
> > ---
> >   .../media/common/videobuf2/videobuf2-memops.c |  3 +-
> >   include/linux/mm.h                            |  2 +-
> >   mm/frame_vector.c                             | 53 ++++++-------------
> >   3 files changed, 19 insertions(+), 39 deletions(-)
> >
> > diff --git a/drivers/media/common/videobuf2/videobuf2-memops.c b/drivers/media/common/videobuf2/videobuf2-memops.c
> > index 6e9e05153f4e..9dd6c27162f4 100644
> > --- a/drivers/media/common/videobuf2/videobuf2-memops.c
> > +++ b/drivers/media/common/videobuf2/videobuf2-memops.c
> > @@ -40,7 +40,6 @@ struct frame_vector *vb2_create_framevec(unsigned long start,
> >       unsigned long first, last;
> >       unsigned long nr;
> >       struct frame_vector *vec;
> > -     unsigned int flags = FOLL_FORCE | FOLL_WRITE;
> >
> >       first = start >> PAGE_SHIFT;
> >       last = (start + length - 1) >> PAGE_SHIFT;
> > @@ -48,7 +47,7 @@ struct frame_vector *vb2_create_framevec(unsigned long start,
> >       vec = frame_vector_create(nr);
> >       if (!vec)
> >               return ERR_PTR(-ENOMEM);
> > -     ret = get_vaddr_frames(start & PAGE_MASK, nr, flags, vec);
> > +     ret = get_vaddr_frames(start & PAGE_MASK, nr, vec);
> >       if (ret < 0)
> >               goto out_destroy;
> >       /* We accept only complete set of PFNs */
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index ef360fe70aaf..d6b8e30dce2e 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1765,7 +1765,7 @@ struct frame_vector {
> >   struct frame_vector *frame_vector_create(unsigned int nr_frames);
> >   void frame_vector_destroy(struct frame_vector *vec);
> >   int get_vaddr_frames(unsigned long start, unsigned int nr_pfns,
> > -                  unsigned int gup_flags, struct frame_vector *vec);
> > +                  struct frame_vector *vec);
> >   void put_vaddr_frames(struct frame_vector *vec);
> >   int frame_vector_to_pages(struct frame_vector *vec);
> >   void frame_vector_to_pfns(struct frame_vector *vec);
> > diff --git a/mm/frame_vector.c b/mm/frame_vector.c
> > index 10f82d5643b6..f8c34b895c76 100644
> > --- a/mm/frame_vector.c
> > +++ b/mm/frame_vector.c
> > @@ -32,13 +32,12 @@
> >    * This function takes care of grabbing mmap_lock as necessary.
> >    */
> >   int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
> > -                  unsigned int gup_flags, struct frame_vector *vec)
> > +                  struct frame_vector *vec)
> >   {
> >       struct mm_struct *mm = current->mm;
> >       struct vm_area_struct *vma;
> >       int ret = 0;
> >       int err;
> > -     int locked;
> >
> >       if (nr_frames == 0)
> >               return 0;
> > @@ -48,40 +47,26 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
> >
> >       start = untagged_addr(start);
> >
> > -     mmap_read_lock(mm);
> > -     locked = 1;
> > -     vma = find_vma_intersection(mm, start, start + 1);
> > -     if (!vma) {
> > -             ret = -EFAULT;
> > -             goto out;
> > -     }
> > -
> > -     /*
> > -      * While get_vaddr_frames() could be used for transient (kernel
> > -      * controlled lifetime) pinning of memory pages all current
> > -      * users establish long term (userspace controlled lifetime)
> > -      * page pinning. Treat get_vaddr_frames() like
> > -      * get_user_pages_longterm() and disallow it for filesystem-dax
> > -      * mappings.
> > -      */
> > -     if (vma_is_fsdax(vma)) {
> > -             ret = -EOPNOTSUPP;
> > -             goto out;
> > -     }
> > -
> > -     if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) {
>
> By removing this check from this location, and changing from
> pin_user_pages_locked() to pin_user_pages_fast(), I *think* we end up
> losing the check entirely. Is that intended? If so it could use a comment
> somewhere to explain why.

Yeah this wasn't intentional. I think I needed to drop the _locked
version to prep for FOLL_LONGTERM, and figured _fast is always better.
But I didn't realize that _fast doesn't have the vma checks, gup.c got
me a bit confused.

I'll remedy this in all the patches where this applies (because a
VM_IO | VM_PFNMAP can point at struct page backed memory, and that
exact use-case is what we want to stop with the unsafe_follow_pfn work
since it wreaks things like cma or security).

Aside: I do wonder whether the lack for that check isn't a problem.
VM_IO | VM_PFNMAP generally means driver managed, which means the
driver isn't going to consult the page pin count or anything like that
(at least not necessarily) when revoking or moving that memory, since
we're assuming it's totally under driver control. So if pup_fast can
get into such a mapping, we might have a problem.
-Daniel

> thanks,
> --
> John Hubbard
> NVIDIA
>
> > +     ret = pin_user_pages_fast(start, nr_frames,
> > +                               FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM,
> > +                               (struct page **)(vec->ptrs));
> > +     if (ret > 0) {
> >               vec->got_ref = true;
> >               vec->is_pfns = false;
> > -             ret = pin_user_pages_locked(start, nr_frames,
> > -                     gup_flags, (struct page **)(vec->ptrs), &locked);
> > -             goto out;
> > +             goto out_unlocked;
> >       }
> >
> > +     mmap_read_lock(mm);
> >       vec->got_ref = false;
> >       vec->is_pfns = true;
> > +     ret = 0;
> >       do {
> >               unsigned long *nums = frame_vector_pfns(vec);
> >
> > +             vma = find_vma_intersection(mm, start, start + 1);
> > +             if (!vma)
> > +                     break;
> > +
> >               while (ret < nr_frames && start + PAGE_SIZE <= vma->vm_end) {
> >                       err = follow_pfn(vma, start, &nums[ret]);
> >                       if (err) {
> > @@ -92,17 +77,13 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
> >                       start += PAGE_SIZE;
> >                       ret++;
> >               }
> > -             /*
> > -              * We stop if we have enough pages or if VMA doesn't completely
> > -              * cover the tail page.
> > -              */
> > -             if (ret >= nr_frames || start < vma->vm_end)
> > +             /* Bail out if VMA doesn't completely cover the tail page. */
> > +             if (start < vma->vm_end)
> >                       break;
> > -             vma = find_vma_intersection(mm, start, start + 1);
> > -     } while (vma && vma->vm_flags & (VM_IO | VM_PFNMAP));
> > +     } while (ret < nr_frames);
> >   out:
> > -     if (locked)
> > -             mmap_read_unlock(mm);
> > +     mmap_read_unlock(mm);
> > +out_unlocked:
> >       if (!ret)
> >               ret = -EFAULT;
> >       if (ret > 0)
> >
>
>


-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 05/15] mm/frame-vector: Use FOLL_LONGTERM
  2020-10-31 14:45     ` Daniel Vetter
@ 2020-11-01  5:22       ` John Hubbard
  2020-11-01 10:30         ` Daniel Vetter
  0 siblings, 1 reply; 62+ messages in thread
From: John Hubbard @ 2020-11-01  5:22 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: DRI Development, LKML, KVM list, Linux MM, Linux ARM,
	linux-samsung-soc, open list:DMA BUFFER SHARING FRAMEWORK,
	Daniel Vetter, Jason Gunthorpe, Pawel Osciak, Marek Szyprowski,
	Kyungmin Park, Tomasz Figa, Mauro Carvalho Chehab, Andrew Morton,
	Jérôme Glisse, Jan Kara, Dan Williams

On 10/31/20 7:45 AM, Daniel Vetter wrote:
> On Sat, Oct 31, 2020 at 3:55 AM John Hubbard <jhubbard@nvidia.com> wrote:
>> On 10/30/20 3:08 AM, Daniel Vetter wrote:
...
>> By removing this check from this location, and changing from
>> pin_user_pages_locked() to pin_user_pages_fast(), I *think* we end up
>> losing the check entirely. Is that intended? If so it could use a comment
>> somewhere to explain why.
> 
> Yeah this wasn't intentional. I think I needed to drop the _locked
> version to prep for FOLL_LONGTERM, and figured _fast is always better.
> But I didn't realize that _fast doesn't have the vma checks, gup.c got
> me a bit confused.

Actually, I thought that the change to _fast was a very nice touch, btw.

> 
> I'll remedy this in all the patches where this applies (because a
> VM_IO | VM_PFNMAP can point at struct page backed memory, and that
> exact use-case is what we want to stop with the unsafe_follow_pfn work
> since it wreaks things like cma or security).
> 
> Aside: I do wonder whether the lack for that check isn't a problem.
> VM_IO | VM_PFNMAP generally means driver managed, which means the
> driver isn't going to consult the page pin count or anything like that
> (at least not necessarily) when revoking or moving that memory, since
> we're assuming it's totally under driver control. So if pup_fast can
> get into such a mapping, we might have a problem.
> -Daniel
>

Yes. I don't know why that check is missing from the _fast path.
Probably just an oversight, seeing as how it's in the slow path. Maybe
the appropriate response here is to add a separate patch that adds the
check.

I wonder if I'm overlooking something, but it certainly seems correct to
do that.

  thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 05/15] mm/frame-vector: Use FOLL_LONGTERM
  2020-11-01  5:22       ` John Hubbard
@ 2020-11-01 10:30         ` Daniel Vetter
  2020-11-01 21:13           ` John Hubbard
  0 siblings, 1 reply; 62+ messages in thread
From: Daniel Vetter @ 2020-11-01 10:30 UTC (permalink / raw)
  To: John Hubbard
  Cc: DRI Development, LKML, KVM list, Linux MM, Linux ARM,
	linux-samsung-soc, open list:DMA BUFFER SHARING FRAMEWORK,
	Daniel Vetter, Jason Gunthorpe, Pawel Osciak, Marek Szyprowski,
	Kyungmin Park, Tomasz Figa, Mauro Carvalho Chehab, Andrew Morton,
	Jérôme Glisse, Jan Kara, Dan Williams

On Sun, Nov 1, 2020 at 6:22 AM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 10/31/20 7:45 AM, Daniel Vetter wrote:
> > On Sat, Oct 31, 2020 at 3:55 AM John Hubbard <jhubbard@nvidia.com> wrote:
> >> On 10/30/20 3:08 AM, Daniel Vetter wrote:
> ...
> >> By removing this check from this location, and changing from
> >> pin_user_pages_locked() to pin_user_pages_fast(), I *think* we end up
> >> losing the check entirely. Is that intended? If so it could use a comment
> >> somewhere to explain why.
> >
> > Yeah this wasn't intentional. I think I needed to drop the _locked
> > version to prep for FOLL_LONGTERM, and figured _fast is always better.
> > But I didn't realize that _fast doesn't have the vma checks, gup.c got
> > me a bit confused.
>
> Actually, I thought that the change to _fast was a very nice touch, btw.
>
> >
> > I'll remedy this in all the patches where this applies (because a
> > VM_IO | VM_PFNMAP can point at struct page backed memory, and that
> > exact use-case is what we want to stop with the unsafe_follow_pfn work
> > since it wreaks things like cma or security).
> >
> > Aside: I do wonder whether the lack for that check isn't a problem.
> > VM_IO | VM_PFNMAP generally means driver managed, which means the
> > driver isn't going to consult the page pin count or anything like that
> > (at least not necessarily) when revoking or moving that memory, since
> > we're assuming it's totally under driver control. So if pup_fast can
> > get into such a mapping, we might have a problem.
> > -Daniel
> >
>
> Yes. I don't know why that check is missing from the _fast path.
> Probably just an oversight, seeing as how it's in the slow path. Maybe
> the appropriate response here is to add a separate patch that adds the
> check.
>
> I wonder if I'm overlooking something, but it certainly seems correct to
> do that.

You'll need the mmap_sem to get at the vma to be able to do this
check. If you add that to _fast, you made it as fast as the slow one.
Plus there's _fast_only due to locking recurion issues in fast-paths
(I assume, I didn't check all the callers).

I'm just wondering whether we have a bug somewhere with device
drivers. For CMA regions we always check in try_grab_page, but for dax
I'm not seeing where the checks in the _fast fastpaths are, and that
all still leaves random device driver mappings behind which aren't
backed by CMA but still point to something with a struct page behind
it. I'm probably just missing something, but no idea what.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 05/15] mm/frame-vector: Use FOLL_LONGTERM
  2020-11-01 10:30         ` Daniel Vetter
@ 2020-11-01 21:13           ` John Hubbard
  2020-11-01 22:50             ` Daniel Vetter
  0 siblings, 1 reply; 62+ messages in thread
From: John Hubbard @ 2020-11-01 21:13 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: DRI Development, LKML, KVM list, Linux MM, Linux ARM,
	linux-samsung-soc, open list:DMA BUFFER SHARING FRAMEWORK,
	Daniel Vetter, Jason Gunthorpe, Pawel Osciak, Marek Szyprowski,
	Kyungmin Park, Tomasz Figa, Mauro Carvalho Chehab, Andrew Morton,
	Jérôme Glisse, Jan Kara, Dan Williams

On 11/1/20 2:30 AM, Daniel Vetter wrote:
> On Sun, Nov 1, 2020 at 6:22 AM John Hubbard <jhubbard@nvidia.com> wrote:
>>
>> On 10/31/20 7:45 AM, Daniel Vetter wrote:
>>> On Sat, Oct 31, 2020 at 3:55 AM John Hubbard <jhubbard@nvidia.com> wrote:
>>>> On 10/30/20 3:08 AM, Daniel Vetter wrote:
>> ...
>>>> By removing this check from this location, and changing from
>>>> pin_user_pages_locked() to pin_user_pages_fast(), I *think* we end up
>>>> losing the check entirely. Is that intended? If so it could use a comment
>>>> somewhere to explain why.
>>>
>>> Yeah this wasn't intentional. I think I needed to drop the _locked
>>> version to prep for FOLL_LONGTERM, and figured _fast is always better.
>>> But I didn't realize that _fast doesn't have the vma checks, gup.c got
>>> me a bit confused.
>>
>> Actually, I thought that the change to _fast was a very nice touch, btw.
>>
>>>
>>> I'll remedy this in all the patches where this applies (because a
>>> VM_IO | VM_PFNMAP can point at struct page backed memory, and that
>>> exact use-case is what we want to stop with the unsafe_follow_pfn work
>>> since it wreaks things like cma or security).
>>>
>>> Aside: I do wonder whether the lack for that check isn't a problem.
>>> VM_IO | VM_PFNMAP generally means driver managed, which means the
>>> driver isn't going to consult the page pin count or anything like that
>>> (at least not necessarily) when revoking or moving that memory, since
>>> we're assuming it's totally under driver control. So if pup_fast can
>>> get into such a mapping, we might have a problem.
>>> -Daniel
>>>
>>
>> Yes. I don't know why that check is missing from the _fast path.
>> Probably just an oversight, seeing as how it's in the slow path. Maybe
>> the appropriate response here is to add a separate patch that adds the
>> check.
>>
>> I wonder if I'm overlooking something, but it certainly seems correct to
>> do that.
> 
> You'll need the mmap_sem to get at the vma to be able to do this
> check. If you add that to _fast, you made it as fast as the slow one.

Arggh, yes of course. Strike that, please. :)

> Plus there's _fast_only due to locking recurion issues in fast-paths
> (I assume, I didn't check all the callers).
> 
> I'm just wondering whether we have a bug somewhere with device
> drivers. For CMA regions we always check in try_grab_page, but for dax

OK, so here you're talking about a different bug than the VM_IO | VM_PFNMAP
pages, I think. This is about the "FOLL_LONGTERM + CMA + gup/pup _fast"
combination that is not allowed, right?

For that: try_grab_page() doesn't check anything, but try_grab_compound_head()
does, but only for pup_fast, not gup_fast. That was added by commit
df3a0a21b698d ("mm/gup: fix omission of check on FOLL_LONGTERM in gup fast
path") in April.

I recall that the patch was just plugging a very specific hole, as opposed
to locking down the API against mistakes or confused callers. And it does
seem that there are some holes.

> I'm not seeing where the checks in the _fast fastpaths are, and that
> all still leaves random device driver mappings behind which aren't
> backed by CMA but still point to something with a struct page behind
> it. I'm probably just missing something, but no idea what.
> -Daniel
> 

Certainly we've established that we can't check VMA flags by that time,
so I'm not sure that there is much we can check by the time we get to
gup/pup _fast. Seems like the device drivers have to avoid calling _fast
with pages that live in VM_IO | VM_PFNMAP, by design, right? Or maybe
you're talking about CMA checks only?


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 05/15] mm/frame-vector: Use FOLL_LONGTERM
  2020-11-01 21:13           ` John Hubbard
@ 2020-11-01 22:50             ` Daniel Vetter
  2020-11-04 14:00               ` Jason Gunthorpe
  0 siblings, 1 reply; 62+ messages in thread
From: Daniel Vetter @ 2020-11-01 22:50 UTC (permalink / raw)
  To: John Hubbard
  Cc: DRI Development, LKML, KVM list, Linux MM, Linux ARM,
	linux-samsung-soc, open list:DMA BUFFER SHARING FRAMEWORK,
	Daniel Vetter, Jason Gunthorpe, Pawel Osciak, Marek Szyprowski,
	Kyungmin Park, Tomasz Figa, Mauro Carvalho Chehab, Andrew Morton,
	Jérôme Glisse, Jan Kara, Dan Williams

On Sun, Nov 1, 2020 at 10:13 PM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 11/1/20 2:30 AM, Daniel Vetter wrote:
> > On Sun, Nov 1, 2020 at 6:22 AM John Hubbard <jhubbard@nvidia.com> wrote:
> >>
> >> On 10/31/20 7:45 AM, Daniel Vetter wrote:
> >>> On Sat, Oct 31, 2020 at 3:55 AM John Hubbard <jhubbard@nvidia.com> wrote:
> >>>> On 10/30/20 3:08 AM, Daniel Vetter wrote:
> >> ...
> >>>> By removing this check from this location, and changing from
> >>>> pin_user_pages_locked() to pin_user_pages_fast(), I *think* we end up
> >>>> losing the check entirely. Is that intended? If so it could use a comment
> >>>> somewhere to explain why.
> >>>
> >>> Yeah this wasn't intentional. I think I needed to drop the _locked
> >>> version to prep for FOLL_LONGTERM, and figured _fast is always better.
> >>> But I didn't realize that _fast doesn't have the vma checks, gup.c got
> >>> me a bit confused.
> >>
> >> Actually, I thought that the change to _fast was a very nice touch, btw.
> >>
> >>>
> >>> I'll remedy this in all the patches where this applies (because a
> >>> VM_IO | VM_PFNMAP can point at struct page backed memory, and that
> >>> exact use-case is what we want to stop with the unsafe_follow_pfn work
> >>> since it wreaks things like cma or security).
> >>>
> >>> Aside: I do wonder whether the lack for that check isn't a problem.
> >>> VM_IO | VM_PFNMAP generally means driver managed, which means the
> >>> driver isn't going to consult the page pin count or anything like that
> >>> (at least not necessarily) when revoking or moving that memory, since
> >>> we're assuming it's totally under driver control. So if pup_fast can
> >>> get into such a mapping, we might have a problem.
> >>> -Daniel
> >>>
> >>
> >> Yes. I don't know why that check is missing from the _fast path.
> >> Probably just an oversight, seeing as how it's in the slow path. Maybe
> >> the appropriate response here is to add a separate patch that adds the
> >> check.
> >>
> >> I wonder if I'm overlooking something, but it certainly seems correct to
> >> do that.
> >
> > You'll need the mmap_sem to get at the vma to be able to do this
> > check. If you add that to _fast, you made it as fast as the slow one.
>
> Arggh, yes of course. Strike that, please. :)
>
> > Plus there's _fast_only due to locking recurion issues in fast-paths
> > (I assume, I didn't check all the callers).
> >
> > I'm just wondering whether we have a bug somewhere with device
> > drivers. For CMA regions we always check in try_grab_page, but for dax
>
> OK, so here you're talking about a different bug than the VM_IO | VM_PFNMAP
> pages, I think. This is about the "FOLL_LONGTERM + CMA + gup/pup _fast"
> combination that is not allowed, right?

Yeah sorry, I got distracted reading code and noticed we might have
another issue.

> For that: try_grab_page() doesn't check anything, but try_grab_compound_head()
> does, but only for pup_fast, not gup_fast. That was added by commit
> df3a0a21b698d ("mm/gup: fix omission of check on FOLL_LONGTERM in gup fast
> path") in April.
>
> I recall that the patch was just plugging a very specific hole, as opposed
> to locking down the API against mistakes or confused callers. And it does
> seem that there are some holes.

Yup that's the one I've found.

> > I'm not seeing where the checks in the _fast fastpaths are, and that
> > all still leaves random device driver mappings behind which aren't
> > backed by CMA but still point to something with a struct page behind
> > it. I'm probably just missing something, but no idea what.
> > -Daniel
> >
>
> Certainly we've established that we can't check VMA flags by that time,
> so I'm not sure that there is much we can check by the time we get to
> gup/pup _fast. Seems like the device drivers have to avoid calling _fast
> with pages that live in VM_IO | VM_PFNMAP, by design, right? Or maybe
> you're talking about CMA checks only?

It's not device drivers, but everyone else. At least my understanding
is that VM_IO | VM_PFNMAP means "even if it happens to be backed by a
struct page, do not treat it like normal memory". And gup/pup_fast
happily break that. I tried to chase the history of that test, didn't
turn up anything I understood much:

commit 1ff8038988adecfde71d82c0597727fc239d4e8c
Author: Linus Torvalds <torvalds@g5.osdl.org>
Date:   Mon Dec 12 16:24:33 2005 -0800

   get_user_pages: don't try to follow PFNMAP pages

   Nick Piggin points out that a few drivers play games with VM_IO (why?
   who knows..) and thus a pfn-remapped area may not have that bit set even
   if remap_pfn_range() set it originally.

   So make it explicit in get_user_pages() that we don't follow VM_PFNMAP
   pages, since pretty much by definition they do not have a "struct page"
   associated with them.

   Signed-off-by: Linus Torvalds <torvalds@osdl.org>

diff --git a/mm/memory.c b/mm/memory.c
index 47c533eaa072..d22f78c8a381 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1009,7 +1009,7 @@ int get_user_pages(struct task_struct *tsk,
struct mm_struct *mm,
                       continue;
               }

-               if (!vma || (vma->vm_flags & VM_IO)
+               if (!vma || (vma->vm_flags & (VM_IO | VM_PFNMAP))
                               || !(vm_flags & vma->vm_flags))
                       return i ? : -EFAULT;


The VM_IO check is kinda lost in pre-history.

tbh I have no idea what the various variants of pup/gup are supposed
to be doing vs. these VMA flags in the various cases. Just smells a
bit like potential trouble due to randomly pinning stuff without the
owner of that memory having an idea what's going on.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 08/15] mm: Add unsafe_follow_pfn
  2020-10-30 10:08 ` [PATCH v5 08/15] mm: Add unsafe_follow_pfn Daniel Vetter
@ 2020-11-02  7:29   ` Christoph Hellwig
  2020-11-02 12:56     ` Daniel Vetter
  0 siblings, 1 reply; 62+ messages in thread
From: Christoph Hellwig @ 2020-11-02  7:29 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: DRI Development, LKML, kvm, linux-mm, linux-arm-kernel,
	linux-samsung-soc, linux-media, Daniel Vetter, Christoph Hellwig,
	Jason Gunthorpe, Kees Cook, Dan Williams, Andrew Morton,
	John Hubbard, J??r??me Glisse, Jan Kara

On Fri, Oct 30, 2020 at 11:08:08AM +0100, Daniel Vetter wrote:
> Also mark up follow_pfn as EXPORT_SYMBOL_GPL. The only safe way to use
> that by drivers/modules is together with an mmu_notifier, and that's
> all _GPL stuff.

I also think it also needs to be renamed to explicitly break any existing
users out of tree or int the submission queue.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 08/15] mm: Add unsafe_follow_pfn
  2020-11-02  7:29   ` Christoph Hellwig
@ 2020-11-02 12:56     ` Daniel Vetter
  2020-11-02 13:01       ` Jason Gunthorpe
  0 siblings, 1 reply; 62+ messages in thread
From: Daniel Vetter @ 2020-11-02 12:56 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: DRI Development, LKML, KVM list, Linux MM, Linux ARM,
	linux-samsung-soc, open list:DMA BUFFER SHARING FRAMEWORK,
	Daniel Vetter, Jason Gunthorpe, Kees Cook, Dan Williams,
	Andrew Morton, John Hubbard, J??r??me Glisse, Jan Kara

On Mon, Nov 2, 2020 at 8:29 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Fri, Oct 30, 2020 at 11:08:08AM +0100, Daniel Vetter wrote:
> > Also mark up follow_pfn as EXPORT_SYMBOL_GPL. The only safe way to use
> > that by drivers/modules is together with an mmu_notifier, and that's
> > all _GPL stuff.
>
> I also think it also needs to be renamed to explicitly break any existing
> users out of tree or int the submission queue.

Ok I looked at the mmu notifier locking again and noticed that
mm->subscriptions has its own spinlock. Since there usually shouldn't
be a huge pile of these I think it's feasible to check for the mmu
notifier in follow_pfn. And that would stuff this gap for good. I'll
throw that on top as a final patch and see what people think.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 08/15] mm: Add unsafe_follow_pfn
  2020-11-02 12:56     ` Daniel Vetter
@ 2020-11-02 13:01       ` Jason Gunthorpe
  2020-11-02 13:23         ` Daniel Vetter
  0 siblings, 1 reply; 62+ messages in thread
From: Jason Gunthorpe @ 2020-11-02 13:01 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Christoph Hellwig, DRI Development, LKML, KVM list, Linux MM,
	Linux ARM, linux-samsung-soc,
	open list:DMA BUFFER SHARING FRAMEWORK, Daniel Vetter, Kees Cook,
	Dan Williams, Andrew Morton, John Hubbard, J??r??me Glisse,
	Jan Kara

On Mon, Nov 02, 2020 at 01:56:10PM +0100, Daniel Vetter wrote:
> On Mon, Nov 2, 2020 at 8:29 AM Christoph Hellwig <hch@infradead.org> wrote:
> >
> > On Fri, Oct 30, 2020 at 11:08:08AM +0100, Daniel Vetter wrote:
> > > Also mark up follow_pfn as EXPORT_SYMBOL_GPL. The only safe way to use
> > > that by drivers/modules is together with an mmu_notifier, and that's
> > > all _GPL stuff.
> >
> > I also think it also needs to be renamed to explicitly break any existing
> > users out of tree or int the submission queue.
> 
> Ok I looked at the mmu notifier locking again and noticed that
> mm->subscriptions has its own spinlock. Since there usually shouldn't
> be a huge pile of these I think it's feasible to check for the mmu
> notifier in follow_pfn. And that would stuff this gap for good. I'll
> throw that on top as a final patch and see what people think.

Probably the simplest is to just check mm_has_notifiers() when in
lockdep or something very simple like that

Jason

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 08/15] mm: Add unsafe_follow_pfn
  2020-11-02 13:01       ` Jason Gunthorpe
@ 2020-11-02 13:23         ` Daniel Vetter
  2020-11-02 15:52           ` Jason Gunthorpe
  0 siblings, 1 reply; 62+ messages in thread
From: Daniel Vetter @ 2020-11-02 13:23 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, DRI Development, LKML, KVM list, Linux MM,
	Linux ARM, linux-samsung-soc,
	open list:DMA BUFFER SHARING FRAMEWORK, Daniel Vetter, Kees Cook,
	Dan Williams, Andrew Morton, John Hubbard, J??r??me Glisse,
	Jan Kara

On Mon, Nov 2, 2020 at 2:01 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Mon, Nov 02, 2020 at 01:56:10PM +0100, Daniel Vetter wrote:
> > On Mon, Nov 2, 2020 at 8:29 AM Christoph Hellwig <hch@infradead.org> wrote:
> > >
> > > On Fri, Oct 30, 2020 at 11:08:08AM +0100, Daniel Vetter wrote:
> > > > Also mark up follow_pfn as EXPORT_SYMBOL_GPL. The only safe way to use
> > > > that by drivers/modules is together with an mmu_notifier, and that's
> > > > all _GPL stuff.
> > >
> > > I also think it also needs to be renamed to explicitly break any existing
> > > users out of tree or int the submission queue.
> >
> > Ok I looked at the mmu notifier locking again and noticed that
> > mm->subscriptions has its own spinlock. Since there usually shouldn't
> > be a huge pile of these I think it's feasible to check for the mmu
> > notifier in follow_pfn. And that would stuff this gap for good. I'll
> > throw that on top as a final patch and see what people think.
>
> Probably the simplest is to just check mm_has_notifiers() when in
> lockdep or something very simple like that

lockdep feels wrong, was locking more at CONFIG_DEBUG_VM. And since
generally you only have 1 mmu notifier (especially for kvm) I think we
can also pay the 2nd cacheline miss and actually check the right mmu
notifier is registered.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 08/15] mm: Add unsafe_follow_pfn
  2020-11-02 13:23         ` Daniel Vetter
@ 2020-11-02 15:52           ` Jason Gunthorpe
  2020-11-02 16:41             ` Christoph Hellwig
  2020-11-02 16:42             ` Daniel Vetter
  0 siblings, 2 replies; 62+ messages in thread
From: Jason Gunthorpe @ 2020-11-02 15:52 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Christoph Hellwig, DRI Development, LKML, KVM list, Linux MM,
	Linux ARM, linux-samsung-soc,
	open list:DMA BUFFER SHARING FRAMEWORK, Daniel Vetter, Kees Cook,
	Dan Williams, Andrew Morton, John Hubbard, J??r??me Glisse,
	Jan Kara

On Mon, Nov 02, 2020 at 02:23:58PM +0100, Daniel Vetter wrote:
> On Mon, Nov 2, 2020 at 2:01 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Mon, Nov 02, 2020 at 01:56:10PM +0100, Daniel Vetter wrote:
> > > On Mon, Nov 2, 2020 at 8:29 AM Christoph Hellwig <hch@infradead.org> wrote:
> > > >
> > > > On Fri, Oct 30, 2020 at 11:08:08AM +0100, Daniel Vetter wrote:
> > > > > Also mark up follow_pfn as EXPORT_SYMBOL_GPL. The only safe way to use
> > > > > that by drivers/modules is together with an mmu_notifier, and that's
> > > > > all _GPL stuff.
> > > >
> > > > I also think it also needs to be renamed to explicitly break any existing
> > > > users out of tree or int the submission queue.
> > >
> > > Ok I looked at the mmu notifier locking again and noticed that
> > > mm->subscriptions has its own spinlock. Since there usually shouldn't
> > > be a huge pile of these I think it's feasible to check for the mmu
> > > notifier in follow_pfn. And that would stuff this gap for good. I'll
> > > throw that on top as a final patch and see what people think.
> >
> > Probably the simplest is to just check mm_has_notifiers() when in
> > lockdep or something very simple like that
> 
> lockdep feels wrong, was locking more at CONFIG_DEBUG_VM. And since
> generally you only have 1 mmu notifier (especially for kvm) I think we
> can also pay the 2nd cacheline miss and actually check the right mmu
> notifier is registered.

Need to hold the lock to check that and there are two ways to register
notifiers these days, so it feels to expensive to me.

CH's 'export symbol only for kvm' really does seem the most robust way
to handle this though.

Jason

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 08/15] mm: Add unsafe_follow_pfn
  2020-11-02 15:52           ` Jason Gunthorpe
@ 2020-11-02 16:41             ` Christoph Hellwig
  2020-11-02 16:42             ` Daniel Vetter
  1 sibling, 0 replies; 62+ messages in thread
From: Christoph Hellwig @ 2020-11-02 16:41 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Daniel Vetter, Christoph Hellwig, DRI Development, LKML,
	KVM list, Linux MM, Linux ARM, linux-samsung-soc,
	open list:DMA BUFFER SHARING FRAMEWORK, Daniel Vetter, Kees Cook,
	Dan Williams, Andrew Morton, John Hubbard, J??r??me Glisse,
	Jan Kara

On Mon, Nov 02, 2020 at 11:52:56AM -0400, Jason Gunthorpe wrote:
> Need to hold the lock to check that and there are two ways to register
> notifiers these days, so it feels to expensive to me.
> 
> CH's 'export symbol only for kvm' really does seem the most robust way
> to handle this though.

I hope I can get that done for this merge window, but I'm not sure.

I still think we should at least have a new name for the old follow_pfn
that no one should use.  And it should sound more scary than
unsafe_follow_pfn :)

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 08/15] mm: Add unsafe_follow_pfn
  2020-11-02 15:52           ` Jason Gunthorpe
  2020-11-02 16:41             ` Christoph Hellwig
@ 2020-11-02 16:42             ` Daniel Vetter
  2020-11-02 17:16               ` Jason Gunthorpe
  1 sibling, 1 reply; 62+ messages in thread
From: Daniel Vetter @ 2020-11-02 16:42 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, DRI Development, LKML, KVM list, Linux MM,
	Linux ARM, linux-samsung-soc,
	open list:DMA BUFFER SHARING FRAMEWORK, Daniel Vetter, Kees Cook,
	Dan Williams, Andrew Morton, John Hubbard, J??r??me Glisse,
	Jan Kara

On Mon, Nov 2, 2020 at 4:52 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Mon, Nov 02, 2020 at 02:23:58PM +0100, Daniel Vetter wrote:
> > On Mon, Nov 2, 2020 at 2:01 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >
> > > On Mon, Nov 02, 2020 at 01:56:10PM +0100, Daniel Vetter wrote:
> > > > On Mon, Nov 2, 2020 at 8:29 AM Christoph Hellwig <hch@infradead.org> wrote:
> > > > >
> > > > > On Fri, Oct 30, 2020 at 11:08:08AM +0100, Daniel Vetter wrote:
> > > > > > Also mark up follow_pfn as EXPORT_SYMBOL_GPL. The only safe way to use
> > > > > > that by drivers/modules is together with an mmu_notifier, and that's
> > > > > > all _GPL stuff.
> > > > >
> > > > > I also think it also needs to be renamed to explicitly break any existing
> > > > > users out of tree or int the submission queue.
> > > >
> > > > Ok I looked at the mmu notifier locking again and noticed that
> > > > mm->subscriptions has its own spinlock. Since there usually shouldn't
> > > > be a huge pile of these I think it's feasible to check for the mmu
> > > > notifier in follow_pfn. And that would stuff this gap for good. I'll
> > > > throw that on top as a final patch and see what people think.
> > >
> > > Probably the simplest is to just check mm_has_notifiers() when in
> > > lockdep or something very simple like that
> >
> > lockdep feels wrong, was locking more at CONFIG_DEBUG_VM. And since
> > generally you only have 1 mmu notifier (especially for kvm) I think we
> > can also pay the 2nd cacheline miss and actually check the right mmu
> > notifier is registered.
>
> Need to hold the lock to check that and there are two ways to register
> notifiers these days, so it feels to expensive to me.

Uh I mixed stuff up all along, struct mmu_notifier *subcription that
all the mmu notifier users use has the ->mm pointer we want right
there. That's good enough I think.

Now I'm kinda lost in kvm code trying to wire it through, but it's
looking ok-ish thus far :-)
-Daniel

> CH's 'export symbol only for kvm' really does seem the most robust way
> to handle this though.
>
> Jason



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 08/15] mm: Add unsafe_follow_pfn
  2020-11-02 16:42             ` Daniel Vetter
@ 2020-11-02 17:16               ` Jason Gunthorpe
  0 siblings, 0 replies; 62+ messages in thread
From: Jason Gunthorpe @ 2020-11-02 17:16 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Christoph Hellwig, DRI Development, LKML, KVM list, Linux MM,
	Linux ARM, linux-samsung-soc,
	open list:DMA BUFFER SHARING FRAMEWORK, Daniel Vetter, Kees Cook,
	Dan Williams, Andrew Morton, John Hubbard, J??r??me Glisse,
	Jan Kara

On Mon, Nov 02, 2020 at 05:42:20PM +0100, Daniel Vetter wrote:
> > Need to hold the lock to check that and there are two ways to register
> > notifiers these days, so it feels to expensive to me.
> 
> Uh I mixed stuff up all along, struct mmu_notifier *subcription that
> all the mmu notifier users use has the ->mm pointer we want right
> there. That's good enough I think.

Yah, if you can pass in one of those instead of the raw mm it would be
fine

Jason

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 05/15] mm/frame-vector: Use FOLL_LONGTERM
  2020-10-30 14:37     ` Daniel Vetter
@ 2020-11-02 18:19       ` Tomasz Figa
  0 siblings, 0 replies; 62+ messages in thread
From: Tomasz Figa @ 2020-11-02 18:19 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Jérôme Glisse, linux-samsung-soc, Jan Kara,
	Pawel Osciak, kvm, Jason Gunthorpe, John Hubbard,
	Mauro Carvalho Chehab, LKML, DRI Development, Linux MM,
	Kyungmin Park, Daniel Vetter, Andrew Morton, Marek Szyprowski,
	Dan Williams,
	list@263.net:IOMMU DRIVERS
	<iommu@lists.linux-foundation.org>,
	Joerg Roedel <joro@8bytes.org>,,
	Linux Media Mailing List

On Fri, Oct 30, 2020 at 3:38 PM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
> On Fri, Oct 30, 2020 at 3:11 PM Tomasz Figa <tfiga@chromium.org> wrote:
> >
> > On Fri, Oct 30, 2020 at 11:08 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > >
> > > This is used by media/videbuf2 for persistent dma mappings, not just
> > > for a single dma operation and then freed again, so needs
> > > FOLL_LONGTERM.
> > >
> > > Unfortunately current pup_locked doesn't support FOLL_LONGTERM due to
> > > locking issues. Rework the code to pull the pup path out from the
> > > mmap_sem critical section as suggested by Jason.
> > >
> > > By relying entirely on the vma checks in pin_user_pages and follow_pfn
> > > (for vm_flags and vma_is_fsdax) we can also streamline the code a lot.
> > >
> > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > Cc: Jason Gunthorpe <jgg@ziepe.ca>
> > > Cc: Pawel Osciak <pawel@osciak.com>
> > > Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> > > Cc: Kyungmin Park <kyungmin.park@samsung.com>
> > > Cc: Tomasz Figa <tfiga@chromium.org>
> > > Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Cc: John Hubbard <jhubbard@nvidia.com>
> > > Cc: Jérôme Glisse <jglisse@redhat.com>
> > > Cc: Jan Kara <jack@suse.cz>
> > > Cc: Dan Williams <dan.j.williams@intel.com>
> > > Cc: linux-mm@kvack.org
> > > Cc: linux-arm-kernel@lists.infradead.org
> > > Cc: linux-samsung-soc@vger.kernel.org
> > > Cc: linux-media@vger.kernel.org
> > > Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> > > --
> > > v2: Streamline the code and further simplify the loop checks (Jason)
> > >
> > > v5: Review from Tomasz:
> > > - fix page counting for the follow_pfn case by resetting ret
> > > - drop gup_flags paramater, now unused
> > > ---
> > >  .../media/common/videobuf2/videobuf2-memops.c |  3 +-
> > >  include/linux/mm.h                            |  2 +-
> > >  mm/frame_vector.c                             | 53 ++++++-------------
> > >  3 files changed, 19 insertions(+), 39 deletions(-)
> > >
> >
> > Thanks, looks good to me now.
> >
> > Acked-by: Tomasz Figa <tfiga@chromium.org>
> >
> > From reading the code, this is quite unlikely to introduce any
> > behavior changes, but just to be safe, did you have a chance to test
> > this with some V4L2 driver?
>
> Nah, unfortunately not.

I believe we don't have any setup that could exercise the IO/PFNMAP
user pointers, but it should be possible to exercise the basic userptr
path by enabling the virtual (fake) video driver, vivid or
CONFIG_VIDEO_VIVID, in your kernel and then using yavta [1] with
--userptr and --capture=<number of frames> (and possibly some more
options) to grab a couple of frames from the test pattern generator.

Does it sound like something that you could give a try? Feel free to
ping me on IRC (tfiga on #v4l or #dri-devel) if you need any help.

[1] https://git.ideasonboard.org/yavta.git

Best regards,
Tomasz

> -Daniel
>
> >
> > Best regards,
> > Tomasz
> >
> > > diff --git a/drivers/media/common/videobuf2/videobuf2-memops.c b/drivers/media/common/videobuf2/videobuf2-memops.c
> > > index 6e9e05153f4e..9dd6c27162f4 100644
> > > --- a/drivers/media/common/videobuf2/videobuf2-memops.c
> > > +++ b/drivers/media/common/videobuf2/videobuf2-memops.c
> > > @@ -40,7 +40,6 @@ struct frame_vector *vb2_create_framevec(unsigned long start,
> > >         unsigned long first, last;
> > >         unsigned long nr;
> > >         struct frame_vector *vec;
> > > -       unsigned int flags = FOLL_FORCE | FOLL_WRITE;
> > >
> > >         first = start >> PAGE_SHIFT;
> > >         last = (start + length - 1) >> PAGE_SHIFT;
> > > @@ -48,7 +47,7 @@ struct frame_vector *vb2_create_framevec(unsigned long start,
> > >         vec = frame_vector_create(nr);
> > >         if (!vec)
> > >                 return ERR_PTR(-ENOMEM);
> > > -       ret = get_vaddr_frames(start & PAGE_MASK, nr, flags, vec);
> > > +       ret = get_vaddr_frames(start & PAGE_MASK, nr, vec);
> > >         if (ret < 0)
> > >                 goto out_destroy;
> > >         /* We accept only complete set of PFNs */
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index ef360fe70aaf..d6b8e30dce2e 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -1765,7 +1765,7 @@ struct frame_vector {
> > >  struct frame_vector *frame_vector_create(unsigned int nr_frames);
> > >  void frame_vector_destroy(struct frame_vector *vec);
> > >  int get_vaddr_frames(unsigned long start, unsigned int nr_pfns,
> > > -                    unsigned int gup_flags, struct frame_vector *vec);
> > > +                    struct frame_vector *vec);
> > >  void put_vaddr_frames(struct frame_vector *vec);
> > >  int frame_vector_to_pages(struct frame_vector *vec);
> > >  void frame_vector_to_pfns(struct frame_vector *vec);
> > > diff --git a/mm/frame_vector.c b/mm/frame_vector.c
> > > index 10f82d5643b6..f8c34b895c76 100644
> > > --- a/mm/frame_vector.c
> > > +++ b/mm/frame_vector.c
> > > @@ -32,13 +32,12 @@
> > >   * This function takes care of grabbing mmap_lock as necessary.
> > >   */
> > >  int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
> > > -                    unsigned int gup_flags, struct frame_vector *vec)
> > > +                    struct frame_vector *vec)
> > >  {
> > >         struct mm_struct *mm = current->mm;
> > >         struct vm_area_struct *vma;
> > >         int ret = 0;
> > >         int err;
> > > -       int locked;
> > >
> > >         if (nr_frames == 0)
> > >                 return 0;
> > > @@ -48,40 +47,26 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
> > >
> > >         start = untagged_addr(start);
> > >
> > > -       mmap_read_lock(mm);
> > > -       locked = 1;
> > > -       vma = find_vma_intersection(mm, start, start + 1);
> > > -       if (!vma) {
> > > -               ret = -EFAULT;
> > > -               goto out;
> > > -       }
> > > -
> > > -       /*
> > > -        * While get_vaddr_frames() could be used for transient (kernel
> > > -        * controlled lifetime) pinning of memory pages all current
> > > -        * users establish long term (userspace controlled lifetime)
> > > -        * page pinning. Treat get_vaddr_frames() like
> > > -        * get_user_pages_longterm() and disallow it for filesystem-dax
> > > -        * mappings.
> > > -        */
> > > -       if (vma_is_fsdax(vma)) {
> > > -               ret = -EOPNOTSUPP;
> > > -               goto out;
> > > -       }
> > > -
> > > -       if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) {
> > > +       ret = pin_user_pages_fast(start, nr_frames,
> > > +                                 FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM,
> > > +                                 (struct page **)(vec->ptrs));
> > > +       if (ret > 0) {
> > >                 vec->got_ref = true;
> > >                 vec->is_pfns = false;
> > > -               ret = pin_user_pages_locked(start, nr_frames,
> > > -                       gup_flags, (struct page **)(vec->ptrs), &locked);
> > > -               goto out;
> > > +               goto out_unlocked;
> > >         }
> > >
> > > +       mmap_read_lock(mm);
> > >         vec->got_ref = false;
> > >         vec->is_pfns = true;
> > > +       ret = 0;
> > >         do {
> > >                 unsigned long *nums = frame_vector_pfns(vec);
> > >
> > > +               vma = find_vma_intersection(mm, start, start + 1);
> > > +               if (!vma)
> > > +                       break;
> > > +
> > >                 while (ret < nr_frames && start + PAGE_SIZE <= vma->vm_end) {
> > >                         err = follow_pfn(vma, start, &nums[ret]);
> > >                         if (err) {
> > > @@ -92,17 +77,13 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
> > >                         start += PAGE_SIZE;
> > >                         ret++;
> > >                 }
> > > -               /*
> > > -                * We stop if we have enough pages or if VMA doesn't completely
> > > -                * cover the tail page.
> > > -                */
> > > -               if (ret >= nr_frames || start < vma->vm_end)
> > > +               /* Bail out if VMA doesn't completely cover the tail page. */
> > > +               if (start < vma->vm_end)
> > >                         break;
> > > -               vma = find_vma_intersection(mm, start, start + 1);
> > > -       } while (vma && vma->vm_flags & (VM_IO | VM_PFNMAP));
> > > +       } while (ret < nr_frames);
> > >  out:
> > > -       if (locked)
> > > -               mmap_read_unlock(mm);
> > > +       mmap_read_unlock(mm);
> > > +out_unlocked:
> > >         if (!ret)
> > >                 ret = -EFAULT;
> > >         if (ret > 0)
> > > --
> > > 2.28.0
> > >
> > _______________________________________________
> > dri-devel mailing list
> > dri-devel@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/dri-devel
>
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 11/15] PCI: Obey iomem restrictions for procfs mmap
  2020-10-30 10:08 ` [PATCH v5 11/15] PCI: Obey iomem restrictions for procfs mmap Daniel Vetter
@ 2020-11-03 21:28   ` Bjorn Helgaas
  2020-11-03 22:09     ` Dan Williams
  0 siblings, 1 reply; 62+ messages in thread
From: Bjorn Helgaas @ 2020-11-03 21:28 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: DRI Development, LKML, kvm, linux-mm, linux-arm-kernel,
	linux-samsung-soc, linux-media, Daniel Vetter, Jason Gunthorpe,
	Kees Cook, Dan Williams, Andrew Morton, John Hubbard,
	Jérôme Glisse, Jan Kara, Bjorn Helgaas, linux-pci

On Fri, Oct 30, 2020 at 11:08:11AM +0100, Daniel Vetter wrote:
> There's three ways to access PCI BARs from userspace: /dev/mem, sysfs
> files, and the old proc interface. Two check against
> iomem_is_exclusive, proc never did. And with CONFIG_IO_STRICT_DEVMEM,
> this starts to matter, since we don't want random userspace having
> access to PCI BARs while a driver is loaded and using it.
> 
> Fix this by adding the same iomem_is_exclusive() check we already have
> on the sysfs side in pci_mmap_resource().
> 
> References: 90a545e98126 ("restrict /dev/mem to idle io memory ranges")
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>

This is OK with me but it looks like IORESOURCE_EXCLUSIVE is currently
only used in a few places:

  e1000_probe() calls pci_request_selected_regions_exclusive(),
  ne_pci_probe() calls pci_request_regions_exclusive(),
  vmbus_allocate_mmio() calls request_mem_region_exclusive()

which raises the question of whether it's worth keeping
IORESOURCE_EXCLUSIVE at all.  I'm totally fine with removing it
completely.

But if you want it,

Acked-by: Bjorn Helgaas <bhelgaas@google.com>

> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Jérôme Glisse <jglisse@redhat.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: linux-mm@kvack.org
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-samsung-soc@vger.kernel.org
> Cc: linux-media@vger.kernel.org
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: linux-pci@vger.kernel.org
> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> --
> v2: Improve commit message (Bjorn)
> ---
>  drivers/pci/proc.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/pci/proc.c b/drivers/pci/proc.c
> index d35186b01d98..3a2f90beb4cb 100644
> --- a/drivers/pci/proc.c
> +++ b/drivers/pci/proc.c
> @@ -274,6 +274,11 @@ static int proc_bus_pci_mmap(struct file *file, struct vm_area_struct *vma)
>  		else
>  			return -EINVAL;
>  	}
> +
> +	if (dev->resource[i].flags & IORESOURCE_MEM &&
> +	    iomem_is_exclusive(dev->resource[i].start))
> +		return -EINVAL;
> +
>  	ret = pci_mmap_page_range(dev, i, vma,
>  				  fpriv->mmap_state, write_combine);
>  	if (ret < 0)
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 15/15] PCI: Revoke mappings like devmem
  2020-10-30 10:08 ` [PATCH v5 15/15] PCI: Revoke mappings like devmem Daniel Vetter
  2020-10-30 19:22   ` Dan Williams
@ 2020-11-03 21:30   ` Bjorn Helgaas
  1 sibling, 0 replies; 62+ messages in thread
From: Bjorn Helgaas @ 2020-11-03 21:30 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: DRI Development, LKML, kvm, linux-mm, linux-arm-kernel,
	linux-samsung-soc, linux-media, Daniel Vetter, Jason Gunthorpe,
	Kees Cook, Dan Williams, Andrew Morton, John Hubbard,
	Jérôme Glisse, Jan Kara, Greg Kroah-Hartman,
	Bjorn Helgaas, linux-pci

On Fri, Oct 30, 2020 at 11:08:15AM +0100, Daniel Vetter wrote:
> Since 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims
> the region") /dev/kmem zaps ptes when the kernel requests exclusive
> acccess to an iomem region. And with CONFIG_IO_STRICT_DEVMEM, this is
> the default for all driver uses.
> 
> Except there's two more ways to access PCI BARs: sysfs and proc mmap
> support. Let's plug that hole.
> 
> For revoke_devmem() to work we need to link our vma into the same
> address_space, with consistent vma->vm_pgoff. ->pgoff is already
> adjusted, because that's how (io_)remap_pfn_range works, but for the
> mapping we need to adjust vma->vm_file->f_mapping. The cleanest way is
> to adjust this at at ->open time:
> 
> - for sysfs this is easy, now that binary attributes support this. We
>   just set bin_attr->mapping when mmap is supported
> - for procfs it's a bit more tricky, since procfs pci access has only
>   one file per device, and access to a specific resources first needs
>   to be set up with some ioctl calls. But mmap is only supported for
>   the same resources as sysfs exposes with mmap support, and otherwise
>   rejected, so we can set the mapping unconditionally at open time
>   without harm.
> 
> A special consideration is for arch_can_pci_mmap_io() - we need to
> make sure that the ->f_mapping doesn't alias between ioport and iomem
> space. There's only 2 ways in-tree to support mmap of ioports: generic
> pci mmap (ARCH_GENERIC_PCI_MMAP_RESOURCE), and sparc as the single
> architecture hand-rolling. Both approach support ioport mmap through a
> special pfn range and not through magic pte attributes. Aliasing is
> therefore not a problem.
> 
> The only difference in access checks left is that sysfs PCI mmap does
> not check for CAP_RAWIO. I'm not really sure whether that should be
> added or not.
> 
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Jérôme Glisse <jglisse@redhat.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: linux-mm@kvack.org
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-samsung-soc@vger.kernel.org
> Cc: linux-media@vger.kernel.org
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: linux-pci@vger.kernel.org
> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>

Acked-by: Bjorn Helgaas <bhelgaas@google.com>

> --
> v2:
> - Totally new approach: Adjust filp->f_mapping at open time. Note that
>   this now works on all architectures, not just those support
>   ARCH_GENERIC_PCI_MMAP_RESOURCE
> ---
>  drivers/pci/pci-sysfs.c | 4 ++++
>  drivers/pci/proc.c      | 1 +
>  2 files changed, 5 insertions(+)
> 
> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> index d15c881e2e7e..3f1c31bc0b7c 100644
> --- a/drivers/pci/pci-sysfs.c
> +++ b/drivers/pci/pci-sysfs.c
> @@ -929,6 +929,7 @@ void pci_create_legacy_files(struct pci_bus *b)
>  	b->legacy_io->read = pci_read_legacy_io;
>  	b->legacy_io->write = pci_write_legacy_io;
>  	b->legacy_io->mmap = pci_mmap_legacy_io;
> +	b->legacy_io->mapping = iomem_get_mapping();
>  	pci_adjust_legacy_attr(b, pci_mmap_io);
>  	error = device_create_bin_file(&b->dev, b->legacy_io);
>  	if (error)
> @@ -941,6 +942,7 @@ void pci_create_legacy_files(struct pci_bus *b)
>  	b->legacy_mem->size = 1024*1024;
>  	b->legacy_mem->attr.mode = 0600;
>  	b->legacy_mem->mmap = pci_mmap_legacy_mem;
> +	b->legacy_io->mapping = iomem_get_mapping();
>  	pci_adjust_legacy_attr(b, pci_mmap_mem);
>  	error = device_create_bin_file(&b->dev, b->legacy_mem);
>  	if (error)
> @@ -1156,6 +1158,8 @@ static int pci_create_attr(struct pci_dev *pdev, int num, int write_combine)
>  			res_attr->mmap = pci_mmap_resource_uc;
>  		}
>  	}
> +	if (res_attr->mmap)
> +		res_attr->mapping = iomem_get_mapping();
>  	res_attr->attr.name = res_attr_name;
>  	res_attr->attr.mode = 0600;
>  	res_attr->size = pci_resource_len(pdev, num);
> diff --git a/drivers/pci/proc.c b/drivers/pci/proc.c
> index 3a2f90beb4cb..9bab07302bbf 100644
> --- a/drivers/pci/proc.c
> +++ b/drivers/pci/proc.c
> @@ -298,6 +298,7 @@ static int proc_bus_pci_open(struct inode *inode, struct file *file)
>  	fpriv->write_combine = 0;
>  
>  	file->private_data = fpriv;
> +	file->f_mapping = iomem_get_mapping();
>  
>  	return 0;
>  }
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 11/15] PCI: Obey iomem restrictions for procfs mmap
  2020-11-03 21:28   ` Bjorn Helgaas
@ 2020-11-03 22:09     ` Dan Williams
  2020-11-04  8:44       ` Daniel Vetter
  0 siblings, 1 reply; 62+ messages in thread
From: Dan Williams @ 2020-11-03 22:09 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Daniel Vetter, DRI Development, LKML, KVM list, Linux MM,
	Linux ARM, linux-samsung-soc, Linux-media@vger.kernel.org,
	Daniel Vetter, Jason Gunthorpe, Kees Cook, Andrew Morton,
	John Hubbard, Jérôme Glisse, Jan Kara, Bjorn Helgaas,
	Linux PCI

On Tue, Nov 3, 2020 at 1:28 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Fri, Oct 30, 2020 at 11:08:11AM +0100, Daniel Vetter wrote:
> > There's three ways to access PCI BARs from userspace: /dev/mem, sysfs
> > files, and the old proc interface. Two check against
> > iomem_is_exclusive, proc never did. And with CONFIG_IO_STRICT_DEVMEM,
> > this starts to matter, since we don't want random userspace having
> > access to PCI BARs while a driver is loaded and using it.
> >
> > Fix this by adding the same iomem_is_exclusive() check we already have
> > on the sysfs side in pci_mmap_resource().
> >
> > References: 90a545e98126 ("restrict /dev/mem to idle io memory ranges")
> > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
>
> This is OK with me but it looks like IORESOURCE_EXCLUSIVE is currently
> only used in a few places:
>
>   e1000_probe() calls pci_request_selected_regions_exclusive(),
>   ne_pci_probe() calls pci_request_regions_exclusive(),
>   vmbus_allocate_mmio() calls request_mem_region_exclusive()
>
> which raises the question of whether it's worth keeping
> IORESOURCE_EXCLUSIVE at all.  I'm totally fine with removing it
> completely.

Now that CONFIG_IO_STRICT_DEVMEM upgrades IORESOURCE_BUSY to
IORESOURCE_EXCLUSIVE semantics the latter has lost its meaning so I'd
be in favor of removing it as well.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 11/15] PCI: Obey iomem restrictions for procfs mmap
  2020-11-03 22:09     ` Dan Williams
@ 2020-11-04  8:44       ` Daniel Vetter
  2020-11-04 16:50         ` Bjorn Helgaas
  0 siblings, 1 reply; 62+ messages in thread
From: Daniel Vetter @ 2020-11-04  8:44 UTC (permalink / raw)
  To: Dan Williams
  Cc: Bjorn Helgaas, DRI Development, LKML, KVM list, Linux MM,
	Linux ARM, linux-samsung-soc, Linux-media@vger.kernel.org,
	Daniel Vetter, Jason Gunthorpe, Kees Cook, Andrew Morton,
	John Hubbard, Jérôme Glisse, Jan Kara, Bjorn Helgaas,
	Linux PCI

On Tue, Nov 3, 2020 at 11:09 PM Dan Williams <dan.j.williams@intel.com> wrote:
> On Tue, Nov 3, 2020 at 1:28 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Fri, Oct 30, 2020 at 11:08:11AM +0100, Daniel Vetter wrote:
> > > There's three ways to access PCI BARs from userspace: /dev/mem, sysfs
> > > files, and the old proc interface. Two check against
> > > iomem_is_exclusive, proc never did. And with CONFIG_IO_STRICT_DEVMEM,
> > > this starts to matter, since we don't want random userspace having
> > > access to PCI BARs while a driver is loaded and using it.
> > >
> > > Fix this by adding the same iomem_is_exclusive() check we already have
> > > on the sysfs side in pci_mmap_resource().
> > >
> > > References: 90a545e98126 ("restrict /dev/mem to idle io memory ranges")
> > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> >
> > This is OK with me but it looks like IORESOURCE_EXCLUSIVE is currently
> > only used in a few places:
> >
> >   e1000_probe() calls pci_request_selected_regions_exclusive(),
> >   ne_pci_probe() calls pci_request_regions_exclusive(),
> >   vmbus_allocate_mmio() calls request_mem_region_exclusive()
> >
> > which raises the question of whether it's worth keeping
> > IORESOURCE_EXCLUSIVE at all.  I'm totally fine with removing it
> > completely.
>
> Now that CONFIG_IO_STRICT_DEVMEM upgrades IORESOURCE_BUSY to
> IORESOURCE_EXCLUSIVE semantics the latter has lost its meaning so I'd
> be in favor of removing it as well.

Still has some value since it enforces exclusive access even if the
config isn't enabled, and iirc e1000 had some fun with userspace tools
clobbering the firmware and bricking the chip.

Another thing I kinda wondered, since pci maintainer is here: At least
in drivers/gpu I see very few drivers explicitly requestion regions
(this might be a historical artifact due to the shadow attach stuff
before we had real modesetting drivers). And pci core doesn't do that
either, even when a driver is bound. Is this intentional, or
should/could we do better? Since drivers work happily without
reserving regions I don't think "the drivers need to remember to do
this" will ever really work out well.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 05/15] mm/frame-vector: Use FOLL_LONGTERM
  2020-11-01 22:50             ` Daniel Vetter
@ 2020-11-04 14:00               ` Jason Gunthorpe
  2020-11-04 15:54                 ` Daniel Vetter
  0 siblings, 1 reply; 62+ messages in thread
From: Jason Gunthorpe @ 2020-11-04 14:00 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: John Hubbard, DRI Development, LKML, KVM list, Linux MM,
	Linux ARM, linux-samsung-soc,
	open list:DMA BUFFER SHARING FRAMEWORK, Daniel Vetter,
	Pawel Osciak, Marek Szyprowski, Kyungmin Park, Tomasz Figa,
	Mauro Carvalho Chehab, Andrew Morton, Jérôme Glisse,
	Jan Kara, Dan Williams

On Sun, Nov 01, 2020 at 11:50:39PM +0100, Daniel Vetter wrote:

> It's not device drivers, but everyone else. At least my understanding
> is that VM_IO | VM_PFNMAP means "even if it happens to be backed by a
> struct page, do not treat it like normal memory". And gup/pup_fast
> happily break that. I tried to chase the history of that test, didn't
> turn up anything I understood much:

VM_IO isn't suppose do thave struct pages, so how can gup_fast return
them?

I thought some magic in the PTE flags excluded this?

Jason

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 05/15] mm/frame-vector: Use FOLL_LONGTERM
  2020-11-04 14:00               ` Jason Gunthorpe
@ 2020-11-04 15:54                 ` Daniel Vetter
  2020-11-04 16:21                   ` Christoph Hellwig
  0 siblings, 1 reply; 62+ messages in thread
From: Daniel Vetter @ 2020-11-04 15:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: John Hubbard, DRI Development, LKML, KVM list, Linux MM,
	Linux ARM, linux-samsung-soc,
	open list:DMA BUFFER SHARING FRAMEWORK, Daniel Vetter,
	Pawel Osciak, Marek Szyprowski, Kyungmin Park, Tomasz Figa,
	Mauro Carvalho Chehab, Andrew Morton, Jérôme Glisse,
	Jan Kara, Dan Williams

On Wed, Nov 4, 2020 at 3:00 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Sun, Nov 01, 2020 at 11:50:39PM +0100, Daniel Vetter wrote:
>
> > It's not device drivers, but everyone else. At least my understanding
> > is that VM_IO | VM_PFNMAP means "even if it happens to be backed by a
> > struct page, do not treat it like normal memory". And gup/pup_fast
> > happily break that. I tried to chase the history of that test, didn't
> > turn up anything I understood much:
>
> VM_IO isn't suppose do thave struct pages, so how can gup_fast return
> them?
>
> I thought some magic in the PTE flags excluded this?

I don't really have a box here, but dma_mmap_attrs() and friends to
mmap dma_alloc_coherent memory is set up as VM_IO | VM_PFNMAP (it's
actually enforced since underneath it uses remap_pfn_range), and
usually (except if it's pre-cma carveout) that's just normal struct
page backed memory. Sometimes from a cma region (so will be caught by
the cma page check), but if you have an iommu to make it
device-contiguous, that's not needed.

I think only some architectures have a special io pte flag, and those
are only used for real mmio access. And I think the popular ones all
don't. But that stuff is really not my expertise, just some drive-by
reading I've done to understand how the pci mmap stuff works (which is
special in yet other ways I think).

So probably I'm missing something, but I'm not seeing anything that
prevents this from coming out of a  pup/gup_fast.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 05/15] mm/frame-vector: Use FOLL_LONGTERM
  2020-11-04 15:54                 ` Daniel Vetter
@ 2020-11-04 16:21                   ` Christoph Hellwig
  2020-11-04 16:26                     ` Daniel Vetter
  0 siblings, 1 reply; 62+ messages in thread
From: Christoph Hellwig @ 2020-11-04 16:21 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Jason Gunthorpe, J??r??me Glisse, linux-samsung-soc, Jan Kara,
	Pawel Osciak, KVM list, John Hubbard, Mauro Carvalho Chehab,
	LKML, DRI Development, Tomasz Figa, Linux MM, Kyungmin Park,
	Daniel Vetter, Andrew Morton, Marek Szyprowski, Dan Williams,
	Linux ARM, open list:DMA BUFFER SHARING FRAMEWORK

On Wed, Nov 04, 2020 at 04:54:19PM +0100, Daniel Vetter wrote:
> I don't really have a box here, but dma_mmap_attrs() and friends to
> mmap dma_alloc_coherent memory is set up as VM_IO | VM_PFNMAP (it's
> actually enforced since underneath it uses remap_pfn_range), and
> usually (except if it's pre-cma carveout) that's just normal struct
> page backed memory. Sometimes from a cma region (so will be caught by
> the cma page check), but if you have an iommu to make it
> device-contiguous, that's not needed.

dma_mmap_* memory may or may not be page backed, but it absolutely
must not be resolved by get_user_pages and friends as it is special.
So yes, not being able to get a struct page back from such an mmap is
a feature.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 05/15] mm/frame-vector: Use FOLL_LONGTERM
  2020-11-04 16:21                   ` Christoph Hellwig
@ 2020-11-04 16:26                     ` Daniel Vetter
  2020-11-04 16:37                       ` Christoph Hellwig
  0 siblings, 1 reply; 62+ messages in thread
From: Daniel Vetter @ 2020-11-04 16:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jason Gunthorpe, J??r??me Glisse, linux-samsung-soc, Jan Kara,
	Pawel Osciak, KVM list, John Hubbard, Mauro Carvalho Chehab,
	LKML, DRI Development, Tomasz Figa, Linux MM, Kyungmin Park,
	Daniel Vetter, Andrew Morton, Marek Szyprowski, Dan Williams,
	Linux ARM, open list:DMA BUFFER SHARING FRAMEWORK

On Wed, Nov 4, 2020 at 5:21 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Wed, Nov 04, 2020 at 04:54:19PM +0100, Daniel Vetter wrote:
> > I don't really have a box here, but dma_mmap_attrs() and friends to
> > mmap dma_alloc_coherent memory is set up as VM_IO | VM_PFNMAP (it's
> > actually enforced since underneath it uses remap_pfn_range), and
> > usually (except if it's pre-cma carveout) that's just normal struct
> > page backed memory. Sometimes from a cma region (so will be caught by
> > the cma page check), but if you have an iommu to make it
> > device-contiguous, that's not needed.
>
> dma_mmap_* memory may or may not be page backed, but it absolutely
> must not be resolved by get_user_pages and friends as it is special.
> So yes, not being able to get a struct page back from such an mmap is
> a feature.

Yes, that's clear.

What we're discussing is whether gup_fast and pup_fast also obey this,
or fall over and can give you the struct page that's backing the
dma_mmap_* memory. Since the _fast variant doesn't check for
vma->vm_flags, and afaict that's the only thing which closes this gap.
And like you restate, that would be a bit a problem. So where's that
check which Jason&me aren't spotting?
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 05/15] mm/frame-vector: Use FOLL_LONGTERM
  2020-11-04 16:26                     ` Daniel Vetter
@ 2020-11-04 16:37                       ` Christoph Hellwig
  2020-11-04 16:41                         ` Christoph Hellwig
  0 siblings, 1 reply; 62+ messages in thread
From: Christoph Hellwig @ 2020-11-04 16:37 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Christoph Hellwig, Jason Gunthorpe, J??r??me Glisse,
	linux-samsung-soc, Jan Kara, Pawel Osciak, KVM list,
	John Hubbard, Mauro Carvalho Chehab, LKML, DRI Development,
	Tomasz Figa, Linux MM, Kyungmin Park, Daniel Vetter,
	Andrew Morton, Marek Szyprowski, Dan Williams, Linux ARM,
	open list:DMA BUFFER SHARING FRAMEWORK

On Wed, Nov 04, 2020 at 05:26:58PM +0100, Daniel Vetter wrote:
> What we're discussing is whether gup_fast and pup_fast also obey this,
> or fall over and can give you the struct page that's backing the
> dma_mmap_* memory. Since the _fast variant doesn't check for
> vma->vm_flags, and afaict that's the only thing which closes this gap.
> And like you restate, that would be a bit a problem. So where's that
> check which Jason&me aren't spotting?

remap_pte_range uses pte_mkspecial to set up the PTEs, and gup_pte_range
errors out on pte_special.  Of course this only works for the
CONFIG_ARCH_HAS_PTE_SPECIAL case, for other architectures we do have
a real problem.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 05/15] mm/frame-vector: Use FOLL_LONGTERM
  2020-11-04 16:37                       ` Christoph Hellwig
@ 2020-11-04 16:41                         ` Christoph Hellwig
  2020-11-04 18:17                           ` Jason Gunthorpe
  0 siblings, 1 reply; 62+ messages in thread
From: Christoph Hellwig @ 2020-11-04 16:41 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Christoph Hellwig, Jason Gunthorpe, J??r??me Glisse,
	linux-samsung-soc, Jan Kara, Pawel Osciak, KVM list,
	John Hubbard, Mauro Carvalho Chehab, LKML, DRI Development,
	Tomasz Figa, Linux MM, Kyungmin Park, Daniel Vetter,
	Andrew Morton, Marek Szyprowski, Dan Williams, Linux ARM,
	open list:DMA BUFFER SHARING FRAMEWORK

On Wed, Nov 04, 2020 at 04:37:58PM +0000, Christoph Hellwig wrote:
> On Wed, Nov 04, 2020 at 05:26:58PM +0100, Daniel Vetter wrote:
> > What we're discussing is whether gup_fast and pup_fast also obey this,
> > or fall over and can give you the struct page that's backing the
> > dma_mmap_* memory. Since the _fast variant doesn't check for
> > vma->vm_flags, and afaict that's the only thing which closes this gap.
> > And like you restate, that would be a bit a problem. So where's that
> > check which Jason&me aren't spotting?
> 
> remap_pte_range uses pte_mkspecial to set up the PTEs, and gup_pte_range
> errors out on pte_special.  Of course this only works for the
> CONFIG_ARCH_HAS_PTE_SPECIAL case, for other architectures we do have
> a real problem.

Except that we don't really support pte-level gup-fast without
CONFIG_ARCH_HAS_PTE_SPECIAL, and in fact all architectures selecting
HAVE_FAST_GUP also select ARCH_HAS_PTE_SPECIAL, so we should be fine.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 11/15] PCI: Obey iomem restrictions for procfs mmap
  2020-11-04  8:44       ` Daniel Vetter
@ 2020-11-04 16:50         ` Bjorn Helgaas
  2020-11-04 20:12           ` Dan Williams
  0 siblings, 1 reply; 62+ messages in thread
From: Bjorn Helgaas @ 2020-11-04 16:50 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Dan Williams, DRI Development, LKML, KVM list, Linux MM,
	Linux ARM, linux-samsung-soc, Linux-media@vger.kernel.org,
	Daniel Vetter, Jason Gunthorpe, Kees Cook, Andrew Morton,
	John Hubbard, Jérôme Glisse, Jan Kara, Bjorn Helgaas,
	Linux PCI

On Wed, Nov 04, 2020 at 09:44:04AM +0100, Daniel Vetter wrote:
> On Tue, Nov 3, 2020 at 11:09 PM Dan Williams <dan.j.williams@intel.com> wrote:
> > On Tue, Nov 3, 2020 at 1:28 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > On Fri, Oct 30, 2020 at 11:08:11AM +0100, Daniel Vetter wrote:
> > > > There's three ways to access PCI BARs from userspace: /dev/mem, sysfs
> > > > files, and the old proc interface. Two check against
> > > > iomem_is_exclusive, proc never did. And with CONFIG_IO_STRICT_DEVMEM,
> > > > this starts to matter, since we don't want random userspace having
> > > > access to PCI BARs while a driver is loaded and using it.
> > > >
> > > > Fix this by adding the same iomem_is_exclusive() check we already have
> > > > on the sysfs side in pci_mmap_resource().
> > > >
> > > > References: 90a545e98126 ("restrict /dev/mem to idle io memory ranges")
> > > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > >
> > > This is OK with me but it looks like IORESOURCE_EXCLUSIVE is currently
> > > only used in a few places:
> > >
> > >   e1000_probe() calls pci_request_selected_regions_exclusive(),
> > >   ne_pci_probe() calls pci_request_regions_exclusive(),
> > >   vmbus_allocate_mmio() calls request_mem_region_exclusive()
> > >
> > > which raises the question of whether it's worth keeping
> > > IORESOURCE_EXCLUSIVE at all.  I'm totally fine with removing it
> > > completely.
> >
> > Now that CONFIG_IO_STRICT_DEVMEM upgrades IORESOURCE_BUSY to
> > IORESOURCE_EXCLUSIVE semantics the latter has lost its meaning so I'd
> > be in favor of removing it as well.
> 
> Still has some value since it enforces exclusive access even if the
> config isn't enabled, and iirc e1000 had some fun with userspace tools
> clobbering the firmware and bricking the chip.

There's *some* value; I'm just skeptical since only three drivers use
it.

IORESOURCE_EXCLUSIVE is from e8de1481fd71 ("resource: allow MMIO
exclusivity for device drivers"), and the commit message says this is
only active when CONFIG_STRICT_DEVMEM is set.  I didn't check to see
whether that's still true.

That commit adds a bunch of wrappers and "__"-prefixed functions to
pass the IORESOURCE_EXCLUSIVE flag around.  That's a fair bit of
uglification for three drivers.

> Another thing I kinda wondered, since pci maintainer is here: At least
> in drivers/gpu I see very few drivers explicitly requestion regions
> (this might be a historical artifact due to the shadow attach stuff
> before we had real modesetting drivers). And pci core doesn't do that
> either, even when a driver is bound. Is this intentional, or
> should/could we do better? Since drivers work happily without
> reserving regions I don't think "the drivers need to remember to do
> this" will ever really work out well.

You're right, many drivers don't call pci_request_regions().  Maybe we
could do better, but I haven't looked into that recently.  There is a
related note in Documentation/PCI/pci.rst that's been there for a long
time (it refers to "pci_request_resources()", which has never existed
AFAICT).  I'm certainly open to proposals.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 05/15] mm/frame-vector: Use FOLL_LONGTERM
  2020-11-04 16:41                         ` Christoph Hellwig
@ 2020-11-04 18:17                           ` Jason Gunthorpe
  2020-11-04 18:44                             ` John Hubbard
  0 siblings, 1 reply; 62+ messages in thread
From: Jason Gunthorpe @ 2020-11-04 18:17 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Daniel Vetter, J??r??me Glisse, linux-samsung-soc, Jan Kara,
	Pawel Osciak, KVM list, John Hubbard, Mauro Carvalho Chehab,
	LKML, DRI Development, Tomasz Figa, Linux MM, Kyungmin Park,
	Daniel Vetter, Andrew Morton, Marek Szyprowski, Dan Williams,
	Linux ARM, open list:DMA BUFFER SHARING FRAMEWORK

On Wed, Nov 04, 2020 at 04:41:19PM +0000, Christoph Hellwig wrote:
> On Wed, Nov 04, 2020 at 04:37:58PM +0000, Christoph Hellwig wrote:
> > On Wed, Nov 04, 2020 at 05:26:58PM +0100, Daniel Vetter wrote:
> > > What we're discussing is whether gup_fast and pup_fast also obey this,
> > > or fall over and can give you the struct page that's backing the
> > > dma_mmap_* memory. Since the _fast variant doesn't check for
> > > vma->vm_flags, and afaict that's the only thing which closes this gap.
> > > And like you restate, that would be a bit a problem. So where's that
> > > check which Jason&me aren't spotting?
> > 
> > remap_pte_range uses pte_mkspecial to set up the PTEs, and gup_pte_range
> > errors out on pte_special.  Of course this only works for the
> > CONFIG_ARCH_HAS_PTE_SPECIAL case, for other architectures we do have
> > a real problem.
> 
> Except that we don't really support pte-level gup-fast without
> CONFIG_ARCH_HAS_PTE_SPECIAL, and in fact all architectures selecting
> HAVE_FAST_GUP also select ARCH_HAS_PTE_SPECIAL, so we should be fine.

Mm, I thought it was probably the special flag..

Knowing that CONFIG_HAVE_FAST_GUP can't be set without
CONFIG_ARCH_HAS_PTE_SPECIAL is pretty insightful, can we put that in
the Kconfig?

config HAVE_FAST_GUP
        depends on MMU
        depends on ARCH_HAS_PTE_SPECIAL
        bool

?

Jason

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 05/15] mm/frame-vector: Use FOLL_LONGTERM
  2020-11-04 18:17                           ` Jason Gunthorpe
@ 2020-11-04 18:44                             ` John Hubbard
  2020-11-04 19:02                               ` Jason Gunthorpe
  2020-11-05  9:25                               ` Daniel Vetter
  0 siblings, 2 replies; 62+ messages in thread
From: John Hubbard @ 2020-11-04 18:44 UTC (permalink / raw)
  To: Jason Gunthorpe, Christoph Hellwig
  Cc: Daniel Vetter, J??r??me Glisse, linux-samsung-soc, Jan Kara,
	Pawel Osciak, KVM list, Mauro Carvalho Chehab, LKML,
	DRI Development, Tomasz Figa, Linux MM, Kyungmin Park,
	Daniel Vetter, Andrew Morton, Marek Szyprowski, Dan Williams,
	Linux ARM, open list:DMA BUFFER SHARING FRAMEWORK

On 11/4/20 10:17 AM, Jason Gunthorpe wrote:
> On Wed, Nov 04, 2020 at 04:41:19PM +0000, Christoph Hellwig wrote:
>> On Wed, Nov 04, 2020 at 04:37:58PM +0000, Christoph Hellwig wrote:
>>> On Wed, Nov 04, 2020 at 05:26:58PM +0100, Daniel Vetter wrote:
>>>> What we're discussing is whether gup_fast and pup_fast also obey this,
>>>> or fall over and can give you the struct page that's backing the
>>>> dma_mmap_* memory. Since the _fast variant doesn't check for
>>>> vma->vm_flags, and afaict that's the only thing which closes this gap.
>>>> And like you restate, that would be a bit a problem. So where's that
>>>> check which Jason&me aren't spotting?
>>>
>>> remap_pte_range uses pte_mkspecial to set up the PTEs, and gup_pte_range
>>> errors out on pte_special.  Of course this only works for the
>>> CONFIG_ARCH_HAS_PTE_SPECIAL case, for other architectures we do have
>>> a real problem.
>>
>> Except that we don't really support pte-level gup-fast without
>> CONFIG_ARCH_HAS_PTE_SPECIAL, and in fact all architectures selecting
>> HAVE_FAST_GUP also select ARCH_HAS_PTE_SPECIAL, so we should be fine.
> 
> Mm, I thought it was probably the special flag..
> 
> Knowing that CONFIG_HAVE_FAST_GUP can't be set without
> CONFIG_ARCH_HAS_PTE_SPECIAL is pretty insightful, can we put that in
> the Kconfig?
> 
> config HAVE_FAST_GUP
>          depends on MMU
>          depends on ARCH_HAS_PTE_SPECIAL
>          bool
> 
Well, the !CONFIG_ARCH_HAS_PTE_SPECIAL case points out in a comment that
gup-fast is not *completely* unavailable there, so I don't think you want
to shut it off like that:

/*
  * If we can't determine whether or not a pte is special, then fail immediately
  * for ptes. Note, we can still pin HugeTLB and THP as these are guaranteed not
  * to be special.
  *
  * For a futex to be placed on a THP tail page, get_futex_key requires a
  * get_user_pages_fast_only implementation that can pin pages. Thus it's still
  * useful to have gup_huge_pmd even if we can't operate on ptes.
  */


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 05/15] mm/frame-vector: Use FOLL_LONGTERM
  2020-11-04 18:44                             ` John Hubbard
@ 2020-11-04 19:02                               ` Jason Gunthorpe
  2020-11-04 19:11                                 ` Christoph Hellwig
  2020-11-05  9:25                               ` Daniel Vetter
  1 sibling, 1 reply; 62+ messages in thread
From: Jason Gunthorpe @ 2020-11-04 19:02 UTC (permalink / raw)
  To: John Hubbard
  Cc: Christoph Hellwig, Daniel Vetter, J??r??me Glisse,
	linux-samsung-soc, Jan Kara, Pawel Osciak, KVM list,
	Mauro Carvalho Chehab, LKML, DRI Development, Tomasz Figa,
	Linux MM, Kyungmin Park, Daniel Vetter, Andrew Morton,
	Marek Szyprowski, Dan Williams, Linux ARM,
	open list:DMA BUFFER SHARING FRAMEWORK

On Wed, Nov 04, 2020 at 10:44:56AM -0800, John Hubbard wrote:
> On 11/4/20 10:17 AM, Jason Gunthorpe wrote:
> > On Wed, Nov 04, 2020 at 04:41:19PM +0000, Christoph Hellwig wrote:
> > > On Wed, Nov 04, 2020 at 04:37:58PM +0000, Christoph Hellwig wrote:
> > > > On Wed, Nov 04, 2020 at 05:26:58PM +0100, Daniel Vetter wrote:
> > > > > What we're discussing is whether gup_fast and pup_fast also obey this,
> > > > > or fall over and can give you the struct page that's backing the
> > > > > dma_mmap_* memory. Since the _fast variant doesn't check for
> > > > > vma->vm_flags, and afaict that's the only thing which closes this gap.
> > > > > And like you restate, that would be a bit a problem. So where's that
> > > > > check which Jason&me aren't spotting?
> > > > 
> > > > remap_pte_range uses pte_mkspecial to set up the PTEs, and gup_pte_range
> > > > errors out on pte_special.  Of course this only works for the
> > > > CONFIG_ARCH_HAS_PTE_SPECIAL case, for other architectures we do have
> > > > a real problem.
> > > 
> > > Except that we don't really support pte-level gup-fast without
> > > CONFIG_ARCH_HAS_PTE_SPECIAL, and in fact all architectures selecting
> > > HAVE_FAST_GUP also select ARCH_HAS_PTE_SPECIAL, so we should be fine.
> > 
> > Mm, I thought it was probably the special flag..
> > 
> > Knowing that CONFIG_HAVE_FAST_GUP can't be set without
> > CONFIG_ARCH_HAS_PTE_SPECIAL is pretty insightful, can we put that in
> > the Kconfig?
> > 
> > config HAVE_FAST_GUP
> >          depends on MMU
> >          depends on ARCH_HAS_PTE_SPECIAL
> >          bool
> > 
> Well, the !CONFIG_ARCH_HAS_PTE_SPECIAL case points out in a comment that
> gup-fast is not *completely* unavailable there, so I don't think you want
> to shut it off like that:
> 
> /*
>  * If we can't determine whether or not a pte is special, then fail immediately
>  * for ptes. Note, we can still pin HugeTLB and THP as these are guaranteed not
>  * to be special.
>  *
>  * For a futex to be placed on a THP tail page, get_futex_key requires a
>  * get_user_pages_fast_only implementation that can pin pages. Thus it's still
>  * useful to have gup_huge_pmd even if we can't operate on ptes.
>  */

I saw that once and I really couldn't make sense of it..
What use is having futex's that only work on THP pages? Confused

CH said there was no case of HAVE_FAST_GUP !ARCH_HAS_PTE_SPECIAL, is
one hidden someplace then?

Jason

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 05/15] mm/frame-vector: Use FOLL_LONGTERM
  2020-11-04 19:02                               ` Jason Gunthorpe
@ 2020-11-04 19:11                                 ` Christoph Hellwig
  0 siblings, 0 replies; 62+ messages in thread
From: Christoph Hellwig @ 2020-11-04 19:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: John Hubbard, Christoph Hellwig, Daniel Vetter, J??r??me Glisse,
	linux-samsung-soc, Jan Kara, Pawel Osciak, KVM list,
	Mauro Carvalho Chehab, LKML, DRI Development, Tomasz Figa,
	Linux MM, Kyungmin Park, Daniel Vetter, Andrew Morton,
	Marek Szyprowski, Dan Williams, Linux ARM,
	open list:DMA BUFFER SHARING FRAMEWORK, Thomas Bogendoerfer

On Wed, Nov 04, 2020 at 03:02:14PM -0400, Jason Gunthorpe wrote:
> I saw that once and I really couldn't make sense of it..
> What use is having futex's that only work on THP pages? Confused
> 
> CH said there was no case of HAVE_FAST_GUP !ARCH_HAS_PTE_SPECIAL, is
> one hidden someplace then?

ARCH_HAS_PTE_SPECIAL is selected by:

arc
arm		(if LPAE)
arm64
mips		(if !(32BIT && CPU_HAS_RIXI))
powerpc
riscv
s390
sh
sparc		(if SPARC64)
x86

HAVE_FAST_GUP is supported by:

arm		(if LPAE)
arm64
mips
powerpc
s390
sh
x86

so mips would be a candidate.  But I think only in theory, as
the options selecting CPU_HAS_RIXI do not select
CPU_SUPPORTS_HUGEPAGES.  Adding the mips maintainer to double check
my logic.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 11/15] PCI: Obey iomem restrictions for procfs mmap
  2020-11-04 16:50         ` Bjorn Helgaas
@ 2020-11-04 20:12           ` Dan Williams
  2020-11-05  9:34             ` Daniel Vetter
  0 siblings, 1 reply; 62+ messages in thread
From: Dan Williams @ 2020-11-04 20:12 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Daniel Vetter, DRI Development, LKML, KVM list, Linux MM,
	Linux ARM, linux-samsung-soc, Linux-media@vger.kernel.org,
	Daniel Vetter, Jason Gunthorpe, Kees Cook, Andrew Morton,
	John Hubbard, Jérôme Glisse, Jan Kara, Bjorn Helgaas,
	Linux PCI

On Wed, Nov 4, 2020 at 8:50 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Wed, Nov 04, 2020 at 09:44:04AM +0100, Daniel Vetter wrote:
> > On Tue, Nov 3, 2020 at 11:09 PM Dan Williams <dan.j.williams@intel.com> wrote:
> > > On Tue, Nov 3, 2020 at 1:28 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > On Fri, Oct 30, 2020 at 11:08:11AM +0100, Daniel Vetter wrote:
> > > > > There's three ways to access PCI BARs from userspace: /dev/mem, sysfs
> > > > > files, and the old proc interface. Two check against
> > > > > iomem_is_exclusive, proc never did. And with CONFIG_IO_STRICT_DEVMEM,
> > > > > this starts to matter, since we don't want random userspace having
> > > > > access to PCI BARs while a driver is loaded and using it.
> > > > >
> > > > > Fix this by adding the same iomem_is_exclusive() check we already have
> > > > > on the sysfs side in pci_mmap_resource().
> > > > >
> > > > > References: 90a545e98126 ("restrict /dev/mem to idle io memory ranges")
> > > > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > >
> > > > This is OK with me but it looks like IORESOURCE_EXCLUSIVE is currently
> > > > only used in a few places:
> > > >
> > > >   e1000_probe() calls pci_request_selected_regions_exclusive(),
> > > >   ne_pci_probe() calls pci_request_regions_exclusive(),
> > > >   vmbus_allocate_mmio() calls request_mem_region_exclusive()
> > > >
> > > > which raises the question of whether it's worth keeping
> > > > IORESOURCE_EXCLUSIVE at all.  I'm totally fine with removing it
> > > > completely.
> > >
> > > Now that CONFIG_IO_STRICT_DEVMEM upgrades IORESOURCE_BUSY to
> > > IORESOURCE_EXCLUSIVE semantics the latter has lost its meaning so I'd
> > > be in favor of removing it as well.
> >
> > Still has some value since it enforces exclusive access even if the
> > config isn't enabled, and iirc e1000 had some fun with userspace tools
> > clobbering the firmware and bricking the chip.
>
> There's *some* value; I'm just skeptical since only three drivers use
> it.
>
> IORESOURCE_EXCLUSIVE is from e8de1481fd71 ("resource: allow MMIO
> exclusivity for device drivers"), and the commit message says this is
> only active when CONFIG_STRICT_DEVMEM is set.  I didn't check to see
> whether that's still true.
>
> That commit adds a bunch of wrappers and "__"-prefixed functions to
> pass the IORESOURCE_EXCLUSIVE flag around.  That's a fair bit of
> uglification for three drivers.
>
> > Another thing I kinda wondered, since pci maintainer is here: At least
> > in drivers/gpu I see very few drivers explicitly requestion regions
> > (this might be a historical artifact due to the shadow attach stuff
> > before we had real modesetting drivers). And pci core doesn't do that
> > either, even when a driver is bound. Is this intentional, or
> > should/could we do better? Since drivers work happily without
> > reserving regions I don't think "the drivers need to remember to do
> > this" will ever really work out well.
>
> You're right, many drivers don't call pci_request_regions().  Maybe we
> could do better, but I haven't looked into that recently.  There is a
> related note in Documentation/PCI/pci.rst that's been there for a long
> time (it refers to "pci_request_resources()", which has never existed
> AFAICT).  I'm certainly open to proposals.

It seems a bug that the kernel permits MMIO regions with side effects
to be ioremap()'ed without request_mem_region() on the resource. I
wonder how much log spam would happen if ioremap() reported whenever a
non-IORESOURE_BUSY range was passed to it? The current state of
affairs to trust *remap users to have claimed their remap target seems
too ingrained to unwind now.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 05/15] mm/frame-vector: Use FOLL_LONGTERM
  2020-11-04 18:44                             ` John Hubbard
  2020-11-04 19:02                               ` Jason Gunthorpe
@ 2020-11-05  9:25                               ` Daniel Vetter
  2020-11-05 12:49                                 ` Jason Gunthorpe
  1 sibling, 1 reply; 62+ messages in thread
From: Daniel Vetter @ 2020-11-05  9:25 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jason Gunthorpe, Christoph Hellwig, Daniel Vetter,
	J??r??me Glisse, linux-samsung-soc, Jan Kara, Pawel Osciak,
	KVM list, Mauro Carvalho Chehab, LKML, DRI Development,
	Tomasz Figa, Linux MM, Kyungmin Park, Daniel Vetter,
	Andrew Morton, Marek Szyprowski, Dan Williams, Linux ARM,
	open list:DMA BUFFER SHARING FRAMEWORK

On Wed, Nov 04, 2020 at 10:44:56AM -0800, John Hubbard wrote:
> On 11/4/20 10:17 AM, Jason Gunthorpe wrote:
> > On Wed, Nov 04, 2020 at 04:41:19PM +0000, Christoph Hellwig wrote:
> > > On Wed, Nov 04, 2020 at 04:37:58PM +0000, Christoph Hellwig wrote:
> > > > On Wed, Nov 04, 2020 at 05:26:58PM +0100, Daniel Vetter wrote:
> > > > > What we're discussing is whether gup_fast and pup_fast also obey this,
> > > > > or fall over and can give you the struct page that's backing the
> > > > > dma_mmap_* memory. Since the _fast variant doesn't check for
> > > > > vma->vm_flags, and afaict that's the only thing which closes this gap.
> > > > > And like you restate, that would be a bit a problem. So where's that
> > > > > check which Jason&me aren't spotting?
> > > > 
> > > > remap_pte_range uses pte_mkspecial to set up the PTEs, and gup_pte_range
> > > > errors out on pte_special.  Of course this only works for the
> > > > CONFIG_ARCH_HAS_PTE_SPECIAL case, for other architectures we do have
> > > > a real problem.
> > > 
> > > Except that we don't really support pte-level gup-fast without
> > > CONFIG_ARCH_HAS_PTE_SPECIAL, and in fact all architectures selecting
> > > HAVE_FAST_GUP also select ARCH_HAS_PTE_SPECIAL, so we should be fine.

Thanks for the explainer. I guess I can go back to _fast and instead
adjust the commit message to explain why that's all fine.

> > Mm, I thought it was probably the special flag..
> > 
> > Knowing that CONFIG_HAVE_FAST_GUP can't be set without
> > CONFIG_ARCH_HAS_PTE_SPECIAL is pretty insightful, can we put that in
> > the Kconfig?
> > 
> > config HAVE_FAST_GUP
> >          depends on MMU
> >          depends on ARCH_HAS_PTE_SPECIAL
> >          bool
> > 
> Well, the !CONFIG_ARCH_HAS_PTE_SPECIAL case points out in a comment that
> gup-fast is not *completely* unavailable there, so I don't think you want
> to shut it off like that:
> 
> /*
>  * If we can't determine whether or not a pte is special, then fail immediately
>  * for ptes. Note, we can still pin HugeTLB and THP as these are guaranteed not
>  * to be special.
>  *
>  * For a futex to be placed on a THP tail page, get_futex_key requires a
>  * get_user_pages_fast_only implementation that can pin pages. Thus it's still
>  * useful to have gup_huge_pmd even if we can't operate on ptes.
>  */

We support hugepage faults in gpu drivers since recently, and I'm not
seeing a pud_mkhugespecial anywhere. So not sure this works, but probably
just me missing something again.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 11/15] PCI: Obey iomem restrictions for procfs mmap
  2020-11-04 20:12           ` Dan Williams
@ 2020-11-05  9:34             ` Daniel Vetter
  0 siblings, 0 replies; 62+ messages in thread
From: Daniel Vetter @ 2020-11-05  9:34 UTC (permalink / raw)
  To: Dan Williams
  Cc: Bjorn Helgaas, Daniel Vetter, DRI Development, LKML, KVM list,
	Linux MM, Linux ARM, linux-samsung-soc,
	Linux-media@vger.kernel.org, Daniel Vetter, Jason Gunthorpe,
	Kees Cook, Andrew Morton, John Hubbard, Jérôme Glisse,
	Jan Kara, Bjorn Helgaas, Linux PCI

On Wed, Nov 04, 2020 at 12:12:15PM -0800, Dan Williams wrote:
> On Wed, Nov 4, 2020 at 8:50 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> >
> > On Wed, Nov 04, 2020 at 09:44:04AM +0100, Daniel Vetter wrote:
> > > On Tue, Nov 3, 2020 at 11:09 PM Dan Williams <dan.j.williams@intel.com> wrote:
> > > > On Tue, Nov 3, 2020 at 1:28 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > > On Fri, Oct 30, 2020 at 11:08:11AM +0100, Daniel Vetter wrote:
> > > > > > There's three ways to access PCI BARs from userspace: /dev/mem, sysfs
> > > > > > files, and the old proc interface. Two check against
> > > > > > iomem_is_exclusive, proc never did. And with CONFIG_IO_STRICT_DEVMEM,
> > > > > > this starts to matter, since we don't want random userspace having
> > > > > > access to PCI BARs while a driver is loaded and using it.
> > > > > >
> > > > > > Fix this by adding the same iomem_is_exclusive() check we already have
> > > > > > on the sysfs side in pci_mmap_resource().
> > > > > >
> > > > > > References: 90a545e98126 ("restrict /dev/mem to idle io memory ranges")
> > > > > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > > >
> > > > > This is OK with me but it looks like IORESOURCE_EXCLUSIVE is currently
> > > > > only used in a few places:
> > > > >
> > > > >   e1000_probe() calls pci_request_selected_regions_exclusive(),
> > > > >   ne_pci_probe() calls pci_request_regions_exclusive(),
> > > > >   vmbus_allocate_mmio() calls request_mem_region_exclusive()
> > > > >
> > > > > which raises the question of whether it's worth keeping
> > > > > IORESOURCE_EXCLUSIVE at all.  I'm totally fine with removing it
> > > > > completely.
> > > >
> > > > Now that CONFIG_IO_STRICT_DEVMEM upgrades IORESOURCE_BUSY to
> > > > IORESOURCE_EXCLUSIVE semantics the latter has lost its meaning so I'd
> > > > be in favor of removing it as well.
> > >
> > > Still has some value since it enforces exclusive access even if the
> > > config isn't enabled, and iirc e1000 had some fun with userspace tools
> > > clobbering the firmware and bricking the chip.
> >
> > There's *some* value; I'm just skeptical since only three drivers use
> > it.
> >
> > IORESOURCE_EXCLUSIVE is from e8de1481fd71 ("resource: allow MMIO
> > exclusivity for device drivers"), and the commit message says this is
> > only active when CONFIG_STRICT_DEVMEM is set.  I didn't check to see
> > whether that's still true.
> >
> > That commit adds a bunch of wrappers and "__"-prefixed functions to
> > pass the IORESOURCE_EXCLUSIVE flag around.  That's a fair bit of
> > uglification for three drivers.
> >
> > > Another thing I kinda wondered, since pci maintainer is here: At least
> > > in drivers/gpu I see very few drivers explicitly requestion regions
> > > (this might be a historical artifact due to the shadow attach stuff
> > > before we had real modesetting drivers). And pci core doesn't do that
> > > either, even when a driver is bound. Is this intentional, or
> > > should/could we do better? Since drivers work happily without
> > > reserving regions I don't think "the drivers need to remember to do
> > > this" will ever really work out well.
> >
> > You're right, many drivers don't call pci_request_regions().  Maybe we
> > could do better, but I haven't looked into that recently.  There is a
> > related note in Documentation/PCI/pci.rst that's been there for a long
> > time (it refers to "pci_request_resources()", which has never existed
> > AFAICT).  I'm certainly open to proposals.
> 
> It seems a bug that the kernel permits MMIO regions with side effects
> to be ioremap()'ed without request_mem_region() on the resource. I
> wonder how much log spam would happen if ioremap() reported whenever a
> non-IORESOURE_BUSY range was passed to it? The current state of
> affairs to trust *remap users to have claimed their remap target seems
> too ingrained to unwind now.

Yeah I think that's hopeless. I think the only feasible approach is if bus
drivers claim resources by default when a driver is bound (it should nest,
so if the driver claims again, I think that should all keep working), just
using the driver name. Probably with some special casing for legacy io
(vgaarb.c should claim these I guess). Still probably tons of fallout.

Once that's rolled out to all bus drivers we could perhaps add the ioremap
check without drowning in log spam. Still a multi-year project I think :-/
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 05/15] mm/frame-vector: Use FOLL_LONGTERM
  2020-11-05  9:25                               ` Daniel Vetter
@ 2020-11-05 12:49                                 ` Jason Gunthorpe
  2020-11-06  4:08                                   ` John Hubbard
  0 siblings, 1 reply; 62+ messages in thread
From: Jason Gunthorpe @ 2020-11-05 12:49 UTC (permalink / raw)
  To: John Hubbard, Christoph Hellwig, J??r??me Glisse,
	linux-samsung-soc, Jan Kara, Pawel Osciak, KVM list,
	Mauro Carvalho Chehab, LKML, DRI Development, Tomasz Figa,
	Linux MM, Kyungmin Park, Daniel Vetter, Andrew Morton,
	Marek Szyprowski, Dan Williams, Linux ARM,
	open list:DMA BUFFER SHARING FRAMEWORK

On Thu, Nov 05, 2020 at 10:25:24AM +0100, Daniel Vetter wrote:
> > /*
> >  * If we can't determine whether or not a pte is special, then fail immediately
> >  * for ptes. Note, we can still pin HugeTLB and THP as these are guaranteed not
> >  * to be special.
> >  *
> >  * For a futex to be placed on a THP tail page, get_futex_key requires a
> >  * get_user_pages_fast_only implementation that can pin pages. Thus it's still
> >  * useful to have gup_huge_pmd even if we can't operate on ptes.
> >  */
> 
> We support hugepage faults in gpu drivers since recently, and I'm not
> seeing a pud_mkhugespecial anywhere. So not sure this works, but probably
> just me missing something again.

It means ioremap can't create an IO page PUD, it has to be broken up.

Does ioremap even create anything larger than PTEs?

Jason

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 05/15] mm/frame-vector: Use FOLL_LONGTERM
  2020-11-05 12:49                                 ` Jason Gunthorpe
@ 2020-11-06  4:08                                   ` John Hubbard
  2020-11-06 10:01                                     ` Daniel Vetter
  0 siblings, 1 reply; 62+ messages in thread
From: John Hubbard @ 2020-11-06  4:08 UTC (permalink / raw)
  To: Jason Gunthorpe, Christoph Hellwig, J??r??me Glisse,
	linux-samsung-soc, Jan Kara, Pawel Osciak, KVM list,
	Mauro Carvalho Chehab, LKML, DRI Development, Tomasz Figa,
	Linux MM, Kyungmin Park, Daniel Vetter, Andrew Morton,
	Marek Szyprowski, Dan Williams, Linux ARM,
	open list:DMA BUFFER SHARING FRAMEWORK

On 11/5/20 4:49 AM, Jason Gunthorpe wrote:
> On Thu, Nov 05, 2020 at 10:25:24AM +0100, Daniel Vetter wrote:
>>> /*
>>>   * If we can't determine whether or not a pte is special, then fail immediately
>>>   * for ptes. Note, we can still pin HugeTLB and THP as these are guaranteed not
>>>   * to be special.
>>>   *
>>>   * For a futex to be placed on a THP tail page, get_futex_key requires a
>>>   * get_user_pages_fast_only implementation that can pin pages. Thus it's still
>>>   * useful to have gup_huge_pmd even if we can't operate on ptes.
>>>   */
>>
>> We support hugepage faults in gpu drivers since recently, and I'm not
>> seeing a pud_mkhugespecial anywhere. So not sure this works, but probably
>> just me missing something again.
> 
> It means ioremap can't create an IO page PUD, it has to be broken up.
> 
> Does ioremap even create anything larger than PTEs?
> 

 From my reading, yes. See ioremap_try_huge_pmd().

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 05/15] mm/frame-vector: Use FOLL_LONGTERM
  2020-11-06  4:08                                   ` John Hubbard
@ 2020-11-06 10:01                                     ` Daniel Vetter
  2020-11-06 10:27                                       ` Daniel Vetter
  2020-11-06 12:58                                       ` Jason Gunthorpe
  0 siblings, 2 replies; 62+ messages in thread
From: Daniel Vetter @ 2020-11-06 10:01 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jason Gunthorpe, Christoph Hellwig, J??r??me Glisse,
	linux-samsung-soc, Jan Kara, Pawel Osciak, KVM list,
	Mauro Carvalho Chehab, LKML, DRI Development, Tomasz Figa,
	Linux MM, Kyungmin Park, Daniel Vetter, Andrew Morton,
	Marek Szyprowski, Dan Williams, Linux ARM,
	open list:DMA BUFFER SHARING FRAMEWORK

On Fri, Nov 6, 2020 at 5:08 AM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 11/5/20 4:49 AM, Jason Gunthorpe wrote:
> > On Thu, Nov 05, 2020 at 10:25:24AM +0100, Daniel Vetter wrote:
> >>> /*
> >>>   * If we can't determine whether or not a pte is special, then fail immediately
> >>>   * for ptes. Note, we can still pin HugeTLB and THP as these are guaranteed not
> >>>   * to be special.
> >>>   *
> >>>   * For a futex to be placed on a THP tail page, get_futex_key requires a
> >>>   * get_user_pages_fast_only implementation that can pin pages. Thus it's still
> >>>   * useful to have gup_huge_pmd even if we can't operate on ptes.
> >>>   */
> >>
> >> We support hugepage faults in gpu drivers since recently, and I'm not
> >> seeing a pud_mkhugespecial anywhere. So not sure this works, but probably
> >> just me missing something again.
> >
> > It means ioremap can't create an IO page PUD, it has to be broken up.
> >
> > Does ioremap even create anything larger than PTEs?

gpu drivers also tend to use vmf_insert_pfn* directly, so we can do
on-demand paging and move buffers around. From what I glanced for
lowest level we to the pte_mkspecial correctly (I think I convinced
myself that vm_insert_pfn does that), but for pud/pmd levels it seems
just yolo.

remap_pfn_range seems to indeed split down to pte level always.

>  From my reading, yes. See ioremap_try_huge_pmd().

The ioremap here shouldn't matter, since this is for kernel-internal
mappings. So that's all fine I think.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 05/15] mm/frame-vector: Use FOLL_LONGTERM
  2020-11-06 10:01                                     ` Daniel Vetter
@ 2020-11-06 10:27                                       ` Daniel Vetter
  2020-11-06 12:55                                         ` Jason Gunthorpe
  2020-11-06 12:58                                       ` Jason Gunthorpe
  1 sibling, 1 reply; 62+ messages in thread
From: Daniel Vetter @ 2020-11-06 10:27 UTC (permalink / raw)
  To: John Hubbard, Thomas Hellstrom
  Cc: Jason Gunthorpe, Christoph Hellwig, J??r??me Glisse,
	linux-samsung-soc, Jan Kara, Pawel Osciak, KVM list,
	Mauro Carvalho Chehab, LKML, DRI Development, Tomasz Figa,
	Linux MM, Kyungmin Park, Daniel Vetter, Andrew Morton,
	Marek Szyprowski, Dan Williams, Linux ARM,
	open list:DMA BUFFER SHARING FRAMEWORK

On Fri, Nov 6, 2020 at 11:01 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Fri, Nov 6, 2020 at 5:08 AM John Hubbard <jhubbard@nvidia.com> wrote:
> >
> > On 11/5/20 4:49 AM, Jason Gunthorpe wrote:
> > > On Thu, Nov 05, 2020 at 10:25:24AM +0100, Daniel Vetter wrote:
> > >>> /*
> > >>>   * If we can't determine whether or not a pte is special, then fail immediately
> > >>>   * for ptes. Note, we can still pin HugeTLB and THP as these are guaranteed not
> > >>>   * to be special.
> > >>>   *
> > >>>   * For a futex to be placed on a THP tail page, get_futex_key requires a
> > >>>   * get_user_pages_fast_only implementation that can pin pages. Thus it's still
> > >>>   * useful to have gup_huge_pmd even if we can't operate on ptes.
> > >>>   */
> > >>
> > >> We support hugepage faults in gpu drivers since recently, and I'm not
> > >> seeing a pud_mkhugespecial anywhere. So not sure this works, but probably
> > >> just me missing something again.
> > >
> > > It means ioremap can't create an IO page PUD, it has to be broken up.
> > >
> > > Does ioremap even create anything larger than PTEs?
>
> gpu drivers also tend to use vmf_insert_pfn* directly, so we can do
> on-demand paging and move buffers around. From what I glanced for
> lowest level we to the pte_mkspecial correctly (I think I convinced
> myself that vm_insert_pfn does that), but for pud/pmd levels it seems
> just yolo.

So I dug around a bit more and ttm sets PFN_DEV | PFN_MAP to get past
the various pft_t_devmap checks (see e.g. vmf_insert_pfn_pmd_prot()).
x86-64 has ARCH_HAS_PTE_DEVMAP, and gup.c seems to handle these
specially, but frankly I got totally lost in what this does.

The comment above the pfn_t_devmap check makes me wonder whether doing
this is correct or not.

Also adding Thomas Hellstrom, who implemented the huge map support in ttm.
-Daniel

> remap_pfn_range seems to indeed split down to pte level always.
>
> >  From my reading, yes. See ioremap_try_huge_pmd().
>
> The ioremap here shouldn't matter, since this is for kernel-internal
> mappings. So that's all fine I think.
> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 05/15] mm/frame-vector: Use FOLL_LONGTERM
  2020-11-06 10:27                                       ` Daniel Vetter
@ 2020-11-06 12:55                                         ` Jason Gunthorpe
  2020-11-09  8:44                                           ` Thomas Hellström
  0 siblings, 1 reply; 62+ messages in thread
From: Jason Gunthorpe @ 2020-11-06 12:55 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: John Hubbard, Thomas Hellstrom, Christoph Hellwig,
	J??r??me Glisse, linux-samsung-soc, Jan Kara, Pawel Osciak,
	KVM list, Mauro Carvalho Chehab, LKML, DRI Development,
	Tomasz Figa, Linux MM, Kyungmin Park, Daniel Vetter,
	Andrew Morton, Marek Szyprowski, Dan Williams, Linux ARM,
	open list:DMA BUFFER SHARING FRAMEWORK

On Fri, Nov 06, 2020 at 11:27:59AM +0100, Daniel Vetter wrote:
> On Fri, Nov 6, 2020 at 11:01 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > On Fri, Nov 6, 2020 at 5:08 AM John Hubbard <jhubbard@nvidia.com> wrote:
> > >
> > > On 11/5/20 4:49 AM, Jason Gunthorpe wrote:
> > > > On Thu, Nov 05, 2020 at 10:25:24AM +0100, Daniel Vetter wrote:
> > > >>> /*
> > > >>>   * If we can't determine whether or not a pte is special, then fail immediately
> > > >>>   * for ptes. Note, we can still pin HugeTLB and THP as these are guaranteed not
> > > >>>   * to be special.
> > > >>>   *
> > > >>>   * For a futex to be placed on a THP tail page, get_futex_key requires a
> > > >>>   * get_user_pages_fast_only implementation that can pin pages. Thus it's still
> > > >>>   * useful to have gup_huge_pmd even if we can't operate on ptes.
> > > >>>   */
> > > >>
> > > >> We support hugepage faults in gpu drivers since recently, and I'm not
> > > >> seeing a pud_mkhugespecial anywhere. So not sure this works, but probably
> > > >> just me missing something again.
> > > >
> > > > It means ioremap can't create an IO page PUD, it has to be broken up.
> > > >
> > > > Does ioremap even create anything larger than PTEs?
> >
> > gpu drivers also tend to use vmf_insert_pfn* directly, so we can do
> > on-demand paging and move buffers around. From what I glanced for
> > lowest level we to the pte_mkspecial correctly (I think I convinced
> > myself that vm_insert_pfn does that), but for pud/pmd levels it seems
> > just yolo.
> 
> So I dug around a bit more and ttm sets PFN_DEV | PFN_MAP to get past
> the various pft_t_devmap checks (see e.g. vmf_insert_pfn_pmd_prot()).
> x86-64 has ARCH_HAS_PTE_DEVMAP, and gup.c seems to handle these
> specially, but frankly I got totally lost in what this does.

The fact vmf_insert_pfn_pmd_prot() has all those BUG_ON's to prevent
putting VM_PFNMAP pages into the page tables seems like a big red
flag.

The comment seems to confirm what we are talking about here:

	/*
	 * If we had pmd_special, we could avoid all these restrictions,
	 * but we need to be consistent with PTEs and architectures that
	 * can't support a 'special' bit.
	 */

ie without the ability to mark special we can't block fast gup and
anyone who does O_DIRECT on these ranges will crash the kernel when it
tries to convert a IO page into a struct page.

Should be easy enough to directly test?

Putting non-struct page PTEs into a VMA without setting VM_PFNMAP just
seems horribly wrong to me.

Jason

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 05/15] mm/frame-vector: Use FOLL_LONGTERM
  2020-11-06 10:01                                     ` Daniel Vetter
  2020-11-06 10:27                                       ` Daniel Vetter
@ 2020-11-06 12:58                                       ` Jason Gunthorpe
  1 sibling, 0 replies; 62+ messages in thread
From: Jason Gunthorpe @ 2020-11-06 12:58 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: John Hubbard, Christoph Hellwig, J??r??me Glisse,
	linux-samsung-soc, Jan Kara, Pawel Osciak, KVM list,
	Mauro Carvalho Chehab, LKML, DRI Development, Tomasz Figa,
	Linux MM, Kyungmin Park, Daniel Vetter, Andrew Morton,
	Marek Szyprowski, Dan Williams, Linux ARM,
	open list:DMA BUFFER SHARING FRAMEWORK

On Fri, Nov 06, 2020 at 11:01:57AM +0100, Daniel Vetter wrote:

> gpu drivers also tend to use vmf_insert_pfn* directly, so we can do
> on-demand paging and move buffers around. From what I glanced for
> lowest level we to the pte_mkspecial correctly (I think I convinced
> myself that vm_insert_pfn does that), but for pud/pmd levels it seems
> just yolo.
> 
> remap_pfn_range seems to indeed split down to pte level always.

Thats what it looked like to me too.
 
> >  From my reading, yes. See ioremap_try_huge_pmd().
> 
> The ioremap here shouldn't matter, since this is for kernel-internal
> mappings. So that's all fine I think.

Right, sorry to be unclear, we are talking about io_remap_pfn_range()
which is for userspace mappings in VMAs

Jason

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 05/15] mm/frame-vector: Use FOLL_LONGTERM
  2020-11-06 12:55                                         ` Jason Gunthorpe
@ 2020-11-09  8:44                                           ` Thomas Hellström
  2020-11-09 20:19                                             ` Jason Gunthorpe
  0 siblings, 1 reply; 62+ messages in thread
From: Thomas Hellström @ 2020-11-09  8:44 UTC (permalink / raw)
  To: Jason Gunthorpe, Daniel Vetter
  Cc: John Hubbard, Christoph Hellwig, J??r??me Glisse,
	linux-samsung-soc, Jan Kara, Pawel Osciak, KVM list,
	Mauro Carvalho Chehab, LKML, DRI Development, Tomasz Figa,
	Linux MM, Kyungmin Park, Daniel Vetter, Andrew Morton,
	Marek Szyprowski, Dan Williams, Linux ARM,
	open list:DMA BUFFER SHARING FRAMEWORK

On Fri, 2020-11-06 at 08:55 -0400, Jason Gunthorpe wrote:
> On Fri, Nov 06, 2020 at 11:27:59AM +0100, Daniel Vetter wrote:
> > On Fri, Nov 6, 2020 at 11:01 AM Daniel Vetter <daniel@ffwll.ch>
> > wrote:
> > > On Fri, Nov 6, 2020 at 5:08 AM John Hubbard <jhubbard@nvidia.com>
> > > wrote:
> > > > On 11/5/20 4:49 AM, Jason Gunthorpe wrote:
> > > > > On Thu, Nov 05, 2020 at 10:25:24AM +0100, Daniel Vetter
> > > > > wrote:
> > > > > > > /*
> > > > > > >   * If we can't determine whether or not a pte is
> > > > > > > special, then fail immediately
> > > > > > >   * for ptes. Note, we can still pin HugeTLB and THP as
> > > > > > > these are guaranteed not
> > > > > > >   * to be special.
> > > > > > >   *
> > > > > > >   * For a futex to be placed on a THP tail page,
> > > > > > > get_futex_key requires a
> > > > > > >   * get_user_pages_fast_only implementation that can pin
> > > > > > > pages. Thus it's still
> > > > > > >   * useful to have gup_huge_pmd even if we can't operate
> > > > > > > on ptes.
> > > > > > >   */
> > > > > > 
> > > > > > We support hugepage faults in gpu drivers since recently,
> > > > > > and I'm not
> > > > > > seeing a pud_mkhugespecial anywhere. So not sure this
> > > > > > works, but probably
> > > > > > just me missing something again.
> > > > > 
> > > > > It means ioremap can't create an IO page PUD, it has to be
> > > > > broken up.
> > > > > 
> > > > > Does ioremap even create anything larger than PTEs?
> > > 
> > > gpu drivers also tend to use vmf_insert_pfn* directly, so we can
> > > do
> > > on-demand paging and move buffers around. From what I glanced for
> > > lowest level we to the pte_mkspecial correctly (I think I
> > > convinced
> > > myself that vm_insert_pfn does that), but for pud/pmd levels it
> > > seems
> > > just yolo.
> > 
> > So I dug around a bit more and ttm sets PFN_DEV | PFN_MAP to get
> > past
> > the various pft_t_devmap checks (see e.g.
> > vmf_insert_pfn_pmd_prot()).
> > x86-64 has ARCH_HAS_PTE_DEVMAP, and gup.c seems to handle these
> > specially, but frankly I got totally lost in what this does.
> 
> The fact vmf_insert_pfn_pmd_prot() has all those BUG_ON's to prevent
> putting VM_PFNMAP pages into the page tables seems like a big red
> flag.
> 
> The comment seems to confirm what we are talking about here:
> 
> 	/*
> 	 * If we had pmd_special, we could avoid all these
> restrictions,
> 	 * but we need to be consistent with PTEs and architectures
> that
> 	 * can't support a 'special' bit.
> 	 */
> 
> ie without the ability to mark special we can't block fast gup and
> anyone who does O_DIRECT on these ranges will crash the kernel when
> it
> tries to convert a IO page into a struct page.
> 
> Should be easy enough to directly test?
> 
> Putting non-struct page PTEs into a VMA without setting VM_PFNMAP
> just
> seems horribly wrong to me.

Although core mm special huge-page support is currently quite limited,
some time ago, I extended the pre-existing vma_is_dax() to
vma_is_special_huge():

/**
 * vma_is_special_huge - Are transhuge page-table entries considered
special?
 * @vma: Pointer to the struct vm_area_struct to consider
 *
 * Whether transhuge page-table entries are considered "special"
following
 * the definition in vm_normal_page().
 *
 * Return: true if transhuge page-table entries should be considered
special,
 * false otherwise.
 */
static inline bool vma_is_special_huge(const struct vm_area_struct
*vma)
{
	return vma_is_dax(vma) || (vma->vm_file &&
				   (vma->vm_flags & (VM_PFNMAP |
VM_MIXEDMAP)));
}

meaning that currently all transhuge page-table-entries in a PFNMAP or
MIXEDMAP vma are considered "special". The number of calls to this
function (mainly in the page-splitting code) is quite limited so
replacing it with a more elaborate per-page-table-entry scheme would, I
guess, definitely be possible. Although all functions using it would
need to require a fallback path for architectures not supporting it.

/Thomas



> 
> Jason


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 05/15] mm/frame-vector: Use FOLL_LONGTERM
  2020-11-09  8:44                                           ` Thomas Hellström
@ 2020-11-09 20:19                                             ` Jason Gunthorpe
  0 siblings, 0 replies; 62+ messages in thread
From: Jason Gunthorpe @ 2020-11-09 20:19 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: Daniel Vetter, John Hubbard, Christoph Hellwig, J??r??me Glisse,
	linux-samsung-soc, Jan Kara, Pawel Osciak, KVM list,
	Mauro Carvalho Chehab, LKML, DRI Development, Tomasz Figa,
	Linux MM, Kyungmin Park, Daniel Vetter, Andrew Morton,
	Marek Szyprowski, Dan Williams, Linux ARM,
	open list:DMA BUFFER SHARING FRAMEWORK

On Mon, Nov 09, 2020 at 09:44:02AM +0100, Thomas Hellström wrote:
> static inline bool vma_is_special_huge(const struct vm_area_struct
> *vma)
> {
> 	return vma_is_dax(vma) || (vma->vm_file &&
> 				   (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)));
> }

That is testing a VMA, not a PTE, which doesn't help protect
get_user_pages_fast.

Sounds like is has opened a big user crashy problem in DRM and the
huge page stuff needs to be revereted..

Dan?

Jason

^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2020-11-09 20:19 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-30 10:08 [PATCH v5 00/15] follow_pfn and other iomap races Daniel Vetter
2020-10-30 10:08 ` [PATCH v5 01/15] drm/exynos: Stop using frame_vector helpers Daniel Vetter
2020-10-30 10:08 ` [PATCH v5 02/15] drm/exynos: Use FOLL_LONGTERM for g2d cmdlists Daniel Vetter
2020-10-30 10:08 ` [PATCH v5 03/15] misc/habana: Stop using frame_vector helpers Daniel Vetter
2020-10-30 10:08 ` [PATCH v5 04/15] misc/habana: Use FOLL_LONGTERM for userptr Daniel Vetter
2020-10-30 10:08 ` [PATCH v5 05/15] mm/frame-vector: Use FOLL_LONGTERM Daniel Vetter
2020-10-30 14:11   ` Tomasz Figa
2020-10-30 14:37     ` Daniel Vetter
2020-11-02 18:19       ` Tomasz Figa
2020-10-31  2:55   ` John Hubbard
2020-10-31 14:45     ` Daniel Vetter
2020-11-01  5:22       ` John Hubbard
2020-11-01 10:30         ` Daniel Vetter
2020-11-01 21:13           ` John Hubbard
2020-11-01 22:50             ` Daniel Vetter
2020-11-04 14:00               ` Jason Gunthorpe
2020-11-04 15:54                 ` Daniel Vetter
2020-11-04 16:21                   ` Christoph Hellwig
2020-11-04 16:26                     ` Daniel Vetter
2020-11-04 16:37                       ` Christoph Hellwig
2020-11-04 16:41                         ` Christoph Hellwig
2020-11-04 18:17                           ` Jason Gunthorpe
2020-11-04 18:44                             ` John Hubbard
2020-11-04 19:02                               ` Jason Gunthorpe
2020-11-04 19:11                                 ` Christoph Hellwig
2020-11-05  9:25                               ` Daniel Vetter
2020-11-05 12:49                                 ` Jason Gunthorpe
2020-11-06  4:08                                   ` John Hubbard
2020-11-06 10:01                                     ` Daniel Vetter
2020-11-06 10:27                                       ` Daniel Vetter
2020-11-06 12:55                                         ` Jason Gunthorpe
2020-11-09  8:44                                           ` Thomas Hellström
2020-11-09 20:19                                             ` Jason Gunthorpe
2020-11-06 12:58                                       ` Jason Gunthorpe
2020-10-30 10:08 ` [PATCH v5 06/15] media: videobuf2: Move frame_vector into media subsystem Daniel Vetter
2020-10-30 10:08 ` [PATCH v5 07/15] mm: Close race in generic_access_phys Daniel Vetter
2020-10-30 10:08 ` [PATCH v5 08/15] mm: Add unsafe_follow_pfn Daniel Vetter
2020-11-02  7:29   ` Christoph Hellwig
2020-11-02 12:56     ` Daniel Vetter
2020-11-02 13:01       ` Jason Gunthorpe
2020-11-02 13:23         ` Daniel Vetter
2020-11-02 15:52           ` Jason Gunthorpe
2020-11-02 16:41             ` Christoph Hellwig
2020-11-02 16:42             ` Daniel Vetter
2020-11-02 17:16               ` Jason Gunthorpe
2020-10-30 10:08 ` [PATCH v5 09/15] media/videbuf1|2: Mark follow_pfn usage as unsafe Daniel Vetter
2020-10-30 10:08 ` [PATCH v5 10/15] vfio/type1: Mark follow_pfn " Daniel Vetter
2020-10-30 10:08 ` [PATCH v5 11/15] PCI: Obey iomem restrictions for procfs mmap Daniel Vetter
2020-11-03 21:28   ` Bjorn Helgaas
2020-11-03 22:09     ` Dan Williams
2020-11-04  8:44       ` Daniel Vetter
2020-11-04 16:50         ` Bjorn Helgaas
2020-11-04 20:12           ` Dan Williams
2020-11-05  9:34             ` Daniel Vetter
2020-10-30 10:08 ` [PATCH v5 12/15] /dev/mem: Only set filp->f_mapping Daniel Vetter
2020-10-30 10:08 ` [PATCH v5 13/15] resource: Move devmem revoke code to resource framework Daniel Vetter
2020-10-31  6:36   ` John Hubbard
2020-10-31 14:40     ` Daniel Vetter
2020-10-30 10:08 ` [PATCH v5 14/15] sysfs: Support zapping of binary attr mmaps Daniel Vetter
2020-10-30 10:08 ` [PATCH v5 15/15] PCI: Revoke mappings like devmem Daniel Vetter
2020-10-30 19:22   ` Dan Williams
2020-11-03 21:30   ` Bjorn Helgaas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).