All of lore.kernel.org
 help / color / mirror / Atom feed
* [Virtio-fs] [PATCH] Virtiofs: fix null pointer deference in directIO
@ 2019-07-11 22:35 Liu Bo
  2019-07-18  0:21 ` Liu Bo
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Liu Bo @ 2019-07-11 22:35 UTC (permalink / raw)
  To: virtio-fs

Virtiofs has used memremap to create page structs for its managed memory
and set pagemap's type to MEMORY_DEVICE_FS_DAX.  Thus, the page from
virtiofs's managed memory is a ZONE_DEVICE page, if %devmap_managed_key
was enabled by pmem drivers, put_page() will run into
__put_devmap_managed_page().  However, as page_free ops is not defined, it
ends up with a NULL pointer dereference.

As virtiofs presents its managed memory range as a dax device, this patch
makes virtiofs mimic how nvdimm/pmem.c setups pagemap fsdax.

=================

[ 1223.732964] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[ 1223.733043] PGD 800000026bd91067 P4D 800000026bd91067 PUD 27631e067 PMD 0
[ 1223.733085] Oops: 0010 [#1] SMP PTI
[ 1223.733114] CPU: 1 PID: 16505 Comm: build Not tainted 4.19.48-dff9509 #1
[ 1223.733157] RIP: 0010:          (null)
[ 1223.733179] Code: Bad RIP value.
..
[ 1223.733671] Call Trace:
[ 1223.733704]  fuse_release_user_pages+0xad/0xb0
[ 1223.733751]  fuse_direct_io+0x3ec/0x6a0
[ 1223.733823]  fuse_dax_direct_write+0x3c/0x70
[ 1223.733863]  fuse_file_write_iter+0x2ea/0x4c0
[ 1223.733978]  __vfs_write+0xde/0x160
[ 1223.734009]  vfs_write+0xab/0x190
[ 1223.734061]  ksys_write+0x45/0xc0
[ 1223.734094]  do_syscall_64+0x6b/0x330
[ 1223.734214]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1223.734261] RIP: 0033:0x15536bcddfd0

The reproducer could be as simply as,

addr = mmap(..., fileA, ...);
fdB = open(fileB, O_DIRECT);
write(fdB, addr);

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
---
 fs/fuse/virtio_fs.c | 26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 05ea28c..9b3f2e9 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -538,6 +538,28 @@ static void virtio_fs_cleanup_dax(void *data)
 	put_dax(fs->dax_dev);
 }
 
+static void virtio_fs_release_pgmap_ops(void *_pgmap)
+{
+	dev_pagemap_put_ops();
+}
+
+static void virtio_fs_fsdax_pagefree(struct page *page, void *data)
+{
+	wake_up_var(&page->_refcount);
+}
+
+static int virtio_fs_setup_pagemap_fsdax(struct device *dev,
+					 struct dev_pagemap *pgmap)
+{
+	dev_pagemap_get_ops();
+	if (devm_add_action_or_reset(dev, virtio_fs_release_pgmap_ops, pgmap))
+		return -ENOMEM;
+	pgmap->type = MEMORY_DEVICE_FS_DAX;
+	pgmap->page_free = virtio_fs_fsdax_pagefree;
+
+	return 0;
+}
+
 static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
 {
 	struct virtio_shm_region cache_reg, journal_reg, vertab_reg;
@@ -600,7 +622,9 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
 	pgmap->altmap_valid = false;
 	pgmap->ref = &mi->ref;
 	pgmap->kill = virtio_fs_percpu_kill;
-	pgmap->type = MEMORY_DEVICE_FS_DAX;
+
+	if (virtio_fs_setup_pagemap_fsdax(&vdev->dev, pgmap))
+		return -ENOMEM;
 
 	/* Ideally we would directly use the PCI BAR resource but
 	 * devm_memremap_pages() wants its own copy in pgmap.  So
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [Virtio-fs] [PATCH] Virtiofs: fix null pointer deference in directIO
  2019-07-11 22:35 [Virtio-fs] [PATCH] Virtiofs: fix null pointer deference in directIO Liu Bo
@ 2019-07-18  0:21 ` Liu Bo
  2019-07-18 18:18 ` Vivek Goyal
  2019-07-30 15:30 ` Vivek Goyal
  2 siblings, 0 replies; 12+ messages in thread
From: Liu Bo @ 2019-07-18  0:21 UTC (permalink / raw)
  To: virtio-fs

Ping?

thanks,
-liubo

On Fri, Jul 12, 2019 at 06:35:42AM +0800, Liu Bo wrote:
> Virtiofs has used memremap to create page structs for its managed memory
> and set pagemap's type to MEMORY_DEVICE_FS_DAX.  Thus, the page from
> virtiofs's managed memory is a ZONE_DEVICE page, if %devmap_managed_key
> was enabled by pmem drivers, put_page() will run into
> __put_devmap_managed_page().  However, as page_free ops is not defined, it
> ends up with a NULL pointer dereference.
> 
> As virtiofs presents its managed memory range as a dax device, this patch
> makes virtiofs mimic how nvdimm/pmem.c setups pagemap fsdax.
> 
> =================
> 
> [ 1223.732964] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> [ 1223.733043] PGD 800000026bd91067 P4D 800000026bd91067 PUD 27631e067 PMD 0
> [ 1223.733085] Oops: 0010 [#1] SMP PTI
> [ 1223.733114] CPU: 1 PID: 16505 Comm: build Not tainted 4.19.48-dff9509 #1
> [ 1223.733157] RIP: 0010:          (null)
> [ 1223.733179] Code: Bad RIP value.
> ..
> [ 1223.733671] Call Trace:
> [ 1223.733704]  fuse_release_user_pages+0xad/0xb0
> [ 1223.733751]  fuse_direct_io+0x3ec/0x6a0
> [ 1223.733823]  fuse_dax_direct_write+0x3c/0x70
> [ 1223.733863]  fuse_file_write_iter+0x2ea/0x4c0
> [ 1223.733978]  __vfs_write+0xde/0x160
> [ 1223.734009]  vfs_write+0xab/0x190
> [ 1223.734061]  ksys_write+0x45/0xc0
> [ 1223.734094]  do_syscall_64+0x6b/0x330
> [ 1223.734214]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 1223.734261] RIP: 0033:0x15536bcddfd0
> 
> The reproducer could be as simply as,
> 
> addr = mmap(..., fileA, ...);
> fdB = open(fileB, O_DIRECT);
> write(fdB, addr);
> 
> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
> ---
>  fs/fuse/virtio_fs.c | 26 +++++++++++++++++++++++++-
>  1 file changed, 25 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> index 05ea28c..9b3f2e9 100644
> --- a/fs/fuse/virtio_fs.c
> +++ b/fs/fuse/virtio_fs.c
> @@ -538,6 +538,28 @@ static void virtio_fs_cleanup_dax(void *data)
>  	put_dax(fs->dax_dev);
>  }
>  
> +static void virtio_fs_release_pgmap_ops(void *_pgmap)
> +{
> +	dev_pagemap_put_ops();
> +}
> +
> +static void virtio_fs_fsdax_pagefree(struct page *page, void *data)
> +{
> +	wake_up_var(&page->_refcount);
> +}
> +
> +static int virtio_fs_setup_pagemap_fsdax(struct device *dev,
> +					 struct dev_pagemap *pgmap)
> +{
> +	dev_pagemap_get_ops();
> +	if (devm_add_action_or_reset(dev, virtio_fs_release_pgmap_ops, pgmap))
> +		return -ENOMEM;
> +	pgmap->type = MEMORY_DEVICE_FS_DAX;
> +	pgmap->page_free = virtio_fs_fsdax_pagefree;
> +
> +	return 0;
> +}
> +
>  static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
>  {
>  	struct virtio_shm_region cache_reg, journal_reg, vertab_reg;
> @@ -600,7 +622,9 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
>  	pgmap->altmap_valid = false;
>  	pgmap->ref = &mi->ref;
>  	pgmap->kill = virtio_fs_percpu_kill;
> -	pgmap->type = MEMORY_DEVICE_FS_DAX;
> +
> +	if (virtio_fs_setup_pagemap_fsdax(&vdev->dev, pgmap))
> +		return -ENOMEM;
>  
>  	/* Ideally we would directly use the PCI BAR resource but
>  	 * devm_memremap_pages() wants its own copy in pgmap.  So
> -- 
> 1.8.3.1
> 
> _______________________________________________
> Virtio-fs mailing list
> Virtio-fs@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Virtio-fs] [PATCH] Virtiofs: fix null pointer deference in directIO
  2019-07-11 22:35 [Virtio-fs] [PATCH] Virtiofs: fix null pointer deference in directIO Liu Bo
  2019-07-18  0:21 ` Liu Bo
@ 2019-07-18 18:18 ` Vivek Goyal
  2019-07-18 18:58   ` Liu Bo
  2019-07-30 15:30 ` Vivek Goyal
  2 siblings, 1 reply; 12+ messages in thread
From: Vivek Goyal @ 2019-07-18 18:18 UTC (permalink / raw)
  To: Liu Bo; +Cc: virtio-fs

On Fri, Jul 12, 2019 at 06:35:42AM +0800, Liu Bo wrote:
> Virtiofs has used memremap to create page structs for its managed memory
> and set pagemap's type to MEMORY_DEVICE_FS_DAX.  Thus, the page from
> virtiofs's managed memory is a ZONE_DEVICE page, if %devmap_managed_key
> was enabled by pmem drivers, put_page() will run into
> __put_devmap_managed_page().  However, as page_free ops is not defined, it
> ends up with a NULL pointer dereference.
> 
> As virtiofs presents its managed memory range as a dax device, this patch
> makes virtiofs mimic how nvdimm/pmem.c setups pagemap fsdax.
> 
> =================
> 
> [ 1223.732964] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> [ 1223.733043] PGD 800000026bd91067 P4D 800000026bd91067 PUD 27631e067 PMD 0
> [ 1223.733085] Oops: 0010 [#1] SMP PTI
> [ 1223.733114] CPU: 1 PID: 16505 Comm: build Not tainted 4.19.48-dff9509 #1
> [ 1223.733157] RIP: 0010:          (null)
> [ 1223.733179] Code: Bad RIP value.
> ..
> [ 1223.733671] Call Trace:
> [ 1223.733704]  fuse_release_user_pages+0xad/0xb0
> [ 1223.733751]  fuse_direct_io+0x3ec/0x6a0
> [ 1223.733823]  fuse_dax_direct_write+0x3c/0x70
> [ 1223.733863]  fuse_file_write_iter+0x2ea/0x4c0
> [ 1223.733978]  __vfs_write+0xde/0x160
> [ 1223.734009]  vfs_write+0xab/0x190
> [ 1223.734061]  ksys_write+0x45/0xc0
> [ 1223.734094]  do_syscall_64+0x6b/0x330
> [ 1223.734214]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 1223.734261] RIP: 0033:0x15536bcddfd0
> 
> The reproducer could be as simply as,
> 
> addr = mmap(..., fileA, ...);
> fdB = open(fileB, O_DIRECT);
> write(fdB, addr);
> 
> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>

I am looking at this now and trying to understand it better. One problem
I see is with page pinning. If a page is pinned, how do we make sure it
is not evicted by reclaim logic. IIUC, right now memory range relciam
logic is not aware about page pinning.

For instance, in above example, write(fdB, addr) will pin pages in fileA.
While write is going on we need to make sure dax memory ranges covering
those pages are not reclaimed.

/me needs to spend more time understanding this.

Vivek

> ---
>  fs/fuse/virtio_fs.c | 26 +++++++++++++++++++++++++-
>  1 file changed, 25 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> index 05ea28c..9b3f2e9 100644
> --- a/fs/fuse/virtio_fs.c
> +++ b/fs/fuse/virtio_fs.c
> @@ -538,6 +538,28 @@ static void virtio_fs_cleanup_dax(void *data)
>  	put_dax(fs->dax_dev);
>  }
>  
> +static void virtio_fs_release_pgmap_ops(void *_pgmap)
> +{
> +	dev_pagemap_put_ops();
> +}
> +
> +static void virtio_fs_fsdax_pagefree(struct page *page, void *data)
> +{
> +	wake_up_var(&page->_refcount);
> +}
> +
> +static int virtio_fs_setup_pagemap_fsdax(struct device *dev,
> +					 struct dev_pagemap *pgmap)
> +{
> +	dev_pagemap_get_ops();
> +	if (devm_add_action_or_reset(dev, virtio_fs_release_pgmap_ops, pgmap))
> +		return -ENOMEM;
> +	pgmap->type = MEMORY_DEVICE_FS_DAX;
> +	pgmap->page_free = virtio_fs_fsdax_pagefree;
> +
> +	return 0;
> +}
> +
>  static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
>  {
>  	struct virtio_shm_region cache_reg, journal_reg, vertab_reg;
> @@ -600,7 +622,9 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
>  	pgmap->altmap_valid = false;
>  	pgmap->ref = &mi->ref;
>  	pgmap->kill = virtio_fs_percpu_kill;
> -	pgmap->type = MEMORY_DEVICE_FS_DAX;
> +
> +	if (virtio_fs_setup_pagemap_fsdax(&vdev->dev, pgmap))
> +		return -ENOMEM;
>  
>  	/* Ideally we would directly use the PCI BAR resource but
>  	 * devm_memremap_pages() wants its own copy in pgmap.  So
> -- 
> 1.8.3.1
> 
> _______________________________________________
> Virtio-fs mailing list
> Virtio-fs@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Virtio-fs] [PATCH] Virtiofs: fix null pointer deference in directIO
  2019-07-18 18:18 ` Vivek Goyal
@ 2019-07-18 18:58   ` Liu Bo
  2019-07-18 20:51     ` Vivek Goyal
  0 siblings, 1 reply; 12+ messages in thread
From: Liu Bo @ 2019-07-18 18:58 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs

On Thu, Jul 18, 2019 at 02:18:39PM -0400, Vivek Goyal wrote:
> On Fri, Jul 12, 2019 at 06:35:42AM +0800, Liu Bo wrote:
> > Virtiofs has used memremap to create page structs for its managed memory
> > and set pagemap's type to MEMORY_DEVICE_FS_DAX.  Thus, the page from
> > virtiofs's managed memory is a ZONE_DEVICE page, if %devmap_managed_key
> > was enabled by pmem drivers, put_page() will run into
> > __put_devmap_managed_page().  However, as page_free ops is not defined, it
> > ends up with a NULL pointer dereference.
> > 
> > As virtiofs presents its managed memory range as a dax device, this patch
> > makes virtiofs mimic how nvdimm/pmem.c setups pagemap fsdax.
> > 
> > =================
> > 
> > [ 1223.732964] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> > [ 1223.733043] PGD 800000026bd91067 P4D 800000026bd91067 PUD 27631e067 PMD 0
> > [ 1223.733085] Oops: 0010 [#1] SMP PTI
> > [ 1223.733114] CPU: 1 PID: 16505 Comm: build Not tainted 4.19.48-dff9509 #1
> > [ 1223.733157] RIP: 0010:          (null)
> > [ 1223.733179] Code: Bad RIP value.
> > ..
> > [ 1223.733671] Call Trace:
> > [ 1223.733704]  fuse_release_user_pages+0xad/0xb0
> > [ 1223.733751]  fuse_direct_io+0x3ec/0x6a0
> > [ 1223.733823]  fuse_dax_direct_write+0x3c/0x70
> > [ 1223.733863]  fuse_file_write_iter+0x2ea/0x4c0
> > [ 1223.733978]  __vfs_write+0xde/0x160
> > [ 1223.734009]  vfs_write+0xab/0x190
> > [ 1223.734061]  ksys_write+0x45/0xc0
> > [ 1223.734094]  do_syscall_64+0x6b/0x330
> > [ 1223.734214]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [ 1223.734261] RIP: 0033:0x15536bcddfd0
> > 
> > The reproducer could be as simply as,
> > 
> > addr = mmap(..., fileA, ...);
> > fdB = open(fileB, O_DIRECT);
> > write(fdB, addr);
> > 
> > Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
> 
> I am looking at this now and trying to understand it better. One problem
> I see is with page pinning. If a page is pinned, how do we make sure it
> is not evicted by reclaim logic. IIUC, right now memory range relciam
> logic is not aware about page pinning.
> 
> For instance, in above example, write(fdB, addr) will pin pages in fileA.
> While write is going on we need to make sure dax memory ranges covering
> those pages are not reclaimed.

Good question, it is indeed a problem, i_mmap_sem is not held by page
pinning so that dax range reclaim is free to go, nor page pinning in
within iomap_begin() and iomap_end() pair.

While that is a problem, this patch is fixing a different one because
of the lack of ->page_free definition.

thanks,
-liubo
> 
> /me needs to spend more time understanding this.
> 
> Vivek
> 
> > ---
> >  fs/fuse/virtio_fs.c | 26 +++++++++++++++++++++++++-
> >  1 file changed, 25 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> > index 05ea28c..9b3f2e9 100644
> > --- a/fs/fuse/virtio_fs.c
> > +++ b/fs/fuse/virtio_fs.c
> > @@ -538,6 +538,28 @@ static void virtio_fs_cleanup_dax(void *data)
> >  	put_dax(fs->dax_dev);
> >  }
> >  
> > +static void virtio_fs_release_pgmap_ops(void *_pgmap)
> > +{
> > +	dev_pagemap_put_ops();
> > +}
> > +
> > +static void virtio_fs_fsdax_pagefree(struct page *page, void *data)
> > +{
> > +	wake_up_var(&page->_refcount);
> > +}
> > +
> > +static int virtio_fs_setup_pagemap_fsdax(struct device *dev,
> > +					 struct dev_pagemap *pgmap)
> > +{
> > +	dev_pagemap_get_ops();
> > +	if (devm_add_action_or_reset(dev, virtio_fs_release_pgmap_ops, pgmap))
> > +		return -ENOMEM;
> > +	pgmap->type = MEMORY_DEVICE_FS_DAX;
> > +	pgmap->page_free = virtio_fs_fsdax_pagefree;
> > +
> > +	return 0;
> > +}
> > +
> >  static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> >  {
> >  	struct virtio_shm_region cache_reg, journal_reg, vertab_reg;
> > @@ -600,7 +622,9 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> >  	pgmap->altmap_valid = false;
> >  	pgmap->ref = &mi->ref;
> >  	pgmap->kill = virtio_fs_percpu_kill;
> > -	pgmap->type = MEMORY_DEVICE_FS_DAX;
> > +
> > +	if (virtio_fs_setup_pagemap_fsdax(&vdev->dev, pgmap))
> > +		return -ENOMEM;
> >  
> >  	/* Ideally we would directly use the PCI BAR resource but
> >  	 * devm_memremap_pages() wants its own copy in pgmap.  So
> > -- 
> > 1.8.3.1
> > 
> > _______________________________________________
> > Virtio-fs mailing list
> > Virtio-fs@redhat.com
> > https://www.redhat.com/mailman/listinfo/virtio-fs


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Virtio-fs] [PATCH] Virtiofs: fix null pointer deference in directIO
  2019-07-18 18:58   ` Liu Bo
@ 2019-07-18 20:51     ` Vivek Goyal
  2019-07-18 23:39       ` Liu Bo
  0 siblings, 1 reply; 12+ messages in thread
From: Vivek Goyal @ 2019-07-18 20:51 UTC (permalink / raw)
  To: Liu Bo; +Cc: virtio-fs

On Thu, Jul 18, 2019 at 11:58:01AM -0700, Liu Bo wrote:
> On Thu, Jul 18, 2019 at 02:18:39PM -0400, Vivek Goyal wrote:
> > On Fri, Jul 12, 2019 at 06:35:42AM +0800, Liu Bo wrote:
> > > Virtiofs has used memremap to create page structs for its managed memory
> > > and set pagemap's type to MEMORY_DEVICE_FS_DAX.  Thus, the page from
> > > virtiofs's managed memory is a ZONE_DEVICE page, if %devmap_managed_key
> > > was enabled by pmem drivers, put_page() will run into
> > > __put_devmap_managed_page().  However, as page_free ops is not defined, it
> > > ends up with a NULL pointer dereference.
> > > 
> > > As virtiofs presents its managed memory range as a dax device, this patch
> > > makes virtiofs mimic how nvdimm/pmem.c setups pagemap fsdax.
> > > 
> > > =================
> > > 
> > > [ 1223.732964] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> > > [ 1223.733043] PGD 800000026bd91067 P4D 800000026bd91067 PUD 27631e067 PMD 0
> > > [ 1223.733085] Oops: 0010 [#1] SMP PTI
> > > [ 1223.733114] CPU: 1 PID: 16505 Comm: build Not tainted 4.19.48-dff9509 #1
> > > [ 1223.733157] RIP: 0010:          (null)
> > > [ 1223.733179] Code: Bad RIP value.
> > > ..
> > > [ 1223.733671] Call Trace:
> > > [ 1223.733704]  fuse_release_user_pages+0xad/0xb0
> > > [ 1223.733751]  fuse_direct_io+0x3ec/0x6a0
> > > [ 1223.733823]  fuse_dax_direct_write+0x3c/0x70
> > > [ 1223.733863]  fuse_file_write_iter+0x2ea/0x4c0
> > > [ 1223.733978]  __vfs_write+0xde/0x160
> > > [ 1223.734009]  vfs_write+0xab/0x190
> > > [ 1223.734061]  ksys_write+0x45/0xc0
> > > [ 1223.734094]  do_syscall_64+0x6b/0x330
> > > [ 1223.734214]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > [ 1223.734261] RIP: 0033:0x15536bcddfd0
> > > 
> > > The reproducer could be as simply as,
> > > 
> > > addr = mmap(..., fileA, ...);
> > > fdB = open(fileB, O_DIRECT);
> > > write(fdB, addr);
> > > 
> > > Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
> > 
> > I am looking at this now and trying to understand it better. One problem
> > I see is with page pinning. If a page is pinned, how do we make sure it
> > is not evicted by reclaim logic. IIUC, right now memory range relciam
> > logic is not aware about page pinning.
> > 
> > For instance, in above example, write(fdB, addr) will pin pages in fileA.
> > While write is going on we need to make sure dax memory ranges covering
> > those pages are not reclaimed.
> 
> Good question, it is indeed a problem, i_mmap_sem is not held by page
> pinning so that dax range reclaim is free to go, nor page pinning in
> within iomap_begin() and iomap_end() pair.
> 
> While that is a problem, this patch is fixing a different one because
> of the lack of ->page_free definition.

Sure, but in the definition of ->page_free, you are calling
"wake_up_var(page->_refcount)". This will make sense only if somebody
is waiting on this.

I am checking xfs code and it seems to be waiting for this. We probably
will have to modify virtio-fs to fix get_user_pages()/DMA races.

So give me some time to go through and understand what xfs/dax/mm stack
is doing for these races and try to fix it properly. 

Thanks
Vivek


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Virtio-fs] [PATCH] Virtiofs: fix null pointer deference in directIO
  2019-07-18 20:51     ` Vivek Goyal
@ 2019-07-18 23:39       ` Liu Bo
  2019-07-19 19:51         ` Vivek Goyal
  0 siblings, 1 reply; 12+ messages in thread
From: Liu Bo @ 2019-07-18 23:39 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs

On Thu, Jul 18, 2019 at 04:51:10PM -0400, Vivek Goyal wrote:
> On Thu, Jul 18, 2019 at 11:58:01AM -0700, Liu Bo wrote:
> > On Thu, Jul 18, 2019 at 02:18:39PM -0400, Vivek Goyal wrote:
> > > On Fri, Jul 12, 2019 at 06:35:42AM +0800, Liu Bo wrote:
> > > > Virtiofs has used memremap to create page structs for its managed memory
> > > > and set pagemap's type to MEMORY_DEVICE_FS_DAX.  Thus, the page from
> > > > virtiofs's managed memory is a ZONE_DEVICE page, if %devmap_managed_key
> > > > was enabled by pmem drivers, put_page() will run into
> > > > __put_devmap_managed_page().  However, as page_free ops is not defined, it
> > > > ends up with a NULL pointer dereference.
> > > > 
> > > > As virtiofs presents its managed memory range as a dax device, this patch
> > > > makes virtiofs mimic how nvdimm/pmem.c setups pagemap fsdax.
> > > > 
> > > > =================
> > > > 
> > > > [ 1223.732964] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> > > > [ 1223.733043] PGD 800000026bd91067 P4D 800000026bd91067 PUD 27631e067 PMD 0
> > > > [ 1223.733085] Oops: 0010 [#1] SMP PTI
> > > > [ 1223.733114] CPU: 1 PID: 16505 Comm: build Not tainted 4.19.48-dff9509 #1
> > > > [ 1223.733157] RIP: 0010:          (null)
> > > > [ 1223.733179] Code: Bad RIP value.
> > > > ..
> > > > [ 1223.733671] Call Trace:
> > > > [ 1223.733704]  fuse_release_user_pages+0xad/0xb0
> > > > [ 1223.733751]  fuse_direct_io+0x3ec/0x6a0
> > > > [ 1223.733823]  fuse_dax_direct_write+0x3c/0x70
> > > > [ 1223.733863]  fuse_file_write_iter+0x2ea/0x4c0
> > > > [ 1223.733978]  __vfs_write+0xde/0x160
> > > > [ 1223.734009]  vfs_write+0xab/0x190
> > > > [ 1223.734061]  ksys_write+0x45/0xc0
> > > > [ 1223.734094]  do_syscall_64+0x6b/0x330
> > > > [ 1223.734214]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > [ 1223.734261] RIP: 0033:0x15536bcddfd0
> > > > 
> > > > The reproducer could be as simply as,
> > > > 
> > > > addr = mmap(..., fileA, ...);
> > > > fdB = open(fileB, O_DIRECT);
> > > > write(fdB, addr);
> > > > 
> > > > Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
> > > 
> > > I am looking at this now and trying to understand it better. One problem
> > > I see is with page pinning. If a page is pinned, how do we make sure it
> > > is not evicted by reclaim logic. IIUC, right now memory range relciam
> > > logic is not aware about page pinning.
> > > 
> > > For instance, in above example, write(fdB, addr) will pin pages in fileA.
> > > While write is going on we need to make sure dax memory ranges covering
> > > those pages are not reclaimed.
> > 
> > Good question, it is indeed a problem, i_mmap_sem is not held by page
> > pinning so that dax range reclaim is free to go, nor page pinning in
> > within iomap_begin() and iomap_end() pair.
> > 
> > While that is a problem, this patch is fixing a different one because
> > of the lack of ->page_free definition.
> 
> Sure, but in the definition of ->page_free, you are calling
> "wake_up_var(page->_refcount)". This will make sense only if somebody
> is waiting on this.
>

Right, looks like we also need to follow this patch[1], but looks like
the fix for the whole problem is still WIP[2].

> I am checking xfs code and it seems to be waiting for this. We probably
> will have to modify virtio-fs to fix get_user_pages()/DMA races.
> 
> So give me some time to go through and understand what xfs/dax/mm stack
> is doing for these races and try to fix it properly. 

Right, I didn't realize the race with dax mapping reclaim.

So there're two things to fix,
a) the race between pinned entries/pages and fs fallocate/truncate.
b) the race between pinned entries/pages and dax ranges background reclaim.

In terms of a), page->_refcount can be used to avoid the race between
dma and dax.

a "dmap->refcnt--" is actually what we need to add for b) at ->page_free.


[1]: "mm, fs, dax: handle layout changes to pinned dax mappings"
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5fac7408d828719db6d3fdba63e3c3726a6d1ee5

[2]: "mm: introduce put_user_page*(), placeholder versions"
https://patchwork.kernel.org/patch/10637929/

thanks,
-liubo


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Virtio-fs] [PATCH] Virtiofs: fix null pointer deference in directIO
  2019-07-18 23:39       ` Liu Bo
@ 2019-07-19 19:51         ` Vivek Goyal
  2019-07-20  2:34           ` Liu Bo
  0 siblings, 1 reply; 12+ messages in thread
From: Vivek Goyal @ 2019-07-19 19:51 UTC (permalink / raw)
  To: Liu Bo; +Cc: virtio-fs

On Thu, Jul 18, 2019 at 04:39:12PM -0700, Liu Bo wrote:
> On Thu, Jul 18, 2019 at 04:51:10PM -0400, Vivek Goyal wrote:
> > On Thu, Jul 18, 2019 at 11:58:01AM -0700, Liu Bo wrote:
> > > On Thu, Jul 18, 2019 at 02:18:39PM -0400, Vivek Goyal wrote:
> > > > On Fri, Jul 12, 2019 at 06:35:42AM +0800, Liu Bo wrote:
> > > > > Virtiofs has used memremap to create page structs for its managed memory
> > > > > and set pagemap's type to MEMORY_DEVICE_FS_DAX.  Thus, the page from
> > > > > virtiofs's managed memory is a ZONE_DEVICE page, if %devmap_managed_key
> > > > > was enabled by pmem drivers, put_page() will run into
> > > > > __put_devmap_managed_page().  However, as page_free ops is not defined, it
> > > > > ends up with a NULL pointer dereference.
> > > > > 
> > > > > As virtiofs presents its managed memory range as a dax device, this patch
> > > > > makes virtiofs mimic how nvdimm/pmem.c setups pagemap fsdax.
> > > > > 
> > > > > =================
> > > > > 
> > > > > [ 1223.732964] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> > > > > [ 1223.733043] PGD 800000026bd91067 P4D 800000026bd91067 PUD 27631e067 PMD 0
> > > > > [ 1223.733085] Oops: 0010 [#1] SMP PTI
> > > > > [ 1223.733114] CPU: 1 PID: 16505 Comm: build Not tainted 4.19.48-dff9509 #1
> > > > > [ 1223.733157] RIP: 0010:          (null)
> > > > > [ 1223.733179] Code: Bad RIP value.
> > > > > ..
> > > > > [ 1223.733671] Call Trace:
> > > > > [ 1223.733704]  fuse_release_user_pages+0xad/0xb0
> > > > > [ 1223.733751]  fuse_direct_io+0x3ec/0x6a0
> > > > > [ 1223.733823]  fuse_dax_direct_write+0x3c/0x70
> > > > > [ 1223.733863]  fuse_file_write_iter+0x2ea/0x4c0
> > > > > [ 1223.733978]  __vfs_write+0xde/0x160
> > > > > [ 1223.734009]  vfs_write+0xab/0x190
> > > > > [ 1223.734061]  ksys_write+0x45/0xc0
> > > > > [ 1223.734094]  do_syscall_64+0x6b/0x330
> > > > > [ 1223.734214]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > > [ 1223.734261] RIP: 0033:0x15536bcddfd0
> > > > > 
> > > > > The reproducer could be as simply as,
> > > > > 
> > > > > addr = mmap(..., fileA, ...);
> > > > > fdB = open(fileB, O_DIRECT);
> > > > > write(fdB, addr);
> > > > > 
> > > > > Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
> > > > 
> > > > I am looking at this now and trying to understand it better. One problem
> > > > I see is with page pinning. If a page is pinned, how do we make sure it
> > > > is not evicted by reclaim logic. IIUC, right now memory range relciam
> > > > logic is not aware about page pinning.
> > > > 
> > > > For instance, in above example, write(fdB, addr) will pin pages in fileA.
> > > > While write is going on we need to make sure dax memory ranges covering
> > > > those pages are not reclaimed.
> > > 
> > > Good question, it is indeed a problem, i_mmap_sem is not held by page
> > > pinning so that dax range reclaim is free to go, nor page pinning in
> > > within iomap_begin() and iomap_end() pair.
> > > 
> > > While that is a problem, this patch is fixing a different one because
> > > of the lack of ->page_free definition.
> > 
> > Sure, but in the definition of ->page_free, you are calling
> > "wake_up_var(page->_refcount)". This will make sense only if somebody
> > is waiting on this.
> >
> 
> Right, looks like we also need to follow this patch[1], but looks like
> the fix for the whole problem is still WIP[2].
> 
> > I am checking xfs code and it seems to be waiting for this. We probably
> > will have to modify virtio-fs to fix get_user_pages()/DMA races.
> > 
> > So give me some time to go through and understand what xfs/dax/mm stack
> > is doing for these races and try to fix it properly. 
> 
> Right, I didn't realize the race with dax mapping reclaim.
> 
> So there're two things to fix,
> a) the race between pinned entries/pages and fs fallocate/truncate.
> b) the race between pinned entries/pages and dax ranges background reclaim.

Right.

I think we can take care of both the races by calling
dax_layout_busy_page() in path a) and path b)?

> 
> In terms of a), page->_refcount can be used to avoid the race between
> dma and dax.
> 
> a "dmap->refcnt--" is actually what we need to add for b) at ->page_free.

I am not seeing any callback where we can increment dmap->refcnt. That
probably means we will have to reply on calling dax_layout_busy_page()
in reclaim path. This is expected to be called with inode lock held.

I am thinking of following sequence.

1. inode_trylock(inode)
2. Take i_mmap_sem
3. call dax_layout_busy_page() to make sure no pages on this inode are idle.
4. Take i_dmap_sem
5. Reclaim range

This means that we can't drop inode lock in range reclaim path as your
dmap refcount patches are currently doing.

This probably means that range reclaim might get enen more slower. I 
am thinking that once we get hold of a inode lock and have been
able to do dax_layout_busy_page(), we should reclaim multiple ranges
from same inode (using one FUSE_REMOVEMAPPING()) command, instead
of freeing one range at a time.

And that might provide some speed up with range reclaim.

/me is not 100% sure yet if we need to keep inode lock held when we 
call dax_layout_busy_page(). As of now truncate/punch_hole seems
to hold it. Also we are waiting for pages to become idle, so it
kind of makes sense to not start more read/writes while we are
waiting for pages to become idle.

Thanks
Vivek

> 
> 
> [1]: "mm, fs, dax: handle layout changes to pinned dax mappings"
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5fac7408d828719db6d3fdba63e3c3726a6d1ee5
> 
> [2]: "mm: introduce put_user_page*(), placeholder versions"
> https://patchwork.kernel.org/patch/10637929/
> 
> thanks,
> -liubo


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Virtio-fs] [PATCH] Virtiofs: fix null pointer deference in directIO
  2019-07-19 19:51         ` Vivek Goyal
@ 2019-07-20  2:34           ` Liu Bo
  0 siblings, 0 replies; 12+ messages in thread
From: Liu Bo @ 2019-07-20  2:34 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs

On Fri, Jul 19, 2019 at 03:51:30PM -0400, Vivek Goyal wrote:
> On Thu, Jul 18, 2019 at 04:39:12PM -0700, Liu Bo wrote:
> > On Thu, Jul 18, 2019 at 04:51:10PM -0400, Vivek Goyal wrote:
> > > On Thu, Jul 18, 2019 at 11:58:01AM -0700, Liu Bo wrote:
> > > > On Thu, Jul 18, 2019 at 02:18:39PM -0400, Vivek Goyal wrote:
> > > > > On Fri, Jul 12, 2019 at 06:35:42AM +0800, Liu Bo wrote:
> > > > > > Virtiofs has used memremap to create page structs for its managed memory
> > > > > > and set pagemap's type to MEMORY_DEVICE_FS_DAX.  Thus, the page from
> > > > > > virtiofs's managed memory is a ZONE_DEVICE page, if %devmap_managed_key
> > > > > > was enabled by pmem drivers, put_page() will run into
> > > > > > __put_devmap_managed_page().  However, as page_free ops is not defined, it
> > > > > > ends up with a NULL pointer dereference.
> > > > > > 
> > > > > > As virtiofs presents its managed memory range as a dax device, this patch
> > > > > > makes virtiofs mimic how nvdimm/pmem.c setups pagemap fsdax.
> > > > > > 
> > > > > > =================
> > > > > > 
> > > > > > [ 1223.732964] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> > > > > > [ 1223.733043] PGD 800000026bd91067 P4D 800000026bd91067 PUD 27631e067 PMD 0
> > > > > > [ 1223.733085] Oops: 0010 [#1] SMP PTI
> > > > > > [ 1223.733114] CPU: 1 PID: 16505 Comm: build Not tainted 4.19.48-dff9509 #1
> > > > > > [ 1223.733157] RIP: 0010:          (null)
> > > > > > [ 1223.733179] Code: Bad RIP value.
> > > > > > ..
> > > > > > [ 1223.733671] Call Trace:
> > > > > > [ 1223.733704]  fuse_release_user_pages+0xad/0xb0
> > > > > > [ 1223.733751]  fuse_direct_io+0x3ec/0x6a0
> > > > > > [ 1223.733823]  fuse_dax_direct_write+0x3c/0x70
> > > > > > [ 1223.733863]  fuse_file_write_iter+0x2ea/0x4c0
> > > > > > [ 1223.733978]  __vfs_write+0xde/0x160
> > > > > > [ 1223.734009]  vfs_write+0xab/0x190
> > > > > > [ 1223.734061]  ksys_write+0x45/0xc0
> > > > > > [ 1223.734094]  do_syscall_64+0x6b/0x330
> > > > > > [ 1223.734214]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > > > [ 1223.734261] RIP: 0033:0x15536bcddfd0
> > > > > > 
> > > > > > The reproducer could be as simply as,
> > > > > > 
> > > > > > addr = mmap(..., fileA, ...);
> > > > > > fdB = open(fileB, O_DIRECT);
> > > > > > write(fdB, addr);
> > > > > > 
> > > > > > Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
> > > > > 
> > > > > I am looking at this now and trying to understand it better. One problem
> > > > > I see is with page pinning. If a page is pinned, how do we make sure it
> > > > > is not evicted by reclaim logic. IIUC, right now memory range relciam
> > > > > logic is not aware about page pinning.
> > > > > 
> > > > > For instance, in above example, write(fdB, addr) will pin pages in fileA.
> > > > > While write is going on we need to make sure dax memory ranges covering
> > > > > those pages are not reclaimed.
> > > > 
> > > > Good question, it is indeed a problem, i_mmap_sem is not held by page
> > > > pinning so that dax range reclaim is free to go, nor page pinning in
> > > > within iomap_begin() and iomap_end() pair.
> > > > 
> > > > While that is a problem, this patch is fixing a different one because
> > > > of the lack of ->page_free definition.
> > > 
> > > Sure, but in the definition of ->page_free, you are calling
> > > "wake_up_var(page->_refcount)". This will make sense only if somebody
> > > is waiting on this.
> > >
> > 
> > Right, looks like we also need to follow this patch[1], but looks like
> > the fix for the whole problem is still WIP[2].
> > 
> > > I am checking xfs code and it seems to be waiting for this. We probably
> > > will have to modify virtio-fs to fix get_user_pages()/DMA races.
> > > 
> > > So give me some time to go through and understand what xfs/dax/mm stack
> > > is doing for these races and try to fix it properly. 
> > 
> > Right, I didn't realize the race with dax mapping reclaim.
> > 
> > So there're two things to fix,
> > a) the race between pinned entries/pages and fs fallocate/truncate.
> > b) the race between pinned entries/pages and dax ranges background reclaim.
> 
> Right.
> 
> I think we can take care of both the races by calling
> dax_layout_busy_page() in path a) and path b)?
>

I think so, that's what we have for now.

> > 
> > In terms of a), page->_refcount can be used to avoid the race between
> > dma and dax.
> > 
> > a "dmap->refcnt--" is actually what we need to add for b) at ->page_free.
> 
> I am not seeing any callback where we can increment dmap->refcnt. That
> probably means we will have to reply on calling dax_layout_busy_page()
> in reclaim path. This is expected to be called with inode lock held.
> 
> I am thinking of following sequence.
> 
> 1. inode_trylock(inode)
> 2. Take i_mmap_sem
> 3. call dax_layout_busy_page() to make sure no pages on this inode are idle.
> 4. Take i_dmap_sem
> 5. Reclaim range

I realized that callers of dax_layout_busy_page() only need to block
any new mapping creation, which means a down_read(&i_mmap_sem) is
enough for the protection.

If dax_layout_busy_page() returns any pinned entry, we can skip the
reclaim.

Although dax_layout_busy_page() does unmap_mapping_range() to
disassociate the dax pfn and its mapped pte, next time when the pte is
walked by mmu, it will refault in the dax pfn.

The whole assumption of using dax_layout_busy_page() is that pinned memory
will be released very soon, I think it's also matching our virtiofs case,
e.g. a direct IO.

thanks,
-liubo
> 
> This means that we can't drop inode lock in range reclaim path as your
> dmap refcount patches are currently doing.
> 
> This probably means that range reclaim might get enen more slower. I 
> am thinking that once we get hold of a inode lock and have been
> able to do dax_layout_busy_page(), we should reclaim multiple ranges
> from same inode (using one FUSE_REMOVEMAPPING()) command, instead
> of freeing one range at a time.
> 
> And that might provide some speed up with range reclaim.
> 
> /me is not 100% sure yet if we need to keep inode lock held when we 
> call dax_layout_busy_page(). As of now truncate/punch_hole seems
> to hold it. Also we are waiting for pages to become idle, so it
> kind of makes sense to not start more read/writes while we are
> waiting for pages to become idle.
> 
> Thanks
> Vivek
> 
> > 
> > 
> > [1]: "mm, fs, dax: handle layout changes to pinned dax mappings"
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5fac7408d828719db6d3fdba63e3c3726a6d1ee5
> > 
> > [2]: "mm: introduce put_user_page*(), placeholder versions"
> > https://patchwork.kernel.org/patch/10637929/
> > 
> > thanks,
> > -liubo


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Virtio-fs] [PATCH] Virtiofs: fix null pointer deference in directIO
  2019-07-11 22:35 [Virtio-fs] [PATCH] Virtiofs: fix null pointer deference in directIO Liu Bo
  2019-07-18  0:21 ` Liu Bo
  2019-07-18 18:18 ` Vivek Goyal
@ 2019-07-30 15:30 ` Vivek Goyal
  2019-07-30 22:29   ` Liu Bo
  2 siblings, 1 reply; 12+ messages in thread
From: Vivek Goyal @ 2019-07-30 15:30 UTC (permalink / raw)
  To: Liu Bo; +Cc: virtio-fs

Hi Liu Bo,

I have committed your patch for now here.

https://github.com/rhvgoyal/linux/commits/virtio-fs-dev-5.1

I also added a patch to wait for page reference count to drop to 1 before
range is reclaimed. It slows down mmap() workload very significantly
though.

Right now dmap reference patches are not there. This branch is still WIP
and once I have done more testing and possible dax optimization, I will
look into re-applying dmap reference changes on top of it and see if
it can be done reasonably without breaking other things.

Will be good if you give this branch a try.

Thanks
Vivek

On Fri, Jul 12, 2019 at 06:35:42AM +0800, Liu Bo wrote:
> Virtiofs has used memremap to create page structs for its managed memory
> and set pagemap's type to MEMORY_DEVICE_FS_DAX.  Thus, the page from
> virtiofs's managed memory is a ZONE_DEVICE page, if %devmap_managed_key
> was enabled by pmem drivers, put_page() will run into
> __put_devmap_managed_page().  However, as page_free ops is not defined, it
> ends up with a NULL pointer dereference.
> 
> As virtiofs presents its managed memory range as a dax device, this patch
> makes virtiofs mimic how nvdimm/pmem.c setups pagemap fsdax.
> 
> =================
> 
> [ 1223.732964] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> [ 1223.733043] PGD 800000026bd91067 P4D 800000026bd91067 PUD 27631e067 PMD 0
> [ 1223.733085] Oops: 0010 [#1] SMP PTI
> [ 1223.733114] CPU: 1 PID: 16505 Comm: build Not tainted 4.19.48-dff9509 #1
> [ 1223.733157] RIP: 0010:          (null)
> [ 1223.733179] Code: Bad RIP value.
> ..
> [ 1223.733671] Call Trace:
> [ 1223.733704]  fuse_release_user_pages+0xad/0xb0
> [ 1223.733751]  fuse_direct_io+0x3ec/0x6a0
> [ 1223.733823]  fuse_dax_direct_write+0x3c/0x70
> [ 1223.733863]  fuse_file_write_iter+0x2ea/0x4c0
> [ 1223.733978]  __vfs_write+0xde/0x160
> [ 1223.734009]  vfs_write+0xab/0x190
> [ 1223.734061]  ksys_write+0x45/0xc0
> [ 1223.734094]  do_syscall_64+0x6b/0x330
> [ 1223.734214]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 1223.734261] RIP: 0033:0x15536bcddfd0
> 
> The reproducer could be as simply as,
> 
> addr = mmap(..., fileA, ...);
> fdB = open(fileB, O_DIRECT);
> write(fdB, addr);
> 
> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
> ---
>  fs/fuse/virtio_fs.c | 26 +++++++++++++++++++++++++-
>  1 file changed, 25 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> index 05ea28c..9b3f2e9 100644
> --- a/fs/fuse/virtio_fs.c
> +++ b/fs/fuse/virtio_fs.c
> @@ -538,6 +538,28 @@ static void virtio_fs_cleanup_dax(void *data)
>  	put_dax(fs->dax_dev);
>  }
>  
> +static void virtio_fs_release_pgmap_ops(void *_pgmap)
> +{
> +	dev_pagemap_put_ops();
> +}
> +
> +static void virtio_fs_fsdax_pagefree(struct page *page, void *data)
> +{
> +	wake_up_var(&page->_refcount);
> +}
> +
> +static int virtio_fs_setup_pagemap_fsdax(struct device *dev,
> +					 struct dev_pagemap *pgmap)
> +{
> +	dev_pagemap_get_ops();
> +	if (devm_add_action_or_reset(dev, virtio_fs_release_pgmap_ops, pgmap))
> +		return -ENOMEM;
> +	pgmap->type = MEMORY_DEVICE_FS_DAX;
> +	pgmap->page_free = virtio_fs_fsdax_pagefree;
> +
> +	return 0;
> +}
> +
>  static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
>  {
>  	struct virtio_shm_region cache_reg, journal_reg, vertab_reg;
> @@ -600,7 +622,9 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
>  	pgmap->altmap_valid = false;
>  	pgmap->ref = &mi->ref;
>  	pgmap->kill = virtio_fs_percpu_kill;
> -	pgmap->type = MEMORY_DEVICE_FS_DAX;
> +
> +	if (virtio_fs_setup_pagemap_fsdax(&vdev->dev, pgmap))
> +		return -ENOMEM;
>  
>  	/* Ideally we would directly use the PCI BAR resource but
>  	 * devm_memremap_pages() wants its own copy in pgmap.  So
> -- 
> 1.8.3.1
> 
> _______________________________________________
> Virtio-fs mailing list
> Virtio-fs@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Virtio-fs] [PATCH] Virtiofs: fix null pointer deference in directIO
  2019-07-30 15:30 ` Vivek Goyal
@ 2019-07-30 22:29   ` Liu Bo
  2019-07-31 12:39     ` Vivek Goyal
  0 siblings, 1 reply; 12+ messages in thread
From: Liu Bo @ 2019-07-30 22:29 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs

On Tue, Jul 30, 2019 at 11:30:41AM -0400, Vivek Goyal wrote:
> Hi Liu Bo,
> 
> I have committed your patch for now here.
> 
> https://github.com/rhvgoyal/linux/commits/virtio-fs-dev-5.1
> 
> I also added a patch to wait for page reference count to drop to 1 before
> range is reclaimed. It slows down mmap() workload very significantly
> though.
>

Yes, the impact on mmap is expected because of the 'unmap' in layout checking,
it'd be helpful to check layout on the dmap range instead of the whole inode.

However, there is a race bug in the current code between mmap workloads and dmap
reclaim, since the reclaim does not unmap the dmap range at all, such that the
dmap range might still stay in some process's page table, if such a dmap is
reclaimed, the situation would be worse.  So with unmapping the whole inode in
layout checking, at least it is correct in term of that race.

> Right now dmap reference patches are not there. This branch is still WIP
> and once I have done more testing and possible dax optimization, I will
> look into re-applying dmap reference changes on top of it and see if
> it can be done reasonably without breaking other things.
> 
> Will be good if you give this branch a try.

Sure, I've already made similar changes internally.

thanks,
-liubo

> 
> Thanks
> Vivek
> 
> On Fri, Jul 12, 2019 at 06:35:42AM +0800, Liu Bo wrote:
> > Virtiofs has used memremap to create page structs for its managed memory
> > and set pagemap's type to MEMORY_DEVICE_FS_DAX.  Thus, the page from
> > virtiofs's managed memory is a ZONE_DEVICE page, if %devmap_managed_key
> > was enabled by pmem drivers, put_page() will run into
> > __put_devmap_managed_page().  However, as page_free ops is not defined, it
> > ends up with a NULL pointer dereference.
> > 
> > As virtiofs presents its managed memory range as a dax device, this patch
> > makes virtiofs mimic how nvdimm/pmem.c setups pagemap fsdax.
> > 
> > =================
> > 
> > [ 1223.732964] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> > [ 1223.733043] PGD 800000026bd91067 P4D 800000026bd91067 PUD 27631e067 PMD 0
> > [ 1223.733085] Oops: 0010 [#1] SMP PTI
> > [ 1223.733114] CPU: 1 PID: 16505 Comm: build Not tainted 4.19.48-dff9509 #1
> > [ 1223.733157] RIP: 0010:          (null)
> > [ 1223.733179] Code: Bad RIP value.
> > ..
> > [ 1223.733671] Call Trace:
> > [ 1223.733704]  fuse_release_user_pages+0xad/0xb0
> > [ 1223.733751]  fuse_direct_io+0x3ec/0x6a0
> > [ 1223.733823]  fuse_dax_direct_write+0x3c/0x70
> > [ 1223.733863]  fuse_file_write_iter+0x2ea/0x4c0
> > [ 1223.733978]  __vfs_write+0xde/0x160
> > [ 1223.734009]  vfs_write+0xab/0x190
> > [ 1223.734061]  ksys_write+0x45/0xc0
> > [ 1223.734094]  do_syscall_64+0x6b/0x330
> > [ 1223.734214]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [ 1223.734261] RIP: 0033:0x15536bcddfd0
> > 
> > The reproducer could be as simply as,
> > 
> > addr = mmap(..., fileA, ...);
> > fdB = open(fileB, O_DIRECT);
> > write(fdB, addr);
> > 
> > Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
> > ---
> >  fs/fuse/virtio_fs.c | 26 +++++++++++++++++++++++++-
> >  1 file changed, 25 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> > index 05ea28c..9b3f2e9 100644
> > --- a/fs/fuse/virtio_fs.c
> > +++ b/fs/fuse/virtio_fs.c
> > @@ -538,6 +538,28 @@ static void virtio_fs_cleanup_dax(void *data)
> >  	put_dax(fs->dax_dev);
> >  }
> >  
> > +static void virtio_fs_release_pgmap_ops(void *_pgmap)
> > +{
> > +	dev_pagemap_put_ops();
> > +}
> > +
> > +static void virtio_fs_fsdax_pagefree(struct page *page, void *data)
> > +{
> > +	wake_up_var(&page->_refcount);
> > +}
> > +
> > +static int virtio_fs_setup_pagemap_fsdax(struct device *dev,
> > +					 struct dev_pagemap *pgmap)
> > +{
> > +	dev_pagemap_get_ops();
> > +	if (devm_add_action_or_reset(dev, virtio_fs_release_pgmap_ops, pgmap))
> > +		return -ENOMEM;
> > +	pgmap->type = MEMORY_DEVICE_FS_DAX;
> > +	pgmap->page_free = virtio_fs_fsdax_pagefree;
> > +
> > +	return 0;
> > +}
> > +
> >  static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> >  {
> >  	struct virtio_shm_region cache_reg, journal_reg, vertab_reg;
> > @@ -600,7 +622,9 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> >  	pgmap->altmap_valid = false;
> >  	pgmap->ref = &mi->ref;
> >  	pgmap->kill = virtio_fs_percpu_kill;
> > -	pgmap->type = MEMORY_DEVICE_FS_DAX;
> > +
> > +	if (virtio_fs_setup_pagemap_fsdax(&vdev->dev, pgmap))
> > +		return -ENOMEM;
> >  
> >  	/* Ideally we would directly use the PCI BAR resource but
> >  	 * devm_memremap_pages() wants its own copy in pgmap.  So
> > -- 
> > 1.8.3.1
> > 
> > _______________________________________________
> > Virtio-fs mailing list
> > Virtio-fs@redhat.com
> > https://www.redhat.com/mailman/listinfo/virtio-fs


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Virtio-fs] [PATCH] Virtiofs: fix null pointer deference in directIO
  2019-07-30 22:29   ` Liu Bo
@ 2019-07-31 12:39     ` Vivek Goyal
  2019-07-31 17:53       ` Liu Bo
  0 siblings, 1 reply; 12+ messages in thread
From: Vivek Goyal @ 2019-07-31 12:39 UTC (permalink / raw)
  To: Liu Bo; +Cc: virtio-fs

On Tue, Jul 30, 2019 at 03:29:39PM -0700, Liu Bo wrote:
> On Tue, Jul 30, 2019 at 11:30:41AM -0400, Vivek Goyal wrote:
> > Hi Liu Bo,
> > 
> > I have committed your patch for now here.
> > 
> > https://github.com/rhvgoyal/linux/commits/virtio-fs-dev-5.1
> > 
> > I also added a patch to wait for page reference count to drop to 1 before
> > range is reclaimed. It slows down mmap() workload very significantly
> > though.
> >
> 
> Yes, the impact on mmap is expected because of the 'unmap' in layout checking,
> it'd be helpful to check layout on the dmap range instead of the whole inode.
> 
> However, there is a race bug in the current code between mmap workloads and dmap
> reclaim, since the reclaim does not unmap the dmap range at all, such that the
> dmap range might still stay in some process's page table, if such a dmap is
> reclaimed, the situation would be worse.  So with unmapping the whole inode in
> layout checking, at least it is correct in term of that race.

Can you elaborate more on this. I am not sure I am able to see the race
you are talking about. Assume we implemented logic to unmap only pages
which are in dmap (and not whole inode).

dmap is something internal to virtio-fs. If we unmap all pages which are
in a dmap, then dmap is free to be reclaimed. We are holding fi->i_mmap_sem,
and that should make sure that unmapped pages don't get mapped again (page
fault code will block on fi->i_mmap_sem).

Thanks
Vivek


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Virtio-fs] [PATCH] Virtiofs: fix null pointer deference in directIO
  2019-07-31 12:39     ` Vivek Goyal
@ 2019-07-31 17:53       ` Liu Bo
  0 siblings, 0 replies; 12+ messages in thread
From: Liu Bo @ 2019-07-31 17:53 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs

On Wed, Jul 31, 2019 at 08:39:38AM -0400, Vivek Goyal wrote:
> On Tue, Jul 30, 2019 at 03:29:39PM -0700, Liu Bo wrote:
> > On Tue, Jul 30, 2019 at 11:30:41AM -0400, Vivek Goyal wrote:
> > > Hi Liu Bo,
> > > 
> > > I have committed your patch for now here.
> > > 
> > > https://github.com/rhvgoyal/linux/commits/virtio-fs-dev-5.1
> > > 
> > > I also added a patch to wait for page reference count to drop to 1 before
> > > range is reclaimed. It slows down mmap() workload very significantly
> > > though.
> > >
> > 
> > Yes, the impact on mmap is expected because of the 'unmap' in layout checking,
> > it'd be helpful to check layout on the dmap range instead of the whole inode.
> > 
> > However, there is a race bug in the current code between mmap workloads and dmap
> > reclaim, since the reclaim does not unmap the dmap range at all, such that the
> > dmap range might still stay in some process's page table, if such a dmap is
> > reclaimed, the situation would be worse.  So with unmapping the whole inode in
> > layout checking, at least it is correct in term of that race.
> 
> Can you elaborate more on this. I am not sure I am able to see the race
> you are talking about. Assume we implemented logic to unmap only pages
> which are in dmap (and not whole inode).
>

Sorry for confusing you, what I was trying to say is that the above
problem exists without doing unmapping pages (no matter within the
whole inode or only the dmap range), IOW, before commit "fuse: Free
dax mapping range only if page refcount is 1", the race bug exists.

thanks,
-liubo

> dmap is something internal to virtio-fs. If we unmap all pages which are
> in a dmap, then dmap is free to be reclaimed. We are holding fi->i_mmap_sem,
> and that should make sure that unmapped pages don't get mapped again (page
> fault code will block on fi->i_mmap_sem).
> 
> Thanks
> Vivek


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-07-31 17:53 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-11 22:35 [Virtio-fs] [PATCH] Virtiofs: fix null pointer deference in directIO Liu Bo
2019-07-18  0:21 ` Liu Bo
2019-07-18 18:18 ` Vivek Goyal
2019-07-18 18:58   ` Liu Bo
2019-07-18 20:51     ` Vivek Goyal
2019-07-18 23:39       ` Liu Bo
2019-07-19 19:51         ` Vivek Goyal
2019-07-20  2:34           ` Liu Bo
2019-07-30 15:30 ` Vivek Goyal
2019-07-30 22:29   ` Liu Bo
2019-07-31 12:39     ` Vivek Goyal
2019-07-31 17:53       ` Liu Bo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.