All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH RFC] introduce ioctl to completely invalidate page cache
       [not found] <1412266184-23776-1-git-send-email-thanos.makatos@citrix.com>
@ 2014-10-02 19:59 ` Jens Axboe
  2014-10-03  5:27   ` Dave Chinner
                     ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Jens Axboe @ 2014-10-02 19:59 UTC (permalink / raw)
  To: Thanos Makatos, linux-fsdevel
  Cc: linux-kernel, linux-api, jlayton, bfields, jack

On 10/02/2014 10:09 AM, Thanos Makatos wrote:
> This patch introduces a new ioctl called BLKFLUSHBUFS2, which is pretty
> similar to BLKFLUSHBUFS except that is also invalidates the page cache.
> This allows for a complete invalidation of the cached data of a
> particular block device, which might be useful for cases like
> synchronising the caches of an iSCSI block device used by multiple
> hosts.
> 
> Signed-off-by: Thanos Makatos <thanos.makatos@citrix.com>
> ---
>  block/compat_ioctl.c    |    1 +
>  block/ioctl.c           |   13 +++++++++++--
>  include/uapi/linux/fs.h |    1 +
>  3 files changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/block/compat_ioctl.c b/block/compat_ioctl.c
> index 18b282c..672388ab 100644
> --- a/block/compat_ioctl.c
> +++ b/block/compat_ioctl.c
> @@ -688,6 +688,7 @@ long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg)
>  	case BLKDISCARDZEROES:
>  		return compat_put_uint(arg, bdev_discard_zeroes_data(bdev));
>  	case BLKFLSBUF:
> +	case BLKFLSBUF2:
>  	case BLKROSET:
>  	case BLKDISCARD:
>  	case BLKSECDISCARD:
> diff --git a/block/ioctl.c b/block/ioctl.c
> index d6cda81..0c427a7 100644
> --- a/block/ioctl.c
> +++ b/block/ioctl.c
> @@ -268,6 +268,12 @@ static inline int is_unrecognized_ioctl(int ret)
>  		ret == -ENOIOCTLCMD;
>  }
>  
> +static void flush_buffer_cache(struct block_device *bdev)
> +{
> +	fsync_bdev(bdev);
> +	invalidate_bdev(bdev);
> +}
> +
>  /*
>   * always keep this in sync with compat_blkdev_ioctl()
>   */
> @@ -282,6 +288,7 @@ int blkdev_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd,
>  
>  	switch(cmd) {
>  	case BLKFLSBUF:
> +	case BLKFLSBUF2:
>  		if (!capable(CAP_SYS_ADMIN))
>  			return -EACCES;
>  
> @@ -289,8 +296,10 @@ int blkdev_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd,
>  		if (!is_unrecognized_ioctl(ret))
>  			return ret;
>  
> -		fsync_bdev(bdev);
> -		invalidate_bdev(bdev);
> +		flush_buffer_cache(bdev);
> +		if (BLKFLSBUF2 == cmd)
> +			return invalidate_inode_pages2(
> +					bdev->bd_inode->i_mapping);
>  		return 0;

We're currently ignoring the buffer cache sync and invalidation (which
is odd), but at least being consistent would be good.

Might also need a filemap_write_and_wait() to sync before invalidation.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] introduce ioctl to completely invalidate page cache
  2014-10-02 19:59 ` [PATCH RFC] introduce ioctl to completely invalidate page cache Jens Axboe
@ 2014-10-03  5:27   ` Dave Chinner
  2014-10-03  9:00     ` Thanos Makatos
  2014-10-03  9:25     ` Thanos Makatos
  2014-10-06  8:06     ` Jan Kara
  2 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2014-10-03  5:27 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Thanos Makatos, linux-fsdevel, linux-kernel, linux-api, jlayton,
	bfields, jack

On Thu, Oct 02, 2014 at 01:59:40PM -0600, Jens Axboe wrote:
> On 10/02/2014 10:09 AM, Thanos Makatos wrote:
> > This patch introduces a new ioctl called BLKFLUSHBUFS2, which is pretty

What a horrible name. Whatever happened to naming ioctls interfaces
after their function? i.e. BLKFLUSHINVAL?

Cheers,

Dave?

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH RFC] introduce ioctl to completely invalidate page cache
  2014-10-03  5:27   ` Dave Chinner
@ 2014-10-03  9:00     ` Thanos Makatos
  0 siblings, 0 replies; 16+ messages in thread
From: Thanos Makatos @ 2014-10-03  9:00 UTC (permalink / raw)
  To: 'Dave Chinner', Jens Axboe
  Cc: linux-fsdevel, linux-kernel, linux-api, jlayton, bfields, jack

> > > This patch introduces a new ioctl called BLKFLUSHBUFS2, which is
> > > pretty
> 
> What a horrible name. Whatever happened to naming ioctls interfaces after
> their function? i.e. BLKFLUSHINVAL?

Indeed it's not a good name, I'm open to suggestions!

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH RFC] introduce ioctl to completely invalidate page cache
@ 2014-10-03  9:25     ` Thanos Makatos
  0 siblings, 0 replies; 16+ messages in thread
From: Thanos Makatos @ 2014-10-03  9:25 UTC (permalink / raw)
  To: 'Jens Axboe', linux-fsdevel
  Cc: linux-kernel, linux-api, jlayton, bfields, jack

> > -		fsync_bdev(bdev);
> > -		invalidate_bdev(bdev);
> > +		flush_buffer_cache(bdev);
> > +		if (BLKFLSBUF2 == cmd)
> > +			return invalidate_inode_pages2(
> > +					bdev->bd_inode->i_mapping);
> >  		return 0;
> 
> We're currently ignoring the buffer cache sync and invalidation (which is odd),
> but at least being consistent would be good.
> 
> Might also need a filemap_write_and_wait() to sync before invalidation.

(I've got zero knowledge in this area, so excuse my ignorance!)

Does filemap_write_and_wait() writes back modified, memory-mapped pages? If so, isn't there a race condition? Or have I got it completely wrong?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH RFC] introduce ioctl to completely invalidate page cache
@ 2014-10-03  9:25     ` Thanos Makatos
  0 siblings, 0 replies; 16+ messages in thread
From: Thanos Makatos @ 2014-10-03  9:25 UTC (permalink / raw)
  To: 'Jens Axboe', linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, jlayton-vpEMnDpepFuMZCB2o+C8xQ,
	bfields-uC3wQj2KruNg9hUCZPvPmw, jack-AlSwsSmVLrQ

> > -		fsync_bdev(bdev);
> > -		invalidate_bdev(bdev);
> > +		flush_buffer_cache(bdev);
> > +		if (BLKFLSBUF2 == cmd)
> > +			return invalidate_inode_pages2(
> > +					bdev->bd_inode->i_mapping);
> >  		return 0;
> 
> We're currently ignoring the buffer cache sync and invalidation (which is odd),
> but at least being consistent would be good.
> 
> Might also need a filemap_write_and_wait() to sync before invalidation.

(I've got zero knowledge in this area, so excuse my ignorance!)

Does filemap_write_and_wait() writes back modified, memory-mapped pages? If so, isn't there a race condition? Or have I got it completely wrong?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] introduce ioctl to completely invalidate page cache
  2014-10-03  9:25     ` Thanos Makatos
  (?)
@ 2014-10-03 14:28     ` Jens Axboe
  -1 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2014-10-03 14:28 UTC (permalink / raw)
  To: Thanos Makatos, linux-fsdevel
  Cc: linux-kernel, linux-api, jlayton, bfields, jack

On 2014-10-03 03:25, Thanos Makatos wrote:
>>> -		fsync_bdev(bdev);
>>> -		invalidate_bdev(bdev);
>>> +		flush_buffer_cache(bdev);
>>> +		if (BLKFLSBUF2 == cmd)
>>> +			return invalidate_inode_pages2(
>>> +					bdev->bd_inode->i_mapping);
>>>   		return 0;
>>
>> We're currently ignoring the buffer cache sync and invalidation (which is odd),
>> but at least being consistent would be good.
>>
>> Might also need a filemap_write_and_wait() to sync before invalidation.
>
> (I've got zero knowledge in this area, so excuse my ignorance!)
>
> Does filemap_write_and_wait() writes back modified, memory-mapped pages? If so,
> isn't there a race condition? Or have I got it completely wrong?

There's no race to be concerned of for this ioctl. Any page dirtied 
before you make the call will be synced, any page dirtied after may not. 
This is no different than what would happen on the buffer cache side in 
the current ioctl.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] introduce ioctl to completely invalidate page cache
@ 2014-10-06  8:06     ` Jan Kara
  0 siblings, 0 replies; 16+ messages in thread
From: Jan Kara @ 2014-10-06  8:06 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Thanos Makatos, linux-fsdevel, linux-kernel, linux-api, jlayton,
	bfields, jack

On Thu 02-10-14 13:59:40, Jens Axboe wrote:
> On 10/02/2014 10:09 AM, Thanos Makatos wrote:
> > This patch introduces a new ioctl called BLKFLUSHBUFS2, which is pretty
> > similar to BLKFLUSHBUFS except that is also invalidates the page cache.
> > This allows for a complete invalidation of the cached data of a
> > particular block device, which might be useful for cases like
> > synchronising the caches of an iSCSI block device used by multiple
> > hosts.
> > 
> > Signed-off-by: Thanos Makatos <thanos.makatos@citrix.com>
> > ---
> >  block/compat_ioctl.c    |    1 +
> >  block/ioctl.c           |   13 +++++++++++--
> >  include/uapi/linux/fs.h |    1 +
> >  3 files changed, 13 insertions(+), 2 deletions(-)
> > 
> > diff --git a/block/compat_ioctl.c b/block/compat_ioctl.c
> > index 18b282c..672388ab 100644
> > --- a/block/compat_ioctl.c
> > +++ b/block/compat_ioctl.c
> > @@ -688,6 +688,7 @@ long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg)
> >  	case BLKDISCARDZEROES:
> >  		return compat_put_uint(arg, bdev_discard_zeroes_data(bdev));
> >  	case BLKFLSBUF:
> > +	case BLKFLSBUF2:
> >  	case BLKROSET:
> >  	case BLKDISCARD:
> >  	case BLKSECDISCARD:
> > diff --git a/block/ioctl.c b/block/ioctl.c
> > index d6cda81..0c427a7 100644
> > --- a/block/ioctl.c
> > +++ b/block/ioctl.c
> > @@ -268,6 +268,12 @@ static inline int is_unrecognized_ioctl(int ret)
> >  		ret == -ENOIOCTLCMD;
> >  }
> >  
> > +static void flush_buffer_cache(struct block_device *bdev)
> > +{
> > +	fsync_bdev(bdev);
> > +	invalidate_bdev(bdev);
> > +}
> > +
> >  /*
> >   * always keep this in sync with compat_blkdev_ioctl()
> >   */
> > @@ -282,6 +288,7 @@ int blkdev_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd,
> >  
> >  	switch(cmd) {
> >  	case BLKFLSBUF:
> > +	case BLKFLSBUF2:
> >  		if (!capable(CAP_SYS_ADMIN))
> >  			return -EACCES;
> >  
> > @@ -289,8 +296,10 @@ int blkdev_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd,
> >  		if (!is_unrecognized_ioctl(ret))
> >  			return ret;
> >  
> > -		fsync_bdev(bdev);
> > -		invalidate_bdev(bdev);
> > +		flush_buffer_cache(bdev);
> > +		if (BLKFLSBUF2 == cmd)
> > +			return invalidate_inode_pages2(
> > +					bdev->bd_inode->i_mapping);
> >  		return 0;
> 
> We're currently ignoring the buffer cache sync and invalidation (which
> is odd), but at least being consistent would be good.
  Well, invalidate_bdev() doesn't return anything. And
invalidate_mapping_pages() inside invalidate_bdev() returns only number of
invalidated pages. I don't think there's any value in returning that.

OTOH invalidate_inode_pages2() returns 0 / -EBUSY / other error when
invalidation of some page fails so returning that seems useful.

> Might also need a filemap_write_and_wait() to sync before invalidation.
  That's what fsync_bdev() is doing under the hoods. Sometimes I'm not
sure whether all these wrappers are useful...

Trond also had a comment that if we extended the ioctl to work for all
inodes (not just blkdev) and allowed some additional flags of what needs to
be invalidated, the new ioctl would be also useful to NFS userspace - see
Trond's email at

http://www.spinics.net/lists/linux-fsdevel/msg78917.html

and the following thread. I would prefer to cover that usecase when we are
introducing new invalidation ioctl. Have you considered that Thanos?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] introduce ioctl to completely invalidate page cache
@ 2014-10-06  8:06     ` Jan Kara
  0 siblings, 0 replies; 16+ messages in thread
From: Jan Kara @ 2014-10-06  8:06 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Thanos Makatos, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, jlayton-vpEMnDpepFuMZCB2o+C8xQ,
	bfields-uC3wQj2KruNg9hUCZPvPmw, jack-AlSwsSmVLrQ

On Thu 02-10-14 13:59:40, Jens Axboe wrote:
> On 10/02/2014 10:09 AM, Thanos Makatos wrote:
> > This patch introduces a new ioctl called BLKFLUSHBUFS2, which is pretty
> > similar to BLKFLUSHBUFS except that is also invalidates the page cache.
> > This allows for a complete invalidation of the cached data of a
> > particular block device, which might be useful for cases like
> > synchronising the caches of an iSCSI block device used by multiple
> > hosts.
> > 
> > Signed-off-by: Thanos Makatos <thanos.makatos-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>
> > ---
> >  block/compat_ioctl.c    |    1 +
> >  block/ioctl.c           |   13 +++++++++++--
> >  include/uapi/linux/fs.h |    1 +
> >  3 files changed, 13 insertions(+), 2 deletions(-)
> > 
> > diff --git a/block/compat_ioctl.c b/block/compat_ioctl.c
> > index 18b282c..672388ab 100644
> > --- a/block/compat_ioctl.c
> > +++ b/block/compat_ioctl.c
> > @@ -688,6 +688,7 @@ long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg)
> >  	case BLKDISCARDZEROES:
> >  		return compat_put_uint(arg, bdev_discard_zeroes_data(bdev));
> >  	case BLKFLSBUF:
> > +	case BLKFLSBUF2:
> >  	case BLKROSET:
> >  	case BLKDISCARD:
> >  	case BLKSECDISCARD:
> > diff --git a/block/ioctl.c b/block/ioctl.c
> > index d6cda81..0c427a7 100644
> > --- a/block/ioctl.c
> > +++ b/block/ioctl.c
> > @@ -268,6 +268,12 @@ static inline int is_unrecognized_ioctl(int ret)
> >  		ret == -ENOIOCTLCMD;
> >  }
> >  
> > +static void flush_buffer_cache(struct block_device *bdev)
> > +{
> > +	fsync_bdev(bdev);
> > +	invalidate_bdev(bdev);
> > +}
> > +
> >  /*
> >   * always keep this in sync with compat_blkdev_ioctl()
> >   */
> > @@ -282,6 +288,7 @@ int blkdev_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd,
> >  
> >  	switch(cmd) {
> >  	case BLKFLSBUF:
> > +	case BLKFLSBUF2:
> >  		if (!capable(CAP_SYS_ADMIN))
> >  			return -EACCES;
> >  
> > @@ -289,8 +296,10 @@ int blkdev_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd,
> >  		if (!is_unrecognized_ioctl(ret))
> >  			return ret;
> >  
> > -		fsync_bdev(bdev);
> > -		invalidate_bdev(bdev);
> > +		flush_buffer_cache(bdev);
> > +		if (BLKFLSBUF2 == cmd)
> > +			return invalidate_inode_pages2(
> > +					bdev->bd_inode->i_mapping);
> >  		return 0;
> 
> We're currently ignoring the buffer cache sync and invalidation (which
> is odd), but at least being consistent would be good.
  Well, invalidate_bdev() doesn't return anything. And
invalidate_mapping_pages() inside invalidate_bdev() returns only number of
invalidated pages. I don't think there's any value in returning that.

OTOH invalidate_inode_pages2() returns 0 / -EBUSY / other error when
invalidation of some page fails so returning that seems useful.

> Might also need a filemap_write_and_wait() to sync before invalidation.
  That's what fsync_bdev() is doing under the hoods. Sometimes I'm not
sure whether all these wrappers are useful...

Trond also had a comment that if we extended the ioctl to work for all
inodes (not just blkdev) and allowed some additional flags of what needs to
be invalidated, the new ioctl would be also useful to NFS userspace - see
Trond's email at

http://www.spinics.net/lists/linux-fsdevel/msg78917.html

and the following thread. I would prefer to cover that usecase when we are
introducing new invalidation ioctl. Have you considered that Thanos?

								Honza
-- 
Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH RFC] introduce ioctl to completely invalidate page cache
  2014-10-06  8:06     ` Jan Kara
  (?)
@ 2014-10-06  9:21     ` Thanos Makatos
  2014-10-06 11:33       ` Thanos Makatos
  -1 siblings, 1 reply; 16+ messages in thread
From: Thanos Makatos @ 2014-10-06  9:21 UTC (permalink / raw)
  To: 'Jan Kara', Jens Axboe
  Cc: linux-fsdevel, linux-kernel, linux-api, jlayton, bfields

> > We're currently ignoring the buffer cache sync and invalidation (which
> > is odd), but at least being consistent would be good.
>   Well, invalidate_bdev() doesn't return anything. And
> invalidate_mapping_pages() inside invalidate_bdev() returns only number of
> invalidated pages. I don't think there's any value in returning that.
> 
> OTOH invalidate_inode_pages2() returns 0 / -EBUSY / other error when
> invalidation of some page fails so returning that seems useful.
> 
> > Might also need a filemap_write_and_wait() to sync before invalidation.
>   That's what fsync_bdev() is doing under the hoods. Sometimes I'm not sure
> whether all these wrappers are useful...

Indeed, fsync_bdev() does call filemap_write_and_wait() so I don't need to explicitly do that.

> 
> Trond also had a comment that if we extended the ioctl to work for all inodes
> (not just blkdev) and allowed some additional flags of what needs to be
> invalidated, the new ioctl would be also useful to NFS userspace - see Trond's
> email at
> 
> http://www.spinics.net/lists/linux-fsdevel/msg78917.html
> 
> and the following thread. I would prefer to cover that usecase when we are
> introducing new invalidation ioctl. Have you considered that Thanos?

Sure, though I don't really know how to do it. I'll start by looking at the code flow when someone does " echo 3 > /proc/sys/vm/drop_caches", unless you already have a rough idea how to do that.

Thanks

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH RFC] introduce ioctl to completely invalidate page cache
  2014-10-06  9:21     ` Thanos Makatos
@ 2014-10-06 11:33       ` Thanos Makatos
  2014-10-06 14:30         ` Jan Kara
  0 siblings, 1 reply; 16+ messages in thread
From: Thanos Makatos @ 2014-10-06 11:33 UTC (permalink / raw)
  To: Thanos Makatos, 'Jan Kara', Jens Axboe
  Cc: linux-fsdevel, linux-kernel, linux-api, jlayton, bfields

> > Trond also had a comment that if we extended the ioctl to work for all
> > inodes (not just blkdev) and allowed some additional flags of what
> > needs to be invalidated, the new ioctl would be also useful to NFS
> > userspace - see Trond's email at
> >
> > http://www.spinics.net/lists/linux-fsdevel/msg78917.html
> >
> > and the following thread. I would prefer to cover that usecase when we
> > are introducing new invalidation ioctl. Have you considered that Thanos?
> 
> Sure, though I don't really know how to do it. I'll start by looking at the code
> flow when someone does " echo 3 > /proc/sys/vm/drop_caches", unless you
> already have a rough idea how to do that.

I realise I haven't clearly understood what the semantics of this new ioctl
should be.

My initial goal was to implement an ioctl that would _completely_ invalidate
the buffer cache of a block device when there is no file-system involved.
Unless I'm mistaken the patch I posted achieves this goal.

We now want to extend this patch to take care of cached metadata, which seems
to be of particular importance for NFS, and I suspect that this piece of
functionality will still be applicable to any kind of file-system, correct?

Do we want this new ioctl to do what "echo 3 > /proc/sys/vm/drop_caches" does
but on a more selective basis, which IIUC drops whatever can be dropped, but
may not drop everything? If so, then we should more precisely define this ioctl
as "drop *all* cached data and as much metadata you can, which may not be all
of them". If this is the case, would adding a call to drop_pagecache_sb() on
all super blocks whose .s_bdev equals the block device we're interested in
suffice?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] introduce ioctl to completely invalidate page cache
  2014-10-06 11:33       ` Thanos Makatos
@ 2014-10-06 14:30         ` Jan Kara
  2014-10-06 15:21           ` Thanos Makatos
  2014-10-07  1:30           ` Dave Chinner
  0 siblings, 2 replies; 16+ messages in thread
From: Jan Kara @ 2014-10-06 14:30 UTC (permalink / raw)
  To: Thanos Makatos
  Cc: 'Jan Kara',
	Jens Axboe, linux-fsdevel, linux-kernel, linux-api, jlayton,
	bfields

On Mon 06-10-14 11:33:23, Thanos Makatos wrote:
> > > Trond also had a comment that if we extended the ioctl to work for all
> > > inodes (not just blkdev) and allowed some additional flags of what
> > > needs to be invalidated, the new ioctl would be also useful to NFS
> > > userspace - see Trond's email at
> > >
> > > http://www.spinics.net/lists/linux-fsdevel/msg78917.html
> > >
> > > and the following thread. I would prefer to cover that usecase when we
> > > are introducing new invalidation ioctl. Have you considered that Thanos?
> > 
> > Sure, though I don't really know how to do it. I'll start by looking at the code
> > flow when someone does " echo 3 > /proc/sys/vm/drop_caches", unless you
> > already have a rough idea how to do that.
> 
> I realise I haven't clearly understood what the semantics of this new ioctl
> should be.
> 
> My initial goal was to implement an ioctl that would _completely_ invalidate
> the buffer cache of a block device when there is no file-system involved.
> Unless I'm mistaken the patch I posted achieves this goal.
  Yes.

> We now want to extend this patch to take care of cached metadata, which seems
> to be of particular importance for NFS, and I suspect that this piece of
> functionality will still be applicable to any kind of file-system, correct?
  So most notably they want the ioctl to work not only for block devices
but also for any regular file. That's easily doable - you just call
filemap_write_and_wait() and invalidate_inode_pages2() in the ioctl handler
for regular files.

Also they wanted to be able to specify a range of a mapping to invalidate -
that's easily doable as well. Finally they wanted a 'flags' argument so you
can additionally ask fs to invalidate also some metadata. How invalidation
is done will be a fs specific thing and for now I guess we don't need to go
into details. NFS guys can sort that out when they decide to implement it.
So in the beginning we can just have u64 flags argument and in
it a single 'INVAL_DATA' flag meaning that invalidation of data in a given
range is requested. Later NFS guys can add further flags.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH RFC] introduce ioctl to completely invalidate page cache
  2014-10-06 14:30         ` Jan Kara
@ 2014-10-06 15:21           ` Thanos Makatos
  2014-10-07  1:30           ` Dave Chinner
  1 sibling, 0 replies; 16+ messages in thread
From: Thanos Makatos @ 2014-10-06 15:21 UTC (permalink / raw)
  To: 'Jan Kara'
  Cc: Jens Axboe, linux-fsdevel, linux-kernel, linux-api, jlayton, bfields

Thanks, Jan, it's much clearer now what I need to do.

>   So most notably they want the ioctl to work not only for block devices but also for any regular file. That's easily doable - you just call
> filemap_write_and_wait() and invalidate_inode_pages2() in the ioctl handler for regular files.

All right, so I need to find out how I can direct the new ioctl to
file-systems as well. I thought that I could get away with it by simply looking
at which file-systems have the same block device as the one to which the ioctl
is directed, but IIUC this doesn't make sense as NFS doesn't use a block device
at all.

> Also they wanted to be able to specify a range of a mapping to invalidate -
> that's easily doable as well. Finally they wanted a 'flags' argument so you can
> additionally ask fs to invalidate also some metadata. How invalidation is done
> will be a fs specific thing and for now I guess we don't need to go into
> details. NFS guys can sort that out when they decide to implement it.

So after I've figured out how to direct this new ioctl to a file-system, I need
to understand out how to invalidate a specific range of data. I will gracefully
fail metadata invalidation operations with EOPNOTSUPP or ENOSYS.

> So in the beginning we can just have u64 flags argument and in it a single
> 'INVAL_DATA' flag meaning that invalidation of data in a given range is
> requested. Later NFS guys can add further flags.

OK that I can do.

I suppose we'll always fail metadata invalidation operations when the target
of the ioctl is a block device.

Thanks

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] introduce ioctl to completely invalidate page cache
  2014-10-06 14:30         ` Jan Kara
  2014-10-06 15:21           ` Thanos Makatos
@ 2014-10-07  1:30           ` Dave Chinner
  2014-10-07 19:16               ` Jan Kara
  1 sibling, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2014-10-07  1:30 UTC (permalink / raw)
  To: Jan Kara
  Cc: Thanos Makatos, Jens Axboe, linux-fsdevel, linux-kernel,
	linux-api, jlayton, bfields

On Mon, Oct 06, 2014 at 04:30:19PM +0200, Jan Kara wrote:
> On Mon 06-10-14 11:33:23, Thanos Makatos wrote:
> > > > Trond also had a comment that if we extended the ioctl to work for all
> > > > inodes (not just blkdev) and allowed some additional flags of what
> > > > needs to be invalidated, the new ioctl would be also useful to NFS
> > > > userspace - see Trond's email at
> > > >
> > > > http://www.spinics.net/lists/linux-fsdevel/msg78917.html
> > > >
> > > > and the following thread. I would prefer to cover that usecase when we
> > > > are introducing new invalidation ioctl. Have you considered that Thanos?
> > > 
> > > Sure, though I don't really know how to do it. I'll start by looking at the code
> > > flow when someone does " echo 3 > /proc/sys/vm/drop_caches", unless you
> > > already have a rough idea how to do that.
> > 
> > I realise I haven't clearly understood what the semantics of this new ioctl
> > should be.
> > 
> > My initial goal was to implement an ioctl that would _completely_ invalidate
> > the buffer cache of a block device when there is no file-system involved.
> > Unless I'm mistaken the patch I posted achieves this goal.
>   Yes.
> 
> > We now want to extend this patch to take care of cached metadata, which seems
> > to be of particular importance for NFS, and I suspect that this piece of
> > functionality will still be applicable to any kind of file-system, correct?
>   So most notably they want the ioctl to work not only for block devices
> but also for any regular file. That's easily doable - you just call
> filemap_write_and_wait() and invalidate_inode_pages2() in the ioctl handler
> for regular files.
> 
> Also they wanted to be able to specify a range of a mapping to invalidate -
> that's easily doable as well. Finally they wanted a 'flags' argument so you
> can additionally ask fs to invalidate also some metadata. How invalidation
> is done will be a fs specific thing and for now I guess we don't need to go
> into details. NFS guys can sort that out when they decide to implement it.
> So in the beginning we can just have u64 flags argument and in
> it a single 'INVAL_DATA' flag meaning that invalidation of data in a given
> range is requested. Later NFS guys can add further flags.

Why do we need a new ioctl to do this? fadvise64() seems like it's
the exact fit for "FADV_INVALIDATE_[META]DATA" flags...

And before anyone shouts "posix_fadvise sucks!" note that I'm
talking about adding flags to the syscall that the kernel defines,
not the glibc posix wrapper....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] introduce ioctl to completely invalidate page cache
  2014-10-07  1:30           ` Dave Chinner
@ 2014-10-07 19:16               ` Jan Kara
  0 siblings, 0 replies; 16+ messages in thread
From: Jan Kara @ 2014-10-07 19:16 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Thanos Makatos, Jens Axboe, linux-fsdevel,
	linux-kernel, linux-api, jlayton, bfields

On Tue 07-10-14 12:30:59, Dave Chinner wrote:
> On Mon, Oct 06, 2014 at 04:30:19PM +0200, Jan Kara wrote:
> > On Mon 06-10-14 11:33:23, Thanos Makatos wrote:
> > > > > Trond also had a comment that if we extended the ioctl to work for all
> > > > > inodes (not just blkdev) and allowed some additional flags of what
> > > > > needs to be invalidated, the new ioctl would be also useful to NFS
> > > > > userspace - see Trond's email at
> > > > >
> > > > > http://www.spinics.net/lists/linux-fsdevel/msg78917.html
> > > > >
> > > > > and the following thread. I would prefer to cover that usecase when we
> > > > > are introducing new invalidation ioctl. Have you considered that Thanos?
> > > > 
> > > > Sure, though I don't really know how to do it. I'll start by looking at the code
> > > > flow when someone does " echo 3 > /proc/sys/vm/drop_caches", unless you
> > > > already have a rough idea how to do that.
> > > 
> > > I realise I haven't clearly understood what the semantics of this new ioctl
> > > should be.
> > > 
> > > My initial goal was to implement an ioctl that would _completely_ invalidate
> > > the buffer cache of a block device when there is no file-system involved.
> > > Unless I'm mistaken the patch I posted achieves this goal.
> >   Yes.
> > 
> > > We now want to extend this patch to take care of cached metadata, which seems
> > > to be of particular importance for NFS, and I suspect that this piece of
> > > functionality will still be applicable to any kind of file-system, correct?
> >   So most notably they want the ioctl to work not only for block devices
> > but also for any regular file. That's easily doable - you just call
> > filemap_write_and_wait() and invalidate_inode_pages2() in the ioctl handler
> > for regular files.
> > 
> > Also they wanted to be able to specify a range of a mapping to invalidate -
> > that's easily doable as well. Finally they wanted a 'flags' argument so you
> > can additionally ask fs to invalidate also some metadata. How invalidation
> > is done will be a fs specific thing and for now I guess we don't need to go
> > into details. NFS guys can sort that out when they decide to implement it.
> > So in the beginning we can just have u64 flags argument and in
> > it a single 'INVAL_DATA' flag meaning that invalidation of data in a given
> > range is requested. Later NFS guys can add further flags.
> 
> Why do we need a new ioctl to do this? fadvise64() seems like it's
> the exact fit for "FADV_INVALIDATE_[META]DATA" flags...
  Well, fadvise() is currently a hint to kernel. In this case we would
really like the call to do the invalidation and return error if it fails
for some reason. So I'm not sure fadvise() is a perfect fit. But I wouldn't
be strongly opposed to it either.
								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] introduce ioctl to completely invalidate page cache
@ 2014-10-07 19:16               ` Jan Kara
  0 siblings, 0 replies; 16+ messages in thread
From: Jan Kara @ 2014-10-07 19:16 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Thanos Makatos, Jens Axboe,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, jlayton-vpEMnDpepFuMZCB2o+C8xQ,
	bfields-uC3wQj2KruNg9hUCZPvPmw

On Tue 07-10-14 12:30:59, Dave Chinner wrote:
> On Mon, Oct 06, 2014 at 04:30:19PM +0200, Jan Kara wrote:
> > On Mon 06-10-14 11:33:23, Thanos Makatos wrote:
> > > > > Trond also had a comment that if we extended the ioctl to work for all
> > > > > inodes (not just blkdev) and allowed some additional flags of what
> > > > > needs to be invalidated, the new ioctl would be also useful to NFS
> > > > > userspace - see Trond's email at
> > > > >
> > > > > http://www.spinics.net/lists/linux-fsdevel/msg78917.html
> > > > >
> > > > > and the following thread. I would prefer to cover that usecase when we
> > > > > are introducing new invalidation ioctl. Have you considered that Thanos?
> > > > 
> > > > Sure, though I don't really know how to do it. I'll start by looking at the code
> > > > flow when someone does " echo 3 > /proc/sys/vm/drop_caches", unless you
> > > > already have a rough idea how to do that.
> > > 
> > > I realise I haven't clearly understood what the semantics of this new ioctl
> > > should be.
> > > 
> > > My initial goal was to implement an ioctl that would _completely_ invalidate
> > > the buffer cache of a block device when there is no file-system involved.
> > > Unless I'm mistaken the patch I posted achieves this goal.
> >   Yes.
> > 
> > > We now want to extend this patch to take care of cached metadata, which seems
> > > to be of particular importance for NFS, and I suspect that this piece of
> > > functionality will still be applicable to any kind of file-system, correct?
> >   So most notably they want the ioctl to work not only for block devices
> > but also for any regular file. That's easily doable - you just call
> > filemap_write_and_wait() and invalidate_inode_pages2() in the ioctl handler
> > for regular files.
> > 
> > Also they wanted to be able to specify a range of a mapping to invalidate -
> > that's easily doable as well. Finally they wanted a 'flags' argument so you
> > can additionally ask fs to invalidate also some metadata. How invalidation
> > is done will be a fs specific thing and for now I guess we don't need to go
> > into details. NFS guys can sort that out when they decide to implement it.
> > So in the beginning we can just have u64 flags argument and in
> > it a single 'INVAL_DATA' flag meaning that invalidation of data in a given
> > range is requested. Later NFS guys can add further flags.
> 
> Why do we need a new ioctl to do this? fadvise64() seems like it's
> the exact fit for "FADV_INVALIDATE_[META]DATA" flags...
  Well, fadvise() is currently a hint to kernel. In this case we would
really like the call to do the invalidation and return error if it fails
for some reason. So I'm not sure fadvise() is a perfect fit. But I wouldn't
be strongly opposed to it either.
								Honza
-- 
Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] introduce ioctl to completely invalidate page cache
  2014-10-07 19:16               ` Jan Kara
  (?)
@ 2014-10-07 19:35               ` Trond Myklebust
  -1 siblings, 0 replies; 16+ messages in thread
From: Trond Myklebust @ 2014-10-07 19:35 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, Thanos Makatos, Jens Axboe, linux-fsdevel,
	linux-kernel, linux-api, jlayton, bfields

On Tue, Oct 7, 2014 at 3:16 PM, Jan Kara <jack@suse.cz> wrote:
> On Tue 07-10-14 12:30:59, Dave Chinner wrote:
>> On Mon, Oct 06, 2014 at 04:30:19PM +0200, Jan Kara wrote:
>> > On Mon 06-10-14 11:33:23, Thanos Makatos wrote:
>> > > > > Trond also had a comment that if we extended the ioctl to work for all
>> > > > > inodes (not just blkdev) and allowed some additional flags of what
>> > > > > needs to be invalidated, the new ioctl would be also useful to NFS
>> > > > > userspace - see Trond's email at
>> > > > >
>> > > > > http://www.spinics.net/lists/linux-fsdevel/msg78917.html
>> > > > >
>> > > > > and the following thread. I would prefer to cover that usecase when we
>> > > > > are introducing new invalidation ioctl. Have you considered that Thanos?
>> > > >
>> > > > Sure, though I don't really know how to do it. I'll start by looking at the code
>> > > > flow when someone does " echo 3 > /proc/sys/vm/drop_caches", unless you
>> > > > already have a rough idea how to do that.
>> > >
>> > > I realise I haven't clearly understood what the semantics of this new ioctl
>> > > should be.
>> > >
>> > > My initial goal was to implement an ioctl that would _completely_ invalidate
>> > > the buffer cache of a block device when there is no file-system involved.
>> > > Unless I'm mistaken the patch I posted achieves this goal.
>> >   Yes.
>> >
>> > > We now want to extend this patch to take care of cached metadata, which seems
>> > > to be of particular importance for NFS, and I suspect that this piece of
>> > > functionality will still be applicable to any kind of file-system, correct?
>> >   So most notably they want the ioctl to work not only for block devices
>> > but also for any regular file. That's easily doable - you just call
>> > filemap_write_and_wait() and invalidate_inode_pages2() in the ioctl handler
>> > for regular files.
>> >
>> > Also they wanted to be able to specify a range of a mapping to invalidate -
>> > that's easily doable as well. Finally they wanted a 'flags' argument so you
>> > can additionally ask fs to invalidate also some metadata. How invalidation
>> > is done will be a fs specific thing and for now I guess we don't need to go
>> > into details. NFS guys can sort that out when they decide to implement it.
>> > So in the beginning we can just have u64 flags argument and in
>> > it a single 'INVAL_DATA' flag meaning that invalidation of data in a given
>> > range is requested. Later NFS guys can add further flags.
>>
>> Why do we need a new ioctl to do this? fadvise64() seems like it's
>> the exact fit for "FADV_INVALIDATE_[META]DATA" flags...
>   Well, fadvise() is currently a hint to kernel. In this case we would
> really like the call to do the invalidation and return error if it fails
> for some reason. So I'm not sure fadvise() is a perfect fit. But I wouldn't
> be strongly opposed to it either.
>

fadvise is about giving programs the ability to "announce an intention
to access file data in a specific pattern in the future, thus allowing
the kernel to perform appropriate optimizations" according to the
manpage.

Cache invalidation and revalidation, OTOH, is about ensuring meta/data
consistency between the disk and inode/page cache.

I'm not seeing a perfect match. :-)

-- 
Trond Myklebust

Linux NFS client maintainer, PrimaryData

trond.myklebust@primarydata.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2014-10-07 19:35 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1412266184-23776-1-git-send-email-thanos.makatos@citrix.com>
2014-10-02 19:59 ` [PATCH RFC] introduce ioctl to completely invalidate page cache Jens Axboe
2014-10-03  5:27   ` Dave Chinner
2014-10-03  9:00     ` Thanos Makatos
2014-10-03  9:25   ` Thanos Makatos
2014-10-03  9:25     ` Thanos Makatos
2014-10-03 14:28     ` Jens Axboe
2014-10-06  8:06   ` Jan Kara
2014-10-06  8:06     ` Jan Kara
2014-10-06  9:21     ` Thanos Makatos
2014-10-06 11:33       ` Thanos Makatos
2014-10-06 14:30         ` Jan Kara
2014-10-06 15:21           ` Thanos Makatos
2014-10-07  1:30           ` Dave Chinner
2014-10-07 19:16             ` Jan Kara
2014-10-07 19:16               ` Jan Kara
2014-10-07 19:35               ` Trond Myklebust

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.