[PATCH] mm/filemap: do not allocate cache pages beyond end of file at read

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm/filemap: do not allocate cache pages beyond end of file at read
@ 2019-10-28  9:59 Konstantin Khlebnikov
  2019-10-28 12:39 ` Linus Torvalds
  2019-10-28 12:42 ` Kirill A. Shutemov
  0 siblings, 2 replies; 18+ messages in thread
From: Konstantin Khlebnikov @ 2019-10-28  9:59 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, linux-kernel
  Cc: linux-fsdevel, Linus Torvalds, Alexander Viro, Johannes Weiner

Page cache could contain pages beyond end of file during write or
if read races with truncate. But generic_file_buffered_read() always
allocates unneeded pages beyond eof if somebody reads here and one
extra page at the end if file size is page-aligned.

Function generic_file_buffered_read() calls page_cache_sync_readahead()
if page not found in cache and then do another lookup. Readahead checks
file size in __do_page_cache_readahead() before allocating pages.
After that generic_file_buffered_read() falls back to slow path and
allocates page for ->readpage() without checking file size.

This patch checks file size before allocating page for ->readpage().

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 mm/filemap.c |    4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 85b7d087eb45..92abf5f348a9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2225,6 +2225,10 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 		goto out;

 no_cached_page:
+		/* Do not allocate cache pages beyond end of file. */
+		if (((loff_t)index << PAGE_SHIFT) >= i_size_read(inode))
+			goto out;
+
 		/*
 		 * Ok, it wasn't cached, so we need to create a new
 		 * page..

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm/filemap: do not allocate cache pages beyond end of file at read
  2019-10-28  9:59 [PATCH] mm/filemap: do not allocate cache pages beyond end of file at read Konstantin Khlebnikov
@ 2019-10-28 12:39 ` Linus Torvalds
  2019-10-28 12:42 ` Kirill A. Shutemov
  1 sibling, 0 replies; 18+ messages in thread
From: Linus Torvalds @ 2019-10-28 12:39 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Linux-MM, Andrew Morton, Linux Kernel Mailing List,
	linux-fsdevel, Alexander Viro, Johannes Weiner

On Mon, Oct 28, 2019 at 10:59 AM Konstantin Khlebnikov
<khlebnikov@yandex-team.ru> wrote:
>
> Page cache could contain pages beyond end of file during write or
> if read races with truncate. But generic_file_buffered_read() always
> allocates unneeded pages beyond eof if somebody reads here and one
> extra page at the end if file size is page-aligned.

I wonder if we could just do something like this instead:

  diff --git a/mm/filemap.c b/mm/filemap.c
  index 85b7d087eb45..80b08433c93a 100644
  --- a/mm/filemap.c
  +++ b/mm/filemap.c
  @@ -2013,7 +2013,7 @@ static ssize_t generic_file_buffered_read(
        struct address_space *mapping = filp->f_mapping;
        struct inode *inode = mapping->host;
        struct file_ra_state *ra = &filp->f_ra;
  -     loff_t *ppos = &iocb->ki_pos;
  +     loff_t *ppos = &iocb->ki_pos, size;
        pgoff_t index;
        pgoff_t last_index;
        pgoff_t prev_index;
  @@ -2021,9 +2021,10 @@ static ssize_t generic_file_buffered_read(
        unsigned int prev_offset;
        int error = 0;

  -     if (unlikely(*ppos >= inode->i_sb->s_maxbytes))
  +     size = i_size_read(inode);
  +     if (unlikely(*ppos >= size))
                return 0;
  -     iov_iter_truncate(iter, inode->i_sb->s_maxbytes);
  +     iov_iter_truncate(iter, size);

        index = *ppos >> PAGE_SHIFT;
        prev_index = ra->prev_pos >> PAGE_SHIFT;

and yes, we still need to re-check the inode size after we've read the
page cache page (since it might have changed during the IO), but the
above seems fairly benign and simple.

Hmm?

              Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm/filemap: do not allocate cache pages beyond end of file at read
  2019-10-28  9:59 [PATCH] mm/filemap: do not allocate cache pages beyond end of file at read Konstantin Khlebnikov
  2019-10-28 12:39 ` Linus Torvalds
@ 2019-10-28 12:42 ` Kirill A. Shutemov
  2019-10-28 12:47   ` Linus Torvalds
  1 sibling, 1 reply; 18+ messages in thread
From: Kirill A. Shutemov @ 2019-10-28 12:42 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-mm, Andrew Morton, linux-kernel, linux-fsdevel,
	Linus Torvalds, Alexander Viro, Johannes Weiner,
	Steven Whitehouse

On Mon, Oct 28, 2019 at 12:59:34PM +0300, Konstantin Khlebnikov wrote:
> Page cache could contain pages beyond end of file during write or
> if read races with truncate. But generic_file_buffered_read() always
> allocates unneeded pages beyond eof if somebody reads here and one
> extra page at the end if file size is page-aligned.
> 
> Function generic_file_buffered_read() calls page_cache_sync_readahead()
> if page not found in cache and then do another lookup. Readahead checks
> file size in __do_page_cache_readahead() before allocating pages.
> After that generic_file_buffered_read() falls back to slow path and
> allocates page for ->readpage() without checking file size.
> 
> This patch checks file size before allocating page for ->readpage().
> 
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> ---
>  mm/filemap.c |    4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 85b7d087eb45..92abf5f348a9 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2225,6 +2225,10 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
>  		goto out;
>  
>  no_cached_page:
> +		/* Do not allocate cache pages beyond end of file. */
> +		if (((loff_t)index << PAGE_SHIFT) >= i_size_read(inode))
> +			goto out;
> +
>  		/*
>  		 * Ok, it wasn't cached, so we need to create a new
>  		 * page..
> 
> 

CC Steven.

I've tried something of this sort back in 2013:

http://lore.kernel.org/r/1377099441-2224-1-git-send-email-kirill.shutemov@linux.intel.com

and I've got push back.

Apparently, some filesystems may not have valid i_size before >readpage().
Not sure if it's still the case...

Anyway I don't think it's valid reason for this inefficiency. These
filesystems have to have own implementation of >read_iter() to deal with
this.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm/filemap: do not allocate cache pages beyond end of file at read
  2019-10-28 12:42 ` Kirill A. Shutemov
@ 2019-10-28 12:47   ` Linus Torvalds
  2019-10-28 12:57     ` Kirill A. Shutemov
  0 siblings, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2019-10-28 12:47 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Konstantin Khlebnikov, Linux-MM, Andrew Morton,
	Linux Kernel Mailing List, linux-fsdevel, Alexander Viro,
	Johannes Weiner, Steven Whitehouse

On Mon, Oct 28, 2019 at 1:42 PM Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> I've tried something of this sort back in 2013:
>
> http://lore.kernel.org/r/1377099441-2224-1-git-send-email-kirill.shutemov@linux.intel.com
>
> and I've got push back.
>
> Apparently, some filesystems may not have valid i_size before >readpage().
> Not sure if it's still the case...

Well, I agree that there might be some network filesystem that might
have inode sizes that are stale, but if that's the case then I don't
think your previous patch works either.

It too will avoid the readpage() if the read position is beyond i_size.

No?

               Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm/filemap: do not allocate cache pages beyond end of file at read
  2019-10-28 12:47   ` Linus Torvalds
@ 2019-10-28 12:57     ` Kirill A. Shutemov
  2019-10-29 14:25       ` Konstantin Khlebnikov
  0 siblings, 1 reply; 18+ messages in thread
From: Kirill A. Shutemov @ 2019-10-28 12:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Konstantin Khlebnikov, Linux-MM, Andrew Morton,
	Linux Kernel Mailing List, linux-fsdevel, Alexander Viro,
	Johannes Weiner, Steven Whitehouse

On Mon, Oct 28, 2019 at 01:47:16PM +0100, Linus Torvalds wrote:
> On Mon, Oct 28, 2019 at 1:42 PM Kirill A. Shutemov <kirill@shutemov.name> wrote:
> >
> > I've tried something of this sort back in 2013:
> >
> > http://lore.kernel.org/r/1377099441-2224-1-git-send-email-kirill.shutemov@linux.intel.com
> >
> > and I've got push back.
> >
> > Apparently, some filesystems may not have valid i_size before >readpage().
> > Not sure if it's still the case...
> 
> Well, I agree that there might be some network filesystem that might
> have inode sizes that are stale, but if that's the case then I don't
> think your previous patch works either.
> 
> It too will avoid the readpage() if the read position is beyond i_size.
> 
> No?

Yes. That's the reason the patch was rejected back then.

My point is that we need to make sure that this patch not break anything.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm/filemap: do not allocate cache pages beyond end of file at read
  2019-10-28 12:57     ` Kirill A. Shutemov
@ 2019-10-29 14:25       ` Konstantin Khlebnikov
  2019-10-29 16:52         ` Linus Torvalds
  0 siblings, 1 reply; 18+ messages in thread
From: Konstantin Khlebnikov @ 2019-10-29 14:25 UTC (permalink / raw)
  To: Kirill A. Shutemov, Linus Torvalds
  Cc: Linux-MM, Andrew Morton, Linux Kernel Mailing List,
	linux-fsdevel, Alexander Viro, Johannes Weiner,
	Steven Whitehouse

On 28/10/2019 15.57, Kirill A. Shutemov wrote:
> On Mon, Oct 28, 2019 at 01:47:16PM +0100, Linus Torvalds wrote:
>> On Mon, Oct 28, 2019 at 1:42 PM Kirill A. Shutemov <kirill@shutemov.name> wrote:
>>>
>>> I've tried something of this sort back in 2013:
>>>
>>> http://lore.kernel.org/r/1377099441-2224-1-git-send-email-kirill.shutemov@linux.intel.com
>>>
>>> and I've got push back.
>>>
>>> Apparently, some filesystems may not have valid i_size before >readpage().
>>> Not sure if it's still the case...
>>
>> Well, I agree that there might be some network filesystem that might
>> have inode sizes that are stale, but if that's the case then I don't
>> think your previous patch works either.
>>
>> It too will avoid the readpage() if the read position is beyond i_size.
>>
>> No?
> 
> Yes. That's the reason the patch was rejected back then.
> 
> My point is that we need to make sure that this patch not break anything.
> 

I think all network filesystems which synchronize metadata lazily should be
marked. For example as "SB_VOLATILE". And vfs could handle them specially.

For this case generic_file_buffered_read() could call for them readpages
for single page (rather than readpage) to let filesystem revalidate
metadata and drop unneeded page without inserting it into inode and lru.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm/filemap: do not allocate cache pages beyond end of file at read
  2019-10-29 14:25       ` Konstantin Khlebnikov
@ 2019-10-29 16:52         ` Linus Torvalds
  2019-10-30  6:50           ` Kirill A. Shutemov
  2019-10-30 10:34           ` Steven Whitehouse
  0 siblings, 2 replies; 18+ messages in thread
From: Linus Torvalds @ 2019-10-29 16:52 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Kirill A. Shutemov, Linux-MM, Andrew Morton,
	Linux Kernel Mailing List, linux-fsdevel, Alexander Viro,
	Johannes Weiner, Steven Whitehouse

On Tue, Oct 29, 2019 at 3:25 PM Konstantin Khlebnikov
<khlebnikov@yandex-team.ru> wrote:
>
> I think all network filesystems which synchronize metadata lazily should be
> marked. For example as "SB_VOLATILE". And vfs could handle them specially.

No need. The VFS layer doesn't call generic_file_buffered_read()
directly anyway. It's just a helper function for filesystems to use if
they want to.

They could (and should) make sure the inode size is sufficiently
up-to-date before calling it. And if they want something more
synchronous, they can do it themselves.

But NFS, for example, has open/close consistency, so the metadata
revalidation is at open() time, not at read time.

               Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm/filemap: do not allocate cache pages beyond end of file at read
  2019-10-29 16:52         ` Linus Torvalds
@ 2019-10-30  6:50           ` Kirill A. Shutemov
  2019-10-30  7:02             ` Linus Torvalds
  2019-10-30 10:34           ` Steven Whitehouse
  1 sibling, 1 reply; 18+ messages in thread
From: Kirill A. Shutemov @ 2019-10-30  6:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Konstantin Khlebnikov, Linux-MM, Andrew Morton,
	Linux Kernel Mailing List, linux-fsdevel, Alexander Viro,
	Johannes Weiner, Steven Whitehouse

On Tue, Oct 29, 2019 at 05:52:05PM +0100, Linus Torvalds wrote:
> But NFS, for example, has open/close consistency, so the metadata
> revalidation is at open() time, not at read time.

I don't know much about filesystems, but can't size of file change after
the open() under network filesystem? Revlidation on read looks like an
requirement anyway, no?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm/filemap: do not allocate cache pages beyond end of file at read
  2019-10-30  6:50           ` Kirill A. Shutemov
@ 2019-10-30  7:02             ` Linus Torvalds
  0 siblings, 0 replies; 18+ messages in thread
From: Linus Torvalds @ 2019-10-30  7:02 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Konstantin Khlebnikov, Linux-MM, Andrew Morton,
	Linux Kernel Mailing List, linux-fsdevel, Alexander Viro,
	Johannes Weiner, Steven Whitehouse

On Wed, Oct 30, 2019 at 7:50 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> I don't know much about filesystems, but can't size of file change after
> the open() under network filesystem? Revlidation on read looks like an
> requirement anyway, no?

Requirement? No. But QoS issue, yes.

But note that NFS already does that. Look at nfs_file_read(), and
notice how it's not using generic_file_buffered_read() directly, it's
doing its own thing first with checking for direct-IO, but then doing
that nfs_revalidate_mapping() that checks whether caches should be
re-validated.

It's not just size of the file, the actual cached contents may need
invalidating too etc.

And note how the generic page cache reader doesn't need to care. If
what the generic code does isn't enough, or is the wrong thing, the
filesystem simply shouldn't use it, or, like NFS, do its own thing
first/last.

So I think the "some filesystems may have other rules" is irrelevant.
If they do have other rules, it's _their_ issue, not the issue of the
generic page cache read logic.

             Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm/filemap: do not allocate cache pages beyond end of file at read
  2019-10-29 16:52         ` Linus Torvalds
  2019-10-30  6:50           ` Kirill A. Shutemov
@ 2019-10-30 10:34           ` Steven Whitehouse
  2019-10-30 10:54             ` Linus Torvalds
  1 sibling, 1 reply; 18+ messages in thread
From: Steven Whitehouse @ 2019-10-30 10:34 UTC (permalink / raw)
  To: Linus Torvalds, Konstantin Khlebnikov
  Cc: Kirill A. Shutemov, Linux-MM, Andrew Morton,
	Linux Kernel Mailing List, linux-fsdevel, Alexander Viro,
	Johannes Weiner, cluster-devel

Hi,

On 29/10/2019 16:52, Linus Torvalds wrote:
> On Tue, Oct 29, 2019 at 3:25 PM Konstantin Khlebnikov
> <khlebnikov@yandex-team.ru> wrote:
>> I think all network filesystems which synchronize metadata lazily should be
>> marked. For example as "SB_VOLATILE". And vfs could handle them specially.
> No need. The VFS layer doesn't call generic_file_buffered_read()
> directly anyway. It's just a helper function for filesystems to use if
> they want to.
>
> They could (and should) make sure the inode size is sufficiently
> up-to-date before calling it. And if they want something more
> synchronous, they can do it themselves.
>
> But NFS, for example, has open/close consistency, so the metadata
> revalidation is at open() time, not at read time.
>
>                 Linus

NFS may be ok here, but it will break GFS2. There may be others too... 
OCFS2 is likely one. Not sure about CIFS either. Does it really matter 
that we might occasionally allocate a page and then free it again?

Ramfs is a simple test case, but at the same time it doesn't represent 
the complexity of a real world filesystem. I'm just back from a few days 
holiday so apologies if I've missed something earlier on in the discussions,

Steve.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm/filemap: do not allocate cache pages beyond end of file at read
  2019-10-30 10:34           ` Steven Whitehouse
@ 2019-10-30 10:54             ` Linus Torvalds
  2019-10-31 11:40               ` Steven Whitehouse
  0 siblings, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2019-10-30 10:54 UTC (permalink / raw)
  To: Steven Whitehouse
  Cc: Konstantin Khlebnikov, Kirill A. Shutemov, Linux-MM,
	Andrew Morton, Linux Kernel Mailing List, linux-fsdevel,
	Alexander Viro, Johannes Weiner, cluster-devel

On Wed, Oct 30, 2019 at 11:35 AM Steven Whitehouse <swhiteho@redhat.com> wrote:
>
> NFS may be ok here, but it will break GFS2. There may be others too...
> OCFS2 is likely one. Not sure about CIFS either. Does it really matter
> that we might occasionally allocate a page and then free it again?

Why are gfs2 and cifs doing things wrong?

"readpage()" is not for synchrionizing metadata. Never has been. You
shouldn't treat it that way, and you shouldn't then make excuses for
filesystems that treat it that way.

Look at mmap, for example. It will do the SIGBUS handling before
calling readpage(). Same goes for the copyfile code. A filesystem that
thinks "I will update size at readpage" is already fundamentally
buggy.

We do _recheck_ the inode size under the page lock, but that's to
handle the races with truncate etc.

            Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm/filemap: do not allocate cache pages beyond end of file at read
  2019-10-30 10:54             ` Linus Torvalds
@ 2019-10-31 11:40               ` Steven Whitehouse
  2019-11-22 23:59                 ` Andreas Grünbacher
  0 siblings, 1 reply; 18+ messages in thread
From: Steven Whitehouse @ 2019-10-31 11:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Konstantin Khlebnikov, Kirill A. Shutemov, Linux-MM,
	Andrew Morton, Linux Kernel Mailing List, linux-fsdevel,
	Alexander Viro, Johannes Weiner, cluster-devel, Ronnie Sahlberg,
	Steve French, Andreas Gruenbacher, Bob Peterson

Hi,

On 30/10/2019 10:54, Linus Torvalds wrote:
> On Wed, Oct 30, 2019 at 11:35 AM Steven Whitehouse<swhiteho@redhat.com>  wrote:
>> NFS may be ok here, but it will break GFS2. There may be others too...
>> OCFS2 is likely one. Not sure about CIFS either. Does it really matter
>> that we might occasionally allocate a page and then free it again?
> Why are gfs2 and cifs doing things wrong?
For CIFS I've added Ronnie and Steve to common on that.
> "readpage()" is not for synchrionizing metadata. Never has been. You
> shouldn't treat it that way, and you shouldn't then make excuses for
> filesystems that treat it that way.
>
> Look at mmap, for example. It will do the SIGBUS handling before
> calling readpage(). Same goes for the copyfile code. A filesystem that
> thinks "I will update size at readpage" is already fundamentally
> buggy.
>
> We do _recheck_ the inode size under the page lock, but that's to
> handle the races with truncate etc.
>
>              Linus

For the GFS2 side of things, the algorithm looks like this:

  - Is there an uptodate page in cache?

    Yes, return it

    No, call into the fs readpage() to get one

This is designed so that for pages that are available in the page cache, 
we don't even need to call into the filesystem at all. It is all dealt 
with at the page cache level, unless the page doesn't exist. At this 
point we don't know what the i_size might be, and prior to the proposed 
patch, it simply doesn't matter, since we will ask the filesystem via 
->readpage() for all pages which are not in the cache.

If the page doesn't exist, we have to take the cluster level locks 
(glocks in the case of GFS2) which are potentially expensive, certainly 
a lot more expensive than the page lock anyway. That is currently done 
at the ->readpage() level, although we do have to drop the page lock 
first and then get the locks in the correct order, since the lock 
ordering requires the glock to be taken in shared mode ahead of the page 
lock.

We've always in the past been able to just use the generic code, since 
it was written to not assume i_size was valid outside of the fs specific 
locks. The aim has always been to try and use generic code as much as 
possible, even though there are some cases where we've had to depart 
from that for various reasons.

It appears that the filemap_fault issue seems to have not been spotted 
before. I'm not quite sure how that was missed - seems to show that we 
have some missing tests, but I agree that it does need to be fixed. It 
is a while since I last looked at that particular bit of code in detail, 
so my memory may be a bit fuzzy.

Andreas, Bob, have I missed anything here?

Steve.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm/filemap: do not allocate cache pages beyond end of file at read
  2019-10-31 11:40               ` Steven Whitehouse
@ 2019-11-22 23:59                 ` Andreas Grünbacher
  2019-11-25 10:52                   ` Steven Whitehouse
  0 siblings, 1 reply; 18+ messages in thread
From: Andreas Grünbacher @ 2019-11-22 23:59 UTC (permalink / raw)
  To: Steven Whitehouse
  Cc: Linus Torvalds, Konstantin Khlebnikov, Kirill A. Shutemov,
	Linux-MM, Andrew Morton, Linux Kernel Mailing List,
	linux-fsdevel, Alexander Viro, Johannes Weiner, cluster-devel,
	Ronnie Sahlberg, Steve French, Andreas Gruenbacher, Bob Peterson

Hi,

Am Do., 31. Okt. 2019 um 12:43 Uhr schrieb Steven Whitehouse
<swhiteho@redhat.com>:
> Andreas, Bob, have I missed anything here?

I've looked into this a bit, and it seems that there's a reasonable
way to get rid of the lock taking in ->readpage and ->readpages
without a lot of code duplication. My proposal for that consists of
multiple patches, so I've posted it separately:

https://lore.kernel.org/linux-fsdevel/20191122235324.17245-1-agruenba@redhat.com/T/#t

Thanks,
Andreas

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm/filemap: do not allocate cache pages beyond end of file at read
  2019-11-22 23:59                 ` Andreas Grünbacher
@ 2019-11-25 10:52                   ` Steven Whitehouse
  2019-11-25 17:05                     ` Linus Torvalds
  0 siblings, 1 reply; 18+ messages in thread
From: Steven Whitehouse @ 2019-11-25 10:52 UTC (permalink / raw)
  To: Andreas Grünbacher
  Cc: Linus Torvalds, Konstantin Khlebnikov, Kirill A. Shutemov,
	Linux-MM, Andrew Morton, Linux Kernel Mailing List,
	linux-fsdevel, Alexander Viro, Johannes Weiner, cluster-devel,
	Ronnie Sahlberg, Steve French, Andreas Gruenbacher, Bob Peterson

Hi,

On 22/11/2019 23:59, Andreas Grünbacher wrote:
> Hi,
>
> Am Do., 31. Okt. 2019 um 12:43 Uhr schrieb Steven Whitehouse
> <swhiteho@redhat.com>:
>> Andreas, Bob, have I missed anything here?
> I've looked into this a bit, and it seems that there's a reasonable
> way to get rid of the lock taking in ->readpage and ->readpages
> without a lot of code duplication. My proposal for that consists of
> multiple patches, so I've posted it separately:
>
> https://lore.kernel.org/linux-fsdevel/20191122235324.17245-1-agruenba@redhat.com/T/#t
>
> Thanks,
> Andreas
>
Andreas, thanks for taking a look at this.

Linus, is that roughly what you were thinking of?

Ronnie, Steve, can the same approach perhaps work for CIFS?

Steve.





^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm/filemap: do not allocate cache pages beyond end of file at read
  2019-11-25 10:52                   ` Steven Whitehouse
@ 2019-11-25 17:05                     ` Linus Torvalds
  2019-11-27 15:41                       ` Steven Whitehouse
  0 siblings, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2019-11-25 17:05 UTC (permalink / raw)
  To: Steven Whitehouse
  Cc: Andreas Grünbacher, Konstantin Khlebnikov,
	Kirill A. Shutemov, Linux-MM, Andrew Morton,
	Linux Kernel Mailing List, linux-fsdevel, Alexander Viro,
	Johannes Weiner, cluster-devel, Ronnie Sahlberg, Steve French,
	Andreas Gruenbacher, Bob Peterson

On Mon, Nov 25, 2019 at 2:53 AM Steven Whitehouse <swhiteho@redhat.com> wrote:
>
> Linus, is that roughly what you were thinking of?

So the concept looks ok, but I don't really like the new flags as they
seem to be gfs2-specific rather than generic.

That said, I don't _gate_ them either, since they aren't in any
critical code sequence, and it's not like they do anything really odd.

I still think the whole gfs2 approach is broken. You're magically ok
with using stale data from the cache just because it's cached, even if
another client might have truncated the file or something.

So you're ok with saying "the file used to be X bytes in size, so
we'll just give you this data because we trust that the X is correct".

But you're not ok to say "oh, the file used to be X bytes in size, but
we don't want to give you a short read because it might not be correct
any more".

See the disconnect? You trust the size in one situation, but not in another one.

I also don't really see that you *need* the new flag at all. Since
you're doing to do a speculative read and then a real read anyway, and
since the only thing that you seem to care about is the file size
(because the *data* you will trust if it is cached), then why don't
you just use the *existing* generic read, and *IFF* you get a
truncated return value, then you go and double-check that the file
hasn't changed in size?

See what I'm saying? I think gfs2 is being very inconsistent in when
it trusts the file size, and I don't see that you even need the new
behavior that patch gives, because you might as well just use the
existing code (just move the i_size check earlier, and then teach gfs2
to double-check the "I didn't get as much as I expected" case).

                 Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm/filemap: do not allocate cache pages beyond end of file at read
  2019-11-25 17:05                     ` Linus Torvalds
@ 2019-11-27 15:41                       ` Steven Whitehouse
  2019-11-27 16:29                         ` Andreas Gruenbacher
  2019-11-27 17:29                         ` Linus Torvalds
  0 siblings, 2 replies; 18+ messages in thread
From: Steven Whitehouse @ 2019-11-27 15:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andreas Grünbacher, Konstantin Khlebnikov,
	Kirill A. Shutemov, Linux-MM, Andrew Morton,
	Linux Kernel Mailing List, linux-fsdevel, Alexander Viro,
	Johannes Weiner, cluster-devel, Ronnie Sahlberg, Steve French,
	Andreas Gruenbacher, Bob Peterson

Hi,

On 25/11/2019 17:05, Linus Torvalds wrote:
> On Mon, Nov 25, 2019 at 2:53 AM Steven Whitehouse <swhiteho@redhat.com> wrote:
>> Linus, is that roughly what you were thinking of?
> So the concept looks ok, but I don't really like the new flags as they
> seem to be gfs2-specific rather than generic.
>
> That said, I don't _gate_ them either, since they aren't in any
> critical code sequence, and it's not like they do anything really odd.
>
> I still think the whole gfs2 approach is broken. You're magically ok
> with using stale data from the cache just because it's cached, even if
> another client might have truncated the file or something.

If another node tries to truncate the file, that will require an 
exclusive glock, and in turn that means the all the other nodes will 
have to drop their glock(s) shared or exclusive. That process 
invalidates the page cache on those nodes, such that any further 
requests on those nodes will find the cache empty and have to call into 
the filesystem.

If a page is truncated on another node, then when the local node gives 
up its glock, after any copying (e.g. for read) has completed then the 
truncate will take place. The local node will then have to reread any 
data relating to new pages or return an error in case the next page to 
be read has vanished due to the truncate. It is a pretty small window, 
and the advantage is that in cases where the page is in cache, we can 
directly use the cached page without having to call into the filesystem 
at all. So it is page atomic in that sense.

The overall aim here is to avoid taking (potentially slow) cluster locks 
when at all possible, yet at the same time deliver close to local fs 
semantics whenever we can. You can think of GFS2's glock concept (at 
least as far as the inodes we are discussing here) as providing a 
combination of (page) cache control and cluster (dlm) locking.

>
> So you're ok with saying "the file used to be X bytes in size, so
> we'll just give you this data because we trust that the X is correct".
>
> But you're not ok to say "oh, the file used to be X bytes in size, but
> we don't want to give you a short read because it might not be correct
> any more".
>
> See the disconnect? You trust the size in one situation, but not in another one.

Well we are not trusting the size at all... the original algorithm 
worked entirely off "is this page in cache and uptodate?" and for 
exactly the reason that we know the size in the inode might be out of 
date, if we are not currently holding a glock in either shared or 
exclusive mode. We also know that if there is a page in cache and 
uptodate then we must be holding the glock too.

>
> I also don't really see that you *need* the new flag at all. Since
> you're doing to do a speculative read and then a real read anyway, and
> since the only thing that you seem to care about is the file size
> (because the *data* you will trust if it is cached), then why don't
> you just use the *existing* generic read, and *IFF* you get a
> truncated return value, then you go and double-check that the file
> hasn't changed in size?
>
> See what I'm saying? I think gfs2 is being very inconsistent in when
> it trusts the file size, and I don't see that you even need the new
> behavior that patch gives, because you might as well just use the
> existing code (just move the i_size check earlier, and then teach gfs2
> to double-check the "I didn't get as much as I expected" case).
>
>                   Linus

I'll leave the finer details to Andreas here, since it is his patch, and 
hopefully we can figure out a good path forward. We are perhaps also a 
bit reluctant to go off and (nearly) duplicate code that is already in 
the core vfs library functions, since that often leads to things getting 
out of sync (our implementation of ->writepages is one case where that 
happened in the past) and missing important bug fixes/features in some 
cases. Hopefully though we can iterate on this a bit and come up with 
something which will resolve all the issues,

Steve.

>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm/filemap: do not allocate cache pages beyond end of file at read
  2019-11-27 15:41                       ` Steven Whitehouse
@ 2019-11-27 16:29                         ` Andreas Gruenbacher
  2019-11-27 17:29                         ` Linus Torvalds
  1 sibling, 0 replies; 18+ messages in thread
From: Andreas Gruenbacher @ 2019-11-27 16:29 UTC (permalink / raw)
  To: Steven Whitehouse
  Cc: Linus Torvalds, Andreas Grünbacher, Konstantin Khlebnikov,
	Kirill A. Shutemov, Linux-MM, Andrew Morton,
	Linux Kernel Mailing List, linux-fsdevel, Alexander Viro,
	Johannes Weiner, cluster-devel, Ronnie Sahlberg, Steve French,
	Bob Peterson

On Wed, Nov 27, 2019 at 4:42 PM Steven Whitehouse <swhiteho@redhat.com> wrote:
> Hi,
>
> On 25/11/2019 17:05, Linus Torvalds wrote:
> > On Mon, Nov 25, 2019 at 2:53 AM Steven Whitehouse <swhiteho@redhat.com> wrote:
> >> Linus, is that roughly what you were thinking of?
> > So the concept looks ok, but I don't really like the new flags as they
> > seem to be gfs2-specific rather than generic.
> >
> > That said, I don't _gate_ them either, since they aren't in any
> > critical code sequence, and it's not like they do anything really odd.
> >
> > I still think the whole gfs2 approach is broken. You're magically ok
> > with using stale data from the cache just because it's cached, even if
> > another client might have truncated the file or something.
>
> If another node tries to truncate the file, that will require an
> exclusive glock, and in turn that means the all the other nodes will
> have to drop their glock(s) shared or exclusive. That process
> invalidates the page cache on those nodes, such that any further
> requests on those nodes will find the cache empty and have to call into
> the filesystem.
>
> If a page is truncated on another node, then when the local node gives
> up its glock, after any copying (e.g. for read) has completed then the
> truncate will take place. The local node will then have to reread any
> data relating to new pages or return an error in case the next page to
> be read has vanished due to the truncate. It is a pretty small window,
> and the advantage is that in cases where the page is in cache, we can
> directly use the cached page without having to call into the filesystem
> at all. So it is page atomic in that sense.
>
> The overall aim here is to avoid taking (potentially slow) cluster locks
> when at all possible, yet at the same time deliver close to local fs
> semantics whenever we can. You can think of GFS2's glock concept (at
> least as far as the inodes we are discussing here) as providing a
> combination of (page) cache control and cluster (dlm) locking.
>
> >
> > So you're ok with saying "the file used to be X bytes in size, so
> > we'll just give you this data because we trust that the X is correct".
> >
> > But you're not ok to say "oh, the file used to be X bytes in size, but
> > we don't want to give you a short read because it might not be correct
> > any more".
> >
> > See the disconnect? You trust the size in one situation, but not in another one.
>
> Well we are not trusting the size at all... the original algorithm
> worked entirely off "is this page in cache and uptodate?" and for
> exactly the reason that we know the size in the inode might be out of
> date, if we are not currently holding a glock in either shared or
> exclusive mode. We also know that if there is a page in cache and
> uptodate then we must be holding the glock too.
>
>
> >
> > I also don't really see that you *need* the new flag at all. Since
> > you're doing to do a speculative read and then a real read anyway, and
> > since the only thing that you seem to care about is the file size
> > (because the *data* you will trust if it is cached), then why don't
> > you just use the *existing* generic read, and *IFF* you get a
> > truncated return value, then you go and double-check that the file
> > hasn't changed in size?
> >
> > See what I'm saying? I think gfs2 is being very inconsistent in when
> > it trusts the file size, and I don't see that you even need the new
> > behavior that patch gives, because you might as well just use the
> > existing code (just move the i_size check earlier, and then teach gfs2
> > to double-check the "I didn't get as much as I expected" case).

We can identify short reads, but we won't get information about
readahead back from generic_file_read_iter or filemap_fault. We could
try to work around this with filesystem specific flags for ->readpage
and ->readpages, but that would break down with multiple concurrent
readers in addition to being a real mess. I'm currently out of better
ideas that avoid duplicating the generic code.

> >                   Linus
>
> I'll leave the finer details to Andreas here, since it is his patch, and
> hopefully we can figure out a good path forward. We are perhaps also a
> bit reluctant to go off and (nearly) duplicate code that is already in
> the core vfs library functions, since that often leads to things getting
> out of sync (our implementation of ->writepages is one case where that
> happened in the past) and missing important bug fixes/features in some
> cases. Hopefully though we can iterate on this a bit and come up with
> something which will resolve all the issues,
>
> Steve.

Thanks,
Andreas


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] mm/filemap: do not allocate cache pages beyond end of file at read
  2019-11-27 15:41                       ` Steven Whitehouse
  2019-11-27 16:29                         ` Andreas Gruenbacher
@ 2019-11-27 17:29                         ` Linus Torvalds
  1 sibling, 0 replies; 18+ messages in thread
From: Linus Torvalds @ 2019-11-27 17:29 UTC (permalink / raw)
  To: Steven Whitehouse
  Cc: Andreas Grünbacher, Konstantin Khlebnikov,
	Kirill A. Shutemov, Linux-MM, Andrew Morton,
	Linux Kernel Mailing List, linux-fsdevel, Alexander Viro,
	Johannes Weiner, cluster-devel, Ronnie Sahlberg, Steve French,
	Andreas Gruenbacher, Bob Peterson

On Wed, Nov 27, 2019 at 7:42 AM Steven Whitehouse <swhiteho@redhat.com> wrote:
>
> I'll leave the finer details to Andreas here, since it is his patch, and
> hopefully we can figure out a good path forward.

As mentioned, I don't _hate_ that patch (ok, I seem to have typoed it
and said that I don't "gate" it ;), so if that's what you guys really
want to do, I'm ok with it. But..

I do think you already get the data with the current case, from the
"short read" thing. So just changing the current generic read function
to check against the size first:

  --- a/mm/filemap.c
  +++ b/mm/filemap.c
  @@ -2021,9 +2021,9 @@ static ssize_t
generic_file_buffered_read(struct kiocb *iocb,
        unsigned int prev_offset;
        int error = 0;

  -     if (unlikely(*ppos >= inode->i_sb->s_maxbytes))
  +     if (unlikely(*ppos >= inode->i_size))
                return 0;
  -     iov_iter_truncate(iter, inode->i_sb->s_maxbytes);
  +     iov_iter_truncate(iter, inode->i_size);

        index = *ppos >> PAGE_SHIFT;
        prev_index = ra->prev_pos >> PAGE_SHIFT;

and you're done. Nice and clean.

Then in gfs2 you just notice the short read, and check at that point.
Sure, you'll also cut read-ahead to the old size boundary, but does
anybody _seriously_ believe that read-ahead matters when you hit the
"some other node write more data, we're reading past the old end"
case? I don't think that's the case.

But I _can_ live with the patch that adds the extra "cached only" bit.
It just honestly feels pointless.

               Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2019-11-27 17:30 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-28  9:59 [PATCH] mm/filemap: do not allocate cache pages beyond end of file at read Konstantin Khlebnikov
2019-10-28 12:39 ` Linus Torvalds
2019-10-28 12:42 ` Kirill A. Shutemov
2019-10-28 12:47   ` Linus Torvalds
2019-10-28 12:57     ` Kirill A. Shutemov
2019-10-29 14:25       ` Konstantin Khlebnikov
2019-10-29 16:52         ` Linus Torvalds
2019-10-30  6:50           ` Kirill A. Shutemov
2019-10-30  7:02             ` Linus Torvalds
2019-10-30 10:34           ` Steven Whitehouse
2019-10-30 10:54             ` Linus Torvalds
2019-10-31 11:40               ` Steven Whitehouse
2019-11-22 23:59                 ` Andreas Grünbacher
2019-11-25 10:52                   ` Steven Whitehouse
2019-11-25 17:05                     ` Linus Torvalds
2019-11-27 15:41                       ` Steven Whitehouse
2019-11-27 16:29                         ` Andreas Gruenbacher
2019-11-27 17:29                         ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).