linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Steven Whitehouse <swhiteho@redhat.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>,
	"Kirill A. Shutemov" <kirill@shutemov.name>,
	Linux-MM <linux-mm@kvack.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Johannes Weiner <hannes@cmpxchg.org>,
	"cluster-devel@redhat.com" <cluster-devel@redhat.com>,
	Ronnie Sahlberg <lsahlber@redhat.com>,
	Steve French <sfrench@samba.org>,
	Andreas Gruenbacher <agruenba@redhat.com>,
	Bob Peterson <rpeterso@redhat.com>
Subject: Re: [PATCH] mm/filemap: do not allocate cache pages beyond end of file at read
Date: Thu, 31 Oct 2019 11:40:56 +0000	[thread overview]
Message-ID: <640bbe51-706b-8d9f-4abc-5f184de6a701@redhat.com> (raw)
In-Reply-To: <CAHk-=wh4SKRxKQf5LawRMSijtjRVQevaFioBK+tOZAVPt7ek0Q@mail.gmail.com>

Hi,

On 30/10/2019 10:54, Linus Torvalds wrote:
> On Wed, Oct 30, 2019 at 11:35 AM Steven Whitehouse<swhiteho@redhat.com>  wrote:
>> NFS may be ok here, but it will break GFS2. There may be others too...
>> OCFS2 is likely one. Not sure about CIFS either. Does it really matter
>> that we might occasionally allocate a page and then free it again?
> Why are gfs2 and cifs doing things wrong?
For CIFS I've added Ronnie and Steve to common on that.
> "readpage()" is not for synchrionizing metadata. Never has been. You
> shouldn't treat it that way, and you shouldn't then make excuses for
> filesystems that treat it that way.
>
> Look at mmap, for example. It will do the SIGBUS handling before
> calling readpage(). Same goes for the copyfile code. A filesystem that
> thinks "I will update size at readpage" is already fundamentally
> buggy.
>
> We do _recheck_ the inode size under the page lock, but that's to
> handle the races with truncate etc.
>
>              Linus

For the GFS2 side of things, the algorithm looks like this:

  - Is there an uptodate page in cache?

    Yes, return it

    No, call into the fs readpage() to get one

This is designed so that for pages that are available in the page cache, 
we don't even need to call into the filesystem at all. It is all dealt 
with at the page cache level, unless the page doesn't exist. At this 
point we don't know what the i_size might be, and prior to the proposed 
patch, it simply doesn't matter, since we will ask the filesystem via 
->readpage() for all pages which are not in the cache.

If the page doesn't exist, we have to take the cluster level locks 
(glocks in the case of GFS2) which are potentially expensive, certainly 
a lot more expensive than the page lock anyway. That is currently done 
at the ->readpage() level, although we do have to drop the page lock 
first and then get the locks in the correct order, since the lock 
ordering requires the glock to be taken in shared mode ahead of the page 
lock.

We've always in the past been able to just use the generic code, since 
it was written to not assume i_size was valid outside of the fs specific 
locks. The aim has always been to try and use generic code as much as 
possible, even though there are some cases where we've had to depart 
from that for various reasons.

It appears that the filemap_fault issue seems to have not been spotted 
before. I'm not quite sure how that was missed - seems to show that we 
have some missing tests, but I agree that it does need to be fixed. It 
is a while since I last looked at that particular bit of code in detail, 
so my memory may be a bit fuzzy.

Andreas, Bob, have I missed anything here?

Steve.




  reply	other threads:[~2019-10-31 11:41 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-28  9:59 [PATCH] mm/filemap: do not allocate cache pages beyond end of file at read Konstantin Khlebnikov
2019-10-28 12:39 ` Linus Torvalds
2019-10-28 12:42 ` Kirill A. Shutemov
2019-10-28 12:47   ` Linus Torvalds
2019-10-28 12:57     ` Kirill A. Shutemov
2019-10-29 14:25       ` Konstantin Khlebnikov
2019-10-29 16:52         ` Linus Torvalds
2019-10-30  6:50           ` Kirill A. Shutemov
2019-10-30  7:02             ` Linus Torvalds
2019-10-30 10:34           ` Steven Whitehouse
2019-10-30 10:54             ` Linus Torvalds
2019-10-31 11:40               ` Steven Whitehouse [this message]
2019-11-22 23:59                 ` Andreas Grünbacher
2019-11-25 10:52                   ` Steven Whitehouse
2019-11-25 17:05                     ` Linus Torvalds
2019-11-27 15:41                       ` Steven Whitehouse
2019-11-27 16:29                         ` Andreas Gruenbacher
2019-11-27 17:29                         ` Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=640bbe51-706b-8d9f-4abc-5f184de6a701@redhat.com \
    --to=swhiteho@redhat.com \
    --cc=agruenba@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=cluster-devel@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=khlebnikov@yandex-team.ru \
    --cc=kirill@shutemov.name \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsahlber@redhat.com \
    --cc=rpeterso@redhat.com \
    --cc=sfrench@samba.org \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).