From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.9 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id AE0CCC432C3 for ; Wed, 27 Nov 2019 16:29:53 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 68D222075C for ; Wed, 27 Nov 2019 16:29:53 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="ZPSJE2MV" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 68D222075C Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 15AF16B04C1; Wed, 27 Nov 2019 11:29:53 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 0E5706B04C2; Wed, 27 Nov 2019 11:29:53 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EEEC26B04C3; Wed, 27 Nov 2019 11:29:52 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0059.hostedemail.com [216.40.44.59]) by kanga.kvack.org (Postfix) with ESMTP id D5A956B04C1 for ; Wed, 27 Nov 2019 11:29:52 -0500 (EST) Received: from smtpin11.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with SMTP id 9952C4DDE for ; Wed, 27 Nov 2019 16:29:52 +0000 (UTC) X-FDA: 76202593824.11.sand32_1be688b63f924 X-HE-Tag: sand32_1be688b63f924 X-Filterd-Recvd-Size: 9099 Received: from us-smtp-delivery-1.mimecast.com (us-smtp-1.mimecast.com [207.211.31.81]) by imf02.hostedemail.com (Postfix) with ESMTP for ; Wed, 27 Nov 2019 16:29:51 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1574872191; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=rcyN5dGoMhhF+8wU9SDW3LRuwC7oWgcs+VV9XVwODaY=; b=ZPSJE2MV810sp32bWNxVjcpzk4qnTJxvvONoNhVYOVz+vXFDdFiqT2tXROdrLHkYt0skOa wSBNdIpF17ZPS6xdsOR+93bAluci9v3HY75ijfSf/eX6Tumwow1sxxIkcWlMkCxPyGLhWe UiV0hYTgwSjh0udVoT48z5vN41FG1Iw= Received: from mail-oi1-f199.google.com (mail-oi1-f199.google.com [209.85.167.199]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-259-uO86tPIbMymeDeoTm96uFw-1; Wed, 27 Nov 2019 11:29:48 -0500 Received: by mail-oi1-f199.google.com with SMTP id 10so11371418ois.18 for ; Wed, 27 Nov 2019 08:29:48 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=kIu1VyYGNbaIKqeZUoUtiDvput7821e7oFhQavConxk=; b=HRlmNuy5oqZyEILoaDNL1VXuoWq0CP5CuOeT6Si4HafbZx5v+1qLRccPSn4eHS+gHe ZHM493Fuc3A8gRIDkr5UBhbucTqnWlAVvcb7DN+97y/E1IL3L2fxmSh3aJfQOSEQK8sL K2ZuodIOBcsqnlVE4aOmwSznw//m8r6ge6B3jB0JSwuVvydvxx4k12SAn6pCh+Mdw5wK jnKZO8oHWp+TD6WbSOWt1CDjWRVh3XnGLMQz42qC+zocFgQjoWVFXmXVcPVZFldU1oRe ZRVZaOeiZmrZ3nvrv6Ax8miDhbBQwVtaPEGly4rbgvga0Kb+ZZJjk1/rO8DuF4pix8Mx nkMQ== X-Gm-Message-State: APjAAAXT9cCrpvzpN6kElq5Veyo5hupW1gPYrWYpncLoN89+96tA4YNy Vgfu5HnRV5bozVgWS49l1pmELxTjkJ4zDWoOwbyamaqG4VP1GS+TvFa0qj5a3MGnW2A06I70bqq UNd1sWZ0XYYbC8QKXbKsOk8O9jjU= X-Received: by 2002:aca:d14:: with SMTP id 20mr4877850oin.178.1574872187663; Wed, 27 Nov 2019 08:29:47 -0800 (PST) X-Google-Smtp-Source: APXvYqw3yLKNT7wsw1AqyVf2knKKoyFNMFDOQF5O+7hYYWvVBZw57Km8zJVbATfvGULvCZMZmqbPRh8opF4b2Zj5pdE= X-Received: by 2002:aca:d14:: with SMTP id 20mr4877821oin.178.1574872187328; Wed, 27 Nov 2019 08:29:47 -0800 (PST) MIME-Version: 1.0 References: <157225677483.3442.4227193290486305330.stgit@buzz> <20191028124222.ld6u3dhhujfqcn7w@box> <20191028125702.xdfbs7rqhm3wer5t@box> <640bbe51-706b-8d9f-4abc-5f184de6a701@redhat.com> <22f04f02-86e4-b379-81c8-08c002a648f0@redhat.com> In-Reply-To: From: Andreas Gruenbacher Date: Wed, 27 Nov 2019 17:29:36 +0100 Message-ID: Subject: Re: [PATCH] mm/filemap: do not allocate cache pages beyond end of file at read To: Steven Whitehouse Cc: Linus Torvalds , =?UTF-8?Q?Andreas_Gr=C3=BCnbacher?= , Konstantin Khlebnikov , "Kirill A. Shutemov" , Linux-MM , Andrew Morton , Linux Kernel Mailing List , linux-fsdevel , Alexander Viro , Johannes Weiner , "cluster-devel@redhat.com" , Ronnie Sahlberg , Steve French , Bob Peterson X-MC-Unique: uO86tPIbMymeDeoTm96uFw-1 X-Mimecast-Spam-Score: 0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Nov 27, 2019 at 4:42 PM Steven Whitehouse wro= te: > Hi, > > On 25/11/2019 17:05, Linus Torvalds wrote: > > On Mon, Nov 25, 2019 at 2:53 AM Steven Whitehouse = wrote: > >> Linus, is that roughly what you were thinking of? > > So the concept looks ok, but I don't really like the new flags as they > > seem to be gfs2-specific rather than generic. > > > > That said, I don't _gate_ them either, since they aren't in any > > critical code sequence, and it's not like they do anything really odd. > > > > I still think the whole gfs2 approach is broken. You're magically ok > > with using stale data from the cache just because it's cached, even if > > another client might have truncated the file or something. > > If another node tries to truncate the file, that will require an > exclusive glock, and in turn that means the all the other nodes will > have to drop their glock(s) shared or exclusive. That process > invalidates the page cache on those nodes, such that any further > requests on those nodes will find the cache empty and have to call into > the filesystem. > > If a page is truncated on another node, then when the local node gives > up its glock, after any copying (e.g. for read) has completed then the > truncate will take place. The local node will then have to reread any > data relating to new pages or return an error in case the next page to > be read has vanished due to the truncate. It is a pretty small window, > and the advantage is that in cases where the page is in cache, we can > directly use the cached page without having to call into the filesystem > at all. So it is page atomic in that sense. > > The overall aim here is to avoid taking (potentially slow) cluster locks > when at all possible, yet at the same time deliver close to local fs > semantics whenever we can. You can think of GFS2's glock concept (at > least as far as the inodes we are discussing here) as providing a > combination of (page) cache control and cluster (dlm) locking. > > > > > So you're ok with saying "the file used to be X bytes in size, so > > we'll just give you this data because we trust that the X is correct". > > > > But you're not ok to say "oh, the file used to be X bytes in size, but > > we don't want to give you a short read because it might not be correct > > any more". > > > > See the disconnect? You trust the size in one situation, but not in ano= ther one. > > Well we are not trusting the size at all... the original algorithm > worked entirely off "is this page in cache and uptodate?" and for > exactly the reason that we know the size in the inode might be out of > date, if we are not currently holding a glock in either shared or > exclusive mode. We also know that if there is a page in cache and > uptodate then we must be holding the glock too. > > > > > > I also don't really see that you *need* the new flag at all. Since > > you're doing to do a speculative read and then a real read anyway, and > > since the only thing that you seem to care about is the file size > > (because the *data* you will trust if it is cached), then why don't > > you just use the *existing* generic read, and *IFF* you get a > > truncated return value, then you go and double-check that the file > > hasn't changed in size? > > > > See what I'm saying? I think gfs2 is being very inconsistent in when > > it trusts the file size, and I don't see that you even need the new > > behavior that patch gives, because you might as well just use the > > existing code (just move the i_size check earlier, and then teach gfs2 > > to double-check the "I didn't get as much as I expected" case). We can identify short reads, but we won't get information about readahead back from generic_file_read_iter or filemap_fault. We could try to work around this with filesystem specific flags for ->readpage and ->readpages, but that would break down with multiple concurrent readers in addition to being a real mess. I'm currently out of better ideas that avoid duplicating the generic code. > > Linus > > I'll leave the finer details to Andreas here, since it is his patch, and > hopefully we can figure out a good path forward. We are perhaps also a > bit reluctant to go off and (nearly) duplicate code that is already in > the core vfs library functions, since that often leads to things getting > out of sync (our implementation of ->writepages is one case where that > happened in the past) and missing important bug fixes/features in some > cases. Hopefully though we can iterate on this a bit and come up with > something which will resolve all the issues, > > Steve. Thanks, Andreas