All of lore.kernel.org
 help / color / mirror / Atom feed
From: Milosz Tanski <milosz@adfin.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@infradead.org>,
	Jeremy Allison <jra@samba.org>,
	LKML <linux-kernel@vger.kernel.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"linux-aio@kvack.org" <linux-aio@kvack.org>,
	Mel Gorman <mgorman@suse.de>,
	Volker Lendecke <Volker.Lendecke@sernet.de>,
	Tejun Heo <tj@kernel.org>, Jeff Moyer <jmoyer@redhat.com>,
	"Theodore Ts'o" <tytso@mit.edu>,
	Al Viro <viro@zeniv.linux.org.uk>,
	Linux API <linux-api@vger.kernel.org>,
	Michael Kerrisk <mtk.manpages@gmail.com>,
	linux-arch@vger.kernel.org, Dave Chinner <david@fromorbit.com>
Subject: Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
Date: Mon, 30 Mar 2015 19:25:22 -0400	[thread overview]
Message-ID: <CANP1eJFGYrXoWEc7xeeJeudk5tCSbzb1cezjPCTumRCfTVmWog@mail.gmail.com> (raw)
In-Reply-To: <20150330132625.52b1250527ca3dcda79e349e@linux-foundation.org>

On Mon, Mar 30, 2015 at 4:26 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Mon, 30 Mar 2015 00:36:04 -0700 Christoph Hellwig <hch@infradead.org> wrote:
>
>> On Fri, Mar 27, 2015 at 08:58:54AM -0700, Jeremy Allison wrote:
>> > The problem with the above is that we can't tell the difference
>> > between pread2() returning a short read because the pages are not
>> > in cache, or because someone truncated the file. So we need some
>> > way to differentiate this.
>>
>> Is a race vs truncate really that time critical that you can't
>> wait for the thread pool to do the second read to notice it?
>>
>> > My preference from userspace would be for pread2() to return
>> > EAGAIN if *all* the data requested is not available (where
>> > 'all' can be less than the size requested if the file has
>> > been truncated in the meantime).
>>
>> That is easily implementable, but I can see that for example web apps
>> would be happy to get as much as possible.  So if Samba can be ok
>> with short reads and only detecting the truncated case in the slow
>> path that would make life simpler.  Otherwise we might indeed need two
>> flags.
>
> The problem is that many applications (including samba!) want
> all-or-nothing behaviour, and preadv2() cannot provide it.  By the time
> preadv2() discovers a not-present page, it has already copied bulk data
> out to userspace.
>
> To fix this, preadv2() would need to take two passes across the pages,
> pinning them in between and somehow blocking out truncate.  That's a
> big change.
>
> With the current preadv2(), applications would have to do
>
>         nr_read = preadv2(..., offset, len, ...);
>         if (nr_read == len)
>                 process data;
>         else
>                 punt(offset + nr_read, len - nr_read);
>
> and the worker thread will later have to splice together the initial
> data and the later-arriving data, probably on another CPU, probably
> after the initial data has gone cache-cold.
>
> A cleaner solution is
>
>         if (fincore(fd, NULL, offset, len) == len) {
>                 preadv(..., offset, len);
>                 process data;
>         } else {
>                 punt(offset, len);
>         }
>
> This way all the data gets copied in a single hit and is cache-hot when
> userspace processes it.
>
> Comparing fincore()+pread() to preadv2():
>
> pros:
>
> a) fincore() may be used to provide both all-or-nothing and
>    part-read-ok behaviour cleanly and with optimum cache behaviour.
>
> b) fincore() doesn't add overhead, complexity and stack depth to
>    core pagecache read() code.  Nor does it expand VFS data structures.

Actually, we're not expanding any VFS structures with the next
patchset. I've rebased the forthcoming patchset ontop of Al's
vfs/linux-next tree to keep track of the refactoring already done with
some of the code paths I touched. The refactoring work done there
already ads a flag argument to kiocb struct for other reasons.

>
> c) with a non-NULL second argument, fincore provides the
>    mincore()-style page map.
>
> cons:
>
> d) fincore() is more expensive
>
> e) fincore() will very occasionally block
>
>
> Tradeoffs are involved.  To decide on the best path we should examine
> d).  I expect that the overhead will be significant for small reads but
> not significant for medium and large reads.  Needs quantifying.
>
> And I don't believe that e) will be a problem in the real world.  It's
> a significant increase in worst-case latency and a negligible increase
> in average latency.  I've asked at least three times for someone to
> explain why this is unacceptable and no explanation has been provided.

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

  parent reply	other threads:[~2015-03-30 23:25 UTC|newest]

Thread overview: 94+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-03-16 18:27 [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only) Milosz Tanski
2015-03-16 18:27 ` [PATCH v7 1/5] vfs: Prepare for adding a new preadv/pwritev with user flags Milosz Tanski
2015-03-16 18:27   ` Milosz Tanski
2015-03-16 21:05   ` Andreas Dilger
2015-03-16 21:05     ` Andreas Dilger
2015-03-16 18:27 ` [PATCH v7 2/5] vfs: Define new syscalls preadv2,pwritev2 Milosz Tanski
2015-03-16 18:27   ` Milosz Tanski
2015-03-16 18:27 ` [PATCH v7 3/5] x86: wire up preadv2 and pwritev2 Milosz Tanski
2015-03-16 18:27   ` Milosz Tanski
2015-03-16 18:27 ` [PATCH v7 4/5] vfs: RWF_NONBLOCK flag for preadv2 Milosz Tanski
2015-03-16 18:27   ` Milosz Tanski
2015-03-16 18:27 ` [PATCH v7 5/5] xfs: add RWF_NONBLOCK support Milosz Tanski
2015-03-16 18:27   ` Milosz Tanski
2015-03-16 22:04   ` Dave Chinner
2015-03-16 18:32 ` [PATCH] Add preadv2/pwritev2 documentation Milosz Tanski
2015-03-27 16:49   ` Andrew Morton
2015-03-30  7:33     ` Christoph Hellwig
2015-03-30  7:33       ` Christoph Hellwig
2015-03-16 18:34 ` [PATCH] fstests: generic test for preadv2 behavior on linux Milosz Tanski
2015-03-16 18:34   ` Milosz Tanski
2015-03-16 21:07   ` Andreas Dilger
2015-03-16 21:07     ` Andreas Dilger
2015-03-16 22:03     ` Milosz Tanski
2015-03-16 22:02   ` Dave Chinner
2015-03-16 22:02     ` Dave Chinner
2015-03-16 22:11     ` Milosz Tanski
2015-03-16 22:56       ` Dave Chinner
2015-03-16 22:56         ` Dave Chinner
2015-03-26 11:55 ` [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only) Christoph Hellwig
2015-03-26 11:55   ` Christoph Hellwig
2015-03-26 19:12   ` Milosz Tanski
2015-03-26 19:12     ` Milosz Tanski
2015-03-27  2:26     ` Milosz Tanski
2015-03-27  2:29     ` Milosz Tanski
2015-03-27  2:29       ` Milosz Tanski
2015-03-27  3:28 ` Andrew Morton
2015-03-27  3:28   ` Andrew Morton
2015-03-27  5:41   ` Volker Lendecke
2015-03-27  5:41     ` Volker Lendecke
2015-03-27  6:08     ` Andrew Morton
2015-03-27  6:08       ` Andrew Morton
2015-03-27  8:02       ` Volker Lendecke
2015-03-27  8:02         ` Volker Lendecke
2015-03-27  8:12         ` Christoph Hellwig
2015-03-27  8:18   ` Christoph Hellwig
2015-03-27  8:18     ` Christoph Hellwig
2015-03-27  8:35     ` Andrew Morton
2015-03-27  8:35       ` Andrew Morton
2015-03-27  8:48       ` Christoph Hellwig
2015-03-27  9:01         ` Andrew Morton
2015-03-27  9:01           ` Andrew Morton
2015-03-27  9:44           ` Volker Lendecke
2015-03-27 15:58           ` Jeremy Allison
2015-03-27 15:58             ` Jeremy Allison
2015-03-27 16:30             ` Andrew Morton
2015-03-27 16:30               ` Andrew Morton
2015-03-27 16:30               ` Andrew Morton
2015-03-27 16:30               ` Andrew Morton
2015-03-27 16:39               ` Jeremy Allison
2015-03-27 16:39                 ` Jeremy Allison
2015-03-27 16:39               ` Andrew Morton
2015-03-27 16:45               ` Milosz Tanski
2015-03-31  1:27               ` Milosz Tanski
2015-03-27 16:38             ` Milosz Tanski
2015-03-27 16:38               ` Milosz Tanski
2015-03-30  7:36             ` Christoph Hellwig
2015-03-30 17:19               ` Jeremy Allison
2015-03-30 17:19                 ` Jeremy Allison
2015-03-30 22:51                 ` Milosz Tanski
2015-03-30 20:26               ` Andrew Morton
2015-03-30 20:26                 ` Andrew Morton
2015-03-30 20:32                 ` Jeremy Allison
2015-03-30 20:37                   ` Andrew Morton
2015-03-30 20:49                     ` Jeremy Allison
2015-03-30 21:33                       ` Andrew Morton
2015-03-30 22:35                     ` Milosz Tanski
2015-03-30 22:49                   ` Milosz Tanski
2015-03-30 22:57                     ` Andrew Morton
2015-03-30 23:06                       ` Milosz Tanski
2015-03-30 23:06                         ` Milosz Tanski
2015-03-30 23:25                 ` Milosz Tanski [this message]
2015-04-04  3:42                 ` Andrew Morton
2015-04-06  3:53                   ` Milosz Tanski
2015-04-06  3:53                     ` Milosz Tanski
2015-03-30 23:09               ` Milosz Tanski
2015-03-27 15:21   ` Milosz Tanski
2015-03-27 15:21     ` Milosz Tanski
2015-03-27 17:04     ` Andrew Morton
2015-03-30  7:40       ` Christoph Hellwig
2015-03-30  7:40         ` Christoph Hellwig
2015-03-30 18:54         ` Andrew Morton
2015-03-30 22:40           ` Milosz Tanski
2015-03-30 22:50             ` Andrew Morton
2015-03-30 22:50               ` Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CANP1eJFGYrXoWEc7xeeJeudk5tCSbzb1cezjPCTumRCfTVmWog@mail.gmail.com \
    --to=milosz@adfin.com \
    --cc=Volker.Lendecke@sernet.de \
    --cc=akpm@linux-foundation.org \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=jmoyer@redhat.com \
    --cc=jra@samba.org \
    --cc=linux-aio@kvack.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mtk.manpages@gmail.com \
    --cc=tj@kernel.org \
    --cc=tytso@mit.edu \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.