Re: [RFC] iomap: fix race between readahead and direct write

From: Matthew Wilcox <willy@infradead.org>
To: "yukuai (C)" <yukuai3@huawei.com>
Cc: hch@infradead.org, darrick.wong@oracle.com,
	linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, houtao1@huawei.com,
	zhengbin13@huawei.com, yi.zhang@huawei.com
Subject: Re: [RFC] iomap: fix race between readahead and direct write
Date: Sat, 18 Jan 2020 23:58:28 -0800	[thread overview]
Message-ID: <20200119075828.GA4147@bombadil.infradead.org> (raw)
In-Reply-To: <fafa0550-184c-e59c-9b79-bd5d716a20cc@huawei.com>

On Sun, Jan 19, 2020 at 02:55:14PM +0800, yukuai (C) wrote:
> On 2020/1/19 14:14, Matthew Wilcox wrote:
> > I don't understand your reasoning here.  If another process wants to
> > access a page of the file which isn't currently in cache, it would have
> > to first read the page in from storage.  If it's under readahead, it
> > has to wait for the read to finish.  Why is the second case worse than
> > the second?  It seems better to me.
> 
> Thanks for your response! My worries is that, for example:
> 
> We read page 0, and trigger readahead to read n pages(0 - n-1). While in
> another thread, we read page n-1.
> 
> In the current implementation, if readahead is in the process of reading
> page 0 - n-2,  later operation doesn't need to wait the former one to
> finish. However, later operation will have to wait if we add all pages
> to page cache first. And that is why I said it might cause problem for
> performance overhead.

OK, but let's put some numbers on that.  Imagine that we're using high
performance spinning rust so we have an access latency of 5ms (200
IOPS), we're accessing 20 consecutive pages which happen to have their
data contiguous on disk.  Our CPU is running at 2GHz and takes about
100,000 cycles to submit an I/O, plus 1,000 cycles to add an extra page
to the I/O.

Current implementation: Allocate 20 pages, place 19 of them in the cache,
fail to place the last one in the cache.  The later thread actually gets
to jump the queue and submit its bio first.  Its latency will be 100,000
cycles (20us) plus the 5ms access time.  But it only has 20,000 cycles
(4us) to hit this race, or it will end up behaving the same way as below.

New implementation: Allocate 20 pages, place them all in the cache,
then takes 120,000 cycles to build & submit the I/O, and wait 5ms for
the I/O to complete.

But look how much more likely it is that it'll hit during the window
where we're waiting for the I/O to complete -- 5ms is 1250 times longer
than 4us.

If it _does_ get the latency benefit of jumping the queue, the readahead
will create one or two I/Os.  If it hit page 18 instead of page 19, we'd
end up doing three I/Os; the first for page 18, then one for pages 0-17,
and one for page 19.  And that means the disk is going to be busy for
15ms, delaying the next I/O for up to 10ms.  It's actually beneficial in
the long term for the second thread to wait for the readahead to finish.

Oh, and the current ->readpages code has a race where if the page tagged
with PageReadahead ends up not being inserted, we'll lose that bit,
which means the readahead will just stop and have to restart (because
it will look to the readahead code like it's not being effective).
That's a far worse performance problem.