From: Jan Kara <jack@suse.cz>
To: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Jan Kara <jack@suse.cz>,
"Darrick J. Wong" <darrick.wong@oracle.com>,
Linux FS Devel <linux-fsdevel@vger.kernel.org>,
"linux-scsi@vger.kernel.org" <linux-scsi@vger.kernel.org>,
device-mapper development <dm-devel@redhat.com>,
linux-btrfs@vger.kernel.org, axboe@fb.com, zab@zabbo.net,
neilb@suse.de
Subject: Re: Proposal for annotating _unstable_ pages
Date: Thu, 21 May 2015 21:21:12 +0200 [thread overview]
Message-ID: <20150521192112.GA2665@quack.suse.cz> (raw)
In-Reply-To: <20150521180954.GA27397@kmo-pixel>
On Thu 21-05-15 11:09:55, Kent Overstreet wrote:
> On Thu, May 21, 2015 at 06:54:53PM +0200, Jan Kara wrote:
> > On Wed 20-05-15 18:04:40, Kent Overstreet wrote:
> > > > Yeah. I never figured out a sane way to migrate pages and keep everything
> > > > else happy. Daniel Phillips is having a go at page forking for tux3; let's
> > > > see if the questions about that get resolved.
> > >
> > > That would be great, we need something.
> > >
> > > I'd also be really curious what btrfs is doing today - is it just bouncing
> > > everything internally, or did they come up with something more clever?
> >
> > Btrfs is just waiting for IO to complete.
> >
> > > > > Also, there's probably always going to be situations where we're reading or
> > > > > writing to pages user space can stomp on (dio) - IMO we need to add a bio flag
> > > > > to annotate this - "if you need this to be stable you have to bounce it".
> > > > > Otherwise either filesystems/block drivers are going to be stuck bouncing
> > > > > everything, or it'll just (continue to be) buggy.
> > > >
> > > > Well, for now there's BIO_SNAP_STABLE that forces the block layer to bounce it,
> > > > but right now ext3 is the last user of it, and afaict btrfs is the only other
> > > > FS that takes care of stable pages on its own.
> > >
> > > I have no idea what BIO_SNAP_STABLE was supposed to be for, but I don't see how
> > > it's useful for anything sane.
> >
> > It's for the case where lower layer requests it needs stable pages but
> > upper layer isn't able to provide them (as is the case of ext3). Then block
> > layer bounces the data for the caller.
> >
> > > But that's the complete opposite of the problem stable pages are supposed to
> > > solve: stable pages are for when the _lower_ layer (be it filesystem, bcache,
> > > md, lvm) needs the memory being either read to or written from (both, it's not
> > > just writes) to not be diddled over while the IO is in flight.
> > >
> > > Now, a point that I think has been missed is that stable pages are _not_ a
> > > complete solution, at least for consumers in the block layer.
> > >
> > > The situation today is that if I'm in the block layer, and I get a handed a read
> > > or write bio, I _don't know_ if it's from something that's going to diddle over
> > > those pages or not. So if I require stable pages - be it for data checksumming
> > > or for other things - I've just got to bounce the bio myself.
> > >
> > > And then the really annoying thing is that if you've got stacked things that all
> > > need stable pages (maybe btrfs on top of bcache on top of md) - they _all_ have
> > > to assume the pages aren't going to be stable, so if they need them they _all_
> > > have to bounce - even though once the first layer bounced the bio that made it
> > > stable for everything underneath it.
> >
> > The current design is that if you need stable pages for your device, set
> > bdi capability BDI_CAP_STABLE_WRITES, fs then takes care of not scribbling
> > over your page while it is under writeback or uses BIO_SNAP_STABLE if it
> > cannot.
>
> But if I need stable pages, I still have to bounce because that _does not_
> guarantee stable pages, it only gives me stable pages for some of the IOs and in
> the lower layers you can't tell which is which.
>
> Do you see the problem? What good is BDI_CAP_STABLE_WRITES if it's not a
> guarantee and I can't tell if I need to bounce or not?
So fix the upper layers to make it a guarantee? You mentioned direct IO
needs fixing. Anything else?
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
next prev parent reply other threads:[~2015-05-21 19:21 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-05-15 20:58 Let's get a File & Storage miniconf going at LPC2015! Darrick J. Wong
2015-05-19 15:42 ` Kent Overstreet
2015-05-19 20:10 ` Darrick J. Wong
2015-05-21 1:04 ` Proposal for annotating _unstable_ pages Kent Overstreet
2015-05-21 16:54 ` Jan Kara
2015-05-21 18:09 ` Kent Overstreet
2015-05-21 19:21 ` Jan Kara [this message]
2015-05-22 18:17 ` [dm-devel] " Darrick J. Wong
2015-05-22 18:33 ` Kent Overstreet
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150521192112.GA2665@quack.suse.cz \
--to=jack@suse.cz \
--cc=axboe@fb.com \
--cc=darrick.wong@oracle.com \
--cc=dm-devel@redhat.com \
--cc=kent.overstreet@gmail.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-scsi@vger.kernel.org \
--cc=neilb@suse.de \
--cc=zab@zabbo.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).