linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Josef Bacik <josef@redhat.com>
To: Mike Fedyk <mfedyk@mikefedyk.com>
Cc: Josef Bacik <josef@redhat.com>, Chris Ball <cjb@laptop.org>,
	Nickolai Zeldovich <nickolai@csail.mit.edu>,
	linux-btrfs@vger.kernel.org
Subject: Re: zero-length files in snapshots
Date: Fri, 12 Feb 2010 11:32:47 -0500	[thread overview]
Message-ID: <20100212163246.GC4191@localhost.localdomain> (raw)
In-Reply-To: <93cdabd21002120827k493a4c1ao2ba4b6840f2ab427@mail.gmail.com>

On Fri, Feb 12, 2010 at 08:27:00AM -0800, Mike Fedyk wrote:
> On Fri, Feb 12, 2010 at 8:22 AM, Josef Bacik <josef@redhat.com> wrote=
:
> > On Fri, Feb 12, 2010 at 08:18:01AM -0800, Mike Fedyk wrote:
> >> On Fri, Feb 12, 2010 at 7:19 AM, Josef Bacik <josef@redhat.com> wr=
ote:
> >> > On Thu, Feb 11, 2010 at 08:50:48PM -0800, Mike Fedyk wrote:
> >> >> On Thu, Feb 11, 2010 at 7:11 PM, Chris Ball <cjb@laptop.org> wr=
ote:
> >> >> > =A0 > echo x1 > /mnt/x/d/foo.txt || exit 2
> >> >> > =A0 > btrfsctl -s /mnt/x/snap /mnt/x/d
> >> >> >
> >> >> > You're just missing a sync/fsync() between these two lines.
> >> >> >
> >> >> > We argued on IRC a while ago about whether this is a sensible=
 default;
> >> >> > cmason wants the no-sync version of snapshot creation to be a=
vailable,
> >> >> > but was amenable to the idea of changing the default to be sy=
nc before
> >> >> > snapshot, since it was pointed out that no-one other than him=
 had
> >> >> > understood we were supposed to be running sync first.
> >> >> >
> >> >> You're saying that it only snapshots the on-disk data structure=
s and
> >> >> not the in-memory versions? =A0That can only lead to pain. =A0W=
hat do you
> >> >> do if something else during this race condition? =A0What would =
a sync do
> >> >> to solve this? =A0Have the semantics of sync been changed in bt=
rfs from
> >> >> "sync everything that hasn't been written yet" to "sync this
> >> >> subvolume"?
> >> >>
> >> >
> >> > Welcome to delalloc. =A0You either get fast writes or you get al=
l of your data on
> >> > the disk every 5 seconds. =A0If you don't like delalloc, use ext=
3. =A0The data
> >> > you've written to memory doesn't go down to disk unless explicit=
ly told to, such
> >> > as
> >> >
> >> > 1) fsync - this is obvious
> >> > 2) vm - the vm has decided that this dirty page has been sitting=
 around long
> >> > enough and should be written back to the disk, could happen now,=
 could happen 10
> >> > years from now.
> >> > 3) sync - this is not as obvious. =A0sync doesn't mean anything =
than "start
> >> > writing back dirty data to the fs", and returns before it's done=
=2E =A0For btrfs
> >> > what that means is we run through _every_ inode that has delallo=
c pages
> >> > associated with them and start writeback on them. =A0This will g=
et most of your
> >> > data into the current transaction, which is when the snapshot ha=
ppens.
> >> >
> >> > If you don't want empty files, do something like this
> >> >
> >> > btrfsctl -c /dir/to/volume
> >> > btrfsctl -s /dir/to/volume/snapshotname /dir/to/volume
> >> >
> >> > this is what we do with yum and its rollback plugin, and it work=
s out quite
> >> > well. =A0Thanks,
> >> >
> >>
> >> Then you broke your ordering guarantee. =A0If the data isn't there=
, the
> >> meta-data shouldn't be there either. =A0So the snapshots made befo=
re the
> >> data hits a transaction shouldn't have the file at all.
> >
> > Nope, what is happening is
> >
> > fd =3D creat("file") =A0<- this is metadata that needs to be writte=
n
> > write(fd, buf) =A0 =A0 =A0<- because of delalloc there is no metada=
ta that is created
> > for this operation, therefore it doesn't need to be written out.
> > close(fd)
> >
> > so the file has metadata created for it, which needs to be written =
out. =A0Because
> > of delalloc there are no extents created or anything for the data, =
therefore
> > there is nothing to write. =A0Thanks,
> >
>=20
> So file creation is effectively synchronous?  So I could create a
> benchmark that creates millions of files and it would be limited to
> the IO OP performance of the disks?
>=20
> Why does file creation need to hit the disk before the contents (with
> limits to size of data that can fit in one transaction)?

=46ile creation isn't synchronous, it just modifies metadata, which nee=
ds to be
committed when the transaction commits.  So if you creat millions of fi=
les you
are going to be held up every 30 seconds as the transaction commits and=
 writes
all the files you were able to create within that 30 seconds, same as _=
any_
filesystem that does ordered mode.

Creating a file is a metadata operation, and _any_ metadata operation h=
as to be
committed to disk when the transaction commits in order to maintain a c=
oherent
fs.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  reply	other threads:[~2010-02-12 16:32 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-02-12  1:49 zero-length files in snapshots Nickolai Zeldovich
2010-02-12  3:11 ` Chris Ball
2010-02-12  4:50   ` Mike Fedyk
2010-02-12 15:19     ` Josef Bacik
2010-02-12 16:18       ` Mike Fedyk
2010-02-12 16:22         ` Josef Bacik
2010-02-12 16:27           ` Mike Fedyk
2010-02-12 16:32             ` Josef Bacik [this message]
2010-02-12 17:13               ` Mike Fedyk
2010-02-13 11:25                 ` Sander
2010-02-13 19:26                   ` Mike Fedyk
2010-02-19 22:22                     ` Sage Weil
2010-02-25 18:57                       ` Goffredo Baroncelli
2010-02-12 18:22       ` Ravi Pinjala
2010-02-12 18:45         ` Josef Bacik
2010-02-12 19:03         ` Chris Ball
2010-02-12 19:10       ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100212163246.GC4191@localhost.localdomain \
    --to=josef@redhat.com \
    --cc=cjb@laptop.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=mfedyk@mikefedyk.com \
    --cc=nickolai@csail.mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).