linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* zero-length files in snapshots
@ 2010-02-12  1:49 Nickolai Zeldovich
  2010-02-12  3:11 ` Chris Ball
  0 siblings, 1 reply; 17+ messages in thread
From: Nickolai Zeldovich @ 2010-02-12  1:49 UTC (permalink / raw)
  To: linux-btrfs

I often get zero-length files in btrfs snapshots (when the original
files were not zero-length).  The shell script below reproduces this
problem on two Ubuntu machines, with Ubuntu kernels 2.6.31-17.54 and
2.6.32-12.17.  Is there some mistaken assumption I'm making here in
terms of how btrfsctl works?

Nickolai.

---

root@sahara:/# cat /tmp/btrbug.sh
#!/bin/sh -x
if id | grep -qv uid=0; then
    echo "Must run setup as root"
    exit 1
fi

if losetup -a | grep -q /dev/shm/fs.img; then
    echo "Loopback FS mounted, unmounting.."
    umount /mnt/x || exit 2
fi

rmmod btrfs
rmmod zlib_deflate
rmmod libcrc32c
modprobe btrfs

dd if=/dev/zero of=/dev/shm/fs.img bs=1024k count=256 || exit 2
mkfs -t btrfs /dev/shm/fs.img || exit 2

mkdir -p /mnt/x || exit 2
mount -o loop -t btrfs /dev/shm/fs.img /mnt/x || exit 2

mkdir /mnt/x/d || exit 2
echo x1 > /mnt/x/d/foo.txt || exit 2
btrfsctl -s /mnt/x/snap /mnt/x/d

wc -l /mnt/x/d/foo.txt
wc -l /mnt/x/snap/d/foo.txt

root@sahara:/# /tmp/btrbug.sh
+ id
+ grep -qv uid=0
+ losetup -a
+ grep -q /dev/shm/fs.img
+ echo Loopback FS mounted, unmounting..
Loopback FS mounted, unmounting..
+ umount /mnt/x
+ rmmod btrfs
+ rmmod zlib_deflate
+ rmmod libcrc32c
+ modprobe btrfs
+ dd if=/dev/zero of=/dev/shm/fs.img bs=1024k count=256
256+0 records in
256+0 records out
268435456 bytes (268 MB) copied, 0.231684 s, 1.2 GB/s
+ mkfs -t btrfs /dev/shm/fs.img

WARNING! - Btrfs Btrfs v0.19 IS EXPERIMENTAL
WARNING! - see http://btrfs.wiki.kernel.org before using

fs created label (null) on /dev/shm/fs.img
	nodesize 4096 leafsize 4096 sectorsize 4096 size 256.00MB
Btrfs Btrfs v0.19
+ mkdir -p /mnt/x
+ mount -o loop -t btrfs /dev/shm/fs.img /mnt/x
+ mkdir /mnt/x/d
+ echo x1
+ btrfsctl -s /mnt/x/snap /mnt/x/d
operation complete
Btrfs Btrfs v0.19
+ wc -l /mnt/x/d/foo.txt
1 /mnt/x/d/foo.txt
+ wc -l /mnt/x/snap/d/foo.txt
0 /mnt/x/snap/d/foo.txt
root@sahara:/#

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: zero-length files in snapshots
  2010-02-12  1:49 zero-length files in snapshots Nickolai Zeldovich
@ 2010-02-12  3:11 ` Chris Ball
  2010-02-12  4:50   ` Mike Fedyk
  0 siblings, 1 reply; 17+ messages in thread
From: Chris Ball @ 2010-02-12  3:11 UTC (permalink / raw)
  To: Nickolai Zeldovich; +Cc: linux-btrfs

Hi,

   > I often get zero-length files in btrfs snapshots (when the
   > original files were not zero-length).  The shell script below
   > reproduces this problem on two Ubuntu machines, with Ubuntu
   > kernels 2.6.31-17.54 and 2.6.32-12.17.  Is there some mistaken
   > assumption I'm making here in terms of how btrfsctl works?
   > [..]
   > 
   > echo x1 > /mnt/x/d/foo.txt || exit 2
   > btrfsctl -s /mnt/x/snap /mnt/x/d

You're just missing a sync/fsync() between these two lines.

We argued on IRC a while ago about whether this is a sensible default;
cmason wants the no-sync version of snapshot creation to be available,
but was amenable to the idea of changing the default to be sync before
snapshot, since it was pointed out that no-one other than him had
understood we were supposed to be running sync first.

- Chris.
-- 
Chris Ball   <cjb@laptop.org>
One Laptop Per Child

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: zero-length files in snapshots
  2010-02-12  3:11 ` Chris Ball
@ 2010-02-12  4:50   ` Mike Fedyk
  2010-02-12 15:19     ` Josef Bacik
  0 siblings, 1 reply; 17+ messages in thread
From: Mike Fedyk @ 2010-02-12  4:50 UTC (permalink / raw)
  To: Chris Ball; +Cc: Nickolai Zeldovich, linux-btrfs

On Thu, Feb 11, 2010 at 7:11 PM, Chris Ball <cjb@laptop.org> wrote:
> =C2=A0 > echo x1 > /mnt/x/d/foo.txt || exit 2
> =C2=A0 > btrfsctl -s /mnt/x/snap /mnt/x/d
>
> You're just missing a sync/fsync() between these two lines.
>
> We argued on IRC a while ago about whether this is a sensible default=
;
> cmason wants the no-sync version of snapshot creation to be available=
,
> but was amenable to the idea of changing the default to be sync befor=
e
> snapshot, since it was pointed out that no-one other than him had
> understood we were supposed to be running sync first.
>
You're saying that it only snapshots the on-disk data structures and
not the in-memory versions?  That can only lead to pain.  What do you
do if something else during this race condition?  What would a sync do
to solve this?  Have the semantics of sync been changed in btrfs from
"sync everything that hasn't been written yet" to "sync this
subvolume"?

=46rom what I understand what should be happening is much like what LVM=
 should do:

step 1: defer all other writes to subvolume (userspace processes get
stuck in D state until step 4)
step 2: sync all changes not already committed to subvolume
step 3: create snapshot
step 4: resume writes from userspace

Now if all 4 steps can be done with in-memory data structures without
forcing data (not necessarily meta-data) to disk, so much the better.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: zero-length files in snapshots
  2010-02-12  4:50   ` Mike Fedyk
@ 2010-02-12 15:19     ` Josef Bacik
  2010-02-12 16:18       ` Mike Fedyk
                         ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Josef Bacik @ 2010-02-12 15:19 UTC (permalink / raw)
  To: Mike Fedyk; +Cc: Chris Ball, Nickolai Zeldovich, linux-btrfs

On Thu, Feb 11, 2010 at 08:50:48PM -0800, Mike Fedyk wrote:
> On Thu, Feb 11, 2010 at 7:11 PM, Chris Ball <cjb@laptop.org> wrote:
> > =A0 > echo x1 > /mnt/x/d/foo.txt || exit 2
> > =A0 > btrfsctl -s /mnt/x/snap /mnt/x/d
> >
> > You're just missing a sync/fsync() between these two lines.
> >
> > We argued on IRC a while ago about whether this is a sensible defau=
lt;
> > cmason wants the no-sync version of snapshot creation to be availab=
le,
> > but was amenable to the idea of changing the default to be sync bef=
ore
> > snapshot, since it was pointed out that no-one other than him had
> > understood we were supposed to be running sync first.
> >
> You're saying that it only snapshots the on-disk data structures and
> not the in-memory versions?  That can only lead to pain.  What do you
> do if something else during this race condition?  What would a sync d=
o
> to solve this?  Have the semantics of sync been changed in btrfs from
> "sync everything that hasn't been written yet" to "sync this
> subvolume"?
>=20

Welcome to delalloc.  You either get fast writes or you get all of your=
 data on
the disk every 5 seconds.  If you don't like delalloc, use ext3.  The d=
ata
you've written to memory doesn't go down to disk unless explicitly told=
 to, such
as

1) fsync - this is obvious
2) vm - the vm has decided that this dirty page has been sitting around=
 long
enough and should be written back to the disk, could happen now, could =
happen 10
years from now.
3) sync - this is not as obvious.  sync doesn't mean anything than "sta=
rt
writing back dirty data to the fs", and returns before it's done.  For =
btrfs
what that means is we run through _every_ inode that has delalloc pages
associated with them and start writeback on them.  This will get most o=
f your
data into the current transaction, which is when the snapshot happens.

If you don't want empty files, do something like this

btrfsctl -c /dir/to/volume
btrfsctl -s /dir/to/volume/snapshotname /dir/to/volume

this is what we do with yum and its rollback plugin, and it works out q=
uite
well.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: zero-length files in snapshots
  2010-02-12 15:19     ` Josef Bacik
@ 2010-02-12 16:18       ` Mike Fedyk
  2010-02-12 16:22         ` Josef Bacik
  2010-02-12 18:22       ` Ravi Pinjala
  2010-02-12 19:10       ` Christoph Hellwig
  2 siblings, 1 reply; 17+ messages in thread
From: Mike Fedyk @ 2010-02-12 16:18 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Chris Ball, Nickolai Zeldovich, linux-btrfs

On Fri, Feb 12, 2010 at 7:19 AM, Josef Bacik <josef@redhat.com> wrote:
> On Thu, Feb 11, 2010 at 08:50:48PM -0800, Mike Fedyk wrote:
>> On Thu, Feb 11, 2010 at 7:11 PM, Chris Ball <cjb@laptop.org> wrote:
>> > =C2=A0 > echo x1 > /mnt/x/d/foo.txt || exit 2
>> > =C2=A0 > btrfsctl -s /mnt/x/snap /mnt/x/d
>> >
>> > You're just missing a sync/fsync() between these two lines.
>> >
>> > We argued on IRC a while ago about whether this is a sensible defa=
ult;
>> > cmason wants the no-sync version of snapshot creation to be availa=
ble,
>> > but was amenable to the idea of changing the default to be sync be=
fore
>> > snapshot, since it was pointed out that no-one other than him had
>> > understood we were supposed to be running sync first.
>> >
>> You're saying that it only snapshots the on-disk data structures and
>> not the in-memory versions? =C2=A0That can only lead to pain. =C2=A0=
What do you
>> do if something else during this race condition? =C2=A0What would a =
sync do
>> to solve this? =C2=A0Have the semantics of sync been changed in btrf=
s from
>> "sync everything that hasn't been written yet" to "sync this
>> subvolume"?
>>
>
> Welcome to delalloc. =C2=A0You either get fast writes or you get all =
of your data on
> the disk every 5 seconds. =C2=A0If you don't like delalloc, use ext3.=
 =C2=A0The data
> you've written to memory doesn't go down to disk unless explicitly to=
ld to, such
> as
>
> 1) fsync - this is obvious
> 2) vm - the vm has decided that this dirty page has been sitting arou=
nd long
> enough and should be written back to the disk, could happen now, coul=
d happen 10
> years from now.
> 3) sync - this is not as obvious. =C2=A0sync doesn't mean anything th=
an "start
> writing back dirty data to the fs", and returns before it's done. =C2=
=A0For btrfs
> what that means is we run through _every_ inode that has delalloc pag=
es
> associated with them and start writeback on them. =C2=A0This will get=
 most of your
> data into the current transaction, which is when the snapshot happens=
=2E
>
> If you don't want empty files, do something like this
>
> btrfsctl -c /dir/to/volume
> btrfsctl -s /dir/to/volume/snapshotname /dir/to/volume
>
> this is what we do with yum and its rollback plugin, and it works out=
 quite
> well. =C2=A0Thanks,
>

Then you broke your ordering guarantee.  If the data isn't there, the
meta-data shouldn't be there either.  So the snapshots made before the
data hits a transaction shouldn't have the file at all.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: zero-length files in snapshots
  2010-02-12 16:18       ` Mike Fedyk
@ 2010-02-12 16:22         ` Josef Bacik
  2010-02-12 16:27           ` Mike Fedyk
  0 siblings, 1 reply; 17+ messages in thread
From: Josef Bacik @ 2010-02-12 16:22 UTC (permalink / raw)
  To: Mike Fedyk; +Cc: Josef Bacik, Chris Ball, Nickolai Zeldovich, linux-btrfs

On Fri, Feb 12, 2010 at 08:18:01AM -0800, Mike Fedyk wrote:
> On Fri, Feb 12, 2010 at 7:19 AM, Josef Bacik <josef@redhat.com> wrote=
:
> > On Thu, Feb 11, 2010 at 08:50:48PM -0800, Mike Fedyk wrote:
> >> On Thu, Feb 11, 2010 at 7:11 PM, Chris Ball <cjb@laptop.org> wrote=
:
> >> > =A0 > echo x1 > /mnt/x/d/foo.txt || exit 2
> >> > =A0 > btrfsctl -s /mnt/x/snap /mnt/x/d
> >> >
> >> > You're just missing a sync/fsync() between these two lines.
> >> >
> >> > We argued on IRC a while ago about whether this is a sensible de=
fault;
> >> > cmason wants the no-sync version of snapshot creation to be avai=
lable,
> >> > but was amenable to the idea of changing the default to be sync =
before
> >> > snapshot, since it was pointed out that no-one other than him ha=
d
> >> > understood we were supposed to be running sync first.
> >> >
> >> You're saying that it only snapshots the on-disk data structures a=
nd
> >> not the in-memory versions? =A0That can only lead to pain. =A0What=
 do you
> >> do if something else during this race condition? =A0What would a s=
ync do
> >> to solve this? =A0Have the semantics of sync been changed in btrfs=
 from
> >> "sync everything that hasn't been written yet" to "sync this
> >> subvolume"?
> >>
> >
> > Welcome to delalloc. =A0You either get fast writes or you get all o=
f your data on
> > the disk every 5 seconds. =A0If you don't like delalloc, use ext3. =
=A0The data
> > you've written to memory doesn't go down to disk unless explicitly =
told to, such
> > as
> >
> > 1) fsync - this is obvious
> > 2) vm - the vm has decided that this dirty page has been sitting ar=
ound long
> > enough and should be written back to the disk, could happen now, co=
uld happen 10
> > years from now.
> > 3) sync - this is not as obvious. =A0sync doesn't mean anything tha=
n "start
> > writing back dirty data to the fs", and returns before it's done. =A0=
=46or btrfs
> > what that means is we run through _every_ inode that has delalloc p=
ages
> > associated with them and start writeback on them. =A0This will get =
most of your
> > data into the current transaction, which is when the snapshot happe=
ns.
> >
> > If you don't want empty files, do something like this
> >
> > btrfsctl -c /dir/to/volume
> > btrfsctl -s /dir/to/volume/snapshotname /dir/to/volume
> >
> > this is what we do with yum and its rollback plugin, and it works o=
ut quite
> > well. =A0Thanks,
> >
>=20
> Then you broke your ordering guarantee.  If the data isn't there, the
> meta-data shouldn't be there either.  So the snapshots made before th=
e
> data hits a transaction shouldn't have the file at all.

Nope, what is happening is

fd =3D creat("file")  <- this is metadata that needs to be written
write(fd, buf)      <- because of delalloc there is no metadata that is=
 created
for this operation, therefore it doesn't need to be written out.
close(fd)

so the file has metadata created for it, which needs to be written out.=
  Because
of delalloc there are no extents created or anything for the data, ther=
efore
there is nothing to write.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: zero-length files in snapshots
  2010-02-12 16:22         ` Josef Bacik
@ 2010-02-12 16:27           ` Mike Fedyk
  2010-02-12 16:32             ` Josef Bacik
  0 siblings, 1 reply; 17+ messages in thread
From: Mike Fedyk @ 2010-02-12 16:27 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Chris Ball, Nickolai Zeldovich, linux-btrfs

On Fri, Feb 12, 2010 at 8:22 AM, Josef Bacik <josef@redhat.com> wrote:
> On Fri, Feb 12, 2010 at 08:18:01AM -0800, Mike Fedyk wrote:
>> On Fri, Feb 12, 2010 at 7:19 AM, Josef Bacik <josef@redhat.com> wrot=
e:
>> > On Thu, Feb 11, 2010 at 08:50:48PM -0800, Mike Fedyk wrote:
>> >> On Thu, Feb 11, 2010 at 7:11 PM, Chris Ball <cjb@laptop.org> wrot=
e:
>> >> > =C2=A0 > echo x1 > /mnt/x/d/foo.txt || exit 2
>> >> > =C2=A0 > btrfsctl -s /mnt/x/snap /mnt/x/d
>> >> >
>> >> > You're just missing a sync/fsync() between these two lines.
>> >> >
>> >> > We argued on IRC a while ago about whether this is a sensible d=
efault;
>> >> > cmason wants the no-sync version of snapshot creation to be ava=
ilable,
>> >> > but was amenable to the idea of changing the default to be sync=
 before
>> >> > snapshot, since it was pointed out that no-one other than him h=
ad
>> >> > understood we were supposed to be running sync first.
>> >> >
>> >> You're saying that it only snapshots the on-disk data structures =
and
>> >> not the in-memory versions? =C2=A0That can only lead to pain. =C2=
=A0What do you
>> >> do if something else during this race condition? =C2=A0What would=
 a sync do
>> >> to solve this? =C2=A0Have the semantics of sync been changed in b=
trfs from
>> >> "sync everything that hasn't been written yet" to "sync this
>> >> subvolume"?
>> >>
>> >
>> > Welcome to delalloc. =C2=A0You either get fast writes or you get a=
ll of your data on
>> > the disk every 5 seconds. =C2=A0If you don't like delalloc, use ex=
t3. =C2=A0The data
>> > you've written to memory doesn't go down to disk unless explicitly=
 told to, such
>> > as
>> >
>> > 1) fsync - this is obvious
>> > 2) vm - the vm has decided that this dirty page has been sitting a=
round long
>> > enough and should be written back to the disk, could happen now, c=
ould happen 10
>> > years from now.
>> > 3) sync - this is not as obvious. =C2=A0sync doesn't mean anything=
 than "start
>> > writing back dirty data to the fs", and returns before it's done. =
=C2=A0For btrfs
>> > what that means is we run through _every_ inode that has delalloc =
pages
>> > associated with them and start writeback on them. =C2=A0This will =
get most of your
>> > data into the current transaction, which is when the snapshot happ=
ens.
>> >
>> > If you don't want empty files, do something like this
>> >
>> > btrfsctl -c /dir/to/volume
>> > btrfsctl -s /dir/to/volume/snapshotname /dir/to/volume
>> >
>> > this is what we do with yum and its rollback plugin, and it works =
out quite
>> > well. =C2=A0Thanks,
>> >
>>
>> Then you broke your ordering guarantee. =C2=A0If the data isn't ther=
e, the
>> meta-data shouldn't be there either. =C2=A0So the snapshots made bef=
ore the
>> data hits a transaction shouldn't have the file at all.
>
> Nope, what is happening is
>
> fd =3D creat("file") =C2=A0<- this is metadata that needs to be writt=
en
> write(fd, buf) =C2=A0 =C2=A0 =C2=A0<- because of delalloc there is no=
 metadata that is created
> for this operation, therefore it doesn't need to be written out.
> close(fd)
>
> so the file has metadata created for it, which needs to be written ou=
t. =C2=A0Because
> of delalloc there are no extents created or anything for the data, th=
erefore
> there is nothing to write. =C2=A0Thanks,
>

So file creation is effectively synchronous?  So I could create a
benchmark that creates millions of files and it would be limited to
the IO OP performance of the disks?

Why does file creation need to hit the disk before the contents (with
limits to size of data that can fit in one transaction)?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: zero-length files in snapshots
  2010-02-12 16:27           ` Mike Fedyk
@ 2010-02-12 16:32             ` Josef Bacik
  2010-02-12 17:13               ` Mike Fedyk
  0 siblings, 1 reply; 17+ messages in thread
From: Josef Bacik @ 2010-02-12 16:32 UTC (permalink / raw)
  To: Mike Fedyk; +Cc: Josef Bacik, Chris Ball, Nickolai Zeldovich, linux-btrfs

On Fri, Feb 12, 2010 at 08:27:00AM -0800, Mike Fedyk wrote:
> On Fri, Feb 12, 2010 at 8:22 AM, Josef Bacik <josef@redhat.com> wrote=
:
> > On Fri, Feb 12, 2010 at 08:18:01AM -0800, Mike Fedyk wrote:
> >> On Fri, Feb 12, 2010 at 7:19 AM, Josef Bacik <josef@redhat.com> wr=
ote:
> >> > On Thu, Feb 11, 2010 at 08:50:48PM -0800, Mike Fedyk wrote:
> >> >> On Thu, Feb 11, 2010 at 7:11 PM, Chris Ball <cjb@laptop.org> wr=
ote:
> >> >> > =A0 > echo x1 > /mnt/x/d/foo.txt || exit 2
> >> >> > =A0 > btrfsctl -s /mnt/x/snap /mnt/x/d
> >> >> >
> >> >> > You're just missing a sync/fsync() between these two lines.
> >> >> >
> >> >> > We argued on IRC a while ago about whether this is a sensible=
 default;
> >> >> > cmason wants the no-sync version of snapshot creation to be a=
vailable,
> >> >> > but was amenable to the idea of changing the default to be sy=
nc before
> >> >> > snapshot, since it was pointed out that no-one other than him=
 had
> >> >> > understood we were supposed to be running sync first.
> >> >> >
> >> >> You're saying that it only snapshots the on-disk data structure=
s and
> >> >> not the in-memory versions? =A0That can only lead to pain. =A0W=
hat do you
> >> >> do if something else during this race condition? =A0What would =
a sync do
> >> >> to solve this? =A0Have the semantics of sync been changed in bt=
rfs from
> >> >> "sync everything that hasn't been written yet" to "sync this
> >> >> subvolume"?
> >> >>
> >> >
> >> > Welcome to delalloc. =A0You either get fast writes or you get al=
l of your data on
> >> > the disk every 5 seconds. =A0If you don't like delalloc, use ext=
3. =A0The data
> >> > you've written to memory doesn't go down to disk unless explicit=
ly told to, such
> >> > as
> >> >
> >> > 1) fsync - this is obvious
> >> > 2) vm - the vm has decided that this dirty page has been sitting=
 around long
> >> > enough and should be written back to the disk, could happen now,=
 could happen 10
> >> > years from now.
> >> > 3) sync - this is not as obvious. =A0sync doesn't mean anything =
than "start
> >> > writing back dirty data to the fs", and returns before it's done=
=2E =A0For btrfs
> >> > what that means is we run through _every_ inode that has delallo=
c pages
> >> > associated with them and start writeback on them. =A0This will g=
et most of your
> >> > data into the current transaction, which is when the snapshot ha=
ppens.
> >> >
> >> > If you don't want empty files, do something like this
> >> >
> >> > btrfsctl -c /dir/to/volume
> >> > btrfsctl -s /dir/to/volume/snapshotname /dir/to/volume
> >> >
> >> > this is what we do with yum and its rollback plugin, and it work=
s out quite
> >> > well. =A0Thanks,
> >> >
> >>
> >> Then you broke your ordering guarantee. =A0If the data isn't there=
, the
> >> meta-data shouldn't be there either. =A0So the snapshots made befo=
re the
> >> data hits a transaction shouldn't have the file at all.
> >
> > Nope, what is happening is
> >
> > fd =3D creat("file") =A0<- this is metadata that needs to be writte=
n
> > write(fd, buf) =A0 =A0 =A0<- because of delalloc there is no metada=
ta that is created
> > for this operation, therefore it doesn't need to be written out.
> > close(fd)
> >
> > so the file has metadata created for it, which needs to be written =
out. =A0Because
> > of delalloc there are no extents created or anything for the data, =
therefore
> > there is nothing to write. =A0Thanks,
> >
>=20
> So file creation is effectively synchronous?  So I could create a
> benchmark that creates millions of files and it would be limited to
> the IO OP performance of the disks?
>=20
> Why does file creation need to hit the disk before the contents (with
> limits to size of data that can fit in one transaction)?

=46ile creation isn't synchronous, it just modifies metadata, which nee=
ds to be
committed when the transaction commits.  So if you creat millions of fi=
les you
are going to be held up every 30 seconds as the transaction commits and=
 writes
all the files you were able to create within that 30 seconds, same as _=
any_
filesystem that does ordered mode.

Creating a file is a metadata operation, and _any_ metadata operation h=
as to be
committed to disk when the transaction commits in order to maintain a c=
oherent
fs.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: zero-length files in snapshots
  2010-02-12 16:32             ` Josef Bacik
@ 2010-02-12 17:13               ` Mike Fedyk
  2010-02-13 11:25                 ` Sander
  0 siblings, 1 reply; 17+ messages in thread
From: Mike Fedyk @ 2010-02-12 17:13 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Chris Ball, Nickolai Zeldovich, linux-btrfs

On Fri, Feb 12, 2010 at 8:32 AM, Josef Bacik <josef@redhat.com> wrote:
> On Fri, Feb 12, 2010 at 08:27:00AM -0800, Mike Fedyk wrote:
>> On Fri, Feb 12, 2010 at 8:22 AM, Josef Bacik <josef@redhat.com> wrot=
e:
>> > On Fri, Feb 12, 2010 at 08:18:01AM -0800, Mike Fedyk wrote:
>> >> On Fri, Feb 12, 2010 at 7:19 AM, Josef Bacik <josef@redhat.com> w=
rote:
>> >> > On Thu, Feb 11, 2010 at 08:50:48PM -0800, Mike Fedyk wrote:
>> >> >> On Thu, Feb 11, 2010 at 7:11 PM, Chris Ball <cjb@laptop.org> w=
rote:
>> >> >> > =C2=A0 > echo x1 > /mnt/x/d/foo.txt || exit 2
>> >> >> > =C2=A0 > btrfsctl -s /mnt/x/snap /mnt/x/d
>> >> >> >
>> >> >> > You're just missing a sync/fsync() between these two lines.
>> >> >> >
>> >> >> > We argued on IRC a while ago about whether this is a sensibl=
e default;
>> >> >> > cmason wants the no-sync version of snapshot creation to be =
available,
>> >> >> > but was amenable to the idea of changing the default to be s=
ync before
>> >> >> > snapshot, since it was pointed out that no-one other than hi=
m had
>> >> >> > understood we were supposed to be running sync first.
>> >> >> >
>> >> >> You're saying that it only snapshots the on-disk data structur=
es and
>> >> >> not the in-memory versions? =C2=A0That can only lead to pain. =
=C2=A0What do you
>> >> >> do if something else during this race condition? =C2=A0What wo=
uld a sync do
>> >> >> to solve this? =C2=A0Have the semantics of sync been changed i=
n btrfs from
>> >> >> "sync everything that hasn't been written yet" to "sync this
>> >> >> subvolume"?
>> >> >>
>> >> >
>> >> > Welcome to delalloc. =C2=A0You either get fast writes or you ge=
t all of your data on
>> >> > the disk every 5 seconds. =C2=A0If you don't like delalloc, use=
 ext3. =C2=A0The data
>> >> > you've written to memory doesn't go down to disk unless explici=
tly told to, such
>> >> > as
>> >> >
>> >> > 1) fsync - this is obvious
>> >> > 2) vm - the vm has decided that this dirty page has been sittin=
g around long
>> >> > enough and should be written back to the disk, could happen now=
, could happen 10
>> >> > years from now.
>> >> > 3) sync - this is not as obvious. =C2=A0sync doesn't mean anyth=
ing than "start
>> >> > writing back dirty data to the fs", and returns before it's don=
e. =C2=A0For btrfs
>> >> > what that means is we run through _every_ inode that has delall=
oc pages
>> >> > associated with them and start writeback on them. =C2=A0This wi=
ll get most of your
>> >> > data into the current transaction, which is when the snapshot h=
appens.
>> >> >
>> >> > If you don't want empty files, do something like this
>> >> >
>> >> > btrfsctl -c /dir/to/volume
>> >> > btrfsctl -s /dir/to/volume/snapshotname /dir/to/volume
>> >> >
>> >> > this is what we do with yum and its rollback plugin, and it wor=
ks out quite
>> >> > well. =C2=A0Thanks,
>> >> >
>> >>
>> >> Then you broke your ordering guarantee. =C2=A0If the data isn't t=
here, the
>> >> meta-data shouldn't be there either. =C2=A0So the snapshots made =
before the
>> >> data hits a transaction shouldn't have the file at all.
>> >
>> > Nope, what is happening is
>> >
>> > fd =3D creat("file") =C2=A0<- this is metadata that needs to be wr=
itten
>> > write(fd, buf) =C2=A0 =C2=A0 =C2=A0<- because of delalloc there is=
 no metadata that is created
>> > for this operation, therefore it doesn't need to be written out.
>> > close(fd)
>> >
>> > so the file has metadata created for it, which needs to be written=
 out. =C2=A0Because
>> > of delalloc there are no extents created or anything for the data,=
 therefore
>> > there is nothing to write. =C2=A0Thanks,
>> >
>>
>> So file creation is effectively synchronous? =C2=A0So I could create=
 a
>> benchmark that creates millions of files and it would be limited to
>> the IO OP performance of the disks?
>>
>> Why does file creation need to hit the disk before the contents (wit=
h
>> limits to size of data that can fit in one transaction)?
>
> File creation isn't synchronous, it just modifies metadata, which nee=
ds to be
> committed when the transaction commits. =C2=A0So if you creat million=
s of files you
> are going to be held up every 30 seconds as the transaction commits a=
nd writes
> all the files you were able to create within that 30 seconds, same as=
 _any_
> filesystem that does ordered mode.
>
> Creating a file is a metadata operation, and _any_ metadata operation=
 has to be
> committed to disk when the transaction commits in order to maintain a=
 coherent
> fs. =C2=A0Thanks,
>

Thanks, I understand better now.

What I still don't understand though is that the create could have
taken up to 30 seconds to commit and the same for the few bytes of
data, but a few ms later a snapshot was made and the metadata change
was there and the data change was not.  Could it have happened that
the snapshot would not have the newly created file and this was just a
timing issue that should not be relied upon?

I'm just wondering why that file was there at all.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: zero-length files in snapshots
  2010-02-12 15:19     ` Josef Bacik
  2010-02-12 16:18       ` Mike Fedyk
@ 2010-02-12 18:22       ` Ravi Pinjala
  2010-02-12 18:45         ` Josef Bacik
  2010-02-12 19:03         ` Chris Ball
  2010-02-12 19:10       ` Christoph Hellwig
  2 siblings, 2 replies; 17+ messages in thread
From: Ravi Pinjala @ 2010-02-12 18:22 UTC (permalink / raw)
  To: linux-btrfs

On 02/12/10 09:19, Josef Bacik wrote:
> On Thu, Feb 11, 2010 at 08:50:48PM -0800, Mike Fedyk wrote:
>> On Thu, Feb 11, 2010 at 7:11 PM, Chris Ball<cjb@laptop.org>  wrote:
>>>    >  echo x1>  /mnt/x/d/foo.txt || exit 2
>>>    >  btrfsctl -s /mnt/x/snap /mnt/x/d
>>>
>>> You're just missing a sync/fsync() between these two lines.
>>>
>>> We argued on IRC a while ago about whether this is a sensible default;
>>> cmason wants the no-sync version of snapshot creation to be available,
>>> but was amenable to the idea of changing the default to be sync before
>>> snapshot, since it was pointed out that no-one other than him had
>>> understood we were supposed to be running sync first.
>>>
>> You're saying that it only snapshots the on-disk data structures and
>> not the in-memory versions?  That can only lead to pain.  What do you
>> do if something else during this race condition?  What would a sync do
>> to solve this?  Have the semantics of sync been changed in btrfs from
>> "sync everything that hasn't been written yet" to "sync this
>> subvolume"?
>>
>
> Welcome to delalloc.  You either get fast writes or you get all of your data on
> the disk every 5 seconds.  If you don't like delalloc, use ext3.  The data
> you've written to memory doesn't go down to disk unless explicitly told to, such
> as
>
> 1) fsync - this is obvious
> 2) vm - the vm has decided that this dirty page has been sitting around long
> enough and should be written back to the disk, could happen now, could happen 10
> years from now.
> 3) sync - this is not as obvious.  sync doesn't mean anything than "start
> writing back dirty data to the fs", and returns before it's done.  For btrfs
> what that means is we run through _every_ inode that has delalloc pages
> associated with them and start writeback on them.  This will get most of your
> data into the current transaction, which is when the snapshot happens.
>
> If you don't want empty files, do something like this
>
> btrfsctl -c /dir/to/volume
> btrfsctl -s /dir/to/volume/snapshotname /dir/to/volume
>
> this is what we do with yum and its rollback plugin, and it works out quite
> well.  Thanks,
>
> Josef
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Is there a race in there? It seems like if a process starts modifying a 
file between the sync and the snapshot, data could still be lost. Is 
there something else going on here that I'm missing that would prevent 
this race?

--Ravi

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: zero-length files in snapshots
  2010-02-12 18:22       ` Ravi Pinjala
@ 2010-02-12 18:45         ` Josef Bacik
  2010-02-12 19:03         ` Chris Ball
  1 sibling, 0 replies; 17+ messages in thread
From: Josef Bacik @ 2010-02-12 18:45 UTC (permalink / raw)
  To: Ravi Pinjala; +Cc: linux-btrfs

On Fri, Feb 12, 2010 at 12:22:12PM -0600, Ravi Pinjala wrote:
> On 02/12/10 09:19, Josef Bacik wrote:
>> On Thu, Feb 11, 2010 at 08:50:48PM -0800, Mike Fedyk wrote:
>>> On Thu, Feb 11, 2010 at 7:11 PM, Chris Ball<cjb@laptop.org>  wrote:
>>>>    >  echo x1>  /mnt/x/d/foo.txt || exit 2
>>>>    >  btrfsctl -s /mnt/x/snap /mnt/x/d
>>>>
>>>> You're just missing a sync/fsync() between these two lines.
>>>>
>>>> We argued on IRC a while ago about whether this is a sensible default;
>>>> cmason wants the no-sync version of snapshot creation to be available,
>>>> but was amenable to the idea of changing the default to be sync before
>>>> snapshot, since it was pointed out that no-one other than him had
>>>> understood we were supposed to be running sync first.
>>>>
>>> You're saying that it only snapshots the on-disk data structures and
>>> not the in-memory versions?  That can only lead to pain.  What do you
>>> do if something else during this race condition?  What would a sync do
>>> to solve this?  Have the semantics of sync been changed in btrfs from
>>> "sync everything that hasn't been written yet" to "sync this
>>> subvolume"?
>>>
>>
>> Welcome to delalloc.  You either get fast writes or you get all of your data on
>> the disk every 5 seconds.  If you don't like delalloc, use ext3.  The data
>> you've written to memory doesn't go down to disk unless explicitly told to, such
>> as
>>
>> 1) fsync - this is obvious
>> 2) vm - the vm has decided that this dirty page has been sitting around long
>> enough and should be written back to the disk, could happen now, could happen 10
>> years from now.
>> 3) sync - this is not as obvious.  sync doesn't mean anything than "start
>> writing back dirty data to the fs", and returns before it's done.  For btrfs
>> what that means is we run through _every_ inode that has delalloc pages
>> associated with them and start writeback on them.  This will get most of your
>> data into the current transaction, which is when the snapshot happens.
>>
>> If you don't want empty files, do something like this
>>
>> btrfsctl -c /dir/to/volume
>> btrfsctl -s /dir/to/volume/snapshotname /dir/to/volume
>>
>> this is what we do with yum and its rollback plugin, and it works out quite
>> well.  Thanks,
>>
>> Josef
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
> Is there a race in there? It seems like if a process starts modifying a  
> file between the sync and the snapshot, data could still be lost. Is  
> there something else going on here that I'm missing that would prevent  
> this race?
>

Data won't be lost, it just won't be there in the snapshot, and will be there in
the source.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: zero-length files in snapshots
  2010-02-12 18:22       ` Ravi Pinjala
  2010-02-12 18:45         ` Josef Bacik
@ 2010-02-12 19:03         ` Chris Ball
  1 sibling, 0 replies; 17+ messages in thread
From: Chris Ball @ 2010-02-12 19:03 UTC (permalink / raw)
  To: Ravi Pinjala; +Cc: linux-btrfs

Hi,

   > Is there a race in there? It seems like if a process starts
   > modifying a file between the sync and the snapshot, data could
   > still be lost. Is there something else going on here that I'm
   > missing that would prevent this race?

AIUI, you're correct that a writer process concurrent to a snapshot
leads to a race with data that doesn't make it in to the snapshot, but
I think you're wrong in thinking that there's much we could do about
it -- either you miss writes between sync and snapshot, as we do now,
or we do sync-and-snapshot atomically, and the concurrent writes are
missed because we decided to block further writes from that process
before we took the snapshot.  The only real answer is to quiesce the
writer process before you begin.  Does that make sense?

- Chris.
-- 
Chris Ball   <cjb@laptop.org>
One Laptop Per Child

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: zero-length files in snapshots
  2010-02-12 15:19     ` Josef Bacik
  2010-02-12 16:18       ` Mike Fedyk
  2010-02-12 18:22       ` Ravi Pinjala
@ 2010-02-12 19:10       ` Christoph Hellwig
  2 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2010-02-12 19:10 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Mike Fedyk, Chris Ball, Nickolai Zeldovich, linux-btrfs

On Fri, Feb 12, 2010 at 10:19:40AM -0500, Josef Bacik wrote:
> 3) sync - this is not as obvious.  sync doesn't mean anything than "start
> writing back dirty data to the fs", and returns before it's done.  For btrfs
> what that means is we run through _every_ inode that has delalloc pages
> associated with them and start writeback on them.  This will get most of your
> data into the current transaction, which is when the snapshot happens.

sync does return synchronously on Linux, even if that is not guaranted
by Posix.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: zero-length files in snapshots
  2010-02-12 17:13               ` Mike Fedyk
@ 2010-02-13 11:25                 ` Sander
  2010-02-13 19:26                   ` Mike Fedyk
  0 siblings, 1 reply; 17+ messages in thread
From: Sander @ 2010-02-13 11:25 UTC (permalink / raw)
  To: Mike Fedyk; +Cc: Josef Bacik, Chris Ball, Nickolai Zeldovich, linux-btrfs

Mike Fedyk wrote (ao):
> On Fri, Feb 12, 2010 at 8:32 AM, Josef Bacik <josef@redhat.com> wrote:
> > Creating a file is a metadata operation, and _any_ metadata operation has to be
> > committed to disk when the transaction commits in order to maintain a coherent
> > fs. ??Thanks,
> 
> What I still don't understand though is that the create could have
> taken up to 30 seconds to commit and the same for the few bytes of
> data, but a few ms later a snapshot was made and the metadata change
> was there and the data change was not.  Could it have happened that
> the snapshot would not have the newly created file and this was just a
> timing issue that should not be relied upon?
> 
> I'm just wondering why that file was there at all.

I would say that is because the moment the file got created, the
resulting metadata was commited immediately. The data not yet.

	With kind regards, Sander

-- 
Humilis IT Services and Solutions
http://www.humilis.net

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: zero-length files in snapshots
  2010-02-13 11:25                 ` Sander
@ 2010-02-13 19:26                   ` Mike Fedyk
  2010-02-19 22:22                     ` Sage Weil
  0 siblings, 1 reply; 17+ messages in thread
From: Mike Fedyk @ 2010-02-13 19:26 UTC (permalink / raw)
  To: sander; +Cc: Josef Bacik, Chris Ball, Nickolai Zeldovich, linux-btrfs

On Sat, Feb 13, 2010 at 3:25 AM, Sander <sander@humilis.net> wrote:
> Mike Fedyk wrote (ao):
>> On Fri, Feb 12, 2010 at 8:32 AM, Josef Bacik <josef@redhat.com> wrot=
e:
>> > Creating a file is a metadata operation, and _any_ metadata operat=
ion has to be
>> > committed to disk when the transaction commits in order to maintai=
n a coherent
>> > fs. ??Thanks,
>>
>> What I still don't understand though is that the create could have
>> taken up to 30 seconds to commit and the same for the few bytes of
>> data, but a few ms later a snapshot was made and the metadata change
>> was there and the data change was not. =C2=A0Could it have happened =
that
>> the snapshot would not have the newly created file and this was just=
 a
>> timing issue that should not be relied upon?
>>
>> I'm just wondering why that file was there at all.
>
> I would say that is because the moment the file got created, the
> resulting metadata was commited immediately. The data not yet.
>

Josef explained it to me on IRC.  Meta-data changes like file creation
get added to the current transaction and snapshots start a new
transaction so that is why the empty file is in the snapshot.

The file is empty because with delayed allocation, the data has not
hit the filesystem yet and thus has no representation in filesystem
operations like snapshots.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: zero-length files in snapshots
  2010-02-13 19:26                   ` Mike Fedyk
@ 2010-02-19 22:22                     ` Sage Weil
  2010-02-25 18:57                       ` Goffredo Baroncelli
  0 siblings, 1 reply; 17+ messages in thread
From: Sage Weil @ 2010-02-19 22:22 UTC (permalink / raw)
  To: Mike Fedyk
  Cc: sander, Josef Bacik, Chris Ball, Nickolai Zeldovich, linux-btrfs

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3658 bytes --]

On Sat, 13 Feb 2010, Mike Fedyk wrote:
> On Sat, Feb 13, 2010 at 3:25 AM, Sander <sander@humilis.net> wrote:
> > Mike Fedyk wrote (ao):
> >> On Fri, Feb 12, 2010 at 8:32 AM, Josef Bacik <josef@redhat.com> wrote:
> >> > Creating a file is a metadata operation, and _any_ metadata operation has to be
> >> > committed to disk when the transaction commits in order to maintain a coherent
> >> > fs. ??Thanks,
> >>
> >> What I still don't understand though is that the create could have
> >> taken up to 30 seconds to commit and the same for the few bytes of
> >> data, but a few ms later a snapshot was made and the metadata change
> >> was there and the data change was not.  Could it have happened that
> >> the snapshot would not have the newly created file and this was just a
> >> timing issue that should not be relied upon?
> >>
> >> I'm just wondering why that file was there at all.
> >
> > I would say that is because the moment the file got created, the
> > resulting metadata was commited immediately. The data not yet.
> 
> Josef explained it to me on IRC.  Meta-data changes like file creation
> get added to the current transaction and snapshots start a new
> transaction so that is why the empty file is in the snapshot.
> 
> The file is empty because with delayed allocation, the data has not
> hit the filesystem yet and thus has no representation in filesystem
> operations like snapshots.

You can make btrfs include the file data in the snapshot along with the 
metadata with the 'flushoncommit' mount option.  The problem is that this 
will make _all_ btrfs commits more expensive, as they'll block new 
operations during the commit while old data is being flushed out.

We could trivially make this happen only when there is a new snapshot, to 
get the behavior you expect (see patch below).  If the goal is to make a 
perfectly consistent snapshot of the file system, this is better than

	sync ; btrfsctl -s snap whatever

because there wouldn't be a window where metadata changes make it into the 
snapshot but file data does not.

Is there really a use case for the sort of 'lazy' snapshots with 
out-of-sync data and metadata (like 0-byte files)?  If so, we should add 
another ioctl for a full-blown snapshot so that users who _do_ want a 
fully consistent snapshot can get it.

If not, something like the below should be sufficient to make all 
snapshots fully consistent...

sage

---

From: Sage Weil <sage@newdream.net>
Date: Fri, 19 Feb 2010 14:13:50 -0800
Subject: [PATCH] Btrfs: flush data on snapshot creation

Flush any delalloc extents when we create a snapshot, so that recently
written file data is always included in the snapshot.

Signed-off-by: Sage Weil <sage@newdream.net>
---
 fs/btrfs/transaction.c |    5 +----
 1 files changed, 1 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index e83d4e1..f5b7029 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1084,13 +1084,10 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
 
                mutex_unlock(&root->fs_info->trans_mutex);
 
-               if (flush_on_commit) {
+               if (flush_on_commit || snap_pending) {
                        btrfs_start_delalloc_inodes(root, 1);
                        ret = btrfs_wait_ordered_extents(root, 0, 1);
                        BUG_ON(ret);
-               } else if (snap_pending) {
-                       ret = btrfs_wait_ordered_extents(root, 0, 1);
-                       BUG_ON(ret);
                }
 
                /*
-- 
1.6.6.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: zero-length files in snapshots
  2010-02-19 22:22                     ` Sage Weil
@ 2010-02-25 18:57                       ` Goffredo Baroncelli
  0 siblings, 0 replies; 17+ messages in thread
From: Goffredo Baroncelli @ 2010-02-25 18:57 UTC (permalink / raw)
  To: linux-btrfs, Sage Weil

On Friday 19 February 2010, Sage Weil wrote:
[...]
> We could trivially make this happen only when there is a new snapshot, to 
> get the behavior you expect (see patch below).  If the goal is to make a 
> perfectly consistent snapshot of the file system, this is better than
> 
> 	sync ; btrfsctl -s snap whatever
> 
> because there wouldn't be a window where metadata changes make it into the 
> snapshot but file data does not.

I don't have the knowledge to say if your patch is good or not from a 
performance point of view, but to me, the behaviour of your patch seems a 
reasonable defaults.
I may accept that a crash can break a supposed sequence of a writing on the 
disk, so data which should be on the disk never reach the disk. But I can 
reduce the risk of this behaviour with an UPS.
Instead the fact that a snapshot may not taken the last data to me seems an 
un-acceptable behaviour.

Worse, this behaviour may lead to write code like

	do_sync(); do_snapshot();

which is difficult to optimise at the kernel level; instead if we put a sync 
before a snapshot in the core of the btrfs, even tough in the present there is 
performance problem, may have (even in a far future) a possible optimisation..

Regards
Goffredo
> Is there really a use case for the sort of 'lazy' snapshots with 
> out-of-sync data and metadata (like 0-byte files)?  If so, we should add 
> another ioctl for a full-blown snapshot so that users who _do_ want a 
> fully consistent snapshot can get it.
> 
> If not, something like the below should be sufficient to make all 
> snapshots fully consistent...
> 
> sage
> 
> ---
> 
> From: Sage Weil <sage@newdream.net>
> Date: Fri, 19 Feb 2010 14:13:50 -0800
> Subject: [PATCH] Btrfs: flush data on snapshot creation
> 
> Flush any delalloc extents when we create a snapshot, so that recently
> written file data is always included in the snapshot.
> 
> Signed-off-by: Sage Weil <sage@newdream.net>
> ---
>  fs/btrfs/transaction.c |    5 +----
>  1 files changed, 1 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index e83d4e1..f5b7029 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -1084,13 +1084,10 @@ int btrfs_commit_transaction(struct 
btrfs_trans_handle *trans,
>  
>                 mutex_unlock(&root->fs_info->trans_mutex);
>  
> -               if (flush_on_commit) {
> +               if (flush_on_commit || snap_pending) {
>                         btrfs_start_delalloc_inodes(root, 1);
>                         ret = btrfs_wait_ordered_extents(root, 0, 1);
>                         BUG_ON(ret);
> -               } else if (snap_pending) {
> -                       ret = btrfs_wait_ordered_extents(root, 0, 1);
> -                       BUG_ON(ret);
>                 }
>  
>                 /*
> -- 
> 1.6.6.1
> 


-- 
gpg key@ keyserver.linux.it: Goffredo Baroncelli (ghigo) <kreijackAtinwind.it>
Key fingerprint = 4769 7E51 5293 D36C 814E  C054 BF04 F161 3DC5 0512

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2010-02-25 18:57 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-02-12  1:49 zero-length files in snapshots Nickolai Zeldovich
2010-02-12  3:11 ` Chris Ball
2010-02-12  4:50   ` Mike Fedyk
2010-02-12 15:19     ` Josef Bacik
2010-02-12 16:18       ` Mike Fedyk
2010-02-12 16:22         ` Josef Bacik
2010-02-12 16:27           ` Mike Fedyk
2010-02-12 16:32             ` Josef Bacik
2010-02-12 17:13               ` Mike Fedyk
2010-02-13 11:25                 ` Sander
2010-02-13 19:26                   ` Mike Fedyk
2010-02-19 22:22                     ` Sage Weil
2010-02-25 18:57                       ` Goffredo Baroncelli
2010-02-12 18:22       ` Ravi Pinjala
2010-02-12 18:45         ` Josef Bacik
2010-02-12 19:03         ` Chris Ball
2010-02-12 19:10       ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).