All of lore.kernel.org
 help / color / mirror / Atom feed
* Question about writable ext4-snapshot
@ 2012-01-21  2:45 Robin Dong
  2012-01-21  4:24 ` Theodore Tso
  0 siblings, 1 reply; 7+ messages in thread
From: Robin Dong @ 2012-01-21  2:45 UTC (permalink / raw)
  To: amir73il; +Cc: Tao Ma, coly, Ext4 Developers List

Hello, Amir

I am evaluating ext4-snapshot (on github) for TAOBAO recently. The
snapshot of an ext4 fs is READONLY now, but we do need to write data
into snapshot. We also want using  ext4-snapshot to do online-fsck on
Hadoop clusters, but our hadoop clusters are using no-journal ext4
now. So we have some question

1. Will it be possible to implement a writable ext4-snapshot ?
2. Will it be possible to snapshot a no-journal ext4-fs ?
3. What's the difficult point of  implementing above ?

Any of your reply will be appreciate

-- 
--
Best Regard
Robin Dong

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Question about writable ext4-snapshot
  2012-01-21  2:45 Question about writable ext4-snapshot Robin Dong
@ 2012-01-21  4:24 ` Theodore Tso
  2012-01-21  4:37   ` Andreas Dilger
  2012-01-21 16:09   ` Amir Goldstein
  0 siblings, 2 replies; 7+ messages in thread
From: Theodore Tso @ 2012-01-21  4:24 UTC (permalink / raw)
  To: Robin Dong; +Cc: Theodore Tso, amir73il, Tao Ma, coly, Ext4 Developers List


On Jan 20, 2012, at 9:45 PM, Robin Dong wrote:

> Hello, Amir
> 
> I am evaluating ext4-snapshot (on github) for TAOBAO recently. The
> snapshot of an ext4 fs is READONLY now, but we do need to write data
> into snapshot. We also want using  ext4-snapshot to do online-fsck on
> Hadoop clusters, but our hadoop clusters are using no-journal ext4
> now. So we have some question
> 
> 1. Will it be possible to implement a writable ext4-snapshot ?
> 2. Will it be possible to snapshot a no-journal ext4-fs ?
> 3. What's the difficult point of  implementing above ?

Something else to consider is that the device mapper thin-provisioning approach.   This approach does the snapshotting at the device-mapper layer, which means it is separate from the file system.  It relies on using the discard request when the file is unlinked to know when blocks can be released from the snapshot.  It also uses a granularity much smaller than that of the traditional LVM-style snapshots. 

This code will still need a few months to be mature (the thin-provisioning code just got merged into 3.2, but discard support isn't done yet, and the userspace support is lagging).   But in the long run, this might be a very attractive way of providing multiple levels of writeable snapshots, in a clean and relatively simple way.

-- Ted



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Question about writable ext4-snapshot
  2012-01-21  4:24 ` Theodore Tso
@ 2012-01-21  4:37   ` Andreas Dilger
  2012-01-21 16:09   ` Amir Goldstein
  1 sibling, 0 replies; 7+ messages in thread
From: Andreas Dilger @ 2012-01-21  4:37 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Robin Dong, Theodore Tso, amir73il, Tao Ma, coly, Ext4 Developers List

On Jan 20, 2012, at 9:45 PM, Robin Dong wrote:

> Hello, Amir
> 
> I am evaluating ext4-snapshot (on github) for TAOBAO recently. The
> snapshot of an ext4 fs is READONLY now, but we do need to write data
> into snapshot. We also want using  ext4-snapshot to do online-fsck on
> Hadoop clusters, but our hadoop clusters are using no-journal ext4
> now. 

When you write about online e2fsck, what do you mean exactly?  It is already possible with LVM to create a read-only snapshot of a device and run read-only e2fsck. This works because the LVM snapshot is hooked to ext4 to freeze the filesystem and flush the journal before the snapshot is done. 

At this point, if the fsck is clean then the original filesystem is clean also. This is the most common case. In the uncommon case of errors detected on the snapshot, then the filesystem would need to be taken offline to fix any problems.

By running the online fsck on the snapshot, one can be certain that the filesystem is clean, and reset the automatic checking date/mount counters.

If you are thinking about online repair, that would be much more complex, but may still be possible for some cases. 

Cheers, Andreas

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Question about writable ext4-snapshot
  2012-01-21  4:24 ` Theodore Tso
  2012-01-21  4:37   ` Andreas Dilger
@ 2012-01-21 16:09   ` Amir Goldstein
  2012-01-22  3:31     ` Robin Dong
  1 sibling, 1 reply; 7+ messages in thread
From: Amir Goldstein @ 2012-01-21 16:09 UTC (permalink / raw)
  To: Robin Dong
  Cc: Theodore Tso, Tao Ma, coly, Ext4 Developers List, Yongqiang Yang

On Sat, Jan 21, 2012 at 6:24 AM, Theodore Tso <tytso@mit.edu> wrote:
>
> On Jan 20, 2012, at 9:45 PM, Robin Dong wrote:
>
>> Hello, Amir
>>
>> I am evaluating ext4-snapshot (on github) for TAOBAO recently. The
>> snapshot of an ext4 fs is READONLY now, but we do need to write data
>> into snapshot.
>> We also want using  ext4-snapshot to do online-fsck on
>> Hadoop clusters, but our hadoop clusters are using no-journal ext4
>> now. So we have some question
>>
>> 1. Will it be possible to implement a writable ext4-snapshot ?
>> 2. Will it be possible to snapshot a no-journal ext4-fs ?
>> 3. What's the difficult point of  implementing above ?
>

Hello Robin,

1. writable snapshots (snapshot clones) are actually quite simple to implement
(a sparse file containing all changes from a read-only snapshot).
The real challenge is how to support snapshots of these clones and how to
implement the space reclaim efficiently (time wise) when deleting snapshots.
indeed, LVM thin-provisioning target handles space reclaim very efficiently.

2. I think it is possible, but I never looked into it, so there may
be challenges that I haven't foreseen.
The obvious culprit is that snapshots will not be reliable after crash.
JBD ensures that metadata is not overwritten on-disk before it is
copied to snapshot,
but without journal, after a crash, meta data could have already been
written and you loose
the origin data that was supposed to be copied to snapshot.

3. I think I have already answered that question above, but the actual
difficulty
really depends on your specific needs.

> Something else to consider is that the device mapper thin-provisioning approach.   This approach does the snapshotting at the device-mapper layer, which means it is separate from the file system.  It relies on using the discard request when the file is unlinked to know when blocks can be released from the snapshot.  It also uses a granularity much smaller than that of the traditional LVM-style snapshots.
>
> This code will still need a few months to be mature (the thin-provisioning code just got merged into 3.2, but discard support isn't done yet, and the userspace support is lagging).   But in the long run, this might be a very attractive way of providing multiple levels of writeable snapshots, in a clean and relatively simple way.
>

There are some lengthy threads about LVM thinp vs. Ext4 snapshots here:
http://thread.gmane.org/gmane.comp.file-systems.ext4/25968/focus=26056
and here:
http://thread.gmane.org/gmane.comp.file-systems.ext4/26041

At the end of the day, thinp target is a very powerful tool, but is
does not fit all
use cases. In particular, it fragments the on-disk layout of ext4 metadata and
benchmark results for how this affect performance were never published.

Also, thinp needs to store quite a lot of metadata for the mapping of
all thinp blocks
and in order to keep this metadata durable and not hurt write speed performance
you will almost certainly need to store this metadata on an SSD - not
a bad solution
for a high end server, but not sure if everyone can afford this.

Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Question about writable ext4-snapshot
  2012-01-21 16:09   ` Amir Goldstein
@ 2012-01-22  3:31     ` Robin Dong
  2012-01-23  3:21       ` Ted Ts'o
  0 siblings, 1 reply; 7+ messages in thread
From: Robin Dong @ 2012-01-22  3:31 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Theodore Tso, Tao Ma, coly, Ext4 Developers List, Yongqiang Yang

2012/1/22 Amir Goldstein <amir73il@gmail.com>:
> On Sat, Jan 21, 2012 at 6:24 AM, Theodore Tso <tytso@mit.edu> wrote:
>>
>> On Jan 20, 2012, at 9:45 PM, Robin Dong wrote:
>>
>>> Hello, Amir
>>>
>>> I am evaluating ext4-snapshot (on github) for TAOBAO recently. The
>>> snapshot of an ext4 fs is READONLY now, but we do need to write data
>>> into snapshot.
>>> We also want using  ext4-snapshot to do online-fsck on
>>> Hadoop clusters, but our hadoop clusters are using no-journal ext4
>>> now. So we have some question
>>>
>>> 1. Will it be possible to implement a writable ext4-snapshot ?
>>> 2. Will it be possible to snapshot a no-journal ext4-fs ?
>>> 3. What's the difficult point of  implementing above ?
>>
>
> Hello Robin,
>
> 1. writable snapshots (snapshot clones) are actually quite simple to implement
> (a sparse file containing all changes from a read-only snapshot).
> The real challenge is how to support snapshots of these clones and how to
> implement the space reclaim efficiently (time wise) when deleting snapshots.
> indeed, LVM thin-provisioning target handles space reclaim very efficiently.
>
> 2. I think it is possible, but I never looked into it, so there may
> be challenges that I haven't foreseen.
> The obvious culprit is that snapshots will not be reliable after crash.
> JBD ensures that metadata is not overwritten on-disk before it is
> copied to snapshot,
> but without journal, after a crash, meta data could have already been
> written and you loose
> the origin data that was supposed to be copied to snapshot.
>
> 3. I think I have already answered that question above, but the actual
> difficulty
> really depends on your specific needs.
>
>> Something else to consider is that the device mapper thin-provisioning approach.   This approach does the snapshotting at the device-mapper layer, which means it is separate from the file system.  It relies on using the discard request when the file is unlinked to know when blocks can be released from the snapshot.  It also uses a granularity much smaller than that of the traditional LVM-style snapshots.
>>
>> This code will still need a few months to be mature (the thin-provisioning code just got merged into 3.2, but discard support isn't done yet, and the userspace support is lagging).   But in the long run, this might be a very attractive way of providing multiple levels of writeable snapshots, in a clean and relatively simple way.
>>
>
> There are some lengthy threads about LVM thinp vs. Ext4 snapshots here:
> http://thread.gmane.org/gmane.comp.file-systems.ext4/25968/focus=26056
> and here:
> http://thread.gmane.org/gmane.comp.file-systems.ext4/26041
>
> At the end of the day, thinp target is a very powerful tool, but is
> does not fit all
> use cases. In particular, it fragments the on-disk layout of ext4 metadata and
> benchmark results for how this affect performance were never published.
>
> Also, thinp needs to store quite a lot of metadata for the mapping of
> all thinp blocks
> and in order to keep this metadata durable and not hurt write speed performance
> you will almost certainly need to store this metadata on an SSD - not
> a bad solution
> for a high end server, but not sure if everyone can afford this.
>
> Amir.

Thanks for all your suggestion!
I will evaluate thin-provision and ext4-snapshot both later.

-- 
--
Best Regard
Robin Dong
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Question about writable ext4-snapshot
  2012-01-22  3:31     ` Robin Dong
@ 2012-01-23  3:21       ` Ted Ts'o
  2012-01-23 20:08         ` Amir Goldstein
  0 siblings, 1 reply; 7+ messages in thread
From: Ted Ts'o @ 2012-01-23  3:21 UTC (permalink / raw)
  To: Robin Dong
  Cc: Amir Goldstein, Tao Ma, coly, Ext4 Developers List, Yongqiang Yang

On Sun, Jan 22, 2012 at 11:31:31AM +0800, Robin Dong wrote:
> > At the end of the day, thinp target is a very powerful tool, but
> > is does not fit all use cases. In particular, it fragments the
> > on-disk layout of ext4 metadata and benchmark results for how this
> > affect performance were never published.

Amir,

Well, to be fair, your approach to snapshotting also causes
fragmentation.  If a file or a directory in the base image gets
modified while there is a read-only snapshot, the inode in the base
image gets fragmented as a result.

It is true that thin provisioning in general tends to defeat the block
placement algorithms used by a file system, but it will be possible to
create snapshots of non-thinp volumes, which will address this issue.
Hopefully in the next 3-6 months, these things will be implemented
enough so that we can benchmark them and see for certain how well or
poorly this approach will work out.  I'm sure there will be a certain
number of tradeoffs for both approaches.

Regards,

					- Ted

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Question about writable ext4-snapshot
  2012-01-23  3:21       ` Ted Ts'o
@ 2012-01-23 20:08         ` Amir Goldstein
  0 siblings, 0 replies; 7+ messages in thread
From: Amir Goldstein @ 2012-01-23 20:08 UTC (permalink / raw)
  To: Ted Ts'o
  Cc: Robin Dong, Tao Ma, coly, Ext4 Developers List, Yongqiang Yang

On Mon, Jan 23, 2012 at 5:21 AM, Ted Ts'o <tytso@mit.edu> wrote:
> On Sun, Jan 22, 2012 at 11:31:31AM +0800, Robin Dong wrote:
>> > At the end of the day, thinp target is a very powerful tool, but
>> > is does not fit all use cases. In particular, it fragments the
>> > on-disk layout of ext4 metadata and benchmark results for how this
>> > affect performance were never published.
>
> Amir,
>
> Well, to be fair, your approach to snapshotting also causes
> fragmentation.  If a file or a directory in the base image gets
> modified while there is a read-only snapshot, the inode in the base
> image gets fragmented as a result.

Yes, that's true, to some extent. directory inodes, however, do not
get fragmented. all journaled metadata is copied a side on JBD hooks.
My claim was about fragmentation of ext4 metadata, but fragmentation
of data is also a problem in both approaches.

>
> It is true that thin provisioning in general tends to defeat the block
> placement algorithms used by a file system, but it will be possible to
> create snapshots of non-thinp volumes, which will address this issue.
> Hopefully in the next 3-6 months, these things will be implemented
> enough so that we can benchmark them and see for certain how well or
> poorly this approach will work out.  I'm sure there will be a certain
> number of tradeoffs for both approaches.
>
> Regards,
>
>                                        - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-01-23 20:08 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-21  2:45 Question about writable ext4-snapshot Robin Dong
2012-01-21  4:24 ` Theodore Tso
2012-01-21  4:37   ` Andreas Dilger
2012-01-21 16:09   ` Amir Goldstein
2012-01-22  3:31     ` Robin Dong
2012-01-23  3:21       ` Ted Ts'o
2012-01-23 20:08         ` Amir Goldstein

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.