Shared snapshots

All of lore.kernel.org
 help / color / mirror / Atom feed

* Shared snapshots
@ 2009-12-16  8:05 Mikulas Patocka
  2009-12-16 20:39 ` Mike Snitzer
  2009-12-22  9:50 ` 张宇
  0 siblings, 2 replies; 24+ messages in thread
From: Mikulas Patocka @ 2009-12-16  8:05 UTC (permalink / raw)
  To: Alasdair G Kergon; +Cc: dm-devel

Hi

I uploaded new shared snapshots here:
http://people.redhat.com/mpatocka/patches/kernel/new-snapshots/devel/

changes:

- Broken to separate patches, one patch per file. The last patch adds all 
the files to Makefile and makes it all compile. It allows you to ack 
patches individually. I didn't use stubs to compile intermediate patches, 
the problem is that if multiple patches modify the same file, general 
modifications to the code become harder. With one-patch-per-file I can 
edit it with normal text editor and quilt regenerates the patches.

- Document on-disk format in dm-multisnap-mikulas-struct.h

- More functions are commented

- Removed /*printk ... */ statements. I left some #ifdefed debug routines, 
they are non-trivial to write again

- Fixed a bug on big-endian systems

- Exposed interface for snapshots-of-snapshots, tested that they work

Mikulas

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Shared snapshots
  2009-12-16  8:05 Shared snapshots Mikulas Patocka
@ 2009-12-16 20:39 ` Mike Snitzer
  2009-12-17 16:32   ` Mike Snitzer
  2010-03-30 16:26   ` Mike Snitzer
  2009-12-22  9:50 ` 张宇
  1 sibling, 2 replies; 24+ messages in thread
From: Mike Snitzer @ 2009-12-16 20:39 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: device-mapper development, Alasdair G Kergon

On Wed, Dec 16 2009 at  3:05am -0500,
Mikulas Patocka <mpatocka@redhat.com> wrote:

> Hi
> 
> I uploaded new shared snapshots here:
> http://people.redhat.com/mpatocka/patches/kernel/new-snapshots/devel/

Just a general observation:
You have an empty line separating your comment blocks above functions.
I think eliminating that empty line would help join the comment block to
the function a bit better (as is common in the rest of the kernel).

Also, a comment block of the form:
/* single line comment above function */

Is generally done as:
/*
 * single line comment above function
 */

Hopefully I'm not triggering unhappy checkpatch.pl-type thoughts with
these recommendations :)

> changes:
> 
> - Broken to separate patches, one patch per file. The last patch adds all 
> the files to Makefile and makes it all compile. It allows you to ack 
> patches individually. I didn't use stubs to compile intermediate patches, 
> the problem is that if multiple patches modify the same file, general 
> modifications to the code become harder. With one-patch-per-file I can 
> edit it with normal text editor and quilt regenerates the patches.
> 
> - Document on-disk format in dm-multisnap-mikulas-struct.h
> 
> - More functions are commented
> 
> - Removed /*printk ... */ statements. I left some #ifdefed debug routines, 
> they are non-trivial to write again

I'm still seeing one in dm_bufio_client_create()

> - Fixed a bug on big-endian systems
> 
> - Exposed interface for snapshots-of-snapshots, tested that they work

Where is that interface documented?

As an aside, I have some ideas for improving
Documentation/device-mapper/dm-multisnapshot.txt
I'll just send a patch and we can go from there.

> Mikulas

BTW, I'm getting the following warning when I build the code:

drivers/md/dm-bufio.c: In function ‘write_endio’:
drivers/md/dm-bufio.c:725: warning: value computed is not used

A cast fixes it:

diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
index 1ab5304..be7b89b 100644
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -722,7 +722,7 @@ static void write_endio(struct bio *bio, int error)
 	b->write_error = error;
 	if (unlikely(error)) {
 		struct dm_bufio_client *c = b->c;
-		cmpxchg(&c->async_write_error, 0, error);
+		(void)cmpxchg(&c->async_write_error, 0, error);
 	}
 	BUG_ON(!test_bit(B_WRITING, &b->state));
 	smp_mb__before_clear_bit();

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: Shared snapshots
  2009-12-16 20:39 ` Mike Snitzer
@ 2009-12-17 16:32   ` Mike Snitzer
  2010-03-30 16:26   ` Mike Snitzer
  1 sibling, 0 replies; 24+ messages in thread
From: Mike Snitzer @ 2009-12-17 16:32 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: device-mapper development, Alasdair G Kergon

On Wed, Dec 16 2009 at  3:39pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Wed, Dec 16 2009 at  3:05am -0500,
> Mikulas Patocka <mpatocka@redhat.com> wrote:
...
> > - Exposed interface for snapshots-of-snapshots, tested that they work
> 
> Where is that interface documented?

I see the 'dmsetup message ...' to do so is quietly mentioned in the
"Create new snapshot" section of the documentation.

> As an aside, I have some ideas for improving
> Documentation/device-mapper/dm-multisnapshot.txt
> I'll just send a patch and we can go from there.

The inlined [TODO: ...] comments in the following patch are things I'm
looking to understand by reviewing the code; but I think they should be
answered in the Documentation.  I wanted to get you my edits sooner
rather than later.

diff --git a/Documentation/device-mapper/dm-multisnapshot.txt b/Documentation/device-mapper/dm-multisnapshot.txt
index 0dff16e..d730d3a 100644
--- a/Documentation/device-mapper/dm-multisnapshot.txt
+++ b/Documentation/device-mapper/dm-multisnapshot.txt
@@ -1,64 +1,90 @@
-This snapshot implementation has shared storage and high number of snapshots.
+Device-mapper multiple snapshot support
+=======================================
 
-The work is split to two modules:
-dm-multisnapshot.ko - the general module
-dm-store-mikulas.ko - the snapshot store
+Device-mapper allows a single copy-on-write (COW) block device to be
+shared among multiple snapshots of an origin device.  This variant of dm
+snapshot is ideal for supporting high numbers of snapshots.  It also
+supports snapshots of snapshots.
 
-The modularity allows to load other snapshot stores.
+There is a single dm target:
+multisnapshot
 
-Usage:
-Create two logical volumes, one for origin and one for snapshots.
-(assume /dev/mapper/vg1-lv1 for origin and /dev/mapper/vg1-lv2 for snapshot in
-these examples)
+and associated shared COW storage modules:
+mikulas
+daniel
 
-Clear the first sector of the snapshot volume:
-dd if=/dev/zero of=/dev/mapper/vg1-lv2 bs=4096 count=1
+[TODO: expand on the benefits/design of each store; so as to help a user
+       decide between them?]
+
+*) multisnapshot <origin> <COW device> <chunksize>
+   <# generic args> <generic args> <shared COW store type>
+   <# shared COW store args> <shared COW store args>
+   [<# snapshot ids> <snapshot ids>]
 
 Table line arguments:
-- origin device
-- shared store device
-- chunk size
-- number of generic arguments
-- generic arguments
+- <origin> : origin device
+- <COW device> : shared COW store device
+- <chunksize> : chunk size
+- <# generic args> : number of generic arguments
+- <generic args> : generic arguments
 	sync-snapshots --- synchronize snapshots according to the list
 	preserve-on-error --- halt the origin on error in the snapshot store
-- shared store type
-- number of arguments for shared store type
-- shared store arguments
+- <shared COW store type> : shared COW store type
+	mikulas --- TODO
+	daniel --- TODO
+- <# shared COW store args> : number of arguments for shared COW store type
+- <shared COW store args> : shared COW store arguments
 	cache-threshold size --- a background write is started
 	cache-limit size --- a limit for metadata cache size
-if sync-snapshots was specified
-	- number of snapshot ids
-	- snapshot ids
+If 'sync-snapshots' was specified:
+- <# snapshot ids> : number of snapshot ids
+- <snapshot ids> : snapshot ids in desired sync order
+
+
+Usage
+=====
+*) Create two logical volumes, one for origin and one for snapshots.
+(The following examples assume /dev/mapper/vg1-lv1 for origin and
+/dev/mapper/vg1-lv2 for snapshot)
+
+*) Clear the first 4 sectors of the snapshot volume:
+[TODO: I see below that the store will create a new metadata structure if
+the snapshot device were zero'd.  What if it wasn't zero'd and the
+device still has data?  Appears the ctr will error out.  So will lvm
+blindly zero any device that it is told to use for a multisnap store?]
+dd if=/dev/zero of=/dev/mapper/vg1-lv2 bs=4096 count=1
 
-Load the shared snapshot driver:
+*) Load the shared snapshot driver:
 echo 0 `blockdev --getsize /dev/mapper/vg1-lv1` multisnapshot /dev/mapper/vg1-lv1 /dev/mapper/vg1-lv2 16 0 mikulas 0|dmsetup create ms
-(16 is the chunk size in 512-byte sectors. You can place different number there)
-This creates the origin store on /dev/mapper/ms. If the store was zeroed, it
-creates new structure, otherwise it loads existing structure.
+(16 is the chunk size in 512-byte sectors. You can place a different
+number there. [TODO: what is the limit?])
+This creates the origin store on /dev/mapper/ms. If the COW store was
+zeroed, it creates new structure, otherwise it loads existing structure.
 
 Once this is done, you should no longer access /dev/mapper/vg1-lv1 and
 /dev/mapper/vg1-lv2 and only use /dev/mapper/ms.
 
-Create new snapshot:
+*) Create new snapshot:
+[TODO: what is the '0' in the following messages?]
 dmsetup message /dev/mapper/ms 0 create
-	If you want to create snapshot-of-snapshot, use
+	If you want to create snapshot-of-snapshot, use:
 	dmsetup message /dev/mapper/ms 0 create_subsnap <snapID>
 dmsetup status /dev/mapper/ms
-	(this will find out the newly created snapshot ID)
+	(this will display the newly created snapshot ID)
+	[TODO: how will that scale? Does the status output sort based on
+	creation time?  maybe show example output?]
 dmsetup suspend /dev/mapper/ms
 dmsetup resume /dev/mapper/ms
 
-Attach the snapshot:
-echo 0 `blockdev --getsize /dev/mapper/vg1-lv1` multisnap-snap /dev/mapper/vg1-lv1 0|dmsetup create ms0
-(that '0' is the snapshot id ... you can use different number)
-This attaches the snapshot '0' on /dev/mapper/ms0
+*) Attach the snapshot:
+echo 0 `blockdev --getsize /dev/mapper/vg1-lv1` multisnap-snap /dev/mapper/vg1-lv1 <snapID>|dmsetup create ms0
+This attaches the snapshot with <snapID> to /dev/mapper/ms0
 
-Delete the snapshot:
-dmsetup message /dev/mapper/ms 0 delete 0
-(the parameter after "delete" is the snapshot id)
+*) Delete the snapshot:
+dmsetup message /dev/mapper/ms 0 delete <snapID>
 
-See status:
+*) See status:
+[TODO: could use some further cleanup.. maybe show example output?]
 dmsetup status prints these information about the multisnapshot device:
 - number of arguments befor the snapshot id list (5)
 - 0 on active storage, -error number on error (-ENOSPC, -EIO, etc.)
@@ -69,9 +95,9 @@ dmsetup status prints these information about the multisnapshot device:
 - a number of snapshots
 - existing snapshot IDs
 
-Unload it:
+*) Unload it:
 dmsetup remove ms
 dmsetup remove ms0
 ... etc. (note, once you unload the origin, the snapshots become inaccessible
-- the devices exist but they return -EIO on everything)
+- the devices exist but they return -EIO when accessed)
 

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: Shared snapshots
  2009-12-16  8:05 Shared snapshots Mikulas Patocka
  2009-12-16 20:39 ` Mike Snitzer
@ 2009-12-22  9:50 ` 张宇
  1 sibling, 0 replies; 24+ messages in thread
From: 张宇 @ 2009-12-22  9:50 UTC (permalink / raw)
  To: device-mapper development


[-- Attachment #1.1: Type: text/plain, Size: 1403 bytes --]

does the origin device or cow  device can be resized dynamically?
once created the multisnapshot device 'ms':
echo 0 `blockdev --getsize /dev/mapper/vg1-lv1` multisnapshot /dev/mapper/
vg1-lv1 /dev/mapper/vg1-lv2 16 0 mikulas 0|dmsetup create ms
vg1-lv1, vg1-lv2, ms, if all of them can be resized dynamically?

thank you very much!


2009/12/16 Mikulas Patocka <mpatocka@redhat.com>

> Hi
>
> I uploaded new shared snapshots here:
> http://people.redhat.com/mpatocka/patches/kernel/new-snapshots/devel/
>
> changes:
>
> - Broken to separate patches, one patch per file. The last patch adds all
> the files to Makefile and makes it all compile. It allows you to ack
> patches individually. I didn't use stubs to compile intermediate patches,
> the problem is that if multiple patches modify the same file, general
> modifications to the code become harder. With one-patch-per-file I can
> edit it with normal text editor and quilt regenerates the patches.
>
> - Document on-disk format in dm-multisnap-mikulas-struct.h
>
> - More functions are commented
>
> - Removed /*printk ... */ statements. I left some #ifdefed debug routines,
> they are non-trivial to write again
>
> - Fixed a bug on big-endian systems
>
> - Exposed interface for snapshots-of-snapshots, tested that they work
>
> Mikulas
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
>

[-- Attachment #1.2: Type: text/html, Size: 2234 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Shared snapshots
  2009-12-16 20:39 ` Mike Snitzer
  2009-12-17 16:32   ` Mike Snitzer
@ 2010-03-30 16:26   ` Mike Snitzer
  2010-03-31  3:43     ` Mikulas Patocka
  1 sibling, 1 reply; 24+ messages in thread
From: Mike Snitzer @ 2010-03-30 16:26 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: device-mapper development, Alasdair G Kergon

On Wed, Dec 16 2009 at  3:39pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> As an aside, I have some ideas for improving
> Documentation/device-mapper/dm-multisnapshot.txt
> I'll just send a patch and we can go from there.

OK, here is the updated dm-multisnapshot.txt (finally):

Device-mapper multiple snapshot support
=======================================

Device-mapper allows a single copy-on-write (COW) block device to be
shared among multiple snapshots of an origin device.  This variant of dm
snapshot is ideal for supporting high numbers of snapshots.

There is a single dm target for the origin device:
multisnapshot

and associated shared COW storage modules:
mikulas - supports 2^32 snapshots and 2^32 snapshots of snapshots with
	  full consistency across crashes via journaling
daniel  - only supports 64 snapshots and does not provide consistency
          through journaling

The snapshots within the shared COW use a single dm target:
multisnap-snap

*) multisnapshot <origin> <COW device> <chunksize>
   <# generic args> <generic args> <shared COW store type>
   <# shared COW store args> <shared COW store args>
   [<# snapshot ids> <snapshot ids>]

Table line arguments:
- <origin> : origin device
- <COW device> : shared COW store device
- <chunksize> : chunk size in 512b sectors
- <# generic args> : number of generic arguments
- <generic args> : generic arguments
	sync-snapshots --- synchronize snapshots according to the list
	preserve-on-error --- halt the origin on error in the snapshot store
- <shared COW store type> : shared COW store type
	mikulas --- provided by the 'dm-store-mikulas' module
	daniel --- provided by the 'dm-store-daniel' module
- <# shared COW store args> : number of arguments for shared COW store type
- <shared COW store args> : shared COW store arguments
	cache-threshold size --- a background write is started
	cache-limit size --- a limit for metadata cache size
If 'sync-snapshots' was specified:
- <# snapshot ids> : number of snapshot ids
- <snapshot ids> : snapshot ids in desired sync order


*) multisnap-snap <origin> <snapshot id>

Table line arguments:
- <origin> : origin device
- <snapshot id> : id of the snapshot within the shared store


Status output:
*) multisnapshot <# output args> <errno> <new snapid>
   <total_sectors> <sectors_allocated> <metadata_sectors>
   <# snapshot ids> <snapshot ids>

Status line output arguments:
- <# shared COW store output args> : number of output arguments before
                                     snapshot id list
- <errno> : error number associated with the first error that occurred in
            the store (e.g. -EIO), 0 means the store is active with no errors
- <new snapid> : snapshot id that will be used for next snapshot, '-' if
                 no snapshot is in the process of being created
- <total_sectors> : total size of the shared store in 512b sectors
- <sectors_allocated> : number of sectors allocated for data and metadata
- <metadata_sectors> : number of sectors allocated for metadata
- <# snapshot ids> : number of snapshot ids
- <snapshot ids> : snapshot ids for snapshots in the store


Other tunables:
*) multisnapshot (when using 'mikulas' store)
The size of the metadata cache associated with the 'mikulas' shared COW
store defaults to 2% of system memory or 25% of vmalloc memory (which
ever is lower).  The size of the metadata cache may be overriden using
the 'dm_bufio_cache_size' module parameter when loading the
'dm-store-mikulas' module.  Alternatively, the size may be changed or
queried after the module is loaded via sysfs:
/sys/module/dm_store_mikulas/parameters/dm_bufio_cache_size


DM messages:
*) multisnapshot
   - create : creates next new snapshot id, reports created id through 'status'
     (the snapshot is created once the multisnapshot is suspended)
   - create_subsnap <snapshot id> : create subsnapshot of specified snapshot
   - delete <snapshot id> : delete the specified snapshot


Usage
=====
*) Create two logical volumes, one for origin and one for snapshots.
(The following examples assume /dev/sda for origin and /dev/sdb for snapshot)

*) Clear the first 4 sectors of the snapshot volume:
dd if=/dev/zero of=/dev/sdb bs=4096 count=1
(Otherwise the multisnapshot target's constructor will fail)

*) Load the shared snapshot driver:
ORIGIN_BDEV_SIZE=`blockdev --getsize /dev/sda`
echo 0 $ORIGIN_BDEV_SIZE multisnapshot /dev/sda /dev/sdb 16 0 mikulas 0 | dmsetup create ms
('16' is the chunk size in 512-byte sectors. The chunk size may range
from 1 to 1024 512-byte sectors via lvm. DM's maximum chunk size is only
limited by 32-bit integer size and available memory)

This creates the multisnapshot device on /dev/mapper/ms. If the COW
store was zeroed, it creates a new structure, otherwise it loads
existing structure.

Once this is done, you should no longer access /dev/sda and
/dev/sdb and only use /dev/mapper/ms.

*) Create new snapshot:
('0' in the following dmsetup message commands means sector arg isn't needed)
dmsetup message /dev/mapper/ms 0 create
	If you want to create snapshot-of-snapshot, use:
	dmsetup message /dev/mapper/ms 0 create_subsnap <snapID>
dmsetup status /dev/mapper/ms
	(this will display the newly created snapshot ID)
dmsetup suspend /dev/mapper/ms
dmsetup resume /dev/mapper/ms

*) Attach the snapshot:
echo 0 $ORIGIN_BDEV_SIZE multisnap-snap /dev/sda <snapID> | dmsetup create ms0
This attaches the snapshot with <snapID> to /dev/mapper/ms0

*) Delete the snapshot:
dmsetup message /dev/mapper/ms 0 delete <snapID>

*) See shared store's status:
dmsetup status /dev/mapper/ms
(multisnapshot target's status output is documented above)

*) Unload it:
dmsetup remove ms
dmsetup remove ms0
... etc. (note, once you unload the origin, the snapshots become inaccessible
- the devices exist but they return -EIO when accessed)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Shared snapshots
  2010-03-30 16:26   ` Mike Snitzer
@ 2010-03-31  3:43     ` Mikulas Patocka
  0 siblings, 0 replies; 24+ messages in thread
From: Mikulas Patocka @ 2010-03-31  3:43 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: device-mapper development, Alasdair G Kergon

OK.

I removed the notion of cache-threshold and cache-limit, because it is 
being set in sysfs. (I also removed remains of it from the code).

Mikulas


On Tue, 30 Mar 2010, Mike Snitzer wrote:

> On Wed, Dec 16 2009 at  3:39pm -0500,
> Mike Snitzer <snitzer@redhat.com> wrote:
> 
> > As an aside, I have some ideas for improving
> > Documentation/device-mapper/dm-multisnapshot.txt
> > I'll just send a patch and we can go from there.
> 
> OK, here is the updated dm-multisnapshot.txt (finally):
> 
> Device-mapper multiple snapshot support
> =======================================
> 
> Device-mapper allows a single copy-on-write (COW) block device to be
> shared among multiple snapshots of an origin device.  This variant of dm
> snapshot is ideal for supporting high numbers of snapshots.
> 
> There is a single dm target for the origin device:
> multisnapshot
> 
> and associated shared COW storage modules:
> mikulas - supports 2^32 snapshots and 2^32 snapshots of snapshots with
> 	  full consistency across crashes via journaling
> daniel  - only supports 64 snapshots and does not provide consistency
>           through journaling
> 
> The snapshots within the shared COW use a single dm target:
> multisnap-snap
> 
> *) multisnapshot <origin> <COW device> <chunksize>
>    <# generic args> <generic args> <shared COW store type>
>    <# shared COW store args> <shared COW store args>
>    [<# snapshot ids> <snapshot ids>]
> 
> Table line arguments:
> - <origin> : origin device
> - <COW device> : shared COW store device
> - <chunksize> : chunk size in 512b sectors
> - <# generic args> : number of generic arguments
> - <generic args> : generic arguments
> 	sync-snapshots --- synchronize snapshots according to the list
> 	preserve-on-error --- halt the origin on error in the snapshot store
> - <shared COW store type> : shared COW store type
> 	mikulas --- provided by the 'dm-store-mikulas' module
> 	daniel --- provided by the 'dm-store-daniel' module
> - <# shared COW store args> : number of arguments for shared COW store type
> - <shared COW store args> : shared COW store arguments
> 	cache-threshold size --- a background write is started
> 	cache-limit size --- a limit for metadata cache size
> If 'sync-snapshots' was specified:
> - <# snapshot ids> : number of snapshot ids
> - <snapshot ids> : snapshot ids in desired sync order
> 
> 
> *) multisnap-snap <origin> <snapshot id>
> 
> Table line arguments:
> - <origin> : origin device
> - <snapshot id> : id of the snapshot within the shared store
> 
> 
> Status output:
> *) multisnapshot <# output args> <errno> <new snapid>
>    <total_sectors> <sectors_allocated> <metadata_sectors>
>    <# snapshot ids> <snapshot ids>
> 
> Status line output arguments:
> - <# shared COW store output args> : number of output arguments before
>                                      snapshot id list
> - <errno> : error number associated with the first error that occurred in
>             the store (e.g. -EIO), 0 means the store is active with no errors
> - <new snapid> : snapshot id that will be used for next snapshot, '-' if
>                  no snapshot is in the process of being created
> - <total_sectors> : total size of the shared store in 512b sectors
> - <sectors_allocated> : number of sectors allocated for data and metadata
> - <metadata_sectors> : number of sectors allocated for metadata
> - <# snapshot ids> : number of snapshot ids
> - <snapshot ids> : snapshot ids for snapshots in the store
> 
> 
> Other tunables:
> *) multisnapshot (when using 'mikulas' store)
> The size of the metadata cache associated with the 'mikulas' shared COW
> store defaults to 2% of system memory or 25% of vmalloc memory (which
> ever is lower).  The size of the metadata cache may be overriden using
> the 'dm_bufio_cache_size' module parameter when loading the
> 'dm-store-mikulas' module.  Alternatively, the size may be changed or
> queried after the module is loaded via sysfs:
> /sys/module/dm_store_mikulas/parameters/dm_bufio_cache_size
> 
> 
> DM messages:
> *) multisnapshot
>    - create : creates next new snapshot id, reports created id through 'status'
>      (the snapshot is created once the multisnapshot is suspended)
>    - create_subsnap <snapshot id> : create subsnapshot of specified snapshot
>    - delete <snapshot id> : delete the specified snapshot
> 
> 
> Usage
> =====
> *) Create two logical volumes, one for origin and one for snapshots.
> (The following examples assume /dev/sda for origin and /dev/sdb for snapshot)
> 
> *) Clear the first 4 sectors of the snapshot volume:
> dd if=/dev/zero of=/dev/sdb bs=4096 count=1
> (Otherwise the multisnapshot target's constructor will fail)
> 
> *) Load the shared snapshot driver:
> ORIGIN_BDEV_SIZE=`blockdev --getsize /dev/sda`
> echo 0 $ORIGIN_BDEV_SIZE multisnapshot /dev/sda /dev/sdb 16 0 mikulas 0 | dmsetup create ms
> ('16' is the chunk size in 512-byte sectors. The chunk size may range
> from 1 to 1024 512-byte sectors via lvm. DM's maximum chunk size is only
> limited by 32-bit integer size and available memory)
> 
> This creates the multisnapshot device on /dev/mapper/ms. If the COW
> store was zeroed, it creates a new structure, otherwise it loads
> existing structure.
> 
> Once this is done, you should no longer access /dev/sda and
> /dev/sdb and only use /dev/mapper/ms.
> 
> *) Create new snapshot:
> ('0' in the following dmsetup message commands means sector arg isn't needed)
> dmsetup message /dev/mapper/ms 0 create
> 	If you want to create snapshot-of-snapshot, use:
> 	dmsetup message /dev/mapper/ms 0 create_subsnap <snapID>
> dmsetup status /dev/mapper/ms
> 	(this will display the newly created snapshot ID)
> dmsetup suspend /dev/mapper/ms
> dmsetup resume /dev/mapper/ms
> 
> *) Attach the snapshot:
> echo 0 $ORIGIN_BDEV_SIZE multisnap-snap /dev/sda <snapID> | dmsetup create ms0
> This attaches the snapshot with <snapID> to /dev/mapper/ms0
> 
> *) Delete the snapshot:
> dmsetup message /dev/mapper/ms 0 delete <snapID>
> 
> *) See shared store's status:
> dmsetup status /dev/mapper/ms
> (multisnapshot target's status output is documented above)
> 
> *) Unload it:
> dmsetup remove ms
> dmsetup remove ms0
> ... etc. (note, once you unload the origin, the snapshots become inaccessible
> - the devices exist but they return -EIO when accessed)
> 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* shared snapshots
@ 2010-09-10 17:47 Mikulas Patocka
  0 siblings, 0 replies; 24+ messages in thread
From: Mikulas Patocka @ 2010-09-10 17:47 UTC (permalink / raw)
  To: hch; +Cc: Mike Snitzer, dm-devel

Hi

Mike said that you want to see my shared snapshots. I uploaded the current 
code to:
http://people.redhat.com/mpatocka/patches/kernel/new-snapshots/r22/
http://people.redhat.com/mpatocka/patches/userspace/new-snapshots/lvm-2.02.73/

It apllies to Linux-2.6.35 (and also works with 34 and 33 ... not 
previous kernels) and LVM-2.02.73 (there is also an older version for 70).

Mikulas

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Shared snapshots
  2010-01-11  9:40     ` Pasi Kärkkäinen
@ 2010-01-11  9:49       ` Mikulas Patocka
  0 siblings, 0 replies; 24+ messages in thread
From: Mikulas Patocka @ 2010-01-11  9:49 UTC (permalink / raw)
  To: Pasi Kärkkäinen; +Cc: device-mapper development

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2101 bytes --]



On Mon, 11 Jan 2010, Pasi Kärkkäinen wrote:

> On Mon, Jan 11, 2010 at 04:12:43AM -0500, Mikulas Patocka wrote:
> > 
> > 
> > On Thu, 7 Jan 2010, Pasi Kärkkäinen wrote:
> > 
> > > On Wed, Jan 06, 2010 at 09:38:00AM -0500, Mikulas Patocka wrote:
> > > > Hi
> > > > 
> > > > I uploaded new code for shared snapshots at
> > > > http://people.redhat.com/mpatocka/patches/kernel/new-snapshots/devel/
> > > > 
> > > > Changes:
> > > > - two queues, one for reads and one for writes, to improve read latency 
> > > > when the device is write-congested
> > > > - limit number of reallocations in flight (VM subsystem sends all writes 
> > > > at once and without limit it creates big lags).
> > > > - fixed one place where it didn't handle on-disk errors
> > > > - simplified bitmap creation, removed some unneeded code
> > > > - new macro for error reporting, as suggested by Zdenek
> > > > - some comments added
> > > > - DM_ prefix added to some macros
> > > > 
> > > > Userspace for lvm 2.02.53 is at
> > > > http://people.redhat.com/mpatocka/patches/userspace/new-snapshots/lvm-2.02.54/
> > > > 
> > > > Please, someone, do offensive testing with this --- i.e. apply the 
> > > > patches, try it, and try all possible lvm commands with it, try to stress 
> > > > and crash it in various ways, etc. If we wait with testing until Alasdair 
> > > > reads the code, it won't get much testing...
> > > > 
> > > 
> > > Sorry for a stupid question, but what does "shared" mean? Support for CLVM?
> > > 
> > > -- Pasi
> > 
> > No, it is not clustered.
> >
> 
> Ok. Do you know if there has been work making snapshots work with CLVM? 
> 
> I remember there was some patches earlier, but no idea about the status
> of those..

There is kernel support, see 
http://people.redhat.com/mpatocka/patches/kernel/all/series.html (the last 
section, "Mikulas' clustering"). The patches are old, they may need 
conflict resolving.

But there is no userspace support. So, if you set it up manually with 
dmsetup and pass it a correct locking ID, it may work.

Mikulas

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Shared snapshots
  2010-01-11  9:12   ` Mikulas Patocka
@ 2010-01-11  9:40     ` Pasi Kärkkäinen
  2010-01-11  9:49       ` Mikulas Patocka
  0 siblings, 1 reply; 24+ messages in thread
From: Pasi Kärkkäinen @ 2010-01-11  9:40 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: device-mapper development

On Mon, Jan 11, 2010 at 04:12:43AM -0500, Mikulas Patocka wrote:
> 
> 
> On Thu, 7 Jan 2010, Pasi Kärkkäinen wrote:
> 
> > On Wed, Jan 06, 2010 at 09:38:00AM -0500, Mikulas Patocka wrote:
> > > Hi
> > > 
> > > I uploaded new code for shared snapshots at
> > > http://people.redhat.com/mpatocka/patches/kernel/new-snapshots/devel/
> > > 
> > > Changes:
> > > - two queues, one for reads and one for writes, to improve read latency 
> > > when the device is write-congested
> > > - limit number of reallocations in flight (VM subsystem sends all writes 
> > > at once and without limit it creates big lags).
> > > - fixed one place where it didn't handle on-disk errors
> > > - simplified bitmap creation, removed some unneeded code
> > > - new macro for error reporting, as suggested by Zdenek
> > > - some comments added
> > > - DM_ prefix added to some macros
> > > 
> > > Userspace for lvm 2.02.53 is at
> > > http://people.redhat.com/mpatocka/patches/userspace/new-snapshots/lvm-2.02.54/
> > > 
> > > Please, someone, do offensive testing with this --- i.e. apply the 
> > > patches, try it, and try all possible lvm commands with it, try to stress 
> > > and crash it in various ways, etc. If we wait with testing until Alasdair 
> > > reads the code, it won't get much testing...
> > > 
> > 
> > Sorry for a stupid question, but what does "shared" mean? Support for CLVM?
> > 
> > -- Pasi
> 
> No, it is not clustered.
>

Ok. Do you know if there has been work making snapshots work with CLVM? 

I remember there was some patches earlier, but no idea about the status
of those..

> It means that you can create more snapshots (actually 2^32), they have a 
> common snapshot store and data in this store are shared.
> 
> Normal snapshots get extremely ineffective when using multiple snapshots 
> --- for example, if you have 10 snapshots, data have to be written 10 
> times and any writes to the origin are 10 times slower. Shared snapshots 
> write data only once.
> 
> A possible use for them is to take periodic snapshot, for example once per 
> 5 minutes, to record system activity.
> 
> Another possible use is to create one master image with system image and 
> take many snapshots, each snapshot for one virtual machine.
>

This sounds excellent! Thanks for explaining.

-- Pasi

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Shared snapshots
  2010-01-07 10:08     ` 张宇
@ 2010-01-11  9:14       ` Mikulas Patocka
  0 siblings, 0 replies; 24+ messages in thread
From: Mikulas Patocka @ 2010-01-11  9:14 UTC (permalink / raw)
  To: 张宇; +Cc: device-mapper development

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: TEXT/PLAIN; charset=X-UNKNOWN, Size: 2016 bytes --]



On Thu, 7 Jan 2010, ÕÅÓî wrote:

> both of the origin device and the snapshot device can't be resized
> dynamically?

You can extend the origin. You can't reduce the origin.
You can extend the shared snapshot store and you can't reduce it.
You can both extend and reduce individual snapshots.

Mikulas

> 2010/1/7 haad <haaaad@gmail.com>
> 
> >  On Thu, Jan 7, 2010 at 9:42 AM, Pasi Kärkkäinen <pasik@iki.fi> wrote:
> > > On Wed, Jan 06, 2010 at 09:38:00AM -0500, Mikulas Patocka wrote:
> > >> Hi
> > >>
> > >> I uploaded new code for shared snapshots at
> > >> http://people.redhat.com/mpatocka/patches/kernel/new-snapshots/devel/
> > >>
> > >> Changes:
> > >> - two queues, one for reads and one for writes, to improve read latency
> > >> when the device is write-congested
> > >> - limit number of reallocations in flight (VM subsystem sends all writes
> > >> at once and without limit it creates big lags).
> > >> - fixed one place where it didn't handle on-disk errors
> > >> - simplified bitmap creation, removed some unneeded code
> > >> - new macro for error reporting, as suggested by Zdenek
> > >> - some comments added
> > >> - DM_ prefix added to some macros
> > >>
> > >> Userspace for lvm 2.02.53 is at
> > >>
> > http://people.redhat.com/mpatocka/patches/userspace/new-snapshots/lvm-2.02.54/
> > >>
> > >> Please, someone, do offensive testing with this --- i.e. apply the
> > >> patches, try it, and try all possible lvm commands with it, try to
> > stress
> > >> and crash it in various ways, etc. If we wait with testing until
> > Alasdair
> > >> reads the code, it won't get much testing...
> > >>
> > >
> > > Sorry for a stupid question, but what does "shared" mean? Support for
> > CLVM?
> >
> > No all snapshots share one, common exception store I think.
> >
> >
> >
> > --
> >
> >
> > Regards.
> >
> > Adam
> >
> > --
> > dm-devel mailing list
> > dm-devel@redhat.com
> > https://www.redhat.com/mailman/listinfo/dm-devel
> >
> 

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Shared snapshots
  2010-01-07  8:42 ` Pasi Kärkkäinen
  2010-01-07  9:05   ` haad
@ 2010-01-11  9:12   ` Mikulas Patocka
  2010-01-11  9:40     ` Pasi Kärkkäinen
  1 sibling, 1 reply; 24+ messages in thread
From: Mikulas Patocka @ 2010-01-11  9:12 UTC (permalink / raw)
  To: Pasi Kärkkäinen; +Cc: device-mapper development

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: TEXT/PLAIN; charset=X-UNKNOWN, Size: 1971 bytes --]



On Thu, 7 Jan 2010, Pasi Kärkkäinen wrote:

> On Wed, Jan 06, 2010 at 09:38:00AM -0500, Mikulas Patocka wrote:
> > Hi
> > 
> > I uploaded new code for shared snapshots at
> > http://people.redhat.com/mpatocka/patches/kernel/new-snapshots/devel/
> > 
> > Changes:
> > - two queues, one for reads and one for writes, to improve read latency 
> > when the device is write-congested
> > - limit number of reallocations in flight (VM subsystem sends all writes 
> > at once and without limit it creates big lags).
> > - fixed one place where it didn't handle on-disk errors
> > - simplified bitmap creation, removed some unneeded code
> > - new macro for error reporting, as suggested by Zdenek
> > - some comments added
> > - DM_ prefix added to some macros
> > 
> > Userspace for lvm 2.02.53 is at
> > http://people.redhat.com/mpatocka/patches/userspace/new-snapshots/lvm-2.02.54/
> > 
> > Please, someone, do offensive testing with this --- i.e. apply the 
> > patches, try it, and try all possible lvm commands with it, try to stress 
> > and crash it in various ways, etc. If we wait with testing until Alasdair 
> > reads the code, it won't get much testing...
> > 
> 
> Sorry for a stupid question, but what does "shared" mean? Support for CLVM?
> 
> -- Pasi

No, it is not clustered.

It means that you can create more snapshots (actually 2^32), they have a 
common snapshot store and data in this store are shared.

Normal snapshots get extremely ineffective when using multiple snapshots 
--- for example, if you have 10 snapshots, data have to be written 10 
times and any writes to the origin are 10 times slower. Shared snapshots 
write data only once.

A possible use for them is to take periodic snapshot, for example once per 
5 minutes, to record system activity.

Another possible use is to create one master image with system image and 
take many snapshots, each snapshot for one virtual machine.

Mikulas

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Shared snapshots
  2010-01-07  9:05   ` haad
@ 2010-01-07 10:08     ` 张宇
  2010-01-11  9:14       ` Mikulas Patocka
  0 siblings, 1 reply; 24+ messages in thread
From: 张宇 @ 2010-01-07 10:08 UTC (permalink / raw)
  To: device-mapper development


[-- Attachment #1.1: Type: text/plain, Size: 1678 bytes --]

both of the origin device and the snapshot device can't be resized
dynamically?

2010/1/7 haad <haaaad@gmail.com>

>  On Thu, Jan 7, 2010 at 9:42 AM, Pasi Kärkkäinen <pasik@iki.fi> wrote:
> > On Wed, Jan 06, 2010 at 09:38:00AM -0500, Mikulas Patocka wrote:
> >> Hi
> >>
> >> I uploaded new code for shared snapshots at
> >> http://people.redhat.com/mpatocka/patches/kernel/new-snapshots/devel/
> >>
> >> Changes:
> >> - two queues, one for reads and one for writes, to improve read latency
> >> when the device is write-congested
> >> - limit number of reallocations in flight (VM subsystem sends all writes
> >> at once and without limit it creates big lags).
> >> - fixed one place where it didn't handle on-disk errors
> >> - simplified bitmap creation, removed some unneeded code
> >> - new macro for error reporting, as suggested by Zdenek
> >> - some comments added
> >> - DM_ prefix added to some macros
> >>
> >> Userspace for lvm 2.02.53 is at
> >>
> http://people.redhat.com/mpatocka/patches/userspace/new-snapshots/lvm-2.02.54/
> >>
> >> Please, someone, do offensive testing with this --- i.e. apply the
> >> patches, try it, and try all possible lvm commands with it, try to
> stress
> >> and crash it in various ways, etc. If we wait with testing until
> Alasdair
> >> reads the code, it won't get much testing...
> >>
> >
> > Sorry for a stupid question, but what does "shared" mean? Support for
> CLVM?
>
> No all snapshots share one, common exception store I think.
>
>
>
> --
>
>
> Regards.
>
> Adam
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
>

[-- Attachment #1.2: Type: text/html, Size: 2593 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Shared snapshots
  2010-01-07  8:42 ` Pasi Kärkkäinen
@ 2010-01-07  9:05   ` haad
  2010-01-07 10:08     ` 张宇
  2010-01-11  9:12   ` Mikulas Patocka
  1 sibling, 1 reply; 24+ messages in thread
From: haad @ 2010-01-07  9:05 UTC (permalink / raw)
  To: device-mapper development; +Cc: Mike Snitzer, Alasdair G Kergon

On Thu, Jan 7, 2010 at 9:42 AM, Pasi Kärkkäinen <pasik@iki.fi> wrote:
> On Wed, Jan 06, 2010 at 09:38:00AM -0500, Mikulas Patocka wrote:
>> Hi
>>
>> I uploaded new code for shared snapshots at
>> http://people.redhat.com/mpatocka/patches/kernel/new-snapshots/devel/
>>
>> Changes:
>> - two queues, one for reads and one for writes, to improve read latency
>> when the device is write-congested
>> - limit number of reallocations in flight (VM subsystem sends all writes
>> at once and without limit it creates big lags).
>> - fixed one place where it didn't handle on-disk errors
>> - simplified bitmap creation, removed some unneeded code
>> - new macro for error reporting, as suggested by Zdenek
>> - some comments added
>> - DM_ prefix added to some macros
>>
>> Userspace for lvm 2.02.53 is at
>> http://people.redhat.com/mpatocka/patches/userspace/new-snapshots/lvm-2.02.54/
>>
>> Please, someone, do offensive testing with this --- i.e. apply the
>> patches, try it, and try all possible lvm commands with it, try to stress
>> and crash it in various ways, etc. If we wait with testing until Alasdair
>> reads the code, it won't get much testing...
>>
>
> Sorry for a stupid question, but what does "shared" mean? Support for CLVM?

No all snapshots share one, common exception store I think.



-- 


Regards.

Adam

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Shared snapshots
  2010-01-06 14:38 Shared snapshots Mikulas Patocka
@ 2010-01-07  8:42 ` Pasi Kärkkäinen
  2010-01-07  9:05   ` haad
  2010-01-11  9:12   ` Mikulas Patocka
  0 siblings, 2 replies; 24+ messages in thread
From: Pasi Kärkkäinen @ 2010-01-07  8:42 UTC (permalink / raw)
  To: device-mapper development; +Cc: Mike Snitzer, Alasdair G Kergon

On Wed, Jan 06, 2010 at 09:38:00AM -0500, Mikulas Patocka wrote:
> Hi
> 
> I uploaded new code for shared snapshots at
> http://people.redhat.com/mpatocka/patches/kernel/new-snapshots/devel/
> 
> Changes:
> - two queues, one for reads and one for writes, to improve read latency 
> when the device is write-congested
> - limit number of reallocations in flight (VM subsystem sends all writes 
> at once and without limit it creates big lags).
> - fixed one place where it didn't handle on-disk errors
> - simplified bitmap creation, removed some unneeded code
> - new macro for error reporting, as suggested by Zdenek
> - some comments added
> - DM_ prefix added to some macros
> 
> Userspace for lvm 2.02.53 is at
> http://people.redhat.com/mpatocka/patches/userspace/new-snapshots/lvm-2.02.54/
> 
> Please, someone, do offensive testing with this --- i.e. apply the 
> patches, try it, and try all possible lvm commands with it, try to stress 
> and crash it in various ways, etc. If we wait with testing until Alasdair 
> reads the code, it won't get much testing...
> 

Sorry for a stupid question, but what does "shared" mean? Support for CLVM?

-- Pasi

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Shared snapshots
@ 2010-01-06 14:38 Mikulas Patocka
  2010-01-07  8:42 ` Pasi Kärkkäinen
  0 siblings, 1 reply; 24+ messages in thread
From: Mikulas Patocka @ 2010-01-06 14:38 UTC (permalink / raw)
  To: Alasdair G Kergon; +Cc: Mike Snitzer, dm-devel

Hi

I uploaded new code for shared snapshots at
http://people.redhat.com/mpatocka/patches/kernel/new-snapshots/devel/

Changes:
- two queues, one for reads and one for writes, to improve read latency 
when the device is write-congested
- limit number of reallocations in flight (VM subsystem sends all writes 
at once and without limit it creates big lags).
- fixed one place where it didn't handle on-disk errors
- simplified bitmap creation, removed some unneeded code
- new macro for error reporting, as suggested by Zdenek
- some comments added
- DM_ prefix added to some macros

Userspace for lvm 2.02.53 is at
http://people.redhat.com/mpatocka/patches/userspace/new-snapshots/lvm-2.02.54/

Please, someone, do offensive testing with this --- i.e. apply the 
patches, try it, and try all possible lvm commands with it, try to stress 
and crash it in various ways, etc. If we wait with testing until Alasdair 
reads the code, it won't get much testing...

Mikulas

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Shared snapshots
  2009-05-07 15:44           ` Jonathan Brassow
@ 2009-05-11 10:44             ` Mikulas Patocka
  0 siblings, 0 replies; 24+ messages in thread
From: Mikulas Patocka @ 2009-05-11 10:44 UTC (permalink / raw)
  To: Jonathan Brassow; +Cc: dm-devel, Alasdair Kergon

> > Is your refactored code extensible to other exception stores?
> > - true. It is. But it was extensible even before refactoring. Before
> > refactoring, one could write any exception store driver (either unshared
> > or shared) and plug it into the unshared-master-driver or the
> > shared-master-driver. All you did was to join the unshared-master-driver
> > and the shared-master-driver to one chunk of code.
> 
> You make it sound as though there was a modular interface to begin with, but
> there was not.  Every time a new exception store was added in the past, it
> required new parsing code in the snapshot module, new 'if' statements to
> conditionalize for it, etc.

And if you want my exception store, you also need a set of patches to the 
generic code. You haven't made it any better - and no one can make it 
better - when someone else implements something completely new, more 
changes to the generic code will be needed.

> A bunch of patches have already been added to
> improve that situation - all of which were part of the refactored code.  If
> just 2 of the remaining 33 patches in my further refactoring are added
> upstream, then there is no need for the snapshot module to parse the exception
> store arguments.  In fact, there is no reason for the snapshot module to know
> anything about how the exception store does its job.

There are generic command-line arguments and arguments specific to the 
exception store. That's how I already wrote it.

> > Any other advantages?
> 
> Yes.  It is easier to write/stack/include cluster-aware exception store
> modules.

Why stack? The exception store makes sometimes some actions on its own. If 
you stack anything on the top of it, you won't make it cluster-aware.

All you need for clustering is to write a lock module that has these 
functions:

- take the lock (in the return code it indicates whether the lock was held 
by another node in the cluster since the last release --- so that cache 
flush can be implemented)
- release the lock
- ask if someone is competing for the lock (in case of merging/background 
delete, we must hand over the store to another node if it wants to do 
something)

This lock can then be plugged into all the exception stores and make them 
clustered (it could be plugged even to the old non-shared store --- when 
"take the lock" function indicates that someone held the lock meanwhile, 
the non-shared store could just read possible exceptions that the other 
node created and go on).

You don't want to develop new cluster exception store or new layer! All 
you need to develop is the lock.

> > You can't do this. "delete" is a long-running operation, it needs to walk
> > the whole tree. Stopping an exception store during delete is not
> > acceptable on big devices. I have already implemented delete in the
> > background, it goes step-by-step and blocks concurrent access only for
> > these little steps, so it won't lock the computer up for minutes or hours.
> > It can also batch multiple delete requests just into one tree-walk.
> 
> Your approach is excellent.  However, it would be unacceptable if it cannot
> cease processing when device-mapper needs it to.  Certainly you built in the
> capability to stop and restart the process?

Yes, it stores the current point in the commit block, so in case of driver 
unload or crash, it resumes where it left.

> I am not asking for the delete
> operation to be completed in order to suspend.  I am asking for the operation
> to halt for a suspend and continue on resume.  This is a fundamental
> requirement.  How will you know when to suspend if you are not informed by a
> 'suspend' call from the API?

There is no table reload needed now, so I don't have suspend. Maybe it 
will change with clustering and I'll split the initialization into two 
parts: allocate structures (done in the constructor) and read data (done 
in resume).

> > > We absolutely need your exception store - there is no question about that.
> > > I
> > 
> > But it is not possible now. What it lacks:
> > - the snapshot ids are submitted by the user (in my implementation, the
> > exception store itself must make the snapshot id, it is design thing that
> > can't be changed)
> 
> Really?  It can't be changed?  I think it would be great if you added a
> translation table.  Userspace would pass down a number/name/UUID (it doesn't
> really matter, as long as it is unique among the shared snapshots for a given
> COW device), then your exception store could make its own "snapshot id" and
> associate it with what userspace has given it.
> 
> This translation table could clean up a number of complaints that we have had
> about your implementation.  Now, all you should need is a simple CTR + RESUME
> to get a snapshot device going - same as the legacy snapshot targets and every
> other device-mapper target without exception.  There would no longer be a need
> to use 'message' to create a snapshot and then a further CTR+RESUME to
> instantiate it.

I'll write the userspace lvm code and if I find out that it'd be better to 
implement the translation table in the kernel instead of in the userspace, 
I'll do it. So far, I'm undecided on this.

> Additional benefits could be gained from this.  I no longer see a need to have
> the origin constructor changed...  Now things seem to be coming more in-line
> with the way the userspace/kernel interface was in the past.  That should
> simplify userspace coding.
> 
> Just this one thing could fix so many complaints... can you not do this?
> 
> > - it lacks read-once-write-many copies (they are needed only on my store,
> > not on Fujita-Daniel's store)
> 
> I need to understand this better.  Perhaps you could help me?  This could be
> your greatest argument.  It would not deter me from asking for a exception
> store/snapshot boundary API, nor will it deter me from demanding better
> userspace/kernel boundary compliance.  However, it may make me realize that
> retrofitting the old snapshot module to also be able to handle shared
> exception stores is not the best.  (This would require me to remove 2 of the
> 33 outstanding patches I have posted.)
> 
> If you help me, I will probably understand this code more quickly.  Otherwise,
> it will take me longer.  If I can see no solution, I will very easily accept
> that the current snapshot module is not the place to handle shared exception
> stores.  I have to warn you though... I can be good at finding solutions to
> problems.

My store is b+-tree that contains (old chunk number, starting snapshot id, 
ending snapshot id, new chunk number) (struct dm_multisnap_bt_entry). So 
it can share only chunks in consecutive ranges of snapshot ids. Snapshot 
ids are assigned sequentially. If there were no writing to the snapshots, 
the sharing would be always perfect --- i.e. chunks would be always shared 
if they could be.

But if you write to the snapshot, the sharing becomes imperfect. Suppose 
that you have snapshots 1,2,3,4,5. You write to snapshot 3. You create 
record for range (3-3). Now you write to the origin. In this case there 
must be two chunks allocated and two copies performed, one for range (1-2) 
and the other for range (4-5).

kcopyd can performa up to 8 simultaneous writes, so I use this 
functionality. I make kcopyd call that reads it once and makes several 
writes.

> > - it lacks snapshot-to-snapshot copies for writable snapshots (for
> > Fujita-Daniel's code you need just one copy, for my code you need at most
> > two).
> 
> I think there are solutions to this that are not that complicated.
> 
> > - it lacks the possibility to re-submit bios to the exception store to do
> > more relocations after some of the relocations have been finished. (in
> > extreme cases of many writes to the snapshots, there may be many
> > relocations on my exception store --- it is no need to optimize for these
> > cases, but they must be handled correctly).
>
> I also would like to understand this limitation more.

See the above paragraph. If it is needed to copy to more than 8 places, I 
just fill all 8 slots, submit it to kcopyd, and when it finishes, I 
resubmit the bio to the entry path, do lookup for new copies again and if 
there are more, fill up to 8 new jobs to kcopyd ... and so on, until it is 
processed.

Note that this only happens if someone writes to the snapshots, if not, 
there'll be always perfect sharing and at most one copy.

> > Why don't do this refactoring and why not have common driver for shared
> > and non-shared exception store? - it already took 5 months, it is still
> > incomplete and needs non-trivial changes to be at least as functional as
> > if this hadn't happened at all.
> > 
> > If you're still thinking about "cleanness", read the first paragraph about
> > loss of reason again. Back into reality --- if you spend 5 months
> > shuffling code around without adding any functionality it is only
> > reasonable to do it if it will save more than 5 months in the long term.
> > And it won't.
> > 
> > It is no shame, it occasionaly happens to anyone. To me it happened on
> > other projects too that I had these feelings "wow, it will be super clean
> > and easy" and when I got into typing the code, it turned out that it is
> > much more time-consuming that I anticipated. And when it happend to me, I
> > erased the code and changed my perception of "cleanness" - it's the only
> > thing one can do.
> 
> Final comments...
> 
> You and I are in agreement that we don't want to waste time; and certainly,
> there is no time to waste.  We want a good solution.  Let us figure out what
> we both want; and determine from there what pieces of code we can keep and
> what needs to be adjusted.  I will start...
> 
> Here is what I demand:
> #1) A clean API that separates the snapshot module and the exception store
> implementations.  This facilitates others who wish to write new exception
> stores - including me, who is responsible for writing the cluster-aware
> exception store.  I would appreciate it if some forethought was given to
> cluster implementations.  I am not saying that there cannot be a "shared" API
> and a "non-shared" API.  Go ahead, write your own - it shouldn't take too
> long.

I have already did.

I started with my exception store only. Later I refactored it to handle 
both my and Fujita's code. This refactoring took me only one week.

> I'd like to think that there can be one API that can satisfy the needs
> of both (perhaps even with a flag that signifies "shared" vs "non-shared if
> necessary?).

This is what you did, but the shared driver is as big as both the smaller 
drivers. So there is no advantage from it.

> I will continue to think that it is possible to make this happen
> until I am proved or can prove to myself otherwise.  This does /not/ mean I
> will block any attempts to do something other than a singular, all-inclusive
> API.
> 
> #2) Don't break the current model of userspace/kernel interaction unless it is
> absolutely necessary.  'delete'ing a single snapshot from a share exception
> store is *potentially* one of these cases (although I can think of good
> solutions).  Right now, when we want to delete a snapshot, we simply wipe the
> COW device - obviously, this doesn't work when the COW is shared.  However,
> forcing userspace to know about snapid's which are generated by the kernel and
> overloading the message function is not acceptable.  I think there is also
> much room for improvement once that is fixed to minimize changes to the
> constructor tables and reduce the burden on user-space tools.

This is too one sided view. The lookup code will be written either in the 
kernel or in the userspace. It has to be written anyway.

I won't argue about "what would be easier if someone were writing the 
userspace code". I'll just start writing the userspace code now. If I get 
intense feeling that snapids and create message are complicating it too 
much, I'll implement the lookup in the kernel.

Keep in mind that the task here is not to create some emotion of 
"cleanness", but to save coding time.

> Here is what I am willing to sacrifice if it proves necessary:
> #1) I am willing to give in on having two snapshot modules.  I have always
> stated this as a possibility.  It is very possible that it makes more sense
> this way instead of having one snapshot module be able to interact with all
> exception stores.  This is even more palatable if you consider that the legacy
> snapshots will be largely abandoned.  This would mean abandoning some of the
> patches I've posted upstream, but others still remain quite useful.  More than
> one snapshot-merge target might be required in this case.  There are a number
> of other disadvantages to this that I (maybe not others) would be willing to
> set aside based solely on your word that the advantages outweigh the
> disadvantages.
> 
> #2) I am willing to put in any extra work required to write the cluster-aware
> modules - even if two APIs and two snapshot modules end up being required.

See above --- don't write the clustered exception store layer, write the 
cluster lock. The lock can be with little cost plugged into both old and 
new snapshots.

> Mikulas, I get the sense that you have looked at the API with a quick mind of
> why it can't be done, instead of whether it can be done.  The conclusions may
> ultimately be the same; but I am asking you to take a look and see what *can*
> be done.  What compromises can be made?  How far are you willing to come?  I
> don't think my base demands are too much to ask, nor do I think they
> materially differ from what is required for upstream inclusion.

I'm not saying that your code won't work. If you put a couple of months 
into it, you could definitely make it work. But why doing it if it's 
already done?

If you want to do something useful, you can clean-up deficiencies in the 
Fujita's code and implement journal in it. Or you can write that cluster 
lock as I proposed above, and when I finish the userspace code, I'll plug 
in into the kernel driver.

I'm just suggesting that you shouldn't waste more time shuffling code 
around.

Mikulas

> Thanks for your continued effort on this,
> brassow

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Shared snapshots
  2009-05-06 16:06         ` Mikulas Patocka
@ 2009-05-07 15:44           ` Jonathan Brassow
  2009-05-11 10:44             ` Mikulas Patocka
  0 siblings, 1 reply; 24+ messages in thread
From: Jonathan Brassow @ 2009-05-07 15:44 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: dm-devel, Alasdair Kergon

On May 6, 2009, at 11:06 AM, Mikulas Patocka wrote:

> Hi
>
>> I'd like to state that the point of the refactoring work was not to  
>> reduce the
>> size of the code.  The point was to provide a clear and clean  
>> interface for
>> all exception stores - new and old - and do it in a way that  
>> adheres to the
>> principals of the general device-mapper target API.
>
> First let me write about a possible arguments regarding "cleanness",
> "niceness" etc.
>
> I think that what is really hapenning here is loss of reason.
>
> In general, loss of reason is a cognitive bias that happens in the
> following scenario (not only in software engineering, but in other  
> social
> situations as well):
>
> * people want to reach goal G
> * in order to reach the goal G, people do action A
> * they do it thoroughly, consistently, over the long time
> * now, "loss of reason", an error inside human brain happens: people  
> start
> slowly forgetting about the goal G. And they start viewing the  
> action A as
> being "good", "nice", "right", "the way it should be done"
> * from this point on, people are doing action A even in situations  
> when
> it damages the goal G
> * people's performance decreases, they no longer attempt to reach real
> objective goals, all they want to reach is their own emotion of  
> "niceness"
> inside their head.
>
> Now, how does it apply to our snapshot development scenario?
>
> * programmers want to reduce coding time and maintenance time (goal G)
> * to reach this goal, programmers share their code, they sometimes
> refactor code to be better shareable (action A)
> * it has been done consistently over the long time for various  
> projects
> * now people start considering sharing code and refactoring as "good",
> "clean", "the way software should be written" or whatever. They  
> completely
> forgot about the initial goal that was there (save developers' time)
> * and now we are refactoring the snapshot code even if increases our
> development time, it is not going to save the time in the future,  
> but we
> consider this "right", "clean", "clear", "nice"
>
> Are you getting it how these "universal"/"emotional"/"aesthetic"  
> arguments
> not backed by anything are void?
>
>
> Now, if we ignore the emotional stuff and look at it from the real  
> point
> of view.
>
> Is the refactored code saving development time now?
> - it isn't. It already ate 5 months and it will just eat more before  
> it
> gets functional.

I agree.  I think there will be additional time involved to get the  
refactored code ready.  I also believe that it will take additional  
time to get your current solution ready.  There is also more involved  
here than just development time.  You must also consider the time it  
takes to merge things upstream.  If you are unwilling to make any  
concessions on your code, it is unlikely to go upstream.  Where will  
you be then?

> Will the refactored code be saving maintenance time in the future?
> - it won't. The code is as big as if the refactoring hadn't happened  
> (and
> it'll be just bigger if you imeplement missing functionality).
>
> Anyway, two smaller drivers are better maintainable than one big  
> driver of
> the double size. A person who doesn't understand the code will have to
> read just half of the code to understand the logic if he wants to  
> modify
> one of the smaller drivers. It also reduces testing time, if you  
> modify
> one smaller driver, you are certain that the second driver is intact  
> and
> don't have to test it.

I see (have seen) your point; and while I respectfully disagree, I  
still think I have ideas where we can meet in the middle to come up  
with a solution.  (More on this later.)

> Is your refactored code extensible to other exception stores?
> - true. It is. But it was extensible even before refactoring. Before
> refactoring, one could write any exception store driver (either  
> unshared
> or shared) and plug it into the unshared-master-driver or the
> shared-master-driver. All you did was to join the unshared-master- 
> driver
> and the shared-master-driver to one chunk of code.

You make it sound as though there was a modular interface to begin  
with, but there was not.  Every time a new exception store was added  
in the past, it required new parsing code in the snapshot module, new  
'if' statements to conditionalize for it, etc.  A bunch of patches  
have already been added to improve that situation - all of which were  
part of the refactored code.  If just 2 of the remaining 33 patches in  
my further refactoring are added upstream, then there is no need for  
the snapshot module to parse the exception store arguments.  In fact,  
there is no reason for the snapshot module to know anything about how  
the exception store does its job.

> Any other advantages?

Yes.  It is easier to write/stack/include cluster-aware exception  
store modules.

> Regarding the bugs:
> 56MB origin, 160MB snapshot store (128MB of that are reserved for
> non-existing journal, so it will overflow). Run mkfs.reiserfs on the
> origin. Ends up with a lot of messages like this:
> May  6 13:36:10 slunicko kernel: attempt to access beyond end of  
> device
> May  6 13:36:10 slunicko kernel: dm-5: rw=25, want=1048784,  
> limit=327680
> May  6 13:36:11 slunicko kernel: attempt to access beyond end of  
> device
> May  6 13:36:11 slunicko kernel: dm-5: rw=25, want=1376256,  
> limit=327680
> May  6 13:36:11 slunicko kernel: attempt to access beyond end of  
> device
> and the process stucks in "D" state in
> __wait_on_bit/wait_on_page_writeback_range. Vely likely the bio was  
> lost
> somewhere and never returned with either error or success.
>
> I have found a couple of bugs in Fujita/Daniel's exception store long
> time ago (when I was porting it to my driver), the most serious one I
> found was this:
> @@ -510,7 +510,7 @@ static int read_new_bitmap_chunk(struct
>        chunk_io(store, ss->cur_bitmap_chunk, WRITE, 1, ss->bitmap);
>
>        ss->cur_bitmap_chunk++;
> -       if (ss->cur_bitmap_chunk == ss->nr_bitmap_chunks)
> +       if (ss->cur_bitmap_chunk == ss->nr_bitmap_chunks +  
> FIRST_BITMAP_CHUNK)
>                ss->cur_bitmap_chunk = FIRST_BITMAP_CHUNK;
>
>        chunk_io(store, ss->cur_bitmap_chunk, READ, 1,  ss->bitmap);
> ... but it doesn't fix this problem. So there's something else.
>
> Maybe the bad accesses could be due to endianity problems, I use  
> sparc64.
> The lost bio can't be caused by endianity, this needs to be  
> investigated.
>
>
> Writable snapshots:
> create an origin and two snapshots. Write to the origin (it creates  
> shared
> chunk for both snapshots). Write to one of the snapshots to the same
> chunk. The write is lost. Read it again and the data is not there.
>
> In this situation you need to copy the chunk from the snapshot store  
> to
> the snapshot store (you can't copy it from the origin). I was  
> looking for
> the code that handles these snapshot-to-snapshot copies, but I  
> didn't find
> it. So the problem is not some silly typo that could be fixed with one
> line - the problem is that the code to handle it wasn't written.

Thank-you for explaining these bugs.  You are right, the code to  
handle snapshot-to-snapshot copy does not exist!  It should not be  
difficult to fix these bugs.

>> I tried to ensure that all necessary API functions were in place.   
>> Like
>> "suspend", which is not used by any of the legacy exception stores,  
>> but will
>> be useful when it comes time to quiesce a shared exception store  
>> that may be
>> performing a "delete" operation.
>
> You can't do this. "delete" is a long-running operation, it needs to  
> walk
> the whole tree. Stopping an exception store during delete is not
> acceptable on big devices. I have already implemented delete in the
> background, it goes step-by-step and blocks concurrent access only for
> these little steps, so it won't lock the computer up for minutes or  
> hours.
> It can also batch multiple delete requests just into one tree-walk.

Your approach is excellent.  However, it would be unacceptable if it  
cannot cease processing when device-mapper needs it to.  Certainly you  
built in the capability to stop and restart the process?  I am not  
asking for the delete operation to be completed in order to suspend.   
I am asking for the operation to halt for a suspend and continue on  
resume.  This is a fundamental requirement.  How will you know when to  
suspend if you are not informed by a 'suspend' call from the API?

>> Please note, that there probably won't be that much code sharing  
>> between
>> exception store modules; but the snapshot code is completely  
>> agnostic to the
>> type of exception store used.  This allows an infinite variety of  
>> exception
>> stores without change to the snapshot code.  This is also nice  
>> because changes
>> in the snapshot code should not require work in the exception  
>> stores and vise
>> versa. Userspace code should also be reduced due to identical  
>> mechanisms used
>> to communicate and operate snapshots with legacy or future  
>> exception stores -
>> you need only change the constructor arguments.  Backwards  
>> compatibility is
>> improved and the meaning of DM ioctls doesn't change depending on  
>> the target
>> type.  It is a necessity that we have new exception stores, but I  
>> don't think
>> that necessitates a new snapshot target - something which, I think,  
>> would
>> increase our maintenance costs because we now support two entirely  
>> different
>> snapshot implementations.
>>
>> We absolutely need your exception store - there is no question  
>> about that.  I
>
> But it is not possible now. What it lacks:
> - the snapshot ids are submitted by the user (in my implementation,  
> the
> exception store itself must make the snapshot id, it is design thing  
> that
> can't be changed)

Really?  It can't be changed?  I think it would be great if you added  
a translation table.  Userspace would pass down a number/name/UUID (it  
doesn't really matter, as long as it is unique among the shared  
snapshots for a given COW device), then your exception store could  
make its own "snapshot id" and associate it with what userspace has  
given it.

This translation table could clean up a number of complaints that we  
have had about your implementation.  Now, all you should need is a  
simple CTR + RESUME to get a snapshot device going - same as the  
legacy snapshot targets and every other device-mapper target without  
exception.  There would no longer be a need to use 'message' to create  
a snapshot and then a further CTR+RESUME to instantiate it.

Additional benefits could be gained from this.  I no longer see a need  
to have the origin constructor changed...  Now things seem to be  
coming more in-line with the way the userspace/kernel interface was in  
the past.  That should simplify userspace coding.

Just this one thing could fix so many complaints... can you not do this?

> - it lacks read-once-write-many copies (they are needed only on my  
> store,
> not on Fujita-Daniel's store)

I need to understand this better.  Perhaps you could help me?  This  
could be your greatest argument.  It would not deter me from asking  
for a exception store/snapshot boundary API, nor will it deter me from  
demanding better userspace/kernel boundary compliance.  However, it  
may make me realize that retrofitting the old snapshot module to also  
be able to handle shared exception stores is not the best.  (This  
would require me to remove 2 of the 33 outstanding patches I have  
posted.)

If you help me, I will probably understand this code more quickly.   
Otherwise, it will take me longer.  If I can see no solution, I will  
very easily accept that the current snapshot module is not the place  
to handle shared exception stores.  I have to warn you though... I can  
be good at finding solutions to problems.

> - it lacks snapshot-to-snapshot copies for writable snapshots (for
> Fujita-Daniel's code you need just one copy, for my code you need at  
> most
> two).

I think there are solutions to this that are not that complicated.

> - it lacks the possibility to re-submit bios to the exception store  
> to do
> more relocations after some of the relocations have been finished. (in
> extreme cases of many writes to the snapshots, there may be many
> relocations on my exception store --- it is no need to optimize for  
> these
> cases, but they must be handled correctly).

I also would like to understand this limitation more.

> - lame delete that blocks the whole device.

What?  There is no reason to block the whole device for a delete -  
that is an exception store implementation detail that has nothing to  
do with the snapshot module or the exception store API.

>> agree with your complaints about the short-comings of the shared  
>> exception
>> store that was ported to the interface (e.g. it has no journaling,  
>> 64 share
>> limit before needing a new store, memory issues).  I ported the  
>> shared
>> exception store to the interface (along with creating a cluster-aware
>> exception store) so I could feel confident that I got the interface  
>> correct.
>> The "shared" exception store should not be viewed as complete; even  
>> though I
>> think the exception store interface is.  Please refrain from  
>> judging the API
>> based on incompleteness in the implementation of one exception store.
>
> This is not about lameness of Fujita's code. What I mean is that your
> interface is too tied on Fujita-Daniel's store. It is not just easy  
> that
> you take my store and plug it in. It still needs many generic  
> changes. It
> is not as beautiful as you described it that it is "completely  
> agnostic to
> the type of exception store used". And I am concerned --- why spend  
> more
> time making the changes, if generic driver that has them is already
> written?

Now trying to aggregate comments... see below.

>
>
>> [However, I have tested the "shared" exception store and did not  
>> suffer from
>> the same problems you did.  I would be interested to have you fill  
>> me in on
>> the problems you encountered - "doesn't support writable snapshots  
>> correctly,
>> it deadlocks when the exception store overflows, it prints warnings  
>> about
>> circular locking"].  I believe I have the interface correct, and I  
>> am not
>> aware of changes that would have to be made to accommodate your  
>> exception
>> store.
>>
>> I also think you have a number of great ideas in your higher level  
>> snapshot
>> code.  I'd love to see these added to the current snapshot module.   
>> I would
>> even be willing to abandon the current snapshot module and make  
>> yours the only
>> one in the kernel iff it adhered to the following principals:
>>
>> 1) It is completely agnostic to all legacy and future exception  
>> stores
>
> My code does support all existing and possible future shared
> snapshots. It doesn't support non-shared snapshots. They are well
> implemented already and there is no reason to reinvent the wheel.

Now trying to aggregate comments... see below.

>
>
>> (e.g.
>> it should know nothing about snapid's and such, as this information  
>> aids in
>> how/where an exception store places data - something that should be  
>> known only
>> to the exception store implementation)  Without a clean break, the  
>> API is
>> meaningless.
>
> My code works in such a way that the kernel creates a snapid, it  
> passes it
> to the userspace, the userspace is supposed to store the snapid in  
> its lvm
> metadata and passes it back to the kernel. The userspace views the  
> snapid
> as an opaque value --- it doesn't look at the snapid, it just passes  
> it
> back and forth.
>
> If you wanted to change it to uuids in the kernel, it is changeable.

Now trying to aggregate comments... see below.

>
>
>> 2) The current meaning of DM ioctls is preserved.  We shouldn't  
>> have to
>> create/resume a snapshot device and then use messages or other  
>> mechanisms to
>> instantiate the shares.  This slippery slope leads us to a place  
>> where each
>> target gets to choose its userspace interaction and the defined  
>> functions for
>> that purpose become meaningless.
>
> It is changeable to uuids if the kernel maintained uuid->snapid
> translation table in the exception store. It is possible, thought we  
> must
> know why. "snapids are submitted by userspace" is not possible.
>
>> Ultimately, I think the exception store API that I have put in  
>> place would be
>> the glue between the upper snapshot layer and the lower exception  
>> store
>> modules - whatever generation of these layers happens to exist.  I  
>> believe the
>> API follows in the models laid down by the dm-raid1/log and
>> multipath/read-balance interfaces.  I think this API is a natural and
>> necessary progression.
>>
>> I would also take issue with the fact that if the code size is  
>> increased, it
>> decreases maintainability.  I think maintainability is improved  
>> with the clean
>> separation of snapshot code and exception store modules; just as I  
>> think
>> maintainability is improved by having a file system switch that  
>> separates the
>> different file systems and provides a clean method for them to  
>> communicate
>> with higher layers.
>>
>> I welcome further thoughts,
>> brassow
>
> Why don't do this refactoring and why not have common driver for  
> shared
> and non-shared exception store? - it already took 5 months, it is  
> still
> incomplete and needs non-trivial changes to be at least as  
> functional as
> if this hadn't happened at all.
>
> If you're still thinking about "cleanness", read the first paragraph  
> about
> loss of reason again. Back into reality --- if you spend 5 months
> shuffling code around without adding any functionality it is only
> reasonable to do it if it will save more than 5 months in the long  
> term.
> And it won't.
>
> It is no shame, it occasionaly happens to anyone. To me it happened on
> other projects too that I had these feelings "wow, it will be super  
> clean
> and easy" and when I got into typing the code, it turned out that it  
> is
> much more time-consuming that I anticipated. And when it happend to  
> me, I
> erased the code and changed my perception of "cleanness" - it's the  
> only
> thing one can do.

Final comments...

You and I are in agreement that we don't want to waste time; and  
certainly, there is no time to waste.  We want a good solution.  Let  
us figure out what we both want; and determine from there what pieces  
of code we can keep and what needs to be adjusted.  I will start...

Here is what I demand:
#1) A clean API that separates the snapshot module and the exception  
store implementations.  This facilitates others who wish to write new  
exception stores - including me, who is responsible for writing the  
cluster-aware exception store.  I would appreciate it if some  
forethought was given to cluster implementations.  I am not saying  
that there cannot be a "shared" API and a "non-shared" API.  Go ahead,  
write your own - it shouldn't take too long.  I'd like to think that  
there can be one API that can satisfy the needs of both (perhaps even  
with a flag that signifies "shared" vs "non-shared if necessary?).  I  
will continue to think that it is possible to make this happen until I  
am proved or can prove to myself otherwise.  This does /not/ mean I  
will block any attempts to do something other than a singular, all- 
inclusive API.

#2) Don't break the current model of userspace/kernel interaction  
unless it is absolutely necessary.  'delete'ing a single snapshot from  
a share exception store is *potentially* one of these cases (although  
I can think of good solutions).  Right now, when we want to delete a  
snapshot, we simply wipe the COW device - obviously, this doesn't work  
when the COW is shared.  However, forcing userspace to know about  
snapid's which are generated by the kernel and overloading the message  
function is not acceptable.  I think there is also much room for  
improvement once that is fixed to minimize changes to the constructor  
tables and reduce the burden on user-space tools.

Here is what I am willing to sacrifice if it proves necessary:
#1) I am willing to give in on having two snapshot modules.  I have  
always stated this as a possibility.  It is very possible that it  
makes more sense this way instead of having one snapshot module be  
able to interact with all exception stores.  This is even more  
palatable if you consider that the legacy snapshots will be largely  
abandoned.  This would mean abandoning some of the patches I've posted  
upstream, but others still remain quite useful.  More than one  
snapshot-merge target might be required in this case.  There are a  
number of other disadvantages to this that I (maybe not others) would  
be willing to set aside based solely on your word that the advantages  
outweigh the disadvantages.

#2) I am willing to put in any extra work required to write the  
cluster-aware modules - even if two APIs and two snapshot modules end  
up being required.

Mikulas, I get the sense that you have looked at the API with a quick  
mind of why it can't be done, instead of whether it can be done.  The  
conclusions may ultimately be the same; but I am asking you to take a  
look and see what *can* be done.  What compromises can be made?  How  
far are you willing to come?  I don't think my base demands are too  
much to ask, nor do I think they materially differ from what is  
required for upstream inclusion.

Thanks for your continued effort on this,
  brassow

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Shared snapshots
       [not found]       ` <15B46C78-2846-43C4-8090-1EC362E24FD7@redhat.com>
@ 2009-05-06 16:06         ` Mikulas Patocka
  2009-05-07 15:44           ` Jonathan Brassow
  0 siblings, 1 reply; 24+ messages in thread
From: Mikulas Patocka @ 2009-05-06 16:06 UTC (permalink / raw)
  To: Jonathan Brassow; +Cc: dm-devel, Alasdair Kergon

Hi

> I'd like to state that the point of the refactoring work was not to reduce the
> size of the code.  The point was to provide a clear and clean interface for
> all exception stores - new and old - and do it in a way that adheres to the
> principals of the general device-mapper target API.

First let me write about a possible arguments regarding "cleanness", 
"niceness" etc.

I think that what is really hapenning here is loss of reason.

In general, loss of reason is a cognitive bias that happens in the 
following scenario (not only in software engineering, but in other social 
situations as well):

* people want to reach goal G
* in order to reach the goal G, people do action A
* they do it thoroughly, consistently, over the long time
* now, "loss of reason", an error inside human brain happens: people start 
slowly forgetting about the goal G. And they start viewing the action A as 
being "good", "nice", "right", "the way it should be done"
* from this point on, people are doing action A even in situations when 
it damages the goal G
* people's performance decreases, they no longer attempt to reach real 
objective goals, all they want to reach is their own emotion of "niceness" 
inside their head.

Now, how does it apply to our snapshot development scenario?

* programmers want to reduce coding time and maintenance time (goal G)
* to reach this goal, programmers share their code, they sometimes 
refactor code to be better shareable (action A)
* it has been done consistently over the long time for various projects
* now people start considering sharing code and refactoring as "good", 
"clean", "the way software should be written" or whatever. They completely 
forgot about the initial goal that was there (save developers' time)
* and now we are refactoring the snapshot code even if increases our 
development time, it is not going to save the time in the future, but we 
consider this "right", "clean", "clear", "nice"

Are you getting it how these "universal"/"emotional"/"aesthetic" arguments 
not backed by anything are void?

Now, if we ignore the emotional stuff and look at it from the real point 
of view.

Is the refactored code saving development time now?
- it isn't. It already ate 5 months and it will just eat more before it 
gets functional.

Will the refactored code be saving maintenance time in the future?
- it won't. The code is as big as if the refactoring hadn't happened (and 
it'll be just bigger if you imeplement missing functionality).

Anyway, two smaller drivers are better maintainable than one big driver of 
the double size. A person who doesn't understand the code will have to 
read just half of the code to understand the logic if he wants to modify 
one of the smaller drivers. It also reduces testing time, if you modify 
one smaller driver, you are certain that the second driver is intact and 
don't have to test it.

Is your refactored code extensible to other exception stores?
- true. It is. But it was extensible even before refactoring. Before 
refactoring, one could write any exception store driver (either unshared 
or shared) and plug it into the unshared-master-driver or the 
shared-master-driver. All you did was to join the unshared-master-driver 
and the shared-master-driver to one chunk of code.

Any other advantages?

Regarding the bugs:
56MB origin, 160MB snapshot store (128MB of that are reserved for 
non-existing journal, so it will overflow). Run mkfs.reiserfs on the 
origin. Ends up with a lot of messages like this:
May  6 13:36:10 slunicko kernel: attempt to access beyond end of device
May  6 13:36:10 slunicko kernel: dm-5: rw=25, want=1048784, limit=327680
May  6 13:36:11 slunicko kernel: attempt to access beyond end of device
May  6 13:36:11 slunicko kernel: dm-5: rw=25, want=1376256, limit=327680
May  6 13:36:11 slunicko kernel: attempt to access beyond end of device
and the process stucks in "D" state in 
__wait_on_bit/wait_on_page_writeback_range. Vely likely the bio was lost 
somewhere and never returned with either error or success.

I have found a couple of bugs in Fujita/Daniel's exception store long 
time ago (when I was porting it to my driver), the most serious one I 
found was this:
@@ -510,7 +510,7 @@ static int read_new_bitmap_chunk(struct
        chunk_io(store, ss->cur_bitmap_chunk, WRITE, 1, ss->bitmap);

        ss->cur_bitmap_chunk++;
-       if (ss->cur_bitmap_chunk == ss->nr_bitmap_chunks)
+       if (ss->cur_bitmap_chunk == ss->nr_bitmap_chunks + FIRST_BITMAP_CHUNK)
                ss->cur_bitmap_chunk = FIRST_BITMAP_CHUNK;

        chunk_io(store, ss->cur_bitmap_chunk, READ, 1,  ss->bitmap);
... but it doesn't fix this problem. So there's something else.

Maybe the bad accesses could be due to endianity problems, I use sparc64. 
The lost bio can't be caused by endianity, this needs to be investigated.

Writable snapshots:
create an origin and two snapshots. Write to the origin (it creates shared 
chunk for both snapshots). Write to one of the snapshots to the same 
chunk. The write is lost. Read it again and the data is not there.

In this situation you need to copy the chunk from the snapshot store to 
the snapshot store (you can't copy it from the origin). I was looking for 
the code that handles these snapshot-to-snapshot copies, but I didn't find 
it. So the problem is not some silly typo that could be fixed with one 
line - the problem is that the code to handle it wasn't written.

> I tried to ensure that all necessary API functions were in place.  Like
> "suspend", which is not used by any of the legacy exception stores, but will
> be useful when it comes time to quiesce a shared exception store that may be
> performing a "delete" operation.

You can't do this. "delete" is a long-running operation, it needs to walk 
the whole tree. Stopping an exception store during delete is not 
acceptable on big devices. I have already implemented delete in the 
background, it goes step-by-step and blocks concurrent access only for 
these little steps, so it won't lock the computer up for minutes or hours. 
It can also batch multiple delete requests just into one tree-walk.

> Please note, that there probably won't be that much code sharing between
> exception store modules; but the snapshot code is completely agnostic to the
> type of exception store used.  This allows an infinite variety of exception
> stores without change to the snapshot code.  This is also nice because changes
> in the snapshot code should not require work in the exception stores and vise
> versa. Userspace code should also be reduced due to identical mechanisms used
> to communicate and operate snapshots with legacy or future exception stores -
> you need only change the constructor arguments.  Backwards compatibility is
> improved and the meaning of DM ioctls doesn't change depending on the target
> type.  It is a necessity that we have new exception stores, but I don't think
> that necessitates a new snapshot target - something which, I think, would
> increase our maintenance costs because we now support two entirely different
> snapshot implementations.
> 
> We absolutely need your exception store - there is no question about that.  I

But it is not possible now. What it lacks:
- the snapshot ids are submitted by the user (in my implementation, the 
exception store itself must make the snapshot id, it is design thing that 
can't be changed)
- it lacks read-once-write-many copies (they are needed only on my store, 
not on Fujita-Daniel's store)
- it lacks snapshot-to-snapshot copies for writable snapshots (for 
Fujita-Daniel's code you need just one copy, for my code you need at most 
two).
- it lacks the possibility to re-submit bios to the exception store to do 
more relocations after some of the relocations have been finished. (in 
extreme cases of many writes to the snapshots, there may be many 
relocations on my exception store --- it is no need to optimize for these 
cases, but they must be handled correctly).
- lame delete that blocks the whole device.

> agree with your complaints about the short-comings of the shared exception
> store that was ported to the interface (e.g. it has no journaling, 64 share
> limit before needing a new store, memory issues).  I ported the shared
> exception store to the interface (along with creating a cluster-aware
> exception store) so I could feel confident that I got the interface correct.
> The "shared" exception store should not be viewed as complete; even though I
> think the exception store interface is.  Please refrain from judging the API
> based on incompleteness in the implementation of one exception store.

This is not about lameness of Fujita's code. What I mean is that your 
interface is too tied on Fujita-Daniel's store. It is not just easy that 
you take my store and plug it in. It still needs many generic changes. It 
is not as beautiful as you described it that it is "completely agnostic to 
the type of exception store used". And I am concerned --- why spend more 
time making the changes, if generic driver that has them is already 
written?

> [However, I have tested the "shared" exception store and did not suffer from
> the same problems you did.  I would be interested to have you fill me in on
> the problems you encountered - "doesn't support writable snapshots correctly,
> it deadlocks when the exception store overflows, it prints warnings about
> circular locking"].  I believe I have the interface correct, and I am not
> aware of changes that would have to be made to accommodate your exception
> store.
> 
> I also think you have a number of great ideas in your higher level snapshot
> code.  I'd love to see these added to the current snapshot module.  I would
> even be willing to abandon the current snapshot module and make yours the only
> one in the kernel iff it adhered to the following principals:
>
> 1) It is completely agnostic to all legacy and future exception stores

My code does support all existing and possible future shared 
snapshots. It doesn't support non-shared snapshots. They are well 
implemented already and there is no reason to reinvent the wheel.

> (e.g.
> it should know nothing about snapid's and such, as this information aids in
> how/where an exception store places data - something that should be known only
> to the exception store implementation)  Without a clean break, the API is
> meaningless.

My code works in such a way that the kernel creates a snapid, it passes it 
to the userspace, the userspace is supposed to store the snapid in its lvm 
metadata and passes it back to the kernel. The userspace views the snapid 
as an opaque value --- it doesn't look at the snapid, it just passes it 
back and forth.

If you wanted to change it to uuids in the kernel, it is changeable.

> 2) The current meaning of DM ioctls is preserved.  We shouldn't have to
> create/resume a snapshot device and then use messages or other mechanisms to
> instantiate the shares.  This slippery slope leads us to a place where each
> target gets to choose its userspace interaction and the defined functions for
> that purpose become meaningless.

It is changeable to uuids if the kernel maintained uuid->snapid 
translation table in the exception store. It is possible, thought we must 
know why. "snapids are submitted by userspace" is not possible.

> Ultimately, I think the exception store API that I have put in place would be
> the glue between the upper snapshot layer and the lower exception store
> modules - whatever generation of these layers happens to exist.  I believe the
> API follows in the models laid down by the dm-raid1/log and
> multipath/read-balance interfaces.  I think this API is a natural and
> necessary progression.
> 
> I would also take issue with the fact that if the code size is increased, it
> decreases maintainability.  I think maintainability is improved with the clean
> separation of snapshot code and exception store modules; just as I think
> maintainability is improved by having a file system switch that separates the
> different file systems and provides a clean method for them to communicate
> with higher layers.
> 
> I welcome further thoughts,
> brassow

Why don't do this refactoring and why not have common driver for shared 
and non-shared exception store? - it already took 5 months, it is still 
incomplete and needs non-trivial changes to be at least as functional as 
if this hadn't happened at all.

If you're still thinking about "cleanness", read the first paragraph about 
loss of reason again. Back into reality --- if you spend 5 months 
shuffling code around without adding any functionality it is only 
reasonable to do it if it will save more than 5 months in the long term. 
And it won't.

It is no shame, it occasionaly happens to anyone. To me it happened on 
other projects too that I had these feelings "wow, it will be super clean 
and easy" and when I got into typing the code, it turned out that it is 
much more time-consuming that I anticipated. And when it happend to me, I 
erased the code and changed my perception of "cleanness" - it's the only 
thing one can do.

Mikulas

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: shared snapshots
  2009-02-06  7:18     ` FUJITA Tomonori
@ 2009-02-20 23:58       ` Jonathan Brassow
  0 siblings, 0 replies; 24+ messages in thread
From: Jonathan Brassow @ 2009-02-20 23:58 UTC (permalink / raw)
  To: device-mapper development


On Feb 6, 2009, at 1:18 AM, FUJITA Tomonori wrote:

> On Thu, 5 Feb 2009 16:17:29 -0600
> Jonathan Brassow <jbrassow@redhat.com> wrote:
>
>>
>> On Feb 3, 2009, at 12:48 AM, FUJITA Tomonori wrote:
>>
>>> But surely supporting the shared implementation(s) on the top of the
>>> existing snapshot code needs more refactoring. Jonathan, have you
>>> already started to work on it?
>>
>> I have a couple pending patches that I haven't sent to the list yet.
>> I'm hoping to get back to exception store API patches in a week or
>> two.  In the mean time, the current exception store API patches are
>> ready for review.  The additions to facilitate shared exception  
>> stores
>> will go on top of that patch set.
>
> Can you post the pending patches to the mailing list or put on a web
> site, please?

Guess I needed the whole two weeks, but I just posted the next wave of  
patches to the list.  I have to go back to some other work again next  
week, but I'm hoping to have time to dig into the "shared" exception  
store code to help you guys integrate into the given structure.  If  
not next week, then the week after.  :(

  brassow

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: shared snapshots
  2009-02-05 22:17   ` Jonathan Brassow
@ 2009-02-06  7:18     ` FUJITA Tomonori
  2009-02-20 23:58       ` Jonathan Brassow
  0 siblings, 1 reply; 24+ messages in thread
From: FUJITA Tomonori @ 2009-02-06  7:18 UTC (permalink / raw)
  To: dm-devel

On Thu, 5 Feb 2009 16:17:29 -0600
Jonathan Brassow <jbrassow@redhat.com> wrote:

> 
> On Feb 3, 2009, at 12:48 AM, FUJITA Tomonori wrote:
> 
> > But surely supporting the shared implementation(s) on the top of the
> > existing snapshot code needs more refactoring. Jonathan, have you
> > already started to work on it?
> 
> I have a couple pending patches that I haven't sent to the list yet.   
> I'm hoping to get back to exception store API patches in a week or  
> two.  In the mean time, the current exception store API patches are  
> ready for review.  The additions to facilitate shared exception stores  
> will go on top of that patch set.

Can you post the pending patches to the mailing list or put on a web
site, please?

Thanks,

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: shared snapshots
  2009-02-03  6:48 ` FUJITA Tomonori
  2009-02-05 22:17   ` Jonathan Brassow
@ 2009-02-06  3:44   ` Mikulas Patocka
  1 sibling, 0 replies; 24+ messages in thread
From: Mikulas Patocka @ 2009-02-06  3:44 UTC (permalink / raw)
  To: agk; +Cc: dm-devel

On Tue, 3 Feb 2009, FUJITA Tomonori wrote:

> On Tue, 3 Feb 2009 01:25:30 -0500 (EST)
> Mikulas Patocka <mpatocka@redhat.com> wrote:
> 
> > If you are going to put shared storage into your snapshots, look at this: 
> > http://people.redhat.com/mpatocka/patches/kernel/new-snapshots/devel/
> > 
> > --- this is my snapshot store with Fujita/Daniel's store as a loadable 
> > module.
> > 
> > So if you are going to do shared storage interface, you must accommodate 
> > both of these implementations into it.
> > 
> > I still don't understand, how do you intend to patch it on the top of 
> > existing snapshot code. Definitely, I wouldn't say that it will be an easy 
> > task. Look at the code and have a fun with it.
> 
> I expect that we will support the shared implementation(s) on the top
> of the existing snapshot code instead of inventing a new target_type
> (as Mikulas do with multisnap_*).

Alasdair is wanting to make it on the top of existing snapshots but I 
haven't heard from him any specific argument, just generalized talks how 
sharing code is good.

So, please start saying some specific nongeneralized ideas how do you 
imagine to do it.

I.e. instead od saying "sharing code is good", download stuff from 
http://people.redhat.com/mpatocka/patches/kernel/new-snapshots/devel/ and 
say *EXACTLY* which code (structures, functions...) you you want to share 
with existing snapshots.

Instead of saying "we treat all snapshots like objects with method tables" 
look at actual method tables from in-kernel snapshots and my snapshots and 
say *WHICH* methods exactly are you going to join.

--- then, we can discuss it and go somewhere. Abstract ideas about 
software engineering make no sense at this stage.

> If not, we can live without refactoring the existing snapshot code.
> 
> But surely supporting the shared implementation(s) on the top of the
> existing snapshot code needs more refactoring. Jonathan, have you
> already started to work on it?
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

Mikulas

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: shared snapshots
  2009-02-03  6:48 ` FUJITA Tomonori
@ 2009-02-05 22:17   ` Jonathan Brassow
  2009-02-06  7:18     ` FUJITA Tomonori
  2009-02-06  3:44   ` Mikulas Patocka
  1 sibling, 1 reply; 24+ messages in thread
From: Jonathan Brassow @ 2009-02-05 22:17 UTC (permalink / raw)
  To: device-mapper development

On Feb 3, 2009, at 12:48 AM, FUJITA Tomonori wrote:

> But surely supporting the shared implementation(s) on the top of the
> existing snapshot code needs more refactoring. Jonathan, have you
> already started to work on it?

I have a couple pending patches that I haven't sent to the list yet.   
I'm hoping to get back to exception store API patches in a week or  
two.  In the mean time, the current exception store API patches are  
ready for review.  The additions to facilitate shared exception stores  
will go on top of that patch set.

brassow

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: shared snapshots
  2009-02-03  6:25 shared snapshots Mikulas Patocka
@ 2009-02-03  6:48 ` FUJITA Tomonori
  2009-02-05 22:17   ` Jonathan Brassow
  2009-02-06  3:44   ` Mikulas Patocka
  0 siblings, 2 replies; 24+ messages in thread
From: FUJITA Tomonori @ 2009-02-03  6:48 UTC (permalink / raw)
  To: dm-devel; +Cc: agk

On Tue, 3 Feb 2009 01:25:30 -0500 (EST)
Mikulas Patocka <mpatocka@redhat.com> wrote:

> If you are going to put shared storage into your snapshots, look at this: 
> http://people.redhat.com/mpatocka/patches/kernel/new-snapshots/devel/
> 
> --- this is my snapshot store with Fujita/Daniel's store as a loadable 
> module.
> 
> So if you are going to do shared storage interface, you must accommodate 
> both of these implementations into it.
> 
> I still don't understand, how do you intend to patch it on the top of 
> existing snapshot code. Definitely, I wouldn't say that it will be an easy 
> task. Look at the code and have a fun with it.

I expect that we will support the shared implementation(s) on the top
of the existing snapshot code instead of inventing a new target_type
(as Mikulas do with multisnap_*).

If not, we can live without refactoring the existing snapshot code.

But surely supporting the shared implementation(s) on the top of the
existing snapshot code needs more refactoring. Jonathan, have you
already started to work on it?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* shared snapshots
@ 2009-02-03  6:25 Mikulas Patocka
  2009-02-03  6:48 ` FUJITA Tomonori
  0 siblings, 1 reply; 24+ messages in thread
From: Mikulas Patocka @ 2009-02-03  6:25 UTC (permalink / raw)
  To: Jonathan Brassow; +Cc: dm-devel, Alasdair G Kergon

Hi

If you are going to put shared storage into your snapshots, look at this: 
http://people.redhat.com/mpatocka/patches/kernel/new-snapshots/devel/

--- this is my snapshot store with Fujita/Daniel's store as a loadable 
module.

So if you are going to do shared storage interface, you must accommodate 
both of these implementations into it.

I still don't understand, how do you intend to patch it on the top of 
existing snapshot code. Definitely, I wouldn't say that it will be an easy 
task. Look at the code and have a fun with it.

Mikulas

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2010-09-10 17:47 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-12-16  8:05 Shared snapshots Mikulas Patocka
2009-12-16 20:39 ` Mike Snitzer
2009-12-17 16:32   ` Mike Snitzer
2010-03-30 16:26   ` Mike Snitzer
2010-03-31  3:43     ` Mikulas Patocka
2009-12-22  9:50 ` 张宇
  -- strict thread matches above, loose matches on Subject: below --
2010-09-10 17:47 shared snapshots Mikulas Patocka
2010-01-06 14:38 Shared snapshots Mikulas Patocka
2010-01-07  8:42 ` Pasi Kärkkäinen
2010-01-07  9:05   ` haad
2010-01-07 10:08     ` 张宇
2010-01-11  9:14       ` Mikulas Patocka
2010-01-11  9:12   ` Mikulas Patocka
2010-01-11  9:40     ` Pasi Kärkkäinen
2010-01-11  9:49       ` Mikulas Patocka
     [not found] <1240610034.7392.284.camel@p670.boston.redhat.com>
     [not found] ` <Pine.LNX.4.64.0904270815420.18227@hs20-bc2-1.build.redhat.com>
     [not found]   ` <1240836879.1759.16.camel@p670.boston.redhat.com>
     [not found]     ` <Pine.LNX.4.64.0905040741190.20446@hs20-bc2-1.build.redhat.com>
     [not found]       ` <15B46C78-2846-43C4-8090-1EC362E24FD7@redhat.com>
2009-05-06 16:06         ` Mikulas Patocka
2009-05-07 15:44           ` Jonathan Brassow
2009-05-11 10:44             ` Mikulas Patocka
2009-02-03  6:25 shared snapshots Mikulas Patocka
2009-02-03  6:48 ` FUJITA Tomonori
2009-02-05 22:17   ` Jonathan Brassow
2009-02-06  7:18     ` FUJITA Tomonori
2009-02-20 23:58       ` Jonathan Brassow
2009-02-06  3:44   ` Mikulas Patocka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.