All of lore.kernel.org
 help / color / mirror / Atom feed
* Best insertion point for storage shim
@ 2012-08-24 15:49 Stephen Perkins
  2012-08-24 16:28 ` Tommi Virtanen
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Stephen Perkins @ 2012-08-24 15:49 UTC (permalink / raw)
  To: ceph-devel

Hi all,

I'd like to get feedback from folks as to where the best place would be to
insert a "shim" into the RADOS object storage.

Currently, you can configure RADOS to use copy based storage to store
redundant copies of a file (I like 3 redundant copies so I will use that as
an example).  So... each file is stored in three locations on independent
hardware.   The redundancy has a cost of 3x the storage.

I would assume that it is "possible" to configure RADOS to store only 1 copy
of a file (bear with me here).

I'd like to see where it may be possible to insert a "shim" in the storage
such that I can take the file to be stored and apply some erasure coding to
it. Therefore, the file now becomes multiple files that are handed off to
RADOS.  

The shim would also have to take read file requests and read some small
portion of the fragments and recombine.

Basically... what I am asking is...  where would be the best place to start
looking at adding this:
	https://tahoe-lafs.org/trac/tahoe-lafs#
	
(just the erasure coded part).

Here is the real rationale.  Extreme availability at only 1.3 or 1.6 time
redundancy:

	http://www.zdnet.com/videos/whiteboard/dispersed-storage/156114

Thoughts appreciated,

- Steve

P.S. yes... I posted on this earlier.  Microsoft Azure storage takes this
approach by lazy erasure coding inactive files to significantly reduce
storage costs while increasing reliability.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Best insertion point for storage shim
  2012-08-24 15:49 Best insertion point for storage shim Stephen Perkins
@ 2012-08-24 16:28 ` Tommi Virtanen
  2012-08-24 16:42 ` Atchley, Scott
  2012-08-24 16:42 ` Sage Weil
  2 siblings, 0 replies; 8+ messages in thread
From: Tommi Virtanen @ 2012-08-24 16:28 UTC (permalink / raw)
  To: Stephen Perkins; +Cc: ceph-devel

On Fri, Aug 24, 2012 at 8:49 AM, Stephen Perkins <perkins@netmass.com> wrote:
> I'd like to get feedback from folks as to where the best place would be to
> insert a "shim" into the RADOS object storage.
...
> I would assume that it is "possible" to configure RADOS to store only 1 copy
> of a file (bear with me here).

RADOS stores objects, not files.

The Ceph File System stores file metadata in mds journals and
directory objects, and file data striped over several objects.

> I'd like to see where it may be possible to insert a "shim" in the storage
> such that I can take the file to be stored and apply some erasure coding to
> it. Therefore, the file now becomes multiple files that are handed off to
> RADOS.
>
> The shim would also have to take read file requests and read some small
> portion of the fragments and recombine.

Sounds like you want a client library on top of RADOS.

Getting that even remotely performant is going to be a huge
engineering challenge. Zooko will tell you that Tahoe-LAFS ain't a
filesystem, largely for that reason (and if you do run into him, say
hi from Tv!).

Good luck!

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Best insertion point for storage shim
  2012-08-24 15:49 Best insertion point for storage shim Stephen Perkins
  2012-08-24 16:28 ` Tommi Virtanen
@ 2012-08-24 16:42 ` Atchley, Scott
  2012-08-24 16:42 ` Sage Weil
  2 siblings, 0 replies; 8+ messages in thread
From: Atchley, Scott @ 2012-08-24 16:42 UTC (permalink / raw)
  To: Stephen Perkins; +Cc: ceph-devel

On Aug 24, 2012, at 11:49 AM, Stephen Perkins wrote:

> Hi all,
> 
> I'd like to get feedback from folks as to where the best place would be to
> insert a "shim" into the RADOS object storage.
> 
> Currently, you can configure RADOS to use copy based storage to store
> redundant copies of a file (I like 3 redundant copies so I will use that as
> an example).  So... each file is stored in three locations on independent
> hardware.   The redundancy has a cost of 3x the storage.
> 
> I would assume that it is "possible" to configure RADOS to store only 1 copy
> of a file (bear with me here).
> 
> I'd like to see where it may be possible to insert a "shim" in the storage
> such that I can take the file to be stored and apply some erasure coding to
> it. Therefore, the file now becomes multiple files that are handed off to
> RADOS.
> 
> The shim would also have to take read file requests and read some small
> portion of the fragments and recombine.

This sounds more like a modification to the POSIX file system interface rather than to the RADOS object store which knows nothing of files.

> Basically... what I am asking is...  where would be the best place to start
> looking at adding this:
> 	https://tahoe-lafs.org/trac/tahoe-lafs#
> 	
> (just the erasure coded part).
> 
> Here is the real rationale.  Extreme availability at only 1.3 or 1.6 time
> redundancy:
> 
> 	http://www.zdnet.com/videos/whiteboard/dispersed-storage/156114

The "extreme" reliability is a bit oversold. I worked on a project a decade ago that stored blocks of files over servers scattered around the globe. Each block was checksummed and optionally encrypted (they were not our servers, so we did not assume that we could trust the admins). To handle reliability, we implemented both replication (copies) and error coding (Reed-Solomon based erasure coding). There is a trade-off between the two.

Copies are nice since they do not require extra computation and they can be handled between servers so that the client only has to store once (which is what the ceph file system does). Copies also allow you to load balance over more servers and increase read access (which ceph does not do since the copies are pseudo-randomly stored and _should_ provide load-balancing on average). With good CRUSH rules, they should provide better fault-tolerance (e.g. a rack goes down, pull from a copy on another rack). The maximum failure level is N where N is the number of copies. That also means that your total usable storage is 1/Nth the raw capacity.

Error coding allows you to tolerate greater number of failures at the expense of computation and memory usage. When using error coding, you break up a file into blocks (as mentioned in the video). For a set of M blocks (this is the coding block set size), you create one or more (N) coding blocks. In the video example, 1.3 corresponds to one coding block per three data blocks (N=1 and M=3). This means it can tolerate losing one of the four blocks and still recompute the original data using any 3 of the M + N blocks. A level of 1.6 simply is two coding blocks per three data blocks which can now survive losing two of the blocks. Using three coding blocks per three data blocks (not mentioned in the video) allows you to survive three failures at the cost of 1/2 the raw capacity which is clearly
  a win over simple replication.

The downside is that calculating the erasure coding is not cheap and it requires an extra block's worth of memory until it is complete. It is best to implement the coding at the client since it has all the data, while servers do not and would have to copy the data to the server performing the computation. It is possible to pipeline the storing of blocks and hopefully mask this but it adds to the requirements of the processors for normal usage (not to mention when handling failures). Also, if you need to read a block that is not available, you are no longer reading one block (e.g. of 4 MB), but the whole coding set (M blocks of 4 MBs) which increases the network traffic M times.

Erasure coding is no magic bullet and has a use, but it is complicated and increases computing resource requirements.

Scott



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Best insertion point for storage shim
  2012-08-24 15:49 Best insertion point for storage shim Stephen Perkins
  2012-08-24 16:28 ` Tommi Virtanen
  2012-08-24 16:42 ` Atchley, Scott
@ 2012-08-24 16:42 ` Sage Weil
  2012-08-31 14:37   ` Stephen Perkins
  2 siblings, 1 reply; 8+ messages in thread
From: Sage Weil @ 2012-08-24 16:42 UTC (permalink / raw)
  To: Stephen Perkins; +Cc: ceph-devel

On Fri, 24 Aug 2012, Stephen Perkins wrote:
> Hi all,
> 
> I'd like to get feedback from folks as to where the best place would be to
> insert a "shim" into the RADOS object storage.
> 
> Currently, you can configure RADOS to use copy based storage to store
> redundant copies of a file (I like 3 redundant copies so I will use that as
> an example).  So... each file is stored in three locations on independent
> hardware.   The redundancy has a cost of 3x the storage.
> 
> I would assume that it is "possible" to configure RADOS to store only 1 copy
> of a file (bear with me here).
> 
> I'd like to see where it may be possible to insert a "shim" in the storage
> such that I can take the file to be stored and apply some erasure coding to
> it. Therefore, the file now becomes multiple files that are handed off to
> RADOS.  
> 
> The shim would also have to take read file requests and read some small
> portion of the fragments and recombine.
> 
> Basically... what I am asking is...  where would be the best place to start
> looking at adding this:
> 	https://tahoe-lafs.org/trac/tahoe-lafs#
> 	
> (just the erasure coded part).
> 
> Here is the real rationale.  Extreme availability at only 1.3 or 1.6 time
> redundancy:
> 
> 	http://www.zdnet.com/videos/whiteboard/dispersed-storage/156114
> 
> Thoughts appreciated,

The good news is that CRUSH has a mode that is intended for erasure/parity 
coding, and there are fields reserved in many ceph structures to support 
this type of thing.  The bad news is that in order to make it work it 
needs to live inside of rados, not on top of it.  The reason is that you 
need to separate the fragments across devices/failure domains/etc, which 
happens at the PG level; users of librados have no control over that 
(objects are randomly hashed into PGs, and then PGs are mapped to 
devices).

And in order to implement it properly, a lot of code shuffling and wading 
through OSD internals will be necessary.  There are some basic 
abstractions in place, but they are largely ignored and need to be shifted 
around because replication has been the only implementation for some time 
now.

I think the only way to layer this on top of rados and align your 
fragments with failure domains would be to create N different pools with 
distinct devices, and store one fragment in each pool...

sage

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: Best insertion point for storage shim
  2012-08-24 16:42 ` Sage Weil
@ 2012-08-31 14:37   ` Stephen Perkins
  2012-08-31 15:15     ` Tommi Virtanen
  0 siblings, 1 reply; 8+ messages in thread
From: Stephen Perkins @ 2012-08-31 14:37 UTC (permalink / raw)
  To: 'Sage Weil'; +Cc: ceph-devel

Hi all,

Excellent points all.  Many thanks for the clarification. 

*Tommi* - Yep... I wrote file but had object in mind.  However... now that
you bring the distinction... I may actually mean file.  (see below)  I don't
know Zooko personally, but will definitely pass it on if I meet him! 

*Scott* - Agreed.

As to the performance... I am also in agreement.  My thoughts were to make
the operation lazy.  By this, I mean that items that are not due to change
much (think archive items) could be converted from N Copies to an Erasure
coded equivalent.  The Lazy piece of that would help reduce the processing
overhead.  The encoded items could also be "on demand" decoded back to N
items if such items are accessed over a given threshold.  This is not
exactly tiered storage... but has many of the same characteristics.

Sage... it may be that your second approach of having N storage pools and
writing to them is the best approach.  The reasoning is that I'm not sure
that RADOS will have any idea of "which" objects would be candidates for
lazy erasure coding.  If done closer to a POSIX level, then files and
directories that have not been accessed recently could become candidates for
the coding. 

My personal desire is to have this available for archiving large file based
datasets in an economical fashion.  The files would be "generated" by
commercial file archiving software (with the source data contained on other
systems) and would be stored on a ceph cluster via either CephFS or an RBD
device with a standard file system on it. 

Then, because of domain specific knowledge about the data (i.e. it is
archive data), I would know that much of the data will probably never be
touched again.  IMHO, that would be good candidate data for erasure coding.

One approach would be to have a standard cephFS mounted and configured with
RADOS to keep N copies of data.  The second would be to have a new RScephFS
(Reed-Solomon encoded) mount point (possibly using Sage's many storage pools
approach) .  Then.. .using available tiering software, files could be
"moved" from one mount point to the other based on some criteria.  Pointers
on the original mount point make this basically invisible.  If a files is
access too many times, it can be "moved" back to the cephFS mount point.

Would this require 2 clusters because of the need to have RADOS keep N
copies on one and 1 copy on the other? 

I appreciate the discussion... it is helping me fashion what I'm really
interested in...

- Steve

-----Original Message-----
From: Sage Weil [mailto:sage@inktank.com] 
Sent: Friday, August 24, 2012 11:43 AM
To: Stephen Perkins
Cc: ceph-devel@vger.kernel.org
Subject: Re: Best insertion point for storage shim

On Fri, 24 Aug 2012, Stephen Perkins wrote:
> Hi all,
> 
> I'd like to get feedback from folks as to where the best place would 
> be to insert a "shim" into the RADOS object storage.
> 
> Currently, you can configure RADOS to use copy based storage to store 
> redundant copies of a file (I like 3 redundant copies so I will use 
> that as an example).  So... each file is stored in three locations on
independent
> hardware.   The redundancy has a cost of 3x the storage.
> 
> I would assume that it is "possible" to configure RADOS to store only 
> 1 copy of a file (bear with me here).
> 
> I'd like to see where it may be possible to insert a "shim" in the 
> storage such that I can take the file to be stored and apply some 
> erasure coding to it. Therefore, the file now becomes multiple files 
> that are handed off to RADOS.
> 
> The shim would also have to take read file requests and read some 
> small portion of the fragments and recombine.
> 
> Basically... what I am asking is...  where would be the best place to 
> start looking at adding this:
> 	https://tahoe-lafs.org/trac/tahoe-lafs#
> 	
> (just the erasure coded part).
> 
> Here is the real rationale.  Extreme availability at only 1.3 or 1.6 
> time
> redundancy:
> 
> 	http://www.zdnet.com/videos/whiteboard/dispersed-storage/156114
> 
> Thoughts appreciated,

The good news is that CRUSH has a mode that is intended for erasure/parity
coding, and there are fields reserved in many ceph structures to support
this type of thing.  The bad news is that in order to make it work it needs
to live inside of rados, not on top of it.  The reason is that you need to
separate the fragments across devices/failure domains/etc, which happens at
the PG level; users of librados have no control over that (objects are
randomly hashed into PGs, and then PGs are mapped to devices).

And in order to implement it properly, a lot of code shuffling and wading
through OSD internals will be necessary.  There are some basic abstractions
in place, but they are largely ignored and need to be shifted around because
replication has been the only implementation for some time now.

I think the only way to layer this on top of rados and align your fragments
with failure domains would be to create N different pools with distinct
devices, and store one fragment in each pool...

sage


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Best insertion point for storage shim
  2012-08-31 14:37   ` Stephen Perkins
@ 2012-08-31 15:15     ` Tommi Virtanen
  2012-08-31 15:59       ` Atchley, Scott
  0 siblings, 1 reply; 8+ messages in thread
From: Tommi Virtanen @ 2012-08-31 15:15 UTC (permalink / raw)
  To: Stephen Perkins; +Cc: Sage Weil, ceph-devel

On Fri, Aug 31, 2012 at 10:37 AM, Stephen Perkins <perkins@netmass.com> wrote:
> Would this require 2 clusters because of the need to have RADOS keep N
> copies on one and 1 copy on the other?

That's doable with just multiple RADOS pools, no need for multiple clusters.

And CephFS is even able to pick what pool to put a file in, at the
time of its creation (see set_layout).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Best insertion point for storage shim
  2012-08-31 15:15     ` Tommi Virtanen
@ 2012-08-31 15:59       ` Atchley, Scott
  2012-08-31 16:08         ` Tommi Virtanen
  0 siblings, 1 reply; 8+ messages in thread
From: Atchley, Scott @ 2012-08-31 15:59 UTC (permalink / raw)
  To: Tommi Virtanen; +Cc: Stephen Perkins, Sage Weil, ceph-devel

On Aug 31, 2012, at 11:15 AM, Tommi Virtanen wrote:

> On Fri, Aug 31, 2012 at 10:37 AM, Stephen Perkins <perkins@netmass.com> wrote:
>> Would this require 2 clusters because of the need to have RADOS keep N
>> copies on one and 1 copy on the other?
> 
> That's doable with just multiple RADOS pools, no need for multiple clusters.
> 
> And CephFS is even able to pick what pool to put a file in, at the
> time of its creation (see set_layout).

I think what he is looking for is not to bring data to a client to convert from replication to/from erasure coding, but to have the servers do it based on some metric _or_ have the client indicate which file needs to be converted and have the servers do the work.

I believe what you are saying is that I can have a directory using the replicated pool and another directory (or sub-directory) that uses the coding pool. The client would then copy the file from one directory to the other. The question becomes "Who does the erasure encoding?". The client (read back from the replica pool and write to the erasure pool) or the servers (copy data to the erasure pool and calculate on the servers)?

Scott

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Best insertion point for storage shim
  2012-08-31 15:59       ` Atchley, Scott
@ 2012-08-31 16:08         ` Tommi Virtanen
  0 siblings, 0 replies; 8+ messages in thread
From: Tommi Virtanen @ 2012-08-31 16:08 UTC (permalink / raw)
  To: Atchley, Scott; +Cc: Stephen Perkins, Sage Weil, ceph-devel

On Fri, Aug 31, 2012 at 11:59 AM, Atchley, Scott <atchleyes@ornl.gov> wrote:
> I think what he is looking for is not to bring data to a client to convert from replication to/from erasure coding, but to have the servers do it based on some metric _or_ have the client indicate which file needs to be converted and have the servers do the work.
>
> I believe what you are saying is that I can have a directory using the replicated pool and another directory (or sub-directory) that uses the coding pool. The client would then copy the file from one directory to the other. The question becomes "Who does the erasure encoding?". The client (read back from the replica pool and write to the erasure pool) or the servers (copy data to the erasure pool and calculate on the servers)?

I was aiming more for "I can see where this could sit in the
architecture". You don't have to make the client effectively download
and re-upload a file; that logic can be pushed closer to the OSDs.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2012-08-31 16:09 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-08-24 15:49 Best insertion point for storage shim Stephen Perkins
2012-08-24 16:28 ` Tommi Virtanen
2012-08-24 16:42 ` Atchley, Scott
2012-08-24 16:42 ` Sage Weil
2012-08-31 14:37   ` Stephen Perkins
2012-08-31 15:15     ` Tommi Virtanen
2012-08-31 15:59       ` Atchley, Scott
2012-08-31 16:08         ` Tommi Virtanen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.