All of lore.kernel.org
 help / color / mirror / Atom feed
* 2 related bluestore questions
@ 2016-05-09 18:31 Sage Weil
  2016-05-10 12:17 ` Igor Fedotov
  0 siblings, 1 reply; 28+ messages in thread
From: Sage Weil @ 2016-05-09 18:31 UTC (permalink / raw)
  To: ifedotov, allen.samuels; +Cc: ceph-devel

1. In 7fb649a3800a5653f5f7ddf942c53503f88ad3f1 I added an extent_ref_map_t 
to the blob_t.  This lets us keep track, for each blob, of references to 
the logical blob extents (in addition to the raw num_refs that just counts 
how many lextent_t's point to us).  It will let us make decisions about 
deallocating unused portions of the blob that are no longer referenced 
(e.g., when we are uncompressed).  It will also let us sanely reason 
about whether we can write into the blob's allocated space that is not 
referenced (e.g., past end of object/file, but within a min_alloc_size 
chunk).

The downside is that it's a bit more metadata to maintain.  OTOH, we need 
it in many cases, and it would be slow/tedious to create it on the fly.

I think yes, though some minor changes to the current extent_ref_map_t are 
needed, since it currently has weird assumptoins about empty meaning a ref 
count of 1.

2. Allow lextent_t's to be byte-granularity.

For example, if we write 10 bytes into the object, we'd have a blob of 
min_alloc_size, and an lextent_t that indicates [0,10) points to that 
blob.

The upside here is that truncate and zero are trivial updates to the 
lextent map and never need to do any IO--we just punch holes in our 
mapping.

The downside is that we might get odd mappings like

 0: 0~10->1
 4000: 4000~96->1

after a hole (10~3990) has been punched, and we may need to piece the 
mapping back together.  I think we will need most of this complexity 
(e.g., merging adjacent lextents that map to adjacent regions of the same 
blob) anyway.

Hmm, there is probably some other downside but now I can't think of a good 
reason not to do this.  It'll basically put all of the onus on the write 
code to do the right thing... which is probably a good thing.

Yes?


Also, one note on the WAL changes: we'll need to have any read portion of 
a wal event include the raw pextents *and* the associated checksum(s).  
This is because the events need to be idempotent and may overwrite the 
read region, or interact with wal ops that come before/after.

sage

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: 2 related bluestore questions
  2016-05-09 18:31 2 related bluestore questions Sage Weil
@ 2016-05-10 12:17 ` Igor Fedotov
  2016-05-10 12:53   ` Sage Weil
  0 siblings, 1 reply; 28+ messages in thread
From: Igor Fedotov @ 2016-05-10 12:17 UTC (permalink / raw)
  To: Sage Weil, allen.samuels; +Cc: ceph-devel

Hi Sage,
Please find my comments below.

WRT 1. there is an alternative approach that doesn't need persistent 
refmap. It works for non-shared bnode only though. In fact one can build 
such a refmap using onode's lextents map pretty easy. It looks like any 
procedure that requires such a refmap has a logical offset as an input. 
This provides an appropriate lextent referring to some blob we need 
refmap for. What we need to do for blob's refmap building is to 
enumerate lextents within +-max_blob_size range from the original 
loffset. I suppose we are going to avoid small lextent entries most of 
time by merging them thus such enumeration should be short enough. Most 
probably such refmap build is needed for background wal procedure (or 
its replacement - see below) thus it wouldn't affect primary write path 
performance. And this procedure will require some neighboring lxtent 
enumeration to detect  lextents to merge anyway.

Actually I don't have strong opinion which approach is better.  Just a 
minor point that tracking persistent refmap is a bit more complex and 
space consuming.

WRT to 2. IMO single  byte granularity is OK. Initial write request 
handling can create lextents of any size depending on the input data 
blocks. But we will try to eliminate it during wal processing to have 
larger extents and better space usage though.

WRT WAL changes. My idea is to replace WAL with a bit different extent 
merge (defragmenter, garbage collector, space optimizer - whatever name 
of your choice) process. The main difference - current WAL 
implementation tracks some user data and thus it's a part of the 
consistency model (i.e. one has to check if data block is in the WAL). 
In my approach data is always consistent without such a service. At the 
first write handling step we always write data to the store by 
allocating new blob and modifying lextent map. And apply corresponding 
checksum using regular means if needed. Thus we always have consistent 
data in lextent/blob structures. And defragmenter process is just a 
cleanup/optimization thread that merges sparse lextents to improve space 
utilization. To avoid full lextent map enumeration during 
defragmentation ExtentManager (or whatever entity that handles writes) 
may return some 'hints' where space optimization should be applied. This 
is to be done at initial write processing. Such hint is most probably 
just a logical offset or some interval within object logical space. 
Write handler provides such a hint if it detects (by lextent map 
inspection) that optimization is required, e.g. in case of partial 
lextent overwrite, big hole punch, sparse small lextents etc. Pending 
optimization tasks (list of hints) are maintained by the BlueStore and 
passed to EM (or another corresponding entity) for processing in the 
context of a specific thread. Based of such hints defragmenter locates 
lextents to merge and do the job: Read/Modify/Write multiple lextents 
and/or blobs. Optionally this can be done with with some delay to care 
write burst within a specific object region. Another point is that hint 
list can be potentially tracked without KV store (some in-memory data 
structure is enough) as there is no mandatory need for its replay in 
case of OSD failure - data are always consistent at the store and 
failure can lead to some local space ineffectiveness only. That's a rare 
case though.

What do you think about this approach?

Thanks,
Igor

On 09.05.2016 21:31, Sage Weil wrote:
> 1. In 7fb649a3800a5653f5f7ddf942c53503f88ad3f1 I added an extent_ref_map_t
> to the blob_t.  This lets us keep track, for each blob, of references to
> the logical blob extents (in addition to the raw num_refs that just counts
> how many lextent_t's point to us).  It will let us make decisions about
> deallocating unused portions of the blob that are no longer referenced
> (e.g., when we are uncompressed).  It will also let us sanely reason
> about whether we can write into the blob's allocated space that is not
> referenced (e.g., past end of object/file, but within a min_alloc_size
> chunk).
>
> The downside is that it's a bit more metadata to maintain.  OTOH, we need
> it in many cases, and it would be slow/tedious to create it on the fly.
>
> I think yes, though some minor changes to the current extent_ref_map_t are
> needed, since it currently has weird assumptoins about empty meaning a ref
> count of 1.
>
> 2. Allow lextent_t's to be byte-granularity.
>
> For example, if we write 10 bytes into the object, we'd have a blob of
> min_alloc_size, and an lextent_t that indicates [0,10) points to that
> blob.
>
> The upside here is that truncate and zero are trivial updates to the
> lextent map and never need to do any IO--we just punch holes in our
> mapping.
>
> The downside is that we might get odd mappings like
>
>   0: 0~10->1
>   4000: 4000~96->1
>
> after a hole (10~3990) has been punched, and we may need to piece the
> mapping back together.  I think we will need most of this complexity
> (e.g., merging adjacent lextents that map to adjacent regions of the same
> blob) anyway.
>
> Hmm, there is probably some other downside but now I can't think of a good
> reason not to do this.  It'll basically put all of the onus on the write
> code to do the right thing... which is probably a good thing.
>
> Yes?
>
>
> Also, one note on the WAL changes: we'll need to have any read portion of
> a wal event include the raw pextents *and* the associated checksum(s).
> This is because the events need to be idempotent and may overwrite the
> read region, or interact with wal ops that come before/after.
>
> sage


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: 2 related bluestore questions
  2016-05-10 12:17 ` Igor Fedotov
@ 2016-05-10 12:53   ` Sage Weil
  2016-05-10 14:41     ` Igor Fedotov
  0 siblings, 1 reply; 28+ messages in thread
From: Sage Weil @ 2016-05-10 12:53 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: allen.samuels, ceph-devel

On Tue, 10 May 2016, Igor Fedotov wrote:
> Hi Sage,
> Please find my comments below.
> 
> WRT 1. there is an alternative approach that doesn't need persistent refmap.
> It works for non-shared bnode only though. In fact one can build such a refmap
> using onode's lextents map pretty easy. It looks like any procedure that
> requires such a refmap has a logical offset as an input. This provides an
> appropriate lextent referring to some blob we need refmap for. What we need to
> do for blob's refmap building is to enumerate lextents within +-max_blob_size
> range from the original loffset. I suppose we are going to avoid small lextent
> entries most of time by merging them thus such enumeration should be short
> enough. Most probably such refmap build is needed for background wal procedure
> (or its replacement - see below) thus it wouldn't affect primary write path
> performance. And this procedure will require some neighboring lxtent
> enumeration to detect  lextents to merge anyway.
> 
> Actually I don't have strong opinion which approach is better.  Just a minor
> point that tracking persistent refmap is a bit more complex and space
> consuming.

Yeah, that's my only real concern--and mostly on the memory allocation 
side, less so on the size of the encoded metadata.  Since the alternative 
only works in the non-shared bnode case, I think it'll be simpler to only 
implement one approach for now, and consider optimizing later, since we'd 
have to implement to share-capable approach either way.  (For example, 
most blobs will have one reference for their full range; we could probably 
represent this as an empty map with a bit of care.)

> WRT to 2. IMO single  byte granularity is OK. Initial write request handling
> can create lextents of any size depending on the input data blocks. But we
> will try to eliminate it during wal processing to have larger extents and
> better space usage though.

Ok cool.
 
> WRT WAL changes. My idea is to replace WAL with a bit different extent merge
> (defragmenter, garbage collector, space optimizer - whatever name of your
> choice) process. The main difference - current WAL implementation tracks some
> user data and thus it's a part of the consistency model (i.e. one has to check
> if data block is in the WAL). In my approach data is always consistent without
> such a service. At the first write handling step we always write data to the
> store by allocating new blob and modifying lextent map. And apply
> corresponding checksum using regular means if needed. Thus we always have
> consistent data in lextent/blob structures. And defragmenter process is just a
> cleanup/optimization thread that merges sparse lextents to improve space
> utilization. To avoid full lextent map enumeration during defragmentation
> ExtentManager (or whatever entity that handles writes) may return some 'hints'
> where space optimization should be applied. This is to be done at initial
> write processing. Such hint is most probably just a logical offset or some
> interval within object logical space. Write handler provides such a hint if it
> detects (by lextent map inspection) that optimization is required, e.g. in
> case of partial lextent overwrite, big hole punch, sparse small lextents etc.
> Pending optimization tasks (list of hints) are maintained by the BlueStore and
> passed to EM (or another corresponding entity) for processing in the context
> of a specific thread. Based of such hints defragmenter locates lextents to
> merge and do the job: Read/Modify/Write multiple lextents and/or blobs.
> Optionally this can be done with with some delay to care write burst within a
> specific object region. Another point is that hint list can be potentially
> tracked without KV store (some in-memory data structure is enough) as there is
> no mandatory need for its replay in case of OSD failure - data are always
> consistent at the store and failure can lead to some local space
> ineffectiveness only. That's a rare case though.
> 
> What do you think about this approach?

My concern is that it makes a simple overwrite less IO efficient because 
you have to (1) write a new (temporary-ish) blob, (2) commit the kv 
transaction, and then (3) write an updated/merged blob, then (4) commit 
the kv txn for new blob.  And if I understand the proposal correctly any 
overwrite is still off-limits because you can't to the overwrite IO 
atomically with the kv commit.  Is that right?

Making the wal part of the consistency model is more complex, but it means 
we can (1) log our intent to overwrite atomically with the kv txn commit, 
and then (2) do the async overwrite.  It will get a bit more complex 
because we'll be doing metadata updates as part of the wal completion, but 
it's not a big step from where we are now, and I think the performance 
benefit will be worth it.

I think we'll still want a gc/cleanup/optimizer async process like you 
describe, but it can be driven by wal hints or whatever other mechanism we 
like.

sage

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: 2 related bluestore questions
  2016-05-10 12:53   ` Sage Weil
@ 2016-05-10 14:41     ` Igor Fedotov
  2016-05-10 15:39       ` Sage Weil
  0 siblings, 1 reply; 28+ messages in thread
From: Igor Fedotov @ 2016-05-10 14:41 UTC (permalink / raw)
  To: Sage Weil; +Cc: allen.samuels, ceph-devel


On 10.05.2016 15:53, Sage Weil wrote:
> On Tue, 10 May 2016, Igor Fedotov wrote:
>> Hi Sage,
>> Please find my comments below.
>>
>> WRT 1. there is an alternative approach that doesn't need persistent refmap.
>> It works for non-shared bnode only though. In fact one can build such a refmap
>> using onode's lextents map pretty easy. It looks like any procedure that
>> requires such a refmap has a logical offset as an input. This provides an
>> appropriate lextent referring to some blob we need refmap for. What we need to
>> do for blob's refmap building is to enumerate lextents within +-max_blob_size
>> range from the original loffset. I suppose we are going to avoid small lextent
>> entries most of time by merging them thus such enumeration should be short
>> enough. Most probably such refmap build is needed for background wal procedure
>> (or its replacement - see below) thus it wouldn't affect primary write path
>> performance. And this procedure will require some neighboring lxtent
>> enumeration to detect  lextents to merge anyway.
>>
>> Actually I don't have strong opinion which approach is better.  Just a minor
>> point that tracking persistent refmap is a bit more complex and space
>> consuming.
> Yeah, that's my only real concern--and mostly on the memory allocation
> side, less so on the size of the encoded metadata.  Since the alternative
> only works in the non-shared bnode case, I think it'll be simpler to only
> implement one approach for now, and consider optimizing later, since we'd
> have to implement to share-capable approach either way.  (For example,
> most blobs will have one reference for their full range; we could probably
> represent this as an empty map with a bit of care.)
So the initial approach is to have refmap, right?
>> WRT to 2. IMO single  byte granularity is OK. Initial write request handling
>> can create lextents of any size depending on the input data blocks. But we
>> will try to eliminate it during wal processing to have larger extents and
>> better space usage though.
> Ok cool.
>   
>> WRT WAL changes. My idea is to replace WAL with a bit different extent merge
>> (defragmenter, garbage collector, space optimizer - whatever name of your
>> choice) process. The main difference - current WAL implementation tracks some
>> user data and thus it's a part of the consistency model (i.e. one has to check
>> if data block is in the WAL). In my approach data is always consistent without
>> such a service. At the first write handling step we always write data to the
>> store by allocating new blob and modifying lextent map. And apply
>> corresponding checksum using regular means if needed. Thus we always have
>> consistent data in lextent/blob structures. And defragmenter process is just a
>> cleanup/optimization thread that merges sparse lextents to improve space
>> utilization. To avoid full lextent map enumeration during defragmentation
>> ExtentManager (or whatever entity that handles writes) may return some 'hints'
>> where space optimization should be applied. This is to be done at initial
>> write processing. Such hint is most probably just a logical offset or some
>> interval within object logical space. Write handler provides such a hint if it
>> detects (by lextent map inspection) that optimization is required, e.g. in
>> case of partial lextent overwrite, big hole punch, sparse small lextents etc.
>> Pending optimization tasks (list of hints) are maintained by the BlueStore and
>> passed to EM (or another corresponding entity) for processing in the context
>> of a specific thread. Based of such hints defragmenter locates lextents to
>> merge and do the job: Read/Modify/Write multiple lextents and/or blobs.
>> Optionally this can be done with with some delay to care write burst within a
>> specific object region. Another point is that hint list can be potentially
>> tracked without KV store (some in-memory data structure is enough) as there is
>> no mandatory need for its replay in case of OSD failure - data are always
>> consistent at the store and failure can lead to some local space
>> ineffectiveness only. That's a rare case though.
>>
>> What do you think about this approach?
> My concern is that it makes a simple overwrite less IO efficient because
> you have to (1) write a new (temporary-ish) blob, (2) commit the kv
> transaction, and then (3) write an updated/merged blob, then (4) commit
> the kv txn for new blob.
Yes, that's true. But there are some concerns about WAL case as well:
1) Are you sure that writing larger KV record ( metadata + user data ) 
is better than direct data write to the store + smaller KV (metadata 
only) update?

2) Either WAL records will increase or we need to have both WAL and 
optimizer simultaneously. Especially for compressed case. As far as I 
understand currently WAL record has up to block_size bytes of user data. 
With blob introduction this raises up to max_blob_size ( 
N*min_alloc_size). Or we'll need to maintain both WAL and optimizer
E.g. there is an lextent 0~256K and overwrite 1K~254K, block size = 4K
For no checksum and compression case WAL records are 2 * 3K
For checksum case WAL records are 2 * (max( csum_block_size, block_size) 
- 1K)
For compression case WAL records are 2 * (max( max_blob_size, 
block_size) - 1K) or do that temporary blob allocation.

3) WAL apply locks the subsequent read until its completion. I.e. 
subsequent read has to wait until WAL apply is completed ( o->flush() 
call in _do_read()). In case of optimizer approach lock can be postponed 
as optimizer doesn't need to perform the task immediately.

> And if I understand the proposal correctly any
> overwrite is still off-limits because you can't to the overwrite IO
> atomically with the kv commit.  Is that right?
Could you please elaborate - not sure I understand the question.
> Making the wal part of the consistency model is more complex, but it means
> we can (1) log our intent to overwrite atomically with the kv txn commit,
> and then (2) do the async overwrite.  It will get a bit more complex
> because we'll be doing metadata updates as part of the wal completion, but
> it's not a big step from where we are now, and I think the performance
> benefit will be worth it.
May I have some example how it's supposed to work please?

> I think we'll still want a gc/cleanup/optimizer async process like you
> describe, but it can be driven by wal hints or whatever other mechanism we
> like.
>
> sage


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: 2 related bluestore questions
  2016-05-10 14:41     ` Igor Fedotov
@ 2016-05-10 15:39       ` Sage Weil
  2016-05-11  1:10         ` Sage Weil
  0 siblings, 1 reply; 28+ messages in thread
From: Sage Weil @ 2016-05-10 15:39 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: allen.samuels, ceph-devel

On Tue, 10 May 2016, Igor Fedotov wrote:
> On 10.05.2016 15:53, Sage Weil wrote:
> > On Tue, 10 May 2016, Igor Fedotov wrote:
> > > Hi Sage,
> > > Please find my comments below.
> > > 
> > > WRT 1. there is an alternative approach that doesn't need persistent
> > > refmap.
> > > It works for non-shared bnode only though. In fact one can build such a
> > > refmap
> > > using onode's lextents map pretty easy. It looks like any procedure that
> > > requires such a refmap has a logical offset as an input. This provides an
> > > appropriate lextent referring to some blob we need refmap for. What we
> > > need to
> > > do for blob's refmap building is to enumerate lextents within
> > > +-max_blob_size
> > > range from the original loffset. I suppose we are going to avoid small
> > > lextent
> > > entries most of time by merging them thus such enumeration should be short
> > > enough. Most probably such refmap build is needed for background wal
> > > procedure
> > > (or its replacement - see below) thus it wouldn't affect primary write
> > > path
> > > performance. And this procedure will require some neighboring lxtent
> > > enumeration to detect  lextents to merge anyway.
> > > 
> > > Actually I don't have strong opinion which approach is better.  Just a
> > > minor
> > > point that tracking persistent refmap is a bit more complex and space
> > > consuming.
> > Yeah, that's my only real concern--and mostly on the memory allocation
> > side, less so on the size of the encoded metadata.  Since the alternative
> > only works in the non-shared bnode case, I think it'll be simpler to only
> > implement one approach for now, and consider optimizing later, since we'd
> > have to implement to share-capable approach either way.  (For example,
> > most blobs will have one reference for their full range; we could probably
> > represent this as an empty map with a bit of care.)
> So the initial approach is to have refmap, right?

Yeah.

Also, yesterday while writing some helpers I realized it will actually 
simplify things greatly to rely on the ref_map only and get rid of 
num_refs (which is redundant anyway)... I think I'll do that too.

> > > WRT WAL changes. My idea is to replace WAL with a bit different 
> > > extent merge (defragmenter, garbage collector, space optimizer - 
> > > whatever name of your choice) process. The main difference - current 
> > > WAL implementation tracks some user data and thus it's a part of the 
> > > consistency model (i.e. one has to check if data block is in the 
> > > WAL). In my approach data is always consistent without such a 
> > > service. At the first write handling step we always write data to 
> > > the store by allocating new blob and modifying lextent map. And 
> > > apply corresponding checksum using regular means if needed. Thus we 
> > > always have consistent data in lextent/blob structures. And 
> > > defragmenter process is just a cleanup/optimization thread that 
> > > merges sparse lextents to improve space utilization. To avoid full 
> > > lextent map enumeration during defragmentation ExtentManager (or 
> > > whatever entity that handles writes) may return some 'hints' where 
> > > space optimization should be applied. This is to be done at initial 
> > > write processing. Such hint is most probably just a logical offset 
> > > or some interval within object logical space. Write handler provides 
> > > such a hint if it detects (by lextent map inspection) that 
> > > optimization is required, e.g. in case of partial lextent overwrite, 
> > > big hole punch, sparse small lextents etc. Pending optimization 
> > > tasks (list of hints) are maintained by the BlueStore and passed to 
> > > EM (or another corresponding entity) for processing in the context 
> > > of a specific thread. Based of such hints defragmenter locates 
> > > lextents to merge and do the job: Read/Modify/Write multiple 
> > > lextents and/or blobs. Optionally this can be done with with some 
> > > delay to care write burst within a specific object region. Another 
> > > point is that hint list can be potentially tracked without KV store 
> > > (some in-memory data structure is enough) as there is no mandatory 
> > > need for its replay in case of OSD failure - data are always 
> > > consistent at the store and failure can lead to some local space 
> > > ineffectiveness only. That's a rare case though.
> > > 
> > > What do you think about this approach?
> > My concern is that it makes a simple overwrite less IO efficient because
> > you have to (1) write a new (temporary-ish) blob, (2) commit the kv
> > transaction, and then (3) write an updated/merged blob, then (4) commit
> > the kv txn for new blob.
> Yes, that's true. But there are some concerns about WAL case as well:
> 1) Are you sure that writing larger KV record ( metadata + user data ) is
> better than direct data write to the store + smaller KV (metadata only)
> update?

Only sometimes.  Normally this is what the min_alloc_size knob controls.. 
whether to wal + overwrite or do a new allocation and cow/fragment.

> 2) Either WAL records will increase or we need to have both WAL and optimizer
> simultaneously. Especially for compressed case. As far as I understand
> currently WAL record has up to block_size bytes of user data. 

It might be up to min_alloc_size, actually

> With blob
> introduction this raises up to max_blob_size ( N*min_alloc_size). Or we'll
> need to maintain both WAL and optimizer
> E.g. there is an lextent 0~256K and overwrite 1K~254K, block size = 4K
> For no checksum and compression case WAL records are 2 * 3K
> For checksum case WAL records are 2 * (max( csum_block_size, block_size) - 1K)
> For compression case WAL records are 2 * (max( max_blob_size, block_size) -
> 1K) or do that temporary blob allocation.

Yeah.  I think this becomes a matter of "policy" in the write code.  If we 
make the wal machinery generic enough to do this, then we can decide 
whether to 

 1) do a sync read, write to a new blob, then commit
 2) commit a wal event, do async read/modify/write
 3) write to a new blob with byte-granular lex map, commit.  maybe clean 
up later.

The ondisk format, wal machinery, and read path wouldn't care which choice 
we made on write.

> 3) WAL apply locks the subsequent read until its completion. I.e. subsequent
> read has to wait until WAL apply is completed ( o->flush() call in
> _do_read()). In case of optimizer approach lock can be postponed as optimizer
> doesn't need to perform the task immediately.

In practice the wal flush is almost never hit because we rare read after 
write.  We can also optimize this later when we add a buffer cache.

> > And if I understand the proposal correctly any
> > overwrite is still off-limits because you can't to the overwrite IO
> > atomically with the kv commit.  Is that right?
> Could you please elaborate - not sure I understand the question.

Let's say there's no compression or checksum, and I have a 4K overwrite 
inside a large blob.  Without WAL, the best we can do is write a 
minimally-sized new blob elsewhere (read 64K-4K, write 64K to a new 
allocation, commit with lextent refs to new blob).  There's no way to 
simply overwrite 4K within the existing allocation because if we do the 
overwrite and crash we have corrupted old object state.  This is the main 
problem the wal is intended so resolve--the ability to 'atomically' 
overwrite data as part of a transaction.

> > Making the wal part of the consistency model is more complex, but it means
> > we can (1) log our intent to overwrite atomically with the kv txn commit,
> > and then (2) do the async overwrite.  It will get a bit more complex
> > because we'll be doing metadata updates as part of the wal completion, but
> > it's not a big step from where we are now, and I think the performance
> > benefit will be worth it.
> May I have some example how it's supposed to work please?

At a high level,

1- prepare kv commit.  it includes a wal_op_t that describes a 
read/modify/write of a csum block within some existing blob.
2- commit the kv txn
3- do the wal read/modify/write
4- calculate new csum
5- in the wal 'cleanup' transaction (which removes the wal event), 
also update blob csum_data in the onode or bnode.

In practice, I think we want to pass TransContext down from _wal_apply 
into _do_wal_op, and put modified onode or bnode in the txc dirty list.  
(We'll need to clear it after the current _txc_finalize call so that it 
doesn't have the first phase's dirty stuff still there.)  Then, in 
_kv_sync_thread, where the

      // cleanup sync wal keys

stuff is, we probably want to have a helper that captures the wal event 
removal *and* any other stuff we need to do.. like update onodes and 
bnodes.  It'll look similar to _txc_finalize, I think.

Then the main piece is how to modify the bluestore_wal_op_t to describe 
which blob metadata we're modifying and how to do the 
whole read/modify/write operation.  I think

 - we need to bundle the csum data for anything we read.  we can probably 
just put a blob_t in here, since it includes the extents and csum metadata 
all together.
 - we need to describe where the blob exists (which onode or bnode owns 
it, and what its id is) so that do_wal_op can find it and update it.
   * we might want to optimize the normal path so that we can use the 
in-memory copy without doing a lookup

It probably means a mostly rewritten wal_op_t type.  I think the ops we 
need to capture are

 - overwrite
 - read / modify / write (e.g., partial blocks)
 - read / modify / write (and update csum metadata)
 - read / modify / compress / write (and update csum metadata)
 - read / write elsewhere (e.g., the current copy op, used for cow)

Since compression is thrown in there, we probably need to be able to 
allocate in the do_wal_op path too.  I think that'll be okay... it's 
making the wal finalize kv look more and more like the current 
txc_finalize.  That probably means if we're careful we can use the same 
code for both?

sage

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: 2 related bluestore questions
  2016-05-10 15:39       ` Sage Weil
@ 2016-05-11  1:10         ` Sage Weil
  2016-05-11 12:11           ` Igor Fedotov
  2016-05-11 12:39           ` Igor Fedotov
  0 siblings, 2 replies; 28+ messages in thread
From: Sage Weil @ 2016-05-11  1:10 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: allen.samuels, ceph-devel

On Tue, 10 May 2016, Sage Weil wrote:
> > > Making the wal part of the consistency model is more complex, but it means
> > > we can (1) log our intent to overwrite atomically with the kv txn commit,
> > > and then (2) do the async overwrite.  It will get a bit more complex
> > > because we'll be doing metadata updates as part of the wal completion, but
> > > it's not a big step from where we are now, and I think the performance
> > > benefit will be worth it.
> > May I have some example how it's supposed to work please?
> 
> At a high level,
> 
> 1- prepare kv commit.  it includes a wal_op_t that describes a 
> read/modify/write of a csum block within some existing blob.
> 2- commit the kv txn
> 3- do the wal read/modify/write
> 4- calculate new csum
> 5- in the wal 'cleanup' transaction (which removes the wal event), 
> also update blob csum_data in the onode or bnode.
> 
> In practice, I think we want to pass TransContext down from _wal_apply 
> into _do_wal_op, and put modified onode or bnode in the txc dirty list.  
> (We'll need to clear it after the current _txc_finalize call so that it 
> doesn't have the first phase's dirty stuff still there.)  Then, in 
> _kv_sync_thread, where the
> 
>       // cleanup sync wal keys
> 
> stuff is, we probably want to have a helper that captures the wal event 
> removal *and* any other stuff we need to do.. like update onodes and 
> bnodes.  It'll look similar to _txc_finalize, I think.
> 
> Then the main piece is how to modify the bluestore_wal_op_t to describe 
> which blob metadata we're modifying and how to do the 
> whole read/modify/write operation.  I think
> 
>  - we need to bundle the csum data for anything we read.  we can probably 
> just put a blob_t in here, since it includes the extents and csum metadata 
> all together.
>  - we need to describe where the blob exists (which onode or bnode owns 
> it, and what its id is) so that do_wal_op can find it and update it.
>    * we might want to optimize the normal path so that we can use the 
> in-memory copy without doing a lookup
> 
> It probably means a mostly rewritten wal_op_t type.  I think the ops we 
> need to capture are
> 
>  - overwrite
>  - read / modify / write (e.g., partial blocks)
>  - read / modify / write (and update csum metadata)
>  - read / modify / compress / write (and update csum metadata)
>  - read / write elsewhere (e.g., the current copy op, used for cow)
> 
> Since compression is thrown in there, we probably need to be able to 
> allocate in the do_wal_op path too.  I think that'll be okay... it's 
> making the wal finalize kv look more and more like the current 
> txc_finalize.  That probably means if we're careful we can use the same 
> code for both?


I took a stab at a revised wal_op_t here:

	https://github.com/liewegas/ceph/blob/wip-bluestore-write/src/os/bluestore/bluestore_types.h#L595-L605

This is enough to implement the basic wal overwrite case here:

	https://github.com/liewegas/ceph/blob/wip-bluestore-write/src/os/bluestore/BlueStore.cc#L5522-L5578

It's overkill for that, but something like this ought to be sufficiently 
general to express the more complicated wal (and compaction/gc/cleanup) 
operations, where we are reading bits of data from lots of different 
previous blobs, verifying checksums, and then assembling the results into 
a new buffer that gets written somewhere else.  The read_extent_map and 
write_map offsets are logical offsets in a buffer we assemble and then 
write to b_off~b_len in the specific blob.  I didn't get to the _do_wal_op 
part that actually does it, but it would do the final write, csum 
calculation, and metadata update.  Probably... the allocation would happen 
then too, if the specified blob didn't already have pextents.  Tha way 
we can do compression at that stage as well?

What do you think?

sage


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: 2 related bluestore questions
  2016-05-11  1:10         ` Sage Weil
@ 2016-05-11 12:11           ` Igor Fedotov
  2016-05-11 13:10             ` Sage Weil
  2016-05-11 12:39           ` Igor Fedotov
  1 sibling, 1 reply; 28+ messages in thread
From: Igor Fedotov @ 2016-05-11 12:11 UTC (permalink / raw)
  To: Sage Weil; +Cc: allen.samuels, ceph-devel



On 11.05.2016 4:10, Sage Weil wrote:
> On Tue, 10 May 2016, Sage Weil wrote:
>>>> Making the wal part of the consistency model is more complex, but it means
>>>> we can (1) log our intent to overwrite atomically with the kv txn commit,
>>>> and then (2) do the async overwrite.  It will get a bit more complex
>>>> because we'll be doing metadata updates as part of the wal completion, but
>>>> it's not a big step from where we are now, and I think the performance
>>>> benefit will be worth it.
>>> May I have some example how it's supposed to work please?
>> At a high level,
>>
>> 1- prepare kv commit.  it includes a wal_op_t that describes a
>> read/modify/write of a csum block within some existing blob.
>> 2- commit the kv txn
>> 3- do the wal read/modify/write
>> 4- calculate new csum
>> 5- in the wal 'cleanup' transaction (which removes the wal event),
>> also update blob csum_data in the onode or bnode.
>>
>> In practice, I think we want to pass TransContext down from _wal_apply
>> into _do_wal_op, and put modified onode or bnode in the txc dirty list.
>> (We'll need to clear it after the current _txc_finalize call so that it
>> doesn't have the first phase's dirty stuff still there.)  Then, in
>> _kv_sync_thread, where the
>>
>>        // cleanup sync wal keys
>>
>> stuff is, we probably want to have a helper that captures the wal event
>> removal *and* any other stuff we need to do.. like update onodes and
>> bnodes.  It'll look similar to _txc_finalize, I think.
>>
>> Then the main piece is how to modify the bluestore_wal_op_t to describe
>> which blob metadata we're modifying and how to do the
>> whole read/modify/write operation.  I think
>>
>>   - we need to bundle the csum data for anything we read.  we can probably
>> just put a blob_t in here, since it includes the extents and csum metadata
>> all together.
>>   - we need to describe where the blob exists (which onode or bnode owns
>> it, and what its id is) so that do_wal_op can find it and update it.
>>     * we might want to optimize the normal path so that we can use the
>> in-memory copy without doing a lookup
>>
>> It probably means a mostly rewritten wal_op_t type.  I think the ops we
>> need to capture are
>>
>>   - overwrite
>>   - read / modify / write (e.g., partial blocks)
>>   - read / modify / write (and update csum metadata)
>>   - read / modify / compress / write (and update csum metadata)
>>   - read / write elsewhere (e.g., the current copy op, used for cow)
>>
>> Since compression is thrown in there, we probably need to be able to
>> allocate in the do_wal_op path too.  I think that'll be okay... it's
>> making the wal finalize kv look more and more like the current
>> txc_finalize.  That probably means if we're careful we can use the same
>> code for both?
>
> I took a stab at a revised wal_op_t here:
>
> 	https://github.com/liewegas/ceph/blob/wip-bluestore-write/src/os/bluestore/bluestore_types.h#L595-L605
>
> This is enough to implement the basic wal overwrite case here:
>
> 	https://github.com/liewegas/ceph/blob/wip-bluestore-write/src/os/bluestore/BlueStore.cc#L5522-L5578
>
> It's overkill for that, but something like this ought to be sufficiently
> general to express the more complicated wal (and compaction/gc/cleanup)
> operations, where we are reading bits of data from lots of different
> previous blobs, verifying checksums, and then assembling the results into
> a new buffer that gets written somewhere else.  The read_extent_map and
> write_map offsets are logical offsets in a buffer we assemble and then
> write to b_off~b_len in the specific blob.  I didn't get to the _do_wal_op
> part that actually does it, but it would do the final write, csum
> calculation, and metadata update.  Probably... the allocation would happen
> then too, if the specified blob didn't already have pextents.  Tha way
> we can do compression at that stage as well?
>
> What do you think?
Not completely sure that it's a good idea to have read stage description 
stored in WAL record? Wouldn't that produce any 
conflicts/inconsistencies when multiple WAL records deal with the same 
or close lextents and previous WAL updates lextents to read. May be it's 
better to prepare such a description exactly when WAL is applied? And 
WAL record to have just a basic write info?

And for GC/cleanup process this becomes even more important as the task 
may be deferred for a while and lextent map may be significantly altered.

And I suppose blob id should be 2 not 1 here:
https://github.com/liewegas/ceph/blob/wip-bluestore-write/src/os/bluestore/BlueStore.cc#L5545

> sage
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: 2 related bluestore questions
  2016-05-11  1:10         ` Sage Weil
  2016-05-11 12:11           ` Igor Fedotov
@ 2016-05-11 12:39           ` Igor Fedotov
  2016-05-11 14:35             ` Sage Weil
  1 sibling, 1 reply; 28+ messages in thread
From: Igor Fedotov @ 2016-05-11 12:39 UTC (permalink / raw)
  To: Sage Weil; +Cc: allen.samuels, ceph-devel

Sage,

thanks a lot for detailed overview.

Below is a brief proposal on Extent Manager extension to support WAL.
Please tell me what's your final decision its fortune? Are there any 
plans to use it or we completely rebase to the code you are woring on. 
Is there any sense for me to to continue working on it or I should 
switch to something in your stream.

Here is the idea that's pretty simple:
1) Add new method(s)  to BlockOpInterface to add WAL record. EX to call 
it when corresponding overwrite (or whatever other case) is detected.
2) Add method(s) to ExtentManger to process WAL record. They are to be 
called from _wal_apply/_do_wal_op and EM to perform corresponding RMW 
tasks. It does that using other regular BlockOpInterface methods ( 
read/write/allocate etc) thus Bluestore environment should be ready to 
handle them during the WAL phase as it does during the initial write 
hanlding one. But it looks like we need that anyway - EM doesn't create 
any additional complexities here.

And a couple of general notes on WAL apply handling that I realized 
recently and we should probably care of.
1) when WAL is applied on compressed blobs one should always allocate 
and use a new blob. Even for the case when resulting buffer fit into the 
existing blob. The rationale for that is that if we crash after the 
write phase  we and before KV update we can't properly execute WAL 
replay. Source data is altered and we are unable to repeat RMW.
2) Similarly during WAL replay after the failure one should disable 
checksum verification for existing data read if any. The rationale is 
the same - we can crash after the write and before KV update thus 
previous checksum in KV is invalid.

Thanks,
Igor.

On 11.05.2016 4:10, Sage Weil wrote:
> On Tue, 10 May 2016, Sage Weil wrote:
>>>> Making the wal part of the consistency model is more complex, but it means
>>>> we can (1) log our intent to overwrite atomically with the kv txn commit,
>>>> and then (2) do the async overwrite.  It will get a bit more complex
>>>> because we'll be doing metadata updates as part of the wal completion, but
>>>> it's not a big step from where we are now, and I think the performance
>>>> benefit will be worth it.
>>> May I have some example how it's supposed to work please?
>> At a high level,
>>
>> 1- prepare kv commit.  it includes a wal_op_t that describes a
>> read/modify/write of a csum block within some existing blob.
>> 2- commit the kv txn
>> 3- do the wal read/modify/write
>> 4- calculate new csum
>> 5- in the wal 'cleanup' transaction (which removes the wal event),
>> also update blob csum_data in the onode or bnode.
>>
>> In practice, I think we want to pass TransContext down from _wal_apply
>> into _do_wal_op, and put modified onode or bnode in the txc dirty list.
>> (We'll need to clear it after the current _txc_finalize call so that it
>> doesn't have the first phase's dirty stuff still there.)  Then, in
>> _kv_sync_thread, where the
>>
>>        // cleanup sync wal keys
>>
>> stuff is, we probably want to have a helper that captures the wal event
>> removal *and* any other stuff we need to do.. like update onodes and
>> bnodes.  It'll look similar to _txc_finalize, I think.
>>
>> Then the main piece is how to modify the bluestore_wal_op_t to describe
>> which blob metadata we're modifying and how to do the
>> whole read/modify/write operation.  I think
>>
>>   - we need to bundle the csum data for anything we read.  we can probably
>> just put a blob_t in here, since it includes the extents and csum metadata
>> all together.
>>   - we need to describe where the blob exists (which onode or bnode owns
>> it, and what its id is) so that do_wal_op can find it and update it.
>>     * we might want to optimize the normal path so that we can use the
>> in-memory copy without doing a lookup
>>
>> It probably means a mostly rewritten wal_op_t type.  I think the ops we
>> need to capture are
>>
>>   - overwrite
>>   - read / modify / write (e.g., partial blocks)
>>   - read / modify / write (and update csum metadata)
>>   - read / modify / compress / write (and update csum metadata)
>>   - read / write elsewhere (e.g., the current copy op, used for cow)
>>
>> Since compression is thrown in there, we probably need to be able to
>> allocate in the do_wal_op path too.  I think that'll be okay... it's
>> making the wal finalize kv look more and more like the current
>> txc_finalize.  That probably means if we're careful we can use the same
>> code for both?
>
> I took a stab at a revised wal_op_t here:
>
> 	https://github.com/liewegas/ceph/blob/wip-bluestore-write/src/os/bluestore/bluestore_types.h#L595-L605
>
> This is enough to implement the basic wal overwrite case here:
>
> 	https://github.com/liewegas/ceph/blob/wip-bluestore-write/src/os/bluestore/BlueStore.cc#L5522-L5578
>
> It's overkill for that, but something like this ought to be sufficiently
> general to express the more complicated wal (and compaction/gc/cleanup)
> operations, where we are reading bits of data from lots of different
> previous blobs, verifying checksums, and then assembling the results into
> a new buffer that gets written somewhere else.  The read_extent_map and
> write_map offsets are logical offsets in a buffer we assemble and then
> write to b_off~b_len in the specific blob.  I didn't get to the _do_wal_op
> part that actually does it, but it would do the final write, csum
> calculation, and metadata update.  Probably... the allocation would happen
> then too, if the specified blob didn't already have pextents.  Tha way
> we can do compression at that stage as well?
>
> What do you think?
>
> sage
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: 2 related bluestore questions
  2016-05-11 12:11           ` Igor Fedotov
@ 2016-05-11 13:10             ` Sage Weil
  2016-05-11 13:45               ` Igor Fedotov
  0 siblings, 1 reply; 28+ messages in thread
From: Sage Weil @ 2016-05-11 13:10 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: allen.samuels, ceph-devel

On Wed, 11 May 2016, Igor Fedotov wrote:
> > I took a stab at a revised wal_op_t here:
> > 
> > 	https://github.com/liewegas/ceph/blob/wip-bluestore-write/src/os/bluestore/bluestore_types.h#L595-L605
> > 
> > This is enough to implement the basic wal overwrite case here:
> > 
> > 	https://github.com/liewegas/ceph/blob/wip-bluestore-write/src/os/bluestore/BlueStore.cc#L5522-L5578
> > 
> > It's overkill for that, but something like this ought to be sufficiently
> > general to express the more complicated wal (and compaction/gc/cleanup)
> > operations, where we are reading bits of data from lots of different
> > previous blobs, verifying checksums, and then assembling the results into
> > a new buffer that gets written somewhere else.  The read_extent_map and
> > write_map offsets are logical offsets in a buffer we assemble and then
> > write to b_off~b_len in the specific blob.  I didn't get to the _do_wal_op
> > part that actually does it, but it would do the final write, csum
> > calculation, and metadata update.  Probably... the allocation would happen
> > then too, if the specified blob didn't already have pextents.  Tha way
> > we can do compression at that stage as well?
> > 
> > What do you think?
> Not completely sure that it's a good idea to have read stage description
> stored in WAL record? Wouldn't that produce any conflicts/inconsistencies when
> multiple WAL records deal with the same or close lextents and previous WAL
> updates lextents to read. May be it's better to prepare such a description
> exactly when WAL is applied? And WAL record to have just a basic write info?

Yeah, I think this is a problem.  I see two basic paths:

 - We do a wal flush before queueing a new wal event to avoid races like 
this. Or perhaps we only do it when the wal event(s) touch the same 
blob(s).  That's simple to reason about, but means that a series 
of small IOs to the same object (or blob) will serialize the kv commit and 
wal r/m/w operations.  (Note that this is no worse than the naive approach 
of doing the read part up front, and it only happens when you have 
successive wal ops on the same object (or blob)).

 - We describe the wal read-side in terms of the current onode state.  For 
example, 'read object offset 0..100, use provided buffer for 100..4096, 
overwrite block'.  That can be pipelined.  But there are other 
operations that would require we flush the wal events, like a truncate or 
zero or other write that clobbers that region of the object.  
Maybe/hopefully in those cases we don't care (it no longer matters that 
this wal event do the write we originally intended) but we'd need 
to think pretty carefully about it.  FWIW, truncate already does an 
o->flush().

> And for GC/cleanup process this becomes even more important as the task 
> may be deferred for a while and lextent map may be significantly 
> altered.

I get the feeling that the GC process can either (1) write new blobs in 
new locations and do an atomic transition, without interacting with the 
wal events at all, or (2) we just do the work once we committed to it. I 
think the potential benefit of committing to do wal work and then changing 
our mind is pretty small.

> And I suppose blob id should be 2 not 1 here:
> https://github.com/liewegas/ceph/blob/wip-bluestore-write/src/os/bluestore/BlueStore.cc#L5545

Ah, yes, thanks!
sage

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: 2 related bluestore questions
  2016-05-11 13:10             ` Sage Weil
@ 2016-05-11 13:45               ` Igor Fedotov
  2016-05-11 13:57                 ` Sage Weil
  0 siblings, 1 reply; 28+ messages in thread
From: Igor Fedotov @ 2016-05-11 13:45 UTC (permalink / raw)
  To: Sage Weil; +Cc: allen.samuels, ceph-devel


On 11.05.2016 16:10, Sage Weil wrote:
> On Wed, 11 May 2016, Igor Fedotov wrote:
>>> I took a stab at a revised wal_op_t here:
>>>
>>> 	https://github.com/liewegas/ceph/blob/wip-bluestore-write/src/os/bluestore/bluestore_types.h#L595-L605
>>>
>>> This is enough to implement the basic wal overwrite case here:
>>>
>>> 	https://github.com/liewegas/ceph/blob/wip-bluestore-write/src/os/bluestore/BlueStore.cc#L5522-L5578
>>>
>>> It's overkill for that, but something like this ought to be sufficiently
>>> general to express the more complicated wal (and compaction/gc/cleanup)
>>> operations, where we are reading bits of data from lots of different
>>> previous blobs, verifying checksums, and then assembling the results into
>>> a new buffer that gets written somewhere else.  The read_extent_map and
>>> write_map offsets are logical offsets in a buffer we assemble and then
>>> write to b_off~b_len in the specific blob.  I didn't get to the _do_wal_op
>>> part that actually does it, but it would do the final write, csum
>>> calculation, and metadata update.  Probably... the allocation would happen
>>> then too, if the specified blob didn't already have pextents.  Tha way
>>> we can do compression at that stage as well?
>>>
>>> What do you think?
>> Not completely sure that it's a good idea to have read stage description
>> stored in WAL record? Wouldn't that produce any conflicts/inconsistencies when
>> multiple WAL records deal with the same or close lextents and previous WAL
>> updates lextents to read. May be it's better to prepare such a description
>> exactly when WAL is applied? And WAL record to have just a basic write info?
> Yeah, I think this is a problem.  I see two basic paths:
>
>   - We do a wal flush before queueing a new wal event to avoid races like
> this. Or perhaps we only do it when the wal event(s) touch the same
> blob(s).  That's simple to reason about, but means that a series
> of small IOs to the same object (or blob) will serialize the kv commit and
> wal r/m/w operations.  (Note that this is no worse than the naive approach
> of doing the read part up front, and it only happens when you have
> successive wal ops on the same object (or blob)).
>
>   - We describe the wal read-side in terms of the current onode state.  For
> example, 'read object offset 0..100, use provided buffer for 100..4096,
> overwrite block'.  That can be pipelined.  But there are other
> operations that would require we flush the wal events, like a truncate or
> zero or other write that clobbers that region of the object.
> Maybe/hopefully in those cases we don't care (it no longer matters that
> this wal event do the write we originally intended) but we'd need
> to think pretty carefully about it.  FWIW, truncate already does an
> o->flush().
I'd prefer the second approach. Probably with some modification...
As far as I understand with the approach above you are trying to locate 
all write logic at a single place and have WAL machinery as a 
straightforward executor for already prepared tasks. Not sure this is 
beneficial enough. But definitely it's more complex and error-prone. And 
potentially you will need extend WAL machinery task description from 
time to time...
As an alternative one can eliminate that read description in WAL record 
at all. Let's simply record what loffset we are going to write to and 
data itself. Thus we have simple write request description.
And when WAL is applied corresponding code should determine how to do 
the write properly using the current lextent/blob maps state. This way 
Write Op apply can be just a regular write handling that performs sync 
RMW or any other implementation depending on the current state, some 
policy, or whatever else that fits the best at the specific moment.

>> And for GC/cleanup process this becomes even more important as the task
>> may be deferred for a while and lextent map may be significantly
>> altered.
> I get the feeling that the GC process can either (1) write new blobs in
> new locations and do an atomic transition, without interacting with the
> wal events at all, or (2) we just do the work once we committed to it. I
> think the potential benefit of committing to do wal work and then changing
> our mind is pretty small.
Yes that's probably true. My concern was regarding the attempt to 
describe GC tasks using wal_op struct with read description embedded. 
IMHO for GC that's even more inappropriate than for WAL as GC alters 
lextent map more massively.

>
>> And I suppose blob id should be 2 not 1 here:
>> https://github.com/liewegas/ceph/blob/wip-bluestore-write/src/os/bluestore/BlueStore.cc#L5545
> Ah, yes, thanks!
> sage


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: 2 related bluestore questions
  2016-05-11 13:45               ` Igor Fedotov
@ 2016-05-11 13:57                 ` Sage Weil
  2016-05-11 20:54                   ` Sage Weil
  0 siblings, 1 reply; 28+ messages in thread
From: Sage Weil @ 2016-05-11 13:57 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: allen.samuels, ceph-devel

On Wed, 11 May 2016, Igor Fedotov wrote:
> On 11.05.2016 16:10, Sage Weil wrote:
> > On Wed, 11 May 2016, Igor Fedotov wrote:
> > > > I took a stab at a revised wal_op_t here:
> > > > 
> > > > 	https://github.com/liewegas/ceph/blob/wip-bluestore-write/src/os/bluestore/bluestore_types.h#L595-L605
> > > > 
> > > > This is enough to implement the basic wal overwrite case here:
> > > > 
> > > > 	https://github.com/liewegas/ceph/blob/wip-bluestore-write/src/os/bluestore/BlueStore.cc#L5522-L5578
> > > > 
> > > > It's overkill for that, but something like this ought to be sufficiently
> > > > general to express the more complicated wal (and compaction/gc/cleanup)
> > > > operations, where we are reading bits of data from lots of different
> > > > previous blobs, verifying checksums, and then assembling the results
> > > > into
> > > > a new buffer that gets written somewhere else.  The read_extent_map and
> > > > write_map offsets are logical offsets in a buffer we assemble and then
> > > > write to b_off~b_len in the specific blob.  I didn't get to the
> > > > _do_wal_op
> > > > part that actually does it, but it would do the final write, csum
> > > > calculation, and metadata update.  Probably... the allocation would
> > > > happen
> > > > then too, if the specified blob didn't already have pextents.  Tha way
> > > > we can do compression at that stage as well?
> > > > 
> > > > What do you think?
> > > Not completely sure that it's a good idea to have read stage description
> > > stored in WAL record? Wouldn't that produce any conflicts/inconsistencies
> > > when
> > > multiple WAL records deal with the same or close lextents and previous WAL
> > > updates lextents to read. May be it's better to prepare such a description
> > > exactly when WAL is applied? And WAL record to have just a basic write
> > > info?
> > Yeah, I think this is a problem.  I see two basic paths:
> > 
> >   - We do a wal flush before queueing a new wal event to avoid races like
> > this. Or perhaps we only do it when the wal event(s) touch the same
> > blob(s).  That's simple to reason about, but means that a series
> > of small IOs to the same object (or blob) will serialize the kv commit and
> > wal r/m/w operations.  (Note that this is no worse than the naive approach
> > of doing the read part up front, and it only happens when you have
> > successive wal ops on the same object (or blob)).
> > 
> >   - We describe the wal read-side in terms of the current onode state.  For
> > example, 'read object offset 0..100, use provided buffer for 100..4096,
> > overwrite block'.  That can be pipelined.  But there are other
> > operations that would require we flush the wal events, like a truncate or
> > zero or other write that clobbers that region of the object.
> > Maybe/hopefully in those cases we don't care (it no longer matters that
> > this wal event do the write we originally intended) but we'd need
> > to think pretty carefully about it.  FWIW, truncate already does an
> > o->flush().
> I'd prefer the second approach. Probably with some modification...
> As far as I understand with the approach above you are trying to locate all
> write logic at a single place and have WAL machinery as a straightforward
> executor for already prepared tasks. Not sure this is beneficial enough. But
> definitely it's more complex and error-prone. And potentially you will need
> extend WAL machinery task description from time to time...
> As an alternative one can eliminate that read description in WAL record at
> all. Let's simply record what loffset we are going to write to and data
> itself. Thus we have simple write request description.
> And when WAL is applied corresponding code should determine how to do the
> write properly using the current lextent/blob maps state. This way Write Op
> apply can be just a regular write handling that performs sync RMW or any other
> implementation depending on the current state, some policy, or whatever else
> that fits the best at the specific moment.

I like that way better!  We can just add a force_sync argument to 
_do_write.  That also lets us trivially disable wal (by forcing sync w/ 
a config option or whatever).

The downside is that any logically conflicting request (an overlapping 
write or truncate or zero) needs to drain the wal events, whereas with a 
lower-level wal description there might be cases where we can ignore the 
wal operation.  I suspect the trivial solution of o->flush() on 
write/truncate/zero will be pretty visible in benchmarks.  Tracking 
in-flight wal ops with an interval_set would probably work well enough.

sage

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: 2 related bluestore questions
  2016-05-11 12:39           ` Igor Fedotov
@ 2016-05-11 14:35             ` Sage Weil
  0 siblings, 0 replies; 28+ messages in thread
From: Sage Weil @ 2016-05-11 14:35 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: allen.samuels, ceph-devel

On Wed, 11 May 2016, Igor Fedotov wrote:
> Sage,
> 
> thanks a lot for detailed overview.
> 
> Below is a brief proposal on Extent Manager extension to support WAL.
> Please tell me what's your final decision its fortune? Are there any plans to
> use it or we completely rebase to the code you are woring on. Is there any
> sense for me to to continue working on it or I should switch to something in
> your stream.

I would prefer to work from my stream to flesh out the read and write 
code, and then move to an ExtentManager modularization once we understand 
the dependencies of the write logic on other BlueStore state.  FWIW I've 
been pushing some of the mapping logic into the bluestore_types.h types 
and adding unit tests for those, so that hopefully it is more minimal.

> Here is the idea that's pretty simple:
> 1) Add new method(s)  to BlockOpInterface to add WAL record. EX to call it
> when corresponding overwrite (or whatever other case) is detected.
> 2) Add method(s) to ExtentManger to process WAL record. They are to be called
> from _wal_apply/_do_wal_op and EM to perform corresponding RMW tasks. It does
> that using other regular BlockOpInterface methods ( read/write/allocate etc)
> thus Bluestore environment should be ready to handle them during the WAL phase
> as it does during the initial write hanlding one. But it looks like we need
> that anyway - EM doesn't create any additional complexities here.

I think your logical-level wal write suggestion will make this work, but 
I'd like to get it wired up so that the wal operations can have metadata 
side-effects before worrying about the interface.

> And a couple of general notes on WAL apply handling that I realized recently
> and we should probably care of.
> 1) when WAL is applied on compressed blobs one should always allocate and use
> a new blob. Even for the case when resulting buffer fit into the existing
> blob. The rationale for that is that if we crash after the write phase  we and
> before KV update we can't properly execute WAL replay. Source data is altered
> and we are unable to repeat RMW.

Definitely.

> 2) Similarly during WAL replay after the failure one should disable checksum
> verification for existing data read if any. The rationale is the same - we can
> crash after the write and before KV update thus previous checksum in KV is
> invalid.

Yeah, I've been mulling this one over.  What we could do is make 
the checksum verification on the read portion try both the previous 
checksum and the to-be-written checksum and see if either of them matches.  
The problem is that the write could be torn at a block boundary, so we 
might have to try a bunch of different combinations.  Or, we could just 
disable the csum check in the wal replay r/m/w path.

I don't really like the idea of amplifying a 4k write into a 
min_alloc_size (e.g., 64k) read + write elsewhere.  Even on a spinning 
disk, which doesn't really care about the 4k vs 64k, the new write will 
generally be in a different track so it'll be an additional seek.  :(

sage



> 
> Thanks,
> Igor.
> 
> On 11.05.2016 4:10, Sage Weil wrote:
> > On Tue, 10 May 2016, Sage Weil wrote:
> > > > > Making the wal part of the consistency model is more complex, but it
> > > > > means
> > > > > we can (1) log our intent to overwrite atomically with the kv txn
> > > > > commit,
> > > > > and then (2) do the async overwrite.  It will get a bit more complex
> > > > > because we'll be doing metadata updates as part of the wal completion,
> > > > > but
> > > > > it's not a big step from where we are now, and I think the performance
> > > > > benefit will be worth it.
> > > > May I have some example how it's supposed to work please?
> > > At a high level,
> > > 
> > > 1- prepare kv commit.  it includes a wal_op_t that describes a
> > > read/modify/write of a csum block within some existing blob.
> > > 2- commit the kv txn
> > > 3- do the wal read/modify/write
> > > 4- calculate new csum
> > > 5- in the wal 'cleanup' transaction (which removes the wal event),
> > > also update blob csum_data in the onode or bnode.
> > > 
> > > In practice, I think we want to pass TransContext down from _wal_apply
> > > into _do_wal_op, and put modified onode or bnode in the txc dirty list.
> > > (We'll need to clear it after the current _txc_finalize call so that it
> > > doesn't have the first phase's dirty stuff still there.)  Then, in
> > > _kv_sync_thread, where the
> > > 
> > >        // cleanup sync wal keys
> > > 
> > > stuff is, we probably want to have a helper that captures the wal event
> > > removal *and* any other stuff we need to do.. like update onodes and
> > > bnodes.  It'll look similar to _txc_finalize, I think.
> > > 
> > > Then the main piece is how to modify the bluestore_wal_op_t to describe
> > > which blob metadata we're modifying and how to do the
> > > whole read/modify/write operation.  I think
> > > 
> > >   - we need to bundle the csum data for anything we read.  we can probably
> > > just put a blob_t in here, since it includes the extents and csum metadata
> > > all together.
> > >   - we need to describe where the blob exists (which onode or bnode owns
> > > it, and what its id is) so that do_wal_op can find it and update it.
> > >     * we might want to optimize the normal path so that we can use the
> > > in-memory copy without doing a lookup
> > > 
> > > It probably means a mostly rewritten wal_op_t type.  I think the ops we
> > > need to capture are
> > > 
> > >   - overwrite
> > >   - read / modify / write (e.g., partial blocks)
> > >   - read / modify / write (and update csum metadata)
> > >   - read / modify / compress / write (and update csum metadata)
> > >   - read / write elsewhere (e.g., the current copy op, used for cow)
> > > 
> > > Since compression is thrown in there, we probably need to be able to
> > > allocate in the do_wal_op path too.  I think that'll be okay... it's
> > > making the wal finalize kv look more and more like the current
> > > txc_finalize.  That probably means if we're careful we can use the same
> > > code for both?
> > 
> > I took a stab at a revised wal_op_t here:
> > 
> > 	https://github.com/liewegas/ceph/blob/wip-bluestore-write/src/os/bluestore/bluestore_types.h#L595-L605
> > 
> > This is enough to implement the basic wal overwrite case here:
> > 
> > 	https://github.com/liewegas/ceph/blob/wip-bluestore-write/src/os/bluestore/BlueStore.cc#L5522-L5578
> > 
> > It's overkill for that, but something like this ought to be sufficiently
> > general to express the more complicated wal (and compaction/gc/cleanup)
> > operations, where we are reading bits of data from lots of different
> > previous blobs, verifying checksums, and then assembling the results into
> > a new buffer that gets written somewhere else.  The read_extent_map and
> > write_map offsets are logical offsets in a buffer we assemble and then
> > write to b_off~b_len in the specific blob.  I didn't get to the _do_wal_op
> > part that actually does it, but it would do the final write, csum
> > calculation, and metadata update.  Probably... the allocation would happen
> > then too, if the specified blob didn't already have pextents.  Tha way
> > we can do compression at that stage as well?
> > 
> > What do you think?
> > 
> > sage
> > 
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: 2 related bluestore questions
  2016-05-11 13:57                 ` Sage Weil
@ 2016-05-11 20:54                   ` Sage Weil
  2016-05-11 21:38                     ` Allen Samuels
  2016-05-12 14:27                     ` Igor Fedotov
  0 siblings, 2 replies; 28+ messages in thread
From: Sage Weil @ 2016-05-11 20:54 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: allen.samuels, ceph-devel

On Wed, 11 May 2016, Sage Weil wrote:
> On Wed, 11 May 2016, Igor Fedotov wrote:
> > On 11.05.2016 16:10, Sage Weil wrote:
> > > On Wed, 11 May 2016, Igor Fedotov wrote:
> > > > > I took a stab at a revised wal_op_t here:
> > > > > 
> > > > > 	https://github.com/liewegas/ceph/blob/wip-bluestore-write/src/os/bluestore/bluestore_types.h#L595-L605
> > > > > 
> > > > > This is enough to implement the basic wal overwrite case here:
> > > > > 
> > > > > 	https://github.com/liewegas/ceph/blob/wip-bluestore-write/src/os/bluestore/BlueStore.cc#L5522-L5578
> > > > > 
> > > > > It's overkill for that, but something like this ought to be sufficiently
> > > > > general to express the more complicated wal (and compaction/gc/cleanup)
> > > > > operations, where we are reading bits of data from lots of different
> > > > > previous blobs, verifying checksums, and then assembling the results
> > > > > into
> > > > > a new buffer that gets written somewhere else.  The read_extent_map and
> > > > > write_map offsets are logical offsets in a buffer we assemble and then
> > > > > write to b_off~b_len in the specific blob.  I didn't get to the
> > > > > _do_wal_op
> > > > > part that actually does it, but it would do the final write, csum
> > > > > calculation, and metadata update.  Probably... the allocation would
> > > > > happen
> > > > > then too, if the specified blob didn't already have pextents.  Tha way
> > > > > we can do compression at that stage as well?
> > > > > 
> > > > > What do you think?
> > > > Not completely sure that it's a good idea to have read stage description
> > > > stored in WAL record? Wouldn't that produce any conflicts/inconsistencies
> > > > when
> > > > multiple WAL records deal with the same or close lextents and previous WAL
> > > > updates lextents to read. May be it's better to prepare such a description
> > > > exactly when WAL is applied? And WAL record to have just a basic write
> > > > info?
> > > Yeah, I think this is a problem.  I see two basic paths:
> > > 
> > >   - We do a wal flush before queueing a new wal event to avoid races like
> > > this. Or perhaps we only do it when the wal event(s) touch the same
> > > blob(s).  That's simple to reason about, but means that a series
> > > of small IOs to the same object (or blob) will serialize the kv commit and
> > > wal r/m/w operations.  (Note that this is no worse than the naive approach
> > > of doing the read part up front, and it only happens when you have
> > > successive wal ops on the same object (or blob)).
> > > 
> > >   - We describe the wal read-side in terms of the current onode state.  For
> > > example, 'read object offset 0..100, use provided buffer for 100..4096,
> > > overwrite block'.  That can be pipelined.  But there are other
> > > operations that would require we flush the wal events, like a truncate or
> > > zero or other write that clobbers that region of the object.
> > > Maybe/hopefully in those cases we don't care (it no longer matters that
> > > this wal event do the write we originally intended) but we'd need
> > > to think pretty carefully about it.  FWIW, truncate already does an
> > > o->flush().
> > I'd prefer the second approach. Probably with some modification...
> > As far as I understand with the approach above you are trying to locate all
> > write logic at a single place and have WAL machinery as a straightforward
> > executor for already prepared tasks. Not sure this is beneficial enough. But
> > definitely it's more complex and error-prone. And potentially you will need
> > extend WAL machinery task description from time to time...
> > As an alternative one can eliminate that read description in WAL record at
> > all. Let's simply record what loffset we are going to write to and data
> > itself. Thus we have simple write request description.
> > And when WAL is applied corresponding code should determine how to do the
> > write properly using the current lextent/blob maps state. This way Write Op
> > apply can be just a regular write handling that performs sync RMW or any other
> > implementation depending on the current state, some policy, or whatever else
> > that fits the best at the specific moment.
> 
> I like that way better!  We can just add a force_sync argument to 
> _do_write.  That also lets us trivially disable wal (by forcing sync w/ 
> a config option or whatever).
> 
> The downside is that any logically conflicting request (an overlapping 
> write or truncate or zero) needs to drain the wal events, whereas with a 
> lower-level wal description there might be cases where we can ignore the 
> wal operation.  I suspect the trivial solution of o->flush() on 
> write/truncate/zero will be pretty visible in benchmarks.  Tracking 
> in-flight wal ops with an interval_set would probably work well enough.

Hmm, I'm not sure this will pan out.  The main problem is that if we call 
back into the write code (with a sync flag), we will have to do write 
IO, and this wreaks havoc on our otherwise (mostly) orderly state machine.  
I think it can be done if we build in a similar guard like _txc_finish_io 
so that we wait for the wal events to also complete IO in order before 
committing them.  I think.

But the other problem is the checksum thing that came up in another 
thread, where the read-side of a read/modify/write might fail teh checksum 
because the wal write hit disk but the kv portion didn't commit. I see a 
few options:

 1) If there are checksums and we're doing a sub-block overwrite, we 
have to write/cow it elsewhere.  This probably means min_alloc_size cow 
operations for small writes.  In which case we needn't bother doing a wal 
even in the first place--the whole point is to enable an overwrite.

 2) We do loose checksum validation that will accept either the old 
checksum or the expected new checksum for the read stage.  This handles 
these two crash cases:

 * kv commit of op + wal event
   <crash here, or>
 * do wal io (completely)
   <crash before cleaning up wal event>
 * kv cleanup of wal event

but not the case where we only partially complete the wal io.  Which means 
there is a small probability is "corrupt" ourselves on crash (not really 
corrupt, but confuse ourselves such that we refuse to replay the 
wal events on startup).

 3) Same as 2, but simply warn if we fail that read-side checksum on 
replay.  This basically introduces a *very* small window which could allow 
an ondisk corruption to get absorbed into our checksum.  This could just 
be #2 + a config option so we warn instead of erroring out.

 4) Same as 2, but we try every combination of old and new data on 
block/sector boundaries to find a valid checksum on the read-side.

I think #1 is a non-starter because it turns a 4K write into a 64K read + 
seek + 64K write on an HDD.  Or forces us to run with min_alloc_size=4K on 
HDD, which would risk very bad fragmentation.

Which makes we want #3 (initially) and then #4.  But... if we do the "wal 
is just a logical write", that means this weird replay handling logic 
creeps into the normal write path.

I'm currently leaning toward keeping the wal events special (lower-level), 
but doing what we can to make it work with the same mid- to low-level 
helper functions (for reading and verifying blobs, etc.).

sage

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: 2 related bluestore questions
  2016-05-11 20:54                   ` Sage Weil
@ 2016-05-11 21:38                     ` Allen Samuels
  2016-05-12  2:58                       ` Sage Weil
  2016-05-12 14:29                       ` Igor Fedotov
  2016-05-12 14:27                     ` Igor Fedotov
  1 sibling, 2 replies; 28+ messages in thread
From: Allen Samuels @ 2016-05-11 21:38 UTC (permalink / raw)
  To: Sage Weil, Igor Fedotov; +Cc: ceph-devel

Sorry, still on vacation and I haven't really wrapped my head around everything that's being discussed. However, w.r.t. wal operations, I would strongly favor an approach that minimizes the amount of "future" operations that are recorded (which I'll call intentions -- i.e., non-binding hints about extra work that needs to get done). Much of the complexity here is because the intentions -- after being recorded -- will need to be altered based on subsequent operations. Hence every write operation will need to digest the historical intentions and potentially update them -- this is VERY complex, potentially much more complex than code that simply examines the current state and re-determines the correct next operation (i.e., de-wal, gc, etc.)

Additional complexity arises because you're recording two sets of state that require consistency checking -- in my experience, this road leads to perdition....

Maybe I've missed something critical (vacation-headed) if so, please ignore.


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions 

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


> -----Original Message-----
> From: Sage Weil [mailto:sage@newdream.net]
> Sent: Thursday, May 12, 2016 6:54 AM
> To: Igor Fedotov <ifedotov@mirantis.com>
> Cc: Allen Samuels <Allen.Samuels@sandisk.com>; ceph-
> devel@vger.kernel.org
> Subject: Re: 2 related bluestore questions
> 
> On Wed, 11 May 2016, Sage Weil wrote:
> > On Wed, 11 May 2016, Igor Fedotov wrote:
> > > On 11.05.2016 16:10, Sage Weil wrote:
> > > > On Wed, 11 May 2016, Igor Fedotov wrote:
> > > > > > I took a stab at a revised wal_op_t here:
> > > > > >
> > > > > >
> > > > > > https://github.com/liewegas/ceph/blob/wip-bluestore-write/src/
> > > > > > os/bluestore/bluestore_types.h#L595-L605
> > > > > >
> > > > > > This is enough to implement the basic wal overwrite case here:
> > > > > >
> > > > > >
> > > > > > https://github.com/liewegas/ceph/blob/wip-bluestore-write/src/
> > > > > > os/bluestore/BlueStore.cc#L5522-L5578
> > > > > >
> > > > > > It's overkill for that, but something like this ought to be
> > > > > > sufficiently general to express the more complicated wal (and
> > > > > > compaction/gc/cleanup) operations, where we are reading bits
> > > > > > of data from lots of different previous blobs, verifying
> > > > > > checksums, and then assembling the results into a new buffer
> > > > > > that gets written somewhere else.  The read_extent_map and
> > > > > > write_map offsets are logical offsets in a buffer we assemble
> > > > > > and then write to b_off~b_len in the specific blob.  I didn't
> > > > > > get to the _do_wal_op part that actually does it, but it would
> > > > > > do the final write, csum calculation, and metadata update.
> > > > > > Probably... the allocation would happen then too, if the
> > > > > > specified blob didn't already have pextents.  Tha way we can
> > > > > > do compression at that stage as well?
> > > > > >
> > > > > > What do you think?
> > > > > Not completely sure that it's a good idea to have read stage
> > > > > description stored in WAL record? Wouldn't that produce any
> > > > > conflicts/inconsistencies when multiple WAL records deal with
> > > > > the same or close lextents and previous WAL updates lextents to
> > > > > read. May be it's better to prepare such a description exactly
> > > > > when WAL is applied? And WAL record to have just a basic write
> > > > > info?
> > > > Yeah, I think this is a problem.  I see two basic paths:
> > > >
> > > >   - We do a wal flush before queueing a new wal event to avoid
> > > > races like this. Or perhaps we only do it when the wal event(s)
> > > > touch the same blob(s).  That's simple to reason about, but means
> > > > that a series of small IOs to the same object (or blob) will
> > > > serialize the kv commit and wal r/m/w operations.  (Note that this
> > > > is no worse than the naive approach of doing the read part up
> > > > front, and it only happens when you have successive wal ops on the
> same object (or blob)).
> > > >
> > > >   - We describe the wal read-side in terms of the current onode
> > > > state.  For example, 'read object offset 0..100, use provided
> > > > buffer for 100..4096, overwrite block'.  That can be pipelined.
> > > > But there are other operations that would require we flush the wal
> > > > events, like a truncate or zero or other write that clobbers that region
> of the object.
> > > > Maybe/hopefully in those cases we don't care (it no longer matters
> > > > that this wal event do the write we originally intended) but we'd
> > > > need to think pretty carefully about it.  FWIW, truncate already
> > > > does an
> > > > o->flush().
> > > I'd prefer the second approach. Probably with some modification...
> > > As far as I understand with the approach above you are trying to
> > > locate all write logic at a single place and have WAL machinery as a
> > > straightforward executor for already prepared tasks. Not sure this
> > > is beneficial enough. But definitely it's more complex and
> > > error-prone. And potentially you will need extend WAL machinery task
> description from time to time...
> > > As an alternative one can eliminate that read description in WAL
> > > record at all. Let's simply record what loffset we are going to
> > > write to and data itself. Thus we have simple write request description.
> > > And when WAL is applied corresponding code should determine how to
> > > do the write properly using the current lextent/blob maps state.
> > > This way Write Op apply can be just a regular write handling that
> > > performs sync RMW or any other implementation depending on the
> > > current state, some policy, or whatever else that fits the best at the
> specific moment.
> >
> > I like that way better!  We can just add a force_sync argument to
> > _do_write.  That also lets us trivially disable wal (by forcing sync
> > w/ a config option or whatever).
> >
> > The downside is that any logically conflicting request (an overlapping
> > write or truncate or zero) needs to drain the wal events, whereas with
> > a lower-level wal description there might be cases where we can ignore
> > the wal operation.  I suspect the trivial solution of o->flush() on
> > write/truncate/zero will be pretty visible in benchmarks.  Tracking
> > in-flight wal ops with an interval_set would probably work well enough.
> 
> Hmm, I'm not sure this will pan out.  The main problem is that if we call back
> into the write code (with a sync flag), we will have to do write IO, and this
> wreaks havoc on our otherwise (mostly) orderly state machine.
> I think it can be done if we build in a similar guard like _txc_finish_io so that
> we wait for the wal events to also complete IO in order before committing
> them.  I think.
> 
> But the other problem is the checksum thing that came up in another thread,
> where the read-side of a read/modify/write might fail teh checksum because
> the wal write hit disk but the kv portion didn't commit. I see a few options:
> 
>  1) If there are checksums and we're doing a sub-block overwrite, we have to
> write/cow it elsewhere.  This probably means min_alloc_size cow operations
> for small writes.  In which case we needn't bother doing a wal even in the
> first place--the whole point is to enable an overwrite.
> 
>  2) We do loose checksum validation that will accept either the old checksum
> or the expected new checksum for the read stage.  This handles these two
> crash cases:
> 
>  * kv commit of op + wal event
>    <crash here, or>
>  * do wal io (completely)
>    <crash before cleaning up wal event>
>  * kv cleanup of wal event
> 
> but not the case where we only partially complete the wal io.  Which means
> there is a small probability is "corrupt" ourselves on crash (not really corrupt,
> but confuse ourselves such that we refuse to replay the wal events on
> startup).
> 
>  3) Same as 2, but simply warn if we fail that read-side checksum on replay.
> This basically introduces a *very* small window which could allow an ondisk
> corruption to get absorbed into our checksum.  This could just be #2 + a
> config option so we warn instead of erroring out.
> 
>  4) Same as 2, but we try every combination of old and new data on
> block/sector boundaries to find a valid checksum on the read-side.
> 
> I think #1 is a non-starter because it turns a 4K write into a 64K read + seek +
> 64K write on an HDD.  Or forces us to run with min_alloc_size=4K on HDD,
> which would risk very bad fragmentation.
> 
> Which makes we want #3 (initially) and then #4.  But... if we do the "wal is
> just a logical write", that means this weird replay handling logic creeps into
> the normal write path.
> 
> I'm currently leaning toward keeping the wal events special (lower-level), but
> doing what we can to make it work with the same mid- to low-level helper
> functions (for reading and verifying blobs, etc.).
> 
> sage

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: 2 related bluestore questions
  2016-05-11 21:38                     ` Allen Samuels
@ 2016-05-12  2:58                       ` Sage Weil
  2016-05-12 11:54                         ` Allen Samuels
                                           ` (2 more replies)
  2016-05-12 14:29                       ` Igor Fedotov
  1 sibling, 3 replies; 28+ messages in thread
From: Sage Weil @ 2016-05-12  2:58 UTC (permalink / raw)
  To: Allen Samuels; +Cc: Igor Fedotov, ceph-devel

On Wed, 11 May 2016, Allen Samuels wrote:
> Sorry, still on vacation and I haven't really wrapped my head around 
> everything that's being discussed. However, w.r.t. wal operations, I 
> would strongly favor an approach that minimizes the amount of "future" 
> operations that are recorded (which I'll call intentions -- i.e., 
> non-binding hints about extra work that needs to get done). Much of the 
> complexity here is because the intentions -- after being recorded -- 
> will need to be altered based on subsequent operations. Hence every 
> write operation will need to digest the historical intentions and 
> potentially update them -- this is VERY complex, potentially much more 
> complex than code that simply examines the current state and 
> re-determines the correct next operation (i.e., de-wal, gc, etc.)
> 
> Additional complexity arises because you're recording two sets of state 
> that require consistency checking -- in my experience, this road leads 
> to perdition....

I agree is has to be something manageable that we can reason about.  I
think the question for me is mostly about which path minimizes the 
complexity while still getting us a reasonable level of performance.

I had one new thought, see below...

> > > The downside is that any logically conflicting request (an overlapping
> > > write or truncate or zero) needs to drain the wal events, whereas with
> > > a lower-level wal description there might be cases where we can ignore
> > > the wal operation.  I suspect the trivial solution of o->flush() on
> > > write/truncate/zero will be pretty visible in benchmarks.  Tracking
> > > in-flight wal ops with an interval_set would probably work well enough.
> > 
> > Hmm, I'm not sure this will pan out.  The main problem is that if we call back
> > into the write code (with a sync flag), we will have to do write IO, and this
> > wreaks havoc on our otherwise (mostly) orderly state machine.
> > I think it can be done if we build in a similar guard like _txc_finish_io so that
> > we wait for the wal events to also complete IO in order before committing
> > them.  I think.
> > 
> > But the other problem is the checksum thing that came up in another thread,
> > where the read-side of a read/modify/write might fail teh checksum because
> > the wal write hit disk but the kv portion didn't commit. I see a few options:
> > 
> >  1) If there are checksums and we're doing a sub-block overwrite, we have to
> > write/cow it elsewhere.  This probably means min_alloc_size cow operations
> > for small writes.  In which case we needn't bother doing a wal even in the
> > first place--the whole point is to enable an overwrite.
> > 
> >  2) We do loose checksum validation that will accept either the old checksum
> > or the expected new checksum for the read stage.  This handles these two
> > crash cases:
> > 
> >  * kv commit of op + wal event
> >    <crash here, or>
> >  * do wal io (completely)
> >    <crash before cleaning up wal event>
> >  * kv cleanup of wal event
> > 
> > but not the case where we only partially complete the wal io.  Which means
> > there is a small probability is "corrupt" ourselves on crash (not really corrupt,
> > but confuse ourselves such that we refuse to replay the wal events on
> > startup).
> > 
> >  3) Same as 2, but simply warn if we fail that read-side checksum on replay.
> > This basically introduces a *very* small window which could allow an ondisk
> > corruption to get absorbed into our checksum.  This could just be #2 + a
> > config option so we warn instead of erroring out.
> > 
> >  4) Same as 2, but we try every combination of old and new data on
> > block/sector boundaries to find a valid checksum on the read-side.
> > 
> > I think #1 is a non-starter because it turns a 4K write into a 64K read + seek +
> > 64K write on an HDD.  Or forces us to run with min_alloc_size=4K on HDD,
> > which would risk very bad fragmentation.
> > 
> > Which makes we want #3 (initially) and then #4.  But... if we do the "wal is
> > just a logical write", that means this weird replay handling logic creeps into
> > the normal write path.
> > 
> > I'm currently leaning toward keeping the wal events special (lower-level), but
> > doing what we can to make it work with the same mid- to low-level helper
> > functions (for reading and verifying blobs, etc.).

It occured to me that this checksum consistency issue only comes up when 
we are updating something that is smaller than the csum block size.  And 
the real source of the problem is that you have a sequence of

 1- journal intent (kv wal item)
 2- do read io
 3- verify csum
 4- do write io
 5- cancel intent (remove kv wal item)

If we have an order like

 1- do read io
 2- journal intent for entire csum chunk (kv wal item)
 3- do write io
 4- cancel intent

Then the issue goes away.  And I'm thinking if the csum chunk is big 
enough that the #2 step is too big of a wal item to perform well, then the 
problem is your choice of csum block size, not the approach.  I.e., use a 
4kb csum block size for rbd images, and use large blocks (128k, 512k, 
whatever) only for things that never see random overwrites (rgw data).

If that is good enough, then it might also mean that we can make the wal 
operations never do reads--just (over)writes, further simplifying things 
on that end.  In the jewel bluestore the only times we do reads is for 
partial block updates (do we really care about these?  a buffer cache 
could absorb them when it matters) and for copy/cow operations post-clone 
(which i think are simple enough to be deal with separately).

sage

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: 2 related bluestore questions
  2016-05-12  2:58                       ` Sage Weil
@ 2016-05-12 11:54                         ` Allen Samuels
  2016-05-12 14:47                           ` Igor Fedotov
  2016-05-12 14:38                         ` Igor Fedotov
  2016-05-12 16:37                         ` Igor Fedotov
  2 siblings, 1 reply; 28+ messages in thread
From: Allen Samuels @ 2016-05-12 11:54 UTC (permalink / raw)
  To: Sage Weil; +Cc: Igor Fedotov, ceph-devel

Ok, I hope I'm starting to understand. 

The real source of the problem is the desire to "write in place". The kv wal item (intention) and the associated complexity are associated with the need to recognize that a chunk may have an incorrect checksum for a transient period of time because of an in-place update operation.

One simple solution to this is to eliminate write-in-place. You're already reading a bunch of data, computing the new checksums and then writing out the correct data. If you cow the write, then all of the torn page stuff goes away and you cut down the number of kv updates significantly (no intention is required, only a single kv update is required for the entire cow process). Then the only thing that's required to make this perform more or less equivalent to the complicated logic that you're dealing with now is to make sure that the COW destination location is "close" to the original source data location. Basically, you'll want to modify the allocation algorithm so as to leave some "space" in each cylinder so that the COW destination can always be close to the source (you only need a few p
 er cylinder since each COW frees up another one). This is a tiny bit ugly, but much much simpler than all of the machinations that you're going through now. 

I would point out that one important optimization is to do a good job of handling append log files. These will look like multiple-write-in-places that will have to get merged. I believe that this optimization is particularly easy to handle with the COW approach that I described.

I would point out that nothing prevents an addition of a non-COW update-in-place at a future time. 

Allen Samuels
Software Architect, Fellow, Systems and Software Solutions 

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com

> -----Original Message-----
> From: Sage Weil [mailto:sage@newdream.net]
> Sent: Thursday, May 12, 2016 12:58 PM
> To: Allen Samuels <Allen.Samuels@sandisk.com>
> Cc: Igor Fedotov <ifedotov@mirantis.com>; ceph-devel@vger.kernel.org
> Subject: RE: 2 related bluestore questions
> 
> On Wed, 11 May 2016, Allen Samuels wrote:
> > Sorry, still on vacation and I haven't really wrapped my head around
> > everything that's being discussed. However, w.r.t. wal operations, I
> > would strongly favor an approach that minimizes the amount of "future"
> > operations that are recorded (which I'll call intentions -- i.e.,
> > non-binding hints about extra work that needs to get done). Much of
> > the complexity here is because the intentions -- after being recorded
> > -- will need to be altered based on subsequent operations. Hence every
> > write operation will need to digest the historical intentions and
> > potentially update them -- this is VERY complex, potentially much more
> > complex than code that simply examines the current state and
> > re-determines the correct next operation (i.e., de-wal, gc, etc.)
> >
> > Additional complexity arises because you're recording two sets of
> > state that require consistency checking -- in my experience, this road
> > leads to perdition....
> 
> I agree is has to be something manageable that we can reason about.  I think
> the question for me is mostly about which path minimizes the complexity
> while still getting us a reasonable level of performance.
> 
> I had one new thought, see below...
> 
> > > > The downside is that any logically conflicting request (an
> > > > overlapping write or truncate or zero) needs to drain the wal
> > > > events, whereas with a lower-level wal description there might be
> > > > cases where we can ignore the wal operation.  I suspect the
> > > > trivial solution of o->flush() on write/truncate/zero will be
> > > > pretty visible in benchmarks.  Tracking in-flight wal ops with an
> interval_set would probably work well enough.
> > >
> > > Hmm, I'm not sure this will pan out.  The main problem is that if we
> > > call back into the write code (with a sync flag), we will have to do
> > > write IO, and this wreaks havoc on our otherwise (mostly) orderly state
> machine.
> > > I think it can be done if we build in a similar guard like
> > > _txc_finish_io so that we wait for the wal events to also complete
> > > IO in order before committing them.  I think.
> > >
> > > But the other problem is the checksum thing that came up in another
> > > thread, where the read-side of a read/modify/write might fail teh
> > > checksum because the wal write hit disk but the kv portion didn't commit.
> I see a few options:
> > >
> > >  1) If there are checksums and we're doing a sub-block overwrite, we
> > > have to write/cow it elsewhere.  This probably means min_alloc_size
> > > cow operations for small writes.  In which case we needn't bother
> > > doing a wal even in the first place--the whole point is to enable an
> overwrite.
> > >
> > >  2) We do loose checksum validation that will accept either the old
> > > checksum or the expected new checksum for the read stage.  This
> > > handles these two crash cases:
> > >
> > >  * kv commit of op + wal event
> > >    <crash here, or>
> > >  * do wal io (completely)
> > >    <crash before cleaning up wal event>
> > >  * kv cleanup of wal event
> > >
> > > but not the case where we only partially complete the wal io.  Which
> > > means there is a small probability is "corrupt" ourselves on crash
> > > (not really corrupt, but confuse ourselves such that we refuse to
> > > replay the wal events on startup).
> > >
> > >  3) Same as 2, but simply warn if we fail that read-side checksum on
> replay.
> > > This basically introduces a *very* small window which could allow an
> > > ondisk corruption to get absorbed into our checksum.  This could
> > > just be #2 + a config option so we warn instead of erroring out.
> > >
> > >  4) Same as 2, but we try every combination of old and new data on
> > > block/sector boundaries to find a valid checksum on the read-side.
> > >
> > > I think #1 is a non-starter because it turns a 4K write into a 64K
> > > read + seek + 64K write on an HDD.  Or forces us to run with
> > > min_alloc_size=4K on HDD, which would risk very bad fragmentation.
> > >
> > > Which makes we want #3 (initially) and then #4.  But... if we do the
> > > "wal is just a logical write", that means this weird replay handling
> > > logic creeps into the normal write path.
> > >
> > > I'm currently leaning toward keeping the wal events special
> > > (lower-level), but doing what we can to make it work with the same
> > > mid- to low-level helper functions (for reading and verifying blobs, etc.).
> 
> It occured to me that this checksum consistency issue only comes up when
> we are updating something that is smaller than the csum block size.  And the
> real source of the problem is that you have a sequence of
> 
>  1- journal intent (kv wal item)
>  2- do read io
>  3- verify csum
>  4- do write io
>  5- cancel intent (remove kv wal item)
> 
> If we have an order like
> 
>  1- do read io
>  2- journal intent for entire csum chunk (kv wal item)
>  3- do write io
>  4- cancel intent
> 
> Then the issue goes away.  And I'm thinking if the csum chunk is big enough
> that the #2 step is too big of a wal item to perform well, then the problem is
> your choice of csum block size, not the approach.  I.e., use a 4kb csum block
> size for rbd images, and use large blocks (128k, 512k,
> whatever) only for things that never see random overwrites (rgw data).
> 
> If that is good enough, then it might also mean that we can make the wal
> operations never do reads--just (over)writes, further simplifying things on
> that end.  In the jewel bluestore the only times we do reads is for partial
> block updates (do we really care about these?  a buffer cache could absorb
> them when it matters) and for copy/cow operations post-clone (which i think
> are simple enough to be deal with separately).
> 
> sage

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: 2 related bluestore questions
  2016-05-11 20:54                   ` Sage Weil
  2016-05-11 21:38                     ` Allen Samuels
@ 2016-05-12 14:27                     ` Igor Fedotov
  2016-05-12 15:06                       ` Sage Weil
  1 sibling, 1 reply; 28+ messages in thread
From: Igor Fedotov @ 2016-05-12 14:27 UTC (permalink / raw)
  To: Sage Weil; +Cc: allen.samuels, ceph-devel



On 11.05.2016 23:54, Sage Weil wrote:
> On Wed, 11 May 2016, Sage Weil wrote:
>> On Wed, 11 May 2016, Igor Fedotov wrote:
>>
>> I like that way better!  We can just add a force_sync argument to
>> _do_write.  That also lets us trivially disable wal (by forcing sync w/
>> a config option or whatever).
>>
>> The downside is that any logically conflicting request (an overlapping
>> write or truncate or zero) needs to drain the wal events, whereas with a
>> lower-level wal description there might be cases where we can ignore the
>> wal operation.  I suspect the trivial solution of o->flush() on
>> write/truncate/zero will be pretty visible in benchmarks.  Tracking
>> in-flight wal ops with an interval_set would probably work well enough.
> Hmm, I'm not sure this will pan out.  The main problem is that if we call
> back into the write code (with a sync flag), we will have to do write
> IO, and this wreaks havoc on our otherwise (mostly) orderly state machine.
> I think it can be done if we build in a similar guard like _txc_finish_io
> so that we wait for the wal events to also complete IO in order before
> committing them.  I think.
>
> But the other problem is the checksum thing that came up in another
> thread, where the read-side of a read/modify/write might fail teh checksum
> because the wal write hit disk but the kv portion didn't commit. I see a
> few options:
>
>   1) If there are checksums and we're doing a sub-block overwrite, we
> have to write/cow it elsewhere.  This probably means min_alloc_size cow
> operations for small writes.  In which case we needn't bother doing a wal
> even in the first place--the whole point is to enable an overwrite.
>
>   2) We do loose checksum validation that will accept either the old
> checksum or the expected new checksum for the read stage.  This handles
> these two crash cases:
Probably I missed something but It seems to me that we don't have any 
'expected new checksum' for the whole new block after the crash.
What we can have are old block checksum in KV and checksum for 
overwritten portion of the block in WAL. To have full new checksum one 
has to do the read and store new checksum to KV afterwards.
Or you mean write+KV update under 'do  wal io'?
>   * kv commit of op + wal event
>     <crash here, or>
>   * do wal io (completely)
>     <crash before cleaning up wal event>
>   * kv cleanup of wal event
>
> but not the case where we only partially complete the wal io.  Which means
> there is a small probability is "corrupt" ourselves on crash (not really
> corrupt, but confuse ourselves such that we refuse to replay the
> wal events on startup).
>
>   3) Same as 2, but simply warn if we fail that read-side checksum on
> replay.  This basically introduces a *very* small window which could allow
> an ondisk corruption to get absorbed into our checksum.  This could just
> be #2 + a config option so we warn instead of erroring out.
>
>   4) Same as 2, but we try every combination of old and new data on
> block/sector boundaries to find a valid checksum on the read-side.
Still unclear for me where can we get old data from when we've just 
overwritten them?
E.g. old block  was
<1,2,3> and the new one <4> with the resulting one = <4,2,3>
We have checksums for <1,2,3> and for <4> in KV. And <4,2,3> block at 
the disk. How one can detect an error in an invalud <4,5,3> block unless 
we store checksum for <4,2,3> before the write?

>
> I think #1 is a non-starter because it turns a 4K write into a 64K read +
> seek + 64K write on an HDD.  Or forces us to run with min_alloc_size=4K on
> HDD, which would risk very bad fragmentation.
>
> Which makes we want #3 (initially) and then #4.  But... if we do the "wal
> is just a logical write", that means this weird replay handling logic
> creeps into the normal write path.
>
> I'm currently leaning toward keeping the wal events special (lower-level),
> but doing what we can to make it work with the same mid- to low-level
> helper functions (for reading and verifying blobs, etc.).
>
> sage


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: 2 related bluestore questions
  2016-05-11 21:38                     ` Allen Samuels
  2016-05-12  2:58                       ` Sage Weil
@ 2016-05-12 14:29                       ` Igor Fedotov
  1 sibling, 0 replies; 28+ messages in thread
From: Igor Fedotov @ 2016-05-12 14:29 UTC (permalink / raw)
  To: Allen Samuels, Sage Weil; +Cc: ceph-devel



On 12.05.2016 0:38, Allen Samuels wrote:
> Sorry, still on vacation and I haven't really wrapped my head around everything that's being discussed. However, w.r.t. wal operations, I would strongly favor an approach that minimizes the amount of "future" operations that are recorded (which I'll call intentions -- i.e., non-binding hints about extra work that needs to get done). Much of the complexity here is because the intentions -- after being recorded -- will need to be altered based on subsequent operations. Hence every write operation will need to digest the historical intentions and potentially update them -- this is VERY complex, potentially much more complex than code that simply examines the current state and re-determines the correct next operation (i.e., de-wal, gc, etc.)
>
> Additional complexity arises because you're recording two sets of state that require consistency checking -- in my experience, this road leads to perdition....
Completely agree
> Maybe I've missed something critical (vacation-headed) if so, please ignore.
>
>
> Allen Samuels
> Software Architect, Fellow, Systems and Software Solutions
>
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samuels@SanDisk.com
>
>
>> -----Original Message-----
>> From: Sage Weil [mailto:sage@newdream.net]
>> Sent: Thursday, May 12, 2016 6:54 AM
>> To: Igor Fedotov <ifedotov@mirantis.com>
>> Cc: Allen Samuels <Allen.Samuels@sandisk.com>; ceph-
>> devel@vger.kernel.org
>> Subject: Re: 2 related bluestore questions
>>
>> On Wed, 11 May 2016, Sage Weil wrote:
>>> On Wed, 11 May 2016, Igor Fedotov wrote:
>>>> On 11.05.2016 16:10, Sage Weil wrote:
>>>>> On Wed, 11 May 2016, Igor Fedotov wrote:
>>>>>>> I took a stab at a revised wal_op_t here:
>>>>>>>
>>>>>>>
>>>>>>> https://github.com/liewegas/ceph/blob/wip-bluestore-write/src/
>>>>>>> os/bluestore/bluestore_types.h#L595-L605
>>>>>>>
>>>>>>> This is enough to implement the basic wal overwrite case here:
>>>>>>>
>>>>>>>
>>>>>>> https://github.com/liewegas/ceph/blob/wip-bluestore-write/src/
>>>>>>> os/bluestore/BlueStore.cc#L5522-L5578
>>>>>>>
>>>>>>> It's overkill for that, but something like this ought to be
>>>>>>> sufficiently general to express the more complicated wal (and
>>>>>>> compaction/gc/cleanup) operations, where we are reading bits
>>>>>>> of data from lots of different previous blobs, verifying
>>>>>>> checksums, and then assembling the results into a new buffer
>>>>>>> that gets written somewhere else.  The read_extent_map and
>>>>>>> write_map offsets are logical offsets in a buffer we assemble
>>>>>>> and then write to b_off~b_len in the specific blob.  I didn't
>>>>>>> get to the _do_wal_op part that actually does it, but it would
>>>>>>> do the final write, csum calculation, and metadata update.
>>>>>>> Probably... the allocation would happen then too, if the
>>>>>>> specified blob didn't already have pextents.  Tha way we can
>>>>>>> do compression at that stage as well?
>>>>>>>
>>>>>>> What do you think?
>>>>>> Not completely sure that it's a good idea to have read stage
>>>>>> description stored in WAL record? Wouldn't that produce any
>>>>>> conflicts/inconsistencies when multiple WAL records deal with
>>>>>> the same or close lextents and previous WAL updates lextents to
>>>>>> read. May be it's better to prepare such a description exactly
>>>>>> when WAL is applied? And WAL record to have just a basic write
>>>>>> info?
>>>>> Yeah, I think this is a problem.  I see two basic paths:
>>>>>
>>>>>    - We do a wal flush before queueing a new wal event to avoid
>>>>> races like this. Or perhaps we only do it when the wal event(s)
>>>>> touch the same blob(s).  That's simple to reason about, but means
>>>>> that a series of small IOs to the same object (or blob) will
>>>>> serialize the kv commit and wal r/m/w operations.  (Note that this
>>>>> is no worse than the naive approach of doing the read part up
>>>>> front, and it only happens when you have successive wal ops on the
>> same object (or blob)).
>>>>>    - We describe the wal read-side in terms of the current onode
>>>>> state.  For example, 'read object offset 0..100, use provided
>>>>> buffer for 100..4096, overwrite block'.  That can be pipelined.
>>>>> But there are other operations that would require we flush the wal
>>>>> events, like a truncate or zero or other write that clobbers that region
>> of the object.
>>>>> Maybe/hopefully in those cases we don't care (it no longer matters
>>>>> that this wal event do the write we originally intended) but we'd
>>>>> need to think pretty carefully about it.  FWIW, truncate already
>>>>> does an
>>>>> o->flush().
>>>> I'd prefer the second approach. Probably with some modification...
>>>> As far as I understand with the approach above you are trying to
>>>> locate all write logic at a single place and have WAL machinery as a
>>>> straightforward executor for already prepared tasks. Not sure this
>>>> is beneficial enough. But definitely it's more complex and
>>>> error-prone. And potentially you will need extend WAL machinery task
>> description from time to time...
>>>> As an alternative one can eliminate that read description in WAL
>>>> record at all. Let's simply record what loffset we are going to
>>>> write to and data itself. Thus we have simple write request description.
>>>> And when WAL is applied corresponding code should determine how to
>>>> do the write properly using the current lextent/blob maps state.
>>>> This way Write Op apply can be just a regular write handling that
>>>> performs sync RMW or any other implementation depending on the
>>>> current state, some policy, or whatever else that fits the best at the
>> specific moment.
>>> I like that way better!  We can just add a force_sync argument to
>>> _do_write.  That also lets us trivially disable wal (by forcing sync
>>> w/ a config option or whatever).
>>>
>>> The downside is that any logically conflicting request (an overlapping
>>> write or truncate or zero) needs to drain the wal events, whereas with
>>> a lower-level wal description there might be cases where we can ignore
>>> the wal operation.  I suspect the trivial solution of o->flush() on
>>> write/truncate/zero will be pretty visible in benchmarks.  Tracking
>>> in-flight wal ops with an interval_set would probably work well enough.
>> Hmm, I'm not sure this will pan out.  The main problem is that if we call back
>> into the write code (with a sync flag), we will have to do write IO, and this
>> wreaks havoc on our otherwise (mostly) orderly state machine.
>> I think it can be done if we build in a similar guard like _txc_finish_io so that
>> we wait for the wal events to also complete IO in order before committing
>> them.  I think.
>>
>> But the other problem is the checksum thing that came up in another thread,
>> where the read-side of a read/modify/write might fail teh checksum because
>> the wal write hit disk but the kv portion didn't commit. I see a few options:
>>
>>   1) If there are checksums and we're doing a sub-block overwrite, we have to
>> write/cow it elsewhere.  This probably means min_alloc_size cow operations
>> for small writes.  In which case we needn't bother doing a wal even in the
>> first place--the whole point is to enable an overwrite.
>>
>>   2) We do loose checksum validation that will accept either the old checksum
>> or the expected new checksum for the read stage.  This handles these two
>> crash cases:
>>
>>   * kv commit of op + wal event
>>     <crash here, or>
>>   * do wal io (completely)
>>     <crash before cleaning up wal event>
>>   * kv cleanup of wal event
>>
>> but not the case where we only partially complete the wal io.  Which means
>> there is a small probability is "corrupt" ourselves on crash (not really corrupt,
>> but confuse ourselves such that we refuse to replay the wal events on
>> startup).
>>
>>   3) Same as 2, but simply warn if we fail that read-side checksum on replay.
>> This basically introduces a *very* small window which could allow an ondisk
>> corruption to get absorbed into our checksum.  This could just be #2 + a
>> config option so we warn instead of erroring out.
>>
>>   4) Same as 2, but we try every combination of old and new data on
>> block/sector boundaries to find a valid checksum on the read-side.
>>
>> I think #1 is a non-starter because it turns a 4K write into a 64K read + seek +
>> 64K write on an HDD.  Or forces us to run with min_alloc_size=4K on HDD,
>> which would risk very bad fragmentation.
>>
>> Which makes we want #3 (initially) and then #4.  But... if we do the "wal is
>> just a logical write", that means this weird replay handling logic creeps into
>> the normal write path.
>>
>> I'm currently leaning toward keeping the wal events special (lower-level), but
>> doing what we can to make it work with the same mid- to low-level helper
>> functions (for reading and verifying blobs, etc.).
>>
>> sage


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: 2 related bluestore questions
  2016-05-12  2:58                       ` Sage Weil
  2016-05-12 11:54                         ` Allen Samuels
@ 2016-05-12 14:38                         ` Igor Fedotov
  2016-05-12 16:37                         ` Igor Fedotov
  2 siblings, 0 replies; 28+ messages in thread
From: Igor Fedotov @ 2016-05-12 14:38 UTC (permalink / raw)
  To: Sage Weil, Allen Samuels; +Cc: ceph-devel



On 12.05.2016 5:58, Sage Weil wrote:
> On Wed, 11 May 2016, Allen Samuels wrote:
>> Sorry, still on vacation and I haven't really wrapped my head around
>> everything that's being discussed. However, w.r.t. wal operations, I
>> would strongly favor an approach that minimizes the amount of "future"
>> operations that are recorded (which I'll call intentions -- i.e.,
>> non-binding hints about extra work that needs to get done). Much of the
>> complexity here is because the intentions -- after being recorded --
>> will need to be altered based on subsequent operations. Hence every
>> write operation will need to digest the historical intentions and
>> potentially update them -- this is VERY complex, potentially much more
>> complex than code that simply examines the current state and
>> re-determines the correct next operation (i.e., de-wal, gc, etc.)
>>
>> Additional complexity arises because you're recording two sets of state
>> that require consistency checking -- in my experience, this road leads
>> to perdition....
> I agree is has to be something manageable that we can reason about.  I
> think the question for me is mostly about which path minimizes the
> complexity while still getting us a reasonable level of performance.
>
> I had one new thought, see below...
>
>>>> The downside is that any logically conflicting request (an overlapping
>>>> write or truncate or zero) needs to drain the wal events, whereas with
>>>> a lower-level wal description there might be cases where we can ignore
>>>> the wal operation.  I suspect the trivial solution of o->flush() on
>>>> write/truncate/zero will be pretty visible in benchmarks.  Tracking
>>>> in-flight wal ops with an interval_set would probably work well enough.
>>> Hmm, I'm not sure this will pan out.  The main problem is that if we call back
>>> into the write code (with a sync flag), we will have to do write IO, and this
>>> wreaks havoc on our otherwise (mostly) orderly state machine.
>>> I think it can be done if we build in a similar guard like _txc_finish_io so that
>>> we wait for the wal events to also complete IO in order before committing
>>> them.  I think.
>>>
>>> But the other problem is the checksum thing that came up in another thread,
>>> where the read-side of a read/modify/write might fail teh checksum because
>>> the wal write hit disk but the kv portion didn't commit. I see a few options:
>>>
>>>   1) If there are checksums and we're doing a sub-block overwrite, we have to
>>> write/cow it elsewhere.  This probably means min_alloc_size cow operations
>>> for small writes.  In which case we needn't bother doing a wal even in the
>>> first place--the whole point is to enable an overwrite.
>>>
>>>   2) We do loose checksum validation that will accept either the old checksum
>>> or the expected new checksum for the read stage.  This handles these two
>>> crash cases:
>>>
>>>   * kv commit of op + wal event
>>>     <crash here, or>
>>>   * do wal io (completely)
>>>     <crash before cleaning up wal event>
>>>   * kv cleanup of wal event
>>>
>>> but not the case where we only partially complete the wal io.  Which means
>>> there is a small probability is "corrupt" ourselves on crash (not really corrupt,
>>> but confuse ourselves such that we refuse to replay the wal events on
>>> startup).
>>>
>>>   3) Same as 2, but simply warn if we fail that read-side checksum on replay.
>>> This basically introduces a *very* small window which could allow an ondisk
>>> corruption to get absorbed into our checksum.  This could just be #2 + a
>>> config option so we warn instead of erroring out.
>>>
>>>   4) Same as 2, but we try every combination of old and new data on
>>> block/sector boundaries to find a valid checksum on the read-side.
>>>
>>> I think #1 is a non-starter because it turns a 4K write into a 64K read + seek +
>>> 64K write on an HDD.  Or forces us to run with min_alloc_size=4K on HDD,
>>> which would risk very bad fragmentation.
>>>
>>> Which makes we want #3 (initially) and then #4.  But... if we do the "wal is
>>> just a logical write", that means this weird replay handling logic creeps into
>>> the normal write path.
>>>
>>> I'm currently leaning toward keeping the wal events special (lower-level), but
>>> doing what we can to make it work with the same mid- to low-level helper
>>> functions (for reading and verifying blobs, etc.).
> It occured to me that this checksum consistency issue only comes up when
> we are updating something that is smaller than the csum block size.  And
> the real source of the problem is that you have a sequence of
>
>   1- journal intent (kv wal item)
>   2- do read io
>   3- verify csum
>   4- do write io
>   5- cancel intent (remove kv wal item)
>
> If we have an order like
>
>   1- do read io
>   2- journal intent for entire csum chunk (kv wal item)
>   3- do write io
>   4- cancel intent

I suspect this will cause consistency issues when handling multiple 
writes for the same extent if subsequent write doesn't wait for WAL 
apply completion.
E.g. we have block <1,2,3> and two writes <4,,> & <,,5>
In your case the second WAL will have <1,2,5> block instead of <4,2,5> one.

And remember you have o->flush for reading and don't have one for 
writing. But for your case you're introducing o->flush for writing as 
well to perform a read...

>
> Then the issue goes away.  And I'm thinking if the csum chunk is big
> enough that the #2 step is too big of a wal item to perform well, then the
> problem is your choice of csum block size, not the approach.  I.e., use a
> 4kb csum block size for rbd images, and use large blocks (128k, 512k,
> whatever) only for things that never see random overwrites (rgw data).
>
> If that is good enough, then it might also mean that we can make the wal
> operations never do reads--just (over)writes, further simplifying things
> on that end.  In the jewel bluestore the only times we do reads is for
> partial block updates (do we really care about these?  a buffer cache
> could absorb them when it matters) and for copy/cow operations post-clone
> (which i think are simple enough to be deal with separately).
>
> sage


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: 2 related bluestore questions
  2016-05-12 11:54                         ` Allen Samuels
@ 2016-05-12 14:47                           ` Igor Fedotov
  0 siblings, 0 replies; 28+ messages in thread
From: Igor Fedotov @ 2016-05-12 14:47 UTC (permalink / raw)
  To: Allen Samuels, Sage Weil; +Cc: ceph-devel


On 12.05.2016 14:54, Allen Samuels wrote:
> Ok, I hope I'm starting to understand.
>
> The real source of the problem is the desire to "write in place". The kv wal item (intention) and the associated complexity are associated with the need to recognize that a chunk may have an incorrect checksum for a transient period of time because of an in-place update operation.
+1
>
> One simple solution to this is to eliminate write-in-place. You're already reading a bunch of data, computing the new checksums and then writing out the correct data. If you cow the write, then all of the torn page stuff goes away and you cut down the number of kv updates significantly (no intention is required, only a single kv update is required for the entire cow process). Then the only thing that's required to make this perform more or less equivalent to the complicated logic that you're dealing with now is to make sure that the COW destination location is "close" to the original source data location. Basically, you'll want to modify the allocation algorithm so as to leave some "space" in each cylinder so that the COW destination can always be close to the source (you only need a few
  per cylinder since each COW frees up another one). This is a tiny bit ugly, but much much simpler than all of the machinations that you're going through now.
Why do we need close COW destination? To reduce drive head re-position 
or what?
>
> I would point out that one important optimization is to do a good job of handling append log files. These will look like multiple-write-in-places that will have to get merged. I believe that this optimization is particularly easy to handle with the COW approach that I described.
Well, this starts to remind me original proposal I did for WAL/GC 
implementation for ExtentManager ;)
- just do every write to a new location and update lextent map then 
perform lazy/deferred garbage collection/defragmentation.


> I would point out that nothing prevents an addition of a non-COW update-in-place at a future time.
>
> Allen Samuels
> Software Architect, Fellow, Systems and Software Solutions
>
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samuels@SanDisk.com
>
>> -----Original Message-----
>> From: Sage Weil [mailto:sage@newdream.net]
>> Sent: Thursday, May 12, 2016 12:58 PM
>> To: Allen Samuels <Allen.Samuels@sandisk.com>
>> Cc: Igor Fedotov <ifedotov@mirantis.com>; ceph-devel@vger.kernel.org
>> Subject: RE: 2 related bluestore questions
>>
>> On Wed, 11 May 2016, Allen Samuels wrote:
>>> Sorry, still on vacation and I haven't really wrapped my head around
>>> everything that's being discussed. However, w.r.t. wal operations, I
>>> would strongly favor an approach that minimizes the amount of "future"
>>> operations that are recorded (which I'll call intentions -- i.e.,
>>> non-binding hints about extra work that needs to get done). Much of
>>> the complexity here is because the intentions -- after being recorded
>>> -- will need to be altered based on subsequent operations. Hence every
>>> write operation will need to digest the historical intentions and
>>> potentially update them -- this is VERY complex, potentially much more
>>> complex than code that simply examines the current state and
>>> re-determines the correct next operation (i.e., de-wal, gc, etc.)
>>>
>>> Additional complexity arises because you're recording two sets of
>>> state that require consistency checking -- in my experience, this road
>>> leads to perdition....
>> I agree is has to be something manageable that we can reason about.  I think
>> the question for me is mostly about which path minimizes the complexity
>> while still getting us a reasonable level of performance.
>>
>> I had one new thought, see below...
>>
>>>>> The downside is that any logically conflicting request (an
>>>>> overlapping write or truncate or zero) needs to drain the wal
>>>>> events, whereas with a lower-level wal description there might be
>>>>> cases where we can ignore the wal operation.  I suspect the
>>>>> trivial solution of o->flush() on write/truncate/zero will be
>>>>> pretty visible in benchmarks.  Tracking in-flight wal ops with an
>> interval_set would probably work well enough.
>>>> Hmm, I'm not sure this will pan out.  The main problem is that if we
>>>> call back into the write code (with a sync flag), we will have to do
>>>> write IO, and this wreaks havoc on our otherwise (mostly) orderly state
>> machine.
>>>> I think it can be done if we build in a similar guard like
>>>> _txc_finish_io so that we wait for the wal events to also complete
>>>> IO in order before committing them.  I think.
>>>>
>>>> But the other problem is the checksum thing that came up in another
>>>> thread, where the read-side of a read/modify/write might fail teh
>>>> checksum because the wal write hit disk but the kv portion didn't commit.
>> I see a few options:
>>>>   1) If there are checksums and we're doing a sub-block overwrite, we
>>>> have to write/cow it elsewhere.  This probably means min_alloc_size
>>>> cow operations for small writes.  In which case we needn't bother
>>>> doing a wal even in the first place--the whole point is to enable an
>> overwrite.
>>>>   2) We do loose checksum validation that will accept either the old
>>>> checksum or the expected new checksum for the read stage.  This
>>>> handles these two crash cases:
>>>>
>>>>   * kv commit of op + wal event
>>>>     <crash here, or>
>>>>   * do wal io (completely)
>>>>     <crash before cleaning up wal event>
>>>>   * kv cleanup of wal event
>>>>
>>>> but not the case where we only partially complete the wal io.  Which
>>>> means there is a small probability is "corrupt" ourselves on crash
>>>> (not really corrupt, but confuse ourselves such that we refuse to
>>>> replay the wal events on startup).
>>>>
>>>>   3) Same as 2, but simply warn if we fail that read-side checksum on
>> replay.
>>>> This basically introduces a *very* small window which could allow an
>>>> ondisk corruption to get absorbed into our checksum.  This could
>>>> just be #2 + a config option so we warn instead of erroring out.
>>>>
>>>>   4) Same as 2, but we try every combination of old and new data on
>>>> block/sector boundaries to find a valid checksum on the read-side.
>>>>
>>>> I think #1 is a non-starter because it turns a 4K write into a 64K
>>>> read + seek + 64K write on an HDD.  Or forces us to run with
>>>> min_alloc_size=4K on HDD, which would risk very bad fragmentation.
>>>>
>>>> Which makes we want #3 (initially) and then #4.  But... if we do the
>>>> "wal is just a logical write", that means this weird replay handling
>>>> logic creeps into the normal write path.
>>>>
>>>> I'm currently leaning toward keeping the wal events special
>>>> (lower-level), but doing what we can to make it work with the same
>>>> mid- to low-level helper functions (for reading and verifying blobs, etc.).
>> It occured to me that this checksum consistency issue only comes up when
>> we are updating something that is smaller than the csum block size.  And the
>> real source of the problem is that you have a sequence of
>>
>>   1- journal intent (kv wal item)
>>   2- do read io
>>   3- verify csum
>>   4- do write io
>>   5- cancel intent (remove kv wal item)
>>
>> If we have an order like
>>
>>   1- do read io
>>   2- journal intent for entire csum chunk (kv wal item)
>>   3- do write io
>>   4- cancel intent
>>
>> Then the issue goes away.  And I'm thinking if the csum chunk is big enough
>> that the #2 step is too big of a wal item to perform well, then the problem is
>> your choice of csum block size, not the approach.  I.e., use a 4kb csum block
>> size for rbd images, and use large blocks (128k, 512k,
>> whatever) only for things that never see random overwrites (rgw data).
>>
>> If that is good enough, then it might also mean that we can make the wal
>> operations never do reads--just (over)writes, further simplifying things on
>> that end.  In the jewel bluestore the only times we do reads is for partial
>> block updates (do we really care about these?  a buffer cache could absorb
>> them when it matters) and for copy/cow operations post-clone (which i think
>> are simple enough to be deal with separately).
>>
>> sage


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: 2 related bluestore questions
  2016-05-12 14:27                     ` Igor Fedotov
@ 2016-05-12 15:06                       ` Sage Weil
  0 siblings, 0 replies; 28+ messages in thread
From: Sage Weil @ 2016-05-12 15:06 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: allen.samuels, ceph-devel

On Thu, 12 May 2016, Igor Fedotov wrote:
> On 11.05.2016 23:54, Sage Weil wrote:
> > On Wed, 11 May 2016, Sage Weil wrote:
> > > On Wed, 11 May 2016, Igor Fedotov wrote:
> > > 
> > > I like that way better!  We can just add a force_sync argument to
> > > _do_write.  That also lets us trivially disable wal (by forcing sync w/
> > > a config option or whatever).
> > > 
> > > The downside is that any logically conflicting request (an overlapping
> > > write or truncate or zero) needs to drain the wal events, whereas with a
> > > lower-level wal description there might be cases where we can ignore the
> > > wal operation.  I suspect the trivial solution of o->flush() on
> > > write/truncate/zero will be pretty visible in benchmarks.  Tracking
> > > in-flight wal ops with an interval_set would probably work well enough.
> > Hmm, I'm not sure this will pan out.  The main problem is that if we call
> > back into the write code (with a sync flag), we will have to do write
> > IO, and this wreaks havoc on our otherwise (mostly) orderly state machine.
> > I think it can be done if we build in a similar guard like _txc_finish_io
> > so that we wait for the wal events to also complete IO in order before
> > committing them.  I think.
> > 
> > But the other problem is the checksum thing that came up in another
> > thread, where the read-side of a read/modify/write might fail teh checksum
> > because the wal write hit disk but the kv portion didn't commit. I see a
> > few options:
> > 
> >   1) If there are checksums and we're doing a sub-block overwrite, we
> > have to write/cow it elsewhere.  This probably means min_alloc_size cow
> > operations for small writes.  In which case we needn't bother doing a wal
> > even in the first place--the whole point is to enable an overwrite.
> > 
> >   2) We do loose checksum validation that will accept either the old
> > checksum or the expected new checksum for the read stage.  This handles
> > these two crash cases:
> Probably I missed something but It seems to me that we don't have any
> 'expected new checksum' for the whole new block after the crash.
> What we can have are old block checksum in KV and checksum for overwritten
> portion of the block in WAL. To have full new checksum one has to do the read
> and store new checksum to KV afterwards.
> Or you mean write+KV update under 'do  wal io'?

Yeah, you're right, I'm speaking nonsense!

Bottom line, we can't do partial checksum-block r/m/w overwrites (unless 
the read part of the r/m/w happens beforehand, turning it into a full 
block overwrite).

sage

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: 2 related bluestore questions
  2016-05-12  2:58                       ` Sage Weil
  2016-05-12 11:54                         ` Allen Samuels
  2016-05-12 14:38                         ` Igor Fedotov
@ 2016-05-12 16:37                         ` Igor Fedotov
  2016-05-12 16:43                           ` Sage Weil
  2 siblings, 1 reply; 28+ messages in thread
From: Igor Fedotov @ 2016-05-12 16:37 UTC (permalink / raw)
  To: Sage Weil, Allen Samuels; +Cc: ceph-devel

Yet another potential issue with WAL I can imagine:

Let's have some small write going to WAL followed by an larger aligned 
overwrite to the same extent that bypasses WAL. Is it possible if the 
first write is processed later and overwrites the second one? I think so.

This way we can probably come to the conclusion that all requests should 
be processed in-sequence. One should prohibit multiple flows for 
requests processing as this may eliminate their order.

Yeah - I'm attacking WAL concept this way...


Thanks,
Igor

On 12.05.2016 5:58, Sage Weil wrote:
> On Wed, 11 May 2016, Allen Samuels wrote:
>> Sorry, still on vacation and I haven't really wrapped my head around
>> everything that's being discussed. However, w.r.t. wal operations, I
>> would strongly favor an approach that minimizes the amount of "future"
>> operations that are recorded (which I'll call intentions -- i.e.,
>> non-binding hints about extra work that needs to get done). Much of the
>> complexity here is because the intentions -- after being recorded --
>> will need to be altered based on subsequent operations. Hence every
>> write operation will need to digest the historical intentions and
>> potentially update them -- this is VERY complex, potentially much more
>> complex than code that simply examines the current state and
>> re-determines the correct next operation (i.e., de-wal, gc, etc.)
>>
>> Additional complexity arises because you're recording two sets of state
>> that require consistency checking -- in my experience, this road leads
>> to perdition....
> I agree is has to be something manageable that we can reason about.  I
> think the question for me is mostly about which path minimizes the
> complexity while still getting us a reasonable level of performance.
>
> I had one new thought, see below...
>
>>>> The downside is that any logically conflicting request (an overlapping
>>>> write or truncate or zero) needs to drain the wal events, whereas with
>>>> a lower-level wal description there might be cases where we can ignore
>>>> the wal operation.  I suspect the trivial solution of o->flush() on
>>>> write/truncate/zero will be pretty visible in benchmarks.  Tracking
>>>> in-flight wal ops with an interval_set would probably work well enough.
>>> Hmm, I'm not sure this will pan out.  The main problem is that if we call back
>>> into the write code (with a sync flag), we will have to do write IO, and this
>>> wreaks havoc on our otherwise (mostly) orderly state machine.
>>> I think it can be done if we build in a similar guard like _txc_finish_io so that
>>> we wait for the wal events to also complete IO in order before committing
>>> them.  I think.
>>>
>>> But the other problem is the checksum thing that came up in another thread,
>>> where the read-side of a read/modify/write might fail teh checksum because
>>> the wal write hit disk but the kv portion didn't commit. I see a few options:
>>>
>>>   1) If there are checksums and we're doing a sub-block overwrite, we have to
>>> write/cow it elsewhere.  This probably means min_alloc_size cow operations
>>> for small writes.  In which case we needn't bother doing a wal even in the
>>> first place--the whole point is to enable an overwrite.
>>>
>>>   2) We do loose checksum validation that will accept either the old checksum
>>> or the expected new checksum for the read stage.  This handles these two
>>> crash cases:
>>>
>>>   * kv commit of op + wal event
>>>     <crash here, or>
>>>   * do wal io (completely)
>>>     <crash before cleaning up wal event>
>>>   * kv cleanup of wal event
>>>
>>> but not the case where we only partially complete the wal io.  Which means
>>> there is a small probability is "corrupt" ourselves on crash (not really corrupt,
>>> but confuse ourselves such that we refuse to replay the wal events on
>>> startup).
>>>
>>>   3) Same as 2, but simply warn if we fail that read-side checksum on replay.
>>> This basically introduces a *very* small window which could allow an ondisk
>>> corruption to get absorbed into our checksum.  This could just be #2 + a
>>> config option so we warn instead of erroring out.
>>>
>>>   4) Same as 2, but we try every combination of old and new data on
>>> block/sector boundaries to find a valid checksum on the read-side.
>>>
>>> I think #1 is a non-starter because it turns a 4K write into a 64K read + seek +
>>> 64K write on an HDD.  Or forces us to run with min_alloc_size=4K on HDD,
>>> which would risk very bad fragmentation.
>>>
>>> Which makes we want #3 (initially) and then #4.  But... if we do the "wal is
>>> just a logical write", that means this weird replay handling logic creeps into
>>> the normal write path.
>>>
>>> I'm currently leaning toward keeping the wal events special (lower-level), but
>>> doing what we can to make it work with the same mid- to low-level helper
>>> functions (for reading and verifying blobs, etc.).
> It occured to me that this checksum consistency issue only comes up when
> we are updating something that is smaller than the csum block size.  And
> the real source of the problem is that you have a sequence of
>
>   1- journal intent (kv wal item)
>   2- do read io
>   3- verify csum
>   4- do write io
>   5- cancel intent (remove kv wal item)
>
> If we have an order like
>
>   1- do read io
>   2- journal intent for entire csum chunk (kv wal item)
>   3- do write io
>   4- cancel intent
>
> Then the issue goes away.  And I'm thinking if the csum chunk is big
> enough that the #2 step is too big of a wal item to perform well, then the
> problem is your choice of csum block size, not the approach.  I.e., use a
> 4kb csum block size for rbd images, and use large blocks (128k, 512k,
> whatever) only for things that never see random overwrites (rgw data).
>
> If that is good enough, then it might also mean that we can make the wal
> operations never do reads--just (over)writes, further simplifying things
> on that end.  In the jewel bluestore the only times we do reads is for
> partial block updates (do we really care about these?  a buffer cache
> could absorb them when it matters) and for copy/cow operations post-clone
> (which i think are simple enough to be deal with separately).
>
> sage


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: 2 related bluestore questions
  2016-05-12 16:37                         ` Igor Fedotov
@ 2016-05-12 16:43                           ` Sage Weil
  2016-05-12 16:45                             ` Igor Fedotov
  0 siblings, 1 reply; 28+ messages in thread
From: Sage Weil @ 2016-05-12 16:43 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: Allen Samuels, ceph-devel

On Thu, 12 May 2016, Igor Fedotov wrote:
> Yet another potential issue with WAL I can imagine:
> 
> Let's have some small write going to WAL followed by an larger aligned
> overwrite to the same extent that bypasses WAL. Is it possible if the first
> write is processed later and overwrites the second one? I think so.

Yeah, that would be chaos.  The wal ops are already ordered by the 
sequencer (or ordered globally, if bluestore_sync_wal_apply=true), so this 
can't happen.

sage


> This way we can probably come to the conclusion that all requests should be
> processed in-sequence. One should prohibit multiple flows for requests
> processing as this may eliminate their order.
> 
> Yeah - I'm attacking WAL concept this way...
> 
> 
> Thanks,
> Igor
> 
> On 12.05.2016 5:58, Sage Weil wrote:
> > On Wed, 11 May 2016, Allen Samuels wrote:
> > > Sorry, still on vacation and I haven't really wrapped my head around
> > > everything that's being discussed. However, w.r.t. wal operations, I
> > > would strongly favor an approach that minimizes the amount of "future"
> > > operations that are recorded (which I'll call intentions -- i.e.,
> > > non-binding hints about extra work that needs to get done). Much of the
> > > complexity here is because the intentions -- after being recorded --
> > > will need to be altered based on subsequent operations. Hence every
> > > write operation will need to digest the historical intentions and
> > > potentially update them -- this is VERY complex, potentially much more
> > > complex than code that simply examines the current state and
> > > re-determines the correct next operation (i.e., de-wal, gc, etc.)
> > > 
> > > Additional complexity arises because you're recording two sets of state
> > > that require consistency checking -- in my experience, this road leads
> > > to perdition....
> > I agree is has to be something manageable that we can reason about.  I
> > think the question for me is mostly about which path minimizes the
> > complexity while still getting us a reasonable level of performance.
> > 
> > I had one new thought, see below...
> > 
> > > > > The downside is that any logically conflicting request (an overlapping
> > > > > write or truncate or zero) needs to drain the wal events, whereas with
> > > > > a lower-level wal description there might be cases where we can ignore
> > > > > the wal operation.  I suspect the trivial solution of o->flush() on
> > > > > write/truncate/zero will be pretty visible in benchmarks.  Tracking
> > > > > in-flight wal ops with an interval_set would probably work well
> > > > > enough.
> > > > Hmm, I'm not sure this will pan out.  The main problem is that if we
> > > > call back
> > > > into the write code (with a sync flag), we will have to do write IO, and
> > > > this
> > > > wreaks havoc on our otherwise (mostly) orderly state machine.
> > > > I think it can be done if we build in a similar guard like
> > > > _txc_finish_io so that
> > > > we wait for the wal events to also complete IO in order before
> > > > committing
> > > > them.  I think.
> > > > 
> > > > But the other problem is the checksum thing that came up in another
> > > > thread,
> > > > where the read-side of a read/modify/write might fail teh checksum
> > > > because
> > > > the wal write hit disk but the kv portion didn't commit. I see a few
> > > > options:
> > > > 
> > > >   1) If there are checksums and we're doing a sub-block overwrite, we
> > > > have to
> > > > write/cow it elsewhere.  This probably means min_alloc_size cow
> > > > operations
> > > > for small writes.  In which case we needn't bother doing a wal even in
> > > > the
> > > > first place--the whole point is to enable an overwrite.
> > > > 
> > > >   2) We do loose checksum validation that will accept either the old
> > > > checksum
> > > > or the expected new checksum for the read stage.  This handles these two
> > > > crash cases:
> > > > 
> > > >   * kv commit of op + wal event
> > > >     <crash here, or>
> > > >   * do wal io (completely)
> > > >     <crash before cleaning up wal event>
> > > >   * kv cleanup of wal event
> > > > 
> > > > but not the case where we only partially complete the wal io.  Which
> > > > means
> > > > there is a small probability is "corrupt" ourselves on crash (not really
> > > > corrupt,
> > > > but confuse ourselves such that we refuse to replay the wal events on
> > > > startup).
> > > > 
> > > >   3) Same as 2, but simply warn if we fail that read-side checksum on
> > > > replay.
> > > > This basically introduces a *very* small window which could allow an
> > > > ondisk
> > > > corruption to get absorbed into our checksum.  This could just be #2 + a
> > > > config option so we warn instead of erroring out.
> > > > 
> > > >   4) Same as 2, but we try every combination of old and new data on
> > > > block/sector boundaries to find a valid checksum on the read-side.
> > > > 
> > > > I think #1 is a non-starter because it turns a 4K write into a 64K read
> > > > + seek +
> > > > 64K write on an HDD.  Or forces us to run with min_alloc_size=4K on HDD,
> > > > which would risk very bad fragmentation.
> > > > 
> > > > Which makes we want #3 (initially) and then #4.  But... if we do the
> > > > "wal is
> > > > just a logical write", that means this weird replay handling logic
> > > > creeps into
> > > > the normal write path.
> > > > 
> > > > I'm currently leaning toward keeping the wal events special
> > > > (lower-level), but
> > > > doing what we can to make it work with the same mid- to low-level helper
> > > > functions (for reading and verifying blobs, etc.).
> > It occured to me that this checksum consistency issue only comes up when
> > we are updating something that is smaller than the csum block size.  And
> > the real source of the problem is that you have a sequence of
> > 
> >   1- journal intent (kv wal item)
> >   2- do read io
> >   3- verify csum
> >   4- do write io
> >   5- cancel intent (remove kv wal item)
> > 
> > If we have an order like
> > 
> >   1- do read io
> >   2- journal intent for entire csum chunk (kv wal item)
> >   3- do write io
> >   4- cancel intent
> > 
> > Then the issue goes away.  And I'm thinking if the csum chunk is big
> > enough that the #2 step is too big of a wal item to perform well, then the
> > problem is your choice of csum block size, not the approach.  I.e., use a
> > 4kb csum block size for rbd images, and use large blocks (128k, 512k,
> > whatever) only for things that never see random overwrites (rgw data).
> > 
> > If that is good enough, then it might also mean that we can make the wal
> > operations never do reads--just (over)writes, further simplifying things
> > on that end.  In the jewel bluestore the only times we do reads is for
> > partial block updates (do we really care about these?  a buffer cache
> > could absorb them when it matters) and for copy/cow operations post-clone
> > (which i think are simple enough to be deal with separately).
> > 
> > sage
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: 2 related bluestore questions
  2016-05-12 16:43                           ` Sage Weil
@ 2016-05-12 16:45                             ` Igor Fedotov
  2016-05-12 16:48                               ` Sage Weil
  0 siblings, 1 reply; 28+ messages in thread
From: Igor Fedotov @ 2016-05-12 16:45 UTC (permalink / raw)
  To: Sage Weil; +Cc: Allen Samuels, ceph-devel

The second write in my example isn't processed through WAL - it's large 
and overwrites the whole blob...


On 12.05.2016 19:43, Sage Weil wrote:
> On Thu, 12 May 2016, Igor Fedotov wrote:
>> Yet another potential issue with WAL I can imagine:
>>
>> Let's have some small write going to WAL followed by an larger aligned
>> overwrite to the same extent that bypasses WAL. Is it possible if the first
>> write is processed later and overwrites the second one? I think so.
> Yeah, that would be chaos.  The wal ops are already ordered by the
> sequencer (or ordered globally, if bluestore_sync_wal_apply=true), so this
> can't happen.
>
> sage
>
>
>> This way we can probably come to the conclusion that all requests should be
>> processed in-sequence. One should prohibit multiple flows for requests
>> processing as this may eliminate their order.
>>
>> Yeah - I'm attacking WAL concept this way...
>>
>>
>> Thanks,
>> Igor
>>
>> On 12.05.2016 5:58, Sage Weil wrote:
>>> On Wed, 11 May 2016, Allen Samuels wrote:
>>>> Sorry, still on vacation and I haven't really wrapped my head around
>>>> everything that's being discussed. However, w.r.t. wal operations, I
>>>> would strongly favor an approach that minimizes the amount of "future"
>>>> operations that are recorded (which I'll call intentions -- i.e.,
>>>> non-binding hints about extra work that needs to get done). Much of the
>>>> complexity here is because the intentions -- after being recorded --
>>>> will need to be altered based on subsequent operations. Hence every
>>>> write operation will need to digest the historical intentions and
>>>> potentially update them -- this is VERY complex, potentially much more
>>>> complex than code that simply examines the current state and
>>>> re-determines the correct next operation (i.e., de-wal, gc, etc.)
>>>>
>>>> Additional complexity arises because you're recording two sets of state
>>>> that require consistency checking -- in my experience, this road leads
>>>> to perdition....
>>> I agree is has to be something manageable that we can reason about.  I
>>> think the question for me is mostly about which path minimizes the
>>> complexity while still getting us a reasonable level of performance.
>>>
>>> I had one new thought, see below...
>>>
>>>>>> The downside is that any logically conflicting request (an overlapping
>>>>>> write or truncate or zero) needs to drain the wal events, whereas with
>>>>>> a lower-level wal description there might be cases where we can ignore
>>>>>> the wal operation.  I suspect the trivial solution of o->flush() on
>>>>>> write/truncate/zero will be pretty visible in benchmarks.  Tracking
>>>>>> in-flight wal ops with an interval_set would probably work well
>>>>>> enough.
>>>>> Hmm, I'm not sure this will pan out.  The main problem is that if we
>>>>> call back
>>>>> into the write code (with a sync flag), we will have to do write IO, and
>>>>> this
>>>>> wreaks havoc on our otherwise (mostly) orderly state machine.
>>>>> I think it can be done if we build in a similar guard like
>>>>> _txc_finish_io so that
>>>>> we wait for the wal events to also complete IO in order before
>>>>> committing
>>>>> them.  I think.
>>>>>
>>>>> But the other problem is the checksum thing that came up in another
>>>>> thread,
>>>>> where the read-side of a read/modify/write might fail teh checksum
>>>>> because
>>>>> the wal write hit disk but the kv portion didn't commit. I see a few
>>>>> options:
>>>>>
>>>>>    1) If there are checksums and we're doing a sub-block overwrite, we
>>>>> have to
>>>>> write/cow it elsewhere.  This probably means min_alloc_size cow
>>>>> operations
>>>>> for small writes.  In which case we needn't bother doing a wal even in
>>>>> the
>>>>> first place--the whole point is to enable an overwrite.
>>>>>
>>>>>    2) We do loose checksum validation that will accept either the old
>>>>> checksum
>>>>> or the expected new checksum for the read stage.  This handles these two
>>>>> crash cases:
>>>>>
>>>>>    * kv commit of op + wal event
>>>>>      <crash here, or>
>>>>>    * do wal io (completely)
>>>>>      <crash before cleaning up wal event>
>>>>>    * kv cleanup of wal event
>>>>>
>>>>> but not the case where we only partially complete the wal io.  Which
>>>>> means
>>>>> there is a small probability is "corrupt" ourselves on crash (not really
>>>>> corrupt,
>>>>> but confuse ourselves such that we refuse to replay the wal events on
>>>>> startup).
>>>>>
>>>>>    3) Same as 2, but simply warn if we fail that read-side checksum on
>>>>> replay.
>>>>> This basically introduces a *very* small window which could allow an
>>>>> ondisk
>>>>> corruption to get absorbed into our checksum.  This could just be #2 + a
>>>>> config option so we warn instead of erroring out.
>>>>>
>>>>>    4) Same as 2, but we try every combination of old and new data on
>>>>> block/sector boundaries to find a valid checksum on the read-side.
>>>>>
>>>>> I think #1 is a non-starter because it turns a 4K write into a 64K read
>>>>> + seek +
>>>>> 64K write on an HDD.  Or forces us to run with min_alloc_size=4K on HDD,
>>>>> which would risk very bad fragmentation.
>>>>>
>>>>> Which makes we want #3 (initially) and then #4.  But... if we do the
>>>>> "wal is
>>>>> just a logical write", that means this weird replay handling logic
>>>>> creeps into
>>>>> the normal write path.
>>>>>
>>>>> I'm currently leaning toward keeping the wal events special
>>>>> (lower-level), but
>>>>> doing what we can to make it work with the same mid- to low-level helper
>>>>> functions (for reading and verifying blobs, etc.).
>>> It occured to me that this checksum consistency issue only comes up when
>>> we are updating something that is smaller than the csum block size.  And
>>> the real source of the problem is that you have a sequence of
>>>
>>>    1- journal intent (kv wal item)
>>>    2- do read io
>>>    3- verify csum
>>>    4- do write io
>>>    5- cancel intent (remove kv wal item)
>>>
>>> If we have an order like
>>>
>>>    1- do read io
>>>    2- journal intent for entire csum chunk (kv wal item)
>>>    3- do write io
>>>    4- cancel intent
>>>
>>> Then the issue goes away.  And I'm thinking if the csum chunk is big
>>> enough that the #2 step is too big of a wal item to perform well, then the
>>> problem is your choice of csum block size, not the approach.  I.e., use a
>>> 4kb csum block size for rbd images, and use large blocks (128k, 512k,
>>> whatever) only for things that never see random overwrites (rgw data).
>>>
>>> If that is good enough, then it might also mean that we can make the wal
>>> operations never do reads--just (over)writes, further simplifying things
>>> on that end.  In the jewel bluestore the only times we do reads is for
>>> partial block updates (do we really care about these?  a buffer cache
>>> could absorb them when it matters) and for copy/cow operations post-clone
>>> (which i think are simple enough to be deal with separately).
>>>
>>> sage
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: 2 related bluestore questions
  2016-05-12 16:45                             ` Igor Fedotov
@ 2016-05-12 16:48                               ` Sage Weil
  2016-05-12 16:52                                 ` Igor Fedotov
  0 siblings, 1 reply; 28+ messages in thread
From: Sage Weil @ 2016-05-12 16:48 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: Allen Samuels, ceph-devel

On Thu, 12 May 2016, Igor Fedotov wrote:
> The second write in my example isn't processed through WAL - it's large and
> overwrites the whole blob...

If it's large, it wouldn't overwrite--it would go to newly allocated 
space.  We can *never* overwrite without wal or else we corrupt previous 
data...

sage


> 
> 
> On 12.05.2016 19:43, Sage Weil wrote:
> > On Thu, 12 May 2016, Igor Fedotov wrote:
> > > Yet another potential issue with WAL I can imagine:
> > > 
> > > Let's have some small write going to WAL followed by an larger aligned
> > > overwrite to the same extent that bypasses WAL. Is it possible if the
> > > first
> > > write is processed later and overwrites the second one? I think so.
> > Yeah, that would be chaos.  The wal ops are already ordered by the
> > sequencer (or ordered globally, if bluestore_sync_wal_apply=true), so this
> > can't happen.
> > 
> > sage
> > 
> > 
> > > This way we can probably come to the conclusion that all requests should
> > > be
> > > processed in-sequence. One should prohibit multiple flows for requests
> > > processing as this may eliminate their order.
> > > 
> > > Yeah - I'm attacking WAL concept this way...
> > > 
> > > 
> > > Thanks,
> > > Igor
> > > 
> > > On 12.05.2016 5:58, Sage Weil wrote:
> > > > On Wed, 11 May 2016, Allen Samuels wrote:
> > > > > Sorry, still on vacation and I haven't really wrapped my head around
> > > > > everything that's being discussed. However, w.r.t. wal operations, I
> > > > > would strongly favor an approach that minimizes the amount of "future"
> > > > > operations that are recorded (which I'll call intentions -- i.e.,
> > > > > non-binding hints about extra work that needs to get done). Much of
> > > > > the
> > > > > complexity here is because the intentions -- after being recorded --
> > > > > will need to be altered based on subsequent operations. Hence every
> > > > > write operation will need to digest the historical intentions and
> > > > > potentially update them -- this is VERY complex, potentially much more
> > > > > complex than code that simply examines the current state and
> > > > > re-determines the correct next operation (i.e., de-wal, gc, etc.)
> > > > > 
> > > > > Additional complexity arises because you're recording two sets of
> > > > > state
> > > > > that require consistency checking -- in my experience, this road leads
> > > > > to perdition....
> > > > I agree is has to be something manageable that we can reason about.  I
> > > > think the question for me is mostly about which path minimizes the
> > > > complexity while still getting us a reasonable level of performance.
> > > > 
> > > > I had one new thought, see below...
> > > > 
> > > > > > > The downside is that any logically conflicting request (an
> > > > > > > overlapping
> > > > > > > write or truncate or zero) needs to drain the wal events, whereas
> > > > > > > with
> > > > > > > a lower-level wal description there might be cases where we can
> > > > > > > ignore
> > > > > > > the wal operation.  I suspect the trivial solution of o->flush()
> > > > > > > on
> > > > > > > write/truncate/zero will be pretty visible in benchmarks.
> > > > > > > Tracking
> > > > > > > in-flight wal ops with an interval_set would probably work well
> > > > > > > enough.
> > > > > > Hmm, I'm not sure this will pan out.  The main problem is that if we
> > > > > > call back
> > > > > > into the write code (with a sync flag), we will have to do write IO,
> > > > > > and
> > > > > > this
> > > > > > wreaks havoc on our otherwise (mostly) orderly state machine.
> > > > > > I think it can be done if we build in a similar guard like
> > > > > > _txc_finish_io so that
> > > > > > we wait for the wal events to also complete IO in order before
> > > > > > committing
> > > > > > them.  I think.
> > > > > > 
> > > > > > But the other problem is the checksum thing that came up in another
> > > > > > thread,
> > > > > > where the read-side of a read/modify/write might fail teh checksum
> > > > > > because
> > > > > > the wal write hit disk but the kv portion didn't commit. I see a few
> > > > > > options:
> > > > > > 
> > > > > >    1) If there are checksums and we're doing a sub-block overwrite,
> > > > > > we
> > > > > > have to
> > > > > > write/cow it elsewhere.  This probably means min_alloc_size cow
> > > > > > operations
> > > > > > for small writes.  In which case we needn't bother doing a wal even
> > > > > > in
> > > > > > the
> > > > > > first place--the whole point is to enable an overwrite.
> > > > > > 
> > > > > >    2) We do loose checksum validation that will accept either the
> > > > > > old
> > > > > > checksum
> > > > > > or the expected new checksum for the read stage.  This handles these
> > > > > > two
> > > > > > crash cases:
> > > > > > 
> > > > > >    * kv commit of op + wal event
> > > > > >      <crash here, or>
> > > > > >    * do wal io (completely)
> > > > > >      <crash before cleaning up wal event>
> > > > > >    * kv cleanup of wal event
> > > > > > 
> > > > > > but not the case where we only partially complete the wal io.  Which
> > > > > > means
> > > > > > there is a small probability is "corrupt" ourselves on crash (not
> > > > > > really
> > > > > > corrupt,
> > > > > > but confuse ourselves such that we refuse to replay the wal events
> > > > > > on
> > > > > > startup).
> > > > > > 
> > > > > >    3) Same as 2, but simply warn if we fail that read-side checksum
> > > > > > on
> > > > > > replay.
> > > > > > This basically introduces a *very* small window which could allow an
> > > > > > ondisk
> > > > > > corruption to get absorbed into our checksum.  This could just be #2
> > > > > > + a
> > > > > > config option so we warn instead of erroring out.
> > > > > > 
> > > > > >    4) Same as 2, but we try every combination of old and new data on
> > > > > > block/sector boundaries to find a valid checksum on the read-side.
> > > > > > 
> > > > > > I think #1 is a non-starter because it turns a 4K write into a 64K
> > > > > > read
> > > > > > + seek +
> > > > > > 64K write on an HDD.  Or forces us to run with min_alloc_size=4K on
> > > > > > HDD,
> > > > > > which would risk very bad fragmentation.
> > > > > > 
> > > > > > Which makes we want #3 (initially) and then #4.  But... if we do the
> > > > > > "wal is
> > > > > > just a logical write", that means this weird replay handling logic
> > > > > > creeps into
> > > > > > the normal write path.
> > > > > > 
> > > > > > I'm currently leaning toward keeping the wal events special
> > > > > > (lower-level), but
> > > > > > doing what we can to make it work with the same mid- to low-level
> > > > > > helper
> > > > > > functions (for reading and verifying blobs, etc.).
> > > > It occured to me that this checksum consistency issue only comes up when
> > > > we are updating something that is smaller than the csum block size.  And
> > > > the real source of the problem is that you have a sequence of
> > > > 
> > > >    1- journal intent (kv wal item)
> > > >    2- do read io
> > > >    3- verify csum
> > > >    4- do write io
> > > >    5- cancel intent (remove kv wal item)
> > > > 
> > > > If we have an order like
> > > > 
> > > >    1- do read io
> > > >    2- journal intent for entire csum chunk (kv wal item)
> > > >    3- do write io
> > > >    4- cancel intent
> > > > 
> > > > Then the issue goes away.  And I'm thinking if the csum chunk is big
> > > > enough that the #2 step is too big of a wal item to perform well, then
> > > > the
> > > > problem is your choice of csum block size, not the approach.  I.e., use
> > > > a
> > > > 4kb csum block size for rbd images, and use large blocks (128k, 512k,
> > > > whatever) only for things that never see random overwrites (rgw data).
> > > > 
> > > > If that is good enough, then it might also mean that we can make the wal
> > > > operations never do reads--just (over)writes, further simplifying things
> > > > on that end.  In the jewel bluestore the only times we do reads is for
> > > > partial block updates (do we really care about these?  a buffer cache
> > > > could absorb them when it matters) and for copy/cow operations
> > > > post-clone
> > > > (which i think are simple enough to be deal with separately).
> > > > 
> > > > sage
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: 2 related bluestore questions
  2016-05-12 16:48                               ` Sage Weil
@ 2016-05-12 16:52                                 ` Igor Fedotov
  2016-05-12 17:09                                   ` Sage Weil
  0 siblings, 1 reply; 28+ messages in thread
From: Igor Fedotov @ 2016-05-12 16:52 UTC (permalink / raw)
  To: Sage Weil; +Cc: Allen Samuels, ceph-devel

Well, it goes to the new space and updates all the maps.

Then WAL comes to action - where will it write? To the new location? And 
overwrite new data?


On 12.05.2016 19:48, Sage Weil wrote:
> On Thu, 12 May 2016, Igor Fedotov wrote:
>> The second write in my example isn't processed through WAL - it's large and
>> overwrites the whole blob...
> If it's large, it wouldn't overwrite--it would go to newly allocated
> space.  We can *never* overwrite without wal or else we corrupt previous
> data...
>
> sage
>
>
>>
>> On 12.05.2016 19:43, Sage Weil wrote:
>>> On Thu, 12 May 2016, Igor Fedotov wrote:
>>>> Yet another potential issue with WAL I can imagine:
>>>>
>>>> Let's have some small write going to WAL followed by an larger aligned
>>>> overwrite to the same extent that bypasses WAL. Is it possible if the
>>>> first
>>>> write is processed later and overwrites the second one? I think so.
>>> Yeah, that would be chaos.  The wal ops are already ordered by the
>>> sequencer (or ordered globally, if bluestore_sync_wal_apply=true), so this
>>> can't happen.
>>>
>>> sage
>>>
>>>
>>>> This way we can probably come to the conclusion that all requests should
>>>> be
>>>> processed in-sequence. One should prohibit multiple flows for requests
>>>> processing as this may eliminate their order.
>>>>
>>>> Yeah - I'm attacking WAL concept this way...
>>>>
>>>>
>>>> Thanks,
>>>> Igor
>>>>
>>>> On 12.05.2016 5:58, Sage Weil wrote:
>>>>> On Wed, 11 May 2016, Allen Samuels wrote:
>>>>>> Sorry, still on vacation and I haven't really wrapped my head around
>>>>>> everything that's being discussed. However, w.r.t. wal operations, I
>>>>>> would strongly favor an approach that minimizes the amount of "future"
>>>>>> operations that are recorded (which I'll call intentions -- i.e.,
>>>>>> non-binding hints about extra work that needs to get done). Much of
>>>>>> the
>>>>>> complexity here is because the intentions -- after being recorded --
>>>>>> will need to be altered based on subsequent operations. Hence every
>>>>>> write operation will need to digest the historical intentions and
>>>>>> potentially update them -- this is VERY complex, potentially much more
>>>>>> complex than code that simply examines the current state and
>>>>>> re-determines the correct next operation (i.e., de-wal, gc, etc.)
>>>>>>
>>>>>> Additional complexity arises because you're recording two sets of
>>>>>> state
>>>>>> that require consistency checking -- in my experience, this road leads
>>>>>> to perdition....
>>>>> I agree is has to be something manageable that we can reason about.  I
>>>>> think the question for me is mostly about which path minimizes the
>>>>> complexity while still getting us a reasonable level of performance.
>>>>>
>>>>> I had one new thought, see below...
>>>>>
>>>>>>>> The downside is that any logically conflicting request (an
>>>>>>>> overlapping
>>>>>>>> write or truncate or zero) needs to drain the wal events, whereas
>>>>>>>> with
>>>>>>>> a lower-level wal description there might be cases where we can
>>>>>>>> ignore
>>>>>>>> the wal operation.  I suspect the trivial solution of o->flush()
>>>>>>>> on
>>>>>>>> write/truncate/zero will be pretty visible in benchmarks.
>>>>>>>> Tracking
>>>>>>>> in-flight wal ops with an interval_set would probably work well
>>>>>>>> enough.
>>>>>>> Hmm, I'm not sure this will pan out.  The main problem is that if we
>>>>>>> call back
>>>>>>> into the write code (with a sync flag), we will have to do write IO,
>>>>>>> and
>>>>>>> this
>>>>>>> wreaks havoc on our otherwise (mostly) orderly state machine.
>>>>>>> I think it can be done if we build in a similar guard like
>>>>>>> _txc_finish_io so that
>>>>>>> we wait for the wal events to also complete IO in order before
>>>>>>> committing
>>>>>>> them.  I think.
>>>>>>>
>>>>>>> But the other problem is the checksum thing that came up in another
>>>>>>> thread,
>>>>>>> where the read-side of a read/modify/write might fail teh checksum
>>>>>>> because
>>>>>>> the wal write hit disk but the kv portion didn't commit. I see a few
>>>>>>> options:
>>>>>>>
>>>>>>>     1) If there are checksums and we're doing a sub-block overwrite,
>>>>>>> we
>>>>>>> have to
>>>>>>> write/cow it elsewhere.  This probably means min_alloc_size cow
>>>>>>> operations
>>>>>>> for small writes.  In which case we needn't bother doing a wal even
>>>>>>> in
>>>>>>> the
>>>>>>> first place--the whole point is to enable an overwrite.
>>>>>>>
>>>>>>>     2) We do loose checksum validation that will accept either the
>>>>>>> old
>>>>>>> checksum
>>>>>>> or the expected new checksum for the read stage.  This handles these
>>>>>>> two
>>>>>>> crash cases:
>>>>>>>
>>>>>>>     * kv commit of op + wal event
>>>>>>>       <crash here, or>
>>>>>>>     * do wal io (completely)
>>>>>>>       <crash before cleaning up wal event>
>>>>>>>     * kv cleanup of wal event
>>>>>>>
>>>>>>> but not the case where we only partially complete the wal io.  Which
>>>>>>> means
>>>>>>> there is a small probability is "corrupt" ourselves on crash (not
>>>>>>> really
>>>>>>> corrupt,
>>>>>>> but confuse ourselves such that we refuse to replay the wal events
>>>>>>> on
>>>>>>> startup).
>>>>>>>
>>>>>>>     3) Same as 2, but simply warn if we fail that read-side checksum
>>>>>>> on
>>>>>>> replay.
>>>>>>> This basically introduces a *very* small window which could allow an
>>>>>>> ondisk
>>>>>>> corruption to get absorbed into our checksum.  This could just be #2
>>>>>>> + a
>>>>>>> config option so we warn instead of erroring out.
>>>>>>>
>>>>>>>     4) Same as 2, but we try every combination of old and new data on
>>>>>>> block/sector boundaries to find a valid checksum on the read-side.
>>>>>>>
>>>>>>> I think #1 is a non-starter because it turns a 4K write into a 64K
>>>>>>> read
>>>>>>> + seek +
>>>>>>> 64K write on an HDD.  Or forces us to run with min_alloc_size=4K on
>>>>>>> HDD,
>>>>>>> which would risk very bad fragmentation.
>>>>>>>
>>>>>>> Which makes we want #3 (initially) and then #4.  But... if we do the
>>>>>>> "wal is
>>>>>>> just a logical write", that means this weird replay handling logic
>>>>>>> creeps into
>>>>>>> the normal write path.
>>>>>>>
>>>>>>> I'm currently leaning toward keeping the wal events special
>>>>>>> (lower-level), but
>>>>>>> doing what we can to make it work with the same mid- to low-level
>>>>>>> helper
>>>>>>> functions (for reading and verifying blobs, etc.).
>>>>> It occured to me that this checksum consistency issue only comes up when
>>>>> we are updating something that is smaller than the csum block size.  And
>>>>> the real source of the problem is that you have a sequence of
>>>>>
>>>>>     1- journal intent (kv wal item)
>>>>>     2- do read io
>>>>>     3- verify csum
>>>>>     4- do write io
>>>>>     5- cancel intent (remove kv wal item)
>>>>>
>>>>> If we have an order like
>>>>>
>>>>>     1- do read io
>>>>>     2- journal intent for entire csum chunk (kv wal item)
>>>>>     3- do write io
>>>>>     4- cancel intent
>>>>>
>>>>> Then the issue goes away.  And I'm thinking if the csum chunk is big
>>>>> enough that the #2 step is too big of a wal item to perform well, then
>>>>> the
>>>>> problem is your choice of csum block size, not the approach.  I.e., use
>>>>> a
>>>>> 4kb csum block size for rbd images, and use large blocks (128k, 512k,
>>>>> whatever) only for things that never see random overwrites (rgw data).
>>>>>
>>>>> If that is good enough, then it might also mean that we can make the wal
>>>>> operations never do reads--just (over)writes, further simplifying things
>>>>> on that end.  In the jewel bluestore the only times we do reads is for
>>>>> partial block updates (do we really care about these?  a buffer cache
>>>>> could absorb them when it matters) and for copy/cow operations
>>>>> post-clone
>>>>> (which i think are simple enough to be deal with separately).
>>>>>
>>>>> sage
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: 2 related bluestore questions
  2016-05-12 16:52                                 ` Igor Fedotov
@ 2016-05-12 17:09                                   ` Sage Weil
  2016-05-13 17:07                                     ` Igor Fedotov
  0 siblings, 1 reply; 28+ messages in thread
From: Sage Weil @ 2016-05-12 17:09 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: Allen Samuels, ceph-devel

On Thu, 12 May 2016, Igor Fedotov wrote:
> Well, it goes to the new space and updates all the maps.
> 
> Then WAL comes to action - where will it write? To the new location? And
> overwrite new data?

The the old location.  The modes I describe in the doc all are in terms of 
pextents (with the possible exception of E+F) for this reason... deferred 
IO, not deferred logical operations.

	https://github.com/liewegas/ceph/commit/c7cb76889669bf2c1abffd69f05d1c9e15c41e3c#commitcomment-17453409

For E+F, the source is always immutable (compressed blob or clone).  To 
avoid this sort of race on the destination... I'm not sure.  I'm sort of 
wondering, though, if we should even bother with the 'cow' part.  We used 
to have to do this because we didn't have a lextent mapping.  Now, if we 
have a small overwrite of a cloned/shared lextent/blob, we can

 - allocate a new min_alloc_size blob
 - write the new data into the relevant block in that blob
 - update lextent_t map *only for those bytes*

There's not write- or read-amp that way, and if we later get more random 
overwrites nearby they can just fill in the other unused parts of the blob 
and eventually the lextent mapping will merge/simplify to reference the 
whole thing.  (I'm assuming that if we wrote at, say, object offset 68k 
and min_alloc_size is 64k, we'd write at offset 4k in the new 64k blob, so 
that later when adjacent blocks get filled in it would be contiguous.)

Anyway, that would be *no* copy/cow type wal events at all.  The only 
read-like thing that would remain would be C, which is a pretty trivial 
case (no csum, no comp, just a read/modify/write of a partial block.)  I 
think it also means that no wal events would need to do metadata (csum) 
updates after all.

I pushed an update to that doc:

	https://github.com/liewegas/ceph/blob/76ab431ec2aed0b90f2f0354d89f4bccd23e7ae2/doc/dev/bluestore.rst

The D case may or may not be worth it.  It's nice for efficient small 
overwrites of big compressed blobs.  OTOH, E accomplishes the same thing 
at the expense of using a bit more disk space.  (For SSDs, E won't matter, 
since min_alloc_size would be 4K anyway.)

sage



> 
> 
> On 12.05.2016 19:48, Sage Weil wrote:
> > On Thu, 12 May 2016, Igor Fedotov wrote:
> > > The second write in my example isn't processed through WAL - it's large
> > > and
> > > overwrites the whole blob...
> > If it's large, it wouldn't overwrite--it would go to newly allocated
> > space.  We can *never* overwrite without wal or else we corrupt previous
> > data...
> > 
> > sage
> > 
> > 
> > > 
> > > On 12.05.2016 19:43, Sage Weil wrote:
> > > > On Thu, 12 May 2016, Igor Fedotov wrote:
> > > > > Yet another potential issue with WAL I can imagine:
> > > > > 
> > > > > Let's have some small write going to WAL followed by an larger aligned
> > > > > overwrite to the same extent that bypasses WAL. Is it possible if the
> > > > > first
> > > > > write is processed later and overwrites the second one? I think so.
> > > > Yeah, that would be chaos.  The wal ops are already ordered by the
> > > > sequencer (or ordered globally, if bluestore_sync_wal_apply=true), so
> > > > this
> > > > can't happen.
> > > > 
> > > > sage
> > > > 
> > > > 
> > > > > This way we can probably come to the conclusion that all requests
> > > > > should
> > > > > be
> > > > > processed in-sequence. One should prohibit multiple flows for requests
> > > > > processing as this may eliminate their order.
> > > > > 
> > > > > Yeah - I'm attacking WAL concept this way...
> > > > > 
> > > > > 
> > > > > Thanks,
> > > > > Igor
> > > > > 
> > > > > On 12.05.2016 5:58, Sage Weil wrote:
> > > > > > On Wed, 11 May 2016, Allen Samuels wrote:
> > > > > > > Sorry, still on vacation and I haven't really wrapped my head
> > > > > > > around
> > > > > > > everything that's being discussed. However, w.r.t. wal operations,
> > > > > > > I
> > > > > > > would strongly favor an approach that minimizes the amount of
> > > > > > > "future"
> > > > > > > operations that are recorded (which I'll call intentions -- i.e.,
> > > > > > > non-binding hints about extra work that needs to get done). Much
> > > > > > > of
> > > > > > > the
> > > > > > > complexity here is because the intentions -- after being recorded
> > > > > > > --
> > > > > > > will need to be altered based on subsequent operations. Hence
> > > > > > > every
> > > > > > > write operation will need to digest the historical intentions and
> > > > > > > potentially update them -- this is VERY complex, potentially much
> > > > > > > more
> > > > > > > complex than code that simply examines the current state and
> > > > > > > re-determines the correct next operation (i.e., de-wal, gc, etc.)
> > > > > > > 
> > > > > > > Additional complexity arises because you're recording two sets of
> > > > > > > state
> > > > > > > that require consistency checking -- in my experience, this road
> > > > > > > leads
> > > > > > > to perdition....
> > > > > > I agree is has to be something manageable that we can reason about.
> > > > > > I
> > > > > > think the question for me is mostly about which path minimizes the
> > > > > > complexity while still getting us a reasonable level of performance.
> > > > > > 
> > > > > > I had one new thought, see below...
> > > > > > 
> > > > > > > > > The downside is that any logically conflicting request (an
> > > > > > > > > overlapping
> > > > > > > > > write or truncate or zero) needs to drain the wal events,
> > > > > > > > > whereas
> > > > > > > > > with
> > > > > > > > > a lower-level wal description there might be cases where we
> > > > > > > > > can
> > > > > > > > > ignore
> > > > > > > > > the wal operation.  I suspect the trivial solution of
> > > > > > > > > o->flush()
> > > > > > > > > on
> > > > > > > > > write/truncate/zero will be pretty visible in benchmarks.
> > > > > > > > > Tracking
> > > > > > > > > in-flight wal ops with an interval_set would probably work
> > > > > > > > > well
> > > > > > > > > enough.
> > > > > > > > Hmm, I'm not sure this will pan out.  The main problem is that
> > > > > > > > if we
> > > > > > > > call back
> > > > > > > > into the write code (with a sync flag), we will have to do write
> > > > > > > > IO,
> > > > > > > > and
> > > > > > > > this
> > > > > > > > wreaks havoc on our otherwise (mostly) orderly state machine.
> > > > > > > > I think it can be done if we build in a similar guard like
> > > > > > > > _txc_finish_io so that
> > > > > > > > we wait for the wal events to also complete IO in order before
> > > > > > > > committing
> > > > > > > > them.  I think.
> > > > > > > > 
> > > > > > > > But the other problem is the checksum thing that came up in
> > > > > > > > another
> > > > > > > > thread,
> > > > > > > > where the read-side of a read/modify/write might fail teh
> > > > > > > > checksum
> > > > > > > > because
> > > > > > > > the wal write hit disk but the kv portion didn't commit. I see a
> > > > > > > > few
> > > > > > > > options:
> > > > > > > > 
> > > > > > > >     1) If there are checksums and we're doing a sub-block
> > > > > > > > overwrite,
> > > > > > > > we
> > > > > > > > have to
> > > > > > > > write/cow it elsewhere.  This probably means min_alloc_size cow
> > > > > > > > operations
> > > > > > > > for small writes.  In which case we needn't bother doing a wal
> > > > > > > > even
> > > > > > > > in
> > > > > > > > the
> > > > > > > > first place--the whole point is to enable an overwrite.
> > > > > > > > 
> > > > > > > >     2) We do loose checksum validation that will accept either
> > > > > > > > the
> > > > > > > > old
> > > > > > > > checksum
> > > > > > > > or the expected new checksum for the read stage.  This handles
> > > > > > > > these
> > > > > > > > two
> > > > > > > > crash cases:
> > > > > > > > 
> > > > > > > >     * kv commit of op + wal event
> > > > > > > >       <crash here, or>
> > > > > > > >     * do wal io (completely)
> > > > > > > >       <crash before cleaning up wal event>
> > > > > > > >     * kv cleanup of wal event
> > > > > > > > 
> > > > > > > > but not the case where we only partially complete the wal io.
> > > > > > > > Which
> > > > > > > > means
> > > > > > > > there is a small probability is "corrupt" ourselves on crash
> > > > > > > > (not
> > > > > > > > really
> > > > > > > > corrupt,
> > > > > > > > but confuse ourselves such that we refuse to replay the wal
> > > > > > > > events
> > > > > > > > on
> > > > > > > > startup).
> > > > > > > > 
> > > > > > > >     3) Same as 2, but simply warn if we fail that read-side
> > > > > > > > checksum
> > > > > > > > on
> > > > > > > > replay.
> > > > > > > > This basically introduces a *very* small window which could
> > > > > > > > allow an
> > > > > > > > ondisk
> > > > > > > > corruption to get absorbed into our checksum.  This could just
> > > > > > > > be #2
> > > > > > > > + a
> > > > > > > > config option so we warn instead of erroring out.
> > > > > > > > 
> > > > > > > >     4) Same as 2, but we try every combination of old and new
> > > > > > > > data on
> > > > > > > > block/sector boundaries to find a valid checksum on the
> > > > > > > > read-side.
> > > > > > > > 
> > > > > > > > I think #1 is a non-starter because it turns a 4K write into a
> > > > > > > > 64K
> > > > > > > > read
> > > > > > > > + seek +
> > > > > > > > 64K write on an HDD.  Or forces us to run with min_alloc_size=4K
> > > > > > > > on
> > > > > > > > HDD,
> > > > > > > > which would risk very bad fragmentation.
> > > > > > > > 
> > > > > > > > Which makes we want #3 (initially) and then #4.  But... if we do
> > > > > > > > the
> > > > > > > > "wal is
> > > > > > > > just a logical write", that means this weird replay handling
> > > > > > > > logic
> > > > > > > > creeps into
> > > > > > > > the normal write path.
> > > > > > > > 
> > > > > > > > I'm currently leaning toward keeping the wal events special
> > > > > > > > (lower-level), but
> > > > > > > > doing what we can to make it work with the same mid- to
> > > > > > > > low-level
> > > > > > > > helper
> > > > > > > > functions (for reading and verifying blobs, etc.).
> > > > > > It occured to me that this checksum consistency issue only comes up
> > > > > > when
> > > > > > we are updating something that is smaller than the csum block size.
> > > > > > And
> > > > > > the real source of the problem is that you have a sequence of
> > > > > > 
> > > > > >     1- journal intent (kv wal item)
> > > > > >     2- do read io
> > > > > >     3- verify csum
> > > > > >     4- do write io
> > > > > >     5- cancel intent (remove kv wal item)
> > > > > > 
> > > > > > If we have an order like
> > > > > > 
> > > > > >     1- do read io
> > > > > >     2- journal intent for entire csum chunk (kv wal item)
> > > > > >     3- do write io
> > > > > >     4- cancel intent
> > > > > > 
> > > > > > Then the issue goes away.  And I'm thinking if the csum chunk is big
> > > > > > enough that the #2 step is too big of a wal item to perform well,
> > > > > > then
> > > > > > the
> > > > > > problem is your choice of csum block size, not the approach.  I.e.,
> > > > > > use
> > > > > > a
> > > > > > 4kb csum block size for rbd images, and use large blocks (128k,
> > > > > > 512k,
> > > > > > whatever) only for things that never see random overwrites (rgw
> > > > > > data).
> > > > > > 
> > > > > > If that is good enough, then it might also mean that we can make the
> > > > > > wal
> > > > > > operations never do reads--just (over)writes, further simplifying
> > > > > > things
> > > > > > on that end.  In the jewel bluestore the only times we do reads is
> > > > > > for
> > > > > > partial block updates (do we really care about these?  a buffer
> > > > > > cache
> > > > > > could absorb them when it matters) and for copy/cow operations
> > > > > > post-clone
> > > > > > (which i think are simple enough to be deal with separately).
> > > > > > 
> > > > > > sage
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > > in
> > > > > the body of a message to majordomo@vger.kernel.org
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > 
> > > > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: 2 related bluestore questions
  2016-05-12 17:09                                   ` Sage Weil
@ 2016-05-13 17:07                                     ` Igor Fedotov
  0 siblings, 0 replies; 28+ messages in thread
From: Igor Fedotov @ 2016-05-13 17:07 UTC (permalink / raw)
  To: Sage Weil; +Cc: Allen Samuels, ceph-devel



On 12.05.2016 20:09, Sage Weil wrote:
> On Thu, 12 May 2016, Igor Fedotov wrote:
>> Well, it goes to the new space and updates all the maps.
>>
>> Then WAL comes to action - where will it write? To the new location? And
>> overwrite new data?
> The the old location.  The modes I describe in the doc all are in terms of
> pextents (with the possible exception of E+F) for this reason... deferred
> IO, not deferred logical operations.
Sounds almost good.
Sorry for obtrusiveness but IMHO there is still a minor chance that 
destination pextent (released by the second write bypassing WAL) is 
reallocated for another object and thus the write might destroy data there.
> 	https://github.com/liewegas/ceph/commit/c7cb76889669bf2c1abffd69f05d1c9e15c41e3c#commitcomment-17453409
>
> For E+F, the source is always immutable (compressed blob or clone).  To
> avoid this sort of race on the destination... I'm not sure.  I'm sort of
> wondering, though, if we should even bother with the 'cow' part.  We used
> to have to do this because we didn't have a lextent mapping.  Now, if we
> have a small overwrite of a cloned/shared lextent/blob, we can
>
>   - allocate a new min_alloc_size blob
>   - write the new data into the relevant block in that blob
>   - update lextent_t map *only for those bytes*
That's cool!
The only (minor?) drawback I can see is potential read ineffectiveness 
when csum is enabled. One should read both overlapping blocks to 
calculate csum when read happens for the spanning interval.
> There's not write- or read-amp that way, and if we later get more random
> overwrites nearby they can just fill in the other unused parts of the blob
> and eventually the lextent mapping will merge/simplify to reference the
> whole thing.  (I'm assuming that if we wrote at, say, object offset 68k
> and min_alloc_size is 64k, we'd write at offset 4k in the new 64k blob, so
> that later when adjacent blocks get filled in it would be contiguous.)
> Anyway, that would be *no* copy/cow type wal events at all.  The only
> read-like thing that would remain would be C, which is a pretty trivial
> case (no csum, no comp, just a read/modify/write of a partial block.)  I
> think it also means that no wal events would need to do metadata (csum)
> updates after all.
>
> I pushed an update to that doc:
>
> 	https://github.com/liewegas/ceph/blob/76ab431ec2aed0b90f2f0354d89f4bccd23e7ae2/doc/dev/bluestore.rst
>
> The D case may or may not be worth it.  It's nice for efficient small
> overwrites of big compressed blobs.  OTOH, E accomplishes the same thing
> at the expense of using a bit more disk space.  (For SSDs, E won't matter,
> since min_alloc_size would be 4K anyway.)
>
> sage
>
>
>
>>
>> On 12.05.2016 19:48, Sage Weil wrote:
>>> On Thu, 12 May 2016, Igor Fedotov wrote:
>>>> The second write in my example isn't processed through WAL - it's large
>>>> and
>>>> overwrites the whole blob...
>>> If it's large, it wouldn't overwrite--it would go to newly allocated
>>> space.  We can *never* overwrite without wal or else we corrupt previous
>>> data...
>>>
>>> sage
>>>
>>>
>>>> On 12.05.2016 19:43, Sage Weil wrote:
>>>>> On Thu, 12 May 2016, Igor Fedotov wrote:
>>>>>> Yet another potential issue with WAL I can imagine:
>>>>>>
>>>>>> Let's have some small write going to WAL followed by an larger aligned
>>>>>> overwrite to the same extent that bypasses WAL. Is it possible if the
>>>>>> first
>>>>>> write is processed later and overwrites the second one? I think so.
>>>>> Yeah, that would be chaos.  The wal ops are already ordered by the
>>>>> sequencer (or ordered globally, if bluestore_sync_wal_apply=true), so
>>>>> this
>>>>> can't happen.
>>>>>
>>>>> sage
>>>>>
>>>>>
>>>>>> This way we can probably come to the conclusion that all requests
>>>>>> should
>>>>>> be
>>>>>> processed in-sequence. One should prohibit multiple flows for requests
>>>>>> processing as this may eliminate their order.
>>>>>>
>>>>>> Yeah - I'm attacking WAL concept this way...
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Igor
>>>>>>
>>>>>> On 12.05.2016 5:58, Sage Weil wrote:
>>>>>>> On Wed, 11 May 2016, Allen Samuels wrote:
>>>>>>>> Sorry, still on vacation and I haven't really wrapped my head
>>>>>>>> around
>>>>>>>> everything that's being discussed. However, w.r.t. wal operations,
>>>>>>>> I
>>>>>>>> would strongly favor an approach that minimizes the amount of
>>>>>>>> "future"
>>>>>>>> operations that are recorded (which I'll call intentions -- i.e.,
>>>>>>>> non-binding hints about extra work that needs to get done). Much
>>>>>>>> of
>>>>>>>> the
>>>>>>>> complexity here is because the intentions -- after being recorded
>>>>>>>> --
>>>>>>>> will need to be altered based on subsequent operations. Hence
>>>>>>>> every
>>>>>>>> write operation will need to digest the historical intentions and
>>>>>>>> potentially update them -- this is VERY complex, potentially much
>>>>>>>> more
>>>>>>>> complex than code that simply examines the current state and
>>>>>>>> re-determines the correct next operation (i.e., de-wal, gc, etc.)
>>>>>>>>
>>>>>>>> Additional complexity arises because you're recording two sets of
>>>>>>>> state
>>>>>>>> that require consistency checking -- in my experience, this road
>>>>>>>> leads
>>>>>>>> to perdition....
>>>>>>> I agree is has to be something manageable that we can reason about.
>>>>>>> I
>>>>>>> think the question for me is mostly about which path minimizes the
>>>>>>> complexity while still getting us a reasonable level of performance.
>>>>>>>
>>>>>>> I had one new thought, see below...
>>>>>>>
>>>>>>>>>> The downside is that any logically conflicting request (an
>>>>>>>>>> overlapping
>>>>>>>>>> write or truncate or zero) needs to drain the wal events,
>>>>>>>>>> whereas
>>>>>>>>>> with
>>>>>>>>>> a lower-level wal description there might be cases where we
>>>>>>>>>> can
>>>>>>>>>> ignore
>>>>>>>>>> the wal operation.  I suspect the trivial solution of
>>>>>>>>>> o->flush()
>>>>>>>>>> on
>>>>>>>>>> write/truncate/zero will be pretty visible in benchmarks.
>>>>>>>>>> Tracking
>>>>>>>>>> in-flight wal ops with an interval_set would probably work
>>>>>>>>>> well
>>>>>>>>>> enough.
>>>>>>>>> Hmm, I'm not sure this will pan out.  The main problem is that
>>>>>>>>> if we
>>>>>>>>> call back
>>>>>>>>> into the write code (with a sync flag), we will have to do write
>>>>>>>>> IO,
>>>>>>>>> and
>>>>>>>>> this
>>>>>>>>> wreaks havoc on our otherwise (mostly) orderly state machine.
>>>>>>>>> I think it can be done if we build in a similar guard like
>>>>>>>>> _txc_finish_io so that
>>>>>>>>> we wait for the wal events to also complete IO in order before
>>>>>>>>> committing
>>>>>>>>> them.  I think.
>>>>>>>>>
>>>>>>>>> But the other problem is the checksum thing that came up in
>>>>>>>>> another
>>>>>>>>> thread,
>>>>>>>>> where the read-side of a read/modify/write might fail teh
>>>>>>>>> checksum
>>>>>>>>> because
>>>>>>>>> the wal write hit disk but the kv portion didn't commit. I see a
>>>>>>>>> few
>>>>>>>>> options:
>>>>>>>>>
>>>>>>>>>      1) If there are checksums and we're doing a sub-block
>>>>>>>>> overwrite,
>>>>>>>>> we
>>>>>>>>> have to
>>>>>>>>> write/cow it elsewhere.  This probably means min_alloc_size cow
>>>>>>>>> operations
>>>>>>>>> for small writes.  In which case we needn't bother doing a wal
>>>>>>>>> even
>>>>>>>>> in
>>>>>>>>> the
>>>>>>>>> first place--the whole point is to enable an overwrite.
>>>>>>>>>
>>>>>>>>>      2) We do loose checksum validation that will accept either
>>>>>>>>> the
>>>>>>>>> old
>>>>>>>>> checksum
>>>>>>>>> or the expected new checksum for the read stage.  This handles
>>>>>>>>> these
>>>>>>>>> two
>>>>>>>>> crash cases:
>>>>>>>>>
>>>>>>>>>      * kv commit of op + wal event
>>>>>>>>>        <crash here, or>
>>>>>>>>>      * do wal io (completely)
>>>>>>>>>        <crash before cleaning up wal event>
>>>>>>>>>      * kv cleanup of wal event
>>>>>>>>>
>>>>>>>>> but not the case where we only partially complete the wal io.
>>>>>>>>> Which
>>>>>>>>> means
>>>>>>>>> there is a small probability is "corrupt" ourselves on crash
>>>>>>>>> (not
>>>>>>>>> really
>>>>>>>>> corrupt,
>>>>>>>>> but confuse ourselves such that we refuse to replay the wal
>>>>>>>>> events
>>>>>>>>> on
>>>>>>>>> startup).
>>>>>>>>>
>>>>>>>>>      3) Same as 2, but simply warn if we fail that read-side
>>>>>>>>> checksum
>>>>>>>>> on
>>>>>>>>> replay.
>>>>>>>>> This basically introduces a *very* small window which could
>>>>>>>>> allow an
>>>>>>>>> ondisk
>>>>>>>>> corruption to get absorbed into our checksum.  This could just
>>>>>>>>> be #2
>>>>>>>>> + a
>>>>>>>>> config option so we warn instead of erroring out.
>>>>>>>>>
>>>>>>>>>      4) Same as 2, but we try every combination of old and new
>>>>>>>>> data on
>>>>>>>>> block/sector boundaries to find a valid checksum on the
>>>>>>>>> read-side.
>>>>>>>>>
>>>>>>>>> I think #1 is a non-starter because it turns a 4K write into a
>>>>>>>>> 64K
>>>>>>>>> read
>>>>>>>>> + seek +
>>>>>>>>> 64K write on an HDD.  Or forces us to run with min_alloc_size=4K
>>>>>>>>> on
>>>>>>>>> HDD,
>>>>>>>>> which would risk very bad fragmentation.
>>>>>>>>>
>>>>>>>>> Which makes we want #3 (initially) and then #4.  But... if we do
>>>>>>>>> the
>>>>>>>>> "wal is
>>>>>>>>> just a logical write", that means this weird replay handling
>>>>>>>>> logic
>>>>>>>>> creeps into
>>>>>>>>> the normal write path.
>>>>>>>>>
>>>>>>>>> I'm currently leaning toward keeping the wal events special
>>>>>>>>> (lower-level), but
>>>>>>>>> doing what we can to make it work with the same mid- to
>>>>>>>>> low-level
>>>>>>>>> helper
>>>>>>>>> functions (for reading and verifying blobs, etc.).
>>>>>>> It occured to me that this checksum consistency issue only comes up
>>>>>>> when
>>>>>>> we are updating something that is smaller than the csum block size.
>>>>>>> And
>>>>>>> the real source of the problem is that you have a sequence of
>>>>>>>
>>>>>>>      1- journal intent (kv wal item)
>>>>>>>      2- do read io
>>>>>>>      3- verify csum
>>>>>>>      4- do write io
>>>>>>>      5- cancel intent (remove kv wal item)
>>>>>>>
>>>>>>> If we have an order like
>>>>>>>
>>>>>>>      1- do read io
>>>>>>>      2- journal intent for entire csum chunk (kv wal item)
>>>>>>>      3- do write io
>>>>>>>      4- cancel intent
>>>>>>>
>>>>>>> Then the issue goes away.  And I'm thinking if the csum chunk is big
>>>>>>> enough that the #2 step is too big of a wal item to perform well,
>>>>>>> then
>>>>>>> the
>>>>>>> problem is your choice of csum block size, not the approach.  I.e.,
>>>>>>> use
>>>>>>> a
>>>>>>> 4kb csum block size for rbd images, and use large blocks (128k,
>>>>>>> 512k,
>>>>>>> whatever) only for things that never see random overwrites (rgw
>>>>>>> data).
>>>>>>>
>>>>>>> If that is good enough, then it might also mean that we can make the
>>>>>>> wal
>>>>>>> operations never do reads--just (over)writes, further simplifying
>>>>>>> things
>>>>>>> on that end.  In the jewel bluestore the only times we do reads is
>>>>>>> for
>>>>>>> partial block updates (do we really care about these?  a buffer
>>>>>>> cache
>>>>>>> could absorb them when it matters) and for copy/cow operations
>>>>>>> post-clone
>>>>>>> (which i think are simple enough to be deal with separately).
>>>>>>>
>>>>>>> sage
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>> in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2016-05-13 17:07 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-09 18:31 2 related bluestore questions Sage Weil
2016-05-10 12:17 ` Igor Fedotov
2016-05-10 12:53   ` Sage Weil
2016-05-10 14:41     ` Igor Fedotov
2016-05-10 15:39       ` Sage Weil
2016-05-11  1:10         ` Sage Weil
2016-05-11 12:11           ` Igor Fedotov
2016-05-11 13:10             ` Sage Weil
2016-05-11 13:45               ` Igor Fedotov
2016-05-11 13:57                 ` Sage Weil
2016-05-11 20:54                   ` Sage Weil
2016-05-11 21:38                     ` Allen Samuels
2016-05-12  2:58                       ` Sage Weil
2016-05-12 11:54                         ` Allen Samuels
2016-05-12 14:47                           ` Igor Fedotov
2016-05-12 14:38                         ` Igor Fedotov
2016-05-12 16:37                         ` Igor Fedotov
2016-05-12 16:43                           ` Sage Weil
2016-05-12 16:45                             ` Igor Fedotov
2016-05-12 16:48                               ` Sage Weil
2016-05-12 16:52                                 ` Igor Fedotov
2016-05-12 17:09                                   ` Sage Weil
2016-05-13 17:07                                     ` Igor Fedotov
2016-05-12 14:29                       ` Igor Fedotov
2016-05-12 14:27                     ` Igor Fedotov
2016-05-12 15:06                       ` Sage Weil
2016-05-11 12:39           ` Igor Fedotov
2016-05-11 14:35             ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.