All of lore.kernel.org
 help / color / mirror / Atom feed
* clone_range in BlueStore
@ 2017-01-30 13:49 Igor Fedotov
  2017-01-30 14:18 ` Sage Weil
  0 siblings, 1 reply; 4+ messages in thread
From: Igor Fedotov @ 2017-01-30 13:49 UTC (permalink / raw)
  To: Sage Weil, ceph-devel

Hi Sage,

It looks like there is some bug somewhere in 
BlueStore/store_test/clone_range.

I'm occasionally hitting an assert on mismatched data in read result 
while performing SyntheticMatrixCsumVsCompression/2 test case.

--- buffer mismatch between offset 0x7400 and 0xa200, total 0x19e00
--- expected:
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
|................|
*

00006c00  39 35 31 37 32 37 31 34  34 31 38 39 31 33 37 39 
|9517271441891379|

<skipped>

00007400  30 31 33 32 34 39 35 35  30 38 32 37 39 32 37 31 
|0132495508279271|

00007410  37 37 37 31 31 38 31 37  33 36 32 36 33 33 31 34 
|7771181736263314|

--- actual:
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
|................|
*
00006c00  39 35 31 37 32 37 31 34  34 31 38 39 31 33 37 39 
|9517271441891379|

<skipped>

00007400  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
|................|
*
0000a200  32 35 32 32 38 33 31 34  35 38 37 36 34 35 36 33 
|2522831458764563|

Multiple runs are required to hit that though...

I did some analysis and it seems that there are some issues with 
clone_range2 stuff.

First of all - do we have any limits prerequisites on src/dst offsets in 
this request? E.g. should they be aligned similarly within alloc unit 
boundaries? I recall some discussions on that

a while ago.

store_test doesn't have any as far as I can see, e.g. (min_alloc_size = 
0x10000)

  "ops": [
         {
             "op_num": 0,
             "op_name": "clonerange2",
             "collection": "555.0_head",
             "src_oid": 
"#555:3b000000:::OBJ_731aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa:head#",
             "dst_oid": "#555:c7000000:::OBJ_738:7cfc81ab#",
             "src_offset": 107520,
`            "len": 78336,
             "dst_offset": 27648
         }
     ]

This results in potentially invalid blobs for the destination objects, 
see extent starting at 0x7400 below - it has blob offset = 0 and hence 
blob isn't aligned with min_alloc_size:

2017-01-30 03:57:17.802440 7f0036a20700 15 
bluestore(bluestore.test_temp_dir) read 555.0_head 
#555:c7000000:::OBJ_738:7cfc81ab# 0x0~19e0
0
2017-01-30 03:57:17.802448 7f0036a20700 30 
bluestore.OnodeSpace(0x55eb49789b78 in 0x55eb45dd0620) lookup
2017-01-30 03:57:17.802450 7f0036a20700 30 
bluestore.OnodeSpace(0x55eb49789b78 in 0x55eb45dd0620) lookup 
#555:c7000000:::OBJ_738:7cfc81a
b# hit 0x55eb49874700
2017-01-30 03:57:17.802453 7f0036a20700 20 
bluestore(bluestore.test_temp_dir) _do_read 0x0~19e00 size 0x19e00 (105984)
2017-01-30 03:57:17.802455 7f0036a20700 20 
bluestore.onode(0x55eb49874700) flush done
2017-01-30 03:57:17.802456 7f0036a20700 30 
bluestore.extentmap(0x55eb49874850) fault_range 0x0~19e00
2017-01-30 03:57:17.802457 7f0036a20700 30 
bluestore(bluestore.test_temp_dir) _dump_onode 0x55eb49874700 
#555:c7000000:::OBJ_738:7cfc81a
b# nid 17377 size 0x19e00 (105984) expected_object_size 2097152 
expected_write_size 4096 in 0 shards
2017-01-30 03:57:17.802461 7f0036a20700 30 
bluestore(bluestore.test_temp_dir) _dump_extent_map  0x6c00~800: 
0x3800~800 Blob(0x55eb5047a4
60 blob([0x40190000~4000] csum+has_unused+shared crc32c/0x1000 
unused=0xff) ref_map(0x3800~800=1) SharedBlob(0x55eb4c2f49f0 sbid 0x3adf
loaded shared_blob(ref_map(0x40190000~4000=2))))
2017-01-30 03:57:17.802469 7f0036a20700 30 
bluestore(bluestore.test_temp_dir) _dump_extent_map      csum: 
[0,0,f1e4ed4a,417bbe91]
2017-01-30 03:57:17.802472 7f0036a20700 30 
bluestore(bluestore.test_temp_dir) _dump_extent_map  0x7400~12a00: 
0x0~12a00 Blob(0x55eb4dc87
b80 blob([0x40194000~18000] csum+shared crc32c/0x1000) 
ref_map(0x0~12a00=1) SharedBlob(0x55eb4c2f5180 sbid 0x3ae0 loaded 
shared_blob(ref
_map(0x40194000~18000=3))))
2017-01-30 03:57:17.802479 7f0036a20700 30 
bluestore(bluestore.test_temp_dir) _dump_extent_map      csum: 
[d1f849c5,fbe516b8,518379f8,b8
b944c8,18b7be23,2b6562d5,51de5770,40988db7,bf7fd7f3,14744e41,eddcb459,639b3350,d038700c,80ffc21e,d7f4edb3,a7ae1a9,f123b379,dfb76444,8ac0
3032,c1cbff33,629e4868,12d9f0ea,5d50ca8c,b7ce671d]
2017-01-30 03:57:17.802484 7f0036a20700 30 
bluestore(bluestore.test_temp_dir) _dump_extent_map 0x0~18000 
buffer(0x55eb45deb020 spa
ce 0x55eb4c2f51d8 0x0~18000 clean)
2017-01-30 03:57:17.802487 7f0036a20700 30 
bluestore(bluestore.test_temp_dir) _do_read  hole 0x0~6c00
2017-01-30 03:57:17.802490 7f0036a20700 20 
bluestore(bluestore.test_temp_dir) _do_read  blob Blob(0x55eb5047a460 
blob([0x40190000~4000]
csum+has_unused+shared crc32c/0x1000 unused=0xff) ref_map(0x3800~800=1) 
SharedBlob(0x55eb4c2f49f0 sbid 0x3adf loaded shared_blob(ref_map
(0x40190000~4000=2)))) need 0x3800~800 cache has 0x[]
2017-01-30 03:57:17.802495 7f0036a20700 30 
bluestore(bluestore.test_temp_dir) _do_read    will read 0x6c00: 0x3800~800
2017-01-30 03:57:17.802509 7f0036a20700 20 
bluestore(bluestore.test_temp_dir) _do_read  blob Blob(0x55eb4dc87b80 
blob([0x40194000~18000]
  csum+shared crc32c/0x1000) ref_map(0x0~12a00=1) 
SharedBlob(0x55eb4c2f5180 sbid 0x3ae0 loaded 
shared_blob(ref_map(0x40194000~18000=3))))
  need 0x0~12a00 cache has 0x[0~12a00]
2017-01-30 03:57:17.802515 7f0036a20700 30 
bluestore(bluestore.test_temp_dir) _do_read    use cache 0x7400: 0x0~12a00
2017-01-30 03:57:17.802519 7f0036a20700 20 
bluestore(bluestore.test_temp_dir) _do_read  blob Blob(0x55eb5047a460 
blob([0x40190000~4000] csum+has_unused+shared crc32c/0x1000 unused=0xff) 
ref_map(0x3800~800=1) SharedBlob(0x55eb4c2f49f0 sbid 0x3adf loaded 
shared_blob(ref_map(0x40190000~4000=2)))) need 0x0x6c00:3800~800
2017-01-30 03:57:17.802529 7f0036a20700 20 
bluestore(bluestore.test_temp_dir) _do_read    region 0x6c00: 0x3800~800 
reading 0x3000~1000

I haven't unwind all the clone_range transformations that lead to this 
state yet. In the example above source object already has the same 
unaligned extents issue.

But anyway it appears that clone_range neither care nor assert on 
unaligned input offsets...

I can share a couple of logs if needed..

Any comments?

Thanks,

Igor



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: clone_range in BlueStore
  2017-01-30 13:49 clone_range in BlueStore Igor Fedotov
@ 2017-01-30 14:18 ` Sage Weil
  2017-01-30 15:04   ` Igor Fedotov
  0 siblings, 1 reply; 4+ messages in thread
From: Sage Weil @ 2017-01-30 14:18 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: ceph-devel

On Mon, 30 Jan 2017, Igor Fedotov wrote:
> Hi Sage,
> 
> It looks like there is some bug somewhere in BlueStore/store_test/clone_range.
> 
> I'm occasionally hitting an assert on mismatched data in read result while
> performing SyntheticMatrixCsumVsCompression/2 test case.
> 
> --- buffer mismatch between offset 0x7400 and 0xa200, total 0x19e00
> --- expected:
> 00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 |................|
> *
> 
> 00006c00  39 35 31 37 32 37 31 34  34 31 38 39 31 33 37 39 |9517271441891379|
> 
> <skipped>
> 
> 00007400  30 31 33 32 34 39 35 35  30 38 32 37 39 32 37 31 |0132495508279271|
> 
> 00007410  37 37 37 31 31 38 31 37  33 36 32 36 33 33 31 34 |7771181736263314|
> 
> --- actual:
> 00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 |................|
> *
> 00006c00  39 35 31 37 32 37 31 34  34 31 38 39 31 33 37 39 |9517271441891379|
> 
> <skipped>
> 
> 00007400  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 |................|
> *
> 0000a200  32 35 32 32 38 33 31 34  35 38 37 36 34 35 36 33 |2522831458764563|
> 
> Multiple runs are required to hit that though...
> 
> I did some analysis and it seems that there are some issues with clone_range2
> stuff.
> 
> First of all - do we have any limits prerequisites on src/dst offsets in this
> request? E.g. should they be aligned similarly within alloc unit boundaries? I
> recall some discussions on that a while ago.
> 
> store_test doesn't have any as far as I can see, e.g. (min_alloc_size =
> 0x10000)
> 
>  "ops": [
>         {
>             "op_num": 0,
>             "op_name": "clonerange2",
>             "collection": "555.0_head",
>             "src_oid":
> "#555:3b000000:::OBJ_731aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
> aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
> aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa:head#",
>             "dst_oid": "#555:c7000000:::OBJ_738:7cfc81ab#",
>             "src_offset": 107520,
> `            "len": 78336,
>             "dst_offset": 27648
>         }
>     ]
> 
> This results in potentially invalid blobs for the destination objects, see
> extent starting at 0x7400 below - it has blob offset = 0 and hence blob isn't
> aligned with min_alloc_size:

Oh, right.

Well, the good news is the OSD no longer has any callers for which the src 
and dst clone_range offsets are different, so we could simply assert that 
they match.  That's the simplest fix.  It's party a question of whether 
we expect future cases where we will need to clone between offsets.  
Perhaps we assert for now but don't clean up the interface in case 
we need to backtrack later?

Or we could do something more limited.  The problem below is less about 
min_alloc_size and more that it's not block aligned, I think, right?  We 
could make clone_range fall back to the read/write path if the alignment 
does not match the block device...

sage

> 
> 2017-01-30 03:57:17.802440 7f0036a20700 15 bluestore(bluestore.test_temp_dir)
> read 555.0_head #555:c7000000:::OBJ_738:7cfc81ab# 0x0~19e0
> 0
> 2017-01-30 03:57:17.802448 7f0036a20700 30 bluestore.OnodeSpace(0x55eb49789b78
> in 0x55eb45dd0620) lookup
> 2017-01-30 03:57:17.802450 7f0036a20700 30 bluestore.OnodeSpace(0x55eb49789b78
> in 0x55eb45dd0620) lookup #555:c7000000:::OBJ_738:7cfc81a
> b# hit 0x55eb49874700
> 2017-01-30 03:57:17.802453 7f0036a20700 20 bluestore(bluestore.test_temp_dir)
> _do_read 0x0~19e00 size 0x19e00 (105984)
> 2017-01-30 03:57:17.802455 7f0036a20700 20 bluestore.onode(0x55eb49874700)
> flush done
> 2017-01-30 03:57:17.802456 7f0036a20700 30 bluestore.extentmap(0x55eb49874850)
> fault_range 0x0~19e00
> 2017-01-30 03:57:17.802457 7f0036a20700 30 bluestore(bluestore.test_temp_dir)
> _dump_onode 0x55eb49874700 #555:c7000000:::OBJ_738:7cfc81a
> b# nid 17377 size 0x19e00 (105984) expected_object_size 2097152
> expected_write_size 4096 in 0 shards
> 2017-01-30 03:57:17.802461 7f0036a20700 30 bluestore(bluestore.test_temp_dir)
> _dump_extent_map  0x6c00~800: 0x3800~800 Blob(0x55eb5047a4
> 60 blob([0x40190000~4000] csum+has_unused+shared crc32c/0x1000 unused=0xff)
> ref_map(0x3800~800=1) SharedBlob(0x55eb4c2f49f0 sbid 0x3adf
> loaded shared_blob(ref_map(0x40190000~4000=2))))
> 2017-01-30 03:57:17.802469 7f0036a20700 30 bluestore(bluestore.test_temp_dir)
> _dump_extent_map      csum: [0,0,f1e4ed4a,417bbe91]
> 2017-01-30 03:57:17.802472 7f0036a20700 30 bluestore(bluestore.test_temp_dir)
> _dump_extent_map  0x7400~12a00: 0x0~12a00 Blob(0x55eb4dc87
> b80 blob([0x40194000~18000] csum+shared crc32c/0x1000) ref_map(0x0~12a00=1)
> SharedBlob(0x55eb4c2f5180 sbid 0x3ae0 loaded shared_blob(ref
> _map(0x40194000~18000=3))))
> 2017-01-30 03:57:17.802479 7f0036a20700 30 bluestore(bluestore.test_temp_dir)
> _dump_extent_map      csum: [d1f849c5,fbe516b8,518379f8,b8
> b944c8,18b7be23,2b6562d5,51de5770,40988db7,bf7fd7f3,14744e41,eddcb459,639b3350,d038700c,80ffc21e,d7f4edb3,a7ae1a9,f123b379,dfb76444,8ac0
> 3032,c1cbff33,629e4868,12d9f0ea,5d50ca8c,b7ce671d]
> 2017-01-30 03:57:17.802484 7f0036a20700 30 bluestore(bluestore.test_temp_dir)
> _dump_extent_map 0x0~18000 buffer(0x55eb45deb020 spa
> ce 0x55eb4c2f51d8 0x0~18000 clean)
> 2017-01-30 03:57:17.802487 7f0036a20700 30 bluestore(bluestore.test_temp_dir)
> _do_read  hole 0x0~6c00
> 2017-01-30 03:57:17.802490 7f0036a20700 20 bluestore(bluestore.test_temp_dir)
> _do_read  blob Blob(0x55eb5047a460 blob([0x40190000~4000]
> csum+has_unused+shared crc32c/0x1000 unused=0xff) ref_map(0x3800~800=1)
> SharedBlob(0x55eb4c2f49f0 sbid 0x3adf loaded shared_blob(ref_map
> (0x40190000~4000=2)))) need 0x3800~800 cache has 0x[]
> 2017-01-30 03:57:17.802495 7f0036a20700 30 bluestore(bluestore.test_temp_dir)
> _do_read    will read 0x6c00: 0x3800~800
> 2017-01-30 03:57:17.802509 7f0036a20700 20 bluestore(bluestore.test_temp_dir)
> _do_read  blob Blob(0x55eb4dc87b80 blob([0x40194000~18000]
>  csum+shared crc32c/0x1000) ref_map(0x0~12a00=1) SharedBlob(0x55eb4c2f5180
> sbid 0x3ae0 loaded shared_blob(ref_map(0x40194000~18000=3))))
>  need 0x0~12a00 cache has 0x[0~12a00]
> 2017-01-30 03:57:17.802515 7f0036a20700 30 bluestore(bluestore.test_temp_dir)
> _do_read    use cache 0x7400: 0x0~12a00
> 2017-01-30 03:57:17.802519 7f0036a20700 20 bluestore(bluestore.test_temp_dir)
> _do_read  blob Blob(0x55eb5047a460 blob([0x40190000~4000]
> csum+has_unused+shared crc32c/0x1000 unused=0xff) ref_map(0x3800~800=1)
> SharedBlob(0x55eb4c2f49f0 sbid 0x3adf loaded
> shared_blob(ref_map(0x40190000~4000=2)))) need 0x0x6c00:3800~800
> 2017-01-30 03:57:17.802529 7f0036a20700 20 bluestore(bluestore.test_temp_dir)
> _do_read    region 0x6c00: 0x3800~800 reading 0x3000~1000
> 
> I haven't unwind all the clone_range transformations that lead to this state
> yet. In the example above source object already has the same unaligned extents
> issue.
> 
> But anyway it appears that clone_range neither care nor assert on unaligned
> input offsets...
> 
> I can share a couple of logs if needed..
> 
> Any comments?
> 
> Thanks,
> 
> Igor
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: clone_range in BlueStore
  2017-01-30 14:18 ` Sage Weil
@ 2017-01-30 15:04   ` Igor Fedotov
  2017-01-30 15:10     ` Sage Weil
  0 siblings, 1 reply; 4+ messages in thread
From: Igor Fedotov @ 2017-01-30 15:04 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel


On 30.01.2017 17:18, Sage Weil wrote:
> On Mon, 30 Jan 2017, Igor Fedotov wrote:
>> Hi Sage,
>>
>> It looks like there is some bug somewhere in BlueStore/store_test/clone_range.
>>
>> I'm occasionally hitting an assert on mismatched data in read result while
>> performing SyntheticMatrixCsumVsCompression/2 test case.
>>
>> --- buffer mismatch between offset 0x7400 and 0xa200, total 0x19e00
>> --- expected:
>> 00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 |................|
>> *
>>
>> 00006c00  39 35 31 37 32 37 31 34  34 31 38 39 31 33 37 39 |9517271441891379|
>>
>> <skipped>
>>
>> 00007400  30 31 33 32 34 39 35 35  30 38 32 37 39 32 37 31 |0132495508279271|
>>
>> 00007410  37 37 37 31 31 38 31 37  33 36 32 36 33 33 31 34 |7771181736263314|
>>
>> --- actual:
>> 00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 |................|
>> *
>> 00006c00  39 35 31 37 32 37 31 34  34 31 38 39 31 33 37 39 |9517271441891379|
>>
>> <skipped>
>>
>> 00007400  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 |................|
>> *
>> 0000a200  32 35 32 32 38 33 31 34  35 38 37 36 34 35 36 33 |2522831458764563|
>>
>> Multiple runs are required to hit that though...
>>
>> I did some analysis and it seems that there are some issues with clone_range2
>> stuff.
>>
>> First of all - do we have any limits prerequisites on src/dst offsets in this
>> request? E.g. should they be aligned similarly within alloc unit boundaries? I
>> recall some discussions on that a while ago.
>>
>> store_test doesn't have any as far as I can see, e.g. (min_alloc_size =
>> 0x10000)
>>
>>   "ops": [
>>          {
>>              "op_num": 0,
>>              "op_name": "clonerange2",
>>              "collection": "555.0_head",
>>              "src_oid":
>> "#555:3b000000:::OBJ_731aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
>> aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
>> aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa:head#",
>>              "dst_oid": "#555:c7000000:::OBJ_738:7cfc81ab#",
>>              "src_offset": 107520,
>> `            "len": 78336,
>>              "dst_offset": 27648
>>          }
>>      ]
>>
>> This results in potentially invalid blobs for the destination objects, see
>> extent starting at 0x7400 below - it has blob offset = 0 and hence blob isn't
>> aligned with min_alloc_size:
> Oh, right.
>
> Well, the good news is the OSD no longer has any callers for which the src
> and dst clone_range offsets are different, so we could simply assert that
> they match.  That's the simplest fix.  It's party a question of whether
> we expect future cases where we will need to clone between offsets.
> Perhaps we assert for now but don't clean up the interface in case
> we need to backtrack later?
>
> Or we could do something more limited.  The problem below is less about
> min_alloc_size and more that it's not block aligned, I think, right?  We
Perhaps you're right and that's rather about block aligned 
blobs/extents. But I'm a bit worried about having AU-unaligned blobs. 
IMO we don't test such cases much. I think it's better to produce 
extents/blobs similarly to the mainstream write  path, i.e. 
AU-alignment. Other approaches are more error-prone and much harder to 
catch due to their rarity. On the other hand we have a capability to 
modify AU-size on the flight and hence we violate AU-alignment 
requirement this way too....
>   
> could make clone_range fall back to the read/write path if the alignment
> does not match the block device...
Yeah, that makes sense. Especially we do R/W for unaligned head/tail only..

Actually my major concern is a broken store_test for now. Should we 
force aligned-only offsets there at the moment?
>
> sage
>
>> 2017-01-30 03:57:17.802440 7f0036a20700 15 bluestore(bluestore.test_temp_dir)
>> read 555.0_head #555:c7000000:::OBJ_738:7cfc81ab# 0x0~19e0
>> 0
>> 2017-01-30 03:57:17.802448 7f0036a20700 30 bluestore.OnodeSpace(0x55eb49789b78
>> in 0x55eb45dd0620) lookup
>> 2017-01-30 03:57:17.802450 7f0036a20700 30 bluestore.OnodeSpace(0x55eb49789b78
>> in 0x55eb45dd0620) lookup #555:c7000000:::OBJ_738:7cfc81a
>> b# hit 0x55eb49874700
>> 2017-01-30 03:57:17.802453 7f0036a20700 20 bluestore(bluestore.test_temp_dir)
>> _do_read 0x0~19e00 size 0x19e00 (105984)
>> 2017-01-30 03:57:17.802455 7f0036a20700 20 bluestore.onode(0x55eb49874700)
>> flush done
>> 2017-01-30 03:57:17.802456 7f0036a20700 30 bluestore.extentmap(0x55eb49874850)
>> fault_range 0x0~19e00
>> 2017-01-30 03:57:17.802457 7f0036a20700 30 bluestore(bluestore.test_temp_dir)
>> _dump_onode 0x55eb49874700 #555:c7000000:::OBJ_738:7cfc81a
>> b# nid 17377 size 0x19e00 (105984) expected_object_size 2097152
>> expected_write_size 4096 in 0 shards
>> 2017-01-30 03:57:17.802461 7f0036a20700 30 bluestore(bluestore.test_temp_dir)
>> _dump_extent_map  0x6c00~800: 0x3800~800 Blob(0x55eb5047a4
>> 60 blob([0x40190000~4000] csum+has_unused+shared crc32c/0x1000 unused=0xff)
>> ref_map(0x3800~800=1) SharedBlob(0x55eb4c2f49f0 sbid 0x3adf
>> loaded shared_blob(ref_map(0x40190000~4000=2))))
>> 2017-01-30 03:57:17.802469 7f0036a20700 30 bluestore(bluestore.test_temp_dir)
>> _dump_extent_map      csum: [0,0,f1e4ed4a,417bbe91]
>> 2017-01-30 03:57:17.802472 7f0036a20700 30 bluestore(bluestore.test_temp_dir)
>> _dump_extent_map  0x7400~12a00: 0x0~12a00 Blob(0x55eb4dc87
>> b80 blob([0x40194000~18000] csum+shared crc32c/0x1000) ref_map(0x0~12a00=1)
>> SharedBlob(0x55eb4c2f5180 sbid 0x3ae0 loaded shared_blob(ref
>> _map(0x40194000~18000=3))))
>> 2017-01-30 03:57:17.802479 7f0036a20700 30 bluestore(bluestore.test_temp_dir)
>> _dump_extent_map      csum: [d1f849c5,fbe516b8,518379f8,b8
>> b944c8,18b7be23,2b6562d5,51de5770,40988db7,bf7fd7f3,14744e41,eddcb459,639b3350,d038700c,80ffc21e,d7f4edb3,a7ae1a9,f123b379,dfb76444,8ac0
>> 3032,c1cbff33,629e4868,12d9f0ea,5d50ca8c,b7ce671d]
>> 2017-01-30 03:57:17.802484 7f0036a20700 30 bluestore(bluestore.test_temp_dir)
>> _dump_extent_map 0x0~18000 buffer(0x55eb45deb020 spa
>> ce 0x55eb4c2f51d8 0x0~18000 clean)
>> 2017-01-30 03:57:17.802487 7f0036a20700 30 bluestore(bluestore.test_temp_dir)
>> _do_read  hole 0x0~6c00
>> 2017-01-30 03:57:17.802490 7f0036a20700 20 bluestore(bluestore.test_temp_dir)
>> _do_read  blob Blob(0x55eb5047a460 blob([0x40190000~4000]
>> csum+has_unused+shared crc32c/0x1000 unused=0xff) ref_map(0x3800~800=1)
>> SharedBlob(0x55eb4c2f49f0 sbid 0x3adf loaded shared_blob(ref_map
>> (0x40190000~4000=2)))) need 0x3800~800 cache has 0x[]
>> 2017-01-30 03:57:17.802495 7f0036a20700 30 bluestore(bluestore.test_temp_dir)
>> _do_read    will read 0x6c00: 0x3800~800
>> 2017-01-30 03:57:17.802509 7f0036a20700 20 bluestore(bluestore.test_temp_dir)
>> _do_read  blob Blob(0x55eb4dc87b80 blob([0x40194000~18000]
>>   csum+shared crc32c/0x1000) ref_map(0x0~12a00=1) SharedBlob(0x55eb4c2f5180
>> sbid 0x3ae0 loaded shared_blob(ref_map(0x40194000~18000=3))))
>>   need 0x0~12a00 cache has 0x[0~12a00]
>> 2017-01-30 03:57:17.802515 7f0036a20700 30 bluestore(bluestore.test_temp_dir)
>> _do_read    use cache 0x7400: 0x0~12a00
>> 2017-01-30 03:57:17.802519 7f0036a20700 20 bluestore(bluestore.test_temp_dir)
>> _do_read  blob Blob(0x55eb5047a460 blob([0x40190000~4000]
>> csum+has_unused+shared crc32c/0x1000 unused=0xff) ref_map(0x3800~800=1)
>> SharedBlob(0x55eb4c2f49f0 sbid 0x3adf loaded
>> shared_blob(ref_map(0x40190000~4000=2)))) need 0x0x6c00:3800~800
>> 2017-01-30 03:57:17.802529 7f0036a20700 20 bluestore(bluestore.test_temp_dir)
>> _do_read    region 0x6c00: 0x3800~800 reading 0x3000~1000
>>
>> I haven't unwind all the clone_range transformations that lead to this state
>> yet. In the example above source object already has the same unaligned extents
>> issue.
>>
>> But anyway it appears that clone_range neither care nor assert on unaligned
>> input offsets...
>>
>> I can share a couple of logs if needed..
>>
>> Any comments?
>>
>> Thanks,
>>
>> Igor
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: clone_range in BlueStore
  2017-01-30 15:04   ` Igor Fedotov
@ 2017-01-30 15:10     ` Sage Weil
  0 siblings, 0 replies; 4+ messages in thread
From: Sage Weil @ 2017-01-30 15:10 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: ceph-devel

On Mon, 30 Jan 2017, Igor Fedotov wrote:
> 
> On 30.01.2017 17:18, Sage Weil wrote:
> > On Mon, 30 Jan 2017, Igor Fedotov wrote:
> > > Hi Sage,
> > > 
> > > It looks like there is some bug somewhere in
> > > BlueStore/store_test/clone_range.
> > > 
> > > I'm occasionally hitting an assert on mismatched data in read result while
> > > performing SyntheticMatrixCsumVsCompression/2 test case.
> > > 
> > > --- buffer mismatch between offset 0x7400 and 0xa200, total 0x19e00
> > > --- expected:
> > > 00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
> > > |................|
> > > *
> > > 
> > > 00006c00  39 35 31 37 32 37 31 34  34 31 38 39 31 33 37 39
> > > |9517271441891379|
> > > 
> > > <skipped>
> > > 
> > > 00007400  30 31 33 32 34 39 35 35  30 38 32 37 39 32 37 31
> > > |0132495508279271|
> > > 
> > > 00007410  37 37 37 31 31 38 31 37  33 36 32 36 33 33 31 34
> > > |7771181736263314|
> > > 
> > > --- actual:
> > > 00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
> > > |................|
> > > *
> > > 00006c00  39 35 31 37 32 37 31 34  34 31 38 39 31 33 37 39
> > > |9517271441891379|
> > > 
> > > <skipped>
> > > 
> > > 00007400  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
> > > |................|
> > > *
> > > 0000a200  32 35 32 32 38 33 31 34  35 38 37 36 34 35 36 33
> > > |2522831458764563|
> > > 
> > > Multiple runs are required to hit that though...
> > > 
> > > I did some analysis and it seems that there are some issues with
> > > clone_range2
> > > stuff.
> > > 
> > > First of all - do we have any limits prerequisites on src/dst offsets in
> > > this
> > > request? E.g. should they be aligned similarly within alloc unit
> > > boundaries? I
> > > recall some discussions on that a while ago.
> > > 
> > > store_test doesn't have any as far as I can see, e.g. (min_alloc_size =
> > > 0x10000)
> > > 
> > >   "ops": [
> > >          {
> > >              "op_num": 0,
> > >              "op_name": "clonerange2",
> > >              "collection": "555.0_head",
> > >              "src_oid":
> > > "#555:3b000000:::OBJ_731aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
> > > aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
> > > aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa:head#",
> > >              "dst_oid": "#555:c7000000:::OBJ_738:7cfc81ab#",
> > >              "src_offset": 107520,
> > > `            "len": 78336,
> > >              "dst_offset": 27648
> > >          }
> > >      ]
> > > 
> > > This results in potentially invalid blobs for the destination objects, see
> > > extent starting at 0x7400 below - it has blob offset = 0 and hence blob
> > > isn't
> > > aligned with min_alloc_size:
> > Oh, right.
> > 
> > Well, the good news is the OSD no longer has any callers for which the src
> > and dst clone_range offsets are different, so we could simply assert that
> > they match.  That's the simplest fix.  It's party a question of whether
> > we expect future cases where we will need to clone between offsets.
> > Perhaps we assert for now but don't clean up the interface in case
> > we need to backtrack later?
> > 
> > Or we could do something more limited.  The problem below is less about
> > min_alloc_size and more that it's not block aligned, I think, right?  We
> Perhaps you're right and that's rather about block aligned blobs/extents. But
> I'm a bit worried about having AU-unaligned blobs. IMO we don't test such
> cases much. I think it's better to produce extents/blobs similarly to the
> mainstream write  path, i.e. AU-alignment. Other approaches are more
> error-prone and much harder to catch due to their rarity. On the other hand we
> have a capability to modify AU-size on the flight and hence we violate
> AU-alignment requirement this way too....

I think we should either adjust the store_test synthetic thing to adjust 
min_alloc_size randomly, or drop the ability to change it at all.  Leaning 
toward the latter.. let's discuss during standup.

> >   could make clone_range fall back to the read/write path if the alignment
> > does not match the block device...
> Yeah, that makes sense. Especially we do R/W for unaligned head/tail only..
> 
> Actually my major concern is a broken store_test for now. Should we force
> aligned-only offsets there at the moment?

Yeah, let's do that!

sage


> > 
> > sage
> > 
> > > 2017-01-30 03:57:17.802440 7f0036a20700 15
> > > bluestore(bluestore.test_temp_dir)
> > > read 555.0_head #555:c7000000:::OBJ_738:7cfc81ab# 0x0~19e0
> > > 0
> > > 2017-01-30 03:57:17.802448 7f0036a20700 30
> > > bluestore.OnodeSpace(0x55eb49789b78
> > > in 0x55eb45dd0620) lookup
> > > 2017-01-30 03:57:17.802450 7f0036a20700 30
> > > bluestore.OnodeSpace(0x55eb49789b78
> > > in 0x55eb45dd0620) lookup #555:c7000000:::OBJ_738:7cfc81a
> > > b# hit 0x55eb49874700
> > > 2017-01-30 03:57:17.802453 7f0036a20700 20
> > > bluestore(bluestore.test_temp_dir)
> > > _do_read 0x0~19e00 size 0x19e00 (105984)
> > > 2017-01-30 03:57:17.802455 7f0036a20700 20 bluestore.onode(0x55eb49874700)
> > > flush done
> > > 2017-01-30 03:57:17.802456 7f0036a20700 30
> > > bluestore.extentmap(0x55eb49874850)
> > > fault_range 0x0~19e00
> > > 2017-01-30 03:57:17.802457 7f0036a20700 30
> > > bluestore(bluestore.test_temp_dir)
> > > _dump_onode 0x55eb49874700 #555:c7000000:::OBJ_738:7cfc81a
> > > b# nid 17377 size 0x19e00 (105984) expected_object_size 2097152
> > > expected_write_size 4096 in 0 shards
> > > 2017-01-30 03:57:17.802461 7f0036a20700 30
> > > bluestore(bluestore.test_temp_dir)
> > > _dump_extent_map  0x6c00~800: 0x3800~800 Blob(0x55eb5047a4
> > > 60 blob([0x40190000~4000] csum+has_unused+shared crc32c/0x1000
> > > unused=0xff)
> > > ref_map(0x3800~800=1) SharedBlob(0x55eb4c2f49f0 sbid 0x3adf
> > > loaded shared_blob(ref_map(0x40190000~4000=2))))
> > > 2017-01-30 03:57:17.802469 7f0036a20700 30
> > > bluestore(bluestore.test_temp_dir)
> > > _dump_extent_map      csum: [0,0,f1e4ed4a,417bbe91]
> > > 2017-01-30 03:57:17.802472 7f0036a20700 30
> > > bluestore(bluestore.test_temp_dir)
> > > _dump_extent_map  0x7400~12a00: 0x0~12a00 Blob(0x55eb4dc87
> > > b80 blob([0x40194000~18000] csum+shared crc32c/0x1000)
> > > ref_map(0x0~12a00=1)
> > > SharedBlob(0x55eb4c2f5180 sbid 0x3ae0 loaded shared_blob(ref
> > > _map(0x40194000~18000=3))))
> > > 2017-01-30 03:57:17.802479 7f0036a20700 30
> > > bluestore(bluestore.test_temp_dir)
> > > _dump_extent_map      csum: [d1f849c5,fbe516b8,518379f8,b8
> > > b944c8,18b7be23,2b6562d5,51de5770,40988db7,bf7fd7f3,14744e41,eddcb459,639b3350,d038700c,80ffc21e,d7f4edb3,a7ae1a9,f123b379,dfb76444,8ac0
> > > 3032,c1cbff33,629e4868,12d9f0ea,5d50ca8c,b7ce671d]
> > > 2017-01-30 03:57:17.802484 7f0036a20700 30
> > > bluestore(bluestore.test_temp_dir)
> > > _dump_extent_map 0x0~18000 buffer(0x55eb45deb020 spa
> > > ce 0x55eb4c2f51d8 0x0~18000 clean)
> > > 2017-01-30 03:57:17.802487 7f0036a20700 30
> > > bluestore(bluestore.test_temp_dir)
> > > _do_read  hole 0x0~6c00
> > > 2017-01-30 03:57:17.802490 7f0036a20700 20
> > > bluestore(bluestore.test_temp_dir)
> > > _do_read  blob Blob(0x55eb5047a460 blob([0x40190000~4000]
> > > csum+has_unused+shared crc32c/0x1000 unused=0xff) ref_map(0x3800~800=1)
> > > SharedBlob(0x55eb4c2f49f0 sbid 0x3adf loaded shared_blob(ref_map
> > > (0x40190000~4000=2)))) need 0x3800~800 cache has 0x[]
> > > 2017-01-30 03:57:17.802495 7f0036a20700 30
> > > bluestore(bluestore.test_temp_dir)
> > > _do_read    will read 0x6c00: 0x3800~800
> > > 2017-01-30 03:57:17.802509 7f0036a20700 20
> > > bluestore(bluestore.test_temp_dir)
> > > _do_read  blob Blob(0x55eb4dc87b80 blob([0x40194000~18000]
> > >   csum+shared crc32c/0x1000) ref_map(0x0~12a00=1)
> > > SharedBlob(0x55eb4c2f5180
> > > sbid 0x3ae0 loaded shared_blob(ref_map(0x40194000~18000=3))))
> > >   need 0x0~12a00 cache has 0x[0~12a00]
> > > 2017-01-30 03:57:17.802515 7f0036a20700 30
> > > bluestore(bluestore.test_temp_dir)
> > > _do_read    use cache 0x7400: 0x0~12a00
> > > 2017-01-30 03:57:17.802519 7f0036a20700 20
> > > bluestore(bluestore.test_temp_dir)
> > > _do_read  blob Blob(0x55eb5047a460 blob([0x40190000~4000]
> > > csum+has_unused+shared crc32c/0x1000 unused=0xff) ref_map(0x3800~800=1)
> > > SharedBlob(0x55eb4c2f49f0 sbid 0x3adf loaded
> > > shared_blob(ref_map(0x40190000~4000=2)))) need 0x0x6c00:3800~800
> > > 2017-01-30 03:57:17.802529 7f0036a20700 20
> > > bluestore(bluestore.test_temp_dir)
> > > _do_read    region 0x6c00: 0x3800~800 reading 0x3000~1000
> > > 
> > > I haven't unwind all the clone_range transformations that lead to this
> > > state
> > > yet. In the example above source object already has the same unaligned
> > > extents
> > > issue.
> > > 
> > > But anyway it appears that clone_range neither care nor assert on
> > > unaligned
> > > input offsets...
> > > 
> > > I can share a couple of logs if needed..
> > > 
> > > Any comments?
> > > 
> > > Thanks,
> > > 
> > > Igor
> > > 
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> 
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2017-01-30 15:14 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-30 13:49 clone_range in BlueStore Igor Fedotov
2017-01-30 14:18 ` Sage Weil
2017-01-30 15:04   ` Igor Fedotov
2017-01-30 15:10     ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.