* clone_range in BlueStore
@ 2017-01-30 13:49 Igor Fedotov
2017-01-30 14:18 ` Sage Weil
0 siblings, 1 reply; 4+ messages in thread
From: Igor Fedotov @ 2017-01-30 13:49 UTC (permalink / raw)
To: Sage Weil, ceph-devel
Hi Sage,
It looks like there is some bug somewhere in
BlueStore/store_test/clone_range.
I'm occasionally hitting an assert on mismatched data in read result
while performing SyntheticMatrixCsumVsCompression/2 test case.
--- buffer mismatch between offset 0x7400 and 0xa200, total 0x19e00
--- expected:
00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
|................|
*
00006c00 39 35 31 37 32 37 31 34 34 31 38 39 31 33 37 39
|9517271441891379|
<skipped>
00007400 30 31 33 32 34 39 35 35 30 38 32 37 39 32 37 31
|0132495508279271|
00007410 37 37 37 31 31 38 31 37 33 36 32 36 33 33 31 34
|7771181736263314|
--- actual:
00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
|................|
*
00006c00 39 35 31 37 32 37 31 34 34 31 38 39 31 33 37 39
|9517271441891379|
<skipped>
00007400 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
|................|
*
0000a200 32 35 32 32 38 33 31 34 35 38 37 36 34 35 36 33
|2522831458764563|
Multiple runs are required to hit that though...
I did some analysis and it seems that there are some issues with
clone_range2 stuff.
First of all - do we have any limits prerequisites on src/dst offsets in
this request? E.g. should they be aligned similarly within alloc unit
boundaries? I recall some discussions on that
a while ago.
store_test doesn't have any as far as I can see, e.g. (min_alloc_size =
0x10000)
"ops": [
{
"op_num": 0,
"op_name": "clonerange2",
"collection": "555.0_head",
"src_oid":
"#555:3b000000:::OBJ_731aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa:head#",
"dst_oid": "#555:c7000000:::OBJ_738:7cfc81ab#",
"src_offset": 107520,
` "len": 78336,
"dst_offset": 27648
}
]
This results in potentially invalid blobs for the destination objects,
see extent starting at 0x7400 below - it has blob offset = 0 and hence
blob isn't aligned with min_alloc_size:
2017-01-30 03:57:17.802440 7f0036a20700 15
bluestore(bluestore.test_temp_dir) read 555.0_head
#555:c7000000:::OBJ_738:7cfc81ab# 0x0~19e0
0
2017-01-30 03:57:17.802448 7f0036a20700 30
bluestore.OnodeSpace(0x55eb49789b78 in 0x55eb45dd0620) lookup
2017-01-30 03:57:17.802450 7f0036a20700 30
bluestore.OnodeSpace(0x55eb49789b78 in 0x55eb45dd0620) lookup
#555:c7000000:::OBJ_738:7cfc81a
b# hit 0x55eb49874700
2017-01-30 03:57:17.802453 7f0036a20700 20
bluestore(bluestore.test_temp_dir) _do_read 0x0~19e00 size 0x19e00 (105984)
2017-01-30 03:57:17.802455 7f0036a20700 20
bluestore.onode(0x55eb49874700) flush done
2017-01-30 03:57:17.802456 7f0036a20700 30
bluestore.extentmap(0x55eb49874850) fault_range 0x0~19e00
2017-01-30 03:57:17.802457 7f0036a20700 30
bluestore(bluestore.test_temp_dir) _dump_onode 0x55eb49874700
#555:c7000000:::OBJ_738:7cfc81a
b# nid 17377 size 0x19e00 (105984) expected_object_size 2097152
expected_write_size 4096 in 0 shards
2017-01-30 03:57:17.802461 7f0036a20700 30
bluestore(bluestore.test_temp_dir) _dump_extent_map 0x6c00~800:
0x3800~800 Blob(0x55eb5047a4
60 blob([0x40190000~4000] csum+has_unused+shared crc32c/0x1000
unused=0xff) ref_map(0x3800~800=1) SharedBlob(0x55eb4c2f49f0 sbid 0x3adf
loaded shared_blob(ref_map(0x40190000~4000=2))))
2017-01-30 03:57:17.802469 7f0036a20700 30
bluestore(bluestore.test_temp_dir) _dump_extent_map csum:
[0,0,f1e4ed4a,417bbe91]
2017-01-30 03:57:17.802472 7f0036a20700 30
bluestore(bluestore.test_temp_dir) _dump_extent_map 0x7400~12a00:
0x0~12a00 Blob(0x55eb4dc87
b80 blob([0x40194000~18000] csum+shared crc32c/0x1000)
ref_map(0x0~12a00=1) SharedBlob(0x55eb4c2f5180 sbid 0x3ae0 loaded
shared_blob(ref
_map(0x40194000~18000=3))))
2017-01-30 03:57:17.802479 7f0036a20700 30
bluestore(bluestore.test_temp_dir) _dump_extent_map csum:
[d1f849c5,fbe516b8,518379f8,b8
b944c8,18b7be23,2b6562d5,51de5770,40988db7,bf7fd7f3,14744e41,eddcb459,639b3350,d038700c,80ffc21e,d7f4edb3,a7ae1a9,f123b379,dfb76444,8ac0
3032,c1cbff33,629e4868,12d9f0ea,5d50ca8c,b7ce671d]
2017-01-30 03:57:17.802484 7f0036a20700 30
bluestore(bluestore.test_temp_dir) _dump_extent_map 0x0~18000
buffer(0x55eb45deb020 spa
ce 0x55eb4c2f51d8 0x0~18000 clean)
2017-01-30 03:57:17.802487 7f0036a20700 30
bluestore(bluestore.test_temp_dir) _do_read hole 0x0~6c00
2017-01-30 03:57:17.802490 7f0036a20700 20
bluestore(bluestore.test_temp_dir) _do_read blob Blob(0x55eb5047a460
blob([0x40190000~4000]
csum+has_unused+shared crc32c/0x1000 unused=0xff) ref_map(0x3800~800=1)
SharedBlob(0x55eb4c2f49f0 sbid 0x3adf loaded shared_blob(ref_map
(0x40190000~4000=2)))) need 0x3800~800 cache has 0x[]
2017-01-30 03:57:17.802495 7f0036a20700 30
bluestore(bluestore.test_temp_dir) _do_read will read 0x6c00: 0x3800~800
2017-01-30 03:57:17.802509 7f0036a20700 20
bluestore(bluestore.test_temp_dir) _do_read blob Blob(0x55eb4dc87b80
blob([0x40194000~18000]
csum+shared crc32c/0x1000) ref_map(0x0~12a00=1)
SharedBlob(0x55eb4c2f5180 sbid 0x3ae0 loaded
shared_blob(ref_map(0x40194000~18000=3))))
need 0x0~12a00 cache has 0x[0~12a00]
2017-01-30 03:57:17.802515 7f0036a20700 30
bluestore(bluestore.test_temp_dir) _do_read use cache 0x7400: 0x0~12a00
2017-01-30 03:57:17.802519 7f0036a20700 20
bluestore(bluestore.test_temp_dir) _do_read blob Blob(0x55eb5047a460
blob([0x40190000~4000] csum+has_unused+shared crc32c/0x1000 unused=0xff)
ref_map(0x3800~800=1) SharedBlob(0x55eb4c2f49f0 sbid 0x3adf loaded
shared_blob(ref_map(0x40190000~4000=2)))) need 0x0x6c00:3800~800
2017-01-30 03:57:17.802529 7f0036a20700 20
bluestore(bluestore.test_temp_dir) _do_read region 0x6c00: 0x3800~800
reading 0x3000~1000
I haven't unwind all the clone_range transformations that lead to this
state yet. In the example above source object already has the same
unaligned extents issue.
But anyway it appears that clone_range neither care nor assert on
unaligned input offsets...
I can share a couple of logs if needed..
Any comments?
Thanks,
Igor
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: clone_range in BlueStore
2017-01-30 13:49 clone_range in BlueStore Igor Fedotov
@ 2017-01-30 14:18 ` Sage Weil
2017-01-30 15:04 ` Igor Fedotov
0 siblings, 1 reply; 4+ messages in thread
From: Sage Weil @ 2017-01-30 14:18 UTC (permalink / raw)
To: Igor Fedotov; +Cc: ceph-devel
On Mon, 30 Jan 2017, Igor Fedotov wrote:
> Hi Sage,
>
> It looks like there is some bug somewhere in BlueStore/store_test/clone_range.
>
> I'm occasionally hitting an assert on mismatched data in read result while
> performing SyntheticMatrixCsumVsCompression/2 test case.
>
> --- buffer mismatch between offset 0x7400 and 0xa200, total 0x19e00
> --- expected:
> 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
> *
>
> 00006c00 39 35 31 37 32 37 31 34 34 31 38 39 31 33 37 39 |9517271441891379|
>
> <skipped>
>
> 00007400 30 31 33 32 34 39 35 35 30 38 32 37 39 32 37 31 |0132495508279271|
>
> 00007410 37 37 37 31 31 38 31 37 33 36 32 36 33 33 31 34 |7771181736263314|
>
> --- actual:
> 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
> *
> 00006c00 39 35 31 37 32 37 31 34 34 31 38 39 31 33 37 39 |9517271441891379|
>
> <skipped>
>
> 00007400 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
> *
> 0000a200 32 35 32 32 38 33 31 34 35 38 37 36 34 35 36 33 |2522831458764563|
>
> Multiple runs are required to hit that though...
>
> I did some analysis and it seems that there are some issues with clone_range2
> stuff.
>
> First of all - do we have any limits prerequisites on src/dst offsets in this
> request? E.g. should they be aligned similarly within alloc unit boundaries? I
> recall some discussions on that a while ago.
>
> store_test doesn't have any as far as I can see, e.g. (min_alloc_size =
> 0x10000)
>
> "ops": [
> {
> "op_num": 0,
> "op_name": "clonerange2",
> "collection": "555.0_head",
> "src_oid":
> "#555:3b000000:::OBJ_731aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
> aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
> aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa:head#",
> "dst_oid": "#555:c7000000:::OBJ_738:7cfc81ab#",
> "src_offset": 107520,
> ` "len": 78336,
> "dst_offset": 27648
> }
> ]
>
> This results in potentially invalid blobs for the destination objects, see
> extent starting at 0x7400 below - it has blob offset = 0 and hence blob isn't
> aligned with min_alloc_size:
Oh, right.
Well, the good news is the OSD no longer has any callers for which the src
and dst clone_range offsets are different, so we could simply assert that
they match. That's the simplest fix. It's party a question of whether
we expect future cases where we will need to clone between offsets.
Perhaps we assert for now but don't clean up the interface in case
we need to backtrack later?
Or we could do something more limited. The problem below is less about
min_alloc_size and more that it's not block aligned, I think, right? We
could make clone_range fall back to the read/write path if the alignment
does not match the block device...
sage
>
> 2017-01-30 03:57:17.802440 7f0036a20700 15 bluestore(bluestore.test_temp_dir)
> read 555.0_head #555:c7000000:::OBJ_738:7cfc81ab# 0x0~19e0
> 0
> 2017-01-30 03:57:17.802448 7f0036a20700 30 bluestore.OnodeSpace(0x55eb49789b78
> in 0x55eb45dd0620) lookup
> 2017-01-30 03:57:17.802450 7f0036a20700 30 bluestore.OnodeSpace(0x55eb49789b78
> in 0x55eb45dd0620) lookup #555:c7000000:::OBJ_738:7cfc81a
> b# hit 0x55eb49874700
> 2017-01-30 03:57:17.802453 7f0036a20700 20 bluestore(bluestore.test_temp_dir)
> _do_read 0x0~19e00 size 0x19e00 (105984)
> 2017-01-30 03:57:17.802455 7f0036a20700 20 bluestore.onode(0x55eb49874700)
> flush done
> 2017-01-30 03:57:17.802456 7f0036a20700 30 bluestore.extentmap(0x55eb49874850)
> fault_range 0x0~19e00
> 2017-01-30 03:57:17.802457 7f0036a20700 30 bluestore(bluestore.test_temp_dir)
> _dump_onode 0x55eb49874700 #555:c7000000:::OBJ_738:7cfc81a
> b# nid 17377 size 0x19e00 (105984) expected_object_size 2097152
> expected_write_size 4096 in 0 shards
> 2017-01-30 03:57:17.802461 7f0036a20700 30 bluestore(bluestore.test_temp_dir)
> _dump_extent_map 0x6c00~800: 0x3800~800 Blob(0x55eb5047a4
> 60 blob([0x40190000~4000] csum+has_unused+shared crc32c/0x1000 unused=0xff)
> ref_map(0x3800~800=1) SharedBlob(0x55eb4c2f49f0 sbid 0x3adf
> loaded shared_blob(ref_map(0x40190000~4000=2))))
> 2017-01-30 03:57:17.802469 7f0036a20700 30 bluestore(bluestore.test_temp_dir)
> _dump_extent_map csum: [0,0,f1e4ed4a,417bbe91]
> 2017-01-30 03:57:17.802472 7f0036a20700 30 bluestore(bluestore.test_temp_dir)
> _dump_extent_map 0x7400~12a00: 0x0~12a00 Blob(0x55eb4dc87
> b80 blob([0x40194000~18000] csum+shared crc32c/0x1000) ref_map(0x0~12a00=1)
> SharedBlob(0x55eb4c2f5180 sbid 0x3ae0 loaded shared_blob(ref
> _map(0x40194000~18000=3))))
> 2017-01-30 03:57:17.802479 7f0036a20700 30 bluestore(bluestore.test_temp_dir)
> _dump_extent_map csum: [d1f849c5,fbe516b8,518379f8,b8
> b944c8,18b7be23,2b6562d5,51de5770,40988db7,bf7fd7f3,14744e41,eddcb459,639b3350,d038700c,80ffc21e,d7f4edb3,a7ae1a9,f123b379,dfb76444,8ac0
> 3032,c1cbff33,629e4868,12d9f0ea,5d50ca8c,b7ce671d]
> 2017-01-30 03:57:17.802484 7f0036a20700 30 bluestore(bluestore.test_temp_dir)
> _dump_extent_map 0x0~18000 buffer(0x55eb45deb020 spa
> ce 0x55eb4c2f51d8 0x0~18000 clean)
> 2017-01-30 03:57:17.802487 7f0036a20700 30 bluestore(bluestore.test_temp_dir)
> _do_read hole 0x0~6c00
> 2017-01-30 03:57:17.802490 7f0036a20700 20 bluestore(bluestore.test_temp_dir)
> _do_read blob Blob(0x55eb5047a460 blob([0x40190000~4000]
> csum+has_unused+shared crc32c/0x1000 unused=0xff) ref_map(0x3800~800=1)
> SharedBlob(0x55eb4c2f49f0 sbid 0x3adf loaded shared_blob(ref_map
> (0x40190000~4000=2)))) need 0x3800~800 cache has 0x[]
> 2017-01-30 03:57:17.802495 7f0036a20700 30 bluestore(bluestore.test_temp_dir)
> _do_read will read 0x6c00: 0x3800~800
> 2017-01-30 03:57:17.802509 7f0036a20700 20 bluestore(bluestore.test_temp_dir)
> _do_read blob Blob(0x55eb4dc87b80 blob([0x40194000~18000]
> csum+shared crc32c/0x1000) ref_map(0x0~12a00=1) SharedBlob(0x55eb4c2f5180
> sbid 0x3ae0 loaded shared_blob(ref_map(0x40194000~18000=3))))
> need 0x0~12a00 cache has 0x[0~12a00]
> 2017-01-30 03:57:17.802515 7f0036a20700 30 bluestore(bluestore.test_temp_dir)
> _do_read use cache 0x7400: 0x0~12a00
> 2017-01-30 03:57:17.802519 7f0036a20700 20 bluestore(bluestore.test_temp_dir)
> _do_read blob Blob(0x55eb5047a460 blob([0x40190000~4000]
> csum+has_unused+shared crc32c/0x1000 unused=0xff) ref_map(0x3800~800=1)
> SharedBlob(0x55eb4c2f49f0 sbid 0x3adf loaded
> shared_blob(ref_map(0x40190000~4000=2)))) need 0x0x6c00:3800~800
> 2017-01-30 03:57:17.802529 7f0036a20700 20 bluestore(bluestore.test_temp_dir)
> _do_read region 0x6c00: 0x3800~800 reading 0x3000~1000
>
> I haven't unwind all the clone_range transformations that lead to this state
> yet. In the example above source object already has the same unaligned extents
> issue.
>
> But anyway it appears that clone_range neither care nor assert on unaligned
> input offsets...
>
> I can share a couple of logs if needed..
>
> Any comments?
>
> Thanks,
>
> Igor
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: clone_range in BlueStore
2017-01-30 14:18 ` Sage Weil
@ 2017-01-30 15:04 ` Igor Fedotov
2017-01-30 15:10 ` Sage Weil
0 siblings, 1 reply; 4+ messages in thread
From: Igor Fedotov @ 2017-01-30 15:04 UTC (permalink / raw)
To: Sage Weil; +Cc: ceph-devel
On 30.01.2017 17:18, Sage Weil wrote:
> On Mon, 30 Jan 2017, Igor Fedotov wrote:
>> Hi Sage,
>>
>> It looks like there is some bug somewhere in BlueStore/store_test/clone_range.
>>
>> I'm occasionally hitting an assert on mismatched data in read result while
>> performing SyntheticMatrixCsumVsCompression/2 test case.
>>
>> --- buffer mismatch between offset 0x7400 and 0xa200, total 0x19e00
>> --- expected:
>> 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
>> *
>>
>> 00006c00 39 35 31 37 32 37 31 34 34 31 38 39 31 33 37 39 |9517271441891379|
>>
>> <skipped>
>>
>> 00007400 30 31 33 32 34 39 35 35 30 38 32 37 39 32 37 31 |0132495508279271|
>>
>> 00007410 37 37 37 31 31 38 31 37 33 36 32 36 33 33 31 34 |7771181736263314|
>>
>> --- actual:
>> 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
>> *
>> 00006c00 39 35 31 37 32 37 31 34 34 31 38 39 31 33 37 39 |9517271441891379|
>>
>> <skipped>
>>
>> 00007400 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
>> *
>> 0000a200 32 35 32 32 38 33 31 34 35 38 37 36 34 35 36 33 |2522831458764563|
>>
>> Multiple runs are required to hit that though...
>>
>> I did some analysis and it seems that there are some issues with clone_range2
>> stuff.
>>
>> First of all - do we have any limits prerequisites on src/dst offsets in this
>> request? E.g. should they be aligned similarly within alloc unit boundaries? I
>> recall some discussions on that a while ago.
>>
>> store_test doesn't have any as far as I can see, e.g. (min_alloc_size =
>> 0x10000)
>>
>> "ops": [
>> {
>> "op_num": 0,
>> "op_name": "clonerange2",
>> "collection": "555.0_head",
>> "src_oid":
>> "#555:3b000000:::OBJ_731aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
>> aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
>> aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa:head#",
>> "dst_oid": "#555:c7000000:::OBJ_738:7cfc81ab#",
>> "src_offset": 107520,
>> ` "len": 78336,
>> "dst_offset": 27648
>> }
>> ]
>>
>> This results in potentially invalid blobs for the destination objects, see
>> extent starting at 0x7400 below - it has blob offset = 0 and hence blob isn't
>> aligned with min_alloc_size:
> Oh, right.
>
> Well, the good news is the OSD no longer has any callers for which the src
> and dst clone_range offsets are different, so we could simply assert that
> they match. That's the simplest fix. It's party a question of whether
> we expect future cases where we will need to clone between offsets.
> Perhaps we assert for now but don't clean up the interface in case
> we need to backtrack later?
>
> Or we could do something more limited. The problem below is less about
> min_alloc_size and more that it's not block aligned, I think, right? We
Perhaps you're right and that's rather about block aligned
blobs/extents. But I'm a bit worried about having AU-unaligned blobs.
IMO we don't test such cases much. I think it's better to produce
extents/blobs similarly to the mainstream write path, i.e.
AU-alignment. Other approaches are more error-prone and much harder to
catch due to their rarity. On the other hand we have a capability to
modify AU-size on the flight and hence we violate AU-alignment
requirement this way too....
>
> could make clone_range fall back to the read/write path if the alignment
> does not match the block device...
Yeah, that makes sense. Especially we do R/W for unaligned head/tail only..
Actually my major concern is a broken store_test for now. Should we
force aligned-only offsets there at the moment?
>
> sage
>
>> 2017-01-30 03:57:17.802440 7f0036a20700 15 bluestore(bluestore.test_temp_dir)
>> read 555.0_head #555:c7000000:::OBJ_738:7cfc81ab# 0x0~19e0
>> 0
>> 2017-01-30 03:57:17.802448 7f0036a20700 30 bluestore.OnodeSpace(0x55eb49789b78
>> in 0x55eb45dd0620) lookup
>> 2017-01-30 03:57:17.802450 7f0036a20700 30 bluestore.OnodeSpace(0x55eb49789b78
>> in 0x55eb45dd0620) lookup #555:c7000000:::OBJ_738:7cfc81a
>> b# hit 0x55eb49874700
>> 2017-01-30 03:57:17.802453 7f0036a20700 20 bluestore(bluestore.test_temp_dir)
>> _do_read 0x0~19e00 size 0x19e00 (105984)
>> 2017-01-30 03:57:17.802455 7f0036a20700 20 bluestore.onode(0x55eb49874700)
>> flush done
>> 2017-01-30 03:57:17.802456 7f0036a20700 30 bluestore.extentmap(0x55eb49874850)
>> fault_range 0x0~19e00
>> 2017-01-30 03:57:17.802457 7f0036a20700 30 bluestore(bluestore.test_temp_dir)
>> _dump_onode 0x55eb49874700 #555:c7000000:::OBJ_738:7cfc81a
>> b# nid 17377 size 0x19e00 (105984) expected_object_size 2097152
>> expected_write_size 4096 in 0 shards
>> 2017-01-30 03:57:17.802461 7f0036a20700 30 bluestore(bluestore.test_temp_dir)
>> _dump_extent_map 0x6c00~800: 0x3800~800 Blob(0x55eb5047a4
>> 60 blob([0x40190000~4000] csum+has_unused+shared crc32c/0x1000 unused=0xff)
>> ref_map(0x3800~800=1) SharedBlob(0x55eb4c2f49f0 sbid 0x3adf
>> loaded shared_blob(ref_map(0x40190000~4000=2))))
>> 2017-01-30 03:57:17.802469 7f0036a20700 30 bluestore(bluestore.test_temp_dir)
>> _dump_extent_map csum: [0,0,f1e4ed4a,417bbe91]
>> 2017-01-30 03:57:17.802472 7f0036a20700 30 bluestore(bluestore.test_temp_dir)
>> _dump_extent_map 0x7400~12a00: 0x0~12a00 Blob(0x55eb4dc87
>> b80 blob([0x40194000~18000] csum+shared crc32c/0x1000) ref_map(0x0~12a00=1)
>> SharedBlob(0x55eb4c2f5180 sbid 0x3ae0 loaded shared_blob(ref
>> _map(0x40194000~18000=3))))
>> 2017-01-30 03:57:17.802479 7f0036a20700 30 bluestore(bluestore.test_temp_dir)
>> _dump_extent_map csum: [d1f849c5,fbe516b8,518379f8,b8
>> b944c8,18b7be23,2b6562d5,51de5770,40988db7,bf7fd7f3,14744e41,eddcb459,639b3350,d038700c,80ffc21e,d7f4edb3,a7ae1a9,f123b379,dfb76444,8ac0
>> 3032,c1cbff33,629e4868,12d9f0ea,5d50ca8c,b7ce671d]
>> 2017-01-30 03:57:17.802484 7f0036a20700 30 bluestore(bluestore.test_temp_dir)
>> _dump_extent_map 0x0~18000 buffer(0x55eb45deb020 spa
>> ce 0x55eb4c2f51d8 0x0~18000 clean)
>> 2017-01-30 03:57:17.802487 7f0036a20700 30 bluestore(bluestore.test_temp_dir)
>> _do_read hole 0x0~6c00
>> 2017-01-30 03:57:17.802490 7f0036a20700 20 bluestore(bluestore.test_temp_dir)
>> _do_read blob Blob(0x55eb5047a460 blob([0x40190000~4000]
>> csum+has_unused+shared crc32c/0x1000 unused=0xff) ref_map(0x3800~800=1)
>> SharedBlob(0x55eb4c2f49f0 sbid 0x3adf loaded shared_blob(ref_map
>> (0x40190000~4000=2)))) need 0x3800~800 cache has 0x[]
>> 2017-01-30 03:57:17.802495 7f0036a20700 30 bluestore(bluestore.test_temp_dir)
>> _do_read will read 0x6c00: 0x3800~800
>> 2017-01-30 03:57:17.802509 7f0036a20700 20 bluestore(bluestore.test_temp_dir)
>> _do_read blob Blob(0x55eb4dc87b80 blob([0x40194000~18000]
>> csum+shared crc32c/0x1000) ref_map(0x0~12a00=1) SharedBlob(0x55eb4c2f5180
>> sbid 0x3ae0 loaded shared_blob(ref_map(0x40194000~18000=3))))
>> need 0x0~12a00 cache has 0x[0~12a00]
>> 2017-01-30 03:57:17.802515 7f0036a20700 30 bluestore(bluestore.test_temp_dir)
>> _do_read use cache 0x7400: 0x0~12a00
>> 2017-01-30 03:57:17.802519 7f0036a20700 20 bluestore(bluestore.test_temp_dir)
>> _do_read blob Blob(0x55eb5047a460 blob([0x40190000~4000]
>> csum+has_unused+shared crc32c/0x1000 unused=0xff) ref_map(0x3800~800=1)
>> SharedBlob(0x55eb4c2f49f0 sbid 0x3adf loaded
>> shared_blob(ref_map(0x40190000~4000=2)))) need 0x0x6c00:3800~800
>> 2017-01-30 03:57:17.802529 7f0036a20700 20 bluestore(bluestore.test_temp_dir)
>> _do_read region 0x6c00: 0x3800~800 reading 0x3000~1000
>>
>> I haven't unwind all the clone_range transformations that lead to this state
>> yet. In the example above source object already has the same unaligned extents
>> issue.
>>
>> But anyway it appears that clone_range neither care nor assert on unaligned
>> input offsets...
>>
>> I can share a couple of logs if needed..
>>
>> Any comments?
>>
>> Thanks,
>>
>> Igor
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: clone_range in BlueStore
2017-01-30 15:04 ` Igor Fedotov
@ 2017-01-30 15:10 ` Sage Weil
0 siblings, 0 replies; 4+ messages in thread
From: Sage Weil @ 2017-01-30 15:10 UTC (permalink / raw)
To: Igor Fedotov; +Cc: ceph-devel
On Mon, 30 Jan 2017, Igor Fedotov wrote:
>
> On 30.01.2017 17:18, Sage Weil wrote:
> > On Mon, 30 Jan 2017, Igor Fedotov wrote:
> > > Hi Sage,
> > >
> > > It looks like there is some bug somewhere in
> > > BlueStore/store_test/clone_range.
> > >
> > > I'm occasionally hitting an assert on mismatched data in read result while
> > > performing SyntheticMatrixCsumVsCompression/2 test case.
> > >
> > > --- buffer mismatch between offset 0x7400 and 0xa200, total 0x19e00
> > > --- expected:
> > > 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > |................|
> > > *
> > >
> > > 00006c00 39 35 31 37 32 37 31 34 34 31 38 39 31 33 37 39
> > > |9517271441891379|
> > >
> > > <skipped>
> > >
> > > 00007400 30 31 33 32 34 39 35 35 30 38 32 37 39 32 37 31
> > > |0132495508279271|
> > >
> > > 00007410 37 37 37 31 31 38 31 37 33 36 32 36 33 33 31 34
> > > |7771181736263314|
> > >
> > > --- actual:
> > > 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > |................|
> > > *
> > > 00006c00 39 35 31 37 32 37 31 34 34 31 38 39 31 33 37 39
> > > |9517271441891379|
> > >
> > > <skipped>
> > >
> > > 00007400 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > |................|
> > > *
> > > 0000a200 32 35 32 32 38 33 31 34 35 38 37 36 34 35 36 33
> > > |2522831458764563|
> > >
> > > Multiple runs are required to hit that though...
> > >
> > > I did some analysis and it seems that there are some issues with
> > > clone_range2
> > > stuff.
> > >
> > > First of all - do we have any limits prerequisites on src/dst offsets in
> > > this
> > > request? E.g. should they be aligned similarly within alloc unit
> > > boundaries? I
> > > recall some discussions on that a while ago.
> > >
> > > store_test doesn't have any as far as I can see, e.g. (min_alloc_size =
> > > 0x10000)
> > >
> > > "ops": [
> > > {
> > > "op_num": 0,
> > > "op_name": "clonerange2",
> > > "collection": "555.0_head",
> > > "src_oid":
> > > "#555:3b000000:::OBJ_731aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
> > > aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
> > > aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa:head#",
> > > "dst_oid": "#555:c7000000:::OBJ_738:7cfc81ab#",
> > > "src_offset": 107520,
> > > ` "len": 78336,
> > > "dst_offset": 27648
> > > }
> > > ]
> > >
> > > This results in potentially invalid blobs for the destination objects, see
> > > extent starting at 0x7400 below - it has blob offset = 0 and hence blob
> > > isn't
> > > aligned with min_alloc_size:
> > Oh, right.
> >
> > Well, the good news is the OSD no longer has any callers for which the src
> > and dst clone_range offsets are different, so we could simply assert that
> > they match. That's the simplest fix. It's party a question of whether
> > we expect future cases where we will need to clone between offsets.
> > Perhaps we assert for now but don't clean up the interface in case
> > we need to backtrack later?
> >
> > Or we could do something more limited. The problem below is less about
> > min_alloc_size and more that it's not block aligned, I think, right? We
> Perhaps you're right and that's rather about block aligned blobs/extents. But
> I'm a bit worried about having AU-unaligned blobs. IMO we don't test such
> cases much. I think it's better to produce extents/blobs similarly to the
> mainstream write path, i.e. AU-alignment. Other approaches are more
> error-prone and much harder to catch due to their rarity. On the other hand we
> have a capability to modify AU-size on the flight and hence we violate
> AU-alignment requirement this way too....
I think we should either adjust the store_test synthetic thing to adjust
min_alloc_size randomly, or drop the ability to change it at all. Leaning
toward the latter.. let's discuss during standup.
> > could make clone_range fall back to the read/write path if the alignment
> > does not match the block device...
> Yeah, that makes sense. Especially we do R/W for unaligned head/tail only..
>
> Actually my major concern is a broken store_test for now. Should we force
> aligned-only offsets there at the moment?
Yeah, let's do that!
sage
> >
> > sage
> >
> > > 2017-01-30 03:57:17.802440 7f0036a20700 15
> > > bluestore(bluestore.test_temp_dir)
> > > read 555.0_head #555:c7000000:::OBJ_738:7cfc81ab# 0x0~19e0
> > > 0
> > > 2017-01-30 03:57:17.802448 7f0036a20700 30
> > > bluestore.OnodeSpace(0x55eb49789b78
> > > in 0x55eb45dd0620) lookup
> > > 2017-01-30 03:57:17.802450 7f0036a20700 30
> > > bluestore.OnodeSpace(0x55eb49789b78
> > > in 0x55eb45dd0620) lookup #555:c7000000:::OBJ_738:7cfc81a
> > > b# hit 0x55eb49874700
> > > 2017-01-30 03:57:17.802453 7f0036a20700 20
> > > bluestore(bluestore.test_temp_dir)
> > > _do_read 0x0~19e00 size 0x19e00 (105984)
> > > 2017-01-30 03:57:17.802455 7f0036a20700 20 bluestore.onode(0x55eb49874700)
> > > flush done
> > > 2017-01-30 03:57:17.802456 7f0036a20700 30
> > > bluestore.extentmap(0x55eb49874850)
> > > fault_range 0x0~19e00
> > > 2017-01-30 03:57:17.802457 7f0036a20700 30
> > > bluestore(bluestore.test_temp_dir)
> > > _dump_onode 0x55eb49874700 #555:c7000000:::OBJ_738:7cfc81a
> > > b# nid 17377 size 0x19e00 (105984) expected_object_size 2097152
> > > expected_write_size 4096 in 0 shards
> > > 2017-01-30 03:57:17.802461 7f0036a20700 30
> > > bluestore(bluestore.test_temp_dir)
> > > _dump_extent_map 0x6c00~800: 0x3800~800 Blob(0x55eb5047a4
> > > 60 blob([0x40190000~4000] csum+has_unused+shared crc32c/0x1000
> > > unused=0xff)
> > > ref_map(0x3800~800=1) SharedBlob(0x55eb4c2f49f0 sbid 0x3adf
> > > loaded shared_blob(ref_map(0x40190000~4000=2))))
> > > 2017-01-30 03:57:17.802469 7f0036a20700 30
> > > bluestore(bluestore.test_temp_dir)
> > > _dump_extent_map csum: [0,0,f1e4ed4a,417bbe91]
> > > 2017-01-30 03:57:17.802472 7f0036a20700 30
> > > bluestore(bluestore.test_temp_dir)
> > > _dump_extent_map 0x7400~12a00: 0x0~12a00 Blob(0x55eb4dc87
> > > b80 blob([0x40194000~18000] csum+shared crc32c/0x1000)
> > > ref_map(0x0~12a00=1)
> > > SharedBlob(0x55eb4c2f5180 sbid 0x3ae0 loaded shared_blob(ref
> > > _map(0x40194000~18000=3))))
> > > 2017-01-30 03:57:17.802479 7f0036a20700 30
> > > bluestore(bluestore.test_temp_dir)
> > > _dump_extent_map csum: [d1f849c5,fbe516b8,518379f8,b8
> > > b944c8,18b7be23,2b6562d5,51de5770,40988db7,bf7fd7f3,14744e41,eddcb459,639b3350,d038700c,80ffc21e,d7f4edb3,a7ae1a9,f123b379,dfb76444,8ac0
> > > 3032,c1cbff33,629e4868,12d9f0ea,5d50ca8c,b7ce671d]
> > > 2017-01-30 03:57:17.802484 7f0036a20700 30
> > > bluestore(bluestore.test_temp_dir)
> > > _dump_extent_map 0x0~18000 buffer(0x55eb45deb020 spa
> > > ce 0x55eb4c2f51d8 0x0~18000 clean)
> > > 2017-01-30 03:57:17.802487 7f0036a20700 30
> > > bluestore(bluestore.test_temp_dir)
> > > _do_read hole 0x0~6c00
> > > 2017-01-30 03:57:17.802490 7f0036a20700 20
> > > bluestore(bluestore.test_temp_dir)
> > > _do_read blob Blob(0x55eb5047a460 blob([0x40190000~4000]
> > > csum+has_unused+shared crc32c/0x1000 unused=0xff) ref_map(0x3800~800=1)
> > > SharedBlob(0x55eb4c2f49f0 sbid 0x3adf loaded shared_blob(ref_map
> > > (0x40190000~4000=2)))) need 0x3800~800 cache has 0x[]
> > > 2017-01-30 03:57:17.802495 7f0036a20700 30
> > > bluestore(bluestore.test_temp_dir)
> > > _do_read will read 0x6c00: 0x3800~800
> > > 2017-01-30 03:57:17.802509 7f0036a20700 20
> > > bluestore(bluestore.test_temp_dir)
> > > _do_read blob Blob(0x55eb4dc87b80 blob([0x40194000~18000]
> > > csum+shared crc32c/0x1000) ref_map(0x0~12a00=1)
> > > SharedBlob(0x55eb4c2f5180
> > > sbid 0x3ae0 loaded shared_blob(ref_map(0x40194000~18000=3))))
> > > need 0x0~12a00 cache has 0x[0~12a00]
> > > 2017-01-30 03:57:17.802515 7f0036a20700 30
> > > bluestore(bluestore.test_temp_dir)
> > > _do_read use cache 0x7400: 0x0~12a00
> > > 2017-01-30 03:57:17.802519 7f0036a20700 20
> > > bluestore(bluestore.test_temp_dir)
> > > _do_read blob Blob(0x55eb5047a460 blob([0x40190000~4000]
> > > csum+has_unused+shared crc32c/0x1000 unused=0xff) ref_map(0x3800~800=1)
> > > SharedBlob(0x55eb4c2f49f0 sbid 0x3adf loaded
> > > shared_blob(ref_map(0x40190000~4000=2)))) need 0x0x6c00:3800~800
> > > 2017-01-30 03:57:17.802529 7f0036a20700 20
> > > bluestore(bluestore.test_temp_dir)
> > > _do_read region 0x6c00: 0x3800~800 reading 0x3000~1000
> > >
> > > I haven't unwind all the clone_range transformations that lead to this
> > > state
> > > yet. In the example above source object already has the same unaligned
> > > extents
> > > issue.
> > >
> > > But anyway it appears that clone_range neither care nor assert on
> > > unaligned
> > > input offsets...
> > >
> > > I can share a couple of logs if needed..
> > >
> > > Any comments?
> > >
> > > Thanks,
> > >
> > > Igor
> > >
> > >
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > >
> > >
>
>
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2017-01-30 15:14 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-30 13:49 clone_range in BlueStore Igor Fedotov
2017-01-30 14:18 ` Sage Weil
2017-01-30 15:04 ` Igor Fedotov
2017-01-30 15:10 ` Sage Weil
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.