All of lore.kernel.org
 help / color / mirror / Atom feed
* trying to get osd-scrub-repair to work on my jenkins builder
@ 2016-11-04 17:09 Willem Jan Withagen
  2016-11-04 19:39 ` David Zafman
  2016-11-07 12:40 ` Willem Jan Withagen
  0 siblings, 2 replies; 10+ messages in thread
From: Willem Jan Withagen @ 2016-11-04 17:09 UTC (permalink / raw)
  To: Ceph Development

Hi,

On my workstation if have this tst completing just fine. But on my
Jenkins-builder it keeps running into this state where it does not make
any progress.
Any particulars I should look for? I can let this run for an hour, but
pg 2.0 stays active+degrades, and the script requires it to be clean.
and the pgmap version is steadily incrementing.

What should be in the log files that points me to the problem?

Thanx,
--WjW


    cluster 667960a1-a2ae-11e6-a834-69c386980813
     health HEALTH_WARN
            1 pgs degraded
            1 pgs stuck degraded
            1 pgs stuck unclean
            recovery 2/6 objects degraded (33.333%)
            too few PGs per OSD (1 < min 30)
           
noscrub,nodeep-scrub,sortbitwise,require_jewel_osds,require_kraken_osds
flag(s) set
     monmap e1: 1 mons at {a=127.0.0.1:7107/0}
            election epoch 3, quorum 0 a
        mgr no daemons active
     osdmap e117: 10 osds: 10 up, 10 in
            flags
noscrub,nodeep-scrub,sortbitwise,require_jewel_osds,require_kraken_osds
      pgmap v674: 5 pgs, 2 pools, 7 bytes data, 1 objects
            60710 MB used, 2314 GB / 2374 GB avail
            2/6 objects degraded (33.333%)
                   4 active+clean
                   1 active+degraded

PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES LOG
DISK_LOG STATE           STATE_STAMP                VERSION REPORTED
UP            UP_PRIMARY ACTING        ACTING_PRIMARY LAST_SCRUB
SCRUB_STAMP                LAST_DEEP_SCRUB DEEP_SCRUB_STAMP
122: 2.0           1                  0        2         0       0    
7   1        1 active+degraded 2016-11-04 17:55:00.190833    72'1 
117:139 [3,1,9,0,6,2]          3 [3,1,9,0,6,2]              3       72'1
2016-11-04 17:54:37.943943            72'1 2016-11-04 17:54:37.943943
122: 1.3           0                  0        0         0       0    
0   0        0    active+clean 2016-11-04 17:55:00.321568     0'0 
117:139       [4,1,5]          4       [4,1,5]              4        0'0
2016-11-04 17:52:24.704377             0'0 2016-11-04 17:52:24.704377
122: 1.2           0                  0        0         0       0    
0   0        0    active+clean 2016-11-04 17:55:00.249497     0'0 
117:188       [0,5,9]          0       [0,5,9]              0        0'0
2016-11-04 17:52:24.704324             0'0 2016-11-04 17:52:24.704324
122: 1.1           0                  0        0         0       0    
0   0        0    active+clean 2016-11-04 17:55:00.133525     0'0   
116:7       [7,3,8]          7       [7,3,8]              7        0'0
2016-11-04 17:52:24.704269             0'0 2016-11-04 17:52:24.704269
122: 1.0           0                  0        0         0       0    
0   0        0    active+clean 2016-11-04 17:53:18.409300     0'0    
60:7       [8,0,2]          8       [8,0,2]              8        0'0
2016-11-04 17:52:24.704159             0'0 2016-11-04 17:52:24.704159



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: trying to get osd-scrub-repair to work on my jenkins builder
  2016-11-04 17:09 trying to get osd-scrub-repair to work on my jenkins builder Willem Jan Withagen
@ 2016-11-04 19:39 ` David Zafman
  2016-11-04 20:09   ` Willem Jan Withagen
  2016-11-07 12:40 ` Willem Jan Withagen
  1 sibling, 1 reply; 10+ messages in thread
From: David Zafman @ 2016-11-04 19:39 UTC (permalink / raw)
  To: Willem Jan Withagen, Ceph Development


I'm working a similar problem with the test-erasure-eio.sh test which 
only fails on Jenkins.  I have a pg that is active+degraded and then 
active+recovery_wait+degraded.  In this case the hinfo is missing after 
osds are brought up and down in order to use the ceph-objectstore-tool 
to implement the test cases.  On Jenkins the osd restarting causes 
different pg mappings then on my build machine and recovery can't make 
progress.

See if you see these message in the osd log of the primary of the pg (in 
your example below that would be osd.3:

handle_sub_read_reply shard=1(1) error=-5

_failed_push: canceling recovery op for obj ...

David


On 11/4/16 10:09 AM, Willem Jan Withagen wrote:
> Hi,
>
> On my workstation if have this tst completing just fine. But on my
> Jenkins-builder it keeps running into this state where it does not make
> any progress.
> Any particulars I should look for? I can let this run for an hour, but
> pg 2.0 stays active+degrades, and the script requires it to be clean.
> and the pgmap version is steadily incrementing.
>
> What should be in the log files that points me to the problem?
>
> Thanx,
> --WjW
>
>
>      cluster 667960a1-a2ae-11e6-a834-69c386980813
>       health HEALTH_WARN
>              1 pgs degraded
>              1 pgs stuck degraded
>              1 pgs stuck unclean
>              recovery 2/6 objects degraded (33.333%)
>              too few PGs per OSD (1 < min 30)
>             
> noscrub,nodeep-scrub,sortbitwise,require_jewel_osds,require_kraken_osds
> flag(s) set
>       monmap e1: 1 mons at {a=127.0.0.1:7107/0}
>              election epoch 3, quorum 0 a
>          mgr no daemons active
>       osdmap e117: 10 osds: 10 up, 10 in
>              flags
> noscrub,nodeep-scrub,sortbitwise,require_jewel_osds,require_kraken_osds
>        pgmap v674: 5 pgs, 2 pools, 7 bytes data, 1 objects
>              60710 MB used, 2314 GB / 2374 GB avail
>              2/6 objects degraded (33.333%)
>                     4 active+clean
>                     1 active+degraded
>
> PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES LOG
> DISK_LOG STATE           STATE_STAMP                VERSION REPORTED
> UP            UP_PRIMARY ACTING        ACTING_PRIMARY LAST_SCRUB
> SCRUB_STAMP                LAST_DEEP_SCRUB DEEP_SCRUB_STAMP
> 122: 2.0           1                  0        2         0       0
> 7   1        1 active+degraded 2016-11-04 17:55:00.190833    72'1
> 117:139 [3,1,9,0,6,2]          3 [3,1,9,0,6,2]              3       72'1
> 2016-11-04 17:54:37.943943            72'1 2016-11-04 17:54:37.943943
> 122: 1.3           0                  0        0         0       0
> 0   0        0    active+clean 2016-11-04 17:55:00.321568     0'0
> 117:139       [4,1,5]          4       [4,1,5]              4        0'0
> 2016-11-04 17:52:24.704377             0'0 2016-11-04 17:52:24.704377
> 122: 1.2           0                  0        0         0       0
> 0   0        0    active+clean 2016-11-04 17:55:00.249497     0'0
> 117:188       [0,5,9]          0       [0,5,9]              0        0'0
> 2016-11-04 17:52:24.704324             0'0 2016-11-04 17:52:24.704324
> 122: 1.1           0                  0        0         0       0
> 0   0        0    active+clean 2016-11-04 17:55:00.133525     0'0
> 116:7       [7,3,8]          7       [7,3,8]              7        0'0
> 2016-11-04 17:52:24.704269             0'0 2016-11-04 17:52:24.704269
> 122: 1.0           0                  0        0         0       0
> 0   0        0    active+clean 2016-11-04 17:53:18.409300     0'0
> 60:7       [8,0,2]          8       [8,0,2]              8        0'0
> 2016-11-04 17:52:24.704159             0'0 2016-11-04 17:52:24.704159
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: trying to get osd-scrub-repair to work on my jenkins builder
  2016-11-04 19:39 ` David Zafman
@ 2016-11-04 20:09   ` Willem Jan Withagen
  2016-11-04 20:52     ` Willem Jan Withagen
  0 siblings, 1 reply; 10+ messages in thread
From: Willem Jan Withagen @ 2016-11-04 20:09 UTC (permalink / raw)
  To: David Zafman, Ceph Development

On 4-11-2016 20:39, David Zafman wrote:
> 
> I'm working a similar problem with the test-erasure-eio.sh test which
> only fails on Jenkins.  I have a pg that is active+degraded and then
> active+recovery_wait+degraded.  In this case the hinfo is missing after
> osds are brought up and down in order to use the ceph-objectstore-tool
> to implement the test cases.  On Jenkins the osd restarting causes
> different pg mappings then on my build machine and recovery can't make
> progress.
> 
> See if you see these message in the osd log of the primary of the pg (in
> your example below that would be osd.3:
> 
> handle_sub_read_reply shard=1(1) error=-5
> 
> _failed_push: canceling recovery op for obj ...

I have something like 250.000 !!! lines looking like:

2016-11-04 18:09:52.769091 ba9c480 10 osd.3 pg_epoch: 117 pg[2.0s0( v
72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/115/84)
[3,1,9,0,6,2] r=0 lpr=115 pi=103-114/5 rops=1 crt=72'1 lcod 0'0 mlcod
0'0 active+recovering+degraded] _failed_push: canceling recovery op for
obj 2:eb822e21:::SOMETHING:head

And another fragment is:
2016-11-04 17:55:00.233049 ba6ad00 10 osd.3 pg_epoch: 117 pg[2.0s0( v
72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/115/84)
[3,1,9,0,6,2] r=0 lpr=115 pi=103-114/5 rops=1 crt=72'1 lcod 0'0 mlcod
0'0 active+recovering+degraded] handle_message:
MOSDECSubOpReadReply(2.0s0 117 ECSubReadReply(tid=14, attrs_read=0)) v1
2016-11-04 17:55:00.233099 ba6ad00 10 osd.3 pg_epoch: 117 pg[2.0s0( v
72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/115/84)
[3,1,9,0,6,2] r=0 lpr=115 pi=103-114/5 rops=1 crt=72'1 lcod 0'0 mlcod
0'0 active+recovering+degraded] handle_sub_read_reply: reply
ECSubReadReply(tid=14, attrs_read=0)
2016-11-04 17:55:00.233150 ba6ad00 20 osd.3 pg_epoch: 117 pg[2.0s0( v
72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/115/84)
[3,1,9,0,6,2] r=0 lpr=115 pi=103-114/5 rops=1 crt=72'1 lcod 0'0 mlcod
0'0 active+recovering+degraded] handle_sub_read_reply shard=9(2) error=-2
2016-11-04 17:55:00.233200 ba6ad00 10 osd.3 pg_epoch: 117 pg[2.0s0( v
72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/115/84)
[3,1,9,0,6,2] r=0 lpr=115 pi=103-114/5 rops=1 crt=72'1 lcod 0'0 mlcod
0'0 active+recovering+degraded] handle_sub_read_reply readop not
complete: ReadOp(tid=14,
to_read={2:eb822e21:::SOMETHING:head=read_request_t(to_read=[0,8388608,0],
need=0(3),3(0),6(4),9(2), want_attrs=1)},
complete={2:eb822e21:::SOMETHING:head=read_result_t(r=0,
errors={9(2)=-2}, noattrs, returned=(0, 8388608, [3(0),1024]))},
priority=3,
obj_to_source={2:eb822e21:::SOMETHING:head=0(3),3(0),6(4),9(2)},
source_to_obj={0(3)=2:eb822e21:::SOMETHING:head,3(0)=2:eb822e21:::SOMETHING:head,6(4)=2:eb822e21:::SOMETHING:head,9(2)=2:eb822e21:::SOMETHING:head},
in_progress=0(3),6(4))

I can post (a part of) the 7Gb logfile for you to have a look at.

--WjW

> On 11/4/16 10:09 AM, Willem Jan Withagen wrote:
>> Hi,
>>
>> On my workstation if have this tst completing just fine. But on my
>> Jenkins-builder it keeps running into this state where it does not make
>> any progress.
>> Any particulars I should look for? I can let this run for an hour, but
>> pg 2.0 stays active+degrades, and the script requires it to be clean.
>> and the pgmap version is steadily incrementing.
>>
>> What should be in the log files that points me to the problem?
>>
>> Thanx,
>> --WjW
>>
>>
>>      cluster 667960a1-a2ae-11e6-a834-69c386980813
>>       health HEALTH_WARN
>>              1 pgs degraded
>>              1 pgs stuck degraded
>>              1 pgs stuck unclean
>>              recovery 2/6 objects degraded (33.333%)
>>              too few PGs per OSD (1 < min 30)
>>            
>> noscrub,nodeep-scrub,sortbitwise,require_jewel_osds,require_kraken_osds
>> flag(s) set
>>       monmap e1: 1 mons at {a=127.0.0.1:7107/0}
>>              election epoch 3, quorum 0 a
>>          mgr no daemons active
>>       osdmap e117: 10 osds: 10 up, 10 in
>>              flags
>> noscrub,nodeep-scrub,sortbitwise,require_jewel_osds,require_kraken_osds
>>        pgmap v674: 5 pgs, 2 pools, 7 bytes data, 1 objects
>>              60710 MB used, 2314 GB / 2374 GB avail
>>              2/6 objects degraded (33.333%)
>>                     4 active+clean
>>                     1 active+degraded
>>
>> PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES LOG
>> DISK_LOG STATE           STATE_STAMP                VERSION REPORTED
>> UP            UP_PRIMARY ACTING        ACTING_PRIMARY LAST_SCRUB
>> SCRUB_STAMP                LAST_DEEP_SCRUB DEEP_SCRUB_STAMP
>> 122: 2.0           1                  0        2         0       0
>> 7   1        1 active+degraded 2016-11-04 17:55:00.190833    72'1
>> 117:139 [3,1,9,0,6,2]          3 [3,1,9,0,6,2]              3       72'1
>> 2016-11-04 17:54:37.943943            72'1 2016-11-04 17:54:37.943943
>> 122: 1.3           0                  0        0         0       0
>> 0   0        0    active+clean 2016-11-04 17:55:00.321568     0'0
>> 117:139       [4,1,5]          4       [4,1,5]              4        0'0
>> 2016-11-04 17:52:24.704377             0'0 2016-11-04 17:52:24.704377
>> 122: 1.2           0                  0        0         0       0
>> 0   0        0    active+clean 2016-11-04 17:55:00.249497     0'0
>> 117:188       [0,5,9]          0       [0,5,9]              0        0'0
>> 2016-11-04 17:52:24.704324             0'0 2016-11-04 17:52:24.704324
>> 122: 1.1           0                  0        0         0       0
>> 0   0        0    active+clean 2016-11-04 17:55:00.133525     0'0
>> 116:7       [7,3,8]          7       [7,3,8]              7        0'0
>> 2016-11-04 17:52:24.704269             0'0 2016-11-04 17:52:24.704269
>> 122: 1.0           0                  0        0         0       0
>> 0   0        0    active+clean 2016-11-04 17:53:18.409300     0'0
>> 60:7       [8,0,2]          8       [8,0,2]              8        0'0
>> 2016-11-04 17:52:24.704159             0'0 2016-11-04 17:52:24.704159
>>
>>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: trying to get osd-scrub-repair to work on my jenkins builder
  2016-11-04 20:09   ` Willem Jan Withagen
@ 2016-11-04 20:52     ` Willem Jan Withagen
  2016-11-04 21:07       ` Sage Weil
  0 siblings, 1 reply; 10+ messages in thread
From: Willem Jan Withagen @ 2016-11-04 20:52 UTC (permalink / raw)
  To: David Zafman, Ceph Development

On 4-11-2016 21:09, Willem Jan Withagen wrote:
> On 4-11-2016 20:39, David Zafman wrote:
>>
>> I'm working a similar problem with the test-erasure-eio.sh test which
>> only fails on Jenkins.  I have a pg that is active+degraded and then
>> active+recovery_wait+degraded.  In this case the hinfo is missing after
>> osds are brought up and down in order to use the ceph-objectstore-tool
>> to implement the test cases.  On Jenkins the osd restarting causes
>> different pg mappings then on my build machine and recovery can't make
>> progress.
>>
>> See if you see these message in the osd log of the primary of the pg (in
>> your example below that would be osd.3:
>>
>> handle_sub_read_reply shard=1(1) error=-5
>>
>> _failed_push: canceling recovery op for obj ...
> 
> I have something like 250.000 !!! lines looking like:
> 
> 2016-11-04 18:09:52.769091 ba9c480 10 osd.3 pg_epoch: 117 pg[2.0s0( v
> 72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/115/84)
> [3,1,9,0,6,2] r=0 lpr=115 pi=103-114/5 rops=1 crt=72'1 lcod 0'0 mlcod
> 0'0 active+recovering+degraded] _failed_push: canceling recovery op for
> obj 2:eb822e21:::SOMETHING:head
> 
> And another fragment is:
> 2016-11-04 17:55:00.233049 ba6ad00 10 osd.3 pg_epoch: 117 pg[2.0s0( v
> 72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/115/84)
> [3,1,9,0,6,2] r=0 lpr=115 pi=103-114/5 rops=1 crt=72'1 lcod 0'0 mlcod
> 0'0 active+recovering+degraded] handle_message:
> MOSDECSubOpReadReply(2.0s0 117 ECSubReadReply(tid=14, attrs_read=0)) v1
> 2016-11-04 17:55:00.233099 ba6ad00 10 osd.3 pg_epoch: 117 pg[2.0s0( v
> 72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/115/84)
> [3,1,9,0,6,2] r=0 lpr=115 pi=103-114/5 rops=1 crt=72'1 lcod 0'0 mlcod
> 0'0 active+recovering+degraded] handle_sub_read_reply: reply
> ECSubReadReply(tid=14, attrs_read=0)
> 2016-11-04 17:55:00.233150 ba6ad00 20 osd.3 pg_epoch: 117 pg[2.0s0( v
> 72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/115/84)
> [3,1,9,0,6,2] r=0 lpr=115 pi=103-114/5 rops=1 crt=72'1 lcod 0'0 mlcod
> 0'0 active+recovering+degraded] handle_sub_read_reply shard=9(2) error=-2
> 2016-11-04 17:55:00.233200 ba6ad00 10 osd.3 pg_epoch: 117 pg[2.0s0( v
> 72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/115/84)
> [3,1,9,0,6,2] r=0 lpr=115 pi=103-114/5 rops=1 crt=72'1 lcod 0'0 mlcod
> 0'0 active+recovering+degraded] handle_sub_read_reply readop not
> complete: ReadOp(tid=14,
> to_read={2:eb822e21:::SOMETHING:head=read_request_t(to_read=[0,8388608,0],
> need=0(3),3(0),6(4),9(2), want_attrs=1)},
> complete={2:eb822e21:::SOMETHING:head=read_result_t(r=0,
> errors={9(2)=-2}, noattrs, returned=(0, 8388608, [3(0),1024]))},
> priority=3,
> obj_to_source={2:eb822e21:::SOMETHING:head=0(3),3(0),6(4),9(2)},
> source_to_obj={0(3)=2:eb822e21:::SOMETHING:head,3(0)=2:eb822e21:::SOMETHING:head,6(4)=2:eb822e21:::SOMETHING:head,9(2)=2:eb822e21:::SOMETHING:head},
> in_progress=0(3),6(4))

And the log of osd.9 is also ful with the following sequence:
2016-11-04 18:09:51.995641 bab2000 20 osd.9 117 share_map_peer 0xbb0c000
already has epoch 117
2016-11-04 18:09:51.995848 bab2000 10 osd.9 117 dequeue_op 0xbf59180 finish
2016-11-04 18:09:52.005020 b8b3200 10 osd.9 117 handle_replica_op
MOSDECSubOpRead(2.0s2 117 ECSubRead(tid=86489, to_read={2:eb822e21
:::SOMETHING:head=0,2097152,0}, attrs_to_read=)) v2 epoch 117
2016-11-04 18:09:52.005149 b8b3200 20 osd.9 117 should_share_map osd.3
127.0.0.1:6813/67592 117
2016-11-04 18:09:52.005229 b8b3200 15 osd.9 117 enqueue_op 0xbf5ae40
prio 3 cost 0 latency 0.000378 MOSDECSubOpRead(2.0s2 117 ECSubR
ead(tid=86489, to_read={2:eb822e21:::SOMETHING:head=0,2097152,0},
attrs_to_read=)) v2
2016-11-04 18:09:52.005408 bab2000 10 osd.9 117 dequeue_op 0xbf5ae40
prio 3 cost 0 latency 0.000558 MOSDECSubOpRead(2.0s2 117 ECSubR
ead(tid=86489, to_read={2:eb822e21:::SOMETHING:head=0,2097152,0},
attrs_to_read=)) v2 pg pg[2.0s2( v 72'1 (0'0,72'1] local-les=116 n
=1 ec=69 les/c/f 116/105/0 113/115/84) [3,1,9,0,6,2] r=2 lpr=115
pi=69-114/22 luod=0'0 crt=0'0 lcod 0'0 active]
2016-11-04 18:09:52.005539 bab2000 10 osd.9 pg_epoch: 117 pg[2.0s2( v
72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/
115/84) [3,1,9,0,6,2] r=2 lpr=115 pi=69-114/22 luod=0'0 crt=0'0 lcod 0'0
active] handle_message: MOSDECSubOpRead(2.0s2 117 ECSubRead
(tid=86489, to_read={2:eb822e21:::SOMETHING:head=0,2097152,0},
attrs_to_read=)) v2
2016-11-04 18:09:52.005595 bab2000 10 osd.9 pg_epoch: 117 pg[2.0s2( v
72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/
115/84) [3,1,9,0,6,2] r=2 lpr=115 pi=69-114/22 luod=0'0 crt=0'0 lcod 0'0
active] get_hash_info: Getting attr on 2:eb822e21:::SOMETHI
NG:head
2016-11-04 18:09:52.005641 bab2000 10 osd.9 pg_epoch: 117 pg[2.0s2( v
72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/
115/84) [3,1,9,0,6,2] r=2 lpr=115 pi=69-114/22 luod=0'0 crt=0'0 lcod 0'0
active] get_hash_info: not in cache 2:eb822e21:::SOMETHING:
head
2016-11-04 18:09:52.005853 bab2000 -1 log_channel(cluster) log [ERR] :
handle_sub_read: Error -2 reading 2:eb822e21:::SOMETHING:head
2016-11-04 18:09:52.005880 bab2000  5 osd.9 pg_epoch: 117 pg[2.0s2( v
72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/
115/84) [3,1,9,0,6,2] r=2 lpr=115 pi=69-114/22 luod=0'0 crt=0'0 lcod 0'0
active] handle_sub_read: Error -2 reading 2:eb822e21:::SOME
THING:head
2016-11-04 18:09:52.005940 bab2000 20 osd.9 117 share_map_peer 0xbb0c000
already has epoch 117

And error -2 would read more or less:
#define ENOENT          2               /* No such file or directory */

And these are the SOMETHING__head files in the test:
1 -rw-r--r--  1 jenkins  jenkins  1024 Nov  4 17:53
./0/current/2.0s3_head/SOMETHING__head_847441D7__2_ffffffffffffffff_3
1 -rw-r--r--  1 jenkins  jenkins  1024 Nov  4 17:53
./2/current/2.0s5_head/SOMETHING__head_847441D7__2_ffffffffffffffff_5
1 -rw-r--r--  1 jenkins  jenkins  1024 Nov  4 17:54
./3/current/2.0s0_head/SOMETHING__head_847441D7__2_ffffffffffffffff_0
1 -rw-r--r--  1 jenkins  jenkins  1024 Nov  4 17:53
./6/current/2.0s4_head/SOMETHING__head_847441D7__2_ffffffffffffffff_4

Perhaps this 847441D7 should read EB822E21 ??
And is this the reason of the read error?

--WjW

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: trying to get osd-scrub-repair to work on my jenkins builder
  2016-11-04 20:52     ` Willem Jan Withagen
@ 2016-11-04 21:07       ` Sage Weil
  2016-11-04 21:13         ` Willem Jan Withagen
  0 siblings, 1 reply; 10+ messages in thread
From: Sage Weil @ 2016-11-04 21:07 UTC (permalink / raw)
  To: Willem Jan Withagen; +Cc: David Zafman, Ceph Development

On Fri, 4 Nov 2016, Willem Jan Withagen wrote:
> On 4-11-2016 21:09, Willem Jan Withagen wrote:
> > On 4-11-2016 20:39, David Zafman wrote:
> >>
> >> I'm working a similar problem with the test-erasure-eio.sh test which
> >> only fails on Jenkins.  I have a pg that is active+degraded and then
> >> active+recovery_wait+degraded.  In this case the hinfo is missing after
> >> osds are brought up and down in order to use the ceph-objectstore-tool
> >> to implement the test cases.  On Jenkins the osd restarting causes
> >> different pg mappings then on my build machine and recovery can't make
> >> progress.
> >>
> >> See if you see these message in the osd log of the primary of the pg (in
> >> your example below that would be osd.3:
> >>
> >> handle_sub_read_reply shard=1(1) error=-5
> >>
> >> _failed_push: canceling recovery op for obj ...
> > 
> > I have something like 250.000 !!! lines looking like:
> > 
> > 2016-11-04 18:09:52.769091 ba9c480 10 osd.3 pg_epoch: 117 pg[2.0s0( v
> > 72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/115/84)
> > [3,1,9,0,6,2] r=0 lpr=115 pi=103-114/5 rops=1 crt=72'1 lcod 0'0 mlcod
> > 0'0 active+recovering+degraded] _failed_push: canceling recovery op for
> > obj 2:eb822e21:::SOMETHING:head
> > 
> > And another fragment is:
> > 2016-11-04 17:55:00.233049 ba6ad00 10 osd.3 pg_epoch: 117 pg[2.0s0( v
> > 72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/115/84)
> > [3,1,9,0,6,2] r=0 lpr=115 pi=103-114/5 rops=1 crt=72'1 lcod 0'0 mlcod
> > 0'0 active+recovering+degraded] handle_message:
> > MOSDECSubOpReadReply(2.0s0 117 ECSubReadReply(tid=14, attrs_read=0)) v1
> > 2016-11-04 17:55:00.233099 ba6ad00 10 osd.3 pg_epoch: 117 pg[2.0s0( v
> > 72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/115/84)
> > [3,1,9,0,6,2] r=0 lpr=115 pi=103-114/5 rops=1 crt=72'1 lcod 0'0 mlcod
> > 0'0 active+recovering+degraded] handle_sub_read_reply: reply
> > ECSubReadReply(tid=14, attrs_read=0)
> > 2016-11-04 17:55:00.233150 ba6ad00 20 osd.3 pg_epoch: 117 pg[2.0s0( v
> > 72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/115/84)
> > [3,1,9,0,6,2] r=0 lpr=115 pi=103-114/5 rops=1 crt=72'1 lcod 0'0 mlcod
> > 0'0 active+recovering+degraded] handle_sub_read_reply shard=9(2) error=-2
> > 2016-11-04 17:55:00.233200 ba6ad00 10 osd.3 pg_epoch: 117 pg[2.0s0( v
> > 72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/115/84)
> > [3,1,9,0,6,2] r=0 lpr=115 pi=103-114/5 rops=1 crt=72'1 lcod 0'0 mlcod
> > 0'0 active+recovering+degraded] handle_sub_read_reply readop not
> > complete: ReadOp(tid=14,
> > to_read={2:eb822e21:::SOMETHING:head=read_request_t(to_read=[0,8388608,0],
> > need=0(3),3(0),6(4),9(2), want_attrs=1)},
> > complete={2:eb822e21:::SOMETHING:head=read_result_t(r=0,
> > errors={9(2)=-2}, noattrs, returned=(0, 8388608, [3(0),1024]))},
> > priority=3,
> > obj_to_source={2:eb822e21:::SOMETHING:head=0(3),3(0),6(4),9(2)},
> > source_to_obj={0(3)=2:eb822e21:::SOMETHING:head,3(0)=2:eb822e21:::SOMETHING:head,6(4)=2:eb822e21:::SOMETHING:head,9(2)=2:eb822e21:::SOMETHING:head},
> > in_progress=0(3),6(4))
> 
> And the log of osd.9 is also ful with the following sequence:
> 2016-11-04 18:09:51.995641 bab2000 20 osd.9 117 share_map_peer 0xbb0c000
> already has epoch 117
> 2016-11-04 18:09:51.995848 bab2000 10 osd.9 117 dequeue_op 0xbf59180 finish
> 2016-11-04 18:09:52.005020 b8b3200 10 osd.9 117 handle_replica_op
> MOSDECSubOpRead(2.0s2 117 ECSubRead(tid=86489, to_read={2:eb822e21
> :::SOMETHING:head=0,2097152,0}, attrs_to_read=)) v2 epoch 117
> 2016-11-04 18:09:52.005149 b8b3200 20 osd.9 117 should_share_map osd.3
> 127.0.0.1:6813/67592 117
> 2016-11-04 18:09:52.005229 b8b3200 15 osd.9 117 enqueue_op 0xbf5ae40
> prio 3 cost 0 latency 0.000378 MOSDECSubOpRead(2.0s2 117 ECSubR
> ead(tid=86489, to_read={2:eb822e21:::SOMETHING:head=0,2097152,0},
> attrs_to_read=)) v2
> 2016-11-04 18:09:52.005408 bab2000 10 osd.9 117 dequeue_op 0xbf5ae40
> prio 3 cost 0 latency 0.000558 MOSDECSubOpRead(2.0s2 117 ECSubR
> ead(tid=86489, to_read={2:eb822e21:::SOMETHING:head=0,2097152,0},
> attrs_to_read=)) v2 pg pg[2.0s2( v 72'1 (0'0,72'1] local-les=116 n
> =1 ec=69 les/c/f 116/105/0 113/115/84) [3,1,9,0,6,2] r=2 lpr=115
> pi=69-114/22 luod=0'0 crt=0'0 lcod 0'0 active]
> 2016-11-04 18:09:52.005539 bab2000 10 osd.9 pg_epoch: 117 pg[2.0s2( v
> 72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/
> 115/84) [3,1,9,0,6,2] r=2 lpr=115 pi=69-114/22 luod=0'0 crt=0'0 lcod 0'0
> active] handle_message: MOSDECSubOpRead(2.0s2 117 ECSubRead
> (tid=86489, to_read={2:eb822e21:::SOMETHING:head=0,2097152,0},
> attrs_to_read=)) v2
> 2016-11-04 18:09:52.005595 bab2000 10 osd.9 pg_epoch: 117 pg[2.0s2( v
> 72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/
> 115/84) [3,1,9,0,6,2] r=2 lpr=115 pi=69-114/22 luod=0'0 crt=0'0 lcod 0'0
> active] get_hash_info: Getting attr on 2:eb822e21:::SOMETHI
> NG:head
> 2016-11-04 18:09:52.005641 bab2000 10 osd.9 pg_epoch: 117 pg[2.0s2( v
> 72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/
> 115/84) [3,1,9,0,6,2] r=2 lpr=115 pi=69-114/22 luod=0'0 crt=0'0 lcod 0'0
> active] get_hash_info: not in cache 2:eb822e21:::SOMETHING:
> head
> 2016-11-04 18:09:52.005853 bab2000 -1 log_channel(cluster) log [ERR] :
> handle_sub_read: Error -2 reading 2:eb822e21:::SOMETHING:head
> 2016-11-04 18:09:52.005880 bab2000  5 osd.9 pg_epoch: 117 pg[2.0s2( v
> 72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/
> 115/84) [3,1,9,0,6,2] r=2 lpr=115 pi=69-114/22 luod=0'0 crt=0'0 lcod 0'0
> active] handle_sub_read: Error -2 reading 2:eb822e21:::SOME
> THING:head
> 2016-11-04 18:09:52.005940 bab2000 20 osd.9 117 share_map_peer 0xbb0c000
> already has epoch 117
> 
> And error -2 would read more or less:
> #define ENOENT          2               /* No such file or directory */
> 
> And these are the SOMETHING__head files in the test:
> 1 -rw-r--r--  1 jenkins  jenkins  1024 Nov  4 17:53
> ./0/current/2.0s3_head/SOMETHING__head_847441D7__2_ffffffffffffffff_3
> 1 -rw-r--r--  1 jenkins  jenkins  1024 Nov  4 17:53
> ./2/current/2.0s5_head/SOMETHING__head_847441D7__2_ffffffffffffffff_5
> 1 -rw-r--r--  1 jenkins  jenkins  1024 Nov  4 17:54
> ./3/current/2.0s0_head/SOMETHING__head_847441D7__2_ffffffffffffffff_0
> 1 -rw-r--r--  1 jenkins  jenkins  1024 Nov  4 17:53
> ./6/current/2.0s4_head/SOMETHING__head_847441D7__2_ffffffffffffffff_4
> 
> Perhaps this 847441D7 should read EB822E21 ??
> And is this the reason of the read error?

Could be!  Note though that the ghobject_t operator<< string you see in 
the log (2:eb822e21:::SOMETHING:head) prints the hex in reversed bitwise 
order, while the string in the FileStore filename is reversed nibble form.  
You can convert between the two bit reversing the bits in each hex 
digit/nibble, though, and those don't appear to match.

In any case, can you post the log?  ceph-post-file <filename> then send 
the uuid.

Thanks!
sage

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: trying to get osd-scrub-repair to work on my jenkins builder
  2016-11-04 21:07       ` Sage Weil
@ 2016-11-04 21:13         ` Willem Jan Withagen
  2016-11-04 21:19           ` Sage Weil
  0 siblings, 1 reply; 10+ messages in thread
From: Willem Jan Withagen @ 2016-11-04 21:13 UTC (permalink / raw)
  To: Sage Weil; +Cc: David Zafman, Ceph Development

On 4-11-2016 22:07, Sage Weil wrote:
> On Fri, 4 Nov 2016, Willem Jan Withagen wrote:
>> On 4-11-2016 21:09, Willem Jan Withagen wrote:
>>> On 4-11-2016 20:39, David Zafman wrote:
>>>>
>>>> I'm working a similar problem with the test-erasure-eio.sh test which
>>>> only fails on Jenkins.  I have a pg that is active+degraded and then
>>>> active+recovery_wait+degraded.  In this case the hinfo is missing after
>>>> osds are brought up and down in order to use the ceph-objectstore-tool
>>>> to implement the test cases.  On Jenkins the osd restarting causes
>>>> different pg mappings then on my build machine and recovery can't make
>>>> progress.
>>>>
>>>> See if you see these message in the osd log of the primary of the pg (in
>>>> your example below that would be osd.3:
>>>>
>>>> handle_sub_read_reply shard=1(1) error=-5
>>>>
>>>> _failed_push: canceling recovery op for obj ...
>>>
>>> I have something like 250.000 !!! lines looking like:
>>>
>>> 2016-11-04 18:09:52.769091 ba9c480 10 osd.3 pg_epoch: 117 pg[2.0s0( v
>>> 72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/115/84)
>>> [3,1,9,0,6,2] r=0 lpr=115 pi=103-114/5 rops=1 crt=72'1 lcod 0'0 mlcod
>>> 0'0 active+recovering+degraded] _failed_push: canceling recovery op for
>>> obj 2:eb822e21:::SOMETHING:head
>>>
>>> And another fragment is:
>>> 2016-11-04 17:55:00.233049 ba6ad00 10 osd.3 pg_epoch: 117 pg[2.0s0( v
>>> 72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/115/84)
>>> [3,1,9,0,6,2] r=0 lpr=115 pi=103-114/5 rops=1 crt=72'1 lcod 0'0 mlcod
>>> 0'0 active+recovering+degraded] handle_message:
>>> MOSDECSubOpReadReply(2.0s0 117 ECSubReadReply(tid=14, attrs_read=0)) v1
>>> 2016-11-04 17:55:00.233099 ba6ad00 10 osd.3 pg_epoch: 117 pg[2.0s0( v
>>> 72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/115/84)
>>> [3,1,9,0,6,2] r=0 lpr=115 pi=103-114/5 rops=1 crt=72'1 lcod 0'0 mlcod
>>> 0'0 active+recovering+degraded] handle_sub_read_reply: reply
>>> ECSubReadReply(tid=14, attrs_read=0)
>>> 2016-11-04 17:55:00.233150 ba6ad00 20 osd.3 pg_epoch: 117 pg[2.0s0( v
>>> 72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/115/84)
>>> [3,1,9,0,6,2] r=0 lpr=115 pi=103-114/5 rops=1 crt=72'1 lcod 0'0 mlcod
>>> 0'0 active+recovering+degraded] handle_sub_read_reply shard=9(2) error=-2
>>> 2016-11-04 17:55:00.233200 ba6ad00 10 osd.3 pg_epoch: 117 pg[2.0s0( v
>>> 72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/115/84)
>>> [3,1,9,0,6,2] r=0 lpr=115 pi=103-114/5 rops=1 crt=72'1 lcod 0'0 mlcod
>>> 0'0 active+recovering+degraded] handle_sub_read_reply readop not
>>> complete: ReadOp(tid=14,
>>> to_read={2:eb822e21:::SOMETHING:head=read_request_t(to_read=[0,8388608,0],
>>> need=0(3),3(0),6(4),9(2), want_attrs=1)},
>>> complete={2:eb822e21:::SOMETHING:head=read_result_t(r=0,
>>> errors={9(2)=-2}, noattrs, returned=(0, 8388608, [3(0),1024]))},
>>> priority=3,
>>> obj_to_source={2:eb822e21:::SOMETHING:head=0(3),3(0),6(4),9(2)},
>>> source_to_obj={0(3)=2:eb822e21:::SOMETHING:head,3(0)=2:eb822e21:::SOMETHING:head,6(4)=2:eb822e21:::SOMETHING:head,9(2)=2:eb822e21:::SOMETHING:head},
>>> in_progress=0(3),6(4))
>>
>> And the log of osd.9 is also ful with the following sequence:
>> 2016-11-04 18:09:51.995641 bab2000 20 osd.9 117 share_map_peer 0xbb0c000
>> already has epoch 117
>> 2016-11-04 18:09:51.995848 bab2000 10 osd.9 117 dequeue_op 0xbf59180 finish
>> 2016-11-04 18:09:52.005020 b8b3200 10 osd.9 117 handle_replica_op
>> MOSDECSubOpRead(2.0s2 117 ECSubRead(tid=86489, to_read={2:eb822e21
>> :::SOMETHING:head=0,2097152,0}, attrs_to_read=)) v2 epoch 117
>> 2016-11-04 18:09:52.005149 b8b3200 20 osd.9 117 should_share_map osd.3
>> 127.0.0.1:6813/67592 117
>> 2016-11-04 18:09:52.005229 b8b3200 15 osd.9 117 enqueue_op 0xbf5ae40
>> prio 3 cost 0 latency 0.000378 MOSDECSubOpRead(2.0s2 117 ECSubR
>> ead(tid=86489, to_read={2:eb822e21:::SOMETHING:head=0,2097152,0},
>> attrs_to_read=)) v2
>> 2016-11-04 18:09:52.005408 bab2000 10 osd.9 117 dequeue_op 0xbf5ae40
>> prio 3 cost 0 latency 0.000558 MOSDECSubOpRead(2.0s2 117 ECSubR
>> ead(tid=86489, to_read={2:eb822e21:::SOMETHING:head=0,2097152,0},
>> attrs_to_read=)) v2 pg pg[2.0s2( v 72'1 (0'0,72'1] local-les=116 n
>> =1 ec=69 les/c/f 116/105/0 113/115/84) [3,1,9,0,6,2] r=2 lpr=115
>> pi=69-114/22 luod=0'0 crt=0'0 lcod 0'0 active]
>> 2016-11-04 18:09:52.005539 bab2000 10 osd.9 pg_epoch: 117 pg[2.0s2( v
>> 72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/
>> 115/84) [3,1,9,0,6,2] r=2 lpr=115 pi=69-114/22 luod=0'0 crt=0'0 lcod 0'0
>> active] handle_message: MOSDECSubOpRead(2.0s2 117 ECSubRead
>> (tid=86489, to_read={2:eb822e21:::SOMETHING:head=0,2097152,0},
>> attrs_to_read=)) v2
>> 2016-11-04 18:09:52.005595 bab2000 10 osd.9 pg_epoch: 117 pg[2.0s2( v
>> 72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/
>> 115/84) [3,1,9,0,6,2] r=2 lpr=115 pi=69-114/22 luod=0'0 crt=0'0 lcod 0'0
>> active] get_hash_info: Getting attr on 2:eb822e21:::SOMETHI
>> NG:head
>> 2016-11-04 18:09:52.005641 bab2000 10 osd.9 pg_epoch: 117 pg[2.0s2( v
>> 72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/
>> 115/84) [3,1,9,0,6,2] r=2 lpr=115 pi=69-114/22 luod=0'0 crt=0'0 lcod 0'0
>> active] get_hash_info: not in cache 2:eb822e21:::SOMETHING:
>> head
>> 2016-11-04 18:09:52.005853 bab2000 -1 log_channel(cluster) log [ERR] :
>> handle_sub_read: Error -2 reading 2:eb822e21:::SOMETHING:head
>> 2016-11-04 18:09:52.005880 bab2000  5 osd.9 pg_epoch: 117 pg[2.0s2( v
>> 72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/
>> 115/84) [3,1,9,0,6,2] r=2 lpr=115 pi=69-114/22 luod=0'0 crt=0'0 lcod 0'0
>> active] handle_sub_read: Error -2 reading 2:eb822e21:::SOME
>> THING:head
>> 2016-11-04 18:09:52.005940 bab2000 20 osd.9 117 share_map_peer 0xbb0c000
>> already has epoch 117
>>
>> And error -2 would read more or less:
>> #define ENOENT          2               /* No such file or directory */
>>
>> And these are the SOMETHING__head files in the test:
>> 1 -rw-r--r--  1 jenkins  jenkins  1024 Nov  4 17:53
>> ./0/current/2.0s3_head/SOMETHING__head_847441D7__2_ffffffffffffffff_3
>> 1 -rw-r--r--  1 jenkins  jenkins  1024 Nov  4 17:53
>> ./2/current/2.0s5_head/SOMETHING__head_847441D7__2_ffffffffffffffff_5
>> 1 -rw-r--r--  1 jenkins  jenkins  1024 Nov  4 17:54
>> ./3/current/2.0s0_head/SOMETHING__head_847441D7__2_ffffffffffffffff_0
>> 1 -rw-r--r--  1 jenkins  jenkins  1024 Nov  4 17:53
>> ./6/current/2.0s4_head/SOMETHING__head_847441D7__2_ffffffffffffffff_4
>>
>> Perhaps this 847441D7 should read EB822E21 ??
>> And is this the reason of the read error?
> 
> Could be!  Note though that the ghobject_t operator<< string you see in 
> the log (2:eb822e21:::SOMETHING:head) prints the hex in reversed bitwise 
> order, while the string in the FileStore filename is reversed nibble form.  
> You can convert between the two bit reversing the bits in each hex 
> digit/nibble, though, and those don't appear to match.

Ehhh, that would call for a proper print-operator(s) ... and add that as
extra info with this error. Makes life for comparing and debugging a lot
easier, and make them equal to what is actually stored in the FS.

> In any case, can you post the log?  ceph-post-file <filename> then send 
> the uuid.

Which logs would you like to have?
osd.3.log is 7G and osd.9.log is 600Mb. I do have the bandwidth to
upload, but especially 7G is big.

--WjW

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: trying to get osd-scrub-repair to work on my jenkins builder
  2016-11-04 21:13         ` Willem Jan Withagen
@ 2016-11-04 21:19           ` Sage Weil
  0 siblings, 0 replies; 10+ messages in thread
From: Sage Weil @ 2016-11-04 21:19 UTC (permalink / raw)
  To: Willem Jan Withagen; +Cc: David Zafman, Ceph Development

On Fri, 4 Nov 2016, Willem Jan Withagen wrote:
> On 4-11-2016 22:07, Sage Weil wrote:
> > On Fri, 4 Nov 2016, Willem Jan Withagen wrote:
> >> On 4-11-2016 21:09, Willem Jan Withagen wrote:
> >>> On 4-11-2016 20:39, David Zafman wrote:
> >>>>
> >>>> I'm working a similar problem with the test-erasure-eio.sh test which
> >>>> only fails on Jenkins.  I have a pg that is active+degraded and then
> >>>> active+recovery_wait+degraded.  In this case the hinfo is missing after
> >>>> osds are brought up and down in order to use the ceph-objectstore-tool
> >>>> to implement the test cases.  On Jenkins the osd restarting causes
> >>>> different pg mappings then on my build machine and recovery can't make
> >>>> progress.
> >>>>
> >>>> See if you see these message in the osd log of the primary of the pg (in
> >>>> your example below that would be osd.3:
> >>>>
> >>>> handle_sub_read_reply shard=1(1) error=-5
> >>>>
> >>>> _failed_push: canceling recovery op for obj ...
> >>>
> >>> I have something like 250.000 !!! lines looking like:
> >>>
> >>> 2016-11-04 18:09:52.769091 ba9c480 10 osd.3 pg_epoch: 117 pg[2.0s0( v
> >>> 72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/115/84)
> >>> [3,1,9,0,6,2] r=0 lpr=115 pi=103-114/5 rops=1 crt=72'1 lcod 0'0 mlcod
> >>> 0'0 active+recovering+degraded] _failed_push: canceling recovery op for
> >>> obj 2:eb822e21:::SOMETHING:head
> >>>
> >>> And another fragment is:
> >>> 2016-11-04 17:55:00.233049 ba6ad00 10 osd.3 pg_epoch: 117 pg[2.0s0( v
> >>> 72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/115/84)
> >>> [3,1,9,0,6,2] r=0 lpr=115 pi=103-114/5 rops=1 crt=72'1 lcod 0'0 mlcod
> >>> 0'0 active+recovering+degraded] handle_message:
> >>> MOSDECSubOpReadReply(2.0s0 117 ECSubReadReply(tid=14, attrs_read=0)) v1
> >>> 2016-11-04 17:55:00.233099 ba6ad00 10 osd.3 pg_epoch: 117 pg[2.0s0( v
> >>> 72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/115/84)
> >>> [3,1,9,0,6,2] r=0 lpr=115 pi=103-114/5 rops=1 crt=72'1 lcod 0'0 mlcod
> >>> 0'0 active+recovering+degraded] handle_sub_read_reply: reply
> >>> ECSubReadReply(tid=14, attrs_read=0)
> >>> 2016-11-04 17:55:00.233150 ba6ad00 20 osd.3 pg_epoch: 117 pg[2.0s0( v
> >>> 72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/115/84)
> >>> [3,1,9,0,6,2] r=0 lpr=115 pi=103-114/5 rops=1 crt=72'1 lcod 0'0 mlcod
> >>> 0'0 active+recovering+degraded] handle_sub_read_reply shard=9(2) error=-2
> >>> 2016-11-04 17:55:00.233200 ba6ad00 10 osd.3 pg_epoch: 117 pg[2.0s0( v
> >>> 72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/115/84)
> >>> [3,1,9,0,6,2] r=0 lpr=115 pi=103-114/5 rops=1 crt=72'1 lcod 0'0 mlcod
> >>> 0'0 active+recovering+degraded] handle_sub_read_reply readop not
> >>> complete: ReadOp(tid=14,
> >>> to_read={2:eb822e21:::SOMETHING:head=read_request_t(to_read=[0,8388608,0],
> >>> need=0(3),3(0),6(4),9(2), want_attrs=1)},
> >>> complete={2:eb822e21:::SOMETHING:head=read_result_t(r=0,
> >>> errors={9(2)=-2}, noattrs, returned=(0, 8388608, [3(0),1024]))},
> >>> priority=3,
> >>> obj_to_source={2:eb822e21:::SOMETHING:head=0(3),3(0),6(4),9(2)},
> >>> source_to_obj={0(3)=2:eb822e21:::SOMETHING:head,3(0)=2:eb822e21:::SOMETHING:head,6(4)=2:eb822e21:::SOMETHING:head,9(2)=2:eb822e21:::SOMETHING:head},
> >>> in_progress=0(3),6(4))
> >>
> >> And the log of osd.9 is also ful with the following sequence:
> >> 2016-11-04 18:09:51.995641 bab2000 20 osd.9 117 share_map_peer 0xbb0c000
> >> already has epoch 117
> >> 2016-11-04 18:09:51.995848 bab2000 10 osd.9 117 dequeue_op 0xbf59180 finish
> >> 2016-11-04 18:09:52.005020 b8b3200 10 osd.9 117 handle_replica_op
> >> MOSDECSubOpRead(2.0s2 117 ECSubRead(tid=86489, to_read={2:eb822e21
> >> :::SOMETHING:head=0,2097152,0}, attrs_to_read=)) v2 epoch 117
> >> 2016-11-04 18:09:52.005149 b8b3200 20 osd.9 117 should_share_map osd.3
> >> 127.0.0.1:6813/67592 117
> >> 2016-11-04 18:09:52.005229 b8b3200 15 osd.9 117 enqueue_op 0xbf5ae40
> >> prio 3 cost 0 latency 0.000378 MOSDECSubOpRead(2.0s2 117 ECSubR
> >> ead(tid=86489, to_read={2:eb822e21:::SOMETHING:head=0,2097152,0},
> >> attrs_to_read=)) v2
> >> 2016-11-04 18:09:52.005408 bab2000 10 osd.9 117 dequeue_op 0xbf5ae40
> >> prio 3 cost 0 latency 0.000558 MOSDECSubOpRead(2.0s2 117 ECSubR
> >> ead(tid=86489, to_read={2:eb822e21:::SOMETHING:head=0,2097152,0},
> >> attrs_to_read=)) v2 pg pg[2.0s2( v 72'1 (0'0,72'1] local-les=116 n
> >> =1 ec=69 les/c/f 116/105/0 113/115/84) [3,1,9,0,6,2] r=2 lpr=115
> >> pi=69-114/22 luod=0'0 crt=0'0 lcod 0'0 active]
> >> 2016-11-04 18:09:52.005539 bab2000 10 osd.9 pg_epoch: 117 pg[2.0s2( v
> >> 72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/
> >> 115/84) [3,1,9,0,6,2] r=2 lpr=115 pi=69-114/22 luod=0'0 crt=0'0 lcod 0'0
> >> active] handle_message: MOSDECSubOpRead(2.0s2 117 ECSubRead
> >> (tid=86489, to_read={2:eb822e21:::SOMETHING:head=0,2097152,0},
> >> attrs_to_read=)) v2
> >> 2016-11-04 18:09:52.005595 bab2000 10 osd.9 pg_epoch: 117 pg[2.0s2( v
> >> 72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/
> >> 115/84) [3,1,9,0,6,2] r=2 lpr=115 pi=69-114/22 luod=0'0 crt=0'0 lcod 0'0
> >> active] get_hash_info: Getting attr on 2:eb822e21:::SOMETHI
> >> NG:head
> >> 2016-11-04 18:09:52.005641 bab2000 10 osd.9 pg_epoch: 117 pg[2.0s2( v
> >> 72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/
> >> 115/84) [3,1,9,0,6,2] r=2 lpr=115 pi=69-114/22 luod=0'0 crt=0'0 lcod 0'0
> >> active] get_hash_info: not in cache 2:eb822e21:::SOMETHING:
> >> head
> >> 2016-11-04 18:09:52.005853 bab2000 -1 log_channel(cluster) log [ERR] :
> >> handle_sub_read: Error -2 reading 2:eb822e21:::SOMETHING:head
> >> 2016-11-04 18:09:52.005880 bab2000  5 osd.9 pg_epoch: 117 pg[2.0s2( v
> >> 72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/
> >> 115/84) [3,1,9,0,6,2] r=2 lpr=115 pi=69-114/22 luod=0'0 crt=0'0 lcod 0'0
> >> active] handle_sub_read: Error -2 reading 2:eb822e21:::SOME
> >> THING:head
> >> 2016-11-04 18:09:52.005940 bab2000 20 osd.9 117 share_map_peer 0xbb0c000
> >> already has epoch 117
> >>
> >> And error -2 would read more or less:
> >> #define ENOENT          2               /* No such file or directory */
> >>
> >> And these are the SOMETHING__head files in the test:
> >> 1 -rw-r--r--  1 jenkins  jenkins  1024 Nov  4 17:53
> >> ./0/current/2.0s3_head/SOMETHING__head_847441D7__2_ffffffffffffffff_3
> >> 1 -rw-r--r--  1 jenkins  jenkins  1024 Nov  4 17:53
> >> ./2/current/2.0s5_head/SOMETHING__head_847441D7__2_ffffffffffffffff_5
> >> 1 -rw-r--r--  1 jenkins  jenkins  1024 Nov  4 17:54
> >> ./3/current/2.0s0_head/SOMETHING__head_847441D7__2_ffffffffffffffff_0
> >> 1 -rw-r--r--  1 jenkins  jenkins  1024 Nov  4 17:53
> >> ./6/current/2.0s4_head/SOMETHING__head_847441D7__2_ffffffffffffffff_4
> >>
> >> Perhaps this 847441D7 should read EB822E21 ??
> >> And is this the reason of the read error?
> > 
> > Could be!  Note though that the ghobject_t operator<< string you see in 
> > the log (2:eb822e21:::SOMETHING:head) prints the hex in reversed bitwise 
> > order, while the string in the FileStore filename is reversed nibble form.  
> > You can convert between the two bit reversing the bits in each hex 
> > digit/nibble, though, and those don't appear to match.
> 
> Ehhh, that would call for a proper print-operator(s) ... and add that as
> extra info with this error. Makes life for comparing and debugging a lot
> easier, and make them equal to what is actually stored in the FS.

The bitwise view is how they sotrt logically, which is why operator<< was 
changed (and what the 'sortbitwise' cluster flag is all about).  But the 
FileStore file naming is legacy and can't easily be changed, 
unfortunately.
 
> > In any case, can you post the log?  ceph-post-file <filename> then send 
> > the uuid.
> 
> Which logs would you like to have?
> osd.3.log is 7G and osd.9.log is 600Mb. I do have the bandwidth to
> upload, but especially 7G is big.

Let's start with osd.9

s

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: trying to get osd-scrub-repair to work on my jenkins builder
  2016-11-04 17:09 trying to get osd-scrub-repair to work on my jenkins builder Willem Jan Withagen
  2016-11-04 19:39 ` David Zafman
@ 2016-11-07 12:40 ` Willem Jan Withagen
  2016-11-08 23:01   ` David Zafman
  1 sibling, 1 reply; 10+ messages in thread
From: Willem Jan Withagen @ 2016-11-07 12:40 UTC (permalink / raw)
  To: Ceph Development

On 4-11-2016 18:09, Willem Jan Withagen wrote:
> Hi,
> 
> On my workstation if have this tst completing just fine. But on my
> Jenkins-builder it keeps running into this state where it does not make
> any progress.
> Any particulars I should look for? I can let this run for an hour, but
> pg 2.0 stays active+degrades, and the script requires it to be clean.
> and the pgmap version is steadily incrementing.

And as a side-track to this, it is not where things go wrong. But it is
nog right either...

ceph -c ./build/src/test/osd/testdir/osd-scrub-repair/ceph.conf --format
xml status 2>/dev/null | \
        xml sel \
        -t -m "//pgmap/recovering_keys_per_sec" -v . -o ' ' \
        -t -m "//pgmap/recovering_bytes_per_sec" -v . -o ' ' \
        -t -m "//pgmap/recovering_objects_per_sec" -v .)
    test -n "$progress"

Does not seem to work because the output does not contain any of the xml
fields...
Now I've been beating myself over the head, but I cannot find the
request that does have these fields.

So a suggestion is welcom, and I'll submit a patch.

thanx,
--WjW


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: trying to get osd-scrub-repair to work on my jenkins builder
  2016-11-07 12:40 ` Willem Jan Withagen
@ 2016-11-08 23:01   ` David Zafman
  2016-11-08 23:21     ` Willem Jan Withagen
  0 siblings, 1 reply; 10+ messages in thread
From: David Zafman @ 2016-11-08 23:01 UTC (permalink / raw)
  To: Willem Jan Withagen, Ceph Development


Willem,

At first I thought the fields were missing but I checked the xml output 
right there in get_is_making_recovery_progress() and the "xml sel" did 
see those fields while recovery was in progress.

David


On 11/7/16 4:40 AM, Willem Jan Withagen wrote:
> On 4-11-2016 18:09, Willem Jan Withagen wrote:
>> Hi,
>>
>> On my workstation if have this tst completing just fine. But on my
>> Jenkins-builder it keeps running into this state where it does not make
>> any progress.
>> Any particulars I should look for? I can let this run for an hour, but
>> pg 2.0 stays active+degrades, and the script requires it to be clean.
>> and the pgmap version is steadily incrementing.
> And as a side-track to this, it is not where things go wrong. But it is
> nog right either...
>
> ceph -c ./build/src/test/osd/testdir/osd-scrub-repair/ceph.conf --format
> xml status 2>/dev/null | \
>          xml sel \
>          -t -m "//pgmap/recovering_keys_per_sec" -v . -o ' ' \
>          -t -m "//pgmap/recovering_bytes_per_sec" -v . -o ' ' \
>          -t -m "//pgmap/recovering_objects_per_sec" -v .)
>      test -n "$progress"
>
> Does not seem to work because the output does not contain any of the xml
> fields...
> Now I've been beating myself over the head, but I cannot find the
> request that does have these fields.
>
> So a suggestion is welcom, and I'll submit a patch.
>
> thanx,
> --WjW
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: trying to get osd-scrub-repair to work on my jenkins builder
  2016-11-08 23:01   ` David Zafman
@ 2016-11-08 23:21     ` Willem Jan Withagen
  0 siblings, 0 replies; 10+ messages in thread
From: Willem Jan Withagen @ 2016-11-08 23:21 UTC (permalink / raw)
  To: David Zafman, Ceph Development

On 9-11-2016 00:01, David Zafman wrote:
> 
> Willem,
> 
> At first I thought the fields were missing but I checked the xml output
> right there in get_is_making_recovery_progress() and the "xml sel" did
> see those fields while recovery was in progress.

Oke, ...

That leaves a challenge, and I'll have to review the output on FreeBSD
again, and see where, what, how it then works.

--WjW

> 
> David
> 
> 
> On 11/7/16 4:40 AM, Willem Jan Withagen wrote:
>> On 4-11-2016 18:09, Willem Jan Withagen wrote:
>>> Hi,
>>>
>>> On my workstation if have this tst completing just fine. But on my
>>> Jenkins-builder it keeps running into this state where it does not make
>>> any progress.
>>> Any particulars I should look for? I can let this run for an hour, but
>>> pg 2.0 stays active+degrades, and the script requires it to be clean.
>>> and the pgmap version is steadily incrementing.
>> And as a side-track to this, it is not where things go wrong. But it is
>> nog right either...
>>
>> ceph -c ./build/src/test/osd/testdir/osd-scrub-repair/ceph.conf --format
>> xml status 2>/dev/null | \
>>          xml sel \
>>          -t -m "//pgmap/recovering_keys_per_sec" -v . -o ' ' \
>>          -t -m "//pgmap/recovering_bytes_per_sec" -v . -o ' ' \
>>          -t -m "//pgmap/recovering_objects_per_sec" -v .)
>>      test -n "$progress"
>>
>> Does not seem to work because the output does not contain any of the xml
>> fields...
>> Now I've been beating myself over the head, but I cannot find the
>> request that does have these fields.
>>
>> So a suggestion is welcom, and I'll submit a patch.
>>
>> thanx,
>> --WjW
>>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2016-11-08 23:21 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-04 17:09 trying to get osd-scrub-repair to work on my jenkins builder Willem Jan Withagen
2016-11-04 19:39 ` David Zafman
2016-11-04 20:09   ` Willem Jan Withagen
2016-11-04 20:52     ` Willem Jan Withagen
2016-11-04 21:07       ` Sage Weil
2016-11-04 21:13         ` Willem Jan Withagen
2016-11-04 21:19           ` Sage Weil
2016-11-07 12:40 ` Willem Jan Withagen
2016-11-08 23:01   ` David Zafman
2016-11-08 23:21     ` Willem Jan Withagen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.