Upgrade from Jewel to Luminous. REQUIRE_JEWEL OSDMap

* Upgrade from Jewel to Luminous. REQUIRE_JEWEL OSDMap
@ 2017-11-28  2:46 Cary
  2017-11-28  3:09 ` Sage Weil
  0 siblings, 1 reply; 12+ messages in thread
From: Cary @ 2017-11-28  2:46 UTC (permalink / raw)
  To: ceph-devel

Hello,

 Could someone please help me complete my botched upgrade from Jewel
10.2.3-r1 to Luminous 12.2.1. I have 9 Gentoo servers, 4 of which have
2 OSDs each.

 My OSD servers were accidentally rebooted before the monitor servers
causing them to be running Luminous before the monitors. All services
have been restarted and running ceph versions gives the following:

# ceph versions
2017-11-27 21:27:24.356940 7fed67efe700 -1 WARNING: the following
dangerous and experimental features are enabled: btrfs
2017-11-27 21:27:24.368469 7fed67efe700 -1 WARNING: the following
dangerous and experimental features are enabled: btrfs

    "mon": {
        "ceph version 12.2.1
(3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)": 4
    },
    "mgr": {
        "ceph version 12.2.1
(3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)": 3
    },
    "osd": {},
    "mds": {
        "ceph version 12.2.1
(3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)": 1
    },
    "overall": {
        "ceph version 12.2.1
(3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)": 8

For some reason the OSDs do not show what version they are running,
and a ceph osd tree shows all of the OSD as being down.

 # ceph osd tree
2017-11-27 21:32:51.969335 7f483d9c2700 -1 WARNING: the following
dangerous and experimental features are enabled: btrfs
2017-11-27 21:32:51.980976 7f483d9c2700 -1 WARNING: the following
dangerous and experimental features are enabled: btrfs
ID CLASS WEIGHT   TYPE NAME              STATUS REWEIGHT PRI-AFF
-1       27.77998 root default
-3       27.77998     datacenter DC1
-6       27.77998         rack 1B06
-5        6.48000             host ceph3
 1        1.84000                 osd.1    down        0 1.00000
 3        4.64000                 osd.3    down        0 1.00000
-2        5.53999             host ceph4
 5        4.64000                 osd.5    down        0 1.00000
 8        0.89999                 osd.8    down        0 1.00000
-4        9.28000             host ceph6
 0        4.64000                 osd.0    down        0 1.00000
 2        4.64000                 osd.2    down        0 1.00000
-7        6.48000             host ceph7
 6        4.64000                 osd.6    down        0 1.00000
 7        1.84000                 osd.7    down        0 1.00000

The OSD logs all have this message:

20235 osdmap REQUIRE_JEWEL OSDMap flag is NOT set; please set it.

When I try to set it with "ceph osd set require_jewel_osds" I get this error:

Error EPERM: not all up OSDs have CEPH_FEATURE_SERVER_JEWEL feature

A "ceph features" returns:

    "mon": {
        "group": {
            "features": "0x1ffddff8eea4fffb",
            "release": "luminous",
            "num": 4
        }
    },
    "mds": {
        "group": {
            "features": "0x1ffddff8eea4fffb",
            "release": "luminous",
            "num": 1
        }
    },
    "osd": {
        "group": {
            "features": "0x1ffddff8eea4fffb",
            "release": "luminous",
            "num": 8
        }
    },
    "client": {
        "group": {
            "features": "0x1ffddff8eea4fffb",
            "release": "luminous",
            "num": 3

 # ceph tell osd.* versions
2017-11-28 02:29:28.565943 7f99c6aee700 -1 WARNING: the following
dangerous and experimental features are enabled: btrfs
2017-11-28 02:29:28.578956 7f99c6aee700 -1 WARNING: the following
dangerous and experimental features are enabled: btrfs
Error ENXIO: problem getting command descriptions from osd.0
osd.0: problem getting command descriptions from osd.0
Error ENXIO: problem getting command descriptions from osd.1
osd.1: problem getting command descriptions from osd.1
Error ENXIO: problem getting command descriptions from osd.2
osd.2: problem getting command descriptions from osd.2
Error ENXIO: problem getting command descriptions from osd.3
osd.3: problem getting command descriptions from osd.3
Error ENXIO: problem getting command descriptions from osd.5
osd.5: problem getting command descriptions from osd.5
Error ENXIO: problem getting command descriptions from osd.6
osd.6: problem getting command descriptions from osd.6
Error ENXIO: problem getting command descriptions from osd.7
osd.7: problem getting command descriptions from osd.7
Error ENXIO: problem getting command descriptions from osd.8
osd.8: problem getting command descriptions from osd.8

 # ceph daemon osd.1 status

    "cluster_fsid": "CENSORED",
    "osd_fsid": "CENSORED",
    "whoami": 1,
    "state": "preboot",
    "oldest_map": 19482,
    "newest_map": 20235,
    "num_pgs": 141

 # ceph -s
2017-11-27 22:04:10.372471 7f89a3935700 -1 WARNING: the following
dangerous and experimental features are enabled: btrfs
2017-11-27 22:04:10.375709 7f89a3935700 -1 WARNING: the following
dangerous and experimental features are enabled: btrfs
  cluster:
    id:     CENSORED
    health: HEALTH_ERR
            513 pgs are stuck inactive for more than 60 seconds
            126 pgs backfill_wait
            52 pgs backfilling
            435 pgs degraded
            513 pgs stale
            435 pgs stuck degraded
            513 pgs stuck stale
            435 pgs stuck unclean
            435 pgs stuck undersized
            435 pgs undersized
            recovery 854719/3688140 objects degraded (23.175%)
            recovery 838607/3688140 objects misplaced (22.738%)
            mds cluster is degraded
            crush map has straw_calc_version=0

  services:
    mon: 4 daemons, quorum 0,1,3,2
    mgr: 0(active), standbys: 1, 5
    mds: cephfs-1/1/1 up  {0=a=up:replay}, 1 up:standby
    osd: 8 osds: 0 up, 0 in

  data:
    pools:   7 pools, 513 pgs
    objects: 1199k objects, 4510 GB
    usage:   13669 GB used, 15150 GB / 28876 GB avail
    pgs:     854719/3688140 objects degraded (23.175%)
             838607/3688140 objects misplaced (22.738%)
             257 stale+active+undersized+degraded
             126 stale+active+undersized+degraded+remapped+backfill_wait
             78  stale+active+clean
             52  stale+active+undersized+degraded+remapped+backfilling

I ran "ceph auth list", and client.admin has the following permissions.
auid: 0
caps: [mds] allow
caps: [mgr] allow *
caps: [mon] allow *
caps: [osd] allow *

Thank you for your time.

Is there any way I can get these OSDs to join the cluster now, or
recover my data?

Cary
-Dynamic

^ permalink raw reply	[flat|nested] 12+ messages in thread