All of lore.kernel.org
 help / color / mirror / Atom feed
* Assemblin journaled array fails
@ 2020-05-09 10:54 Michal Soltys
  2020-05-09 11:32 ` Michal Soltys
                   ` (2 more replies)
  0 siblings, 3 replies; 46+ messages in thread
From: Michal Soltys @ 2020-05-09 10:54 UTC (permalink / raw)
  To: linux-raid

Looks like I have now another issue with journaled array that now fails 
to assemble - with related command used to assembly it (whether it would 
be mdadm --incremental called by udev, or manually from commandline) 
hanged and unkillable.

The array was previously running with different journal that was later 
replaced with different better disks. The journal device itself is 
healthy and assembles fine. Assembling (the below dump is when mdadm vis 
udev was doing it):


mdadm -D /dev/md126
/dev/md126:
            Version : 1.1
      Creation Time : Tue Mar  5 19:28:58 2019
         Raid Level : raid5
      Used Dev Size : 18446744073709551615
       Raid Devices : 4
      Total Devices : 5
        Persistence : Superblock is persistent

        Update Time : Sat May  9 09:34:58 2020
              State : active, FAILED, Not Started
     Active Devices : 4
    Working Devices : 5
     Failed Devices : 0
      Spare Devices : 0

             Layout : left-symmetric
         Chunk Size : 512K

Consistency Policy : journal

               Name : xs22:r5_big  (local to host xs22)
               UUID : d5995d76:67d7fabd:05392f87:25a91a97
             Events : 56283

     Number   Major   Minor   RaidDevice State
        -       0        0        0      removed
        -       0        0        1      removed
        -       0        0        2      removed
        -       0        0        3      removed

        -       8      145        3      sync   /dev/sdj1
        -       8      129        2      sync   /dev/sdi1
        -       9      127        -      spare   /dev/md/xs22:r1_journal_big
        -       8      113        1      sync   /dev/sdh1
        -       8       97        0      sync   /dev/sdg1



mdadm -Esv:

ARRAY /dev/md/r1_journal_big  level=raid1 metadata=1.1 num-devices=2 
UUID=c18b3996:49b21f49:3ee18e92:a8240f7f name=xs22:r1_journal_big
    devices=/dev/sdv1,/dev/sdu1
ARRAY /dev/md/r5_big  level=raid5 metadata=1.1 num-devices=4 
UUID=d5995d76:67d7fabd:05392f87:25a91a97 name=xs22:r5_big
    spares=1 
devices=/dev/md/xs22:r1_journal_big,/dev/sdg1,/dev/sdh1,/dev/sdj1,/dev/sdi1


When I was changing journal some time ago (successfully, everything was 
working fine afterwards) this is what I did:

echo [write-through] >/sys/md126/md/journal_mode
mdadm /dev/md126 --fail /dev/md/r1_journal_big --remove 
/dev/md/r1_journal_big
echo resync >/sys/md126/md/consistency_policy

# here i created new journal from new disks, the name was the same

mdadm --readonly /dev/md126
mdadm /dev/md126 --add-journal /dev/md/r1_journal_big
echo [write-back] >/sys/md126/md/journal_mode
# At this point mdadm confirmed everything went fine

After the reboot I'm in the situation as outlined above. Any suggestion 
how to assemble this array now ?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Assemblin journaled array fails
  2020-05-09 10:54 Assemblin journaled array fails Michal Soltys
@ 2020-05-09 11:32 ` Michal Soltys
  2020-05-09 13:15 ` antlists
  2020-05-09 23:57 ` Michal Soltys
  2 siblings, 0 replies; 46+ messages in thread
From: Michal Soltys @ 2020-05-09 11:32 UTC (permalink / raw)
  To: linux-raid

On 20/05/09 12:54, Michal Soltys wrote:
> # At this point mdadm confirmed everything went fine
> 
> After the reboot I'm in the situation as outlined above. Any suggestion 
> how to assemble this array now ?

To expand on this: "after reboot" means a few days later - the array was 
operating fine during that time.

Attempting to assembly the arrary manually from early initramfs (before 
any udev/etc. and with manually loaded modules), e.g.:

mdadm -A /dev/md/r5_big /dev/sd[efgh]1 /dev/md/r1_journal_big

shows:

md/raid:md126: device sde1 operational as raid disk 0
md/raid:md126: device sdh1 operational as raid disk 3
md/raid:md126: device sdg1 operational as raid disk 2
md/raid:md126: device sdf1 operational as raid disk 1
md/raid:md126: raid level 5 active with 4 out of 4 devices, algorithm 2

But then mdadm is left hanging for good.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Assemblin journaled array fails
  2020-05-09 10:54 Assemblin journaled array fails Michal Soltys
  2020-05-09 11:32 ` Michal Soltys
@ 2020-05-09 13:15 ` antlists
  2020-05-09 13:20   ` Michal Soltys
  2020-05-09 23:57 ` Michal Soltys
  2 siblings, 1 reply; 46+ messages in thread
From: antlists @ 2020-05-09 13:15 UTC (permalink / raw)
  To: Michal Soltys, linux-raid

On 09/05/2020 11:54, Michal Soltys wrote:
> After the reboot I'm in the situation as outlined above. Any suggestion 
> how to assemble this array now ?

mdadm -v ?

uname -a ?

https://raid.wiki.kernel.org/index.php/Asking_for_help

Dunno what the problem is, and I'm hesitant to recommend it, but 
re-assembling the array without a journal might work.

You have made sure that bitmaps aren't enabled? You can't have both, but 
for historical reasons you can end up with both enabled and then the 
resulting array won't start.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Assemblin journaled array fails
  2020-05-09 13:15 ` antlists
@ 2020-05-09 13:20   ` Michal Soltys
  2020-05-09 13:32     ` Michal Soltys
  0 siblings, 1 reply; 46+ messages in thread
From: Michal Soltys @ 2020-05-09 13:20 UTC (permalink / raw)
  To: antlists, linux-raid

On 20/05/09 15:15, antlists wrote:
> On 09/05/2020 11:54, Michal Soltys wrote:
>> After the reboot I'm in the situation as outlined above. Any 
>> suggestion how to assemble this array now ?
> 
> mdadm -v ?

mdadm - v4.1 - 2018-10-01

> 
> uname -a ?

Linux xs22 5.4.0-0.bpo.4-amd64 #1 SMP Debian 5.4.19-1~bpo10+1 
(2020-03-09) x86_64 GNU/Linux

> 
> https://raid.wiki.kernel.org/index.php/Asking_for_help
> 
> Dunno what the problem is, and I'm hesitant to recommend it, but 
> re-assembling the array without a journal might work.

I would - if possible - want to avoid this, as the journal is 
write-back. So as a last resort only.

> 
> You have made sure that bitmaps aren't enabled? You can't have both, but 
> for historical reasons you can end up with both enabled and then the 
> resulting array won't start.
> 

Yea, I know - the bitmap is not enabled on this array.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Assemblin journaled array fails
  2020-05-09 13:20   ` Michal Soltys
@ 2020-05-09 13:32     ` Michal Soltys
  0 siblings, 0 replies; 46+ messages in thread
From: Michal Soltys @ 2020-05-09 13:32 UTC (permalink / raw)
  To: antlists, linux-raid

On 20/05/09 15:20, Michal Soltys wrote:
> On 20/05/09 15:15, antlists wrote:
>> On 09/05/2020 11:54, Michal Soltys wrote:
>>> After the reboot I'm in the situation as outlined above. Any 
>>> suggestion how to assemble this array now ?
>>
>> mdadm -v ?
> 
> mdadm - v4.1 - 2018-10-01
> 
>>
>> uname -a ?
> 
> Linux xs22 5.4.0-0.bpo.4-amd64 #1 SMP Debian 5.4.19-1~bpo10+1 
> (2020-03-09) x86_64 GNU/Linux
> 
>>
>> https://raid.wiki.kernel.org/index.php/Asking_for_help
>>
>> Dunno what the problem is, and I'm hesitant to recommend it, but 
>> re-assembling the array without a journal might work.
> 
> I would - if possible - want to avoid this, as the journal is 
> write-back. So as a last resort only.
> 

Looking at the around-reboot logs, the machine might have eaten hard 
power loss (despite all ups processes working, but I'll be checking that 
later)

The things that so far are really weird:

1) the size reported via mdadm -D:

mdadm -D /dev/md126
/dev/md126:
            Version : 1.1
      Creation Time : Tue Mar  5 19:28:58 2019
         Raid Level : raid5
      Used Dev Size : 18446744073709551615

This is clearly incorrect.

2) the mdadm --incremental is in R state eating all 1 cpu, but otherwise 
cannot be terminated or killed (so sitting on some lock ?)

root       403     1  0  2744  3392  33 15:05 ?        S      0:01 
/lib/systemd/systemd-udevd --daemon --resolve-names=never
root       571   403 98   740  1976  13 15:05 ?        R     19:07  \_ 
/sbin/mdadm --incremental --export /dev/md125 --offroot 
/dev/disk/by-id/md-name-xs22:r1_journal_big 
/dev/disk/by-id/md-uuid-c18b3996:49b21f49:3ee18e92:a8240f7f 
/dev/md/r1_journal_big

This is probably the same case as with manual mdadm -A that I tried from 
initramfs before udevd.

3) The 4 removed lines in mdadm -D output

     Number   Major   Minor   RaidDevice State
        -       0        0        0      removed
        -       0        0        1      removed
        -       0        0        2      removed
        -       0        0        3      removed

No device was removed from this array besides the old journal what I 
wrote in the first mail.

4) The journal device is listed as actual spare instead of journal

        -       8      145        3      sync   /dev/sdj1
        -       8      129        2      sync   /dev/sdi1
        -       8      113        1      sync   /dev/sdh1
        -       8       97        0      sync   /dev/sdg1
        -       9      125        -      spare   /dev/md/r1_journal_big


All mdadm -Ev of each component device looks fine though:

/dev/sdj1:
           Magic : a92b4efc
         Version : 1.1
     Feature Map : 0x200
      Array UUID : d5995d76:67d7fabd:05392f87:25a91a97
            Name : xs22:r5_big  (local to host xs22)
   Creation Time : Tue Mar  5 19:28:58 2019
      Raid Level : raid5
    Raid Devices : 4

  Avail Dev Size : 7812237856 (3725.17 GiB 3999.87 GB)
      Array Size : 11718355968 (11175.50 GiB 11999.60 GB)
   Used Dev Size : 7812237312 (3725.17 GiB 3999.87 GB)
     Data Offset : 262144 sectors
    Super Offset : 0 sectors
    Unused Space : before=262064 sectors, after=544 sectors
           State : clean
     Device UUID : 5a73a457:b16dd591:90eceb56:2f8e1cb3

     Update Time : Sat May  9 15:05:22 2020
   Bad Block Log : 512 entries available at offset 72 sectors
        Checksum : dfa92cad - correct
          Events : 56289

          Layout : left-symmetric
      Chunk Size : 512K

    Device Role : Active device 3
    Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)

/dev/sdi1:
           Magic : a92b4efc
         Version : 1.1
     Feature Map : 0x200
      Array UUID : d5995d76:67d7fabd:05392f87:25a91a97
            Name : xs22:r5_big  (local to host xs22)
   Creation Time : Tue Mar  5 19:28:58 2019
      Raid Level : raid5
    Raid Devices : 4

  Avail Dev Size : 7812237856 (3725.17 GiB 3999.87 GB)
      Array Size : 11718355968 (11175.50 GiB 11999.60 GB)
   Used Dev Size : 7812237312 (3725.17 GiB 3999.87 GB)
     Data Offset : 262144 sectors
    Super Offset : 0 sectors
    Unused Space : before=262064 sectors, after=544 sectors
           State : clean
     Device UUID : f933e8b6:5411d560:0467cb10:95971e4b

     Update Time : Sat May  9 15:05:22 2020
   Bad Block Log : 512 entries available at offset 72 sectors
        Checksum : 5fce14c7 - correct
          Events : 56289

          Layout : left-symmetric
      Chunk Size : 512K

    Device Role : Active device 2
    Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)

/dev/sdh1:
           Magic : a92b4efc
         Version : 1.1
     Feature Map : 0x200
      Array UUID : d5995d76:67d7fabd:05392f87:25a91a97
            Name : xs22:r5_big  (local to host xs22)
   Creation Time : Tue Mar  5 19:28:58 2019
      Raid Level : raid5
    Raid Devices : 4

  Avail Dev Size : 7812237856 (3725.17 GiB 3999.87 GB)
      Array Size : 11718355968 (11175.50 GiB 11999.60 GB)
   Used Dev Size : 7812237312 (3725.17 GiB 3999.87 GB)
     Data Offset : 262144 sectors
    Super Offset : 0 sectors
    Unused Space : before=262064 sectors, after=544 sectors
           State : clean
     Device UUID : 8ea58384:43683121:d1017eed:1f565786

     Update Time : Sat May  9 15:05:22 2020
   Bad Block Log : 512 entries available at offset 72 sectors
        Checksum : 5b136a2 - correct
          Events : 56289

          Layout : left-symmetric
      Chunk Size : 512K

    Device Role : Active device 1
    Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)

/dev/sdg1:
           Magic : a92b4efc
         Version : 1.1
     Feature Map : 0x200
      Array UUID : d5995d76:67d7fabd:05392f87:25a91a97
            Name : xs22:r5_big  (local to host xs22)
   Creation Time : Tue Mar  5 19:28:58 2019
      Raid Level : raid5
    Raid Devices : 4

  Avail Dev Size : 7812237856 (3725.17 GiB 3999.87 GB)
      Array Size : 11718355968 (11175.50 GiB 11999.60 GB)
   Used Dev Size : 7812237312 (3725.17 GiB 3999.87 GB)
     Data Offset : 262144 sectors
    Super Offset : 0 sectors
    Unused Space : before=262064 sectors, after=544 sectors
           State : clean
     Device UUID : e38f2cd0:b9716e4e:b7fc0631:0a4e352d

     Update Time : Sat May  9 15:05:22 2020
   Bad Block Log : 512 entries available at offset 72 sectors
        Checksum : 68fe1d3c - correct
          Events : 56289

          Layout : left-symmetric
      Chunk Size : 512K

    Device Role : Active device 0
    Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)


And finally journal seems fine as well:

/dev/md/r1_journal_big:
           Magic : a92b4efc
         Version : 1.1
     Feature Map : 0x200
      Array UUID : d5995d76:67d7fabd:05392f87:25a91a97
            Name : xs22:r5_big  (local to host xs22)
   Creation Time : Tue Mar  5 19:28:58 2019
      Raid Level : raid5
    Raid Devices : 4

  Avail Dev Size : 536344576 (255.75 GiB 274.61 GB)
      Array Size : 11718355968 (11175.50 GiB 11999.60 GB)
   Used Dev Size : 7812237312 (3725.17 GiB 3999.87 GB)
     Data Offset : 262144 sectors
    Super Offset : 0 sectors
    Unused Space : before=261872 sectors, after=0 sectors
           State : clean
     Device UUID : c3a6f2f6:7dd26b0c:08a31ad7:cc8ed2a9

     Update Time : Sat May  9 15:05:22 2020
   Bad Block Log : 512 entries available at offset 264 sectors
        Checksum : c854904f - correct
          Events : 56289

          Layout : left-symmetric
      Chunk Size : 512K

    Device Role : Journal
    Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Assemblin journaled array fails
  2020-05-09 10:54 Assemblin journaled array fails Michal Soltys
  2020-05-09 11:32 ` Michal Soltys
  2020-05-09 13:15 ` antlists
@ 2020-05-09 23:57 ` Michal Soltys
  2020-05-11 11:11   ` Michal Soltys
  2 siblings, 1 reply; 46+ messages in thread
From: Michal Soltys @ 2020-05-09 23:57 UTC (permalink / raw)
  To: linux-raid

Anyway, I did some tests with manually snapshotted component devices 
(using dm snapshot target to not touch underlying devices).

The raid manages to force assemble in read-only mode with missing 
journal device, so we probably will be able to recover most data 
underneath this way (as a last resort).

The situation I'm in now is likely from uncelan shutdown after all (why 
the machine failed to react to ups properly is another subject).

I'd still want to find out why is - apparently - a journal device giving 
issues (contrary to what I'd expect it to do ...), with notable mention of:

1) mdadm hangs (unkillable, so I presume in kernel somewhere) and eats 1 
cpu when trying to assemble the raid with journal device present; once 
it happens I can't do anything with the array (stop, run, etc.) and can 
only reboot the server to "fix" that

2) mdadm -D shows nonsensical device size after assembly attempt (Used 
Dev Size : 18446744073709551615)

3) the journal device (which itself is md raid1 consisting of 2 ssds) 
assembles, checks (0 mismatch_cnt) fine - and overall looks ok.


 From other interesting things, I also attempted to assemble the raid 
with snapshotted journal. From what I can see it does attempt to do 
something, judging from:

dmsetup status:

snap_jo2: 0 536870912 snapshot 40/33554432 16
snap_sdi1: 0 7812500000 snapshot 25768/83886080 112
snap_jo1: 0 536870912 snapshot 40/33554432 16
snap_sdg1: 0 7812500000 snapshot 25456/83886080 112
snap_sdj1: 0 7812500000 snapshot 25928/83886080 112
snap_sdh1: 0 7812500000 snapshot 25352/83886080 112

But it doesn't move from those values (with mdadm doing nothing eating 
100% cpu as mentioned earlier).


Any suggestions how to proceed would very be appreciated.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Assemblin journaled array fails
  2020-05-09 23:57 ` Michal Soltys
@ 2020-05-11 11:11   ` Michal Soltys
  2020-05-12  1:27     ` Song Liu
  0 siblings, 1 reply; 46+ messages in thread
From: Michal Soltys @ 2020-05-11 11:11 UTC (permalink / raw)
  To: linux-raid; +Cc: song

On 5/10/20 1:57 AM, Michal Soltys wrote:
> Anyway, I did some tests with manually snapshotted component devices 
> (using dm snapshot target to not touch underlying devices).
> 
> The raid manages to force assemble in read-only mode with missing 
> journal device, so we probably will be able to recover most data 
> underneath this way (as a last resort).
> 
> The situation I'm in now is likely from uncelan shutdown after all (why 
> the machine failed to react to ups properly is another subject).
> 
> I'd still want to find out why is - apparently - a journal device giving 
> issues (contrary to what I'd expect it to do ...), with notable mention of:
> 
> 1) mdadm hangs (unkillable, so I presume in kernel somewhere) and eats 1 
> cpu when trying to assemble the raid with journal device present; once 
> it happens I can't do anything with the array (stop, run, etc.) and can 
> only reboot the server to "fix" that
> 
> 2) mdadm -D shows nonsensical device size after assembly attempt (Used 
> Dev Size : 18446744073709551615)
> 
> 3) the journal device (which itself is md raid1 consisting of 2 ssds) 
> assembles, checks (0 mismatch_cnt) fine - and overall looks ok.
> 
> 
>  From other interesting things, I also attempted to assemble the raid 
> with snapshotted journal. From what I can see it does attempt to do 
> something, judging from:
> 
> dmsetup status:
> 
> snap_jo2: 0 536870912 snapshot 40/33554432 16
> snap_sdi1: 0 7812500000 snapshot 25768/83886080 112
> snap_jo1: 0 536870912 snapshot 40/33554432 16
> snap_sdg1: 0 7812500000 snapshot 25456/83886080 112
> snap_sdj1: 0 7812500000 snapshot 25928/83886080 112
> snap_sdh1: 0 7812500000 snapshot 25352/83886080 112
> 
> But it doesn't move from those values (with mdadm doing nothing eating 
> 100% cpu as mentioned earlier).
> 
> 
> Any suggestions how to proceed would very be appreciated.


I've added Song to the CC. If you have any suggestions how to 
proceed/debug this (mdadm stuck somewhere in kernel as far as I can see 
- while attempting to assembly it).

For the record, I can assemble the raid successfully w/o journal (using 
snapshotted component devices as above), and we did recover some stuff 
this way from some filesystems - but for some other ones I'd like to 
keep that option as the very last resort.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Assemblin journaled array fails
  2020-05-11 11:11   ` Michal Soltys
@ 2020-05-12  1:27     ` Song Liu
  2020-05-13 10:58       ` Michal Soltys
  0 siblings, 1 reply; 46+ messages in thread
From: Song Liu @ 2020-05-12  1:27 UTC (permalink / raw)
  To: Michal Soltys; +Cc: linux-raid

On Mon, May 11, 2020 at 4:13 AM Michal Soltys <msoltyspl@yandex.pl> wrote:
>
> On 5/10/20 1:57 AM, Michal Soltys wrote:
> > Anyway, I did some tests with manually snapshotted component devices
> > (using dm snapshot target to not touch underlying devices).
> >
> > The raid manages to force assemble in read-only mode with missing
> > journal device, so we probably will be able to recover most data
> > underneath this way (as a last resort).
> >
> > The situation I'm in now is likely from uncelan shutdown after all (why
> > the machine failed to react to ups properly is another subject).
> >
> > I'd still want to find out why is - apparently - a journal device giving
> > issues (contrary to what I'd expect it to do ...), with notable mention of:
> >
> > 1) mdadm hangs (unkillable, so I presume in kernel somewhere) and eats 1
> > cpu when trying to assemble the raid with journal device present; once
> > it happens I can't do anything with the array (stop, run, etc.) and can
> > only reboot the server to "fix" that
> >
> > 2) mdadm -D shows nonsensical device size after assembly attempt (Used
> > Dev Size : 18446744073709551615)
> >
> > 3) the journal device (which itself is md raid1 consisting of 2 ssds)
> > assembles, checks (0 mismatch_cnt) fine - and overall looks ok.
> >
> >
> >  From other interesting things, I also attempted to assemble the raid
> > with snapshotted journal. From what I can see it does attempt to do
> > something, judging from:
> >
> > dmsetup status:
> >
> > snap_jo2: 0 536870912 snapshot 40/33554432 16
> > snap_sdi1: 0 7812500000 snapshot 25768/83886080 112
> > snap_jo1: 0 536870912 snapshot 40/33554432 16
> > snap_sdg1: 0 7812500000 snapshot 25456/83886080 112
> > snap_sdj1: 0 7812500000 snapshot 25928/83886080 112
> > snap_sdh1: 0 7812500000 snapshot 25352/83886080 112
> >
> > But it doesn't move from those values (with mdadm doing nothing eating
> > 100% cpu as mentioned earlier).
> >
> >
> > Any suggestions how to proceed would very be appreciated.
>
>
> I've added Song to the CC. If you have any suggestions how to
> proceed/debug this (mdadm stuck somewhere in kernel as far as I can see
> - while attempting to assembly it).
>
> For the record, I can assemble the raid successfully w/o journal (using
> snapshotted component devices as above), and we did recover some stuff
> this way from some filesystems - but for some other ones I'd like to
> keep that option as the very last resort.

Sorry for delayed response.

A few questions.

For these two outputs:
#1
               Name : xs22:r5_big  (local to host xs22)
               UUID : d5995d76:67d7fabd:05392f87:25a91a97
             Events : 56283

     Number   Major   Minor   RaidDevice State
        -       0        0        0      removed
        -       0        0        1      removed
        -       0        0        2      removed
        -       0        0        3      removed

        -       8      145        3      sync   /dev/sdj1
        -       8      129        2      sync   /dev/sdi1
        -       9      127        -      spare   /dev/md/xs22:r1_journal_big
        -       8      113        1      sync   /dev/sdh1
        -       8       97        0      sync   /dev/sdg1

#2
/dev/md/r1_journal_big:
           Magic : a92b4efc
         Version : 1.1
     Feature Map : 0x200
      Array UUID : d5995d76:67d7fabd:05392f87:25a91a97
            Name : xs22:r5_big  (local to host xs22)
   Creation Time : Tue Mar  5 19:28:58 2019
      Raid Level : raid5
    Raid Devices : 4

  Avail Dev Size : 536344576 (255.75 GiB 274.61 GB)
      Array Size : 11718355968 (11175.50 GiB 11999.60 GB)
   Used Dev Size : 7812237312 (3725.17 GiB 3999.87 GB)
     Data Offset : 262144 sectors
    Super Offset : 0 sectors
    Unused Space : before=261872 sectors, after=0 sectors
           State : clean
     Device UUID : c3a6f2f6:7dd26b0c:08a31ad7:cc8ed2a9

     Update Time : Sat May  9 15:05:22 2020
   Bad Block Log : 512 entries available at offset 264 sectors
        Checksum : c854904f - correct
          Events : 56289

          Layout : left-symmetric
      Chunk Size : 512K

    Device Role : Journal
    Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)


Are these captured back to back? I am asking because they showed different
"Events" number.

Also, when mdadm -A hangs, could you please capture /proc/$(pidof mdadm)/stack ?

18446744073709551615 is 0xffffffffffffffffL, so it is not initialized
by data from the disk.
I suspect we hang somewhere before this value is initialized.

Thanks,
Song

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Assemblin journaled array fails
  2020-05-12  1:27     ` Song Liu
@ 2020-05-13 10:58       ` Michal Soltys
  2020-05-13 16:17         ` Song Liu
  0 siblings, 1 reply; 46+ messages in thread
From: Michal Soltys @ 2020-05-13 10:58 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-raid

On 5/12/20 3:27 AM, Song Liu wrote:
> On Mon, May 11, 2020 at 4:13 AM Michal Soltys <msoltyspl@yandex.pl> wrote:
>>
>> On 5/10/20 1:57 AM, Michal Soltys wrote:
>>> Anyway, I did some tests with manually snapshotted component devices
>>> (using dm snapshot target to not touch underlying devices).
>>>
>>> The raid manages to force assemble in read-only mode with missing
>>> journal device, so we probably will be able to recover most data
>>> underneath this way (as a last resort).
>>>
>>> The situation I'm in now is likely from uncelan shutdown after all (why
>>> the machine failed to react to ups properly is another subject).
>>>
>>> I'd still want to find out why is - apparently - a journal device giving
>>> issues (contrary to what I'd expect it to do ...), with notable mention of:
>>>
>>> 1) mdadm hangs (unkillable, so I presume in kernel somewhere) and eats 1
>>> cpu when trying to assemble the raid with journal device present; once
>>> it happens I can't do anything with the array (stop, run, etc.) and can
>>> only reboot the server to "fix" that
>>>
>>> 2) mdadm -D shows nonsensical device size after assembly attempt (Used
>>> Dev Size : 18446744073709551615)
>>>
>>> 3) the journal device (which itself is md raid1 consisting of 2 ssds)
>>> assembles, checks (0 mismatch_cnt) fine - and overall looks ok.
>>>
>>>
>>>   From other interesting things, I also attempted to assemble the raid
>>> with snapshotted journal. From what I can see it does attempt to do
>>> something, judging from:
>>>
>>> dmsetup status:
>>>
>>> snap_jo2: 0 536870912 snapshot 40/33554432 16
>>> snap_sdi1: 0 7812500000 snapshot 25768/83886080 112
>>> snap_jo1: 0 536870912 snapshot 40/33554432 16
>>> snap_sdg1: 0 7812500000 snapshot 25456/83886080 112
>>> snap_sdj1: 0 7812500000 snapshot 25928/83886080 112
>>> snap_sdh1: 0 7812500000 snapshot 25352/83886080 112
>>>
>>> But it doesn't move from those values (with mdadm doing nothing eating
>>> 100% cpu as mentioned earlier).
>>>
>>>
>>> Any suggestions how to proceed would very be appreciated.
>>
>>
>> I've added Song to the CC. If you have any suggestions how to
>> proceed/debug this (mdadm stuck somewhere in kernel as far as I can see
>> - while attempting to assembly it).
>>
>> For the record, I can assemble the raid successfully w/o journal (using
>> snapshotted component devices as above), and we did recover some stuff
>> this way from some filesystems - but for some other ones I'd like to
>> keep that option as the very last resort.
> 
> Sorry for delayed response.
> 
> A few questions.
> 
> For these two outputs:
> #1
>                 Name : xs22:r5_big  (local to host xs22)
>                 UUID : d5995d76:67d7fabd:05392f87:25a91a97
>               Events : 56283
> 
>       Number   Major   Minor   RaidDevice State
>          -       0        0        0      removed
>          -       0        0        1      removed
>          -       0        0        2      removed
>          -       0        0        3      removed
> 
>          -       8      145        3      sync   /dev/sdj1
>          -       8      129        2      sync   /dev/sdi1
>          -       9      127        -      spare   /dev/md/xs22:r1_journal_big
>          -       8      113        1      sync   /dev/sdh1
>          -       8       97        0      sync   /dev/sdg1
> 
> #2
> /dev/md/r1_journal_big:
>             Magic : a92b4efc
>           Version : 1.1
>       Feature Map : 0x200
>        Array UUID : d5995d76:67d7fabd:05392f87:25a91a97
>              Name : xs22:r5_big  (local to host xs22)
>     Creation Time : Tue Mar  5 19:28:58 2019
>        Raid Level : raid5
>      Raid Devices : 4
> 
>    Avail Dev Size : 536344576 (255.75 GiB 274.61 GB)
>        Array Size : 11718355968 (11175.50 GiB 11999.60 GB)
>     Used Dev Size : 7812237312 (3725.17 GiB 3999.87 GB)
>       Data Offset : 262144 sectors
>      Super Offset : 0 sectors
>      Unused Space : before=261872 sectors, after=0 sectors
>             State : clean
>       Device UUID : c3a6f2f6:7dd26b0c:08a31ad7:cc8ed2a9
> 
>       Update Time : Sat May  9 15:05:22 2020
>     Bad Block Log : 512 entries available at offset 264 sectors
>          Checksum : c854904f - correct
>            Events : 56289
> 
>            Layout : left-symmetric
>        Chunk Size : 512K
> 
>      Device Role : Journal
>      Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)
> 
> 
> Are these captured back to back? I am asking because they showed different
> "Events" number.

Nah, they were captured between reboots. Back to back all event fields show identical value (at 56291 now).

> 
> Also, when mdadm -A hangs, could you please capture /proc/$(pidof mdadm)/stack ?
> 

The output is empty:

xs22:/☠ ps -eF fww | grep mdadm
root     10332  9362 97   740  1884  25 12:47 pts/1    R+     6:59  |   \_ mdadm -A /dev/md/r5_big /dev/md/r1_journal_big /dev/sdi1 /dev/sdg1 /dev/sdj1 /dev/sdh1
xs22:/☠ cd /proc/10332
xs22:/proc/10332☠ cat stack
xs22:/proc/10332☠ 


> 18446744073709551615 is 0xffffffffffffffffL, so it is not initialized
> by data from the disk.
> I suspect we hang somewhere before this value is initialized.
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Assemblin journaled array fails
  2020-05-13 10:58       ` Michal Soltys
@ 2020-05-13 16:17         ` Song Liu
  2020-05-14 11:07           ` Michal Soltys
  0 siblings, 1 reply; 46+ messages in thread
From: Song Liu @ 2020-05-13 16:17 UTC (permalink / raw)
  To: Michal Soltys; +Cc: linux-raid

> >
> > Are these captured back to back? I am asking because they showed different
> > "Events" number.
>
> Nah, they were captured between reboots. Back to back all event fields show identical value (at 56291 now).
>
> >
> > Also, when mdadm -A hangs, could you please capture /proc/$(pidof mdadm)/stack ?
> >
>
> The output is empty:
>
> xs22:/☠ ps -eF fww | grep mdadm
> root     10332  9362 97   740  1884  25 12:47 pts/1    R+     6:59  |   \_ mdadm -A /dev/md/r5_big /dev/md/r1_journal_big /dev/sdi1 /dev/sdg1 /dev/sdj1 /dev/sdh1
> xs22:/☠ cd /proc/10332
> xs22:/proc/10332☠ cat stack
> xs22:/proc/10332☠

Hmm... Could you please share the strace output of "mdadm -A" command? Like

strace mdadm -A /dev/md/r5_big /dev/md/r1_journal_big /dev/xxx ...

Thanks,
Song

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Assemblin journaled array fails
  2020-05-13 16:17         ` Song Liu
@ 2020-05-14 11:07           ` Michal Soltys
  2020-05-16  0:15             ` Song Liu
  2020-06-26  8:28             ` Guoqing Jiang
  0 siblings, 2 replies; 46+ messages in thread
From: Michal Soltys @ 2020-05-14 11:07 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2389 bytes --]

On 5/13/20 6:17 PM, Song Liu wrote:
>>>
>>> Are these captured back to back? I am asking because they showed different
>>> "Events" number.
>>
>> Nah, they were captured between reboots. Back to back all event fields show identical value (at 56291 now).
>>
>>>
>>> Also, when mdadm -A hangs, could you please capture /proc/$(pidof mdadm)/stack ?
>>>
>>
>> The output is empty:
>>
>> xs22:/☠ ps -eF fww | grep mdadm
>> root     10332  9362 97   740  1884  25 12:47 pts/1    R+     6:59  |   \_ mdadm -A /dev/md/r5_big /dev/md/r1_journal_big /dev/sdi1 /dev/sdg1 /dev/sdj1 /dev/sdh1
>> xs22:/☠ cd /proc/10332
>> xs22:/proc/10332☠ cat stack
>> xs22:/proc/10332☠
> 
> Hmm... Could you please share the strace output of "mdadm -A" command? Like
> 
> strace mdadm -A /dev/md/r5_big /dev/md/r1_journal_big /dev/xxx ...
> 
> Thanks,
> Song

I did one more thing - ran cat on that stack in another terminal every 
0.1 seconds, this is what was found:

cat: /proc//stack: No such file or directory
cat: /proc//stack: No such file or directory
cat: /proc//stack: No such file or directory
[<0>] submit_bio_wait+0x5b/0x80
[<0>] r5l_recovery_read_page+0x1b6/0x200 [raid456]
[<0>] r5l_recovery_verify_data_checksum+0x19/0x70 [raid456]
[<0>] r5l_start+0x99e/0xe70 [raid456]
[<0>] md_start.part.48+0x2e/0x50 [md_mod]
[<0>] do_md_run+0x64/0x100 [md_mod]
[<0>] md_ioctl+0xe7d/0x17d0 [md_mod]
[<0>] blkdev_ioctl+0x4d0/0xa00
[<0>] block_ioctl+0x39/0x40
[<0>] do_vfs_ioctl+0xa4/0x630
[<0>] ksys_ioctl+0x60/0x90
[<0>] __x64_sys_ioctl+0x16/0x20
[<0>] do_syscall_64+0x52/0x160
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
cat: /proc//stack: No such file or directory
[<0>] submit_bio_wait+0x5b/0x80
[<0>] r5l_recovery_read_page+0x1b6/0x200 [raid456]
[<0>] r5l_recovery_verify_data_checksum+0x19/0x70 [raid456]
[<0>] r5l_start+0x99e/0xe70 [raid456]
[<0>] md_start.part.48+0x2e/0x50 [md_mod]
[<0>] do_md_run+0x64/0x100 [md_mod]
[<0>] md_ioctl+0xe7d/0x17d0 [md_mod]
[<0>] blkdev_ioctl+0x4d0/0xa00
[<0>] block_ioctl+0x39/0x40
[<0>] do_vfs_ioctl+0xa4/0x630
[<0>] ksys_ioctl+0x60/0x90
[<0>] __x64_sys_ioctl+0x16/0x20
[<0>] do_syscall_64+0x52/0x160
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
cat: /proc//stack: No such file or directory
cat: /proc//stack: No such file or directory
cat: /proc//stack: No such file or directory
cat: /proc//stack: No such file or directory


Strace output attached.

Thanks.

[-- Attachment #2: mdadm.strace.log --]
[-- Type: text/x-log, Size: 46654 bytes --]

3650  execve("/sbin/mdadm", ["mdadm", "-A", "/dev/md/r5_big", "/dev/md/r1_journal_big", "/dev/sdi1", "/dev/sdg1", "/dev/sdj1", "/dev/sdh1"], 0x7ffea6946570 /* 39 vars */) = 0
3650  brk(NULL)                         = 0x55885c8e2000
3650  access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
3650  openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
3650  fstat(3, {st_mode=S_IFREG|0644, st_size=63242, ...}) = 0
3650  mmap(NULL, 63242, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fbb17321000
3650  close(3)                          = 0
3650  openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3
3650  read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0000\21\0\0\0\0\0\0"..., 832) = 832
3650  fstat(3, {st_mode=S_IFREG|0644, st_size=14592, ...}) = 0
3650  mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbb1731f000
3650  mmap(NULL, 16656, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fbb1731a000
3650  mmap(0x7fbb1731b000, 4096, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1000) = 0x7fbb1731b000
3650  mmap(0x7fbb1731c000, 4096, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2000) = 0x7fbb1731c000
3650  mmap(0x7fbb1731d000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2000) = 0x7fbb1731d000
3650  close(3)                          = 0
3650  openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
3650  read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\260A\2\0\0\0\0\0"..., 832) = 832
3650  fstat(3, {st_mode=S_IFREG|0755, st_size=1824496, ...}) = 0
3650  mmap(NULL, 1837056, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fbb17159000
3650  mprotect(0x7fbb1717b000, 1658880, PROT_NONE) = 0
3650  mmap(0x7fbb1717b000, 1343488, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x22000) = 0x7fbb1717b000
3650  mmap(0x7fbb172c3000, 311296, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x16a000) = 0x7fbb172c3000
3650  mmap(0x7fbb17310000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1b6000) = 0x7fbb17310000
3650  mmap(0x7fbb17316000, 14336, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fbb17316000
3650  close(3)                          = 0
3650  mmap(NULL, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbb17156000
3650  arch_prctl(ARCH_SET_FS, 0x7fbb17156740) = 0
3650  mprotect(0x7fbb17310000, 16384, PROT_READ) = 0
3650  mprotect(0x7fbb1731d000, 4096, PROT_READ) = 0
3650  mprotect(0x55885b2e6000, 4096, PROT_READ) = 0
3650  mprotect(0x7fbb17358000, 4096, PROT_READ) = 0
3650  munmap(0x7fbb17321000, 63242)     = 0
3650  getpid()                          = 3650
3650  uname({sysname="Linux", nodename="xs22", ...}) = 0
3650  brk(NULL)                         = 0x55885c8e2000
3650  brk(0x55885c903000)               = 0x55885c903000
3650  openat(AT_FDCWD, "/dev/md/r5_big", O_RDONLY) = -1 ENOENT (No such file or directory)
3650  openat(AT_FDCWD, "/etc/mdadm/mdadm.conf", O_RDONLY) = 3
3650  fstat(3, {st_mode=S_IFREG|0644, st_size=1729, ...}) = 0
3650  fstat(3, {st_mode=S_IFREG|0644, st_size=1729, ...}) = 0
3650  read(3, "# mdadm.conf\n#\n# Please refer to"..., 4096) = 1729
3650  read(3, "", 4096)                 = 0
3650  fstat(3, {st_mode=S_IFREG|0644, st_size=1729, ...}) = 0
3650  close(3)                          = 0
3650  openat(AT_FDCWD, "/etc/mdadm/mdadm.conf.d", O_RDONLY) = -1 ENOENT (No such file or directory)
3650  uname({sysname="Linux", nodename="xs22", ...}) = 0
3650  openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
3650  fstat(3, {st_mode=S_IFREG|0644, st_size=63242, ...}) = 0
3650  mmap(NULL, 63242, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fbb17321000
3650  close(3)                          = 0
3650  openat(AT_FDCWD, "/lib/x86_64-linux-gnu/tls/x86_64/x86_64/libdlm_lt.so.3", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
3650  stat("/lib/x86_64-linux-gnu/tls/x86_64/x86_64", 0x7ffc70457820) = -1 ENOENT (No such file or directory)
3650  openat(AT_FDCWD, "/lib/x86_64-linux-gnu/tls/x86_64/libdlm_lt.so.3", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
3650  stat("/lib/x86_64-linux-gnu/tls/x86_64", 0x7ffc70457820) = -1 ENOENT (No such file or directory)
3650  openat(AT_FDCWD, "/lib/x86_64-linux-gnu/tls/x86_64/libdlm_lt.so.3", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
3650  stat("/lib/x86_64-linux-gnu/tls/x86_64", 0x7ffc70457820) = -1 ENOENT (No such file or directory)
3650  openat(AT_FDCWD, "/lib/x86_64-linux-gnu/tls/libdlm_lt.so.3", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
3650  stat("/lib/x86_64-linux-gnu/tls", 0x7ffc70457820) = -1 ENOENT (No such file or directory)
3650  openat(AT_FDCWD, "/lib/x86_64-linux-gnu/x86_64/x86_64/libdlm_lt.so.3", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
3650  stat("/lib/x86_64-linux-gnu/x86_64/x86_64", 0x7ffc70457820) = -1 ENOENT (No such file or directory)
3650  openat(AT_FDCWD, "/lib/x86_64-linux-gnu/x86_64/libdlm_lt.so.3", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
3650  stat("/lib/x86_64-linux-gnu/x86_64", 0x7ffc70457820) = -1 ENOENT (No such file or directory)
3650  openat(AT_FDCWD, "/lib/x86_64-linux-gnu/x86_64/libdlm_lt.so.3", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
3650  stat("/lib/x86_64-linux-gnu/x86_64", 0x7ffc70457820) = -1 ENOENT (No such file or directory)
3650  openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libdlm_lt.so.3", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
3650  stat("/lib/x86_64-linux-gnu", {st_mode=S_IFDIR|0755, st_size=6218, ...}) = 0
3650  openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/tls/x86_64/x86_64/libdlm_lt.so.3", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
3650  stat("/usr/lib/x86_64-linux-gnu/tls/x86_64/x86_64", 0x7ffc70457820) = -1 ENOENT (No such file or directory)
3650  openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/tls/x86_64/libdlm_lt.so.3", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
3650  stat("/usr/lib/x86_64-linux-gnu/tls/x86_64", 0x7ffc70457820) = -1 ENOENT (No such file or directory)
3650  openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/tls/x86_64/libdlm_lt.so.3", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
3650  stat("/usr/lib/x86_64-linux-gnu/tls/x86_64", 0x7ffc70457820) = -1 ENOENT (No such file or directory)
3650  openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/tls/libdlm_lt.so.3", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
3650  stat("/usr/lib/x86_64-linux-gnu/tls", 0x7ffc70457820) = -1 ENOENT (No such file or directory)
3650  openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/x86_64/x86_64/libdlm_lt.so.3", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
3650  stat("/usr/lib/x86_64-linux-gnu/x86_64/x86_64", 0x7ffc70457820) = -1 ENOENT (No such file or directory)
3650  openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/x86_64/libdlm_lt.so.3", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
3650  stat("/usr/lib/x86_64-linux-gnu/x86_64", 0x7ffc70457820) = -1 ENOENT (No such file or directory)
3650  openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/x86_64/libdlm_lt.so.3", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
3650  stat("/usr/lib/x86_64-linux-gnu/x86_64", 0x7ffc70457820) = -1 ENOENT (No such file or directory)
3650  openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/libdlm_lt.so.3", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
3650  stat("/usr/lib/x86_64-linux-gnu", {st_mode=S_IFDIR|0755, st_size=38826, ...}) = 0
3650  openat(AT_FDCWD, "/lib/tls/x86_64/x86_64/libdlm_lt.so.3", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
3650  stat("/lib/tls/x86_64/x86_64", 0x7ffc70457820) = -1 ENOENT (No such file or directory)
3650  openat(AT_FDCWD, "/lib/tls/x86_64/libdlm_lt.so.3", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
3650  stat("/lib/tls/x86_64", 0x7ffc70457820) = -1 ENOENT (No such file or directory)
3650  openat(AT_FDCWD, "/lib/tls/x86_64/libdlm_lt.so.3", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
3650  stat("/lib/tls/x86_64", 0x7ffc70457820) = -1 ENOENT (No such file or directory)
3650  openat(AT_FDCWD, "/lib/tls/libdlm_lt.so.3", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
3650  stat("/lib/tls", 0x7ffc70457820)  = -1 ENOENT (No such file or directory)
3650  openat(AT_FDCWD, "/lib/x86_64/x86_64/libdlm_lt.so.3", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
3650  stat("/lib/x86_64/x86_64", 0x7ffc70457820) = -1 ENOENT (No such file or directory)
3650  openat(AT_FDCWD, "/lib/x86_64/libdlm_lt.so.3", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
3650  stat("/lib/x86_64", 0x7ffc70457820) = -1 ENOENT (No such file or directory)
3650  openat(AT_FDCWD, "/lib/x86_64/libdlm_lt.so.3", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
3650  stat("/lib/x86_64", 0x7ffc70457820) = -1 ENOENT (No such file or directory)
3650  openat(AT_FDCWD, "/lib/libdlm_lt.so.3", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
3650  stat("/lib", {st_mode=S_IFDIR|0755, st_size=326, ...}) = 0
3650  openat(AT_FDCWD, "/usr/lib/tls/x86_64/x86_64/libdlm_lt.so.3", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
3650  stat("/usr/lib/tls/x86_64/x86_64", 0x7ffc70457820) = -1 ENOENT (No such file or directory)
3650  openat(AT_FDCWD, "/usr/lib/tls/x86_64/libdlm_lt.so.3", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
3650  stat("/usr/lib/tls/x86_64", 0x7ffc70457820) = -1 ENOENT (No such file or directory)
3650  openat(AT_FDCWD, "/usr/lib/tls/x86_64/libdlm_lt.so.3", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
3650  stat("/usr/lib/tls/x86_64", 0x7ffc70457820) = -1 ENOENT (No such file or directory)
3650  openat(AT_FDCWD, "/usr/lib/tls/libdlm_lt.so.3", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
3650  stat("/usr/lib/tls", 0x7ffc70457820) = -1 ENOENT (No such file or directory)
3650  openat(AT_FDCWD, "/usr/lib/x86_64/x86_64/libdlm_lt.so.3", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
3650  stat("/usr/lib/x86_64/x86_64", 0x7ffc70457820) = -1 ENOENT (No such file or directory)
3650  openat(AT_FDCWD, "/usr/lib/x86_64/libdlm_lt.so.3", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
3650  stat("/usr/lib/x86_64", 0x7ffc70457820) = -1 ENOENT (No such file or directory)
3650  openat(AT_FDCWD, "/usr/lib/x86_64/libdlm_lt.so.3", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
3650  stat("/usr/lib/x86_64", 0x7ffc70457820) = -1 ENOENT (No such file or directory)
3650  openat(AT_FDCWD, "/usr/lib/libdlm_lt.so.3", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
3650  stat("/usr/lib", {st_mode=S_IFDIR|0755, st_size=1262, ...}) = 0
3650  munmap(0x7fbb17321000, 63242)     = 0
3650  openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
3650  fstat(3, {st_mode=S_IFREG|0644, st_size=63242, ...}) = 0
3650  mmap(NULL, 63242, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fbb17321000
3650  close(3)                          = 0
3650  openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libcmap.so.4", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
3650  openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/libcmap.so.4", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
3650  openat(AT_FDCWD, "/lib/libcmap.so.4", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
3650  openat(AT_FDCWD, "/usr/lib/libcmap.so.4", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
3650  munmap(0x7fbb17321000, 63242)     = 0
3650  geteuid()                         = 0
3650  openat(AT_FDCWD, "/dev/md/r1_journal_big", O_RDONLY|O_DIRECT) = 3
3650  fstat(3, {st_mode=S_IFBLK|0660, st_rdev=makedev(0x9, 0x7d), ...}) = 0
3650  fstat(3, {st_mode=S_IFBLK|0660, st_rdev=makedev(0x9, 0x7d), ...}) = 0
3650  readlink("/sys/dev/block/9:125", "../../devices/virtual/block/md12"..., 199) = 33
3650  stat("/sys/block/md125/md", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
3650  openat(AT_FDCWD, "/sys/block/md125/md/metadata_version", O_RDONLY) = 4
3650  read(4, "1.1\n", 4096)            = 4
3650  close(4)                          = 0
3650  fstat(3, {st_mode=S_IFBLK|0660, st_rdev=makedev(0x9, 0x7d), ...}) = 0
3650  ioctl(3, BLKGETSIZE64, [274742640640]) = 0
3650  fstat(3, {st_mode=S_IFBLK|0660, st_rdev=makedev(0x9, 0x7d), ...}) = 0
3650  ioctl(3, BLKGETSIZE64, [274742640640]) = 0
3650  lseek(3, 274742575104, SEEK_SET)  = 274742575104
3650  read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
3650  ioctl(3, BLKSSZGET, [512])        = 0
3650  ioctl(3, BLKSSZGET, [512])        = 0
3650  fstat(3, {st_mode=S_IFBLK|0660, st_rdev=makedev(0x9, 0x7d), ...}) = 0
3650  ioctl(3, BLKGETSIZE64, [274742640640]) = 0
3650  lseek(3, 274742632448, SEEK_SET)  = 274742632448
3650  read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
3650  lseek(3, 0, SEEK_CUR)             = 274742636544
3650  ioctl(3, BLKSSZGET, [512])        = 0
3650  fstat(3, {st_mode=S_IFBLK|0660, st_rdev=makedev(0x9, 0x7d), ...}) = 0
3650  ioctl(3, BLKGETSIZE64, [274742640640]) = 0
3650  lseek(3, 0, SEEK_SET)             = 0
3650  read(3, "\374N+\251\1\0\0\0\0\2\0\0\0\0\0\0\325\231]vg\327\372\275\59/\207%\251\32\227"..., 4096) = 4096
3650  lseek(3, 0, SEEK_CUR)             = 4096
3650  ioctl(3, BLKSSZGET, [512])        = 0
3650  fstat(3, {st_mode=S_IFBLK|0660, st_rdev=makedev(0x9, 0x7d), ...}) = 0
3650  ioctl(3, BLKGETSIZE64, [274742640640]) = 0
3650  lseek(3, 4096, SEEK_SET)          = 4096
3650  read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
3650  lseek(3, 0, SEEK_CUR)             = 8192
3650  ioctl(3, BLKSSZGET, [512])        = 0
3650  fstat(3, {st_mode=S_IFBLK|0660, st_rdev=makedev(0x9, 0x7d), ...}) = 0
3650  ioctl(3, BLKGETSIZE64, [274742640640]) = 0
3650  lseek(3, 0, SEEK_SET)             = 0
3650  read(3, "\374N+\251\1\0\0\0\0\2\0\0\0\0\0\0\325\231]vg\327\372\275\59/\207%\251\32\227"..., 4096) = 4096
3650  lseek(3, 0, SEEK_CUR)             = 4096
3650  fstat(3, {st_mode=S_IFBLK|0660, st_rdev=makedev(0x9, 0x7d), ...}) = 0
3650  ioctl(3, BLKGETSIZE64, [274742640640]) = 0
3650  ioctl(3, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=1073741824, devname="", volname=""}}) = -1 ENXIO (No such device or address)
3650  fstat(3, {st_mode=S_IFBLK|0660, st_rdev=makedev(0x9, 0x7d), ...}) = 0
3650  ioctl(3, BLKGETSIZE64, [274742640640]) = 0
3650  lseek(3, 274742640128, SEEK_SET)  = 274742640128
3650  read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512) = 512
3650  ioctl(3, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=1073741824, devname="", volname=""}}) = -1 ENXIO (No such device or address)
3650  ioctl(3, BLKSSZGET, [512])        = 0
3650  ioctl(3, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=1073741824, devname="", volname=""}}) = -1 ENXIO (No such device or address)
3650  openat(AT_FDCWD, "/sys/bus/pci/drivers/isci", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = -1 ENOENT (No such file or directory)
3650  openat(AT_FDCWD, "/sys/bus/pci/drivers/ahci", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 4
3650  fstat(4, {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
3650  getdents64(4, /* 9 entries */, 32768) = 264
3650  readlink("/sys/bus/pci/drivers/ahci/./subsystem", 0x7ffc70456530, 4096) = -1 ENOENT (No such file or directory)
3650  readlink("/sys/bus/pci/drivers/ahci/../subsystem", 0x7ffc70456530, 4096) = -1 ENOENT (No such file or directory)
3650  readlink("/sys/bus/pci/drivers/ahci/uevent/subsystem", 0x7ffc70456530, 4096) = -1 ENOTDIR (Not a directory)
3650  readlink("/sys/bus/pci/drivers/ahci/0000:00:1f.2/subsystem", "../../../bus/pci", 4096) = 16
3650  openat(AT_FDCWD, "/sys/bus/pci/drivers/ahci/0000:00:1f.2/vendor", O_RDONLY) = 5
3650  read(5, "0x8086\n", 7)            = 7
3650  close(5)                          = 0
3650  openat(AT_FDCWD, "/sys/bus/pci/drivers/ahci/0000:00:1f.2/device", O_RDONLY) = 5
3650  lseek(5, 0, SEEK_SET)             = 0
3650  read(5, "0x1d02\n", 50)           = 7
3650  close(5)                          = 0
3650  openat(AT_FDCWD, "/sys/bus/pci/drivers/ahci/0000:00:1f.2/class", O_RDONLY) = 5
3650  lseek(5, 0, SEEK_SET)             = 0
3650  read(5, "0x010601\n", 50)         = 9
3650  close(5)                          = 0
3650  lstat("/sys", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
3650  lstat("/sys/bus", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
3650  lstat("/sys/bus/pci", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
3650  lstat("/sys/bus/pci/drivers", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
3650  lstat("/sys/bus/pci/drivers/ahci", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
3650  lstat("/sys/bus/pci/drivers/ahci/0000:00:1f.2", {st_mode=S_IFLNK|0777, st_size=0, ...}) = 0
3650  readlink("/sys/bus/pci/drivers/ahci/0000:00:1f.2", "../../../../devices/pci0000:00/0"..., 4095) = 43
3650  lstat("/sys/devices", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
3650  lstat("/sys/devices/pci0000:00", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
3650  lstat("/sys/devices/pci0000:00/0000:00:1f.2", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
3650  readlink("/sys/bus/pci/drivers/ahci/bind/subsystem", 0x7ffc70456530, 4096) = -1 ENOTDIR (Not a directory)
3650  readlink("/sys/bus/pci/drivers/ahci/new_id/subsystem", 0x7ffc70456530, 4096) = -1 ENOTDIR (Not a directory)
3650  readlink("/sys/bus/pci/drivers/ahci/remove_id/subsystem", 0x7ffc70456530, 4096) = -1 ENOTDIR (Not a directory)
3650  readlink("/sys/bus/pci/drivers/ahci/unbind/subsystem", 0x7ffc70456530, 4096) = -1 ENOTDIR (Not a directory)
3650  readlink("/sys/bus/pci/drivers/ahci/module/subsystem", 0x7ffc70456530, 4096) = -1 ENOENT (No such file or directory)
3650  getdents64(4, /* 0 entries */, 32768) = 0
3650  close(4)                          = 0
3650  openat(AT_FDCWD, "/sys/bus/pci/drivers/vmd", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = -1 ENOENT (No such file or directory)
3650  openat(AT_FDCWD, "/sys/bus/pci/drivers/nvme", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = -1 ENOENT (No such file or directory)
3650  fstat(3, {st_mode=S_IFBLK|0660, st_rdev=makedev(0x9, 0x7d), ...}) = 0
3650  lstat("/sys", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
3650  lstat("/sys/dev", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
3650  lstat("/sys/dev/block", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
3650  lstat("/sys/dev/block/9:125", {st_mode=S_IFLNK|0777, st_size=0, ...}) = 0
3650  readlink("/sys/dev/block/9:125", "../../devices/virtual/block/md12"..., 4095) = 33
3650  lstat("/sys/devices", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
3650  lstat("/sys/devices/virtual", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
3650  lstat("/sys/devices/virtual/block", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
3650  lstat("/sys/devices/virtual/block/md125", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
3650  lstat("/sys/devices/virtual/block/md125/device", 0x7ffc704573c0) = -1 ENOENT (No such file or directory)
3650  fstat(3, {st_mode=S_IFBLK|0660, st_rdev=makedev(0x9, 0x7d), ...}) = 0
3650  ioctl(3, BLKGETSIZE64, [274742640640]) = 0
3650  lseek(3, 274742639616, SEEK_SET)  = 274742639616
3650  read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512) = 512
3650  lseek(3, 0, SEEK_SET)             = 0
3650  read(3, "\374N+\251\1\0\0\0\0\2\0\0\0\0\0\0\325\231]vg\327\372\275\59/\207%\251\32\227"..., 512) = 512
3650  ioctl(3, BLKSSZGET, [512])        = 0
3650  lseek(3, 0, SEEK_SET)             = 0
3650  read(3, "\374N+\251\1\0\0\0\0\2\0\0\0\0\0\0\325\231]vg\327\372\275\59/\207%\251\32\227"..., 512) = 512
3650  ioctl(3, BLKSSZGET, [512])        = 0
3650  ioctl(3, BLKSSZGET, [512])        = 0
3650  fstat(3, {st_mode=S_IFBLK|0660, st_rdev=makedev(0x9, 0x7d), ...}) = 0
3650  ioctl(3, BLKGETSIZE64, [274742640640]) = 0
3650  lseek(3, 274742632448, SEEK_SET)  = 274742632448
3650  read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
3650  lseek(3, 0, SEEK_CUR)             = 274742636544
3650  ioctl(3, BLKSSZGET, [512])        = 0
3650  fstat(3, {st_mode=S_IFBLK|0660, st_rdev=makedev(0x9, 0x7d), ...}) = 0
3650  ioctl(3, BLKGETSIZE64, [274742640640]) = 0
3650  lseek(3, 0, SEEK_SET)             = 0
3650  read(3, "\374N+\251\1\0\0\0\0\2\0\0\0\0\0\0\325\231]vg\327\372\275\59/\207%\251\32\227"..., 4096) = 4096
3650  lseek(3, 0, SEEK_CUR)             = 4096
3650  ioctl(3, BLKSSZGET, [512])        = 0
3650  fstat(3, {st_mode=S_IFBLK|0660, st_rdev=makedev(0x9, 0x7d), ...}) = 0
3650  ioctl(3, BLKGETSIZE64, [274742640640]) = 0
3650  lseek(3, 4096, SEEK_SET)          = 4096
3650  read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
3650  lseek(3, 0, SEEK_CUR)             = 8192
3650  ioctl(3, BLKSSZGET, [512])        = 0
3650  fstat(3, {st_mode=S_IFBLK|0660, st_rdev=makedev(0x9, 0x7d), ...}) = 0
3650  ioctl(3, BLKGETSIZE64, [274742640640]) = 0
3650  lseek(3, 0, SEEK_SET)             = 0
3650  read(3, "\374N+\251\1\0\0\0\0\2\0\0\0\0\0\0\325\231]vg\327\372\275\59/\207%\251\32\227"..., 4096) = 4096
3650  lseek(3, 0, SEEK_CUR)             = 4096
3650  ioctl(3, BLKSSZGET, [512])        = 0
3650  fstat(3, {st_mode=S_IFBLK|0660, st_rdev=makedev(0x9, 0x7d), ...}) = 0
3650  ioctl(3, BLKGETSIZE64, [274742640640]) = 0
3650  lseek(3, 0, SEEK_SET)             = 0
3650  read(3, "\374N+\251\1\0\0\0\0\2\0\0\0\0\0\0\325\231]vg\327\372\275\59/\207%\251\32\227"..., 4096) = 4096
3650  lseek(3, 0, SEEK_CUR)             = 4096
3650  close(3)                          = 0
3650  openat(AT_FDCWD, "/dev/md/r1_journal_big", O_RDONLY|O_EXCL|O_DIRECT) = 3
3650  close(3)                          = 0
3650  stat("/sys/dev/block/9:125/partition", 0x7ffc704575c0) = -1 ENOENT (No such file or directory)
3650  openat(AT_FDCWD, "/dev/sdi1", O_RDONLY|O_DIRECT) = 3
3650  fstat(3, {st_mode=S_IFBLK|0660, st_rdev=makedev(0x8, 0x81), ...}) = 0
3650  fstat(3, {st_mode=S_IFBLK|0660, st_rdev=makedev(0x8, 0x81), ...}) = 0
3650  readlink("/sys/dev/block/8:129", "../../devices/pci0000:00/0000:00"..., 199) = 141
3650  stat("/sys/block/sdi/md", 0x7ffc704556d0) = -1 ENOENT (No such file or directory)
3650  ioctl(3, BLKSSZGET, [512])        = 0
3650  fstat(3, {st_mode=S_IFBLK|0660, st_rdev=makedev(0x8, 0x81), ...}) = 0
3650  ioctl(3, BLKGETSIZE64, [4000000000000]) = 0
3650  lseek(3, 0, SEEK_SET)             = 0
3650  read(3, "\374N+\251\1\0\0\0\0\2\0\0\0\0\0\0\325\231]vg\327\372\275\59/\207%\251\32\227"..., 4096) = 4096
3650  lseek(3, 0, SEEK_CUR)             = 4096
3650  close(3)                          = 0
3650  openat(AT_FDCWD, "/dev/sdi1", O_RDONLY|O_EXCL|O_DIRECT) = 3
3650  close(3)                          = 0
3650  stat("/sys/dev/block/8:129/partition", {st_mode=S_IFREG|0444, st_size=4096, ...}) = 0
3650  openat(AT_FDCWD, "/dev/sdg1", O_RDONLY|O_DIRECT) = 3
3650  fstat(3, {st_mode=S_IFBLK|0660, st_rdev=makedev(0x8, 0x61), ...}) = 0
3650  fstat(3, {st_mode=S_IFBLK|0660, st_rdev=makedev(0x8, 0x61), ...}) = 0
3650  readlink("/sys/dev/block/8:97", "../../devices/pci0000:00/0000:00"..., 199) = 141
3650  stat("/sys/block/sdg/md", 0x7ffc704556d0) = -1 ENOENT (No such file or directory)
3650  ioctl(3, BLKSSZGET, [512])        = 0
3650  fstat(3, {st_mode=S_IFBLK|0660, st_rdev=makedev(0x8, 0x61), ...}) = 0
3650  ioctl(3, BLKGETSIZE64, [4000000000000]) = 0
3650  lseek(3, 0, SEEK_SET)             = 0
3650  read(3, "\374N+\251\1\0\0\0\0\2\0\0\0\0\0\0\325\231]vg\327\372\275\59/\207%\251\32\227"..., 4096) = 4096
3650  lseek(3, 0, SEEK_CUR)             = 4096
3650  close(3)                          = 0
3650  openat(AT_FDCWD, "/dev/sdg1", O_RDONLY|O_EXCL|O_DIRECT) = 3
3650  close(3)                          = 0
3650  stat("/sys/dev/block/8:97/partition", {st_mode=S_IFREG|0444, st_size=4096, ...}) = 0
3650  openat(AT_FDCWD, "/dev/sdj1", O_RDONLY|O_DIRECT) = 3
3650  fstat(3, {st_mode=S_IFBLK|0660, st_rdev=makedev(0x8, 0x91), ...}) = 0
3650  fstat(3, {st_mode=S_IFBLK|0660, st_rdev=makedev(0x8, 0x91), ...}) = 0
3650  readlink("/sys/dev/block/8:145", "../../devices/pci0000:00/0000:00"..., 199) = 141
3650  stat("/sys/block/sdj/md", 0x7ffc704556d0) = -1 ENOENT (No such file or directory)
3650  ioctl(3, BLKSSZGET, [512])        = 0
3650  fstat(3, {st_mode=S_IFBLK|0660, st_rdev=makedev(0x8, 0x91), ...}) = 0
3650  ioctl(3, BLKGETSIZE64, [4000000000000]) = 0
3650  lseek(3, 0, SEEK_SET)             = 0
3650  read(3, "\374N+\251\1\0\0\0\0\2\0\0\0\0\0\0\325\231]vg\327\372\275\59/\207%\251\32\227"..., 4096) = 4096
3650  lseek(3, 0, SEEK_CUR)             = 4096
3650  close(3)                          = 0
3650  openat(AT_FDCWD, "/dev/sdj1", O_RDONLY|O_EXCL|O_DIRECT) = 3
3650  close(3)                          = 0
3650  stat("/sys/dev/block/8:145/partition", {st_mode=S_IFREG|0444, st_size=4096, ...}) = 0
3650  openat(AT_FDCWD, "/dev/sdh1", O_RDONLY|O_DIRECT) = 3
3650  fstat(3, {st_mode=S_IFBLK|0660, st_rdev=makedev(0x8, 0x71), ...}) = 0
3650  fstat(3, {st_mode=S_IFBLK|0660, st_rdev=makedev(0x8, 0x71), ...}) = 0
3650  readlink("/sys/dev/block/8:113", "../../devices/pci0000:00/0000:00"..., 199) = 141
3650  stat("/sys/block/sdh/md", 0x7ffc704556d0) = -1 ENOENT (No such file or directory)
3650  ioctl(3, BLKSSZGET, [512])        = 0
3650  fstat(3, {st_mode=S_IFBLK|0660, st_rdev=makedev(0x8, 0x71), ...}) = 0
3650  ioctl(3, BLKGETSIZE64, [4000000000000]) = 0
3650  lseek(3, 0, SEEK_SET)             = 0
3650  read(3, "\374N+\251\1\0\0\0\0\2\0\0\0\0\0\0\325\231]vg\327\372\275\59/\207%\251\32\227"..., 4096) = 4096
3650  lseek(3, 0, SEEK_CUR)             = 4096
3650  close(3)                          = 0
3650  openat(AT_FDCWD, "/dev/sdh1", O_RDONLY|O_EXCL|O_DIRECT) = 3
3650  close(3)                          = 0
3650  stat("/sys/dev/block/8:113/partition", {st_mode=S_IFREG|0444, st_size=4096, ...}) = 0
3650  mkdir("/run/mdadm", 0755)         = -1 EEXIST (File exists)
3650  openat(AT_FDCWD, "/run/mdadm/map.lock", O_RDWR|O_CREAT|O_TRUNC, 0600) = 3
3650  fcntl(3, F_GETFL)                 = 0x8002 (flags O_RDWR|O_LARGEFILE)
3650  flock(3, LOCK_EX)                 = 0
3650  fstat(3, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
3650  openat(AT_FDCWD, "/run/mdadm/map", O_RDONLY) = 4
3650  fcntl(4, F_GETFL)                 = 0x8000 (flags O_RDONLY|O_LARGEFILE)
3650  fstat(4, {st_mode=S_IFREG|0600, st_size=191, ...}) = 0
3650  read(4, "md125 1.1 96398bc1:491fb249:928e"..., 4096) = 191
3650  read(4, "", 4096)                 = 0
3650  close(4)                          = 0
3650  stat("/dev/.udev", 0x7ffc704574d0) = -1 ENOENT (No such file or directory)
3650  stat("/run/udev", {st_mode=S_IFDIR|0755, st_size=160, ...}) = 0
3650  openat(AT_FDCWD, "/proc/mdstat", O_RDONLY) = 4
3650  fcntl(4, F_SETFD, FD_CLOEXEC)     = 0
3650  fstat(4, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
3650  read(4, "Personalities : [linear] [multip"..., 1024) = 654
3650  read(4, "", 1024)                 = 0
3650  close(4)                          = 0
3650  openat(AT_FDCWD, "/proc/mdstat", O_RDONLY) = 4
3650  fcntl(4, F_SETFD, FD_CLOEXEC)     = 0
3650  fstat(4, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
3650  read(4, "Personalities : [linear] [multip"..., 1024) = 654
3650  read(4, "", 1024)                 = 0
3650  close(4)                          = 0
3650  openat(AT_FDCWD, "/proc/mdstat", O_RDONLY) = 4
3650  fcntl(4, F_SETFD, FD_CLOEXEC)     = 0
3650  fstat(4, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
3650  read(4, "Personalities : [linear] [multip"..., 1024) = 654
3650  read(4, "", 1024)                 = 0
3650  close(4)                          = 0
3650  openat(AT_FDCWD, "/proc/mdstat", O_RDONLY) = 4
3650  fcntl(4, F_SETFD, FD_CLOEXEC)     = 0
3650  fstat(4, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
3650  read(4, "Personalities : [linear] [multip"..., 1024) = 654
3650  read(4, "", 1024)                 = 0
3650  close(4)                          = 0
3650  openat(AT_FDCWD, "/sys/block/md124/dev", O_RDONLY) = -1 ENOENT (No such file or directory)
3650  getpid()                          = 3650
3650  mknod("/dev/.tmp.md.3650:9:124", S_IFBLK|0600, makedev(0x9, 0x7c)) = 0
3650  openat(AT_FDCWD, "/dev/.tmp.md.3650:9:124", O_RDWR|O_EXCL|O_DIRECT) = 4
3650  unlink("/dev/.tmp.md.3650:9:124") = 0
3650  fstat(4, {st_mode=S_IFBLK|0600, st_rdev=makedev(0x9, 0x7c), ...}) = 0
3650  readlink("/sys/dev/block/9:124", "../../devices/virtual/block/md12"..., 199) = 33
3650  openat(AT_FDCWD, "/proc/mdstat", O_RDONLY) = 5
3650  fcntl(5, F_SETFD, FD_CLOEXEC)     = 0
3650  fstat(5, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
3650  read(5, "Personalities : [linear] [multip"..., 1024) = 654
3650  read(5, "", 1024)                 = 0
3650  close(5)                          = 0
3650  ioctl(4, STOP_ARRAY, 0)           = 0
3650  openat(AT_FDCWD, "/dev/md/r1_journal_big", O_RDWR|O_EXCL|O_DIRECT) = 5
3650  ioctl(5, BLKSSZGET, [512])        = 0
3650  fstat(5, {st_mode=S_IFBLK|0660, st_rdev=makedev(0x9, 0x7d), ...}) = 0
3650  ioctl(5, BLKGETSIZE64, [274742640640]) = 0
3650  lseek(5, 0, SEEK_SET)             = 0
3650  read(5, "\374N+\251\1\0\0\0\0\2\0\0\0\0\0\0\325\231]vg\327\372\275\59/\207%\251\32\227"..., 4096) = 4096
3650  lseek(5, 0, SEEK_CUR)             = 4096
3650  fstat(5, {st_mode=S_IFBLK|0660, st_rdev=makedev(0x9, 0x7d), ...}) = 0
3650  close(5)                          = 0
3650  openat(AT_FDCWD, "/dev/sdi1", O_RDWR|O_EXCL|O_DIRECT) = 5
3650  ioctl(5, BLKSSZGET, [512])        = 0
3650  fstat(5, {st_mode=S_IFBLK|0660, st_rdev=makedev(0x8, 0x81), ...}) = 0
3650  ioctl(5, BLKGETSIZE64, [4000000000000]) = 0
3650  lseek(5, 0, SEEK_SET)             = 0
3650  read(5, "\374N+\251\1\0\0\0\0\2\0\0\0\0\0\0\325\231]vg\327\372\275\59/\207%\251\32\227"..., 4096) = 4096
3650  lseek(5, 0, SEEK_CUR)             = 4096
3650  fstat(5, {st_mode=S_IFBLK|0660, st_rdev=makedev(0x8, 0x81), ...}) = 0
3650  close(5)                          = 0
3650  openat(AT_FDCWD, "/dev/sdg1", O_RDWR|O_EXCL|O_DIRECT) = 5
3650  ioctl(5, BLKSSZGET, [512])        = 0
3650  fstat(5, {st_mode=S_IFBLK|0660, st_rdev=makedev(0x8, 0x61), ...}) = 0
3650  ioctl(5, BLKGETSIZE64, [4000000000000]) = 0
3650  lseek(5, 0, SEEK_SET)             = 0
3650  read(5, "\374N+\251\1\0\0\0\0\2\0\0\0\0\0\0\325\231]vg\327\372\275\59/\207%\251\32\227"..., 4096) = 4096
3650  lseek(5, 0, SEEK_CUR)             = 4096
3650  fstat(5, {st_mode=S_IFBLK|0660, st_rdev=makedev(0x8, 0x61), ...}) = 0
3650  close(5)                          = 0
3650  openat(AT_FDCWD, "/dev/sdj1", O_RDWR|O_EXCL|O_DIRECT) = 5
3650  ioctl(5, BLKSSZGET, [512])        = 0
3650  fstat(5, {st_mode=S_IFBLK|0660, st_rdev=makedev(0x8, 0x91), ...}) = 0
3650  ioctl(5, BLKGETSIZE64, [4000000000000]) = 0
3650  lseek(5, 0, SEEK_SET)             = 0
3650  read(5, "\374N+\251\1\0\0\0\0\2\0\0\0\0\0\0\325\231]vg\327\372\275\59/\207%\251\32\227"..., 4096) = 4096
3650  lseek(5, 0, SEEK_CUR)             = 4096
3650  fstat(5, {st_mode=S_IFBLK|0660, st_rdev=makedev(0x8, 0x91), ...}) = 0
3650  close(5)                          = 0
3650  openat(AT_FDCWD, "/dev/sdh1", O_RDWR|O_EXCL|O_DIRECT) = 5
3650  ioctl(5, BLKSSZGET, [512])        = 0
3650  fstat(5, {st_mode=S_IFBLK|0660, st_rdev=makedev(0x8, 0x71), ...}) = 0
3650  ioctl(5, BLKGETSIZE64, [4000000000000]) = 0
3650  lseek(5, 0, SEEK_SET)             = 0
3650  read(5, "\374N+\251\1\0\0\0\0\2\0\0\0\0\0\0\325\231]vg\327\372\275\59/\207%\251\32\227"..., 4096) = 4096
3650  lseek(5, 0, SEEK_CUR)             = 4096
3650  fstat(5, {st_mode=S_IFBLK|0660, st_rdev=makedev(0x8, 0x71), ...}) = 0
3650  close(5)                          = 0
3650  openat(AT_FDCWD, "/dev/sdg1", O_RDONLY|O_EXCL|O_DIRECT) = 5
3650  ioctl(5, BLKSSZGET, [512])        = 0
3650  fstat(5, {st_mode=S_IFBLK|0660, st_rdev=makedev(0x8, 0x61), ...}) = 0
3650  ioctl(5, BLKGETSIZE64, [4000000000000]) = 0
3650  lseek(5, 0, SEEK_SET)             = 0
3650  read(5, "\374N+\251\1\0\0\0\0\2\0\0\0\0\0\0\325\231]vg\327\372\275\59/\207%\251\32\227"..., 4096) = 4096
3650  lseek(5, 0, SEEK_CUR)             = 4096
3650  close(5)                          = 0
3650  fstat(4, {st_mode=S_IFBLK|0600, st_rdev=makedev(0x9, 0x7c), ...}) = 0
3650  readlink("/sys/dev/block/9:124", "../../devices/virtual/block/md12"..., 199) = 33
3650  stat("/sys/block/md124/md", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
3650  fstat(4, {st_mode=S_IFBLK|0600, st_rdev=makedev(0x9, 0x7c), ...}) = 0
3650  readlink("/sys/dev/block/9:124", "../../devices/virtual/block/md12"..., 199) = 33
3650  mkdir("/run/mdadm", 0755)         = -1 EEXIST (File exists)
3650  openat(AT_FDCWD, "/run/mdadm/map.new", O_RDWR|O_CREAT, 0600) = 5
3650  fcntl(5, F_GETFL)                 = 0x8002 (flags O_RDWR|O_LARGEFILE)
3650  fstat(5, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
3650  write(5, "md124 1.1 765d99d5:bdfad767:872f"..., 252) = 252
3650  close(5)                          = 0
3650  rename("/run/mdadm/map.new", "/run/mdadm/map") = 0
3650  ioctl(4, SET_ARRAY_INFO, 0x7ffc70457950) = 0
3650  openat(AT_FDCWD, "/dev/sdh1", O_RDWR|O_EXCL|O_DIRECT) = 5
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=0, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=1, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=2, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=3, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=4, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=5, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=6, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=7, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=8, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=9, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=10, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=11, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=12, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=13, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=14, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=15, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  close(5)                          = 0
3650  ioctl(4, ADD_NEW_DISK, 0x55885c8e7cc8) = 0
3650  openat(AT_FDCWD, "/dev/sdi1", O_RDWR|O_EXCL|O_DIRECT) = 5
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=0, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=1, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=2, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=3, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=4, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=5, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=6, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=7, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=8, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=9, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=10, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=11, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=12, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=13, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=14, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=15, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  close(5)                          = 0
3650  ioctl(4, ADD_NEW_DISK, 0x55885c8e7710) = 0
3650  openat(AT_FDCWD, "/dev/sdj1", O_RDWR|O_EXCL|O_DIRECT) = 5
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=0, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=1, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=2, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=3, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=4, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=5, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=6, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=7, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=8, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=9, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=10, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=11, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=12, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=13, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=14, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=15, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  close(5)                          = 0
3650  ioctl(4, ADD_NEW_DISK, 0x55885c8e7ae0) = 0
3650  openat(AT_FDCWD, "/dev/md/r1_journal_big", O_RDWR|O_EXCL|O_DIRECT) = 5
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=0, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=1, devname="", volname=""}}) = -1 ENXIO (No such device or address)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=2, devname="", volname=""}}) = -1 ENXIO (No such device or address)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=3, devname="", volname=""}}) = -1 ENXIO (No such device or address)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=4, devname="", volname=""}}) = -1 ENXIO (No such device or address)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=5, devname="", volname=""}}) = -1 ENXIO (No such device or address)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=6, devname="", volname=""}}) = -1 ENXIO (No such device or address)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=7, devname="", volname=""}}) = -1 ENXIO (No such device or address)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=8, devname="", volname=""}}) = -1 ENXIO (No such device or address)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=9, devname="", volname=""}}) = -1 ENXIO (No such device or address)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=10, devname="", volname=""}}) = -1 ENXIO (No such device or address)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=11, devname="", volname=""}}) = -1 ENXIO (No such device or address)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=12, devname="", volname=""}}) = -1 ENXIO (No such device or address)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=13, devname="", volname=""}}) = -1 ENXIO (No such device or address)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=14, devname="", volname=""}}) = -1 ENXIO (No such device or address)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=15, devname="", volname=""}}) = -1 ENXIO (No such device or address)
3650  close(5)                          = 0
3650  ioctl(4, ADD_NEW_DISK, 0x55885c8e7528) = 0
3650  openat(AT_FDCWD, "/dev/sdg1", O_RDWR|O_EXCL|O_DIRECT) = 5
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=0, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=1, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=2, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=3, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=4, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=5, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=6, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=7, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=8, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=9, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=10, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=11, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=12, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=13, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=14, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  ioctl(5, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, length=0, pno=15, devname="", volname=""}}) = -1 EINVAL (Invalid argument)
3650  close(5)                          = 0
3650  ioctl(4, ADD_NEW_DISK, 0x55885c8e78f8) = 0
3650  ioctl(4, RUN_ARRAY

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Assemblin journaled array fails
  2020-05-14 11:07           ` Michal Soltys
@ 2020-05-16  0:15             ` Song Liu
       [not found]               ` <d0340d7b-6b3a-4fd3-e446-5f0967132ef6@yandex.pl>
  2020-06-26  8:28             ` Guoqing Jiang
  1 sibling, 1 reply; 46+ messages in thread
From: Song Liu @ 2020-05-16  0:15 UTC (permalink / raw)
  To: Michal Soltys; +Cc: linux-raid

On Thu, May 14, 2020 at 4:07 AM Michal Soltys <msoltyspl@yandex.pl> wrote:
>
> On 5/13/20 6:17 PM, Song Liu wrote:
> >>>
> >>> Are these captured back to back? I am asking because they showed different
> >>> "Events" number.
> >>
> >> Nah, they were captured between reboots. Back to back all event fields show identical value (at 56291 now).
> >>
> >>>
> >>> Also, when mdadm -A hangs, could you please capture /proc/$(pidof mdadm)/stack ?
> >>>
> >>
> >> The output is empty:
> >>
> >> xs22:/☠ ps -eF fww | grep mdadm
> >> root     10332  9362 97   740  1884  25 12:47 pts/1    R+     6:59  |   \_ mdadm -A /dev/md/r5_big /dev/md/r1_journal_big /dev/sdi1 /dev/sdg1 /dev/sdj1 /dev/sdh1
> >> xs22:/☠ cd /proc/10332
> >> xs22:/proc/10332☠ cat stack
> >> xs22:/proc/10332☠
> >
> > Hmm... Could you please share the strace output of "mdadm -A" command? Like
> >
> > strace mdadm -A /dev/md/r5_big /dev/md/r1_journal_big /dev/xxx ...
> >
> > Thanks,
> > Song
>
> I did one more thing - ran cat on that stack in another terminal every
> 0.1 seconds, this is what was found:
>
> cat: /proc//stack: No such file or directory
> cat: /proc//stack: No such file or directory
> cat: /proc//stack: No such file or directory
> [<0>] submit_bio_wait+0x5b/0x80
> [<0>] r5l_recovery_read_page+0x1b6/0x200 [raid456]
> [<0>] r5l_recovery_verify_data_checksum+0x19/0x70 [raid456]
> [<0>] r5l_start+0x99e/0xe70 [raid456]
> [<0>] md_start.part.48+0x2e/0x50 [md_mod]
> [<0>] do_md_run+0x64/0x100 [md_mod]
> [<0>] md_ioctl+0xe7d/0x17d0 [md_mod]
> [<0>] blkdev_ioctl+0x4d0/0xa00
> [<0>] block_ioctl+0x39/0x40
> [<0>] do_vfs_ioctl+0xa4/0x630
> [<0>] ksys_ioctl+0x60/0x90
> [<0>] __x64_sys_ioctl+0x16/0x20
> [<0>] do_syscall_64+0x52/0x160
> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> cat: /proc//stack: No such file or directory
> [<0>] submit_bio_wait+0x5b/0x80
> [<0>] r5l_recovery_read_page+0x1b6/0x200 [raid456]
> [<0>] r5l_recovery_verify_data_checksum+0x19/0x70 [raid456]
> [<0>] r5l_start+0x99e/0xe70 [raid456]
> [<0>] md_start.part.48+0x2e/0x50 [md_mod]
> [<0>] do_md_run+0x64/0x100 [md_mod]
> [<0>] md_ioctl+0xe7d/0x17d0 [md_mod]
> [<0>] blkdev_ioctl+0x4d0/0xa00
> [<0>] block_ioctl+0x39/0x40
> [<0>] do_vfs_ioctl+0xa4/0x630
> [<0>] ksys_ioctl+0x60/0x90
> [<0>] __x64_sys_ioctl+0x16/0x20
> [<0>] do_syscall_64+0x52/0x160
> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> cat: /proc//stack: No such file or directory
> cat: /proc//stack: No such file or directory
> cat: /proc//stack: No such file or directory
> cat: /proc//stack: No such file or directory
>
>
> Strace output attached.

So mdadm hangs at "ioctl(4, RUN_ARRAY", right?

Does /proc/<pid of hanging mdadm>/stack always shows the same output above?

Thanks,
Song

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Assemblin journaled array fails
       [not found]               ` <d0340d7b-6b3a-4fd3-e446-5f0967132ef6@yandex.pl>
@ 2020-05-18 23:55                 ` Song Liu
  2020-05-20 12:08                   ` Michal Soltys
  2020-05-25 16:23                   ` Michal Soltys
  0 siblings, 2 replies; 46+ messages in thread
From: Song Liu @ 2020-05-18 23:55 UTC (permalink / raw)
  To: Michal Soltys; +Cc: linux-raid

On Mon, May 18, 2020 at 4:43 AM Michal Soltys <msoltyspl@yandex.pl> wrote:
>
> On 5/16/20 2:15 AM, Song Liu wrote:
> > On Thu, May 14, 2020 at 4:07 AM Michal Soltys <msoltyspl@yandex.pl> wrote:
> >>
> >> On 5/13/20 6:17 PM, Song Liu wrote:
> >>>>>
> >>>>> Are these captured back to back? I am asking because they showed different
> >>>>> "Events" number.
> >>>>
> >>>> Nah, they were captured between reboots. Back to back all event fields show identical value (at 56291 now).
> >>>>
> >>>>>
> >>>>> Also, when mdadm -A hangs, could you please capture /proc/$(pidof mdadm)/stack ?
> >>>>>
> >>>>
> >>>> The output is empty:
> >>>>
> >>>> xs22:/☠ ps -eF fww | grep mdadm
> >>>> root     10332  9362 97   740  1884  25 12:47 pts/1    R+     6:59  |   \_ mdadm -A /dev/md/r5_big /dev/md/r1_journal_big /dev/sdi1 /dev/sdg1 /dev/sdj1 /dev/sdh1
> >>>> xs22:/☠ cd /proc/10332
> >>>> xs22:/proc/10332☠ cat stack
> >>>> xs22:/proc/10332☠
> >>>
> >>> Hmm... Could you please share the strace output of "mdadm -A" command? Like
> >>>
> >>> strace mdadm -A /dev/md/r5_big /dev/md/r1_journal_big /dev/xxx ...
> >>>
> >>> Thanks,
> >>> Song
> >>
> >> I did one more thing - ran cat on that stack in another terminal every
> >> 0.1 seconds, this is what was found:
> >>
> >> cat: /proc//stack: No such file or directory
> >> cat: /proc//stack: No such file or directory
> >> cat: /proc//stack: No such file or directory
> >> [<0>] submit_bio_wait+0x5b/0x80
> >> [<0>] r5l_recovery_read_page+0x1b6/0x200 [raid456]
> >> [<0>] r5l_recovery_verify_data_checksum+0x19/0x70 [raid456]
> >> [<0>] r5l_start+0x99e/0xe70 [raid456]
> >> [<0>] md_start.part.48+0x2e/0x50 [md_mod]
> >> [<0>] do_md_run+0x64/0x100 [md_mod]
> >> [<0>] md_ioctl+0xe7d/0x17d0 [md_mod]
> >> [<0>] blkdev_ioctl+0x4d0/0xa00
> >> [<0>] block_ioctl+0x39/0x40
> >> [<0>] do_vfs_ioctl+0xa4/0x630
> >> [<0>] ksys_ioctl+0x60/0x90
> >> [<0>] __x64_sys_ioctl+0x16/0x20
> >> [<0>] do_syscall_64+0x52/0x160
> >> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >> cat: /proc//stack: No such file or directory
> >> [<0>] submit_bio_wait+0x5b/0x80
> >> [<0>] r5l_recovery_read_page+0x1b6/0x200 [raid456]
> >> [<0>] r5l_recovery_verify_data_checksum+0x19/0x70 [raid456]
> >> [<0>] r5l_start+0x99e/0xe70 [raid456]
> >> [<0>] md_start.part.48+0x2e/0x50 [md_mod]
> >> [<0>] do_md_run+0x64/0x100 [md_mod]
> >> [<0>] md_ioctl+0xe7d/0x17d0 [md_mod]
> >> [<0>] blkdev_ioctl+0x4d0/0xa00
> >> [<0>] block_ioctl+0x39/0x40
> >> [<0>] do_vfs_ioctl+0xa4/0x630
> >> [<0>] ksys_ioctl+0x60/0x90
> >> [<0>] __x64_sys_ioctl+0x16/0x20
> >> [<0>] do_syscall_64+0x52/0x160
> >> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >> cat: /proc//stack: No such file or directory
> >> cat: /proc//stack: No such file or directory
> >> cat: /proc//stack: No such file or directory
> >> cat: /proc//stack: No such file or directory
> >>
> >>
> >> Strace output attached.
> >
> > So mdadm hangs at "ioctl(4, RUN_ARRAY", right?
> >
>
> Yes, it's consistent across attempts.
>
> > Does /proc/<pid of hanging mdadm>/stack always shows the same output above?
>
> Having peeked closer - it actually tries to do something somehow -
> attached the stack file dumped every ~0.01 seconds.
>
> After a while I can hit non-empty cat /proc/<pid>/stack every 5-15
> seconds this way.
>
> At one point "hanged" mdadm has been running this way for 24hr+ without
> any practical effect in the end.
>

Thanks for these information.

It looks like the kernel had some issue in recovery. Could you please
try the following:

1. check dmesg, we should see one of the following at some point:
    pr_info("md/raid:%s: starting from clean shutdown\n",
or
    pr_info("md/raid:%s: recovering %d data-only stripes and %d
data-parity stripes\n",

2. try use bcc/bpftrace to trace r5l_recovery_read_page(),
specifically, the 4th argument.
With bcc, it is something like:

    trace.py -M 100 'r5l_recovery_read_page() "%llx", arg4'

-M above limits the number of outputs to 100 lines. We may need to
increase the limit or
remove the constraint. If the system doesn't have bcc/bpftrace. You
can also try with
kprobe.

Thanks,
Song

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Assemblin journaled array fails
  2020-05-18 23:55                 ` Song Liu
@ 2020-05-20 12:08                   ` Michal Soltys
  2020-05-25 16:23                   ` Michal Soltys
  1 sibling, 0 replies; 46+ messages in thread
From: Michal Soltys @ 2020-05-20 12:08 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-raid

On 5/19/20 1:55 AM, Song Liu wrote:
> 
> Thanks for these information.
> 
> It looks like the kernel had some issue in recovery. Could you please
> try the following:
> 
> 1. check dmesg, we should see one of the following at some point:
>      pr_info("md/raid:%s: starting from clean shutdown\n",
> or
>      pr_info("md/raid:%s: recovering %d data-only stripes and %d
> data-parity stripes\n",
> 
> 2. try use bcc/bpftrace to trace r5l_recovery_read_page(),
> specifically, the 4th argument.
> With bcc, it is something like:
> 
>      trace.py -M 100 'r5l_recovery_read_page() "%llx", arg4'
> 
> -M above limits the number of outputs to 100 lines. We may need to
> increase the limit or
> remove the constraint. If the system doesn't have bcc/bpftrace. You
> can also try with
> kprobe.
> 
> Thanks,
> Song
> 

Will do the above tests in coming days.

As for dmesg, it was always clean after mdadm's invocation (that's why I 
never noticed it), but today I found on the machine that the following 
popped around *19* hours after mdadm assembly was started:

Note - since the last dmesg entry yesterday, nothing new showed up in 
dmesg, and the mdadm is still running.

ps a -o command,lstart | grep mdadm
strace -f -o mdadm.strace.l Mon May 18 13:25:45 2020
mdadm -A /dev/md/r5_big /de Mon May 18 13:25:45 2020



[May19 09:53] md/raid:md124: recovering 24667 data-only stripes and 
50212 data-parity stripes
[May19 09:57] INFO: task mdadm:4698 blocked for more than 120 seconds.
[  +0.007114]       Not tainted 5.4.0-0.bpo.4-amd64 #1 Debian 
5.4.19-1~bpo10+1
[  +0.007881] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  +0.008754] mdadm           D    0  4698   4695 0x00004003
[  +0.006128] Call Trace:
[  +0.002599]  ? __schedule+0x2e6/0x6f0
[  +0.004222]  schedule+0x2f/0xa0
[  +0.003519]  r5l_start.cold.43+0x544/0x549 [raid456]
[  +0.005560]  ? finish_wait+0x80/0x80
[  +0.003901]  md_start.part.48+0x2e/0x50 [md_mod]
[  +0.005175]  do_md_run+0x64/0x100 [md_mod]
[  +0.004582]  md_ioctl+0xe7d/0x17d0 [md_mod]
[  +0.004675]  blkdev_ioctl+0x4d0/0xa00
[  +0.004104]  block_ioctl+0x39/0x40
[  +0.003806]  do_vfs_ioctl+0xa4/0x630
[  +0.004004]  ksys_ioctl+0x60/0x90
[  +0.003697]  __x64_sys_ioctl+0x16/0x20
[  +0.003966]  do_syscall_64+0x52/0x160
[  +0.004100]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.005658] RIP: 0033:0x7efcb9969427
[  +0.004118] Code: Bad RIP value.
[  +0.003611] RSP: 002b:00007ffd902f2508 EFLAGS: 00000246 ORIG_RAX: 
0000000000000010
[  +0.008461] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 
00007efcb9969427
[  +0.007964] RDX: 0000000000000000 RSI: 00000000400c0930 RDI: 
0000000000000004
[  +0.007965] RBP: 00007ffd902f2680 R08: 0000561686dc6f40 R09: 
00005616858b9b60
[  +0.007968] R10: 0000000000000001 R11: 0000000000000246 R12: 
0000000000000002
[  +0.008034] R13: 0000000000000000 R14: 00007ffd902f2e30 R15: 
0000000000000004
[May19 09:59] INFO: task mdadm:4698 blocked for more than 241 seconds.
[  +0.007113]       Not tainted 5.4.0-0.bpo.4-amd64 #1 Debian 
5.4.19-1~bpo10+1
[  +0.031011] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  +0.032417] mdadm           D    0  4698   4695 0x00004003
[  +0.030883] Call Trace:
[  +0.027037]  ? __schedule+0x2e6/0x6f0
[  +0.027109]  schedule+0x2f/0xa0
[  +0.026651]  r5l_start.cold.43+0x544/0x549 [raid456]
[  +0.028842]  ? finish_wait+0x80/0x80
[  +0.026646]  md_start.part.48+0x2e/0x50 [md_mod]
[  +0.027846]  do_md_run+0x64/0x100 [md_mod]
[  +0.027423]  md_ioctl+0xe7d/0x17d0 [md_mod]
[  +0.027029]  blkdev_ioctl+0x4d0/0xa00
[  +0.027716]  block_ioctl+0x39/0x40
[  +0.028195]  do_vfs_ioctl+0xa4/0x630
[  +0.028868]  ksys_ioctl+0x60/0x90
[  +0.028845]  __x64_sys_ioctl+0x16/0x20
[  +0.030159]  do_syscall_64+0x52/0x160
[  +0.029944]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.029988] RIP: 0033:0x7efcb9969427
[  +0.027728] Code: Bad RIP value.
[  +0.025732] RSP: 002b:00007ffd902f2508 EFLAGS: 00000246 ORIG_RAX: 
0000000000000010
[  +0.028386] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 
00007efcb9969427
[  +0.028462] RDX: 0000000000000000 RSI: 00000000400c0930 RDI: 
0000000000000004
[  +0.028854] RBP: 00007ffd902f2680 R08: 0000561686dc6f40 R09: 
00005616858b9b60
[  +0.027692] R10: 0000000000000001 R11: 0000000000000246 R12: 
0000000000000002
[  +0.028838] R13: 0000000000000000 R14: 00007ffd902f2e30 R15: 
0000000000000004
[May19 10:01] INFO: task mdadm:4698 blocked for more than 362 seconds.
[  +0.027842]       Not tainted 5.4.0-0.bpo.4-amd64 #1 Debian 
5.4.19-1~bpo10+1
[  +0.028293] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  +0.029933] mdadm           D    0  4698   4695 0x00004003
[  +0.027836] Call Trace:
[  +0.025804]  ? __schedule+0x2e6/0x6f0
[  +0.025857]  schedule+0x2f/0xa0
[  +0.025504]  r5l_start.cold.43+0x544/0x549 [raid456]
[  +0.026469]  ? finish_wait+0x80/0x80
[  +0.026253]  md_start.part.48+0x2e/0x50 [md_mod]
[  +0.026693]  do_md_run+0x64/0x100 [md_mod]
[  +0.027043]  md_ioctl+0xe7d/0x17d0 [md_mod]
[  +0.027063]  blkdev_ioctl+0x4d0/0xa00
[  +0.025390]  block_ioctl+0x39/0x40
[  +0.026334]  do_vfs_ioctl+0xa4/0x630
[  +0.025508]  ksys_ioctl+0x60/0x90
[  +0.024969]  __x64_sys_ioctl+0x16/0x20
[  +0.024461]  do_syscall_64+0x52/0x160
[  +0.025266]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.025898] RIP: 0033:0x7efcb9969427
[  +0.025336] Code: Bad RIP value.
[  +0.023623] RSP: 002b:00007ffd902f2508 EFLAGS: 00000246 ORIG_RAX: 
0000000000000010
[  +0.029437] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 
00007efcb9969427
[  +0.028688] RDX: 0000000000000000 RSI: 00000000400c0930 RDI: 
0000000000000004
[  +0.028993] RBP: 00007ffd902f2680 R08: 0000561686dc6f40 R09: 
00005616858b9b60
[  +0.028810] R10: 0000000000000001 R11: 0000000000000246 R12: 
0000000000000002
[  +0.028715] R13: 0000000000000000 R14: 00007ffd902f2e30 R15: 
0000000000000004
[May19 10:03] INFO: task mdadm:4698 blocked for more than 483 seconds.
[  +0.028312]       Not tainted 5.4.0-0.bpo.4-amd64 #1 Debian 
5.4.19-1~bpo10+1
[  +0.029922] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  +0.029811] mdadm           D    0  4698   4695 0x00004003
[  +0.028094] Call Trace:
[  +0.024963]  ? __schedule+0x2e6/0x6f0
[  +0.025823]  schedule+0x2f/0xa0
[  +0.029291]  r5l_start.cold.43+0x544/0x549 [raid456]
[  +0.031421]  ? finish_wait+0x80/0x80
[  +0.028282]  md_start.part.48+0x2e/0x50 [md_mod]
[  +0.026902]  do_md_run+0x64/0x100 [md_mod]
[  +0.026903]  md_ioctl+0xe7d/0x17d0 [md_mod]
[  +0.026249]  blkdev_ioctl+0x4d0/0xa00
[  +0.026441]  block_ioctl+0x39/0x40
[  +0.026720]  do_vfs_ioctl+0xa4/0x630
[  +0.025734]  ksys_ioctl+0x60/0x90
[  +0.025040]  __x64_sys_ioctl+0x16/0x20
[  +0.025255]  do_syscall_64+0x52/0x160
[  +0.024974]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.026676] RIP: 0033:0x7efcb9969427
[  +0.025491] Code: Bad RIP value.
[  +0.023661] RSP: 002b:00007ffd902f2508 EFLAGS: 00000246 ORIG_RAX: 
0000000000000010
[  +0.029253] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 
00007efcb9969427
[  +0.027865] RDX: 0000000000000000 RSI: 00000000400c0930 RDI: 
0000000000000004
[  +0.029299] RBP: 00007ffd902f2680 R08: 0000561686dc6f40 R09: 
00005616858b9b60
[  +0.028662] R10: 0000000000000001 R11: 0000000000000246 R12: 
0000000000000002
[  +0.028661] R13: 0000000000000000 R14: 00007ffd902f2e30 R15: 
0000000000000004
[May19 10:05] INFO: task mdadm:4698 blocked for more than 604 seconds.
[  +0.028120]       Not tainted 5.4.0-0.bpo.4-amd64 #1 Debian 
5.4.19-1~bpo10+1
[  +0.028921] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  +0.030050] mdadm           D    0  4698   4695 0x00004003
[  +0.028201] Call Trace:
[  +0.025739]  ? __schedule+0x2e6/0x6f0
[  +0.025130]  schedule+0x2f/0xa0
[  +0.025862]  r5l_start.cold.43+0x544/0x549 [raid456]
[  +0.026863]  ? finish_wait+0x80/0x80
[  +0.026362]  md_start.part.48+0x2e/0x50 [md_mod]
[  +0.028033]  do_md_run+0x64/0x100 [md_mod]
[  +0.026302]  md_ioctl+0xe7d/0x17d0 [md_mod]
[  +0.027130]  blkdev_ioctl+0x4d0/0xa00
[  +0.025276]  block_ioctl+0x39/0x40
[  +0.025804]  do_vfs_ioctl+0xa4/0x630
[  +0.025543]  ksys_ioctl+0x60/0x90
[  +0.025133]  __x64_sys_ioctl+0x16/0x20
[  +0.025289]  do_syscall_64+0x52/0x160
[  +0.025019]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.027423] RIP: 0033:0x7efcb9969427
[  +0.025241] Code: Bad RIP value.
[  +0.024188] RSP: 002b:00007ffd902f2508 EFLAGS: 00000246 ORIG_RAX: 
0000000000000010
[  +0.029330] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 
00007efcb9969427
[  +0.028777] RDX: 0000000000000000 RSI: 00000000400c0930 RDI: 
0000000000000004
[  +0.029321] RBP: 00007ffd902f2680 R08: 0000561686dc6f40 R09: 
00005616858b9b60
[  +0.030312] R10: 0000000000000001 R11: 0000000000000246 R12: 
0000000000000002
[  +0.028756] R13: 0000000000000000 R14: 00007ffd902f2e30 R15: 
0000000000000004
[May19 10:07] INFO: task mdadm:4698 blocked for more than 724 seconds.
[  +0.028161]       Not tainted 5.4.0-0.bpo.4-amd64 #1 Debian 
5.4.19-1~bpo10+1
[  +0.029025] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  +0.029438] mdadm           D    0  4698   4695 0x00004003
[  +0.027906] Call Trace:
[  +0.024749]  ? __schedule+0x2e6/0x6f0
[  +0.025400]  schedule+0x2f/0xa0
[  +0.027335]  r5l_start.cold.43+0x544/0x549 [raid456]
[  +0.030223]  ? finish_wait+0x80/0x80
[  +0.028959]  md_start.part.48+0x2e/0x50 [md_mod]
[  +0.029618]  do_md_run+0x64/0x100 [md_mod]
[  +0.026927]  md_ioctl+0xe7d/0x17d0 [md_mod]
[  +0.026187]  blkdev_ioctl+0x4d0/0xa00
[  +0.026722]  block_ioctl+0x39/0x40
[  +0.026070]  do_vfs_ioctl+0xa4/0x630
[  +0.026268]  ksys_ioctl+0x60/0x90
[  +0.025397]  __x64_sys_ioctl+0x16/0x20
[  +0.025492]  do_syscall_64+0x52/0x160
[  +0.025293]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.026295] RIP: 0033:0x7efcb9969427
[  +0.025152] Code: Bad RIP value.
[  +0.023763] RSP: 002b:00007ffd902f2508 EFLAGS: 00000246 ORIG_RAX: 
0000000000000010
[  +0.029591] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 
00007efcb9969427
[  +0.027850] RDX: 0000000000000000 RSI: 00000000400c0930 RDI: 
0000000000000004
[  +0.028603] RBP: 00007ffd902f2680 R08: 0000561686dc6f40 R09: 
00005616858b9b60
[  +0.028714] R10: 0000000000000001 R11: 0000000000000246 R12: 
0000000000000002
[  +0.028870] R13: 0000000000000000 R14: 00007ffd902f2e30 R15: 
0000000000000004
[May19 10:09] INFO: task mdadm:4698 blocked for more than 845 seconds.
[  +0.028077]       Not tainted 5.4.0-0.bpo.4-amd64 #1 Debian 
5.4.19-1~bpo10+1
[  +0.029226] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  +0.029321] mdadm           D    0  4698   4695 0x00004003
[May19 10:10] Call Trace:
[  +0.025422]  ? __schedule+0x2e6/0x6f0
[  +0.025490]  schedule+0x2f/0xa0
[  +0.026722]  r5l_start.cold.43+0x544/0x549 [raid456]
[  +0.026858]  ? finish_wait+0x80/0x80
[  +0.026399]  md_start.part.48+0x2e/0x50 [md_mod]
[  +0.027444]  do_md_run+0x64/0x100 [md_mod]
[  +0.026228]  md_ioctl+0xe7d/0x17d0 [md_mod]
[  +0.027008]  blkdev_ioctl+0x4d0/0xa00
[  +0.025656]  block_ioctl+0x39/0x40
[  +0.026068]  do_vfs_ioctl+0xa4/0x630
[  +0.025899]  ksys_ioctl+0x60/0x90
[  +0.025539]  __x64_sys_ioctl+0x16/0x20
[  +0.025885]  do_syscall_64+0x52/0x160
[  +0.025075]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.026798] RIP: 0033:0x7efcb9969427
[  +0.024981] Code: Bad RIP value.
[  +0.024604] RSP: 002b:00007ffd902f2508 EFLAGS: 00000246 ORIG_RAX: 
0000000000000010
[  +0.028628] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 
00007efcb9969427
[  +0.028698] RDX: 0000000000000000 RSI: 00000000400c0930 RDI: 
0000000000000004
[  +0.028732] RBP: 00007ffd902f2680 R08: 0000561686dc6f40 R09: 
00005616858b9b60
[  +0.027649] R10: 0000000000000001 R11: 0000000000000246 R12: 
0000000000000002
[  +0.029800] R13: 0000000000000000 R14: 00007ffd902f2e30 R15: 
0000000000000004
[May19 10:12] INFO: task mdadm:4698 blocked for more than 966 seconds.
[  +0.030285]       Not tainted 5.4.0-0.bpo.4-amd64 #1 Debian 
5.4.19-1~bpo10+1
[  +0.032412] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  +0.033870] mdadm           D    0  4698   4695 0x00004003
[  +0.031620] Call Trace:
[  +0.028343]  ? __schedule+0x2e6/0x6f0
[  +0.028456]  schedule+0x2f/0xa0
[  +0.025514]  r5l_start.cold.43+0x544/0x549 [raid456]
[  +0.028027]  ? finish_wait+0x80/0x80
[  +0.025278]  md_start.part.48+0x2e/0x50 [md_mod]
[  +0.028173]  do_md_run+0x64/0x100 [md_mod]
[  +0.027243]  md_ioctl+0xe7d/0x17d0 [md_mod]
[  +0.026750]  blkdev_ioctl+0x4d0/0xa00
[  +0.026450]  block_ioctl+0x39/0x40
[  +0.025187]  do_vfs_ioctl+0xa4/0x630
[  +0.025689]  ksys_ioctl+0x60/0x90
[  +0.024076]  __x64_sys_ioctl+0x16/0x20
[  +0.025253]  do_syscall_64+0x52/0x160
[  +0.024310]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.027527] RIP: 0033:0x7efcb9969427
[  +0.024193] Code: Bad RIP value.
[  +0.024483] RSP: 002b:00007ffd902f2508 EFLAGS: 00000246 ORIG_RAX: 
0000000000000010
[  +0.028585] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 
00007efcb9969427
[  +0.029009] RDX: 0000000000000000 RSI: 00000000400c0930 RDI: 
0000000000000004
[  +0.028403] RBP: 00007ffd902f2680 R08: 0000561686dc6f40 R09: 
00005616858b9b60
[  +0.027510] R10: 0000000000000001 R11: 0000000000000246 R12: 
0000000000000002
[  +0.028629] R13: 0000000000000000 R14: 00007ffd902f2e30 R15: 
0000000000000004
[May19 10:14] INFO: task mdadm:4698 blocked for more than 1087 seconds.
[  +0.028117]       Not tainted 5.4.0-0.bpo.4-amd64 #1 Debian 
5.4.19-1~bpo10+1
[  +0.029008] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  +0.030891] mdadm           D    0  4698   4695 0x00004003
[  +0.027018] Call Trace:
[  +0.024698]  ? __schedule+0x2e6/0x6f0
[  +0.025217]  schedule+0x2f/0xa0
[  +0.025613]  r5l_start.cold.43+0x544/0x549 [raid456]
[  +0.027065]  ? finish_wait+0x80/0x80
[  +0.026013]  md_start.part.48+0x2e/0x50 [md_mod]
[  +0.027664]  do_md_run+0x64/0x100 [md_mod]
[  +0.026160]  md_ioctl+0xe7d/0x17d0 [md_mod]
[  +0.027902]  blkdev_ioctl+0x4d0/0xa00
[  +0.026085]  block_ioctl+0x39/0x40
[  +0.026057]  do_vfs_ioctl+0xa4/0x630
[  +0.025886]  ksys_ioctl+0x60/0x90
[  +0.024539]  __x64_sys_ioctl+0x16/0x20
[  +0.025231]  do_syscall_64+0x52/0x160
[  +0.025571]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.026934] RIP: 0033:0x7efcb9969427
[  +0.024585] Code: Bad RIP value.
[  +0.024635] RSP: 002b:00007ffd902f2508 EFLAGS: 00000246 ORIG_RAX: 
0000000000000010
[  +0.029237] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 
00007efcb9969427
[  +0.031792] RDX: 0000000000000000 RSI: 00000000400c0930 RDI: 
0000000000000004
[  +0.031758] RBP: 00007ffd902f2680 R08: 0000561686dc6f40 R09: 
00005616858b9b60
[  +0.031567] R10: 0000000000000001 R11: 0000000000000246 R12: 
0000000000000002
[  +0.028862] R13: 0000000000000000 R14: 00007ffd902f2e30 R15: 
0000000000000004
[May19 10:16] INFO: task mdadm:4698 blocked for more than 1208 seconds.
[  +0.028273]       Not tainted 5.4.0-0.bpo.4-amd64 #1 Debian 
5.4.19-1~bpo10+1
[  +0.029463] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  +0.031789] mdadm           D    0  4698   4695 0x00004003
[  +0.028022] Call Trace:
[  +0.024606]  ? __schedule+0x2e6/0x6f0
[  +0.026123]  schedule+0x2f/0xa0
[  +0.024892]  r5l_start.cold.43+0x544/0x549 [raid456]
[  +0.027620]  ? finish_wait+0x80/0x80
[  +0.025492]  md_start.part.48+0x2e/0x50 [md_mod]
[  +0.027706]  do_md_run+0x64/0x100 [md_mod]
[  +0.026976]  md_ioctl+0xe7d/0x17d0 [md_mod]
[  +0.026705]  blkdev_ioctl+0x4d0/0xa00
[  +0.027531]  block_ioctl+0x39/0x40
[  +0.024963]  do_vfs_ioctl+0xa4/0x630
[  +0.025568]  ksys_ioctl+0x60/0x90
[  +0.024137]  __x64_sys_ioctl+0x16/0x20
[  +0.025538]  do_syscall_64+0x52/0x160
[  +0.024200]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.026633] RIP: 0033:0x7efcb9969427
[  +0.024619] Code: Bad RIP value.
[  +0.024698] RSP: 002b:00007ffd902f2508 EFLAGS: 00000246 ORIG_RAX: 
0000000000000010
[  +0.029381] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 
00007efcb9969427
[  +0.028794] RDX: 0000000000000000 RSI: 00000000400c0930 RDI: 
0000000000000004
[  +0.028724] RBP: 00007ffd902f2680 R08: 0000561686dc6f40 R09: 
00005616858b9b60
[  +0.028828] R10: 0000000000000001 R11: 0000000000000246 R12: 
0000000000000002
[  +0.028030] R13: 0000000000000000 R14: 00007ffd902f2e30 R15: 
0000000000000004

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Assemblin journaled array fails
  2020-05-18 23:55                 ` Song Liu
  2020-05-20 12:08                   ` Michal Soltys
@ 2020-05-25 16:23                   ` Michal Soltys
  2020-05-26 23:37                     ` Song Liu
  1 sibling, 1 reply; 46+ messages in thread
From: Michal Soltys @ 2020-05-25 16:23 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-raid

On 5/19/20 1:55 AM, Song Liu wrote:
> 
> 2. try use bcc/bpftrace to trace r5l_recovery_read_page(),
> specifically, the 4th argument.
> With bcc, it is something like:
> 
>      trace.py -M 100 'r5l_recovery_read_page() "%llx", arg4'
> 
> -M above limits the number of outputs to 100 lines. We may need to
> increase the limit or
> remove the constraint. If the system doesn't have bcc/bpftrace. You
> can also try with
> kprobe.
> 


Trace keeps outputting the following data (with steadily growing 4th 
argument):

PID     TID     COMM            FUNC             -
3456    3456    mdadm           r5l_recovery_read_page 98f65b8
3456    3456    mdadm           r5l_recovery_read_page 98f65c0
3456    3456    mdadm           r5l_recovery_read_page 98f65c8
3456    3456    mdadm           r5l_recovery_read_page 98f65d0
3456    3456    mdadm           r5l_recovery_read_page 98f65d8
3456    3456    mdadm           r5l_recovery_read_page 98f65e0
3456    3456    mdadm           r5l_recovery_read_page 98f65e8
3456    3456    mdadm           r5l_recovery_read_page 98f65f0
3456    3456    mdadm           r5l_recovery_read_page 98f65f8
3456    3456    mdadm           r5l_recovery_read_page 98f6600
3456    3456    mdadm           r5l_recovery_read_page 98f65c0
3456    3456    mdadm           r5l_recovery_read_page 98f65c8
3456    3456    mdadm           r5l_recovery_read_page 98f65d0
3456    3456    mdadm           r5l_recovery_read_page 98f65d8
3456    3456    mdadm           r5l_recovery_read_page 98f65e0
3456    3456    mdadm           r5l_recovery_read_page 98f65e8
3456    3456    mdadm           r5l_recovery_read_page 98f65f0
3456    3456    mdadm           r5l_recovery_read_page 98f65f8
3456    3456    mdadm           r5l_recovery_read_page 98f6600
3456    3456    mdadm           r5l_recovery_read_page 98f6608
3456    3456    mdadm           r5l_recovery_read_page 98f6610

... a few minutes later ...

PID     TID     COMM            FUNC             -
3456    3456    mdadm           r5l_recovery_read_page 9b69b60
3456    3456    mdadm           r5l_recovery_read_page 9b69b68 

3456    3456    mdadm           r5l_recovery_read_page 9b69b70
3456    3456    mdadm           r5l_recovery_read_page 9b69b78 

3456    3456    mdadm           r5l_recovery_read_page 9b69b80
3456    3456    mdadm           r5l_recovery_read_page 9b69b88
3456    3456    mdadm           r5l_recovery_read_page 9b69b90 
 

3456    3456    mdadm           r5l_recovery_read_page 9b69b98
3456    3456    mdadm           r5l_recovery_read_page 9b69ba0
3456    3456    mdadm           r5l_recovery_read_page 9b69ba8 
 

3456    3456    mdadm           r5l_recovery_read_page 9b69bb0 
 

3456    3456    mdadm           r5l_recovery_read_page 9b69bb8
3456    3456    mdadm           r5l_recovery_read_page 9b69bc0 

3456    3456    mdadm           r5l_recovery_read_page 9b69bc8
3456    3456    mdadm           r5l_recovery_read_page 9b69b90 
 

3456    3456    mdadm           r5l_recovery_read_page 9b69b98
3456    3456    mdadm           r5l_recovery_read_page 9b69ba0
3456    3456    mdadm           r5l_recovery_read_page 9b69ba8
3456    3456    mdadm           r5l_recovery_read_page 9b69bb0
3456    3456    mdadm           r5l_recovery_read_page 9b69bb8

... and so on

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Assemblin journaled array fails
  2020-05-25 16:23                   ` Michal Soltys
@ 2020-05-26 23:37                     ` Song Liu
  2020-05-27 13:36                       ` Michal Soltys
  0 siblings, 1 reply; 46+ messages in thread
From: Song Liu @ 2020-05-26 23:37 UTC (permalink / raw)
  To: Michal Soltys; +Cc: linux-raid

On Mon, May 25, 2020 at 9:23 AM Michal Soltys <msoltyspl@yandex.pl> wrote:
>
> On 5/19/20 1:55 AM, Song Liu wrote:
> >
> > 2. try use bcc/bpftrace to trace r5l_recovery_read_page(),
> > specifically, the 4th argument.
> > With bcc, it is something like:
> >
> >      trace.py -M 100 'r5l_recovery_read_page() "%llx", arg4'
> >
> > -M above limits the number of outputs to 100 lines. We may need to
> > increase the limit or
> > remove the constraint. If the system doesn't have bcc/bpftrace. You
> > can also try with
> > kprobe.
> >
>
>
> Trace keeps outputting the following data (with steadily growing 4th
> argument):
>
> PID     TID     COMM            FUNC             -
> 3456    3456    mdadm           r5l_recovery_read_page 98f65b8
> 3456    3456    mdadm           r5l_recovery_read_page 98f65c0
> 3456    3456    mdadm           r5l_recovery_read_page 98f65c8
> 3456    3456    mdadm           r5l_recovery_read_page 98f65d0
> 3456    3456    mdadm           r5l_recovery_read_page 98f65d8
> 3456    3456    mdadm           r5l_recovery_read_page 98f65e0
> 3456    3456    mdadm           r5l_recovery_read_page 98f65e8
> 3456    3456    mdadm           r5l_recovery_read_page 98f65f0
> 3456    3456    mdadm           r5l_recovery_read_page 98f65f8
> 3456    3456    mdadm           r5l_recovery_read_page 98f6600
> 3456    3456    mdadm           r5l_recovery_read_page 98f65c0
> 3456    3456    mdadm           r5l_recovery_read_page 98f65c8
> 3456    3456    mdadm           r5l_recovery_read_page 98f65d0
> 3456    3456    mdadm           r5l_recovery_read_page 98f65d8
> 3456    3456    mdadm           r5l_recovery_read_page 98f65e0
> 3456    3456    mdadm           r5l_recovery_read_page 98f65e8
> 3456    3456    mdadm           r5l_recovery_read_page 98f65f0
> 3456    3456    mdadm           r5l_recovery_read_page 98f65f8
> 3456    3456    mdadm           r5l_recovery_read_page 98f6600
> 3456    3456    mdadm           r5l_recovery_read_page 98f6608
> 3456    3456    mdadm           r5l_recovery_read_page 98f6610
>
> ... a few minutes later ...
>
> PID     TID     COMM            FUNC             -
> 3456    3456    mdadm           r5l_recovery_read_page 9b69b60
> 3456    3456    mdadm           r5l_recovery_read_page 9b69b68
>
> 3456    3456    mdadm           r5l_recovery_read_page 9b69b70
> 3456    3456    mdadm           r5l_recovery_read_page 9b69b78
>
> 3456    3456    mdadm           r5l_recovery_read_page 9b69b80
> 3456    3456    mdadm           r5l_recovery_read_page 9b69b88
> 3456    3456    mdadm           r5l_recovery_read_page 9b69b90
>
>
> 3456    3456    mdadm           r5l_recovery_read_page 9b69b98
> 3456    3456    mdadm           r5l_recovery_read_page 9b69ba0
> 3456    3456    mdadm           r5l_recovery_read_page 9b69ba8
>
>
> 3456    3456    mdadm           r5l_recovery_read_page 9b69bb0
>
>
> 3456    3456    mdadm           r5l_recovery_read_page 9b69bb8
> 3456    3456    mdadm           r5l_recovery_read_page 9b69bc0
>
> 3456    3456    mdadm           r5l_recovery_read_page 9b69bc8
> 3456    3456    mdadm           r5l_recovery_read_page 9b69b90
>
>
> 3456    3456    mdadm           r5l_recovery_read_page 9b69b98
> 3456    3456    mdadm           r5l_recovery_read_page 9b69ba0
> 3456    3456    mdadm           r5l_recovery_read_page 9b69ba8
> 3456    3456    mdadm           r5l_recovery_read_page 9b69bb0
> 3456    3456    mdadm           r5l_recovery_read_page 9b69bb8
>
> ... and so on

Looks like the kernel has processed about 1.2GB of journal (9b69bb8 -
98f65b8 sectors).
And the limit is min(1/4 disk size, 10GB).

I just checked the code, it should stop once it hits checksum
mismatch. Does it keep going
after half hour or so?

Thanks,
Song

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Assemblin journaled array fails
  2020-05-26 23:37                     ` Song Liu
@ 2020-05-27 13:36                       ` Michal Soltys
  2020-05-28 10:18                         ` Michal Soltys
  0 siblings, 1 reply; 46+ messages in thread
From: Michal Soltys @ 2020-05-27 13:36 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-raid

On 5/27/20 1:37 AM, Song Liu wrote:
> On Mon, May 25, 2020 at 9:23 AM Michal Soltys <msoltyspl@yandex.pl> wrote:
>>
>> On 5/19/20 1:55 AM, Song Liu wrote:
>>>
>>> 2. try use bcc/bpftrace to trace r5l_recovery_read_page(),
>>> specifically, the 4th argument.
>>> With bcc, it is something like:
>>>
>>>       trace.py -M 100 'r5l_recovery_read_page() "%llx", arg4'
>>>
>>> -M above limits the number of outputs to 100 lines. We may need to
>>> increase the limit or
>>> remove the constraint. If the system doesn't have bcc/bpftrace. You
>>> can also try with
>>> kprobe.
>>>
>>
>>
> 
> Looks like the kernel has processed about 1.2GB of journal (9b69bb8 -
> 98f65b8 sectors).
> And the limit is min(1/4 disk size, 10GB).
> 
> I just checked the code, it should stop once it hits checksum
> mismatch. Does it keep going
> after half hour or so?

It keeps going so far (mdadm started few hours ago):

xs22:/home/msl☠ head -n 4 trace.out
TIME     PID     TID     COMM            FUNC             -
12:58:21 40739   40739   mdadm           r5l_recovery_read_page 98f65b8
12:58:21 40739   40739   mdadm           r5l_recovery_read_page 98f65c0
12:58:21 40739   40739   mdadm           r5l_recovery_read_page 98f65c8
xs22:/home/msl☠ tail -n 4 trace.out
15:34:50 40739   40739   mdadm           r5l_recovery_read_page bc04e88
15:34:50 40739   40739   mdadm           r5l_recovery_read_page bc04e90
15:34:50 40739   40739   mdadm           r5l_recovery_read_page bc04e98

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Assemblin journaled array fails
  2020-05-27 13:36                       ` Michal Soltys
@ 2020-05-28 10:18                         ` Michal Soltys
  2020-05-28 15:06                           ` Song Liu
  0 siblings, 1 reply; 46+ messages in thread
From: Michal Soltys @ 2020-05-28 10:18 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-raid

On 5/27/20 3:36 PM, Michal Soltys wrote:
> On 5/27/20 1:37 AM, Song Liu wrote:
>>
>> Looks like the kernel has processed about 1.2GB of journal (9b69bb8 -
>> 98f65b8 sectors).
>> And the limit is min(1/4 disk size, 10GB).
>>
>> I just checked the code, it should stop once it hits checksum
>> mismatch. Does it keep going
>> after half hour or so?
> 
> It keeps going so far (mdadm started few hours ago):
> 
> xs22:/home/msl☠ head -n 4 trace.out
> TIME     PID     TID     COMM            FUNC             -
> 12:58:21 40739   40739   mdadm           r5l_recovery_read_page 98f65b8
> 12:58:21 40739   40739   mdadm           r5l_recovery_read_page 98f65c0
> 12:58:21 40739   40739   mdadm           r5l_recovery_read_page 98f65c8
> xs22:/home/msl☠ tail -n 4 trace.out
> 15:34:50 40739   40739   mdadm           r5l_recovery_read_page bc04e88
> 15:34:50 40739   40739   mdadm           r5l_recovery_read_page bc04e90
> 15:34:50 40739   40739   mdadm           r5l_recovery_read_page bc04e98
> 

To expand on the above (mdadm was started yesterday):

- it went on for ~19+ hours - like previously
- the last piece in trace output:

xs22:/home/msl☠ tail -n 4 trace.out ; echo
09:08:12 40739   40739   mdadm           r5l_recovery_read_page 1852d8b0
09:08:12 40739   40739   mdadm           r5l_recovery_read_page 1852d8b8
09:08:12 40739   40739   mdadm           r5l_recovery_read_page 1852d8c0
09:08:12 40739   40739   mdadm           r5l_recovery_read_page 1852d8c8

So it went through ~120gb of journal (the journal device is 256gb).
Non-assembling raid issue aside (assuming it's unrelated) - why does it 
take such enormous amount of time ? That's average of like 1.7mb/s

Around 09:08 today, the recovery line popped in the dmesg as previously:

[May27 12:58] md: md124 stopped.
[  +0.006779] md/raid:md124: device sdf1 operational as raid disk 0
[  +0.006929] md/raid:md124: device sdj1 operational as raid disk 3
[  +0.006903] md/raid:md124: device sdi1 operational as raid disk 2
[  +0.006914] md/raid:md124: device sdh1 operational as raid disk 1
[  +0.007426] md/raid:md124: raid level 5 active with 4 out of 4 
devices, algorithm 2
[May28 09:08] md/raid:md124: recovering 24667 data-only stripes and 
50212 data-parity stripes
[May28 09:11] INFO: task mdadm:40739 blocked for more than 120 seconds.
[  +0.007304]       Not tainted 5.4.0-0.bpo.4-amd64 #1 Debian 
5.4.19-1~bpo10+1
[  +0.008008] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  +0.008858] mdadm           D    0 40739  40073 0x00004002
[  +0.006222] Call Trace:
[  +0.002870]  ? __schedule+0x2e6/0x6f0
[  +0.004194]  schedule+0x2f/0xa0
[  +0.003631]  r5l_start.cold.43+0x544/0x549 [raid456]
[  +0.005669]  ? finish_wait+0x80/0x80
[  +0.004108]  md_start.part.48+0x2e/0x50 [md_mod]
[  +0.005271]  do_md_run+0x64/0x100 [md_mod]
[  +0.004696]  md_ioctl+0xe7d/0x17d0 [md_mod]
[  +0.004783]  blkdev_ioctl+0x4d0/0xa00
[  +0.004211]  block_ioctl+0x39/0x40
[  +0.003896]  do_vfs_ioctl+0xa4/0x630
[  +0.004119]  ksys_ioctl+0x60/0x90
[  +0.003824]  __x64_sys_ioctl+0x16/0x20
[  +0.004288]  do_syscall_64+0x52/0x160
[  +0.004202]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.005765] RIP: 0033:0x7f5e2e777427
[  +0.004111] Code: Bad RIP value.
[  +0.003708] RSP: 002b:00007ffd39152d78 EFLAGS: 00000246 ORIG_RAX: 
0000000000000010
[  +0.008562] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 
00007f5e2e777427
[  +0.008083] RDX: 0000000000000000 RSI: 00000000400c0930 RDI: 
0000000000000004
[  +0.008081] RBP: 00007ffd39152ef0 R08: 000055641fb6ff40 R09: 
000055641de6ab60
[  +0.008301] R10: 0000000000000001 R11: 0000000000000246 R12: 
0000000000000004
[  +0.008085] R13: 0000000000000000 R14: 00007ffd391536a0 R15: 
0000000000000004
[May28 09:13] INFO: task mdadm:40739 blocked for more than 241 seconds.

The last dmesg call trace is from few hours ago:

[May28 09:29] INFO: task mdadm:40739 blocked for more than 1208 seconds.
[  +0.030944]       Not tainted 5.4.0-0.bpo.4-amd64 #1 Debian 
5.4.19-1~bpo10+1
[  +0.031885] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  +0.031088] mdadm           D    0 40739  40073 0x00004002
[  +0.028702] Call Trace:
[  +0.024943]  ? __schedule+0x2e6/0x6f0
[  +0.025448]  schedule+0x2f/0xa0
[  +0.025868]  r5l_start.cold.43+0x544/0x549 [raid456]
[  +0.027650]  ? finish_wait+0x80/0x80
[  +0.026630]  md_start.part.48+0x2e/0x50 [md_mod]
[  +0.027959]  do_md_run+0x64/0x100 [md_mod]
[  +0.026500]  md_ioctl+0xe7d/0x17d0 [md_mod]
[  +0.027326]  blkdev_ioctl+0x4d0/0xa00
[  +0.025951]  block_ioctl+0x39/0x40
[  +0.026311]  do_vfs_ioctl+0xa4/0x630
[  +0.026721]  ksys_ioctl+0x60/0x90
[  +0.025582]  __x64_sys_ioctl+0x16/0x20
[  +0.025665]  do_syscall_64+0x52/0x160
[  +0.024993]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.027166] RIP: 0033:0x7f5e2e777427
[  +0.024500] Code: Bad RIP value.
[  +0.024675] RSP: 002b:00007ffd39152d78 EFLAGS: 00000246 ORIG_RAX: 
0000000000000010
[  +0.028617] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 
00007f5e2e777427
[  +0.029175] RDX: 0000000000000000 RSI: 00000000400c0930 RDI: 
0000000000000004
[  +0.028890] RBP: 00007ffd39152ef0 R08: 000055641fb6ff40 R09: 
000055641de6ab60
[  +0.027863] R10: 0000000000000001 R11: 0000000000000246 R12: 
0000000000000004
[  +0.028962] R13: 0000000000000000 R14: 00007ffd391536a0 R15: 
0000000000000004

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Assemblin journaled array fails
  2020-05-28 10:18                         ` Michal Soltys
@ 2020-05-28 15:06                           ` Song Liu
  2020-05-28 15:36                             ` Michal Soltys
  2020-05-29 12:09                             ` Michal Soltys
  0 siblings, 2 replies; 46+ messages in thread
From: Song Liu @ 2020-05-28 15:06 UTC (permalink / raw)
  To: Michal Soltys; +Cc: linux-raid

On Thu, May 28, 2020 at 3:18 AM Michal Soltys <msoltyspl@yandex.pl> wrote:
>
> On 5/27/20 3:36 PM, Michal Soltys wrote:
> > On 5/27/20 1:37 AM, Song Liu wrote:
> >>
[...]

> To expand on the above (mdadm was started yesterday):
>
> - it went on for ~19+ hours - like previously
> - the last piece in trace output:
>
> xs22:/home/msl☠ tail -n 4 trace.out ; echo
> 09:08:12 40739   40739   mdadm           r5l_recovery_read_page 1852d8b0
> 09:08:12 40739   40739   mdadm           r5l_recovery_read_page 1852d8b8
> 09:08:12 40739   40739   mdadm           r5l_recovery_read_page 1852d8c0
> 09:08:12 40739   40739   mdadm           r5l_recovery_read_page 1852d8c8
>
> So it went through ~120gb of journal (the journal device is 256gb).

This (processing more than 10GB) should not happen. So there must be
something wrong...

> Non-assembling raid issue aside (assuming it's unrelated) - why does it
> take such enormous amount of time ? That's average of like 1.7mb/s

This is slow because we are processing 4kB at a time. But it is slower than
my expectation. What are the speeds measured at boot? Something like the
following in dmesg:

[    3.851064] raid6: avx2x4   gen() 17328 MB/s
[    3.877404] raid6: avx2x4   xor() 11033 MB/s
[    3.903410] raid6: avx2x2   gen() 14710 MB/s
[    3.929410] raid6: avx2x2   xor()  8921 MB/s
[    3.955408] raid6: avx2x1   gen() 12687 MB/s
[    3.981405] raid6: avx2x1   xor()  8414 MB/s
[    4.007409] raid6: sse2x4   gen()  9105 MB/s
[    4.033403] raid6: sse2x4   xor()  5724 MB/s
[    4.059408] raid6: sse2x2   gen()  7570 MB/s
[    4.085410] raid6: sse2x2   xor()  4880 MB/s
[    4.111409] raid6: sse2x1   gen()  6484 MB/s
[    4.137402] raid6: sse2x1   xor()  4412 MB/s

For the next step, could you please try the following the beginning of
--assemble
run?

   trace.py -M 10 'r::r5l_recovery_verify_data_checksum()(retval != 0)'

Thanks,
Song

PS: Some key functions here are inlined and thus not easily traceable. We will
probably need to patch the kernel for further debugging.

>
> Around 09:08 today, the recovery line popped in the dmesg as previously:
>
> [May27 12:58] md: md124 stopped.
> [  +0.006779] md/raid:md124: device sdf1 operational as raid disk 0
> [  +0.006929] md/raid:md124: device sdj1 operational as raid disk 3
> [  +0.006903] md/raid:md124: device sdi1 operational as raid disk 2
> [  +0.006914] md/raid:md124: device sdh1 operational as raid disk 1
> [  +0.007426] md/raid:md124: raid level 5 active with 4 out of 4
> devices, algorithm 2
> [May28 09:08] md/raid:md124: recovering 24667 data-only stripes and
> 50212 data-parity stripes

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Assemblin journaled array fails
  2020-05-28 15:06                           ` Song Liu
@ 2020-05-28 15:36                             ` Michal Soltys
  2020-05-29 12:09                             ` Michal Soltys
  1 sibling, 0 replies; 46+ messages in thread
From: Michal Soltys @ 2020-05-28 15:36 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-raid

On 5/28/20 5:06 PM, Song Liu wrote:
> On Thu, May 28, 2020 at 3:18 AM Michal Soltys <msoltyspl@yandex.pl> wrote:
> 
> This is slow because we are processing 4kB at a time. But it is slower than
> my expectation. What are the speeds measured at boot? Something like the
> following in dmesg:
> 
> [    3.851064] raid6: avx2x4   gen() 17328 MB/s
> [    3.877404] raid6: avx2x4   xor() 11033 MB/s
> [    3.903410] raid6: avx2x2   gen() 14710 MB/s
> [    3.929410] raid6: avx2x2   xor()  8921 MB/s
> [    3.955408] raid6: avx2x1   gen() 12687 MB/s
> [    3.981405] raid6: avx2x1   xor()  8414 MB/s
> [    4.007409] raid6: sse2x4   gen()  9105 MB/s
> [    4.033403] raid6: sse2x4   xor()  5724 MB/s
> [    4.059408] raid6: sse2x2   gen()  7570 MB/s
> [    4.085410] raid6: sse2x2   xor()  4880 MB/s
> [    4.111409] raid6: sse2x1   gen()  6484 MB/s
> [    4.137402] raid6: sse2x1   xor()  4412 MB/s
> 

These are results for this machine:

[   17.707948] raid6: sse2x4   gen() 12196 MB/s
[   17.871946] raid6: sse2x4   xor()  8169 MB/s
[   18.039947] raid6: sse2x2   gen() 10035 MB/s
[   18.195950] raid6: sse2x2   xor()  6954 MB/s
[   18.343951] raid6: sse2x1   gen()  7983 MB/s
[   18.495948] raid6: sse2x1   xor()  6089 MB/s
[   18.577606] raid6: using algorithm sse2x4 gen() 12196 MB/s
[   18.675268] raid6: .... xor() 8169 MB/s, rmw enabled
[   18.774428] raid6: using ssse3x2 recovery algorithm

> For the next step, could you please try the following the beginning of
> --assemble
> run?
> 
>     trace.py -M 10 'r::r5l_recovery_verify_data_checksum()(retval != 0)'
> 
> Thanks,
> Song

Will do and report.

> 
> PS: Some key functions here are inlined and thus not easily traceable. We will
> probably need to patch the kernel for further debugging.
> 

Won't be a problem.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Assemblin journaled array fails
  2020-05-28 15:06                           ` Song Liu
  2020-05-28 15:36                             ` Michal Soltys
@ 2020-05-29 12:09                             ` Michal Soltys
  2020-05-29 21:03                               ` Song Liu
  1 sibling, 1 reply; 46+ messages in thread
From: Michal Soltys @ 2020-05-29 12:09 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-raid

On 5/28/20 5:06 PM, Song Liu wrote:
> 
> For the next step, could you please try the following the beginning of
> --assemble
> run?
> 
>     trace.py -M 10 'r::r5l_recovery_verify_data_checksum()(retval != 0)'
> 
> Thanks,
> Song
> 

How long should I keep this one running ? So far - after ~1hr - nothing 
got reported.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Assemblin journaled array fails
  2020-05-29 12:09                             ` Michal Soltys
@ 2020-05-29 21:03                               ` Song Liu
  2020-05-29 21:57                                 ` Song Liu
  0 siblings, 1 reply; 46+ messages in thread
From: Song Liu @ 2020-05-29 21:03 UTC (permalink / raw)
  To: Michal Soltys; +Cc: linux-raid

On Fri, May 29, 2020 at 5:09 AM Michal Soltys <msoltyspl@yandex.pl> wrote:
>
> On 5/28/20 5:06 PM, Song Liu wrote:
> >
> > For the next step, could you please try the following the beginning of
> > --assemble
> > run?
> >
> >     trace.py -M 10 'r::r5l_recovery_verify_data_checksum()(retval != 0)'
> >
> > Thanks,
> > Song
> >
>
> How long should I keep this one running ? So far - after ~1hr - nothing
> got reported.

We can stop it. It didn't really hit any data checksum error in early phase
of the recovery. So it did found a long long journal to recover.

Thanks,
Song

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Assemblin journaled array fails
  2020-05-29 21:03                               ` Song Liu
@ 2020-05-29 21:57                                 ` Song Liu
  2020-06-03 10:12                                   ` Michal Soltys
  0 siblings, 1 reply; 46+ messages in thread
From: Song Liu @ 2020-05-29 21:57 UTC (permalink / raw)
  To: Michal Soltys; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 747 bytes --]

On Fri, May 29, 2020 at 2:03 PM Song Liu <song@kernel.org> wrote:
>
> On Fri, May 29, 2020 at 5:09 AM Michal Soltys <msoltyspl@yandex.pl> wrote:
> >
> > On 5/28/20 5:06 PM, Song Liu wrote:
> > >
> > > For the next step, could you please try the following the beginning of
> > > --assemble
> > > run?
> > >
> > >     trace.py -M 10 'r::r5l_recovery_verify_data_checksum()(retval != 0)'
> > >
> > > Thanks,
> > > Song
> > >
> >
> > How long should I keep this one running ? So far - after ~1hr - nothing
> > got reported.
>
> We can stop it. It didn't really hit any data checksum error in early phase
> of the recovery. So it did found a long long journal to recover.

Could you please try the assemble again with the attached patch?

Thanks,
Song

[-- Attachment #2: 0001-debug-long-running-journal-recovery.patch --]
[-- Type: application/octet-stream, Size: 3266 bytes --]

From 3160568a69f36abadf43be73ecc79d37c717956b Mon Sep 17 00:00:00 2001
From: Song Liu <songliubraving@fb.com>
Date: Fri, 29 May 2020 14:53:28 -0700
Subject: [PATCH] debug long running journal recovery

---
 drivers/md/raid5-cache.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 9b6da759dca2..bf49c4bdc4c8 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -2253,6 +2253,9 @@ static int r5c_recovery_flush_log(struct r5l_log *log,
 
 		ret = r5c_recovery_analyze_meta_block(log, ctx,
 						      &ctx->cached_list);
+		if (ret)
+			pr_info("r5c_recovery_analyze_meta_block() returns %d at seq %llu\n",
+				ret, ctx->seq);
 		/*
 		 * -EAGAIN means mismatch in data block, in this case, we still
 		 * try scan the next metablock
@@ -2260,6 +2263,9 @@ static int r5c_recovery_flush_log(struct r5l_log *log,
 		if (ret && ret != -EAGAIN)
 			break;   /* ret == -EINVAL or -ENOMEM */
 		ctx->seq++;
+
+		if (ctx->seq % 100000 == 0)
+			pr_info("%s processing ctx->seq %llu\n", __func__, ctx->seq);
 		ctx->pos = r5l_ring_add(log, ctx->pos, ctx->meta_total_blocks);
 	}
 
@@ -2358,6 +2364,7 @@ r5c_recovery_rewrite_data_only_stripes(struct r5l_log *log,
 	struct mddev *mddev = log->rdev->mddev;
 	struct page *page;
 	sector_t next_checkpoint = MaxSector;
+	int total_sh_written = 0;
 
 	page = alloc_page(GFP_KERNEL);
 	if (!page) {
@@ -2418,9 +2425,13 @@ r5c_recovery_rewrite_data_only_stripes(struct r5l_log *log,
 		ctx->pos = write_pos;
 		ctx->seq += 1;
 		next_checkpoint = sh->log_start;
+		if (total_sh_written++ % 1000 == 0)
+			pr_info("%s rewritten %d stripes to the journal, current ctx->pos %llu ctx->seq %llu\n",
+				__func__, total_sh_written, ctx->pos, ctx->seq);
 	}
 	log->next_checkpoint = next_checkpoint;
 	__free_page(page);
+	pr_info("%s done\n", __func__);
 	return 0;
 }
 
@@ -2431,6 +2442,7 @@ static void r5c_recovery_flush_data_only_stripes(struct r5l_log *log,
 	struct r5conf *conf = mddev->private;
 	struct stripe_head *sh, *next;
 
+	pr_info("%s enter\n", __func__);
 	if (ctx->data_only_stripes == 0)
 		return;
 
@@ -2443,9 +2455,11 @@ static void r5c_recovery_flush_data_only_stripes(struct r5l_log *log,
 		raid5_release_stripe(sh);
 	}
 
+	pr_info("%s before wait_event\n", __func__);
 	/* reuse conf->wait_for_quiescent in recovery */
 	wait_event(conf->wait_for_quiescent,
 		   atomic_read(&conf->active_stripes) == 0);
+	pr_info("%s after wait_event\n", __func__);
 
 	log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
 }
@@ -2463,6 +2477,8 @@ static int r5l_recovery_log(struct r5l_log *log)
 
 	ctx->pos = log->last_checkpoint;
 	ctx->seq = log->last_cp_seq;
+	pr_info("%s starting from ctx->pos %llu ctx->seq %llu",
+		__func__, ctx->pos, ctx->seq);
 	INIT_LIST_HEAD(&ctx->cached_list);
 	ctx->meta_page = alloc_page(GFP_KERNEL);
 
@@ -2484,6 +2500,9 @@ static int r5l_recovery_log(struct r5l_log *log)
 	pos = ctx->pos;
 	ctx->seq += 10000;
 
+	pr_info("%s finished scanning with ctx->pos %llu ctx->seq %llu",
+		__func__, ctx->pos, ctx->seq);
+
 	if ((ctx->data_only_stripes == 0) && (ctx->data_parity_stripes == 0))
 		pr_info("md/raid:%s: starting from clean shutdown\n",
 			 mdname(mddev));
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: Assemblin journaled array fails
  2020-05-29 21:57                                 ` Song Liu
@ 2020-06-03 10:12                                   ` Michal Soltys
  2020-06-03 22:07                                     ` Song Liu
  0 siblings, 1 reply; 46+ messages in thread
From: Michal Soltys @ 2020-06-03 10:12 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-raid

On 5/29/20 11:57 PM, Song Liu wrote:
>>
>> We can stop it. It didn't really hit any data checksum error in early phase
>> of the recovery. So it did found a long long journal to recover.
> 
> Could you please try the assemble again with the attached patch?
> 
> Thanks,
> Song
> 

[Jun 2 16:08] md: md124 stopped.
[  +0.008974] md/raid:md124: device sdf1 operational as raid disk 0
[  +0.006875] md/raid:md124: device sdj1 operational as raid disk 3
[  +0.006826] md/raid:md124: device sdh1 operational as raid disk 2
[  +0.006814] md/raid:md124: device sdg1 operational as raid disk 1
[  +0.007398] md/raid:md124: raid level 5 active with 4 out of 4 devices, algorithm 2
[  +0.008562] r5l_recovery_log starting from ctx->pos 160392632 ctx->seq 860687132
[ +11.215777] r5c_recovery_flush_log processing ctx->seq 860700000
[Jun 2 16:31] r5c_recovery_flush_log processing ctx->seq 860800000
[Jun 2 16:57] r5c_recovery_flush_log processing ctx->seq 860900000
[Jun 2 17:11] r5c_recovery_flush_log processing ctx->seq 861000000
[Jun 2 17:24] r5c_recovery_flush_log processing ctx->seq 861100000
[Jun 2 17:39] r5c_recovery_flush_log processing ctx->seq 861200000
[Jun 2 17:51] r5c_recovery_flush_log processing ctx->seq 861300000
[Jun 2 18:02] r5c_recovery_flush_log processing ctx->seq 861400000
[Jun 2 18:12] r5c_recovery_flush_log processing ctx->seq 861500000
[Jun 2 18:25] r5c_recovery_flush_log processing ctx->seq 861600000
[Jun 2 18:36] r5c_recovery_flush_log processing ctx->seq 861700000
[Jun 2 18:49] r5c_recovery_flush_log processing ctx->seq 861800000
[Jun 2 19:02] r5c_recovery_flush_log processing ctx->seq 861900000
[Jun 2 19:14] r5c_recovery_flush_log processing ctx->seq 862000000
[Jun 2 19:26] r5c_recovery_flush_log processing ctx->seq 862100000
[Jun 2 19:53] r5c_recovery_flush_log processing ctx->seq 862200000
[Jun 2 20:10] r5c_recovery_flush_log processing ctx->seq 862300000
[Jun 2 20:22] r5c_recovery_flush_log processing ctx->seq 862400000
[Jun 2 20:35] r5c_recovery_flush_log processing ctx->seq 862500000
[Jun 2 20:47] r5c_recovery_flush_log processing ctx->seq 862600000
[Jun 2 21:02] r5c_recovery_flush_log processing ctx->seq 862700000
[Jun 2 21:17] r5c_recovery_flush_log processing ctx->seq 862800000
[Jun 2 21:28] r5c_recovery_flush_log processing ctx->seq 862900000
[Jun 2 21:42] r5c_recovery_flush_log processing ctx->seq 863000000
[Jun 2 21:53] r5c_recovery_flush_log processing ctx->seq 863100000
[Jun 2 22:03] r5c_recovery_flush_log processing ctx->seq 863200000
[Jun 2 22:16] r5c_recovery_flush_log processing ctx->seq 863300000
[Jun 2 22:29] r5c_recovery_flush_log processing ctx->seq 863400000
[Jun 2 22:43] r5c_recovery_flush_log processing ctx->seq 863500000
[Jun 2 22:56] r5c_recovery_flush_log processing ctx->seq 863600000
[Jun 2 23:09] r5c_recovery_flush_log processing ctx->seq 863700000
[Jun 2 23:21] r5c_recovery_flush_log processing ctx->seq 863800000
[Jun 2 23:33] r5c_recovery_flush_log processing ctx->seq 863900000
[Jun 3 00:02] r5c_recovery_flush_log processing ctx->seq 864000000
[Jun 3 00:22] r5c_recovery_flush_log processing ctx->seq 864100000
[Jun 3 00:55] r5c_recovery_flush_log processing ctx->seq 864200000
[Jun 3 01:28] r5c_recovery_flush_log processing ctx->seq 864300000
[Jun 3 02:01] r5c_recovery_flush_log processing ctx->seq 864400000
[Jun 3 02:27] r5c_recovery_flush_log processing ctx->seq 864500000
[Jun 3 02:45] r5c_recovery_flush_log processing ctx->seq 864600000
[Jun 3 02:57] r5c_recovery_flush_log processing ctx->seq 864700000
[Jun 3 03:12] r5c_recovery_flush_log processing ctx->seq 864800000
[Jun 3 03:33] r5c_recovery_flush_log processing ctx->seq 864900000
[Jun 3 03:51] r5c_recovery_flush_log processing ctx->seq 865000000
[Jun 3 04:11] r5c_recovery_flush_log processing ctx->seq 865100000
[Jun 3 04:28] r5c_recovery_flush_log processing ctx->seq 865200000
[Jun 3 04:48] r5c_recovery_flush_log processing ctx->seq 865300000
[Jun 3 05:07] r5c_recovery_flush_log processing ctx->seq 865400000
[Jun 3 05:24] r5c_recovery_flush_log processing ctx->seq 865500000
[Jun 3 05:41] r5c_recovery_flush_log processing ctx->seq 865600000
[Jun 3 06:02] r5c_recovery_flush_log processing ctx->seq 865700000
[Jun 3 06:25] r5c_recovery_flush_log processing ctx->seq 865800000
[Jun 3 06:43] r5c_recovery_flush_log processing ctx->seq 865900000
[Jun 3 07:04] r5c_recovery_flush_log processing ctx->seq 866000000
[Jun 3 07:21] r5c_recovery_flush_log processing ctx->seq 866100000
[Jun 3 07:42] r5c_recovery_flush_log processing ctx->seq 866200000
[Jun 3 08:06] r5c_recovery_flush_log processing ctx->seq 866300000
[Jun 3 08:24] r5c_recovery_flush_log processing ctx->seq 866400000
[Jun 3 08:44] r5c_recovery_flush_log processing ctx->seq 866500000
[Jun 3 08:58] r5l_recovery_log finished scanning with ctx->pos 408082848 ctx->seq 866583360
[  +0.000003] md/raid:md124: recovering 24667 data-only stripes and 50212 data-parity stripes
[  +0.037320] r5c_recovery_rewrite_data_only_stripes rewritten 1 stripes to the journal, current ctx->pos 408082864 ctx->seq 866583361
[  +0.182558] r5c_recovery_rewrite_data_only_stripes rewritten 1001 stripes to the journal, current ctx->pos 408099592 ctx->seq 866584361
[  +0.180980] r5c_recovery_rewrite_data_only_stripes rewritten 2001 stripes to the journal, current ctx->pos 408115904 ctx->seq 866585361
[  +0.196148] r5c_recovery_rewrite_data_only_stripes rewritten 3001 stripes to the journal, current ctx->pos 408133600 ctx->seq 866586361
[  +0.221433] r5c_recovery_rewrite_data_only_stripes rewritten 4001 stripes to the journal, current ctx->pos 408153680 ctx->seq 866587361
[  +0.223732] r5c_recovery_rewrite_data_only_stripes rewritten 5001 stripes to the journal, current ctx->pos 408173152 ctx->seq 866588361
[  +0.228663] r5c_recovery_rewrite_data_only_stripes rewritten 6001 stripes to the journal, current ctx->pos 408192808 ctx->seq 866589361
[  +0.234246] r5c_recovery_rewrite_data_only_stripes rewritten 7001 stripes to the journal, current ctx->pos 408212760 ctx->seq 866590361
[  +0.242665] r5c_recovery_rewrite_data_only_stripes rewritten 8001 stripes to the journal, current ctx->pos 408233176 ctx->seq 866591361
[  +0.231829] r5c_recovery_rewrite_data_only_stripes rewritten 9001 stripes to the journal, current ctx->pos 408251696 ctx->seq 866592361
[  +0.250124] r5c_recovery_rewrite_data_only_stripes rewritten 10001 stripes to the journal, current ctx->pos 408270968 ctx->seq 866593361
[  +0.240402] r5c_recovery_rewrite_data_only_stripes rewritten 11001 stripes to the journal, current ctx->pos 408289976 ctx->seq 866594361
[  +0.250681] r5c_recovery_rewrite_data_only_stripes rewritten 12001 stripes to the journal, current ctx->pos 408309784 ctx->seq 866595361
[  +0.258173] r5c_recovery_rewrite_data_only_stripes rewritten 13001 stripes to the journal, current ctx->pos 408329888 ctx->seq 866596361
[  +0.235759] r5c_recovery_rewrite_data_only_stripes rewritten 14001 stripes to the journal, current ctx->pos 408349112 ctx->seq 866597361
[  +0.255747] r5c_recovery_rewrite_data_only_stripes rewritten 15001 stripes to the journal, current ctx->pos 408368984 ctx->seq 866598361
[  +0.255252] r5c_recovery_rewrite_data_only_stripes rewritten 16001 stripes to the journal, current ctx->pos 408386768 ctx->seq 866599361
[  +0.264333] r5c_recovery_rewrite_data_only_stripes rewritten 17001 stripes to the journal, current ctx->pos 408407072 ctx->seq 866600361
[  +0.261822] r5c_recovery_rewrite_data_only_stripes rewritten 18001 stripes to the journal, current ctx->pos 408425840 ctx->seq 866601361
[  +0.270390] r5c_recovery_rewrite_data_only_stripes rewritten 19001 stripes to the journal, current ctx->pos 408443976 ctx->seq 866602361
[  +0.266591] r5c_recovery_rewrite_data_only_stripes rewritten 20001 stripes to the journal, current ctx->pos 408461384 ctx->seq 866603361
[  +0.276480] r5c_recovery_rewrite_data_only_stripes rewritten 21001 stripes to the journal, current ctx->pos 408479568 ctx->seq 866604361
[  +0.272757] r5c_recovery_rewrite_data_only_stripes rewritten 22001 stripes to the journal, current ctx->pos 408496600 ctx->seq 866605361
[  +0.290148] r5c_recovery_rewrite_data_only_stripes rewritten 23001 stripes to the journal, current ctx->pos 408515472 ctx->seq 866606361
[  +0.274369] r5c_recovery_rewrite_data_only_stripes rewritten 24001 stripes to the journal, current ctx->pos 408532112 ctx->seq 866607361
[  +0.237468] r5c_recovery_rewrite_data_only_stripes done
[  +0.062124] r5c_recovery_flush_data_only_stripes enter
[  +0.063396] r5c_recovery_flush_data_only_stripes before wait_event
[Jun 3 09:02] INFO: task mdadm:2858 blocked for more than 120 seconds.
[  +0.060545]       Tainted: G            E     5.4.19-msl-00001-gbf39596faf12 #2
[  +0.062932] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.068255] mdadm           D    0  2858   2751 0x00004002
[  +0.065247] Call Trace:
[  +0.056200]  ? __schedule+0x2e6/0x6f0
[  +0.054586]  schedule+0x2f/0xa0
[  +0.052803]  r5l_start.cold.43+0xb10/0xfca [raid456]
[  +0.053892]  ? finish_wait+0x80/0x80
[  +0.051771]  md_start.part.48+0x2e/0x50 [md_mod]
[  +0.051705]  do_md_run+0x64/0x100 [md_mod]
[  +0.051111]  md_ioctl+0xe7d/0x17d0 [md_mod]
[  +0.049628]  blkdev_ioctl+0x4d0/0xa00
[  +0.048213]  block_ioctl+0x39/0x40
[  +0.047055]  do_vfs_ioctl+0xa4/0x630
[  +0.046446]  ksys_ioctl+0x60/0x90
[  +0.046983]  __x64_sys_ioctl+0x16/0x20
[  +0.045661]  do_syscall_64+0x52/0x160
[  +0.043602]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.045892] RIP: 0033:0x7f72920c3427
[  +0.043231] Code: Bad RIP value.
[  +0.040553] RSP: 002b:00007ffc2d2045f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  +0.046595] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f72920c3427
[  +0.044504] RDX: 0000000000000000 RSI: 00000000400c0930 RDI: 0000000000000004
[  +0.044056] RBP: 00007ffc2d204770 R08: 000055f715c6ef40 R09: 000055f715a7ab60
[  +0.043743] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000002
[  +0.042295] R13: 0000000000000000 R14: 00007ffc2d204f20 R15: 0000000000000004
[Jun 3 09:04] INFO: task mdadm:2858 blocked for more than 243 seconds.
[  +0.040957]       Tainted: G            E     5.4.19-msl-00001-gbf39596faf12 #2
[  +0.040821] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.041663] mdadm           D    0  2858   2751 0x00004002
[  +0.038191] Call Trace:
[  +0.033593]  ? __schedule+0x2e6/0x6f0
[  +0.034067]  schedule+0x2f/0xa0
[  +0.032670]  r5l_start.cold.43+0xb10/0xfca [raid456]
[  +0.033911]  ? finish_wait+0x80/0x80
[  +0.031274]  md_start.part.48+0x2e/0x50 [md_mod]
[  +0.032020]  do_md_run+0x64/0x100 [md_mod]
[  +0.031191]  md_ioctl+0xe7d/0x17d0 [md_mod]
[  +0.030522]  blkdev_ioctl+0x4d0/0xa00
[  +0.028305]  block_ioctl+0x39/0x40
[  +0.026702]  do_vfs_ioctl+0xa4/0x630
[  +0.027148]  ksys_ioctl+0x60/0x90
[  +0.024758]  __x64_sys_ioctl+0x16/0x20
[  +0.025464]  do_syscall_64+0x52/0x160
[  +0.023894]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.026684] RIP: 0033:0x7f72920c3427
[  +0.024336] Code: Bad RIP value.
[  +0.024689] RSP: 002b:00007ffc2d2045f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  +0.028442] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f72920c3427
[  +0.028941] RDX: 0000000000000000 RSI: 00000000400c0930 RDI: 0000000000000004
[  +0.028573] RBP: 00007ffc2d204770 R08: 000055f715c6ef40 R09: 000055f715a7ab60
[  +0.028022] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000002
[  +0.028933] R13: 0000000000000000 R14: 00007ffc2d204f20 R15: 0000000000000004
[Jun 3 09:06] INFO: task mdadm:2858 blocked for more than 364 seconds.
[  +0.027274]       Tainted: G            E     5.4.19-msl-00001-gbf39596faf12 #2
[  +0.029471] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.030418] mdadm           D    0  2858   2751 0x00004002
[  +0.028200] Call Trace:
[  +0.024432]  ? __schedule+0x2e6/0x6f0
[  +0.026054]  schedule+0x2f/0xa0
[  +0.024534]  r5l_start.cold.43+0xb10/0xfca [raid456]
[  +0.029208]  ? finish_wait+0x80/0x80
[  +0.025526]  md_start.part.48+0x2e/0x50 [md_mod]
[  +0.028478]  do_md_run+0x64/0x100 [md_mod]
[  +0.027473]  md_ioctl+0xe7d/0x17d0 [md_mod]
[  +0.026305]  blkdev_ioctl+0x4d0/0xa00
[  +0.026246]  block_ioctl+0x39/0x40
[  +0.025604]  do_vfs_ioctl+0xa4/0x630
[  +0.025801]  ksys_ioctl+0x60/0x90
[  +0.025177]  __x64_sys_ioctl+0x16/0x20
[  +0.025306]  do_syscall_64+0x52/0x160
[  +0.025793]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.026747] RIP: 0033:0x7f72920c3427
[  +0.025422] Code: Bad RIP value.
[  +0.023953] RSP: 002b:00007ffc2d2045f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  +0.029392] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f72920c3427
[  +0.027915] RDX: 0000000000000000 RSI: 00000000400c0930 RDI: 0000000000000004
[  +0.028305] RBP: 00007ffc2d204770 R08: 000055f715c6ef40 R09: 000055f715a7ab60
[  +0.028820] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000002
[  +0.029250] R13: 0000000000000000 R14: 00007ffc2d204f20 R15: 0000000000000004
[Jun 3 09:08] INFO: task mdadm:2858 blocked for more than 485 seconds.
[  +0.026953]       Tainted: G            E     5.4.19-msl-00001-gbf39596faf12 #2
[  +0.030072] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.030385] mdadm           D    0  2858   2751 0x00004002
[  +0.027879] Call Trace:
[  +0.023949]  ? __schedule+0x2e6/0x6f0
[  +0.025820]  schedule+0x2f/0xa0
[  +0.024656]  r5l_start.cold.43+0xb10/0xfca [raid456]
[  +0.027525]  ? finish_wait+0x80/0x80
[  +0.025534]  md_start.part.48+0x2e/0x50 [md_mod]
[  +0.027898]  do_md_run+0x64/0x100 [md_mod]
[  +0.027299]  md_ioctl+0xe7d/0x17d0 [md_mod]
[  +0.025926]  blkdev_ioctl+0x4d0/0xa00
[  +0.026269]  block_ioctl+0x39/0x40
[  +0.025262]  do_vfs_ioctl+0xa4/0x630
[  +0.025799]  ksys_ioctl+0x60/0x90
[  +0.024000]  __x64_sys_ioctl+0x16/0x20
[  +0.025950]  do_syscall_64+0x52/0x160
[  +0.024311]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.026696] RIP: 0033:0x7f72920c3427
[  +0.024260] Code: Bad RIP value.
[  +0.024967] RSP: 002b:00007ffc2d2045f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  +0.028541] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f72920c3427
[  +0.028687] RDX: 0000000000000000 RSI: 00000000400c0930 RDI: 0000000000000004
[  +0.028792] RBP: 00007ffc2d204770 R08: 000055f715c6ef40 R09: 000055f715a7ab60
[  +0.028316] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000002
[  +0.028385] R13: 0000000000000000 R14: 00007ffc2d204f20 R15: 0000000000000004
[Jun 3 09:10] INFO: task mdadm:2858 blocked for more than 606 seconds.
[  +0.026999]       Tainted: G            E     5.4.19-msl-00001-gbf39596faf12 #2
[  +0.029364] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.030553] mdadm           D    0  2858   2751 0x00004002
[  +0.030993] Call Trace:
[  +0.024346]  ? __schedule+0x2e6/0x6f0
[  +0.025899]  schedule+0x2f/0xa0
[  +0.024906]  r5l_start.cold.43+0xb10/0xfca [raid456]
[  +0.027654]  ? finish_wait+0x80/0x80
[  +0.026325]  md_start.part.48+0x2e/0x50 [md_mod]
[  +0.026537]  do_md_run+0x64/0x100 [md_mod]
[  +0.027048]  md_ioctl+0xe7d/0x17d0 [md_mod]
[  +0.027328]  blkdev_ioctl+0x4d0/0xa00
[  +0.026367]  block_ioctl+0x39/0x40
[  +0.025746]  do_vfs_ioctl+0xa4/0x630
[  +0.025654]  ksys_ioctl+0x60/0x90
[  +0.024867]  __x64_sys_ioctl+0x16/0x20
[  +0.025322]  do_syscall_64+0x52/0x160
[  +0.024934]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.026441] RIP: 0033:0x7f72920c3427
[  +0.025195] Code: Bad RIP value.
[  +0.023641] RSP: 002b:00007ffc2d2045f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  +0.029224] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f72920c3427
[  +0.028522] RDX: 0000000000000000 RSI: 00000000400c0930 RDI: 0000000000000004
[  +0.028499] RBP: 00007ffc2d204770 R08: 000055f715c6ef40 R09: 000055f715a7ab60
[  +0.028515] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000002
[  +0.028559] R13: 0000000000000000 R14: 00007ffc2d204f20 R15: 0000000000000004
[Jun 3 09:12] INFO: task mdadm:2858 blocked for more than 727 seconds.
[  +0.027689]       Tainted: G            E     5.4.19-msl-00001-gbf39596faf12 #2
[  +0.029425] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.030441] mdadm           D    0  2858   2751 0x00004002
[  +0.028318] Call Trace:
[  +0.024240]  ? __schedule+0x2e6/0x6f0
[  +0.026113]  schedule+0x2f/0xa0
[  +0.024594]  r5l_start.cold.43+0xb10/0xfca [raid456]
[  +0.027742]  ? finish_wait+0x80/0x80
[  +0.026482]  md_start.part.48+0x2e/0x50 [md_mod]
[  +0.027073]  do_md_run+0x64/0x100 [md_mod]
[  +0.026928]  md_ioctl+0xe7d/0x17d0 [md_mod]
[  +0.026385]  blkdev_ioctl+0x4d0/0xa00
[  +0.026621]  block_ioctl+0x39/0x40
[  +0.026416]  do_vfs_ioctl+0xa4/0x630
[  +0.026045]  ksys_ioctl+0x60/0x90
[  +0.025177]  __x64_sys_ioctl+0x16/0x20
[  +0.025337]  do_syscall_64+0x52/0x160
[  +0.025306]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.026347] RIP: 0033:0x7f72920c3427
[  +0.025397] Code: Bad RIP value.
[  +0.023684] RSP: 002b:00007ffc2d2045f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  +0.029548] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f72920c3427
[  +0.028548] RDX: 0000000000000000 RSI: 00000000400c0930 RDI: 0000000000000004
[  +0.028609] RBP: 00007ffc2d204770 R08: 000055f715c6ef40 R09: 000055f715a7ab60
[  +0.028464] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000002
[  +0.028697] R13: 0000000000000000 R14: 00007ffc2d204f20 R15: 0000000000000004
[Jun 3 09:14] INFO: task mdadm:2858 blocked for more than 847 seconds.
[  +0.027980]       Tainted: G            E     5.4.19-msl-00001-gbf39596faf12 #2
[  +0.028799] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.030554] mdadm           D    0  2858   2751 0x00004002
[  +0.028257] Call Trace:
[  +0.023759]  ? __schedule+0x2e6/0x6f0
[  +0.025938]  schedule+0x2f/0xa0
[  +0.025325]  r5l_start.cold.43+0xb10/0xfca [raid456]
[  +0.030832]  ? finish_wait+0x80/0x80
[  +0.026344]  md_start.part.48+0x2e/0x50 [md_mod]
[  +0.026746]  do_md_run+0x64/0x100 [md_mod]
[  +0.027229]  md_ioctl+0xe7d/0x17d0 [md_mod]
[  +0.027051]  blkdev_ioctl+0x4d0/0xa00
[  +0.025505]  block_ioctl+0x39/0x40
[  +0.026085]  do_vfs_ioctl+0xa4/0x630
[  +0.025163]  ksys_ioctl+0x60/0x90
[  +0.029285]  __x64_sys_ioctl+0x16/0x20
[  +0.029185]  do_syscall_64+0x52/0x160
[  +0.027525]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.030066] RIP: 0033:0x7f72920c3427
[  +0.028916] Code: Bad RIP value.
[  +0.026674] RSP: 002b:00007ffc2d2045f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  +0.029693] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f72920c3427
[  +0.028949] RDX: 0000000000000000 RSI: 00000000400c0930 RDI: 0000000000000004
[  +0.028409] RBP: 00007ffc2d204770 R08: 000055f715c6ef40 R09: 000055f715a7ab60
[  +0.028117] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000002
[  +0.028542] R13: 0000000000000000 R14: 00007ffc2d204f20 R15: 0000000000000004
[Jun 3 09:16] INFO: task mdadm:2858 blocked for more than 968 seconds.
[  +0.028070]       Tainted: G            E     5.4.19-msl-00001-gbf39596faf12 #2
[  +0.028488] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.030356] mdadm           D    0  2858   2751 0x00004002
[  +0.028778] Call Trace:
[  +0.025616]  ? __schedule+0x2e6/0x6f0
[  +0.028651]  schedule+0x2f/0xa0
[  +0.028842]  r5l_start.cold.43+0xb10/0xfca [raid456]
[  +0.029961]  ? finish_wait+0x80/0x80
[  +0.028681]  md_start.part.48+0x2e/0x50 [md_mod]
[  +0.030107]  do_md_run+0x64/0x100 [md_mod]
[  +0.030568]  md_ioctl+0xe7d/0x17d0 [md_mod]
[  +0.027043]  blkdev_ioctl+0x4d0/0xa00
[  +0.026720]  block_ioctl+0x39/0x40
[  +0.025781]  do_vfs_ioctl+0xa4/0x630
[  +0.024701]  ksys_ioctl+0x60/0x90
[  +0.025065]  __x64_sys_ioctl+0x16/0x20
[  +0.024444]  do_syscall_64+0x52/0x160
[  +0.025220]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.025739] RIP: 0033:0x7f72920c3427
[  +0.025300] Code: Bad RIP value.
[  +0.023582] RSP: 002b:00007ffc2d2045f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  +0.029201] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f72920c3427
[  +0.028748] RDX: 0000000000000000 RSI: 00000000400c0930 RDI: 0000000000000004
[  +0.027968] RBP: 00007ffc2d204770 R08: 000055f715c6ef40 R09: 000055f715a7ab60
[  +0.028481] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000002
[  +0.028766] R13: 0000000000000000 R14: 00007ffc2d204f20 R15: 0000000000000004
[Jun 3 09:18] INFO: task mdadm:2858 blocked for more than 1089 seconds.
[  +0.028054]       Tainted: G            E     5.4.19-msl-00001-gbf39596faf12 #2
[  +0.028400] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.030904] mdadm           D    0  2858   2751 0x00004002
[  +0.028077] Call Trace:
[  +0.024791]  ? __schedule+0x2e6/0x6f0
[  +0.025875]  schedule+0x2f/0xa0
[  +0.025618]  r5l_start.cold.43+0xb10/0xfca [raid456]
[  +0.027710]  ? finish_wait+0x80/0x80
[  +0.026291]  md_start.part.48+0x2e/0x50 [md_mod]
[  +0.026487]  do_md_run+0x64/0x100 [md_mod]
[  +0.028776]  md_ioctl+0xe7d/0x17d0 [md_mod]
[  +0.027002]  blkdev_ioctl+0x4d0/0xa00
[  +0.025169]  block_ioctl+0x39/0x40
[  +0.025991]  do_vfs_ioctl+0xa4/0x630
[  +0.024424]  ksys_ioctl+0x60/0x90
[  +0.025340]  __x64_sys_ioctl+0x16/0x20
[  +0.024386]  do_syscall_64+0x52/0x160
[  +0.025086]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.025950] RIP: 0033:0x7f72920c3427
[  +0.025552] Code: Bad RIP value.
[  +0.023459] RSP: 002b:00007ffc2d2045f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  +0.030089] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f72920c3427
[  +0.028770] RDX: 0000000000000000 RSI: 00000000400c0930 RDI: 0000000000000004
[  +0.027951] RBP: 00007ffc2d204770 R08: 000055f715c6ef40 R09: 000055f715a7ab60
[  +0.028993] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000002
[  +0.028520] R13: 0000000000000000 R14: 00007ffc2d204f20 R15: 0000000000000004
[Jun 3 09:20] INFO: task mdadm:2858 blocked for more than 1210 seconds.
[  +0.028171]       Tainted: G            E     5.4.19-msl-00001-gbf39596faf12 #2
[  +0.028430] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.030316] mdadm           D    0  2858   2751 0x00004002
[  +0.028004] Call Trace:
[  +0.024696]  ? __schedule+0x2e6/0x6f0
[  +0.025941]  schedule+0x2f/0xa0
[  +0.025382]  r5l_start.cold.43+0xb10/0xfca [raid456]
[  +0.027699]  ? finish_wait+0x80/0x80
[  +0.027329]  md_start.part.48+0x2e/0x50 [md_mod]
[  +0.026295]  do_md_run+0x64/0x100 [md_mod]
[  +0.027274]  md_ioctl+0xe7d/0x17d0 [md_mod]
[  +0.026747]  blkdev_ioctl+0x4d0/0xa00
[  +0.025499]  block_ioctl+0x39/0x40
[  +0.025981]  do_vfs_ioctl+0xa4/0x630
[  +0.025238]  ksys_ioctl+0x60/0x90
[  +0.024917]  __x64_sys_ioctl+0x16/0x20
[  +0.025268]  do_syscall_64+0x52/0x160
[  +0.025288]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.026166] RIP: 0033:0x7f72920c3427
[  +0.027251] Code: Bad RIP value.
[  +0.027211] RSP: 002b:00007ffc2d2045f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  +0.031611] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f72920c3427
[  +0.029797] RDX: 0000000000000000 RSI: 00000000400c0930 RDI: 0000000000000004
[  +0.028235] RBP: 00007ffc2d204770 R08: 000055f715c6ef40 R09: 000055f715a7ab60
[  +0.027762] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000002
[  +0.028697] R13: 0000000000000000 R14: 00007ffc2d204f20 R15: 0000000000000004

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Assemblin journaled array fails
  2020-06-03 10:12                                   ` Michal Soltys
@ 2020-06-03 22:07                                     ` Song Liu
  2020-06-04 13:58                                       ` Michal Soltys
  2020-06-05 12:26                                       ` Michal Soltys
  0 siblings, 2 replies; 46+ messages in thread
From: Song Liu @ 2020-06-03 22:07 UTC (permalink / raw)
  To: Michal Soltys; +Cc: linux-raid

Thanks for the trace.

On Wed, Jun 3, 2020 at 3:12 AM Michal Soltys <msoltyspl@yandex.pl> wrote:
>
> On 5/29/20 11:57 PM, Song Liu wrote:
> >>
> >> We can stop it. It didn't really hit any data checksum error in early phase
> >> of the recovery. So it did found a long long journal to recover.
> >
> > Could you please try the assemble again with the attached patch?
> >
> > Thanks,
> > Song
> >

[...]

> [Jun 3 07:42] r5c_recovery_flush_log processing ctx->seq 866200000
> [Jun 3 08:06] r5c_recovery_flush_log processing ctx->seq 866300000
> [Jun 3 08:24] r5c_recovery_flush_log processing ctx->seq 866400000
> [Jun 3 08:44] r5c_recovery_flush_log processing ctx->seq 866500000
> [Jun 3 08:58] r5l_recovery_log finished scanning with ctx->pos 408082848 ctx->seq 866583360
> [  +0.000003] md/raid:md124: recovering 24667 data-only stripes and 50212 data-parity stripes

We are recovering many stripes. I guess this is because we somehow missed a
metadata update before the crash.

> [  +0.037320] r5c_recovery_rewrite_data_only_stripes rewritten 1 stripes to the journal, current ctx->pos 408082864 ctx->seq 866583361
> [  +0.182558] r5c_recovery_rewrite_data_only_stripes rewritten 1001 stripes to the journal, current ctx->pos 408099592 ctx->seq 866584361
> [  +0.180980] r5c_recovery_rewrite_data_only_stripes rewritten 2001 stripes to the journal, current ctx->pos 408115904 ctx->seq 866585361
> [  +0.196148] r5c_recovery_rewrite_data_only_stripes rewritten 3001 stripes to the journal, current ctx->pos 408133600 ctx->seq 866586361
> [  +0.221433] r5c_recovery_rewrite_data_only_stripes rewritten 4001 stripes to the journal, current ctx->pos 408153680 ctx->seq 866587361
> [  +0.223732] r5c_recovery_rewrite_data_only_stripes rewritten 5001 stripes to the journal, current ctx->pos 408173152 ctx->seq 866588361
> [  +0.228663] r5c_recovery_rewrite_data_only_stripes rewritten 6001 stripes to the journal, current ctx->pos 408192808 ctx->seq 866589361
> [  +0.234246] r5c_recovery_rewrite_data_only_stripes rewritten 7001 stripes to the journal, current ctx->pos 408212760 ctx->seq 866590361
> [  +0.242665] r5c_recovery_rewrite_data_only_stripes rewritten 8001 stripes to the journal, current ctx->pos 408233176 ctx->seq 866591361
> [  +0.231829] r5c_recovery_rewrite_data_only_stripes rewritten 9001 stripes to the journal, current ctx->pos 408251696 ctx->seq 866592361
> [  +0.250124] r5c_recovery_rewrite_data_only_stripes rewritten 10001 stripes to the journal, current ctx->pos 408270968 ctx->seq 866593361
> [  +0.240402] r5c_recovery_rewrite_data_only_stripes rewritten 11001 stripes to the journal, current ctx->pos 408289976 ctx->seq 866594361
> [  +0.250681] r5c_recovery_rewrite_data_only_stripes rewritten 12001 stripes to the journal, current ctx->pos 408309784 ctx->seq 866595361
> [  +0.258173] r5c_recovery_rewrite_data_only_stripes rewritten 13001 stripes to the journal, current ctx->pos 408329888 ctx->seq 866596361
> [  +0.235759] r5c_recovery_rewrite_data_only_stripes rewritten 14001 stripes to the journal, current ctx->pos 408349112 ctx->seq 866597361
> [  +0.255747] r5c_recovery_rewrite_data_only_stripes rewritten 15001 stripes to the journal, current ctx->pos 408368984 ctx->seq 866598361
> [  +0.255252] r5c_recovery_rewrite_data_only_stripes rewritten 16001 stripes to the journal, current ctx->pos 408386768 ctx->seq 866599361
> [  +0.264333] r5c_recovery_rewrite_data_only_stripes rewritten 17001 stripes to the journal, current ctx->pos 408407072 ctx->seq 866600361
> [  +0.261822] r5c_recovery_rewrite_data_only_stripes rewritten 18001 stripes to the journal, current ctx->pos 408425840 ctx->seq 866601361
> [  +0.270390] r5c_recovery_rewrite_data_only_stripes rewritten 19001 stripes to the journal, current ctx->pos 408443976 ctx->seq 866602361
> [  +0.266591] r5c_recovery_rewrite_data_only_stripes rewritten 20001 stripes to the journal, current ctx->pos 408461384 ctx->seq 866603361
> [  +0.276480] r5c_recovery_rewrite_data_only_stripes rewritten 21001 stripes to the journal, current ctx->pos 408479568 ctx->seq 866604361
> [  +0.272757] r5c_recovery_rewrite_data_only_stripes rewritten 22001 stripes to the journal, current ctx->pos 408496600 ctx->seq 866605361
> [  +0.290148] r5c_recovery_rewrite_data_only_stripes rewritten 23001 stripes to the journal, current ctx->pos 408515472 ctx->seq 866606361
> [  +0.274369] r5c_recovery_rewrite_data_only_stripes rewritten 24001 stripes to the journal, current ctx->pos 408532112 ctx->seq 866607361
> [  +0.237468] r5c_recovery_rewrite_data_only_stripes done
> [  +0.062124] r5c_recovery_flush_data_only_stripes enter
> [  +0.063396] r5c_recovery_flush_data_only_stripes before wait_event

The hang happens at expected place.

> [Jun 3 09:02] INFO: task mdadm:2858 blocked for more than 120 seconds.
> [  +0.060545]       Tainted: G            E     5.4.19-msl-00001-gbf39596faf12 #2
> [  +0.062932] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Could you please try disable the timeout message with

echo 0 > /proc/sys/kernel/hung_task_timeout_secs

And during this wait (after message
"r5c_recovery_flush_data_only_stripes before wait_event"),
checks whether the raid disks (not the journal disk) are taking IOs
(using tools like iostat).

Thanks,
Song

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Assemblin journaled array fails
  2020-06-03 22:07                                     ` Song Liu
@ 2020-06-04 13:58                                       ` Michal Soltys
  2020-06-05 12:26                                       ` Michal Soltys
  1 sibling, 0 replies; 46+ messages in thread
From: Michal Soltys @ 2020-06-04 13:58 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-raid

On 6/4/20 12:07 AM, Song Liu wrote:
> 
> The hang happens at expected place.
> 
>> [Jun 3 09:02] INFO: task mdadm:2858 blocked for more than 120 seconds.
>> [  +0.060545]       Tainted: G            E     5.4.19-msl-00001-gbf39596faf12 #2
>> [  +0.062932] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> 
> Could you please try disable the timeout message with
> 
> echo 0 > /proc/sys/kernel/hung_task_timeout_secs
> 
> And during this wait (after message
> "r5c_recovery_flush_data_only_stripes before wait_event"),
> checks whether the raid disks (not the journal disk) are taking IOs
> (using tools like iostat).
> 

Will report tommorow (machine was restarted, so gotta wait 19+ hours 
again until r5c_recovery_flush_log / processing gets its part of the job 
completed).

Non-assembling raid issue aside - any idea why is it so inhumanly slow ? 
It's not really much of an use in a production scenario in this state.

Following as every-10 seconds stats from journal device after the 
assembly of the main raid started.

Device             tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
md125             3.00      3072.00         0.00      30720          0
md125             2.80      2867.20         0.00      28672          0
md125             2.10      2150.40         0.00      21504          0
md125             1.90      1945.60         0.00      19456          0
md125             2.00      1920.40         0.00      19204          0
md125             1.30      1331.20         0.00      13312          0
md125             1.50      1536.00         0.00      15360          0

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Assemblin journaled array fails
  2020-06-03 22:07                                     ` Song Liu
  2020-06-04 13:58                                       ` Michal Soltys
@ 2020-06-05 12:26                                       ` Michal Soltys
  2020-06-09  9:36                                         ` Michal Soltys
  1 sibling, 1 reply; 46+ messages in thread
From: Michal Soltys @ 2020-06-05 12:26 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-raid

On 6/4/20 12:07 AM, Song Liu wrote:
> 
> The hang happens at expected place.
> 
>> [Jun 3 09:02] INFO: task mdadm:2858 blocked for more than 120 seconds.
>> [  +0.060545]       Tainted: G            E     5.4.19-msl-00001-gbf39596faf12 #2
>> [  +0.062932] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> 
> Could you please try disable the timeout message with
> 
> echo 0 > /proc/sys/kernel/hung_task_timeout_secs
> 
> And during this wait (after message
> "r5c_recovery_flush_data_only_stripes before wait_event"),
> checks whether the raid disks (not the journal disk) are taking IOs
> (using tools like iostat).
> 

No activity on component drives.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Assemblin journaled array fails
  2020-06-05 12:26                                       ` Michal Soltys
@ 2020-06-09  9:36                                         ` Michal Soltys
  2020-06-09 18:36                                           ` Song Liu
  0 siblings, 1 reply; 46+ messages in thread
From: Michal Soltys @ 2020-06-09  9:36 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-raid

On 6/5/20 2:26 PM, Michal Soltys wrote:
> On 6/4/20 12:07 AM, Song Liu wrote:
>>
>> The hang happens at expected place.
>>
>>> [Jun 3 09:02] INFO: task mdadm:2858 blocked for more than 120 seconds.
>>> [  +0.060545]       Tainted: G            E     
>>> 5.4.19-msl-00001-gbf39596faf12 #2
>>> [  +0.062932] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
>>> disables this message.
>>
>> Could you please try disable the timeout message with
>>
>> echo 0 > /proc/sys/kernel/hung_task_timeout_secs
>>
>> And during this wait (after message
>> "r5c_recovery_flush_data_only_stripes before wait_event"),
>> checks whether the raid disks (not the journal disk) are taking IOs
>> (using tools like iostat).
>>
> 
> No activity on component drives.

To expand on that - while there is no i/o activity whatsoever at the component drives (as well as journal), the cpu is of course still fully loaded (5 days so far):

UID        PID  PPID  C    SZ   RSS PSR STIME TTY          TIME CMD
root      8129  6755 15   740  1904  10 Jun04 pts/2    17:42:34 mdadm -A /dev/md/r5_big /dev/md/r1_journal_big /dev/sdj1 /dev/sdi1 /dev/sdg1 /dev/sdh1
root      8147     2 84     0     0  30 Jun04 ?        4-02:09:47 [md124_raid5]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Assemblin journaled array fails
  2020-06-09  9:36                                         ` Michal Soltys
@ 2020-06-09 18:36                                           ` Song Liu
  2020-06-09 21:51                                             ` Michal Soltys
  0 siblings, 1 reply; 46+ messages in thread
From: Song Liu @ 2020-06-09 18:36 UTC (permalink / raw)
  To: Michal Soltys; +Cc: linux-raid

On Tue, Jun 9, 2020 at 2:36 AM Michal Soltys <msoltyspl@yandex.pl> wrote:
>
> On 6/5/20 2:26 PM, Michal Soltys wrote:
> > On 6/4/20 12:07 AM, Song Liu wrote:
> >>
> >> The hang happens at expected place.
> >>
> >>> [Jun 3 09:02] INFO: task mdadm:2858 blocked for more than 120 seconds.
> >>> [  +0.060545]       Tainted: G            E
> >>> 5.4.19-msl-00001-gbf39596faf12 #2
> >>> [  +0.062932] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> >>> disables this message.
> >>
> >> Could you please try disable the timeout message with
> >>
> >> echo 0 > /proc/sys/kernel/hung_task_timeout_secs
> >>
> >> And during this wait (after message
> >> "r5c_recovery_flush_data_only_stripes before wait_event"),
> >> checks whether the raid disks (not the journal disk) are taking IOs
> >> (using tools like iostat).
> >>
> >
> > No activity on component drives.
>
> To expand on that - while there is no i/o activity whatsoever at the component drives (as well as journal), the cpu is of course still fully loaded (5 days so far):
>
> UID        PID  PPID  C    SZ   RSS PSR STIME TTY          TIME CMD
> root      8129  6755 15   740  1904  10 Jun04 pts/2    17:42:34 mdadm -A /dev/md/r5_big /dev/md/r1_journal_big /dev/sdj1 /dev/sdi1 /dev/sdg1 /dev/sdh1
> root      8147     2 84     0     0  30 Jun04 ?        4-02:09:47 [md124_raid5]

I guess the md thread stuck at some stripe. Does the kernel have
CONFIG_DYNAMIC_DEBUG enabled? If so, could you please try enable some pr_debug()
in function handle_stripe()?

Thanks,
Song

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Assemblin journaled array fails
  2020-06-09 18:36                                           ` Song Liu
@ 2020-06-09 21:51                                             ` Michal Soltys
  2020-06-16 11:11                                               ` Michal Soltys
  2020-06-17  0:48                                               ` Song Liu
  0 siblings, 2 replies; 46+ messages in thread
From: Michal Soltys @ 2020-06-09 21:51 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-raid

On 20/06/09 20:36, Song Liu wrote:
> On Tue, Jun 9, 2020 at 2:36 AM Michal Soltys <msoltyspl@yandex.pl> wrote:
>>
>> On 6/5/20 2:26 PM, Michal Soltys wrote:
>> > On 6/4/20 12:07 AM, Song Liu wrote:
>> >>
>> >> The hang happens at expected place.
>> >>
>> >>> [Jun 3 09:02] INFO: task mdadm:2858 blocked for more than 120 seconds.
>> >>> [  +0.060545]       Tainted: G            E
>> >>> 5.4.19-msl-00001-gbf39596faf12 #2
>> >>> [  +0.062932] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> >>> disables this message.
>> >>
>> >> Could you please try disable the timeout message with
>> >>
>> >> echo 0 > /proc/sys/kernel/hung_task_timeout_secs
>> >>
>> >> And during this wait (after message
>> >> "r5c_recovery_flush_data_only_stripes before wait_event"),
>> >> checks whether the raid disks (not the journal disk) are taking IOs
>> >> (using tools like iostat).
>> >>
>> >
>> > No activity on component drives.
>>
>> To expand on that - while there is no i/o activity whatsoever at the component drives (as well as journal), the cpu is of course still fully loaded (5 days so far):
>>
>> UID        PID  PPID  C    SZ   RSS PSR STIME TTY          TIME CMD
>> root      8129  6755 15   740  1904  10 Jun04 pts/2    17:42:34 mdadm -A /dev/md/r5_big /dev/md/r1_journal_big /dev/sdj1 /dev/sdi1 /dev/sdg1 /dev/sdh1
>> root      8147     2 84     0     0  30 Jun04 ?        4-02:09:47 [md124_raid5]
> 
> I guess the md thread stuck at some stripe. Does the kernel have
> CONFIG_DYNAMIC_DEBUG enabled? If so, could you please try enable some pr_debug()
> in function handle_stripe()?
> 
> Thanks,
> Song
> 

Massive spam in dmesg with messages like these:

[464836.603033] handling stripe 1551540328, state=0x41 cnt=1, pd_idx=3, 
qd_idx=-1
                 , check:0, reconstruct:0
[464836.603036] handling stripe 1551540336, state=0x41 cnt=1, pd_idx=3, 
qd_idx=-1
                 , check:0, reconstruct:0
[464836.603038] handling stripe 1551540344, state=0x41 cnt=1, pd_idx=3, 
qd_idx=-1
                 , check:0, reconstruct:0
[464836.603041] handling stripe 1551540352, state=0x41 cnt=1, pd_idx=3, 
qd_idx=-1
                 , check:0, reconstruct:0
[464836.603045] handling stripe 1551540360, state=0x41 cnt=1, pd_idx=3, 
qd_idx=-1
                 , check:0, reconstruct:0
[464836.603049] handling stripe 1551540368, state=0x41 cnt=1, pd_idx=3, 
qd_idx=-1
                 , check:0, reconstruct:0
[464836.603052] handling stripe 1551540376, state=0x41 cnt=1, pd_idx=3, 
qd_idx=-1
                 , check:0, reconstruct:0
[464836.603056] handling stripe 1551540384, state=0x41 cnt=1, pd_idx=3, 
qd_idx=-1
                 , check:0, reconstruct:0
[464836.603059] handling stripe 1551540392, state=0x41 cnt=1, pd_idx=3, 
qd_idx=-1
                 , check:0, reconstruct:0
[464836.603062] handling stripe 1551540400, state=0x41 cnt=1, pd_idx=3, 
qd_idx=-1
                 , check:0, reconstruct:0
[464836.603065] handling stripe 1551540408, state=0x41 cnt=1, pd_idx=3, 
qd_idx=-1
                 , check:0, reconstruct:0
[464836.603068] handling stripe 1551540416, state=0x41 cnt=1, pd_idx=3, 
qd_idx=-1
                 , check:0, reconstruct:0
[464836.603071] handling stripe 1551540424, state=0x41 cnt=1, pd_idx=3, 
qd_idx=-1
                 , check:0, reconstruct:0
[464836.603075] handling stripe 1551540432, state=0x41 cnt=1, pd_idx=3, 
qd_idx=-1
                 , check:0, reconstruct:0
[464836.603077] handling stripe 1551540440, state=0x41 cnt=1, pd_idx=3, 
qd_idx=-1
                 , check:0, reconstruct:0
[464836.603078] handling stripe 1551540448, state=0x41 cnt=1, pd_idx=3, 
qd_idx=-1
                 , check:0, reconstruct:0

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Assemblin journaled array fails
  2020-06-09 21:51                                             ` Michal Soltys
@ 2020-06-16 11:11                                               ` Michal Soltys
  2020-06-17  0:48                                               ` Song Liu
  1 sibling, 0 replies; 46+ messages in thread
From: Michal Soltys @ 2020-06-16 11:11 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-raid

On 6/9/20 11:51 PM, Michal Soltys wrote:
> On 20/06/09 20:36, Song Liu wrote:
>> On Tue, Jun 9, 2020 at 2:36 AM Michal Soltys <msoltyspl@yandex.pl> wrote:
>>>
>>> On 6/5/20 2:26 PM, Michal Soltys wrote:
>>> > On 6/4/20 12:07 AM, Song Liu wrote:
>>> >>
>>> >> The hang happens at expected place.
>>> >>
>>> >>> [Jun 3 09:02] INFO: task mdadm:2858 blocked for more than 120 
>>> seconds.
>>> >>> [  +0.060545]       Tainted: G            E
>>> >>> 5.4.19-msl-00001-gbf39596faf12 #2
>>> >>> [  +0.062932] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>> >>> disables this message.
>>> >>
>>> >> Could you please try disable the timeout message with
>>> >>
>>> >> echo 0 > /proc/sys/kernel/hung_task_timeout_secs
>>> >>
>>> >> And during this wait (after message
>>> >> "r5c_recovery_flush_data_only_stripes before wait_event"),
>>> >> checks whether the raid disks (not the journal disk) are taking IOs
>>> >> (using tools like iostat).
>>> >>
>>> >
>>> > No activity on component drives.
>>>
>>> To expand on that - while there is no i/o activity whatsoever at the 
>>> component drives (as well as journal), the cpu is of course still 
>>> fully loaded (5 days so far):
>>>
>>> UID        PID  PPID  C    SZ   RSS PSR STIME TTY          TIME CMD
>>> root      8129  6755 15   740  1904  10 Jun04 pts/2    17:42:34 mdadm 
>>> -A /dev/md/r5_big /dev/md/r1_journal_big /dev/sdj1 /dev/sdi1 
>>> /dev/sdg1 /dev/sdh1
>>> root      8147     2 84     0     0  30 Jun04 ?        4-02:09:47 
>>> [md124_raid5]
>>
>> I guess the md thread stuck at some stripe. Does the kernel have
>> CONFIG_DYNAMIC_DEBUG enabled? If so, could you please try enable some 
>> pr_debug()
>> in function handle_stripe()?
>>
>> Thanks,
>> Song
>>
> 
> Massive spam in dmesg with messages like these:
> 
> [464836.603033] handling stripe 1551540328, state=0x41 cnt=1, pd_idx=3, 
> qd_idx=-1
>                  , check:0, reconstruct:0
> [464836.603036] handling stripe 1551540336, state=0x41 cnt=1, pd_idx=3, 
> qd_idx=-1
>                  , check:0, reconstruct:0
> [464836.603038] handling stripe 1551540344, state=0x41 cnt=1, pd_idx=3, 
> qd_idx=-1
>                  , check:0, reconstruct:0
> <cut>

So what should be the next step in debugging/fixing this ?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Assemblin journaled array fails
  2020-06-09 21:51                                             ` Michal Soltys
  2020-06-16 11:11                                               ` Michal Soltys
@ 2020-06-17  0:48                                               ` Song Liu
  2020-06-17 12:53                                                 ` Michal Soltys
  1 sibling, 1 reply; 46+ messages in thread
From: Song Liu @ 2020-06-17  0:48 UTC (permalink / raw)
  To: Michal Soltys; +Cc: linux-raid

On Tue, Jun 9, 2020 at 2:51 PM Michal Soltys <msoltyspl@yandex.pl> wrote:
>
> On 20/06/09 20:36, Song Liu wrote:
> > On Tue, Jun 9, 2020 at 2:36 AM Michal Soltys <msoltyspl@yandex.pl> wrote:
> >>
> >> On 6/5/20 2:26 PM, Michal Soltys wrote:
> >> > On 6/4/20 12:07 AM, Song Liu wrote:
> >> >>
> >> >> The hang happens at expected place.
> >> >>
> >> >>> [Jun 3 09:02] INFO: task mdadm:2858 blocked for more than 120 seconds.
> >> >>> [  +0.060545]       Tainted: G            E
> >> >>> 5.4.19-msl-00001-gbf39596faf12 #2
> >> >>> [  +0.062932] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> >> >>> disables this message.
> >> >>
> >> >> Could you please try disable the timeout message with
> >> >>
> >> >> echo 0 > /proc/sys/kernel/hung_task_timeout_secs
> >> >>
> >> >> And during this wait (after message
> >> >> "r5c_recovery_flush_data_only_stripes before wait_event"),
> >> >> checks whether the raid disks (not the journal disk) are taking IOs
> >> >> (using tools like iostat).
> >> >>
> >> >
> >> > No activity on component drives.
> >>
> >> To expand on that - while there is no i/o activity whatsoever at the component drives (as well as journal), the cpu is of course still fully loaded (5 days so far):
> >>
> >> UID        PID  PPID  C    SZ   RSS PSR STIME TTY          TIME CMD
> >> root      8129  6755 15   740  1904  10 Jun04 pts/2    17:42:34 mdadm -A /dev/md/r5_big /dev/md/r1_journal_big /dev/sdj1 /dev/sdi1 /dev/sdg1 /dev/sdh1
> >> root      8147     2 84     0     0  30 Jun04 ?        4-02:09:47 [md124_raid5]
> >
> > I guess the md thread stuck at some stripe. Does the kernel have
> > CONFIG_DYNAMIC_DEBUG enabled? If so, could you please try enable some pr_debug()
> > in function handle_stripe()?
> >
> > Thanks,
> > Song
> >
>
> Massive spam in dmesg with messages like these:
>
> [464836.603033] handling stripe 1551540328, state=0x41 cnt=1, pd_idx=3,
> qd_idx=-1
>                  , check:0, reconstruct:0
> [464836.603036] handling stripe 1551540336, state=0x41 cnt=1, pd_idx=3,
> qd_idx=-1
>                  , check:0, reconstruct:0
> [464836.603038] handling stripe 1551540344, state=0x41 cnt=1, pd_idx=3,
> qd_idx=-1
>                  , check:0, reconstruct:0
> [464836.603041] handling stripe 1551540352, state=0x41 cnt=1, pd_idx=3,
> qd_idx=-1
>                  , check:0, reconstruct:0
> [464836.603045] handling stripe 1551540360, state=0x41 cnt=1, pd_idx=3,
> qd_idx=-1
>                  , check:0, reconstruct:0
> [464836.603049] handling stripe 1551540368, state=0x41 cnt=1, pd_idx=3,
> qd_idx=-1
>                  , check:0, reconstruct:0
> [464836.603052] handling stripe 1551540376, state=0x41 cnt=1, pd_idx=3,
> qd_idx=-1
>                  , check:0, reconstruct:0
> [464836.603056] handling stripe 1551540384, state=0x41 cnt=1, pd_idx=3,
> qd_idx=-1
>                  , check:0, reconstruct:0
> [464836.603059] handling stripe 1551540392, state=0x41 cnt=1, pd_idx=3,
> qd_idx=-1
>                  , check:0, reconstruct:0
> [464836.603062] handling stripe 1551540400, state=0x41 cnt=1, pd_idx=3,
> qd_idx=-1
>                  , check:0, reconstruct:0
> [464836.603065] handling stripe 1551540408, state=0x41 cnt=1, pd_idx=3,
> qd_idx=-1
>                  , check:0, reconstruct:0
> [464836.603068] handling stripe 1551540416, state=0x41 cnt=1, pd_idx=3,
> qd_idx=-1
>                  , check:0, reconstruct:0
> [464836.603071] handling stripe 1551540424, state=0x41 cnt=1, pd_idx=3,
> qd_idx=-1
>                  , check:0, reconstruct:0
> [464836.603075] handling stripe 1551540432, state=0x41 cnt=1, pd_idx=3,
> qd_idx=-1
>                  , check:0, reconstruct:0
> [464836.603077] handling stripe 1551540440, state=0x41 cnt=1, pd_idx=3,
> qd_idx=-1
>                  , check:0, reconstruct:0
> [464836.603078] handling stripe 1551540448, state=0x41 cnt=1, pd_idx=3,
> qd_idx=-1
>                  , check:0, reconstruct:0

I am really sorry for the delay. A few questions:

1. There are two pr_debug() calls in handle_stripe():
       pr_debug("handling stripe %llu, state=%#lx cnt=%d, "
       pr_debug("locked=%d uptodate=%d to_read=%d"

    Did you enable all of them? Or only the first one?

2. In the log above, we have pd_idx = 3 for all these stripes. Is this
the case for
   all the stripes? If that's the case, is it a raid4 instead of raid5?

Thanks,
Song

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Assemblin journaled array fails
  2020-06-17  0:48                                               ` Song Liu
@ 2020-06-17 12:53                                                 ` Michal Soltys
  2020-06-17 17:11                                                   ` Song Liu
  0 siblings, 1 reply; 46+ messages in thread
From: Michal Soltys @ 2020-06-17 12:53 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2771 bytes --]

On 6/17/20 2:48 AM, Song Liu wrote:
> On Tue, Jun 9, 2020 at 2:51 PM Michal Soltys <msoltyspl@yandex.pl> wrote:
>>
>> On 20/06/09 20:36, Song Liu wrote:
>>> On Tue, Jun 9, 2020 at 2:36 AM Michal Soltys <msoltyspl@yandex.pl> wrote:
>>>>
>>>> On 6/5/20 2:26 PM, Michal Soltys wrote:
>>>>> On 6/4/20 12:07 AM, Song Liu wrote:
>>>>>>
>>>>>> The hang happens at expected place.
>>>>>>
>>>>>>> [Jun 3 09:02] INFO: task mdadm:2858 blocked for more than 120 seconds.
>>>>>>> [  +0.060545]       Tainted: G            E
>>>>>>> 5.4.19-msl-00001-gbf39596faf12 #2
>>>>>>> [  +0.062932] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>>>>>> disables this message.
>>>>>>
>>>>>> Could you please try disable the timeout message with
>>>>>>
>>>>>> echo 0 > /proc/sys/kernel/hung_task_timeout_secs
>>>>>>
>>>>>> And during this wait (after message
>>>>>> "r5c_recovery_flush_data_only_stripes before wait_event"),
>>>>>> checks whether the raid disks (not the journal disk) are taking IOs
>>>>>> (using tools like iostat).
>>>>>>
>>>>>
>>>>> No activity on component drives.
>>>>
>>>> To expand on that - while there is no i/o activity whatsoever at the component drives (as well as journal), the cpu is of course still fully loaded (5 days so far):
>>>>
>>>> UID        PID  PPID  C    SZ   RSS PSR STIME TTY          TIME CMD
>>>> root      8129  6755 15   740  1904  10 Jun04 pts/2    17:42:34 mdadm -A /dev/md/r5_big /dev/md/r1_journal_big /dev/sdj1 /dev/sdi1 /dev/sdg1 /dev/sdh1
>>>> root      8147     2 84     0     0  30 Jun04 ?        4-02:09:47 [md124_raid5]
>>>
>>> I guess the md thread stuck at some stripe. Does the kernel have
>>> CONFIG_DYNAMIC_DEBUG enabled? If so, could you please try enable some pr_debug()
>>> in function handle_stripe()?
>>>
>>> Thanks,
>>> Song
>>>
>>
>> Massive spam in dmesg with messages like these:
>>
>> [464836.603033] handling stripe 1551540328, state=0x41 cnt=1, pd_idx=3,
>> qd_idx=-1
>>                   , check:0, reconstruct:0
>> <cut>
>>
> 
> I am really sorry for the delay. A few questions:
> 

No worries.

> 1. There are two pr_debug() calls in handle_stripe():
>         pr_debug("handling stripe %llu, state=%#lx cnt=%d, "
>         pr_debug("locked=%d uptodate=%d to_read=%d"
> 
>      Did you enable all of them? Or only the first one?

I enabled all of them (I think), to be precise:

echo -n 'func handle_stripe +p' >/sys/kernel/debug/dynamic_debug/control

Haven't seen any `locked` lines though.

> 
> 2. In the log above, we have pd_idx = 3 for all these stripes. Is this
> the case for
>     all the stripes? If that's the case, is it a raid4 instead of raid5?
> 

Just a coincidence, pd_idx is in 0..3 range. The rest of the values are 
constant from what I can see.

I've attached longer piece of dmesg output.

[-- Attachment #2: dmesg.txt.xz --]
[-- Type: application/x-xz, Size: 18160 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Assemblin journaled array fails
  2020-06-17 12:53                                                 ` Michal Soltys
@ 2020-06-17 17:11                                                   ` Song Liu
  2020-06-19 11:35                                                     ` Michal Soltys
  0 siblings, 1 reply; 46+ messages in thread
From: Song Liu @ 2020-06-17 17:11 UTC (permalink / raw)
  To: Michal Soltys; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2770 bytes --]

On Wed, Jun 17, 2020 at 5:53 AM Michal Soltys <msoltyspl@yandex.pl> wrote:
>
> On 6/17/20 2:48 AM, Song Liu wrote:
> > On Tue, Jun 9, 2020 at 2:51 PM Michal Soltys <msoltyspl@yandex.pl> wrote:
> >>
> >> On 20/06/09 20:36, Song Liu wrote:
> >>> On Tue, Jun 9, 2020 at 2:36 AM Michal Soltys <msoltyspl@yandex.pl> wrote:
> >>>>
> >>>> On 6/5/20 2:26 PM, Michal Soltys wrote:
> >>>>> On 6/4/20 12:07 AM, Song Liu wrote:
> >>>>>>
> >>>>>> The hang happens at expected place.
> >>>>>>
> >>>>>>> [Jun 3 09:02] INFO: task mdadm:2858 blocked for more than 120 seconds.
> >>>>>>> [  +0.060545]       Tainted: G            E
> >>>>>>> 5.4.19-msl-00001-gbf39596faf12 #2
> >>>>>>> [  +0.062932] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> >>>>>>> disables this message.
> >>>>>>
> >>>>>> Could you please try disable the timeout message with
> >>>>>>
> >>>>>> echo 0 > /proc/sys/kernel/hung_task_timeout_secs
> >>>>>>
> >>>>>> And during this wait (after message
> >>>>>> "r5c_recovery_flush_data_only_stripes before wait_event"),
> >>>>>> checks whether the raid disks (not the journal disk) are taking IOs
> >>>>>> (using tools like iostat).
> >>>>>>
> >>>>>
> >>>>> No activity on component drives.
> >>>>
> >>>> To expand on that - while there is no i/o activity whatsoever at the component drives (as well as journal), the cpu is of course still fully loaded (5 days so far):
> >>>>
> >>>> UID        PID  PPID  C    SZ   RSS PSR STIME TTY          TIME CMD
> >>>> root      8129  6755 15   740  1904  10 Jun04 pts/2    17:42:34 mdadm -A /dev/md/r5_big /dev/md/r1_journal_big /dev/sdj1 /dev/sdi1 /dev/sdg1 /dev/sdh1
> >>>> root      8147     2 84     0     0  30 Jun04 ?        4-02:09:47 [md124_raid5]
> >>>
> >>> I guess the md thread stuck at some stripe. Does the kernel have
> >>> CONFIG_DYNAMIC_DEBUG enabled? If so, could you please try enable some pr_debug()
> >>> in function handle_stripe()?
> >>>
> >>> Thanks,
> >>> Song
> >>>
> >>
> >> Massive spam in dmesg with messages like these:
> >>
> >> [464836.603033] handling stripe 1551540328, state=0x41 cnt=1, pd_idx=3,
> >> qd_idx=-1
> >>                   , check:0, reconstruct:0
> >> <cut>
> >>
> >
> > I am really sorry for the delay. A few questions:
> >
>
> No worries.
>
> > 1. There are two pr_debug() calls in handle_stripe():
> >         pr_debug("handling stripe %llu, state=%#lx cnt=%d, "
> >         pr_debug("locked=%d uptodate=%d to_read=%d"
> >
> >      Did you enable all of them? Or only the first one?
>
> I enabled all of them (I think), to be precise:
>
> echo -n 'func handle_stripe +p' >/sys/kernel/debug/dynamic_debug/control
>
> Haven't seen any `locked` lines though.

That's a little weird, and probably explains why we stuck. Could you
please try the attached patch?

Thanks,
Song

[-- Attachment #2: 0001-Debug-raid5-recovery-hang.patch --]
[-- Type: application/octet-stream, Size: 1726 bytes --]

From 4c60a692fffc1eb51a4811d36eec6076b779c6c6 Mon Sep 17 00:00:00 2001
From: Song Liu <songliubraving@fb.com>
Date: Wed, 17 Jun 2020 09:56:06 -0700
Subject: [PATCH] Debug raid5 recovery hang

---
 drivers/md/raid5.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index ab8067f9ce8c6..ca7476fac3e9d 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4726,6 +4726,20 @@ static void handle_stripe(struct stripe_head *sh)
 	if (test_bit(STRIPE_LOG_TRAPPED, &sh->state))
 		goto finish;
 
+	pr_debug("locked=%d uptodate=%d to_read=%d"
+	       " to_write=%d failed=%d failed_num=%d,%d\n",
+	       s.locked, s.uptodate, s.to_read, s.to_write, s.failed,
+	       s.failed_num[0], s.failed_num[1]);
+
+	pr_debug("handle_bad_blocks=%d MD_SB_CHANGE_PENDING=%d blocked_rdev=%px\n",
+		 s.handle_bad_blocks,
+		 test_bit(MD_SB_CHANGE_PENDING, &conf->mddev->sb_flags),
+		 s.blocked_rdev);
+
+	pr_debug("syncing=%d expanding=%d expanded=%d"
+	       " replacing=%d written=%d\n",
+		 s.syncing, s.expanding, s.expanded, s.replacing, s.written);
+
 	if (s.handle_bad_blocks ||
 	    test_bit(MD_SB_CHANGE_PENDING, &conf->mddev->sb_flags)) {
 		set_bit(STRIPE_HANDLE, &sh->state);
@@ -4748,10 +4762,6 @@ static void handle_stripe(struct stripe_head *sh)
 		set_bit(STRIPE_BIOFILL_RUN, &sh->state);
 	}
 
-	pr_debug("locked=%d uptodate=%d to_read=%d"
-	       " to_write=%d failed=%d failed_num=%d,%d\n",
-	       s.locked, s.uptodate, s.to_read, s.to_write, s.failed,
-	       s.failed_num[0], s.failed_num[1]);
 	/*
 	 * check if the array has lost more than max_degraded devices and,
 	 * if so, some requests might need to be failed.
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: Assemblin journaled array fails
  2020-06-17 17:11                                                   ` Song Liu
@ 2020-06-19 11:35                                                     ` Michal Soltys
  2020-06-20  0:14                                                       ` Song Liu
  0 siblings, 1 reply; 46+ messages in thread
From: Michal Soltys @ 2020-06-19 11:35 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-raid

On 6/17/20 7:11 PM, Song Liu wrote:
>>
>>> 1. There are two pr_debug() calls in handle_stripe():
>>>          pr_debug("handling stripe %llu, state=%#lx cnt=%d, "
>>>          pr_debug("locked=%d uptodate=%d to_read=%d"
>>>
>>>       Did you enable all of them? Or only the first one?
>>
>> I enabled all of them (I think), to be precise:
>>
>> echo -n 'func handle_stripe +p' >/sys/kernel/debug/dynamic_debug/control
>>
>> Haven't seen any `locked` lines though.
> 
> That's a little weird, and probably explains why we stuck. Could you
> please try the attached patch?
> 
> Thanks,
> Song
> 

I've started assembly with the above patch, the output can be inspected 
from this file:

https://ufile.io/yx4bbcb4

This is ~5mb packed dmesg from start of the boot to the relevant parts, 
unpacks to ~150mb file.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Assemblin journaled array fails
  2020-06-19 11:35                                                     ` Michal Soltys
@ 2020-06-20  0:14                                                       ` Song Liu
  2020-06-22 11:12                                                         ` Michal Soltys
  0 siblings, 1 reply; 46+ messages in thread
From: Song Liu @ 2020-06-20  0:14 UTC (permalink / raw)
  To: Michal Soltys; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1074 bytes --]

On Fri, Jun 19, 2020 at 4:35 AM Michal Soltys <msoltyspl@yandex.pl> wrote:
>
> On 6/17/20 7:11 PM, Song Liu wrote:
> >>
> >>> 1. There are two pr_debug() calls in handle_stripe():
> >>>          pr_debug("handling stripe %llu, state=%#lx cnt=%d, "
> >>>          pr_debug("locked=%d uptodate=%d to_read=%d"
> >>>
> >>>       Did you enable all of them? Or only the first one?
> >>
> >> I enabled all of them (I think), to be precise:
> >>
> >> echo -n 'func handle_stripe +p' >/sys/kernel/debug/dynamic_debug/control
> >>
> >> Haven't seen any `locked` lines though.
> >
> > That's a little weird, and probably explains why we stuck. Could you
> > please try the attached patch?
> >
> > Thanks,
> > Song
> >
>
> I've started assembly with the above patch, the output can be inspected
> from this file:
>
> https://ufile.io/yx4bbcb4
>
> This is ~5mb packed dmesg from start of the boot to the relevant parts,
> unpacks to ~150mb file.

Thanks for the trace. Looks like we may have some issues with
MD_SB_CHANGE_PENDING.
Could you please try the attached patch?

Thanks,
Song

[-- Attachment #2: 0001-debug-MD_SB_CHANGE_PENDING.patch --]
[-- Type: application/octet-stream, Size: 5015 bytes --]

From 6c3758525c4abadc64839cf18ee3ed7ad40293da Mon Sep 17 00:00:00 2001
From: Song Liu <songliubraving@fb.com>
Date: Fri, 19 Jun 2020 17:12:05 -0700
Subject: [PATCH] debug MD_SB_CHANGE_PENDING

---
 drivers/md/md.c          | 20 ++++++++++++++++++++
 drivers/md/raid5-cache.c |  1 +
 drivers/md/raid5.c       |  2 ++
 3 files changed, 23 insertions(+)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index f567f536b529b..dbbc8a50e2ed2 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2670,6 +2670,8 @@ void md_update_sb(struct mddev *mddev, int force_change)
 			bit_clear_unless(&mddev->sb_flags, BIT(MD_SB_CHANGE_PENDING),
 							 BIT(MD_SB_CHANGE_DEVS) |
 							 BIT(MD_SB_CHANGE_CLEAN));
+			pr_info("%s bit_clear_unless sb_flags %lx\n", __func__,
+				mddev->sb_flags);
 			return;
 		}
 	}
@@ -2697,6 +2699,8 @@ void md_update_sb(struct mddev *mddev, int force_change)
 		clear_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
 		if (!mddev->external) {
 			clear_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags);
+			pr_info("%s clear MD_SB_CHANGE_PENDING\n", __func__);
+
 			rdev_for_each(rdev, mddev) {
 				if (rdev->badblocks.changed) {
 					rdev->badblocks.changed = 0;
@@ -2815,6 +2819,9 @@ void md_update_sb(struct mddev *mddev, int force_change)
 	if (mddev_is_clustered(mddev) && ret == 0)
 		md_cluster_ops->metadata_update_finish(mddev);
 
+	if (mddev->in_sync != sync_req)
+		pr_info("%s bit_clear_unless sb_flags %lx\n", __func__,
+			mddev->sb_flags);
 	if (mddev->in_sync != sync_req ||
 	    !bit_clear_unless(&mddev->sb_flags, BIT(MD_SB_CHANGE_PENDING),
 			       BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_CLEAN)))
@@ -4389,6 +4396,8 @@ array_state_store(struct mddev *mddev, const char *buf, size_t len)
 		if (st == active) {
 			restart_array(mddev);
 			clear_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags);
+			pr_info("%s clear MD_SB_CHANGE_PENDING\n", __func__);
+
 			md_wakeup_thread(mddev->thread);
 			wake_up(&mddev->sb_wait);
 		} else /* st == clean */ {
@@ -4463,6 +4472,8 @@ array_state_store(struct mddev *mddev, const char *buf, size_t len)
 			if (err)
 				break;
 			clear_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags);
+			pr_info("%s clear MD_SB_CHANGE_PENDING\n", __func__);
+
 			wake_up(&mddev->sb_wait);
 			err = 0;
 		} else {
@@ -8469,6 +8480,7 @@ bool md_write_start(struct mddev *mddev, struct bio *bi)
 			mddev->in_sync = 0;
 			set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
 			set_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags);
+			pr_info("%s set MD_SB_CHANGE_PENDING\n", __func__);
 			md_wakeup_thread(mddev->thread);
 			did_change = 1;
 		}
@@ -8544,6 +8556,8 @@ void md_allow_write(struct mddev *mddev)
 		mddev->in_sync = 0;
 		set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
 		set_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags);
+		pr_info("%s set MD_SB_CHANGE_PENDING\n", __func__);
+
 		if (mddev->safemode_delay &&
 		    mddev->safemode == 0)
 			mddev->safemode = 1;
@@ -8948,6 +8962,8 @@ void md_do_sync(struct md_thread *thread)
 	set_mask_bits(&mddev->sb_flags, 0,
 		      BIT(MD_SB_CHANGE_PENDING) | BIT(MD_SB_CHANGE_DEVS));
 
+	pr_info("%s set MD_SB_CHANGE_PENDING\n", __func__);
+
 	if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
 			!test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
 			mddev->delta_disks > 0 &&
@@ -9197,6 +9213,8 @@ void md_check_recovery(struct mddev *mddev)
 			clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
 			clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
 			clear_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags);
+			pr_info("%s clear MD_SB_CHANGE_PENDING\n", __func__);
+
 			goto unlock;
 		}
 
@@ -9411,6 +9429,8 @@ int rdev_set_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
 		sysfs_notify_dirent_safe(rdev->sysfs_state);
 		set_mask_bits(&mddev->sb_flags, 0,
 			      BIT(MD_SB_CHANGE_CLEAN) | BIT(MD_SB_CHANGE_PENDING));
+		pr_info("%s set MD_SB_CHANGE_PENDING\n", __func__);
+
 		md_wakeup_thread(rdev->mddev->thread);
 		return 1;
 	} else
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 9b6da759dca25..60fa08f0170bc 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -1337,6 +1337,7 @@ static void r5l_write_super_and_discard_space(struct r5l_log *log,
 	 */
 	set_mask_bits(&mddev->sb_flags, 0,
 		BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_PENDING));
+	pr_info("%s set MD_SB_CHANGE_PENDING\n", __func__);
 	if (!mddev_trylock(mddev))
 		return;
 	md_update_sb(mddev, 1);
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index ca7476fac3e9d..3240dee1da7bf 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2704,6 +2704,8 @@ static void raid5_error(struct mddev *mddev, struct md_rdev *rdev)
 	set_bit(Blocked, &rdev->flags);
 	set_mask_bits(&mddev->sb_flags, 0,
 		      BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_PENDING));
+	pr_info("%s set MD_SB_CHANGE_PENDING\n", __func__);
+
 	pr_crit("md/raid:%s: Disk failure on %s, disabling device.\n"
 		"md/raid:%s: Operation continuing on %d devices.\n",
 		mdname(mddev),
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: Assemblin journaled array fails
  2020-06-20  0:14                                                       ` Song Liu
@ 2020-06-22 11:12                                                         ` Michal Soltys
  2020-06-22 16:37                                                           ` Song Liu
  0 siblings, 1 reply; 46+ messages in thread
From: Michal Soltys @ 2020-06-22 11:12 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-raid

On 6/20/20 2:14 AM, Song Liu wrote:
> On Fri, Jun 19, 2020 at 4:35 AM Michal Soltys <msoltyspl@yandex.pl> wrote:
>>
>> On 6/17/20 7:11 PM, Song Liu wrote:
>>>>
>>>>> 1. There are two pr_debug() calls in handle_stripe():
>>>>>           pr_debug("handling stripe %llu, state=%#lx cnt=%d, "
>>>>>           pr_debug("locked=%d uptodate=%d to_read=%d"
>>>>>
>>>>>        Did you enable all of them? Or only the first one?
>>>>
>>>> I enabled all of them (I think), to be precise:
>>>>
>>>> echo -n 'func handle_stripe +p' >/sys/kernel/debug/dynamic_debug/control
>>>>
>>>> Haven't seen any `locked` lines though.
>>>
>>> That's a little weird, and probably explains why we stuck. Could you
>>> please try the attached patch?
>>>
>>> Thanks,
>>> Song
>>>
>>
>> I've started assembly with the above patch, the output can be inspected
>> from this file:
>>
>> https://ufile.io/yx4bbcb4
>>
>> This is ~5mb packed dmesg from start of the boot to the relevant parts,
>> unpacks to ~150mb file.
> 
> Thanks for the trace. Looks like we may have some issues with
> MD_SB_CHANGE_PENDING.
> Could you please try the attached patch?

Should I run this along with pr_debugs from the previous patch enabled ?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Assemblin journaled array fails
  2020-06-22 11:12                                                         ` Michal Soltys
@ 2020-06-22 16:37                                                           ` Song Liu
  2020-06-23 13:17                                                             ` Michal Soltys
  0 siblings, 1 reply; 46+ messages in thread
From: Song Liu @ 2020-06-22 16:37 UTC (permalink / raw)
  To: Michal Soltys; +Cc: linux-raid

On Mon, Jun 22, 2020 at 4:12 AM Michal Soltys <msoltyspl@yandex.pl> wrote:
>
> On 6/20/20 2:14 AM, Song Liu wrote:
> > On Fri, Jun 19, 2020 at 4:35 AM Michal Soltys <msoltyspl@yandex.pl> wrote:
> >>
> >> On 6/17/20 7:11 PM, Song Liu wrote:
> >>>>
> >>>>> 1. There are two pr_debug() calls in handle_stripe():
> >>>>>           pr_debug("handling stripe %llu, state=%#lx cnt=%d, "
> >>>>>           pr_debug("locked=%d uptodate=%d to_read=%d"
> >>>>>
> >>>>>        Did you enable all of them? Or only the first one?
> >>>>
> >>>> I enabled all of them (I think), to be precise:
> >>>>
> >>>> echo -n 'func handle_stripe +p' >/sys/kernel/debug/dynamic_debug/control
> >>>>
> >>>> Haven't seen any `locked` lines though.
> >>>
> >>> That's a little weird, and probably explains why we stuck. Could you
> >>> please try the attached patch?
> >>>
> >>> Thanks,
> >>> Song
> >>>
> >>
> >> I've started assembly with the above patch, the output can be inspected
> >> from this file:
> >>
> >> https://ufile.io/yx4bbcb4
> >>
> >> This is ~5mb packed dmesg from start of the boot to the relevant parts,
> >> unpacks to ~150mb file.
> >
> > Thanks for the trace. Looks like we may have some issues with
> > MD_SB_CHANGE_PENDING.
> > Could you please try the attached patch?
>
> Should I run this along with pr_debugs from the previous patch enabled ?

We don't need those pr_debug() here.

Thanks,
Song

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Assemblin journaled array fails
  2020-06-22 16:37                                                           ` Song Liu
@ 2020-06-23 13:17                                                             ` Michal Soltys
  2020-06-23 23:13                                                               ` Song Liu
  0 siblings, 1 reply; 46+ messages in thread
From: Michal Soltys @ 2020-06-23 13:17 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-raid

On 6/22/20 6:37 PM, Song Liu wrote:
>>>
>>> Thanks for the trace. Looks like we may have some issues with
>>> MD_SB_CHANGE_PENDING.
>>> Could you please try the attached patch?
>>
>> Should I run this along with pr_debugs from the previous patch enabled ?
> 
> We don't need those pr_debug() here.
> 
> Thanks,
> Song
> 

So with this patch attached, there is no extra output whatsoever - once it finished getting past this point:

[  +0.371752] r5c_recovery_rewrite_data_only_stripes rewritten 20001 stripes to the journal, current ctx->pos 408461384 ctx->seq 866603361                           
[  +0.395000] r5c_recovery_rewrite_data_only_stripes rewritten 21001 stripes to the journal, current ctx->pos 408479568 ctx->seq 866604361                           
[  +0.371255] r5c_recovery_rewrite_data_only_stripes rewritten 22001 stripes to the journal, current ctx->pos 408496600 ctx->seq 866605361                           
[  +0.401013] r5c_recovery_rewrite_data_only_stripes rewritten 23001 stripes to the journal, current ctx->pos 408515472 ctx->seq 866606361                           
[  +0.370543] r5c_recovery_rewrite_data_only_stripes rewritten 24001 stripes to the journal, current ctx->pos 408532112 ctx->seq 866607361                           
[  +0.319253] r5c_recovery_rewrite_data_only_stripes done
[  +0.061560] r5c_recovery_flush_data_only_stripes enter
[  +0.075697] r5c_recovery_flush_data_only_stripes before wait_event

That is, besides 'task <....> blocked for' traces or unless pr_debug()s were enabled.

There were a few 'md_write_start set MD_SB_CHANGE_PENDING' *before* that (all of them likely related to another raid that is active at the moment, as these were happening during that lengthy r5c_recovery_flush_log() process).

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Assemblin journaled array fails
  2020-06-23 13:17                                                             ` Michal Soltys
@ 2020-06-23 23:13                                                               ` Song Liu
  2020-06-25 16:11                                                                 ` Michal Soltys
  0 siblings, 1 reply; 46+ messages in thread
From: Song Liu @ 2020-06-23 23:13 UTC (permalink / raw)
  To: Michal Soltys; +Cc: linux-raid

On Tue, Jun 23, 2020 at 6:17 AM Michal Soltys <msoltyspl@yandex.pl> wrote:
>
> On 6/22/20 6:37 PM, Song Liu wrote:
> >>>
> >>> Thanks for the trace. Looks like we may have some issues with
> >>> MD_SB_CHANGE_PENDING.
> >>> Could you please try the attached patch?
> >>
> >> Should I run this along with pr_debugs from the previous patch enabled ?
> >
> > We don't need those pr_debug() here.
> >
> > Thanks,
> > Song
> >
>
> So with this patch attached, there is no extra output whatsoever - once it finished getting past this point:
>
> [  +0.371752] r5c_recovery_rewrite_data_only_stripes rewritten 20001 stripes to the journal, current ctx->pos 408461384 ctx->seq 866603361
> [  +0.395000] r5c_recovery_rewrite_data_only_stripes rewritten 21001 stripes to the journal, current ctx->pos 408479568 ctx->seq 866604361
> [  +0.371255] r5c_recovery_rewrite_data_only_stripes rewritten 22001 stripes to the journal, current ctx->pos 408496600 ctx->seq 866605361
> [  +0.401013] r5c_recovery_rewrite_data_only_stripes rewritten 23001 stripes to the journal, current ctx->pos 408515472 ctx->seq 866606361
> [  +0.370543] r5c_recovery_rewrite_data_only_stripes rewritten 24001 stripes to the journal, current ctx->pos 408532112 ctx->seq 866607361
> [  +0.319253] r5c_recovery_rewrite_data_only_stripes done
> [  +0.061560] r5c_recovery_flush_data_only_stripes enter
> [  +0.075697] r5c_recovery_flush_data_only_stripes before wait_event
>
> That is, besides 'task <....> blocked for' traces or unless pr_debug()s were enabled.
>
> There were a few 'md_write_start set MD_SB_CHANGE_PENDING' *before* that (all of them likely related to another raid that is active at the moment, as these were happening during that lengthy r5c_recovery_flush_log() process).

Hmm.. this is weird, as I think I marked every instance of set_bit
MD_SB_CHANGE_PENDING.
Would you mind confirm those are to the other array with something like:

diff --git i/drivers/md/md.c w/drivers/md/md.c
index dbbc8a50e2ed2..e91acfdcec032 100644
--- i/drivers/md/md.c
+++ w/drivers/md/md.c
@@ -8480,7 +8480,7 @@ bool md_write_start(struct mddev *mddev, struct bio *bi)
                        mddev->in_sync = 0;
                        set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
                        set_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags);
-                       pr_info("%s set MD_SB_CHANGE_PENDING\n", __func__);
+                       pr_info("%s: md: %s set
MD_SB_CHANGE_PENDING\n", __func__, mdname(mddev));
                        md_wakeup_thread(mddev->thread);
                        did_change = 1;
                }

Thanks,
Song

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: Assemblin journaled array fails
  2020-06-23 23:13                                                               ` Song Liu
@ 2020-06-25 16:11                                                                 ` Michal Soltys
  2020-07-06  8:07                                                                   ` Michal Soltys
  0 siblings, 1 reply; 46+ messages in thread
From: Michal Soltys @ 2020-06-25 16:11 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1123 bytes --]

On 6/24/20 1:13 AM, Song Liu wrote:
> On Tue, Jun 23, 2020 at 6:17 AM Michal Soltys <msoltyspl@yandex.pl> wrote:
> 
> Hmm.. this is weird, as I think I marked every instance of set_bit
> MD_SB_CHANGE_PENDING.
> Would you mind confirm those are to the other array with something like:
> 
> diff --git i/drivers/md/md.c w/drivers/md/md.c
> index dbbc8a50e2ed2..e91acfdcec032 100644
> --- i/drivers/md/md.c
> +++ w/drivers/md/md.c
> @@ -8480,7 +8480,7 @@ bool md_write_start(struct mddev *mddev, struct bio *bi)
>                          mddev->in_sync = 0;
>                          set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
>                          set_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags);
> -                       pr_info("%s set MD_SB_CHANGE_PENDING\n", __func__);
> +                       pr_info("%s: md: %s set
> MD_SB_CHANGE_PENDING\n", __func__, mdname(mddev));
>                          md_wakeup_thread(mddev->thread);
>                          did_change = 1;
>                  }
> 
> Thanks,
> Song
> 

dmesg attached
- md127 - journal
- md126 - the other raid
- md125 - the problematic one

[-- Attachment #2: dmesg.txt.xz --]
[-- Type: application/x-xz, Size: 24356 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Assemblin journaled array fails
  2020-05-14 11:07           ` Michal Soltys
  2020-05-16  0:15             ` Song Liu
@ 2020-06-26  8:28             ` Guoqing Jiang
  1 sibling, 0 replies; 46+ messages in thread
From: Guoqing Jiang @ 2020-06-26  8:28 UTC (permalink / raw)
  To: Michal Soltys, Song Liu; +Cc: linux-raid



On 5/14/20 1:07 PM, Michal Soltys wrote:
> On 5/13/20 6:17 PM, Song Liu wrote:
>>>>
>>>> Are these captured back to back? I am asking because they showed 
>>>> different
>>>> "Events" number.
>>>
>>> Nah, they were captured between reboots. Back to back all event 
>>> fields show identical value (at 56291 now).
>>>
>>>>
>>>> Also, when mdadm -A hangs, could you please capture /proc/$(pidof 
>>>> mdadm)/stack ?
>>>>
>>>
>>> The output is empty:
>>>
>>> xs22:/☠ ps -eF fww | grep mdadm
>>> root     10332  9362 97   740  1884  25 12:47 pts/1    R+ 6:59  |   
>>> \_ mdadm -A /dev/md/r5_big /dev/md/r1_journal_big /dev/sdi1 
>>> /dev/sdg1 /dev/sdj1 /dev/sdh1
>>> xs22:/☠ cd /proc/10332
>>> xs22:/proc/10332☠ cat stack
>>> xs22:/proc/10332☠
>>
>> Hmm... Could you please share the strace output of "mdadm -A" 
>> command? Like
>>
>> strace mdadm -A /dev/md/r5_big /dev/md/r1_journal_big /dev/xxx ...
>>
>> Thanks,
>> Song
>
> I did one more thing - ran cat on that stack in another terminal every 
> 0.1 seconds, this is what was found:
>
> cat: /proc//stack: No such file or directory
> cat: /proc//stack: No such file or directory
> cat: /proc//stack: No such file or directory
> [<0>] submit_bio_wait+0x5b/0x80
> [<0>] r5l_recovery_read_page+0x1b6/0x200 [raid456]
> [<0>] r5l_recovery_verify_data_checksum+0x19/0x70 [raid456]
> [<0>] r5l_start+0x99e/0xe70 [raid456]
> [<0>] md_start.part.48+0x2e/0x50 [md_mod]
> [<0>] do_md_run+0x64/0x100 [md_mod]
> [<0>] md_ioctl+0xe7d/0x17d0 [md_mod]
> [<0>] blkdev_ioctl+0x4d0/0xa00
> [<0>] block_ioctl+0x39/0x40
> [<0>] do_vfs_ioctl+0xa4/0x630
> [<0>] ksys_ioctl+0x60/0x90
> [<0>] __x64_sys_ioctl+0x16/0x20
> [<0>] do_syscall_64+0x52/0x160
> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> cat: /proc//stack: No such file or directory
> [<0>] submit_bio_wait+0x5b/0x80

Just FYI. I recalled that long sync IO could fire the warning in 
submit_bio_wait.
Have you applied the commit de6a78b601c5 ("block: Prevent hung_check
firing during long sync IO")?

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Assemblin journaled array fails
  2020-06-25 16:11                                                                 ` Michal Soltys
@ 2020-07-06  8:07                                                                   ` Michal Soltys
  2020-07-06 22:08                                                                     ` Song Liu
  0 siblings, 1 reply; 46+ messages in thread
From: Michal Soltys @ 2020-07-06  8:07 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-raid

On 20/06/25 18:11, Michal Soltys wrote:
> On 6/24/20 1:13 AM, Song Liu wrote:
>> On Tue, Jun 23, 2020 at 6:17 AM Michal Soltys <msoltyspl@yandex.pl> 
>> wrote:
>>
>> Hmm.. this is weird, as I think I marked every instance of set_bit
>> MD_SB_CHANGE_PENDING.
>> Would you mind confirm those are to the other array with something like:
>>
>> diff --git i/drivers/md/md.c w/drivers/md/md.c
>> index dbbc8a50e2ed2..e91acfdcec032 100644
>> --- i/drivers/md/md.c
>> +++ w/drivers/md/md.c
>> @@ -8480,7 +8480,7 @@ bool md_write_start(struct mddev *mddev, struct 
>> bio *bi)
>>                          mddev->in_sync = 0;
>>                          set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
>>                          set_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags);
>> -                       pr_info("%s set MD_SB_CHANGE_PENDING\n", 
>> __func__);
>> +                       pr_info("%s: md: %s set
>> MD_SB_CHANGE_PENDING\n", __func__, mdname(mddev));
>>                          md_wakeup_thread(mddev->thread);
>>                          did_change = 1;
>>                  }
>>
>> Thanks,
>> Song
>>
> 
> dmesg attached
> - md127 - journal
> - md126 - the other raid
> - md125 - the problematic one

So, what kind of next step after this ?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Assemblin journaled array fails
  2020-07-06  8:07                                                                   ` Michal Soltys
@ 2020-07-06 22:08                                                                     ` Song Liu
  2020-07-08 11:29                                                                       ` Michal Soltys
  0 siblings, 1 reply; 46+ messages in thread
From: Song Liu @ 2020-07-06 22:08 UTC (permalink / raw)
  To: Michal Soltys; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1653 bytes --]

On Mon, Jul 6, 2020 at 1:07 AM Michal Soltys <msoltyspl@yandex.pl> wrote:
>
> On 20/06/25 18:11, Michal Soltys wrote:
> > On 6/24/20 1:13 AM, Song Liu wrote:
> >> On Tue, Jun 23, 2020 at 6:17 AM Michal Soltys <msoltyspl@yandex.pl>
> >> wrote:
> >>
> >> Hmm.. this is weird, as I think I marked every instance of set_bit
> >> MD_SB_CHANGE_PENDING.
> >> Would you mind confirm those are to the other array with something like:
> >>
> >> diff --git i/drivers/md/md.c w/drivers/md/md.c
> >> index dbbc8a50e2ed2..e91acfdcec032 100644
> >> --- i/drivers/md/md.c
> >> +++ w/drivers/md/md.c
> >> @@ -8480,7 +8480,7 @@ bool md_write_start(struct mddev *mddev, struct
> >> bio *bi)
> >>                          mddev->in_sync = 0;
> >>                          set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
> >>                          set_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags);
> >> -                       pr_info("%s set MD_SB_CHANGE_PENDING\n",
> >> __func__);
> >> +                       pr_info("%s: md: %s set
> >> MD_SB_CHANGE_PENDING\n", __func__, mdname(mddev));
> >>                          md_wakeup_thread(mddev->thread);
> >>                          did_change = 1;
> >>                  }
> >>
> >> Thanks,
> >> Song
> >>
> >
> > dmesg attached
> > - md127 - journal
> > - md126 - the other raid
> > - md125 - the problematic one
>
> So, what kind of next step after this ?

Sorry for the delay. I read the log again, and found the following
line caused this issue:

[ +16.088243] r5l_write_super_and_discard_space set MD_SB_CHANGE_PENDING

The attached patch should workaround this issue. Could you please give it a try?

Thanks,
song

[-- Attachment #2: 0001-md-raid5-cache-clear-MD_SB_CHANGE_PENDING-before-flu.patch --]
[-- Type: application/octet-stream, Size: 1432 bytes --]

From ce310728ea5e8420e9f62f11e3969c2100af7c81 Mon Sep 17 00:00:00 2001
From: Song Liu <songliubraving@fb.com>
Date: Mon, 6 Jul 2020 14:57:32 -0700
Subject: [PATCH] md/raid5-cache: clear MD_SB_CHANGE_PENDING before flushing
 data only stripes

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 9b6da759dca25..0bea21d81697e 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -2430,10 +2430,15 @@ static void r5c_recovery_flush_data_only_stripes(struct r5l_log *log,
 	struct mddev *mddev = log->rdev->mddev;
 	struct r5conf *conf = mddev->private;
 	struct stripe_head *sh, *next;
+	bool cleared_pending = false;
 
 	if (ctx->data_only_stripes == 0)
 		return;
 
+	if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) {
+		cleared_pending = true;
+		clear_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags);
+	}
 	log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_BACK;
 
 	list_for_each_entry_safe(sh, next, &ctx->cached_list, lru) {
@@ -2448,6 +2453,8 @@ static void r5c_recovery_flush_data_only_stripes(struct r5l_log *log,
 		   atomic_read(&conf->active_stripes) == 0);
 
 	log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
+	if (cleared_pending)
+		set_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags);
 }
 
 static int r5l_recovery_log(struct r5l_log *log)
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: Assemblin journaled array fails
  2020-07-06 22:08                                                                     ` Song Liu
@ 2020-07-08 11:29                                                                       ` Michal Soltys
  2020-07-08 23:51                                                                         ` Song Liu
  0 siblings, 1 reply; 46+ messages in thread
From: Michal Soltys @ 2020-07-08 11:29 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-raid

On 7/7/20 12:08 AM, Song Liu wrote:
>>
>> So, what kind of next step after this ?
> 
> Sorry for the delay. I read the log again, and found the following
> line caused this issue:
> 
> [ +16.088243] r5l_write_super_and_discard_space set MD_SB_CHANGE_PENDING
> 
> The attached patch should workaround this issue. Could you please give it a try?

Yea, this solved the issue - the raid assembled correctly (so the patch 
is probably a good candidate for lts kernels).

Thanks for helping with this bug.

Underlying filesystems are mountable/usable as well - albeit read-only 
fsck (ext4) or btrfs check do find some minor issues; tough to say at 
this point what was the exact culprit.

In this particular case - imho - one issue remains: the assembly is 
slower than full resync (without bitmap), which outside of some 
performance gains (writeback journal) and write-hole fixing - kind of 
completely defeats the point of having such resync policy in the first 
place.

dmesg -H | grep r5c_recovery_flush_log

[ +13.550877] r5c_recovery_flush_log processing ctx->seq 860700000
[Jul 7 15:16] r5c_recovery_flush_log processing ctx->seq 860800000
[Jul 7 15:40] r5c_recovery_flush_log processing ctx->seq 860900000
...
[Jul 8 06:40] r5c_recovery_flush_log processing ctx->seq 866300000
[Jul 8 06:58] r5c_recovery_flush_log processing ctx->seq 866400000
[Jul 8 07:20] r5c_recovery_flush_log processing ctx->seq 866500000

During those periods when I was testing your patches, the machine has 
always been basically idle - no cpu/io/waits, or anything that could 
hamper it. The read process going from the journal device (ssds) was 
averaging 1-4 mb/s.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Assemblin journaled array fails
  2020-07-08 11:29                                                                       ` Michal Soltys
@ 2020-07-08 23:51                                                                         ` Song Liu
  0 siblings, 0 replies; 46+ messages in thread
From: Song Liu @ 2020-07-08 23:51 UTC (permalink / raw)
  To: Michal Soltys; +Cc: linux-raid

On Wed, Jul 8, 2020 at 4:29 AM Michal Soltys <msoltyspl@yandex.pl> wrote:
>
> On 7/7/20 12:08 AM, Song Liu wrote:
> >>
> >> So, what kind of next step after this ?
> >
> > Sorry for the delay. I read the log again, and found the following
> > line caused this issue:
> >
> > [ +16.088243] r5l_write_super_and_discard_space set MD_SB_CHANGE_PENDING
> >
> > The attached patch should workaround this issue. Could you please give it a try?
>
> Yea, this solved the issue - the raid assembled correctly (so the patch
> is probably a good candidate for lts kernels).

Thanks for testing the patch. I will work on applying it to upstream.

>
> Thanks for helping with this bug.
>
> Underlying filesystems are mountable/usable as well - albeit read-only
> fsck (ext4) or btrfs check do find some minor issues; tough to say at
> this point what was the exact culprit.
>
> In this particular case - imho - one issue remains: the assembly is
> slower than full resync (without bitmap), which outside of some
> performance gains (writeback journal) and write-hole fixing - kind of
> completely defeats the point of having such resync policy in the first
> place.

Totally agreed that we need to improve the recovery speed. I will also look
into that.

Thanks again!
Song

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2020-07-08 23:51 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-09 10:54 Assemblin journaled array fails Michal Soltys
2020-05-09 11:32 ` Michal Soltys
2020-05-09 13:15 ` antlists
2020-05-09 13:20   ` Michal Soltys
2020-05-09 13:32     ` Michal Soltys
2020-05-09 23:57 ` Michal Soltys
2020-05-11 11:11   ` Michal Soltys
2020-05-12  1:27     ` Song Liu
2020-05-13 10:58       ` Michal Soltys
2020-05-13 16:17         ` Song Liu
2020-05-14 11:07           ` Michal Soltys
2020-05-16  0:15             ` Song Liu
     [not found]               ` <d0340d7b-6b3a-4fd3-e446-5f0967132ef6@yandex.pl>
2020-05-18 23:55                 ` Song Liu
2020-05-20 12:08                   ` Michal Soltys
2020-05-25 16:23                   ` Michal Soltys
2020-05-26 23:37                     ` Song Liu
2020-05-27 13:36                       ` Michal Soltys
2020-05-28 10:18                         ` Michal Soltys
2020-05-28 15:06                           ` Song Liu
2020-05-28 15:36                             ` Michal Soltys
2020-05-29 12:09                             ` Michal Soltys
2020-05-29 21:03                               ` Song Liu
2020-05-29 21:57                                 ` Song Liu
2020-06-03 10:12                                   ` Michal Soltys
2020-06-03 22:07                                     ` Song Liu
2020-06-04 13:58                                       ` Michal Soltys
2020-06-05 12:26                                       ` Michal Soltys
2020-06-09  9:36                                         ` Michal Soltys
2020-06-09 18:36                                           ` Song Liu
2020-06-09 21:51                                             ` Michal Soltys
2020-06-16 11:11                                               ` Michal Soltys
2020-06-17  0:48                                               ` Song Liu
2020-06-17 12:53                                                 ` Michal Soltys
2020-06-17 17:11                                                   ` Song Liu
2020-06-19 11:35                                                     ` Michal Soltys
2020-06-20  0:14                                                       ` Song Liu
2020-06-22 11:12                                                         ` Michal Soltys
2020-06-22 16:37                                                           ` Song Liu
2020-06-23 13:17                                                             ` Michal Soltys
2020-06-23 23:13                                                               ` Song Liu
2020-06-25 16:11                                                                 ` Michal Soltys
2020-07-06  8:07                                                                   ` Michal Soltys
2020-07-06 22:08                                                                     ` Song Liu
2020-07-08 11:29                                                                       ` Michal Soltys
2020-07-08 23:51                                                                         ` Song Liu
2020-06-26  8:28             ` Guoqing Jiang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.