linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* 💥 PANICKED: Test report for kernel 5.9.0-rc3-020ad03.cki (block)
@ 2020-09-03 17:07 CKI Project
  2020-09-03 17:10 ` Rachel Sibley
  2020-09-04  1:02 ` Ming Lei
  0 siblings, 2 replies; 14+ messages in thread
From: CKI Project @ 2020-09-03 17:07 UTC (permalink / raw)
  To: linux-block, axboe; +Cc: Changhui Zhong


Hello,

We ran automated tests on a recent commit from this kernel tree:

       Kernel repo: https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git
            Commit: 020ad0333b03 - Merge branch 'for-5.10/block' into for-next

The results of these automated tests are provided below.

    Overall result: FAILED (see details below)
             Merge: OK
           Compile: OK
             Tests: PANICKED

All kernel binaries, config files, and logs are available for download here:

  https://cki-artifacts.s3.us-east-2.amazonaws.com/index.html?prefix=datawarehouse/2020/09/02/613166

One or more kernel tests failed:

    ppc64le:
     💥 storage: software RAID testing

    aarch64:
     💥 storage: software RAID testing

    x86_64:
     💥 storage: software RAID testing

We hope that these logs can help you find the problem quickly. For the full
detail on our testing procedures, please scroll to the bottom of this message.

Please reply to this email if you have any questions about the tests that we
ran or if you have any suggestions on how to make future tests more effective.

        ,-.   ,-.
       ( C ) ( K )  Continuous
        `-',-.`-'   Kernel
          ( I )     Integration
           `-'
______________________________________________________________________________

Compile testing
---------------

We compiled the kernel for 4 architectures:

    aarch64:
      make options: make -j30 INSTALL_MOD_STRIP=1 targz-pkg

    ppc64le:
      make options: make -j30 INSTALL_MOD_STRIP=1 targz-pkg

    s390x:
      make options: make -j30 INSTALL_MOD_STRIP=1 targz-pkg

    x86_64:
      make options: make -j30 INSTALL_MOD_STRIP=1 targz-pkg



Hardware testing
----------------
We booted each kernel and ran the following tests:

  aarch64:
    Host 1:
       ✅ Boot test
       ✅ ACPI table test
       ✅ LTP
       ✅ Loopdev Sanity
       ✅ Memory function: memfd_create
       ✅ AMTU (Abstract Machine Test Utility)
       ✅ Ethernet drivers sanity
       ✅ storage: SCSI VPD
       🚧 ✅ CIFS Connectathon
       🚧 ✅ POSIX pjd-fstest suites

    Host 2:

       ⚡ Internal infrastructure issues prevented one or more tests (marked
       with ⚡⚡⚡) from running on this architecture.
       This is not the fault of the kernel that was tested.

       ⚡⚡⚡ Boot test
       ⚡⚡⚡ xfstests - ext4
       ⚡⚡⚡ xfstests - xfs
       ⚡⚡⚡ storage: software RAID testing
       ⚡⚡⚡ stress: stress-ng
       🚧 ⚡⚡⚡ xfstests - btrfs
       🚧 ⚡⚡⚡ Storage blktests

    Host 3:
       ✅ Boot test
       ✅ xfstests - ext4
       ✅ xfstests - xfs
       💥 storage: software RAID testing
       ⚡⚡⚡ stress: stress-ng
       🚧 ⚡⚡⚡ xfstests - btrfs
       🚧 ⚡⚡⚡ Storage blktests

  ppc64le:
    Host 1:
       ✅ Boot test
       🚧 ✅ kdump - sysrq-c

    Host 2:
       ✅ Boot test
       ✅ xfstests - ext4
       ✅ xfstests - xfs
       💥 storage: software RAID testing
       🚧 ⚡⚡⚡ xfstests - btrfs
       🚧 ⚡⚡⚡ Storage blktests

    Host 3:

       ⚡ Internal infrastructure issues prevented one or more tests (marked
       with ⚡⚡⚡) from running on this architecture.
       This is not the fault of the kernel that was tested.

       ✅ Boot test
       ⚡⚡⚡ LTP
       ⚡⚡⚡ Loopdev Sanity
       ⚡⚡⚡ Memory function: memfd_create
       ⚡⚡⚡ AMTU (Abstract Machine Test Utility)
       ⚡⚡⚡ Ethernet drivers sanity
       🚧 ⚡⚡⚡ CIFS Connectathon
       🚧 ⚡⚡⚡ POSIX pjd-fstest suites

  s390x:
    Host 1:
       ✅ Boot test
       ✅ stress: stress-ng
       🚧 ✅ Storage blktests

    Host 2:
       ✅ Boot test
       ✅ LTP
       ✅ Loopdev Sanity
       ✅ Memory function: memfd_create
       ✅ AMTU (Abstract Machine Test Utility)
       ✅ Ethernet drivers sanity
       🚧 ✅ CIFS Connectathon
       🚧 ✅ POSIX pjd-fstest suites

  x86_64:
    Host 1:
       ✅ Boot test
       ✅ Storage SAN device stress - qedf driver

    Host 2:
       ⏱  Boot test
       ⏱  Storage SAN device stress - mpt3sas_gen1

    Host 3:
       ✅ Boot test
       ✅ xfstests - ext4
       ✅ xfstests - xfs
       💥 storage: software RAID testing
       ⚡⚡⚡ stress: stress-ng
       🚧 ⚡⚡⚡ xfstests - btrfs
       🚧 ⚡⚡⚡ Storage blktests

    Host 4:
       ✅ Boot test
       ✅ Storage SAN device stress - lpfc driver

    Host 5:
       ✅ Boot test
       🚧 ✅ kdump - sysrq-c

    Host 6:
       ✅ Boot test
       ✅ ACPI table test
       ✅ LTP
       ✅ Loopdev Sanity
       ✅ Memory function: memfd_create
       ✅ AMTU (Abstract Machine Test Utility)
       ✅ Ethernet drivers sanity
       ✅ kernel-rt: rt_migrate_test
       ✅ kernel-rt: rteval
       ✅ kernel-rt: sched_deadline
       ✅ kernel-rt: smidetect
       ✅ storage: SCSI VPD
       🚧 ✅ CIFS Connectathon
       🚧 ✅ POSIX pjd-fstest suites

    Host 7:
       ✅ Boot test
       ✅ kdump - sysrq-c - megaraid_sas

    Host 8:
       ✅ Boot test
       ✅ Storage SAN device stress - qla2xxx driver

    Host 9:
       ⏱  Boot test
       ⏱  kdump - sysrq-c - mpt3sas_gen1

  Test sources: https://gitlab.com/cki-project/kernel-tests
    💚 Pull requests are welcome for new tests or improvements to existing tests!

Aborted tests
-------------
Tests that didn't complete running successfully are marked with ⚡⚡⚡.
If this was caused by an infrastructure issue, we try to mark that
explicitly in the report.

Waived tests
------------
If the test run included waived tests, they are marked with 🚧. Such tests are
executed but their results are not taken into account. Tests are waived when
their results are not reliable enough, e.g. when they're just introduced or are
being fixed.

Testing timeout
---------------
We aim to provide a report within reasonable timeframe. Tests that haven't
finished running yet are marked with ⏱.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 💥 PANICKED: Test report for kernel 5.9.0-rc3-020ad03.cki (block)
  2020-09-03 17:07 💥 PANICKED: Test report for kernel 5.9.0-rc3-020ad03.cki (block) CKI Project
@ 2020-09-03 17:10 ` Rachel Sibley
  2020-09-03 17:46   ` Jens Axboe
  2020-09-04  1:02 ` Ming Lei
  1 sibling, 1 reply; 14+ messages in thread
From: Rachel Sibley @ 2020-09-03 17:10 UTC (permalink / raw)
  To: CKI Project, linux-block, axboe; +Cc: Changhui Zhong


On 9/3/20 1:07 PM, CKI Project wrote:
> 
> Hello,
> 
> We ran automated tests on a recent commit from this kernel tree:
> 
>         Kernel repo: https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git
>              Commit: 020ad0333b03 - Merge branch 'for-5.10/block' into for-next
> 
> The results of these automated tests are provided below.
> 
>      Overall result: FAILED (see details below)
>               Merge: OK
>             Compile: OK
>               Tests: PANICKED
> 
> All kernel binaries, config files, and logs are available for download here:
> 
>    https://cki-artifacts.s3.us-east-2.amazonaws.com/index.html?prefix=datawarehouse/2020/09/02/613166
> 
> One or more kernel tests failed:
> 
>      ppc64le:
>       💥 storage: software RAID testing
> 
>      aarch64:
>       💥 storage: software RAID testing
> 
>      x86_64:
>       💥 storage: software RAID testing

Hello,

We're seeing a panic for all non s390x arches triggered by swraid test. Seems to be reproducible
for all succeeding pipelines after this one, and we haven't yet seen it in mainline or yesterday's
block tree results.

Thank you,
Rachel

https://cki-artifacts.s3.us-east-2.amazonaws.com/datawarehouse/2020/09/02/613166/build_aarch64_redhat%3A968098/tests/8757835_aarch64_3_console.log

[ 8394.609219] Internal error: Oops: 96000004 [#1] SMP
[ 8394.614070] Modules linked in: raid0 loop raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx dm_log_writes dm_flakey 
rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rfkill sunrpc vfat fat xgene_hwmon xgene_enet at803x mdio_xgene xgene_rng 
xgene_edac mailbox_xgene_slimpro drm ip_tables xfs sdhci_of_arasan sdhci_pltfm i2c_xgene_slimpro crct10dif_ce sdhci gpio_dwapb cqhci xhci_plat_hcd 
gpio_xgene_sb gpio_keys aes_neon_bs
[ 8394.654298] CPU: 3 PID: 471427 Comm: kworker/3:2 Kdump: loaded Not tainted 5.9.0-rc3-020ad03.cki #1
[ 8394.663299] Hardware name: AppliedMicro X-Gene Mustang Board/X-Gene Mustang Board, BIOS 3.06.25 Oct 17 2016
[ 8394.672999] Workqueue: md_misc mddev_delayed_delete
[ 8394.677853] pstate: 40400085 (nZcv daIf +PAN -UAO BTYPE=--)
[ 8394.683399] pc : percpu_ref_exit+0x5c/0xc8
[ 8394.687473] lr : percpu_ref_exit+0x20/0xc8
[ 8394.691547] sp : ffff800019f33d00
[ 8394.694843] x29: ffff800019f33d00 x28: 0000000000000000
[ 8394.700129] x27: ffff0003c63ae000 x26: ffff8000120b6228
[ 8394.705414] x25: 0000000000000001 x24: ffff0003d8322a80
[ 8394.710698] x23: 0000000000000000 x22: 0000000000000000
[ 8394.715983] x21: 0000000000000000 x20: ffff8000121d2000
[ 8394.721266] x19: ffff0003d8322af0 x18: 0000000000000000
[ 8394.726550] x17: 0000000000000000 x16: 0000000000000000
[ 8394.731834] x15: 0000000000000007 x14: 0000000000000003
[ 8394.737119] x13: 0000000000000000 x12: ffff0003888a1978
[ 8394.742403] x11: ffff0003888a1918 x10: 0000000000000001
[ 8394.747688] x9 : 0000000000000000 x8 : 0000000000000000
[ 8394.752972] x7 : 0000000000000400 x6 : 0000000000000001
[ 8394.758257] x5 : ffff800010423030 x4 : ffff8000121d2e40
[ 8394.763540] x3 : 0000000000000000 x2 : 0000000000000000
[ 8394.768825] x1 : 0000000000000000 x0 : 0000000000000000
[ 8394.774110] Call trace:
[ 8394.776544]  percpu_ref_exit+0x5c/0xc8
[ 8394.780273]  md_free+0x64/0xa0
[ 8394.783311]  kobject_put+0x7c/0x218
[ 8394.786781]  mddev_delayed_delete+0x3c/0x50
[ 8394.790944]  process_one_work+0x1c4/0x450
[ 8394.794932]  worker_thread+0x164/0x4a8
[ 8394.798662]  kthread+0xf4/0x120
[ 8394.801787]  ret_from_fork+0x10/0x18
[ 8394.805344] Code: 2a0403e0 350002c0 a9400262 52800001 (f9400000)
[ 8394.811407] ---[ end trace 481cab6e1ad73da1 ]---


> 
> We hope that these logs can help you find the problem quickly. For the full
> detail on our testing procedures, please scroll to the bottom of this message.
> 
> Please reply to this email if you have any questions about the tests that we
> ran or if you have any suggestions on how to make future tests more effective.
> 
>          ,-.   ,-.
>         ( C ) ( K )  Continuous
>          `-',-.`-'   Kernel
>            ( I )     Integration
>             `-'
> ______________________________________________________________________________
> 
> Compile testing
> ---------------
> 
> We compiled the kernel for 4 architectures:
> 
>      aarch64:
>        make options: make -j30 INSTALL_MOD_STRIP=1 targz-pkg
> 
>      ppc64le:
>        make options: make -j30 INSTALL_MOD_STRIP=1 targz-pkg
> 
>      s390x:
>        make options: make -j30 INSTALL_MOD_STRIP=1 targz-pkg
> 
>      x86_64:
>        make options: make -j30 INSTALL_MOD_STRIP=1 targz-pkg
> 
> 
> 
> Hardware testing
> ----------------
> We booted each kernel and ran the following tests:
> 
>    aarch64:
>      Host 1:
>         ✅ Boot test
>         ✅ ACPI table test
>         ✅ LTP
>         ✅ Loopdev Sanity
>         ✅ Memory function: memfd_create
>         ✅ AMTU (Abstract Machine Test Utility)
>         ✅ Ethernet drivers sanity
>         ✅ storage: SCSI VPD
>         🚧 ✅ CIFS Connectathon
>         🚧 ✅ POSIX pjd-fstest suites
> 
>      Host 2:
> 
>         ⚡ Internal infrastructure issues prevented one or more tests (marked
>         with ⚡⚡⚡) from running on this architecture.
>         This is not the fault of the kernel that was tested.
> 
>         ⚡⚡⚡ Boot test
>         ⚡⚡⚡ xfstests - ext4
>         ⚡⚡⚡ xfstests - xfs
>         ⚡⚡⚡ storage: software RAID testing
>         ⚡⚡⚡ stress: stress-ng
>         🚧 ⚡⚡⚡ xfstests - btrfs
>         🚧 ⚡⚡⚡ Storage blktests
> 
>      Host 3:
>         ✅ Boot test
>         ✅ xfstests - ext4
>         ✅ xfstests - xfs
>         💥 storage: software RAID testing
>         ⚡⚡⚡ stress: stress-ng
>         🚧 ⚡⚡⚡ xfstests - btrfs
>         🚧 ⚡⚡⚡ Storage blktests
> 
>    ppc64le:
>      Host 1:
>         ✅ Boot test
>         🚧 ✅ kdump - sysrq-c
> 
>      Host 2:
>         ✅ Boot test
>         ✅ xfstests - ext4
>         ✅ xfstests - xfs
>         💥 storage: software RAID testing
>         🚧 ⚡⚡⚡ xfstests - btrfs
>         🚧 ⚡⚡⚡ Storage blktests
> 
>      Host 3:
> 
>         ⚡ Internal infrastructure issues prevented one or more tests (marked
>         with ⚡⚡⚡) from running on this architecture.
>         This is not the fault of the kernel that was tested.
> 
>         ✅ Boot test
>         ⚡⚡⚡ LTP
>         ⚡⚡⚡ Loopdev Sanity
>         ⚡⚡⚡ Memory function: memfd_create
>         ⚡⚡⚡ AMTU (Abstract Machine Test Utility)
>         ⚡⚡⚡ Ethernet drivers sanity
>         🚧 ⚡⚡⚡ CIFS Connectathon
>         🚧 ⚡⚡⚡ POSIX pjd-fstest suites
> 
>    s390x:
>      Host 1:
>         ✅ Boot test
>         ✅ stress: stress-ng
>         🚧 ✅ Storage blktests
> 
>      Host 2:
>         ✅ Boot test
>         ✅ LTP
>         ✅ Loopdev Sanity
>         ✅ Memory function: memfd_create
>         ✅ AMTU (Abstract Machine Test Utility)
>         ✅ Ethernet drivers sanity
>         🚧 ✅ CIFS Connectathon
>         🚧 ✅ POSIX pjd-fstest suites
> 
>    x86_64:
>      Host 1:
>         ✅ Boot test
>         ✅ Storage SAN device stress - qedf driver
> 
>      Host 2:
>         ⏱  Boot test
>         ⏱  Storage SAN device stress - mpt3sas_gen1
> 
>      Host 3:
>         ✅ Boot test
>         ✅ xfstests - ext4
>         ✅ xfstests - xfs
>         💥 storage: software RAID testing
>         ⚡⚡⚡ stress: stress-ng
>         🚧 ⚡⚡⚡ xfstests - btrfs
>         🚧 ⚡⚡⚡ Storage blktests
> 
>      Host 4:
>         ✅ Boot test
>         ✅ Storage SAN device stress - lpfc driver
> 
>      Host 5:
>         ✅ Boot test
>         🚧 ✅ kdump - sysrq-c
> 
>      Host 6:
>         ✅ Boot test
>         ✅ ACPI table test
>         ✅ LTP
>         ✅ Loopdev Sanity
>         ✅ Memory function: memfd_create
>         ✅ AMTU (Abstract Machine Test Utility)
>         ✅ Ethernet drivers sanity
>         ✅ kernel-rt: rt_migrate_test
>         ✅ kernel-rt: rteval
>         ✅ kernel-rt: sched_deadline
>         ✅ kernel-rt: smidetect
>         ✅ storage: SCSI VPD
>         🚧 ✅ CIFS Connectathon
>         🚧 ✅ POSIX pjd-fstest suites
> 
>      Host 7:
>         ✅ Boot test
>         ✅ kdump - sysrq-c - megaraid_sas
> 
>      Host 8:
>         ✅ Boot test
>         ✅ Storage SAN device stress - qla2xxx driver
> 
>      Host 9:
>         ⏱  Boot test
>         ⏱  kdump - sysrq-c - mpt3sas_gen1
> 
>    Test sources: https://gitlab.com/cki-project/kernel-tests
>      💚 Pull requests are welcome for new tests or improvements to existing tests!
> 
> Aborted tests
> -------------
> Tests that didn't complete running successfully are marked with ⚡⚡⚡.
> If this was caused by an infrastructure issue, we try to mark that
> explicitly in the report.
> 
> Waived tests
> ------------
> If the test run included waived tests, they are marked with 🚧. Such tests are
> executed but their results are not taken into account. Tests are waived when
> their results are not reliable enough, e.g. when they're just introduced or are
> being fixed.
> 
> Testing timeout
> ---------------
> We aim to provide a report within reasonable timeframe. Tests that haven't
> finished running yet are marked with ⏱.
> 
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 💥 PANICKED: Test report for kernel 5.9.0-rc3-020ad03.cki (block)
  2020-09-03 17:10 ` Rachel Sibley
@ 2020-09-03 17:46   ` Jens Axboe
  2020-09-03 18:59     ` Rachel Sibley
  0 siblings, 1 reply; 14+ messages in thread
From: Jens Axboe @ 2020-09-03 17:46 UTC (permalink / raw)
  To: Rachel Sibley, CKI Project, linux-block; +Cc: Changhui Zhong

On 9/3/20 11:10 AM, Rachel Sibley wrote:
> 
> On 9/3/20 1:07 PM, CKI Project wrote:
>>
>> Hello,
>>
>> We ran automated tests on a recent commit from this kernel tree:
>>
>>         Kernel repo: https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git
>>              Commit: 020ad0333b03 - Merge branch 'for-5.10/block' into for-next
>>
>> The results of these automated tests are provided below.
>>
>>      Overall result: FAILED (see details below)
>>               Merge: OK
>>             Compile: OK
>>               Tests: PANICKED
>>
>> All kernel binaries, config files, and logs are available for download here:
>>
>>    https://cki-artifacts.s3.us-east-2.amazonaws.com/index.html?prefix=datawarehouse/2020/09/02/613166
>>
>> One or more kernel tests failed:
>>
>>      ppc64le:
>>       💥 storage: software RAID testing
>>
>>      aarch64:
>>       💥 storage: software RAID testing
>>
>>      x86_64:
>>       💥 storage: software RAID testing
> 
> Hello,
> 
> We're seeing a panic for all non s390x arches triggered by swraid test. Seems to be reproducible
> for all succeeding pipelines after this one, and we haven't yet seen it in mainline or yesterday's
> block tree results.
> 
> Thank you,
> Rachel
> 
> https://cki-artifacts.s3.us-east-2.amazonaws.com/datawarehouse/2020/09/02/613166/build_aarch64_redhat%3A968098/tests/8757835_aarch64_3_console.log
> 
> [ 8394.609219] Internal error: Oops: 96000004 [#1] SMP
> [ 8394.614070] Modules linked in: raid0 loop raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx dm_log_writes dm_flakey 
> rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rfkill sunrpc vfat fat xgene_hwmon xgene_enet at803x mdio_xgene xgene_rng 
> xgene_edac mailbox_xgene_slimpro drm ip_tables xfs sdhci_of_arasan sdhci_pltfm i2c_xgene_slimpro crct10dif_ce sdhci gpio_dwapb cqhci xhci_plat_hcd 
> gpio_xgene_sb gpio_keys aes_neon_bs
> [ 8394.654298] CPU: 3 PID: 471427 Comm: kworker/3:2 Kdump: loaded Not tainted 5.9.0-rc3-020ad03.cki #1
> [ 8394.663299] Hardware name: AppliedMicro X-Gene Mustang Board/X-Gene Mustang Board, BIOS 3.06.25 Oct 17 2016
> [ 8394.672999] Workqueue: md_misc mddev_delayed_delete
> [ 8394.677853] pstate: 40400085 (nZcv daIf +PAN -UAO BTYPE=--)
> [ 8394.683399] pc : percpu_ref_exit+0x5c/0xc8
> [ 8394.687473] lr : percpu_ref_exit+0x20/0xc8
> [ 8394.691547] sp : ffff800019f33d00
> [ 8394.694843] x29: ffff800019f33d00 x28: 0000000000000000
> [ 8394.700129] x27: ffff0003c63ae000 x26: ffff8000120b6228
> [ 8394.705414] x25: 0000000000000001 x24: ffff0003d8322a80
> [ 8394.710698] x23: 0000000000000000 x22: 0000000000000000
> [ 8394.715983] x21: 0000000000000000 x20: ffff8000121d2000
> [ 8394.721266] x19: ffff0003d8322af0 x18: 0000000000000000
> [ 8394.726550] x17: 0000000000000000 x16: 0000000000000000
> [ 8394.731834] x15: 0000000000000007 x14: 0000000000000003
> [ 8394.737119] x13: 0000000000000000 x12: ffff0003888a1978
> [ 8394.742403] x11: ffff0003888a1918 x10: 0000000000000001
> [ 8394.747688] x9 : 0000000000000000 x8 : 0000000000000000
> [ 8394.752972] x7 : 0000000000000400 x6 : 0000000000000001
> [ 8394.758257] x5 : ffff800010423030 x4 : ffff8000121d2e40
> [ 8394.763540] x3 : 0000000000000000 x2 : 0000000000000000
> [ 8394.768825] x1 : 0000000000000000 x0 : 0000000000000000
> [ 8394.774110] Call trace:
> [ 8394.776544]  percpu_ref_exit+0x5c/0xc8
> [ 8394.780273]  md_free+0x64/0xa0
> [ 8394.783311]  kobject_put+0x7c/0x218
> [ 8394.786781]  mddev_delayed_delete+0x3c/0x50
> [ 8394.790944]  process_one_work+0x1c4/0x450
> [ 8394.794932]  worker_thread+0x164/0x4a8
> [ 8394.798662]  kthread+0xf4/0x120
> [ 8394.801787]  ret_from_fork+0x10/0x18
> [ 8394.805344] Code: 2a0403e0 350002c0 a9400262 52800001 (f9400000)
> [ 8394.811407] ---[ end trace 481cab6e1ad73da1 ]---

Ming, I wonder if this is:

commit d0c567d60f3730b97050347ea806e1ee06445c78
Author: Ming Lei <ming.lei@redhat.com>
Date:   Wed Sep 2 20:26:42 2020 +0800

    percpu_ref: reduce memory footprint of percpu_ref in fast path

Rachel, any chance you can do a run with that commit reverted?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 💥 PANICKED: Test report for kernel 5.9.0-rc3-020ad03.cki (block)
  2020-09-03 17:46   ` Jens Axboe
@ 2020-09-03 18:59     ` Rachel Sibley
  2020-09-03 19:58       ` Veronika Kabatova
  0 siblings, 1 reply; 14+ messages in thread
From: Rachel Sibley @ 2020-09-03 18:59 UTC (permalink / raw)
  To: Jens Axboe, CKI Project, linux-block; +Cc: Changhui Zhong



On 9/3/20 1:46 PM, Jens Axboe wrote:
> On 9/3/20 11:10 AM, Rachel Sibley wrote:
>>
>> On 9/3/20 1:07 PM, CKI Project wrote:
>>>
>>> Hello,
>>>
>>> We ran automated tests on a recent commit from this kernel tree:
>>>
>>>          Kernel repo: https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git
>>>               Commit: 020ad0333b03 - Merge branch 'for-5.10/block' into for-next
>>>
>>> The results of these automated tests are provided below.
>>>
>>>       Overall result: FAILED (see details below)
>>>                Merge: OK
>>>              Compile: OK
>>>                Tests: PANICKED
>>>
>>> All kernel binaries, config files, and logs are available for download here:
>>>
>>>     https://cki-artifacts.s3.us-east-2.amazonaws.com/index.html?prefix=datawarehouse/2020/09/02/613166
>>>
>>> One or more kernel tests failed:
>>>
>>>       ppc64le:
>>>        💥 storage: software RAID testing
>>>
>>>       aarch64:
>>>        💥 storage: software RAID testing
>>>
>>>       x86_64:
>>>        💥 storage: software RAID testing
>>
>> Hello,
>>
>> We're seeing a panic for all non s390x arches triggered by swraid test. Seems to be reproducible
>> for all succeeding pipelines after this one, and we haven't yet seen it in mainline or yesterday's
>> block tree results.
>>
>> Thank you,
>> Rachel
>>
>> https://cki-artifacts.s3.us-east-2.amazonaws.com/datawarehouse/2020/09/02/613166/build_aarch64_redhat%3A968098/tests/8757835_aarch64_3_console.log
>>
>> [ 8394.609219] Internal error: Oops: 96000004 [#1] SMP
>> [ 8394.614070] Modules linked in: raid0 loop raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx dm_log_writes dm_flakey
>> rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rfkill sunrpc vfat fat xgene_hwmon xgene_enet at803x mdio_xgene xgene_rng
>> xgene_edac mailbox_xgene_slimpro drm ip_tables xfs sdhci_of_arasan sdhci_pltfm i2c_xgene_slimpro crct10dif_ce sdhci gpio_dwapb cqhci xhci_plat_hcd
>> gpio_xgene_sb gpio_keys aes_neon_bs
>> [ 8394.654298] CPU: 3 PID: 471427 Comm: kworker/3:2 Kdump: loaded Not tainted 5.9.0-rc3-020ad03.cki #1
>> [ 8394.663299] Hardware name: AppliedMicro X-Gene Mustang Board/X-Gene Mustang Board, BIOS 3.06.25 Oct 17 2016
>> [ 8394.672999] Workqueue: md_misc mddev_delayed_delete
>> [ 8394.677853] pstate: 40400085 (nZcv daIf +PAN -UAO BTYPE=--)
>> [ 8394.683399] pc : percpu_ref_exit+0x5c/0xc8
>> [ 8394.687473] lr : percpu_ref_exit+0x20/0xc8
>> [ 8394.691547] sp : ffff800019f33d00
>> [ 8394.694843] x29: ffff800019f33d00 x28: 0000000000000000
>> [ 8394.700129] x27: ffff0003c63ae000 x26: ffff8000120b6228
>> [ 8394.705414] x25: 0000000000000001 x24: ffff0003d8322a80
>> [ 8394.710698] x23: 0000000000000000 x22: 0000000000000000
>> [ 8394.715983] x21: 0000000000000000 x20: ffff8000121d2000
>> [ 8394.721266] x19: ffff0003d8322af0 x18: 0000000000000000
>> [ 8394.726550] x17: 0000000000000000 x16: 0000000000000000
>> [ 8394.731834] x15: 0000000000000007 x14: 0000000000000003
>> [ 8394.737119] x13: 0000000000000000 x12: ffff0003888a1978
>> [ 8394.742403] x11: ffff0003888a1918 x10: 0000000000000001
>> [ 8394.747688] x9 : 0000000000000000 x8 : 0000000000000000
>> [ 8394.752972] x7 : 0000000000000400 x6 : 0000000000000001
>> [ 8394.758257] x5 : ffff800010423030 x4 : ffff8000121d2e40
>> [ 8394.763540] x3 : 0000000000000000 x2 : 0000000000000000
>> [ 8394.768825] x1 : 0000000000000000 x0 : 0000000000000000
>> [ 8394.774110] Call trace:
>> [ 8394.776544]  percpu_ref_exit+0x5c/0xc8
>> [ 8394.780273]  md_free+0x64/0xa0
>> [ 8394.783311]  kobject_put+0x7c/0x218
>> [ 8394.786781]  mddev_delayed_delete+0x3c/0x50
>> [ 8394.790944]  process_one_work+0x1c4/0x450
>> [ 8394.794932]  worker_thread+0x164/0x4a8
>> [ 8394.798662]  kthread+0xf4/0x120
>> [ 8394.801787]  ret_from_fork+0x10/0x18
>> [ 8394.805344] Code: 2a0403e0 350002c0 a9400262 52800001 (f9400000)
>> [ 8394.811407] ---[ end trace 481cab6e1ad73da1 ]---
> 
> Ming, I wonder if this is:
> 
> commit d0c567d60f3730b97050347ea806e1ee06445c78
> Author: Ming Lei <ming.lei@redhat.com>
> Date:   Wed Sep 2 20:26:42 2020 +0800
> 
>      percpu_ref: reduce memory footprint of percpu_ref in fast path
> 
> Rachel, any chance you can do a run with that commit reverted?

Hi Jens, yes we're working on it and will share our findings as soon as the job finishes.

> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 💥 PANICKED: Test report for kernel 5.9.0-rc3-020ad03.cki (block)
  2020-09-03 18:59     ` Rachel Sibley
@ 2020-09-03 19:58       ` Veronika Kabatova
  2020-09-03 20:53         ` Jens Axboe
  0 siblings, 1 reply; 14+ messages in thread
From: Veronika Kabatova @ 2020-09-03 19:58 UTC (permalink / raw)
  To: Jens Axboe; +Cc: CKI Project, linux-block, Changhui Zhong, Rachel Sibley



----- Original Message -----
> From: "Rachel Sibley" <rasibley@redhat.com>
> To: "Jens Axboe" <axboe@kernel.dk>, "CKI Project" <cki-project@redhat.com>, linux-block@vger.kernel.org
> Cc: "Changhui Zhong" <czhong@redhat.com>
> Sent: Thursday, September 3, 2020 8:59:48 PM
> Subject: Re: 💥 PANICKED: Test report for kernel 5.9.0-rc3-020ad03.cki (block)
> 
> 
> 
> On 9/3/20 1:46 PM, Jens Axboe wrote:
> > On 9/3/20 11:10 AM, Rachel Sibley wrote:
> >>
> >> On 9/3/20 1:07 PM, CKI Project wrote:
> >>>
> >>> Hello,
> >>>
> >>> We ran automated tests on a recent commit from this kernel tree:
> >>>
> >>>          Kernel repo:
> >>>          https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git
> >>>               Commit: 020ad0333b03 - Merge branch 'for-5.10/block' into
> >>>               for-next
> >>>
> >>> The results of these automated tests are provided below.
> >>>
> >>>       Overall result: FAILED (see details below)
> >>>                Merge: OK
> >>>              Compile: OK
> >>>                Tests: PANICKED
> >>>
> >>> All kernel binaries, config files, and logs are available for download
> >>> here:
> >>>
> >>>     https://cki-artifacts.s3.us-east-2.amazonaws.com/index.html?prefix=datawarehouse/2020/09/02/613166
> >>>
> >>> One or more kernel tests failed:
> >>>
> >>>       ppc64le:
> >>>        💥 storage: software RAID testing
> >>>
> >>>       aarch64:
> >>>        💥 storage: software RAID testing
> >>>
> >>>       x86_64:
> >>>        💥 storage: software RAID testing
> >>
> >> Hello,
> >>
> >> We're seeing a panic for all non s390x arches triggered by swraid test.
> >> Seems to be reproducible
> >> for all succeeding pipelines after this one, and we haven't yet seen it in
> >> mainline or yesterday's
> >> block tree results.
> >>
> >> Thank you,
> >> Rachel
> >>
> >> https://cki-artifacts.s3.us-east-2.amazonaws.com/datawarehouse/2020/09/02/613166/build_aarch64_redhat%3A968098/tests/8757835_aarch64_3_console.log
> >>
> >> [ 8394.609219] Internal error: Oops: 96000004 [#1] SMP
> >> [ 8394.614070] Modules linked in: raid0 loop raid456 async_raid6_recov
> >> async_memcpy async_pq async_xor async_tx dm_log_writes dm_flakey
> >> rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache
> >> rfkill sunrpc vfat fat xgene_hwmon xgene_enet at803x mdio_xgene xgene_rng
> >> xgene_edac mailbox_xgene_slimpro drm ip_tables xfs sdhci_of_arasan
> >> sdhci_pltfm i2c_xgene_slimpro crct10dif_ce sdhci gpio_dwapb cqhci
> >> xhci_plat_hcd
> >> gpio_xgene_sb gpio_keys aes_neon_bs
> >> [ 8394.654298] CPU: 3 PID: 471427 Comm: kworker/3:2 Kdump: loaded Not
> >> tainted 5.9.0-rc3-020ad03.cki #1
> >> [ 8394.663299] Hardware name: AppliedMicro X-Gene Mustang Board/X-Gene
> >> Mustang Board, BIOS 3.06.25 Oct 17 2016
> >> [ 8394.672999] Workqueue: md_misc mddev_delayed_delete
> >> [ 8394.677853] pstate: 40400085 (nZcv daIf +PAN -UAO BTYPE=--)
> >> [ 8394.683399] pc : percpu_ref_exit+0x5c/0xc8
> >> [ 8394.687473] lr : percpu_ref_exit+0x20/0xc8
> >> [ 8394.691547] sp : ffff800019f33d00
> >> [ 8394.694843] x29: ffff800019f33d00 x28: 0000000000000000
> >> [ 8394.700129] x27: ffff0003c63ae000 x26: ffff8000120b6228
> >> [ 8394.705414] x25: 0000000000000001 x24: ffff0003d8322a80
> >> [ 8394.710698] x23: 0000000000000000 x22: 0000000000000000
> >> [ 8394.715983] x21: 0000000000000000 x20: ffff8000121d2000
> >> [ 8394.721266] x19: ffff0003d8322af0 x18: 0000000000000000
> >> [ 8394.726550] x17: 0000000000000000 x16: 0000000000000000
> >> [ 8394.731834] x15: 0000000000000007 x14: 0000000000000003
> >> [ 8394.737119] x13: 0000000000000000 x12: ffff0003888a1978
> >> [ 8394.742403] x11: ffff0003888a1918 x10: 0000000000000001
> >> [ 8394.747688] x9 : 0000000000000000 x8 : 0000000000000000
> >> [ 8394.752972] x7 : 0000000000000400 x6 : 0000000000000001
> >> [ 8394.758257] x5 : ffff800010423030 x4 : ffff8000121d2e40
> >> [ 8394.763540] x3 : 0000000000000000 x2 : 0000000000000000
> >> [ 8394.768825] x1 : 0000000000000000 x0 : 0000000000000000
> >> [ 8394.774110] Call trace:
> >> [ 8394.776544]  percpu_ref_exit+0x5c/0xc8
> >> [ 8394.780273]  md_free+0x64/0xa0
> >> [ 8394.783311]  kobject_put+0x7c/0x218
> >> [ 8394.786781]  mddev_delayed_delete+0x3c/0x50
> >> [ 8394.790944]  process_one_work+0x1c4/0x450
> >> [ 8394.794932]  worker_thread+0x164/0x4a8
> >> [ 8394.798662]  kthread+0xf4/0x120
> >> [ 8394.801787]  ret_from_fork+0x10/0x18
> >> [ 8394.805344] Code: 2a0403e0 350002c0 a9400262 52800001 (f9400000)
> >> [ 8394.811407] ---[ end trace 481cab6e1ad73da1 ]---
> > 
> > Ming, I wonder if this is:
> > 
> > commit d0c567d60f3730b97050347ea806e1ee06445c78
> > Author: Ming Lei <ming.lei@redhat.com>
> > Date:   Wed Sep 2 20:26:42 2020 +0800
> > 
> >      percpu_ref: reduce memory footprint of percpu_ref in fast path
> > 
> > Rachel, any chance you can do a run with that commit reverted?
> 
> Hi Jens, yes we're working on it and will share our findings as soon as the
> job finishes.
> 

Hi Jens, we can confirm that there are no panics and the test passes
with the patch reverted.


We also realized that this patch is a likely cause of serious problems
on ppc64le during LTP testing as well, specifically msgstress04. Both
issues started occurring at the same time, we just didn't notice as the
test was crashing.


[ 5682.999169] msgstress04 invoked oom-killer: gfp_mask=0x40cc0(GFP_KERNEL|__GFP_COMP), order=0, oom_score_adj=0 
[ 5682.999981] CPU: 1 PID: 170909 Comm: msgstress04 Kdump: loaded Not tainted 5.9.0-rc3-020ad03.cki #1 
[ 5683.000048] Call Trace: 
[ 5683.000098] [c00000023de972e0] [c000000000927e00] dump_stack+0xc4/0x114 (unreliable) 
[ 5683.000161] [c00000023de97330] [c000000000386958] dump_header+0x64/0x274 
[ 5683.000205] [c00000023de973c0] [c000000000385534] oom_kill_process+0x284/0x290 
[ 5683.000259] [c00000023de97400] [c0000000003862b0] out_of_memory+0x220/0x790 
[ 5683.000307] [c00000023de974a0] [c000000000408890] __alloc_pages_slowpath.constprop.0+0xd60/0xeb0 
[ 5683.000370] [c00000023de97670] [c000000000408d20] __alloc_pages_nodemask+0x340/0x400 
[ 5683.000426] [c00000023de97700] [c000000000434dec] alloc_pages_current+0xac/0x130 
[ 5683.000479] [c00000023de97750] [c000000000442fc4] allocate_slab+0x584/0x810 
[ 5683.000525] [c00000023de977c0] [c000000000447e7c] ___slab_alloc+0x44c/0xa30 
[ 5683.000571] [c00000023de978b0] [c000000000448494] __slab_alloc+0x34/0x60 
[ 5683.000615] [c00000023de978e0] [c000000000448b48] kmem_cache_alloc+0x688/0x700 
[ 5683.000671] [c00000023de97940] [c0000000003d9c80] __pud_alloc+0x70/0x1e0 
[ 5683.000717] [c00000023de97990] [c0000000003ddbb4] copy_page_range+0x1204/0x1490 
[ 5683.000779] [c00000023de97b20] [c00000000013b7c0] dup_mm+0x370/0x6e0 
[ 5683.000826] [c00000023de97bd0] [c00000000013ce10] copy_process+0xd20/0x1950 
[ 5683.000870] [c00000023de97c90] [c00000000013dc64] _do_fork+0xa4/0x560 
[ 5683.000915] [c00000023de97d00] [c00000000013e24c] __do_sys_clone+0x7c/0xa0 
[ 5683.000965] [c00000023de97dc0] [c00000000002f9a4] system_call_exception+0xe4/0x1c0 
[ 5683.001019] [c00000023de97e20] [c00000000000d140] system_call_common+0xf0/0x27c 

The test then manages the fill the console log with good 4G of dump...
this is actually visible in the ppc64le console log from the linked
artifacts (warnings, it's a huge file!):

https://cki-artifacts.s3.us-east-2.amazonaws.com/datawarehouse/2020/09/02/613166/build_ppc64le_redhat%3A968099/tests/8757368_ppc64le_3_console.log


There are also more ppc64le traces in the other log (of reasonable size):
https://cki-artifacts.s3.us-east-2.amazonaws.com/datawarehouse/2020/09/02/613166/build_ppc64le_redhat%3A968099/tests/8757337_ppc64le_2_console.log


Veronika

> > 
> 
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 💥 PANICKED: Test report for kernel 5.9.0-rc3-020ad03.cki (block)
  2020-09-03 19:58       ` Veronika Kabatova
@ 2020-09-03 20:53         ` Jens Axboe
  2020-09-04  3:22           ` Ming Lei
  0 siblings, 1 reply; 14+ messages in thread
From: Jens Axboe @ 2020-09-03 20:53 UTC (permalink / raw)
  To: Veronika Kabatova
  Cc: CKI Project, linux-block, Changhui Zhong, Rachel Sibley, Ming Lei

On 9/3/20 1:58 PM, Veronika Kabatova wrote:
> 
> 
> ----- Original Message -----
>> From: "Rachel Sibley" <rasibley@redhat.com>
>> To: "Jens Axboe" <axboe@kernel.dk>, "CKI Project" <cki-project@redhat.com>, linux-block@vger.kernel.org
>> Cc: "Changhui Zhong" <czhong@redhat.com>
>> Sent: Thursday, September 3, 2020 8:59:48 PM
>> Subject: Re: 💥 PANICKED: Test report for kernel 5.9.0-rc3-020ad03.cki (block)
>>
>>
>>
>> On 9/3/20 1:46 PM, Jens Axboe wrote:
>>> On 9/3/20 11:10 AM, Rachel Sibley wrote:
>>>>
>>>> On 9/3/20 1:07 PM, CKI Project wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>> We ran automated tests on a recent commit from this kernel tree:
>>>>>
>>>>>          Kernel repo:
>>>>>          https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git
>>>>>               Commit: 020ad0333b03 - Merge branch 'for-5.10/block' into
>>>>>               for-next
>>>>>
>>>>> The results of these automated tests are provided below.
>>>>>
>>>>>       Overall result: FAILED (see details below)
>>>>>                Merge: OK
>>>>>              Compile: OK
>>>>>                Tests: PANICKED
>>>>>
>>>>> All kernel binaries, config files, and logs are available for download
>>>>> here:
>>>>>
>>>>>     https://cki-artifacts.s3.us-east-2.amazonaws.com/index.html?prefix=datawarehouse/2020/09/02/613166
>>>>>
>>>>> One or more kernel tests failed:
>>>>>
>>>>>       ppc64le:
>>>>>        💥 storage: software RAID testing
>>>>>
>>>>>       aarch64:
>>>>>        💥 storage: software RAID testing
>>>>>
>>>>>       x86_64:
>>>>>        💥 storage: software RAID testing
>>>>
>>>> Hello,
>>>>
>>>> We're seeing a panic for all non s390x arches triggered by swraid test.
>>>> Seems to be reproducible
>>>> for all succeeding pipelines after this one, and we haven't yet seen it in
>>>> mainline or yesterday's
>>>> block tree results.
>>>>
>>>> Thank you,
>>>> Rachel
>>>>
>>>> https://cki-artifacts.s3.us-east-2.amazonaws.com/datawarehouse/2020/09/02/613166/build_aarch64_redhat%3A968098/tests/8757835_aarch64_3_console.log
>>>>
>>>> [ 8394.609219] Internal error: Oops: 96000004 [#1] SMP
>>>> [ 8394.614070] Modules linked in: raid0 loop raid456 async_raid6_recov
>>>> async_memcpy async_pq async_xor async_tx dm_log_writes dm_flakey
>>>> rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache
>>>> rfkill sunrpc vfat fat xgene_hwmon xgene_enet at803x mdio_xgene xgene_rng
>>>> xgene_edac mailbox_xgene_slimpro drm ip_tables xfs sdhci_of_arasan
>>>> sdhci_pltfm i2c_xgene_slimpro crct10dif_ce sdhci gpio_dwapb cqhci
>>>> xhci_plat_hcd
>>>> gpio_xgene_sb gpio_keys aes_neon_bs
>>>> [ 8394.654298] CPU: 3 PID: 471427 Comm: kworker/3:2 Kdump: loaded Not
>>>> tainted 5.9.0-rc3-020ad03.cki #1
>>>> [ 8394.663299] Hardware name: AppliedMicro X-Gene Mustang Board/X-Gene
>>>> Mustang Board, BIOS 3.06.25 Oct 17 2016
>>>> [ 8394.672999] Workqueue: md_misc mddev_delayed_delete
>>>> [ 8394.677853] pstate: 40400085 (nZcv daIf +PAN -UAO BTYPE=--)
>>>> [ 8394.683399] pc : percpu_ref_exit+0x5c/0xc8
>>>> [ 8394.687473] lr : percpu_ref_exit+0x20/0xc8
>>>> [ 8394.691547] sp : ffff800019f33d00
>>>> [ 8394.694843] x29: ffff800019f33d00 x28: 0000000000000000
>>>> [ 8394.700129] x27: ffff0003c63ae000 x26: ffff8000120b6228
>>>> [ 8394.705414] x25: 0000000000000001 x24: ffff0003d8322a80
>>>> [ 8394.710698] x23: 0000000000000000 x22: 0000000000000000
>>>> [ 8394.715983] x21: 0000000000000000 x20: ffff8000121d2000
>>>> [ 8394.721266] x19: ffff0003d8322af0 x18: 0000000000000000
>>>> [ 8394.726550] x17: 0000000000000000 x16: 0000000000000000
>>>> [ 8394.731834] x15: 0000000000000007 x14: 0000000000000003
>>>> [ 8394.737119] x13: 0000000000000000 x12: ffff0003888a1978
>>>> [ 8394.742403] x11: ffff0003888a1918 x10: 0000000000000001
>>>> [ 8394.747688] x9 : 0000000000000000 x8 : 0000000000000000
>>>> [ 8394.752972] x7 : 0000000000000400 x6 : 0000000000000001
>>>> [ 8394.758257] x5 : ffff800010423030 x4 : ffff8000121d2e40
>>>> [ 8394.763540] x3 : 0000000000000000 x2 : 0000000000000000
>>>> [ 8394.768825] x1 : 0000000000000000 x0 : 0000000000000000
>>>> [ 8394.774110] Call trace:
>>>> [ 8394.776544]  percpu_ref_exit+0x5c/0xc8
>>>> [ 8394.780273]  md_free+0x64/0xa0
>>>> [ 8394.783311]  kobject_put+0x7c/0x218
>>>> [ 8394.786781]  mddev_delayed_delete+0x3c/0x50
>>>> [ 8394.790944]  process_one_work+0x1c4/0x450
>>>> [ 8394.794932]  worker_thread+0x164/0x4a8
>>>> [ 8394.798662]  kthread+0xf4/0x120
>>>> [ 8394.801787]  ret_from_fork+0x10/0x18
>>>> [ 8394.805344] Code: 2a0403e0 350002c0 a9400262 52800001 (f9400000)
>>>> [ 8394.811407] ---[ end trace 481cab6e1ad73da1 ]---
>>>
>>> Ming, I wonder if this is:
>>>
>>> commit d0c567d60f3730b97050347ea806e1ee06445c78
>>> Author: Ming Lei <ming.lei@redhat.com>
>>> Date:   Wed Sep 2 20:26:42 2020 +0800
>>>
>>>      percpu_ref: reduce memory footprint of percpu_ref in fast path
>>>
>>> Rachel, any chance you can do a run with that commit reverted?
>>
>> Hi Jens, yes we're working on it and will share our findings as soon as the
>> job finishes.
>>
> 
> Hi Jens, we can confirm that there are no panics and the test passes
> with the patch reverted.
> 
> 
> We also realized that this patch is a likely cause of serious problems
> on ppc64le during LTP testing as well, specifically msgstress04. Both
> issues started occurring at the same time, we just didn't notice as the
> test was crashing.
> 
> 
> [ 5682.999169] msgstress04 invoked oom-killer: gfp_mask=0x40cc0(GFP_KERNEL|__GFP_COMP), order=0, oom_score_adj=0 
> [ 5682.999981] CPU: 1 PID: 170909 Comm: msgstress04 Kdump: loaded Not tainted 5.9.0-rc3-020ad03.cki #1 
> [ 5683.000048] Call Trace: 
> [ 5683.000098] [c00000023de972e0] [c000000000927e00] dump_stack+0xc4/0x114 (unreliable) 
> [ 5683.000161] [c00000023de97330] [c000000000386958] dump_header+0x64/0x274 
> [ 5683.000205] [c00000023de973c0] [c000000000385534] oom_kill_process+0x284/0x290 
> [ 5683.000259] [c00000023de97400] [c0000000003862b0] out_of_memory+0x220/0x790 
> [ 5683.000307] [c00000023de974a0] [c000000000408890] __alloc_pages_slowpath.constprop.0+0xd60/0xeb0 
> [ 5683.000370] [c00000023de97670] [c000000000408d20] __alloc_pages_nodemask+0x340/0x400 
> [ 5683.000426] [c00000023de97700] [c000000000434dec] alloc_pages_current+0xac/0x130 
> [ 5683.000479] [c00000023de97750] [c000000000442fc4] allocate_slab+0x584/0x810 
> [ 5683.000525] [c00000023de977c0] [c000000000447e7c] ___slab_alloc+0x44c/0xa30 
> [ 5683.000571] [c00000023de978b0] [c000000000448494] __slab_alloc+0x34/0x60 
> [ 5683.000615] [c00000023de978e0] [c000000000448b48] kmem_cache_alloc+0x688/0x700 
> [ 5683.000671] [c00000023de97940] [c0000000003d9c80] __pud_alloc+0x70/0x1e0 
> [ 5683.000717] [c00000023de97990] [c0000000003ddbb4] copy_page_range+0x1204/0x1490 
> [ 5683.000779] [c00000023de97b20] [c00000000013b7c0] dup_mm+0x370/0x6e0 
> [ 5683.000826] [c00000023de97bd0] [c00000000013ce10] copy_process+0xd20/0x1950 
> [ 5683.000870] [c00000023de97c90] [c00000000013dc64] _do_fork+0xa4/0x560 
> [ 5683.000915] [c00000023de97d00] [c00000000013e24c] __do_sys_clone+0x7c/0xa0 
> [ 5683.000965] [c00000023de97dc0] [c00000000002f9a4] system_call_exception+0xe4/0x1c0 
> [ 5683.001019] [c00000023de97e20] [c00000000000d140] system_call_common+0xf0/0x27c 
> 
> The test then manages the fill the console log with good 4G of dump...
> this is actually visible in the ppc64le console log from the linked
> artifacts (warnings, it's a huge file!):
> 
> https://cki-artifacts.s3.us-east-2.amazonaws.com/datawarehouse/2020/09/02/613166/build_ppc64le_redhat%3A968099/tests/8757368_ppc64le_3_console.log
> 
> 
> There are also more ppc64le traces in the other log (of reasonable size):
> https://cki-artifacts.s3.us-east-2.amazonaws.com/datawarehouse/2020/09/02/613166/build_ppc64le_redhat%3A968099/tests/8757337_ppc64le_2_console.log

I'll revert this change for now.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 💥 PANICKED: Test report for kernel 5.9.0-rc3-020ad03.cki (block)
  2020-09-03 17:07 💥 PANICKED: Test report for kernel 5.9.0-rc3-020ad03.cki (block) CKI Project
  2020-09-03 17:10 ` Rachel Sibley
@ 2020-09-04  1:02 ` Ming Lei
  2020-09-04 11:06   ` Veronika Kabatova
  1 sibling, 1 reply; 14+ messages in thread
From: Ming Lei @ 2020-09-04  1:02 UTC (permalink / raw)
  To: CKI Project; +Cc: linux-block, axboe, Changhui Zhong

On Thu, Sep 03, 2020 at 05:07:57PM -0000, CKI Project wrote:
> 
> Hello,
> 
> We ran automated tests on a recent commit from this kernel tree:
> 
>        Kernel repo: https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git
>             Commit: 020ad0333b03 - Merge branch 'for-5.10/block' into for-next
> 
> The results of these automated tests are provided below.
> 
>     Overall result: FAILED (see details below)
>              Merge: OK
>            Compile: OK
>              Tests: PANICKED
> 
> All kernel binaries, config files, and logs are available for download here:
> 
>   https://cki-artifacts.s3.us-east-2.amazonaws.com/index.html?prefix=datawarehouse/2020/09/02/613166
> 
> One or more kernel tests failed:
> 
>     ppc64le:
>      💥 storage: software RAID testing
> 
>     aarch64:
>      💥 storage: software RAID testing
> 
>     x86_64:
>      💥 storage: software RAID testing
> 
> We hope that these logs can help you find the problem quickly. For the full
> detail on our testing procedures, please scroll to the bottom of this message.
> 
> Please reply to this email if you have any questions about the tests that we
> ran or if you have any suggestions on how to make future tests more effective.
> 
>         ,-.   ,-.
>        ( C ) ( K )  Continuous
>         `-',-.`-'   Kernel
>           ( I )     Integration
>            `-'
> ______________________________________________________________________________
> 
> Compile testing
> ---------------
> 
> We compiled the kernel for 4 architectures:
> 
>     aarch64:
>       make options: make -j30 INSTALL_MOD_STRIP=1 targz-pkg
> 
>     ppc64le:
>       make options: make -j30 INSTALL_MOD_STRIP=1 targz-pkg
> 
>     s390x:
>       make options: make -j30 INSTALL_MOD_STRIP=1 targz-pkg
> 
>     x86_64:
>       make options: make -j30 INSTALL_MOD_STRIP=1 targz-pkg
> 
> 
> 
> Hardware testing
> ----------------
> We booted each kernel and ran the following tests:
> 
>   aarch64:
>     Host 1:
>        ✅ Boot test
>        ✅ ACPI table test
>        ✅ LTP
>        ✅ Loopdev Sanity
>        ✅ Memory function: memfd_create
>        ✅ AMTU (Abstract Machine Test Utility)
>        ✅ Ethernet drivers sanity
>        ✅ storage: SCSI VPD
>        🚧 ✅ CIFS Connectathon
>        🚧 ✅ POSIX pjd-fstest suites
> 
>     Host 2:
> 
>        ⚡ Internal infrastructure issues prevented one or more tests (marked
>        with ⚡⚡⚡) from running on this architecture.
>        This is not the fault of the kernel that was tested.
> 
>        ⚡⚡⚡ Boot test
>        ⚡⚡⚡ xfstests - ext4
>        ⚡⚡⚡ xfstests - xfs
>        ⚡⚡⚡ storage: software RAID testing
>        ⚡⚡⚡ stress: stress-ng
>        🚧 ⚡⚡⚡ xfstests - btrfs
>        🚧 ⚡⚡⚡ Storage blktests
> 
>     Host 3:
>        ✅ Boot test
>        ✅ xfstests - ext4
>        ✅ xfstests - xfs
>        💥 storage: software RAID testing
>        ⚡⚡⚡ stress: stress-ng
>        🚧 ⚡⚡⚡ xfstests - btrfs
>        🚧 ⚡⚡⚡ Storage blktests
> 
>   ppc64le:
>     Host 1:
>        ✅ Boot test
>        🚧 ✅ kdump - sysrq-c
> 
>     Host 2:
>        ✅ Boot test
>        ✅ xfstests - ext4
>        ✅ xfstests - xfs
>        💥 storage: software RAID testing
>        🚧 ⚡⚡⚡ xfstests - btrfs
>        🚧 ⚡⚡⚡ Storage blktests
> 
>     Host 3:
> 
>        ⚡ Internal infrastructure issues prevented one or more tests (marked
>        with ⚡⚡⚡) from running on this architecture.
>        This is not the fault of the kernel that was tested.
> 
>        ✅ Boot test
>        ⚡⚡⚡ LTP
>        ⚡⚡⚡ Loopdev Sanity
>        ⚡⚡⚡ Memory function: memfd_create
>        ⚡⚡⚡ AMTU (Abstract Machine Test Utility)
>        ⚡⚡⚡ Ethernet drivers sanity
>        🚧 ⚡⚡⚡ CIFS Connectathon
>        🚧 ⚡⚡⚡ POSIX pjd-fstest suites
> 
>   s390x:
>     Host 1:
>        ✅ Boot test
>        ✅ stress: stress-ng
>        🚧 ✅ Storage blktests
> 
>     Host 2:
>        ✅ Boot test
>        ✅ LTP
>        ✅ Loopdev Sanity
>        ✅ Memory function: memfd_create
>        ✅ AMTU (Abstract Machine Test Utility)
>        ✅ Ethernet drivers sanity
>        🚧 ✅ CIFS Connectathon
>        🚧 ✅ POSIX pjd-fstest suites
> 
>   x86_64:
>     Host 1:
>        ✅ Boot test
>        ✅ Storage SAN device stress - qedf driver
> 
>     Host 2:
>        ⏱  Boot test
>        ⏱  Storage SAN device stress - mpt3sas_gen1
> 
>     Host 3:
>        ✅ Boot test
>        ✅ xfstests - ext4
>        ✅ xfstests - xfs
>        💥 storage: software RAID testing
>        ⚡⚡⚡ stress: stress-ng
>        🚧 ⚡⚡⚡ xfstests - btrfs
>        🚧 ⚡⚡⚡ Storage blktests
> 
>     Host 4:
>        ✅ Boot test
>        ✅ Storage SAN device stress - lpfc driver
> 
>     Host 5:
>        ✅ Boot test
>        🚧 ✅ kdump - sysrq-c
> 
>     Host 6:
>        ✅ Boot test
>        ✅ ACPI table test
>        ✅ LTP
>        ✅ Loopdev Sanity
>        ✅ Memory function: memfd_create
>        ✅ AMTU (Abstract Machine Test Utility)
>        ✅ Ethernet drivers sanity
>        ✅ kernel-rt: rt_migrate_test
>        ✅ kernel-rt: rteval
>        ✅ kernel-rt: sched_deadline
>        ✅ kernel-rt: smidetect
>        ✅ storage: SCSI VPD
>        🚧 ✅ CIFS Connectathon
>        🚧 ✅ POSIX pjd-fstest suites
> 
>     Host 7:
>        ✅ Boot test
>        ✅ kdump - sysrq-c - megaraid_sas
> 
>     Host 8:
>        ✅ Boot test
>        ✅ Storage SAN device stress - qla2xxx driver
> 
>     Host 9:
>        ⏱  Boot test
>        ⏱  kdump - sysrq-c - mpt3sas_gen1
> 
>   Test sources: https://gitlab.com/cki-project/kernel-tests

Hello,

Can you share us the exact commands for setting up xfstests over
'software RAID testing' from the above tree?

BTW I can't reproduce it by running xfstest generic/551 on my simple raid10
settings, include raid stop/remove steps.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 💥 PANICKED: Test report for kernel 5.9.0-rc3-020ad03.cki (block)
  2020-09-03 20:53         ` Jens Axboe
@ 2020-09-04  3:22           ` Ming Lei
  2020-09-04  3:37             ` Jens Axboe
  0 siblings, 1 reply; 14+ messages in thread
From: Ming Lei @ 2020-09-04  3:22 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Veronika Kabatova, CKI Project, linux-block, Changhui Zhong,
	Rachel Sibley, Song Liu, linux-raid

On Thu, Sep 03, 2020 at 02:53:39PM -0600, Jens Axboe wrote:
> On 9/3/20 1:58 PM, Veronika Kabatova wrote:
> > 
> > 
> > ----- Original Message -----
> >> From: "Rachel Sibley" <rasibley@redhat.com>
> >> To: "Jens Axboe" <axboe@kernel.dk>, "CKI Project" <cki-project@redhat.com>, linux-block@vger.kernel.org
> >> Cc: "Changhui Zhong" <czhong@redhat.com>
> >> Sent: Thursday, September 3, 2020 8:59:48 PM
> >> Subject: Re: 💥 PANICKED: Test report for kernel 5.9.0-rc3-020ad03.cki (block)
> >>
> >>
> >>
> >> On 9/3/20 1:46 PM, Jens Axboe wrote:
> >>> On 9/3/20 11:10 AM, Rachel Sibley wrote:
> >>>>
> >>>> On 9/3/20 1:07 PM, CKI Project wrote:
> >>>>>
> >>>>> Hello,
> >>>>>
> >>>>> We ran automated tests on a recent commit from this kernel tree:
> >>>>>
> >>>>>          Kernel repo:
> >>>>>          https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git
> >>>>>               Commit: 020ad0333b03 - Merge branch 'for-5.10/block' into
> >>>>>               for-next
> >>>>>
> >>>>> The results of these automated tests are provided below.
> >>>>>
> >>>>>       Overall result: FAILED (see details below)
> >>>>>                Merge: OK
> >>>>>              Compile: OK
> >>>>>                Tests: PANICKED
> >>>>>
> >>>>> All kernel binaries, config files, and logs are available for download
> >>>>> here:
> >>>>>
> >>>>>     https://cki-artifacts.s3.us-east-2.amazonaws.com/index.html?prefix=datawarehouse/2020/09/02/613166
> >>>>>
> >>>>> One or more kernel tests failed:
> >>>>>
> >>>>>       ppc64le:
> >>>>>        💥 storage: software RAID testing
> >>>>>
> >>>>>       aarch64:
> >>>>>        💥 storage: software RAID testing
> >>>>>
> >>>>>       x86_64:
> >>>>>        💥 storage: software RAID testing
> >>>>
> >>>> Hello,
> >>>>
> >>>> We're seeing a panic for all non s390x arches triggered by swraid test.
> >>>> Seems to be reproducible
> >>>> for all succeeding pipelines after this one, and we haven't yet seen it in
> >>>> mainline or yesterday's
> >>>> block tree results.
> >>>>
> >>>> Thank you,
> >>>> Rachel
> >>>>
> >>>> https://cki-artifacts.s3.us-east-2.amazonaws.com/datawarehouse/2020/09/02/613166/build_aarch64_redhat%3A968098/tests/8757835_aarch64_3_console.log
> >>>>
> >>>> [ 8394.609219] Internal error: Oops: 96000004 [#1] SMP
> >>>> [ 8394.614070] Modules linked in: raid0 loop raid456 async_raid6_recov
> >>>> async_memcpy async_pq async_xor async_tx dm_log_writes dm_flakey
> >>>> rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache
> >>>> rfkill sunrpc vfat fat xgene_hwmon xgene_enet at803x mdio_xgene xgene_rng
> >>>> xgene_edac mailbox_xgene_slimpro drm ip_tables xfs sdhci_of_arasan
> >>>> sdhci_pltfm i2c_xgene_slimpro crct10dif_ce sdhci gpio_dwapb cqhci
> >>>> xhci_plat_hcd
> >>>> gpio_xgene_sb gpio_keys aes_neon_bs
> >>>> [ 8394.654298] CPU: 3 PID: 471427 Comm: kworker/3:2 Kdump: loaded Not
> >>>> tainted 5.9.0-rc3-020ad03.cki #1
> >>>> [ 8394.663299] Hardware name: AppliedMicro X-Gene Mustang Board/X-Gene
> >>>> Mustang Board, BIOS 3.06.25 Oct 17 2016
> >>>> [ 8394.672999] Workqueue: md_misc mddev_delayed_delete
> >>>> [ 8394.677853] pstate: 40400085 (nZcv daIf +PAN -UAO BTYPE=--)
> >>>> [ 8394.683399] pc : percpu_ref_exit+0x5c/0xc8
> >>>> [ 8394.687473] lr : percpu_ref_exit+0x20/0xc8
> >>>> [ 8394.691547] sp : ffff800019f33d00
> >>>> [ 8394.694843] x29: ffff800019f33d00 x28: 0000000000000000
> >>>> [ 8394.700129] x27: ffff0003c63ae000 x26: ffff8000120b6228
> >>>> [ 8394.705414] x25: 0000000000000001 x24: ffff0003d8322a80
> >>>> [ 8394.710698] x23: 0000000000000000 x22: 0000000000000000
> >>>> [ 8394.715983] x21: 0000000000000000 x20: ffff8000121d2000
> >>>> [ 8394.721266] x19: ffff0003d8322af0 x18: 0000000000000000
> >>>> [ 8394.726550] x17: 0000000000000000 x16: 0000000000000000
> >>>> [ 8394.731834] x15: 0000000000000007 x14: 0000000000000003
> >>>> [ 8394.737119] x13: 0000000000000000 x12: ffff0003888a1978
> >>>> [ 8394.742403] x11: ffff0003888a1918 x10: 0000000000000001
> >>>> [ 8394.747688] x9 : 0000000000000000 x8 : 0000000000000000
> >>>> [ 8394.752972] x7 : 0000000000000400 x6 : 0000000000000001
> >>>> [ 8394.758257] x5 : ffff800010423030 x4 : ffff8000121d2e40
> >>>> [ 8394.763540] x3 : 0000000000000000 x2 : 0000000000000000
> >>>> [ 8394.768825] x1 : 0000000000000000 x0 : 0000000000000000
> >>>> [ 8394.774110] Call trace:
> >>>> [ 8394.776544]  percpu_ref_exit+0x5c/0xc8
> >>>> [ 8394.780273]  md_free+0x64/0xa0
> >>>> [ 8394.783311]  kobject_put+0x7c/0x218
> >>>> [ 8394.786781]  mddev_delayed_delete+0x3c/0x50
> >>>> [ 8394.790944]  process_one_work+0x1c4/0x450
> >>>> [ 8394.794932]  worker_thread+0x164/0x4a8
> >>>> [ 8394.798662]  kthread+0xf4/0x120
> >>>> [ 8394.801787]  ret_from_fork+0x10/0x18
> >>>> [ 8394.805344] Code: 2a0403e0 350002c0 a9400262 52800001 (f9400000)
> >>>> [ 8394.811407] ---[ end trace 481cab6e1ad73da1 ]---
> >>>
> >>> Ming, I wonder if this is:
> >>>
> >>> commit d0c567d60f3730b97050347ea806e1ee06445c78
> >>> Author: Ming Lei <ming.lei@redhat.com>
> >>> Date:   Wed Sep 2 20:26:42 2020 +0800
> >>>
> >>>      percpu_ref: reduce memory footprint of percpu_ref in fast path
> >>>
> >>> Rachel, any chance you can do a run with that commit reverted?
> >>
> >> Hi Jens, yes we're working on it and will share our findings as soon as the
> >> job finishes.
> >>
> > 
> > Hi Jens, we can confirm that there are no panics and the test passes
> > with the patch reverted.
> > 
> > 
> > We also realized that this patch is a likely cause of serious problems
> > on ppc64le during LTP testing as well, specifically msgstress04. Both
> > issues started occurring at the same time, we just didn't notice as the
> > test was crashing.
> > 
> > 
> > [ 5682.999169] msgstress04 invoked oom-killer: gfp_mask=0x40cc0(GFP_KERNEL|__GFP_COMP), order=0, oom_score_adj=0 
> > [ 5682.999981] CPU: 1 PID: 170909 Comm: msgstress04 Kdump: loaded Not tainted 5.9.0-rc3-020ad03.cki #1 
> > [ 5683.000048] Call Trace: 
> > [ 5683.000098] [c00000023de972e0] [c000000000927e00] dump_stack+0xc4/0x114 (unreliable) 
> > [ 5683.000161] [c00000023de97330] [c000000000386958] dump_header+0x64/0x274 
> > [ 5683.000205] [c00000023de973c0] [c000000000385534] oom_kill_process+0x284/0x290 
> > [ 5683.000259] [c00000023de97400] [c0000000003862b0] out_of_memory+0x220/0x790 
> > [ 5683.000307] [c00000023de974a0] [c000000000408890] __alloc_pages_slowpath.constprop.0+0xd60/0xeb0 
> > [ 5683.000370] [c00000023de97670] [c000000000408d20] __alloc_pages_nodemask+0x340/0x400 
> > [ 5683.000426] [c00000023de97700] [c000000000434dec] alloc_pages_current+0xac/0x130 
> > [ 5683.000479] [c00000023de97750] [c000000000442fc4] allocate_slab+0x584/0x810 
> > [ 5683.000525] [c00000023de977c0] [c000000000447e7c] ___slab_alloc+0x44c/0xa30 
> > [ 5683.000571] [c00000023de978b0] [c000000000448494] __slab_alloc+0x34/0x60 
> > [ 5683.000615] [c00000023de978e0] [c000000000448b48] kmem_cache_alloc+0x688/0x700 
> > [ 5683.000671] [c00000023de97940] [c0000000003d9c80] __pud_alloc+0x70/0x1e0 
> > [ 5683.000717] [c00000023de97990] [c0000000003ddbb4] copy_page_range+0x1204/0x1490 
> > [ 5683.000779] [c00000023de97b20] [c00000000013b7c0] dup_mm+0x370/0x6e0 
> > [ 5683.000826] [c00000023de97bd0] [c00000000013ce10] copy_process+0xd20/0x1950 
> > [ 5683.000870] [c00000023de97c90] [c00000000013dc64] _do_fork+0xa4/0x560 
> > [ 5683.000915] [c00000023de97d00] [c00000000013e24c] __do_sys_clone+0x7c/0xa0 
> > [ 5683.000965] [c00000023de97dc0] [c00000000002f9a4] system_call_exception+0xe4/0x1c0 
> > [ 5683.001019] [c00000023de97e20] [c00000000000d140] system_call_common+0xf0/0x27c 
> > 
> > The test then manages the fill the console log with good 4G of dump...
> > this is actually visible in the ppc64le console log from the linked
> > artifacts (warnings, it's a huge file!):
> > 
> > https://cki-artifacts.s3.us-east-2.amazonaws.com/datawarehouse/2020/09/02/613166/build_ppc64le_redhat%3A968099/tests/8757368_ppc64le_3_console.log
> > 
> > 
> > There are also more ppc64le traces in the other log (of reasonable size):
> > https://cki-artifacts.s3.us-east-2.amazonaws.com/datawarehouse/2020/09/02/613166/build_ppc64le_redhat%3A968099/tests/8757337_ppc64le_2_console.log
> 
> I'll revert this change for now.

It is one MD's bug, and percpu_ref_exit() may be called on one ref not
initialized via percpu_ref_init(), and the following patch can fix the
issue:

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 607278207023..9c55489066d2 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5599,7 +5599,9 @@ static void md_free(struct kobject *ko)
                blk_cleanup_queue(mddev->queue);
        if (mddev->gendisk)
                put_disk(mddev->gendisk);
-       percpu_ref_exit(&mddev->writes_pending);
+
+       if (mddev->writes_pending.percpu_count_ptr)
+               percpu_ref_exit(&mddev->writes_pending);

        bioset_exit(&mddev->bio_set);
        bioset_exit(&mddev->sync_set);


Thanks,
Ming


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: 💥 PANICKED: Test report for kernel 5.9.0-rc3-020ad03.cki (block)
  2020-09-04  3:22           ` Ming Lei
@ 2020-09-04  3:37             ` Jens Axboe
  2020-09-04  4:24               ` Ming Lei
  0 siblings, 1 reply; 14+ messages in thread
From: Jens Axboe @ 2020-09-04  3:37 UTC (permalink / raw)
  To: Ming Lei
  Cc: Veronika Kabatova, CKI Project, linux-block, Changhui Zhong,
	Rachel Sibley, Song Liu, linux-raid

On 9/3/20 9:22 PM, Ming Lei wrote:
> It is one MD's bug, and percpu_ref_exit() may be called on one ref not
> initialized via percpu_ref_init(), and the following patch can fix the
> issue:

I really (REALLY) think this should be handled by percpu_ref_exit(), if
it worked before. Otherwise you're just setting yourself up for a world
of pain with other users, and we'll be fixing this fallout for a while.
I don't want to carry that. So let's just make it do the right thing,
needing to do this:

> +       if (mddev->writes_pending.percpu_count_ptr)
> +               percpu_ref_exit(&mddev->writes_pending);

is really nasty.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 💥 PANICKED: Test report for kernel 5.9.0-rc3-020ad03.cki (block)
  2020-09-04  3:37             ` Jens Axboe
@ 2020-09-04  4:24               ` Ming Lei
  2020-09-04 15:06                 ` Jens Axboe
  0 siblings, 1 reply; 14+ messages in thread
From: Ming Lei @ 2020-09-04  4:24 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Veronika Kabatova, CKI Project, linux-block, Changhui Zhong,
	Rachel Sibley, Song Liu, linux-raid

On Thu, Sep 03, 2020 at 09:37:40PM -0600, Jens Axboe wrote:
> On 9/3/20 9:22 PM, Ming Lei wrote:
> > It is one MD's bug, and percpu_ref_exit() may be called on one ref not
> > initialized via percpu_ref_init(), and the following patch can fix the
> > issue:
> 
> I really (REALLY) think this should be handled by percpu_ref_exit(), if

OK, we can do that by return immediately from percpu_ref_exit() if
percpu_count_ptr(ref) is 0 just like before.

> it worked before. Otherwise you're just setting yourself up for a world
> of pain with other users, and we'll be fixing this fallout for a while.
> I don't want to carry that. So let's just make it do the right thing,
> needing to do this:
> 
> > +       if (mddev->writes_pending.percpu_count_ptr)
> > +               percpu_ref_exit(&mddev->writes_pending);
> 
> is really nasty.

Yeah, it is as mddev_init_writes_pending():

        if (mddev->writes_pending.percpu_count_ptr)
                return 0;
        if (percpu_ref_init(&mddev->writes_pending, no_op,
                            PERCPU_REF_ALLOW_REINIT, GFP_KERNEL) < 0)
                return -ENOMEM;

thanks,
Ming


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 💥 PANICKED: Test report for kernel 5.9.0-rc3-020ad03.cki (block)
  2020-09-04  1:02 ` Ming Lei
@ 2020-09-04 11:06   ` Veronika Kabatova
  2020-09-06  3:19     ` 💥 PANICKED: Test report for?kernel " Ming Lei
  0 siblings, 1 reply; 14+ messages in thread
From: Veronika Kabatova @ 2020-09-04 11:06 UTC (permalink / raw)
  To: Ming Lei; +Cc: CKI Project, linux-block, Changhui Zhong, axboe



----- Original Message -----
> From: "Ming Lei" <ming.lei@redhat.com>
> To: "CKI Project" <cki-project@redhat.com>
> Cc: linux-block@vger.kernel.org, axboe@kernel.dk, "Changhui Zhong" <czhong@redhat.com>
> Sent: Friday, September 4, 2020 3:02:33 AM
> Subject: Re: 💥 PANICKED: Test report for	kernel 5.9.0-rc3-020ad03.cki (block)
> 
> On Thu, Sep 03, 2020 at 05:07:57PM -0000, CKI Project wrote:
> > 
> > Hello,
> > 
> > We ran automated tests on a recent commit from this kernel tree:
> > 
> >        Kernel repo:
> >        https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git
> >             Commit: 020ad0333b03 - Merge branch 'for-5.10/block' into
> >             for-next
> > 
> > The results of these automated tests are provided below.
> > 
> >     Overall result: FAILED (see details below)
> >              Merge: OK
> >            Compile: OK
> >              Tests: PANICKED
> > 
> > All kernel binaries, config files, and logs are available for download
> > here:
> > 
> >   https://cki-artifacts.s3.us-east-2.amazonaws.com/index.html?prefix=datawarehouse/2020/09/02/613166
> > 
> > One or more kernel tests failed:
> > 
> >     ppc64le:
> >      💥 storage: software RAID testing
> > 
> >     aarch64:
> >      💥 storage: software RAID testing
> > 
> >     x86_64:
> >      💥 storage: software RAID testing
> > 
> > We hope that these logs can help you find the problem quickly. For the full
> > detail on our testing procedures, please scroll to the bottom of this
> > message.
> > 
> > Please reply to this email if you have any questions about the tests that
> > we
> > ran or if you have any suggestions on how to make future tests more
> > effective.
> > 
> >         ,-.   ,-.
> >        ( C ) ( K )  Continuous
> >         `-',-.`-'   Kernel
> >           ( I )     Integration
> >            `-'
> > ______________________________________________________________________________
> > 
> > Compile testing
> > ---------------
> > 
> > We compiled the kernel for 4 architectures:
> > 
> >     aarch64:
> >       make options: make -j30 INSTALL_MOD_STRIP=1 targz-pkg
> > 
> >     ppc64le:
> >       make options: make -j30 INSTALL_MOD_STRIP=1 targz-pkg
> > 
> >     s390x:
> >       make options: make -j30 INSTALL_MOD_STRIP=1 targz-pkg
> > 
> >     x86_64:
> >       make options: make -j30 INSTALL_MOD_STRIP=1 targz-pkg
> > 
> > 
> > 
> > Hardware testing
> > ----------------
> > We booted each kernel and ran the following tests:
> > 
> >   aarch64:
> >     Host 1:
> >        ✅ Boot test
> >        ✅ ACPI table test
> >        ✅ LTP
> >        ✅ Loopdev Sanity
> >        ✅ Memory function: memfd_create
> >        ✅ AMTU (Abstract Machine Test Utility)
> >        ✅ Ethernet drivers sanity
> >        ✅ storage: SCSI VPD
> >        🚧 ✅ CIFS Connectathon
> >        🚧 ✅ POSIX pjd-fstest suites
> > 
> >     Host 2:
> > 
> >        ⚡ Internal infrastructure issues prevented one or more tests (marked
> >        with ⚡⚡⚡) from running on this architecture.
> >        This is not the fault of the kernel that was tested.
> > 
> >        ⚡⚡⚡ Boot test
> >        ⚡⚡⚡ xfstests - ext4
> >        ⚡⚡⚡ xfstests - xfs
> >        ⚡⚡⚡ storage: software RAID testing
> >        ⚡⚡⚡ stress: stress-ng
> >        🚧 ⚡⚡⚡ xfstests - btrfs
> >        🚧 ⚡⚡⚡ Storage blktests
> > 
> >     Host 3:
> >        ✅ Boot test
> >        ✅ xfstests - ext4
> >        ✅ xfstests - xfs
> >        💥 storage: software RAID testing
> >        ⚡⚡⚡ stress: stress-ng
> >        🚧 ⚡⚡⚡ xfstests - btrfs
> >        🚧 ⚡⚡⚡ Storage blktests
> > 
> >   ppc64le:
> >     Host 1:
> >        ✅ Boot test
> >        🚧 ✅ kdump - sysrq-c
> > 
> >     Host 2:
> >        ✅ Boot test
> >        ✅ xfstests - ext4
> >        ✅ xfstests - xfs
> >        💥 storage: software RAID testing
> >        🚧 ⚡⚡⚡ xfstests - btrfs
> >        🚧 ⚡⚡⚡ Storage blktests
> > 
> >     Host 3:
> > 
> >        ⚡ Internal infrastructure issues prevented one or more tests (marked
> >        with ⚡⚡⚡) from running on this architecture.
> >        This is not the fault of the kernel that was tested.
> > 
> >        ✅ Boot test
> >        ⚡⚡⚡ LTP
> >        ⚡⚡⚡ Loopdev Sanity
> >        ⚡⚡⚡ Memory function: memfd_create
> >        ⚡⚡⚡ AMTU (Abstract Machine Test Utility)
> >        ⚡⚡⚡ Ethernet drivers sanity
> >        🚧 ⚡⚡⚡ CIFS Connectathon
> >        🚧 ⚡⚡⚡ POSIX pjd-fstest suites
> > 
> >   s390x:
> >     Host 1:
> >        ✅ Boot test
> >        ✅ stress: stress-ng
> >        🚧 ✅ Storage blktests
> > 
> >     Host 2:
> >        ✅ Boot test
> >        ✅ LTP
> >        ✅ Loopdev Sanity
> >        ✅ Memory function: memfd_create
> >        ✅ AMTU (Abstract Machine Test Utility)
> >        ✅ Ethernet drivers sanity
> >        🚧 ✅ CIFS Connectathon
> >        🚧 ✅ POSIX pjd-fstest suites
> > 
> >   x86_64:
> >     Host 1:
> >        ✅ Boot test
> >        ✅ Storage SAN device stress - qedf driver
> > 
> >     Host 2:
> >        ⏱  Boot test
> >        ⏱  Storage SAN device stress - mpt3sas_gen1
> > 
> >     Host 3:
> >        ✅ Boot test
> >        ✅ xfstests - ext4
> >        ✅ xfstests - xfs
> >        💥 storage: software RAID testing
> >        ⚡⚡⚡ stress: stress-ng
> >        🚧 ⚡⚡⚡ xfstests - btrfs
> >        🚧 ⚡⚡⚡ Storage blktests
> > 
> >     Host 4:
> >        ✅ Boot test
> >        ✅ Storage SAN device stress - lpfc driver
> > 
> >     Host 5:
> >        ✅ Boot test
> >        🚧 ✅ kdump - sysrq-c
> > 
> >     Host 6:
> >        ✅ Boot test
> >        ✅ ACPI table test
> >        ✅ LTP
> >        ✅ Loopdev Sanity
> >        ✅ Memory function: memfd_create
> >        ✅ AMTU (Abstract Machine Test Utility)
> >        ✅ Ethernet drivers sanity
> >        ✅ kernel-rt: rt_migrate_test
> >        ✅ kernel-rt: rteval
> >        ✅ kernel-rt: sched_deadline
> >        ✅ kernel-rt: smidetect
> >        ✅ storage: SCSI VPD
> >        🚧 ✅ CIFS Connectathon
> >        🚧 ✅ POSIX pjd-fstest suites
> > 
> >     Host 7:
> >        ✅ Boot test
> >        ✅ kdump - sysrq-c - megaraid_sas
> > 
> >     Host 8:
> >        ✅ Boot test
> >        ✅ Storage SAN device stress - qla2xxx driver
> > 
> >     Host 9:
> >        ⏱  Boot test
> >        ⏱  kdump - sysrq-c - mpt3sas_gen1
> > 
> >   Test sources: https://gitlab.com/cki-project/kernel-tests
> 
> Hello,
> 

Hi Ming,

first the good news: Both issues detected by LTP and RAID test are
officially gone after the revert. There's some x86_64 testing still
running but the results look good so far!

> Can you share us the exact commands for setting up xfstests over
> 'software RAID testing' from the above tree?
> 

It's this test (which seeing your @redhat email, you can also trigger
via internal Brew testing if you use the "stor" test set):

https://gitlab.com/cki-project/kernel-tests/-/tree/master/storage/swraid/trim

The important part of the test is:

https://gitlab.com/cki-project/kernel-tests/-/blob/master/storage/swraid/trim/main.sh#L27

The test maintainer (Changhui) is cced on this thread in case you need
any help or have questions about the test.



I'll just quickly mention, please be careful if you're planning on
testing LTP/msgstress04 on ppc64le in Beaker, as the conserver overload
is causing issues to lab owners.


Let us know if we can help you with something else,
Veronika

> BTW I can't reproduce it by running xfstest generic/551 on my simple raid10
> settings, include raid stop/remove steps.
> 
> Thanks,
> Ming
> 
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 💥 PANICKED: Test report for kernel 5.9.0-rc3-020ad03.cki (block)
  2020-09-04  4:24               ` Ming Lei
@ 2020-09-04 15:06                 ` Jens Axboe
  0 siblings, 0 replies; 14+ messages in thread
From: Jens Axboe @ 2020-09-04 15:06 UTC (permalink / raw)
  To: Ming Lei
  Cc: Veronika Kabatova, CKI Project, linux-block, Changhui Zhong,
	Rachel Sibley, Song Liu, linux-raid

On 9/3/20 10:24 PM, Ming Lei wrote:
> On Thu, Sep 03, 2020 at 09:37:40PM -0600, Jens Axboe wrote:
>> On 9/3/20 9:22 PM, Ming Lei wrote:
>>> It is one MD's bug, and percpu_ref_exit() may be called on one ref not
>>> initialized via percpu_ref_init(), and the following patch can fix the
>>> issue:
>>
>> I really (REALLY) think this should be handled by percpu_ref_exit(), if
> 
> OK, we can do that by return immediately from percpu_ref_exit() if
> percpu_count_ptr(ref) is 0 just like before.

Yep that's going to be a must, also see recent syzbot report that's the
same issue, just the core block parts instead.

>> it worked before. Otherwise you're just setting yourself up for a world
>> of pain with other users, and we'll be fixing this fallout for a while.
>> I don't want to carry that. So let's just make it do the right thing,
>> needing to do this:
>>
>>> +       if (mddev->writes_pending.percpu_count_ptr)
>>> +               percpu_ref_exit(&mddev->writes_pending);
>>
>> is really nasty.
> 
> Yeah, it is as mddev_init_writes_pending():
> 
>         if (mddev->writes_pending.percpu_count_ptr)
>                 return 0;
>         if (percpu_ref_init(&mddev->writes_pending, no_op,
>                             PERCPU_REF_ALLOW_REINIT, GFP_KERNEL) < 0)
>                 return -ENOMEM;

Indeed, that's another eye sore... No users should need to know about
these internals. Maybe add a percpu_ref_inited() or something to test
for it, at least that'd allow us to clean up these bad use cases after
the fact.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 💥 PANICKED: Test report for?kernel 5.9.0-rc3-020ad03.cki (block)
  2020-09-04 11:06   ` Veronika Kabatova
@ 2020-09-06  3:19     ` Ming Lei
  2020-09-07 18:49       ` 💥 PANICKED: Test report for kernel " Veronika Kabatova
  0 siblings, 1 reply; 14+ messages in thread
From: Ming Lei @ 2020-09-06  3:19 UTC (permalink / raw)
  To: Veronika Kabatova; +Cc: CKI Project, linux-block, Changhui Zhong, axboe

Hi Veronika,

On Fri, Sep 04, 2020 at 07:06:25AM -0400, Veronika Kabatova wrote:
> 
> 
> ----- Original Message -----
> > From: "Ming Lei" <ming.lei@redhat.com>
> > To: "CKI Project" <cki-project@redhat.com>
> > Cc: linux-block@vger.kernel.org, axboe@kernel.dk, "Changhui Zhong" <czhong@redhat.com>
> > Sent: Friday, September 4, 2020 3:02:33 AM
> > Subject: Re: 💥 PANICKED: Test report for	kernel 5.9.0-rc3-020ad03.cki (block)
> > 
> > On Thu, Sep 03, 2020 at 05:07:57PM -0000, CKI Project wrote:
> > > 
> > > Hello,
> > > 
> > > We ran automated tests on a recent commit from this kernel tree:
> > > 
> > >        Kernel repo:
> > >        https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git
> > >             Commit: 020ad0333b03 - Merge branch 'for-5.10/block' into
> > >             for-next
> > > 
> > > The results of these automated tests are provided below.
> > > 
> > >     Overall result: FAILED (see details below)
> > >              Merge: OK
> > >            Compile: OK
> > >              Tests: PANICKED
> > > 
> > > All kernel binaries, config files, and logs are available for download
> > > here:
> > > 
> > >   https://cki-artifacts.s3.us-east-2.amazonaws.com/index.html?prefix=datawarehouse/2020/09/02/613166
> > > 
> > > One or more kernel tests failed:
> > > 
> > >     ppc64le:
> > >      💥 storage: software RAID testing
> > > 
> > >     aarch64:
> > >      💥 storage: software RAID testing
> > > 
> > >     x86_64:
> > >      💥 storage: software RAID testing
> > > 
> > > We hope that these logs can help you find the problem quickly. For the full
> > > detail on our testing procedures, please scroll to the bottom of this
> > > message.
> > > 
> > > Please reply to this email if you have any questions about the tests that
> > > we
> > > ran or if you have any suggestions on how to make future tests more
> > > effective.
> > > 
> > >         ,-.   ,-.
> > >        ( C ) ( K )  Continuous
> > >         `-',-.`-'   Kernel
> > >           ( I )     Integration
> > >            `-'
> > > ______________________________________________________________________________
> > > 
> > > Compile testing
> > > ---------------
> > > 
> > > We compiled the kernel for 4 architectures:
> > > 
> > >     aarch64:
> > >       make options: make -j30 INSTALL_MOD_STRIP=1 targz-pkg
> > > 
> > >     ppc64le:
> > >       make options: make -j30 INSTALL_MOD_STRIP=1 targz-pkg
> > > 
> > >     s390x:
> > >       make options: make -j30 INSTALL_MOD_STRIP=1 targz-pkg
> > > 
> > >     x86_64:
> > >       make options: make -j30 INSTALL_MOD_STRIP=1 targz-pkg
> > > 
> > > 
> > > 
> > > Hardware testing
> > > ----------------
> > > We booted each kernel and ran the following tests:
> > > 
> > >   aarch64:
> > >     Host 1:
> > >        ✅ Boot test
> > >        ✅ ACPI table test
> > >        ✅ LTP
> > >        ✅ Loopdev Sanity
> > >        ✅ Memory function: memfd_create
> > >        ✅ AMTU (Abstract Machine Test Utility)
> > >        ✅ Ethernet drivers sanity
> > >        ✅ storage: SCSI VPD
> > >        🚧 ✅ CIFS Connectathon
> > >        🚧 ✅ POSIX pjd-fstest suites
> > > 
> > >     Host 2:
> > > 
> > >        ⚡ Internal infrastructure issues prevented one or more tests (marked
> > >        with ⚡⚡⚡) from running on this architecture.
> > >        This is not the fault of the kernel that was tested.
> > > 
> > >        ⚡⚡⚡ Boot test
> > >        ⚡⚡⚡ xfstests - ext4
> > >        ⚡⚡⚡ xfstests - xfs
> > >        ⚡⚡⚡ storage: software RAID testing
> > >        ⚡⚡⚡ stress: stress-ng
> > >        🚧 ⚡⚡⚡ xfstests - btrfs
> > >        🚧 ⚡⚡⚡ Storage blktests
> > > 
> > >     Host 3:
> > >        ✅ Boot test
> > >        ✅ xfstests - ext4
> > >        ✅ xfstests - xfs
> > >        💥 storage: software RAID testing
> > >        ⚡⚡⚡ stress: stress-ng
> > >        🚧 ⚡⚡⚡ xfstests - btrfs
> > >        🚧 ⚡⚡⚡ Storage blktests
> > > 
> > >   ppc64le:
> > >     Host 1:
> > >        ✅ Boot test
> > >        🚧 ✅ kdump - sysrq-c
> > > 
> > >     Host 2:
> > >        ✅ Boot test
> > >        ✅ xfstests - ext4
> > >        ✅ xfstests - xfs
> > >        💥 storage: software RAID testing
> > >        🚧 ⚡⚡⚡ xfstests - btrfs
> > >        🚧 ⚡⚡⚡ Storage blktests
> > > 
> > >     Host 3:
> > > 
> > >        ⚡ Internal infrastructure issues prevented one or more tests (marked
> > >        with ⚡⚡⚡) from running on this architecture.
> > >        This is not the fault of the kernel that was tested.
> > > 
> > >        ✅ Boot test
> > >        ⚡⚡⚡ LTP
> > >        ⚡⚡⚡ Loopdev Sanity
> > >        ⚡⚡⚡ Memory function: memfd_create
> > >        ⚡⚡⚡ AMTU (Abstract Machine Test Utility)
> > >        ⚡⚡⚡ Ethernet drivers sanity
> > >        🚧 ⚡⚡⚡ CIFS Connectathon
> > >        🚧 ⚡⚡⚡ POSIX pjd-fstest suites
> > > 
> > >   s390x:
> > >     Host 1:
> > >        ✅ Boot test
> > >        ✅ stress: stress-ng
> > >        🚧 ✅ Storage blktests
> > > 
> > >     Host 2:
> > >        ✅ Boot test
> > >        ✅ LTP
> > >        ✅ Loopdev Sanity
> > >        ✅ Memory function: memfd_create
> > >        ✅ AMTU (Abstract Machine Test Utility)
> > >        ✅ Ethernet drivers sanity
> > >        🚧 ✅ CIFS Connectathon
> > >        🚧 ✅ POSIX pjd-fstest suites
> > > 
> > >   x86_64:
> > >     Host 1:
> > >        ✅ Boot test
> > >        ✅ Storage SAN device stress - qedf driver
> > > 
> > >     Host 2:
> > >        ⏱  Boot test
> > >        ⏱  Storage SAN device stress - mpt3sas_gen1
> > > 
> > >     Host 3:
> > >        ✅ Boot test
> > >        ✅ xfstests - ext4
> > >        ✅ xfstests - xfs
> > >        💥 storage: software RAID testing
> > >        ⚡⚡⚡ stress: stress-ng
> > >        🚧 ⚡⚡⚡ xfstests - btrfs
> > >        🚧 ⚡⚡⚡ Storage blktests
> > > 
> > >     Host 4:
> > >        ✅ Boot test
> > >        ✅ Storage SAN device stress - lpfc driver
> > > 
> > >     Host 5:
> > >        ✅ Boot test
> > >        🚧 ✅ kdump - sysrq-c
> > > 
> > >     Host 6:
> > >        ✅ Boot test
> > >        ✅ ACPI table test
> > >        ✅ LTP
> > >        ✅ Loopdev Sanity
> > >        ✅ Memory function: memfd_create
> > >        ✅ AMTU (Abstract Machine Test Utility)
> > >        ✅ Ethernet drivers sanity
> > >        ✅ kernel-rt: rt_migrate_test
> > >        ✅ kernel-rt: rteval
> > >        ✅ kernel-rt: sched_deadline
> > >        ✅ kernel-rt: smidetect
> > >        ✅ storage: SCSI VPD
> > >        🚧 ✅ CIFS Connectathon
> > >        🚧 ✅ POSIX pjd-fstest suites
> > > 
> > >     Host 7:
> > >        ✅ Boot test
> > >        ✅ kdump - sysrq-c - megaraid_sas
> > > 
> > >     Host 8:
> > >        ✅ Boot test
> > >        ✅ Storage SAN device stress - qla2xxx driver
> > > 
> > >     Host 9:
> > >        ⏱  Boot test
> > >        ⏱  kdump - sysrq-c - mpt3sas_gen1
> > > 
> > >   Test sources: https://gitlab.com/cki-project/kernel-tests
> > 
> > Hello,
> > 
> 
> Hi Ming,
> 
> first the good news: Both issues detected by LTP and RAID test are
> officially gone after the revert. There's some x86_64 testing still
> running but the results look good so far!
> 
> > Can you share us the exact commands for setting up xfstests over
> > 'software RAID testing' from the above tree?
> > 
> 
> It's this test (which seeing your @redhat email, you can also trigger
> via internal Brew testing if you use the "stor" test set):
> 
> https://gitlab.com/cki-project/kernel-tests/-/tree/master/storage/swraid/trim
> 
> The important part of the test is:
> 
> https://gitlab.com/cki-project/kernel-tests/-/blob/master/storage/swraid/trim/main.sh#L27
> 
> The test maintainer (Changhui) is cced on this thread in case you need
> any help or have questions about the test.
> 
> 
> 
> I'll just quickly mention, please be careful if you're planning on
> testing LTP/msgstress04 on ppc64le in Beaker, as the conserver overload
> is causing issues to lab owners.
> 
> 
> Let us know if we can help you with something else,

I have verified the revised patches does fix kernel oops in 'software
RAID storage test'. However, I can't reproduce the OOM in LTP/msgstress04.

Could you help to check if LTP/msgstress04 can pass with the following
tree(top three patches) which is against the latest for-5.10/block:

	https://github.com/ming1/linux/commits/v5.9-rc-block-test

Thanks,
Ming


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 💥 PANICKED: Test report for kernel 5.9.0-rc3-020ad03.cki (block)
  2020-09-06  3:19     ` 💥 PANICKED: Test report for?kernel " Ming Lei
@ 2020-09-07 18:49       ` Veronika Kabatova
  0 siblings, 0 replies; 14+ messages in thread
From: Veronika Kabatova @ 2020-09-07 18:49 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, axboe, CKI Project, Changhui Zhong



----- Original Message -----
> From: "Ming Lei" <ming.lei@redhat.com>
> To: "Veronika Kabatova" <vkabatov@redhat.com>
> Cc: linux-block@vger.kernel.org, axboe@kernel.dk, "CKI Project" <cki-project@redhat.com>, "Changhui Zhong"
> <czhong@redhat.com>
> Sent: Sunday, September 6, 2020 5:19:08 AM
> Subject: Re: 💥 PANICKED: Test report	for?kernel 5.9.0-rc3-020ad03.cki (block)
> 
> Hi Veronika,
> 
> On Fri, Sep 04, 2020 at 07:06:25AM -0400, Veronika Kabatova wrote:
> > 
> > 
> > ----- Original Message -----
> > > From: "Ming Lei" <ming.lei@redhat.com>
> > > To: "CKI Project" <cki-project@redhat.com>
> > > Cc: linux-block@vger.kernel.org, axboe@kernel.dk, "Changhui Zhong"
> > > <czhong@redhat.com>
> > > Sent: Friday, September 4, 2020 3:02:33 AM
> > > Subject: Re: 💥 PANICKED: Test report for	kernel 5.9.0-rc3-020ad03.cki
> > > (block)
> > > 
> > > On Thu, Sep 03, 2020 at 05:07:57PM -0000, CKI Project wrote:
> > > > 
> > > > Hello,
> > > > 
> > > > We ran automated tests on a recent commit from this kernel tree:
> > > > 
> > > >        Kernel repo:
> > > >        https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git
> > > >             Commit: 020ad0333b03 - Merge branch 'for-5.10/block' into
> > > >             for-next
> > > > 
> > > > The results of these automated tests are provided below.
> > > > 
> > > >     Overall result: FAILED (see details below)
> > > >              Merge: OK
> > > >            Compile: OK
> > > >              Tests: PANICKED
> > > > 
> > > > All kernel binaries, config files, and logs are available for download
> > > > here:
> > > > 
> > > >   https://cki-artifacts.s3.us-east-2.amazonaws.com/index.html?prefix=datawarehouse/2020/09/02/613166
> > > > 
> > > > One or more kernel tests failed:
> > > > 
> > > >     ppc64le:
> > > >      💥 storage: software RAID testing
> > > > 
> > > >     aarch64:
> > > >      💥 storage: software RAID testing
> > > > 
> > > >     x86_64:
> > > >      💥 storage: software RAID testing
> > > > 
> > > > We hope that these logs can help you find the problem quickly. For the
> > > > full
> > > > detail on our testing procedures, please scroll to the bottom of this
> > > > message.
> > > > 
> > > > Please reply to this email if you have any questions about the tests
> > > > that
> > > > we
> > > > ran or if you have any suggestions on how to make future tests more
> > > > effective.
> > > > 
> > > >         ,-.   ,-.
> > > >        ( C ) ( K )  Continuous
> > > >         `-',-.`-'   Kernel
> > > >           ( I )     Integration
> > > >            `-'
> > > > ______________________________________________________________________________
> > > > 
> > > > Compile testing
> > > > ---------------
> > > > 
> > > > We compiled the kernel for 4 architectures:
> > > > 
> > > >     aarch64:
> > > >       make options: make -j30 INSTALL_MOD_STRIP=1 targz-pkg
> > > > 
> > > >     ppc64le:
> > > >       make options: make -j30 INSTALL_MOD_STRIP=1 targz-pkg
> > > > 
> > > >     s390x:
> > > >       make options: make -j30 INSTALL_MOD_STRIP=1 targz-pkg
> > > > 
> > > >     x86_64:
> > > >       make options: make -j30 INSTALL_MOD_STRIP=1 targz-pkg
> > > > 
> > > > 
> > > > 
> > > > Hardware testing
> > > > ----------------
> > > > We booted each kernel and ran the following tests:
> > > > 
> > > >   aarch64:
> > > >     Host 1:
> > > >        ✅ Boot test
> > > >        ✅ ACPI table test
> > > >        ✅ LTP
> > > >        ✅ Loopdev Sanity
> > > >        ✅ Memory function: memfd_create
> > > >        ✅ AMTU (Abstract Machine Test Utility)
> > > >        ✅ Ethernet drivers sanity
> > > >        ✅ storage: SCSI VPD
> > > >        🚧 ✅ CIFS Connectathon
> > > >        🚧 ✅ POSIX pjd-fstest suites
> > > > 
> > > >     Host 2:
> > > > 
> > > >        ⚡ Internal infrastructure issues prevented one or more tests
> > > >        (marked
> > > >        with ⚡⚡⚡) from running on this architecture.
> > > >        This is not the fault of the kernel that was tested.
> > > > 
> > > >        ⚡⚡⚡ Boot test
> > > >        ⚡⚡⚡ xfstests - ext4
> > > >        ⚡⚡⚡ xfstests - xfs
> > > >        ⚡⚡⚡ storage: software RAID testing
> > > >        ⚡⚡⚡ stress: stress-ng
> > > >        🚧 ⚡⚡⚡ xfstests - btrfs
> > > >        🚧 ⚡⚡⚡ Storage blktests
> > > > 
> > > >     Host 3:
> > > >        ✅ Boot test
> > > >        ✅ xfstests - ext4
> > > >        ✅ xfstests - xfs
> > > >        💥 storage: software RAID testing
> > > >        ⚡⚡⚡ stress: stress-ng
> > > >        🚧 ⚡⚡⚡ xfstests - btrfs
> > > >        🚧 ⚡⚡⚡ Storage blktests
> > > > 
> > > >   ppc64le:
> > > >     Host 1:
> > > >        ✅ Boot test
> > > >        🚧 ✅ kdump - sysrq-c
> > > > 
> > > >     Host 2:
> > > >        ✅ Boot test
> > > >        ✅ xfstests - ext4
> > > >        ✅ xfstests - xfs
> > > >        💥 storage: software RAID testing
> > > >        🚧 ⚡⚡⚡ xfstests - btrfs
> > > >        🚧 ⚡⚡⚡ Storage blktests
> > > > 
> > > >     Host 3:
> > > > 
> > > >        ⚡ Internal infrastructure issues prevented one or more tests
> > > >        (marked
> > > >        with ⚡⚡⚡) from running on this architecture.
> > > >        This is not the fault of the kernel that was tested.
> > > > 
> > > >        ✅ Boot test
> > > >        ⚡⚡⚡ LTP
> > > >        ⚡⚡⚡ Loopdev Sanity
> > > >        ⚡⚡⚡ Memory function: memfd_create
> > > >        ⚡⚡⚡ AMTU (Abstract Machine Test Utility)
> > > >        ⚡⚡⚡ Ethernet drivers sanity
> > > >        🚧 ⚡⚡⚡ CIFS Connectathon
> > > >        🚧 ⚡⚡⚡ POSIX pjd-fstest suites
> > > > 
> > > >   s390x:
> > > >     Host 1:
> > > >        ✅ Boot test
> > > >        ✅ stress: stress-ng
> > > >        🚧 ✅ Storage blktests
> > > > 
> > > >     Host 2:
> > > >        ✅ Boot test
> > > >        ✅ LTP
> > > >        ✅ Loopdev Sanity
> > > >        ✅ Memory function: memfd_create
> > > >        ✅ AMTU (Abstract Machine Test Utility)
> > > >        ✅ Ethernet drivers sanity
> > > >        🚧 ✅ CIFS Connectathon
> > > >        🚧 ✅ POSIX pjd-fstest suites
> > > > 
> > > >   x86_64:
> > > >     Host 1:
> > > >        ✅ Boot test
> > > >        ✅ Storage SAN device stress - qedf driver
> > > > 
> > > >     Host 2:
> > > >        ⏱  Boot test
> > > >        ⏱  Storage SAN device stress - mpt3sas_gen1
> > > > 
> > > >     Host 3:
> > > >        ✅ Boot test
> > > >        ✅ xfstests - ext4
> > > >        ✅ xfstests - xfs
> > > >        💥 storage: software RAID testing
> > > >        ⚡⚡⚡ stress: stress-ng
> > > >        🚧 ⚡⚡⚡ xfstests - btrfs
> > > >        🚧 ⚡⚡⚡ Storage blktests
> > > > 
> > > >     Host 4:
> > > >        ✅ Boot test
> > > >        ✅ Storage SAN device stress - lpfc driver
> > > > 
> > > >     Host 5:
> > > >        ✅ Boot test
> > > >        🚧 ✅ kdump - sysrq-c
> > > > 
> > > >     Host 6:
> > > >        ✅ Boot test
> > > >        ✅ ACPI table test
> > > >        ✅ LTP
> > > >        ✅ Loopdev Sanity
> > > >        ✅ Memory function: memfd_create
> > > >        ✅ AMTU (Abstract Machine Test Utility)
> > > >        ✅ Ethernet drivers sanity
> > > >        ✅ kernel-rt: rt_migrate_test
> > > >        ✅ kernel-rt: rteval
> > > >        ✅ kernel-rt: sched_deadline
> > > >        ✅ kernel-rt: smidetect
> > > >        ✅ storage: SCSI VPD
> > > >        🚧 ✅ CIFS Connectathon
> > > >        🚧 ✅ POSIX pjd-fstest suites
> > > > 
> > > >     Host 7:
> > > >        ✅ Boot test
> > > >        ✅ kdump - sysrq-c - megaraid_sas
> > > > 
> > > >     Host 8:
> > > >        ✅ Boot test
> > > >        ✅ Storage SAN device stress - qla2xxx driver
> > > > 
> > > >     Host 9:
> > > >        ⏱  Boot test
> > > >        ⏱  kdump - sysrq-c - mpt3sas_gen1
> > > > 
> > > >   Test sources: https://gitlab.com/cki-project/kernel-tests
> > > 
> > > Hello,
> > > 
> > 
> > Hi Ming,
> > 
> > first the good news: Both issues detected by LTP and RAID test are
> > officially gone after the revert. There's some x86_64 testing still
> > running but the results look good so far!
> > 
> > > Can you share us the exact commands for setting up xfstests over
> > > 'software RAID testing' from the above tree?
> > > 
> > 
> > It's this test (which seeing your @redhat email, you can also trigger
> > via internal Brew testing if you use the "stor" test set):
> > 
> > https://gitlab.com/cki-project/kernel-tests/-/tree/master/storage/swraid/trim
> > 
> > The important part of the test is:
> > 
> > https://gitlab.com/cki-project/kernel-tests/-/blob/master/storage/swraid/trim/main.sh#L27
> > 
> > The test maintainer (Changhui) is cced on this thread in case you need
> > any help or have questions about the test.
> > 
> > 
> > 
> > I'll just quickly mention, please be careful if you're planning on
> > testing LTP/msgstress04 on ppc64le in Beaker, as the conserver overload
> > is causing issues to lab owners.
> > 
> > 
> > Let us know if we can help you with something else,
> 
> I have verified the revised patches does fix kernel oops in 'software
> RAID storage test'. However, I can't reproduce the OOM in LTP/msgstress04.
> 
> Could you help to check if LTP/msgstress04 can pass with the following
> tree(top three patches) which is against the latest for-5.10/block:
> 
> 	https://github.com/ming1/linux/commits/v5.9-rc-block-test
> 

Hi,

I ran the affected ppc64le testing with your new patches and it gives the
expected results.


We also got in touch with the LTP test maintainers. It looks like there are
some issues with the msgstress tests as well. These got amplified by the
patch and the combination caused the conserver overload. The tests
themselves need to be fixed too.

Veronika

> Thanks,
> Ming
> 
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2020-09-07 18:50 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-03 17:07 💥 PANICKED: Test report for kernel 5.9.0-rc3-020ad03.cki (block) CKI Project
2020-09-03 17:10 ` Rachel Sibley
2020-09-03 17:46   ` Jens Axboe
2020-09-03 18:59     ` Rachel Sibley
2020-09-03 19:58       ` Veronika Kabatova
2020-09-03 20:53         ` Jens Axboe
2020-09-04  3:22           ` Ming Lei
2020-09-04  3:37             ` Jens Axboe
2020-09-04  4:24               ` Ming Lei
2020-09-04 15:06                 ` Jens Axboe
2020-09-04  1:02 ` Ming Lei
2020-09-04 11:06   ` Veronika Kabatova
2020-09-06  3:19     ` 💥 PANICKED: Test report for?kernel " Ming Lei
2020-09-07 18:49       ` 💥 PANICKED: Test report for kernel " Veronika Kabatova

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).