All of lore.kernel.org
 help / color / mirror / Atom feed
* blktests failures with v6.3
@ 2023-05-01  8:40 Shinichiro Kawasaki
  0 siblings, 0 replies; only message in thread
From: Shinichiro Kawasaki @ 2023-05-01  8:40 UTC (permalink / raw)
  To: linux-block, linux-nvme, linux-scsi

Hi all,

I ran the latest blktests (git hash: c1da4a225273) with v6.3 kernel and observed
test case failures. I call for support to fix them. (I myself is now working on
some of the block/011 failure symptoms.)

List of failures
================
#1: block/011
#2: block/024
#3: nvme/003 (fabrics transport)
#4: nvme/030 or nvme/031 (rdma transport with siw)
#5: nvme/* (fc transport)

This list excludes failures already resovled: scsi/007 and nvme/{044,045}.

Failure description
===================

#1: block/011

   This test case shows three failure symptoms.

   Symptom A:

     The test case fails with NVME devices due to lockdep WARNING "possible
     circular locking dependency detected". Reported in Sep/2022 [1] and
     solution was discussed. Waiting a fix.

     [1] https://lore.kernel.org/linux-block/20220930001943.zdbvolc3gkekfmcv@shindev/

   Symptom B:

     When this test case passes for a NVME device, the device is left with
     zero capacity occasionally. It causes following test cases fail. The device
     capacity change happens due to ENODEV during the test. I'm preparing a fix
     patch for blktests [2].

     [2] https://github.com/kawasaki/blktests/commit/f78ceec0cb95b9b2a9dcbf44999e350b740722e8

   Symptom C:

     When system disk and the test device belong to same PCI device, the system
     disk disappears during the test case run (e.g. a system with two SSDs on
     same AHCI controller or HBA). This results in test system hang. I'm
     preparing patches for blktests to avoid this failure [3].

     [3] https://github.com/kawasaki/blktests/commit/662c33048ba1e949348dc2fca8b4778f940b7cc5

#2: block/024

   Fails on slow machines. Reported in Dec/2022. Test case side issue is
   suspected. Still need further investigation.

   block/024 (do I/O faster than a jiffy and check iostats times) [failed]
    runtime    ...  4.347s
    --- tests/block/024.out     2022-12-06 20:51:41.525066605 +0900
    +++ /home/shin/kts/kernel-test-suite/sets/blktests/log/runlog/nodev/block/024.out.bad       2022-12-07 12:51:03.610924521 +0900
    @@ -6,5 +6,5 @@
     read 1 s
     write 1 s
     read 2 s
    -write 3 s
    +write 4 s
     Test complete

#3: nvme/003 (fabrics transport)

   When nvme test group is run with trtype=rdma, tcp or fc, the test case fails
   due to lockdep WARNING "possible circular locking dependency detected".
   Detailed kernel message is attached [4].

#4: nvme/030 or nvme/031 (rdma transport with siw)

   When nvme test group is run with trtype=rdma and use_siw=1 configurations,
   a failure is observed at nvme/030 or nvme/031 occasionally. The failure
   cause is "BUG: KASAN: slab-use-after-free in __mutex_lock". Detailed
   kernel message is attached [5]

#5: nvme/* (fc transport)

   With trtype=fc configuration, many of test cases in nvme test group fail or
   hang. Daniel Wagner is driving fix work [6].

   [6] https://lore.kernel.org/linux-nvme/20230418130159.11075-1-dwagner@suse.de/


[4]

[  134.432456] ======================================================
[  134.433289] WARNING: possible circular locking dependency detected
[  134.433806] 6.3.0 #62 Not tainted
[  134.434149] ------------------------------------------------------
[  134.434709] kworker/1:3/455 is trying to acquire lock:
[  134.435141] ffff88813bfd9420 (&id_priv->handler_mutex){+.+.}-{3:3}, at: rdma_destroy_id+0x17/0x20 [rdma_cm]
[  134.435986] 
               but task is already holding lock:
[  134.436554] ffff888131327db0 ((work_completion)(&queue->release_work)){+.+.}-{0:0}, at: process_one_work+0x793/0x1350
[  134.437459] 
               which lock already depends on the new lock.

[  134.438235] 
               the existing dependency chain (in reverse order) is:
[  134.438914] 
               -> #3 ((work_completion)(&queue->release_work)){+.+.}-{0:0}:
[  134.439688]        process_one_work+0x7dc/0x1350
[  134.440084]        worker_thread+0xfc/0x1260
[  134.440489]        kthread+0x29e/0x340
[  134.440857]        ret_from_fork+0x2c/0x50
[  134.441251] 
               -> #2 ((wq_completion)nvmet-wq){+.+.}-{0:0}:
[  134.441885]        __flush_workqueue+0x130/0x12f0
[  134.442328]        nvmet_rdma_cm_handler+0x961/0x39c0 [nvmet_rdma]
[  134.442900]        cma_cm_event_handler+0xb2/0x2f0 [rdma_cm]
[  134.443412]        cma_ib_req_handler+0x1096/0x4a50 [rdma_cm]
[  134.443925]        cm_process_work+0x46/0x3b0 [ib_cm]
[  134.444399]        cm_req_handler+0x20e2/0x61d0 [ib_cm]
[  134.444880]        cm_work_handler+0xc15/0x6ce0 [ib_cm]
[  134.445365]        process_one_work+0x843/0x1350
[  134.445794]        worker_thread+0xfc/0x1260
[  134.446165]        kthread+0x29e/0x340
[  134.447271]        ret_from_fork+0x2c/0x50
[  134.448373] 
               -> #1 (&id_priv->handler_mutex/1){+.+.}-{3:3}:
[  134.450542]        __mutex_lock+0x186/0x18f0
[  134.451682]        cma_ib_req_handler+0xc3c/0x4a50 [rdma_cm]
[  134.452921]        cm_process_work+0x46/0x3b0 [ib_cm]
[  134.454150]        cm_req_handler+0x20e2/0x61d0 [ib_cm]
[  134.455423]        cm_work_handler+0xc15/0x6ce0 [ib_cm]
[  134.456680]        process_one_work+0x843/0x1350
[  134.457865]        worker_thread+0xfc/0x1260
[  134.458985]        kthread+0x29e/0x340
[  134.460097]        ret_from_fork+0x2c/0x50
[  134.461294] 
               -> #0 (&id_priv->handler_mutex){+.+.}-{3:3}:
[  134.463349]        __lock_acquire+0x2fc0/0x60f0
[  134.464430]        lock_acquire+0x1a7/0x4e0
[  134.465448]        __mutex_lock+0x186/0x18f0
[  134.466461]        rdma_destroy_id+0x17/0x20 [rdma_cm]
[  134.467569]        nvmet_rdma_free_queue+0x7a/0x380 [nvmet_rdma]
[  134.468735]        nvmet_rdma_release_queue_work+0x3e/0x90 [nvmet_rdma]
[  134.469904]        process_one_work+0x843/0x1350
[  134.470923]        worker_thread+0xfc/0x1260
[  134.471918]        kthread+0x29e/0x340
[  134.472902]        ret_from_fork+0x2c/0x50
[  134.473884] 
               other info that might help us debug this:

[  134.476264] Chain exists of:
                 &id_priv->handler_mutex --> (wq_completion)nvmet-wq --> (work_completion)(&queue->release_work)

[  134.479201]  Possible unsafe locking scenario:

[  134.480949]        CPU0                    CPU1
[  134.481941]        ----                    ----
[  134.482871]   lock((work_completion)(&queue->release_work));
[  134.483938]                                lock((wq_completion)nvmet-wq);
[  134.485094]                                lock((work_completion)(&queue->release_work));
[  134.486391]   lock(&id_priv->handler_mutex);
[  134.487384] 
                *** DEADLOCK ***

[  134.489712] 2 locks held by kworker/1:3/455:
[  134.490728]  #0: ffff888127b28538 ((wq_completion)nvmet-wq){+.+.}-{0:0}, at: process_one_work+0x766/0x1350
[  134.492084]  #1: ffff888131327db0 ((work_completion)(&queue->release_work)){+.+.}-{0:0}, at: process_one_work+0x793/0x1350
[  134.493538] 
               stack backtrace:
[  134.495147] CPU: 1 PID: 455 Comm: kworker/1:3 Not tainted 6.3.0 #62
[  134.496262] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014
[  134.497573] Workqueue: nvmet-wq nvmet_rdma_release_queue_work [nvmet_rdma]
[  134.498769] Call Trace:
[  134.499649]  <TASK>
[  134.500494]  dump_stack_lvl+0x57/0x90
[  134.501455]  check_noncircular+0x27b/0x310
[  134.502444]  ? __pfx_mark_lock+0x10/0x10
[  134.503413]  ? __pfx_check_noncircular+0x10/0x10
[  134.504431]  ? find_held_lock+0x2d/0x110
[  134.505399]  ? lockdep_lock+0xca/0x1c0
[  134.506351]  ? __pfx_lockdep_lock+0x10/0x10
[  134.507333]  ? mark_held_locks+0x9e/0xe0
[  134.508299]  __lock_acquire+0x2fc0/0x60f0
[  134.509272]  ? __pfx___wait_for_common+0x10/0x10
[  134.510292]  ? __pfx___lock_acquire+0x10/0x10
[  134.511296]  ? rxe_sched_task+0x50/0x70 [rdma_rxe]
[  134.512328]  ? rxe_post_recv+0x1d9/0x250 [rdma_rxe]
[  134.513366]  lock_acquire+0x1a7/0x4e0
[  134.514314]  ? rdma_destroy_id+0x17/0x20 [rdma_cm]
[  134.515353]  ? __pfx_lock_acquire+0x10/0x10
[  134.516345]  ? __pfx_ib_drain_qp_done+0x10/0x10 [ib_core]
[  134.517447]  ? __pfx___might_resched+0x10/0x10
[  134.518455]  __mutex_lock+0x186/0x18f0
[  134.519405]  ? rdma_destroy_id+0x17/0x20 [rdma_cm]
[  134.520436]  ? rdma_destroy_id+0x17/0x20 [rdma_cm]
[  134.521463]  ? __pfx___mutex_lock+0x10/0x10
[  134.522447]  ? rcu_is_watching+0x11/0xb0
[  134.523408]  ? rdma_destroy_id+0x17/0x20 [rdma_cm]
[  134.524434]  rdma_destroy_id+0x17/0x20 [rdma_cm]
[  134.525451]  nvmet_rdma_free_queue+0x7a/0x380 [nvmet_rdma]
[  134.526528]  nvmet_rdma_release_queue_work+0x3e/0x90 [nvmet_rdma]
[  134.527643]  process_one_work+0x843/0x1350
[  134.528600]  ? __pfx_lock_acquire+0x10/0x10
[  134.529562]  ? __pfx_process_one_work+0x10/0x10
[  134.530551]  ? __pfx_do_raw_spin_lock+0x10/0x10
[  134.531537]  worker_thread+0xfc/0x1260
[  134.532471]  ? __kthread_parkme+0xc1/0x1f0
[  134.533428]  ? __pfx_worker_thread+0x10/0x10
[  134.534385]  kthread+0x29e/0x340
[  134.535264]  ? __pfx_kthread+0x10/0x10
[  134.536164]  ret_from_fork+0x2c/0x50
[  134.537058]  </TASK>


[5]

May 01 11:19:44 testnode1 kernel: ==================================================================
May 01 11:19:44 testnode1 kernel: BUG: KASAN: slab-use-after-free in __mutex_lock+0x1324/0x18f0
May 01 11:19:44 testnode1 kernel: Read of size 8 at addr ffff888197b37418 by task kworker/u8:0/9
May 01 11:19:44 testnode1 kernel: 
May 01 11:19:44 testnode1 kernel: CPU: 0 PID: 9 Comm: kworker/u8:0 Not tainted 6.3.0 #62
May 01 11:19:44 testnode1 kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014
May 01 11:19:44 testnode1 kernel: Workqueue: iw_cm_wq cm_work_handler [iw_cm]
May 01 11:19:44 testnode1 kernel: Call Trace:
May 01 11:19:44 testnode1 kernel:  <TASK>
May 01 11:19:44 testnode1 kernel:  dump_stack_lvl+0x57/0x90
May 01 11:19:44 testnode1 kernel:  print_report+0xcf/0x660
May 01 11:19:44 testnode1 kernel:  ? __mutex_lock+0x1324/0x18f0
May 01 11:19:44 testnode1 kernel:  kasan_report+0xa4/0xe0
May 01 11:19:44 testnode1 kernel:  ? __mutex_lock+0x1324/0x18f0
May 01 11:19:44 testnode1 kernel:  __mutex_lock+0x1324/0x18f0
May 01 11:19:44 testnode1 kernel:  ? cma_iw_handler+0xac/0x4f0 [rdma_cm]
May 01 11:19:44 testnode1 kernel:  ? _raw_spin_unlock_irqrestore+0x30/0x60
May 01 11:19:44 testnode1 kernel:  ? rcu_is_watching+0x11/0xb0
May 01 11:19:44 testnode1 kernel:  ? _raw_spin_unlock_irqrestore+0x30/0x60
May 01 11:19:44 testnode1 kernel:  ? trace_hardirqs_on+0x12/0x100
May 01 11:19:44 testnode1 kernel:  ? __pfx___mutex_lock+0x10/0x10
May 01 11:19:44 testnode1 kernel:  ? __percpu_counter_sum+0x147/0x1e0
May 01 11:19:44 testnode1 kernel:  ? domain_dirty_limits+0x246/0x390
May 01 11:19:44 testnode1 kernel:  ? wb_over_bg_thresh+0x4d5/0x610
May 01 11:19:44 testnode1 kernel:  ? rcu_is_watching+0x11/0xb0
May 01 11:19:44 testnode1 kernel:  ? cma_iw_handler+0xac/0x4f0 [rdma_cm]
May 01 11:19:44 testnode1 kernel:  cma_iw_handler+0xac/0x4f0 [rdma_cm]
May 01 11:19:44 testnode1 kernel:  ? rcu_is_watching+0x11/0xb0
May 01 11:19:44 testnode1 kernel:  ? __pfx_cma_iw_handler+0x10/0x10 [rdma_cm]
May 01 11:19:44 testnode1 kernel:  ? attach_entity_load_avg+0x4e2/0x920
May 01 11:19:44 testnode1 kernel:  ? _raw_spin_unlock_irqrestore+0x30/0x60
May 01 11:19:44 testnode1 kernel:  ? rcu_is_watching+0x11/0xb0
May 01 11:19:44 testnode1 kernel:  cm_work_handler+0x139e/0x1c50 [iw_cm]
May 01 11:19:44 testnode1 kernel:  ? __pfx_cm_work_handler+0x10/0x10 [iw_cm]
May 01 11:19:44 testnode1 kernel:  ? rcu_is_watching+0x11/0xb0
May 01 11:19:44 testnode1 kernel:  ? __pfx_try_to_wake_up+0x10/0x10
May 01 11:19:44 testnode1 kernel:  ? __pfx_do_raw_spin_lock+0x10/0x10
May 01 11:19:44 testnode1 kernel:  ? __pfx___might_resched+0x10/0x10
May 01 11:19:44 testnode1 kernel:  ? _raw_spin_unlock_irq+0x24/0x50
May 01 11:19:44 testnode1 kernel:  process_one_work+0x843/0x1350
May 01 11:19:44 testnode1 kernel:  ? __pfx_lock_acquire+0x10/0x10
May 01 11:19:44 testnode1 kernel:  ? __pfx_process_one_work+0x10/0x10
May 01 11:19:44 testnode1 kernel:  ? __pfx_do_raw_spin_lock+0x10/0x10
May 01 11:19:44 testnode1 kernel:  worker_thread+0xfc/0x1260
May 01 11:19:44 testnode1 kernel:  ? __pfx_worker_thread+0x10/0x10
May 01 11:19:44 testnode1 kernel:  kthread+0x29e/0x340
May 01 11:19:44 testnode1 kernel:  ? __pfx_kthread+0x10/0x10
May 01 11:19:44 testnode1 kernel:  ret_from_fork+0x2c/0x50
May 01 11:19:44 testnode1 kernel:  </TASK>
May 01 11:19:44 testnode1 kernel: 
May 01 11:19:44 testnode1 kernel: Allocated by task 4225:
May 01 11:19:44 testnode1 kernel:  kasan_save_stack+0x2f/0x50
May 01 11:19:44 testnode1 kernel:  kasan_set_track+0x21/0x30
May 01 11:19:44 testnode1 kernel:  __kasan_kmalloc+0xa6/0xb0
May 01 11:19:44 testnode1 kernel:  __rdma_create_id+0x5b/0x5d0 [rdma_cm]
May 01 11:19:44 testnode1 kernel:  __rdma_create_kernel_id+0x12/0x40 [rdma_cm]
May 01 11:19:44 testnode1 kernel:  nvme_rdma_alloc_queue+0x26a/0x5f0 [nvme_rdma]
May 01 11:19:44 testnode1 kernel:  nvme_rdma_setup_ctrl+0xb84/0x1d90 [nvme_rdma]
May 01 11:19:44 testnode1 kernel:  nvme_rdma_create_ctrl+0x7b5/0xd20 [nvme_rdma]
May 01 11:19:44 testnode1 kernel:  nvmf_dev_write+0xddd/0x22b0 [nvme_fabrics]
May 01 11:19:44 testnode1 kernel:  vfs_write+0x211/0xd50
May 01 11:19:44 testnode1 kernel:  ksys_write+0x100/0x1e0
May 01 11:19:44 testnode1 kernel:  do_syscall_64+0x5b/0x80
May 01 11:19:44 testnode1 kernel:  entry_SYSCALL_64_after_hwframe+0x72/0xdc
May 01 11:19:44 testnode1 kernel: 
May 01 11:19:44 testnode1 kernel: Freed by task 4227:
May 01 11:19:44 testnode1 kernel:  kasan_save_stack+0x2f/0x50
May 01 11:19:44 testnode1 kernel:  kasan_set_track+0x21/0x30
May 01 11:19:44 testnode1 kernel:  kasan_save_free_info+0x2a/0x50
May 01 11:19:44 testnode1 kernel:  ____kasan_slab_free+0x169/0x1c0
May 01 11:19:44 testnode1 kernel:  slab_free_freelist_hook+0xdb/0x1b0
May 01 11:19:44 testnode1 kernel:  __kmem_cache_free+0xb8/0x2e0
May 01 11:19:44 testnode1 kernel:  nvme_rdma_free_queue+0x4a/0x70 [nvme_rdma]
May 01 11:19:44 testnode1 kernel:  nvme_rdma_teardown_io_queues.part.0+0x14a/0x1e0 [nvme_rdma]
May 01 11:19:44 testnode1 kernel:  nvme_rdma_delete_ctrl+0x4f/0x100 [nvme_rdma]
May 01 11:19:44 testnode1 kernel:  nvme_do_delete_ctrl+0x14e/0x240 [nvme_core]
May 01 11:19:44 testnode1 kernel:  nvme_sysfs_delete+0xcb/0x100 [nvme_core]
May 01 11:19:44 testnode1 kernel:  kernfs_fop_write_iter+0x359/0x530
May 01 11:19:44 testnode1 kernel:  vfs_write+0x58f/0xd50
May 01 11:19:44 testnode1 kernel:  ksys_write+0x100/0x1e0
May 01 11:19:44 testnode1 kernel:  do_syscall_64+0x5b/0x80
May 01 11:19:44 testnode1 kernel:  entry_SYSCALL_64_after_hwframe+0x72/0xdc
May 01 11:19:44 testnode1 kernel: 
May 01 11:19:44 testnode1 kernel: The buggy address belongs to the object at ffff888197b37000
                                   which belongs to the cache kmalloc-2k of size 2048
May 01 11:19:44 testnode1 kernel: The buggy address is located 1048 bytes inside of
                                   freed 2048-byte region [ffff888197b37000, ffff888197b37800)
May 01 11:19:44 testnode1 kernel: 
May 01 11:19:44 testnode1 kernel: The buggy address belongs to the physical page:
May 01 11:19:44 testnode1 kernel: page:00000000fbe33a6e refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x197b30
May 01 11:19:44 testnode1 kernel: head:00000000fbe33a6e order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0
May 01 11:19:44 testnode1 kernel: anon flags: 0x17ffffc0010200(slab|head|node=0|zone=2|lastcpupid=0x1fffff)
May 01 11:19:44 testnode1 kernel: raw: 0017ffffc0010200 ffff888100042f00 0000000000000000 dead000000000001
May 01 11:19:44 testnode1 kernel: raw: 0000000000000000 0000000000080008 00000001ffffffff 0000000000000000
May 01 11:19:44 testnode1 kernel: page dumped because: kasan: bad access detected
May 01 11:19:44 testnode1 kernel: 
May 01 11:19:44 testnode1 kernel: Memory state around the buggy address:
May 01 11:19:44 testnode1 kernel:  ffff888197b37300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
May 01 11:19:44 testnode1 kernel:  ffff888197b37380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
May 01 11:19:44 testnode1 kernel: >ffff888197b37400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
May 01 11:19:44 testnode1 kernel:                             ^
May 01 11:19:44 testnode1 kernel:  ffff888197b37480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
May 01 11:19:44 testnode1 kernel:  ffff888197b37500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
May 01 11:19:44 testnode1 kernel: ==================================================================

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2023-05-01  8:40 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-01  8:40 blktests failures with v6.3 Shinichiro Kawasaki

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.