All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tony Lu <tonylu@linux.alibaba.com>
To: Jan Karcher <jaka@linux.ibm.com>
Cc: "D. Wythe" <alibuda@linux.alibaba.com>,
	kuba@kernel.org, davem@davemloft.net, netdev@vger.kernel.org,
	linux-s390@vger.kernel.org, linux-rdma@vger.kernel.org
Subject: Re: [PATCH net-next v3 00/10] optimize the parallelism of SMC-R connections
Date: Tue, 25 Oct 2022 14:13:32 +0800	[thread overview]
Message-ID: <Y1d+jDQiyn4LSKlu@TonyMac-Alibaba> (raw)
In-Reply-To: <f8ea7943-4267-8b8d-f8b4-831fea7f3963@linux.ibm.com>

On Mon, Oct 24, 2022 at 03:10:54PM +0200, Jan Karcher wrote:
> Hi D. Wythe,
> 
> I reply with the feedback on your fix to your v4 fix.
> 
> Regarding your questions:
> We are aware of this situation and we are currently evaluating how we want
> to deal with SMC-D in the future because as of right now i can understand
> your frustration regarding the SMC-D testing.
> Please give me some time to hit up the right people and collect some
> information to answer your question. I'll let you know as soon as i have an
> answer.

Hi Jan,

We sent a RFC [1] to mock SMC-D device for inter-VM communication. The
original purpose is not to test, but for now it could be useful for the
people who are going to test without physical devices in the community.

This driver basically works but I would improve it for testing. Before
that, what do you think about it?

And where to put this driver? In kernel with SMC code or merge into 
separate SMC test cases. I haven't made up my mind yet.

[1] https://lore.kernel.org/netdev/20220720170048.20806-1-tonylu@linux.alibaba.com/

Cheers,
Tony Lu
> 
> Thanks
> - Jan
> 
> On 21/10/2022 17:57, D. Wythe wrote:
> > Hi Jan,
> > 
> > Sorry for this bug. It's my bad to do not enough code checking, here is
> > the problems:
> > 
> > int __init smc_core_init(void)
> > {
> >          int i;
> > 
> >          /* init smc lgr decision maker builder */
> >          for (i = 0; i < SMC_TYPE_D; i++)
> > 
> > 
> > i < SMC_TYPE_D should change to i <= SMC_TYPE_D, otherwise the SMC-D
> > related
> > map has not init yet. i thinks the two bugs was all caused by it.
> > 
> > 
> > I has reproduced the first problem and verified that it can be fixed.
> > Please help me to see if the SMC-D problem can be fixed too after this
> > change, thx.
> > 
> > By the way, Is there any way to simulate SMC-D dev for testing? All of
> > our problems are caused by poor consideration on SMC-D.
> > In fact, we have some SMC-D related work plans in the future. It seems
> > not a perfect way to bother you every time.
> > 
> > 
> > Best Wishes.
> > D. Wythe
> > 
> > On 10/21/22 7:57 PM, Jan Karcher wrote:
> > > 
> > > 
> > > On 20/10/2022 09:00, D. Wythe wrote:
> > > > 
> > > > Hi Jan,
> > > > 
> > > > Sorry for the long delay, The main purpose of v3 is to put
> > > > optimizes also works on SMC-D, dues to the environment,
> > > > I can only tests it in SMC-R, so please help us to verify the
> > > > stability and functional in SMC-D,
> > > > Thanks a lot.
> > > > 
> > > > If you have any problems, please let us know.
> > > > 
> > > > Besides, PATCH bug fixes need to be reordered. After the code
> > > > review passes and the SMC-D test goes stable, I will adjust it
> > > > in next serial.
> > > > 
> > > > 
> > > 
> > > Hi D. Wythe,
> > > 
> > > thank you again for your submission. I ran the first tests and here
> > > are my findings:
> > > 
> > > For SMC-R we are facing problems during unloading of the smc module:
> > > 
> > > vvvvvvvvvv
> > > 
> > > [root@testsys10 ~]# dmesg -C
> > > [root@testsys10 ~]# dmesg
> > > [root@testsys10 ~]# rmmod ism
> > > [root@testsys10 ~]# rmmod smc_diag
> > > [root@testsys10 ~]# dmesg
> > > [   51.671365] smc: removing smcd device 1522:00:00.0
> > > [root@testsys10 ~]# rmmod smc
> > > [root@testsys10 ~]# dmesg
> > > [   51.671365] smc: removing smcd device 1522:00:00.0
> > > [   65.378445] NET: Unregistered PF_SMC protocol family
> > > [   65.378463] ------------[ cut here ]------------
> > > [   65.378465] WARNING: CPU: 0 PID: 1155 at kernel/workqueue.c:3066
> > > __flush_work.isra.0+0x28a/0x298
> > > [   65.378476] Modules linked in: nft_fib_inet nft_fib_ipv4
> > > nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6
> > > nft_reject nft_ct nft_chain_nat nf_nat mlx5_ib nf_conntrack
> > > ib_uverbs nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink
> > > mlx5_core smc(-) ib_core vfio_ccw s390_trng mdev vfio_iommu_type1
> > > vfio sch_fq_codel configfs ghash_s390 prng chacha_s390 libchacha
> > > aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390
> > > sha256_s390 sha1_s390 sha_common pkey zcrypt rng_core autofs4 [last
> > > unloaded: smc_diag]
> > > [   65.378509] CPU: 0 PID: 1155 Comm: rmmod Not tainted
> > > 6.1.0-rc1-00035-g9980a965416f #4
> > > [   65.378514] Hardware name: IBM 8561 T01 701 (z/VM 7.2.0)
> > > [   65.378517] Krnl PSW : 0704c00180000000 00000000f9d5f17e
> > > (__flush_work.isra.0+0x28e/0x298)
> > > [   65.378523]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3
> > > CC:0 PM:0 RI:0 EA:3
> > > [   65.380675] Krnl GPRS: 8000000000000001 0000000000000000
> > > 000003ff7fd40270 0000000000000000
> > > [   65.380683]            0000038000c73d70 000e002100000000
> > > 0000000000000000 0000000000000001
> > > [   65.380686]            0000038000c73d70 0000000000000000
> > > 000003ff7fd40270 000003ff7fd40270
> > > [   65.380688]            000000009b8d2100 000003ffe38f98f8
> > > 0000038000c73cd0 0000038000c73c30
> > > [   65.380697] Krnl Code: 00000000f9d5f172: a7780000            lhi %r7,0
> > >                            00000000f9d5f176: a7f4ff7b            brc
> > > 15,00000000f9d5f06c
> > >                           #00000000f9d5f17a: af000000
> > > mc      0,0
> > >                           >00000000f9d5f17e: a7780000            lhi
> > > %r7,0
> > >                            00000000f9d5f182: a7f4ff75            brc
> > > 15,00000000f9d5f06c
> > >                            00000000f9d5f186: 0707                bcr
> > > 0,%r7
> > >                            00000000f9d5f188: c004005daa34       
> > > brcl 0,00000000fa9145f0
> > >                            00000000f9d5f18e: ebaff0680024       
> > > stmg %r10,%r15,104(%r15)
> > > [   65.380773] Call Trace:
> > > [   65.380774]  [<00000000f9d5f17e>] __flush_work.isra.0+0x28e/0x298
> > > [   65.380779]  [<00000000f9d61228>] __cancel_work_timer+0x130/0x1c0
> > > [   65.380782]  [<00000000fa46b1b4>]
> > > rhashtable_free_and_destroy+0x2c/0x170
> > > [   65.380787]  [<000003ff7fd3a08e>] smc_exit+0x3e/0x1b8 [smc]
> > > [   65.380804]  [<00000000f9de946a>] __do_sys_delete_module+0x1a2/0x298
> > > [   65.380809]  [<00000000fa8f85ac>] __do_syscall+0x1d4/0x200
> > > [   65.380814]  [<00000000fa907722>] system_call+0x82/0xb0
> > > [   65.380817] Last Breaking-Event-Address:
> > > [   65.380818]  [<00000000f9d5ef24>] __flush_work.isra.0+0x34/0x298
> > > [   65.380820] ---[ end trace 0000000000000000 ]---
> > > [   65.380828] smc: removing ib device mlx5_0
> > > [   65.380833] smc: removing ib device mlx5_1
> > > 
> > > ^^^^^^^^^^
> > > 
> > > For SMC-D it seems like your decisionmaker is causing some troubles
> > > (crash). I did not have the time yet to look into it, i still dump
> > > you the console log - maybe you're seeing the problem faster then
> > > me:
> > > 
> > > 
> > > vvvvvvvvvv
> > > 
> > > [  135.528259] smc-tests: test_cs_security started
> > > [  136.397056] illegal operation: 0001 ilc:1 [#1] SMP
> > > [  136.397064] Modules linked in: tcp_diag inet_diag ism mlx5_ib
> > > ib_uverbs mlx5_core smc_diag smc ib_core vmur nft_fib_inet nft_fib_
> > > ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4
> > > nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack
> > > nf_defra
> > > g_ipv6 nf_defrag_ipv4 ip_set nf_tab
> > > [  136.397093] CPU: 0 PID: 9 Comm: kworker/0:1 Not tainted
> > > 6.1.0-rc1-00035-g1c11cab281ca #4
> > > [  136.397098] Hardware name: IBM 8561 T01 701 (z/VM 7.2.0)
> > > [  136.397100] Workqueue: smc_hs_wq smc_listen_work [smc]
> > > [  136.397123] Krnl PSW : 0704e00180000000 0000000000000002 (0x2)
> > > [  136.397128]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3
> > > CC:2 PM:0 RI:0 EA:3
> > > [  136.397133] Krnl GPRS: 0000000000000001 0000000000000000
> > > 00000000a5670600 0000000000000000
> > > 
> > > [  136.398410]            0000000000000000 000003ff7feee620
> > > 00000000000000c8 0000000000000000
> > > [  136.398417]            000003ff7feed2b8 00000000a5670600
> > > 000003ff7feed168 000003ff7fed1628
> > > [  136.398420]            0000000080334200 0000000000000001
> > > 000003ff7fed3ab0 0000037fffb5fa30
> > > [  136.398425] Krnl Code:#0000000000000000: 0000                illegal
> > > [  136.398425]           >0000000000000002: 0000                illegal
> > > [  136.398425]            0000000000000004: 0000                illegal
> > > [  136.398425]            0000000000000006: 0000                illegal
> > > [  136.398425]            0000000000000008: 0000                illegal
> > > [  136.398425]            000000000000000a: 0000                illegal
> > > [  136.398425]            000000000000000c: 0000                illegal
> > > [  136.398425]            000000000000000e: 0000                illegal
> > > [  136.398465] Call Trace:
> > > [  136.398469]  [<0000000000000002>] 0x2
> > > [  136.398472] ([<00000001790fdbde>] release_sock+0x6e/0xd8)
> > > [  136.398482]  [<000003ff7fed746a>] smc_conn_create+0xc2/0x9d8 [smc]
> > > 01: HCPGSP2629I The virtual machine is placed in CP mode due to a
> > > SIGP stop from CPU 01.
> > > 01: HCPGSP2629I The virtual machine is placed in CP mode due to a
> > > SIGP stop from CPU 00.
> > > 
> > > [  136.408436]  [<000003ff7fec8206>]
> > > smc_find_ism_v2_device_serv+0x186/0x288 [smc]
> > > [  136.408444]  [<000003ff7fec8336>]
> > > smc_listen_find_device+0x2e/0x370 [smc]
> > > [  136.408452]  [<000003ff7fecaa8a>] smc_listen_work+0x2ca/0x580 [smc]
> > > [  136.408459]  [<00000001788481e8>] process_one_work+0x200/0x458
> > > [  136.408466]  [<000000017884896e>] worker_thread+0x66/0x480
> > > [  136.408470]  [<0000000178851888>] kthread+0x108/0x110
> > > [  136.408474]  [<00000001787d72cc>] __ret_from_fork+0x3c/0x58
> > > [  136.408478]  [<00000001793ef75a>] ret_from_fork+0xa/0x40
> > > [  136.408484] Last Breaking-Event-Address:
> > > [  136.408486]  [<000003ff7fed3aae>]
> > > smc_get_or_create_lgr_decision_maker.constprop.0+0xe6/0x398 [smc]
> > > [  136.408495] Kernel panic - not syncing: Fatal exception in interrupt
> > > 
> > > ^^^^^^^^^^
> > > 
> > > - Jan

  reply	other threads:[~2022-10-25  6:13 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-20  6:43 [PATCH net-next v3 00/10] optimize the parallelism of SMC-R connections D.Wythe
2022-10-20  6:43 ` [PATCH net-next v3 01/10] net/smc: remove locks smc_client_lgr_pending and smc_server_lgr_pending D.Wythe
2022-10-20  6:43 ` [PATCH net-next v3 02/10] net/smc: fix SMC_CLC_DECL_ERR_REGRMB without smc_server_lgr_pending D.Wythe
2022-10-20  6:43 ` [PATCH net-next v3 03/10] net/smc: allow confirm/delete rkey response deliver multiplex D.Wythe
2022-10-20  6:43 ` [PATCH net-next v3 04/10] net/smc: make SMC_LLC_FLOW_RKEY run concurrently D.Wythe
2022-10-20  6:43 ` [PATCH net-next v3 05/10] net/smc: llc_conf_mutex refactor, replace it with rw_semaphore D.Wythe
2022-10-20  6:43 ` [PATCH net-next v3 06/10] net/smc: use read semaphores to reduce unnecessary blocking in smc_buf_create() & smcr_buf_unuse() D.Wythe
2022-10-20  6:43 ` [PATCH net-next v3 07/10] net/smc: reduce unnecessary blocking in smcr_lgr_reg_rmbs() D.Wythe
2022-10-20  6:43 ` [PATCH net-next v3 08/10] net/smc: replace mutex rmbs_lock and sndbufs_lock with rw_semaphore D.Wythe
2022-10-20  6:43 ` [PATCH net-next v3 09/10] net/smc: Fix potential panic dues to unprotected smc_llc_srv_add_link() D.Wythe
2022-10-20  6:43 ` [PATCH net-next v3 10/10] net/smc: fix application data exception D.Wythe
2022-10-20  7:00 ` [PATCH net-next v3 00/10] optimize the parallelism of SMC-R connections D. Wythe
2022-10-20  7:24   ` Jan Karcher
2022-10-21 11:57   ` Jan Karcher
2022-10-21 15:57     ` D. Wythe
2022-10-24 13:10       ` Jan Karcher
2022-10-25  6:13         ` Tony Lu [this message]
2022-10-26 13:12           ` Jan Karcher
2022-10-28  5:29             ` Tony Lu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y1d+jDQiyn4LSKlu@TonyMac-Alibaba \
    --to=tonylu@linux.alibaba.com \
    --cc=alibuda@linux.alibaba.com \
    --cc=davem@davemloft.net \
    --cc=jaka@linux.ibm.com \
    --cc=kuba@kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.