[SPDK] Re: Print backtrace in SPDK

* [SPDK] Re: Print backtrace in SPDK
@ 2020-08-30  7:52 Yang, Ziye
  0 siblings, 0 replies; 15+ messages in thread
From: Yang, Ziye @ 2020-08-30  7:52 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 15850 bytes --]

Hi Wenhua,

Thanks. It would be better that you can reproduce your issue in an easy way then submit a issue in github. Then the community can help you.

发自我的iPad

> 在 2020年8月30日，下午2:05，Wenhua Liu <liuw(a)vmware.com> 写道：
> 
> Hi Ziye,
> 
> I tested the patch you provided. It does not help. The problem still exists.
> 
> Thanks,
> -Wenhua
> 
> On 8/26/20, 10:09 PM, "Yang, Ziye" <ziye.yang(a)intel.com> wrote:
> 
>    Hi Wenhua,
> 
>    Thanks for your continuing verification. So there should be some issues with zero copy support in SPDK posix socket implementation in target side.
> 
> 
> 
> 
>    Best Regards
>    Ziye Yang 
> 
>    -----Original Message-----
>    From: Wenhua Liu <liuw(a)vmware.com> 
>    Sent: Thursday, August 27, 2020 1:05 PM
>    To: Storage Performance Development Kit <spdk(a)lists.01.org>
>    Subject: [SPDK] Re: Print backtrace in SPDK
> 
>    Hi Ziye,
> 
>    I have verified after disabling zero copy, the problem is gone. The following is the change I made to disable zero copy.
> 
>    spdk$ git diff module/sock/posix/posix.c diff --git a/module/sock/posix/posix.c b/module/sock/posix/posix.c index 4eb1bf106..7b77289bb 100644
>    --- a/module/sock/posix/posix.c
>    +++ b/module/sock/posix/posix.c
>    @@ -53,9 +53,9 @@
>     #define MIN_SO_SNDBUF_SIZE (2 * 1024 * 1024)  #define IOV_BATCH_SIZE 64
> 
>    -#if defined(SO_ZEROCOPY) && defined(MSG_ZEROCOPY) -#define SPDK_ZEROCOPY -#endif
>    +//#if defined(SO_ZEROCOPY) && defined(MSG_ZEROCOPY) //#define 
>    +SPDK_ZEROCOPY //#endif
> 
>     struct spdk_posix_sock {
>            struct spdk_sock        base;
>    ~/spdk$
> 
>    With this change, I did VM power-on and shutdown 8 times and did not see a single "Connection Reset by Peer" issue. Without the change, I did VM power-on and shutdown 4 times, every time I saw at least one "Connection Reset by Peer" error on every IO queue (4 IO queues in total).
> 
>    Thanks,
>    -Wenhua
> 
>    On 8/25/20, 9:51 PM, "Wenhua Liu" <liuw(a)vmware.com> wrote:
> 
>        I did not check errno. The only thing I knew is _sock_flush returns -1 which is the return value of sendmsg.
> 
>        Thanks,
>        -Wenhua
> 
>        On 8/25/20, 9:31 PM, "Yang, Ziye" <ziye.yang(a)intel.com> wrote:
> 
>            Hi  Wenhua,
> 
>            What's error number when you see that sendmsg function returns -1 when you use posix socket implmentation? 
> 
> 
> 
> 
>            Best Regards
>            Ziye Yang 
> 
>            -----Original Message-----
>            From: Wenhua Liu <liuw(a)vmware.com> 
>            Sent: Wednesday, August 26, 2020 12:27 PM
>            To: Storage Performance Development Kit <spdk(a)lists.01.org>
>            Subject: [SPDK] Re: Print backtrace in SPDK
> 
>            Hi Ziye,
> 
>            Back to April/May, I used SPDK 20.01 (the first release supported FUSED operation) in a VM and ran into this issue once in a while.
> 
>            Recently, in order to test NVMe Abort, I updated the SPDK in that VM to 20.07 and I started seeing this issue consistently. Maybe this is because the change at our side that makes the issue easier to reproduce.
> 
>            I spent a lot time debugging this issue and found in wire data, the TCP/IP FIN flag is set in the TCP packet in response to an NVME READ command. As FIN flag is set when closing TCP connection. With this information, I found it's the function nvmf_tcp_close_qpair close the TCP connection. To figure out how this function is called, I wanted to print stack trace but could not find a way, so I sent an email to the SPDK community asking for a solution. Later I used some other way and figured out the call path which points where the problem happens.
> 
>            I noticed the zero copy thing and tried to disable it but did not help (I can try it again to confirm). I started thinking if my VM itself has problem. I set up another VM with Ubuntu 20.04.1 and SPDK 20.07, but the problem still exists in this new target. As I could not find how sendmsg works and I noticed there is a uring based socket implementation. I wanted to give it a try so I asked you.
> 
>            I will let you know if disabling zero copy will help.
> 
>            Thanks,
>            -Wenhua
> 
>            On 8/25/20, 6:52 PM, "Yang, Ziye" <ziye.yang(a)intel.com> wrote:
> 
>                Hi Wenhua,
> 
>                Did you reproduce the issue you mentioned in last email with same VM environment (OS) and same SPDK version?  You mention that there is no issue with uring, but there is issue with posix on the same SPDK version?  Can you reproduce the issue with latest version in SPDK master branch.
> 
>                I think that the current difference with uring and posix is: For the posix implementation, it uses the zero copy feature. Could you do some experiments to disable the zero copy feature manually in posix.c like the following shows. Then we can firstly eliminate whether there is issue with zero copy feature on the target side. Thanks.
> 
>                #if defined(SO_ZEROCOPY) && defined(MSG_ZEROCOPY)
>                //#define SPDK_ZEROCOPY
>                #endif
> 
> 
> 
> 
>                Best Regards
>                Ziye Yang 
> 
>                -----Original Message-----
>                From: Wenhua Liu <liuw(a)vmware.com> 
>                Sent: Wednesday, August 26, 2020 8:20 AM
>                To: Storage Performance Development Kit <spdk(a)lists.01.org>
>                Subject: [SPDK] Re: Print backtrace in SPDK
> 
>                Hi Ziye,
> 
>                I'm using Ubuntu-20.04.1. The Linux kernel version seems to be 5.4.44 ~spdk$ cat /proc/version_signature Ubuntu 5.4.0-42.46-generic 5.4.44 ~/spdk$
> 
>                I downloaded, buit and installed liburing from source.
>                git clone https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Faxboe%2Fliburing.git&amp;data=02%7C01%7Cliuw%40vmware.com%7C758eeeac6b8041c3010e08d84a4756df%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637341017551671723&amp;sdata=s%2BnqigVqrppruGRJAxYG1mNGI18GqCec8JhStfcKi5g%3D&amp;reserved=0
> 
>                After switching to uring sock implementation,  the "connection reset by peer" problem is gone. I tried to power on and shutdown my testing VM and did not see one single "connection reset by peer" issue. Before this, every time, I powered on my testing VM, there were multiple "connection reset by peer" failures happened.
> 
>                Actually, I had this issue back to April/May. At that time, I could not identify/corelate how the issue happened and did not drill down. This time, the issue happened so frequently. This helped me dig out more information.
> 
>                In summary, it seems the posix sock implementation may have some problem. I'm not sure if this is generic or specific for running SPDK in VM. The issue might also be related to our initiator implementation.
> 
>                Thanks,
>                -Wenhua
> 
> 
>                On 8/24/20, 12:33 AM, "Yang, Ziye" <ziye.yang(a)intel.com> wrote:
> 
>                    Hi Wenhua,
> 
>                    You need to compile spdk with --with-uring option.  And you need to 
>                    1 Download the liburing and install it by yourself.
>                    2 Check your kernel version. Uring socket implementation depends on the kernel (> 5.4.3).
> 
>                    What's you kernel version in the VM?
> 
>                    Thanks.
> 
> 
> 
> 
>                    Best Regards
>                    Ziye Yang 
> 
>                    -----Original Message-----
>                    From: Wenhua Liu <liuw(a)vmware.com> 
>                    Sent: Monday, August 24, 2020 3:19 PM
>                    To: Storage Performance Development Kit <spdk(a)lists.01.org>
>                    Subject: [SPDK] Re: Print backtrace in SPDK
> 
>                    Hi Ziye,
> 
>                    I'm using SPDK NVMe-oF target.
> 
>                    I used some other way and figured out the following call path:
>                    posix_sock_group_impl_poll
>                    -> _sock_flush    <------------------ failed
>                    -> spdk_sock_abort_requests
>                       -> _pdu_write_done
>                          -> nvmf_tcp_qpair_disconnect
>                             -> spdk_nvmf_qpair_disconnect
>                                -> _nvmf_qpair_destroy
>                                   -> spdk_nvmf_poll_group_remove
>                                      -> nvmf_transport_poll_group_remove
>                                         -> nvmf_tcp_poll_group_remove
>                                            -> spdk_sock_group_remove_sock
>                                               -> posix_sock_group_impl_remove_sock
>                                                  -> spdk_sock_abort_requests
>                                   -> _nvmf_ctrlr_free_from_qpair
>                                      -> _nvmf_transport_qpair_fini
>                                         -> nvmf_transport_qpair_fini
>                                            -> nvmf_tcp_close_qpair
>                                               -> spdk_sock_close
> 
>                    The _sock_flush calls sendmsg to write the data to the socket. It's sendmsg failing with return value -1. I captured wire data. In Wireshark, I can see the READ command has been received by the target as a TCP packet. As the response to this TCP packet, a TCP packet with FIN flag set is sent to the initiator. The FIN is to close the socket connection.
> 
>                    I'm running SPDK target inside a VM. My NVMe/TCP initiator runs inside another VM. I'm going to try with another SPDK target which runs on a physical machine.
> 
>                    By the way, I noticed there is a uring based sock implementation,  how do I switch to this sock implementation. It seems the default is posix sock implementation.
> 
>                    Thanks,
>                    -Wenhua 
> 
>                    On 8/23/20, 9:55 PM, "Yang, Ziye" <ziye.yang(a)intel.com> wrote:
> 
>                        Hi Wenhua,
> 
>                        Which applications are you using from SPDK?  
>                        1 SPDK NVMe-oF target in target side?
>                        2  SPDK NVMe perf or others?
> 
>                        For nvmf_tcp_close_qpair will be called in the following possible cases (not all listed) for TCP transport. But it will be called by spdk_nvmf_qpair_disconnect as the entry.
> 
>                        1  qpair is not in polling group
>                        spdk_nvmf_qpair_disconnect
>                            nvmf_transport_qpair_fini
> 
>                        2  spdk_nvmf_qpair_disconnect
>                                ....
>                            _nvmf_qpair_destroy
>                                nvmf_transport_qpair_fini
>                                    ..
>                                    nvmf_tcp_close_qpair
> 
> 
>                        3  spdk_nvmf_qpair_disconnect
>                                ....
>                            _nvmf_qpair_destroy
>                                _nvmf_ctrlr_free_from_qpair    
>                                    _nvmf_transport_qpair_fini
>                                        ..
>                                        nvmf_tcp_close_qpair
> 
> 
>                        spdk_nvmf_qpair_disconnect is called by nvmf_tcp_qpair_disconnect in tcp.c. nvmf_tcp_qpair_disconnect is called in the following cases:
> 
>                        (1) _pdu_write_done (if there is error for write);
>                        (2) nvmf_tcp_qpair_handle_timeout.( No response from initiator in 30s if targets sends c2h_term_req)
>                        (3) nvmf_tcp_capsule_cmd_hdr_handle. (Cannot get tcp req)
>                        (4) nvmf_tcp_sock_cb.   TCP PDU related handling issue. 
> 
> 
>                        Also in lib/nvmf/ctrlr.c Target side has a timer poller: nvmf_ctrlr_keep_alive_poll. If there is no keep alive command sent from host, it will call spdk_nvmf_qpair_disconnect in related polling group assoicated with the controller.
> 
> 
>                        Best Regards
>                        Ziye Yang 
> 
>                        -----Original Message-----
>                        From: Wenhua Liu <liuw(a)vmware.com> 
>                        Sent: Saturday, August 22, 2020 3:15 PM
>                        To: Storage Performance Development Kit <spdk(a)lists.01.org>
>                        Subject: [SPDK] Print backtrace in SPDK
> 
>                        Hi,
> 
>                        Does anyone know if there is a function in SPDK that prints the backtrace?
> 
>                        I run into a “Connection Reset by Peer” issue on host side when testing NVMe/TCP. I identified it’s because some queue pairs are closed unexpectedly by calling nvmf_tcp_close_qpair, but I could not figure out how/why this function is called. I thought if the backtrace can be printed when calling this function, it might be helpful to me to find the root cause.
> 
>                        Thanks,
>                        -Wenhua
>                        _______________________________________________
>                        SPDK mailing list -- spdk(a)lists.01.org
>                        To unsubscribe send an email to spdk-leave(a)lists.01.org
>                        _______________________________________________
>                        SPDK mailing list -- spdk(a)lists.01.org
>                        To unsubscribe send an email to spdk-leave(a)lists.01.org
> 
>                    _______________________________________________
>                    SPDK mailing list -- spdk(a)lists.01.org
>                    To unsubscribe send an email to spdk-leave(a)lists.01.org
>                    _______________________________________________
>                    SPDK mailing list -- spdk(a)lists.01.org
>                    To unsubscribe send an email to spdk-leave(a)lists.01.org
> 
>                _______________________________________________
>                SPDK mailing list -- spdk(a)lists.01.org
>                To unsubscribe send an email to spdk-leave(a)lists.01.org
>                _______________________________________________
>                SPDK mailing list -- spdk(a)lists.01.org
>                To unsubscribe send an email to spdk-leave(a)lists.01.org
> 
>            _______________________________________________
>            SPDK mailing list -- spdk(a)lists.01.org
>            To unsubscribe send an email to spdk-leave(a)lists.01.org
>            _______________________________________________
>            SPDK mailing list -- spdk(a)lists.01.org
>            To unsubscribe send an email to spdk-leave(a)lists.01.org
> 
>        _______________________________________________
>        SPDK mailing list -- spdk(a)lists.01.org
>        To unsubscribe send an email to spdk-leave(a)lists.01.org
> 
>    _______________________________________________
>    SPDK mailing list -- spdk(a)lists.01.org
>    To unsubscribe send an email to spdk-leave(a)lists.01.org
>    _______________________________________________
>    SPDK mailing list -- spdk(a)lists.01.org
>    To unsubscribe send an email to spdk-leave(a)lists.01.org
> 
> _______________________________________________
> SPDK mailing list -- spdk(a)lists.01.org
> To unsubscribe send an email to spdk-leave(a)lists.01.org

^ permalink raw reply	[flat|nested] 15+ messages in thread