All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [SPDK] CI latent failures
@ 2018-07-20 13:37 Luse, Paul E
  0 siblings, 0 replies; 5+ messages in thread
From: Luse, Paul E @ 2018-07-20 13:37 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 5282 bytes --]

Hi Maciej,

Thanks for doing this! I've added the SPDK dist list to the email as this is the kind of info that everyone can benefit from. Not only do we need more community members stepping up and doing this kind of analysis, but if others are experiencing any of these things we might be able to get some traction in debug.

Are you by any chance able to cross reference these with github issues (not a major deep comparison, at least a scan) to try and identify any that are already reported and if so put the CI failure link in github?  For those you can find maybe enter one issue per item but wait probably 24 hours to see if anyone out there jumps up ans says "I've been looking at that one" (there's always hope)

Thanks again
Paul


Hi,
As we are close to release we wanted to look closer at CI latent failures in case there is something that we should address before release. I went through all reported failures from this quarter. I was able to divide those failures to 11 cases. Some of the failures where caused by failing test environment, some were caused by other software, some were probably caused by some bugs.... Below is more detailed report for each of the 11 cases. I didn't want to go too deep into debugging, but rather conclude based on my experience, so if you see that something is actually very different than I assumed please speak up.

Case 1.
Issue: Segmentation fault on NVMf shutdown
Number of failures: 8
Link to latest failure: http://spdk.intel.com/public/spdk/builds/review/97a5813b0bc8fa6b00aabf8b31f8a0b0776e8458.1531262768/fedora-06/build.log
Description: This one occurs from time to time starting from around mid-June and is one of the latest ones. I think this one should examined closer as a potential bug.

Case 2.
Issue: sock.c:249:11: runtime error: member access within null pointer of type 'struct spdk_sock' (aka spdk_iscsi_conn_destruct)
Number of failures: 5
Link to latest failure: http://spdk.intel.com/public/spdk/builds/review/0585c2426071f9506dbe520234e2c3ee2f5aee7d.1531468630/fedora-06/build.log
Description: This one occurs from time to time and is one of the latest ones. I think this one should examined closer as a potential bug.

Case 3.
Issue: nbd.c: 879:spdk_nbd_start: *ERROR*: ioctl(NBD_SET_SOCK) failed: Device or resource busy
Number of failures: 2
Link to latest failure: http://spdk.intel.com/public/spdk/builds/review/c3ba7cb2ce0bd01ec61f93e852cd37d110fa9d9d.1531431333/ubuntu17.10/build.log
Description: Looks like issue with NBD access (environment clean up after failure issue?).

Case 4.
Issue: thread.c: 343:spdk_io_device_register: *ERROR*: io_device 0x1dd6040 already registered
Number of failures: 4
Link to latest failure: http://spdk.intel.com/public/spdk/builds/review/3aa13c878c94c8925bbf3647ca67b666dfda6c75.1530826421/fedora-02/build.log
Description: These series of 4 failures occurred one after another around 2 weeks ago and never happened again. It looks like there was some bug that was fixed. We should monitor if it happens again.

Case 5.
Issue: rmmod: ERROR: Module nvme_rdma is in use
Number of failures: 2
Link to latest failure: http://spdk.intel.com/public/spdk/builds/review/1939cfb85ee7c2d7707ade0e49c0e8bd01b04961.1529624900/fedora-06/build.log
Description: Looks like issue with nvme_rdma module (environment clean up after failure issue?).

Case 6.
Issue: initiator.sh: line 43:  9367 Bus error               (core dumped) $rootdir/test/bdev/bdevperf/bdevperf -c $testdir/bdev.conf -q 128 -s 4096 -w verify -t 5 -d 512
Number of failures: 1
Link to latest failure: http://spdk.intel.com/public/spdk/builds/review/5b32f26f10d709f939443f777956574b77955f09.1528385513/fedora-06/build.log
Description: Happened only once at the beginning of June. We should monitor if it happens again.

Case 7.
Issue: VM shutdown issue
Number of failures: 2
Link to latest failure: <sorry, no link here>
Description: VM failed to shut down - test script timed out.

Case 8.
Issue: NVMf segmentation fault on disconnect
Number of failures: 6
Link to latest failure: http://spdk.intel.com/public/spdk/builds/review/9031d3bf2a94d15c2119c2e582a2ce5362cfa461.1526309227/fedora-06/build.log
Description: Happened one after another in May. I guess it was related to some bug that has been fixed.

Case 9.
Issue: Assan scan issue
Number of failures: 3
Link to latest failure: <sorry, no link here>
Description: Happened in May in short period of time. Probably fixed right now.

Case 10.
Issue: conn.c: 741:spdk_iscsi_conn_read_data: *ERROR*: spdk_sock_recv() failed, errno 104: Connection reset by peer
Number of failures: 3
Link to latest failure: http://spdk.intel.com/public/spdk/builds/review/f77339505066367ec6c4279dd230fe1a64626032.1525681774/fedora-06/build.log
Description: Happened one after another in May. Looks like environment failure.

Case 11.
Issue: nvme_rdma.c:1394:nvme_rdma_ctrlr_construct: *ERROR*: failed to create admin qpair
Number of failures: 1
Link to latest failure: http://spdk.intel.com/public/spdk/builds/review/e7713a2746bc8e72b4ec2b17622e102aebdf8c79.1525462134/fedora-03/build.log
Description: Happened only once at the beginning of May. We should monitor if it happens again.



Maciek


[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 11625 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [SPDK] CI latent failures
@ 2018-07-23  7:27 Szwed, Maciej
  0 siblings, 0 replies; 5+ messages in thread
From: Szwed, Maciej @ 2018-07-23  7:27 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 6813 bytes --]

That's weird. I've just checked all the links and all of them have failure at the bottom of the log. As for making them public... I'm not sure if I'm able to do that. I guess latent failures are saved only on our internal web. Maybe Seth or Ben will be able to help here?

From: Luse, Paul E
Sent: Friday, July 20, 2018 4:49 PM
To: Storage Performance Development Kit <spdk(a)lists.01.org>; Szwed, Maciej <maciej.szwed(a)intel.com>
Subject: RE: CI latent failures

Ahh, didn't catch that thanks.  Maciej, can you public links to the patches behind these failures? I looked at the first few and for some reason they didn't seem to have failures in CI so I dunno, maybe I was looking at the wrong ones

Thx
Paul


From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Philipp Skadorov
Sent: Friday, July 20, 2018 7:40 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>; Szwed, Maciej <maciej.szwed(a)intel.com<mailto:maciej.szwed(a)intel.com>>
Subject: Re: [SPDK] CI latent failures

Hi,
Those links must be pointing to Intel internal servers.
The review IDs below are missing at https://ci.spdk.io/spdk/builds/review/.

Regards,
Philipp

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Luse, Paul E
Sent: Friday, July 20, 2018 9:37 AM
To: Szwed, Maciej <maciej.szwed(a)intel.com<mailto:maciej.szwed(a)intel.com>>; Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] CI latent failures

Hi Maciej,

Thanks for doing this! I've added the SPDK dist list to the email as this is the kind of info that everyone can benefit from. Not only do we need more community members stepping up and doing this kind of analysis, but if others are experiencing any of these things we might be able to get some traction in debug.

Are you by any chance able to cross reference these with github issues (not a major deep comparison, at least a scan) to try and identify any that are already reported and if so put the CI failure link in github?  For those you can find maybe enter one issue per item but wait probably 24 hours to see if anyone out there jumps up ans says "I've been looking at that one" (there's always hope)

Thanks again
Paul


Hi,
As we are close to release we wanted to look closer at CI latent failures in case there is something that we should address before release. I went through all reported failures from this quarter. I was able to divide those failures to 11 cases. Some of the failures where caused by failing test environment, some were caused by other software, some were probably caused by some bugs.... Below is more detailed report for each of the 11 cases. I didn't want to go too deep into debugging, but rather conclude based on my experience, so if you see that something is actually very different than I assumed please speak up.

Case 1.
Issue: Segmentation fault on NVMf shutdown
Number of failures: 8
Link to latest failure: http://spdk.intel.com/public/spdk/builds/review/97a5813b0bc8fa6b00aabf8b31f8a0b0776e8458.1531262768/fedora-06/build.log
Description: This one occurs from time to time starting from around mid-June and is one of the latest ones. I think this one should examined closer as a potential bug.

Case 2.
Issue: sock.c:249:11: runtime error: member access within null pointer of type 'struct spdk_sock' (aka spdk_iscsi_conn_destruct)
Number of failures: 5
Link to latest failure: http://spdk.intel.com/public/spdk/builds/review/0585c2426071f9506dbe520234e2c3ee2f5aee7d.1531468630/fedora-06/build.log
Description: This one occurs from time to time and is one of the latest ones. I think this one should examined closer as a potential bug.

Case 3.
Issue: nbd.c: 879:spdk_nbd_start: *ERROR*: ioctl(NBD_SET_SOCK) failed: Device or resource busy
Number of failures: 2
Link to latest failure: http://spdk.intel.com/public/spdk/builds/review/c3ba7cb2ce0bd01ec61f93e852cd37d110fa9d9d.1531431333/ubuntu17.10/build.log
Description: Looks like issue with NBD access (environment clean up after failure issue?).

Case 4.
Issue: thread.c: 343:spdk_io_device_register: *ERROR*: io_device 0x1dd6040 already registered
Number of failures: 4
Link to latest failure: http://spdk.intel.com/public/spdk/builds/review/3aa13c878c94c8925bbf3647ca67b666dfda6c75.1530826421/fedora-02/build.log
Description: These series of 4 failures occurred one after another around 2 weeks ago and never happened again. It looks like there was some bug that was fixed. We should monitor if it happens again.

Case 5.
Issue: rmmod: ERROR: Module nvme_rdma is in use
Number of failures: 2
Link to latest failure: http://spdk.intel.com/public/spdk/builds/review/1939cfb85ee7c2d7707ade0e49c0e8bd01b04961.1529624900/fedora-06/build.log
Description: Looks like issue with nvme_rdma module (environment clean up after failure issue?).

Case 6.
Issue: initiator.sh: line 43:  9367 Bus error               (core dumped) $rootdir/test/bdev/bdevperf/bdevperf -c $testdir/bdev.conf -q 128 -s 4096 -w verify -t 5 -d 512
Number of failures: 1
Link to latest failure: http://spdk.intel.com/public/spdk/builds/review/5b32f26f10d709f939443f777956574b77955f09.1528385513/fedora-06/build.log
Description: Happened only once at the beginning of June. We should monitor if it happens again.

Case 7.
Issue: VM shutdown issue
Number of failures: 2
Link to latest failure: <sorry, no link here>
Description: VM failed to shut down - test script timed out.

Case 8.
Issue: NVMf segmentation fault on disconnect
Number of failures: 6
Link to latest failure: http://spdk.intel.com/public/spdk/builds/review/9031d3bf2a94d15c2119c2e582a2ce5362cfa461.1526309227/fedora-06/build.log
Description: Happened one after another in May. I guess it was related to some bug that has been fixed.

Case 9.
Issue: Assan scan issue
Number of failures: 3
Link to latest failure: <sorry, no link here>
Description: Happened in May in short period of time. Probably fixed right now.

Case 10.
Issue: conn.c: 741:spdk_iscsi_conn_read_data: *ERROR*: spdk_sock_recv() failed, errno 104: Connection reset by peer
Number of failures: 3
Link to latest failure: http://spdk.intel.com/public/spdk/builds/review/f77339505066367ec6c4279dd230fe1a64626032.1525681774/fedora-06/build.log
Description: Happened one after another in May. Looks like environment failure.

Case 11.
Issue: nvme_rdma.c:1394:nvme_rdma_ctrlr_construct: *ERROR*: failed to create admin qpair
Number of failures: 1
Link to latest failure: http://spdk.intel.com/public/spdk/builds/review/e7713a2746bc8e72b4ec2b17622e102aebdf8c79.1525462134/fedora-03/build.log
Description: Happened only once at the beginning of May. We should monitor if it happens again.



Maciek


[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 18094 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [SPDK] CI latent failures
@ 2018-07-23  5:05 
  0 siblings, 0 replies; 5+ messages in thread
From:  @ 2018-07-23  5:05 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 14191 bytes --]

Hi Maciej, Paul, and All


Thanks for listing up CI failures.

I have submitted a fix to a recently observed CI failure related with the following:


Case 3.

Issue: nbd.c: 879:spdk_nbd_start: *ERROR*: ioctl(NBD_SET_SOCK) failed: Device or resource busy

Number of failures: 2

Link to latest failure: http://spdk.intel.com/public/spdk/builds/review/c3ba7cb2ce0bd01ec61f93e852cd37d110fa9d9d.1531431333/ubuntu17.10/build.log<https://clicktime.symantec.com/a/1/Yf18uLR2-9KyVh4WLLDK94emOlIM-n_toaozkdX_I5Q=?d=-kqscwOyCuZTkk00ZndM_BDGM9cZcVipWISikCKrPL7ddTAlw68Fsu2JJ7UK_NLkStbN_SE7-z31y6agE5LbgsDkLXbe3Nq26bXUkJh8Bsq67HEBBZgRw811He6ZEqhJQ4gcgs0s8n_qRKm89rdoTnRVCPfHMfgIL4JfGNJJR4DjFvNmV4Xc_FD9-b4IWjir-_knZRvH-8E0qlz87fNqKC5Y7dC1pdmQ87lBNBZUhazS0vN2fYNj0S0P_CqMmfEcN4HNkwYr-jyxHm_D3wL_fUt56H-zgco0uQ5UxRUrijN8Rn-9MCgtW4zQy55tSDAgOT3M1_cpYqckOkEuCFYZidIH3I1236HyJi1h2mq5b1CaO8FeR4lUJQEGftTAkdXk3ixu9rvh-i36QtZWVGYw3fc%3D&u=http%3A%2F%2Fspdk.intel.com%2Fpublic%2Fspdk%2Fbuilds%2Freview%2Fc3ba7cb2ce0bd01ec61f93e852cd37d110fa9d9d.1531431333%2Fubuntu17.10%2Fbuild.log>

Description: Looks like issue with NBD access (environment clean up after failure issue?).


Thanks,
Shuhei

________________________________
差出人: SPDK <spdk-bounces(a)lists.01.org> が Luse, Paul E <paul.e.luse(a)intel.com> の代理で送信
送信日�: 2018年7月20日 23:49
宛先: Storage Performance Development Kit; Szwed, Maciej
件名: [!]Re: [SPDK] CI latent failures


Ahh, didn’t catch that thanks.  Maciej, can you public links to the patches behind these failures? I looked at the first few and for some reason they didn’t seem to have failures in CI so I dunno, maybe I was looking at the wrong ones



Thx

Paul





From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Philipp Skadorov
Sent: Friday, July 20, 2018 7:40 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org>; Szwed, Maciej <maciej.szwed(a)intel.com>
Subject: Re: [SPDK] CI latent failures



Hi,

Those links must be pointing to Intel internal servers.

The review IDs below are missing at https://ci.spdk.io/spdk/builds/review/<https://clicktime.symantec.com/a/1/sUdf60VAIDvRY7xvj-7ftdkeCPY3wUvAu2x6K8niMQQ=?d=-kqscwOyCuZTkk00ZndM_BDGM9cZcVipWISikCKrPL7ddTAlw68Fsu2JJ7UK_NLkStbN_SE7-z31y6agE5LbgsDkLXbe3Nq26bXUkJh8Bsq67HEBBZgRw811He6ZEqhJQ4gcgs0s8n_qRKm89rdoTnRVCPfHMfgIL4JfGNJJR4DjFvNmV4Xc_FD9-b4IWjir-_knZRvH-8E0qlz87fNqKC5Y7dC1pdmQ87lBNBZUhazS0vN2fYNj0S0P_CqMmfEcN4HNkwYr-jyxHm_D3wL_fUt56H-zgco0uQ5UxRUrijN8Rn-9MCgtW4zQy55tSDAgOT3M1_cpYqckOkEuCFYZidIH3I1236HyJi1h2mq5b1CaO8FeR4lUJQEGftTAkdXk3ixu9rvh-i36QtZWVGYw3fc%3D&u=https%3A%2F%2Fci.spdk.io%2Fspdk%2Fbuilds%2Freview%2F>.



Regards,

Philipp



From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Luse, Paul E
Sent: Friday, July 20, 2018 9:37 AM
To: Szwed, Maciej <maciej.szwed(a)intel.com>; Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: Re: [SPDK] CI latent failures



Hi Maciej,



Thanks for doing this! I’ve added the SPDK dist list to the email as this is the kind of info that everyone can benefit from. Not only do we need more community members stepping up and doing this kind of analysis, but if others are experiencing any of these things we might be able to get some traction in debug.



Are you by any chance able to cross reference these with github issues (not a major deep comparison, at least a scan) to try and identify any that are already reported and if so put the CI failure link in github?  For those you can find maybe enter one issue per item but wait probably 24 hours to see if anyone out there jumps up ans says “I’ve been looking at that one” (there’s always hope)



Thanks again

Paul





Hi,

As we are close to release we wanted to look closer at CI latent failures in case there is something that we should address before release. I went through all reported failures from this quarter. I was able to divide those failures to 11 cases. Some of the failures where caused by failing test environment, some were caused by other software, some were probably caused by some bugs…. Below is more detailed report for each of the 11 cases. I didn’t want to go too deep into debugging, but rather conclude based on my experience, so if you see that something is actually very different than I assumed please speak up.



Case 1.

Issue: Segmentation fault on NVMf shutdown

Number of failures: 8

Link to latest failure: http://spdk.intel.com/public/spdk/builds/review/97a5813b0bc8fa6b00aabf8b31f8a0b0776e8458.1531262768/fedora-06/build.log<https://clicktime.symantec.com/a/1/nEm2YD-45Yy29-LFhwCQ_jEP2FXidYCf7s_SyqAJ2LA=?d=-kqscwOyCuZTkk00ZndM_BDGM9cZcVipWISikCKrPL7ddTAlw68Fsu2JJ7UK_NLkStbN_SE7-z31y6agE5LbgsDkLXbe3Nq26bXUkJh8Bsq67HEBBZgRw811He6ZEqhJQ4gcgs0s8n_qRKm89rdoTnRVCPfHMfgIL4JfGNJJR4DjFvNmV4Xc_FD9-b4IWjir-_knZRvH-8E0qlz87fNqKC5Y7dC1pdmQ87lBNBZUhazS0vN2fYNj0S0P_CqMmfEcN4HNkwYr-jyxHm_D3wL_fUt56H-zgco0uQ5UxRUrijN8Rn-9MCgtW4zQy55tSDAgOT3M1_cpYqckOkEuCFYZidIH3I1236HyJi1h2mq5b1CaO8FeR4lUJQEGftTAkdXk3ixu9rvh-i36QtZWVGYw3fc%3D&u=http%3A%2F%2Fspdk.intel.com%2Fpublic%2Fspdk%2Fbuilds%2Freview%2F97a5813b0bc8fa6b00aabf8b31f8a0b0776e8458.1531262768%2Ffedora-06%2Fbuild.log>

Description: This one occurs from time to time starting from around mid-June and is one of the latest ones. I think this one should examined closer as a potential bug.



Case 2.

Issue: sock.c:249:11: runtime error: member access within null pointer of type 'struct spdk_sock' (aka spdk_iscsi_conn_destruct)

Number of failures: 5

Link to latest failure: http://spdk.intel.com/public/spdk/builds/review/0585c2426071f9506dbe520234e2c3ee2f5aee7d.1531468630/fedora-06/build.log<https://clicktime.symantec.com/a/1/gNMr2LzybVIyPDJtF88xG_pobw_qU2MU7neEVKbSINU=?d=-kqscwOyCuZTkk00ZndM_BDGM9cZcVipWISikCKrPL7ddTAlw68Fsu2JJ7UK_NLkStbN_SE7-z31y6agE5LbgsDkLXbe3Nq26bXUkJh8Bsq67HEBBZgRw811He6ZEqhJQ4gcgs0s8n_qRKm89rdoTnRVCPfHMfgIL4JfGNJJR4DjFvNmV4Xc_FD9-b4IWjir-_knZRvH-8E0qlz87fNqKC5Y7dC1pdmQ87lBNBZUhazS0vN2fYNj0S0P_CqMmfEcN4HNkwYr-jyxHm_D3wL_fUt56H-zgco0uQ5UxRUrijN8Rn-9MCgtW4zQy55tSDAgOT3M1_cpYqckOkEuCFYZidIH3I1236HyJi1h2mq5b1CaO8FeR4lUJQEGftTAkdXk3ixu9rvh-i36QtZWVGYw3fc%3D&u=http%3A%2F%2Fspdk.intel.com%2Fpublic%2Fspdk%2Fbuilds%2Freview%2F0585c2426071f9506dbe520234e2c3ee2f5aee7d.1531468630%2Ffedora-06%2Fbuild.log>

Description: This one occurs from time to time and is one of the latest ones. I think this one should examined closer as a potential bug.



Case 3.

Issue: nbd.c: 879:spdk_nbd_start: *ERROR*: ioctl(NBD_SET_SOCK) failed: Device or resource busy

Number of failures: 2

Link to latest failure: http://spdk.intel.com/public/spdk/builds/review/c3ba7cb2ce0bd01ec61f93e852cd37d110fa9d9d.1531431333/ubuntu17.10/build.log<https://clicktime.symantec.com/a/1/Yf18uLR2-9KyVh4WLLDK94emOlIM-n_toaozkdX_I5Q=?d=-kqscwOyCuZTkk00ZndM_BDGM9cZcVipWISikCKrPL7ddTAlw68Fsu2JJ7UK_NLkStbN_SE7-z31y6agE5LbgsDkLXbe3Nq26bXUkJh8Bsq67HEBBZgRw811He6ZEqhJQ4gcgs0s8n_qRKm89rdoTnRVCPfHMfgIL4JfGNJJR4DjFvNmV4Xc_FD9-b4IWjir-_knZRvH-8E0qlz87fNqKC5Y7dC1pdmQ87lBNBZUhazS0vN2fYNj0S0P_CqMmfEcN4HNkwYr-jyxHm_D3wL_fUt56H-zgco0uQ5UxRUrijN8Rn-9MCgtW4zQy55tSDAgOT3M1_cpYqckOkEuCFYZidIH3I1236HyJi1h2mq5b1CaO8FeR4lUJQEGftTAkdXk3ixu9rvh-i36QtZWVGYw3fc%3D&u=http%3A%2F%2Fspdk.intel.com%2Fpublic%2Fspdk%2Fbuilds%2Freview%2Fc3ba7cb2ce0bd01ec61f93e852cd37d110fa9d9d.1531431333%2Fubuntu17.10%2Fbuild.log>

Description: Looks like issue with NBD access (environment clean up after failure issue?).



Case 4.

Issue: thread.c: 343:spdk_io_device_register: *ERROR*: io_device 0x1dd6040 already registered

Number of failures: 4

Link to latest failure: http://spdk.intel.com/public/spdk/builds/review/3aa13c878c94c8925bbf3647ca67b666dfda6c75.1530826421/fedora-02/build.log<https://clicktime.symantec.com/a/1/myRCDXwPzqnwPL_c4tkm5_LLJHpv6FEEPEGLZFWqIKE=?d=-kqscwOyCuZTkk00ZndM_BDGM9cZcVipWISikCKrPL7ddTAlw68Fsu2JJ7UK_NLkStbN_SE7-z31y6agE5LbgsDkLXbe3Nq26bXUkJh8Bsq67HEBBZgRw811He6ZEqhJQ4gcgs0s8n_qRKm89rdoTnRVCPfHMfgIL4JfGNJJR4DjFvNmV4Xc_FD9-b4IWjir-_knZRvH-8E0qlz87fNqKC5Y7dC1pdmQ87lBNBZUhazS0vN2fYNj0S0P_CqMmfEcN4HNkwYr-jyxHm_D3wL_fUt56H-zgco0uQ5UxRUrijN8Rn-9MCgtW4zQy55tSDAgOT3M1_cpYqckOkEuCFYZidIH3I1236HyJi1h2mq5b1CaO8FeR4lUJQEGftTAkdXk3ixu9rvh-i36QtZWVGYw3fc%3D&u=http%3A%2F%2Fspdk.intel.com%2Fpublic%2Fspdk%2Fbuilds%2Freview%2F3aa13c878c94c8925bbf3647ca67b666dfda6c75.1530826421%2Ffedora-02%2Fbuild.log>

Description: These series of 4 failures occurred one after another around 2 weeks ago and never happened again. It looks like there was some bug that was fixed. We should monitor if it happens again.



Case 5.

Issue: rmmod: ERROR: Module nvme_rdma is in use

Number of failures: 2

Link to latest failure: http://spdk.intel.com/public/spdk/builds/review/1939cfb85ee7c2d7707ade0e49c0e8bd01b04961.1529624900/fedora-06/build.log<https://clicktime.symantec.com/a/1/FXnFJ9I2aemo3c8R1N0QfbRLWPv4HPVzmhwy7htvyco=?d=-kqscwOyCuZTkk00ZndM_BDGM9cZcVipWISikCKrPL7ddTAlw68Fsu2JJ7UK_NLkStbN_SE7-z31y6agE5LbgsDkLXbe3Nq26bXUkJh8Bsq67HEBBZgRw811He6ZEqhJQ4gcgs0s8n_qRKm89rdoTnRVCPfHMfgIL4JfGNJJR4DjFvNmV4Xc_FD9-b4IWjir-_knZRvH-8E0qlz87fNqKC5Y7dC1pdmQ87lBNBZUhazS0vN2fYNj0S0P_CqMmfEcN4HNkwYr-jyxHm_D3wL_fUt56H-zgco0uQ5UxRUrijN8Rn-9MCgtW4zQy55tSDAgOT3M1_cpYqckOkEuCFYZidIH3I1236HyJi1h2mq5b1CaO8FeR4lUJQEGftTAkdXk3ixu9rvh-i36QtZWVGYw3fc%3D&u=http%3A%2F%2Fspdk.intel.com%2Fpublic%2Fspdk%2Fbuilds%2Freview%2F1939cfb85ee7c2d7707ade0e49c0e8bd01b04961.1529624900%2Ffedora-06%2Fbuild.log>

Description: Looks like issue with nvme_rdma module (environment clean up after failure issue?).



Case 6.

Issue: initiator.sh: line 43:  9367 Bus error               (core dumped) $rootdir/test/bdev/bdevperf/bdevperf -c $testdir/bdev.conf -q 128 -s 4096 -w verify -t 5 -d 512

Number of failures: 1

Link to latest failure: http://spdk.intel.com/public/spdk/builds/review/5b32f26f10d709f939443f777956574b77955f09.1528385513/fedora-06/build.log<https://clicktime.symantec.com/a/1/JFhD7mwi3QWmP8nlC5_6BVX-EQHSFUz-6stw_7LmTOQ=?d=-kqscwOyCuZTkk00ZndM_BDGM9cZcVipWISikCKrPL7ddTAlw68Fsu2JJ7UK_NLkStbN_SE7-z31y6agE5LbgsDkLXbe3Nq26bXUkJh8Bsq67HEBBZgRw811He6ZEqhJQ4gcgs0s8n_qRKm89rdoTnRVCPfHMfgIL4JfGNJJR4DjFvNmV4Xc_FD9-b4IWjir-_knZRvH-8E0qlz87fNqKC5Y7dC1pdmQ87lBNBZUhazS0vN2fYNj0S0P_CqMmfEcN4HNkwYr-jyxHm_D3wL_fUt56H-zgco0uQ5UxRUrijN8Rn-9MCgtW4zQy55tSDAgOT3M1_cpYqckOkEuCFYZidIH3I1236HyJi1h2mq5b1CaO8FeR4lUJQEGftTAkdXk3ixu9rvh-i36QtZWVGYw3fc%3D&u=http%3A%2F%2Fspdk.intel.com%2Fpublic%2Fspdk%2Fbuilds%2Freview%2F5b32f26f10d709f939443f777956574b77955f09.1528385513%2Ffedora-06%2Fbuild.log>

Description: Happened only once at the beginning of June. We should monitor if it happens again.



Case 7.

Issue: VM shutdown issue

Number of failures: 2

Link to latest failure: <sorry, no link here>

Description: VM failed to shut down � test script timed out.



Case 8.

Issue: NVMf segmentation fault on disconnect

Number of failures: 6

Link to latest failure: http://spdk.intel.com/public/spdk/builds/review/9031d3bf2a94d15c2119c2e582a2ce5362cfa461.1526309227/fedora-06/build.log<https://clicktime.symantec.com/a/1/2sogjFj3QD7edx0KLQg8IPfiKN0DdJt__vH7mDJoYPI=?d=-kqscwOyCuZTkk00ZndM_BDGM9cZcVipWISikCKrPL7ddTAlw68Fsu2JJ7UK_NLkStbN_SE7-z31y6agE5LbgsDkLXbe3Nq26bXUkJh8Bsq67HEBBZgRw811He6ZEqhJQ4gcgs0s8n_qRKm89rdoTnRVCPfHMfgIL4JfGNJJR4DjFvNmV4Xc_FD9-b4IWjir-_knZRvH-8E0qlz87fNqKC5Y7dC1pdmQ87lBNBZUhazS0vN2fYNj0S0P_CqMmfEcN4HNkwYr-jyxHm_D3wL_fUt56H-zgco0uQ5UxRUrijN8Rn-9MCgtW4zQy55tSDAgOT3M1_cpYqckOkEuCFYZidIH3I1236HyJi1h2mq5b1CaO8FeR4lUJQEGftTAkdXk3ixu9rvh-i36QtZWVGYw3fc%3D&u=http%3A%2F%2Fspdk.intel.com%2Fpublic%2Fspdk%2Fbuilds%2Freview%2F9031d3bf2a94d15c2119c2e582a2ce5362cfa461.1526309227%2Ffedora-06%2Fbuild.log>

Description: Happened one after another in May. I guess it was related to some bug that has been fixed.



Case 9.

Issue: Assan scan issue

Number of failures: 3

Link to latest failure: <sorry, no link here>

Description: Happened in May in short period of time. Probably fixed right now.



Case 10.

Issue: conn.c: 741:spdk_iscsi_conn_read_data: *ERROR*: spdk_sock_recv() failed, errno 104: Connection reset by peer

Number of failures: 3

Link to latest failure: http://spdk.intel.com/public/spdk/builds/review/f77339505066367ec6c4279dd230fe1a64626032.1525681774/fedora-06/build.log<https://clicktime.symantec.com/a/1/WCzWZswRtkzY6l2e85v4rIxHZzHp6_Kf-P9iJxHKo7c=?d=-kqscwOyCuZTkk00ZndM_BDGM9cZcVipWISikCKrPL7ddTAlw68Fsu2JJ7UK_NLkStbN_SE7-z31y6agE5LbgsDkLXbe3Nq26bXUkJh8Bsq67HEBBZgRw811He6ZEqhJQ4gcgs0s8n_qRKm89rdoTnRVCPfHMfgIL4JfGNJJR4DjFvNmV4Xc_FD9-b4IWjir-_knZRvH-8E0qlz87fNqKC5Y7dC1pdmQ87lBNBZUhazS0vN2fYNj0S0P_CqMmfEcN4HNkwYr-jyxHm_D3wL_fUt56H-zgco0uQ5UxRUrijN8Rn-9MCgtW4zQy55tSDAgOT3M1_cpYqckOkEuCFYZidIH3I1236HyJi1h2mq5b1CaO8FeR4lUJQEGftTAkdXk3ixu9rvh-i36QtZWVGYw3fc%3D&u=http%3A%2F%2Fspdk.intel.com%2Fpublic%2Fspdk%2Fbuilds%2Freview%2Ff77339505066367ec6c4279dd230fe1a64626032.1525681774%2Ffedora-06%2Fbuild.log>

Description: Happened one after another in May. Looks like environment failure.



Case 11.

Issue: nvme_rdma.c:1394:nvme_rdma_ctrlr_construct: *ERROR*: failed to create admin qpair

Number of failures: 1

Link to latest failure: http://spdk.intel.com/public/spdk/builds/review/e7713a2746bc8e72b4ec2b17622e102aebdf8c79.1525462134/fedora-03/build.log<https://clicktime.symantec.com/a/1/uN7o3P37Ry0wrLkajtGWJblQ7nSF1YjT4yT2pjUltuo=?d=-kqscwOyCuZTkk00ZndM_BDGM9cZcVipWISikCKrPL7ddTAlw68Fsu2JJ7UK_NLkStbN_SE7-z31y6agE5LbgsDkLXbe3Nq26bXUkJh8Bsq67HEBBZgRw811He6ZEqhJQ4gcgs0s8n_qRKm89rdoTnRVCPfHMfgIL4JfGNJJR4DjFvNmV4Xc_FD9-b4IWjir-_knZRvH-8E0qlz87fNqKC5Y7dC1pdmQ87lBNBZUhazS0vN2fYNj0S0P_CqMmfEcN4HNkwYr-jyxHm_D3wL_fUt56H-zgco0uQ5UxRUrijN8Rn-9MCgtW4zQy55tSDAgOT3M1_cpYqckOkEuCFYZidIH3I1236HyJi1h2mq5b1CaO8FeR4lUJQEGftTAkdXk3ixu9rvh-i36QtZWVGYw3fc%3D&u=http%3A%2F%2Fspdk.intel.com%2Fpublic%2Fspdk%2Fbuilds%2Freview%2Fe7713a2746bc8e72b4ec2b17622e102aebdf8c79.1525462134%2Ffedora-03%2Fbuild.log>

Description: Happened only once at the beginning of May. We should monitor if it happens again.







Maciek



[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 21915 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [SPDK] CI latent failures
@ 2018-07-20 14:49 Luse, Paul E
  0 siblings, 0 replies; 5+ messages in thread
From: Luse, Paul E @ 2018-07-20 14:49 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 6219 bytes --]

Ahh, didn't catch that thanks.  Maciej, can you public links to the patches behind these failures? I looked at the first few and for some reason they didn't seem to have failures in CI so I dunno, maybe I was looking at the wrong ones

Thx
Paul


From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Philipp Skadorov
Sent: Friday, July 20, 2018 7:40 AM
To: Storage Performance Development Kit <spdk(a)lists.01.org>; Szwed, Maciej <maciej.szwed(a)intel.com>
Subject: Re: [SPDK] CI latent failures

Hi,
Those links must be pointing to Intel internal servers.
The review IDs below are missing at https://ci.spdk.io/spdk/builds/review/.

Regards,
Philipp

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Luse, Paul E
Sent: Friday, July 20, 2018 9:37 AM
To: Szwed, Maciej <maciej.szwed(a)intel.com>; Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: Re: [SPDK] CI latent failures

Hi Maciej,

Thanks for doing this! I've added the SPDK dist list to the email as this is the kind of info that everyone can benefit from. Not only do we need more community members stepping up and doing this kind of analysis, but if others are experiencing any of these things we might be able to get some traction in debug.

Are you by any chance able to cross reference these with github issues (not a major deep comparison, at least a scan) to try and identify any that are already reported and if so put the CI failure link in github?  For those you can find maybe enter one issue per item but wait probably 24 hours to see if anyone out there jumps up ans says "I've been looking at that one" (there's always hope)

Thanks again
Paul


Hi,
As we are close to release we wanted to look closer at CI latent failures in case there is something that we should address before release. I went through all reported failures from this quarter. I was able to divide those failures to 11 cases. Some of the failures where caused by failing test environment, some were caused by other software, some were probably caused by some bugs.... Below is more detailed report for each of the 11 cases. I didn't want to go too deep into debugging, but rather conclude based on my experience, so if you see that something is actually very different than I assumed please speak up.

Case 1.
Issue: Segmentation fault on NVMf shutdown
Number of failures: 8
Link to latest failure: http://spdk.intel.com/public/spdk/builds/review/97a5813b0bc8fa6b00aabf8b31f8a0b0776e8458.1531262768/fedora-06/build.log
Description: This one occurs from time to time starting from around mid-June and is one of the latest ones. I think this one should examined closer as a potential bug.

Case 2.
Issue: sock.c:249:11: runtime error: member access within null pointer of type 'struct spdk_sock' (aka spdk_iscsi_conn_destruct)
Number of failures: 5
Link to latest failure: http://spdk.intel.com/public/spdk/builds/review/0585c2426071f9506dbe520234e2c3ee2f5aee7d.1531468630/fedora-06/build.log
Description: This one occurs from time to time and is one of the latest ones. I think this one should examined closer as a potential bug.

Case 3.
Issue: nbd.c: 879:spdk_nbd_start: *ERROR*: ioctl(NBD_SET_SOCK) failed: Device or resource busy
Number of failures: 2
Link to latest failure: http://spdk.intel.com/public/spdk/builds/review/c3ba7cb2ce0bd01ec61f93e852cd37d110fa9d9d.1531431333/ubuntu17.10/build.log
Description: Looks like issue with NBD access (environment clean up after failure issue?).

Case 4.
Issue: thread.c: 343:spdk_io_device_register: *ERROR*: io_device 0x1dd6040 already registered
Number of failures: 4
Link to latest failure: http://spdk.intel.com/public/spdk/builds/review/3aa13c878c94c8925bbf3647ca67b666dfda6c75.1530826421/fedora-02/build.log
Description: These series of 4 failures occurred one after another around 2 weeks ago and never happened again. It looks like there was some bug that was fixed. We should monitor if it happens again.

Case 5.
Issue: rmmod: ERROR: Module nvme_rdma is in use
Number of failures: 2
Link to latest failure: http://spdk.intel.com/public/spdk/builds/review/1939cfb85ee7c2d7707ade0e49c0e8bd01b04961.1529624900/fedora-06/build.log
Description: Looks like issue with nvme_rdma module (environment clean up after failure issue?).

Case 6.
Issue: initiator.sh: line 43:  9367 Bus error               (core dumped) $rootdir/test/bdev/bdevperf/bdevperf -c $testdir/bdev.conf -q 128 -s 4096 -w verify -t 5 -d 512
Number of failures: 1
Link to latest failure: http://spdk.intel.com/public/spdk/builds/review/5b32f26f10d709f939443f777956574b77955f09.1528385513/fedora-06/build.log
Description: Happened only once at the beginning of June. We should monitor if it happens again.

Case 7.
Issue: VM shutdown issue
Number of failures: 2
Link to latest failure: <sorry, no link here>
Description: VM failed to shut down - test script timed out.

Case 8.
Issue: NVMf segmentation fault on disconnect
Number of failures: 6
Link to latest failure: http://spdk.intel.com/public/spdk/builds/review/9031d3bf2a94d15c2119c2e582a2ce5362cfa461.1526309227/fedora-06/build.log
Description: Happened one after another in May. I guess it was related to some bug that has been fixed.

Case 9.
Issue: Assan scan issue
Number of failures: 3
Link to latest failure: <sorry, no link here>
Description: Happened in May in short period of time. Probably fixed right now.

Case 10.
Issue: conn.c: 741:spdk_iscsi_conn_read_data: *ERROR*: spdk_sock_recv() failed, errno 104: Connection reset by peer
Number of failures: 3
Link to latest failure: http://spdk.intel.com/public/spdk/builds/review/f77339505066367ec6c4279dd230fe1a64626032.1525681774/fedora-06/build.log
Description: Happened one after another in May. Looks like environment failure.

Case 11.
Issue: nvme_rdma.c:1394:nvme_rdma_ctrlr_construct: *ERROR*: failed to create admin qpair
Number of failures: 1
Link to latest failure: http://spdk.intel.com/public/spdk/builds/review/e7713a2746bc8e72b4ec2b17622e102aebdf8c79.1525462134/fedora-03/build.log
Description: Happened only once at the beginning of May. We should monitor if it happens again.



Maciek


[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 14378 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [SPDK] CI latent failures
@ 2018-07-20 14:39 Philipp Skadorov
  0 siblings, 0 replies; 5+ messages in thread
From: Philipp Skadorov @ 2018-07-20 14:39 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 5703 bytes --]

Hi,
Those links must be pointing to Intel internal servers.
The review IDs below are missing at https://ci.spdk.io/spdk/builds/review/.

Regards,
Philipp

From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Luse, Paul E
Sent: Friday, July 20, 2018 9:37 AM
To: Szwed, Maciej <maciej.szwed(a)intel.com>; Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: Re: [SPDK] CI latent failures

Hi Maciej,

Thanks for doing this! I've added the SPDK dist list to the email as this is the kind of info that everyone can benefit from. Not only do we need more community members stepping up and doing this kind of analysis, but if others are experiencing any of these things we might be able to get some traction in debug.

Are you by any chance able to cross reference these with github issues (not a major deep comparison, at least a scan) to try and identify any that are already reported and if so put the CI failure link in github?  For those you can find maybe enter one issue per item but wait probably 24 hours to see if anyone out there jumps up ans says "I've been looking at that one" (there's always hope)

Thanks again
Paul


Hi,
As we are close to release we wanted to look closer at CI latent failures in case there is something that we should address before release. I went through all reported failures from this quarter. I was able to divide those failures to 11 cases. Some of the failures where caused by failing test environment, some were caused by other software, some were probably caused by some bugs.... Below is more detailed report for each of the 11 cases. I didn't want to go too deep into debugging, but rather conclude based on my experience, so if you see that something is actually very different than I assumed please speak up.

Case 1.
Issue: Segmentation fault on NVMf shutdown
Number of failures: 8
Link to latest failure: http://spdk.intel.com/public/spdk/builds/review/97a5813b0bc8fa6b00aabf8b31f8a0b0776e8458.1531262768/fedora-06/build.log
Description: This one occurs from time to time starting from around mid-June and is one of the latest ones. I think this one should examined closer as a potential bug.

Case 2.
Issue: sock.c:249:11: runtime error: member access within null pointer of type 'struct spdk_sock' (aka spdk_iscsi_conn_destruct)
Number of failures: 5
Link to latest failure: http://spdk.intel.com/public/spdk/builds/review/0585c2426071f9506dbe520234e2c3ee2f5aee7d.1531468630/fedora-06/build.log
Description: This one occurs from time to time and is one of the latest ones. I think this one should examined closer as a potential bug.

Case 3.
Issue: nbd.c: 879:spdk_nbd_start: *ERROR*: ioctl(NBD_SET_SOCK) failed: Device or resource busy
Number of failures: 2
Link to latest failure: http://spdk.intel.com/public/spdk/builds/review/c3ba7cb2ce0bd01ec61f93e852cd37d110fa9d9d.1531431333/ubuntu17.10/build.log
Description: Looks like issue with NBD access (environment clean up after failure issue?).

Case 4.
Issue: thread.c: 343:spdk_io_device_register: *ERROR*: io_device 0x1dd6040 already registered
Number of failures: 4
Link to latest failure: http://spdk.intel.com/public/spdk/builds/review/3aa13c878c94c8925bbf3647ca67b666dfda6c75.1530826421/fedora-02/build.log
Description: These series of 4 failures occurred one after another around 2 weeks ago and never happened again. It looks like there was some bug that was fixed. We should monitor if it happens again.

Case 5.
Issue: rmmod: ERROR: Module nvme_rdma is in use
Number of failures: 2
Link to latest failure: http://spdk.intel.com/public/spdk/builds/review/1939cfb85ee7c2d7707ade0e49c0e8bd01b04961.1529624900/fedora-06/build.log
Description: Looks like issue with nvme_rdma module (environment clean up after failure issue?).

Case 6.
Issue: initiator.sh: line 43:  9367 Bus error               (core dumped) $rootdir/test/bdev/bdevperf/bdevperf -c $testdir/bdev.conf -q 128 -s 4096 -w verify -t 5 -d 512
Number of failures: 1
Link to latest failure: http://spdk.intel.com/public/spdk/builds/review/5b32f26f10d709f939443f777956574b77955f09.1528385513/fedora-06/build.log
Description: Happened only once at the beginning of June. We should monitor if it happens again.

Case 7.
Issue: VM shutdown issue
Number of failures: 2
Link to latest failure: <sorry, no link here>
Description: VM failed to shut down - test script timed out.

Case 8.
Issue: NVMf segmentation fault on disconnect
Number of failures: 6
Link to latest failure: http://spdk.intel.com/public/spdk/builds/review/9031d3bf2a94d15c2119c2e582a2ce5362cfa461.1526309227/fedora-06/build.log
Description: Happened one after another in May. I guess it was related to some bug that has been fixed.

Case 9.
Issue: Assan scan issue
Number of failures: 3
Link to latest failure: <sorry, no link here>
Description: Happened in May in short period of time. Probably fixed right now.

Case 10.
Issue: conn.c: 741:spdk_iscsi_conn_read_data: *ERROR*: spdk_sock_recv() failed, errno 104: Connection reset by peer
Number of failures: 3
Link to latest failure: http://spdk.intel.com/public/spdk/builds/review/f77339505066367ec6c4279dd230fe1a64626032.1525681774/fedora-06/build.log
Description: Happened one after another in May. Looks like environment failure.

Case 11.
Issue: nvme_rdma.c:1394:nvme_rdma_ctrlr_construct: *ERROR*: failed to create admin qpair
Number of failures: 1
Link to latest failure: http://spdk.intel.com/public/spdk/builds/review/e7713a2746bc8e72b4ec2b17622e102aebdf8c79.1525462134/fedora-03/build.log
Description: Happened only once at the beginning of May. We should monitor if it happens again.



Maciek


[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 13015 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-07-23  7:27 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-20 13:37 [SPDK] CI latent failures Luse, Paul E
2018-07-20 14:39 Philipp Skadorov
2018-07-20 14:49 Luse, Paul E
2018-07-23  5:05 
2018-07-23  7:27 Szwed, Maciej

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.