From mboxrd@z Thu Jan 1 00:00:00 1970 From: Christian Borntraeger Date: Wed, 01 Jul 2020 16:01:59 +0000 Subject: Re: linux-next: umh: fix processed error when UMH_WAIT_PROC is used seems to break linux bridge on s Message-Id: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit List-Id: References: <4d8fbcea-a892-3453-091f-d57c03f9aa90@de.ibm.com> <1263e370-7cee-24d8-b98c-117bf7c90a83@de.ibm.com> <20200626025410.GJ4332@42.do-not-panic.com> <20200630175704.GO13911@42.do-not-panic.com> <20200701135324.GS4332@42.do-not-panic.com> <8d714a23-bac4-7631-e5fc-f97c20a46083@i-love.sakura.ne.jp> <20200701153859.GT4332@42.do-not-panic.com> <20200701155819.GU4332@42.do-not-panic.com> In-Reply-To: <20200701155819.GU4332@42.do-not-panic.com> To: Luis Chamberlain Cc: Tetsuo Handa , Christoph Hellwig , "Eric W. Biederman" , ast@kernel.org, axboe@kernel.dk, bfields@fieldses.org, bridge@lists.linux-foundation.org, chainsaw@gentoo.org, christian.brauner@ubuntu.com, chuck.lever@oracle.com, davem@davemloft.net, dhowells@redhat.com, gregkh@linuxfoundation.org, jarkko.sakkinen@linux.intel.com, jmorris@namei.org, josh@joshtriplett.org, keescook@chromium.org, keyrings@vger.kernel.org, kuba@kernel.org, lars.ellenberg@linbit.com, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-nfs@vger.kernel.org, linux-security-module@vger.kernel.org, nikolay@cumulusnetworks.com, philipp.reisner@linbit.com, ravenexp@gmail.com, roopa@cumulusnetworks.com, serge@hallyn.com, slyfox@gentoo.org, viro@zeniv.linux.org.uk, yangtiezhu@loongson.cn, netdev@vger.kernel.org, markward@linux.ibm.com, linux-s390 On 01.07.20 17:58, Luis Chamberlain wrote: [...] >>> >>> Ah, well that would be a different fix required, becuase again, >>> br_stp_start() does not untangle the correct error today really. >>> I also I think it would be odd odd that SIGSEGV or another signal >>> is what was terminating Christian's bridge stp call, but let's >>> find out! >>> >>> Note we pass 0 to the options to wait so the mistake here could indeed >>> be that we did not need KWIFSIGNALED(). I was afraid of this prospect... >>> as it other implications. >>> >>> It means we either *open code* all callers, or we handle this in a >>> unified way on the umh. And if we do handle this in a unified way, it >>> then begs the question as to *what* do we pass for the signals case and >>> continued case. Below we just pass the signal, and treat continued as >>> OK, but treating continued as OK would also be a *new* change as well. >>> >>> For instance (this goes just boot tested, but Christian if you can >>> try this as well that would be appreciated): >> >> >> Does not help, the bridge stays in DOWN state. > > OK thanks for testing, that was fast! Does your code go through the > STP kernel path or userpath? If it is taking the STP kernel path > then this is not the real culprit to your issue then. I have no idea and I cannot look into this right now. I can test patches as compile,reboot and test is almost no effort. FWIW, this is just the network of a KVM guest of libvirts default network no longer working, maybe you can reproduce this on x86 as well? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.5 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 833CAC433DF for ; Wed, 1 Jul 2020 16:03:55 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 6BDD0206C3 for ; Wed, 1 Jul 2020 16:03:55 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732341AbgGAQDy (ORCPT ); Wed, 1 Jul 2020 12:03:54 -0400 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:58346 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1731672AbgGAQDw (ORCPT ); Wed, 1 Jul 2020 12:03:52 -0400 Received: from pps.filterd (m0098414.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 061FXIr2051668; Wed, 1 Jul 2020 12:02:08 -0400 Received: from pps.reinject (localhost [127.0.0.1]) by mx0b-001b2d01.pphosted.com with ESMTP id 320pw7p9df-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 01 Jul 2020 12:02:08 -0400 Received: from m0098414.ppops.net (m0098414.ppops.net [127.0.0.1]) by pps.reinject (8.16.0.36/8.16.0.36) with SMTP id 061Fn0CP132425; Wed, 1 Jul 2020 12:02:07 -0400 Received: from ppma04ams.nl.ibm.com (63.31.33a9.ip4.static.sl-reverse.com [169.51.49.99]) by mx0b-001b2d01.pphosted.com with ESMTP id 320pw7p9bn-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 01 Jul 2020 12:02:06 -0400 Received: from pps.filterd (ppma04ams.nl.ibm.com [127.0.0.1]) by ppma04ams.nl.ibm.com (8.16.0.42/8.16.0.42) with SMTP id 061FtZi0007012; Wed, 1 Jul 2020 16:02:04 GMT Received: from b06cxnps3074.portsmouth.uk.ibm.com (d06relay09.portsmouth.uk.ibm.com [9.149.109.194]) by ppma04ams.nl.ibm.com with ESMTP id 31wwr8d42q-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 01 Jul 2020 16:02:04 +0000 Received: from d06av22.portsmouth.uk.ibm.com (d06av22.portsmouth.uk.ibm.com [9.149.105.58]) by b06cxnps3074.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 061G21gZ7995686 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 1 Jul 2020 16:02:01 GMT Received: from d06av22.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id A96A74C046; Wed, 1 Jul 2020 16:02:01 +0000 (GMT) Received: from d06av22.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id EDFFB4C044; Wed, 1 Jul 2020 16:01:59 +0000 (GMT) Received: from oc7455500831.ibm.com (unknown [9.145.75.158]) by d06av22.portsmouth.uk.ibm.com (Postfix) with ESMTP; Wed, 1 Jul 2020 16:01:59 +0000 (GMT) Subject: Re: linux-next: umh: fix processed error when UMH_WAIT_PROC is used seems to break linux bridge on s390x (bisected) To: Luis Chamberlain Cc: Tetsuo Handa , Christoph Hellwig , "Eric W. Biederman" , ast@kernel.org, axboe@kernel.dk, bfields@fieldses.org, bridge@lists.linux-foundation.org, chainsaw@gentoo.org, christian.brauner@ubuntu.com, chuck.lever@oracle.com, davem@davemloft.net, dhowells@redhat.com, gregkh@linuxfoundation.org, jarkko.sakkinen@linux.intel.com, jmorris@namei.org, josh@joshtriplett.org, keescook@chromium.org, keyrings@vger.kernel.org, kuba@kernel.org, lars.ellenberg@linbit.com, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-nfs@vger.kernel.org, linux-security-module@vger.kernel.org, nikolay@cumulusnetworks.com, philipp.reisner@linbit.com, ravenexp@gmail.com, roopa@cumulusnetworks.com, serge@hallyn.com, slyfox@gentoo.org, viro@zeniv.linux.org.uk, yangtiezhu@loongson.cn, netdev@vger.kernel.org, markward@linux.ibm.com, linux-s390 References: <4d8fbcea-a892-3453-091f-d57c03f9aa90@de.ibm.com> <1263e370-7cee-24d8-b98c-117bf7c90a83@de.ibm.com> <20200626025410.GJ4332@42.do-not-panic.com> <20200630175704.GO13911@42.do-not-panic.com> <20200701135324.GS4332@42.do-not-panic.com> <8d714a23-bac4-7631-e5fc-f97c20a46083@i-love.sakura.ne.jp> <20200701153859.GT4332@42.do-not-panic.com> <20200701155819.GU4332@42.do-not-panic.com> From: Christian Borntraeger Autocrypt: addr=borntraeger@de.ibm.com; prefer-encrypt=mutual; keydata= xsFNBE6cPPgBEAC2VpALY0UJjGmgAmavkL/iAdqul2/F9ONz42K6NrwmT+SI9CylKHIX+fdf J34pLNJDmDVEdeb+brtpwC9JEZOLVE0nb+SR83CsAINJYKG3V1b3Kfs0hydseYKsBYqJTN2j CmUXDYq9J7uOyQQ7TNVoQejmpp5ifR4EzwIFfmYDekxRVZDJygD0wL/EzUr8Je3/j548NLyL 4Uhv6CIPf3TY3/aLVKXdxz/ntbLgMcfZsDoHgDk3lY3r1iwbWwEM2+eYRdSZaR4VD+JRD7p8 0FBadNwWnBce1fmQp3EklodGi5y7TNZ/CKdJ+jRPAAnw7SINhSd7PhJMruDAJaUlbYaIm23A +82g+IGe4z9tRGQ9TAflezVMhT5J3ccu6cpIjjvwDlbxucSmtVi5VtPAMTLmfjYp7VY2Tgr+ T92v7+V96jAfE3Zy2nq52e8RDdUo/F6faxcumdl+aLhhKLXgrozpoe2nL0Nyc2uqFjkjwXXI OBQiaqGeWtxeKJP+O8MIpjyGuHUGzvjNx5S/592TQO3phpT5IFWfMgbu4OreZ9yekDhf7Cvn /fkYsiLDz9W6Clihd/xlpm79+jlhm4E3xBPiQOPCZowmHjx57mXVAypOP2Eu+i2nyQrkapaY IdisDQfWPdNeHNOiPnPS3+GhVlPcqSJAIWnuO7Ofw1ZVOyg/jwARAQABzUNDaHJpc3RpYW4g Qm9ybnRyYWVnZXIgKDJuZCBJQk0gYWRkcmVzcykgPGJvcm50cmFlZ2VyQGxpbnV4LmlibS5j b20+wsF5BBMBAgAjBQJdP/hMAhsDBwsJCAcDAgEGFQgCCQoLBBYCAwECHgECF4AACgkQEXu8 gLWmHHy/pA/+JHjpEnd01A0CCyfVnb5fmcOlQ0LdmoKWLWPvU840q65HycCBFTt6V62cDljB kXFFxMNA4y/2wqU0H5/CiL963y3gWIiJsZa4ent+KrHl5GK1nIgbbesfJyA7JqlB0w/E/SuY NRQwIWOo/uEvOgXnk/7+rtvBzNaPGoGiiV1LZzeaxBVWrqLtmdi1iulW/0X/AlQPuF9dD1Px hx+0mPjZ8ClLpdSp5d0yfpwgHtM1B7KMuQPQZGFKMXXTUd3ceBUGGczsgIMipZWJukqMJiJj QIMH0IN7XYErEnhf0GCxJ3xAn/J7iFpPFv8sFZTvukntJXSUssONnwiKuld6ttUaFhSuSoQg OFYR5v7pOfinM0FcScPKTkrRsB5iUvpdthLq5qgwdQjmyINt3cb+5aSvBX2nNN135oGOtlb5 tf4dh00kUR8XFHRrFxXx4Dbaw4PKgV3QLIHKEENlqnthH5t0tahDygQPnSucuXbVQEcDZaL9 WgJqlRAAj0pG8M6JNU5+2ftTFXoTcoIUbb0KTOibaO9zHVeGegwAvPLLNlKHiHXcgLX1tkjC DrvE2Z0e2/4q7wgZgn1kbvz7ZHQZB76OM2mjkFu7QNHlRJ2VXJA8tMXyTgBX6kq1cYMmd/Hl OhFrAU3QO1SjCsXA2CDk9MM1471mYB3CTXQuKzXckJnxHkHOwU0ETpw8+AEQAJjyNXvMQdJN t07BIPDtbAQk15FfB0hKuyZVs+0lsjPKBZCamAAexNRk11eVGXK/YrqwjChkk60rt3q5i42u PpNMO9aS8cLPOfVft89Y654Qd3Rs1WRFIQq9xLjdLfHh0i0jMq5Ty+aiddSXpZ7oU6E+ud+X Czs3k5RAnOdW6eV3+v10sUjEGiFNZwzN9Udd6PfKET0J70qjnpY3NuWn5Sp1ZEn6lkq2Zm+G 9G3FlBRVClT30OWeiRHCYB6e6j1x1u/rSU4JiNYjPwSJA8EPKnt1s/Eeq37qXXvk+9DYiHdT PcOa3aNCSbIygD3jyjkg6EV9ZLHibE2R/PMMid9FrqhKh/cwcYn9FrT0FE48/2IBW5mfDpAd YvpawQlRz3XJr2rYZJwMUm1y+49+1ZmDclaF3s9dcz2JvuywNq78z/VsUfGz4Sbxy4ShpNpG REojRcz/xOK+FqNuBk+HoWKw6OxgRzfNleDvScVmbY6cQQZfGx/T7xlgZjl5Mu/2z+ofeoxb vWWM1YCJAT91GFvj29Wvm8OAPN/+SJj8LQazd9uGzVMTz6lFjVtH7YkeW/NZrP6znAwv5P1a DdQfiB5F63AX++NlTiyA+GD/ggfRl68LheSskOcxDwgI5TqmaKtX1/8RkrLpnzO3evzkfJb1 D5qh3wM1t7PZ+JWTluSX8W25ABEBAAHCwV8EGAECAAkFAk6cPPgCGwwACgkQEXu8gLWmHHz8 2w//VjRlX+tKF3szc0lQi4X0t+pf88uIsvR/a1GRZpppQbn1jgE44hgF559K6/yYemcvTR7r 6Xt7cjWGS4wfaR0+pkWV+2dbw8Xi4DI07/fN00NoVEpYUUnOnupBgychtVpxkGqsplJZQpng v6fauZtyEcUK3dLJH3TdVQDLbUcL4qZpzHbsuUnTWsmNmG4Vi0NsEt1xyd/Wuw+0kM/oFEH1 4BN6X9xZcG8GYUbVUd8+bmio8ao8m0tzo4pseDZFo4ncDmlFWU6hHnAVfkAs4tqA6/fl7RLN JuWBiOL/mP5B6HDQT9JsnaRdzqF73FnU2+WrZPjinHPLeE74istVgjbowvsgUqtzjPIG5pOj cAsKoR0M1womzJVRfYauWhYiW/KeECklci4TPBDNx7YhahSUlexfoftltJA8swRshNA/M90/ i9zDo9ySSZHwsGxG06ZOH5/MzG6HpLja7g8NTgA0TD5YaFm/oOnsQVsf2DeAGPS2xNirmknD jaqYefx7yQ7FJXXETd2uVURiDeNEFhVZWb5CiBJM5c6qQMhmkS4VyT7/+raaEGgkEKEgHOWf ZDP8BHfXtszHqI3Fo1F4IKFo/AP8GOFFxMRgbvlAs8z/+rEEaQYjxYJqj08raw6P4LFBqozr nS4h0HDFPrrp1C2EMVYIQrMokWvlFZbCpsdYbBI= Message-ID: Date: Wed, 1 Jul 2020 18:01:59 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.9.0 MIME-Version: 1.0 In-Reply-To: <20200701155819.GU4332@42.do-not-panic.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-TM-AS-GCONF: 00 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.235,18.0.687 definitions=2020-07-01_08:2020-07-01,2020-07-01 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 impostorscore=0 adultscore=0 mlxlogscore=999 cotscore=-2147483648 phishscore=0 mlxscore=0 suspectscore=0 bulkscore=0 spamscore=0 priorityscore=1501 malwarescore=0 clxscore=1011 lowpriorityscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2004280000 definitions=main-2007010110 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01.07.20 17:58, Luis Chamberlain wrote: [...] >>> >>> Ah, well that would be a different fix required, becuase again, >>> br_stp_start() does not untangle the correct error today really. >>> I also I think it would be odd odd that SIGSEGV or another signal >>> is what was terminating Christian's bridge stp call, but let's >>> find out! >>> >>> Note we pass 0 to the options to wait so the mistake here could indeed >>> be that we did not need KWIFSIGNALED(). I was afraid of this prospect... >>> as it other implications. >>> >>> It means we either *open code* all callers, or we handle this in a >>> unified way on the umh. And if we do handle this in a unified way, it >>> then begs the question as to *what* do we pass for the signals case and >>> continued case. Below we just pass the signal, and treat continued as >>> OK, but treating continued as OK would also be a *new* change as well. >>> >>> For instance (this goes just boot tested, but Christian if you can >>> try this as well that would be appreciated): >> >> >> Does not help, the bridge stays in DOWN state. > > OK thanks for testing, that was fast! Does your code go through the > STP kernel path or userpath? If it is taking the STP kernel path > then this is not the real culprit to your issue then. I have no idea and I cannot look into this right now. I can test patches as compile,reboot and test is almost no effort. FWIW, this is just the network of a KVM guest of libvirts default network no longer working, maybe you can reproduce this on x86 as well? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: References: <4d8fbcea-a892-3453-091f-d57c03f9aa90@de.ibm.com> <1263e370-7cee-24d8-b98c-117bf7c90a83@de.ibm.com> <20200626025410.GJ4332@42.do-not-panic.com> <20200630175704.GO13911@42.do-not-panic.com> <20200701135324.GS4332@42.do-not-panic.com> <8d714a23-bac4-7631-e5fc-f97c20a46083@i-love.sakura.ne.jp> <20200701153859.GT4332@42.do-not-panic.com> <20200701155819.GU4332@42.do-not-panic.com> From: Christian Borntraeger Message-ID: MIME-Version: 1.0 In-Reply-To: <20200701155819.GU4332@42.do-not-panic.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Subject: Re: [Bridge] linux-next: umh: fix processed error when UMH_WAIT_PROC is used seems to break linux bridge on s390x (bisected) List-Id: Linux Ethernet Bridging List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Date: Wed, 01 Jul 2020 16:03:28 -0000 To: Luis Chamberlain Cc: Tetsuo Handa , ast@kernel.org, jarkko.sakkinen@linux.intel.com, philipp.reisner@linbit.com, bfields@fieldses.org, keyrings@vger.kernel.org, christian.brauner@ubuntu.com, yangtiezhu@loongson.cn, linux-s390 , bridge@lists.linux-foundation.org, jmorris@namei.org, Christoph Hellwig , kuba@kernel.org, serge@hallyn.com, keescook@chromium.org, nikolay@cumulusnetworks.com, roopa@cumulusnetworks.com, josh@joshtriplett.org, slyfox@gentoo.org, viro@zeniv.linux.org.uk, axboe@kernel.dk, dhowells@redhat.com, linux-nfs@vger.kernel.org, chainsaw@gentoo.org, ravenexp@gmail.com, gregkh@linuxfoundation.org, markward@linux.ibm.com, linux-kernel@vger.kernel.org, linux-security-module@vger.kernel.org, chuck.lever@oracle.com, "Eric W. Biederman" , netdev@vger.kernel.org, linux-fsdevel@vger.kernel.org, lars.ellenberg@linbit.com, davem@davemloft.net On 01.07.20 17:58, Luis Chamberlain wrote: [...] >>> >>> Ah, well that would be a different fix required, becuase again, >>> br_stp_start() does not untangle the correct error today really. >>> I also I think it would be odd odd that SIGSEGV or another signal >>> is what was terminating Christian's bridge stp call, but let's >>> find out! >>> >>> Note we pass 0 to the options to wait so the mistake here could indeed >>> be that we did not need KWIFSIGNALED(). I was afraid of this prospect... >>> as it other implications. >>> >>> It means we either *open code* all callers, or we handle this in a >>> unified way on the umh. And if we do handle this in a unified way, it >>> then begs the question as to *what* do we pass for the signals case and >>> continued case. Below we just pass the signal, and treat continued as >>> OK, but treating continued as OK would also be a *new* change as well. >>> >>> For instance (this goes just boot tested, but Christian if you can >>> try this as well that would be appreciated): >> >> >> Does not help, the bridge stays in DOWN state. > > OK thanks for testing, that was fast! Does your code go through the > STP kernel path or userpath? If it is taking the STP kernel path > then this is not the real culprit to your issue then. I have no idea and I cannot look into this right now. I can test patches as compile,reboot and test is almost no effort. FWIW, this is just the network of a KVM guest of libvirts default network no longer working, maybe you can reproduce this on x86 as well?