From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754051AbbBKSSr (ORCPT ); Wed, 11 Feb 2015 13:18:47 -0500 Received: from mail-ie0-f174.google.com ([209.85.223.174]:46569 "EHLO mail-ie0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754012AbbBKSSo (ORCPT ); Wed, 11 Feb 2015 13:18:44 -0500 MIME-Version: 1.0 In-Reply-To: References: Date: Wed, 11 Feb 2015 10:18:43 -0800 X-Google-Sender-Auth: 8wwHs-NI2hOFctvdYxD99GdqoOc Message-ID: Subject: Re: smp_call_function_single lockups From: Linus Torvalds To: Rafael David Tinoco Cc: LKML , Thomas Gleixner , Jens Axboe Content-Type: multipart/mixed; boundary=001a11c3d13a9b0003050ed40888 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org --001a11c3d13a9b0003050ed40888 Content-Type: text/plain; charset=UTF-8 On Wed, Feb 11, 2015 at 5:19 AM, Rafael David Tinoco wrote: > > - After applying patch provided by Thomas we were able to cause the > lockup only after 6 days (also locked inside > smp_call_function_single). Test performance (even for a nested kvm) > was reduced substantially with 3.19 + this patch. I think that just means that the patch from Thomas doesn't change anything - the reason it takes longer to lock up is just that performance reduction, so whatever race it is that causes the problem was just harder to hit, but not fundamentally affected. I think a more interesting thing to get is the traces from the other CPU's when this happens. In a virtualized environment, that might be easier to get than on real hardware, and if you are able to reproduce this at will - especially with something recent like 3.19, and could get that, that would be really good. I'll think about this all, but we couldn't figure anything out last time we looked at it, so without more clues, don't hold your breath. That said, it *would* be good if we could get rid of the synchronous behavior entirely, and make it a rule that if somebody wants to wait for it, they'll have to do their own waiting. Because I still think that that CSD_FLAG_WAIT is pure and utter garbage. And I think that Jens said that it is probably bogus to begin with. I also don't even see where the CSD_FLAG_WAIT bit woudl ever be cleared, so it all looks completely buggy anyway. Does this (COMPLETELY UNTESTED!) attached patch change anything? Linus --001a11c3d13a9b0003050ed40888 Content-Type: text/plain; charset=US-ASCII; name="patch.diff" Content-Disposition: attachment; filename="patch.diff" Content-Transfer-Encoding: base64 X-Attachment-Id: f_i611j6zh0 IGtlcm5lbC9zbXAuYyB8IDE0ICsrKystLS0tLS0tLS0tCiAxIGZpbGUgY2hhbmdlZCwgNCBpbnNl cnRpb25zKCspLCAxMCBkZWxldGlvbnMoLSkKCmRpZmYgLS1naXQgYS9rZXJuZWwvc21wLmMgYi9r ZXJuZWwvc21wLmMKaW5kZXggZjM4YTFlNjkyMjU5Li4xM2E4ZTc1ZTEzNzkgMTAwNjQ0Ci0tLSBh L2tlcm5lbC9zbXAuYworKysgYi9rZXJuZWwvc21wLmMKQEAgLTE5LDcgKzE5LDYgQEAKIAogZW51 bSB7CiAJQ1NEX0ZMQUdfTE9DSwkJPSAweDAxLAotCUNTRF9GTEFHX1dBSVQJCT0gMHgwMiwKIH07 CiAKIHN0cnVjdCBjYWxsX2Z1bmN0aW9uX2RhdGEgewpAQCAtMTE0LDI2ICsxMTMsMjQgQEAgc3Rh dGljIHZvaWQgY3NkX2xvY2tfd2FpdChzdHJ1Y3QgY2FsbF9zaW5nbGVfZGF0YSAqY3NkKQogc3Rh dGljIHZvaWQgY3NkX2xvY2soc3RydWN0IGNhbGxfc2luZ2xlX2RhdGEgKmNzZCkKIHsKIAljc2Rf bG9ja193YWl0KGNzZCk7Ci0JY3NkLT5mbGFncyB8PSBDU0RfRkxBR19MT0NLOworCWNzZC0+Zmxh Z3MgPSBDU0RfRkxBR19MT0NLOwogCiAJLyoKIAkgKiBwcmV2ZW50IENQVSBmcm9tIHJlb3JkZXJp bmcgdGhlIGFib3ZlIGFzc2lnbm1lbnQKIAkgKiB0byAtPmZsYWdzIHdpdGggYW55IHN1YnNlcXVl bnQgYXNzaWdubWVudHMgdG8gb3RoZXIKIAkgKiBmaWVsZHMgb2YgdGhlIHNwZWNpZmllZCBjYWxs X3NpbmdsZV9kYXRhIHN0cnVjdHVyZToKIAkgKi8KLQlzbXBfbWIoKTsKKwlzbXBfd21iKCk7CiB9 CiAKIHN0YXRpYyB2b2lkIGNzZF91bmxvY2soc3RydWN0IGNhbGxfc2luZ2xlX2RhdGEgKmNzZCkK IHsKLQlXQVJOX09OKChjc2QtPmZsYWdzICYgQ1NEX0ZMQUdfV0FJVCkgJiYgIShjc2QtPmZsYWdz ICYgQ1NEX0ZMQUdfTE9DSykpOworCVdBUk5fT04oIShjc2QtPmZsYWdzICYgQ1NEX0ZMQUdfTE9D SykpOwogCiAJLyoKIAkgKiBlbnN1cmUgd2UncmUgYWxsIGRvbmUgYmVmb3JlIHJlbGVhc2luZyBk YXRhOgogCSAqLwotCXNtcF9tYigpOwotCi0JY3NkLT5mbGFncyAmPSB+Q1NEX0ZMQUdfTE9DSzsK KwlzbXBfc3RvcmVfcmVsZWFzZSgmY3NkLT5mbGFncywgMCk7CiB9CiAKIHN0YXRpYyBERUZJTkVf UEVSX0NQVV9TSEFSRURfQUxJR05FRChzdHJ1Y3QgY2FsbF9zaW5nbGVfZGF0YSwgY3NkX2RhdGEp OwpAQCAtMTczLDkgKzE3MCw2IEBAIHN0YXRpYyBpbnQgZ2VuZXJpY19leGVjX3NpbmdsZShpbnQg Y3B1LCBzdHJ1Y3QgY2FsbF9zaW5nbGVfZGF0YSAqY3NkLAogCWNzZC0+ZnVuYyA9IGZ1bmM7CiAJ Y3NkLT5pbmZvID0gaW5mbzsKIAotCWlmICh3YWl0KQotCQljc2QtPmZsYWdzIHw9IENTRF9GTEFH X1dBSVQ7Ci0KIAkvKgogCSAqIFRoZSBsaXN0IGFkZGl0aW9uIHNob3VsZCBiZSB2aXNpYmxlIGJl Zm9yZSBzZW5kaW5nIHRoZSBJUEkKIAkgKiBoYW5kbGVyIGxvY2tzIHRoZSBsaXN0IHRvIHB1bGwg dGhlIGVudHJ5IG9mZiBpdCBiZWNhdXNlIG9mCg== --001a11c3d13a9b0003050ed40888--