From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5CF2DC43219 for ; Sat, 4 May 2019 04:21:04 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 26617206DF for ; Sat, 4 May 2019 04:21:03 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="FZjC35cr" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726480AbfEDEVC (ORCPT ); Sat, 4 May 2019 00:21:02 -0400 Received: from mail-pl1-f193.google.com ([209.85.214.193]:39293 "EHLO mail-pl1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725770AbfEDEVC (ORCPT ); Sat, 4 May 2019 00:21:02 -0400 Received: by mail-pl1-f193.google.com with SMTP id e92so3659087plb.6 for ; Fri, 03 May 2019 21:21:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=2niQZAcO4B3TuA0Jbh10gmhcQ5R1+7CidbDlWs5As80=; b=FZjC35cra7t9JQGvg+nlN0g+upkodunL99cIB3S5+l4mYslGpxQzpLFtsrkrG3a6ro vo8agCB/Im8sQ7/jbw/cwVF+GZhkDLRRDz93f6fgS7YDz4bTr9c6EW3qE1hkDOdo/DyK t/LhOUE0bOx3Ld15nSLQN1CCKFPJOqKKZ4syPxMRajwzWQKMwYFYciItHXCyhgM1tUry 8LFtKqYte6KyG0K/BBbs0y1ZN4AzgijoOJ8CLdaoUE9of/xdE6Jo3rFJo95nduWoT1u3 MwralEvoxASdTTm3tcgrgOeEKgRuX5fGEhZFYfRU6geuBWEI2DYeG2PEyys8/mk9mvG3 xBqA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=2niQZAcO4B3TuA0Jbh10gmhcQ5R1+7CidbDlWs5As80=; b=XuTlaq8JeLSz1iURtI6BAx8MC7TZf8cgLuiAX3aucZl8qgRhgM7oz8NpmpFSlCBnWT cl/I3iklLonkdgeu8BSUnvLxooVbb3iY9LW5Ndf29kR2vuyksxPgvBcy37XWHgSm/rdl mXbQTb5CXnFeOa0GnHkeKr1eOidJaPgmT2y9GC70mr7exBw6mKC97cVSOTghaN6Pup5O 1YsKwPmOj5N/mCD9EtB7153Edm51I5+Jt2VE7Rd+LCIC1EMpYIIY3/haM49xA+mObklJ cGp+Byf64hDdx4/3oeEUMVe16xvogMoQRVntCsWZRKCigR9cv183yUAKpb5azI+Wk/B7 ZSuQ== X-Gm-Message-State: APjAAAVyoDdrfeEOdJhZfBuvGRkUEI2ukSQdJaotE5PMuLOCW2BRhpkP QcjdwNb8ujBD+gZX5K7DRNwAwTJAKBkvLW3cZjo= X-Google-Smtp-Source: APXvYqzk08GW18+GoiHSeLiPXxG5WmUXmSoYcOlyzQXVtB3rSOe/11xTKxDCshE4SRkYoGf5dkwKMxrfxUGEACuM7k0= X-Received: by 2002:a17:902:5995:: with SMTP id p21mr15757134pli.216.1556943661646; Fri, 03 May 2019 21:21:01 -0700 (PDT) MIME-Version: 1.0 References: <1556787561-5113-1-git-send-email-akinobu.mita@gmail.com> <20190502125722.GA28470@localhost.localdomain> <20190503121232.GB30013@localhost.localdomain> <20190503122035.GA21501@lst.de> In-Reply-To: <20190503122035.GA21501@lst.de> From: Akinobu Mita Date: Sat, 4 May 2019 13:20:50 +0900 Message-ID: Subject: Re: [PATCH 0/4] nvme-pci: support device coredump To: Christoph Hellwig Cc: Keith Busch , Keith Busch , linux-nvme@lists.infradead.org, LKML , Johannes Berg , Jens Axboe , Sagi Grimberg Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org 2019=E5=B9=B45=E6=9C=883=E6=97=A5(=E9=87=91) 21:20 Christoph Hellwig : > > On Fri, May 03, 2019 at 06:12:32AM -0600, Keith Busch wrote: > > Could you actually explain how the rest is useful? I personally have > > never encountered an issue where knowing these values would have helped= : > > every device timeout always needed device specific internal firmware > > logs in my experience. I agree that the device specific internal logs like telemetry are the most useful. The memory dump of command queues and completion queues is not that powerful but helps to know what commands have been submitted before the controller goes wrong (IOW, it's sometimes not enough to know which commands are actually failed), and it can be parsed without vendor specific knowledge. If the issue is reproducible, the nvme trace is the most powerful for this kind of information. The memory dump of the queues is not that powerful, but it can always be enabled by default. > Yes. Also not that NVMe now has the 'device initiated telemetry' > feauture, which is just a wired name for device coredump. Wiring that > up so that we can easily provide that data to the device vendor would > actually be pretty useful. This version of nvme coredump captures controller registers and each queue. So before resetting controller is a suitable time to capture these. If we'll capture other log pages in this mechanism, the coredump procedure will be splitted into two phases (before resetting controller and after resetting as soon as admin queue is available). From mboxrd@z Thu Jan 1 00:00:00 1970 From: akinobu.mita@gmail.com (Akinobu Mita) Date: Sat, 4 May 2019 13:20:50 +0900 Subject: [PATCH 0/4] nvme-pci: support device coredump In-Reply-To: <20190503122035.GA21501@lst.de> References: <1556787561-5113-1-git-send-email-akinobu.mita@gmail.com> <20190502125722.GA28470@localhost.localdomain> <20190503121232.GB30013@localhost.localdomain> <20190503122035.GA21501@lst.de> Message-ID: 2019?5?3?(?) 21:20 Christoph Hellwig : > > On Fri, May 03, 2019@06:12:32AM -0600, Keith Busch wrote: > > Could you actually explain how the rest is useful? I personally have > > never encountered an issue where knowing these values would have helped: > > every device timeout always needed device specific internal firmware > > logs in my experience. I agree that the device specific internal logs like telemetry are the most useful. The memory dump of command queues and completion queues is not that powerful but helps to know what commands have been submitted before the controller goes wrong (IOW, it's sometimes not enough to know which commands are actually failed), and it can be parsed without vendor specific knowledge. If the issue is reproducible, the nvme trace is the most powerful for this kind of information. The memory dump of the queues is not that powerful, but it can always be enabled by default. > Yes. Also not that NVMe now has the 'device initiated telemetry' > feauture, which is just a wired name for device coredump. Wiring that > up so that we can easily provide that data to the device vendor would > actually be pretty useful. This version of nvme coredump captures controller registers and each queue. So before resetting controller is a suitable time to capture these. If we'll capture other log pages in this mechanism, the coredump procedure will be splitted into two phases (before resetting controller and after resetting as soon as admin queue is available).