All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Torsten Kaiser" <just.for.lkml@googlemail.com>
To: Jeff Garzik <jeff@garzik.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org, linux-ide@vger.kernel.org,
	Kuan Luo <kluo@nvidia.com>, Peer Chen <pchen@nvidia.com>
Subject: Re: 2.6.23-mm1
Date: Sat, 13 Oct 2007 16:32:14 +0200	[thread overview]
Message-ID: <64bb37e0710130732p303547e3n54cfa9dac34c53b5@mail.gmail.com> (raw)
In-Reply-To: <4710B7C5.5050403@garzik.org>

On 10/13/07, Jeff Garzik <jeff@garzik.org> wrote:
> Torsten Kaiser wrote:
> > On 10/13/07, Jeff Garzik <jeff@garzik.org> wrote:
> >> Torsten Kaiser wrote:
> > I can't follow you on SYNCHRONIZE CACHE.
> > The only command written to the syslog in the errors where
> > 0x60==ATA_CMD_FPDMA_READ and 0xB0 (which is not in
> > include/linux/ata.h, but ATA-6 says that this is SMART related. That
> > makes sense, as smartd is failing).
>
> In the traceback you have "ata_scsi_flush_xlat", which is the function
> that translates a SCSI sync-cache command into an ATA flush-cache command.

Aha. That makes sense.
But on the second error, where the drive was kicked out completely all
three traces did not have ata_scsi_flush_xlat.
First WARNING:
Oct 13 07:46:48 treogen [   99.850000] Call Trace:
Oct 13 07:46:48 treogen [   99.850000]  [<ffffffff8044431a>]
ata_qc_issue+0x4aa/0x540
Oct 13 07:46:48 treogen [   99.850000]  [<ffffffff80432e60>] scsi_done+0x0/0x20
Oct 13 07:46:48 treogen [   99.850000]  [<ffffffff8044ce30>]
ata_scsi_pass_thru+0x0/0x2c0
Oct 13 07:46:48 treogen [   99.850000]  [<ffffffff8044a6ea>]
ata_scsi_translate+0xfa/0x180
Oct 13 07:46:48 treogen [   99.850000]  [<ffffffff80432e60>] scsi_done+0x0/0x20
...

Second+Third:
Oct 13 07:46:49 treogen [  100.510000]  [<ffffffff804442ef>]
ata_qc_issue+0x47f/0x540
Oct 13 07:46:49 treogen [  100.510000]  [<ffffffff80432e60>] scsi_done+0x0/0x20
Oct 13 07:46:49 treogen [  100.510000]  [<ffffffff80432e60>] scsi_done+0x0/0x20
Oct 13 07:46:49 treogen [  100.510000]  [<ffffffff8044a440>]
ata_scsi_rw_xlat+0x0/0x1b0
Oct 13 07:46:49 treogen [  100.510000]  [<ffffffff8044a6ea>]
ata_scsi_translate+0xfa/0x180
Oct 13 07:46:49 treogen [  100.510000]  [<ffffffff80432e60>] scsi_done+0x0/0x20
...

So the commands that generate the WARNINGs seem only later collateral damage.

> The "WARNING: at drivers/ata/libata-core.c:5752 ata_qc_issue()" also
> guides us to the code comment
>
>          /* Make sure only one non-NCQ command is outstanding.  The
>           * check is skipped for old EH because it reuses active qc to
>           * request ATAPI sense.
>           */
>
> which is a check related to NCQ->off and off->NCQ edge cases.
>
> So those are the two bits of information I found interesting.

But I very much agree about this. But rather than 'normal' edges with
the cache flushes, I would blame it on the SMART commands from smartd
that trigger the switch.
Both errors happend during the startup of smartd.

> >> guess that sata_nv is not properly handling non-queued commands.
> >
> > But that still seems correct, as I would not expect that SMART
> > commands get queued. (Thats just a guess, as I did not try to find the
> > code that does this distinction)
> >
> >> This is a patch from libata-dev.git#nv-swncq (via #ALL).
> >
> > Comparing sata_nv.c from 2.6.23-rc8-mm1 and 2.6.23-mm1 I see two
> > changes, that look suspicious:
> >
> > http://git.kernel.org/?p=linux/kernel/git/jgarzik/libata-dev.git;a=commitdiff;h=31cc23b34913bc173680bdc87af79e551bf8cc0d
> >
> > The comment says: "ahci and sata_sil24 are converted to use ata_std_qc_defer()."
> > But the patch also adds ".qc_defer = ata_std_qc_defer," to sata_nv.c

Looking more at this patch, I thing the code change is correct and
only the comment is missing sata_nv. (Only ahci, sil24 and nv seem to
use NCQ und so need the logic from qc_defer)

> > The second change is the removal of the 'lock' spinlock from sata_nv.c
> > that was used in nv_swncq_qc_issue and nv_swncq_host_interrupt.
> >
> > Should I try to revert one or both of these changes?
>
> If you are git-capable, IMO the next steps in problem elimination should be

... I should really take the time install this, but I don't think git
will help in this special case, because:

> * download latest linux-2.6.git (currently
> 752097cec53eea111d087c545179b421e2bde98a)
> * build and test linux-2.6.git, to establish a new baseline

2.6.23-rc8-mm1 worked.

> * download latest libata-dev.git#nv-swncq (currently
> 3cb664c2d319a4fde5028c3c5dab6221fe70bd2d)

That commit (3cb664c2d319a4fde5028c3c5dab6221fe70bd2d) seems to be the
only commit relevant to swncq, as it adds it completely without any
partial steps that could be bisected.

> * build and test, with sata_nv module option swncq=0
> * build and test, with sata_nv module option swncq=1

I will try this. Currently I have sata_nv.swncq=1 in my kernel
commandline so its trivial to change that.
But as only 2 out of 3 boots failed, I think I hit another heisenbug.

> My gut feeling is that there is a lingering bug in sata_nv SWNCQ somewhere.

Older versions of SWNCQ already worked for me, so I don't think its a
general problem.
And as the symptoms would nicely fit into a race condition when
manipulating the NCQ state, the removal of the lock protecting the
private sata_nv defer_queue between 2.6.23-rc8-mm1 and 2.6.23-mm1
looks like the prime suspect.

So now booting with and without swncq and if swncq=0 works, I will try
to add the lock back...

Torsten

  reply	other threads:[~2007-10-13 14:32 UTC|newest]

Thread overview: 164+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-10-12  4:31 2.6.23-mm1 Andrew Morton
2007-10-12  5:03 ` 2.6.23-mm1 KAMEZAWA Hiroyuki
2007-10-12  6:42   ` 2.6.23-mm1 Andrew Morton
2007-10-12  6:46     ` 2.6.23-mm1 Al Viro
2007-10-12  7:13       ` 2.6.23-mm1 Andrew Morton
2007-10-12 18:06         ` [PATCH net-2.6] uml: hard_header fix Stephen Hemminger
2007-10-12 19:04         ` 2.6.23-mm1 Al Viro
2007-10-12 19:47         ` 2.6.23-mm1 thread exit_group issue Mathieu Desnoyers
2007-10-12 20:01           ` Andrew Morton
2007-10-13  1:03           ` Andrew Morton
2007-10-13 11:48             ` Oleg Nesterov
2007-10-13 12:02               ` Oleg Nesterov
2007-10-13 17:49                 ` Andrew Morton
2007-10-14  4:04               ` Mathieu Desnoyers
2007-10-12  7:25     ` 2.6.23-mm1 KAMEZAWA Hiroyuki
2007-10-12  8:36       ` 2.6.23-mm1 Sam Ravnborg
2007-10-12  8:31     ` 2.6.23-mm1 Torsten Kaiser
2007-10-12  8:37       ` 2.6.23-mm1 Andrew Morton
2007-10-12 12:46         ` 2.6.23-mm1 Torsten Kaiser
2007-10-13  8:01         ` 2.6.23-mm1 Torsten Kaiser
2007-10-13 10:55           ` 2.6.23-mm1 Jeff Garzik
2007-10-13 12:03             ` 2.6.23-mm1 Torsten Kaiser
2007-10-13 12:19               ` 2.6.23-mm1 Jeff Garzik
2007-10-13 14:32                 ` Torsten Kaiser [this message]
2007-10-13 14:40                   ` 2.6.23-mm1 Torsten Kaiser
2007-10-13 15:13                     ` 2.6.23-mm1 Torsten Kaiser
2007-10-13 17:48                       ` 2.6.23-mm1 Jeff Garzik
2007-10-13 18:05                         ` 2.6.23-mm1 Torsten Kaiser
2007-10-13 18:18                           ` 2.6.23-mm1 Andrew Morton
2007-10-13 18:35                             ` 2.6.23-mm1 Torsten Kaiser
2007-10-14 11:54                             ` 2.6.23-mm1 Torsten Kaiser
2007-10-14 18:39                               ` 2.6.23-mm1 Andrew Morton
2007-10-14 19:12                                 ` 2.6.23-mm1 Torsten Kaiser
2007-10-14 19:26                                   ` 2.6.23-mm1 Andrew Morton
2007-10-14 19:26                                     ` 2.6.23-mm1 Andrew Morton
2007-10-14 19:40                                     ` 2.6.23-mm1 Torsten Kaiser
2007-10-14 22:03                                     ` 2.6.23-mm1 Milan Broz
2007-10-14 22:03                                       ` 2.6.23-mm1 Milan Broz
2007-10-15  6:50                                       ` 2.6.23-mm1 Jens Axboe
2007-10-15  6:50                                         ` 2.6.23-mm1 Jens Axboe
2007-10-15  7:31                                         ` 2.6.23-mm1 Neil Brown
2007-10-15  7:31                                           ` 2.6.23-mm1 Neil Brown
2007-10-15  7:45                                           ` 2.6.23-mm1 Jens Axboe
2007-10-15  7:45                                             ` 2.6.23-mm1 Jens Axboe
2007-10-13 18:41                           ` 2.6.23-mm1 Jeff Garzik
2007-10-12  6:48   ` 2.6.23-mm1 Cedric Le Goater
2007-10-12  6:51 ` [PATCH] add missing parenthesis in cfe_writeblk() macro Mariusz Kozlowski
2007-10-12  7:44 ` 2.6.23-mm1 - build failure on axonram Kamalesh Babulal
2007-10-12  9:42 ` Build Failure (Was Re: 2.6.23-mm1) Dhaval Giani
2007-10-12  9:42   ` Dhaval Giani
2007-10-12 20:38 ` 2.6.23-mm1 Laurent Riffard
2007-10-12 21:00   ` 2.6.23-mm1 Andrew Morton
2007-10-13  9:29     ` [PATCH] Reiser4: Drop 'size' argument from bio_endio and bi_end_io Laurent Riffard
2007-10-13 10:10       ` Jens Axboe
2007-10-14 13:09       ` Edward Shishkin
2007-10-15 16:13     ` 2.6.23-mm1 Zan Lynx
2007-10-12 21:32 ` 2.6.23-mm1 Rafael J. Wysocki
2007-10-15 16:09   ` 2.6.23-mm1 Mark Gross
2007-10-15 20:40     ` 2.6.23-mm1 Rafael J. Wysocki
2007-10-16 19:58       ` 2.6.23-mm1 Mark Gross
2007-10-16 20:28         ` 2.6.23-mm1 Rafael J. Wysocki
2007-10-16 23:31           ` 2.6.23-mm1 Mark Gross
2007-10-17 21:15           ` [PATCH] static initialization with blocking notifiers. was :wqRe: 2.6.23-mm1 Mark Gross
2007-10-17 17:21   ` [PATCH] static initialization and blocking notification for pm_qos... was 2.6.23-mm1 Mark Gross
2007-10-13  4:35 ` 2.6.23-mm1 - Build failure on rgmii Kamalesh Babulal
2007-10-13  4:44 ` 2.6.23-mm1 - build failure with advansys Kamalesh Babulal
2007-10-13  6:52   ` Andrew Morton
2007-10-13  6:52     ` Andrew Morton
2007-10-18  0:07     ` Paul Mackerras
2007-10-18  0:07       ` Paul Mackerras
2007-10-18  1:48       ` Matthew Wilcox
2007-10-18  1:48         ` Matthew Wilcox
2007-10-13 15:50 ` 2.6.23-mm1 pm_prepare() and _finish() w/ args vs. without Joseph Fannin
2007-10-13 17:22   ` Rafael J. Wysocki
2007-10-13 18:40     ` Joseph Fannin
2007-10-13 19:13       ` Rafael J. Wysocki
2007-10-14 19:47         ` Joseph Fannin
2007-10-14 20:20           ` Rafael J. Wysocki
2007-10-15 20:55             ` Rafael J. Wysocki
2007-10-16 17:29               ` Joseph Fannin
2007-10-13 17:12 ` 2.6.23-mm1 Gabriel C
2007-10-13 18:01   ` 2.6.23-mm1 Andrew Morton
2007-10-13 18:08     ` 2.6.23-mm1 Gabriel C
2007-10-15 16:28     ` 2.6.23-mm1 Dave Hansen
2007-10-13 17:58 ` Suspend Broken (Re: 2.6.23-mm1) Dhaval Giani
2007-10-13 18:33   ` Rafael J. Wysocki
2007-10-14  4:26     ` Dhaval Giani
2007-10-14 14:19       ` Rafael J. Wysocki
2007-10-13 22:11 ` [2.6.23-mm1] CONFIG_LOCALVERSION handling broken Tilman Schmidt
2007-10-17 20:27   ` Sam Ravnborg
2007-10-17 23:06   ` Tilman Schmidt
2007-10-27 15:19     ` Tilman Schmidt
2007-10-27 15:28       ` Sam Ravnborg
2007-10-14 22:34 ` 2.6.23-mm1: BUG in reiserfs_delete_xattrs Laurent Riffard
2007-10-14 22:34   ` Laurent Riffard
2007-10-15  8:40   ` Christoph Hellwig
2007-10-15 18:31     ` Jeff Mahoney
2007-10-15 18:31     ` Jeff Mahoney
2007-10-15 18:31     ` Jeff Mahoney
2007-10-15 20:06       ` Laurent Riffard
2007-10-15 20:06         ` Laurent Riffard
2007-10-15 20:23         ` Jeff Mahoney
2007-10-15 20:23           ` Jeff Mahoney
2007-10-17  8:59         ` Christoph Hellwig
2007-10-17  8:58       ` Christoph Hellwig
2007-10-17 14:55         ` Jeff Mahoney
2007-10-17 14:55         ` Jeff Mahoney
2007-10-17 14:55           ` Jeff Mahoney
2007-10-15 19:51     ` Laurent Riffard
2007-10-15 19:51     ` Laurent Riffard
2007-10-15 19:51     ` Laurent Riffard
2007-10-15  6:18 ` [PATCH] Add irq protection in the percpu-counters cpu-hotplug-callback path Gautham R Shenoy
2007-10-15 12:28 ` nfs mmap adventure (was: 2.6.23-mm1) Peter Zijlstra
2007-10-15 15:43   ` Trond Myklebust
2007-10-15 14:06 ` David Howells
2007-10-15 15:51   ` Trond Myklebust
2007-10-15 16:38     ` Peter Zijlstra
2007-10-15 23:27   ` David Howells
2007-10-16  1:46   ` Nick Piggin
2007-10-16  7:18 ` 2.6.23-mm1 - regression- PowerPC link failure at arch/powerpc/kernel/head_64.o Kamalesh Babulal
2007-10-16  7:28   ` Andrew Morton
2007-10-16  7:44     ` Kamalesh Babulal
2007-10-21  6:42       ` Kamalesh Babulal
2007-10-27  5:05         ` Stephen Rothwell
2007-10-17  7:01 ` 2.6.23-mm1 KAMEZAWA Hiroyuki
2007-10-17  9:02   ` 2.6.23-mm1 Andrew Morton
2007-10-17  9:10   ` 2.6.23-mm1 Jiri Kosina
2007-10-17  9:36     ` 2.6.23-mm1 KAMEZAWA Hiroyuki
2007-10-17 11:42       ` 2.6.23-mm1 Jiri Kosina
2007-10-17 12:33         ` 2.6.23-mm1 KAMEZAWA Hiroyuki
2007-10-19  9:07           ` PIE randomization (was Re: 2.6.23-mm1) Jiri Kosina
2007-10-19 21:54       ` 2.6.23-mm1 Jiri Kosina
2007-10-17 15:54 ` 2.6.23-mm1 - list_add corruption in cgroup Cedric Le Goater
2007-10-18 15:56   ` Paul Menage
2007-10-19 22:11   ` Paul Menage
2007-10-18 12:06 ` 2.6.23-mm1 - powerpc - Build fails at arch/powerpc/boot/inflate.o Kamalesh Babulal
2007-10-18 12:06   ` Kamalesh Babulal
2007-10-18 12:23   ` Paul Mackerras
2007-10-18 12:23     ` Paul Mackerras
2007-10-18 13:20     ` Kamalesh Babulal
2007-10-18 13:20       ` Kamalesh Babulal
2007-10-20  4:57 ` oops in lbmIODone, fails to boot [Re: 2.6.23-mm1] Mattia Dongili
2007-10-20  5:34   ` Andrew Morton
2007-10-20 12:18     ` Dave Kleikamp
2007-10-21  5:44       ` Mattia Dongili
2007-10-20  5:13 ` 2.6.23-mm1 - autofs broken Rik van Riel
2007-10-20  5:39   ` Andrew Morton
2007-10-20  5:54     ` Rik van Riel
2007-10-20  5:54       ` Rik van Riel
2007-10-20 14:56         ` Rik van Riel
2007-10-22 22:03           ` Dave Hansen
2007-10-22  3:45   ` Ian Kent
2007-10-22 16:46     ` Rik van Riel
2007-10-21  5:58 ` mysqld prevents s2ram [Re: 2.6.23-mm1] Mattia Dongili
2007-10-21  6:28   ` Mattia Dongili
2007-10-21  9:58   ` Pavel Machek
2007-10-21 11:53     ` Rafael J. Wysocki
2007-10-22 18:40 ` kernel panic when running tcpdump Mariusz Kozlowski
2007-10-22 18:40   ` Mariusz Kozlowski
2007-10-22 19:03   ` Andrew Morton
2007-10-22 19:03     ` Andrew Morton
2007-10-22 21:16     ` Mariusz Kozlowski
2007-10-22 21:16       ` Mariusz Kozlowski
  -- strict thread matches above, loose matches on Subject: below --
2007-10-12  4:31 2.6.23-mm1 Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=64bb37e0710130732p303547e3n54cfa9dac34c53b5@mail.gmail.com \
    --to=just.for.lkml@googlemail.com \
    --cc=akpm@linux-foundation.org \
    --cc=jeff@garzik.org \
    --cc=kluo@nvidia.com \
    --cc=linux-ide@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pchen@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.