From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751943AbZH1OrX (ORCPT ); Fri, 28 Aug 2009 10:47:23 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751796AbZH1OrU (ORCPT ); Fri, 28 Aug 2009 10:47:20 -0400 Received: from mail.lang.hm ([64.81.33.126]:38620 "EHLO bifrost.lang.hm" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751135AbZH1OrS (ORCPT ); Fri, 28 Aug 2009 10:47:18 -0400 Date: Fri, 28 Aug 2009 07:46:42 -0700 (PDT) From: david@lang.hm X-X-Sender: dlang@asgard.lang.hm To: David Woodhouse cc: Theodore Tso , Pavel Machek , Ric Wheeler , Florian Weimer , Goswin von Brederlow , Rob Landley , kernel list , Andrew Morton , mtk.manpages@gmail.com, rdunlap@xenotime.net, linux-doc@vger.kernel.org, linux-ext4@vger.kernel.org, corbet@lwn.net Subject: Re: [patch] ext2/3: document conditions when reliable operation is possible In-Reply-To: <1251362787.4354.373.camel@macbook.infradead.org> Message-ID: References: <20090824093143.GD25591@elf.ucw.cz> <82k50tjw7u.fsf@mid.bfk.de> <20090824130125.GG23677@mit.edu> <20090824195159.GD29763@elf.ucw.cz> <4A92F6FC.4060907@redhat.com> <20090824205209.GE29763@elf.ucw.cz> <4A930160.8060508@redhat.com> <20090824212518.GF29763@elf.ucw.cz> <20090824223915.GI17684@mit.edu> <20090824230036.GK29763@elf.ucw.cz> <20090825000842.GM17684@mit.edu> <1251362787.4354.373.camel@macbook.infradead.org> User-Agent: Alpine 2.00 (DEB 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; format=flowed; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 27 Aug 2009, David Woodhouse wrote: > On Mon, 2009-08-24 at 20:08 -0400, Theodore Tso wrote: >> >> (It's worse with people using Digital SLR's shooting in raw mode, >> since it can take upwards of 30 seconds or more to write out a 12-30MB >> raw image, and if you eject at the wrong time, you can trash the >> contents of the entire CF card; in the worst case, the Flash >> Translation Layer data can get corrupted, and the card is completely >> ruined; you can't even reformat it at the filesystem level, but have >> to get a special Windows program from the CF manufacturer to --maybe-- >> reset the FTL layer. > > This just goes to show why having this "translation layer" done in > firmware on the device itself is a _bad_ idea. We're much better off > when we have full access to the underlying flash and the OS can actually > see what's going on. That way, we can actually debug, fix and recover > from such problems. > >> Early CF cards were especially vulnerable to >> this; more recent CF cards are better, but it's a known failure mode >> of CF cards.) > > It's a known failure mode of _everything_ that uses flash to pretend to > be a block device. As I see it, there are no SSD devices which don't > lose data; there are only SSD devices which haven't lost your data > _yet_. > > There's no fundamental reason why it should be this way; it just is. > > (I'm kind of hoping that the shiny new expensive ones that everyone's > talking about right now, that I shouldn't really be slagging off, are > actually OK. But they're still new, and I'm certainly not trusting them > with my own data _quite_ yet.) so what sort of test would be needed to identify if a device has this problem? people can do ad-hoc tests by pulling the devices in use and then checking the entire device, but something better should be available. it seems to me that there are two things needed to define the tests. 1. a predictable write load so that it's easy to detect data getting lose 2. some statistical analysis to decide how many device pulls are needed (under the write load defined in #1) to make the odds high that the problem will be revealed. with this we could have people test various devices and report if the test detects unrelated data being lost (or businesses, and I think the tech hardware sites would jump into this given some sort of accepted test) for USB devices there may be a way to use the power management functions to cut power to the device without requiring it to physically be pulled, if this is the case (even if this only works on some specific chipsets), it would drasticly speed up the testing David Lang