On Mon, 2009-03-30 at 09:58 -0700, Linus Torvalds wrote:
> 
> On Mon, 30 Mar 2009, Mark Lord wrote:
> >
> > I spent an entire day recently, trying to see if I could significantly fill
> > up the 32MB cache on a 750GB Hitach SATA drive here.
> > 
> > With deliberate/random write patterns, big and small, near and far,
> > I could not fill the drive with anything approaching a full second
> > of latent write-cache flush time.
> > 
> > Not even close.  Which is a pity, because I really wanted to do some testing
> > related to a deep write cache.  But it just wouldn't happen.
> > 
> > I tried this again on a 16MB cache of a Seagate drive, no difference.
> > 
> > Bummer.  :)
> 
> Try it with laptop drives. You might get to a second, or at least hundreds 
> of ms (not counting the spinup delay if it went to sleep, obviously). You 
> probably tested desktop drives (that 750GB Hitachi one is not a low end 
> one, and I assume the Seagate one isn't either).

I had some fun trying things with this, and I've been able to reliably
trigger stalls in write cache of ~60 seconds on my seagate 500GB sata
drive.  The worst I saw was 214 seconds.

It took a little experimentation, and I had to switch to the noop
scheduler (no idea why).  

Also, I had to watch vmstat closely.  When the test first started,
vmstat was reporting 500kb/s or so write throughput.  After the test ran
for a few minutes, vmstat jumped up to 8MB/s.

My guess is that the drive has some internal threshold for when it
decides to only write in cache.  The switch to 8MB/s is when it switched
to cache only goodness.  Or perhaps the attached program is buggy and
I'll end up looking silly...it was some quick coding.

The test forks two procs.  One proc does 4k writes to the first 26MB of
the test file (/dev/sdb for me).  These writes are O_DIRECT, and use a
block size of 4k.

The idea is that we fill the cache with work that is very beneficial to
keep in cache, but that the drive will tend to flush out because it is
filling up tracks.

The second proc O_DIRECT writes to two adjacent sectors far away from
the hot writes from the first proc, and it puts in a timestamp from just
before the write.  Every second or so, this timestamp is printed to
stderr.  The drive will want to keep these two sectors in cache because
we are constantly overwriting them.

(It's worth mentioning this is a destructive test.  Running it
on /dev/sdb will overwrite the first 64MB of the drive!!!!)

Sample output:

# ./wb-latency /dev/sdb
Found tv 1238434622.461527
starting hot writes run
starting tester run
current time 1238435045.529751
current time 1238435046.531250
...
current time 1238435063.772456
current time 1238435064.788639
current time 1238435065.814101
current time 1238435066.847704

Right here, I pull the power cord.  The box comes back up, and I run:

# ./wb-latency -c /dev/sdb
Found tv 1238435067.347829

When -c is passed, it just reads the timestamp out of the timestamp
block and exits.  You compare this value with the value printed just
before you pulled the block.

For the run here, the two values are within .5s of each other.  The
tester only prints the time every one second, so anything that close is
very good.  I had pulled the plug before the drive got into that fast
8MB/s mode, so the drive was doing a pretty good job of fairly servicing
the cache.

My drive has a cache of 32MB.  Smaller caches probably need a smaller
hot zone.

-chris