cfq misbehaving on 2.6.11-1.14_FC3

* cfq misbehaving on 2.6.11-1.14_FC3
@ 2005-06-10 22:54 spaminos-ker
  2005-06-11  9:29 ` Andrew Morton
  0 siblings, 1 reply; 16+ messages in thread
From: spaminos-ker @ 2005-06-10 22:54 UTC (permalink / raw)
  To: linux-kernel

Hello, I am running into a very bad problem on one of my production servers.

* the config
Linux Fedora core 3 latest everything, kernel 2.6.11-1.14_FC3
AMD Opteron 2 GHz, 1 G RAM, 80 GB Hard drive (IDE, Western Digital)

I have a log processor running in the background, it's using sqlite for storing
the information it finds in the logs. It takes a few hours to complete a run.
It's clearly I/O bound (SleepAVG = 98%, according to /proc/pid/status).
I have to use the cfq scheduler because it's the only scheduler that is fair
between processes (or should be, keep reading).

* the problem
Now, after an hour or so of processing, the machine becomes very unresponsive
when trying to do new disk operations. I say new because existing processes
that stream data to disk don't seem to suffer so much.

On the other hand, opening a blank new file in vi and saving it takes about 5
minutes or so.
Logging in with ssh just times out (so I have to keep a connection open to
avoid being locked out). << that's where it's a really bad problem for me :)

Now, if I switch the disk to anticipatory or deadline, by setting
/sys/block/hda/queue/scheduler, things go back to regular times very quickly.
Saving a file in vi takes about 12 seconds (slow, but not unbearable,
considering the machine is doing a lot of things).
Logging in takes less than a second.

I did a strace on the process that is causing havock, and the pattern of usage
is:
* open files
*
about 5000 of combinations of
llseek+read
llseek+write
in 1000 bytes requests.
* close files

The process is also niced to 8, but it doesn't seem to make any difference. I
found references to a "ionice" or "iorenice" syscall, but that doesn't seem to
exist anymore.
I thought that the i/o scheduler was taking the priority into account?

Is this a know problem? I also thought that timed cfq was supposed to take care
of such workloads?

Any idea on how I could improve the situation?

Thanks

Nicolas

^ permalink raw reply	[flat|nested] 16+ messages in thread