qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] [Bug 1745312] [NEW] Regression report: Disk subsystem I/O failures/issues surfacing in DOS/early Windows [two separate issues: one bisected, one root-caused]
@ 2018-01-25  7:18 i336_
  2018-01-29 13:37 ` Stefan Hajnoczi
                   ` (9 more replies)
  0 siblings, 10 replies; 11+ messages in thread
From: i336_ @ 2018-01-25  7:18 UTC (permalink / raw)
  To: qemu-devel

Public bug reported:

[Headsup: This report is long-ish due to the amount of detail I've
stumbled on along the way that I think is relevant to include. I can't
speak as to the complexity of the actual bugs, but the size of this
report should not suggest that the reproduction process is particularly
headache-inducing.]

Hi!

I recently needed to fire up some ancient software for research purposes
and got very distracted discovering and playing with old versions of
Windows :). In the process I've discovered some glitches with disk I/O.

I believe I've stumbled on two completely separate issues that
coincidentally surfaced at the same time. It's possible that components
of this report will be re-filed as more specific new bugs, but I'm not
an authority on QEMU internals or how to narrow down/categorize what
I've found.

- The first bug only surfaces when the "isapc" machine type is used. It
intermittently produces "General failure {read,writ}ing drive _" under
MS-DOS 6.22, and also somehow interferes with early bootstrap of Windows
NT 4 (in NTLDR). Enabling or disabling KVM (I'm on Linux) appears to
make no difference whatsoever, which may help with debugging.

- The second issue involves
  - a WinNT4 disk image
  - created by running through a bog-standard NT4 install inside QEMU 2.9.0
  - which will now fail to boot in any version of QEMU - even version 1.0
    - but which VirtualBox will boot fine
      - but only if I point VirtualBox at QEMU's raw disk image via a
        hacked-together VMDK file
      - if the raw image is converted to VHD(X), VirtualBox will also fail
        to boot the image with exactly the same error as QEMU
      - this state of affairs is not affected by image sparseness (which makes
        sense)

I'm confident I've bisected the first issue.

I wasn't able to bisect the second issue (as all tested versions of QEMU
behaved identically), but I've figured out a working repro testcase and
I believe I've managed to pin down a solid root cause.


== #1: Intermittent I/O issues when `-M isapc` is used =====

These symptoms sometimes take a small amount of time and fiddling to
trigger, but I AM able to consistently surface them on my machine after
a short while. (I am very very interested to hear if others cannot
reproduce them.)

So, first of all:

https://github.com/qemu/qemu/commit/306ec6c3cece7004429c79c1ac93d49919f1f1cc
  (Jul 30 2013): the last version that works

https://github.com/qemu/qemu/commit/e689f7c668cbd9d08f330e17c3dd3a059c9553d3
  (Oct 30 2013): the first version that intermittently fails

Maybe lift out and build these branches while reading. *shrug*
(How to do this can be found at the end of this report - along with a time-saving ./configure line, FWIW)

Here are the changelists between these two revisions:

https://github.com/qemu/qemu/compare/306ec6c...e689f7c
(Compare direction: OLD to NEW) (Commits: 166  Files changed: 192)

https://github.com/qemu/qemu/compare/e689f7c...306ec6c
(Compare direction: NEW to OLD) (Commits: 30   Files changed: 22)

(Someone else more familiar with Git might know why GitHub returns
results for both compare directions, and/or if the 2nd link is useful
information. The first link returns a lot more results than the 2nd one,
at least. Does comparing new>old return deletions?)

---

Now on to the symptoms. In a moment I'll describe reproduction.

# MS-DOS 6.22

The first symptom I discovered was that trivial read and write
operations under MS-DOS would sometimes fail:

  C:\>echo test > hi

  General failure writing drive C
  Abort, Retry, Fail?

Anything else that exercises the disk behaves similarly:

  C:\>dir /s > nul

  General failure reading drive C
  Abort, Retry, Fail?

(Note that the above demonstrates both write and read failures)

(Also, FWIW, `dir /s` == `ls -R`)

The behavior of the I/O errors is not possible to characterise as it
fluctuates so much. For example something as simple as DIR can produce
wildly differing results: in one run, poking around with DIR ended with
DOS deciding C:\ was empty at one point; at another point in a different
run C:\ mysteriously dropped 50% of its contents only to magically gain
it all back moments later after some poking around in one of the
subdirectories that was still visible.

The time it takes to trigger these errors is also highly variable. QEMU
may fall over as early as hanging forever at "Starting MS-DOS...", or I
might get all the way into Windows 3.1 before it triggers (in which case
Win3.1 reports vague memory errors - of all things).

Very occasionally I've seen _SeaBIOS itself_ report "Booting from Hard
Disk..." "Boot failed: could not read the boot disk" ... "No bootable
device.", and on one occasion I even got "Non-System disk or disk error"
"Replace and strike any key when ready"!


# WinNT 4 Terminal Server

Most of the time, NTLDR will fire up normally. But every so often...

  SeaBIOS (version rel-1.7.3-117-g31b8b4e-
20131206_080705-nilsson.home.kraxel.org)

  Booting from Hard Disk...
  A disk read error occurred.
  Insert a system diskette and restart
  the system.

(NB. You're seeing the old SeaBIOS version included with e689f7c, which
was the first buggy commit.)

If NT gets past this point without erroring out (ie, it makes it to the
boot menu), the rest of the system is 100% fine and there are no other
disk I/O issues whatsoever. For example, on QEMU 2.9.0 I was able to
enable disk compression, answer "Yes" to "Compress entire disk now?" and
have the process fully complete. No hitches.

This makes me vaguely recall/wonder that perhaps this could be somehow
related to LBA and/or Int 13h, or something floating around near that
bunch of functionality. (I'm woefully ignorant about such low-level
details.) Perhaps DOS/Win3.1 are stuck using a disk mode that QEMU has a
buggy implementation of, while NT 4 (once NTOSKRNL is up and running) is
able to use a different disk mode or access mechanism.

I'm really interested to get some understanding of what the root issue
is here, when this is fixed. (I wonder if it's a timing thing?)

I've observed some unusual behavior with repeated restarts. In one case,
I attempted to start NT4 multiple times, and QEMU consistently failed
with "No bootable device" each time. So, I removed `-M isapc`, promptly
got a boot menu, hit ^C, readded `-M isapc` - and continued to get a
boot menu. Yep. I'll accept "really really big coincidence" but I do
very much wonder if something else is going on here. I've observed many
similar incidents. It makes me wonder whether the contents of memory or
some other system state is an influence. Very probably not, but still...


-- Reproduction --------------------------------------------

First of all, there was unfortunately no way for me to avoid having to
post entire disk images, but I've managed to compress everything down to
174MB total download size.

FWIW, WinWorld and many other sites seem to have no operational issues
providing clear pointers to CD keys; I consider my distribution of my
installed HDD images an extension of the apparent status quo.

That being said, I've put everything on Google Drive so nobody has to
headscratch about Launchpad/Canonical/etc's stance on hosting this data.

So, this folder contains the disk images:
https://drive.google.com/drive/folders/1WdZVBh5Trs9HLC186-nHyKeqaSxfyM2c
("Download all" at the top-right will create a ZIP file, but FWIW
downloading the individual files simultaneously would implement a rough
form of download acceleration)

File meta info:

Compressed
|
|      Apparent
|      |    Actual
|      |    |
38M -> 200M (103M)  win31.img.xz
82M -> 1G   (289M)  wnt4ts-broken.img.xz
55M -> 350M (146M)  wnt4ts-intermittent.img.xz

SHA-256s:

win31:        8179b8180a2ab40bd472e8a2f3fb89fc331651e56923f94ceb9e52a78ee220d2
broken:       a2af5f0bc49a063b75f534b6ffe5b82e32ecc706a64a425b6626feccf6e3fdfa  
intermittent: 77ae8c458829ebcdd64c71042012f45d5a2788e6ebd22db9d53de9ef1a574784

(Wanted to keep the checksum lines within 80 columns)

And, since I can't figure out where else in this report to put this,
wnt4ts-broken.img's password is "admin" but something seems to have
happened to the disk and NT doesn't actually boot properly :(, and
wnt4ts-intermittent.img's password is "1234". (These were set up as test
images. Now I'm _really_ glad I used simple passwords! :) )

---


I have two testcases: DOS 6.22 (+ Windows 3.1), and Windows NT 4.


# MS-DOS

DOS is the simplest. It basically consists of

$ qemu-system-i386 -drive file=win31.img,format=raw -M isapc -enable-kvm

And then literally just playing around. Things to try include creating
files (`echo blah > file`), repeatedly seeking across the entire FAT
(`dir /s > nul` or `dir /s`), and launching Windows (`win`).

win31.img is not special (as far as I can tell) and merely consists of
the result of installing DOS 6.22 and Windows 3.1 from WinWorldPC. I've
basically just included the image for convenience.

Generally no single "run" is immune to starting Win3.1 and then
launching File Manager; if that doesn't generate an error, something is
definitely up.

The second best trigger is creating new files. That very very frequently
produces "General Failure ...", but not always.


# WinNT 4

Windows NT 4 is a bit more complicated. Because this error only occurs
at presumably a single small point very early in boot, the window of
opportunity for the glitch to surface within is much much narrower and
thus often requires a larger number of tries.

Anecdotally I've had QEMU hit the boot error at the first try/run, and
after as many as 63 "successful" boots.

I made a small test harness that automates the launch process. It
consists of two shell scripts and requires tmux (and netcat).
(*Potential epilepsy warning*: if you use a light-colored terminal
background, the terminal QEMU is repeatedly invoked from will
continuously flash rapidly from white to black.)

One of the scripts is run inside a tmux session in one terminal, while
the other script is run in its own terminal (without any tmux).


I named this one `run-qemu-loop`:

--8<--------------------------------------------------------

#!/bin/bash

# ---

qemu=/path/to/qemu-system-i386
#or, alternatively: (I used the following line myself so I
#could tab-complete my way to different qemu executables)
#qemu="$1"

disk=/path/to/wnt4ts-intermittent.img

# ---

port=4444

rm -f STOP itercount

itercount=0

while :; do
	
	[ -f STOP ] && break
	
	((itercount++))
	echo $itercount > itercount
	
	$qemu \
		-enable-kvm -vga cirrus -curses -M isapc \
		-drive file="$disk",format=raw \
		-chardev socket,id=mon0,host=localhost,port=$port,server,nowait \
		-mon chardev=mon0,mode=readline
	
	#point to an otherwise-unused terminal if you like (see also: `tty`)
	#echo "$itercount run(s)" > /dev/pts/__
	
done

------------------------------------------------------------

Not much logic above; this just repeatedly runs QEMU for as long as
the file `STOP` does not exist in the current directory.

The key "magic" bit is that QEMU is launched in -curses mode.

The other key bit is that the above script is run inside tmux.


Here's `tmux-ctl-loop`:

--8<--------------------------------------------------------

#!/bin/bash

port=4444

tmux=./tmux

printf -v l '%0.0s-' {0..25}
h1="$l/ buffer dump begin \\$l"
h2="$l-\\ buffer dump end /-$l"

while :; do
	
	while :; do
		echo | nc localhost $port -q0 -w1 > /dev/null && break
		echo 'Start qemu!'
	done
	
	buffer="$(tmux -S $tmux capture-pane; tmux -S $tmux save-buffer -)"
	
	echo "$h1"
	[[ "$buffer" ]] && echo "$buffer" || echo '( * Screen buffer is empty * )'
	echo "$h2"
	
	if echo "$buffer" | grep -q 'A disk read error occurred.'; then
		
		s="<Crashed after $(< itercount) runs.>"
		echo "$s"
		echo "$s" >> stats
		
		touch STOP
		
		#echo q | nc localhost $port -q0 > /dev/null
		
		exit
		
	elif echo "$buffer" | grep -q 'OS Loader V4.00'; then
			
		echo '<Booted successfully, trying again>'
		
		echo q | nc localhost $port -q0 > /dev/null
		
	else
		
		echo '<Waiting for boot>'
		
	fi
			
done

------------------------------------------------------------

Nothing particularly amazing going on here either.

While `qemu-run-loop` is running inside tmux in the first terminal, this
is running in the 2nd one.

The small infinite loop at the top only breaks when it can successfully
ping QEMU and it knows it's running.

Then, a screendump of the contents of the terminal QEMU is in is fetched
from tmux, and the buffer's content is analyzed.

- If NTLDR fails, the script creates `STOP` to halt qemu-run-loop,
  sends `q` to QEMU through netcat, and then the script exits.

- If NTLDR loads successfully, the script sends `q` to QEMU and continues
  looping. (qemu-run-loop will not find the `STOP` file, and so restart qemu.)

The scripts run very quickly, with 2-3 iterations per second on my i3
box.


# Usage

Save the two scripts above to the same directory as wnt4ts-intermittent.img,
then:

- (If port 4444 doesn't work, the value needs to be changed in both
scripts.)

- In the first terminal, run `tmux -S <file>`, where <file> names the socket
  tmux will use. This needs to match "tmux=" at the top of `tmux-ctl-loop`.
  (with `tmux=./tmux`, the command would be `tmux -S tmux`)

- Still in the first terminal (and now also inside tmux), enter
  `./qemu-run-loop`, passing the path to qemu if you're using that approach
  (refer to the first few lines of the script). Don't hit enter yet.

- Now, in the 2nd terminal, type `./tmux-ctl-loop`

- Hit enter in both terminals.
 

Rationale for timing of Enter key:

- Running qemu-run-loop first will start QEMU, and if NTLDR starts
  successfully it will immediately begin counting down from 30. If NT actually
  starts to boot and is then hard-shut-down this /may/ affect the disk image

- tmux-ctl-loop will annoyingly spam a continuous stream of 'Start qemu!' until
  qemu-run-loop is running.

- Starting both scripts at "more or less" the same time (no rush) works out
  well.


Hopefully potential script modifications are obvious; for example

- changing tmux-ctl-loop to not send 'q' to qemu so you can connect to the HMP
  yourself
  (NB, if `STOP` is not created, when qemu finally exits it will of course
  promptly be relaunched)

- pointing run-qemu-loop to a modified qemu binary


== #2: QEMU-vs-VirtualBox image issue ======================

I was initially completely stumped by this issue, perhaps unsurprisingly
so. :)

wnt4ts-broken.img is a perfectly ordinary NT 4 installation that was
created in QEMU 2.9.0. I created a 1GB disk with `truncate`, picked NTFS
and installed everything (which took a while).

NT setup reboots a number of times during the boot process, and IIRC
those all went just fine. However, at some point, the image began to
consistently bomb out with "A disk read error occurred. ...", and
stubbornly refused to boot, regardless of the number of boot attempts I
tried.

QEMU 2.0.0, 1.5.0, and 1.0 (the earliest version I was able to build on
my system) all consistently hit "disk read error occurred".

I tried compiling QEMU 1.0 using clang so I could build for 32-bit on my
64-bit system (GCC 7 died with "Frame pointer required, but reserved").
The resulting qemu completely crashed if I didn't enable KVM (ie, TCG
was (understandably) broken); with KVM enabled qemu didn't crash, but
NTLDR halted with the same error as on 64-bit qemu. (TL;DR, no
difference whatsoever.)

My initial reaction at this point was to try the image on another
virtualization platform. My first pick was VirtualBox.

So, I followed the official instructions for pointing VirtualBox to
physical disk images, except I substituted a /dev/loopN device I'd
pointed to the image file via losetup.

And... VirtualBox picked the image up fine and Just Worked(TM). Yay! -
but not yay. What gives?!

Confused, I then tried to convert the disk image to VHD format.
Unfortunately, for some reason, if I try `qemu-image convert ... -O vhdx
...`, VirtualBox chokes on the result:

-----

VD: error VERR_NOT_SUPPORTED opening image file
'/.../wnt4ts-broken-qemuconv.vhd' (VERR_NOT_SUPPORTED).

Result Code: NS_ERROR_FAILURE (0x80004005)
Component: MediumWrap
Interface: IMedium {4afe423b-43e0-e9d0-82e8-ceb307940dda}
Callee: IVirtualBox {0169423f-46b4-cde9-91af-1e9d5b6cd945}
Callee RC: VBOX_E_OBJECT_NOT_FOUND (0x80BB0001)

-----

Welp.

Well, a bit more digging later, and I found I could do

$ VBoxManage convertfromraw wnt4ts-broken.img wnt4ts-broken.vhd

but... as soon as I pointed VirtualBox to this, it too began to choke
with "A disk read error occurred".

And yet, the VMDK->raw image setup worked just fine.

I found I could even replace the loop device with the path of the .img
file itself and that worked just fine too.

At my wits' end, I followed some online instructions to learn about
manual CHS configuration so I could try and get the image working in
Bochs. "A disk read error occurred". I wasn't surprised.

It was at this point I began to give up, but I decided to try One Last
Thing(TM) before properly throwing in the towel.

:)

I decided to learn a bit more about how `VBoxManage internalcommands
createrawvmdk` worked, and try one thing in particular: I can edit the
.vmdk file, but can I point `createrawvmdk` at the .img file directly
too?

Turns out, yes you can.

It also turns out that this promptly caused VirtualBox to bomb out.

Interesting.

For reference, here's the VMDK file I initially created (by pointing
`createrawvmdk` at /dev/loopN) and then later edited to point straight
to the .img file, with both approaches resulting in successful boot.

--8<--------------------------------------------------------

# Disk DescriptorFile
version=1
CID=e35b9a45
parentCID=ffffffff
createType="fullDevice"

# Extent description
RW 1536000 FLAT "/absolute/full/path/to/wnt4ts-broken.img" 0

# The disk Data Base 
#DDB

ddb.virtualHWVersion = "4"
ddb.adapterType="ide"
ddb.geometry.cylinders="1523"
ddb.geometry.heads="16"
ddb.geometry.sectors="63"
ddb.uuid.image="871a6044-c8ca-48ed-b7aa-e6fc49da3db4"
ddb.uuid.parent="00000000-0000-0000-0000-000000000000"
ddb.uuid.modification="3661715c-3906-4e4a-ab65-486d140e03b8"
ddb.uuid.parentmodification="00000000-0000-0000-0000-000000000000"
ddb.geometry.biosCylinders="761"
ddb.geometry.biosHeads="32"
ddb.geometry.biosSectors="63"

------------------------------------------------------------


Here's the _diff_ of what happens if I point `createrawvmdk` at wnt4ts-broken.img directly:

--8<--------------------------------------------------------

ddb.geometry.cylinders="2080"
ddb.geometry.heads="16"
ddb.geometry.sectors="63"

------------------------------------------------------------

:D

Naturally,

$ qemu-system-i386 -drive file=wnt4ts-
broken.img,format=raw,cyls=1523,heads=16,secs=63 -M isapc -sdl

will boot happily on 2.9.0 (notwithstanding the occasional "disk read
error occurred" documented above).

It will also boot in 1.6.0.

(POTENTIAL BUG HEADSUP: 1.0 and 1.5.0 both lock up with a blank 640x480
window and use 0% CPU if I specify `-M isapc`.)

And, of course, using these CHS values in Bochs also results in
successful boot as well (after setting the CPU type to pentium).

Unfortunately, I have no idea what sequence of events caused the
creation of the VMDK file above. No invocation of `createrawvmdk` is
producing a VMDK file with the CHS settings above.

I've only just begun to learn about the intricacies of CHS. Am I to
understand that these values are stored amongst the first 512 bytes of
the disk? If this is the case, then I wonder what changed the data, and
why. I was initially only using QEMU 2.9.0, and didn't move the image to
different VMs or QEMU versions. Perhaps Windows NT got confused about
the disk CHS and rewrote it?


== Sporadic BIOS-level boot failure ========================

I have multiple screenshots of SeaBIOS in QEMU 2.9.0 halting with "No
bootable device" (et al), even with the above manually-applied CHS
settings.

Commit e689f7c also presents such errors.

Commit 306ec6c does not suffer from intermittent breakage of any kind:

- No SeaBIOS flake-outs
- No "Non-system disk or disk error"
- No "A disk error has occurred"
- No "General failure ..."

While most of my confidence in commit 306ec6c is based on anecdotal
evidence, I modified `tmux-ctl-loop` a little to soak-test BIOS-level
I/O stability and left this modified version running for a few minutes.

--8<--------------------------------------------------------

#!/bin/bash

port=4444

tmux=./tmux

printf -v l '%0.0s-' {0..25}
h1="$l/ buffer dump begin \\$l"
h2="$l-\\ buffer dump end /-$l"

while :; do
	
	while :; do
		echo | nc localhost $port -q0 -w1 > /dev/null && break
		echo 'Start qemu!'
	done
	
	buffer="$(tmux -S $tmux capture-pane; tmux -S $tmux save-buffer -)"
	
	echo "$h1"
	[[ "$buffer" ]] && echo "$buffer" || echo '( * Screen buffer is empty * )'
	echo "$h2"
	
	if echo "$buffer" | grep -q 'Non-system disk' || echo "$buffer" | \
		grep -q 'No bootable device'
	then
		
		s="<Hit error after $(< itercount) runs.>"
		echo "$s"
		echo "$s" >> stats
		
		touch STOP
		
		#echo q | nc localhost $port -q0 > /dev/null
		
		exit
		
	elif echo "$buffer" | grep -q 'OS Loader V4.00' || echo "$buffer" | \
		grep -q 'A disk read error'
	then
	
		echo '<Boot did not hang at BIOS, trying again>'
		
		echo q | nc localhost $port -q0 > /dev/null
		
	else
		
		echo '<Waiting for boot>'
		
	fi
			
done

------------------------------------------------------------

For the above to work, the top of run-qemu-loop must also be modified to
read something along the lines of

disk=/path/to/wnt4ts-broken.img,format=raw,cyls=1523,heads=16,secs=63

(Suggestion: modify copies of both scripts)

One small terminal-flicker-headache (and a 57°C CPU) later, I was able
to carefully observe just over 350 successful runs in which QEMU commit
306ec6c only ever produced a boot menu. No other hitches.

** Important: **

However, commit 306ec6c will fail to boot, ever, if the cylinders and
geometry are not set to the values VirtualBox "discovered". (Of note is
the fact that QEMU (2.9.0) was what initially created this image. I must
admit that I don't remember what sequence of QEMU versions I fed the
image to - and I maybe, possibly, didn't think to back the file up
(sorry), so maybe something mangled something somewhere. But VirtualBox
figured it out nonetheless!)

Furthermore, feeding /dev/loopN to any QEMU version will NOT result in
correct CHS discovery (and successful boot).

This is what leads me to conclude that I've discovered two separate
issues.


== Appendix: How to build the branches =====================

It's very simple.

First, `git clone https://github.com/qemu/qemu` somewhere if you don't
already have a local copy. If you have an old git checkout that's from
2014 or later, you can use that old checkout instead. (If you want to
test an old checkout you have, the commands below will either work
perfectly or completely bomb out with no side effects.)

A full checkout is a ~183MB download. Sorry.

Next, create two new directories somewhere. Name them what you like, eg
`qemu-working` and `qemu-broken`.

Now, cd into the checkout directory, and run:

$ git archive 306ec6c3cece7004429c79c1ac93d49919f1f1cc | tar xC /path/to
/qemu-working/

$ git archive e689f7c668cbd9d08f330e17c3dd3a059c9553d3 | tar xC /path/to
/qemu-broken/

The paths can be relative.

Now, run this in both of the new directories:

$ ./configure --python=python2.7 --disable-libssh2 --disable-seccomp
--disable-usb-redir --disable-guest-agent --disable-libiscsi --disable-
spice --disable-smartcard-nss --disable-vhost-net --disable-docs
--disable-attr --disable-cap-ng --disable-vde --disable-user --disable-
bluez --disable-vnc-ws --disable-xen --disable-brlapi --enable-debug
--target-list=i386-softmmu --disable-fdt

$ make -j64

You can open two terminals and configure and build both simultaneously
if you like.

On my decent but very basic (2-core+HT) i3 box, -j64 actually works out - make doesn't actually launch too many gcc processes. You *will* see your system load spike to ~20 though :)
(NB. Do. not. use. -j64. with. the. linux. kernel.)

On my system, a single build with -j64 takes only about 35 seconds. C
FTW. (Although this has increased to 1min20sec for more recent builds.)

Most of the configure arguments remove functionality I'll never use (in
this situation) and which will only slow down the build.

Once QEMU is built, run qemu-system-i386 directly from where it has been
built.

$ /path/to/qemu-working/i386-softmmu/qemu-system-i386 ...
$ /path/to/qemu-broken/i386-softmmu/qemu-system-i386 ...

Again, the paths can be relative.

** Affects: qemu
     Importance: Undecided
         Status: New


** Tags: disk io qemu

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1745312

Title:
  Regression report: Disk subsystem I/O failures/issues surfacing in
  DOS/early Windows [two separate issues: one bisected, one root-caused]

Status in QEMU:
  New

Bug description:
  [Headsup: This report is long-ish due to the amount of detail I've
  stumbled on along the way that I think is relevant to include. I can't
  speak as to the complexity of the actual bugs, but the size of this
  report should not suggest that the reproduction process is
  particularly headache-inducing.]

  Hi!

  I recently needed to fire up some ancient software for research
  purposes and got very distracted discovering and playing with old
  versions of Windows :). In the process I've discovered some glitches
  with disk I/O.

  I believe I've stumbled on two completely separate issues that
  coincidentally surfaced at the same time. It's possible that
  components of this report will be re-filed as more specific new bugs,
  but I'm not an authority on QEMU internals or how to narrow
  down/categorize what I've found.

  - The first bug only surfaces when the "isapc" machine type is used.
  It intermittently produces "General failure {read,writ}ing drive _"
  under MS-DOS 6.22, and also somehow interferes with early bootstrap of
  Windows NT 4 (in NTLDR). Enabling or disabling KVM (I'm on Linux)
  appears to make no difference whatsoever, which may help with
  debugging.

  - The second issue involves
    - a WinNT4 disk image
    - created by running through a bog-standard NT4 install inside QEMU 2.9.0
    - which will now fail to boot in any version of QEMU - even version 1.0
      - but which VirtualBox will boot fine
        - but only if I point VirtualBox at QEMU's raw disk image via a
          hacked-together VMDK file
        - if the raw image is converted to VHD(X), VirtualBox will also fail
          to boot the image with exactly the same error as QEMU
        - this state of affairs is not affected by image sparseness (which makes
          sense)

  I'm confident I've bisected the first issue.

  I wasn't able to bisect the second issue (as all tested versions of
  QEMU behaved identically), but I've figured out a working repro
  testcase and I believe I've managed to pin down a solid root cause.


  == #1: Intermittent I/O issues when `-M isapc` is used =====

  These symptoms sometimes take a small amount of time and fiddling to
  trigger, but I AM able to consistently surface them on my machine
  after a short while. (I am very very interested to hear if others
  cannot reproduce them.)

  So, first of all:

  https://github.com/qemu/qemu/commit/306ec6c3cece7004429c79c1ac93d49919f1f1cc
    (Jul 30 2013): the last version that works

  https://github.com/qemu/qemu/commit/e689f7c668cbd9d08f330e17c3dd3a059c9553d3
    (Oct 30 2013): the first version that intermittently fails

  Maybe lift out and build these branches while reading. *shrug*
  (How to do this can be found at the end of this report - along with a time-saving ./configure line, FWIW)

  Here are the changelists between these two revisions:

  https://github.com/qemu/qemu/compare/306ec6c...e689f7c
  (Compare direction: OLD to NEW) (Commits: 166  Files changed: 192)

  https://github.com/qemu/qemu/compare/e689f7c...306ec6c
  (Compare direction: NEW to OLD) (Commits: 30   Files changed: 22)

  (Someone else more familiar with Git might know why GitHub returns
  results for both compare directions, and/or if the 2nd link is useful
  information. The first link returns a lot more results than the 2nd
  one, at least. Does comparing new>old return deletions?)

  ---

  Now on to the symptoms. In a moment I'll describe reproduction.

  # MS-DOS 6.22

  The first symptom I discovered was that trivial read and write
  operations under MS-DOS would sometimes fail:

    C:\>echo test > hi

    General failure writing drive C
    Abort, Retry, Fail?

  Anything else that exercises the disk behaves similarly:

    C:\>dir /s > nul

    General failure reading drive C
    Abort, Retry, Fail?

  (Note that the above demonstrates both write and read failures)

  (Also, FWIW, `dir /s` == `ls -R`)

  The behavior of the I/O errors is not possible to characterise as it
  fluctuates so much. For example something as simple as DIR can produce
  wildly differing results: in one run, poking around with DIR ended
  with DOS deciding C:\ was empty at one point; at another point in a
  different run C:\ mysteriously dropped 50% of its contents only to
  magically gain it all back moments later after some poking around in
  one of the subdirectories that was still visible.

  The time it takes to trigger these errors is also highly variable.
  QEMU may fall over as early as hanging forever at "Starting MS-
  DOS...", or I might get all the way into Windows 3.1 before it
  triggers (in which case Win3.1 reports vague memory errors - of all
  things).

  Very occasionally I've seen _SeaBIOS itself_ report "Booting from Hard
  Disk..." "Boot failed: could not read the boot disk" ... "No bootable
  device.", and on one occasion I even got "Non-System disk or disk
  error" "Replace and strike any key when ready"!

  
  # WinNT 4 Terminal Server

  Most of the time, NTLDR will fire up normally. But every so often...

    SeaBIOS (version rel-1.7.3-117-g31b8b4e-
  20131206_080705-nilsson.home.kraxel.org)

    Booting from Hard Disk...
    A disk read error occurred.
    Insert a system diskette and restart
    the system.

  (NB. You're seeing the old SeaBIOS version included with e689f7c,
  which was the first buggy commit.)

  If NT gets past this point without erroring out (ie, it makes it to
  the boot menu), the rest of the system is 100% fine and there are no
  other disk I/O issues whatsoever. For example, on QEMU 2.9.0 I was
  able to enable disk compression, answer "Yes" to "Compress entire disk
  now?" and have the process fully complete. No hitches.

  This makes me vaguely recall/wonder that perhaps this could be somehow
  related to LBA and/or Int 13h, or something floating around near that
  bunch of functionality. (I'm woefully ignorant about such low-level
  details.) Perhaps DOS/Win3.1 are stuck using a disk mode that QEMU has
  a buggy implementation of, while NT 4 (once NTOSKRNL is up and
  running) is able to use a different disk mode or access mechanism.

  I'm really interested to get some understanding of what the root issue
  is here, when this is fixed. (I wonder if it's a timing thing?)

  I've observed some unusual behavior with repeated restarts. In one
  case, I attempted to start NT4 multiple times, and QEMU consistently
  failed with "No bootable device" each time. So, I removed `-M isapc`,
  promptly got a boot menu, hit ^C, readded `-M isapc` - and continued
  to get a boot menu. Yep. I'll accept "really really big coincidence"
  but I do very much wonder if something else is going on here. I've
  observed many similar incidents. It makes me wonder whether the
  contents of memory or some other system state is an influence. Very
  probably not, but still...


  -- Reproduction --------------------------------------------

  First of all, there was unfortunately no way for me to avoid having to
  post entire disk images, but I've managed to compress everything down
  to 174MB total download size.

  FWIW, WinWorld and many other sites seem to have no operational issues
  providing clear pointers to CD keys; I consider my distribution of my
  installed HDD images an extension of the apparent status quo.

  That being said, I've put everything on Google Drive so nobody has to
  headscratch about Launchpad/Canonical/etc's stance on hosting this
  data.

  So, this folder contains the disk images:
  https://drive.google.com/drive/folders/1WdZVBh5Trs9HLC186-nHyKeqaSxfyM2c
  ("Download all" at the top-right will create a ZIP file, but FWIW
  downloading the individual files simultaneously would implement a
  rough form of download acceleration)

  File meta info:

  Compressed
  |
  |      Apparent
  |      |    Actual
  |      |    |
  38M -> 200M (103M)  win31.img.xz
  82M -> 1G   (289M)  wnt4ts-broken.img.xz
  55M -> 350M (146M)  wnt4ts-intermittent.img.xz

  SHA-256s:

  win31:        8179b8180a2ab40bd472e8a2f3fb89fc331651e56923f94ceb9e52a78ee220d2
  broken:       a2af5f0bc49a063b75f534b6ffe5b82e32ecc706a64a425b6626feccf6e3fdfa  
  intermittent: 77ae8c458829ebcdd64c71042012f45d5a2788e6ebd22db9d53de9ef1a574784

  (Wanted to keep the checksum lines within 80 columns)

  And, since I can't figure out where else in this report to put this,
  wnt4ts-broken.img's password is "admin" but something seems to have
  happened to the disk and NT doesn't actually boot properly :(, and
  wnt4ts-intermittent.img's password is "1234". (These were set up as
  test images. Now I'm _really_ glad I used simple passwords! :) )

  ---

  
  I have two testcases: DOS 6.22 (+ Windows 3.1), and Windows NT 4.

  
  # MS-DOS

  DOS is the simplest. It basically consists of

  $ qemu-system-i386 -drive file=win31.img,format=raw -M isapc -enable-
  kvm

  And then literally just playing around. Things to try include creating
  files (`echo blah > file`), repeatedly seeking across the entire FAT
  (`dir /s > nul` or `dir /s`), and launching Windows (`win`).

  win31.img is not special (as far as I can tell) and merely consists of
  the result of installing DOS 6.22 and Windows 3.1 from WinWorldPC.
  I've basically just included the image for convenience.

  Generally no single "run" is immune to starting Win3.1 and then
  launching File Manager; if that doesn't generate an error, something
  is definitely up.

  The second best trigger is creating new files. That very very
  frequently produces "General Failure ...", but not always.

  
  # WinNT 4

  Windows NT 4 is a bit more complicated. Because this error only occurs
  at presumably a single small point very early in boot, the window of
  opportunity for the glitch to surface within is much much narrower and
  thus often requires a larger number of tries.

  Anecdotally I've had QEMU hit the boot error at the first try/run, and
  after as many as 63 "successful" boots.

  I made a small test harness that automates the launch process. It
  consists of two shell scripts and requires tmux (and netcat).
  (*Potential epilepsy warning*: if you use a light-colored terminal
  background, the terminal QEMU is repeatedly invoked from will
  continuously flash rapidly from white to black.)

  One of the scripts is run inside a tmux session in one terminal, while
  the other script is run in its own terminal (without any tmux).

  
  I named this one `run-qemu-loop`:

  --8<--------------------------------------------------------

  #!/bin/bash

  # ---

  qemu=/path/to/qemu-system-i386
  #or, alternatively: (I used the following line myself so I
  #could tab-complete my way to different qemu executables)
  #qemu="$1"

  disk=/path/to/wnt4ts-intermittent.img

  # ---

  port=4444

  rm -f STOP itercount

  itercount=0

  while :; do
  	
  	[ -f STOP ] && break
  	
  	((itercount++))
  	echo $itercount > itercount
  	
  	$qemu \
  		-enable-kvm -vga cirrus -curses -M isapc \
  		-drive file="$disk",format=raw \
  		-chardev socket,id=mon0,host=localhost,port=$port,server,nowait \
  		-mon chardev=mon0,mode=readline
  	
  	#point to an otherwise-unused terminal if you like (see also: `tty`)
  	#echo "$itercount run(s)" > /dev/pts/__
  	
  done

  ------------------------------------------------------------

  Not much logic above; this just repeatedly runs QEMU for as long as
  the file `STOP` does not exist in the current directory.

  The key "magic" bit is that QEMU is launched in -curses mode.

  The other key bit is that the above script is run inside tmux.

  
  Here's `tmux-ctl-loop`:

  --8<--------------------------------------------------------

  #!/bin/bash

  port=4444

  tmux=./tmux

  printf -v l '%0.0s-' {0..25}
  h1="$l/ buffer dump begin \\$l"
  h2="$l-\\ buffer dump end /-$l"

  while :; do
  	
  	while :; do
  		echo | nc localhost $port -q0 -w1 > /dev/null && break
  		echo 'Start qemu!'
  	done
  	
  	buffer="$(tmux -S $tmux capture-pane; tmux -S $tmux save-buffer -)"
  	
  	echo "$h1"
  	[[ "$buffer" ]] && echo "$buffer" || echo '( * Screen buffer is empty * )'
  	echo "$h2"
  	
  	if echo "$buffer" | grep -q 'A disk read error occurred.'; then
  		
  		s="<Crashed after $(< itercount) runs.>"
  		echo "$s"
  		echo "$s" >> stats
  		
  		touch STOP
  		
  		#echo q | nc localhost $port -q0 > /dev/null
  		
  		exit
  		
  	elif echo "$buffer" | grep -q 'OS Loader V4.00'; then
  			
  		echo '<Booted successfully, trying again>'
  		
  		echo q | nc localhost $port -q0 > /dev/null
  		
  	else
  		
  		echo '<Waiting for boot>'
  		
  	fi
  			
  done

  ------------------------------------------------------------

  Nothing particularly amazing going on here either.

  While `qemu-run-loop` is running inside tmux in the first terminal,
  this is running in the 2nd one.

  The small infinite loop at the top only breaks when it can
  successfully ping QEMU and it knows it's running.

  Then, a screendump of the contents of the terminal QEMU is in is
  fetched from tmux, and the buffer's content is analyzed.

  - If NTLDR fails, the script creates `STOP` to halt qemu-run-loop,
    sends `q` to QEMU through netcat, and then the script exits.

  - If NTLDR loads successfully, the script sends `q` to QEMU and continues
    looping. (qemu-run-loop will not find the `STOP` file, and so restart qemu.)

  The scripts run very quickly, with 2-3 iterations per second on my i3
  box.


  # Usage

  Save the two scripts above to the same directory as wnt4ts-intermittent.img,
  then:

  - (If port 4444 doesn't work, the value needs to be changed in both
  scripts.)

  - In the first terminal, run `tmux -S <file>`, where <file> names the socket
    tmux will use. This needs to match "tmux=" at the top of `tmux-ctl-loop`.
    (with `tmux=./tmux`, the command would be `tmux -S tmux`)

  - Still in the first terminal (and now also inside tmux), enter
    `./qemu-run-loop`, passing the path to qemu if you're using that approach
    (refer to the first few lines of the script). Don't hit enter yet.

  - Now, in the 2nd terminal, type `./tmux-ctl-loop`

  - Hit enter in both terminals.
   

  Rationale for timing of Enter key:

  - Running qemu-run-loop first will start QEMU, and if NTLDR starts
    successfully it will immediately begin counting down from 30. If NT actually
    starts to boot and is then hard-shut-down this /may/ affect the disk image

  - tmux-ctl-loop will annoyingly spam a continuous stream of 'Start qemu!' until
    qemu-run-loop is running.

  - Starting both scripts at "more or less" the same time (no rush) works out
    well.

  
  Hopefully potential script modifications are obvious; for example

  - changing tmux-ctl-loop to not send 'q' to qemu so you can connect to the HMP
    yourself
    (NB, if `STOP` is not created, when qemu finally exits it will of course
    promptly be relaunched)

  - pointing run-qemu-loop to a modified qemu binary


  == #2: QEMU-vs-VirtualBox image issue ======================

  I was initially completely stumped by this issue, perhaps
  unsurprisingly so. :)

  wnt4ts-broken.img is a perfectly ordinary NT 4 installation that was
  created in QEMU 2.9.0. I created a 1GB disk with `truncate`, picked
  NTFS and installed everything (which took a while).

  NT setup reboots a number of times during the boot process, and IIRC
  those all went just fine. However, at some point, the image began to
  consistently bomb out with "A disk read error occurred. ...", and
  stubbornly refused to boot, regardless of the number of boot attempts
  I tried.

  QEMU 2.0.0, 1.5.0, and 1.0 (the earliest version I was able to build
  on my system) all consistently hit "disk read error occurred".

  I tried compiling QEMU 1.0 using clang so I could build for 32-bit on
  my 64-bit system (GCC 7 died with "Frame pointer required, but
  reserved"). The resulting qemu completely crashed if I didn't enable
  KVM (ie, TCG was (understandably) broken); with KVM enabled qemu
  didn't crash, but NTLDR halted with the same error as on 64-bit qemu.
  (TL;DR, no difference whatsoever.)

  My initial reaction at this point was to try the image on another
  virtualization platform. My first pick was VirtualBox.

  So, I followed the official instructions for pointing VirtualBox to
  physical disk images, except I substituted a /dev/loopN device I'd
  pointed to the image file via losetup.

  And... VirtualBox picked the image up fine and Just Worked(TM). Yay! -
  but not yay. What gives?!

  Confused, I then tried to convert the disk image to VHD format.
  Unfortunately, for some reason, if I try `qemu-image convert ... -O
  vhdx ...`, VirtualBox chokes on the result:

  -----

  VD: error VERR_NOT_SUPPORTED opening image file
  '/.../wnt4ts-broken-qemuconv.vhd' (VERR_NOT_SUPPORTED).

  Result Code: NS_ERROR_FAILURE (0x80004005)
  Component: MediumWrap
  Interface: IMedium {4afe423b-43e0-e9d0-82e8-ceb307940dda}
  Callee: IVirtualBox {0169423f-46b4-cde9-91af-1e9d5b6cd945}
  Callee RC: VBOX_E_OBJECT_NOT_FOUND (0x80BB0001)

  -----

  Welp.

  Well, a bit more digging later, and I found I could do

  $ VBoxManage convertfromraw wnt4ts-broken.img wnt4ts-broken.vhd

  but... as soon as I pointed VirtualBox to this, it too began to choke
  with "A disk read error occurred".

  And yet, the VMDK->raw image setup worked just fine.

  I found I could even replace the loop device with the path of the .img
  file itself and that worked just fine too.

  At my wits' end, I followed some online instructions to learn about
  manual CHS configuration so I could try and get the image working in
  Bochs. "A disk read error occurred". I wasn't surprised.

  It was at this point I began to give up, but I decided to try One Last
  Thing(TM) before properly throwing in the towel.

  :)

  I decided to learn a bit more about how `VBoxManage internalcommands
  createrawvmdk` worked, and try one thing in particular: I can edit the
  .vmdk file, but can I point `createrawvmdk` at the .img file directly
  too?

  Turns out, yes you can.

  It also turns out that this promptly caused VirtualBox to bomb out.

  Interesting.

  For reference, here's the VMDK file I initially created (by pointing
  `createrawvmdk` at /dev/loopN) and then later edited to point straight
  to the .img file, with both approaches resulting in successful boot.

  --8<--------------------------------------------------------

  # Disk DescriptorFile
  version=1
  CID=e35b9a45
  parentCID=ffffffff
  createType="fullDevice"

  # Extent description
  RW 1536000 FLAT "/absolute/full/path/to/wnt4ts-broken.img" 0

  # The disk Data Base 
  #DDB

  ddb.virtualHWVersion = "4"
  ddb.adapterType="ide"
  ddb.geometry.cylinders="1523"
  ddb.geometry.heads="16"
  ddb.geometry.sectors="63"
  ddb.uuid.image="871a6044-c8ca-48ed-b7aa-e6fc49da3db4"
  ddb.uuid.parent="00000000-0000-0000-0000-000000000000"
  ddb.uuid.modification="3661715c-3906-4e4a-ab65-486d140e03b8"
  ddb.uuid.parentmodification="00000000-0000-0000-0000-000000000000"
  ddb.geometry.biosCylinders="761"
  ddb.geometry.biosHeads="32"
  ddb.geometry.biosSectors="63"

  ------------------------------------------------------------

  
  Here's the _diff_ of what happens if I point `createrawvmdk` at wnt4ts-broken.img directly:

  --8<--------------------------------------------------------

  ddb.geometry.cylinders="2080"
  ddb.geometry.heads="16"
  ddb.geometry.sectors="63"

  ------------------------------------------------------------

  :D

  Naturally,

  $ qemu-system-i386 -drive file=wnt4ts-
  broken.img,format=raw,cyls=1523,heads=16,secs=63 -M isapc -sdl

  will boot happily on 2.9.0 (notwithstanding the occasional "disk read
  error occurred" documented above).

  It will also boot in 1.6.0.

  (POTENTIAL BUG HEADSUP: 1.0 and 1.5.0 both lock up with a blank
  640x480 window and use 0% CPU if I specify `-M isapc`.)

  And, of course, using these CHS values in Bochs also results in
  successful boot as well (after setting the CPU type to pentium).

  Unfortunately, I have no idea what sequence of events caused the
  creation of the VMDK file above. No invocation of `createrawvmdk` is
  producing a VMDK file with the CHS settings above.

  I've only just begun to learn about the intricacies of CHS. Am I to
  understand that these values are stored amongst the first 512 bytes of
  the disk? If this is the case, then I wonder what changed the data,
  and why. I was initially only using QEMU 2.9.0, and didn't move the
  image to different VMs or QEMU versions. Perhaps Windows NT got
  confused about the disk CHS and rewrote it?

  
  == Sporadic BIOS-level boot failure ========================

  I have multiple screenshots of SeaBIOS in QEMU 2.9.0 halting with "No
  bootable device" (et al), even with the above manually-applied CHS
  settings.

  Commit e689f7c also presents such errors.

  Commit 306ec6c does not suffer from intermittent breakage of any kind:

  - No SeaBIOS flake-outs
  - No "Non-system disk or disk error"
  - No "A disk error has occurred"
  - No "General failure ..."

  While most of my confidence in commit 306ec6c is based on anecdotal
  evidence, I modified `tmux-ctl-loop` a little to soak-test BIOS-level
  I/O stability and left this modified version running for a few
  minutes.

  --8<--------------------------------------------------------

  #!/bin/bash

  port=4444

  tmux=./tmux

  printf -v l '%0.0s-' {0..25}
  h1="$l/ buffer dump begin \\$l"
  h2="$l-\\ buffer dump end /-$l"

  while :; do
  	
  	while :; do
  		echo | nc localhost $port -q0 -w1 > /dev/null && break
  		echo 'Start qemu!'
  	done
  	
  	buffer="$(tmux -S $tmux capture-pane; tmux -S $tmux save-buffer -)"
  	
  	echo "$h1"
  	[[ "$buffer" ]] && echo "$buffer" || echo '( * Screen buffer is empty * )'
  	echo "$h2"
  	
  	if echo "$buffer" | grep -q 'Non-system disk' || echo "$buffer" | \
  		grep -q 'No bootable device'
  	then
  		
  		s="<Hit error after $(< itercount) runs.>"
  		echo "$s"
  		echo "$s" >> stats
  		
  		touch STOP
  		
  		#echo q | nc localhost $port -q0 > /dev/null
  		
  		exit
  		
  	elif echo "$buffer" | grep -q 'OS Loader V4.00' || echo "$buffer" | \
  		grep -q 'A disk read error'
  	then
  	
  		echo '<Boot did not hang at BIOS, trying again>'
  		
  		echo q | nc localhost $port -q0 > /dev/null
  		
  	else
  		
  		echo '<Waiting for boot>'
  		
  	fi
  			
  done

  ------------------------------------------------------------

  For the above to work, the top of run-qemu-loop must also be modified
  to read something along the lines of

  disk=/path/to/wnt4ts-broken.img,format=raw,cyls=1523,heads=16,secs=63

  (Suggestion: modify copies of both scripts)

  One small terminal-flicker-headache (and a 57°C CPU) later, I was able
  to carefully observe just over 350 successful runs in which QEMU
  commit 306ec6c only ever produced a boot menu. No other hitches.

  ** Important: **

  However, commit 306ec6c will fail to boot, ever, if the cylinders and
  geometry are not set to the values VirtualBox "discovered". (Of note
  is the fact that QEMU (2.9.0) was what initially created this image. I
  must admit that I don't remember what sequence of QEMU versions I fed
  the image to - and I maybe, possibly, didn't think to back the file up
  (sorry), so maybe something mangled something somewhere. But
  VirtualBox figured it out nonetheless!)

  Furthermore, feeding /dev/loopN to any QEMU version will NOT result in
  correct CHS discovery (and successful boot).

  This is what leads me to conclude that I've discovered two separate
  issues.


  == Appendix: How to build the branches =====================

  It's very simple.

  First, `git clone https://github.com/qemu/qemu` somewhere if you don't
  already have a local copy. If you have an old git checkout that's from
  2014 or later, you can use that old checkout instead. (If you want to
  test an old checkout you have, the commands below will either work
  perfectly or completely bomb out with no side effects.)

  A full checkout is a ~183MB download. Sorry.

  Next, create two new directories somewhere. Name them what you like,
  eg `qemu-working` and `qemu-broken`.

  Now, cd into the checkout directory, and run:

  $ git archive 306ec6c3cece7004429c79c1ac93d49919f1f1cc | tar xC
  /path/to/qemu-working/

  $ git archive e689f7c668cbd9d08f330e17c3dd3a059c9553d3 | tar xC
  /path/to/qemu-broken/

  The paths can be relative.

  Now, run this in both of the new directories:

  $ ./configure --python=python2.7 --disable-libssh2 --disable-seccomp
  --disable-usb-redir --disable-guest-agent --disable-libiscsi
  --disable-spice --disable-smartcard-nss --disable-vhost-net --disable-
  docs --disable-attr --disable-cap-ng --disable-vde --disable-user
  --disable-bluez --disable-vnc-ws --disable-xen --disable-brlapi
  --enable-debug --target-list=i386-softmmu --disable-fdt

  $ make -j64

  You can open two terminals and configure and build both simultaneously
  if you like.

  On my decent but very basic (2-core+HT) i3 box, -j64 actually works out - make doesn't actually launch too many gcc processes. You *will* see your system load spike to ~20 though :)
  (NB. Do. not. use. -j64. with. the. linux. kernel.)

  On my system, a single build with -j64 takes only about 35 seconds. C
  FTW. (Although this has increased to 1min20sec for more recent
  builds.)

  Most of the configure arguments remove functionality I'll never use
  (in this situation) and which will only slow down the build.

  Once QEMU is built, run qemu-system-i386 directly from where it has
  been built.

  $ /path/to/qemu-working/i386-softmmu/qemu-system-i386 ...
  $ /path/to/qemu-broken/i386-softmmu/qemu-system-i386 ...

  Again, the paths can be relative.

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1745312/+subscriptions

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [Bug 1745312] [NEW] Regression report: Disk subsystem I/O failures/issues surfacing in DOS/early Windows [two separate issues: one bisected, one root-caused]
  2018-01-25  7:18 [Qemu-devel] [Bug 1745312] [NEW] Regression report: Disk subsystem I/O failures/issues surfacing in DOS/early Windows [two separate issues: one bisected, one root-caused] i336_
@ 2018-01-29 13:37 ` Stefan Hajnoczi
  2018-01-30  7:10 ` [Qemu-devel] [Bug 1745312] " Fam Zheng
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Stefan Hajnoczi @ 2018-01-29 13:37 UTC (permalink / raw)
  To: Bug 1745312; +Cc: qemu-devel, John Snow, Fam Zheng

[-- Attachment #1: Type: text/plain, Size: 56010 bytes --]

On Thu, Jan 25, 2018 at 07:18:52AM -0000, i336_ wrote:
> Public bug reported:
> 
> [Headsup: This report is long-ish due to the amount of detail I've
> stumbled on along the way that I think is relevant to include. I can't
> speak as to the complexity of the actual bugs, but the size of this
> report should not suggest that the reproduction process is particularly
> headache-inducing.]

I've CCed people who may be able to help.

I don't have time to read through everything you've posted.

> Hi!
> 
> I recently needed to fire up some ancient software for research purposes
> and got very distracted discovering and playing with old versions of
> Windows :). In the process I've discovered some glitches with disk I/O.
> 
> I believe I've stumbled on two completely separate issues that
> coincidentally surfaced at the same time. It's possible that components
> of this report will be re-filed as more specific new bugs, but I'm not
> an authority on QEMU internals or how to narrow down/categorize what
> I've found.
> 
> - The first bug only surfaces when the "isapc" machine type is used. It
> intermittently produces "General failure {read,writ}ing drive _" under
> MS-DOS 6.22, and also somehow interferes with early bootstrap of Windows
> NT 4 (in NTLDR). Enabling or disabling KVM (I'm on Linux) appears to
> make no difference whatsoever, which may help with debugging.

Is this using the IDE disk controller?  In that case John Snow can help
you debug what's going on at the IDE level.

> - The second issue involves
>   - a WinNT4 disk image
>   - created by running through a bog-standard NT4 install inside QEMU 2.9.0
>   - which will now fail to boot in any version of QEMU - even version 1.0
>     - but which VirtualBox will boot fine
>       - but only if I point VirtualBox at QEMU's raw disk image via a
>         hacked-together VMDK file
>       - if the raw image is converted to VHD(X), VirtualBox will also fail
>         to boot the image with exactly the same error as QEMU
>       - this state of affairs is not affected by image sparseness (which makes
>         sense)

VMDK stores the disk geometry (cylinders, heads, sectors), which may
affect guest software.  I've CCed Fam Zheng.

> 
> I'm confident I've bisected the first issue.
> 
> I wasn't able to bisect the second issue (as all tested versions of QEMU
> behaved identically), but I've figured out a working repro testcase and
> I believe I've managed to pin down a solid root cause.
> 
> 
> == #1: Intermittent I/O issues when `-M isapc` is used =====
> 
> These symptoms sometimes take a small amount of time and fiddling to
> trigger, but I AM able to consistently surface them on my machine after
> a short while. (I am very very interested to hear if others cannot
> reproduce them.)
> 
> So, first of all:
> 
> https://github.com/qemu/qemu/commit/306ec6c3cece7004429c79c1ac93d49919f1f1cc
>   (Jul 30 2013): the last version that works
> 
> https://github.com/qemu/qemu/commit/e689f7c668cbd9d08f330e17c3dd3a059c9553d3
>   (Oct 30 2013): the first version that intermittently fails
> 
> Maybe lift out and build these branches while reading. *shrug*
> (How to do this can be found at the end of this report - along with a time-saving ./configure line, FWIW)
> 
> Here are the changelists between these two revisions:
> 
> https://github.com/qemu/qemu/compare/306ec6c...e689f7c
> (Compare direction: OLD to NEW) (Commits: 166  Files changed: 192)
> 
> https://github.com/qemu/qemu/compare/e689f7c...306ec6c
> (Compare direction: NEW to OLD) (Commits: 30   Files changed: 22)
> 
> (Someone else more familiar with Git might know why GitHub returns
> results for both compare directions, and/or if the 2nd link is useful
> information. The first link returns a lot more results than the 2nd one,
> at least. Does comparing new>old return deletions?)
> 
> ---
> 
> Now on to the symptoms. In a moment I'll describe reproduction.
> 
> # MS-DOS 6.22
> 
> The first symptom I discovered was that trivial read and write
> operations under MS-DOS would sometimes fail:
> 
>   C:\>echo test > hi
> 
>   General failure writing drive C
>   Abort, Retry, Fail?
> 
> Anything else that exercises the disk behaves similarly:
> 
>   C:\>dir /s > nul
> 
>   General failure reading drive C
>   Abort, Retry, Fail?
> 
> (Note that the above demonstrates both write and read failures)
> 
> (Also, FWIW, `dir /s` == `ls -R`)
> 
> The behavior of the I/O errors is not possible to characterise as it
> fluctuates so much. For example something as simple as DIR can produce
> wildly differing results: in one run, poking around with DIR ended with
> DOS deciding C:\ was empty at one point; at another point in a different
> run C:\ mysteriously dropped 50% of its contents only to magically gain
> it all back moments later after some poking around in one of the
> subdirectories that was still visible.
> 
> The time it takes to trigger these errors is also highly variable. QEMU
> may fall over as early as hanging forever at "Starting MS-DOS...", or I
> might get all the way into Windows 3.1 before it triggers (in which case
> Win3.1 reports vague memory errors - of all things).
> 
> Very occasionally I've seen _SeaBIOS itself_ report "Booting from Hard
> Disk..." "Boot failed: could not read the boot disk" ... "No bootable
> device.", and on one occasion I even got "Non-System disk or disk error"
> "Replace and strike any key when ready"!
> 
> 
> # WinNT 4 Terminal Server
> 
> Most of the time, NTLDR will fire up normally. But every so often...
> 
>   SeaBIOS (version rel-1.7.3-117-g31b8b4e-
> 20131206_080705-nilsson.home.kraxel.org)
> 
>   Booting from Hard Disk...
>   A disk read error occurred.
>   Insert a system diskette and restart
>   the system.
> 
> (NB. You're seeing the old SeaBIOS version included with e689f7c, which
> was the first buggy commit.)
> 
> If NT gets past this point without erroring out (ie, it makes it to the
> boot menu), the rest of the system is 100% fine and there are no other
> disk I/O issues whatsoever. For example, on QEMU 2.9.0 I was able to
> enable disk compression, answer "Yes" to "Compress entire disk now?" and
> have the process fully complete. No hitches.
> 
> This makes me vaguely recall/wonder that perhaps this could be somehow
> related to LBA and/or Int 13h, or something floating around near that
> bunch of functionality. (I'm woefully ignorant about such low-level
> details.) Perhaps DOS/Win3.1 are stuck using a disk mode that QEMU has a
> buggy implementation of, while NT 4 (once NTOSKRNL is up and running) is
> able to use a different disk mode or access mechanism.
> 
> I'm really interested to get some understanding of what the root issue
> is here, when this is fixed. (I wonder if it's a timing thing?)
> 
> I've observed some unusual behavior with repeated restarts. In one case,
> I attempted to start NT4 multiple times, and QEMU consistently failed
> with "No bootable device" each time. So, I removed `-M isapc`, promptly
> got a boot menu, hit ^C, readded `-M isapc` - and continued to get a
> boot menu. Yep. I'll accept "really really big coincidence" but I do
> very much wonder if something else is going on here. I've observed many
> similar incidents. It makes me wonder whether the contents of memory or
> some other system state is an influence. Very probably not, but still...
> 
> 
> -- Reproduction --------------------------------------------
> 
> First of all, there was unfortunately no way for me to avoid having to
> post entire disk images, but I've managed to compress everything down to
> 174MB total download size.
> 
> FWIW, WinWorld and many other sites seem to have no operational issues
> providing clear pointers to CD keys; I consider my distribution of my
> installed HDD images an extension of the apparent status quo.
> 
> That being said, I've put everything on Google Drive so nobody has to
> headscratch about Launchpad/Canonical/etc's stance on hosting this data.
> 
> So, this folder contains the disk images:
> https://drive.google.com/drive/folders/1WdZVBh5Trs9HLC186-nHyKeqaSxfyM2c
> ("Download all" at the top-right will create a ZIP file, but FWIW
> downloading the individual files simultaneously would implement a rough
> form of download acceleration)
> 
> File meta info:
> 
> Compressed
> |
> |      Apparent
> |      |    Actual
> |      |    |
> 38M -> 200M (103M)  win31.img.xz
> 82M -> 1G   (289M)  wnt4ts-broken.img.xz
> 55M -> 350M (146M)  wnt4ts-intermittent.img.xz
> 
> SHA-256s:
> 
> win31:        8179b8180a2ab40bd472e8a2f3fb89fc331651e56923f94ceb9e52a78ee220d2
> broken:       a2af5f0bc49a063b75f534b6ffe5b82e32ecc706a64a425b6626feccf6e3fdfa  
> intermittent: 77ae8c458829ebcdd64c71042012f45d5a2788e6ebd22db9d53de9ef1a574784
> 
> (Wanted to keep the checksum lines within 80 columns)
> 
> And, since I can't figure out where else in this report to put this,
> wnt4ts-broken.img's password is "admin" but something seems to have
> happened to the disk and NT doesn't actually boot properly :(, and
> wnt4ts-intermittent.img's password is "1234". (These were set up as test
> images. Now I'm _really_ glad I used simple passwords! :) )
> 
> ---
> 
> 
> I have two testcases: DOS 6.22 (+ Windows 3.1), and Windows NT 4.
> 
> 
> # MS-DOS
> 
> DOS is the simplest. It basically consists of
> 
> $ qemu-system-i386 -drive file=win31.img,format=raw -M isapc -enable-kvm
> 
> And then literally just playing around. Things to try include creating
> files (`echo blah > file`), repeatedly seeking across the entire FAT
> (`dir /s > nul` or `dir /s`), and launching Windows (`win`).
> 
> win31.img is not special (as far as I can tell) and merely consists of
> the result of installing DOS 6.22 and Windows 3.1 from WinWorldPC. I've
> basically just included the image for convenience.
> 
> Generally no single "run" is immune to starting Win3.1 and then
> launching File Manager; if that doesn't generate an error, something is
> definitely up.
> 
> The second best trigger is creating new files. That very very frequently
> produces "General Failure ...", but not always.
> 
> 
> # WinNT 4
> 
> Windows NT 4 is a bit more complicated. Because this error only occurs
> at presumably a single small point very early in boot, the window of
> opportunity for the glitch to surface within is much much narrower and
> thus often requires a larger number of tries.
> 
> Anecdotally I've had QEMU hit the boot error at the first try/run, and
> after as many as 63 "successful" boots.
> 
> I made a small test harness that automates the launch process. It
> consists of two shell scripts and requires tmux (and netcat).
> (*Potential epilepsy warning*: if you use a light-colored terminal
> background, the terminal QEMU is repeatedly invoked from will
> continuously flash rapidly from white to black.)
> 
> One of the scripts is run inside a tmux session in one terminal, while
> the other script is run in its own terminal (without any tmux).
> 
> 
> I named this one `run-qemu-loop`:
> 
> --8<--------------------------------------------------------
> 
> #!/bin/bash
> 
> # ---
> 
> qemu=/path/to/qemu-system-i386
> #or, alternatively: (I used the following line myself so I
> #could tab-complete my way to different qemu executables)
> #qemu="$1"
> 
> disk=/path/to/wnt4ts-intermittent.img
> 
> # ---
> 
> port=4444
> 
> rm -f STOP itercount
> 
> itercount=0
> 
> while :; do
> 	
> 	[ -f STOP ] && break
> 	
> 	((itercount++))
> 	echo $itercount > itercount
> 	
> 	$qemu \
> 		-enable-kvm -vga cirrus -curses -M isapc \
> 		-drive file="$disk",format=raw \
> 		-chardev socket,id=mon0,host=localhost,port=$port,server,nowait \
> 		-mon chardev=mon0,mode=readline
> 	
> 	#point to an otherwise-unused terminal if you like (see also: `tty`)
> 	#echo "$itercount run(s)" > /dev/pts/__
> 	
> done
> 
> ------------------------------------------------------------
> 
> Not much logic above; this just repeatedly runs QEMU for as long as
> the file `STOP` does not exist in the current directory.
> 
> The key "magic" bit is that QEMU is launched in -curses mode.
> 
> The other key bit is that the above script is run inside tmux.
> 
> 
> Here's `tmux-ctl-loop`:
> 
> --8<--------------------------------------------------------
> 
> #!/bin/bash
> 
> port=4444
> 
> tmux=./tmux
> 
> printf -v l '%0.0s-' {0..25}
> h1="$l/ buffer dump begin \\$l"
> h2="$l-\\ buffer dump end /-$l"
> 
> while :; do
> 	
> 	while :; do
> 		echo | nc localhost $port -q0 -w1 > /dev/null && break
> 		echo 'Start qemu!'
> 	done
> 	
> 	buffer="$(tmux -S $tmux capture-pane; tmux -S $tmux save-buffer -)"
> 	
> 	echo "$h1"
> 	[[ "$buffer" ]] && echo "$buffer" || echo '( * Screen buffer is empty * )'
> 	echo "$h2"
> 	
> 	if echo "$buffer" | grep -q 'A disk read error occurred.'; then
> 		
> 		s="<Crashed after $(< itercount) runs.>"
> 		echo "$s"
> 		echo "$s" >> stats
> 		
> 		touch STOP
> 		
> 		#echo q | nc localhost $port -q0 > /dev/null
> 		
> 		exit
> 		
> 	elif echo "$buffer" | grep -q 'OS Loader V4.00'; then
> 			
> 		echo '<Booted successfully, trying again>'
> 		
> 		echo q | nc localhost $port -q0 > /dev/null
> 		
> 	else
> 		
> 		echo '<Waiting for boot>'
> 		
> 	fi
> 			
> done
> 
> ------------------------------------------------------------
> 
> Nothing particularly amazing going on here either.
> 
> While `qemu-run-loop` is running inside tmux in the first terminal, this
> is running in the 2nd one.
> 
> The small infinite loop at the top only breaks when it can successfully
> ping QEMU and it knows it's running.
> 
> Then, a screendump of the contents of the terminal QEMU is in is fetched
> from tmux, and the buffer's content is analyzed.
> 
> - If NTLDR fails, the script creates `STOP` to halt qemu-run-loop,
>   sends `q` to QEMU through netcat, and then the script exits.
> 
> - If NTLDR loads successfully, the script sends `q` to QEMU and continues
>   looping. (qemu-run-loop will not find the `STOP` file, and so restart qemu.)
> 
> The scripts run very quickly, with 2-3 iterations per second on my i3
> box.
> 
> 
> # Usage
> 
> Save the two scripts above to the same directory as wnt4ts-intermittent.img,
> then:
> 
> - (If port 4444 doesn't work, the value needs to be changed in both
> scripts.)
> 
> - In the first terminal, run `tmux -S <file>`, where <file> names the socket
>   tmux will use. This needs to match "tmux=" at the top of `tmux-ctl-loop`.
>   (with `tmux=./tmux`, the command would be `tmux -S tmux`)
> 
> - Still in the first terminal (and now also inside tmux), enter
>   `./qemu-run-loop`, passing the path to qemu if you're using that approach
>   (refer to the first few lines of the script). Don't hit enter yet.
> 
> - Now, in the 2nd terminal, type `./tmux-ctl-loop`
> 
> - Hit enter in both terminals.
>  
> 
> Rationale for timing of Enter key:
> 
> - Running qemu-run-loop first will start QEMU, and if NTLDR starts
>   successfully it will immediately begin counting down from 30. If NT actually
>   starts to boot and is then hard-shut-down this /may/ affect the disk image
> 
> - tmux-ctl-loop will annoyingly spam a continuous stream of 'Start qemu!' until
>   qemu-run-loop is running.
> 
> - Starting both scripts at "more or less" the same time (no rush) works out
>   well.
> 
> 
> Hopefully potential script modifications are obvious; for example
> 
> - changing tmux-ctl-loop to not send 'q' to qemu so you can connect to the HMP
>   yourself
>   (NB, if `STOP` is not created, when qemu finally exits it will of course
>   promptly be relaunched)
> 
> - pointing run-qemu-loop to a modified qemu binary
> 
> 
> == #2: QEMU-vs-VirtualBox image issue ======================
> 
> I was initially completely stumped by this issue, perhaps unsurprisingly
> so. :)
> 
> wnt4ts-broken.img is a perfectly ordinary NT 4 installation that was
> created in QEMU 2.9.0. I created a 1GB disk with `truncate`, picked NTFS
> and installed everything (which took a while).
> 
> NT setup reboots a number of times during the boot process, and IIRC
> those all went just fine. However, at some point, the image began to
> consistently bomb out with "A disk read error occurred. ...", and
> stubbornly refused to boot, regardless of the number of boot attempts I
> tried.
> 
> QEMU 2.0.0, 1.5.0, and 1.0 (the earliest version I was able to build on
> my system) all consistently hit "disk read error occurred".
> 
> I tried compiling QEMU 1.0 using clang so I could build for 32-bit on my
> 64-bit system (GCC 7 died with "Frame pointer required, but reserved").
> The resulting qemu completely crashed if I didn't enable KVM (ie, TCG
> was (understandably) broken); with KVM enabled qemu didn't crash, but
> NTLDR halted with the same error as on 64-bit qemu. (TL;DR, no
> difference whatsoever.)
> 
> My initial reaction at this point was to try the image on another
> virtualization platform. My first pick was VirtualBox.
> 
> So, I followed the official instructions for pointing VirtualBox to
> physical disk images, except I substituted a /dev/loopN device I'd
> pointed to the image file via losetup.
> 
> And... VirtualBox picked the image up fine and Just Worked(TM). Yay! -
> but not yay. What gives?!
> 
> Confused, I then tried to convert the disk image to VHD format.
> Unfortunately, for some reason, if I try `qemu-image convert ... -O vhdx
> ...`, VirtualBox chokes on the result:
> 
> -----
> 
> VD: error VERR_NOT_SUPPORTED opening image file
> '/.../wnt4ts-broken-qemuconv.vhd' (VERR_NOT_SUPPORTED).
> 
> Result Code: NS_ERROR_FAILURE (0x80004005)
> Component: MediumWrap
> Interface: IMedium {4afe423b-43e0-e9d0-82e8-ceb307940dda}
> Callee: IVirtualBox {0169423f-46b4-cde9-91af-1e9d5b6cd945}
> Callee RC: VBOX_E_OBJECT_NOT_FOUND (0x80BB0001)
> 
> -----
> 
> Welp.
> 
> Well, a bit more digging later, and I found I could do
> 
> $ VBoxManage convertfromraw wnt4ts-broken.img wnt4ts-broken.vhd
> 
> but... as soon as I pointed VirtualBox to this, it too began to choke
> with "A disk read error occurred".
> 
> And yet, the VMDK->raw image setup worked just fine.
> 
> I found I could even replace the loop device with the path of the .img
> file itself and that worked just fine too.
> 
> At my wits' end, I followed some online instructions to learn about
> manual CHS configuration so I could try and get the image working in
> Bochs. "A disk read error occurred". I wasn't surprised.
> 
> It was at this point I began to give up, but I decided to try One Last
> Thing(TM) before properly throwing in the towel.
> 
> :)
> 
> I decided to learn a bit more about how `VBoxManage internalcommands
> createrawvmdk` worked, and try one thing in particular: I can edit the
> .vmdk file, but can I point `createrawvmdk` at the .img file directly
> too?
> 
> Turns out, yes you can.
> 
> It also turns out that this promptly caused VirtualBox to bomb out.
> 
> Interesting.
> 
> For reference, here's the VMDK file I initially created (by pointing
> `createrawvmdk` at /dev/loopN) and then later edited to point straight
> to the .img file, with both approaches resulting in successful boot.
> 
> --8<--------------------------------------------------------
> 
> # Disk DescriptorFile
> version=1
> CID=e35b9a45
> parentCID=ffffffff
> createType="fullDevice"
> 
> # Extent description
> RW 1536000 FLAT "/absolute/full/path/to/wnt4ts-broken.img" 0
> 
> # The disk Data Base 
> #DDB
> 
> ddb.virtualHWVersion = "4"
> ddb.adapterType="ide"
> ddb.geometry.cylinders="1523"
> ddb.geometry.heads="16"
> ddb.geometry.sectors="63"
> ddb.uuid.image="871a6044-c8ca-48ed-b7aa-e6fc49da3db4"
> ddb.uuid.parent="00000000-0000-0000-0000-000000000000"
> ddb.uuid.modification="3661715c-3906-4e4a-ab65-486d140e03b8"
> ddb.uuid.parentmodification="00000000-0000-0000-0000-000000000000"
> ddb.geometry.biosCylinders="761"
> ddb.geometry.biosHeads="32"
> ddb.geometry.biosSectors="63"
> 
> ------------------------------------------------------------
> 
> 
> Here's the _diff_ of what happens if I point `createrawvmdk` at wnt4ts-broken.img directly:
> 
> --8<--------------------------------------------------------
> 
> ddb.geometry.cylinders="2080"
> ddb.geometry.heads="16"
> ddb.geometry.sectors="63"
> 
> ------------------------------------------------------------
> 
> :D
> 
> Naturally,
> 
> $ qemu-system-i386 -drive file=wnt4ts-
> broken.img,format=raw,cyls=1523,heads=16,secs=63 -M isapc -sdl
> 
> will boot happily on 2.9.0 (notwithstanding the occasional "disk read
> error occurred" documented above).
> 
> It will also boot in 1.6.0.
> 
> (POTENTIAL BUG HEADSUP: 1.0 and 1.5.0 both lock up with a blank 640x480
> window and use 0% CPU if I specify `-M isapc`.)
> 
> And, of course, using these CHS values in Bochs also results in
> successful boot as well (after setting the CPU type to pentium).
> 
> Unfortunately, I have no idea what sequence of events caused the
> creation of the VMDK file above. No invocation of `createrawvmdk` is
> producing a VMDK file with the CHS settings above.
> 
> I've only just begun to learn about the intricacies of CHS. Am I to
> understand that these values are stored amongst the first 512 bytes of
> the disk? If this is the case, then I wonder what changed the data, and
> why. I was initially only using QEMU 2.9.0, and didn't move the image to
> different VMs or QEMU versions. Perhaps Windows NT got confused about
> the disk CHS and rewrote it?
> 
> 
> == Sporadic BIOS-level boot failure ========================
> 
> I have multiple screenshots of SeaBIOS in QEMU 2.9.0 halting with "No
> bootable device" (et al), even with the above manually-applied CHS
> settings.
> 
> Commit e689f7c also presents such errors.
> 
> Commit 306ec6c does not suffer from intermittent breakage of any kind:
> 
> - No SeaBIOS flake-outs
> - No "Non-system disk or disk error"
> - No "A disk error has occurred"
> - No "General failure ..."
> 
> While most of my confidence in commit 306ec6c is based on anecdotal
> evidence, I modified `tmux-ctl-loop` a little to soak-test BIOS-level
> I/O stability and left this modified version running for a few minutes.
> 
> --8<--------------------------------------------------------
> 
> #!/bin/bash
> 
> port=4444
> 
> tmux=./tmux
> 
> printf -v l '%0.0s-' {0..25}
> h1="$l/ buffer dump begin \\$l"
> h2="$l-\\ buffer dump end /-$l"
> 
> while :; do
> 	
> 	while :; do
> 		echo | nc localhost $port -q0 -w1 > /dev/null && break
> 		echo 'Start qemu!'
> 	done
> 	
> 	buffer="$(tmux -S $tmux capture-pane; tmux -S $tmux save-buffer -)"
> 	
> 	echo "$h1"
> 	[[ "$buffer" ]] && echo "$buffer" || echo '( * Screen buffer is empty * )'
> 	echo "$h2"
> 	
> 	if echo "$buffer" | grep -q 'Non-system disk' || echo "$buffer" | \
> 		grep -q 'No bootable device'
> 	then
> 		
> 		s="<Hit error after $(< itercount) runs.>"
> 		echo "$s"
> 		echo "$s" >> stats
> 		
> 		touch STOP
> 		
> 		#echo q | nc localhost $port -q0 > /dev/null
> 		
> 		exit
> 		
> 	elif echo "$buffer" | grep -q 'OS Loader V4.00' || echo "$buffer" | \
> 		grep -q 'A disk read error'
> 	then
> 	
> 		echo '<Boot did not hang at BIOS, trying again>'
> 		
> 		echo q | nc localhost $port -q0 > /dev/null
> 		
> 	else
> 		
> 		echo '<Waiting for boot>'
> 		
> 	fi
> 			
> done
> 
> ------------------------------------------------------------
> 
> For the above to work, the top of run-qemu-loop must also be modified to
> read something along the lines of
> 
> disk=/path/to/wnt4ts-broken.img,format=raw,cyls=1523,heads=16,secs=63
> 
> (Suggestion: modify copies of both scripts)
> 
> One small terminal-flicker-headache (and a 57°C CPU) later, I was able
> to carefully observe just over 350 successful runs in which QEMU commit
> 306ec6c only ever produced a boot menu. No other hitches.
> 
> ** Important: **
> 
> However, commit 306ec6c will fail to boot, ever, if the cylinders and
> geometry are not set to the values VirtualBox "discovered". (Of note is
> the fact that QEMU (2.9.0) was what initially created this image. I must
> admit that I don't remember what sequence of QEMU versions I fed the
> image to - and I maybe, possibly, didn't think to back the file up
> (sorry), so maybe something mangled something somewhere. But VirtualBox
> figured it out nonetheless!)
> 
> Furthermore, feeding /dev/loopN to any QEMU version will NOT result in
> correct CHS discovery (and successful boot).
> 
> This is what leads me to conclude that I've discovered two separate
> issues.
> 
> 
> == Appendix: How to build the branches =====================
> 
> It's very simple.
> 
> First, `git clone https://github.com/qemu/qemu` somewhere if you don't
> already have a local copy. If you have an old git checkout that's from
> 2014 or later, you can use that old checkout instead. (If you want to
> test an old checkout you have, the commands below will either work
> perfectly or completely bomb out with no side effects.)
> 
> A full checkout is a ~183MB download. Sorry.
> 
> Next, create two new directories somewhere. Name them what you like, eg
> `qemu-working` and `qemu-broken`.
> 
> Now, cd into the checkout directory, and run:
> 
> $ git archive 306ec6c3cece7004429c79c1ac93d49919f1f1cc | tar xC /path/to
> /qemu-working/
> 
> $ git archive e689f7c668cbd9d08f330e17c3dd3a059c9553d3 | tar xC /path/to
> /qemu-broken/
> 
> The paths can be relative.
> 
> Now, run this in both of the new directories:
> 
> $ ./configure --python=python2.7 --disable-libssh2 --disable-seccomp
> --disable-usb-redir --disable-guest-agent --disable-libiscsi --disable-
> spice --disable-smartcard-nss --disable-vhost-net --disable-docs
> --disable-attr --disable-cap-ng --disable-vde --disable-user --disable-
> bluez --disable-vnc-ws --disable-xen --disable-brlapi --enable-debug
> --target-list=i386-softmmu --disable-fdt
> 
> $ make -j64
> 
> You can open two terminals and configure and build both simultaneously
> if you like.
> 
> On my decent but very basic (2-core+HT) i3 box, -j64 actually works out - make doesn't actually launch too many gcc processes. You *will* see your system load spike to ~20 though :)
> (NB. Do. not. use. -j64. with. the. linux. kernel.)
> 
> On my system, a single build with -j64 takes only about 35 seconds. C
> FTW. (Although this has increased to 1min20sec for more recent builds.)
> 
> Most of the configure arguments remove functionality I'll never use (in
> this situation) and which will only slow down the build.
> 
> Once QEMU is built, run qemu-system-i386 directly from where it has been
> built.
> 
> $ /path/to/qemu-working/i386-softmmu/qemu-system-i386 ...
> $ /path/to/qemu-broken/i386-softmmu/qemu-system-i386 ...
> 
> Again, the paths can be relative.
> 
> ** Affects: qemu
>      Importance: Undecided
>          Status: New
> 
> 
> ** Tags: disk io qemu
> 
> -- 
> You received this bug notification because you are a member of qemu-
> devel-ml, which is subscribed to QEMU.
> https://bugs.launchpad.net/bugs/1745312
> 
> Title:
>   Regression report: Disk subsystem I/O failures/issues surfacing in
>   DOS/early Windows [two separate issues: one bisected, one root-caused]
> 
> Status in QEMU:
>   New
> 
> Bug description:
>   [Headsup: This report is long-ish due to the amount of detail I've
>   stumbled on along the way that I think is relevant to include. I can't
>   speak as to the complexity of the actual bugs, but the size of this
>   report should not suggest that the reproduction process is
>   particularly headache-inducing.]
> 
>   Hi!
> 
>   I recently needed to fire up some ancient software for research
>   purposes and got very distracted discovering and playing with old
>   versions of Windows :). In the process I've discovered some glitches
>   with disk I/O.
> 
>   I believe I've stumbled on two completely separate issues that
>   coincidentally surfaced at the same time. It's possible that
>   components of this report will be re-filed as more specific new bugs,
>   but I'm not an authority on QEMU internals or how to narrow
>   down/categorize what I've found.
> 
>   - The first bug only surfaces when the "isapc" machine type is used.
>   It intermittently produces "General failure {read,writ}ing drive _"
>   under MS-DOS 6.22, and also somehow interferes with early bootstrap of
>   Windows NT 4 (in NTLDR). Enabling or disabling KVM (I'm on Linux)
>   appears to make no difference whatsoever, which may help with
>   debugging.
> 
>   - The second issue involves
>     - a WinNT4 disk image
>     - created by running through a bog-standard NT4 install inside QEMU 2.9.0
>     - which will now fail to boot in any version of QEMU - even version 1.0
>       - but which VirtualBox will boot fine
>         - but only if I point VirtualBox at QEMU's raw disk image via a
>           hacked-together VMDK file
>         - if the raw image is converted to VHD(X), VirtualBox will also fail
>           to boot the image with exactly the same error as QEMU
>         - this state of affairs is not affected by image sparseness (which makes
>           sense)
> 
>   I'm confident I've bisected the first issue.
> 
>   I wasn't able to bisect the second issue (as all tested versions of
>   QEMU behaved identically), but I've figured out a working repro
>   testcase and I believe I've managed to pin down a solid root cause.
> 
> 
>   == #1: Intermittent I/O issues when `-M isapc` is used =====
> 
>   These symptoms sometimes take a small amount of time and fiddling to
>   trigger, but I AM able to consistently surface them on my machine
>   after a short while. (I am very very interested to hear if others
>   cannot reproduce them.)
> 
>   So, first of all:
> 
>   https://github.com/qemu/qemu/commit/306ec6c3cece7004429c79c1ac93d49919f1f1cc
>     (Jul 30 2013): the last version that works
> 
>   https://github.com/qemu/qemu/commit/e689f7c668cbd9d08f330e17c3dd3a059c9553d3
>     (Oct 30 2013): the first version that intermittently fails
> 
>   Maybe lift out and build these branches while reading. *shrug*
>   (How to do this can be found at the end of this report - along with a time-saving ./configure line, FWIW)
> 
>   Here are the changelists between these two revisions:
> 
>   https://github.com/qemu/qemu/compare/306ec6c...e689f7c
>   (Compare direction: OLD to NEW) (Commits: 166  Files changed: 192)
> 
>   https://github.com/qemu/qemu/compare/e689f7c...306ec6c
>   (Compare direction: NEW to OLD) (Commits: 30   Files changed: 22)
> 
>   (Someone else more familiar with Git might know why GitHub returns
>   results for both compare directions, and/or if the 2nd link is useful
>   information. The first link returns a lot more results than the 2nd
>   one, at least. Does comparing new>old return deletions?)
> 
>   ---
> 
>   Now on to the symptoms. In a moment I'll describe reproduction.
> 
>   # MS-DOS 6.22
> 
>   The first symptom I discovered was that trivial read and write
>   operations under MS-DOS would sometimes fail:
> 
>     C:\>echo test > hi
> 
>     General failure writing drive C
>     Abort, Retry, Fail?
> 
>   Anything else that exercises the disk behaves similarly:
> 
>     C:\>dir /s > nul
> 
>     General failure reading drive C
>     Abort, Retry, Fail?
> 
>   (Note that the above demonstrates both write and read failures)
> 
>   (Also, FWIW, `dir /s` == `ls -R`)
> 
>   The behavior of the I/O errors is not possible to characterise as it
>   fluctuates so much. For example something as simple as DIR can produce
>   wildly differing results: in one run, poking around with DIR ended
>   with DOS deciding C:\ was empty at one point; at another point in a
>   different run C:\ mysteriously dropped 50% of its contents only to
>   magically gain it all back moments later after some poking around in
>   one of the subdirectories that was still visible.
> 
>   The time it takes to trigger these errors is also highly variable.
>   QEMU may fall over as early as hanging forever at "Starting MS-
>   DOS...", or I might get all the way into Windows 3.1 before it
>   triggers (in which case Win3.1 reports vague memory errors - of all
>   things).
> 
>   Very occasionally I've seen _SeaBIOS itself_ report "Booting from Hard
>   Disk..." "Boot failed: could not read the boot disk" ... "No bootable
>   device.", and on one occasion I even got "Non-System disk or disk
>   error" "Replace and strike any key when ready"!
> 
>   
>   # WinNT 4 Terminal Server
> 
>   Most of the time, NTLDR will fire up normally. But every so often...
> 
>     SeaBIOS (version rel-1.7.3-117-g31b8b4e-
>   20131206_080705-nilsson.home.kraxel.org)
> 
>     Booting from Hard Disk...
>     A disk read error occurred.
>     Insert a system diskette and restart
>     the system.
> 
>   (NB. You're seeing the old SeaBIOS version included with e689f7c,
>   which was the first buggy commit.)
> 
>   If NT gets past this point without erroring out (ie, it makes it to
>   the boot menu), the rest of the system is 100% fine and there are no
>   other disk I/O issues whatsoever. For example, on QEMU 2.9.0 I was
>   able to enable disk compression, answer "Yes" to "Compress entire disk
>   now?" and have the process fully complete. No hitches.
> 
>   This makes me vaguely recall/wonder that perhaps this could be somehow
>   related to LBA and/or Int 13h, or something floating around near that
>   bunch of functionality. (I'm woefully ignorant about such low-level
>   details.) Perhaps DOS/Win3.1 are stuck using a disk mode that QEMU has
>   a buggy implementation of, while NT 4 (once NTOSKRNL is up and
>   running) is able to use a different disk mode or access mechanism.
> 
>   I'm really interested to get some understanding of what the root issue
>   is here, when this is fixed. (I wonder if it's a timing thing?)
> 
>   I've observed some unusual behavior with repeated restarts. In one
>   case, I attempted to start NT4 multiple times, and QEMU consistently
>   failed with "No bootable device" each time. So, I removed `-M isapc`,
>   promptly got a boot menu, hit ^C, readded `-M isapc` - and continued
>   to get a boot menu. Yep. I'll accept "really really big coincidence"
>   but I do very much wonder if something else is going on here. I've
>   observed many similar incidents. It makes me wonder whether the
>   contents of memory or some other system state is an influence. Very
>   probably not, but still...
> 
> 
>   -- Reproduction --------------------------------------------
> 
>   First of all, there was unfortunately no way for me to avoid having to
>   post entire disk images, but I've managed to compress everything down
>   to 174MB total download size.
> 
>   FWIW, WinWorld and many other sites seem to have no operational issues
>   providing clear pointers to CD keys; I consider my distribution of my
>   installed HDD images an extension of the apparent status quo.
> 
>   That being said, I've put everything on Google Drive so nobody has to
>   headscratch about Launchpad/Canonical/etc's stance on hosting this
>   data.
> 
>   So, this folder contains the disk images:
>   https://drive.google.com/drive/folders/1WdZVBh5Trs9HLC186-nHyKeqaSxfyM2c
>   ("Download all" at the top-right will create a ZIP file, but FWIW
>   downloading the individual files simultaneously would implement a
>   rough form of download acceleration)
> 
>   File meta info:
> 
>   Compressed
>   |
>   |      Apparent
>   |      |    Actual
>   |      |    |
>   38M -> 200M (103M)  win31.img.xz
>   82M -> 1G   (289M)  wnt4ts-broken.img.xz
>   55M -> 350M (146M)  wnt4ts-intermittent.img.xz
> 
>   SHA-256s:
> 
>   win31:        8179b8180a2ab40bd472e8a2f3fb89fc331651e56923f94ceb9e52a78ee220d2
>   broken:       a2af5f0bc49a063b75f534b6ffe5b82e32ecc706a64a425b6626feccf6e3fdfa  
>   intermittent: 77ae8c458829ebcdd64c71042012f45d5a2788e6ebd22db9d53de9ef1a574784
> 
>   (Wanted to keep the checksum lines within 80 columns)
> 
>   And, since I can't figure out where else in this report to put this,
>   wnt4ts-broken.img's password is "admin" but something seems to have
>   happened to the disk and NT doesn't actually boot properly :(, and
>   wnt4ts-intermittent.img's password is "1234". (These were set up as
>   test images. Now I'm _really_ glad I used simple passwords! :) )
> 
>   ---
> 
>   
>   I have two testcases: DOS 6.22 (+ Windows 3.1), and Windows NT 4.
> 
>   
>   # MS-DOS
> 
>   DOS is the simplest. It basically consists of
> 
>   $ qemu-system-i386 -drive file=win31.img,format=raw -M isapc -enable-
>   kvm
> 
>   And then literally just playing around. Things to try include creating
>   files (`echo blah > file`), repeatedly seeking across the entire FAT
>   (`dir /s > nul` or `dir /s`), and launching Windows (`win`).
> 
>   win31.img is not special (as far as I can tell) and merely consists of
>   the result of installing DOS 6.22 and Windows 3.1 from WinWorldPC.
>   I've basically just included the image for convenience.
> 
>   Generally no single "run" is immune to starting Win3.1 and then
>   launching File Manager; if that doesn't generate an error, something
>   is definitely up.
> 
>   The second best trigger is creating new files. That very very
>   frequently produces "General Failure ...", but not always.
> 
>   
>   # WinNT 4
> 
>   Windows NT 4 is a bit more complicated. Because this error only occurs
>   at presumably a single small point very early in boot, the window of
>   opportunity for the glitch to surface within is much much narrower and
>   thus often requires a larger number of tries.
> 
>   Anecdotally I've had QEMU hit the boot error at the first try/run, and
>   after as many as 63 "successful" boots.
> 
>   I made a small test harness that automates the launch process. It
>   consists of two shell scripts and requires tmux (and netcat).
>   (*Potential epilepsy warning*: if you use a light-colored terminal
>   background, the terminal QEMU is repeatedly invoked from will
>   continuously flash rapidly from white to black.)
> 
>   One of the scripts is run inside a tmux session in one terminal, while
>   the other script is run in its own terminal (without any tmux).
> 
>   
>   I named this one `run-qemu-loop`:
> 
>   --8<--------------------------------------------------------
> 
>   #!/bin/bash
> 
>   # ---
> 
>   qemu=/path/to/qemu-system-i386
>   #or, alternatively: (I used the following line myself so I
>   #could tab-complete my way to different qemu executables)
>   #qemu="$1"
> 
>   disk=/path/to/wnt4ts-intermittent.img
> 
>   # ---
> 
>   port=4444
> 
>   rm -f STOP itercount
> 
>   itercount=0
> 
>   while :; do
>   	
>   	[ -f STOP ] && break
>   	
>   	((itercount++))
>   	echo $itercount > itercount
>   	
>   	$qemu \
>   		-enable-kvm -vga cirrus -curses -M isapc \
>   		-drive file="$disk",format=raw \
>   		-chardev socket,id=mon0,host=localhost,port=$port,server,nowait \
>   		-mon chardev=mon0,mode=readline
>   	
>   	#point to an otherwise-unused terminal if you like (see also: `tty`)
>   	#echo "$itercount run(s)" > /dev/pts/__
>   	
>   done
> 
>   ------------------------------------------------------------
> 
>   Not much logic above; this just repeatedly runs QEMU for as long as
>   the file `STOP` does not exist in the current directory.
> 
>   The key "magic" bit is that QEMU is launched in -curses mode.
> 
>   The other key bit is that the above script is run inside tmux.
> 
>   
>   Here's `tmux-ctl-loop`:
> 
>   --8<--------------------------------------------------------
> 
>   #!/bin/bash
> 
>   port=4444
> 
>   tmux=./tmux
> 
>   printf -v l '%0.0s-' {0..25}
>   h1="$l/ buffer dump begin \\$l"
>   h2="$l-\\ buffer dump end /-$l"
> 
>   while :; do
>   	
>   	while :; do
>   		echo | nc localhost $port -q0 -w1 > /dev/null && break
>   		echo 'Start qemu!'
>   	done
>   	
>   	buffer="$(tmux -S $tmux capture-pane; tmux -S $tmux save-buffer -)"
>   	
>   	echo "$h1"
>   	[[ "$buffer" ]] && echo "$buffer" || echo '( * Screen buffer is empty * )'
>   	echo "$h2"
>   	
>   	if echo "$buffer" | grep -q 'A disk read error occurred.'; then
>   		
>   		s="<Crashed after $(< itercount) runs.>"
>   		echo "$s"
>   		echo "$s" >> stats
>   		
>   		touch STOP
>   		
>   		#echo q | nc localhost $port -q0 > /dev/null
>   		
>   		exit
>   		
>   	elif echo "$buffer" | grep -q 'OS Loader V4.00'; then
>   			
>   		echo '<Booted successfully, trying again>'
>   		
>   		echo q | nc localhost $port -q0 > /dev/null
>   		
>   	else
>   		
>   		echo '<Waiting for boot>'
>   		
>   	fi
>   			
>   done
> 
>   ------------------------------------------------------------
> 
>   Nothing particularly amazing going on here either.
> 
>   While `qemu-run-loop` is running inside tmux in the first terminal,
>   this is running in the 2nd one.
> 
>   The small infinite loop at the top only breaks when it can
>   successfully ping QEMU and it knows it's running.
> 
>   Then, a screendump of the contents of the terminal QEMU is in is
>   fetched from tmux, and the buffer's content is analyzed.
> 
>   - If NTLDR fails, the script creates `STOP` to halt qemu-run-loop,
>     sends `q` to QEMU through netcat, and then the script exits.
> 
>   - If NTLDR loads successfully, the script sends `q` to QEMU and continues
>     looping. (qemu-run-loop will not find the `STOP` file, and so restart qemu.)
> 
>   The scripts run very quickly, with 2-3 iterations per second on my i3
>   box.
> 
> 
>   # Usage
> 
>   Save the two scripts above to the same directory as wnt4ts-intermittent.img,
>   then:
> 
>   - (If port 4444 doesn't work, the value needs to be changed in both
>   scripts.)
> 
>   - In the first terminal, run `tmux -S <file>`, where <file> names the socket
>     tmux will use. This needs to match "tmux=" at the top of `tmux-ctl-loop`.
>     (with `tmux=./tmux`, the command would be `tmux -S tmux`)
> 
>   - Still in the first terminal (and now also inside tmux), enter
>     `./qemu-run-loop`, passing the path to qemu if you're using that approach
>     (refer to the first few lines of the script). Don't hit enter yet.
> 
>   - Now, in the 2nd terminal, type `./tmux-ctl-loop`
> 
>   - Hit enter in both terminals.
>    
> 
>   Rationale for timing of Enter key:
> 
>   - Running qemu-run-loop first will start QEMU, and if NTLDR starts
>     successfully it will immediately begin counting down from 30. If NT actually
>     starts to boot and is then hard-shut-down this /may/ affect the disk image
> 
>   - tmux-ctl-loop will annoyingly spam a continuous stream of 'Start qemu!' until
>     qemu-run-loop is running.
> 
>   - Starting both scripts at "more or less" the same time (no rush) works out
>     well.
> 
>   
>   Hopefully potential script modifications are obvious; for example
> 
>   - changing tmux-ctl-loop to not send 'q' to qemu so you can connect to the HMP
>     yourself
>     (NB, if `STOP` is not created, when qemu finally exits it will of course
>     promptly be relaunched)
> 
>   - pointing run-qemu-loop to a modified qemu binary
> 
> 
>   == #2: QEMU-vs-VirtualBox image issue ======================
> 
>   I was initially completely stumped by this issue, perhaps
>   unsurprisingly so. :)
> 
>   wnt4ts-broken.img is a perfectly ordinary NT 4 installation that was
>   created in QEMU 2.9.0. I created a 1GB disk with `truncate`, picked
>   NTFS and installed everything (which took a while).
> 
>   NT setup reboots a number of times during the boot process, and IIRC
>   those all went just fine. However, at some point, the image began to
>   consistently bomb out with "A disk read error occurred. ...", and
>   stubbornly refused to boot, regardless of the number of boot attempts
>   I tried.
> 
>   QEMU 2.0.0, 1.5.0, and 1.0 (the earliest version I was able to build
>   on my system) all consistently hit "disk read error occurred".
> 
>   I tried compiling QEMU 1.0 using clang so I could build for 32-bit on
>   my 64-bit system (GCC 7 died with "Frame pointer required, but
>   reserved"). The resulting qemu completely crashed if I didn't enable
>   KVM (ie, TCG was (understandably) broken); with KVM enabled qemu
>   didn't crash, but NTLDR halted with the same error as on 64-bit qemu.
>   (TL;DR, no difference whatsoever.)
> 
>   My initial reaction at this point was to try the image on another
>   virtualization platform. My first pick was VirtualBox.
> 
>   So, I followed the official instructions for pointing VirtualBox to
>   physical disk images, except I substituted a /dev/loopN device I'd
>   pointed to the image file via losetup.
> 
>   And... VirtualBox picked the image up fine and Just Worked(TM). Yay! -
>   but not yay. What gives?!
> 
>   Confused, I then tried to convert the disk image to VHD format.
>   Unfortunately, for some reason, if I try `qemu-image convert ... -O
>   vhdx ...`, VirtualBox chokes on the result:
> 
>   -----
> 
>   VD: error VERR_NOT_SUPPORTED opening image file
>   '/.../wnt4ts-broken-qemuconv.vhd' (VERR_NOT_SUPPORTED).
> 
>   Result Code: NS_ERROR_FAILURE (0x80004005)
>   Component: MediumWrap
>   Interface: IMedium {4afe423b-43e0-e9d0-82e8-ceb307940dda}
>   Callee: IVirtualBox {0169423f-46b4-cde9-91af-1e9d5b6cd945}
>   Callee RC: VBOX_E_OBJECT_NOT_FOUND (0x80BB0001)
> 
>   -----
> 
>   Welp.
> 
>   Well, a bit more digging later, and I found I could do
> 
>   $ VBoxManage convertfromraw wnt4ts-broken.img wnt4ts-broken.vhd
> 
>   but... as soon as I pointed VirtualBox to this, it too began to choke
>   with "A disk read error occurred".
> 
>   And yet, the VMDK->raw image setup worked just fine.
> 
>   I found I could even replace the loop device with the path of the .img
>   file itself and that worked just fine too.
> 
>   At my wits' end, I followed some online instructions to learn about
>   manual CHS configuration so I could try and get the image working in
>   Bochs. "A disk read error occurred". I wasn't surprised.
> 
>   It was at this point I began to give up, but I decided to try One Last
>   Thing(TM) before properly throwing in the towel.
> 
>   :)
> 
>   I decided to learn a bit more about how `VBoxManage internalcommands
>   createrawvmdk` worked, and try one thing in particular: I can edit the
>   .vmdk file, but can I point `createrawvmdk` at the .img file directly
>   too?
> 
>   Turns out, yes you can.
> 
>   It also turns out that this promptly caused VirtualBox to bomb out.
> 
>   Interesting.
> 
>   For reference, here's the VMDK file I initially created (by pointing
>   `createrawvmdk` at /dev/loopN) and then later edited to point straight
>   to the .img file, with both approaches resulting in successful boot.
> 
>   --8<--------------------------------------------------------
> 
>   # Disk DescriptorFile
>   version=1
>   CID=e35b9a45
>   parentCID=ffffffff
>   createType="fullDevice"
> 
>   # Extent description
>   RW 1536000 FLAT "/absolute/full/path/to/wnt4ts-broken.img" 0
> 
>   # The disk Data Base 
>   #DDB
> 
>   ddb.virtualHWVersion = "4"
>   ddb.adapterType="ide"
>   ddb.geometry.cylinders="1523"
>   ddb.geometry.heads="16"
>   ddb.geometry.sectors="63"
>   ddb.uuid.image="871a6044-c8ca-48ed-b7aa-e6fc49da3db4"
>   ddb.uuid.parent="00000000-0000-0000-0000-000000000000"
>   ddb.uuid.modification="3661715c-3906-4e4a-ab65-486d140e03b8"
>   ddb.uuid.parentmodification="00000000-0000-0000-0000-000000000000"
>   ddb.geometry.biosCylinders="761"
>   ddb.geometry.biosHeads="32"
>   ddb.geometry.biosSectors="63"
> 
>   ------------------------------------------------------------
> 
>   
>   Here's the _diff_ of what happens if I point `createrawvmdk` at wnt4ts-broken.img directly:
> 
>   --8<--------------------------------------------------------
> 
>   ddb.geometry.cylinders="2080"
>   ddb.geometry.heads="16"
>   ddb.geometry.sectors="63"
> 
>   ------------------------------------------------------------
> 
>   :D
> 
>   Naturally,
> 
>   $ qemu-system-i386 -drive file=wnt4ts-
>   broken.img,format=raw,cyls=1523,heads=16,secs=63 -M isapc -sdl
> 
>   will boot happily on 2.9.0 (notwithstanding the occasional "disk read
>   error occurred" documented above).
> 
>   It will also boot in 1.6.0.
> 
>   (POTENTIAL BUG HEADSUP: 1.0 and 1.5.0 both lock up with a blank
>   640x480 window and use 0% CPU if I specify `-M isapc`.)
> 
>   And, of course, using these CHS values in Bochs also results in
>   successful boot as well (after setting the CPU type to pentium).
> 
>   Unfortunately, I have no idea what sequence of events caused the
>   creation of the VMDK file above. No invocation of `createrawvmdk` is
>   producing a VMDK file with the CHS settings above.
> 
>   I've only just begun to learn about the intricacies of CHS. Am I to
>   understand that these values are stored amongst the first 512 bytes of
>   the disk? If this is the case, then I wonder what changed the data,
>   and why. I was initially only using QEMU 2.9.0, and didn't move the
>   image to different VMs or QEMU versions. Perhaps Windows NT got
>   confused about the disk CHS and rewrote it?
> 
>   
>   == Sporadic BIOS-level boot failure ========================
> 
>   I have multiple screenshots of SeaBIOS in QEMU 2.9.0 halting with "No
>   bootable device" (et al), even with the above manually-applied CHS
>   settings.
> 
>   Commit e689f7c also presents such errors.
> 
>   Commit 306ec6c does not suffer from intermittent breakage of any kind:
> 
>   - No SeaBIOS flake-outs
>   - No "Non-system disk or disk error"
>   - No "A disk error has occurred"
>   - No "General failure ..."
> 
>   While most of my confidence in commit 306ec6c is based on anecdotal
>   evidence, I modified `tmux-ctl-loop` a little to soak-test BIOS-level
>   I/O stability and left this modified version running for a few
>   minutes.
> 
>   --8<--------------------------------------------------------
> 
>   #!/bin/bash
> 
>   port=4444
> 
>   tmux=./tmux
> 
>   printf -v l '%0.0s-' {0..25}
>   h1="$l/ buffer dump begin \\$l"
>   h2="$l-\\ buffer dump end /-$l"
> 
>   while :; do
>   	
>   	while :; do
>   		echo | nc localhost $port -q0 -w1 > /dev/null && break
>   		echo 'Start qemu!'
>   	done
>   	
>   	buffer="$(tmux -S $tmux capture-pane; tmux -S $tmux save-buffer -)"
>   	
>   	echo "$h1"
>   	[[ "$buffer" ]] && echo "$buffer" || echo '( * Screen buffer is empty * )'
>   	echo "$h2"
>   	
>   	if echo "$buffer" | grep -q 'Non-system disk' || echo "$buffer" | \
>   		grep -q 'No bootable device'
>   	then
>   		
>   		s="<Hit error after $(< itercount) runs.>"
>   		echo "$s"
>   		echo "$s" >> stats
>   		
>   		touch STOP
>   		
>   		#echo q | nc localhost $port -q0 > /dev/null
>   		
>   		exit
>   		
>   	elif echo "$buffer" | grep -q 'OS Loader V4.00' || echo "$buffer" | \
>   		grep -q 'A disk read error'
>   	then
>   	
>   		echo '<Boot did not hang at BIOS, trying again>'
>   		
>   		echo q | nc localhost $port -q0 > /dev/null
>   		
>   	else
>   		
>   		echo '<Waiting for boot>'
>   		
>   	fi
>   			
>   done
> 
>   ------------------------------------------------------------
> 
>   For the above to work, the top of run-qemu-loop must also be modified
>   to read something along the lines of
> 
>   disk=/path/to/wnt4ts-broken.img,format=raw,cyls=1523,heads=16,secs=63
> 
>   (Suggestion: modify copies of both scripts)
> 
>   One small terminal-flicker-headache (and a 57°C CPU) later, I was able
>   to carefully observe just over 350 successful runs in which QEMU
>   commit 306ec6c only ever produced a boot menu. No other hitches.
> 
>   ** Important: **
> 
>   However, commit 306ec6c will fail to boot, ever, if the cylinders and
>   geometry are not set to the values VirtualBox "discovered". (Of note
>   is the fact that QEMU (2.9.0) was what initially created this image. I
>   must admit that I don't remember what sequence of QEMU versions I fed
>   the image to - and I maybe, possibly, didn't think to back the file up
>   (sorry), so maybe something mangled something somewhere. But
>   VirtualBox figured it out nonetheless!)
> 
>   Furthermore, feeding /dev/loopN to any QEMU version will NOT result in
>   correct CHS discovery (and successful boot).
> 
>   This is what leads me to conclude that I've discovered two separate
>   issues.
> 
> 
>   == Appendix: How to build the branches =====================
> 
>   It's very simple.
> 
>   First, `git clone https://github.com/qemu/qemu` somewhere if you don't
>   already have a local copy. If you have an old git checkout that's from
>   2014 or later, you can use that old checkout instead. (If you want to
>   test an old checkout you have, the commands below will either work
>   perfectly or completely bomb out with no side effects.)
> 
>   A full checkout is a ~183MB download. Sorry.
> 
>   Next, create two new directories somewhere. Name them what you like,
>   eg `qemu-working` and `qemu-broken`.
> 
>   Now, cd into the checkout directory, and run:
> 
>   $ git archive 306ec6c3cece7004429c79c1ac93d49919f1f1cc | tar xC
>   /path/to/qemu-working/
> 
>   $ git archive e689f7c668cbd9d08f330e17c3dd3a059c9553d3 | tar xC
>   /path/to/qemu-broken/
> 
>   The paths can be relative.
> 
>   Now, run this in both of the new directories:
> 
>   $ ./configure --python=python2.7 --disable-libssh2 --disable-seccomp
>   --disable-usb-redir --disable-guest-agent --disable-libiscsi
>   --disable-spice --disable-smartcard-nss --disable-vhost-net --disable-
>   docs --disable-attr --disable-cap-ng --disable-vde --disable-user
>   --disable-bluez --disable-vnc-ws --disable-xen --disable-brlapi
>   --enable-debug --target-list=i386-softmmu --disable-fdt
> 
>   $ make -j64
> 
>   You can open two terminals and configure and build both simultaneously
>   if you like.
> 
>   On my decent but very basic (2-core+HT) i3 box, -j64 actually works out - make doesn't actually launch too many gcc processes. You *will* see your system load spike to ~20 though :)
>   (NB. Do. not. use. -j64. with. the. linux. kernel.)
> 
>   On my system, a single build with -j64 takes only about 35 seconds. C
>   FTW. (Although this has increased to 1min20sec for more recent
>   builds.)
> 
>   Most of the configure arguments remove functionality I'll never use
>   (in this situation) and which will only slow down the build.
> 
>   Once QEMU is built, run qemu-system-i386 directly from where it has
>   been built.
> 
>   $ /path/to/qemu-working/i386-softmmu/qemu-system-i386 ...
>   $ /path/to/qemu-broken/i386-softmmu/qemu-system-i386 ...
> 
>   Again, the paths can be relative.
> 
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/qemu/+bug/1745312/+subscriptions
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Qemu-devel] [Bug 1745312] Re: Regression report: Disk subsystem I/O failures/issues surfacing in DOS/early Windows [two separate issues: one bisected, one root-caused]
  2018-01-25  7:18 [Qemu-devel] [Bug 1745312] [NEW] Regression report: Disk subsystem I/O failures/issues surfacing in DOS/early Windows [two separate issues: one bisected, one root-caused] i336_
  2018-01-29 13:37 ` Stefan Hajnoczi
@ 2018-01-30  7:10 ` Fam Zheng
  2018-01-30 19:56 ` John Snow
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Fam Zheng @ 2018-01-30  7:10 UTC (permalink / raw)
  To: qemu-devel

QEMU ignores the CHS numbers in VMDK images. From the report, it seems
VirtualBox uses it.

So like what you've discovered, for QEMU the right thing to do for such
a guest would be setting the correct values explicitly from the command
line, rather than let it decide (guess).

I have no idea about the first issue, though.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1745312

Title:
  Regression report: Disk subsystem I/O failures/issues surfacing in
  DOS/early Windows [two separate issues: one bisected, one root-caused]

Status in QEMU:
  New

Bug description:
  [Headsup: This report is long-ish due to the amount of detail I've
  stumbled on along the way that I think is relevant to include. I can't
  speak as to the complexity of the actual bugs, but the size of this
  report should not suggest that the reproduction process is
  particularly headache-inducing.]

  Hi!

  I recently needed to fire up some ancient software for research
  purposes and got very distracted discovering and playing with old
  versions of Windows :). In the process I've discovered some glitches
  with disk I/O.

  I believe I've stumbled on two completely separate issues that
  coincidentally surfaced at the same time. It's possible that
  components of this report will be re-filed as more specific new bugs,
  but I'm not an authority on QEMU internals or how to narrow
  down/categorize what I've found.

  - The first bug only surfaces when the "isapc" machine type is used.
  It intermittently produces "General failure {read,writ}ing drive _"
  under MS-DOS 6.22, and also somehow interferes with early bootstrap of
  Windows NT 4 (in NTLDR). Enabling or disabling KVM (I'm on Linux)
  appears to make no difference whatsoever, which may help with
  debugging.

  - The second issue involves
    - a WinNT4 disk image
    - created by running through a bog-standard NT4 install inside QEMU 2.9.0
    - which will now fail to boot in any version of QEMU - even version 1.0
      - but which VirtualBox will boot fine
        - but only if I point VirtualBox at QEMU's raw disk image via a
          hacked-together VMDK file
        - if the raw image is converted to VHD(X), VirtualBox will also fail
          to boot the image with exactly the same error as QEMU
        - this state of affairs is not affected by image sparseness (which makes
          sense)

  I'm confident I've bisected the first issue.

  I wasn't able to bisect the second issue (as all tested versions of
  QEMU behaved identically), but I've figured out a working repro
  testcase and I believe I've managed to pin down a solid root cause.


  == #1: Intermittent I/O issues when `-M isapc` is used =====

  These symptoms sometimes take a small amount of time and fiddling to
  trigger, but I AM able to consistently surface them on my machine
  after a short while. (I am very very interested to hear if others
  cannot reproduce them.)

  So, first of all:

  https://github.com/qemu/qemu/commit/306ec6c3cece7004429c79c1ac93d49919f1f1cc
    (Jul 30 2013): the last version that works

  https://github.com/qemu/qemu/commit/e689f7c668cbd9d08f330e17c3dd3a059c9553d3
    (Oct 30 2013): the first version that intermittently fails

  Maybe lift out and build these branches while reading. *shrug*
  (How to do this can be found at the end of this report - along with a time-saving ./configure line, FWIW)

  Here are the changelists between these two revisions:

  https://github.com/qemu/qemu/compare/306ec6c...e689f7c
  (Compare direction: OLD to NEW) (Commits: 166  Files changed: 192)

  https://github.com/qemu/qemu/compare/e689f7c...306ec6c
  (Compare direction: NEW to OLD) (Commits: 30   Files changed: 22)

  (Someone else more familiar with Git might know why GitHub returns
  results for both compare directions, and/or if the 2nd link is useful
  information. The first link returns a lot more results than the 2nd
  one, at least. Does comparing new>old return deletions?)

  ---

  Now on to the symptoms. In a moment I'll describe reproduction.

  # MS-DOS 6.22

  The first symptom I discovered was that trivial read and write
  operations under MS-DOS would sometimes fail:

    C:\>echo test > hi

    General failure writing drive C
    Abort, Retry, Fail?

  Anything else that exercises the disk behaves similarly:

    C:\>dir /s > nul

    General failure reading drive C
    Abort, Retry, Fail?

  (Note that the above demonstrates both write and read failures)

  (Also, FWIW, `dir /s` == `ls -R`)

  The behavior of the I/O errors is not possible to characterise as it
  fluctuates so much. For example something as simple as DIR can produce
  wildly differing results: in one run, poking around with DIR ended
  with DOS deciding C:\ was empty at one point; at another point in a
  different run C:\ mysteriously dropped 50% of its contents only to
  magically gain it all back moments later after some poking around in
  one of the subdirectories that was still visible.

  The time it takes to trigger these errors is also highly variable.
  QEMU may fall over as early as hanging forever at "Starting MS-
  DOS...", or I might get all the way into Windows 3.1 before it
  triggers (in which case Win3.1 reports vague memory errors - of all
  things).

  Very occasionally I've seen _SeaBIOS itself_ report "Booting from Hard
  Disk..." "Boot failed: could not read the boot disk" ... "No bootable
  device.", and on one occasion I even got "Non-System disk or disk
  error" "Replace and strike any key when ready"!

  
  # WinNT 4 Terminal Server

  Most of the time, NTLDR will fire up normally. But every so often...

    SeaBIOS (version rel-1.7.3-117-g31b8b4e-
  20131206_080705-nilsson.home.kraxel.org)

    Booting from Hard Disk...
    A disk read error occurred.
    Insert a system diskette and restart
    the system.

  (NB. You're seeing the old SeaBIOS version included with e689f7c,
  which was the first buggy commit.)

  If NT gets past this point without erroring out (ie, it makes it to
  the boot menu), the rest of the system is 100% fine and there are no
  other disk I/O issues whatsoever. For example, on QEMU 2.9.0 I was
  able to enable disk compression, answer "Yes" to "Compress entire disk
  now?" and have the process fully complete. No hitches.

  This makes me vaguely recall/wonder that perhaps this could be somehow
  related to LBA and/or Int 13h, or something floating around near that
  bunch of functionality. (I'm woefully ignorant about such low-level
  details.) Perhaps DOS/Win3.1 are stuck using a disk mode that QEMU has
  a buggy implementation of, while NT 4 (once NTOSKRNL is up and
  running) is able to use a different disk mode or access mechanism.

  I'm really interested to get some understanding of what the root issue
  is here, when this is fixed. (I wonder if it's a timing thing?)

  I've observed some unusual behavior with repeated restarts. In one
  case, I attempted to start NT4 multiple times, and QEMU consistently
  failed with "No bootable device" each time. So, I removed `-M isapc`,
  promptly got a boot menu, hit ^C, readded `-M isapc` - and continued
  to get a boot menu. Yep. I'll accept "really really big coincidence"
  but I do very much wonder if something else is going on here. I've
  observed many similar incidents. It makes me wonder whether the
  contents of memory or some other system state is an influence. Very
  probably not, but still...


  -- Reproduction --------------------------------------------

  First of all, there was unfortunately no way for me to avoid having to
  post entire disk images, but I've managed to compress everything down
  to 174MB total download size.

  FWIW, WinWorld and many other sites seem to have no operational issues
  providing clear pointers to CD keys; I consider my distribution of my
  installed HDD images an extension of the apparent status quo.

  That being said, I've put everything on Google Drive so nobody has to
  headscratch about Launchpad/Canonical/etc's stance on hosting this
  data.

  So, this folder contains the disk images:
  https://drive.google.com/drive/folders/1WdZVBh5Trs9HLC186-nHyKeqaSxfyM2c
  ("Download all" at the top-right will create a ZIP file, but FWIW
  downloading the individual files simultaneously would implement a
  rough form of download acceleration)

  File meta info:

  Compressed
  |
  |      Apparent
  |      |    Actual
  |      |    |
  38M -> 200M (103M)  win31.img.xz
  82M -> 1G   (289M)  wnt4ts-broken.img.xz
  55M -> 350M (146M)  wnt4ts-intermittent.img.xz

  SHA-256s:

  win31:        8179b8180a2ab40bd472e8a2f3fb89fc331651e56923f94ceb9e52a78ee220d2
  broken:       a2af5f0bc49a063b75f534b6ffe5b82e32ecc706a64a425b6626feccf6e3fdfa  
  intermittent: 77ae8c458829ebcdd64c71042012f45d5a2788e6ebd22db9d53de9ef1a574784

  (Wanted to keep the checksum lines within 80 columns)

  And, since I can't figure out where else in this report to put this,
  wnt4ts-broken.img's password is "admin" but something seems to have
  happened to the disk and NT doesn't actually boot properly :(, and
  wnt4ts-intermittent.img's password is "1234". (These were set up as
  test images. Now I'm _really_ glad I used simple passwords! :) )

  ---

  
  I have two testcases: DOS 6.22 (+ Windows 3.1), and Windows NT 4.

  
  # MS-DOS

  DOS is the simplest. It basically consists of

  $ qemu-system-i386 -drive file=win31.img,format=raw -M isapc -enable-
  kvm

  And then literally just playing around. Things to try include creating
  files (`echo blah > file`), repeatedly seeking across the entire FAT
  (`dir /s > nul` or `dir /s`), and launching Windows (`win`).

  win31.img is not special (as far as I can tell) and merely consists of
  the result of installing DOS 6.22 and Windows 3.1 from WinWorldPC.
  I've basically just included the image for convenience.

  Generally no single "run" is immune to starting Win3.1 and then
  launching File Manager; if that doesn't generate an error, something
  is definitely up.

  The second best trigger is creating new files. That very very
  frequently produces "General Failure ...", but not always.

  
  # WinNT 4

  Windows NT 4 is a bit more complicated. Because this error only occurs
  at presumably a single small point very early in boot, the window of
  opportunity for the glitch to surface within is much much narrower and
  thus often requires a larger number of tries.

  Anecdotally I've had QEMU hit the boot error at the first try/run, and
  after as many as 63 "successful" boots.

  I made a small test harness that automates the launch process. It
  consists of two shell scripts and requires tmux (and netcat).
  (*Potential epilepsy warning*: if you use a light-colored terminal
  background, the terminal QEMU is repeatedly invoked from will
  continuously flash rapidly from white to black.)

  One of the scripts is run inside a tmux session in one terminal, while
  the other script is run in its own terminal (without any tmux).

  
  I named this one `run-qemu-loop`:

  --8<--------------------------------------------------------

  #!/bin/bash

  # ---

  qemu=/path/to/qemu-system-i386
  #or, alternatively: (I used the following line myself so I
  #could tab-complete my way to different qemu executables)
  #qemu="$1"

  disk=/path/to/wnt4ts-intermittent.img

  # ---

  port=4444

  rm -f STOP itercount

  itercount=0

  while :; do
  	
  	[ -f STOP ] && break
  	
  	((itercount++))
  	echo $itercount > itercount
  	
  	$qemu \
  		-enable-kvm -vga cirrus -curses -M isapc \
  		-drive file="$disk",format=raw \
  		-chardev socket,id=mon0,host=localhost,port=$port,server,nowait \
  		-mon chardev=mon0,mode=readline
  	
  	#point to an otherwise-unused terminal if you like (see also: `tty`)
  	#echo "$itercount run(s)" > /dev/pts/__
  	
  done

  ------------------------------------------------------------

  Not much logic above; this just repeatedly runs QEMU for as long as
  the file `STOP` does not exist in the current directory.

  The key "magic" bit is that QEMU is launched in -curses mode.

  The other key bit is that the above script is run inside tmux.

  
  Here's `tmux-ctl-loop`:

  --8<--------------------------------------------------------

  #!/bin/bash

  port=4444

  tmux=./tmux

  printf -v l '%0.0s-' {0..25}
  h1="$l/ buffer dump begin \\$l"
  h2="$l-\\ buffer dump end /-$l"

  while :; do
  	
  	while :; do
  		echo | nc localhost $port -q0 -w1 > /dev/null && break
  		echo 'Start qemu!'
  	done
  	
  	buffer="$(tmux -S $tmux capture-pane; tmux -S $tmux save-buffer -)"
  	
  	echo "$h1"
  	[[ "$buffer" ]] && echo "$buffer" || echo '( * Screen buffer is empty * )'
  	echo "$h2"
  	
  	if echo "$buffer" | grep -q 'A disk read error occurred.'; then
  		
  		s="<Crashed after $(< itercount) runs.>"
  		echo "$s"
  		echo "$s" >> stats
  		
  		touch STOP
  		
  		#echo q | nc localhost $port -q0 > /dev/null
  		
  		exit
  		
  	elif echo "$buffer" | grep -q 'OS Loader V4.00'; then
  			
  		echo '<Booted successfully, trying again>'
  		
  		echo q | nc localhost $port -q0 > /dev/null
  		
  	else
  		
  		echo '<Waiting for boot>'
  		
  	fi
  			
  done

  ------------------------------------------------------------

  Nothing particularly amazing going on here either.

  While `qemu-run-loop` is running inside tmux in the first terminal,
  this is running in the 2nd one.

  The small infinite loop at the top only breaks when it can
  successfully ping QEMU and it knows it's running.

  Then, a screendump of the contents of the terminal QEMU is in is
  fetched from tmux, and the buffer's content is analyzed.

  - If NTLDR fails, the script creates `STOP` to halt qemu-run-loop,
    sends `q` to QEMU through netcat, and then the script exits.

  - If NTLDR loads successfully, the script sends `q` to QEMU and continues
    looping. (qemu-run-loop will not find the `STOP` file, and so restart qemu.)

  The scripts run very quickly, with 2-3 iterations per second on my i3
  box.


  # Usage

  Save the two scripts above to the same directory as wnt4ts-intermittent.img,
  then:

  - (If port 4444 doesn't work, the value needs to be changed in both
  scripts.)

  - In the first terminal, run `tmux -S <file>`, where <file> names the socket
    tmux will use. This needs to match "tmux=" at the top of `tmux-ctl-loop`.
    (with `tmux=./tmux`, the command would be `tmux -S tmux`)

  - Still in the first terminal (and now also inside tmux), enter
    `./qemu-run-loop`, passing the path to qemu if you're using that approach
    (refer to the first few lines of the script). Don't hit enter yet.

  - Now, in the 2nd terminal, type `./tmux-ctl-loop`

  - Hit enter in both terminals.
   

  Rationale for timing of Enter key:

  - Running qemu-run-loop first will start QEMU, and if NTLDR starts
    successfully it will immediately begin counting down from 30. If NT actually
    starts to boot and is then hard-shut-down this /may/ affect the disk image

  - tmux-ctl-loop will annoyingly spam a continuous stream of 'Start qemu!' until
    qemu-run-loop is running.

  - Starting both scripts at "more or less" the same time (no rush) works out
    well.

  
  Hopefully potential script modifications are obvious; for example

  - changing tmux-ctl-loop to not send 'q' to qemu so you can connect to the HMP
    yourself
    (NB, if `STOP` is not created, when qemu finally exits it will of course
    promptly be relaunched)

  - pointing run-qemu-loop to a modified qemu binary


  == #2: QEMU-vs-VirtualBox image issue ======================

  I was initially completely stumped by this issue, perhaps
  unsurprisingly so. :)

  wnt4ts-broken.img is a perfectly ordinary NT 4 installation that was
  created in QEMU 2.9.0. I created a 1GB disk with `truncate`, picked
  NTFS and installed everything (which took a while).

  NT setup reboots a number of times during the boot process, and IIRC
  those all went just fine. However, at some point, the image began to
  consistently bomb out with "A disk read error occurred. ...", and
  stubbornly refused to boot, regardless of the number of boot attempts
  I tried.

  QEMU 2.0.0, 1.5.0, and 1.0 (the earliest version I was able to build
  on my system) all consistently hit "disk read error occurred".

  I tried compiling QEMU 1.0 using clang so I could build for 32-bit on
  my 64-bit system (GCC 7 died with "Frame pointer required, but
  reserved"). The resulting qemu completely crashed if I didn't enable
  KVM (ie, TCG was (understandably) broken); with KVM enabled qemu
  didn't crash, but NTLDR halted with the same error as on 64-bit qemu.
  (TL;DR, no difference whatsoever.)

  My initial reaction at this point was to try the image on another
  virtualization platform. My first pick was VirtualBox.

  So, I followed the official instructions for pointing VirtualBox to
  physical disk images, except I substituted a /dev/loopN device I'd
  pointed to the image file via losetup.

  And... VirtualBox picked the image up fine and Just Worked(TM). Yay! -
  but not yay. What gives?!

  Confused, I then tried to convert the disk image to VHD format.
  Unfortunately, for some reason, if I try `qemu-image convert ... -O
  vhdx ...`, VirtualBox chokes on the result:

  -----

  VD: error VERR_NOT_SUPPORTED opening image file
  '/.../wnt4ts-broken-qemuconv.vhd' (VERR_NOT_SUPPORTED).

  Result Code: NS_ERROR_FAILURE (0x80004005)
  Component: MediumWrap
  Interface: IMedium {4afe423b-43e0-e9d0-82e8-ceb307940dda}
  Callee: IVirtualBox {0169423f-46b4-cde9-91af-1e9d5b6cd945}
  Callee RC: VBOX_E_OBJECT_NOT_FOUND (0x80BB0001)

  -----

  Welp.

  Well, a bit more digging later, and I found I could do

  $ VBoxManage convertfromraw wnt4ts-broken.img wnt4ts-broken.vhd

  but... as soon as I pointed VirtualBox to this, it too began to choke
  with "A disk read error occurred".

  And yet, the VMDK->raw image setup worked just fine.

  I found I could even replace the loop device with the path of the .img
  file itself and that worked just fine too.

  At my wits' end, I followed some online instructions to learn about
  manual CHS configuration so I could try and get the image working in
  Bochs. "A disk read error occurred". I wasn't surprised.

  It was at this point I began to give up, but I decided to try One Last
  Thing(TM) before properly throwing in the towel.

  :)

  I decided to learn a bit more about how `VBoxManage internalcommands
  createrawvmdk` worked, and try one thing in particular: I can edit the
  .vmdk file, but can I point `createrawvmdk` at the .img file directly
  too?

  Turns out, yes you can.

  It also turns out that this promptly caused VirtualBox to bomb out.

  Interesting.

  For reference, here's the VMDK file I initially created (by pointing
  `createrawvmdk` at /dev/loopN) and then later edited to point straight
  to the .img file, with both approaches resulting in successful boot.

  --8<--------------------------------------------------------

  # Disk DescriptorFile
  version=1
  CID=e35b9a45
  parentCID=ffffffff
  createType="fullDevice"

  # Extent description
  RW 1536000 FLAT "/absolute/full/path/to/wnt4ts-broken.img" 0

  # The disk Data Base 
  #DDB

  ddb.virtualHWVersion = "4"
  ddb.adapterType="ide"
  ddb.geometry.cylinders="1523"
  ddb.geometry.heads="16"
  ddb.geometry.sectors="63"
  ddb.uuid.image="871a6044-c8ca-48ed-b7aa-e6fc49da3db4"
  ddb.uuid.parent="00000000-0000-0000-0000-000000000000"
  ddb.uuid.modification="3661715c-3906-4e4a-ab65-486d140e03b8"
  ddb.uuid.parentmodification="00000000-0000-0000-0000-000000000000"
  ddb.geometry.biosCylinders="761"
  ddb.geometry.biosHeads="32"
  ddb.geometry.biosSectors="63"

  ------------------------------------------------------------

  
  Here's the _diff_ of what happens if I point `createrawvmdk` at wnt4ts-broken.img directly:

  --8<--------------------------------------------------------

  ddb.geometry.cylinders="2080"
  ddb.geometry.heads="16"
  ddb.geometry.sectors="63"

  ------------------------------------------------------------

  :D

  Naturally,

  $ qemu-system-i386 -drive file=wnt4ts-
  broken.img,format=raw,cyls=1523,heads=16,secs=63 -M isapc -sdl

  will boot happily on 2.9.0 (notwithstanding the occasional "disk read
  error occurred" documented above).

  It will also boot in 1.6.0.

  (POTENTIAL BUG HEADSUP: 1.0 and 1.5.0 both lock up with a blank
  640x480 window and use 0% CPU if I specify `-M isapc`.)

  And, of course, using these CHS values in Bochs also results in
  successful boot as well (after setting the CPU type to pentium).

  Unfortunately, I have no idea what sequence of events caused the
  creation of the VMDK file above. No invocation of `createrawvmdk` is
  producing a VMDK file with the CHS settings above.

  I've only just begun to learn about the intricacies of CHS. Am I to
  understand that these values are stored amongst the first 512 bytes of
  the disk? If this is the case, then I wonder what changed the data,
  and why. I was initially only using QEMU 2.9.0, and didn't move the
  image to different VMs or QEMU versions. Perhaps Windows NT got
  confused about the disk CHS and rewrote it?

  
  == Sporadic BIOS-level boot failure ========================

  I have multiple screenshots of SeaBIOS in QEMU 2.9.0 halting with "No
  bootable device" (et al), even with the above manually-applied CHS
  settings.

  Commit e689f7c also presents such errors.

  Commit 306ec6c does not suffer from intermittent breakage of any kind:

  - No SeaBIOS flake-outs
  - No "Non-system disk or disk error"
  - No "A disk error has occurred"
  - No "General failure ..."

  While most of my confidence in commit 306ec6c is based on anecdotal
  evidence, I modified `tmux-ctl-loop` a little to soak-test BIOS-level
  I/O stability and left this modified version running for a few
  minutes.

  --8<--------------------------------------------------------

  #!/bin/bash

  port=4444

  tmux=./tmux

  printf -v l '%0.0s-' {0..25}
  h1="$l/ buffer dump begin \\$l"
  h2="$l-\\ buffer dump end /-$l"

  while :; do
  	
  	while :; do
  		echo | nc localhost $port -q0 -w1 > /dev/null && break
  		echo 'Start qemu!'
  	done
  	
  	buffer="$(tmux -S $tmux capture-pane; tmux -S $tmux save-buffer -)"
  	
  	echo "$h1"
  	[[ "$buffer" ]] && echo "$buffer" || echo '( * Screen buffer is empty * )'
  	echo "$h2"
  	
  	if echo "$buffer" | grep -q 'Non-system disk' || echo "$buffer" | \
  		grep -q 'No bootable device'
  	then
  		
  		s="<Hit error after $(< itercount) runs.>"
  		echo "$s"
  		echo "$s" >> stats
  		
  		touch STOP
  		
  		#echo q | nc localhost $port -q0 > /dev/null
  		
  		exit
  		
  	elif echo "$buffer" | grep -q 'OS Loader V4.00' || echo "$buffer" | \
  		grep -q 'A disk read error'
  	then
  	
  		echo '<Boot did not hang at BIOS, trying again>'
  		
  		echo q | nc localhost $port -q0 > /dev/null
  		
  	else
  		
  		echo '<Waiting for boot>'
  		
  	fi
  			
  done

  ------------------------------------------------------------

  For the above to work, the top of run-qemu-loop must also be modified
  to read something along the lines of

  disk=/path/to/wnt4ts-broken.img,format=raw,cyls=1523,heads=16,secs=63

  (Suggestion: modify copies of both scripts)

  One small terminal-flicker-headache (and a 57°C CPU) later, I was able
  to carefully observe just over 350 successful runs in which QEMU
  commit 306ec6c only ever produced a boot menu. No other hitches.

  ** Important: **

  However, commit 306ec6c will fail to boot, ever, if the cylinders and
  geometry are not set to the values VirtualBox "discovered". (Of note
  is the fact that QEMU (2.9.0) was what initially created this image. I
  must admit that I don't remember what sequence of QEMU versions I fed
  the image to - and I maybe, possibly, didn't think to back the file up
  (sorry), so maybe something mangled something somewhere. But
  VirtualBox figured it out nonetheless!)

  Furthermore, feeding /dev/loopN to any QEMU version will NOT result in
  correct CHS discovery (and successful boot).

  This is what leads me to conclude that I've discovered two separate
  issues.


  == Appendix: How to build the branches =====================

  It's very simple.

  First, `git clone https://github.com/qemu/qemu` somewhere if you don't
  already have a local copy. If you have an old git checkout that's from
  2014 or later, you can use that old checkout instead. (If you want to
  test an old checkout you have, the commands below will either work
  perfectly or completely bomb out with no side effects.)

  A full checkout is a ~183MB download. Sorry.

  Next, create two new directories somewhere. Name them what you like,
  eg `qemu-working` and `qemu-broken`.

  Now, cd into the checkout directory, and run:

  $ git archive 306ec6c3cece7004429c79c1ac93d49919f1f1cc | tar xC
  /path/to/qemu-working/

  $ git archive e689f7c668cbd9d08f330e17c3dd3a059c9553d3 | tar xC
  /path/to/qemu-broken/

  The paths can be relative.

  Now, run this in both of the new directories:

  $ ./configure --python=python2.7 --disable-libssh2 --disable-seccomp
  --disable-usb-redir --disable-guest-agent --disable-libiscsi
  --disable-spice --disable-smartcard-nss --disable-vhost-net --disable-
  docs --disable-attr --disable-cap-ng --disable-vde --disable-user
  --disable-bluez --disable-vnc-ws --disable-xen --disable-brlapi
  --enable-debug --target-list=i386-softmmu --disable-fdt

  $ make -j64

  You can open two terminals and configure and build both simultaneously
  if you like.

  On my decent but very basic (2-core+HT) i3 box, -j64 actually works out - make doesn't actually launch too many gcc processes. You *will* see your system load spike to ~20 though :)
  (NB. Do. not. use. -j64. with. the. linux. kernel.)

  On my system, a single build with -j64 takes only about 35 seconds. C
  FTW. (Although this has increased to 1min20sec for more recent
  builds.)

  Most of the configure arguments remove functionality I'll never use
  (in this situation) and which will only slow down the build.

  Once QEMU is built, run qemu-system-i386 directly from where it has
  been built.

  $ /path/to/qemu-working/i386-softmmu/qemu-system-i386 ...
  $ /path/to/qemu-broken/i386-softmmu/qemu-system-i386 ...

  Again, the paths can be relative.

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1745312/+subscriptions

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Qemu-devel] [Bug 1745312] Re: Regression report: Disk subsystem I/O failures/issues surfacing in DOS/early Windows [two separate issues: one bisected, one root-caused]
  2018-01-25  7:18 [Qemu-devel] [Bug 1745312] [NEW] Regression report: Disk subsystem I/O failures/issues surfacing in DOS/early Windows [two separate issues: one bisected, one root-caused] i336_
  2018-01-29 13:37 ` Stefan Hajnoczi
  2018-01-30  7:10 ` [Qemu-devel] [Bug 1745312] " Fam Zheng
@ 2018-01-30 19:56 ` John Snow
  2018-04-30 18:06 ` Mario
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: John Snow @ 2018-01-30 19:56 UTC (permalink / raw)
  To: qemu-devel

Can you post your commandline for the MSDOS 6.22 issue? NT is known to
have a few problems and may be out of scope for what I can help with,
but I was under the assumption that MSDOS 6.22 was well-behaved in QEMU.

Commandline and steps to reproduce the error may be helpful (any
particularly kind of command, workflow, etc that helps trigger the IO
errors? How big is the hard disk you are using? etc)

Thanks,
--John

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1745312

Title:
  Regression report: Disk subsystem I/O failures/issues surfacing in
  DOS/early Windows [two separate issues: one bisected, one root-caused]

Status in QEMU:
  New

Bug description:
  [Headsup: This report is long-ish due to the amount of detail I've
  stumbled on along the way that I think is relevant to include. I can't
  speak as to the complexity of the actual bugs, but the size of this
  report should not suggest that the reproduction process is
  particularly headache-inducing.]

  Hi!

  I recently needed to fire up some ancient software for research
  purposes and got very distracted discovering and playing with old
  versions of Windows :). In the process I've discovered some glitches
  with disk I/O.

  I believe I've stumbled on two completely separate issues that
  coincidentally surfaced at the same time. It's possible that
  components of this report will be re-filed as more specific new bugs,
  but I'm not an authority on QEMU internals or how to narrow
  down/categorize what I've found.

  - The first bug only surfaces when the "isapc" machine type is used.
  It intermittently produces "General failure {read,writ}ing drive _"
  under MS-DOS 6.22, and also somehow interferes with early bootstrap of
  Windows NT 4 (in NTLDR). Enabling or disabling KVM (I'm on Linux)
  appears to make no difference whatsoever, which may help with
  debugging.

  - The second issue involves
    - a WinNT4 disk image
    - created by running through a bog-standard NT4 install inside QEMU 2.9.0
    - which will now fail to boot in any version of QEMU - even version 1.0
      - but which VirtualBox will boot fine
        - but only if I point VirtualBox at QEMU's raw disk image via a
          hacked-together VMDK file
        - if the raw image is converted to VHD(X), VirtualBox will also fail
          to boot the image with exactly the same error as QEMU
        - this state of affairs is not affected by image sparseness (which makes
          sense)

  I'm confident I've bisected the first issue.

  I wasn't able to bisect the second issue (as all tested versions of
  QEMU behaved identically), but I've figured out a working repro
  testcase and I believe I've managed to pin down a solid root cause.


  == #1: Intermittent I/O issues when `-M isapc` is used =====

  These symptoms sometimes take a small amount of time and fiddling to
  trigger, but I AM able to consistently surface them on my machine
  after a short while. (I am very very interested to hear if others
  cannot reproduce them.)

  So, first of all:

  https://github.com/qemu/qemu/commit/306ec6c3cece7004429c79c1ac93d49919f1f1cc
    (Jul 30 2013): the last version that works

  https://github.com/qemu/qemu/commit/e689f7c668cbd9d08f330e17c3dd3a059c9553d3
    (Oct 30 2013): the first version that intermittently fails

  Maybe lift out and build these branches while reading. *shrug*
  (How to do this can be found at the end of this report - along with a time-saving ./configure line, FWIW)

  Here are the changelists between these two revisions:

  https://github.com/qemu/qemu/compare/306ec6c...e689f7c
  (Compare direction: OLD to NEW) (Commits: 166  Files changed: 192)

  https://github.com/qemu/qemu/compare/e689f7c...306ec6c
  (Compare direction: NEW to OLD) (Commits: 30   Files changed: 22)

  (Someone else more familiar with Git might know why GitHub returns
  results for both compare directions, and/or if the 2nd link is useful
  information. The first link returns a lot more results than the 2nd
  one, at least. Does comparing new>old return deletions?)

  ---

  Now on to the symptoms. In a moment I'll describe reproduction.

  # MS-DOS 6.22

  The first symptom I discovered was that trivial read and write
  operations under MS-DOS would sometimes fail:

    C:\>echo test > hi

    General failure writing drive C
    Abort, Retry, Fail?

  Anything else that exercises the disk behaves similarly:

    C:\>dir /s > nul

    General failure reading drive C
    Abort, Retry, Fail?

  (Note that the above demonstrates both write and read failures)

  (Also, FWIW, `dir /s` == `ls -R`)

  The behavior of the I/O errors is not possible to characterise as it
  fluctuates so much. For example something as simple as DIR can produce
  wildly differing results: in one run, poking around with DIR ended
  with DOS deciding C:\ was empty at one point; at another point in a
  different run C:\ mysteriously dropped 50% of its contents only to
  magically gain it all back moments later after some poking around in
  one of the subdirectories that was still visible.

  The time it takes to trigger these errors is also highly variable.
  QEMU may fall over as early as hanging forever at "Starting MS-
  DOS...", or I might get all the way into Windows 3.1 before it
  triggers (in which case Win3.1 reports vague memory errors - of all
  things).

  Very occasionally I've seen _SeaBIOS itself_ report "Booting from Hard
  Disk..." "Boot failed: could not read the boot disk" ... "No bootable
  device.", and on one occasion I even got "Non-System disk or disk
  error" "Replace and strike any key when ready"!

  
  # WinNT 4 Terminal Server

  Most of the time, NTLDR will fire up normally. But every so often...

    SeaBIOS (version rel-1.7.3-117-g31b8b4e-
  20131206_080705-nilsson.home.kraxel.org)

    Booting from Hard Disk...
    A disk read error occurred.
    Insert a system diskette and restart
    the system.

  (NB. You're seeing the old SeaBIOS version included with e689f7c,
  which was the first buggy commit.)

  If NT gets past this point without erroring out (ie, it makes it to
  the boot menu), the rest of the system is 100% fine and there are no
  other disk I/O issues whatsoever. For example, on QEMU 2.9.0 I was
  able to enable disk compression, answer "Yes" to "Compress entire disk
  now?" and have the process fully complete. No hitches.

  This makes me vaguely recall/wonder that perhaps this could be somehow
  related to LBA and/or Int 13h, or something floating around near that
  bunch of functionality. (I'm woefully ignorant about such low-level
  details.) Perhaps DOS/Win3.1 are stuck using a disk mode that QEMU has
  a buggy implementation of, while NT 4 (once NTOSKRNL is up and
  running) is able to use a different disk mode or access mechanism.

  I'm really interested to get some understanding of what the root issue
  is here, when this is fixed. (I wonder if it's a timing thing?)

  I've observed some unusual behavior with repeated restarts. In one
  case, I attempted to start NT4 multiple times, and QEMU consistently
  failed with "No bootable device" each time. So, I removed `-M isapc`,
  promptly got a boot menu, hit ^C, readded `-M isapc` - and continued
  to get a boot menu. Yep. I'll accept "really really big coincidence"
  but I do very much wonder if something else is going on here. I've
  observed many similar incidents. It makes me wonder whether the
  contents of memory or some other system state is an influence. Very
  probably not, but still...


  -- Reproduction --------------------------------------------

  First of all, there was unfortunately no way for me to avoid having to
  post entire disk images, but I've managed to compress everything down
  to 174MB total download size.

  FWIW, WinWorld and many other sites seem to have no operational issues
  providing clear pointers to CD keys; I consider my distribution of my
  installed HDD images an extension of the apparent status quo.

  That being said, I've put everything on Google Drive so nobody has to
  headscratch about Launchpad/Canonical/etc's stance on hosting this
  data.

  So, this folder contains the disk images:
  https://drive.google.com/drive/folders/1WdZVBh5Trs9HLC186-nHyKeqaSxfyM2c
  ("Download all" at the top-right will create a ZIP file, but FWIW
  downloading the individual files simultaneously would implement a
  rough form of download acceleration)

  File meta info:

  Compressed
  |
  |      Apparent
  |      |    Actual
  |      |    |
  38M -> 200M (103M)  win31.img.xz
  82M -> 1G   (289M)  wnt4ts-broken.img.xz
  55M -> 350M (146M)  wnt4ts-intermittent.img.xz

  SHA-256s:

  win31:        8179b8180a2ab40bd472e8a2f3fb89fc331651e56923f94ceb9e52a78ee220d2
  broken:       a2af5f0bc49a063b75f534b6ffe5b82e32ecc706a64a425b6626feccf6e3fdfa  
  intermittent: 77ae8c458829ebcdd64c71042012f45d5a2788e6ebd22db9d53de9ef1a574784

  (Wanted to keep the checksum lines within 80 columns)

  And, since I can't figure out where else in this report to put this,
  wnt4ts-broken.img's password is "admin" but something seems to have
  happened to the disk and NT doesn't actually boot properly :(, and
  wnt4ts-intermittent.img's password is "1234". (These were set up as
  test images. Now I'm _really_ glad I used simple passwords! :) )

  ---

  
  I have two testcases: DOS 6.22 (+ Windows 3.1), and Windows NT 4.

  
  # MS-DOS

  DOS is the simplest. It basically consists of

  $ qemu-system-i386 -drive file=win31.img,format=raw -M isapc -enable-
  kvm

  And then literally just playing around. Things to try include creating
  files (`echo blah > file`), repeatedly seeking across the entire FAT
  (`dir /s > nul` or `dir /s`), and launching Windows (`win`).

  win31.img is not special (as far as I can tell) and merely consists of
  the result of installing DOS 6.22 and Windows 3.1 from WinWorldPC.
  I've basically just included the image for convenience.

  Generally no single "run" is immune to starting Win3.1 and then
  launching File Manager; if that doesn't generate an error, something
  is definitely up.

  The second best trigger is creating new files. That very very
  frequently produces "General Failure ...", but not always.

  
  # WinNT 4

  Windows NT 4 is a bit more complicated. Because this error only occurs
  at presumably a single small point very early in boot, the window of
  opportunity for the glitch to surface within is much much narrower and
  thus often requires a larger number of tries.

  Anecdotally I've had QEMU hit the boot error at the first try/run, and
  after as many as 63 "successful" boots.

  I made a small test harness that automates the launch process. It
  consists of two shell scripts and requires tmux (and netcat).
  (*Potential epilepsy warning*: if you use a light-colored terminal
  background, the terminal QEMU is repeatedly invoked from will
  continuously flash rapidly from white to black.)

  One of the scripts is run inside a tmux session in one terminal, while
  the other script is run in its own terminal (without any tmux).

  
  I named this one `run-qemu-loop`:

  --8<--------------------------------------------------------

  #!/bin/bash

  # ---

  qemu=/path/to/qemu-system-i386
  #or, alternatively: (I used the following line myself so I
  #could tab-complete my way to different qemu executables)
  #qemu="$1"

  disk=/path/to/wnt4ts-intermittent.img

  # ---

  port=4444

  rm -f STOP itercount

  itercount=0

  while :; do
  	
  	[ -f STOP ] && break
  	
  	((itercount++))
  	echo $itercount > itercount
  	
  	$qemu \
  		-enable-kvm -vga cirrus -curses -M isapc \
  		-drive file="$disk",format=raw \
  		-chardev socket,id=mon0,host=localhost,port=$port,server,nowait \
  		-mon chardev=mon0,mode=readline
  	
  	#point to an otherwise-unused terminal if you like (see also: `tty`)
  	#echo "$itercount run(s)" > /dev/pts/__
  	
  done

  ------------------------------------------------------------

  Not much logic above; this just repeatedly runs QEMU for as long as
  the file `STOP` does not exist in the current directory.

  The key "magic" bit is that QEMU is launched in -curses mode.

  The other key bit is that the above script is run inside tmux.

  
  Here's `tmux-ctl-loop`:

  --8<--------------------------------------------------------

  #!/bin/bash

  port=4444

  tmux=./tmux

  printf -v l '%0.0s-' {0..25}
  h1="$l/ buffer dump begin \\$l"
  h2="$l-\\ buffer dump end /-$l"

  while :; do
  	
  	while :; do
  		echo | nc localhost $port -q0 -w1 > /dev/null && break
  		echo 'Start qemu!'
  	done
  	
  	buffer="$(tmux -S $tmux capture-pane; tmux -S $tmux save-buffer -)"
  	
  	echo "$h1"
  	[[ "$buffer" ]] && echo "$buffer" || echo '( * Screen buffer is empty * )'
  	echo "$h2"
  	
  	if echo "$buffer" | grep -q 'A disk read error occurred.'; then
  		
  		s="<Crashed after $(< itercount) runs.>"
  		echo "$s"
  		echo "$s" >> stats
  		
  		touch STOP
  		
  		#echo q | nc localhost $port -q0 > /dev/null
  		
  		exit
  		
  	elif echo "$buffer" | grep -q 'OS Loader V4.00'; then
  			
  		echo '<Booted successfully, trying again>'
  		
  		echo q | nc localhost $port -q0 > /dev/null
  		
  	else
  		
  		echo '<Waiting for boot>'
  		
  	fi
  			
  done

  ------------------------------------------------------------

  Nothing particularly amazing going on here either.

  While `qemu-run-loop` is running inside tmux in the first terminal,
  this is running in the 2nd one.

  The small infinite loop at the top only breaks when it can
  successfully ping QEMU and it knows it's running.

  Then, a screendump of the contents of the terminal QEMU is in is
  fetched from tmux, and the buffer's content is analyzed.

  - If NTLDR fails, the script creates `STOP` to halt qemu-run-loop,
    sends `q` to QEMU through netcat, and then the script exits.

  - If NTLDR loads successfully, the script sends `q` to QEMU and continues
    looping. (qemu-run-loop will not find the `STOP` file, and so restart qemu.)

  The scripts run very quickly, with 2-3 iterations per second on my i3
  box.


  # Usage

  Save the two scripts above to the same directory as wnt4ts-intermittent.img,
  then:

  - (If port 4444 doesn't work, the value needs to be changed in both
  scripts.)

  - In the first terminal, run `tmux -S <file>`, where <file> names the socket
    tmux will use. This needs to match "tmux=" at the top of `tmux-ctl-loop`.
    (with `tmux=./tmux`, the command would be `tmux -S tmux`)

  - Still in the first terminal (and now also inside tmux), enter
    `./qemu-run-loop`, passing the path to qemu if you're using that approach
    (refer to the first few lines of the script). Don't hit enter yet.

  - Now, in the 2nd terminal, type `./tmux-ctl-loop`

  - Hit enter in both terminals.
   

  Rationale for timing of Enter key:

  - Running qemu-run-loop first will start QEMU, and if NTLDR starts
    successfully it will immediately begin counting down from 30. If NT actually
    starts to boot and is then hard-shut-down this /may/ affect the disk image

  - tmux-ctl-loop will annoyingly spam a continuous stream of 'Start qemu!' until
    qemu-run-loop is running.

  - Starting both scripts at "more or less" the same time (no rush) works out
    well.

  
  Hopefully potential script modifications are obvious; for example

  - changing tmux-ctl-loop to not send 'q' to qemu so you can connect to the HMP
    yourself
    (NB, if `STOP` is not created, when qemu finally exits it will of course
    promptly be relaunched)

  - pointing run-qemu-loop to a modified qemu binary


  == #2: QEMU-vs-VirtualBox image issue ======================

  I was initially completely stumped by this issue, perhaps
  unsurprisingly so. :)

  wnt4ts-broken.img is a perfectly ordinary NT 4 installation that was
  created in QEMU 2.9.0. I created a 1GB disk with `truncate`, picked
  NTFS and installed everything (which took a while).

  NT setup reboots a number of times during the boot process, and IIRC
  those all went just fine. However, at some point, the image began to
  consistently bomb out with "A disk read error occurred. ...", and
  stubbornly refused to boot, regardless of the number of boot attempts
  I tried.

  QEMU 2.0.0, 1.5.0, and 1.0 (the earliest version I was able to build
  on my system) all consistently hit "disk read error occurred".

  I tried compiling QEMU 1.0 using clang so I could build for 32-bit on
  my 64-bit system (GCC 7 died with "Frame pointer required, but
  reserved"). The resulting qemu completely crashed if I didn't enable
  KVM (ie, TCG was (understandably) broken); with KVM enabled qemu
  didn't crash, but NTLDR halted with the same error as on 64-bit qemu.
  (TL;DR, no difference whatsoever.)

  My initial reaction at this point was to try the image on another
  virtualization platform. My first pick was VirtualBox.

  So, I followed the official instructions for pointing VirtualBox to
  physical disk images, except I substituted a /dev/loopN device I'd
  pointed to the image file via losetup.

  And... VirtualBox picked the image up fine and Just Worked(TM). Yay! -
  but not yay. What gives?!

  Confused, I then tried to convert the disk image to VHD format.
  Unfortunately, for some reason, if I try `qemu-image convert ... -O
  vhdx ...`, VirtualBox chokes on the result:

  -----

  VD: error VERR_NOT_SUPPORTED opening image file
  '/.../wnt4ts-broken-qemuconv.vhd' (VERR_NOT_SUPPORTED).

  Result Code: NS_ERROR_FAILURE (0x80004005)
  Component: MediumWrap
  Interface: IMedium {4afe423b-43e0-e9d0-82e8-ceb307940dda}
  Callee: IVirtualBox {0169423f-46b4-cde9-91af-1e9d5b6cd945}
  Callee RC: VBOX_E_OBJECT_NOT_FOUND (0x80BB0001)

  -----

  Welp.

  Well, a bit more digging later, and I found I could do

  $ VBoxManage convertfromraw wnt4ts-broken.img wnt4ts-broken.vhd

  but... as soon as I pointed VirtualBox to this, it too began to choke
  with "A disk read error occurred".

  And yet, the VMDK->raw image setup worked just fine.

  I found I could even replace the loop device with the path of the .img
  file itself and that worked just fine too.

  At my wits' end, I followed some online instructions to learn about
  manual CHS configuration so I could try and get the image working in
  Bochs. "A disk read error occurred". I wasn't surprised.

  It was at this point I began to give up, but I decided to try One Last
  Thing(TM) before properly throwing in the towel.

  :)

  I decided to learn a bit more about how `VBoxManage internalcommands
  createrawvmdk` worked, and try one thing in particular: I can edit the
  .vmdk file, but can I point `createrawvmdk` at the .img file directly
  too?

  Turns out, yes you can.

  It also turns out that this promptly caused VirtualBox to bomb out.

  Interesting.

  For reference, here's the VMDK file I initially created (by pointing
  `createrawvmdk` at /dev/loopN) and then later edited to point straight
  to the .img file, with both approaches resulting in successful boot.

  --8<--------------------------------------------------------

  # Disk DescriptorFile
  version=1
  CID=e35b9a45
  parentCID=ffffffff
  createType="fullDevice"

  # Extent description
  RW 1536000 FLAT "/absolute/full/path/to/wnt4ts-broken.img" 0

  # The disk Data Base 
  #DDB

  ddb.virtualHWVersion = "4"
  ddb.adapterType="ide"
  ddb.geometry.cylinders="1523"
  ddb.geometry.heads="16"
  ddb.geometry.sectors="63"
  ddb.uuid.image="871a6044-c8ca-48ed-b7aa-e6fc49da3db4"
  ddb.uuid.parent="00000000-0000-0000-0000-000000000000"
  ddb.uuid.modification="3661715c-3906-4e4a-ab65-486d140e03b8"
  ddb.uuid.parentmodification="00000000-0000-0000-0000-000000000000"
  ddb.geometry.biosCylinders="761"
  ddb.geometry.biosHeads="32"
  ddb.geometry.biosSectors="63"

  ------------------------------------------------------------

  
  Here's the _diff_ of what happens if I point `createrawvmdk` at wnt4ts-broken.img directly:

  --8<--------------------------------------------------------

  ddb.geometry.cylinders="2080"
  ddb.geometry.heads="16"
  ddb.geometry.sectors="63"

  ------------------------------------------------------------

  :D

  Naturally,

  $ qemu-system-i386 -drive file=wnt4ts-
  broken.img,format=raw,cyls=1523,heads=16,secs=63 -M isapc -sdl

  will boot happily on 2.9.0 (notwithstanding the occasional "disk read
  error occurred" documented above).

  It will also boot in 1.6.0.

  (POTENTIAL BUG HEADSUP: 1.0 and 1.5.0 both lock up with a blank
  640x480 window and use 0% CPU if I specify `-M isapc`.)

  And, of course, using these CHS values in Bochs also results in
  successful boot as well (after setting the CPU type to pentium).

  Unfortunately, I have no idea what sequence of events caused the
  creation of the VMDK file above. No invocation of `createrawvmdk` is
  producing a VMDK file with the CHS settings above.

  I've only just begun to learn about the intricacies of CHS. Am I to
  understand that these values are stored amongst the first 512 bytes of
  the disk? If this is the case, then I wonder what changed the data,
  and why. I was initially only using QEMU 2.9.0, and didn't move the
  image to different VMs or QEMU versions. Perhaps Windows NT got
  confused about the disk CHS and rewrote it?

  
  == Sporadic BIOS-level boot failure ========================

  I have multiple screenshots of SeaBIOS in QEMU 2.9.0 halting with "No
  bootable device" (et al), even with the above manually-applied CHS
  settings.

  Commit e689f7c also presents such errors.

  Commit 306ec6c does not suffer from intermittent breakage of any kind:

  - No SeaBIOS flake-outs
  - No "Non-system disk or disk error"
  - No "A disk error has occurred"
  - No "General failure ..."

  While most of my confidence in commit 306ec6c is based on anecdotal
  evidence, I modified `tmux-ctl-loop` a little to soak-test BIOS-level
  I/O stability and left this modified version running for a few
  minutes.

  --8<--------------------------------------------------------

  #!/bin/bash

  port=4444

  tmux=./tmux

  printf -v l '%0.0s-' {0..25}
  h1="$l/ buffer dump begin \\$l"
  h2="$l-\\ buffer dump end /-$l"

  while :; do
  	
  	while :; do
  		echo | nc localhost $port -q0 -w1 > /dev/null && break
  		echo 'Start qemu!'
  	done
  	
  	buffer="$(tmux -S $tmux capture-pane; tmux -S $tmux save-buffer -)"
  	
  	echo "$h1"
  	[[ "$buffer" ]] && echo "$buffer" || echo '( * Screen buffer is empty * )'
  	echo "$h2"
  	
  	if echo "$buffer" | grep -q 'Non-system disk' || echo "$buffer" | \
  		grep -q 'No bootable device'
  	then
  		
  		s="<Hit error after $(< itercount) runs.>"
  		echo "$s"
  		echo "$s" >> stats
  		
  		touch STOP
  		
  		#echo q | nc localhost $port -q0 > /dev/null
  		
  		exit
  		
  	elif echo "$buffer" | grep -q 'OS Loader V4.00' || echo "$buffer" | \
  		grep -q 'A disk read error'
  	then
  	
  		echo '<Boot did not hang at BIOS, trying again>'
  		
  		echo q | nc localhost $port -q0 > /dev/null
  		
  	else
  		
  		echo '<Waiting for boot>'
  		
  	fi
  			
  done

  ------------------------------------------------------------

  For the above to work, the top of run-qemu-loop must also be modified
  to read something along the lines of

  disk=/path/to/wnt4ts-broken.img,format=raw,cyls=1523,heads=16,secs=63

  (Suggestion: modify copies of both scripts)

  One small terminal-flicker-headache (and a 57°C CPU) later, I was able
  to carefully observe just over 350 successful runs in which QEMU
  commit 306ec6c only ever produced a boot menu. No other hitches.

  ** Important: **

  However, commit 306ec6c will fail to boot, ever, if the cylinders and
  geometry are not set to the values VirtualBox "discovered". (Of note
  is the fact that QEMU (2.9.0) was what initially created this image. I
  must admit that I don't remember what sequence of QEMU versions I fed
  the image to - and I maybe, possibly, didn't think to back the file up
  (sorry), so maybe something mangled something somewhere. But
  VirtualBox figured it out nonetheless!)

  Furthermore, feeding /dev/loopN to any QEMU version will NOT result in
  correct CHS discovery (and successful boot).

  This is what leads me to conclude that I've discovered two separate
  issues.


  == Appendix: How to build the branches =====================

  It's very simple.

  First, `git clone https://github.com/qemu/qemu` somewhere if you don't
  already have a local copy. If you have an old git checkout that's from
  2014 or later, you can use that old checkout instead. (If you want to
  test an old checkout you have, the commands below will either work
  perfectly or completely bomb out with no side effects.)

  A full checkout is a ~183MB download. Sorry.

  Next, create two new directories somewhere. Name them what you like,
  eg `qemu-working` and `qemu-broken`.

  Now, cd into the checkout directory, and run:

  $ git archive 306ec6c3cece7004429c79c1ac93d49919f1f1cc | tar xC
  /path/to/qemu-working/

  $ git archive e689f7c668cbd9d08f330e17c3dd3a059c9553d3 | tar xC
  /path/to/qemu-broken/

  The paths can be relative.

  Now, run this in both of the new directories:

  $ ./configure --python=python2.7 --disable-libssh2 --disable-seccomp
  --disable-usb-redir --disable-guest-agent --disable-libiscsi
  --disable-spice --disable-smartcard-nss --disable-vhost-net --disable-
  docs --disable-attr --disable-cap-ng --disable-vde --disable-user
  --disable-bluez --disable-vnc-ws --disable-xen --disable-brlapi
  --enable-debug --target-list=i386-softmmu --disable-fdt

  $ make -j64

  You can open two terminals and configure and build both simultaneously
  if you like.

  On my decent but very basic (2-core+HT) i3 box, -j64 actually works out - make doesn't actually launch too many gcc processes. You *will* see your system load spike to ~20 though :)
  (NB. Do. not. use. -j64. with. the. linux. kernel.)

  On my system, a single build with -j64 takes only about 35 seconds. C
  FTW. (Although this has increased to 1min20sec for more recent
  builds.)

  Most of the configure arguments remove functionality I'll never use
  (in this situation) and which will only slow down the build.

  Once QEMU is built, run qemu-system-i386 directly from where it has
  been built.

  $ /path/to/qemu-working/i386-softmmu/qemu-system-i386 ...
  $ /path/to/qemu-broken/i386-softmmu/qemu-system-i386 ...

  Again, the paths can be relative.

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1745312/+subscriptions

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Qemu-devel] [Bug 1745312] Re: Regression report: Disk subsystem I/O failures/issues surfacing in DOS/early Windows [two separate issues: one bisected, one root-caused]
  2018-01-25  7:18 [Qemu-devel] [Bug 1745312] [NEW] Regression report: Disk subsystem I/O failures/issues surfacing in DOS/early Windows [two separate issues: one bisected, one root-caused] i336_
                   ` (2 preceding siblings ...)
  2018-01-30 19:56 ` John Snow
@ 2018-04-30 18:06 ` Mario
  2019-08-01 18:19 ` Mdasoh Kyaeppd
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Mario @ 2018-04-30 18:06 UTC (permalink / raw)
  To: qemu-devel

I have a similar bug: 1674114

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1745312

Title:
  Regression report: Disk subsystem I/O failures/issues surfacing in
  DOS/early Windows [two separate issues: one bisected, one root-caused]

Status in QEMU:
  New

Bug description:
  [Headsup: This report is long-ish due to the amount of detail I've
  stumbled on along the way that I think is relevant to include. I can't
  speak as to the complexity of the actual bugs, but the size of this
  report should not suggest that the reproduction process is
  particularly headache-inducing.]

  Hi!

  I recently needed to fire up some ancient software for research
  purposes and got very distracted discovering and playing with old
  versions of Windows :). In the process I've discovered some glitches
  with disk I/O.

  I believe I've stumbled on two completely separate issues that
  coincidentally surfaced at the same time. It's possible that
  components of this report will be re-filed as more specific new bugs,
  but I'm not an authority on QEMU internals or how to narrow
  down/categorize what I've found.

  - The first bug only surfaces when the "isapc" machine type is used.
  It intermittently produces "General failure {read,writ}ing drive _"
  under MS-DOS 6.22, and also somehow interferes with early bootstrap of
  Windows NT 4 (in NTLDR). Enabling or disabling KVM (I'm on Linux)
  appears to make no difference whatsoever, which may help with
  debugging.

  - The second issue involves
    - a WinNT4 disk image
    - created by running through a bog-standard NT4 install inside QEMU 2.9.0
    - which will now fail to boot in any version of QEMU - even version 1.0
      - but which VirtualBox will boot fine
        - but only if I point VirtualBox at QEMU's raw disk image via a
          hacked-together VMDK file
        - if the raw image is converted to VHD(X), VirtualBox will also fail
          to boot the image with exactly the same error as QEMU
        - this state of affairs is not affected by image sparseness (which makes
          sense)

  I'm confident I've bisected the first issue.

  I wasn't able to bisect the second issue (as all tested versions of
  QEMU behaved identically), but I've figured out a working repro
  testcase and I believe I've managed to pin down a solid root cause.


  == #1: Intermittent I/O issues when `-M isapc` is used =====

  These symptoms sometimes take a small amount of time and fiddling to
  trigger, but I AM able to consistently surface them on my machine
  after a short while. (I am very very interested to hear if others
  cannot reproduce them.)

  So, first of all:

  https://github.com/qemu/qemu/commit/306ec6c3cece7004429c79c1ac93d49919f1f1cc
    (Jul 30 2013): the last version that works

  https://github.com/qemu/qemu/commit/e689f7c668cbd9d08f330e17c3dd3a059c9553d3
    (Oct 30 2013): the first version that intermittently fails

  Maybe lift out and build these branches while reading. *shrug*
  (How to do this can be found at the end of this report - along with a time-saving ./configure line, FWIW)

  Here are the changelists between these two revisions:

  https://github.com/qemu/qemu/compare/306ec6c...e689f7c
  (Compare direction: OLD to NEW) (Commits: 166  Files changed: 192)

  https://github.com/qemu/qemu/compare/e689f7c...306ec6c
  (Compare direction: NEW to OLD) (Commits: 30   Files changed: 22)

  (Someone else more familiar with Git might know why GitHub returns
  results for both compare directions, and/or if the 2nd link is useful
  information. The first link returns a lot more results than the 2nd
  one, at least. Does comparing new>old return deletions?)

  ---

  Now on to the symptoms. In a moment I'll describe reproduction.

  # MS-DOS 6.22

  The first symptom I discovered was that trivial read and write
  operations under MS-DOS would sometimes fail:

    C:\>echo test > hi

    General failure writing drive C
    Abort, Retry, Fail?

  Anything else that exercises the disk behaves similarly:

    C:\>dir /s > nul

    General failure reading drive C
    Abort, Retry, Fail?

  (Note that the above demonstrates both write and read failures)

  (Also, FWIW, `dir /s` == `ls -R`)

  The behavior of the I/O errors is not possible to characterise as it
  fluctuates so much. For example something as simple as DIR can produce
  wildly differing results: in one run, poking around with DIR ended
  with DOS deciding C:\ was empty at one point; at another point in a
  different run C:\ mysteriously dropped 50% of its contents only to
  magically gain it all back moments later after some poking around in
  one of the subdirectories that was still visible.

  The time it takes to trigger these errors is also highly variable.
  QEMU may fall over as early as hanging forever at "Starting MS-
  DOS...", or I might get all the way into Windows 3.1 before it
  triggers (in which case Win3.1 reports vague memory errors - of all
  things).

  Very occasionally I've seen _SeaBIOS itself_ report "Booting from Hard
  Disk..." "Boot failed: could not read the boot disk" ... "No bootable
  device.", and on one occasion I even got "Non-System disk or disk
  error" "Replace and strike any key when ready"!

  
  # WinNT 4 Terminal Server

  Most of the time, NTLDR will fire up normally. But every so often...

    SeaBIOS (version rel-1.7.3-117-g31b8b4e-
  20131206_080705-nilsson.home.kraxel.org)

    Booting from Hard Disk...
    A disk read error occurred.
    Insert a system diskette and restart
    the system.

  (NB. You're seeing the old SeaBIOS version included with e689f7c,
  which was the first buggy commit.)

  If NT gets past this point without erroring out (ie, it makes it to
  the boot menu), the rest of the system is 100% fine and there are no
  other disk I/O issues whatsoever. For example, on QEMU 2.9.0 I was
  able to enable disk compression, answer "Yes" to "Compress entire disk
  now?" and have the process fully complete. No hitches.

  This makes me vaguely recall/wonder that perhaps this could be somehow
  related to LBA and/or Int 13h, or something floating around near that
  bunch of functionality. (I'm woefully ignorant about such low-level
  details.) Perhaps DOS/Win3.1 are stuck using a disk mode that QEMU has
  a buggy implementation of, while NT 4 (once NTOSKRNL is up and
  running) is able to use a different disk mode or access mechanism.

  I'm really interested to get some understanding of what the root issue
  is here, when this is fixed. (I wonder if it's a timing thing?)

  I've observed some unusual behavior with repeated restarts. In one
  case, I attempted to start NT4 multiple times, and QEMU consistently
  failed with "No bootable device" each time. So, I removed `-M isapc`,
  promptly got a boot menu, hit ^C, readded `-M isapc` - and continued
  to get a boot menu. Yep. I'll accept "really really big coincidence"
  but I do very much wonder if something else is going on here. I've
  observed many similar incidents. It makes me wonder whether the
  contents of memory or some other system state is an influence. Very
  probably not, but still...


  -- Reproduction --------------------------------------------

  First of all, there was unfortunately no way for me to avoid having to
  post entire disk images, but I've managed to compress everything down
  to 174MB total download size.

  FWIW, WinWorld and many other sites seem to have no operational issues
  providing clear pointers to CD keys; I consider my distribution of my
  installed HDD images an extension of the apparent status quo.

  That being said, I've put everything on Google Drive so nobody has to
  headscratch about Launchpad/Canonical/etc's stance on hosting this
  data.

  So, this folder contains the disk images:
  https://drive.google.com/drive/folders/1WdZVBh5Trs9HLC186-nHyKeqaSxfyM2c
  ("Download all" at the top-right will create a ZIP file, but FWIW
  downloading the individual files simultaneously would implement a
  rough form of download acceleration)

  File meta info:

  Compressed
  |
  |      Apparent
  |      |    Actual
  |      |    |
  38M -> 200M (103M)  win31.img.xz
  82M -> 1G   (289M)  wnt4ts-broken.img.xz
  55M -> 350M (146M)  wnt4ts-intermittent.img.xz

  SHA-256s:

  win31:        8179b8180a2ab40bd472e8a2f3fb89fc331651e56923f94ceb9e52a78ee220d2
  broken:       a2af5f0bc49a063b75f534b6ffe5b82e32ecc706a64a425b6626feccf6e3fdfa  
  intermittent: 77ae8c458829ebcdd64c71042012f45d5a2788e6ebd22db9d53de9ef1a574784

  (Wanted to keep the checksum lines within 80 columns)

  And, since I can't figure out where else in this report to put this,
  wnt4ts-broken.img's password is "admin" but something seems to have
  happened to the disk and NT doesn't actually boot properly :(, and
  wnt4ts-intermittent.img's password is "1234". (These were set up as
  test images. Now I'm _really_ glad I used simple passwords! :) )

  ---

  
  I have two testcases: DOS 6.22 (+ Windows 3.1), and Windows NT 4.

  
  # MS-DOS

  DOS is the simplest. It basically consists of

  $ qemu-system-i386 -drive file=win31.img,format=raw -M isapc -enable-
  kvm

  And then literally just playing around. Things to try include creating
  files (`echo blah > file`), repeatedly seeking across the entire FAT
  (`dir /s > nul` or `dir /s`), and launching Windows (`win`).

  win31.img is not special (as far as I can tell) and merely consists of
  the result of installing DOS 6.22 and Windows 3.1 from WinWorldPC.
  I've basically just included the image for convenience.

  Generally no single "run" is immune to starting Win3.1 and then
  launching File Manager; if that doesn't generate an error, something
  is definitely up.

  The second best trigger is creating new files. That very very
  frequently produces "General Failure ...", but not always.

  
  # WinNT 4

  Windows NT 4 is a bit more complicated. Because this error only occurs
  at presumably a single small point very early in boot, the window of
  opportunity for the glitch to surface within is much much narrower and
  thus often requires a larger number of tries.

  Anecdotally I've had QEMU hit the boot error at the first try/run, and
  after as many as 63 "successful" boots.

  I made a small test harness that automates the launch process. It
  consists of two shell scripts and requires tmux (and netcat).
  (*Potential epilepsy warning*: if you use a light-colored terminal
  background, the terminal QEMU is repeatedly invoked from will
  continuously flash rapidly from white to black.)

  One of the scripts is run inside a tmux session in one terminal, while
  the other script is run in its own terminal (without any tmux).

  
  I named this one `run-qemu-loop`:

  --8<--------------------------------------------------------

  #!/bin/bash

  # ---

  qemu=/path/to/qemu-system-i386
  #or, alternatively: (I used the following line myself so I
  #could tab-complete my way to different qemu executables)
  #qemu="$1"

  disk=/path/to/wnt4ts-intermittent.img

  # ---

  port=4444

  rm -f STOP itercount

  itercount=0

  while :; do
  	
  	[ -f STOP ] && break
  	
  	((itercount++))
  	echo $itercount > itercount
  	
  	$qemu \
  		-enable-kvm -vga cirrus -curses -M isapc \
  		-drive file="$disk",format=raw \
  		-chardev socket,id=mon0,host=localhost,port=$port,server,nowait \
  		-mon chardev=mon0,mode=readline
  	
  	#point to an otherwise-unused terminal if you like (see also: `tty`)
  	#echo "$itercount run(s)" > /dev/pts/__
  	
  done

  ------------------------------------------------------------

  Not much logic above; this just repeatedly runs QEMU for as long as
  the file `STOP` does not exist in the current directory.

  The key "magic" bit is that QEMU is launched in -curses mode.

  The other key bit is that the above script is run inside tmux.

  
  Here's `tmux-ctl-loop`:

  --8<--------------------------------------------------------

  #!/bin/bash

  port=4444

  tmux=./tmux

  printf -v l '%0.0s-' {0..25}
  h1="$l/ buffer dump begin \\$l"
  h2="$l-\\ buffer dump end /-$l"

  while :; do
  	
  	while :; do
  		echo | nc localhost $port -q0 -w1 > /dev/null && break
  		echo 'Start qemu!'
  	done
  	
  	buffer="$(tmux -S $tmux capture-pane; tmux -S $tmux save-buffer -)"
  	
  	echo "$h1"
  	[[ "$buffer" ]] && echo "$buffer" || echo '( * Screen buffer is empty * )'
  	echo "$h2"
  	
  	if echo "$buffer" | grep -q 'A disk read error occurred.'; then
  		
  		s="<Crashed after $(< itercount) runs.>"
  		echo "$s"
  		echo "$s" >> stats
  		
  		touch STOP
  		
  		#echo q | nc localhost $port -q0 > /dev/null
  		
  		exit
  		
  	elif echo "$buffer" | grep -q 'OS Loader V4.00'; then
  			
  		echo '<Booted successfully, trying again>'
  		
  		echo q | nc localhost $port -q0 > /dev/null
  		
  	else
  		
  		echo '<Waiting for boot>'
  		
  	fi
  			
  done

  ------------------------------------------------------------

  Nothing particularly amazing going on here either.

  While `qemu-run-loop` is running inside tmux in the first terminal,
  this is running in the 2nd one.

  The small infinite loop at the top only breaks when it can
  successfully ping QEMU and it knows it's running.

  Then, a screendump of the contents of the terminal QEMU is in is
  fetched from tmux, and the buffer's content is analyzed.

  - If NTLDR fails, the script creates `STOP` to halt qemu-run-loop,
    sends `q` to QEMU through netcat, and then the script exits.

  - If NTLDR loads successfully, the script sends `q` to QEMU and continues
    looping. (qemu-run-loop will not find the `STOP` file, and so restart qemu.)

  The scripts run very quickly, with 2-3 iterations per second on my i3
  box.


  # Usage

  Save the two scripts above to the same directory as wnt4ts-intermittent.img,
  then:

  - (If port 4444 doesn't work, the value needs to be changed in both
  scripts.)

  - In the first terminal, run `tmux -S <file>`, where <file> names the socket
    tmux will use. This needs to match "tmux=" at the top of `tmux-ctl-loop`.
    (with `tmux=./tmux`, the command would be `tmux -S tmux`)

  - Still in the first terminal (and now also inside tmux), enter
    `./qemu-run-loop`, passing the path to qemu if you're using that approach
    (refer to the first few lines of the script). Don't hit enter yet.

  - Now, in the 2nd terminal, type `./tmux-ctl-loop`

  - Hit enter in both terminals.
   

  Rationale for timing of Enter key:

  - Running qemu-run-loop first will start QEMU, and if NTLDR starts
    successfully it will immediately begin counting down from 30. If NT actually
    starts to boot and is then hard-shut-down this /may/ affect the disk image

  - tmux-ctl-loop will annoyingly spam a continuous stream of 'Start qemu!' until
    qemu-run-loop is running.

  - Starting both scripts at "more or less" the same time (no rush) works out
    well.

  
  Hopefully potential script modifications are obvious; for example

  - changing tmux-ctl-loop to not send 'q' to qemu so you can connect to the HMP
    yourself
    (NB, if `STOP` is not created, when qemu finally exits it will of course
    promptly be relaunched)

  - pointing run-qemu-loop to a modified qemu binary


  == #2: QEMU-vs-VirtualBox image issue ======================

  I was initially completely stumped by this issue, perhaps
  unsurprisingly so. :)

  wnt4ts-broken.img is a perfectly ordinary NT 4 installation that was
  created in QEMU 2.9.0. I created a 1GB disk with `truncate`, picked
  NTFS and installed everything (which took a while).

  NT setup reboots a number of times during the boot process, and IIRC
  those all went just fine. However, at some point, the image began to
  consistently bomb out with "A disk read error occurred. ...", and
  stubbornly refused to boot, regardless of the number of boot attempts
  I tried.

  QEMU 2.0.0, 1.5.0, and 1.0 (the earliest version I was able to build
  on my system) all consistently hit "disk read error occurred".

  I tried compiling QEMU 1.0 using clang so I could build for 32-bit on
  my 64-bit system (GCC 7 died with "Frame pointer required, but
  reserved"). The resulting qemu completely crashed if I didn't enable
  KVM (ie, TCG was (understandably) broken); with KVM enabled qemu
  didn't crash, but NTLDR halted with the same error as on 64-bit qemu.
  (TL;DR, no difference whatsoever.)

  My initial reaction at this point was to try the image on another
  virtualization platform. My first pick was VirtualBox.

  So, I followed the official instructions for pointing VirtualBox to
  physical disk images, except I substituted a /dev/loopN device I'd
  pointed to the image file via losetup.

  And... VirtualBox picked the image up fine and Just Worked(TM). Yay! -
  but not yay. What gives?!

  Confused, I then tried to convert the disk image to VHD format.
  Unfortunately, for some reason, if I try `qemu-image convert ... -O
  vhdx ...`, VirtualBox chokes on the result:

  -----

  VD: error VERR_NOT_SUPPORTED opening image file
  '/.../wnt4ts-broken-qemuconv.vhd' (VERR_NOT_SUPPORTED).

  Result Code: NS_ERROR_FAILURE (0x80004005)
  Component: MediumWrap
  Interface: IMedium {4afe423b-43e0-e9d0-82e8-ceb307940dda}
  Callee: IVirtualBox {0169423f-46b4-cde9-91af-1e9d5b6cd945}
  Callee RC: VBOX_E_OBJECT_NOT_FOUND (0x80BB0001)

  -----

  Welp.

  Well, a bit more digging later, and I found I could do

  $ VBoxManage convertfromraw wnt4ts-broken.img wnt4ts-broken.vhd

  but... as soon as I pointed VirtualBox to this, it too began to choke
  with "A disk read error occurred".

  And yet, the VMDK->raw image setup worked just fine.

  I found I could even replace the loop device with the path of the .img
  file itself and that worked just fine too.

  At my wits' end, I followed some online instructions to learn about
  manual CHS configuration so I could try and get the image working in
  Bochs. "A disk read error occurred". I wasn't surprised.

  It was at this point I began to give up, but I decided to try One Last
  Thing(TM) before properly throwing in the towel.

  :)

  I decided to learn a bit more about how `VBoxManage internalcommands
  createrawvmdk` worked, and try one thing in particular: I can edit the
  .vmdk file, but can I point `createrawvmdk` at the .img file directly
  too?

  Turns out, yes you can.

  It also turns out that this promptly caused VirtualBox to bomb out.

  Interesting.

  For reference, here's the VMDK file I initially created (by pointing
  `createrawvmdk` at /dev/loopN) and then later edited to point straight
  to the .img file, with both approaches resulting in successful boot.

  --8<--------------------------------------------------------

  # Disk DescriptorFile
  version=1
  CID=e35b9a45
  parentCID=ffffffff
  createType="fullDevice"

  # Extent description
  RW 1536000 FLAT "/absolute/full/path/to/wnt4ts-broken.img" 0

  # The disk Data Base 
  #DDB

  ddb.virtualHWVersion = "4"
  ddb.adapterType="ide"
  ddb.geometry.cylinders="1523"
  ddb.geometry.heads="16"
  ddb.geometry.sectors="63"
  ddb.uuid.image="871a6044-c8ca-48ed-b7aa-e6fc49da3db4"
  ddb.uuid.parent="00000000-0000-0000-0000-000000000000"
  ddb.uuid.modification="3661715c-3906-4e4a-ab65-486d140e03b8"
  ddb.uuid.parentmodification="00000000-0000-0000-0000-000000000000"
  ddb.geometry.biosCylinders="761"
  ddb.geometry.biosHeads="32"
  ddb.geometry.biosSectors="63"

  ------------------------------------------------------------

  
  Here's the _diff_ of what happens if I point `createrawvmdk` at wnt4ts-broken.img directly:

  --8<--------------------------------------------------------

  ddb.geometry.cylinders="2080"
  ddb.geometry.heads="16"
  ddb.geometry.sectors="63"

  ------------------------------------------------------------

  :D

  Naturally,

  $ qemu-system-i386 -drive file=wnt4ts-
  broken.img,format=raw,cyls=1523,heads=16,secs=63 -M isapc -sdl

  will boot happily on 2.9.0 (notwithstanding the occasional "disk read
  error occurred" documented above).

  It will also boot in 1.6.0.

  (POTENTIAL BUG HEADSUP: 1.0 and 1.5.0 both lock up with a blank
  640x480 window and use 0% CPU if I specify `-M isapc`.)

  And, of course, using these CHS values in Bochs also results in
  successful boot as well (after setting the CPU type to pentium).

  Unfortunately, I have no idea what sequence of events caused the
  creation of the VMDK file above. No invocation of `createrawvmdk` is
  producing a VMDK file with the CHS settings above.

  I've only just begun to learn about the intricacies of CHS. Am I to
  understand that these values are stored amongst the first 512 bytes of
  the disk? If this is the case, then I wonder what changed the data,
  and why. I was initially only using QEMU 2.9.0, and didn't move the
  image to different VMs or QEMU versions. Perhaps Windows NT got
  confused about the disk CHS and rewrote it?

  
  == Sporadic BIOS-level boot failure ========================

  I have multiple screenshots of SeaBIOS in QEMU 2.9.0 halting with "No
  bootable device" (et al), even with the above manually-applied CHS
  settings.

  Commit e689f7c also presents such errors.

  Commit 306ec6c does not suffer from intermittent breakage of any kind:

  - No SeaBIOS flake-outs
  - No "Non-system disk or disk error"
  - No "A disk error has occurred"
  - No "General failure ..."

  While most of my confidence in commit 306ec6c is based on anecdotal
  evidence, I modified `tmux-ctl-loop` a little to soak-test BIOS-level
  I/O stability and left this modified version running for a few
  minutes.

  --8<--------------------------------------------------------

  #!/bin/bash

  port=4444

  tmux=./tmux

  printf -v l '%0.0s-' {0..25}
  h1="$l/ buffer dump begin \\$l"
  h2="$l-\\ buffer dump end /-$l"

  while :; do
  	
  	while :; do
  		echo | nc localhost $port -q0 -w1 > /dev/null && break
  		echo 'Start qemu!'
  	done
  	
  	buffer="$(tmux -S $tmux capture-pane; tmux -S $tmux save-buffer -)"
  	
  	echo "$h1"
  	[[ "$buffer" ]] && echo "$buffer" || echo '( * Screen buffer is empty * )'
  	echo "$h2"
  	
  	if echo "$buffer" | grep -q 'Non-system disk' || echo "$buffer" | \
  		grep -q 'No bootable device'
  	then
  		
  		s="<Hit error after $(< itercount) runs.>"
  		echo "$s"
  		echo "$s" >> stats
  		
  		touch STOP
  		
  		#echo q | nc localhost $port -q0 > /dev/null
  		
  		exit
  		
  	elif echo "$buffer" | grep -q 'OS Loader V4.00' || echo "$buffer" | \
  		grep -q 'A disk read error'
  	then
  	
  		echo '<Boot did not hang at BIOS, trying again>'
  		
  		echo q | nc localhost $port -q0 > /dev/null
  		
  	else
  		
  		echo '<Waiting for boot>'
  		
  	fi
  			
  done

  ------------------------------------------------------------

  For the above to work, the top of run-qemu-loop must also be modified
  to read something along the lines of

  disk=/path/to/wnt4ts-broken.img,format=raw,cyls=1523,heads=16,secs=63

  (Suggestion: modify copies of both scripts)

  One small terminal-flicker-headache (and a 57°C CPU) later, I was able
  to carefully observe just over 350 successful runs in which QEMU
  commit 306ec6c only ever produced a boot menu. No other hitches.

  ** Important: **

  However, commit 306ec6c will fail to boot, ever, if the cylinders and
  geometry are not set to the values VirtualBox "discovered". (Of note
  is the fact that QEMU (2.9.0) was what initially created this image. I
  must admit that I don't remember what sequence of QEMU versions I fed
  the image to - and I maybe, possibly, didn't think to back the file up
  (sorry), so maybe something mangled something somewhere. But
  VirtualBox figured it out nonetheless!)

  Furthermore, feeding /dev/loopN to any QEMU version will NOT result in
  correct CHS discovery (and successful boot).

  This is what leads me to conclude that I've discovered two separate
  issues.


  == Appendix: How to build the branches =====================

  It's very simple.

  First, `git clone https://github.com/qemu/qemu` somewhere if you don't
  already have a local copy. If you have an old git checkout that's from
  2014 or later, you can use that old checkout instead. (If you want to
  test an old checkout you have, the commands below will either work
  perfectly or completely bomb out with no side effects.)

  A full checkout is a ~183MB download. Sorry.

  Next, create two new directories somewhere. Name them what you like,
  eg `qemu-working` and `qemu-broken`.

  Now, cd into the checkout directory, and run:

  $ git archive 306ec6c3cece7004429c79c1ac93d49919f1f1cc | tar xC
  /path/to/qemu-working/

  $ git archive e689f7c668cbd9d08f330e17c3dd3a059c9553d3 | tar xC
  /path/to/qemu-broken/

  The paths can be relative.

  Now, run this in both of the new directories:

  $ ./configure --python=python2.7 --disable-libssh2 --disable-seccomp
  --disable-usb-redir --disable-guest-agent --disable-libiscsi
  --disable-spice --disable-smartcard-nss --disable-vhost-net --disable-
  docs --disable-attr --disable-cap-ng --disable-vde --disable-user
  --disable-bluez --disable-vnc-ws --disable-xen --disable-brlapi
  --enable-debug --target-list=i386-softmmu --disable-fdt

  $ make -j64

  You can open two terminals and configure and build both simultaneously
  if you like.

  On my decent but very basic (2-core+HT) i3 box, -j64 actually works out - make doesn't actually launch too many gcc processes. You *will* see your system load spike to ~20 though :)
  (NB. Do. not. use. -j64. with. the. linux. kernel.)

  On my system, a single build with -j64 takes only about 35 seconds. C
  FTW. (Although this has increased to 1min20sec for more recent
  builds.)

  Most of the configure arguments remove functionality I'll never use
  (in this situation) and which will only slow down the build.

  Once QEMU is built, run qemu-system-i386 directly from where it has
  been built.

  $ /path/to/qemu-working/i386-softmmu/qemu-system-i386 ...
  $ /path/to/qemu-broken/i386-softmmu/qemu-system-i386 ...

  Again, the paths can be relative.

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1745312/+subscriptions

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Qemu-devel] [Bug 1745312] Re: Regression report: Disk subsystem I/O failures/issues surfacing in DOS/early Windows [two separate issues: one bisected, one root-caused]
  2018-01-25  7:18 [Qemu-devel] [Bug 1745312] [NEW] Regression report: Disk subsystem I/O failures/issues surfacing in DOS/early Windows [two separate issues: one bisected, one root-caused] i336_
                   ` (3 preceding siblings ...)
  2018-04-30 18:06 ` Mario
@ 2019-08-01 18:19 ` Mdasoh Kyaeppd
  2019-12-23 21:35 ` John Snow
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Mdasoh Kyaeppd @ 2019-08-01 18:19 UTC (permalink / raw)
  To: qemu-devel

Can confirm the DOS issue is present.  Here are some steps to recreate:
wget http://www.freedos.org/download/download/FD12CD.iso
apt-get install mbr fdisk parted dosfstools qemu-system-x86
# dd if=/dev/zero of=dos.img bs=512 count=1032192
# losetup /dev/loop0 dos.img
# fdisk -u=cylinders /dev/loop0
command: x
expert: h
heads: 16 (you can try different values, 16, 32, 64, 128, 255)
expert: c
cylinders (default 1024):
expert: r
command: c 
DOS compatibility flag is set...
command: n
select: p
partition (default 1):
first cylinder (default 1):
last cylinder (default 1024):
command: a
command: t
selected partition 1
type: 6
command: w
# partprobe /dev/loop0
# install-mbr -f /dev/loop0
# mkdosfs -F 16 /dev/loop0p1
# qemu-system-i386 -drive file=/dev/loop0,cache=none,format=raw,index=0 \
-drive file=FD12CD.iso,cache=none,media=cdrom,if=ide,format=raw,index=1 -boot d \
-machine isapc
--------
qemu comes up
"install to harddisk"
select your preferred language
"yes - continue with the installation"
drive C does not appear to be formatted
"yes - please erase and format drive c:"
lbacache flush write error 0c80/chs#0001
...
etc etc etc.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1745312

Title:
  Regression report: Disk subsystem I/O failures/issues surfacing in
  DOS/early Windows [two separate issues: one bisected, one root-caused]

Status in QEMU:
  New

Bug description:
  [Headsup: This report is long-ish due to the amount of detail I've
  stumbled on along the way that I think is relevant to include. I can't
  speak as to the complexity of the actual bugs, but the size of this
  report should not suggest that the reproduction process is
  particularly headache-inducing.]

  Hi!

  I recently needed to fire up some ancient software for research
  purposes and got very distracted discovering and playing with old
  versions of Windows :). In the process I've discovered some glitches
  with disk I/O.

  I believe I've stumbled on two completely separate issues that
  coincidentally surfaced at the same time. It's possible that
  components of this report will be re-filed as more specific new bugs,
  but I'm not an authority on QEMU internals or how to narrow
  down/categorize what I've found.

  - The first bug only surfaces when the "isapc" machine type is used.
  It intermittently produces "General failure {read,writ}ing drive _"
  under MS-DOS 6.22, and also somehow interferes with early bootstrap of
  Windows NT 4 (in NTLDR). Enabling or disabling KVM (I'm on Linux)
  appears to make no difference whatsoever, which may help with
  debugging.

  - The second issue involves
    - a WinNT4 disk image
    - created by running through a bog-standard NT4 install inside QEMU 2.9.0
    - which will now fail to boot in any version of QEMU - even version 1.0
      - but which VirtualBox will boot fine
        - but only if I point VirtualBox at QEMU's raw disk image via a
          hacked-together VMDK file
        - if the raw image is converted to VHD(X), VirtualBox will also fail
          to boot the image with exactly the same error as QEMU
        - this state of affairs is not affected by image sparseness (which makes
          sense)

  I'm confident I've bisected the first issue.

  I wasn't able to bisect the second issue (as all tested versions of
  QEMU behaved identically), but I've figured out a working repro
  testcase and I believe I've managed to pin down a solid root cause.


  == #1: Intermittent I/O issues when `-M isapc` is used =====

  These symptoms sometimes take a small amount of time and fiddling to
  trigger, but I AM able to consistently surface them on my machine
  after a short while. (I am very very interested to hear if others
  cannot reproduce them.)

  So, first of all:

  https://github.com/qemu/qemu/commit/306ec6c3cece7004429c79c1ac93d49919f1f1cc
    (Jul 30 2013): the last version that works

  https://github.com/qemu/qemu/commit/e689f7c668cbd9d08f330e17c3dd3a059c9553d3
    (Oct 30 2013): the first version that intermittently fails

  Maybe lift out and build these branches while reading. *shrug*
  (How to do this can be found at the end of this report - along with a time-saving ./configure line, FWIW)

  Here are the changelists between these two revisions:

  https://github.com/qemu/qemu/compare/306ec6c...e689f7c
  (Compare direction: OLD to NEW) (Commits: 166  Files changed: 192)

  https://github.com/qemu/qemu/compare/e689f7c...306ec6c
  (Compare direction: NEW to OLD) (Commits: 30   Files changed: 22)

  (Someone else more familiar with Git might know why GitHub returns
  results for both compare directions, and/or if the 2nd link is useful
  information. The first link returns a lot more results than the 2nd
  one, at least. Does comparing new>old return deletions?)

  ---

  Now on to the symptoms. In a moment I'll describe reproduction.

  # MS-DOS 6.22

  The first symptom I discovered was that trivial read and write
  operations under MS-DOS would sometimes fail:

    C:\>echo test > hi

    General failure writing drive C
    Abort, Retry, Fail?

  Anything else that exercises the disk behaves similarly:

    C:\>dir /s > nul

    General failure reading drive C
    Abort, Retry, Fail?

  (Note that the above demonstrates both write and read failures)

  (Also, FWIW, `dir /s` == `ls -R`)

  The behavior of the I/O errors is not possible to characterise as it
  fluctuates so much. For example something as simple as DIR can produce
  wildly differing results: in one run, poking around with DIR ended
  with DOS deciding C:\ was empty at one point; at another point in a
  different run C:\ mysteriously dropped 50% of its contents only to
  magically gain it all back moments later after some poking around in
  one of the subdirectories that was still visible.

  The time it takes to trigger these errors is also highly variable.
  QEMU may fall over as early as hanging forever at "Starting MS-
  DOS...", or I might get all the way into Windows 3.1 before it
  triggers (in which case Win3.1 reports vague memory errors - of all
  things).

  Very occasionally I've seen _SeaBIOS itself_ report "Booting from Hard
  Disk..." "Boot failed: could not read the boot disk" ... "No bootable
  device.", and on one occasion I even got "Non-System disk or disk
  error" "Replace and strike any key when ready"!

  
  # WinNT 4 Terminal Server

  Most of the time, NTLDR will fire up normally. But every so often...

    SeaBIOS (version rel-1.7.3-117-g31b8b4e-
  20131206_080705-nilsson.home.kraxel.org)

    Booting from Hard Disk...
    A disk read error occurred.
    Insert a system diskette and restart
    the system.

  (NB. You're seeing the old SeaBIOS version included with e689f7c,
  which was the first buggy commit.)

  If NT gets past this point without erroring out (ie, it makes it to
  the boot menu), the rest of the system is 100% fine and there are no
  other disk I/O issues whatsoever. For example, on QEMU 2.9.0 I was
  able to enable disk compression, answer "Yes" to "Compress entire disk
  now?" and have the process fully complete. No hitches.

  This makes me vaguely recall/wonder that perhaps this could be somehow
  related to LBA and/or Int 13h, or something floating around near that
  bunch of functionality. (I'm woefully ignorant about such low-level
  details.) Perhaps DOS/Win3.1 are stuck using a disk mode that QEMU has
  a buggy implementation of, while NT 4 (once NTOSKRNL is up and
  running) is able to use a different disk mode or access mechanism.

  I'm really interested to get some understanding of what the root issue
  is here, when this is fixed. (I wonder if it's a timing thing?)

  I've observed some unusual behavior with repeated restarts. In one
  case, I attempted to start NT4 multiple times, and QEMU consistently
  failed with "No bootable device" each time. So, I removed `-M isapc`,
  promptly got a boot menu, hit ^C, readded `-M isapc` - and continued
  to get a boot menu. Yep. I'll accept "really really big coincidence"
  but I do very much wonder if something else is going on here. I've
  observed many similar incidents. It makes me wonder whether the
  contents of memory or some other system state is an influence. Very
  probably not, but still...


  -- Reproduction --------------------------------------------

  First of all, there was unfortunately no way for me to avoid having to
  post entire disk images, but I've managed to compress everything down
  to 174MB total download size.

  FWIW, WinWorld and many other sites seem to have no operational issues
  providing clear pointers to CD keys; I consider my distribution of my
  installed HDD images an extension of the apparent status quo.

  That being said, I've put everything on Google Drive so nobody has to
  headscratch about Launchpad/Canonical/etc's stance on hosting this
  data.

  So, this folder contains the disk images:
  https://drive.google.com/drive/folders/1WdZVBh5Trs9HLC186-nHyKeqaSxfyM2c
  ("Download all" at the top-right will create a ZIP file, but FWIW
  downloading the individual files simultaneously would implement a
  rough form of download acceleration)

  File meta info:

  Compressed
  |
  |      Apparent
  |      |    Actual
  |      |    |
  38M -> 200M (103M)  win31.img.xz
  82M -> 1G   (289M)  wnt4ts-broken.img.xz
  55M -> 350M (146M)  wnt4ts-intermittent.img.xz

  SHA-256s:

  win31:        8179b8180a2ab40bd472e8a2f3fb89fc331651e56923f94ceb9e52a78ee220d2
  broken:       a2af5f0bc49a063b75f534b6ffe5b82e32ecc706a64a425b6626feccf6e3fdfa  
  intermittent: 77ae8c458829ebcdd64c71042012f45d5a2788e6ebd22db9d53de9ef1a574784

  (Wanted to keep the checksum lines within 80 columns)

  And, since I can't figure out where else in this report to put this,
  wnt4ts-broken.img's password is "admin" but something seems to have
  happened to the disk and NT doesn't actually boot properly :(, and
  wnt4ts-intermittent.img's password is "1234". (These were set up as
  test images. Now I'm _really_ glad I used simple passwords! :) )

  ---

  
  I have two testcases: DOS 6.22 (+ Windows 3.1), and Windows NT 4.

  
  # MS-DOS

  DOS is the simplest. It basically consists of

  $ qemu-system-i386 -drive file=win31.img,format=raw -M isapc -enable-
  kvm

  And then literally just playing around. Things to try include creating
  files (`echo blah > file`), repeatedly seeking across the entire FAT
  (`dir /s > nul` or `dir /s`), and launching Windows (`win`).

  win31.img is not special (as far as I can tell) and merely consists of
  the result of installing DOS 6.22 and Windows 3.1 from WinWorldPC.
  I've basically just included the image for convenience.

  Generally no single "run" is immune to starting Win3.1 and then
  launching File Manager; if that doesn't generate an error, something
  is definitely up.

  The second best trigger is creating new files. That very very
  frequently produces "General Failure ...", but not always.

  
  # WinNT 4

  Windows NT 4 is a bit more complicated. Because this error only occurs
  at presumably a single small point very early in boot, the window of
  opportunity for the glitch to surface within is much much narrower and
  thus often requires a larger number of tries.

  Anecdotally I've had QEMU hit the boot error at the first try/run, and
  after as many as 63 "successful" boots.

  I made a small test harness that automates the launch process. It
  consists of two shell scripts and requires tmux (and netcat).
  (*Potential epilepsy warning*: if you use a light-colored terminal
  background, the terminal QEMU is repeatedly invoked from will
  continuously flash rapidly from white to black.)

  One of the scripts is run inside a tmux session in one terminal, while
  the other script is run in its own terminal (without any tmux).

  
  I named this one `run-qemu-loop`:

  --8<--------------------------------------------------------

  #!/bin/bash

  # ---

  qemu=/path/to/qemu-system-i386
  #or, alternatively: (I used the following line myself so I
  #could tab-complete my way to different qemu executables)
  #qemu="$1"

  disk=/path/to/wnt4ts-intermittent.img

  # ---

  port=4444

  rm -f STOP itercount

  itercount=0

  while :; do
  	
  	[ -f STOP ] && break
  	
  	((itercount++))
  	echo $itercount > itercount
  	
  	$qemu \
  		-enable-kvm -vga cirrus -curses -M isapc \
  		-drive file="$disk",format=raw \
  		-chardev socket,id=mon0,host=localhost,port=$port,server,nowait \
  		-mon chardev=mon0,mode=readline
  	
  	#point to an otherwise-unused terminal if you like (see also: `tty`)
  	#echo "$itercount run(s)" > /dev/pts/__
  	
  done

  ------------------------------------------------------------

  Not much logic above; this just repeatedly runs QEMU for as long as
  the file `STOP` does not exist in the current directory.

  The key "magic" bit is that QEMU is launched in -curses mode.

  The other key bit is that the above script is run inside tmux.

  
  Here's `tmux-ctl-loop`:

  --8<--------------------------------------------------------

  #!/bin/bash

  port=4444

  tmux=./tmux

  printf -v l '%0.0s-' {0..25}
  h1="$l/ buffer dump begin \\$l"
  h2="$l-\\ buffer dump end /-$l"

  while :; do
  	
  	while :; do
  		echo | nc localhost $port -q0 -w1 > /dev/null && break
  		echo 'Start qemu!'
  	done
  	
  	buffer="$(tmux -S $tmux capture-pane; tmux -S $tmux save-buffer -)"
  	
  	echo "$h1"
  	[[ "$buffer" ]] && echo "$buffer" || echo '( * Screen buffer is empty * )'
  	echo "$h2"
  	
  	if echo "$buffer" | grep -q 'A disk read error occurred.'; then
  		
  		s="<Crashed after $(< itercount) runs.>"
  		echo "$s"
  		echo "$s" >> stats
  		
  		touch STOP
  		
  		#echo q | nc localhost $port -q0 > /dev/null
  		
  		exit
  		
  	elif echo "$buffer" | grep -q 'OS Loader V4.00'; then
  			
  		echo '<Booted successfully, trying again>'
  		
  		echo q | nc localhost $port -q0 > /dev/null
  		
  	else
  		
  		echo '<Waiting for boot>'
  		
  	fi
  			
  done

  ------------------------------------------------------------

  Nothing particularly amazing going on here either.

  While `qemu-run-loop` is running inside tmux in the first terminal,
  this is running in the 2nd one.

  The small infinite loop at the top only breaks when it can
  successfully ping QEMU and it knows it's running.

  Then, a screendump of the contents of the terminal QEMU is in is
  fetched from tmux, and the buffer's content is analyzed.

  - If NTLDR fails, the script creates `STOP` to halt qemu-run-loop,
    sends `q` to QEMU through netcat, and then the script exits.

  - If NTLDR loads successfully, the script sends `q` to QEMU and continues
    looping. (qemu-run-loop will not find the `STOP` file, and so restart qemu.)

  The scripts run very quickly, with 2-3 iterations per second on my i3
  box.


  # Usage

  Save the two scripts above to the same directory as wnt4ts-intermittent.img,
  then:

  - (If port 4444 doesn't work, the value needs to be changed in both
  scripts.)

  - In the first terminal, run `tmux -S <file>`, where <file> names the socket
    tmux will use. This needs to match "tmux=" at the top of `tmux-ctl-loop`.
    (with `tmux=./tmux`, the command would be `tmux -S tmux`)

  - Still in the first terminal (and now also inside tmux), enter
    `./qemu-run-loop`, passing the path to qemu if you're using that approach
    (refer to the first few lines of the script). Don't hit enter yet.

  - Now, in the 2nd terminal, type `./tmux-ctl-loop`

  - Hit enter in both terminals.
   

  Rationale for timing of Enter key:

  - Running qemu-run-loop first will start QEMU, and if NTLDR starts
    successfully it will immediately begin counting down from 30. If NT actually
    starts to boot and is then hard-shut-down this /may/ affect the disk image

  - tmux-ctl-loop will annoyingly spam a continuous stream of 'Start qemu!' until
    qemu-run-loop is running.

  - Starting both scripts at "more or less" the same time (no rush) works out
    well.

  
  Hopefully potential script modifications are obvious; for example

  - changing tmux-ctl-loop to not send 'q' to qemu so you can connect to the HMP
    yourself
    (NB, if `STOP` is not created, when qemu finally exits it will of course
    promptly be relaunched)

  - pointing run-qemu-loop to a modified qemu binary


  == #2: QEMU-vs-VirtualBox image issue ======================

  I was initially completely stumped by this issue, perhaps
  unsurprisingly so. :)

  wnt4ts-broken.img is a perfectly ordinary NT 4 installation that was
  created in QEMU 2.9.0. I created a 1GB disk with `truncate`, picked
  NTFS and installed everything (which took a while).

  NT setup reboots a number of times during the boot process, and IIRC
  those all went just fine. However, at some point, the image began to
  consistently bomb out with "A disk read error occurred. ...", and
  stubbornly refused to boot, regardless of the number of boot attempts
  I tried.

  QEMU 2.0.0, 1.5.0, and 1.0 (the earliest version I was able to build
  on my system) all consistently hit "disk read error occurred".

  I tried compiling QEMU 1.0 using clang so I could build for 32-bit on
  my 64-bit system (GCC 7 died with "Frame pointer required, but
  reserved"). The resulting qemu completely crashed if I didn't enable
  KVM (ie, TCG was (understandably) broken); with KVM enabled qemu
  didn't crash, but NTLDR halted with the same error as on 64-bit qemu.
  (TL;DR, no difference whatsoever.)

  My initial reaction at this point was to try the image on another
  virtualization platform. My first pick was VirtualBox.

  So, I followed the official instructions for pointing VirtualBox to
  physical disk images, except I substituted a /dev/loopN device I'd
  pointed to the image file via losetup.

  And... VirtualBox picked the image up fine and Just Worked(TM). Yay! -
  but not yay. What gives?!

  Confused, I then tried to convert the disk image to VHD format.
  Unfortunately, for some reason, if I try `qemu-image convert ... -O
  vhdx ...`, VirtualBox chokes on the result:

  -----

  VD: error VERR_NOT_SUPPORTED opening image file
  '/.../wnt4ts-broken-qemuconv.vhd' (VERR_NOT_SUPPORTED).

  Result Code: NS_ERROR_FAILURE (0x80004005)
  Component: MediumWrap
  Interface: IMedium {4afe423b-43e0-e9d0-82e8-ceb307940dda}
  Callee: IVirtualBox {0169423f-46b4-cde9-91af-1e9d5b6cd945}
  Callee RC: VBOX_E_OBJECT_NOT_FOUND (0x80BB0001)

  -----

  Welp.

  Well, a bit more digging later, and I found I could do

  $ VBoxManage convertfromraw wnt4ts-broken.img wnt4ts-broken.vhd

  but... as soon as I pointed VirtualBox to this, it too began to choke
  with "A disk read error occurred".

  And yet, the VMDK->raw image setup worked just fine.

  I found I could even replace the loop device with the path of the .img
  file itself and that worked just fine too.

  At my wits' end, I followed some online instructions to learn about
  manual CHS configuration so I could try and get the image working in
  Bochs. "A disk read error occurred". I wasn't surprised.

  It was at this point I began to give up, but I decided to try One Last
  Thing(TM) before properly throwing in the towel.

  :)

  I decided to learn a bit more about how `VBoxManage internalcommands
  createrawvmdk` worked, and try one thing in particular: I can edit the
  .vmdk file, but can I point `createrawvmdk` at the .img file directly
  too?

  Turns out, yes you can.

  It also turns out that this promptly caused VirtualBox to bomb out.

  Interesting.

  For reference, here's the VMDK file I initially created (by pointing
  `createrawvmdk` at /dev/loopN) and then later edited to point straight
  to the .img file, with both approaches resulting in successful boot.

  --8<--------------------------------------------------------

  # Disk DescriptorFile
  version=1
  CID=e35b9a45
  parentCID=ffffffff
  createType="fullDevice"

  # Extent description
  RW 1536000 FLAT "/absolute/full/path/to/wnt4ts-broken.img" 0

  # The disk Data Base 
  #DDB

  ddb.virtualHWVersion = "4"
  ddb.adapterType="ide"
  ddb.geometry.cylinders="1523"
  ddb.geometry.heads="16"
  ddb.geometry.sectors="63"
  ddb.uuid.image="871a6044-c8ca-48ed-b7aa-e6fc49da3db4"
  ddb.uuid.parent="00000000-0000-0000-0000-000000000000"
  ddb.uuid.modification="3661715c-3906-4e4a-ab65-486d140e03b8"
  ddb.uuid.parentmodification="00000000-0000-0000-0000-000000000000"
  ddb.geometry.biosCylinders="761"
  ddb.geometry.biosHeads="32"
  ddb.geometry.biosSectors="63"

  ------------------------------------------------------------

  
  Here's the _diff_ of what happens if I point `createrawvmdk` at wnt4ts-broken.img directly:

  --8<--------------------------------------------------------

  ddb.geometry.cylinders="2080"
  ddb.geometry.heads="16"
  ddb.geometry.sectors="63"

  ------------------------------------------------------------

  :D

  Naturally,

  $ qemu-system-i386 -drive file=wnt4ts-
  broken.img,format=raw,cyls=1523,heads=16,secs=63 -M isapc -sdl

  will boot happily on 2.9.0 (notwithstanding the occasional "disk read
  error occurred" documented above).

  It will also boot in 1.6.0.

  (POTENTIAL BUG HEADSUP: 1.0 and 1.5.0 both lock up with a blank
  640x480 window and use 0% CPU if I specify `-M isapc`.)

  And, of course, using these CHS values in Bochs also results in
  successful boot as well (after setting the CPU type to pentium).

  Unfortunately, I have no idea what sequence of events caused the
  creation of the VMDK file above. No invocation of `createrawvmdk` is
  producing a VMDK file with the CHS settings above.

  I've only just begun to learn about the intricacies of CHS. Am I to
  understand that these values are stored amongst the first 512 bytes of
  the disk? If this is the case, then I wonder what changed the data,
  and why. I was initially only using QEMU 2.9.0, and didn't move the
  image to different VMs or QEMU versions. Perhaps Windows NT got
  confused about the disk CHS and rewrote it?

  
  == Sporadic BIOS-level boot failure ========================

  I have multiple screenshots of SeaBIOS in QEMU 2.9.0 halting with "No
  bootable device" (et al), even with the above manually-applied CHS
  settings.

  Commit e689f7c also presents such errors.

  Commit 306ec6c does not suffer from intermittent breakage of any kind:

  - No SeaBIOS flake-outs
  - No "Non-system disk or disk error"
  - No "A disk error has occurred"
  - No "General failure ..."

  While most of my confidence in commit 306ec6c is based on anecdotal
  evidence, I modified `tmux-ctl-loop` a little to soak-test BIOS-level
  I/O stability and left this modified version running for a few
  minutes.

  --8<--------------------------------------------------------

  #!/bin/bash

  port=4444

  tmux=./tmux

  printf -v l '%0.0s-' {0..25}
  h1="$l/ buffer dump begin \\$l"
  h2="$l-\\ buffer dump end /-$l"

  while :; do
  	
  	while :; do
  		echo | nc localhost $port -q0 -w1 > /dev/null && break
  		echo 'Start qemu!'
  	done
  	
  	buffer="$(tmux -S $tmux capture-pane; tmux -S $tmux save-buffer -)"
  	
  	echo "$h1"
  	[[ "$buffer" ]] && echo "$buffer" || echo '( * Screen buffer is empty * )'
  	echo "$h2"
  	
  	if echo "$buffer" | grep -q 'Non-system disk' || echo "$buffer" | \
  		grep -q 'No bootable device'
  	then
  		
  		s="<Hit error after $(< itercount) runs.>"
  		echo "$s"
  		echo "$s" >> stats
  		
  		touch STOP
  		
  		#echo q | nc localhost $port -q0 > /dev/null
  		
  		exit
  		
  	elif echo "$buffer" | grep -q 'OS Loader V4.00' || echo "$buffer" | \
  		grep -q 'A disk read error'
  	then
  	
  		echo '<Boot did not hang at BIOS, trying again>'
  		
  		echo q | nc localhost $port -q0 > /dev/null
  		
  	else
  		
  		echo '<Waiting for boot>'
  		
  	fi
  			
  done

  ------------------------------------------------------------

  For the above to work, the top of run-qemu-loop must also be modified
  to read something along the lines of

  disk=/path/to/wnt4ts-broken.img,format=raw,cyls=1523,heads=16,secs=63

  (Suggestion: modify copies of both scripts)

  One small terminal-flicker-headache (and a 57°C CPU) later, I was able
  to carefully observe just over 350 successful runs in which QEMU
  commit 306ec6c only ever produced a boot menu. No other hitches.

  ** Important: **

  However, commit 306ec6c will fail to boot, ever, if the cylinders and
  geometry are not set to the values VirtualBox "discovered". (Of note
  is the fact that QEMU (2.9.0) was what initially created this image. I
  must admit that I don't remember what sequence of QEMU versions I fed
  the image to - and I maybe, possibly, didn't think to back the file up
  (sorry), so maybe something mangled something somewhere. But
  VirtualBox figured it out nonetheless!)

  Furthermore, feeding /dev/loopN to any QEMU version will NOT result in
  correct CHS discovery (and successful boot).

  This is what leads me to conclude that I've discovered two separate
  issues.


  == Appendix: How to build the branches =====================

  It's very simple.

  First, `git clone https://github.com/qemu/qemu` somewhere if you don't
  already have a local copy. If you have an old git checkout that's from
  2014 or later, you can use that old checkout instead. (If you want to
  test an old checkout you have, the commands below will either work
  perfectly or completely bomb out with no side effects.)

  A full checkout is a ~183MB download. Sorry.

  Next, create two new directories somewhere. Name them what you like,
  eg `qemu-working` and `qemu-broken`.

  Now, cd into the checkout directory, and run:

  $ git archive 306ec6c3cece7004429c79c1ac93d49919f1f1cc | tar xC
  /path/to/qemu-working/

  $ git archive e689f7c668cbd9d08f330e17c3dd3a059c9553d3 | tar xC
  /path/to/qemu-broken/

  The paths can be relative.

  Now, run this in both of the new directories:

  $ ./configure --python=python2.7 --disable-libssh2 --disable-seccomp
  --disable-usb-redir --disable-guest-agent --disable-libiscsi
  --disable-spice --disable-smartcard-nss --disable-vhost-net --disable-
  docs --disable-attr --disable-cap-ng --disable-vde --disable-user
  --disable-bluez --disable-vnc-ws --disable-xen --disable-brlapi
  --enable-debug --target-list=i386-softmmu --disable-fdt

  $ make -j64

  You can open two terminals and configure and build both simultaneously
  if you like.

  On my decent but very basic (2-core+HT) i3 box, -j64 actually works out - make doesn't actually launch too many gcc processes. You *will* see your system load spike to ~20 though :)
  (NB. Do. not. use. -j64. with. the. linux. kernel.)

  On my system, a single build with -j64 takes only about 35 seconds. C
  FTW. (Although this has increased to 1min20sec for more recent
  builds.)

  Most of the configure arguments remove functionality I'll never use
  (in this situation) and which will only slow down the build.

  Once QEMU is built, run qemu-system-i386 directly from where it has
  been built.

  $ /path/to/qemu-working/i386-softmmu/qemu-system-i386 ...
  $ /path/to/qemu-broken/i386-softmmu/qemu-system-i386 ...

  Again, the paths can be relative.

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1745312/+subscriptions


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug 1745312] Re: Regression report: Disk subsystem I/O failures/issues surfacing in DOS/early Windows [two separate issues: one bisected, one root-caused]
  2018-01-25  7:18 [Qemu-devel] [Bug 1745312] [NEW] Regression report: Disk subsystem I/O failures/issues surfacing in DOS/early Windows [two separate issues: one bisected, one root-caused] i336_
                   ` (4 preceding siblings ...)
  2019-08-01 18:19 ` Mdasoh Kyaeppd
@ 2019-12-23 21:35 ` John Snow
  2021-04-22  5:32 ` Thomas Huth
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: John Snow @ 2019-12-23 21:35 UTC (permalink / raw)
  To: qemu-devel

I will try to debug as time permits, but the priority of MS-DOS bugs is
not ... measurable with casual tools. However, there are a lot of other
IDE bugs on my plate that are very important! so I am hoping to grab a
bunch of IDE bugs at once, but no promises here.

Notably, our geometry detection is not very good, it's more than
possible we are misreporting values and confusing DOS. Our IDE disks are
also not very consistent about what standard of the spec they are trying
to emulate, so there are likely other problems there, too.

If you'd like to debug on your own, I'd recommend enabling tracing and
enabling some of the IDE trace points; some of them can be quite verbose
-- don't enable the data dumping ones. The control flow ones can be
informational sometimes to guess when the guest OS got confused and then
walk your way back to a register read that would have picked up some
error bits, or to detect busy-waits on registers not changing and try to
guess what it was waiting for.

https://github.com/qemu/qemu/blob/master/docs/devel/tracing.txt
https://github.com/qemu/qemu/blob/master/hw/ide/trace-events

Ignore the AHCI and ATAPI traces, and don't use the ide_data_* traces
unless you are booting a custom firmware that only performs a strict few
IO accesses -- otherwise you'll get flooded off the map.

** Changed in: qemu
     Assignee: (unassigned) => John Snow (jnsnow)

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1745312

Title:
  Regression report: Disk subsystem I/O failures/issues surfacing in
  DOS/early Windows [two separate issues: one bisected, one root-caused]

Status in QEMU:
  New

Bug description:
  [Headsup: This report is long-ish due to the amount of detail I've
  stumbled on along the way that I think is relevant to include. I can't
  speak as to the complexity of the actual bugs, but the size of this
  report should not suggest that the reproduction process is
  particularly headache-inducing.]

  Hi!

  I recently needed to fire up some ancient software for research
  purposes and got very distracted discovering and playing with old
  versions of Windows :). In the process I've discovered some glitches
  with disk I/O.

  I believe I've stumbled on two completely separate issues that
  coincidentally surfaced at the same time. It's possible that
  components of this report will be re-filed as more specific new bugs,
  but I'm not an authority on QEMU internals or how to narrow
  down/categorize what I've found.

  - The first bug only surfaces when the "isapc" machine type is used.
  It intermittently produces "General failure {read,writ}ing drive _"
  under MS-DOS 6.22, and also somehow interferes with early bootstrap of
  Windows NT 4 (in NTLDR). Enabling or disabling KVM (I'm on Linux)
  appears to make no difference whatsoever, which may help with
  debugging.

  - The second issue involves
    - a WinNT4 disk image
    - created by running through a bog-standard NT4 install inside QEMU 2.9.0
    - which will now fail to boot in any version of QEMU - even version 1.0
      - but which VirtualBox will boot fine
        - but only if I point VirtualBox at QEMU's raw disk image via a
          hacked-together VMDK file
        - if the raw image is converted to VHD(X), VirtualBox will also fail
          to boot the image with exactly the same error as QEMU
        - this state of affairs is not affected by image sparseness (which makes
          sense)

  I'm confident I've bisected the first issue.

  I wasn't able to bisect the second issue (as all tested versions of
  QEMU behaved identically), but I've figured out a working repro
  testcase and I believe I've managed to pin down a solid root cause.


  == #1: Intermittent I/O issues when `-M isapc` is used =====

  These symptoms sometimes take a small amount of time and fiddling to
  trigger, but I AM able to consistently surface them on my machine
  after a short while. (I am very very interested to hear if others
  cannot reproduce them.)

  So, first of all:

  https://github.com/qemu/qemu/commit/306ec6c3cece7004429c79c1ac93d49919f1f1cc
    (Jul 30 2013): the last version that works

  https://github.com/qemu/qemu/commit/e689f7c668cbd9d08f330e17c3dd3a059c9553d3
    (Oct 30 2013): the first version that intermittently fails

  Maybe lift out and build these branches while reading. *shrug*
  (How to do this can be found at the end of this report - along with a time-saving ./configure line, FWIW)

  Here are the changelists between these two revisions:

  https://github.com/qemu/qemu/compare/306ec6c...e689f7c
  (Compare direction: OLD to NEW) (Commits: 166  Files changed: 192)

  https://github.com/qemu/qemu/compare/e689f7c...306ec6c
  (Compare direction: NEW to OLD) (Commits: 30   Files changed: 22)

  (Someone else more familiar with Git might know why GitHub returns
  results for both compare directions, and/or if the 2nd link is useful
  information. The first link returns a lot more results than the 2nd
  one, at least. Does comparing new>old return deletions?)

  ---

  Now on to the symptoms. In a moment I'll describe reproduction.

  # MS-DOS 6.22

  The first symptom I discovered was that trivial read and write
  operations under MS-DOS would sometimes fail:

    C:\>echo test > hi

    General failure writing drive C
    Abort, Retry, Fail?

  Anything else that exercises the disk behaves similarly:

    C:\>dir /s > nul

    General failure reading drive C
    Abort, Retry, Fail?

  (Note that the above demonstrates both write and read failures)

  (Also, FWIW, `dir /s` == `ls -R`)

  The behavior of the I/O errors is not possible to characterise as it
  fluctuates so much. For example something as simple as DIR can produce
  wildly differing results: in one run, poking around with DIR ended
  with DOS deciding C:\ was empty at one point; at another point in a
  different run C:\ mysteriously dropped 50% of its contents only to
  magically gain it all back moments later after some poking around in
  one of the subdirectories that was still visible.

  The time it takes to trigger these errors is also highly variable.
  QEMU may fall over as early as hanging forever at "Starting MS-
  DOS...", or I might get all the way into Windows 3.1 before it
  triggers (in which case Win3.1 reports vague memory errors - of all
  things).

  Very occasionally I've seen _SeaBIOS itself_ report "Booting from Hard
  Disk..." "Boot failed: could not read the boot disk" ... "No bootable
  device.", and on one occasion I even got "Non-System disk or disk
  error" "Replace and strike any key when ready"!

  
  # WinNT 4 Terminal Server

  Most of the time, NTLDR will fire up normally. But every so often...

    SeaBIOS (version rel-1.7.3-117-g31b8b4e-
  20131206_080705-nilsson.home.kraxel.org)

    Booting from Hard Disk...
    A disk read error occurred.
    Insert a system diskette and restart
    the system.

  (NB. You're seeing the old SeaBIOS version included with e689f7c,
  which was the first buggy commit.)

  If NT gets past this point without erroring out (ie, it makes it to
  the boot menu), the rest of the system is 100% fine and there are no
  other disk I/O issues whatsoever. For example, on QEMU 2.9.0 I was
  able to enable disk compression, answer "Yes" to "Compress entire disk
  now?" and have the process fully complete. No hitches.

  This makes me vaguely recall/wonder that perhaps this could be somehow
  related to LBA and/or Int 13h, or something floating around near that
  bunch of functionality. (I'm woefully ignorant about such low-level
  details.) Perhaps DOS/Win3.1 are stuck using a disk mode that QEMU has
  a buggy implementation of, while NT 4 (once NTOSKRNL is up and
  running) is able to use a different disk mode or access mechanism.

  I'm really interested to get some understanding of what the root issue
  is here, when this is fixed. (I wonder if it's a timing thing?)

  I've observed some unusual behavior with repeated restarts. In one
  case, I attempted to start NT4 multiple times, and QEMU consistently
  failed with "No bootable device" each time. So, I removed `-M isapc`,
  promptly got a boot menu, hit ^C, readded `-M isapc` - and continued
  to get a boot menu. Yep. I'll accept "really really big coincidence"
  but I do very much wonder if something else is going on here. I've
  observed many similar incidents. It makes me wonder whether the
  contents of memory or some other system state is an influence. Very
  probably not, but still...


  -- Reproduction --------------------------------------------

  First of all, there was unfortunately no way for me to avoid having to
  post entire disk images, but I've managed to compress everything down
  to 174MB total download size.

  FWIW, WinWorld and many other sites seem to have no operational issues
  providing clear pointers to CD keys; I consider my distribution of my
  installed HDD images an extension of the apparent status quo.

  That being said, I've put everything on Google Drive so nobody has to
  headscratch about Launchpad/Canonical/etc's stance on hosting this
  data.

  So, this folder contains the disk images:
  https://drive.google.com/drive/folders/1WdZVBh5Trs9HLC186-nHyKeqaSxfyM2c
  ("Download all" at the top-right will create a ZIP file, but FWIW
  downloading the individual files simultaneously would implement a
  rough form of download acceleration)

  File meta info:

  Compressed
  |
  |      Apparent
  |      |    Actual
  |      |    |
  38M -> 200M (103M)  win31.img.xz
  82M -> 1G   (289M)  wnt4ts-broken.img.xz
  55M -> 350M (146M)  wnt4ts-intermittent.img.xz

  SHA-256s:

  win31:        8179b8180a2ab40bd472e8a2f3fb89fc331651e56923f94ceb9e52a78ee220d2
  broken:       a2af5f0bc49a063b75f534b6ffe5b82e32ecc706a64a425b6626feccf6e3fdfa  
  intermittent: 77ae8c458829ebcdd64c71042012f45d5a2788e6ebd22db9d53de9ef1a574784

  (Wanted to keep the checksum lines within 80 columns)

  And, since I can't figure out where else in this report to put this,
  wnt4ts-broken.img's password is "admin" but something seems to have
  happened to the disk and NT doesn't actually boot properly :(, and
  wnt4ts-intermittent.img's password is "1234". (These were set up as
  test images. Now I'm _really_ glad I used simple passwords! :) )

  ---

  
  I have two testcases: DOS 6.22 (+ Windows 3.1), and Windows NT 4.

  
  # MS-DOS

  DOS is the simplest. It basically consists of

  $ qemu-system-i386 -drive file=win31.img,format=raw -M isapc -enable-
  kvm

  And then literally just playing around. Things to try include creating
  files (`echo blah > file`), repeatedly seeking across the entire FAT
  (`dir /s > nul` or `dir /s`), and launching Windows (`win`).

  win31.img is not special (as far as I can tell) and merely consists of
  the result of installing DOS 6.22 and Windows 3.1 from WinWorldPC.
  I've basically just included the image for convenience.

  Generally no single "run" is immune to starting Win3.1 and then
  launching File Manager; if that doesn't generate an error, something
  is definitely up.

  The second best trigger is creating new files. That very very
  frequently produces "General Failure ...", but not always.

  
  # WinNT 4

  Windows NT 4 is a bit more complicated. Because this error only occurs
  at presumably a single small point very early in boot, the window of
  opportunity for the glitch to surface within is much much narrower and
  thus often requires a larger number of tries.

  Anecdotally I've had QEMU hit the boot error at the first try/run, and
  after as many as 63 "successful" boots.

  I made a small test harness that automates the launch process. It
  consists of two shell scripts and requires tmux (and netcat).
  (*Potential epilepsy warning*: if you use a light-colored terminal
  background, the terminal QEMU is repeatedly invoked from will
  continuously flash rapidly from white to black.)

  One of the scripts is run inside a tmux session in one terminal, while
  the other script is run in its own terminal (without any tmux).

  
  I named this one `run-qemu-loop`:

  --8<--------------------------------------------------------

  #!/bin/bash

  # ---

  qemu=/path/to/qemu-system-i386
  #or, alternatively: (I used the following line myself so I
  #could tab-complete my way to different qemu executables)
  #qemu="$1"

  disk=/path/to/wnt4ts-intermittent.img

  # ---

  port=4444

  rm -f STOP itercount

  itercount=0

  while :; do
  	
  	[ -f STOP ] && break
  	
  	((itercount++))
  	echo $itercount > itercount
  	
  	$qemu \
  		-enable-kvm -vga cirrus -curses -M isapc \
  		-drive file="$disk",format=raw \
  		-chardev socket,id=mon0,host=localhost,port=$port,server,nowait \
  		-mon chardev=mon0,mode=readline
  	
  	#point to an otherwise-unused terminal if you like (see also: `tty`)
  	#echo "$itercount run(s)" > /dev/pts/__
  	
  done

  ------------------------------------------------------------

  Not much logic above; this just repeatedly runs QEMU for as long as
  the file `STOP` does not exist in the current directory.

  The key "magic" bit is that QEMU is launched in -curses mode.

  The other key bit is that the above script is run inside tmux.

  
  Here's `tmux-ctl-loop`:

  --8<--------------------------------------------------------

  #!/bin/bash

  port=4444

  tmux=./tmux

  printf -v l '%0.0s-' {0..25}
  h1="$l/ buffer dump begin \\$l"
  h2="$l-\\ buffer dump end /-$l"

  while :; do
  	
  	while :; do
  		echo | nc localhost $port -q0 -w1 > /dev/null && break
  		echo 'Start qemu!'
  	done
  	
  	buffer="$(tmux -S $tmux capture-pane; tmux -S $tmux save-buffer -)"
  	
  	echo "$h1"
  	[[ "$buffer" ]] && echo "$buffer" || echo '( * Screen buffer is empty * )'
  	echo "$h2"
  	
  	if echo "$buffer" | grep -q 'A disk read error occurred.'; then
  		
  		s="<Crashed after $(< itercount) runs.>"
  		echo "$s"
  		echo "$s" >> stats
  		
  		touch STOP
  		
  		#echo q | nc localhost $port -q0 > /dev/null
  		
  		exit
  		
  	elif echo "$buffer" | grep -q 'OS Loader V4.00'; then
  			
  		echo '<Booted successfully, trying again>'
  		
  		echo q | nc localhost $port -q0 > /dev/null
  		
  	else
  		
  		echo '<Waiting for boot>'
  		
  	fi
  			
  done

  ------------------------------------------------------------

  Nothing particularly amazing going on here either.

  While `qemu-run-loop` is running inside tmux in the first terminal,
  this is running in the 2nd one.

  The small infinite loop at the top only breaks when it can
  successfully ping QEMU and it knows it's running.

  Then, a screendump of the contents of the terminal QEMU is in is
  fetched from tmux, and the buffer's content is analyzed.

  - If NTLDR fails, the script creates `STOP` to halt qemu-run-loop,
    sends `q` to QEMU through netcat, and then the script exits.

  - If NTLDR loads successfully, the script sends `q` to QEMU and continues
    looping. (qemu-run-loop will not find the `STOP` file, and so restart qemu.)

  The scripts run very quickly, with 2-3 iterations per second on my i3
  box.


  # Usage

  Save the two scripts above to the same directory as wnt4ts-intermittent.img,
  then:

  - (If port 4444 doesn't work, the value needs to be changed in both
  scripts.)

  - In the first terminal, run `tmux -S <file>`, where <file> names the socket
    tmux will use. This needs to match "tmux=" at the top of `tmux-ctl-loop`.
    (with `tmux=./tmux`, the command would be `tmux -S tmux`)

  - Still in the first terminal (and now also inside tmux), enter
    `./qemu-run-loop`, passing the path to qemu if you're using that approach
    (refer to the first few lines of the script). Don't hit enter yet.

  - Now, in the 2nd terminal, type `./tmux-ctl-loop`

  - Hit enter in both terminals.
   

  Rationale for timing of Enter key:

  - Running qemu-run-loop first will start QEMU, and if NTLDR starts
    successfully it will immediately begin counting down from 30. If NT actually
    starts to boot and is then hard-shut-down this /may/ affect the disk image

  - tmux-ctl-loop will annoyingly spam a continuous stream of 'Start qemu!' until
    qemu-run-loop is running.

  - Starting both scripts at "more or less" the same time (no rush) works out
    well.

  
  Hopefully potential script modifications are obvious; for example

  - changing tmux-ctl-loop to not send 'q' to qemu so you can connect to the HMP
    yourself
    (NB, if `STOP` is not created, when qemu finally exits it will of course
    promptly be relaunched)

  - pointing run-qemu-loop to a modified qemu binary


  == #2: QEMU-vs-VirtualBox image issue ======================

  I was initially completely stumped by this issue, perhaps
  unsurprisingly so. :)

  wnt4ts-broken.img is a perfectly ordinary NT 4 installation that was
  created in QEMU 2.9.0. I created a 1GB disk with `truncate`, picked
  NTFS and installed everything (which took a while).

  NT setup reboots a number of times during the boot process, and IIRC
  those all went just fine. However, at some point, the image began to
  consistently bomb out with "A disk read error occurred. ...", and
  stubbornly refused to boot, regardless of the number of boot attempts
  I tried.

  QEMU 2.0.0, 1.5.0, and 1.0 (the earliest version I was able to build
  on my system) all consistently hit "disk read error occurred".

  I tried compiling QEMU 1.0 using clang so I could build for 32-bit on
  my 64-bit system (GCC 7 died with "Frame pointer required, but
  reserved"). The resulting qemu completely crashed if I didn't enable
  KVM (ie, TCG was (understandably) broken); with KVM enabled qemu
  didn't crash, but NTLDR halted with the same error as on 64-bit qemu.
  (TL;DR, no difference whatsoever.)

  My initial reaction at this point was to try the image on another
  virtualization platform. My first pick was VirtualBox.

  So, I followed the official instructions for pointing VirtualBox to
  physical disk images, except I substituted a /dev/loopN device I'd
  pointed to the image file via losetup.

  And... VirtualBox picked the image up fine and Just Worked(TM). Yay! -
  but not yay. What gives?!

  Confused, I then tried to convert the disk image to VHD format.
  Unfortunately, for some reason, if I try `qemu-image convert ... -O
  vhdx ...`, VirtualBox chokes on the result:

  -----

  VD: error VERR_NOT_SUPPORTED opening image file
  '/.../wnt4ts-broken-qemuconv.vhd' (VERR_NOT_SUPPORTED).

  Result Code: NS_ERROR_FAILURE (0x80004005)
  Component: MediumWrap
  Interface: IMedium {4afe423b-43e0-e9d0-82e8-ceb307940dda}
  Callee: IVirtualBox {0169423f-46b4-cde9-91af-1e9d5b6cd945}
  Callee RC: VBOX_E_OBJECT_NOT_FOUND (0x80BB0001)

  -----

  Welp.

  Well, a bit more digging later, and I found I could do

  $ VBoxManage convertfromraw wnt4ts-broken.img wnt4ts-broken.vhd

  but... as soon as I pointed VirtualBox to this, it too began to choke
  with "A disk read error occurred".

  And yet, the VMDK->raw image setup worked just fine.

  I found I could even replace the loop device with the path of the .img
  file itself and that worked just fine too.

  At my wits' end, I followed some online instructions to learn about
  manual CHS configuration so I could try and get the image working in
  Bochs. "A disk read error occurred". I wasn't surprised.

  It was at this point I began to give up, but I decided to try One Last
  Thing(TM) before properly throwing in the towel.

  :)

  I decided to learn a bit more about how `VBoxManage internalcommands
  createrawvmdk` worked, and try one thing in particular: I can edit the
  .vmdk file, but can I point `createrawvmdk` at the .img file directly
  too?

  Turns out, yes you can.

  It also turns out that this promptly caused VirtualBox to bomb out.

  Interesting.

  For reference, here's the VMDK file I initially created (by pointing
  `createrawvmdk` at /dev/loopN) and then later edited to point straight
  to the .img file, with both approaches resulting in successful boot.

  --8<--------------------------------------------------------

  # Disk DescriptorFile
  version=1
  CID=e35b9a45
  parentCID=ffffffff
  createType="fullDevice"

  # Extent description
  RW 1536000 FLAT "/absolute/full/path/to/wnt4ts-broken.img" 0

  # The disk Data Base 
  #DDB

  ddb.virtualHWVersion = "4"
  ddb.adapterType="ide"
  ddb.geometry.cylinders="1523"
  ddb.geometry.heads="16"
  ddb.geometry.sectors="63"
  ddb.uuid.image="871a6044-c8ca-48ed-b7aa-e6fc49da3db4"
  ddb.uuid.parent="00000000-0000-0000-0000-000000000000"
  ddb.uuid.modification="3661715c-3906-4e4a-ab65-486d140e03b8"
  ddb.uuid.parentmodification="00000000-0000-0000-0000-000000000000"
  ddb.geometry.biosCylinders="761"
  ddb.geometry.biosHeads="32"
  ddb.geometry.biosSectors="63"

  ------------------------------------------------------------

  
  Here's the _diff_ of what happens if I point `createrawvmdk` at wnt4ts-broken.img directly:

  --8<--------------------------------------------------------

  ddb.geometry.cylinders="2080"
  ddb.geometry.heads="16"
  ddb.geometry.sectors="63"

  ------------------------------------------------------------

  :D

  Naturally,

  $ qemu-system-i386 -drive file=wnt4ts-
  broken.img,format=raw,cyls=1523,heads=16,secs=63 -M isapc -sdl

  will boot happily on 2.9.0 (notwithstanding the occasional "disk read
  error occurred" documented above).

  It will also boot in 1.6.0.

  (POTENTIAL BUG HEADSUP: 1.0 and 1.5.0 both lock up with a blank
  640x480 window and use 0% CPU if I specify `-M isapc`.)

  And, of course, using these CHS values in Bochs also results in
  successful boot as well (after setting the CPU type to pentium).

  Unfortunately, I have no idea what sequence of events caused the
  creation of the VMDK file above. No invocation of `createrawvmdk` is
  producing a VMDK file with the CHS settings above.

  I've only just begun to learn about the intricacies of CHS. Am I to
  understand that these values are stored amongst the first 512 bytes of
  the disk? If this is the case, then I wonder what changed the data,
  and why. I was initially only using QEMU 2.9.0, and didn't move the
  image to different VMs or QEMU versions. Perhaps Windows NT got
  confused about the disk CHS and rewrote it?

  
  == Sporadic BIOS-level boot failure ========================

  I have multiple screenshots of SeaBIOS in QEMU 2.9.0 halting with "No
  bootable device" (et al), even with the above manually-applied CHS
  settings.

  Commit e689f7c also presents such errors.

  Commit 306ec6c does not suffer from intermittent breakage of any kind:

  - No SeaBIOS flake-outs
  - No "Non-system disk or disk error"
  - No "A disk error has occurred"
  - No "General failure ..."

  While most of my confidence in commit 306ec6c is based on anecdotal
  evidence, I modified `tmux-ctl-loop` a little to soak-test BIOS-level
  I/O stability and left this modified version running for a few
  minutes.

  --8<--------------------------------------------------------

  #!/bin/bash

  port=4444

  tmux=./tmux

  printf -v l '%0.0s-' {0..25}
  h1="$l/ buffer dump begin \\$l"
  h2="$l-\\ buffer dump end /-$l"

  while :; do
  	
  	while :; do
  		echo | nc localhost $port -q0 -w1 > /dev/null && break
  		echo 'Start qemu!'
  	done
  	
  	buffer="$(tmux -S $tmux capture-pane; tmux -S $tmux save-buffer -)"
  	
  	echo "$h1"
  	[[ "$buffer" ]] && echo "$buffer" || echo '( * Screen buffer is empty * )'
  	echo "$h2"
  	
  	if echo "$buffer" | grep -q 'Non-system disk' || echo "$buffer" | \
  		grep -q 'No bootable device'
  	then
  		
  		s="<Hit error after $(< itercount) runs.>"
  		echo "$s"
  		echo "$s" >> stats
  		
  		touch STOP
  		
  		#echo q | nc localhost $port -q0 > /dev/null
  		
  		exit
  		
  	elif echo "$buffer" | grep -q 'OS Loader V4.00' || echo "$buffer" | \
  		grep -q 'A disk read error'
  	then
  	
  		echo '<Boot did not hang at BIOS, trying again>'
  		
  		echo q | nc localhost $port -q0 > /dev/null
  		
  	else
  		
  		echo '<Waiting for boot>'
  		
  	fi
  			
  done

  ------------------------------------------------------------

  For the above to work, the top of run-qemu-loop must also be modified
  to read something along the lines of

  disk=/path/to/wnt4ts-broken.img,format=raw,cyls=1523,heads=16,secs=63

  (Suggestion: modify copies of both scripts)

  One small terminal-flicker-headache (and a 57°C CPU) later, I was able
  to carefully observe just over 350 successful runs in which QEMU
  commit 306ec6c only ever produced a boot menu. No other hitches.

  ** Important: **

  However, commit 306ec6c will fail to boot, ever, if the cylinders and
  geometry are not set to the values VirtualBox "discovered". (Of note
  is the fact that QEMU (2.9.0) was what initially created this image. I
  must admit that I don't remember what sequence of QEMU versions I fed
  the image to - and I maybe, possibly, didn't think to back the file up
  (sorry), so maybe something mangled something somewhere. But
  VirtualBox figured it out nonetheless!)

  Furthermore, feeding /dev/loopN to any QEMU version will NOT result in
  correct CHS discovery (and successful boot).

  This is what leads me to conclude that I've discovered two separate
  issues.


  == Appendix: How to build the branches =====================

  It's very simple.

  First, `git clone https://github.com/qemu/qemu` somewhere if you don't
  already have a local copy. If you have an old git checkout that's from
  2014 or later, you can use that old checkout instead. (If you want to
  test an old checkout you have, the commands below will either work
  perfectly or completely bomb out with no side effects.)

  A full checkout is a ~183MB download. Sorry.

  Next, create two new directories somewhere. Name them what you like,
  eg `qemu-working` and `qemu-broken`.

  Now, cd into the checkout directory, and run:

  $ git archive 306ec6c3cece7004429c79c1ac93d49919f1f1cc | tar xC
  /path/to/qemu-working/

  $ git archive e689f7c668cbd9d08f330e17c3dd3a059c9553d3 | tar xC
  /path/to/qemu-broken/

  The paths can be relative.

  Now, run this in both of the new directories:

  $ ./configure --python=python2.7 --disable-libssh2 --disable-seccomp
  --disable-usb-redir --disable-guest-agent --disable-libiscsi
  --disable-spice --disable-smartcard-nss --disable-vhost-net --disable-
  docs --disable-attr --disable-cap-ng --disable-vde --disable-user
  --disable-bluez --disable-vnc-ws --disable-xen --disable-brlapi
  --enable-debug --target-list=i386-softmmu --disable-fdt

  $ make -j64

  You can open two terminals and configure and build both simultaneously
  if you like.

  On my decent but very basic (2-core+HT) i3 box, -j64 actually works out - make doesn't actually launch too many gcc processes. You *will* see your system load spike to ~20 though :)
  (NB. Do. not. use. -j64. with. the. linux. kernel.)

  On my system, a single build with -j64 takes only about 35 seconds. C
  FTW. (Although this has increased to 1min20sec for more recent
  builds.)

  Most of the configure arguments remove functionality I'll never use
  (in this situation) and which will only slow down the build.

  Once QEMU is built, run qemu-system-i386 directly from where it has
  been built.

  $ /path/to/qemu-working/i386-softmmu/qemu-system-i386 ...
  $ /path/to/qemu-broken/i386-softmmu/qemu-system-i386 ...

  Again, the paths can be relative.

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1745312/+subscriptions


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug 1745312] Re: Regression report: Disk subsystem I/O failures/issues surfacing in DOS/early Windows [two separate issues: one bisected, one root-caused]
  2018-01-25  7:18 [Qemu-devel] [Bug 1745312] [NEW] Regression report: Disk subsystem I/O failures/issues surfacing in DOS/early Windows [two separate issues: one bisected, one root-caused] i336_
                   ` (5 preceding siblings ...)
  2019-12-23 21:35 ` John Snow
@ 2021-04-22  5:32 ` Thomas Huth
  2021-04-29  9:52 ` Thomas Huth
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Thomas Huth @ 2021-04-22  5:32 UTC (permalink / raw)
  To: qemu-devel

The QEMU project is currently considering to move its bug tracking to
another system. For this we need to know which bugs are still valid
and which could be closed already. Thus we are setting older bugs to
"Incomplete" now.

If you still think this bug report here is valid, then please switch
the state back to "New" within the next 60 days, otherwise this report
will be marked as "Expired". Or please mark it as "Fix Released" if
the problem has been solved with a newer version of QEMU already.

Thank you and sorry for the inconvenience.

** Changed in: qemu
       Status: New => Incomplete

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1745312

Title:
  Regression report: Disk subsystem I/O failures/issues surfacing in
  DOS/early Windows [two separate issues: one bisected, one root-caused]

Status in QEMU:
  Incomplete

Bug description:
  [Headsup: This report is long-ish due to the amount of detail I've
  stumbled on along the way that I think is relevant to include. I can't
  speak as to the complexity of the actual bugs, but the size of this
  report should not suggest that the reproduction process is
  particularly headache-inducing.]

  Hi!

  I recently needed to fire up some ancient software for research
  purposes and got very distracted discovering and playing with old
  versions of Windows :). In the process I've discovered some glitches
  with disk I/O.

  I believe I've stumbled on two completely separate issues that
  coincidentally surfaced at the same time. It's possible that
  components of this report will be re-filed as more specific new bugs,
  but I'm not an authority on QEMU internals or how to narrow
  down/categorize what I've found.

  - The first bug only surfaces when the "isapc" machine type is used.
  It intermittently produces "General failure {read,writ}ing drive _"
  under MS-DOS 6.22, and also somehow interferes with early bootstrap of
  Windows NT 4 (in NTLDR). Enabling or disabling KVM (I'm on Linux)
  appears to make no difference whatsoever, which may help with
  debugging.

  - The second issue involves
    - a WinNT4 disk image
    - created by running through a bog-standard NT4 install inside QEMU 2.9.0
    - which will now fail to boot in any version of QEMU - even version 1.0
      - but which VirtualBox will boot fine
        - but only if I point VirtualBox at QEMU's raw disk image via a
          hacked-together VMDK file
        - if the raw image is converted to VHD(X), VirtualBox will also fail
          to boot the image with exactly the same error as QEMU
        - this state of affairs is not affected by image sparseness (which makes
          sense)

  I'm confident I've bisected the first issue.

  I wasn't able to bisect the second issue (as all tested versions of
  QEMU behaved identically), but I've figured out a working repro
  testcase and I believe I've managed to pin down a solid root cause.


  == #1: Intermittent I/O issues when `-M isapc` is used =====

  These symptoms sometimes take a small amount of time and fiddling to
  trigger, but I AM able to consistently surface them on my machine
  after a short while. (I am very very interested to hear if others
  cannot reproduce them.)

  So, first of all:

  https://github.com/qemu/qemu/commit/306ec6c3cece7004429c79c1ac93d49919f1f1cc
    (Jul 30 2013): the last version that works

  https://github.com/qemu/qemu/commit/e689f7c668cbd9d08f330e17c3dd3a059c9553d3
    (Oct 30 2013): the first version that intermittently fails

  Maybe lift out and build these branches while reading. *shrug*
  (How to do this can be found at the end of this report - along with a time-saving ./configure line, FWIW)

  Here are the changelists between these two revisions:

  https://github.com/qemu/qemu/compare/306ec6c...e689f7c
  (Compare direction: OLD to NEW) (Commits: 166  Files changed: 192)

  https://github.com/qemu/qemu/compare/e689f7c...306ec6c
  (Compare direction: NEW to OLD) (Commits: 30   Files changed: 22)

  (Someone else more familiar with Git might know why GitHub returns
  results for both compare directions, and/or if the 2nd link is useful
  information. The first link returns a lot more results than the 2nd
  one, at least. Does comparing new>old return deletions?)

  ---

  Now on to the symptoms. In a moment I'll describe reproduction.

  # MS-DOS 6.22

  The first symptom I discovered was that trivial read and write
  operations under MS-DOS would sometimes fail:

    C:\>echo test > hi

    General failure writing drive C
    Abort, Retry, Fail?

  Anything else that exercises the disk behaves similarly:

    C:\>dir /s > nul

    General failure reading drive C
    Abort, Retry, Fail?

  (Note that the above demonstrates both write and read failures)

  (Also, FWIW, `dir /s` == `ls -R`)

  The behavior of the I/O errors is not possible to characterise as it
  fluctuates so much. For example something as simple as DIR can produce
  wildly differing results: in one run, poking around with DIR ended
  with DOS deciding C:\ was empty at one point; at another point in a
  different run C:\ mysteriously dropped 50% of its contents only to
  magically gain it all back moments later after some poking around in
  one of the subdirectories that was still visible.

  The time it takes to trigger these errors is also highly variable.
  QEMU may fall over as early as hanging forever at "Starting MS-
  DOS...", or I might get all the way into Windows 3.1 before it
  triggers (in which case Win3.1 reports vague memory errors - of all
  things).

  Very occasionally I've seen _SeaBIOS itself_ report "Booting from Hard
  Disk..." "Boot failed: could not read the boot disk" ... "No bootable
  device.", and on one occasion I even got "Non-System disk or disk
  error" "Replace and strike any key when ready"!

  
  # WinNT 4 Terminal Server

  Most of the time, NTLDR will fire up normally. But every so often...

    SeaBIOS (version rel-1.7.3-117-g31b8b4e-
  20131206_080705-nilsson.home.kraxel.org)

    Booting from Hard Disk...
    A disk read error occurred.
    Insert a system diskette and restart
    the system.

  (NB. You're seeing the old SeaBIOS version included with e689f7c,
  which was the first buggy commit.)

  If NT gets past this point without erroring out (ie, it makes it to
  the boot menu), the rest of the system is 100% fine and there are no
  other disk I/O issues whatsoever. For example, on QEMU 2.9.0 I was
  able to enable disk compression, answer "Yes" to "Compress entire disk
  now?" and have the process fully complete. No hitches.

  This makes me vaguely recall/wonder that perhaps this could be somehow
  related to LBA and/or Int 13h, or something floating around near that
  bunch of functionality. (I'm woefully ignorant about such low-level
  details.) Perhaps DOS/Win3.1 are stuck using a disk mode that QEMU has
  a buggy implementation of, while NT 4 (once NTOSKRNL is up and
  running) is able to use a different disk mode or access mechanism.

  I'm really interested to get some understanding of what the root issue
  is here, when this is fixed. (I wonder if it's a timing thing?)

  I've observed some unusual behavior with repeated restarts. In one
  case, I attempted to start NT4 multiple times, and QEMU consistently
  failed with "No bootable device" each time. So, I removed `-M isapc`,
  promptly got a boot menu, hit ^C, readded `-M isapc` - and continued
  to get a boot menu. Yep. I'll accept "really really big coincidence"
  but I do very much wonder if something else is going on here. I've
  observed many similar incidents. It makes me wonder whether the
  contents of memory or some other system state is an influence. Very
  probably not, but still...


  -- Reproduction --------------------------------------------

  First of all, there was unfortunately no way for me to avoid having to
  post entire disk images, but I've managed to compress everything down
  to 174MB total download size.

  FWIW, WinWorld and many other sites seem to have no operational issues
  providing clear pointers to CD keys; I consider my distribution of my
  installed HDD images an extension of the apparent status quo.

  That being said, I've put everything on Google Drive so nobody has to
  headscratch about Launchpad/Canonical/etc's stance on hosting this
  data.

  So, this folder contains the disk images:
  https://drive.google.com/drive/folders/1WdZVBh5Trs9HLC186-nHyKeqaSxfyM2c
  ("Download all" at the top-right will create a ZIP file, but FWIW
  downloading the individual files simultaneously would implement a
  rough form of download acceleration)

  File meta info:

  Compressed
  |
  |      Apparent
  |      |    Actual
  |      |    |
  38M -> 200M (103M)  win31.img.xz
  82M -> 1G   (289M)  wnt4ts-broken.img.xz
  55M -> 350M (146M)  wnt4ts-intermittent.img.xz

  SHA-256s:

  win31:        8179b8180a2ab40bd472e8a2f3fb89fc331651e56923f94ceb9e52a78ee220d2
  broken:       a2af5f0bc49a063b75f534b6ffe5b82e32ecc706a64a425b6626feccf6e3fdfa  
  intermittent: 77ae8c458829ebcdd64c71042012f45d5a2788e6ebd22db9d53de9ef1a574784

  (Wanted to keep the checksum lines within 80 columns)

  And, since I can't figure out where else in this report to put this,
  wnt4ts-broken.img's password is "admin" but something seems to have
  happened to the disk and NT doesn't actually boot properly :(, and
  wnt4ts-intermittent.img's password is "1234". (These were set up as
  test images. Now I'm _really_ glad I used simple passwords! :) )

  ---

  
  I have two testcases: DOS 6.22 (+ Windows 3.1), and Windows NT 4.

  
  # MS-DOS

  DOS is the simplest. It basically consists of

  $ qemu-system-i386 -drive file=win31.img,format=raw -M isapc -enable-
  kvm

  And then literally just playing around. Things to try include creating
  files (`echo blah > file`), repeatedly seeking across the entire FAT
  (`dir /s > nul` or `dir /s`), and launching Windows (`win`).

  win31.img is not special (as far as I can tell) and merely consists of
  the result of installing DOS 6.22 and Windows 3.1 from WinWorldPC.
  I've basically just included the image for convenience.

  Generally no single "run" is immune to starting Win3.1 and then
  launching File Manager; if that doesn't generate an error, something
  is definitely up.

  The second best trigger is creating new files. That very very
  frequently produces "General Failure ...", but not always.

  
  # WinNT 4

  Windows NT 4 is a bit more complicated. Because this error only occurs
  at presumably a single small point very early in boot, the window of
  opportunity for the glitch to surface within is much much narrower and
  thus often requires a larger number of tries.

  Anecdotally I've had QEMU hit the boot error at the first try/run, and
  after as many as 63 "successful" boots.

  I made a small test harness that automates the launch process. It
  consists of two shell scripts and requires tmux (and netcat).
  (*Potential epilepsy warning*: if you use a light-colored terminal
  background, the terminal QEMU is repeatedly invoked from will
  continuously flash rapidly from white to black.)

  One of the scripts is run inside a tmux session in one terminal, while
  the other script is run in its own terminal (without any tmux).

  
  I named this one `run-qemu-loop`:

  --8<--------------------------------------------------------

  #!/bin/bash

  # ---

  qemu=/path/to/qemu-system-i386
  #or, alternatively: (I used the following line myself so I
  #could tab-complete my way to different qemu executables)
  #qemu="$1"

  disk=/path/to/wnt4ts-intermittent.img

  # ---

  port=4444

  rm -f STOP itercount

  itercount=0

  while :; do
  	
  	[ -f STOP ] && break
  	
  	((itercount++))
  	echo $itercount > itercount
  	
  	$qemu \
  		-enable-kvm -vga cirrus -curses -M isapc \
  		-drive file="$disk",format=raw \
  		-chardev socket,id=mon0,host=localhost,port=$port,server,nowait \
  		-mon chardev=mon0,mode=readline
  	
  	#point to an otherwise-unused terminal if you like (see also: `tty`)
  	#echo "$itercount run(s)" > /dev/pts/__
  	
  done

  ------------------------------------------------------------

  Not much logic above; this just repeatedly runs QEMU for as long as
  the file `STOP` does not exist in the current directory.

  The key "magic" bit is that QEMU is launched in -curses mode.

  The other key bit is that the above script is run inside tmux.

  
  Here's `tmux-ctl-loop`:

  --8<--------------------------------------------------------

  #!/bin/bash

  port=4444

  tmux=./tmux

  printf -v l '%0.0s-' {0..25}
  h1="$l/ buffer dump begin \\$l"
  h2="$l-\\ buffer dump end /-$l"

  while :; do
  	
  	while :; do
  		echo | nc localhost $port -q0 -w1 > /dev/null && break
  		echo 'Start qemu!'
  	done
  	
  	buffer="$(tmux -S $tmux capture-pane; tmux -S $tmux save-buffer -)"
  	
  	echo "$h1"
  	[[ "$buffer" ]] && echo "$buffer" || echo '( * Screen buffer is empty * )'
  	echo "$h2"
  	
  	if echo "$buffer" | grep -q 'A disk read error occurred.'; then
  		
  		s="<Crashed after $(< itercount) runs.>"
  		echo "$s"
  		echo "$s" >> stats
  		
  		touch STOP
  		
  		#echo q | nc localhost $port -q0 > /dev/null
  		
  		exit
  		
  	elif echo "$buffer" | grep -q 'OS Loader V4.00'; then
  			
  		echo '<Booted successfully, trying again>'
  		
  		echo q | nc localhost $port -q0 > /dev/null
  		
  	else
  		
  		echo '<Waiting for boot>'
  		
  	fi
  			
  done

  ------------------------------------------------------------

  Nothing particularly amazing going on here either.

  While `qemu-run-loop` is running inside tmux in the first terminal,
  this is running in the 2nd one.

  The small infinite loop at the top only breaks when it can
  successfully ping QEMU and it knows it's running.

  Then, a screendump of the contents of the terminal QEMU is in is
  fetched from tmux, and the buffer's content is analyzed.

  - If NTLDR fails, the script creates `STOP` to halt qemu-run-loop,
    sends `q` to QEMU through netcat, and then the script exits.

  - If NTLDR loads successfully, the script sends `q` to QEMU and continues
    looping. (qemu-run-loop will not find the `STOP` file, and so restart qemu.)

  The scripts run very quickly, with 2-3 iterations per second on my i3
  box.


  # Usage

  Save the two scripts above to the same directory as wnt4ts-intermittent.img,
  then:

  - (If port 4444 doesn't work, the value needs to be changed in both
  scripts.)

  - In the first terminal, run `tmux -S <file>`, where <file> names the socket
    tmux will use. This needs to match "tmux=" at the top of `tmux-ctl-loop`.
    (with `tmux=./tmux`, the command would be `tmux -S tmux`)

  - Still in the first terminal (and now also inside tmux), enter
    `./qemu-run-loop`, passing the path to qemu if you're using that approach
    (refer to the first few lines of the script). Don't hit enter yet.

  - Now, in the 2nd terminal, type `./tmux-ctl-loop`

  - Hit enter in both terminals.
   

  Rationale for timing of Enter key:

  - Running qemu-run-loop first will start QEMU, and if NTLDR starts
    successfully it will immediately begin counting down from 30. If NT actually
    starts to boot and is then hard-shut-down this /may/ affect the disk image

  - tmux-ctl-loop will annoyingly spam a continuous stream of 'Start qemu!' until
    qemu-run-loop is running.

  - Starting both scripts at "more or less" the same time (no rush) works out
    well.

  
  Hopefully potential script modifications are obvious; for example

  - changing tmux-ctl-loop to not send 'q' to qemu so you can connect to the HMP
    yourself
    (NB, if `STOP` is not created, when qemu finally exits it will of course
    promptly be relaunched)

  - pointing run-qemu-loop to a modified qemu binary


  == #2: QEMU-vs-VirtualBox image issue ======================

  I was initially completely stumped by this issue, perhaps
  unsurprisingly so. :)

  wnt4ts-broken.img is a perfectly ordinary NT 4 installation that was
  created in QEMU 2.9.0. I created a 1GB disk with `truncate`, picked
  NTFS and installed everything (which took a while).

  NT setup reboots a number of times during the boot process, and IIRC
  those all went just fine. However, at some point, the image began to
  consistently bomb out with "A disk read error occurred. ...", and
  stubbornly refused to boot, regardless of the number of boot attempts
  I tried.

  QEMU 2.0.0, 1.5.0, and 1.0 (the earliest version I was able to build
  on my system) all consistently hit "disk read error occurred".

  I tried compiling QEMU 1.0 using clang so I could build for 32-bit on
  my 64-bit system (GCC 7 died with "Frame pointer required, but
  reserved"). The resulting qemu completely crashed if I didn't enable
  KVM (ie, TCG was (understandably) broken); with KVM enabled qemu
  didn't crash, but NTLDR halted with the same error as on 64-bit qemu.
  (TL;DR, no difference whatsoever.)

  My initial reaction at this point was to try the image on another
  virtualization platform. My first pick was VirtualBox.

  So, I followed the official instructions for pointing VirtualBox to
  physical disk images, except I substituted a /dev/loopN device I'd
  pointed to the image file via losetup.

  And... VirtualBox picked the image up fine and Just Worked(TM). Yay! -
  but not yay. What gives?!

  Confused, I then tried to convert the disk image to VHD format.
  Unfortunately, for some reason, if I try `qemu-image convert ... -O
  vhdx ...`, VirtualBox chokes on the result:

  -----

  VD: error VERR_NOT_SUPPORTED opening image file
  '/.../wnt4ts-broken-qemuconv.vhd' (VERR_NOT_SUPPORTED).

  Result Code: NS_ERROR_FAILURE (0x80004005)
  Component: MediumWrap
  Interface: IMedium {4afe423b-43e0-e9d0-82e8-ceb307940dda}
  Callee: IVirtualBox {0169423f-46b4-cde9-91af-1e9d5b6cd945}
  Callee RC: VBOX_E_OBJECT_NOT_FOUND (0x80BB0001)

  -----

  Welp.

  Well, a bit more digging later, and I found I could do

  $ VBoxManage convertfromraw wnt4ts-broken.img wnt4ts-broken.vhd

  but... as soon as I pointed VirtualBox to this, it too began to choke
  with "A disk read error occurred".

  And yet, the VMDK->raw image setup worked just fine.

  I found I could even replace the loop device with the path of the .img
  file itself and that worked just fine too.

  At my wits' end, I followed some online instructions to learn about
  manual CHS configuration so I could try and get the image working in
  Bochs. "A disk read error occurred". I wasn't surprised.

  It was at this point I began to give up, but I decided to try One Last
  Thing(TM) before properly throwing in the towel.

  :)

  I decided to learn a bit more about how `VBoxManage internalcommands
  createrawvmdk` worked, and try one thing in particular: I can edit the
  .vmdk file, but can I point `createrawvmdk` at the .img file directly
  too?

  Turns out, yes you can.

  It also turns out that this promptly caused VirtualBox to bomb out.

  Interesting.

  For reference, here's the VMDK file I initially created (by pointing
  `createrawvmdk` at /dev/loopN) and then later edited to point straight
  to the .img file, with both approaches resulting in successful boot.

  --8<--------------------------------------------------------

  # Disk DescriptorFile
  version=1
  CID=e35b9a45
  parentCID=ffffffff
  createType="fullDevice"

  # Extent description
  RW 1536000 FLAT "/absolute/full/path/to/wnt4ts-broken.img" 0

  # The disk Data Base 
  #DDB

  ddb.virtualHWVersion = "4"
  ddb.adapterType="ide"
  ddb.geometry.cylinders="1523"
  ddb.geometry.heads="16"
  ddb.geometry.sectors="63"
  ddb.uuid.image="871a6044-c8ca-48ed-b7aa-e6fc49da3db4"
  ddb.uuid.parent="00000000-0000-0000-0000-000000000000"
  ddb.uuid.modification="3661715c-3906-4e4a-ab65-486d140e03b8"
  ddb.uuid.parentmodification="00000000-0000-0000-0000-000000000000"
  ddb.geometry.biosCylinders="761"
  ddb.geometry.biosHeads="32"
  ddb.geometry.biosSectors="63"

  ------------------------------------------------------------

  
  Here's the _diff_ of what happens if I point `createrawvmdk` at wnt4ts-broken.img directly:

  --8<--------------------------------------------------------

  ddb.geometry.cylinders="2080"
  ddb.geometry.heads="16"
  ddb.geometry.sectors="63"

  ------------------------------------------------------------

  :D

  Naturally,

  $ qemu-system-i386 -drive file=wnt4ts-
  broken.img,format=raw,cyls=1523,heads=16,secs=63 -M isapc -sdl

  will boot happily on 2.9.0 (notwithstanding the occasional "disk read
  error occurred" documented above).

  It will also boot in 1.6.0.

  (POTENTIAL BUG HEADSUP: 1.0 and 1.5.0 both lock up with a blank
  640x480 window and use 0% CPU if I specify `-M isapc`.)

  And, of course, using these CHS values in Bochs also results in
  successful boot as well (after setting the CPU type to pentium).

  Unfortunately, I have no idea what sequence of events caused the
  creation of the VMDK file above. No invocation of `createrawvmdk` is
  producing a VMDK file with the CHS settings above.

  I've only just begun to learn about the intricacies of CHS. Am I to
  understand that these values are stored amongst the first 512 bytes of
  the disk? If this is the case, then I wonder what changed the data,
  and why. I was initially only using QEMU 2.9.0, and didn't move the
  image to different VMs or QEMU versions. Perhaps Windows NT got
  confused about the disk CHS and rewrote it?

  
  == Sporadic BIOS-level boot failure ========================

  I have multiple screenshots of SeaBIOS in QEMU 2.9.0 halting with "No
  bootable device" (et al), even with the above manually-applied CHS
  settings.

  Commit e689f7c also presents such errors.

  Commit 306ec6c does not suffer from intermittent breakage of any kind:

  - No SeaBIOS flake-outs
  - No "Non-system disk or disk error"
  - No "A disk error has occurred"
  - No "General failure ..."

  While most of my confidence in commit 306ec6c is based on anecdotal
  evidence, I modified `tmux-ctl-loop` a little to soak-test BIOS-level
  I/O stability and left this modified version running for a few
  minutes.

  --8<--------------------------------------------------------

  #!/bin/bash

  port=4444

  tmux=./tmux

  printf -v l '%0.0s-' {0..25}
  h1="$l/ buffer dump begin \\$l"
  h2="$l-\\ buffer dump end /-$l"

  while :; do
  	
  	while :; do
  		echo | nc localhost $port -q0 -w1 > /dev/null && break
  		echo 'Start qemu!'
  	done
  	
  	buffer="$(tmux -S $tmux capture-pane; tmux -S $tmux save-buffer -)"
  	
  	echo "$h1"
  	[[ "$buffer" ]] && echo "$buffer" || echo '( * Screen buffer is empty * )'
  	echo "$h2"
  	
  	if echo "$buffer" | grep -q 'Non-system disk' || echo "$buffer" | \
  		grep -q 'No bootable device'
  	then
  		
  		s="<Hit error after $(< itercount) runs.>"
  		echo "$s"
  		echo "$s" >> stats
  		
  		touch STOP
  		
  		#echo q | nc localhost $port -q0 > /dev/null
  		
  		exit
  		
  	elif echo "$buffer" | grep -q 'OS Loader V4.00' || echo "$buffer" | \
  		grep -q 'A disk read error'
  	then
  	
  		echo '<Boot did not hang at BIOS, trying again>'
  		
  		echo q | nc localhost $port -q0 > /dev/null
  		
  	else
  		
  		echo '<Waiting for boot>'
  		
  	fi
  			
  done

  ------------------------------------------------------------

  For the above to work, the top of run-qemu-loop must also be modified
  to read something along the lines of

  disk=/path/to/wnt4ts-broken.img,format=raw,cyls=1523,heads=16,secs=63

  (Suggestion: modify copies of both scripts)

  One small terminal-flicker-headache (and a 57°C CPU) later, I was able
  to carefully observe just over 350 successful runs in which QEMU
  commit 306ec6c only ever produced a boot menu. No other hitches.

  ** Important: **

  However, commit 306ec6c will fail to boot, ever, if the cylinders and
  geometry are not set to the values VirtualBox "discovered". (Of note
  is the fact that QEMU (2.9.0) was what initially created this image. I
  must admit that I don't remember what sequence of QEMU versions I fed
  the image to - and I maybe, possibly, didn't think to back the file up
  (sorry), so maybe something mangled something somewhere. But
  VirtualBox figured it out nonetheless!)

  Furthermore, feeding /dev/loopN to any QEMU version will NOT result in
  correct CHS discovery (and successful boot).

  This is what leads me to conclude that I've discovered two separate
  issues.


  == Appendix: How to build the branches =====================

  It's very simple.

  First, `git clone https://github.com/qemu/qemu` somewhere if you don't
  already have a local copy. If you have an old git checkout that's from
  2014 or later, you can use that old checkout instead. (If you want to
  test an old checkout you have, the commands below will either work
  perfectly or completely bomb out with no side effects.)

  A full checkout is a ~183MB download. Sorry.

  Next, create two new directories somewhere. Name them what you like,
  eg `qemu-working` and `qemu-broken`.

  Now, cd into the checkout directory, and run:

  $ git archive 306ec6c3cece7004429c79c1ac93d49919f1f1cc | tar xC
  /path/to/qemu-working/

  $ git archive e689f7c668cbd9d08f330e17c3dd3a059c9553d3 | tar xC
  /path/to/qemu-broken/

  The paths can be relative.

  Now, run this in both of the new directories:

  $ ./configure --python=python2.7 --disable-libssh2 --disable-seccomp
  --disable-usb-redir --disable-guest-agent --disable-libiscsi
  --disable-spice --disable-smartcard-nss --disable-vhost-net --disable-
  docs --disable-attr --disable-cap-ng --disable-vde --disable-user
  --disable-bluez --disable-vnc-ws --disable-xen --disable-brlapi
  --enable-debug --target-list=i386-softmmu --disable-fdt

  $ make -j64

  You can open two terminals and configure and build both simultaneously
  if you like.

  On my decent but very basic (2-core+HT) i3 box, -j64 actually works out - make doesn't actually launch too many gcc processes. You *will* see your system load spike to ~20 though :)
  (NB. Do. not. use. -j64. with. the. linux. kernel.)

  On my system, a single build with -j64 takes only about 35 seconds. C
  FTW. (Although this has increased to 1min20sec for more recent
  builds.)

  Most of the configure arguments remove functionality I'll never use
  (in this situation) and which will only slow down the build.

  Once QEMU is built, run qemu-system-i386 directly from where it has
  been built.

  $ /path/to/qemu-working/i386-softmmu/qemu-system-i386 ...
  $ /path/to/qemu-broken/i386-softmmu/qemu-system-i386 ...

  Again, the paths can be relative.

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1745312/+subscriptions


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug 1745312] Re: Regression report: Disk subsystem I/O failures/issues surfacing in DOS/early Windows [two separate issues: one bisected, one root-caused]
  2018-01-25  7:18 [Qemu-devel] [Bug 1745312] [NEW] Regression report: Disk subsystem I/O failures/issues surfacing in DOS/early Windows [two separate issues: one bisected, one root-caused] i336_
                   ` (6 preceding siblings ...)
  2021-04-22  5:32 ` Thomas Huth
@ 2021-04-29  9:52 ` Thomas Huth
  2021-04-30 16:49 ` Thomas Huth
  2022-07-07 21:03 ` Lev Kujawski
  9 siblings, 0 replies; 11+ messages in thread
From: Thomas Huth @ 2021-04-29  9:52 UTC (permalink / raw)
  To: qemu-devel

** Tags removed: qemu

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1745312

Title:
  Regression report: Disk subsystem I/O failures/issues surfacing in
  DOS/early Windows [two separate issues: one bisected, one root-caused]

Status in QEMU:
  Incomplete

Bug description:
  [Headsup: This report is long-ish due to the amount of detail I've
  stumbled on along the way that I think is relevant to include. I can't
  speak as to the complexity of the actual bugs, but the size of this
  report should not suggest that the reproduction process is
  particularly headache-inducing.]

  Hi!

  I recently needed to fire up some ancient software for research
  purposes and got very distracted discovering and playing with old
  versions of Windows :). In the process I've discovered some glitches
  with disk I/O.

  I believe I've stumbled on two completely separate issues that
  coincidentally surfaced at the same time. It's possible that
  components of this report will be re-filed as more specific new bugs,
  but I'm not an authority on QEMU internals or how to narrow
  down/categorize what I've found.

  - The first bug only surfaces when the "isapc" machine type is used.
  It intermittently produces "General failure {read,writ}ing drive _"
  under MS-DOS 6.22, and also somehow interferes with early bootstrap of
  Windows NT 4 (in NTLDR). Enabling or disabling KVM (I'm on Linux)
  appears to make no difference whatsoever, which may help with
  debugging.

  - The second issue involves
    - a WinNT4 disk image
    - created by running through a bog-standard NT4 install inside QEMU 2.9.0
    - which will now fail to boot in any version of QEMU - even version 1.0
      - but which VirtualBox will boot fine
        - but only if I point VirtualBox at QEMU's raw disk image via a
          hacked-together VMDK file
        - if the raw image is converted to VHD(X), VirtualBox will also fail
          to boot the image with exactly the same error as QEMU
        - this state of affairs is not affected by image sparseness (which makes
          sense)

  I'm confident I've bisected the first issue.

  I wasn't able to bisect the second issue (as all tested versions of
  QEMU behaved identically), but I've figured out a working repro
  testcase and I believe I've managed to pin down a solid root cause.


  == #1: Intermittent I/O issues when `-M isapc` is used =====

  These symptoms sometimes take a small amount of time and fiddling to
  trigger, but I AM able to consistently surface them on my machine
  after a short while. (I am very very interested to hear if others
  cannot reproduce them.)

  So, first of all:

  https://github.com/qemu/qemu/commit/306ec6c3cece7004429c79c1ac93d49919f1f1cc
    (Jul 30 2013): the last version that works

  https://github.com/qemu/qemu/commit/e689f7c668cbd9d08f330e17c3dd3a059c9553d3
    (Oct 30 2013): the first version that intermittently fails

  Maybe lift out and build these branches while reading. *shrug*
  (How to do this can be found at the end of this report - along with a time-saving ./configure line, FWIW)

  Here are the changelists between these two revisions:

  https://github.com/qemu/qemu/compare/306ec6c...e689f7c
  (Compare direction: OLD to NEW) (Commits: 166  Files changed: 192)

  https://github.com/qemu/qemu/compare/e689f7c...306ec6c
  (Compare direction: NEW to OLD) (Commits: 30   Files changed: 22)

  (Someone else more familiar with Git might know why GitHub returns
  results for both compare directions, and/or if the 2nd link is useful
  information. The first link returns a lot more results than the 2nd
  one, at least. Does comparing new>old return deletions?)

  ---

  Now on to the symptoms. In a moment I'll describe reproduction.

  # MS-DOS 6.22

  The first symptom I discovered was that trivial read and write
  operations under MS-DOS would sometimes fail:

    C:\>echo test > hi

    General failure writing drive C
    Abort, Retry, Fail?

  Anything else that exercises the disk behaves similarly:

    C:\>dir /s > nul

    General failure reading drive C
    Abort, Retry, Fail?

  (Note that the above demonstrates both write and read failures)

  (Also, FWIW, `dir /s` == `ls -R`)

  The behavior of the I/O errors is not possible to characterise as it
  fluctuates so much. For example something as simple as DIR can produce
  wildly differing results: in one run, poking around with DIR ended
  with DOS deciding C:\ was empty at one point; at another point in a
  different run C:\ mysteriously dropped 50% of its contents only to
  magically gain it all back moments later after some poking around in
  one of the subdirectories that was still visible.

  The time it takes to trigger these errors is also highly variable.
  QEMU may fall over as early as hanging forever at "Starting MS-
  DOS...", or I might get all the way into Windows 3.1 before it
  triggers (in which case Win3.1 reports vague memory errors - of all
  things).

  Very occasionally I've seen _SeaBIOS itself_ report "Booting from Hard
  Disk..." "Boot failed: could not read the boot disk" ... "No bootable
  device.", and on one occasion I even got "Non-System disk or disk
  error" "Replace and strike any key when ready"!

  
  # WinNT 4 Terminal Server

  Most of the time, NTLDR will fire up normally. But every so often...

    SeaBIOS (version rel-1.7.3-117-g31b8b4e-
  20131206_080705-nilsson.home.kraxel.org)

    Booting from Hard Disk...
    A disk read error occurred.
    Insert a system diskette and restart
    the system.

  (NB. You're seeing the old SeaBIOS version included with e689f7c,
  which was the first buggy commit.)

  If NT gets past this point without erroring out (ie, it makes it to
  the boot menu), the rest of the system is 100% fine and there are no
  other disk I/O issues whatsoever. For example, on QEMU 2.9.0 I was
  able to enable disk compression, answer "Yes" to "Compress entire disk
  now?" and have the process fully complete. No hitches.

  This makes me vaguely recall/wonder that perhaps this could be somehow
  related to LBA and/or Int 13h, or something floating around near that
  bunch of functionality. (I'm woefully ignorant about such low-level
  details.) Perhaps DOS/Win3.1 are stuck using a disk mode that QEMU has
  a buggy implementation of, while NT 4 (once NTOSKRNL is up and
  running) is able to use a different disk mode or access mechanism.

  I'm really interested to get some understanding of what the root issue
  is here, when this is fixed. (I wonder if it's a timing thing?)

  I've observed some unusual behavior with repeated restarts. In one
  case, I attempted to start NT4 multiple times, and QEMU consistently
  failed with "No bootable device" each time. So, I removed `-M isapc`,
  promptly got a boot menu, hit ^C, readded `-M isapc` - and continued
  to get a boot menu. Yep. I'll accept "really really big coincidence"
  but I do very much wonder if something else is going on here. I've
  observed many similar incidents. It makes me wonder whether the
  contents of memory or some other system state is an influence. Very
  probably not, but still...


  -- Reproduction --------------------------------------------

  First of all, there was unfortunately no way for me to avoid having to
  post entire disk images, but I've managed to compress everything down
  to 174MB total download size.

  FWIW, WinWorld and many other sites seem to have no operational issues
  providing clear pointers to CD keys; I consider my distribution of my
  installed HDD images an extension of the apparent status quo.

  That being said, I've put everything on Google Drive so nobody has to
  headscratch about Launchpad/Canonical/etc's stance on hosting this
  data.

  So, this folder contains the disk images:
  https://drive.google.com/drive/folders/1WdZVBh5Trs9HLC186-nHyKeqaSxfyM2c
  ("Download all" at the top-right will create a ZIP file, but FWIW
  downloading the individual files simultaneously would implement a
  rough form of download acceleration)

  File meta info:

  Compressed
  |
  |      Apparent
  |      |    Actual
  |      |    |
  38M -> 200M (103M)  win31.img.xz
  82M -> 1G   (289M)  wnt4ts-broken.img.xz
  55M -> 350M (146M)  wnt4ts-intermittent.img.xz

  SHA-256s:

  win31:        8179b8180a2ab40bd472e8a2f3fb89fc331651e56923f94ceb9e52a78ee220d2
  broken:       a2af5f0bc49a063b75f534b6ffe5b82e32ecc706a64a425b6626feccf6e3fdfa  
  intermittent: 77ae8c458829ebcdd64c71042012f45d5a2788e6ebd22db9d53de9ef1a574784

  (Wanted to keep the checksum lines within 80 columns)

  And, since I can't figure out where else in this report to put this,
  wnt4ts-broken.img's password is "admin" but something seems to have
  happened to the disk and NT doesn't actually boot properly :(, and
  wnt4ts-intermittent.img's password is "1234". (These were set up as
  test images. Now I'm _really_ glad I used simple passwords! :) )

  ---

  
  I have two testcases: DOS 6.22 (+ Windows 3.1), and Windows NT 4.

  
  # MS-DOS

  DOS is the simplest. It basically consists of

  $ qemu-system-i386 -drive file=win31.img,format=raw -M isapc -enable-
  kvm

  And then literally just playing around. Things to try include creating
  files (`echo blah > file`), repeatedly seeking across the entire FAT
  (`dir /s > nul` or `dir /s`), and launching Windows (`win`).

  win31.img is not special (as far as I can tell) and merely consists of
  the result of installing DOS 6.22 and Windows 3.1 from WinWorldPC.
  I've basically just included the image for convenience.

  Generally no single "run" is immune to starting Win3.1 and then
  launching File Manager; if that doesn't generate an error, something
  is definitely up.

  The second best trigger is creating new files. That very very
  frequently produces "General Failure ...", but not always.

  
  # WinNT 4

  Windows NT 4 is a bit more complicated. Because this error only occurs
  at presumably a single small point very early in boot, the window of
  opportunity for the glitch to surface within is much much narrower and
  thus often requires a larger number of tries.

  Anecdotally I've had QEMU hit the boot error at the first try/run, and
  after as many as 63 "successful" boots.

  I made a small test harness that automates the launch process. It
  consists of two shell scripts and requires tmux (and netcat).
  (*Potential epilepsy warning*: if you use a light-colored terminal
  background, the terminal QEMU is repeatedly invoked from will
  continuously flash rapidly from white to black.)

  One of the scripts is run inside a tmux session in one terminal, while
  the other script is run in its own terminal (without any tmux).

  
  I named this one `run-qemu-loop`:

  --8<--------------------------------------------------------

  #!/bin/bash

  # ---

  qemu=/path/to/qemu-system-i386
  #or, alternatively: (I used the following line myself so I
  #could tab-complete my way to different qemu executables)
  #qemu="$1"

  disk=/path/to/wnt4ts-intermittent.img

  # ---

  port=4444

  rm -f STOP itercount

  itercount=0

  while :; do
  	
  	[ -f STOP ] && break
  	
  	((itercount++))
  	echo $itercount > itercount
  	
  	$qemu \
  		-enable-kvm -vga cirrus -curses -M isapc \
  		-drive file="$disk",format=raw \
  		-chardev socket,id=mon0,host=localhost,port=$port,server,nowait \
  		-mon chardev=mon0,mode=readline
  	
  	#point to an otherwise-unused terminal if you like (see also: `tty`)
  	#echo "$itercount run(s)" > /dev/pts/__
  	
  done

  ------------------------------------------------------------

  Not much logic above; this just repeatedly runs QEMU for as long as
  the file `STOP` does not exist in the current directory.

  The key "magic" bit is that QEMU is launched in -curses mode.

  The other key bit is that the above script is run inside tmux.

  
  Here's `tmux-ctl-loop`:

  --8<--------------------------------------------------------

  #!/bin/bash

  port=4444

  tmux=./tmux

  printf -v l '%0.0s-' {0..25}
  h1="$l/ buffer dump begin \\$l"
  h2="$l-\\ buffer dump end /-$l"

  while :; do
  	
  	while :; do
  		echo | nc localhost $port -q0 -w1 > /dev/null && break
  		echo 'Start qemu!'
  	done
  	
  	buffer="$(tmux -S $tmux capture-pane; tmux -S $tmux save-buffer -)"
  	
  	echo "$h1"
  	[[ "$buffer" ]] && echo "$buffer" || echo '( * Screen buffer is empty * )'
  	echo "$h2"
  	
  	if echo "$buffer" | grep -q 'A disk read error occurred.'; then
  		
  		s="<Crashed after $(< itercount) runs.>"
  		echo "$s"
  		echo "$s" >> stats
  		
  		touch STOP
  		
  		#echo q | nc localhost $port -q0 > /dev/null
  		
  		exit
  		
  	elif echo "$buffer" | grep -q 'OS Loader V4.00'; then
  			
  		echo '<Booted successfully, trying again>'
  		
  		echo q | nc localhost $port -q0 > /dev/null
  		
  	else
  		
  		echo '<Waiting for boot>'
  		
  	fi
  			
  done

  ------------------------------------------------------------

  Nothing particularly amazing going on here either.

  While `qemu-run-loop` is running inside tmux in the first terminal,
  this is running in the 2nd one.

  The small infinite loop at the top only breaks when it can
  successfully ping QEMU and it knows it's running.

  Then, a screendump of the contents of the terminal QEMU is in is
  fetched from tmux, and the buffer's content is analyzed.

  - If NTLDR fails, the script creates `STOP` to halt qemu-run-loop,
    sends `q` to QEMU through netcat, and then the script exits.

  - If NTLDR loads successfully, the script sends `q` to QEMU and continues
    looping. (qemu-run-loop will not find the `STOP` file, and so restart qemu.)

  The scripts run very quickly, with 2-3 iterations per second on my i3
  box.


  # Usage

  Save the two scripts above to the same directory as wnt4ts-intermittent.img,
  then:

  - (If port 4444 doesn't work, the value needs to be changed in both
  scripts.)

  - In the first terminal, run `tmux -S <file>`, where <file> names the socket
    tmux will use. This needs to match "tmux=" at the top of `tmux-ctl-loop`.
    (with `tmux=./tmux`, the command would be `tmux -S tmux`)

  - Still in the first terminal (and now also inside tmux), enter
    `./qemu-run-loop`, passing the path to qemu if you're using that approach
    (refer to the first few lines of the script). Don't hit enter yet.

  - Now, in the 2nd terminal, type `./tmux-ctl-loop`

  - Hit enter in both terminals.
   

  Rationale for timing of Enter key:

  - Running qemu-run-loop first will start QEMU, and if NTLDR starts
    successfully it will immediately begin counting down from 30. If NT actually
    starts to boot and is then hard-shut-down this /may/ affect the disk image

  - tmux-ctl-loop will annoyingly spam a continuous stream of 'Start qemu!' until
    qemu-run-loop is running.

  - Starting both scripts at "more or less" the same time (no rush) works out
    well.

  
  Hopefully potential script modifications are obvious; for example

  - changing tmux-ctl-loop to not send 'q' to qemu so you can connect to the HMP
    yourself
    (NB, if `STOP` is not created, when qemu finally exits it will of course
    promptly be relaunched)

  - pointing run-qemu-loop to a modified qemu binary


  == #2: QEMU-vs-VirtualBox image issue ======================

  I was initially completely stumped by this issue, perhaps
  unsurprisingly so. :)

  wnt4ts-broken.img is a perfectly ordinary NT 4 installation that was
  created in QEMU 2.9.0. I created a 1GB disk with `truncate`, picked
  NTFS and installed everything (which took a while).

  NT setup reboots a number of times during the boot process, and IIRC
  those all went just fine. However, at some point, the image began to
  consistently bomb out with "A disk read error occurred. ...", and
  stubbornly refused to boot, regardless of the number of boot attempts
  I tried.

  QEMU 2.0.0, 1.5.0, and 1.0 (the earliest version I was able to build
  on my system) all consistently hit "disk read error occurred".

  I tried compiling QEMU 1.0 using clang so I could build for 32-bit on
  my 64-bit system (GCC 7 died with "Frame pointer required, but
  reserved"). The resulting qemu completely crashed if I didn't enable
  KVM (ie, TCG was (understandably) broken); with KVM enabled qemu
  didn't crash, but NTLDR halted with the same error as on 64-bit qemu.
  (TL;DR, no difference whatsoever.)

  My initial reaction at this point was to try the image on another
  virtualization platform. My first pick was VirtualBox.

  So, I followed the official instructions for pointing VirtualBox to
  physical disk images, except I substituted a /dev/loopN device I'd
  pointed to the image file via losetup.

  And... VirtualBox picked the image up fine and Just Worked(TM). Yay! -
  but not yay. What gives?!

  Confused, I then tried to convert the disk image to VHD format.
  Unfortunately, for some reason, if I try `qemu-image convert ... -O
  vhdx ...`, VirtualBox chokes on the result:

  -----

  VD: error VERR_NOT_SUPPORTED opening image file
  '/.../wnt4ts-broken-qemuconv.vhd' (VERR_NOT_SUPPORTED).

  Result Code: NS_ERROR_FAILURE (0x80004005)
  Component: MediumWrap
  Interface: IMedium {4afe423b-43e0-e9d0-82e8-ceb307940dda}
  Callee: IVirtualBox {0169423f-46b4-cde9-91af-1e9d5b6cd945}
  Callee RC: VBOX_E_OBJECT_NOT_FOUND (0x80BB0001)

  -----

  Welp.

  Well, a bit more digging later, and I found I could do

  $ VBoxManage convertfromraw wnt4ts-broken.img wnt4ts-broken.vhd

  but... as soon as I pointed VirtualBox to this, it too began to choke
  with "A disk read error occurred".

  And yet, the VMDK->raw image setup worked just fine.

  I found I could even replace the loop device with the path of the .img
  file itself and that worked just fine too.

  At my wits' end, I followed some online instructions to learn about
  manual CHS configuration so I could try and get the image working in
  Bochs. "A disk read error occurred". I wasn't surprised.

  It was at this point I began to give up, but I decided to try One Last
  Thing(TM) before properly throwing in the towel.

  :)

  I decided to learn a bit more about how `VBoxManage internalcommands
  createrawvmdk` worked, and try one thing in particular: I can edit the
  .vmdk file, but can I point `createrawvmdk` at the .img file directly
  too?

  Turns out, yes you can.

  It also turns out that this promptly caused VirtualBox to bomb out.

  Interesting.

  For reference, here's the VMDK file I initially created (by pointing
  `createrawvmdk` at /dev/loopN) and then later edited to point straight
  to the .img file, with both approaches resulting in successful boot.

  --8<--------------------------------------------------------

  # Disk DescriptorFile
  version=1
  CID=e35b9a45
  parentCID=ffffffff
  createType="fullDevice"

  # Extent description
  RW 1536000 FLAT "/absolute/full/path/to/wnt4ts-broken.img" 0

  # The disk Data Base 
  #DDB

  ddb.virtualHWVersion = "4"
  ddb.adapterType="ide"
  ddb.geometry.cylinders="1523"
  ddb.geometry.heads="16"
  ddb.geometry.sectors="63"
  ddb.uuid.image="871a6044-c8ca-48ed-b7aa-e6fc49da3db4"
  ddb.uuid.parent="00000000-0000-0000-0000-000000000000"
  ddb.uuid.modification="3661715c-3906-4e4a-ab65-486d140e03b8"
  ddb.uuid.parentmodification="00000000-0000-0000-0000-000000000000"
  ddb.geometry.biosCylinders="761"
  ddb.geometry.biosHeads="32"
  ddb.geometry.biosSectors="63"

  ------------------------------------------------------------

  
  Here's the _diff_ of what happens if I point `createrawvmdk` at wnt4ts-broken.img directly:

  --8<--------------------------------------------------------

  ddb.geometry.cylinders="2080"
  ddb.geometry.heads="16"
  ddb.geometry.sectors="63"

  ------------------------------------------------------------

  :D

  Naturally,

  $ qemu-system-i386 -drive file=wnt4ts-
  broken.img,format=raw,cyls=1523,heads=16,secs=63 -M isapc -sdl

  will boot happily on 2.9.0 (notwithstanding the occasional "disk read
  error occurred" documented above).

  It will also boot in 1.6.0.

  (POTENTIAL BUG HEADSUP: 1.0 and 1.5.0 both lock up with a blank
  640x480 window and use 0% CPU if I specify `-M isapc`.)

  And, of course, using these CHS values in Bochs also results in
  successful boot as well (after setting the CPU type to pentium).

  Unfortunately, I have no idea what sequence of events caused the
  creation of the VMDK file above. No invocation of `createrawvmdk` is
  producing a VMDK file with the CHS settings above.

  I've only just begun to learn about the intricacies of CHS. Am I to
  understand that these values are stored amongst the first 512 bytes of
  the disk? If this is the case, then I wonder what changed the data,
  and why. I was initially only using QEMU 2.9.0, and didn't move the
  image to different VMs or QEMU versions. Perhaps Windows NT got
  confused about the disk CHS and rewrote it?

  
  == Sporadic BIOS-level boot failure ========================

  I have multiple screenshots of SeaBIOS in QEMU 2.9.0 halting with "No
  bootable device" (et al), even with the above manually-applied CHS
  settings.

  Commit e689f7c also presents such errors.

  Commit 306ec6c does not suffer from intermittent breakage of any kind:

  - No SeaBIOS flake-outs
  - No "Non-system disk or disk error"
  - No "A disk error has occurred"
  - No "General failure ..."

  While most of my confidence in commit 306ec6c is based on anecdotal
  evidence, I modified `tmux-ctl-loop` a little to soak-test BIOS-level
  I/O stability and left this modified version running for a few
  minutes.

  --8<--------------------------------------------------------

  #!/bin/bash

  port=4444

  tmux=./tmux

  printf -v l '%0.0s-' {0..25}
  h1="$l/ buffer dump begin \\$l"
  h2="$l-\\ buffer dump end /-$l"

  while :; do
  	
  	while :; do
  		echo | nc localhost $port -q0 -w1 > /dev/null && break
  		echo 'Start qemu!'
  	done
  	
  	buffer="$(tmux -S $tmux capture-pane; tmux -S $tmux save-buffer -)"
  	
  	echo "$h1"
  	[[ "$buffer" ]] && echo "$buffer" || echo '( * Screen buffer is empty * )'
  	echo "$h2"
  	
  	if echo "$buffer" | grep -q 'Non-system disk' || echo "$buffer" | \
  		grep -q 'No bootable device'
  	then
  		
  		s="<Hit error after $(< itercount) runs.>"
  		echo "$s"
  		echo "$s" >> stats
  		
  		touch STOP
  		
  		#echo q | nc localhost $port -q0 > /dev/null
  		
  		exit
  		
  	elif echo "$buffer" | grep -q 'OS Loader V4.00' || echo "$buffer" | \
  		grep -q 'A disk read error'
  	then
  	
  		echo '<Boot did not hang at BIOS, trying again>'
  		
  		echo q | nc localhost $port -q0 > /dev/null
  		
  	else
  		
  		echo '<Waiting for boot>'
  		
  	fi
  			
  done

  ------------------------------------------------------------

  For the above to work, the top of run-qemu-loop must also be modified
  to read something along the lines of

  disk=/path/to/wnt4ts-broken.img,format=raw,cyls=1523,heads=16,secs=63

  (Suggestion: modify copies of both scripts)

  One small terminal-flicker-headache (and a 57°C CPU) later, I was able
  to carefully observe just over 350 successful runs in which QEMU
  commit 306ec6c only ever produced a boot menu. No other hitches.

  ** Important: **

  However, commit 306ec6c will fail to boot, ever, if the cylinders and
  geometry are not set to the values VirtualBox "discovered". (Of note
  is the fact that QEMU (2.9.0) was what initially created this image. I
  must admit that I don't remember what sequence of QEMU versions I fed
  the image to - and I maybe, possibly, didn't think to back the file up
  (sorry), so maybe something mangled something somewhere. But
  VirtualBox figured it out nonetheless!)

  Furthermore, feeding /dev/loopN to any QEMU version will NOT result in
  correct CHS discovery (and successful boot).

  This is what leads me to conclude that I've discovered two separate
  issues.


  == Appendix: How to build the branches =====================

  It's very simple.

  First, `git clone https://github.com/qemu/qemu` somewhere if you don't
  already have a local copy. If you have an old git checkout that's from
  2014 or later, you can use that old checkout instead. (If you want to
  test an old checkout you have, the commands below will either work
  perfectly or completely bomb out with no side effects.)

  A full checkout is a ~183MB download. Sorry.

  Next, create two new directories somewhere. Name them what you like,
  eg `qemu-working` and `qemu-broken`.

  Now, cd into the checkout directory, and run:

  $ git archive 306ec6c3cece7004429c79c1ac93d49919f1f1cc | tar xC
  /path/to/qemu-working/

  $ git archive e689f7c668cbd9d08f330e17c3dd3a059c9553d3 | tar xC
  /path/to/qemu-broken/

  The paths can be relative.

  Now, run this in both of the new directories:

  $ ./configure --python=python2.7 --disable-libssh2 --disable-seccomp
  --disable-usb-redir --disable-guest-agent --disable-libiscsi
  --disable-spice --disable-smartcard-nss --disable-vhost-net --disable-
  docs --disable-attr --disable-cap-ng --disable-vde --disable-user
  --disable-bluez --disable-vnc-ws --disable-xen --disable-brlapi
  --enable-debug --target-list=i386-softmmu --disable-fdt

  $ make -j64

  You can open two terminals and configure and build both simultaneously
  if you like.

  On my decent but very basic (2-core+HT) i3 box, -j64 actually works out - make doesn't actually launch too many gcc processes. You *will* see your system load spike to ~20 though :)
  (NB. Do. not. use. -j64. with. the. linux. kernel.)

  On my system, a single build with -j64 takes only about 35 seconds. C
  FTW. (Although this has increased to 1min20sec for more recent
  builds.)

  Most of the configure arguments remove functionality I'll never use
  (in this situation) and which will only slow down the build.

  Once QEMU is built, run qemu-system-i386 directly from where it has
  been built.

  $ /path/to/qemu-working/i386-softmmu/qemu-system-i386 ...
  $ /path/to/qemu-broken/i386-softmmu/qemu-system-i386 ...

  Again, the paths can be relative.

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1745312/+subscriptions


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug 1745312] Re: Regression report: Disk subsystem I/O failures/issues surfacing in DOS/early Windows [two separate issues: one bisected, one root-caused]
  2018-01-25  7:18 [Qemu-devel] [Bug 1745312] [NEW] Regression report: Disk subsystem I/O failures/issues surfacing in DOS/early Windows [two separate issues: one bisected, one root-caused] i336_
                   ` (7 preceding siblings ...)
  2021-04-29  9:52 ` Thomas Huth
@ 2021-04-30 16:49 ` Thomas Huth
  2022-07-07 21:03 ` Lev Kujawski
  9 siblings, 0 replies; 11+ messages in thread
From: Thomas Huth @ 2021-04-30 16:49 UTC (permalink / raw)
  To: qemu-devel

This is an automated cleanup. This bug report has been moved
to QEMU's new bug tracker on gitlab.com and thus gets marked
as 'expired' now. Please continue with the discussion here:

 https://gitlab.com/qemu-project/qemu/-/issues/56


** Changed in: qemu
       Status: Incomplete => Expired

** Changed in: qemu
     Assignee: John Snow (jnsnow) => (unassigned)

** Bug watch added: gitlab.com/qemu-project/qemu/-/issues #56
   https://gitlab.com/qemu-project/qemu/-/issues/56

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1745312

Title:
  Regression report: Disk subsystem I/O failures/issues surfacing in
  DOS/early Windows [two separate issues: one bisected, one root-caused]

Status in QEMU:
  Expired

Bug description:
  [Headsup: This report is long-ish due to the amount of detail I've
  stumbled on along the way that I think is relevant to include. I can't
  speak as to the complexity of the actual bugs, but the size of this
  report should not suggest that the reproduction process is
  particularly headache-inducing.]

  Hi!

  I recently needed to fire up some ancient software for research
  purposes and got very distracted discovering and playing with old
  versions of Windows :). In the process I've discovered some glitches
  with disk I/O.

  I believe I've stumbled on two completely separate issues that
  coincidentally surfaced at the same time. It's possible that
  components of this report will be re-filed as more specific new bugs,
  but I'm not an authority on QEMU internals or how to narrow
  down/categorize what I've found.

  - The first bug only surfaces when the "isapc" machine type is used.
  It intermittently produces "General failure {read,writ}ing drive _"
  under MS-DOS 6.22, and also somehow interferes with early bootstrap of
  Windows NT 4 (in NTLDR). Enabling or disabling KVM (I'm on Linux)
  appears to make no difference whatsoever, which may help with
  debugging.

  - The second issue involves
    - a WinNT4 disk image
    - created by running through a bog-standard NT4 install inside QEMU 2.9.0
    - which will now fail to boot in any version of QEMU - even version 1.0
      - but which VirtualBox will boot fine
        - but only if I point VirtualBox at QEMU's raw disk image via a
          hacked-together VMDK file
        - if the raw image is converted to VHD(X), VirtualBox will also fail
          to boot the image with exactly the same error as QEMU
        - this state of affairs is not affected by image sparseness (which makes
          sense)

  I'm confident I've bisected the first issue.

  I wasn't able to bisect the second issue (as all tested versions of
  QEMU behaved identically), but I've figured out a working repro
  testcase and I believe I've managed to pin down a solid root cause.


  == #1: Intermittent I/O issues when `-M isapc` is used =====

  These symptoms sometimes take a small amount of time and fiddling to
  trigger, but I AM able to consistently surface them on my machine
  after a short while. (I am very very interested to hear if others
  cannot reproduce them.)

  So, first of all:

  https://github.com/qemu/qemu/commit/306ec6c3cece7004429c79c1ac93d49919f1f1cc
    (Jul 30 2013): the last version that works

  https://github.com/qemu/qemu/commit/e689f7c668cbd9d08f330e17c3dd3a059c9553d3
    (Oct 30 2013): the first version that intermittently fails

  Maybe lift out and build these branches while reading. *shrug*
  (How to do this can be found at the end of this report - along with a time-saving ./configure line, FWIW)

  Here are the changelists between these two revisions:

  https://github.com/qemu/qemu/compare/306ec6c...e689f7c
  (Compare direction: OLD to NEW) (Commits: 166  Files changed: 192)

  https://github.com/qemu/qemu/compare/e689f7c...306ec6c
  (Compare direction: NEW to OLD) (Commits: 30   Files changed: 22)

  (Someone else more familiar with Git might know why GitHub returns
  results for both compare directions, and/or if the 2nd link is useful
  information. The first link returns a lot more results than the 2nd
  one, at least. Does comparing new>old return deletions?)

  ---

  Now on to the symptoms. In a moment I'll describe reproduction.

  # MS-DOS 6.22

  The first symptom I discovered was that trivial read and write
  operations under MS-DOS would sometimes fail:

    C:\>echo test > hi

    General failure writing drive C
    Abort, Retry, Fail?

  Anything else that exercises the disk behaves similarly:

    C:\>dir /s > nul

    General failure reading drive C
    Abort, Retry, Fail?

  (Note that the above demonstrates both write and read failures)

  (Also, FWIW, `dir /s` == `ls -R`)

  The behavior of the I/O errors is not possible to characterise as it
  fluctuates so much. For example something as simple as DIR can produce
  wildly differing results: in one run, poking around with DIR ended
  with DOS deciding C:\ was empty at one point; at another point in a
  different run C:\ mysteriously dropped 50% of its contents only to
  magically gain it all back moments later after some poking around in
  one of the subdirectories that was still visible.

  The time it takes to trigger these errors is also highly variable.
  QEMU may fall over as early as hanging forever at "Starting MS-
  DOS...", or I might get all the way into Windows 3.1 before it
  triggers (in which case Win3.1 reports vague memory errors - of all
  things).

  Very occasionally I've seen _SeaBIOS itself_ report "Booting from Hard
  Disk..." "Boot failed: could not read the boot disk" ... "No bootable
  device.", and on one occasion I even got "Non-System disk or disk
  error" "Replace and strike any key when ready"!

  
  # WinNT 4 Terminal Server

  Most of the time, NTLDR will fire up normally. But every so often...

    SeaBIOS (version rel-1.7.3-117-g31b8b4e-
  20131206_080705-nilsson.home.kraxel.org)

    Booting from Hard Disk...
    A disk read error occurred.
    Insert a system diskette and restart
    the system.

  (NB. You're seeing the old SeaBIOS version included with e689f7c,
  which was the first buggy commit.)

  If NT gets past this point without erroring out (ie, it makes it to
  the boot menu), the rest of the system is 100% fine and there are no
  other disk I/O issues whatsoever. For example, on QEMU 2.9.0 I was
  able to enable disk compression, answer "Yes" to "Compress entire disk
  now?" and have the process fully complete. No hitches.

  This makes me vaguely recall/wonder that perhaps this could be somehow
  related to LBA and/or Int 13h, or something floating around near that
  bunch of functionality. (I'm woefully ignorant about such low-level
  details.) Perhaps DOS/Win3.1 are stuck using a disk mode that QEMU has
  a buggy implementation of, while NT 4 (once NTOSKRNL is up and
  running) is able to use a different disk mode or access mechanism.

  I'm really interested to get some understanding of what the root issue
  is here, when this is fixed. (I wonder if it's a timing thing?)

  I've observed some unusual behavior with repeated restarts. In one
  case, I attempted to start NT4 multiple times, and QEMU consistently
  failed with "No bootable device" each time. So, I removed `-M isapc`,
  promptly got a boot menu, hit ^C, readded `-M isapc` - and continued
  to get a boot menu. Yep. I'll accept "really really big coincidence"
  but I do very much wonder if something else is going on here. I've
  observed many similar incidents. It makes me wonder whether the
  contents of memory or some other system state is an influence. Very
  probably not, but still...


  -- Reproduction --------------------------------------------

  First of all, there was unfortunately no way for me to avoid having to
  post entire disk images, but I've managed to compress everything down
  to 174MB total download size.

  FWIW, WinWorld and many other sites seem to have no operational issues
  providing clear pointers to CD keys; I consider my distribution of my
  installed HDD images an extension of the apparent status quo.

  That being said, I've put everything on Google Drive so nobody has to
  headscratch about Launchpad/Canonical/etc's stance on hosting this
  data.

  So, this folder contains the disk images:
  https://drive.google.com/drive/folders/1WdZVBh5Trs9HLC186-nHyKeqaSxfyM2c
  ("Download all" at the top-right will create a ZIP file, but FWIW
  downloading the individual files simultaneously would implement a
  rough form of download acceleration)

  File meta info:

  Compressed
  |
  |      Apparent
  |      |    Actual
  |      |    |
  38M -> 200M (103M)  win31.img.xz
  82M -> 1G   (289M)  wnt4ts-broken.img.xz
  55M -> 350M (146M)  wnt4ts-intermittent.img.xz

  SHA-256s:

  win31:        8179b8180a2ab40bd472e8a2f3fb89fc331651e56923f94ceb9e52a78ee220d2
  broken:       a2af5f0bc49a063b75f534b6ffe5b82e32ecc706a64a425b6626feccf6e3fdfa  
  intermittent: 77ae8c458829ebcdd64c71042012f45d5a2788e6ebd22db9d53de9ef1a574784

  (Wanted to keep the checksum lines within 80 columns)

  And, since I can't figure out where else in this report to put this,
  wnt4ts-broken.img's password is "admin" but something seems to have
  happened to the disk and NT doesn't actually boot properly :(, and
  wnt4ts-intermittent.img's password is "1234". (These were set up as
  test images. Now I'm _really_ glad I used simple passwords! :) )

  ---

  
  I have two testcases: DOS 6.22 (+ Windows 3.1), and Windows NT 4.

  
  # MS-DOS

  DOS is the simplest. It basically consists of

  $ qemu-system-i386 -drive file=win31.img,format=raw -M isapc -enable-
  kvm

  And then literally just playing around. Things to try include creating
  files (`echo blah > file`), repeatedly seeking across the entire FAT
  (`dir /s > nul` or `dir /s`), and launching Windows (`win`).

  win31.img is not special (as far as I can tell) and merely consists of
  the result of installing DOS 6.22 and Windows 3.1 from WinWorldPC.
  I've basically just included the image for convenience.

  Generally no single "run" is immune to starting Win3.1 and then
  launching File Manager; if that doesn't generate an error, something
  is definitely up.

  The second best trigger is creating new files. That very very
  frequently produces "General Failure ...", but not always.

  
  # WinNT 4

  Windows NT 4 is a bit more complicated. Because this error only occurs
  at presumably a single small point very early in boot, the window of
  opportunity for the glitch to surface within is much much narrower and
  thus often requires a larger number of tries.

  Anecdotally I've had QEMU hit the boot error at the first try/run, and
  after as many as 63 "successful" boots.

  I made a small test harness that automates the launch process. It
  consists of two shell scripts and requires tmux (and netcat).
  (*Potential epilepsy warning*: if you use a light-colored terminal
  background, the terminal QEMU is repeatedly invoked from will
  continuously flash rapidly from white to black.)

  One of the scripts is run inside a tmux session in one terminal, while
  the other script is run in its own terminal (without any tmux).

  
  I named this one `run-qemu-loop`:

  --8<--------------------------------------------------------

  #!/bin/bash

  # ---

  qemu=/path/to/qemu-system-i386
  #or, alternatively: (I used the following line myself so I
  #could tab-complete my way to different qemu executables)
  #qemu="$1"

  disk=/path/to/wnt4ts-intermittent.img

  # ---

  port=4444

  rm -f STOP itercount

  itercount=0

  while :; do
  	
  	[ -f STOP ] && break
  	
  	((itercount++))
  	echo $itercount > itercount
  	
  	$qemu \
  		-enable-kvm -vga cirrus -curses -M isapc \
  		-drive file="$disk",format=raw \
  		-chardev socket,id=mon0,host=localhost,port=$port,server,nowait \
  		-mon chardev=mon0,mode=readline
  	
  	#point to an otherwise-unused terminal if you like (see also: `tty`)
  	#echo "$itercount run(s)" > /dev/pts/__
  	
  done

  ------------------------------------------------------------

  Not much logic above; this just repeatedly runs QEMU for as long as
  the file `STOP` does not exist in the current directory.

  The key "magic" bit is that QEMU is launched in -curses mode.

  The other key bit is that the above script is run inside tmux.

  
  Here's `tmux-ctl-loop`:

  --8<--------------------------------------------------------

  #!/bin/bash

  port=4444

  tmux=./tmux

  printf -v l '%0.0s-' {0..25}
  h1="$l/ buffer dump begin \\$l"
  h2="$l-\\ buffer dump end /-$l"

  while :; do
  	
  	while :; do
  		echo | nc localhost $port -q0 -w1 > /dev/null && break
  		echo 'Start qemu!'
  	done
  	
  	buffer="$(tmux -S $tmux capture-pane; tmux -S $tmux save-buffer -)"
  	
  	echo "$h1"
  	[[ "$buffer" ]] && echo "$buffer" || echo '( * Screen buffer is empty * )'
  	echo "$h2"
  	
  	if echo "$buffer" | grep -q 'A disk read error occurred.'; then
  		
  		s="<Crashed after $(< itercount) runs.>"
  		echo "$s"
  		echo "$s" >> stats
  		
  		touch STOP
  		
  		#echo q | nc localhost $port -q0 > /dev/null
  		
  		exit
  		
  	elif echo "$buffer" | grep -q 'OS Loader V4.00'; then
  			
  		echo '<Booted successfully, trying again>'
  		
  		echo q | nc localhost $port -q0 > /dev/null
  		
  	else
  		
  		echo '<Waiting for boot>'
  		
  	fi
  			
  done

  ------------------------------------------------------------

  Nothing particularly amazing going on here either.

  While `qemu-run-loop` is running inside tmux in the first terminal,
  this is running in the 2nd one.

  The small infinite loop at the top only breaks when it can
  successfully ping QEMU and it knows it's running.

  Then, a screendump of the contents of the terminal QEMU is in is
  fetched from tmux, and the buffer's content is analyzed.

  - If NTLDR fails, the script creates `STOP` to halt qemu-run-loop,
    sends `q` to QEMU through netcat, and then the script exits.

  - If NTLDR loads successfully, the script sends `q` to QEMU and continues
    looping. (qemu-run-loop will not find the `STOP` file, and so restart qemu.)

  The scripts run very quickly, with 2-3 iterations per second on my i3
  box.


  # Usage

  Save the two scripts above to the same directory as wnt4ts-intermittent.img,
  then:

  - (If port 4444 doesn't work, the value needs to be changed in both
  scripts.)

  - In the first terminal, run `tmux -S <file>`, where <file> names the socket
    tmux will use. This needs to match "tmux=" at the top of `tmux-ctl-loop`.
    (with `tmux=./tmux`, the command would be `tmux -S tmux`)

  - Still in the first terminal (and now also inside tmux), enter
    `./qemu-run-loop`, passing the path to qemu if you're using that approach
    (refer to the first few lines of the script). Don't hit enter yet.

  - Now, in the 2nd terminal, type `./tmux-ctl-loop`

  - Hit enter in both terminals.
   

  Rationale for timing of Enter key:

  - Running qemu-run-loop first will start QEMU, and if NTLDR starts
    successfully it will immediately begin counting down from 30. If NT actually
    starts to boot and is then hard-shut-down this /may/ affect the disk image

  - tmux-ctl-loop will annoyingly spam a continuous stream of 'Start qemu!' until
    qemu-run-loop is running.

  - Starting both scripts at "more or less" the same time (no rush) works out
    well.

  
  Hopefully potential script modifications are obvious; for example

  - changing tmux-ctl-loop to not send 'q' to qemu so you can connect to the HMP
    yourself
    (NB, if `STOP` is not created, when qemu finally exits it will of course
    promptly be relaunched)

  - pointing run-qemu-loop to a modified qemu binary


  == #2: QEMU-vs-VirtualBox image issue ======================

  I was initially completely stumped by this issue, perhaps
  unsurprisingly so. :)

  wnt4ts-broken.img is a perfectly ordinary NT 4 installation that was
  created in QEMU 2.9.0. I created a 1GB disk with `truncate`, picked
  NTFS and installed everything (which took a while).

  NT setup reboots a number of times during the boot process, and IIRC
  those all went just fine. However, at some point, the image began to
  consistently bomb out with "A disk read error occurred. ...", and
  stubbornly refused to boot, regardless of the number of boot attempts
  I tried.

  QEMU 2.0.0, 1.5.0, and 1.0 (the earliest version I was able to build
  on my system) all consistently hit "disk read error occurred".

  I tried compiling QEMU 1.0 using clang so I could build for 32-bit on
  my 64-bit system (GCC 7 died with "Frame pointer required, but
  reserved"). The resulting qemu completely crashed if I didn't enable
  KVM (ie, TCG was (understandably) broken); with KVM enabled qemu
  didn't crash, but NTLDR halted with the same error as on 64-bit qemu.
  (TL;DR, no difference whatsoever.)

  My initial reaction at this point was to try the image on another
  virtualization platform. My first pick was VirtualBox.

  So, I followed the official instructions for pointing VirtualBox to
  physical disk images, except I substituted a /dev/loopN device I'd
  pointed to the image file via losetup.

  And... VirtualBox picked the image up fine and Just Worked(TM). Yay! -
  but not yay. What gives?!

  Confused, I then tried to convert the disk image to VHD format.
  Unfortunately, for some reason, if I try `qemu-image convert ... -O
  vhdx ...`, VirtualBox chokes on the result:

  -----

  VD: error VERR_NOT_SUPPORTED opening image file
  '/.../wnt4ts-broken-qemuconv.vhd' (VERR_NOT_SUPPORTED).

  Result Code: NS_ERROR_FAILURE (0x80004005)
  Component: MediumWrap
  Interface: IMedium {4afe423b-43e0-e9d0-82e8-ceb307940dda}
  Callee: IVirtualBox {0169423f-46b4-cde9-91af-1e9d5b6cd945}
  Callee RC: VBOX_E_OBJECT_NOT_FOUND (0x80BB0001)

  -----

  Welp.

  Well, a bit more digging later, and I found I could do

  $ VBoxManage convertfromraw wnt4ts-broken.img wnt4ts-broken.vhd

  but... as soon as I pointed VirtualBox to this, it too began to choke
  with "A disk read error occurred".

  And yet, the VMDK->raw image setup worked just fine.

  I found I could even replace the loop device with the path of the .img
  file itself and that worked just fine too.

  At my wits' end, I followed some online instructions to learn about
  manual CHS configuration so I could try and get the image working in
  Bochs. "A disk read error occurred". I wasn't surprised.

  It was at this point I began to give up, but I decided to try One Last
  Thing(TM) before properly throwing in the towel.

  :)

  I decided to learn a bit more about how `VBoxManage internalcommands
  createrawvmdk` worked, and try one thing in particular: I can edit the
  .vmdk file, but can I point `createrawvmdk` at the .img file directly
  too?

  Turns out, yes you can.

  It also turns out that this promptly caused VirtualBox to bomb out.

  Interesting.

  For reference, here's the VMDK file I initially created (by pointing
  `createrawvmdk` at /dev/loopN) and then later edited to point straight
  to the .img file, with both approaches resulting in successful boot.

  --8<--------------------------------------------------------

  # Disk DescriptorFile
  version=1
  CID=e35b9a45
  parentCID=ffffffff
  createType="fullDevice"

  # Extent description
  RW 1536000 FLAT "/absolute/full/path/to/wnt4ts-broken.img" 0

  # The disk Data Base 
  #DDB

  ddb.virtualHWVersion = "4"
  ddb.adapterType="ide"
  ddb.geometry.cylinders="1523"
  ddb.geometry.heads="16"
  ddb.geometry.sectors="63"
  ddb.uuid.image="871a6044-c8ca-48ed-b7aa-e6fc49da3db4"
  ddb.uuid.parent="00000000-0000-0000-0000-000000000000"
  ddb.uuid.modification="3661715c-3906-4e4a-ab65-486d140e03b8"
  ddb.uuid.parentmodification="00000000-0000-0000-0000-000000000000"
  ddb.geometry.biosCylinders="761"
  ddb.geometry.biosHeads="32"
  ddb.geometry.biosSectors="63"

  ------------------------------------------------------------

  
  Here's the _diff_ of what happens if I point `createrawvmdk` at wnt4ts-broken.img directly:

  --8<--------------------------------------------------------

  ddb.geometry.cylinders="2080"
  ddb.geometry.heads="16"
  ddb.geometry.sectors="63"

  ------------------------------------------------------------

  :D

  Naturally,

  $ qemu-system-i386 -drive file=wnt4ts-
  broken.img,format=raw,cyls=1523,heads=16,secs=63 -M isapc -sdl

  will boot happily on 2.9.0 (notwithstanding the occasional "disk read
  error occurred" documented above).

  It will also boot in 1.6.0.

  (POTENTIAL BUG HEADSUP: 1.0 and 1.5.0 both lock up with a blank
  640x480 window and use 0% CPU if I specify `-M isapc`.)

  And, of course, using these CHS values in Bochs also results in
  successful boot as well (after setting the CPU type to pentium).

  Unfortunately, I have no idea what sequence of events caused the
  creation of the VMDK file above. No invocation of `createrawvmdk` is
  producing a VMDK file with the CHS settings above.

  I've only just begun to learn about the intricacies of CHS. Am I to
  understand that these values are stored amongst the first 512 bytes of
  the disk? If this is the case, then I wonder what changed the data,
  and why. I was initially only using QEMU 2.9.0, and didn't move the
  image to different VMs or QEMU versions. Perhaps Windows NT got
  confused about the disk CHS and rewrote it?

  
  == Sporadic BIOS-level boot failure ========================

  I have multiple screenshots of SeaBIOS in QEMU 2.9.0 halting with "No
  bootable device" (et al), even with the above manually-applied CHS
  settings.

  Commit e689f7c also presents such errors.

  Commit 306ec6c does not suffer from intermittent breakage of any kind:

  - No SeaBIOS flake-outs
  - No "Non-system disk or disk error"
  - No "A disk error has occurred"
  - No "General failure ..."

  While most of my confidence in commit 306ec6c is based on anecdotal
  evidence, I modified `tmux-ctl-loop` a little to soak-test BIOS-level
  I/O stability and left this modified version running for a few
  minutes.

  --8<--------------------------------------------------------

  #!/bin/bash

  port=4444

  tmux=./tmux

  printf -v l '%0.0s-' {0..25}
  h1="$l/ buffer dump begin \\$l"
  h2="$l-\\ buffer dump end /-$l"

  while :; do
  	
  	while :; do
  		echo | nc localhost $port -q0 -w1 > /dev/null && break
  		echo 'Start qemu!'
  	done
  	
  	buffer="$(tmux -S $tmux capture-pane; tmux -S $tmux save-buffer -)"
  	
  	echo "$h1"
  	[[ "$buffer" ]] && echo "$buffer" || echo '( * Screen buffer is empty * )'
  	echo "$h2"
  	
  	if echo "$buffer" | grep -q 'Non-system disk' || echo "$buffer" | \
  		grep -q 'No bootable device'
  	then
  		
  		s="<Hit error after $(< itercount) runs.>"
  		echo "$s"
  		echo "$s" >> stats
  		
  		touch STOP
  		
  		#echo q | nc localhost $port -q0 > /dev/null
  		
  		exit
  		
  	elif echo "$buffer" | grep -q 'OS Loader V4.00' || echo "$buffer" | \
  		grep -q 'A disk read error'
  	then
  	
  		echo '<Boot did not hang at BIOS, trying again>'
  		
  		echo q | nc localhost $port -q0 > /dev/null
  		
  	else
  		
  		echo '<Waiting for boot>'
  		
  	fi
  			
  done

  ------------------------------------------------------------

  For the above to work, the top of run-qemu-loop must also be modified
  to read something along the lines of

  disk=/path/to/wnt4ts-broken.img,format=raw,cyls=1523,heads=16,secs=63

  (Suggestion: modify copies of both scripts)

  One small terminal-flicker-headache (and a 57°C CPU) later, I was able
  to carefully observe just over 350 successful runs in which QEMU
  commit 306ec6c only ever produced a boot menu. No other hitches.

  ** Important: **

  However, commit 306ec6c will fail to boot, ever, if the cylinders and
  geometry are not set to the values VirtualBox "discovered". (Of note
  is the fact that QEMU (2.9.0) was what initially created this image. I
  must admit that I don't remember what sequence of QEMU versions I fed
  the image to - and I maybe, possibly, didn't think to back the file up
  (sorry), so maybe something mangled something somewhere. But
  VirtualBox figured it out nonetheless!)

  Furthermore, feeding /dev/loopN to any QEMU version will NOT result in
  correct CHS discovery (and successful boot).

  This is what leads me to conclude that I've discovered two separate
  issues.


  == Appendix: How to build the branches =====================

  It's very simple.

  First, `git clone https://github.com/qemu/qemu` somewhere if you don't
  already have a local copy. If you have an old git checkout that's from
  2014 or later, you can use that old checkout instead. (If you want to
  test an old checkout you have, the commands below will either work
  perfectly or completely bomb out with no side effects.)

  A full checkout is a ~183MB download. Sorry.

  Next, create two new directories somewhere. Name them what you like,
  eg `qemu-working` and `qemu-broken`.

  Now, cd into the checkout directory, and run:

  $ git archive 306ec6c3cece7004429c79c1ac93d49919f1f1cc | tar xC
  /path/to/qemu-working/

  $ git archive e689f7c668cbd9d08f330e17c3dd3a059c9553d3 | tar xC
  /path/to/qemu-broken/

  The paths can be relative.

  Now, run this in both of the new directories:

  $ ./configure --python=python2.7 --disable-libssh2 --disable-seccomp
  --disable-usb-redir --disable-guest-agent --disable-libiscsi
  --disable-spice --disable-smartcard-nss --disable-vhost-net --disable-
  docs --disable-attr --disable-cap-ng --disable-vde --disable-user
  --disable-bluez --disable-vnc-ws --disable-xen --disable-brlapi
  --enable-debug --target-list=i386-softmmu --disable-fdt

  $ make -j64

  You can open two terminals and configure and build both simultaneously
  if you like.

  On my decent but very basic (2-core+HT) i3 box, -j64 actually works out - make doesn't actually launch too many gcc processes. You *will* see your system load spike to ~20 though :)
  (NB. Do. not. use. -j64. with. the. linux. kernel.)

  On my system, a single build with -j64 takes only about 35 seconds. C
  FTW. (Although this has increased to 1min20sec for more recent
  builds.)

  Most of the configure arguments remove functionality I'll never use
  (in this situation) and which will only slow down the build.

  Once QEMU is built, run qemu-system-i386 directly from where it has
  been built.

  $ /path/to/qemu-working/i386-softmmu/qemu-system-i386 ...
  $ /path/to/qemu-broken/i386-softmmu/qemu-system-i386 ...

  Again, the paths can be relative.

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1745312/+subscriptions


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug 1745312] Re: Regression report: Disk subsystem I/O failures/issues surfacing in DOS/early Windows [two separate issues: one bisected, one root-caused]
  2018-01-25  7:18 [Qemu-devel] [Bug 1745312] [NEW] Regression report: Disk subsystem I/O failures/issues surfacing in DOS/early Windows [two separate issues: one bisected, one root-caused] i336_
                   ` (8 preceding siblings ...)
  2021-04-30 16:49 ` Thomas Huth
@ 2022-07-07 21:03 ` Lev Kujawski
  9 siblings, 0 replies; 11+ messages in thread
From: Lev Kujawski @ 2022-07-07 21:03 UTC (permalink / raw)
  To: qemu-devel

Hi,

Thanks to everyone who contributed information to this report. As far as
issue #1 from David, I cannot reproduce the intermittent MS-DOS or
Windows NT 4 I/O failures with the latest git revision (a74c66b1). I am
similarly unable to reproduce Mdasoh's issue.

For the NT 4 testing script, I had to substitute '-display curses' for
'-curses' to accommodate the changes in QEMU, and match against 'Please
select' from the boot loader menu rather than 'OS Loader V4.00', which
disappears too quickly.

For issue #2, the root seems to be that both SeaBIOS and QEMU default to
LARGE/ECHS disk translation for small disks (<4 GiB). If you apply the
patch at

https://patchew.org/QEMU/20220707204045.999544-1-lkujaw@member.fsf.org/

you should be able to get to the NT 4 boot loader using

qemu-system-i386 -blockdev node-name=hda,driver=file,filename=./wnt4ts-
broken.img -device ide-hd,drive=hda,bus=ide.0,unit=0,bios-chs-trans=lba

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1745312

Title:
  Regression report: Disk subsystem I/O failures/issues surfacing in
  DOS/early Windows [two separate issues: one bisected, one root-caused]

Status in QEMU:
  Expired

Bug description:
  [Headsup: This report is long-ish due to the amount of detail I've
  stumbled on along the way that I think is relevant to include. I can't
  speak as to the complexity of the actual bugs, but the size of this
  report should not suggest that the reproduction process is
  particularly headache-inducing.]

  Hi!

  I recently needed to fire up some ancient software for research
  purposes and got very distracted discovering and playing with old
  versions of Windows :). In the process I've discovered some glitches
  with disk I/O.

  I believe I've stumbled on two completely separate issues that
  coincidentally surfaced at the same time. It's possible that
  components of this report will be re-filed as more specific new bugs,
  but I'm not an authority on QEMU internals or how to narrow
  down/categorize what I've found.

  - The first bug only surfaces when the "isapc" machine type is used.
  It intermittently produces "General failure {read,writ}ing drive _"
  under MS-DOS 6.22, and also somehow interferes with early bootstrap of
  Windows NT 4 (in NTLDR). Enabling or disabling KVM (I'm on Linux)
  appears to make no difference whatsoever, which may help with
  debugging.

  - The second issue involves
    - a WinNT4 disk image
    - created by running through a bog-standard NT4 install inside QEMU 2.9.0
    - which will now fail to boot in any version of QEMU - even version 1.0
      - but which VirtualBox will boot fine
        - but only if I point VirtualBox at QEMU's raw disk image via a
          hacked-together VMDK file
        - if the raw image is converted to VHD(X), VirtualBox will also fail
          to boot the image with exactly the same error as QEMU
        - this state of affairs is not affected by image sparseness (which makes
          sense)

  I'm confident I've bisected the first issue.

  I wasn't able to bisect the second issue (as all tested versions of
  QEMU behaved identically), but I've figured out a working repro
  testcase and I believe I've managed to pin down a solid root cause.


  == #1: Intermittent I/O issues when `-M isapc` is used =====

  These symptoms sometimes take a small amount of time and fiddling to
  trigger, but I AM able to consistently surface them on my machine
  after a short while. (I am very very interested to hear if others
  cannot reproduce them.)

  So, first of all:

  https://github.com/qemu/qemu/commit/306ec6c3cece7004429c79c1ac93d49919f1f1cc
    (Jul 30 2013): the last version that works

  https://github.com/qemu/qemu/commit/e689f7c668cbd9d08f330e17c3dd3a059c9553d3
    (Oct 30 2013): the first version that intermittently fails

  Maybe lift out and build these branches while reading. *shrug*
  (How to do this can be found at the end of this report - along with a time-saving ./configure line, FWIW)

  Here are the changelists between these two revisions:

  https://github.com/qemu/qemu/compare/306ec6c...e689f7c
  (Compare direction: OLD to NEW) (Commits: 166  Files changed: 192)

  https://github.com/qemu/qemu/compare/e689f7c...306ec6c
  (Compare direction: NEW to OLD) (Commits: 30   Files changed: 22)

  (Someone else more familiar with Git might know why GitHub returns
  results for both compare directions, and/or if the 2nd link is useful
  information. The first link returns a lot more results than the 2nd
  one, at least. Does comparing new>old return deletions?)

  ---

  Now on to the symptoms. In a moment I'll describe reproduction.

  # MS-DOS 6.22

  The first symptom I discovered was that trivial read and write
  operations under MS-DOS would sometimes fail:

    C:\>echo test > hi

    General failure writing drive C
    Abort, Retry, Fail?

  Anything else that exercises the disk behaves similarly:

    C:\>dir /s > nul

    General failure reading drive C
    Abort, Retry, Fail?

  (Note that the above demonstrates both write and read failures)

  (Also, FWIW, `dir /s` == `ls -R`)

  The behavior of the I/O errors is not possible to characterise as it
  fluctuates so much. For example something as simple as DIR can produce
  wildly differing results: in one run, poking around with DIR ended
  with DOS deciding C:\ was empty at one point; at another point in a
  different run C:\ mysteriously dropped 50% of its contents only to
  magically gain it all back moments later after some poking around in
  one of the subdirectories that was still visible.

  The time it takes to trigger these errors is also highly variable.
  QEMU may fall over as early as hanging forever at "Starting MS-
  DOS...", or I might get all the way into Windows 3.1 before it
  triggers (in which case Win3.1 reports vague memory errors - of all
  things).

  Very occasionally I've seen _SeaBIOS itself_ report "Booting from Hard
  Disk..." "Boot failed: could not read the boot disk" ... "No bootable
  device.", and on one occasion I even got "Non-System disk or disk
  error" "Replace and strike any key when ready"!

  
  # WinNT 4 Terminal Server

  Most of the time, NTLDR will fire up normally. But every so often...

    SeaBIOS (version
  rel-1.7.3-117-g31b8b4e-20131206_080705-nilsson.home.kraxel.org)

    Booting from Hard Disk...
    A disk read error occurred.
    Insert a system diskette and restart
    the system.

  (NB. You're seeing the old SeaBIOS version included with e689f7c,
  which was the first buggy commit.)

  If NT gets past this point without erroring out (ie, it makes it to
  the boot menu), the rest of the system is 100% fine and there are no
  other disk I/O issues whatsoever. For example, on QEMU 2.9.0 I was
  able to enable disk compression, answer "Yes" to "Compress entire disk
  now?" and have the process fully complete. No hitches.

  This makes me vaguely recall/wonder that perhaps this could be somehow
  related to LBA and/or Int 13h, or something floating around near that
  bunch of functionality. (I'm woefully ignorant about such low-level
  details.) Perhaps DOS/Win3.1 are stuck using a disk mode that QEMU has
  a buggy implementation of, while NT 4 (once NTOSKRNL is up and
  running) is able to use a different disk mode or access mechanism.

  I'm really interested to get some understanding of what the root issue
  is here, when this is fixed. (I wonder if it's a timing thing?)

  I've observed some unusual behavior with repeated restarts. In one
  case, I attempted to start NT4 multiple times, and QEMU consistently
  failed with "No bootable device" each time. So, I removed `-M isapc`,
  promptly got a boot menu, hit ^C, readded `-M isapc` - and continued
  to get a boot menu. Yep. I'll accept "really really big coincidence"
  but I do very much wonder if something else is going on here. I've
  observed many similar incidents. It makes me wonder whether the
  contents of memory or some other system state is an influence. Very
  probably not, but still...


  -- Reproduction --------------------------------------------

  First of all, there was unfortunately no way for me to avoid having to
  post entire disk images, but I've managed to compress everything down
  to 174MB total download size.

  FWIW, WinWorld and many other sites seem to have no operational issues
  providing clear pointers to CD keys; I consider my distribution of my
  installed HDD images an extension of the apparent status quo.

  That being said, I've put everything on Google Drive so nobody has to
  headscratch about Launchpad/Canonical/etc's stance on hosting this
  data.

  So, this folder contains the disk images:
  https://drive.google.com/drive/folders/1WdZVBh5Trs9HLC186-nHyKeqaSxfyM2c
  ("Download all" at the top-right will create a ZIP file, but FWIW
  downloading the individual files simultaneously would implement a
  rough form of download acceleration)

  File meta info:

  Compressed
  |
  |      Apparent
  |      |    Actual
  |      |    |
  38M -> 200M (103M)  win31.img.xz
  82M -> 1G   (289M)  wnt4ts-broken.img.xz
  55M -> 350M (146M)  wnt4ts-intermittent.img.xz

  SHA-256s:

  win31:        8179b8180a2ab40bd472e8a2f3fb89fc331651e56923f94ceb9e52a78ee220d2
  broken:       a2af5f0bc49a063b75f534b6ffe5b82e32ecc706a64a425b6626feccf6e3fdfa  
  intermittent: 77ae8c458829ebcdd64c71042012f45d5a2788e6ebd22db9d53de9ef1a574784

  (Wanted to keep the checksum lines within 80 columns)

  And, since I can't figure out where else in this report to put this,
  wnt4ts-broken.img's password is "admin" but something seems to have
  happened to the disk and NT doesn't actually boot properly :(, and
  wnt4ts-intermittent.img's password is "1234". (These were set up as
  test images. Now I'm _really_ glad I used simple passwords! :) )

  ---

  
  I have two testcases: DOS 6.22 (+ Windows 3.1), and Windows NT 4.

  
  # MS-DOS

  DOS is the simplest. It basically consists of

  $ qemu-system-i386 -drive file=win31.img,format=raw -M isapc -enable-
  kvm

  And then literally just playing around. Things to try include creating
  files (`echo blah > file`), repeatedly seeking across the entire FAT
  (`dir /s > nul` or `dir /s`), and launching Windows (`win`).

  win31.img is not special (as far as I can tell) and merely consists of
  the result of installing DOS 6.22 and Windows 3.1 from WinWorldPC.
  I've basically just included the image for convenience.

  Generally no single "run" is immune to starting Win3.1 and then
  launching File Manager; if that doesn't generate an error, something
  is definitely up.

  The second best trigger is creating new files. That very very
  frequently produces "General Failure ...", but not always.

  
  # WinNT 4

  Windows NT 4 is a bit more complicated. Because this error only occurs
  at presumably a single small point very early in boot, the window of
  opportunity for the glitch to surface within is much much narrower and
  thus often requires a larger number of tries.

  Anecdotally I've had QEMU hit the boot error at the first try/run, and
  after as many as 63 "successful" boots.

  I made a small test harness that automates the launch process. It
  consists of two shell scripts and requires tmux (and netcat).
  (*Potential epilepsy warning*: if you use a light-colored terminal
  background, the terminal QEMU is repeatedly invoked from will
  continuously flash rapidly from white to black.)

  One of the scripts is run inside a tmux session in one terminal, while
  the other script is run in its own terminal (without any tmux).

  
  I named this one `run-qemu-loop`:

  --8<--------------------------------------------------------

  #!/bin/bash

  # ---

  qemu=/path/to/qemu-system-i386
  #or, alternatively: (I used the following line myself so I
  #could tab-complete my way to different qemu executables)
  #qemu="$1"

  disk=/path/to/wnt4ts-intermittent.img

  # ---

  port=4444

  rm -f STOP itercount

  itercount=0

  while :; do
  	
  	[ -f STOP ] && break
  	
  	((itercount++))
  	echo $itercount > itercount
  	
  	$qemu \
  		-enable-kvm -vga cirrus -curses -M isapc \
  		-drive file="$disk",format=raw \
  		-chardev socket,id=mon0,host=localhost,port=$port,server,nowait \
  		-mon chardev=mon0,mode=readline
  	
  	#point to an otherwise-unused terminal if you like (see also: `tty`)
  	#echo "$itercount run(s)" > /dev/pts/__
  	
  done

  ------------------------------------------------------------

  Not much logic above; this just repeatedly runs QEMU for as long as
  the file `STOP` does not exist in the current directory.

  The key "magic" bit is that QEMU is launched in -curses mode.

  The other key bit is that the above script is run inside tmux.

  
  Here's `tmux-ctl-loop`:

  --8<--------------------------------------------------------

  #!/bin/bash

  port=4444

  tmux=./tmux

  printf -v l '%0.0s-' {0..25}
  h1="$l/ buffer dump begin \\$l"
  h2="$l-\\ buffer dump end /-$l"

  while :; do
  	
  	while :; do
  		echo | nc localhost $port -q0 -w1 > /dev/null && break
  		echo 'Start qemu!'
  	done
  	
  	buffer="$(tmux -S $tmux capture-pane; tmux -S $tmux save-buffer -)"
  	
  	echo "$h1"
  	[[ "$buffer" ]] && echo "$buffer" || echo '( * Screen buffer is empty * )'
  	echo "$h2"
  	
  	if echo "$buffer" | grep -q 'A disk read error occurred.'; then
  		
  		s="<Crashed after $(< itercount) runs.>"
  		echo "$s"
  		echo "$s" >> stats
  		
  		touch STOP
  		
  		#echo q | nc localhost $port -q0 > /dev/null
  		
  		exit
  		
  	elif echo "$buffer" | grep -q 'OS Loader V4.00'; then
  			
  		echo '<Booted successfully, trying again>'
  		
  		echo q | nc localhost $port -q0 > /dev/null
  		
  	else
  		
  		echo '<Waiting for boot>'
  		
  	fi
  			
  done

  ------------------------------------------------------------

  Nothing particularly amazing going on here either.

  While `qemu-run-loop` is running inside tmux in the first terminal,
  this is running in the 2nd one.

  The small infinite loop at the top only breaks when it can
  successfully ping QEMU and it knows it's running.

  Then, a screendump of the contents of the terminal QEMU is in is
  fetched from tmux, and the buffer's content is analyzed.

  - If NTLDR fails, the script creates `STOP` to halt qemu-run-loop,
    sends `q` to QEMU through netcat, and then the script exits.

  - If NTLDR loads successfully, the script sends `q` to QEMU and continues
    looping. (qemu-run-loop will not find the `STOP` file, and so restart qemu.)

  The scripts run very quickly, with 2-3 iterations per second on my i3
  box.


  # Usage

  Save the two scripts above to the same directory as wnt4ts-intermittent.img,
  then:

  - (If port 4444 doesn't work, the value needs to be changed in both
  scripts.)

  - In the first terminal, run `tmux -S <file>`, where <file> names the socket
    tmux will use. This needs to match "tmux=" at the top of `tmux-ctl-loop`.
    (with `tmux=./tmux`, the command would be `tmux -S tmux`)

  - Still in the first terminal (and now also inside tmux), enter
    `./qemu-run-loop`, passing the path to qemu if you're using that approach
    (refer to the first few lines of the script). Don't hit enter yet.

  - Now, in the 2nd terminal, type `./tmux-ctl-loop`

  - Hit enter in both terminals.
   

  Rationale for timing of Enter key:

  - Running qemu-run-loop first will start QEMU, and if NTLDR starts
    successfully it will immediately begin counting down from 30. If NT actually
    starts to boot and is then hard-shut-down this /may/ affect the disk image

  - tmux-ctl-loop will annoyingly spam a continuous stream of 'Start qemu!' until
    qemu-run-loop is running.

  - Starting both scripts at "more or less" the same time (no rush) works out
    well.

  
  Hopefully potential script modifications are obvious; for example

  - changing tmux-ctl-loop to not send 'q' to qemu so you can connect to the HMP
    yourself
    (NB, if `STOP` is not created, when qemu finally exits it will of course
    promptly be relaunched)

  - pointing run-qemu-loop to a modified qemu binary


  == #2: QEMU-vs-VirtualBox image issue ======================

  I was initially completely stumped by this issue, perhaps
  unsurprisingly so. :)

  wnt4ts-broken.img is a perfectly ordinary NT 4 installation that was
  created in QEMU 2.9.0. I created a 1GB disk with `truncate`, picked
  NTFS and installed everything (which took a while).

  NT setup reboots a number of times during the boot process, and IIRC
  those all went just fine. However, at some point, the image began to
  consistently bomb out with "A disk read error occurred. ...", and
  stubbornly refused to boot, regardless of the number of boot attempts
  I tried.

  QEMU 2.0.0, 1.5.0, and 1.0 (the earliest version I was able to build
  on my system) all consistently hit "disk read error occurred".

  I tried compiling QEMU 1.0 using clang so I could build for 32-bit on
  my 64-bit system (GCC 7 died with "Frame pointer required, but
  reserved"). The resulting qemu completely crashed if I didn't enable
  KVM (ie, TCG was (understandably) broken); with KVM enabled qemu
  didn't crash, but NTLDR halted with the same error as on 64-bit qemu.
  (TL;DR, no difference whatsoever.)

  My initial reaction at this point was to try the image on another
  virtualization platform. My first pick was VirtualBox.

  So, I followed the official instructions for pointing VirtualBox to
  physical disk images, except I substituted a /dev/loopN device I'd
  pointed to the image file via losetup.

  And... VirtualBox picked the image up fine and Just Worked(TM). Yay! -
  but not yay. What gives?!

  Confused, I then tried to convert the disk image to VHD format.
  Unfortunately, for some reason, if I try `qemu-image convert ... -O
  vhdx ...`, VirtualBox chokes on the result:

  -----

  VD: error VERR_NOT_SUPPORTED opening image file
  '/.../wnt4ts-broken-qemuconv.vhd' (VERR_NOT_SUPPORTED).

  Result Code: NS_ERROR_FAILURE (0x80004005)
  Component: MediumWrap
  Interface: IMedium {4afe423b-43e0-e9d0-82e8-ceb307940dda}
  Callee: IVirtualBox {0169423f-46b4-cde9-91af-1e9d5b6cd945}
  Callee RC: VBOX_E_OBJECT_NOT_FOUND (0x80BB0001)

  -----

  Welp.

  Well, a bit more digging later, and I found I could do

  $ VBoxManage convertfromraw wnt4ts-broken.img wnt4ts-broken.vhd

  but... as soon as I pointed VirtualBox to this, it too began to choke
  with "A disk read error occurred".

  And yet, the VMDK->raw image setup worked just fine.

  I found I could even replace the loop device with the path of the .img
  file itself and that worked just fine too.

  At my wits' end, I followed some online instructions to learn about
  manual CHS configuration so I could try and get the image working in
  Bochs. "A disk read error occurred". I wasn't surprised.

  It was at this point I began to give up, but I decided to try One Last
  Thing(TM) before properly throwing in the towel.

  :)

  I decided to learn a bit more about how `VBoxManage internalcommands
  createrawvmdk` worked, and try one thing in particular: I can edit the
  .vmdk file, but can I point `createrawvmdk` at the .img file directly
  too?

  Turns out, yes you can.

  It also turns out that this promptly caused VirtualBox to bomb out.

  Interesting.

  For reference, here's the VMDK file I initially created (by pointing
  `createrawvmdk` at /dev/loopN) and then later edited to point straight
  to the .img file, with both approaches resulting in successful boot.

  --8<--------------------------------------------------------

  # Disk DescriptorFile
  version=1
  CID=e35b9a45
  parentCID=ffffffff
  createType="fullDevice"

  # Extent description
  RW 1536000 FLAT "/absolute/full/path/to/wnt4ts-broken.img" 0

  # The disk Data Base 
  #DDB

  ddb.virtualHWVersion = "4"
  ddb.adapterType="ide"
  ddb.geometry.cylinders="1523"
  ddb.geometry.heads="16"
  ddb.geometry.sectors="63"
  ddb.uuid.image="871a6044-c8ca-48ed-b7aa-e6fc49da3db4"
  ddb.uuid.parent="00000000-0000-0000-0000-000000000000"
  ddb.uuid.modification="3661715c-3906-4e4a-ab65-486d140e03b8"
  ddb.uuid.parentmodification="00000000-0000-0000-0000-000000000000"
  ddb.geometry.biosCylinders="761"
  ddb.geometry.biosHeads="32"
  ddb.geometry.biosSectors="63"

  ------------------------------------------------------------

  
  Here's the _diff_ of what happens if I point `createrawvmdk` at wnt4ts-broken.img directly:

  --8<--------------------------------------------------------

  ddb.geometry.cylinders="2080"
  ddb.geometry.heads="16"
  ddb.geometry.sectors="63"

  ------------------------------------------------------------

  :D

  Naturally,

  $ qemu-system-i386 -drive file=wnt4ts-
  broken.img,format=raw,cyls=1523,heads=16,secs=63 -M isapc -sdl

  will boot happily on 2.9.0 (notwithstanding the occasional "disk read
  error occurred" documented above).

  It will also boot in 1.6.0.

  (POTENTIAL BUG HEADSUP: 1.0 and 1.5.0 both lock up with a blank
  640x480 window and use 0% CPU if I specify `-M isapc`.)

  And, of course, using these CHS values in Bochs also results in
  successful boot as well (after setting the CPU type to pentium).

  Unfortunately, I have no idea what sequence of events caused the
  creation of the VMDK file above. No invocation of `createrawvmdk` is
  producing a VMDK file with the CHS settings above.

  I've only just begun to learn about the intricacies of CHS. Am I to
  understand that these values are stored amongst the first 512 bytes of
  the disk? If this is the case, then I wonder what changed the data,
  and why. I was initially only using QEMU 2.9.0, and didn't move the
  image to different VMs or QEMU versions. Perhaps Windows NT got
  confused about the disk CHS and rewrote it?

  
  == Sporadic BIOS-level boot failure ========================

  I have multiple screenshots of SeaBIOS in QEMU 2.9.0 halting with "No
  bootable device" (et al), even with the above manually-applied CHS
  settings.

  Commit e689f7c also presents such errors.

  Commit 306ec6c does not suffer from intermittent breakage of any kind:

  - No SeaBIOS flake-outs
  - No "Non-system disk or disk error"
  - No "A disk error has occurred"
  - No "General failure ..."

  While most of my confidence in commit 306ec6c is based on anecdotal
  evidence, I modified `tmux-ctl-loop` a little to soak-test BIOS-level
  I/O stability and left this modified version running for a few
  minutes.

  --8<--------------------------------------------------------

  #!/bin/bash

  port=4444

  tmux=./tmux

  printf -v l '%0.0s-' {0..25}
  h1="$l/ buffer dump begin \\$l"
  h2="$l-\\ buffer dump end /-$l"

  while :; do
  	
  	while :; do
  		echo | nc localhost $port -q0 -w1 > /dev/null && break
  		echo 'Start qemu!'
  	done
  	
  	buffer="$(tmux -S $tmux capture-pane; tmux -S $tmux save-buffer -)"
  	
  	echo "$h1"
  	[[ "$buffer" ]] && echo "$buffer" || echo '( * Screen buffer is empty * )'
  	echo "$h2"
  	
  	if echo "$buffer" | grep -q 'Non-system disk' || echo "$buffer" | \
  		grep -q 'No bootable device'
  	then
  		
  		s="<Hit error after $(< itercount) runs.>"
  		echo "$s"
  		echo "$s" >> stats
  		
  		touch STOP
  		
  		#echo q | nc localhost $port -q0 > /dev/null
  		
  		exit
  		
  	elif echo "$buffer" | grep -q 'OS Loader V4.00' || echo "$buffer" | \
  		grep -q 'A disk read error'
  	then
  	
  		echo '<Boot did not hang at BIOS, trying again>'
  		
  		echo q | nc localhost $port -q0 > /dev/null
  		
  	else
  		
  		echo '<Waiting for boot>'
  		
  	fi
  			
  done

  ------------------------------------------------------------

  For the above to work, the top of run-qemu-loop must also be modified
  to read something along the lines of

  disk=/path/to/wnt4ts-broken.img,format=raw,cyls=1523,heads=16,secs=63

  (Suggestion: modify copies of both scripts)

  One small terminal-flicker-headache (and a 57°C CPU) later, I was able
  to carefully observe just over 350 successful runs in which QEMU
  commit 306ec6c only ever produced a boot menu. No other hitches.

  ** Important: **

  However, commit 306ec6c will fail to boot, ever, if the cylinders and
  geometry are not set to the values VirtualBox "discovered". (Of note
  is the fact that QEMU (2.9.0) was what initially created this image. I
  must admit that I don't remember what sequence of QEMU versions I fed
  the image to - and I maybe, possibly, didn't think to back the file up
  (sorry), so maybe something mangled something somewhere. But
  VirtualBox figured it out nonetheless!)

  Furthermore, feeding /dev/loopN to any QEMU version will NOT result in
  correct CHS discovery (and successful boot).

  This is what leads me to conclude that I've discovered two separate
  issues.


  == Appendix: How to build the branches =====================

  It's very simple.

  First, `git clone https://github.com/qemu/qemu` somewhere if you don't
  already have a local copy. If you have an old git checkout that's from
  2014 or later, you can use that old checkout instead. (If you want to
  test an old checkout you have, the commands below will either work
  perfectly or completely bomb out with no side effects.)

  A full checkout is a ~183MB download. Sorry.

  Next, create two new directories somewhere. Name them what you like,
  eg `qemu-working` and `qemu-broken`.

  Now, cd into the checkout directory, and run:

  $ git archive 306ec6c3cece7004429c79c1ac93d49919f1f1cc | tar xC
  /path/to/qemu-working/

  $ git archive e689f7c668cbd9d08f330e17c3dd3a059c9553d3 | tar xC
  /path/to/qemu-broken/

  The paths can be relative.

  Now, run this in both of the new directories:

  $ ./configure --python=python2.7 --disable-libssh2 --disable-seccomp
  --disable-usb-redir --disable-guest-agent --disable-libiscsi
  --disable-spice --disable-smartcard-nss --disable-vhost-net --disable-
  docs --disable-attr --disable-cap-ng --disable-vde --disable-user
  --disable-bluez --disable-vnc-ws --disable-xen --disable-brlapi
  --enable-debug --target-list=i386-softmmu --disable-fdt

  $ make -j64

  You can open two terminals and configure and build both simultaneously
  if you like.

  On my decent but very basic (2-core+HT) i3 box, -j64 actually works out - make doesn't actually launch too many gcc processes. You *will* see your system load spike to ~20 though :)
  (NB. Do. not. use. -j64. with. the. linux. kernel.)

  On my system, a single build with -j64 takes only about 35 seconds. C
  FTW. (Although this has increased to 1min20sec for more recent
  builds.)

  Most of the configure arguments remove functionality I'll never use
  (in this situation) and which will only slow down the build.

  Once QEMU is built, run qemu-system-i386 directly from where it has
  been built.

  $ /path/to/qemu-working/i386-softmmu/qemu-system-i386 ...
  $ /path/to/qemu-broken/i386-softmmu/qemu-system-i386 ...

  Again, the paths can be relative.

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1745312/+subscriptions



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-07-07 21:13 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-25  7:18 [Qemu-devel] [Bug 1745312] [NEW] Regression report: Disk subsystem I/O failures/issues surfacing in DOS/early Windows [two separate issues: one bisected, one root-caused] i336_
2018-01-29 13:37 ` Stefan Hajnoczi
2018-01-30  7:10 ` [Qemu-devel] [Bug 1745312] " Fam Zheng
2018-01-30 19:56 ` John Snow
2018-04-30 18:06 ` Mario
2019-08-01 18:19 ` Mdasoh Kyaeppd
2019-12-23 21:35 ` John Snow
2021-04-22  5:32 ` Thomas Huth
2021-04-29  9:52 ` Thomas Huth
2021-04-30 16:49 ` Thomas Huth
2022-07-07 21:03 ` Lev Kujawski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).